Because our newspaper OCR text is noisy, article text often scrambled and articles are clipped across pages, we’re limited in what information we can extract with more advanced Natural Language Processing (NLP) algorithms.

For example, earlier pages of the corpus we have a very high error rate (~20%) with few complete sentences.  In addition, it appears that all sentence terminating punctuation (periods ‘.’) have been stripped from both XML datasets which trip up even basic utilities like NLTK word and sentence tokenizers.  As a result we cannot do more sophisticated textual analysis that assume complete English sentences and depend upon grammatically correct construction.

As a consequence we’ll first characterize our word corpus with simple statistical metrics and then limit ourselves to fundamental NLP analysis via the Python NLTK toolkit.  For a great introduction to NLP we’ll be looking at excerpts from NLTK’s excellent online manual.

In rough order of increasing complexity, here are some statistics we’d like to gather for each scanned page perhaps averaged across each newspaper issue:

  • Python Package “textstat
    • syllable_count(text)
    • lexicon_count(text, TRUE/FALSE) – TRUE/default removes punctuation first
    • sentence_count(text)
    • flesch_reading_ease(text)
    • flesch_kincaid_grade(text)
    • dale_chall_readability_score(text), uses lookup table of 3000 english words
  • NLTK (see NLTK online book chapter 1)
    • Concordance (context within which a particular word appears)
      • from nltk.book import *
      • text1.concordance(“monstrous”) # context around word in Moby Dick
    • Similar (what other words appear in similar contexts)
      • text1.similar(“monstrous”)
    • Common Context (find common context shared by 2+ words)
      • text2.common_contexts([“monstrous”, “very”])
    • Dispersion Plots (relative offset into text of word(s))
      • text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America])
    • Generate (generate random text in same style)
      • text3.generate()
    • Counting Vocabulary
      • len(text3)  # total words
      • sorted(set(text3))  # unique words
      • def lexical_diversity(text):
        • return len(set(text3)) / len(text3)  # lexical richness, % distinct words
      • text3.count(“smote”)  # specific word count
      • 100 * text4.count(‘a’) / len(text4)
      • def percentage(count, total):
        • return 100 * count / total  # percent of total that is count
      • percentage(text4.count(‘a’), len(text4))
      • end
    • Tokenizer
    • Frequency Distributions
      • fdist1 = FreqDist(text1)
      • fdist1.most_common(50)  # 50 most common words with count
      • fdist1[‘whale’]  # count for word ‘whale’
      • Functions Defined for NLTK’s FreqDist (Table 3.1)
        • fdist = FreqDist(samples)
        • fdist[sample] += 1
        • fdist[‘monstrous’]  # count of word
        • fdist.freq[‘monstrous’]  # freq of sample
        • fdist.N()  # total number of samples
        • fdist.most_common(n)
        • for sample in fdist:  # iterate over
        • fdist.max()  # greatest count
        • fdist.tabulate()  # tabulate freq dist
        • fdist.plot()
        • fdist.plot(cumulative=True)
        • fdist1 |= fdist2  # update fdist1 with counts from fdist2
        • fdist1 < fdist2  # test if samples in fdist1 occur less freq than in fdist2
    • Fine-grained Selection of Words
      • Min length
        • V = set(text1)
        • long_words = [w for w in V if len(w) > 15]
        • sorted(long_words)
      • Combined length with frequency
        • fdist5 = FreqDist5
        • sorted(w for w in set(text5) if len(w) > 7 and fdisk5[w] > 7)
    • Collocations and Bigrams
      • Bigrams
        • list(bigrams([“more”,”is”,”said”,”than”,”done])
      • Collocations
        • text4.collocations()
    • Stylistics
      • NLTK Brown million word corpus 1961
      • “news” genre from Chicago Tribune: Society Reportage
        • from nltk.corpus import brown
        • news_text = brown.words(categories=’news’)
        • fdist = nltk.FreqDist(w.lower() for w in news_text)
        • modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’, ‘will’]
        • for m in modals:
          • print(m + ‘:’, fdist[m], end=’ ‘)
        • cfd = nltk.ConditionalFreqDist(
          • (genre, word)
          • for genre in brown.categories()
          • for word in brown.words(categories=genre))
        • genrres = [‘news’, ‘religion’, …]
        • modals = [‘can’, ‘could’, …]
        • cfd.tabulate(conditions=genres, samples=modals)
      • Reuters Corpus 1.3 million words from 10,788 news documents, 90 topics, divided into two sets: training/test
      • from nltk.corpus import reuters
      • reuters.fileids()
      • reuters.categories()
      • Inaugural Address Corpus to chart word count over time
      • from nltk.corpus import inaugural
      • cfd = nltk.ConditionalFreqDist(
        • (target, fileid[:4])
        • for fileid in inaugural.fileids()
        • for w in inaugural.words(fileid)
        • for target in [‘american’, ‘citizen’]
        • if w.lower().startswith(target))
      • cfd.plot()
    • Loading Your Own Corpus
      • from nltk.corpus import PlaintextCorpusReader
      • corpus_root = ‘/usr/share/dict’
      • wordlists = PlaintextCorpusReader(corpus_root, ‘.*’)
      • wordlists.fileids()
      • wordlists.words(‘connectives’)
    • End
  • Word Count
    • from nltk.tokenize import RegexpTokenizer
    • tokenizer = RegexpTokenizer(r’\w+’)
    • text = with open(“filename”, “r”) as file: <read lines.strip()>
    • tokens = tokenizer.tokenize(text)
  • Sentence Length
  • Vocabulary Diversity
  • Reading Difficulty Level

 

A good overview of Natural Language Processing at SlideShare

maxresdefault

Advertisements