Introduction

This is an exploratory analysis report of the text corpus provided as part of the Coursera Capstone. The English corpus (which is the scope of the project) consists of news articles, blogs and tweets as 3 separate sets.

Approach

Transformations

Each of the corpus was ingested as a character vector. The following transformations were then applied:

Word transformations: Compound words were decomposed into separate words. Also, abbreviated words were expanded to their full forms. The specific transformations were as follows: * “can’t” to “cannot” * “won’t” to “would not” * “n’t” to " not" * “I’m” to “I am” * “it’s” to “It is” * “Mr.” to “Mr” * “Mrs.” to “Mrs” * “Sr.” to “Sr”

The last 3 transformations above were to preserve only those periods that represent end of sentences and not those periods that occur as part of abbreviations.

End-of-sentence transformations: In creating bigrams or trigrams, the end-of-sentence boundaries have to be preserved. This was achieved by splitting all the elements of the corpus by period, exclamation marks or round braces. This resulted in a character vector where each element was a sentence

Removing special characters: After accounting for end-of-sentence markers and abbreviation markers as above, there was no need to preserve any non-alphanumeric characters. The sentence vector was thus tripped of anything other than alphanumerics

Converting to lower case: This was not possible before due to the special non-UTF characters such as emoticons etc. This step is necessary befor any tokenization and frequency analysis

Removing profanities: A list of profane words was obtained from https://www.cs.cmu.edu/~biglou/resources/ and these words were stripped from the corpus vectors. A regex approach ensured the identification of these words even if they were part of other words.

Tokenization

Having created a clean sentence set, The Quanteda package was used to create tokens. Unigrams, bigrams and trigrams, 4-grams and 5-grams were generated along with their frequencies

Exploration of the corpus

What is the nature and size of the corpus?

Corpus Type No.of docs in corpus
Blogs 899288
News 77259
Tweets 2360148

What is the distribution of document size for each corpus?

As expected, tweets have a maximum of 140 characters. The blogs have the highest average as well as outliers and the news articles fall in the middle of blogs and tweets.

Exploring n-grams

Unigrams

How many unigrams exist in the corpus?

Corpus No.of unigrams in corpus
News 66978
Blogs 762857
Tweets 276425

There are a substantial number of unigrams in each corpus. An examination of the unigram length distribution may tell us about the outliers.

The outliers exist more in tweets and blogs. But since most of the unigrams fall within 15-30 characters per unigram, the outliers are not of concern.

What are the top 30 unigrams in each corpus?

An examination of the unigrams show almost the same set of words figuring at the top of the unigrams in each dataset.

Bigrams
The table below shows the number of bigrams extracted from the corpus.
How many bigrams exist in each corpus?
Corpus No.of bigrams in corpus
News 815019
Blogs 4871093
Tweets 3855666

The number of bigrams in blogs and tweets is substantial. Subsequent computations have to take a selective approach for bigrams in order to manage memory and performance.

What are the top 30 bigrams in each corpus?

Observations of the bigrams: Like the unigrams, a consistent set of bigrams are at the top of the bigrams in each dataset. There is a very sharp decrease in the frequency of bigrams beyond the first 10 bigrams.

Trigrams
How many trigrams exist in each corpus?
Corpus No.of trigrams in corpus
News 1665499
Blogs 15117452
Tweets 10081385

The number of trigrams in blogs and tweets is also substantial and needs to be actively managed for subsequent computing.

What are the top 30 trigrams in each corpus?

Finally, a consistent pattern emerges for trigrams across the various corpus.

Approach for next phase

The overall approach is to use the chain rule of probability to determine which word causes the sentence or phrase to have the highest probability of occurance. The probability will be detemined by using the Good-Turing smoothing for the n-gram language model. The steps would be:

  • Create test set by randomy selecting 5% each of phrases from the news, blogs and tweets corpus
  • From the remaining corpus, create the training set:
  • Remove singletons from all n-grams and create a combined set of unigrams and other n-grams across news, blogs and tweets
  • Use Good-Turing smoothing to compute bigram counts and probability of any bigram that may not be in the created training corpus
  • use the chain rule with Markov assumption, using bigram probability