Milestone Report

Coursera Data Science Specialization - Capstone Project

Victor Ruiz

Synopsis

The goal of the Capstone Project is to build a predictive text model which suggests the next word to be entered based on the previous words introduced by the user. For this purpose, a text corpus, i.e. a collection of text documents, which can be downloaded here, is provided by Coursera and Swiftkey. Text documents are provided in languages: english, german, finish and russian. The documents in the corpus come from three different sources: blogs, news articles and tweets from twitter.com. This report presents the exploratory analysis performed in the english corpus.

Exploratory Analisys

Basic Summary Statistics

Before starting the analisys of the corpus, a basic summary of the files was built. The results are shown in the table below.

##                    blogs     news  twitter
## lines             899288  1010242  2360148
## words           37182923 33983128 29746934
## words/line            41       34       13
## max word length      164       36      120

Blogs and news have in average more words per document than twitter corpus, as expected, since tweets have length limited to 255 characters.
In blogs and twitter entries appear to be longer words, whereas in news, the longest word has 36 characters.

Data Preprocessing

Since the files provided are too large, only a small sample of the files have been used to perform the exploratory analisys of the corpus. The corpus used in this report was therefore a subset of the original corpus, built by randomly sampling 10,000 lines of each of the files. Before starting the analisys, some basic preprocessing was necessary to clean the text:

transform all characters to lower case.
remove email addresses, URLs.
remove twitter names, twitter hashtags, RTs and via.
remove numbers.
transform end of sentence characters to periods.
remove non-alphabetic characters.
trim extra whitespaces.

After this processing, the corpus contains only sentences separated by dots. Sentence separators are important for the N-Gramm analisys, if all punctuation signs were removed, the last word in a sentence would be considered to form a bigramm with the first word of the next sentence.

Corpus Analisys

Unigramm Analisys

Over 860,000 total words in the sampled corpus.
Nearly 47,000 unique words in the dictionary.
The frequency distribution is skewed, many words occur only few times, whereas only a small proportion of the dictionary is used very often.
About 22,800 words (48.6%) occur only once, i.e. 51.3% of the words are reused.
The top 20 words,cover more than 28% of total occurrences.
140 words cover the 50% of total occurrences, 7,050 cover the 90% and with 15,083 words the 95% of occurrences can be covered.
Around 45% of total occurrences are stopwords. These words are mostly pronouns, prepositions, conjunctions and connectors.
Words have in average a length of 7 characters.
By inspecting some extremely long words, I realized there are many words which are not gramatically correct, very likely coming from twitter and blogs. Therefore, a list containing valid words is necessary to prevent the model from predicting these bad words.

plot of chunk uni1

plot of chunk uni2

plot of chunk uni3

Bigramm Analisys

Over 390,000 unique bigrams in the sampled corpus.
Around 350,000 bigrams occur only one or two times and were discarded to reduce the sparsity of the term matrix.
Not surprisingly, the most frequent bigrams are combinations of stopwords.
Frequency distribution is even more skewed than the distribution of unigrams.
To cover 50% of all occurrences, a dictionary size of 3,165 bigrams is required and 26,696 bigrams are required to cover the 90%.

plot of chunk bi1

plot of chunk bi2

Trigramm Analisys

Over 681,000 unique trigrams in the sampled corpus.
Around 634,000 trigrams occur only once.
As observed in the bigram analisys, the most frequent trigrams are combinations of the most frequent unigrams.
Frequency distribution is also highly skewed.
To cover 50% of all occurrences, a dictionary size of 10,186 trigrams is required and 38,697 trigrams are required to cover the 90%.

plot of chunk tri1

plot of chunk tri2

plot of chunk cover

Next Steps

An stemming algorithm will be used reduce the size of the dictionary, since many words will be reduced to the same stems. For a fixed dictionary size, this will also increase the coverage of the dictionary.
Use a list of valid words to remove profanity and other words that we don’t want to predict.
Build frequency matrices for 1-grams, 2-grams and 3-grams. Using 4-gramms the accuracy of the prediction would improve, but also the memory footprint and the processing time would increase.
The sentence written by the user needs to be processed following the same steps that have been applied to the corpus, i.e., clean the text to keep only valid words.
The model will then use only the last 2 words that the user has written to calculate which 3-gramms or 2-gramms have higher probability to occur, based on frequency counts of the ngrams.