Coursera Data Science Specialization - Capstone Project

Victor Ruiz

Synopsis

The goal of the Capstone Project is to build a predictive text model which suggests the next word to be entered based on the previous words introduced by the user. For this purpose, a text corpus, i.e. a collection of text documents, which can be downloaded here, is provided by Coursera and Swiftkey. Text documents are provided in languages: english, german, finish and russian. The documents in the corpus come from three different sources: blogs, news articles and tweets from twitter.com. This report presents the exploratory analysis performed in the english corpus.

Exploratory Analisys

Basic Summary Statistics

Before starting the analisys of the corpus, a basic summary of the files was built. The results are shown in the table below.

##                    blogs     news  twitter
## lines             899288  1010242  2360148
## words           37182923 33983128 29746934
## words/line            41       34       13
## max word length      164       36      120
  • Blogs and news have in average more words per document than twitter corpus, as expected, since tweets have length limited to 255 characters.

  • In blogs and twitter entries appear to be longer words, whereas in news, the longest word has 36 characters.

Data Preprocessing

Since the files provided are too large, only a small sample of the files have been used to perform the exploratory analisys of the corpus. The corpus used in this report was therefore a subset of the original corpus, built by randomly sampling 10,000 lines of each of the files. Before starting the analisys, some basic preprocessing was necessary to clean the text:

  • transform all characters to lower case.

  • remove email addresses, URLs.

  • remove twitter names, twitter hashtags, RTs and via.

  • remove numbers.

  • transform end of sentence characters to periods.

  • remove non-alphabetic characters.

  • trim extra whitespaces.

After this processing, the corpus contains only sentences separated by dots. Sentence separators are important for the N-Gramm analisys, if all punctuation signs were removed, the last word in a sentence would be considered to form a bigramm with the first word of the next sentence.

Corpus Analisys

Unigramm Analisys

  • Over 860,000 total words in the sampled corpus.
  • Nearly 47,000 unique words in the dictionary.
  • The frequency distribution is skewed, many words occur only few times, whereas only a small proportion of the dictionary is used very often.
  • About 22,800 words (48.6%) occur only once, i.e. 51.3% of the words are reused.
  • The top 20 words,cover more than 28% of total occurrences.
  • 140 words cover the 50% of total occurrences, 7,050 cover the 90% and with 15,083 words the 95% of occurrences can be covered.
  • Around 45% of total occurrences are stopwords. These words are mostly pronouns, prepositions, conjunctions and connectors.
  • Words have in average a length of 7 characters.
  • By inspecting some extremely long words, I realized there are many words which are not gramatically correct, very likely coming from twitter and blogs. Therefore, a list containing valid words is necessary to prevent the model from predicting these bad words.

plot of chunk uni1

plot of chunk uni2

plot of chunk uni3

Bigramm Analisys

  • Over 390,000 unique bigrams in the sampled corpus.
  • Around 350,000 bigrams occur only one or two times and were discarded to reduce the sparsity of the term matrix.
  • Not surprisingly, the most frequent bigrams are combinations of stopwords.
  • Frequency distribution is even more skewed than the distribution of unigrams.
  • To cover 50% of all occurrences, a dictionary size of 3,165 bigrams is required and 26,696 bigrams are required to cover the 90%.

plot of chunk bi1

plot of chunk bi2

Trigramm Analisys

  • Over 681,000 unique trigrams in the sampled corpus.
  • Around 634,000 trigrams occur only once.
  • As observed in the bigram analisys, the most frequent trigrams are combinations of the most frequent unigrams.
  • Frequency distribution is also highly skewed.
  • To cover 50% of all occurrences, a dictionary size of 10,186 trigrams is required and 38,697 trigrams are required to cover the 90%.

plot of chunk tri1

plot of chunk tri2

plot of chunk cover

Next Steps

  • An stemming algorithm will be used reduce the size of the dictionary, since many words will be reduced to the same stems. For a fixed dictionary size, this will also increase the coverage of the dictionary.

  • Use a list of valid words to remove profanity and other words that we don’t want to predict.

  • Build frequency matrices for 1-grams, 2-grams and 3-grams. Using 4-gramms the accuracy of the prediction would improve, but also the memory footprint and the processing time would increase.

  • The sentence written by the user needs to be processed following the same steps that have been applied to the corpus, i.e., clean the text to keep only valid words.

  • The model will then use only the last 2 words that the user has written to calculate which 3-gramms or 2-gramms have higher probability to occur, based on frequency counts of the ngrams.