The goal of the Capstone Project is to build a predictive text model which suggests the next word to be entered based on the previous words introduced by the user. For this purpose, a text corpus, i.e. a collection of text documents, which can be downloaded here, is provided by Coursera and Swiftkey. Text documents are provided in languages: english, german, finish and russian. The documents in the corpus come from three different sources: blogs, news articles and tweets from twitter.com. This report presents the exploratory analysis performed in the english corpus.
Before starting the analisys of the corpus, a basic summary of the files was built. The results are shown in the table below.
## blogs news twitter
## lines 899288 1010242 2360148
## words 37182923 33983128 29746934
## words/line 41 34 13
## max word length 164 36 120
Blogs and news have in average more words per document than twitter corpus, as expected, since tweets have length limited to 255 characters.
In blogs and twitter entries appear to be longer words, whereas in news, the longest word has 36 characters.
Since the files provided are too large, only a small sample of the files have been used to perform the exploratory analisys of the corpus. The corpus used in this report was therefore a subset of the original corpus, built by randomly sampling 10,000 lines of each of the files. Before starting the analisys, some basic preprocessing was necessary to clean the text:
transform all characters to lower case.
remove email addresses, URLs.
remove twitter names, twitter hashtags, RTs and via.
remove numbers.
transform end of sentence characters to periods.
remove non-alphabetic characters.
trim extra whitespaces.
After this processing, the corpus contains only sentences separated by dots. Sentence separators are important for the N-Gramm analisys, if all punctuation signs were removed, the last word in a sentence would be considered to form a bigramm with the first word of the next sentence.
Unigramm Analisys
Bigramm Analisys
Trigramm Analisys
Next Steps
An stemming algorithm will be used reduce the size of the dictionary, since many words will be reduced to the same stems. For a fixed dictionary size, this will also increase the coverage of the dictionary.
Use a list of valid words to remove profanity and other words that we don’t want to predict.
Build frequency matrices for 1-grams, 2-grams and 3-grams. Using 4-gramms the accuracy of the prediction would improve, but also the memory footprint and the processing time would increase.
The sentence written by the user needs to be processed following the same steps that have been applied to the corpus, i.e., clean the text to keep only valid words.
The model will then use only the last 2 words that the user has written to calculate which 3-gramms or 2-gramms have higher probability to occur, based on frequency counts of the ngrams.