Overview

This analysis report is part of a bigger project which aims to build a model that predicts the next word in a sequence entered by a user. The goal of this first sub-project is to get used to working with the data and processing an exploratory analysis on the files provided. As this document should be concise, it only explains the major features of the data and briefly summarizes the next steps and key ideas in order to build the prediction model.
The files provided in this capstone project are extracts from US blogs, Twitter and news. They contain a large amount of text lines, and the very first action we had to proceed after looking at the files, was to get a sample out of the millions of lines in order to be fast enough with the computation. Then we built 3 sub data sets on this sample: training (60%), testing (30%) and validation (10%).
The raw data are too messy to perform our analysis, and contain some unexpected formatting and syntaxing forms, as well as profanities that we needed to clean before going further.
Once those pre-processing activities had been performed, we finally were able to proceed with a first level of analysis of words and sequences of words: unigrams, bigrams and trigrams.
Thus, we rapidly found the interest to consider the ‘stop words’ in our prediction model. Indeed, as per our analysis hereafter, a very few of the most frequent ‘stop words’ are representing a very large part of the total number of words in our training data set: 4 words represent 12% of the total number of words (6.1+ millions), 142 words represent 50% and 8021 words represent 90%.

Analysis

Global analysis of the provided files

The 3 files provided are:
- ‘en_US.blogs.txt’,
- ‘en_US.twitter.txt’,
- ‘en_US.news.txt’

Once files have been loaded, we performed a few basic analysis as per the below table.
This table give a few details on each file:
- file: name of the file
- file_size_Mb: size of the file (in Mb)
- nb_lines: number of lines in the file
- nb_words: number of words in the file
- words_per_line: average number of words per line
- shortest_line: number of characters in the shortest line
- min_nb_words: minimum number of words in a line
- longest_line: number of characters in the longest line
- max_nb_words: maximum number of words in a line

file file_size_Mb nb_lines nb_words words_per_line shortest_line min_nb_words longest_line max_nb_words
en_US.blogs.txt 200.42 899,288 37,546,246 41.75 1 0 40,833 6,726
en_US.twitter.txt 159.36 2,360,148 30,093,410 12.75 2 1 140 47
en_US.news.txt 196.28 1,010,242 34,762,395 34.41 1 1 11,384 1,796

Sampling and splitting into TRAINING, TESTING and VALIDATION data sets

As per the above table, the total amount of raw data combined from the 3 input files (‘blogs’, ‘twitter’ and ‘news’) is very large (about 4.27 millions of lines). Therefore, out of it, we extracted a random sample of 10% on the total number of lines. This helped us in implementing our model without dedicating too much time with computation. This ratio is parameterizable and we still can increase the amount of data if needed.
Then, out of this sample, we created 3 data sets: ‘training’ (60%), ‘testing’ (30%) and ‘validation’ (10%) which were usefull along the creation of our prediction model.

Cleaning and pre-processing the data

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

[1] How was ur day ma nigga                                                                                                     
[2] Like the episode with Adam Baldwin. Not as much Firefly as I expected. Loved the spot after with R. Downey Jr about Avengers
[3] Your dog won't stop licking my ankles, so that's a little weird, but I get it though. Delicious ankles run in my family.    

A quick look at 3 lines of the training data set (see above) allows us to understand that we need to perform some cleaning and pre-processing actions. The data had been taken “as is” from blogs, Twitter and news Web sites. They include many kinds of formatting and syntaxing forms that won’t help us in being accurate in our prediction model. Moreover, because we don’t want our model to propose some profanities, we need to remove them from our data sets.

Thus, we performed the following actions prior to any further analysis of our data sets:
- Convert to plain text document
- Convert to lower case
- Replace contractions with their full forms
- Remove profanities
- Remove numbers and punctuation
- Strip white spaces

Hereafter is the result after cleaning the same 3 lines as previously:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

[1] How was ur day ma                                                                                                        
[2] Like the episode with adam baldwin not as much firefly as i expected loved the spot after with r downey jr about avengers
[3] Your dog will not stop licking my ankles so that is a little weird but i get it though delicious ankles run in my family 

In addition to the above actions, we also figured out that most of the words in our sample of raw data are considered as “stop words” (see: Stop_words page from Wikipedia). So we might consider to remove them in order to build a model that predicts something else than those “stop words”.

Hereafter is the result after removing English stop words of the same 3 lines from our training data set:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

[1] How was ur day ma                                                                                                        
[2] Like the episode with adam baldwin not as much firefly as i expected loved the spot after with r downey jr about avengers
[3] Your dog will not stop licking my ankles so that is a little weird but i get it though delicious ankles run in my family 

Finally, the table below presents a few details of the data sets we created for our analysis:
- dataset: name of the data set (‘Training’, ‘Testing’ or ‘Validation’)
- ratio_total_nb_lines: percentage of the data set out of the original sample or raw data
- nb_lines: number of lines in the data set
- nb_words: number of words in the data set
- words_per_line: average number of words per line
- nb_words_after_cleaning: average number of words per line after cleaning
- nb_words_no_stop_words: average number of words per line afeter cleaning and removing English stop words

dataset ratio_total_nb_lines nb_lines nb_words words_per_line nb_words_after_cleaning nb_words_no_stop_words
training 60% 256180 6137867 23.96 6140431 3516993
testing 30% 128092 3068220 23.95 3070184 1759787
validation 10% 42694 1017782 23.84 1018387 583370

Analysis of the frequency of words and words sequences in TRAINING data set

1-gram sequence (unigram)

We know that our training data set contains a total number of 6.138 millions of words.

The table below presents the cumulated frequency of the most frequent words in the training data set, with the following details:
nb_words: the number of words to reach this level
frequency_words / nb_words_total: the ratio of cumulated frequency of those words out of the total number of words in the training data set
cumulated_frequency: the cumulated frequency of those words
nb_words / nb_unique_words: the ratio of the number of words to reach this level out of the total number of unique words

Analysis of the words in training data set including the English stop words

nb_words frequency_words…nb_words_total cumulated_frequency nb_words…nb_unique_words
4 12% 722109 0%
10 20% 1239884 0.01%
26 30% 1858877 0.02%
60 40% 2460985 0.04%
142 50% 3072803 0.09%
379 60% 3683215 0.25%
972 70% 4296735 0.65%
2523 80% 4910507 1.68%
8021 90% 5524124 5.35%

While considering the English stop words, we can see from the table above that the 4 most frequent words represent already 12% of the total number of words in training data set, and obviously 0% of the total number of unique words.
To reach 50% of the total number of words, we need only the 142 most frequent words, which represent only 0.09% of the total number of unique words.
Finally, 90% of the total number of words are reached with 8021 words, ie: 5.35% of the total number of unique words.

Analysis of the words in training data set excluding the English stop words

nb_words frequency_words…nb_words_total cumulated_frequency nb_words…nb_unique_words
79 10% 616343 0.05%
380 20% 1227956 0.25%
1133 30% 1841811 0.76%
3212 40% 2455310 2.15%
12178 50% 3068941 8.14%

When excluding the English stop words, we need 79 words to have 10% of the total number of words in our training data set, which represents 0.05% of the total number of unique words.
12178 words are needed to reach 50% of the total number of words, ie: 8.14% of the total number of unique words.

Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words

Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words

2-gram sequence (bigram)

Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words

Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words

3-gram sequence (trigram)

Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words

Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words

Feedback on the plans for creating a prediction algorithm and Shiny app

Based on the analysis we performed, we have a good start point in order to compute the probabilities of next word in the n-grams we identified in our training data set. We still need to define a strategy regarding the prediction of ‘stop words’ and how they should be considered in our model.

The Shiny app will obviously have to propose the user to enter a sequence of words. Anytime the user is entering a letter, this app will check if it results with a known word and, if it is the case, will propose the next word to enter based on the probabilities computed in our model. There could be an option in the app to display from 1 to n different words, in addition with the probability for this word to be the next in the sequence.

APPENDIXES

APPENDIX 1 - Time for processing treatments

treatment user system elapsed
Create training, testing and validation data sets 0.684 0.080 0.772
Training data set - Call to clean_dataset function 132.725 1.070 134.677
Training data set - Call to remove_stop_words function 13.554 0.083 13.679
Testing data set - Call to clean_dataset function 64.249 0.460 64.868
Testing data set - Call to remove_stop_words function 6.028 0.018 6.061
Validation data set - Call to clean_dataset function 22.075 0.181 23.149
Validation data set - Call to remove_stop_words function 2.157 0.007 2.172
Training data set - Call to getNgram function for 1 gram with English stop words 24.260 0.323 24.717
Training data set - Call to getNgram function for 1 gram without English stop words 12.877 0.146 13.046
Training data set - Call to getNGramSummary function for 1 gram with English stop words -0.003 0.000 -0.003
Training data set - Call to getNGramSummary function for 1 gram without English stop words -0.154 -0.001 -0.155
Training data set - Call to getNgram function for 2 gram with English stop words 17.943 0.358 18.470
Training data set - Call to getNgram function for 2 gram without English stop words 14.977 0.427 15.792
Training data set - Call to getNgram function for 3 gram with English stop words 30.698 1.583 32.844
Training data set - Call to getNgram function for 3 gram without English stop words 23.114 1.088 24.639