This analysis report is part of a bigger project which aims to build a model that predicts the next word in a sequence entered by a user. The goal of this first sub-project is to get used to working with the data and processing an exploratory analysis on the files provided. As this document should be concise, it only explains the major features of the data and briefly summarizes the next steps and key ideas in order to build the prediction model.
The files provided in this capstone project are extracts from US blogs, Twitter and news. They contain a large amount of text lines, and the very first action we had to proceed after looking at the files, was to get a sample out of the millions of lines in order to be fast enough with the computation. Then we built 3 sub data sets on this sample: training (60%), testing (30%) and validation (10%).
The raw data are too messy to perform our analysis, and contain some unexpected formatting and syntaxing forms, as well as profanities that we needed to clean before going further.
Once those pre-processing activities had been performed, we finally were able to proceed with a first level of analysis of words and sequences of words: unigrams, bigrams and trigrams.
Thus, we rapidly found the interest to consider the ‘stop words’ in our prediction model. Indeed, as per our analysis hereafter, a very few of the most frequent ‘stop words’ are representing a very large part of the total number of words in our training data set: 4 words represent 12% of the total number of words (6.1+ millions), 142 words represent 50% and 8021 words represent 90%.
The 3 files provided are:
- ‘en_US.blogs.txt’,
- ‘en_US.twitter.txt’,
- ‘en_US.news.txt’
Once files have been loaded, we performed a few basic analysis as per the below table.
This table give a few details on each file:
- file: name of the file
- file_size_Mb: size of the file (in Mb)
- nb_lines: number of lines in the file
- nb_words: number of words in the file
- words_per_line: average number of words per line
- shortest_line: number of characters in the shortest line
- min_nb_words: minimum number of words in a line
- longest_line: number of characters in the longest line
- max_nb_words: maximum number of words in a line
| file | file_size_Mb | nb_lines | nb_words | words_per_line | shortest_line | min_nb_words | longest_line | max_nb_words |
|---|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.42 | 899,288 | 37,546,246 | 41.75 | 1 | 0 | 40,833 | 6,726 |
| en_US.twitter.txt | 159.36 | 2,360,148 | 30,093,410 | 12.75 | 2 | 1 | 140 | 47 |
| en_US.news.txt | 196.28 | 1,010,242 | 34,762,395 | 34.41 | 1 | 1 | 11,384 | 1,796 |
As per the above table, the total amount of raw data combined from the 3 input files (‘blogs’, ‘twitter’ and ‘news’) is very large (about 4.27 millions of lines). Therefore, out of it, we extracted a random sample of 10% on the total number of lines. This helped us in implementing our model without dedicating too much time with computation. This ratio is parameterizable and we still can increase the amount of data if needed.
Then, out of this sample, we created 3 data sets: ‘training’ (60%), ‘testing’ (30%) and ‘validation’ (10%) which were usefull along the creation of our prediction model.
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] How was ur day ma nigga
[2] Like the episode with Adam Baldwin. Not as much Firefly as I expected. Loved the spot after with R. Downey Jr about Avengers
[3] Your dog won't stop licking my ankles, so that's a little weird, but I get it though. Delicious ankles run in my family.
A quick look at 3 lines of the training data set (see above) allows us to understand that we need to perform some cleaning and pre-processing actions. The data had been taken “as is” from blogs, Twitter and news Web sites. They include many kinds of formatting and syntaxing forms that won’t help us in being accurate in our prediction model. Moreover, because we don’t want our model to propose some profanities, we need to remove them from our data sets.
Thus, we performed the following actions prior to any further analysis of our data sets:
- Convert to plain text document
- Convert to lower case
- Replace contractions with their full forms
- Remove profanities
- Remove numbers and punctuation
- Strip white spaces
Hereafter is the result after cleaning the same 3 lines as previously:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] How was ur day ma
[2] Like the episode with adam baldwin not as much firefly as i expected loved the spot after with r downey jr about avengers
[3] Your dog will not stop licking my ankles so that is a little weird but i get it though delicious ankles run in my family
In addition to the above actions, we also figured out that most of the words in our sample of raw data are considered as “stop words” (see: Stop_words page from Wikipedia). So we might consider to remove them in order to build a model that predicts something else than those “stop words”.
Hereafter is the result after removing English stop words of the same 3 lines from our training data set:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] How was ur day ma
[2] Like the episode with adam baldwin not as much firefly as i expected loved the spot after with r downey jr about avengers
[3] Your dog will not stop licking my ankles so that is a little weird but i get it though delicious ankles run in my family
Finally, the table below presents a few details of the data sets we created for our analysis:
- dataset: name of the data set (‘Training’, ‘Testing’ or ‘Validation’)
- ratio_total_nb_lines: percentage of the data set out of the original sample or raw data
- nb_lines: number of lines in the data set
- nb_words: number of words in the data set
- words_per_line: average number of words per line
- nb_words_after_cleaning: average number of words per line after cleaning
- nb_words_no_stop_words: average number of words per line afeter cleaning and removing English stop words
| dataset | ratio_total_nb_lines | nb_lines | nb_words | words_per_line | nb_words_after_cleaning | nb_words_no_stop_words |
|---|---|---|---|---|---|---|
| training | 60% | 256180 | 6137867 | 23.96 | 6140431 | 3516993 |
| testing | 30% | 128092 | 3068220 | 23.95 | 3070184 | 1759787 |
| validation | 10% | 42694 | 1017782 | 23.84 | 1018387 | 583370 |
We know that our training data set contains a total number of 6.138 millions of words.
The table below presents the cumulated frequency of the most frequent words in the training data set, with the following details:
nb_words: the number of words to reach this level
frequency_words / nb_words_total: the ratio of cumulated frequency of those words out of the total number of words in the training data set
cumulated_frequency: the cumulated frequency of those words
nb_words / nb_unique_words: the ratio of the number of words to reach this level out of the total number of unique words
Analysis of the words in training data set including the English stop words
| nb_words | frequency_words…nb_words_total | cumulated_frequency | nb_words…nb_unique_words |
|---|---|---|---|
| 4 | 12% | 722109 | 0% |
| 10 | 20% | 1239884 | 0.01% |
| 26 | 30% | 1858877 | 0.02% |
| 60 | 40% | 2460985 | 0.04% |
| 142 | 50% | 3072803 | 0.09% |
| 379 | 60% | 3683215 | 0.25% |
| 972 | 70% | 4296735 | 0.65% |
| 2523 | 80% | 4910507 | 1.68% |
| 8021 | 90% | 5524124 | 5.35% |
While considering the English stop words, we can see from the table above that the 4 most frequent words represent already 12% of the total number of words in training data set, and obviously 0% of the total number of unique words.
To reach 50% of the total number of words, we need only the 142 most frequent words, which represent only 0.09% of the total number of unique words.
Finally, 90% of the total number of words are reached with 8021 words, ie: 5.35% of the total number of unique words.
Analysis of the words in training data set excluding the English stop words
| nb_words | frequency_words…nb_words_total | cumulated_frequency | nb_words…nb_unique_words |
|---|---|---|---|
| 79 | 10% | 616343 | 0.05% |
| 380 | 20% | 1227956 | 0.25% |
| 1133 | 30% | 1841811 | 0.76% |
| 3212 | 40% | 2455310 | 2.15% |
| 12178 | 50% | 3068941 | 8.14% |
When excluding the English stop words, we need 79 words to have 10% of the total number of words in our training data set, which represents 0.05% of the total number of unique words.
12178 words are needed to reach 50% of the total number of words, ie: 8.14% of the total number of unique words.
Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words
Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words
Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words
Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words
Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words
Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words
Based on the analysis we performed, we have a good start point in order to compute the probabilities of next word in the n-grams we identified in our training data set. We still need to define a strategy regarding the prediction of ‘stop words’ and how they should be considered in our model.
The Shiny app will obviously have to propose the user to enter a sequence of words. Anytime the user is entering a letter, this app will check if it results with a known word and, if it is the case, will propose the next word to enter based on the probabilities computed in our model. There could be an option in the app to display from 1 to n different words, in addition with the probability for this word to be the next in the sequence.
| treatment | user | system | elapsed |
|---|---|---|---|
| Create training, testing and validation data sets | 0.684 | 0.080 | 0.772 |
| Training data set - Call to clean_dataset function | 132.725 | 1.070 | 134.677 |
| Training data set - Call to remove_stop_words function | 13.554 | 0.083 | 13.679 |
| Testing data set - Call to clean_dataset function | 64.249 | 0.460 | 64.868 |
| Testing data set - Call to remove_stop_words function | 6.028 | 0.018 | 6.061 |
| Validation data set - Call to clean_dataset function | 22.075 | 0.181 | 23.149 |
| Validation data set - Call to remove_stop_words function | 2.157 | 0.007 | 2.172 |
| Training data set - Call to getNgram function for 1 gram with English stop words | 24.260 | 0.323 | 24.717 |
| Training data set - Call to getNgram function for 1 gram without English stop words | 12.877 | 0.146 | 13.046 |
| Training data set - Call to getNGramSummary function for 1 gram with English stop words | -0.003 | 0.000 | -0.003 |
| Training data set - Call to getNGramSummary function for 1 gram without English stop words | -0.154 | -0.001 | -0.155 |
| Training data set - Call to getNgram function for 2 gram with English stop words | 17.943 | 0.358 | 18.470 |
| Training data set - Call to getNgram function for 2 gram without English stop words | 14.977 | 0.427 | 15.792 |
| Training data set - Call to getNgram function for 3 gram with English stop words | 30.698 | 1.583 | 32.844 |
| Training data set - Call to getNgram function for 3 gram without English stop words | 23.114 | 1.088 | 24.639 |