Overview

This analysis report is part of a bigger project which aims to build a model that predicts the next word in a sequence entered by a user. The goal of this first sub-project is to get used to working with the data and processing an exploratory analysis on the files provided. As this document should be concise, it only explains the major features of the data and briefly summarizes the next steps and key ideas in order to build the prediction model.
The files provided in this capstone project are extracts from US blogs, Twitter and news. They contain a large amount of text lines, and the very first action we had to proceed after looking at the files, was to get a sample out of the millions of lines in order to be fast enough with the computation. Then we built 3 sub data sets on this sample: training (60%), testing (30%) and validation (10%).
The raw data are too messy to perform our analysis, and contain some unexpected formatting and syntaxing forms, as well as profanities that we needed to clean before going further.
Once those pre-processing activities had been performed, we finally were able to proceed with a first level of analysis of words and sequences of words: unigrams, bigrams and trigrams.
Thus, we rapidly found the interest to consider the ‘stop words’ in our prediction model. Indeed, as per our analysis hereafter, a very few of the most frequent ‘stop words’ are representing a very large part of the total number of words in our training data set: 4 words represent 12% of the total number of words (6.1+ millions), 142 words represent 50% and 8021 words represent 90%.

Analysis

Global analysis of the provided files

The 3 files provided are:
- ‘en_US.blogs.txt’,
- ‘en_US.twitter.txt’,
- ‘en_US.news.txt’

Once files have been loaded, we performed a few basic analysis as per the below table.
This table give a few details on each file:
- file: name of the file
- file_size_Mb: size of the file (in Mb)
- nb_lines: number of lines in the file
- nb_words: number of words in the file
- words_per_line: average number of words per line
- shortest_line: number of characters in the shortest line
- min_nb_words: minimum number of words in a line
- longest_line: number of characters in the longest line
- max_nb_words: maximum number of words in a line

file	file_size_Mb	nb_lines	nb_words	words_per_line	shortest_line	min_nb_words	longest_line	max_nb_words
en_US.blogs.txt	200.42	899,288	37,546,246	41.75	1	0	40,833	6,726
en_US.twitter.txt	159.36	2,360,148	30,093,410	12.75	2	1	140	47
en_US.news.txt	196.28	1,010,242	34,762,395	34.41	1	1	11,384	1,796

Sampling and splitting into TRAINING, TESTING and VALIDATION data sets

As per the above table, the total amount of raw data combined from the 3 input files (‘blogs’, ‘twitter’ and ‘news’) is very large (about 4.27 millions of lines). Therefore, out of it, we extracted a random sample of 10% on the total number of lines. This helped us in implementing our model without dedicating too much time with computation. This ratio is parameterizable and we still can increase the amount of data if needed.
Then, out of this sample, we created 3 data sets: ‘training’ (60%), ‘testing’ (30%) and ‘validation’ (10%) which were usefull along the creation of our prediction model.

Cleaning and pre-processing the data

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

[1] How was ur day ma nigga                                                                                                     
[2] Like the episode with Adam Baldwin. Not as much Firefly as I expected. Loved the spot after with R. Downey Jr about Avengers
[3] Your dog won't stop licking my ankles, so that's a little weird, but I get it though. Delicious ankles run in my family.

A quick look at 3 lines of the training data set (see above) allows us to understand that we need to perform some cleaning and pre-processing actions. The data had been taken “as is” from blogs, Twitter and news Web sites. They include many kinds of formatting and syntaxing forms that won’t help us in being accurate in our prediction model. Moreover, because we don’t want our model to propose some profanities, we need to remove them from our data sets.

Thus, we performed the following actions prior to any further analysis of our data sets:
- Convert to plain text document
- Convert to lower case
- Replace contractions with their full forms
- Remove profanities
- Remove numbers and punctuation
- Strip white spaces

Hereafter is the result after cleaning the same 3 lines as previously:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

[1] How was ur day ma                                                                                                        
[2] Like the episode with adam baldwin not as much firefly as i expected loved the spot after with r downey jr about avengers
[3] Your dog will not stop licking my ankles so that is a little weird but i get it though delicious ankles run in my family

In addition to the above actions, we also figured out that most of the words in our sample of raw data are considered as “stop words” (see: Stop_words page from Wikipedia). So we might consider to remove them in order to build a model that predicts something else than those “stop words”.

Hereafter is the result after removing English stop words of the same 3 lines from our training data set:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

[1] How was ur day ma                                                                                                        
[2] Like the episode with adam baldwin not as much firefly as i expected loved the spot after with r downey jr about avengers
[3] Your dog will not stop licking my ankles so that is a little weird but i get it though delicious ankles run in my family

Finally, the table below presents a few details of the data sets we created for our analysis:
- dataset: name of the data set (‘Training’, ‘Testing’ or ‘Validation’)
- ratio_total_nb_lines: percentage of the data set out of the original sample or raw data
- nb_lines: number of lines in the data set
- nb_words: number of words in the data set
- words_per_line: average number of words per line
- nb_words_after_cleaning: average number of words per line after cleaning
- nb_words_no_stop_words: average number of words per line afeter cleaning and removing English stop words

dataset	ratio_total_nb_lines	nb_lines	nb_words	words_per_line	nb_words_after_cleaning	nb_words_no_stop_words
training	60%	256180	6137867	23.96	6140431	3516993
testing	30%	128092	3068220	23.95	3070184	1759787
validation	10%	42694	1017782	23.84	1018387	583370

Analysis of the frequency of words and words sequences in TRAINING data set

1-gram sequence (unigram)

We know that our training data set contains a total number of 6.138 millions of words.

The table below presents the cumulated frequency of the most frequent words in the training data set, with the following details:
nb_words: the number of words to reach this level
frequency_words / nb_words_total: the ratio of cumulated frequency of those words out of the total number of words in the training data set
cumulated_frequency: the cumulated frequency of those words
nb_words / nb_unique_words: the ratio of the number of words to reach this level out of the total number of unique words

Analysis of the words in training data set including the English stop words

nb_words	frequency_words…nb_words_total	cumulated_frequency	nb_words…nb_unique_words
4	12%	722109	0%
10	20%	1239884	0.01%
26	30%	1858877	0.02%
60	40%	2460985	0.04%
142	50%	3072803	0.09%
379	60%	3683215	0.25%
972	70%	4296735	0.65%
2523	80%	4910507	1.68%
8021	90%	5524124	5.35%

While considering the English stop words, we can see from the table above that the 4 most frequent words represent already 12% of the total number of words in training data set, and obviously 0% of the total number of unique words.
To reach 50% of the total number of words, we need only the 142 most frequent words, which represent only 0.09% of the total number of unique words.
Finally, 90% of the total number of words are reached with 8021 words, ie: 5.35% of the total number of unique words.

Analysis of the words in training data set excluding the English stop words

nb_words	frequency_words…nb_words_total	cumulated_frequency	nb_words…nb_unique_words
79	10%	616343	0.05%
380	20%	1227956	0.25%
1133	30%	1841811	0.76%
3212	40%	2455310	2.15%
12178	50%	3068941	8.14%

When excluding the English stop words, we need 79 words to have 10% of the total number of words in our training data set, which represents 0.05% of the total number of unique words.
12178 words are needed to reach 50% of the total number of words, ie: 8.14% of the total number of unique words.

Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words

Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words

2-gram sequence (bigram)

Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words

Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words

3-gram sequence (trigram)

Words cloud and histogram of the 20 most frequent words in our training data set, including the English stop words

Words cloud and histogram of the 20 most frequent words in our training data set, excluding the English stop words

Feedback on the plans for creating a prediction algorithm and Shiny app

Based on the analysis we performed, we have a good start point in order to compute the probabilities of next word in the n-grams we identified in our training data set. We still need to define a strategy regarding the prediction of ‘stop words’ and how they should be considered in our model.

The Shiny app will obviously have to propose the user to enter a sequence of words. Anytime the user is entering a letter, this app will check if it results with a known word and, if it is the case, will propose the next word to enter based on the probabilities computed in our model. There could be an option in the app to display from 1 to n different words, in addition with the probability for this word to be the next in the sequence.

APPENDIXES

APPENDIX 1 - Time for processing treatments

treatment	user	system	elapsed
Create training, testing and validation data sets	0.684	0.080	0.772
Training data set - Call to clean_dataset function	132.725	1.070	134.677
Training data set - Call to remove_stop_words function	13.554	0.083	13.679
Testing data set - Call to clean_dataset function	64.249	0.460	64.868
Testing data set - Call to remove_stop_words function	6.028	0.018	6.061
Validation data set - Call to clean_dataset function	22.075	0.181	23.149
Validation data set - Call to remove_stop_words function	2.157	0.007	2.172
Training data set - Call to getNgram function for 1 gram with English stop words	24.260	0.323	24.717
Training data set - Call to getNgram function for 1 gram without English stop words	12.877	0.146	13.046
Training data set - Call to getNGramSummary function for 1 gram with English stop words	-0.003	0.000	-0.003
Training data set - Call to getNGramSummary function for 1 gram without English stop words	-0.154	-0.001	-0.155
Training data set - Call to getNgram function for 2 gram with English stop words	17.943	0.358	18.470
Training data set - Call to getNgram function for 2 gram without English stop words	14.977	0.427	15.792
Training data set - Call to getNgram function for 3 gram with English stop words	30.698	1.583	32.844
Training data set - Call to getNgram function for 3 gram without English stop words	23.114	1.088	24.639

Data Science Capstone - Week 2 - Milestone report

GAEL BERON

25/10/2017

Overview

Analysis

Global analysis of the provided files

Sampling and splitting into TRAINING, TESTING and VALIDATION data sets

Cleaning and pre-processing the data

Analysis of the frequency of words and words sequences in TRAINING data set

1-gram sequence (unigram)

2-gram sequence (bigram)

3-gram sequence (trigram)

Feedback on the plans for creating a prediction algorithm and Shiny app

APPENDIXES

APPENDIX 1 - Time for processing treatments