This document shows the activities towards completing the Capstone Project of the Data Science Specialization - Johns Hopkins University - Coursera. The project consists on building a predictive text model that guess what the next word to be typed might be. The text files are provided by Coursera in partnership with SwiftKey. Being a midterm report, this paper covers brief explanations about how the data were retrieved, the steeps done to clean this data, the process by which samples from these huge datasets were obtained. Finally, the paper shows the exploratory analysis to the corpora built with the samples. The .Rmd file behind this .html file was written in order to start from the scratch. As such, this report supports Reproducible Research.
The first step consists in download the file from the the Coursera repository. Once unzipped, the file creates a directory named final. Inside this directory, there exist four directories containing text files in German, Finnish, Russian and English. Inside each language directory, there exist datasets from three sources: Twitter, News and Blogs.
Next table shows the summary of the three sources of the English directory, the datasets were saved in native file types of R (.RData) to work with them faster. As it is noticed, the .RData files are much more small than .txt files:
.txt files in MB | .RData files in MB | Number of Lines | Number of Words | |
---|---|---|---|---|
Blogs | 200.42 | 88.34 | 899288.00 | 37153263.00 |
News | 196.28 | 88.43 | 1010242.00 | 34190519.00 |
159.36 | 78.33 | 2360148.00 | 29760108.00 |
Due to the size of the datasets, it was needed to sample them to create files to proceed with the exploratory analysis, they were generated with 0.1% of the original data. The next table shows statistics of the sampled datasets:
Sample files in MB | Number of Lines | Number of Words | |
---|---|---|---|
Blogs | 0.09 | 899.00 | 37818.00 |
News | 0.09 | 1010.00 | 34264.00 |
0.08 | 2360.00 | 29622.00 |
Once generated the samples, the three text files were manipulated with the tm and RWeka packages to produce a corpus for each one of them. With the tm package they were removed punctuations, numbers and white spaces. They were also transformed all capital letters to lower-case. I preferred not to mutilate the datasets by removing profanity, so I preserved all the words written by people. The decision obey to the intention to keep the original info, as it is analysed in Social Sciences, as it is the case of Sentiment Analysis. In such a case, it is desirable to analyse the genuine ideas of the people.
With the RWeka package it was done an N-grams analysis to each of the sources. It was selected the N-grams analysis to build the next word prediction algorithm. It was searched for frequencies of 1-grams, 2-grams and 3-grams for each one of the three sources. The results of such analysis are shown next.
74477 1-grams were found in the blogs corpus.
8326 unique entries of these were found in this corpus.
2811 1-grams appear more than once in this corpus.
All those 1-grams appear 68962 times in the blogs corpus
93 percent of the words in the blogs corpus are covered by all the 1-grams
74474 2-grams were found in the blogs corpus.
21374 unique entries of these were found in this corpus.
725 2-grams appear more than once in this corpus.
All those 2-grams appear 53825 times in the blogs corpus.
72 percent of the words in the blogs corpus are covered by all the 2-grams.
74473 3-grams were found in the blogs corpus.
23761 unique entries of these were found in this corpus.
302 3-grams appear more than once in this corpus.
All those 3-grams appear 51014 times in the blogs corpus
68 percent of the words in the blogs corpus are covered by all the 3-grams
81462 1-grams were found in the news corpus.
9115 unique entries of these were found in this corpus.
3119 1-grams appear more than once in this corpus.
All those 1-grams appear 75466 times in the news corpus
93 percent of the words in the news corpus are covered by all the 1-grams
81459 2-grams were found in the news corpus.
22008 unique entries of these were found in this corpus.
658 2-grams appear more than once in this corpus.
All those 2-grams appear 60109 times in the news corpus.
74 percent of the words in the news corpus are covered by all the 2-grams.
81458 3-grams were found in the news corpus.
24553 unique entries of these were found in this corpus.
295 3-grams appear more than once in this corpus.
All those 3-grams appear 57200 times in the news corpus
70 percent of the words in the news corpus are covered by all the 3-grams
161430 1-grams were found in the twitter corpus.
8487 unique entries of these were found in this corpus.
2098 1-grams appear more than once in this corpus.
All those 1-grams appear 155041 times in the twitter corpus
96 percent of the words in the twitter corpus are covered by all the 1-grams
161427 2-grams were found in the twitter corpus.
21649 unique entries of these were found in this corpus.
1185 2-grams appear more than once in this corpus.
All those 2-grams appear 140963 times in the twitter corpus.
87 percent of the words in the twitter corpus are covered by all the 2-grams.
161426 3-grams were found in the twitter corpus.
27022 unique entries of these were found in this corpus.
791 3-grams appear more than once in this corpus.
All those 3-grams appear 135195 times in the twitter corpus
84 percent of the words in the twitter corpus are covered by all the 3-grams
Three sets of graphics appear below. In each set appear three graphics as a visual summary for each one of the N-grams analyses.
Warning: Removed 4 rows containing missing values (position_stack).
Warning: Removed 3 rows containing missing values (position_stack).
Warning: Removed 5 rows containing missing values (position_stack).
Warning: Removed 9 rows containing missing values (position_stack).
The exploratory analysis, as well as the graphics show that unique 1-grams represent the major frequencies. This is indeed expected because of the higher unique word count. 1-grams are highly variable and they are not an option to predict the next word that may be typed. The prediction model requires to predict the next word given at least one word that comes before it. Therefore 2-grams and 3-grams are more likely to support the requirements of the model.
I am sure that a lot of improvements needs to be done, I suspect I can clean the data sets even more. I notice I was not aware enough about abreviations. Finally, I think it is not required to avoid profanities, anyway I prefer not to delete this as I mentioned before.