Executive Summary

This document shows the activities towards completing the Capstone Project of the Data Science Specialization - Johns Hopkins University - Coursera. The project consists on building a predictive text model that guess what the next word to be typed might be. The text files are provided by Coursera in partnership with SwiftKey. Being a midterm report, this paper covers brief explanations about how the data were retrieved, the steeps done to clean this data, the process by which samples from these huge datasets were obtained. Finally, the paper shows the exploratory analysis to the corpora built with the samples. The .Rmd file behind this .html file was written in order to start from the scratch. As such, this report supports Reproducible Research.

Download and Summary of the Original Data

The first step consists in download the file from the the Coursera repository. Once unzipped, the file creates a directory named final. Inside this directory, there exist four directories containing text files in German, Finnish, Russian and English. Inside each language directory, there exist datasets from three sources: Twitter, News and Blogs.

Next table shows the summary of the three sources of the English directory, the datasets were saved in native file types of R (.RData) to work with them faster. As it is noticed, the .RData files are much more small than .txt files:

Original Data Statistics
.txt files in MB .RData files in MB Number of Lines Number of Words
Blogs 200.42 88.34 899288.00 37153263.00
News 196.28 88.43 1010242.00 34190519.00
Twitter 159.36 78.33 2360148.00 29760108.00

Sampling the datasets to build the corpora

Due to the size of the datasets, it was needed to sample them to create files to proceed with the exploratory analysis, they were generated with 0.1% of the original data. The next table shows statistics of the sampled datasets:

Sampled Data Statistics
Sample files in MB Number of Lines Number of Words
Blogs 0.09 899.00 37818.00
News 0.09 1010.00 34264.00
Twitter 0.08 2360.00 29622.00

Exploratory Analysis

Once generated the samples, the three text files were manipulated with the tm and RWeka packages to produce a corpus for each one of them. With the tm package they were removed punctuations, numbers and white spaces. They were also transformed all capital letters to lower-case. I preferred not to mutilate the datasets by removing profanity, so I preserved all the words written by people. The decision obey to the intention to keep the original info, as it is analysed in Social Sciences, as it is the case of Sentiment Analysis. In such a case, it is desirable to analyse the genuine ideas of the people.

With the RWeka package it was done an N-grams analysis to each of the sources. It was selected the N-grams analysis to build the next word prediction algorithm. It was searched for frequencies of 1-grams, 2-grams and 3-grams for each one of the three sources. The results of such analysis are shown next.

Analysis of blogs corpus

1-grams in blogs corpus:

74477 1-grams were found in the blogs corpus.
8326 unique entries of these were found in this corpus.
2811 1-grams appear more than once in this corpus.
All those 1-grams appear 68962 times in the blogs corpus
93 percent of the words in the blogs corpus are covered by all the 1-grams

2-grams in blogs corpus:

74474 2-grams were found in the blogs corpus.
21374 unique entries of these were found in this corpus.
725 2-grams appear more than once in this corpus.
All those 2-grams appear 53825 times in the blogs corpus.
72 percent of the words in the blogs corpus are covered by all the 2-grams.

3-grams in blogs corpus:

74473 3-grams were found in the blogs corpus.
23761 unique entries of these were found in this corpus.
302 3-grams appear more than once in this corpus.
All those 3-grams appear 51014 times in the blogs corpus
68 percent of the words in the blogs corpus are covered by all the 3-grams

Analysis of news corpus

1-grams in news corpus:

81462 1-grams were found in the news corpus.
9115 unique entries of these were found in this corpus.
3119 1-grams appear more than once in this corpus.
All those 1-grams appear 75466 times in the news corpus
93 percent of the words in the news corpus are covered by all the 1-grams

2-grams in news corpus:

81459 2-grams were found in the news corpus.
22008 unique entries of these were found in this corpus.
658 2-grams appear more than once in this corpus.
All those 2-grams appear 60109 times in the news corpus.
74 percent of the words in the news corpus are covered by all the 2-grams.

3-grams in news corpus:

81458 3-grams were found in the news corpus.
24553 unique entries of these were found in this corpus.
295 3-grams appear more than once in this corpus.
All those 3-grams appear 57200 times in the news corpus
70 percent of the words in the news corpus are covered by all the 3-grams

Analysis of twitter corpus

1-grams in twitter corpus:

161430 1-grams were found in the twitter corpus.
8487 unique entries of these were found in this corpus.
2098 1-grams appear more than once in this corpus.
All those 1-grams appear 155041 times in the twitter corpus
96 percent of the words in the twitter corpus are covered by all the 1-grams

2-grams in twitter corpus:

161427 2-grams were found in the twitter corpus.
21649 unique entries of these were found in this corpus.
1185 2-grams appear more than once in this corpus.
All those 2-grams appear 140963 times in the twitter corpus.
87 percent of the words in the twitter corpus are covered by all the 2-grams.

3-grams in twitter corpus:

161426 3-grams were found in the twitter corpus.
27022 unique entries of these were found in this corpus.
791 3-grams appear more than once in this corpus.
All those 3-grams appear 135195 times in the twitter corpus
84 percent of the words in the twitter corpus are covered by all the 3-grams

Graphic output of the N-grams analysis

Three sets of graphics appear below. In each set appear three graphics as a visual summary for each one of the N-grams analyses.

Graphic output for the 1-grams analysis

Graphic output for the 2-grams analysis

Warning: Removed 4 rows containing missing values (position_stack).
Warning: Removed 3 rows containing missing values (position_stack).

Graphic output for the 3-grams analysis

Warning: Removed 5 rows containing missing values (position_stack).
Warning: Removed 9 rows containing missing values (position_stack).

Observations

The exploratory analysis, as well as the graphics show that unique 1-grams represent the major frequencies. This is indeed expected because of the higher unique word count. 1-grams are highly variable and they are not an option to predict the next word that may be typed. The prediction model requires to predict the next word given at least one word that comes before it. Therefore 2-grams and 3-grams are more likely to support the requirements of the model.

Future Improvements

I am sure that a lot of improvements needs to be done, I suspect I can clean the data sets even more. I notice I was not aware enough about abreviations. Finally, I think it is not required to avoid profanities, anyway I prefer not to delete this as I mentioned before.