Executive Summary

The aim of the John Hopkins University Capstone project in partnership with SwiftKey is to develop a predictive model of text starting with a really large, unstructured database of the English language. In order to get started with this, an exploratory analysis was performed on the data including looking at the distribution of unigrams (single words), bigrams (two consecutive words), trigrams and quadgrams for the 3 documents in the text collection (corpus), namely a blog file, a news file and a twitter file. Features of each document are presented in this report and will be used for building of the predictive model.

Data Processing

The training dataset was accessed at the following site: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The data was first loaded and cleaned to remove white spaces, convert to lower case, remove stop word and badwords (in the form of an author-compiled list) and stem to go back to the root of the words. Stemming is particularly useful as it helps recognizing, searching and retrieving more forms of the words in the corpus, leading to better statistics across the entire dataset

Exploratory Analysis

The features of the corpus were then extracted, first in terms of individual words or unigrams, and word count performed for each file.

##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##          18256720           1427231          16091650

The number of words for each document in the corpus is calculated by the code to be 18256720 for en_US.blogs.txt,1427231 for en_US.news.txt and 16091650 for en_US.twitter.txt.

We also do a similar count on sentences.

The number of sentences for each document in the corpus is calculated by the code to be 286531 for en_US.blogs.txt,5913 for en_US.news.txt and 1233466 for en_US.twitter.txt.

The wordclouds show that the most common words for blogs are “one,”like, and “time” whereas for news they are “said”, “year” and “one”. For tweets the most commonly used words appear to be “just”, “get” and “thank”.

Further analysis was then performed on bigrams, trigrams and quadgrams for each document and the below plots show the frequency of the most common ones.

Blog Bigram Blog Bigram Count News Bigram News Bigram Count Twitter Bigram Twitter Bigram Count Bigram Ranking
look_like 6646 last_year 1191 right_now 16969 1
feel_like 5934 new_york 897 thank_follow 12378 2
year_ago 5774 new_jersey 678 look_forward 11720 3
last_year 5267 year_ago 677 look_like 11124 4
new_york 5115 st_loui 671 feel_like 9093 5
right_now 5112 last_week 579 happi_birthday 8428 6
last_week 4544 los_angel 429 good_morn 8006 7
make_sure 4495 san_francisco 405 just_got 7245 8
can_see 4327 two_year 400 follow_back 7188 9
first_time 4277 unit_state 360 thank_much 6473 10

Discussion

We can see from the above that the three files in the corpus exhibit interesting features. First of all interestingly, the number of words in the blog files is about 13% higher than in the twitter file although the twitter file contains about 6 times fewer sentences, which would indicate that tweets have a more repetitive nature. The news file contains significantly fewer words and sentences compared to the other 2 documents.

When looking at the bigrams it is interesting to see that there are some commonalities in the most frequently encountered bigrams (e.g. “look like” and “feel like”) between the blog and twitter files, although not entirely surprising given the nature of these documents: authors of each will typically use Twitter or blogs to convey their feelings about a particular topic. The news bigrams appear to be more factual and focused on places or times.

There is a general similar trend in the trigram distribution although the focus of the blog trigram shifts somewhat to amazon service llc, whilst the news trigram show an increase in political terms including name of politicians, and the mention of World War II. However, Twitter trigrams topics appear to remain aligned with the bigrams.

The quadgrams see another interesting shift as the most frequent blog quadgrams appear to be solely focus on selling services and moving away from expression of feeling that were apparent in the bigrams. The largest shift however is observed for the news files where the most frequent quadgrams are of a dietary nature with the exception of a few political and financial terms. For Twitter, a significant drop can be noticed between counts of most frequent trigrams and most frequent quadgrams, suggesting that although similar words of expression tend to be used across tweets they vary slighty, which may be attributable to the varied syntaxes used by the different users.

The above analysis is significant in the sense that it highlights the context of the most used words. For example, Amazon Services LLC will be frequently encountered as a trigram as this is the name of the company. Similarly the full name ‘President George W Bush’ tends to be used to refer to the president, which is why the frequency of this quadgram in the news file is so high.

Model Development and way forward

In order to build our model to predict the word following a bigram entered by the app user, we will be using conditional probabilities. Essentially our model will calculate the probability of a word occuring based on the previous word sequence. The model will make use of the Markov Chain approach, whereby the probability of an event occuring is not history-dependent and follows strict probabilistic rules.