SwiftKey Capstone Project Milestone Report

Executive Summary

The aim of the John Hopkins University Capstone project in partnership with SwiftKey is to develop a predictive model of text starting with a really large, unstructured database of the English language. In order to get started with this, an exploratory analysis was performed on the data including looking at the distribution of unigrams (single words), bigrams (two consecutive words), trigrams and quadgrams for the 3 documents in the text collection (corpus), namely a blog file, a news file and a twitter file. Features of each document are presented in this report and will be used for building of the predictive model.

Data Processing

The training dataset was accessed at the following site: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The data was first loaded and cleaned to remove white spaces, convert to lower case, remove stop word and badwords (in the form of an author-compiled list) and stem to go back to the root of the words. Stemming is particularly useful as it helps recognizing, searching and retrieving more forms of the words in the corpus, leading to better statistics across the entire dataset

Exploratory Analysis

The features of the corpus were then extracted, first in terms of individual words or unigrams, and word count performed for each file.

##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##          18256720           1427231          16091650

The number of words for each document in the corpus is calculated by the code to be 18256720 for en_US.blogs.txt,1427231 for en_US.news.txt and 16091650 for en_US.twitter.txt.

We also do a similar count on sentences.

The number of sentences for each document in the corpus is calculated by the code to be 286531 for en_US.blogs.txt,5913 for en_US.news.txt and 1233466 for en_US.twitter.txt.

The wordclouds show that the most common words for blogs are “one,”like, and “time” whereas for news they are “said”, “year” and “one”. For tweets the most commonly used words appear to be “just”, “get” and “thank”.

Further analysis was then performed on bigrams, trigrams and quadgrams for each document and the below plots show the frequency of the most common ones.

Blog Bigram	Blog Bigram Count	News Bigram	News Bigram Count	Twitter Bigram	Twitter Bigram Count	Bigram Ranking
look_like	6646	last_year	1191	right_now	16969	1
feel_like	5934	new_york	897	thank_follow	12378	2
year_ago	5774	new_jersey	678	look_forward	11720	3
last_year	5267	year_ago	677	look_like	11124	4
new_york	5115	st_loui	671	feel_like	9093	5
right_now	5112	last_week	579	happi_birthday	8428	6
last_week	4544	los_angel	429	good_morn	8006	7
make_sure	4495	san_francisco	405	just_got	7245	8
can_see	4327	two_year	400	follow_back	7188	9
first_time	4277	unit_state	360	thank_much	6473	10

Discussion

We can see from the above that the three files in the corpus exhibit interesting features. First of all interestingly, the number of words in the blog files is about 13% higher than in the twitter file although the twitter file contains about 6 times fewer sentences, which would indicate that tweets have a more repetitive nature. The news file contains significantly fewer words and sentences compared to the other 2 documents.

When looking at the bigrams it is interesting to see that there are some commonalities in the most frequently encountered bigrams (e.g. “look like” and “feel like”) between the blog and twitter files, although not entirely surprising given the nature of these documents: authors of each will typically use Twitter or blogs to convey their feelings about a particular topic. The news bigrams appear to be more factual and focused on places or times.

There is a general similar trend in the trigram distribution although the focus of the blog trigram shifts somewhat to amazon service llc, whilst the news trigram show an increase in political terms including name of politicians, and the mention of World War II. However, Twitter trigrams topics appear to remain aligned with the bigrams.

The quadgrams see another interesting shift as the most frequent blog quadgrams appear to be solely focus on selling services and moving away from expression of feeling that were apparent in the bigrams. The largest shift however is observed for the news files where the most frequent quadgrams are of a dietary nature with the exception of a few political and financial terms. For Twitter, a significant drop can be noticed between counts of most frequent trigrams and most frequent quadgrams, suggesting that although similar words of expression tend to be used across tweets they vary slighty, which may be attributable to the varied syntaxes used by the different users.

The above analysis is significant in the sense that it highlights the context of the most used words. For example, Amazon Services LLC will be frequently encountered as a trigram as this is the name of the company. Similarly the full name ‘President George W Bush’ tends to be used to refer to the president, which is why the frequency of this quadgram in the news file is so high.

Model Development and way forward

In order to build our model to predict the word following a bigram entered by the app user, we will be using conditional probabilities. Essentially our model will calculate the probability of a word occuring based on the previous word sequence. The model will make use of the Markov Chain approach, whereby the probability of an event occuring is not history-dependent and follows strict probabilistic rules.