1. Synopsis

This document is prepared as the final project (Capstone) in the JHU-Coursera Data Science Specialization. The capstone is done in conjunction with SwiftKey, a software developing company famous for it’s keyboard with predictive algorithms. The idea behind this project is to simulate an algorithm for a predictive keyboard, and to develop a corresponding application. Through this project, we’ll be tackling text data analysis, natural language processing, and product development.

2. Exploratory data Analysis

For this project, we’re given some data sets of consisting of multiple lines of text written in english (though german, russian and finnish are also available) and analyzing those texts using R. The data sets can be downloaded from here.

2.1 Loading and looking at the data

We’ll focus on the english data, available in the final/en_US/ directory, which contains three text files:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

Each file contains several lines of text corresponding to either blogs, news or tweets, which can be analyzed to see which words naturally follow which ones in a natural language setting. In the Appendix it’ll be shown which packages we’ll be using for our analysis.

The following R packages are used for this analysis: dplyr, tidyr, ggplot2, LaF, tokenizers, stringr, stringi, quanteda, data.table, caret, and, clearly, knitr.

First, let’s load the data, and make a summary table of it:

And to take a quick look at it:

Twitter Blogs News
Lines 2,360,148 899,288 77,259
Characters 162,096,241 206,824,382 15,639,408
Words 30,451,170 37,570,839 2,651,432
Min words per line 1 0 1
Mean words per line 13 42 35
Max words per line 47 6,726 1,123

As you can see, we have really big data sets, weighting over 550 MB between them all. So, due to my computer having not so much processing capacity, and given the weight or the data bases, we’ll be working with random samples from the data sets. The sampling method uses the sample_lines() function from the LaF package. We’ll be working with samples of 10% of the total lines per data base. From this 10%, 20% will be for our training set, and 5% for our testing set. Although my computer allows for a bigger data set for exploratory data analysis, when it comes to fiting models and prediction, it just doesn’t have enough memory.

2.2 Cleaning and partitioning our data sets

So, we now have samples 25% the size of our original data sets, but given the size of the datasets, it should suffice (for my computer’s sake). The following step is to extract the relevant data from the sample data sets. This is, we’re not interested in whole stories being told in blogs, news or tweets, but on the individual frequency of words and phrases, as well as knowing which words follow which ones.

To do this, we’ll be using the quanteda package, as well as the base package and regular expressions, to extract from the texts sets of words (tokens), as well as ngrams (sequence of tokens). To do this, I’ve created a simple function to allow me to extract tokens and ngrams of n-number of words. This function can be then applied to our data samples to get the most common tokens (words) and ngrams (sequence of tokens).

For this example, we’ll be working with the most common tokens considering all words, as well as excluding so called stopwords (words that are very common but have little meaning in an overall analysis, such as “the”, “is”, among others). Luckily, the package quanteda exports a stopwords() function which includes a list of 175 english common words which can be excluded from the analysis. We’ll also be looking at the most common 2 token ngrams and 3 token ngrams.

We’ll plot our results on the training data set.

2.3 Plotting our results

The following tabs show occurance of each token or ngram for twitter, blogs and news. For quick Reference:

  • 1T. Most common tokens
  • 1T-nsw. Most commont tokens, without stopwords
  • 2T. Most common 2-token ngrams
  • 2T-nsw. Most common 2-token ngrams, without stopwords
  • 3T. Most common 3-token ngrams
  • 4T. Most common 4-token ngrams

1T

Most common tokens

Twitter

Blogs

News

1T-nsw

Most common tokens, without stopwords

Twitter

Blogs

News

2T

Most common 2-token ngrams

Twitter

Blogs

News

2T-nsw

Most common 2-token ngrams, without stopwords

Twitter

Blogs

News

3T

Most common 3-token ngrams

Twitter

Blogs

News

4T

Most common 4-token ngrams

Twitter

Blogs

News

2.4 Word Coverage

Now we want to see how many words amount to which percentage of the total number of words. For this, we’ll create some basic plots that will help ilustrate this. We will notice that a relatively small number of words cover 50% of the words, and the number increases rapidly to cover almost 80% of total word usage, after which the rate rapidely decreases. For quick Reference:

  • 1T. Most common tokens
  • 1T-nsw. Most commont tokens, without stopwords
  • 2T. Most common 2-token ngrams
  • 2T-nsw. Most common 2-token ngrams, without stopwords
  • 3T. Most common 3-token ngrams
  • 4T. Most common 4-token ngrams

1T

Cumulative distribution of tokens

Twitter

Blogs

News

1T-nsw

Cumulative distribution of tokens, without stopwords

Twitter

Blogs

News

2T

Cumulative distribution of 2-ngrams

Twitter

Blogs

News

2T-nsw

Cumulative distribution of 2-ngrams, without stopwords

Twitter

Blogs

News

3T

Cumulative distribution of 3-ngrams

Twitter

Blogs

News

4T

Cumulative distribution of 3-ngrams

Twitter

Blogs

News

2.5 Interpreting the cumulative frequency plots

What we notice in the cumulative frequency plots is very straight forward. When dealing with single words, we reach the 90% of total words with a relatively small number of unique words. In this sense, predicting the first word to be typed should be easy (mainly, if we restrict ourselves to only analyzing the first word typed in every sentence in our data sets). However, ngrams represent permutations of n words, and reaching even the 50% of posible ngrams takes us into the hundreds of thousands of posibilities.

So, having 4-token ngrams may be way better for predicting the fourth word typed given the previous three words. However, the amount of data needed for this prediction is a clear setback we have to weight.

3. Next Steps

For the following steps, we have to try prediction models and test them, subject to computational restraints, and choose the project which best results shows in terms of efficiency. The Idea is to deploy the final product on a shiny app, as well as a presentation to pitch the app.