Coursera Data Science Capstone Project

Swift Key Predictive Text Analytics

David Williams, PMP

Introduction

This report explains how we will build a predictive text algorithm. This is the type of program that allows your phone to know what you will type next and make suggestions prior to your input. The procedure I follow is:

  1. Obtain a large body (corpus) of text that we will analyze. It is important to document how we retrieve this data so that all the findings can be reproducable from a fixed starting place by anyone. These data will be too large to process during our analysis so we will cut it down to about 1% of its initial size to start with.

  2. Load the corpus and transform it so that it doesn’t have words and symbols that are unimportant for predicting text.

  3. Do some basic analysis on which words appear the most in the text. These are called Unigrams. Then we will analyze which words appear most frequently together. When we consider 2 word combinations they are called bigrams. Three words are trigrams.

Data Acquisition and Sampling

Data for the final project was obtained from this location. Our first step is to get some summary statistics on what we are dealing with. We will limit our analysis to the US English files.

%%bash
ls -lh data/final/en_US/*
wc -l data/final/en_US/*
-rw-r--r--@ 1 David  staff   200M Jul 22  2014 data/final/en_US/en_US.blogs.txt
-rw-r--r--@ 1 David  staff   196M Jul 22  2014 data/final/en_US/en_US.news.txt
-rw-r--r--@ 1 David  staff   159M Jul 22  2014 data/final/en_US/en_US.twitter.txt
  899288 data/final/en_US/en_US.blogs.txt
 1010242 data/final/en_US/en_US.news.txt
 2360148 data/final/en_US/en_US.twitter.txt
 4269678 total

So the files are roughly 150M - 200M. The course materials provide some strong hints that these files should be randomly sampled to reduce the size for the initial analysis. I’m going to do this with some UNIX commands instead of writing code. I’m going to limit the sample sizes to 1% of what the original data sets are.

%%bash
gshuf -n 23696 -o data/sample/en_US/en_US.twitter.txt data/final/en_US/en_US.twitter.txt
gshuf -n 10102 -o data/sample/en_US/en_US.news.txt data/final/en_US/en_US.news.txt
gshuf -n 8992 -o data/sample/en_US/en_US.blogs.txt data/final/en_US/en_US.blogs.txt
%%bash
ls -lh data/sample/en_US/*
wc -l data/sample/en_US/*
-rw-r--r--  1 David  staff   2.0M Mar 11 10:01 data/sample/en_US/en_US.blogs.txt
-rw-r--r--  1 David  staff   2.0M Mar 11 10:01 data/sample/en_US/en_US.news.txt
-rw-r--r--  1 David  staff   1.6M Mar 11 10:01 data/sample/en_US/en_US.twitter.txt
    8992 data/sample/en_US/en_US.blogs.txt
   10102 data/sample/en_US/en_US.news.txt
   23696 data/sample/en_US/en_US.twitter.txt
   42790 total

Load Libraries

Now we start the data analysis in R. We are going to use the following libraries. Some of the computations benefit from paralization so we will load the do multicore library.

library(tm)
library(doMC)
registerDoMC(cores = 8)
library(ggplot2)
library(dplyr)
library(quanteda)

Load the documents with the Quanteda library

Now we are ready to start working with our downsampled data. We will limit ourselves to the 3 files found in the English subdirectory.

# Load the corpus
c <- corpus(textfile(file='data/sample/en_US/*'))
summary(c)
## Corpus consisting of 3 documents.
## 
##               Text Types Tokens Sentences
##    en_US.blogs.txt 34160 423542     23523
##     en_US.news.txt 35118 408764     19959
##  en_US.twitter.txt 33912 373192     38041
## 
## Source:  /Users/David/Hack/Datascience-Coursera/Capstone/* on x86_64 by David
## Created: Sun Mar 20 14:31:58 2016
## Notes:

Feature Selection

Filter out words (features) that are not useful and create document frequency matricies for unigrams, bigrams, and trigrams. By default, the document feature matricies remove words that appear with little frequency in the text. This is called sparse entries.

#
# create a list of profanity to filter
conn <- file("data/swearWords.csv", "r")
sw <- readLines(conn)
close.connection(conn)
sw <- strsplit(sw,",")

# create a sparse document feature matrix for the unigrams
d <- dfm(c, ignoredFeatures=c(sw[[1]], stopwords("english")),removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 57,645 feature types
##    ... removed 218 features, from 251 supplied (glob) feature types
##    ... created a 3 x 57427 sparse dfm
##    ... complete. 
## Elapsed time: 2.703 seconds.
# create the freq. matrix for bigrams
d2 <- dfm(c, ngrams = 2, ignoredFeatures=c(sw[[1]], stopwords("english")),removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 472,624 feature types
##    ... removed 251,849 features, from 251 supplied (glob) feature types
##    ... created a 3 x 220775 sparse dfm
##    ... complete. 
## Elapsed time: 12.813 seconds.
# create the freq matrix for trigrams
d3 <- dfm(c, ngrams = 3, ignoredFeatures=c(sw[[1]], stopwords("english")),removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 848,078 feature types
##    ... removed 717,022 features, from 251 supplied (glob) feature types
##    ... created a 3 x 131056 sparse dfm
##    ... complete. 
## Elapsed time: 21.871 seconds.

Analyze words and word combination frequencies

Now we can do some analysis on the words, bigrams, and trigrams. We want to get a sense of how common the words are and how frequently they appear in the text as a whole. We provide word clouds to visualize how often words appear.

Unigrams

# counts of the top words
topfeatures(d)
## will just said  one like  can  get time  new  now 
## 3173 3048 3002 2927 2748 2498 2284 2157 1912 1862
# frequencies of the top words
topfeatures(d) / d@Dim[2]
##       will       just       said        one       like        can 
## 0.05525276 0.05307608 0.05227506 0.05096906 0.04785206 0.04349870 
##        get       time        new        now 
## 0.03977223 0.03756073 0.03329444 0.03242377
plot(d, min.freq = 750, random.order = FALSE)

Bigrams

topfeatures(d2) # counts for the top bigrams
##       right_now        new_york       last_year      last_night 
##             242             218             203             162 
##     high_school       years_ago       last_week       feel_like 
##             138             138             135             123 
##      first_time looking_forward 
##             120             111
topfeatures(d2) / d2@Dim[2] # frequencies for the top bigrams
##       right_now        new_york       last_year      last_night 
##    0.0010961386    0.0009874306    0.0009194882    0.0007337787 
##     high_school       years_ago       last_week       feel_like 
##    0.0006250708    0.0006250708    0.0006114823    0.0005571283 
##      first_time looking_forward 
##    0.0005435398    0.0005027743
plot(d2, min.freq = 50, random.order = FALSE)

Trigrams

topfeatures(d3) # counts for the top trigrams
##          new_york_city            let_us_know         happy_new_year 
##                     33                     27                     23 
##      happy_mothers_day     happy_mother's_day          cinco_de_mayo 
##                     22                     21                     18 
## president_barack_obama         new_york_times          two_years_ago 
##                     15                     14                     14 
##           world_war_ii 
##                     13
topfeatures(d3) / d3@Dim[2] # frequencies for the top trigrams
##          new_york_city            let_us_know         happy_new_year 
##           2.518008e-04           2.060188e-04           1.754975e-04 
##      happy_mothers_day     happy_mother's_day          cinco_de_mayo 
##           1.678672e-04           1.602368e-04           1.373459e-04 
## president_barack_obama         new_york_times          two_years_ago 
##           1.144549e-04           1.068246e-04           1.068246e-04 
##           world_war_ii 
##           9.919424e-05
plot(d3, min.freq = 5, random.order = FALSE)

Findings

The most interesting fiding is that the frequency of the most prevelant terms falls of by an order of 10 as we move from single words, bigrams, and then trigrams.

* n-gram | frequency
* will : 3173
* right_now : 242
* new_york_city : 33

This intuitively makes sense, that longer word sequences will appear less frequently, but it will reduce the predictive power of our algorithm. This points to the inherient limitations of the bag of words approach that we are taking in this initial analysis. If we instead created models that understood the parts of speech and the most likely next word given the context of the previous words our predictive powers will likely increase.

Next Steps

  1. Create a prototype application based on this sampling. We will use the shiny rapid prototyping environment to do this.
  2. Test the usability of this initial analysis and sampling. This will guide us in considering improvements.
  3. Try to build a prediction model from ALL the data, not just a 1% sample.
  4. Consider building a model more sophisticated than using the bag of words approach.