Data Science Specialization Swiftkey Capstone

Overview

This is an exploratory report for the Data Science Specialization Swiftkey Capstone project, through Coursera and the Johns Hopkins University Bloomberg School of Public Health, and in conjunction with Swiftkey. The instructors are Jeff Leek, Roger Peng, and Brian Caffo.

The task at hand is to build a predictive text application using NLP that will predict the next word based on the previous entry or entries. The data set consists of a corpus of text that will be used to create the predictive model. It can be found here.

Data Cleaning

The data is loaded into R using unz and readLines functions, processed into two samples, one of which is archived while the other is processed in to a series of vectors containing raw single words; and position and element indexes for each word, so phrases can be rebuilt later. Whitespace is stripped and the text is encoded using utf-8.

Basic Summaries

Basic data summaries are calculated.

tw2 <- unz("Coursera-SwiftKey.zip", "final/en_US/en_US.twitter.txt")
twData <- readLines(tw2)
twLines <- length(twData)
twWords <- length(unlist(strsplit(twData, "\\s")))
close(tw2)
rm(tw2)
rm(twData)
gcv2 <- gc(verbose = FALSE)

Twitter Summary:

Number of lines: 2360148 Lines
Number of words: 30373543

bl2 <- unz("Coursera-SwiftKey.zip", "final/en_US/en_US.blogs.txt")
blData <- readLines(bl2)
blLines <- length(blData)
blWords <- length(unlist(strsplit(blData, "\\s")))
close(bl2)
rm(bl2)
rm(blData)
gcv3 <- gc(verbose = FALSE)

Blog Summary:

Number of lines: 899288
Number of words: 37334131

nw2 <- unz("Coursera-SwiftKey.zip", "final/en_US/en_US.news.txt")
nwData <- readLines(nw2)
nwLines <- length(nwData)
nwWords <- length(unlist(strsplit(nwData, "\\s")))
close(nw2)
rm(nw2)
rm(nwData)
gcv4 <- gc(verbose = FALSE)

News Summary:

Number of lines: 2360148
Number of words: 30373543

A quick summary of this generally uncleaned may help direct the data cleaning. It may be a good idea to take look at the highest frequency words. plot of chunk wFreq

There are a number of very high frequency words. These are the 20 words with the highest frequency.

##      frequency
## the    2104752
## to     1346798
## and    1129139
## a      1127168
## of      989608
## in      755406
## I       730524
## for     519742
## is      507385
## that    463196
## on      378286
## you     364483
## with    338093
## was     304270
## it      303389
## at      264551
## my      261575
## be      260377
## have    249990
## The     240479

These words appear to be function words commonly chosen as “stop words” in NLP. Many techniques recommend removal of these words due to their overrepresentaiton, however it may be worth keeping them in to investigate alternative methods of handling these words.

Another question is, how often does each word frequency occur. plot of chunk fFreq

The frequency of frequencies plot shows a large number of 1-occurence words. Before deciding how to handle these, more information is needed.

Digging a little deeper reveals 2.08% of all words have a frequency of one. In fact, removing all words with 66 occurences or less, retains 90% of the overall words in the data set.

By removing all words with 15088 occurences or less, 50.01% of the highest frequency words are retained.

It is clear that removing low frequency words will help with model efficiency, and should be an effective way to remove little-used foreign-language words, hard to detect misspellings, and other “noise” from the model. What is less clear is at what point these less frequent words gain sufficient value to merit remaining in the model. This requires testing during the model building phase.

Lastly, it seems likely that some punctuation removal will be necessary to consolidate word occurences, but first a better understanding of what kind of punctuation is in use is necessary, to see if it might be useful data. plot of chunk punctuation

Unsurprisingly; periods, commas, and apostrophes are the most common punctuation. But glancing through the other results suggests a little more investigation.

One special case to consider is words starting with #. These are likely to be hashtags from our twitter data set. These are likely to be unrelated to the words found before and after them, and may introduce noise into our model.

##  [1] "#MothersDay"      "#OccupyMadison"   "#cufon"          
##  [4] "#wordpress."      "#SNTCK"           "#specialneeds"   
##  [7] "#Well"            "#FridayThe13th"   "#Broncos"        
## [10] "#fashion,"        "#internmistakes!" "#kids"           
## [13] "#Wisconsin"       "#FanNight"        "#unfollowfriday" 
## [16] "#eating"          "#sleep"           "#exercising"     
## [19] "#TeamSoaringHigh" "#cockroach"

133866 occurences are found, 0.26% of all words. Most look like low-frequency words that would probably be excluded anyway, but based on the usage of these types of words, the best course of action seems to be to exclude them entirely.

While more detailed data cleaning may be required later, the data cleaning for now is then rounded out with these steps:

Remove any hashtag words
Standardize by converting all words to lower case
Remove unwanted punctuation

These actions were considered but not taken at this time, until a better picture of their impact on the model develops

Removal of high frequency “stop word” candidates
Remove low frequency word words from model
Removal of curse words
- It seems that these may have value in the model, and their removal at the output stage may be more appropriate

Summary Statistics

We can begin our summary statistic review by creating our 1-gram, 2-gram, and 3-gram vectors using our clean data. This can be done by creating a set of flag variables that tell us whether a word is eligible to be a starter word for a 1-gram, 2-gram, or 3-gram. This is determined by looking at the exclusion rules established above, and the word position data we preserved from our original sample.

Two-grams

A two-gram frequency distribution can now be examined.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       6       2 2220000

Two-grams range in frequency from 1 to 2221297, with a median of 1.

##          frequency
##            2221297
## of the      215211
## in the      203072
## to the      105961
## for the     100415
## on the       98176
## to be        80759
## at the       70967
## and the      62301
## in a         59272
## with the     52362
## is a         50066
## it was       47680
## for a        46852
## from the     43495

An examination of the most frequent phrases again shows a large spike in those based solely on “function” words.

Three-grams

A three-gram frequency distribution can also be looked at.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       2       1 4410000

Three-grams range in frequency from 1 to 4409362, with a median of 1.

thgHd <- as.data.frame(head(thgFreq, 15))
colnames(thgHd) <- "frequency"
print(thgHd)

##                frequency
##                  4409362
## one of the         17142
## a lot of           14851
## thanks for the     11768
## to be a             8978
## going to be         8678
## the end of          7427
## i want to           7375
## out of the          7288
## it was a            6986
## as well as          6926
## some of the         6824
## be able to          6538
## part of the         6177
## i have a            5755

An examination of the most frequent phrases still shows a large spike in those based solely on “function” words. There is also an increasing number of 1 occurence phrases, indicating an increase in phrase uniqueness as more words are added.

Findings

Twitter “hashtag” data may need to be considered differently.
Designing a model is about balancing the weak power but strong diversity of low occurence words with the strong power but weak diversity of high occurence words.
The higher n goes in the n-grams, the more uniqueness single occurences.
Performance will be a strong consideration in model choice

Prediction Algorithm Plan

Based on this cleaning and exploratory analysis, the first stages of modeling will proceed using the cleaned n-gram frequencies as a basis. It is expected that other elements will be included or removed based on the quality of the results, and the inclusion/exclusion levels for low and high frequency words will be set during model testing.

The shiny app based on the model will be constructed to allow free entry of a text string, and provide 3 suggestions for the next word based on the prediction model analysis of the previously entered text. The predictions will update every time a space is entered, and it would be beneficial to design this app so that a suggested word can be added to the text with a button click.

Data Science Specialization Swiftkey Capstone - Exporatory Analysis Report

Jason Ives

Sunday, November 09, 2014