SwiftKey Text Prediction, Analysis Findings

Analysis

The following are the basic steps/pipeline used to analyze the data with supporting comments:

1. Reading The Data Sets

The data sets were acquired at the following location: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

For this analysis the English language version of the files were used.

I found the data sets to be a bit tricky to read at first, however, found the following combination of functions/parameters in R to successively read the complete files, for example reading the blog data set:

 ## READ BLOGS
f <- file("../raw_data/final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- readLines(f)
len.blogs <- length(blogs)
close(f)

2. Creating Working Sets

After working with the raw data in R it was determined that R was not an appropriate language to use for raw text exploration. Given the size of the raw data and the number of parsing/replacement operations that were needed, R was just too slow.

To improve the exploration of the data 4 sets of data were “sampled” out of the raw data. This sampling was seeded so any subsequent access to the raw data would be consistent. The following code snippet is used to sample the blog data set:

 ### CREATE BLOG INDEXES
set.seed(111)
test.blog.sample.index <- sample(1:len.blogs, 100, replace=FALSE)
s10K.blog.sample.index <- sample(1:len.blogs, 10000, replace=FALSE)
s100K.blog.sample.index <- sample(1:len.blogs, 100000, replace=FALSE)
s300K.blog.sample.index <- sample(1:len.blogs, 300000, replace=FALSE)

This creates ‘indexes’ into the data sets for the selection of a specific number of documents. The indexes are then used to create consolidated data sets, i.e. data sets that contain samples from each document set. The following code is an example of merging the document types:

 ###   TEST DATA
test.data <- vector()
test.data <- c(test.data,
                    blogs[test.blog.sample.index],
                   news[test.news.sample.index],
                   twit[test.twit.sample.index])

Four different data sets were created. The test.data data set is small, only 300 documents, and is used to test any ‘exploration’ code/algorithms prior to running on larger data sets. The other data sets have incrementally more documents included to explore the effects of ‘size’ on the performance, size and speed, on the resulting models.

3. Iterative Exploration

3.1 Raw Data Statistics

Lines	Words	Characters	File
899288	37334690	210160014	en_US.blogs.txt
1010242	34372720	205811889	en_US.news.txt
2360148	30374206	167105338	en_US.twitter.txt
4269678	102081616	583077241	total

3.2 Test Data Statistics

Lines	Words	Characters	File
300000	8823440	50419258	s100K_test_data.txt
30000	885356	5058972	s10K_test_data.txt
900000	26502233	151477345	s300K_test_data.txt
300	9071	51234	test_data.txt
1230300	36220100	207006809	total

3.3 Word Occurances vs Total Words Found

In each of the data sets the ratio of first 100, 500, and 1500 words was consistent across all data sets, i.e. this ratio was not influenced by the number of documents found in each data set.

Note: The x axis on the following chart represents the index, i.e. fist 3000 words found in the data sets. See the table in 4.1 1-Grams for the first 20 words.

3.3 Peeling Back The Onion

To be brief, I will not go into all the findings/passes made in the exploration. The following however are some of the key take a ways that will be used for subsequent model building:

Remove all punctuation.
Remove stop words, including profanity (the stopwords(“English”) was used)
Remove all numeric data
Include only valid, correctly spelled words in the aspell dictionary (ftp://ftp.gnu.org/gnu/aspell/dict/en/aspell6-en-7.1-0.tar.bz2)

3.4 Pre-Processing Sets

Prior to performing any subsequent analysis or model building, all the data sets created were pre-processed using the above process steps. Any future prediction algorithm will also need to be pre-processed any inputs with these same processing steps.

4. N-Gram Analysis

4.1 1-Gram or Word Occurences

The following table summarizes the 1-gram, or the word counts for the most frequently occurring words in the documents. The number of documents, from 30K document on up, does not materially change the top words that occur in the data sets. This is a possible indication that an appropriate smaller collection of documents, of each document type could be used to limit the size of the model.

##    X300K_words X300K_count X100K_words X100K_count X10K_words X10K_count
## 1         said       88795        said       29898       said       2917
## 2         will       82075        will       27531       will       2769
## 3          one       76261         one       25477        one       2614
## 4         just       68335        just       22795       just       2244
## 5         like       62688        like       21051        can       2022
## 6          can       61765         can       20703       like       2020
## 7         time       54205        time       17847       time       1823
## 8          get       50612         get       16809        get       1705
## 9          new       47752         new       16064        new       1604
## 10         now       41043         now       13797     people       1326
## 11      people       40615      people       13670        now       1292
## 12        also       38201        also       12601       good       1292
## 13        good       37714        know       12461        day       1243
## 14        know       37024         day       12298       also       1242
## 15          us       36997          us       12191         us       1231
## 16         day       36809        good       12185      first       1201
## 17       first       36278       first       12078       know       1162
## 18        back       33813        back       11346       back       1157
## 19         two       32595        make       10835        see       1105
## 20        make       32292         two       10778        two       1093

4.1 X-Gram

The code to calculate the X_GRAMS is still under development.