Introduction

The following report summarizes the findings from the analysis of the raw data sets provided for SwiftKey Text Prediction project. There are 3 data sets representing blogs, twitter and news reports. Each of these data sets have unique characteristics and the objective of this analysis is to determine how these data sets can be used in the construction of a text prediction model.

Some initial expectations about the data sets:

Analysis

The following are the basic steps/pipeline used to analyze the data with supporting comments:

1. Reading The Data Sets

The data sets were acquired at the following location: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

For this analysis the English language version of the files were used.

I found the data sets to be a bit tricky to read at first, however, found the following combination of functions/parameters in R to successively read the complete files, for example reading the blog data set:

 ## READ BLOGS
f <- file("../raw_data/final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- readLines(f)
len.blogs <- length(blogs)
close(f)

2. Creating Working Sets

After working with the raw data in R it was determined that R was not an appropriate language to use for raw text exploration. Given the size of the raw data and the number of parsing/replacement operations that were needed, R was just too slow.

To improve the exploration of the data 4 sets of data were “sampled” out of the raw data. This sampling was seeded so any subsequent access to the raw data would be consistent. The following code snippet is used to sample the blog data set:

 ### CREATE BLOG INDEXES
set.seed(111)
test.blog.sample.index <- sample(1:len.blogs, 100, replace=FALSE)
s10K.blog.sample.index <- sample(1:len.blogs, 10000, replace=FALSE)
s100K.blog.sample.index <- sample(1:len.blogs, 100000, replace=FALSE)
s300K.blog.sample.index <- sample(1:len.blogs, 300000, replace=FALSE)

This creates ‘indexes’ into the data sets for the selection of a specific number of documents. The indexes are then used to create consolidated data sets, i.e. data sets that contain samples from each document set. The following code is an example of merging the document types:

 ###   TEST DATA
test.data <- vector()
test.data <- c(test.data,
                    blogs[test.blog.sample.index],
                   news[test.news.sample.index],
                   twit[test.twit.sample.index])

Four different data sets were created. The test.data data set is small, only 300 documents, and is used to test any ‘exploration’ code/algorithms prior to running on larger data sets. The other data sets have incrementally more documents included to explore the effects of ‘size’ on the performance, size and speed, on the resulting models.

3. Iterative Exploration

3.1 Raw Data Statistics

Lines Words Characters File
899288 37334690 210160014 en_US.blogs.txt
1010242 34372720 205811889 en_US.news.txt
2360148 30374206 167105338 en_US.twitter.txt
4269678 102081616 583077241 total

3.2 Test Data Statistics

Lines Words Characters File
300000 8823440 50419258 s100K_test_data.txt
30000 885356 5058972 s10K_test_data.txt
900000 26502233 151477345 s300K_test_data.txt
300 9071 51234 test_data.txt
1230300 36220100 207006809 total

3.3 Word Occurances vs Total Words Found

In each of the data sets the ratio of first 100, 500, and 1500 words was consistent across all data sets, i.e. this ratio was not influenced by the number of documents found in each data set.

Note: The x axis on the following chart represents the index, i.e. fist 3000 words found in the data sets. See the table in 4.1 1-Grams for the first 20 words.

3.3 Peeling Back The Onion

To be brief, I will not go into all the findings/passes made in the exploration. The following however are some of the key take a ways that will be used for subsequent model building:

  1. Remove all punctuation.
  2. Remove stop words, including profanity (the stopwords(“English”) was used)
  3. Remove all numeric data
  4. Include only valid, correctly spelled words in the aspell dictionary (ftp://ftp.gnu.org/gnu/aspell/dict/en/aspell6-en-7.1-0.tar.bz2)

3.4 Pre-Processing Sets

Prior to performing any subsequent analysis or model building, all the data sets created were pre-processed using the above process steps. Any future prediction algorithm will also need to be pre-processed any inputs with these same processing steps.

4. N-Gram Analysis

4.1 1-Gram or Word Occurences

The following table summarizes the 1-gram, or the word counts for the most frequently occurring words in the documents. The number of documents, from 30K document on up, does not materially change the top words that occur in the data sets. This is a possible indication that an appropriate smaller collection of documents, of each document type could be used to limit the size of the model.

##    X300K_words X300K_count X100K_words X100K_count X10K_words X10K_count
## 1         said       88795        said       29898       said       2917
## 2         will       82075        will       27531       will       2769
## 3          one       76261         one       25477        one       2614
## 4         just       68335        just       22795       just       2244
## 5         like       62688        like       21051        can       2022
## 6          can       61765         can       20703       like       2020
## 7         time       54205        time       17847       time       1823
## 8          get       50612         get       16809        get       1705
## 9          new       47752         new       16064        new       1604
## 10         now       41043         now       13797     people       1326
## 11      people       40615      people       13670        now       1292
## 12        also       38201        also       12601       good       1292
## 13        good       37714        know       12461        day       1243
## 14        know       37024         day       12298       also       1242
## 15          us       36997          us       12191         us       1231
## 16         day       36809        good       12185      first       1201
## 17       first       36278       first       12078       know       1162
## 18        back       33813        back       11346       back       1157
## 19         two       32595        make       10835        see       1105
## 20        make       32292         two       10778        two       1093

4.1 X-Gram

The code to calculate the X_GRAMS is still under development.

Modeling and Prediction Plans

Based on the analysis the following is the strategy for developing the text predication model:

  1. Calculate the 2 and 3 N_GRAMS
  2. Use the 3 N_GRAM as the first key to determine the next most probable word
  3. If the 3 N_GRAM has no solution, use the 2 N_GRAM to determine the most probable word
  4. If the 2 N_GRAM has not solution, use the 1 N_GRAM to determine the most probable word to follow.
  5. If the above does not provide a solution, offer the top 3 most probable words.

END