The following report summarizes the findings from the analysis of the raw data sets provided for SwiftKey Text Prediction project. There are 3 data sets representing blogs, twitter and news reports. Each of these data sets have unique characteristics and the objective of this analysis is to determine how these data sets can be used in the construction of a text prediction model.
Some initial expectations about the data sets:
The following are the basic steps/pipeline used to analyze the data with supporting comments:
The data sets were acquired at the following location: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
For this analysis the English language version of the files were used.
I found the data sets to be a bit tricky to read at first, however, found the following combination of functions/parameters in R to successively read the complete files, for example reading the blog data set:
## READ BLOGS
f <- file("../raw_data/final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- readLines(f)
len.blogs <- length(blogs)
close(f)
After working with the raw data in R it was determined that R was not an appropriate language to use for raw text exploration. Given the size of the raw data and the number of parsing/replacement operations that were needed, R was just too slow.
To improve the exploration of the data 4 sets of data were “sampled” out of the raw data. This sampling was seeded so any subsequent access to the raw data would be consistent. The following code snippet is used to sample the blog data set:
### CREATE BLOG INDEXES
set.seed(111)
test.blog.sample.index <- sample(1:len.blogs, 100, replace=FALSE)
s10K.blog.sample.index <- sample(1:len.blogs, 10000, replace=FALSE)
s100K.blog.sample.index <- sample(1:len.blogs, 100000, replace=FALSE)
s300K.blog.sample.index <- sample(1:len.blogs, 300000, replace=FALSE)
This creates ‘indexes’ into the data sets for the selection of a specific number of documents. The indexes are then used to create consolidated data sets, i.e. data sets that contain samples from each document set. The following code is an example of merging the document types:
### TEST DATA
test.data <- vector()
test.data <- c(test.data,
blogs[test.blog.sample.index],
news[test.news.sample.index],
twit[test.twit.sample.index])
Four different data sets were created. The test.data data set is small, only 300 documents, and is used to test any ‘exploration’ code/algorithms prior to running on larger data sets. The other data sets have incrementally more documents included to explore the effects of ‘size’ on the performance, size and speed, on the resulting models.
| Lines | Words | Characters | File |
|---|---|---|---|
| 899288 | 37334690 | 210160014 | en_US.blogs.txt |
| 1010242 | 34372720 | 205811889 | en_US.news.txt |
| 2360148 | 30374206 | 167105338 | en_US.twitter.txt |
| 4269678 | 102081616 | 583077241 | total |
| Lines | Words | Characters | File |
|---|---|---|---|
| 300000 | 8823440 | 50419258 | s100K_test_data.txt |
| 30000 | 885356 | 5058972 | s10K_test_data.txt |
| 900000 | 26502233 | 151477345 | s300K_test_data.txt |
| 300 | 9071 | 51234 | test_data.txt |
| 1230300 | 36220100 | 207006809 | total |
In each of the data sets the ratio of first 100, 500, and 1500 words was consistent across all data sets, i.e. this ratio was not influenced by the number of documents found in each data set.
Note: The x axis on the following chart represents the index, i.e. fist 3000 words found in the data sets. See the table in 4.1 1-Grams for the first 20 words.
To be brief, I will not go into all the findings/passes made in the exploration. The following however are some of the key take a ways that will be used for subsequent model building:
Prior to performing any subsequent analysis or model building, all the data sets created were pre-processed using the above process steps. Any future prediction algorithm will also need to be pre-processed any inputs with these same processing steps.
The following table summarizes the 1-gram, or the word counts for the most frequently occurring words in the documents. The number of documents, from 30K document on up, does not materially change the top words that occur in the data sets. This is a possible indication that an appropriate smaller collection of documents, of each document type could be used to limit the size of the model.
## X300K_words X300K_count X100K_words X100K_count X10K_words X10K_count
## 1 said 88795 said 29898 said 2917
## 2 will 82075 will 27531 will 2769
## 3 one 76261 one 25477 one 2614
## 4 just 68335 just 22795 just 2244
## 5 like 62688 like 21051 can 2022
## 6 can 61765 can 20703 like 2020
## 7 time 54205 time 17847 time 1823
## 8 get 50612 get 16809 get 1705
## 9 new 47752 new 16064 new 1604
## 10 now 41043 now 13797 people 1326
## 11 people 40615 people 13670 now 1292
## 12 also 38201 also 12601 good 1292
## 13 good 37714 know 12461 day 1243
## 14 know 37024 day 12298 also 1242
## 15 us 36997 us 12191 us 1231
## 16 day 36809 good 12185 first 1201
## 17 first 36278 first 12078 know 1162
## 18 back 33813 back 11346 back 1157
## 19 two 32595 make 10835 see 1105
## 20 make 32292 two 10778 two 1093
The code to calculate the X_GRAMS is still under development.
Based on the analysis the following is the strategy for developing the text predication model: