Executive Summary

The purpose of this analysis is to perform an exploratory analysis on the SwiftKey data sets to eventually develop a model to predict the next word in a sequence of words. Data from twitter, news, and blogs will be used in the creation of this model.

After analyzing the data it appears that around 1/3 of the words in all three sources come from a subset of 100 (1%) of the unique words. There we will develop a prediciton model that focuses just on predicting these top 100 words. This is mainly because we will be limited computationally, and it is a reasonable goal for the class.

After analyzing this data below is my plan for developing a prediction model

  1. Create a data set of each word with the word that proceeded it.
  2. Create a feature of up to the three words prior to the proceeding word. I assume the word right before is the most important, and the three words prior to that have importance although the order may not be as important.
  3. The data set will have the outcome word, and the predictors are the word directly before and a bucket of up to three words that were also created.

  4. I’ll limit the count of features that could possible be included to something that is reasonable.

  5. The final training file will have the following variables.

    • Outcome Word
    • Word prior
    • Word 2 prior
    • Word 3 prior
    • Word 4 prior
  6. I’ll limit this to the most common words so that my training data set is reasonably sized so the model actually runs. The final training data set will look like this:

    • Outcome Word
    • Top X words prior (binary 1 or 0). I’m guessing there will be 100+ variables
    • Prior words 2-4 (binary 1 or 0). I’m guessing around 100-500
  7. I’ll either develop this as 100 models with the word as a binary outcome or some sort of model where there can be multiple outcomes. Everytime it is run the outcomes with the 3 highest probabilities will be displayed.

Summaries of 3 files

Set up libraries and set the working drive

#Set working drive and import the three files
library(tm)
## Loading required package: NLP
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
setwd("/Users/SRG/Desktop/Coursera/Capstone")

Import the three files

#Import blogs, news, and twitter
blogs = readLines("final/en_US/en_US.blogs.txt",n = -1)
news = readLines("final/en_US/en_US.news.txt",n = -1)
twitter = readLines("final/en_US/en_US.twitter.txt",n=-1)
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 167155
## appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 268547
## appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 1274086
## appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 1759032
## appears to contain an embedded nul

Get line counts for each data source

blogs.rows = NROW(blogs)
news.rows = NROW(news)
twitter.rows = NROW(twitter)

The three data sets are quite large. The number of lines of text are the following in each of the data sets: blogs 899288, news 1010242, twitter 2360148

Next, I’ll use the commande MC_tokenizer break the lines into words. I’m limiting it to the first 10000, rows of each file because this process is way too memory intensive for my machine. I’ll use that to estimate the nuumber of words per row and multiple by the number of rows in the file.

library(tm)
#Tokenize the first 10000 rows in each data source
blogs.words = MC_tokenizer(blogs[1:10000])
news.words = MC_tokenizer(news[1:10000])
twitter.words = MC_tokenizer(twitter[1:10000])

#Count of of words
blogs.wordCount = NROW(blogs.words)
news.wordCount = NROW(news.words)
twitter.wordCount = NROW(twitter.words)

#Estimate of total words in the file
blogs.wordEstimate = (blogs.wordCount / 10000) * blogs.rows
news.wordEstimate = (news.wordCount / 10000) * news.rows
twitter.wordEstimate = (twitter.wordCount / 10000) * twitter.rows

Below are my estimates for the number of words per data source.

Blogs: 4.364505510^{7} News: 4.341545310^{7} Twitter: 3.747348610^{7}

There are over 100,000,000 words in all the files combined. This is such a large data set that it is just not possible for me to load it all into my computer. Therefore I’ll need to cut it down to a subset.

Top 10 words by data source

I’m interested in seeing the top ten most frequently used words by data set. Below are those 3 tables.

Blogs top 10 words

blogs.freq = data.frame(table(blogs.words))
blogs.freq = blogs.freq[order(-blogs.freq$Freq),]
head(blogs.freq,10)
##       blogs.words  Freq
## 1                 68559
## 30583         the 18310
## 30996          to 11507
## 1118          and 11363
## 14816           I  9871
## 2               a  9519
## 20999          of  9459
## 15120          in  6000
## 30572        that  5044
## 15878          is  4700

News top 10 words

news.freq = data.frame(table(news.words))
news.freq = news.freq[order(-news.freq$Freq),]
head(news.freq,10)
##       news.words  Freq
## 1                83597
## 28693        the 17089
## 29039         to  8968
## 1008         and  8595
## 2              a  8523
## 19684         of  7795
## 14057         in  6240
## 24608          s  4338
## 11035        for  3466
## 28686       that  3379

Twitter

twitter.freq = data.frame(table(twitter.words))
twitter.freq = twitter.freq[order(-twitter.freq$Freq),]
head(twitter.freq,10)
##       twitter.words  Freq
## 1                   30420
## 16089           the  3492
## 7850              I  3356
## 16372            to  3181
## 2                 a  2428
## 18322           you  2175
## 562             and  1709
## 6169            for  1565
## 8035             in  1503
## 11278            of  1485

There are a number of similitaries between the data sets. The words “the”, “to”, “and”,and “of”, are all in the top 10. The only expected difference was that “I” appeared frequently in twitter and blogs which makes sense because they are typically written in the first person; while, in news, “I” never made the top 10 as news articles are typically written in the third person.

The similarities are great enough between the data sources that it is reasonable to combine them into a single data source.

Combined Analysis

I was having problems performing any text mining with the data so I opted to cut the data down to a subset. The goal is to just identify word counts of the most common words. I imported different numbers or records per data set to try to keep the number of words equal between the data sets.

#Import files
blogs.sub = readLines("final/en_US/en_US.blogs.txt",n = 10000)
news.sub = readLines("final/en_US/en_US.news.txt",n = 20000)
twitter.sub = readLines("final/en_US/en_US.twitter.txt",n=30000)

#Export Subset for text mining purposes
write.table(blogs.sub,file = "out/blogs_sub.txt")
write.table(news.sub,"out/news_sub.txt")
write.table(twitter.sub,"out/twitter_sub.txt")

rm(blogs.sub,news.sub,twitter.sub)

Get summary statistics on the data

Create corpus file for text mining. This will make it easy to create distributions on the word counts

path = cname <- file.path("/Users/SRG/Desktop/Coursera/Capstone/out")
docs <- Corpus(DirSource(path))   
summary(docs)  
##                 Length Class             Mode
## blogs_sub.txt   2      PlainTextDocument list
## news_sub.txt    2      PlainTextDocument list
## twitter_sub.txt 2      PlainTextDocument list

Peform several preprocessing steps & Create term document matrix. We only want to remove the puncuation, white spaces and then make it a plain text document. Typically in text mining you remove the common words, and frequent endings such as “ing”. However, in this assignment, the common words are the ones we are likely to predict on, and should be preserved.

docs <- tm_map(docs, removePunctuation) 
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument) 
dtm <- DocumentTermMatrix(docs)  

Identify words most frequently vs. the total number of words

#Data frame of word frequencies
freq <- colSums(as.matrix(dtm))  
wf <- data.frame(word=names(freq), freq=freq)  
word.unique = NROW(wf)
word.total = sum(wf$freq)


#Top 100 Frequenc8es
wf.100 = head(wf[order(-wf$freq),],100)  
word.total.100 = sum(wf.100$freq)
word.total.100 / word.total
## [1] 0.3422228

In this data set there are 1.22140910^{6} total words in the files and 103658 unique words. We can see that that the top 100 words (1% of unique words) account for approximatley one third (34%) of the total words used.

Based on the resources for this project we will focus the predict model on the top 100 words. This will either be in the form of a predictive model that has 100 possible outcomes and a probabliity is assigned to each outcome. The model

The following histogram clearly identifies that certain words are written much more than others.

library(ggplot2)   
p <- ggplot(subset(wf.100, freq>4000), aes(reorder(word,-freq),freq))    
p <- p + geom_bar(stat="identity")   
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))   
p 

Conclusion

In short, it appears to be very feasible to develop a prediction model using the existing data sets. Since around 1/3 of the words in the documents are generated from just 1% (100) of the unique words we will focus our predictive model on this subset. This may have to be cut down at some point depending on the computational resources but it is a reasonable start. Further details on how I intend on developing the prediciton model are in the executive summary.