The purpose of this analysis is to perform an exploratory analysis on the SwiftKey data sets to eventually develop a model to predict the next word in a sequence of words. Data from twitter, news, and blogs will be used in the creation of this model.
After analyzing the data it appears that around 1/3 of the words in all three sources come from a subset of 100 (1%) of the unique words. There we will develop a prediciton model that focuses just on predicting these top 100 words. This is mainly because we will be limited computationally, and it is a reasonable goal for the class.
After analyzing this data below is my plan for developing a prediction model
The data set will have the outcome word, and the predictors are the word directly before and a bucket of up to three words that were also created.
I’ll limit the count of features that could possible be included to something that is reasonable.
The final training file will have the following variables.
I’ll limit this to the most common words so that my training data set is reasonably sized so the model actually runs. The final training data set will look like this:
I’ll either develop this as 100 models with the word as a binary outcome or some sort of model where there can be multiple outcomes. Everytime it is run the outcomes with the 3 highest probabilities will be displayed.
Set up libraries and set the working drive
#Set working drive and import the three files
library(tm)
## Loading required package: NLP
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
setwd("/Users/SRG/Desktop/Coursera/Capstone")
Import the three files
#Import blogs, news, and twitter
blogs = readLines("final/en_US/en_US.blogs.txt",n = -1)
news = readLines("final/en_US/en_US.news.txt",n = -1)
twitter = readLines("final/en_US/en_US.twitter.txt",n=-1)
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 167155
## appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 268547
## appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 1274086
## appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", n = -1): line 1759032
## appears to contain an embedded nul
Get line counts for each data source
blogs.rows = NROW(blogs)
news.rows = NROW(news)
twitter.rows = NROW(twitter)
The three data sets are quite large. The number of lines of text are the following in each of the data sets: blogs 899288, news 1010242, twitter 2360148
Next, I’ll use the commande MC_tokenizer break the lines into words. I’m limiting it to the first 10000, rows of each file because this process is way too memory intensive for my machine. I’ll use that to estimate the nuumber of words per row and multiple by the number of rows in the file.
library(tm)
#Tokenize the first 10000 rows in each data source
blogs.words = MC_tokenizer(blogs[1:10000])
news.words = MC_tokenizer(news[1:10000])
twitter.words = MC_tokenizer(twitter[1:10000])
#Count of of words
blogs.wordCount = NROW(blogs.words)
news.wordCount = NROW(news.words)
twitter.wordCount = NROW(twitter.words)
#Estimate of total words in the file
blogs.wordEstimate = (blogs.wordCount / 10000) * blogs.rows
news.wordEstimate = (news.wordCount / 10000) * news.rows
twitter.wordEstimate = (twitter.wordCount / 10000) * twitter.rows
Below are my estimates for the number of words per data source.
Blogs: 4.364505510^{7} News: 4.341545310^{7} Twitter: 3.747348610^{7}
There are over 100,000,000 words in all the files combined. This is such a large data set that it is just not possible for me to load it all into my computer. Therefore I’ll need to cut it down to a subset.
I’m interested in seeing the top ten most frequently used words by data set. Below are those 3 tables.
Blogs top 10 words
blogs.freq = data.frame(table(blogs.words))
blogs.freq = blogs.freq[order(-blogs.freq$Freq),]
head(blogs.freq,10)
## blogs.words Freq
## 1 68559
## 30583 the 18310
## 30996 to 11507
## 1118 and 11363
## 14816 I 9871
## 2 a 9519
## 20999 of 9459
## 15120 in 6000
## 30572 that 5044
## 15878 is 4700
News top 10 words
news.freq = data.frame(table(news.words))
news.freq = news.freq[order(-news.freq$Freq),]
head(news.freq,10)
## news.words Freq
## 1 83597
## 28693 the 17089
## 29039 to 8968
## 1008 and 8595
## 2 a 8523
## 19684 of 7795
## 14057 in 6240
## 24608 s 4338
## 11035 for 3466
## 28686 that 3379
twitter.freq = data.frame(table(twitter.words))
twitter.freq = twitter.freq[order(-twitter.freq$Freq),]
head(twitter.freq,10)
## twitter.words Freq
## 1 30420
## 16089 the 3492
## 7850 I 3356
## 16372 to 3181
## 2 a 2428
## 18322 you 2175
## 562 and 1709
## 6169 for 1565
## 8035 in 1503
## 11278 of 1485
There are a number of similitaries between the data sets. The words “the”, “to”, “and”,and “of”, are all in the top 10. The only expected difference was that “I” appeared frequently in twitter and blogs which makes sense because they are typically written in the first person; while, in news, “I” never made the top 10 as news articles are typically written in the third person.
The similarities are great enough between the data sources that it is reasonable to combine them into a single data source.
I was having problems performing any text mining with the data so I opted to cut the data down to a subset. The goal is to just identify word counts of the most common words. I imported different numbers or records per data set to try to keep the number of words equal between the data sets.
#Import files
blogs.sub = readLines("final/en_US/en_US.blogs.txt",n = 10000)
news.sub = readLines("final/en_US/en_US.news.txt",n = 20000)
twitter.sub = readLines("final/en_US/en_US.twitter.txt",n=30000)
#Export Subset for text mining purposes
write.table(blogs.sub,file = "out/blogs_sub.txt")
write.table(news.sub,"out/news_sub.txt")
write.table(twitter.sub,"out/twitter_sub.txt")
rm(blogs.sub,news.sub,twitter.sub)
Create corpus file for text mining. This will make it easy to create distributions on the word counts
path = cname <- file.path("/Users/SRG/Desktop/Coursera/Capstone/out")
docs <- Corpus(DirSource(path))
summary(docs)
## Length Class Mode
## blogs_sub.txt 2 PlainTextDocument list
## news_sub.txt 2 PlainTextDocument list
## twitter_sub.txt 2 PlainTextDocument list
Peform several preprocessing steps & Create term document matrix. We only want to remove the puncuation, white spaces and then make it a plain text document. Typically in text mining you remove the common words, and frequent endings such as “ing”. However, in this assignment, the common words are the ones we are likely to predict on, and should be preserved.
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
Identify words most frequently vs. the total number of words
#Data frame of word frequencies
freq <- colSums(as.matrix(dtm))
wf <- data.frame(word=names(freq), freq=freq)
word.unique = NROW(wf)
word.total = sum(wf$freq)
#Top 100 Frequenc8es
wf.100 = head(wf[order(-wf$freq),],100)
word.total.100 = sum(wf.100$freq)
word.total.100 / word.total
## [1] 0.3422228
In this data set there are 1.22140910^{6} total words in the files and 103658 unique words. We can see that that the top 100 words (1% of unique words) account for approximatley one third (34%) of the total words used.
Based on the resources for this project we will focus the predict model on the top 100 words. This will either be in the form of a predictive model that has 100 possible outcomes and a probabliity is assigned to each outcome. The model
The following histogram clearly identifies that certain words are written much more than others.
library(ggplot2)
p <- ggplot(subset(wf.100, freq>4000), aes(reorder(word,-freq),freq))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
In short, it appears to be very feasible to develop a prediction model using the existing data sets. Since around 1/3 of the words in the documents are generated from just 1% (100) of the unique words we will focus our predictive model on this subset. This may have to be cut down at some point depending on the computational resources but it is a reasonable start. Further details on how I intend on developing the prediciton model are in the executive summary.