The Coursera Data Science Capstone Project is a colloboration with the predictive text company SwiftKey. The goal is to develop and test a text prediction algorithm. One of the primary challenges will be finding the balance between computational performance and predictive accuracy. The purpose of this report is to review the initial exploratory data analysis results, looking at the major features of the HC Corpora reference data and also noting any unexpected features that may inform our approach for the algorithm. From there we will clean and tokenize the data and review those results. Finally we will discuss next-steps for designing the predictive algorithm.
set.seed(88888)
options(warn=-1)
library(RWeka)
library(NLP)
library(RColorBrewer)
library(tm)
library(stringi)
library(ggplot2)
library(wordcloud)
setwd("C:/Users/J/DataScienceCapstone/en_US")
con <- file("en_US.twitter.txt")
twit <- readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file("en_US.news.txt")
news <- readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
con <- file("en_US.blogs.txt")
blog <- readLines(con, skipNul = TRUE, encoding = "UTF-8")
close(con)
Before going any further, let’s look at the high-level characteristics of these three data sets. In particular the number of lines in each source, the average number of words in each line, and the overall file size. This may help us determine whether to focus on a specific file or help to determine the sample size needed.
frame <- data.frame( File = c("Twitter", "News", "Blogs"),
c(length(twit), length(news), length(blog)),
c(summary(stri_count_words(twit))[4],
summary(stri_count_words(news))[4],
summary(stri_count_words(blog))[4]),
c(file.info("en_US.twitter.txt")$size,
file.info("en_US.news.txt")$size,
file.info("en_US.blogs.txt")$size))
colnames(frame) <- c("File", "Number of Lines", "Average Words per Line", "File Size (bytes)")
frame
## File Number of Lines Average Words per Line File Size (bytes)
## 1 Twitter 2360148 12.75 167105338
## 2 News 77259 34.62 205811889
## 3 Blogs 899288 41.75 210160014
qplot(stri_count_words(twit),geom="histogram",main="Twitter Words per Line", xlab="Words",ylab="Count",binwidth=2, fill=I("blue"),xlim=c(0,200))
qplot(stri_count_words(news),geom="histogram",main="News Words per Line", xlab="Words",ylab="Count",binwidth=2, fill=I("red"),xlim=c(0,200))
qplot(stri_count_words(blog),geom="histogram",main="Blogs Words per Line", xlab="Words",ylab="Count",binwidth=2, fill=I("green"),xlim=c(0,200))
As might be expected, Twitter has far more lines than the other sources, but far fewer words per line.
The Initial text processing to follow was not particularly affected by our sample size, however tokenizing large samples quickly ran into memory issues on my computer with 6GB of RAM.
twitsamp <- sample(twit, 5000)
newssamp <- sample(news, 5000)
blogsamp <- sample(blog, 5000)
Next we will clean up the data. This includes converting to lowercase, removing non-standard characters, numbers, punctuation, and whitespace. We will then save out the resulting vector to disk.
mergesamp <- c(twitsamp, newssamp, blogsamp)
mergesamp <-iconv(mergesamp, "latin1", "ASCII", sub="")
corpus <-VCorpus(VectorSource(list(mergesamp)))
corpus <- tm_map(corpus,content_transformer(tolower))
download.file("http://www.bannedwordlist.com/lists/swearWords.txt", "swearwords.txt")
swearwords <- readLines("swearwords.txt")
corpus <- tm_map(corpus,removeWords, swearwords)
corpus <- tm_map(corpus,removeNumbers)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
We now take our clean vector and tokenize into three classes of n-gram: 1-grams, two-grams and 3-grams:
#Tokenize
OneToken <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1, delimiters = "\"\'\\t\\r\\n ().,;!?"))
TwoToken <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2, delimiters = "\"\'\\t\\r\\n ().,;!?"))
ThreeToken <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3, delimiters = "\"\'\\t\\r\\n ().,;!?"))
#Prepare Top-10 Frequency Plots
OneTokenDF <- data.frame(table(OneToken))
OneTokenDF <- OneTokenDF[order(OneTokenDF$Freq,decreasing = TRUE),]
Plot1 <- OneTokenDF[1:10,]
TwoTokenDF <- data.frame(table(TwoToken))
TwoTokenDF <- TwoTokenDF[order(TwoTokenDF$Freq,decreasing = TRUE),]
Plot2 <- TwoTokenDF[1:10,]
ThreeTokenDF <- data.frame(table(ThreeToken))
ThreeTokenDF <- ThreeTokenDF[order(ThreeTokenDF$Freq,decreasing = TRUE),]
Plot3<-ThreeTokenDF[1:10,]
colnames(Plot1) <-c("Word","Freq");colnames(Plot2) <-c("Phrase","Freq");colnames(Plot3) <-c("Phrase","Freq")
ggplot(Plot1, aes(x=Word, y=Freq), ) + geom_bar(stat="Identity", fill="darkblue") +ggtitle("10 Most Frequent Words")
ggplot(Plot2, aes(x=Phrase, y=Freq), ) + geom_bar(stat="Identity", fill="darkred") +ggtitle("10 Most Frequent 2 Word Phrases")
ggplot(Plot3, aes(x=Phrase, y=Freq), ) + geom_bar(stat="Identity", fill="darkgreen") +ggtitle("10 Most Frequent 3 Word Phrases")
options(warn=0)
The next phase will be to review the current set of filters and tokenizer settings to see if the dataset can be further improved. Following that it will be a matter of building the predictive model, which will likely rely on some combination of Maximum Likelihood Estimation, smoothing and perhaps a combined estimator approach such as the Backoff Model. The design choices will likely be influenced by computational performance - choosing the trade-offs of accuracy versus speed so that the algorithm can return good results within a Shiny application.