This report summarizes the exploratory analysis done for the data science capstone project which is to build a predictive text model. The purpose of the model is to correctly predict the next word based on the words the user types.
We perform a thorough exploratory analysis of the data to understand the distribution and relationship between the words and phrases in the text.
The purpose of this report broadly is:
The data for the project was downloaded from the link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Unzipping the file resulted in multiple language directory. The en_US directory was used to access the English language documents for this project. This directory contained the following files:
These files were used for the exploratory analysis
con <- file("en_US.blogs.txt", open = "r")
rawblogs <- readLines(con, skipNul = TRUE)
close(con)
con <- file("en_US.news.txt", open = "rb")
rawnews <- readLines(con, skipNul = TRUE)
close(con)
con <- file("en_US.twitter.txt", open = "r")
rawtwitter <- readLines(con, skipNul = TRUE)
close(con)
The size of the read-in text objects in memory and the number of lines in each object are summarized next
numlines<-c(length(rawblogs),length(rawnews), length(rawtwitter))
memsize<-c(format(object.size(rawblogs),unit="MB"), format(object.size(rawnews),unit="MB"), format(object.size(rawtwitter), unit="MB"))
maxline<-c(max(nchar(rawblogs)),max(nchar(rawnews)),max(nchar(rawtwitter)))
DF<-data.frame(Item=c("Blogs","News","Twitter"),NumLines=numlines,MemorySize=memsize,MaxLengthofLine=maxline)
DF
## Item NumLines MemorySize MaxLengthofLine
## 1 Blogs 899288 248.5 Mb 40835
## 2 News 1010242 249.6 Mb 11384
## 3 Twitter 2360148 301.4 Mb 213
Performing the exploratory analysis does not require us to use all the data. In fact as the data is very huge, we can only sample a part of it to understand the distribution and characteristics of the data. Here we sample 5% of the raw data for further analysis.
set.seed(1234)
sample_blogs<-rawblogs[sample(length(rawblogs),replace=F,size=0.05*length(rawblogs))]
set.seed(54)
sample_news<-rawnews[sample(length(rawnews),replace=F,size=0.05*length(rawnews))]
set.seed(511)
sample_twit<-rawtwitter[sample(length(rawtwitter),replace=F,size=0.05*length(rawtwitter))]
Now with the sampled data we use the R package “tm” that has been designed for text mining. This package provides functionality to clean up the text data such as changing case, removing punctuations, removing English stop words etc. After cleaning we create a Corpus Class with the data so that we can apply the tm package functions easily to do word analysis
cleanText<-function(x) {
x <- iconv(x, from = "UTF-8", to = "latin1", sub = "")
x <- tolower(x) #convert to lowercase
x <- removeNumbers(x) # remove numbers
x <- removePunctuation(x) #remove punctuation
x <- removeWords(x, stopwords("english")) #remove stopwords
x <- gsub("[^[:alnum:][:space:]']", "", x) #remove any special char except '
#x <- stemDocument(x) # remove common word endings
x <- stripWhitespace(x) #strip whitespace
x
}
clean_blog<-cleanText(sample_blogs)
clean_news<-cleanText(sample_news)
clean_twit<-cleanText(sample_twit)
text.list<-list(blog=clean_blog,news=clean_news,twitter=clean_twit)
#Create Corpus Class
text.corpus <- tm::Corpus(VectorSource(text.list))
rm(text.list)
rm(rawblogs)
rm(rawnews)
rm(rawtwitter)
First we create a TermDocumentMatrix for each of the data sources. This will allow us to analyze the data easily. The most frequent words are then extracted from the tdm data.
tdm_b <- tm::TermDocumentMatrix(text.corpus["blog"], control = list(wordLengths = c(3,Inf)))
tdm_n <- tm::TermDocumentMatrix(text.corpus["news"], control = list(wordLengths = c(3,Inf)))
tdm_t <- tm::TermDocumentMatrix(text.corpus["twitter"], control = list(wordLengths = c(3,Inf)))
## Get word counts using TDM and create a dataframe for plotting easily
get_wordcount<-function(x) { #creates a DF and also arranges in decreasing order
df<-data.frame(word=x$dimnames$Terms, frequency=x$v)
df<-plyr::arrange(df, -frequency)
df
}
wc.blog<-get_wordcount(tdm_b)
wc.news<-get_wordcount(tdm_n)
wc.twit<-get_wordcount(tdm_t)
Next we plot the most 20 frequenty occuring words in each data source.
plot_top_n<-function(x,n,tx) {
subn<-x[1:n,]
subn$word<-reorder(subn$word,subn$frequency)
ggplot(subn, aes(x = word, y = frequency)) + geom_bar(stat = "identity", col="red") + coord_flip() + labs(title= tx)
}
plot_top_n(wc.blog,20,"Top 20 Words: Blog")
plot_top_n(wc.news,20,"Top 20 Words: News")
plot_top_n(wc.twit,20,"Top 20 Words: Twitter")
We look at the multiple word phrases as that will be necessary to understand to build our predictive model finally. Multiple word phrases are called n-grams. A 2 word phrase is called a bigram, 3 words - a trigram etc. we use the package RWeka to accomplish this task.
BiTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TriTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram.blog <- tm::TermDocumentMatrix(text.corpus["blog"], control = list(tokenize = TriTokenizer))
trigram.news <- tm::TermDocumentMatrix(text.corpus["news"], control = list(tokenize = TriTokenizer))
wc.tri.blog<-get_wordcount(trigram.blog)
wc.tri.news<-get_wordcount(trigram.news)
Here we list the top 10 trigrams found in the blog and news datasets
wc.tri.blog[1:10,]
## word frequency
## 1 im pretty sure 33
## 2 new york city 30
## 3 dont get wrong 28
## 4 new york times 27
## 5 lets just say 24
## 6 cant wait see 23
## 7 long time ago 22
## 8 new york ny 22
## 9 couple weeks ago 21
## 10 incorporated item c 18
wc.tri.news[1:10,]
## word frequency
## 1 president barack obama 80
## 2 new york city 73
## 3 st louis county 60
## 4 two years ago 51
## 5 gov chris christie 50
## 6 first time since 37
## 7 world war ii 36
## 8 four years ago 32
## 9 three years ago 29
## 10 cents per share 28
Anakyzing the datasets we find some inteersting facts.
In this project we will need to predict the correct word after a user types a few words. Once the user types a sentence, the app should suggest the most relvant words based on the prediction model. The prediction model will be based on the n-gram generated from the data. A dictionary of n-grams (2,3,4 atleast) will be generated from the data. It will also be important to include stop words which were excluded in this analysis and also add apsotrophes and abbreviations to get better prediction.
One approach to predict is to break a n-gram into a (n-1) gram and the nth word as a unigram. A database of all possible words that go with the (n-1) gram can be created. The most frequently used combinations need only be stored.
As the user types a phrase and we find a match in the (n-1)gram database, the app can suggest what will be most probable next word. It can provide 3 or 4 suggestions for the user to select.