Summary

This report summarizes the exploratory analysis done for the data science capstone project which is to build a predictive text model. The purpose of the model is to correctly predict the next word based on the words the user types.

We perform a thorough exploratory analysis of the data to understand the distribution and relationship between the words and phrases in the text.

The purpose of this report broadly is:

  1. Demonstrate the data has been downloaded and successfully loaded
  2. Perform basic summary statistics of the data
  3. Report any interesting findings.
  4. Obtain feedback on prediction models.

Exploratory Data Analysis

The data for the project was downloaded from the link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Unzipping the file resulted in multiple language directory. The en_US directory was used to access the English language documents for this project. This directory contained the following files:

These files were used for the exploratory analysis

con <- file("en_US.blogs.txt", open = "r")
rawblogs <- readLines(con, skipNul = TRUE)
close(con)

con <- file("en_US.news.txt", open = "rb")
rawnews <- readLines(con, skipNul = TRUE)
close(con)

con <- file("en_US.twitter.txt", open = "r")
rawtwitter <- readLines(con, skipNul = TRUE)
close(con)

The size of the read-in text objects in memory and the number of lines in each object are summarized next

numlines<-c(length(rawblogs),length(rawnews), length(rawtwitter))
memsize<-c(format(object.size(rawblogs),unit="MB"), format(object.size(rawnews),unit="MB"), format(object.size(rawtwitter), unit="MB"))
maxline<-c(max(nchar(rawblogs)),max(nchar(rawnews)),max(nchar(rawtwitter)))
DF<-data.frame(Item=c("Blogs","News","Twitter"),NumLines=numlines,MemorySize=memsize,MaxLengthofLine=maxline)
DF
##      Item NumLines MemorySize MaxLengthofLine
## 1   Blogs   899288   248.5 Mb           40835
## 2    News  1010242   249.6 Mb           11384
## 3 Twitter  2360148   301.4 Mb             213

Sampling for exploratory analysis

Performing the exploratory analysis does not require us to use all the data. In fact as the data is very huge, we can only sample a part of it to understand the distribution and characteristics of the data. Here we sample 5% of the raw data for further analysis.

set.seed(1234)
sample_blogs<-rawblogs[sample(length(rawblogs),replace=F,size=0.05*length(rawblogs))]
set.seed(54)
sample_news<-rawnews[sample(length(rawnews),replace=F,size=0.05*length(rawnews))]
set.seed(511)
sample_twit<-rawtwitter[sample(length(rawtwitter),replace=F,size=0.05*length(rawtwitter))]

Cleaning the data

Now with the sampled data we use the R package “tm” that has been designed for text mining. This package provides functionality to clean up the text data such as changing case, removing punctuations, removing English stop words etc. After cleaning we create a Corpus Class with the data so that we can apply the tm package functions easily to do word analysis

cleanText<-function(x) {
        x <- iconv(x, from = "UTF-8", to = "latin1", sub = "")
        x <- tolower(x) #convert to lowercase
        x <- removeNumbers(x) # remove numbers
        x <- removePunctuation(x) #remove punctuation
        x <- removeWords(x, stopwords("english")) #remove stopwords
        x <- gsub("[^[:alnum:][:space:]']", "", x) #remove any special char except '
        #x <- stemDocument(x) # remove common word endings
        x <- stripWhitespace(x) #strip whitespace
        x
}
clean_blog<-cleanText(sample_blogs)
clean_news<-cleanText(sample_news)
clean_twit<-cleanText(sample_twit)

text.list<-list(blog=clean_blog,news=clean_news,twitter=clean_twit)

#Create Corpus Class
text.corpus <- tm::Corpus(VectorSource(text.list))
rm(text.list)
rm(rawblogs)
rm(rawnews)
rm(rawtwitter)

First we create a TermDocumentMatrix for each of the data sources. This will allow us to analyze the data easily. The most frequent words are then extracted from the tdm data.

tdm_b <- tm::TermDocumentMatrix(text.corpus["blog"], control = list(wordLengths = c(3,Inf)))
tdm_n <- tm::TermDocumentMatrix(text.corpus["news"], control = list(wordLengths = c(3,Inf)))
tdm_t <- tm::TermDocumentMatrix(text.corpus["twitter"], control = list(wordLengths = c(3,Inf)))

## Get word counts using TDM and create a dataframe for plotting easily
get_wordcount<-function(x) { #creates a DF and also arranges in decreasing order
  df<-data.frame(word=x$dimnames$Terms, frequency=x$v)
  df<-plyr::arrange(df, -frequency)
  df
}
wc.blog<-get_wordcount(tdm_b)
wc.news<-get_wordcount(tdm_n)
wc.twit<-get_wordcount(tdm_t)

Next we plot the most 20 frequenty occuring words in each data source.

plot_top_n<-function(x,n,tx) {
  subn<-x[1:n,]
  subn$word<-reorder(subn$word,subn$frequency)
  ggplot(subn, aes(x = word, y = frequency)) + geom_bar(stat = "identity", col="red") + coord_flip() + labs(title= tx)
}
plot_top_n(wc.blog,20,"Top 20 Words: Blog")

plot_top_n(wc.news,20,"Top 20 Words: News")

plot_top_n(wc.twit,20,"Top 20 Words: Twitter")

We look at the multiple word phrases as that will be necessary to understand to build our predictive model finally. Multiple word phrases are called n-grams. A 2 word phrase is called a bigram, 3 words - a trigram etc. we use the package RWeka to accomplish this task.

BiTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TriTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram.blog <- tm::TermDocumentMatrix(text.corpus["blog"], control = list(tokenize = TriTokenizer))
trigram.news <- tm::TermDocumentMatrix(text.corpus["news"], control = list(tokenize = TriTokenizer))
wc.tri.blog<-get_wordcount(trigram.blog)
wc.tri.news<-get_wordcount(trigram.news)

Here we list the top 10 trigrams found in the blog and news datasets

wc.tri.blog[1:10,]
##                   word frequency
## 1       im pretty sure        33
## 2        new york city        30
## 3       dont get wrong        28
## 4       new york times        27
## 5        lets just say        24
## 6        cant wait see        23
## 7        long time ago        22
## 8          new york ny        22
## 9     couple weeks ago        21
## 10 incorporated item c        18
wc.tri.news[1:10,]
##                      word frequency
## 1  president barack obama        80
## 2           new york city        73
## 3         st louis county        60
## 4           two years ago        51
## 5      gov chris christie        50
## 6        first time since        37
## 7            world war ii        36
## 8          four years ago        32
## 9         three years ago        29
## 10        cents per share        28

Findings

Anakyzing the datasets we find some inteersting facts.

  • The News data sets have lot of proper nouns (Name of persons and places).
  • The Twitter dataset has lot of informal words and abbreviated words like lol, rt.
  • There are quite a few common words in the top most frequently used word list for all the three sources.

Prediction Model Planning

In this project we will need to predict the correct word after a user types a few words. Once the user types a sentence, the app should suggest the most relvant words based on the prediction model. The prediction model will be based on the n-gram generated from the data. A dictionary of n-grams (2,3,4 atleast) will be generated from the data. It will also be important to include stop words which were excluded in this analysis and also add apsotrophes and abbreviations to get better prediction.

One approach to predict is to break a n-gram into a (n-1) gram and the nth word as a unigram. A database of all possible words that go with the (n-1) gram can be created. The most frequently used combinations need only be stored.

As the user types a phrase and we find a match in the (n-1)gram database, the app can suggest what will be most probable next word. It can provide 3 or 4 suggestions for the user to select.