Data Science Capstone Milestone Report 1

Instructions This milestone report report is an summary overview explaining major features of the effort including basic data analysis and fundamental steps taken to analyze the text data. Tasks include exploratory data analysis and Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram is not observed.

Preposing The data for the project is from coursera website at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The data will be downloaded and prepared for further analysis.

library(SnowballC)
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.3.1

## Loading required package: RColorBrewer

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.3.1

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

 src <- 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
 dest<-'~/Coursera-SwiftKey.zip'
# download.file(src,dest)
# unzip(dest,exdir = '~/Coursera-SwiftKey')

final <- '~/Coursera-SwiftKey/final/en_US/'
blogsFile <- paste(final,"en_US.blogs.txt",sep = "")
newsFile <- paste(final,"en_US.news.txt",sep = "")
twitterFile <- paste(final,"en_US.twitter.txt",sep = "")

conBlogs <- file(blogsFile,open = "r")
blog <- readLines(conBlogs,n=-1,encoding = "UTF-8", skipNul = TRUE)
close(conBlogs)
# 
# conNews <- file(newsFile, open ="r")
# news <- readLines(conNews,n=-1,encoding = "UTF-8", skipNul = TRUE)
# close(conNews)
# 
# conTwitter<-file(twitterFile, open = "r")
# tweet <- readLines(conTwitter,n=-1,encoding = "UTF-8", skipNul = TRUE)
# close(conTwitter)

Lets get a sample document from the 3 feeds for exploration

Merge.doc <- sample(paste(blog), size = 1000, replace = FALSE)
#rm(blog,news,tweet,blogsFile,newsFile,twitterFile,conTwitter,conNews,conBlogs,src,dest,final)

Here are two functions that wil be important for working with the data. makeCorpus is a function that takes a text and returns a clean corpus form the text. ngramer is a function that takes a text and return a tokenized version of it with the given number of n gram to form.

#function to generate clean corpus
makeCorpus<- function(text){
#download list of profanity words
    badwords <-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
    ##doc <- iconv(text,"UTF-8","bytes")
    doc <- removeWords(text,stopwords("english"))
    doc <- removeWords(doc,badwords)
    doc.Vec <- VectorSource(text)                      
    doc.Corpus <- Corpus(doc.Vec)
    #doc.Corpus <- tm_map(doc.Corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    #doc.Corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
    doc.Corpus <- tm_map(doc.Corpus,removeWords,letters)
    doc.Corpus <- tm_map(doc.Corpus,tolower)
    doc.Corpus <- tm_map(doc.Corpus,removePunctuation)
    doc.Corpus <- tm_map(doc.Corpus,removeNumbers)
    doc.Corpus <- tm_map(doc.Corpus,stripWhitespace)
    doc.Corpus <- tm_map(doc.Corpus,trimws)
   # doc.Corpus<- tm_map(doc.Corpus, stemDocument)
    doc.Corpus <- tm_map(doc.Corpus,PlainTextDocument)
    doc.Corpus
}

#function that takes a text and retrun a tonkenized version
ngramer <- function(text,n) {
    #delim <- "\\r\\n\\t.,;:\"()?!&+""'''/"
    ngram <- NGramTokenizer(text,Weka_control(min = n, max = n))
    ngram <- data.frame(table(ngram))
    ngram <- ngram[order(ngram$Freq, decreasing = TRUE),]
    colnames(ngram) <- c("Ngram","Frequency")
    ngram
}

Data cleaning In this section, we will clean up the data and remove all white spaces, numbers and punctuations using the function makecorpus above.we will use the tm package to perform this tasks. bad word list is downloaded from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt and used to remove them from the data. We also generate a TermDcoumentMatrix for the sample data.

Merge.Corpus <- makeCorpus(Merge.doc)
Merge.TDM <- TermDocumentMatrix(Merge.Corpus)

** Exploratory Analysis ** The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

We will take a look at various components of the data and draw some figures for visualization. 2 gram and 3 gram figures are plotted and also word clouds are generated from each corpus and also corpus are merged and sampled which will be used for further processing.

Word cloud for sample data

# Merge.Matrix <- as.matrix(Merge.TDM)
# Merge.Freq <- colSums(Merge.Matrix) 
# findFreqTerms(Merge.TDM,100)
# findAssocs(Merge.TDM,"like",0.1)
# Merge.Common <- removeSparseTerms(Merge.TDM,0.1)
# Merge.Lines <- lapply(Merge.doc), nchar)
# nchar(blogs[which.max(Merge.Lines)])
# length(grep("love",Merge.doc))/length(grep("hate",twitter))
# Merge.doc[grep("biostats",Merge.doc)]
# length(grep("A computer once beat me at chess, but it was no match for me at kickboxing",Merge.doc))

wordcloud(Merge.Corpus, max.words = 100, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

We will now generate 1, 2 and 3 grams for the merged data and produce visualations for the tokens

## Generate the 1.2 and 3 grams for the merged corpus 
Merge.n1gram <- ngramer(Merge.Corpus,1)
Merge.n2gram <- ngramer(Merge.Corpus,2)
Merge.n3gram <- ngramer(Merge.Corpus,3)

## lets get only top 20 grams for visualization
Merge.n1gram <- Merge.n1gram[1:20, ]
Merge.n2gram <- Merge.n2gram[1:20, ]
Merge.n3gram <- Merge.n3gram[1:20, ]

Histogram for top 20 1 gram

ggplot(Merge.n1gram, aes(x=Ngram, y=Frequency), ) + geom_bar(stat="Identity", fill="blue", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

Histogram for top 20 bi-grams

ggplot(Merge.n2gram, aes(x=Ngram, y=Frequency), ) + geom_bar(stat="Identity", fill="blue", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

Histogram for top 20 tri-grams

ggplot(Merge.n3gram, aes(x=Ngram, y=Frequency), ) + geom_bar(stat="Identity", fill="blue", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

Modeling

The goal is to build our first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models. Findings so far: When comparing the highest frequency results using 4-grams, we did not find that 4-grams were helpful in finding the next word in a n-gram.

A major tradeoff is the amount of data analyzed (corpus size) vs analysis time. Note that stopwords were intentionally kept so that we can predict them, although there is a large number of them.

Adding more lines from the text in the target corpus does not always contribute to a better model accuracy. The model will therefore be built based on qualitative n-gram criteria versus quantitative.

Next steps: Build a n-gram model using the exploratory analysis previously performed (http://en.wikipedia.org/wiki/N-gram) for predicting the next word based on 1, 2, or 3 words. Assess the Katz Back-Off Model for accuracy. Generate a 2-column table of unique n-grams by frequencies by summing frequency counts, Match a n-gram character string with the appropriate n+1 gram entry in the n-gram frequency Table. If there is a match, propose high frequency words to the user.

Data Science Capstone Milestone Report 1

Vincent Amedekah

August 28, 2016