Milestone Report

This document describes the process designed to answer the following challenges:

Perform Exploratory analysis thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Data set

The data comes from a corpus called HC Corpora (http://www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.

The data set used in this analysis is available at : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The downloaded file is stored locally in the following location: ./Capstone Project/en_US/

Loading the required packages

list.of.packages <- c("tm", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc","doParallel", "RWeka","R.utils")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

Loading the libraries

library("tm")
library("SnowballC")   
library("cldr")
library("reshape2")
library("ggplot2")
library("R.utils")
library("RWeka")

Basic summaries of the indivudal files from the data set

The below code checks the size of the individual files in the English corpus.

blogs.txt.size <- file.info("./Capstone Project/en_US/en_US.blogs.txt")$size
news.txt.size <- file.info("./Capstone Project/en_US/en_US.news.txt")$size
twitter.txt.size <- file.info("./Capstone Project/en_US/en_US.twitter.txt")$size

blogs.txt.lines <- countLines("./Capstone Project/en_US/en_US.blogs.txt")
news.txt.lines <- countLines("./Capstone Project/en_US/en_US.news.txt")
twitter.txt.lines <- countLines("./Capstone Project/en_US/en_US.twitter.txt")
Files =c("blogs.txt", "news.txt", "twitter.txt")
Size=c(blogs.txt.size,news.txt.size,twitter.txt.size)
Lines=c(blogs.txt.lines,news.txt.lines,twitter.txt.lines)
size_table <- data.frame(Files, Size, Lines)

The results are printed using knitr kable function.

knitr::kable(size_table)

Files	Size	Lines
blogs.txt	210160014	899288
news.txt	205811889	1010242
twitter.txt	167105338	2360148

Sampling the data sets

It’s clear from the table above that the size of the corpus exceeds the computational power of a single machine. In order to reduce the runtime the decision has been made to sample the original files as per the code below:

connection <- file("./Capstone Project/en_US/en_US.news.txt")
news <- readLines(connection, encoding="UTF-8", skipNul=T)
close(connection)
news_sample <- sample(news, length(news)/50)

connection <- file("./Capstone Project/en_US/en_US.blogs.txt")
blogs <- readLines(connection, encoding="UTF-8", skipNul=T)
close(connection)
blogs_sample <- sample(blogs, length(blogs)/50)


connection <- file("./Capstone Project/en_US/en_US.twitter.txt")
twitter <- readLines(connection, encoding="UTF-8", skipNul=T)
close(connection)
twitter_sample <- sample(twitter, length(twitter)/50)

Saving the samples

connection <- file("./Capstone Project/en_US/sample/news_sample.txt")
writeLines(news_sample,connection)
close(connection)

connection <- file("./Capstone Project/en_US/sample/blogs_sample.txt")
writeLines(blogs_sample,connection)
close(connection)

connection <- file("./Capstone Project/en_US/sample/twitter_sample.txt")
writeLines(twitter_sample,connection)
close(connection)

Evaluating how many words come from foreign languages

One of the quickest ways to detect words from foreign languages is to use “cdlr” package (http://cran.r-project.org/web/packages/cldr/) which brings Google’s Chrome language detection into R. In order to install the package make sure you have got devtools installed and execue the following command:

devtools::install_version("cldr",version="1.1.0")

library(cldr)
non_english_blogs<-which(detectLanguage(blogs_sample)$detectedLanguage != "ENGLISH")
length(non_english_blogs)

non_english_news<-which(detectLanguage(news_sample)$detectedLanguage != "ENGLISH")
length(non_english_news)

non_english_twitter<-which(detectLanguage(twitter_sample)$detectedLanguage != "ENGLISH")
length(non_english_twitter)
Non_English <-  c(length(non_english_blogs), length(non_english_news),length(non_english_twitter ))
Non_English_table <- data.frame(Files, Non_English)

knitr::kable(Non_English_table)

Files	Non_English
blogs.txt	608
news.txt	444
twitter.txt	1101

Creating Corpus data set

cat_name <- file.path("./Capstone Project/en_US/sample")

texts <- tm::Corpus(DirSource(cat_name))

Text Preprocessing and Cleaning

This step allows us to remove the characters like numbers, capitalization, common words, punctuation to better prepare our text for analysis. It might be time and processing power consuming but will greatly improve the overall quality of analysis.

Removing punctuation

texts <- tm_map(texts, removePunctuation)

Removing numbers

texts <- tm_map(texts, removeNumbers)

Removing the stop english stopwords

texts <- tm_map(texts, removeWords, stopwords("english"))

Taking all capital letters to lower case

texts <- tm_map(texts, content_transformer(tolower))

Stemming Documents

We stem the documents so that a word will be recognizable to the computer, despite whether or not it may have a variety of possible endings in the original text. That means removing common word endings (e.g., “ing”, “es”, “s”)

texts <- tm_map(texts, stemDocument)

Removing “wite spaces” from the document.

All the above text cleaning activities will leave your corpus with many unnecessary white spaces, which are simply leftovers after we have removed specified words.

texts <- tm_map(texts, stripWhitespace)

Creating Term-Document Matrix.

It is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

texts_tdm <- TermDocumentMatrix(texts, control=list(wordLengths=c(3,Inf)))

Generating n-grams

In this step we’ll generate the 3 TermDocumentMatrix objects which will represent our original corpus in form n-sequenced words.

require(RWeka)

options(mc.cores = 1)
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
tdm_unigram <- TermDocumentMatrix(texts, control = list(tokenize = UnigramTokenizer)) # create tdm from 1-grams

#note that in theory texts_tdm = tdm_unigram

BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
tdm_bigram <- TermDocumentMatrix(texts, control = list(tokenize = BigramTokenizer)) # create tdm from 2-grams


ThreegramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tdm_threegram <- TermDocumentMatrix(texts, control = list(tokenize = ThreegramTokenizer)) # create tdm from 3-grams

Finding top words and phrases

tdm_unigram.matrix <- as.matrix(tdm_unigram)
topwords_uni_gram<- rowSums(tdm_unigram.matrix)
head(sort(topwords_uni_gram,decreasing = TRUE), 25)

##   the  will   one   get  said  just  like  time   can   day  year  make 
## 10688  6366  6213  6033  6028  5945  5945  5111  4779  4426  4357  4175 
##  love   new  good  know   now  dont  work think   say   see  want peopl 
##  3911  3857  3674  3597  3537  3424  3387  3291  3276  3236  3224  3213 
## thank 
##  3048

tdm_bigram.matrix <- as.matrix(tdm_bigram)
topwords_bigram<- rowSums(tdm_bigram.matrix)
head(sort(topwords_bigram,decreasing = TRUE), 25)

##    i think     i dont     i love      i can     i know     i just 
##       1211       1107        999        752        743        735 
##     i want     i will     i cant     i need  last year  right now 
##        688        630        509        483        475        467 
##     i like    i didnt     i feel  look like      i got      i get 
##        453        412        405        404        401        399 
##   new york  cant wait  i thought     i hope  dont know   year ago 
##        383        364        364        358        346        344 
## last night 
##        341

tdm_threegram.matrix <- as.matrix(tdm_threegram)
topwords_threegram<- rowSums(tdm_threegram.matrix)
head(sort(topwords_threegram,decreasing = TRUE), 25)

##         i dont know        i dont think           i think i 
##                 187                 162                 147 
##         i feel like            i know i         i dont want 
##                 128                 121                  94 
##         i cant wait            i wish i       cant wait see 
##                  90                  87                  73 
##    happi mother day         i thought i         feel like i 
##                  62                  61                  60 
##         i dont like         i just want         let us know 
##                  56                  52                  50 
##          i think im      happi new year       new york citi 
##                  45                  44                  44 
##        i didnt know presid barack obama       i realli want 
##                  43                  42                  41 
##           i can get       i cant believ         i dont even 
##                  38                  38                  38 
##         dont know i 
##                  36

Generating 1-gram top 25 word Frequency Distribution

unigram_freq <- rowSums(tdm_unigram.matrix)
unigram_freq_ord <- order(unigram_freq, decreasing = TRUE)
unigram_freq_top25 <- unigram_freq[head(unigram_freq_ord, 25)]
unigram_freq_top25 <- melt(unigram_freq_top25)
unigram_freq_top25$words <- rownames(unigram_freq_top25)

#Generating the word cloud for unigram
wordcloud::wordcloud(names(unigram_freq), unigram_freq,max.words=200, scale = c(5, .1))

#ploting the top words
p <- ggplot(unigram_freq_top25, aes(x=words, y=value))
p <- p + geom_bar(stat = "identity", colour="red", fill="navy", width = 0.5)
p <- p + geom_text( aes (label = value ) , vjust = - 0.20, size = 3 )
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) 
p

Generating 2-gram top 25 word Frequency Distribution

bigram_freq <- rowSums(tdm_bigram.matrix)
bigram_freq_ord <- order(bigram_freq, decreasing = TRUE)
bigram_freq_top25 <- bigram_freq[head(bigram_freq_ord, 25)]
bigram_freq_top25 <- melt(bigram_freq_top25)
bigram_freq_top25$words <- rownames(bigram_freq_top25)

#Generating the word cloud for bigram
wordcloud::wordcloud(names(bigram_freq), bigram_freq,max.words=100, scale = c(5, .1))

#ploting the top words
b <- ggplot(bigram_freq_top25, aes(x=words, y=value))
b <- b + geom_bar(stat = "identity", colour="yellow", fill="red", width = 0.5)
b <- b + geom_text( aes (label = value ) , vjust = - 0.20, size = 3 )
b <- b + theme(axis.text.x=element_text(angle=45, hjust=1)) 
b

Generating 3-gram top 25 word Frequency Distribution

threegram_freq <- rowSums(tdm_threegram.matrix)
threegram_freq_ord <- order(threegram_freq, decreasing = TRUE)
threegram_freq_top25 <- threegram_freq[head(threegram_freq_ord, 25)]
threegram_freq_top25 <- melt(threegram_freq_top25)
threegram_freq_top25$words <- rownames(threegram_freq_top25)

#Generating the word cloud for threegram
wordcloud::wordcloud(names(threegram_freq), threegram_freq,max.words=25)

#ploting the top words
t <- ggplot(threegram_freq_top25, aes(x=words, y=value))
t <- t + geom_bar(stat = "identity", colour="yellow", fill="green", width = 0.5)
t <- t + geom_text( aes (label = value ) , vjust = - 0.20, size = 3 )
t <- t + theme(axis.text.x=element_text(angle=45, hjust=1))
t

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

all_uniwords <- sort(topwords_uni_gram,decreasing = TRUE)
all_uniwords <- melt(all_uniwords)

#Generating a function which calculates a number of frequency sorted words which cover given percent of entire set of 1-gram words
words_percentage_sum <- function(x,y){
  for(i in 1:length(x)) {
    a <- sum(x[1:i])
    if (a > length(x)*y) return(i)}
  print(i)
}

To cover 50 percent of all word instances we need the following number of frequency sorted top words:

## [1] 5

To cover 90 percent of all word instances we need the following number of frequency sorted top words:

## [1] 9

Conclusions

The above exploratory analysis has delivered the following insights:

The input data sets represents quite differential level of English grammar. The way people express themselves on twitter differs from the language used in news and blogs.
The input data set is large and given the limitations of R sampling if the source data was necessary. Further analysis may require applying packages which will enable multi-thread, parallel processing (doParallel).
Using sample of the input data may impact final algorithm prediction accuracy. Modelling phase of the project will have to deliver a proper analysis of the trade-off between the size of the sample used for building the model and its accuracy.
The cleaning data stage presented above will also have to be re-designed to meet the requirements of the final prediction model. Key concepts to consider and next actions are:

Reducing sparsity of the corpus.
Removing the English stop words - how does it impact the quality of prediction?
Stemming the corpus - how does it impact the quality of prediction?
Develop a prediction model based on n-grams.
Develop a Shiny app with UI to present the outcome of the prediction algorithm.
Optimize size of the n-grams files and the prediciton model to minimize the requirements on runtime environment.

Capstone Project - Exploratory Analysis Report

Tom Checkiewicz

27/11/2016