Week 2 milestone report: exploratory analyses and future goals

Synopsis

In this report I explain my exploratory analysis and my goals for the final app and algorithm. This document briefly summarize plans for creating the prediction algorithm and Shiny app.

Data Processing

Introduction

Data used in this document was downloaded January 02, 2018 from this site:

Capstone Dataset (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

if (!file.exists("./data")) {
        dir.create("./data")
}

link <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("./data/Coursera-Swiftkey.zip")) {
        download.file(link,"./data/Coursera-Swiftkey.zip")
}

Libraries used:

library(readr)
library(ngram)
library(NLP)
library(tm)
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(RWeka)

Data was peocessed with:

R version 3.4.1 (2017-06-30), platform: x86_64-w64-mingw32/x64 (64-bit)
R Studio 1.1.383
OS Windows 10 Pro (OS build 16299.125) Version 1709

Data Loading

First I decompress the zipped file:

unzip("./data/Coursera-Swiftkey.zip")

then loaded in separate files for analysis:

blogs_full <- read_lines("./final/en_US/en_US.blogs.txt", progress = FALSE)
news_full <- read_lines("./final/en_US/en_US.news.txt", progress = FALSE)
twitter_full <- read_lines("./final/en_US/en_US.twitter.txt", progress = FALSE)

Some basic stats:

df_stat <- data.frame(FileName = "en_US.blogs.txt", NumRow = NROW(blogs_full), NumChars = sum(nchar(blogs_full)))
df_stat <- rbind(df_stat, data.frame(FileName = "en_US.news.txt", NumRow = NROW(news_full), NumChars = sum(nchar(news_full))))
df_stat <- rbind(df_stat, data.frame(FileName = "en_US.twitter.txt", NumRow = NROW(twitter_full), NumChars = sum(nchar(twitter_full))))
knitr::kable(df_stat, row.names = FALSE, format.args = list(big.mark = ','))

FileName	NumRow	NumChars
en_US.blogs.txt	899,288	206,824,505
en_US.news.txt	1,010,242	203,223,159
en_US.twitter.txt	2,360,148	162,096,031

Data sampling

Files are huge enough and can be sampled to reduce elaboration time. I sample 1% of every file:

set.seed(27182)
blogs <- sample(blogs_full, 0.01 * NROW(blogs_full))
news <- sample(news_full, 0.01 * NROW(news_full))
twitter <- sample(twitter_full, 0.01 * NROW(twitter_full))
df_stat_sample <- data.frame(Data = "blogs", NumRow = NROW(blogs), NumWords = wordcount(blogs), NumChars = sum(nchar(blogs)))
df_stat_sample <- rbind(df_stat_sample, data.frame(Data = "news", NumRow = NROW(news), NumWords = wordcount(news),  NumChars = sum(nchar(news))))
df_stat_sample <- rbind(df_stat_sample, data.frame(Data = "twitter", NumRow = NROW(twitter), NumWords = wordcount(twitter), NumChars = sum(nchar(twitter))))
knitr::kable(df_stat_sample, row.names = FALSE, format.args = list(big.mark = ','))

Data	NumRow	NumWords	NumChars
blogs	8,992	372,961	2,063,969
news	10,102	343,461	2,032,212
twitter	23,601	303,961	1,621,456
With these	samples	the report	is completely processed within 1 hour.

Data cleaning

Now I clean datasets and also prepare the Corpora.

# to avoid repetition I create a function
cleanData <- function(x) {
        x <- iconv(x, from = "UTF-8", to = "ASCII", sub = "") # convert needed to avoid error due weird characters
        x <- VCorpus(VectorSource(x)) # convert
        x <- tm_map(x, tolower) # all lowercase
        x <- tm_map(x, removeNumbers) # no numbers
        x <- tm_map(x, removePunctuation) # no punctuation
        x <- tm_map(x, stripWhitespace) # no white spaces
        x <- tm_map(x, PlainTextDocument) # plain text only for NLP processing
        x <- tm_map(x, removeWords, stopwords(kind = "en"))
}
blogs_clean <- cleanData(blogs)
news_clean <- cleanData(news)
twitter_clean <- cleanData(twitter)

N-gram analysis

Creation of Document-Term Matrices

Now I analyze the cleaned dataset/corpora. First create a Document-Term Matrices for analyses:

# define various steps in one function to repeat for every corpus
createDocumentTermMatrix <- function(x) {
        # create Document-Term Matrix
        documentTermMatrix <- DocumentTermMatrix(x, control = list(weighting = function(z) {weightTf(z)}))
#        freq <- sort(colSums(as.matrix(documentTermMatrix)), decreasing=TRUE)
        return(documentTermMatrix) 
}
blogsDTM <-  createDocumentTermMatrix(blogs_clean)
newsDTM <-  createDocumentTermMatrix(news_clean)
twitterDTM <-  createDocumentTermMatrix(twitter_clean)

Some statistics on Document-Term Matrices

Now some statitics on Document-Term Matrices:

blogsDTM

## <<DocumentTermMatrix (documents: 8992, terms: 28758)>>
## Non-/sparse entries: 171156/258420780
## Sparsity           : 100%
## Maximal term length: 84
## Weighting          : term frequency (tf)

newsDTM

## <<DocumentTermMatrix (documents: 10102, terms: 30403)>>
## Non-/sparse entries: 181704/306949402
## Sparsity           : 100%
## Maximal term length: 35
## Weighting          : term frequency (tf)

twitterDTM

## <<DocumentTermMatrix (documents: 23601, terms: 25595)>>
## Non-/sparse entries: 159930/603907665
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

Some statitics on words frequencies

And then some statitics about word frequencies:

freqBlogs <- sort(colSums(as.matrix(blogsDTM)), decreasing=TRUE)
head(freqBlogs, 10)

##    one   will   just    can   like   time    get    now people   also 
##   1221   1107   1031    956    949    888    721    632    600    565

freqNews <- sort(colSums(as.matrix(newsDTM)), decreasing=TRUE)
head(freqNews, 10)

##  said  will   one   new  year   two   can  also first  just 
##  2416  1056   870   709   610   591   579   573   547   528

freqTwitter <- sort(colSums(as.matrix(twitterDTM)), decreasing=TRUE)
head(freqTwitter, 10)

##   just   like    get   love   good   will   dont    can    day thanks 
##   1577   1233   1117   1035    979    913    899    888    874    855

We see that in the first ten words of every corpus there are several common terms.

Now some histograms of word frequency can represent better this fact:

plotTopFreq <- function(x, corp) {
        inputData <- sort(x, decreasing = T)[1:25] # Top 25 words
        df <- data.frame(x=names(inputData), y=inputData) # data frame for ggplot
        p <- ggplot(data=df, aes(x=reorder(x, -y), y=y)) + 
                xlab(paste("Top 25 words -", corp)) + ylab("Count") +
                geom_bar(stat = "identity") +  
                theme_light() + theme(axis.text.x = element_text(angle = 60))
        p
}
pb <- plotTopFreq(freqBlogs, "Blogs")
pn <- plotTopFreq(freqNews, "News")
ptw <- plotTopFreq(freqTwitter, "Twitter")
grid.arrange(pb, pn, ptw, ncol=1)

and some word clouds:

wordcloud(names(freqBlogs), freqBlogs, max.words =  50)

wordcloud(names(freqNews), freqNews, max.words = 50)

wordcloud(names(freqTwitter), freqTwitter, max.words = 50)

Some statitics on 2-grams

Now, similar to words (unigram) statistics, I calculate Top 25 frequencies of bigram and print first ten:

# function to create N-Gram frequencies
createNFreq <- function(x, n = 2, lowfreq = 10){
        # Function to find n-grams (from tm FAQ)
        myNGramTokenizer <- function(x){
                NGramTokenizer(x, Weka_control(min = n, max = n))
        }
        ngramXTDM <- TermDocumentMatrix(x, control = list(tokenize = myNGramTokenizer))
        freqNgramX <- rowSums(as.matrix(ngramXTDM[findFreqTerms(ngramXTDM, lowfreq = lowfreq), ]))
        return(freqNgramX)
}
bigramTopFreqBlogs <- sort(createNFreq(blogs_clean), decreasing = T)
bigramTopFreqNews <- sort(createNFreq(news_clean), decreasing = T)
bigramTopFreqTwitter <- sort(createNFreq(twitter_clean), decreasing = T)
# Print Top 10
head(bigramTopFreqBlogs, 10)

##   dont know even though     im sure   years ago  first time   feel like 
##          69          50          50          50          48          47 
##   right now     can see    new york  dont think 
##          46          44          42          38

head(bigramTopFreqNews, 10)

##   last year    new york    st louis high school   years ago los angeles 
##         145         109          92          91          75          66 
##  new jersey   last week  first time   dont know 
##          66          60          50          48

head(bigramTopFreqTwitter, 10)

##       cant wait       right now      last night looking forward 
##             175             165             122             102 
##       dont know  happy birthday        just got        im going 
##              88              84              84              81 
##       feel like    good morning 
##              80              76

Then I can plot histograms:

pb <- plotTopFreq(bigramTopFreqBlogs, "Blogs")
pn <- plotTopFreq(bigramTopFreqNews, "News")
ptw <- plotTopFreq(bigramTopFreqTwitter, "Twitter")
print(pb)

print(pn)

print(ptw)

Some statitics on 3-grams

For tri-grams I proceed similarly to the previous case:

# here I use the same function as 2-gram but changing default values and with a lower lowfreq parameter
trigramTopFreqBlogs <- sort(createNFreq(blogs_clean, n = 3, lowfreq = 5), decreasing = T)
trigramTopFreqNews <- sort(createNFreq(news_clean, n = 3, lowfreq = 5), decreasing = T)
trigramTopFreqTwitter <- sort(createNFreq(twitter_clean, n = 3, lowfreq = 5), decreasing = T)
# Print Top 10
head(trigramTopFreqBlogs, 10)

##    accounting jobs italy            new york city ventimiglia italy hotels 
##                       20                       14                        9 
##      amazon services llc    italy accounting jobs           dont get wrong 
##                        8                        7                        6 
##           im pretty sure    jobs italy accounting              new york ny 
##                        6                        6                        6 
##            two years ago 
##                        6

head(trigramTopFreqNews, 10)

##        st louis county president barack obama         four years ago 
##                     15                     14                     11 
##     gov chris christie           world war ii       first time since 
##                     11                     10                      9 
##          new york city        cents per share high school basketball 
##                      9                      8                      7 
##          two weeks ago 
##                      7

head(trigramTopFreqTwitter, 10)

##        cant wait see    happy mothers day          let us know 
##                   37                   34                   18 
##             la la la happy valentines day        cinco de mayo 
##                   13                   12                   11 
##       im pretty sure        just got back        cant wait get 
##                   11                   10                    9 
##       cant wait hear 
##                    9

Then I can plot histograms:

pb <- plotTopFreq(trigramTopFreqBlogs, "Blogs")
pn <- plotTopFreq(trigramTopFreqNews, "News")
ptw <- plotTopFreq(trigramTopFreqTwitter, "Twitter")
print(pb)

## Warning: Removed 13 rows containing missing values (position_stack).

print(pn)

print(ptw)

Conclusions

As seen from plots and most frequently used word tables, the transition from 1-gram to 2-gram and 3-gram creates an ever-narrower set of values thus making prediction more effective.

The algorithm currently uses a lot of processing time, but we have seen that the data are very scattered, so by resurrecting this effect the algorithm should become more efficient.

For now, the Katz’s back-off model seems to be the most promising in terms of prediction but only later I will be able to say if it will be efficient enough to be applied.

Capstone Data Science Course

TizVic

2018-01-02