Exploratory Data Analysis

Synopsis

The goal of the Data Science Capstone Project is to use the skills acquired in the specialization in creating an application based on a predictive model for text. Given a word or phrase as input, the application will try to predict the next word. The predictive model will be trained using a corpus, a collection of written texts, called the HC Corpora which has been filtered by language.

This report is an EDA of the training data supplied for the capstone project. The Data can be found here: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

In addition to loading and cleaning the data, the aim here is to make use of the NLP packages for R to tokenize n-grams as a first step toward testing a Markov model for prediction.

Load the libraries

library(ggplot2)
library(NLP)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)
library(RWeka)
library(stringi)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RColorBrewer)
library(wordcloud)
library(SnowballC)
library(pander)
library(caret)
## Loading required package: lattice

Loading data

news <- readLines("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.news.txt",encoding="UTF-8", skipNul = TRUE, warn = FALSE)
blogs<- readLines("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.blogs.txt",encoding="UTF-8", skipNul = TRUE, warn = FALSE)
twitter<- readLines("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt",encoding="UTF-8", skipNul = TRUE, warn = FALSE)

Summarise the data

blogs_size <- file.info("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news_size <- file.info("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter_size <- file.info("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
pop_summary <- data.frame('File' = c("Blogs","News","Twitter"),
                      "FileSizeinMB" = c(blogs_size, news_size, twitter_size),
                      'NumberofLines' = sapply(list(blogs, news, twitter), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
                      'TotalWords' = sapply(list(blogs,news,twitter),stri_stats_latex)[4,],
                      'MaxCharacters' = sapply(list(blogs, news, twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
                      )
pop_summary
##      File FileSizeinMB NumberofLines TotalCharacters TotalWords MaxCharacters
## 1   Blogs     200.4242        899288       206824505   37570839         40833
## 2    News     196.2775         77259        15639408    2651432          5760
## 3 Twitter     159.3641       2360148       162096241   30451170           140

Sample the data

set.seed(1130)
samp_size = 5000

news_samp <- news[sample(1:length(news),samp_size)]
twitter_samp <- twitter[sample(1:length(twitter),samp_size)]
blogs_samp<- blogs[sample(1:length(blogs),samp_size)]

invisible(write.table(blogs_samp, file="blog_samp.txt", quote=F))
invisible(write.table(twitter_samp, file="twitter_samp.txt", quote=F))
invisible(write.table(news_samp, file="news_samp.txt", quote=F))

df <-rbind(news_samp,twitter_samp,blogs_samp)
rm(news,twitter,blogs)
wd<- file.path("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US")
dir(wd)
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

The size of the data sets being evaluated is important, so word and line counts are calculated.

BlogWords <- stri_count_words(blogs_samp)
summary(BlogWords)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   30.00   41.66   60.00  472.00
TwitterWords <- stri_count_words(twitter_samp)
summary(TwitterWords)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    7.00   12.00   12.89   19.00   34.00
NewsWords <- stri_count_words(news_samp)
summary(NewsWords)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.51   46.00  301.00
stri_stats_general(blogs_samp)
##       Lines LinesNEmpty       Chars CharsNWhite 
##        5000        5000     1149398      947454
stri_stats_general(twitter_samp)
##       Lines LinesNEmpty       Chars CharsNWhite 
##        5000        5000      346996      286969
stri_stats_general(news_samp)
##       Lines LinesNEmpty       Chars CharsNWhite 
##        5000        5000     1007349      841782
invisible(write.table(blogs_samp, file="blog_samp.txt", quote=F))
invisible(write.table(twitter_samp, file="twitter_samp.txt", quote=F))
invisible(write.table(news_samp, file="news_samp.txt", quote=F))

Corpus

In text mining, a corpus is created to facilitate statistical analysis, hypothesis testing and to account for occurances.

docs <- VCorpus(DirSource(wd))
summary(docs)
##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list
inspect(docs[1])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 208361438
inspect(docs[2])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15683765
inspect(docs[3])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 162384825

Cleaning data

docs <- tm_map(docs, removePunctuation)  
docs <- tm_map(docs, removeNumbers)   
docs <- tm_map(docs, tolower)   
docs <- tm_map(docs, PlainTextDocument)
DocsCopy <- docs  

docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, PlainTextDocument)

summary

BlogWords <- stri_count_words(blogs_samp)
summary(BlogWords)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   30.00   41.66   60.00  472.00
TwitterWords <- stri_count_words(twitter_samp)
summary(TwitterWords)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    7.00   12.00   12.89   19.00   34.00
NewsWords <- stri_count_words(news_samp)
summary(NewsWords)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.51   46.00  301.00
stri_stats_general(blogs_samp)
##       Lines LinesNEmpty       Chars CharsNWhite 
##        5000        5000     1149398      947454
stri_stats_general(twitter_samp)
##       Lines LinesNEmpty       Chars CharsNWhite 
##        5000        5000      346996      286969
stri_stats_general(news_samp)
##       Lines LinesNEmpty       Chars CharsNWhite 
##        5000        5000     1007349      841782
dtm <- DocumentTermMatrix(docs)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 862832)>>
## Non-/sparse entries: 1067805/1520691
## Sparsity           : 59%
## Maximal term length: 1101
## Weighting          : term frequency (tf)
tdm <- TermDocumentMatrix(docs)
tdm
## <<TermDocumentMatrix (terms: 862832, documents: 3)>>
## Non-/sparse entries: 1067805/1520691
## Sparsity           : 59%
## Maximal term length: 1101
## Weighting          : term frequency (tf)

Reviewing clean data

freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 862832
dtms <-removeSparseTerms(dtm, 0.2)
dtms
## <<DocumentTermMatrix (documents: 3, terms: 55990)>>
## Non-/sparse entries: 167970/0
## Sparsity           : 0%
## Maximal term length: 23
## Weighting          : term frequency (tf)
freq <- colSums(as.matrix(dtm))
head(table(freq))
## freq
##      1      2      3      4      5      6 
## 567428  95363  41459  24376  16180  11768
tail(table(freq))
## freq
## 185744 191293 211198 215078 222959 253230 
##      1      1      1      1      1      1
freq <- colSums(as.matrix(dtms))


freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq,20)
##   just   like   will    one    can    get   time   love   good    now    day 
## 253230 222959 215078 211198 191293 185744 165936 150447 149659 143211 142003 
##   know    new   dont    see people   back  great  think   make 
## 140393 128702 118459 117719 113698 109903 107278 102959 100065
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
##      word   freq
## just just 253230
## like like 222959
## will will 215078
## one   one 211198
## can   can 191293
## get   get 185744

Create a histogram for words that occur at least 50 times.

p <- ggplot(subset(wf, freq>50), aes(x = reorder(word, -freq), y = freq)) +
        geom_bar(stat = "identity") + 
        theme(axis.text.x=element_text(angle=45, hjust=1))
p   

Removal of terms that fall below a specified frequency threshold

dtms <- removeSparseTerms(dtm, 0.1) 

head(table(freq), 20)  
## freq
##      1      2      3      4      5      6      7      8      9     10     11 
## 567428  95363  41459  24376  16180  11768   8923   7079   5850   4888   4121 
##     12     13     14     15     16     17     18     19     20 
##   3545   3097   2730   2503   2270   1948   1885   1638   1530
tail(table(freq), 20)  
## freq
## 100065 102959 107278 109903 113698 117719 118459 128702 140393 142003 143211 
##      1      1      1      1      1      1      1      1      1      1      1 
## 149659 150447 165936 185744 191293 211198 215078 222959 253230 
##      1      1      1      1      1      1      1      1      1
freq <- colSums(as.matrix(dtms))   

Creating N-grams

NOTE: Because of memory error during the NGramTokenizer my NGram algorithm failed with the legend “Error in .jcall(”RWekaInterfaces“,”[S“,”tokenize“, .jcast (tokenizer, : java.lang.OutOfMemoryError: GC overhead limit exceeded”

If you have any ideas or comments regarding this error I would greatly appreciate them.

Predictive Analysis

For the predictive assignment I propose the following workflow for model computation

1.Load Corpus 2.Clean Each Corpus (as in EDA but replace contractions) 3.Extract Train/Test/Validation Sets 60/20/20 4.Build N-Grams on Train (Sizes 1, 2, 3, and 4) 5.Test Prediction Using (Always k-gram if able, or votes between all k-grams) 6.Save Chosen Frequency Matrices and Chosen Model

To use the models saved to predict

1.Propose First Word (Most Frequent 1-Gram) 2.Receive Input Text 3.Clean Text by Corpus Rules (as in EDA) 4.Extract Last N-Gram (Sizes 1, 2, 3) 5.Regex Extracted N-Gram against names on frequency matrices 6.Choose according to model selected

This will be done once I solve my memory errors in the previous section.