The goal of the Data Science Capstone Project is to use the skills acquired in the specialization in creating an application based on a predictive model for text. Given a word or phrase as input, the application will try to predict the next word. The predictive model will be trained using a corpus, a collection of written texts, called the HC Corpora which has been filtered by language.
This report is an EDA of the training data supplied for the capstone project. The Data can be found here: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
In addition to loading and cleaning the data, the aim here is to make use of the NLP packages for R to tokenize n-grams as a first step toward testing a Markov model for prediction.
library(ggplot2)
library(NLP)
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tm)
library(RWeka)
library(stringi)
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RColorBrewer)
library(wordcloud)
library(SnowballC)
library(pander)
library(caret)
## Loading required package: lattice
news <- readLines("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.news.txt",encoding="UTF-8", skipNul = TRUE, warn = FALSE)
blogs<- readLines("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.blogs.txt",encoding="UTF-8", skipNul = TRUE, warn = FALSE)
twitter<- readLines("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt",encoding="UTF-8", skipNul = TRUE, warn = FALSE)
blogs_size <- file.info("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news_size <- file.info("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter_size <- file.info("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
pop_summary <- data.frame('File' = c("Blogs","News","Twitter"),
"FileSizeinMB" = c(blogs_size, news_size, twitter_size),
'NumberofLines' = sapply(list(blogs, news, twitter), function(x){length(x)}),
'TotalCharacters' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
'TotalWords' = sapply(list(blogs,news,twitter),stri_stats_latex)[4,],
'MaxCharacters' = sapply(list(blogs, news, twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
pop_summary
## File FileSizeinMB NumberofLines TotalCharacters TotalWords MaxCharacters
## 1 Blogs 200.4242 899288 206824505 37570839 40833
## 2 News 196.2775 77259 15639408 2651432 5760
## 3 Twitter 159.3641 2360148 162096241 30451170 140
set.seed(1130)
samp_size = 5000
news_samp <- news[sample(1:length(news),samp_size)]
twitter_samp <- twitter[sample(1:length(twitter),samp_size)]
blogs_samp<- blogs[sample(1:length(blogs),samp_size)]
invisible(write.table(blogs_samp, file="blog_samp.txt", quote=F))
invisible(write.table(twitter_samp, file="twitter_samp.txt", quote=F))
invisible(write.table(news_samp, file="news_samp.txt", quote=F))
df <-rbind(news_samp,twitter_samp,blogs_samp)
rm(news,twitter,blogs)
wd<- file.path("D:/RStudio/Documents/Capstone/NLP/Coursera-SwiftKey/final/en_US")
dir(wd)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
The size of the data sets being evaluated is important, so word and line counts are calculated.
BlogWords <- stri_count_words(blogs_samp)
summary(BlogWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 30.00 41.66 60.00 472.00
TwitterWords <- stri_count_words(twitter_samp)
summary(TwitterWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 7.00 12.00 12.89 19.00 34.00
NewsWords <- stri_count_words(news_samp)
summary(NewsWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.51 46.00 301.00
stri_stats_general(blogs_samp)
## Lines LinesNEmpty Chars CharsNWhite
## 5000 5000 1149398 947454
stri_stats_general(twitter_samp)
## Lines LinesNEmpty Chars CharsNWhite
## 5000 5000 346996 286969
stri_stats_general(news_samp)
## Lines LinesNEmpty Chars CharsNWhite
## 5000 5000 1007349 841782
invisible(write.table(blogs_samp, file="blog_samp.txt", quote=F))
invisible(write.table(twitter_samp, file="twitter_samp.txt", quote=F))
invisible(write.table(news_samp, file="news_samp.txt", quote=F))
In text mining, a corpus is created to facilitate statistical analysis, hypothesis testing and to account for occurances.
docs <- VCorpus(DirSource(wd))
summary(docs)
## Length Class Mode
## en_US.blogs.txt 2 PlainTextDocument list
## en_US.news.txt 2 PlainTextDocument list
## en_US.twitter.txt 2 PlainTextDocument list
inspect(docs[1])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 208361438
inspect(docs[2])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15683765
inspect(docs[3])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 162384825
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, PlainTextDocument)
DocsCopy <- docs
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, PlainTextDocument)
BlogWords <- stri_count_words(blogs_samp)
summary(BlogWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 30.00 41.66 60.00 472.00
TwitterWords <- stri_count_words(twitter_samp)
summary(TwitterWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 7.00 12.00 12.89 19.00 34.00
NewsWords <- stri_count_words(news_samp)
summary(NewsWords)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.51 46.00 301.00
stri_stats_general(blogs_samp)
## Lines LinesNEmpty Chars CharsNWhite
## 5000 5000 1149398 947454
stri_stats_general(twitter_samp)
## Lines LinesNEmpty Chars CharsNWhite
## 5000 5000 346996 286969
stri_stats_general(news_samp)
## Lines LinesNEmpty Chars CharsNWhite
## 5000 5000 1007349 841782
dtm <- DocumentTermMatrix(docs)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 862832)>>
## Non-/sparse entries: 1067805/1520691
## Sparsity : 59%
## Maximal term length: 1101
## Weighting : term frequency (tf)
tdm <- TermDocumentMatrix(docs)
tdm
## <<TermDocumentMatrix (terms: 862832, documents: 3)>>
## Non-/sparse entries: 1067805/1520691
## Sparsity : 59%
## Maximal term length: 1101
## Weighting : term frequency (tf)
freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 862832
dtms <-removeSparseTerms(dtm, 0.2)
dtms
## <<DocumentTermMatrix (documents: 3, terms: 55990)>>
## Non-/sparse entries: 167970/0
## Sparsity : 0%
## Maximal term length: 23
## Weighting : term frequency (tf)
freq <- colSums(as.matrix(dtm))
head(table(freq))
## freq
## 1 2 3 4 5 6
## 567428 95363 41459 24376 16180 11768
tail(table(freq))
## freq
## 185744 191293 211198 215078 222959 253230
## 1 1 1 1 1 1
freq <- colSums(as.matrix(dtms))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq,20)
## just like will one can get time love good now day
## 253230 222959 215078 211198 191293 185744 165936 150447 149659 143211 142003
## know new dont see people back great think make
## 140393 128702 118459 117719 113698 109903 107278 102959 100065
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## just just 253230
## like like 222959
## will will 215078
## one one 211198
## can can 191293
## get get 185744
p <- ggplot(subset(wf, freq>50), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
p
dtms <- removeSparseTerms(dtm, 0.1)
head(table(freq), 20)
## freq
## 1 2 3 4 5 6 7 8 9 10 11
## 567428 95363 41459 24376 16180 11768 8923 7079 5850 4888 4121
## 12 13 14 15 16 17 18 19 20
## 3545 3097 2730 2503 2270 1948 1885 1638 1530
tail(table(freq), 20)
## freq
## 100065 102959 107278 109903 113698 117719 118459 128702 140393 142003 143211
## 1 1 1 1 1 1 1 1 1 1 1
## 149659 150447 165936 185744 191293 211198 215078 222959 253230
## 1 1 1 1 1 1 1 1 1
freq <- colSums(as.matrix(dtms))
NOTE: Because of memory error during the NGramTokenizer my NGram algorithm failed with the legend “Error in .jcall(”RWekaInterfaces“,”[S“,”tokenize“, .jcast (tokenizer, : java.lang.OutOfMemoryError: GC overhead limit exceeded”
If you have any ideas or comments regarding this error I would greatly appreciate them.
For the predictive assignment I propose the following workflow for model computation
1.Load Corpus 2.Clean Each Corpus (as in EDA but replace contractions) 3.Extract Train/Test/Validation Sets 60/20/20 4.Build N-Grams on Train (Sizes 1, 2, 3, and 4) 5.Test Prediction Using (Always k-gram if able, or votes between all k-grams) 6.Save Chosen Frequency Matrices and Chosen Model
To use the models saved to predict
1.Propose First Word (Most Frequent 1-Gram) 2.Receive Input Text 3.Clean Text by Corpus Rules (as in EDA) 4.Extract Last N-Gram (Sizes 1, 2, 3) 5.Regex Extracted N-Gram against names on frequency matrices 6.Choose according to model selected
This will be done once I solve my memory errors in the previous section.