Data Science capstone - Milestone Report

I want to look at the size of the files, namely the amount of lines. Since the twitter feed will have “normal” phone conversations, this will have to be weighed heavily. The news thus weighed less for this purpose as it will possess more formal communications albeit this content will be repeated in normal conversions. Blogs in the middle but may have more unique words present than normal conversation. Something of the order of a 10:6:1 ratio.

blogs<-file("en_US.blogs.txt","r")
blogs_lines<-readLines(blogs)
close(blogs)
summary(nchar(blogs_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0
news<-file("en_US.news.txt","r")
news_lines<-readLines(news)
## Warning in readLines(news): incomplete final line found on 'en_US.news.txt'
close(news)
summary(nchar(news_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760
twitter<-file("en_US.twitter.txt","r")
twitter_lines<-readLines(twitter)
## Warning in readLines(twitter): line 167155 appears to contain an embedded nul
## Warning in readLines(twitter): line 268547 appears to contain an embedded nul
## Warning in readLines(twitter): line 1274086 appears to contain an embedded nul
## Warning in readLines(twitter): line 1759032 appears to contain an embedded nul
close(twitter)
summary(nchar(twitter_lines))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

I will take all the twitter, most of the news and 1/2 of the blogs

set.seed(1234)
blogs_lines <- sample(blogs_lines, length(blogs_lines) * 0.5)
news_lines <- sample(news_lines, length(news_lines) * 0.9)
sapply(list(blogs_lines, news_lines, twitter_lines), function(x){format(object.size(x), "MB")})
## [1] "127.7 Mb" "17.8 Mb"  "319 Mb"
sapply(list(blogs_lines, news_lines, twitter_lines), stri_stats_general)
##                  [,1]     [,2]      [,3]
## Lines          449644    69533   2360148
## LinesNEmpty    449644    69533   2360148
## Chars       104171581 14121974 162384825
## CharsNWhite  85952345 11811375 134370864

I want to use 1/10th of that

sampleTtext <- c(sample(blogs_lines, length(blogs_lines) * 0.1),
                 sample(news_lines, length(news_lines) * 0.1),
                 sample(twitter_lines, length(twitter_lines) * 0.1))
stri_stats_general(sampleTtext)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      287931      287931    27995022    23152314
wordcount(sampleTtext)
## [1] 5130551
remove(blogs_lines)
remove(news_lines)
remove(twitter_lines)

Cleaning the data

sampleTtext <- tolower(sampleTtext)
sampleTtext <- gsub("[[:punct:]]", "", sampleTtext)
sampleTtext <- gsub("[[:digit:]]","",sampleTtext)
sampleTtext <- gsub("'","",sampleTtext)
sampleTtext <- iconv(sampleTtext, "latin1", "ASCII", sub="")

Getting the N-Grams list

corpus_1 <- corpus(sampleTtext)
token_1 <- tokens(corpus_1, remove_punct = TRUE)
ngramUni <- tokens_ngrams(token_1, n = 1)
top40NgramUni <- topfeatures(dfm(ngramUni), 40)
ngramBi <- tokens_ngrams(token_1, n = 2)
top40NgramBi <- topfeatures(dfm(ngramBi), 40)
ngramTri <- tokens_ngrams(token_1, n = 3)
top40NgramTRI <- topfeatures(dfm(ngramTri), 40)
remove(token_1)
remove(ngramUni)
remove(ngramBi)
remove(ngramTri)

Plotting the Ngrans

Frequency of the most common 2-gram words

barplot(height = top40NgramUni, names.arg = names(top40NgramUni),
 las = 2, main = "Frequency of the most common single words")

Frequency of the most common single word

barplot(height = top40NgramBi, names.arg = names(top40NgramBi),
 las = 2, main = "Frequency of the most common 2-ngram words")

Frequency of the most common 3-gram words

barplot(height = top40NgramTRI, names.arg = names(top40NgramTRI),
 las = 2, main = "Frequency of the most common 3-gram words")

The next will be to isolate testing and validating sets. Build a predictive algorithm that uses a similar n-gram model. Then deploy the algorithm in a Shiny app.