I want to look at the size of the files, namely the amount of lines. Since the twitter feed will have “normal” phone conversations, this will have to be weighed heavily. The news thus weighed less for this purpose as it will possess more formal communications albeit this content will be repeated in normal conversions. Blogs in the middle but may have more unique words present than normal conversation. Something of the order of a 10:6:1 ratio.
blogs<-file("en_US.blogs.txt","r")
blogs_lines<-readLines(blogs)
close(blogs)
summary(nchar(blogs_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 157.0 231.7 331.0 40835.0
news<-file("en_US.news.txt","r")
news_lines<-readLines(news)
## Warning in readLines(news): incomplete final line found on 'en_US.news.txt'
close(news)
summary(nchar(news_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 203 270 5760
twitter<-file("en_US.twitter.txt","r")
twitter_lines<-readLines(twitter)
## Warning in readLines(twitter): line 167155 appears to contain an embedded nul
## Warning in readLines(twitter): line 268547 appears to contain an embedded nul
## Warning in readLines(twitter): line 1274086 appears to contain an embedded nul
## Warning in readLines(twitter): line 1759032 appears to contain an embedded nul
close(twitter)
summary(nchar(twitter_lines))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.8 100.0 213.0
I will take all the twitter, most of the news and 1/2 of the blogs
set.seed(1234)
blogs_lines <- sample(blogs_lines, length(blogs_lines) * 0.5)
news_lines <- sample(news_lines, length(news_lines) * 0.9)
sapply(list(blogs_lines, news_lines, twitter_lines), function(x){format(object.size(x), "MB")})
## [1] "127.7 Mb" "17.8 Mb" "319 Mb"
sapply(list(blogs_lines, news_lines, twitter_lines), stri_stats_general)
## [,1] [,2] [,3]
## Lines 449644 69533 2360148
## LinesNEmpty 449644 69533 2360148
## Chars 104171581 14121974 162384825
## CharsNWhite 85952345 11811375 134370864
I want to use 1/10th of that
sampleTtext <- c(sample(blogs_lines, length(blogs_lines) * 0.1),
sample(news_lines, length(news_lines) * 0.1),
sample(twitter_lines, length(twitter_lines) * 0.1))
stri_stats_general(sampleTtext)
## Lines LinesNEmpty Chars CharsNWhite
## 287931 287931 27995022 23152314
wordcount(sampleTtext)
## [1] 5130551
remove(blogs_lines)
remove(news_lines)
remove(twitter_lines)
Cleaning the data
sampleTtext <- tolower(sampleTtext)
sampleTtext <- gsub("[[:punct:]]", "", sampleTtext)
sampleTtext <- gsub("[[:digit:]]","",sampleTtext)
sampleTtext <- gsub("'","",sampleTtext)
sampleTtext <- iconv(sampleTtext, "latin1", "ASCII", sub="")
Getting the N-Grams list
corpus_1 <- corpus(sampleTtext)
token_1 <- tokens(corpus_1, remove_punct = TRUE)
ngramUni <- tokens_ngrams(token_1, n = 1)
top40NgramUni <- topfeatures(dfm(ngramUni), 40)
ngramBi <- tokens_ngrams(token_1, n = 2)
top40NgramBi <- topfeatures(dfm(ngramBi), 40)
ngramTri <- tokens_ngrams(token_1, n = 3)
top40NgramTRI <- topfeatures(dfm(ngramTri), 40)
remove(token_1)
remove(ngramUni)
remove(ngramBi)
remove(ngramTri)
Frequency of the most common 2-gram words
barplot(height = top40NgramUni, names.arg = names(top40NgramUni),
las = 2, main = "Frequency of the most common single words")
Frequency of the most common single word
barplot(height = top40NgramBi, names.arg = names(top40NgramBi),
las = 2, main = "Frequency of the most common 2-ngram words")
Frequency of the most common 3-gram words
barplot(height = top40NgramTRI, names.arg = names(top40NgramTRI),
las = 2, main = "Frequency of the most common 3-gram words")
The next will be to isolate testing and validating sets. Build a predictive algorithm that uses a similar n-gram model. Then deploy the algorithm in a Shiny app.