Taking values from the three sources: blogs; news; and twitter, we were able to get stats on Unique Words, Overall word counts in each file, and the most frequent words of various lengths of which I have chose a few samples to display in a histogram format. With the sheer volume of data, it will be necessary to choose an appropriate data representation during the final model development to keep within the constraints of limited computer systems resource.
library(stringr)
library(plotly)
con1 <- file("./en_US/en_US.twitter.txt")
con2 <- file("./en_us/en_US.blogs.txt")
con3 <- file("./en_us/en_US.news.txt")
data1 <- readLines(con1)
data2 <- readLines(con2)
data3 <- readLines(con3)
fullData <- c(data1, data2, data3)
## [1] "The file 4 has 2360148 lines, and 30373543 words. Further there are 980 unique words."
## [1] "The file 5 has 899288 lines, and 37334131 words. Further there are 980 unique words."
## [1] "The file 6 has 77259 lines, and 2643969 words. Further there are 980 unique words."
print(sprintf("The length of the combined number of lines is %d.", length(fullData)))
## [1] "The length of the combined number of lines is 3336695."
## [1] "Word Count:"
## [1] 70351643
## [1] "Unique Words:"
## [1] 867231
min_length <- 3
word_counts_4 <- table(words[nchar(words) > min_length])
total_words_4 <- sum(word_counts_4)
word_ratios_4 <- 100 * word_counts_4 / total_words_4
top_words_4 <- as.data.frame(sort(word_ratios_4, decreasing = TRUE)[1:20])
plot_ly(top_words_4, x = ~Var1, y = ~Freq, type = "bar") %>%
layout(
title = "Top 20 Word Occurences - 4 Characters or Greater",
xaxis = list(title = "Words", tickangle = -45),
yaxis = list(title = "Frequency (%)")
)