Milestone Report

Summary

Taking values from the three sources: blogs; news; and twitter, we were able to get stats on Unique Words, Overall word counts in each file, and the most frequent words of various lengths of which I have chose a few samples to display in a histogram format. With the sheer volume of data, it will be necessary to choose an appropriate data representation during the final model development to keep within the constraints of limited computer systems resource.

Load neccessary libraries

library(stringr)
library(plotly)

Load data into a list

con1 <- file("./en_US/en_US.twitter.txt")
con2 <- file("./en_us/en_US.blogs.txt")
con3 <- file("./en_us/en_US.news.txt")
data1 <- readLines(con1)
data2 <- readLines(con2)
data3 <- readLines(con3)
fullData <- c(data1, data2, data3)

## [1] "The file 4 has 2360148 lines, and 30373543 words. Further there are 980 unique words."

## [1] "The file 5 has 899288 lines, and 37334131 words. Further there are 980 unique words."

## [1] "The file 6 has 77259 lines, and 2643969 words. Further there are 980 unique words."

Get the list of all the entries

print(sprintf("The length of the combined number of lines is %d.", length(fullData)))

## [1] "The length of the combined number of lines is 3336695."

Extract all words from the list, output the number of Unique words

## [1] "Word Count:"

## [1] 70351643

## [1] "Unique Words:"

## [1] 867231

Graph out the percentage of words of different lengths

min_length <- 3
word_counts_4 <- table(words[nchar(words) > min_length])
total_words_4 <- sum(word_counts_4)
word_ratios_4 <- 100 * word_counts_4 / total_words_4
top_words_4 <- as.data.frame(sort(word_ratios_4, decreasing = TRUE)[1:20])

plot_ly(top_words_4, x = ~Var1, y = ~Freq, type = "bar") %>%
  layout(
    title = "Top 20 Word Occurences - 4 Characters or Greater",
    xaxis = list(title = "Words", tickangle = -45),
    yaxis = list(title = "Frequency (%)")
  )