In this document we will load the data from three files, all of which contain large amounts of raw text in English, and then we will analyze the data to get an idea of the structure and nature of the data. Some cleaning will be performed to improve this initial analysis.
The goal is to use the data from these three files (or some subset thereof) to create an algorithm that will predict the next word after a series of words are entered.
The first step is to load the data, but note that there are repeated lines within the twitter data so those rows will be taken out so that the word sequences therein are not more heavily weighted.
blog.data <- readLines("en_US.blogs.txt", encoding="UTF-8")
news.data <- readLines("en_US.news.txt", encoding="UTF-8")
twtr.data <- unique(readLines("en_US.twitter.txt", encoding="UTF-8"))
Here are some high level statistics.
Line counts:
## Blog News Twitter
## Line count: 899288 77259 2305923
Word counts:
## Blog News Twitter
## Word count: 37334131 2643969 30094542
Average word lengths:
## Blog News Twitter
## Mean Word Length: 4.563904 4.944346 4.414994
Knowing that the average word length is between 4 and 5 for all data sets, it would be helpful to see a breakdown of proportion of words that are each length. If we randomly select 10,000 rows from each object, combine those into one object, and analyze all of the words that will give us a very large sample to work with.
We will also use gsub from dplyr so that we do not include, for example, the period connected with a word at the end of a sentence. (This could have been done for the word count measurements above, but that was a more rough measurement performed on much larger character vectors.)
set.seed(97315)
row.sample <- c(sample(blog.data, 10000),sample(news.data, 10000),sample(twtr.data, 10000))
df.sample <- data.frame("sample" = row.sample)
words <- unnest_tokens(tbl = df.sample, word, sample)
words.lengths <- sapply(words, str_length)
words.length.frequencies <- table(words.lengths)[1:9] / length(words.lengths)
Now we have stored all of the words from these randomly selected 30,000 entries across the data sources, separated the words, counted the length of each word, and finally we have the proportion of words that are of each length from 1 through 9 characters. The limit of 9 character words reflects an inclusion of just over 96% of all words with an aim towards balancing an effective model without having too many words and slowing performance.
See the following plot of the relative rates of occurrence of word lengths:
More importantly, what about specific words? These 30,000 rows contain nearly 900,000 words and it would be useful to know how many of those are repeats of the 10, 100, or even 1000 most common unique words.
What are the 10 most common words, how many times do they each appear, and what is each word’s relative frequency in this sample?
divisor <- nrow(words)
word.frequency <- words %>%
group_by(word) %>%
mutate(count=n()) %>%
distinct(word,count) %>%
mutate(freq = round(count / divisor,4))
word.frequency <- word.frequency[order(word.frequency$count, decreasing=TRUE),]
data.frame(word.frequency[1:10,])
## word count freq
## 1 the 44010 0.0497
## 2 to 23728 0.0268
## 3 and 22695 0.0256
## 4 a 21168 0.0239
## 5 of 18890 0.0213
## 6 in 14918 0.0169
## 7 i 13179 0.0149
## 8 that 9388 0.0106
## 9 is 9219 0.0104
## 10 for 9069 0.0102
Finally we will look at how many of the words are contained in just the top 10, 100, and 1000 most frequent words.
top.10 <- sum(word.frequency$freq[1:10])
top.100 <- sum(word.frequency$freq[1:100])
top.1000 <- sum(word.frequency$freq[1:1000])
## Top 10 Top 100 Top 1000
## Frequencies: 0.2103 0.4517 0.691
The function of this milestone report is to inform a plan for the final project. Based on this analysis, the most clear finding is that these files have a very large amount of data. That is demonstrated by the large amount of rows and words in particular.
Furthermore, nearly 70% of all words that are used come from the top 1000 most used words and over 96% of words are at most 9 characters. These two features suggest that we can use a subset of the words in our predictive algorithm: