JHS Data Science Capstone Milestone

Introduction

In this document we will load the data from three files, all of which contain large amounts of raw text in English, and then we will analyze the data to get an idea of the structure and nature of the data. Some cleaning will be performed to improve this initial analysis.

The goal is to use the data from these three files (or some subset thereof) to create an algorithm that will predict the next word after a series of words are entered.

Data

The first step is to load the data, but note that there are repeated lines within the twitter data so those rows will be taken out so that the word sequences therein are not more heavily weighted.

blog.data <- readLines("en_US.blogs.txt", encoding="UTF-8")
news.data <- readLines("en_US.news.txt", encoding="UTF-8")
twtr.data <- unique(readLines("en_US.twitter.txt", encoding="UTF-8"))

Here are some high level statistics.

Line counts:

##               Blog  News Twitter
## Line count: 899288 77259 2305923

Word counts:

##                 Blog    News  Twitter
## Word count: 37334131 2643969 30094542

Average word lengths:

##                       Blog     News  Twitter
## Mean Word Length: 4.563904 4.944346 4.414994

Exploratory Analysis

Knowing that the average word length is between 4 and 5 for all data sets, it would be helpful to see a breakdown of proportion of words that are each length. If we randomly select 10,000 rows from each object, combine those into one object, and analyze all of the words that will give us a very large sample to work with.

We will also use gsub from dplyr so that we do not include, for example, the period connected with a word at the end of a sentence. (This could have been done for the word count measurements above, but that was a more rough measurement performed on much larger character vectors.)

set.seed(97315)
row.sample <- c(sample(blog.data, 10000),sample(news.data, 10000),sample(twtr.data, 10000))
df.sample <- data.frame("sample" = row.sample)
words <- unnest_tokens(tbl = df.sample, word, sample)
words.lengths <- sapply(words, str_length)
words.length.frequencies <- table(words.lengths)[1:9] / length(words.lengths)

Now we have stored all of the words from these randomly selected 30,000 entries across the data sources, separated the words, counted the length of each word, and finally we have the proportion of words that are of each length from 1 through 9 characters. The limit of 9 character words reflects an inclusion of just over 96% of all words with an aim towards balancing an effective model without having too many words and slowing performance.

See the following plot of the relative rates of occurrence of word lengths:

More importantly, what about specific words? These 30,000 rows contain nearly 900,000 words and it would be useful to know how many of those are repeats of the 10, 100, or even 1000 most common unique words.

What are the 10 most common words, how many times do they each appear, and what is each word’s relative frequency in this sample?

divisor <- nrow(words)
word.frequency <- words %>%
  group_by(word) %>%
  mutate(count=n()) %>%
  distinct(word,count) %>%
  mutate(freq = round(count / divisor,4))
word.frequency <- word.frequency[order(word.frequency$count, decreasing=TRUE),]
data.frame(word.frequency[1:10,])

##    word count   freq
## 1   the 44010 0.0497
## 2    to 23728 0.0268
## 3   and 22695 0.0256
## 4     a 21168 0.0239
## 5    of 18890 0.0213
## 6    in 14918 0.0169
## 7     i 13179 0.0149
## 8  that  9388 0.0106
## 9    is  9219 0.0104
## 10  for  9069 0.0102

Finally we will look at how many of the words are contained in just the top 10, 100, and 1000 most frequent words.

top.10 <- sum(word.frequency$freq[1:10])
top.100 <- sum(word.frequency$freq[1:100])
top.1000 <- sum(word.frequency$freq[1:1000])

##               Top 10 Top 100 Top 1000
## Frequencies:  0.2103  0.4517    0.691

Plan

The function of this milestone report is to inform a plan for the final project. Based on this analysis, the most clear finding is that these files have a very large amount of data. That is demonstrated by the large amount of rows and words in particular.

Furthermore, nearly 70% of all words that are used come from the top 1000 most used words and over 96% of words are at most 9 characters. These two features suggest that we can use a subset of the words in our predictive algorithm:

By word length

We will only need words at most 9 characters long to cover most possible suggestions.
Ignoring longer words prevents errors eg two five-letter words that did not have a space between them on account of a typographical error, geographical names, other proper nouns.

By word frequency

Using at most 1000 words to predict and predict from will increase performance (there were over 50,000 unique words in just the sample of 30,000 rows in this report).
Less frequently used words will likely contain more obscure and specialized terms that we do not want to have as suggestions for a wide audience.
We would need to have rules for words that are not in our dictionary regardless, so this does not create a new issue in algorithm design.

JHS Data Science Capstone Milestone

Darren Whitwood

2023-04-11

Introduction

Data

Exploratory Analysis

Plan