Purpose

The purpose of this document is to outline my progress in analyzing the data for the Capstone project in Coursera’s Data Science Specialization. The document shows the steps I took in loading the data, cleaning it, and developing a few exploratory plots to get a sense of the data sets. Finally, I outline the basic idea for my predictive text model.

Data Access and Cleaning

I begin with reading the data into R as a dataframe. This let me look at the size of each of the three files, along with the number of lines. Then, using the tidytext package, I looked at the actual word count of each file. These results are summarized in the table.

path <- getwd()
twitterFrame <- data.frame(text = readLines(paste(path,"/final/en_US/en_US.twitter.txt", sep = "")))
blogsFrame <- data.frame(text = readLines(paste(path,"/final/en_US/en_US.blogs.txt", sep = "")))
newsFrame <- data.frame(text = readLines(paste(path, "/final/en_US/en_US.news.txt", sep = "")))
twitterWords <- unnest_tokens(twitterFrame, word, text)
blogWords <- unnest_tokens(blogsFrame, word, text)
newsWords <- unnest_tokens(newsFrame, word, text)
##      Name Size(Mb)   Lines    Words
## 1 Twitter 326645.9 2360148 30093372
## 2   Blogs 261483.7  899288 37546239
## 3    News 263517.3 1010242 34762395

Next, I removed all of the swear words using the sweary library.

swears <- get_swearwords("en")
twitterClean <- twitterWords %>% anti_join(swears, by = "word")
blogClean <- blogWords %>% anti_join(swears, by = "word")
newsClean <- newsWords %>% anti_join(swears, by = "word")

Exploratory Analysis

Here is a quick look of the top 20 words in the dataset. Note that they are all stop words.

un1 <- twitterWords %>% count(word, sort = TRUE)
un2 <- blogWords %>% count(word, sort = TRUE)
un3 <- newsWords %>% count(word, sort = TRUE)
uniques <- do.call("rbind", list(twitterClean, blogClean, newsClean))
finalList <- uniques %>% count(word, sort = TRUE)
finalList$rank <- c(1:nrow(finalList))
finalList <- finalList %>% mutate(finalList, percent = n  * 100 / sum(n), cumul = cumsum(percent))
head(finalList, n = 20)
##    word       n rank   percent     cumul
## 1   the 4771927    1 4.6643677  4.664368
## 2    to 2764230    2 2.7019242  7.366292
## 3   and 2422450    3 2.3678479  9.734140
## 4     a 2389755    4 2.3358899 12.070030
## 5    of 2010936    5 1.9656095 14.035639
## 6    in 1657973    6 1.6206023 15.656241
## 7     i 1657335    7 1.6199786 17.276220
## 8   for 1103087    8 1.0782234 18.354443
## 9    is 1075727    9 1.0514801 19.405923
## 10 that 1042522   10 1.0190235 20.424947
## 11  you  943299   11 0.9220370 21.346984
## 12   it  919081   12 0.8983649 22.245349
## 13   on  824414   13 0.8058317 23.051181
## 14 with  715023   14 0.6989064 23.750087
## 15  was  624700   15 0.6106192 24.360706
## 16   my  604889   16 0.5912548 24.951961
## 17   at  573144   17 0.5602253 25.512186
## 18   be  549771   18 0.5373791 26.049565
## 19 this  545290   19 0.5329992 26.582565
## 20 have  530991   20 0.5190225 27.101587

These twenty stop words are almost a quarter of the text. To explore further, I created a cumulative frequency plot, showing that 154 words comprise half of the text, and 7916 words comprise 95% of the text.

cutoff <- finalList %>% filter(cumul <=  95)
cutoff %>% ggplot(aes(x = rank, y = cumul)) +
  geom_point() + geom_vline(xintercept = c(which.min(abs(finalList$cumul - 50)),
                                           which.min(abs(finalList$cumul - 90)))) +
  geom_hline(yintercept = c(50, 90)) +
  scale_x_continuous(breaks = c(which.min(abs(finalList$cumul - 50)), which.min(abs(finalList$cumul - 90)))) +
  scale_y_continuous(breaks = c(0, 25, 50, 70, 90, 95))

It is not surprising that many of the frequent words are stop words, given by the diagram below. The first 54 most popular words are all stop words, and there are only four words in the top 100 that are not stop words.

cutoff %>% anti_join(stop_words) %>% select(word, n, rank) %>% head()
##     word      n rank
## 1   time 224774   55
## 2    day 175983   72
## 3   love 161651   75
## 4 people 159280   76
## 5      2 106860  112
## 6      3 104152  116

As far as prediction goes, single word frequencies are not very powerful. My plan is to use bigrams, trigrams, and quadrigrams in order to predict text via Katz’s back-off model. Essentially, the process goes as follows:

  1. See whether the previous three words are in the list of quadrigrams. If so, recommend the fourth word.
  2. If there are not three words, or the sequence is not in the quadrigrams, repeate using the trigram list.
  3. Repeat again, using the bigram list.
  4. Recommend the most popular single word.

As an illustration, here are the top 50 trigrams from the twitter data.

twitterTrigrams <- unnest_tokens(twitterFrame, trigram, text, token = "ngrams", n = 3) 
sortedTrigrams <- twitterTrigrams %>% count(trigram, sort = TRUE) %>% head(n = 50) %>% na.omit()
ggplot(sortedTrigrams, aes(x = trigram, y = n)) + geom_bar(stat = "identity") + coord_flip()

Thanks for reading through this.