Executive Summary

This document provides a milestone report for the Data Science Capstone project. The objective of this project is to create a predictive text algorithm and shiny app. The purpose of this document is to demonstrate that the input data has been properely loaded, share some exploratory data analysis, and preview an approach for developing the predictive text algorithm.

Part One: Exploratory Data Analysis

The first step is to load the files provided in the assignment, which include text datasets from Twitter, news articles, and blogs. We will use the English set, however German, Finnish, and Russian datasets were also available and may be incorporated later in the project.

twitter <- read_lines("en_US.twitter.txt")
news <- read_lines("en_US.news.txt")
blogs <- read_lines("en_US.blogs.txt")

The datasets are quiet large. The number of rows of text in the Twitter file is 2360148, the number of row in the News file is 1010242, and the number of lines in the blogs file is 899288. For performance purposes, the majority of this document will only look at the first 50,000 lines of each file. However the final model may use more or less data for modeling.

To ensure we have a diverse set of language patterns, we combine a subset of each input file to create an alltext dataset for later use.

alltext <- c(twitter[1:50000],news[1:50000],blogs[1:50000])

In order to properly analyze the data, first we tokenize each file:

tidy_twitter <- tibble(text = twitter) %>% head(50000) %>%
                unnest_tokens(word, text)

tidy_news <- tibble(text = news) %>% head(50000) %>%
             unnest_tokens(word, text)

tidy_blogs <- tibble(text = blogs) %>% head(50000) %>%
              unnest_tokens(word, text) 

tidy_alltext <- tibble(text = alltext) %>%
                unnest_tokens(word, text)

Using these tibbles, we can get a sense of the first 50,000 lines of data in each file. For instance, the Twitter file has 39551 unique words, the News file has 70161 unique words, and the blogs file has 71753. By looking at the combined alltext data, we see that there is a lot of overlap in words between the dataset as alltext only has 119695 unique words - far fewer than the sum of the three input files.

Not surprisingly, the most common words in each dataset (as well as the combined dataset) are similar:

Looking at the long tail of the datasets is also interesting. There are may words that only occur once out of hundreds of thousands of lines of text. Many of these words are either typos, rare scientific terms, uncommon brandnames, or stylistic text that is not proper English.

tail(count_alltext,10)
## # A tibble: 10 x 6
##    word                                        n  total    percent cumsum   rank
##    <chr>                                   <int>  <int>      <dbl>  <dbl>  <dbl>
##  1 zycam                                       1 4.44e6    2.25e-7   1.00 119686
##  2 zygi                                        1 4.44e6    2.25e-7   1.00 119687
##  3 zygodactyl                                  1 4.44e6    2.25e-7   1.00 119688
##  4 zygodactylous                               1 4.44e6    2.25e-7   1.00 119689
##  5 zylstra                                     1 4.44e6    2.25e-7   1.00 119690
##  6 zymurgy                                     1 4.44e6    2.25e-7   1.00 119691
##  7 zyrtec                                      1 4.44e6    2.25e-7   1.00 119692
##  8 zzeon                                       1 4.44e6    2.25e-7   1.00 119693
##  9 zzzzzzzzzzzzz                               1 4.44e6    2.25e-7   1.00 119694
## 10 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz~     1 4.44e6    2.25e-7   1.   119695

This raises the question of how many words do you really need to capture most text? We can see from the plot below that the top 5,000 words account for more than 80% of text, and the top 10,000 account for nearly 90%.

While it is interesting to look at individual word counts, much of the predictive text model will be built by looking at groups of words, or n-grams. We can see that the frequency of the most common trigrams is far lower than the frequency of any given word.

alltext_trigram <- tibble(text = alltext) %>% head(50000) %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)

trigram_count <- alltext_trigram %>% count(trigram, sort = TRUE)

head(trigram_count,25)
## # A tibble: 25 x 2
##    trigram                n
##    <chr>              <int>
##  1 thanks for the       488
##  2 looking forward to   187
##  3 thank you for        184
##  4 i love you           181
##  5 for the follow       172
##  6 going to be          171
##  7 i want to            167
##  8 can't wait to        161
##  9 i have a             130
## 10 a lot of             129
## # ... with 15 more rows

An interesting way to vizualize trigrams was developed by Emil Hvitfeldt. I borrowed some of his code from his site and adapted it to the assignment’s data set. Below is a plot of the most common trigrams starting with either ‘he’ or ‘she’. It provides a sence of what the predictive text output options would look like as the user types.

Part Two: Modeling Approach

As a first approach, I’m going to try and use a 4-gram model and match text based on the previous three words to predict the fourth. If fewer than 3 words have been provided, 3-gram or 2-gram models could be applied. Likewise, if the 4-gram does not appear in the corpora then 3-gram and 2-gram models could be applied. If there are multiple 4-grams that match for the first 3 words, then the most common one can be applied.
The approach above will likely not yield great results. Additional features such as sentiment analysis or pairwise correlation may be helpful when there are either no perfect matches or multiple perfect matches and a ‘best’ choice needs to be made. A lot of work will be required to make the algorithm both accurate and efficient.