Overview

This brief report for the John Hopkins Data Science Specialization Capstone Project will be broken down into 3 parts:

  1. Exploratory Data Analysis (EDA) - I will provide basic statistics on each of the three text files. Further, histograms will be provided for word and ngram (where n = 2 & 3) counts in the combination of all three files.

  2. Prediction Algorithm - This section will contain no code, only ideas on how I plan on using the results from my EDA to create a prediction algorithm.

  3. Shiny App - Again, this section will contain no code, only a conceptual blueprint for how I plan on implementing the prediction algorithm into a Shiny App.

1. EDA

The first step I took (after reading in each file separately) was to randomly select 10,000 lines to read from each file, and with a little manipulation I can calculate the words in each line and the average words per line in each text (in the output below, the first data table is based on the blog text, the second table based on the news text and the third based on the twitter text).

## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'tidytext' was built under R version 3.4.2
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 10,000 x 3
##     line words_in_line avg_words_line
##    <int>         <int>          <dbl>
##  1     1            35        41.8881
##  2     2            49        41.8881
##  3     3             2        41.8881
##  4     4             5        41.8881
##  5     5           159        41.8881
##  6     6            69        41.8881
##  7     7            48        41.8881
##  8     8            59        41.8881
##  9     9            21        41.8881
## 10    10            60        41.8881
## # ... with 9,990 more rows
## # A tibble: 10,000 x 3
##     line words_in_line avg_words_line
##    <int>         <int>          <dbl>
##  1     1            35        34.4457
##  2     2            14        34.4457
##  3     3            61        34.4457
##  4     4            35        34.4457
##  5     5            49        34.4457
##  6     6            68        34.4457
##  7     7            25        34.4457
##  8     8            30        34.4457
##  9     9            21        34.4457
## 10    10            71        34.4457
## # ... with 9,990 more rows
## # A tibble: 10,000 x 3
##     line words_in_line avg_words_line
##    <int>         <int>          <dbl>
##  1     1            18        12.6462
##  2     2            16        12.6462
##  3     3            18        12.6462
##  4     4            16        12.6462
##  5     5             7        12.6462
##  6     6            11        12.6462
##  7     7             8        12.6462
##  8     8            10        12.6462
##  9     9            26        12.6462
## 10    10             3        12.6462
## # ... with 9,990 more rows

However, it is important to remember that the goal of this project is to create an app that predicts a word after having received a word or sequence of words. It does not matter whether the sequence of words is from “Twitter” or a blog or a news source. Therefore, combining all the words together, disregarding what source they came from, will allow a more relevant analysis.

After combining the sources together, I can create a few bar-plots showing the 10 most common words and ngrams (two and three word sequences) in the combined data set.

One thing to notice is the inclusion of “stopwords” (words that are very common in a given language, e.g. “of”, “the”, “to”, etc) in the bigram (two word sequence) and trigram (three word sequence) distributions. My reasoning for this is, although in some Natural Language Processing projects stopwords are omittted from the analysis so the algorithm can learn more from the text over fewer words, in this project the goal is to create an app that predicts a word after having received a word or sequence of words. Therefore, stopwords should be given some probability (certainly not 0) of being the next word in a sequence.

2. Prediction Algorithm

My rough plan for the prediction algorithm is:

  1. Use the distribution of single words to predict any given word without a preceding string of words.

  2. Use the bigram and trigram distribution combined with a series of regex code (code that searches through a string of text to find matches) to predict the next word given a previous sequence of words.

The model is Markovian, which is a class of model where the conditional probability of the next event occuring (in our case, the next word being W) can be estimated without looking too far into the past. For example, given a string of six words, we can assume the probability of the seventh word given the previous six words is roughly equal to the probability of the seventh word given the preceding 1-3 words. The benefits of this assumption are:

  1. Computational efficiency; creating an algorithm that attempts to take in all previous information (preceding words) to predict the next word would require huge processing power.

  2. Flexibility; since language is inherently highly dynamic, even if my computer had the processing power mentioned above, new sequences of words are constantly being created, so “storing” all possible sequences of words is impossible.

For a slighlty more in depth explanaion, imagine the string of words,“I have…” I will use a series of regex statements to reorder each distribution (word, bi gram and trigram) based on that information. So, the bigram distribution only inludes bigrams that begin with “have.” Similarly, the trigram distribution only includes sequences that begin with, “I have…” For the single word distribution, my plan is to sort the data so that only lines with the string anywhere in that line are shown.

These three distributions will then be arranged in descending order (as the shown in the plots above), and, the next predicted word will be the one with the highest probability.

3. Shiny App

My plan for the Shiny Application is a simple text box where the user types a string of words, and there will be a section where the next predicted word is shown (I might show the top 3 most probable words, still undecided)