Capstone Report

Aimie Faucett
07 October, 2016

Coursera Data Science Specialization
Capstone Project

Project Overview

The objective of the project is to create a corpus from a collection of twitter, news, and blog entries and use this corpus to build an algorithm that will predict the next word when someone is typing text. I decided to create my own algorithm to create the corpus rather than using a built-in R packages, because I wanted to get experience building an algorithm and see how it would perform.

Processing the data and creating the corpus:

  • Read in a 3% sample of all text in the news and blogs corpuses (larger samples took too long to load)
  • Twitter is excluded because tweets frequently don't use a normal sentence structure and there are many grammatically incorrect abbreviations (e.g. “u” rather than “you”)
  • With each entry, break into sentences by finding typical punctuation that would end a complete sentence (i.e. ?, ., !, ;)
  • Iterate through each sentence and:
    • Convert everything to lower case
    • Keep only alphabetical characters so that numbers and punctuation are removed
    • Count words in the sentence and only keep if there are more than one words in the sentence
    • Find swear words and replace them with “swearjar” to preserve the behavior of swearing in text
    • Collapse white spaces that are an artifact of removing numbers and punctuation, but preserve spaces between words

Predicting Text

Next, I used the do.call function to comb through all entries in the corpus and extract one-, two-, three-, and four-grams. The gram data frames were written to .csv files so they could be used in the Shiny App.

  X.1    X           Combo Record    leadGram gram
1   1 6174  the end of the    124  the end of  the
2   2 7313 the rest of the    120 the rest of  the

To predict the text, I wrote a function that takes in a user-entered character string and does the following:

  • Removes punction, numbers, punctuations, and replaces swear words with “swearjar” to match the swear pattern in the corpus
  • Searches for a match with the highest gram data available for the length of text inputed
  • If the largest gram does not return a match, it will recursively check smaller grams for a match
  • If no match can be found in any gram, the funciton will return a random word… I chose random words that I thought were funny and mixed in the top 100 most common words from the corpus
 [1] "homeboy"      "taco"         "snaggletooth" "sombrero"    
 [5] "blankie"      "kittens"      "tequila"      "moose"       
 [9] "spongebob"    "fireball"    

Shiny App

The app is hosted at: https://bioaimie.shinyapps.io/capApp/

It utilizes custom CSS for styling and loads data from the one-, two-, three-, and four-gram csv files. The gram files had to be trimmed to work within the confines of upload sizes on shinyapps.io. To do this, I kept all grams occurring with a frequency > 1 and then randomly selected 1000 other grams to keep for each data set.

The app asks the user to input some text and press the submit button. Once it is submitted, the app will display the text appended with the predicted word in a window below the button. It will also generate a word cloud of possible matches. The sizes of the words in the word cloud are relative to their frequency within the corpus.

plot of chunk unnamed-chunk-3

Conclusions and Further Work

I was fairly disappointed with my own algorithm's ability to predict text. Additionally, since my algorithm randomly selects a guess for the predicted word if there isn't a match in the corpus, the function to produce the random word will produce the next word, but those words will not match in the word cloud. This was very frustrating, but I was unable to figure out a way around this with a reactive function.

In the future I would:

  • Utilizes packages existing in R
  • Use sparklyr or sparkR packages to process big data with Spark and do preliminary cleaning
  • Store output from data processing in a more big-data friendly format, such as parquet so that more of the corpus could be utilized by the app
  • Work on more customization of the app styling using CSS or custom JS to improve user experience

Overall, I enjoyed the project, and I learned a lot by writing my own processing and predicitive algorithm; however, I think my app's ability to correctly predict was sub-par.