Aimie Faucett
07 October, 2016
Coursera Data Science Specialization Capstone Project
The objective of the project is to create a corpus from a collection of twitter, news, and blog entries and use this corpus to build an algorithm that will predict the next word when someone is typing text. I decided to create my own algorithm to create the corpus rather than using a built-in R packages, because I wanted to get experience building an algorithm and see how it would perform.
Processing the data and creating the corpus:
Next, I used the do.call function to comb through all entries in the corpus and extract one-, two-, three-, and four-grams. The gram data frames were written to .csv files so they could be used in the Shiny App.
X.1 X Combo Record leadGram gram
1 1 6174 the end of the 124 the end of the
2 2 7313 the rest of the 120 the rest of the
To predict the text, I wrote a function that takes in a user-entered character string and does the following:
[1] "homeboy" "taco" "snaggletooth" "sombrero"
[5] "blankie" "kittens" "tequila" "moose"
[9] "spongebob" "fireball"
The app is hosted at: https://bioaimie.shinyapps.io/capApp/
It utilizes custom CSS for styling and loads data from the one-, two-, three-, and four-gram csv files. The gram files had to be trimmed to work within the confines of upload sizes on shinyapps.io. To do this, I kept all grams occurring with a frequency > 1 and then randomly selected 1000 other grams to keep for each data set.
The app asks the user to input some text and press the submit button. Once it is submitted, the app will display the text appended with the predicted word in a window below the button. It will also generate a word cloud of possible matches. The sizes of the words in the word cloud are relative to their frequency within the corpus.
I was fairly disappointed with my own algorithm's ability to predict text. Additionally, since my algorithm randomly selects a guess for the predicted word if there isn't a match in the corpus, the function to produce the random word will produce the next word, but those words will not match in the word cloud. This was very frustrating, but I was unable to figure out a way around this with a reactive function.
In the future I would:
Overall, I enjoyed the project, and I learned a lot by writing my own processing and predicitive algorithm; however, I think my app's ability to correctly predict was sub-par.