Rick Lingle
October 4, 2016
Coursera Capstone Project
The project was for the Coursera DataScience Capstone project, next word prediction. Three text files were originally used to construct the ngrams used in the searching algorithm, but had to be limited due to file size constraints and computer processing time. The final ngrams were selected from the twitter text file, only utilizing approximately ¼ of the text file. The news and blogs text files were not used for the apps, but were utilized for the quiz portion of the course. The R package “quanteda” was used to analyze the text files and create the n-grams.
After the n-grams were constructed from the text file, a simple R code searches the n-grams for matching text. Once matching text is identified, the R code finds the next word in the n-grams and stores that for a possible option. After identifying all matches, the R code combines all next words, sums the total, and calculates the probability of each word based on total options found.
Simply enter text in the input field and press submit
words need to be spelled correctly :)
NO punctuation or symbols :(
Please wait for the library (n-grams) to load
I learned more than I originally wanted to about NLP for this project, but it was very rewarding. If there was more time, I would like to compare more text files of different genera's. The difference between the three supplied was very enlightening. The code itself could eliminate the symbols and punctuation, identifying misspelled words would be helpful, and creating new n-grams for specific users would be very beneficial for the individuals.