Antonio Ferraro
This presentation will pitch guessnextword, application for predicting the next word a user is going to type.
The application is the final submission for the capstone project for the Coursera/Johns Hopkins' Data Science specialization.
The goal of this capstone project is to produce a shiny application able to predict the next word a user is going to type, based on a collection of document extracted from twitter, blogs and news articles and on some user input, and provided for the project.
A subset of the data provided (HC Corpora.) has been cleaned, lowercased and reduced to ASCII only. Profanities have been removed. Preferred library is quanteda, because it is faster than tm, and the 4-grams frequencies generated with quanteda (dfm) are stored in a SQLITE database. This speeds up info retrieval and allows a bigger database (in the end may be small anyway because it takes a long time to produce and shinyapps size limit is 100m per application).
Main libraries used: DBI, RSQLite, quanteda, stringr
The data sample has been cleaned, removing URLs, email addresses, twitter characters, numbers, profanities, single quotes (apostrophs) that are troublesome to handle with SQLlite, converted to lowercase and tokenized with quanteda. The sparse matrix dfm produced by quanteda have been stored in SQLlite tables, in a single DB named capstone.sqlite. This may take more memory but are stored offline, search is faster because of indexes and the sample is much easier to handle because it consists of a single file. Furthermore I can regenerate the database at will and perform every sort of modification (increase size, perform specific pruning etc), without changing the model. 4 length ngram frequency tables have been generated and transferred into the SQLite table QUADGRAM (4-grams are split, this allow searching with less than 3 words).
The SQL table is used to predict the next word when a user inputs a text (preprocessed with the same cleaning criteria used to generate the sample). The application will always give a prediction. If input is insufficient, it shall propose as a default the most frequent unigram. The user input and the predicted word are displayed. The initial DB load is a bit slow but then the app is quite responsive.
GuessNextWord is hosted on shinyapps.io: https://anfe67.shinyapps.io/guessnextword/
The bulk of the code of this application is embedded in the application itself (third tab)
This presentation can be found here: http://rpubs.com/anfe67/guessnextword