Coursera Data Science Capstone Project

Antonio Ferraro

This presentation will pitch guessnextword, application for predicting the next word a user is going to type.

The application is the final submission for the capstone project for the Coursera/Johns Hopkins' Data Science specialization.

Goal

The goal of this capstone project is to produce a shiny application able to predict the next word a user is going to type, based on a collection of document extracted from twitter, blogs and news articles and on some user input, and provided for the project.

A subset of the data provided (HC Corpora.) has been cleaned, lowercased and reduced to ASCII only. Profanities have been removed. Preferred library is quanteda, because it is faster than tm, and the 4-grams frequencies generated with quanteda (dfm) are stored in a SQLITE database. This speeds up info retrieval and allows a bigger database (in the end may be small anyway because it takes a long time to produce and shinyapps size limit is 100m per application).

Main libraries used: DBI, RSQLite, quanteda, stringr

Method & Model

The data sample has been cleaned, removing URLs, email addresses, twitter characters, numbers, profanities, single quotes (apostrophs) that are troublesome to handle with SQLlite, converted to lowercase and tokenized with quanteda. The sparse matrix dfm produced by quanteda have been stored in SQLlite tables, in a single DB named capstone.sqlite. This may take more memory but are stored offline, search is faster because of indexes and the sample is much easier to handle because it consists of a single file. Furthermore I can regenerate the database at will and perform every sort of modification (increase size, perform specific pruning etc), without changing the model. 4 length ngram frequency tables have been generated and transferred into the SQLite table QUADGRAM (4-grams are split, this allow searching with less than 3 words).

Prediction and the Application in practice

The SQL table is used to predict the next word when a user inputs a text (preprocessed with the same cleaning criteria used to generate the sample). The application will always give a prediction. If input is insufficient, it shall propose as a default the most frequent unigram. The user input and the predicted word are displayed. The initial DB load is a bit slow but then the app is quite responsive.

Application Screenshot

Links