Coursera Data Science Specialization Capstone Project

Steve Wenck
2/11/2018

The objective of this capstone is to develop an application that can predict the next word, given a user-provided string of text. This is similar to how mobile device keyboard applications function.

Overview

  • This project involved many tasks including the following: (1) understanding the problem, (2) acquiring and cleaning the data, (3) performing exploratory data analysis, (4) performing statistical modeling, (5) creating a prediction model, (6) optimizing the prediction model for performance and creativity, (7) creating the data product - the application, and (8) creating the presentation.

  • The corpora were collected from publicly available sources by a web crawler. Three files were provided: Blogs, News and Twitter.

  • The milestone report describing the data acquisition, cleaning, and exploratory analysis can be found here: http://www.rpubs.com/Demographer/CapstoneWeek2.

Approach

  • The data as downloaded are too large to be used in a prediction algorithm that might be used on a mobile device, so a sample had to be takenthat was large enough to be accurate, but not slow down app performance.

  • The sample was converted into n-grams (a contiguous sequence of n items from a given sequence of text). Bigrams (2 words), trigrams (3 words) and 4-grams (4 words) were created and frequences of those n-grams computed.

  • The n-gram frequency matrices were used to predict the next word in the application recursively. If a likely match could be found using the last 3 words typed, then 4-grams were used, if no match there, then the last 2 typed words were compared against trigrams, if still no match, then the last word typed was compared against bigrams.

App Functionality

  • The app works very intuitively and is user friendly.

  • Simply start by typing text in the box labeled “enter text here” on the left side of the app.

  • The app automatically starts to try to predict the next word as soon as typing pauses for a second.

  • The resulting predicted word appears in the “Predicted next word:” box on the right side of the app.

  • The app also replicates the words entered and states which level of n-gram was used to predict the next word. This information appears below the predicted next word.

The App