Steve Wenck
2/11/2018
The objective of this capstone is to develop an application that can predict the next word, given a user-provided string of text. This is similar to how mobile device keyboard applications function.
This project involved many tasks including the following: (1) understanding the problem, (2) acquiring and cleaning the data, (3) performing exploratory data analysis, (4) performing statistical modeling, (5) creating a prediction model, (6) optimizing the prediction model for performance and creativity, (7) creating the data product - the application, and (8) creating the presentation.
The corpora were collected from publicly available sources by a web crawler. Three files were provided: Blogs, News and Twitter.
The milestone report describing the data acquisition, cleaning, and exploratory analysis can be found here: http://www.rpubs.com/Demographer/CapstoneWeek2.
The data as downloaded are too large to be used in a prediction algorithm that might be used on a mobile device, so a sample had to be takenthat was large enough to be accurate, but not slow down app performance.
The sample was converted into n-grams (a contiguous sequence of n items from a given sequence of text). Bigrams (2 words), trigrams (3 words) and 4-grams (4 words) were created and frequences of those n-grams computed.
The n-gram frequency matrices were used to predict the next word in the application recursively. If a likely match could be found using the last 3 words typed, then 4-grams were used, if no match there, then the last 2 typed words were compared against trigrams, if still no match, then the last word typed was compared against bigrams.
The app works very intuitively and is user friendly.
Simply start by typing text in the box labeled “enter text here” on the left side of the app.
The app automatically starts to try to predict the next word as soon as typing pauses for a second.
The resulting predicted word appears in the “Predicted next word:” box on the right side of the app.
The app also replicates the words entered and states which level of n-gram was used to predict the next word. This information appears below the predicted next word.