Next Word Prediction

15/12/2019

About the Capstone Project

This project is part of the tenth and final course of the Coursera Data Science Specialization. The project focuses on the analysis of several huge files with text, to analyze their structure and on this analysis create a model to predict the next word written by a user.
Contents
- Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
- Predictive modeling: build basice n-gram models and develop algorithms to facilitate text prediction
- Shiny app development: produce a web-based Shiny app to predict next words

Getting and cleaning the data: profanity was first removed and words tokenized
Exploratory data analysis: the frequencies of words and word paris were calculated
Modeling: 2-7 gram models were built to facilitate word prediction
Prediciton model: - Katz’s back-off model was used to predict the next word - The model iterates from 7-gram to 2-gram to find matches in the last n-1 words - In the case of unseen n-gram, the most frequent word, ‘the’, is returned - To improve efficiency, word pairs that appear less than 5 times in the corpus were removed

The data analysis and model building writeups was delivered to the coursera platform
The Shiny app for prediction can be found in: https://rafmesal.shinyapps.io/predictorApp/
The app takes in the following inputs:
1. a word or phrase that the user inputs
2. “# words to predict”, the user select the number of words to predict
The predicted next word will show up in the order of most frequently used

Type some word or phrase and show results in seconds!