Pandatas
January 2020
The Word Prediction App was developed as part of the Coursera/Swiftkey Data Science Capstone Project.
This application predicts the next word of a sentence entered by a user using a text prediction algorithm.
The Word Prediction App is located at https://pandatas.shinyapps.io/TextPrediction/.
The text prediction model was developed using three English text datasets: “blogs”, “news” and “twitter” from a multiple language dataset which is located at:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
These three datasets were loaded, sampled and cleaned removing white spaces, punctuation, numbers, stopwords and converting upper case letters to lower case.
Then the sampled corpus was “tokenized” into n-grams, i.e. the text was broken up into phrases of n words. The phrases in the n-grams were sorted on frequency to predict the next word based on the user input in the application.
The application uses the text prediction algorithm to suggest three words based on a certain text phrase entered by the user.