Word Prediction App

Pandatas
January 2020

Introduction

Background

The Word Prediction App was developed as part of the Coursera/Swiftkey Data Science Capstone Project.

Objective

This application predicts the next word of a sentence entered by a user using a text prediction algorithm.

Location

The Word Prediction App is located at https://pandatas.shinyapps.io/TextPrediction/.

Description of the algorithm

The text prediction model was developed using three English text datasets: “blogs”, “news” and “twitter” from a multiple language dataset which is located at:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

These three datasets were loaded, sampled and cleaned removing white spaces, punctuation, numbers, stopwords and converting upper case letters to lower case.

Then the sampled corpus was “tokenized” into n-grams, i.e. the text was broken up into phrases of n words. The phrases in the n-grams were sorted on frequency to predict the next word based on the user input in the application.

Description of the Application

The application uses the text prediction algorithm to suggest three words based on a certain text phrase entered by the user. Application Screenshot

  • Enter your text
  • Press the button with the suggestion you prefer
  • The suggestion is added to the original text and three new suggestions appear.

Recommendations for further development

  • The accuracy of the app could be improved increasing the sample size.
  • The application is currently only available in English, but could be expanded to other languages