Data Science Capstone Project : Words Prediction

Irni Jasmina Ibrahim
May 27, 2017

Introduction

  • This aim of this project is to create a ShinyApp that can take input, be it a phrase or multiple words, and predicts probable outputs based on the inputs keyed in.

  • This project is based on the data provided by SwiftKey on twitter, news and blogs.

  • In this project, various exercises have been done such as cleaning the data and prediction model creation. For example, the data used has been cleaned of from any special characters and bad words.

Model and Algorithms

  • Using the cleaned up data, the data has been tokenized into an N-gram model.

  • An n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. Source:Wikipedia

  • In this model, the N-gram model algorithm will process the sample corpus data into a N-grams model with their frequencies (bi-gram, tri-gram, and quad-gram).

  • The output is predicted based on the inputs, which will be looked into the data frame to find the next words with the frequencies as per the n-grams table.

The App

Image

You can access the ShinyApp by clicking here.

About the App

  • The user interface of the app is as shown in the previous slide.
  • User will need to type input in the text box provided.
  • A couple of option selections are provided as well, to provide more flexibility in choosing the possible outputs and frequeny of outputs, as desired.
  • User will also have an option to select from the top 3 possible predictions as appeared above the text box, or can change the options provided first before hitting a 'Predict' button to get the desire outputs.
  • The app will only predicts the outputs based on the last three words of the inputs in the text box.
  • An instructions manual is provided as well for reference in using the app.