A Shiny Word Predictor!

Rui La
12/27/2016

2014 Data Science Capstone

Project introduction

The purpose of this project of the Data Science Capstone is to use the skills acquired in the previous courses to create an application based on a predictive model for text. Given a word or a sentence as input, the application will return a bunch of prediction words.

The next word prediction app is on shinyapps.io:

Larry's word prediction App

The App is slow when predicting next words. Please be patient. Thanks.

How it works

  • The app uses data sets from collection of us_news, us_twitter and us_blogs for prediction.
  • Strings saved into datasource are processed to remove punctuation, digits and convert to lowercase
us_twitter = readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
us_blog = readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
us_news = readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
string = str_replace_all(string, "[[:punct:]]", "")
string = str_replace_all(string, "[[:digit:]]", "")
string = tolower(string)
  • General predictions are made by extracting infor using regexpr from datasource!
match = regmatches(datasource, regexpr(paste(term, "(.*?) "), datasource))
matchlist = gsub(paste(term, "| $"),"", match)

How it works (2)

  • The data sample was then tokenized into so-called n-grams

  • Find the most frequent word after n-grams terms.

nwText[i] = names(sort(table(next.word), decreasing = TRUE)[i])
#nwText[i] will find i th most frequent word after the previous n-gram germ
  • Build the model to predict the next words and make a table of their frequencies

The App

Free to use and runs right in the browser usgin Shiny.

  • Type a sentence
  • Select number of predictions you want to see
  • Check prediction results