Final Project Submission- Word Prediction -
17.11.2017
A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs after a suitable delay a prediction of the next word.
Data basis are US blogs, news, and twitter data downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The following code shows how the data for the app is loaded and preprocessed.
library(ngram)
blogs<- readLines ("C:/Coursera/ngram/final/en_US/en_US.blogs.txt", n=10000)
news<- readLines ("C:/Coursera/ngram/final/en_US/en_US.news.txt", n=10000)
twitter<- readLines ("C:/Coursera/ngram/final/en_US/en_US.twitter.txt", n=10000)
str <- concatenate(blogs, news, twitter)
str<- preprocess(str)
ng <- ngram(str, n=2)
pt_ng<- get.phrasetable(ng)
The app works with word 2-grams build with the data basis.
Core of the app is a function that takes the last word of the text input, searches the 2-gram with this word as startword and the highest frequency. The the second word of this 2-gram is displayed.
Code of the core function:
find_word <- function(w) {
x <- pt_ng[grep(paste0("^",word(w,-1)),pt_ng$ngrams), ]
x<- x$ngrams[1]
if (is.na(word(x,2))){print("Sorry, no suggestion")} else {word(x,2)}
}
If no corresponding 2-gram is found, the text “Sorry, no suggestion” is displayed.