Word Predictor

Justin Pizzino
3/7/20

Purpose

Have you ever wanted to impress your friends by trying to guess what they are about to say? Now you can with my new word predictor application.

For the low low cost of free you can enter words into this application, and the next word will be predicted.

How does it work?

  • You enter text in the input box and the top word is returned.

  • Below that a table shows how this result was come to.

How was this build?

  • Using data from blog, texts, and twitter provided by John's Hopkins I pulled in and cleaned all the sentences, to use for prediction.

  • These lines were further cleaned (remove symbols, etc…), and turned into 2, 3, 4, and 5 words phrases (called grams) using the Quanteda package.

  • Finally the data was cut down to phrases that showed up at least 3 times, number only results were removed, and a cumulaitive % was added so that the user could choose the accuracy they wish.

Code examples

Here is the code used to predict (minor code lines removed to save space on the slide):

next_word <- function(input_text, grams){
    input_text<-prep_line(input_text,filter_method)
    results<-grepn(input_text, grams$line,(max_lines_per_gram*10),20000)
    results<-grams[results,]
    num_words<-str_count(input_text, '_')-1
    for (i in num_words:1) {
        input_text_reduced<-paste(tail(strsplit(input_text,split="_")[[1]],i),collapse = "_")
        input_text_reduced<-paste(input_text_reduced,"_",sep="")
        index<-grepn(input_text_reduced, grams$line,max_lines_per_gram,20000)
        results_reduced<-grams[index,]
        results<-rbind(results, results_reduced)
    }
    return(results)
}

Code examples

Here is the code used to score and return the top results

my_score<-function(test){
    #eq<-c(1,1,10,35, 70)
    eq<-c(1,1,100000,200000, 300000)
    if(length(test$next_word)>0){
        test$score<-test$freq*eq[test$num_words]

        test_summary<- test %>%
            group_by(next_word) %>%
            summarise(total=sum(score)) %>%
            arrange(desc(total))

    } else {
        test_summary <- "no value"
    }
    return(test_summary$next_word[1:5])
}