Data Science Capstone: word prediction

2017/1/23

Purpose of the Presentation

This presentation describes a Shiny application created for PA for Johns Hopkins Coursera Data Science Unit 10, "Capstone Project", for predicting next word.

The application was inspired by Swiftkey that predicts next word from the input of the user.

The application is published at the Shiny Server(https://takashisendo.shinyapps.io/JH_DataScience_Capstone_Words_Prediction/).

Overview of the Application

The application asks user to input any number of words to predict maximum three possible next words.

created dara frame for 4-word, 3-word, and 2-word sequences
1-word(unigram) is not used to reduce memory reuiqrement
if input has more than two words, 4-word (QuaGram) is searched for the best matches
if no word is found, or input has two words, 3-word is looked using the last two word of input
if no word if found for 3-word, or input has only a word, 2-word is looked for
maximum best three matches are shown to user
if no word is matched, "no match" is returnd
sample rate from corpus is 0.003(0.3%)

Logic to handle unknown input length

The following describes general logis for handling.

### prediction by matching

############# Ngrams = 4    
predict_quagram<-function(user_input)
    words <- paste(user_input[length(user_input)-2], user_input[length(user_input)-1], 
    user_input[length(user_input)])
    start_with_term<-paste("^","\\b",words,"\\b", sep="")
    find<-QuaGram[grep(start_with_term, QuaGram$word),]
    if (nrow(find)==0) {
        predicted<-predict_trigram(user_input)
        return(predicted)
    }
    find$word<-as.character(find$word)
    find_mat<-matrix(unlist(strsplit(find$word, " ")), ncol=4, byrow=TRUE)
    predicted<-find_mat[1,4]
    return(predicted)
}

Conclusion

By limiting uni-gram, the application runs reasonably fast to recommending maximum three possible prediction of next words.

Handling of no match, by absence in ngrams, or word is new to corpus, only gives "no match", that would have to be improved.