September 4th 2017
Coursera Data Science Capstone Project
Natural Language Processing is a field of data science concerned with building models and software to improve computer-human interaction. A commercial example of this is next-word prediction apps for mobile devices, such as SwiftKey. https://swiftkey.com/en
As the capstone project for the Coursera Data Science Certification, a Shiny app has been developed using a simple n-gram model to predict the next word in a sentence. Emphasis has been placed on speed of the prediction algorithm, since it is infeasible to wait several moments for the program to generate a result.
# Create a document-feature matrix of 5-word phrases:
five_grams <- dfm(all_sample, ngrams = 5, verbose = FALSE)
# Sum occurances of each 5-word phrase across all documents, and store as a data.table:
five_freq <- as.data.frame(col_sums(five_grams, na.rm=T))
fivegrams_Frequency <- data.table(NGram = rownames(five_freq), Frequency = five_freq[,1])
# Keep ngrams that appear more than once:
fivegrams_Frequency <- fivegrams_Frequency[Frequency>1]
# An excerpt of the model function predictwords()
ngramtables <- fread("ngrams.csv")
input <- removePunctuation(input)
input <- stripWhitespace(input)
input <- char_tolower(input)
fourwordsample <- word(input, start = -4, end = -1, sep=" ")
return5 <- ngramtables[Initial==fourwordsample]
return5 <- head(return5[order(-Frequency), Final], 1)
The published Shiny app can be found at this link: https://dboucher.shinyapps.io/n-gram_word_prediction/
Below is an example of the model prediction:
predictword("it would mean the")
Read 77.7% of 2186699 rows
Read 98.3% of 2186699 rows
Read 2186699 rows and 3 (of 3) columns from 0.050 GB file in 00:00:04
[1] "world"
predictword("can i get what i")
[1] "want"