FC
Decemeber 2014
Here's how it works in plain English…
So, user types “you are my” and the app searches the database for the most frequent 4-gram that starts with “you are my” and passes back the fourth word in that phrase… perhaps “best”.
# build a tiny corpus
tinyCorpus <- c("you are my best friend", "It's possible you are my... dog", "says you are my best fan")
# Remove punctuation, change capitals to lowers, and transform lines to long list of words
tinyCorpus <- gsub("([[:punct:]])", "", tolower(tinyCorpus))
allWords <- unlist(strsplit(tinyCorpus, " "))
# create 4-grams and count frequency of those 4-grams
library(stylo)
ngrams <- make.ngrams(allWords, 4)
count <- as.data.frame(table(ngrams))
count <- count[order(count),]
# now search for "you are my " to see that "best" is the most likely next word
count[grep("^you are my", count$ngrams),]
ngrams Freq
12 you are my best 2
13 you are my dog 1
NextWords is OK, but it could be much better…
NextWords is a project created for the Data Science Capstone course offered by Johns Hopkins University via Coursera. Overall, the app…
The app itself lives here, The code for the app lives here, and the code for this deck lives here.
Thanks for reading!