Jean-Marc Terrettaz
12.01.202
This pitch present my 'Predict next word' application
developed for the capstone project of
the Data Science Specialisation of the John Hopkins University.
I used the quanteda package to prepare the data for the app.
First I read the 3 files twitter, blogs and news and concatenated them into one vector of strings.
Then using quanteda I tokenize the strings and create a document-feature matrix (dfm):
grams_dfm <- function(ngrams, filename) {
grams_tokens <- tokens(data,
ngrams = ngrams,
remove_punct = TRUE,
...)
grams_dfm <- dfm(grams_tokens, verbose = TRUE)
saveRDS(grams_dfm, filename)
}
This for uni-, bi- and trigrams.
The next step is to extract the frequencies and save them as dataframes:
build_grams <- function(dfm_filename, grams_filename) {
grams_dfm <- readRDS(dfm_filename)
grams_freq <- featfreq(grams_dfm)
grams_freq <- subset(grams_freq, grams_freq > 3)
words <- names(grams_freq)
count <- grams_freq
grams <- data.frame(words, count)
saveRDS(grams, grams_filename)
}
I had to remove grams with frequencies <= 3 because the Shiny app was running out of memory on shiny.io otherwise.
The prediction algorithm was implemented as follows:
For a first try, the app is on my opinion predicting reasonably.
First I tried to use quadrigrams as well but had to leave this out because the app ran out of memory.
I first tried to build the grams frequencies with the tm package, but the size of the resulting data was far to big, so after reading this article I switched to the quanteda package which is much more efficient (and furthermore very easy to use).
There is still a lot to do to make the app better, among other the performance is not very good, prediction takes too long. One solution here would be to use the data.table package to index the grams for the searches.