CapstoneProject

Jean-Marc Terrettaz
12.01.202

This pitch present my 'Predict next word' application
developed for the capstone project of
the Data Science Specialisation of the John Hopkins University.

Create the document-feature matrices

I used the quanteda package to prepare the data for the app.

First I read the 3 files twitter, blogs and news and concatenated them into one vector of strings.

Then using quanteda I tokenize the strings and create a document-feature matrix (dfm):

grams_dfm <- function(ngrams, filename) {
        grams_tokens <- tokens(data,
                               ngrams = ngrams,
                               remove_punct = TRUE,
                               ...)
        grams_dfm <- dfm(grams_tokens, verbose = TRUE)
        saveRDS(grams_dfm, filename)
}

This for uni-, bi- and trigrams.

Create the frequency dataframes

The next step is to extract the frequencies and save them as dataframes:

build_grams <- function(dfm_filename, grams_filename) {
        grams_dfm <- readRDS(dfm_filename)
        grams_freq <- featfreq(grams_dfm)
        grams_freq <- subset(grams_freq, grams_freq > 3)
        words <- names(grams_freq)
        count <- grams_freq
        grams <- data.frame(words, count)
        saveRDS(grams, grams_filename)
}

I had to remove grams with frequencies <= 3 because the Shiny app was running out of memory on shiny.io otherwise.

Predict the next word

The prediction algorithm was implemented as follows:

  • take the two last words of the input and finds the trigram starting with these two words with the highest frequency
  • if found, take the third word of this trigram
  • if not found, take the last word of the input and finds the bigram starting with this word with the highest frequency
  • if found, take the second word of this bigram
  • if not found, take randomly a word among the 100 unigrams with the highest frequencies

Conclusions

For a first try, the app is on my opinion predicting reasonably.

First I tried to use quadrigrams as well but had to leave this out because the app ran out of memory.

I first tried to build the grams frequencies with the tm package, but the size of the resulting data was far to big, so after reading this article I switched to the quanteda package which is much more efficient (and furthermore very easy to use).

There is still a lot to do to make the app better, among other the performance is not very good, prediction takes too long. One solution here would be to use the data.table package to index the grams for the searches.