8/23/2020

Performance and Accuracy

By using the OpenMP-based Rcpp library data.table, this application can combine a Gzipped 4-gram, 3-gram, and Bayesian “multigram.” The app can expand to fit the number of threads available on the system for faster performance than some canned text analysis packages.

system.time(dt <- read.csv("testcsv.csv")) #regular read.csv
##    user  system elapsed 
##   1.485   0.056   1.544
library(data.table)
system.time(dt <- fread("testcsv.csv")) # data.table 
##    user  system elapsed 
##   0.707   0.022   0.105

Why Use Data Table Predict?

  • Easy to use: Enter text and click “Predict”
  • Shows you your top prediction, and other possible answers
  • Uses “fuzzy searching” on misspelled last word to find a match.

The Algorithm

Try to get a 4-gram or 3-gram match using precalculated probabilities

# 4gram
dt<-gram4[word1==x[l-2] & word2==x[l-1] & word3==x[l]][order(-p4),
                                              .(word=word4, prob=p4)]
if (nrow(dt)<1){ # No matches, try fuzzy match on first two words
      dt<-rbind(dt,gram4[word1==x[l-2] & word2==x[l-1], 
                         .(word=word4, prob=p4)])
}
# 3gram if no matches in 4gram
dt<-gram3[word1==x[l-1] & word2==x[l]][order(-p3),
                                       .(word=word3, prob=p3)]
if (nrow(dt)<3){# If no Matches
      dt2<-gram3[word1==x[l-1],.(word=word3, prob=p3) ]
      dt<-rbind(dt,dt2)
}

Algorithm (cont.)

If the 4 and 3-grams don’t match, use the precalculated Bayesian table (calcualted fromt the conditional probability of a word given the words appearing before it).

# pool of words and probabilities given all words in sentence
pool<-group10_2[word1 %in% x] # precalcualted with Bayes's Theorem
dt2<-pool[,.(word=word2, prob=prob)]

If nothing has worked, assume a misspelled word, and use fuzzy matching:

dt2<-pool[agrep(x[l],word1),.(word=word2,prob=prob)]

Thank You!