Shiny app

2022-06-18

About the app

The app was created as a part of the final project in the Coursera’s Data Data Science Capstone by Johns Hopkins University and is available here: https://www.coursera.org/learn/data-science-project/peer/EI1l4/final-project-submission/submit

The training data et for the prediction model was previously prepared as described in the Milestone report, available at https://rpubs.com/aaturki/Milestone. In short, from the raw data I removed non-latin characters, punctuation marks, digits, stopwords, and swearing words (based on the publicly available Google’s Bad Words list)

Deployment & Performance

Since introduction of the SBO (Stupid Back-Off) model R package the deployment of an N-gram prediction model in R became as easy as calling a function. More information on the sbo package is available at https://cran.r-project.org/web/packages/sbo/vignettes/sbo.html

# Train predictor
p <- sbo_predictor(object = t_train, # load dataset
                   N = 3, # Train a 3-gram model
                   dict = target ~ 0.75, # cover 75% of training corpus
                   .preprocess = sbo::preprocess, # Preprocessing transformation 
                   EOS = ".?!:;", # End-Of-Sentence tokens
                   lambda = 0.4, # Back-off penalization in SBO algorithm
                   L = 1L, # Number of predictions for input
                   filtered = "<UNK>"  # Exclude the <UNK> token from predictions
)

# Generate prediction 
predict(p, "i love")

Since the training process is computationally expensive the predictor can be saved as an RDA object (using first ‘sbo_predtable’ and then ‘save’ functions ) and after that loaded in global.R of the Shiny app and identified as a predictor (via ‘sbo_predictor’ function).

Accuracy evaluation

The accuracy of the predictor can be evaluated using one of the built in functions of the SBO packaged

# p - is a predictor, output of the sbo_predictor function or loaded output of the sbo_predtable

(evaluation <- eval_sbo_predictor(p, test = t_test))

evaluation %>% # Accuracy for in-sentence predictions
    filter(true != "<EOS>") %>%
    summarise(accuracy = sum(correct) / n(),
              uncertainty = sqrt(accuracy * (1 - accuracy) / n()))

The resulting accuracy with SBO model and chosen training / test sets turned out to be

# A tibble: 1 x 2
  accuracy uncertainty
     <dbl>       <dbl>
1    0.128     0.00865

Conclusions

The introduction of the SBO package made it very easy to have some N-gram prediction model up and running.
While training the model is a very computationally expensive process, the ability to save the trained predictor greatly facilitates hosting of the Shiny app and improves its response time.
At the same time, the accuracy of the model still depends on the fundamentals, such as:
- origins, size and quality of the data set;
- how the data was sampled and cleaned;
- parameters of the model.
To improve the accuracy of the model it might be beneficial to include other text corpora. Blog, News and Twitter text have a specific style (or limitations, such as the number of characters in a Tweet) that might not always correspond to the user expectations.