2023-01-11

Introduction

With the improvement of technology, and computing power, as well as vast resources of available text on the internet, it is possible to build a application that assists in text prediction. There are a variety of methods for developing text prediction, however there are nuances as to the platform hosting the app, as well as the input language and tone of the message.

Model Dev

The application was trained on a small subset of text from the below resources in the USA:

  • Twitter
  • Blogs
  • News

From these resources, a sample was extracted to train, as well as test the model. THe model was initially built using a k-grams modeling application. THe data was cleaned for the following, and reduced to lowercase:

  • swear words
  • numbers
  • symbols
  • URLs / links

Model Dev Cont. 1

In the inital assessment of the data, stopwords were removed to get an indication of populat n-grams. However these were not removed in the final model, as they are required in text prediction.

However this model, while allowing for the inclusion of “temperature” (a metric for tone), was slow, and not able to provide an output that could be launched on a website.

Thereafter a “Stupid-back-off” model, for a 3-gram model was developed.

Model Dev Cont. 2

Using the “sbo” package in R, a model was developed, using the provided Kneser-Ney smoothing. Thereafter the model was tested again on a sub-sample of the original data. The model’s initial perplexity was over 3oo. After testing and adjusting the smoothing parameters, the perplexity metric of the model was reduced to below 300, with the model able to both predict the next word (giving three options), as well as provide a probability metric for most-likely word inserted:

## Next-word text predictor from Stupid Back-off N-gram model
## 
## Order (N): 3 
## Dictionary size: 1685  words
## Back-off penalization (lambda): 0.4 
## Maximum number of predictions (L): 3 
## 
## See ?predict.sbo_predictor for usage help.

The most optimal model works off of N = 5, allowing for more context. The model is saved as a .rda file, which allows for quick loading and predicting. THe model can be updated running the r scripts in the repository.

The App and future developments

The Shiny app has been developed to creatively predict the next word, based on the sentence typed into the block. In preparation for improvements on the model, there has been a slider included that helps adjust for tone of the output. currently the model predicts the top three next words, however this can easily be edited in the app to just include one output.

predict(model3,"Lets travel all around the")
## [1] "world"  "corner" "house"

Reference and Conclusion