2025-08-22

R Markdown

This presentation will outline the processes and limitations of predicting the next word following a given sentence

Outline Data Collection and Cleaning Tokenisation Document Term frequecy dataframes and matrices and the

ANLP package

ANLP package Was used to generate a sample of the data set without proportioning for the type of data is from, i.e. twitter, news, blogs

Data Cleaning

sampled.data <- gsub("[^a-zA-Z0-9 ]", "", sampled.data)

The data was not further pre-processed beyond removing non alpha-numeric characters

ngrams

prediction Algorithm

  • This library imposes a limit of 1000 documents/ sentences
  • sbo library was used in the prediction algorithm
  • sbo_dictionary and sbo_predictor were used to generate predictions