Next Word Prediction

2025-08-22

R Markdown

This presentation will outline the processes and limitations of predicting the next word following a given sentence

Outline Data Collection and Cleaning Tokenisation Document Term frequecy dataframes and matrices and the

ANLP package

ANLP package Was used to generate a sample of the data set without proportioning for the type of data is from, i.e. twitter, news, blogs

sampled.data <- gsub("[^a-zA-Z0-9 ]", "", sampled.data)

The data was not further pre-processed beyond removing non alpha-numeric characters