2025-01-16

Problem Statement

The problem being solved during the Data Science Capstone Project was a natural language processing challenge related to next word prediction.

This was definitely a stretch project when it was initially assigned, and for that reason it took a long time to research the best practices in this domain. Nonetheless, this challenge was rewarding as it got me to think deeply about the language I speak everyday, and how a computer might begin to understand that language.

How It Works

The model works by taking the context given by the user, which could be 2,3,4 words etc., and tokenizes them into individual words, so “I went to” becomes “I” “went” “to”. After this, the token is then matched against the model’s training data, which is a collection of unigrams, bigrams, trigrams and quadgrams. These are collections of phrases that are split into the “context” and the “next word”, as well as the “frequency” at which it occurs. Once the tokenized context supplied by the user is matched to a context in the training data, the next word of that phrase is supplied to the model and returned to the user. If that context does not occur, one token (the last word) is removed and the remaining context is then matched against the data. This is repeated until a match is found. If no match is found, an NA is returned.

Model Performance

  • The model was trained on only 70,000 sentences
  • This was turned into 3,623,785 n-grams that the model split into “context” and the “next word”.
  • However, due to limited computational power, the model’s accuracy is only around 15%.
  • With greater resources, the model can receive more training and improve its accuracy.

Slide with R Output

Next Steps

The model can be improved with greater training data, which would require more computational power as well as some smart programming and understanding of Big O notation. The current model has a quick response time, which is good for UX. With more training, another task will be ensuring that this fast reaction time persists while the performance of the model improves.

I look forward to revisiting this project periodically to make adjustments to the code as my skills programming improve!s