Predicting the Next Word

2022-10-09

Introduction

Mobile phones are increasingly used to communicate through emails, text messages, and/or social media
To make typing on mobile phones easier, smart keyboards that use models to predict the next word have been developed
Creating such an app is the Capstone project of Coursera’s Data Science Specialization
This project required researching NLP (Natural Language Processing) techniques for processing text
Project deliverable is a prediction model using the SwiftKey data files to predict a user’s next word
The SwiftKey data used can be found here.
For information on the raw data and text processing methods I used, see the Milestone Report

Markov-Chain models use n-grams - word strings of ‘n’ length - to predict the next word
Typical algorithms check for the probable next word by using the largest n-gram model based on entered text
If no prediction found, the algorithm processes smaller n-gram models until a word is found
This is known as the Back-Off method

Test data was processed using the same methods as training data
20 tests were ran - each test randomly selecting 50 lines from test data - for a total of 1000 individual tests
Input was processed by all n-gram models equal to and less than the text length entered
Results were compared to the actual next word in the test data
Sample results for one 50-line test are shown below

The 1st table shows the end result of 5 50-line tests
- PercNgCorrect = NoNgCorrect/NoNgPredicted
The second table shows the overall results for all 20 test runs

Seeing the different words predicted by each model during testing was interesting to me
Thinking others may find it interesting as well, I decided to return the same information with my app
My Shiny App will show you what each n-gram model predicts based on the text you submit
Give it a try and see which n-gram model looks the most accurate to you
My Shiny App