Predict Next Word

A Practice of Natual Language Processing

Author: Jessie J. Q

Project Objective

The objective of this project is to build a model that can predict next word. This project covers the range of activities encountered by a practicing data scientist. They mirror many of the skills required in the data science specialization.

Understand the problem
Data acqusition and cleaning
Exploratory analysis
Predictive modeling
Creative exploration
Creating a data product

Data Understanding and Preprocessing

Download the Coursera-Swiftkey.zip file from the Coursera website and unzip the folder Coursera-Swiftkey to the working directory
Read data in as text file to a vector whose elements are lines and read twitter, blogs and news respectively
Sample the data into smaller and workable size. sample size = samplingFactor * total number of lines
Partition sample data into train(60%), validation(20%) and test(20%)
Preprocess a corpus by buidling a function to convert its text to plain text document, lower case, replacing contractions with their full forms, and remove profanities, numbers and punctuation and stopwords(en)
Clean the training, validate, and test corpuses

Modeling

Get frequencies of terms in a corpus, in decreasing order: getTermFrequency
Generate unigrams, bigrams, and trigrams using the ngram library
Get 1,2,3-grams for validation and test data using the same function
Use simple Good Turning algorithm to smooth frequencies
Remove words with frequencies less than minimum (singletons)
p(w3|w1w2) = count(w1,w2,w3)/count(w1,w2)) to calculate bigram frequencies and probabilities
p(w2|w1) = count(w1,w2)/count(w1) to calculate trigram frequencies and probabilities
Train, validate and test the model

References

A shiny app to predict next word at http://jjq5958.shinyapps.io/PredictNewWord2.
For more details can be found at https://github.com/JJQU/PredictNextWord.
Natural Language Processing: A Model to Predict a Sequence of Words by Gerald R. Gendron
NLP: Language Models-Lecture 9 by Joshua Goodman
https://faculty.cs.byu.edu/~ringger/CS479/papers/Gale-SimpleGoodTuring.pdf
https://medium.com/ymedialabs-innovation/next-word-prediction-using-markov-model-570fc0475f96
http://ptrckprry.com/course/ssd/data/
Many people who contributed to this capstone project over the years on github