Predict Next Word

A Practice of Natual Language Processing

Author: Jessie J. Q

Project Objective

The objective of this project is to build a model that can predict next word. This project covers the range of activities encountered by a practicing data scientist. They mirror many of the skills required in the data science specialization.

  • Understand the problem
  • Data acqusition and cleaning
  • Exploratory analysis
  • Predictive modeling
  • Creative exploration
  • Creating a data product

Data Understanding and Preprocessing

  • Download the Coursera-Swiftkey.zip file from the Coursera website and unzip the folder Coursera-Swiftkey to the working directory
  • Read data in as text file to a vector whose elements are lines and read twitter, blogs and news respectively
  • Sample the data into smaller and workable size. sample size = samplingFactor * total number of lines
  • Partition sample data into train(60%), validation(20%) and test(20%)
  • Preprocess a corpus by buidling a function to convert its text to plain text document, lower case, replacing contractions with their full forms, and remove profanities, numbers and punctuation and stopwords(en)
  • Clean the training, validate, and test corpuses

Modeling

  • Get frequencies of terms in a corpus, in decreasing order: getTermFrequency
  • Generate unigrams, bigrams, and trigrams using the ngram library
  • Get 1,2,3-grams for validation and test data using the same function
  • Use simple Good Turning algorithm to smooth frequencies
  • Remove words with frequencies less than minimum (singletons)
  • p(w3|w1w2) = count(w1,w2,w3)/count(w1,w2)) to calculate bigram frequencies and probabilities
  • p(w2|w1) = count(w1,w2)/count(w1) to calculate trigram frequencies and probabilities
  • Train, validate and test the model

References