JHU Data Science Capstone Project

8/20/2020

Introduction

The goal of this JHU Data Science Capstone Project is to build a shiny application that takes a text input and outputs a prediction of the next word based on selected prediction algorithm, while providing a user-friendly interface that can be accessed by others conveniently. Final product was built to meet the following review criterias as provided by the course:

Must provide a text input box that can accept user input.
Must provide at least one word prediction based on user input with a suitable delay.

N-gram Probability

The N-gram probaility was calculated by the maximum likelihood estimation, (read more here)

Bigram: P(wn|wn−1)=C(wn−1, wn)C(wn−1)
Trigram: P(wn|wn−1,wn−2)=C(wn−2,wn−1,wn)C(wn−2,wn−1)

The frequency of the least observed n-gram was assumed for unobserved n-gramn and smoothed frequency was acquired using the Simple Good-Turing smoothing method with code developed by William Gale (read here)

Model Selection

Probability for next word was estimated using the simple linear interpolation method, which combines the weighted probabilities of the next word in unigram, bigram, and trigram.

Where P hat is the estimated probability for next word and λi are the weighting factors whose sum is 1. After repeated model tests and comparison final λi are set to be (0, 0.3, 0.7), final selected vocabulary includes over 20,000 words (compiled across three sources provided by this course) in addition to stopwords. It is also found out by removing stopwords, accurancy for prediction is drastically reduced.

How to use?

Shiny Application: https://wenwen829.shinyapps.io/nlp_wenwen/

Please press the “predict!” button everytime after adjusting number of predicted words to be shown.

Downloadable Sources

Dataset: cloudfront

Google’s banned word list: https://gist.github.com/ryanlewis/a37739d710ccdb4b406d