## Introduction

This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.

The milestone report outlines the initial approach to building a series of ngram models from a range of text documents.

## Prediction models

• A cleaned corpus sample was used to create 5 bag of word ‘ngram’ models
• The below process was used to search for words from an input sentence

## Algorithm used to make the prediction

A Stupid Backoff smoothing strategy was used to calculate a ‘score’ for each word follows:

if the rows ngram model was 5
score = matched 5 gram Count / input 4 gram Count
else if the rows ngram model was 4
score = 0.4 * matched 4 gram Count / input 3 gram Count
else if the rows ngram model was 3
score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count
else if the rows ngram model was 2
score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count
ngram predicted score
4 you 0.2
3 you 0.1
2 you 0.05
The total score for the predicted word ‘you’ is (0.2 + 0.1 + 0.05) = 0.35 The final scoring is aggregated and summed for each word. The top 3 words are selected according to the highest score.

## Evaluation

The prediction model was evaluated using the Benchmark.R tool (see references for source).

Initial predicts were quite high but also quite slow. The decision to only use 1-3 ngram models sped up the search time by half but also dropped the accuracy by 10%.

## Instructions

To use the application navigate to the following URL

https://chrismckelt.shinyapps.io/datascience-capstone/

To use the application start typing in text.

When no results are found the 3 most common words from the English language (‘the’, ‘be’, ‘to’) are returned as a response.

Click on the green side menu for visual display options.