Final Project-next word prediction

Shenqin Yao
2020-08-21

Introduction

This presentation is included in the last assignment from the online course Data Science Capstone (https://www.coursera.org/learn/data-science-project)

The goal is to generate a data product to predict the next word. The related Shiny App can be found: https://shenqiny.shinyapps.io/finalproject/

The presentation was generated using RStudio.

Data source

The source of data is from the SwiftKey.zip provided in the class;

It includes twitters, blogs and news written in English;

Data cleaning steps are as follows: convert to lower-case, remove non-letter strings

Trigrams were generated for next-word prediction;

Low frequency trigrams were eliminated to save time

Prediction

The Shiny app uses trigram dictionary to get fast prediction speed, while sacrificing accuracy.

One can input any number of words.

If there are two or more than words, the last two words are matched to the trigram, to predict the third word. If no entries found, it will match the last word of the input only. If no entries found again, it will return “the”.

If there are only one word, the word will be matched to the trigram, and return the next word. If no entries found again, it will return “the”.

Summary

How to use this app?

The Shiny app comes with a “about” tab to explain how to use the app.

For words frequently appear in the orginal datasource, such as “new york”, the output is accurate, “city”.

For words not frequently used, such as “capstone”, the output is the default word “the”.