Coursera: Data Science Capstone Presentation

Ramesh Thyagarajan
March 24, 2017

Introduction

This presentation is to showcase the knowlege acquired in Data Science specialization by building an application.

This application is intended to take a string of words and predict the next word, based on the probability of occurence.

A Shiny application was built to demonstrate predictions. The application located at https://rameshthy.shinyapps.io/capstone_swiftkey_prediction/

The basis of the prediction algorithm is a set of three documents (corpus) containing text from blogs, news articles and tweets.

Constraints and available Resources

We need to create a robust program that predicts next word to be written based on the preceding text.

For this project our corpora is built on the following files provide Swiftkey:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Due to the size limitations on shiny.io, our data files have to be under 5MB.

Methodology - Analyzing the Corpus

The sentences in the corpus were subsequently split into individual words combinations. The datasets were created for the following:

Unigrams (One word sets)
Bigrams (Two word sets)
Trigrams (Three word sets)
Quadgrams(Four word sets)

For each of the above, frequences of occurence of each set were calculated. These were later converted into probabilities of occurence, which are directly used in the application for the next word prediction.
Profanity was removed by using Google's bad word list

Application

alt text

Start typing the words in the text box and three most relavent predictions will be displayed.