Coursera Data Science Specialization

September 8, 2017

Slide 1

Presentation

This presentation explains briefly an application for predicting the next word(s), given a word or a sentence. The application is part of the Capstone Project of the Coursera Data Science specialization.

Objectives

The main objective is to build an application that can predict the next word when an user enters a word/phrase. The second objective is to showcase the word prediction method with a Shiny App.

Slide 2

Process Overview

Sampling corpus: three files (from blogs, news, and respectively twitters) are put together into a single corpus. Due to the memory constraits, 1% of the corpus is taken (through random sampling) for further analysis.
Cleaning data: text stemming, removal from the data of html tags, emails, twitter handles, white spaces, punctuations, digits, numbers etc.
Making N-grams tokens: one-gram, bi-grams, and tri-grams are make using RWeka package. In the field of computational linguistics, an N-gram is a contiguous sequence of N items from a given sequence of text.

Slide 3

Process Details

Using Markov Chain Model: words are transitional states with probabilities.
It is based on the calculation of the so-called transition matrix where probablilities are assigned to words. in a transition matrix each row and column represent a transitional state with probability. If no word are found (even a probalility is assigned), the UNK is returned. UNK can occur for diferent probabilities.

Slide 4

Application Instructions

The user interface of this application: type your desired word/phrase into the text box. The application will preprocess the input text (capital words are transformed in lowercase words), then will try to predict the next words and display them.

Access shiny app here: https://tibi23.shinyapps.io/Week5/