Introduction

This presentation summarizes the steps and results of the capstone project of the Coursera Data Science specialization of the Johns Hopskins University.

This simple application predicts the next word for each source:

Twitter
Blogs
News

separately, since after the first insights we discovered the way people communicates is different in each of them.

Since the source files are large, we consider only a random sample of the 10% of each source file and create a copy in the disk, to avoid reloading the initial big files.

Data cleansing

With the sample files, now it’s time to generate the corpus we’ll be analyzing. We clean it setting to lower case all the words, removing punctuation and white spaces. We keep common english words and numbers, since they are an essential part of the communication and we cannot remove them from the prediction.

#remove capital letters
corpus_sample <- tm_map(corpus_sample,content_transformer(tolower))
#we remove all punctuation
corpus_sample <- tm_map(corpus_sample, removePunctuation)
#we remove all white spaces
corpus_sample <- tm_map(corpus_sample, stripWhitespace)

Algorithm and application

The core function is the algorithm and the way to access data and who the application manages them.

To guarantee high-speed results we generate some files in disk, so the results are pre-calculated. We generate a file for each information source. Within each file we generate the list of most frequent n-grams and its frequency, so they can be sorted.

The application detects the length of the text inserted in the textbox and looks into the pre-calculated files the output with the highest probability.

Once the value is recovered from the files, it is sent to the application front-end result textbox.

Application usage

The application is published in this link

The usage is simple:

Type a sentence in the textbox
Press the submit
Check the predicted next work for each source

Word predictor

Introduction

Data cleansing

Algorithm and application

Application usage