Data Science Capstone - Final Project

January, 2021

Introduction

The Capstone is the last module of the Data Science specialization provided by John Hopkins University.

The objective of the Capstone is to apply skills learned during the previous modules in order to solve a new and challenging problem: word prediction based on the user input.

The technology to be used to provide a user interface and user interactions is R Shiny.

Data and model

The data used for building the predictive model come from a set of huge texts extracted from internet sources (such as posts, comments and twitts).

After the creation of a data sample from the original text corpus, the sample was cleaned by conversion to lowercase, by removing punctuation, white spaces, numbers and special characters. So, the resulting data sample was tokenized into n-grams (1-gram, 2-grams, 3-grams and 4-grams).

N-grams were used to obtain corresponding frequency matrices that finally are used to predict the next word in connection with the user input.

The Application

The Shiny application provides a user interface to input a word (or a sentence) and get the next predicted word based on the model described before.

The user is required to input a text within the “Enter a partially complete sentence:” field, and the next predicted word is shown under “The predicted word is:”.

In case that no any correspondence is found, the application simply returns the most frequent word present in the underlying data sample (that is 'the').

Conclusion

The next word prediction application is hosted here: https://elenadb93.shinyapps.io/CapstoneFInalProject/

The milestone report can be found here: https://rpubs.com/eledb/CapstoneWeek2

(some changes have been made with respect to what was presented in the milestone report, as the connection words are kept in the final cleaned data sample)

Extension and better performances of the model can be obtained by building a larger underlying data sample.