Coursera Data Science Capstone

Elisa Villar
March 3rd, 2022

plot of chunk unnamed-chunk-1

Introduction

This presentation is created as part of the requirement for the Coursera Data Science Capstone Course.

The goal of the project was to build a predictive model and a shiny app that will predict the next word when the user enters text into the text-box, very similar to what modern cellphones do when typing.

Getting started

When working to develop this algorithm it was important to download the data and do some cleaning before getting started. Some of the steps were:

  • Get a 5% sample of the three files bundle together (blogs, news, twitter).
  • Clean the sample removing punctuation, numbers, white spaces and converting to lower case.
  • Create n-gram files with the sample, which are sorted in descending order by frequency.
  • Create functions like clean and find_next, which are responsible to clean and to predict the next word.
  • The last step was save those files for later use.

Next word - prediction model

When creating the prediction model, some of the steps were crucial to get an optimal result, some of them are:

  • Load the files (unigram, bigram, etc.) with frequencies to determine most probable next word.
  • Because there's no punctuation, numbers, white spaces and everything is lower case in the files, when looking for the input it's necessary to clean the input before trying to check for the next word.
  • First step is to check the length of the input:
  1. More than three words: look into quadram.
  2. More than two word but less than three: look into trigram.
  3. One word: look into bigram.

Shiny Application

A Shiny application was developed to suggest the next word.

[Shiny App] - [https://ucsos6-elisa-villar.shinyapps.io/Next_Word_Predictor/]

Capture: