Date: 15-5-2020

Introduction

This presentation is created as part of the final project of the Capstone course of the Coursera Data Science Specialisation.

The goal of this project is to develop a model that predicts the next word of a sentence in combination with a shiny app in which this model will be applied.

The project consists of:

  • Prediction model
  • Shiny app
  • Presentation

Data

For this project a Swiftkey dataset is used which consists of three text files:

  • Twitter.txt
  • Blogs.txt
  • News.txt

All three text files consits of multiple lines of text which are cleaned by removing punctation, spaces, numbers and transforming to lowercase.

Of each file a subsample is taken to reduce the size of the dataset. Then these three files were combined into one dataset that is used for the prediction model.

The prediction model

To predict the next word in a sentence several ngrams were created:

  • Bigrams
  • Trigrams
  • Quadgrams

These ngrams were converted to frequency tables in which the frequency of certain word combinations is captured. This is done for combinations of two words (bigrams), three words (trigrams) and four words (quadgrams).

These tables were saved as R files such that the app can use these different ngrams.

The app

The final step of this project is applying the predicion model in an app. In the app a user can enter a sentence in an input box. The sentence can conist of either one word or multiple words.The sentence is then going into the prediction algorithm and will output the next predicted word. This predicted next word will then be displayed in the app.