15-2-2020

Introduction

In this presentation, I will present my final project for the Data Science Specialisation of Coursera. The Data Science Specialisation is developed by the John Hopkins University

During the Capstone Project, I developed a model to predict the next sequences of Words. The predictions are based on English Twitter, Blogs and News data provided by SwiftKey. SwiftKey is a virtual keyboard app to support in text analysis and predicting Words.

Text Cleaning, Corpus Creation and Model Building

The model is build on textdata from English News, Twitter and Blogs data. Before building the model. the following steps are performed on the data:

  • Loading Data into R
  • Cleaning data via TM Package which includes the following:
    • Cleaning text to lowercases
    • Removing Numbers
    • Removing Stopwords
    • Removing punctuation
    • Removing white spaces

Model Building

The data is cleaned and directly imported into a Corpus, which is a collection of text documents. Based on the corpus, we develop One-grams, Bi-grams and Three-grams. The n-grams serve as the basis for our model and are build by using the Rweka Package.

The Application

Link to the application: https://pascalspijkerman12.shinyapps.io/Predicting/

The application is based on the 3 and 2 grams. You will be able to provide a list of words as input. The model will then predict the next word that is most like to follow.

Thank you very much for using my application. Please feel free to provide me with any feedback.