Data Science Capstone Project (Coursera)

Pascal Grabbe
28.08.2018

Background

This project is part of the Coursera Data Science Specialization created by the Johns Hopkins University. In cooperation with the company Swiftkey the task was to develop an algorithm which is able to predict the next word.

This Slidify Presentation together with the app provided by shinyapps forms the Capstone Project for this course.

The goal of this project is to build an app which is able to predict the next word based on previously typed in words.

Methods used

There were three data sets containing text samples from blogs, news and twitter written in english. To make them processable a small, random sample was taken, cleaned and tokenized into so called n-grams. A key packege for processing the data was quanteda which turned out to be faster than similiar packages.

The stupid backoff algorithm was chosen to manage this task. It uses n-grams to predict the word. To make the app fast and usable it can only predict words based on max. two previous words.

The Shiny Web App

After the algorithm was developed the challenge was to make the prediction model work in a user friendly, online accessible app. In this app the user can type in a sequence of words (english only) and the app will, based on the model working in the background, create a list of possible words decreasing by possibility. To visualise the prediction result a wordcloud is presented on the right side of the app.

Things to note:

  • The app will start with an error. Don't mind and just type some words
  • The wordcloud only appears if the amount of possible words is smaller than the max. number of words set with the slider
  • Numbers and punctuation couldn't be excluded in the app. This should be improved in a future version

On the next page you will see a screenshot of the final version and a basic prediction for the word “I” visualised in the wordcloud

APicture