Pascal Grabbe
28.08.2018
This project is part of the Coursera Data Science Specialization created by the Johns Hopkins University. In cooperation with the company Swiftkey the task was to develop an algorithm which is able to predict the next word.
This Slidify Presentation together with the app provided by shinyapps forms the Capstone Project for this course.
The goal of this project is to build an app which is able to predict the next word based on previously typed in words.
There were three data sets containing text samples from blogs, news and twitter written in english. To make them processable a small, random sample was taken, cleaned and tokenized into so called n-grams. A key packege for processing the data was quanteda which turned out to be faster than similiar packages.
The stupid backoff algorithm was chosen to manage this task. It uses n-grams to predict the word. To make the app fast and usable it can only predict words based on max. two previous words.
After the algorithm was developed the challenge was to make the prediction model work in a user friendly, online accessible app. In this app the user can type in a sequence of words (english only) and the app will, based on the model working in the background, create a list of possible words decreasing by possibility. To visualise the prediction result a wordcloud is presented on the right side of the app.
Things to note:
On the next page you will see a screenshot of the final version and a basic prediction for the word “I” visualised in the wordcloud