Coursera Data Science Capstone Project

Arkadiusz Oliwa
12 December 2018

The application is the capstone project for the Coursera Data Science specialization held by professors of the Johns Hopkins University.

Project goals

The main goal of this capstone project is to build a shiny application that is able to predict the next word.

This exercise was divided into several tasks like data cleansing, exploratory analysis and the creation of a predictive model.

How does it work?

The general idea is that you can look at each pair (or triple, set of four, etc.) of words that occur next to each other. In a large corpus, you’re likely to see ‘the red’ and ‘red apple’ several times, but less likely to see ‘apple red’ and ‘red the’. This may be useful to predict next word in typing.

These co-occuring words are known as ‘n-grams’, where ‘n’ is a number saying how long a string of words you considered.

N-grams and all text mining was done with the usage of a variety of R packages like tm, quanteda etc…

Description of application

The Shiny application has an input text box to enter a partial sentence or phrase for which the user would like to predict the next word.

Instruction is quite simple:

Enter a sentence on the input text box
The most probable word will appear below

References

The Application is hosted here:
https://aroliwa.shinyapps.io/DataScienceCapstone/
The code of this application you can find in this GitHub
https://github.com/aroliwa/DataScienceCapstone
Information about Coursera and Data Science Specialization
https://www.coursera.org/specializations/jhu-data-science