Next Word

Angelo Klin
August/2015

Specialisation: Data Science
Course: SwiftKey Capstone Project
Education Institution: Johns Hopkins
Publisher: Coursera

Overview

The goal of the Coursera's Data Science Specialisation: SwiftKey Capstone Project, is to expose the students to a real life problem, where the overall scope is known, but not much more than a source dataset is given.

One of the purposes is to instigate the student to not only understand the problem, but in order to find a solution, search the best way to approach the problem, seek for alternatives, and even create something customised to solve the problem.

Data

The original set of data was provided and comes from Blogs, News and a Tweeter feed.

After an initial cleanup the number of ngrams produced is show on the table.

Blogs News Twitter
1 ngram 145764 141822 138536
2 ngrams 1766213 1828608 1335629
3 ngrams 2862731 2833636 1893462
4 ngrams 1898999 1931072 1258080

Application

  • A simple application that tries to offer fitting following words for a given text
    • Input is a text in English
    • Built with RStudio's Shiny
  • R packages used
    • stringi
    • quanteda

Work-flow

  • Pre-Processing
    • Cleaning
    • Creation of a Corpus
    • Tokenisation
    • Frequency Distribution
  • Processing
    • Collect some user text
    • Clean the input text
    • Define previous words
    • Find Candidates
    • Calculate probabilities
    • Present the top most likely