Data Science Specification Capstone Project

Yan Feng
July 27, 2017

This is the presentation of the capstone project for the Data Science Specification.

Overview

  • Background

Nowadays, people spend so much time in typing the smart phone. However, with the small keyboard, typing could become very painful after chatting for a period. SwiftKey has developed an application to help people relieve the pain of typing. Instead of typing the next word, the user only needs to pick the next word predicted by the application.

  • Goals

An application will be built to predict the next word based on the existing typing.

  • Methodology

The project use fourgram, trigram, bigram and unigram to predict the next word.

  1. Three sources, blogs, news, and twitter, are used as the basis for the training set. They are cleaned with “tm” package, and used to build the fourgram, trigram, bigram and unigram data frames.

  2. To increase the efficiency, only small number of lines are sampled.

  3. A shiny app is built to use the previous 3, 2, or 1 word to predict the next word.

Algorithm

  • This shiny app uses N-gram to predict the next word. Based on the Markov chain, the next word depends on the previous words.

  • To simplify this problem, this app considers only the previous 3, 2, or 1 word.

  • The training set is built by sampling from blogs, news, and twitter.

  • When a user type something, which could be one or more words, a phrase, or even an unfinished sentence. The app will first clean the typing: remove any links, remove “@” account, remove profanity words, convert all to lower cases, remove number, remove punctuations, and convert to plain text.

  • The app will use a back-off model to predict. It will try to use the previous 3 words (fourgram). If the combination is not observed in the training set, it will back off to use the previous 2 words (trigram). The app will keep backing off until it goes to the unigram.

  • In case the word typed by the user is not in the unigram training set or the user types nothing, the app will give the most likely words in the unigram.

Experience with the "predict_the_next_word" app

  1. The “predict_the_next_word” app is loaded to shinyapp.io.

  2. The appearence of the app consists of 2 panels.

  3. The left panel includes 3 parts: an input text box where a user can type anything; a number slider for the user to choose how many choices he/she wants, by default only 1 choice will be given; a submit button labeled “predict”, after the user clicks this button, the app will predict the next word.

  4. The right panel also includes 3 parts: an output text box where the user will see the word predicted by the app. Depending on the user's choice, the app will show 1 or more words. The second part of the right panel will show the algorithm the app uses to predict the next word: fourgram, trigram, bigram, and/or unigram. Finally, a word cloud of the predicted words will be shown based on the probability of each words.

References