Data Science Capstone Project Slide

Qi Shao
April 25 2015

Overview

The Data Science Capstone project, part of the Data Science Specialization, in partnership with SwiftKey, will be applying data science in the area of natural language processing to build a next word prediction data product.

Data Acquisition and Cleaning

HC Corpora Data Set.

Stanford Tokenizer

  • Mimic Penn Treebank 3 (PTB) tokenization.
  • Mainly targets formal English writing rather than SMS-speak.

Clean Tokens

  • Remove non-English tokens
  • Remove numbers, puncuations
  • lowercase

Exploratory Analysis

Modeling and Prediction

  • Katz Back-off model

    A generative ngram language model that estimates the conditional probability of a word given its history.