Coursera Johns Hopkins Data Science Capstone Project

Mr. Jim
Aug 2018

Overview

  • The project requests Word Prediction
  • The project offers Blogs News Twitter Courpus
    • The activity matters
  • The project guideline is vague on the kind of writing, Twitter or News
  • The project guidelines give latitude to explore other data sources
  • For the purpose of the project the model is generated from
    • Corpus of Contemporary American English (COCA)

Corpus Consideration

  • Project guidlines call out the Blog, News, Twitter (BNT) corpus
    • Study indicates Bag of Words model performance linked to corpus
    • Twitter is a lot different than News copy
    • Using a combined BNT corpus seems an bad compromise
  • Alternative: Corpus of Contemporary American English (COCA)
    • Register then download sampled preprocessed COCA corpus
  • Note: BNT ngrams were generates and could be used

COCA Corpus

  • The COCA corpus: condiitoned set of N-grams with profanity
    • 2 gram to 5 gram
    • Case insensitive
    • Case sensitive
    • Case sensensitive with PoS content
  • The project uses case insensitive 2 gram to 5 gram
  • Most native speakers of American English have a vocabulary of ~40K words
    • 40K words is the target size of the working app vocabularly
    • Why predict words the user does not know?
  • The actual vocabulary in the model is ~25K words

The Model and Runtime

  • The model search conditioned list of ngrams from 5 gram to 2 gram
  • Preprocessing ensures
    • Lower case, no profanity, no punctuation, no digits
  • The evaluation result is heirarchical list of all matches
  • Search is long grams then short grams, results presented that way
  • If no match is found with the input sequence
    • 3 grams and 2 grams do 'Context' or unordered search
    • Context search is based on random sample of existing input
  • Still no match, app provides a random sample of the vocabulary

The Application

  • Side bar is for text entry and prediciton options
  • Main panel tabs include
    • Runtime to see evaluation results
    • Code the actual source code and other files
    • About descritpion
  • Preprocessing ensures
    • Lower case, no profanity, no punctuation, no digits
  • Runtime
    • Input is preprocess, evaluated against the model, results presented