Coursera Johns Hopkins Data Science Capstone Project
Mr. Jim
Aug 2018
Overview
- The project requests Word Prediction
- The project offers Blogs News Twitter Courpus
- The project guideline is vague on the kind of writing, Twitter or News
- The project guidelines give latitude to explore other data sources
- For the purpose of the project the model is generated from
- Corpus of Contemporary American English (COCA)
Corpus Consideration
- Project guidlines call out the Blog, News, Twitter (BNT) corpus
- Study indicates Bag of Words model performance linked to corpus
- Twitter is a lot different than News copy
- Using a combined BNT corpus seems an bad compromise
- Alternative: Corpus of Contemporary American English (COCA)
- Register then download sampled preprocessed COCA corpus
- Note: BNT ngrams were generates and could be used
COCA Corpus
- The COCA corpus: condiitoned set of N-grams with profanity
- 2 gram to 5 gram
- Case insensitive
- Case sensitive
- Case sensensitive with PoS content
- The project uses case insensitive 2 gram to 5 gram
- Most native speakers of American English have a vocabulary of ~40K words
- 40K words is the target size of the working app vocabularly
- Why predict words the user does not know?
- The actual vocabulary in the model is ~25K words
The Model and Runtime
- The model search conditioned list of ngrams from 5 gram to 2 gram
- Preprocessing ensures
- Lower case, no profanity, no punctuation, no digits
- The evaluation result is heirarchical list of all matches
- Search is long grams then short grams, results presented that way
- If no match is found with the input sequence
- 3 grams and 2 grams do 'Context' or unordered search
- Context search is based on random sample of existing input
- Still no match, app provides a random sample of the vocabulary
The Application
- Side bar is for text entry and prediciton options
- Main panel tabs include
- Runtime to see evaluation results
- Code the actual source code and other files
- About descritpion
- Preprocessing ensures
- Lower case, no profanity, no punctuation, no digits
- Runtime
- Input is preprocess, evaluated against the model, results presented