Coursera Johns Hopkins Data Science Capstone Project

Mr. Jim
Aug 2018

The project requests Word Prediction
The project offers Blogs News Twitter Courpus
- The activity matters
The project guideline is vague on the kind of writing, Twitter or News
The project guidelines give latitude to explore other data sources
For the purpose of the project the model is generated from
- Corpus of Contemporary American English (COCA)

Project guidlines call out the Blog, News, Twitter (BNT) corpus
- Study indicates Bag of Words model performance linked to corpus
- Twitter is a lot different than News copy
- Using a combined BNT corpus seems an bad compromise
Alternative: Corpus of Contemporary American English (COCA)
- Register then download sampled preprocessed COCA corpus
Note: BNT ngrams were generates and could be used

The COCA corpus: condiitoned set of N-grams with profanity
- 2 gram to 5 gram
- Case insensitive
- Case sensitive
- Case sensensitive with PoS content
The project uses case insensitive 2 gram to 5 gram
Most native speakers of American English have a vocabulary of ~40K words
- 40K words is the target size of the working app vocabularly
- Why predict words the user does not know?
The actual vocabulary in the model is ~25K words

The model search conditioned list of ngrams from 5 gram to 2 gram
Preprocessing ensures
- Lower case, no profanity, no punctuation, no digits
The evaluation result is heirarchical list of all matches
Search is long grams then short grams, results presented that way
If no match is found with the input sequence
- 3 grams and 2 grams do 'Context' or unordered search
- Context search is based on random sample of existing input
Still no match, app provides a random sample of the vocabulary

Side bar is for text entry and prediciton options
Main panel tabs include
- Runtime to see evaluation results
- Code the actual source code and other files
- About descritpion
Preprocessing ensures
- Lower case, no profanity, no punctuation, no digits
Runtime
- Input is preprocess, evaluated against the model, results presented