Capstone Project
Data Science Specialization
Nidhi Mavani
16th August 2015
The goal of capstone project of Data science specialization was to build a shiny app that will predict the next word in a sentence. It is a problem that falls under Natural language processing
The Corpus used for this app was taken from sources such as news, blogs and twitter (~550MB).
This data provided was pre-processed to remove punctuations, numbers, whitespaces, profane words to avoid predicting any of them. From the processed data, ngrams of length 1, 2 and 3 were made using KfNgram software.
| Source | Total(MB) | Training(MB) |
|---|---|---|
| Blogs | 200 | 119 |
| News | 196 | 116 |
| 160 | 92 |
The total size of the RData file used for the app is about 100MB where all the grams were stored in data table. Table shows the size of ngrams before and after pruning
| Ngram | Total Size(MB) | Final Size(MB) |
|---|---|---|
| Unigram | 200 | 2 |
| Bigram | 203 | 4 |
| Trigram | 740 | 58 |
The algorithm used to make a model using the Tri-, Bi-, Unigram is Stupid Backoff. It takes about 20ms to return with top 5 most likely next words.
The algorithm which helps in calculating score of the next word is follows
GitHub repository
Data Science Specialization by Johns Hopkins University
Natural Language Processing by Stanford University on coursera