The Data Science Capstone Project, consist of designing and implementing a language model capable to predict the next word in a sentence. For this we should put in practice all the skill acquired during the previous 9 courses plus some additional skill that we must self-learn on the road, mainly NLP techniques.
This problem was introduced before and it is also known as a variant of “The Shannon Game”, where we calculate the probability of a word given a previous sequence of words.
In this report we will show an initial exploratory analysis of the data provided to train and test our model, as well as the initial predictive model based on Markov assumption and ngrams counts.
Our dataset consist in 3 files with a sample of text from twitter, news websites and blogs.
Let’s take a look of this files and summarize their content.
| Dataset | News | Blogs | |
|---|---|---|---|
| Size | 159.4 MB | 196.3 MB | 200.4 MB |
| Size in Memory | 301.4 MB | 19.2 MB | 248.5 Mb |
| Lines | 2360148 | 77259 | 899288 |
| Word Count | 30373543 | 2643969 | 37334131 |