The capstone project allows us (students) to create a usable/public data product that can be used to show the skills developed throughout the nine courses of the data science specialization. In this occasion, we’ll work on understanding and building Predictive Text Models like the ones used by SwiftKey - Coursera’s corporate partner for this capstone project.
This Milestone Report will cover:
Note: If you are interested in the code used to create this .Rmd file you can go to GitHub
The data used in this project is from a corpus called HC Corpora. The files have been language filtered by Coursera but may still contain some foreign text. The *.zip file contains the following language folder:
| Language | Folder Name | files included |
|---|---|---|
| Deutsche | de_DE | blogs.txt, news.txt, twitter.txt |
| English | en_US | blogs.txt, news.txt, twitter.txt |
| Russian | fi_FI | blogs.txt, news.txt, twitter.txt |
| Finnish | ru_RU | blogs.txt, news.txt, twitter.txt |
Each Language folder contains .txt files from 3 different sources: blogs, news and twitter.
Looking at the files contained in en_US gives the following characteristics:
| Name | File Name | File Size | Lines | Words |
|---|---|---|---|---|
| blogs | en_US.blogs.txt | 200.4242 Mb | 899,288 | 37,510,168 |
| news | en_US.news.txt | 196.2775 Mb | 77,259 | 2,673,480 |
| en_US.twitter.txt | 159.3641 Mb | 2,360,148 | 30,088,564 |
As expected, twitter has the most lines despite being the smallest file in the folder. This has to be related to the 140 character limit twitter has.
The following histogram shows how the word count is distributed across the entire en_US folder (e.i. all files within the folder). It is interesting to see how twitter skew the plot to the right until the 28 word mark.
Note: If we were to use number of characters instead of words, twitter will still skew the plot but now it will be close to the 140 character mark
From this point and on we are going to sample the dataset to control de processing power required to do the next operations. For now, we are going to create a subset with 3,000 lines per file.
We start with describing Tokenization as the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.
The following plot shows the 25 most frequent words from our sample data set:
Most of the words shone above are also called stop words. In general, stop words are the most common worlds in a language.
If we remove the stop words from our data set and recreate the previous plot we get the following top 25 words:
The word said moved from the 14th position to the 1st position by removing the stop words. Removing stop words is used for other NLP usages, but in this case it’s only to show the difference between datasets. For this specific project we’ll need to leave all stop words in the dataset as we are trying to predict phrases.