Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be a serious pain. SwiftKey, who is the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. In this capstone we will be applying data science in the area of natural language processing. With the advent of social media and blogs, the value of text-based information continues to increase. Research shows that there are 3 broad categories which this falls into - 1) Natural Language Processing (NLP) 2) Text Mining 4) Machine Learning.
Whatever you categorise it into, it comes with it’s fair share of challenges.
The ultimate goal of the project is to create an application shiny apps with predictive model. We will have to decide on the limited resources available on mobile devices to choose an algorithm based on accurac and speed.
Our tasks will involve:
The data provided is for training purpose. It was downloaded from the following site as per the instructions.
The following pre-processing is done and will be used to build our model:
The 3 datasets were analysed and following we the findings.
The following table shows the data for words in the datasets:
| Dataset Type | Documents | Total Words (TW) | Distinct Words (DW) | TTR (DW/TW) |
|---|---|---|---|---|
| Blogs | 899288 | 37546246 | 319112 | 0.0085 |
| 2360148 | 30093410 | 369615 | 0.0123 | |
| News | 77259 | 2674536 | 86620 | 0.0324 |
We see the Type/ Token Ratio (TTR) is more for News datasets followed by Twitter and then blogs. We would have to explore more datasets if needed at the later stage to cover more words.
Now let’s dive deeper into the data content of the 3 datasets.
Let’s look at the unigrams in this dataset.
Fig-1: Top 20 Unigrams
The words like ‘the’, ‘and’, ‘to’ etc. do not give us much info. Let us filter these words to look at the rest of the words. We would be filtering above words in other datasets too.
Fig-2: Top 20 Unigrams
The top20 bigrams are as follows.
Fig-3: Top 20 Bigrams
The top20 trigrams are as follows.
Fig-3: Top 20 Trigrams
Let’s explore how the words are related.
Fig-4: Word relations
We observe words from various topics like Politics, Economics, Real Estate, National issues, Health Care to Social Media. It interesting to note various numbers and relation to time, money size/ volume etc.
The top20 unigrams are as follows.
Fig-5: Top 20 Unigrams
The top20 bigrams are as follows.
Fig-6: Top 20 Bigrams
The top20 trigrams are as follows.
Fig-7: Top 20 Trigrams
Let’s explore how the words are related.
Fig-8: Word relations
For Twitter, we observe words mostly around Entertainment, Sports and social interactions. There are are some offensive/ profanity words which needs further cleaning.
The top20 unigrams are as follows.
Fig-9: Top 20 Unigrams
The top20 bigrams are as follows.
Fig-10: Top 20 Bigrams
The top20 trigrams are as follows.
Fig-11: Top 20 Trigrams
Let’s explore how the words are related to different words.
Fig-12: Word relations
In blogs, we observe words around Entertainment, Sports, Health and Cooking. This is a subset so more interactions are still not shown.