Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be a serious pain. SwiftKey, who is the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. In this capstone we will be applying data science in the area of natural language processing. With the advent of social media and blogs, the value of text-based information continues to increase. Research shows that there are 3 broad categories which this falls into - 1) Natural Language Processing (NLP) 2) Text Mining 4) Machine Learning.

Whatever you categorise it into, it comes with it’s fair share of challenges.

Some key Challenges

  • Vast amount of data available over the web which is unstructured.
  • Understanding the language and the context.
  • Limitations in current mobile devices.
  • Scalability.

Goal

The ultimate goal of the project is to create an application shiny apps with predictive model. We will have to decide on the limited resources available on mobile devices to choose an algorithm based on accurac and speed.

Our tasks will involve:

Data

The data provided is for training purpose. It was downloaded from the following site as per the instructions.

Intial Pre-processing

The following pre-processing is done and will be used to build our model:

  • Cases: We will be converting all words to lower case.
  • Punctuation: Will be stripped off.
  • Stopwords: Stop words will be removed from the list.
  • Conjunctions, Prepositions: As per research these words do not add value. I think otherwise. A decission will be taken later during model building on these types of words.
  • URL: Will be removed.
  • Emojis: Will be removed
  • Numbers: As per research these words do not add value. A decission will be taken later during model building on the the numbers.

Exploratory Analysis

The 3 datasets were analysed and following we the findings.

The following table shows the data for words in the datasets:

Dataset Type Documents Total Words (TW) Distinct Words (DW) TTR (DW/TW)
Blogs 899288 37546246 319112 0.0085
Twitter 2360148 30093410 369615 0.0123
News 77259 2674536 86620 0.0324

We see the Type/ Token Ratio (TTR) is more for News datasets followed by Twitter and then blogs. We would have to explore more datasets if needed at the later stage to cover more words.

Now let’s dive deeper into the data content of the 3 datasets.

News-Dataset

Let’s look at the unigrams in this dataset.

Fig-1: Top 20 Unigrams

Fig-1: Top 20 Unigrams

The words like ‘the’, ‘and’, ‘to’ etc. do not give us much info. Let us filter these words to look at the rest of the words. We would be filtering above words in other datasets too.

Fig-2: Top 20 Unigrams

Fig-2: Top 20 Unigrams

The top20 bigrams are as follows.

Fig-3: Top 20 Bigrams

Fig-3: Top 20 Bigrams

The top20 trigrams are as follows.

Fig-3: Top 20 Trigrams

Fig-3: Top 20 Trigrams

Let’s explore how the words are related.

Fig-4: Word relations

Fig-4: Word relations

We observe words from various topics like Politics, Economics, Real Estate, National issues, Health Care to Social Media. It interesting to note various numbers and relation to time, money size/ volume etc.

Twitter-Dataset

The top20 unigrams are as follows.

Fig-5: Top 20 Unigrams

Fig-5: Top 20 Unigrams

The top20 bigrams are as follows.

Fig-6: Top 20 Bigrams

Fig-6: Top 20 Bigrams

The top20 trigrams are as follows.

Fig-7: Top 20 Trigrams

Fig-7: Top 20 Trigrams

Let’s explore how the words are related.

Fig-8: Word relations

Fig-8: Word relations

For Twitter, we observe words mostly around Entertainment, Sports and social interactions. There are are some offensive/ profanity words which needs further cleaning.

Blogs-Dataset

The top20 unigrams are as follows.

Fig-9: Top 20 Unigrams

Fig-9: Top 20 Unigrams

The top20 bigrams are as follows.

Fig-10: Top 20 Bigrams

Fig-10: Top 20 Bigrams

The top20 trigrams are as follows.

Fig-11: Top 20 Trigrams

Fig-11: Top 20 Trigrams

Let’s explore how the words are related to different words.

Fig-12: Word relations

Fig-12: Word relations

In blogs, we observe words around Entertainment, Sports, Health and Cooking. This is a subset so more interactions are still not shown.

Future course of actions