Next Word Prediction - Exploratory Report

Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be a serious pain. SwiftKey, who is the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. In this capstone we will be applying data science in the area of natural language processing. With the advent of social media and blogs, the value of text-based information continues to increase. Research shows that there are 3 broad categories which this falls into - 1) Natural Language Processing (NLP) 2) Text Mining 4) Machine Learning.

Whatever you categorise it into, it comes with it’s fair share of challenges.

Some key Challenges

Vast amount of data available over the web which is unstructured.
Understanding the language and the context.
Limitations in current mobile devices.
Scalability.

Goal

The ultimate goal of the project is to create an application shiny apps with predictive model. We will have to decide on the limited resources available on mobile devices to choose an algorithm based on accurac and speed.

Our tasks will involve:

Understanding the problem
Data acquisition and cleaning
Exploratory analysis
Statistical modeling
Predictive modeling
Creative exploration
Creating a data product
Creating a short slide deck pitching for the product

Data

The data provided is for training purpose. It was downloaded from the following site as per the instructions.

Capstone Dataset The exploration of the data and modeling steps are performed on this data set. Later in the capstone, if needed, we may use additional data sets that may be useful for building the model.

Intial Pre-processing

The following pre-processing is done and will be used to build our model:

Cases: We will be converting all words to lower case.
Punctuation: Will be stripped off.
Stopwords: Stop words will be removed from the list.
Conjunctions, Prepositions: As per research these words do not add value. I think otherwise. A decission will be taken later during model building on these types of words.
URL: Will be removed.
Emojis: Will be removed
Numbers: As per research these words do not add value. A decission will be taken later during model building on the the numbers.

Exploratory Analysis

The 3 datasets were analysed and following we the findings.

The following table shows the data for words in the datasets:

Dataset Type	Documents	Total Words (TW)	Distinct Words (DW)	TTR (DW/TW)
Blogs	899288	37546246	319112	0.0085
Twitter	2360148	30093410	369615	0.0123
News	77259	2674536	86620	0.0324

We see the Type/ Token Ratio (TTR) is more for News datasets followed by Twitter and then blogs. We would have to explore more datasets if needed at the later stage to cover more words.

Now let’s dive deeper into the data content of the 3 datasets.

News-Dataset

Let’s look at the unigrams in this dataset.

Fig-1: Top 20 Unigrams

The words like ‘the’, ‘and’, ‘to’ etc. do not give us much info. Let us filter these words to look at the rest of the words. We would be filtering above words in other datasets too.

Fig-2: Top 20 Unigrams

The top20 bigrams are as follows.

Fig-3: Top 20 Bigrams

The top20 trigrams are as follows.

Fig-3: Top 20 Trigrams

Let’s explore how the words are related.

Fig-4: Word relations

We observe words from various topics like Politics, Economics, Real Estate, National issues, Health Care to Social Media. It interesting to note various numbers and relation to time, money size/ volume etc.

Twitter-Dataset

The top20 unigrams are as follows.

Fig-5: Top 20 Unigrams

The top20 bigrams are as follows.

Fig-6: Top 20 Bigrams

The top20 trigrams are as follows.

Fig-7: Top 20 Trigrams

Let’s explore how the words are related.

Fig-8: Word relations

For Twitter, we observe words mostly around Entertainment, Sports and social interactions. There are are some offensive/ profanity words which needs further cleaning.

Blogs-Dataset

The top20 unigrams are as follows.

Fig-9: Top 20 Unigrams

The top20 bigrams are as follows.

Fig-10: Top 20 Bigrams

The top20 trigrams are as follows.

Fig-11: Top 20 Trigrams

Let’s explore how the words are related to different words.

Fig-12: Word relations

In blogs, we observe words around Entertainment, Sports, Health and Cooking. This is a subset so more interactions are still not shown.

Future course of actions

There are some interesting word relations we have discovered so far. They range from Politics, Entertainment, Sports, National issues, Cooking, Health care to social interactions.
The data needs further cleaning like removing some profanity/ offensive words and words which do not add value.
Repetitive words/alphabets, short forms, web addresses etc need to cleaned.
More data with diverse vocabulary needs to be added to unigrams, bigrams and trigrams. Need to find more Corpora to train our models.
We need to build some model for unobserved words. Look into Markov Chain/ Good-Turing/ Katz Backoff model.
Based on our unigrams, bigrams, trigrams and statistical modeling predict the likelihood of our next word in shiny apps.

Next Word Prediction - Exploratory Report

Ralston Fonseca

October 19, 2018

Overview

Some key Challenges

Goal

Data

Intial Pre-processing

Exploratory Analysis

News-Dataset

Twitter-Dataset

Blogs-Dataset

Future course of actions