Introduction

This report documented the exploratory analysis I conducted for analyzing the raw text data to better understand the underlying text data and begin to develop a strategy towards building a text prediction model that predicts the next word to be typed by the user based on her/his previous inputs.

Raw Data Overview

The training data is provided by SwiftKey. I chose to use the English version of twitter,news, and blog text files.

Let’s first load the source files and convert them into a corpus object using the Quanteda package in R. The summary of the three .txt files are below:

## Corpus consisting of 3 documents:
## 
##               Text  Types   Tokens Sentences
##    en_US.blogs.txt 482484 42840192   2072941
##     en_US.news.txt 431667 39918317   1867522
##  en_US.twitter.txt 566995 36719702   2588548
## 
## Source: /home/roger/NLP-R/* on x86_64 by roger
## Created: Wed Sep 25 14:12:23 2019
## Notes:

Since the raw data is consisted of large amount of information - millions of sentences and tens of millions of tokens - it will take large resouces and long time to process them all together at once. For the purposes of exploratory data analysis, only 1% of randomly sampled data from each of the three text files is used for practical reasons: speed of processing with sufficient amount of information to find patterns.

Once a good strategy for cleaning/processing the data and for constructing the text prediction model is developed, greater portion of the raw data will be used for the analysis will be used/revisited as needed later.

Exploratory Analysis

Twitter Data

Start with looking into the Twitter file by reading the lines from .txt into data.table and randomly draw 1% of the lines for the analysis.

Then create a corpus on the drawn Twitter data and see what the most frequent token is.

##            feature frequency
##     1:           .     25177
##     2:           !     12619
##     3:         the      9258
##     4:          to      7776
##     5:           ,      7456
##    ---                      
## 27276:       ankel         1
## 27277:    sports-_         1
## 27278:      kassim         1
## 27279:          tp         1
## 27280: #iamamentor         1

At first glance, the sampled Twitter data has 27279 unique features/tokens without any processing/trimming/stemming is performed. A closer loook at the features are required for determining the strategy for cleaning up the text data.

It’s easy to notice the following features should be removed: - punctuations - numbers - emojis - foreign characters

##      feature frequency  rank docfreq group
##   1:       .     25177     1   12340   all
##   2:       !     12619     2    7262   all
##   3:       ,      7456     5    5416   all
##   4:       ?      4179    10    3289   all
##   5:       :      4041    11    3576   all
##  ---                                      
## 172:       😢         1 11085       1   all
## 173:       🍆         1 11085       1   all
## 174:       💏         1 11085       1   all
## 175:       🚼         1 11085       1   all
## 176:       🐬         1 11085       1   all

Additionally, common Enlish stopwords, url, twitter characters, and hyphens will also be removed and triming will be applied.

We will then follow the similar approach to analyze/process the news and blog text data

Visualizing the features/tokens

As the last step of the intial exploratory analysis, we will visualize the top 100 features from each of the three data set

Finally, we look at how everything looks when features from all three data sets are combined.

Word cloud of top 200 features from all three source data combined.

N-Gram Modeling

N-grams can be created easily using the same process under the feature creations so far with minor modifications in calling the dfm() function from the quanteda package.

We will first write a function to create n-grams.

We will then create 2-gram, 3-gram, 4-gram, and 5-gram models for each of the three data sets.

And then combine them together.

Visualize N-grams

For demonstration purposes, the to 25 frequent 2-gram and 3-gram models are plotted below.

Next Steps

References