Introduction

This report lays the foundation of our Natural Language Processing application which we are eventually going to create a predictive algorithm and expose it using a Shiny app. The following are the main objectives of this report:

  1. To successfully download the given dataset and load it into the R environment.

  2. To create a basic report of summary statistics about the data sets.

  3. Do exploratory data analysis on this dataset to highlight the major features using tables and plots.

  4. To outline the plan for creating the prediction algorithm and Shiny app.

The Dataset

The training dataset for this project can be downloaded from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The data consists of three files from “twitter”, “news” and “blogs”. Here is a basic summary of the dataset:

#FILE_NAME           SIZE(ls -lh)   WORD_COUNT(wc -w)    LINE_COUNT(wc -l)  LONGEST_LINE_LENGTH(GNU: wc - L)
#-----------------------------------------------------------------------------------------------------------
 en_US.blogs.txt     200MB          37334690              899288            40385
 en_US.news.txt      196MB          34372720             1010242            11384
 en_US.twitter.txt   159MB          30374206             2360148              213

Just a sidenote; the following powershell command can be used on windows to get the line count.

dir | foreach-object {(get-content $_).count}

The above statistics gives us an idea about the complexity of this dataset. The following observations can be made:

  1. There are over 4 million overall records. This would be a huge dataset and a sensible approach would be to sample this dataset.

  2. There are some really long records, especially in the blog dataset. If included in the sample dataset, it would make it far too complex when it comes down to creating n-grams. This suggests that we need further analysis once we load data into R. If it’s a rare occurrence, we would explicitly exclude those rows.

  3. The dataset has already been cleaned and it only contains the “text”. We are going to consider each row as separate document in our corpus.

Also, as the final predictive model is going to be a shiny application, we need to be mindful of resource consumption especially memory, CPU and garbage collection.

Now that we have a basic understanding of the data, let’s load it in R environment for further exploration.

Loading the dataset

blogsRaw   <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.blogs.txt",  skipNul = TRUE, warn = TRUE)
twitterRaw <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.twitter.txt",skipNul = TRUE, warn = TRUE)
newsRaw    <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.news.txt",   skipNul = TRUE, warn = TRUE)

Please note that you will get this warning while loading the “news” dataset.

## Warning in readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.news.txt",
##: incomplete final line found on '/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.news.txt'

This is because of a bad line ending character in the file (4 occurrences) which I have removed manually in notepad++ by searching “1A”.

Pre-processing the dataset

As we saw previously, there are few very long rows in our “blog” dataset. It would be worthwhile to have a look at this aspect and potentially remove it. Otherwise if got selected, it will blow-up our DFM in the sample set which we are going to create next.

summary(nchar(blogsRaw))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833
summary(nchar(newsRaw))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11384.0

For the “blogs” and “news”, the 3rd quantile is 331 and 268 respectively. Knowing that, we would remove some of the long rows from our dataset.

blogsRaw <- blogsRaw[!(nchar(blogsRaw) >= 10000)] # 6 rows deleted (based on intuition)
newsRaw  <- newsRaw[!(nchar(newsRaw) >= 5000)] # 5 rows deleted (based on intuition)
summary(nchar(blogsRaw))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   156.0   229.8   329.0  9810.0
summary(nchar(newsRaw))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.1   268.0  4198.0

Sampling the Data

Before we proceed further, we need to sample the data for the sake of performance. As we have around 4 million total records, we are going to sample 1% of data on proportionate basis.

set.seed(32124)  
sampleSize <- 0.01
# Creating 1% sample
blogs   <- blogsRaw  [rbinom(n = length(blogsRaw),   size = 1, prob = sampleSize) == 1]
news    <- newsRaw   [rbinom(n = length(newsRaw),    size = 1, prob = sampleSize) == 1]
twitter <- twitterRaw[rbinom(n = length(twitterRaw), size = 1, prob = sampleSize) == 1]

# Combining the three datasets into 1
sampledData <- c(blogs, news, twitter)

# Removing the original datasets (no longer required)
rm(blogsRaw, newsRaw, twitterRaw, blogs, news, twitter, sampleSize)

length(sampledData)
## [1] 43049
head(sampledData, 3)
## [1] "I’ll wear pajamas and give up pajahmas."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [2] "I have mixed emotions about the start to this book. As I’ve already mentioned I very much enjoyed the characters and the voice of the protagonist. But as much as I was enjoying it, it took me a little while to get into the story. We’re presented with a huge case of insta-love right from the start. Lucas, the mysterious and hot new neighbor, comes on strong. I’m pretty sure they start making out on their second or third meeting! I’m warning you now. There’s not a lot of rationalization for the insta-love…you won’t really understand the reasons until the end of the book. Which is why, in retrospect, I kinda like that I had to wait for all of the info to be revealed."
## [3] "to do this in remembrance of Him. Luke 22:19"

As we can see above, there are non-ascii encoded characters in the dataset which needs to be cleaned.

Cleaning the data

For building our data pipeline, we will be using the “quanteda” package which is known for its performance. Let’s get started:

library(quanteda)
library(ggplot2)
library(gridExtra)

# setting parallel threads for "quanteda"
quanteda_options(threads = 7)

# First of all, convert to ASCII encoding
sampledData <- iconv(sampledData, "latin1", "ASCII", sub = "")

# Creating tokens
sampledTokens <- tokens(corpus(sampledData),
                        remove_punct = TRUE,        # Removing puncuations
                        remove_twitter = TRUE,      # Removing twitter # and @
                        remove_numbers = TRUE,      # Remove numbers
                        remove_symbols = TRUE,      # Remove symbols
                        remove_separators = TRUE,   # Remove separators
                        remove_hyphens = TRUE,      # Remove hyphens
                        remove_url = FALSE)         # Tokenising URL by using "remove_punct = TRUE" and "remove_url = FALSE"

# Get rid of sample vector
rm (sampledData)

# Making tokens lowercase, except when it's all UPPERCASE
sampledTokens <- tokens_tolower(sampledTokens, keep_acronyms = TRUE)

# Applying Stemmer
sampledTokens <- tokens_wordstem(sampledTokens)

# Removing stopwords
sampledTokens <- tokens_select(sampledTokens, pattern = stopwords('en'), selection = 'remove')

Removing Profanity

For our purpose, we are going to remove inappropriate works from our analysis. For that, we have included a list of banned words from google.

profanityList <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/profanity.txt",  skipNul = TRUE, warn = FALSE)

# Removing profanity
sampledTokens <- tokens_select(sampledTokens, pattern = profanityList, selection = 'remove')
rm(profanityList)
# Garbage collection
gc()
##           used  (Mb) gc trigger  (Mb) limit (Mb)  max used  (Mb)
## Ncells 2075155 110.9    5551816 296.5         NA   6939771 370.7
## Vcells 8173444  62.4   68597362 523.4      16384 107181955 817.8

Creating n-Grams and corresponding DFMs

Now that we have got the tokens ready, we can create n-grams. For our purpose, we are going to create unigrams, bigrams and trigrams.

# Creating n-gram tokens
monoGram <- tokens_ngrams(sampledTokens, n = 1)
biGram   <- tokens_ngrams(sampledTokens, n = 2)
triGram  <- tokens_ngrams(sampledTokens, n = 3)

# Creating DFM
monoGramDFM <- dfm(monoGram)
biGramDFM   <- dfm(biGram)
triGramDFM   <- dfm(triGram)

Creating some exploratory plots

When exploring DFMs, the word clouds is the most common plot used to get the first hand insight into the data. Let’s create one for 1-Gram DFM:

# Creating n-gram tokens
textplot_wordcloud(monoGramDFM, max_words = 100, color = rev(RColorBrewer::brewer.pal(10, "RdBu")))

We can also create a frequency plot to observe the actual occurrence count.

monoGramDFM                  %>% 
  textstat_frequency(n = 15) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequency") +
  ggtitle("Top 15 1-Gram Frequency Plot") +
  theme_minimal()

Similarly, we can also create frequency plots for 2-Gram and 3-Grams.

The above graphs validates the intuitive idea of decrease in usage frequency as n increases in n-Gram models.

Next Steps

Now that we have basic DFMs in-place, we can start thinking about the predictive shiny app. There is still a lot to be done. The idea behind the predictive search is to suggest the next word as user enters few words based on our trained model. Also, the model should be continously adapting itself to newly entered combination of words. One idea is to just predict it based on n-grams which would eventually depend on number of words already entered by the user.