This report lays the foundation of our Natural Language Processing application which we are eventually going to create a predictive algorithm and expose it using a Shiny app. The following are the main objectives of this report:
To successfully download the given dataset and load it into the R environment.
To create a basic report of summary statistics about the data sets.
Do exploratory data analysis on this dataset to highlight the major features using tables and plots.
To outline the plan for creating the prediction algorithm and Shiny app.
The training dataset for this project can be downloaded from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The data consists of three files from “twitter”, “news” and “blogs”. Here is a basic summary of the dataset:
#FILE_NAME SIZE(ls -lh) WORD_COUNT(wc -w) LINE_COUNT(wc -l) LONGEST_LINE_LENGTH(GNU: wc - L)
#-----------------------------------------------------------------------------------------------------------
en_US.blogs.txt 200MB 37334690 899288 40385
en_US.news.txt 196MB 34372720 1010242 11384
en_US.twitter.txt 159MB 30374206 2360148 213
Just a sidenote; the following powershell command can be used on windows to get the line count.
dir | foreach-object {(get-content $_).count}
The above statistics gives us an idea about the complexity of this dataset. The following observations can be made:
There are over 4 million overall records. This would be a huge dataset and a sensible approach would be to sample this dataset.
There are some really long records, especially in the blog dataset. If included in the sample dataset, it would make it far too complex when it comes down to creating n-grams. This suggests that we need further analysis once we load data into R. If it’s a rare occurrence, we would explicitly exclude those rows.
The dataset has already been cleaned and it only contains the “text”. We are going to consider each row as separate document in our corpus.
Also, as the final predictive model is going to be a shiny application, we need to be mindful of resource consumption especially memory, CPU and garbage collection.
Now that we have a basic understanding of the data, let’s load it in R environment for further exploration.
blogsRaw <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.blogs.txt", skipNul = TRUE, warn = TRUE)
twitterRaw <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.twitter.txt",skipNul = TRUE, warn = TRUE)
newsRaw <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.news.txt", skipNul = TRUE, warn = TRUE)
Please note that you will get this warning while loading the “news” dataset.
## Warning in readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.news.txt",
##: incomplete final line found on '/Users/promisinganuj/Data/Technical/R/capstone/final/en_US/en_US.news.txt'
This is because of a bad line ending character in the file (4 occurrences) which I have removed manually in notepad++ by searching “1A”.
As we saw previously, there are few very long rows in our “blog” dataset. It would be worthwhile to have a look at this aspect and potentially remove it. Otherwise if got selected, it will blow-up our DFM in the sample set which we are going to create next.
summary(nchar(blogsRaw))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40833
summary(nchar(newsRaw))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11384.0
For the “blogs” and “news”, the 3rd quantile is 331 and 268 respectively. Knowing that, we would remove some of the long rows from our dataset.
blogsRaw <- blogsRaw[!(nchar(blogsRaw) >= 10000)] # 6 rows deleted (based on intuition)
newsRaw <- newsRaw[!(nchar(newsRaw) >= 5000)] # 5 rows deleted (based on intuition)
summary(nchar(blogsRaw))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 156.0 229.8 329.0 9810.0
summary(nchar(newsRaw))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.1 268.0 4198.0
Before we proceed further, we need to sample the data for the sake of performance. As we have around 4 million total records, we are going to sample 1% of data on proportionate basis.
set.seed(32124)
sampleSize <- 0.01
# Creating 1% sample
blogs <- blogsRaw [rbinom(n = length(blogsRaw), size = 1, prob = sampleSize) == 1]
news <- newsRaw [rbinom(n = length(newsRaw), size = 1, prob = sampleSize) == 1]
twitter <- twitterRaw[rbinom(n = length(twitterRaw), size = 1, prob = sampleSize) == 1]
# Combining the three datasets into 1
sampledData <- c(blogs, news, twitter)
# Removing the original datasets (no longer required)
rm(blogsRaw, newsRaw, twitterRaw, blogs, news, twitter, sampleSize)
length(sampledData)
## [1] 43049
head(sampledData, 3)
## [1] "I’ll wear pajamas and give up pajahmas."
## [2] "I have mixed emotions about the start to this book. As I’ve already mentioned I very much enjoyed the characters and the voice of the protagonist. But as much as I was enjoying it, it took me a little while to get into the story. We’re presented with a huge case of insta-love right from the start. Lucas, the mysterious and hot new neighbor, comes on strong. I’m pretty sure they start making out on their second or third meeting! I’m warning you now. There’s not a lot of rationalization for the insta-love…you won’t really understand the reasons until the end of the book. Which is why, in retrospect, I kinda like that I had to wait for all of the info to be revealed."
## [3] "to do this in remembrance of Him. Luke 22:19"
As we can see above, there are non-ascii encoded characters in the dataset which needs to be cleaned.
For building our data pipeline, we will be using the “quanteda” package which is known for its performance. Let’s get started:
library(quanteda)
library(ggplot2)
library(gridExtra)
# setting parallel threads for "quanteda"
quanteda_options(threads = 7)
# First of all, convert to ASCII encoding
sampledData <- iconv(sampledData, "latin1", "ASCII", sub = "")
# Creating tokens
sampledTokens <- tokens(corpus(sampledData),
remove_punct = TRUE, # Removing puncuations
remove_twitter = TRUE, # Removing twitter # and @
remove_numbers = TRUE, # Remove numbers
remove_symbols = TRUE, # Remove symbols
remove_separators = TRUE, # Remove separators
remove_hyphens = TRUE, # Remove hyphens
remove_url = FALSE) # Tokenising URL by using "remove_punct = TRUE" and "remove_url = FALSE"
# Get rid of sample vector
rm (sampledData)
# Making tokens lowercase, except when it's all UPPERCASE
sampledTokens <- tokens_tolower(sampledTokens, keep_acronyms = TRUE)
# Applying Stemmer
sampledTokens <- tokens_wordstem(sampledTokens)
# Removing stopwords
sampledTokens <- tokens_select(sampledTokens, pattern = stopwords('en'), selection = 'remove')
For our purpose, we are going to remove inappropriate works from our analysis. For that, we have included a list of banned words from google.
profanityList <- readLines("/Users/promisinganuj/Data/Technical/R/capstone/final/profanity.txt", skipNul = TRUE, warn = FALSE)
# Removing profanity
sampledTokens <- tokens_select(sampledTokens, pattern = profanityList, selection = 'remove')
rm(profanityList)
# Garbage collection
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2075155 110.9 5551816 296.5 NA 6939771 370.7
## Vcells 8173444 62.4 68597362 523.4 16384 107181955 817.8
Now that we have got the tokens ready, we can create n-grams. For our purpose, we are going to create unigrams, bigrams and trigrams.
# Creating n-gram tokens
monoGram <- tokens_ngrams(sampledTokens, n = 1)
biGram <- tokens_ngrams(sampledTokens, n = 2)
triGram <- tokens_ngrams(sampledTokens, n = 3)
# Creating DFM
monoGramDFM <- dfm(monoGram)
biGramDFM <- dfm(biGram)
triGramDFM <- dfm(triGram)
When exploring DFMs, the word clouds is the most common plot used to get the first hand insight into the data. Let’s create one for 1-Gram DFM:
# Creating n-gram tokens
textplot_wordcloud(monoGramDFM, max_words = 100, color = rev(RColorBrewer::brewer.pal(10, "RdBu")))
We can also create a frequency plot to observe the actual occurrence count.
monoGramDFM %>%
textstat_frequency(n = 15) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = NULL, y = "Frequency") +
ggtitle("Top 15 1-Gram Frequency Plot") +
theme_minimal()
Similarly, we can also create frequency plots for 2-Gram and 3-Grams.
The above graphs validates the intuitive idea of decrease in usage frequency as n increases in n-Gram models.
Now that we have basic DFMs in-place, we can start thinking about the predictive shiny app. There is still a lot to be done. The idea behind the predictive search is to suggest the next word as user enters few words based on our trained model. Also, the model should be continously adapting itself to newly entered combination of words. One idea is to just predict it based on n-grams which would eventually depend on number of words already entered by the user.