Data Science Capstone Milestone Report

Overview

It is the Milestone Report for the Coursera Data Science Capstone project. In this capstone, we will be applying data science in the area of natural language processing. The project is sponsored by SwiftKey.

This milestone report describes the exploratory data analysis of the Capstone Dataset.

The following tasks has been performed for this report.

Obtaining the data
Familiarizing the data with NLP and text mining
Cleaning the Data
Profanity filtering - removing profanity and other words you do not want to predict
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers
Exploratory Analysis

Loading Library

# Preload necessary R librabires
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

library(stringi)
library(SnowballC)
library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Download and Import Data

The data is from HC Corpora with access to 4 languages, but only English will be used. The dataset has three files includes en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. The data loaded from Coursera Link to local machine and will be read from local disk.

if(!file.exists("Coursera-SwiftKey.zip")) {
      download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "Coursera-SwiftKey.zip")
      unzip("Coursera-SwiftKey.zip")
}
con <- file("final/en_US/en_US.blogs.txt","r")
blogs <- readLines(con)
close(con)
con <- file("final/en_US/en_US.news.txt","rb")
news <- readLines(con)
close(con)
con <- file("final/en_US/en_US.twitter.txt","r")
twitter <- readLines(con, skipNul = TRUE)
close(con)

Original Data/Population Summary Stats

Calculate some summary stats for each file: Size in Megabytes, number of entries (rows), total characters and length of longest entry.

# Get file sizes
blogs_size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news_size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter_size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
pop_summary <- data.frame('File' = c("Blogs","News","Twitter"),
                      "FileSizeinMB" = c(blogs_size, news_size, twitter_size),
                      'NumberofLines' = sapply(list(blogs, news, twitter), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
                      TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,],
                      'MaxCharacters' = sapply(list(blogs, news, twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
                      )

pop_summary

##      File FileSizeinMB NumberofLines TotalCharacters TotalWords MaxCharacters
## 1   Blogs     200.4242        899288       208361438   37746231         40835
## 2    News     196.2775       1010242       203791405   34623841         11384
## 3 Twitter     159.3641       2360148       162385035   30556566           213

This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms. At least initially, you might want to use a smaller subset of the data.

Sampling

To build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.

Since the data are so big (see above Population summary table) we are only going to proceed with a subset (e,g, 5% of each file) as running the calculations using the big files will be really slow.. Then we are going to clean the data and convert to a corpus.

set.seed(2020)
# Remove all non english characters as they cause issues
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
# Set sample percentage
percent <- 0.05
# Binomial sampling of the data and create the relevant files
samp_blogs   <- blogs[as.logical(rbinom(length(blogs),1,percent))]
samp_news   <- news[as.logical(rbinom(length(news),1,percent))]
samp_twitter   <- twitter[as.logical(rbinom(length(twitter),1,percent))]
samp_data <- c(samp_blogs,samp_news,samp_twitter)
# remove temporary variables
rm(blogs, news, twitter, samp_blogs, samp_news, samp_twitter)

Data Preprocessing

The final selected text data needs to be cleaned to be used in the word prediction model.

Cleaning the Data

The data can be cleaned using techniues such as removing whitespaces, numbers, URLs, punctuations and profanity etc.

sample_data <- VCorpus(VectorSource(samp_data)) # load the data as a corpus
sample_data <- tm_map(sample_data, content_transformer(tolower))

# Removing Profanity Words using one of the available dictionaries of 1384 words.
profanity_words = readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
sample_data <- tm_map(sample_data,removeWords, profanity_words)

# Replacing special chars with space
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
sample_data <- tm_map(sample_data, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
sample_data <- tm_map(sample_data, toSpace, "@[^\\s]+")
sample_data <- tm_map(sample_data, tolower) # convert to lowercase
#sample_data <- tm_map(sample_data, removeWords, stopwords("en"))#remove english stop words
sample_data <- tm_map(sample_data, removePunctuation) # remove punctuation
sample_data <- tm_map(sample_data, removeNumbers) # remove numbers
sample_data <- tm_map(sample_data, stripWhitespace) # remove extra whitespaces
#sample_data <- tm_map(sample_data, stemDocument) # initiate stemming
sample_data <- tm_map(sample_data, PlainTextDocument)

sample_corpus <- data.frame(text=unlist(sapply(sample_data,'[',"content")),stringsAsFactors = FALSE)
head(sample_corpus)

##                                                                                                                                                                                                                                                                                                                                                                                                                 text
## 1 the one thing that was astounding though was the support from the marshals they were all phenomenal for being out in the rain for so long and remaining so cheery and supportive they were an amazing bunch owing to the nature of the course and very few closed roads meant that supporters who knew the area were able to skip around the course and a few people were seen about or times which was also super
## 2                                                                                                                                                                                                                  the anc had earlier approached the south gauteng high court to grant us leave to appeal its earlier ruling that the singing of the freedom song was unconstitutional and fitted the of incitement
## 3                                                                          i have it on good authority that everyone in this company particularly the strikers recieved gifts in the companyorganised secret santa i also know that the strikers attended the companyfunded xmas party where they partook in the companyfunded luncheon they also recieved vouchers to the value of k from the company as xmas gifts
## 4                                                                                           and i very much doubt if it will end there because of all the foregoing factors and the immediate crisis murdoch faces bskybs movers know that he must do this dealor wind up in very bad shape as a business there may also come a point at which bskyb find murdoch himself simply too toxic to have as a buyer anyway
## 5                                                                                                                                                                                                                                                            im sad you cant be with us but we know youll be watching i dont know if you ever got to eat at longhorn or not but i hear they cook a really mean steak
## 6                                                                                                                                                                                                                                                                                                   len woodbyrne joe oreilly development at francis street dublin central dublin has a range of apartments on offer

N-gram Tokenization

Now the corpus sample_data has cleaned data. We need to format this cleaned data in to a fromat which is most useful for NLP. The format is N-grams stored in Term Document Matrices or Document Term Matrix. we use a Document Term Matrix (DTM) representation: documents as the rows, terms/words as the columns, frequency of the term in the document as the entries. Because the number of unique words in the corpus the dimension can be large. Ngram models are created to explore word frequences. We can use RWeka package to create unigrams, bigrams, and trigrams.

review_dtm <- DocumentTermMatrix(sample_data)
review_dtm

## <<DocumentTermMatrix (documents: 213068, terms: 140393)>>
## Non-/sparse entries: 3402264/29909853460
## Sparsity           : 100%
## Maximal term length: 94
## Weighting          : term frequency (tf)

Unigram Analysis

Unigram Analysis shows that which words are the most frequent and what their frequency is. Unigram is based on individual words.

unigrams <- DocumentTermMatrix(sample_data, control = list(tokenize = "words"))

Bigram Analysis

Bigram Analysis shows that which words are the most frequent and what their frequency is. Bigram is based on two word combinations.

BigramTokenizer <- function(x) {
      unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}

bigrams <- DocumentTermMatrix(sample_data, control = list(tokenize = BigramTokenizer))

Trigram Analysis

Trigram Analysis shows that which words are the most frequent and what their frequency is. Trigram is based on three word combinations.

trigramTokenizer <- function(x) {
      unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}

trigrams <- DocumentTermMatrix(sample_data, control = list(tokenize = trigramTokenizer))

Quadgram Analysis

Quadgram Analysis shows that which words are the most frequent and what their frequency is. Quadgram is based on four word combinations.

quadgramTokenizer <- function(x) {
      unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)
}

quadgrams <- DocumentTermMatrix(sample_data, control = list(tokenize = quadgramTokenizer))

Exploratory Data Analysis

Now we can perform exploratory analysis on the tidy data. For each Term Document Matrix, we list the most common unigrams, bigrams, trigrams and fourgrams. It would be interesting and helpful to find the most frequently occurring words in the data.

Top 10 frequencies of unigrams

unigrams_frequency <- sort(colSums(as.matrix(removeSparseTerms(unigrams, 0.99))),decreasing = TRUE)
unigrams_freq_df <- data.frame(word = names(unigrams_frequency), frequency = unigrams_frequency)
head(unigrams_freq_df, 10)

##      word frequency
## the   the    237452
## and   and    119657
## for   for     54524
## that that     51856
## you   you     46562
## with with     35747
## was   was     31165
## this this     27012
## have have     26675
## are   are     24743

Plot the Unigram frequency

unigrams_freq_df %>%
      filter(frequency > 10000) %>%
      ggplot(aes(reorder(word,-frequency), frequency)) +
      geom_bar(stat = "identity") +
      ggtitle("Unigrams with frequencies > 10000") +
      xlab("Unigrams") + ylab("Frequency") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top 10 frequencies of bigrams

bigrams_frequency <- sort(colSums(as.matrix(removeSparseTerms(bigrams, 0.995))),decreasing = TRUE)
bigrams_freq_df <- data.frame(word = names(bigrams_frequency), frequency = bigrams_frequency)
head(bigrams_freq_df, 10)

##              word frequency
## of the     of the     21650
## in the     in the     20492
## to the     to the     10803
## for the   for the      9994
## on the     on the      9908
## to be       to be      8219
## at the     at the      7094
## and the   and the      6241
## in a         in a      5907
## with the with the      5263

Here, create generic function to plot the top 30 frequences for Bigrams and Trigrams.

hist_plot <- function(data, label) {
      ggplot(data[1:30,], aes(reorder(word, -frequency), frequency)) +
            labs(x = label, y = "Frequency") +
            theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
            geom_bar(stat = "identity", fill = I("grey50"))
}

Plot the Bigram frequency

hist_plot(bigrams_freq_df, "30 Most Common Bigrams")

Top 10 frequencies of trigrams

trigrams_frequency <- sort(colSums(as.matrix(removeSparseTerms(trigrams, 0.999))),decreasing = TRUE)
trigrams_freq_df <- data.frame(word = names(trigrams_frequency), frequency = trigrams_frequency)
head(trigrams_freq_df, 10)

##                          word frequency
## one of the         one of the      1737
## a lot of             a lot of      1399
## thanks for the thanks for the      1180
## to be a               to be a       896
## going to be       going to be       873
## out of the         out of the       771
## the end of         the end of       767
## i want to           i want to       730
## be able to         be able to       682
## it was a             it was a       669

Plot the Trigram frequency

hist_plot(trigrams_freq_df, "30 Most Common Trigrams")

Top 10 frequencies of quadgrams

quadgrams_frequency <- sort(colSums(as.matrix(removeSparseTerms(quadgrams, 0.9997))),decreasing = TRUE)
quadgrams_freq_df <- data.frame(word = names(quadgrams_frequency), frequency = quadgrams_frequency)
head(quadgrams_freq_df, 10)

##                                        word frequency
## the end of the               the end of the       406
## at the end of                 at the end of       332
## the rest of the             the rest of the       327
## for the first time       for the first time       305
## thanks for the follow thanks for the follow       291
## at the same time           at the same time       279
## one of the most             one of the most       230
## is going to be               is going to be       220
## to be able to                 to be able to       203
## when it comes to           when it comes to       203

Plot the Quadgram frequency

hist_plot(quadgrams_freq_df, "30 Most Common Quadgrams")

Further Actions

It concludes the exploratory analysis. As a further step a model will be created and integrated into a Shiny app for word prediction.

The corpus has been converted to N-grams stored in Document Term Matrix (DTM) and then converted to data frames of frequencies. This format should be useful for predicting the next word in a sequence of words. For example, when looking at a string of 3 words the most likely next word can be guessed by investigating all 4-grams starting with these three words and chosing the most frequent one.

For the Shiny applicaiton, the plan is to create an application with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.