Exploratory Data Analysis

Introduction

The goal of the Data Science Capstone course is to incorporate all of the previous courses into one, final course, in which the participants are prepared to work as a real data scientist. For the course, a case from SwiftKey will be used to test our skills. During the course, we are required to:

Retrieve and load the data into R
Analyze the data and idenfity why and how it needs to be cleaned
Clean the dataset
Inspect the processed dataset
Develop a model which can predict the next word based on the previous 1, 2 or 3 words
Further improve the model and reduce the computation time it requires
Develop a Shiny app that uses the predictive model
Prepare slides and a final report on the project

This report serves as a milestone report for the work I have done so far.

Getting ready

Before obtaining the data and analyzing it, a couple of packages will be loaded. In addition, a function for pretty printing of tables is loaded.

# Packages
library(magrittr)
library(stringi)
library(ggplot2)
library(tm)
library(RWeka)
library(knitr)
library(kableExtra)
library(dplyr)
library(wordcloud)

# Functions
## Print table, for pretty printing of tables
print_table <- function(dataframe, caption="Don't forget your caption!", font=12.5) {
  kable(dataframe,
    format="html",
    align=c("l", rep("c", ncol(dataframe)-1)),
    caption=caption) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
    full_width=T,
    font_size=font)
}

Loading the data

To load the data into RStudio, it was first locally stored. The data can be downloaded using this link.

By obtaining the paths of the locally stored data and using readLines(), the data can subsequently be read into R. This is done using the code below.

# Working directory
setwd("C:/Projects/Capstone")

# Path to files
blogpath <- paste0(getwd(), "/final/en_US/en_US.blogs.txt")
newspath <- paste0(getwd(), "/final/en_US/en_US.news.txt")
twitterpath <- paste0(getwd(), "/final/en_US/en_US.twitter.txt")

# Read data
blogdata <-readLines(blogpath,encoding="UTF-8", skipNul = TRUE)
newsdata <-readLines(newspath,encoding="UTF-8", skipNul = TRUE)
twitterdata <-readLines(twitterpath,encoding="UTF-8", skipNul = TRUE)

To get an idea of the size of the data and the number of lines of text it contains, the following code is run.

[1] "Example of the blogdata: In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."

[1] "The blogdata is 210.160014 Mb in size. It contains 899288 lines of text. The text consists of 37546246 words."

[1] "Example of the newsdata: <U+FEFF>He wasn't home alone, apparently."

[1] "The newsdata is 205.81189 Mb in size. It contains 77259 lines of text. The text consists of 2674536 words."

[1] "Example of the twitterdata: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

[1] "The twitterdata is 167.105338 Mb in size. It contains 2360148 lines of text. The text consists of 30093410 words."

As can be seen from the output, the files are rather large and contain a lot of words. To prevent further computations from taking too long, a subsample of the data is obtained using the code below. A seed is set to make sure that the results are reproducible. For training the predictive model, a larger subset will be used.

# Generate sample
set.seed(711)
blogsample <- sample(blogdata, length(blogdata)*0.02, replace = FALSE)
newssample <- sample(newsdata, length(newsdata)*0.02, replace = FALSE)
twittersample <- sample(twitterdata, length(twitterdata)*0.02, replace = FALSE) 
twittersample <- sapply(twittersample, 
                        function(row) iconv(row, "latin1", "ASCII", sub=""))

Furthermore, it is apparent from the lines printed that the text requires further cleaning, as the examples show the usage of abbreviations, punctuation and the project requires the usage of a profanity filter, as stated on Coursera. Hence, the next step is to clean the textdata.

Data cleaning

For data cleaning purposes, it is easiest to combine the samples of the datasets into one, general text dataset. These samples are then put into a corpus for further processing.

# Combine samples into one dataset
df <- c(blogsample, newssample, twittersample)

# Transform the data into a corpus
corpus <- VCorpus(VectorSource(df))

Using the sampled datasets in the corpus, the textdata can then be further cleaned, after which the exploratory data analysis can be performed. The cleaning is done in several steps:

Clean the text using standard functions from the tm package
1. Transform all characters to lowercase
2. Remove numbers
3. Remove punctuation
Further clean the text using custom regex patterns
1. Remove @-signs
2. Remove links
Remove common English stopwords from the text
Apply the profanity filter
Strip whitespace

To apply the profanity filter, a text dataset is used which can be found here.

The code to apply the steps listed above can be found in the chunck below.

# Profanity
profanity_path <- "C:/Projects/Capstone/Task 2/Profanity_filter_text.txt"
profanity <- readLines(profanity_path, encoding="UTF-8", skipNul = TRUE)

# Custom regex
customRegex <- content_transformer(function(x, regex) {gsub(regex, "", x)})

# Wrapper to run all cleaning functions
clean_text <- function(corpus){
  
  # 1. First, all characters are converted to lowercase, numbers are removed and punctuation is removed, preserving intra word dashes
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
  
  # 2. Then, custom regex is used to remove @-signs and links
  corpus <- tm_map(corpus, customRegex, "/|@|\\|")
  corpus <- tm_map(corpus, customRegex, "(f|ht)tp(s?)://(.*)[.][a-z]+")
  
  # 3. Next, stopwords are removed and the profanity filter is applied, using the previously loaded profanity text 
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, removeWords, profanity)
  
  # 4. Lastly, whitespaces are trimmed from the data
  corpus <- tm_map(corpus, stripWhitespace)
  
  # 5. Cleaning is completed, the corpus is returned
  return(corpus)
}

# Clean the text using the wrapper
corpus <- clean_text(corpus)

Now that the textdata has been cleaned, the next step is to analyze the processed data.

Exploratory data analysis

Word frequencies

A first step in this analysis is to evaluate which words are used most often. This is done by generating a term-document matrix using the tm package, after which the top 20 most-often used words are obtained and printed using the printing function that was loaded in the first code chunck.

# Frequency of words
tdm <- TermDocumentMatrix(corpus)

frequencies <- tdm %>%
  removeSparseTerms(.99) %>%
  {sort(rowSums(as.matrix(.)), decreasing=T)} %>%
  {data.frame(Term = names(.), Frequency=.)} %>%
  `rownames<-`(c())

print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used terms")

Term frequencies of the top 20 most often-used terms
Term	Frequency
just	4955
like	4415
will	4279
one	4179
can	3717
get	3590
time	3396
good	3017
love	2983
now	2878
day	2825
know	2790
new	2458
dont	2447
see	2238
back	2207
people	2175
great	2170
think	2048
make	2002

As can be seen in the table, the words just, like, will, one and can are used most often. To get an overall idea of the distribution of the frequency with which words are used, a histogram is plotted below, showing the top 100 most often used words.

gdf <- ggplot(data=frequencies[1:100,], aes(x=reorder(Term, -Frequency), y=Frequency))
gdf + 
  geom_histogram(stat="identity", fill="#86BC25", bins=500) +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = unit(c(.5,.5,.5,.5), "cm"),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x="", title = "Histogram of top 100 most often used words") +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_discrete(expand=c(0,0))

As can be seen from the plot, there are a few words which are used very often. In that sense, it is interesting to evaluate how many words are required to cover 50% and 90% of all text in the sample. This is calculated below:

# Generate a full frequency table
words <- sapply(corpus, as.character)
words <- WordTokenizer(words) 

frequencies <- data.frame(table(words)) %>%
{.[order(.$Freq, decreasing = TRUE),]} %>%
  `colnames<-`(c("Term", "Frequency"))

# Obtain the sum of words used in the sample
sum_of_words <- sum(frequencies$Frequency)

# Calculate the 50% and 90% point
percent_50 <- sum_of_words*.5
percent_90 <- sum_of_words*.9

# Loop over the frequency table to evaluate how many words are needed for the 50% point
counter <- 0
wordcounter <- 1

while(counter < percent_50) {
counter <- counter+frequencies[wordcounter, 2]  
wordcounter <- wordcounter + 1 
}

print(paste0("The number of words required to cover 50% of the text is ", wordcounter-1))

## [1] "The number of words required to cover 50% of the text is 805"

# Loop over the frequency table to evaluate how many words are needed for the 90% point
counter <- 0
wordcounter <- 1

while(counter < percent_90) {
counter <- counter+frequencies[wordcounter, 2]  
wordcounter <- wordcounter + 1 
}

print(paste0("The number of words required to cover 90% of the text is ", wordcounter-1))

## [1] "The number of words required to cover 90% of the text is 15276"

As can be seen in the ouput above, 805 words are needed to cover 50% of the text and 15276 words are needed to cover 90% of the text. Do note that this result only holds for this cleaned, subsample of the real dataset. When using the full dataset, I expect these numbers to be quite different.

Bigrams

Next, we can generate a term-document matrix using the NGramTokenizer from RWeka to generate sets of two words that can be found in the text. By computing their frequencies, we can then evaluate which set of words can be found most often in the subset of the data that was used for this report. Instead of using a table, the results are now visualized as a wordcloud.

# Term document matrix with bigrams
tdm2 <- TermDocumentMatrix(corpus, control = list(tokenize = function(x) {NGramTokenizer(x, Weka_control(min=2, max=2))}))

frequencies <- tdm2 %>%
  removeSparseTerms(.999) %>%
  {sort(rowSums(as.matrix(.)), decreasing=T)} %>%
  {data.frame(Bigram = names(.), Frequency=.)} %>%
  `rownames<-`(c())

print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used bigrams")

Term frequencies of the top 20 most often-used bigrams
Bigram	Frequency
right now	405
cant wait	350
last night	273
dont know	272
feel like	239
looking forward	211
im going	198
first time	179
happy birthday	171
years ago	168
looks like	166
good morning	162
can get	154
im sure	154
just got	143
make sure	142
good luck	138
let know	138
dont think	133
new york	133

wordcloud(frequencies[1:40,]$Bigram, frequencies[1:40,]$Frequency, colors=brewer.pal(12, "Set3"),
          scale=c(5,0.1), random.order=F)

Trigrams

Lastly, it is interesting to evaluate combinations of three words that occur in the text, also known as trigrams. This is performed below, after which the results are visualised in a barplot.

# Trigrams
tdm3 <- TermDocumentMatrix(corpus, control = list(tokenize = function(x) {NGramTokenizer(x, Weka_control(min=3, max=3))}))

frequencies <- tdm3 %>%
  removeSparseTerms(.9999) %>%
  {sort(rowSums(as.matrix(.)), decreasing=T)} %>%
  {data.frame(Trigram = names(.), Frequency=.)} %>%
  `rownames<-`(c())

print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used trigrams")

Term frequencies of the top 20 most often-used trigrams
Trigram	Frequency
cant wait see	70
happy mothers day	57
let us know	55
happy new year	41
im pretty sure	37
please please please	29
cinco de mayo	22
dont even know	21
new york city	21
love love love	20
looking forward seeing	19
keep good work	18
cant wait get	17
feel like im	17
im looking forward	16
look forward seeing	16
cant wait hear	15
cant wait till	14
just got back	13
makes feel like	13

gdf <- ggplot(data=frequencies[1:20,], aes(x=reorder(Trigram, Frequency), y=Frequency))
gdf + 
  geom_histogram(stat="identity", fill="#86BC25") +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = unit(c(.5,.5,.5,.5), "cm")) +
  labs(x="", title = "Barplot of top 20 most-often used trigrams") +
  coord_flip() +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_discrete(expand=c(0,0))

Next steps

Using the cleaned dataset, the next step is to develop a first predictive model. From there, it can be finetuned to achieve better results whilst reducing the time needed for computations. Currently, my plan is a follows:

Increase the size of the sample used for the model. This will be a trial-and-error proces, as I am not sure how much data my computer can handle
Using the larger sample, generate new 2-grams and 3-grams
Generate a predictive model, which predicts what is the next word using the 2- and 3-grams
To further improve the performance of the model, add 4-grams
Evaluate the code and check if a better runtime may be achieved by optimizing it

Additionally, I would like to see if a Word2vec algorithm can be used to predict the next word in a sentence. If so, I may also experiment with that to see if I can achieve better results than by the usage of n-grams.

Additional thoughts

Before proceeding to the predictive model, I want to experiment with the options to clean the text. Things that I am not yet certain about are:

Removing stopwords: although this retains the most relevant words, it is not representative of the way text is commonly used. This may present problems for the eventual predictive model
Stemming: may improve the accuracy of the model, but if stemming is used, an additional step is needed when using the model: the input should be stemmed, then run through the model, after which the returned output should then be de-stemmed again. Otherwise the model might suggest a stemmed word