Introduction

The goal of the Data Science Capstone course is to incorporate all of the previous courses into one, final course, in which the participants are prepared to work as a real data scientist. For the course, a case from SwiftKey will be used to test our skills. During the course, we are required to:

This report serves as a milestone report for the work I have done so far.

Getting ready

Before obtaining the data and analyzing it, a couple of packages will be loaded. In addition, a function for pretty printing of tables is loaded.

# Packages
library(magrittr)
library(stringi)
library(ggplot2)
library(tm)
library(RWeka)
library(knitr)
library(kableExtra)
library(dplyr)
library(wordcloud)

# Functions
## Print table, for pretty printing of tables
print_table <- function(dataframe, caption="Don't forget your caption!", font=12.5) {
  kable(dataframe,
    format="html",
    align=c("l", rep("c", ncol(dataframe)-1)),
    caption=caption) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
    full_width=T,
    font_size=font)
}

Loading the data

To load the data into RStudio, it was first locally stored. The data can be downloaded using this link.

By obtaining the paths of the locally stored data and using readLines(), the data can subsequently be read into R. This is done using the code below.

# Working directory
setwd("C:/Projects/Capstone")

# Path to files
blogpath <- paste0(getwd(), "/final/en_US/en_US.blogs.txt")
newspath <- paste0(getwd(), "/final/en_US/en_US.news.txt")
twitterpath <- paste0(getwd(), "/final/en_US/en_US.twitter.txt")

# Read data
blogdata <-readLines(blogpath,encoding="UTF-8", skipNul = TRUE)
newsdata <-readLines(newspath,encoding="UTF-8", skipNul = TRUE)
twitterdata <-readLines(twitterpath,encoding="UTF-8", skipNul = TRUE)

To get an idea of the size of the data and the number of lines of text it contains, the following code is run.

[1] "Example of the blogdata: In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
[1] "The blogdata is 210.160014 Mb in size. It contains 899288 lines of text. The text consists of 37546246 words."
[1] "Example of the newsdata: <U+FEFF>He wasn't home alone, apparently."
[1] "The newsdata is 205.81189 Mb in size. It contains 77259 lines of text. The text consists of 2674536 words."
[1] "Example of the twitterdata: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
[1] "The twitterdata is 167.105338 Mb in size. It contains 2360148 lines of text. The text consists of 30093410 words."

As can be seen from the output, the files are rather large and contain a lot of words. To prevent further computations from taking too long, a subsample of the data is obtained using the code below. A seed is set to make sure that the results are reproducible. For training the predictive model, a larger subset will be used.

# Generate sample
set.seed(711)
blogsample <- sample(blogdata, length(blogdata)*0.02, replace = FALSE)
newssample <- sample(newsdata, length(newsdata)*0.02, replace = FALSE)
twittersample <- sample(twitterdata, length(twitterdata)*0.02, replace = FALSE) 
twittersample <- sapply(twittersample, 
                        function(row) iconv(row, "latin1", "ASCII", sub=""))

Furthermore, it is apparent from the lines printed that the text requires further cleaning, as the examples show the usage of abbreviations, punctuation and the project requires the usage of a profanity filter, as stated on Coursera. Hence, the next step is to clean the textdata.

Data cleaning

For data cleaning purposes, it is easiest to combine the samples of the datasets into one, general text dataset. These samples are then put into a corpus for further processing.

# Combine samples into one dataset
df <- c(blogsample, newssample, twittersample)

# Transform the data into a corpus
corpus <- VCorpus(VectorSource(df))

Using the sampled datasets in the corpus, the textdata can then be further cleaned, after which the exploratory data analysis can be performed. The cleaning is done in several steps:

  1. Clean the text using standard functions from the tm package
    1. Transform all characters to lowercase
    2. Remove numbers
    3. Remove punctuation
  2. Further clean the text using custom regex patterns
    1. Remove @-signs
    2. Remove links
  3. Remove common English stopwords from the text
  4. Apply the profanity filter
  5. Strip whitespace

To apply the profanity filter, a text dataset is used which can be found here.

The code to apply the steps listed above can be found in the chunck below.

# Profanity
profanity_path <- "C:/Projects/Capstone/Task 2/Profanity_filter_text.txt"
profanity <- readLines(profanity_path, encoding="UTF-8", skipNul = TRUE)

# Custom regex
customRegex <- content_transformer(function(x, regex) {gsub(regex, "", x)})

# Wrapper to run all cleaning functions
clean_text <- function(corpus){
  
  # 1. First, all characters are converted to lowercase, numbers are removed and punctuation is removed, preserving intra word dashes
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
  
  # 2. Then, custom regex is used to remove @-signs and links
  corpus <- tm_map(corpus, customRegex, "/|@|\\|")
  corpus <- tm_map(corpus, customRegex, "(f|ht)tp(s?)://(.*)[.][a-z]+")
  
  # 3. Next, stopwords are removed and the profanity filter is applied, using the previously loaded profanity text 
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, removeWords, profanity)
  
  # 4. Lastly, whitespaces are trimmed from the data
  corpus <- tm_map(corpus, stripWhitespace)
  
  # 5. Cleaning is completed, the corpus is returned
  return(corpus)
}

# Clean the text using the wrapper
corpus <- clean_text(corpus)

Now that the textdata has been cleaned, the next step is to analyze the processed data.

Exploratory data analysis

Word frequencies

A first step in this analysis is to evaluate which words are used most often. This is done by generating a term-document matrix using the tm package, after which the top 20 most-often used words are obtained and printed using the printing function that was loaded in the first code chunck.

# Frequency of words
tdm <- TermDocumentMatrix(corpus)

frequencies <- tdm %>%
  removeSparseTerms(.99) %>%
  {sort(rowSums(as.matrix(.)), decreasing=T)} %>%
  {data.frame(Term = names(.), Frequency=.)} %>%
  `rownames<-`(c())

print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used terms")
Term frequencies of the top 20 most often-used terms
Term Frequency
just 4955
like 4415
will 4279
one 4179
can 3717
get 3590
time 3396
good 3017
love 2983
now 2878
day 2825
know 2790
new 2458
dont 2447
see 2238
back 2207
people 2175
great 2170
think 2048
make 2002

As can be seen in the table, the words just, like, will, one and can are used most often. To get an overall idea of the distribution of the frequency with which words are used, a histogram is plotted below, showing the top 100 most often used words.

gdf <- ggplot(data=frequencies[1:100,], aes(x=reorder(Term, -Frequency), y=Frequency))
gdf + 
  geom_histogram(stat="identity", fill="#86BC25", bins=500) +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = unit(c(.5,.5,.5,.5), "cm"),
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(x="", title = "Histogram of top 100 most often used words") +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_discrete(expand=c(0,0))

As can be seen from the plot, there are a few words which are used very often. In that sense, it is interesting to evaluate how many words are required to cover 50% and 90% of all text in the sample. This is calculated below:

# Generate a full frequency table
words <- sapply(corpus, as.character)
words <- WordTokenizer(words) 

frequencies <- data.frame(table(words)) %>%
{.[order(.$Freq, decreasing = TRUE),]} %>%
  `colnames<-`(c("Term", "Frequency"))

# Obtain the sum of words used in the sample
sum_of_words <- sum(frequencies$Frequency)

# Calculate the 50% and 90% point
percent_50 <- sum_of_words*.5
percent_90 <- sum_of_words*.9

# Loop over the frequency table to evaluate how many words are needed for the 50% point
counter <- 0
wordcounter <- 1

while(counter < percent_50) {
counter <- counter+frequencies[wordcounter, 2]  
wordcounter <- wordcounter + 1 
}

print(paste0("The number of words required to cover 50% of the text is ", wordcounter-1))
## [1] "The number of words required to cover 50% of the text is 805"
# Loop over the frequency table to evaluate how many words are needed for the 90% point
counter <- 0
wordcounter <- 1

while(counter < percent_90) {
counter <- counter+frequencies[wordcounter, 2]  
wordcounter <- wordcounter + 1 
}

print(paste0("The number of words required to cover 90% of the text is ", wordcounter-1))
## [1] "The number of words required to cover 90% of the text is 15276"

As can be seen in the ouput above, 805 words are needed to cover 50% of the text and 15276 words are needed to cover 90% of the text. Do note that this result only holds for this cleaned, subsample of the real dataset. When using the full dataset, I expect these numbers to be quite different.

Bigrams

Next, we can generate a term-document matrix using the NGramTokenizer from RWeka to generate sets of two words that can be found in the text. By computing their frequencies, we can then evaluate which set of words can be found most often in the subset of the data that was used for this report. Instead of using a table, the results are now visualized as a wordcloud.

# Term document matrix with bigrams
tdm2 <- TermDocumentMatrix(corpus, control = list(tokenize = function(x) {NGramTokenizer(x, Weka_control(min=2, max=2))}))

frequencies <- tdm2 %>%
  removeSparseTerms(.999) %>%
  {sort(rowSums(as.matrix(.)), decreasing=T)} %>%
  {data.frame(Bigram = names(.), Frequency=.)} %>%
  `rownames<-`(c())

print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used bigrams")
Term frequencies of the top 20 most often-used bigrams
Bigram Frequency
right now 405
cant wait 350
last night 273
dont know 272
feel like 239
looking forward 211
im going 198
first time 179
happy birthday 171
years ago 168
looks like 166
good morning 162
can get 154
im sure 154
just got 143
make sure 142
good luck 138
let know 138
dont think 133
new york 133
wordcloud(frequencies[1:40,]$Bigram, frequencies[1:40,]$Frequency, colors=brewer.pal(12, "Set3"),
          scale=c(5,0.1), random.order=F)

Trigrams

Lastly, it is interesting to evaluate combinations of three words that occur in the text, also known as trigrams. This is performed below, after which the results are visualised in a barplot.

# Trigrams
tdm3 <- TermDocumentMatrix(corpus, control = list(tokenize = function(x) {NGramTokenizer(x, Weka_control(min=3, max=3))}))

frequencies <- tdm3 %>%
  removeSparseTerms(.9999) %>%
  {sort(rowSums(as.matrix(.)), decreasing=T)} %>%
  {data.frame(Trigram = names(.), Frequency=.)} %>%
  `rownames<-`(c())

print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used trigrams")
Term frequencies of the top 20 most often-used trigrams
Trigram Frequency
cant wait see 70
happy mothers day 57
let us know 55
happy new year 41
im pretty sure 37
please please please 29
cinco de mayo 22
dont even know 21
new york city 21
love love love 20
looking forward seeing 19
keep good work 18
cant wait get 17
feel like im 17
im looking forward 16
look forward seeing 16
cant wait hear 15
cant wait till 14
just got back 13
makes feel like 13
gdf <- ggplot(data=frequencies[1:20,], aes(x=reorder(Trigram, Frequency), y=Frequency))
gdf + 
  geom_histogram(stat="identity", fill="#86BC25") +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = unit(c(.5,.5,.5,.5), "cm")) +
  labs(x="", title = "Barplot of top 20 most-often used trigrams") +
  coord_flip() +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_discrete(expand=c(0,0))

Next steps

Using the cleaned dataset, the next step is to develop a first predictive model. From there, it can be finetuned to achieve better results whilst reducing the time needed for computations. Currently, my plan is a follows:

Additionally, I would like to see if a Word2vec algorithm can be used to predict the next word in a sentence. If so, I may also experiment with that to see if I can achieve better results than by the usage of n-grams.

Additional thoughts

Before proceeding to the predictive model, I want to experiment with the options to clean the text. Things that I am not yet certain about are: