The goal of the Data Science Capstone course is to incorporate all of the previous courses into one, final course, in which the participants are prepared to work as a real data scientist. For the course, a case from SwiftKey will be used to test our skills. During the course, we are required to:
This report serves as a milestone report for the work I have done so far.
Before obtaining the data and analyzing it, a couple of packages will be loaded. In addition, a function for pretty printing of tables is loaded.
# Packages
library(magrittr)
library(stringi)
library(ggplot2)
library(tm)
library(RWeka)
library(knitr)
library(kableExtra)
library(dplyr)
library(wordcloud)
# Functions
## Print table, for pretty printing of tables
print_table <- function(dataframe, caption="Don't forget your caption!", font=12.5) {
kable(dataframe,
format="html",
align=c("l", rep("c", ncol(dataframe)-1)),
caption=caption) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"),
full_width=T,
font_size=font)
}
To load the data into RStudio, it was first locally stored. The data can be downloaded using this link.
By obtaining the paths of the locally stored data and using readLines(), the data can subsequently be read into R. This is done using the code below.
# Working directory
setwd("C:/Projects/Capstone")
# Path to files
blogpath <- paste0(getwd(), "/final/en_US/en_US.blogs.txt")
newspath <- paste0(getwd(), "/final/en_US/en_US.news.txt")
twitterpath <- paste0(getwd(), "/final/en_US/en_US.twitter.txt")
# Read data
blogdata <-readLines(blogpath,encoding="UTF-8", skipNul = TRUE)
newsdata <-readLines(newspath,encoding="UTF-8", skipNul = TRUE)
twitterdata <-readLines(twitterpath,encoding="UTF-8", skipNul = TRUE)
To get an idea of the size of the data and the number of lines of text it contains, the following code is run.
[1] "Example of the blogdata: In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
[1] "The blogdata is 210.160014 Mb in size. It contains 899288 lines of text. The text consists of 37546246 words."
[1] "Example of the newsdata: <U+FEFF>He wasn't home alone, apparently."
[1] "The newsdata is 205.81189 Mb in size. It contains 77259 lines of text. The text consists of 2674536 words."
[1] "Example of the twitterdata: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
[1] "The twitterdata is 167.105338 Mb in size. It contains 2360148 lines of text. The text consists of 30093410 words."
As can be seen from the output, the files are rather large and contain a lot of words. To prevent further computations from taking too long, a subsample of the data is obtained using the code below. A seed is set to make sure that the results are reproducible. For training the predictive model, a larger subset will be used.
# Generate sample
set.seed(711)
blogsample <- sample(blogdata, length(blogdata)*0.02, replace = FALSE)
newssample <- sample(newsdata, length(newsdata)*0.02, replace = FALSE)
twittersample <- sample(twitterdata, length(twitterdata)*0.02, replace = FALSE)
twittersample <- sapply(twittersample,
function(row) iconv(row, "latin1", "ASCII", sub=""))
Furthermore, it is apparent from the lines printed that the text requires further cleaning, as the examples show the usage of abbreviations, punctuation and the project requires the usage of a profanity filter, as stated on Coursera. Hence, the next step is to clean the textdata.
For data cleaning purposes, it is easiest to combine the samples of the datasets into one, general text dataset. These samples are then put into a corpus for further processing.
# Combine samples into one dataset
df <- c(blogsample, newssample, twittersample)
# Transform the data into a corpus
corpus <- VCorpus(VectorSource(df))
Using the sampled datasets in the corpus, the textdata can then be further cleaned, after which the exploratory data analysis can be performed. The cleaning is done in several steps:
tm package
To apply the profanity filter, a text dataset is used which can be found here.
The code to apply the steps listed above can be found in the chunck below.
# Profanity
profanity_path <- "C:/Projects/Capstone/Task 2/Profanity_filter_text.txt"
profanity <- readLines(profanity_path, encoding="UTF-8", skipNul = TRUE)
# Custom regex
customRegex <- content_transformer(function(x, regex) {gsub(regex, "", x)})
# Wrapper to run all cleaning functions
clean_text <- function(corpus){
# 1. First, all characters are converted to lowercase, numbers are removed and punctuation is removed, preserving intra word dashes
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
# 2. Then, custom regex is used to remove @-signs and links
corpus <- tm_map(corpus, customRegex, "/|@|\\|")
corpus <- tm_map(corpus, customRegex, "(f|ht)tp(s?)://(.*)[.][a-z]+")
# 3. Next, stopwords are removed and the profanity filter is applied, using the previously loaded profanity text
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, profanity)
# 4. Lastly, whitespaces are trimmed from the data
corpus <- tm_map(corpus, stripWhitespace)
# 5. Cleaning is completed, the corpus is returned
return(corpus)
}
# Clean the text using the wrapper
corpus <- clean_text(corpus)
Now that the textdata has been cleaned, the next step is to analyze the processed data.
A first step in this analysis is to evaluate which words are used most often. This is done by generating a term-document matrix using the tm package, after which the top 20 most-often used words are obtained and printed using the printing function that was loaded in the first code chunck.
# Frequency of words
tdm <- TermDocumentMatrix(corpus)
frequencies <- tdm %>%
removeSparseTerms(.99) %>%
{sort(rowSums(as.matrix(.)), decreasing=T)} %>%
{data.frame(Term = names(.), Frequency=.)} %>%
`rownames<-`(c())
print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used terms")
| Term | Frequency |
|---|---|
| just | 4955 |
| like | 4415 |
| will | 4279 |
| one | 4179 |
| can | 3717 |
| get | 3590 |
| time | 3396 |
| good | 3017 |
| love | 2983 |
| now | 2878 |
| day | 2825 |
| know | 2790 |
| new | 2458 |
| dont | 2447 |
| see | 2238 |
| back | 2207 |
| people | 2175 |
| great | 2170 |
| think | 2048 |
| make | 2002 |
As can be seen in the table, the words just, like, will, one and can are used most often. To get an overall idea of the distribution of the frequency with which words are used, a histogram is plotted below, showing the top 100 most often used words.
gdf <- ggplot(data=frequencies[1:100,], aes(x=reorder(Term, -Frequency), y=Frequency))
gdf +
geom_histogram(stat="identity", fill="#86BC25", bins=500) +
theme(panel.background = element_blank(),
plot.title = element_text(hjust = 0.5),
plot.margin = unit(c(.5,.5,.5,.5), "cm"),
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(x="", title = "Histogram of top 100 most often used words") +
scale_y_continuous(expand=c(0,0)) +
scale_x_discrete(expand=c(0,0))
As can be seen from the plot, there are a few words which are used very often. In that sense, it is interesting to evaluate how many words are required to cover 50% and 90% of all text in the sample. This is calculated below:
# Generate a full frequency table
words <- sapply(corpus, as.character)
words <- WordTokenizer(words)
frequencies <- data.frame(table(words)) %>%
{.[order(.$Freq, decreasing = TRUE),]} %>%
`colnames<-`(c("Term", "Frequency"))
# Obtain the sum of words used in the sample
sum_of_words <- sum(frequencies$Frequency)
# Calculate the 50% and 90% point
percent_50 <- sum_of_words*.5
percent_90 <- sum_of_words*.9
# Loop over the frequency table to evaluate how many words are needed for the 50% point
counter <- 0
wordcounter <- 1
while(counter < percent_50) {
counter <- counter+frequencies[wordcounter, 2]
wordcounter <- wordcounter + 1
}
print(paste0("The number of words required to cover 50% of the text is ", wordcounter-1))
## [1] "The number of words required to cover 50% of the text is 805"
# Loop over the frequency table to evaluate how many words are needed for the 90% point
counter <- 0
wordcounter <- 1
while(counter < percent_90) {
counter <- counter+frequencies[wordcounter, 2]
wordcounter <- wordcounter + 1
}
print(paste0("The number of words required to cover 90% of the text is ", wordcounter-1))
## [1] "The number of words required to cover 90% of the text is 15276"
As can be seen in the ouput above, 805 words are needed to cover 50% of the text and 15276 words are needed to cover 90% of the text. Do note that this result only holds for this cleaned, subsample of the real dataset. When using the full dataset, I expect these numbers to be quite different.
Next, we can generate a term-document matrix using the NGramTokenizer from RWeka to generate sets of two words that can be found in the text. By computing their frequencies, we can then evaluate which set of words can be found most often in the subset of the data that was used for this report. Instead of using a table, the results are now visualized as a wordcloud.
# Term document matrix with bigrams
tdm2 <- TermDocumentMatrix(corpus, control = list(tokenize = function(x) {NGramTokenizer(x, Weka_control(min=2, max=2))}))
frequencies <- tdm2 %>%
removeSparseTerms(.999) %>%
{sort(rowSums(as.matrix(.)), decreasing=T)} %>%
{data.frame(Bigram = names(.), Frequency=.)} %>%
`rownames<-`(c())
print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used bigrams")
| Bigram | Frequency |
|---|---|
| right now | 405 |
| cant wait | 350 |
| last night | 273 |
| dont know | 272 |
| feel like | 239 |
| looking forward | 211 |
| im going | 198 |
| first time | 179 |
| happy birthday | 171 |
| years ago | 168 |
| looks like | 166 |
| good morning | 162 |
| can get | 154 |
| im sure | 154 |
| just got | 143 |
| make sure | 142 |
| good luck | 138 |
| let know | 138 |
| dont think | 133 |
| new york | 133 |
wordcloud(frequencies[1:40,]$Bigram, frequencies[1:40,]$Frequency, colors=brewer.pal(12, "Set3"),
scale=c(5,0.1), random.order=F)
Lastly, it is interesting to evaluate combinations of three words that occur in the text, also known as trigrams. This is performed below, after which the results are visualised in a barplot.
# Trigrams
tdm3 <- TermDocumentMatrix(corpus, control = list(tokenize = function(x) {NGramTokenizer(x, Weka_control(min=3, max=3))}))
frequencies <- tdm3 %>%
removeSparseTerms(.9999) %>%
{sort(rowSums(as.matrix(.)), decreasing=T)} %>%
{data.frame(Trigram = names(.), Frequency=.)} %>%
`rownames<-`(c())
print_table(frequencies[1:20,], caption="Term frequencies of the top 20 most often-used trigrams")
| Trigram | Frequency |
|---|---|
| cant wait see | 70 |
| happy mothers day | 57 |
| let us know | 55 |
| happy new year | 41 |
| im pretty sure | 37 |
| please please please | 29 |
| cinco de mayo | 22 |
| dont even know | 21 |
| new york city | 21 |
| love love love | 20 |
| looking forward seeing | 19 |
| keep good work | 18 |
| cant wait get | 17 |
| feel like im | 17 |
| im looking forward | 16 |
| look forward seeing | 16 |
| cant wait hear | 15 |
| cant wait till | 14 |
| just got back | 13 |
| makes feel like | 13 |
gdf <- ggplot(data=frequencies[1:20,], aes(x=reorder(Trigram, Frequency), y=Frequency))
gdf +
geom_histogram(stat="identity", fill="#86BC25") +
theme(panel.background = element_blank(),
plot.title = element_text(hjust = 0.5),
plot.margin = unit(c(.5,.5,.5,.5), "cm")) +
labs(x="", title = "Barplot of top 20 most-often used trigrams") +
coord_flip() +
scale_y_continuous(expand=c(0,0)) +
scale_x_discrete(expand=c(0,0))
Using the cleaned dataset, the next step is to develop a first predictive model. From there, it can be finetuned to achieve better results whilst reducing the time needed for computations. Currently, my plan is a follows:
Additionally, I would like to see if a Word2vec algorithm can be used to predict the next word in a sentence. If so, I may also experiment with that to see if I can achieve better results than by the usage of n-grams.
Before proceeding to the predictive model, I want to experiment with the options to clean the text. Things that I am not yet certain about are: