The aim of this lab is to apply tree-based models for natural language processing and to learn how to train and tune tree-based models.
By the end of this lab session, students should be able to
Visualize text data.
Extract features from a text data.
Train and plot decision trees and regularized decision trees.
Train random forest and tune the hyperparameters in random forest to minimize the out-of-bag error.
You should be familiar with libraries caret
and tidyverse
.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
We will work with the data of Donald Trump’s tweets. Source:
The objective is to predict the number of retweets from the text of each tweet.
First we will load the data into R. Note that the original dataset is a JSON file that contains a lot of information. For this lab, we only extracted the text and the number of retweets from 5K out of original 40K Trump’s tweets.
library(tidyverse) # for manipulation with data
library(caret) # for machine learning, including KNN
library(rpart) # for training decision trees
library(rpart.plot) # for plotting decision trees
library(ranger) # for training random forest
# Note for students: the libraries `cforest` and `randomForest`
# provide better access to guts of random forest models.
# That's why these libraries were used to prepare lecture 5 slides.
# However, cforest and randomForest do not allow tuning the minimum node size.
library(tm) # for text mining
library(wordcloud) # for text visualization
tweets <- read_csv("trump_tweets.csv")
head(tweets)
The number of tweets is
nrow(tweets)
## [1] 5000
We will remove all non-character symbols except for “@” and “#” because they are special symbols. Then we will change all characters to the lower case. Below is the processed dataset.
tweets <- tweets %>%
mutate(cleaned_text = gsub("[^a-zA-Z@#]", " ", text)) %>%
mutate(cleaned_text = tolower(cleaned_text))
head(tweets)
The symbols “@” and “#” have special meaning in twitter and it is probably a good idea to treat them as separate words. Write an R code that replaces all occurrences of “@” with “twacc” and all occurrences of “#” with “htag” in Trump’s tweets.
# Modify the code below
tweets <- tweets %>%
mutate(cleaned_text = gsub("@", "twacc ", cleaned_text)) %>%
mutate(cleaned_text = gsub("#", "htag ", cleaned_text))
head(tweets)
An important tool in natural language processing is term-document-matrix. This is a matrix whose rows represent documents, i.e., tweets, and whose columns represent words. The element \(T_{ij}\) in row \(i\) and column \(j\) is the number of occurrences of word \(j\) in document \(i\). Below we construct the term-document matrix of Donald Trump’s tweet and report its dimensions.
corpus <- VCorpus(VectorSource(tweets$cleaned_text))
DTM <- DocumentTermMatrix(corpus)
dim(DTM)
## [1] 5000 12164
Below is a sample of the term-document matrix (note that words actually appear as column indices while only document numbers appear as row indices):
as.matrix(DTM[1:5, c('trump', 'that', 'sure', 'the', 'twacc')])
## Terms
## Docs trump that sure the twacc
## 1 1 0 0 1 1
## 2 1 1 0 3 0
## 3 0 0 0 0 1
## 4 0 0 0 0 1
## 5 0 0 0 1 2
To find the total frequency of each word, we will calculate the sum of values in each column of this matrix. Below are frequencies of a first few words (alphabetically):
words_freq <- termFreq(tweets$cleaned_text)
head(words_freq)
## aaa aaeu aafczflvmx aarnold aaronpha aaszkler
## 1 1 1 1 1 1
Given dimensions of the term-document matrix, is it possible to use linear regression to predict the number of retweets from the term-document matrix?
It is impossible to use linear regression because the number of predictors is greater than the number of variables. It, however, possible to use ridge regression.
Remove columns of the term-document matrix that represent words whose total frequency is below 50. What are the dimensions of the matrix now?
frequent_words <- words_freq[words_freq >= 50]
length(frequent_words)
## [1] 190
# modify the code below
DTM <- DTM[ , names(frequent_words)]
dim(DTM)
## [1] 5000 190
Text data are often visualized with a word cloud. Below is the word cloud of Trump’s tweets.
wordcloud(words = names(frequent_words), freq = frequent_words, min.freq = 0,
max.words = 100, random.order=FALSE, rot.per=0.35)
Note that some of these words are so called stopwords. Below is a list of English stopwords
stopwords()
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
Now we will remove these stopwords from our vector of word frequencies in Trump’s tweets and plot the new wordcloud. We will also add some colours to the word cloud.
frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
wordcloud(words = names(frequent_words), freq = frequent_words, min.freq = 0,
max.words = 100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Now we will remove the stopwords from the term-document matrix and report dimensions of the new term-document mantrix.
# Modify this code chunk
DTM <- DTM[ , names(frequent_words)]
dim(DTM)
## [1] 5000 136
Let us plot the distribution of the number of retweets. As shown in the plot below, it is highly skewed:
ggplot(data = tweets, aes(x = retweet_count)) +
geom_histogram(fill = "orange", color = "black", bins = 20)
It makes more sense to predict the logarithm of the number of retweets than the raw number of retweets then. Compute the logarithm of the number of retweets and store it in the variable “Y” of our dataset. Warning: think of what you will do with tweets with 0 retweet count.
Plot the histogram of the log-retweet_count
# Modify the code below
# You should see a nice bimodal distribution
tweets$Y <- log(1 + tweets$retweet_count)
ggplot(data = tweets, aes(x = Y)) +
geom_histogram(fill = "orange", color = "black", bins = 20) +
xlab("Log-retweet-count")
You can try to figure out why the distribution is bimodal, but it is not necessary.
We will randomly split the data into 80% training and 20% test sets. We will also rename all columns by adding “w_” to every column name. This is done to avoid some technical issues (random forest does not behave well if R keywords, such as next
are used as variable names).
set.seed(2020)
p <- 0.8
ind <- runif(nrow(DTM)) < p
all_data <- DTM %>%
as.matrix %>%
as_tibble %>%
rename_with(function(x) paste("w", x, sep = "_")) %>%
mutate(Y = tweets$Y)
train_data <- all_data[ind , ]
test_data <- all_data[!ind , ]
cat("Dimensions of the training set are", dim(train_data),"\n")
## Dimensions of the training set are 4056 137
cat("Dimensions of the test set are", dim(test_data),"\n")
## Dimensions of the test set are 944 137
We will use the raw interface of the library rpart
here rather than wrapping it to train
. This is because it is a bit simpler and rpart
automatically does cross-validation for choosing the optimal value of the regularization constant.
First, we construct a decision tree and print it.
mod_tree <- rpart(Y ~ . , data = train_data)
print(mod_tree)
## n= 4056
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4056 33944.6800 5.836876
## 2) w_https< 0.5 3226 21289.5400 4.985934
## 4) w_twacc>=0.5 2155 11362.5700 4.123382
## 8) w_thanks>=0.5 194 476.3170 2.264581 *
## 9) w_thanks< 0.5 1961 10149.6400 4.307272 *
## 5) w_twacc< 0.5 1071 5097.5800 6.721506
## 10) w_http>=0.5 141 204.2644 4.461973 *
## 11) w_http< 0.5 930 4064.3000 7.064080 *
## 3) w_https>=0.5 830 1239.9060 9.144275 *
And here is the plot:
rpart.plot(mod_tree)
We can get access to the guts of the tree model to be able to extract particular splits and deviations at those splits:
mod_tree$frame
Compute the RSS in two ways: by aggregating deviance of all leaf nodes and as a total training error.
The RSS of the entire dataset is deviance of the root node, Below is the RSS of the decision tree.
sum(mod_tree$frame$dev[mod_tree$frame$var == "<leaf>"])
## [1] 16134.43
And here is the RSS:
tse <- function(x, y) {
sum((x-y)^2)
}
mod_tree %>%
predict(train_data) %>%
tse(train_data$Y)
## [1] 16134.43
To prune a decision tree, we first grow it and then we choose a sub-tree minimizing the regularized loss function \[
\sum_{m=1}^{|T|}\sum_{x^{i}\in R_m}(y^{i}-\hat{y}_{R_i})^2
+\alpha|T|
\] Below is the plot of cross-validation error vs \(\alpha\) (parameter cp
):
plotcp(mod_tree)
In this case, we do not need to prune the tree. But for your future reference, here is how we do it. First, we will look at the table
mod_tree$cptable
## CP nsplit rel error xerror xstd
## 1 0.33628942 0 1.0000000 1.0001212 0.01340659
## 2 0.14227232 1 0.6637106 0.6639109 0.01360342
## 3 0.02442257 2 0.5214383 0.5217974 0.01217237
## 4 0.02170031 3 0.4970157 0.5048248 0.01202827
## 5 0.01000000 4 0.4753154 0.4760675 0.01151867
The optimal value of \(\alpha\) is
opt_cp <- mod_tree$cptable[which.min(mod_tree$cptable[ , 'xerror']) , 'CP']
opt_cp
## [1] 0.01
And now we prune the tree:
mod_tree_pruned <- prune.rpart(mod_tree, opt_cp)
mod_tree_pruned
## n= 4056
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4056 33944.6800 5.836876
## 2) w_https< 0.5 3226 21289.5400 4.985934
## 4) w_twacc>=0.5 2155 11362.5700 4.123382
## 8) w_thanks>=0.5 194 476.3170 2.264581 *
## 9) w_thanks< 0.5 1961 10149.6400 4.307272 *
## 5) w_twacc< 0.5 1071 5097.5800 6.721506
## 10) w_http>=0.5 141 204.2644 4.461973 *
## 11) w_http< 0.5 930 4064.3000 7.064080 *
## 3) w_https>=0.5 830 1239.9060 9.144275 *
Variable importance is measured as a total drop in residual sum of squares due to splits in each variable. Here is how we calculate it with the function varImp
from caret
:
varImp(mod_tree_pruned) %>%
arrange(-Overall) %>%
head(n = 20)
And here is how we can extract it directly from the model:
mod_tree_pruned$variable.importance
## w_https w_twacc w_http
## 11415.235687 4829.388032 829.016160
## w_thanks w_makeamericagreatagain w_obama
## 736.609860 261.312624 211.933929
## w_cont w_now w_hillary
## 141.109134 130.767743 126.258511
## w_honor w_big w_obamacare
## 123.779664 121.749278 103.712348
## w_democrats w_american w_jobs
## 68.766480 13.753296 13.753296
## w_via w_trump w_love
## 11.759094 5.879547 3.796958
Finally, let us construct predictions and report the test root mean squared error calculated in caret
(here, we did not create our own version of the very simple RMSE
function because we are loading caret
anyway).
mod_tree %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 2.014989
We will use the library ranger
because it allows tuning more than just one hyperparameter.
First, we train a random forest model and print the result. Note that by default, it tries three values of mtry
(the number of predictors allowed at each step), \(2\), \(p/2\) and \(p\) and two values of splitrule
(this is something that we did not cover in class and you can either ignore it or google what it is) and only one value of min.node.size
, 5.
Here we set train control to oob
, i.e., out-of-bag error (with 5-fold cross validation, training time will be 5 times slower) and num.trees
to 50 (with default value of 500, training time will be 10 times slower).
set.seed(100)
mod_rf <- train(Y ~ . , data = train_data, method = "ranger",
num.trees = 50,
importance = 'impurity',
trControl = trainControl("oob"))
print(mod_rf)
## Random Forest
##
## 4056 samples
## 136 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 2.082586 0.5918521 1.768071
## 2 extratrees 2.380923 0.4881632 2.050377
## 69 variance 1.799780 0.6182019 1.294085
## 69 extratrees 1.791730 0.6202089 1.296690
## 136 variance 1.814183 0.6130391 1.305854
## 136 extratrees 1.809448 0.6149121 1.300855
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 69, splitrule = extratrees
## and min.node.size = 5.
Here we will construct predictions and report the test error
mod_rf %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.714026
Here is variable importance for top 20 most important variables:
varImp(mod_rf)
## ranger variable importance
##
## only 20 most important variables shown (out of 136)
##
## Overall
## w_https 100.000
## w_twacc 74.355
## w_realdonaldtrump 32.627
## w_http 21.841
## w_thanks 21.092
## w_htag 5.843
## w_hillary 5.722
## w_trump 4.894
## w_will 4.479
## w_democrats 4.167
## w_great 3.958
## w_fake 3.456
## w_media 3.339
## w_amp 3.014
## w_america 2.951
## w_celebapprentice 2.716
## w_people 2.617
## w_today 2.613
## w_thank 2.490
## w_cnn 2.394
If we want all, here is how we can get the full information (we only printed the top 10 most important variables, but you can easily print the whole vector):
var_importance <- mod_rf$finalModel$variable.importance %>%
sort(decreasing = TRUE)
var_importance %>% head(10)
## w_https w_twacc w_realdonaldtrump w_http
## 7260.1919 5401.6458 2377.5899 1595.9337
## w_thanks w_htag w_hillary w_trump
## 1541.6702 436.5180 427.7609 367.7833
## w_will w_democrats
## 337.6637 315.0795
Here is the built-in plot of variable importance:
varImp(mod_rf) %>%
plot(top = 10)
And here is how you can make a custom plot with ggplot2
:
var_importance <- mod_rf$finalModel$variable.importance %>%
sort(decreasing = TRUE) %>% head(10)
data.frame(variable = names(var_importance),
importance = var_importance) %>%
mutate(word = gsub("w_", "", variable)) %>%
ggplot(aes(x = reorder(word, -importance), y = importance)) +
geom_col() + xlab("word") + ylab("importance") +
theme(axis.text.x = element_text(angle = 45))
Random forest has a number of hyperparameters that can be tuned with OOB error (faster) or with cross-validation (slower).
Random forest models may take a lot of time to train. For the sake of saving time, we will create a mini-version of our dataset to demonstrate the tuning process.
set.seed(199)
ind <- sample(1:nrow(train_data), size = 500, replace = FALSE)
mini_data <- train_data %>% slice(ind)
rfGrid <- expand.grid(mtry = c(10, 20, 30, 40, 50, 60),
min.node.size = c(5, 10, 20, 40),
splitrule = "variance")
mod_rf_tune <- train(Y ~ . , data = mini_data, method = "ranger",
num.trees = 100,
importance = 'impurity',
tuneGrid = rfGrid,
trControl = trainControl("oob"))
mod_rf_tune
## Random Forest
##
## 500 samples
## 136 predictors
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 10 5 1.840810 0.6011599 1.404396
## 10 10 1.813542 0.6150020 1.372422
## 10 20 1.825748 0.6096740 1.388036
## 10 40 1.825687 0.6110682 1.396257
## 20 5 1.869219 0.5908714 1.362570
## 20 10 1.871719 0.5893268 1.371413
## 20 20 1.860045 0.5936597 1.374263
## 20 40 1.850608 0.5968650 1.374273
## 30 5 1.898025 0.5834530 1.366885
## 30 10 1.899723 0.5812882 1.374557
## 30 20 1.887607 0.5851192 1.367166
## 30 40 1.891024 0.5813137 1.381814
## 40 5 1.954260 0.5647347 1.402775
## 40 10 1.940716 0.5668066 1.397490
## 40 20 1.924161 0.5720631 1.384602
## 40 40 1.889100 0.5833508 1.375735
## 50 5 1.938187 0.5720726 1.380867
## 50 10 1.935023 0.5733453 1.379490
## 50 20 1.953506 0.5631649 1.398478
## 50 40 1.917230 0.5738187 1.378502
## 60 5 1.975515 0.5558973 1.411896
## 60 10 1.966518 0.5603053 1.407614
## 60 20 1.956011 0.5630785 1.397491
## 60 40 1.939511 0.5654197 1.400145
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 10, splitrule = variance
## and min.node.size = 10.
The tuning process can be plotted to get a better picture of what is going on:
plot(mod_rf_tune)
The optimal values of the hyperparameters are
mod_rf_tune$bestTune
Now we will retrain the model on the whole training data with these values
mod_rf_tuned <- train(Y ~ . , data = train_data, method = "ranger",
num.trees = 100,
importance = 'impurity',
tuneGrid = expand.grid(mod_rf_tune$bestTune),
trControl = trainControl("oob"))
mod_rf_tuned
## Random Forest
##
## 4056 samples
## 136 predictor
##
## No pre-processing
## Resampling results:
##
## RMSE Rsquared MAE
## 1.703226 0.654538 1.27152
##
## Tuning parameter 'mtry' was held constant at a value of 10
## Tuning
## parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 10
Here is the test error. It should be a bit better than the first version of random forest.
mod_rf_tuned %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.669528
Extract only 70 most important variables and train a new random forest with 50 trees on just those 70 variables. Tune hyperparameters using the following values:
mtry
= 3, 5, 8, 15
min.node.size
= 2, 4, 8, 16, 32
Retrain random forest with the optimal values of hyperparameters but with 500 trees instead of 50 trees.
Report the test error of the final model.
top_variables <- mod_rf_tuned$finalModel$variable.importance %>%
sort(decreasing = TRUE) %>% head(70) %>% names
t_data <- train_data %>%
select(all_of(top_variables), Y)
v_data <- test_data %>%
select(all_of(top_variables), Y)
rfGrid_2 <- expand.grid(mtry = c(3, 5, 8, 15),
min.node.size = c(2, 4, 8, 16, 32),
splitrule = "variance")
mod_rf_topvar <- train(Y ~ . , data = t_data, method = "ranger",
num.trees = 50,
importance = 'impurity',
tuneGrid = expand.grid(rfGrid_2),
trControl = trainControl("oob"))
mod_rf_topvar
## Random Forest
##
## 4056 samples
## 70 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 3 2 1.823820 0.6319876 1.465312
## 3 4 1.787475 0.6393289 1.410124
## 3 8 1.807706 0.6340743 1.436824
## 3 16 1.823363 0.6318878 1.461573
## 3 32 1.859618 0.6174428 1.504937
## 5 2 1.723347 0.6486114 1.305000
## 5 4 1.730880 0.6452282 1.305491
## 5 8 1.718247 0.6508553 1.299993
## 5 16 1.720310 0.6500063 1.300286
## 5 32 1.734724 0.6449590 1.316366
## 8 2 1.716500 0.6480733 1.269446
## 8 4 1.707477 0.6517444 1.257412
## 8 8 1.713821 0.6493004 1.264234
## 8 16 1.708285 0.6515648 1.261307
## 8 32 1.708223 0.6518810 1.270337
## 15 2 1.756308 0.6326508 1.275370
## 15 4 1.744559 0.6372440 1.268672
## 15 8 1.734011 0.6413787 1.257799
## 15 16 1.718971 0.6472359 1.255121
## 15 32 1.719297 0.6468900 1.263364
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 8, splitrule = variance
## and min.node.size = 4.
The optimal values of hyperparameters are
mod_rf_topvar$bestTune
mod_rf_final <- train(Y ~ . , data = t_data, method = "ranger",
num.trees = 500,
importance = 'impurity',
tuneGrid = expand.grid(mod_rf_topvar$bestTune),
trControl = trainControl("oob"))
And the error of the final model is
mod_rf_final %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.669269
Here are the answers: