The aim of this lab is to apply tree-based models for natural language processing and to learn how to train and tune tree-based models.
By the end of this lab session, students should be able to
Visualize text data.
Extract features from a text data.
Train and plot decision trees and regularized decision trees.
Train random forest and tune the hyperparameters in random forest to minimize the out-of-bag error.
You should be familiar with libraries caret and
tidyverse.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
We will work with the data of Donald Trump’s tweets. Source:
The objective is to predict the number of retweets from the text of each tweet.
First we will load the data into R. Note that the original dataset is a JSON file that contains a lot of information. For this lab, we only extracted the text and the number of retweets from 5K out of original 40K Trump’s tweets.
library(tidyverse) # for manipulation with data
library(tidytext) # for tidyverse-style text tokenization
library(caret) # for machine learning, including KNN
library(rpart) # for training decision trees
library(rpart.plot) # for plotting decision trees
library(ranger) # for training random forest
# Note for students: the libraries `cforest` and `randomForest`
# provide better access to guts of random forest models.
# That's why these libraries were used to prepare lecture 5 slides.
# However, cforest and randomForest do not allow tuning the minimum node size.
library(wordcloud) # for text visualization
set.seed(8192)
tweets <- read_csv("trump_tweets.csv")
head(tweets)
The number of tweets is
nrow(tweets)
## [1] 5000
We will remove all non-character symbols except for “@” and “#” because they are special symbols. Then we will change all characters to the lower case. Below is the processed dataset.
tweets <- tweets %>%
mutate(
cleaned_text = text %>%
str_replace_all("[^A-Za-z@#]", " ") %>%
str_squish() %>%
str_to_lower()
)
head(tweets)
The symbols “@” and “#” have special meaning in twitter and it is probably a good idea to treat them as separate words. Write an R code that replaces all occurrences of “@” with “twacc” and all occurrences of “#” with “htag” in Trump’s tweets.
# Modify the code below
tweets <- tweets %>%
mutate(
cleaned_text = cleaned_text %>%
str_replace_all("@", "twacc ") %>%
str_replace_all("#", "htag ")
)
head(tweets)
An important tool in natural language processing is document-term-matrix. This is a matrix whose rows represent documents, i.e., tweets, and whose columns represent words. The element \(T_{ij}\) in row \(i\) and column \(j\) is the number of occurrences of word \(j\) in document \(i\). To construct the document-term matrix, the first step is to create a data frame of word counts in each document:
tidy_counts <- tweets %>%
mutate(doc_id = row_number()) %>%
select(doc_id, cleaned_text) %>%
unnest_tokens(word, cleaned_text) %>%
count(doc_id, word, sort = FALSE)
tidy_counts %>% sample_n(10)
How many different words are there in the vocabulary?
Given dimensions of the term-document matrix, is it possible to use linear regression to predict the number of retweets from the term-document matrix?
Answer The number of words in the vocabulary is
n_distinct(tidy_counts$word)
## [1] 12674
It is impossible to use linear regression because the number of predictors is greater than the number of variables.
If we included all words that are mentioned at least once in any tweet, we would have too many words and most of them wouldn’t be useful since the appear only in one or two tweets.
To reduce the number of independent variables, we will only keep words that appear in the whole corpus (collection of all tweets), say, at least 30 times. We won’t keep stopwords either since they aren’t meaningful for our purpose.
keep_words <- tidy_counts %>%
count(word, wt = n, name = "freq") %>%
filter(freq >= 30, !word %in% stop_words$word)
keep_words %>% sample_n(10)
The next step is to subset tidy_counts to include only
frequent words. There are two ways to do it, the first one is more
efficient (works faster for large datasets), but second one is easier to
read.
# tidy_counts_f <- tidy_counts %>%
# semi_join(keep_words, by = "word")
tidy_counts_f <- tidy_counts %>%
filter(word %in% keep_words$word)
Finally, we can create the document-term matrix. Note that it has a
special class, so we will immediately create a data frame out of it. You
are encourated to explore DTM itself to get a feeling what
it looks like.
Again, there are two ways to create the document-term matrix. The first one (commented out below) is more efficient if the number of predictors is large but takes some effort to understand:
# all_docs <- seq_len(nrow(tweets))
# DTM <- tidy_counts_f %>%
# mutate(doc_id = factor(doc_id, levels = all_docs),
# word = factor(word, levels = keep_words$word)) %>%
# cast_sparse(doc_id, word, n)
#
# all_data <- DTM %>%
# as.matrix() %>%
# as_tibble(.name_repair = "minimal") %>%
# mutate(Y = tweets$Y)
all_data <- tibble(doc_id = 1:nrow(tweets)) %>%
left_join(tidy_counts_f) %>%
pivot_wider(id_cols = doc_id,
names_from = "word",
values_from = "n",
values_fill = 0) %>%
select(-any_of(c("doc_id", "NA")))
## Joining with `by = join_by(doc_id)`
all_data %>%
select(amp, iran, golf, hillary, twacc) %>%
slice(1:5)
Text data are often visualized with a word cloud. Below is the word cloud of Trump’s tweets.
wordcloud(words = keep_words$word, freq = keep_words$freq, min.freq = 0,
max.words = 100, random.order=FALSE, rot.per=0.35)

Plot the word cloud without the tokens “twacc”, “realdonaldtrump”, “htag”, “http”, and “https”.
df_to_word_cloud <- function(word_data) {
wordcloud(words = word_data$word, freq = word_data$freq, min.freq = 0,
max.words = 100, random.order=FALSE, rot.per=0.35)
}
keep_words %>%
filter(!word %in% c("twacc", "realdonaldtrump", "htag", "http","https")) %>%
df_to_word_cloud

Let us plot the distribution of the number of retweets. As shown in the plot below, it is highly skewed:
ggplot(data = tweets, aes(x = retweet_count)) +
geom_histogram(fill = "orange", color = "black", bins = 20)

It makes more sense to predict the logarithm of the number of retweets than the raw number of retweets then. Compute the logarithm of the number of retweets and store it in the variable “Y” of our dataset. Warning: think of what you will do with tweets with 0 retweet count.
Plot the histogram of the log-retweet_count
# Modify the code below
# You should see a nice bimodal distribution
tweets <- tweets %>%
mutate(Y = log(1 + retweet_count))
all_data <- all_data %>% mutate(Y = tweets$Y)
ggplot(data = tweets, aes(x = Y)) +
geom_histogram(fill = "orange", color = "black", bins = 20) +
xlab("Log-retweet-count")

You can try to figure out why the distribution is bimodal, but it is not necessary.
We will randomly split the data into 80% training and 20% test sets.
p <- 0.8
ind <- runif(nrow(all_data)) < p
train_data <- all_data %>% filter(ind)
test_data <- all_data %>% filter(!ind)
cat("Dimensions of the training set are", dim(train_data),"\n")
## Dimensions of the training set are 3993 178
cat("Dimensions of the test set are", dim(test_data),"\n")
## Dimensions of the test set are 1007 178
We will use the raw interface of the library rpart here
rather than wrapping it to train. This is because it is a
bit simpler and rpart automatically does cross-validation
for choosing the optimal value of the regularization constant.
Remark As a rule of thumb, it is preferable to use
tidyverse tools such as ggplot() for plotting,
because they are much easier to customize and extend than base R
graphics. In practice, if you need anything more elaborate than a quick
plot(), ggplot() will usually be the better
choice. An exception are simple diagnostic plots that come bundled with
modeling packages (e.g., decision trees with rpart.plot,
cross-validation error curves with plotcp() etc). These are
often implemented in base R and are convenient to use as-is.
First, we construct a decision tree and print it.
mod_tree <- rpart(Y ~ . , data = train_data)
print(mod_tree)
## n= 3993
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3993 33455.24000 5.771200
## 2) https< 0.5 3219 21443.57000 4.962397
## 4) twacc>=0.5 2170 11533.10000 4.097823
## 8) rt< 0.5 2038 8273.64200 3.815346
## 16) interviewed< 0.5 2010 7856.63900 3.765413 *
## 17) interviewed>=0.5 28 52.22542 7.399866 *
## 9) rt>=0.5 132 586.11860 8.459084 *
## 5) twacc< 0.5 1049 4932.98900 6.750888
## 10) http>=0.5 141 198.46960 4.485972 *
## 11) http< 0.5 908 3898.89200 7.102598 *
## 3) https>=0.5 774 1148.29700 9.134942 *
And here is the plot:
rpart.plot(mod_tree)

We can get access to the guts of the tree model to be able to extract particular splits and deviations at those splits:
mod_tree$frame
Compute the RSS in two ways: by aggregating deviance of all leaf nodes and as a total training error.
The RSS of the entire dataset is deviance of the root node, Below is the RSS of the decision tree.
sum(mod_tree$frame$dev[mod_tree$frame$var == "<leaf>"])
## [1] 13740.64
And here is the RSS:
tse <- function(x, y) {
sum((x-y)^2)
}
mod_tree %>%
predict(train_data) %>%
tse(train_data$Y)
## [1] 13740.64
To prune a decision tree, we first grow it and then we choose a
sub-tree minimizing the regularized loss function \[
\sum_{m=1}^{|T|}\sum_{x^{i}\in R_m}(y^{i}-\hat{y}_{R_i})^2
+\alpha|T|
\] Below is the plot of cross-validation error vs \(\alpha\) (parameter cp):
plotcp(mod_tree)

In this case, we do not need to prune the tree. But for your future reference, here is how we do it. First, we will look at the table
mod_tree$cptable
## CP nsplit rel error xerror xstd
## 1 0.32471355 0 1.0000000 1.0005826 0.01374169
## 2 0.14878048 1 0.6752864 0.6756644 0.01387620
## 3 0.07990787 2 0.5265060 0.5271873 0.01243399
## 4 0.02497748 3 0.4465981 0.4475226 0.01086897
## 5 0.01090345 4 0.4216206 0.4226515 0.01065410
## 6 0.01000000 5 0.4107172 0.4161941 0.01061745
The optimal value of \(\alpha\) is
opt_cp <- mod_tree$cptable[which.min(mod_tree$cptable[ , 'xerror']) , 'CP']
opt_cp
## [1] 0.01
And now we prune the tree:
mod_tree_pruned <- prune.rpart(mod_tree, opt_cp)
mod_tree_pruned
## n= 3993
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3993 33455.24000 5.771200
## 2) https< 0.5 3219 21443.57000 4.962397
## 4) twacc>=0.5 2170 11533.10000 4.097823
## 8) rt< 0.5 2038 8273.64200 3.815346
## 16) interviewed< 0.5 2010 7856.63900 3.765413 *
## 17) interviewed>=0.5 28 52.22542 7.399866 *
## 9) rt>=0.5 132 586.11860 8.459084 *
## 5) twacc< 0.5 1049 4932.98900 6.750888
## 10) http>=0.5 141 198.46960 4.485972 *
## 11) http< 0.5 908 3898.89200 7.102598 *
## 3) https>=0.5 774 1148.29700 9.134942 *
Variable importance is measured as a total drop in residual sum of
squares due to splits in each variable. Here is how we calculate it with
the function varImp from caret:
varImp(mod_tree_pruned) %>%
arrange(-Overall) %>%
head(n = 20)
And here is how we can extract it directly from the model:
mod_tree_pruned$variable.importance
## https twacc rt
## 10863.36988 5017.99184 2673.33702
## http interviewed obama
## 835.62757 364.77748 241.99411
## join makeamericagreatagain honor
## 230.49222 196.49506 154.38898
## democrats cont hillary
## 150.84151 142.23448 137.60450
## obamacare crooked whitehouse
## 104.38962 90.15467 76.39400
## russia donald hotel
## 20.25255 17.77931 11.85287
Finally, let us construct predictions and report the test root mean
squared error calculated in caret (here, we did not create
our own version of the very simple RMSE function because we
are loading caret anyway).
mod_tree %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.880089
We will use the library ranger because it allows tuning
more than just one hyperparameter.
First, we train a random forest model and print the result. Note that
by default, it tries three values of mtry (the number of
predictors allowed at each step), \(2\), \(p/2\) and \(p\) and two values of
splitrule (this is something that we did not cover in class
and you can either ignore it or google what it is) and only one value of
min.node.size, 5.
Here we set train control to oob, i.e., out-of-bag error
(with 5-fold cross validation, training time will be 5 times slower) and
num.trees to 50 (with default value of 500, training time
will be 10 times slower).
set.seed(100)
mod_rf <- train(Y ~ . , data = train_data, method = "ranger",
num.trees = 50,
importance = 'impurity',
trControl = trainControl("oob"))
print(mod_rf)
## Random Forest
##
## 3993 samples
## 177 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 2.109581 0.6145837 1.793226
## 2 extratrees 2.261572 0.5535946 1.927110
## 89 variance 1.619465 0.6910906 1.161660
## 89 extratrees 1.603458 0.6958904 1.158384
## 177 variance 1.636035 0.6858236 1.177388
## 177 extratrees 1.620119 0.6911891 1.166473
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 89, splitrule = extratrees
## and min.node.size = 5.
Here we will construct predictions and report the test error
mod_rf %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.608222
Here is variable importance for top 20 most important variables:
varImp(mod_rf)
## ranger variable importance
##
## only 20 most important variables shown (out of 177)
##
## Overall
## https 100.000
## twacc 92.056
## rt 28.445
## http 20.302
## realdonaldtrump 19.433
## hillary 6.598
## democrats 4.189
## htag 4.049
## fake 3.935
## cnn 3.564
## interviewed 3.383
## trump 3.329
## america 3.328
## people 3.073
## media 2.973
## clinton 2.529
## foxnews 2.365
## obama 2.249
## makeamericagreatagain 2.021
## barackobama 1.966
If we want all, here is how we can get the full information (we only printed the top 10 most important variables, but you can easily print the whole vector):
var_importance <- mod_rf$finalModel$variable.importance %>%
sort(decreasing = TRUE)
var_importance %>% head(10)
## https twacc rt http realdonaldtrump
## 7303.1909 6723.7392 2083.6586 1489.6843 1426.3358
## hillary democrats htag fake cnn
## 490.0537 314.3798 304.1789 295.8575 268.7504
Here is the built-in plot of variable importance:
varImp(mod_rf) %>%
plot(top = 10)
And here is how you can make a custom plot with
ggplot2:
var_importance <- mod_rf$finalModel$variable.importance %>%
sort(decreasing = TRUE) %>% head(10)
var_importance %>%
enframe(name = "variable", value = "importance") %>%
mutate(variable = fct_reorder(variable, importance)) %>%
ggplot(aes(variable, importance)) +
geom_col() +
coord_flip()
Random forest has a number of hyperparameters that can be tuned with OOB error (faster) or with cross-validation (slower).
Random forest models may take a lot of time to train. For the sake of saving time, we will create a mini-version of our dataset to demonstrate the tuning process.
set.seed(199)
ind <- sample(1:nrow(train_data), size = 500, replace = FALSE)
mini_data <- train_data %>% slice(ind)
rfGrid <- expand.grid(mtry = c(10, 20, 30, 40, 50, 60),
min.node.size = c(5, 10, 20, 40),
splitrule = "variance")
mod_rf_tune <- train(Y ~ . , data = mini_data, method = "ranger",
num.trees = 100,
importance = 'impurity',
tuneGrid = rfGrid,
trControl = trainControl("oob"))
mod_rf_tune
## Random Forest
##
## 500 samples
## 177 predictors
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 10 5 1.762335 0.6287188 1.401409
## 10 10 1.756478 0.6332227 1.409939
## 10 20 1.770713 0.6270420 1.427645
## 10 40 1.773984 0.6254328 1.425865
## 20 5 1.779183 0.6145700 1.367211
## 20 10 1.774561 0.6169195 1.356166
## 20 20 1.783228 0.6129856 1.369521
## 20 40 1.755187 0.6257252 1.357974
## 30 5 1.801382 0.6064170 1.349882
## 30 10 1.795172 0.6090385 1.348945
## 30 20 1.794053 0.6089956 1.341256
## 30 40 1.790565 0.6102296 1.351091
## 40 5 1.835113 0.5946635 1.358717
## 40 10 1.835684 0.5939132 1.345487
## 40 20 1.821771 0.6000633 1.351702
## 40 40 1.810445 0.6017705 1.356285
## 50 5 1.839444 0.5933170 1.359724
## 50 10 1.851677 0.5884200 1.367300
## 50 20 1.834790 0.5955452 1.346606
## 50 40 1.826745 0.5957722 1.365810
## 60 5 1.841390 0.5939469 1.353436
## 60 10 1.860252 0.5852938 1.370551
## 60 20 1.849120 0.5893441 1.360764
## 60 40 1.830226 0.5960918 1.355391
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 20, splitrule = variance
## and min.node.size = 40.
The tuning process can be plotted to get a better picture of what is going on:
plot(mod_rf_tune)
The optimal values of the hyperparameters are
mod_rf_tune$bestTune
Now we will retrain the model on the whole training data with these values
mod_rf_tuned <- train(Y ~ . , data = train_data, method = "ranger",
num.trees = 100,
importance = 'impurity',
tuneGrid = expand.grid(mod_rf_tune$bestTune),
trControl = trainControl("oob"))
mod_rf_tuned
## Random Forest
##
## 3993 samples
## 177 predictor
##
## No pre-processing
## Resampling results:
##
## RMSE Rsquared MAE
## 1.552133 0.7125418 1.142014
##
## Tuning parameter 'mtry' was held constant at a value of 20
## Tuning
## parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 40
Here is the test error. It should be a bit better than the first version of random forest.
mod_rf_tuned %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.569078
Extract only 70 most important variables and train a new random forest with 50 trees on just those 70 variables. Tune hyperparameters using the following values:
mtry = 3, 5, 8, 15
min.node.size = 2, 4, 8, 16, 32
Retrain random forest with the optimal values of hyperparameters but with 500 trees instead of 50 trees.
Report the test error of the final model.
top_variables <- mod_rf_tuned$finalModel$variable.importance %>%
sort(decreasing = TRUE) %>% head(70) %>% names
t_data <- train_data %>%
select(all_of(top_variables), Y)
v_data <- test_data %>%
select(all_of(top_variables), Y)
rfGrid_2 <- expand.grid(mtry = c(3, 5, 8, 15),
min.node.size = c(2, 4, 8, 16, 32),
splitrule = "variance")
mod_rf_topvar <- train(Y ~ . , data = t_data, method = "ranger",
num.trees = 50,
importance = 'impurity',
tuneGrid = expand.grid(rfGrid_2),
trControl = trainControl("oob"))
mod_rf_topvar
## Random Forest
##
## 3993 samples
## 70 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 3 2 1.691569 0.6864934 1.364034
## 3 4 1.698157 0.6880713 1.371738
## 3 8 1.745042 0.6762095 1.400343
## 3 16 1.705281 0.6895987 1.380798
## 3 32 1.703106 0.6845100 1.376944
## 5 2 1.565631 0.7108941 1.192262
## 5 4 1.569269 0.7113679 1.201785
## 5 8 1.566804 0.7124685 1.197332
## 5 16 1.576999 0.7078233 1.204693
## 5 32 1.577176 0.7076377 1.206910
## 8 2 1.549006 0.7140285 1.149347
## 8 4 1.544244 0.7159724 1.148777
## 8 8 1.549222 0.7140333 1.154370
## 8 16 1.547220 0.7146295 1.150184
## 8 32 1.544423 0.7162628 1.157942
## 15 2 1.558433 0.7104340 1.137908
## 15 4 1.552323 0.7126835 1.133491
## 15 8 1.546518 0.7147791 1.129417
## 15 16 1.559776 0.7098098 1.140998
## 15 32 1.555910 0.7111383 1.140837
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 8, splitrule = variance
## and min.node.size = 4.
The optimal values of hyperparameters are
mod_rf_topvar$bestTune
mod_rf_final <- train(Y ~ . , data = t_data, method = "ranger",
num.trees = 500,
importance = 'impurity',
tuneGrid = expand.grid(mod_rf_topvar$bestTune),
trControl = trainControl("oob"))
And the error of the final model is
mod_rf_final %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 1.575252
Here are the answers:
I used Chat GPT to update codes from 2024 to make them more
tidyverse-styled.