Info about the lab

Learning aim

The aim of this lab is to apply tree-based models for natural language processing and to learn how to train and tune tree-based models.

Objectives

By the end of this lab session, students should be able to

Visualize text data.
Extract features from a text data.
Train and plot decision trees and regularized decision trees.
Train random forest and tune the hyperparameters in random forest to minimize the out-of-bag error.

Preliminaries

You should be familiar with libraries caret and tidyverse.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

We will work with the data of Donald Trump’s tweets. Source:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FKJEBIL

The objective is to predict the number of retweets from the text of each tweet.

Loading data into R

First we will load the data into R. Note that the original dataset is a JSON file that contains a lot of information. For this lab, we only extracted the text and the number of retweets from 5K out of original 40K Trump’s tweets.

library(tidyverse) # for manipulation with data
library(tidytext) # for tidyverse-style text tokenization
library(caret) # for machine learning, including KNN

library(rpart) # for training decision trees
library(rpart.plot) # for plotting decision trees

library(ranger) # for training random forest
# Note for students: the libraries `cforest` and `randomForest` 
# provide better access to guts of random forest models.
# That's why these libraries were used to prepare lecture 5 slides.
# However, cforest and randomForest do not allow tuning the minimum node size.

library(wordcloud) # for text visualization
set.seed(8192)

tweets <- read_csv("trump_tweets.csv")
head(tweets)

The number of tweets is

nrow(tweets)

## [1] 5000

Data cleaning

We will remove all non-character symbols except for “@” and “#” because they are special symbols. Then we will change all characters to the lower case. Below is the processed dataset.

tweets <- tweets %>%
  mutate(
    cleaned_text = text %>%
      str_replace_all("[^A-Za-z@#]", " ") %>%
      str_squish() %>%
      str_to_lower()
  )

head(tweets)

Question 1

The symbols “@” and “#” have special meaning in twitter and it is probably a good idea to treat them as separate words. Write an R code that replaces all occurrences of “@” with “twacc” and all occurrences of “#” with “htag” in Trump’s tweets.

# Modify the code below
tweets <- tweets %>%
  mutate(
    cleaned_text = cleaned_text %>%
      str_replace_all("@", "twacc ") %>%
      str_replace_all("#", "htag ") 
  )

head(tweets)

Document-term matrix

An important tool in natural language processing is document-term-matrix. This is a matrix whose rows represent documents, i.e., tweets, and whose columns represent words. The element \(T_{ij}\) in row \(i\) and column \(j\) is the number of occurrences of word \(j\) in document \(i\). To construct the document-term matrix, the first step is to create a data frame of word counts in each document:

tidy_counts <- tweets %>%
  mutate(doc_id = row_number()) %>%
  select(doc_id, cleaned_text) %>%
  unnest_tokens(word, cleaned_text) %>%
  count(doc_id, word, sort = FALSE)

tidy_counts %>% sample_n(10)

Question 2

How many different words are there in the vocabulary?

Given dimensions of the term-document matrix, is it possible to use linear regression to predict the number of retweets from the term-document matrix?

Answer The number of words in the vocabulary is

n_distinct(tidy_counts$word)

## [1] 12674

It is impossible to use linear regression because the number of predictors is greater than the number of variables.

If we included all words that are mentioned at least once in any tweet, we would have too many words and most of them wouldn’t be useful since the appear only in one or two tweets.

To reduce the number of independent variables, we will only keep words that appear in the whole corpus (collection of all tweets), say, at least 30 times. We won’t keep stopwords either since they aren’t meaningful for our purpose.

keep_words <- tidy_counts %>%
  count(word, wt = n, name = "freq") %>%
  filter(freq >= 30, !word %in% stop_words$word)

keep_words %>% sample_n(10)

The next step is to subset tidy_counts to include only frequent words. There are two ways to do it, the first one is more efficient (works faster for large datasets), but second one is easier to read.

# tidy_counts_f <- tidy_counts %>%
#   semi_join(keep_words, by = "word")

tidy_counts_f <- tidy_counts %>%
  filter(word %in% keep_words$word)

Finally, we can create the document-term matrix. Note that it has a special class, so we will immediately create a data frame out of it. You are encourated to explore DTM itself to get a feeling what it looks like.

Again, there are two ways to create the document-term matrix. The first one (commented out below) is more efficient if the number of predictors is large but takes some effort to understand:

# all_docs <- seq_len(nrow(tweets))
# DTM <- tidy_counts_f %>%
#   mutate(doc_id = factor(doc_id, levels = all_docs),
#          word   = factor(word, levels = keep_words$word)) %>%
#   cast_sparse(doc_id, word, n)   
# 
# all_data <- DTM %>%
#   as.matrix() %>%            
#   as_tibble(.name_repair = "minimal") %>%
#   mutate(Y = tweets$Y)

all_data <- tibble(doc_id = 1:nrow(tweets)) %>% 
  left_join(tidy_counts_f) %>%
  pivot_wider(id_cols = doc_id, 
              names_from = "word",
              values_from = "n",
              values_fill = 0) %>%
  select(-any_of(c("doc_id", "NA")))

## Joining with `by = join_by(doc_id)`

all_data %>%
  select(amp, iran, golf, hillary, twacc) %>%
  slice(1:5)

Data visualization

Text data are often visualized with a word cloud. Below is the word cloud of Trump’s tweets.

wordcloud(words = keep_words$word, freq = keep_words$freq, min.freq = 0,
            max.words = 100, random.order=FALSE, rot.per=0.35)

Question 3

Plot the word cloud without the tokens “twacc”, “realdonaldtrump”, “htag”, “http”, and “https”.

df_to_word_cloud <- function(word_data) {
  wordcloud(words = word_data$word, freq = word_data$freq, min.freq = 0,
            max.words = 100, random.order=FALSE, rot.per=0.35)
}

keep_words %>%
  filter(!word %in% c("twacc", "realdonaldtrump", "htag", "http","https")) %>%
  df_to_word_cloud

Question 4

Let us plot the distribution of the number of retweets. As shown in the plot below, it is highly skewed:

ggplot(data = tweets, aes(x = retweet_count)) +
  geom_histogram(fill = "orange", color = "black", bins = 20)

It makes more sense to predict the logarithm of the number of retweets than the raw number of retweets then. Compute the logarithm of the number of retweets and store it in the variable “Y” of our dataset. Warning: think of what you will do with tweets with 0 retweet count.

Plot the histogram of the log-retweet_count

# Modify the code below
# You should see a nice bimodal distribution

tweets <- tweets %>%
  mutate(Y = log(1 + retweet_count))

all_data <- all_data %>% mutate(Y = tweets$Y)

ggplot(data = tweets, aes(x = Y)) +
  geom_histogram(fill = "orange", color = "black", bins = 20) +
  xlab("Log-retweet-count")

You can try to figure out why the distribution is bimodal, but it is not necessary.

Training and test sets

We will randomly split the data into 80% training and 20% test sets.

p <- 0.8
ind <- runif(nrow(all_data)) < p

train_data <- all_data %>% filter(ind)
test_data <- all_data %>% filter(!ind)

cat("Dimensions of the training set are", dim(train_data),"\n")

## Dimensions of the training set are 3993 178

cat("Dimensions of the test set are", dim(test_data),"\n")

## Dimensions of the test set are 1007 178

Decision tree

We will use the raw interface of the library rpart here rather than wrapping it to train. This is because it is a bit simpler and rpart automatically does cross-validation for choosing the optimal value of the regularization constant.

Training and plotting

Remark As a rule of thumb, it is preferable to use tidyverse tools such as ggplot() for plotting, because they are much easier to customize and extend than base R graphics. In practice, if you need anything more elaborate than a quick plot(), ggplot() will usually be the better choice. An exception are simple diagnostic plots that come bundled with modeling packages (e.g., decision trees with rpart.plot, cross-validation error curves with plotcp() etc). These are often implemented in base R and are convenient to use as-is.

First, we construct a decision tree and print it.

mod_tree <- rpart(Y ~ . , data = train_data)
print(mod_tree)

## n= 3993 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3993 33455.24000 5.771200  
##    2) https< 0.5 3219 21443.57000 4.962397  
##      4) twacc>=0.5 2170 11533.10000 4.097823  
##        8) rt< 0.5 2038  8273.64200 3.815346  
##         16) interviewed< 0.5 2010  7856.63900 3.765413 *
##         17) interviewed>=0.5 28    52.22542 7.399866 *
##        9) rt>=0.5 132   586.11860 8.459084 *
##      5) twacc< 0.5 1049  4932.98900 6.750888  
##       10) http>=0.5 141   198.46960 4.485972 *
##       11) http< 0.5 908  3898.89200 7.102598 *
##    3) https>=0.5 774  1148.29700 9.134942 *

And here is the plot:

rpart.plot(mod_tree)

We can get access to the guts of the tree model to be able to extract particular splits and deviations at those splits:

mod_tree$frame

Question 5

Compute the RSS in two ways: by aggregating deviance of all leaf nodes and as a total training error.

The RSS of the entire dataset is deviance of the root node, Below is the RSS of the decision tree.

sum(mod_tree$frame$dev[mod_tree$frame$var == "<leaf>"])

## [1] 13740.64

And here is the RSS:

tse <- function(x, y) {
  sum((x-y)^2)
}

mod_tree %>% 
  predict(train_data) %>%
  tse(train_data$Y)

## [1] 13740.64

Tree pruning (regularization)

To prune a decision tree, we first grow it and then we choose a sub-tree minimizing the regularized loss function \[ \sum_{m=1}^{|T|}\sum_{x^{i}\in R_m}(y^{i}-\hat{y}_{R_i})^2 +\alpha|T| \] Below is the plot of cross-validation error vs \(\alpha\) (parameter cp):

plotcp(mod_tree)

In this case, we do not need to prune the tree. But for your future reference, here is how we do it. First, we will look at the table

mod_tree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.32471355      0 1.0000000 1.0005826 0.01374169
## 2 0.14878048      1 0.6752864 0.6756644 0.01387620
## 3 0.07990787      2 0.5265060 0.5271873 0.01243399
## 4 0.02497748      3 0.4465981 0.4475226 0.01086897
## 5 0.01090345      4 0.4216206 0.4226515 0.01065410
## 6 0.01000000      5 0.4107172 0.4161941 0.01061745

The optimal value of \(\alpha\) is

opt_cp <- mod_tree$cptable[which.min(mod_tree$cptable[ , 'xerror']) , 'CP']
opt_cp

## [1] 0.01

And now we prune the tree:

mod_tree_pruned <- prune.rpart(mod_tree, opt_cp)
mod_tree_pruned

## n= 3993 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3993 33455.24000 5.771200  
##    2) https< 0.5 3219 21443.57000 4.962397  
##      4) twacc>=0.5 2170 11533.10000 4.097823  
##        8) rt< 0.5 2038  8273.64200 3.815346  
##         16) interviewed< 0.5 2010  7856.63900 3.765413 *
##         17) interviewed>=0.5 28    52.22542 7.399866 *
##        9) rt>=0.5 132   586.11860 8.459084 *
##      5) twacc< 0.5 1049  4932.98900 6.750888  
##       10) http>=0.5 141   198.46960 4.485972 *
##       11) http< 0.5 908  3898.89200 7.102598 *
##    3) https>=0.5 774  1148.29700 9.134942 *

Variable importance

Variable importance is measured as a total drop in residual sum of squares due to splits in each variable. Here is how we calculate it with the function varImp from caret:

varImp(mod_tree_pruned) %>%
  arrange(-Overall) %>%
  head(n = 20)

And here is how we can extract it directly from the model:

mod_tree_pruned$variable.importance

##                 https                 twacc                    rt 
##           10863.36988            5017.99184            2673.33702 
##                  http           interviewed                 obama 
##             835.62757             364.77748             241.99411 
##                  join makeamericagreatagain                 honor 
##             230.49222             196.49506             154.38898 
##             democrats                  cont               hillary 
##             150.84151             142.23448             137.60450 
##             obamacare               crooked            whitehouse 
##             104.38962              90.15467              76.39400 
##                russia                donald                 hotel 
##              20.25255              17.77931              11.85287

Test error

Finally, let us construct predictions and report the test root mean squared error calculated in caret (here, we did not create our own version of the very simple RMSE function because we are loading caret anyway).

mod_tree %>%
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.880089

Random forest

We will use the library ranger because it allows tuning more than just one hyperparameter.

Training

First, we train a random forest model and print the result. Note that by default, it tries three values of mtry (the number of predictors allowed at each step), \(2\), \(p/2\) and \(p\) and two values of splitrule (this is something that we did not cover in class and you can either ignore it or google what it is) and only one value of min.node.size, 5.

Here we set train control to oob, i.e., out-of-bag error (with 5-fold cross validation, training time will be 5 times slower) and num.trees to 50 (with default value of 500, training time will be 10 times slower).

set.seed(100)
mod_rf <- train(Y ~ . , data = train_data, method = "ranger",
                num.trees = 50,
                importance = 'impurity',
                trControl = trainControl("oob"))

print(mod_rf)

## Random Forest 
## 
## 3993 samples
##  177 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE     
##     2   variance    2.109581  0.6145837  1.793226
##     2   extratrees  2.261572  0.5535946  1.927110
##    89   variance    1.619465  0.6910906  1.161660
##    89   extratrees  1.603458  0.6958904  1.158384
##   177   variance    1.636035  0.6858236  1.177388
##   177   extratrees  1.620119  0.6911891  1.166473
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 89, splitrule = extratrees
##  and min.node.size = 5.

Predictions

Here we will construct predictions and report the test error

mod_rf %>% 
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.608222

Variable importance

Here is variable importance for top 20 most important variables:

varImp(mod_rf)

## ranger variable importance
## 
##   only 20 most important variables shown (out of 177)
## 
##                       Overall
## https                 100.000
## twacc                  92.056
## rt                     28.445
## http                   20.302
## realdonaldtrump        19.433
## hillary                 6.598
## democrats               4.189
## htag                    4.049
## fake                    3.935
## cnn                     3.564
## interviewed             3.383
## trump                   3.329
## america                 3.328
## people                  3.073
## media                   2.973
## clinton                 2.529
## foxnews                 2.365
## obama                   2.249
## makeamericagreatagain   2.021
## barackobama             1.966

If we want all, here is how we can get the full information (we only printed the top 10 most important variables, but you can easily print the whole vector):

var_importance <- mod_rf$finalModel$variable.importance %>% 
  sort(decreasing = TRUE)

var_importance %>% head(10)

##           https           twacc              rt            http realdonaldtrump 
##       7303.1909       6723.7392       2083.6586       1489.6843       1426.3358 
##         hillary       democrats            htag            fake             cnn 
##        490.0537        314.3798        304.1789        295.8575        268.7504

Here is the built-in plot of variable importance:

varImp(mod_rf) %>%
  plot(top = 10)

And here is how you can make a custom plot with ggplot2:

var_importance <- mod_rf$finalModel$variable.importance %>%
  sort(decreasing = TRUE) %>% head(10)

var_importance %>% 
  enframe(name = "variable", value = "importance") %>%
  mutate(variable = fct_reorder(variable, importance)) %>%
  ggplot(aes(variable, importance)) +
  geom_col() +
  coord_flip()

Tuning random forest

Random forest has a number of hyperparameters that can be tuned with OOB error (faster) or with cross-validation (slower).

Random forest models may take a lot of time to train. For the sake of saving time, we will create a mini-version of our dataset to demonstrate the tuning process.

set.seed(199)
ind <- sample(1:nrow(train_data), size = 500, replace = FALSE)
mini_data <- train_data %>% slice(ind)

rfGrid <- expand.grid(mtry = c(10, 20, 30, 40, 50, 60), 
                      min.node.size = c(5, 10, 20, 40),
                      splitrule = "variance")

mod_rf_tune <- train(Y ~ . , data = mini_data, method = "ranger",
                num.trees = 100,
                importance = 'impurity',
                tuneGrid = rfGrid,
                trControl = trainControl("oob"))
mod_rf_tune

## Random Forest 
## 
## 500 samples
## 177 predictors
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##   10     5             1.762335  0.6287188  1.401409
##   10    10             1.756478  0.6332227  1.409939
##   10    20             1.770713  0.6270420  1.427645
##   10    40             1.773984  0.6254328  1.425865
##   20     5             1.779183  0.6145700  1.367211
##   20    10             1.774561  0.6169195  1.356166
##   20    20             1.783228  0.6129856  1.369521
##   20    40             1.755187  0.6257252  1.357974
##   30     5             1.801382  0.6064170  1.349882
##   30    10             1.795172  0.6090385  1.348945
##   30    20             1.794053  0.6089956  1.341256
##   30    40             1.790565  0.6102296  1.351091
##   40     5             1.835113  0.5946635  1.358717
##   40    10             1.835684  0.5939132  1.345487
##   40    20             1.821771  0.6000633  1.351702
##   40    40             1.810445  0.6017705  1.356285
##   50     5             1.839444  0.5933170  1.359724
##   50    10             1.851677  0.5884200  1.367300
##   50    20             1.834790  0.5955452  1.346606
##   50    40             1.826745  0.5957722  1.365810
##   60     5             1.841390  0.5939469  1.353436
##   60    10             1.860252  0.5852938  1.370551
##   60    20             1.849120  0.5893441  1.360764
##   60    40             1.830226  0.5960918  1.355391
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 20, splitrule = variance
##  and min.node.size = 40.

The tuning process can be plotted to get a better picture of what is going on:

plot(mod_rf_tune)

The optimal values of the hyperparameters are

mod_rf_tune$bestTune

Now we will retrain the model on the whole training data with these values

mod_rf_tuned <- train(Y ~ . , data = train_data, method = "ranger",
                num.trees = 100,
                importance = 'impurity',
                tuneGrid = expand.grid(mod_rf_tune$bestTune),
                trControl = trainControl("oob"))
mod_rf_tuned

## Random Forest 
## 
## 3993 samples
##  177 predictor
## 
## No pre-processing
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.552133  0.7125418  1.142014
## 
## Tuning parameter 'mtry' was held constant at a value of 20
## Tuning
##  parameter 'splitrule' was held constant at a value of variance
## 
## Tuning parameter 'min.node.size' was held constant at a value of 40

Here is the test error. It should be a bit better than the first version of random forest.

mod_rf_tuned %>% 
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.569078

Question 6

Extract only 70 most important variables and train a new random forest with 50 trees on just those 70 variables. Tune hyperparameters using the following values:

mtry = 3, 5, 8, 15
min.node.size = 2, 4, 8, 16, 32

Retrain random forest with the optimal values of hyperparameters but with 500 trees instead of 50 trees.

Report the test error of the final model.

top_variables <- mod_rf_tuned$finalModel$variable.importance %>% 
  sort(decreasing = TRUE) %>% head(70) %>% names

t_data <- train_data %>%
  select(all_of(top_variables), Y)

v_data <- test_data %>%
  select(all_of(top_variables), Y)

rfGrid_2 <- expand.grid(mtry = c(3, 5, 8, 15), 
                      min.node.size = c(2, 4, 8, 16, 32),
                      splitrule = "variance")


mod_rf_topvar <- train(Y ~ . , data = t_data, method = "ranger",
                num.trees = 50,
                importance = 'impurity',
                tuneGrid = expand.grid(rfGrid_2),
                trControl = trainControl("oob"))
mod_rf_topvar

## Random Forest 
## 
## 3993 samples
##   70 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##    3     2             1.691569  0.6864934  1.364034
##    3     4             1.698157  0.6880713  1.371738
##    3     8             1.745042  0.6762095  1.400343
##    3    16             1.705281  0.6895987  1.380798
##    3    32             1.703106  0.6845100  1.376944
##    5     2             1.565631  0.7108941  1.192262
##    5     4             1.569269  0.7113679  1.201785
##    5     8             1.566804  0.7124685  1.197332
##    5    16             1.576999  0.7078233  1.204693
##    5    32             1.577176  0.7076377  1.206910
##    8     2             1.549006  0.7140285  1.149347
##    8     4             1.544244  0.7159724  1.148777
##    8     8             1.549222  0.7140333  1.154370
##    8    16             1.547220  0.7146295  1.150184
##    8    32             1.544423  0.7162628  1.157942
##   15     2             1.558433  0.7104340  1.137908
##   15     4             1.552323  0.7126835  1.133491
##   15     8             1.546518  0.7147791  1.129417
##   15    16             1.559776  0.7098098  1.140998
##   15    32             1.555910  0.7111383  1.140837
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 8, splitrule = variance
##  and min.node.size = 4.

The optimal values of hyperparameters are

mod_rf_topvar$bestTune

mod_rf_final <- train(Y ~ . , data = t_data, method = "ranger",
                      num.trees = 500,
                      importance = 'impurity',
                      tuneGrid = expand.grid(mod_rf_topvar$bestTune),
                      trControl = trainControl("oob"))

And the error of the final model is

mod_rf_final %>% 
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.575252

Survey

There is a link to a simple survey after lab 5:

https://forms.gle/bk4iJ9RahdKeaEjD6

Answers

Here are the answers:

INSERT LINK

Declaration of AI Usage

I used Chat GPT to update codes from 2024 to make them more tidyverse-styled.

Lab 5 — tree-based methods

Info about the lab

Learning aim

Objectives

Preliminaries

Mode

Data

Loading data into R

Data cleaning

Question 1

Document-term matrix

Question 2

Data visualization

Question 3

Question 4

Training and test sets

Decision tree

Training and plotting

Question 5

Tree pruning (regularization)

Variable importance

Test error

Random forest

Training

Predictions

Variable importance

Tuning random forest

Question 6

Survey

Answers

Declaration of AI Usage