Info about the lab

Learning aim

The aim of this lab is to apply tree-based models for natural language processing and to learn how to train and tune tree-based models.

Objectives

By the end of this lab session, students should be able to

Visualize text data.
Extract features from a text data.
Train and plot decision trees and regularized decision trees.
Train random forest and tune the hyperparameters in random forest to minimize the out-of-bag error.

Preliminaries

You should be familiar with libraries caret and tidyverse.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

We will work with the data of Donald Trump’s tweets. Source:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FKJEBIL

The objective is to predict the number of retweets from the text of each tweet.

Loading data into R

First we will load the data into R. Note that the original dataset is a JSON file that contains a lot of information. For this lab, we only extracted the text and the number of retweets from 5K out of original 40K Trump’s tweets.

library(tidyverse) # for manipulation with data
library(caret) # for machine learning, including KNN

library(rpart) # for training decision trees
library(rpart.plot) # for plotting decision trees

library(ranger) # for training random forest
# Note for students: the libraries `cforest` and `randomForest` 
# provide better access to guts of random forest models.
# That's why these libraries were used to prepare lecture 5 slides.
# However, cforest and randomForest do not allow tuning the minimum node size.

library(tm) # for text mining
library(wordcloud) # for text visualization

tweets <- read_csv("trump_tweets.csv")
head(tweets)

The number of tweets is

nrow(tweets)

## [1] 5000

Data cleaning

We will remove all non-character symbols except for “@” and “#” because they are special symbols. Then we will change all characters to the lower case. Below is the processed dataset.

tweets <- tweets %>%
  mutate(cleaned_text = gsub("[^a-zA-Z@#]", " ", text)) %>%
  mutate(cleaned_text = tolower(cleaned_text)) 

head(tweets)

Question 1

The symbols “@” and “#” have special meaning in twitter and it is probably a good idea to treat them as separate words. Write an R code that replaces all occurrences of “@” with “twacc” and all occurrences of “#” with “htag” in Trump’s tweets.

# Modify the code below
tweets <- tweets %>%
  mutate(cleaned_text = gsub("@", "twacc ", cleaned_text)) %>%
  mutate(cleaned_text = gsub("#", "htag ", cleaned_text)) 


head(tweets)

Term-document matrix

An important tool in natural language processing is term-document-matrix. This is a matrix whose rows represent documents, i.e., tweets, and whose columns represent words. The element \(T_{ij}\) in row \(i\) and column \(j\) is the number of occurrences of word \(j\) in document \(i\). Below we construct the term-document matrix of Donald Trump’s tweet and report its dimensions.

corpus <- VCorpus(VectorSource(tweets$cleaned_text))
DTM <- DocumentTermMatrix(corpus)
dim(DTM)

## [1]  5000 12164

Below is a sample of the term-document matrix (note that words actually appear as column indices while only document numbers appear as row indices):

as.matrix(DTM[1:5, c('trump', 'that', 'sure', 'the', 'twacc')])

##     Terms
## Docs trump that sure the twacc
##    1     1    0    0   1     1
##    2     1    1    0   3     0
##    3     0    0    0   0     1
##    4     0    0    0   0     1
##    5     0    0    0   1     2

To find the total frequency of each word, we will calculate the sum of values in each column of this matrix. Below are frequencies of a first few words (alphabetically):

words_freq <- termFreq(tweets$cleaned_text)
head(words_freq)

##        aaa       aaeu aafczflvmx    aarnold   aaronpha   aaszkler 
##          1          1          1          1          1          1

Question 2

Given dimensions of the term-document matrix, is it possible to use linear regression to predict the number of retweets from the term-document matrix?

It is impossible to use linear regression because the number of predictors is greater than the number of variables. It, however, possible to use ridge regression.

Remove columns of the term-document matrix that represent words whose total frequency is below 50. What are the dimensions of the matrix now?

frequent_words <- words_freq[words_freq >= 50]
length(frequent_words)

## [1] 190

# modify the code below 
DTM <- DTM[ , names(frequent_words)]
dim(DTM)

## [1] 5000  190

Data visualization

Text data are often visualized with a word cloud. Below is the word cloud of Trump’s tweets.

wordcloud(words = names(frequent_words), freq = frequent_words, min.freq = 0,
            max.words = 100, random.order=FALSE, rot.per=0.35)

Stopword removal

Note that some of these words are so called stopwords. Below is a list of English stopwords

stopwords()

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Now we will remove these stopwords from our vector of word frequencies in Trump’s tweets and plot the new wordcloud. We will also add some colours to the word cloud.

frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
wordcloud(words = names(frequent_words), freq = frequent_words, min.freq = 0,
            max.words = 100, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

Now we will remove the stopwords from the term-document matrix and report dimensions of the new term-document mantrix.

# Modify this code chunk
DTM <- DTM[ , names(frequent_words)]
dim(DTM)

## [1] 5000  136

Question 3

Let us plot the distribution of the number of retweets. As shown in the plot below, it is highly skewed:

ggplot(data = tweets, aes(x = retweet_count)) +
  geom_histogram(fill = "orange", color = "black", bins = 20)

It makes more sense to predict the logarithm of the number of retweets than the raw number of retweets then. Compute the logarithm of the number of retweets and store it in the variable “Y” of our dataset. Warning: think of what you will do with tweets with 0 retweet count.

Plot the histogram of the log-retweet_count

# Modify the code below
# You should see a nice bimodal distribution

tweets$Y <- log(1 + tweets$retweet_count)
ggplot(data = tweets, aes(x = Y)) +
  geom_histogram(fill = "orange", color = "black", bins = 20) +
  xlab("Log-retweet-count")

You can try to figure out why the distribution is bimodal, but it is not necessary.

Training and test sets

We will randomly split the data into 80% training and 20% test sets. We will also rename all columns by adding “w_” to every column name. This is done to avoid some technical issues (random forest does not behave well if R keywords, such as next are used as variable names).

set.seed(2020)
p <- 0.8
ind <- runif(nrow(DTM)) < p
all_data <- DTM %>%
  as.matrix %>%
  as_tibble %>%
  rename_with(function(x) paste("w", x, sep = "_")) %>%
  mutate(Y = tweets$Y)

train_data <- all_data[ind , ]
test_data <- all_data[!ind , ]

cat("Dimensions of the training set are", dim(train_data),"\n")

## Dimensions of the training set are 4056 137

cat("Dimensions of the test set are", dim(test_data),"\n")

## Dimensions of the test set are 944 137

Decision tree

We will use the raw interface of the library rpart here rather than wrapping it to train. This is because it is a bit simpler and rpart automatically does cross-validation for choosing the optimal value of the regularization constant.

Training and plotting

First, we construct a decision tree and print it.

mod_tree <- rpart(Y ~ . , data = train_data)
print(mod_tree)

## n= 4056 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4056 33944.6800 5.836876  
##    2) w_https< 0.5 3226 21289.5400 4.985934  
##      4) w_twacc>=0.5 2155 11362.5700 4.123382  
##        8) w_thanks>=0.5 194   476.3170 2.264581 *
##        9) w_thanks< 0.5 1961 10149.6400 4.307272 *
##      5) w_twacc< 0.5 1071  5097.5800 6.721506  
##       10) w_http>=0.5 141   204.2644 4.461973 *
##       11) w_http< 0.5 930  4064.3000 7.064080 *
##    3) w_https>=0.5 830  1239.9060 9.144275 *

And here is the plot:

rpart.plot(mod_tree)

We can get access to the guts of the tree model to be able to extract particular splits and deviations at those splits:

mod_tree$frame

Question 4

Compute the RSS in two ways: by aggregating deviance of all leaf nodes and as a total training error.

The RSS of the entire dataset is deviance of the root node, Below is the RSS of the decision tree.

sum(mod_tree$frame$dev[mod_tree$frame$var == "<leaf>"])

## [1] 16134.43

And here is the RSS:

tse <- function(x, y) {
  sum((x-y)^2)
}

mod_tree %>% 
  predict(train_data) %>%
  tse(train_data$Y)

## [1] 16134.43

Tree pruning (regularization)

To prune a decision tree, we first grow it and then we choose a sub-tree minimizing the regularized loss function \[ \sum_{m=1}^{|T|}\sum_{x^{i}\in R_m}(y^{i}-\hat{y}_{R_i})^2 +\alpha|T| \] Below is the plot of cross-validation error vs \(\alpha\) (parameter cp):

plotcp(mod_tree)

In this case, we do not need to prune the tree. But for your future reference, here is how we do it. First, we will look at the table

mod_tree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.33628942      0 1.0000000 1.0001212 0.01340659
## 2 0.14227232      1 0.6637106 0.6639109 0.01360342
## 3 0.02442257      2 0.5214383 0.5217974 0.01217237
## 4 0.02170031      3 0.4970157 0.5048248 0.01202827
## 5 0.01000000      4 0.4753154 0.4760675 0.01151867

The optimal value of \(\alpha\) is

opt_cp <- mod_tree$cptable[which.min(mod_tree$cptable[ , 'xerror']) , 'CP']
opt_cp

## [1] 0.01

And now we prune the tree:

mod_tree_pruned <- prune.rpart(mod_tree, opt_cp)
mod_tree_pruned

## n= 4056 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 4056 33944.6800 5.836876  
##    2) w_https< 0.5 3226 21289.5400 4.985934  
##      4) w_twacc>=0.5 2155 11362.5700 4.123382  
##        8) w_thanks>=0.5 194   476.3170 2.264581 *
##        9) w_thanks< 0.5 1961 10149.6400 4.307272 *
##      5) w_twacc< 0.5 1071  5097.5800 6.721506  
##       10) w_http>=0.5 141   204.2644 4.461973 *
##       11) w_http< 0.5 930  4064.3000 7.064080 *
##    3) w_https>=0.5 830  1239.9060 9.144275 *

Variable importance

Variable importance is measured as a total drop in residual sum of squares due to splits in each variable. Here is how we calculate it with the function varImp from caret:

varImp(mod_tree_pruned) %>%
  arrange(-Overall) %>%
  head(n = 20)

And here is how we can extract it directly from the model:

mod_tree_pruned$variable.importance

##                 w_https                 w_twacc                  w_http 
##            11415.235687             4829.388032              829.016160 
##                w_thanks w_makeamericagreatagain                 w_obama 
##              736.609860              261.312624              211.933929 
##                  w_cont                   w_now               w_hillary 
##              141.109134              130.767743              126.258511 
##                 w_honor                   w_big             w_obamacare 
##              123.779664              121.749278              103.712348 
##             w_democrats              w_american                  w_jobs 
##               68.766480               13.753296               13.753296 
##                   w_via                 w_trump                  w_love 
##               11.759094                5.879547                3.796958

Test error

Finally, let us construct predictions and report the test root mean squared error calculated in caret (here, we did not create our own version of the very simple RMSE function because we are loading caret anyway).

mod_tree %>%
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 2.014989

Random forest

We will use the library ranger because it allows tuning more than just one hyperparameter.

Training

First, we train a random forest model and print the result. Note that by default, it tries three values of mtry (the number of predictors allowed at each step), \(2\), \(p/2\) and \(p\) and two values of splitrule (this is something that we did not cover in class and you can either ignore it or google what it is) and only one value of min.node.size, 5.

Here we set train control to oob, i.e., out-of-bag error (with 5-fold cross validation, training time will be 5 times slower) and num.trees to 50 (with default value of 500, training time will be 10 times slower).

set.seed(100)
mod_rf <- train(Y ~ . , data = train_data, method = "ranger",
                num.trees = 50,
                importance = 'impurity',
                trControl = trainControl("oob"))

print(mod_rf)

## Random Forest 
## 
## 4056 samples
##  136 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE     
##     2   variance    2.082586  0.5918521  1.768071
##     2   extratrees  2.380923  0.4881632  2.050377
##    69   variance    1.799780  0.6182019  1.294085
##    69   extratrees  1.791730  0.6202089  1.296690
##   136   variance    1.814183  0.6130391  1.305854
##   136   extratrees  1.809448  0.6149121  1.300855
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 69, splitrule = extratrees
##  and min.node.size = 5.

Predictions

Here we will construct predictions and report the test error

mod_rf %>% 
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.714026

Variable importance

Here is variable importance for top 20 most important variables:

varImp(mod_rf)

## ranger variable importance
## 
##   only 20 most important variables shown (out of 136)
## 
##                   Overall
## w_https           100.000
## w_twacc            74.355
## w_realdonaldtrump  32.627
## w_http             21.841
## w_thanks           21.092
## w_htag              5.843
## w_hillary           5.722
## w_trump             4.894
## w_will              4.479
## w_democrats         4.167
## w_great             3.958
## w_fake              3.456
## w_media             3.339
## w_amp               3.014
## w_america           2.951
## w_celebapprentice   2.716
## w_people            2.617
## w_today             2.613
## w_thank             2.490
## w_cnn               2.394

If we want all, here is how we can get the full information (we only printed the top 10 most important variables, but you can easily print the whole vector):

var_importance <- mod_rf$finalModel$variable.importance %>% 
  sort(decreasing = TRUE)

var_importance %>% head(10)

##           w_https           w_twacc w_realdonaldtrump            w_http 
##         7260.1919         5401.6458         2377.5899         1595.9337 
##          w_thanks            w_htag         w_hillary           w_trump 
##         1541.6702          436.5180          427.7609          367.7833 
##            w_will       w_democrats 
##          337.6637          315.0795

Here is the built-in plot of variable importance:

varImp(mod_rf) %>%
  plot(top = 10)

And here is how you can make a custom plot with ggplot2:

var_importance <- mod_rf$finalModel$variable.importance %>% 
  sort(decreasing = TRUE) %>% head(10)

data.frame(variable = names(var_importance),
           importance = var_importance) %>%
  mutate(word = gsub("w_", "", variable)) %>%
  ggplot(aes(x = reorder(word, -importance), y = importance)) +
  geom_col() + xlab("word") + ylab("importance") +
  theme(axis.text.x = element_text(angle = 45))

Tuning random forest

Random forest has a number of hyperparameters that can be tuned with OOB error (faster) or with cross-validation (slower).

Random forest models may take a lot of time to train. For the sake of saving time, we will create a mini-version of our dataset to demonstrate the tuning process.

set.seed(199)
ind <- sample(1:nrow(train_data), size = 500, replace = FALSE)
mini_data <- train_data %>% slice(ind)

rfGrid <- expand.grid(mtry = c(10, 20, 30, 40, 50, 60), 
                      min.node.size = c(5, 10, 20, 40),
                      splitrule = "variance")

mod_rf_tune <- train(Y ~ . , data = mini_data, method = "ranger",
                num.trees = 100,
                importance = 'impurity',
                tuneGrid = rfGrid,
                trControl = trainControl("oob"))
mod_rf_tune

## Random Forest 
## 
## 500 samples
## 136 predictors
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##   10     5             1.840810  0.6011599  1.404396
##   10    10             1.813542  0.6150020  1.372422
##   10    20             1.825748  0.6096740  1.388036
##   10    40             1.825687  0.6110682  1.396257
##   20     5             1.869219  0.5908714  1.362570
##   20    10             1.871719  0.5893268  1.371413
##   20    20             1.860045  0.5936597  1.374263
##   20    40             1.850608  0.5968650  1.374273
##   30     5             1.898025  0.5834530  1.366885
##   30    10             1.899723  0.5812882  1.374557
##   30    20             1.887607  0.5851192  1.367166
##   30    40             1.891024  0.5813137  1.381814
##   40     5             1.954260  0.5647347  1.402775
##   40    10             1.940716  0.5668066  1.397490
##   40    20             1.924161  0.5720631  1.384602
##   40    40             1.889100  0.5833508  1.375735
##   50     5             1.938187  0.5720726  1.380867
##   50    10             1.935023  0.5733453  1.379490
##   50    20             1.953506  0.5631649  1.398478
##   50    40             1.917230  0.5738187  1.378502
##   60     5             1.975515  0.5558973  1.411896
##   60    10             1.966518  0.5603053  1.407614
##   60    20             1.956011  0.5630785  1.397491
##   60    40             1.939511  0.5654197  1.400145
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 10, splitrule = variance
##  and min.node.size = 10.

The tuning process can be plotted to get a better picture of what is going on:

plot(mod_rf_tune)

The optimal values of the hyperparameters are

mod_rf_tune$bestTune

Now we will retrain the model on the whole training data with these values

mod_rf_tuned <- train(Y ~ . , data = train_data, method = "ranger",
                num.trees = 100,
                importance = 'impurity',
                tuneGrid = expand.grid(mod_rf_tune$bestTune),
                trControl = trainControl("oob"))
mod_rf_tuned

## Random Forest 
## 
## 4056 samples
##  136 predictor
## 
## No pre-processing
## Resampling results:
## 
##   RMSE      Rsquared  MAE    
##   1.703226  0.654538  1.27152
## 
## Tuning parameter 'mtry' was held constant at a value of 10
## Tuning
##  parameter 'splitrule' was held constant at a value of variance
## 
## Tuning parameter 'min.node.size' was held constant at a value of 10

Here is the test error. It should be a bit better than the first version of random forest.

mod_rf_tuned %>% 
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.669528

Question 5

Extract only 70 most important variables and train a new random forest with 50 trees on just those 70 variables. Tune hyperparameters using the following values:

mtry = 3, 5, 8, 15
min.node.size = 2, 4, 8, 16, 32

Retrain random forest with the optimal values of hyperparameters but with 500 trees instead of 50 trees.

Report the test error of the final model.

top_variables <- mod_rf_tuned$finalModel$variable.importance %>% 
  sort(decreasing = TRUE) %>% head(70) %>% names

t_data <- train_data %>%
  select(all_of(top_variables), Y)

v_data <- test_data %>%
  select(all_of(top_variables), Y)

rfGrid_2 <- expand.grid(mtry = c(3, 5, 8, 15), 
                      min.node.size = c(2, 4, 8, 16, 32),
                      splitrule = "variance")


mod_rf_topvar <- train(Y ~ . , data = t_data, method = "ranger",
                num.trees = 50,
                importance = 'impurity',
                tuneGrid = expand.grid(rfGrid_2),
                trControl = trainControl("oob"))
mod_rf_topvar

## Random Forest 
## 
## 4056 samples
##   70 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE      Rsquared   MAE     
##    3     2             1.823820  0.6319876  1.465312
##    3     4             1.787475  0.6393289  1.410124
##    3     8             1.807706  0.6340743  1.436824
##    3    16             1.823363  0.6318878  1.461573
##    3    32             1.859618  0.6174428  1.504937
##    5     2             1.723347  0.6486114  1.305000
##    5     4             1.730880  0.6452282  1.305491
##    5     8             1.718247  0.6508553  1.299993
##    5    16             1.720310  0.6500063  1.300286
##    5    32             1.734724  0.6449590  1.316366
##    8     2             1.716500  0.6480733  1.269446
##    8     4             1.707477  0.6517444  1.257412
##    8     8             1.713821  0.6493004  1.264234
##    8    16             1.708285  0.6515648  1.261307
##    8    32             1.708223  0.6518810  1.270337
##   15     2             1.756308  0.6326508  1.275370
##   15     4             1.744559  0.6372440  1.268672
##   15     8             1.734011  0.6413787  1.257799
##   15    16             1.718971  0.6472359  1.255121
##   15    32             1.719297  0.6468900  1.263364
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 8, splitrule = variance
##  and min.node.size = 4.

The optimal values of hyperparameters are

mod_rf_topvar$bestTune

mod_rf_final <- train(Y ~ . , data = t_data, method = "ranger",
                      num.trees = 500,
                      importance = 'impurity',
                      tuneGrid = expand.grid(mod_rf_topvar$bestTune),
                      trControl = trainControl("oob"))

And the error of the final model is

mod_rf_final %>% 
  predict(test_data) %>%
  RMSE(test_data$Y)

## [1] 1.669269

Survey

There is a link to a simple survey after lab 5:

https://forms.gle/bk4iJ9RahdKeaEjD6

Answers

Here are the answers:

INSERT LINK

Lab 5 — tree-based methods

Info about the lab

Learning aim

Objectives

Preliminaries

Mode

Data

Loading data into R

Data cleaning

Question 1

Term-document matrix

Question 2

Data visualization

Stopword removal

Question 3

Training and test sets

Decision tree

Training and plotting

Question 4

Tree pruning (regularization)

Variable importance

Test error

Random forest

Training

Predictions

Variable importance

Tuning random forest

Question 5

Survey

Answers