STA 279 Lab 10

Complete all Questions.

The Goal

Today, we are focused on classification trees. We are going to deepen our understanding of how trees work, and we are going to fit and visualize tree models in R.

Formatting your lab

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

The Data Set

We are going to stick with our cell phone review data from the last lab. Recall that our client provided us a data set called train with 1000 phones that are labelled with their brand. They ask us to use those phones to build a model to (1) predict which brand a phone is and (2) describe any key traits in reviews that might suggest a phone is from a certain brand.

They then also provide us a data set called test with 500 phones to use to assess the predictive abilities of our model.

Let’s start off by loading our libraries. Most of these will look familiar, but a few are new. The new ones are the libraries that we need for classification trees. REMEMBER: If you get an error that says your computer does not have or cannot find a package, go to the top of your screen (the right of your camera for my Apple folks) and choose Tools. From there, choose Install Packages. From there, type in the name of the packages you need and install!

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(nnet)
library(tm)

# New - needed for trees
library(rpart)
library(rattle)
library(rpart.plot)

Once you have loaded the libraries, you can use the following code to load the data.

# Load the data
train <- read.csv("https://www.dropbox.com/scl/fi/y9srlcqu36qh4x2gxx1mf/CellPhonetrain.csv?rlkey=qc1bgktk8kdg8xo0uwhx9gbiv&st=bp2sdljz&dl=1")
test  <- read.csv("https://www.dropbox.com/scl/fi/i875r45iwefli11b8bhbq/CellPhonetest.csv?rlkey=ux8hcgfrb7dagm1lirmhszlxz&st=4gb6yu3q&dl=1")
train$brand <-as.factor(train$brand)
test$brand <-as.factor(test$brand)

Just as we did in Lab 9, our goal is to use the 118 features from LIWC features to (1) predict which phone brand each review is about and (2) to identify and discuss key traits in the text that identify reviews from each brand.

Today, we are going to do this using classification trees.

Classification Tree: One Split

Each step we use to build a tree is a question. These questions are thing like “If the word count is less than 50, is a review more likely to have been written about Apple, Samsung, or Motorola?” This means that trees make no assumptions about the shape of the relationship between X and Y. Instead, trees assume that we can use the features to group the data into authors using a series of splitting rules.

To see this, let’s build a tree with only one split.

tree_onesplit <- rpart( brand ~ . , data = train, method = "class", maxdepth = 1)

This looks a lot like the code we have been using all semester with:

brand: our \(Y\) variable
.: meaning we use all other variables as features
method= "class": tells R to fit a classification tree.
maxdepth = 1: Tells R right now we only want one splitting rule (so 2 leaves)

To visualize the tree, we can use

rpart.plot(tree_onesplit, sub = "" )

Here, Perception measures the percent of words in a review that are related to perception. These are words about how a phone feels, sounds, looks, etc. In other words, these are words about how a person physically interacts with the device.

Question 1

If a paper has 22% of the words being perception words, which brand do we predict the review is about?

Question 2

Why are no reviews predicted to be from Motorola right now?

Question 3

What percent of reviews have less than 11% of their words being perception words?

Question 4

What percent of papers with at least 11% of their words being perception words were written about (a) Apple, (b) Motorola, and (c) Samsung?

Right now, this tree is very clear. Only one feature is being used, and if we needed to chat with clients, we could easily explain how to read the tree, both for predictions and for explaining the relationships highlighted in the tree.

However…there were 118 different features to choose from. Why did we choose perception words?? And why 11%??

Gini Index

When we grow a tree, our priority is stability in the leaves. A leaf is stable if there is one very dominant level of a categorical variable in the leaf.

Question 5

Our current tree has 2 leaves. Which leaf is more stable in our current tree: Leaf 1 (left) or Leaf 2 (right)?

We humans can assess stability by eye, but that doesn’t help the computer. Instead, we need some numeric way to assess stability or instability. We can then use this to determine which splitting rules to use to make the most stable leaf.

We measure instability using the Gini Index, which is a weighted average of the Gini Impurity Score in each leaf. We define the Gini Impurity Score as

\[G(Leaf~\ell ) = 1 - \left(\hat{p}_{\ell(A)}^2 +\hat{p}_{\ell(M)}^2 + \hat{p}_{\ell(S)}^2 \right)\] where

\(\hat{p}_{\ell(A)}^2\) = proportion of reviews in leaf \(\ell\) that were written about Apple.
\(\hat{p}_{\ell(M)}^2\) = proportion of reviews in leaf \(\ell\) that were written about Motorola.
\(\hat{p}_{\ell(S)}^2\) = proportion of reviews in leaf \(\ell\) that were written about Samsung.

If we had a different number of levels, you have a \(\hat{p}_{\ell(j)}^2\) term for each level \(j\) of \(Y\).

The Gini Impurity score measures instability, meaning it measures how unstable a leaf is. This means that more stable leaves have smaller values of the Gini Impurity Score.

Question 6

What is the Gini Impurity Score in Leaf 1?

Question 7

Should the Gini Impurity Score in Leaf 2 be higher, lower, or about the same as in Leaf 1? Explain without doing any calculations.

Once we have the Gini Impurity Score for each leaf, we compute the Gini Index as a weighted average of these impurity scores. The weights are the percent of the data in each leaf. For these data, this means

\[Gini = (\text{% Reviews in Leaf 1}) \times G(Leaf~1 ) + (\text{% Reviews in Leaf 2}) \times G(Leaf~2) \]

Question 8

What is the Gini Index for tree_onesplit? Show your work (meaning you cannot use a function to do it for you, I need to see the steps!)

If you would like to check your answer to Question 8. I have built you a function that can compute the Gini Index for a tree with one split. It relies on this function (which I did not write!). Copy and paste it into a chunk and press play. We will do nothing else with this function.

rpart_splits <- function(fit, digits = getOption("digits")) {
  splits <- fit$splits
  if (!is.null(splits)) {
    ff <- fit$frame
    is.leaf <- ff$var == "<leaf>"
    n <- nrow(splits)
    nn <- ff$ncompete + ff$nsurrogate + !is.leaf
    ix <- cumsum(c(1L, nn))
    ix_prim <- unlist(mapply(ix, ix + c(ff$ncompete, 0), FUN = seq, SIMPLIFY = F))
    type <- rep.int("surrogate", n)
    type[ix_prim[ix_prim <= n]] <- "primary"
    type[ix[ix <= n]] <- "main"
    left <- character(nrow(splits))
    side <- splits[, 2L]
    for (i in seq_along(left)) {
      left[i] <- if (side[i] == -1L)
        paste("<", format(signif(splits[i, 4L], digits)))
      else if (side[i] == 1L)
        paste(">=", format(signif(splits[i, 4L], digits)))
      else {
        catside <- fit$csplit[splits[i, 4L], 1:side[i]]
        paste(c("L", "-", "R")[catside], collapse = "", sep = "")
      }
    }
    cbind(data.frame(var = rownames(splits),
                     type = type,
                     node = rep(as.integer(row.names(ff)), times = nn),
                     ix = rep(seq_len(nrow(ff)), nn),
                     left = left),
          as.data.frame(splits, row.names = F))
  }
}

Next, copy and paste the function below into a chunk and press play. This is the function that actually computes the Gini Index.

gini <- function(tree,data,Y){
  
  variable <-rpart_splits(tree)[1,"var"]
  value <-rpart_splits(tree)[1,"index"]
  #print(length(rpart_splits(tree)[1,"left"]))
  if(length(rpart_splits(tree)[1,"left"]) >0 ){
    sign <- unlist(strsplit(rpart_splits(tree)[1,"left"], " "))[1]
    
    if(sign == ">=" ){
      Leaf <- ifelse( data[,variable] >= value, "Leaf 1", "Leaf 2")
    }
    
    if(sign == ">" ){
      Leaf <- ifelse( data[,variable] > value, "Leaf 1", "Leaf 2")
    }
    
    if(sign == "<=" ){
      Leaf <- ifelse( data[,variable] <= value, "Leaf 1", "Leaf 2")
    }
    
    if(sign == "<" ){
      Leaf <- ifelse( data[,variable] < value, "Leaf 1", "Leaf 2")
    }
    
    weights <- data.frame(table(Leaf)/length(Leaf))
    
    Leaf1 <- subset(data, Leaf== "Leaf 1")
    Leaf2 <- subset(data, Leaf== "Leaf 2")
    
    props1 <- data.frame(table(Leaf1[,Y])/nrow(Leaf1))
    props2 <- data.frame(table(Leaf2[,Y])/nrow(Leaf2))
    
    Impurity1 <- 1 -(sum(props1$Freq^2))
    Impurity2 <- 1 -(sum(props2$Freq^2))
    
    out <- weights$Freq[1]*Impurity1 + weights$Freq[2]*Impurity2
  }
  
  if(length(rpart_splits(tree)[1,"left"]) ==0){
    out <- NULL
  }
  
  out
  
  
}

To use the function on our tree_onesplit, we use:

gini(tree_onesplit, train,1)

where

tree_onesplit: is the name of our tree model
train: is the name of our training data set
1: is the number of the column in the data set holding our Y variable.

NOTE: Your answer from the function and your answer to Question 8 might be slightly different due to rounding that takes place in the visualization. Don’t worry about that!!

Question 9

When we grow a classification tree, we choose each splitting rule to minimize the Gini Index. Our tree uses Perception, the percent of perception words in the review, as the first splitting rile. This means that there should be no other feature that we could split on once and get a lower Gini Index. Let’s check.

Choose any other feature in the data set (just pick one). State the feature you chose.
Grow a tree with one split using your chosen feature. Use the gini function to compute the Gini index, and state the Gini Index for that tree.
Is this Gini Index larger, smaller, or the same as the one we got from tree_onesplit with Perception?

So, this is our tree with only one split. However, we were only restricted to one split for now so we could dig into trees a little more deeply. In reality, we rarely stop after just one split!

Classification Tree 2: More Splits

When we grew our tree with one split, we used:

tree_onesplit <- rpart( brand ~ . , data = train, method = "class", maxdepth = 1)

The maxdepth = 1 argument is the part that tells R to stop after one split. If we remove it, the tree grows more!

tree_full <- rpart( brand ~ . , data = train, method = "class")

Question 10

Make a plot to show tree_full.

Hint: If you don’t like the results from rpart.plot, try prp(tree_full). This is a different way to visualize that sometimes works better with larger trees!

Question 11

Which splitting rule gives us the largest reduction in the Gini Index in one split? Hint: Answering this should require no calculation!!

You will notice that this tree did not use all the features. Why did it stop growing???

Trees stop growing when they hit what is called a stopping rule. These are a series of rules that we choose that helps us keep trees from getting too large. Super larger trees (meaning trees with many leaves) are hard to interpret and generally do not predict well on test data…which is usually the whole point of prediction.

I don’t see any stopping rules in our code, though. Where did we specify these rules?

It turns out that R has done this for us.

tree_full$control

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 30
## 
## $xval
## [1] 10

See all of this output?? All of these are default stopping rules in R. What do they all mean?

Here are the important stopping rules:

minsplit= the minimum number of rows that must exist in a leaf in order for a split to be attempted.
minbucket = the minimum number of rows allowed in any leaf.
cp = the percent improvement in the Gini Index we require in order to accept a split. Right now, this is .01 = 1%. This means that if we cannot improve the Gini Index by at least 1%, we do not split a leaf (meaning we stop growing).
maxdepth = the maximum depth of any leaf, with the root counted as depth 0. Basically, this is how we control how many splitting rules you go through to get from the root to a leaf.

Having default stopping rules is great, but we do need to think about whether or not they make sense for our data. Suppose I had 3 million rows of data. Do we really want to allow leaves as small as 7 rows??? The tree would be huge!!

If we want to change one of these stopping rules when growing a tree, we can adapt our tree growing code. For example, if we want to change the maxdepth stopping rule to 5, we use:

tree_new <- rpart( brand ~ . , data = train, maxdepth = 5)

Question 12

We only have 1000 rows in our data set, and right now we are only allowed to keep growing our tree if there are at least 20 rows (2%) of the data in each split. This seems like too few.

Suppose we only want to require 50 rows in order to split. Adapt tree_full to create a tree called tree_50 with this change. Show a plot of this tree as the answer to this question.

For the rest of the lab, use tree_50 for all questions.

Note: If you are finding your tree hard to read, consider:

prp(tree_50)

Question 14

If a review has 10% perception words, 24 words per sentence, 4% of the characters being comma, 4.5% of the words being adjectives, and 12% of the words being social, which brand will the tree predict the review is about?

Question 15

How stable is the prediction from Question 14? Explain.

Question 16

Your client asks what separates review for Apple phones from other reviews. Answer them!

As we can see, classification trees have a few nice properties! Even though we started with 118 features, only a few of them are actually used in our model. This makes it fairly easy to visualize and interpret the relationships in the data.

Predictions

Now that we have built and interpreted the tree, let’s use it for making predictions. We can (1) make predictions on the training data to see how well our model fits the data, and (2) we make predictions on test data. Let’s start with the training data.

To make predictions using a tree, we use the code:

yhat_test <- predict(tree_50, type = "class", newdata = test)

To make our confusion matrix, we use the same code as we did for multinomial:

holder <- table("Prediction"=yhat_test,"Actual" = test$brand)

# Label the rows 
row.names(holder) <- c("Predicted: Apple", "Predicted: Motorola", "Predicted: Samsung")

# Label the columns
colnames(holder) <- c("True: Apple", "True: Motorola", "True: Samsung")

# Format the confusion matrix
knitr::kable(holder)

Question 17

Using the confusion matrix, explain to our client how well the model is doing at prediction. You do NOT need to compute the accuracy or F1-score or anything like that - just discuss the confusion matrix.

References

The activity was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2026 April 12.