STA521 Lab 05

Justin Kao

Course Tools

RStudio Online: Your Data Science Playground


  • Browser-based: Access RStudio right from your browser—no setup headaches.
  • Duke VPN or Campus Connection: Seamlessly connect from anywhere using Duke VPN (Cisco) or while on campus.
  • Cloud-Powered: Harness the power of online resources—save your computer’s CPU and GPU for other tasks.
  • Hassle-Free Setup: No need to run install.packages—everything is pre-installed and ready to go!

Source: Dr. Colin Rundel STA 523

Tree-Based Methods

Claissification and Regression Tree(CART)-Supervised

  • These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation,
  • Typically we use the mean or the mode response value for the training observations in the region to which it belongs.
  • Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
  • Tree-based methods are simple and useful for interpretation. However, decision tree they typically are not competitive with the best supervised learning approaches. Hence, random forests, bagging and Bayesian additive regression trees would play a role.
  • We will see that combining a large number of trees(ensemble) can often result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.

The architecture of a decision tree

The architecture of a decision tree

The architecture of a decision tree

Source: Applied and Computational Engineering

Methodology - Regression Tree 1.0

Generally speaking, there are two steps.

  1. We divide the predictor space - that is, the set of possible values for \(X_1, X_2, \ldots, X_p\) - into \(J\) distinct and non-overlapping regions, \(R_1, R_2, \ldots, R_J\).
  2. For every observation that falls into the region \(R_j\), we make the same prediction, which is simply the mean of the response values for the training observations in \(R_j\).

A tree corresponding to the partition

The partition

Methodology - Regression Tree 1.1

How do we construct the regions \(R_1, \ldots, R_J\) ? In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes \(R_1, \ldots, R_J\) that minimize the RSS, given by

\[ \sum_{j=1}^J \sum_{i \in R_j}\left(y_i-\hat{y}_{R_j}\right)^2 \]

where \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j\) th box.

Methodology - Regression Tree 1.2

In order to perform recursive binary splitting, we first select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \(\left\{X \mid X_j<s\right\}\) and \(\left\{X \mid X_j \geq s\right\}\) leads to the greatest possible reduction in RSS. That is, we consider all predictors \(X_1, \ldots, X_p\), and all possible values of the cutpoint \(s\)for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS.

Methodology - Regression Tree 1.3

\(\forall j\) and \(\forall s\), we define the pair of half-planes

\[ R_1(j, s)=\left\{X \mid X_j<s\right\} \text { and } R_2(j, s)=\left\{X \mid X_j \geq s\right\} \]

and we seek the value of \(j\) and \(s\) that minimize the equation

\[ (j^*, s^*) = \arg\min_{j, s} \left( \sum_{i: x_i \in R_1(j, s)} \left( y_i - \hat{y}_{R_1} \right)^2 + \sum_{i: x_i \in R_2(j, s)} \left( y_i - \hat{y}_{R_2} \right)^2 \right) \]

where \(\hat{y}_{R_1}\) is the mean response for the training observations in \(R_1(j, s)\), and \(\hat{y}_{R_2}\) is the mean response for the training observations in \(R_2(j, s)\).

Methodology - Classification Tree 2.1

A natural alternative to RSS is the classification error rate. Since we plan to assign an observation in a given region to the most commonly occurring class of training observations in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class:

\[ E=1-\max _k\left(\hat{p}_{m k}\right) \]

Methodology - Classification Tree 2.2

The Gini index is defined by

\[ G=\sum_{k=1}^K \hat{p}_{m k}\left(1-\hat{p}_{m k}\right), \]

The Entropy index is defined by

\[ D=-\sum_{k=1}^K \hat{p}_{m k} \log \hat{p}_{m k} \] Since \(0 \leq \hat{p}_{m k} \leq 1\), it follows that \(0 \leq-\hat{p}_{m k} \log \hat{p}_{m k}\). One can show that the entropy will take on a value near zero if the \(\hat{p}_{m k}\)’s are all near zero or near one.

Tree Pruning - Cost complexity pruning

For each value of \(\alpha\) there corresponds a subtree \(T \subset T_0\) such that

\[ \sum_{m=1}^{|T|} \sum_{i: x_i \in R_m}\left(y_i-\hat{y}_{R_m}\right)^2+\alpha|T| \]

\[ \sum_{\text {Terminal Nodes }} \text { Misclass }_i+\alpha *|T| \]

Here \(|T|\) indicates the number of terminal nodes of the tree \(T, R_m\) is the rectangle (i.e. the subset of predictor space) corresponding to the \(m\) th terminal node, and \(\hat{y}_{R_m}\) is the predicted response associated with \(R_m\)-that is, the mean of the training observations in \(R_m\).

Mini-Case Study: A Toy Decision Tree

Establish the model framework

# Clean the working memory

rm(list = ls()) (This is just my preference, you dont necessarily have to follow it)

# Set up our X1, X2, and Y

X1 <- c(1, 2, 3, 4, 2, 1)
X2 <- c(0, 1, 2, 1, 2, 1)
Y <- c(1.2, 2.1, 1.5, 3.0, 2.0, 1.6)
dat <- data.frame(X1, X2, Y)

Use rpart and rpart.plot packages to construct the basic tree

install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

## dt stands for decision tree
dt <- rpart(Y ~ .,
            data = dat,
            control = rpart.control(minsplit = 1,
                                    minbucket = 1,
                                    cp = 0,
                                    xval = 6))

Also, we construct the normality OLS model to compare the model performance later

ols <- lm(Y ~ X1 * X2, data = dat)
summary(ols)

rpart.control I

  • minsplit: The minimum number of observations that must exist in a node in order for a split to be attempted.

  • minbucket: The minimum number of observations in any terminal <leaf> node. If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket * 3 or minbucket to minsplit / 3, as appropriate.

  • cp: The complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with ANOVA splitting, this means that the overall R-squared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially, the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and hence the program need not pursue it.

rpart.control II (Rarely Used)

  • maxcompete: The number of competitor splits retained in the output. It is useful to know not just which split was chosen, but which variable came in second, third, etc.

  • maxsurrogate: The number of surrogate splits retained in the output. If this is set to zero, the compute time will be reduced, since approximately half of the computational time (other than setup) is used in the search for surrogate splits.

  • usesurrogate: Specifies how to use surrogates in the splitting process:

    • 0: Display only; an observation with a missing value for the primary split rule is not sent further down the tree.
    • 1: Use surrogates, in order, to split subjects missing the primary variable; if all surrogates are missing, the observation is not split.
    • 2: If all surrogates are missing, then send the observation in the majority direction. A value of 0 corresponds to the action of tree, and 2 corresponds to the recommendations of Breiman et al. (1984).

rpart.plot(dt)

X1 <- c(1, 2, 3, 4, 2, 1)
X2 <- c(0, 1, 2, 1, 2, 1)
Y <- c(1.2, 2.1, 1.5, 3.0, 2.0, 1.6)
# dt is the CART model we just constructed. 
# Then, we plot the decision tree
rpart.plot(dt)

How does the tree split

n= 6 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 6 2.000 1.90  
   2) X1< 3.5 5 0.548 1.68  
     4) X2< 0.5 1 0.000 1.20 *
     5) X2>=0.5 4 0.260 1.80  
      10) X1>=2.5 1 0.000 1.50 *
      11) X1< 2.5 3 0.140 1.90  
        22) X1< 1.5 1 0.000 1.60 *
        23) X1>=1.5 2 0.005 2.05  
          46) X2>=1.5 1 0.000 2.00 *
          47) X2< 1.5 1 0.000 2.10 *
   3) X1>=3.5 1 0.000 3.00 *

Set maxdepth

dt.new <- rpart(Y ~ .,
                data = dat,
                control = rpart.control(minsplit = 1,
                                        minbucket = 1,
                                        maxdepth = 3,
                                        cp = 0,
                                        xval = 6))
rpart.plot(dt.new)

cptable

dt$cptable
       CP nsplit rel error   xerror     xstd
1 0.72600      0    1.0000 1.440000 0.745794
2 0.14400      1    0.2740 2.399012 1.059821
3 0.06375      2    0.1300 2.220556 1.004330
4 0.00250      4    0.0025 2.420000 1.252211
5 0.00000      5    0.0000 2.420000 1.252211
dt.new$cptable
     CP nsplit rel error   xerror     xstd
1 0.726      0     1.000 1.440000 0.745794
2 0.144      1     0.274 2.399012 1.059821
3 0.060      2     0.130 2.220556 1.004330
4 0.000      3     0.070 2.466250 1.237930

Prune base on xerror

cp.min <- dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]
cp.min
[1] 0.726
dt.pruned <- prune(dt, cp = cp.min)
rpart.plot(dt.pruned)

HW6 Boston Housing Data

Loading Dataset Assigned by Dr. Banks

library(MASS)
data(Boston)
View(Boston)
library(MASS)
library(rpart)
library(rpart.plot)
data(Boston)
dt.Boston <- rpart(medv ~ .,
                data = Boston,
                control = rpart.control(minsplit = 1,
                                        minbucket = 1,
                                        cp = 0,))

T0: Constructing the Unpruned Decision Tree

Checking dt.Boston$cptable

dt.Boston$cptable
              CP nsplit    rel error    xerror       xstd
1   4.527442e-01      0 1.000000e+00 1.0047628 0.08311193
2   1.711724e-01      1 5.472558e-01 0.6255648 0.05777035
3   7.165784e-02      2 3.760834e-01 0.4197694 0.04583758
4   5.900152e-02      3 3.044255e-01 0.4076077 0.04900619
5   3.375589e-02      4 2.454240e-01 0.3151550 0.04007391
6   2.661300e-02      5 2.116681e-01 0.2816160 0.03818040
7   2.357238e-02      6 1.850551e-01 0.2682153 0.03808902
8   1.303109e-02      7 1.614827e-01 0.2529092 0.03761612
9   9.147048e-03      8 1.484516e-01 0.2452456 0.03774482
10  7.430420e-03      9 1.393046e-01 0.2255687 0.03123433
11  7.265385e-03     10 1.318742e-01 0.2311023 0.03264274
12  7.071417e-03     11 1.246088e-01 0.2311023 0.03264274
13  6.126335e-03     12 1.175374e-01 0.2274893 0.03264578
14  4.805320e-03     13 1.114110e-01 0.2272790 0.03178548
15  4.560925e-03     14 1.066057e-01 0.2228210 0.03155596
16  4.259314e-03     15 1.020448e-01 0.2293847 0.03251198
17  3.941023e-03     16 9.778548e-02 0.2231474 0.03147975
18  3.663324e-03     17 9.384445e-02 0.2242917 0.03151500
19  3.660250e-03     18 9.018113e-02 0.2165161 0.03113851
20  3.193311e-03     19 8.652088e-02 0.2205585 0.03139184
21  3.015697e-03     20 8.332757e-02 0.2198595 0.03146316
22  2.809232e-03     21 8.031187e-02 0.2198330 0.03146324
23  2.361294e-03     22 7.750264e-02 0.2208772 0.03145014
24  2.245942e-03     23 7.514134e-02 0.2283068 0.03218943
25  2.235404e-03     25 7.064946e-02 0.2245907 0.03191181
26  2.159020e-03     26 6.841406e-02 0.2268077 0.03191436
27  2.152471e-03     27 6.625504e-02 0.2269014 0.03191368
28  2.092321e-03     28 6.410257e-02 0.2274998 0.03191185
29  1.716908e-03     29 6.201024e-02 0.2249387 0.03183153
30  1.675508e-03     30 6.029334e-02 0.2217785 0.03116124
31  1.620985e-03     31 5.861783e-02 0.2215931 0.03116291
32  1.382927e-03     32 5.699684e-02 0.2113680 0.02607803
33  1.369015e-03     33 5.561392e-02 0.2119546 0.02612078
34  1.368269e-03     35 5.287589e-02 0.2116555 0.02612203
35  1.361573e-03     37 5.013935e-02 0.2116555 0.02612203
36  1.314040e-03     38 4.877777e-02 0.2117230 0.02617437
37  1.226582e-03     39 4.746373e-02 0.2133950 0.02724218
38  1.192170e-03     40 4.623715e-02 0.2108053 0.02724866
39  1.139508e-03     41 4.504498e-02 0.2107720 0.02725254
40  1.137254e-03     42 4.390548e-02 0.2102784 0.02728532
41  8.859617e-04     43 4.276822e-02 0.2075984 0.02729335
42  8.778851e-04     44 4.188226e-02 0.2076084 0.02727834
43  8.758795e-04     45 4.100437e-02 0.2074852 0.02727913
44  8.685405e-04     46 4.012849e-02 0.2074579 0.02727951
45  8.395913e-04     47 3.925995e-02 0.2081016 0.02728100
46  8.324919e-04     48 3.842036e-02 0.2115141 0.02748328
47  7.826737e-04     49 3.758787e-02 0.2111169 0.02746421
48  7.641216e-04     50 3.680520e-02 0.2103132 0.02746715
49  7.387612e-04     51 3.604108e-02 0.2090312 0.02744662
50  7.293603e-04     52 3.530231e-02 0.2079757 0.02745025
51  7.045085e-04     53 3.457295e-02 0.2086287 0.02746798
52  6.846748e-04     54 3.386845e-02 0.2089363 0.02746542
53  6.429334e-04     55 3.318377e-02 0.2081058 0.02812364
54  6.423778e-04     56 3.254084e-02 0.2079723 0.02811854
55  6.299617e-04     57 3.189846e-02 0.2079723 0.02811854
56  6.270240e-04     58 3.126850e-02 0.2075843 0.02811951
57  6.045313e-04     59 3.064147e-02 0.2066059 0.02808030
58  5.970048e-04     60 3.003694e-02 0.2072717 0.02807488
59  5.618465e-04     61 2.943994e-02 0.2095727 0.02819980
60  5.592278e-04     62 2.887809e-02 0.2103155 0.02818243
61  5.124541e-04     64 2.775964e-02 0.2109692 0.02820288
62  5.098757e-04     65 2.724718e-02 0.2108952 0.02816671
63  5.052292e-04     66 2.673731e-02 0.2108884 0.02819217
64  5.004614e-04     67 2.623208e-02 0.2103325 0.02819566
65  4.870116e-04     68 2.573162e-02 0.2104626 0.02819818
66  4.551531e-04     69 2.524460e-02 0.2094181 0.02821251
67  4.447989e-04     70 2.478945e-02 0.2081616 0.02818422
68  4.383963e-04     72 2.389985e-02 0.2082399 0.02818402
69  4.326223e-04     73 2.346146e-02 0.2094637 0.02818543
70  4.301637e-04     74 2.302883e-02 0.2094637 0.02818543
71  4.226647e-04     75 2.259867e-02 0.2094798 0.02818521
72  4.210740e-04     76 2.217601e-02 0.2112314 0.02917944
73  4.017436e-04     77 2.175493e-02 0.2117176 0.02917775
74  3.877437e-04     78 2.135319e-02 0.2101435 0.02911430
75  3.870010e-04     79 2.096544e-02 0.2095798 0.02911792
76  3.851032e-04     80 2.057844e-02 0.2095798 0.02911792
77  3.556226e-04     81 2.019334e-02 0.2114300 0.02915503
78  3.550169e-04     83 1.948209e-02 0.2123955 0.02915182
79  3.534990e-04     84 1.912708e-02 0.2123955 0.02915182
80  3.432142e-04     85 1.877358e-02 0.2123000 0.02915234
81  3.382784e-04     87 1.808715e-02 0.2124429 0.02914039
82  3.336110e-04     88 1.774887e-02 0.2124010 0.02914090
83  3.090234e-04     89 1.741526e-02 0.2135314 0.02916227
84  3.011680e-04     90 1.710624e-02 0.2146825 0.02919257
85  2.987376e-04     93 1.620273e-02 0.2154501 0.02920061
86  2.886642e-04     94 1.590400e-02 0.2140930 0.02905024
87  2.848562e-04     95 1.561533e-02 0.2123978 0.02890373
88  2.835417e-04     96 1.533048e-02 0.2124388 0.02890316
89  2.758042e-04     97 1.504693e-02 0.2124109 0.02890294
90  2.725420e-04     98 1.477113e-02 0.2125969 0.02890131
91  2.654725e-04     99 1.449859e-02 0.2127858 0.02890501
92  2.585664e-04    100 1.423312e-02 0.2124448 0.02890334
93  2.485620e-04    101 1.397455e-02 0.2131571 0.02890739
94  2.433961e-04    102 1.372599e-02 0.2143729 0.02890693
95  2.345107e-04    103 1.348259e-02 0.2125591 0.02889473
96  2.263011e-04    104 1.324808e-02 0.2115563 0.02885650
97  2.106929e-04    105 1.302178e-02 0.2126524 0.02889759
98  2.026849e-04    106 1.281109e-02 0.2127742 0.02890111
99  2.022725e-04    108 1.240572e-02 0.2124675 0.02888628
100 1.969995e-04    109 1.220344e-02 0.2125694 0.02888674
101 1.928337e-04    110 1.200644e-02 0.2125673 0.02888729
102 1.927771e-04    111 1.181361e-02 0.2123900 0.02888787
103 1.819271e-04    112 1.162083e-02 0.2127075 0.02888690
104 1.815034e-04    114 1.125698e-02 0.2125672 0.02888095
105 1.800299e-04    116 1.089397e-02 0.2125551 0.02888093
106 1.751478e-04    117 1.071394e-02 0.2125815 0.02887998
107 1.751107e-04    118 1.053880e-02 0.2127563 0.02888279
108 1.731424e-04    119 1.036368e-02 0.2127563 0.02888279
109 1.729133e-04    120 1.019054e-02 0.2127563 0.02888279
110 1.709262e-04    122 9.844716e-03 0.2127503 0.02888278
111 1.693064e-04    123 9.673789e-03 0.2130442 0.02887988
112 1.687004e-04    124 9.504483e-03 0.2129946 0.02888053
113 1.651204e-04    125 9.335783e-03 0.2132505 0.02887799
114 1.618619e-04    126 9.170662e-03 0.2131195 0.02887925
115 1.595722e-04    127 9.008800e-03 0.2133795 0.02887576
116 1.589868e-04    128 8.849228e-03 0.2140921 0.02891557
117 1.580193e-04    129 8.690241e-03 0.2145119 0.02891456
118 1.574382e-04    130 8.532222e-03 0.2145754 0.02891372
119 1.518566e-04    132 8.217346e-03 0.2142427 0.02891414
120 1.499818e-04    133 8.065489e-03 0.2143700 0.02891333
121 1.464625e-04    134 7.915507e-03 0.2147630 0.02891325
122 1.462191e-04    135 7.769045e-03 0.2147630 0.02891325
123 1.451827e-04    136 7.622826e-03 0.2149276 0.02891150
124 1.358186e-04    137 7.477643e-03 0.2149784 0.02890255
125 1.357807e-04    138 7.341824e-03 0.2149976 0.02890229
126 1.279957e-04    139 7.206044e-03 0.2144958 0.02890404
127 1.266027e-04    140 7.078048e-03 0.2145681 0.02890996
128 1.220043e-04    141 6.951445e-03 0.2135451 0.02842631
129 1.180268e-04    142 6.829441e-03 0.2130822 0.02842405
130 1.180268e-04    143 6.711414e-03 0.2132041 0.02842790
131 1.146422e-04    144 6.593387e-03 0.2131352 0.02842782
132 1.126814e-04    145 6.478745e-03 0.2134060 0.02843631
133 1.039611e-04    146 6.366064e-03 0.2137010 0.02843044
134 9.838166e-05    147 6.262103e-03 0.2139648 0.02842933
135 9.559193e-05    148 6.163721e-03 0.2150689 0.02845158
136 9.559193e-05    149 6.068129e-03 0.2150629 0.02845167
137 9.411059e-05    150 5.972537e-03 0.2150629 0.02845167
138 9.176826e-05    151 5.878427e-03 0.2150761 0.02845152
139 9.166424e-05    152 5.786658e-03 0.2151547 0.02845056
140 8.989544e-05    153 5.694994e-03 0.2150893 0.02845091
141 8.901755e-05    154 5.605099e-03 0.2150036 0.02845130
142 8.849082e-05    155 5.516081e-03 0.2151113 0.02844998
143 8.731195e-05    156 5.427590e-03 0.2155870 0.02849184
144 8.542449e-05    157 5.340278e-03 0.2148724 0.02846399
145 8.521533e-05    159 5.169429e-03 0.2149045 0.02846345
146 8.497928e-05    160 5.084214e-03 0.2149045 0.02846345
147 8.242366e-05    161 4.999235e-03 0.2146917 0.02846476
148 7.912671e-05    162 4.916811e-03 0.2138527 0.02782933
149 7.830317e-05    163 4.837684e-03 0.2135738 0.02782350
150 7.826973e-05    165 4.681078e-03 0.2137284 0.02782275
151 7.783155e-05    166 4.602808e-03 0.2136513 0.02782367
152 7.761062e-05    167 4.524977e-03 0.2136376 0.02782387
153 7.584927e-05    168 4.447366e-03 0.2136405 0.02782386
154 7.398377e-05    169 4.371517e-03 0.2136529 0.02782381
155 7.315709e-05    171 4.223549e-03 0.2137421 0.02782357
156 7.191635e-05    172 4.150392e-03 0.2136823 0.02782684
157 7.173994e-05    173 4.078476e-03 0.2136823 0.02782684
158 7.121404e-05    174 4.006736e-03 0.2137072 0.02782650
159 7.042589e-05    175 3.935522e-03 0.2141363 0.02782282
160 6.882619e-05    176 3.865096e-03 0.2141379 0.02782120
161 6.093498e-05    177 3.796270e-03 0.2148974 0.02783234
162 6.067942e-05    180 3.610665e-03 0.2154700 0.02783390
163 6.005069e-05    181 3.549986e-03 0.2154700 0.02783390
164 5.940635e-05    182 3.489935e-03 0.2154453 0.02783426
165 5.934503e-05    183 3.430529e-03 0.2153728 0.02783514
166 5.771412e-05    184 3.371184e-03 0.2150702 0.02783341
167 5.634072e-05    185 3.313470e-03 0.2152203 0.02783170
168 5.624317e-05    186 3.257129e-03 0.2151919 0.02783209
169 5.479954e-05    187 3.200886e-03 0.2151805 0.02783223
170 5.456154e-05    188 3.146086e-03 0.2152478 0.02783056
171 5.247022e-05    190 3.036963e-03 0.2150744 0.02783165
172 5.206444e-05    192 2.932023e-03 0.2150876 0.02783191
173 5.074176e-05    194 2.827894e-03 0.2153564 0.02783137
174 4.645768e-05    195 2.777152e-03 0.2154738 0.02783172
175 4.494772e-05    196 2.730694e-03 0.2165527 0.02784513
176 4.494772e-05    197 2.685747e-03 0.2165708 0.02784486
177 4.448146e-05    198 2.640799e-03 0.2165867 0.02784505
178 4.389426e-05    199 2.596318e-03 0.2165746 0.02784448
179 4.248964e-05    200 2.552423e-03 0.2164736 0.02784301
180 4.225554e-05    201 2.509934e-03 0.2163965 0.02784207
181 4.195380e-05    202 2.467678e-03 0.2166921 0.02784308
182 3.995353e-05    203 2.425724e-03 0.2170926 0.02784046
183 3.995353e-05    204 2.385771e-03 0.2169039 0.02784105
184 3.934226e-05    205 2.345817e-03 0.2169039 0.02784105
185 3.802998e-05    206 2.306475e-03 0.2167103 0.02783923
186 3.792464e-05    208 2.230415e-03 0.2167124 0.02783920
187 3.749545e-05    210 2.154566e-03 0.2166999 0.02783938
188 3.749545e-05    211 2.117070e-03 0.2165686 0.02784093
189 3.657855e-05    213 2.042079e-03 0.2165686 0.02784093
190 3.450869e-05    214 2.005501e-03 0.2169389 0.02783764
191 3.413217e-05    215 1.970992e-03 0.2171010 0.02783798
192 3.382784e-05    217 1.902728e-03 0.2171790 0.02783762
193 3.382784e-05    219 1.835072e-03 0.2172001 0.02783731
194 3.287972e-05    221 1.767416e-03 0.2172065 0.02783709
195 3.281339e-05    222 1.734537e-03 0.2170442 0.02782608
196 3.279389e-05    223 1.701723e-03 0.2170442 0.02782608
197 3.121369e-05    224 1.668929e-03 0.2166450 0.02782323
198 3.096008e-05    225 1.637716e-03 0.2168044 0.02783531
199 3.058942e-05    226 1.606756e-03 0.2168044 0.02783531
200 2.996515e-05    227 1.576166e-03 0.2168044 0.02783531
201 2.967252e-05    229 1.516236e-03 0.2165585 0.02783312
202 2.903059e-05    230 1.486563e-03 0.2167242 0.02783518
203 2.786937e-05    231 1.457533e-03 0.2171607 0.02783889
204 2.637557e-05    232 1.429664e-03 0.2175076 0.02784959
205 2.633655e-05    233 1.403288e-03 0.2179451 0.02789624
206 2.438570e-05    237 1.297942e-03 0.2181034 0.02789716
207 2.438570e-05    238 1.273556e-03 0.2179579 0.02788785
208 2.416274e-05    239 1.249170e-03 0.2179809 0.02788752
209 2.389798e-05    240 1.225008e-03 0.2179809 0.02788752
210 2.294206e-05    241 1.201110e-03 0.2181332 0.02789714
211 2.294206e-05    242 1.178168e-03 0.2182343 0.02789626
212 2.275478e-05    243 1.155225e-03 0.2181609 0.02789708
213 2.247386e-05    244 1.132471e-03 0.2181609 0.02789708
214 2.112777e-05    246 1.087523e-03 0.2183053 0.02790742
215 2.064005e-05    247 1.066395e-03 0.2184871 0.02790730
216 2.029670e-05    248 1.045755e-03 0.2184804 0.02790740
217 1.978168e-05    249 1.025458e-03 0.2184804 0.02790740
218 1.978168e-05    250 1.005677e-03 0.2184804 0.02790740
219 1.978168e-05    251 9.858951e-04 0.2184804 0.02790740
220 1.967633e-05    252 9.661134e-04 0.2184804 0.02790740
221 1.888428e-05    253 9.464371e-04 0.2184453 0.02790792
222 1.874772e-05    254 9.275528e-04 0.2182712 0.02790684
223 1.720655e-05    255 9.088051e-04 0.2192807 0.02805577
224 1.720655e-05    256 8.915985e-04 0.2192766 0.02805583
225 1.691392e-05    257 8.743920e-04 0.2192766 0.02805583
226 1.691392e-05    258 8.574781e-04 0.2191610 0.02805729
227 1.685539e-05    259 8.405641e-04 0.2191610 0.02805729
228 1.685539e-05    260 8.237087e-04 0.2191610 0.02805729
229 1.651204e-05    261 8.068533e-04 0.2191610 0.02805729
230 1.640670e-05    262 7.903413e-04 0.2192270 0.02805750
231 1.602433e-05    263 7.739346e-04 0.2192270 0.02805750
232 1.560685e-05    264 7.579103e-04 0.2192055 0.02805670
233 1.560685e-05    265 7.423034e-04 0.2192868 0.02805564
234 1.416321e-05    267 7.110897e-04 0.2192492 0.02805610
235 1.416321e-05    268 6.969265e-04 0.2194753 0.02806062
236 1.416321e-05    269 6.827633e-04 0.2194753 0.02806062
237 1.408518e-05    270 6.686001e-04 0.2194753 0.02806062
238 1.353114e-05    272 6.404297e-04 0.2198634 0.02806270
239 1.316828e-05    273 6.268986e-04 0.2198634 0.02806270
240 1.264155e-05    274 6.137303e-04 0.2199390 0.02806295
241 1.264155e-05    275 6.010888e-04 0.2198612 0.02806349
242 1.170513e-05    276 5.884472e-04 0.2196914 0.02805856
243 1.147103e-05    281 5.299216e-04 0.2197065 0.02805817
244 1.127595e-05    282 5.184505e-04 0.2196847 0.02805840
245 1.123693e-05    284 4.958986e-04 0.2196847 0.02805840
246 1.011324e-05    285 4.846617e-04 0.2197635 0.02805833
247 9.844018e-06    286 4.745485e-04 0.2198039 0.02805887
248 9.481159e-06    287 4.647044e-04 0.2198469 0.02805899
249 9.442142e-06    289 4.457421e-04 0.2198553 0.02805889
250 9.442142e-06    290 4.363000e-04 0.2198553 0.02805889
251 9.390863e-06    291 4.268578e-04 0.2198944 0.02805875
252 8.778851e-06    293 4.080761e-04 0.2200116 0.02806018
253 7.647355e-06    295 3.905184e-04 0.2202834 0.02805086
254 7.647355e-06    297 3.752237e-04 0.2202479 0.02805120
255 7.491286e-06    300 3.522816e-04 0.2202479 0.02805120
256 7.081607e-06    301 3.447904e-04 0.2201953 0.02805132
257 6.593893e-06    302 3.377087e-04 0.2201979 0.02805134
258 6.593893e-06    305 3.179271e-04 0.2201979 0.02805134
259 6.554876e-06    306 3.113332e-04 0.2201979 0.02805134
260 6.320773e-06    308 2.982234e-04 0.2201801 0.02805062
261 6.192016e-06    309 2.919027e-04 0.2201014 0.02805223
262 5.735516e-06    310 2.857106e-04 0.2200668 0.02803344
263 5.735516e-06    311 2.799751e-04 0.2200668 0.02803344
264 5.735516e-06    312 2.742396e-04 0.2200668 0.02803344
265 5.637973e-06    319 2.340910e-04 0.2200668 0.02803344
266 5.618465e-06    320 2.284530e-04 0.2200668 0.02803344
267 4.877140e-06    322 2.172161e-04 0.2202453 0.02803453
268 4.877140e-06    323 2.123390e-04 0.2202453 0.02803453
269 4.740580e-06    324 2.074618e-04 0.2202356 0.02803461
270 4.721071e-06    325 2.027212e-04 0.2202356 0.02803461
271 4.721071e-06    326 1.980002e-04 0.2202356 0.02803461
272 4.389426e-06    327 1.932791e-04 0.2198039 0.02803459
273 4.213849e-06    328 1.888897e-04 0.2198317 0.02803401
274 4.213849e-06    332 1.720343e-04 0.2199185 0.02803338
275 4.213849e-06    335 1.593927e-04 0.2199185 0.02803338
276 3.901712e-06    337 1.509650e-04 0.2199531 0.02803346
277 3.901712e-06    338 1.470633e-04 0.2199531 0.02803346
278 3.823677e-06    340 1.392599e-04 0.2199531 0.02803346
279 3.745643e-06    341 1.354362e-04 0.2199531 0.02803346
280 3.382784e-06    342 1.316906e-04 0.2199927 0.02803306
281 3.296946e-06    343 1.283078e-04 0.2200858 0.02803535
282 3.296946e-06    344 1.250108e-04 0.2200858 0.02803535
283 3.160386e-06    345 1.217139e-04 0.2202383 0.02804624
284 2.926284e-06    349 1.090723e-04 0.2201897 0.02804650
285 2.867758e-06    358 8.273580e-05 0.2201897 0.02804650
286 2.497095e-06    359 7.986804e-05 0.2201437 0.02804715
287 2.106924e-06    361 7.487385e-05 0.2201475 0.02804712
288 1.950856e-06    362 7.276692e-05 0.2204144 0.02805646
289 1.911839e-06    363 7.081607e-05 0.2204144 0.02805646
290 1.911839e-06    364 6.890423e-05 0.2204144 0.02805646
291 1.911839e-06    365 6.699239e-05 0.2204144 0.02805646
292 1.872822e-06    367 6.316871e-05 0.2204144 0.02805646
293 1.872822e-06    370 5.755025e-05 0.2204144 0.02805646
294 1.872822e-06    372 5.380460e-05 0.2204144 0.02805646
295 1.463142e-06    376 4.631332e-05 0.2204087 0.02805653
296 1.463142e-06    377 4.485018e-05 0.2203685 0.02805683
297 1.404616e-06    378 4.338703e-05 0.2203685 0.02805683
298 1.404616e-06    379 4.198242e-05 0.2203685 0.02805683
299 1.404616e-06    380 4.057780e-05 0.2203685 0.02805683
300 1.053462e-06    381 3.917318e-05 0.2203860 0.02805703
301 1.053462e-06    390 2.969203e-05 0.2204696 0.02805921
302 1.053462e-06    391 2.863856e-05 0.2204696 0.02805921
303 9.754279e-07    396 2.337125e-05 0.2204403 0.02805945
304 9.754279e-07    397 2.239582e-05 0.2204403 0.02805945
305 9.754279e-07    398 2.142040e-05 0.2204403 0.02805945
306 9.754279e-07    400 1.946954e-05 0.2204403 0.02805945
307 6.242739e-07    402 1.751869e-05 0.2204403 0.02805945
308 6.242739e-07    403 1.689441e-05 0.2204496 0.02805931
309 6.242739e-07    404 1.627014e-05 0.2204496 0.02805931
310 4.682054e-07    405 1.564586e-05 0.2204779 0.02806040
311 4.682054e-07    412 1.236843e-05 0.2204639 0.02806077
312 4.682054e-07    413 1.190022e-05 0.2204639 0.02806077
313 4.682054e-07    414 1.143202e-05 0.2204639 0.02806077
314 4.682054e-07    424 6.749961e-06 0.2204639 0.02806077
315 3.511540e-07    426 5.813550e-06 0.2204639 0.02806077
316 3.511540e-07    427 5.462396e-06 0.2204719 0.02806087
317 1.560685e-07    428 5.111242e-06 0.2204719 0.02806087
318 1.560685e-07    429 4.955174e-06 0.2204663 0.02806091
319 1.170513e-07    430 4.799105e-06 0.2204663 0.02806091
320 1.170513e-07    454 1.989873e-06 0.2204655 0.02805998
321 1.170513e-07    460 1.287565e-06 0.2204655 0.02805998
322 1.170513e-07    470 1.170513e-07 0.2204655 0.02805998
323 0.000000e+00    471 0.000000e+00 0.2204655 0.02805998

We have to find the cp value that give us the smallest xerror

cp.Boston.min <- dt.Boston$cptable[which.min(dt.Boston$cptable[, "xerror"]), "CP"]
cp.Boston.min
[1] 0.0006045313

Prune tree

dt.Boston.pruned <- prune(dt.Boston, cp = cp.Boston.min)
rpart.plot(dt.Boston.pruned)

Make Prediction, comapre it with OLS_model

predictions <- predict(dt.Boston.pruned, newdata = Boston)
actual_values <- Boston$medv
dt_sse <- sum((predictions - actual_values) ^ 2)
dt_sse
[1] 1308.89
ols_model <- lm(medv ~ ., data = Boston)
ols_predictions <- predict(ols_model, newdata = Boston)
ols_sse <- sum((ols_predictions - actual_values) ^ 2)
ols_sse
[1] 11078.78

Pay Attention

Ask yourself why the dt_sse is different each time you rerun the code.

HW6 Q1 Modify CART code for regression

Some ideas for your modified CART algorithm

Function to track used variables and apply the penalty

custom_split <- function(response, covariates, weights, used_vars, lambda) {
  best_split <- NULL
  best_mse <- Inf
  best_variable <- NULL

Recursive function to build the tree

build_custom_tree <- function(data, response_var, lambda, used_vars) {

Source

  • Leo Breiman, Department of Statistics, UC Berkeley. Data used in Leo Breiman and Jerome H. Friedman (1985), Estimating optimal transformations for multiple regression and correlation, JASA, 80, pp. 580-598.
  • Dr. Colin rundel STA 523 Slides
  • Dr. David Banks STA 521 Slides
  • Dr. Ambrose Lo ACTEX
  • An Introduction to Statistical Learning(ISLR)

Thank you!

Have a great rest of your day!