Regression Trees and Model Trees

Understanding regression trees and model trees

Example: Calculating SDR

# set up the data
tee <- c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7)
at1 <- c(1, 1, 1, 2, 2, 3, 4, 5, 5)
at2 <- c(6, 6, 7, 7, 7, 7)
bt1 <- c(1, 1, 1, 2, 2, 3, 4)
bt2 <- c(5, 5, 6, 6, 7, 7, 7, 7)
summary(tee)
str(tee)

#We test
# compute the SDR
sdr_a <- sd(tee) - (length(at1) / length(tee) * sd(at1) + length(at2) / length(tee) * sd(at2))
sdr_b <- sd(tee) - (length(bt1) / length(tee) * sd(bt1) + length(bt2) / length(tee) * sd(bt2))

#these values provide some indication of how well the main signal (tee) stands out from the #variations present in its sub-sets (at1, at2, bt1, bt2). Higher values of sdr_a and sdr_b #would suggest stronger signals relative to the sub-sets' noise.
# compare the SDR for each split
sdr_a
sdr_b

#sdr_a is = to 1.202815
#sdr_b is =  to 1.392751

#This indicates that sdr_b represents a stronger signal relative to its sub-sets compared to sdr_a.

# If sdr_a or sdr_b has a larger value, it means that even after removing the sub-sets' variability, the remaining standard deviation (representing the variability of the main signal) is still high. This implies that the main signal is strong and distinct from the variations in the sub-sets.

Exercise No 3: Estimating Wine Quality

Step 2: Exploring and preparing the data

wine <- read.csv("whitewines.csv")
# examine the wine data
str(wine)
# the distribution of quality ratings
hist(wine$quality)
# summary statistics of the wine data
summary(wine)
wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

#used 3750 observations to train to model and 1148 to test it

Step 3: Training a model on the data

# regression tree using rpart
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
# get basic information about the tree
m.rpart
# get more detailed information about the tree
summary(m.rpart)

Variable Importance:

The most important variable for predicting wine quality is alcohol, followed by density, volatile acidity, and chlorides.

Tree Structure:

The tree starts by splitting the data based on alcohol content. Wines with alcohol < 10.85 are more likely to have lower quality (mean quality around 5.6). Wines with alcohol >= 10.85 are more likely to have higher quality (mean quality around 6.3). Further splits are made based on other variables like volatile acidity, free sulfur dioxide, and chlorides. Each terminal node represents a group of wines with similar characteristics and predicted quality.

#install.packages("rpart.plot")
# use the rpart.plot package to create a visualization
library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)
# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

Step 4: Evaluate model performanc

# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, wine_test)
# compare the distribution of predicted values vs. actual values
summary(p.rpart)
summary(wine_test$quality)

#Both distributions have similar medians and 3rd quartiles, suggesting predictions tend to be close to actual values for a significant portion of the data.
#The actual values range wider (3.000 to 9.000) compared to predicted values (4.545 to 6.597), indicating the model might not capture the full range of quality.
# compare the correlation
cor(p.rpart, wine_test$quality)
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}
# mean absolute error between predicted and actual values
MAE(p.rpart, wine_test$quality)

#The resulting MAE value represents the average absolute error made by your model on the test set. It reflects the typical magnitude of difference between predictions and actual quality on unseen data.
#on average every prediction is 0.5872 units of quality off the actual value.
# mean absolute error between actual values and mean value
mean(wine_train$quality) # result = 5.87
MAE(5.87, wine_test$quality)  #result = 0.6722474

Step 5: Improving model performance

#install.packages("plyr")
#install.packages("Cubist")
# train a Cubist Model Tree
library(Cubist)
m.cubist <- cubist(x = wine_train[-12], y = wine_train$quality)
# display basic information about the model tree
m.cubist
# display the tree itself
summary(m.cubist)
# generate predictions for the model
p.cubist <- predict(m.cubist, wine_test)
# summary statistics about the predictions
summary(p.cubist)
# correlation between the predicted and true values
cor(p.cubist, wine_test$quality)

#The magnitude of 0.62 suggests a moderately strong linear relationship.
# mean absolute error of predicted and true values
# (uses a custom function defined above)
MAE(wine_test$quality, p.cubist) 
LS0tDQp0aXRsZTogIlJlZ3Jlc3Npb24gVHJlZXMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoNCg0KDQojIyMjIFJlZ3Jlc3Npb24gVHJlZXMgYW5kIE1vZGVsIFRyZWVzDQoNCiMjIFVuZGVyc3RhbmRpbmcgcmVncmVzc2lvbiB0cmVlcyBhbmQgbW9kZWwgdHJlZXMNCg0KIyMgRXhhbXBsZTogQ2FsY3VsYXRpbmcgU0RSDQoNCmBgYHtyfQ0KIyBzZXQgdXAgdGhlIGRhdGENCnRlZSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQsIDUsIDUsIDYsIDYsIDcsIDcsIDcsIDcpDQphdDEgPC0gYygxLCAxLCAxLCAyLCAyLCAzLCA0LCA1LCA1KQ0KYXQyIDwtIGMoNiwgNiwgNywgNywgNywgNykNCmJ0MSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQpDQpidDIgPC0gYyg1LCA1LCA2LCA2LCA3LCA3LCA3LCA3KQ0KYGBgDQoNCg0KYGBge3J9DQpzdW1tYXJ5KHRlZSkNCnN0cih0ZWUpDQoNCiNXZSB0ZXN0DQpgYGANCg0KDQpgYGB7cn0NCiMgY29tcHV0ZSB0aGUgU0RSDQpzZHJfYSA8LSBzZCh0ZWUpIC0gKGxlbmd0aChhdDEpIC8gbGVuZ3RoKHRlZSkgKiBzZChhdDEpICsgbGVuZ3RoKGF0MikgLyBsZW5ndGgodGVlKSAqIHNkKGF0MikpDQpzZHJfYiA8LSBzZCh0ZWUpIC0gKGxlbmd0aChidDEpIC8gbGVuZ3RoKHRlZSkgKiBzZChidDEpICsgbGVuZ3RoKGJ0MikgLyBsZW5ndGgodGVlKSAqIHNkKGJ0MikpDQoNCiN0aGVzZSB2YWx1ZXMgcHJvdmlkZSBzb21lIGluZGljYXRpb24gb2YgaG93IHdlbGwgdGhlIG1haW4gc2lnbmFsICh0ZWUpIHN0YW5kcyBvdXQgZnJvbSB0aGUgI3ZhcmlhdGlvbnMgcHJlc2VudCBpbiBpdHMgc3ViLXNldHMgKGF0MSwgYXQyLCBidDEsIGJ0MikuIEhpZ2hlciB2YWx1ZXMgb2Ygc2RyX2EgYW5kIHNkcl9iICN3b3VsZCBzdWdnZXN0IHN0cm9uZ2VyIHNpZ25hbHMgcmVsYXRpdmUgdG8gdGhlIHN1Yi1zZXRzJyBub2lzZS4NCmBgYA0KDQoNCmBgYHtyfQ0KIyBjb21wYXJlIHRoZSBTRFIgZm9yIGVhY2ggc3BsaXQNCnNkcl9hDQpzZHJfYg0KDQojc2RyX2EgaXMgPSB0byAxLjIwMjgxNQ0KI3Nkcl9iIGlzID0gIHRvIDEuMzkyNzUxDQoNCiNUaGlzIGluZGljYXRlcyB0aGF0IHNkcl9iIHJlcHJlc2VudHMgYSBzdHJvbmdlciBzaWduYWwgcmVsYXRpdmUgdG8gaXRzIHN1Yi1zZXRzIGNvbXBhcmVkIHRvIHNkcl9hLg0KDQojIElmIHNkcl9hIG9yIHNkcl9iIGhhcyBhIGxhcmdlciB2YWx1ZSwgaXQgbWVhbnMgdGhhdCBldmVuIGFmdGVyIHJlbW92aW5nIHRoZSBzdWItc2V0cycgdmFyaWFiaWxpdHksIHRoZSByZW1haW5pbmcgc3RhbmRhcmQgZGV2aWF0aW9uIChyZXByZXNlbnRpbmcgdGhlIHZhcmlhYmlsaXR5IG9mIHRoZSBtYWluIHNpZ25hbCkgaXMgc3RpbGwgaGlnaC4gVGhpcyBpbXBsaWVzIHRoYXQgdGhlIG1haW4gc2lnbmFsIGlzIHN0cm9uZyBhbmQgZGlzdGluY3QgZnJvbSB0aGUgdmFyaWF0aW9ucyBpbiB0aGUgc3ViLXNldHMuDQoNCmBgYA0KDQoNCg0KIyMgRXhlcmNpc2UgTm8gMzogRXN0aW1hdGluZyBXaW5lIFF1YWxpdHkNCg0KDQojIyBTdGVwIDI6IEV4cGxvcmluZyBhbmQgcHJlcGFyaW5nIHRoZSBkYXRhDQoNCmBgYHtyfQ0Kd2luZSA8LSByZWFkLmNzdigid2hpdGV3aW5lcy5jc3YiKQ0KYGBgDQoNCg0KDQpgYGB7cn0NCiMgZXhhbWluZSB0aGUgd2luZSBkYXRhDQpzdHIod2luZSkNCmBgYA0KDQoNCmBgYHtyfQ0KIyB0aGUgZGlzdHJpYnV0aW9uIG9mIHF1YWxpdHkgcmF0aW5ncw0KaGlzdCh3aW5lJHF1YWxpdHkpDQpgYGANCg0KDQpgYGB7cn0NCiMgc3VtbWFyeSBzdGF0aXN0aWNzIG9mIHRoZSB3aW5lIGRhdGENCnN1bW1hcnkod2luZSkNCmBgYA0KDQoNCg0KYGBge3J9DQp3aW5lX3RyYWluIDwtIHdpbmVbMTozNzUwLCBdDQp3aW5lX3Rlc3QgPC0gd2luZVszNzUxOjQ4OTgsIF0NCg0KI3VzZWQgMzc1MCBvYnNlcnZhdGlvbnMgdG8gdHJhaW4gdG8gbW9kZWwgYW5kIDExNDggdG8gdGVzdCBpdA0KDQpgYGANCg0KDQoNCiMjIFN0ZXAgMzogVHJhaW5pbmcgYSBtb2RlbCBvbiB0aGUgZGF0YQ0KDQpgYGB7cn0NCiMgcmVncmVzc2lvbiB0cmVlIHVzaW5nIHJwYXJ0DQpsaWJyYXJ5KHJwYXJ0KQ0KbS5ycGFydCA8LSBycGFydChxdWFsaXR5IH4gLiwgZGF0YSA9IHdpbmVfdHJhaW4pDQpgYGANCg0KDQpgYGB7cn0NCiMgZ2V0IGJhc2ljIGluZm9ybWF0aW9uIGFib3V0IHRoZSB0cmVlDQptLnJwYXJ0DQpgYGANCg0KDQoNCmBgYHtyfQ0KIyBnZXQgbW9yZSBkZXRhaWxlZCBpbmZvcm1hdGlvbiBhYm91dCB0aGUgdHJlZQ0Kc3VtbWFyeShtLnJwYXJ0KQ0KYGBgDQpWYXJpYWJsZSBJbXBvcnRhbmNlOg0KDQpUaGUgbW9zdCBpbXBvcnRhbnQgdmFyaWFibGUgZm9yIHByZWRpY3Rpbmcgd2luZSBxdWFsaXR5IGlzIGFsY29ob2wsIGZvbGxvd2VkIGJ5IGRlbnNpdHksIHZvbGF0aWxlIGFjaWRpdHksIGFuZCBjaGxvcmlkZXMuDQoNClRyZWUgU3RydWN0dXJlOg0KDQpUaGUgdHJlZSBzdGFydHMgYnkgc3BsaXR0aW5nIHRoZSBkYXRhIGJhc2VkIG9uIGFsY29ob2wgY29udGVudC4NCldpbmVzIHdpdGggYWxjb2hvbCA8IDEwLjg1IGFyZSBtb3JlIGxpa2VseSB0byBoYXZlIGxvd2VyIHF1YWxpdHkgKG1lYW4gcXVhbGl0eSBhcm91bmQgNS42KS4NCldpbmVzIHdpdGggYWxjb2hvbCA+PSAxMC44NSBhcmUgbW9yZSBsaWtlbHkgdG8gaGF2ZSBoaWdoZXIgcXVhbGl0eSAobWVhbiBxdWFsaXR5IGFyb3VuZCA2LjMpLg0KRnVydGhlciBzcGxpdHMgYXJlIG1hZGUgYmFzZWQgb24gb3RoZXIgdmFyaWFibGVzIGxpa2Ugdm9sYXRpbGUgYWNpZGl0eSwgZnJlZSBzdWxmdXIgZGlveGlkZSwgYW5kIGNobG9yaWRlcy4NCkVhY2ggdGVybWluYWwgbm9kZSByZXByZXNlbnRzIGEgZ3JvdXAgb2Ygd2luZXMgd2l0aCBzaW1pbGFyIGNoYXJhY3RlcmlzdGljcyBhbmQgcHJlZGljdGVkIHF1YWxpdHkuDQoNCmBgYHtyfQ0KI2luc3RhbGwucGFja2FnZXMoInJwYXJ0LnBsb3QiKQ0KYGBgDQoNCg0KYGBge3J9DQojIHVzZSB0aGUgcnBhcnQucGxvdCBwYWNrYWdlIHRvIGNyZWF0ZSBhIHZpc3VhbGl6YXRpb24NCmxpYnJhcnkocnBhcnQucGxvdCkNCmBgYA0KDQoNCmBgYHtyfQ0KIyBhIGJhc2ljIGRlY2lzaW9uIHRyZWUgZGlhZ3JhbQ0KcnBhcnQucGxvdChtLnJwYXJ0LCBkaWdpdHMgPSAzKQ0KYGBgDQoNCg0KYGBge3J9DQojIGEgZmV3IGFkanVzdG1lbnRzIHRvIHRoZSBkaWFncmFtDQpycGFydC5wbG90KG0ucnBhcnQsIGRpZ2l0cyA9IDQsIGZhbGxlbi5sZWF2ZXMgPSBUUlVFLCB0eXBlID0gMywgZXh0cmEgPSAxMDEpDQpgYGANCg0KDQojIyBTdGVwIDQ6IEV2YWx1YXRlIG1vZGVsIHBlcmZvcm1hbmMNCg0KYGBge3J9DQojIGdlbmVyYXRlIHByZWRpY3Rpb25zIGZvciB0aGUgdGVzdGluZyBkYXRhc2V0DQpwLnJwYXJ0IDwtIHByZWRpY3QobS5ycGFydCwgd2luZV90ZXN0KQ0KYGBgDQoNCg0KYGBge3J9DQojIGNvbXBhcmUgdGhlIGRpc3RyaWJ1dGlvbiBvZiBwcmVkaWN0ZWQgdmFsdWVzIHZzLiBhY3R1YWwgdmFsdWVzDQpzdW1tYXJ5KHAucnBhcnQpDQpzdW1tYXJ5KHdpbmVfdGVzdCRxdWFsaXR5KQ0KDQojQm90aCBkaXN0cmlidXRpb25zIGhhdmUgc2ltaWxhciBtZWRpYW5zIGFuZCAzcmQgcXVhcnRpbGVzLCBzdWdnZXN0aW5nIHByZWRpY3Rpb25zIHRlbmQgdG8gYmUgY2xvc2UgdG8gYWN0dWFsIHZhbHVlcyBmb3IgYSBzaWduaWZpY2FudCBwb3J0aW9uIG9mIHRoZSBkYXRhLg0KI1RoZSBhY3R1YWwgdmFsdWVzIHJhbmdlIHdpZGVyICgzLjAwMCB0byA5LjAwMCkgY29tcGFyZWQgdG8gcHJlZGljdGVkIHZhbHVlcyAoNC41NDUgdG8gNi41OTcpLCBpbmRpY2F0aW5nIHRoZSBtb2RlbCBtaWdodCBub3QgY2FwdHVyZSB0aGUgZnVsbCByYW5nZSBvZiBxdWFsaXR5Lg0KYGBgDQoNCg0KYGBge3J9DQojIGNvbXBhcmUgdGhlIGNvcnJlbGF0aW9uDQpjb3IocC5ycGFydCwgd2luZV90ZXN0JHF1YWxpdHkpDQpgYGANCg0KDQpgYGB7cn0NCiMgZnVuY3Rpb24gdG8gY2FsY3VsYXRlIHRoZSBtZWFuIGFic29sdXRlIGVycm9yDQpNQUUgPC0gZnVuY3Rpb24oYWN0dWFsLCBwcmVkaWN0ZWQpIHsNCiAgbWVhbihhYnMoYWN0dWFsIC0gcHJlZGljdGVkKSkgIA0KfQ0KYGBgDQoNCg0KDQpgYGB7cn0NCiMgbWVhbiBhYnNvbHV0ZSBlcnJvciBiZXR3ZWVuIHByZWRpY3RlZCBhbmQgYWN0dWFsIHZhbHVlcw0KTUFFKHAucnBhcnQsIHdpbmVfdGVzdCRxdWFsaXR5KQ0KDQojVGhlIHJlc3VsdGluZyBNQUUgdmFsdWUgcmVwcmVzZW50cyB0aGUgYXZlcmFnZSBhYnNvbHV0ZSBlcnJvciBtYWRlIGJ5IHlvdXIgbW9kZWwgb24gdGhlIHRlc3Qgc2V0LiBJdCByZWZsZWN0cyB0aGUgdHlwaWNhbCBtYWduaXR1ZGUgb2YgZGlmZmVyZW5jZSBiZXR3ZWVuIHByZWRpY3Rpb25zIGFuZCBhY3R1YWwgcXVhbGl0eSBvbiB1bnNlZW4gZGF0YS4NCiNvbiBhdmVyYWdlIGV2ZXJ5IHByZWRpY3Rpb24gaXMgMC41ODcyIHVuaXRzIG9mIHF1YWxpdHkgb2ZmIHRoZSBhY3R1YWwgdmFsdWUuDQpgYGANCg0KDQpgYGB7cn0NCiMgbWVhbiBhYnNvbHV0ZSBlcnJvciBiZXR3ZWVuIGFjdHVhbCB2YWx1ZXMgYW5kIG1lYW4gdmFsdWUNCm1lYW4od2luZV90cmFpbiRxdWFsaXR5KSAjIHJlc3VsdCA9IDUuODcNCk1BRSg1Ljg3LCB3aW5lX3Rlc3QkcXVhbGl0eSkgICNyZXN1bHQgPSAwLjY3MjI0NzQNCmBgYCANCg0KDQojIyBTdGVwIDU6IEltcHJvdmluZyBtb2RlbCBwZXJmb3JtYW5jZQ0KDQpgYGB7cn0NCiNpbnN0YWxsLnBhY2thZ2VzKCJwbHlyIikNCiNpbnN0YWxsLnBhY2thZ2VzKCJDdWJpc3QiKQ0KYGBgDQoNCg0KYGBge3J9DQojIHRyYWluIGEgQ3ViaXN0IE1vZGVsIFRyZWUNCmxpYnJhcnkoQ3ViaXN0KQ0KbS5jdWJpc3QgPC0gY3ViaXN0KHggPSB3aW5lX3RyYWluWy0xMl0sIHkgPSB3aW5lX3RyYWluJHF1YWxpdHkpDQpgYGANCg0KDQpgYGB7cn0NCiMgZGlzcGxheSBiYXNpYyBpbmZvcm1hdGlvbiBhYm91dCB0aGUgbW9kZWwgdHJlZQ0KbS5jdWJpc3QNCmBgYA0KDQoNCg0KYGBge3J9DQojIGRpc3BsYXkgdGhlIHRyZWUgaXRzZWxmDQpzdW1tYXJ5KG0uY3ViaXN0KQ0KYGBgDQoNCg0KYGBge3J9DQojIGdlbmVyYXRlIHByZWRpY3Rpb25zIGZvciB0aGUgbW9kZWwNCnAuY3ViaXN0IDwtIHByZWRpY3QobS5jdWJpc3QsIHdpbmVfdGVzdCkNCmBgYA0KDQoNCmBgYHtyfQ0KIyBzdW1tYXJ5IHN0YXRpc3RpY3MgYWJvdXQgdGhlIHByZWRpY3Rpb25zDQpzdW1tYXJ5KHAuY3ViaXN0KQ0KYGBgDQoNCg0KYGBge3J9DQojIGNvcnJlbGF0aW9uIGJldHdlZW4gdGhlIHByZWRpY3RlZCBhbmQgdHJ1ZSB2YWx1ZXMNCmNvcihwLmN1YmlzdCwgd2luZV90ZXN0JHF1YWxpdHkpDQoNCiNUaGUgbWFnbml0dWRlIG9mIDAuNjIgc3VnZ2VzdHMgYSBtb2RlcmF0ZWx5IHN0cm9uZyBsaW5lYXIgcmVsYXRpb25zaGlwLg0KYGBgDQoNCg0KYGBge3J9DQojIG1lYW4gYWJzb2x1dGUgZXJyb3Igb2YgcHJlZGljdGVkIGFuZCB0cnVlIHZhbHVlcw0KIyAodXNlcyBhIGN1c3RvbSBmdW5jdGlvbiBkZWZpbmVkIGFib3ZlKQ0KTUFFKHdpbmVfdGVzdCRxdWFsaXR5LCBwLmN1YmlzdCkgDQpgYGANCg0KDQoNCg==