Regression Trees and Model Trees
Understanding regression trees and model trees
Example: Calculating SDR
# set up the data
tee <- c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7)
at1 <- c(1, 1, 1, 2, 2, 3, 4, 5, 5)
at2 <- c(6, 6, 7, 7, 7, 7)
bt1 <- c(1, 1, 1, 2, 2, 3, 4)
bt2 <- c(5, 5, 6, 6, 7, 7, 7, 7)
summary(tee)
str(tee)
#We test
# compute the SDR
sdr_a <- sd(tee) - (length(at1) / length(tee) * sd(at1) + length(at2) / length(tee) * sd(at2))
sdr_b <- sd(tee) - (length(bt1) / length(tee) * sd(bt1) + length(bt2) / length(tee) * sd(bt2))
#these values provide some indication of how well the main signal (tee) stands out from the #variations present in its sub-sets (at1, at2, bt1, bt2). Higher values of sdr_a and sdr_b #would suggest stronger signals relative to the sub-sets' noise.
# compare the SDR for each split
sdr_a
sdr_b
#sdr_a is = to 1.202815
#sdr_b is = to 1.392751
#This indicates that sdr_b represents a stronger signal relative to its sub-sets compared to sdr_a.
# If sdr_a or sdr_b has a larger value, it means that even after removing the sub-sets' variability, the remaining standard deviation (representing the variability of the main signal) is still high. This implies that the main signal is strong and distinct from the variations in the sub-sets.
Exercise No 3: Estimating Wine Quality
Step 2: Exploring and preparing the data
wine <- read.csv("whitewines.csv")
# examine the wine data
str(wine)
# the distribution of quality ratings
hist(wine$quality)
# summary statistics of the wine data
summary(wine)
wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]
#used 3750 observations to train to model and 1148 to test it
Step 3: Training a model on the data
# regression tree using rpart
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
# get basic information about the tree
m.rpart
# get more detailed information about the tree
summary(m.rpart)
Variable Importance:
The most important variable for predicting wine quality is alcohol,
followed by density, volatile acidity, and chlorides.
Tree Structure:
The tree starts by splitting the data based on alcohol content. Wines
with alcohol < 10.85 are more likely to have lower quality (mean
quality around 5.6). Wines with alcohol >= 10.85 are more likely to
have higher quality (mean quality around 6.3). Further splits are made
based on other variables like volatile acidity, free sulfur dioxide, and
chlorides. Each terminal node represents a group of wines with similar
characteristics and predicted quality.
#install.packages("rpart.plot")
# use the rpart.plot package to create a visualization
library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)
# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
LS0tDQp0aXRsZTogIlJlZ3Jlc3Npb24gVHJlZXMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoNCg0KDQojIyMjIFJlZ3Jlc3Npb24gVHJlZXMgYW5kIE1vZGVsIFRyZWVzDQoNCiMjIFVuZGVyc3RhbmRpbmcgcmVncmVzc2lvbiB0cmVlcyBhbmQgbW9kZWwgdHJlZXMNCg0KIyMgRXhhbXBsZTogQ2FsY3VsYXRpbmcgU0RSDQoNCmBgYHtyfQ0KIyBzZXQgdXAgdGhlIGRhdGENCnRlZSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQsIDUsIDUsIDYsIDYsIDcsIDcsIDcsIDcpDQphdDEgPC0gYygxLCAxLCAxLCAyLCAyLCAzLCA0LCA1LCA1KQ0KYXQyIDwtIGMoNiwgNiwgNywgNywgNywgNykNCmJ0MSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQpDQpidDIgPC0gYyg1LCA1LCA2LCA2LCA3LCA3LCA3LCA3KQ0KYGBgDQoNCg0KYGBge3J9DQpzdW1tYXJ5KHRlZSkNCnN0cih0ZWUpDQoNCiNXZSB0ZXN0DQpgYGANCg0KDQpgYGB7cn0NCiMgY29tcHV0ZSB0aGUgU0RSDQpzZHJfYSA8LSBzZCh0ZWUpIC0gKGxlbmd0aChhdDEpIC8gbGVuZ3RoKHRlZSkgKiBzZChhdDEpICsgbGVuZ3RoKGF0MikgLyBsZW5ndGgodGVlKSAqIHNkKGF0MikpDQpzZHJfYiA8LSBzZCh0ZWUpIC0gKGxlbmd0aChidDEpIC8gbGVuZ3RoKHRlZSkgKiBzZChidDEpICsgbGVuZ3RoKGJ0MikgLyBsZW5ndGgodGVlKSAqIHNkKGJ0MikpDQoNCiN0aGVzZSB2YWx1ZXMgcHJvdmlkZSBzb21lIGluZGljYXRpb24gb2YgaG93IHdlbGwgdGhlIG1haW4gc2lnbmFsICh0ZWUpIHN0YW5kcyBvdXQgZnJvbSB0aGUgI3ZhcmlhdGlvbnMgcHJlc2VudCBpbiBpdHMgc3ViLXNldHMgKGF0MSwgYXQyLCBidDEsIGJ0MikuIEhpZ2hlciB2YWx1ZXMgb2Ygc2RyX2EgYW5kIHNkcl9iICN3b3VsZCBzdWdnZXN0IHN0cm9uZ2VyIHNpZ25hbHMgcmVsYXRpdmUgdG8gdGhlIHN1Yi1zZXRzJyBub2lzZS4NCmBgYA0KDQoNCmBgYHtyfQ0KIyBjb21wYXJlIHRoZSBTRFIgZm9yIGVhY2ggc3BsaXQNCnNkcl9hDQpzZHJfYg0KDQojc2RyX2EgaXMgPSB0byAxLjIwMjgxNQ0KI3Nkcl9iIGlzID0gIHRvIDEuMzkyNzUxDQoNCiNUaGlzIGluZGljYXRlcyB0aGF0IHNkcl9iIHJlcHJlc2VudHMgYSBzdHJvbmdlciBzaWduYWwgcmVsYXRpdmUgdG8gaXRzIHN1Yi1zZXRzIGNvbXBhcmVkIHRvIHNkcl9hLg0KDQojIElmIHNkcl9hIG9yIHNkcl9iIGhhcyBhIGxhcmdlciB2YWx1ZSwgaXQgbWVhbnMgdGhhdCBldmVuIGFmdGVyIHJlbW92aW5nIHRoZSBzdWItc2V0cycgdmFyaWFiaWxpdHksIHRoZSByZW1haW5pbmcgc3RhbmRhcmQgZGV2aWF0aW9uIChyZXByZXNlbnRpbmcgdGhlIHZhcmlhYmlsaXR5IG9mIHRoZSBtYWluIHNpZ25hbCkgaXMgc3RpbGwgaGlnaC4gVGhpcyBpbXBsaWVzIHRoYXQgdGhlIG1haW4gc2lnbmFsIGlzIHN0cm9uZyBhbmQgZGlzdGluY3QgZnJvbSB0aGUgdmFyaWF0aW9ucyBpbiB0aGUgc3ViLXNldHMuDQoNCmBgYA0KDQoNCg0KIyMgRXhlcmNpc2UgTm8gMzogRXN0aW1hdGluZyBXaW5lIFF1YWxpdHkNCg0KDQojIyBTdGVwIDI6IEV4cGxvcmluZyBhbmQgcHJlcGFyaW5nIHRoZSBkYXRhDQoNCmBgYHtyfQ0Kd2luZSA8LSByZWFkLmNzdigid2hpdGV3aW5lcy5jc3YiKQ0KYGBgDQoNCg0KDQpgYGB7cn0NCiMgZXhhbWluZSB0aGUgd2luZSBkYXRhDQpzdHIod2luZSkNCmBgYA0KDQoNCmBgYHtyfQ0KIyB0aGUgZGlzdHJpYnV0aW9uIG9mIHF1YWxpdHkgcmF0aW5ncw0KaGlzdCh3aW5lJHF1YWxpdHkpDQpgYGANCg0KDQpgYGB7cn0NCiMgc3VtbWFyeSBzdGF0aXN0aWNzIG9mIHRoZSB3aW5lIGRhdGENCnN1bW1hcnkod2luZSkNCmBgYA0KDQoNCg0KYGBge3J9DQp3aW5lX3RyYWluIDwtIHdpbmVbMTozNzUwLCBdDQp3aW5lX3Rlc3QgPC0gd2luZVszNzUxOjQ4OTgsIF0NCg0KI3VzZWQgMzc1MCBvYnNlcnZhdGlvbnMgdG8gdHJhaW4gdG8gbW9kZWwgYW5kIDExNDggdG8gdGVzdCBpdA0KDQpgYGANCg0KDQoNCiMjIFN0ZXAgMzogVHJhaW5pbmcgYSBtb2RlbCBvbiB0aGUgZGF0YQ0KDQpgYGB7cn0NCiMgcmVncmVzc2lvbiB0cmVlIHVzaW5nIHJwYXJ0DQpsaWJyYXJ5KHJwYXJ0KQ0KbS5ycGFydCA8LSBycGFydChxdWFsaXR5IH4gLiwgZGF0YSA9IHdpbmVfdHJhaW4pDQpgYGANCg0KDQpgYGB7cn0NCiMgZ2V0IGJhc2ljIGluZm9ybWF0aW9uIGFib3V0IHRoZSB0cmVlDQptLnJwYXJ0DQpgYGANCg0KDQoNCmBgYHtyfQ0KIyBnZXQgbW9yZSBkZXRhaWxlZCBpbmZvcm1hdGlvbiBhYm91dCB0aGUgdHJlZQ0Kc3VtbWFyeShtLnJwYXJ0KQ0KYGBgDQpWYXJpYWJsZSBJbXBvcnRhbmNlOg0KDQpUaGUgbW9zdCBpbXBvcnRhbnQgdmFyaWFibGUgZm9yIHByZWRpY3Rpbmcgd2luZSBxdWFsaXR5IGlzIGFsY29ob2wsIGZvbGxvd2VkIGJ5IGRlbnNpdHksIHZvbGF0aWxlIGFjaWRpdHksIGFuZCBjaGxvcmlkZXMuDQoNClRyZWUgU3RydWN0dXJlOg0KDQpUaGUgdHJlZSBzdGFydHMgYnkgc3BsaXR0aW5nIHRoZSBkYXRhIGJhc2VkIG9uIGFsY29ob2wgY29udGVudC4NCldpbmVzIHdpdGggYWxjb2hvbCA8IDEwLjg1IGFyZSBtb3JlIGxpa2VseSB0byBoYXZlIGxvd2VyIHF1YWxpdHkgKG1lYW4gcXVhbGl0eSBhcm91bmQgNS42KS4NCldpbmVzIHdpdGggYWxjb2hvbCA+PSAxMC44NSBhcmUgbW9yZSBsaWtlbHkgdG8gaGF2ZSBoaWdoZXIgcXVhbGl0eSAobWVhbiBxdWFsaXR5IGFyb3VuZCA2LjMpLg0KRnVydGhlciBzcGxpdHMgYXJlIG1hZGUgYmFzZWQgb24gb3RoZXIgdmFyaWFibGVzIGxpa2Ugdm9sYXRpbGUgYWNpZGl0eSwgZnJlZSBzdWxmdXIgZGlveGlkZSwgYW5kIGNobG9yaWRlcy4NCkVhY2ggdGVybWluYWwgbm9kZSByZXByZXNlbnRzIGEgZ3JvdXAgb2Ygd2luZXMgd2l0aCBzaW1pbGFyIGNoYXJhY3RlcmlzdGljcyBhbmQgcHJlZGljdGVkIHF1YWxpdHkuDQoNCmBgYHtyfQ0KI2luc3RhbGwucGFja2FnZXMoInJwYXJ0LnBsb3QiKQ0KYGBgDQoNCg0KYGBge3J9DQojIHVzZSB0aGUgcnBhcnQucGxvdCBwYWNrYWdlIHRvIGNyZWF0ZSBhIHZpc3VhbGl6YXRpb24NCmxpYnJhcnkocnBhcnQucGxvdCkNCmBgYA0KDQoNCmBgYHtyfQ0KIyBhIGJhc2ljIGRlY2lzaW9uIHRyZWUgZGlhZ3JhbQ0KcnBhcnQucGxvdChtLnJwYXJ0LCBkaWdpdHMgPSAzKQ0KYGBgDQoNCg0KYGBge3J9DQojIGEgZmV3IGFkanVzdG1lbnRzIHRvIHRoZSBkaWFncmFtDQpycGFydC5wbG90KG0ucnBhcnQsIGRpZ2l0cyA9IDQsIGZhbGxlbi5sZWF2ZXMgPSBUUlVFLCB0eXBlID0gMywgZXh0cmEgPSAxMDEpDQpgYGANCg0KDQojIyBTdGVwIDQ6IEV2YWx1YXRlIG1vZGVsIHBlcmZvcm1hbmMNCg0KYGBge3J9DQojIGdlbmVyYXRlIHByZWRpY3Rpb25zIGZvciB0aGUgdGVzdGluZyBkYXRhc2V0DQpwLnJwYXJ0IDwtIHByZWRpY3QobS5ycGFydCwgd2luZV90ZXN0KQ0KYGBgDQoNCg0KYGBge3J9DQojIGNvbXBhcmUgdGhlIGRpc3RyaWJ1dGlvbiBvZiBwcmVkaWN0ZWQgdmFsdWVzIHZzLiBhY3R1YWwgdmFsdWVzDQpzdW1tYXJ5KHAucnBhcnQpDQpzdW1tYXJ5KHdpbmVfdGVzdCRxdWFsaXR5KQ0KDQojQm90aCBkaXN0cmlidXRpb25zIGhhdmUgc2ltaWxhciBtZWRpYW5zIGFuZCAzcmQgcXVhcnRpbGVzLCBzdWdnZXN0aW5nIHByZWRpY3Rpb25zIHRlbmQgdG8gYmUgY2xvc2UgdG8gYWN0dWFsIHZhbHVlcyBmb3IgYSBzaWduaWZpY2FudCBwb3J0aW9uIG9mIHRoZSBkYXRhLg0KI1RoZSBhY3R1YWwgdmFsdWVzIHJhbmdlIHdpZGVyICgzLjAwMCB0byA5LjAwMCkgY29tcGFyZWQgdG8gcHJlZGljdGVkIHZhbHVlcyAoNC41NDUgdG8gNi41OTcpLCBpbmRpY2F0aW5nIHRoZSBtb2RlbCBtaWdodCBub3QgY2FwdHVyZSB0aGUgZnVsbCByYW5nZSBvZiBxdWFsaXR5Lg0KYGBgDQoNCg0KYGBge3J9DQojIGNvbXBhcmUgdGhlIGNvcnJlbGF0aW9uDQpjb3IocC5ycGFydCwgd2luZV90ZXN0JHF1YWxpdHkpDQpgYGANCg0KDQpgYGB7cn0NCiMgZnVuY3Rpb24gdG8gY2FsY3VsYXRlIHRoZSBtZWFuIGFic29sdXRlIGVycm9yDQpNQUUgPC0gZnVuY3Rpb24oYWN0dWFsLCBwcmVkaWN0ZWQpIHsNCiAgbWVhbihhYnMoYWN0dWFsIC0gcHJlZGljdGVkKSkgIA0KfQ0KYGBgDQoNCg0KDQpgYGB7cn0NCiMgbWVhbiBhYnNvbHV0ZSBlcnJvciBiZXR3ZWVuIHByZWRpY3RlZCBhbmQgYWN0dWFsIHZhbHVlcw0KTUFFKHAucnBhcnQsIHdpbmVfdGVzdCRxdWFsaXR5KQ0KDQojVGhlIHJlc3VsdGluZyBNQUUgdmFsdWUgcmVwcmVzZW50cyB0aGUgYXZlcmFnZSBhYnNvbHV0ZSBlcnJvciBtYWRlIGJ5IHlvdXIgbW9kZWwgb24gdGhlIHRlc3Qgc2V0LiBJdCByZWZsZWN0cyB0aGUgdHlwaWNhbCBtYWduaXR1ZGUgb2YgZGlmZmVyZW5jZSBiZXR3ZWVuIHByZWRpY3Rpb25zIGFuZCBhY3R1YWwgcXVhbGl0eSBvbiB1bnNlZW4gZGF0YS4NCiNvbiBhdmVyYWdlIGV2ZXJ5IHByZWRpY3Rpb24gaXMgMC41ODcyIHVuaXRzIG9mIHF1YWxpdHkgb2ZmIHRoZSBhY3R1YWwgdmFsdWUuDQpgYGANCg0KDQpgYGB7cn0NCiMgbWVhbiBhYnNvbHV0ZSBlcnJvciBiZXR3ZWVuIGFjdHVhbCB2YWx1ZXMgYW5kIG1lYW4gdmFsdWUNCm1lYW4od2luZV90cmFpbiRxdWFsaXR5KSAjIHJlc3VsdCA9IDUuODcNCk1BRSg1Ljg3LCB3aW5lX3Rlc3QkcXVhbGl0eSkgICNyZXN1bHQgPSAwLjY3MjI0NzQNCmBgYCANCg0KDQojIyBTdGVwIDU6IEltcHJvdmluZyBtb2RlbCBwZXJmb3JtYW5jZQ0KDQpgYGB7cn0NCiNpbnN0YWxsLnBhY2thZ2VzKCJwbHlyIikNCiNpbnN0YWxsLnBhY2thZ2VzKCJDdWJpc3QiKQ0KYGBgDQoNCg0KYGBge3J9DQojIHRyYWluIGEgQ3ViaXN0IE1vZGVsIFRyZWUNCmxpYnJhcnkoQ3ViaXN0KQ0KbS5jdWJpc3QgPC0gY3ViaXN0KHggPSB3aW5lX3RyYWluWy0xMl0sIHkgPSB3aW5lX3RyYWluJHF1YWxpdHkpDQpgYGANCg0KDQpgYGB7cn0NCiMgZGlzcGxheSBiYXNpYyBpbmZvcm1hdGlvbiBhYm91dCB0aGUgbW9kZWwgdHJlZQ0KbS5jdWJpc3QNCmBgYA0KDQoNCg0KYGBge3J9DQojIGRpc3BsYXkgdGhlIHRyZWUgaXRzZWxmDQpzdW1tYXJ5KG0uY3ViaXN0KQ0KYGBgDQoNCg0KYGBge3J9DQojIGdlbmVyYXRlIHByZWRpY3Rpb25zIGZvciB0aGUgbW9kZWwNCnAuY3ViaXN0IDwtIHByZWRpY3QobS5jdWJpc3QsIHdpbmVfdGVzdCkNCmBgYA0KDQoNCmBgYHtyfQ0KIyBzdW1tYXJ5IHN0YXRpc3RpY3MgYWJvdXQgdGhlIHByZWRpY3Rpb25zDQpzdW1tYXJ5KHAuY3ViaXN0KQ0KYGBgDQoNCg0KYGBge3J9DQojIGNvcnJlbGF0aW9uIGJldHdlZW4gdGhlIHByZWRpY3RlZCBhbmQgdHJ1ZSB2YWx1ZXMNCmNvcihwLmN1YmlzdCwgd2luZV90ZXN0JHF1YWxpdHkpDQoNCiNUaGUgbWFnbml0dWRlIG9mIDAuNjIgc3VnZ2VzdHMgYSBtb2RlcmF0ZWx5IHN0cm9uZyBsaW5lYXIgcmVsYXRpb25zaGlwLg0KYGBgDQoNCg0KYGBge3J9DQojIG1lYW4gYWJzb2x1dGUgZXJyb3Igb2YgcHJlZGljdGVkIGFuZCB0cnVlIHZhbHVlcw0KIyAodXNlcyBhIGN1c3RvbSBmdW5jdGlvbiBkZWZpbmVkIGFib3ZlKQ0KTUFFKHdpbmVfdGVzdCRxdWFsaXR5LCBwLmN1YmlzdCkgDQpgYGANCg0KDQoNCg==