The criteria for the two CART models here used were: 1). the default CART setting in rpart for building a complete tree: the altered prior. The altered prior can affact the splits. (\( m*=argmax((N_m^l * N_m^r)/N_m)*(y_m^l - y_m^r)^2 \)); 2). prunned tree: set up the adjustment in control.rpart in addition to using the altered prior only. I set cp (complexity parameter) be 0 and xvalue (number of cross validation) be 10. The two criteria yielded different decision trees. But the procedures of both the two models divide into the new subgroups recursively until the subgroups either reach a minimum size or until no improvement can be made. The optimal CART of the complete tree is with 3 splits, and is 17 splits for the prunned tree. The complexity parameters for the two CART model are 0.010000 and 0.0020695 respectively. The advantage of the second criteria is that it can fit the data more accurately. The disadvantage of the second criteria is that it uses several of the random predictors in doing so and seriously overfits the data. The debrief is below:
Read the income data into the R and factorize the data:
install.packages("rpart")
## Error: trying to use CRAN without setting a mirror
library(rpart)
incomeData = read.table("/Users/shijiabian/Dropbox/STATS315B/Data/Income_Data.txt",
sep = ",", col.names = c("ANNUAL_INCOME", "SEX", "MARITAL_STATUS", "AGE",
"EDUCATION", "OCCUPATION", "LIVING_TIME", "DUAL_INCOMES", "HOUSEHOLD_NUMBER",
"HOUSEHOLD_UNDER_18", "HOUSEHOLDER_STATUS", "TYPE_OF_HOME", "ETHNIC",
"LANGUAGE"), fill = FALSE, strip.white = TRUE)
# This step is redundant as long as I specify 'method = class in rpart()'
for (i in 1:14) {
incomeData[, i] = as.factor(incomeData[, i])
}
summary(incomeData)
## ANNUAL_INCOME SEX MARITAL_STATUS AGE EDUCATION
## 1 :1745 1:4075 1 :3334 1: 878 1 : 264
## 8 :1308 2:4918 2 : 668 2:2129 2 :1046
## 6 :1110 3 : 875 3:2249 3 :2041
## 7 : 969 4 : 302 4:1615 4 :3066
## 9 : 884 5 :3654 5: 922 5 :1524
## 4 : 813 NA's: 160 6: 640 6 : 966
## (Other):2164 7: 560 NA's: 86
## OCCUPATION LIVING_TIME DUAL_INCOMES HOUSEHOLD_NUMBER
## 1 :2820 1 : 270 1:5438 2 :2664
## 6 :1489 2 :1042 2:2211 3 :1670
## 4 :1062 3 : 686 3:1344 1 :1620
## 2 : 770 4 : 900 4 :1526
## 3 : 767 5 :5182 5 : 686
## (Other):1949 NA's: 913 (Other): 452
## NA's : 136 NA's : 375
## HOUSEHOLD_UNDER_18 HOUSEHOLDER_STATUS TYPE_OF_HOME ETHNIC
## 0 :5724 1 :3256 1 :5073 7 :5811
## 1 :1506 2 :3670 2 : 655 5 :1231
## 2 :1148 3 :1827 3 :2373 3 : 910
## 3 : 412 NA's: 240 4 : 151 2 : 477
## 4 : 117 5 : 384 8 : 225
## 5 : 46 NA's: 357 (Other): 271
## (Other): 40 NA's : 68
## LANGUAGE
## 1 :7794
## 2 : 579
## 3 : 261
## NA's: 359
##
##
##
For the complete tree, regress the 13 predictors onto “ANNUAL INCOME OF HOUSEHOLD” by the default setting only. According to the output of tree, this tree only use the three predictors OCCUPATION, AGE and HOUSE STATUS. We can print the complexity table and plot the graph of the tree. It has 4 terminal nodes.
fit.noconstrain <- rpart(ANNUAL_INCOME ~ ., data = incomeData, method = "class")
print(fit.noconstrain)
## n= 8993
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 8993 7248 1 (0.19 0.086 0.074 0.09 0.08 0.12 0.11 0.15 0.098)
## 2) OCCUPATION=6,9 1843 684 1 (0.63 0.077 0.042 0.033 0.028 0.041 0.042 0.058 0.051) *
## 3) OCCUPATION=1,2,3,4,5,7,8 7150 5948 8 (0.082 0.089 0.083 0.11 0.094 0.14 0.12 0.17 0.11)
## 6) HOUSEHOLDER_STATUS=2,3 3950 3394 6 (0.13 0.13 0.12 0.13 0.12 0.14 0.1 0.094 0.037)
## 12) AGE=1,2,6,7 1569 1215 1 (0.23 0.2 0.15 0.12 0.085 0.081 0.055 0.063 0.029) *
## 13) AGE=3,4,5 2381 1952 6 (0.062 0.088 0.1 0.14 0.14 0.18 0.13 0.11 0.042) *
## 7) HOUSEHOLDER_STATUS=1 3200 2369 8 (0.026 0.036 0.037 0.069 0.065 0.15 0.16 0.26 0.2) *
plot(fit.noconstrain)
text(fit.noconstrain, use.n = T)
The analysis and graph above are reseasonble: The group of Student, HS or College and unemployment will have the lowest income. The household status will play an important role within the rest of the occupations: the group who rent properties or live with parents will have lower income compared to those who have their own estate properties. Then the age factor will make a difference within the group who rent the properties: the people who are younger than 24 or older than 55 will have lower income compared to the rest of the people. Overall, the tree illustrates that the people who are either non-student nor non-unemployed and own their real estate properties will have higher income than the rest of the people.
Beblow produce the cp from the complexity model for 3 split is 0.01. Cp decreases as the number of splits increases. The best model is the one with 3 splits (4 nods).
summary(fit.noconstrain)
## Call:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.08499 0 1.0000 1.0000 0.005174
## 2 0.02842 1 0.9150 0.9193 0.005732
## 3 0.01000 3 0.8582 0.8600 0.006034
##
## Variable importance
## OCCUPATION AGE HOUSEHOLDER_STATUS
## 39 21 19
## EDUCATION MARITAL_STATUS TYPE_OF_HOME
## 7 5 4
## DUAL_INCOMES
## 4
##
## Node number 1: 8993 observations, complexity param=0.08499
## predicted class=1 expected loss=0.806 P(node) =1
## class counts: 1745 775 667 813 722 1110 969 1308 884
## probabilities: 0.194 0.086 0.074 0.090 0.080 0.123 0.108 0.145 0.098
## left son=2 (1843 obs) right son=3 (7150 obs)
## Primary splits:
## OCCUPATION splits as RRRRRLRRL, improve=495.5, (136 missing)
## AGE splits as LRRRRRR, improve=495.5, (0 missing)
## EDUCATION splits as LLRRRR, improve=400.3, (86 missing)
## HOUSEHOLDER_STATUS splits as RRL, improve=396.2, (240 missing)
## MARITAL_STATUS splits as RRRRL, improve=295.3, (160 missing)
## Surrogate splits:
## AGE splits as LRRRRRR, agree=0.859, adj=0.314, (136 split)
## EDUCATION splits as LLRRRR, agree=0.832, adj=0.185, (0 split)
## HOUSEHOLDER_STATUS splits as RRL, agree=0.832, adj=0.185, (0 split)
##
## Node number 2: 1843 observations
## predicted class=1 expected loss=0.3711 P(node) =0.2049
## class counts: 1159 142 77 61 51 75 78 106 94
## probabilities: 0.629 0.077 0.042 0.033 0.028 0.041 0.042 0.058 0.051
##
## Node number 3: 7150 observations, complexity param=0.02842
## predicted class=8 expected loss=0.8319 P(node) =0.7951
## class counts: 586 633 590 752 671 1035 891 1202 790
## probabilities: 0.082 0.089 0.083 0.105 0.094 0.145 0.125 0.168 0.110
## left son=6 (3950 obs) right son=7 (3200 obs)
## Primary splits:
## HOUSEHOLDER_STATUS splits as RLL, improve=159.70, (210 missing)
## MARITAL_STATUS splits as RLLLL, improve=136.70, (109 missing)
## DUAL_INCOMES splits as LRR, improve=125.70, (0 missing)
## AGE splits as LLRRRRR, improve=108.40, (0 missing)
## TYPE_OF_HOME splits as RRLLL, improve= 90.16, (268 missing)
## Surrogate splits:
## AGE splits as LLLRRRR, agree=0.733, adj=0.406, (210 split)
## MARITAL_STATUS splits as RLLRL, agree=0.721, adj=0.377, (0 split)
## TYPE_OF_HOME splits as RRLRL, agree=0.718, adj=0.371, (0 split)
## DUAL_INCOMES splits as LRR, agree=0.703, adj=0.337, (0 split)
## OCCUPATION splits as RLLLR-LR-, agree=0.621, adj=0.155, (0 split)
##
## Node number 6: 3950 observations, complexity param=0.02842
## predicted class=6 expected loss=0.8592 P(node) =0.4392
## class counts: 502 518 470 530 463 556 394 371 146
## probabilities: 0.127 0.131 0.119 0.134 0.117 0.141 0.100 0.094 0.037
## left son=12 (1569 obs) right son=13 (2381 obs)
## Primary splits:
## AGE splits as LLRRRLL, improve=58.66, (0 missing)
## OCCUPATION splits as RLLLL-LL-, improve=49.69, (78 missing)
## EDUCATION splits as LLLRRR, improve=36.17, (36 missing)
## HOUSEHOLDER_STATUS splits as -RL, improve=32.30, (122 missing)
## MARITAL_STATUS splits as RLLLL, improve=28.84, (60 missing)
## Surrogate splits:
## HOUSEHOLDER_STATUS splits as -RL, agree=0.674, adj=0.180, (0 split)
## OCCUPATION splits as RLRRR-LL-, agree=0.658, adj=0.140, (0 split)
## EDUCATION splits as RLLRRR, agree=0.648, adj=0.113, (0 split)
## MARITAL_STATUS splits as RRRLL, agree=0.637, adj=0.087, (0 split)
## HOUSEHOLD_UNDER_18 splits as RRRRRRLR--, agree=0.604, adj=0.002, (0 split)
##
## Node number 7: 3200 observations
## predicted class=8 expected loss=0.7403 P(node) =0.3558
## class counts: 84 115 120 222 208 479 497 831 644
## probabilities: 0.026 0.036 0.037 0.069 0.065 0.150 0.155 0.260 0.201
##
## Node number 12: 1569 observations
## predicted class=1 expected loss=0.7744 P(node) =0.1745
## class counts: 354 308 229 187 134 127 86 99 45
## probabilities: 0.226 0.196 0.146 0.119 0.085 0.081 0.055 0.063 0.029
##
## Node number 13: 2381 observations
## predicted class=6 expected loss=0.8198 P(node) =0.2648
## class counts: 148 210 241 343 329 429 308 272 101
## probabilities: 0.062 0.088 0.101 0.144 0.138 0.180 0.129 0.114 0.042
printcp(fit.noconstrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
##
## Variables actually used in tree construction:
## [1] AGE HOUSEHOLDER_STATUS OCCUPATION
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.085 0 1.00 1.00 0.0052
## 2 0.028 1 0.92 0.92 0.0057
## 3 0.010 3 0.86 0.86 0.0060
Alternative criteria is the prunned tree: being specified by command rpart.control: the number of cross valication is 10 and the complexity parameter is 0. The complexity table is listed below. The complexity table is printed from the smallest tree (no splits) to the largest one (559 splits). The xerror clearly shows the feature of convexity. It reaches the minimum value at split 23 that the xerror is 0.80560 and one standard error is 0.0062435. Depending on the one standard error rule of the cross-validation, the best split is 17 (18 nodes) with cp being 0.0020695.
fit.constrain <- rpart(ANNUAL_INCOME ~ ., data = incomeData, method = "class",
control = rpart.control(xval = 10, cp = 0))
printcp(fit.constrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class",
## control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] AGE DUAL_INCOMES EDUCATION
## [4] ETHNIC HOUSEHOLD_NUMBER HOUSEHOLD_UNDER_18
## [7] HOUSEHOLDER_STATUS LANGUAGE LIVING_TIME
## [10] MARITAL_STATUS OCCUPATION SEX
## [13] TYPE_OF_HOME
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 8.5e-02 0 1.00 1.00 0.0052
## 2 2.8e-02 1 0.92 0.90 0.0058
## 3 5.7e-03 3 0.86 0.86 0.0060
## 4 5.1e-03 5 0.85 0.85 0.0061
## 5 4.6e-03 6 0.84 0.85 0.0061
## 6 4.2e-03 8 0.83 0.84 0.0061
## 7 4.0e-03 11 0.82 0.83 0.0062
## 8 3.1e-03 13 0.81 0.82 0.0062
## 9 3.0e-03 15 0.81 0.82 0.0062
## 10 2.1e-03 17 0.80 0.81 0.0062
## 11 1.9e-03 18 0.80 0.81 0.0062
## 12 1.8e-03 20 0.79 0.81 0.0062
## 13 1.2e-03 21 0.79 0.81 0.0062
## 14 1.1e-03 23 0.79 0.80 0.0062
## 15 1.1e-03 25 0.79 0.80 0.0062
## 16 9.7e-04 28 0.78 0.81 0.0062
## 17 8.3e-04 31 0.78 0.81 0.0062
## 18 6.9e-04 33 0.78 0.81 0.0062
## 19 6.2e-04 41 0.77 0.81 0.0062
## 20 6.0e-04 52 0.77 0.82 0.0062
## 21 5.5e-04 61 0.76 0.82 0.0062
## 22 5.2e-04 72 0.75 0.82 0.0062
## 23 4.8e-04 76 0.75 0.82 0.0062
## 24 4.6e-04 83 0.75 0.82 0.0062
## 25 4.1e-04 111 0.73 0.83 0.0062
## 26 3.7e-04 149 0.72 0.83 0.0062
## 27 3.4e-04 158 0.71 0.83 0.0061
## 28 3.4e-04 180 0.71 0.83 0.0061
## 29 3.2e-04 195 0.70 0.83 0.0061
## 30 3.1e-04 210 0.69 0.83 0.0061
## 31 3.0e-04 217 0.69 0.83 0.0061
## 32 2.8e-04 228 0.69 0.84 0.0061
## 33 2.5e-04 309 0.67 0.84 0.0061
## 34 2.5e-04 316 0.66 0.84 0.0061
## 35 2.4e-04 321 0.66 0.84 0.0061
## 36 2.3e-04 325 0.66 0.84 0.0061
## 37 2.2e-04 343 0.66 0.84 0.0061
## 38 2.2e-04 348 0.66 0.84 0.0061
## 39 2.1e-04 355 0.65 0.84 0.0061
## 40 1.8e-04 404 0.64 0.85 0.0061
## 41 1.4e-04 414 0.64 0.85 0.0061
## 42 1.1e-04 519 0.63 0.85 0.0061
## 43 9.2e-05 524 0.63 0.85 0.0061
## 44 6.9e-05 533 0.63 0.85 0.0061
## 45 4.6e-05 553 0.62 0.86 0.0061
## 46 0.0e+00 559 0.62 0.86 0.0061
Use branch and compress to restrict the shape of the tree and plot the prunned the tree:
fit17 <- prune(fit.constrain, cp = 0.0020695)
par(mar = rep(0.1, 4))
plot(fit17, branch = 0.3, compress = TRUE)
text(fit17)
Overall, the prunned tree illustrate the similar story as the previous CART model but being more detailed: it has more nodes and predictors, such as education, ethnic and marital status.
Yes. The surrogate splits are used in the construction for the optimal trees of the two models.
Definition of Surrogate Split:
When the data has some missing predictor values in some or all of the variables, we estimate the missing datum using the other independent variableswe. These constructed variables that are used to fill in (impute) the missing values are called surrogate variables. Surrogate splits exploit correlations between predictors to try and alleviate the effect of missing data.
Example of a Surrogate Split:
We want to look at the first model from the previous question by using the command summary(). The first model splited the entire 8993 data into 1843 data (node 2 that is a terminal nodes) and 7150 data (node 3). For node 1, the best split is occupation that does have 136 missing values. Thus the surrogate splits is applied for occupation in the first node. AGE is the surrogate split on.
printcp(fit.noconstrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
##
## Variables actually used in tree construction:
## [1] AGE HOUSEHOLDER_STATUS OCCUPATION
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.085 0 1.00 1.00 0.0052
## 2 0.028 1 0.92 0.92 0.0057
## 3 0.010 3 0.86 0.86 0.0060
summary(fit.noconstrain)
## Call:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.08499 0 1.0000 1.0000 0.005174
## 2 0.02842 1 0.9150 0.9193 0.005732
## 3 0.01000 3 0.8582 0.8600 0.006034
##
## Variable importance
## OCCUPATION AGE HOUSEHOLDER_STATUS
## 39 21 19
## EDUCATION MARITAL_STATUS TYPE_OF_HOME
## 7 5 4
## DUAL_INCOMES
## 4
##
## Node number 1: 8993 observations, complexity param=0.08499
## predicted class=1 expected loss=0.806 P(node) =1
## class counts: 1745 775 667 813 722 1110 969 1308 884
## probabilities: 0.194 0.086 0.074 0.090 0.080 0.123 0.108 0.145 0.098
## left son=2 (1843 obs) right son=3 (7150 obs)
## Primary splits:
## OCCUPATION splits as RRRRRLRRL, improve=495.5, (136 missing)
## AGE splits as LRRRRRR, improve=495.5, (0 missing)
## EDUCATION splits as LLRRRR, improve=400.3, (86 missing)
## HOUSEHOLDER_STATUS splits as RRL, improve=396.2, (240 missing)
## MARITAL_STATUS splits as RRRRL, improve=295.3, (160 missing)
## Surrogate splits:
## AGE splits as LRRRRRR, agree=0.859, adj=0.314, (136 split)
## EDUCATION splits as LLRRRR, agree=0.832, adj=0.185, (0 split)
## HOUSEHOLDER_STATUS splits as RRL, agree=0.832, adj=0.185, (0 split)
##
## Node number 2: 1843 observations
## predicted class=1 expected loss=0.3711 P(node) =0.2049
## class counts: 1159 142 77 61 51 75 78 106 94
## probabilities: 0.629 0.077 0.042 0.033 0.028 0.041 0.042 0.058 0.051
##
## Node number 3: 7150 observations, complexity param=0.02842
## predicted class=8 expected loss=0.8319 P(node) =0.7951
## class counts: 586 633 590 752 671 1035 891 1202 790
## probabilities: 0.082 0.089 0.083 0.105 0.094 0.145 0.125 0.168 0.110
## left son=6 (3950 obs) right son=7 (3200 obs)
## Primary splits:
## HOUSEHOLDER_STATUS splits as RLL, improve=159.70, (210 missing)
## MARITAL_STATUS splits as RLLLL, improve=136.70, (109 missing)
## DUAL_INCOMES splits as LRR, improve=125.70, (0 missing)
## AGE splits as LLRRRRR, improve=108.40, (0 missing)
## TYPE_OF_HOME splits as RRLLL, improve= 90.16, (268 missing)
## Surrogate splits:
## AGE splits as LLLRRRR, agree=0.733, adj=0.406, (210 split)
## MARITAL_STATUS splits as RLLRL, agree=0.721, adj=0.377, (0 split)
## TYPE_OF_HOME splits as RRLRL, agree=0.718, adj=0.371, (0 split)
## DUAL_INCOMES splits as LRR, agree=0.703, adj=0.337, (0 split)
## OCCUPATION splits as RLLLR-LR-, agree=0.621, adj=0.155, (0 split)
##
## Node number 6: 3950 observations, complexity param=0.02842
## predicted class=6 expected loss=0.8592 P(node) =0.4392
## class counts: 502 518 470 530 463 556 394 371 146
## probabilities: 0.127 0.131 0.119 0.134 0.117 0.141 0.100 0.094 0.037
## left son=12 (1569 obs) right son=13 (2381 obs)
## Primary splits:
## AGE splits as LLRRRLL, improve=58.66, (0 missing)
## OCCUPATION splits as RLLLL-LL-, improve=49.69, (78 missing)
## EDUCATION splits as LLLRRR, improve=36.17, (36 missing)
## HOUSEHOLDER_STATUS splits as -RL, improve=32.30, (122 missing)
## MARITAL_STATUS splits as RLLLL, improve=28.84, (60 missing)
## Surrogate splits:
## HOUSEHOLDER_STATUS splits as -RL, agree=0.674, adj=0.180, (0 split)
## OCCUPATION splits as RLRRR-LL-, agree=0.658, adj=0.140, (0 split)
## EDUCATION splits as RLLRRR, agree=0.648, adj=0.113, (0 split)
## MARITAL_STATUS splits as RRRLL, agree=0.637, adj=0.087, (0 split)
## HOUSEHOLD_UNDER_18 splits as RRRRRRLR--, agree=0.604, adj=0.002, (0 split)
##
## Node number 7: 3200 observations
## predicted class=8 expected loss=0.7403 P(node) =0.3558
## class counts: 84 115 120 222 208 479 497 831 644
## probabilities: 0.026 0.036 0.037 0.069 0.065 0.150 0.155 0.260 0.201
##
## Node number 12: 1569 observations
## predicted class=1 expected loss=0.7744 P(node) =0.1745
## class counts: 354 308 229 187 134 127 86 99 45
## probabilities: 0.226 0.196 0.146 0.119 0.085 0.081 0.055 0.063 0.029
##
## Node number 13: 2381 observations
## predicted class=6 expected loss=0.8198 P(node) =0.2648
## class counts: 148 210 241 343 329 429 308 272 101
## probabilities: 0.062 0.088 0.101 0.144 0.138 0.180 0.129 0.114 0.042
The misclassification rate is 1-(677+117+97+10+126+101+292+93)/4496=0.6634786.
Below is the table listing the prediction of the model:
set.seed(23)
train = sample(1:nrow(incomeData), 4497)
incomeData.test = incomeData[-train, ]
fit.train <- rpart(ANNUAL_INCOME ~ ., data = incomeData, subset = train, method = "class",
control = rpart.control(xval = 10, cp = 0))
printcp(fit.train)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, subset = train,
## method = "class", control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] AGE DUAL_INCOMES EDUCATION
## [4] ETHNIC HOUSEHOLD_NUMBER HOUSEHOLD_UNDER_18
## [7] HOUSEHOLDER_STATUS LANGUAGE LIVING_TIME
## [10] MARITAL_STATUS OCCUPATION SEX
## [13] TYPE_OF_HOME
##
## Root node error: 3625/4497 = 0.81
##
## n= 4497
##
## CP nsplit rel error xerror xstd
## 1 5.3e-02 0 1.00 1.00 0.0073
## 2 5.2e-02 1 0.95 0.98 0.0075
## 3 2.2e-02 2 0.90 0.90 0.0083
## 4 6.1e-03 4 0.85 0.86 0.0085
## 5 5.8e-03 5 0.85 0.86 0.0085
## 6 5.0e-03 6 0.84 0.85 0.0086
## 7 4.1e-03 7 0.83 0.84 0.0086
## 8 3.7e-03 11 0.82 0.83 0.0087
## 9 3.6e-03 15 0.80 0.82 0.0088
## 10 3.0e-03 16 0.80 0.81 0.0088
## 11 2.9e-03 17 0.80 0.81 0.0088
## 12 2.5e-03 19 0.79 0.81 0.0088
## 13 1.9e-03 20 0.79 0.81 0.0088
## 14 1.7e-03 23 0.78 0.81 0.0088
## 15 1.4e-03 25 0.78 0.82 0.0088
## 16 1.2e-03 30 0.77 0.82 0.0087
## 17 1.1e-03 35 0.77 0.82 0.0087
## 18 1.1e-03 47 0.75 0.83 0.0087
## 19 9.7e-04 53 0.75 0.83 0.0087
## 20 9.2e-04 57 0.74 0.83 0.0087
## 21 8.3e-04 60 0.74 0.84 0.0087
## 22 7.5e-04 106 0.70 0.84 0.0087
## 23 6.9e-04 114 0.69 0.83 0.0087
## 24 6.4e-04 120 0.69 0.84 0.0087
## 25 6.2e-04 123 0.69 0.84 0.0087
## 26 5.5e-04 127 0.68 0.84 0.0087
## 27 4.8e-04 168 0.66 0.84 0.0086
## 28 4.6e-04 173 0.66 0.84 0.0086
## 29 4.1e-04 176 0.66 0.84 0.0086
## 30 3.7e-04 209 0.64 0.85 0.0086
## 31 2.8e-04 215 0.64 0.85 0.0086
## 32 1.6e-04 247 0.63 0.85 0.0086
## 33 1.4e-04 256 0.63 0.85 0.0086
## 34 9.2e-05 266 0.63 0.85 0.0086
## 35 6.9e-05 275 0.63 0.85 0.0086
## 36 0.0e+00 279 0.63 0.85 0.0086
# Choose the one with 15 splits
cptarg = sqrt(fit.train$cptable[14, 1] * fit.train$cptable[15, 1])
prunedtree = prune(fit.train, cp = cptarg)
# Visualize the tree
par(mfrow = c(1, 1), mar = c(1, 1, 1, 1))
plot(prunedtree, uniform = T, compress = T, margin = 0.1, branch = 0.3)
text(prunedtree, use.n = T, digits = 3, cex = 0.6)
fit.preds = predict(prunedtree, newdata = incomeData.test, type = "class")
fit.table = table(incomeData.test$ANNUAL_INCOME, fit.preds)
fit.table
## fit.preds
## 1 2 3 4 5 6 7 8 9
## 1 677 103 0 45 4 16 19 9 0
## 2 134 117 0 50 8 37 23 11 0
## 3 85 86 0 66 6 53 30 13 1
## 4 49 79 0 97 9 88 43 42 1
## 5 43 59 0 64 10 94 55 43 0
## 6 59 60 0 52 24 126 105 112 7
## 7 55 39 0 20 15 82 101 156 16
## 8 65 32 0 21 9 60 106 292 47
## 9 61 17 0 6 4 37 45 203 93
By running the models below, it says that CART is highly influenced by the spurious model: it tends to choose the spurious variables other than the correct variables. CART includes spurious variables by over modeling. However, the size of the optimal model does not have a big difference after adding the spurious variables. Below are the steps how I debrief my points:
Read the Income_Big into the R:
incomeBig = read.table("/Users/shijiabian/Dropbox/STATS315B/Data/Income_Big.txt",
sep = ",", col.names = c("col1", "col2", "col3", "col4", "col5", "col6",
"col7", "col8", "col9", "col10", "col11", "col12", "col13", "col14",
"col15", "col16", "col17", "col18", "col19", "col20", "col21", "col22",
"col23", "col24"), fill = FALSE, strip.white = TRUE)
dim(incomeBig)
## [1] 8993 24
# This step is redundant as long as I set method = 'class'
for (i in 1:24) {
incomeBig[, i] = as.factor(incomeBig[, i])
}
summary(incomeBig)
## col1 col2 col3 col4 col5 col6
## 1 :1745 1:4075 1 :3334 1: 878 1 : 264 1 :2820
## 8 :1308 2:4918 2 : 668 2:2129 2 :1046 6 :1489
## 6 :1110 3 : 875 3:2249 3 :2041 4 :1062
## 7 : 969 4 : 302 4:1615 4 :3066 2 : 770
## 9 : 884 5 :3654 5: 922 5 :1524 3 : 767
## 4 : 813 NA's: 160 6: 640 6 : 966 (Other):1949
## (Other):2164 7: 560 NA's: 86 NA's : 136
## col7 col8 col9 col10 col11
## 1 : 270 1:5438 2 :2664 0 :5724 1 :3256
## 2 :1042 2:2211 3 :1670 1 :1506 2 :3670
## 3 : 686 3:1344 1 :1620 2 :1148 3 :1827
## 4 : 900 4 :1526 3 : 412 NA's: 240
## 5 :5182 5 : 686 4 : 117
## NA's: 913 (Other): 452 5 : 46
## NA's : 375 (Other): 40
## col12 col13 col14 col15 col16
## 1 :5073 7 :5811 1 :7794 5 :1024 5 :1054
## 2 : 655 5 :1231 2 : 579 2 :1023 1 :1048
## 3 :2373 3 : 910 3 : 261 3 :1009 4 :1018
## 4 : 151 2 : 477 NA's: 359 8 :1009 8 :1005
## 5 : 384 8 : 225 6 :1001 6 :1002
## NA's: 357 (Other): 271 7 : 992 7 : 995
## NA's : 68 (Other):2935 (Other):2871
## col17 col18 col19 col20
## 6 :1038 2 :1039 5 :1043 7 :1045
## 4 :1017 7 :1032 3 :1026 1 :1040
## 7 :1009 1 :1031 1 :1002 5 :1019
## 3 : 996 8 :1005 9 :1001 3 :1006
## 9 : 994 3 :1003 7 :1000 6 :1004
## 8 : 991 5 :1003 6 : 998 2 : 982
## (Other):2948 (Other):2880 (Other):2923 (Other):2897
## col21 col22 col23 col24
## 5 :1034 1 :1061 4 :1035 9 :1040
## 7 :1032 8 :1048 5 :1027 5 :1029
## 4 :1019 6 :1040 3 :1023 8 :1014
## 1 :1016 5 :1019 9 :1023 4 :1012
## 8 :1015 7 : 975 2 : 992 1 : 993
## 3 : 993 3 : 970 6 : 989 6 : 987
## (Other):2884 (Other):2880 (Other):2904 (Other):2918
First Model: simply construct the complete tree for the Income_Big with default lose and prior. Print out its complexity table. Also print out the complexity table for the Income data. Based on the one standard error rule of the cross-validation, we need choose the 3 splits (4 nodes) for both the two data set. However, the Income_Big data set does not choose the correct predictor: it chooses HOUSEHOLDER STATUS, LANGUAGE and a set of spurious variables.But the correct model is supposed to choose AGE HOUSEHOLD STATUS and OCCUPATION.
fit.big.noconstrain <- rpart(col1 ~ ., data = incomeBig, method = "class")
printcp(fit.big.noconstrain)
##
## Classification tree:
## rpart(formula = col1 ~ ., data = incomeBig, method = "class")
##
## Variables actually used in tree construction:
## [1] col11 col4 col6
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.085 0 1.00 1.00 0.0052
## 2 0.028 1 0.92 0.91 0.0058
## 3 0.010 3 0.86 0.86 0.0060
printcp(fit.noconstrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
##
## Variables actually used in tree construction:
## [1] AGE HOUSEHOLDER_STATUS OCCUPATION
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.085 0 1.00 1.00 0.0052
## 2 0.028 1 0.92 0.92 0.0057
## 3 0.010 3 0.86 0.86 0.0060
Second Model: Construct the prunned tree for the Income_Big by restricting the xvalue and cp. Print out the complexity table for this new CART model. Also print out the complexity table for the Income data of the same construction of the model. 2/3 variables actually used in the new modelgenerated from the Income_Big data are spurious variables. This tells the same story as the complete tree model above does.
fit.big.constrain <- rpart(col1 ~ ., data = incomeBig, method = "class", control = rpart.control(xval = 10,
cp = 0))
printcp(fit.big.constrain)
##
## Classification tree:
## rpart(formula = col1 ~ ., data = incomeBig, method = "class",
## control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] col10 col11 col12 col13 col15 col16 col17 col18 col19 col2 col20
## [12] col21 col22 col23 col24 col3 col4 col5 col6 col7 col9
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 8.5e-02 0 1.00 1.00 0.0052
## 2 2.8e-02 1 0.92 0.93 0.0057
## 3 5.7e-03 3 0.86 0.86 0.0060
## 4 5.1e-03 5 0.85 0.85 0.0061
## 5 4.6e-03 6 0.84 0.85 0.0061
## 6 4.2e-03 8 0.83 0.85 0.0061
## 7 4.0e-03 11 0.82 0.83 0.0062
## 8 3.1e-03 13 0.81 0.82 0.0062
## 9 3.0e-03 15 0.81 0.82 0.0062
## 10 2.1e-03 17 0.80 0.82 0.0062
## 11 1.9e-03 18 0.80 0.81 0.0062
## 12 1.8e-03 20 0.79 0.81 0.0062
## 13 1.7e-03 21 0.79 0.81 0.0062
## 14 1.4e-03 24 0.79 0.82 0.0062
## 15 1.3e-03 25 0.79 0.82 0.0062
## 16 1.2e-03 29 0.78 0.82 0.0062
## 17 1.1e-03 31 0.78 0.82 0.0062
## 18 1.0e-03 51 0.75 0.83 0.0062
## 19 9.7e-04 54 0.75 0.83 0.0062
## 20 9.2e-04 64 0.74 0.83 0.0062
## 21 9.0e-04 67 0.74 0.83 0.0061
## 22 8.6e-04 76 0.73 0.83 0.0061
## 23 8.3e-04 81 0.73 0.84 0.0061
## 24 7.8e-04 98 0.71 0.85 0.0061
## 25 7.6e-04 101 0.71 0.85 0.0061
## 26 7.2e-04 107 0.71 0.86 0.0061
## 27 7.2e-04 122 0.69 0.86 0.0061
## 28 6.9e-04 127 0.69 0.86 0.0060
## 29 6.6e-04 151 0.67 0.86 0.0061
## 30 6.4e-04 157 0.67 0.86 0.0060
## 31 6.3e-04 160 0.67 0.86 0.0060
## 32 6.2e-04 165 0.66 0.86 0.0060
## 33 5.9e-04 171 0.66 0.86 0.0061
## 34 5.5e-04 175 0.66 0.86 0.0061
## 35 5.2e-04 224 0.63 0.86 0.0060
## 36 5.1e-04 230 0.63 0.86 0.0060
## 37 4.8e-04 238 0.62 0.86 0.0060
## 38 4.6e-04 262 0.61 0.86 0.0060
## 39 4.1e-04 280 0.60 0.87 0.0060
## 40 3.7e-04 396 0.55 0.87 0.0060
## 41 3.4e-04 402 0.55 0.88 0.0060
## 42 3.3e-04 435 0.54 0.88 0.0060
## 43 2.8e-04 440 0.53 0.88 0.0059
## 44 2.3e-04 504 0.52 0.88 0.0059
## 45 2.1e-04 510 0.51 0.89 0.0059
## 46 1.8e-04 529 0.51 0.89 0.0059
## 47 1.4e-04 533 0.51 0.89 0.0059
## 48 9.2e-05 568 0.50 0.89 0.0059
## 49 0.0e+00 571 0.50 0.89 0.0059
printcp(fit.constrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class",
## control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] AGE DUAL_INCOMES EDUCATION
## [4] ETHNIC HOUSEHOLD_NUMBER HOUSEHOLD_UNDER_18
## [7] HOUSEHOLDER_STATUS LANGUAGE LIVING_TIME
## [10] MARITAL_STATUS OCCUPATION SEX
## [13] TYPE_OF_HOME
##
## Root node error: 7248/8993 = 0.81
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 8.5e-02 0 1.00 1.00 0.0052
## 2 2.8e-02 1 0.92 0.90 0.0058
## 3 5.7e-03 3 0.86 0.86 0.0060
## 4 5.1e-03 5 0.85 0.85 0.0061
## 5 4.6e-03 6 0.84 0.85 0.0061
## 6 4.2e-03 8 0.83 0.84 0.0061
## 7 4.0e-03 11 0.82 0.83 0.0062
## 8 3.1e-03 13 0.81 0.82 0.0062
## 9 3.0e-03 15 0.81 0.82 0.0062
## 10 2.1e-03 17 0.80 0.81 0.0062
## 11 1.9e-03 18 0.80 0.81 0.0062
## 12 1.8e-03 20 0.79 0.81 0.0062
## 13 1.2e-03 21 0.79 0.81 0.0062
## 14 1.1e-03 23 0.79 0.80 0.0062
## 15 1.1e-03 25 0.79 0.80 0.0062
## 16 9.7e-04 28 0.78 0.81 0.0062
## 17 8.3e-04 31 0.78 0.81 0.0062
## 18 6.9e-04 33 0.78 0.81 0.0062
## 19 6.2e-04 41 0.77 0.81 0.0062
## 20 6.0e-04 52 0.77 0.82 0.0062
## 21 5.5e-04 61 0.76 0.82 0.0062
## 22 5.2e-04 72 0.75 0.82 0.0062
## 23 4.8e-04 76 0.75 0.82 0.0062
## 24 4.6e-04 83 0.75 0.82 0.0062
## 25 4.1e-04 111 0.73 0.83 0.0062
## 26 3.7e-04 149 0.72 0.83 0.0062
## 27 3.4e-04 158 0.71 0.83 0.0061
## 28 3.4e-04 180 0.71 0.83 0.0061
## 29 3.2e-04 195 0.70 0.83 0.0061
## 30 3.1e-04 210 0.69 0.83 0.0061
## 31 3.0e-04 217 0.69 0.83 0.0061
## 32 2.8e-04 228 0.69 0.84 0.0061
## 33 2.5e-04 309 0.67 0.84 0.0061
## 34 2.5e-04 316 0.66 0.84 0.0061
## 35 2.4e-04 321 0.66 0.84 0.0061
## 36 2.3e-04 325 0.66 0.84 0.0061
## 37 2.2e-04 343 0.66 0.84 0.0061
## 38 2.2e-04 348 0.66 0.84 0.0061
## 39 2.1e-04 355 0.65 0.84 0.0061
## 40 1.8e-04 404 0.64 0.85 0.0061
## 41 1.4e-04 414 0.64 0.85 0.0061
## 42 1.1e-04 519 0.63 0.85 0.0061
## 43 9.2e-05 524 0.63 0.85 0.0061
## 44 6.9e-05 533 0.63 0.85 0.0061
## 45 4.6e-05 553 0.62 0.86 0.0061
## 46 0.0e+00 559 0.62 0.86 0.0061
For the Income_Big data set, the smallest xerror of the complexity table is 0.80726 and its standard deviaition is 0.0062380. According to the one standard error rule of the cross-validation, the model with 17 splits (18 nods) is the optimal choice. The cp for this model is 0.0020695.
For the Income_Dta data set, the smallest xerror of the complexity table is 0.80574 and its standard deviaition is 0.0062431. According to the one standard error rule of the cross-validation, the model with 18 splits (19 nods) is the optimal choice. The cp for this model is 0.0017936.
Thus, the size of the optimal model does not have a big difference after adding the new unrelated predictors. However, the Income_Big data has a relatively smaller tree. But the overall shape of the two trees are very similar. Here are the plots of the two trees:
fit17 <- prune(fit.big.constrain, cp = 0.0020695)
par(mar = rep(0.1, 4))
plot(fit17, branch = 0.3, compress = TRUE)
text(fit17)
fit18 <- prune(fit.constrain, cp = 0.0017936)
par(mar = rep(0.1, 4))
plot(fit18, branch = 0.3, compress = TRUE)
text(fit18)
I analyzed two new genenrated dataset. I plot complete tree and prunned tree for both the two new data sets. The two set of data show different model. However, the complete tree are much simpler compared to the prunned tree based on complexity parameter and one standard error rule for each of the dataset. The complete data for the random genenrated model is very similar to the model genenrated from Q1. But the prunned tree is much more complexed. This is probably due to the repeatitive information by randomly sampling with replacement, and the prunned tree always overfit the data. Below is the debrief:
Below is the first generated model by setting seed to be 4363. I come with the complete tree and prunned tree. The complete tree has 3 splits (4 nodes). The actuall used predictors are AGE, HOUSEHOLDER_STATUS and OCCUPATION. The optimal prunned tree is the one with 251 splits (252 nodes). 13 parameters are actually used. cp is 0.000045442. This tree has too many splits to display. The complete tree is pretty similar to the tree we get from the question 1.
# The first generated model for the Income_Data with replacement:
set.seed(4363)
sampler = sample(1:8993, 8993, replace = TRUE)
new.incomeData1 = incomeData[sampler, ]
# Complete tree
fit.new.nonconstrain <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class")
printcp(fit.new.nonconstrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class")
##
## Variables actually used in tree construction:
## [1] AGE HOUSEHOLDER_STATUS OCCUPATION
##
## Root node error: 7152/8993 = 0.8
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.089 0 1.00 1.00 0.0054
## 2 0.027 1 0.91 0.91 0.0059
## 3 0.010 3 0.86 0.86 0.0062
# Prunned tree
fit.new.constrain <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class",
control = rpart.control(xval = 10, cp = 0))
printcp(fit.new.constrain)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class",
## control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] AGE DUAL_INCOMES EDUCATION
## [4] ETHNIC HOUSEHOLD_NUMBER HOUSEHOLD_UNDER_18
## [7] HOUSEHOLDER_STATUS LANGUAGE LIVING_TIME
## [10] MARITAL_STATUS OCCUPATION SEX
## [13] TYPE_OF_HOME
##
## Root node error: 7152/8993 = 0.8
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 8.9e-02 0 1.00 1.00 0.0054
## 2 2.7e-02 1 0.91 0.91 0.0059
## 3 7.4e-03 3 0.86 0.86 0.0062
## 4 6.7e-03 4 0.85 0.85 0.0062
## 5 6.5e-03 8 0.82 0.84 0.0062
## 6 5.3e-03 11 0.80 0.82 0.0063
## 7 4.8e-03 12 0.80 0.81 0.0064
## 8 3.8e-03 13 0.79 0.80 0.0064
## 9 2.1e-03 14 0.79 0.79 0.0064
## 10 2.0e-03 15 0.79 0.80 0.0064
## 11 1.8e-03 21 0.78 0.79 0.0064
## 12 1.7e-03 23 0.77 0.79 0.0064
## 13 1.4e-03 24 0.77 0.78 0.0064
## 14 1.3e-03 26 0.77 0.79 0.0064
## 15 1.2e-03 27 0.77 0.78 0.0064
## 16 1.1e-03 30 0.76 0.78 0.0064
## 17 1.0e-03 32 0.76 0.78 0.0064
## 18 1.0e-03 39 0.75 0.78 0.0064
## 19 9.8e-04 44 0.75 0.78 0.0064
## 20 9.1e-04 50 0.74 0.78 0.0064
## 21 8.7e-04 59 0.73 0.78 0.0064
## 22 8.4e-04 65 0.73 0.77 0.0065
## 23 7.0e-04 77 0.72 0.76 0.0065
## 24 6.5e-04 86 0.71 0.76 0.0065
## 25 6.3e-04 95 0.70 0.76 0.0065
## 26 6.2e-04 99 0.70 0.76 0.0065
## 27 6.1e-04 115 0.69 0.76 0.0065
## 28 5.9e-04 127 0.68 0.76 0.0065
## 29 5.9e-04 139 0.68 0.76 0.0065
## 30 5.9e-04 149 0.67 0.75 0.0065
## 31 5.6e-04 155 0.67 0.75 0.0065
## 32 5.4e-04 204 0.64 0.75 0.0065
## 33 5.2e-04 216 0.63 0.75 0.0065
## 34 5.1e-04 220 0.63 0.75 0.0065
## 35 4.9e-04 223 0.63 0.75 0.0065
## 36 4.7e-04 248 0.62 0.75 0.0065
## 37 4.5e-04 251 0.61 0.74 0.0065
## 38 4.2e-04 255 0.61 0.74 0.0065
## 39 4.0e-04 295 0.59 0.74 0.0065
## 40 3.7e-04 312 0.59 0.73 0.0065
## 41 3.5e-04 322 0.58 0.73 0.0065
## 42 3.4e-04 334 0.58 0.73 0.0065
## 43 3.3e-04 340 0.57 0.73 0.0065
## 44 3.1e-04 349 0.57 0.73 0.0065
## 45 3.1e-04 353 0.57 0.73 0.0065
## 46 2.8e-04 358 0.57 0.72 0.0066
## 47 2.6e-04 436 0.55 0.72 0.0066
## 48 2.4e-04 446 0.54 0.72 0.0066
## 49 2.3e-04 454 0.54 0.72 0.0066
## 50 2.1e-04 459 0.54 0.72 0.0066
## 51 2.0e-04 476 0.54 0.72 0.0066
## 52 1.9e-04 481 0.53 0.72 0.0066
## 53 1.6e-04 493 0.53 0.72 0.0066
## 54 1.4e-04 499 0.53 0.72 0.0066
## 55 7.0e-05 542 0.53 0.72 0.0066
## 56 4.7e-05 552 0.52 0.72 0.0066
## 57 0.0e+00 555 0.52 0.72 0.0066
I also plot the complete tree:
plot(fit.new.nonconstrain)
text(fit.new.nonconstrain, use.n = T)
Below is the second generated model by setting seed to be 12345. I come with the complete tree and prunned tree. The complete tree has 4 splits (5 nodes). The actuall used predictors are also AGE, HOUSEHOLDER_STATUS and OCCUPATION. The optimal prunned tree is the one with 377 splits (378 nodes). 13 parameters are actually used. cp is 0.00027720. This tree has too many splits to display. The complete tree is pretty similar to the tree we get from the question 1 and the previous model from the random generated data.
# The second generated model for the Income_Data with replacement:
set.seed(12345)
sampler.2 = sample(1:8993, 8993, replace = TRUE)
new.incomeData2 = incomeData[sampler.2, ]
# Complete tree
fit.new.nonconstrain.2 <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class")
printcp(fit.new.nonconstrain.2)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class")
##
## Variables actually used in tree construction:
## [1] AGE HOUSEHOLDER_STATUS OCCUPATION
##
## Root node error: 7215/8993 = 0.8
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.049 0 1.00 1.00 0.0052
## 2 0.017 2 0.90 0.90 0.0059
## 3 0.010 4 0.87 0.87 0.0060
# Prunned tree
fit.new.constrain.2 <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class",
control = rpart.control(xval = 10, cp = 0))
printcp(fit.new.constrain.2)
##
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class",
## control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] AGE DUAL_INCOMES EDUCATION
## [4] ETHNIC HOUSEHOLD_NUMBER HOUSEHOLD_UNDER_18
## [7] HOUSEHOLDER_STATUS LANGUAGE LIVING_TIME
## [10] MARITAL_STATUS OCCUPATION SEX
## [13] TYPE_OF_HOME
##
## Root node error: 7215/8993 = 0.8
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 4.9e-02 0 1.00 1.00 0.0052
## 2 1.7e-02 2 0.90 0.90 0.0059
## 3 7.5e-03 4 0.87 0.87 0.0060
## 4 6.1e-03 6 0.85 0.86 0.0061
## 5 5.1e-03 7 0.85 0.85 0.0061
## 6 4.9e-03 9 0.84 0.85 0.0061
## 7 3.5e-03 10 0.83 0.84 0.0062
## 8 3.3e-03 12 0.82 0.84 0.0062
## 9 3.0e-03 14 0.82 0.84 0.0062
## 10 2.9e-03 16 0.81 0.84 0.0062
## 11 2.6e-03 17 0.81 0.84 0.0062
## 12 2.5e-03 19 0.80 0.83 0.0062
## 13 2.2e-03 21 0.80 0.83 0.0062
## 14 1.9e-03 23 0.79 0.82 0.0062
## 15 1.8e-03 24 0.79 0.82 0.0062
## 16 1.7e-03 25 0.79 0.82 0.0062
## 17 1.4e-03 27 0.79 0.81 0.0063
## 18 1.3e-03 31 0.78 0.81 0.0063
## 19 1.3e-03 35 0.77 0.81 0.0063
## 20 1.2e-03 41 0.77 0.80 0.0063
## 21 1.2e-03 43 0.76 0.80 0.0063
## 22 1.2e-03 47 0.76 0.80 0.0063
## 23 1.1e-03 50 0.76 0.80 0.0063
## 24 1.1e-03 59 0.75 0.80 0.0063
## 25 9.7e-04 62 0.74 0.79 0.0063
## 26 9.4e-04 71 0.73 0.79 0.0063
## 27 9.2e-04 80 0.73 0.79 0.0063
## 28 9.0e-04 83 0.72 0.78 0.0064
## 29 8.3e-04 87 0.72 0.78 0.0064
## 30 7.6e-04 107 0.70 0.78 0.0064
## 31 7.4e-04 111 0.70 0.77 0.0064
## 32 6.9e-04 115 0.70 0.77 0.0064
## 33 6.5e-04 137 0.68 0.76 0.0064
## 34 6.4e-04 140 0.68 0.76 0.0064
## 35 6.2e-04 149 0.67 0.76 0.0064
## 36 6.0e-04 156 0.67 0.75 0.0064
## 37 5.8e-04 159 0.66 0.75 0.0064
## 38 5.7e-04 165 0.66 0.75 0.0064
## 39 5.5e-04 175 0.65 0.75 0.0064
## 40 5.1e-04 207 0.64 0.74 0.0065
## 41 4.9e-04 211 0.63 0.74 0.0065
## 42 4.6e-04 219 0.63 0.74 0.0065
## 43 4.5e-04 226 0.63 0.73 0.0065
## 44 4.2e-04 230 0.63 0.73 0.0065
## 45 3.9e-04 295 0.60 0.73 0.0065
## 46 3.8e-04 301 0.60 0.72 0.0065
## 47 3.7e-04 309 0.59 0.72 0.0065
## 48 3.5e-04 316 0.59 0.72 0.0065
## 49 3.2e-04 355 0.57 0.72 0.0065
## 50 3.1e-04 373 0.57 0.72 0.0065
## 51 2.8e-04 377 0.57 0.71 0.0065
## 52 2.5e-04 430 0.55 0.71 0.0065
## 53 2.3e-04 441 0.55 0.71 0.0065
## 54 2.2e-04 452 0.54 0.71 0.0065
## 55 2.1e-04 459 0.54 0.71 0.0065
## 56 1.8e-04 476 0.54 0.71 0.0065
## 57 1.4e-04 482 0.54 0.71 0.0065
## 58 6.9e-05 536 0.53 0.71 0.0065
## 59 5.5e-05 548 0.53 0.71 0.0065
## 60 4.6e-05 553 0.53 0.71 0.0065
## 61 0.0e+00 565 0.53 0.71 0.0065
I also plot the complete tree:
plot(fit.new.nonconstrain.2)
text(fit.new.nonconstrain.2, use.n = T)
The misclassification rate: 1 - (2349+970)/4507 = 0.26359. The debrief is below:
houseTypeData = read.table("/Users/shijiabian/Dropbox/STATS315B/Data/Housetype_Data.txt",
sep = ",", col.names = c("typeHome", "sex", "maritalStatus", "age", "education",
"occupation", "anualIncome", "lengthLiving", "dualIncome", "personHouse",
"youngPersonHous", "houseStatus", "ethnic", "language"), fill = FALSE,
strip.white = TRUE)
for (i in 1:14) {
houseTypeData[, i] = as.factor(houseTypeData[, i])
}
summary(houseTypeData)
## typeHome sex maritalStatus age education occupation
## 1:5319 1:4043 1 :3351 1: 870 1 : 279 1 :2798
## 2: 674 2:4970 2 : 667 2:2151 2 :1050 6 :1475
## 3:2454 3 : 856 3:2244 3 :2052 4 :1049
## 4: 162 4 : 312 4:1608 4 :3036 2 : 786
## 5: 404 5 :3669 5: 909 5 :1525 3 : 755
## NA's: 158 6: 651 6 : 978 (Other):1994
## 7: 580 NA's: 93 NA's : 156
## anualIncome lengthLiving dualIncome personHouse youngPersonHous
## 1 :1650 1 : 276 1:5439 2 :2670 0 :5695
## 8 :1274 2 :1034 2:2217 3 :1685 1 :1523
## 6 :1074 3 : 681 3:1357 1 :1600 2 :1163
## 7 : 943 4 : 890 4 :1552 3 : 423
## 9 : 862 5 :5190 5 : 696 4 : 122
## (Other):2833 NA's: 942 (Other): 479 5 : 47
## NA's : 377 NA's : 331 (Other): 40
## houseStatus ethnic language
## 1 :3304 7 :5845 1 :7803
## 2 :3664 5 :1235 2 : 591
## 3 :1851 3 : 893 3 : 260
## NA's: 194 2 : 471 NA's: 359
## 8 : 238
## (Other): 270
## NA's : 61
set.seed(2)
train = sample(1:nrow(houseTypeData), 4506)
houseTypeData.test = houseTypeData[-train, ]
fit.train <- rpart(typeHome ~ ., data = houseTypeData, subset = train, method = "class",
control = rpart.control(xval = 10, cp = 0))
printcp(fit.train)
##
## Classification tree:
## rpart(formula = typeHome ~ ., data = houseTypeData, subset = train,
## method = "class", control = rpart.control(xval = 10, cp = 0))
##
## Variables actually used in tree construction:
## [1] age anualIncome dualIncome education
## [5] ethnic houseStatus lengthLiving maritalStatus
## [9] occupation personHouse sex youngPersonHous
##
## Root node error: 1848/4506 = 0.41
##
## n= 4506
##
## CP nsplit rel error xerror xstd
## 1 0.32630 0 1.00 1.00 0.018
## 2 0.02868 1 0.67 0.67 0.016
## 3 0.01569 2 0.65 0.66 0.016
## 4 0.00595 3 0.63 0.64 0.016
## 5 0.00244 4 0.62 0.63 0.016
## 6 0.00216 7 0.61 0.63 0.016
## 7 0.00186 9 0.61 0.63 0.016
## 8 0.00162 17 0.59 0.62 0.016
## 9 0.00135 21 0.59 0.62 0.016
## 10 0.00108 23 0.58 0.62 0.016
## 11 0.00090 24 0.58 0.64 0.016
## 12 0.00087 27 0.58 0.64 0.016
## 13 0.00081 32 0.58 0.65 0.016
## 14 0.00079 34 0.58 0.65 0.016
## 15 0.00072 47 0.56 0.66 0.016
## 16 0.00054 50 0.56 0.66 0.016
## 17 0.00047 66 0.55 0.69 0.016
## 18 0.00036 93 0.54 0.69 0.016
## 19 0.00032 96 0.54 0.69 0.016
## 20 0.00027 104 0.53 0.70 0.016
## 21 0.00014 108 0.53 0.71 0.017
## 22 0.00011 120 0.53 0.71 0.017
## 23 0.00000 125 0.53 0.72 0.017
# Choose the one with 3 splits
cptarg = sqrt(fit.train$cptable[2, 1] * fit.train$cptable[3, 1])
prunedtree = prune(fit.train, cp = cptarg)
# Visualize the tree
par(mfrow = c(1, 1), mar = c(1, 1, 1, 1))
plot(prunedtree, uniform = T, compress = T, margin = 0.1, branch = 0.3)
text(prunedtree, use.n = T, digits = 3, cex = 0.6)
fit.preds = predict(prunedtree, newdata = houseTypeData.test, type = "class")
fit.table = table(houseTypeData.test$typeHome, fit.preds)
fit.table
## fit.preds
## 1 2 3 4 5
## 1 2349 0 312 0 0
## 2 227 0 116 0 0
## 3 256 0 970 0 0
## 4 64 0 15 0 0
## 5 104 0 94 0 0
This question is attached on a separate piece of paper.
Discuss the relative advantages/disadvantages of this strategy relative to thesurrogate splitting strategy?
Advantage: if there is minimal missing data or a sufficiently large sample size , and there is no structure or pattern to the missing data, that are the cases in which missing values are missing completely at random, this strategy computes fast and efficiently.
Disadvantage: the impact of this strategy depends on how missing values are divided among the real categories, and how the probability of a value being missing depends on other variables;very dissimilar classes can be lumped into one group; sever bias can arise, in any direction, and when used to stratify for adjustment (or correct for confounding) the completed categorical variable will not do its job properly.
Does this strategy allow a “surrogate effect” by encouraging variables within highly correlated sets to substitute for each other in the prediction process? If so, how (by what mechanism)?
Yes. This is realized by the surrogate spliting during the greedy algorithm.
Can this new strategy be used if there are no missing values in the training set? If not, what can be done so as to enable the tree to predict with missing values in future data?
No. Simple random select some observations for each predictors as missing values (follow the criteria of MCAR). Then treat these missing values as additional categories. CART the model on this training dataset to predict the future data. However, this requires that the missing values in the test set are independent of the response and other predictors.