STATS 315B, Spring 2014 | Homework 1

Shijia Bian | Deadline: May 1st, 2014

Question 1 Data Mining Marketing

(a)

The criteria for the two CART models here used were: 1). the default CART setting in rpart for building a complete tree: the altered prior. The altered prior can affact the splits. (\( m*=argmax((N_m^l * N_m^r)/N_m)*(y_m^l - y_m^r)^2 \)); 2). prunned tree: set up the adjustment in control.rpart in addition to using the altered prior only. I set cp (complexity parameter) be 0 and xvalue (number of cross validation) be 10. The two criteria yielded different decision trees. But the procedures of both the two models divide into the new subgroups recursively until the subgroups either reach a minimum size or until no improvement can be made. The optimal CART of the complete tree is with 3 splits, and is 17 splits for the prunned tree. The complexity parameters for the two CART model are 0.010000 and 0.0020695 respectively. The advantage of the second criteria is that it can fit the data more accurately. The disadvantage of the second criteria is that it uses several of the random predictors in doing so and seriously overfits the data. The debrief is below:

Read the income data into the R and factorize the data:

install.packages("rpart")

## Error: trying to use CRAN without setting a mirror

library(rpart)

incomeData = read.table("/Users/shijiabian/Dropbox/STATS315B/Data/Income_Data.txt", 
    sep = ",", col.names = c("ANNUAL_INCOME", "SEX", "MARITAL_STATUS", "AGE", 
        "EDUCATION", "OCCUPATION", "LIVING_TIME", "DUAL_INCOMES", "HOUSEHOLD_NUMBER", 
        "HOUSEHOLD_UNDER_18", "HOUSEHOLDER_STATUS", "TYPE_OF_HOME", "ETHNIC", 
        "LANGUAGE"), fill = FALSE, strip.white = TRUE)

# This step is redundant as long as I specify 'method = class in rpart()'
for (i in 1:14) {
    incomeData[, i] = as.factor(incomeData[, i])
}

summary(incomeData)

##  ANNUAL_INCOME  SEX      MARITAL_STATUS AGE      EDUCATION  
##  1      :1745   1:4075   1   :3334      1: 878   1   : 264  
##  8      :1308   2:4918   2   : 668      2:2129   2   :1046  
##  6      :1110            3   : 875      3:2249   3   :2041  
##  7      : 969            4   : 302      4:1615   4   :3066  
##  9      : 884            5   :3654      5: 922   5   :1524  
##  4      : 813            NA's: 160      6: 640   6   : 966  
##  (Other):2164                           7: 560   NA's:  86  
##    OCCUPATION   LIVING_TIME DUAL_INCOMES HOUSEHOLD_NUMBER
##  1      :2820   1   : 270   1:5438       2      :2664    
##  6      :1489   2   :1042   2:2211       3      :1670    
##  4      :1062   3   : 686   3:1344       1      :1620    
##  2      : 770   4   : 900                4      :1526    
##  3      : 767   5   :5182                5      : 686    
##  (Other):1949   NA's: 913                (Other): 452    
##  NA's   : 136                            NA's   : 375    
##  HOUSEHOLD_UNDER_18 HOUSEHOLDER_STATUS TYPE_OF_HOME     ETHNIC    
##  0      :5724       1   :3256          1   :5073    7      :5811  
##  1      :1506       2   :3670          2   : 655    5      :1231  
##  2      :1148       3   :1827          3   :2373    3      : 910  
##  3      : 412       NA's: 240          4   : 151    2      : 477  
##  4      : 117                          5   : 384    8      : 225  
##  5      :  46                          NA's: 357    (Other): 271  
##  (Other):  40                                       NA's   :  68  
##  LANGUAGE   
##  1   :7794  
##  2   : 579  
##  3   : 261  
##  NA's: 359  
##             
##             
##

For the complete tree, regress the 13 predictors onto “ANNUAL INCOME OF HOUSEHOLD” by the default setting only. According to the output of tree, this tree only use the three predictors OCCUPATION, AGE and HOUSE STATUS. We can print the complexity table and plot the graph of the tree. It has 4 terminal nodes.

fit.noconstrain <- rpart(ANNUAL_INCOME ~ ., data = incomeData, method = "class")
print(fit.noconstrain)

## n= 8993 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 8993 7248 1 (0.19 0.086 0.074 0.09 0.08 0.12 0.11 0.15 0.098)  
##    2) OCCUPATION=6,9 1843  684 1 (0.63 0.077 0.042 0.033 0.028 0.041 0.042 0.058 0.051) *
##    3) OCCUPATION=1,2,3,4,5,7,8 7150 5948 8 (0.082 0.089 0.083 0.11 0.094 0.14 0.12 0.17 0.11)  
##      6) HOUSEHOLDER_STATUS=2,3 3950 3394 6 (0.13 0.13 0.12 0.13 0.12 0.14 0.1 0.094 0.037)  
##       12) AGE=1,2,6,7 1569 1215 1 (0.23 0.2 0.15 0.12 0.085 0.081 0.055 0.063 0.029) *
##       13) AGE=3,4,5 2381 1952 6 (0.062 0.088 0.1 0.14 0.14 0.18 0.13 0.11 0.042) *
##      7) HOUSEHOLDER_STATUS=1 3200 2369 8 (0.026 0.036 0.037 0.069 0.065 0.15 0.16 0.26 0.2) *

plot(fit.noconstrain)
text(fit.noconstrain, use.n = T)

plot of chunk unnamed-chunk-3

The analysis and graph above are reseasonble: The group of Student, HS or College and unemployment will have the lowest income. The household status will play an important role within the rest of the occupations: the group who rent properties or live with parents will have lower income compared to those who have their own estate properties. Then the age factor will make a difference within the group who rent the properties: the people who are younger than 24 or older than 55 will have lower income compared to the rest of the people. Overall, the tree illustrates that the people who are either non-student nor non-unemployed and own their real estate properties will have higher income than the rest of the people.

Beblow produce the cp from the complexity model for 3 split is 0.01. Cp decreases as the number of splits increases. The best model is the one with 3 splits (4 nods).

summary(fit.noconstrain)

## Call:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
##   n= 8993 
## 
##        CP nsplit rel error xerror     xstd
## 1 0.08499      0    1.0000 1.0000 0.005174
## 2 0.02842      1    0.9150 0.9193 0.005732
## 3 0.01000      3    0.8582 0.8600 0.006034
## 
## Variable importance
##         OCCUPATION                AGE HOUSEHOLDER_STATUS 
##                 39                 21                 19 
##          EDUCATION     MARITAL_STATUS       TYPE_OF_HOME 
##                  7                  5                  4 
##       DUAL_INCOMES 
##                  4 
## 
## Node number 1: 8993 observations,    complexity param=0.08499
##   predicted class=1  expected loss=0.806  P(node) =1
##     class counts:  1745   775   667   813   722  1110   969  1308   884
##    probabilities: 0.194 0.086 0.074 0.090 0.080 0.123 0.108 0.145 0.098 
##   left son=2 (1843 obs) right son=3 (7150 obs)
##   Primary splits:
##       OCCUPATION         splits as  RRRRRLRRL, improve=495.5, (136 missing)
##       AGE                splits as  LRRRRRR,   improve=495.5, (0 missing)
##       EDUCATION          splits as  LLRRRR,    improve=400.3, (86 missing)
##       HOUSEHOLDER_STATUS splits as  RRL,       improve=396.2, (240 missing)
##       MARITAL_STATUS     splits as  RRRRL,     improve=295.3, (160 missing)
##   Surrogate splits:
##       AGE                splits as  LRRRRRR, agree=0.859, adj=0.314, (136 split)
##       EDUCATION          splits as  LLRRRR,  agree=0.832, adj=0.185, (0 split)
##       HOUSEHOLDER_STATUS splits as  RRL,     agree=0.832, adj=0.185, (0 split)
## 
## Node number 2: 1843 observations
##   predicted class=1  expected loss=0.3711  P(node) =0.2049
##     class counts:  1159   142    77    61    51    75    78   106    94
##    probabilities: 0.629 0.077 0.042 0.033 0.028 0.041 0.042 0.058 0.051 
## 
## Node number 3: 7150 observations,    complexity param=0.02842
##   predicted class=8  expected loss=0.8319  P(node) =0.7951
##     class counts:   586   633   590   752   671  1035   891  1202   790
##    probabilities: 0.082 0.089 0.083 0.105 0.094 0.145 0.125 0.168 0.110 
##   left son=6 (3950 obs) right son=7 (3200 obs)
##   Primary splits:
##       HOUSEHOLDER_STATUS splits as  RLL,     improve=159.70, (210 missing)
##       MARITAL_STATUS     splits as  RLLLL,   improve=136.70, (109 missing)
##       DUAL_INCOMES       splits as  LRR,     improve=125.70, (0 missing)
##       AGE                splits as  LLRRRRR, improve=108.40, (0 missing)
##       TYPE_OF_HOME       splits as  RRLLL,   improve= 90.16, (268 missing)
##   Surrogate splits:
##       AGE            splits as  LLLRRRR,   agree=0.733, adj=0.406, (210 split)
##       MARITAL_STATUS splits as  RLLRL,     agree=0.721, adj=0.377, (0 split)
##       TYPE_OF_HOME   splits as  RRLRL,     agree=0.718, adj=0.371, (0 split)
##       DUAL_INCOMES   splits as  LRR,       agree=0.703, adj=0.337, (0 split)
##       OCCUPATION     splits as  RLLLR-LR-, agree=0.621, adj=0.155, (0 split)
## 
## Node number 6: 3950 observations,    complexity param=0.02842
##   predicted class=6  expected loss=0.8592  P(node) =0.4392
##     class counts:   502   518   470   530   463   556   394   371   146
##    probabilities: 0.127 0.131 0.119 0.134 0.117 0.141 0.100 0.094 0.037 
##   left son=12 (1569 obs) right son=13 (2381 obs)
##   Primary splits:
##       AGE                splits as  LLRRRLL,   improve=58.66, (0 missing)
##       OCCUPATION         splits as  RLLLL-LL-, improve=49.69, (78 missing)
##       EDUCATION          splits as  LLLRRR,    improve=36.17, (36 missing)
##       HOUSEHOLDER_STATUS splits as  -RL,       improve=32.30, (122 missing)
##       MARITAL_STATUS     splits as  RLLLL,     improve=28.84, (60 missing)
##   Surrogate splits:
##       HOUSEHOLDER_STATUS splits as  -RL,        agree=0.674, adj=0.180, (0 split)
##       OCCUPATION         splits as  RLRRR-LL-,  agree=0.658, adj=0.140, (0 split)
##       EDUCATION          splits as  RLLRRR,     agree=0.648, adj=0.113, (0 split)
##       MARITAL_STATUS     splits as  RRRLL,      agree=0.637, adj=0.087, (0 split)
##       HOUSEHOLD_UNDER_18 splits as  RRRRRRLR--, agree=0.604, adj=0.002, (0 split)
## 
## Node number 7: 3200 observations
##   predicted class=8  expected loss=0.7403  P(node) =0.3558
##     class counts:    84   115   120   222   208   479   497   831   644
##    probabilities: 0.026 0.036 0.037 0.069 0.065 0.150 0.155 0.260 0.201 
## 
## Node number 12: 1569 observations
##   predicted class=1  expected loss=0.7744  P(node) =0.1745
##     class counts:   354   308   229   187   134   127    86    99    45
##    probabilities: 0.226 0.196 0.146 0.119 0.085 0.081 0.055 0.063 0.029 
## 
## Node number 13: 2381 observations
##   predicted class=6  expected loss=0.8198  P(node) =0.2648
##     class counts:   148   210   241   343   329   429   308   272   101
##    probabilities: 0.062 0.088 0.101 0.144 0.138 0.180 0.129 0.114 0.042

printcp(fit.noconstrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
## 
## Variables actually used in tree construction:
## [1] AGE                HOUSEHOLDER_STATUS OCCUPATION        
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##      CP nsplit rel error xerror   xstd
## 1 0.085      0      1.00   1.00 0.0052
## 2 0.028      1      0.92   0.92 0.0057
## 3 0.010      3      0.86   0.86 0.0060

Alternative criteria is the prunned tree: being specified by command rpart.control: the number of cross valication is 10 and the complexity parameter is 0. The complexity table is listed below. The complexity table is printed from the smallest tree (no splits) to the largest one (559 splits). The xerror clearly shows the feature of convexity. It reaches the minimum value at split 23 that the xerror is 0.80560 and one standard error is 0.0062435. Depending on the one standard error rule of the cross-validation, the best split is 17 (18 nodes) with cp being 0.0020695.

fit.constrain <- rpart(ANNUAL_INCOME ~ ., data = incomeData, method = "class", 
    control = rpart.control(xval = 10, cp = 0))
printcp(fit.constrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class", 
##     control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] AGE                DUAL_INCOMES       EDUCATION         
##  [4] ETHNIC             HOUSEHOLD_NUMBER   HOUSEHOLD_UNDER_18
##  [7] HOUSEHOLDER_STATUS LANGUAGE           LIVING_TIME       
## [10] MARITAL_STATUS     OCCUPATION         SEX               
## [13] TYPE_OF_HOME      
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##         CP nsplit rel error xerror   xstd
## 1  8.5e-02      0      1.00   1.00 0.0052
## 2  2.8e-02      1      0.92   0.90 0.0058
## 3  5.7e-03      3      0.86   0.86 0.0060
## 4  5.1e-03      5      0.85   0.85 0.0061
## 5  4.6e-03      6      0.84   0.85 0.0061
## 6  4.2e-03      8      0.83   0.84 0.0061
## 7  4.0e-03     11      0.82   0.83 0.0062
## 8  3.1e-03     13      0.81   0.82 0.0062
## 9  3.0e-03     15      0.81   0.82 0.0062
## 10 2.1e-03     17      0.80   0.81 0.0062
## 11 1.9e-03     18      0.80   0.81 0.0062
## 12 1.8e-03     20      0.79   0.81 0.0062
## 13 1.2e-03     21      0.79   0.81 0.0062
## 14 1.1e-03     23      0.79   0.80 0.0062
## 15 1.1e-03     25      0.79   0.80 0.0062
## 16 9.7e-04     28      0.78   0.81 0.0062
## 17 8.3e-04     31      0.78   0.81 0.0062
## 18 6.9e-04     33      0.78   0.81 0.0062
## 19 6.2e-04     41      0.77   0.81 0.0062
## 20 6.0e-04     52      0.77   0.82 0.0062
## 21 5.5e-04     61      0.76   0.82 0.0062
## 22 5.2e-04     72      0.75   0.82 0.0062
## 23 4.8e-04     76      0.75   0.82 0.0062
## 24 4.6e-04     83      0.75   0.82 0.0062
## 25 4.1e-04    111      0.73   0.83 0.0062
## 26 3.7e-04    149      0.72   0.83 0.0062
## 27 3.4e-04    158      0.71   0.83 0.0061
## 28 3.4e-04    180      0.71   0.83 0.0061
## 29 3.2e-04    195      0.70   0.83 0.0061
## 30 3.1e-04    210      0.69   0.83 0.0061
## 31 3.0e-04    217      0.69   0.83 0.0061
## 32 2.8e-04    228      0.69   0.84 0.0061
## 33 2.5e-04    309      0.67   0.84 0.0061
## 34 2.5e-04    316      0.66   0.84 0.0061
## 35 2.4e-04    321      0.66   0.84 0.0061
## 36 2.3e-04    325      0.66   0.84 0.0061
## 37 2.2e-04    343      0.66   0.84 0.0061
## 38 2.2e-04    348      0.66   0.84 0.0061
## 39 2.1e-04    355      0.65   0.84 0.0061
## 40 1.8e-04    404      0.64   0.85 0.0061
## 41 1.4e-04    414      0.64   0.85 0.0061
## 42 1.1e-04    519      0.63   0.85 0.0061
## 43 9.2e-05    524      0.63   0.85 0.0061
## 44 6.9e-05    533      0.63   0.85 0.0061
## 45 4.6e-05    553      0.62   0.86 0.0061
## 46 0.0e+00    559      0.62   0.86 0.0061

Use branch and compress to restrict the shape of the tree and plot the prunned the tree:

fit17 <- prune(fit.constrain, cp = 0.0020695)
par(mar = rep(0.1, 4))
plot(fit17, branch = 0.3, compress = TRUE)
text(fit17)

plot of chunk unnamed-chunk-6

Overall, the prunned tree illustrate the similar story as the previous CART model but being more detailed: it has more nodes and predictors, such as education, ethnic and marital status.

(b)

Yes. The surrogate splits are used in the construction for the optimal trees of the two models.

Definition of Surrogate Split:

When the data has some missing predictor values in some or all of the variables, we estimate the missing datum using the other independent variableswe. These constructed variables that are used to fill in (impute) the missing values are called surrogate variables. Surrogate splits exploit correlations between predictors to try and alleviate the effect of missing data.

Example of a Surrogate Split:

We want to look at the first model from the previous question by using the command summary(). The first model splited the entire 8993 data into 1843 data (node 2 that is a terminal nodes) and 7150 data (node 3). For node 1, the best split is occupation that does have 136 missing values. Thus the surrogate splits is applied for occupation in the first node. AGE is the surrogate split on.

printcp(fit.noconstrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
## 
## Variables actually used in tree construction:
## [1] AGE                HOUSEHOLDER_STATUS OCCUPATION        
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##      CP nsplit rel error xerror   xstd
## 1 0.085      0      1.00   1.00 0.0052
## 2 0.028      1      0.92   0.92 0.0057
## 3 0.010      3      0.86   0.86 0.0060

summary(fit.noconstrain)

## Call:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
##   n= 8993 
## 
##        CP nsplit rel error xerror     xstd
## 1 0.08499      0    1.0000 1.0000 0.005174
## 2 0.02842      1    0.9150 0.9193 0.005732
## 3 0.01000      3    0.8582 0.8600 0.006034
## 
## Variable importance
##         OCCUPATION                AGE HOUSEHOLDER_STATUS 
##                 39                 21                 19 
##          EDUCATION     MARITAL_STATUS       TYPE_OF_HOME 
##                  7                  5                  4 
##       DUAL_INCOMES 
##                  4 
## 
## Node number 1: 8993 observations,    complexity param=0.08499
##   predicted class=1  expected loss=0.806  P(node) =1
##     class counts:  1745   775   667   813   722  1110   969  1308   884
##    probabilities: 0.194 0.086 0.074 0.090 0.080 0.123 0.108 0.145 0.098 
##   left son=2 (1843 obs) right son=3 (7150 obs)
##   Primary splits:
##       OCCUPATION         splits as  RRRRRLRRL, improve=495.5, (136 missing)
##       AGE                splits as  LRRRRRR,   improve=495.5, (0 missing)
##       EDUCATION          splits as  LLRRRR,    improve=400.3, (86 missing)
##       HOUSEHOLDER_STATUS splits as  RRL,       improve=396.2, (240 missing)
##       MARITAL_STATUS     splits as  RRRRL,     improve=295.3, (160 missing)
##   Surrogate splits:
##       AGE                splits as  LRRRRRR, agree=0.859, adj=0.314, (136 split)
##       EDUCATION          splits as  LLRRRR,  agree=0.832, adj=0.185, (0 split)
##       HOUSEHOLDER_STATUS splits as  RRL,     agree=0.832, adj=0.185, (0 split)
## 
## Node number 2: 1843 observations
##   predicted class=1  expected loss=0.3711  P(node) =0.2049
##     class counts:  1159   142    77    61    51    75    78   106    94
##    probabilities: 0.629 0.077 0.042 0.033 0.028 0.041 0.042 0.058 0.051 
## 
## Node number 3: 7150 observations,    complexity param=0.02842
##   predicted class=8  expected loss=0.8319  P(node) =0.7951
##     class counts:   586   633   590   752   671  1035   891  1202   790
##    probabilities: 0.082 0.089 0.083 0.105 0.094 0.145 0.125 0.168 0.110 
##   left son=6 (3950 obs) right son=7 (3200 obs)
##   Primary splits:
##       HOUSEHOLDER_STATUS splits as  RLL,     improve=159.70, (210 missing)
##       MARITAL_STATUS     splits as  RLLLL,   improve=136.70, (109 missing)
##       DUAL_INCOMES       splits as  LRR,     improve=125.70, (0 missing)
##       AGE                splits as  LLRRRRR, improve=108.40, (0 missing)
##       TYPE_OF_HOME       splits as  RRLLL,   improve= 90.16, (268 missing)
##   Surrogate splits:
##       AGE            splits as  LLLRRRR,   agree=0.733, adj=0.406, (210 split)
##       MARITAL_STATUS splits as  RLLRL,     agree=0.721, adj=0.377, (0 split)
##       TYPE_OF_HOME   splits as  RRLRL,     agree=0.718, adj=0.371, (0 split)
##       DUAL_INCOMES   splits as  LRR,       agree=0.703, adj=0.337, (0 split)
##       OCCUPATION     splits as  RLLLR-LR-, agree=0.621, adj=0.155, (0 split)
## 
## Node number 6: 3950 observations,    complexity param=0.02842
##   predicted class=6  expected loss=0.8592  P(node) =0.4392
##     class counts:   502   518   470   530   463   556   394   371   146
##    probabilities: 0.127 0.131 0.119 0.134 0.117 0.141 0.100 0.094 0.037 
##   left son=12 (1569 obs) right son=13 (2381 obs)
##   Primary splits:
##       AGE                splits as  LLRRRLL,   improve=58.66, (0 missing)
##       OCCUPATION         splits as  RLLLL-LL-, improve=49.69, (78 missing)
##       EDUCATION          splits as  LLLRRR,    improve=36.17, (36 missing)
##       HOUSEHOLDER_STATUS splits as  -RL,       improve=32.30, (122 missing)
##       MARITAL_STATUS     splits as  RLLLL,     improve=28.84, (60 missing)
##   Surrogate splits:
##       HOUSEHOLDER_STATUS splits as  -RL,        agree=0.674, adj=0.180, (0 split)
##       OCCUPATION         splits as  RLRRR-LL-,  agree=0.658, adj=0.140, (0 split)
##       EDUCATION          splits as  RLLRRR,     agree=0.648, adj=0.113, (0 split)
##       MARITAL_STATUS     splits as  RRRLL,      agree=0.637, adj=0.087, (0 split)
##       HOUSEHOLD_UNDER_18 splits as  RRRRRRLR--, agree=0.604, adj=0.002, (0 split)
## 
## Node number 7: 3200 observations
##   predicted class=8  expected loss=0.7403  P(node) =0.3558
##     class counts:    84   115   120   222   208   479   497   831   644
##    probabilities: 0.026 0.036 0.037 0.069 0.065 0.150 0.155 0.260 0.201 
## 
## Node number 12: 1569 observations
##   predicted class=1  expected loss=0.7744  P(node) =0.1745
##     class counts:   354   308   229   187   134   127    86    99    45
##    probabilities: 0.226 0.196 0.146 0.119 0.085 0.081 0.055 0.063 0.029 
## 
## Node number 13: 2381 observations
##   predicted class=6  expected loss=0.8198  P(node) =0.2648
##     class counts:   148   210   241   343   329   429   308   272   101
##    probabilities: 0.062 0.088 0.101 0.144 0.138 0.180 0.129 0.114 0.042

©

The misclassification rate is 1-(677+117+97+10+126+101+292+93)/4496=0.6634786.

Below is the table listing the prediction of the model:

set.seed(23)
train = sample(1:nrow(incomeData), 4497)
incomeData.test = incomeData[-train, ]
fit.train <- rpart(ANNUAL_INCOME ~ ., data = incomeData, subset = train, method = "class", 
    control = rpart.control(xval = 10, cp = 0))
printcp(fit.train)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, subset = train, 
##     method = "class", control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] AGE                DUAL_INCOMES       EDUCATION         
##  [4] ETHNIC             HOUSEHOLD_NUMBER   HOUSEHOLD_UNDER_18
##  [7] HOUSEHOLDER_STATUS LANGUAGE           LIVING_TIME       
## [10] MARITAL_STATUS     OCCUPATION         SEX               
## [13] TYPE_OF_HOME      
## 
## Root node error: 3625/4497 = 0.81
## 
## n= 4497 
## 
##         CP nsplit rel error xerror   xstd
## 1  5.3e-02      0      1.00   1.00 0.0073
## 2  5.2e-02      1      0.95   0.98 0.0075
## 3  2.2e-02      2      0.90   0.90 0.0083
## 4  6.1e-03      4      0.85   0.86 0.0085
## 5  5.8e-03      5      0.85   0.86 0.0085
## 6  5.0e-03      6      0.84   0.85 0.0086
## 7  4.1e-03      7      0.83   0.84 0.0086
## 8  3.7e-03     11      0.82   0.83 0.0087
## 9  3.6e-03     15      0.80   0.82 0.0088
## 10 3.0e-03     16      0.80   0.81 0.0088
## 11 2.9e-03     17      0.80   0.81 0.0088
## 12 2.5e-03     19      0.79   0.81 0.0088
## 13 1.9e-03     20      0.79   0.81 0.0088
## 14 1.7e-03     23      0.78   0.81 0.0088
## 15 1.4e-03     25      0.78   0.82 0.0088
## 16 1.2e-03     30      0.77   0.82 0.0087
## 17 1.1e-03     35      0.77   0.82 0.0087
## 18 1.1e-03     47      0.75   0.83 0.0087
## 19 9.7e-04     53      0.75   0.83 0.0087
## 20 9.2e-04     57      0.74   0.83 0.0087
## 21 8.3e-04     60      0.74   0.84 0.0087
## 22 7.5e-04    106      0.70   0.84 0.0087
## 23 6.9e-04    114      0.69   0.83 0.0087
## 24 6.4e-04    120      0.69   0.84 0.0087
## 25 6.2e-04    123      0.69   0.84 0.0087
## 26 5.5e-04    127      0.68   0.84 0.0087
## 27 4.8e-04    168      0.66   0.84 0.0086
## 28 4.6e-04    173      0.66   0.84 0.0086
## 29 4.1e-04    176      0.66   0.84 0.0086
## 30 3.7e-04    209      0.64   0.85 0.0086
## 31 2.8e-04    215      0.64   0.85 0.0086
## 32 1.6e-04    247      0.63   0.85 0.0086
## 33 1.4e-04    256      0.63   0.85 0.0086
## 34 9.2e-05    266      0.63   0.85 0.0086
## 35 6.9e-05    275      0.63   0.85 0.0086
## 36 0.0e+00    279      0.63   0.85 0.0086


# Choose the one with 15 splits
cptarg = sqrt(fit.train$cptable[14, 1] * fit.train$cptable[15, 1])
prunedtree = prune(fit.train, cp = cptarg)

# Visualize the tree
par(mfrow = c(1, 1), mar = c(1, 1, 1, 1))
plot(prunedtree, uniform = T, compress = T, margin = 0.1, branch = 0.3)
text(prunedtree, use.n = T, digits = 3, cex = 0.6)

plot of chunk unnamed-chunk-8


fit.preds = predict(prunedtree, newdata = incomeData.test, type = "class")
fit.table = table(incomeData.test$ANNUAL_INCOME, fit.preds)
fit.table

##    fit.preds
##       1   2   3   4   5   6   7   8   9
##   1 677 103   0  45   4  16  19   9   0
##   2 134 117   0  50   8  37  23  11   0
##   3  85  86   0  66   6  53  30  13   1
##   4  49  79   0  97   9  88  43  42   1
##   5  43  59   0  64  10  94  55  43   0
##   6  59  60   0  52  24 126 105 112   7
##   7  55  39   0  20  15  82 101 156  16
##   8  65  32   0  21   9  60 106 292  47
##   9  61  17   0   6   4  37  45 203  93

Question 2 Some of the Advantages and Disadvantages of CART

(a)

By running the models below, it says that CART is highly influenced by the spurious model: it tends to choose the spurious variables other than the correct variables. CART includes spurious variables by over modeling. However, the size of the optimal model does not have a big difference after adding the spurious variables. Below are the steps how I debrief my points:

Read the Income_Big into the R:

incomeBig = read.table("/Users/shijiabian/Dropbox/STATS315B/Data/Income_Big.txt", 
    sep = ",", col.names = c("col1", "col2", "col3", "col4", "col5", "col6", 
        "col7", "col8", "col9", "col10", "col11", "col12", "col13", "col14", 
        "col15", "col16", "col17", "col18", "col19", "col20", "col21", "col22", 
        "col23", "col24"), fill = FALSE, strip.white = TRUE)
dim(incomeBig)

## [1] 8993   24


# This step is redundant as long as I set method = 'class'
for (i in 1:24) {
    incomeBig[, i] = as.factor(incomeBig[, i])
}

summary(incomeBig)

##       col1      col2       col3      col4       col5           col6     
##  1      :1745   1:4075   1   :3334   1: 878   1   : 264   1      :2820  
##  8      :1308   2:4918   2   : 668   2:2129   2   :1046   6      :1489  
##  6      :1110            3   : 875   3:2249   3   :2041   4      :1062  
##  7      : 969            4   : 302   4:1615   4   :3066   2      : 770  
##  9      : 884            5   :3654   5: 922   5   :1524   3      : 767  
##  4      : 813            NA's: 160   6: 640   6   : 966   (Other):1949  
##  (Other):2164                        7: 560   NA's:  86   NA's   : 136  
##    col7      col8          col9          col10       col11     
##  1   : 270   1:5438   2      :2664   0      :5724   1   :3256  
##  2   :1042   2:2211   3      :1670   1      :1506   2   :3670  
##  3   : 686   3:1344   1      :1620   2      :1148   3   :1827  
##  4   : 900            4      :1526   3      : 412   NA's: 240  
##  5   :5182            5      : 686   4      : 117              
##  NA's: 913            (Other): 452   5      :  46              
##                       NA's   : 375   (Other):  40              
##   col12          col13       col14          col15          col16     
##  1   :5073   7      :5811   1   :7794   5      :1024   5      :1054  
##  2   : 655   5      :1231   2   : 579   2      :1023   1      :1048  
##  3   :2373   3      : 910   3   : 261   3      :1009   4      :1018  
##  4   : 151   2      : 477   NA's: 359   8      :1009   8      :1005  
##  5   : 384   8      : 225               6      :1001   6      :1002  
##  NA's: 357   (Other): 271               7      : 992   7      : 995  
##              NA's   :  68               (Other):2935   (Other):2871  
##      col17          col18          col19          col20     
##  6      :1038   2      :1039   5      :1043   7      :1045  
##  4      :1017   7      :1032   3      :1026   1      :1040  
##  7      :1009   1      :1031   1      :1002   5      :1019  
##  3      : 996   8      :1005   9      :1001   3      :1006  
##  9      : 994   3      :1003   7      :1000   6      :1004  
##  8      : 991   5      :1003   6      : 998   2      : 982  
##  (Other):2948   (Other):2880   (Other):2923   (Other):2897  
##      col21          col22          col23          col24     
##  5      :1034   1      :1061   4      :1035   9      :1040  
##  7      :1032   8      :1048   5      :1027   5      :1029  
##  4      :1019   6      :1040   3      :1023   8      :1014  
##  1      :1016   5      :1019   9      :1023   4      :1012  
##  8      :1015   7      : 975   2      : 992   1      : 993  
##  3      : 993   3      : 970   6      : 989   6      : 987  
##  (Other):2884   (Other):2880   (Other):2904   (Other):2918

First Model: simply construct the complete tree for the Income_Big with default lose and prior. Print out its complexity table. Also print out the complexity table for the Income data. Based on the one standard error rule of the cross-validation, we need choose the 3 splits (4 nodes) for both the two data set. However, the Income_Big data set does not choose the correct predictor: it chooses HOUSEHOLDER STATUS, LANGUAGE and a set of spurious variables.But the correct model is supposed to choose AGE HOUSEHOLD STATUS and OCCUPATION.

fit.big.noconstrain <- rpart(col1 ~ ., data = incomeBig, method = "class")
printcp(fit.big.noconstrain)

## 
## Classification tree:
## rpart(formula = col1 ~ ., data = incomeBig, method = "class")
## 
## Variables actually used in tree construction:
## [1] col11 col4  col6 
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##      CP nsplit rel error xerror   xstd
## 1 0.085      0      1.00   1.00 0.0052
## 2 0.028      1      0.92   0.91 0.0058
## 3 0.010      3      0.86   0.86 0.0060


printcp(fit.noconstrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class")
## 
## Variables actually used in tree construction:
## [1] AGE                HOUSEHOLDER_STATUS OCCUPATION        
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##      CP nsplit rel error xerror   xstd
## 1 0.085      0      1.00   1.00 0.0052
## 2 0.028      1      0.92   0.92 0.0057
## 3 0.010      3      0.86   0.86 0.0060

Second Model: Construct the prunned tree for the Income_Big by restricting the xvalue and cp. Print out the complexity table for this new CART model. Also print out the complexity table for the Income data of the same construction of the model. 2/3 variables actually used in the new modelgenerated from the Income_Big data are spurious variables. This tells the same story as the complete tree model above does.

fit.big.constrain <- rpart(col1 ~ ., data = incomeBig, method = "class", control = rpart.control(xval = 10, 
    cp = 0))
printcp(fit.big.constrain)

## 
## Classification tree:
## rpart(formula = col1 ~ ., data = incomeBig, method = "class", 
##     control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] col10 col11 col12 col13 col15 col16 col17 col18 col19 col2  col20
## [12] col21 col22 col23 col24 col3  col4  col5  col6  col7  col9 
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##         CP nsplit rel error xerror   xstd
## 1  8.5e-02      0      1.00   1.00 0.0052
## 2  2.8e-02      1      0.92   0.93 0.0057
## 3  5.7e-03      3      0.86   0.86 0.0060
## 4  5.1e-03      5      0.85   0.85 0.0061
## 5  4.6e-03      6      0.84   0.85 0.0061
## 6  4.2e-03      8      0.83   0.85 0.0061
## 7  4.0e-03     11      0.82   0.83 0.0062
## 8  3.1e-03     13      0.81   0.82 0.0062
## 9  3.0e-03     15      0.81   0.82 0.0062
## 10 2.1e-03     17      0.80   0.82 0.0062
## 11 1.9e-03     18      0.80   0.81 0.0062
## 12 1.8e-03     20      0.79   0.81 0.0062
## 13 1.7e-03     21      0.79   0.81 0.0062
## 14 1.4e-03     24      0.79   0.82 0.0062
## 15 1.3e-03     25      0.79   0.82 0.0062
## 16 1.2e-03     29      0.78   0.82 0.0062
## 17 1.1e-03     31      0.78   0.82 0.0062
## 18 1.0e-03     51      0.75   0.83 0.0062
## 19 9.7e-04     54      0.75   0.83 0.0062
## 20 9.2e-04     64      0.74   0.83 0.0062
## 21 9.0e-04     67      0.74   0.83 0.0061
## 22 8.6e-04     76      0.73   0.83 0.0061
## 23 8.3e-04     81      0.73   0.84 0.0061
## 24 7.8e-04     98      0.71   0.85 0.0061
## 25 7.6e-04    101      0.71   0.85 0.0061
## 26 7.2e-04    107      0.71   0.86 0.0061
## 27 7.2e-04    122      0.69   0.86 0.0061
## 28 6.9e-04    127      0.69   0.86 0.0060
## 29 6.6e-04    151      0.67   0.86 0.0061
## 30 6.4e-04    157      0.67   0.86 0.0060
## 31 6.3e-04    160      0.67   0.86 0.0060
## 32 6.2e-04    165      0.66   0.86 0.0060
## 33 5.9e-04    171      0.66   0.86 0.0061
## 34 5.5e-04    175      0.66   0.86 0.0061
## 35 5.2e-04    224      0.63   0.86 0.0060
## 36 5.1e-04    230      0.63   0.86 0.0060
## 37 4.8e-04    238      0.62   0.86 0.0060
## 38 4.6e-04    262      0.61   0.86 0.0060
## 39 4.1e-04    280      0.60   0.87 0.0060
## 40 3.7e-04    396      0.55   0.87 0.0060
## 41 3.4e-04    402      0.55   0.88 0.0060
## 42 3.3e-04    435      0.54   0.88 0.0060
## 43 2.8e-04    440      0.53   0.88 0.0059
## 44 2.3e-04    504      0.52   0.88 0.0059
## 45 2.1e-04    510      0.51   0.89 0.0059
## 46 1.8e-04    529      0.51   0.89 0.0059
## 47 1.4e-04    533      0.51   0.89 0.0059
## 48 9.2e-05    568      0.50   0.89 0.0059
## 49 0.0e+00    571      0.50   0.89 0.0059


printcp(fit.constrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = incomeData, method = "class", 
##     control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] AGE                DUAL_INCOMES       EDUCATION         
##  [4] ETHNIC             HOUSEHOLD_NUMBER   HOUSEHOLD_UNDER_18
##  [7] HOUSEHOLDER_STATUS LANGUAGE           LIVING_TIME       
## [10] MARITAL_STATUS     OCCUPATION         SEX               
## [13] TYPE_OF_HOME      
## 
## Root node error: 7248/8993 = 0.81
## 
## n= 8993 
## 
##         CP nsplit rel error xerror   xstd
## 1  8.5e-02      0      1.00   1.00 0.0052
## 2  2.8e-02      1      0.92   0.90 0.0058
## 3  5.7e-03      3      0.86   0.86 0.0060
## 4  5.1e-03      5      0.85   0.85 0.0061
## 5  4.6e-03      6      0.84   0.85 0.0061
## 6  4.2e-03      8      0.83   0.84 0.0061
## 7  4.0e-03     11      0.82   0.83 0.0062
## 8  3.1e-03     13      0.81   0.82 0.0062
## 9  3.0e-03     15      0.81   0.82 0.0062
## 10 2.1e-03     17      0.80   0.81 0.0062
## 11 1.9e-03     18      0.80   0.81 0.0062
## 12 1.8e-03     20      0.79   0.81 0.0062
## 13 1.2e-03     21      0.79   0.81 0.0062
## 14 1.1e-03     23      0.79   0.80 0.0062
## 15 1.1e-03     25      0.79   0.80 0.0062
## 16 9.7e-04     28      0.78   0.81 0.0062
## 17 8.3e-04     31      0.78   0.81 0.0062
## 18 6.9e-04     33      0.78   0.81 0.0062
## 19 6.2e-04     41      0.77   0.81 0.0062
## 20 6.0e-04     52      0.77   0.82 0.0062
## 21 5.5e-04     61      0.76   0.82 0.0062
## 22 5.2e-04     72      0.75   0.82 0.0062
## 23 4.8e-04     76      0.75   0.82 0.0062
## 24 4.6e-04     83      0.75   0.82 0.0062
## 25 4.1e-04    111      0.73   0.83 0.0062
## 26 3.7e-04    149      0.72   0.83 0.0062
## 27 3.4e-04    158      0.71   0.83 0.0061
## 28 3.4e-04    180      0.71   0.83 0.0061
## 29 3.2e-04    195      0.70   0.83 0.0061
## 30 3.1e-04    210      0.69   0.83 0.0061
## 31 3.0e-04    217      0.69   0.83 0.0061
## 32 2.8e-04    228      0.69   0.84 0.0061
## 33 2.5e-04    309      0.67   0.84 0.0061
## 34 2.5e-04    316      0.66   0.84 0.0061
## 35 2.4e-04    321      0.66   0.84 0.0061
## 36 2.3e-04    325      0.66   0.84 0.0061
## 37 2.2e-04    343      0.66   0.84 0.0061
## 38 2.2e-04    348      0.66   0.84 0.0061
## 39 2.1e-04    355      0.65   0.84 0.0061
## 40 1.8e-04    404      0.64   0.85 0.0061
## 41 1.4e-04    414      0.64   0.85 0.0061
## 42 1.1e-04    519      0.63   0.85 0.0061
## 43 9.2e-05    524      0.63   0.85 0.0061
## 44 6.9e-05    533      0.63   0.85 0.0061
## 45 4.6e-05    553      0.62   0.86 0.0061
## 46 0.0e+00    559      0.62   0.86 0.0061

For the Income_Big data set, the smallest xerror of the complexity table is 0.80726 and its standard deviaition is 0.0062380. According to the one standard error rule of the cross-validation, the model with 17 splits (18 nods) is the optimal choice. The cp for this model is 0.0020695.

For the Income_Dta data set, the smallest xerror of the complexity table is 0.80574 and its standard deviaition is 0.0062431. According to the one standard error rule of the cross-validation, the model with 18 splits (19 nods) is the optimal choice. The cp for this model is 0.0017936.

Thus, the size of the optimal model does not have a big difference after adding the new unrelated predictors. However, the Income_Big data has a relatively smaller tree. But the overall shape of the two trees are very similar. Here are the plots of the two trees:

fit17 <- prune(fit.big.constrain, cp = 0.0020695)
par(mar = rep(0.1, 4))
plot(fit17, branch = 0.3, compress = TRUE)
text(fit17)

plot of chunk unnamed-chunk-12


fit18 <- prune(fit.constrain, cp = 0.0017936)
par(mar = rep(0.1, 4))
plot(fit18, branch = 0.3, compress = TRUE)
text(fit18)

plot of chunk unnamed-chunk-12

(b)

I analyzed two new genenrated dataset. I plot complete tree and prunned tree for both the two new data sets. The two set of data show different model. However, the complete tree are much simpler compared to the prunned tree based on complexity parameter and one standard error rule for each of the dataset. The complete data for the random genenrated model is very similar to the model genenrated from Q1. But the prunned tree is much more complexed. This is probably due to the repeatitive information by randomly sampling with replacement, and the prunned tree always overfit the data. Below is the debrief:

Below is the first generated model by setting seed to be 4363. I come with the complete tree and prunned tree. The complete tree has 3 splits (4 nodes). The actuall used predictors are AGE, HOUSEHOLDER_STATUS and OCCUPATION. The optimal prunned tree is the one with 251 splits (252 nodes). 13 parameters are actually used. cp is 0.000045442. This tree has too many splits to display. The complete tree is pretty similar to the tree we get from the question 1.

# The first generated model for the Income_Data with replacement:
set.seed(4363)
sampler = sample(1:8993, 8993, replace = TRUE)

new.incomeData1 = incomeData[sampler, ]

# Complete tree
fit.new.nonconstrain <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class")
printcp(fit.new.nonconstrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class")
## 
## Variables actually used in tree construction:
## [1] AGE                HOUSEHOLDER_STATUS OCCUPATION        
## 
## Root node error: 7152/8993 = 0.8
## 
## n= 8993 
## 
##      CP nsplit rel error xerror   xstd
## 1 0.089      0      1.00   1.00 0.0054
## 2 0.027      1      0.91   0.91 0.0059
## 3 0.010      3      0.86   0.86 0.0062


# Prunned tree
fit.new.constrain <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class", 
    control = rpart.control(xval = 10, cp = 0))
printcp(fit.new.constrain)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData1, method = "class", 
##     control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] AGE                DUAL_INCOMES       EDUCATION         
##  [4] ETHNIC             HOUSEHOLD_NUMBER   HOUSEHOLD_UNDER_18
##  [7] HOUSEHOLDER_STATUS LANGUAGE           LIVING_TIME       
## [10] MARITAL_STATUS     OCCUPATION         SEX               
## [13] TYPE_OF_HOME      
## 
## Root node error: 7152/8993 = 0.8
## 
## n= 8993 
## 
##         CP nsplit rel error xerror   xstd
## 1  8.9e-02      0      1.00   1.00 0.0054
## 2  2.7e-02      1      0.91   0.91 0.0059
## 3  7.4e-03      3      0.86   0.86 0.0062
## 4  6.7e-03      4      0.85   0.85 0.0062
## 5  6.5e-03      8      0.82   0.84 0.0062
## 6  5.3e-03     11      0.80   0.82 0.0063
## 7  4.8e-03     12      0.80   0.81 0.0064
## 8  3.8e-03     13      0.79   0.80 0.0064
## 9  2.1e-03     14      0.79   0.79 0.0064
## 10 2.0e-03     15      0.79   0.80 0.0064
## 11 1.8e-03     21      0.78   0.79 0.0064
## 12 1.7e-03     23      0.77   0.79 0.0064
## 13 1.4e-03     24      0.77   0.78 0.0064
## 14 1.3e-03     26      0.77   0.79 0.0064
## 15 1.2e-03     27      0.77   0.78 0.0064
## 16 1.1e-03     30      0.76   0.78 0.0064
## 17 1.0e-03     32      0.76   0.78 0.0064
## 18 1.0e-03     39      0.75   0.78 0.0064
## 19 9.8e-04     44      0.75   0.78 0.0064
## 20 9.1e-04     50      0.74   0.78 0.0064
## 21 8.7e-04     59      0.73   0.78 0.0064
## 22 8.4e-04     65      0.73   0.77 0.0065
## 23 7.0e-04     77      0.72   0.76 0.0065
## 24 6.5e-04     86      0.71   0.76 0.0065
## 25 6.3e-04     95      0.70   0.76 0.0065
## 26 6.2e-04     99      0.70   0.76 0.0065
## 27 6.1e-04    115      0.69   0.76 0.0065
## 28 5.9e-04    127      0.68   0.76 0.0065
## 29 5.9e-04    139      0.68   0.76 0.0065
## 30 5.9e-04    149      0.67   0.75 0.0065
## 31 5.6e-04    155      0.67   0.75 0.0065
## 32 5.4e-04    204      0.64   0.75 0.0065
## 33 5.2e-04    216      0.63   0.75 0.0065
## 34 5.1e-04    220      0.63   0.75 0.0065
## 35 4.9e-04    223      0.63   0.75 0.0065
## 36 4.7e-04    248      0.62   0.75 0.0065
## 37 4.5e-04    251      0.61   0.74 0.0065
## 38 4.2e-04    255      0.61   0.74 0.0065
## 39 4.0e-04    295      0.59   0.74 0.0065
## 40 3.7e-04    312      0.59   0.73 0.0065
## 41 3.5e-04    322      0.58   0.73 0.0065
## 42 3.4e-04    334      0.58   0.73 0.0065
## 43 3.3e-04    340      0.57   0.73 0.0065
## 44 3.1e-04    349      0.57   0.73 0.0065
## 45 3.1e-04    353      0.57   0.73 0.0065
## 46 2.8e-04    358      0.57   0.72 0.0066
## 47 2.6e-04    436      0.55   0.72 0.0066
## 48 2.4e-04    446      0.54   0.72 0.0066
## 49 2.3e-04    454      0.54   0.72 0.0066
## 50 2.1e-04    459      0.54   0.72 0.0066
## 51 2.0e-04    476      0.54   0.72 0.0066
## 52 1.9e-04    481      0.53   0.72 0.0066
## 53 1.6e-04    493      0.53   0.72 0.0066
## 54 1.4e-04    499      0.53   0.72 0.0066
## 55 7.0e-05    542      0.53   0.72 0.0066
## 56 4.7e-05    552      0.52   0.72 0.0066
## 57 0.0e+00    555      0.52   0.72 0.0066

I also plot the complete tree:

plot(fit.new.nonconstrain)
text(fit.new.nonconstrain, use.n = T)

plot of chunk unnamed-chunk-14

Below is the second generated model by setting seed to be 12345. I come with the complete tree and prunned tree. The complete tree has 4 splits (5 nodes). The actuall used predictors are also AGE, HOUSEHOLDER_STATUS and OCCUPATION. The optimal prunned tree is the one with 377 splits (378 nodes). 13 parameters are actually used. cp is 0.00027720. This tree has too many splits to display. The complete tree is pretty similar to the tree we get from the question 1 and the previous model from the random generated data.

# The second generated model for the Income_Data with replacement:
set.seed(12345)
sampler.2 = sample(1:8993, 8993, replace = TRUE)

new.incomeData2 = incomeData[sampler.2, ]

# Complete tree
fit.new.nonconstrain.2 <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class")
printcp(fit.new.nonconstrain.2)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class")
## 
## Variables actually used in tree construction:
## [1] AGE                HOUSEHOLDER_STATUS OCCUPATION        
## 
## Root node error: 7215/8993 = 0.8
## 
## n= 8993 
## 
##      CP nsplit rel error xerror   xstd
## 1 0.049      0      1.00   1.00 0.0052
## 2 0.017      2      0.90   0.90 0.0059
## 3 0.010      4      0.87   0.87 0.0060


# Prunned tree
fit.new.constrain.2 <- rpart(ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class", 
    control = rpart.control(xval = 10, cp = 0))
printcp(fit.new.constrain.2)

## 
## Classification tree:
## rpart(formula = ANNUAL_INCOME ~ ., data = new.incomeData2, method = "class", 
##     control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] AGE                DUAL_INCOMES       EDUCATION         
##  [4] ETHNIC             HOUSEHOLD_NUMBER   HOUSEHOLD_UNDER_18
##  [7] HOUSEHOLDER_STATUS LANGUAGE           LIVING_TIME       
## [10] MARITAL_STATUS     OCCUPATION         SEX               
## [13] TYPE_OF_HOME      
## 
## Root node error: 7215/8993 = 0.8
## 
## n= 8993 
## 
##         CP nsplit rel error xerror   xstd
## 1  4.9e-02      0      1.00   1.00 0.0052
## 2  1.7e-02      2      0.90   0.90 0.0059
## 3  7.5e-03      4      0.87   0.87 0.0060
## 4  6.1e-03      6      0.85   0.86 0.0061
## 5  5.1e-03      7      0.85   0.85 0.0061
## 6  4.9e-03      9      0.84   0.85 0.0061
## 7  3.5e-03     10      0.83   0.84 0.0062
## 8  3.3e-03     12      0.82   0.84 0.0062
## 9  3.0e-03     14      0.82   0.84 0.0062
## 10 2.9e-03     16      0.81   0.84 0.0062
## 11 2.6e-03     17      0.81   0.84 0.0062
## 12 2.5e-03     19      0.80   0.83 0.0062
## 13 2.2e-03     21      0.80   0.83 0.0062
## 14 1.9e-03     23      0.79   0.82 0.0062
## 15 1.8e-03     24      0.79   0.82 0.0062
## 16 1.7e-03     25      0.79   0.82 0.0062
## 17 1.4e-03     27      0.79   0.81 0.0063
## 18 1.3e-03     31      0.78   0.81 0.0063
## 19 1.3e-03     35      0.77   0.81 0.0063
## 20 1.2e-03     41      0.77   0.80 0.0063
## 21 1.2e-03     43      0.76   0.80 0.0063
## 22 1.2e-03     47      0.76   0.80 0.0063
## 23 1.1e-03     50      0.76   0.80 0.0063
## 24 1.1e-03     59      0.75   0.80 0.0063
## 25 9.7e-04     62      0.74   0.79 0.0063
## 26 9.4e-04     71      0.73   0.79 0.0063
## 27 9.2e-04     80      0.73   0.79 0.0063
## 28 9.0e-04     83      0.72   0.78 0.0064
## 29 8.3e-04     87      0.72   0.78 0.0064
## 30 7.6e-04    107      0.70   0.78 0.0064
## 31 7.4e-04    111      0.70   0.77 0.0064
## 32 6.9e-04    115      0.70   0.77 0.0064
## 33 6.5e-04    137      0.68   0.76 0.0064
## 34 6.4e-04    140      0.68   0.76 0.0064
## 35 6.2e-04    149      0.67   0.76 0.0064
## 36 6.0e-04    156      0.67   0.75 0.0064
## 37 5.8e-04    159      0.66   0.75 0.0064
## 38 5.7e-04    165      0.66   0.75 0.0064
## 39 5.5e-04    175      0.65   0.75 0.0064
## 40 5.1e-04    207      0.64   0.74 0.0065
## 41 4.9e-04    211      0.63   0.74 0.0065
## 42 4.6e-04    219      0.63   0.74 0.0065
## 43 4.5e-04    226      0.63   0.73 0.0065
## 44 4.2e-04    230      0.63   0.73 0.0065
## 45 3.9e-04    295      0.60   0.73 0.0065
## 46 3.8e-04    301      0.60   0.72 0.0065
## 47 3.7e-04    309      0.59   0.72 0.0065
## 48 3.5e-04    316      0.59   0.72 0.0065
## 49 3.2e-04    355      0.57   0.72 0.0065
## 50 3.1e-04    373      0.57   0.72 0.0065
## 51 2.8e-04    377      0.57   0.71 0.0065
## 52 2.5e-04    430      0.55   0.71 0.0065
## 53 2.3e-04    441      0.55   0.71 0.0065
## 54 2.2e-04    452      0.54   0.71 0.0065
## 55 2.1e-04    459      0.54   0.71 0.0065
## 56 1.8e-04    476      0.54   0.71 0.0065
## 57 1.4e-04    482      0.54   0.71 0.0065
## 58 6.9e-05    536      0.53   0.71 0.0065
## 59 5.5e-05    548      0.53   0.71 0.0065
## 60 4.6e-05    553      0.53   0.71 0.0065
## 61 0.0e+00    565      0.53   0.71 0.0065

I also plot the complete tree:

plot(fit.new.nonconstrain.2)
text(fit.new.nonconstrain.2, use.n = T)

plot of chunk unnamed-chunk-16

Question 3 Multi-Class Classification: Marketing Data

The misclassification rate: 1 - (2349+970)/4507 = 0.26359. The debrief is below:

houseTypeData = read.table("/Users/shijiabian/Dropbox/STATS315B/Data/Housetype_Data.txt", 
    sep = ",", col.names = c("typeHome", "sex", "maritalStatus", "age", "education", 
        "occupation", "anualIncome", "lengthLiving", "dualIncome", "personHouse", 
        "youngPersonHous", "houseStatus", "ethnic", "language"), fill = FALSE, 
    strip.white = TRUE)

for (i in 1:14) {
    houseTypeData[, i] = as.factor(houseTypeData[, i])
}

summary(houseTypeData)

##  typeHome sex      maritalStatus age      education     occupation  
##  1:5319   1:4043   1   :3351     1: 870   1   : 279   1      :2798  
##  2: 674   2:4970   2   : 667     2:2151   2   :1050   6      :1475  
##  3:2454            3   : 856     3:2244   3   :2052   4      :1049  
##  4: 162            4   : 312     4:1608   4   :3036   2      : 786  
##  5: 404            5   :3669     5: 909   5   :1525   3      : 755  
##                    NA's: 158     6: 651   6   : 978   (Other):1994  
##                                  7: 580   NA's:  93   NA's   : 156  
##   anualIncome   lengthLiving dualIncome  personHouse   youngPersonHous
##  1      :1650   1   : 276    1:5439     2      :2670   0      :5695   
##  8      :1274   2   :1034    2:2217     3      :1685   1      :1523   
##  6      :1074   3   : 681    3:1357     1      :1600   2      :1163   
##  7      : 943   4   : 890               4      :1552   3      : 423   
##  9      : 862   5   :5190               5      : 696   4      : 122   
##  (Other):2833   NA's: 942               (Other): 479   5      :  47   
##  NA's   : 377                           NA's   : 331   (Other):  40   
##  houseStatus     ethnic     language   
##  1   :3304   7      :5845   1   :7803  
##  2   :3664   5      :1235   2   : 591  
##  3   :1851   3      : 893   3   : 260  
##  NA's: 194   2      : 471   NA's: 359  
##              8      : 238              
##              (Other): 270              
##              NA's   :  61


set.seed(2)
train = sample(1:nrow(houseTypeData), 4506)
houseTypeData.test = houseTypeData[-train, ]
fit.train <- rpart(typeHome ~ ., data = houseTypeData, subset = train, method = "class", 
    control = rpart.control(xval = 10, cp = 0))
printcp(fit.train)

## 
## Classification tree:
## rpart(formula = typeHome ~ ., data = houseTypeData, subset = train, 
##     method = "class", control = rpart.control(xval = 10, cp = 0))
## 
## Variables actually used in tree construction:
##  [1] age             anualIncome     dualIncome      education      
##  [5] ethnic          houseStatus     lengthLiving    maritalStatus  
##  [9] occupation      personHouse     sex             youngPersonHous
## 
## Root node error: 1848/4506 = 0.41
## 
## n= 4506 
## 
##         CP nsplit rel error xerror  xstd
## 1  0.32630      0      1.00   1.00 0.018
## 2  0.02868      1      0.67   0.67 0.016
## 3  0.01569      2      0.65   0.66 0.016
## 4  0.00595      3      0.63   0.64 0.016
## 5  0.00244      4      0.62   0.63 0.016
## 6  0.00216      7      0.61   0.63 0.016
## 7  0.00186      9      0.61   0.63 0.016
## 8  0.00162     17      0.59   0.62 0.016
## 9  0.00135     21      0.59   0.62 0.016
## 10 0.00108     23      0.58   0.62 0.016
## 11 0.00090     24      0.58   0.64 0.016
## 12 0.00087     27      0.58   0.64 0.016
## 13 0.00081     32      0.58   0.65 0.016
## 14 0.00079     34      0.58   0.65 0.016
## 15 0.00072     47      0.56   0.66 0.016
## 16 0.00054     50      0.56   0.66 0.016
## 17 0.00047     66      0.55   0.69 0.016
## 18 0.00036     93      0.54   0.69 0.016
## 19 0.00032     96      0.54   0.69 0.016
## 20 0.00027    104      0.53   0.70 0.016
## 21 0.00014    108      0.53   0.71 0.017
## 22 0.00011    120      0.53   0.71 0.017
## 23 0.00000    125      0.53   0.72 0.017


# Choose the one with 3 splits
cptarg = sqrt(fit.train$cptable[2, 1] * fit.train$cptable[3, 1])
prunedtree = prune(fit.train, cp = cptarg)

# Visualize the tree
par(mfrow = c(1, 1), mar = c(1, 1, 1, 1))
plot(prunedtree, uniform = T, compress = T, margin = 0.1, branch = 0.3)
text(prunedtree, use.n = T, digits = 3, cex = 0.6)

plot of chunk unnamed-chunk-17


fit.preds = predict(prunedtree, newdata = houseTypeData.test, type = "class")
fit.table = table(houseTypeData.test$typeHome, fit.preds)
fit.table

##    fit.preds
##        1    2    3    4    5
##   1 2349    0  312    0    0
##   2  227    0  116    0    0
##   3  256    0  970    0    0
##   4   64    0   15    0    0
##   5  104    0   94    0    0

Question 4

This question is attached on a separate piece of paper.

Question 5

Discuss the relative advantages/disadvantages of this strategy relative to thesurrogate splitting strategy?

Advantage: if there is minimal missing data or a suﬃciently large sample size , and there is no structure or pattern to the missing data, that are the cases in which missing values are missing completely at random, this strategy computes fast and efficiently.

Disadvantage: the impact of this strategy depends on how missing values are divided among the real categories, and how the probability of a value being missing depends on other variables;very dissimilar classes can be lumped into one group; sever bias can arise, in any direction, and when used to stratify for adjustment (or correct for confounding) the completed categorical variable will not do its job properly.

Does this strategy allow a “surrogate effect” by encouraging variables within highly correlated sets to substitute for each other in the prediction process? If so, how (by what mechanism)?

Yes. This is realized by the surrogate spliting during the greedy algorithm.

Can this new strategy be used if there are no missing values in the training set? If not, what can be done so as to enable the tree to predict with missing values in future data?

No. Simple random select some observations for each predictors as missing values (follow the criteria of MCAR). Then treat these missing values as additional categories. CART the model on this training dataset to predict the future data. However, this requires that the missing values in the test set are independent of the response and other predictors.