library(dplyr)
library(rpart)
library(rpart.plot)
IncomeData <- read.csv("~/materials/minor/lab09/Income_Data.txt", header = F, col.names =c("income", "sex", "martial_status", "age", "edu","ocupation", "living_time", "dual_incomes", "pers_in_house", "pers_in_house_under18", "householder_status", "type_of_home", "ethnic_cl", "lang"))
IncomeData <- within(IncomeData, {
sex <- factor(sex)
martial_status <- factor(martial_status)
edu <- factor(edu)
ocupation <- factor(ocupation)
living_time <- factor(living_time)
dual_incomes <- factor(dual_incomes)
householder_status <- factor(householder_status)
type_of_home <- factor(type_of_home)
ethnic_cl <- factor(ethnic_cl)
lang <- factor(lang)
age <- ordered(age)
pers_in_house <- ordered(pers_in_house)
pers_in_house_under18 <- ordered(pers_in_house_under18)
})
1. Построим регрессионное дерево методом “anova”, который заключается в разбиении суммы квадратов на SST (Sum of Squares for Treatments) и SSE (Sum of Squares for Error) с целью оценки общей вариации модели. Таким образом, “anova” сравнивает среднеквдаратические отклонения SST и SSE.
Возьмем случайные критерии для построения дерева, которые кажутся интуитивно значимыми: пол, время проживания в доме, количество людей в доме младше 18, этническая принадлежность, длительность проживания в регионе (SAN FRAN./OAKLAND/SAN JOSE).
random.income.tree <-rpart(income~sex+living_time+pers_in_house_under18+type_of_home+ethnic_cl+lang, data = IncomeData, method = "anova")
prp(random.income.tree)
printcp(random.income.tree)
##
## Regression tree:
## rpart(formula = income ~ sex + living_time + pers_in_house_under18 +
## type_of_home + ethnic_cl + lang, data = IncomeData, method = "anova")
##
## Variables actually used in tree construction:
## [1] ethnic_cl type_of_home
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.048174 0 1.00000 1.00015 0.0082592
## 2 0.019783 1 0.95183 0.95229 0.0087303
## 3 0.010000 2 0.93204 0.93491 0.0089279
Полученное дерево имеет всего 2 internal nodes, что может означать его высокую информативность при условии низкой ошибки (xerror). Доходы домохозяйств в данной модели прежде всего зависят от типа домохозяйства, а затем от этнической принадлежности. В каждом случае область данных делится на две части:
Однако минимальная ошибка при минимальном cp=0.01 составляет 0.9371563, что означает крайне низкую состоятельность дерева random.income.tree в качестве модели для предсказания доходов домохозяйства данного датасета.
2. Построим регрессионное дерево simple.income.tree, основанное на взаимодействии всех 14-ти предикторов и оценим его состоятельность в качестве модели для предсказания доходов домохозяйства данного датасета.
Мы видим, что минимальная оценка xerror при cp=0.01 составляет 0.5728136, что значительно меньше, чем у предыдущего дерева (0.9371563).
Однако дерево увелилось в размерах и теперь имеет 8 internal nodes и зависит от совершенно других показателей (перечислены в порядке убывания важности) householder_status, age, martial_status, ocupation, dual_incomes, edu, pers_in_house.
simple.income.tree <- rpart(income~., data = IncomeData, method = "anova")
prp(simple.income.tree)
printcp(simple.income.tree)
##
## Regression tree:
## rpart(formula = income ~ ., data = IncomeData, method = "anova")
##
## Variables actually used in tree construction:
## [1] age householder_status martial_status
## [4] ocupation
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.223922 0 1.00000 1.00019 0.0082597
## 2 0.080770 1 0.77608 0.77631 0.0088516
## 3 0.044728 2 0.69531 0.69571 0.0091812
## 4 0.030699 3 0.65058 0.65109 0.0093115
## 5 0.013778 4 0.61988 0.62050 0.0089310
## 6 0.013013 5 0.60610 0.60811 0.0089090
## 7 0.012217 6 0.59309 0.60139 0.0089203
## 8 0.011860 7 0.58087 0.59228 0.0088108
## 9 0.010000 8 0.56901 0.57585 0.0082853
3. Построим деревья, основанные на взаимодействии всех 14-ти предикторов, произвольно фиксируя cp (исходя из графика зависимости cp и xerror).
plotcp(simple.income.tree)
Возьмем значения cp равные \(10^{-3}\) (минимальное cp simple.income.tree), \(0.012\) (оптимальное значение cp исходя из графика зависимости cp и xerror) и предложенное для сравнения значение \(0.2\).
| Название дерева | cp | min{xerror} | internal nodes |
|---|---|---|---|
| cp1.income.tree | \(10^{-3}\) | 0.52910 | 33 |
| cp2.income.tree | \(0.012\) | 0.57545 | 7 |
| cp3.income.tree | \(0.2\) | 0.7764 | 1 |
cp1.income.tree <- rpart(income~., data = IncomeData,control = rpart.control(cp = 10^-3), method = "anova")
prp(cp1.income.tree)
printcp(cp1.income.tree)
##
## Regression tree:
## rpart(formula = income ~ ., data = IncomeData, method = "anova",
## control = rpart.control(cp = 10^-3))
##
## Variables actually used in tree construction:
## [1] age dual_incomes edu
## [4] ethnic_cl householder_status martial_status
## [7] ocupation pers_in_house pers_in_house_under18
## [10] sex type_of_home
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.2239220 0 1.00000 1.00009 0.0082578
## 2 0.0807698 1 0.77608 0.77629 0.0088532
## 3 0.0447277 2 0.69531 0.69562 0.0091810
## 4 0.0306987 3 0.65058 0.65101 0.0093114
## 5 0.0137779 4 0.61988 0.62049 0.0089313
## 6 0.0130133 5 0.60610 0.61389 0.0089202
## 7 0.0122172 6 0.59309 0.59817 0.0088663
## 8 0.0118599 7 0.58087 0.59272 0.0088405
## 9 0.0075503 8 0.56901 0.57391 0.0083000
## 10 0.0060297 9 0.56146 0.56986 0.0083057
## 11 0.0058936 10 0.55543 0.56619 0.0082807
## 12 0.0058179 11 0.54954 0.56551 0.0082784
## 13 0.0056115 12 0.54372 0.56180 0.0082696
## 14 0.0052484 13 0.53811 0.55411 0.0081725
## 15 0.0029893 14 0.53286 0.54043 0.0080844
## 16 0.0021926 15 0.52987 0.53769 0.0080957
## 17 0.0020195 16 0.52768 0.54034 0.0081722
## 18 0.0020188 17 0.52566 0.53992 0.0081607
## 19 0.0019891 18 0.52364 0.53987 0.0081660
## 20 0.0018678 19 0.52165 0.53808 0.0081582
## 21 0.0018067 20 0.51979 0.53602 0.0081379
## 22 0.0016202 21 0.51798 0.53369 0.0081602
## 23 0.0015428 22 0.51636 0.53087 0.0081333
## 24 0.0013878 23 0.51482 0.53132 0.0081386
## 25 0.0013376 25 0.51204 0.53136 0.0081768
## 26 0.0013004 26 0.51070 0.53094 0.0081778
## 27 0.0012727 27 0.50940 0.53077 0.0081856
## 28 0.0012212 28 0.50813 0.53118 0.0082166
## 29 0.0011047 29 0.50691 0.53021 0.0082510
## 30 0.0010760 30 0.50580 0.52967 0.0082515
## 31 0.0010668 31 0.50473 0.52920 0.0082433
## 32 0.0010000 33 0.50259 0.52923 0.0082619
plotcp(cp1.income.tree)
cp2.income.tree <- rpart(income~., data = IncomeData,control = rpart.control(cp = 0.012), method = "anova")
prp(cp2.income.tree)
printcp(cp2.income.tree)
##
## Regression tree:
## rpart(formula = income ~ ., data = IncomeData, method = "anova",
## control = rpart.control(cp = 0.012))
##
## Variables actually used in tree construction:
## [1] age householder_status martial_status
## [4] ocupation
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.223922 0 1.00000 1.00019 0.0082582
## 2 0.080770 1 0.77608 0.77642 0.0088543
## 3 0.044728 2 0.69531 0.69590 0.0091874
## 4 0.030699 3 0.65058 0.65125 0.0093169
## 5 0.013778 4 0.61988 0.62075 0.0089383
## 6 0.013013 5 0.60610 0.60974 0.0089214
## 7 0.012217 6 0.59309 0.60158 0.0088589
## 8 0.012000 7 0.58087 0.59077 0.0086969
plotcp(cp2.income.tree)
cp3.income.tree <- rpart(income~., data = IncomeData,control = rpart.control(cp = 0.2), method = "anova")
prp(cp3.income.tree)
printcp(cp3.income.tree)
##
## Regression tree:
## rpart(formula = income ~ ., data = IncomeData, method = "anova",
## control = rpart.control(cp = 0.2))
##
## Variables actually used in tree construction:
## [1] householder_status
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.22392 0 1.00000 1.00018 0.0082593
## 2 0.20000 1 0.77608 0.77627 0.0088523
plotcp(cp3.income.tree)
Значение cp = 0.2 приводит к значительному увеличению стандартной ошибки по сравнению с дргими деревьями. Кроме того, это значение соответствует почти максимальному знаечние cp (0.223922) для simple.income.tree, при котором xerror = 1.00042.
Значение xerror 0.52910 дерева cp1.income.tree при cp = 0.001 - минимальное из всех полученных. Обрезав это дерево, мы можем улучшить модель и снизить xerror еще больше. (Напомним, что фиксируя cp на уровне \(10^{-3}\) мы уже снизили его с 0.5728136 до 0.52910.)
4. Проверим вариативность конструируемой модели.
set.seed(1)
train = sample(1:nrow(IncomeData), nrow(IncomeData)/2)
train.tree=rpart(income~.,IncomeData,subset=train, method = "anova")
prp(train.tree)
predicted.value=predict(train.tree,newdata=IncomeData[-train,])
real.income=IncomeData[-train,"income"]
Можно увидеть, что конструирование доходов с помощью указанной модели не приводит к однозначному результату.
plot(predicted.value,real.income)
abline(0,1)
Однако MSE в данном случае сокращается до 4.785274 c изначального максимального значения для Node number 9 в simple.income.tree (9.016954), что означает вариативность (среднеквадратическое отклонение) дохода домохозяйства в пределах двух типов (\(4.785274^{0.5}=2.187527\)).
mean((predicted.value-real.income)^2)
## [1] 4.468168
summary(simple.income.tree)
## Call:
## rpart(formula = income ~ ., data = IncomeData, method = "anova")
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.22392203 0 1.0000000 1.0001855 0.008259672
## 2 0.08076978 1 0.7760780 0.7763120 0.008851600
## 3 0.04472770 2 0.6953082 0.6957114 0.009181190
## 4 0.03069866 3 0.6505805 0.6510856 0.009311537
## 5 0.01377787 4 0.6198818 0.6204977 0.008931006
## 6 0.01301330 5 0.6061040 0.6081105 0.008909027
## 7 0.01221720 6 0.5930907 0.6013947 0.008920293
## 8 0.01185992 7 0.5808735 0.5922766 0.008810843
## 9 0.01000000 8 0.5690135 0.5758513 0.008285319
##
## Variable importance
## householder_status age martial_status
## 28 21 15
## ocupation dual_incomes edu
## 14 14 7
## pers_in_house
## 2
##
## Node number 1: 8993 observations, complexity param=0.223922
## mean=4.895029, MSE=7.692528
## left son=2 (5646 obs) right son=3 (3347 obs)
## Primary splits:
## householder_status splits as RLL, improve=0.2243474, (240 missing)
## age splits as LLRRRRR, improve=0.2121074, (0 missing)
## martial_status splits as RLLLL, improve=0.2060636, (160 missing)
## dual_incomes splits as LRR, improve=0.1915465, (0 missing)
## ocupation splits as RLLLRLLRL, improve=0.1732841, (136 missing)
## Surrogate splits:
## age splits as LLLRRRR, agree=0.769, adj=0.378, (240 split)
## martial_status splits as RLLRL, agree=0.755, adj=0.341, (0 split)
## dual_incomes splits as LRR, agree=0.742, adj=0.307, (0 split)
## ocupation splits as RLLLRLLRL, agree=0.681, adj=0.143, (0 split)
## edu splits as LLLLLR, agree=0.647, adj=0.051, (0 split)
##
## Node number 2: 5646 observations, complexity param=0.08076978
## mean=3.88452, MSE=6.656165
## left son=4 (853 obs) right son=5 (4793 obs)
## Primary splits:
## age splits as LRRRRRR, improve=0.14868170, (0 missing)
## ocupation splits as RLLLLLLLL, improve=0.13472560, (95 missing)
## edu splits as LLRRRR, improve=0.12592370, (57 missing)
## martial_status splits as RRRLL, improve=0.08387126, (107 missing)
## householder_status splits as -RL, improve=0.07570331, (149 missing)
## Surrogate splits:
## edu splits as LLRRRR, agree=0.919, adj=0.461, (0 split)
##
## Node number 3: 3347 observations, complexity param=0.03069866
## mean=6.599641, MSE=4.812525
## left son=6 (918 obs) right son=7 (2429 obs)
## Primary splits:
## martial_status splits as RRLLL, improve=0.13684810, (53 missing)
## ocupation splits as RLLLRLLLL, improve=0.12023850, (41 missing)
## dual_incomes splits as LRR, improve=0.11457570, (0 missing)
## edu splits as LLLLRR, improve=0.09346821, (29 missing)
## pers_in_house splits as LRRRRRRRR, improve=0.05325719, (97 missing)
## Surrogate splits:
## dual_incomes splits as LRR, agree=0.955, adj=0.838, (53 split)
## pers_in_house splits as LRRRRRRRR, agree=0.834, adj=0.396, (0 split)
## age splits as LLRRRRR, agree=0.758, adj=0.121, (0 split)
## ocupation splits as RRRRRLLRR, agree=0.725, adj=0.002, (0 split)
## pers_in_house_under18 splits as RRRRRRRRRL, agree=0.725, adj=0.002, (0 split)
##
## Node number 4: 853 observations, complexity param=0.01185992
## mean=1.526377, MSE=2.844849
## left son=8 (689 obs) right son=9 (164 obs)
## Primary splits:
## ocupation splits as LRRRLLRRR, improve=0.336137600, (17 missing)
## householder_status splits as -RL, improve=0.032457790, (5 missing)
## martial_status splits as LRL-L, improve=0.017270670, (31 missing)
## dual_incomes splits as LLR, improve=0.010735450, (0 missing)
## type_of_home splits as LLLRR, improve=0.008932559, (47 missing)
## Surrogate splits:
## householder_status splits as -RL, agree=0.828, adj=0.122, (17 split)
##
## Node number 5: 4793 observations, complexity param=0.0447277
## mean=4.304194, MSE=6.168681
## left son=10 (3399 obs) right son=11 (1394 obs)
## Primary splits:
## ocupation splits as RLLLLLLLL, improve=0.10421770, (78 missing)
## age splits as LLRRRRR, improve=0.05642667, (0 missing)
## dual_incomes splits as LRL, improve=0.05346468, (0 missing)
## edu splits as LLLLRR, improve=0.05227030, (43 missing)
## martial_status splits as RLLLL, improve=0.05186097, (76 missing)
## Surrogate splits:
## edu splits as LLLLRR, agree=0.744, adj=0.124, (73 split)
##
## Node number 6: 918 observations, complexity param=0.0122172
## mean=5.303922, MSE=6.104799
## left son=12 (555 obs) right son=13 (363 obs)
## Primary splits:
## ocupation splits as RLLLLLLLL, improve=0.15307880, (12 missing)
## edu splits as LLLLRR, improve=0.08517863, (10 missing)
## age splits as RRRRRRL, improve=0.05236809, (0 missing)
## type_of_home splits as RRRLR, improve=0.02903844, (39 missing)
## sex splits as RL, improve=0.02451025, (0 missing)
## Surrogate splits:
## edu splits as LLLLRR, agree=0.705, adj=0.264, (12 split)
##
## Node number 7: 2429 observations, complexity param=0.01377787
## mean=7.089337, MSE=3.44982
## left son=14 (1340 obs) right son=15 (1089 obs)
## Primary splits:
## ocupation splits as RLLLLLLLL, improve=0.11597370, (29 missing)
## edu splits as LLLRRR, improve=0.10846340, (19 missing)
## age splits as RRRRRRL, improve=0.05258903, (0 missing)
## dual_incomes splits as LRL, improve=0.04283990, (0 missing)
## type_of_home splits as RRLLR, improve=0.03380694, (71 missing)
## Surrogate splits:
## dual_incomes splits as LRL, agree=0.684, adj=0.292, (29 split)
## edu splits as LLLLRR, agree=0.673, adj=0.266, (0 split)
## age splits as RRRRRLL, agree=0.610, adj=0.126, (0 split)
## sex splits as RL, agree=0.587, adj=0.074, (0 split)
## martial_status splits as LR---, agree=0.557, adj=0.007, (0 split)
##
## Node number 8: 689 observations
## mean=1.047896, MSE=0.1849339
##
## Node number 9: 164 observations
## mean=3.536585, MSE=9.016954
##
## Node number 10: 3399 observations, complexity param=0.0130133
## mean=3.789644, MSE=6.022535
## left son=20 (2631 obs) right son=21 (768 obs)
## Primary splits:
## martial_status splits as RLLLL, improve=0.04750783, (66 missing)
## dual_incomes splits as LRL, improve=0.03465388, (0 missing)
## type_of_home splits as RRLLL, improve=0.02513209, (154 missing)
## age splits as LLRRRRR, improve=0.01889926, (0 missing)
## ocupation splits as -RRRRRRLL, improve=0.01645735, (62 missing)
## Surrogate splits:
## dual_incomes splits as LRR, agree=0.957, adj=0.807, (66 split)
## ocupation splits as -LLLRLLLL, agree=0.781, adj=0.008, (0 split)
##
## Node number 11: 1394 observations
## mean=5.558824, MSE=4.305363
##
## Node number 12: 555 observations
## mean=4.527928, MSE=6.202373
##
## Node number 13: 363 observations
## mean=6.490358, MSE=3.627318
##
## Node number 14: 1340 observations
## mean=6.524627, MSE=3.982229
##
## Node number 15: 1089 observations
## mean=7.784206, MSE=1.919457
##
## Node number 20: 2631 observations
## mean=3.511593, MSE=5.961002
##
## Node number 21: 768 observations
## mean=4.742188, MSE=5.061137
5. Обрежем дерево simple.income.tree.
pruned.tree1 <- prune(simple.income.tree, simple.income.tree$cptable[which.min(simple.income.tree$cptable[,"xerror"]),"CP"])
printcp(pruned.tree1)
##
## Regression tree:
## rpart(formula = income ~ ., data = IncomeData, method = "anova")
##
## Variables actually used in tree construction:
## [1] age householder_status martial_status
## [4] ocupation
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.223922 0 1.00000 1.00019 0.0082597
## 2 0.080770 1 0.77608 0.77631 0.0088516
## 3 0.044728 2 0.69531 0.69571 0.0091812
## 4 0.030699 3 0.65058 0.65109 0.0093115
## 5 0.013778 4 0.61988 0.62050 0.0089310
## 6 0.013013 5 0.60610 0.60811 0.0089090
## 7 0.012217 6 0.59309 0.60139 0.0089203
## 8 0.011860 7 0.58087 0.59228 0.0088108
## 9 0.010000 8 0.56901 0.57585 0.0082853
prp(pruned.tree1)
После того, как мы обрезали дерево simple.income.tree в минимальном значениии cp, в дереве не изменилось ровным счетом ничего. Прунинг в данном случае не дает никакого результата. Ни в предсказании значений дохода домохозяйств, ни в снижении xerror.
6. Обрежем лучшее с точки зрения кросс-валидации дерево cp1.income.tree.
pruned.tree2 <- prune(cp1.income.tree, cp1.income.tree$cptable[which.min(cp1.income.tree$cptable[,"xerror"]),"CP"])
printcp(pruned.tree2)
##
## Regression tree:
## rpart(formula = income ~ ., data = IncomeData, method = "anova",
## control = rpart.control(cp = 10^-3))
##
## Variables actually used in tree construction:
## [1] age dual_incomes edu
## [4] ethnic_cl householder_status martial_status
## [7] ocupation pers_in_house sex
## [10] type_of_home
##
## Root node error: 69179/8993 = 7.6925
##
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.2239220 0 1.00000 1.00009 0.0082578
## 2 0.0807698 1 0.77608 0.77629 0.0088532
## 3 0.0447277 2 0.69531 0.69562 0.0091810
## 4 0.0306987 3 0.65058 0.65101 0.0093114
## 5 0.0137779 4 0.61988 0.62049 0.0089313
## 6 0.0130133 5 0.60610 0.61389 0.0089202
## 7 0.0122172 6 0.59309 0.59817 0.0088663
## 8 0.0118599 7 0.58087 0.59272 0.0088405
## 9 0.0075503 8 0.56901 0.57391 0.0083000
## 10 0.0060297 9 0.56146 0.56986 0.0083057
## 11 0.0058936 10 0.55543 0.56619 0.0082807
## 12 0.0058179 11 0.54954 0.56551 0.0082784
## 13 0.0056115 12 0.54372 0.56180 0.0082696
## 14 0.0052484 13 0.53811 0.55411 0.0081725
## 15 0.0029893 14 0.53286 0.54043 0.0080844
## 16 0.0021926 15 0.52987 0.53769 0.0080957
## 17 0.0020195 16 0.52768 0.54034 0.0081722
## 18 0.0020188 17 0.52566 0.53992 0.0081607
## 19 0.0019891 18 0.52364 0.53987 0.0081660
## 20 0.0018678 19 0.52165 0.53808 0.0081582
## 21 0.0018067 20 0.51979 0.53602 0.0081379
## 22 0.0016202 21 0.51798 0.53369 0.0081602
## 23 0.0015428 22 0.51636 0.53087 0.0081333
## 24 0.0013878 23 0.51482 0.53132 0.0081386
## 25 0.0013376 25 0.51204 0.53136 0.0081768
## 26 0.0013004 26 0.51070 0.53094 0.0081778
## 27 0.0012727 27 0.50940 0.53077 0.0081856
## 28 0.0012212 28 0.50813 0.53118 0.0082166
## 29 0.0011047 29 0.50691 0.53021 0.0082510
## 30 0.0010760 30 0.50580 0.52967 0.0082515
## 31 0.0010668 31 0.50473 0.52920 0.0082433
prp(pruned.tree2)
summary(pruned.tree2)
## Call:
## rpart(formula = income ~ ., data = IncomeData, method = "anova",
## control = rpart.control(cp = 10^-3))
## n= 8993
##
## CP nsplit rel error xerror xstd
## 1 0.223922026 0 1.0000000 1.0000946 0.008257812
## 2 0.080769775 1 0.7760780 0.7762887 0.008853204
## 3 0.044727705 2 0.6953082 0.6956159 0.009181028
## 4 0.030698661 3 0.6505805 0.6510073 0.009311442
## 5 0.013777873 4 0.6198818 0.6204938 0.008931261
## 6 0.013013301 5 0.6061040 0.6138851 0.008920178
## 7 0.012217199 6 0.5930907 0.5981732 0.008866258
## 8 0.011859924 7 0.5808735 0.5927154 0.008840503
## 9 0.007550314 8 0.5690135 0.5739124 0.008299986
## 10 0.006029731 9 0.5614632 0.5698556 0.008305743
## 11 0.005893641 10 0.5554335 0.5661943 0.008280675
## 12 0.005817907 11 0.5495398 0.5655086 0.008278358
## 13 0.005611504 12 0.5437219 0.5618030 0.008269551
## 14 0.005248411 13 0.5381104 0.5541062 0.008172475
## 15 0.002989290 14 0.5328620 0.5404281 0.008084427
## 16 0.002192594 15 0.5298727 0.5376862 0.008095687
## 17 0.002019477 16 0.5276801 0.5403361 0.008172205
## 18 0.002018791 17 0.5256607 0.5399218 0.008160722
## 19 0.001989076 18 0.5236419 0.5398664 0.008166003
## 20 0.001867794 19 0.5216528 0.5380760 0.008158165
## 21 0.001806660 20 0.5197850 0.5360208 0.008137890
## 22 0.001620207 21 0.5179783 0.5336936 0.008160208
## 23 0.001542823 22 0.5163581 0.5308696 0.008133275
## 24 0.001387839 23 0.5148153 0.5313247 0.008138598
## 25 0.001337636 25 0.5120396 0.5313579 0.008176807
## 26 0.001300372 26 0.5107020 0.5309378 0.008177769
## 27 0.001272740 27 0.5094016 0.5307655 0.008185550
## 28 0.001221219 28 0.5081289 0.5311780 0.008216572
## 29 0.001104741 29 0.5069077 0.5302056 0.008251009
## 30 0.001076029 30 0.5058029 0.5296723 0.008251471
## 31 0.001066758 31 0.5047269 0.5292038 0.008243304
##
## Variable importance
## householder_status age martial_status
## 25 20 14
## ocupation dual_incomes edu
## 14 13 9
## pers_in_house type_of_home
## 2 1
##
## Node number 1: 8993 observations, complexity param=0.223922
## mean=4.895029, MSE=7.692528
## left son=2 (5646 obs) right son=3 (3347 obs)
## Primary splits:
## householder_status splits as RLL, improve=0.2243474, (240 missing)
## age splits as LLRRRRR, improve=0.2121074, (0 missing)
## martial_status splits as RLLLL, improve=0.2060636, (160 missing)
## dual_incomes splits as LRR, improve=0.1915465, (0 missing)
## ocupation splits as RLLLRLLRL, improve=0.1732841, (136 missing)
## Surrogate splits:
## age splits as LLLRRRR, agree=0.769, adj=0.378, (240 split)
## martial_status splits as RLLRL, agree=0.755, adj=0.341, (0 split)
## dual_incomes splits as LRR, agree=0.742, adj=0.307, (0 split)
## ocupation splits as RLLLRLLRL, agree=0.681, adj=0.143, (0 split)
## edu splits as LLLLLR, agree=0.647, adj=0.051, (0 split)
##
## Node number 2: 5646 observations, complexity param=0.08076978
## mean=3.88452, MSE=6.656165
## left son=4 (853 obs) right son=5 (4793 obs)
## Primary splits:
## age splits as LRRRRRR, improve=0.14868170, (0 missing)
## ocupation splits as RLLLLLLLL, improve=0.13472560, (95 missing)
## edu splits as LLRRRR, improve=0.12592370, (57 missing)
## martial_status splits as RRRLL, improve=0.08387126, (107 missing)
## householder_status splits as -RL, improve=0.07570331, (149 missing)
## Surrogate splits:
## edu splits as LLRRRR, agree=0.919, adj=0.461, (0 split)
##
## Node number 3: 3347 observations, complexity param=0.03069866
## mean=6.599641, MSE=4.812525
## left son=6 (918 obs) right son=7 (2429 obs)
## Primary splits:
## martial_status splits as RRLLL, improve=0.13684810, (53 missing)
## ocupation splits as RLLLRLLLL, improve=0.12023850, (41 missing)
## dual_incomes splits as LRR, improve=0.11457570, (0 missing)
## edu splits as LLLLRR, improve=0.09346821, (29 missing)
## pers_in_house splits as LRRRRRRRR, improve=0.05325719, (97 missing)
## Surrogate splits:
## dual_incomes splits as LRR, agree=0.955, adj=0.838, (53 split)
## pers_in_house splits as LRRRRRRRR, agree=0.834, adj=0.396, (0 split)
## age splits as LLRRRRR, agree=0.758, adj=0.121, (0 split)
## ocupation splits as RRRRRLLRR, agree=0.725, adj=0.002, (0 split)
## pers_in_house_under18 splits as RRRRRRRRRL, agree=0.725, adj=0.002, (0 split)
##
## Node number 4: 853 observations, complexity param=0.01185992
## mean=1.526377, MSE=2.844849
## left son=8 (689 obs) right son=9 (164 obs)
## Primary splits:
## ocupation splits as LRRRLLRRR, improve=0.336137600, (17 missing)
## householder_status splits as -RL, improve=0.032457790, (5 missing)
## martial_status splits as LRL-L, improve=0.017270670, (31 missing)
## dual_incomes splits as LLR, improve=0.010735450, (0 missing)
## type_of_home splits as LLLRR, improve=0.008932559, (47 missing)
## Surrogate splits:
## householder_status splits as -RL, agree=0.828, adj=0.122, (17 split)
##
## Node number 5: 4793 observations, complexity param=0.0447277
## mean=4.304194, MSE=6.168681
## left son=10 (3399 obs) right son=11 (1394 obs)
## Primary splits:
## ocupation splits as RLLLLLLLL, improve=0.10421770, (78 missing)
## age splits as LLRRRRR, improve=0.05642667, (0 missing)
## dual_incomes splits as LRL, improve=0.05346468, (0 missing)
## edu splits as LLLLRR, improve=0.05227030, (43 missing)
## martial_status splits as RLLLL, improve=0.05186097, (76 missing)
## Surrogate splits:
## edu splits as LLLLRR, agree=0.744, adj=0.124, (73 split)
##
## Node number 6: 918 observations, complexity param=0.0122172
## mean=5.303922, MSE=6.104799
## left son=12 (555 obs) right son=13 (363 obs)
## Primary splits:
## ocupation splits as RLLLLLLLL, improve=0.15307880, (12 missing)
## edu splits as LLLLRR, improve=0.08517863, (10 missing)
## age splits as RRRRRRL, improve=0.05236809, (0 missing)
## type_of_home splits as RRRLR, improve=0.02903844, (39 missing)
## sex splits as RL, improve=0.02451025, (0 missing)
## Surrogate splits:
## edu splits as LLLLRR, agree=0.705, adj=0.264, (12 split)
##
## Node number 7: 2429 observations, complexity param=0.01377787
## mean=7.089337, MSE=3.44982
## left son=14 (1340 obs) right son=15 (1089 obs)
## Primary splits:
## ocupation splits as RLLLLLLLL, improve=0.11597370, (29 missing)
## edu splits as LLLRRR, improve=0.10846340, (19 missing)
## age splits as RRRRRRL, improve=0.05258903, (0 missing)
## dual_incomes splits as LRL, improve=0.04283990, (0 missing)
## type_of_home splits as RRLLR, improve=0.03380694, (71 missing)
## Surrogate splits:
## dual_incomes splits as LRL, agree=0.684, adj=0.292, (29 split)
## edu splits as LLLLRR, agree=0.673, adj=0.266, (0 split)
## age splits as RRRRRLL, agree=0.610, adj=0.126, (0 split)
## sex splits as RL, agree=0.587, adj=0.074, (0 split)
## martial_status splits as LR---, agree=0.557, adj=0.007, (0 split)
##
## Node number 8: 689 observations
## mean=1.047896, MSE=0.1849339
##
## Node number 9: 164 observations, complexity param=0.00127274
## mean=3.536585, MSE=9.016954
## left son=18 (72 obs) right son=19 (92 obs)
## Primary splits:
## pers_in_house splits as LLLRRRRRR, improve=0.05652358, (5 missing)
## type_of_home splits as RLLRR, improve=0.05360106, (7 missing)
## ethnic_cl splits as LRR-LRRR, improve=0.03884052, (1 missing)
## lang splits as RLR, improve=0.02399643, (2 missing)
## ocupation splits as -LLR--LLR, improve=0.02324270, (0 missing)
## Surrogate splits:
## pers_in_house_under18 splits as LLRRRRRRRR, agree=0.717, adj=0.328, (5 split)
## type_of_home splits as RLLRR, agree=0.704, adj=0.299, (0 split)
## ethnic_cl splits as RRL-LLRR, agree=0.673, adj=0.224, (0 split)
## householder_status splits as -LR, agree=0.623, adj=0.104, (0 split)
## lang splits as RLR, agree=0.616, adj=0.090, (0 split)
##
## Node number 10: 3399 observations, complexity param=0.0130133
## mean=3.789644, MSE=6.022535
## left son=20 (2631 obs) right son=21 (768 obs)
## Primary splits:
## martial_status splits as RLLLL, improve=0.04750783, (66 missing)
## dual_incomes splits as LRL, improve=0.03465388, (0 missing)
## type_of_home splits as RRLLL, improve=0.02513209, (154 missing)
## age splits as LLRRRRR, improve=0.01889926, (0 missing)
## ocupation splits as -RRRRRRLL, improve=0.01645735, (62 missing)
## Surrogate splits:
## dual_incomes splits as LRR, agree=0.957, adj=0.807, (66 split)
## ocupation splits as -LLLRLLLL, agree=0.781, adj=0.008, (0 split)
##
## Node number 11: 1394 observations, complexity param=0.007550314
## mean=5.558824, MSE=4.305363
## left son=22 (1062 obs) right son=23 (332 obs)
## Primary splits:
## dual_incomes splits as LRL, improve=0.08702943, (0 missing)
## age splits as LLRRRRR, improve=0.08268045, (0 missing)
## martial_status splits as RRLLL, improve=0.07957170, (10 missing)
## edu splits as LLLLRR, improve=0.02843957, (7 missing)
## type_of_home splits as LRLLL, improve=0.02334594, (46 missing)
## Surrogate splits:
## martial_status splits as RLLLL, agree=0.922, adj=0.672, (0 split)
##
## Node number 12: 555 observations, complexity param=0.001620207
## mean=4.527928, MSE=6.202373
## left son=24 (449 obs) right son=25 (106 obs)
## Primary splits:
## edu splits as LLLLRR, improve=0.03360729, (7 missing)
## type_of_home splits as RRRLL, improve=0.02562304, (29 missing)
## age splits as RRRRRRL, improve=0.01456847, (0 missing)
## sex splits as RL, improve=0.01375062, (0 missing)
## living_time splits as LRRRL, improve=0.01358937, (70 missing)
##
## Node number 13: 363 observations, complexity param=0.001337636
## mean=6.490358, MSE=3.627318
## left son=26 (202 obs) right son=27 (161 obs)
## Primary splits:
## sex splits as RL, improve=0.07027800, (0 missing)
## age splits as RRRRRRL, improve=0.05343036, (0 missing)
## type_of_home splits as RRRLR, improve=0.03035936, (10 missing)
## edu splits as -LLLRR, improve=0.02043215, (3 missing)
## pers_in_house_under18 splits as LRRRRRRRRR, improve=0.00598936, (0 missing)
## Surrogate splits:
## martial_status splits as --LLR, agree=0.595, adj=0.087, (0 split)
## age splits as RRRLLLL, agree=0.595, adj=0.087, (0 split)
## ethnic_cl splits as RRLRRRLL, agree=0.584, adj=0.062, (0 split)
## edu splits as -LLLRL, agree=0.567, adj=0.025, (0 split)
## pers_in_house_under18 splits as LLLLRRRRRR, agree=0.562, adj=0.012, (0 split)
##
## Node number 14: 1340 observations, complexity param=0.005893641
## mean=6.524627, MSE=3.982229
## left son=28 (450 obs) right son=29 (890 obs)
## Primary splits:
## edu splits as LLLRRR, improve=0.07523903, (11 missing)
## type_of_home splits as RRLLL, improve=0.03675295, (50 missing)
## ocupation splits as -RLRRLLLL, improve=0.03292025, (10 missing)
## age splits as RRRRRRL, improve=0.03234622, (0 missing)
## ethnic_cl splits as LRL-LRRR, improve=0.03159858, (8 missing)
## Surrogate splits:
## ethnic_cl splits as RRR-LRRL, agree=0.671, adj=0.025, (11 split)
## pers_in_house_under18 splits as RRRRLLLLLL, agree=0.670, adj=0.020, (0 split)
## age splits as LRRRRRR, agree=0.665, adj=0.007, (0 split)
##
## Node number 15: 1089 observations, complexity param=0.00180666
## mean=7.784206, MSE=1.919457
## left son=30 (769 obs) right son=31 (320 obs)
## Primary splits:
## edu splits as LLLLLR, improve=0.05991902, (8 missing)
## type_of_home splits as RRRLR, improve=0.02162214, (21 missing)
## ethnic_cl splits as LRRLLRRR, improve=0.02148641, (12 missing)
## sex splits as RL, improve=0.01973967, (0 missing)
## lang splits as RLR, improve=0.01043185, (32 missing)
## Surrogate splits:
## pers_in_house_under18 splits as LLLLLLRRRR, agree=0.705, adj=0.003, (8 split)
##
## Node number 18: 72 observations
## mean=2.708333, MSE=5.401042
##
## Node number 19: 92 observations
## mean=4.184783, MSE=10.88977
##
## Node number 20: 2631 observations, complexity param=0.005817907
## mean=3.511593, MSE=5.961002
## left son=40 (1406 obs) right son=41 (1225 obs)
## Primary splits:
## type_of_home splits as RRLLL, improve=0.025502850, (126 missing)
## ocupation splits as -RRRLRLLL, improve=0.019287680, (44 missing)
## living_time splits as LLLRR, improve=0.014444490, (259 missing)
## householder_status splits as -LR, improve=0.014031590, (104 missing)
## sex splits as RL, improve=0.007410382, (0 missing)
## Surrogate splits:
## householder_status splits as -LR, agree=0.763, adj=0.502, (95 split)
## pers_in_house splits as LLRRRRRRR, agree=0.685, adj=0.338, (13 split)
## pers_in_house_under18 splits as LRRRRRRRRR, agree=0.628, adj=0.216, (18 split)
## age splits as RRLLLLL, agree=0.617, adj=0.193, (0 split)
## ocupation splits as -RLLLRLLL, agree=0.582, adj=0.121, (0 split)
##
## Node number 21: 768 observations, complexity param=0.006029731
## mean=4.742188, MSE=5.061137
## left son=42 (384 obs) right son=43 (384 obs)
## Primary splits:
## edu splits as LLLRRR, improve=0.10840950, (11 missing)
## age splits as LLRRRRR, improve=0.07587009, (0 missing)
## lang splits as RLR, improve=0.06001032, (32 missing)
## ethnic_cl splits as RLRRLRRR, improve=0.05328627, (5 missing)
## ocupation splits as -RRRRLRLL, improve=0.03893828, (18 missing)
## Surrogate splits:
## ethnic_cl splits as LRRRLRRR, agree=0.655, adj=0.308, (11 split)
## ocupation splits as -RLRLRRLL, agree=0.622, adj=0.241, (0 split)
## lang splits as RLR, agree=0.587, adj=0.170, (0 split)
## pers_in_house splits as RRRLLLLLL, agree=0.584, adj=0.164, (0 split)
## pers_in_house_under18 splits as RLLLLLLLLL, agree=0.579, adj=0.154, (0 split)
##
## Node number 22: 1062 observations, complexity param=0.005248411
## mean=5.216573, MSE=4.162136
## left son=44 (248 obs) right son=45 (814 obs)
## Primary splits:
## age splits as LLRRRRR, improve=0.08214115, (0 missing)
## edu splits as -LLLRR, improve=0.04175026, (5 missing)
## type_of_home splits as LRLLL, improve=0.02471563, (35 missing)
## sex splits as RL, improve=0.01795391, (0 missing)
## martial_status splits as RRRLL, improve=0.01789405, (9 missing)
## Surrogate splits:
## householder_status splits as -RL, agree=0.786, adj=0.085, (0 split)
##
## Node number 23: 332 observations
## mean=6.653614, MSE=3.190258
##
## Node number 24: 449 observations, complexity param=0.001542823
## mean=4.309577, MSE=6.20483
## left son=48 (95 obs) right son=49 (354 obs)
## Primary splits:
## age splits as RRRRRRL, improve=0.03831012, (0 missing)
## type_of_home splits as RRRLL, improve=0.02546873, (23 missing)
## ocupation splits as -LRRLLLLL, improve=0.02218097, (12 missing)
## ethnic_cl splits as LRRLLLLR, improve=0.01979479, (11 missing)
## sex splits as RL, improve=0.01864807, (0 missing)
## Surrogate splits:
## ocupation splits as -RRRRRRLR, agree=0.837, adj=0.232, (0 split)
## martial_status splits as --RLR, agree=0.829, adj=0.189, (0 split)
##
## Node number 25: 106 observations
## mean=5.45283, MSE=5.134567
##
## Node number 26: 202 observations
## mean=6.039604, MSE=3.859818
##
## Node number 27: 161 observations
## mean=7.055901, MSE=2.76085
##
## Node number 28: 450 observations, complexity param=0.002192594
## mean=5.748889, MSE=4.58361
## left son=56 (66 obs) right son=57 (384 obs)
## Primary splits:
## edu splits as LLR---, improve=0.07441736, (2 missing)
## ethnic_cl splits as RLR-LRRR, improve=0.05084362, (4 missing)
## ocupation splits as -LLRLLRLL, improve=0.03727764, (3 missing)
## pers_in_house splits as RRRRRRLLL, improve=0.03593712, (14 missing)
## age splits as RRRRRLL, improve=0.03180490, (0 missing)
## Surrogate splits:
## age splits as LRRRRRR, agree=0.862, adj=0.061, (2 split)
##
## Node number 29: 890 observations, complexity param=0.002019477
## mean=6.916854, MSE=3.220053
## left son=58 (45 obs) right son=59 (845 obs)
## Primary splits:
## type_of_home splits as RRLLL, improve=0.04966567, (33 missing)
## ocupation splits as -RRRRLLLL, improve=0.03648984, (7 missing)
## edu splits as ---LRR, improve=0.03299970, (9 missing)
## age splits as RRRRRRL, improve=0.02804480, (0 missing)
## ethnic_cl splits as LLL-LRRL, improve=0.01601724, (4 missing)
##
## Node number 30: 769 observations
## mean=7.56567, MSE=2.092241
##
## Node number 31: 320 observations
## mean=8.309375, MSE=1.113662
##
## Node number 40: 1406 observations, complexity param=0.005611504
## mean=3.146515, MSE=4.241691
## left son=80 (631 obs) right son=81 (775 obs)
## Primary splits:
## ocupation splits as -RRRLLLLL, improve=0.06452061, (25 missing)
## age splits as LLRRRRR, improve=0.04730598, (0 missing)
## martial_status splits as -RRLL, improve=0.02161422, (13 missing)
## edu splits as LLLLRR, improve=0.01421960, (15 missing)
## living_time splits as LLRRR, improve=0.01244381, (129 missing)
## Surrogate splits:
## age splits as LLRRRRR, agree=0.594, adj=0.098, (25 split)
## martial_status splits as -RRLR, agree=0.571, adj=0.045, (0 split)
## pers_in_house_under18 splits as RRLLLLLLLL, agree=0.558, adj=0.016, (0 split)
## type_of_home splits as --RRL, agree=0.556, adj=0.013, (0 split)
## householder_status splits as -RL, agree=0.555, adj=0.010, (0 split)
##
## Node number 41: 1225 observations, complexity param=0.001867794
## mean=3.930612, MSE=7.605798
## left son=82 (133 obs) right son=83 (1092 obs)
## Primary splits:
## ocupation splits as -RRRLRRRL, improve=0.014011080, (19 missing)
## living_time splits as LLLLR, improve=0.007596146, (130 missing)
## sex splits as RL, improve=0.006403813, (0 missing)
## pers_in_house splits as LRRRRRRRR, improve=0.004286220, (49 missing)
## edu splits as LLLLRL, improve=0.003555879, (10 missing)
##
## Node number 42: 384 observations, complexity param=0.002018791
## mean=4.005208, MSE=4.526015
## left son=84 (91 obs) right son=85 (293 obs)
## Primary splits:
## edu splits as LLR---, improve=0.07847638, (7 missing)
## type_of_home splits as RRLRR, improve=0.05658004, (13 missing)
## age splits as LLRRRRR, improve=0.05264514, (0 missing)
## lang splits as RLL, improve=0.03505059, (9 missing)
## ethnic_cl splits as RRL-LLRR, improve=0.03028826, (2 missing)
## Surrogate splits:
## pers_in_house_under18 splits as RRRRRRLLLL, agree=0.764, adj=0.022, (7 split)
##
## Node number 43: 384 observations, complexity param=0.001989076
## mean=5.479167, MSE=4.509983
## left son=86 (64 obs) right son=87 (320 obs)
## Primary splits:
## age splits as LLRRRRR, improve=0.07945458, (0 missing)
## ocupation splits as -RRRRLRRR, improve=0.07816118, (2 missing)
## ethnic_cl splits as LLRLLRRR, improve=0.03693821, (3 missing)
## type_of_home splits as RRLLL, improve=0.02947526, (15 missing)
## lang splits as RLL, improve=0.02745962, (23 missing)
##
## Node number 44: 248 observations
## mean=4.157258, MSE=4.487367
##
## Node number 45: 814 observations, complexity param=0.001221219
## mean=5.539312, MSE=3.617005
## left son=90 (752 obs) right son=91 (62 obs)
## Primary splits:
## type_of_home splits as LRLLL, improve=0.02851125, (26 missing)
## martial_status splits as LRLLL, improve=0.02487905, (9 missing)
## edu splits as -LLLRR, improve=0.02447788, (3 missing)
## sex splits as RL, improve=0.01888394, (0 missing)
## ethnic_cl splits as LRRRLRRL, improve=0.01571250, (6 missing)
##
## Node number 48: 95 observations
## mean=3.368421, MSE=4.906371
##
## Node number 49: 354 observations
## mean=4.562147, MSE=6.251787
##
## Node number 56: 66 observations
## mean=4.348485, MSE=5.257346
##
## Node number 57: 384 observations, complexity param=0.001387839
## mean=5.989583, MSE=4.072808
## left son=114 (184 obs) right son=115 (200 obs)
## Primary splits:
## age splits as RRRRRLL, improve=0.05780988, (0 missing)
## pers_in_house splits as LLRRRRRRR, improve=0.05251579, (11 missing)
## pers_in_house_under18 splits as LRRRRRRRRR, improve=0.04376480, (0 missing)
## ocupation splits as -LRRRLRLL, improve=0.04331208, (3 missing)
## ethnic_cl splits as RLR-LLRR, improve=0.02355029, (4 missing)
## Surrogate splits:
## pers_in_house splits as LLRRRRRRR, agree=0.776, adj=0.533, (0 split)
## pers_in_house_under18 splits as LRRRRRRRRR, agree=0.773, adj=0.527, (0 split)
## ocupation splits as -RRRRRRLR, agree=0.760, adj=0.500, (0 split)
## dual_incomes splits as LRL, agree=0.714, adj=0.402, (0 split)
## ethnic_cl splits as LRR-RRLR, agree=0.625, adj=0.217, (0 split)
##
## Node number 58: 45 observations
## mean=5.2, MSE=5.937778
##
## Node number 59: 845 observations, complexity param=0.001104741
## mean=7.008284, MSE=2.909991
## left son=118 (524 obs) right son=119 (321 obs)
## Primary splits:
## ocupation splits as -RLLRLLLL, improve=0.03044176, (7 missing)
## edu splits as ---LRR, improve=0.02917369, (8 missing)
## age splits as RRRRRRL, improve=0.02917211, (0 missing)
## ethnic_cl splits as LLL-LRRL, improve=0.02256119, (3 missing)
## sex splits as LR, improve=0.00733365, (0 missing)
## Surrogate splits:
## age splits as RRRRLLL, agree=0.658, adj=0.095, (7 split)
## pers_in_house_under18 splits as LRRRRRRRRR, agree=0.655, adj=0.088, (0 split)
## pers_in_house splits as LLLRRRRRR, agree=0.626, adj=0.013, (0 split)
##
## Node number 80: 631 observations
## mean=2.564184, MSE=4.220524
##
## Node number 81: 775 observations, complexity param=0.00298929
## mean=3.620645, MSE=3.758025
## left son=162 (279 obs) right son=163 (496 obs)
## Primary splits:
## age splits as LLRRRRR, improve=0.07100360, (0 missing)
## edu splits as LLLLRR, improve=0.03201773, (8 missing)
## martial_status splits as -RRLL, improve=0.02612554, (6 missing)
## lang splits as RLL, improve=0.02512991, (36 missing)
## pers_in_house splits as RRLLLLLLL, improve=0.01542619, (45 missing)
## Surrogate splits:
## ethnic_cl splits as RRR-RLRR, agree=0.641, adj=0.004, (0 split)
##
## Node number 82: 133 observations
## mean=3, MSE=6.421053
##
## Node number 83: 1092 observations
## mean=4.043956, MSE=7.631767
##
## Node number 84: 91 observations
## mean=2.923077, MSE=3.961116
##
## Node number 85: 293 observations, complexity param=0.001300372
## mean=4.341297, MSE=4.224813
## left son=170 (158 obs) right son=171 (135 obs)
## Primary splits:
## type_of_home splits as RRLLL, improve=0.08901701, (12 missing)
## age splits as LLRRRRR, improve=0.05534796, (0 missing)
## ocupation splits as -LRRRLLLL, improve=0.03830745, (9 missing)
## pers_in_house_under18 splits as LLLRRRRRRR, improve=0.02073889, (0 missing)
## pers_in_house splits as LLLRRRRRR, improve=0.01532507, (7 missing)
## Surrogate splits:
## pers_in_house splits as LLLRRRRRR, agree=0.683, adj=0.315, (11 split)
## pers_in_house_under18 splits as LLRRRRRRRR, agree=0.680, adj=0.308, (1 split)
## age splits as LLRRRRR, agree=0.598, adj=0.131, (0 split)
## ethnic_cl splits as LLL-RRLL, agree=0.577, adj=0.085, (0 split)
## ocupation splits as -LRLRLLLL, agree=0.566, adj=0.062, (0 split)
##
## Node number 86: 64 observations
## mean=4.140625, MSE=4.30835
##
## Node number 87: 320 observations
## mean=5.746875, MSE=4.120303
##
## Node number 90: 752 observations, complexity param=0.001076029
## mean=5.446809, MSE=3.62749
## left son=180 (320 obs) right son=181 (432 obs)
## Primary splits:
## edu splits as -LLLRR, improve=0.027120560, (3 missing)
## sex splits as RL, improve=0.019410740, (0 missing)
## martial_status splits as LRLLL, improve=0.019304080, (9 missing)
## ethnic_cl splits as LRR-LRRL, improve=0.014561860, (6 missing)
## age splits as LLLRRRR, improve=0.008569508, (0 missing)
## Surrogate splits:
## martial_status splits as LRLLR, agree=0.636, adj=0.144, (3 split)
## pers_in_house_under18 splits as RLLLLLLLLL, agree=0.611, adj=0.088, (0 split)
## age splits as RRRRLLL, agree=0.586, adj=0.028, (0 split)
## dual_incomes splits as R-L, agree=0.586, adj=0.028, (0 split)
## type_of_home splits as L-RLL, agree=0.586, adj=0.028, (0 split)
##
## Node number 91: 62 observations
## mean=6.66129, MSE=2.127211
##
## Node number 114: 184 observations
## mean=5.483696, MSE=4.195386
##
## Node number 115: 200 observations, complexity param=0.001387839
## mean=6.455, MSE=3.507975
## left son=230 (66 obs) right son=231 (134 obs)
## Primary splits:
## ethnic_cl splits as RLL-LLRR, improve=0.14154760, (1 missing)
## ocupation splits as -LRRRLRRL, improve=0.07030001, (2 missing)
## age splits as LLRRRRR, improve=0.04785090, (0 missing)
## type_of_home splits as RRLLR, improve=0.04195659, (9 missing)
## sex splits as LR, improve=0.03422510, (0 missing)
## Surrogate splits:
## lang splits as RLR, agree=0.714, adj=0.136, (0 split)
## pers_in_house_under18 splits as RRRRRLLLLL, agree=0.688, adj=0.061, (1 split)
## sex splits as LR, agree=0.678, adj=0.030, (0 split)
## age splits as LLRRRRR, agree=0.678, adj=0.030, (0 split)
## pers_in_house splits as RRRRRLLLL, agree=0.678, adj=0.030, (0 split)
##
## Node number 118: 524 observations
## mean=6.772901, MSE=2.946518
##
## Node number 119: 321 observations
## mean=7.392523, MSE=2.612281
##
## Node number 162: 279 observations
## mean=2.9319, MSE=3.274932
##
## Node number 163: 496 observations
## mean=4.008065, MSE=3.612838
##
## Node number 170: 158 observations
## mean=3.829114, MSE=4.407507
##
## Node number 171: 135 observations
## mean=4.940741, MSE=3.344636
##
## Node number 180: 320 observations
## mean=5.08125, MSE=3.605898
##
## Node number 181: 432 observations
## mean=5.717593, MSE=3.471172
##
## Node number 230: 66 observations
## mean=5.439394, MSE=3.640266
##
## Node number 231: 134 observations
## mean=6.955224, MSE=2.684562
Количество internal nodes сократилось с 33 до 31. Стандартная ошибка так же незначитеьно, но сократилась с 0.52910 до 0.52801 при минимальных значениях cp. Дальнейший прунинг так же приводит к незначительным изменениям, поэтому дерево pruned.tree2 - лучшее из всех построенных.
**7.Вернемся к пункту (a) и опишем одну из ветвей этого дерева.* Предикторы (перечислены в порядке уменьшения важности для модели): householder_status, age, martial_status, ocupation, dual_incomes, edu, pers_in_house, type_of_home.
Если тип ведение домохозяйства - аренда или проживание с родителями, проживающий в доме совершеннолетний (старше 17) и старше 25, его специальность - профессиональный работник, совместных доход супругов отсутствует (доходы раздельные или проживающий не женат), тип дома house, apartment, mobile home, other и образование не превышает 3 года в бакалавриате, то доход будет близиться к пятой категории (5.1; $25,000 to $29,999), а если образование выше 3 лет бакалавриата, то к шестой категории (5.7; $30,000 to $39,999).
8. Добавим свой ряд данных: 1,2,5,2,1,6,1,1,1,0,2,3,7,3
MyIncomeData <- read.csv("~/hw2/My_Income_Data.txt", header = F, col.names =c("income", "sex", "martial_status", "age", "edu","ocupation", "living_time", "dual_incomes", "pers_in_house", "pers_in_house_under18", "householder_status", "type_of_home", "ethnic_cl", "lang"))
MyIncomeData <- within(MyIncomeData, {
sex <- factor(sex)
martial_status <- factor(martial_status)
edu <- factor(edu)
ocupation <- factor(ocupation)
living_time <- factor(living_time)
dual_incomes <- factor(dual_incomes)
householder_status <- factor(householder_status)
type_of_home <- factor(type_of_home)
ethnic_cl <- factor(ethnic_cl)
lang <- factor(lang)
age <- ordered(age)
pers_in_house <- ordered(pers_in_house)
pers_in_house_under18 <- ordered(pers_in_house_under18)
})
Построим дерево с добавленным рядом и фиксированным cp = \(10^{-3}\).
my.income.tree <- rpart(income~., data = MyIncomeData, control = rpart.control(cp = 10^-3), method = "anova")
prp(my.income.tree)
printcp(my.income.tree)
##
## Regression tree:
## rpart(formula = income ~ ., data = MyIncomeData, method = "anova",
## control = rpart.control(cp = 10^-3))
##
## Variables actually used in tree construction:
## [1] age dual_incomes edu
## [4] ethnic_cl householder_status martial_status
## [7] ocupation pers_in_house pers_in_house_under18
## [10] sex type_of_home
##
## Root node error: 69194/8994 = 7.6934
##
## n= 8994
##
## CP nsplit rel error xerror xstd
## 1 0.2239719 0 1.00000 1.00024 0.0082582
## 2 0.0807145 1 0.77603 0.77630 0.0088512
## 3 0.0447632 2 0.69531 0.69572 0.0091799
## 4 0.0306919 3 0.65055 0.65125 0.0093118
## 5 0.0137749 4 0.61986 0.62069 0.0089330
## 6 0.0130318 5 0.60608 0.61369 0.0089361
## 7 0.0122145 6 0.59305 0.59790 0.0089475
## 8 0.0118573 7 0.58084 0.59087 0.0086895
## 9 0.0075487 8 0.56898 0.57273 0.0082712
## 10 0.0060284 9 0.56143 0.56855 0.0082889
## 11 0.0058923 10 0.55540 0.56430 0.0082377
## 12 0.0058412 11 0.54951 0.56430 0.0082377
## 13 0.0056415 12 0.54367 0.56190 0.0081989
## 14 0.0052473 13 0.53803 0.55239 0.0081799
## 15 0.0029886 14 0.53278 0.54060 0.0081497
## 16 0.0021921 15 0.52979 0.53702 0.0081285
## 17 0.0020190 16 0.52760 0.53782 0.0081565
## 18 0.0020183 17 0.52558 0.53741 0.0081474
## 19 0.0019886 18 0.52356 0.53764 0.0081576
## 20 0.0018674 19 0.52157 0.53619 0.0081509
## 21 0.0018063 20 0.51971 0.53546 0.0081514
## 22 0.0016199 21 0.51790 0.53286 0.0081147
## 23 0.0015425 22 0.51628 0.53132 0.0081491
## 24 0.0013875 23 0.51474 0.53222 0.0082338
## 25 0.0013373 25 0.51196 0.53101 0.0082745
## 26 0.0013001 26 0.51063 0.53037 0.0082742
## 27 0.0012725 27 0.50933 0.53023 0.0082812
## 28 0.0012210 28 0.50805 0.53004 0.0082859
## 29 0.0011045 29 0.50683 0.53013 0.0083384
## 30 0.0010758 30 0.50573 0.52936 0.0083606
## 31 0.0010733 31 0.50465 0.52956 0.0083618
## 32 0.0010000 33 0.50250 0.53002 0.0083682
predict(my.income.tree, MyIncomeData[29,])
## 29
## 2.223368
Исходя из дерева my.income.tree, мой доход колеблется в пределах второй и третеьй категории, ближе ко второй (2.223368; $10,000 to $14,999), что, безусловно, не так, учитывая текущий курс USD/RUB, однако при уровне курса на 2013 год, предесказание было бы даже заниженным.
my.pruned.tree <- prune(my.income.tree, my.income.tree$cptable[which.min(my.income.tree$cptable[,"xerror"]),"CP"])
printcp(my.pruned.tree)
##
## Regression tree:
## rpart(formula = income ~ ., data = MyIncomeData, method = "anova",
## control = rpart.control(cp = 10^-3))
##
## Variables actually used in tree construction:
## [1] age dual_incomes edu
## [4] ethnic_cl householder_status martial_status
## [7] ocupation pers_in_house sex
## [10] type_of_home
##
## Root node error: 69194/8994 = 7.6934
##
## n= 8994
##
## CP nsplit rel error xerror xstd
## 1 0.2239719 0 1.00000 1.00024 0.0082582
## 2 0.0807145 1 0.77603 0.77630 0.0088512
## 3 0.0447632 2 0.69531 0.69572 0.0091799
## 4 0.0306919 3 0.65055 0.65125 0.0093118
## 5 0.0137749 4 0.61986 0.62069 0.0089330
## 6 0.0130318 5 0.60608 0.61369 0.0089361
## 7 0.0122145 6 0.59305 0.59790 0.0089475
## 8 0.0118573 7 0.58084 0.59087 0.0086895
## 9 0.0075487 8 0.56898 0.57273 0.0082712
## 10 0.0060284 9 0.56143 0.56855 0.0082889
## 11 0.0058923 10 0.55540 0.56430 0.0082377
## 12 0.0058412 11 0.54951 0.56430 0.0082377
## 13 0.0056415 12 0.54367 0.56190 0.0081989
## 14 0.0052473 13 0.53803 0.55239 0.0081799
## 15 0.0029886 14 0.53278 0.54060 0.0081497
## 16 0.0021921 15 0.52979 0.53702 0.0081285
## 17 0.0020190 16 0.52760 0.53782 0.0081565
## 18 0.0020183 17 0.52558 0.53741 0.0081474
## 19 0.0019886 18 0.52356 0.53764 0.0081576
## 20 0.0018674 19 0.52157 0.53619 0.0081509
## 21 0.0018063 20 0.51971 0.53546 0.0081514
## 22 0.0016199 21 0.51790 0.53286 0.0081147
## 23 0.0015425 22 0.51628 0.53132 0.0081491
## 24 0.0013875 23 0.51474 0.53222 0.0082338
## 25 0.0013373 25 0.51196 0.53101 0.0082745
## 26 0.0013001 26 0.51063 0.53037 0.0082742
## 27 0.0012725 27 0.50933 0.53023 0.0082812
## 28 0.0012210 28 0.50805 0.53004 0.0082859
## 29 0.0011045 29 0.50683 0.53013 0.0083384
## 30 0.0010758 30 0.50573 0.52936 0.0083606
prp(my.pruned.tree)
predict(my.pruned.tree, MyIncomeData[29,])
## 29
## 2.561709
Компенсировать гипотетическую заниженность смог пруннинг дерева, приблизив значение к третьей категории (2.223368; $15,000 to $19,999).
9. Опишем дерево my.pruned.tree с точки зрения своего ряда данных. Если статус ведения домохозяйства - аренда (2) или проживание с родителями (3), (в нашем случае - 2), возрасте выше 17 (age-2), работа - студент (ocupation - 6)б , семейное положение - 5, type of home - 3, то построенное дерево привдеет нас с точку 2.6.