Los datos provienen de un censo en USA donde hay que predecir si la persona gana mas o menos de u$50K anuales.
adult <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = F)
nombres <- c('age','workclass','fnlwgt','education',
'education-num','marital-status','occupation',
'relationship','race','sex','capital-gain',
'capital-loss','hours-per-week','native-country','class')
names(adult) <- nombres
dim(adult)
## [1] 32561 15
str(adult, list.len = ncol(adult))
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education-num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital-status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital-gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital-loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours-per-week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native-country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ class : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(adult)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education-num marital-status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship
## Prof-specialty :4140 Husband :13193
## Craft-repair :4099 Not-in-family : 8305
## Exec-managerial:4066 Other-relative: 981
## Adm-clerical :3770 Own-child : 5068
## Sales :3650 Unmarried : 3446
## Other-service :3295 Wife : 1568
## (Other) :9541
## race sex capital-gain
## Amer-Indian-Eskimo: 311 Female:10771 Min. : 0
## Asian-Pac-Islander: 1039 Male :21790 1st Qu.: 0
## Black : 3124 Median : 0
## Other : 271 Mean : 1078
## White :27816 3rd Qu.: 0
## Max. :99999
##
## capital-loss hours-per-week native-country class
## Min. : 0.0 Min. : 1.00 United-States:29170 <=50K:24720
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643 >50K : 7841
## Median : 0.0 Median :40.00 ? : 583
## Mean : 87.3 Mean :40.44 Philippines : 198
## 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137
## Max. :4356.0 Max. :99.00 Canada : 121
## (Other) : 1709
table(adult$class) ; prop.table(table(adult$class))
##
## <=50K >50K
## 24720 7841
##
## <=50K >50K
## 0.7591904 0.2408096
Para comenzar, vamos a dividir el dataset en training y testing para evitar posibles overfittings del modelo. Para ello utilizamos la funcion “sample” que escoge de manera aleatoria un subconjunto de valores del vector que le pasamos. La cantidad de numeros escogidos esta condicionado por el parametro “size”.
idx <- sample(x = 1:nrow(adult), size = nrow(adult) * 0.7)
print(idx, max = 20)
## [1] 11812 5904 29362 29901 2525 442 16563 8311 4738 30997 27423
## [12] 11396 6005 26601 22840 7315 32479 5603 12148 21523
## [ reached getOption("max.print") -- omitted 22772 entries ]
Como puede observarse, ahora tenemos un 22792 numeros que corresponden a los indices fila de nuestro dataset. Por lo tanto, procedemos a armar nuestros nuevos conjuntos de datos.
train <- adult[idx,] ; test <- adult[-idx,]
dim(train) ; dim(test)
## [1] 22792 15
## [1] 9769 15
Ahora si comencemos a modelar.
# Hago el modelo
mdlrpart <- rpart(formula = class ~ ., data = train)
# Hago un print del modelo
print(mdlrpart, digits = 2)
## n= 22792
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 22792 5500 <=50K (0.758 0.242)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441 830 <=50K (0.933 0.067)
## 4) capital-gain< 7.1e+03 12215 610 <=50K (0.950 0.050) *
## 5) capital-gain>=7.1e+03 226 6 >50K (0.027 0.973) *
## 3) relationship= Husband, Wife 10351 4700 <=50K (0.547 0.453)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2500 <=50K (0.662 0.338)
## 12) capital-gain< 5.1e+03 6888 2100 <=50K (0.697 0.303) *
## 13) capital-gain>=5.1e+03 375 8 >50K (0.021 0.979) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 3088 850 >50K (0.276 0.724) *
En el print del arbol se pueden observar los nodos y las ramas del mismo. Al final de cada nodo se puede observar la distribucion de la variable target.
plot(mdlrpart) ; text(mdlrpart)
library(partykit)
## Loading required package: grid
plot(as.party.rpart(mdlrpart))
Da un resumen de como va cayendo el error de prediccion y muestra la cantidad de cortes que fueron ocurriendo. El valor CP es el parametro principal del algoritmo. El mismo va a determinar el prunning que vamos a hacerle al arbol.
printcp(mdlrpart)
##
## Classification tree:
## rpart(formula = class ~ ., data = train)
##
## Variables actually used in tree construction:
## [1] capital-gain education relationship
##
## Root node error: 5520/22792 = 0.24219
##
## n= 22792
##
## CP nsplit rel error xerror xstd
## 1 0.125181 0 1.00000 1.00000 0.011717
## 2 0.065036 2 0.74964 0.75797 0.010588
## 3 0.038768 3 0.68460 0.69982 0.010261
## 4 0.010000 4 0.64583 0.65634 0.010000
En este ejemplo, se puede observar que a partir de un CP de 0.126 el arbol comienza a abrirse, ya que con uno mas alto los splits son 0. Esto nos da a entender que nuestro primer arbol tiene unicamente 2 splits.
print(mdlrpart, cp = 0.126387) ; print(mdlrpart, cp = 0.12)
## n= 22792
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 22792 5520 <=50K (0.7578098 0.2421902) *
## n= 22792
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 22792 5520 <=50K (0.75780976 0.24219024)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441 833 <=50K (0.93304397 0.06695603) *
## 3) relationship= Husband, Wife 10351 4687 <=50K (0.54719351 0.45280649)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2452 <=50K (0.66239846 0.33760154) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 3088 853 >50K (0.27623057 0.72376943) *
El primer split corresponde a la variable relationship (nodos 2 y 3) y el segundo, a la variable education (nodos 6 y 7).
Tambien se puede definir un CP en el momento de ejecutar el arbol. Esto se usa generalmente para: o lograr una respuesta mas rapida, si se quiere hacer vistazo general de los datos; o, para conseguir un arbol mas chico que evite el OVERFITTING.
mdlrpart2 <- rpart(formula = class ~ ., data = adult, control = rpart.control(cp = 0.065))
print(mdlrpart2)
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 <=50K (0.75919044 0.24080956)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 <=50K (0.93382022 0.06617978) *
## 3) relationship= Husband, Wife 14761 6663 <=50K (0.54860782 0.45139218)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 <=50K (0.66540807 0.33459193) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 >50K (0.27639892 0.72360108) *
La funcion summary da informacion de las decisiones que tomo el algoritmo en cada nodo para elegir variables, seguir abriendo o detenerse. Tambien da informacion acerca de como trato los NAs.
summary(mdlrpart)
## Call:
## rpart(formula = class ~ ., data = train)
## n= 22792
##
## CP nsplit rel error xerror xstd
## 1 0.12518116 0 1.0000000 1.0000000 0.01171684
## 2 0.06503623 2 0.7496377 0.7579710 0.01058804
## 3 0.03876812 3 0.6846014 0.6998188 0.01026115
## 4 0.01000000 4 0.6458333 0.6563406 0.01000008
##
## Variable importance
## relationship marital-status capital-gain education education-num
## 24 24 11 9 9
## sex occupation age hours-per-week
## 8 7 6 3
##
## Node number 1: 22792 observations, complexity param=0.1251812
## predicted class= <=50K expected loss=0.2421902 P(node) =1
## class counts: 17272 5520
## probabilities: 0.758 0.242
## left son=2 (12441 obs) right son=3 (10351 obs)
## Primary splits:
## relationship splits as RLLLLR, improve=1682.3770, (0 missing)
## marital-status splits as LRRLLLL, improve=1653.7110, (0 missing)
## capital-gain < 5095.5 to the left, improve=1169.7100, (0 missing)
## education splits as LLLLLLLLLRRLRLRL, improve= 865.8919, (0 missing)
## education-num < 12.5 to the left, improve= 865.8919, (0 missing)
## Surrogate splits:
## marital-status splits as LRRLLLL, agree=0.993, adj=0.984, (0 split)
## sex splits as LR, agree=0.691, adj=0.319, (0 split)
## age < 33.5 to the left, agree=0.651, adj=0.232, (0 split)
## occupation splits as LLLRRRLLLLRRLLR, agree=0.619, adj=0.161, (0 split)
## hours-per-week < 43.5 to the left, agree=0.604, adj=0.128, (0 split)
##
## Node number 2: 12441 observations, complexity param=0.03876812
## predicted class= <=50K expected loss=0.06695603 P(node) =0.5458494
## class counts: 11608 833
## probabilities: 0.933 0.067
## left son=4 (12215 obs) right son=5 (226 obs)
## Primary splits:
## capital-gain < 7073.5 to the left, improve=378.29570, (0 missing)
## education-num < 13.5 to the left, improve=110.81060, (0 missing)
## education splits as LLLLLLLLLLRLRLRL, improve=110.81060, (0 missing)
## occupation splits as LLLLRLLLLLRRLLL, improve= 88.95199, (0 missing)
## hours-per-week < 42.5 to the left, improve= 82.62189, (0 missing)
##
## Node number 3: 10351 observations, complexity param=0.1251812
## predicted class= <=50K expected loss=0.4528065 P(node) =0.4541506
## class counts: 5664 4687
## probabilities: 0.547 0.453
## left son=6 (7263 obs) right son=7 (3088 obs)
## Primary splits:
## education splits as LLLLLLLLLRRLRLRL, improve=646.2393, (0 missing)
## education-num < 12.5 to the left, improve=646.2393, (0 missing)
## occupation splits as LLRLRLLLLLRLRRL, improve=635.6832, (0 missing)
## capital-gain < 5095.5 to the left, improve=527.4262, (0 missing)
## capital-loss < 1782.5 to the left, improve=174.3500, (0 missing)
## Surrogate splits:
## education-num < 12.5 to the left, agree=1.000, adj=1.000, (0 split)
## occupation splits as LLRLRLLLLLRLLLL, agree=0.789, adj=0.294, (0 split)
## capital-gain < 7493 to the left, agree=0.717, adj=0.051, (0 split)
## native-country splits as LLLRLLLLLRLLLLL-LLLRRLLLRLLLLLLLLLLRRLLLLL, agree=0.709, adj=0.026, (0 split)
## capital-loss < 1894.5 to the left, agree=0.707, adj=0.018, (0 split)
##
## Node number 4: 12215 observations
## predicted class= <=50K expected loss=0.0501842 P(node) =0.5359337
## class counts: 11602 613
## probabilities: 0.950 0.050
##
## Node number 5: 226 observations
## predicted class= >50K expected loss=0.02654867 P(node) =0.00991576
## class counts: 6 220
## probabilities: 0.027 0.973
##
## Node number 6: 7263 observations, complexity param=0.06503623
## predicted class= <=50K expected loss=0.3376015 P(node) =0.3186644
## class counts: 4811 2452
## probabilities: 0.662 0.338
## left son=12 (6888 obs) right son=13 (375 obs)
## Primary splits:
## capital-gain < 5095.5 to the left, improve=325.0038, (0 missing)
## occupation splits as LR-LRLLLLLRRRRL, improve=186.4794, (0 missing)
## education splits as LLLLLLLRR--L-L-R, improve=130.1685, (0 missing)
## education-num < 9.5 to the left, improve=130.1685, (0 missing)
## age < 33.5 to the left, improve= 94.3062, (0 missing)
##
## Node number 7: 3088 observations
## predicted class= >50K expected loss=0.2762306 P(node) =0.1354861
## class counts: 853 2235
## probabilities: 0.276 0.724
##
## Node number 12: 6888 observations
## predicted class= <=50K expected loss=0.3027003 P(node) =0.3022113
## class counts: 4803 2085
## probabilities: 0.697 0.303
##
## Node number 13: 375 observations
## predicted class= >50K expected loss=0.02133333 P(node) =0.01645314
## class counts: 8 367
## probabilities: 0.021 0.979
Algunas estadisticas importantes a tener en cuenta son: 1) improve -> Mide la importancia que tenia la variable al momento de realizar el split. Es interesante poder observar que tanta diferencia hay entre las primeras dos variables, la segunda y la tercera, etc. Son realmente significativas ? 2) Surrogate -> Mide la forma en que se tratan los valores nulos (si los hay) que se encuentran en las variables que van abriendo el arbol. El valor agree dice que tan bien la variable surrogate predice el corte de la variable principal. El valor adj mide lo mismo, pero cambiando el benchmark de la medicion. Ahora en vez de medir que tan bien predice los cortes, mide que tan bien predice los cortes por encima de la regla “ir con el mas probable”.
Esta funcion hace un prunning de un arbol ya armado seteando el parametro CP. Por ejemplo, en vez de tener un arbol con 5 hojas, tengo uno con 3.
prune(mdlrpart, cp = 0.07)
## n= 22792
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 22792 5520 <=50K (0.75780976 0.24219024)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441 833 <=50K (0.93304397 0.06695603) *
## 3) relationship= Husband, Wife 10351 4687 <=50K (0.54719351 0.45280649)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2452 <=50K (0.66239846 0.33760154) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 3088 853 >50K (0.27623057 0.72376943) *
Este metodo utiliza criterios estadisticos tanto para elegir las primeras variables del arbol como para dividir las mismas. El nombre deviene de Conditional Inference Trees. Una de las principales diferencias que tiene con otros metodos de particion es que no hace prunning debido a su sustento estadistico. Es decir, el arbol sigue creciendo hasta que no quede ninguna variable que posea una relacion significativa con el target.
Comencemos…
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'party'
## The following objects are masked from 'package:partykit':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner,
## node_surv, node_terminal
mdlctree <- ctree(class ~ . , data = train)
plot(mdlctree) ; print(mdlctree)
##
## Conditional inference tree with 87 terminal nodes
##
## Response: class
## Inputs: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country
## Number of observations: 22792
##
## 1) relationship == { Husband, Wife}; criterion = 1, statistic = 4693.86
## 2) education == { 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college}; criterion = 1, statistic = 1752.002
## 3) occupation == { ?, Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Transport-moving}; criterion = 1, statistic = 569.781
## 4) education == { 12th, Assoc-acdm, Assoc-voc, HS-grad, Some-college}; criterion = 1, statistic = 182.167
## 5) capital-gain <= 5013; criterion = 1, statistic = 93.555
## 6) occupation == { ?, Craft-repair, Machine-op-inspct, Transport-moving}; criterion = 1, statistic = 79.057
## 7) age <= 35; criterion = 1, statistic = 42.404
## 8) age <= 29; criterion = 0.999, statistic = 26.657
## 9)* weights = 357
## 8) age > 29
## 10)* weights = 470
## 7) age > 35
## 11) hours-per-week <= 33; criterion = 1, statistic = 30.974
## 12) education-num <= 9; criterion = 0.999, statistic = 18.986
## 13)* weights = 110
## 12) education-num > 9
## 14)* weights = 54
## 11) hours-per-week > 33
## 15) education-num <= 9; criterion = 1, statistic = 26.727
## 16)* weights = 1036
## 15) education-num > 9
## 17) capital-loss <= 1740; criterion = 0.996, statistic = 18.112
## 18) fnlwgt <= 193459; criterion = 0.973, statistic = 15.64
## 19) workclass == { Federal-gov, Local-gov, Self-emp-not-inc}; criterion = 0.971, statistic = 20.638
## 20)* weights = 50
## 19) workclass == { ?, Private, Self-emp-inc, State-gov}
## 21)* weights = 249
## 18) fnlwgt > 193459
## 22)* weights = 187
## 17) capital-loss > 1740
## 23) capital-loss <= 1887; criterion = 0.996, statistic = 13.282
## 24)* weights = 18
## 23) capital-loss > 1887
## 25)* weights = 9
## 6) occupation == { Farming-fishing, Handlers-cleaners, Other-service, Priv-house-serv}
## 26) hours-per-week <= 32; criterion = 0.991, statistic = 34.599
## 27)* weights = 133
## 26) hours-per-week > 32
## 28)* weights = 678
## 5) capital-gain > 5013
## 29) age <= 61; criterion = 1, statistic = 19.866
## 30)* weights = 129
## 29) age > 61
## 31)* weights = 19
## 4) education == { 10th, 11th, 1st-4th, 5th-6th, 7th-8th, 9th, Preschool}
## 32) capital-gain <= 5013; criterion = 1, statistic = 34.764
## 33) capital-loss <= 1735; criterion = 1, statistic = 26.104
## 34) hours-per-week <= 49; criterion = 0.998, statistic = 22.224
## 35) workclass == { ?, Federal-gov, Local-gov, Private, Self-emp-not-inc, State-gov}; criterion = 0.999, statistic = 28.701
## 36)* weights = 757
## 35) workclass == { Self-emp-inc}
## 37)* weights = 9
## 34) hours-per-week > 49
## 38)* weights = 132
## 33) capital-loss > 1735
## 39) capital-loss <= 1977; criterion = 0.989, statistic = 11.183
## 40)* weights = 10
## 39) capital-loss > 1977
## 41)* weights = 10
## 32) capital-gain > 5013
## 42)* weights = 18
## 3) occupation == { Adm-clerical, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support}
## 43) education-num <= 9; criterion = 1, statistic = 90.391
## 44) capital-gain <= 5013; criterion = 1, statistic = 36.886
## 45) capital-loss <= 1740; criterion = 1, statistic = 42.749
## 46) education-num <= 8; criterion = 0.997, statistic = 34.29
## 47)* weights = 164
## 46) education-num > 8
## 48) hours-per-week <= 39; criterion = 0.999, statistic = 33.63
## 49)* weights = 151
## 48) hours-per-week > 39
## 50) age <= 35; criterion = 1, statistic = 30.507
## 51) age <= 28; criterion = 0.996, statistic = 13.04
## 52)* weights = 90
## 51) age > 28
## 53)* weights = 168
## 50) age > 35
## 54) workclass == { Self-emp-not-inc, State-gov}; criterion = 0.964, statistic = 20.61
## 55)* weights = 103
## 54) workclass == { Federal-gov, Local-gov, Private, Self-emp-inc}
## 56)* weights = 501
## 45) capital-loss > 1740
## 57) capital-loss <= 1977; criterion = 0.992, statistic = 11.89
## 58) workclass == { Federal-gov, Local-gov, Self-emp-inc, State-gov}; criterion = 0.991, statistic = 21.5
## 59)* weights = 13
## 58) workclass == { Private, Self-emp-not-inc}
## 60)* weights = 32
## 57) capital-loss > 1977
## 61)* weights = 10
## 44) capital-gain > 5013
## 62)* weights = 77
## 43) education-num > 9
## 63) capital-gain <= 5013; criterion = 1, statistic = 35.641
## 64) age <= 33; criterion = 1, statistic = 30.77
## 65) age <= 24; criterion = 0.953, statistic = 22.785
## 66)* weights = 35
## 65) age > 24
## 67)* weights = 310
## 64) age > 33
## 68) hours-per-week <= 34; criterion = 0.999, statistic = 23.815
## 69)* weights = 91
## 68) hours-per-week > 34
## 70) workclass == { Self-emp-not-inc, State-gov}; criterion = 0.986, statistic = 22.812
## 71)* weights = 112
## 70) workclass == { Federal-gov, Local-gov, Private, Self-emp-inc}
## 72) capital-loss <= 1740; criterion = 0.989, statistic = 19.346
## 73)* weights = 787
## 72) capital-loss > 1740
## 74) capital-loss <= 1902; criterion = 0.979, statistic = 12.265
## 75)* weights = 36
## 74) capital-loss > 1902
## 76)* weights = 16
## 63) capital-gain > 5013
## 77)* weights = 132
## 2) education == { Bachelors, Doctorate, Masters, Prof-school}
## 78) occupation == { Armed-Forces, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support}; criterion = 1, statistic = 190.056
## 79) capital-loss <= 1740; criterion = 1, statistic = 48.285
## 80) capital-gain <= 5013; criterion = 1, statistic = 53.016
## 81) capital-gain <= 3103; criterion = 1, statistic = 44.109
## 82) education-num <= 13; criterion = 1, statistic = 39.733
## 83) workclass == { Local-gov, Self-emp-not-inc, State-gov}; criterion = 0.996, statistic = 34.097
## 84)* weights = 289
## 83) workclass == { Federal-gov, Private, Self-emp-inc}
## 85)* weights = 845
## 82) education-num > 13
## 86) hours-per-week <= 25; criterion = 0.983, statistic = 39.197
## 87)* weights = 33
## 86) hours-per-week > 25
## 88)* weights = 696
## 81) capital-gain > 3103
## 89)* weights = 35
## 80) capital-gain > 5013
## 90) age <= 69; criterion = 1, statistic = 17.658
## 91)* weights = 408
## 90) age > 69
## 92)* weights = 8
## 79) capital-loss > 1740
## 93)* weights = 247
## 78) occupation == { ?, Adm-clerical, Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Transport-moving}
## 94) capital-gain <= 5013; criterion = 1, statistic = 30.873
## 95) capital-loss <= 1740; criterion = 1, statistic = 31.843
## 96)* weights = 444
## 95) capital-loss > 1740
## 97)* weights = 29
## 94) capital-gain > 5013
## 98)* weights = 54
## 1) relationship == { Not-in-family, Other-relative, Own-child, Unmarried}
## 99) capital-gain <= 6849; criterion = 1, statistic = 1366.286
## 100) education == { Doctorate, Masters, Prof-school}; criterion = 1, statistic = 1066.048
## 101) education-num <= 14; criterion = 1, statistic = 41.236
## 102) hours-per-week <= 43; criterion = 0.998, statistic = 35.117
## 103) age <= 47; criterion = 0.981, statistic = 13.752
## 104) capital-loss <= 0; criterion = 0.97, statistic = 15.069
## 105)* weights = 228
## 104) capital-loss > 0
## 106)* weights = 9
## 103) age > 47
## 107)* weights = 84
## 102) hours-per-week > 43
## 108) sex == { Female}; criterion = 0.954, statistic = 22.522
## 109)* weights = 84
## 108) sex == { Male}
## 110)* weights = 79
## 101) education-num > 14
## 111) capital-loss <= 2339; criterion = 0.964, statistic = 19.135
## 112)* weights = 170
## 111) capital-loss > 2339
## 113)* weights = 7
## 100) education == { 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, HS-grad, Preschool, Some-college}
## 114) occupation == { ?, Adm-clerical, Armed-Forces, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Transport-moving}; criterion = 1, statistic = 386.703
## 115) capital-loss <= 1726; criterion = 1, statistic = 84.958
## 116) age <= 33; criterion = 1, statistic = 95.256
## 117) native-country == { ?, England, Guatemala, Portugal}; criterion = 1, statistic = 120.776
## 118) hours-per-week <= 45; criterion = 0.975, statistic = 16.221
## 119) workclass == { Private}; criterion = 0.993, statistic = 17.75
## 120)* weights = 61
## 119) workclass == { ?, Local-gov, Self-emp-not-inc}
## 121)* weights = 14
## 118) hours-per-week > 45
## 122)* weights = 14
## 117) native-country == { Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, France, Germany, Greece, Haiti, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Peru, Philippines, Poland, Puerto-Rico, South, Taiwan, Thailand, United-States, Vietnam, Yugoslavia}
## 123) education == { 7th-8th}; criterion = 0.997, statistic = 37.516
## 124)* weights = 48
## 123) education == { 10th, 11th, 12th, 1st-4th, 5th-6th, 9th, Assoc-acdm, Assoc-voc, Bachelors, HS-grad, Preschool, Some-college}
## 125)* weights = 3738
## 116) age > 33
## 126) native-country == { ?, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, Germany, Guatemala, Haiti, Honduras, Hungary, Iran, Ireland, Jamaica, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Scotland, Trinadad&Tobago, United-States, Vietnam}; criterion = 1, statistic = 121.996
## 127) marital-status == { Married-civ-spouse}; criterion = 1, statistic = 40.656
## 128)* weights = 29
## 127) marital-status == { Divorced, Married-spouse-absent, Never-married, Separated, Widowed}
## 129) hours-per-week <= 41; criterion = 1, statistic = 33.277
## 130)* weights = 2195
## 129) hours-per-week > 41
## 131)* weights = 425
## 126) native-country == { England, Greece, India, Italy, Japan, Puerto-Rico, South}
## 132)* weights = 50
## 115) capital-loss > 1726
## 133)* weights = 71
## 114) occupation == { Craft-repair, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support}
## 134) hours-per-week <= 42; criterion = 1, statistic = 153.809
## 135) age <= 37; criterion = 1, statistic = 85.299
## 136) age <= 26; criterion = 1, statistic = 51.545
## 137)* weights = 1354
## 136) age > 26
## 138) education-num <= 9; criterion = 0.99, statistic = 47.247
## 139)* weights = 360
## 138) education-num > 9
## 140) capital-loss <= 1408; criterion = 0.995, statistic = 33.171
## 141) marital-status == { Married-civ-spouse, Widowed}; criterion = 0.995, statistic = 37.263
## 142)* weights = 19
## 141) marital-status == { Divorced, Married-spouse-absent, Never-married, Separated}
## 143)* weights = 653
## 140) capital-loss > 1408
## 144)* weights = 17
## 135) age > 37
## 145) sex == { Male}; criterion = 1, statistic = 47.71
## 146) capital-loss <= 2205; criterion = 0.996, statistic = 22.677
## 147) education-num <= 9; criterion = 0.987, statistic = 21.153
## 148) occupation == { Craft-repair, Exec-managerial, Prof-specialty, Sales}; criterion = 0.996, statistic = 23.433
## 149) marital-status == { Married-civ-spouse, Married-spouse-absent}; criterion = 1, statistic = 34.363
## 150)* weights = 13
## 149) marital-status == { Divorced, Never-married, Separated, Widowed}
## 151) race == { Amer-Indian-Eskimo, Asian-Pac-Islander, Other}; criterion = 0.998, statistic = 22.294
## 152)* weights = 9
## 151) race == { Black, White}
## 153)* weights = 177
## 148) occupation == { Protective-serv, Tech-support}
## 154)* weights = 19
## 147) education-num > 9
## 155)* weights = 297
## 146) capital-loss > 2205
## 156)* weights = 8
## 145) sex == { Female}
## 157) race == { Asian-Pac-Islander}; criterion = 1, statistic = 54.52
## 158)* weights = 17
## 157) race == { Amer-Indian-Eskimo, Black, Other, White}
## 159)* weights = 664
## 134) hours-per-week > 42
## 160) age <= 27; criterion = 1, statistic = 44.927
## 161)* weights = 301
## 160) age > 27
## 162) education-num <= 12; criterion = 1, statistic = 47.123
## 163) capital-loss <= 2339; criterion = 0.998, statistic = 22.116
## 164) marital-status == { Divorced, Married-spouse-absent, Never-married, Separated, Widowed}; criterion = 0.997, statistic = 23.999
## 165)* weights = 624
## 164) marital-status == { Married-civ-spouse}
## 166)* weights = 10
## 163) capital-loss > 2339
## 167)* weights = 8
## 162) education-num > 12
## 168) capital-loss <= 0; criterion = 0.984, statistic = 10.52
## 169)* weights = 342
## 168) capital-loss > 0
## 170)* weights = 17
## 99) capital-gain > 6849
## 171) occupation == { ?, Handlers-cleaners, Transport-moving}; criterion = 1, statistic = 50.848
## 172)* weights = 12
## 171) occupation == { Adm-clerical, Craft-repair, Exec-managerial, Farming-fishing, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Sales, Tech-support}
## 173)* weights = 214
Como son muchas ramas se hace imposible ver el grafico, por lo tanto usamos las reglas.
Puedo utilizar la funcion “nodes” para seleccionar a partir de que nodo quiero ver el arbol.
nodes(mdlctree, 42)
## [[1]]
## 42)* weights = 18
mincriterion: Establece el nivel de significancia que va a usar el algoritmo para seguir abriendo el arbol. Esta entre 0 y 1. A mayor valor mas restrictivo es el arbol (menos ramas tiene). minsplit: La minima cantidad de casos en un nodo para considerar una particion. minbucket: La minima cantidad de casos que deben quedar en un nodo final. maxdepth: Maxima profundidad del arbol. mtry: Cantidad de variables que se van a considerar en cada split. Se eligen aleatoriamente.
Como ejemplo usemos maxdepth para restringir la profundidad del arbol a no mas de 2.
mdlctree2 <- ctree(class ~ . , data = train, controls = ctree_control(maxdepth = 2))
plot(mdlctree2)
Depende del conjunto de testing. Hay que probar distintas combinaciones y fijarse cual da mejor de acuerdo a alaguna/s metrica/s selecionada/s. En nuestro caso vamos a probar como ordenan nuestros arboles.
print(mdlrpart) ; print(mdlctree)
## n= 22792
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 22792 5520 <=50K (0.75780976 0.24219024)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441 833 <=50K (0.93304397 0.06695603)
## 4) capital-gain< 7073.5 12215 613 <=50K (0.94981580 0.05018420) *
## 5) capital-gain>=7073.5 226 6 >50K (0.02654867 0.97345133) *
## 3) relationship= Husband, Wife 10351 4687 <=50K (0.54719351 0.45280649)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2452 <=50K (0.66239846 0.33760154)
## 12) capital-gain< 5095.5 6888 2085 <=50K (0.69729965 0.30270035) *
## 13) capital-gain>=5095.5 375 8 >50K (0.02133333 0.97866667) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 3088 853 >50K (0.27623057 0.72376943) *
##
## Conditional inference tree with 87 terminal nodes
##
## Response: class
## Inputs: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country
## Number of observations: 22792
##
## 1) relationship == { Husband, Wife}; criterion = 1, statistic = 4693.86
## 2) education == { 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college}; criterion = 1, statistic = 1752.002
## 3) occupation == { ?, Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Transport-moving}; criterion = 1, statistic = 569.781
## 4) education == { 12th, Assoc-acdm, Assoc-voc, HS-grad, Some-college}; criterion = 1, statistic = 182.167
## 5) capital-gain <= 5013; criterion = 1, statistic = 93.555
## 6) occupation == { ?, Craft-repair, Machine-op-inspct, Transport-moving}; criterion = 1, statistic = 79.057
## 7) age <= 35; criterion = 1, statistic = 42.404
## 8) age <= 29; criterion = 0.999, statistic = 26.657
## 9)* weights = 357
## 8) age > 29
## 10)* weights = 470
## 7) age > 35
## 11) hours-per-week <= 33; criterion = 1, statistic = 30.974
## 12) education-num <= 9; criterion = 0.999, statistic = 18.986
## 13)* weights = 110
## 12) education-num > 9
## 14)* weights = 54
## 11) hours-per-week > 33
## 15) education-num <= 9; criterion = 1, statistic = 26.727
## 16)* weights = 1036
## 15) education-num > 9
## 17) capital-loss <= 1740; criterion = 0.996, statistic = 18.112
## 18) fnlwgt <= 193459; criterion = 0.973, statistic = 15.64
## 19) workclass == { Federal-gov, Local-gov, Self-emp-not-inc}; criterion = 0.971, statistic = 20.638
## 20)* weights = 50
## 19) workclass == { ?, Private, Self-emp-inc, State-gov}
## 21)* weights = 249
## 18) fnlwgt > 193459
## 22)* weights = 187
## 17) capital-loss > 1740
## 23) capital-loss <= 1887; criterion = 0.996, statistic = 13.282
## 24)* weights = 18
## 23) capital-loss > 1887
## 25)* weights = 9
## 6) occupation == { Farming-fishing, Handlers-cleaners, Other-service, Priv-house-serv}
## 26) hours-per-week <= 32; criterion = 0.991, statistic = 34.599
## 27)* weights = 133
## 26) hours-per-week > 32
## 28)* weights = 678
## 5) capital-gain > 5013
## 29) age <= 61; criterion = 1, statistic = 19.866
## 30)* weights = 129
## 29) age > 61
## 31)* weights = 19
## 4) education == { 10th, 11th, 1st-4th, 5th-6th, 7th-8th, 9th, Preschool}
## 32) capital-gain <= 5013; criterion = 1, statistic = 34.764
## 33) capital-loss <= 1735; criterion = 1, statistic = 26.104
## 34) hours-per-week <= 49; criterion = 0.998, statistic = 22.224
## 35) workclass == { ?, Federal-gov, Local-gov, Private, Self-emp-not-inc, State-gov}; criterion = 0.999, statistic = 28.701
## 36)* weights = 757
## 35) workclass == { Self-emp-inc}
## 37)* weights = 9
## 34) hours-per-week > 49
## 38)* weights = 132
## 33) capital-loss > 1735
## 39) capital-loss <= 1977; criterion = 0.989, statistic = 11.183
## 40)* weights = 10
## 39) capital-loss > 1977
## 41)* weights = 10
## 32) capital-gain > 5013
## 42)* weights = 18
## 3) occupation == { Adm-clerical, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support}
## 43) education-num <= 9; criterion = 1, statistic = 90.391
## 44) capital-gain <= 5013; criterion = 1, statistic = 36.886
## 45) capital-loss <= 1740; criterion = 1, statistic = 42.749
## 46) education-num <= 8; criterion = 0.997, statistic = 34.29
## 47)* weights = 164
## 46) education-num > 8
## 48) hours-per-week <= 39; criterion = 0.999, statistic = 33.63
## 49)* weights = 151
## 48) hours-per-week > 39
## 50) age <= 35; criterion = 1, statistic = 30.507
## 51) age <= 28; criterion = 0.996, statistic = 13.04
## 52)* weights = 90
## 51) age > 28
## 53)* weights = 168
## 50) age > 35
## 54) workclass == { Self-emp-not-inc, State-gov}; criterion = 0.964, statistic = 20.61
## 55)* weights = 103
## 54) workclass == { Federal-gov, Local-gov, Private, Self-emp-inc}
## 56)* weights = 501
## 45) capital-loss > 1740
## 57) capital-loss <= 1977; criterion = 0.992, statistic = 11.89
## 58) workclass == { Federal-gov, Local-gov, Self-emp-inc, State-gov}; criterion = 0.991, statistic = 21.5
## 59)* weights = 13
## 58) workclass == { Private, Self-emp-not-inc}
## 60)* weights = 32
## 57) capital-loss > 1977
## 61)* weights = 10
## 44) capital-gain > 5013
## 62)* weights = 77
## 43) education-num > 9
## 63) capital-gain <= 5013; criterion = 1, statistic = 35.641
## 64) age <= 33; criterion = 1, statistic = 30.77
## 65) age <= 24; criterion = 0.953, statistic = 22.785
## 66)* weights = 35
## 65) age > 24
## 67)* weights = 310
## 64) age > 33
## 68) hours-per-week <= 34; criterion = 0.999, statistic = 23.815
## 69)* weights = 91
## 68) hours-per-week > 34
## 70) workclass == { Self-emp-not-inc, State-gov}; criterion = 0.986, statistic = 22.812
## 71)* weights = 112
## 70) workclass == { Federal-gov, Local-gov, Private, Self-emp-inc}
## 72) capital-loss <= 1740; criterion = 0.989, statistic = 19.346
## 73)* weights = 787
## 72) capital-loss > 1740
## 74) capital-loss <= 1902; criterion = 0.979, statistic = 12.265
## 75)* weights = 36
## 74) capital-loss > 1902
## 76)* weights = 16
## 63) capital-gain > 5013
## 77)* weights = 132
## 2) education == { Bachelors, Doctorate, Masters, Prof-school}
## 78) occupation == { Armed-Forces, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support}; criterion = 1, statistic = 190.056
## 79) capital-loss <= 1740; criterion = 1, statistic = 48.285
## 80) capital-gain <= 5013; criterion = 1, statistic = 53.016
## 81) capital-gain <= 3103; criterion = 1, statistic = 44.109
## 82) education-num <= 13; criterion = 1, statistic = 39.733
## 83) workclass == { Local-gov, Self-emp-not-inc, State-gov}; criterion = 0.996, statistic = 34.097
## 84)* weights = 289
## 83) workclass == { Federal-gov, Private, Self-emp-inc}
## 85)* weights = 845
## 82) education-num > 13
## 86) hours-per-week <= 25; criterion = 0.983, statistic = 39.197
## 87)* weights = 33
## 86) hours-per-week > 25
## 88)* weights = 696
## 81) capital-gain > 3103
## 89)* weights = 35
## 80) capital-gain > 5013
## 90) age <= 69; criterion = 1, statistic = 17.658
## 91)* weights = 408
## 90) age > 69
## 92)* weights = 8
## 79) capital-loss > 1740
## 93)* weights = 247
## 78) occupation == { ?, Adm-clerical, Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Transport-moving}
## 94) capital-gain <= 5013; criterion = 1, statistic = 30.873
## 95) capital-loss <= 1740; criterion = 1, statistic = 31.843
## 96)* weights = 444
## 95) capital-loss > 1740
## 97)* weights = 29
## 94) capital-gain > 5013
## 98)* weights = 54
## 1) relationship == { Not-in-family, Other-relative, Own-child, Unmarried}
## 99) capital-gain <= 6849; criterion = 1, statistic = 1366.286
## 100) education == { Doctorate, Masters, Prof-school}; criterion = 1, statistic = 1066.048
## 101) education-num <= 14; criterion = 1, statistic = 41.236
## 102) hours-per-week <= 43; criterion = 0.998, statistic = 35.117
## 103) age <= 47; criterion = 0.981, statistic = 13.752
## 104) capital-loss <= 0; criterion = 0.97, statistic = 15.069
## 105)* weights = 228
## 104) capital-loss > 0
## 106)* weights = 9
## 103) age > 47
## 107)* weights = 84
## 102) hours-per-week > 43
## 108) sex == { Female}; criterion = 0.954, statistic = 22.522
## 109)* weights = 84
## 108) sex == { Male}
## 110)* weights = 79
## 101) education-num > 14
## 111) capital-loss <= 2339; criterion = 0.964, statistic = 19.135
## 112)* weights = 170
## 111) capital-loss > 2339
## 113)* weights = 7
## 100) education == { 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, HS-grad, Preschool, Some-college}
## 114) occupation == { ?, Adm-clerical, Armed-Forces, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Transport-moving}; criterion = 1, statistic = 386.703
## 115) capital-loss <= 1726; criterion = 1, statistic = 84.958
## 116) age <= 33; criterion = 1, statistic = 95.256
## 117) native-country == { ?, England, Guatemala, Portugal}; criterion = 1, statistic = 120.776
## 118) hours-per-week <= 45; criterion = 0.975, statistic = 16.221
## 119) workclass == { Private}; criterion = 0.993, statistic = 17.75
## 120)* weights = 61
## 119) workclass == { ?, Local-gov, Self-emp-not-inc}
## 121)* weights = 14
## 118) hours-per-week > 45
## 122)* weights = 14
## 117) native-country == { Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, France, Germany, Greece, Haiti, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Peru, Philippines, Poland, Puerto-Rico, South, Taiwan, Thailand, United-States, Vietnam, Yugoslavia}
## 123) education == { 7th-8th}; criterion = 0.997, statistic = 37.516
## 124)* weights = 48
## 123) education == { 10th, 11th, 12th, 1st-4th, 5th-6th, 9th, Assoc-acdm, Assoc-voc, Bachelors, HS-grad, Preschool, Some-college}
## 125)* weights = 3738
## 116) age > 33
## 126) native-country == { ?, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, Germany, Guatemala, Haiti, Honduras, Hungary, Iran, Ireland, Jamaica, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Scotland, Trinadad&Tobago, United-States, Vietnam}; criterion = 1, statistic = 121.996
## 127) marital-status == { Married-civ-spouse}; criterion = 1, statistic = 40.656
## 128)* weights = 29
## 127) marital-status == { Divorced, Married-spouse-absent, Never-married, Separated, Widowed}
## 129) hours-per-week <= 41; criterion = 1, statistic = 33.277
## 130)* weights = 2195
## 129) hours-per-week > 41
## 131)* weights = 425
## 126) native-country == { England, Greece, India, Italy, Japan, Puerto-Rico, South}
## 132)* weights = 50
## 115) capital-loss > 1726
## 133)* weights = 71
## 114) occupation == { Craft-repair, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support}
## 134) hours-per-week <= 42; criterion = 1, statistic = 153.809
## 135) age <= 37; criterion = 1, statistic = 85.299
## 136) age <= 26; criterion = 1, statistic = 51.545
## 137)* weights = 1354
## 136) age > 26
## 138) education-num <= 9; criterion = 0.99, statistic = 47.247
## 139)* weights = 360
## 138) education-num > 9
## 140) capital-loss <= 1408; criterion = 0.995, statistic = 33.171
## 141) marital-status == { Married-civ-spouse, Widowed}; criterion = 0.995, statistic = 37.263
## 142)* weights = 19
## 141) marital-status == { Divorced, Married-spouse-absent, Never-married, Separated}
## 143)* weights = 653
## 140) capital-loss > 1408
## 144)* weights = 17
## 135) age > 37
## 145) sex == { Male}; criterion = 1, statistic = 47.71
## 146) capital-loss <= 2205; criterion = 0.996, statistic = 22.677
## 147) education-num <= 9; criterion = 0.987, statistic = 21.153
## 148) occupation == { Craft-repair, Exec-managerial, Prof-specialty, Sales}; criterion = 0.996, statistic = 23.433
## 149) marital-status == { Married-civ-spouse, Married-spouse-absent}; criterion = 1, statistic = 34.363
## 150)* weights = 13
## 149) marital-status == { Divorced, Never-married, Separated, Widowed}
## 151) race == { Amer-Indian-Eskimo, Asian-Pac-Islander, Other}; criterion = 0.998, statistic = 22.294
## 152)* weights = 9
## 151) race == { Black, White}
## 153)* weights = 177
## 148) occupation == { Protective-serv, Tech-support}
## 154)* weights = 19
## 147) education-num > 9
## 155)* weights = 297
## 146) capital-loss > 2205
## 156)* weights = 8
## 145) sex == { Female}
## 157) race == { Asian-Pac-Islander}; criterion = 1, statistic = 54.52
## 158)* weights = 17
## 157) race == { Amer-Indian-Eskimo, Black, Other, White}
## 159)* weights = 664
## 134) hours-per-week > 42
## 160) age <= 27; criterion = 1, statistic = 44.927
## 161)* weights = 301
## 160) age > 27
## 162) education-num <= 12; criterion = 1, statistic = 47.123
## 163) capital-loss <= 2339; criterion = 0.998, statistic = 22.116
## 164) marital-status == { Divorced, Married-spouse-absent, Never-married, Separated, Widowed}; criterion = 0.997, statistic = 23.999
## 165)* weights = 624
## 164) marital-status == { Married-civ-spouse}
## 166)* weights = 10
## 163) capital-loss > 2339
## 167)* weights = 8
## 162) education-num > 12
## 168) capital-loss <= 0; criterion = 0.984, statistic = 10.52
## 169)* weights = 342
## 168) capital-loss > 0
## 170)* weights = 17
## 99) capital-gain > 6849
## 171) occupation == { ?, Handlers-cleaners, Transport-moving}; criterion = 1, statistic = 50.848
## 172)* weights = 12
## 171) occupation == { Adm-clerical, Craft-repair, Exec-managerial, Farming-fishing, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Sales, Tech-support}
## 173)* weights = 214
En vez de predecir la clase vamos a generar, para cada caso del dataset, el vector de probabilidades condicionadas de cada nodo del arbol.
prob_rpart <- predict(mdlrpart, test, type = 'prob')
prob_ctree <- treeresponse(mdlctree, test, type = 'prob')
Vemos que tienen estas variables.
head(prob_rpart) ; head(prob_ctree)
## <=50K >50K
## 2 0.2762306 0.7237694
## 5 0.2762306 0.7237694
## 12 0.2762306 0.7237694
## 21 0.2762306 0.7237694
## 22 0.9498158 0.0501842
## 25 0.9498158 0.0501842
## [[1]]
## [1] 0.4532872 0.5467128
##
## [[2]]
## [1] 0.304142 0.695858
##
## [[3]]
## [1] 0.4532872 0.5467128
##
## [[4]]
## [1] 0.2126437 0.7873563
##
## [[5]]
## [1] 0.98997722 0.01002278
##
## [[6]]
## [1] 0.95331325 0.04668675
Como puede observarse en ambos casos el resultado es una matriz con 2 columnas y n filas que contiene, para cada caso, la probabilidad de que el mismo sea <=50K o >50K. Como queremos predecir las personas que ganan mas de 50K nos quedamos con la segunda columna.
test$prob_rpart <- prob_rpart[,2]
test$prob_ctree <- matrix(unlist(prob_ctree), ncol = 2, byrow = T)[,2]
View(test)
Ahora ordenemos por cada una de las probabilidades.
Tambien podriamos hacer una matriz de confusion y calcular el Accuracy. Elegimos el corte de 0.6 en cada caso para clasificar a los de mayor renta.
prop.table(table(test$class, test$prob_rpart > 0.6))
##
## FALSE TRUE
## <=50K 0.72351315 0.03889856
## >50K 0.11444365 0.12314464
prop.table(table(test$class, test$prob_ctree > 0.6))
##
## FALSE TRUE
## <=50K 0.74050568 0.02190603
## >50K 0.12508957 0.11249872
# Accuracy 1
sum(diag(prop.table(table(test$class, test$prob_rpart > 0.6))))
## [1] 0.8466578
# Accuracy 2
sum(diag(prop.table(table(test$class, test$prob_ctree > 0.6))))
## [1] 0.8530044
T. M. Therneau et al. “An Introduction to Recursive Partitioning Using the RPART Routines” (2015)
T. Hothorn et al. “Unbiased Recursive Partitioning: A Conditional Inference Framework” (2006) C. Molnar. “Recursive partitioning by conditional inference” (2013). Seminar paper. Department of Statistics University of Munich.