Informacion de la base:

Los datos provienen de un censo en USA donde hay que predecir si la persona gana mas o menos de u$50K anuales.

Exploracion de los datos

Levanto la base

adult <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = F)

nombres <- c('age','workclass','fnlwgt','education',
              'education-num','marital-status','occupation',
             'relationship','race','sex','capital-gain',
             'capital-loss','hours-per-week','native-country','class')

names(adult) <- nombres

Estructura del dataset

dim(adult)
## [1] 32561    15
str(adult, list.len = ncol(adult))
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education-num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital-status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital-gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital-loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours-per-week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native-country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ class         : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...

Summary de la base

summary(adult)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education-num                  marital-status 
##   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
##   11th        : 1175   Max.   :16.00    Separated            : 1025  
##  (Other)      : 5134                    Widowed              :  993  
##             occupation            relationship  
##   Prof-specialty :4140    Husband       :13193  
##   Craft-repair   :4099    Not-in-family : 8305  
##   Exec-managerial:4066    Other-relative:  981  
##   Adm-clerical   :3770    Own-child     : 5068  
##   Sales          :3650    Unmarried     : 3446  
##   Other-service  :3295    Wife          : 1568  
##  (Other)         :9541                          
##                   race            sex         capital-gain  
##   Amer-Indian-Eskimo:  311    Female:10771   Min.   :    0  
##   Asian-Pac-Islander: 1039    Male  :21790   1st Qu.:    0  
##   Black             : 3124                   Median :    0  
##   Other             :  271                   Mean   : 1078  
##   White             :27816                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital-loss    hours-per-week         native-country     class      
##  Min.   :   0.0   Min.   : 1.00    United-States:29170    <=50K:24720  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  643    >50K : 7841  
##  Median :   0.0   Median :40.00    ?            :  583                 
##  Mean   :  87.3   Mean   :40.44    Philippines  :  198                 
##  3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  137                 
##  Max.   :4356.0   Max.   :99.00    Canada       :  121                 
##                                   (Other)       : 1709

Me fijo las frecuencias de las diferentes rentas

table(adult$class) ; prop.table(table(adult$class))
## 
##  <=50K   >50K 
##  24720   7841
## 
##     <=50K      >50K 
## 0.7591904 0.2408096

Modelado

Para comenzar, vamos a dividir el dataset en training y testing para evitar posibles overfittings del modelo. Para ello utilizamos la funcion “sample” que escoge de manera aleatoria un subconjunto de valores del vector que le pasamos. La cantidad de numeros escogidos esta condicionado por el parametro “size”.

idx <- sample(x = 1:nrow(adult), size = nrow(adult) * 0.7)

print(idx, max = 20)
##  [1] 11812  5904 29362 29901  2525   442 16563  8311  4738 30997 27423
## [12] 11396  6005 26601 22840  7315 32479  5603 12148 21523
##  [ reached getOption("max.print") -- omitted 22772 entries ]

Como puede observarse, ahora tenemos un 22792 numeros que corresponden a los indices fila de nuestro dataset. Por lo tanto, procedemos a armar nuestros nuevos conjuntos de datos.

train <- adult[idx,] ; test <- adult[-idx,]

dim(train) ; dim(test)
## [1] 22792    15
## [1] 9769   15

Ahora si comencemos a modelar.

RPART (Recursive Partitioning)

# Hago el modelo
mdlrpart <- rpart(formula = class ~ ., data = train)

# Hago un print del modelo

print(mdlrpart, digits = 2)
## n= 22792 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 22792 5500  <=50K (0.758 0.242)  
##    2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441  830  <=50K (0.933 0.067)  
##      4) capital-gain< 7.1e+03 12215  610  <=50K (0.950 0.050) *
##      5) capital-gain>=7.1e+03 226    6  >50K (0.027 0.973) *
##    3) relationship= Husband, Wife 10351 4700  <=50K (0.547 0.453)  
##      6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2500  <=50K (0.662 0.338)  
##       12) capital-gain< 5.1e+03 6888 2100  <=50K (0.697 0.303) *
##       13) capital-gain>=5.1e+03 375    8  >50K (0.021 0.979) *
##      7) education= Bachelors, Doctorate, Masters, Prof-school 3088  850  >50K (0.276 0.724) *

En el print del arbol se pueden observar los nodos y las ramas del mismo. Al final de cada nodo se puede observar la distribucion de la variable target.

Grafico el arbol

plot(mdlrpart) ; text(mdlrpart)

Grafico el arbol usando la libreria partykit

library(partykit)
## Loading required package: grid
plot(as.party.rpart(mdlrpart))

Algunas funciones de RPART

printcp

Da un resumen de como va cayendo el error de prediccion y muestra la cantidad de cortes que fueron ocurriendo. El valor CP es el parametro principal del algoritmo. El mismo va a determinar el prunning que vamos a hacerle al arbol.

printcp(mdlrpart)
## 
## Classification tree:
## rpart(formula = class ~ ., data = train)
## 
## Variables actually used in tree construction:
## [1] capital-gain education    relationship
## 
## Root node error: 5520/22792 = 0.24219
## 
## n= 22792 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.125181      0   1.00000 1.00000 0.011717
## 2 0.065036      2   0.74964 0.75797 0.010588
## 3 0.038768      3   0.68460 0.69982 0.010261
## 4 0.010000      4   0.64583 0.65634 0.010000

En este ejemplo, se puede observar que a partir de un CP de 0.126 el arbol comienza a abrirse, ya que con uno mas alto los splits son 0. Esto nos da a entender que nuestro primer arbol tiene unicamente 2 splits.

print(mdlrpart, cp = 0.126387) ; print(mdlrpart, cp = 0.12)
## n= 22792 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 22792 5520  <=50K (0.7578098 0.2421902) *
## n= 22792 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 22792 5520  <=50K (0.75780976 0.24219024)  
##   2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441  833  <=50K (0.93304397 0.06695603) *
##   3) relationship= Husband, Wife 10351 4687  <=50K (0.54719351 0.45280649)  
##     6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2452  <=50K (0.66239846 0.33760154) *
##     7) education= Bachelors, Doctorate, Masters, Prof-school 3088  853  >50K (0.27623057 0.72376943) *

El primer split corresponde a la variable relationship (nodos 2 y 3) y el segundo, a la variable education (nodos 6 y 7).

Tambien se puede definir un CP en el momento de ejecutar el arbol. Esto se usa generalmente para: o lograr una respuesta mas rapida, si se quiere hacer vistazo general de los datos; o, para conseguir un arbol mas chico que evite el OVERFITTING.

mdlrpart2 <- rpart(formula = class ~ ., data = adult, control = rpart.control(cp = 0.065))

print(mdlrpart2)
## n= 32561 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 32561 7841  <=50K (0.75919044 0.24080956)  
##   2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178  <=50K (0.93382022 0.06617978) *
##   3) relationship= Husband, Wife 14761 6663  <=50K (0.54860782 0.45139218)  
##     6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456  <=50K (0.66540807 0.33459193) *
##     7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225  >50K (0.27639892 0.72360108) *

summary

La funcion summary da informacion de las decisiones que tomo el algoritmo en cada nodo para elegir variables, seguir abriendo o detenerse. Tambien da informacion acerca de como trato los NAs.

summary(mdlrpart)
## Call:
## rpart(formula = class ~ ., data = train)
##   n= 22792 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.12518116      0 1.0000000 1.0000000 0.01171684
## 2 0.06503623      2 0.7496377 0.7579710 0.01058804
## 3 0.03876812      3 0.6846014 0.6998188 0.01026115
## 4 0.01000000      4 0.6458333 0.6563406 0.01000008
## 
## Variable importance
##   relationship marital-status   capital-gain      education  education-num 
##             24             24             11              9              9 
##            sex     occupation            age hours-per-week 
##              8              7              6              3 
## 
## Node number 1: 22792 observations,    complexity param=0.1251812
##   predicted class= <=50K  expected loss=0.2421902  P(node) =1
##     class counts: 17272  5520
##    probabilities: 0.758 0.242 
##   left son=2 (12441 obs) right son=3 (10351 obs)
##   Primary splits:
##       relationship   splits as  RLLLLR, improve=1682.3770, (0 missing)
##       marital-status splits as  LRRLLLL, improve=1653.7110, (0 missing)
##       capital-gain   < 5095.5 to the left,  improve=1169.7100, (0 missing)
##       education      splits as  LLLLLLLLLRRLRLRL, improve= 865.8919, (0 missing)
##       education-num  < 12.5   to the left,  improve= 865.8919, (0 missing)
##   Surrogate splits:
##       marital-status splits as  LRRLLLL, agree=0.993, adj=0.984, (0 split)
##       sex            splits as  LR, agree=0.691, adj=0.319, (0 split)
##       age            < 33.5   to the left,  agree=0.651, adj=0.232, (0 split)
##       occupation     splits as  LLLRRRLLLLRRLLR, agree=0.619, adj=0.161, (0 split)
##       hours-per-week < 43.5   to the left,  agree=0.604, adj=0.128, (0 split)
## 
## Node number 2: 12441 observations,    complexity param=0.03876812
##   predicted class= <=50K  expected loss=0.06695603  P(node) =0.5458494
##     class counts: 11608   833
##    probabilities: 0.933 0.067 
##   left son=4 (12215 obs) right son=5 (226 obs)
##   Primary splits:
##       capital-gain   < 7073.5 to the left,  improve=378.29570, (0 missing)
##       education-num  < 13.5   to the left,  improve=110.81060, (0 missing)
##       education      splits as  LLLLLLLLLLRLRLRL, improve=110.81060, (0 missing)
##       occupation     splits as  LLLLRLLLLLRRLLL, improve= 88.95199, (0 missing)
##       hours-per-week < 42.5   to the left,  improve= 82.62189, (0 missing)
## 
## Node number 3: 10351 observations,    complexity param=0.1251812
##   predicted class= <=50K  expected loss=0.4528065  P(node) =0.4541506
##     class counts:  5664  4687
##    probabilities: 0.547 0.453 
##   left son=6 (7263 obs) right son=7 (3088 obs)
##   Primary splits:
##       education     splits as  LLLLLLLLLRRLRLRL, improve=646.2393, (0 missing)
##       education-num < 12.5   to the left,  improve=646.2393, (0 missing)
##       occupation    splits as  LLRLRLLLLLRLRRL, improve=635.6832, (0 missing)
##       capital-gain  < 5095.5 to the left,  improve=527.4262, (0 missing)
##       capital-loss  < 1782.5 to the left,  improve=174.3500, (0 missing)
##   Surrogate splits:
##       education-num  < 12.5   to the left,  agree=1.000, adj=1.000, (0 split)
##       occupation     splits as  LLRLRLLLLLRLLLL, agree=0.789, adj=0.294, (0 split)
##       capital-gain   < 7493   to the left,  agree=0.717, adj=0.051, (0 split)
##       native-country splits as  LLLRLLLLLRLLLLL-LLLRRLLLRLLLLLLLLLLRRLLLLL, agree=0.709, adj=0.026, (0 split)
##       capital-loss   < 1894.5 to the left,  agree=0.707, adj=0.018, (0 split)
## 
## Node number 4: 12215 observations
##   predicted class= <=50K  expected loss=0.0501842  P(node) =0.5359337
##     class counts: 11602   613
##    probabilities: 0.950 0.050 
## 
## Node number 5: 226 observations
##   predicted class= >50K   expected loss=0.02654867  P(node) =0.00991576
##     class counts:     6   220
##    probabilities: 0.027 0.973 
## 
## Node number 6: 7263 observations,    complexity param=0.06503623
##   predicted class= <=50K  expected loss=0.3376015  P(node) =0.3186644
##     class counts:  4811  2452
##    probabilities: 0.662 0.338 
##   left son=12 (6888 obs) right son=13 (375 obs)
##   Primary splits:
##       capital-gain  < 5095.5 to the left,  improve=325.0038, (0 missing)
##       occupation    splits as  LR-LRLLLLLRRRRL, improve=186.4794, (0 missing)
##       education     splits as  LLLLLLLRR--L-L-R, improve=130.1685, (0 missing)
##       education-num < 9.5    to the left,  improve=130.1685, (0 missing)
##       age           < 33.5   to the left,  improve= 94.3062, (0 missing)
## 
## Node number 7: 3088 observations
##   predicted class= >50K   expected loss=0.2762306  P(node) =0.1354861
##     class counts:   853  2235
##    probabilities: 0.276 0.724 
## 
## Node number 12: 6888 observations
##   predicted class= <=50K  expected loss=0.3027003  P(node) =0.3022113
##     class counts:  4803  2085
##    probabilities: 0.697 0.303 
## 
## Node number 13: 375 observations
##   predicted class= >50K   expected loss=0.02133333  P(node) =0.01645314
##     class counts:     8   367
##    probabilities: 0.021 0.979

Algunas estadisticas importantes a tener en cuenta son: 1) improve -> Mide la importancia que tenia la variable al momento de realizar el split. Es interesante poder observar que tanta diferencia hay entre las primeras dos variables, la segunda y la tercera, etc. Son realmente significativas ? 2) Surrogate -> Mide la forma en que se tratan los valores nulos (si los hay) que se encuentran en las variables que van abriendo el arbol. El valor agree dice que tan bien la variable surrogate predice el corte de la variable principal. El valor adj mide lo mismo, pero cambiando el benchmark de la medicion. Ahora en vez de medir que tan bien predice los cortes, mide que tan bien predice los cortes por encima de la regla “ir con el mas probable”.

prune

Esta funcion hace un prunning de un arbol ya armado seteando el parametro CP. Por ejemplo, en vez de tener un arbol con 5 hojas, tengo uno con 3.

prune(mdlrpart, cp = 0.07)
## n= 22792 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 22792 5520  <=50K (0.75780976 0.24219024)  
##   2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441  833  <=50K (0.93304397 0.06695603) *
##   3) relationship= Husband, Wife 10351 4687  <=50K (0.54719351 0.45280649)  
##     6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2452  <=50K (0.66239846 0.33760154) *
##     7) education= Bachelors, Doctorate, Masters, Prof-school 3088  853  >50K (0.27623057 0.72376943) *

PARTY (ctree)

Este metodo utiliza criterios estadisticos tanto para elegir las primeras variables del arbol como para dividir las mismas. El nombre deviene de Conditional Inference Trees. Una de las principales diferencias que tiene con otros metodos de particion es que no hace prunning debido a su sustento estadistico. Es decir, el arbol sigue creciendo hasta que no quede ninguna variable que posea una relacion significativa con el target.

Comencemos…

## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'party'
## The following objects are masked from 'package:partykit':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal
mdlctree <- ctree(class ~ . , data = train)

Grafico el arbol

plot(mdlctree) ; print(mdlctree)

## 
##   Conditional inference tree with 87 terminal nodes
## 
## Response:  class 
## Inputs:  age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country 
## Number of observations:  22792 
## 
## 1) relationship == { Husband,  Wife}; criterion = 1, statistic = 4693.86
##   2) education == { 10th,  11th,  12th,  1st-4th,  5th-6th,  7th-8th,  9th,  Assoc-acdm,  Assoc-voc,  HS-grad,  Preschool,  Some-college}; criterion = 1, statistic = 1752.002
##     3) occupation == { ?,  Craft-repair,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Transport-moving}; criterion = 1, statistic = 569.781
##       4) education == { 12th,  Assoc-acdm,  Assoc-voc,  HS-grad,  Some-college}; criterion = 1, statistic = 182.167
##         5) capital-gain <= 5013; criterion = 1, statistic = 93.555
##           6) occupation == { ?,  Craft-repair,  Machine-op-inspct,  Transport-moving}; criterion = 1, statistic = 79.057
##             7) age <= 35; criterion = 1, statistic = 42.404
##               8) age <= 29; criterion = 0.999, statistic = 26.657
##                 9)*  weights = 357 
##               8) age > 29
##                 10)*  weights = 470 
##             7) age > 35
##               11) hours-per-week <= 33; criterion = 1, statistic = 30.974
##                 12) education-num <= 9; criterion = 0.999, statistic = 18.986
##                   13)*  weights = 110 
##                 12) education-num > 9
##                   14)*  weights = 54 
##               11) hours-per-week > 33
##                 15) education-num <= 9; criterion = 1, statistic = 26.727
##                   16)*  weights = 1036 
##                 15) education-num > 9
##                   17) capital-loss <= 1740; criterion = 0.996, statistic = 18.112
##                     18) fnlwgt <= 193459; criterion = 0.973, statistic = 15.64
##                       19) workclass == { Federal-gov,  Local-gov,  Self-emp-not-inc}; criterion = 0.971, statistic = 20.638
##                         20)*  weights = 50 
##                       19) workclass == { ?,  Private,  Self-emp-inc,  State-gov}
##                         21)*  weights = 249 
##                     18) fnlwgt > 193459
##                       22)*  weights = 187 
##                   17) capital-loss > 1740
##                     23) capital-loss <= 1887; criterion = 0.996, statistic = 13.282
##                       24)*  weights = 18 
##                     23) capital-loss > 1887
##                       25)*  weights = 9 
##           6) occupation == { Farming-fishing,  Handlers-cleaners,  Other-service,  Priv-house-serv}
##             26) hours-per-week <= 32; criterion = 0.991, statistic = 34.599
##               27)*  weights = 133 
##             26) hours-per-week > 32
##               28)*  weights = 678 
##         5) capital-gain > 5013
##           29) age <= 61; criterion = 1, statistic = 19.866
##             30)*  weights = 129 
##           29) age > 61
##             31)*  weights = 19 
##       4) education == { 10th,  11th,  1st-4th,  5th-6th,  7th-8th,  9th,  Preschool}
##         32) capital-gain <= 5013; criterion = 1, statistic = 34.764
##           33) capital-loss <= 1735; criterion = 1, statistic = 26.104
##             34) hours-per-week <= 49; criterion = 0.998, statistic = 22.224
##               35) workclass == { ?,  Federal-gov,  Local-gov,  Private,  Self-emp-not-inc,  State-gov}; criterion = 0.999, statistic = 28.701
##                 36)*  weights = 757 
##               35) workclass == { Self-emp-inc}
##                 37)*  weights = 9 
##             34) hours-per-week > 49
##               38)*  weights = 132 
##           33) capital-loss > 1735
##             39) capital-loss <= 1977; criterion = 0.989, statistic = 11.183
##               40)*  weights = 10 
##             39) capital-loss > 1977
##               41)*  weights = 10 
##         32) capital-gain > 5013
##           42)*  weights = 18 
##     3) occupation == { Adm-clerical,  Exec-managerial,  Prof-specialty,  Protective-serv,  Sales,  Tech-support}
##       43) education-num <= 9; criterion = 1, statistic = 90.391
##         44) capital-gain <= 5013; criterion = 1, statistic = 36.886
##           45) capital-loss <= 1740; criterion = 1, statistic = 42.749
##             46) education-num <= 8; criterion = 0.997, statistic = 34.29
##               47)*  weights = 164 
##             46) education-num > 8
##               48) hours-per-week <= 39; criterion = 0.999, statistic = 33.63
##                 49)*  weights = 151 
##               48) hours-per-week > 39
##                 50) age <= 35; criterion = 1, statistic = 30.507
##                   51) age <= 28; criterion = 0.996, statistic = 13.04
##                     52)*  weights = 90 
##                   51) age > 28
##                     53)*  weights = 168 
##                 50) age > 35
##                   54) workclass == { Self-emp-not-inc,  State-gov}; criterion = 0.964, statistic = 20.61
##                     55)*  weights = 103 
##                   54) workclass == { Federal-gov,  Local-gov,  Private,  Self-emp-inc}
##                     56)*  weights = 501 
##           45) capital-loss > 1740
##             57) capital-loss <= 1977; criterion = 0.992, statistic = 11.89
##               58) workclass == { Federal-gov,  Local-gov,  Self-emp-inc,  State-gov}; criterion = 0.991, statistic = 21.5
##                 59)*  weights = 13 
##               58) workclass == { Private,  Self-emp-not-inc}
##                 60)*  weights = 32 
##             57) capital-loss > 1977
##               61)*  weights = 10 
##         44) capital-gain > 5013
##           62)*  weights = 77 
##       43) education-num > 9
##         63) capital-gain <= 5013; criterion = 1, statistic = 35.641
##           64) age <= 33; criterion = 1, statistic = 30.77
##             65) age <= 24; criterion = 0.953, statistic = 22.785
##               66)*  weights = 35 
##             65) age > 24
##               67)*  weights = 310 
##           64) age > 33
##             68) hours-per-week <= 34; criterion = 0.999, statistic = 23.815
##               69)*  weights = 91 
##             68) hours-per-week > 34
##               70) workclass == { Self-emp-not-inc,  State-gov}; criterion = 0.986, statistic = 22.812
##                 71)*  weights = 112 
##               70) workclass == { Federal-gov,  Local-gov,  Private,  Self-emp-inc}
##                 72) capital-loss <= 1740; criterion = 0.989, statistic = 19.346
##                   73)*  weights = 787 
##                 72) capital-loss > 1740
##                   74) capital-loss <= 1902; criterion = 0.979, statistic = 12.265
##                     75)*  weights = 36 
##                   74) capital-loss > 1902
##                     76)*  weights = 16 
##         63) capital-gain > 5013
##           77)*  weights = 132 
##   2) education == { Bachelors,  Doctorate,  Masters,  Prof-school}
##     78) occupation == { Armed-Forces,  Exec-managerial,  Prof-specialty,  Protective-serv,  Sales,  Tech-support}; criterion = 1, statistic = 190.056
##       79) capital-loss <= 1740; criterion = 1, statistic = 48.285
##         80) capital-gain <= 5013; criterion = 1, statistic = 53.016
##           81) capital-gain <= 3103; criterion = 1, statistic = 44.109
##             82) education-num <= 13; criterion = 1, statistic = 39.733
##               83) workclass == { Local-gov,  Self-emp-not-inc,  State-gov}; criterion = 0.996, statistic = 34.097
##                 84)*  weights = 289 
##               83) workclass == { Federal-gov,  Private,  Self-emp-inc}
##                 85)*  weights = 845 
##             82) education-num > 13
##               86) hours-per-week <= 25; criterion = 0.983, statistic = 39.197
##                 87)*  weights = 33 
##               86) hours-per-week > 25
##                 88)*  weights = 696 
##           81) capital-gain > 3103
##             89)*  weights = 35 
##         80) capital-gain > 5013
##           90) age <= 69; criterion = 1, statistic = 17.658
##             91)*  weights = 408 
##           90) age > 69
##             92)*  weights = 8 
##       79) capital-loss > 1740
##         93)*  weights = 247 
##     78) occupation == { ?,  Adm-clerical,  Craft-repair,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Transport-moving}
##       94) capital-gain <= 5013; criterion = 1, statistic = 30.873
##         95) capital-loss <= 1740; criterion = 1, statistic = 31.843
##           96)*  weights = 444 
##         95) capital-loss > 1740
##           97)*  weights = 29 
##       94) capital-gain > 5013
##         98)*  weights = 54 
## 1) relationship == { Not-in-family,  Other-relative,  Own-child,  Unmarried}
##   99) capital-gain <= 6849; criterion = 1, statistic = 1366.286
##     100) education == { Doctorate,  Masters,  Prof-school}; criterion = 1, statistic = 1066.048
##       101) education-num <= 14; criterion = 1, statistic = 41.236
##         102) hours-per-week <= 43; criterion = 0.998, statistic = 35.117
##           103) age <= 47; criterion = 0.981, statistic = 13.752
##             104) capital-loss <= 0; criterion = 0.97, statistic = 15.069
##               105)*  weights = 228 
##             104) capital-loss > 0
##               106)*  weights = 9 
##           103) age > 47
##             107)*  weights = 84 
##         102) hours-per-week > 43
##           108) sex == { Female}; criterion = 0.954, statistic = 22.522
##             109)*  weights = 84 
##           108) sex == { Male}
##             110)*  weights = 79 
##       101) education-num > 14
##         111) capital-loss <= 2339; criterion = 0.964, statistic = 19.135
##           112)*  weights = 170 
##         111) capital-loss > 2339
##           113)*  weights = 7 
##     100) education == { 10th,  11th,  12th,  1st-4th,  5th-6th,  7th-8th,  9th,  Assoc-acdm,  Assoc-voc,  Bachelors,  HS-grad,  Preschool,  Some-college}
##       114) occupation == { ?,  Adm-clerical,  Armed-Forces,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Transport-moving}; criterion = 1, statistic = 386.703
##         115) capital-loss <= 1726; criterion = 1, statistic = 84.958
##           116) age <= 33; criterion = 1, statistic = 95.256
##             117) native-country == { ?,  England,  Guatemala,  Portugal}; criterion = 1, statistic = 120.776
##               118) hours-per-week <= 45; criterion = 0.975, statistic = 16.221
##                 119) workclass == { Private}; criterion = 0.993, statistic = 17.75
##                   120)*  weights = 61 
##                 119) workclass == { ?,  Local-gov,  Self-emp-not-inc}
##                   121)*  weights = 14 
##               118) hours-per-week > 45
##                 122)*  weights = 14 
##             117) native-country == { Cambodia,  Canada,  China,  Columbia,  Cuba,  Dominican-Republic,  Ecuador,  El-Salvador,  France,  Germany,  Greece,  Haiti,  Honduras,  Hong,  Hungary,  India,  Iran,  Ireland,  Italy,  Jamaica,  Japan,  Laos,  Mexico,  Nicaragua,  Peru,  Philippines,  Poland,  Puerto-Rico,  South,  Taiwan,  Thailand,  United-States,  Vietnam,  Yugoslavia}
##               123) education == { 7th-8th}; criterion = 0.997, statistic = 37.516
##                 124)*  weights = 48 
##               123) education == { 10th,  11th,  12th,  1st-4th,  5th-6th,  9th,  Assoc-acdm,  Assoc-voc,  Bachelors,  HS-grad,  Preschool,  Some-college}
##                 125)*  weights = 3738 
##           116) age > 33
##             126) native-country == { ?,  Canada,  China,  Columbia,  Cuba,  Dominican-Republic,  Ecuador,  El-Salvador,  Germany,  Guatemala,  Haiti,  Honduras,  Hungary,  Iran,  Ireland,  Jamaica,  Laos,  Mexico,  Nicaragua,  Outlying-US(Guam-USVI-etc),  Peru,  Philippines,  Poland,  Portugal,  Scotland,  Trinadad&Tobago,  United-States,  Vietnam}; criterion = 1, statistic = 121.996
##               127) marital-status == { Married-civ-spouse}; criterion = 1, statistic = 40.656
##                 128)*  weights = 29 
##               127) marital-status == { Divorced,  Married-spouse-absent,  Never-married,  Separated,  Widowed}
##                 129) hours-per-week <= 41; criterion = 1, statistic = 33.277
##                   130)*  weights = 2195 
##                 129) hours-per-week > 41
##                   131)*  weights = 425 
##             126) native-country == { England,  Greece,  India,  Italy,  Japan,  Puerto-Rico,  South}
##               132)*  weights = 50 
##         115) capital-loss > 1726
##           133)*  weights = 71 
##       114) occupation == { Craft-repair,  Exec-managerial,  Prof-specialty,  Protective-serv,  Sales,  Tech-support}
##         134) hours-per-week <= 42; criterion = 1, statistic = 153.809
##           135) age <= 37; criterion = 1, statistic = 85.299
##             136) age <= 26; criterion = 1, statistic = 51.545
##               137)*  weights = 1354 
##             136) age > 26
##               138) education-num <= 9; criterion = 0.99, statistic = 47.247
##                 139)*  weights = 360 
##               138) education-num > 9
##                 140) capital-loss <= 1408; criterion = 0.995, statistic = 33.171
##                   141) marital-status == { Married-civ-spouse,  Widowed}; criterion = 0.995, statistic = 37.263
##                     142)*  weights = 19 
##                   141) marital-status == { Divorced,  Married-spouse-absent,  Never-married,  Separated}
##                     143)*  weights = 653 
##                 140) capital-loss > 1408
##                   144)*  weights = 17 
##           135) age > 37
##             145) sex == { Male}; criterion = 1, statistic = 47.71
##               146) capital-loss <= 2205; criterion = 0.996, statistic = 22.677
##                 147) education-num <= 9; criterion = 0.987, statistic = 21.153
##                   148) occupation == { Craft-repair,  Exec-managerial,  Prof-specialty,  Sales}; criterion = 0.996, statistic = 23.433
##                     149) marital-status == { Married-civ-spouse,  Married-spouse-absent}; criterion = 1, statistic = 34.363
##                       150)*  weights = 13 
##                     149) marital-status == { Divorced,  Never-married,  Separated,  Widowed}
##                       151) race == { Amer-Indian-Eskimo,  Asian-Pac-Islander,  Other}; criterion = 0.998, statistic = 22.294
##                         152)*  weights = 9 
##                       151) race == { Black,  White}
##                         153)*  weights = 177 
##                   148) occupation == { Protective-serv,  Tech-support}
##                     154)*  weights = 19 
##                 147) education-num > 9
##                   155)*  weights = 297 
##               146) capital-loss > 2205
##                 156)*  weights = 8 
##             145) sex == { Female}
##               157) race == { Asian-Pac-Islander}; criterion = 1, statistic = 54.52
##                 158)*  weights = 17 
##               157) race == { Amer-Indian-Eskimo,  Black,  Other,  White}
##                 159)*  weights = 664 
##         134) hours-per-week > 42
##           160) age <= 27; criterion = 1, statistic = 44.927
##             161)*  weights = 301 
##           160) age > 27
##             162) education-num <= 12; criterion = 1, statistic = 47.123
##               163) capital-loss <= 2339; criterion = 0.998, statistic = 22.116
##                 164) marital-status == { Divorced,  Married-spouse-absent,  Never-married,  Separated,  Widowed}; criterion = 0.997, statistic = 23.999
##                   165)*  weights = 624 
##                 164) marital-status == { Married-civ-spouse}
##                   166)*  weights = 10 
##               163) capital-loss > 2339
##                 167)*  weights = 8 
##             162) education-num > 12
##               168) capital-loss <= 0; criterion = 0.984, statistic = 10.52
##                 169)*  weights = 342 
##               168) capital-loss > 0
##                 170)*  weights = 17 
##   99) capital-gain > 6849
##     171) occupation == { ?,  Handlers-cleaners,  Transport-moving}; criterion = 1, statistic = 50.848
##       172)*  weights = 12 
##     171) occupation == { Adm-clerical,  Craft-repair,  Exec-managerial,  Farming-fishing,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Prof-specialty,  Sales,  Tech-support}
##       173)*  weights = 214

Como son muchas ramas se hace imposible ver el grafico, por lo tanto usamos las reglas.

Funcion nodes

Puedo utilizar la funcion “nodes” para seleccionar a partir de que nodo quiero ver el arbol.

nodes(mdlctree, 42)
## [[1]]
## 42)*  weights = 18

Parametros de control

mincriterion: Establece el nivel de significancia que va a usar el algoritmo para seguir abriendo el arbol. Esta entre 0 y 1. A mayor valor mas restrictivo es el arbol (menos ramas tiene). minsplit: La minima cantidad de casos en un nodo para considerar una particion. minbucket: La minima cantidad de casos que deben quedar en un nodo final. maxdepth: Maxima profundidad del arbol. mtry: Cantidad de variables que se van a considerar en cada split. Se eligen aleatoriamente.

Como ejemplo usemos maxdepth para restringir la profundidad del arbol a no mas de 2.

mdlctree2 <- ctree(class ~ . , data = train, controls = ctree_control(maxdepth = 2))
plot(mdlctree2)

Prediccion

¿Que parametros son los mejores para escoger?

Depende del conjunto de testing. Hay que probar distintas combinaciones y fijarse cual da mejor de acuerdo a alaguna/s metrica/s selecionada/s. En nuestro caso vamos a probar como ordenan nuestros arboles.

print(mdlrpart) ; print(mdlctree)
## n= 22792 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 22792 5520  <=50K (0.75780976 0.24219024)  
##    2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 12441  833  <=50K (0.93304397 0.06695603)  
##      4) capital-gain< 7073.5 12215  613  <=50K (0.94981580 0.05018420) *
##      5) capital-gain>=7073.5 226    6  >50K (0.02654867 0.97345133) *
##    3) relationship= Husband, Wife 10351 4687  <=50K (0.54719351 0.45280649)  
##      6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 7263 2452  <=50K (0.66239846 0.33760154)  
##       12) capital-gain< 5095.5 6888 2085  <=50K (0.69729965 0.30270035) *
##       13) capital-gain>=5095.5 375    8  >50K (0.02133333 0.97866667) *
##      7) education= Bachelors, Doctorate, Masters, Prof-school 3088  853  >50K (0.27623057 0.72376943) *
## 
##   Conditional inference tree with 87 terminal nodes
## 
## Response:  class 
## Inputs:  age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country 
## Number of observations:  22792 
## 
## 1) relationship == { Husband,  Wife}; criterion = 1, statistic = 4693.86
##   2) education == { 10th,  11th,  12th,  1st-4th,  5th-6th,  7th-8th,  9th,  Assoc-acdm,  Assoc-voc,  HS-grad,  Preschool,  Some-college}; criterion = 1, statistic = 1752.002
##     3) occupation == { ?,  Craft-repair,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Transport-moving}; criterion = 1, statistic = 569.781
##       4) education == { 12th,  Assoc-acdm,  Assoc-voc,  HS-grad,  Some-college}; criterion = 1, statistic = 182.167
##         5) capital-gain <= 5013; criterion = 1, statistic = 93.555
##           6) occupation == { ?,  Craft-repair,  Machine-op-inspct,  Transport-moving}; criterion = 1, statistic = 79.057
##             7) age <= 35; criterion = 1, statistic = 42.404
##               8) age <= 29; criterion = 0.999, statistic = 26.657
##                 9)*  weights = 357 
##               8) age > 29
##                 10)*  weights = 470 
##             7) age > 35
##               11) hours-per-week <= 33; criterion = 1, statistic = 30.974
##                 12) education-num <= 9; criterion = 0.999, statistic = 18.986
##                   13)*  weights = 110 
##                 12) education-num > 9
##                   14)*  weights = 54 
##               11) hours-per-week > 33
##                 15) education-num <= 9; criterion = 1, statistic = 26.727
##                   16)*  weights = 1036 
##                 15) education-num > 9
##                   17) capital-loss <= 1740; criterion = 0.996, statistic = 18.112
##                     18) fnlwgt <= 193459; criterion = 0.973, statistic = 15.64
##                       19) workclass == { Federal-gov,  Local-gov,  Self-emp-not-inc}; criterion = 0.971, statistic = 20.638
##                         20)*  weights = 50 
##                       19) workclass == { ?,  Private,  Self-emp-inc,  State-gov}
##                         21)*  weights = 249 
##                     18) fnlwgt > 193459
##                       22)*  weights = 187 
##                   17) capital-loss > 1740
##                     23) capital-loss <= 1887; criterion = 0.996, statistic = 13.282
##                       24)*  weights = 18 
##                     23) capital-loss > 1887
##                       25)*  weights = 9 
##           6) occupation == { Farming-fishing,  Handlers-cleaners,  Other-service,  Priv-house-serv}
##             26) hours-per-week <= 32; criterion = 0.991, statistic = 34.599
##               27)*  weights = 133 
##             26) hours-per-week > 32
##               28)*  weights = 678 
##         5) capital-gain > 5013
##           29) age <= 61; criterion = 1, statistic = 19.866
##             30)*  weights = 129 
##           29) age > 61
##             31)*  weights = 19 
##       4) education == { 10th,  11th,  1st-4th,  5th-6th,  7th-8th,  9th,  Preschool}
##         32) capital-gain <= 5013; criterion = 1, statistic = 34.764
##           33) capital-loss <= 1735; criterion = 1, statistic = 26.104
##             34) hours-per-week <= 49; criterion = 0.998, statistic = 22.224
##               35) workclass == { ?,  Federal-gov,  Local-gov,  Private,  Self-emp-not-inc,  State-gov}; criterion = 0.999, statistic = 28.701
##                 36)*  weights = 757 
##               35) workclass == { Self-emp-inc}
##                 37)*  weights = 9 
##             34) hours-per-week > 49
##               38)*  weights = 132 
##           33) capital-loss > 1735
##             39) capital-loss <= 1977; criterion = 0.989, statistic = 11.183
##               40)*  weights = 10 
##             39) capital-loss > 1977
##               41)*  weights = 10 
##         32) capital-gain > 5013
##           42)*  weights = 18 
##     3) occupation == { Adm-clerical,  Exec-managerial,  Prof-specialty,  Protective-serv,  Sales,  Tech-support}
##       43) education-num <= 9; criterion = 1, statistic = 90.391
##         44) capital-gain <= 5013; criterion = 1, statistic = 36.886
##           45) capital-loss <= 1740; criterion = 1, statistic = 42.749
##             46) education-num <= 8; criterion = 0.997, statistic = 34.29
##               47)*  weights = 164 
##             46) education-num > 8
##               48) hours-per-week <= 39; criterion = 0.999, statistic = 33.63
##                 49)*  weights = 151 
##               48) hours-per-week > 39
##                 50) age <= 35; criterion = 1, statistic = 30.507
##                   51) age <= 28; criterion = 0.996, statistic = 13.04
##                     52)*  weights = 90 
##                   51) age > 28
##                     53)*  weights = 168 
##                 50) age > 35
##                   54) workclass == { Self-emp-not-inc,  State-gov}; criterion = 0.964, statistic = 20.61
##                     55)*  weights = 103 
##                   54) workclass == { Federal-gov,  Local-gov,  Private,  Self-emp-inc}
##                     56)*  weights = 501 
##           45) capital-loss > 1740
##             57) capital-loss <= 1977; criterion = 0.992, statistic = 11.89
##               58) workclass == { Federal-gov,  Local-gov,  Self-emp-inc,  State-gov}; criterion = 0.991, statistic = 21.5
##                 59)*  weights = 13 
##               58) workclass == { Private,  Self-emp-not-inc}
##                 60)*  weights = 32 
##             57) capital-loss > 1977
##               61)*  weights = 10 
##         44) capital-gain > 5013
##           62)*  weights = 77 
##       43) education-num > 9
##         63) capital-gain <= 5013; criterion = 1, statistic = 35.641
##           64) age <= 33; criterion = 1, statistic = 30.77
##             65) age <= 24; criterion = 0.953, statistic = 22.785
##               66)*  weights = 35 
##             65) age > 24
##               67)*  weights = 310 
##           64) age > 33
##             68) hours-per-week <= 34; criterion = 0.999, statistic = 23.815
##               69)*  weights = 91 
##             68) hours-per-week > 34
##               70) workclass == { Self-emp-not-inc,  State-gov}; criterion = 0.986, statistic = 22.812
##                 71)*  weights = 112 
##               70) workclass == { Federal-gov,  Local-gov,  Private,  Self-emp-inc}
##                 72) capital-loss <= 1740; criterion = 0.989, statistic = 19.346
##                   73)*  weights = 787 
##                 72) capital-loss > 1740
##                   74) capital-loss <= 1902; criterion = 0.979, statistic = 12.265
##                     75)*  weights = 36 
##                   74) capital-loss > 1902
##                     76)*  weights = 16 
##         63) capital-gain > 5013
##           77)*  weights = 132 
##   2) education == { Bachelors,  Doctorate,  Masters,  Prof-school}
##     78) occupation == { Armed-Forces,  Exec-managerial,  Prof-specialty,  Protective-serv,  Sales,  Tech-support}; criterion = 1, statistic = 190.056
##       79) capital-loss <= 1740; criterion = 1, statistic = 48.285
##         80) capital-gain <= 5013; criterion = 1, statistic = 53.016
##           81) capital-gain <= 3103; criterion = 1, statistic = 44.109
##             82) education-num <= 13; criterion = 1, statistic = 39.733
##               83) workclass == { Local-gov,  Self-emp-not-inc,  State-gov}; criterion = 0.996, statistic = 34.097
##                 84)*  weights = 289 
##               83) workclass == { Federal-gov,  Private,  Self-emp-inc}
##                 85)*  weights = 845 
##             82) education-num > 13
##               86) hours-per-week <= 25; criterion = 0.983, statistic = 39.197
##                 87)*  weights = 33 
##               86) hours-per-week > 25
##                 88)*  weights = 696 
##           81) capital-gain > 3103
##             89)*  weights = 35 
##         80) capital-gain > 5013
##           90) age <= 69; criterion = 1, statistic = 17.658
##             91)*  weights = 408 
##           90) age > 69
##             92)*  weights = 8 
##       79) capital-loss > 1740
##         93)*  weights = 247 
##     78) occupation == { ?,  Adm-clerical,  Craft-repair,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Transport-moving}
##       94) capital-gain <= 5013; criterion = 1, statistic = 30.873
##         95) capital-loss <= 1740; criterion = 1, statistic = 31.843
##           96)*  weights = 444 
##         95) capital-loss > 1740
##           97)*  weights = 29 
##       94) capital-gain > 5013
##         98)*  weights = 54 
## 1) relationship == { Not-in-family,  Other-relative,  Own-child,  Unmarried}
##   99) capital-gain <= 6849; criterion = 1, statistic = 1366.286
##     100) education == { Doctorate,  Masters,  Prof-school}; criterion = 1, statistic = 1066.048
##       101) education-num <= 14; criterion = 1, statistic = 41.236
##         102) hours-per-week <= 43; criterion = 0.998, statistic = 35.117
##           103) age <= 47; criterion = 0.981, statistic = 13.752
##             104) capital-loss <= 0; criterion = 0.97, statistic = 15.069
##               105)*  weights = 228 
##             104) capital-loss > 0
##               106)*  weights = 9 
##           103) age > 47
##             107)*  weights = 84 
##         102) hours-per-week > 43
##           108) sex == { Female}; criterion = 0.954, statistic = 22.522
##             109)*  weights = 84 
##           108) sex == { Male}
##             110)*  weights = 79 
##       101) education-num > 14
##         111) capital-loss <= 2339; criterion = 0.964, statistic = 19.135
##           112)*  weights = 170 
##         111) capital-loss > 2339
##           113)*  weights = 7 
##     100) education == { 10th,  11th,  12th,  1st-4th,  5th-6th,  7th-8th,  9th,  Assoc-acdm,  Assoc-voc,  Bachelors,  HS-grad,  Preschool,  Some-college}
##       114) occupation == { ?,  Adm-clerical,  Armed-Forces,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Transport-moving}; criterion = 1, statistic = 386.703
##         115) capital-loss <= 1726; criterion = 1, statistic = 84.958
##           116) age <= 33; criterion = 1, statistic = 95.256
##             117) native-country == { ?,  England,  Guatemala,  Portugal}; criterion = 1, statistic = 120.776
##               118) hours-per-week <= 45; criterion = 0.975, statistic = 16.221
##                 119) workclass == { Private}; criterion = 0.993, statistic = 17.75
##                   120)*  weights = 61 
##                 119) workclass == { ?,  Local-gov,  Self-emp-not-inc}
##                   121)*  weights = 14 
##               118) hours-per-week > 45
##                 122)*  weights = 14 
##             117) native-country == { Cambodia,  Canada,  China,  Columbia,  Cuba,  Dominican-Republic,  Ecuador,  El-Salvador,  France,  Germany,  Greece,  Haiti,  Honduras,  Hong,  Hungary,  India,  Iran,  Ireland,  Italy,  Jamaica,  Japan,  Laos,  Mexico,  Nicaragua,  Peru,  Philippines,  Poland,  Puerto-Rico,  South,  Taiwan,  Thailand,  United-States,  Vietnam,  Yugoslavia}
##               123) education == { 7th-8th}; criterion = 0.997, statistic = 37.516
##                 124)*  weights = 48 
##               123) education == { 10th,  11th,  12th,  1st-4th,  5th-6th,  9th,  Assoc-acdm,  Assoc-voc,  Bachelors,  HS-grad,  Preschool,  Some-college}
##                 125)*  weights = 3738 
##           116) age > 33
##             126) native-country == { ?,  Canada,  China,  Columbia,  Cuba,  Dominican-Republic,  Ecuador,  El-Salvador,  Germany,  Guatemala,  Haiti,  Honduras,  Hungary,  Iran,  Ireland,  Jamaica,  Laos,  Mexico,  Nicaragua,  Outlying-US(Guam-USVI-etc),  Peru,  Philippines,  Poland,  Portugal,  Scotland,  Trinadad&Tobago,  United-States,  Vietnam}; criterion = 1, statistic = 121.996
##               127) marital-status == { Married-civ-spouse}; criterion = 1, statistic = 40.656
##                 128)*  weights = 29 
##               127) marital-status == { Divorced,  Married-spouse-absent,  Never-married,  Separated,  Widowed}
##                 129) hours-per-week <= 41; criterion = 1, statistic = 33.277
##                   130)*  weights = 2195 
##                 129) hours-per-week > 41
##                   131)*  weights = 425 
##             126) native-country == { England,  Greece,  India,  Italy,  Japan,  Puerto-Rico,  South}
##               132)*  weights = 50 
##         115) capital-loss > 1726
##           133)*  weights = 71 
##       114) occupation == { Craft-repair,  Exec-managerial,  Prof-specialty,  Protective-serv,  Sales,  Tech-support}
##         134) hours-per-week <= 42; criterion = 1, statistic = 153.809
##           135) age <= 37; criterion = 1, statistic = 85.299
##             136) age <= 26; criterion = 1, statistic = 51.545
##               137)*  weights = 1354 
##             136) age > 26
##               138) education-num <= 9; criterion = 0.99, statistic = 47.247
##                 139)*  weights = 360 
##               138) education-num > 9
##                 140) capital-loss <= 1408; criterion = 0.995, statistic = 33.171
##                   141) marital-status == { Married-civ-spouse,  Widowed}; criterion = 0.995, statistic = 37.263
##                     142)*  weights = 19 
##                   141) marital-status == { Divorced,  Married-spouse-absent,  Never-married,  Separated}
##                     143)*  weights = 653 
##                 140) capital-loss > 1408
##                   144)*  weights = 17 
##           135) age > 37
##             145) sex == { Male}; criterion = 1, statistic = 47.71
##               146) capital-loss <= 2205; criterion = 0.996, statistic = 22.677
##                 147) education-num <= 9; criterion = 0.987, statistic = 21.153
##                   148) occupation == { Craft-repair,  Exec-managerial,  Prof-specialty,  Sales}; criterion = 0.996, statistic = 23.433
##                     149) marital-status == { Married-civ-spouse,  Married-spouse-absent}; criterion = 1, statistic = 34.363
##                       150)*  weights = 13 
##                     149) marital-status == { Divorced,  Never-married,  Separated,  Widowed}
##                       151) race == { Amer-Indian-Eskimo,  Asian-Pac-Islander,  Other}; criterion = 0.998, statistic = 22.294
##                         152)*  weights = 9 
##                       151) race == { Black,  White}
##                         153)*  weights = 177 
##                   148) occupation == { Protective-serv,  Tech-support}
##                     154)*  weights = 19 
##                 147) education-num > 9
##                   155)*  weights = 297 
##               146) capital-loss > 2205
##                 156)*  weights = 8 
##             145) sex == { Female}
##               157) race == { Asian-Pac-Islander}; criterion = 1, statistic = 54.52
##                 158)*  weights = 17 
##               157) race == { Amer-Indian-Eskimo,  Black,  Other,  White}
##                 159)*  weights = 664 
##         134) hours-per-week > 42
##           160) age <= 27; criterion = 1, statistic = 44.927
##             161)*  weights = 301 
##           160) age > 27
##             162) education-num <= 12; criterion = 1, statistic = 47.123
##               163) capital-loss <= 2339; criterion = 0.998, statistic = 22.116
##                 164) marital-status == { Divorced,  Married-spouse-absent,  Never-married,  Separated,  Widowed}; criterion = 0.997, statistic = 23.999
##                   165)*  weights = 624 
##                 164) marital-status == { Married-civ-spouse}
##                   166)*  weights = 10 
##               163) capital-loss > 2339
##                 167)*  weights = 8 
##             162) education-num > 12
##               168) capital-loss <= 0; criterion = 0.984, statistic = 10.52
##                 169)*  weights = 342 
##               168) capital-loss > 0
##                 170)*  weights = 17 
##   99) capital-gain > 6849
##     171) occupation == { ?,  Handlers-cleaners,  Transport-moving}; criterion = 1, statistic = 50.848
##       172)*  weights = 12 
##     171) occupation == { Adm-clerical,  Craft-repair,  Exec-managerial,  Farming-fishing,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Prof-specialty,  Sales,  Tech-support}
##       173)*  weights = 214

En vez de predecir la clase vamos a generar, para cada caso del dataset, el vector de probabilidades condicionadas de cada nodo del arbol.

prob_rpart <- predict(mdlrpart, test, type = 'prob')
prob_ctree <- treeresponse(mdlctree, test, type = 'prob')

Vemos que tienen estas variables.

head(prob_rpart) ; head(prob_ctree)
##        <=50K      >50K
## 2  0.2762306 0.7237694
## 5  0.2762306 0.7237694
## 12 0.2762306 0.7237694
## 21 0.2762306 0.7237694
## 22 0.9498158 0.0501842
## 25 0.9498158 0.0501842
## [[1]]
## [1] 0.4532872 0.5467128
## 
## [[2]]
## [1] 0.304142 0.695858
## 
## [[3]]
## [1] 0.4532872 0.5467128
## 
## [[4]]
## [1] 0.2126437 0.7873563
## 
## [[5]]
## [1] 0.98997722 0.01002278
## 
## [[6]]
## [1] 0.95331325 0.04668675

Como puede observarse en ambos casos el resultado es una matriz con 2 columnas y n filas que contiene, para cada caso, la probabilidad de que el mismo sea <=50K o >50K. Como queremos predecir las personas que ganan mas de 50K nos quedamos con la segunda columna.

test$prob_rpart <- prob_rpart[,2]
test$prob_ctree <- matrix(unlist(prob_ctree), ncol = 2, byrow = T)[,2]
View(test)

Ahora ordenemos por cada una de las probabilidades.

Tambien podriamos hacer una matriz de confusion y calcular el Accuracy. Elegimos el corte de 0.6 en cada caso para clasificar a los de mayor renta.

prop.table(table(test$class, test$prob_rpart > 0.6))
##         
##               FALSE       TRUE
##    <=50K 0.72351315 0.03889856
##    >50K  0.11444365 0.12314464
prop.table(table(test$class, test$prob_ctree > 0.6))
##         
##               FALSE       TRUE
##    <=50K 0.74050568 0.02190603
##    >50K  0.12508957 0.11249872
# Accuracy 1
sum(diag(prop.table(table(test$class, test$prob_rpart > 0.6))))
## [1] 0.8466578
# Accuracy 2
sum(diag(prop.table(table(test$class, test$prob_ctree > 0.6))))
## [1] 0.8530044

Bibliografia

RPART

T. M. Therneau et al. “An Introduction to Recursive Partitioning Using the RPART Routines” (2015)

PARTY

T. Hothorn et al. “Unbiased Recursive Partitioning: A Conditional Inference Framework” (2006) C. Molnar. “Recursive partitioning by conditional inference” (2013). Seminar paper. Department of Statistics University of Munich.