PONTIFICIA UNIVERSIDAD JAVERIANA
FACULTAD DE MEDICINA
DEPARTAMENTO DE EPIDEMIOLOGIA CLINICA
EJERCICIO PRACTICO DE ARBOLES DE CLASIFICACIÓN Y REGRESION Y DE CLUSTER ANALISIS

Problema

Los datos del archivo “cirrosis.csv” que se encuentra en la plataforma, corresponden a la información de 312 enfermos de cirrosis biliar primaria que fueron remitidos a la clínica mayo en un periodo de 10 años y que cumplieron los criterios de selección para participar en un experimento clínico para evaluar un medicamento. Dentro de las variables que se estudiaron se encuentran las siguientes:

Variables, codificación y significado
Variable	Significado
number	Número de identificación
status	Estado vital (vivo/muerto)
rx	Tratamiento asignado 1. Nuevo Tratamiento 2. Placebo
sex	Sexo del paciente 0. Masculino 1. Femenino
hepatom	Presencia de Hepatomegalia 0. No 1. Si
spiders	Presencia de arañas 0. No 1. Si
edema	Presencia de Edema 0. No edema 1. Edema a pesar de la terapia
bilirubin	Bilirrubina sérica en mg/dl
cholest	Colesterol sérico en mg/dl
albumin	Albumina en gm/dl
copper	Cobre en orina ug/day
alkphos	Fosfatasa alcalina U/liter
trigli	Trigliceridos in mg/dl
platel	Conteo de plaquetas por ml cubico de sangre / 1000
prothrom	Tiempo de protombina en segundos
histol	Etapa histológica de la enfermedad (1. Grado 1, 2. Grado 2, 3. Grado 3, 4. Grado 4)
age	Edad en años
years	Tiempo hasta la muerte en años

Haga un análisis descriptivo de los datos
El objetivo principal del estudio es realizar un árbol de clasificación que ayude a identificar qué factores están asociados con el estado vital en este grupo de pacientes.
Un segundo objetivo es realizar un árbol de regresión que ayude a identificar qué factores ayudan a predecir el nivel de colesterol en este grupo de pacientes.
Y el tercer objetivo del estudio es agrupar a estos a estos pacientes, ¿cuántos grupos crearía y que los caracteriza?

Lectura y transformación de datos

library(readr)
datos <- as.data.frame(read_csv("cirrosis.csv"))
# summary(datos)

table(datos$status)
datos$status <- factor(datos$status, levels = 0:1, labels = c('Vivo', 'Muerto'))
table(unclass(datos$status), datos$status)
# boxplot(datos$years~datos$status, xlab='Estado vital', ylab='Años de vida')
datos$status = relevel(datos$status, ref=1)

table(datos$rx)
datos$rx <- factor(datos$rx, levels = 0:1, labels = c('Placebo', 'Tratamiento'))
table(unclass(datos$rx), datos$rx)
datos$rx = relevel(datos$rx, ref=1)

table(datos$sex)
datos$sex <- factor(datos$sex, levels = 0:1, labels = c('Mujeres', 'Hombres'))
table(unclass(datos$sex), datos$sex)
datos$sex = relevel(datos$sex, ref=1)

table(datos$hepatom)
datos$hepatom <- factor(datos$hepatom, levels = 0:1, labels = c('No', 'Sí'))
table(unclass(datos$hepatom), datos$hepatom)
datos$hepatom = relevel(datos$hepatom, ref=1)

table(datos$spiders)
datos$spiders <- factor(datos$spiders, levels = 0:1, labels = c('No', 'Sí'))
table(unclass(datos$spiders), datos$spiders)
datos$spiders = relevel(datos$spiders, ref=1)

table(datos$edema)
datos$edema <- factor(datos$edema, levels = 0:1, labels = c('No', 'Sí'))
table(unclass(datos$edema), datos$edema)
datos$edema = relevel(datos$edema, ref=1)

table(datos$histol)
datos$histol <- factor(datos$histol, levels = 1:4, 
                       labels = c('Grado 1', 'Grado 2', 'Grado 3' ,'Grado 4'), ordered = T)
table(unclass(datos$histol), datos$histol)

Resumen de datos

summary(datos)

     number          status              rx           sex      hepatom 
 Min.   :  1.00   Vivo  :187   Placebo    :158   Mujeres: 36   No:152  
 1st Qu.: 78.75   Muerto:125   Tratamiento:154   Hombres:276   Sí:160  
 Median :156.50                                                        
 Mean   :156.50                                                        
 3rd Qu.:234.25                                                        
 Max.   :312.00                                                        
                                                                       
 spiders  edema      bilirubin         cholest          albumin    
 No:222   No:263   Min.   : 0.300   Min.   : 120.0   Min.   :1.96  
 Sí: 90   Sí: 49   1st Qu.: 0.800   1st Qu.: 249.5   1st Qu.:3.31  
                   Median : 1.350   Median : 309.5   Median :3.55  
                   Mean   : 3.256   Mean   : 369.5   Mean   :3.52  
                   3rd Qu.: 3.425   3rd Qu.: 400.0   3rd Qu.:3.80  
                   Max.   :28.000   Max.   :1775.0   Max.   :4.64  
                                    NA's   :28                     
     copper          alkphos            trigli           platel     
 Min.   :  4.00   Min.   :  289.0   Min.   : 33.00   Min.   : 62.0  
 1st Qu.: 41.25   1st Qu.:  871.5   1st Qu.: 84.25   1st Qu.:199.8  
 Median : 73.00   Median : 1259.0   Median :108.00   Median :257.0  
 Mean   : 97.65   Mean   : 1982.7   Mean   :124.70   Mean   :261.9  
 3rd Qu.:123.00   3rd Qu.: 1980.0   3rd Qu.:151.00   3rd Qu.:322.5  
 Max.   :588.00   Max.   :13862.4   Max.   :598.00   Max.   :563.0  
 NA's   :2                          NA's   :30       NA's   :4      
    prothrom         histol         age            years        
 Min.   : 9.00   Grado 1: 16   Min.   :26.28   Min.   : 0.1123  
 1st Qu.:10.00   Grado 2: 67   1st Qu.:42.24   1st Qu.: 3.2608  
 Median :10.60   Grado 3:120   Median :49.79   Median : 5.0363  
 Mean   :10.73   Grado 4:109   Mean   :50.02   Mean   : 5.4931  
 3rd Qu.:11.10                 3rd Qu.:56.71   3rd Qu.: 7.3847  
 Max.   :17.10                 Max.   :78.44   Max.   :12.4736

Resumen de datos por estado vital

by(datos[-1],INDICES = datos$status, summary)

datos$status: Vivo
    status              rx          sex      hepatom  spiders  edema   
 Vivo  :187   Placebo    :93   Mujeres: 14   No:115   No:149   No:174  
 Muerto:  0   Tratamiento:94   Hombres:173   Sí: 72   Sí: 38   Sí: 13  
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       
   bilirubin         cholest          albumin          copper      
 Min.   : 0.300   Min.   : 120.0   Min.   :2.750   Min.   :  4.00  
 1st Qu.: 0.600   1st Qu.: 247.2   1st Qu.:3.395   1st Qu.: 35.25  
 Median : 1.000   Median : 298.5   Median :3.620   Median : 57.00  
 Mean   : 1.669   Mean   : 338.5   Mean   :3.626   Mean   : 72.47  
 3rd Qu.: 1.800   3rd Qu.: 363.2   3rd Qu.:3.845   3rd Qu.: 83.50  
 Max.   :13.000   Max.   :1712.0   Max.   :4.640   Max.   :444.00  
                  NA's   :17                       NA's   :1       
    alkphos            trigli          platel         prothrom    
 Min.   :  289.0   Min.   : 33.0   Min.   : 79.0   Min.   : 9.00  
 1st Qu.:  800.5   1st Qu.: 80.0   1st Qu.:217.0   1st Qu.: 9.85  
 Median : 1132.0   Median :104.0   Median :271.0   Median :10.20  
 Mean   : 1573.8   Mean   :114.1   Mean   :275.3   Mean   :10.38  
 3rd Qu.: 1648.5   3rd Qu.:139.0   3rd Qu.:326.2   3rd Qu.:10.75  
 Max.   :11046.6   Max.   :382.0   Max.   :539.0   Max.   :17.10  
                   NA's   :18      NA's   :3                      
     histol        age            years       
 Grado 1:15   Min.   :26.28   Min.   : 1.459  
 Grado 2:51   1st Qu.:40.35   1st Qu.: 4.244  
 Grado 3:77   Median :47.42   Median : 6.089  
 Grado 4:44   Mean   :47.86   Mean   : 6.464  
              3rd Qu.:55.97   3rd Qu.: 8.193  
              Max.   :78.44   Max.   :12.474  
                                              
-------------------------------------------------------- 
datos$status: Muerto
    status              rx          sex      hepatom spiders edema  
 Vivo  :  0   Placebo    :65   Mujeres: 22   No:37   No:73   No:89  
 Muerto:125   Tratamiento:60   Hombres:103   Sí:88   Sí:52   Sí:36  
                                                                    
                                                                    
                                                                    
                                                                    
                                                                    
   bilirubin        cholest          albumin          copper     
 Min.   : 0.30   Min.   : 127.0   Min.   :1.960   Min.   : 13.0  
 1st Qu.: 1.40   1st Qu.: 257.5   1st Qu.:3.110   1st Qu.: 63.5  
 Median : 3.20   Median : 339.0   Median :3.430   Median :111.0  
 Mean   : 5.63   Mean   : 415.8   Mean   :3.361   Mean   :135.4  
 3rd Qu.: 6.70   3rd Qu.: 454.0   3rd Qu.:3.670   3rd Qu.:199.2  
 Max.   :28.00   Max.   :1775.0   Max.   :4.400   Max.   :588.0  
                 NA's   :11                       NA's   :1      
    alkphos          trigli          platel         prothrom    
 Min.   :  516   Min.   : 49.0   Min.   : 62.0   Min.   : 9.50  
 1st Qu.: 1029   1st Qu.: 91.0   1st Qu.:163.8   1st Qu.:10.60  
 Median : 1664   Median :122.0   Median :226.5   Median :11.00  
 Mean   : 2594   Mean   :140.5   Mean   :242.1   Mean   :11.24  
 3rd Qu.: 2468   3rd Qu.:171.0   3rd Qu.:309.2   3rd Qu.:11.80  
 Max.   :13862   Max.   :598.0   Max.   :563.0   Max.   :15.20  
                 NA's   :12      NA's   :1                      
     histol        age            years        
 Grado 1: 1   Min.   :30.86   Min.   : 0.1123  
 Grado 2:16   1st Qu.:46.10   1st Qu.: 1.9001  
 Grado 3:43   Median :52.83   Median : 3.2608  
 Grado 4:65   Mean   :53.24   Mean   : 4.0401  
              3rd Qu.:61.15   3rd Qu.: 6.1766  
              Max.   :76.71   Max.   :11.4743

Para árboles de decisión, los datos faltantes son omitidos cuando la respuesta está incompleta o todas las covariables están incompletas. Se usan covariables sustitutas en dicho caso. Para la agrupación por clustering se usa la distancia de Gower. Las columnas con datos faltante no aportan al cálculo de la distancia Gower.

Arbol de clasificación

Ajuste

Ajustamos un árbol de clasificación para el status sin considerar la variable years, ya que el objetivo es justamente predecir el estado vital. Se obtiene el siguiente árbol:

# Classification Tree with rpart

library(rpart)

# grow tree
set.seed(17)
fit <- rpart(status ~ . -years,
   method="class", data=datos[-1])

# plot tree
library(rattle)
fancyRpartPlot(fit, cex=1.1)

Importancia de las variables

La importancia de las variables, a continuación, es calculada como la suma de los decrecimientos en la impureza.

# variable importance
library(knitr)
kable(fit$variable.importance, digits = 2)

bilirubin	44.11
prothrom	18.77
copper	17.90
age	15.74
albumin	11.15
alkphos	8.73
spiders	8.11
hepatom	7.10
cholest	4.45
histol	2.04
platel	1.35
edema	0.52
sex	0.52
trigli	0.48

La bilirubina es el mejor predictor

boxplot(datos$bilirubin~datos$status, xlab='Estado Vital', ylab='Bilirubina')

Características operativas del arbol

Bajo esta muestra de modelamiento (100%), obtenemos la matriz de confusión

pred = predict(fit, type="class")

library(caret)
confusionMatrix(pred, datos$status)

Confusion Matrix and Statistics

          Reference
Prediction Vivo Muerto
    Vivo    161     21
    Muerto   26    104
                                          
               Accuracy : 0.8494          
                 95% CI : (0.8048, 0.8872)
    No Information Rate : 0.5994          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6884          
 Mcnemar's Test P-Value : 0.5596          
                                          
            Sensitivity : 0.8610          
            Specificity : 0.8320          
         Pos Pred Value : 0.8846          
         Neg Pred Value : 0.8000          
             Prevalence : 0.5994          
         Detection Rate : 0.5160          
   Detection Prevalence : 0.5833          
      Balanced Accuracy : 0.8465          
                                          
       'Positive' Class : Vivo

Poda del arbol

El árbol anterior puede ser podado con base al parámetro de complejidad,

# printcp(fit) # display the results
# summary(fit) # detailed summary of splits
library(knitr)
kable(fit$cptable, digits=3)

CP	nsplit	rel error	xerror	xstd
0.360	0	1.000	1.000	0.069
0.128	1	0.640	0.760	0.065
0.028	2	0.512	0.608	0.061
0.024	4	0.456	0.656	0.062
0.016	6	0.408	0.656	0.062
0.010	8	0.376	0.648	0.062

plotcp(fit) # visualize cross-validation results

# prune the tree

pfit<- prune(fit, cp=   fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
# summary(pfit)

# plot the pruned tree
fancyRpartPlot(pfit)

La siguiente es la matriz de confusión para el árbol podado,

pred = predict(pfit, type="class")

library(caret)
confusionMatrix(pred, datos$status)

Confusion Matrix and Statistics

          Reference
Prediction Vivo Muerto
    Vivo    168     45
    Muerto   19     80
                                          
               Accuracy : 0.7949          
                 95% CI : (0.7458, 0.8383)
    No Information Rate : 0.5994          
    P-Value [Acc > NIR] : 1.426e-13       
                                          
                  Kappa : 0.5576          
 Mcnemar's Test P-Value : 0.001778        
                                          
            Sensitivity : 0.8984          
            Specificity : 0.6400          
         Pos Pred Value : 0.7887          
         Neg Pred Value : 0.8081          
             Prevalence : 0.5994          
         Detection Rate : 0.5385          
   Detection Prevalence : 0.6827          
      Balanced Accuracy : 0.7692          
                                          
       'Positive' Class : Vivo

Arbol definitivo

Comparando las matrices de confusión y haciendo especial énfasis en el valor predictivo positivo, el primer árbol (sin podar) sería el recomendado. No queremos decir que un paciente va a vivir cuando en realidad es más probable que muera.

Arbol de regresión

Ajuste

A continuación ajustamos un arbol de regresión para predecir los niveles de colesterol. Como covariables seleccionamos solo aquellas características propias del individuo, medidas al inicio del estudio, que luego podrían ser usadas para predecir los niveles de colesterol. Por esto, dejamos por fuera del análisis al estado vital al final del estudio, el tiempo hasta morir y el tratamiento.

# Classification Tree with rpart

library(rpart)

# grow tree
set.seed(17)
fit <- rpart(cholest ~ .,
   method="anova", data=datos[,-c(1:3,18)])

# plot tree
library(rattle)
fancyRpartPlot(fit, cex=1.1)

Importancia de las variables

library(knitr)
kable(fit$variable.importance, digits=3)

bilirubin	4796345.6
histol	2491644.0
age	2309245.1
alkphos	1421659.1
albumin	1388574.9
prothrom	911733.8
trigli	757582.2
platel	588420.0
edema	437508.2
copper	295846.7

Valor predictivo, R²

Calculamo el R² como en este enlace para la muestra de entrenamiento (usamos tada la muestra)

pred = predict(fit, newdata = datos, type="vector")
difer <-datos$cholest-pred
difer <- difer[complete.cases(difer)]
sum(difer^2)/length(difer)

[1] 23796.46

tmp <- printcp(fit)


Regression tree:
rpart(formula = cholest ~ ., data = datos[, -c(1:3, 18)], method = "anova")

Variables actually used in tree construction:
[1] age       bilirubin histol    platel    prothrom 

Root node error: 15224911/284 = 53609

n=284 (28 observations deleted due to missingness)

        CP nsplit rel error  xerror    xstd
1 0.244119      0   1.00000 1.00769 0.23978
2 0.162627      1   0.75588 0.90811 0.21285
3 0.077133      2   0.59325 0.91831 0.20601
4 0.036377      3   0.51612 0.82552 0.19394
5 0.020322      4   0.47974 0.81427 0.17289
6 0.015532      5   0.45942 0.80574 0.17033
7 0.010000      6   0.44389 0.79654 0.17409

(rsq.val <- 1-tmp[nrow(tmp),c(3,4)]  )

rel error    xerror 
0.5561095 0.2034624

Poda

Note de la tabla anterior que el error relativo se minimiza al final. Bajo dicho criterio estariamos maximizando el R² que ya es regular (R²=0.56). Por tanto, no es necesario conveniente podar el árbol.

Agrupación y caracterización de los pacientes

Para el agrupamiento usamos todas las variables pero comenzamos estudiando si existe sobrerepresentación entre ellas y si podemos reducir las variables.

Correlación entre variables

var_continuas <- sapply(datos[-1], class)=="numeric" | sapply(datos[-1], class)=="integer"

correlacion <- cor(datos[-1][,var_continuas],use = 'pair')

library(caret)
tooHighNames <- findCorrelation(correlacion, .8,verbose = T, names = T)

All correlations <= 0.8

# tooHighNames
# 
# View(correlacion)
# 
# myvars <- names(datos) %in% tooHighNames
# # data_seg <- datos[!myvars]
# # str(data_seg)
# # summary(data_seg)

library(GGally)
ggpairs(datos[-1][,var_continuas])

Ninguana de las correlaciones entre las variables continuas es demsaiado alta.

Correlación con las demás variables

library(CluMix)
mix.heatmap(datos[,-1], rowmar=7, legend.mat=TRUE)

boxplot(datos$age~datos$rx, ylab='Edad')

table(datos$sex,datos$spider)

         
           No  Sí
  Mujeres  32   4
  Hombres 190  86

table(datos$edema,datos$status)

    
     Vivo Muerto
  No  174     89
  Sí   13     36

table(datos$histol,datos$hepatom)

         
          No Sí
  Grado 1 16  0
  Grado 2 48 19
  Grado 3 67 53
  Grado 4 21 88

No se aprecia una sobrerepresentación entre las variables No eliminaremos variables

Distancia de Gower

Dado que tenomos datos mixtos usamos la distancia Gower.

library(cluster)
gower_dist <- daisy(datos[,-1],
                    metric = "gower")

summary(gower_dist)

48516 dissimilarities, summarized :
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01803 0.18889 0.25446 0.25976 0.32630 0.66584 
Metric :  mixed ;  Types = N, N, N, N, N, N, I, I, I, I, I, I, I, I, O, I, I 
Number of objects : 312

hist(gower_dist)

Los dos sujetos más similares y los dos más distintos

gower_mat <- as.matrix(gower_dist)


# Output most similar pair

datos[
  which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]),
        arr.ind = TRUE)[1, ], ]

	number	status	rx	sex	hepatom	spiders	edema	bilirubin	cholest	albumin	copper	alkphos	trigli	platel	prothrom	histol	age	years
126	126	Muerto	Placebo	Hombres	Sí	Sí	No	1.2	269	3.12	NA	1441	68	166	11.1	Grado 4	53.5989	2.255989
94	94	Muerto	Placebo	Hombres	Sí	Sí	No	3.2	201	3.11	178	1212	69	188	11.8	Grado 4	53.9165	2.053388

# Output most dissimilar pair

datos[
  which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
        arr.ind = TRUE)[1, ], ]

	number	status	rx	sex	hepatom	spiders	edema	bilirubin	cholest	albumin	copper	alkphos	trigli	platel	prothrom	histol	age	years
164	164	Muerto	Tratamiento	Hombres	Sí	Sí	Sí	8.5	NA	3.34	161	1428	NA	88	13.3	Grado 4	43.41410	0.7227926
58	58	Vivo	Placebo	Mujeres	No	No	No	0.7	242	4.08	73	5890	118	NA	10.6	Grado 1	44.56947	12.2080760

Aglomerativo

Número de clusters

library(seriation)
dissplot(gower_mat)

library(NbClust)

clusters <- 2:6
Medidas <- data.frame(clusters,
                      frey= NA,mcclain= NA,cindex= NA,silhouette= NA,dunn= NA
                      )
Medidas <- rbind(Medidas, c("Best.nc", rep(NA,5)))

indices <- c('frey', 'mcclain', 'cindex', 'silhouette', 'dunn')

for (i in seq_along(indices)) {
  res<-NbClust(diss=gower_dist, distance = NULL, 
               min.nc=min(clusters), max.nc=max(clusters),
               method = "ward.D2",
               index = indices[i])
  res$All.index
  res$Best.nc
  res$All.CriticalValues
  res$Best.partition
  
  Medidas[i+1] <- c(res$All.index,res$Best.nc[1])
}


 Only frey, mcclain, cindex, sihouette and dunn can be computed. To compute the other indices, data matrix is needed 

 Only frey, mcclain, cindex, sihouette and dunn can be computed. To compute the other indices, data matrix is needed 

 Only frey, mcclain, cindex, sihouette and dunn can be computed. To compute the other indices, data matrix is needed 

 Only frey, mcclain, cindex, sihouette and dunn can be computed. To compute the other indices, data matrix is needed 

 Only frey, mcclain, cindex, sihouette and dunn can be computed. To compute the other indices, data matrix is needed

Medidas

clusters	frey	mcclain	cindex	silhouette	dunn
2	4.4665	0.5686	0.3631	0.2876	0.1746
3	0.7061	1.4355	0.3511	0.1690	0.1384
4	-0.0109	1.9098	0.3393	0.1624	0.1326
5	0.0090	2.2476	0.3600	0.1727	0.1545
6	0.1839	2.4403	0.3651	0.1889	0.1677
Best.nc	2.0000	2.0000	4.0000	2.0000	2.0000

Partitioning around medoids (PAM)

Number of clusters

# Calculate silhouette width for many k using PAM

sil_width <- c(NA)

for(i in 2:10){
  
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}

# Plot sihouette width (higher is better)

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

Seleccionamos el método Aglomerativo y Ward con dos segmentos

hier <- hclust(gower_dist, method = 'ward.D2')
plot(hier)
segmento <- cutree(hier,2)
rect.hclust(hier, k=2)

# Medidas
sil <- silhouette(segmento,dist = gower_dist)
summary(sil)

Silhouette of 312 units in 2 clusters from silhouette.default(x = segmento, dist = gower_dist) :
 Cluster sizes and average silhouette widths:
      108       204 
0.1907506 0.3388885 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.2935  0.1753  0.2986  0.2876  0.4311  0.5157

clValid::connectivity(clusters = as.integer(segmento), distance = as.matrix(gower_dist), neighbSize=10)

[1] 30.54722

clValid::dunn(distance = as.matrix(gower_dist), clusters = as.integer(segmento))

[1] 0.1745829

Se conigue una silueta aceptable (>.25)

Descripción de los clusters

library(FactoMineR)
catdes(donnee = data.frame(as.factor(segmento), datos),1)

$test.chi2
             p.value df
status  1.182342e-44  1
hepatom 4.037427e-21  1
histol  2.277717e-12  3
edema   4.762346e-10  1
spiders 9.563962e-09  1
sex     3.810746e-04  1

$category
$category$`1`
                 Cla/Mod   Mod/Cla    Global      p.value     v.test
status=Muerto  80.800000 93.518519 40.064103 3.879472e-49  14.734344
hepatom=Sí     59.375000 87.962963 51.282051 1.147617e-22   9.798066
histol=Grado 4 61.467890 62.037037 34.935897 5.070206e-13   7.223404
edema=Sí       73.469388 33.333333 15.705128 1.732336e-09   6.021102
spiders=Sí     58.888889 49.074074 28.846154 1.937902e-08   5.617455
sex=Mujeres    61.111111 20.370370 11.538462 6.459985e-04   3.411528
histol=Grado 3 24.166667 26.851852 38.461538 2.065069e-03  -3.080711
histol=Grado 1  0.000000  0.000000  5.128205 9.021682e-04  -3.319382
histol=Grado 2 17.910448 11.111111 21.474359 8.745751e-04  -3.328046
sex=Hombres    31.159420 79.629630 88.461538 6.459985e-04  -3.411528
spiders=No     24.774775 50.925926 71.153846 1.937902e-08  -5.617455
edema=No       27.376426 66.666667 84.294872 1.732336e-09  -6.021102
hepatom=No      8.552632 12.037037 48.717949 1.147617e-22  -9.798066
status=Vivo     3.743316  6.481481 59.935897 3.879472e-49 -14.734344

$category$`2`
                 Cla/Mod   Mod/Cla    Global      p.value     v.test
status=Vivo     96.25668 88.235294 59.935897 3.879472e-49  14.734344
hepatom=No      91.44737 68.137255 48.717949 1.147617e-22   9.798066
edema=No        72.62357 93.627451 84.294872 1.732336e-09   6.021102
spiders=No      75.22523 81.862745 71.153846 1.937902e-08   5.617455
sex=Hombres     68.84058 93.137255 88.461538 6.459985e-04   3.411528
histol=Grado 2  82.08955 26.960784 21.474359 8.745751e-04   3.328046
histol=Grado 1 100.00000  7.843137  5.128205 9.021682e-04   3.319382
histol=Grado 3  75.83333 44.607843 38.461538 2.065069e-03   3.080711
sex=Mujeres     38.88889  6.862745 11.538462 6.459985e-04  -3.411528
spiders=Sí      41.11111 18.137255 28.846154 1.937902e-08  -5.617455
edema=Sí        26.53061  6.372549 15.705128 1.732336e-09  -6.021102
histol=Grado 4  38.53211 20.588235 34.935897 5.070206e-13  -7.223404
hepatom=Sí      40.62500 31.862745 51.282051 1.147617e-22  -9.798066
status=Muerto   19.20000 11.764706 40.064103 3.879472e-49 -14.734344


$quanti.var
                Eta2      P-value
bilirubin 0.24006170 3.034507e-20
years     0.20871965 1.706906e-17
prothrom  0.19385802 3.164805e-16
albumin   0.16965325 3.309655e-14
copper    0.13994583 9.842020e-12
number    0.09502280 2.713015e-08
age       0.07430820 1.015777e-06
platel    0.06828684 3.343948e-06
trigli    0.03993311 7.382622e-04
alkphos   0.02354596 6.614873e-03
cholest   0.02292481 1.061602e-02

$quanti
$quanti$`1`
             v.test Mean in category Overall mean sd in category
bilirubin  8.640555         6.301852      3.25609      6.2341055
prothrom   7.764654        11.332407     10.72564      1.0235042
copper     6.611699       141.691589     97.64839    102.7998156
age        4.807271        53.976880     50.01901     10.4061826
trigli     3.541144       142.649485    124.70213     80.9785566
alkphos    2.706066      2433.324073   1982.65577   2484.0164762
cholest    2.697521       418.185567    369.51056    279.1519311
platel    -4.595700       227.747664    261.93506     98.1597109
number    -5.436183       118.342593    156.50000     83.0292409
albumin   -7.263757         3.282685      3.52000      0.4414006
years     -8.056787         3.565214      5.49312      2.7387771
            Overall sd      p.value
bilirubin    4.5230494 5.594062e-18
prothrom     1.0027125 8.186870e-15
copper      85.4757214 3.799331e-11
age         10.5642898 1.530048e-06
trigli      65.0330243 3.983967e-04
alkphos   2136.9559586 6.808545e-03
cholest    231.5358319 6.985796e-03
platel      95.4534071 4.312994e-06
number      90.0661794 5.443390e-08
albumin      0.4192186 3.764850e-13
years        3.0704429 7.832619e-16

$quanti$`2`
             v.test Mean in category Overall mean sd in category
years      8.056787         6.513777      5.49312      2.7273025
albumin    7.263757         3.645637      3.52000      0.3464641
number     5.436183       176.700980    156.50000     87.0509981
platel     4.621104       280.134328    261.93506     88.7638539
cholest   -2.643025       344.262032    369.51056    197.8099363
alkphos   -2.706066      1744.066666   1982.65577   1884.9624915
trigli    -3.507115       115.291892    124.70213     52.4509675
age       -4.807271        47.923662     50.01901     10.0337405
copper    -6.582748        74.433498     97.64839     63.4467253
prothrom  -7.764654        10.404412     10.72564      0.8276651
bilirubin -8.640555         1.643627      3.25609      1.7895143
            Overall sd      p.value
years        3.0704429 7.832619e-16
albumin      0.4192186 3.764850e-13
number      90.0661794 5.443390e-08
platel      95.4534071 3.817027e-06
cholest    231.5358319 8.216891e-03
alkphos   2136.9559586 6.808545e-03
trigli      65.0330243 4.529941e-04
age         10.5642898 1.530048e-06
copper      85.4757214 4.618307e-11
prothrom     1.0027125 8.186870e-15
bilirubin    4.5230494 5.594062e-18


attr(,"class")
[1] "catdes" "list "

by(datos, segmento, summary, na.rm=T)

segmento: 1
     number          status              rx          sex     hepatom
 Min.   :  1.00   Vivo  :  7   Placebo    :58   Mujeres:22   No:13  
 1st Qu.: 49.75   Muerto:101   Tratamiento:50   Hombres:86   Sí:95  
 Median :104.50                                                     
 Mean   :118.34                                                     
 3rd Qu.:169.25                                                     
 Max.   :300.00                                                     
                                                                    
 spiders edema     bilirubin         cholest          albumin     
 No:55   No:72   Min.   : 0.600   Min.   : 151.0   Min.   :1.960  
 Sí:53   Sí:36   1st Qu.: 2.000   1st Qu.: 251.0   1st Qu.:3.065  
                 Median : 3.850   Median : 346.0   Median :3.345  
                 Mean   : 6.302   Mean   : 418.2   Mean   :3.283  
                 3rd Qu.: 7.225   3rd Qu.: 456.0   3rd Qu.:3.553  
                 Max.   :28.000   Max.   :1775.0   Max.   :4.160  
                                  NA's   :11                      
     copper         alkphos          trigli          platel     
 Min.   : 16.0   Min.   :  516   Min.   : 49.0   Min.   : 62.0  
 1st Qu.: 70.0   1st Qu.: 1014   1st Qu.: 91.0   1st Qu.:151.0  
 Median :111.0   Median : 1580   Median :124.0   Median :214.0  
 Mean   :141.7   Mean   : 2433   Mean   :142.6   Mean   :227.7  
 3rd Qu.:200.0   3rd Qu.: 2462   3rd Qu.:171.0   3rd Qu.:280.5  
 Max.   :588.0   Max.   :13862   Max.   :598.0   Max.   :563.0  
 NA's   :1                       NA's   :11      NA's   :1      
    prothrom         histol        age            years        
 Min.   : 9.50   Grado 1: 0   Min.   :30.86   Min.   : 0.1123  
 1st Qu.:10.60   Grado 2:12   1st Qu.:46.36   1st Qu.: 1.5092  
 Median :11.15   Grado 3:29   Median :53.54   Median : 2.8939  
 Mean   :11.33   Grado 4:67   Mean   :53.98   Mean   : 3.5652  
 3rd Qu.:12.00                3rd Qu.:61.26   3rd Qu.: 4.7830  
 Max.   :15.20                Max.   :78.44   Max.   :11.4743  
                                                               
-------------------------------------------------------- 
segmento: 2
     number         status              rx           sex      hepatom 
 Min.   :  2.0   Vivo  :180   Placebo    :100   Mujeres: 14   No:139  
 1st Qu.:106.5   Muerto: 24   Tratamiento:104   Hombres:190   Sí: 65  
 Median :186.5                                                        
 Mean   :176.7                                                        
 3rd Qu.:251.2                                                        
 Max.   :312.0                                                        
                                                                      
 spiders  edema      bilirubin         cholest          albumin     
 No:167   No:191   Min.   : 0.300   Min.   : 120.0   Min.   :2.750  
 Sí: 37   Sí: 13   1st Qu.: 0.675   1st Qu.: 249.0   1st Qu.:3.408  
                   Median : 1.000   Median : 303.0   Median :3.640  
                   Mean   : 1.644   Mean   : 344.3   Mean   :3.646  
                   3rd Qu.: 1.800   3rd Qu.: 372.5   3rd Qu.:3.870  
                   Max.   :13.000   Max.   :1712.0   Max.   :4.640  
                                    NA's   :17                      
     copper          alkphos            trigli          platel     
 Min.   :  4.00   Min.   :  289.0   Min.   : 33.0   Min.   : 79.0  
 1st Qu.: 35.50   1st Qu.:  813.8   1st Qu.: 78.0   1st Qu.:221.0  
 Median : 58.00   Median : 1177.5   Median :104.0   Median :277.0  
 Mean   : 74.43   Mean   : 1744.1   Mean   :115.3   Mean   :280.1  
 3rd Qu.: 89.50   3rd Qu.: 1726.5   3rd Qu.:143.0   3rd Qu.:336.0  
 Max.   :444.00   Max.   :12258.8   Max.   :382.0   Max.   :539.0  
 NA's   :1                          NA's   :19      NA's   :3      
    prothrom        histol        age            years        
 Min.   : 9.0   Grado 1:16   Min.   :26.28   Min.   : 0.5421  
 1st Qu.: 9.9   Grado 2:55   1st Qu.:40.53   1st Qu.: 4.2546  
 Median :10.2   Grado 3:91   Median :47.87   Median : 6.2902  
 Mean   :10.4   Grado 4:42   Mean   :47.92   Mean   : 6.5138  
 3rd Qu.:10.7                3rd Qu.:55.97   3rd Qu.: 8.3566  
 Max.   :17.1                Max.   :75.01   Max.   :12.4736

# Classification Tree with rpart

library(rpart)

# grow tree
set.seed(17)
fit <- rpart(segmento ~ . ,
   method="class", data=datos[-1])

# plot tree
library(rattle)
fancyRpartPlot(fit, cex=1.1)

library(knitr)
kable(fit$variable.importance)

status	88.970833
years	37.894640
prothrom	36.113034
bilirubin	35.552289
albumin	25.909585
copper	24.911833
hepatom	21.919135
spiders	11.531531
age	1.184818
alkphos	1.184818
edema	1.153153

El segmento 1 se caracteriza por aquellos que fallecieron la mayoría presentaron hepatomegalia y tenían arañas. En el segmento 2 todos sujetos vivos sin hepatomegalia y sin arañas.