Predicción masa esquelética

Descriptiva univariada

##                                
##                                 Overall       
##   n                                101        
##   Sexo = M (%)                      67 (66.3) 
##   Edad_calc (mean (SD))          58.00 (11.66)
##   Sitio (%)                                   
##      Colorrectal                    52 (51.5) 
##      Esófago                         4 ( 4.0) 
##      Gástrico                       18 (17.8) 
##      Hepatobiliopancreático         27 (26.7) 
##   ECOG (%)                                    
##      ECOG 0                          9 ( 8.9) 
##      ECOG 1                         59 (58.4) 
##      ECOG 2                         33 (32.7) 
##   MuscEsqueléticoKg (mean (SD))  28.40 (6.00) 
##   SupCorp (mean (SD))             1.81 (0.22) 
##   BMI_calc (mean (SD))           26.14 (4.90) 
##   Talla (mean (SD))             165.84 (9.30) 
##   Peso (mean (SD))               72.04 (15.65)
##   GrasaCorporalkg (mean (SD))    21.24 (10.94)
##   GrasaVisceral (mean (SD))       9.71 (5.24) 
##   RelaciónCC (mean (SD))          0.91 (0.07) 
##   LabAlb (mean (SD))              3.89 (0.23) 
##   LabCreat (mean (SD))            0.83 (0.19) 
##   LabUrea (mean (SD))            35.11 (9.33) 
##   LabHb (mean (SD))              12.77 (1.33) 
##   LabLinf (mean (SD))            23.70 (8.75)
##                                Stratified by Sexo
##                                 F              M              
##   n                                 34             67         
##   Sexo = M (%)                       0 ( 0.0)      67 (100.0) 
##   Edad_calc (mean (SD))          56.66 (11.73)  58.69 (11.66) 
##   Sitio (%)                                                   
##      Colorrectal                    15 (44.1)      37 ( 55.2) 
##      Esófago                         0 ( 0.0)       4 (  6.0) 
##      Gástrico                        2 ( 5.9)      16 ( 23.9) 
##      Hepatobiliopancreático         17 (50.0)      10 ( 14.9) 
##   ECOG (%)                                                    
##      ECOG 0                          1 ( 2.9)       8 ( 11.9) 
##      ECOG 1                         19 (55.9)      40 ( 59.7) 
##      ECOG 2                         14 (41.2)      19 ( 28.4) 
##   MuscEsqueléticoKg (mean (SD))  22.42 (3.73)   31.53 (4.35)  
##   SupCorp (mean (SD))             1.67 (0.18)    1.88 (0.21)  
##   BMI_calc (mean (SD))           25.95 (4.75)   26.24 (5.01)  
##   Talla (mean (SD))             157.53 (6.81)  170.06 (7.36)  
##   Peso (mean (SD))               64.47 (13.36)  75.88 (15.40) 
##   GrasaCorporalkg (mean (SD))    23.88 (10.99)  19.91 (10.75) 
##   GrasaVisceral (mean (SD))      11.25 (5.04)    8.97 (5.21)  
##   RelaciónCC (mean (SD))          0.91 (0.07)    0.91 (0.07)  
##   LabAlb (mean (SD))              3.81 (0.26)    3.93 (0.21)  
##   LabCreat (mean (SD))            0.72 (0.17)    0.89 (0.18)  
##   LabUrea (mean (SD))            33.74 (11.32)  35.83 (8.11)  
##   LabHb (mean (SD))              12.09 (1.02)   13.12 (1.34)  
##   LabLinf (mean (SD))            24.59 (9.05)   23.29 (8.65)
  • Sitio se agrupa en dos categorías: sup (Esófago, Gástrico, Hepatobiliopancreático) e inf (Colorrectal)
  • Pérdida de peso muy asimetrica: se transforma a log

Datos faltantes

## # A tibble: 15 x 3
##    variable          n_miss pct_miss
##    <chr>              <int>    <dbl>
##  1 linfocitos            15   14.9  
##  2 RelaciónCC             3    2.97 
##  3 MuscEsqueléticoKg      2    1.98 
##  4 LabUrea                2    1.98 
##  5 LabAlb                 1    0.990
##  6 LabCreat               1    0.990
##  7 LabHb                  1    0.990
##  8 SupCorp                0    0    
##  9 BMI_calc               0    0    
## 10 Talla                  0    0    
## 11 Peso                   0    0    
## 12 Edad_calc              0    0    
## 13 log_P_Peso             0    0    
## 14 Male                   0    0    
## 15 Sitio_sup              0    0

## # A tibble: 101 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1    45      5    33.3 
##  2    18      2    13.3 
##  3    41      2    13.3 
##  4    66      2    13.3 
##  5     1      1     6.67
##  6     2      1     6.67
##  7     7      1     6.67
##  8     9      1     6.67
##  9    12      1     6.67
## 10    19      1     6.67
## # ... with 91 more rows
  • Solo 25 missing, que representan el 1.650165% de las observaciones
  • Linfocitos es la variable con más missing.
  • Se aplica imputación estocástica (paquete mice)

Descriptiva bivariada

Forzando relaciones lineales y suavizando, discriminado por sexo

CONSULTA -Viendo las correlaciones con masa muscular, a igualdad de tamaño, es decir peso, BMI, sup.corp. (salvo talla), las mujeres tienen mas masa muscular. Es razonable?

Descriptiva multivariada

Análisis de componentes principales entre variables cuantitativas

  • Los 2 primeros componentes explican el 48.02% de la variabilidad total por forma
  • El primer eje separa por tamaño (más pesados a la derecha), 2do eje por forma (más bajos y caderosos abajo)
  • La variable sitio no parece presentar patrones distintos en las restantes variables
  • Sí hay diferencia entre sexos
  • Se detecta un outlier entre varones (caso 41)

Modelo lineal. Selección de modelos

Estudiamos la colinealidad

## RelaciónCC     LabAlb   LabCreat    LabUrea      LabHb    SupCorp 
##   4.518738   1.353824   1.902131   1.532458   1.520250 714.716376 
##   BMI_calc      Talla       Peso  Edad_calc linfocitos log_P_Peso 
## 118.678541  80.263343 627.687164   1.339684   1.111300   1.683424 
##       Male  Sitio_sup 
##   2.260774   1.406221
  • Hay correlación entre las variables de tamaño y forma, algunas muy fuertes. Se prueban distintos modelos, evitando incluir variables colineales. Peso-Talla vs Supcor-BMI. Se decide el mejor por teoría de la información (AIC), quedan Peso y talla.
  • Se deciden la restantes variables por AIC (MuMin, multiple inference): se corren todos los modelos posibles y se los rankea por AIC. Finalmente se muestra la importancia de las variables (aquellas que estuvieron en la mayor proporción de modelos de mejor ajuste)
##                      Male RelaciónCC LabAlb SupCorp linfocitos BMI_calc
## Sum of weights:      1.00 0.90       0.86   0.83    0.69       0.63    
## N containing models: 8192 8192       8192   8192    8192       8192    
##                      LabHb Peso Sitio_sup Talla Edad_calc log_P_Peso
## Sum of weights:      0.55  0.49 0.49      0.39  0.32      0.24      
## N containing models: 8192  8192 8192      8192  8192      8192      
##                      LabCreat LabUrea
## Sum of weights:      0.23     0.23   
## N containing models: 8192     8192
  • Modelo lineal, sin problemas de colinealidad, que incluye: Male + Peso +Talla+ RelaciónCC+LabAlb + linfocitos +LabHb
##       Male       Peso      Talla RelaciónCC     LabAlb linfocitos 
##   1.893981   4.735725   2.509589   3.840202   1.143337   1.047990 
##      LabHb 
##   1.276754
## 
## Call:
## lm(formula = MuscEsqueléticoKg ~ Male + Peso + Talla + RelaciónCC + 
##     LabAlb + linfocitos + LabHb, data = bd_compl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4884 -1.1751  0.1353  1.4211  5.6298 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.157e+01  8.432e+00  -2.558  0.01213 *  
## Male1        3.523e+00  6.098e-01   5.777 1.00e-07 ***
## Peso         2.322e-01  2.927e-02   7.935 4.64e-12 ***
## Talla        2.161e-01  3.585e-02   6.028 3.32e-08 ***
## RelaciónCC  -1.362e+01  5.844e+00  -2.331  0.02192 *  
## LabAlb       2.684e+00  9.699e-01   2.768  0.00681 ** 
## linfocitos   5.481e-04  3.021e-04   1.814  0.07285 .  
## LabHb       -3.116e-01  1.778e-01  -1.752  0.08300 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.104 on 93 degrees of freedom
## Multiple R-squared:  0.8841, Adjusted R-squared:  0.8754 
## F-statistic: 101.3 on 7 and 93 DF,  p-value: < 2.2e-16

Supuestos del modelo lineal

  • Se observa que la observación 48 es atípica. Se descarta. Se reestima el modelo y se estudian supuestos

## 
##  Shapiro-Wilk normality test
## 
## data:  re
## W = 0.98955, p-value = 0.6284
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m5_lineal)
## W = 0.98834, p-value = 0.5339

  • Todo ok. Sin problemas de normalidad ni heterocedasticidad. Relaciones lineales razonables. Se estima el modelo
## 
## Call:
## lm(formula = MuscEsqueléticoKg ~ Peso + Male + Talla + LabAlb + 
##     RelaciónCC + linfocitos + LabHb, data = bd_compl[-48, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0671 -1.3416 -0.0184  1.2543  5.3990 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.285e+01  8.043e+00  -1.598 0.113526    
## Peso         2.555e-01  2.753e-02   9.280 7.46e-15 ***
## Male1        3.161e+00  5.684e-01   5.561 2.61e-07 ***
## Talla        1.973e-01  3.333e-02   5.921 5.46e-08 ***
## LabAlb       2.086e+00  9.050e-01   2.305 0.023415 *  
## RelaciónCC  -2.050e+01  5.630e+00  -3.642 0.000447 ***
## linfocitos   5.258e-04  2.784e-04   1.889 0.062082 .  
## LabHb       -1.829e-01  1.667e-01  -1.097 0.275530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.939 on 92 degrees of freedom
## Multiple R-squared:  0.8951, Adjusted R-squared:  0.8871 
## F-statistic: 112.1 on 7 and 92 DF,  p-value: < 2.2e-16
  • LabHb da NS, sacar?
  • Pablo: seleccionó Peso +Male +Talla+ LabAlb

Comparo los 3 modelos por AIC

##           df      AIC
## m5_lineal  9 425.8712
## m6_sinHb   8 425.1707
## m7_Pablo   6 440.2575

Winner: m6_sinHb. Peso +Male +Talla+ LabAlb +RelaciónCC+ linfocitos. Sin observacion 48

Random forest

## 
## Call:
##  randomForest(formula = MuscEsqueléticoKg ~ ., data = bd_compl,      mtry = 5, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 6.013267
##                     % Var explained: 82.91

  • Las variables de tamaño y forma son las mas relevantes: talla, supcorp, peso, BMI. También Sexo. Luego: alb, creat, Hb
  • Similar a lineal en las 5 mas importantes

Regresión lasso

  • Se utilizan 14 variables: RelaciónCC, LabAlb, LabCreat, LabUrea, LabHb, Talla, Peso, Male, Edad_calc, Sitio_sup, linfocitos, log_P_Peso, SupCorp y BMI
  • se excluye observ 48
  • Se selecciona como mejor lambda, el mín + 1 DE

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -2.724338e+01
## RelaciónCC  -3.550333e+00
## LabAlb       1.018021e+00
## LabCreat     .           
## LabUrea      .           
## LabHb        .           
## SupCorp      1.362145e+01
## BMI_calc     .           
## Talla        1.683747e-01
## Peso         .           
## Edad_calc    .           
## linfocitos   2.168022e-04
## log_P_Peso   .           
## Male1        2.986540e+00
## Sitio_sup1  -8.850462e-02

Lasso selecciona 6 variables: Male Talla SupCorp RelaciónCC LabAlb linfocitos

GAM

Como método de seleccion usamos REML, que resulta ser mejor que GCV ya que penaliza más el overfitting VE: Male + Peso +Talla+ RelaciónCC+LabAlb + linfocitos +LabHb

## Loading required package: nlme
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following objects are masked from 'package:gam':
## 
##     gam, gam.control, gam.fit, s
## Loading required package: plotfunctions
## 
## Attaching package: 'plotfunctions'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## Loaded package itsadug 2.3 (see 'help("itsadug")' ).
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## MuscEsqueléticoKg ~ s(Peso) + s(Talla) + s(LabAlb) + s(RelaciónCC) + 
##     s(linfocitos) + Male
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.5207     0.4075  65.089  < 2e-16 ***
## Male1         2.9348     0.5379   5.456 4.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df      F  p-value    
## s(Peso)       1.001  1.003 91.345  < 2e-16 ***
## s(Talla)      1.000  1.000 34.595 5.19e-08 ***
## s(LabAlb)     1.000  1.000  4.309 0.040669 *  
## s(RelaciónCC) 2.592  3.297  5.830 0.000715 ***
## s(linfocitos) 1.000  1.000  3.256 0.074392 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.891   Deviance explained =   90%
## -REML = 207.27  Scale est. = 3.6149    n = 100
## Summary:
##  * Male : factor; set to the value(s): 0, 1. 
##  * Peso : numeric predictor; with 30 values ranging from 45.700000 to 115.000000. 
##  * Talla : numeric predictor; set to the value(s): 167. 
##  * LabAlb : numeric predictor; set to the value(s): 3.9. 
##  * RelaciónCC : numeric predictor; set to the value(s): 0.91. 
##  * linfocitos : numeric predictor; set to the value(s): 1620.

## Summary:
##  * Male : factor; set to the value(s): 0, 1. 
##  * Peso : numeric predictor; set to the value(s): 71.5. 
##  * Talla : numeric predictor; with 30 values ranging from 145.000000 to 188.000000. 
##  * LabAlb : numeric predictor; set to the value(s): 3.9. 
##  * RelaciónCC : numeric predictor; set to the value(s): 0.91. 
##  * linfocitos : numeric predictor; set to the value(s): 1620.

## Summary:
##  * Male : factor; set to the value(s): 0, 1. 
##  * Peso : numeric predictor; set to the value(s): 71.5. 
##  * Talla : numeric predictor; set to the value(s): 167. 
##  * LabAlb : numeric predictor; with 30 values ranging from 3.400000 to 4.400000. 
##  * RelaciónCC : numeric predictor; set to the value(s): 0.91. 
##  * linfocitos : numeric predictor; set to the value(s): 1620.

## Summary:
##  * Male : factor; set to the value(s): 0, 1. 
##  * Peso : numeric predictor; set to the value(s): 71.5. 
##  * Talla : numeric predictor; set to the value(s): 167. 
##  * LabAlb : numeric predictor; set to the value(s): 3.9. 
##  * RelaciónCC : numeric predictor; with 30 values ranging from 0.720000 to 1.080000. 
##  * linfocitos : numeric predictor; set to the value(s): 1620.

## Summary:
##  * Male : factor; set to the value(s): 0, 1. 
##  * Peso : numeric predictor; set to the value(s): 71.5. 
##  * Talla : numeric predictor; set to the value(s): 167. 
##  * LabAlb : numeric predictor; set to the value(s): 3.9. 
##  * RelaciónCC : numeric predictor; set to the value(s): 0.91. 
##  * linfocitos : numeric predictor; with 30 values ranging from 328.000000 to 4600.000000.

## 
## Method: REML   Optimizer: outer newton
## full convergence after 10 iterations.
## Gradient range [-0.0001123183,0.0005739024]
## (score 207.2742 & scale 3.614904).
## Hessian positive definite, eigenvalue range [9.724347e-06,46.51313].
## Model rank =  47 / 47 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                 k'  edf k-index p-value   
## s(Peso)       9.00 1.00    0.78    0.01 **
## s(Talla)      9.00 1.00    1.06    0.66   
## s(LabAlb)     9.00 1.00    1.03    0.57   
## s(RelaciónCC) 9.00 2.59    1.00    0.43   
## s(linfocitos) 9.00 1.00    1.05    0.68   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Selección de modelos

  • Comparamos por crossvalidation, para seleccionar el modelo con mejor capacidad predictiva.
  • Usamos leave one out (loo) CV
  • Se comparan 3 modelos:

m_lineal <- lm(MuscEsqueléticoKg ~ Peso +Male +Talla+ LabAlb +RelaciónCC+ linfocitos, data=bd_compl[-48,])

m_lasso<- lm(MuscEsqueléticoKg ~ Male +Talla +SupCorp+ RelaciónCC+ LabAlb +linfocitos, , data=bd_compl[-48,])

m_gam<-gam(MuscEsqueléticoKg ~ s(Peso) + s(Talla) + s(LabAlb) + s(RelaciónCC) + s(linfocitos) + Male, data=bd_compl[-48,],method="REML")

## 
## Attaching package: 'caret'
## The following objects are masked from 'package:mixOmics':
## 
##     nearZeroVar, plsda, splsda
## Linear Regression 
## 
## 101 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.240971  0.8574766  1.718937
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression 
## 
## 101 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.240971  0.8574766  1.718937
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression 
## 
## 101 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.214402  0.8608193  1.705464
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression 
## 
## 101 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.214402  0.8608193  1.705464
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Generalized Additive Model using Splines 
## 
## 101 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results across tuning parameters:
## 
##   select  RMSE      Rsquared   MAE     
##   FALSE   2.450007  0.8313108  1.937566
##    TRUE   2.427435  0.8332436  1.882016
## 
## Tuning parameter 'method' was held constant at a value of GCV.Cp
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were select = TRUE and method = GCV.Cp.
## Generalized Additive Model using Splines 
## 
## 101 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results across tuning parameters:
## 
##   select  RMSE      Rsquared   MAE     
##   FALSE   2.450007  0.8313108  1.937566
##    TRUE   2.427435  0.8332436  1.882016
## 
## Tuning parameter 'method' was held constant at a value of GCV.Cp
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were select = TRUE and method = GCV.Cp.