Introducción

Las enfermedades cardiovasculares afectan aproximadamente a 17 millones de personas en todo el mundo cada año, se manifiestan principalmente como infartos de miocardio e insuficiencias cardíacas (IC); la cual se produce cuando el corazón no puede bombear suficiente sangre para satisfacer las necesidades del cuerpo. Los síntomas más comunes de estas enfermedades son dolor o molestias en el centro del pecho; y/o en los brazos, el hombro izquierdo, los codos, la mandíbula o la espalda. Además, la persona puede experimentar dificultades para respirar o disnea, náuseas o vómitos, sensación de mareo o desmayo, sudores fríos y palidez.

Por lo que se analizará un conjunto de datos con información recopilada en el año 2015 de 299 pacientes con insuficiencia cardíaca, en donde se evaluan las variables edad, anemia, creatinina fosfoquinasa, diabetes,fracción de eyección, hipertensión arterial, nivel de las plaquetas, creatinina sérica, sodio sérico, sexo, tabaquismo, tiempo, fallecimiento.

Se va aplicar una distribución Binomial - Bernoulli con apoyo de clasificadores de aprendizaje automático para predecir la supervivencia de los pacientes y clasificar las características correspondientes a los factores de riesgo más importantes.

Objetivo

Desarrollar un modelo Binomial con enlance probit para predecir la probabilidad de muerte en pacientes con insuficiencia cardíaca. Se emplearán datos clínicos relevantes, como la edad, el sexo, la presión arterial, la fracción de eyección y otros factores de riesgo.

Hipótesis

\[\left\{\begin{matrix} H_0: & \text{El efecto de las variables edad, fracción de eyección, creatinina sérica y tiempo del paciente, no incide en la probabilidad de muerte} \\ H_1: & \text{El efecto de las variables edad, fracción de eyección, creatinina sérica y tiempo del paciente, incide en la probabilidad de muerte} \end{matrix}\right.\]

Exploración de los datos:

Descripción de las variables

Variables Continuas

Edad: Edad de los pacientes (años).
Creatinina Fosfoquinasa (CPK): Enzima presente en el cuerpo, predominante en el corazón, cerebro y músculo esquelético. Se mide la cantidad en sangre (U/L).
Fracción de Eyección: Porcentaje de sangre expulsado por el corazón en cada contracción (%).
Plaquetas: Recuento de plaquetas en la sangre (por microlitro).
Creatinina Sérica: Desecho metabólico en la sangre proveniente de los músculos (mg/dL).
Sodio Sérico: Concentración de sodio en la sangre (mEq/L).

Variables Dicotómicas

Anemia: Diagnóstico de anemia (Sí / No).
Diabetes: Diagnóstico de diabetes (Sí / No).
Hipertensión Arterial: Diagnóstico de hipertensión (Sí / No).
Sexo: Género biológico (Femenino / Masculino).
Tabaquismo: Antecedentes de consumo de tabaco (Sí / No).
Evento de Muerte: Indica si el paciente falleció (Sí / No).

Variable Discreta

Tiempo: Tiempo de seguimiento del paciente (días).

Análisis exploratorio

## # A tibble: 6 × 13
##     age anaemia creatinine_phosphokinase diabetes ejection_fraction
##   <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
## 1    75       0                      582        0                20
## 2    55       0                     7861        0                38
## 3    65       0                      146        0                20
## 4    50       1                      111        0                20
## 5    65       1                      160        1                20
## 6    90       1                       47        0                40
## # ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
## #   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
## #   time <dbl>, DEATH_EVENT <dbl>

## [1] 299  13

## spc_tbl_ [299 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age                     : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : num [1:299] 0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
##  $ diabetes                : num [1:299] 0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : num [1:299] 1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num [1:299] 265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : num [1:299] 1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : num [1:299] 0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : num [1:299] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   anaemia = col_double(),
##   ..   creatinine_phosphokinase = col_double(),
##   ..   diabetes = col_double(),
##   ..   ejection_fraction = col_double(),
##   ..   high_blood_pressure = col_double(),
##   ..   platelets = col_double(),
##   ..   serum_creatinine = col_double(),
##   ..   serum_sodium = col_double(),
##   ..   sex = col_double(),
##   ..   smoking = col_double(),
##   ..   time = col_double(),
##   ..   DEATH_EVENT = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000  
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000  
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000  
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181  
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500   
##  1st Qu.:30.00     1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900   
##  Median :38.00     Median :0.0000      Median :262000   Median :1.100   
##  Mean   :38.08     Mean   :0.3512      Mean   :263358   Mean   :1.394   
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400   
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :9.400   
##   serum_sodium        sex            smoking            time      
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.0  
##  1st Qu.:134.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 73.0  
##  Median :137.0   Median :1.0000   Median :0.0000   Median :115.0  
##  Mean   :136.6   Mean   :0.6488   Mean   :0.3211   Mean   :130.3  
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:203.0  
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0  
##   DEATH_EVENT    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3211  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

##                      age                  anaemia creatinine_phosphokinase 
##                        0                        0                        0 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                        0                        0                        0 
##                platelets         serum_creatinine             serum_sodium 
##                        0                        0                        0 
##                      sex                  smoking                     time 
##                        0                        0                        0 
##              DEATH_EVENT 
##                        0

Visualizar el comportamiento de las variables.

Matriz de correlaciones

##                                  age creatinine_phosphokinase ejection_fraction
## age                       1.00000000             -0.081583900        0.06009836
## creatinine_phosphokinase -0.08158390              1.000000000       -0.04407955
## ejection_fraction         0.06009836             -0.044079554        1.00000000
## platelets                -0.05235437              0.024463389        0.07217747
## serum_creatinine          0.15918713             -0.016408480       -0.01130247
## serum_sodium             -0.04596584              0.059550156        0.17590228
## time                     -0.22406842             -0.009345653        0.04172924
##                            platelets serum_creatinine serum_sodium         time
## age                      -0.05235437       0.15918713  -0.04596584 -0.224068420
## creatinine_phosphokinase  0.02446339      -0.01640848   0.05955016 -0.009345653
## ejection_fraction         0.07217747      -0.01130247   0.17590228  0.041729235
## platelets                 1.00000000      -0.04119808   0.06212462  0.010513909
## serum_creatinine         -0.04119808       1.00000000  -0.18909521 -0.149315418
## serum_sodium              0.06212462      -0.18909521   1.00000000  0.087640000
## time                      0.01051391      -0.14931542   0.08764000  1.000000000

La mayoría de las varibales tienen correlación baja, las cuales se encuentran en color blanco.

Ejection_fraction y platelets no parecen estar correlacionadas.

serum_creatinine y time tienen una correlación negativa (ligeramente rojiza), esto sugiere que a mayor tiempo (time), menor es el nivel de serum_creatinine.

Ajustar modelos de regresión logística con diferentes funciones de enlace

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + sex + smoking + time, family = binomial(link = "logit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.057e+01  5.584e+00   1.893 0.058423 .  
## age                       4.669e-02  1.567e-02   2.979 0.002891 ** 
## anaemia1                 -9.408e-03  3.605e-01  -0.026 0.979178    
## creatinine_phosphokinase  2.219e-04  1.778e-04   1.248 0.211994    
## ejection_fraction        -7.636e-02  1.628e-02  -4.690 2.73e-06 ***
## high_blood_pressure1     -1.044e-01  3.582e-01  -0.291 0.770755    
## platelets                -1.152e-06  1.894e-06  -0.608 0.542901    
## serum_creatinine          6.592e-01  1.797e-01   3.668 0.000244 ***
## serum_sodium             -6.897e-02  3.948e-02  -1.747 0.080612 .  
## sex1                     -5.460e-01  4.126e-01  -1.323 0.185759    
## smoking1                 -1.926e-02  4.120e-01  -0.047 0.962706    
## time                     -2.109e-02  3.019e-03  -6.986 2.83e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 219.72  on 287  degrees of freedom
## AIC: 243.72
## 
## Number of Fisher Scoring iterations: 6

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + sex + smoking + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               6.241e+00  3.191e+00   1.956  0.05047 .  
## age                       2.714e-02  8.732e-03   3.108  0.00188 ** 
## anaemia1                  1.720e-03  2.041e-01   0.008  0.99328    
## creatinine_phosphokinase  1.336e-04  9.946e-05   1.343  0.17922    
## ejection_fraction        -4.489e-02  9.069e-03  -4.949 7.45e-07 ***
## high_blood_pressure1     -8.815e-02  2.056e-01  -0.429  0.66815    
## platelets                -7.614e-07  1.071e-06  -0.711  0.47717    
## serum_creatinine          3.678e-01  1.025e-01   3.590  0.00033 ***
## serum_sodium             -4.039e-02  2.269e-02  -1.780  0.07509 .  
## sex1                     -3.169e-01  2.338e-01  -1.356  0.17524    
## smoking1                 -3.220e-02  2.340e-01  -0.138  0.89055    
## time                     -1.204e-02  1.601e-03  -7.521 5.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 219.13  on 287  degrees of freedom
## AIC: 243.13
## 
## Number of Fisher Scoring iterations: 6

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + sex + smoking + time, family = binomial(link = "cloglog"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.697e+00  3.724e+00   2.067 0.038778 *  
## age                       3.363e-02  1.044e-02   3.222 0.001272 ** 
## anaemia1                 -1.224e-01  2.564e-01  -0.477 0.633183    
## creatinine_phosphokinase  1.193e-04  1.004e-04   1.189 0.234612    
## ejection_fraction        -5.128e-02  1.127e-02  -4.549 5.38e-06 ***
## high_blood_pressure1     -5.637e-03  2.553e-01  -0.022 0.982382    
## platelets                -1.130e-06  1.347e-06  -0.839 0.401478    
## serum_creatinine          3.911e-01  1.177e-01   3.324 0.000887 ***
## serum_sodium             -5.276e-02  2.691e-02  -1.960 0.049946 *  
## sex1                     -4.170e-01  2.852e-01  -1.462 0.143647    
## smoking1                  7.006e-02  2.884e-01   0.243 0.808058    
## time                     -1.611e-02  2.149e-03  -7.499 6.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 220.78  on 287  degrees of freedom
## AIC: 244.78
## 
## Number of Fisher Scoring iterations: 7

Comparación AIC y Deviance

##          Modelo      AIC Deviance
## Logit     Logit 243.7250 219.7250
## Probit   Probit 243.1344 219.1344
## Cloglog Cloglog 244.7791 220.7791

Al revisar el AIC y el Deviance, se encuentra que el mejor modelo es Probit porque para ambos casos el AIC y el Deviance son los más bajos en comparación del Logit y Cloglog.

anova2(modelo_probit)

## 
##   Wald test 
## 
## Model 1 :  DEATH_EVENT ~ 1 
## Model 2 :  DEATH_EVENT ~ age 
## Model 3 :  DEATH_EVENT ~ age + anaemia 
## Model 4 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase 
## Model 5 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction 
## Model 6 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure 
## Model 7 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets 
## Model 8 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine 
## Model 9 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium 
## Model 10 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex 
## Model 11 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking 
## Model 12 :  DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time 
## 
##             Chi    df  Pr(Chisq>)    
## 1 vs 2   18.40490   1   1.786e-05 ***
## 2 vs 3    0.66600   1     0.41445    
## 3 vs 4    2.77338   1     0.09584 .  
## 4 vs 5   22.82993   1   1.770e-06 ***
## 5 vs 6    1.36318   1     0.24299    
## 6 vs 7    0.37613   1     0.53968    
## 7 vs 8   17.05559   1   3.630e-05 ***
## 8 vs 9    3.10096   1     0.07825 .  
## 9 vs 10   1.33593   1     0.24775    
## 10 vs 11  0.10206   1     0.74937    
## 11 vs 12 56.56745   1   5.429e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Como la variable ‘smoking’ no tiene una significancia alta en el modelo, se realiza uno nuevo quitando la variable.

summary(modelo_probit)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + sex + smoking + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               6.241e+00  3.191e+00   1.956  0.05047 .  
## age                       2.714e-02  8.732e-03   3.108  0.00188 ** 
## anaemia1                  1.720e-03  2.041e-01   0.008  0.99328    
## creatinine_phosphokinase  1.336e-04  9.946e-05   1.343  0.17922    
## ejection_fraction        -4.489e-02  9.069e-03  -4.949 7.45e-07 ***
## high_blood_pressure1     -8.815e-02  2.056e-01  -0.429  0.66815    
## platelets                -7.614e-07  1.071e-06  -0.711  0.47717    
## serum_creatinine          3.678e-01  1.025e-01   3.590  0.00033 ***
## serum_sodium             -4.039e-02  2.269e-02  -1.780  0.07509 .  
## sex1                     -3.169e-01  2.338e-01  -1.356  0.17524    
## smoking1                 -3.220e-02  2.340e-01  -0.138  0.89055    
## time                     -1.204e-02  1.601e-03  -7.521 5.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 219.13  on 287  degrees of freedom
## AIC: 243.13
## 
## Number of Fisher Scoring iterations: 6

model1<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+serum_creatinine+serum_sodium+sex+time, data = df, family = binomial(link = "probit"))
summary(model1)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + sex + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               6.235e+00  3.191e+00   1.954 0.050700 .  
## age                       2.711e-02  8.731e-03   3.105 0.001903 ** 
## anaemia1                  5.033e-03  2.025e-01   0.025 0.980177    
## creatinine_phosphokinase  1.344e-04  9.923e-05   1.355 0.175519    
## ejection_fraction        -4.487e-02  9.065e-03  -4.949 7.44e-07 ***
## high_blood_pressure1     -8.730e-02  2.056e-01  -0.425 0.671071    
## platelets                -7.723e-07  1.066e-06  -0.724 0.468832    
## serum_creatinine          3.691e-01  1.021e-01   3.615 0.000301 ***
## serum_sodium             -4.037e-02  2.270e-02  -1.779 0.075250 .  
## sex1                     -3.297e-01  2.138e-01  -1.542 0.123085    
## time                     -1.203e-02  1.599e-03  -7.524 5.30e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 219.15  on 288  degrees of freedom
## AIC: 241.15
## 
## Number of Fisher Scoring iterations: 6

anova(model1,modelo_probit, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + serum_sodium + 
##     sex + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + serum_sodium + 
##     sex + smoking + time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       288     219.15                     
## 2       287     219.13  1 0.018693   0.8913

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.8913), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘smoking’.

model2<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+serum_creatinine+serum_sodium+time, data = df, family = binomial(link = "probit"))
summary(model2)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     serum_sodium + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.745e+00  3.131e+00   1.835  0.06655 .  
## age                       2.524e-02  8.583e-03   2.941  0.00327 ** 
## anaemia1                  4.378e-02  1.993e-01   0.220  0.82608    
## creatinine_phosphokinase  1.157e-04  9.718e-05   1.191  0.23369    
## ejection_fraction        -4.225e-02  8.825e-03  -4.787 1.69e-06 ***
## high_blood_pressure1     -3.426e-02  2.016e-01  -0.170  0.86503    
## platelets                -6.762e-07  1.045e-06  -0.647  0.51775    
## serum_creatinine          3.771e-01  1.014e-01   3.719  0.00020 ***
## serum_sodium             -3.873e-02  2.243e-02  -1.727  0.08420 .  
## time                     -1.190e-02  1.586e-03  -7.502 6.29e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 221.54  on 289  degrees of freedom
## AIC: 241.54
## 
## Number of Fisher Scoring iterations: 6

anova(model2,model1, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + serum_sodium + 
##     time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + serum_sodium + 
##     sex + time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       289     221.54                     
## 2       288     219.15  1   2.3847   0.1225

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.1225), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘sex’.

model3<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model3)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + serum_creatinine + 
##     time, family = binomial(link = "probit"), data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.882e-01  6.877e-01   0.710  0.47773    
## age                       2.549e-02  8.467e-03   3.011  0.00261 ** 
## anaemia1                  1.452e-02  1.968e-01   0.074  0.94119    
## creatinine_phosphokinase  1.003e-04  9.471e-05   1.059  0.28948    
## ejection_fraction        -4.338e-02  8.710e-03  -4.980 6.35e-07 ***
## high_blood_pressure1     -3.140e-02  1.999e-01  -0.157  0.87517    
## platelets                -8.141e-07  1.021e-06  -0.797  0.42538    
## serum_creatinine          4.033e-01  1.014e-01   3.976 7.00e-05 ***
## time                     -1.173e-02  1.555e-03  -7.542 4.64e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 224.56  on 290  degrees of freedom
## AIC: 242.56
## 
## Number of Fisher Scoring iterations: 6

anova(model3,model2, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + serum_sodium + 
##     time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       290     224.56                       
## 2       289     221.54  1   3.0191  0.08229 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.08229), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘serum_sodium’.

model4<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+time, data = df, family = binomial(link = "probit"))
summary(model4)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + platelets + time, 
##     family = binomial(link = "probit"), data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               8.292e-01  6.558e-01   1.264 0.206081    
## age                       2.699e-02  8.164e-03   3.306 0.000945 ***
## anaemia1                  5.082e-02  1.873e-01   0.271 0.786154    
## creatinine_phosphokinase  1.067e-04  9.318e-05   1.146 0.251991    
## ejection_fraction        -3.839e-02  8.223e-03  -4.669 3.03e-06 ***
## high_blood_pressure1     -5.455e-02  1.923e-01  -0.284 0.776660    
## platelets                -1.102e-06  9.960e-07  -1.106 0.268553    
## time                     -1.136e-02  1.455e-03  -7.805 5.97e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 244.35  on 291  degrees of freedom
## AIC: 260.35
## 
## Number of Fisher Scoring iterations: 6

anova(model4,model3, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + time
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       291     244.35                          
## 2       290     224.56  1   19.796 8.617e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Del anova se obtiene un p-valor menor al nivel de significancia (0.05 > 8.617e-06), por lo tanto existe evidencia estadísticamente significativa para norechazar la hipótesis nula y se concluye que hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo con la variable ‘serum_creatinine’.

model5<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model5)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + high_blood_pressure + serum_creatinine + 
##     time, family = binomial(link = "probit"), data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.619e-01  6.215e-01   0.421  0.67346    
## age                       2.553e-02  8.426e-03   3.030  0.00245 ** 
## anaemia1                  1.713e-02  1.965e-01   0.087  0.93053    
## creatinine_phosphokinase  9.947e-05  9.499e-05   1.047  0.29505    
## ejection_fraction        -4.349e-02  8.686e-03  -5.007 5.52e-07 ***
## high_blood_pressure1     -3.776e-02  1.992e-01  -0.190  0.84968    
## serum_creatinine          4.075e-01  1.012e-01   4.025 5.69e-05 ***
## time                     -1.161e-02  1.542e-03  -7.532 5.01e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 225.19  on 291  degrees of freedom
## AIC: 241.19
## 
## Number of Fisher Scoring iterations: 6

anova(model5,model3, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + platelets + serum_creatinine + time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       291     225.19                     
## 2       290     224.56  1  0.63358    0.426

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.426), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘platelets’.

model6<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model6)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     ejection_fraction + serum_creatinine + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.417e-01  6.148e-01   0.393  0.69424    
## age                       2.551e-02  8.425e-03   3.028  0.00246 ** 
## anaemia1                  1.648e-02  1.964e-01   0.084  0.93312    
## creatinine_phosphokinase  1.013e-04  9.484e-05   1.068  0.28558    
## ejection_fraction        -4.346e-02  8.682e-03  -5.006 5.55e-07 ***
## serum_creatinine          4.084e-01  1.010e-01   4.042 5.30e-05 ***
## time                     -1.157e-02  1.525e-03  -7.585 3.31e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 225.23  on 292  degrees of freedom
## AIC: 239.23
## 
## Number of Fisher Scoring iterations: 6

anova(model6,model5, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     high_blood_pressure + serum_creatinine + time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       292     225.22                     
## 2       291     225.19  1 0.034981   0.8516

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.8516), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘high_blood_pressure’.

model7<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model7)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + 
##     serum_creatinine + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.674e-01  5.420e-01  -1.785 0.074263 .  
## age                       1.808e-02  7.805e-03   2.316 0.020532 *  
## anaemia1                 -1.507e-02  1.845e-01  -0.082 0.934867    
## creatinine_phosphokinase  1.316e-04  9.396e-05   1.401 0.161302    
## serum_creatinine          4.060e-01  1.164e-01   3.487 0.000489 ***
## time                     -1.088e-02  1.418e-03  -7.673 1.68e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 254.19  on 293  degrees of freedom
## AIC: 266.19
## 
## Number of Fisher Scoring iterations: 6

anova(model7,model6, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + serum_creatinine + 
##     time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     serum_creatinine + time
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       293     254.19                          
## 2       292     225.22  1    28.96 7.388e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Del anova se obtiene un p-valor menor al nivel de significancia (0.05 > 7.388e-08), por lo tanto existe evidencia estadísticamente significativa para no rechazar la hipótesis nula y se concluye que hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo con la variable ‘ejection_fraction’.

model8<- glm(DEATH_EVENT ~ age + anaemia +ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model8)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + ejection_fraction + 
##     serum_creatinine + time, family = binomial(link = "probit"), 
##     data = df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.400859   0.597743   0.671  0.50246    
## age                0.024703   0.008343   2.961  0.00307 ** 
## anaemia1          -0.027929   0.192753  -0.145  0.88479    
## ejection_fraction -0.043914   0.008662  -5.070 3.98e-07 ***
## serum_creatinine   0.407060   0.099471   4.092 4.27e-05 ***
## time              -0.011650   0.001524  -7.644 2.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 226.38  on 293  degrees of freedom
## AIC: 238.38
## 
## Number of Fisher Scoring iterations: 6

anova(model8,model6, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + anaemia + ejection_fraction + serum_creatinine + 
##     time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + 
##     serum_creatinine + time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       293     226.38                     
## 2       292     225.22  1   1.1559   0.2823

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.2823), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘creatinine_phosphokinase’.

model9<- glm(DEATH_EVENT ~ age +ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model9)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     time, family = binomial(link = "probit"), data = df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.389955   0.593672   0.657   0.5113    
## age                0.024670   0.008341   2.958   0.0031 ** 
## ejection_fraction -0.043956   0.008651  -5.081 3.75e-07 ***
## serum_creatinine   0.406534   0.099421   4.089 4.33e-05 ***
## time              -0.011625   0.001514  -7.676 1.64e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 226.40  on 294  degrees of freedom
## AIC: 236.4
## 
## Number of Fisher Scoring iterations: 6

anova(model9,model8, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + ejection_fraction + serum_creatinine + 
##     time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       294     226.40                     
## 2       293     226.38  1  0.02104   0.8847

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.8847), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘anaemia’.

model10<- glm(DEATH_EVENT ~ ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model10)

## 
## Call:
## glm(formula = DEATH_EVENT ~ ejection_fraction + serum_creatinine + 
##     time, family = binomial(link = "probit"), data = df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.777798   0.382242   4.651 3.30e-06 ***
## ejection_fraction -0.039786   0.008296  -4.796 1.62e-06 ***
## serum_creatinine   0.425490   0.101056   4.210 2.55e-05 ***
## time              -0.012024   0.001491  -8.065 7.33e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 235.25  on 295  degrees of freedom
## AIC: 243.25
## 
## Number of Fisher Scoring iterations: 6

anova(model10,model9, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: DEATH_EVENT ~ ejection_fraction + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + time
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       295     235.25                        
## 2       294     226.40  1    8.849 0.002932 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Del anova se obtiene un p-valor menor al nivel de significancia (0.05 > 0.002932), por lo tanto existe evidencia estadísticamente significativa para no rechazar la hipótesis nula y se concluye que hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo con la variable ‘age’.

modelo_seleccionado<-model9
summary(modelo_seleccionado)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     time, family = binomial(link = "probit"), data = df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.389955   0.593672   0.657   0.5113    
## age                0.024670   0.008341   2.958   0.0031 ** 
## ejection_fraction -0.043956   0.008651  -5.081 3.75e-07 ***
## serum_creatinine   0.406534   0.099421   4.089 4.33e-05 ***
## time              -0.011625   0.001514  -7.676 1.64e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 226.40  on 294  degrees of freedom
## AIC: 236.4
## 
## Number of Fisher Scoring iterations: 6

Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.1004), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos.

Ahora, se comparan los AIC, BIC y el deviance de los modelo elegidos (modelo_probit y modelo_seleccionado)

##                                  Modelo      AIC      BIC Deviance
## Probit                           Probit 243.1344 287.5397 219.1344
## modelo_seleccionado modelo_seleccionado 236.4023 254.9046 226.4023

Respeto a mi primer modelo, tenemos que el AIC fue de 243.1344, y el deviance de 219.1344, al eliminar las variables menos signicativas obtenemos un nuevo modelo, el cual presenta un mejor AIC de 236.4023 y el deviance de 226.4023. A pesar de que el deviance aumentó, el AIC disminuyo en el modelo.

Especificación del Modelo

Sea Y la variable binaria que indica el estado del paciente:

\[ Y_i = \begin{cases} 1, & \text{si el paciente falleció} \\ 0, & \text{si el paciente sobrevivió} \end{cases} \]

Realizo una anova al modelo seleccionado y obtengo el Deviance para la dosis:

anova(modelo_seleccionado)

## Analysis of Deviance Table
## 
## Model: binomial, link: probit
## 
## Response: DEATH_EVENT
## 
## Terms added sequentially (first to last)
## 
## 
##                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                298     375.35              
## age                1   19.146       297     356.20 1.211e-05 ***
## ejection_fraction  1   26.946       296     329.26 2.092e-07 ***
## serum_creatinine   1   22.720       295     306.54 1.874e-06 ***
## time               1   80.134       294     226.40 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Evaluación de los niveles de significancia de 5%, si el efecto de la edad, la fracción de eyección, la creatinina sérica y el tiempo inciden en la probabilidad de muerte.

Con el Deviance obtenido para la variable ‘edad’, se calcula el p-valor para las hipótesis

pchisq(19.146,1,lower.tail = F)

## [1] 1.21091e-05

Dado que el p-valor calculado es menor al nivel de significancia (1.21091\(e^{-05}\)< 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la edad del paciente, incide en la probabilidad de muerte.

Ahora para la variable ‘fracción de eyección’, con el Deviance obtenido, se calcula el p-valor para las hipótesis

pchisq(26.946,1,lower.tail = F)

## [1] 2.092197e-07

Dado que el p-valor calculado es menor al nivel de significancia (2.092197\(e^{07}\) < 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la unidad de fracción de eyección, incide en la probabilidad de muerte.

Ahora para la variable ‘creatinina sérica’, con el Deviance obtenido, se calcula el p-valor para las hipótesis

pchisq(22.720,1,lower.tail = F)

## [1] 1.874071e-06

Dado que el p-valor calculado es menor al nivel de significancia (2.092197\(e^{07}\) < 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la unidad de creatinina sérica, incide en la probabilidad de muerte.

Ahora para la variable ‘tiempo’, con el Deviance obtenido, se calcula el p-valor para las hipótesis

pchisq(80.134,1,lower.tail = F)

## [1] 3.498602e-19

Dado que el p-valor calculado es menor al nivel de significancia (3.498602e\(e^{19}\) < 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la unidad de creatinina sérica, incide en la probabilidad de muerte.

Diagnostico del modelo

Varianza de los residuales bajo la log-verosimilitud:

source("/cloud/project/macros.txt")
residuales <- residuals_glm(modelo_seleccionado,col="red")

Revisando el gráfico, los residuales parecen no estar dispersos de manera aleatoria, se procede entonces a realizar el test para homocedasticidad de Breusch Pagan.

bptest(modelo_seleccionado)

## 
##  studentized Breusch-Pagan test
## 
## data:  modelo_seleccionado
## BP = 24.733, df = 4, p-value = 5.692e-05

Con un p-valor menor al nivel de significacia (5.692e\(e^{05}\) < 0.05) existe evidencia estadísticamente significativa para rechazar la hipótesis nula del test y se concluye que no existe homocedasticidad (varianza constante) en los residuales del modelo.

Bondad de ajuste

Distribución Normal de los residuales:

envelope_glm(modelo_seleccionado, rep=100, conf=0.95,col="yellow")

##   |                                                          |                                                  |   0%  |                                                          |                                                  |   1%  |                                                          |+                                                 |   2%  |                                                          |++                                                |   3%  |                                                          |++                                                |   4%  |                                                          |++                                                |   5%  |                                                          |+++                                               |   6%  |                                                          |++++                                              |   7%  |                                                          |++++                                              |   8%  |                                                          |++++                                              |   9%  |                                                          |+++++                                             |  10%  |                                                          |++++++                                            |  11%  |                                                          |++++++                                            |  12%  |                                                          |++++++                                            |  13%  |                                                          |+++++++                                           |  14%  |                                                          |++++++++                                          |  15%  |                                                          |++++++++                                          |  16%  |                                                          |++++++++                                          |  17%  |                                                          |+++++++++                                         |  18%  |                                                          |++++++++++                                        |  19%  |                                                          |++++++++++                                        |  20%  |                                                          |++++++++++                                        |  21%  |                                                          |+++++++++++                                       |  22%  |                                                          |++++++++++++                                      |  23%  |                                                          |++++++++++++                                      |  24%  |                                                          |++++++++++++                                      |  25%  |                                                          |+++++++++++++                                     |  26%  |                                                          |++++++++++++++                                    |  27%  |                                                          |++++++++++++++                                    |  28%  |                                                          |++++++++++++++                                    |  29%  |                                                          |+++++++++++++++                                   |  30%  |                                                          |++++++++++++++++                                  |  31%  |                                                          |++++++++++++++++                                  |  32%  |                                                          |++++++++++++++++                                  |  33%  |                                                          |+++++++++++++++++                                 |  34%  |                                                          |++++++++++++++++++                                |  35%  |                                                          |++++++++++++++++++                                |  36%  |                                                          |++++++++++++++++++                                |  37%  |                                                          |+++++++++++++++++++                               |  38%  |                                                          |++++++++++++++++++++                              |  39%  |                                                          |++++++++++++++++++++                              |  40%  |                                                          |++++++++++++++++++++                              |  41%  |                                                          |+++++++++++++++++++++                             |  42%  |                                                          |++++++++++++++++++++++                            |  43%  |                                                          |++++++++++++++++++++++                            |  44%  |                                                          |++++++++++++++++++++++                            |  45%  |                                                          |+++++++++++++++++++++++                           |  46%  |                                                          |++++++++++++++++++++++++                          |  47%  |                                                          |++++++++++++++++++++++++                          |  48%  |                                                          |++++++++++++++++++++++++                          |  49%  |                                                          |+++++++++++++++++++++++++                         |  50%  |                                                          |++++++++++++++++++++++++++                        |  51%  |                                                          |++++++++++++++++++++++++++                        |  52%  |                                                          |++++++++++++++++++++++++++                        |  53%  |                                                          |+++++++++++++++++++++++++++                       |  54%  |                                                          |++++++++++++++++++++++++++++                      |  55%  |                                                          |++++++++++++++++++++++++++++                      |  56%  |                                                          |++++++++++++++++++++++++++++                      |  57%  |                                                          |+++++++++++++++++++++++++++++                     |  58%  |                                                          |++++++++++++++++++++++++++++++                    |  59%  |                                                          |++++++++++++++++++++++++++++++                    |  60%  |                                                          |++++++++++++++++++++++++++++++                    |  61%  |                                                          |+++++++++++++++++++++++++++++++                   |  62%  |                                                          |++++++++++++++++++++++++++++++++                  |  63%  |                                                          |++++++++++++++++++++++++++++++++                  |  64%  |                                                          |++++++++++++++++++++++++++++++++                  |  65%  |                                                          |+++++++++++++++++++++++++++++++++                 |  66%  |                                                          |++++++++++++++++++++++++++++++++++                |  67%  |                                                          |++++++++++++++++++++++++++++++++++                |  68%  |                                                          |++++++++++++++++++++++++++++++++++                |  69%  |                                                          |+++++++++++++++++++++++++++++++++++               |  70%  |                                                          |++++++++++++++++++++++++++++++++++++              |  71%  |                                                          |++++++++++++++++++++++++++++++++++++              |  72%  |                                                          |++++++++++++++++++++++++++++++++++++              |  73%  |                                                          |+++++++++++++++++++++++++++++++++++++             |  74%  |                                                          |++++++++++++++++++++++++++++++++++++++            |  75%  |                                                          |++++++++++++++++++++++++++++++++++++++            |  76%  |                                                          |++++++++++++++++++++++++++++++++++++++            |  77%  |                                                          |+++++++++++++++++++++++++++++++++++++++           |  78%  |                                                          |++++++++++++++++++++++++++++++++++++++++          |  79%  |                                                          |++++++++++++++++++++++++++++++++++++++++          |  80%  |                                                          |++++++++++++++++++++++++++++++++++++++++          |  81%  |                                                          |+++++++++++++++++++++++++++++++++++++++++         |  82%  |                                                          |++++++++++++++++++++++++++++++++++++++++++        |  83%  |                                                          |++++++++++++++++++++++++++++++++++++++++++        |  84%  |                                                          |++++++++++++++++++++++++++++++++++++++++++        |  85%  |                                                          |+++++++++++++++++++++++++++++++++++++++++++       |  86%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++      |  87%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++      |  88%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++      |  89%  |                                                          |+++++++++++++++++++++++++++++++++++++++++++++     |  90%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++    |  91%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++    |  92%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++    |  93%  |                                                          |+++++++++++++++++++++++++++++++++++++++++++++++   |  94%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++  |  95%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++  |  96%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++  |  97%  |                                                          |+++++++++++++++++++++++++++++++++++++++++++++++++ |  98%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++++|  99%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%

Se observa que todos los residuales del modelo están dentro de las bandas de confianza con un nivel de confianza del 95%. Ahora se realiza la prueba de normalidad Shapiro Wilk para los residuales.

shapiro.test(residuales)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuales
## W = 0.96049, p-value = 2.972e-07

Con un p-valor menor al nivel de significacia (0.05 < 2.972e-07) existe evidencia estadísticamente significativa para no rechazar la hipótesis nula del test y se concluye que los residuales del modelo no siguen una distribución normal.

Proyecto Modelos

Jenny Moceton

2025-04-07