Las enfermedades cardiovasculares afectan aproximadamente a 17 millones de personas en todo el mundo cada año, se manifiestan principalmente como infartos de miocardio e insuficiencias cardíacas (IC); la cual se produce cuando el corazón no puede bombear suficiente sangre para satisfacer las necesidades del cuerpo. Los síntomas más comunes de estas enfermedades son dolor o molestias en el centro del pecho; y/o en los brazos, el hombro izquierdo, los codos, la mandíbula o la espalda. Además, la persona puede experimentar dificultades para respirar o disnea, náuseas o vómitos, sensación de mareo o desmayo, sudores fríos y palidez.
Por lo que se analizará un conjunto de datos con información recopilada en el año 2015 de 299 pacientes con insuficiencia cardíaca, en donde se evaluan las variables edad, anemia, creatinina fosfoquinasa, diabetes,fracción de eyección, hipertensión arterial, nivel de las plaquetas, creatinina sérica, sodio sérico, sexo, tabaquismo, tiempo, fallecimiento.
Se va aplicar una distribución Binomial - Bernoulli con apoyo de clasificadores de aprendizaje automático para predecir la supervivencia de los pacientes y clasificar las características correspondientes a los factores de riesgo más importantes.
Desarrollar un modelo Binomial con enlance probit para predecir la probabilidad de muerte en pacientes con insuficiencia cardíaca. Se emplearán datos clínicos relevantes, como la edad, el sexo, la presión arterial, la fracción de eyección y otros factores de riesgo.
\[\left\{\begin{matrix} H_0: & \text{El efecto de las variables edad, fracción de eyección, creatinina sérica y tiempo del paciente, no incide en la probabilidad de muerte} \\ H_1: & \text{El efecto de las variables edad, fracción de eyección, creatinina sérica y tiempo del paciente, incide en la probabilidad de muerte} \end{matrix}\right.\]
Descripción de las variables
Sí
/
No
).Sí
/ No
).Sí
/ No
).Femenino
/
Masculino
).Sí
/ No
).Sí
/ No
).## # A tibble: 6 × 13
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## # ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
## # serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
## # time <dbl>, DEATH_EVENT <dbl>
## [1] 299 13
## spc_tbl_ [299 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : num [1:299] 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
## $ diabetes : num [1:299] 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : num [1:299] 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num [1:299] 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : num [1:299] 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : num [1:299] 0 0 1 0 0 1 0 1 0 1 ...
## $ time : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : num [1:299] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. anaemia = col_double(),
## .. creatinine_phosphokinase = col_double(),
## .. diabetes = col_double(),
## .. ejection_fraction = col_double(),
## .. high_blood_pressure = col_double(),
## .. platelets = col_double(),
## .. serum_creatinine = col_double(),
## .. serum_sodium = col_double(),
## .. sex = col_double(),
## .. smoking = col_double(),
## .. time = col_double(),
## .. DEATH_EVENT = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## age anaemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :262000 Median :1.100
## Mean :38.08 Mean :0.3512 Mean :263358 Mean :1.394
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :9.400
## serum_sodium sex smoking time
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 4.0
## 1st Qu.:134.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 73.0
## Median :137.0 Median :1.0000 Median :0.0000 Median :115.0
## Mean :136.6 Mean :0.6488 Mean :0.3211 Mean :130.3
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0
## DEATH_EVENT
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3211
## 3rd Qu.:1.0000
## Max. :1.0000
## age anaemia creatinine_phosphokinase
## 0 0 0
## diabetes ejection_fraction high_blood_pressure
## 0 0 0
## platelets serum_creatinine serum_sodium
## 0 0 0
## sex smoking time
## 0 0 0
## DEATH_EVENT
## 0
## age creatinine_phosphokinase ejection_fraction
## age 1.00000000 -0.081583900 0.06009836
## creatinine_phosphokinase -0.08158390 1.000000000 -0.04407955
## ejection_fraction 0.06009836 -0.044079554 1.00000000
## platelets -0.05235437 0.024463389 0.07217747
## serum_creatinine 0.15918713 -0.016408480 -0.01130247
## serum_sodium -0.04596584 0.059550156 0.17590228
## time -0.22406842 -0.009345653 0.04172924
## platelets serum_creatinine serum_sodium time
## age -0.05235437 0.15918713 -0.04596584 -0.224068420
## creatinine_phosphokinase 0.02446339 -0.01640848 0.05955016 -0.009345653
## ejection_fraction 0.07217747 -0.01130247 0.17590228 0.041729235
## platelets 1.00000000 -0.04119808 0.06212462 0.010513909
## serum_creatinine -0.04119808 1.00000000 -0.18909521 -0.149315418
## serum_sodium 0.06212462 -0.18909521 1.00000000 0.087640000
## time 0.01051391 -0.14931542 0.08764000 1.000000000
La mayoría de las varibales tienen correlación baja, las cuales se
encuentran en color blanco.
Ejection_fraction y platelets no parecen estar correlacionadas.
serum_creatinine y time tienen una correlación negativa (ligeramente rojiza), esto sugiere que a mayor tiempo (time), menor es el nivel de serum_creatinine.
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + sex + smoking + time, family = binomial(link = "logit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.057e+01 5.584e+00 1.893 0.058423 .
## age 4.669e-02 1.567e-02 2.979 0.002891 **
## anaemia1 -9.408e-03 3.605e-01 -0.026 0.979178
## creatinine_phosphokinase 2.219e-04 1.778e-04 1.248 0.211994
## ejection_fraction -7.636e-02 1.628e-02 -4.690 2.73e-06 ***
## high_blood_pressure1 -1.044e-01 3.582e-01 -0.291 0.770755
## platelets -1.152e-06 1.894e-06 -0.608 0.542901
## serum_creatinine 6.592e-01 1.797e-01 3.668 0.000244 ***
## serum_sodium -6.897e-02 3.948e-02 -1.747 0.080612 .
## sex1 -5.460e-01 4.126e-01 -1.323 0.185759
## smoking1 -1.926e-02 4.120e-01 -0.047 0.962706
## time -2.109e-02 3.019e-03 -6.986 2.83e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 219.72 on 287 degrees of freedom
## AIC: 243.72
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + sex + smoking + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.241e+00 3.191e+00 1.956 0.05047 .
## age 2.714e-02 8.732e-03 3.108 0.00188 **
## anaemia1 1.720e-03 2.041e-01 0.008 0.99328
## creatinine_phosphokinase 1.336e-04 9.946e-05 1.343 0.17922
## ejection_fraction -4.489e-02 9.069e-03 -4.949 7.45e-07 ***
## high_blood_pressure1 -8.815e-02 2.056e-01 -0.429 0.66815
## platelets -7.614e-07 1.071e-06 -0.711 0.47717
## serum_creatinine 3.678e-01 1.025e-01 3.590 0.00033 ***
## serum_sodium -4.039e-02 2.269e-02 -1.780 0.07509 .
## sex1 -3.169e-01 2.338e-01 -1.356 0.17524
## smoking1 -3.220e-02 2.340e-01 -0.138 0.89055
## time -1.204e-02 1.601e-03 -7.521 5.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 219.13 on 287 degrees of freedom
## AIC: 243.13
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + sex + smoking + time, family = binomial(link = "cloglog"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.697e+00 3.724e+00 2.067 0.038778 *
## age 3.363e-02 1.044e-02 3.222 0.001272 **
## anaemia1 -1.224e-01 2.564e-01 -0.477 0.633183
## creatinine_phosphokinase 1.193e-04 1.004e-04 1.189 0.234612
## ejection_fraction -5.128e-02 1.127e-02 -4.549 5.38e-06 ***
## high_blood_pressure1 -5.637e-03 2.553e-01 -0.022 0.982382
## platelets -1.130e-06 1.347e-06 -0.839 0.401478
## serum_creatinine 3.911e-01 1.177e-01 3.324 0.000887 ***
## serum_sodium -5.276e-02 2.691e-02 -1.960 0.049946 *
## sex1 -4.170e-01 2.852e-01 -1.462 0.143647
## smoking1 7.006e-02 2.884e-01 0.243 0.808058
## time -1.611e-02 2.149e-03 -7.499 6.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 220.78 on 287 degrees of freedom
## AIC: 244.78
##
## Number of Fisher Scoring iterations: 7
## Modelo AIC Deviance
## Logit Logit 243.7250 219.7250
## Probit Probit 243.1344 219.1344
## Cloglog Cloglog 244.7791 220.7791
Al revisar el AIC y el Deviance, se encuentra que el mejor modelo es Probit porque para ambos casos el AIC y el Deviance son los más bajos en comparación del Logit y Cloglog.
anova2(modelo_probit)
##
## Wald test
##
## Model 1 : DEATH_EVENT ~ 1
## Model 2 : DEATH_EVENT ~ age
## Model 3 : DEATH_EVENT ~ age + anaemia
## Model 4 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase
## Model 5 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction
## Model 6 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure
## Model 7 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets
## Model 8 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine
## Model 9 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium
## Model 10 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex
## Model 11 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking
## Model 12 : DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time
##
## Chi df Pr(Chisq>)
## 1 vs 2 18.40490 1 1.786e-05 ***
## 2 vs 3 0.66600 1 0.41445
## 3 vs 4 2.77338 1 0.09584 .
## 4 vs 5 22.82993 1 1.770e-06 ***
## 5 vs 6 1.36318 1 0.24299
## 6 vs 7 0.37613 1 0.53968
## 7 vs 8 17.05559 1 3.630e-05 ***
## 8 vs 9 3.10096 1 0.07825 .
## 9 vs 10 1.33593 1 0.24775
## 10 vs 11 0.10206 1 0.74937
## 11 vs 12 56.56745 1 5.429e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Como la variable ‘smoking’ no tiene una significancia alta en el modelo, se realiza uno nuevo quitando la variable.
summary(modelo_probit)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + sex + smoking + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.241e+00 3.191e+00 1.956 0.05047 .
## age 2.714e-02 8.732e-03 3.108 0.00188 **
## anaemia1 1.720e-03 2.041e-01 0.008 0.99328
## creatinine_phosphokinase 1.336e-04 9.946e-05 1.343 0.17922
## ejection_fraction -4.489e-02 9.069e-03 -4.949 7.45e-07 ***
## high_blood_pressure1 -8.815e-02 2.056e-01 -0.429 0.66815
## platelets -7.614e-07 1.071e-06 -0.711 0.47717
## serum_creatinine 3.678e-01 1.025e-01 3.590 0.00033 ***
## serum_sodium -4.039e-02 2.269e-02 -1.780 0.07509 .
## sex1 -3.169e-01 2.338e-01 -1.356 0.17524
## smoking1 -3.220e-02 2.340e-01 -0.138 0.89055
## time -1.204e-02 1.601e-03 -7.521 5.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 219.13 on 287 degrees of freedom
## AIC: 243.13
##
## Number of Fisher Scoring iterations: 6
model1<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+serum_creatinine+serum_sodium+sex+time, data = df, family = binomial(link = "probit"))
summary(model1)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + sex + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.235e+00 3.191e+00 1.954 0.050700 .
## age 2.711e-02 8.731e-03 3.105 0.001903 **
## anaemia1 5.033e-03 2.025e-01 0.025 0.980177
## creatinine_phosphokinase 1.344e-04 9.923e-05 1.355 0.175519
## ejection_fraction -4.487e-02 9.065e-03 -4.949 7.44e-07 ***
## high_blood_pressure1 -8.730e-02 2.056e-01 -0.425 0.671071
## platelets -7.723e-07 1.066e-06 -0.724 0.468832
## serum_creatinine 3.691e-01 1.021e-01 3.615 0.000301 ***
## serum_sodium -4.037e-02 2.270e-02 -1.779 0.075250 .
## sex1 -3.297e-01 2.138e-01 -1.542 0.123085
## time -1.203e-02 1.599e-03 -7.524 5.30e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 219.15 on 288 degrees of freedom
## AIC: 241.15
##
## Number of Fisher Scoring iterations: 6
anova(model1,modelo_probit, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + serum_sodium +
## sex + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + serum_sodium +
## sex + smoking + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 288 219.15
## 2 287 219.13 1 0.018693 0.8913
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.8913), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘smoking’.
model2<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+serum_creatinine+serum_sodium+time, data = df, family = binomial(link = "probit"))
summary(model2)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## serum_sodium + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.745e+00 3.131e+00 1.835 0.06655 .
## age 2.524e-02 8.583e-03 2.941 0.00327 **
## anaemia1 4.378e-02 1.993e-01 0.220 0.82608
## creatinine_phosphokinase 1.157e-04 9.718e-05 1.191 0.23369
## ejection_fraction -4.225e-02 8.825e-03 -4.787 1.69e-06 ***
## high_blood_pressure1 -3.426e-02 2.016e-01 -0.170 0.86503
## platelets -6.762e-07 1.045e-06 -0.647 0.51775
## serum_creatinine 3.771e-01 1.014e-01 3.719 0.00020 ***
## serum_sodium -3.873e-02 2.243e-02 -1.727 0.08420 .
## time -1.190e-02 1.586e-03 -7.502 6.29e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 221.54 on 289 degrees of freedom
## AIC: 241.54
##
## Number of Fisher Scoring iterations: 6
anova(model2,model1, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + serum_sodium +
## time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + serum_sodium +
## sex + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 289 221.54
## 2 288 219.15 1 2.3847 0.1225
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.1225), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘sex’.
model3<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model3)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + serum_creatinine +
## time, family = binomial(link = "probit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.882e-01 6.877e-01 0.710 0.47773
## age 2.549e-02 8.467e-03 3.011 0.00261 **
## anaemia1 1.452e-02 1.968e-01 0.074 0.94119
## creatinine_phosphokinase 1.003e-04 9.471e-05 1.059 0.28948
## ejection_fraction -4.338e-02 8.710e-03 -4.980 6.35e-07 ***
## high_blood_pressure1 -3.140e-02 1.999e-01 -0.157 0.87517
## platelets -8.141e-07 1.021e-06 -0.797 0.42538
## serum_creatinine 4.033e-01 1.014e-01 3.976 7.00e-05 ***
## time -1.173e-02 1.555e-03 -7.542 4.64e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 224.56 on 290 degrees of freedom
## AIC: 242.56
##
## Number of Fisher Scoring iterations: 6
anova(model3,model2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + serum_sodium +
## time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 290 224.56
## 2 289 221.54 1 3.0191 0.08229 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.08229), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘serum_sodium’.
model4<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+ platelets+time, data = df, family = binomial(link = "probit"))
summary(model4)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + platelets + time,
## family = binomial(link = "probit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.292e-01 6.558e-01 1.264 0.206081
## age 2.699e-02 8.164e-03 3.306 0.000945 ***
## anaemia1 5.082e-02 1.873e-01 0.271 0.786154
## creatinine_phosphokinase 1.067e-04 9.318e-05 1.146 0.251991
## ejection_fraction -3.839e-02 8.223e-03 -4.669 3.03e-06 ***
## high_blood_pressure1 -5.455e-02 1.923e-01 -0.284 0.776660
## platelets -1.102e-06 9.960e-07 -1.106 0.268553
## time -1.136e-02 1.455e-03 -7.805 5.97e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 244.35 on 291 degrees of freedom
## AIC: 260.35
##
## Number of Fisher Scoring iterations: 6
anova(model4,model3, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 291 244.35
## 2 290 224.56 1 19.796 8.617e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Del anova se obtiene un p-valor menor al nivel de significancia (0.05 > 8.617e-06), por lo tanto existe evidencia estadísticamente significativa para norechazar la hipótesis nula y se concluye que hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo con la variable ‘serum_creatinine’.
model5<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+ high_blood_pressure+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model5)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + high_blood_pressure + serum_creatinine +
## time, family = binomial(link = "probit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.619e-01 6.215e-01 0.421 0.67346
## age 2.553e-02 8.426e-03 3.030 0.00245 **
## anaemia1 1.713e-02 1.965e-01 0.087 0.93053
## creatinine_phosphokinase 9.947e-05 9.499e-05 1.047 0.29505
## ejection_fraction -4.349e-02 8.686e-03 -5.007 5.52e-07 ***
## high_blood_pressure1 -3.776e-02 1.992e-01 -0.190 0.84968
## serum_creatinine 4.075e-01 1.012e-01 4.025 5.69e-05 ***
## time -1.161e-02 1.542e-03 -7.532 5.01e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 225.19 on 291 degrees of freedom
## AIC: 241.19
##
## Number of Fisher Scoring iterations: 6
anova(model5,model3, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + platelets + serum_creatinine + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 291 225.19
## 2 290 224.56 1 0.63358 0.426
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.426), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘platelets’.
model6<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model6)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## ejection_fraction + serum_creatinine + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.417e-01 6.148e-01 0.393 0.69424
## age 2.551e-02 8.425e-03 3.028 0.00246 **
## anaemia1 1.648e-02 1.964e-01 0.084 0.93312
## creatinine_phosphokinase 1.013e-04 9.484e-05 1.068 0.28558
## ejection_fraction -4.346e-02 8.682e-03 -5.006 5.55e-07 ***
## serum_creatinine 4.084e-01 1.010e-01 4.042 5.30e-05 ***
## time -1.157e-02 1.525e-03 -7.585 3.31e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 225.23 on 292 degrees of freedom
## AIC: 239.23
##
## Number of Fisher Scoring iterations: 6
anova(model6,model5, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## high_blood_pressure + serum_creatinine + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 292 225.22
## 2 291 225.19 1 0.034981 0.8516
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.8516), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘high_blood_pressure’.
model7<- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model7)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase +
## serum_creatinine + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.674e-01 5.420e-01 -1.785 0.074263 .
## age 1.808e-02 7.805e-03 2.316 0.020532 *
## anaemia1 -1.507e-02 1.845e-01 -0.082 0.934867
## creatinine_phosphokinase 1.316e-04 9.396e-05 1.401 0.161302
## serum_creatinine 4.060e-01 1.164e-01 3.487 0.000489 ***
## time -1.088e-02 1.418e-03 -7.673 1.68e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 254.19 on 293 degrees of freedom
## AIC: 266.19
##
## Number of Fisher Scoring iterations: 6
anova(model7,model6, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + serum_creatinine +
## time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## serum_creatinine + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 293 254.19
## 2 292 225.22 1 28.96 7.388e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Del anova se obtiene un p-valor menor al nivel de significancia (0.05 > 7.388e-08), por lo tanto existe evidencia estadísticamente significativa para no rechazar la hipótesis nula y se concluye que hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo con la variable ‘ejection_fraction’.
model8<- glm(DEATH_EVENT ~ age + anaemia +ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model8)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + anaemia + ejection_fraction +
## serum_creatinine + time, family = binomial(link = "probit"),
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.400859 0.597743 0.671 0.50246
## age 0.024703 0.008343 2.961 0.00307 **
## anaemia1 -0.027929 0.192753 -0.145 0.88479
## ejection_fraction -0.043914 0.008662 -5.070 3.98e-07 ***
## serum_creatinine 0.407060 0.099471 4.092 4.27e-05 ***
## time -0.011650 0.001524 -7.644 2.10e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 226.38 on 293 degrees of freedom
## AIC: 238.38
##
## Number of Fisher Scoring iterations: 6
anova(model8,model6, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + anaemia + ejection_fraction + serum_creatinine +
## time
## Model 2: DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + ejection_fraction +
## serum_creatinine + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 293 226.38
## 2 292 225.22 1 1.1559 0.2823
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.2823), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘creatinine_phosphokinase’.
model9<- glm(DEATH_EVENT ~ age +ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model9)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## time, family = binomial(link = "probit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.389955 0.593672 0.657 0.5113
## age 0.024670 0.008341 2.958 0.0031 **
## ejection_fraction -0.043956 0.008651 -5.081 3.75e-07 ***
## serum_creatinine 0.406534 0.099421 4.089 4.33e-05 ***
## time -0.011625 0.001514 -7.676 1.64e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 226.40 on 294 degrees of freedom
## AIC: 236.4
##
## Number of Fisher Scoring iterations: 6
anova(model9,model8, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + anaemia + ejection_fraction + serum_creatinine +
## time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 294 226.40
## 2 293 226.38 1 0.02104 0.8847
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.8847), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo sin la variable ‘anaemia’.
model10<- glm(DEATH_EVENT ~ ejection_fraction+serum_creatinine+time, data = df, family = binomial(link = "probit"))
summary(model10)
##
## Call:
## glm(formula = DEATH_EVENT ~ ejection_fraction + serum_creatinine +
## time, family = binomial(link = "probit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.777798 0.382242 4.651 3.30e-06 ***
## ejection_fraction -0.039786 0.008296 -4.796 1.62e-06 ***
## serum_creatinine 0.425490 0.101056 4.210 2.55e-05 ***
## time -0.012024 0.001491 -8.065 7.33e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 235.25 on 295 degrees of freedom
## AIC: 243.25
##
## Number of Fisher Scoring iterations: 6
anova(model10,model9, test="LRT")
## Analysis of Deviance Table
##
## Model 1: DEATH_EVENT ~ ejection_fraction + serum_creatinine + time
## Model 2: DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 295 235.25
## 2 294 226.40 1 8.849 0.002932 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Del anova se obtiene un p-valor menor al nivel de significancia (0.05 > 0.002932), por lo tanto existe evidencia estadísticamente significativa para no rechazar la hipótesis nula y se concluye que hay diferencias significativas en la calidad del ajuste entre los dos modelos. Entonces se elige el modelo con la variable ‘age’.
modelo_seleccionado<-model9
summary(modelo_seleccionado)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## time, family = binomial(link = "probit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.389955 0.593672 0.657 0.5113
## age 0.024670 0.008341 2.958 0.0031 **
## ejection_fraction -0.043956 0.008651 -5.081 3.75e-07 ***
## serum_creatinine 0.406534 0.099421 4.089 4.33e-05 ***
## time -0.011625 0.001514 -7.676 1.64e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 226.40 on 294 degrees of freedom
## AIC: 236.4
##
## Number of Fisher Scoring iterations: 6
Del anova se obtiene un p-valor mayor al nivel de significancia (0.05 < 0.1004), por lo tanto no existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que no hay diferencias significativas en la calidad del ajuste entre los dos modelos.
Ahora, se comparan los AIC, BIC y el deviance de los modelo elegidos (modelo_probit y modelo_seleccionado)
## Modelo AIC BIC Deviance
## Probit Probit 243.1344 287.5397 219.1344
## modelo_seleccionado modelo_seleccionado 236.4023 254.9046 226.4023
Respeto a mi primer modelo, tenemos que el AIC fue de 243.1344, y el deviance de 219.1344, al eliminar las variables menos signicativas obtenemos un nuevo modelo, el cual presenta un mejor AIC de 236.4023 y el deviance de 226.4023. A pesar de que el deviance aumentó, el AIC disminuyo en el modelo.
Sea Y la variable binaria que indica el estado del paciente:
\[ Y_i = \begin{cases} 1, & \text{si el paciente falleció} \\ 0, & \text{si el paciente sobrevivió} \end{cases} \]
Realizo una anova al modelo seleccionado y obtengo el Deviance para la dosis:
anova(modelo_seleccionado)
## Analysis of Deviance Table
##
## Model: binomial, link: probit
##
## Response: DEATH_EVENT
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 298 375.35
## age 1 19.146 297 356.20 1.211e-05 ***
## ejection_fraction 1 26.946 296 329.26 2.092e-07 ***
## serum_creatinine 1 22.720 295 306.54 1.874e-06 ***
## time 1 80.134 294 226.40 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Con el Deviance obtenido para la variable ‘edad’, se calcula el p-valor para las hipótesis
pchisq(19.146,1,lower.tail = F)
## [1] 1.21091e-05
Dado que el p-valor calculado es menor al nivel de significancia (1.21091\(e^{-05}\)< 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la edad del paciente, incide en la probabilidad de muerte.
Ahora para la variable ‘fracción de eyección’, con el Deviance obtenido, se calcula el p-valor para las hipótesis
pchisq(26.946,1,lower.tail = F)
## [1] 2.092197e-07
Dado que el p-valor calculado es menor al nivel de significancia (2.092197\(e^{07}\) < 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la unidad de fracción de eyección, incide en la probabilidad de muerte.
Ahora para la variable ‘creatinina sérica’, con el Deviance obtenido, se calcula el p-valor para las hipótesis
pchisq(22.720,1,lower.tail = F)
## [1] 1.874071e-06
Dado que el p-valor calculado es menor al nivel de significancia (2.092197\(e^{07}\) < 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la unidad de creatinina sérica, incide en la probabilidad de muerte.
Ahora para la variable ‘tiempo’, con el Deviance obtenido, se calcula el p-valor para las hipótesis
pchisq(80.134,1,lower.tail = F)
## [1] 3.498602e-19
Dado que el p-valor calculado es menor al nivel de significancia (3.498602e\(e^{19}\) < 0.05), existe evidencia estadísticamente significativa para rechazar la hipótesis nula y se concluye que la unidad de creatinina sérica, incide en la probabilidad de muerte.
Varianza de los residuales bajo la log-verosimilitud:
source("/cloud/project/macros.txt")
residuales <- residuals_glm(modelo_seleccionado,col="red")
Revisando el gráfico, los residuales parecen no estar dispersos de manera aleatoria, se procede entonces a realizar el test para homocedasticidad de Breusch Pagan.
bptest(modelo_seleccionado)
##
## studentized Breusch-Pagan test
##
## data: modelo_seleccionado
## BP = 24.733, df = 4, p-value = 5.692e-05
Con un p-valor menor al nivel de significacia (5.692e\(e^{05}\) < 0.05) existe evidencia estadísticamente significativa para rechazar la hipótesis nula del test y se concluye que no existe homocedasticidad (varianza constante) en los residuales del modelo.
Distribución Normal de los residuales:
envelope_glm(modelo_seleccionado, rep=100, conf=0.95,col="yellow")
## | | | 0% | | | 1% | |+ | 2% | |++ | 3% | |++ | 4% | |++ | 5% | |+++ | 6% | |++++ | 7% | |++++ | 8% | |++++ | 9% | |+++++ | 10% | |++++++ | 11% | |++++++ | 12% | |++++++ | 13% | |+++++++ | 14% | |++++++++ | 15% | |++++++++ | 16% | |++++++++ | 17% | |+++++++++ | 18% | |++++++++++ | 19% | |++++++++++ | 20% | |++++++++++ | 21% | |+++++++++++ | 22% | |++++++++++++ | 23% | |++++++++++++ | 24% | |++++++++++++ | 25% | |+++++++++++++ | 26% | |++++++++++++++ | 27% | |++++++++++++++ | 28% | |++++++++++++++ | 29% | |+++++++++++++++ | 30% | |++++++++++++++++ | 31% | |++++++++++++++++ | 32% | |++++++++++++++++ | 33% | |+++++++++++++++++ | 34% | |++++++++++++++++++ | 35% | |++++++++++++++++++ | 36% | |++++++++++++++++++ | 37% | |+++++++++++++++++++ | 38% | |++++++++++++++++++++ | 39% | |++++++++++++++++++++ | 40% | |++++++++++++++++++++ | 41% | |+++++++++++++++++++++ | 42% | |++++++++++++++++++++++ | 43% | |++++++++++++++++++++++ | 44% | |++++++++++++++++++++++ | 45% | |+++++++++++++++++++++++ | 46% | |++++++++++++++++++++++++ | 47% | |++++++++++++++++++++++++ | 48% | |++++++++++++++++++++++++ | 49% | |+++++++++++++++++++++++++ | 50% | |++++++++++++++++++++++++++ | 51% | |++++++++++++++++++++++++++ | 52% | |++++++++++++++++++++++++++ | 53% | |+++++++++++++++++++++++++++ | 54% | |++++++++++++++++++++++++++++ | 55% | |++++++++++++++++++++++++++++ | 56% | |++++++++++++++++++++++++++++ | 57% | |+++++++++++++++++++++++++++++ | 58% | |++++++++++++++++++++++++++++++ | 59% | |++++++++++++++++++++++++++++++ | 60% | |++++++++++++++++++++++++++++++ | 61% | |+++++++++++++++++++++++++++++++ | 62% | |++++++++++++++++++++++++++++++++ | 63% | |++++++++++++++++++++++++++++++++ | 64% | |++++++++++++++++++++++++++++++++ | 65% | |+++++++++++++++++++++++++++++++++ | 66% | |++++++++++++++++++++++++++++++++++ | 67% | |++++++++++++++++++++++++++++++++++ | 68% | |++++++++++++++++++++++++++++++++++ | 69% | |+++++++++++++++++++++++++++++++++++ | 70% | |++++++++++++++++++++++++++++++++++++ | 71% | |++++++++++++++++++++++++++++++++++++ | 72% | |++++++++++++++++++++++++++++++++++++ | 73% | |+++++++++++++++++++++++++++++++++++++ | 74% | |++++++++++++++++++++++++++++++++++++++ | 75% | |++++++++++++++++++++++++++++++++++++++ | 76% | |++++++++++++++++++++++++++++++++++++++ | 77% | |+++++++++++++++++++++++++++++++++++++++ | 78% | |++++++++++++++++++++++++++++++++++++++++ | 79% | |++++++++++++++++++++++++++++++++++++++++ | 80% | |++++++++++++++++++++++++++++++++++++++++ | 81% | |+++++++++++++++++++++++++++++++++++++++++ | 82% | |++++++++++++++++++++++++++++++++++++++++++ | 83% | |++++++++++++++++++++++++++++++++++++++++++ | 84% | |++++++++++++++++++++++++++++++++++++++++++ | 85% | |+++++++++++++++++++++++++++++++++++++++++++ | 86% | |++++++++++++++++++++++++++++++++++++++++++++ | 87% | |++++++++++++++++++++++++++++++++++++++++++++ | 88% | |++++++++++++++++++++++++++++++++++++++++++++ | 89% | |+++++++++++++++++++++++++++++++++++++++++++++ | 90% | |++++++++++++++++++++++++++++++++++++++++++++++ | 91% | |++++++++++++++++++++++++++++++++++++++++++++++ | 92% | |++++++++++++++++++++++++++++++++++++++++++++++ | 93% | |+++++++++++++++++++++++++++++++++++++++++++++++ | 94% | |++++++++++++++++++++++++++++++++++++++++++++++++ | 95% | |++++++++++++++++++++++++++++++++++++++++++++++++ | 96% | |++++++++++++++++++++++++++++++++++++++++++++++++ | 97% | |+++++++++++++++++++++++++++++++++++++++++++++++++ | 98% | |++++++++++++++++++++++++++++++++++++++++++++++++++| 99% | |++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Se observa que todos los residuales del modelo están dentro de las bandas de confianza con un nivel de confianza del 95%. Ahora se realiza la prueba de normalidad Shapiro Wilk para los residuales.
shapiro.test(residuales)
##
## Shapiro-Wilk normality test
##
## data: residuales
## W = 0.96049, p-value = 2.972e-07
Con un p-valor menor al nivel de significacia (0.05 < 2.972e-07) existe evidencia estadísticamente significativa para no rechazar la hipótesis nula del test y se concluye que los residuales del modelo no siguen una distribución normal.