465 recien nacidos:
pesonac=Peso al Nacer (en gramos)
sexo = Sexo del Recién Nacido (1. Masculino 2.Femenino)
egest=Edad Gestacional (en semanas)
tiparto=Tipo de Parto (1. Eutócico 2. Vaginal Instrumentado 3. Cesárea Programada 4. Cesárea No Programada)
emadre=Edad de la Madre (en años)
infección=Presentó Infección Nosocomial (1. Si 0.No)
estadia=Días de Hospitalización
cateter=Tuvo catéter central (1. Si 0.No)
library(readr)
base <- read_csv("base taller.csv")
names(base)
[1] "pesonac" "sexo" "egest" "tiparto" "emadre"
[6] "infeccion" "estadia" "cateter" "egestcat" "id"
[11] "pesonac2"
base <- base[,-c(9,11)]
base$sexo <- factor(base$sexo, 1:2, c("Masculino","Femenino"))
base$infeccion <- factor(base$infeccion, 1:0, c("Sí","No"))
base$cateter <- factor(base$cateter, 1:0, c("Sí","No"))
base$tiparto <- factor(base$tiparto, 1:4, c("Eutócico","Vaginal Instrumentado",
"Cesárea Programada", "Cesárea No Programada"))
library(pander)
pander(head(base))
| pesonac | sexo | egest | tiparto | emadre | infeccion | estadia | cateter | id |
|---|---|---|---|---|---|---|---|---|
| 2600 | Femenino | 37 | Eutócico | 21 | No | 9 | No | 79 |
| 2200 | Masculino | 37 | Eutócico | 20 | No | 6 | No | 43 |
| 1000 | Femenino | 27 | Eutócico | 25 | Sí | 21 | No | 143 |
| 2700 | Femenino | 40 | Eutócico | 16 | No | 4 | No | 55 |
| 3600 | Masculino | 40 | Eutócico | 20 | Sí | 58 | No | 28 |
| 2840 | Femenino | 40 | Eutócico | 32 | No | 4 | No | 46 |
summary(base)
pesonac sexo egest tiparto
Min. : 457 Masculino:234 Min. :23.0 Eutócico :164
1st Qu.:1680 Femenino :231 1st Qu.:33.0 Vaginal Instrumentado: 21
Median :2250 Median :35.0 Cesárea Programada :102
Mean :2296 Mean :35.2 Cesárea No Programada:178
3rd Qu.:2965 3rd Qu.:38.0
Max. :4440 Max. :41.0
NA's :1
emadre infeccion estadia cateter id
Min. :15.00 Sí: 78 Min. : 0.00 Sí: 26 Min. : 1
1st Qu.:22.00 No:387 1st Qu.: 3.00 No:439 1st Qu.:117
Median :28.00 Median : 7.00 Median :233
Mean :27.63 Mean :12.48 Mean :233
3rd Qu.:32.00 3rd Qu.:17.00 3rd Qu.:349
Max. :46.00 Max. :94.00 Max. :465
NA's :5
library(GGally)
ggpairs(base[-ncol(base)])
Ajustar un modelo de regresión lineal \(E(pesonac)=\beta_0 + \beta_1 X_1 + ... + + \beta_k X_k\) nos permitirá predecir el valor esperado del peso al nacer.
lm1 <- lm(data = base[-ncol(base)], formula = pesonac ~ .)
# Modelo ajustado
# as.formula(
# paste0("y ~ ", round(coefficients(lm1)[1],2), "",
# paste(sprintf(" %+.2f*%s ",
# coefficients(lm1)[-1],
# names(coefficients(lm1)[-1])),
# collapse="")
# )
# )
# Resumen del modelo
summary(lm1)
Call:
lm(formula = pesonac ~ ., data = base[-ncol(base)])
Residuals:
Min 1Q Median 3Q Max
-1238.91 -237.08 7.89 241.16 1272.53
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4075.1210 229.6445 -17.745 < 2e-16 ***
sexoFemenino 62.8074 37.5961 1.671 0.095500 .
egest 187.9200 5.8404 32.176 < 2e-16 ***
tipartoVaginal Instrumentado 88.5382 93.6894 0.945 0.345157
tipartoCesárea Programada -32.7893 51.7893 -0.633 0.526971
tipartoCesárea No Programada -17.7036 45.0218 -0.393 0.694341
emadre 0.2837 2.8483 0.100 0.920702
infeccionNo -43.2111 68.6035 -0.630 0.529101
estadia -5.9191 1.7169 -3.448 0.000619 ***
cateterNo -172.9632 96.3394 -1.795 0.073268 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 399.9 on 450 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.779, Adjusted R-squared: 0.7746
F-statistic: 176.2 on 9 and 450 DF, p-value: < 2.2e-16
library(car)
Anova(lm1, type=3)
Anova Table (Type III tests)
Response: pesonac
Sum Sq Df F value Pr(>F)
(Intercept) 50365341 1 314.8972 < 2.2e-16 ***
sexo 446373 1 2.7908 0.0954999 .
egest 165586594 1 1035.2905 < 2.2e-16 ***
tiparto 273103 3 0.5692 0.6355630
emadre 1587 1 0.0099 0.9207016
infeccion 63454 1 0.3967 0.5291010
estadia 1900981 1 11.8854 0.0006189 ***
cateter 515540 1 3.2233 0.0732683 .
Residuals 71973970 450
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Utilizaremos la estrategia backwards, eliminando en cada paso la variables menos significativas. Para esto observamos el p-valor más alto en las F parciales de cada variable. El nivel de significancia será .05.
Buscaremos un modelo reducido donde el R² no decresca significativamente y la F mútiple del modelo reducido vs el modelo completo nos indeique que posemos quedarnos con el modelo reducido.
En la tabla anova vemos que la edad de la madre resultó la variable menos significativa. Procedemos a eliminarla.
lm2 <- update(lm1, . ~ . - emadre)
summary(lm2)
Call:
lm(formula = pesonac ~ sexo + egest + tiparto + infeccion + estadia +
cateter, data = base[-ncol(base)])
Residuals:
Min 1Q Median 3Q Max
-1230.61 -234.02 7.07 236.19 1270.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4058.938 221.368 -18.336 < 2e-16 ***
sexoFemenino 60.445 37.258 1.622 0.105426
egest 187.638 5.768 32.529 < 2e-16 ***
tipartoVaginal Instrumentado 90.863 93.174 0.975 0.329983
tipartoCesárea Programada -30.213 50.733 -0.596 0.551790
tipartoCesárea No Programada -13.628 43.987 -0.310 0.756841
infeccionNo -43.294 68.223 -0.635 0.526010
estadia -5.955 1.703 -3.496 0.000519 ***
cateterNo -171.963 95.897 -1.793 0.073604 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 398.2 on 455 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.7799, Adjusted R-squared: 0.776
F-statistic: 201.5 on 8 and 455 DF, p-value: < 2.2e-16
Anova(lm2, type = 3)
Anova Table (Type III tests)
Response: pesonac
Sum Sq Df F value Pr(>F)
(Intercept) 53301749 1 336.1970 < 2.2e-16 ***
sexo 417278 1 2.6319 0.1054257
egest 167763713 1 1058.1576 < 2.2e-16 ***
tiparto 265356 3 0.5579 0.6430696
infeccion 63848 1 0.4027 0.5260100
estadia 1937329 1 12.2196 0.0005193 ***
cateter 509809 1 3.2156 0.0736043 .
Residuals 72137167 455
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
La siguiente variable a elimiar es el tipo de parto
lm3 <- update(lm2, . ~ . - tiparto)
summary(lm3)
Call:
lm(formula = pesonac ~ sexo + egest + infeccion + estadia + cateter,
data = base[-ncol(base)])
Residuals:
Min 1Q Median 3Q Max
-1224.1 -230.6 4.7 236.2 1249.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4100.753 213.640 -19.195 < 2e-16 ***
sexoFemenino 59.756 37.021 1.614 0.107188
egest 188.583 5.681 33.198 < 2e-16 ***
infeccionNo -44.828 67.949 -0.660 0.509758
estadia -5.942 1.690 -3.515 0.000483 ***
cateterNo -169.522 95.705 -1.771 0.077178 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 397.6 on 458 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.7791, Adjusted R-squared: 0.7766
F-statistic: 323 on 5 and 458 DF, p-value: < 2.2e-16
Anova(lm3, type = 3)
Anova Table (Type III tests)
Response: pesonac
Sum Sq Df F value Pr(>F)
(Intercept) 58244136 1 368.4376 < 2.2e-16 ***
sexo 411876 1 2.6054 0.1071876
egest 174224187 1 1102.0980 < 2.2e-16 ***
infeccion 68805 1 0.4352 0.5097578
estadia 1953290 1 12.3560 0.0004833 ***
cateter 495981 1 3.1375 0.0771782 .
Residuals 72402523 458
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
La siguiente variable a elimiar es infección nosocomial
lm4 <- update(lm3, . ~ . - infeccion)
summary(lm4)
Call:
lm(formula = pesonac ~ sexo + egest + estadia + cateter, data = base[-ncol(base)])
Residuals:
Min 1Q Median 3Q Max
-1199.99 -232.89 2.36 234.29 1248.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4118.200 211.866 -19.438 < 2e-16 ***
sexoFemenino 61.161 36.937 1.656 0.09844 .
egest 188.433 5.673 33.219 < 2e-16 ***
estadia -5.403 1.479 -3.652 0.00029 ***
cateterNo -192.803 88.907 -2.169 0.03063 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 397.4 on 459 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.7788, Adjusted R-squared: 0.7769
F-statistic: 404.1 on 4 and 459 DF, p-value: < 2.2e-16
Anova(lm4, type = 3)
Anova Table (Type III tests)
Response: pesonac
Sum Sq Df F value Pr(>F)
(Intercept) 59654964 1 377.8271 < 2.2e-16 ***
sexo 432905 1 2.7418 0.0984367 .
egest 174226575 1 1103.4708 < 2.2e-16 ***
estadia 2106327 1 13.3405 0.0002897 ***
cateter 742513 1 4.7027 0.0306276 *
Residuals 72471328 459
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
El p-valor del sexo aún es mayor a \(\alpha=.05\), la eliminamos
lm5 <- update(lm1, . ~ . - sexo - infeccion -tiparto - emadre)
summary(lm5)
Call:
lm(formula = pesonac ~ egest + estadia + cateter, data = base[-ncol(base)])
Residuals:
Min 1Q Median 3Q Max
-1169.43 -236.47 -6.51 237.24 1280.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4091.290 211.641 -19.331 < 2e-16 ***
egest 188.683 5.681 33.212 < 2e-16 ***
estadia -5.377 1.482 -3.628 0.000317 ***
cateterNo -198.855 89.000 -2.234 0.025942 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 398.1 on 460 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.7775, Adjusted R-squared: 0.7761
F-statistic: 535.9 on 3 and 460 DF, p-value: < 2.2e-16
Anova(lm5, type = 3)
Anova Table (Type III tests)
Response: pesonac
Sum Sq Df F value Pr(>F)
(Intercept) 59226373 1 373.6975 < 2.2e-16 ***
egest 174812496 1 1103.0052 < 2.2e-16 ***
estadia 2086293 1 13.1638 0.0003174 ***
cateter 791200 1 4.9922 0.0259415 *
Residuals 72904234 460
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# anova(lm1,lm5)
Finalmente nuestro mejor modelo es
# Modelo ajustado
as.formula(
paste0("pesonac ~ ", round(coefficients(lm5)[1],2), "",
paste(sprintf(" %+.2f*%s ",
coefficients(lm5)[-1],
names(coefficients(lm5)[-1])),
collapse="")
)
)
pesonac ~ -4091.29 + 188.68 * egest - 5.38 * estadia - 198.86 *
cateterNo
El R²-ajustado fue 0.7761 contra 0.7746 del modelo completo.
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(lm5)
par(mfrow=c(2,3))
plot(lm5, 1:6)
par(mfrow=c(1,1))
# Normality of Residuals
# qq plot for studentized resid
qqPlot(lm5, main="QQ Plot")
# Ajustados
plot(base$pesonac, predict(lm5, newdata = base))
# distribution of studentized residuals
library(MASS)
sresid <- studres(lm5)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
# prueba de normalidad
ks.test(stdres(lm5), "pnorm", exact = T)
One-sample Kolmogorov-Smirnov test
data: stdres(lm5)
D = 0.047671, p-value = 0.2348
alternative hypothesis: two-sided
shapiro.test(stdres(lm5))
Shapiro-Wilk normality test
data: stdres(lm5)
W = 0.9955, p-value = 0.2031
Podemos entonces asumir la normalidad tanto gráfica como en las dos pruebas de normalidad. Sin embargo los residuos estandarizados no se distribuyen alrededor del cero. Las distribusción de los residuos y el gráfico de respuesta vs predichos sugieren una tendencia curvilinea con algún predictor y por tanto puede ser necesaria una transformación.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(lm5)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 11.50677 Df = 1 p = 0.000693432
# plot studentized residuals vs. fitted values
spreadLevelPlot(lm5)
Suggested power transformation: 0.9054878
En la prueba de Breusch-Pagan rechazamos la hipótesis nula de varianza constante. No podemos asumir homocedasticidad.
# Test for Autocorrelated Errors
durbinWatsonTest(lm5)
lag Autocorrelation D-W Statistic p-value
1 -0.005759672 2.008448 0.946
Alternative hypothesis: rho != 0
No hay evidencia suficiente para pensar que los errores entén autocorrelacionados y asumimos su independencia.
# Evaluate Nonlinearity
# component + residual plot
crPlots(lm5)
# Ceres plots
ceresPlots(lm5)
Se sugiere una tranformación de la edad gestacional
# Assessing Outliers
outlierTest(lm5) # Bonferonni p-value for most extreme obs
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
238 3.257264 0.0012083 0.56065
qqPlot(lm5, main="QQ Plot") #qq plot for studentized resid
leveragePlots(lm5) # leverage plots
No encontramos datos atípicos
# Influential Observations
# added variable plots
library(car)
avPlots(lm5)
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(base)-length(lm5$coefficients)-2))
plot(lm5, which=4, cook.levels=cutoff)
# Influence Plot
influencePlot(lm5, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
Ninguna distancia de Cook es mayor a 1. El paciente id=12 tiene la mayor distancia de Cook (>.08). El umbral de apalancamiento es 2(k+1)/n = 0.0086022. Varios puntos están por encima de este umbral.
# Evaluate Collinearity
vif(lm5) # variance inflation factors
egest estadia cateter
1.357876 1.403242 1.226638
sqrt(vif(lm5)) > 2 # problem?
egest estadia cateter
FALSE FALSE FALSE
# library(perturb)
# colldiag(lm5, scale=F)
Las variables en el modelo final no presentan problemas de multicolinealidad
# Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(lm5)
summary(gvmodel)
Call:
lm(formula = pesonac ~ egest + estadia + cateter, data = base[-ncol(base)])
Residuals:
Min 1Q Median 3Q Max
-1169.43 -236.47 -6.51 237.24 1280.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4091.290 211.641 -19.331 < 2e-16 ***
egest 188.683 5.681 33.212 < 2e-16 ***
estadia -5.377 1.482 -3.628 0.000317 ***
cateterNo -198.855 89.000 -2.234 0.025942 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 398.1 on 460 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.7775, Adjusted R-squared: 0.7761
F-statistic: 535.9 on 3 and 460 DF, p-value: < 2.2e-16
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = lm5)
Value p-value Decision
Global Stat 50.68229 2.601e-10 Assumptions NOT satisfied!
Skewness 0.03746 8.465e-01 Assumptions acceptable.
Kurtosis 1.75612 1.851e-01 Assumptions acceptable.
Link Function 48.62007 3.107e-12 Assumptions NOT satisfied!
Heteroscedasticity 0.26864 6.042e-01 Assumptions acceptable.
En resumen es necesario transformar la variable edad gestacional para que tener linealidad con la variable respuesta.