We want to predict the probability of a low weight at birth and preterm birth as function of the following variables, using logistic regression. All the variables are categorical.
Mother’s age
less than or equal to 19 years old
20 to 34 years old
35 years old or more
Educational level
less than 8 years
8 years or more
Marital Status
single (with a partner)
not single (without a partner)
Number of children
None
1 or 2
3 or more
Fetal loss
No
Yes
Skin color
Yellow/White
Black/Brown
Type of pregnancy
single
multiple
Prenatal consultation
None
1 to 3
4 to 6
7 or more
Below we can see the results of the univariate analyses together with the Odds Ratio for each variable (OR). All variables in the univariate analysis with a p-value of less than 0.05 were considered for multivariable logistic regression.
##
## Call:
## glm(formula = baixo_peso ~ idade_mae, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4752 -0.4636 -0.4123 -0.4123 2.2394
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17660 0.02332 -93.327 <2e-16 ***
## idade_mae2 -0.24594 0.02683 -9.168 <2e-16 ***
## idade_mae3 0.05221 0.03875 1.347 0.178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 63614 on 107095 degrees of freedom
## AIC: 63620
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.1134261 0.1083276 0.1186985
## idade_mae2 0.7819674 0.7420519 0.8243381
## idade_mae3 1.0535958 0.9763298 1.1365049
##
## Call:
## glm(formula = baixo_peso ~ escolaridade, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4498 -0.4498 -0.4189 -0.4189 2.2256
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.24015 0.01840 -121.75 < 2e-16 ***
## escolaridade2 -0.14883 0.02272 -6.55 5.75e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 63705 on 107096 degrees of freedom
## AIC: 63709
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.1064426 0.1026541 0.1103321
## escolaridade2 0.8617172 0.8242510 0.9010348
##
## Call:
## glm(formula = baixo_peso ~ sit_conjugal, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4366 -0.4366 -0.4366 -0.4118 2.2403
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.42476 0.01995 -121.522 < 2e-16 ***
## sit_conjugal2 0.12198 0.02372 5.142 2.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 63720 on 107096 degrees of freedom
## AIC: 63724
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.08849901 0.08508649 0.09200904
## sit_conjugal2 1.12973618 1.07854095 1.18365961
##
## Call:
## glm(formula = baixo_peso ~ filhos, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4521 -0.4521 -0.4374 -0.3973 2.2708
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.22939 0.01455 -153.225 <2e-16 ***
## filhos2 -0.27001 0.02312 -11.679 <2e-16 ***
## filhos3 -0.06933 0.03925 -1.766 0.0774 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 63607 on 107095 degrees of freedom
## AIC: 63613
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.1075945 0.1045581 0.1106949
## filhos2 0.7633725 0.7295061 0.7987109
## filhos3 0.9330210 0.8634198 1.0070518
##
## Call:
## glm(formula = baixo_peso ~ perda_fetal, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4989 -0.4209 -0.4209 -0.4209 2.2215
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.37900 0.01154 -206.21 <2e-16 ***
## perda_fetal2 0.35808 0.03278 10.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 63636 on 107096 degrees of freedom
## AIC: 63640
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.09264296 0.09056516 0.09475499
## perda_fetal2 1.43057579 1.34101612 1.52490729
##
## Call:
## glm(formula = baixo_peso ~ raca_cor, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4465 -0.4465 -0.4465 -0.3944 2.2770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.51467 0.01967 -127.85 <2e-16 ***
## raca_cor2 0.25903 0.02353 11.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 63622 on 107096 degrees of freedom
## AIC: 63626
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.08088938 0.07781389 0.08405095
## raca_cor2 1.29567566 1.23741931 1.35700360
##
## Call:
## glm(formula = baixo_peso ~ tipo_gravidez, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3418 -0.4012 -0.4012 -0.4012 2.2625
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.47903 0.01156 -214.50 <2e-16 ***
## tipo_gravidez2 2.85751 0.04504 63.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 60074 on 107096 degrees of freedom
## AIC: 60078
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.08382491 0.08194148 0.08573922
## tipo_gravidez2 17.41806174 15.94955195 19.03004278
##
## Call:
## glm(formula = baixo_peso ~ consultas, family = binomial(link = "logit"),
## data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7311 -0.4575 -0.3558 -0.3558 2.3627
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.18300 0.05233 -22.605 < 2e-16 ***
## consultas2 -0.32239 0.05884 -5.479 4.27e-08 ***
## consultas3 -1.02124 0.05521 -18.496 < 2e-16 ***
## consultas4 -1.54494 0.05504 -28.069 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 61916 on 107094 degrees of freedom
## AIC: 61924
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.3063584 0.2762285 0.3391415
## consultas2 0.7244143 0.6459360 0.8135272
## consultas3 0.3601475 0.3234658 0.4016486
## consultas4 0.2133238 0.1916626 0.2378263
The first model incorporates all the aforementioned variables, once all of them were significant in the univariate model (p < 0.05). The model summary is shown below.
##
## Call:
## glm(formula = baixo_peso ~ idade_mae + escolaridade + sit_conjugal +
## filhos + perda_fetal + raca_cor + tipo_gravidez + consultas,
## family = binomial(link = "logit"), data = df_low_weight_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1056 -0.4400 -0.3514 -0.2830 2.6984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.87393 0.07009 -12.469 < 2e-16 ***
## idade_mae2 -0.00157 0.03131 -0.050 0.960002
## idade_mae3 0.40134 0.04648 8.634 < 2e-16 ***
## escolaridade2 0.05741 0.02716 2.114 0.034535 *
## sit_conjugal2 -0.10440 0.02828 -3.692 0.000222 ***
## filhos2 -0.52838 0.02627 -20.115 < 2e-16 ***
## filhos3 -0.69862 0.04745 -14.723 < 2e-16 ***
## perda_fetal2 0.36434 0.03570 10.205 < 2e-16 ***
## raca_cor2 0.08405 0.02742 3.065 0.002176 **
## tipo_gravidez2 2.99724 0.04695 63.837 < 2e-16 ***
## consultas2 -0.47234 0.06086 -7.761 8.43e-15 ***
## consultas3 -1.28397 0.05786 -22.191 < 2e-16 ***
## consultas4 -1.93556 0.05988 -32.326 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63747 on 107097 degrees of freedom
## Residual deviance: 57451 on 107085 degrees of freedom
## AIC: 57477
##
## Number of Fisher Scoring iterations: 5
Only the level 2 of mother’s age (20-34 years) is not statistically significant for the model (p = 0.96). Stepwise approach was performed in both directions, backward and forward to find out the effect caused by the exclusion of variables, one by one from the original model.
The AIC (Akaike Information Criterion) measure the fitted model. The smaller the AIC, the best fitted the model will be. Excluding any of the considered variables caused the AIC to increase, so the initial model was considered as the final model.
## Start: AIC=57476.81
## baixo_peso ~ idade_mae + escolaridade + sit_conjugal + filhos +
## perda_fetal + raca_cor + tipo_gravidez + consultas
##
## Df Deviance AIC
## <none> 57451 57477
## - escolaridade 1 57455 57479
## - raca_cor 1 57460 57484
## - sit_conjugal 1 57464 57488
## - perda_fetal 1 57549 57573
## - idade_mae 2 57563 57585
## - filhos 2 57928 57950
## - consultas 3 59450 59470
## - tipo_gravidez 1 61234 61258
##
## Call: glm(formula = baixo_peso ~ idade_mae + escolaridade + sit_conjugal +
## filhos + perda_fetal + raca_cor + tipo_gravidez + consultas,
## family = binomial(link = "logit"), data = df_low_weight_reg)
##
## Coefficients:
## (Intercept) idade_mae2 idade_mae3 escolaridade2 sit_conjugal2
## -0.87393 -0.00157 0.40134 0.05741 -0.10440
## filhos2 filhos3 perda_fetal2 raca_cor2 tipo_gravidez2
## -0.52838 -0.69862 0.36434 0.08405 2.99724
## consultas2 consultas3 consultas4
## -0.47234 -1.28397 -1.93556
##
## Degrees of Freedom: 107097 Total (i.e. Null); 107085 Residual
## Null Deviance: 63750
## Residual Deviance: 57450 AIC: 57480
The Odds Ratio (OR) for each variable in the fitted model, together with the 95% confidence interval can be seen below.
## OR 2.5 % 97.5 %
## (Intercept) 0.4173095 0.3635214 0.4784783
## idade_mae2 0.9984308 0.9391206 1.0617774
## idade_mae3 1.4938209 1.3634873 1.6360146
## escolaridade2 1.0590902 1.0042453 1.1170676
## sit_conjugal2 0.9008630 0.8523685 0.9522858
## filhos2 0.5895622 0.5599342 0.6206668
## filhos3 0.4972712 0.4528776 0.5454658
## perda_fetal2 1.4395622 1.3417557 1.5433313
## raca_cor2 1.0876809 1.0308735 1.1478612
## tipo_gravidez2 20.0301934 18.2730324 21.9660433
## consultas2 0.6235441 0.5537892 0.7030282
## consultas3 0.2769357 0.2474352 0.3104423
## consultas4 0.1443429 0.1284484 0.1624366
We can observe, for example, that the type of pregnancy represents the biggest OR for low weight birth. A women with multiple pregnancy has 20 (18.27-21.97) times more chance of having kids with low weight when compared with women in a single pregnancy, considering that all other variables are kept constant.
Below we can assess some metrics to validate the proposed model.
The data is split into 80% train and 20% test.
##
## Call:
## glm(formula = baixo_peso ~ idade_mae + escolaridade + sit_conjugal +
## filhos + perda_fetal + raca_cor + tipo_gravidez + consultas,
## family = binomial(link = "logit"), data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1023 -0.4427 -0.3526 -0.2872 2.6872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.879087 0.078291 -11.228 < 2e-16 ***
## idade_mae2 0.003441 0.035029 0.098 0.921750
## idade_mae3 0.430228 0.051735 8.316 < 2e-16 ***
## escolaridade2 0.059699 0.030363 1.966 0.049278 *
## sit_conjugal2 -0.114633 0.031520 -3.637 0.000276 ***
## filhos2 -0.513762 0.029252 -17.563 < 2e-16 ***
## filhos3 -0.721727 0.053429 -13.508 < 2e-16 ***
## perda_fetal2 0.341505 0.040090 8.518 < 2e-16 ***
## raca_cor2 0.093035 0.030608 3.040 0.002369 **
## tipo_gravidez2 2.990775 0.052576 56.885 < 2e-16 ***
## consultas2 -0.507481 0.068100 -7.452 9.2e-14 ***
## consultas3 -1.279506 0.064567 -19.817 < 2e-16 ***
## consultas4 -1.930784 0.066785 -28.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 51000 on 85678 degrees of freedom
## Residual deviance: 46059 on 85666 degrees of freedom
## AIC: 46085
##
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 19327 1598
## 1 210 284
##
## Accuracy : 0.9156
## 95% CI : (0.9118, 0.9193)
## No Information Rate : 0.9121
## P-Value [Acc > NIR] : 0.03744
##
## Kappa : 0.2102
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.15090
## Specificity : 0.98925
## Pos Pred Value : 0.57490
## Neg Pred Value : 0.92363
## Prevalence : 0.08787
## Detection Rate : 0.01326
## Detection Prevalence : 0.02306
## Balanced Accuracy : 0.57008
##
## 'Positive' Class : 1
##
The model has a good accuracy, however, it presents a low sensitivity. (It can be explained by the unbalanced data and low predictability of low born weight considering general socio-economical variables).
The Area Under the Curve (AUC) was 0.704 (CI 95%: 0.579-0.731), which is considered a good fit.
We also performed the Hosmer-Lemeshow goodness-of-fit test, which shows that the model fit well (\(\chi^2 = 11.57, df = 8, p = 0.172\)).
R squared is a useful metric for multiple linear regression, but does not have the same meaning in logistic regression. Statisticians have come up with a variety of analogues of R squared for multiple logistic regression that they refer to collectively as “pseudo R squared”. These do not have the same interpretation, in that they are not simply the proportion of variance explained by the model.In fact, the magnitude of any particular pseudo R squared value can’t be used to compare across datasets. Instead, the primary use for these pseudo R squared values is for comparing multiple models fit to the same dataset. (https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_mult_logistic_gof_pseudo_r_squared.htm)
## $CoxSnell
## [1] 0.05603599
##
## $Nagelkerke
## [1] 0.1249215
##
## $McFadden
## [1] 0.09688073
##
## $Tjur
## [1] 0.09336173
##
## $sqPearson
## [1] 0.09352899
For preterm birth, logistic regression was also applied using the same variables mentioned on section 1.
##
## Call:
## glm(formula = prematuro ~ idade_mae, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4610 -0.4336 -0.3952 -0.3952 2.2753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.31713 0.02469 -93.846 < 2e-16 ***
## idade_mae2 -0.19330 0.02827 -6.839 7.98e-12 ***
## idade_mae3 0.12900 0.04021 3.208 0.00134 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 59841 on 107061 degrees of freedom
## AIC: 59847
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.09855611 0.09386987 0.1034099
## idade_mae2 0.82423389 0.77998346 0.8713787
## idade_mae3 1.13769452 1.05125203 1.2307517
##
## Call:
## glm(formula = prematuro ~ escolaridade, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4104 -0.4104 -0.4104 -0.4080 2.2483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.44417 0.02002 -122.096 <2e-16 ***
## escolaridade2 0.01245 0.02419 0.515 0.607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 59950 on 107062 degrees of freedom
## AIC: 59954
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.08679789 0.08344014 0.09025177
## escolaridade2 1.01253152 0.96576561 1.06180991
##
## Call:
## glm(formula = prematuro ~ sit_conjugal, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.410 -0.410 -0.410 -0.409 2.246
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.438892 0.020075 -121.492 <2e-16 ***
## sit_conjugal2 0.004716 0.024222 0.195 0.846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 59950 on 107062 degrees of freedom
## AIC: 59954
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.08725744 0.0838726 0.0907395
## sit_conjugal2 1.00472734 0.9582542 1.0537032
##
## Call:
## glm(formula = prematuro ~ filhos, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4240 -0.4240 -0.3980 -0.3941 2.2778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.36386 0.01537 -153.761 < 2e-16 ***
## filhos2 -0.15268 0.02375 -6.429 1.29e-10 ***
## filhos3 -0.13196 0.04246 -3.108 0.00188 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 59906 on 107061 degrees of freedom
## AIC: 59912
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.09405669 0.09125335 0.09692182
## filhos2 0.85840090 0.81931608 0.89925803
## filhos3 0.87637280 0.80580062 0.95174739
##
## Call:
## glm(formula = prematuro ~ perda_fetal, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4783 -0.4018 -0.4018 -0.4018 2.2613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.47592 0.01202 -206.03 <2e-16 ***
## perda_fetal2 0.36546 0.03396 10.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 59843 on 107062 degrees of freedom
## AIC: 59847
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.08408567 0.08212188 0.08608294
## perda_fetal2 1.44117756 1.34777026 1.53972137
##
## Call:
## glm(formula = prematuro ~ raca_cor, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4187 -0.4187 -0.4187 -0.3925 2.2812
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.52496 0.01976 -127.801 < 2e-16 ***
## raca_cor2 0.13449 0.02402 5.599 2.15e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 59919 on 107062 degrees of freedom
## AIC: 59923
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.08006134 0.07700386 0.08320474
## raca_cor2 1.14395342 1.09146900 1.19923034
##
## Call:
## glm(formula = prematuro ~ consultas, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6860 -0.4406 -0.3344 -0.3344 2.4131
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.32706 0.05450 -24.350 < 2e-16 ***
## consultas2 -0.23600 0.06100 -3.869 0.000109 ***
## consultas3 -0.95614 0.05745 -16.643 < 2e-16 ***
## consultas4 -1.52856 0.05741 -26.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 58104 on 107060 degrees of freedom
## AIC: 58112
##
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.2652553 0.2381125 0.2948391
## consultas2 0.7897782 0.7013384 0.8908298
## consultas3 0.3843738 0.3437674 0.4306184
## consultas4 0.2168474 0.1939551 0.2429168
For the multivariate model, only Marital Status was not considered, once it was not statistically significant on the univariate analysis (p = 0.846).
Below the results of the stepwise logistic regression model is shown, both forward and backward approach were performed and the model with the smaller AIC was chosen as the final model.
## Start: AIC=53961.79
## prematuro ~ idade_mae + escolaridade + filhos + perda_fetal +
## raca_cor + tipo_gravidez + consultas
##
## Df Deviance AIC
## <none> 53938 53962
## - raca_cor 1 53948 53970
## - escolaridade 1 54031 54053
## - perda_fetal 1 54036 54058
## - idade_mae 2 54059 54079
## - filhos 2 54237 54257
## - consultas 3 56203 56221
## - tipo_gravidez 1 57416 57438
##
## Call:
## glm(formula = prematuro ~ idade_mae + escolaridade + filhos +
## perda_fetal + raca_cor + tipo_gravidez + consultas, family = binomial(link = "logit"),
## data = df_preterm_reg_noNA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0646 -0.4043 -0.3313 -0.2748 2.8145
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.120657 0.067285 -16.655 < 2e-16 ***
## idade_mae2 0.001097 0.032320 0.034 0.97291
## idade_mae3 0.426788 0.047169 9.048 < 2e-16 ***
## escolaridade2 0.271750 0.028442 9.554 < 2e-16 ***
## filhos2 -0.382333 0.026877 -14.225 < 2e-16 ***
## filhos3 -0.705726 0.050715 -13.915 < 2e-16 ***
## perda_fetal2 0.377500 0.036925 10.224 < 2e-16 ***
## raca_cor2 -0.087524 0.027612 -3.170 0.00153 **
## tipo_gravidez2 2.917801 0.046687 62.496 < 2e-16 ***
## consultas2 -0.398325 0.063140 -6.309 2.82e-10 ***
## consultas3 -1.255241 0.060224 -20.843 < 2e-16 ***
## consultas4 -2.027502 0.062304 -32.542 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59950 on 107063 degrees of freedom
## Residual deviance: 53938 on 107052 degrees of freedom
## AIC: 53962
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.3260655 0.2855332 0.3717259
## idade_mae2 1.0010981 0.9398070 1.0667582
## idade_mae3 1.5323273 1.3967589 1.6804676
## escolaridade2 1.3122589 1.2412270 1.3876358
## filhos2 0.6822679 0.6472162 0.7191321
## filhos3 0.4937501 0.4467480 0.5450230
## perda_fetal2 1.4586331 1.3562305 1.5674800
## raca_cor2 0.9161970 0.8680044 0.9672372
## tipo_gravidez2 18.5005643 16.8848844 20.2763580
## consultas2 0.6714439 0.5937498 0.7605306
## consultas3 0.2850073 0.2535039 0.3210220
## consultas4 0.1316640 0.1166239 0.1488947
Below we can assess some metrics to validate the proposed model.
Bootstrap was used for internal validation.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97860 7692
## 1 586 926
##
## Accuracy : 0.9227
## 95% CI : (0.9211, 0.9243)
## No Information Rate : 0.9195
## P-Value [Acc > NIR] : 6.266e-05
##
## Kappa : 0.1627
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.107450
## Specificity : 0.994047
## Pos Pred Value : 0.612434
## Neg Pred Value : 0.927126
## Prevalence : 0.080494
## Detection Rate : 0.008649
## Detection Prevalence : 0.014122
## Balanced Accuracy : 0.550749
##
## 'Positive' Class : 1
##
This is the ROC curve for the model applied to predict preterm in complete data frame.
Using Bootstrap resampling the mean AUC was 0.702 (CI 95%: 0.700-0.704, p < 0.001).
We also performed the Hosmer-Lemeshow goodness-of-fit test, which shows that the model fit well (\(\chi^2 = 0.85, df = 1, p = 0.356\)).
Thus, we can consider that the model has a good fit.
Chi-squared test was used to compare variables between Low Birth Weight and no Low Birth Weight children groups, as well as for Preterm Birth and no Preterm Birth. After the univariate analysis, the variables presenting significance were chosen for the multivariable logistic regression model. Stepwise approach, in both direction, foraward and backward, was performed to identify the variables for the final model. The model with the lower AIC was chosen. Odds Ratio (OR) was calculated for both univariate and adjusted for the multivaribale model, together with 95% confidence interval and significance level.
The model validation was performed with bootstrap resampling, selecting 1,000 random bootstrap samples from the original data set with length 10,000, with replacement. Area under the receiver operating characteristic (ROC) curve, sensitivity, specificity were calculated. Statistical analysis was conducted in R [1].
[1] R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL <https://www.R-project.org/>.