1 Introduction

We want to predict the probability of a low weight at birth and preterm birth as function of the following variables, using logistic regression. All the variables are categorical.

  • Mother’s age

    • less than or equal to 19 years old

    • 20 to 34 years old

    • 35 years old or more

  • Educational level

    • less than 8 years

    • 8 years or more

  • Marital Status

    • single (with a partner)

    • not single (without a partner)

  • Number of children

    • None

    • 1 or 2

    • 3 or more

  • Fetal loss

    • No

    • Yes

  • Skin color

    • Yellow/White

    • Black/Brown

  • Type of pregnancy

    • single

    • multiple

  • Prenatal consultation

    • None

    • 1 to 3

    • 4 to 6

    • 7 or more

2 Logistic regression for low weight birth

2.1 Univariate Analysis

Below we can see the results of the univariate analyses together with the Odds Ratio for each variable (OR). All variables in the univariate analysis with a p-value of less than 0.05 were considered for multivariable logistic regression.

2.1.1 Mother’s age

## 
## Call:
## glm(formula = baixo_peso ~ idade_mae, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4752  -0.4636  -0.4123  -0.4123   2.2394  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.17660    0.02332 -93.327   <2e-16 ***
## idade_mae2  -0.24594    0.02683  -9.168   <2e-16 ***
## idade_mae3   0.05221    0.03875   1.347    0.178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 63614  on 107095  degrees of freedom
## AIC: 63620
## 
## Number of Fisher Scoring iterations: 5
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1134261 0.1083276 0.1186985
## idade_mae2  0.7819674 0.7420519 0.8243381
## idade_mae3  1.0535958 0.9763298 1.1365049

2.1.2 Educational level

## 
## Call:
## glm(formula = baixo_peso ~ escolaridade, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4498  -0.4498  -0.4189  -0.4189   2.2256  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.24015    0.01840 -121.75  < 2e-16 ***
## escolaridade2 -0.14883    0.02272   -6.55 5.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 63705  on 107096  degrees of freedom
## AIC: 63709
## 
## Number of Fisher Scoring iterations: 5
##                      OR     2.5 %    97.5 %
## (Intercept)   0.1064426 0.1026541 0.1103321
## escolaridade2 0.8617172 0.8242510 0.9010348

2.1.3 Marital Status

## 
## Call:
## glm(formula = baixo_peso ~ sit_conjugal, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4366  -0.4366  -0.4366  -0.4118   2.2403  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   -2.42476    0.01995 -121.522  < 2e-16 ***
## sit_conjugal2  0.12198    0.02372    5.142 2.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 63720  on 107096  degrees of freedom
## AIC: 63724
## 
## Number of Fisher Scoring iterations: 5
##                       OR      2.5 %     97.5 %
## (Intercept)   0.08849901 0.08508649 0.09200904
## sit_conjugal2 1.12973618 1.07854095 1.18365961

2.1.4 Number of children

## 
## Call:
## glm(formula = baixo_peso ~ filhos, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4521  -0.4521  -0.4374  -0.3973   2.2708  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.22939    0.01455 -153.225   <2e-16 ***
## filhos2     -0.27001    0.02312  -11.679   <2e-16 ***
## filhos3     -0.06933    0.03925   -1.766   0.0774 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 63607  on 107095  degrees of freedom
## AIC: 63613
## 
## Number of Fisher Scoring iterations: 5
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1075945 0.1045581 0.1106949
## filhos2     0.7633725 0.7295061 0.7987109
## filhos3     0.9330210 0.8634198 1.0070518

2.1.5 Fetal loss

## 
## Call:
## glm(formula = baixo_peso ~ perda_fetal, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4989  -0.4209  -0.4209  -0.4209   2.2215  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.37900    0.01154 -206.21   <2e-16 ***
## perda_fetal2  0.35808    0.03278   10.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 63636  on 107096  degrees of freedom
## AIC: 63640
## 
## Number of Fisher Scoring iterations: 5
##                      OR      2.5 %     97.5 %
## (Intercept)  0.09264296 0.09056516 0.09475499
## perda_fetal2 1.43057579 1.34101612 1.52490729

2.1.6 Race

## 
## Call:
## glm(formula = baixo_peso ~ raca_cor, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4465  -0.4465  -0.4465  -0.3944   2.2770  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.51467    0.01967 -127.85   <2e-16 ***
## raca_cor2    0.25903    0.02353   11.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 63622  on 107096  degrees of freedom
## AIC: 63626
## 
## Number of Fisher Scoring iterations: 5
##                     OR      2.5 %     97.5 %
## (Intercept) 0.08088938 0.07781389 0.08405095
## raca_cor2   1.29567566 1.23741931 1.35700360

2.1.7 Type of pregnancy

## 
## Call:
## glm(formula = baixo_peso ~ tipo_gravidez, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3418  -0.4012  -0.4012  -0.4012   2.2625  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.47903    0.01156 -214.50   <2e-16 ***
## tipo_gravidez2  2.85751    0.04504   63.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 60074  on 107096  degrees of freedom
## AIC: 60078
## 
## Number of Fisher Scoring iterations: 5
##                         OR       2.5 %      97.5 %
## (Intercept)     0.08382491  0.08194148  0.08573922
## tipo_gravidez2 17.41806174 15.94955195 19.03004278

2.1.8 Prenatal consultation

## 
## Call:
## glm(formula = baixo_peso ~ consultas, family = binomial(link = "logit"), 
##     data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7311  -0.4575  -0.3558  -0.3558   2.3627  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.18300    0.05233 -22.605  < 2e-16 ***
## consultas2  -0.32239    0.05884  -5.479 4.27e-08 ***
## consultas3  -1.02124    0.05521 -18.496  < 2e-16 ***
## consultas4  -1.54494    0.05504 -28.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 61916  on 107094  degrees of freedom
## AIC: 61924
## 
## Number of Fisher Scoring iterations: 5
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3063584 0.2762285 0.3391415
## consultas2  0.7244143 0.6459360 0.8135272
## consultas3  0.3601475 0.3234658 0.4016486
## consultas4  0.2133238 0.1916626 0.2378263

2.2 The Multivariate Model for Low Weight Birth

The first model incorporates all the aforementioned variables, once all of them were significant in the univariate model (p < 0.05). The model summary is shown below.

## 
## Call:
## glm(formula = baixo_peso ~ idade_mae + escolaridade + sit_conjugal + 
##     filhos + perda_fetal + raca_cor + tipo_gravidez + consultas, 
##     family = binomial(link = "logit"), data = df_low_weight_reg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1056  -0.4400  -0.3514  -0.2830   2.6984  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.87393    0.07009 -12.469  < 2e-16 ***
## idade_mae2     -0.00157    0.03131  -0.050 0.960002    
## idade_mae3      0.40134    0.04648   8.634  < 2e-16 ***
## escolaridade2   0.05741    0.02716   2.114 0.034535 *  
## sit_conjugal2  -0.10440    0.02828  -3.692 0.000222 ***
## filhos2        -0.52838    0.02627 -20.115  < 2e-16 ***
## filhos3        -0.69862    0.04745 -14.723  < 2e-16 ***
## perda_fetal2    0.36434    0.03570  10.205  < 2e-16 ***
## raca_cor2       0.08405    0.02742   3.065 0.002176 ** 
## tipo_gravidez2  2.99724    0.04695  63.837  < 2e-16 ***
## consultas2     -0.47234    0.06086  -7.761 8.43e-15 ***
## consultas3     -1.28397    0.05786 -22.191  < 2e-16 ***
## consultas4     -1.93556    0.05988 -32.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63747  on 107097  degrees of freedom
## Residual deviance: 57451  on 107085  degrees of freedom
## AIC: 57477
## 
## Number of Fisher Scoring iterations: 5

Only the level 2 of mother’s age (20-34 years) is not statistically significant for the model (p = 0.96). Stepwise approach was performed in both directions, backward and forward to find out the effect caused by the exclusion of variables, one by one from the original model.

2.3 Stepwise Model

The AIC (Akaike Information Criterion) measure the fitted model. The smaller the AIC, the best fitted the model will be. Excluding any of the considered variables caused the AIC to increase, so the initial model was considered as the final model.

## Start:  AIC=57476.81
## baixo_peso ~ idade_mae + escolaridade + sit_conjugal + filhos + 
##     perda_fetal + raca_cor + tipo_gravidez + consultas
## 
##                 Df Deviance   AIC
## <none>                57451 57477
## - escolaridade   1    57455 57479
## - raca_cor       1    57460 57484
## - sit_conjugal   1    57464 57488
## - perda_fetal    1    57549 57573
## - idade_mae      2    57563 57585
## - filhos         2    57928 57950
## - consultas      3    59450 59470
## - tipo_gravidez  1    61234 61258
## 
## Call:  glm(formula = baixo_peso ~ idade_mae + escolaridade + sit_conjugal + 
##     filhos + perda_fetal + raca_cor + tipo_gravidez + consultas, 
##     family = binomial(link = "logit"), data = df_low_weight_reg)
## 
## Coefficients:
##    (Intercept)      idade_mae2      idade_mae3   escolaridade2   sit_conjugal2  
##       -0.87393        -0.00157         0.40134         0.05741        -0.10440  
##        filhos2         filhos3    perda_fetal2       raca_cor2  tipo_gravidez2  
##       -0.52838        -0.69862         0.36434         0.08405         2.99724  
##     consultas2      consultas3      consultas4  
##       -0.47234        -1.28397        -1.93556  
## 
## Degrees of Freedom: 107097 Total (i.e. Null);  107085 Residual
## Null Deviance:       63750 
## Residual Deviance: 57450     AIC: 57480

2.4 Determining the OR for the multivariate model

The Odds Ratio (OR) for each variable in the fitted model, together with the 95% confidence interval can be seen below.

##                        OR      2.5 %     97.5 %
## (Intercept)     0.4173095  0.3635214  0.4784783
## idade_mae2      0.9984308  0.9391206  1.0617774
## idade_mae3      1.4938209  1.3634873  1.6360146
## escolaridade2   1.0590902  1.0042453  1.1170676
## sit_conjugal2   0.9008630  0.8523685  0.9522858
## filhos2         0.5895622  0.5599342  0.6206668
## filhos3         0.4972712  0.4528776  0.5454658
## perda_fetal2    1.4395622  1.3417557  1.5433313
## raca_cor2       1.0876809  1.0308735  1.1478612
## tipo_gravidez2 20.0301934 18.2730324 21.9660433
## consultas2      0.6235441  0.5537892  0.7030282
## consultas3      0.2769357  0.2474352  0.3104423
## consultas4      0.1443429  0.1284484  0.1624366

We can observe, for example, that the type of pregnancy represents the biggest OR for low weight birth. A women with multiple pregnancy has 20 (18.27-21.97) times more chance of having kids with low weight when compared with women in a single pregnancy, considering that all other variables are kept constant.

2.5 Model Validation

Below we can assess some metrics to validate the proposed model.

The data is split into 80% train and 20% test.

## 
## Call:
## glm(formula = baixo_peso ~ idade_mae + escolaridade + sit_conjugal + 
##     filhos + perda_fetal + raca_cor + tipo_gravidez + consultas, 
##     family = binomial(link = "logit"), data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1023  -0.4427  -0.3526  -0.2872   2.6872  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.879087   0.078291 -11.228  < 2e-16 ***
## idade_mae2      0.003441   0.035029   0.098 0.921750    
## idade_mae3      0.430228   0.051735   8.316  < 2e-16 ***
## escolaridade2   0.059699   0.030363   1.966 0.049278 *  
## sit_conjugal2  -0.114633   0.031520  -3.637 0.000276 ***
## filhos2        -0.513762   0.029252 -17.563  < 2e-16 ***
## filhos3        -0.721727   0.053429 -13.508  < 2e-16 ***
## perda_fetal2    0.341505   0.040090   8.518  < 2e-16 ***
## raca_cor2       0.093035   0.030608   3.040 0.002369 ** 
## tipo_gravidez2  2.990775   0.052576  56.885  < 2e-16 ***
## consultas2     -0.507481   0.068100  -7.452  9.2e-14 ***
## consultas3     -1.279506   0.064567 -19.817  < 2e-16 ***
## consultas4     -1.930784   0.066785 -28.911  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 51000  on 85678  degrees of freedom
## Residual deviance: 46059  on 85666  degrees of freedom
## AIC: 46085
## 
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 19327  1598
##          1   210   284
##                                           
##                Accuracy : 0.9156          
##                  95% CI : (0.9118, 0.9193)
##     No Information Rate : 0.9121          
##     P-Value [Acc > NIR] : 0.03744         
##                                           
##                   Kappa : 0.2102          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.15090         
##             Specificity : 0.98925         
##          Pos Pred Value : 0.57490         
##          Neg Pred Value : 0.92363         
##              Prevalence : 0.08787         
##          Detection Rate : 0.01326         
##    Detection Prevalence : 0.02306         
##       Balanced Accuracy : 0.57008         
##                                           
##        'Positive' Class : 1               
## 

The model has a good accuracy, however, it presents a low sensitivity. (It can be explained by the unbalanced data and low predictability of low born weight considering general socio-economical variables).

2.6 ROC Curve and Hosmer-Lemeshow test

The Area Under the Curve (AUC) was 0.704 (CI 95%: 0.579-0.731), which is considered a good fit.

We also performed the Hosmer-Lemeshow goodness-of-fit test, which shows that the model fit well (\(\chi^2 = 11.57, df = 8, p = 0.172\)).

2.7 Pseudo \(R^2\)

R squared is a useful metric for multiple linear regression, but does not have the same meaning in logistic regression. Statisticians have come up with a variety of analogues of R squared for multiple logistic regression that they refer to collectively as “pseudo R squared”. These do not have the same interpretation, in that they are not simply the proportion of variance explained by the model.In fact, the magnitude of any particular pseudo R squared value can’t be used to compare across datasets. Instead, the primary use for these pseudo R squared values is for comparing multiple models fit to the same dataset. (https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_mult_logistic_gof_pseudo_r_squared.htm)

## $CoxSnell
## [1] 0.05603599
## 
## $Nagelkerke
## [1] 0.1249215
## 
## $McFadden
## [1] 0.09688073
## 
## $Tjur
## [1] 0.09336173
## 
## $sqPearson
## [1] 0.09352899

3 Logistic regression for preterm birth

For preterm birth, logistic regression was also applied using the same variables mentioned on section 1.

3.1 The Univariate Model

3.1.1 Mother’s age

## 
## Call:
## glm(formula = prematuro ~ idade_mae, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4610  -0.4336  -0.3952  -0.3952   2.2753  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.31713    0.02469 -93.846  < 2e-16 ***
## idade_mae2  -0.19330    0.02827  -6.839 7.98e-12 ***
## idade_mae3   0.12900    0.04021   3.208  0.00134 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 59841  on 107061  degrees of freedom
## AIC: 59847
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                     OR      2.5 %    97.5 %
## (Intercept) 0.09855611 0.09386987 0.1034099
## idade_mae2  0.82423389 0.77998346 0.8713787
## idade_mae3  1.13769452 1.05125203 1.2307517

3.1.2 Educational Level

## 
## Call:
## glm(formula = prematuro ~ escolaridade, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4104  -0.4104  -0.4104  -0.4080   2.2483  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   -2.44417    0.02002 -122.096   <2e-16 ***
## escolaridade2  0.01245    0.02419    0.515    0.607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 59950  on 107062  degrees of freedom
## AIC: 59954
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                       OR      2.5 %     97.5 %
## (Intercept)   0.08679789 0.08344014 0.09025177
## escolaridade2 1.01253152 0.96576561 1.06180991

3.1.3 Marital Status

## 
## Call:
## glm(formula = prematuro ~ sit_conjugal, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.410  -0.410  -0.410  -0.409   2.246  
## 
## Coefficients:
##                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   -2.438892   0.020075 -121.492   <2e-16 ***
## sit_conjugal2  0.004716   0.024222    0.195    0.846    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 59950  on 107062  degrees of freedom
## AIC: 59954
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                       OR     2.5 %    97.5 %
## (Intercept)   0.08725744 0.0838726 0.0907395
## sit_conjugal2 1.00472734 0.9582542 1.0537032

3.1.4 Number of Children

## 
## Call:
## glm(formula = prematuro ~ filhos, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4240  -0.4240  -0.3980  -0.3941   2.2778  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.36386    0.01537 -153.761  < 2e-16 ***
## filhos2     -0.15268    0.02375   -6.429 1.29e-10 ***
## filhos3     -0.13196    0.04246   -3.108  0.00188 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 59906  on 107061  degrees of freedom
## AIC: 59912
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                     OR      2.5 %     97.5 %
## (Intercept) 0.09405669 0.09125335 0.09692182
## filhos2     0.85840090 0.81931608 0.89925803
## filhos3     0.87637280 0.80580062 0.95174739

3.1.5 Fetal Loss

## 
## Call:
## glm(formula = prematuro ~ perda_fetal, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4783  -0.4018  -0.4018  -0.4018   2.2613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.47592    0.01202 -206.03   <2e-16 ***
## perda_fetal2  0.36546    0.03396   10.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 59843  on 107062  degrees of freedom
## AIC: 59847
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                      OR      2.5 %     97.5 %
## (Intercept)  0.08408567 0.08212188 0.08608294
## perda_fetal2 1.44117756 1.34777026 1.53972137

3.1.6 Race

## 
## Call:
## glm(formula = prematuro ~ raca_cor, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4187  -0.4187  -0.4187  -0.3925   2.2812  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.52496    0.01976 -127.801  < 2e-16 ***
## raca_cor2    0.13449    0.02402    5.599 2.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 59919  on 107062  degrees of freedom
## AIC: 59923
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                     OR      2.5 %     97.5 %
## (Intercept) 0.08006134 0.07700386 0.08320474
## raca_cor2   1.14395342 1.09146900 1.19923034

3.1.7 Prenatal Consultation

## 
## Call:
## glm(formula = prematuro ~ consultas, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6860  -0.4406  -0.3344  -0.3344   2.4131  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.32706    0.05450 -24.350  < 2e-16 ***
## consultas2  -0.23600    0.06100  -3.869 0.000109 ***
## consultas3  -0.95614    0.05745 -16.643  < 2e-16 ***
## consultas4  -1.52856    0.05741 -26.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 58104  on 107060  degrees of freedom
## AIC: 58112
## 
## Number of Fisher Scoring iterations: 5
## Waiting for profiling to be done...
##                    OR     2.5 %    97.5 %
## (Intercept) 0.2652553 0.2381125 0.2948391
## consultas2  0.7897782 0.7013384 0.8908298
## consultas3  0.3843738 0.3437674 0.4306184
## consultas4  0.2168474 0.1939551 0.2429168

3.2 The Multivariate Model for Preterm Birth

For the multivariate model, only Marital Status was not considered, once it was not statistically significant on the univariate analysis (p = 0.846).

Below the results of the stepwise logistic regression model is shown, both forward and backward approach were performed and the model with the smaller AIC was chosen as the final model.

## Start:  AIC=53961.79
## prematuro ~ idade_mae + escolaridade + filhos + perda_fetal + 
##     raca_cor + tipo_gravidez + consultas
## 
##                 Df Deviance   AIC
## <none>                53938 53962
## - raca_cor       1    53948 53970
## - escolaridade   1    54031 54053
## - perda_fetal    1    54036 54058
## - idade_mae      2    54059 54079
## - filhos         2    54237 54257
## - consultas      3    56203 56221
## - tipo_gravidez  1    57416 57438
## 
## Call:
## glm(formula = prematuro ~ idade_mae + escolaridade + filhos + 
##     perda_fetal + raca_cor + tipo_gravidez + consultas, family = binomial(link = "logit"), 
##     data = df_preterm_reg_noNA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0646  -0.4043  -0.3313  -0.2748   2.8145  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.120657   0.067285 -16.655  < 2e-16 ***
## idade_mae2      0.001097   0.032320   0.034  0.97291    
## idade_mae3      0.426788   0.047169   9.048  < 2e-16 ***
## escolaridade2   0.271750   0.028442   9.554  < 2e-16 ***
## filhos2        -0.382333   0.026877 -14.225  < 2e-16 ***
## filhos3        -0.705726   0.050715 -13.915  < 2e-16 ***
## perda_fetal2    0.377500   0.036925  10.224  < 2e-16 ***
## raca_cor2      -0.087524   0.027612  -3.170  0.00153 ** 
## tipo_gravidez2  2.917801   0.046687  62.496  < 2e-16 ***
## consultas2     -0.398325   0.063140  -6.309 2.82e-10 ***
## consultas3     -1.255241   0.060224 -20.843  < 2e-16 ***
## consultas4     -2.027502   0.062304 -32.542  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59950  on 107063  degrees of freedom
## Residual deviance: 53938  on 107052  degrees of freedom
## AIC: 53962
## 
## Number of Fisher Scoring iterations: 5

3.3 Determining the OR for the multivariate model

##                        OR      2.5 %     97.5 %
## (Intercept)     0.3260655  0.2855332  0.3717259
## idade_mae2      1.0010981  0.9398070  1.0667582
## idade_mae3      1.5323273  1.3967589  1.6804676
## escolaridade2   1.3122589  1.2412270  1.3876358
## filhos2         0.6822679  0.6472162  0.7191321
## filhos3         0.4937501  0.4467480  0.5450230
## perda_fetal2    1.4586331  1.3562305  1.5674800
## raca_cor2       0.9161970  0.8680044  0.9672372
## tipo_gravidez2 18.5005643 16.8848844 20.2763580
## consultas2      0.6714439  0.5937498  0.7605306
## consultas3      0.2850073  0.2535039  0.3210220
## consultas4      0.1316640  0.1166239  0.1488947

3.4 Model Validation

Below we can assess some metrics to validate the proposed model.

Bootstrap was used for internal validation.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 97860  7692
##          1   586   926
##                                           
##                Accuracy : 0.9227          
##                  95% CI : (0.9211, 0.9243)
##     No Information Rate : 0.9195          
##     P-Value [Acc > NIR] : 6.266e-05       
##                                           
##                   Kappa : 0.1627          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.107450        
##             Specificity : 0.994047        
##          Pos Pred Value : 0.612434        
##          Neg Pred Value : 0.927126        
##              Prevalence : 0.080494        
##          Detection Rate : 0.008649        
##    Detection Prevalence : 0.014122        
##       Balanced Accuracy : 0.550749        
##                                           
##        'Positive' Class : 1               
## 

3.5 ROC Curve

This is the ROC curve for the model applied to predict preterm in complete data frame.

Using Bootstrap resampling the mean AUC was 0.702 (CI 95%: 0.700-0.704, p < 0.001).

We also performed the Hosmer-Lemeshow goodness-of-fit test, which shows that the model fit well (\(\chi^2 = 0.85, df = 1, p = 0.356\)).

Thus, we can consider that the model has a good fit.

4 Statistical Analyses

Chi-squared test was used to compare variables between Low Birth Weight and no Low Birth Weight children groups, as well as for Preterm Birth and no Preterm Birth. After the univariate analysis, the variables presenting significance were chosen for the multivariable logistic regression model. Stepwise approach, in both direction, foraward and backward, was performed to identify the variables for the final model. The model with the lower AIC was chosen. Odds Ratio (OR) was calculated for both univariate and adjusted for the multivaribale model, together with 95% confidence interval and significance level.

The model validation was performed with bootstrap resampling, selecting 1,000 random bootstrap samples from the original data set with length 10,000, with replacement. Area under the receiver operating characteristic (ROC) curve, sensitivity, specificity were calculated. Statistical analysis was conducted in R [1].

5 Describing the results from the models

6 References

[1] R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL <https://www.R-project.org/>.