Background

In this project, i use the 2018 Natality data file for my predictive model. The outcome of interest is birth outcome(preterm birth). Mothers with singleton birth are classified based on whether they have preterm baby or not.

Data Manupulation For Analysis

natality <-read_dta ("Natality-18.dta")
natality18 <- natality %>% 
  select(age,race,ptb_singleton,bmi,year,bw,edu,wic,rrural_rgrp,tbo_rec,sex,previs) %>% 
  mutate( ageg= as.numeric(age), age2=age^2, ptb_singleton= as.factor(ifelse(ptb_singleton=="Yes" , 1, ifelse(ptb_singleton=="No", 0, NA) ))) %>% 
  filter(complete.cases(.))

Creating Training and Test Dataset

Here, I split the data into two sets, a training set and a test set. The training set will be used to estimate the model parameters, and the test set will be used to validate the model’s predictive ability.

I use an 80% training fraction.

# training the data 

set.seed(1115)
train<- createDataPartition(y = natality18$ptb_singleton,
                            p = .80,
                            list=F)

model.dat2train<-natality18[train,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
model.dat2test<-natality18[-train,]

table(model.dat2train$ptb_singleton)
## 
##       0       1 
## 2790325  240700
prop.table(table(model.dat2train$ptb_singleton))
## 
##          0          1 
## 0.92058792 0.07941208

Fitting Logistic Regresion Model

Here, I fit a logistic regression model to estimate the probability of a woman having a preterm birth. The predicting variable include; age, rurality, educ, Body Mass Index, number of prenatal visits, the weight of the baby at birth, Participation in WIC program, total birth order, sex of the baby.

glm1<-glm(ptb_singleton~factor(sex)+scale(age2)+scale(ageg)+ scale(bmi)+scale(previs)+factor(edu)+scale(bw)+factor(wic)+factor(rrural_rgrp)+scale(tbo_rec),
          data=model.dat2train[,-1],
          family = binomial)
summary(glm1)
## 
## Call:
## glm(formula = ptb_singleton ~ factor(sex) + scale(age2) + scale(ageg) + 
##     scale(bmi) + scale(previs) + factor(edu) + scale(bw) + factor(wic) + 
##     factor(rrural_rgrp) + scale(tbo_rec), family = binomial, 
##     data = model.dat2train[, -1])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3409  -0.3390  -0.2223  -0.1382   5.6873  
## 
## Coefficients:
##                               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                  -3.593877   0.022063 -162.890  < 2e-16 ***
## factor(sex)M                  0.405233   0.005094   79.558  < 2e-16 ***
## scale(age2)                   0.260805   0.019793   13.177  < 2e-16 ***
## scale(ageg)                  -0.218225   0.020279  -10.761  < 2e-16 ***
## scale(bmi)                    0.109590   0.002158   50.792  < 2e-16 ***
## scale(previs)                -0.049763   0.002429  -20.486  < 2e-16 ***
## factor(edu)< HS               0.173706   0.022726    7.643 2.12e-14 ***
## factor(edu)HS degree          0.126805   0.022226    5.705 1.16e-08 ***
## factor(edu)More than HS       0.026277   0.021919    1.199    0.231    
## scale(bw)                    -1.557507   0.002766 -563.028  < 2e-16 ***
## factor(wic)Yes               -0.060604   0.005660  -10.708  < 2e-16 ***
## factor(rrural_rgrp)Non-Metro  0.100892   0.007303   13.815  < 2e-16 ***
## scale(tbo_rec)                0.095943   0.002599   36.912  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1681195  on 3031024  degrees of freedom
## Residual deviance: 1154394  on 3031012  degrees of freedom
## AIC: 1154420
## 
## Number of Fisher Scoring iterations: 6

All the predictors are significantly related to our outcome except for education level that is higher than high school (More than HS).

Model Performance

Here, I see how the model performs in terms of accuracy of prediction. Then show the estimated probability that each of these women had preterm birth, based on the model.

tr_pred<- predict(glm1,
                  newdata = model.dat2train,
                  type = "response")

head(tr_pred)
##           1           2           3           4           5           6 
## 0.267614907 0.014005693 0.124610458 0.031724084 0.005767668 0.802684067

##Classification of the Training Data

I create classes (had Preterm birth vs doesn’t have Preterm birth)Using the .5 decision rule, then classify the observation as women that had Preterm birth vs women who do not have Preterm birth.

tr_predcl<-factor(ifelse(tr_pred>.5, 1, 0))

pred1<-data.frame(pr=tr_pred, gr=tr_predcl, modcon=model.dat2train$ptb_singleton)

pred1%>%
  ggplot()+
  geom_histogram(aes(x=pr, color=gr, fill=gr))+
  ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Threshold = .5")+
  geom_vline(xintercept=.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pred1%>%
  ggplot()+
  geom_histogram(aes(x=pr, color=modcon, fill=modcon))+
  ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Truth")+
  geom_vline(xintercept=.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Overall Accuracy of Train data

This gives the overall accuracy of the train data using the .5 decision rule, sensitivity and specificity.

cm1<-confusionMatrix(data = tr_predcl,
                     reference = model.dat2train$ptb_singleton )
cm1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction       0       1
##          0 2759556  176855
##          1   30769   63845
##                                           
##                Accuracy : 0.9315          
##                  95% CI : (0.9312, 0.9318)
##     No Information Rate : 0.9206          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3518          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9890          
##             Specificity : 0.2652          
##          Pos Pred Value : 0.9398          
##          Neg Pred Value : 0.6748          
##              Prevalence : 0.9206          
##          Detection Rate : 0.9104          
##    Detection Prevalence : 0.9688          
##       Balanced Accuracy : 0.6271          
##                                           
##        'Positive' Class : 0               
## 

Overall the model has a 93.15% accuracy. It has 98% sensitivity,and 26% Specificity. In other word, the model is very good at predicting women who do not have preterm birth but not so good at and women who had preterm birth

Here, i use the mean of the response as the cutoff value or deceison rule.

tr_predcl<-factor(ifelse(tr_pred>mean(I(model.dat2train$ptb_singleton==1)), 1, 0)) #mean of response

pred2<-data.frame(pr=tr_pred, gr=tr_predcl, modcon=model.dat2train$ptb_singleton)

pred2%>%
  ggplot(aes(x=pr, fill=gr))+
  geom_histogram(position="identity", alpha=.2)+
  ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Threshold = Mean")+
  geom_vline(xintercept=mean(I(model.dat2train$ptb_singleton==1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pred2%>%
  ggplot(aes(x=pr, fill=modcon))+
  geom_histogram(position="identity", alpha=.2)+
  ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Truth")+
  geom_vline(xintercept=mean(I(model.dat2train$ptb_singleton==1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

confusionMatrix(data = tr_predcl,
                model.dat2train$ptb_singleton,
                positive = "1" )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction       0       1
##          0 2284787   46891
##          1  505538  193809
##                                           
##                Accuracy : 0.8177          
##                  95% CI : (0.8173, 0.8182)
##     No Information Rate : 0.9206          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3336          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.80519         
##             Specificity : 0.81882         
##          Pos Pred Value : 0.27713         
##          Neg Pred Value : 0.97989         
##              Prevalence : 0.07941         
##          Detection Rate : 0.06394         
##    Detection Prevalence : 0.23073         
##       Balanced Accuracy : 0.81201         
##                                           
##        'Positive' Class : 1               
## 

Changing the decision rule gave a more accurate prediction of the outcome classes. Overall the model has a 81.77% accuracy. It has 80.5% sensitivity,and 81.88% Specificity. In other word, the model is very good at predicting women that had preterm birth and women who do not have preterm birth.

Overall Accuracy of Test Data

This gives the overall accuracy of the test data using the .5 decision rule, sensitivity and specificity.

pred_test<-predict(glm1,
                   newdata=model.dat2test,
                   type="response")

predcl<-factor(ifelse(pred_test>.5, 1, 0))


pred_c<-data.frame(pr=pred_test, gr=predcl, modcon=model.dat2test$ptb_singleton)


cm2<-confusionMatrix(data = predcl,
                     reference = model.dat2test$ptb_singleton )
cm2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 689872  44164
##          1   7709  16010
##                                          
##                Accuracy : 0.9315         
##                  95% CI : (0.931, 0.9321)
##     No Information Rate : 0.9206         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3526         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9889         
##             Specificity : 0.2661         
##          Pos Pred Value : 0.9398         
##          Neg Pred Value : 0.6750         
##              Prevalence : 0.9206         
##          Detection Rate : 0.9104         
##    Detection Prevalence : 0.9687         
##       Balanced Accuracy : 0.6275         
##                                          
##        'Positive' Class : 0              
## 

Overall, the model has a 93.15% accuracy just as the train data. Also as observed in the train data, It has 98% sensitivity,and 26% Specificity. In other word, the model is very good at predicting women who do not have preterm birth but not so good at and women who had preterm birth

Here, i use the mean of the response as the cutoff value or decision rule for the test data.

pred_test<-predict(glm1,
                   newdata=model.dat2test,
                   type="response")

pred_cl<-factor(ifelse(pred_test > mean( I(model.dat2test$ptb_singleton==1)), 1, 0))

table(model.dat2test$ptb_singleton,pred_cl)
##    pred_cl
##          0      1
##   0 570921 126660
##   1  11735  48439
confusionMatrix(data = pred_cl,model.dat2test$ptb_singleton)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 570921  11735
##          1 126660  48439
##                                           
##                Accuracy : 0.8174          
##                  95% CI : (0.8165, 0.8182)
##     No Information Rate : 0.9206          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3329          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8184          
##             Specificity : 0.8050          
##          Pos Pred Value : 0.9799          
##          Neg Pred Value : 0.2766          
##              Prevalence : 0.9206          
##          Detection Rate : 0.7534          
##    Detection Prevalence : 0.7689          
##       Balanced Accuracy : 0.8117          
##                                           
##        'Positive' Class : 0               
## 

Overall the model has a 81.77% accuracy. It has 81.84% sensitivity,and 80.50% Specificity. In other word, the model is very good at predicting women that had preterm birth and women who do not have preterm birth. In general, the model does about as well as it did on the training data.