In this project, i use the 2018 Natality data file for my predictive model. The outcome of interest is birth outcome(preterm birth). Mothers with singleton birth are classified based on whether they have preterm baby or not.
natality <-read_dta ("Natality-18.dta")
natality18 <- natality %>%
select(age,race,ptb_singleton,bmi,year,bw,edu,wic,rrural_rgrp,tbo_rec,sex,previs) %>%
mutate( ageg= as.numeric(age), age2=age^2, ptb_singleton= as.factor(ifelse(ptb_singleton=="Yes" , 1, ifelse(ptb_singleton=="No", 0, NA) ))) %>%
filter(complete.cases(.))
Here, I split the data into two sets, a training set and a test set. The training set will be used to estimate the model parameters, and the test set will be used to validate the model’s predictive ability.
I use an 80% training fraction.
# training the data
set.seed(1115)
train<- createDataPartition(y = natality18$ptb_singleton,
p = .80,
list=F)
model.dat2train<-natality18[train,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
model.dat2test<-natality18[-train,]
table(model.dat2train$ptb_singleton)
##
## 0 1
## 2790325 240700
prop.table(table(model.dat2train$ptb_singleton))
##
## 0 1
## 0.92058792 0.07941208
Here, I fit a logistic regression model to estimate the probability of a woman having a preterm birth. The predicting variable include; age, rurality, educ, Body Mass Index, number of prenatal visits, the weight of the baby at birth, Participation in WIC program, total birth order, sex of the baby.
glm1<-glm(ptb_singleton~factor(sex)+scale(age2)+scale(ageg)+ scale(bmi)+scale(previs)+factor(edu)+scale(bw)+factor(wic)+factor(rrural_rgrp)+scale(tbo_rec),
data=model.dat2train[,-1],
family = binomial)
summary(glm1)
##
## Call:
## glm(formula = ptb_singleton ~ factor(sex) + scale(age2) + scale(ageg) +
## scale(bmi) + scale(previs) + factor(edu) + scale(bw) + factor(wic) +
## factor(rrural_rgrp) + scale(tbo_rec), family = binomial,
## data = model.dat2train[, -1])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3409 -0.3390 -0.2223 -0.1382 5.6873
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.593877 0.022063 -162.890 < 2e-16 ***
## factor(sex)M 0.405233 0.005094 79.558 < 2e-16 ***
## scale(age2) 0.260805 0.019793 13.177 < 2e-16 ***
## scale(ageg) -0.218225 0.020279 -10.761 < 2e-16 ***
## scale(bmi) 0.109590 0.002158 50.792 < 2e-16 ***
## scale(previs) -0.049763 0.002429 -20.486 < 2e-16 ***
## factor(edu)< HS 0.173706 0.022726 7.643 2.12e-14 ***
## factor(edu)HS degree 0.126805 0.022226 5.705 1.16e-08 ***
## factor(edu)More than HS 0.026277 0.021919 1.199 0.231
## scale(bw) -1.557507 0.002766 -563.028 < 2e-16 ***
## factor(wic)Yes -0.060604 0.005660 -10.708 < 2e-16 ***
## factor(rrural_rgrp)Non-Metro 0.100892 0.007303 13.815 < 2e-16 ***
## scale(tbo_rec) 0.095943 0.002599 36.912 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1681195 on 3031024 degrees of freedom
## Residual deviance: 1154394 on 3031012 degrees of freedom
## AIC: 1154420
##
## Number of Fisher Scoring iterations: 6
All the predictors are significantly related to our outcome except for education level that is higher than high school (More than HS).
Here, I see how the model performs in terms of accuracy of prediction. Then show the estimated probability that each of these women had preterm birth, based on the model.
tr_pred<- predict(glm1,
newdata = model.dat2train,
type = "response")
head(tr_pred)
## 1 2 3 4 5 6
## 0.267614907 0.014005693 0.124610458 0.031724084 0.005767668 0.802684067
##Classification of the Training Data
I create classes (had Preterm birth vs doesn’t have Preterm birth)Using the .5 decision rule, then classify the observation as women that had Preterm birth vs women who do not have Preterm birth.
tr_predcl<-factor(ifelse(tr_pred>.5, 1, 0))
pred1<-data.frame(pr=tr_pred, gr=tr_predcl, modcon=model.dat2train$ptb_singleton)
pred1%>%
ggplot()+
geom_histogram(aes(x=pr, color=gr, fill=gr))+
ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Threshold = .5")+
geom_vline(xintercept=.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pred1%>%
ggplot()+
geom_histogram(aes(x=pr, color=modcon, fill=modcon))+
ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Truth")+
geom_vline(xintercept=.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This gives the overall accuracy of the train data using the .5 decision rule, sensitivity and specificity.
cm1<-confusionMatrix(data = tr_predcl,
reference = model.dat2train$ptb_singleton )
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2759556 176855
## 1 30769 63845
##
## Accuracy : 0.9315
## 95% CI : (0.9312, 0.9318)
## No Information Rate : 0.9206
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3518
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9890
## Specificity : 0.2652
## Pos Pred Value : 0.9398
## Neg Pred Value : 0.6748
## Prevalence : 0.9206
## Detection Rate : 0.9104
## Detection Prevalence : 0.9688
## Balanced Accuracy : 0.6271
##
## 'Positive' Class : 0
##
Overall the model has a 93.15% accuracy. It has 98% sensitivity,and 26% Specificity. In other word, the model is very good at predicting women who do not have preterm birth but not so good at and women who had preterm birth
Here, i use the mean of the response as the cutoff value or deceison rule.
tr_predcl<-factor(ifelse(tr_pred>mean(I(model.dat2train$ptb_singleton==1)), 1, 0)) #mean of response
pred2<-data.frame(pr=tr_pred, gr=tr_predcl, modcon=model.dat2train$ptb_singleton)
pred2%>%
ggplot(aes(x=pr, fill=gr))+
geom_histogram(position="identity", alpha=.2)+
ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Threshold = Mean")+
geom_vline(xintercept=mean(I(model.dat2train$ptb_singleton==1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pred2%>%
ggplot(aes(x=pr, fill=modcon))+
geom_histogram(position="identity", alpha=.2)+
ggtitle(label = "Probability of Having Preterm Birth", subtitle = "Truth")+
geom_vline(xintercept=mean(I(model.dat2train$ptb_singleton==1)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
confusionMatrix(data = tr_predcl,
model.dat2train$ptb_singleton,
positive = "1" )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2284787 46891
## 1 505538 193809
##
## Accuracy : 0.8177
## 95% CI : (0.8173, 0.8182)
## No Information Rate : 0.9206
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3336
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.80519
## Specificity : 0.81882
## Pos Pred Value : 0.27713
## Neg Pred Value : 0.97989
## Prevalence : 0.07941
## Detection Rate : 0.06394
## Detection Prevalence : 0.23073
## Balanced Accuracy : 0.81201
##
## 'Positive' Class : 1
##
Changing the decision rule gave a more accurate prediction of the outcome classes. Overall the model has a 81.77% accuracy. It has 80.5% sensitivity,and 81.88% Specificity. In other word, the model is very good at predicting women that had preterm birth and women who do not have preterm birth.
This gives the overall accuracy of the test data using the .5 decision rule, sensitivity and specificity.
pred_test<-predict(glm1,
newdata=model.dat2test,
type="response")
predcl<-factor(ifelse(pred_test>.5, 1, 0))
pred_c<-data.frame(pr=pred_test, gr=predcl, modcon=model.dat2test$ptb_singleton)
cm2<-confusionMatrix(data = predcl,
reference = model.dat2test$ptb_singleton )
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 689872 44164
## 1 7709 16010
##
## Accuracy : 0.9315
## 95% CI : (0.931, 0.9321)
## No Information Rate : 0.9206
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3526
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9889
## Specificity : 0.2661
## Pos Pred Value : 0.9398
## Neg Pred Value : 0.6750
## Prevalence : 0.9206
## Detection Rate : 0.9104
## Detection Prevalence : 0.9687
## Balanced Accuracy : 0.6275
##
## 'Positive' Class : 0
##
Overall, the model has a 93.15% accuracy just as the train data. Also as observed in the train data, It has 98% sensitivity,and 26% Specificity. In other word, the model is very good at predicting women who do not have preterm birth but not so good at and women who had preterm birth
Here, i use the mean of the response as the cutoff value or decision rule for the test data.
pred_test<-predict(glm1,
newdata=model.dat2test,
type="response")
pred_cl<-factor(ifelse(pred_test > mean( I(model.dat2test$ptb_singleton==1)), 1, 0))
table(model.dat2test$ptb_singleton,pred_cl)
## pred_cl
## 0 1
## 0 570921 126660
## 1 11735 48439
confusionMatrix(data = pred_cl,model.dat2test$ptb_singleton)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 570921 11735
## 1 126660 48439
##
## Accuracy : 0.8174
## 95% CI : (0.8165, 0.8182)
## No Information Rate : 0.9206
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3329
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8184
## Specificity : 0.8050
## Pos Pred Value : 0.9799
## Neg Pred Value : 0.2766
## Prevalence : 0.9206
## Detection Rate : 0.7534
## Detection Prevalence : 0.7689
## Balanced Accuracy : 0.8117
##
## 'Positive' Class : 0
##
Overall the model has a 81.77% accuracy. It has 81.84% sensitivity,and 80.50% Specificity. In other word, the model is very good at predicting women that had preterm birth and women who do not have preterm birth. In general, the model does about as well as it did on the training data.