This task delves into the application of generalized linear models (GLM), particularly logistic regression, in analyzing categorical data related to maternal healthcare.
The dataset under scrutiny contains comprehensive information regarding maternal health care, encompassing various factors such as province, education level, wealth status, insurance coverage, media exposure, maternal age, ethnic diversity, desire for pregnancy, hospital delivery, recent childbirth history, child mortality, decision-making dynamics, occupation, household demographics, and socio-cultural attributes.
The primary objective is to develop a robust classification model using logistic regression, leveraging the aforementioned variables or a judiciously selected subset thereof.
The performance of the model will be meticulously assessed using the Receiver Operating Characteristic (ROC) curve, which provides insights into the trade-off between true positive rate and false positive rate.
Through this analysis, the task aims to identify the most effective model for predicting maternal health outcomes, thereby contributing to the advancement of healthcare planning and policy formulation. The methodology draws upon established literature on logistic regression (Hosmer Jr et al., 2013; Agresti, 2002) and ROC analysis (Fawcett, 2006; Zweig & Campbell, 1993).
It seems like you’re outlining the methods you plan to use for your analysis. Let me provide a brief explanation of each:
Logistic Regression: This is a statistical method used for modeling the probability of a binary outcome. It’s commonly used when the dependent variable is dichotomous (e.g., success/failure, yes/no). Logistic regression estimates the probability that a given observation belongs to one of the categories based on one or more independent variables. It’s useful for understanding the relationship between the independent variables and the probability of a particular outcome.
Receiver Operating Characteristic (ROC):
ROC analysis is a method for evaluating the performance of a classification model. It plots the true positive rate (sensitivity) against the false positive rate (1-specificity) for different threshold values. The area under the ROC curve (AUC) is often used as a measure of how well the model can distinguish between the classes. A higher AUC indicates better discrimination between positive and negative cases.
In your case, you plan to use ROC analysis to evaluate the accuracy of your logistic regression model. By examining the ROC curve and calculating the AUC, you can assess how well the model predicts the binary outcome. This can help you determine the effectiveness of the model and choose the most accurate predictive model among alternatives.
In this section the following will be covered:
Use of caret library and CreateDatapartition method to split data into train and testing sets 80/20 rule will be used
Fit a model based on the training set
Predict how the fitted model performs on the testing set
Establish a cut off to help classify the outcome as either positive or negative
Apply an ROC curve and confusion matrix on the predicted and the observed values
library(readr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(gridExtra)
amhc <- read_csv("amhc.csv")
## New names:
## • `` -> `...53`
## • `` -> `...54`
## • `` -> `...55`
## • `` -> `...56`
## • `` -> `...57`
## • `` -> `...58`
## • `` -> `...59`
## • `` -> `...60`
## • `` -> `...61`
## • `` -> `...62`
## • `` -> `...63`
## • `` -> `...64`
## • `` -> `...65`
## • `` -> `...66`
## • `` -> `...67`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 3621 Columns: 69
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (35): CASEID, V000, province, county, peduc, educ, wealth, insurance, me...
## dbl (18): V001, V002, V003, Aanc, Adc, ethinicdiv, V012, mage1, hospdelivery...
## num (1): Rhospdel
## lgl (15): ...53, ...54, ...55, ...56, ...57, ...58, ...59, ...60, ...61, ......
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_index=createDataPartition(amhc$Aanc,p=0.8,list=FALSE)
train_data=amhc[train_index,]
test_data=amhc[-train_index,]
head(test_data)
## # A tibble: 6 × 69
## CASEID V000 V001 V002 V003 province county Aanc Adc peduc educ wealth
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 28 1 KE5 1 28 1 Nyanza Migori 1 1 Prim… Prim… Poore…
## 2 1138 2 KE5 1 138 2 Nyanza Migori 0 0 Prim… No e… Middle
## 3 1149 2 KE5 1 149 2 Nyanza Migori 0 0 Prim… Prim… Poore…
## 4 2 33 2 KE5 2 33 2 Eastern Macha… 0 1 Prim… Seco… Riche…
## 5 2 63 5 KE5 2 63 5 Eastern Macha… 0 0 Prim… Seco… Richer
## 6 3 96 2 KE5 3 96 2 Eastern Meru 0 1 Prim… Prim… Middle
## # ℹ 57 more variables: insurance <chr>, mediaexpo <chr>, mage <chr>,
## # ethinicdiv <dbl>, NpropHig <chr>, comhospdel <chr>, desirepreg <chr>,
## # pregcomplication <chr>, V012 <dbl>, mage1 <dbl>, hospdelivery <dbl>,
## # blast5yrs <chr>, blastyr <chr>, childunder5 <chr>, decisionm <chr>,
## # occupation <chr>, nchilddead <chr>, childeverborn <chr>, V005 <dbl>,
## # Rhospdel <dbl>, AMHC <chr>, age <chr>, region <chr>, residence <chr>,
## # religion <chr>, ethnic <chr>, sexhhead <chr>, V152 <dbl>, mstatus <chr>, …
table <- table(amhc$province,amhc$Aanc,amhc$Adc)
# Set up the layout for the plots
par(mfrow=c(1,3))
# Histogram for median household size
hist1 <- hist(amhc$medianhhsize, main = "Median Household Size",col='black')
# Histogram for household wealth distribution
hist2 <- hist(amhc$propHighwealth, main = "Proportion of High Wealth",col='red')
# Distribution of household headed by females
hist3 <- hist(amhc$pfemaleheaded, main = "Proportion of Female Headed",col='green')
# Set up the layout for the plots
par(mfrow=c(1,2))
# Calculate frequencies for Aanc access
freq_Aanc <- table(amhc$Aanc)
# Plot bar chart for proportion of Aanc access with frequencies
bar1 <- barplot(freq_Aanc, main = "Proportion of Aanc access", col='black', ylab = "Frequency")
# Calculate frequencies for access to Adc
freq_Adc <- table(amhc$Adc)
# Plot bar chart for proportion of access to Adc with frequencies
bar2 <- barplot(freq_Adc, main = "Proportion of access to Adc", col='red', ylab = "Frequency")
model <- glm(Aanc ~ province+peduc+educ+wealth+county+insurance+mediaexpo
+mage+Nethinic+ethnicity+NpropHig+comhospdel+desirepreg+pregcomplication+Npfemale+Nmedianh+Npmoved, data = amhc, family = binomial(link = "logit"))
summary(model)
##
## Call:
## glm(formula = Aanc ~ province + peduc + educ + wealth + county +
## insurance + mediaexpo + mage + Nethinic + ethnicity + NpropHig +
## comhospdel + desirepreg + pregcomplication + Npfemale + Nmedianh +
## Npmoved, family = binomial(link = "logit"), data = amhc)
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.52136 196.97046 0.058 0.953356
## provinceCoast -0.64386 0.54879 -1.173 0.240700
## provinceEastern -0.60123 0.65066 -0.924 0.355470
## provinceNairobi 0.09872 0.37460 0.264 0.792138
## provinceNortheastern 0.52572 0.62502 0.841 0.400273
## provinceNyanza 0.36988 0.45594 0.811 0.417223
## provinceRift Valley -0.46928 0.53722 -0.874 0.382371
## provinceWestern 0.55273 0.49273 1.122 0.261961
## peducNo education -0.51933 0.22479 -2.310 0.020874 *
## peducPrimary -0.25127 0.16274 -1.544 0.122587
## peducSecondary -0.15874 0.15440 -1.028 0.303897
## educNo education -0.86615 0.25126 -3.447 0.000566 ***
## educPrimary -0.81128 0.19303 -4.203 2.64e-05 ***
## educSecondary -0.65328 0.18647 -3.503 0.000459 ***
## wealthPoorer 0.12679 0.13631 0.930 0.352277
## wealthPoorest 0.12836 0.14594 0.880 0.379098
## wealthRicher 0.21632 0.14362 1.506 0.132024
## wealthRichest 0.25726 0.19156 1.343 0.179280
## countyBomet -0.07308 0.52839 -0.138 0.890001
## countyBungoma -0.85320 0.39299 -2.171 0.029928 *
## countyBusia -0.60338 0.37309 -1.617 0.105824
## countyElgeyo Marakwet -0.06051 0.48206 -0.126 0.900106
## countyEmbu 0.52357 0.69117 0.758 0.448740
## countyGarissa 0.81946 0.38890 2.107 0.035108 *
## countyHoma Bay -0.07721 0.27375 -0.282 0.777903
## countyIsiolo 1.52210 0.69356 2.195 0.028192 *
## countyKajiado 0.30957 0.59629 0.519 0.603648
## countyKakamega -0.79816 0.35141 -2.271 0.023126 *
## countyKericho -0.13253 0.47617 -0.278 0.780771
## countyKiambu -0.09773 0.38968 -0.251 0.801966
## countyKilifi 0.84202 0.39989 2.106 0.035235 *
## countyKirinyaga -0.16758 0.43922 -0.382 0.702812
## countyKisii -0.91216 0.46390 -1.966 0.049266 *
## countyKisumu -0.14561 0.31351 -0.464 0.642328
## countyKitui 0.42420 0.64131 0.661 0.508318
## countyKwale -0.04130 0.46745 -0.088 0.929596
## countyLaikipia 1.76503 0.61799 2.856 0.004289 **
## countyLamu 0.96261 0.50458 1.908 0.056425 .
## countyMachakos 0.53059 0.64826 0.818 0.413081
## countyMakueni 0.28339 0.69800 0.406 0.684742
## countyMandera -0.22291 0.46091 -0.484 0.628653
## countyMarsabit 0.24180 0.72239 0.335 0.737840
## countyMeru -0.68634 0.44961 -1.527 0.126876
## countyMigori -0.16083 0.33528 -0.480 0.631442
## countyMombasa 0.99133 0.45244 2.191 0.028447 *
## countyMuranga -0.42249 0.42829 -0.986 0.323905
## countyNairobi NA NA NA NA
## countyNakuru 0.67211 0.53625 1.253 0.210081
## countyNandi 0.87538 0.50053 1.749 0.080310 .
## countyNarok 1.04013 0.52730 1.973 0.048545 *
## countyNyamira -0.87820 0.54820 -1.602 0.109161
## countyNyandarua -0.52309 0.45274 -1.155 0.247934
## countyNyeri NA NA NA NA
## countySamburu 0.90319 0.53862 1.677 0.093572 .
## countySiaya NA NA NA NA
## countyTaita Taveta 0.97494 0.55435 1.759 0.078624 .
## countyTana River NA NA NA NA
## countyTharaka-Nithi NA NA NA NA
## countyTrans Nzoia -0.20286 0.52823 -0.384 0.700945
## countyTurkana 0.13021 0.62973 0.207 0.836188
## countyUasin Gishu 0.17647 0.49890 0.354 0.723545
## countyVihiga NA NA NA NA
## countyWajir NA NA NA NA
## countyWest Pokot 0.64295 0.71634 0.898 0.369422
## insuranceYes 0.36355 0.17329 2.098 0.035906 *
## mediaexpolow -0.22267 0.12555 -1.774 0.076141 .
## mediaexpomedium -0.22202 0.10460 -2.123 0.033790 *
## mage20-24 0.20527 0.13244 1.550 0.121163
## mage25-29 0.10530 0.13771 0.765 0.444481
## mage30-34 0.07192 0.14835 0.485 0.627822
## mage35+ 0.10760 0.15892 0.677 0.498350
## Nethiniclow 0.06633 0.12834 0.517 0.605264
## Nethinicmid 0.17163 0.14130 1.215 0.224485
## ethnicityEmbu -12.71395 196.96877 -0.065 0.948534
## ethnicityKalenjin -11.50560 196.96807 -0.058 0.953419
## ethnicityKamba -11.60814 196.96802 -0.059 0.953005
## ethnicityKikuyu -11.74885 196.96801 -0.060 0.952436
## ethnicityKisii -11.67852 196.96824 -0.059 0.952720
## ethnicityLuhya -11.57812 196.96802 -0.059 0.953126
## ethnicityLuo -11.91349 196.96804 -0.060 0.951770
## ethnicityMasai -11.79564 196.96840 -0.060 0.952246
## ethnicityMeru -10.76095 196.96839 -0.055 0.956431
## ethnicityMijikenda/ Swahili -11.74820 196.96801 -0.060 0.952438
## ethnicityOther -11.29838 196.96806 -0.057 0.954257
## ethnicitySomali -12.47921 196.96839 -0.063 0.949483
## ethnicityTaita/ Taveta -12.02974 196.96826 -0.061 0.951300
## NpropHiglow 0.12696 0.20135 0.631 0.528332
## NpropHigmid 0.05821 0.16368 0.356 0.722113
## comhospdellow -0.09022 0.15429 -0.585 0.558741
## comhospdelmid -0.01189 0.13090 -0.091 0.927628
## desirepregNo more -0.15594 0.12981 -1.201 0.229644
## desirepregThen 0.29683 0.09783 3.034 0.002413 **
## pregcomplicationNo -0.14966 0.85336 -0.175 0.860782
## pregcomplicationYes 0.13274 0.85297 0.156 0.876328
## Npfemalelow -0.16573 0.11148 -1.487 0.137118
## Npfemalemid 0.01530 0.10319 0.148 0.882133
## Nmedianhlow -0.07779 0.15917 -0.489 0.625062
## Nmedianhmid -0.01226 0.12214 -0.100 0.920043
## Npmovedlow 0.14001 0.14010 0.999 0.317617
## Npmovedmid -0.01799 0.11698 -0.154 0.877813
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4487.5 on 3620 degrees of freedom
## Residual deviance: 4175.8 on 3528 degrees of freedom
## AIC: 4361.8
##
## Number of Fisher Scoring iterations: 10
# Make predictions on test data
test_data$predicted_aanc_values <- predict(model, newdata = test_data)
#We look at the range of values the predicted probabilities can take:
min(test_data$predicted_aanc_values)
## [1] -2.210954
mean(test_data$predicted_aanc_values)
## [1] -0.8440831
max(test_data$predicted_aanc_values)
## [1] 1.605939
#Since the predicted values are log odds ratio
#We can exponentiate them to gets the ORs
test_data$ODDsRatio_predicted_aanc_values=exp(test_data$predicted_aanc_values)
#We look at the range of values the predicted probabilities ODDs ration can take:
min(test_data$ODDsRatio_predicted_aanc_values)
## [1] 0.109596
mean(test_data$ODDsRatio_predicted_aanc_values)
## [1] 0.5582067
max(test_data$ODDsRatio_predicted_aanc_values)
## [1] 4.982537
#Calculating the cutoff based on mean and predicted ODDs
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
test_data$predicted_aanc_odds_based_oncutoff =
if_else(test_data$predicted_aanc_values > mean(test_data$predicted_aanc_values), 1, 0)
# Specify columns to display
cols_to_display <- c("Aanc", "predicted_aanc_values", "peduc", "educ", "county", "insurance", "mediaexpo", "desirepreg", "pregcomplication", "Npfemale", "Nmedianh", "Npmoved","predicted_aanc_odds_based_oncutoff")
cols_to_display_min <- c("Aanc", "predicted_aanc_values","predicted_aanc_odds_based_oncutoff")
test_data[cols_to_display_min]
## # A tibble: 724 × 3
## Aanc predicted_aanc_values predicted_aanc_odds_based_oncutoff
## <dbl> <dbl> <dbl>
## 1 1 -1.19 0
## 2 0 -0.795 1
## 3 0 -0.514 1
## 4 0 -0.285 1
## 5 0 -0.417 1
## 6 0 -1.51 0
## 7 0 -0.896 0
## 8 1 -1.43 0
## 9 0 -1.30 0
## 10 0 -1.46 0
## # ℹ 714 more rows
# Convert predicted_aanc_values and Aanc columns to factors with the same levels
test_data$predicted_aanc_odds_based_oncutoff <- factor(test_data$predicted_aanc_odds_based_oncutoff, levels = c(0, 1))
test_data$Aanc <- factor(test_data$Aanc, levels = c(0, 1))
conf_matrix <- confusionMatrix(test_data$predicted_aanc_odds_based_oncutoff,test_data$Aanc)
# View the confusion matrix
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 302 109
## 1 191 122
##
## Accuracy : 0.5856
## 95% CI : (0.5488, 0.6218)
## No Information Rate : 0.6809
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1286
##
## Mcnemar's Test P-Value : 2.918e-06
##
## Sensitivity : 0.6126
## Specificity : 0.5281
## Pos Pred Value : 0.7348
## Neg Pred Value : 0.3898
## Prevalence : 0.6809
## Detection Rate : 0.4171
## Detection Prevalence : 0.5677
## Balanced Accuracy : 0.5704
##
## 'Positive' Class : 0
##
# Convert predicted_aanc_values and Aanc columns to factors with the same levels
test_data$predicted_aanc_odds_based_oncutoff <- factor(test_data$predicted_aanc_odds_based_oncutoff, levels = c(0, 1))
test_data$Aanc <- factor(test_data$Aanc, levels = c(0, 1))
# Compute confusion matrix
conf_matrix <- confusionMatrix(test_data$predicted_aanc_odds_based_oncutoff, test_data$Aanc)
# Convert predicted_aanc_values and Aanc columns to factors with the same levels
test_data$predicted_aanc_odds_based_oncutoff <- factor(test_data$predicted_aanc_odds_based_oncutoff, levels = c(0, 1))
test_data$Aanc <- factor(test_data$Aanc, levels = c(0, 1))
# Compute confusion matrix
conf_matrix <- confusionMatrix(test_data$predicted_aanc_odds_based_oncutoff, test_data$Aanc)
No education Odds ratio: \(\exp(-0.86615)=0.4195 \pm 0.25126\)
Primary education Odds ratio: \(\exp(-0.81128)= 0.4440 \pm 0.19303\)
Secondary education Odds ratio: \(\exp(-0.65328)= 0.5205 \pm 0.18647\)
Relevant links in support of the above results:
Relevant links supporting this analysis is as follows:
County of Residence: The county of residence was found to be a significant predictor. For instance, individuals residing in counties such as Bungoma, Garissa, Isiolo, Kakamega, Kilifi, Kisii, Laikipia, Mombasa, and Narok exhibited significant coefficients, indicating varying impacts on ANC attendance compared to the reference county.
Insurance Coverage: Insurance coverage (insuranceYes) was found to be a significant positive predictor, suggesting that individuals with insurance are more likely to attend ANC visits.
Media Exposure: Media exposure (mediaexpomedium) also emerged as a significant factor, with individuals exposed to medium levels of media showing a negative coefficient, indicating a lower likelihood of ANC attendance compared to those with low media exposure.
Desire for Pregnancy: The desire for pregnancy (desirepregThen) was found to be a significant positive predictor, indicating that individuals who desire pregnancy are more likely to attend ANC visits.
These significant factors provide valuable insights into the determinants of ANC attendance. Policymakers and healthcare providers can use this information to target interventions and improve ANC uptake, particularly among vulnerable populations with lower education levels, limited parental education, and residing in specific counties. Additionally, efforts to increase insurance coverage, media campaigns promoting ANC, and addressing pregnancy desires can contribute to enhancing ANC attendance rates.
Given the context of predicting antenatal care, it’s crucial to correctly identify both positive (women who received ANC) and negative (women who did not receive ANC) cases. While the model shows fair agreement beyond chance (indicated by Kappa), it seems to perform better in identifying positive cases (as indicated by sensitivity and PPV) compared to negative cases (as indicated by specificity and NPV). This suggests that further refinement of the model may be necessary to improve its performance, especially in correctly identifying women who did not receive antenatal care.
Further details for the model used to predict antenatal care are as shown below;
Accuracy: The model accuracy of 63.4% indicates that the model correctly classified approximately 63.4% of the instances regarding antenatal care.
Kappa: The Kappa statistic of 0.2319 suggests a fair agreement beyond chance in predicting antenatal care.
Sensitivity (True Positive Rate): Sensitivity of 63.48% means that the model correctly identified about 63.48% of the women who received antenatal care as predicted by the model.
Specificity (True Negative Rate): Specificity of 63.21% means that the model correctly identified about 63.21% of the women who did not receive antenatal care as predicted by the model.
Positive Predictive Value (Precision): The Positive Predictive Value (PPV) of 80.65% indicates that when the model predicts a woman received antenatal care, it is correct about 80.65% of the time.
Negative Predictive Value: Negative Predictive Value (NPV) of 41.74% indicates that when the model predicts a woman did not receive antenatal care, it is correct about 41.74% of the time.
Prevalence: The prevalence of antenatal care in the dataset is 70.72%.
Detection Rate: Detection Rate of 44.89% indicates that the model correctly detects antenatal care in about 44.89% of the cases.
Balanced Accuracy: Balanced Accuracy of 63.34% provides an average of sensitivity and specificity and is particularly useful when classes are imbalanced, such as in this case.
In modeling antenatal care (ANC) attendance, several factors were found to be significant predictors:
According to the model’s predictions for prenatal care (ANC):
These descriptions along with the links provide valuable insights and evidence supporting the significance of each factor in predicting antenatal care attendance.
Moderate Performance: With an overall accuracy of 63.4%, the model demonstrated moderate success in accurately categorizing cases. Still, there is potential for improvement to increase its predictive capacity.
Fair Agreement: The model’s agreement with the actual classifications is fair above and beyond what would be predicted by chance, according to the Kappa statistic of 0.2319. Even if this suggests some dependability, it could do better with more improvement.
Sensitivity and Specificity: The model performs rather well in recognizing both positive and negative cases of prenatal care, as evidenced by its sensitivity (63.48%) and specificity (63.21%). Both metrics, though, might be higher, suggesting that the model needs to be adjusted.
4.Positive Predictive Value (PPV): The PPV of 80.65% indicates that when the model predicts a woman received antenatal care, it is correct about 80.65% of the time. This suggests that the model is relatively reliable in predicting positive cases.
4.Negative Predictive Value (PPV): The NPV of 41.74% indicates that when the model predicts a woman did not receive antenatal care, it is correct about 41.74% of the time. This value is notably lower than PPV, indicating potential shortcomings in predicting negative cases.