This code sample analyzes maternal health risk factors using a public dataset of health records collected through an IoT-based monitoring system in Bangladesh. The goal is to identify which variables are most associated with low, mid, and high maternal health risk levels.
This project demonstrates data cleaning, exploratory data analysis, class imbalance handling, predictive modeling, and interpretation of results using R.
What are the strongest predictors of maternal health risk level in this dataset?
library(tidyverse)
library(caret)
library(randomForest)
library(nnet)
library(corrplot)
library(knitr)
url <- "https://raw.githubusercontent.com/lburenkov/maternalrisk/refs/heads/main/Maternal%20Health%20Risk%20Data%20Set.csv"
df <- read.csv(url)
head(df)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 25 130 80 15.00 98 86 high risk
## 2 35 140 90 13.00 98 70 high risk
## 3 29 90 70 8.00 100 80 high risk
## 4 30 140 85 7.00 98 70 high risk
## 5 35 120 60 6.10 98 76 low risk
## 6 23 140 80 7.01 98 70 high risk
str(df)
## 'data.frame': 1014 obs. of 7 variables:
## $ Age : int 25 35 29 30 35 23 23 35 32 42 ...
## $ SystolicBP : int 130 140 90 140 120 140 130 85 120 130 ...
## $ DiastolicBP: int 80 90 70 85 60 80 70 60 90 80 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 102 98 98 ...
## $ HeartRate : int 86 70 80 70 76 70 78 86 70 70 ...
## $ RiskLevel : chr "high risk" "high risk" "high risk" "high risk" ...
summary(df)
## Age SystolicBP DiastolicBP BS
## Min. :10.00 Min. : 70.0 Min. : 49.00 Min. : 6.000
## 1st Qu.:19.00 1st Qu.:100.0 1st Qu.: 65.00 1st Qu.: 6.900
## Median :26.00 Median :120.0 Median : 80.00 Median : 7.500
## Mean :29.87 Mean :113.2 Mean : 76.46 Mean : 8.726
## 3rd Qu.:39.00 3rd Qu.:120.0 3rd Qu.: 90.00 3rd Qu.: 8.000
## Max. :70.00 Max. :160.0 Max. :100.00 Max. :19.000
## BodyTemp HeartRate RiskLevel
## Min. : 98.00 Min. : 7.0 Length:1014
## 1st Qu.: 98.00 1st Qu.:70.0 Class :character
## Median : 98.00 Median :76.0 Mode :character
## Mean : 98.67 Mean :74.3
## 3rd Qu.: 98.00 3rd Qu.:80.0
## Max. :103.00 Max. :90.0
missing_values <- colSums(is.na(df))
missing_values
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate
## 0 0 0 0 0 0
## RiskLevel
## 0
The dataset does not contain missing values. The target variable is
RiskLevel, which includes three categories: low risk, mid
risk, and high risk.
Before modeling, I checked for values that may indicate data entry errors or extreme outliers.
# Unusual heart rate values
df %>%
filter(HeartRate < 40 | HeartRate > 200)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 16 120 75 7.9 98 7 low risk
## 2 16 120 75 7.9 98 7 low risk
# Unusual blood sugar or body temperature values
df %>%
filter(BS > 15 | BodyTemp > 100 | BodyTemp < 97)
## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 35 85 60 11.0 102 86 high risk
## 2 42 130 80 18.0 98 70 high risk
## 3 30 120 80 6.9 101 76 mid risk
## 4 40 140 100 18.0 98 90 high risk
## 5 12 95 60 6.1 102 60 low risk
## 6 17 85 60 9.0 102 86 mid risk
## 7 32 120 65 6.0 101 76 mid risk
## 8 26 85 60 6.0 101 86 mid risk
## 9 44 120 90 16.0 98 80 mid risk
## 10 13 90 65 7.8 101 80 mid risk
## 11 28 115 60 7.8 101 86 mid risk
## 12 34 85 60 11.0 102 86 high risk
## 13 42 140 100 18.0 98 90 high risk
## 14 50 140 95 17.0 98 60 high risk
## 15 38 135 60 7.9 101 86 high risk
## 16 30 120 80 7.9 101 76 high risk
## 17 55 140 100 18.0 98 90 high risk
## 18 40 160 100 19.0 98 77 high risk
## 19 32 140 90 18.0 98 88 high risk
## 20 55 140 95 19.0 98 77 high risk
## 21 40 160 100 19.0 98 77 high risk
## 22 32 140 90 18.0 98 88 high risk
## 23 22 90 60 7.5 102 60 high risk
## 24 55 140 95 19.0 98 77 high risk
## 25 50 130 100 16.0 98 75 high risk
## 26 18 120 80 6.9 102 76 mid risk
## 27 17 90 60 6.9 101 76 mid risk
## 28 17 90 63 6.9 101 70 mid risk
## 29 25 120 90 6.7 101 80 mid risk
## 30 17 120 80 6.7 102 76 mid risk
## 31 14 90 65 7.0 101 70 high risk
## 32 17 110 75 12.0 101 76 high risk
## 33 40 160 100 19.0 98 77 high risk
## 34 32 140 90 18.0 98 88 high risk
## 35 12 90 60 7.9 102 66 high risk
## 36 12 95 60 6.1 102 60 low risk
## 37 55 140 95 19.0 98 77 high risk
## 38 50 130 100 16.0 98 75 high risk
## 39 13 90 65 7.9 101 80 mid risk
## 40 17 90 65 6.1 103 67 high risk
## 41 28 83 60 8.0 101 86 high risk
## 42 17 85 60 9.0 102 86 high risk
## 43 50 140 95 17.0 98 60 high risk
## 44 28 85 60 9.0 101 86 mid risk
## 45 17 85 60 9.0 102 86 mid risk
## 46 55 140 80 7.2 101 76 high risk
## 47 40 140 100 18.0 98 77 high risk
## 48 28 120 80 9.0 102 76 high risk
## 49 17 90 60 11.0 101 78 high risk
## 50 17 90 63 8.0 101 70 high risk
## 51 25 120 90 12.0 101 80 high risk
## 52 17 120 80 7.0 102 76 high risk
## 53 19 90 65 11.0 101 70 high risk
## 54 32 120 65 6.0 101 76 mid risk
## 55 17 110 75 13.0 101 76 high risk
## 56 40 160 100 19.0 98 77 high risk
## 57 32 140 90 18.0 98 88 high risk
## 58 12 90 60 8.0 102 66 high risk
## 59 12 90 60 11.0 102 60 high risk
## 60 55 140 95 19.0 98 77 high risk
## 61 50 130 100 16.0 98 76 high risk
## 62 13 90 65 9.0 101 80 high risk
## 63 17 90 65 7.7 103 67 high risk
## 64 26 85 60 6.0 101 86 mid risk
## 65 17 85 60 6.3 102 86 high risk
## 66 55 120 90 18.0 98 60 high risk
## 67 35 85 60 19.0 98 86 high risk
## 68 43 120 90 18.0 98 70 high risk
## 69 44 120 90 16.0 98 80 mid risk
## 70 45 120 80 6.9 103 70 low risk
## 71 70 85 60 6.9 102 70 low risk
## 72 65 120 90 6.9 103 76 low risk
## 73 55 120 80 6.9 102 80 low risk
## 74 45 90 60 18.0 101 70 high risk
## 75 22 120 80 6.9 103 76 low risk
## 76 17 110 75 6.9 101 76 high risk
## 77 40 160 100 19.0 98 77 high risk
## 78 32 140 90 18.0 98 88 high risk
## 79 12 90 60 7.8 102 60 high risk
## 80 55 140 95 19.0 98 77 high risk
## 81 50 130 100 16.0 98 75 high risk
## 82 13 90 65 7.8 101 80 mid risk
## 83 17 90 65 7.8 103 67 high risk
## 84 28 115 60 7.8 101 86 mid risk
## 85 17 85 69 7.8 102 86 high risk
## 86 50 130 80 16.0 102 76 mid risk
## 87 27 120 90 6.8 102 68 mid risk
## 88 55 100 70 6.8 101 80 mid risk
## 89 60 140 80 16.0 98 66 high risk
## 90 17 140 100 6.8 103 80 high risk
## 91 36 140 100 6.8 102 76 high risk
## 92 40 140 100 13.0 101 66 high risk
## 93 36 140 100 6.8 102 76 high risk
## 94 40 140 100 13.0 101 66 high risk
## 95 35 85 60 11.0 102 86 high risk
## 96 43 130 80 18.0 98 70 mid risk
## 97 34 85 60 11.0 102 86 high risk
## 98 42 130 80 18.0 98 70 mid risk
## 99 30 120 80 6.8 101 76 low risk
## 100 42 140 100 18.0 98 90 high risk
## 101 18 120 80 6.8 102 76 low risk
## 102 17 90 60 7.9 101 76 low risk
## 103 50 140 95 17.0 98 60 high risk
## 104 38 135 60 7.9 101 86 high risk
## 105 17 85 60 7.9 102 86 low risk
## 106 30 120 80 7.9 101 76 high risk
## 107 55 140 100 18.0 98 90 high risk
## 108 18 120 80 7.9 102 76 mid risk
## 109 17 90 60 7.5 101 76 low risk
## 110 17 90 63 7.5 101 70 low risk
## 111 25 120 90 7.5 101 80 low risk
## 112 17 120 80 7.5 102 76 low risk
## 113 19 90 65 7.5 101 70 low risk
## 114 18 85 60 7.5 101 86 mid risk
## 115 17 85 60 7.5 102 86 low risk
## 116 30 120 80 7.5 101 76 mid risk
## 117 40 160 100 19.0 98 77 high risk
## 118 32 140 90 18.0 98 88 high risk
## 119 12 90 60 7.5 102 66 low risk
## 120 12 90 60 7.5 102 60 low risk
## 121 55 140 95 19.0 98 77 high risk
## 122 50 130 100 16.0 98 75 mid risk
## 123 13 90 65 7.5 101 80 low risk
## 124 17 90 65 7.5 103 67 low risk
## 125 28 115 60 7.5 101 86 mid risk
## 126 17 85 60 7.5 102 86 low risk
## 127 40 160 100 19.0 98 77 high risk
## 128 32 140 90 18.0 98 88 high risk
## 129 12 90 60 7.5 102 66 mid risk
## 130 22 90 60 7.5 102 60 high risk
## 131 55 140 95 19.0 98 77 high risk
## 132 50 130 100 16.0 98 75 high risk
## 133 55 140 95 19.0 98 77 high risk
## 134 50 130 100 16.0 98 75 mid risk
## 135 13 90 65 7.5 101 80 high risk
## 136 17 90 65 7.5 103 67 mid risk
## 137 27 135 60 7.5 101 86 high risk
## 138 17 85 60 7.5 101 86 high risk
## 139 50 140 95 17.0 98 60 high risk
## 140 28 85 60 9.0 101 86 mid risk
## 141 28 95 60 10.0 101 86 high risk
## 142 17 90 60 9.0 102 86 mid risk
## 143 30 120 80 9.0 101 76 mid risk
## 144 35 85 60 11.0 102 86 high risk
## 145 42 130 80 18.0 98 70 high risk
## 146 40 140 100 18.0 98 90 high risk
## 147 14 90 65 7.0 101 70 high risk
## 148 17 110 75 12.0 101 76 high risk
## 149 40 160 100 19.0 98 77 high risk
## 150 30 120 80 6.9 101 76 mid risk
## 151 18 120 80 6.9 102 76 mid risk
## 152 17 90 60 6.9 101 76 mid risk
## 153 17 90 63 6.9 101 70 mid risk
## 154 25 120 90 6.7 101 80 mid risk
## 155 17 120 80 6.7 102 76 mid risk
## 156 13 90 65 7.9 101 80 mid risk
## 157 28 85 60 9.0 101 86 mid risk
## 158 17 85 60 9.0 102 86 mid risk
## 159 30 120 80 6.9 101 76 mid risk
## 160 18 120 80 6.9 102 76 mid risk
## 161 17 90 60 6.9 101 76 mid risk
## 162 17 90 63 6.9 101 70 mid risk
## 163 25 120 90 6.7 101 80 mid risk
## 164 17 120 80 6.7 102 76 mid risk
## 165 13 90 65 7.9 101 80 mid risk
## 166 28 85 60 9.0 101 86 mid risk
## 167 17 85 60 9.0 102 86 mid risk
## 168 32 120 65 6.0 101 76 mid risk
## 169 26 85 60 6.0 101 86 mid risk
## 170 44 120 90 16.0 98 80 mid risk
## 171 13 90 65 7.8 101 80 mid risk
## 172 28 115 60 7.8 101 86 mid risk
## 173 50 130 80 16.0 102 76 mid risk
## 174 27 120 90 6.8 102 68 mid risk
## 175 55 100 70 6.8 101 80 mid risk
## 176 43 130 80 18.0 98 70 mid risk
## 177 42 130 80 18.0 98 70 mid risk
## 178 18 120 80 7.9 102 76 mid risk
## 179 18 85 60 7.5 101 86 mid risk
## 180 30 120 80 7.5 101 76 mid risk
## 181 50 130 100 16.0 98 75 mid risk
## 182 28 115 60 7.5 101 86 mid risk
## 183 12 90 60 7.5 102 66 mid risk
## 184 50 130 100 16.0 98 75 mid risk
## 185 17 90 65 7.5 103 67 mid risk
## 186 28 85 60 9.0 101 86 mid risk
## 187 17 90 60 9.0 102 86 mid risk
## 188 30 120 80 9.0 101 76 mid risk
## 189 30 120 80 6.9 101 76 mid risk
## 190 18 120 80 6.9 102 76 mid risk
## 191 17 90 60 6.9 101 76 mid risk
## 192 17 90 63 6.9 101 70 mid risk
## 193 25 120 90 6.7 101 80 mid risk
## 194 17 120 80 6.7 102 76 mid risk
## 195 13 90 65 7.9 101 80 mid risk
## 196 28 85 60 9.0 101 86 mid risk
## 197 17 85 60 9.0 102 86 mid risk
## 198 32 120 65 6.0 101 76 mid risk
## 199 30 120 80 6.8 101 76 low risk
## 200 18 120 80 6.8 102 76 low risk
## 201 17 90 60 7.9 101 76 low risk
## 202 17 85 60 7.9 102 86 low risk
## 203 17 90 60 7.5 101 76 low risk
## 204 17 90 63 7.5 101 70 low risk
## 205 25 120 90 7.5 101 80 low risk
## 206 17 120 80 7.5 102 76 low risk
## 207 19 90 65 7.5 101 70 low risk
## 208 17 85 60 7.5 102 86 low risk
## 209 12 90 60 7.5 102 66 low risk
## 210 12 90 60 7.5 102 60 low risk
## 211 13 90 65 7.5 101 80 low risk
## 212 17 90 65 7.5 103 67 low risk
## 213 17 85 60 7.5 102 86 low risk
## 214 40 140 100 18.0 98 90 high risk
## 215 14 90 65 7.0 101 70 high risk
## 216 17 110 75 12.0 101 76 high risk
## 217 40 160 100 19.0 98 77 high risk
## 218 32 140 90 18.0 98 88 high risk
## 219 12 90 60 7.9 102 66 high risk
## 220 55 140 95 19.0 98 77 high risk
## 221 50 130 100 16.0 98 75 high risk
## 222 17 90 65 6.1 103 67 high risk
## 223 28 83 60 8.0 101 86 high risk
## 224 17 85 60 9.0 102 86 high risk
## 225 50 140 95 17.0 98 60 high risk
## 226 55 140 80 7.2 101 76 high risk
## 227 40 140 100 18.0 98 77 high risk
## 228 28 120 80 9.0 102 76 high risk
## 229 17 90 60 11.0 101 78 high risk
## 230 17 90 63 8.0 101 70 high risk
## 231 25 120 90 12.0 101 80 high risk
## 232 17 120 80 7.0 102 76 high risk
## 233 19 90 65 11.0 101 70 high risk
## 234 17 110 75 13.0 101 76 high risk
## 235 40 160 100 19.0 98 77 high risk
## 236 32 140 90 18.0 98 88 high risk
## 237 12 90 60 8.0 102 66 high risk
## 238 12 90 60 11.0 102 60 high risk
## 239 55 140 95 19.0 98 77 high risk
## 240 50 130 100 16.0 98 76 high risk
## 241 13 90 65 9.0 101 80 high risk
## 242 17 90 65 7.7 103 67 high risk
## 243 17 85 60 6.3 102 86 high risk
## 244 55 120 90 18.0 98 60 high risk
## 245 35 85 60 19.0 98 86 high risk
## 246 43 120 90 18.0 98 70 high risk
## 247 32 120 65 6.0 101 76 mid risk
For this code sample, I removed observations with highly unusual values to keep the modeling dataset more consistent.
df_clean <- df %>%
filter(HeartRate >= 40 & HeartRate <= 200) %>%
filter(BS <= 15 & BodyTemp <= 100 & BodyTemp >= 97)
df_clean$RiskLevel <- as.factor(df_clean$RiskLevel)
summary(df_clean)
## Age SystolicBP DiastolicBP BS
## Min. :10.00 Min. : 70.0 Min. : 49.00 Min. : 6.000
## 1st Qu.:20.00 1st Qu.:100.0 1st Qu.: 65.00 1st Qu.: 6.900
## Median :25.00 Median :120.0 Median : 80.00 Median : 7.200
## Mean :30.13 Mean :113.6 Mean : 76.65 Mean : 8.041
## 3rd Qu.:36.00 3rd Qu.:120.0 3rd Qu.: 90.00 3rd Qu.: 7.800
## Max. :66.00 Max. :140.0 Max. :100.00 Max. :15.000
## BodyTemp HeartRate RiskLevel
## Min. : 98.00 Min. :60.00 high risk:148
## 1st Qu.: 98.00 1st Qu.:70.00 low risk :367
## Median : 98.00 Median :70.00 mid risk :250
## Mean : 98.07 Mean :73.68
## 3rd Qu.: 98.00 3rd Qu.:78.00
## Max. :100.00 Max. :90.00
table(df_clean$RiskLevel)
##
## high risk low risk mid risk
## 148 367 250
ggplot(df_clean, aes(x = RiskLevel, fill = RiskLevel)) +
geom_bar() +
labs(
title = "Risk Level Distribution",
x = "Risk Level",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")
The original dataset is imbalanced, with more low-risk observations than high-risk observations. This imbalance may affect classification models because the model may learn to favor the majority class.
Because the dataset is relatively small, I used oversampling to balance the risk categories while keeping as much data as possible.
set.seed(42)
high_risk <- df_clean[df_clean$RiskLevel == "high risk", ]
low_risk <- df_clean[df_clean$RiskLevel == "low risk", ]
mid_risk <- df_clean[df_clean$RiskLevel == "mid risk", ]
target_size <- max(nrow(high_risk), nrow(low_risk), nrow(mid_risk))
generate_synthetic <- function(minority_class, target_size) {
n_samples <- target_size - nrow(minority_class)
if (n_samples <= 0) {
return(minority_class)
}
synthetic_samples <- minority_class[
sample(1:nrow(minority_class), n_samples, replace = TRUE),
]
numeric_cols <- names(synthetic_samples)[sapply(synthetic_samples, is.numeric)]
for (col in numeric_cols) {
synthetic_samples[[col]] <- synthetic_samples[[col]] +
runif(n_samples, -0.1, 0.1)
}
rbind(minority_class, synthetic_samples)
}
high_risk_balanced <- generate_synthetic(high_risk, target_size)
mid_risk_balanced <- generate_synthetic(mid_risk, target_size)
low_risk_balanced <- low_risk
df_balanced <- rbind(high_risk_balanced, mid_risk_balanced, low_risk_balanced)
df_balanced$RiskLevel <- as.factor(df_balanced$RiskLevel)
table(df_balanced$RiskLevel)
##
## high risk low risk mid risk
## 367 367 367
ggplot(df_balanced, aes(x = RiskLevel, fill = RiskLevel)) +
geom_bar() +
labs(
title = "Balanced Risk Level Distribution",
x = "Risk Level",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")
df_clean %>%
select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~ Feature, scales = "free") +
labs(title = "Distribution of Numeric Features") +
theme_minimal()
cor_matrix <- cor(
df_clean %>% select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate)
)
corrplot(
cor_matrix,
method = "circle",
type = "lower",
tl.cex = 0.8,
addCoef.col = "black"
)
The strongest relationship among numeric predictors is between systolic and diastolic blood pressure, which is expected because both measure blood pressure. Blood sugar also shows moderate relationships with age and blood pressure.
features <- c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")
for (feature in features) {
p <- ggplot(df_balanced, aes(x = RiskLevel, y = .data[[feature]], fill = RiskLevel)) +
geom_boxplot() +
labs(
title = paste(feature, "by Risk Level"),
x = "Risk Level",
y = feature
) +
theme_minimal() +
theme(legend.position = "none")
print(p)
}
ggplot(df_balanced, aes(x = SystolicBP, y = DiastolicBP, color = RiskLevel)) +
geom_point(alpha = 0.7) +
labs(
title = "Systolic vs. Diastolic Blood Pressure by Risk Level",
x = "Systolic Blood Pressure",
y = "Diastolic Blood Pressure"
) +
theme_minimal()
I created a binary variable to identify whether each observation was classified as high risk.
df_balanced$HighRisk <- ifelse(df_balanced$RiskLevel == "high risk", 1, 0)
numeric_features <- c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")
pearson_corr <- sapply(numeric_features, function(feature) {
cor(df_balanced[[feature]], df_balanced$HighRisk, method = "pearson")
})
correlation_df <- data.frame(
Feature = names(pearson_corr),
Correlation = as.numeric(pearson_corr)
)
correlation_df %>%
arrange(desc(Correlation)) %>%
kable(caption = "Feature Correlation with High-Risk Outcome")
| Feature | Correlation |
|---|---|
| BS | 0.6465102 |
| DiastolicBP | 0.4859777 |
| SystolicBP | 0.4505163 |
| Age | 0.3831357 |
| HeartRate | 0.2460041 |
| BodyTemp | 0.0474882 |
ggplot(correlation_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Correlation)) +
geom_col() +
coord_flip() +
labs(
title = "Correlation of Features With High-Risk Outcome",
x = "Feature",
y = "Pearson Correlation"
) +
theme_minimal() +
theme(legend.position = "none")
Blood sugar, systolic blood pressure, and diastolic blood pressure show the strongest relationships with high-risk classification.
Age is often assumed to be a major factor in health risk. However, younger individuals also appear in the high-risk group, which suggests that direct health indicators may be more informative than age alone.
young_high_risk <- df_balanced %>%
filter(Age < 40 & HighRisk == 1)
young_non_high_risk <- df_balanced %>%
filter(Age < 40 & HighRisk == 0)
young_summary <- bind_rows(
young_high_risk %>%
summarise(across(c(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate), mean, na.rm = TRUE)) %>%
mutate(Group = "Young High Risk"),
young_non_high_risk %>%
summarise(across(c(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate), mean, na.rm = TRUE)) %>%
mutate(Group = "Young Non-High Risk")
) %>%
relocate(Group)
kable(young_summary, caption = "Average Indicators for Young High-Risk vs. Non-High-Risk Groups")
| Group | Age | SystolicBP | DiastolicBP | BS | BodyTemp | HeartRate |
|---|---|---|---|---|---|---|
| Young High Risk | 31.16188 | 127.2987 | 88.2902 | 9.726348 | 98.16856 | 75.49617 |
| Young Non-High Risk | 23.04462 | 110.3446 | 72.8990 | 7.079843 | 98.07958 | 72.87373 |
A multinomial logistic regression model was used as an interpretable baseline model.
multinom_model <- multinom(
RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
data = df_balanced,
trace = FALSE
)
predicted_multinom <- predict(multinom_model, df_balanced)
table(Predicted = predicted_multinom, Actual = df_balanced$RiskLevel)
## Actual
## Predicted high risk low risk mid risk
## high risk 295 16 43
## low risk 25 218 104
## mid risk 47 133 220
This model is useful for interpretation, but it may struggle when relationships between predictors and risk levels are non-linear or when risk categories overlap.
Random Forest was selected because it can capture non-linear relationships and interactions between variables.
set.seed(42)
train_index <- createDataPartition(df_balanced$RiskLevel, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]
set.seed(42)
rf_model <- randomForest(
RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
data = train_data,
ntree = 100,
importance = TRUE
)
rf_model
##
## Call:
## randomForest(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = train_data, ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.66%
## Confusion matrix:
## high risk low risk mid risk class.error
## high risk 243 5 9 0.05447471
## low risk 7 212 38 0.17509728
## mid risk 15 39 203 0.21011673
rf_predictions <- predict(rf_model, test_data)
confusion_matrix <- confusionMatrix(rf_predictions, test_data$RiskLevel)
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction high risk low risk mid risk
## high risk 108 2 11
## low risk 1 88 14
## mid risk 1 20 85
##
## Overall Statistics
##
## Accuracy : 0.8515
## 95% CI : (0.8085, 0.8881)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7773
##
## Mcnemar's Test P-Value : 0.02105
##
## Statistics by Class:
##
## Class: high risk Class: low risk Class: mid risk
## Sensitivity 0.9818 0.8000 0.7727
## Specificity 0.9409 0.9318 0.9045
## Pos Pred Value 0.8926 0.8544 0.8019
## Neg Pred Value 0.9904 0.9031 0.8884
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3273 0.2667 0.2576
## Detection Prevalence 0.3667 0.3121 0.3212
## Balanced Accuracy 0.9614 0.8659 0.8386
varImpPlot(rf_model, main = "Feature Importance")
This analysis shows that maternal health risk in this dataset is strongly associated with blood sugar, systolic blood pressure, diastolic blood pressure, and age. While age is relevant, the strongest indicators are direct clinical measures. The Random Forest model provided strong predictive performance and helped identify the most important variables for maternal health risk classification.
The project demonstrates an end-to-end data analysis workflow in R, including data loading, validation, cleaning, exploratory analysis, modeling, evaluation, and communication of findings.