Maternal health complications remain a critical challenge in rural, resource-limited settings like Bangladesh. This study investigates primary risk factors by analyzing 1,014 health records collected through an IoT-based monitoring system across rural healthcare facilities. Elevated systolic and diastolic blood pressures in the high-risk group, even among younger individuals, confirm blood pressure control as a fundamental risk indicator. Markedly elevated blood sugar levels in the high-risk group underscore the importance of comprehensive metabolic monitoring. Age revealed a complex risk profile, with high-risk individuals generally older, yet young women under 30 also significantly represented. This finding challenges simplistic age-based risk assumptions. Using advanced machine learning techniques, specifically random forest classification, researchers developed predictive models stratifying maternal health risks into low, medium, and high-risk categories. These models provide nuanced insights into the significance of various health indicators in predicting adverse outcomes. The research offers a scalable methodology for early risk identification and targeted interventions, illuminating potential strategies for addressing maternal health challenges in low-resource environments. The findings challenge traditional approaches, emphasizing the need for individualized, comprehensive risk assessment and proactive intervention strategies. Ultimately, this research contributes crucial knowledge to improve maternal health outcomes, offering a data-driven approach to understanding and mitigating risks in vulnerable populations.
In rural Bangladesh, maternal health complications pose significant challenges, especially in resource-limited settings. This study examines key risk factors by analyzing 1,014 health records from Internet of Things (IoT)-based monitoring system across rural healthcare facilities.
What are the primary risk factors for maternal health complications in rural Bangladesh?
By analyzing the features in the dataset, researchers could identify the factors most strongly associated with adverse maternal outcomes.
Loading Data and packages
library(tidyverse)
library(openintro)
#URL of the dataset
url <- "https://raw.githubusercontent.com/lburenkov/maternalrisk/refs/heads/main/Maternal%20Health%20Risk%20Data%20Set.csv"
#Loading the dataset into a data frame
df <- read.csv(url)
#Displaying the first few rows of the dataset
head(df)## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 25 130 80 15.00 98 86 high risk
## 2 35 140 90 13.00 98 70 high risk
## 3 29 90 70 8.00 100 80 high risk
## 4 30 140 85 7.00 98 70 high risk
## 5 35 120 60 6.10 98 76 low risk
## 6 23 140 80 7.01 98 70 high risk
Checking for missing values
#Just checking for missing values
colSums(is.na(df)) #Should all be 0 since no missing values are mentioned## Age SystolicBP DiastolicBP BS BodyTemp HeartRate
## 0 0 0 0 0 0
## RiskLevel
## 0
## 'data.frame': 1014 obs. of 7 variables:
## $ Age : int 25 35 29 30 35 23 23 35 32 42 ...
## $ SystolicBP : int 130 140 90 140 120 140 130 85 120 130 ...
## $ DiastolicBP: int 80 90 70 85 60 80 70 60 90 80 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 102 98 98 ...
## $ HeartRate : int 86 70 80 70 76 70 78 86 70 70 ...
## $ RiskLevel : chr "high risk" "high risk" "high risk" "high risk" ...
## Age SystolicBP DiastolicBP BS
## Min. :10.00 Min. : 70.0 Min. : 49.00 Min. : 6.000
## 1st Qu.:19.00 1st Qu.:100.0 1st Qu.: 65.00 1st Qu.: 6.900
## Median :26.00 Median :120.0 Median : 80.00 Median : 7.500
## Mean :29.87 Mean :113.2 Mean : 76.46 Mean : 8.726
## 3rd Qu.:39.00 3rd Qu.:120.0 3rd Qu.: 90.00 3rd Qu.: 8.000
## Max. :70.00 Max. :160.0 Max. :100.00 Max. :19.000
## BodyTemp HeartRate RiskLevel
## Min. : 98.00 Min. : 7.0 Length:1014
## 1st Qu.: 98.00 1st Qu.:70.0 Class :character
## Median : 98.00 Median :76.0 Mode :character
## Mean : 98.67 Mean :74.3
## 3rd Qu.: 98.00 3rd Qu.:80.0
## Max. :103.00 Max. :90.0
Some relevant observations from this preview of this dataset:
Age:
The ages range from 10 to 70 years. The distribution seems right-skewed with a mean higher than the median. A closer look at age groups could provide insights (e.g., teenagers, adults, older individuals). SystolicBP and DiastolicBP:
The ranges appear plausible, but values at the extremes (e.g., 70 for SystolicBP and 49 for DiastolicBP) might need further validation. Median SystolicBP (120) and DiastolicBP (80) align with common healthy ranges. BS (Blood Sugar):
A wide range from 6 to 19 mmol/L. Mean (8.726) and median (7.5) indicate possible high blood sugar cases skewing the data. BodyTemp:
Most values are clustered around 98°F, but there are outliers like 103°F. HeartRate:
Resting heart rate ranges from 7 (anomalous or error?) to 90 bpm. Median (76 bpm) is within a normal range for healthy adults. RiskLevel:
This categorical variable is the target. Analyzing its distribution is crucial to understand class balance.
Validate and clean data
#Checking for unexpected values (e.g., HeartRate < 40 or > 200)
filter(df, HeartRate < 40 | HeartRate > 200)## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 16 120 75 7.9 98 7 low risk
## 2 16 120 75 7.9 98 7 low risk
#Investigating unusually low or high Blood Sugar and BodyTemp
filter(df, BS > 15 | BodyTemp > 100 | BodyTemp < 97)## Age SystolicBP DiastolicBP BS BodyTemp HeartRate RiskLevel
## 1 35 85 60 11.0 102 86 high risk
## 2 42 130 80 18.0 98 70 high risk
## 3 30 120 80 6.9 101 76 mid risk
## 4 40 140 100 18.0 98 90 high risk
## 5 12 95 60 6.1 102 60 low risk
## 6 17 85 60 9.0 102 86 mid risk
## 7 32 120 65 6.0 101 76 mid risk
## 8 26 85 60 6.0 101 86 mid risk
## 9 44 120 90 16.0 98 80 mid risk
## 10 13 90 65 7.8 101 80 mid risk
## 11 28 115 60 7.8 101 86 mid risk
## 12 34 85 60 11.0 102 86 high risk
## 13 42 140 100 18.0 98 90 high risk
## 14 50 140 95 17.0 98 60 high risk
## 15 38 135 60 7.9 101 86 high risk
## 16 30 120 80 7.9 101 76 high risk
## 17 55 140 100 18.0 98 90 high risk
## 18 40 160 100 19.0 98 77 high risk
## 19 32 140 90 18.0 98 88 high risk
## 20 55 140 95 19.0 98 77 high risk
## 21 40 160 100 19.0 98 77 high risk
## 22 32 140 90 18.0 98 88 high risk
## 23 22 90 60 7.5 102 60 high risk
## 24 55 140 95 19.0 98 77 high risk
## 25 50 130 100 16.0 98 75 high risk
## 26 18 120 80 6.9 102 76 mid risk
## 27 17 90 60 6.9 101 76 mid risk
## 28 17 90 63 6.9 101 70 mid risk
## 29 25 120 90 6.7 101 80 mid risk
## 30 17 120 80 6.7 102 76 mid risk
## 31 14 90 65 7.0 101 70 high risk
## 32 17 110 75 12.0 101 76 high risk
## 33 40 160 100 19.0 98 77 high risk
## 34 32 140 90 18.0 98 88 high risk
## 35 12 90 60 7.9 102 66 high risk
## 36 12 95 60 6.1 102 60 low risk
## 37 55 140 95 19.0 98 77 high risk
## 38 50 130 100 16.0 98 75 high risk
## 39 13 90 65 7.9 101 80 mid risk
## 40 17 90 65 6.1 103 67 high risk
## 41 28 83 60 8.0 101 86 high risk
## 42 17 85 60 9.0 102 86 high risk
## 43 50 140 95 17.0 98 60 high risk
## 44 28 85 60 9.0 101 86 mid risk
## 45 17 85 60 9.0 102 86 mid risk
## 46 55 140 80 7.2 101 76 high risk
## 47 40 140 100 18.0 98 77 high risk
## 48 28 120 80 9.0 102 76 high risk
## 49 17 90 60 11.0 101 78 high risk
## 50 17 90 63 8.0 101 70 high risk
## 51 25 120 90 12.0 101 80 high risk
## 52 17 120 80 7.0 102 76 high risk
## 53 19 90 65 11.0 101 70 high risk
## 54 32 120 65 6.0 101 76 mid risk
## 55 17 110 75 13.0 101 76 high risk
## 56 40 160 100 19.0 98 77 high risk
## 57 32 140 90 18.0 98 88 high risk
## 58 12 90 60 8.0 102 66 high risk
## 59 12 90 60 11.0 102 60 high risk
## 60 55 140 95 19.0 98 77 high risk
## 61 50 130 100 16.0 98 76 high risk
## 62 13 90 65 9.0 101 80 high risk
## 63 17 90 65 7.7 103 67 high risk
## 64 26 85 60 6.0 101 86 mid risk
## 65 17 85 60 6.3 102 86 high risk
## 66 55 120 90 18.0 98 60 high risk
## 67 35 85 60 19.0 98 86 high risk
## 68 43 120 90 18.0 98 70 high risk
## 69 44 120 90 16.0 98 80 mid risk
## 70 45 120 80 6.9 103 70 low risk
## 71 70 85 60 6.9 102 70 low risk
## 72 65 120 90 6.9 103 76 low risk
## 73 55 120 80 6.9 102 80 low risk
## 74 45 90 60 18.0 101 70 high risk
## 75 22 120 80 6.9 103 76 low risk
## 76 17 110 75 6.9 101 76 high risk
## 77 40 160 100 19.0 98 77 high risk
## 78 32 140 90 18.0 98 88 high risk
## 79 12 90 60 7.8 102 60 high risk
## 80 55 140 95 19.0 98 77 high risk
## 81 50 130 100 16.0 98 75 high risk
## 82 13 90 65 7.8 101 80 mid risk
## 83 17 90 65 7.8 103 67 high risk
## 84 28 115 60 7.8 101 86 mid risk
## 85 17 85 69 7.8 102 86 high risk
## 86 50 130 80 16.0 102 76 mid risk
## 87 27 120 90 6.8 102 68 mid risk
## 88 55 100 70 6.8 101 80 mid risk
## 89 60 140 80 16.0 98 66 high risk
## 90 17 140 100 6.8 103 80 high risk
## 91 36 140 100 6.8 102 76 high risk
## 92 40 140 100 13.0 101 66 high risk
## 93 36 140 100 6.8 102 76 high risk
## 94 40 140 100 13.0 101 66 high risk
## 95 35 85 60 11.0 102 86 high risk
## 96 43 130 80 18.0 98 70 mid risk
## 97 34 85 60 11.0 102 86 high risk
## 98 42 130 80 18.0 98 70 mid risk
## 99 30 120 80 6.8 101 76 low risk
## 100 42 140 100 18.0 98 90 high risk
## 101 18 120 80 6.8 102 76 low risk
## 102 17 90 60 7.9 101 76 low risk
## 103 50 140 95 17.0 98 60 high risk
## 104 38 135 60 7.9 101 86 high risk
## 105 17 85 60 7.9 102 86 low risk
## 106 30 120 80 7.9 101 76 high risk
## 107 55 140 100 18.0 98 90 high risk
## 108 18 120 80 7.9 102 76 mid risk
## 109 17 90 60 7.5 101 76 low risk
## 110 17 90 63 7.5 101 70 low risk
## 111 25 120 90 7.5 101 80 low risk
## 112 17 120 80 7.5 102 76 low risk
## 113 19 90 65 7.5 101 70 low risk
## 114 18 85 60 7.5 101 86 mid risk
## 115 17 85 60 7.5 102 86 low risk
## 116 30 120 80 7.5 101 76 mid risk
## 117 40 160 100 19.0 98 77 high risk
## 118 32 140 90 18.0 98 88 high risk
## 119 12 90 60 7.5 102 66 low risk
## 120 12 90 60 7.5 102 60 low risk
## 121 55 140 95 19.0 98 77 high risk
## 122 50 130 100 16.0 98 75 mid risk
## 123 13 90 65 7.5 101 80 low risk
## 124 17 90 65 7.5 103 67 low risk
## 125 28 115 60 7.5 101 86 mid risk
## 126 17 85 60 7.5 102 86 low risk
## 127 40 160 100 19.0 98 77 high risk
## 128 32 140 90 18.0 98 88 high risk
## 129 12 90 60 7.5 102 66 mid risk
## 130 22 90 60 7.5 102 60 high risk
## 131 55 140 95 19.0 98 77 high risk
## 132 50 130 100 16.0 98 75 high risk
## 133 55 140 95 19.0 98 77 high risk
## 134 50 130 100 16.0 98 75 mid risk
## 135 13 90 65 7.5 101 80 high risk
## 136 17 90 65 7.5 103 67 mid risk
## 137 27 135 60 7.5 101 86 high risk
## 138 17 85 60 7.5 101 86 high risk
## 139 50 140 95 17.0 98 60 high risk
## 140 28 85 60 9.0 101 86 mid risk
## 141 28 95 60 10.0 101 86 high risk
## 142 17 90 60 9.0 102 86 mid risk
## 143 30 120 80 9.0 101 76 mid risk
## 144 35 85 60 11.0 102 86 high risk
## 145 42 130 80 18.0 98 70 high risk
## 146 40 140 100 18.0 98 90 high risk
## 147 14 90 65 7.0 101 70 high risk
## 148 17 110 75 12.0 101 76 high risk
## 149 40 160 100 19.0 98 77 high risk
## 150 30 120 80 6.9 101 76 mid risk
## 151 18 120 80 6.9 102 76 mid risk
## 152 17 90 60 6.9 101 76 mid risk
## 153 17 90 63 6.9 101 70 mid risk
## 154 25 120 90 6.7 101 80 mid risk
## 155 17 120 80 6.7 102 76 mid risk
## 156 13 90 65 7.9 101 80 mid risk
## 157 28 85 60 9.0 101 86 mid risk
## 158 17 85 60 9.0 102 86 mid risk
## 159 30 120 80 6.9 101 76 mid risk
## 160 18 120 80 6.9 102 76 mid risk
## 161 17 90 60 6.9 101 76 mid risk
## 162 17 90 63 6.9 101 70 mid risk
## 163 25 120 90 6.7 101 80 mid risk
## 164 17 120 80 6.7 102 76 mid risk
## 165 13 90 65 7.9 101 80 mid risk
## 166 28 85 60 9.0 101 86 mid risk
## 167 17 85 60 9.0 102 86 mid risk
## 168 32 120 65 6.0 101 76 mid risk
## 169 26 85 60 6.0 101 86 mid risk
## 170 44 120 90 16.0 98 80 mid risk
## 171 13 90 65 7.8 101 80 mid risk
## 172 28 115 60 7.8 101 86 mid risk
## 173 50 130 80 16.0 102 76 mid risk
## 174 27 120 90 6.8 102 68 mid risk
## 175 55 100 70 6.8 101 80 mid risk
## 176 43 130 80 18.0 98 70 mid risk
## 177 42 130 80 18.0 98 70 mid risk
## 178 18 120 80 7.9 102 76 mid risk
## 179 18 85 60 7.5 101 86 mid risk
## 180 30 120 80 7.5 101 76 mid risk
## 181 50 130 100 16.0 98 75 mid risk
## 182 28 115 60 7.5 101 86 mid risk
## 183 12 90 60 7.5 102 66 mid risk
## 184 50 130 100 16.0 98 75 mid risk
## 185 17 90 65 7.5 103 67 mid risk
## 186 28 85 60 9.0 101 86 mid risk
## 187 17 90 60 9.0 102 86 mid risk
## 188 30 120 80 9.0 101 76 mid risk
## 189 30 120 80 6.9 101 76 mid risk
## 190 18 120 80 6.9 102 76 mid risk
## 191 17 90 60 6.9 101 76 mid risk
## 192 17 90 63 6.9 101 70 mid risk
## 193 25 120 90 6.7 101 80 mid risk
## 194 17 120 80 6.7 102 76 mid risk
## 195 13 90 65 7.9 101 80 mid risk
## 196 28 85 60 9.0 101 86 mid risk
## 197 17 85 60 9.0 102 86 mid risk
## 198 32 120 65 6.0 101 76 mid risk
## 199 30 120 80 6.8 101 76 low risk
## 200 18 120 80 6.8 102 76 low risk
## 201 17 90 60 7.9 101 76 low risk
## 202 17 85 60 7.9 102 86 low risk
## 203 17 90 60 7.5 101 76 low risk
## 204 17 90 63 7.5 101 70 low risk
## 205 25 120 90 7.5 101 80 low risk
## 206 17 120 80 7.5 102 76 low risk
## 207 19 90 65 7.5 101 70 low risk
## 208 17 85 60 7.5 102 86 low risk
## 209 12 90 60 7.5 102 66 low risk
## 210 12 90 60 7.5 102 60 low risk
## 211 13 90 65 7.5 101 80 low risk
## 212 17 90 65 7.5 103 67 low risk
## 213 17 85 60 7.5 102 86 low risk
## 214 40 140 100 18.0 98 90 high risk
## 215 14 90 65 7.0 101 70 high risk
## 216 17 110 75 12.0 101 76 high risk
## 217 40 160 100 19.0 98 77 high risk
## 218 32 140 90 18.0 98 88 high risk
## 219 12 90 60 7.9 102 66 high risk
## 220 55 140 95 19.0 98 77 high risk
## 221 50 130 100 16.0 98 75 high risk
## 222 17 90 65 6.1 103 67 high risk
## 223 28 83 60 8.0 101 86 high risk
## 224 17 85 60 9.0 102 86 high risk
## 225 50 140 95 17.0 98 60 high risk
## 226 55 140 80 7.2 101 76 high risk
## 227 40 140 100 18.0 98 77 high risk
## 228 28 120 80 9.0 102 76 high risk
## 229 17 90 60 11.0 101 78 high risk
## 230 17 90 63 8.0 101 70 high risk
## 231 25 120 90 12.0 101 80 high risk
## 232 17 120 80 7.0 102 76 high risk
## 233 19 90 65 11.0 101 70 high risk
## 234 17 110 75 13.0 101 76 high risk
## 235 40 160 100 19.0 98 77 high risk
## 236 32 140 90 18.0 98 88 high risk
## 237 12 90 60 8.0 102 66 high risk
## 238 12 90 60 11.0 102 60 high risk
## 239 55 140 95 19.0 98 77 high risk
## 240 50 130 100 16.0 98 76 high risk
## 241 13 90 65 9.0 101 80 high risk
## 242 17 90 65 7.7 103 67 high risk
## 243 17 85 60 6.3 102 86 high risk
## 244 55 120 90 18.0 98 60 high risk
## 245 35 85 60 19.0 98 86 high risk
## 246 43 120 90 18.0 98 70 high risk
## 247 32 120 65 6.0 101 76 mid risk
#Handling outliers by capping values or removing them
df <- df %>%
filter(HeartRate >= 40 & HeartRate <= 200) %>%
filter(BS <= 15 & BodyTemp <= 100 & BodyTemp >= 97)This imbalance could pose a challenge for classification tasks as models tend to favor the majority class (Low Risk) over the minority classes (High Risk and Mid Risk). To address this, several techniques can be employed, such as resampling, class weighting, or using specialized algorithms.
##
## high risk low risk mid risk
## 148 367 250
#Bar plot to visualize
ggplot(df, aes(x = RiskLevel, fill = RiskLevel)) +
geom_bar() +
labs(title = "Risk Level Distribution", x = "Risk Level", y = "Count") +
theme_minimal()Decision for this Dataset The dataset has 1014 instances, which is relatively small. Removing too many samples (via undersampling) could harm the model’s generalizability. Therefore: I am using oversamppling to retain as much data as possible.
Converting to a factor
## 'data.frame': 765 obs. of 7 variables:
## $ Age : int 25 35 29 30 35 23 23 32 23 19 ...
## $ SystolicBP : int 130 140 90 140 120 140 130 120 90 120 ...
## $ DiastolicBP: int 80 90 70 85 60 80 70 90 60 80 ...
## $ BS : num 15 13 8 7 6.1 7.01 7.01 6.9 7.01 7 ...
## $ BodyTemp : num 98 98 100 98 98 98 98 98 98 98 ...
## $ HeartRate : int 86 70 80 70 76 70 78 70 76 70 ...
## $ RiskLevel : Factor w/ 3 levels "high risk","low risk",..: 1 1 1 1 2 1 3 3 2 3 ...
#Loading necessary library for visualization
library(ggplot2)
#Converting RiskLevel to a factor
df$RiskLevel <- as.factor(df$RiskLevel)
#Splitting the dataset into individual classes
high_risk <- df[df$RiskLevel == "high risk", ]
low_risk <- df[df$RiskLevel == "low risk", ]
mid_risk <- df[df$RiskLevel == "mid risk", ]
#Setting the target size for all classes (equal to the largest class size)
target_size <- max(nrow(low_risk), nrow(high_risk), nrow(mid_risk))
#Function to generate synthetic samples for balancing
generate_synthetic <- function(minority_class, target_size) {
n_samples <- target_size - nrow(minority_class)
if (n_samples <= 0) return(minority_class) #If already balanced, return original
#Randomly sample from the minority class with replacement
synthetic_samples <- minority_class[sample(1:nrow(minority_class), n_samples, replace = TRUE), ]
#Adding small random noise to numeric columns for synthetic samples
for (col in names(synthetic_samples)) {
if (is.numeric(synthetic_samples[[col]])) {
synthetic_samples[[col]] <- synthetic_samples[[col]] + runif(n_samples, -0.1, 0.1)
}
}
#Combining original and synthetic samples
return(rbind(minority_class, synthetic_samples))
}
#Generating synthetic samples for minority classes
high_risk_balanced <- generate_synthetic(high_risk, target_size)
mid_risk_balanced <- generate_synthetic(mid_risk, target_size)
low_risk_balanced <- low_risk # Low risk is already at the target size
#Combining all classes into a single balanced dataset
df_balanced <- rbind(high_risk_balanced, mid_risk_balanced, low_risk_balanced)
#Checking the new class distribution
print(table(df_balanced$RiskLevel))##
## high risk low risk mid risk
## 367 367 367
#Visualizing the balanced dataset
ggplot(df_balanced, aes(x = RiskLevel, fill = RiskLevel)) +
geom_bar() +
labs(title = "Balanced Risk Level Distribution", x = "Risk Level", y = "Count") +
theme_minimal()#Summary statistics for numeric variables
df %>%
select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate) %>%
summary()## Age SystolicBP DiastolicBP BS
## Min. :10.00 Min. : 70.0 Min. : 49.00 Min. : 6.000
## 1st Qu.:20.00 1st Qu.:100.0 1st Qu.: 65.00 1st Qu.: 6.900
## Median :25.00 Median :120.0 Median : 80.00 Median : 7.200
## Mean :30.13 Mean :113.6 Mean : 76.65 Mean : 8.041
## 3rd Qu.:36.00 3rd Qu.:120.0 3rd Qu.: 90.00 3rd Qu.: 7.800
## Max. :66.00 Max. :140.0 Max. :100.00 Max. :15.000
## BodyTemp HeartRate
## Min. : 98.00 Min. :60.00
## 1st Qu.: 98.00 1st Qu.:70.00
## Median : 98.00 Median :70.00
## Mean : 98.07 Mean :73.68
## 3rd Qu.: 98.00 3rd Qu.:78.00
## Max. :100.00 Max. :90.00
#Visualizing numeric feature distributions
df %>%
select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate) %>%
gather(key = "Feature", value = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~ Feature, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Features")#Creating a data frame summarizing variables and comments
summary_table <- data.frame(
Variable = c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate"),
Likely_Distribution = c(
"Slight positive skew",
"Slight negative skew",
"Slight negative skew",
"Positive skew",
"Symmetrical",
"Slight positive skew"
),
Comments = c(
"Younger ages dominate with a few older outliers.",
"Concentration near median with lower outliers.",
"Similar to SystolicBP with fewer low outliers.",
"Higher outliers increase the mean.",
"Likely normal distribution centered at 98.",
"Most values around 70–78, with some high outliers."
)
)
#Printting the table
print(summary_table)## Variable Likely_Distribution
## 1 Age Slight positive skew
## 2 SystolicBP Slight negative skew
## 3 DiastolicBP Slight negative skew
## 4 BS Positive skew
## 5 BodyTemp Symmetrical
## 6 HeartRate Slight positive skew
## Comments
## 1 Younger ages dominate with a few older outliers.
## 2 Concentration near median with lower outliers.
## 3 Similar to SystolicBP with fewer low outliers.
## 4 Higher outliers increase the mean.
## 5 Likely normal distribution centered at 98.
## 6 Most values around 70–78, with some high outliers.
#Optionally display it in a more readable table format if using R Markdown or Viewer
library(knitr)
kable(summary_table, caption = "Variable Distributions and Comments")| Variable | Likely_Distribution | Comments |
|---|---|---|
| Age | Slight positive skew | Younger ages dominate with a few older outliers. |
| SystolicBP | Slight negative skew | Concentration near median with lower outliers. |
| DiastolicBP | Slight negative skew | Similar to SystolicBP with fewer low outliers. |
| BS | Positive skew | Higher outliers increase the mean. |
| BodyTemp | Symmetrical | Likely normal distribution centered at 98. |
| HeartRate | Slight positive skew | Most values around 70–78, with some high outliers. |
Observations:
#Correlations
cor_matrix <- cor(df %>% select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate))
#Visualizing correlation matrix
library(corrplot)## corrplot 0.95 loaded
Strongest Correlation: SystolicBP and DiastolicBP (0.74), which is expected since they are both measures of blood pressure. Moderate Correlations: Age and BS (0.45), and BS with both blood pressure measures (~0.33–0.34). Weak or No Correlation: Body temperature and heart rate show very little to no correlation with other variables.
#Boxplot for Age and RiskLevel
ggplot(df_balanced, aes(x = RiskLevel, y = Age, fill = RiskLevel)) +
geom_boxplot() +
labs(title = "Age Distribution by Risk Level", x = "Risk Level", y = "Age") +
theme_minimal()#Repeating for other features
features <- c("SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")
for (feature in features) {
print(
ggplot(df_balanced, aes_string(x = "RiskLevel", y = feature, fill = "RiskLevel")) +
geom_boxplot() +
labs(title = paste(feature, "Distribution by Risk Level"),
x = "Risk Level", y = feature) +
theme_minimal()
)
}## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Scatter plot for SystolicBP vs. DiastolicBP by RiskLevel
ggplot(df_balanced, aes(x = SystolicBP, y = DiastolicBP, color = RiskLevel)) +
geom_point(alpha = 0.7) +
labs(title = "Systolic vs. Diastolic BP by Risk Level", x = "Systolic BP", y = "Diastolic BP") +
theme_minimal()Correlation between the continuous variables and the probability of being in the “high risk”
#Creating a binary variable for high risk
df_balanced$HighRisk <- ifelse(df_balanced$RiskLevel == "high risk", 1, 0)#Selecting numeric features
numeric_features <- c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")
#Computing Pearson correlations
pearson_corr <- sapply(numeric_features, function(feature) {
cor(df_balanced[[feature]], df_balanced$HighRisk, method = "pearson")
})
#Displaying Pearson correlations
pearson_corr## Age SystolicBP DiastolicBP BS BodyTemp HeartRate
## 0.380103117 0.484005994 0.482593784 0.662769454 -0.008962854 0.236358506
#Converting Pearson correlations to a data frame
correlation_df <- data.frame(
Feature = names(pearson_corr),
Correlation = pearson_corr
)
#Bar plot for correlations
library(ggplot2)
ggplot(correlation_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Correlation)) +
geom_bar(stat = "identity") +
labs(title = "Correlation of Features with High Risk",
x = "Feature",
y = "Pearson Correlation") +
theme_minimal() +
coord_flip() + #Flip coordinates for horizontal bars
scale_fill_gradient(low = "blue", high = "red")For me, it can be surprising when age, which is often thought to be a key factor in health risks, does not turn out to be the strongest predictor. Instead, direct health indicators like blood sugar (BS), systolic blood pressure, and diastolic blood pressure emerge as more critical factors in predicting “high risk” outcomes.
library(ggplot2)
#List of variables to plot
variables <- c("BS", "SystolicBP", "DiastolicBP")
#Loop through each variable and create a trend plot
for (var in variables) {
print(
ggplot(df_balanced, aes_string(x = "Age", y = var, color = "HighRisk")) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = paste(var, "Trend Over Age"),
x = "Age", y = var) +
theme_minimal()
)
}## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
#Filtering for younger individuals (e.g., Age < 40) who are High Risk
young_high_risk <- df_balanced %>%
filter(Age < 40 & HighRisk == 1)
#Comparing younger high-risk individuals to others
young_non_high_risk <- df_balanced %>%
filter(Age < 40 & HighRisk == 0)
#Summary statistics for young high-risk
summary(young_high_risk)## Age SystolicBP DiastolicBP BS
## Min. :21.93 Min. : 89.9 Min. : 59.98 Min. : 6.710
## 1st Qu.:25.00 1st Qu.:120.0 1st Qu.: 79.98 1st Qu.: 7.087
## Median :30.00 Median :140.0 Median : 90.00 Median : 8.000
## Mean :30.24 Mean :128.8 Mean : 88.03 Mean : 9.715
## 3rd Qu.:35.00 3rd Qu.:140.0 3rd Qu.:100.00 3rd Qu.:11.956
## Max. :39.99 Max. :140.1 Max. :100.09 Max. :15.081
## BodyTemp HeartRate RiskLevel HighRisk
## Min. : 97.90 Min. :60.00 high risk:203 Min. :1
## 1st Qu.: 97.98 1st Qu.:70.00 low risk : 0 1st Qu.:1
## Median : 98.00 Median :78.03 mid risk : 0 Median :1
## Mean : 98.09 Mean :76.03 Mean :1
## 3rd Qu.: 98.03 3rd Qu.:80.00 3rd Qu.:1
## Max. :100.00 Max. :90.01 Max. :1
## Age SystolicBP DiastolicBP BS
## Min. :10.00 Min. : 69.9 Min. : 49.00 Min. : 5.910
## 1st Qu.:19.00 1st Qu.:100.0 1st Qu.: 60.00 1st Qu.: 6.800
## Median :22.00 Median :120.0 Median : 70.04 Median : 7.000
## Mean :22.99 Mean :109.8 Mean : 72.73 Mean : 7.104
## 3rd Qu.:29.00 3rd Qu.:120.0 3rd Qu.: 80.00 3rd Qu.: 7.500
## Max. :39.95 Max. :130.1 Max. :100.00 Max. :10.994
## BodyTemp HeartRate RiskLevel HighRisk
## Min. : 97.90 Min. :59.90 high risk: 0 Min. :0
## 1st Qu.: 98.00 1st Qu.:70.00 low risk :299 1st Qu.:0
## Median : 98.00 Median :70.00 mid risk :306 Median :0
## Mean : 98.08 Mean :72.93 Mean :0
## 3rd Qu.: 98.00 3rd Qu.:77.96 3rd Qu.:0
## Max. :100.09 Max. :88.00 Max. :0
Insights: Age: High-risk individuals are generally older than non-high-risk individuals in this age range. However, young individuals (under 30) are still present in the high-risk group, emphasizing the need to explore other factors.
Systolic and Diastolic Blood presure: Blood pressure (both systolic and diastolic) is significantly higher in the high-risk group, even among younger individuals. This confirms that poor blood pressure control is a major risk factor.
Blood Sugar (BS) Blood sugar levels are markedly elevated in the high-risk group. The higher mean suggests some individuals in this group have extreme values.
library(nnet)
#Fitting multinomial logistic regression
multinom_model <- multinom(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = df_balanced)## # weights: 24 (14 variable)
## initial value 1209.572130
## iter 10 value 871.126916
## iter 20 value 778.844664
## iter 30 value 771.175254
## iter 40 value 771.167253
## iter 50 value 764.888262
## iter 60 value 757.170932
## iter 70 value 757.166158
## iter 70 value 757.166155
## final value 757.166055
## converged
## Call:
## multinom(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP +
## BS + BodyTemp + HeartRate, data = df_balanced)
##
## Coefficients:
## (Intercept) Age SystolicBP DiastolicBP BS BodyTemp
## low risk 226.0809 0.01944114 -0.12805405 -0.04257232 -0.9177363 -1.965944
## mid risk 127.2109 0.01825165 -0.07244766 -0.07573707 -0.6787368 -1.020532
## HeartRate
## low risk -0.09894234
## mid risk -0.08889116
##
## Std. Errors:
## (Intercept) Age SystolicBP DiastolicBP BS BodyTemp
## low risk 0.0002141865 0.01187866 0.01265363 0.01317042 0.09080277 0.02194777
## mid risk 0.0001805383 0.01143589 0.01191964 0.01217270 0.06406734 0.02020524
## HeartRate
## low risk 0.01679366
## mid risk 0.01516674
##
## Residual Deviance: 1514.332
## AIC: 1542.332
#Predicting on the training set
predicted <- predict(multinom_model, df_balanced)
#Confusion matrix
table(Predicted = predicted, Actual = df_balanced$RiskLevel)## Actual
## Predicted high risk low risk mid risk
## high risk 297 13 37
## low risk 15 229 114
## mid risk 55 125 216
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following objects are masked from 'package:openintro':
##
## ethanol, lsegments
##
## Attaching package: 'caret'
## The following object is masked from 'package:openintro':
##
## dotPlot
## The following object is masked from 'package:purrr':
##
## lift
#Converting RiskLevel to a factor if not already
df_balanced$RiskLevel <- as.factor(df_balanced$RiskLevel)
#Splitting the dataset into training and testing sets
set.seed(42)
train_index <- createDataPartition(df_balanced$RiskLevel, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
#Fitting the Random Forest model
rf_model <- randomForest(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
data = train_data, ntree = 100, importance = TRUE)
# View the model
print(rf_model)##
## Call:
## randomForest(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = train_data, ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.14%
## Confusion matrix:
## high risk low risk mid risk class.error
## high risk 244 6 7 0.05058366
## low risk 6 209 42 0.18677043
## mid risk 14 34 209 0.18677043
#Predicting on the test set
rf_predictions <- predict(rf_model, test_data)
#Confusion matrix
confusion_matrix <- confusionMatrix(rf_predictions, test_data$RiskLevel)
print(confusion_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction high risk low risk mid risk
## high risk 108 1 10
## low risk 1 88 17
## mid risk 1 21 83
##
## Overall Statistics
##
## Accuracy : 0.8455
## 95% CI : (0.8019, 0.8827)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7682
##
## Mcnemar's Test P-Value : 0.05068
##
## Statistics by Class:
##
## Class: high risk Class: low risk Class: mid risk
## Sensitivity 0.9818 0.8000 0.7545
## Specificity 0.9500 0.9182 0.9000
## Pos Pred Value 0.9076 0.8302 0.7905
## Neg Pred Value 0.9905 0.9018 0.8800
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3273 0.2667 0.2515
## Detection Prevalence 0.3606 0.3212 0.3182
## Balanced Accuracy 0.9659 0.8591 0.8273
In conclusion, this study underscores the persistent challenges of maternal health complications in rural, resource-limited settings like Bangladesh. By leveraging an IoT-based monitoring system, I identified critical risk factors such as elevated blood pressure and blood sugar levels, and complex age-related risk profiles. My findings reveal that high-risk individuals are generally older, but young individuals under 30 are also significantly represented in the high-risk group, highlighting the need to explore additional factors. Blood pressure, both systolic and diastolic, is significantly higher in the high-risk group, even among younger individuals, confirming poor blood pressure control as a major risk factor. Elevated blood sugar levels in the high-risk group further emphasize the importance of comprehensive metabolic monitoring.
Initially, I employed Multinomial Logistic Regression to identify key predictors such as blood sugar, blood pressure, and age. While this model provided interpretable coefficients, it struggled to separate low-risk and mid-risk categories due to overlapping feature distributions and its assumption of linearity between features and the log-odds of each class. These limitations necessitated the transition to a more robust model.
I chose Random Forest for its ability to handle non-linear relationships and capture feature interactions, such as the combined effect of high blood sugar and elevated blood pressure on maternal health risks. This model demonstrated superior performance, with an accuracy of 86.06% and high sensitivity across risk categories. Random Forest also identified blood sugar, systolic and diastolic blood pressure, and age as the most important predictors of maternal health risks.
The use of advanced machine learning techniques, particularly Random Forest classification, enabled the development of predictive models that offer nuanced insights into maternal health risks. These findings underscore the importance of comprehensive, individualized risk assessments and proactive intervention strategies. The scalable methodology presented in this research provides a valuable framework for early risk identification and targeted interventions, ultimately contributing to improved maternal health outcomes in vulnerable populations.
-Oversampling and undersampling in imbalanced data:
https://medium.com/metaor-artificial-intelligence/solving-the-class-imbalance-problem-58cb926b5a0f#:~:text=Resampling%20is%20a%20common%20technique,examples%20from%20the%20original%20dataset.