Abstract

Maternal health complications remain a critical challenge in rural, resource-limited settings like Bangladesh. This study investigates primary risk factors by analyzing 1,014 health records collected through an IoT-based monitoring system across rural healthcare facilities. Elevated systolic and diastolic blood pressures in the high-risk group, even among younger individuals, confirm blood pressure control as a fundamental risk indicator. Markedly elevated blood sugar levels in the high-risk group underscore the importance of comprehensive metabolic monitoring. Age revealed a complex risk profile, with high-risk individuals generally older, yet young women under 30 also significantly represented. This finding challenges simplistic age-based risk assumptions. Using advanced machine learning techniques, specifically random forest classification, researchers developed predictive models stratifying maternal health risks into low, medium, and high-risk categories. These models provide nuanced insights into the significance of various health indicators in predicting adverse outcomes. The research offers a scalable methodology for early risk identification and targeted interventions, illuminating potential strategies for addressing maternal health challenges in low-resource environments. The findings challenge traditional approaches, emphasizing the need for individualized, comprehensive risk assessment and proactive intervention strategies. Ultimately, this research contributes crucial knowledge to improve maternal health outcomes, offering a data-driven approach to understanding and mitigating risks in vulnerable populations.

Introduction

In rural Bangladesh, maternal health complications pose significant challenges, especially in resource-limited settings. This study examines key risk factors by analyzing 1,014 health records from Internet of Things (IoT)-based monitoring system across rural healthcare facilities.

Research question

What are the primary risk factors for maternal health complications in rural Bangladesh?

By analyzing the features in the dataset, researchers could identify the factors most strongly associated with adverse maternal outcomes.

Loading Data and packages

library(tidyverse)
library(openintro)
#URL of the dataset
url <- "https://raw.githubusercontent.com/lburenkov/maternalrisk/refs/heads/main/Maternal%20Health%20Risk%20Data%20Set.csv"

#Loading the dataset into a data frame
df <- read.csv(url)

#Displaying the first few rows of the dataset
head(df)
##   Age SystolicBP DiastolicBP    BS BodyTemp HeartRate RiskLevel
## 1  25        130          80 15.00       98        86 high risk
## 2  35        140          90 13.00       98        70 high risk
## 3  29         90          70  8.00      100        80 high risk
## 4  30        140          85  7.00       98        70 high risk
## 5  35        120          60  6.10       98        76  low risk
## 6  23        140          80  7.01       98        70 high risk

Data exploration and analysis

Checking for missing values

#Just checking for missing values
colSums(is.na(df)) #Should all be 0 since no missing values are mentioned
##         Age  SystolicBP DiastolicBP          BS    BodyTemp   HeartRate 
##           0           0           0           0           0           0 
##   RiskLevel 
##           0

Dataset Summary and Structure

#Checking structure and summary
str(df)        #Checking data types
## 'data.frame':    1014 obs. of  7 variables:
##  $ Age        : int  25 35 29 30 35 23 23 35 32 42 ...
##  $ SystolicBP : int  130 140 90 140 120 140 130 85 120 130 ...
##  $ DiastolicBP: int  80 90 70 85 60 80 70 60 90 80 ...
##  $ BS         : num  15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
##  $ BodyTemp   : num  98 98 100 98 98 98 98 102 98 98 ...
##  $ HeartRate  : int  86 70 80 70 76 70 78 86 70 70 ...
##  $ RiskLevel  : chr  "high risk" "high risk" "high risk" "high risk" ...
summary(df)    #Summary of features
##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :10.00   Min.   : 70.0   Min.   : 49.00   Min.   : 6.000  
##  1st Qu.:19.00   1st Qu.:100.0   1st Qu.: 65.00   1st Qu.: 6.900  
##  Median :26.00   Median :120.0   Median : 80.00   Median : 7.500  
##  Mean   :29.87   Mean   :113.2   Mean   : 76.46   Mean   : 8.726  
##  3rd Qu.:39.00   3rd Qu.:120.0   3rd Qu.: 90.00   3rd Qu.: 8.000  
##  Max.   :70.00   Max.   :160.0   Max.   :100.00   Max.   :19.000  
##     BodyTemp        HeartRate     RiskLevel        
##  Min.   : 98.00   Min.   : 7.0   Length:1014       
##  1st Qu.: 98.00   1st Qu.:70.0   Class :character  
##  Median : 98.00   Median :76.0   Mode  :character  
##  Mean   : 98.67   Mean   :74.3                     
##  3rd Qu.: 98.00   3rd Qu.:80.0                     
##  Max.   :103.00   Max.   :90.0

Some relevant observations from this preview of this dataset:

Age:

The ages range from 10 to 70 years. The distribution seems right-skewed with a mean higher than the median. A closer look at age groups could provide insights (e.g., teenagers, adults, older individuals). SystolicBP and DiastolicBP:

The ranges appear plausible, but values at the extremes (e.g., 70 for SystolicBP and 49 for DiastolicBP) might need further validation. Median SystolicBP (120) and DiastolicBP (80) align with common healthy ranges. BS (Blood Sugar):

A wide range from 6 to 19 mmol/L. Mean (8.726) and median (7.5) indicate possible high blood sugar cases skewing the data. BodyTemp:

Most values are clustered around 98°F, but there are outliers like 103°F. HeartRate:

Resting heart rate ranges from 7 (anomalous or error?) to 90 bpm. Median (76 bpm) is within a normal range for healthy adults. RiskLevel:

This categorical variable is the target. Analyzing its distribution is crucial to understand class balance.

Validate and clean data

#Checking for unexpected values (e.g., HeartRate < 40 or > 200)
filter(df, HeartRate < 40 | HeartRate > 200)
##   Age SystolicBP DiastolicBP  BS BodyTemp HeartRate RiskLevel
## 1  16        120          75 7.9       98         7  low risk
## 2  16        120          75 7.9       98         7  low risk
#Investigating unusually low or high Blood Sugar and BodyTemp
filter(df, BS > 15 | BodyTemp > 100 | BodyTemp < 97)
##     Age SystolicBP DiastolicBP   BS BodyTemp HeartRate RiskLevel
## 1    35         85          60 11.0      102        86 high risk
## 2    42        130          80 18.0       98        70 high risk
## 3    30        120          80  6.9      101        76  mid risk
## 4    40        140         100 18.0       98        90 high risk
## 5    12         95          60  6.1      102        60  low risk
## 6    17         85          60  9.0      102        86  mid risk
## 7    32        120          65  6.0      101        76  mid risk
## 8    26         85          60  6.0      101        86  mid risk
## 9    44        120          90 16.0       98        80  mid risk
## 10   13         90          65  7.8      101        80  mid risk
## 11   28        115          60  7.8      101        86  mid risk
## 12   34         85          60 11.0      102        86 high risk
## 13   42        140         100 18.0       98        90 high risk
## 14   50        140          95 17.0       98        60 high risk
## 15   38        135          60  7.9      101        86 high risk
## 16   30        120          80  7.9      101        76 high risk
## 17   55        140         100 18.0       98        90 high risk
## 18   40        160         100 19.0       98        77 high risk
## 19   32        140          90 18.0       98        88 high risk
## 20   55        140          95 19.0       98        77 high risk
## 21   40        160         100 19.0       98        77 high risk
## 22   32        140          90 18.0       98        88 high risk
## 23   22         90          60  7.5      102        60 high risk
## 24   55        140          95 19.0       98        77 high risk
## 25   50        130         100 16.0       98        75 high risk
## 26   18        120          80  6.9      102        76  mid risk
## 27   17         90          60  6.9      101        76  mid risk
## 28   17         90          63  6.9      101        70  mid risk
## 29   25        120          90  6.7      101        80  mid risk
## 30   17        120          80  6.7      102        76  mid risk
## 31   14         90          65  7.0      101        70 high risk
## 32   17        110          75 12.0      101        76 high risk
## 33   40        160         100 19.0       98        77 high risk
## 34   32        140          90 18.0       98        88 high risk
## 35   12         90          60  7.9      102        66 high risk
## 36   12         95          60  6.1      102        60  low risk
## 37   55        140          95 19.0       98        77 high risk
## 38   50        130         100 16.0       98        75 high risk
## 39   13         90          65  7.9      101        80  mid risk
## 40   17         90          65  6.1      103        67 high risk
## 41   28         83          60  8.0      101        86 high risk
## 42   17         85          60  9.0      102        86 high risk
## 43   50        140          95 17.0       98        60 high risk
## 44   28         85          60  9.0      101        86  mid risk
## 45   17         85          60  9.0      102        86  mid risk
## 46   55        140          80  7.2      101        76 high risk
## 47   40        140         100 18.0       98        77 high risk
## 48   28        120          80  9.0      102        76 high risk
## 49   17         90          60 11.0      101        78 high risk
## 50   17         90          63  8.0      101        70 high risk
## 51   25        120          90 12.0      101        80 high risk
## 52   17        120          80  7.0      102        76 high risk
## 53   19         90          65 11.0      101        70 high risk
## 54   32        120          65  6.0      101        76  mid risk
## 55   17        110          75 13.0      101        76 high risk
## 56   40        160         100 19.0       98        77 high risk
## 57   32        140          90 18.0       98        88 high risk
## 58   12         90          60  8.0      102        66 high risk
## 59   12         90          60 11.0      102        60 high risk
## 60   55        140          95 19.0       98        77 high risk
## 61   50        130         100 16.0       98        76 high risk
## 62   13         90          65  9.0      101        80 high risk
## 63   17         90          65  7.7      103        67 high risk
## 64   26         85          60  6.0      101        86  mid risk
## 65   17         85          60  6.3      102        86 high risk
## 66   55        120          90 18.0       98        60 high risk
## 67   35         85          60 19.0       98        86 high risk
## 68   43        120          90 18.0       98        70 high risk
## 69   44        120          90 16.0       98        80  mid risk
## 70   45        120          80  6.9      103        70  low risk
## 71   70         85          60  6.9      102        70  low risk
## 72   65        120          90  6.9      103        76  low risk
## 73   55        120          80  6.9      102        80  low risk
## 74   45         90          60 18.0      101        70 high risk
## 75   22        120          80  6.9      103        76  low risk
## 76   17        110          75  6.9      101        76 high risk
## 77   40        160         100 19.0       98        77 high risk
## 78   32        140          90 18.0       98        88 high risk
## 79   12         90          60  7.8      102        60 high risk
## 80   55        140          95 19.0       98        77 high risk
## 81   50        130         100 16.0       98        75 high risk
## 82   13         90          65  7.8      101        80  mid risk
## 83   17         90          65  7.8      103        67 high risk
## 84   28        115          60  7.8      101        86  mid risk
## 85   17         85          69  7.8      102        86 high risk
## 86   50        130          80 16.0      102        76  mid risk
## 87   27        120          90  6.8      102        68  mid risk
## 88   55        100          70  6.8      101        80  mid risk
## 89   60        140          80 16.0       98        66 high risk
## 90   17        140         100  6.8      103        80 high risk
## 91   36        140         100  6.8      102        76 high risk
## 92   40        140         100 13.0      101        66 high risk
## 93   36        140         100  6.8      102        76 high risk
## 94   40        140         100 13.0      101        66 high risk
## 95   35         85          60 11.0      102        86 high risk
## 96   43        130          80 18.0       98        70  mid risk
## 97   34         85          60 11.0      102        86 high risk
## 98   42        130          80 18.0       98        70  mid risk
## 99   30        120          80  6.8      101        76  low risk
## 100  42        140         100 18.0       98        90 high risk
## 101  18        120          80  6.8      102        76  low risk
## 102  17         90          60  7.9      101        76  low risk
## 103  50        140          95 17.0       98        60 high risk
## 104  38        135          60  7.9      101        86 high risk
## 105  17         85          60  7.9      102        86  low risk
## 106  30        120          80  7.9      101        76 high risk
## 107  55        140         100 18.0       98        90 high risk
## 108  18        120          80  7.9      102        76  mid risk
## 109  17         90          60  7.5      101        76  low risk
## 110  17         90          63  7.5      101        70  low risk
## 111  25        120          90  7.5      101        80  low risk
## 112  17        120          80  7.5      102        76  low risk
## 113  19         90          65  7.5      101        70  low risk
## 114  18         85          60  7.5      101        86  mid risk
## 115  17         85          60  7.5      102        86  low risk
## 116  30        120          80  7.5      101        76  mid risk
## 117  40        160         100 19.0       98        77 high risk
## 118  32        140          90 18.0       98        88 high risk
## 119  12         90          60  7.5      102        66  low risk
## 120  12         90          60  7.5      102        60  low risk
## 121  55        140          95 19.0       98        77 high risk
## 122  50        130         100 16.0       98        75  mid risk
## 123  13         90          65  7.5      101        80  low risk
## 124  17         90          65  7.5      103        67  low risk
## 125  28        115          60  7.5      101        86  mid risk
## 126  17         85          60  7.5      102        86  low risk
## 127  40        160         100 19.0       98        77 high risk
## 128  32        140          90 18.0       98        88 high risk
## 129  12         90          60  7.5      102        66  mid risk
## 130  22         90          60  7.5      102        60 high risk
## 131  55        140          95 19.0       98        77 high risk
## 132  50        130         100 16.0       98        75 high risk
## 133  55        140          95 19.0       98        77 high risk
## 134  50        130         100 16.0       98        75  mid risk
## 135  13         90          65  7.5      101        80 high risk
## 136  17         90          65  7.5      103        67  mid risk
## 137  27        135          60  7.5      101        86 high risk
## 138  17         85          60  7.5      101        86 high risk
## 139  50        140          95 17.0       98        60 high risk
## 140  28         85          60  9.0      101        86  mid risk
## 141  28         95          60 10.0      101        86 high risk
## 142  17         90          60  9.0      102        86  mid risk
## 143  30        120          80  9.0      101        76  mid risk
## 144  35         85          60 11.0      102        86 high risk
## 145  42        130          80 18.0       98        70 high risk
## 146  40        140         100 18.0       98        90 high risk
## 147  14         90          65  7.0      101        70 high risk
## 148  17        110          75 12.0      101        76 high risk
## 149  40        160         100 19.0       98        77 high risk
## 150  30        120          80  6.9      101        76  mid risk
## 151  18        120          80  6.9      102        76  mid risk
## 152  17         90          60  6.9      101        76  mid risk
## 153  17         90          63  6.9      101        70  mid risk
## 154  25        120          90  6.7      101        80  mid risk
## 155  17        120          80  6.7      102        76  mid risk
## 156  13         90          65  7.9      101        80  mid risk
## 157  28         85          60  9.0      101        86  mid risk
## 158  17         85          60  9.0      102        86  mid risk
## 159  30        120          80  6.9      101        76  mid risk
## 160  18        120          80  6.9      102        76  mid risk
## 161  17         90          60  6.9      101        76  mid risk
## 162  17         90          63  6.9      101        70  mid risk
## 163  25        120          90  6.7      101        80  mid risk
## 164  17        120          80  6.7      102        76  mid risk
## 165  13         90          65  7.9      101        80  mid risk
## 166  28         85          60  9.0      101        86  mid risk
## 167  17         85          60  9.0      102        86  mid risk
## 168  32        120          65  6.0      101        76  mid risk
## 169  26         85          60  6.0      101        86  mid risk
## 170  44        120          90 16.0       98        80  mid risk
## 171  13         90          65  7.8      101        80  mid risk
## 172  28        115          60  7.8      101        86  mid risk
## 173  50        130          80 16.0      102        76  mid risk
## 174  27        120          90  6.8      102        68  mid risk
## 175  55        100          70  6.8      101        80  mid risk
## 176  43        130          80 18.0       98        70  mid risk
## 177  42        130          80 18.0       98        70  mid risk
## 178  18        120          80  7.9      102        76  mid risk
## 179  18         85          60  7.5      101        86  mid risk
## 180  30        120          80  7.5      101        76  mid risk
## 181  50        130         100 16.0       98        75  mid risk
## 182  28        115          60  7.5      101        86  mid risk
## 183  12         90          60  7.5      102        66  mid risk
## 184  50        130         100 16.0       98        75  mid risk
## 185  17         90          65  7.5      103        67  mid risk
## 186  28         85          60  9.0      101        86  mid risk
## 187  17         90          60  9.0      102        86  mid risk
## 188  30        120          80  9.0      101        76  mid risk
## 189  30        120          80  6.9      101        76  mid risk
## 190  18        120          80  6.9      102        76  mid risk
## 191  17         90          60  6.9      101        76  mid risk
## 192  17         90          63  6.9      101        70  mid risk
## 193  25        120          90  6.7      101        80  mid risk
## 194  17        120          80  6.7      102        76  mid risk
## 195  13         90          65  7.9      101        80  mid risk
## 196  28         85          60  9.0      101        86  mid risk
## 197  17         85          60  9.0      102        86  mid risk
## 198  32        120          65  6.0      101        76  mid risk
## 199  30        120          80  6.8      101        76  low risk
## 200  18        120          80  6.8      102        76  low risk
## 201  17         90          60  7.9      101        76  low risk
## 202  17         85          60  7.9      102        86  low risk
## 203  17         90          60  7.5      101        76  low risk
## 204  17         90          63  7.5      101        70  low risk
## 205  25        120          90  7.5      101        80  low risk
## 206  17        120          80  7.5      102        76  low risk
## 207  19         90          65  7.5      101        70  low risk
## 208  17         85          60  7.5      102        86  low risk
## 209  12         90          60  7.5      102        66  low risk
## 210  12         90          60  7.5      102        60  low risk
## 211  13         90          65  7.5      101        80  low risk
## 212  17         90          65  7.5      103        67  low risk
## 213  17         85          60  7.5      102        86  low risk
## 214  40        140         100 18.0       98        90 high risk
## 215  14         90          65  7.0      101        70 high risk
## 216  17        110          75 12.0      101        76 high risk
## 217  40        160         100 19.0       98        77 high risk
## 218  32        140          90 18.0       98        88 high risk
## 219  12         90          60  7.9      102        66 high risk
## 220  55        140          95 19.0       98        77 high risk
## 221  50        130         100 16.0       98        75 high risk
## 222  17         90          65  6.1      103        67 high risk
## 223  28         83          60  8.0      101        86 high risk
## 224  17         85          60  9.0      102        86 high risk
## 225  50        140          95 17.0       98        60 high risk
## 226  55        140          80  7.2      101        76 high risk
## 227  40        140         100 18.0       98        77 high risk
## 228  28        120          80  9.0      102        76 high risk
## 229  17         90          60 11.0      101        78 high risk
## 230  17         90          63  8.0      101        70 high risk
## 231  25        120          90 12.0      101        80 high risk
## 232  17        120          80  7.0      102        76 high risk
## 233  19         90          65 11.0      101        70 high risk
## 234  17        110          75 13.0      101        76 high risk
## 235  40        160         100 19.0       98        77 high risk
## 236  32        140          90 18.0       98        88 high risk
## 237  12         90          60  8.0      102        66 high risk
## 238  12         90          60 11.0      102        60 high risk
## 239  55        140          95 19.0       98        77 high risk
## 240  50        130         100 16.0       98        76 high risk
## 241  13         90          65  9.0      101        80 high risk
## 242  17         90          65  7.7      103        67 high risk
## 243  17         85          60  6.3      102        86 high risk
## 244  55        120          90 18.0       98        60 high risk
## 245  35         85          60 19.0       98        86 high risk
## 246  43        120          90 18.0       98        70 high risk
## 247  32        120          65  6.0      101        76  mid risk
#Handling outliers by capping values or removing them
df <- df %>%
  filter(HeartRate >= 40 & HeartRate <= 200) %>%
  filter(BS <= 15 & BodyTemp <= 100 & BodyTemp >= 97)

This imbalance could pose a challenge for classification tasks as models tend to favor the majority class (Low Risk) over the minority classes (High Risk and Mid Risk). To address this, several techniques can be employed, such as resampling, class weighting, or using specialized algorithms.

#Class distribution
table(df$RiskLevel)
## 
## high risk  low risk  mid risk 
##       148       367       250
#Bar plot to visualize
ggplot(df, aes(x = RiskLevel, fill = RiskLevel)) +
  geom_bar() +
  labs(title = "Risk Level Distribution", x = "Risk Level", y = "Count") +
  theme_minimal()

Oversampling

Decision for this Dataset The dataset has 1014 instances, which is relatively small. Removing too many samples (via undersampling) could harm the model’s generalizability. Therefore: I am using oversamppling to retain as much data as possible.

Converting to a factor

df$RiskLevel <- as.factor(df$RiskLevel)
str(df)
## 'data.frame':    765 obs. of  7 variables:
##  $ Age        : int  25 35 29 30 35 23 23 32 23 19 ...
##  $ SystolicBP : int  130 140 90 140 120 140 130 120 90 120 ...
##  $ DiastolicBP: int  80 90 70 85 60 80 70 90 60 80 ...
##  $ BS         : num  15 13 8 7 6.1 7.01 7.01 6.9 7.01 7 ...
##  $ BodyTemp   : num  98 98 100 98 98 98 98 98 98 98 ...
##  $ HeartRate  : int  86 70 80 70 76 70 78 70 76 70 ...
##  $ RiskLevel  : Factor w/ 3 levels "high risk","low risk",..: 1 1 1 1 2 1 3 3 2 3 ...

Numeric variables

#Loading necessary library for visualization
library(ggplot2)

#Converting RiskLevel to a factor
df$RiskLevel <- as.factor(df$RiskLevel)

#Splitting the dataset into individual classes
high_risk <- df[df$RiskLevel == "high risk", ]
low_risk <- df[df$RiskLevel == "low risk", ]
mid_risk <- df[df$RiskLevel == "mid risk", ]

#Setting the target size for all classes (equal to the largest class size)
target_size <- max(nrow(low_risk), nrow(high_risk), nrow(mid_risk))

#Function to generate synthetic samples for balancing
generate_synthetic <- function(minority_class, target_size) {
  n_samples <- target_size - nrow(minority_class)
  if (n_samples <= 0) return(minority_class) #If already balanced, return original
  
  #Randomly sample from the minority class with replacement
  synthetic_samples <- minority_class[sample(1:nrow(minority_class), n_samples, replace = TRUE), ]
  
  #Adding small random noise to numeric columns for synthetic samples
  for (col in names(synthetic_samples)) {
    if (is.numeric(synthetic_samples[[col]])) {
      synthetic_samples[[col]] <- synthetic_samples[[col]] + runif(n_samples, -0.1, 0.1)
    }
  }
  
  #Combining original and synthetic samples
  return(rbind(minority_class, synthetic_samples))
}

#Generating synthetic samples for minority classes
high_risk_balanced <- generate_synthetic(high_risk, target_size)
mid_risk_balanced <- generate_synthetic(mid_risk, target_size)
low_risk_balanced <- low_risk # Low risk is already at the target size

#Combining all classes into a single balanced dataset
df_balanced <- rbind(high_risk_balanced, mid_risk_balanced, low_risk_balanced)

#Checking the new class distribution
print(table(df_balanced$RiskLevel))
## 
## high risk  low risk  mid risk 
##       367       367       367
#Visualizing the balanced dataset
ggplot(df_balanced, aes(x = RiskLevel, fill = RiskLevel)) +
  geom_bar() +
  labs(title = "Balanced Risk Level Distribution", x = "Risk Level", y = "Count") +
  theme_minimal()

#Summary statistics for numeric variables
df %>%
  select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate) %>%
  summary()
##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :10.00   Min.   : 70.0   Min.   : 49.00   Min.   : 6.000  
##  1st Qu.:20.00   1st Qu.:100.0   1st Qu.: 65.00   1st Qu.: 6.900  
##  Median :25.00   Median :120.0   Median : 80.00   Median : 7.200  
##  Mean   :30.13   Mean   :113.6   Mean   : 76.65   Mean   : 8.041  
##  3rd Qu.:36.00   3rd Qu.:120.0   3rd Qu.: 90.00   3rd Qu.: 7.800  
##  Max.   :66.00   Max.   :140.0   Max.   :100.00   Max.   :15.000  
##     BodyTemp        HeartRate    
##  Min.   : 98.00   Min.   :60.00  
##  1st Qu.: 98.00   1st Qu.:70.00  
##  Median : 98.00   Median :70.00  
##  Mean   : 98.07   Mean   :73.68  
##  3rd Qu.: 98.00   3rd Qu.:78.00  
##  Max.   :100.00   Max.   :90.00
#Visualizing numeric feature distributions
df %>%
  select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate) %>%
  gather(key = "Feature", value = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~ Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Features")

#Creating a data frame summarizing variables and comments
summary_table <- data.frame(
  Variable = c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate"),
  Likely_Distribution = c(
    "Slight positive skew",
    "Slight negative skew",
    "Slight negative skew",
    "Positive skew",
    "Symmetrical",
    "Slight positive skew"
  ),
  Comments = c(
    "Younger ages dominate with a few older outliers.",
    "Concentration near median with lower outliers.",
    "Similar to SystolicBP with fewer low outliers.",
    "Higher outliers increase the mean.",
    "Likely normal distribution centered at 98.",
    "Most values around 70–78, with some high outliers."
  )
)

#Printting the table
print(summary_table)
##      Variable  Likely_Distribution
## 1         Age Slight positive skew
## 2  SystolicBP Slight negative skew
## 3 DiastolicBP Slight negative skew
## 4          BS        Positive skew
## 5    BodyTemp          Symmetrical
## 6   HeartRate Slight positive skew
##                                             Comments
## 1   Younger ages dominate with a few older outliers.
## 2     Concentration near median with lower outliers.
## 3     Similar to SystolicBP with fewer low outliers.
## 4                 Higher outliers increase the mean.
## 5         Likely normal distribution centered at 98.
## 6 Most values around 70–78, with some high outliers.
#Optionally display it in a more readable table format if using R Markdown or Viewer
library(knitr)
kable(summary_table, caption = "Variable Distributions and Comments")
Variable Distributions and Comments
Variable Likely_Distribution Comments
Age Slight positive skew Younger ages dominate with a few older outliers.
SystolicBP Slight negative skew Concentration near median with lower outliers.
DiastolicBP Slight negative skew Similar to SystolicBP with fewer low outliers.
BS Positive skew Higher outliers increase the mean.
BodyTemp Symmetrical Likely normal distribution centered at 98.
HeartRate Slight positive skew Most values around 70–78, with some high outliers.

Observations:

  1. Age Observation: The distribution is positively skewed (right-skewed), with most counts concentrated between 20 and 36. A small number of individuals are older (above 50), creating a long right tail. Insight: The population in this dataset is primarily young to middle-aged, with fewer older individuals.
  2. Body Temperature Observation: The distribution is highly symmetrical, with most values clustered exactly around 98°F. There are very few observations outside this value (98.5 to 100). Insight: This suggests that body temperature in the dataset remains consistent with the normal range for healthy individuals.
  3. Blood Sugar (BS) Observation: The distribution is positively skewed, with most values concentrated between 6.9 and 7.8. However, there are clear outliers at higher values (10 to 15), creating a long right tail. Insight: The majority of individuals have normal blood sugar levels, but some exhibit significantly elevated levels, which may warrant closer attention (e.g., potential diabetes cases).
  4. Diastolic Blood Pressure (DiastolicBP) Observation: The distribution is approximately symmetric, with the highest counts around 80 mmHg. There are slightly fewer observations at the extremes (50 and 100). Insight: Diastolic blood pressure values are largely concentrated around the normal range, with some variation at lower and higher values.
  5. Heart Rate Observation: The distribution is slightly positively skewed, with a peak around 70 bpm. There are some counts in the higher range (80–90 bpm) but very few lower than 60 bpm. Insight: Most individuals have a normal resting heart rate, with fewer outliers in the higher range.
  6. Systolic Blood Pressure (SystolicBP) Observation: The distribution shows a slight negative skew, with a strong peak at 120 mmHg. There are a few lower outliers (around 80) but fewer values exceeding 140. Insight: Systolic blood pressure is concentrated near the higher end of the normal range, with some low outliers potentially indicating hypotension. General Observations Most features exhibit expected physiological ranges (e.g., body temperature, heart rate). Skewness is evident in Age, BS, and Heart Rate, suggesting outliers or unbalanced distributions. Body Temperature and DiastolicBP are close to symmetrical, implying normal distributions.

Relationships Between Features

#Correlations
cor_matrix <- cor(df %>% select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate))

#Visualizing correlation matrix
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "circle", type = "lower", tl.cex = 0.8, addCoef.col = "black")

Strongest Correlation: SystolicBP and DiastolicBP (0.74), which is expected since they are both measures of blood pressure. Moderate Correlations: Age and BS (0.45), and BS with both blood pressure measures (~0.33–0.34). Weak or No Correlation: Body temperature and heart rate show very little to no correlation with other variables.

#Boxplot for Age and RiskLevel
ggplot(df_balanced, aes(x = RiskLevel, y = Age, fill = RiskLevel)) +
  geom_boxplot() +
  labs(title = "Age Distribution by Risk Level", x = "Risk Level", y = "Age") +
  theme_minimal()

#Repeating for other features
features <- c("SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")
for (feature in features) {
  print(
    ggplot(df_balanced, aes_string(x = "RiskLevel", y = feature, fill = "RiskLevel")) +
      geom_boxplot() +
      labs(title = paste(feature, "Distribution by Risk Level"),
           x = "Risk Level", y = feature) +
      theme_minimal()
  )
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Scatter plot for SystolicBP vs. DiastolicBP by RiskLevel
ggplot(df_balanced, aes(x = SystolicBP, y = DiastolicBP, color = RiskLevel)) +
  geom_point(alpha = 0.7) +
  labs(title = "Systolic vs. Diastolic BP by Risk Level", x = "Systolic BP", y = "Diastolic BP") +
  theme_minimal()

Correlation between the continuous variables and the probability of being in the “high risk”

#Creating a binary variable for high risk
df_balanced$HighRisk <- ifelse(df_balanced$RiskLevel == "high risk", 1, 0)
#Selecting numeric features
numeric_features <- c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")

#Computing Pearson correlations
pearson_corr <- sapply(numeric_features, function(feature) {
  cor(df_balanced[[feature]], df_balanced$HighRisk, method = "pearson")
})

#Displaying Pearson correlations
pearson_corr
##          Age   SystolicBP  DiastolicBP           BS     BodyTemp    HeartRate 
##  0.380103117  0.484005994  0.482593784  0.662769454 -0.008962854  0.236358506
#Converting Pearson correlations to a data frame
correlation_df <- data.frame(
  Feature = names(pearson_corr),
  Correlation = pearson_corr
)

#Bar plot for correlations
library(ggplot2)
ggplot(correlation_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Correlation)) +
  geom_bar(stat = "identity") +
  labs(title = "Correlation of Features with High Risk",
       x = "Feature",
       y = "Pearson Correlation") +
  theme_minimal() +
  coord_flip() + #Flip coordinates for horizontal bars
  scale_fill_gradient(low = "blue", high = "red")

For me, it can be surprising when age, which is often thought to be a key factor in health risks, does not turn out to be the strongest predictor. Instead, direct health indicators like blood sugar (BS), systolic blood pressure, and diastolic blood pressure emerge as more critical factors in predicting “high risk” outcomes.

library(ggplot2)

#List of variables to plot
variables <- c("BS", "SystolicBP", "DiastolicBP")

#Loop through each variable and create a trend plot
for (var in variables) {
  print(
    ggplot(df_balanced, aes_string(x = "Age", y = var, color = "HighRisk")) +
      geom_point(alpha = 0.6) +
      geom_smooth(method = "loess", se = FALSE) +
      labs(title = paste(var, "Trend Over Age"),
           x = "Age", y = var) +
      theme_minimal()
  )
}
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

#Filtering for younger individuals (e.g., Age < 40) who are High Risk
young_high_risk <- df_balanced %>%
  filter(Age < 40 & HighRisk == 1)

#Comparing younger high-risk individuals to others
young_non_high_risk <- df_balanced %>%
  filter(Age < 40 & HighRisk == 0)

#Summary statistics for young high-risk
summary(young_high_risk)
##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :21.93   Min.   : 89.9   Min.   : 59.98   Min.   : 6.710  
##  1st Qu.:25.00   1st Qu.:120.0   1st Qu.: 79.98   1st Qu.: 7.087  
##  Median :30.00   Median :140.0   Median : 90.00   Median : 8.000  
##  Mean   :30.24   Mean   :128.8   Mean   : 88.03   Mean   : 9.715  
##  3rd Qu.:35.00   3rd Qu.:140.0   3rd Qu.:100.00   3rd Qu.:11.956  
##  Max.   :39.99   Max.   :140.1   Max.   :100.09   Max.   :15.081  
##     BodyTemp        HeartRate         RiskLevel      HighRisk
##  Min.   : 97.90   Min.   :60.00   high risk:203   Min.   :1  
##  1st Qu.: 97.98   1st Qu.:70.00   low risk :  0   1st Qu.:1  
##  Median : 98.00   Median :78.03   mid risk :  0   Median :1  
##  Mean   : 98.09   Mean   :76.03                   Mean   :1  
##  3rd Qu.: 98.03   3rd Qu.:80.00                   3rd Qu.:1  
##  Max.   :100.00   Max.   :90.01                   Max.   :1
#Summary statistics for young non-high-risk
summary(young_non_high_risk)
##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :10.00   Min.   : 69.9   Min.   : 49.00   Min.   : 5.910  
##  1st Qu.:19.00   1st Qu.:100.0   1st Qu.: 60.00   1st Qu.: 6.800  
##  Median :22.00   Median :120.0   Median : 70.04   Median : 7.000  
##  Mean   :22.99   Mean   :109.8   Mean   : 72.73   Mean   : 7.104  
##  3rd Qu.:29.00   3rd Qu.:120.0   3rd Qu.: 80.00   3rd Qu.: 7.500  
##  Max.   :39.95   Max.   :130.1   Max.   :100.00   Max.   :10.994  
##     BodyTemp        HeartRate         RiskLevel      HighRisk
##  Min.   : 97.90   Min.   :59.90   high risk:  0   Min.   :0  
##  1st Qu.: 98.00   1st Qu.:70.00   low risk :299   1st Qu.:0  
##  Median : 98.00   Median :70.00   mid risk :306   Median :0  
##  Mean   : 98.08   Mean   :72.93                   Mean   :0  
##  3rd Qu.: 98.00   3rd Qu.:77.96                   3rd Qu.:0  
##  Max.   :100.09   Max.   :88.00                   Max.   :0

Insights: Age: High-risk individuals are generally older than non-high-risk individuals in this age range. However, young individuals (under 30) are still present in the high-risk group, emphasizing the need to explore other factors.

Systolic and Diastolic Blood presure: Blood pressure (both systolic and diastolic) is significantly higher in the high-risk group, even among younger individuals. This confirms that poor blood pressure control is a major risk factor.

Blood Sugar (BS) Blood sugar levels are markedly elevated in the high-risk group. The higher mean suggests some individuals in this group have extreme values.

Model

library(nnet)

#Fitting multinomial logistic regression
multinom_model <- multinom(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, data = df_balanced)
## # weights:  24 (14 variable)
## initial  value 1209.572130 
## iter  10 value 871.126916
## iter  20 value 778.844664
## iter  30 value 771.175254
## iter  40 value 771.167253
## iter  50 value 764.888262
## iter  60 value 757.170932
## iter  70 value 757.166158
## iter  70 value 757.166155
## final  value 757.166055 
## converged
#Summary of the model
summary(multinom_model)
## Call:
## multinom(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP + 
##     BS + BodyTemp + HeartRate, data = df_balanced)
## 
## Coefficients:
##          (Intercept)        Age  SystolicBP DiastolicBP         BS  BodyTemp
## low risk    226.0809 0.01944114 -0.12805405 -0.04257232 -0.9177363 -1.965944
## mid risk    127.2109 0.01825165 -0.07244766 -0.07573707 -0.6787368 -1.020532
##            HeartRate
## low risk -0.09894234
## mid risk -0.08889116
## 
## Std. Errors:
##           (Intercept)        Age SystolicBP DiastolicBP         BS   BodyTemp
## low risk 0.0002141865 0.01187866 0.01265363  0.01317042 0.09080277 0.02194777
## mid risk 0.0001805383 0.01143589 0.01191964  0.01217270 0.06406734 0.02020524
##           HeartRate
## low risk 0.01679366
## mid risk 0.01516674
## 
## Residual Deviance: 1514.332 
## AIC: 1542.332
#Predicting on the training set
predicted <- predict(multinom_model, df_balanced)

#Confusion matrix
table(Predicted = predicted, Actual = df_balanced$RiskLevel)
##            Actual
## Predicted   high risk low risk mid risk
##   high risk       297       13       37
##   low risk         15      229      114
##   mid risk         55      125      216
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following objects are masked from 'package:openintro':
## 
##     ethanol, lsegments
## 
## Attaching package: 'caret'
## The following object is masked from 'package:openintro':
## 
##     dotPlot
## The following object is masked from 'package:purrr':
## 
##     lift
#Converting RiskLevel to a factor if not already
df_balanced$RiskLevel <- as.factor(df_balanced$RiskLevel)

#Splitting the dataset into training and testing sets
set.seed(42)
train_index <- createDataPartition(df_balanced$RiskLevel, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
#Fitting the Random Forest model
rf_model <- randomForest(RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate, 
                         data = train_data, ntree = 100, importance = TRUE)

# View the model
print(rf_model)
## 
## Call:
##  randomForest(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP +      BS + BodyTemp + HeartRate, data = train_data, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.14%
## Confusion matrix:
##           high risk low risk mid risk class.error
## high risk       244        6        7  0.05058366
## low risk          6      209       42  0.18677043
## mid risk         14       34      209  0.18677043
#Predicting on the test set
rf_predictions <- predict(rf_model, test_data)

#Confusion matrix
confusion_matrix <- confusionMatrix(rf_predictions, test_data$RiskLevel)
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  high risk low risk mid risk
##   high risk       108        1       10
##   low risk          1       88       17
##   mid risk          1       21       83
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8455          
##                  95% CI : (0.8019, 0.8827)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7682          
##                                           
##  Mcnemar's Test P-Value : 0.05068         
## 
## Statistics by Class:
## 
##                      Class: high risk Class: low risk Class: mid risk
## Sensitivity                    0.9818          0.8000          0.7545
## Specificity                    0.9500          0.9182          0.9000
## Pos Pred Value                 0.9076          0.8302          0.7905
## Neg Pred Value                 0.9905          0.9018          0.8800
## Prevalence                     0.3333          0.3333          0.3333
## Detection Rate                 0.3273          0.2667          0.2515
## Detection Prevalence           0.3606          0.3212          0.3182
## Balanced Accuracy              0.9659          0.8591          0.8273
#Plot feature importance
varImpPlot(rf_model, main = "Feature Importance")

Conclusions

In conclusion, this study underscores the persistent challenges of maternal health complications in rural, resource-limited settings like Bangladesh. By leveraging an IoT-based monitoring system, I identified critical risk factors such as elevated blood pressure and blood sugar levels, and complex age-related risk profiles. My findings reveal that high-risk individuals are generally older, but young individuals under 30 are also significantly represented in the high-risk group, highlighting the need to explore additional factors. Blood pressure, both systolic and diastolic, is significantly higher in the high-risk group, even among younger individuals, confirming poor blood pressure control as a major risk factor. Elevated blood sugar levels in the high-risk group further emphasize the importance of comprehensive metabolic monitoring.

Initially, I employed Multinomial Logistic Regression to identify key predictors such as blood sugar, blood pressure, and age. While this model provided interpretable coefficients, it struggled to separate low-risk and mid-risk categories due to overlapping feature distributions and its assumption of linearity between features and the log-odds of each class. These limitations necessitated the transition to a more robust model.

I chose Random Forest for its ability to handle non-linear relationships and capture feature interactions, such as the combined effect of high blood sugar and elevated blood pressure on maternal health risks. This model demonstrated superior performance, with an accuracy of 86.06% and high sensitivity across risk categories. Random Forest also identified blood sugar, systolic and diastolic blood pressure, and age as the most important predictors of maternal health risks.

The use of advanced machine learning techniques, particularly Random Forest classification, enabled the development of predictive models that offer nuanced insights into maternal health risks. These findings underscore the importance of comprehensive, individualized risk assessments and proactive intervention strategies. The scalable methodology presented in this research provides a valuable framework for early risk identification and targeted interventions, ultimately contributing to improved maternal health outcomes in vulnerable populations.

References

-Oversampling and undersampling in imbalanced data:

https://medium.com/metaor-artificial-intelligence/solving-the-class-imbalance-problem-58cb926b5a0f#:~:text=Resampling%20is%20a%20common%20technique,examples%20from%20the%20original%20dataset.

