1 Project Overview
2 Research Question
3 Tools Used
4 Load Libraries and Data
5 Data Structure and Missing Values
6 Data Validation and Cleaning
7 Class Distribution
8 Balancing the Dataset
9 Exploratory Data Analysis
- 9.1 Numeric Feature Distributions
- 9.2 Correlation Between Numeric Features
10 Relationship Between Features and Risk Level
11 Correlation With High-Risk Outcome
12 Young High-Risk Group Analysis
13 Model 1: Multinomial Logistic Regression
14 Model 2: Random Forest Classification
15 Key Findings
16 Limitations
17 Conclusion

1 Project Overview

This code sample analyzes maternal health risk factors using a public dataset of health records collected through an IoT-based monitoring system in Bangladesh. The goal is to identify which variables are most associated with low, mid, and high maternal health risk levels.

This project demonstrates data cleaning, exploratory data analysis, class imbalance handling, predictive modeling, and interpretation of results using R.

2 Research Question

What are the strongest predictors of maternal health risk level in this dataset?

3 Tools Used

R
tidyverse
ggplot2
caret
randomForest
nnet
corrplot

4 Load Libraries and Data

library(tidyverse)
library(caret)
library(randomForest)
library(nnet)
library(corrplot)
library(knitr)

url <- "https://raw.githubusercontent.com/lburenkov/maternalrisk/refs/heads/main/Maternal%20Health%20Risk%20Data%20Set.csv"

df <- read.csv(url)

head(df)

##   Age SystolicBP DiastolicBP    BS BodyTemp HeartRate RiskLevel
## 1  25        130          80 15.00       98        86 high risk
## 2  35        140          90 13.00       98        70 high risk
## 3  29         90          70  8.00      100        80 high risk
## 4  30        140          85  7.00       98        70 high risk
## 5  35        120          60  6.10       98        76  low risk
## 6  23        140          80  7.01       98        70 high risk

5 Data Structure and Missing Values

str(df)

## 'data.frame':    1014 obs. of  7 variables:
##  $ Age        : int  25 35 29 30 35 23 23 35 32 42 ...
##  $ SystolicBP : int  130 140 90 140 120 140 130 85 120 130 ...
##  $ DiastolicBP: int  80 90 70 85 60 80 70 60 90 80 ...
##  $ BS         : num  15 13 8 7 6.1 7.01 7.01 11 6.9 18 ...
##  $ BodyTemp   : num  98 98 100 98 98 98 98 102 98 98 ...
##  $ HeartRate  : int  86 70 80 70 76 70 78 86 70 70 ...
##  $ RiskLevel  : chr  "high risk" "high risk" "high risk" "high risk" ...

summary(df)

##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :10.00   Min.   : 70.0   Min.   : 49.00   Min.   : 6.000  
##  1st Qu.:19.00   1st Qu.:100.0   1st Qu.: 65.00   1st Qu.: 6.900  
##  Median :26.00   Median :120.0   Median : 80.00   Median : 7.500  
##  Mean   :29.87   Mean   :113.2   Mean   : 76.46   Mean   : 8.726  
##  3rd Qu.:39.00   3rd Qu.:120.0   3rd Qu.: 90.00   3rd Qu.: 8.000  
##  Max.   :70.00   Max.   :160.0   Max.   :100.00   Max.   :19.000  
##     BodyTemp        HeartRate     RiskLevel        
##  Min.   : 98.00   Min.   : 7.0   Length:1014       
##  1st Qu.: 98.00   1st Qu.:70.0   Class :character  
##  Median : 98.00   Median :76.0   Mode  :character  
##  Mean   : 98.67   Mean   :74.3                     
##  3rd Qu.: 98.00   3rd Qu.:80.0                     
##  Max.   :103.00   Max.   :90.0

missing_values <- colSums(is.na(df))
missing_values

##         Age  SystolicBP DiastolicBP          BS    BodyTemp   HeartRate 
##           0           0           0           0           0           0 
##   RiskLevel 
##           0

The dataset does not contain missing values. The target variable is RiskLevel, which includes three categories: low risk, mid risk, and high risk.

6 Data Validation and Cleaning

Before modeling, I checked for values that may indicate data entry errors or extreme outliers.

# Unusual heart rate values
df %>%
  filter(HeartRate < 40 | HeartRate > 200)

##   Age SystolicBP DiastolicBP  BS BodyTemp HeartRate RiskLevel
## 1  16        120          75 7.9       98         7  low risk
## 2  16        120          75 7.9       98         7  low risk

# Unusual blood sugar or body temperature values
df %>%
  filter(BS > 15 | BodyTemp > 100 | BodyTemp < 97)

##     Age SystolicBP DiastolicBP   BS BodyTemp HeartRate RiskLevel
## 1    35         85          60 11.0      102        86 high risk
## 2    42        130          80 18.0       98        70 high risk
## 3    30        120          80  6.9      101        76  mid risk
## 4    40        140         100 18.0       98        90 high risk
## 5    12         95          60  6.1      102        60  low risk
## 6    17         85          60  9.0      102        86  mid risk
## 7    32        120          65  6.0      101        76  mid risk
## 8    26         85          60  6.0      101        86  mid risk
## 9    44        120          90 16.0       98        80  mid risk
## 10   13         90          65  7.8      101        80  mid risk
## 11   28        115          60  7.8      101        86  mid risk
## 12   34         85          60 11.0      102        86 high risk
## 13   42        140         100 18.0       98        90 high risk
## 14   50        140          95 17.0       98        60 high risk
## 15   38        135          60  7.9      101        86 high risk
## 16   30        120          80  7.9      101        76 high risk
## 17   55        140         100 18.0       98        90 high risk
## 18   40        160         100 19.0       98        77 high risk
## 19   32        140          90 18.0       98        88 high risk
## 20   55        140          95 19.0       98        77 high risk
## 21   40        160         100 19.0       98        77 high risk
## 22   32        140          90 18.0       98        88 high risk
## 23   22         90          60  7.5      102        60 high risk
## 24   55        140          95 19.0       98        77 high risk
## 25   50        130         100 16.0       98        75 high risk
## 26   18        120          80  6.9      102        76  mid risk
## 27   17         90          60  6.9      101        76  mid risk
## 28   17         90          63  6.9      101        70  mid risk
## 29   25        120          90  6.7      101        80  mid risk
## 30   17        120          80  6.7      102        76  mid risk
## 31   14         90          65  7.0      101        70 high risk
## 32   17        110          75 12.0      101        76 high risk
## 33   40        160         100 19.0       98        77 high risk
## 34   32        140          90 18.0       98        88 high risk
## 35   12         90          60  7.9      102        66 high risk
## 36   12         95          60  6.1      102        60  low risk
## 37   55        140          95 19.0       98        77 high risk
## 38   50        130         100 16.0       98        75 high risk
## 39   13         90          65  7.9      101        80  mid risk
## 40   17         90          65  6.1      103        67 high risk
## 41   28         83          60  8.0      101        86 high risk
## 42   17         85          60  9.0      102        86 high risk
## 43   50        140          95 17.0       98        60 high risk
## 44   28         85          60  9.0      101        86  mid risk
## 45   17         85          60  9.0      102        86  mid risk
## 46   55        140          80  7.2      101        76 high risk
## 47   40        140         100 18.0       98        77 high risk
## 48   28        120          80  9.0      102        76 high risk
## 49   17         90          60 11.0      101        78 high risk
## 50   17         90          63  8.0      101        70 high risk
## 51   25        120          90 12.0      101        80 high risk
## 52   17        120          80  7.0      102        76 high risk
## 53   19         90          65 11.0      101        70 high risk
## 54   32        120          65  6.0      101        76  mid risk
## 55   17        110          75 13.0      101        76 high risk
## 56   40        160         100 19.0       98        77 high risk
## 57   32        140          90 18.0       98        88 high risk
## 58   12         90          60  8.0      102        66 high risk
## 59   12         90          60 11.0      102        60 high risk
## 60   55        140          95 19.0       98        77 high risk
## 61   50        130         100 16.0       98        76 high risk
## 62   13         90          65  9.0      101        80 high risk
## 63   17         90          65  7.7      103        67 high risk
## 64   26         85          60  6.0      101        86  mid risk
## 65   17         85          60  6.3      102        86 high risk
## 66   55        120          90 18.0       98        60 high risk
## 67   35         85          60 19.0       98        86 high risk
## 68   43        120          90 18.0       98        70 high risk
## 69   44        120          90 16.0       98        80  mid risk
## 70   45        120          80  6.9      103        70  low risk
## 71   70         85          60  6.9      102        70  low risk
## 72   65        120          90  6.9      103        76  low risk
## 73   55        120          80  6.9      102        80  low risk
## 74   45         90          60 18.0      101        70 high risk
## 75   22        120          80  6.9      103        76  low risk
## 76   17        110          75  6.9      101        76 high risk
## 77   40        160         100 19.0       98        77 high risk
## 78   32        140          90 18.0       98        88 high risk
## 79   12         90          60  7.8      102        60 high risk
## 80   55        140          95 19.0       98        77 high risk
## 81   50        130         100 16.0       98        75 high risk
## 82   13         90          65  7.8      101        80  mid risk
## 83   17         90          65  7.8      103        67 high risk
## 84   28        115          60  7.8      101        86  mid risk
## 85   17         85          69  7.8      102        86 high risk
## 86   50        130          80 16.0      102        76  mid risk
## 87   27        120          90  6.8      102        68  mid risk
## 88   55        100          70  6.8      101        80  mid risk
## 89   60        140          80 16.0       98        66 high risk
## 90   17        140         100  6.8      103        80 high risk
## 91   36        140         100  6.8      102        76 high risk
## 92   40        140         100 13.0      101        66 high risk
## 93   36        140         100  6.8      102        76 high risk
## 94   40        140         100 13.0      101        66 high risk
## 95   35         85          60 11.0      102        86 high risk
## 96   43        130          80 18.0       98        70  mid risk
## 97   34         85          60 11.0      102        86 high risk
## 98   42        130          80 18.0       98        70  mid risk
## 99   30        120          80  6.8      101        76  low risk
## 100  42        140         100 18.0       98        90 high risk
## 101  18        120          80  6.8      102        76  low risk
## 102  17         90          60  7.9      101        76  low risk
## 103  50        140          95 17.0       98        60 high risk
## 104  38        135          60  7.9      101        86 high risk
## 105  17         85          60  7.9      102        86  low risk
## 106  30        120          80  7.9      101        76 high risk
## 107  55        140         100 18.0       98        90 high risk
## 108  18        120          80  7.9      102        76  mid risk
## 109  17         90          60  7.5      101        76  low risk
## 110  17         90          63  7.5      101        70  low risk
## 111  25        120          90  7.5      101        80  low risk
## 112  17        120          80  7.5      102        76  low risk
## 113  19         90          65  7.5      101        70  low risk
## 114  18         85          60  7.5      101        86  mid risk
## 115  17         85          60  7.5      102        86  low risk
## 116  30        120          80  7.5      101        76  mid risk
## 117  40        160         100 19.0       98        77 high risk
## 118  32        140          90 18.0       98        88 high risk
## 119  12         90          60  7.5      102        66  low risk
## 120  12         90          60  7.5      102        60  low risk
## 121  55        140          95 19.0       98        77 high risk
## 122  50        130         100 16.0       98        75  mid risk
## 123  13         90          65  7.5      101        80  low risk
## 124  17         90          65  7.5      103        67  low risk
## 125  28        115          60  7.5      101        86  mid risk
## 126  17         85          60  7.5      102        86  low risk
## 127  40        160         100 19.0       98        77 high risk
## 128  32        140          90 18.0       98        88 high risk
## 129  12         90          60  7.5      102        66  mid risk
## 130  22         90          60  7.5      102        60 high risk
## 131  55        140          95 19.0       98        77 high risk
## 132  50        130         100 16.0       98        75 high risk
## 133  55        140          95 19.0       98        77 high risk
## 134  50        130         100 16.0       98        75  mid risk
## 135  13         90          65  7.5      101        80 high risk
## 136  17         90          65  7.5      103        67  mid risk
## 137  27        135          60  7.5      101        86 high risk
## 138  17         85          60  7.5      101        86 high risk
## 139  50        140          95 17.0       98        60 high risk
## 140  28         85          60  9.0      101        86  mid risk
## 141  28         95          60 10.0      101        86 high risk
## 142  17         90          60  9.0      102        86  mid risk
## 143  30        120          80  9.0      101        76  mid risk
## 144  35         85          60 11.0      102        86 high risk
## 145  42        130          80 18.0       98        70 high risk
## 146  40        140         100 18.0       98        90 high risk
## 147  14         90          65  7.0      101        70 high risk
## 148  17        110          75 12.0      101        76 high risk
## 149  40        160         100 19.0       98        77 high risk
## 150  30        120          80  6.9      101        76  mid risk
## 151  18        120          80  6.9      102        76  mid risk
## 152  17         90          60  6.9      101        76  mid risk
## 153  17         90          63  6.9      101        70  mid risk
## 154  25        120          90  6.7      101        80  mid risk
## 155  17        120          80  6.7      102        76  mid risk
## 156  13         90          65  7.9      101        80  mid risk
## 157  28         85          60  9.0      101        86  mid risk
## 158  17         85          60  9.0      102        86  mid risk
## 159  30        120          80  6.9      101        76  mid risk
## 160  18        120          80  6.9      102        76  mid risk
## 161  17         90          60  6.9      101        76  mid risk
## 162  17         90          63  6.9      101        70  mid risk
## 163  25        120          90  6.7      101        80  mid risk
## 164  17        120          80  6.7      102        76  mid risk
## 165  13         90          65  7.9      101        80  mid risk
## 166  28         85          60  9.0      101        86  mid risk
## 167  17         85          60  9.0      102        86  mid risk
## 168  32        120          65  6.0      101        76  mid risk
## 169  26         85          60  6.0      101        86  mid risk
## 170  44        120          90 16.0       98        80  mid risk
## 171  13         90          65  7.8      101        80  mid risk
## 172  28        115          60  7.8      101        86  mid risk
## 173  50        130          80 16.0      102        76  mid risk
## 174  27        120          90  6.8      102        68  mid risk
## 175  55        100          70  6.8      101        80  mid risk
## 176  43        130          80 18.0       98        70  mid risk
## 177  42        130          80 18.0       98        70  mid risk
## 178  18        120          80  7.9      102        76  mid risk
## 179  18         85          60  7.5      101        86  mid risk
## 180  30        120          80  7.5      101        76  mid risk
## 181  50        130         100 16.0       98        75  mid risk
## 182  28        115          60  7.5      101        86  mid risk
## 183  12         90          60  7.5      102        66  mid risk
## 184  50        130         100 16.0       98        75  mid risk
## 185  17         90          65  7.5      103        67  mid risk
## 186  28         85          60  9.0      101        86  mid risk
## 187  17         90          60  9.0      102        86  mid risk
## 188  30        120          80  9.0      101        76  mid risk
## 189  30        120          80  6.9      101        76  mid risk
## 190  18        120          80  6.9      102        76  mid risk
## 191  17         90          60  6.9      101        76  mid risk
## 192  17         90          63  6.9      101        70  mid risk
## 193  25        120          90  6.7      101        80  mid risk
## 194  17        120          80  6.7      102        76  mid risk
## 195  13         90          65  7.9      101        80  mid risk
## 196  28         85          60  9.0      101        86  mid risk
## 197  17         85          60  9.0      102        86  mid risk
## 198  32        120          65  6.0      101        76  mid risk
## 199  30        120          80  6.8      101        76  low risk
## 200  18        120          80  6.8      102        76  low risk
## 201  17         90          60  7.9      101        76  low risk
## 202  17         85          60  7.9      102        86  low risk
## 203  17         90          60  7.5      101        76  low risk
## 204  17         90          63  7.5      101        70  low risk
## 205  25        120          90  7.5      101        80  low risk
## 206  17        120          80  7.5      102        76  low risk
## 207  19         90          65  7.5      101        70  low risk
## 208  17         85          60  7.5      102        86  low risk
## 209  12         90          60  7.5      102        66  low risk
## 210  12         90          60  7.5      102        60  low risk
## 211  13         90          65  7.5      101        80  low risk
## 212  17         90          65  7.5      103        67  low risk
## 213  17         85          60  7.5      102        86  low risk
## 214  40        140         100 18.0       98        90 high risk
## 215  14         90          65  7.0      101        70 high risk
## 216  17        110          75 12.0      101        76 high risk
## 217  40        160         100 19.0       98        77 high risk
## 218  32        140          90 18.0       98        88 high risk
## 219  12         90          60  7.9      102        66 high risk
## 220  55        140          95 19.0       98        77 high risk
## 221  50        130         100 16.0       98        75 high risk
## 222  17         90          65  6.1      103        67 high risk
## 223  28         83          60  8.0      101        86 high risk
## 224  17         85          60  9.0      102        86 high risk
## 225  50        140          95 17.0       98        60 high risk
## 226  55        140          80  7.2      101        76 high risk
## 227  40        140         100 18.0       98        77 high risk
## 228  28        120          80  9.0      102        76 high risk
## 229  17         90          60 11.0      101        78 high risk
## 230  17         90          63  8.0      101        70 high risk
## 231  25        120          90 12.0      101        80 high risk
## 232  17        120          80  7.0      102        76 high risk
## 233  19         90          65 11.0      101        70 high risk
## 234  17        110          75 13.0      101        76 high risk
## 235  40        160         100 19.0       98        77 high risk
## 236  32        140          90 18.0       98        88 high risk
## 237  12         90          60  8.0      102        66 high risk
## 238  12         90          60 11.0      102        60 high risk
## 239  55        140          95 19.0       98        77 high risk
## 240  50        130         100 16.0       98        76 high risk
## 241  13         90          65  9.0      101        80 high risk
## 242  17         90          65  7.7      103        67 high risk
## 243  17         85          60  6.3      102        86 high risk
## 244  55        120          90 18.0       98        60 high risk
## 245  35         85          60 19.0       98        86 high risk
## 246  43        120          90 18.0       98        70 high risk
## 247  32        120          65  6.0      101        76  mid risk

For this code sample, I removed observations with highly unusual values to keep the modeling dataset more consistent.

df_clean <- df %>%
  filter(HeartRate >= 40 & HeartRate <= 200) %>%
  filter(BS <= 15 & BodyTemp <= 100 & BodyTemp >= 97)

df_clean$RiskLevel <- as.factor(df_clean$RiskLevel)

summary(df_clean)

##       Age          SystolicBP     DiastolicBP           BS        
##  Min.   :10.00   Min.   : 70.0   Min.   : 49.00   Min.   : 6.000  
##  1st Qu.:20.00   1st Qu.:100.0   1st Qu.: 65.00   1st Qu.: 6.900  
##  Median :25.00   Median :120.0   Median : 80.00   Median : 7.200  
##  Mean   :30.13   Mean   :113.6   Mean   : 76.65   Mean   : 8.041  
##  3rd Qu.:36.00   3rd Qu.:120.0   3rd Qu.: 90.00   3rd Qu.: 7.800  
##  Max.   :66.00   Max.   :140.0   Max.   :100.00   Max.   :15.000  
##     BodyTemp        HeartRate         RiskLevel  
##  Min.   : 98.00   Min.   :60.00   high risk:148  
##  1st Qu.: 98.00   1st Qu.:70.00   low risk :367  
##  Median : 98.00   Median :70.00   mid risk :250  
##  Mean   : 98.07   Mean   :73.68                  
##  3rd Qu.: 98.00   3rd Qu.:78.00                  
##  Max.   :100.00   Max.   :90.00

7 Class Distribution

table(df_clean$RiskLevel)

## 
## high risk  low risk  mid risk 
##       148       367       250

ggplot(df_clean, aes(x = RiskLevel, fill = RiskLevel)) +
  geom_bar() +
  labs(
    title = "Risk Level Distribution",
    x = "Risk Level",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

The original dataset is imbalanced, with more low-risk observations than high-risk observations. This imbalance may affect classification models because the model may learn to favor the majority class.

8 Balancing the Dataset

Because the dataset is relatively small, I used oversampling to balance the risk categories while keeping as much data as possible.

set.seed(42)

high_risk <- df_clean[df_clean$RiskLevel == "high risk", ]
low_risk  <- df_clean[df_clean$RiskLevel == "low risk", ]
mid_risk  <- df_clean[df_clean$RiskLevel == "mid risk", ]

target_size <- max(nrow(high_risk), nrow(low_risk), nrow(mid_risk))

generate_synthetic <- function(minority_class, target_size) {
  n_samples <- target_size - nrow(minority_class)

  if (n_samples <= 0) {
    return(minority_class)
  }

  synthetic_samples <- minority_class[
    sample(1:nrow(minority_class), n_samples, replace = TRUE),
  ]

  numeric_cols <- names(synthetic_samples)[sapply(synthetic_samples, is.numeric)]

  for (col in numeric_cols) {
    synthetic_samples[[col]] <- synthetic_samples[[col]] + 
      runif(n_samples, -0.1, 0.1)
  }

  rbind(minority_class, synthetic_samples)
}

high_risk_balanced <- generate_synthetic(high_risk, target_size)
mid_risk_balanced  <- generate_synthetic(mid_risk, target_size)
low_risk_balanced  <- low_risk

df_balanced <- rbind(high_risk_balanced, mid_risk_balanced, low_risk_balanced)
df_balanced$RiskLevel <- as.factor(df_balanced$RiskLevel)

table(df_balanced$RiskLevel)

## 
## high risk  low risk  mid risk 
##       367       367       367

ggplot(df_balanced, aes(x = RiskLevel, fill = RiskLevel)) +
  geom_bar() +
  labs(
    title = "Balanced Risk Level Distribution",
    x = "Risk Level",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

9 Exploratory Data Analysis

9.1 Numeric Feature Distributions

df_clean %>%
  select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~ Feature, scales = "free") +
  labs(title = "Distribution of Numeric Features") +
  theme_minimal()

9.2 Correlation Between Numeric Features

cor_matrix <- cor(
  df_clean %>% select(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate)
)

corrplot(
  cor_matrix,
  method = "circle",
  type = "lower",
  tl.cex = 0.8,
  addCoef.col = "black"
)

The strongest relationship among numeric predictors is between systolic and diastolic blood pressure, which is expected because both measure blood pressure. Blood sugar also shows moderate relationships with age and blood pressure.

10 Relationship Between Features and Risk Level

features <- c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")

for (feature in features) {
  p <- ggplot(df_balanced, aes(x = RiskLevel, y = .data[[feature]], fill = RiskLevel)) +
    geom_boxplot() +
    labs(
      title = paste(feature, "by Risk Level"),
      x = "Risk Level",
      y = feature
    ) +
    theme_minimal() +
    theme(legend.position = "none")

  print(p)
}

ggplot(df_balanced, aes(x = SystolicBP, y = DiastolicBP, color = RiskLevel)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Systolic vs. Diastolic Blood Pressure by Risk Level",
    x = "Systolic Blood Pressure",
    y = "Diastolic Blood Pressure"
  ) +
  theme_minimal()

11 Correlation With High-Risk Outcome

I created a binary variable to identify whether each observation was classified as high risk.

df_balanced$HighRisk <- ifelse(df_balanced$RiskLevel == "high risk", 1, 0)

numeric_features <- c("Age", "SystolicBP", "DiastolicBP", "BS", "BodyTemp", "HeartRate")

pearson_corr <- sapply(numeric_features, function(feature) {
  cor(df_balanced[[feature]], df_balanced$HighRisk, method = "pearson")
})

correlation_df <- data.frame(
  Feature = names(pearson_corr),
  Correlation = as.numeric(pearson_corr)
)

correlation_df %>%
  arrange(desc(Correlation)) %>%
  kable(caption = "Feature Correlation with High-Risk Outcome")

Feature Correlation with High-Risk Outcome
Feature	Correlation
BS	0.6465102
DiastolicBP	0.4859777
SystolicBP	0.4505163
Age	0.3831357
HeartRate	0.2460041
BodyTemp	0.0474882

ggplot(correlation_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Correlation)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Correlation of Features With High-Risk Outcome",
    x = "Feature",
    y = "Pearson Correlation"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Blood sugar, systolic blood pressure, and diastolic blood pressure show the strongest relationships with high-risk classification.

12 Young High-Risk Group Analysis

Age is often assumed to be a major factor in health risk. However, younger individuals also appear in the high-risk group, which suggests that direct health indicators may be more informative than age alone.

young_high_risk <- df_balanced %>%
  filter(Age < 40 & HighRisk == 1)

young_non_high_risk <- df_balanced %>%
  filter(Age < 40 & HighRisk == 0)

young_summary <- bind_rows(
  young_high_risk %>%
    summarise(across(c(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate), mean, na.rm = TRUE)) %>%
    mutate(Group = "Young High Risk"),
  young_non_high_risk %>%
    summarise(across(c(Age, SystolicBP, DiastolicBP, BS, BodyTemp, HeartRate), mean, na.rm = TRUE)) %>%
    mutate(Group = "Young Non-High Risk")
) %>%
  relocate(Group)

kable(young_summary, caption = "Average Indicators for Young High-Risk vs. Non-High-Risk Groups")

Average Indicators for Young High-Risk vs. Non-High-Risk Groups
Group	Age	SystolicBP	DiastolicBP	BS	BodyTemp	HeartRate
Young High Risk	31.16188	127.2987	88.2902	9.726348	98.16856	75.49617
Young Non-High Risk	23.04462	110.3446	72.8990	7.079843	98.07958	72.87373

13 Model 1: Multinomial Logistic Regression

A multinomial logistic regression model was used as an interpretable baseline model.

multinom_model <- multinom(
  RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
  data = df_balanced,
  trace = FALSE
)

predicted_multinom <- predict(multinom_model, df_balanced)

table(Predicted = predicted_multinom, Actual = df_balanced$RiskLevel)

##            Actual
## Predicted   high risk low risk mid risk
##   high risk       295       16       43
##   low risk         25      218      104
##   mid risk         47      133      220

This model is useful for interpretation, but it may struggle when relationships between predictors and risk levels are non-linear or when risk categories overlap.

14 Model 2: Random Forest Classification

Random Forest was selected because it can capture non-linear relationships and interactions between variables.

set.seed(42)

train_index <- createDataPartition(df_balanced$RiskLevel, p = 0.7, list = FALSE)

train_data <- df_balanced[train_index, ]
test_data  <- df_balanced[-train_index, ]

set.seed(42)

rf_model <- randomForest(
  RiskLevel ~ Age + SystolicBP + DiastolicBP + BS + BodyTemp + HeartRate,
  data = train_data,
  ntree = 100,
  importance = TRUE
)

rf_model

## 
## Call:
##  randomForest(formula = RiskLevel ~ Age + SystolicBP + DiastolicBP +      BS + BodyTemp + HeartRate, data = train_data, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.66%
## Confusion matrix:
##           high risk low risk mid risk class.error
## high risk       243        5        9  0.05447471
## low risk          7      212       38  0.17509728
## mid risk         15       39      203  0.21011673

rf_predictions <- predict(rf_model, test_data)

confusion_matrix <- confusionMatrix(rf_predictions, test_data$RiskLevel)
confusion_matrix

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  high risk low risk mid risk
##   high risk       108        2       11
##   low risk          1       88       14
##   mid risk          1       20       85
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8515          
##                  95% CI : (0.8085, 0.8881)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7773          
##                                           
##  Mcnemar's Test P-Value : 0.02105         
## 
## Statistics by Class:
## 
##                      Class: high risk Class: low risk Class: mid risk
## Sensitivity                    0.9818          0.8000          0.7727
## Specificity                    0.9409          0.9318          0.9045
## Pos Pred Value                 0.8926          0.8544          0.8019
## Neg Pred Value                 0.9904          0.9031          0.8884
## Prevalence                     0.3333          0.3333          0.3333
## Detection Rate                 0.3273          0.2667          0.2576
## Detection Prevalence           0.3667          0.3121          0.3212
## Balanced Accuracy              0.9614          0.8659          0.8386

varImpPlot(rf_model, main = "Feature Importance")

15 Key Findings

Blood sugar showed the strongest relationship with high-risk maternal health classification.
Systolic and diastolic blood pressure were also strongly associated with maternal health risk.
Age contributed to risk prediction, but direct health indicators were more important than age alone.
Younger individuals under 40 were still represented in the high-risk group, especially when blood pressure and blood sugar were elevated.
Random Forest performed better than the multinomial logistic regression model because it handled non-linear relationships and feature interactions more effectively.

16 Limitations

The dataset is relatively small.
Oversampling was used to address class imbalance, which may affect model generalizability.
Some observations contained unusual values that required cleaning decisions.
The analysis identifies patterns and predictors, but it does not establish causality.
Additional clinical and socioeconomic variables would improve the model and provide more context.

17 Conclusion

This analysis shows that maternal health risk in this dataset is strongly associated with blood sugar, systolic blood pressure, diastolic blood pressure, and age. While age is relevant, the strongest indicators are direct clinical measures. The Random Forest model provided strong predictive performance and helped identify the most important variables for maternal health risk classification.

The project demonstrates an end-to-end data analysis workflow in R, including data loading, validation, cleaning, exploratory analysis, modeling, evaluation, and communication of findings.

Maternal Health Risk Analysis

Laura Burenkov

2026-06-04