Analyzing Maternal Health Data to Understand High vs. Low Risk Pregnancy

Irene Choi

10/7/2025

Purpose of Analysis

This dataset contains maternal health data from 998 pregnant people collected from various health clinics in Bangladesh. It includes details such as mother’s age, weight, height, and fetus health status, as well as medical test results, such as blood sugar levels and fetal heart rate. It also includes a categorization of high or low risk pregnancy.

Pregnancy causes massive physiological strain on and changes to the pregnant body
Homeostasis of a pregnant body vs. non-pregnant body are very different
It is important to understand what a standard pregnancy looks like, to help define what health metrics can cause it to be considered “high risk”
Understanding which factors of maternal health are seen in those with high-risk pregnancies will help reduce maternal mortality rates and allow for safer pregnancies

I will perform basic analysis of different factors of maternal health to define “standard” pregnancy physiology and to understand what medical tests and health metrics, if any, are more correlated with high-risk pregnancies than others.

Reading Data

Data was collected by Ankur Ray Chayan in 2024 and published on Mendeley Data: https://data.mendeley.com/datasets/8k9pvpmykk/1

#Install necessary packages, including readxl, httr, stringr, gtsummary, and Hmisc


#Importing necessary libraries
library(ggplot2) 
library(readxl)
library(httr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringr)
library(gtsummary)
library(Hmisc)

## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units

#Creating a temporary file with the correct .xlsx extension
temp <-tempfile(fileext = ".xlsx")

#Download data from Mendeley
GET("https://data.mendeley.com/public-files/datasets/8k9pvpmykk/files/1147acd1-10a7-486e-a34d-00f3bef999cc/file_downloaded", write_disk(temp, overwrite = TRUE))

## Response [https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/5bb3fb83-1b51-4579-a388-eaec2457bb30]
##   Date: 2025-10-07 18:34
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 96.6 kB
## <ON DISK>  /var/folders/cb/kv2249f91c70wqh8lyh9rwg80000gn/T//RtmpVIWaJb/file56743fdb420.xlsx

maternaldata <- read_excel(temp)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`

#Showing the top rows of the data for visualization
head(maternaldata)

## # A tibble: 6 × 18
##   `ANCC REGISTER` ...2  ...3    ...4   ...5  ...6  ...7  ...8  ...9  ...10 ...11
##   <chr>           <chr> <chr>   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Name            Age   Gravida TiTi … গর্ভকাল… ওজন   উচ্চতা রক্ত চা… রক্তস্বল্প… জন্ডিস গর্ভস্হ …
## 2 Rituporna       18    1st     1st    38 w… 50 kg 5.3'' 100/… None  None  Norm…
## 3 Moina           25    2nd     2nd    38 w… 60 kg 5.2'' 100/… None  None  Norm…
## 4 Rabeya          20    1st     1st    30 w… 55 kg 5.0'' 100/… None  None  Norm…
## 5 Shorna          22    1st     3rd    35 w… 51 kg 5.4'' 110/… None  None  Norm…
## 6 Tania Akter     20    1st     2nd    30 w… 53 kg 5.2'' 100/… None  None  Norm…
## # ℹ 7 more variables: ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
## #   ...16 <chr>, ...17 <chr>, ...18 <chr>

Cleaning and Oganizing the Data

The data had some phrases in Bengali and required translating and relabeling for clarity. It also required changing the labels of some columns and recoding to make the data easier to analyze.

#Set first row as column names, then remove the row that became the column names, and finally check for proper labeling

colnames(maternaldata) <- maternaldata[1,]
maternaldata <- maternaldata[-1,]


#Relabeling column titles with English translations

maternaldata <- rename(maternaldata, "Tetanus_Vaccination" = `TiTi Tika`, "Pregnancy_time" = `গর্ভকাল`, "Weight" = `ওজন`, "Height" = `উচ্চতা`, "Blood_pressure" = `রক্ত চাপ`, "Anemia" = `রক্তস্বল্পতা`, "Jaundice" = `জন্ডিস`, "Fetal_position" = `গর্ভস্হ শিশু অবস্থান`, "Fetal_movement" = `গর্ভস্হ শিশু নাড়াচাড়া`, "Fetal_heartbeat" = `গর্ভস্হ শিশু হৃৎস্পন্দন`, "Urine_albumin" = `প্রসাব পরিক্ষা এলবুমিন`, "Urine_sugar" = `প্রসাব পরিক্ষা সুগার`, "HBsAG" = HRsAG, "Pregnancy_risk" = `ঝুকিপূর্ণ গর্ভ`)


#Replace names in "Name" column with ID numbers to help make patients anonymous and for easier referencing

patients <- data.frame(matrix(nrow = 998, ncol = 1))
colnames(patients) <- c("ID")
patients <- data.frame(ID = 1:998)
maternaldata$Name <- patients$ID


#Recode columns Urine_albumin, Urine_sugar, VDRL, HBsAG, Gravida, Tetanus_Vaccination, and Height such that None=0, Minimal=1, Medium=2, and Higher=3 (for Urine_albumin); No=0 and Yes=1 (for Urine_sugar); Negative=0 and Positive=1 (for VDRL and HBsAG); 1st=1, 2nd=2, 3rd=3 (for Gravida and Tetanus_Vaccination); 5.1''=61, 5.2''=62 (for Height, in inches)

maternaldata$Urine_albumin <- recode_factor(maternaldata$Urine_albumin, "None"=0, "Minimal"=1, "Medium"=2, "Higher"=3)
maternaldata$Urine_sugar <- recode_factor(maternaldata$Urine_sugar, "No"=0, "Yes"=1)
maternaldata$VDRL <- recode_factor(maternaldata$VDRL, "Negative"=0, "Positive"=1)
maternaldata$HBsAG <- recode_factor(maternaldata$HBsAG, "Negative"=0, "Positive"=1)
maternaldata$Gravida <- recode_factor(maternaldata$Gravida, "1st"=1, "2nd"=2, "3rd"=3)
maternaldata$Tetanus_Vaccination <- recode_factor(maternaldata$Tetanus_Vaccination, "1st"=1, "2nd"=2, "3rd"=3)
maternaldata$Height_cm <- recode_factor(maternaldata$Height, "5.0''"=152.4, "5.1''"=154.94, "5.2''"=157.48, "5.3''"=160.02,"5.4''"=162.56, "5.5''"=165.1, "5.6''"=167.64)


#Remove unnecessary words from columns (ex. "week" from Pregnancy_time) and change column title to reflect the unit (ex. Pregnancy_time -> Pregnancy_time_wk)

maternaldata$Pregnancy_time <- str_replace(maternaldata$Pregnancy_time, "week", "")
colnames(maternaldata)[5] <- "Pregnancy_time_wk"

maternaldata$Weight <- str_replace(maternaldata$Weight, "kg", "")
colnames(maternaldata)[6] <- "Weight_kg"

maternaldata$Fetal_heartbeat <- str_replace(maternaldata$Fetal_heartbeat, "m", "")
colnames(maternaldata)[13] <- "Fetal_heartbeat_bpm"


#Convert Height and Weight_kg columns to numeric, calculate BMI, then remove Height, Height_cm, and Weight_kg columns

maternaldata$Weight_kg <- as.numeric(as.character(maternaldata$Weight_kg))
maternaldata$Height_cm <- as.numeric(as.character(maternaldata$Height_cm))

maternaldata$BMI <- (maternaldata$Weight_kg/maternaldata$Height_cm/maternaldata$Height_cm)*10000
maternaldata <- maternaldata[,-c(6,7,19)]


#Separate Blood_pressure column into 2 columns, systolic and diastolic

maternaldata <- maternaldata %>% separate_wider_delim(Blood_pressure, delim = "/", names = c("Systolic", "Diastolic"))


#Change appropriate columns into integers

maternaldata <- maternaldata %>%
     mutate(across(c(Age, Gravida, Pregnancy_time_wk, BMI, Systolic, Diastolic, Tetanus_Vaccination, Fetal_heartbeat_bpm, Urine_albumin, Urine_sugar, VDRL, HBsAG), ~ as.numeric(as.character(.))))


#Make column factoring pregnancy time (in weeks) into trimester

maternaldata$Trimester <- ifelse(maternaldata$Pregnancy_time_wk <= 13, "First",
  ifelse(maternaldata$Pregnancy_time_wk <= 27, "Second", "Third"))


#Reorder columns

maternaldata <- maternaldata[,c(1:3, 5, 18, 6:9, 4, 10:17, 19)]

head(maternaldata)

## # A tibble: 6 × 19
##    Name   Age Gravida Pregnancy_time_wk   BMI Systolic Diastolic Anemia Jaundice
##   <int> <dbl>   <dbl>             <dbl> <dbl>    <dbl>     <dbl> <chr>  <chr>   
## 1     1    18       1                38  19.5      100        60 None   None    
## 2     2    25       2                38  24.2      100        70 None   None    
## 3     3    20       1                30  23.7      100        60 None   None    
## 4     4    22       1                35  19.3      110        65 None   None    
## 5     5    20       1                30  21.4      100        55 None   None    
## 6     6    22       1                30  25.4      100        65 None   None    
## # ℹ 10 more variables: Tetanus_Vaccination <dbl>, Fetal_position <chr>,
## #   Fetal_movement <chr>, Fetal_heartbeat_bpm <dbl>, Urine_albumin <dbl>,
## #   Urine_sugar <dbl>, VDRL <dbl>, HBsAG <dbl>, Pregnancy_risk <chr>,
## #   Trimester <chr>

Looking at the NAs within the data

I wanted to inspect if there were any NAs, and see where they are presiding.

#Create a new data frame with NAs.

maternaldata_na <- maternaldata %>% filter(if_any(everything(), ~ is.na(.)))

#Summarizing the nyu.na dataframe.

summary(maternaldata_na)

##       Name          Age         Gravida    Pregnancy_time_wk      BMI     
##  Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA       Min.   : NA  
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA       1st Qu.: NA  
##  Median : NA   Median : NA   Median : NA   Median : NA       Median : NA  
##  Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN       Mean   :NaN  
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA       3rd Qu.: NA  
##  Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA       Max.   : NA  
##     Systolic     Diastolic      Anemia            Jaundice        
##  Min.   : NA   Min.   : NA   Length:0           Length:0          
##  1st Qu.: NA   1st Qu.: NA   Class :character   Class :character  
##  Median : NA   Median : NA   Mode  :character   Mode  :character  
##  Mean   :NaN   Mean   :NaN                                        
##  3rd Qu.: NA   3rd Qu.: NA                                        
##  Max.   : NA   Max.   : NA                                        
##  Tetanus_Vaccination Fetal_position     Fetal_movement     Fetal_heartbeat_bpm
##  Min.   : NA         Length:0           Length:0           Min.   : NA        
##  1st Qu.: NA         Class :character   Class :character   1st Qu.: NA        
##  Median : NA         Mode  :character   Mode  :character   Median : NA        
##  Mean   :NaN                                               Mean   :NaN        
##  3rd Qu.: NA                                               3rd Qu.: NA        
##  Max.   : NA                                               Max.   : NA        
##  Urine_albumin  Urine_sugar       VDRL         HBsAG     Pregnancy_risk    
##  Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Length:0          
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   Class :character  
##  Median : NA   Median : NA   Median : NA   Median : NA   Mode  :character  
##  Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN                     
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA                     
##  Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA                     
##   Trimester        
##  Length:0          
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Interpretation of NAs

There are no NA values, so I don’t need to decide to include or omit any NAs

Evaluation of Skewness

#Create a histogram to observe skew in the following columns: Age, Gravida, Pregnancy_time_wk, BMI,  Systolic, Diastolic, and Fetal_heartbeat_bpm

age_histo <- hist(maternaldata$Age, xlab = "Age (years)", ylab = "Frequency")

gravida_histo <- hist(maternaldata$Gravida, breaks = seq(0.5, max(maternaldata$Gravida) + 0.5, by = 1), xaxt = "n", xlab = "Gravida (number of pregnancies)", ylab = "Frequency")
  axis(1, at = seq(0, max(maternaldata$Gravida), by = 1))

pregnancy_time_histo <- hist(maternaldata$Pregnancy_time_wk, xlab = "Pregnancy time (weeks)", ylab = "Frequency")

BMI_histo <- hist(maternaldata$BMI, xlab = "BMI", ylab = "Frequency")

systolic_histo <- hist(maternaldata$Systolic, breaks = seq(70, max(maternaldata$Systolic) + 10, by = 10), xaxt = "n", xlab = "Systolic blood pressure", ylab = "Frequency")
  axis(1, at = seq(70, max(maternaldata$Systolic), by = 10))

diastolic_histo <- hist(maternaldata$Diastolic, breaks = seq(50, max(maternaldata$Diastolic) + 5, by = 5), xaxt = "n", xlab = "Diastolic blood pressure", ylab = "Frequency")
  axis(1, at = seq(50, max(maternaldata$Diastolic), by = 5))

fetal_bpm_histo <- hist(maternaldata$Fetal_heartbeat_bpm, breaks = seq(120, max(maternaldata$Fetal_heartbeat_bpm) + 5, by = 5), xaxt = "n", xlab = "Fetal heartbeat (bpm)", ylab = "Frequency")
  axis(1, at = seq(120, max(maternaldata$Fetal_heartbeat_bpm), by = 5))

#Create new dataframes based on pregnancy risk
  
highrisk_maternaldata <- maternaldata[maternaldata$Pregnancy_risk == "Yes", ]

lowrisk_maternaldata <- maternaldata[maternaldata$Pregnancy_risk == "No", ]

Maternal age and gravida (# of pregnancy) have a left skew, indicating that most mothers from this study are younger and on their first pregnancies
Fetal heartbeat also appears to have a slightly left skew, indicating most are on the lower end of the spectrum. However, all fetal BPM fall within a healthy range
All other factors, such as pregnancy time, BMI, systolic, and diastolic blood pressures have relatively normal distribution, indicating most patients from this study are in mid-late gestation, are not at an extremely high or low BMI, and are not at an extremely high or low systolic or diastolic blood pressure

Summarizing health statistics for the average pregnancy using boxplots

I want to summarize the age, BMI, and systolic and diastolic blood pressure for all patients included in the study through boxplots

#Make boxplots for each of the above listed factors

age_boxplot <- boxplot(maternaldata$Age, main = "Boxplot of Maternal Age", ylab = "Age (years)")

BMI_boxplot <- boxplot(maternaldata$BMI, main = "Boxplot of BMI", ylab = "BMI")

bp_boxplot <- boxplot(maternaldata[,c("Systolic", "Diastolic")])

Median age during pregnancy is 22, with relatively normal distribution (as also indicated from the histogram from before). Notably, there are several patients on the older end of the spectrum, causing a longer maximum tail. Median BMI is also about 22, which falls within healthy range (between 18.5-24.9). Median systolic BP is about 100 and median diastolic is about 60

Relationships between pregnancy risk and fetal health or anemia/jaundice in mother

First I wanted to see if there was a relationship between fetal health (Fetal_position, Fetal_movement, Fetal_heartbeat_bpm) with pregnancy risk

#Using gtsummary to list the above listed factors in a table from high risk pregnancies

lowrisk_fetal_health_tbl <- lowrisk_maternaldata %>% tbl_summary(include = c(Fetal_position, Fetal_movement, Fetal_heartbeat_bpm))
lowrisk_fetal_health_tbl

Characteristic	N = 332¹
Fetal_position
Abnormal	2 (0.6%)
Normal	330 (99%)
Fetal_movement
Normal	332 (100%)
Fetal_heartbeat_bpm
120	60 (18%)
125	31 (9.3%)
130	120 (36%)
140	91 (27%)
150	30 (9.0%)
¹ n (%)

highrisk_fetal_health_tbl <- highrisk_maternaldata %>% tbl_summary(include = c(Fetal_position, Fetal_movement, Fetal_heartbeat_bpm))
highrisk_fetal_health_tbl

Characteristic	N = 666¹
Fetal_position
Abnormal	4 (0.6%)
Normal	662 (99%)
Fetal_movement
Normal	666 (100%)
Fetal_heartbeat_bpm
120	121 (18%)
125	60 (9.0%)
130	243 (36%)
140	181 (27%)
150	61 (9.2%)
¹ n (%)

#Merge tables for a side-by-side comparison

tbl_merge(tbls = list(lowrisk_fetal_health_tbl, highrisk_fetal_health_tbl), tab_spanner = c("Low risk", "High risk"))

Characteristic	Low risk	High risk
Characteristic	N = 332¹	N = 666¹
Fetal_position
Abnormal	2 (0.6%)	4 (0.6%)
Normal	330 (99%)	662 (99%)
Fetal_movement
Normal	332 (100%)	666 (100%)
Fetal_heartbeat_bpm
120	60 (18%)	121 (18%)
125	31 (9.3%)	60 (9.0%)
130	120 (36%)	243 (36%)
140	91 (27%)	181 (27%)
150	30 (9.0%)	61 (9.2%)
¹ n (%)

There does not seem to be a notable relationship between fetal health and pregnancy risk

Next I wanted to assess if there’s a relationship between anemia or jaundice in the mother and pregnancy risk

#Use gtsummary to list the above listed factors in a table from high and low risk pregnancies

lowrisk_medicalconditions <- lowrisk_maternaldata %>% tbl_summary(include = c(Jaundice, Anemia))
highrisk_medicalconditions <- highrisk_maternaldata %>% tbl_summary(include = c(Jaundice, Anemia))

#Merge tables for a side-by-side comparison

tbl_merge (tbls = list(lowrisk_medicalconditions, highrisk_medicalconditions), tab_spanner = c("Low risk", "High risk"))

Characteristic	Low risk	High risk
Characteristic	N = 332¹	N = 666¹
Jaundice
Medium	1 (0.3%)	3 (0.5%)
Minimal	3 (0.9%)	5 (0.8%)
None	328 (99%)	658 (99%)
Anemia
Medium	20 (6.0%)	41 (6.2%)
Minimal	21 (6.3%)	41 (6.2%)
None	291 (88%)	584 (88%)
¹ n (%)

There does not seem to be a notable relationship between anemia or jaundice in the mother and pregnancy risk

Assessing stage of pregnancy and pregnancy risk

I wanted to assess if women in later stages of pregnancy (ie. 3rd trimester vs. 2nd trimester) would have a higher incidence of high risk pregnancies

#Use gtsummary to summarize trimester stage in a table from high and low risk pregnancies

lowrisk_trimester <- lowrisk_maternaldata %>% tbl_summary(include = c(Trimester))
highrisk_trimester <- highrisk_maternaldata %>% tbl_summary(include = c(Trimester))

#Merge tables for a side-by-side comparison

tbl_merge (tbls = list(lowrisk_trimester, highrisk_trimester), tab_spanner = c("Low risk", "High risk"))

Characteristic	Low risk	High risk
Characteristic	N = 332¹	N = 666¹
Trimester
Second	131 (39%)	265 (40%)
Third	201 (61%)	401 (60%)
¹ n (%)

There is no apparent relationship between stage of pregnancy and having a high or low risk pregnancy

Assessing relationship between age of mother and pregnancy risk

#Use ggplot to make boxplots to visualize age in high and low risk groups

ggplot(maternaldata, aes(x = Pregnancy_risk, y = Age)) + geom_boxplot(fill = "lightblue") + ylab("Age (years)") + xlab("Pregnancy Risk")

There is no apparent relationship between age of mother and having a high or low risk pregnancy

Assessing mean BMI of pregnancy risk groups and visualizing means

I wanted to see if there’s a relationship between mean BMI and pregnancy risk

#Calculating mean BMI from low risk and high risk pregnancies
lowrisk_meanBMI <- mean(lowrisk_maternaldata$BMI)
highrisk_meanBMI <- mean(highrisk_maternaldata$BMI)

lowrisk_meanBMI

## [1] 22.95498

highrisk_meanBMI

## [1] 21.99694

Visualization of mean BMI and pregnancy risk

I then wanted to visualize the means calculated above in bar graphs.

#Use ggplot to create bar graphs of average BMI in low risk and high risk pregnancy mothers

ggplot(maternaldata, aes(x = Pregnancy_risk, y = BMI)) + stat_summary(fun = mean, geom = "bar", fill = "lightblue") + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .5) + ylab("Mean BMI") + xlab("Pregnancy Risk")

Mothers identified with high risk pregnancies on average have lower BMIs compared to those with non-high risk pregnancies

Discussion and Future Directions

Discussion The median mother from this study was around 22 years of age, with a BMI of 22 and blood pressure of 100/60.

From my analysis, I did not see a relationship between the following and low vs. high pregnancy risk: - Fetal health - Maternal jaundice or anemia - Stage of pregnancy (2nd vs. 3rd trimester) - Age of mother

I did see a relationship between average BMI and pregnancy risk, wherein mothers with high risk pregnancies had on average lower BMIs

Future Directions

Perform non-parametric analysis (ex. Mann-Whitney test) of certain continuous numeric variables (such as maternal age, gravida, and fetal heartbeat in RPM) due to skew
To investigate if other health factors, such as urine albumin or urin sugar, occurred at a higher incidence in high risk pregnancies
All data from this study came from various clinics in Bangladesh. It would be interesting to see if similar patterns are seen in other countries with women of different ethnicities and various socioeconomic backgrounds
Collect birth data and infant health data and see if there are any correlations between high vs. low risk pregnancy and birth complications and infant health

References

-Chayan, Ankur Ray (2024), “Maternal Health and High-Risk Pregnancy Dataset.”, Mendeley Data, V1, doi: 10.17632/8k9pvpmykk.1