Analyzing Maternal Health Data to Understand High vs. Low Risk Pregnancy

Irene Choi

10/7/2025

Purpose of Analysis

This dataset contains maternal health data from 998 pregnant people collected from various health clinics in Bangladesh. It includes details such as mother’s age, weight, height, and fetus health status, as well as medical test results, such as blood sugar levels and fetal heart rate. It also includes a categorization of high or low risk pregnancy.

I will perform basic analysis of different factors of maternal health to define “standard” pregnancy physiology and to understand what medical tests and health metrics, if any, are more correlated with high-risk pregnancies than others.

Reading Data

Data was collected by Ankur Ray Chayan in 2024 and published on Mendeley Data: https://data.mendeley.com/datasets/8k9pvpmykk/1

#Install necessary packages, including readxl, httr, stringr, gtsummary, and Hmisc


#Importing necessary libraries
library(ggplot2) 
library(readxl)
library(httr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(gtsummary)
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
#Creating a temporary file with the correct .xlsx extension
temp <-tempfile(fileext = ".xlsx")

#Download data from Mendeley
GET("https://data.mendeley.com/public-files/datasets/8k9pvpmykk/files/1147acd1-10a7-486e-a34d-00f3bef999cc/file_downloaded", write_disk(temp, overwrite = TRUE))
## Response [https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/5bb3fb83-1b51-4579-a388-eaec2457bb30]
##   Date: 2025-10-07 18:34
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 96.6 kB
## <ON DISK>  /var/folders/cb/kv2249f91c70wqh8lyh9rwg80000gn/T//RtmpVIWaJb/file56743fdb420.xlsx
maternaldata <- read_excel(temp)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
#Showing the top rows of the data for visualization
head(maternaldata)
## # A tibble: 6 × 18
##   `ANCC REGISTER` ...2  ...3    ...4   ...5  ...6  ...7  ...8  ...9  ...10 ...11
##   <chr>           <chr> <chr>   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Name            Age   Gravida TiTi … গর্ভকাল… ওজন   উচ্চতা রক্ত চা… রক্তস্বল্প… জন্ডিস গর্ভস্হ …
## 2 Rituporna       18    1st     1st    38 w… 50 kg 5.3'' 100/… None  None  Norm…
## 3 Moina           25    2nd     2nd    38 w… 60 kg 5.2'' 100/… None  None  Norm…
## 4 Rabeya          20    1st     1st    30 w… 55 kg 5.0'' 100/… None  None  Norm…
## 5 Shorna          22    1st     3rd    35 w… 51 kg 5.4'' 110/… None  None  Norm…
## 6 Tania Akter     20    1st     2nd    30 w… 53 kg 5.2'' 100/… None  None  Norm…
## # ℹ 7 more variables: ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
## #   ...16 <chr>, ...17 <chr>, ...18 <chr>

Cleaning and Oganizing the Data

The data had some phrases in Bengali and required translating and relabeling for clarity. It also required changing the labels of some columns and recoding to make the data easier to analyze.

#Set first row as column names, then remove the row that became the column names, and finally check for proper labeling

colnames(maternaldata) <- maternaldata[1,]
maternaldata <- maternaldata[-1,]


#Relabeling column titles with English translations

maternaldata <- rename(maternaldata, "Tetanus_Vaccination" = `TiTi Tika`, "Pregnancy_time" = `গর্ভকাল`, "Weight" = `ওজন`, "Height" = `উচ্চতা`, "Blood_pressure" = `রক্ত চাপ`, "Anemia" = `রক্তস্বল্পতা`, "Jaundice" = `জন্ডিস`, "Fetal_position" = `গর্ভস্হ শিশু অবস্থান`, "Fetal_movement" = `গর্ভস্হ শিশু নাড়াচাড়া`, "Fetal_heartbeat" = `গর্ভস্হ শিশু হৃৎস্পন্দন`, "Urine_albumin" = `প্রসাব পরিক্ষা এলবুমিন`, "Urine_sugar" = `প্রসাব পরিক্ষা সুগার`, "HBsAG" = HRsAG, "Pregnancy_risk" = `ঝুকিপূর্ণ গর্ভ`)


#Replace names in "Name" column with ID numbers to help make patients anonymous and for easier referencing

patients <- data.frame(matrix(nrow = 998, ncol = 1))
colnames(patients) <- c("ID")
patients <- data.frame(ID = 1:998)
maternaldata$Name <- patients$ID


#Recode columns Urine_albumin, Urine_sugar, VDRL, HBsAG, Gravida, Tetanus_Vaccination, and Height such that None=0, Minimal=1, Medium=2, and Higher=3 (for Urine_albumin); No=0 and Yes=1 (for Urine_sugar); Negative=0 and Positive=1 (for VDRL and HBsAG); 1st=1, 2nd=2, 3rd=3 (for Gravida and Tetanus_Vaccination); 5.1''=61, 5.2''=62 (for Height, in inches)

maternaldata$Urine_albumin <- recode_factor(maternaldata$Urine_albumin, "None"=0, "Minimal"=1, "Medium"=2, "Higher"=3)
maternaldata$Urine_sugar <- recode_factor(maternaldata$Urine_sugar, "No"=0, "Yes"=1)
maternaldata$VDRL <- recode_factor(maternaldata$VDRL, "Negative"=0, "Positive"=1)
maternaldata$HBsAG <- recode_factor(maternaldata$HBsAG, "Negative"=0, "Positive"=1)
maternaldata$Gravida <- recode_factor(maternaldata$Gravida, "1st"=1, "2nd"=2, "3rd"=3)
maternaldata$Tetanus_Vaccination <- recode_factor(maternaldata$Tetanus_Vaccination, "1st"=1, "2nd"=2, "3rd"=3)
maternaldata$Height_cm <- recode_factor(maternaldata$Height, "5.0''"=152.4, "5.1''"=154.94, "5.2''"=157.48, "5.3''"=160.02,"5.4''"=162.56, "5.5''"=165.1, "5.6''"=167.64)


#Remove unnecessary words from columns (ex. "week" from Pregnancy_time) and change column title to reflect the unit (ex. Pregnancy_time -> Pregnancy_time_wk)

maternaldata$Pregnancy_time <- str_replace(maternaldata$Pregnancy_time, "week", "")
colnames(maternaldata)[5] <- "Pregnancy_time_wk"

maternaldata$Weight <- str_replace(maternaldata$Weight, "kg", "")
colnames(maternaldata)[6] <- "Weight_kg"

maternaldata$Fetal_heartbeat <- str_replace(maternaldata$Fetal_heartbeat, "m", "")
colnames(maternaldata)[13] <- "Fetal_heartbeat_bpm"


#Convert Height and Weight_kg columns to numeric, calculate BMI, then remove Height, Height_cm, and Weight_kg columns

maternaldata$Weight_kg <- as.numeric(as.character(maternaldata$Weight_kg))
maternaldata$Height_cm <- as.numeric(as.character(maternaldata$Height_cm))

maternaldata$BMI <- (maternaldata$Weight_kg/maternaldata$Height_cm/maternaldata$Height_cm)*10000
maternaldata <- maternaldata[,-c(6,7,19)]


#Separate Blood_pressure column into 2 columns, systolic and diastolic

maternaldata <- maternaldata %>% separate_wider_delim(Blood_pressure, delim = "/", names = c("Systolic", "Diastolic"))


#Change appropriate columns into integers

maternaldata <- maternaldata %>%
     mutate(across(c(Age, Gravida, Pregnancy_time_wk, BMI, Systolic, Diastolic, Tetanus_Vaccination, Fetal_heartbeat_bpm, Urine_albumin, Urine_sugar, VDRL, HBsAG), ~ as.numeric(as.character(.))))


#Make column factoring pregnancy time (in weeks) into trimester

maternaldata$Trimester <- ifelse(maternaldata$Pregnancy_time_wk <= 13, "First",
  ifelse(maternaldata$Pregnancy_time_wk <= 27, "Second", "Third"))


#Reorder columns

maternaldata <- maternaldata[,c(1:3, 5, 18, 6:9, 4, 10:17, 19)]

head(maternaldata)
## # A tibble: 6 × 19
##    Name   Age Gravida Pregnancy_time_wk   BMI Systolic Diastolic Anemia Jaundice
##   <int> <dbl>   <dbl>             <dbl> <dbl>    <dbl>     <dbl> <chr>  <chr>   
## 1     1    18       1                38  19.5      100        60 None   None    
## 2     2    25       2                38  24.2      100        70 None   None    
## 3     3    20       1                30  23.7      100        60 None   None    
## 4     4    22       1                35  19.3      110        65 None   None    
## 5     5    20       1                30  21.4      100        55 None   None    
## 6     6    22       1                30  25.4      100        65 None   None    
## # ℹ 10 more variables: Tetanus_Vaccination <dbl>, Fetal_position <chr>,
## #   Fetal_movement <chr>, Fetal_heartbeat_bpm <dbl>, Urine_albumin <dbl>,
## #   Urine_sugar <dbl>, VDRL <dbl>, HBsAG <dbl>, Pregnancy_risk <chr>,
## #   Trimester <chr>

Looking at the NAs within the data

I wanted to inspect if there were any NAs, and see where they are presiding.

#Create a new data frame with NAs.

maternaldata_na <- maternaldata %>% filter(if_any(everything(), ~ is.na(.)))

#Summarizing the nyu.na dataframe.

summary(maternaldata_na)
##       Name          Age         Gravida    Pregnancy_time_wk      BMI     
##  Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA       Min.   : NA  
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA       1st Qu.: NA  
##  Median : NA   Median : NA   Median : NA   Median : NA       Median : NA  
##  Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN       Mean   :NaN  
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA       3rd Qu.: NA  
##  Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA       Max.   : NA  
##     Systolic     Diastolic      Anemia            Jaundice        
##  Min.   : NA   Min.   : NA   Length:0           Length:0          
##  1st Qu.: NA   1st Qu.: NA   Class :character   Class :character  
##  Median : NA   Median : NA   Mode  :character   Mode  :character  
##  Mean   :NaN   Mean   :NaN                                        
##  3rd Qu.: NA   3rd Qu.: NA                                        
##  Max.   : NA   Max.   : NA                                        
##  Tetanus_Vaccination Fetal_position     Fetal_movement     Fetal_heartbeat_bpm
##  Min.   : NA         Length:0           Length:0           Min.   : NA        
##  1st Qu.: NA         Class :character   Class :character   1st Qu.: NA        
##  Median : NA         Mode  :character   Mode  :character   Median : NA        
##  Mean   :NaN                                               Mean   :NaN        
##  3rd Qu.: NA                                               3rd Qu.: NA        
##  Max.   : NA                                               Max.   : NA        
##  Urine_albumin  Urine_sugar       VDRL         HBsAG     Pregnancy_risk    
##  Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Length:0          
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   Class :character  
##  Median : NA   Median : NA   Median : NA   Median : NA   Mode  :character  
##  Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN                     
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA                     
##  Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA                     
##   Trimester        
##  Length:0          
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Interpretation of NAs

There are no NA values, so I don’t need to decide to include or omit any NAs

Evaluation of Skewness

#Create a histogram to observe skew in the following columns: Age, Gravida, Pregnancy_time_wk, BMI,  Systolic, Diastolic, and Fetal_heartbeat_bpm

age_histo <- hist(maternaldata$Age, xlab = "Age (years)", ylab = "Frequency")

gravida_histo <- hist(maternaldata$Gravida, breaks = seq(0.5, max(maternaldata$Gravida) + 0.5, by = 1), xaxt = "n", xlab = "Gravida (number of pregnancies)", ylab = "Frequency")
  axis(1, at = seq(0, max(maternaldata$Gravida), by = 1))

pregnancy_time_histo <- hist(maternaldata$Pregnancy_time_wk, xlab = "Pregnancy time (weeks)", ylab = "Frequency")

BMI_histo <- hist(maternaldata$BMI, xlab = "BMI", ylab = "Frequency")

systolic_histo <- hist(maternaldata$Systolic, breaks = seq(70, max(maternaldata$Systolic) + 10, by = 10), xaxt = "n", xlab = "Systolic blood pressure", ylab = "Frequency")
  axis(1, at = seq(70, max(maternaldata$Systolic), by = 10))

diastolic_histo <- hist(maternaldata$Diastolic, breaks = seq(50, max(maternaldata$Diastolic) + 5, by = 5), xaxt = "n", xlab = "Diastolic blood pressure", ylab = "Frequency")
  axis(1, at = seq(50, max(maternaldata$Diastolic), by = 5))

fetal_bpm_histo <- hist(maternaldata$Fetal_heartbeat_bpm, breaks = seq(120, max(maternaldata$Fetal_heartbeat_bpm) + 5, by = 5), xaxt = "n", xlab = "Fetal heartbeat (bpm)", ylab = "Frequency")
  axis(1, at = seq(120, max(maternaldata$Fetal_heartbeat_bpm), by = 5))

#Create new dataframes based on pregnancy risk
  
highrisk_maternaldata <- maternaldata[maternaldata$Pregnancy_risk == "Yes", ]

lowrisk_maternaldata <- maternaldata[maternaldata$Pregnancy_risk == "No", ]

Summarizing health statistics for the average pregnancy using boxplots

I want to summarize the age, BMI, and systolic and diastolic blood pressure for all patients included in the study through boxplots

#Make boxplots for each of the above listed factors

age_boxplot <- boxplot(maternaldata$Age, main = "Boxplot of Maternal Age", ylab = "Age (years)")

BMI_boxplot <- boxplot(maternaldata$BMI, main = "Boxplot of BMI", ylab = "BMI")

bp_boxplot <- boxplot(maternaldata[,c("Systolic", "Diastolic")])

Median age during pregnancy is 22, with relatively normal distribution (as also indicated from the histogram from before). Notably, there are several patients on the older end of the spectrum, causing a longer maximum tail. Median BMI is also about 22, which falls within healthy range (between 18.5-24.9). Median systolic BP is about 100 and median diastolic is about 60

Relationships between pregnancy risk and fetal health or anemia/jaundice in mother

First I wanted to see if there was a relationship between fetal health (Fetal_position, Fetal_movement, Fetal_heartbeat_bpm) with pregnancy risk

#Using gtsummary to list the above listed factors in a table from high risk pregnancies

lowrisk_fetal_health_tbl <- lowrisk_maternaldata %>% tbl_summary(include = c(Fetal_position, Fetal_movement, Fetal_heartbeat_bpm))
lowrisk_fetal_health_tbl
Characteristic N = 3321
Fetal_position
    Abnormal 2 (0.6%)
    Normal 330 (99%)
Fetal_movement
    Normal 332 (100%)
Fetal_heartbeat_bpm
    120 60 (18%)
    125 31 (9.3%)
    130 120 (36%)
    140 91 (27%)
    150 30 (9.0%)
1 n (%)
highrisk_fetal_health_tbl <- highrisk_maternaldata %>% tbl_summary(include = c(Fetal_position, Fetal_movement, Fetal_heartbeat_bpm))
highrisk_fetal_health_tbl
Characteristic N = 6661
Fetal_position
    Abnormal 4 (0.6%)
    Normal 662 (99%)
Fetal_movement
    Normal 666 (100%)
Fetal_heartbeat_bpm
    120 121 (18%)
    125 60 (9.0%)
    130 243 (36%)
    140 181 (27%)
    150 61 (9.2%)
1 n (%)
#Merge tables for a side-by-side comparison

tbl_merge(tbls = list(lowrisk_fetal_health_tbl, highrisk_fetal_health_tbl), tab_spanner = c("Low risk", "High risk"))
Characteristic
Low risk
High risk
N = 3321 N = 6661
Fetal_position

    Abnormal 2 (0.6%) 4 (0.6%)
    Normal 330 (99%) 662 (99%)
Fetal_movement

    Normal 332 (100%) 666 (100%)
Fetal_heartbeat_bpm

    120 60 (18%) 121 (18%)
    125 31 (9.3%) 60 (9.0%)
    130 120 (36%) 243 (36%)
    140 91 (27%) 181 (27%)
    150 30 (9.0%) 61 (9.2%)
1 n (%)

There does not seem to be a notable relationship between fetal health and pregnancy risk

Next I wanted to assess if there’s a relationship between anemia or jaundice in the mother and pregnancy risk

#Use gtsummary to list the above listed factors in a table from high and low risk pregnancies

lowrisk_medicalconditions <- lowrisk_maternaldata %>% tbl_summary(include = c(Jaundice, Anemia))
highrisk_medicalconditions <- highrisk_maternaldata %>% tbl_summary(include = c(Jaundice, Anemia))

#Merge tables for a side-by-side comparison

tbl_merge (tbls = list(lowrisk_medicalconditions, highrisk_medicalconditions), tab_spanner = c("Low risk", "High risk"))
Characteristic
Low risk
High risk
N = 3321 N = 6661
Jaundice

    Medium 1 (0.3%) 3 (0.5%)
    Minimal 3 (0.9%) 5 (0.8%)
    None 328 (99%) 658 (99%)
Anemia

    Medium 20 (6.0%) 41 (6.2%)
    Minimal 21 (6.3%) 41 (6.2%)
    None 291 (88%) 584 (88%)
1 n (%)

There does not seem to be a notable relationship between anemia or jaundice in the mother and pregnancy risk

Assessing stage of pregnancy and pregnancy risk

I wanted to assess if women in later stages of pregnancy (ie. 3rd trimester vs. 2nd trimester) would have a higher incidence of high risk pregnancies

#Use gtsummary to summarize trimester stage in a table from high and low risk pregnancies

lowrisk_trimester <- lowrisk_maternaldata %>% tbl_summary(include = c(Trimester))
highrisk_trimester <- highrisk_maternaldata %>% tbl_summary(include = c(Trimester))

#Merge tables for a side-by-side comparison

tbl_merge (tbls = list(lowrisk_trimester, highrisk_trimester), tab_spanner = c("Low risk", "High risk"))
Characteristic
Low risk
High risk
N = 3321 N = 6661
Trimester

    Second 131 (39%) 265 (40%)
    Third 201 (61%) 401 (60%)
1 n (%)

There is no apparent relationship between stage of pregnancy and having a high or low risk pregnancy

Assessing relationship between age of mother and pregnancy risk

#Use ggplot to make boxplots to visualize age in high and low risk groups

ggplot(maternaldata, aes(x = Pregnancy_risk, y = Age)) + geom_boxplot(fill = "lightblue") + ylab("Age (years)") + xlab("Pregnancy Risk")

There is no apparent relationship between age of mother and having a high or low risk pregnancy

Assessing mean BMI of pregnancy risk groups and visualizing means

I wanted to see if there’s a relationship between mean BMI and pregnancy risk

#Calculating mean BMI from low risk and high risk pregnancies
lowrisk_meanBMI <- mean(lowrisk_maternaldata$BMI)
highrisk_meanBMI <- mean(highrisk_maternaldata$BMI)

lowrisk_meanBMI
## [1] 22.95498
highrisk_meanBMI
## [1] 21.99694

Visualization of mean BMI and pregnancy risk

I then wanted to visualize the means calculated above in bar graphs.

#Use ggplot to create bar graphs of average BMI in low risk and high risk pregnancy mothers

ggplot(maternaldata, aes(x = Pregnancy_risk, y = BMI)) + stat_summary(fun = mean, geom = "bar", fill = "lightblue") + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .5) + ylab("Mean BMI") + xlab("Pregnancy Risk")

Mothers identified with high risk pregnancies on average have lower BMIs compared to those with non-high risk pregnancies

Discussion and Future Directions

Discussion The median mother from this study was around 22 years of age, with a BMI of 22 and blood pressure of 100/60.

From my analysis, I did not see a relationship between the following and low vs. high pregnancy risk: - Fetal health - Maternal jaundice or anemia - Stage of pregnancy (2nd vs. 3rd trimester) - Age of mother

I did see a relationship between average BMI and pregnancy risk, wherein mothers with high risk pregnancies had on average lower BMIs

Future Directions

References

-Chayan, Ankur Ray (2024), “Maternal Health and High-Risk Pregnancy Dataset.”, Mendeley Data, V1, doi: 10.17632/8k9pvpmykk.1