Irene Choi
10/7/2025
This dataset contains maternal health data from 998 pregnant people collected from various health clinics in Bangladesh. It includes details such as mother’s age, weight, height, and fetus health status, as well as medical test results, such as blood sugar levels and fetal heart rate. It also includes a categorization of high or low risk pregnancy.
I will perform basic analysis of different factors of maternal health to define “standard” pregnancy physiology and to understand what medical tests and health metrics, if any, are more correlated with high-risk pregnancies than others.
Data was collected by Ankur Ray Chayan in 2024 and published on Mendeley Data: https://data.mendeley.com/datasets/8k9pvpmykk/1
#Install necessary packages, including readxl, httr, stringr, gtsummary, and Hmisc
#Importing necessary libraries
library(ggplot2)
library(readxl)
library(httr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
#Creating a temporary file with the correct .xlsx extension
temp <-tempfile(fileext = ".xlsx")
#Download data from Mendeley
GET("https://data.mendeley.com/public-files/datasets/8k9pvpmykk/files/1147acd1-10a7-486e-a34d-00f3bef999cc/file_downloaded", write_disk(temp, overwrite = TRUE))## Response [https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/5bb3fb83-1b51-4579-a388-eaec2457bb30]
## Date: 2025-10-07 18:34
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 96.6 kB
## <ON DISK> /var/folders/cb/kv2249f91c70wqh8lyh9rwg80000gn/T//RtmpVIWaJb/file56743fdb420.xlsx
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## # A tibble: 6 × 18
## `ANCC REGISTER` ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Name Age Gravida TiTi … গর্ভকাল… ওজন উচ্চতা রক্ত চা… রক্তস্বল্প… জন্ডিস গর্ভস্হ …
## 2 Rituporna 18 1st 1st 38 w… 50 kg 5.3'' 100/… None None Norm…
## 3 Moina 25 2nd 2nd 38 w… 60 kg 5.2'' 100/… None None Norm…
## 4 Rabeya 20 1st 1st 30 w… 55 kg 5.0'' 100/… None None Norm…
## 5 Shorna 22 1st 3rd 35 w… 51 kg 5.4'' 110/… None None Norm…
## 6 Tania Akter 20 1st 2nd 30 w… 53 kg 5.2'' 100/… None None Norm…
## # ℹ 7 more variables: ...12 <chr>, ...13 <chr>, ...14 <chr>, ...15 <chr>,
## # ...16 <chr>, ...17 <chr>, ...18 <chr>
The data had some phrases in Bengali and required translating and relabeling for clarity. It also required changing the labels of some columns and recoding to make the data easier to analyze.
#Set first row as column names, then remove the row that became the column names, and finally check for proper labeling
colnames(maternaldata) <- maternaldata[1,]
maternaldata <- maternaldata[-1,]
#Relabeling column titles with English translations
maternaldata <- rename(maternaldata, "Tetanus_Vaccination" = `TiTi Tika`, "Pregnancy_time" = `গর্ভকাল`, "Weight" = `ওজন`, "Height" = `উচ্চতা`, "Blood_pressure" = `রক্ত চাপ`, "Anemia" = `রক্তস্বল্পতা`, "Jaundice" = `জন্ডিস`, "Fetal_position" = `গর্ভস্হ শিশু অবস্থান`, "Fetal_movement" = `গর্ভস্হ শিশু নাড়াচাড়া`, "Fetal_heartbeat" = `গর্ভস্হ শিশু হৃৎস্পন্দন`, "Urine_albumin" = `প্রসাব পরিক্ষা এলবুমিন`, "Urine_sugar" = `প্রসাব পরিক্ষা সুগার`, "HBsAG" = HRsAG, "Pregnancy_risk" = `ঝুকিপূর্ণ গর্ভ`)
#Replace names in "Name" column with ID numbers to help make patients anonymous and for easier referencing
patients <- data.frame(matrix(nrow = 998, ncol = 1))
colnames(patients) <- c("ID")
patients <- data.frame(ID = 1:998)
maternaldata$Name <- patients$ID
#Recode columns Urine_albumin, Urine_sugar, VDRL, HBsAG, Gravida, Tetanus_Vaccination, and Height such that None=0, Minimal=1, Medium=2, and Higher=3 (for Urine_albumin); No=0 and Yes=1 (for Urine_sugar); Negative=0 and Positive=1 (for VDRL and HBsAG); 1st=1, 2nd=2, 3rd=3 (for Gravida and Tetanus_Vaccination); 5.1''=61, 5.2''=62 (for Height, in inches)
maternaldata$Urine_albumin <- recode_factor(maternaldata$Urine_albumin, "None"=0, "Minimal"=1, "Medium"=2, "Higher"=3)
maternaldata$Urine_sugar <- recode_factor(maternaldata$Urine_sugar, "No"=0, "Yes"=1)
maternaldata$VDRL <- recode_factor(maternaldata$VDRL, "Negative"=0, "Positive"=1)
maternaldata$HBsAG <- recode_factor(maternaldata$HBsAG, "Negative"=0, "Positive"=1)
maternaldata$Gravida <- recode_factor(maternaldata$Gravida, "1st"=1, "2nd"=2, "3rd"=3)
maternaldata$Tetanus_Vaccination <- recode_factor(maternaldata$Tetanus_Vaccination, "1st"=1, "2nd"=2, "3rd"=3)
maternaldata$Height_cm <- recode_factor(maternaldata$Height, "5.0''"=152.4, "5.1''"=154.94, "5.2''"=157.48, "5.3''"=160.02,"5.4''"=162.56, "5.5''"=165.1, "5.6''"=167.64)
#Remove unnecessary words from columns (ex. "week" from Pregnancy_time) and change column title to reflect the unit (ex. Pregnancy_time -> Pregnancy_time_wk)
maternaldata$Pregnancy_time <- str_replace(maternaldata$Pregnancy_time, "week", "")
colnames(maternaldata)[5] <- "Pregnancy_time_wk"
maternaldata$Weight <- str_replace(maternaldata$Weight, "kg", "")
colnames(maternaldata)[6] <- "Weight_kg"
maternaldata$Fetal_heartbeat <- str_replace(maternaldata$Fetal_heartbeat, "m", "")
colnames(maternaldata)[13] <- "Fetal_heartbeat_bpm"
#Convert Height and Weight_kg columns to numeric, calculate BMI, then remove Height, Height_cm, and Weight_kg columns
maternaldata$Weight_kg <- as.numeric(as.character(maternaldata$Weight_kg))
maternaldata$Height_cm <- as.numeric(as.character(maternaldata$Height_cm))
maternaldata$BMI <- (maternaldata$Weight_kg/maternaldata$Height_cm/maternaldata$Height_cm)*10000
maternaldata <- maternaldata[,-c(6,7,19)]
#Separate Blood_pressure column into 2 columns, systolic and diastolic
maternaldata <- maternaldata %>% separate_wider_delim(Blood_pressure, delim = "/", names = c("Systolic", "Diastolic"))
#Change appropriate columns into integers
maternaldata <- maternaldata %>%
mutate(across(c(Age, Gravida, Pregnancy_time_wk, BMI, Systolic, Diastolic, Tetanus_Vaccination, Fetal_heartbeat_bpm, Urine_albumin, Urine_sugar, VDRL, HBsAG), ~ as.numeric(as.character(.))))
#Make column factoring pregnancy time (in weeks) into trimester
maternaldata$Trimester <- ifelse(maternaldata$Pregnancy_time_wk <= 13, "First",
ifelse(maternaldata$Pregnancy_time_wk <= 27, "Second", "Third"))
#Reorder columns
maternaldata <- maternaldata[,c(1:3, 5, 18, 6:9, 4, 10:17, 19)]
head(maternaldata)## # A tibble: 6 × 19
## Name Age Gravida Pregnancy_time_wk BMI Systolic Diastolic Anemia Jaundice
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 18 1 38 19.5 100 60 None None
## 2 2 25 2 38 24.2 100 70 None None
## 3 3 20 1 30 23.7 100 60 None None
## 4 4 22 1 35 19.3 110 65 None None
## 5 5 20 1 30 21.4 100 55 None None
## 6 6 22 1 30 25.4 100 65 None None
## # ℹ 10 more variables: Tetanus_Vaccination <dbl>, Fetal_position <chr>,
## # Fetal_movement <chr>, Fetal_heartbeat_bpm <dbl>, Urine_albumin <dbl>,
## # Urine_sugar <dbl>, VDRL <dbl>, HBsAG <dbl>, Pregnancy_risk <chr>,
## # Trimester <chr>
I wanted to inspect if there were any NAs, and see where they are presiding.
#Create a new data frame with NAs.
maternaldata_na <- maternaldata %>% filter(if_any(everything(), ~ is.na(.)))
#Summarizing the nyu.na dataframe.
summary(maternaldata_na)## Name Age Gravida Pregnancy_time_wk BMI
## Min. : NA Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA Max. : NA
## Systolic Diastolic Anemia Jaundice
## Min. : NA Min. : NA Length:0 Length:0
## 1st Qu.: NA 1st Qu.: NA Class :character Class :character
## Median : NA Median : NA Mode :character Mode :character
## Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA
## Tetanus_Vaccination Fetal_position Fetal_movement Fetal_heartbeat_bpm
## Min. : NA Length:0 Length:0 Min. : NA
## 1st Qu.: NA Class :character Class :character 1st Qu.: NA
## Median : NA Mode :character Mode :character Median : NA
## Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA
## Urine_albumin Urine_sugar VDRL HBsAG Pregnancy_risk
## Min. : NA Min. : NA Min. : NA Min. : NA Length:0
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA Class :character
## Median : NA Median : NA Median : NA Median : NA Mode :character
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## Trimester
## Length:0
## Class :character
## Mode :character
##
##
##
There are no NA values, so I don’t need to decide to include or omit any NAs
#Create a histogram to observe skew in the following columns: Age, Gravida, Pregnancy_time_wk, BMI, Systolic, Diastolic, and Fetal_heartbeat_bpm
age_histo <- hist(maternaldata$Age, xlab = "Age (years)", ylab = "Frequency")gravida_histo <- hist(maternaldata$Gravida, breaks = seq(0.5, max(maternaldata$Gravida) + 0.5, by = 1), xaxt = "n", xlab = "Gravida (number of pregnancies)", ylab = "Frequency")
axis(1, at = seq(0, max(maternaldata$Gravida), by = 1))pregnancy_time_histo <- hist(maternaldata$Pregnancy_time_wk, xlab = "Pregnancy time (weeks)", ylab = "Frequency")systolic_histo <- hist(maternaldata$Systolic, breaks = seq(70, max(maternaldata$Systolic) + 10, by = 10), xaxt = "n", xlab = "Systolic blood pressure", ylab = "Frequency")
axis(1, at = seq(70, max(maternaldata$Systolic), by = 10))diastolic_histo <- hist(maternaldata$Diastolic, breaks = seq(50, max(maternaldata$Diastolic) + 5, by = 5), xaxt = "n", xlab = "Diastolic blood pressure", ylab = "Frequency")
axis(1, at = seq(50, max(maternaldata$Diastolic), by = 5))fetal_bpm_histo <- hist(maternaldata$Fetal_heartbeat_bpm, breaks = seq(120, max(maternaldata$Fetal_heartbeat_bpm) + 5, by = 5), xaxt = "n", xlab = "Fetal heartbeat (bpm)", ylab = "Frequency")
axis(1, at = seq(120, max(maternaldata$Fetal_heartbeat_bpm), by = 5))#Create new dataframes based on pregnancy risk
highrisk_maternaldata <- maternaldata[maternaldata$Pregnancy_risk == "Yes", ]
lowrisk_maternaldata <- maternaldata[maternaldata$Pregnancy_risk == "No", ]I want to summarize the age, BMI, and systolic and diastolic blood pressure for all patients included in the study through boxplots
#Make boxplots for each of the above listed factors
age_boxplot <- boxplot(maternaldata$Age, main = "Boxplot of Maternal Age", ylab = "Age (years)")
Median age during pregnancy is 22, with relatively normal distribution
(as also indicated from the histogram from before). Notably, there are
several patients on the older end of the spectrum, causing a longer
maximum tail. Median BMI is also about 22, which falls within healthy
range (between 18.5-24.9). Median systolic BP is about 100 and median
diastolic is about 60
First I wanted to see if there was a relationship between fetal health (Fetal_position, Fetal_movement, Fetal_heartbeat_bpm) with pregnancy risk
#Using gtsummary to list the above listed factors in a table from high risk pregnancies
lowrisk_fetal_health_tbl <- lowrisk_maternaldata %>% tbl_summary(include = c(Fetal_position, Fetal_movement, Fetal_heartbeat_bpm))
lowrisk_fetal_health_tbl| Characteristic | N = 3321 |
|---|---|
| Fetal_position | |
| Abnormal | 2 (0.6%) |
| Normal | 330 (99%) |
| Fetal_movement | |
| Normal | 332 (100%) |
| Fetal_heartbeat_bpm | |
| 120 | 60 (18%) |
| 125 | 31 (9.3%) |
| 130 | 120 (36%) |
| 140 | 91 (27%) |
| 150 | 30 (9.0%) |
| 1 n (%) | |
highrisk_fetal_health_tbl <- highrisk_maternaldata %>% tbl_summary(include = c(Fetal_position, Fetal_movement, Fetal_heartbeat_bpm))
highrisk_fetal_health_tbl| Characteristic | N = 6661 |
|---|---|
| Fetal_position | |
| Abnormal | 4 (0.6%) |
| Normal | 662 (99%) |
| Fetal_movement | |
| Normal | 666 (100%) |
| Fetal_heartbeat_bpm | |
| 120 | 121 (18%) |
| 125 | 60 (9.0%) |
| 130 | 243 (36%) |
| 140 | 181 (27%) |
| 150 | 61 (9.2%) |
| 1 n (%) | |
#Merge tables for a side-by-side comparison
tbl_merge(tbls = list(lowrisk_fetal_health_tbl, highrisk_fetal_health_tbl), tab_spanner = c("Low risk", "High risk"))| Characteristic |
Low risk
|
High risk
|
|---|---|---|
| N = 3321 | N = 6661 | |
| Fetal_position | ||
| Abnormal | 2 (0.6%) | 4 (0.6%) |
| Normal | 330 (99%) | 662 (99%) |
| Fetal_movement | ||
| Normal | 332 (100%) | 666 (100%) |
| Fetal_heartbeat_bpm | ||
| 120 | 60 (18%) | 121 (18%) |
| 125 | 31 (9.3%) | 60 (9.0%) |
| 130 | 120 (36%) | 243 (36%) |
| 140 | 91 (27%) | 181 (27%) |
| 150 | 30 (9.0%) | 61 (9.2%) |
| 1 n (%) | ||
There does not seem to be a notable relationship between fetal health and pregnancy risk
Next I wanted to assess if there’s a relationship between anemia or jaundice in the mother and pregnancy risk
#Use gtsummary to list the above listed factors in a table from high and low risk pregnancies
lowrisk_medicalconditions <- lowrisk_maternaldata %>% tbl_summary(include = c(Jaundice, Anemia))
highrisk_medicalconditions <- highrisk_maternaldata %>% tbl_summary(include = c(Jaundice, Anemia))
#Merge tables for a side-by-side comparison
tbl_merge (tbls = list(lowrisk_medicalconditions, highrisk_medicalconditions), tab_spanner = c("Low risk", "High risk"))| Characteristic |
Low risk
|
High risk
|
|---|---|---|
| N = 3321 | N = 6661 | |
| Jaundice | ||
| Medium | 1 (0.3%) | 3 (0.5%) |
| Minimal | 3 (0.9%) | 5 (0.8%) |
| None | 328 (99%) | 658 (99%) |
| Anemia | ||
| Medium | 20 (6.0%) | 41 (6.2%) |
| Minimal | 21 (6.3%) | 41 (6.2%) |
| None | 291 (88%) | 584 (88%) |
| 1 n (%) | ||
There does not seem to be a notable relationship between anemia or jaundice in the mother and pregnancy risk
I wanted to assess if women in later stages of pregnancy (ie. 3rd trimester vs. 2nd trimester) would have a higher incidence of high risk pregnancies
#Use gtsummary to summarize trimester stage in a table from high and low risk pregnancies
lowrisk_trimester <- lowrisk_maternaldata %>% tbl_summary(include = c(Trimester))
highrisk_trimester <- highrisk_maternaldata %>% tbl_summary(include = c(Trimester))
#Merge tables for a side-by-side comparison
tbl_merge (tbls = list(lowrisk_trimester, highrisk_trimester), tab_spanner = c("Low risk", "High risk"))| Characteristic |
Low risk
|
High risk
|
|---|---|---|
| N = 3321 | N = 6661 | |
| Trimester | ||
| Second | 131 (39%) | 265 (40%) |
| Third | 201 (61%) | 401 (60%) |
| 1 n (%) | ||
There is no apparent relationship between stage of pregnancy and having a high or low risk pregnancy
#Use ggplot to make boxplots to visualize age in high and low risk groups
ggplot(maternaldata, aes(x = Pregnancy_risk, y = Age)) + geom_boxplot(fill = "lightblue") + ylab("Age (years)") + xlab("Pregnancy Risk")
There is no apparent relationship between age of mother and having a
high or low risk pregnancy
I wanted to see if there’s a relationship between mean BMI and pregnancy risk
#Calculating mean BMI from low risk and high risk pregnancies
lowrisk_meanBMI <- mean(lowrisk_maternaldata$BMI)
highrisk_meanBMI <- mean(highrisk_maternaldata$BMI)
lowrisk_meanBMI## [1] 22.95498
## [1] 21.99694
I then wanted to visualize the means calculated above in bar graphs.
#Use ggplot to create bar graphs of average BMI in low risk and high risk pregnancy mothers
ggplot(maternaldata, aes(x = Pregnancy_risk, y = BMI)) + stat_summary(fun = mean, geom = "bar", fill = "lightblue") + stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .5) + ylab("Mean BMI") + xlab("Pregnancy Risk")
Mothers identified with high risk pregnancies on average have lower BMIs
compared to those with non-high risk pregnancies
Discussion The median mother from this study was around 22 years of age, with a BMI of 22 and blood pressure of 100/60.
From my analysis, I did not see a relationship between the following and low vs. high pregnancy risk: - Fetal health - Maternal jaundice or anemia - Stage of pregnancy (2nd vs. 3rd trimester) - Age of mother
I did see a relationship between average BMI and pregnancy risk, wherein mothers with high risk pregnancies had on average lower BMIs
Future Directions
-Chayan, Ankur Ray (2024), “Maternal Health and High-Risk Pregnancy Dataset.”, Mendeley Data, V1, doi: 10.17632/8k9pvpmykk.1