Data 606 Final Project

Chafiaa Nadour

2023-12-05

Introduction:

High blood pressure during pregnancy can put the mother and her baby at risk; it can cause permanent damage to the organs, strokes, and underweight babies at birth. I want to examine if there is a relationship between blood pressure and different factors such as age, BMI, and heart rate? . I’m focusing on these three factors because a lot of women who focus on their career and education get pregnant in their 30s. It’s also known that weight and heart rate increase during pregnancy.

Data:

To answer my question, I used the data on the pregnancy risk factor that I found on Kaggle. https://www.kaggle.com/datasets/mmhossain/pregnancy-risk-factor-data: The data set has 6103 observations and 14 attributes. The cases are pregnant women of different ages who were monitored for their BMI, blood glucose, blood pressure (systolic and diastolic), heart rate, body temperature, etc. As a start, I created a new data set with all 6103 observations and only 4 attributes because it was easier for plotting.

data=read.csv("C:/Users/Chafiaa/OneDrive/Documents/pregnancy risk.csv")
head(data)
##   Patient.ID      Name Age Body.Temperature.F. Heart.rate.bpm.
## 1    1994601    Moulya  20                97.5              91
## 2    2001562      Soni  45                97.7              99
## 3    2002530  Baishali  29                98.6              84
## 4    2002114 Abhilasha  26                99.5             135
## 5    2002058    Aanaya  38               102.5              51
## 6    1993812     Navni  21                98.6              85
##   Systolic.Blood.Pressure.mm.Hg. Diastolic.Blood.Pressure.mm.Hg. BMI.kg.m.2.
## 1                            161                             100        24.9
## 2                             99                              94        22.1
## 3                            129                              87        19.0
## 4                            161                             101        23.7
## 5                            106                              91        18.8
## 6                            142                              89        22.0
##   Blood.Glucose.HbA1c. Blood.Glucose.Fasting.hour.mg.dl.   Outcome
## 1                   41                               5.8 high risk
## 2                   36                               5.7 high risk
## 3                   42                               6.4  mid risk
## 4                   46                               4.5 high risk
## 5                   38                               4.3 high risk
## 6                   30                               5.6  mid risk

sum(is.na(data))
## [1] 0
dim(data)
## [1] 6103   11
names(data)
##  [1] "Patient.ID"                        "Name"                             
##  [3] "Age"                               "Body.Temperature.F."              
##  [5] "Heart.rate.bpm."                   "Systolic.Blood.Pressure.mm.Hg."   
##  [7] "Diastolic.Blood.Pressure.mm.Hg."   "BMI.kg.m.2."                      
##  [9] "Blood.Glucose.HbA1c."              "Blood.Glucose.Fasting.hour.mg.dl."
## [11] "Outcome"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data1=data %>%
  select(Diastolic.Blood.Pressure.mm.Hg., Age, Heart.rate.bpm., BMI.kg.m.2.)
head(data1)
##   Diastolic.Blood.Pressure.mm.Hg. Age Heart.rate.bpm. BMI.kg.m.2.
## 1                             100  20              91        24.9
## 2                              94  45              99        22.1
## 3                              87  29              84        19.0
## 4                             101  26             135        23.7
## 5                              91  38              51        18.8
## 6                              89  21              85        22.0

Exploratory Data Analysis:

To conduct my analysis, I chose one dependent variable, y = diastolic BP, and three independent variables. x1=Age, x2= HR, x3=BMI

y=data1$Diastolic.Blood.Pressure.mm.Hg.
head(y)
## [1] 100  94  87 101  91  89
is.numeric(y)#numerical 
## [1] TRUE
x1=data1$Age
head(x1)
## [1] 20 45 29 26 38 21
is.numeric(x1)#numerical 
## [1] TRUE
x2=data1$Heart.rate.bpm.
head(x2)
## [1]  91  99  84 135  51  85
is.numeric(x2)#numerical 
## [1] TRUE
x3=data1$BMI.kg.m.2.
head(x3)
## [1] 24.9 22.1 19.0 23.7 18.8 22.0
is.numeric(x3)#numerical
## [1] TRUE

summary:

summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   82.00   87.00   87.26   92.00  142.00
summary(x1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   22.00   25.00   26.43   30.00  250.00
summary(x2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    45.0    72.0    80.0    86.1    91.0   150.0
summary(x3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.90   19.60   21.30   21.44   23.10   27.90

Plots:

To get a visual idea about the relationship between the variables, I used scatter plots and pair plots.

library("ggplot2")                     
library("GGally")
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(data1)

pairs(data1,
      col = "red",                                        
      pch = 18,                                            
      labels = c("y", "x1", "x2", "x3"),                  
      main = "Pair plot of all variables") 

##

plot(y ~ x1, data = data1, xlab = "Age",
 ylab = "BP", main = "BP by Age" , col = c("red","green"))

plot(y ~ x2, data = data1, xlab = "HR",
 ylab = "BP", main = "BP by HR" , col = c("red","green"))

plot(y ~ x3, data = data1, xlab = "BMI",
 ylab = "BP", main = "BP by BMI" , col = c("red","green"))

From the plot, it appears that a pregnant woman’s blood pressure has a positive weak linear relationship With their age, HR and BMI

Linear Regression:

To have a better understanding of the relationship between the variables, I chose linear regression. I started with the formula to get the functions , then run the summary.

M1=lm(y~x1)
M1
## 
## Call:
## lm(formula = y ~ x1)
## 
## Coefficients:
## (Intercept)           x1  
##     82.2440       0.1897

y=82.2440+0.1897x1—-example Age= 41, BP~90 which high

M2=lm(y~x2)
M2
## 
## Call:
## lm(formula = y ~ x2)
## 
## Coefficients:
## (Intercept)           x2  
##    78.89607      0.09711

y=78.89607+0.09711x2—-Example HR=110 , BP~90

M3=lm(y~x3)
M3
## 
## Call:
## lm(formula = y ~ x3)
## 
## Coefficients:
## (Intercept)           x3  
##     73.1042       0.6603

y=73.1042+0.6603x3—-Example BMI=25, BP~90

Summary

summary(M1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.418  -4.987  -0.608   4.495  50.649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 82.24404    0.41930   196.1   <2e-16 ***
## x1           0.18973    0.01542    12.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.699 on 6101 degrees of freedom
## Multiple R-squared:  0.0242, Adjusted R-squared:  0.02404 
## F-statistic: 151.3 on 1 and 6101 DF,  p-value: < 2.2e-16

summary(M2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.180  -4.888  -0.568   4.529  53.198 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 78.89607    0.37661  209.49   <2e-16 ***
## x2           0.09711    0.00423   22.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.477 on 6101 degrees of freedom
## Multiple R-squared:  0.07951,    Adjusted R-squared:  0.07936 
## F-statistic:   527 on 1 and 6101 DF,  p-value: < 2.2e-16

summary(M3)
## 
## Call:
## lm(formula = y ~ x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.535  -4.970  -0.828   4.888  52.059 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 73.10423    0.97970   74.62   <2e-16 ***
## x3           0.66027    0.04547   14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.662 on 6101 degrees of freedom
## Multiple R-squared:  0.0334, Adjusted R-squared:  0.03324 
## F-statistic: 210.8 on 1 and 6101 DF,  p-value: < 2.2e-16

From the summary results I see that: The p value of Age, HR and BMI <0.05 which make the relationship significant with BP. From R squared values: I find the HR has the highest R^2 mean HR has more effect on BP. R1sq= 0.0242
R2sq= 0.07951 R3sq= 0.0334

and from testing side:

H0: There is no relationship between high BP and Age, HR , BMI (null Hypothesis). H1: There is relationship high BP and Age, HR , BMI (Alternative Hypothesis).

R squared is close to 0 means that this model is not the best to prediction and we have insufficient evidence to reject H0

correlation

I also checked if there is any correlation between the variables.

cor(x1,y)
## [1] 0.1555725
cor(x2,y)
## [1] 0.2819719
cor(x3,y)
## [1] 0.1827578

and we can tell that there is a weak positive correction between BP and age, HR and BMI

conclusion:

The data I worked on is not the best because I see some incorrect variables, like the maximum age is 250, but I think in general it serves to prove a scientific fact that high blood pressure can increase because of age, weight, and heart rate.

References:

https://www.kaggle.com/datasets/mmhossain/pregnancy-risk-factor-data: www.google.com

Thank you all !