Analysis

Transform the dataset

summary(data)

##        id            sex                 age            height     
##  Min.   :  2.0   Length:219         Min.   :21.00   Min.   :145.0  
##  1st Qu.: 85.5   Class :character   1st Qu.:48.00   1st Qu.:155.0  
##  Median :152.0   Mode  :character   Median :58.00   Median :160.0  
##  Mean   :156.6                      Mean   :57.17   Mean   :161.2  
##  3rd Qu.:214.5                      3rd Qu.:67.50   3rd Qu.:167.0  
##  Max.   :419.0                      Max.   :86.00   Max.   :196.0  
##      weight          systolic       diastolic          h_rate      
##  Min.   : 36.00   Min.   : 80.0   Min.   : 42.00   Min.   : 52.00  
##  1st Qu.: 52.50   1st Qu.:113.5   1st Qu.: 64.00   1st Qu.: 66.00  
##  Median : 60.00   Median :126.0   Median : 70.00   Median : 73.00  
##  Mean   : 60.19   Mean   :127.9   Mean   : 71.85   Mean   : 73.64  
##  3rd Qu.: 66.50   3rd Qu.:139.0   3rd Qu.: 78.00   3rd Qu.: 80.00  
##  Max.   :103.00   Max.   :182.0   Max.   :107.00   Max.   :106.00  
##       bmi        hypertension         diabetes         cerebral_infarction
##  Min.   :14.69   Length:219         Length:219         Length:219         
##  1st Qu.:20.55   Class :character   Class :character   Class :character   
##  Median :22.60   Mode  :character   Mode  :character   Mode  :character   
##  Mean   :23.11                                                            
##  3rd Qu.:25.00                                                            
##  Max.   :37.46                                                            
##  cerebrovascular_disease
##  Length:219             
##  Class :character       
##  Mode  :character       
##                         
##                         
##

Due to the pre-defined criteria, patients in the Stage 1 and Stage 2 hypertension group will be considered high blood pressure.

data$high_bp <- ifelse(data$hypertension == "Normal" | data$hypertension == "Prehypertension", "Normal", "High")

The BMI value will also be divided into three smaller groups. Those with higher than 24.9 will be considered as high BMI, and those with lower than 18.5 will be considered as low BMI.

data$bmi_c <- ifelse(data$bmi > 24.9, "High",
                    ifelse(data$bmi < 18.5, "Low", "Normal"))

Data Exploratory Analysis

From the dataset we are using, there are some variables that stand out as possible factors on blood pressure. These variables are sex, age, heart rate, BMI, diabetes, cerebral infarction and cerebrovascular disease. We will begin with checking the correlation between age, heart rate and BMI.

cor(data[c(3,8,9)])

##                age      h_rate         bmi
## age     1.00000000 -0.08564041  0.01632591
## h_rate -0.08564041  1.00000000 -0.10552071
## bmi     0.01632591 -0.10552071  1.00000000

As we can see, there are almost no correlation between all three variables above, so we can use all of them in our analysis.

Now we will confirm the distribution of the blood pressure data.

data %>% ggplot(aes(x = hypertension, fill = hypertension)) + geom_bar()

data %>% ggplot(aes(x = high_bp, fill = high_bp)) + geom_bar()

There are almost three times the data for patients with “Normal” blood pressure compared to those with “High” blood pressure. We will keep this result in mind as we further analyze the dataset.

Analysis With Variables

Gender

data %>% ggplot(aes(x = sex, fill = high_bp)) + geom_bar()

It seems that we have similar number of male and female patients in our data, and it also appears that they have different proportion of people with high blood pressure. We shall dig deeper by defining a function to check the 95% confident interval.

ci95_high <- function(data, dv, value) {
  ci95_r <- rep(0,3) #Create placeholder for results
  m <- mean(data[dv == value,]$high_bp == "High") #Proportion with high bp
  se <- sd(data[dv == value,]$high_bp == "High")/sqrt(nrow(data))
  ci95_r[1] <- m - 1.96*se
  ci95_r[2] <- m
  ci95_r[3] <- m + 1.96*se
  return(ci95_r)
}

This function will calculate the proportion of patients with “High” blood pressure within a certain group. The output is the 95% confident interval of said value.

ci95_high(data, data$sex, "Female")

## [1] 0.1783986 0.2347826 0.2911666

ci95_high(data, data$sex, "Male")

## [1] 0.2012674 0.2596154 0.3179634

There is an overlap between two interval. This means that the difference between the two genders is not statistical significant at 95%, so we cannot claim that gender can affect blood pressure.

Age

data %>% ggplot(aes(x = age, fill = high_bp)) + geom_histogram(binwidth = 5) +
  facet_grid(. ~ high_bp)

There is a pattern here, it seems that only those higher than 45 years of age have high blood pressure. This is an important result that will help us reach our goal.

BMI

To simplify the analysis, here we we will use the BMI categories.

data %>% group_by(bmi_c) %>% summarise(perc = mean(high_bp == "High")) %>%
  ggplot(aes(x = bmi_c, y = perc, fill = bmi_c)) + geom_col()

The results of this plot show that people with high BMI value have a significantly higher chance of getting high blood pressure (43%) than those with lower BMI value.

Heart rate

Higher heart rate means the heart has to pump blood more frequently. We shall see if it will also result in higher blood pressure. We will check the correlation between heart rate and blood pressure values.

data %>% ggplot(aes(x = h_rate, y = systolic)) + geom_point()

data %>% ggplot(aes(x = h_rate, y = diastolic)) + geom_point()

The output indicates that there are no correlation between heart rate and blood pressure values. Higher heart rate does not lead to higher blood pressure values.

Diabetes

Several research papers have pointed out that diabetes leads to higher blood pressure. We shall see if the result holds true in this group of patients.

data %>% group_by(diabetes) %>% summarise(
  perc_high_bp = round(mean(high_bp == "High"),2),
  n = n())

## # A tibble: 3 × 3
##   diabetes          perc_high_bp     n
##   <chr>                    <dbl> <int>
## 1 ""                        0.25   181
## 2 "Diabetes"                1        1
## 3 "Type 2 Diabetes"         0.22    37

The results above show that patients with diabetes do not have higher blood pressure than those without. Note that in this dataset, only one person has Diabetes (type 1), that is why the percentage is 100%. We can prove that by adding a slight modification

data %>% mutate(h_diabetes = ifelse(data$diabetes == "", "Normal", "Diabetes")) %>%
  group_by(h_diabetes) %>% summarise(perc_high_bp = round(mean(high_bp == "High"),2))

## # A tibble: 2 × 2
##   h_diabetes perc_high_bp
##   <chr>             <dbl>
## 1 Diabetes           0.24
## 2 Normal             0.25

Diseases

Lastly, we want to know whether cerebral infarction and cerebrovascular disease have an effect on blood pressure. Both of these terms describe the state of blood insufficiency to the brain.

data %>% group_by(cerebral_infarction) %>% 
  summarise(normal = round(mean(high_bp == "Normal"),2), 
            high = round(mean(high_bp == "High"),2),
            n = n())

## # A tibble: 2 × 4
##   cerebral_infarction   normal  high     n
##   <chr>                  <dbl> <dbl> <int>
## 1 ""                      0.77  0.23   199
## 2 "cerebral infarction"   0.55  0.45    20

data %>% group_by(cerebrovascular_disease) %>% 
  summarise(normal = round(mean(high_bp == "Normal"),2), 
            high = round(mean(high_bp == "High"),2),
            n = n())

## # A tibble: 3 × 4
##   cerebrovascular_disease                  normal  high     n
##   <chr>                                     <dbl> <dbl> <int>
## 1 ""                                         0.75  0.25   194
## 2 "cerebrovascular disease"                  0.5   0.5     10
## 3 "insufficiency of cerebral blood supply"   0.93  0.07    15

The results above indicates that patients with cerebral infarction and cerebrovascular disease also have a higher chance of high blood pressure (45% and 50%, respectively) compared to those without. Note that the number of total observations are low in both cases.

Conclusion

From the results above, in order to improve sales, we should focus our campaign on patients above 45 years old, those with high BMI, and patients with either cerebral infarction or cerebrovascular disease.

Age 45

The results above have shown that most patients with high blood pressure are older than 45 years old. In this part, we will try to improve our prediction further by only including those older than 45 in our new dataset.

data45 <- data[data$age >= 45,]

BMI

Instead of using BMI categories, we will use BMI values.

data45 %>% ggplot(aes(x = bmi, fill = high_bp)) + geom_histogram(binwidth = 5)

It appears that there is a spike in the proportion of those with high blood pressure when BMI is higher than 30. We will confirm the results above.

c(round(mean(data45[data45$bmi >= 30,]$high_bp == "High",),2),
  round(mean(data45[data45$bmi < 30,]$high_bp == "High"),2))

## [1] 0.64 0.27

The first number is the proportion of high blood pressure patients with BMI above 30, the second number is the proportion of patients with BMI lower than 30. We can see that there is a significant difference between them (64% and 27%).

Similar analysis were performed using other variables (heart rate, diabetes and other diseases), however, none of them show significant difference than the analysis using the whole dataset.

High Blood Pressure

Danh Dang

2025-02-28

Introduction

About the dataset

Scenario and Goal of the case study

Analysis

Transform the dataset

Data Exploratory Analysis

Analysis With Variables

Gender

Age

BMI

Heart rate

Diabetes

Diseases

Conclusion

Age 45

BMI

Conclusion

Future Plan