Introduction

High blood pressure (or hypertension) is a major health problem for older adults. As time passes, the body’s blood vessels change. The arteries get stiffer, causing blood pressure to go up. High blood pressure is also known as “the silent killer”, many may not even be aware they have it (National Institute on Aging, 2022). In the dataset below, hypertension was divided into four groups: “Prehypertension”, “Stage 1 hypertension”, “Stage 2 hypertension” and none-of-the-above. In this case study, patients are considered as high blood pressure group if they belong to stage 1 and stage 2 hypertension.

About the dataset

This dataset was collected by Yongbo Liang et al. (2018). This dataset contains information of 219 patients in China, together with 13 variables. The dataset covers an age range of 20-89 years and records of diseases including hypertension and diabetes. The variables contained in this dataset are:

  1. ID
  2. Sex (Male / Female)
  3. Age (Year)
  4. Height (cm)
  5. Weight (kg)
  6. Systolic Blood Pressure (mmHg)
  7. Diastolic Blood Pressure (mmHg)
  8. Heart Rate (per minute)
  9. BMI (kg/m^2)
  10. Hypertension (Prehypertension, Stage 1, Stage 2, None)
  11. Diabetes (Stage 1, Stage 2, None)
  12. Cerebral Infarction
  13. Cerebrovascular Disease

Scenario and Goal of the case study

Scenario: Imagine yourself as an analyst of a pharmaceutical company that just developed a new medicine for patients with high blood pressure. The company now wants to know how to maximize the sales by advertising the new medication to those who are susceptible to it.

Goal: find the group of people that has higher risk of having high blood pressure.

Analysis

Transform the dataset

summary(data)
##        id            sex                 age            height     
##  Min.   :  2.0   Length:219         Min.   :21.00   Min.   :145.0  
##  1st Qu.: 85.5   Class :character   1st Qu.:48.00   1st Qu.:155.0  
##  Median :152.0   Mode  :character   Median :58.00   Median :160.0  
##  Mean   :156.6                      Mean   :57.17   Mean   :161.2  
##  3rd Qu.:214.5                      3rd Qu.:67.50   3rd Qu.:167.0  
##  Max.   :419.0                      Max.   :86.00   Max.   :196.0  
##      weight          systolic       diastolic          h_rate      
##  Min.   : 36.00   Min.   : 80.0   Min.   : 42.00   Min.   : 52.00  
##  1st Qu.: 52.50   1st Qu.:113.5   1st Qu.: 64.00   1st Qu.: 66.00  
##  Median : 60.00   Median :126.0   Median : 70.00   Median : 73.00  
##  Mean   : 60.19   Mean   :127.9   Mean   : 71.85   Mean   : 73.64  
##  3rd Qu.: 66.50   3rd Qu.:139.0   3rd Qu.: 78.00   3rd Qu.: 80.00  
##  Max.   :103.00   Max.   :182.0   Max.   :107.00   Max.   :106.00  
##       bmi        hypertension         diabetes         cerebral_infarction
##  Min.   :14.69   Length:219         Length:219         Length:219         
##  1st Qu.:20.55   Class :character   Class :character   Class :character   
##  Median :22.60   Mode  :character   Mode  :character   Mode  :character   
##  Mean   :23.11                                                            
##  3rd Qu.:25.00                                                            
##  Max.   :37.46                                                            
##  cerebrovascular_disease
##  Length:219             
##  Class :character       
##  Mode  :character       
##                         
##                         
## 

Due to the pre-defined criteria, patients in the Stage 1 and Stage 2 hypertension group will be considered high blood pressure.

data$high_bp <- ifelse(data$hypertension == "Normal" | data$hypertension == "Prehypertension", "Normal", "High")

The BMI value will also be divided into three smaller groups. Those with higher than 24.9 will be considered as high BMI, and those with lower than 18.5 will be considered as low BMI.

data$bmi_c <- ifelse(data$bmi > 24.9, "High",
                    ifelse(data$bmi < 18.5, "Low", "Normal"))

Data Exploratory Analysis

From the dataset we are using, there are some variables that stand out as possible factors on blood pressure. These variables are sex, age, heart rate, BMI, diabetes, cerebral infarction and cerebrovascular disease. We will begin with checking the correlation between age, heart rate and BMI.

cor(data[c(3,8,9)])
##                age      h_rate         bmi
## age     1.00000000 -0.08564041  0.01632591
## h_rate -0.08564041  1.00000000 -0.10552071
## bmi     0.01632591 -0.10552071  1.00000000

As we can see, there are almost no correlation between all three variables above, so we can use all of them in our analysis.

Now we will confirm the distribution of the blood pressure data.

data %>% ggplot(aes(x = hypertension, fill = hypertension)) + geom_bar()

data %>% ggplot(aes(x = high_bp, fill = high_bp)) + geom_bar()

There are almost three times the data for patients with “Normal” blood pressure compared to those with “High” blood pressure. We will keep this result in mind as we further analyze the dataset.

Analysis With Variables

Gender

data %>% ggplot(aes(x = sex, fill = high_bp)) + geom_bar()

It seems that we have similar number of male and female patients in our data, and it also appears that they have different proportion of people with high blood pressure. We shall dig deeper by defining a function to check the 95% confident interval.

ci95_high <- function(data, dv, value) {
  ci95_r <- rep(0,3) #Create placeholder for results
  m <- mean(data[dv == value,]$high_bp == "High") #Proportion with high bp
  se <- sd(data[dv == value,]$high_bp == "High")/sqrt(nrow(data))
  ci95_r[1] <- m - 1.96*se
  ci95_r[2] <- m
  ci95_r[3] <- m + 1.96*se
  return(ci95_r)
}

This function will calculate the proportion of patients with “High” blood pressure within a certain group. The output is the 95% confident interval of said value.

ci95_high(data, data$sex, "Female")
## [1] 0.1783986 0.2347826 0.2911666
ci95_high(data, data$sex, "Male")
## [1] 0.2012674 0.2596154 0.3179634

There is an overlap between two interval. This means that the difference between the two genders is not statistical significant at 95%, so we cannot claim that gender can affect blood pressure.

Age

data %>% ggplot(aes(x = age, fill = high_bp)) + geom_histogram(binwidth = 5) +
  facet_grid(. ~ high_bp)

There is a pattern here, it seems that only those higher than 45 years of age have high blood pressure. This is an important result that will help us reach our goal.

BMI

To simplify the analysis, here we we will use the BMI categories.

data %>% group_by(bmi_c) %>% summarise(perc = mean(high_bp == "High")) %>%
  ggplot(aes(x = bmi_c, y = perc, fill = bmi_c)) + geom_col()

The results of this plot show that people with high BMI value have a significantly higher chance of getting high blood pressure (43%) than those with lower BMI value.

Heart rate

Higher heart rate means the heart has to pump blood more frequently. We shall see if it will also result in higher blood pressure. We will check the correlation between heart rate and blood pressure values.

data %>% ggplot(aes(x = h_rate, y = systolic)) + geom_point()

data %>% ggplot(aes(x = h_rate, y = diastolic)) + geom_point()

The output indicates that there are no correlation between heart rate and blood pressure values. Higher heart rate does not lead to higher blood pressure values.

Diabetes

Several research papers have pointed out that diabetes leads to higher blood pressure. We shall see if the result holds true in this group of patients.

data %>% group_by(diabetes) %>% summarise(
  perc_high_bp = round(mean(high_bp == "High"),2),
  n = n())
## # A tibble: 3 × 3
##   diabetes          perc_high_bp     n
##   <chr>                    <dbl> <int>
## 1 ""                        0.25   181
## 2 "Diabetes"                1        1
## 3 "Type 2 Diabetes"         0.22    37

The results above show that patients with diabetes do not have higher blood pressure than those without. Note that in this dataset, only one person has Diabetes (type 1), that is why the percentage is 100%. We can prove that by adding a slight modification

data %>% mutate(h_diabetes = ifelse(data$diabetes == "", "Normal", "Diabetes")) %>%
  group_by(h_diabetes) %>% summarise(perc_high_bp = round(mean(high_bp == "High"),2))
## # A tibble: 2 × 2
##   h_diabetes perc_high_bp
##   <chr>             <dbl>
## 1 Diabetes           0.24
## 2 Normal             0.25

Diseases

Lastly, we want to know whether cerebral infarction and cerebrovascular disease have an effect on blood pressure. Both of these terms describe the state of blood insufficiency to the brain.

data %>% group_by(cerebral_infarction) %>% 
  summarise(normal = round(mean(high_bp == "Normal"),2), 
            high = round(mean(high_bp == "High"),2),
            n = n())
## # A tibble: 2 × 4
##   cerebral_infarction   normal  high     n
##   <chr>                  <dbl> <dbl> <int>
## 1 ""                      0.77  0.23   199
## 2 "cerebral infarction"   0.55  0.45    20
data %>% group_by(cerebrovascular_disease) %>% 
  summarise(normal = round(mean(high_bp == "Normal"),2), 
            high = round(mean(high_bp == "High"),2),
            n = n())
## # A tibble: 3 × 4
##   cerebrovascular_disease                  normal  high     n
##   <chr>                                     <dbl> <dbl> <int>
## 1 ""                                         0.75  0.25   194
## 2 "cerebrovascular disease"                  0.5   0.5     10
## 3 "insufficiency of cerebral blood supply"   0.93  0.07    15

The results above indicates that patients with cerebral infarction and cerebrovascular disease also have a higher chance of high blood pressure (45% and 50%, respectively) compared to those without. Note that the number of total observations are low in both cases.

Conclusion

From the results above, in order to improve sales, we should focus our campaign on patients above 45 years old, those with high BMI, and patients with either cerebral infarction or cerebrovascular disease.

Age 45

The results above have shown that most patients with high blood pressure are older than 45 years old. In this part, we will try to improve our prediction further by only including those older than 45 in our new dataset.

data45 <- data[data$age >= 45,]

BMI

Instead of using BMI categories, we will use BMI values.

data45 %>% ggplot(aes(x = bmi, fill = high_bp)) + geom_histogram(binwidth = 5)

It appears that there is a spike in the proportion of those with high blood pressure when BMI is higher than 30. We will confirm the results above.

c(round(mean(data45[data45$bmi >= 30,]$high_bp == "High",),2),
  round(mean(data45[data45$bmi < 30,]$high_bp == "High"),2))
## [1] 0.64 0.27

The first number is the proportion of high blood pressure patients with BMI above 30, the second number is the proportion of patients with BMI lower than 30. We can see that there is a significant difference between them (64% and 27%).

Similar analysis were performed using other variables (heart rate, diabetes and other diseases), however, none of them show significant difference than the analysis using the whole dataset.

Conclusion

What did we learn after analyzing the dataset?

  1. Individuals of the age above 45, or those with high BMI, affected by cerebral infarction or cerebrovascular disease have higher chance of getting hypertension.
  2. Aged adults with the BMI value of 30 or above have significantly higher rate of hypertension than those with lower BMI values.

Future Plan

So what should we do as an analyst to improve sales?

  1. The advertising campaign should focus on the population that satisfies that conditions above.
  2. Based on the targeted population, we can also derive better advertising strategy. For example, the type of media that people of the age 45 or above use might be different than those who are younger.