High blood pressure (or hypertension) is a major health problem for older adults. As time passes, the body’s blood vessels change. The arteries get stiffer, causing blood pressure to go up. High blood pressure is also known as “the silent killer”, many may not even be aware they have it (National Institute on Aging, 2022). In the dataset below, hypertension was divided into four groups: “Prehypertension”, “Stage 1 hypertension”, “Stage 2 hypertension” and none-of-the-above. In this case study, patients are considered as high blood pressure group if they belong to stage 1 and stage 2 hypertension.
This dataset was collected by Yongbo Liang et al. (2018). This dataset contains information of 219 patients in China, together with 13 variables. The dataset covers an age range of 20-89 years and records of diseases including hypertension and diabetes. The variables contained in this dataset are:
Scenario: Imagine yourself as an analyst of a pharmaceutical company that just developed a new medicine for patients with high blood pressure. The company now wants to know how to maximize the sales by advertising the new medication to those who are susceptible to it.
Goal: find the group of people that has higher risk of having high blood pressure.
summary(data)
## id sex age height
## Min. : 2.0 Length:219 Min. :21.00 Min. :145.0
## 1st Qu.: 85.5 Class :character 1st Qu.:48.00 1st Qu.:155.0
## Median :152.0 Mode :character Median :58.00 Median :160.0
## Mean :156.6 Mean :57.17 Mean :161.2
## 3rd Qu.:214.5 3rd Qu.:67.50 3rd Qu.:167.0
## Max. :419.0 Max. :86.00 Max. :196.0
## weight systolic diastolic h_rate
## Min. : 36.00 Min. : 80.0 Min. : 42.00 Min. : 52.00
## 1st Qu.: 52.50 1st Qu.:113.5 1st Qu.: 64.00 1st Qu.: 66.00
## Median : 60.00 Median :126.0 Median : 70.00 Median : 73.00
## Mean : 60.19 Mean :127.9 Mean : 71.85 Mean : 73.64
## 3rd Qu.: 66.50 3rd Qu.:139.0 3rd Qu.: 78.00 3rd Qu.: 80.00
## Max. :103.00 Max. :182.0 Max. :107.00 Max. :106.00
## bmi hypertension diabetes cerebral_infarction
## Min. :14.69 Length:219 Length:219 Length:219
## 1st Qu.:20.55 Class :character Class :character Class :character
## Median :22.60 Mode :character Mode :character Mode :character
## Mean :23.11
## 3rd Qu.:25.00
## Max. :37.46
## cerebrovascular_disease
## Length:219
## Class :character
## Mode :character
##
##
##
Due to the pre-defined criteria, patients in the Stage 1 and Stage 2 hypertension group will be considered high blood pressure.
data$high_bp <- ifelse(data$hypertension == "Normal" | data$hypertension == "Prehypertension", "Normal", "High")
The BMI value will also be divided into three smaller groups. Those with higher than 24.9 will be considered as high BMI, and those with lower than 18.5 will be considered as low BMI.
data$bmi_c <- ifelse(data$bmi > 24.9, "High",
ifelse(data$bmi < 18.5, "Low", "Normal"))
From the dataset we are using, there are some variables that stand out as possible factors on blood pressure. These variables are sex, age, heart rate, BMI, diabetes, cerebral infarction and cerebrovascular disease. We will begin with checking the correlation between age, heart rate and BMI.
cor(data[c(3,8,9)])
## age h_rate bmi
## age 1.00000000 -0.08564041 0.01632591
## h_rate -0.08564041 1.00000000 -0.10552071
## bmi 0.01632591 -0.10552071 1.00000000
As we can see, there are almost no correlation between all three variables above, so we can use all of them in our analysis.
Now we will confirm the distribution of the blood pressure data.
data %>% ggplot(aes(x = hypertension, fill = hypertension)) + geom_bar()
data %>% ggplot(aes(x = high_bp, fill = high_bp)) + geom_bar()
There are almost three times the data for patients with “Normal” blood
pressure compared to those with “High” blood pressure. We will keep this
result in mind as we further analyze the dataset.
data %>% ggplot(aes(x = sex, fill = high_bp)) + geom_bar()
It seems that we have similar number of male and female patients in our
data, and it also appears that they have different proportion of people
with high blood pressure. We shall dig deeper by defining a function to
check the 95% confident interval.
ci95_high <- function(data, dv, value) {
ci95_r <- rep(0,3) #Create placeholder for results
m <- mean(data[dv == value,]$high_bp == "High") #Proportion with high bp
se <- sd(data[dv == value,]$high_bp == "High")/sqrt(nrow(data))
ci95_r[1] <- m - 1.96*se
ci95_r[2] <- m
ci95_r[3] <- m + 1.96*se
return(ci95_r)
}
This function will calculate the proportion of patients with “High” blood pressure within a certain group. The output is the 95% confident interval of said value.
ci95_high(data, data$sex, "Female")
## [1] 0.1783986 0.2347826 0.2911666
ci95_high(data, data$sex, "Male")
## [1] 0.2012674 0.2596154 0.3179634
There is an overlap between two interval. This means that the difference between the two genders is not statistical significant at 95%, so we cannot claim that gender can affect blood pressure.
data %>% ggplot(aes(x = age, fill = high_bp)) + geom_histogram(binwidth = 5) +
facet_grid(. ~ high_bp)
There is a pattern here, it seems that only those higher than 45 years of age have high blood pressure. This is an important result that will help us reach our goal.
To simplify the analysis, here we we will use the BMI categories.
data %>% group_by(bmi_c) %>% summarise(perc = mean(high_bp == "High")) %>%
ggplot(aes(x = bmi_c, y = perc, fill = bmi_c)) + geom_col()
The results of this plot show that people with high BMI value have a significantly higher chance of getting high blood pressure (43%) than those with lower BMI value.
Higher heart rate means the heart has to pump blood more frequently. We shall see if it will also result in higher blood pressure. We will check the correlation between heart rate and blood pressure values.
data %>% ggplot(aes(x = h_rate, y = systolic)) + geom_point()
data %>% ggplot(aes(x = h_rate, y = diastolic)) + geom_point()
The output indicates that there are no correlation between heart rate
and blood pressure values. Higher heart rate does not lead to higher
blood pressure values.
Several research papers have pointed out that diabetes leads to higher blood pressure. We shall see if the result holds true in this group of patients.
data %>% group_by(diabetes) %>% summarise(
perc_high_bp = round(mean(high_bp == "High"),2),
n = n())
## # A tibble: 3 × 3
## diabetes perc_high_bp n
## <chr> <dbl> <int>
## 1 "" 0.25 181
## 2 "Diabetes" 1 1
## 3 "Type 2 Diabetes" 0.22 37
The results above show that patients with diabetes do not have higher blood pressure than those without. Note that in this dataset, only one person has Diabetes (type 1), that is why the percentage is 100%. We can prove that by adding a slight modification
data %>% mutate(h_diabetes = ifelse(data$diabetes == "", "Normal", "Diabetes")) %>%
group_by(h_diabetes) %>% summarise(perc_high_bp = round(mean(high_bp == "High"),2))
## # A tibble: 2 × 2
## h_diabetes perc_high_bp
## <chr> <dbl>
## 1 Diabetes 0.24
## 2 Normal 0.25
Lastly, we want to know whether cerebral infarction and cerebrovascular disease have an effect on blood pressure. Both of these terms describe the state of blood insufficiency to the brain.
data %>% group_by(cerebral_infarction) %>%
summarise(normal = round(mean(high_bp == "Normal"),2),
high = round(mean(high_bp == "High"),2),
n = n())
## # A tibble: 2 × 4
## cerebral_infarction normal high n
## <chr> <dbl> <dbl> <int>
## 1 "" 0.77 0.23 199
## 2 "cerebral infarction" 0.55 0.45 20
data %>% group_by(cerebrovascular_disease) %>%
summarise(normal = round(mean(high_bp == "Normal"),2),
high = round(mean(high_bp == "High"),2),
n = n())
## # A tibble: 3 × 4
## cerebrovascular_disease normal high n
## <chr> <dbl> <dbl> <int>
## 1 "" 0.75 0.25 194
## 2 "cerebrovascular disease" 0.5 0.5 10
## 3 "insufficiency of cerebral blood supply" 0.93 0.07 15
The results above indicates that patients with cerebral infarction and cerebrovascular disease also have a higher chance of high blood pressure (45% and 50%, respectively) compared to those without. Note that the number of total observations are low in both cases.
From the results above, in order to improve sales, we should focus our campaign on patients above 45 years old, those with high BMI, and patients with either cerebral infarction or cerebrovascular disease.
The results above have shown that most patients with high blood pressure are older than 45 years old. In this part, we will try to improve our prediction further by only including those older than 45 in our new dataset.
data45 <- data[data$age >= 45,]
Instead of using BMI categories, we will use BMI values.
data45 %>% ggplot(aes(x = bmi, fill = high_bp)) + geom_histogram(binwidth = 5)
It appears that there is a spike in the proportion of those with high blood pressure when BMI is higher than 30. We will confirm the results above.
c(round(mean(data45[data45$bmi >= 30,]$high_bp == "High",),2),
round(mean(data45[data45$bmi < 30,]$high_bp == "High"),2))
## [1] 0.64 0.27
The first number is the proportion of high blood pressure patients with BMI above 30, the second number is the proportion of patients with BMI lower than 30. We can see that there is a significant difference between them (64% and 27%).
Similar analysis were performed using other variables (heart rate, diabetes and other diseases), however, none of them show significant difference than the analysis using the whole dataset.
What did we learn after analyzing the dataset?
So what should we do as an analyst to improve sales?