Do resting blood pressure, cholesterol, age, sex and heart rate predict presence of hypertension?

Abstract:

Hypertension is one of the most common inherited diseases among Americans. The consequences of having hypertension are so severe that it puts one’s life at stake. Hypertension, also known as High blood pressure, is the main cause of kidney failure, stroke, diabetic, Vision loss, Erectile dysfunction, Memory loss, Angina(chest pain) and heart failure. Early detection is an effective way to control the incidence of hypertension by knowing risk factors such as Age, cholesterol level, resting blood pressure, blood sugar, heart rate and exercising habits e.t.c. The method used to analyze these significant risk factors to cause hypertension is logistic regression [1]. Similar to linear regression, logistic regression is a member of the family of generalized linear models. Logistic regressions were initially created to categorize binary outcomes based on a number of independent categorical or continuous factors. These predictions are made via logistic regression using probability discovered through maximum-likelihood estimations [2].

In this particular research, a data set that has been obtained from one of the largest and secure data resources website Kaggle.com. The data set does incorporate all the significant factors mentioned in the paragraph above. Although, the data set does require some preprocessing as it is not clean and ready to be analyzed. Once the data set is prepared for the analysis then an exploratory data analysis (EDA) will be carried out before the application of logistic regression. After carrying out EDA, a thorough application of logistics regression will be done. Starting from a simple regression model where only one factor will be used as a function of hypertension, to a more advanced model that might incorporate all the important factors. In this whole process, a thorough explanation will be asserted alongside source code to explain the process along the way and comments will be provided to explain different coefficients of logistic regression.

Loading and transforming data set:

Loading the data set:

The data set used in this study has been obtained from the kaggle link: https://www.kaggle.com/datasets/prosperchuks/health-dataset?select=hypertension_data.csv. After downloading the data set it was uploaded to a github repository so that it can be called into Rstudio environment while keeping the work’s reproducibility. In order to load the data set into Rstudio we will use read.csv() function from base R to read the URL of the stored repository

hyper_t <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/hypertension_data.csv")

Now in order to see if the data set has been loaded properly with all the columns we can use head() function from Base R which can display the first six row of the data set.

knitr::kable(head(hyper_t))
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
57 1 3 145 233 1 0 150 0 2.3 0 0 1 1
64 0 2 130 250 0 1 187 0 3.5 0 0 2 1
52 1 1 130 204 0 0 172 0 1.4 2 0 2 1
56 0 1 120 236 0 1 178 0 0.8 2 0 2 1
66 0 0 120 354 0 1 163 1 0.6 2 0 2 1
51 1 0 140 192 0 1 148 0 0.4 1 0 1 1

We can see that the data set has been loaded properly into our Rstudio environment and now we can go ahead and analyze the data set but before starting to analyze we have to bear in mind that our data set might not be in a clean form to work with so we’ll have to wrangle and transform our data set to make it analyze worthy.

Wrangling and Transforming:

First, we will look at the structure of the data set, Since we will be carrying out logistic regression so first we will look at the types of data in each variable/column. In order to do that we will use str() function.

str(hyper_t)
## 'data.frame':    26083 obs. of  14 variables:
##  $ age     : num  57 64 52 56 66 51 42 38 72 47 ...
##  $ sex     : num  1 0 1 0 0 1 0 0 0 0 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

As we can see that a lot of columns has numeric and integer data type which is true for some columns but other columns like sex,cp,fbs,exang,slope,ca,thal and target have a factor data type. So we will need to fix that. But before changing data type of any columns we need to look out for any missing values in those rows. we will is.na() to see any missing values. According to our previous findings in proposal there in total 25 missing values in the data set:

sum(is.na(hyper_t))
## [1] 25

And if we check column wise all of those missing values are in sex column

knitr::kable(hyper_t[is.na(hyper_t$sex),])
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
287 43 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
875 78 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
1450 65 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
1996 60 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
3940 58 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
4383 74 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
4813 66 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
5831 68 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
6609 54 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
6894 59 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
12903 33 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
13527 55 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
13797 50 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
14868 48 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
15109 64 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
15332 56 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
16468 44 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
16659 49 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
20186 53 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
20478 88 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
20756 75 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
21026 70 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
22182 84 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
22406 76 NA 3 134 204 0 1 162 0 0.8 2 2 2 0
23306 69 NA 3 134 204 0 1 162 0 0.8 2 2 2 0

Now we can see that apart from age column in the data set every other variable/column has the same value for each of those missing values so we can ignore those rows and since they are only 25 out 26k (way below 5%) observations so we can easily impute it from our data set.

hyper_t <- hyper_t[!(is.na(hyper_t$sex)),]

After imputation lets quickly confirm that our data set is free from any missing values using sum() and is.na() functions:

sum(is.na(hyper_t))
## [1] 0

Now we do not have any missing values in our data set so the next step is to manage the data type of the columns. The source of our data set does gives us the information about data type of columns and according to the source of data set, columns like sex, cp, fbs, exang, slope, ca, thal and target have a factor data type. So let’s change that first and we can change the factors like 0 and 1 to F(female) and M (male), in sex column, respectively:

hyper_t[hyper_t$sex == 0,]$sex <- "F" # Replacing 0 by F
hyper_t[hyper_t$sex == 1,]$sex <- "M" # Replacing 1 by M


hyper_t$sex <- as.factor(hyper_t$sex)
hyper_t$cp <- as.factor(hyper_t$cp)
hyper_t$fbs <- as.factor(hyper_t$fbs)
hyper_t$restecg <- as.factor(hyper_t$restecg)
hyper_t$exang <- as.factor(hyper_t$exang)
hyper_t$slope <- as.factor(hyper_t$slope)
hyper_t$thal <- as.factor(hyper_t$thal)


hyper_t$target <- ifelse(test = hyper_t$target == 0, yes = "Non-Hypertension", no = "Hypertension")
hyper_t$target <- as.factor(hyper_t$target)

After changing the type of columns now lets recall str() function again to see if everything in our data frame is according to the way they were explained in the source

str(hyper_t)
## 'data.frame':    26058 obs. of  14 variables:
##  $ age     : num  57 64 52 56 66 51 42 38 72 47 ...
##  $ sex     : Factor w/ 2 levels "F","M": 2 1 2 1 1 2 1 1 1 1 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Now we can see that we have columns bearing integers, numeric and factors with different levels as mentioned in the source of data set. Our data set is not ready for analysis and modeling:

Exploratory Data Analysis

In this particular section we will explore different aspects of data sets and will try to visualize and check the dispersion of data. let’s check out the summary of columns that might effect hypertension according to the research.

Age

describe(hyper_t$age)
##    vars     n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 26058 55.66 15.19     56   55.74 17.79  11  98    87 -0.04    -0.74
##      se
## X1 0.09

As we can that our mean age is around 55-56 years with minimum of 11 and maximum of 98 years old. we can see the distribution more clearly by plotting a graph:

ggplot()+
  geom_histogram(data = hyper_t, mapping = aes(x=age, fill = target), color = "black", bins = 30)+
  geom_vline(xintercept=mean(hyper_t$age), color='red')+labs(x="Age", y="Count",title ="Distribution plot of Age")+theme_bw()

So we can see that we have a nice distribution of age ranging from young to very old people even though our mean is on the higher side but we will still further look into age as a factor causing hypertension. We can also create a contingency table using xtabs() function. The contingency table gives the number of cases in each subgroup. For instance, if we apply xtabs()function to target and age column it will give us the number of people having and not having hypertension in each age subgroup as shwon below:

knitr::kable(xtabs(~target+age, data = hyper_t))
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
0 0 0 0 0 0 3 0 6 9 7 16 17 16 31 41 45 56 80 76 98 115 109 130 146 150 156 176 183 176 202 214 213 218 232 245 231 243 249 248 254 258 261 260 262 267 256 261 262 258 258 255 262 260 255 256 246 245 246 237 228 231 230 208 202 196 180 154 145 138 106 101 91 72 58 49 41 22 22 17 7 10 7 3 3 3 2 2 0
1 1 2 2 2 3 6 5 5 16 10 20 31 33 51 57 66 81 90 121 120 129 163 169 177 189 213 223 227 252 259 260 271 279 289 283 297 305 299 308 315 308 306 316 312 310 310 314 307 302 310 304 297 302 299 288 283 287 277 271 273 254 250 245 236 208 198 205 157 149 145 109 104 87 83 56 40 49 25 21 15 8 11 3 3 3 1 2 2
#xtabs(~target+age, data = hyper_t)

First row in the contingency table represents ages of patient or respondents. Similarly, second and third rows represent number of patient in that age group not having and having hypertension, respectively.

Sex:

Sex could arguably be another factor that might affect hyper tension. Over here before putting sex in our model we will need to check out for the distribution among male and female patients. Since we change the values of 0 and 1 to “F” and “M”, respectively so instead of plotting lets quickly have a look at contingency table.

knitr::kable(xtabs(~ target + sex, data=hyper_t))
F M
Hypertension 7137 7137
Non-Hypertension 5892 5892

So we can see that gender/sex, even though could arguably be a factor in causing hypertension since male and female body have different response to different medical condition, but we can not use this data set to predict hypertension based on sex/gender since the data has been artificially set in a way that it has equal number of male and female respondents and equal number of responses for both gender in favor of having hypertension and not having hypertension.

In this particular study we will later on use logistic regression which talks about logs odd and logs of the odd ratios and if we do a manual calculations to check out log of odds of female and male having hypertension so we get the same answer:

female.log.odds <- log(5892/7137)
female.log.odds
## [1] -0.191697

Similarly for males:

male.log.odds <- log(5892/7137)
male.log.odds
## [1] -0.191697

So we can see that we get same answer for both and if we check out the log of the odd ratio we get a zero as shown below:

male.log.odds.ratio <- log((5892/7137)/(5892/7137))
male.log.odds.ratio
## [1] 0

logs odd and log of the odds will be explained later one when we apply simple logistic regression but for now we will not include gender/sex into our analysis:

Resting systolic blood pressure:

Systolic blood pressure is another significant factor that can cause hypertension. Most of the time someone with a high blood pressure is automatically considered hypertensive, in fact high blood pressure is another word for hypertension but today we will see if its just a high resting systolic blood pressure or is there more to it. We can quickly check out the summary using describe() function:

describe(hyper_t$trestbps)
##    vars     n   mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 26058 131.59 17.6    130  130.38 14.83  94 200   106 0.72     0.92 0.11

We can see a very good distribution of the resting systolic blood pressure. It ranges from 94 mm of Hg (mercury) to 200 mm of Hg and the mean value is around 131.59 mm of Hg which is a very good indication of having evenly distributed data. We can also plot this data using a histogram.

ggplot()+
  geom_histogram(data = hyper_t, mapping = aes(x=trestbps, fill=target), color = "black", bins = 20)+theme_bw()+
  geom_vline(xintercept=mean(hyper_t$trestbps), color='red')+labs(x="Resting systolic blood pressure", y="Count",title ="Distribution of Resting Blood pressure (systolic)")

To check the distribution of resting blood pressure among hypertensive and non-hypertensive patients we can go ahead and plot a density ridges plot as shown below:

ggplot(hyper_t, aes(x = trestbps, y = target, fill=target)) + geom_density_ridges(alpha = .7)+labs(x="Resting blood pressure(systolic)", y= "", title = "Desnity Ridges Plot of Blood Pressure")+theme_bw()+theme(legend.position = "none")

Similarly on a contingency table:

knitr::kable(xtabs(~target+trestbps, data = hyper_t))
94 100 101 102 104 105 106 108 110 112 114 115 117 118 120 122 123 124 125 126 128 129 130 132 134 135 136 138 140 142 144 145 146 148 150 152 154 155 156 160 164 165 170 172 174 178 180 192 200
Hypertension 178 170 96 168 82 266 82 342 702 454 0 262 0 460 1968 260 0 170 342 88 532 82 2018 256 166 424 84 800 1484 168 0 84 96 88 774 168 0 98 88 438 0 0 74 80 0 82 100 0 0
Non-Hypertension 0 172 0 0 0 0 0 174 962 310 80 0 94 156 1244 86 82 344 614 168 510 0 1080 422 208 80 158 232 1318 88 184 368 92 78 676 250 90 0 0 510 86 82 254 0 82 92 180 84 94

Well if we look at the contingency table we kind of get the idea that even people with low systolic resting blood pressure are hypertensive and as the blood pressure goes up toward normal range the number of hypertensive patients drops and as we crosses the 126 mm of Hg the number of hypertensive patients increases with exceptions. Surprisingly people with higher blood pressure (at 192 and 200) are not hypertensive.

We need to check these exceptions and see if other variable like chest pain, fasting blood sugar, heart rate e.t.c are playing their role. For instance, at 115 mm of Hg there are 262 hypertensive patients so lets check what else do we have at this particular blood pressure:

hyper_115 <- hyper_t|>
  filter(trestbps == 115)
hyper_rest <- hyper_t|>
  filter(trestbps != 115)

Let’s compare the filtered data frame (hyper_115) to rest of the data (hyper_rest):

knitr::kable(describe(hyper_rest))
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 25796 55.6416499 15.1766409 56.00 55.7230352 17.79120 11 98.0 87.0 -0.0382295 -0.7392775 0.0944930
sex* 2 25796 1.5000000 0.5000097 1.50 1.5000000 0.74130 1 2.0 1.0 0.0000000 -2.0000775 0.0031132
cp* 3 25796 1.9601489 1.0228449 2.00 1.8583196 1.48260 1 4.0 3.0 0.4918026 -1.1694205 0.0063684
trestbps 4 25796 131.7591875 17.6062070 130.00 130.5731176 14.82600 94 200.0 106.0 0.7088067 0.9125417 0.1096200
chol 5 25796 245.0445030 48.6600181 240.00 242.9127338 47.44320 126 417.0 291.0 0.5357609 0.5784409 0.3029675
fbs* 6 25796 1.1514188 0.3584636 1.00 1.0642989 0.00000 1 2.0 1.0 1.9447870 1.7822655 0.0022319
restecg* 7 25796 1.5276787 0.5258666 2.00 1.5175405 0.00000 1 3.0 2.0 0.1710449 -1.3431565 0.0032742
thalach 8 25796 149.3764925 22.8019236 152.00 150.6693963 23.72160 71 202.0 131.0 -0.5168684 -0.0731136 0.1419696
exang* 9 25796 1.3302062 0.4702963 1.00 1.2877701 0.00000 1 2.0 1.0 0.7220434 -1.4787107 0.0029282
oldpeak 10 25796 1.0408746 1.1695812 0.75 0.8532610 1.11195 0 6.2 6.2 1.2644032 1.4970476 0.0072821
slope* 11 25796 2.4004497 0.6178000 2.00 2.4640469 1.48260 1 3.0 2.0 -0.5185572 -0.6321075 0.0038466
ca 12 25796 0.7279423 1.0138162 0.00 0.5403624 0.00000 0 4.0 4.0 1.2784350 0.7129430 0.0063122
thal* 13 25796 3.3191968 0.6061561 3.00 3.3603547 0.00000 1 4.0 3.0 -0.4533726 0.2212450 0.0037741
target* 14 25796 1.4568150 0.4981412 1.00 1.4460219 0.00000 1 2.0 1.0 0.1733778 -1.9700165 0.0031015
knitr::kable(describe(hyper_115))
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 262 57.0419847 16.4620239 57.0 57.371429 17.79120 13 92.0 79.0 -0.1631529 -0.6005061 1.0170272
sex* 2 262 1.5000000 0.5009569 1.5 1.500000 0.74130 1 2.0 1.0 0.0000000 -2.0076190 0.0309492
cp* 3 262 1.6106870 0.9228684 1.0 1.514286 0.00000 1 3.0 2.0 0.8404818 -1.2985028 0.0570150
trestbps 4 262 115.0000000 0.0000000 115.0 115.000000 0.00000 115 115.0 0.0 NaN NaN 0.0000000
chol 5 262 368.5801527 131.0420338 303.0 357.828571 63.75180 260 564.0 304.0 0.7800013 -1.3121401 8.0958039
fbs* 6 262 1.0000000 0.0000000 1.0 1.000000 0.00000 1 1.0 0.0 NaN NaN 0.0000000
restecg* 7 262 1.3664122 0.4827461 1.0 1.333333 0.00000 1 2.0 1.0 0.5513371 -1.7024818 0.0298242
thalach 8 262 175.9007634 10.6931102 181.0 176.742857 5.93040 160 185.0 25.0 -0.7620023 -1.3159965 0.6606226
exang* 9 262 1.0000000 0.0000000 1.0 1.000000 0.00000 1 1.0 0.0 NaN NaN 0.0000000
oldpeak 10 262 0.9282443 0.6703662 1.2 0.960000 0.59304 0 1.6 1.6 -0.5398043 -1.4755251 0.0414154
slope* 11 262 2.3282443 0.4704730 2.0 2.285714 0.00000 2 3.0 1.0 0.7273539 -1.4765483 0.0290659
ca 12 262 0.0000000 0.0000000 0.0 0.000000 0.00000 0 0.0 0.0 NaN NaN 0.0000000
thal* 13 262 3.3053435 0.4614342 3.0 3.257143 0.00000 3 4.0 1.0 0.8404818 -1.2985028 0.0285075
target* 14 262 1.0000000 0.0000000 1.0 1.000000 0.00000 1 1.0 0.0 NaN NaN 0.0000000

After doing a rough comparison of both data frames(hyper_t and hyper_115) we can see that some parameters has increased significantly. Parameter like mean cholesterol, mean age, mean heart rate e.t.c. Which kind of indicates that there is a lot more to hypertension than just having a high blood pressure.

Fasting blood sugar:

Blood sugar also cause hypertension.Diabetes and high blood pressure are closely related. In addition to having similar risk factors and causes, diabetes can also raise blood pressure. Eventually, the conditions that induce high blood pressure can result from insulin resistance and high blood glucose levels [3]. Lets check out the how many people with hypertension have a high blood pressure

xtabs(~target+fbs, data = hyper_t)
##                   fbs
## target                 0     1
##   Hypertension     12292  1982
##   Non-Hypertension  9860  1924

Another astonishing result, as we witness a lot of people with fasting blood sugar less than 120 mg/ are hypertensive while almost equal amount of people with fasting blood sugar level higher than 120 mg/ are either hypertensive or non-hypertensive. Which is kind of strange considering the research done in the reference shared above.

ggplot(hyper_t, aes(x = fbs, y = target, fill=target)) + geom_bar(stat = "identity")+labs(x="Fasting blood sugar", y= "", title = "Desnity Ridges Plot of Fasting Blood Sugar")+theme_bw()+theme(legend.position = "none")

Cholesterol:

High cholesterol and high blood pressure (hypertension) are related. Your arteries hardens and narrows because of cholesterol plaque and calcium. As a result, pumping blood through them requires significantly more effort from your heart. Consequently, Your blood pressure goes up too [4]. Lets check out the summary statistics of our data for the column cholesterol using describe() function.

describe(hyper_t$chol)
##    vars     n   mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 26058 246.29 51.65    240   243.5 47.44 126 564   438  1.1     4.15
##      se
## X1 0.32

Anything above 240 mg/dl when it comes to total cholesterol is very harmful [4] and as we can see that our mean is a bit on higher side but do have a very good variation in the data ranging from minimum of 126 mg/dl to 564 mg/dl. We are not going to create a contingency table since the range is too high and the table is going to be huge and not easy o understand so instead we will plot a ridge density plot to see how cholesterol is distributed in hypertensive and non hypertensive patients:

ggplot(hyper_t, aes(x = chol, y = target, fill=target)) + geom_density_ridges(alpha=.6, color='black')+labs(x="Cholesterol", y="", title = "Density Ridges Plot of Cholesterol ")+theme_bw()+theme(legend.position = "none")

and a general distribution of cholesterol.

ggplot()+
  geom_histogram(data = hyper_t, mapping = aes(x=chol, fill=target), color = "black", bins = 20)+theme_bw()+
  geom_vline(xintercept=mean(hyper_t$chol), color='red')+labs(x="Cholesterol", y="Count",title ="Distribution of Cholesterol (systolic)")

Chest pain:

High blood pressure can damage your arteries by making them less elastic, which decreases the flow of blood and oxygen to your heart and leads to heart disease. In addition, decreased blood flow to the heart can cause, chest pain, also called angina [5]. Since we got the data for chest pain in ordinal form so its better to use it because a patient might not have high blood pressure, blood sugar and cholesterol on the day but what if the data collected was on a day when the hypertensive patient were taking proper precaution to keep those indicators down. So in that case we can refer to angina since angina is caused by permanent hypertension.

hyper_t %>%
  group_by(cp) %>%
  summarise(n = n())
## # A tibble: 4 × 2
##   cp        n
##   <fct> <int>
## 1 0     12314
## 2 1      4456
## 3 2      7392
## 4 3      1896

In the above grouped summary we have 12314 patients with asymptomatic chest pain, 4456 patients with typical angina, 7392 patients with atypical angina and 1896 with no angina pain. Let’s create a contingency table and see how many patients with hypertension is having angina:

xtabs(~target+cp, data = hyper_t)
##                   cp
## target                0    1    2    3
##   Hypertension     3384 3680 5860 1350
##   Non-Hypertension 8930  776 1532  546

As we can see that there are a lot of patients who has no symptoms of chest pain are non-hypertensive. Conversely, high number patients with angina and non angina chest pain are hypertensive. If we go back to our previous statement which stated :patient might not have high blood pressure, blood sugar and cholesterol on the day but what if the data collected was on a day when the hypertensive patient were taking proper precaution to keep those indicators down. So in that case we can refer to angina since angina is caused by permanent hypertension Now in order to prove our point lets filter the data for patients who has angina and see if the means of resting blood pressure for those patients is lower than overall mean. If we get a Yes it means that our assumption was right.

hyper_angina <- hyper_t |>
  filter(cp==1)
summarise(hyper_angina, mean = mean(trestbps))
##       mean
## 1 128.2617

So as we can see that mean of resting blood pressure is lower than overall mean, almost around 80% of patients have angina are hypertensive which kind of proves the point that maybe patients with hypertension had lower blood pressure on the day only.

Chest Vessels:

In our data we have one columns that accounts for the number of vessels narrowed more than 50%. The data ranges from 0 to 4.

describe(hyper_t$ca)
##    vars     n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 26058 0.72 1.01      0    0.53   0   0   4     4 1.29     0.75 0.01
xtabs(~target+ca, data = hyper_t)
##                   ca
## target                 0     1     2     3     4
##   Hypertension     11252  1870   624   258   270
##   Non-Hypertension  3894  3644  2674  1474    98
ggplot(hyper_t, aes(x = ca, fill=target)) + geom_bar( color='black')+labs(x="Chest Vessels", y="", title = "Chest Vessels plot")+theme_bw()

Maximum Heart rate achieved:

describe(hyper_t$thalach)
##    vars     n   mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 26058 149.64 22.87    153  150.94 22.24  71 202   131 -0.52    -0.08
##      se
## X1 0.14
ggplot()+
  geom_histogram(data = hyper_t, mapping = aes(x=thalach, fill=target), color = "black", bins = 20)+theme_bw()+
  geom_vline(xintercept=mean(hyper_t$thalach), color='yellow')+labs(x="Maximum heart rate", y="Count",title ="Distribution of Heart rate")

ggplot(hyper_t, aes(x = thalach, y = target, fill=target)) + geom_density_ridges(alpha=.6, color = "black")+labs(x="Maximum heart rate", y="", title = "Density of hypertensive and non hypertensive patients")+theme_bw()+theme(legend.position = "none")

Exercise Induced Angina:

xtabs(~target+exang, data = hyper_t)
##                   exang
## target                 0     1
##   Hypertension     12280  1994
##   Non-Hypertension  5260  6524

Logistic Regression:

Now we are going to apply logistic regression on our target column as a function any independent column like Blood pressure, Heart rate. Chest pain e.t.c and to see the summary of model and check out the co-efficient. Our focus will be on important independent variable that we explored in EDA but before that we will need to understand how logistic regression works so we will start with sex as an independent variable or a function of hypertension and see how it responds. We already found out through EDA that we have equal number of male and female patients that are equally hypertensive and non- hypertensive but lets apply logistic regression and see what happens:

log_sex <- glm(target ~ sex, data=hyper_t, family="binomial")

Let’s call the summary function on our model and see what happens:

summary(log_sex)
## 
## Call:
## glm(formula = target ~ sex, family = "binomial", data = hyper_t)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.097  -1.097  -1.097   1.260   1.260  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.917e-01  1.760e-02  -10.89   <2e-16 ***
## sexM        -3.772e-15  2.489e-02    0.00        1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35886  on 26057  degrees of freedom
## Residual deviance: 35886  on 26056  degrees of freedom
## AIC: 35890
## 
## Number of Fisher Scoring iterations: 3

We can see that summary contains all the details about model. It start with calling the model, then it portrays deviance residuals, and after that we have the coefficients in which the intercept is log odds of patients having hypertension but the overall equation comes out to be: \[odds of hypertension = -.1917-3.773e-15 * M\] Now this means that if the patient is female: \[odds of hypertension = -.1917-3.773e-15 * 0\] \[odds of hypertension = -.1917\] and odds of male patients having hypertension is: \[odds of hypertension = -.1917-3.773e-15 * 1\] \[odds of hypertension = -.1917-3.773e-15\] \[odds of hypertension = -.1917\]

and as we can see both has same log odds to have hyper tension. The second coefficient is the log odds ratio between male and female which is almost equal to zero and has p = 1, meaning that it has no statistical significant. This was the reason why we suggested not add sex as a predictor or function of hypertension since the data set has been set in a way that we automatically have same off for each possibility. We can also find out probability from log odds. Below I have created a small function that does it for us.

logit_to_probability <- function(logit){
  odds <- exp(logit)
  probability <- odds / (1 + odds)
  return(probability)}
logit_to_probability(-.1917)
## [1] 0.4522212

As we can see that both male and female has same log odds so there respectively probability will be equal too.

Similarly we can check a continuous independent variable like resting blood pressure and check out the coefficients:

log_bps <- glm(target ~ trestbps, data=hyper_t, family="binomial")
summary(log_bps)
## 
## Call:
## glm(formula = target ~ trestbps, family = "binomial", data = hyper_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4592  -1.0848  -0.9347   1.2430   1.5034  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.4672669  0.0971890  -25.39   <2e-16 ***
## trestbps     0.0172720  0.0007311   23.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35886  on 26057  degrees of freedom
## Residual deviance: 35307  on 26056  degrees of freedom
## AIC: 35311
## 
## Number of Fisher Scoring iterations: 4

So we can see that our equation comes out to be:

\[logs of hypertension = -2.4672669 + 0.0172720 * Resting BP\] So if hypothetically is blood pressure is 0 the odd of having hypertension are -2.467. We can also plot a probability curve and manually calculate probability, to see how the probability of having heart disease increases with increases in resting blood pressure
Min: Patient with a resting systolic blood pressure of 100 mm of Hg:

logit_to_probability(-.76)
## [1] 0.3186463

As we can see that the probability of having hypertension is around .31

Max: Patient with a resting systolic blood pressure of 200 mm of Hg:

logit_to_probability(.94)
## [1] 0.7190997

We can see that as the resting blood pressure increased the probability went up to .72.

Probability curve:

predicted.data.bps <- data.frame(
  probability.of.target=log_bps$fitted.values,
  trestbps=hyper_t$trestbps)
ggplot(data=predicted.data.bps, aes(x=trestbps, y=probability.of.target)) +
  geom_point(aes(color=trestbps), size=3) +
  xlab("Resting Blood Pressure") +
  ylab("Predicted probability of Hypertension")+labs(title = "Probability Curve for Blood pressure", label ="BP")+theme_bw()

Similarly, we can check out cholesterol

log_chol <- glm(target ~ chol, data=hyper_t, family="binomial")
summary(log_chol)
## 
## Call:
## glm(formula = target ~ chol, family = "binomial", data = hyper_t)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.549  -1.088  -1.017   1.246   1.426  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.9944045  0.0616589  -16.13   <2e-16 ***
## chol         0.0032555  0.0002447   13.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35886  on 26057  degrees of freedom
## Residual deviance: 35706  on 26056  degrees of freedom
## AIC: 35710
## 
## Number of Fisher Scoring iterations: 4

\[logs of hypertension = -.9944 + 0.003255 * cholesterol\]

Min: Probability of patient having 150mg/dl cholesterol:

logit_to_probability(-.505)
## [1] 0.3763664

Max: Probability of patient having 550mg/dl cholesterol:

logit_to_probability(.79585)
## [1] 0.6890861

and the probability curve:

predicted.data.chol <- data.frame(
  probability.of.target=log_chol$fitted.values,
  chol=hyper_t$chol)
ggplot(data=predicted.data.chol, aes(x=chol, y=probability.of.target)) +
  geom_point(aes(color=chol), size=3) +
  xlab("Cholesterol") +
  ylab("Predicted probability of hypertension")+labs(title = "Probability Curve for Cholesterol")+theme_bw()

Similarly, we can do that for all the independent variables and we might find an interesting solution at the end but over here we will try to stick with our question:

Modeling and Validation:

Setting up the data:

Let’s create a model of our data so we will split the data set into testing and training sets. We will use training data set to create a model and then predict the answers using testing data set. We can validate our prediction using confusion matrix:

We will split the data 80/20 and will use 80% of the overall data set (hyper_t) to train our model and we will use the rest of 20% data to test our model, in other words we will perform supervised machine learning using logistic regression:

set.seed(1992)
split <- sample.split(hyper_t, SplitRatio = .8)

Assigning the training and testing data to variables:

hyper_train <- subset(hyper_t, split == "TRUE")
hyper_test <- subset(hyper_t, split == "FALSE")

Model 1:

Now we have got our training and testing data so lets create our machine learning model. But before creating a model we will need to select the independent variable out of pool of variables listed in the data frame. Even though we started of with a question to predict hypertension using sex, age, blood pressure, cholesterol, and heart rate but after carrying out exploratory data analysis we did find that age and sex can not be a good predictor so that reduces our independent variable to three only. Over here we will create a model using heart rate, blood pressure and cholesterol level but after creating a model we will definitely do carry out some cross validation to check the performance of the model and will try to re tweak the model for better accuracy.

log_my <- glm(target ~ thalach+trestbps+chol, data=hyper_train, family="binomial")

Lets use our model to check the prediction. We will use predict() function to predict and save the response in a variable:

resp <- predict(log_my, hyper_test, type = "response")

Confusion matrix:

Now we have got the predictions using our testing data set but in order the evaluate our model and check the accuracy we will create a confusion matrix for it.

con_mat <- table(Actual_value = hyper_test$target, Predicted_value = resp>.5)

Our confusion matrix is ready and lets display the matrix and evaluate the performance of our model

con_mat
##                   Predicted_value
## Actual_value       FALSE TRUE
##   Hypertension      2422  626
##   Non-Hypertension   993 1543

As we can see that our model does not have a very good accuracy we do a quick math we can find out the this model is around 71% accurate which is un acceptable in the medical field. So in order to make our model better we can add more independent variables that can affect the prediction of hypertension in patients:

McFadden R-Square:

Calculating McFadden R-square:

l.null <- log_my$null.deviance/-2
l.proposed <- log_my$deviance/-2
(l.null - l.proposed) / l.null
## [1] 0.1511686

Model 2:

After not a very successful model using only three variables we went back to exploratory data analysis and looked for more independent variables that can improve our model’s performance. This time around we added variables like chest pain, fasting blood sugar, exercise induced angina, and number of chest vessels narrowed by more than 50%. We did feed above mention variables to our model and evaluated it for performance.

Here is our model:

log_def <- glm(target ~ cp+slope+thalach+thal+ca+trestbps+fbs+exang, data=hyper_train, family="binomial")

Let’s predict using testing data:

resp_n <- predict(log_def, hyper_test, type = "response")

Confusion matrix:

We have the predicted and actual lets create a confusion matrix for our model

con_mat_n <- table(Actual_value = hyper_test$target, Predicted_value = resp_n>.5)

Here is our confusion matrix:

con_mat_n
##                   Predicted_value
## Actual_value       FALSE TRUE
##   Hypertension      2765  283
##   Non-Hypertension   430 2106

We can see that our accuracy has now jumped up to 87%. Which is much better than our previous model but still might get disregarded since the data set in from medical field and the accuracy require in the field of medical sciences is in higher 90s (%).

McFadden R-Square:

Calculating McFadden R-square:

ll.null <- log_def$null.deviance/-2
ll.proposed <- log_def$deviance/-2
(ll.null - ll.proposed) / ll.null
## [1] 0.4710719

Model 3:

We did add a lot of variables in our previous model and we did see some improvement, so does this mean that the more variables we put in the better it will get? To find the answer lets use all the independent variables from the data set and check out the performance of the model. We will start with creating a model:

log_all <- glm(target ~ ., data=hyper_t, family="binomial")

Now our model is ready under the name of log_all. Let’s apply our model to the testing data set and create the confusion matrix. We can also calculate the accuracy from the confusion matrix.

resp_all <- predict(log_all, hyper_test, type = "response")

Confusion matrix:

Our predictions are ready to be evaluated so let’s create a confusion matrix:

con_mat_all <- table(Actual_value = hyper_test$target, Predicted_value = resp_all>.5)

Calling our confusion matrix:

con_mat_all
##                   Predicted_value
## Actual_value       FALSE TRUE
##   Hypertension      2756  292
##   Non-Hypertension   500 2036

As we can see the accuracy is around 85.6% which again not great and I would recommend to use model 2 rather than model 3. Since model 2 and less independent variables and we still get an accuracy which is a shade over model 3 with all the variables.

McFadden R-Square:

Calculating McFadden R-square:

lll.null <- log_all$null.deviance/-2
lll.proposed <- log_all$deviance/-2
(lll.null - lll.proposed) / lll.null
## [1] 0.5003144

Attempt at Shiny App:

# Import libraries
library(shiny)
library(shinythemes)
library(data.table)


####################################
# Model                            #
####################################



log_def <- glm(target ~ cp+ca+thalach+thal+trestbps+slope+fbs+exang, data=hyper_train, family="binomial")



####################################
# User interface                   #
####################################

ui <- fluidPage(theme = shinytheme("united"),
  
  # Page header
  headerPanel('Hypertension'),
  
  # Input values
  sidebarPanel(
    HTML("<h3>Input parameters</h3>"),
    
    sliderInput("ca", label = "Chest Vessel:", min = 0, max=4,value=1),
    selectInput("cp", label = "Chest Pain",
                choices = list("Asympomatic" = 0, "Angina" = 1, "Atypical Angina" = 2, "Non-Anginal" = 3), 
                selected = 0),
    sliderInput("trestbps", "Systolic Blood pressure:",
                min = 94, max = 200,
                value = 115),
    sliderInput("thalac", "Heart Rate:",
                min = 71, max = 202,
                value = 90),
    selectInput("thal", label = "Thalassemia type", 
                choices = list("Normal" = 3, "Fixed defect" = "6","Reversible defect"=7), 
                selected = 3),
    selectInput("slope", label = "Slope of the peak exercise ST segment", 
                choices = list("Unsloping" = 0, "Flat" = 1,"Downsloping"=2), 
                selected = 1),
    selectInput("fbs", label = "Fasting Blood sugar over 120 mg/dl:", 
                choices = list("Yes" = 1, "No" = 0), 
                selected = 0),
    selectInput("exang", label = "Exercise Induced Angina:", 
                choices = list("Yes" = 1, "No" = 0), 
                selected = 1),
    
    actionButton("submitbutton", "Submit", class = "btn btn-primary")
  ),
  
  mainPanel(
    tags$label(h3('Status/Output')), # Status/Output Text Box
    verbatimTextOutput('contents'),
    tableOutput('tabledata') # Prediction results table
    
  )
)

####################################
# Server                           #
####################################

server <- function(input, output, session) {

  # Input Data
  datasetInput <- reactive({  
    
  # outlook,temperature,humidity,windy,play
length
  df <- data.frame(
    Name = c("ca",
             "cp",
             "trestbps",
             "thalach",
             "thal",
             "slope",
             "fbs",
             "exang"),
    Value = as.character(c(input$ca,
                           input$cp, input$trestbps, input$thalach,
                           input$thal,
                           input$slope,
                           input$fbs, input$exang)),
    stringsAsFactors = FALSE)
  
 # target <- "target"
 # df <- rbind(df, target)
 # input <- transpose(df)
 # write.table(input,"input.csv", sep=",", quote = FALSE, row.names = FALSE, col.names = FALSE)
  
  #test <- hyper_test|>
  #  select(ca,cp,trestbps,thalach,thal,slope,fbs,exang,target)
  
  #test$outlook <- factor(test$outlook, levels = c("overcast", "rainy", "sunny"))
  
resp_n <- predict(log_def, hyper_test, type = "response")  

  
con_mat_n <- table(Actual_value = hyper_test$target, Predicted_value = resp_n>.5)

Output <- con_mat_n
  
  })
  
  # Status/Output Text Box
  output$contents <- renderPrint({
    if (input$submitbutton>0) { 
      isolate("Calculation complete.") 
    } else {
      return("Server is ready for calculation.")
    }
  })
  
  # Prediction results table
  output$tabledata <- renderTable({
    if (input$submitbutton>0) { 
      isolate(datasetInput()) 
    } 
  })
  
}

####################################
# Create the shiny app             #
####################################
shinyApp(ui = ui, server = server)

Probalility curve of model 2:

predicted.data.fit <- data.frame(
 probability.of.target=log_def$fitted.values,
  target=hyper_train$target)
predicted.data.fit <- predicted.data.fit[
order(predicted.data.fit$probability.of.target, decreasing=FALSE),]
predicted.data.fit$rank <- 1:nrow(predicted.data.fit)
ggplot(data=predicted.data.fit, aes(x=rank, y=probability.of.target)) +
  geom_point(aes(color=target), alpha=.5, shape=4, stroke=2) +
  xlab("Index") +
  ylab("Predicted probability of hypertension")+labs(title = "Probability curve over training index")+theme_bw()+theme(legend.position ='none')

The blue points above .5 are hypertensive and red below .5 probability are non hypertensive.

Conclusion:

We did start off with a question to check if resting blood pressure, cholesterol, age, sex and heart rate predict presence of hypertension but there were some other factors involved too. Before starting this study a general thought/assumption was built around a narrative that whoever has a high blood pressure is always a hypertension patient, in fact the words “hypertension” and “High blood pressure” were used interchangeably but after carrying out this research a lot of other factors that contribute towards hypertension were revealed. Factors like type of chest pain, narrowing of chest vessel more than 50%, heart rate and fasting blood sugar were playing key roles in predicting presence of hypertension among patients. When we used a model based on variables that were mentioned in the question of our research, the accuracy to predict hypertension came out to be only 71% which is way below the par in medical field but upon adding more variables like chest pain, chest vessels, fasting blood sugar e.t.c the accuracy went up to 87% which meant that all of those factors/variables were contributing towards predicting the hypertension. Apart from all this, other interesting finds were patients with low resting blood pressure and still subjected to hypertension, people from different ages and different level of cholesterol having hypertension.