library(tidyverse)
library(psych)Data Preparation:
The data set that I have found was already in tabular structure stored in a github repository. It did require some wrangling and transformation but I will include data wrangling and transformation in the final project report. Over here I will highlight some of the issues that needs to be fixed before starting the analysis.
# Let's load data
hyper_t <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/hypertension_data.csv")Let’s display the first few rows of the data to see if everything loaded the way it was supposed to be;
knitr::kable(head(hyper_t))| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 57 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 64 | 0 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 52 | 1 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 56 | 0 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 66 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
| 51 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 | 0.4 | 1 | 0 | 1 | 1 |
Let’s check out the data type for each column
str(hyper_t)## 'data.frame': 26083 obs. of 14 variables:
## $ age : num 57 64 52 56 66 51 42 38 72 47 ...
## $ sex : num 1 0 1 0 0 1 0 0 0 0 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
We can see that some of the columns like
sex,cp, ca and thal
were supposed to be factors rather than numbers and integers so we would
need to change that before starting the analysis.
Let’s check out if the data frame has any missing values.
sum(is.na(hyper_t))## [1] 25
We can see that, not very signification amount of missing values but we need to address that too. In Addition, we can also transform the names of variables(columns) to more meaningful names so that reader have a great understanding of the whole process as they go along the way.
Research question:
The purpose of this study is to check how different variables like resting blood pressure, cholesterol, age, sex and heart rate contributes towards hypertension and formulate a predicting model using logistic regression.
Cases:
There are in total 26K cases in this data set and each case represents a patients along with medical condition by incorporating measures like blood pressure, chest pain level, age, cholesterol level e.t.c.
Note: The identity of the patient is not disclosed as per HIPAA.
Data collection:
The data was collected by CDC in their 2015 BRFSS Survey Data and Documentation.
Type of study:
This is an observational study
Data Source:
The data used for this study is from kaggle which is one of largest data resources for data scientist. Below is the link that will take us to the data set on Kaggle:
https://www.kaggle.com/datasets/prosperchuks/health-dataset?select=hypertension_data.csv
Dependent Variable:
The dependent variable is the target column which
contains 0 and 1. 0 means that the patient does not have hyper-tension
while 1 means that the patient does. The data type was suppose to be a
factor rather than integer.
Independent Variable:
Independent variables are all the columns like cholesterol(chol), resting blood pressure(trestbps), heart rate(thalach), age, sex
Relevant Summary Statistics:
Let’s perform some basic summary statistics by checking out the independent variables
Age of respondents:
describe(hyper_t$age)We can see that the data has been collected over a waste range of ages starting from 11 year older to all the way to 98 years old with a mean age of 55. Similarly we can also check how many male and females did responds
Gender of respondents:
male <- hyper_t|>
filter(hyper_t$sex == 1)
female <- hyper_t|>
filter(hyper_t$sex == 0)
print(paste0("The number of male: ",nrow(male)))## [1] "The number of male: 13029"
print(paste0("The number of female: ",nrow(female))) ## [1] "The number of female: 13029"
The data has been collected over equal number of male and female with 25 individuals not specifying the gender/sex.
Resting Blood Pressure:
describe(hyper_t$trestbps)Cholesterol:
describe(hyper_t$chol)Heart rate:
describe(hyper_t$thalach)Chest pain:
describe(hyper_t$cp)Plotting the data:
Resting Blood Pressure:
ggplot()+
geom_histogram(data = hyper_t, mapping = aes(x=trestbps), bins = 30)+theme_bw()Cholesterol:
ggplot()+
geom_histogram(data = hyper_t, mapping = aes(x=chol), bins = 45)+theme_bw()Heart rate:
ggplot()+
geom_histogram(data = hyper_t, mapping = aes(x=thalach), bins = 50)+theme_bw()Age:
ggplot()+
geom_histogram(data = hyper_t, mapping = aes(x=age), bins = 45)+theme_bw()