DATA 606 Data Project Proposal

library(tidyverse)
library(psych)

Data Preparation:

The data set that I have found was already in tabular structure stored in a github repository. It did require some wrangling and transformation but I will include data wrangling and transformation in the final project report. Over here I will highlight some of the issues that needs to be fixed before starting the analysis.

# Let's load data
hyper_t <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/hypertension_data.csv")

Let’s display the first few rows of the data to see if everything loaded the way it was supposed to be;

knitr::kable(head(hyper_t))

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
57	1	3	145	233	1	0	150	0	2.3	0	1	1
64	0	2	130	250	0	1	187	0	3.5	0	2	1
52	1	1	130	204	0	0	172	0	1.4	2	2	1
56	0	1	120	236	0	1	178	0	0.8	2	2	1
66	0	0	120	354	0	1	163	1	0.6	2	2	1
51	1	0	140	192	0	1	148	0	0.4	1	1	1

Let’s check out the data type for each column

str(hyper_t)

## 'data.frame':    26083 obs. of  14 variables:
##  $ age     : num  57 64 52 56 66 51 42 38 72 47 ...
##  $ sex     : num  1 0 1 0 0 1 0 0 0 0 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

We can see that some of the columns like sex,cp, ca and thal were supposed to be factors rather than numbers and integers so we would need to change that before starting the analysis.

Let’s check out if the data frame has any missing values.

sum(is.na(hyper_t))

## [1] 25

We can see that, not very signification amount of missing values but we need to address that too. In Addition, we can also transform the names of variables(columns) to more meaningful names so that reader have a great understanding of the whole process as they go along the way.

Research question:

The purpose of this study is to check how different variables like resting blood pressure, cholesterol, age, sex and heart rate contributes towards hypertension and formulate a predicting model using logistic regression.

Cases:

There are in total 26K cases in this data set and each case represents a patients along with medical condition by incorporating measures like blood pressure, chest pain level, age, cholesterol level e.t.c.

Note: The identity of the patient is not disclosed as per HIPAA.

Data collection:

The data was collected by CDC in their 2015 BRFSS Survey Data and Documentation.

Type of study:

This is an observational study

Data Source:

The data used for this study is from kaggle which is one of largest data resources for data scientist. Below is the link that will take us to the data set on Kaggle:

https://www.kaggle.com/datasets/prosperchuks/health-dataset?select=hypertension_data.csv

Dependent Variable:

The dependent variable is the target column which contains 0 and 1. 0 means that the patient does not have hyper-tension while 1 means that the patient does. The data type was suppose to be a factor rather than integer.

Independent Variable:

Independent variables are all the columns like cholesterol(chol), resting blood pressure(trestbps), heart rate(thalach), age, sex

Relevant Summary Statistics:

Let’s perform some basic summary statistics by checking out the independent variables

Age of respondents:

describe(hyper_t$age)

We can see that the data has been collected over a waste range of ages starting from 11 year older to all the way to 98 years old with a mean age of 55. Similarly we can also check how many male and females did responds

Gender of respondents:

male <- hyper_t|>
  filter(hyper_t$sex == 1)
female <- hyper_t|>
  filter(hyper_t$sex == 0)
print(paste0("The number of male: ",nrow(male)))

## [1] "The number of male: 13029"

print(paste0("The number of female: ",nrow(female)))

## [1] "The number of female: 13029"

The data has been collected over equal number of male and female with 25 individuals not specifying the gender/sex.

Resting Blood Pressure:

describe(hyper_t$trestbps)

Cholesterol:

describe(hyper_t$chol)

Heart rate:

describe(hyper_t$thalach)

Chest pain:

describe(hyper_t$cp)

Plotting the data: