Applied Analytics Assignment 2

Heart Disease Patients Analysis

Denis Bharatbhai Vaghasia - s3858391

Last updated: 28 May, 2021

Introduction

Problem Statement

Data

Data Cont.

Descriptive Statistics and Visualisation

heart_patient_data <- read.csv('heart_disease_patients.csv')

heart_patient_data <- heart_patient_data %>% mutate(
sex = factor(sex, levels=c(1,0), labels=c('Male','Female')),
fbs = factor(fbs, levels=c(1,0), labels=c('Yes','No')),
exang = factor(exang, levels=c(1,0), labels=c('Yes','No')),
cp = factor(cp, levels = c(1,2,3,4), labels=c(1,2,3,4), ordered=TRUE))

str(heart_patient_data)
## 'data.frame':    303 obs. of  12 variables:
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 2 2 1 1 ...
##  $ cp      : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 1 ...
##  $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "Yes","No": 2 1 1 2 2 2 2 1 2 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
sum(is.na(heart_patient_data))
## [1] 0

Descriptive Statistics and Visualisation Cont.

heart_patient_data %>% group_by(sex) %>% summarise(Min= min(chol, na.rm = TRUE),
                                  Q1= quantile(chol, probs = 0.25, na.rm = TRUE),
                                  Median = median(chol, na.rm = TRUE),
                                  Q2 = quantile(chol, probs = 0.75, na.rm = TRUE),
                                  Max = max(chol, na.rm=TRUE),
                                  Mean = mean(chol, na.rm = TRUE),
                                  SD = sd(chol, na.rm = TRUE),
                                  n=n(),
                                  Missing = sum(is.na(chol))) -> table_chol
knitr::kable(table_chol)
sex Min Q1 Median Q2 Max Mean SD n Missing
Male 126 208.75 235 268.5 353 239.6019 42.64976 206 0
Female 141 215.00 254 302.0 564 261.7526 64.90089 97 0
heart_patient_data %>% boxplot(chol ~ sex, data = ., main = 'Box Plot of Patient Cholestrol by Sex', ylab='Cholestrol Level', xlab='Sex', col='#1ABC9C')

Descriptive Statistics and Visualisation Cont.

heart_patient_data %>% summarise(Min= min(trestbps, na.rm = TRUE),
                                  Q1= quantile(trestbps, probs = 0.25, na.rm = TRUE),
                                  Median = median(trestbps, na.rm = TRUE),
                                  Q2 = quantile(trestbps, probs = 0.75, na.rm = TRUE),
                                  Max = max(trestbps, na.rm=TRUE),
                                  Mean = mean(trestbps, na.rm = TRUE),
                                  SD = sd(trestbps, na.rm = TRUE),
                                  n=n(),
                                  Missing = sum(is.na(trestbps))) -> table_trestbps
knitr::kable(table_trestbps)
Min Q1 Median Q2 Max Mean SD n Missing
94 120 130 140 200 131.6898 17.59975 303 0
heart_patient_data %>% plot(trestbps ~ age, data = ., xlab = 'Patient Age', ylab = 'Resting Blood Pressure (mm)')

# Descriptive Statistics and Visualisation Cont.

heart_patient_data %>% summarise(Min= min(thalach, na.rm = TRUE),
                                  Q1= quantile(thalach, probs = 0.25, na.rm = TRUE),
                                  Median = median(thalach, na.rm = TRUE),
                                  Q2 = quantile(thalach, probs = 0.75, na.rm = TRUE),
                                  Max = max(thalach, na.rm=TRUE),
                                  Mean = mean(thalach, na.rm = TRUE),
                                  SD = sd(thalach, na.rm = TRUE),
                                  n=n(),
                                  Missing = sum(is.na(thalach))) -> table_thalach
knitr::kable(table_thalach)
Min Q1 Median Q2 Max Mean SD n Missing
71 133.5 153 166 202 149.6073 22.875 303 0
heart_patient_data$thalach %>% hist(col = '#F39C12', xlim=c(50,250), xlab="Maximum heart rate achieved", main= "Histogram of Heart Rate achieved")
heart_patient_data$thalach %>% mean() %>% abline(v=., col="#1C2833", lwd=2, lty=5)

- We have gone through three attributes that are comparison of cholesterol level with sex, Resting blood pressure relationship with age and frequency of maximum heart beat rate of patients. - Females has more cholesterol problems and patient with age above 40 years comes with more blood pressure. This tells us that person with age above 40 mostly have high chances of getting heart disease and also cholesterol level is moreover high to Female. If the person also comes with maximum heart rate above 120 till 200, this cause an trouble for person and can be the patient of disease.

Hypothesis Testing

heart_patient_data$chol %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   211.0   241.0   246.7   275.0   564.0
heart_patient_data$chol %>% qqPlot(dist = 'norm')

## [1] 153  49

Hypothesis Testing Cont.

\[H_0: \mu_1 < \ 240 \] - We will use Confidence Interval Approach. The specialty of this approach is when we test Ho for one sample t-test, it will also test for two-tailed hypothesis test. - So, we will calculate 95% CI for sample mean 246.7. Since, we don’t know standard deviation we will use s/sqrt(n). Hence, below mentioned formula will calculate 95% CI.

t.test(heart_patient_data$chol, conf.level = .95)$conf.int
## [1] 240.8397 252.5465
## attr(,"conf.level")
## [1] 0.95
t.test(heart_patient_data$chol, mu = 240, alternative = 'two.sided')
## 
##  One Sample t-test
## 
## data:  heart_patient_data$chol
## t = 2.2501, df = 302, p-value = 0.02516
## alternative hypothesis: true mean is not equal to 240
## 95 percent confidence interval:
##  240.8397 252.5465
## sample estimates:
## mean of x 
##  246.6931

Discussion

References