Stats Project

heart_data=read.csv("heart.csv",
                    header = TRUE)
---
head(heart_data)

##   age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
## 1 -63  -1 -3   -145 -233  -1       0     -150    0    -2.3   0   0    -1     -1
## 2 -37  -1 -2   -130 -250   0      -1     -187    0    -3.5   0   0    -2     -1
## 3 -41   0 -1   -130 -204   0       0     -172    0    -1.4  -2   0    -2     -1
## 4 -56  -1 -1   -120 -236   0      -1     -178    0    -0.8  -2   0    -2     -1
## 5 -57   0  0   -120 -354   0      -1     -163   -1    -0.6  -2   0    -2     -1
## 6 -57  -1  0   -140 -192   0      -1     -148    0    -0.4  -1   0    -1     -1

The topic is which individual is likely to suffer from heart disease (given various quantitative and qualitative attributes. I’m interested in this particular dataset because in my family there’s a history of heart disease and I unfortunately know quite a lot of near and dear ones who’ve been impacted by heart disease. So I was interested to see factors that likely correlate with its increased chances (Also see if I personally identify with one or more factors!!)

Cases or the number of observations are 303 patients. The variables in this case are the 14 attributes namely: •Age (in Years) numerical discrete

•Sex (1=male;0=female) Boolean

•cp(chest pain type): categorical ordinal —Value 1: typical angina —Value 2: atypical angina —Value 3: non-anginal pain —Value 4:Asymptomatic

•trestbps(Resting blood pressure in mm Hg) numerical discrete

•chol(serum cholesterol in mg/di) numerical discrete

#•fbs(fasting blood sugar > 120 mg/dl)(1=true, 0= false) Boolean

#•restecg(resting electrocardiographic results) categorical ordinal —Value 0:normal __Value 1:ST wave abnormality —-Value 2:Showing probable or definite left ventricular hypertrophy

•thalach(maximum heart rate achieved) numeric discrete

•exang(exercise induced angina 1=yes; 0=no) Boolean

•oldpeak (ST depression induced by exercise relative to rest) numerical continous

•slope(slope of peak exercise ST segment) Categorical ordinal —Value 1:upsloping —Value 2: flat —Value 3:downsloping

•ca (number of major vessels (0-3) coloured by fluoroscopy numerical discrete

•thal numerical discrete 3=normal 6=fixed defect 7=reservable defect

•num(diagnosis of heart disease) Boolean? —Value 0:<50% diameter narrowing —Value 1:>50% diameter narrowing

The dataset was originally taken form UCI Machine Learning Repository that lists 4 creators for this dataset: 1.Hungarian Institute of Cardiology, Budapest: Andras Janosi M.D. 2.University Hospital, Zurich, Switzerland:William Steinbrunn M.D. 3.University Hospital, Basel, Switzerland:Matthias Pfisterer M.D. 4.V.A Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano M.D., PHD

I think personally the most interesting variable for me was exang since I always associated exercise and physical activity to help reduce chances of heart disease and did not think of them from the perspective of angina pain. I think the response variable we could use in a model are num (diagnosis of heart disease) since it could indicate whether less than or greater than 50% of diameter narrowing for heart disease. The more its narrowing the more severe heart disease is.

303 patients are maybe to narrow of a sample to generalise to the larger population. Some results appear strange at first glance like the trend of who’s more prone to heart disease (people with low cholesterol and normal ecg results doesn’t make sense). Furthermore this dataset treats continuous quantities like cholesterol etc as discrete hence that might pose an issue.

df=heart_data
mean(df$age)

## [1] 54.36634

df=heart_data
median(df$chol)

## [1] 240

df=heart_data
mean(df$exng)

## [1] 0.3267327

age <- heart_data$age
hist(age)

chol <- heart_data$chol
hist(chol)

exng <- heart_data$exng
hist(exng)

 plot(age,chol,xlab="age",ylab="chol",pch=3)
title("Scatter Plot of chol vs. Age")

Mean age is important since it can help me identify the most likely age in this case 54 someone is at risk of suffering heart related health issues which is a variable that can be instantaneously measured by asking the patient. The histogram of age again helps show its unimodal with majority of observations lying between 55 and 60.

Median chol is important since value of 240 tells us that half the observations are below and half above this value so this value is a midpoint for the cholestrol in patients shown in data.The histogram is reflective of this centered around 200 to 250 interval.

Mean exng is important since it is nearer to 0 as opposed to 1 it indicates for exercise induced angina the responses leaned towards “no”. The histogram is reflective of this with majority of the observations being at 0.

I think these three variables are most important since they can be measured quickly and hence instantly a summary of patient profile can be derived. I was also interested since i previously mentioned the confusing trend between exercise relating with heart attack and low chollestrol which is opposite of the widespread trend that relates heart issues with low exercise and high cholestrol.

Lastly i made a plot to show relationship between age and cholestrol as i wanted to understand whether older people are more likely to have unhealthy dietary habits (as opposed to younger people) that may lead to high cholestrol.A positive association is reflected by the scatterplot

t.test(chol ~ sex, data = heart_data, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  chol by sex
## t = 3.0244, df = 134.39, p-value = 0.002985
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.617474 36.406982
## sample estimates:
## mean in group 0 mean in group 1 
##        261.3021        239.2899

chol <- heart_data$chol
hist(chol)

sex <- heart_data$sex
hist(sex)

The thing I found most interesting in my analysis last time was the relationship between age and cholesterol that can be used to predict the likelihood of heart disease. The reason why I found this to be most interesting was because of the various other factors that impact cholesterol levels including dietary habits, exercise level etc. and I wanted to see if age played a significant role. What was surprising was the positive association in scatterplot that indicated older people have higher cholesterol as opposed to younger people (who presumably are considered more unhealthy due to access to junk food, consumption of sugar etc.) The conditions we need for inference on a mean are: Random: A random sample or randomized experiment should be used to obtain the data. There is no apparent link between subjects so reasonable.

Normal: The sampling distribution of (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large which it is in this case since our number of observations are reasonably large leading to bell shaped histograms. (n≥30)

Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than 10%percent of the population. This is a reasonable assumption since data comes from multile credible sources.

HO:True difference in mean of a cholesterol and sex is zero

HO:True difference in mean of a cholesterol and sex is not zero

At alpha of 5%, since p value is less that 0.05 Ho is rejected, true difference in mean of cholesterol and sex is not zero. That is trends in cholestrol level are not similar for those in same age groups.

The concerns I’ve are regarding the fact that I’m not sure if the histogram for sex looks normal(bell shaped) and for that I think i might need to do a nonparametric version of the test, which does not assume normality like a chi sqaure test.Like all non-parametric statistics, the Chi-square is robust with respect to the distribution of the data.