Assignment-1.knit

TAKE HOME ASSIGNMENT

’’’

DUE: Wednesday January 19th 11:59 pm

For this assignment you will apply what you learned about basic statistical analysis to a data set on heart disease. The assignment should be completed as a new RMD document and knitted. Submit the html knitted document and the original RMD document.

Assignment To help you out I have set up another data set.

To get you started here is the link to the data https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat

and here is a vector of column names:

age, sex, chestPain, restBP, cholesterol, fastGlucose, restEC, maxHeartRate, exAngina, oldPeak, slope, numVessels, thal, disease

Here is what they mean

age– 1. age
sex– 2. sex
cp– 3. chest pain type (4 values)
restbp– 4. resting blood pressure
chol– 5. serum cholesterol in mg/dl
fbs– 6. fasting blood sugar > 120 mg/dl
restecg– 7. resting electrocardiographic results (values 0,1,2) maxach – 8. maximum heart rate achieved
exang – 9. exercise induced angina
oldpeak – 10. oldpeak = ST depression induced by exercise relative to rest
slope – 11. the slope of the peak exercise ST segment
num – 12. number of major vessels (0-3) colored by fluoroscopy
thal – 13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect disease – 14. 1 = healthy; 2 = sick

Attributes types

Real: 1,4,5,8,10,12 Ordered:11, Binary: 2,6,9 Nominal:7,3,13

QUESTIONS

Starting with a new notebook,completed the numbered tasks/questions below. Put each task as at least one code chunk or more if needed. ’’’

Concept quesitons

Complete this question on your own using short answers 1-2 sentences.

#1. What is the difference between the functions c() and paste()? Answer 1: From my understanding - the c() functions used to combine multiple things into one vector, and does so by keeping all the elements in the same order. Since its a vector, all things must be of the same nature. The paste() function on the other hand, is used to combine multiple things into a string. When using paste, the things dont have to be of the same nature - i.e. you can mix words and numbers.

#2. What is one similarity and one major difference between a matrix() and a data.frame() object?

Answer 2: Similarity: Both matrix() and data.frame() can be used to store data as tables - which is very useful in large data sets. Difference: One of the major differences between the two, is that matrix can only store data of the same type, while data.frame can store different types of information in different columns - so its a more open ended function.

#3. Explain what each of these lines of code will do?

myVar <- patient_ID

myVar2 <- “123663”

myVar3 <- list(A, 23, “X”)

Answer 3: Each of these lines of code, is defining the variable (the part preceeding the <-) In line one, we are defining myVar to be patient_ID - assuming that patient_ID has a value/definition In line two, we are defining myVar2 to be the value 123663 (a string) In line three, we are defining myVar3 to be a list, containing the values A, 23 and X

Coding activity

Be sure to briefly describe what each of your code chunks is doing and why you are doing the analysis/calculation.

Only use the # comments in the code chunks to organize and annotate the code, DO NOT use this to discuss results.

After the code chunk explain your findings or interpret you results the free text. For example if using a statistical test was the result significant ornon-significant? Next make any conclusions from your findings. You can work alone or in a small group of 2-3 people. But you must submit your own work.

1. Read in the data tables. We will do this together!

This code chunk will read in the data remotely from a url and add the labels on to the column and format some of the categorical variables to be more readable. Often these are 0 and 1 or 1 and 2. It is convent to people readable letters or words. For example sex will be converted from 0 and 1 to M and F.

url<-"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"

heart <- read.csv(url, sep=" ", header = F)
names <- c("age", "sex", "cp", "restbp", 
           "chol", "fbs", "restecg", "maxach", "exang", "oldpeak", "slope", "num", 
           "thal","disease")
heart<-data.frame(heart)
colnames(heart) <- names
heart.dat<-heart
heart.dat$sex <- factor(heart.dat$sex, labels=c("F", "M"))

heart.dat$cp <- factor(heart.dat$cp, 
                       labels=c("Typ", "Atyp", "Non-Ang", "Asymp"))

heart.dat$fbs <- factor(heart.dat$fbs, labels=c("T", "F"))

heart.dat$restecg <- factor(heart.dat$restecg, 
                            labels=c("Normal", "Abnorm", "Hypertrophy"))

heart.dat$exang <- factor(heart.dat$exang, labels=c("N", "Y"))

heart.dat$slope <- factor(heart.dat$slope, 
                          labels=c("Up", "Flat", "Down"))

heart.dat$thal <- factor(heart.dat$thal, 
                         labels=c("Normal", "Fixed", "Reversible"))

heart.dat$disease <- factor(heart.dat$disease, labels=c("H", "S"))

NOTE: the call to the internet may cause problems when knitting the document to HTML. To solve this you should run the code to create the heart.dat table. Then save the table to your drive. Create a new code chunk to load the table from your drive. Comment the code to prevent it from running during knitting.

# to save the data file, only need to do this once then comment out
save( heart.dat, file= "heart_data.Rdata") 

# to load the data file, need to do this every time, this will load the object heart.dat
load("heart_data.Rdata")

2. Create an appropriate graph that shows the number of healthy and sick patients by sex.

Before creating a graph, I always like to start by interpreting what is being asked of us: Given the nature of the question, we also know that we are only considering two things - sex and the state of health

summary(heart.dat[,1:14])

##       age        sex           cp          restbp           chol       fbs    
##  Min.   :29.00   F: 87   Typ    : 20   Min.   : 94.0   Min.   :126.0   T:230  
##  1st Qu.:48.00   M:183   Atyp   : 42   1st Qu.:120.0   1st Qu.:213.0   F: 40  
##  Median :55.00           Non-Ang: 79   Median :130.0   Median :245.0          
##  Mean   :54.43           Asymp  :129   Mean   :131.3   Mean   :249.7          
##  3rd Qu.:61.00                         3rd Qu.:140.0   3rd Qu.:280.0          
##  Max.   :77.00                         Max.   :200.0   Max.   :564.0          
##         restecg        maxach      exang      oldpeak      slope    
##  Normal     :131   Min.   : 71.0   N:181   Min.   :0.00   Up  :130  
##  Abnorm     :  2   1st Qu.:133.0   Y: 89   1st Qu.:0.00   Flat:122  
##  Hypertrophy:137   Median :153.5           Median :0.80   Down: 18  
##                    Mean   :149.7           Mean   :1.05             
##                    3rd Qu.:166.0           3rd Qu.:1.60             
##                    Max.   :202.0           Max.   :6.20             
##       num                 thal     disease
##  Min.   :0.0000   Normal    :152   H:150  
##  1st Qu.:0.0000   Fixed     : 14   S:120  
##  Median :0.0000   Reversible:104          
##  Mean   :0.6704                           
##  3rd Qu.:1.0000                           
##  Max.   :3.0000

# I like to use this function to be able to see a table of the data directly on this screen, as opposed to the small screen on the right, makes it easier to work with the data

#install.packages("ggplot2")
#install.packages("corrplot")
library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

ggplot(data = heart.dat, mapping = aes(x = sex, fill = disease)) +
  geom_bar() +
  labs(title ="Distrubution of Disease with Relation to Sex",
       x = "Sex",
       y = "Number of People",
       color="State of Health (Sick vs Healthy)",
       caption = "A graphical representation of the distribution of sick (s) vs healthy (h) individuals within the data set, organized by biological sex")

# I am importing and then specifying the libraries necessary to make the graph, and then telling R all the details necessary to make the graph (i.e. what data to use, what I want on the X axis, what the fill of each bar is)
# I decided to use a bar graph for a few reasons - not only can we physically see the comparison between the sexes overall, but can also see the overwhelming comparison between the sexes as well as within each sex (i.e. we can see the ration of sick vs healthy - denoted by the S and H in the graph)

It’s interesting to see the distribution of the individuals within the data set. The most glaring difference is in the total number of males vs total number of females - in that many more males were included in the study than females. It also appears that males were sick more often than females within the study.

3. Use an appropriate statistical test to determine if there are a similar distribution of male and female healthy and sick patients.

table(heart.dat$disease, heart.dat$sex)

##    
##       F   M
##   H  67  83
##   S  20 100

# We were given this as part of the hint - this numerically lays out the distribution of sick vs healthy, while keeping the sex separate
# This allows us to initially visualize the data before we do a statistical test, as well as make the necessary table 
# As the question states, the comparison to be made is of the distribution of sick vs healthy patients in the two sexes

chisq.test(table(heart.dat$disease, heart.dat$sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(heart.dat$disease, heart.dat$sex)
## X-squared = 22.667, df = 1, p-value = 1.926e-06

#HINT, here is a function called table() that can create a table of counts of categories. 
#table(heart.dat$disease, heart.dat$sex)

Part of Answer to Q3: We see that the X^2 value is 22.66, while the p-value is 1.92x10^-6. Taking these into account, we would reject the null hypothesis (that the differences are due to chance, and are of statistically significant). Hence, we would say that the data suggests that the difference in distribution is not due to chance.

Also note that the chi-squared test is possible on both normal and non-normally distributed data (as long as the sample size is sufficiently large, which is the case with this data set) and hence it’s not necessary to prove normalcy of the data.

4. Make appropriate graphs for each of the age, resting blood pressure and cholesterol between the helathy and sick patients (one graph for each continuous variable)

# From my understanding, the healthy/sick is the only constant between the graphs, with age, resting BP and cholesterol being continuous variables. Hence I will present 3 graphs, although information can be combined to make fewer graphs. I will use jitter plots, as it allows for slighly better visualization than the normal plot function. 

# The first graph to be made is of BP vs healthy/sick 
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = disease, y = restbp, color = sex)) +
  geom_jitter() +
  labs(title ="BP of healthy vs Sick Individuals",
       x = "Disease (H or S)",
       y = "BP",
       caption = "A graphical representation of BP, based on the healthy vs sick state, with color indicating sex of patient"
       )

# The second graph to be made is of age vs healthy/sick
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = disease, y = age, color = sex)) +
  geom_jitter() +
  labs(title ="Age of healthy vs Sick Individuals",
       x = "Disease (H or S)",
       y = "Age",
       caption = "A graphical representation of the age distribution, based on the healthy vs sick state, with color indicating sex of patient"
       )

# The third graph to be made is of cholesterol vs healthy/sick 
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = disease, y = chol, color = sex)) +
  geom_jitter() +
  labs(title ="Cholesterol levels of healthy vs Sick Individuals",
       x = "Disease (H or S)",
       y = "Cholesterol Levels",
       caption = "A graphical representation of the Cholesterol levels, based on the healthy vs sick state, with color indicating sex of patient"
       )

BP Graph Observations/comments: The data seems to be quite spread out when assessing the BP’s of sick and healthy individuals. Although it would require more quantitative analysis, it also appears that within sick individuals, males tend to have a lower BP than females.

Age Graph Observations/comments: Again, the data seems quite spread out, with the main observation being that within the sick individuals. It appears that younger women don’t get sick as often as younger males. Again requiring further analysis to make this conclusion, but it would appear that within the population of women, older women are more likely to be sick.

Cholesterol Graph Observations/comments: As with the other graphs - there are no standout observations/trends and the data seems to be quite spread out. Though, it would appear that within healthy and sick individuals, women tend to be the high, outlier values. However, this conclusion would also require further quantitative analysis/testing - with this simply being an observation from the graph.

5. Make an appropriate statistical test for each of the resting blood pressure and the cholesterl between helathy and sick patients.

# So the tests to pick between for this type of data is either a T-test (if the data is of normal distribution) or a Wilcox test (if the data is not of normal distribution). The first step is to run the normalcy test - which is done by creating a QQ plot:

# Testing normalcy of the Resting BP data 
qqnorm(heart.dat$restbp)
qqline(heart.dat$restbp)

#Running a wilcox test for resting bp of sick vs healthy patients, since the data is not normally distributed (although not far from normal in the middle, the data starts to driverge from the normalcy line towrds the extremes of the x axis)

wilcox.test(formula = restbp ~ disease, data = heart.dat)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  restbp by disease
## W = 7632.5, p-value = 0.03154
## alternative hypothesis: true location shift is not equal to 0

Part of Answer to Q5:

Resting BP of Healthy vs Sick individuals: From the results, we can see that the data is not fitting to normal distributions at the ends of the axis, and because of this, a wilcox test was chosen instead of a T test. It is clear from the wilcox test result, that the P value is 0.031. Using the 0.05 threshold, we can see that the P value of 0.31 indicates a 3.1% chance that the observed differences in the BPs of healthy and sick individuals are due to chance - hence suggesting that there is indeed a statistically significant difference. Hence we would say that there is a statistically significant difference in the BP of healthy and sick individuals

#Again, testing for normalcy, but now of the cholesterol 

qqnorm(heart.dat$chol)
qqline(heart.dat$chol)

# Running a wilcox test for cholesterol levels of sick vs healthy patients. While the QQ plot shows a pretty normal distribution (i.e. it's arguably close to being a normal distribution), since the wilcox test is more stringent, and since the normalcy is debatable, I decided to use that, since it can be used on both normal and non-normal distributions. Note that the limitation of assessing normalcy via QQ plots is seen here - i.e. QQ plots give a visual estimation of the data's normalcy, and not a quantitative answer. Hence, if you want to be highly accurate, another method would be employed to determine if it can be classified as normal 

wilcox.test(formula = chol ~ disease, data = heart.dat)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  chol by disease
## W = 7300.5, p-value = 0.007701
## alternative hypothesis: true location shift is not equal to 0

Part of Answer to Q5:

Cholesterol levels of Healthy vs Sick individuals: Following the same mode of thinking as above - we can see that here, the P value is 0.0077, indicating a 0.7% chance that the observed cholesterol difference is due to chance. Again using the 0.05 (i.e. 5%) threshold, we can say that there is a statistically significant difference in the cholesterol levels of sick and healthy individuals.

6. Males and female often have different values for clincial and physiological variables. Make a graph that compares cholesterol in males and females who are helathy and sick. Describe your observations from the graph and how this compares to the graph in Question 4.

# I will make a graph similar to that in Q4, except this time, I will have the color indicating healthy vs sick, while the X axis will separate the data points by sex. I decided to make both a jitter plot and a point . The jitter plot is more usefull overall, as it shows each of the points clearly - but the point plot allows for the points to overlap, which is useful in trying to find the difference between healthy and sick levels within a sex, as this shows the overlap, but it must be kept in mind that this was only done since we are not doing any quantitative calculations. If calculations would be done, the point plot would be usefull and subpar. 

library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = sex, y = chol, color = disease)) +
  geom_jitter() +
  labs(title ="Cholesterol levels of healthy vs Sick Individuals",
       x = "Sex",
       y = "Cholesterol Levels",
       caption = "A graphical representation of the Cholesterol levels between healthy and sick individuals - with respect to the sex of each individual"
       )

library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = sex, y = chol, color = disease)) +
  geom_point() +
  labs(title ="Cholesterol levels of healthy vs Sick Individuals",
       x = "Sex",
       y = "Cholesterol Levels",
       caption = "A graphical representation of the Cholesterol levels between healthy and sick individuals - with respect to the sex of each individual"
       )

Part of answer to Q6: In Q4, it’s much harder to try to find a trend for any given sex - the data is more spread out. In this graph, since the sexes are seperated, and the colors indicate healthy vs sick, it would appear that males who are sick seem to have a lower average level of cholesterol (while this is just what the graph seemes to show to me, in order to actually say this, a mean would need to be calculated). It also seems that the healthy and sick men have a much closer average cholesterol level than the healthy vs sick females. Whereas, in Q4, the only observation that can be made is that it seems that sick individuals have a slighly higher average cholesterol level.

7. Perfrom a statisical test compariing cholesterol levels between sick males and sick females. Discuss this result and compare to the result of question 5.

# Here, we need to run a wilcox test similar to that in Q5, but this time, we are not separating just by healthy/sick - we are considering only sick individuals, with respect to sex. The easiest way to do this, is to create a new data set that consists only of the sick individuals, and then run a test to differentiate between the cholesterol levels of males and females (since in this data set they are all sick, that aspect is already taken care of)

# I always like to start with a overview/summary to remind myself of what the data consists of:
summary(heart.dat[,1:14])

##       age        sex           cp          restbp           chol       fbs    
##  Min.   :29.00   F: 87   Typ    : 20   Min.   : 94.0   Min.   :126.0   T:230  
##  1st Qu.:48.00   M:183   Atyp   : 42   1st Qu.:120.0   1st Qu.:213.0   F: 40  
##  Median :55.00           Non-Ang: 79   Median :130.0   Median :245.0          
##  Mean   :54.43           Asymp  :129   Mean   :131.3   Mean   :249.7          
##  3rd Qu.:61.00                         3rd Qu.:140.0   3rd Qu.:280.0          
##  Max.   :77.00                         Max.   :200.0   Max.   :564.0          
##         restecg        maxach      exang      oldpeak      slope    
##  Normal     :131   Min.   : 71.0   N:181   Min.   :0.00   Up  :130  
##  Abnorm     :  2   1st Qu.:133.0   Y: 89   1st Qu.:0.00   Flat:122  
##  Hypertrophy:137   Median :153.5           Median :0.80   Down: 18  
##                    Mean   :149.7           Mean   :1.05             
##                    3rd Qu.:166.0           3rd Qu.:1.60             
##                    Max.   :202.0           Max.   :6.20             
##       num                 thal     disease
##  Min.   :0.0000   Normal    :152   H:150  
##  1st Qu.:0.0000   Fixed     : 14   S:120  
##  Median :0.0000   Reversible:104          
##  Mean   :0.6704                           
##  3rd Qu.:1.0000                           
##  Max.   :3.0000

mySplit <- split(heart.dat, heart.dat$disease)

#Now I want to view the splits that I have created - I commented this out, since it's not longer necessary - but this serves as a checkpoint to make sure the split worked correctly
# mySplit
# mySplit$S

# Now that we know that mySplit$S is the data that we want, we use that to run the wilcox test. We know that the data only contains sick individuals, hence the wilcox test will determine statsitical significance of the differences of cholesterol levels of sick men and sick women. 
wilcox.test(formula = chol ~ sex, data = mySplit$S)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  chol by sex
## W = 1423, p-value = 0.002925
## alternative hypothesis: true location shift is not equal to 0

Part of Answer to Q7: The first piece of information to start with - is that as was shown in Q5 (and hence not shown again, as it is already konwn) the cholesterol distribution was not normal. Hence, we cannot use a standard T test in this case. Given that the wilcox test is both more stringent and can fit non-normal data, that test is chosen and used. We can see from the wilcox test, that the resulting P value was 0.002 (i.e. 0.29%). Again using the 0.05 threshold, we can see that the wilcox test suggests that there is a statistically significant difference in the cholesterol levels of sick men and sick women. The split command allows us to seperate out the sick individuals from the master datasheet, which can then be used to analyze the cholesterol level (due to the split, cholesterol levels of only sick individuals) differences between the sexes. Given the P value, the test would indicate that a statistically significant difference is present between the cholesterol levels of sick men vs sick women.

Compared to question 5, we are now differentiating between the sexes, and not simply the cholesterol of sick vs healthy individuals. From question 5, we learned that the cholesterol difference between sick and healthy individuals is of statistically significant difference. Now, we see that the cholesterol levels of an individual is also impacted by sex. While both analyses tell us different things, we now know that both an individual’s sex as well as state of disease (healthy vs sick) contribute significantly to cholesterol levels.

8. Consider if the variable thal significantly changes the differences in the maximum heart rate achieved. Use an ANOVA and the Tukey post hoc test to statistically test the differences. Make sure to report and interpret the the p-value of the ANOVA and if it passed or failed the Tukey’s test.

#The first thing to do is to run the ANOVA, recognizing that we are assessing for maxach with respect to thal. We need to put this into a variable that I called ANOVA1 to allow for further use of the ANOVA results as well as being able to see it. 

ANOVA1 <- aov(formula = maxach ~ thal, data = heart.dat)

summary(ANOVA1)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## thal          2  10540    5270   10.52 4.02e-05 ***
## Residuals   267 133818     501                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#I like putting a summary when executing functions like splitting or running a test like this, where I immediately inserted it into a variable, as this allows me to make sure that the code ran correctly before I use it in downstream code 

TukeyHSD(ANOVA1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = maxach ~ thal, data = heart.dat)
## 
## $thal
##                         diff       lwr       upr     p adj
## Fixed-Normal      -18.448308 -33.18515 -3.711468 0.0096456
## Reversible-Normal -11.385121 -18.09968 -4.670560 0.0002451
## Reversible-Fixed    7.063187  -7.95773 22.084104 0.5098109

#The code below creates a QQ plot that allows us to assess the normal distribution of the data
plot(ANOVA1, 2)

We see that thal had 3 possible values, and hence the 2 Df is the correct value.

The Pr(>F) indicates the p-value for the F-test. Similarly, it indicates the probability that the observed values are due to chance. Since the Pr(>F) is so small of a value (4.02e-05), it suggests that the difference in Max HR achieved is thal dependant (i.e. the groups with different thal values have a statistically significant difference). However, this is not a conclusion that we can draw yet (we cannot say for sure, we can only speculate so far). We must conduct a Tukey Honest Significant Difference post-hoc test to actually determine this.

Once we have conducted the Tukey HSD test (a series of paired T tests). To interpret this, we must understand what is returned after the code is ran. On the right side, we are given P values. For the overall ANOVA to be significant - each of the P-values given by the post-hoc test must also be significant. However, this is not the case here (one value is 0.009, one is 0.0002, one is 0.509). Using the 0.05 threshold (i.e. 5% threshold, which is often the accepted threshold level for a P-value), we can see that 2 T-test results show significance, but the reversible-fixed thal T test was an anomaly and was not significant (i.e. the 0.509 P value does not rule out the fact that differences can be a result of random chance). Because of the magnitude of the values/outlier the original ANOVA appears to be significant although it is not.

Given this result for the ANOVA and the Tukey’s post-hoc test - we cannot say that the values are of statistically significant difference - and hence we cannot draw that conclusion.

In order to apply the ANOVA test, the data must be normally distributed. This is assessed by the QQ plot shown above. The data appears to have a fairly strong correlation to the dashed line - suggesting that the data is indeed normally distributed. However, the ANOVA’s post-hoc test was failed - showing that the results of the ANOVA was invalidated. Hence, no conclusions can be made from the ANOVA test. In a case like this (i.e. results from an ANOVA are not accepted/used), proving normalcy is unnecessary (since the data is not being used, regardless of what it says), showing normalcy of the data can be useful to justify the use of an ANOVA.