DUE: Wednesday January 19th 11:59 pm
For this assignment you will apply what you learned about basic statistical analysis to a data set on heart disease. The assignment should be completed as a new RMD document and knitted. Submit the html knitted document and the original RMD document.
Assignment To help you out I have set up another data set.
To get you started here is the link to the data https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat
and here is a vector of column names:
age, sex, chestPain, restBP, cholesterol, fastGlucose, restEC, maxHeartRate, exAngina, oldPeak, slope, numVessels, thal, disease
Here is what they mean
age– 1. age
sex– 2. sex
cp– 3. chest pain type (4 values)
restbp– 4. resting blood pressure
chol– 5. serum cholesterol in mg/dl
fbs– 6. fasting blood sugar > 120 mg/dl
restecg– 7. resting electrocardiographic results (values 0,1,2) maxach –
8. maximum heart rate achieved
exang – 9. exercise induced angina
oldpeak – 10. oldpeak = ST depression induced by exercise relative to
rest
slope – 11. the slope of the peak exercise ST segment
num – 12. number of major vessels (0-3) colored by fluoroscopy
thal – 13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
disease – 14. 1 = healthy; 2 = sick
Attributes types
Real: 1,4,5,8,10,12 Ordered:11, Binary: 2,6,9 Nominal:7,3,13
QUESTIONS
Starting with a new notebook,completed the numbered tasks/questions below. Put each task as at least one code chunk or more if needed. ’’’
Complete this question on your own using short answers 1-2 sentences.
#1. What is the difference between the functions c() and paste()? Answer 1: From my understanding - the c() functions used to combine multiple things into one vector, and does so by keeping all the elements in the same order. Since its a vector, all things must be of the same nature. The paste() function on the other hand, is used to combine multiple things into a string. When using paste, the things dont have to be of the same nature - i.e. you can mix words and numbers.
#2. What is one similarity and one major difference between a matrix() and a data.frame() object?
Answer 2: Similarity: Both matrix() and data.frame() can be used to store data as tables - which is very useful in large data sets. Difference: One of the major differences between the two, is that matrix can only store data of the same type, while data.frame can store different types of information in different columns - so its a more open ended function.
#3. Explain what each of these lines of code will do?
myVar <- patient_ID
myVar2 <- “123663”
myVar3 <- list(A, 23, “X”)
Answer 3: Each of these lines of code, is defining the variable (the part preceeding the <-) In line one, we are defining myVar to be patient_ID - assuming that patient_ID has a value/definition In line two, we are defining myVar2 to be the value 123663 (a string) In line three, we are defining myVar3 to be a list, containing the values A, 23 and X
Be sure to briefly describe what each of your code chunks is doing and why you are doing the analysis/calculation.
Only use the # comments in the code chunks to organize and annotate the code, DO NOT use this to discuss results.
After the code chunk explain your findings or interpret you results the free text. For example if using a statistical test was the result significant ornon-significant? Next make any conclusions from your findings. You can work alone or in a small group of 2-3 people. But you must submit your own work.
This code chunk will read in the data remotely from a url and add the labels on to the column and format some of the categorical variables to be more readable. Often these are 0 and 1 or 1 and 2. It is convent to people readable letters or words. For example sex will be converted from 0 and 1 to M and F.
url<-"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
heart <- read.csv(url, sep=" ", header = F)
names <- c("age", "sex", "cp", "restbp",
"chol", "fbs", "restecg", "maxach", "exang", "oldpeak", "slope", "num",
"thal","disease")
heart<-data.frame(heart)
colnames(heart) <- names
heart.dat<-heart
heart.dat$sex <- factor(heart.dat$sex, labels=c("F", "M"))
heart.dat$cp <- factor(heart.dat$cp,
labels=c("Typ", "Atyp", "Non-Ang", "Asymp"))
heart.dat$fbs <- factor(heart.dat$fbs, labels=c("T", "F"))
heart.dat$restecg <- factor(heart.dat$restecg,
labels=c("Normal", "Abnorm", "Hypertrophy"))
heart.dat$exang <- factor(heart.dat$exang, labels=c("N", "Y"))
heart.dat$slope <- factor(heart.dat$slope,
labels=c("Up", "Flat", "Down"))
heart.dat$thal <- factor(heart.dat$thal,
labels=c("Normal", "Fixed", "Reversible"))
heart.dat$disease <- factor(heart.dat$disease, labels=c("H", "S"))
NOTE: the call to the internet may cause problems when knitting the document to HTML. To solve this you should run the code to create the heart.dat table. Then save the table to your drive. Create a new code chunk to load the table from your drive. Comment the code to prevent it from running during knitting.
# to save the data file, only need to do this once then comment out
save( heart.dat, file= "heart_data.Rdata")
# to load the data file, need to do this every time, this will load the object heart.dat
load("heart_data.Rdata")
Before creating a graph, I always like to start by interpreting what is being asked of us: Given the nature of the question, we also know that we are only considering two things - sex and the state of health
summary(heart.dat[,1:14])
## age sex cp restbp chol fbs
## Min. :29.00 F: 87 Typ : 20 Min. : 94.0 Min. :126.0 T:230
## 1st Qu.:48.00 M:183 Atyp : 42 1st Qu.:120.0 1st Qu.:213.0 F: 40
## Median :55.00 Non-Ang: 79 Median :130.0 Median :245.0
## Mean :54.43 Asymp :129 Mean :131.3 Mean :249.7
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:280.0
## Max. :77.00 Max. :200.0 Max. :564.0
## restecg maxach exang oldpeak slope
## Normal :131 Min. : 71.0 N:181 Min. :0.00 Up :130
## Abnorm : 2 1st Qu.:133.0 Y: 89 1st Qu.:0.00 Flat:122
## Hypertrophy:137 Median :153.5 Median :0.80 Down: 18
## Mean :149.7 Mean :1.05
## 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :202.0 Max. :6.20
## num thal disease
## Min. :0.0000 Normal :152 H:150
## 1st Qu.:0.0000 Fixed : 14 S:120
## Median :0.0000 Reversible:104
## Mean :0.6704
## 3rd Qu.:1.0000
## Max. :3.0000
# I like to use this function to be able to see a table of the data directly on this screen, as opposed to the small screen on the right, makes it easier to work with the data
#install.packages("ggplot2")
#install.packages("corrplot")
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
ggplot(data = heart.dat, mapping = aes(x = sex, fill = disease)) +
geom_bar() +
labs(title ="Distrubution of Disease with Relation to Sex",
x = "Sex",
y = "Number of People",
color="State of Health (Sick vs Healthy)",
caption = "A graphical representation of the distribution of sick (s) vs healthy (h) individuals within the data set, organized by biological sex")
# I am importing and then specifying the libraries necessary to make the graph, and then telling R all the details necessary to make the graph (i.e. what data to use, what I want on the X axis, what the fill of each bar is)
# I decided to use a bar graph for a few reasons - not only can we physically see the comparison between the sexes overall, but can also see the overwhelming comparison between the sexes as well as within each sex (i.e. we can see the ration of sick vs healthy - denoted by the S and H in the graph)
It’s interesting to see the distribution of the individuals within the data set. The most glaring difference is in the total number of males vs total number of females - in that many more males were included in the study than females. It also appears that males were sick more often than females within the study.
table(heart.dat$disease, heart.dat$sex)
##
## F M
## H 67 83
## S 20 100
# We were given this as part of the hint - this numerically lays out the distribution of sick vs healthy, while keeping the sex separate
# This allows us to initially visualize the data before we do a statistical test, as well as make the necessary table
# As the question states, the comparison to be made is of the distribution of sick vs healthy patients in the two sexes
chisq.test(table(heart.dat$disease, heart.dat$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(heart.dat$disease, heart.dat$sex)
## X-squared = 22.667, df = 1, p-value = 1.926e-06
#HINT, here is a function called table() that can create a table of counts of categories.
#table(heart.dat$disease, heart.dat$sex)
Part of Answer to Q3: We see that the X^2 value is 22.66, while the p-value is 1.92x10^-6. Taking these into account, we would reject the null hypothesis (that the differences are due to chance, and are of statistically significant). Hence, we would say that the data suggests that the difference in distribution is not due to chance.
Also note that the chi-squared test is possible on both normal and non-normally distributed data (as long as the sample size is sufficiently large, which is the case with this data set) and hence it’s not necessary to prove normalcy of the data.
# From my understanding, the healthy/sick is the only constant between the graphs, with age, resting BP and cholesterol being continuous variables. Hence I will present 3 graphs, although information can be combined to make fewer graphs. I will use jitter plots, as it allows for slighly better visualization than the normal plot function.
# The first graph to be made is of BP vs healthy/sick
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = disease, y = restbp, color = sex)) +
geom_jitter() +
labs(title ="BP of healthy vs Sick Individuals",
x = "Disease (H or S)",
y = "BP",
caption = "A graphical representation of BP, based on the healthy vs sick state, with color indicating sex of patient"
)
# The second graph to be made is of age vs healthy/sick
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = disease, y = age, color = sex)) +
geom_jitter() +
labs(title ="Age of healthy vs Sick Individuals",
x = "Disease (H or S)",
y = "Age",
caption = "A graphical representation of the age distribution, based on the healthy vs sick state, with color indicating sex of patient"
)
# The third graph to be made is of cholesterol vs healthy/sick
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = disease, y = chol, color = sex)) +
geom_jitter() +
labs(title ="Cholesterol levels of healthy vs Sick Individuals",
x = "Disease (H or S)",
y = "Cholesterol Levels",
caption = "A graphical representation of the Cholesterol levels, based on the healthy vs sick state, with color indicating sex of patient"
)
BP Graph Observations/comments: The data seems to be quite spread out
when assessing the BP’s of sick and healthy individuals. Although it
would require more quantitative analysis, it also appears that within
sick individuals, males tend to have a lower BP than females.
Age Graph Observations/comments: Again, the data seems quite spread out, with the main observation being that within the sick individuals. It appears that younger women don’t get sick as often as younger males. Again requiring further analysis to make this conclusion, but it would appear that within the population of women, older women are more likely to be sick.
Cholesterol Graph Observations/comments: As with the other graphs - there are no standout observations/trends and the data seems to be quite spread out. Though, it would appear that within healthy and sick individuals, women tend to be the high, outlier values. However, this conclusion would also require further quantitative analysis/testing - with this simply being an observation from the graph.
# So the tests to pick between for this type of data is either a T-test (if the data is of normal distribution) or a Wilcox test (if the data is not of normal distribution). The first step is to run the normalcy test - which is done by creating a QQ plot:
# Testing normalcy of the Resting BP data
qqnorm(heart.dat$restbp)
qqline(heart.dat$restbp)
#Running a wilcox test for resting bp of sick vs healthy patients, since the data is not normally distributed (although not far from normal in the middle, the data starts to driverge from the normalcy line towrds the extremes of the x axis)
wilcox.test(formula = restbp ~ disease, data = heart.dat)
##
## Wilcoxon rank sum test with continuity correction
##
## data: restbp by disease
## W = 7632.5, p-value = 0.03154
## alternative hypothesis: true location shift is not equal to 0
Part of Answer to Q5:
Resting BP of Healthy vs Sick individuals: From the results, we can see that the data is not fitting to normal distributions at the ends of the axis, and because of this, a wilcox test was chosen instead of a T test. It is clear from the wilcox test result, that the P value is 0.031. Using the 0.05 threshold, we can see that the P value of 0.31 indicates a 3.1% chance that the observed differences in the BPs of healthy and sick individuals are due to chance - hence suggesting that there is indeed a statistically significant difference. Hence we would say that there is a statistically significant difference in the BP of healthy and sick individuals
#Again, testing for normalcy, but now of the cholesterol
qqnorm(heart.dat$chol)
qqline(heart.dat$chol)
# Running a wilcox test for cholesterol levels of sick vs healthy patients. While the QQ plot shows a pretty normal distribution (i.e. it's arguably close to being a normal distribution), since the wilcox test is more stringent, and since the normalcy is debatable, I decided to use that, since it can be used on both normal and non-normal distributions. Note that the limitation of assessing normalcy via QQ plots is seen here - i.e. QQ plots give a visual estimation of the data's normalcy, and not a quantitative answer. Hence, if you want to be highly accurate, another method would be employed to determine if it can be classified as normal
wilcox.test(formula = chol ~ disease, data = heart.dat)
##
## Wilcoxon rank sum test with continuity correction
##
## data: chol by disease
## W = 7300.5, p-value = 0.007701
## alternative hypothesis: true location shift is not equal to 0
Part of Answer to Q5:
Cholesterol levels of Healthy vs Sick individuals: Following the same mode of thinking as above - we can see that here, the P value is 0.0077, indicating a 0.7% chance that the observed cholesterol difference is due to chance. Again using the 0.05 (i.e. 5%) threshold, we can say that there is a statistically significant difference in the cholesterol levels of sick and healthy individuals.
# I will make a graph similar to that in Q4, except this time, I will have the color indicating healthy vs sick, while the X axis will separate the data points by sex. I decided to make both a jitter plot and a point . The jitter plot is more usefull overall, as it shows each of the points clearly - but the point plot allows for the points to overlap, which is useful in trying to find the difference between healthy and sick levels within a sex, as this shows the overlap, but it must be kept in mind that this was only done since we are not doing any quantitative calculations. If calculations would be done, the point plot would be usefull and subpar.
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = sex, y = chol, color = disease)) +
geom_jitter() +
labs(title ="Cholesterol levels of healthy vs Sick Individuals",
x = "Sex",
y = "Cholesterol Levels",
caption = "A graphical representation of the Cholesterol levels between healthy and sick individuals - with respect to the sex of each individual"
)
library(ggplot2)
library(corrplot)
ggplot(data = heart.dat, mapping = aes(x = sex, y = chol, color = disease)) +
geom_point() +
labs(title ="Cholesterol levels of healthy vs Sick Individuals",
x = "Sex",
y = "Cholesterol Levels",
caption = "A graphical representation of the Cholesterol levels between healthy and sick individuals - with respect to the sex of each individual"
)
Part of answer to Q6: In Q4, it’s much harder to try to find a trend for
any given sex - the data is more spread out. In this graph, since the
sexes are seperated, and the colors indicate healthy vs sick, it would
appear that males who are sick seem to have a lower average level of
cholesterol (while this is just what the graph seemes to show to me, in
order to actually say this, a mean would need to be calculated). It also
seems that the healthy and sick men have a much closer average
cholesterol level than the healthy vs sick females. Whereas, in Q4, the
only observation that can be made is that it seems that sick individuals
have a slighly higher average cholesterol level.
# Here, we need to run a wilcox test similar to that in Q5, but this time, we are not separating just by healthy/sick - we are considering only sick individuals, with respect to sex. The easiest way to do this, is to create a new data set that consists only of the sick individuals, and then run a test to differentiate between the cholesterol levels of males and females (since in this data set they are all sick, that aspect is already taken care of)
# I always like to start with a overview/summary to remind myself of what the data consists of:
summary(heart.dat[,1:14])
## age sex cp restbp chol fbs
## Min. :29.00 F: 87 Typ : 20 Min. : 94.0 Min. :126.0 T:230
## 1st Qu.:48.00 M:183 Atyp : 42 1st Qu.:120.0 1st Qu.:213.0 F: 40
## Median :55.00 Non-Ang: 79 Median :130.0 Median :245.0
## Mean :54.43 Asymp :129 Mean :131.3 Mean :249.7
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:280.0
## Max. :77.00 Max. :200.0 Max. :564.0
## restecg maxach exang oldpeak slope
## Normal :131 Min. : 71.0 N:181 Min. :0.00 Up :130
## Abnorm : 2 1st Qu.:133.0 Y: 89 1st Qu.:0.00 Flat:122
## Hypertrophy:137 Median :153.5 Median :0.80 Down: 18
## Mean :149.7 Mean :1.05
## 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :202.0 Max. :6.20
## num thal disease
## Min. :0.0000 Normal :152 H:150
## 1st Qu.:0.0000 Fixed : 14 S:120
## Median :0.0000 Reversible:104
## Mean :0.6704
## 3rd Qu.:1.0000
## Max. :3.0000
mySplit <- split(heart.dat, heart.dat$disease)
#Now I want to view the splits that I have created - I commented this out, since it's not longer necessary - but this serves as a checkpoint to make sure the split worked correctly
# mySplit
# mySplit$S
# Now that we know that mySplit$S is the data that we want, we use that to run the wilcox test. We know that the data only contains sick individuals, hence the wilcox test will determine statsitical significance of the differences of cholesterol levels of sick men and sick women.
wilcox.test(formula = chol ~ sex, data = mySplit$S)
##
## Wilcoxon rank sum test with continuity correction
##
## data: chol by sex
## W = 1423, p-value = 0.002925
## alternative hypothesis: true location shift is not equal to 0
Part of Answer to Q7: The first piece of information to start with - is that as was shown in Q5 (and hence not shown again, as it is already konwn) the cholesterol distribution was not normal. Hence, we cannot use a standard T test in this case. Given that the wilcox test is both more stringent and can fit non-normal data, that test is chosen and used. We can see from the wilcox test, that the resulting P value was 0.002 (i.e. 0.29%). Again using the 0.05 threshold, we can see that the wilcox test suggests that there is a statistically significant difference in the cholesterol levels of sick men and sick women. The split command allows us to seperate out the sick individuals from the master datasheet, which can then be used to analyze the cholesterol level (due to the split, cholesterol levels of only sick individuals) differences between the sexes. Given the P value, the test would indicate that a statistically significant difference is present between the cholesterol levels of sick men vs sick women.
Compared to question 5, we are now differentiating between the sexes, and not simply the cholesterol of sick vs healthy individuals. From question 5, we learned that the cholesterol difference between sick and healthy individuals is of statistically significant difference. Now, we see that the cholesterol levels of an individual is also impacted by sex. While both analyses tell us different things, we now know that both an individual’s sex as well as state of disease (healthy vs sick) contribute significantly to cholesterol levels.
#The first thing to do is to run the ANOVA, recognizing that we are assessing for maxach with respect to thal. We need to put this into a variable that I called ANOVA1 to allow for further use of the ANOVA results as well as being able to see it.
ANOVA1 <- aov(formula = maxach ~ thal, data = heart.dat)
summary(ANOVA1)
## Df Sum Sq Mean Sq F value Pr(>F)
## thal 2 10540 5270 10.52 4.02e-05 ***
## Residuals 267 133818 501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#I like putting a summary when executing functions like splitting or running a test like this, where I immediately inserted it into a variable, as this allows me to make sure that the code ran correctly before I use it in downstream code
TukeyHSD(ANOVA1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = maxach ~ thal, data = heart.dat)
##
## $thal
## diff lwr upr p adj
## Fixed-Normal -18.448308 -33.18515 -3.711468 0.0096456
## Reversible-Normal -11.385121 -18.09968 -4.670560 0.0002451
## Reversible-Fixed 7.063187 -7.95773 22.084104 0.5098109
#The code below creates a QQ plot that allows us to assess the normal distribution of the data
plot(ANOVA1, 2)
We see that thal had 3 possible values, and hence the 2 Df is the
correct value.
The Pr(>F) indicates the p-value for the F-test. Similarly, it indicates the probability that the observed values are due to chance. Since the Pr(>F) is so small of a value (4.02e-05), it suggests that the difference in Max HR achieved is thal dependant (i.e. the groups with different thal values have a statistically significant difference). However, this is not a conclusion that we can draw yet (we cannot say for sure, we can only speculate so far). We must conduct a Tukey Honest Significant Difference post-hoc test to actually determine this.
Once we have conducted the Tukey HSD test (a series of paired T tests). To interpret this, we must understand what is returned after the code is ran. On the right side, we are given P values. For the overall ANOVA to be significant - each of the P-values given by the post-hoc test must also be significant. However, this is not the case here (one value is 0.009, one is 0.0002, one is 0.509). Using the 0.05 threshold (i.e. 5% threshold, which is often the accepted threshold level for a P-value), we can see that 2 T-test results show significance, but the reversible-fixed thal T test was an anomaly and was not significant (i.e. the 0.509 P value does not rule out the fact that differences can be a result of random chance). Because of the magnitude of the values/outlier the original ANOVA appears to be significant although it is not.
Given this result for the ANOVA and the Tukey’s post-hoc test - we cannot say that the values are of statistically significant difference - and hence we cannot draw that conclusion.
In order to apply the ANOVA test, the data must be normally distributed. This is assessed by the QQ plot shown above. The data appears to have a fairly strong correlation to the dashed line - suggesting that the data is indeed normally distributed. However, the ANOVA’s post-hoc test was failed - showing that the results of the ANOVA was invalidated. Hence, no conclusions can be made from the ANOVA test. In a case like this (i.e. results from an ANOVA are not accepted/used), proving normalcy is unnecessary (since the data is not being used, regardless of what it says), showing normalcy of the data can be useful to justify the use of an ANOVA.