Graded: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70 (use the library(openintro); data(heartTr) to load the data)
1.8 (a) What does each row of the data matrix represent?
Each row represents a UK resident.
n=1691
sex: categorigal - not ordinal age: numerical - discrete marital: categorical - not ordinal grossIncome: categorical - ordinal smoke: categorical - not ordinal amtWeekends: categorical - ordinal amtWeekdays: categorical - ordinal
1.10 (a) Identify the population of interest and the sample in this study.
The population of interest seems to be all people, regardless of age. The sample in this study is 160 children between the ages of 5 and 15.
The results of the study cannot be generalized to the population. A causal relationship cannot be established because there are too many dependent variables here. We cannot isolate an association between age and honesty because some children were explicitly told not to cheat while others were not, so their honesty depends on that as well. This is also not a blind experiment, so some bias comes into play.
1.28 (a) An article titled Risks: Smokers Found More Prone to Dementia states the following:61 “Researchers analyzed data from 23,123 health plan members who participated in a voluntary exam and health behavior survey from 1978 to 1985, when they were 50-60 years old. 23 years later, about 25% of the group had dementia, including 1,136 with Alzheimer’s disease and 416 with vascular dementia. After adjusting for other factors, the researchers concluded that pack-aday smokers were 37% more likely than nonsmokers to develop dementia, and the risks went up with increased smoking; 44% for one to two packs a day; and twice the risk for more than two packs.”
Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning.
We cannot conclude that smoking causes dementia later in life as this group of voluntary participants could have been self-selecting and pre-disposed to developing dementia. Their pre-disposition could have been a reason why they started smoking to begin with. Also, this was an observational study, so we cannot assume causation.
A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study?
This statement is not justified as there could be many other confounding variables between sleep habits and bullying. It could very well be flipped the other way around: these students have sleep disorders because they bully. We can only conclude what is said in the article, that “children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.”
1.36 (a) What type of study is this?
Experimental
Treatment group: those who exercise Control group: those who do not exercise
This study blocks by age group.
There is no blinding as both the patient and the researcher know who is exercising and who is not. There could be some anonimity when the mental health exam is conducted.
The conclusions can be generalized to adults ages 18-55, though it would be important to know if the sample size is large enough to begin with. The results of the study cannot be used to establish a causal relationship as other things may have changed in the lives of patients during this trial period that could also affect their mental health.
I would have reservations in terms of the size of the population and also their adherence to the treatment (excercising vs not exercising). It would be better to have some data on the movement of patients using an accelerometer. As it is, there is too much room for error/ambiguity in this proposal.
1.48 Create a box plot of the distribution of these scores. The five number summary provided below may be useful.
x <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(x)
1.50 Describe the distribution in the histograms below and match them to the box plots.
matches with (2) as it is normally distributed and about 50% of the data looks to be above 60 while 50% of the data appears to be below.
matches with (3). This is multimodal.
matches with (1). This histogram is left-skewed.
1.56 For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.
The median house cost is $450,000. This distribution is right skewed as the mean is greater than the median. The median and IQR should be used here for the typical observation and variability as this data is not normally distributed/symmetrical.
The median house cost is $600,000. This is close to normally distributed/symmetric with few outliers. The mean and standard deviation should be used here for the typical observation and variability as this data is normally distributed/symmetrical.
This is right skewed as a few students who drink excessively bring the mean higher than the median number of drinks consumed. The median and IQR should be used here for the typical observation and variability as this data is not normally distributed/symmetrical.
This is right skewed as a few employees who earn much higher salaries bring the mean higher than the median salary. The median and IQR should be used here for the typical observation and variability as this data is not normally distributed/symmetrical.
1.70 (a) Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.
Based on the mosaic plot, more of those who are alive were from the treatment group than from the control group. A larger portion of those in the control group died compared to those in the treatment group. However, we cannot determine dependence here as the treatment group appears to be twice the size of the control group based off of the mosaic. There could also have been other factors that played into their survival, the survival rate alone is not enough to claim dependence.
The box plots show that those in the treatment group had a longer survival time than those in the control group.
45 out of 69 patients (88.2%) in the treatment group died, and 30 out of 34 patients (65.2%) in the control group died.
trans <- read.table(file="https://raw.githubusercontent.com/jbryer/DATA606Spring2019/master/data/os3_data/Ch%201%20Exercise%20Data/heart_transplant.csv", header=TRUE, sep="," )
table(trans$transplant, trans$survived)
##
## alive dead
## control 4 30
## treatment 24 45
30/34
## [1] 0.8823529
45/69
## [1] 0.6521739
A transplant increases the likelihood of survival for a patient.
We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated diffrences in proportions are less than or equal to ratio of treatment died - ratio control died. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
The results show that the program is effective.