data(iris)

Chapter 1

1.8 Smoking habits of UK residents.

  1. What does each row of the data matrix represent?

Each row of data represents an observation or participant.

  1. How many participants were included in the survey?

1,691 participants

  1. Indicate whether each variable in the sutdy is numerical or categorical. If numerical, identify as continous or discrete. If categorical, indicate if the variable is ordinal.

1.10 Cheaters, scope of interence.

  1. Identify the population of interest and the sample in this study.

Population of interest: Children between the ages of 5 and 15. Sample: 160 children between teh ages of 5 and 15.

  1. Comment on whether or not the resutls of the study can be generalized to the population, and if the findings of the study can be u sed to established causual relationships.

For this study to be generalizable to the population, this study needs to be replicated by collecting a sufficiently large sample.

This is an experimental study to test if giving specific instruction not to cheat would cause cheating or not. I think that the researchers would need to consider many other factors such as upbringing and values of the children and would need to do the appropriate blocking/ randomization to try and control for the effects of other possible factors that may influcence the outcome of the experiment.

1.28 Reading the paper.

  1. Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning.

No, this is an observational study. This type of study cannot prove any causality. The study may demonstration association between smoking and dementia but not causality.

1.36 Exercise and mental health

  1. What type of study is this?

Experimental

  1. What are the treatment and control gorups in this study?

The treatment group is instructed to exercise twice a week. The control group is instructed NOT to exercise at all.

  1. Does this study make use of blocking? If so, what is the blocking variable?

Stratified randomiation is used to ensure that age group 18 to 55 are appropriately represented in the study. I don’t think the study is blocking for any variable such as the initial mental health of the subject before the start of the experimental study.

  1. Does this study make use of blinding? No.

  2. Comment on whether or not the resutls of the study can be used to established a causal relationship between exercise and mental health, and indicate wehther or not the conclusions can be generalized to the population at large.

For this study to be generalized to the population (18 to 55 of age), this study needs to be replicated to a sufficiently large sample. I’m not sure if this study can be used to establish causality. One obvious factor is the subject’s initial mental health. I do not think that this is being controlled in this study.

  1. Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

Yes. Initial mental health of the participants may have an effect on the outcome. If the treatment group has more subjects with good mental health and control group just happens to have more subjects with poorer mental health, this would have an effect on the outcome.

1.48 Stats scores.

data <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00
boxplot(data)

1.50 Mix-and-match.

1.56 Distribution and appropriate statistics, Part II.

  1. Housing prices in a country where 25% of the hourses cost below 350,000, 50% of the houses cost below 450,000, 75% of the houses cost below 1,000,000, and there are a meaningful number of houses that cost more than 6,000,000.

Housing prices look like it is skewed to the right. Since there are outliers, median and IQR would be more robust in terms of measuring the center and spread of data.

IQR <- 1000000 - 350000 
IQR
## [1] 650000
top_whisker <- 1000000 + 1.5*IQR
top_whisker
## [1] 1975000
  1. Housing prices in a country where 25% of the houses cost below 300,000, 50% of the houses cost below 600,000, 75% of the houses cost below 900,000 and very few houses that cost more than 1,200,000.

Data is more symmetric. I think it would be more robust to use median and IQR.

IQR <- 900000 - 300000
top_whisker <- 900000 + 1.5*IQR
top_whisker
## [1] 1800000
  1. Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively.

Data is concentrated towards zero number of drinks with some outliers that drink excessively. Since there is no such thing as negative number of drinks, all the data are going to start from zero and onwards. I would say the data would be slighty skewed to the right.

Since most of the data would be centered towards zero with a few outliers, mean and IQR would be appropriate to use.

  1. Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than all the other employees.

I think this data would be skewed to the right because of the executive salaries. I think it woudl be better to use median and IQR.

1.70 Heart transplants

  1. Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.

If survival is independent of whether or not the patient got a transplat, the the outcome between the control group and treatment group should be close to each other. Based on the data, control group has survival rate of only about 12% while treatment group has a survival rate of about 35%. This is a difference of 23%. How likely is it that this difference is not due to chance if survival is independent of whether or not the patient got a transplant. A null hypothesis testing should be done.

H0: Survival is independent whether or not the patient received a heart transplant. This means that difference in survival rate should be 0.

HA: Survival is not independent of whether or not the patient received a heart transplant. This means that there is going to be a significant difference in survival rates.

Control group survival: 0.1176471 Treatmetn group survival: 0.3478261

control_survival <- (34-30)/34
treatment_survival <- (69-45)/69

control_survival
## [1] 0.1176471
treatment_survival
## [1] 0.3478261
treatment_survival - control_survival
## [1] 0.230179
  1. What do the box plots below suggest about the efficacy (e???ectiveness) of the heart transplant treatment.

Looking at the boxplot for the treatment group, it looks like 75% of the patients only survived about 525 days after the transplant (about 1 year and 160 days). 75% of 69 patients (rounded to next highest integer) is 52 patients. So about 52 of the patients did not survive past 525 days. Now I am wondering how the researchers defined “survival” when they reported 45 died. They must’ve defined survival as surviving number of days that is less than 525 days. Only very few survive beyond the top whiskher point (close to 1500 days). So basically, it looks like most patients did not really survive long term (past 1500 days or about 4 years).

  1. What proportion of patients in the treatment group and what proportion of patients in the control group died?
control_died = 30/34
treatment_died = 45/69

control_died
## [1] 0.8823529
treatment_died
## [1] 0.6521739
  1. One approach for investigating whether or not the treatment is effective is to use a randomization technique.
  1. What are the claims being tested?

Experimental heart transplant increases lifespan.

  1. Answers to fill in the blanks are below.
  1. What do the simulation results shown below suggest about the e???ectiveness of the transplant program?

About 88% in the control group died while about 65% in the treatment group died. This is a difference of 23%. The distribution of simulated differences suggest that a 23% difference is highly unlikely to have occured by chance. The researches might make the decision to reject the null hypothesis that there is no difference between the control and treatment group.