Chapter 1 Homework

Graded: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70 (use the library(openintro); data(heartTr) to load the data)

1.8 (a) What does each row of the data matrix represent?

Each row represents a UK resident.

How many participants were included in the survey?

n=1691

Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

sex: categorigal - not ordinal age: numerical - discrete marital: categorical - not ordinal grossIncome: categorical - ordinal smoke: categorical - not ordinal amtWeekends: categorical - ordinal amtWeekdays: categorical - ordinal

1.10 (a) Identify the population of interest and the sample in this study.

The population of interest seems to be all people, regardless of age. The sample in this study is 160 children between the ages of 5 and 15.

Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.

The results of the study cannot be generalized to the population. A causal relationship cannot be established because there are too many dependent variables here. We cannot isolate an association between age and honesty because some children were explicitly told not to cheat while others were not, so their honesty depends on that as well. This is also not a blind experiment, so some bias comes into play.

1.28 (a) An article titled Risks: Smokers Found More Prone to Dementia states the following:61 “Researchers analyzed data from 23,123 health plan members who participated in a voluntary exam and health behavior survey from 1978 to 1985, when they were 50-60 years old. 23 years later, about 25% of the group had dementia, including 1,136 with Alzheimer’s disease and 416 with vascular dementia. After adjusting for other factors, the researchers concluded that pack-aday smokers were 37% more likely than nonsmokers to develop dementia, and the risks went up with increased smoking; 44% for one to two packs a day; and twice the risk for more than two packs.”

Based on this study, can we conclude that smoking causes dementia later in life? Explain your reasoning.

We cannot conclude that smoking causes dementia later in life as this group of voluntary participants could have been self-selecting and pre-disposed to developing dementia. Their pre-disposition could have been a reason why they started smoking to begin with. Also, this was an observational study, so we cannot assume causation.

Another article titled The School Bully Is Sleepy states the following:62 “The University of Michigan study, collected survey data from parents on each child’s sleep habits and asked both parents and teachers to assess behavioral concerns. About a third of the students studied were identified by parents or teachers as having problems with disruptive behavior or bullying. The researchers found that children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.”

A friend of yours who read the article says, “The study shows that sleep disorders lead to bullying in school children.” Is this statement justified? If not, how best can you describe the conclusion that can be drawn from this study?

This statement is not justified as there could be many other confounding variables between sleep habits and bullying. It could very well be flipped the other way around: these students have sleep disorders because they bully. We can only conclude what is said in the article, that “children who had behavioral issues and those who were identified as bullies were twice as likely to have shown symptoms of sleep disorders.”

1.36 (a) What type of study is this?

Experimental

What are the treatment and control groups in this study?

Treatment group: those who exercise Control group: those who do not exercise

Does this study make use of blocking? If so, what is the blocking variable?

This study blocks by age group.

Does this study make use of blinding?

There is no blinding as both the patient and the researcher know who is exercising and who is not. There could be some anonimity when the mental health exam is conducted.

Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.

The conclusions can be generalized to adults ages 18-55, though it would be important to know if the sample size is large enough to begin with. The results of the study cannot be used to establish a causal relationship as other things may have changed in the lives of patients during this trial period that could also affect their mental health.

Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

I would have reservations in terms of the size of the population and also their adherence to the treatment (excercising vs not exercising). It would be better to have some data on the movement of patients using an accelerometer. As it is, there is too much room for error/ambiguity in this proposal.

1.48 Create a box plot of the distribution of these scores. The five number summary provided below may be useful.

x <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(x)

1.50 Describe the distribution in the histograms below and match them to the box plots.

matches with (2) as it is normally distributed and about 50% of the data looks to be above 60 while 50% of the data appears to be below.
matches with (3). This is multimodal.
matches with (1). This histogram is left-skewed.

1.56 For each of the following, state whether you expect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.

Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses cost below $450,000, 75% of the houses cost below $1,000,000 and there are a meaningful number of houses that cost more than $6,000,000.

The median house cost is $450,000. This distribution is right skewed as the mean is greater than the median. The median and IQR should be used here for the typical observation and variability as this data is not normally distributed/symmetrical.

Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses cost below $600,000, 75% of the houses cost below $900,000 and very few houses that cost more than $1,200,000.

The median house cost is $600,000. This is close to normally distributed/symmetric with few outliers. The mean and standard deviation should be used here for the typical observation and variability as this data is normally distributed/symmetrical.

Number of alcoholic drinks consumed by college students in a given week. Assume that most of these students don’t drink since they are under 21 years old, and only a few drink excessively.

This is right skewed as a few students who drink excessively bring the mean higher than the median number of drinks consumed. The median and IQR should be used here for the typical observation and variability as this data is not normally distributed/symmetrical.

Annual salaries of the employees at a Fortune 500 company where only a few high level executives earn much higher salaries than the all other employees.

This is right skewed as a few employees who earn much higher salaries bring the mean higher than the median salary. The median and IQR should be used here for the typical observation and variability as this data is not normally distributed/symmetrical.

1.70 (a) Based on the mosaic plot, is survival independent of whether or not the patient got a transplant? Explain your reasoning.

Based on the mosaic plot, more of those who are alive were from the treatment group than from the control group. A larger portion of those in the control group died compared to those in the treatment group. However, we cannot determine dependence here as the treatment group appears to be twice the size of the control group based off of the mosaic. There could also have been other factors that played into their survival, the survival rate alone is not enough to claim dependence.

What do the box plots below suggest about the efficacy (effectiveness) of the heart transplant treatment.

The box plots show that those in the treatment group had a longer survival time than those in the control group.

What proportion of patients in the treatment group and what proportion of patients in the control group died?

45 out of 69 patients (88.2%) in the treatment group died, and 30 out of 34 patients (65.2%) in the control group died.

trans <- read.table(file="https://raw.githubusercontent.com/jbryer/DATA606Spring2019/master/data/os3_data/Ch%201%20Exercise%20Data/heart_transplant.csv", header=TRUE, sep="," )
table(trans$transplant, trans$survived)

##            
##             alive dead
##   control       4   30
##   treatment    24   45

30/34

## [1] 0.8823529

45/69

## [1] 0.6521739

What are the claims being tested?

A transplant increases the likelihood of survival for a patient.

The paragraph below describes the set up for such approach, if we were to do it without using statistical software. Fill in the blanks with a number or phrase, whichever is appropriate.

We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated diffrences in proportions are less than or equal to ratio of treatment died - ratio control died. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.

What do the simulation results shown below suggest about the effectiveness of the transplant program?

The results show that the program is effective.