Data 606 HW 1

1.8)
(a) Each row represents a different UK resident that participated in the survey.

1691 participants
sex - categorical - 2 levels (male and female)

age - numerical - discrete

marital - categorical - 2 levels (single and married)

grossIncome - categorical - ordinal

smoke - categorical - 2 levels (yes and no)

amt/Weekends - numerical - discrete

amt/Weekdays - numerical - discrete

1.10) (a) Population of interest - 160 children between the ages of 5 and 15

The study may be able to be generalized to the population. It depends on how the children were chosen for the study and how the children chosen to be given instructions were chosen. If the children were chosen at random from all over the world and all children who were chosen, participated and the decision about which children were given instruction was randomized, then the results could be generalized to all people. However, if the only children from a certain location, or background, or only those whose parents volunteered them were a part of the study, then the results cannot be generalized to the population. If the children were chosen at random and could be generalized to the whole population, then the findings could be used to establish causal relationships since this was an experiment. In an experiment a causal connection between an explanatory variable and response variable can be determined.

1.28) (a) We cannot conlcude that smoking causes dementia later in life. This was an observational study so we cannot determine that there is a causal relationship.

The statement that sleep disorders leads to bullying is unjustified. This is an observational study so a causal relationship cannot be determined. I would conclude that there is an association between children having a sleep disorder and with behavioral issues and those identified as bullies.

1.36) (a) experiment using stratified random sampling

treatment group - those individuals that excercise twice a week control group - those individuals that are instructed not to excercise
Yes, blocking is being used. The blocking variable is the age. Individuals are broken up according to their age and then half in each group are put into a category.
This study does not make use of blinding. The patients know if they are excercising or not.
The results can be used to show a causal relationship between excercise and mental health because this is an experiment that used stratified random sampling. The conclusion can be generalized to the population at large because it used stratified random sampling.
The reservations that I have relate to other factors that might be signifcant that are not being taken into account in the study, such as gender, health background, socio-economic background, whether the participant has excercised in the past and the location where the participant lives. I also think it would be important to specify the type and rigor of the excercise being conducted so that there is uniformity in partipants’ experiences so a general conclusion can be drawn.

1.48)

scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
boxplot(scores)

1.50) (a) unimodal, symmetric, matches picture 2

multimodal, symmetric, matches picture 3
bimodal, right skewed, matches picture 1

1.56) (a) I would expect the data to be right skewed because of the number of houses that cost over $6,000,000. The median would be used to best represent a typical observation. The variability of observations would be best represented using the IQR because it is right skewed.

This is closer to be symmetric, but skews slightly to the right due to the most expensive houses and a cap on the left end due to houses costing some amount of money. Because of the right skew, I would use the IQR to represent the variability of observations. The median would be used to best represent a typical observation.
I would expect the data to skew right because the minimum value is zero and some people drink a lot. The median would be used to best represent a typical observation, and the variability of observations would best be represented using the IQR because it is right skewed.
I would expect the data to skew right due to the high salary employees and that no employee can make less than zero dollars. I would use the median to represent a typical observation and the IQR to best represent the variability.

1.70) (a) Survival appears to dependent on whether the patient got a transplant. In the mosaic plot, a greater percentage of people who received treatment lived as compared with the percentage of people who lived who did not receive treatment.

The box plots suggest that treatment has a large effect on life expectancy. Half of the people who underwent treatment lived more than about 220 days whereas a person who did not receive treatment who lived that long would be considered an outlier.
Proportion of people in the treatment group that died is about .67 Proportion of people in the control group that died is about .86
1. Ho - Heart transplants and life expectancy are independent Ha - Heart transplants and life expectancy are dependent
2. alive on 116 cards
  
  dead on 284 cards
  
  one group of size 300 representing treatment
  
  one group of size 100 representing control
  
  distribution centered at zero
  
  calculate the fraction of simulations where the simulated differences in proportions are equal to the actual data - percentage of those who underwent treatment who lived.
3. The simulation suggests that survival is dependent on receiving a transplant. About 33% of patients who received a transplant survived. In the simulation there were no cases in which the survival rate was that high.

Data 606 HW 1

Sarah Wigodsky

August 26, 2017