Data 606 Homework 1 - Introduction to data

Heather Geiger

February 11, 2018

  1. Q1.8
      1. Each row is an individual, also known as a case.
      1. 1,691 participants included in the survey.
      1. Indicate whether each variable in the study is numerical or categorical.
      • sex - categorical
      • age - discrete numeric
      • marital - categorical
      • grossIncome - categorical, ordinal
      • smoke - categorical
      • amtWeekends - Discrete numeric
      • amtWeekdays - Discrete numeric
  2. Q1.10
      1. Identify the population of interest and the sample in this study.
      • Population of interest - all children age 5-15
      • Sample - 160 children age 5-15
      1. Comment on whether or not the results of the study can be generalized to the population, and if the findings of the study can be used to establish causal relationships.
      • As this was an experiment and not an just an observational study, one should be able to establish causal relationships using this data. My only concern in generalizing this data is that one needs to be careful in interpretation given the “differences across children’s characteristics within each group” observed. If there are differences between the groups that involve an obvious confounding variable (like one group has a lot more younger or older children than the other), one should control for that during analysis. Also, one should only generalize to children similar to those studied. So if all the children studied were from a certain geographic area, socioeconomic class, etc., that limits the population to which you can generalize.
  3. Q1.28
      1. Based on this study, we cannot conclude that smoking causes dementia later in life. We can only conclude that they are correlated. There are many health and social factors associated with smoking that could be the actual causal link. The only way to tell for sure if there is a causal link would be to assign some people to smoke and see if they get dementia later in life, which is obviously impossible to do.
      1. Based on this study, one cannot conclude that sleep disorders lead to bullying in school children. We can only conclude that they are correlated. Similar to the first example, there may be many other factors associated with sleep disorders. For example, children whose parents don’t set a proper sleep routine and bedtime for them may be more likely to have sleep disorders, and this different parenting could be what is actually causing the bullying behavior.
  4. Q1.36
      1. This study is an experiment.
      1. The treatment group is the one instructed to exercise. The control group is the one instructed not to exercise.
      1. This study does make using of blocking based on age group.
      1. This study does not make use of blinding.
      1. One can establish a causal link based on the results of this study, as it is an experiment rather than just an observational study. One should be able to generalize to the population of adults age 18-55.
      1. The study design is mostly good, but I do have a few suggestions for them to improve their study design.
      • In its current form, the study is really measuring the variable of whether or not someone is instructed to exercise rather than the actual amount of exercise they get. Ideally one would measure compliance in some way (like having study participants wear a pedometer or send videos of themselves exercising).
      • If the sample is large enough for this to be practical, it might also be good to block based on initial mental health exam scores. One could imagine that people with depression might have more trouble sticking to an exercise regimen, for example. Blocking based on initial mental health would eliminate this possible confounding variable.
      • While the treatment makes it impossible to have a fully blinded study, ideally one would at least have the researchers administering the mental health exam be blinded if the mental health exam grading has any potential for administrator bias. You wouldn’t want the exercise group to get higher exam scores because the person giving the mental health exam subconsciously perceives people who exercise as more healthy.
  5. Q1.48
    • Based on the numbers provided, the median line in the boxplot will be at 78.5, while the box will extend from 72.5 to 82.5.
    • Then, the whiskers will extend to 1.5*IQR beyond the boundaries of the box on either side. On the bottom, this would be 72.5 - 15 = 57.5, while on the top this would be 82.5 + 15 = 97.5.
    • This would mean that 57 will be an outlier, while all other values fall within the whiskers.
    • We then adjust the whiskers to the min and max of all non-outlier values. This would mean that the bottom whisker extends to 66, while the top whisker extends to 94.
    • Boxplot and code will be shown in appendix.
  6. Q1.50
      1. Symmetric, plot 2
      1. Uniform, plot 3
      1. Right-skewed, plot 1
  7. Q1.56
      1. Right-skewed, summarize with median and IQR.
      1. Symmetric, summarize with mean and SD.
      1. Right-skewed, summarize with median and IQR.
      1. Right-skewed, sumarize with median and IQR.
  8. Q1.70
      1. Based on the mosaic plot, survival is NOT independent of whether or not the patient got a transplant.
      1. The boxplots suggest that the heart transplant treatment is effective, with those in treatment group having longer survival time.
      1. In the treatment group, 65% of patients died (45/69 * 100) compared to 88% in the control group (30/34 * 100).
    • d, i. The null hypothesis here is that the difference we see between groups is due to chance. The alternative hypothesis is that the difference we see is one that we would not very frequently expect to see just due to chance.
    • d, ii. We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are less than or equal to -0.23. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
    • d, iii. The simulation results suggest that the transplant program is effective, as a difference of less than or equal to -0.23 happens very rarely, suggesting that the result we saw is unlikely to have occured just due to chance.

Appendix - Boxplot and code

boxplot(c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94))
abline(h=66,lty=2)
abline(h=72.5,lty=2)
abline(h=78.5,lty=2)
abline(h=82.5,lty=2)
abline(h=94,lty=2)