Each row of the data matrix represents one person’s answers to the survey.
Since the index number of the final row is 1691, we see that the survey included 1691 participants.
Variables:
sex - categorical (not ordinal)
age - numerical (discrete)
marital - categorical (not ordinal)
grossIncome - categorical (ordinal)
smoke - categorical (not ordinal)
amtWeekends - numerical (discrete)
amtWeekdays - numerical (discrete)
1.10 Cheaters, scope of inference
The population of interest is children, and the sample is the 160 children aged 5 to 15 who participated in the study.
Without information on how the subjects were selected, we cannot determine whether the results can be generalized. If the subjects were sampled randomly, the results can be generalized. Since this is an experiment, we can use the results to establish causal relationships.
1.28 Reading the paper
Though the articles states that the researchers adjusted for outside factors, it’s difficult to establish causal relationships from this study without further detail. First, this is an observational study and not an experiment. Furthermore, the subjects participated in a voluntary exam and survey and were not randomly sampled, meaning that a selection bias may be present among the participants.
For this bullying study, I wonder about the bias within the sample. Parents and teachers may not have the best objective view of the child when it comes to identifying which are bullies. Furthermore, there are certainly other factors other than sleep disorders that could be confounding variables when it comes to the causes of bullying. This study is not providing a full enough picture.
1.36 Exercise and mental health
This is a designed experiment.
The treatment group exercises twice weekly, and the control group does not exercise.
The study is blocked on three age groups: 18-30, 31-40, and 41-55.
This study is not blind - both the subjects and the researchers will know who is exercising.
The study can be used to establish a causal relationship because it is an experiment. Because the participants were randomly sampled, the conclusions can be generalized to the population at large.
While the exercise variable is specifically controlled in this experiment, I would be concerned as a funder about other confouding variables in the participants’ lives that may influence the results of their mental health exams.
1.48 Stats scores
statsScores <-c(57,66,69,71,72,73,74,77,78,78,79,79,81,81,82,83,83,88,89,94)
library(ggplot2)
ggplot() +aes(x ='', y = statsScores) +geom_boxplot() +labs(title ="Boxplot of Stats Scores")
1.50 Mix-and-match
This distribution is unimodal and fairly symmetric. Its boxplot is (2).
This distribution is nearly uniform. Its boxplot is (3).
This distribution is skewed to the right with a long tail. Its boxplot is (1)
1.56 Distributions and appropriate statistics
Since there are a meaningful number of houses valued over $6,000,000, the distribution would be right skewed. In this case the median and IQR would be the best measurements, since they are less subject to the influence of the outliers at the top of the price range.
The distribution of housing prices in this example is symmetric around the $600,000 mark. From the price lists given, the mean and the standard deviation would be good measurements for the center and spread.
The number of drinks for the students cannot be less than zero, and while many of the students will have only a few drinks, there are still some that will drink excessively. Therefore, this distribution is right skewed and best described by the median and IQR.
Because the few high salaries of the top executives are much higher than the majority of the workers, what would normally be a symmetric distribution is thrown off and becomes right skewed when they are included. Therefore, the median salary and IQR are the best measurements.
1.70 Heart transplants
The mosiac plot suggests that survival is not independent of receiving the treatment. The proportion of survivors in the treatment group appears much larger than in the control group.
The box plots shows that the median survival time is higher for the treatment group. Furthermore, the IQR of the boxplots encompasses a much greater timeframe than the control group. The distribution of survival time is very small for the control group, but it is much larger for the treatment group (and mostly longer than for the control group). These boxplots suggest that the treatment is effective for extending survival time.
30 out of 34 members of the control group died, or a proportion of 88%. 45 out of 69 members of the treatment group died, or a proportion of 65%.
The claims being tested are:
H0: The experimental heart transplant and survival time of the patient are independent and do not have a relationship between them. Any difference between treatment and control groups in survival rate and survival time is due to chance.
HA: The experimental heart transplant program and survival time of the patient are not independent. The difference between survival times and death rates between the treatment and control groups is not due to chance.
To simulate: We write alive on 28 cards representing patients who were alive at the end of the study, and dead on 75 cards representing patients who were not. Then we shufffle these cards and split them into two groups:one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution center at 0. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are greater than or equal to the observed difference. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
The simulation results suggest that the difference in survival rates is not due to chance and that we should reject the null hypothesis.