606 Homework Week 1

require(openintro)
data(smoking)

Each row represents a case (specifically a UK resident participating in the study)
There were 1,691 participants
Key: CN = Categorical / Non-ordinal CO = Categorical / Ordinal NC = Numerical / Continuous ND = Numerical / Discrete

Population of interest for the specific study was children, specifically those between the ages of 5-15. It is possible that the actual population of interest is broader but that this is being used as a limited study.
There is not enough information to say whether or not the results can be generalized to the population of children ages 5-15 because we don’t know what sampling techniques were used. If we assume that those techniques were adequate to produce a representative sample then it could be generalized. As it is an experiment rather than observation causal relationship could be found, assuming adequate data was collected to account for external factors.

We cannot conclude that smoking causes dementia later in life, but we can conclude that the study suggests that smoking contributes to the likelihood of dementia later in life. If it caused dementia then everyone in the study who smoked would have it.
There’s really no compelling information that should lead to that conclusion. If nothing else, this is an observational study so you shouldn’t assign causation to it. Further, the study talks about disruptive behavior OR bullying. There are many forms of disruptive behaviors that are not bullying. Finally, we don’t know from this blurb how the subjects of this study were grouped. Were bullies included among those with disruptive behaviors or were they considered independently? There is also an assumption that the parents are accurately reporting on their children’s sleep behaviors. Self-reporting isn’t a famously accurate method of data collection and in this case the parents could not only be reporting with a bias, but they could be unaware of a child’s actual sleeping behavior if, for example, the child wakes in the middle of the night but doesn’t disturb the parent or if the child just pretends to sleep.

What can be concluded from this study is that kids who don’t sleep well are likely to have behavioral issues.

This is an experiment.
The treatment group is the one asked to exercise, the control is the one asked not to exercise.
The study uses blocking because it not only draws its sample from strata, but it maintains those strata when assigning whether the participants will be in the treatment or control group.
No, there is no blinding.
Assuming proper sampling, including screening for physical health and current exercise routine, and monitoring to assure the groups performed correctly in terms of exercise the results of this study could be used to establish a short-term causal relationship between exercise and mental health and could be generalized to persons aged 18-55.
I would definitely have issues with the study. In addition to my previously stated reservations vis-a-vis physical health and ability to monitor, a two week window might result in short-term improvements in mental health but this could be a “pink cloud” effect where just the fact that someone is doing something perceived as good for themself results in improved mood and sense of self-worth.

scores = c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
df = data.frame(scores)
summary(df)

##      scores     
##  Min.   :57.00  
##  1st Qu.:72.75  
##  Median :78.50  
##  Mean   :77.70  
##  3rd Qu.:82.25  
##  Max.   :94.00

require(ggplot2)
scorebox <- ggplot(df, aes(x = "distribution", y = scores)) + 
  geom_boxplot() 
scorebox

Distribution should be right-skewed as a quarter of the homes have no upper limit and “a meaningful number” are over ten times the price of half the homes (50% under $450K). Because of the extreme distribution a median and IQR would better represent the data.
Distribution will still be right-skewed, but much less so as the upper limit has very few homes that are far above the $900K that 75% of the homes cost less than. Mean and median are probably much closer and the IQR is probably close to the range between -1 and 1 SD, but median and IQR are probably still the way to go given that there is still no upper limit and because the cases don’t follow a normal curve.
This is an odd question for a couple of reasons. First, it assumes that college students under 21 years old don’t drink, which is an assumption that makes the whole data set skewed (at best). If we include the minors in te data then it would be highly right-skewed but if we exclude them then it would be a symmetical, normal curve where mean and standard deviation would be most useful since the normal distribution calculations are based on these.
Distribution would be symetric as the high level executives are outliers. This would also lead to using the median to represent a typical observation but because it is a normal distribution standard deviation should be used.

It’s impossible to know as there is no indication as to how many people were in each group. If I assume that there was an equal number of participants in each group and that participants all had similar medical backgrounds I could say that the plot suggests survival was more common among the treatment group.
The box plots suggest that patients in the treatment group survived far longer than those in the control group.
The study was published in 1974. It’s highly unlikely that anyone in consideration for a new heart would still be alive today so 100% of both groups have likely died. As to the proportion who survived or died by the end of the study, there are no actual numbers available but the charts suggest that almost all of the patients died within the study period with the control group dying off very soon after the study began while about a quarter of the treatment group survived until nearly the end of the study.

The claim being tested is whether these experimental heart transplants increased lifespans.
We write alive on index cards representing patients who were alive at the end of the study, and dead on index cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size x participants (e.g. 25%) representing treatment, and another group of size n-x participants (e.g. 75%) representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at the mean. Lastly, we calculate the fraction of simulations where the simulated differences in proportions are higher than the level of statistical significance. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative.
Effectiveness of the transpant program follow a fairly normal distribution.