1.8 A. Each row of the data matrix represents an observation or case–one individual’s response data to this survey. B. Assuming the last data point in this table is the last data point in the dataset, 1,691 participants were included in this survey. C. Variables and their type: Sex – Categorical Age – Numerical, continuous Marital – Categorical grossIncome – Numerical, continuous smoke – Categorical amtWeekends – Numerical, discrete amtWeekdays – Numerical, discrete

1.10 A. The population of interest is children between the ages of 5 and 15 with no other specifications on diagnoses, gender, or other characteristics. The sample consisted of 160 children ages 5 to 15. B. In order to generalize the results of this study to the population or to establish a causal relationship, we would need to know if the sample was randomly selected. Furthermore, due to the differences in characteristics between teh two groups, it should not alone establish a causal relationship.

1.28 A. We cannot conclude that smoking causes dementia later in life based on the results of this study. For one, it is an observational study and experiments are needed to prove casaulity. Secondly, the study is not a random sample of the population. Instead, it represents a convenience sample of members from a specific health plan who may have certain characteristics. An example may be that the health plan is an expensive one, meaning the socioeconomic status of those enrolled in the plan is quite higher than the general population. B. This statement is not justified because there may be a third variable that correlates with both symtpoms of sleep disorders and bullying–a confounding variable. Furthermore, the link reported in the article doesn’t imply any particular direction between the two variables. It is unclear which variable may be the explanatory variable and which is the response, making it unable to say that one gives rise to the other. I would frame the results of the study as demonstrating that sleep disorder symptoms and bullying are co-occuring but that further research is needed to elucidate their effect on one another and other variables that could be involved.

1.36 A. This is a randomized experiment using stratified random sampling. B. The treatment group is comprised of all participants receiving the exercise intervention (instruction to exercise twice a week) and the control group is comprised of all participants instructed not to exercise. C. This study does make use of blocking and uses age as its variable to block the population. D. This description of the study doesn’t mention anything that makes it a blind study. To make it blind the participants wouldn’t be able to know what group they are in and the mental health practitioner assessing baseline and outcome mental health exam would not be able to know which group each participant is in. E. If the study were to be blinded in that the group not exercising were also receiving some kind of sham treatment and the assessors of the mental health examination were blinded to each participant’s group, then this study would be decently designed to generalize to the population of 18-55 year olds. F. The only reservations I would have are listed above regarding the blindedness of examiners and participants. Furthermore, I would want to make sure that there aren’t extraneous variables such as previous mental health diagnoses or treatment that would confound the results of this study. I am skeptical that two days a week of exercise would be enough to see an outcome and would suggest a higher exercise instruction. Finally, I would want some way of tracking if participants actually did the exercise.

1.48

FinalScores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
summary(FinalScores)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   72.75   78.50   77.70   82.25   94.00
boxplot(FinalScores)

1.50 Histogram A goes with Boxplot 2 Histogram B goes with Boxplot 3 Histogram C goes with Boxplot 1

1.56 A. I would expect these data to be right skewed due to the drastic value of a meaningful amount of outliers greater than $6,000,000. Due to the effects these outliers would have on the mean, I would look to median as a better measure of a typical observation. The same holds true of the IQR as extreme values have little effect on both median and IQR. B. I would expect these data to be normally distributed due to the even disbursement between the categories of 0 to $300,000; $300,000, to $600,000; $600,000 to $900,000; and $900,000 to $1,200,000. The lack of outliers beyond these monetary categories indicate that the mean and standard deviation are appropriate measures of a typical observation and the variability of observations, respectively. C. I would expect these data to be significantly right skewed based on the fact that about 75% of college students would be drinking 0 alcoholic drinks per week since they are under 21. Because of this skewness and the (few) outliers that drink excessively, median and IQR would be the best measures of the typical observation and the variability of observations, respectively. D. I would expect these data to be slighly right skewed due to the infrequent but large magnitude outliers of executive salaries. Due to these outliers, I would think of the median and IQR to be the best measures of the typical observation and the variability of the responses.

1.70 A. The mosaic plot does not make me think that the relationship between group and survival are independent. That is because even though the treatment group is about twice as large, it looks like about three times as many people survived in the treatment group compared to the control group. B. The box plot data suggests that the treatment group has extended survival compared to the control group. Nearly the entire Inner Quartile Range is below the beginning of the IQR for the treatment group. This means that 75% of the participants in the control group survived fewer days than 75% of the treatment group. C. The proportion that died in the control group is .882. The proportion that died in the treatment group is .652. See below R code chunk for work.

library(openintro); data(heartTr)
## Warning: package 'openintro' was built under R version 3.3.2
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
control <- subset(heartTr, transplant == 'control')
treatment <- subset(heartTr, transplant == 'treatment')

ProportionDeadControl <- dim(control[control$survived=='dead',])[1] / length(control$survived)
ProportionDeadTreatment <- dim(treatment[treatment$survived=='dead',])[1] / length(treatment$survived)

D. i. The claims being investigated are that the treatment group that was given an experimental procedure had an increased lifespand compared to the control group. ii. We write alive on 28 cards representing patients who were alive at the end of the study, and dead on _75_cards representing patients who were not. Then, we shuffle these cards and split them into two groups: one group of size 69 representing treatment, and another group of size 34 representing control. We calculate the difference between the proportion of dead cards in the treatment and control groups (treatment - control) and record this value. We repeat this 100 times to build a distribution centered at 0. Lastly, we calculate the fraction of simulations where the simulated di↵erences in proportions are lower. If this fraction is low, we conclude that it is unlikely to have observed such an outcome by chance and that the null hypothesis should be rejected in favor of the alternative. iii. The simulation results in the histogram below suggest that It would be extremely unlikely to see the results of the difference between these two proportions (.230) as due to chance.