Submit your homework to Canvas by the due date and time. Email your instructor if you have extenuating circumstances and need to request an extension.
If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.
If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.
You must include an explanation and/or intermediate calculations for an exercise to be complete.
Be sure to submit the HWK 10 Auto grade Quiz which will give you ~20 of your 40 accuracy points.
50 points total: 40 points accuracy, and 10 points completion
Exercise 1. A study was conducted to explore the effects of ethanol on sleep time. Fifteen rats were randomized to one of three treatments. Treatment 1 got only water (control). Treatment 2 got 1g of ethanol per kg of body weight, and treatment 3 got 2g/kg. The amount of REM sleep in a 24hr period was recorded, in minutes. Data are given below.
The researchers plan to perform a test to help decide between a model that says a mean amount of REM sleep for all three treatment is equal \(H_0: \mu_1=\mu_2=\mu_3\) and a model that allows for at least one group mean to be different \(H_A:\) at least one \(\mu_i\) is different.
- Make a preliminary boxplot of the data that enables you to compare the centers and spreads of the three samples. Does this graph suggest the three samples come from populations with the same mean value (\(H_0\) is true)? What about the graph makes you say that? Also comment on the equal variance assumption for the three groups.
# Q1 data
trt1 <- c(63, 56, 69, 59, 67)
trt2 <- c(45, 60, 52, 56)
trt3 <- c(31, 40, 44, 33, 37, 28)
Q1vals <- c(trt1, trt2, trt3)
Q1trt <- c(rep("trt1", times = 5), rep("trt2", times = 4),
rep("trt3", times = 6))
boxplot(trt1, trt2, trt3, main = "Ethanol on Sleep Time", xlab = "Treatment", ylab = "Minutes", names = c("trt1", "trt2", "trt3") )
- Complete the following table of summary statistics that will be useful in an ANOVA analysis. Keep values to at least 3 decimal places.
| Treatment 1 | Treatment 2 | Treatment 3 | Overall | |
|---|---|---|---|---|
| mean | 62.8 | 53.25 | 35.5 | 49.333 |
| sd | 5.404 | 6.397 | 5.958 | 13.452 |
| n | 5 | 4 | 6 | 15 |
sd(trt3)
## [1] 5.958188
treatment = c(63, 56, 69, 59, 67,
45, 60, 52, 56,
31, 40, 44, 33, 37, 28)
mean(treatment)
## [1] 49.33333
pf(30.445, 2, 12, lower.tail = FALSE)
## [1] 1.991039e-05
treatment = c(63, 56, 69, 59, 67,
45, 60, 52, 56,
31, 40, 44, 33, 37, 28)
time = c(rep("trt1", 5), rep("trt2", 4), rep("trt3",6))
time_mod = aov(treatment ~ time)
summary(time_mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## time 2 2116 1058.1 30.45 1.99e-05 ***
## Residuals 12 417 34.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Group means, SDs, and sample sizes
ybar1 <- mean(trt1); s1 <- sd(trt1); n1 <- length(trt1)
ybar2 <- mean(trt2); s2 <- sd(trt2); n2 <- length(trt2)
ybar3 <- mean(trt3); s3 <- sd(trt3); n3 <- length(trt3)
# Overall mean, SD, and sample size
ybar <- mean(Q1vals); s <- sd(Q1vals); N <- length(Q1vals)
# Answers for table:
ybar; s3; n2; N
## [1] 49.33333
## [1] 5.958188
## [1] 4
## [1] 15
- Fill in the blanks for the ANOVA table below (by hand), then check your work with the
aovfunction in R.
The
aovfunction will round the SS terms to the nearest integer, but keep MS and F as decimals, so you may want to do the same.
| Source | DF | SS | MS | F | p-value |
|---|---|---|---|---|---|
| Treatment | 2 | 2116 | 1058.1 | 30.45 | 1.99e-05 |
| Error | 12 | 417 | 34.8 | ||
| Total | 14 | 2533 |
- Use the plot() function on your ANOVA aov object to obtain a residual plot and qqnorm() plot of the residuals. Use the graphs and the summary values above to explain why the assumptions for an ANOVA analysis are well met for this data.
plot(time_mod)
- Based on the ANOVA table, make a conclusion in the context of the problem. (Write out your hypotheses, identify your test statistic, and p value here.)
- Use R to obtain the relevant multiplier and then create \(95\%\) CIs for all pairwise comparisons of means using the Tukey method. Do this by hand and show your work.
qtukey(0.95, nmeans = 3, df = 12)/sqrt(2)
## [1] 2.667864
- Summarize your results regarding which groups are found significantly different by adding letter codes to the table below. What do you conclude?
| Treatment | Sample Mean | Letter (according to Tukey’s) |
|---|---|---|
| 1 | 62.8 | A |
| 2 | 53.25 | A |
| 3 | 35.5 | B |
Exercise 2. Suppose we are interested in exploring the relationship between city air particulate and rates of childhood asthma. We sample 15 cities for particulate (X) measured in parts-per-million (ppm) of large particulate matter and for the rate of childhood asthma (Y) measured in percents. The data is given in the summary table and R vectors below.
| variable | size | mean | variance |
|---|---|---|---|
| X | 15 | 11.42 | 13.05 |
| Y | 15 | 14.513 | 2.636 |
- Plot the data as you see fit and summarize the pattern’s shape, direction, and strength in the context of the problem.
particulate <- c(11.6, 15.9, 15.7, 7.9, 6.3, 13.7, 13.1,
10.8, 6.0, 7.6, 14.8, 7.4, 16.2, 13.1, 11.2)
asthma <- c(14.5, 16.6, 16.5, 12.6, 12.0, 15.8, 15.1,
14.2, 12.2, 13.1, 16.0, 12.9, 16.4, 15.4, 14.4)
plot(particulate, asthma, xlab = "particulates ppm", ylab = "asthma percentage", main = "City Particulates vs Rates of Chilhood Asthma")
cor(particulate, asthma)
## [1] 0.9931873
mean(particulate)
## [1] 11.42
sd(particulate)
## [1] 3.612518
mean(asthma)
## [1] 14.51333
sd(asthma)
## [1] 1.62343
asthma_model = lm(asthma ~ particulate)
summary(asthma_model)
##
## Call:
## lm(formula = asthma ~ particulate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34226 -0.12842 -0.01514 0.12130 0.29164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.41626 0.17344 54.29 < 2e-16 ***
## particulate 0.44633 0.01452 30.73 1.6e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1963 on 13 degrees of freedom
## Multiple R-squared: 0.9864, Adjusted R-squared: 0.9854
## F-statistic: 944.4 on 1 and 13 DF, p-value: 1.595e-13
plot(fitted(asthma_model), resid(asthma_model))
qqnorm(resid(asthma_model))
plot(particulate, asthma, abline(lm(asthma ~ particulate)), main = "City Particulates vs Rates of Chilhood Asthma")
data.frame(particulate, asthma, fitted(asthma_model), resid(asthma_model))
## particulate asthma fitted.asthma_model. resid.asthma_model.
## 1 11.6 14.5 14.59367 -0.09367246
## 2 15.9 16.6 16.51288 0.08711504
## 3 15.7 16.5 16.42362 0.07638074
## 4 7.9 12.6 12.94226 -0.34225706
## 5 6.3 12.0 12.22813 -0.22813148
## 6 13.7 15.8 15.53096 0.26903771
## 7 13.1 15.1 15.26317 -0.16316519
## 8 10.8 14.2 14.23661 -0.03660967
## 9 6.0 12.2 12.09423 0.10576707
## 10 7.6 13.1 12.80836 0.29164149
## 11 14.8 16.0 16.02192 -0.02192362
## 12 7.4 12.9 12.71909 0.18090719
## 13 16.2 16.4 16.64678 -0.24678350
## 14 13.1 15.4 15.26317 0.13683481
## 15 11.2 14.4 14.41514 -0.01514107
- Calculate the correlation coefficient (with R) and explain how the value corresponds to what you observed in the graph in part (a)
- Build a linear regression model with least squares estimators for slope and y intercept for the data
- First, build a regression model by hand using the correlation computed in (b) and summary statistics given above.
- Check your computations using
lmin R.
- Interpret the estimated intercept and slope in the context of the question.
- Construct a residual plot of fitted y values on the x axis and residuals on the y. Graphically assess whether the correct model and constant variance assumptions are reasonably met.
- Identify which (particulate, asthma) data point results in the residual with the largest magnitude. Is that point above or below the fitted regression line? Show how the residual is calculated. (Make sure that you can also identify that point on the residual plot.)