Stat 371 Homework #10

Submit your homework to Canvas by the due date and time. Email your instructor if you have extenuating circumstances and need to request an extension.
If an exercise asks you to use R, include a copy of the code and output. Please edit your code and output to be only the relevant portions.
If a problem does not specify how to compute the answer, you many use any appropriate method. I may ask you to use R or use manually calculations on your exams, so practice accordingly.
You must include an explanation and/or intermediate calculations for an exercise to be complete.
Be sure to submit the HWK 10 Auto grade Quiz which will give you ~20 of your 40 accuracy points.
50 points total: 40 points accuracy, and 10 points completion

Analysis of Variance

Exercise 1. A study was conducted to explore the effects of ethanol on sleep time. Fifteen rats were randomized to one of three treatments. Treatment 1 got only water (control). Treatment 2 got 1g of ethanol per kg of body weight, and treatment 3 got 2g/kg. The amount of REM sleep in a 24hr period was recorded, in minutes. Data are given below.

The researchers plan to perform a test to help decide between a model that says a mean amount of REM sleep for all three treatment is equal \(H_0: \mu_1=\mu_2=\mu_3\) and a model that allows for at least one group mean to be different \(H_A:\) at least one \(\mu_i\) is different.

Make a preliminary boxplot of the data that enables you to compare the centers and spreads of the three samples. Does this graph suggest the three samples come from populations with the same mean value (\(H_0\) is true)? What about the graph makes you say that? Also comment on the equal variance assumption for the three groups.

# Q1 data
trt1 <- c(63, 56, 69, 59, 67)
trt2 <- c(45, 60, 52, 56)
trt3 <- c(31, 40, 44, 33, 37, 28)

Q1vals <- c(trt1, trt2, trt3)
Q1trt <- c(rep("trt1", times = 5), rep("trt2", times = 4),
           rep("trt3", times = 6))


boxplot(trt1, trt2, trt3, main = "Ethanol on Sleep Time", xlab = "Treatment", ylab = "Minutes", names = c("trt1", "trt2", "trt3") )

Complete the following table of summary statistics that will be useful in an ANOVA analysis. Keep values to at least 3 decimal places.

	Treatment 1	Treatment 2	Treatment 3	Overall
mean	62.8	53.25	35.5	49.333
sd	5.404	6.397	5.958	13.452
n	5	4	6	15

sd(trt3)

## [1] 5.958188

treatment = c(63, 56, 69, 59, 67,
              45, 60, 52, 56, 
              31, 40, 44, 33, 37, 28)
mean(treatment)

## [1] 49.33333

pf(30.445, 2, 12, lower.tail = FALSE)

## [1] 1.991039e-05

treatment = c(63, 56, 69, 59, 67,
              45, 60, 52, 56, 
              31, 40, 44, 33, 37, 28)

time = c(rep("trt1", 5), rep("trt2", 4), rep("trt3",6))
time_mod = aov(treatment ~ time)

summary(time_mod)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## time         2   2116  1058.1   30.45 1.99e-05 ***
## Residuals   12    417    34.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Group means, SDs, and sample sizes
ybar1 <- mean(trt1); s1 <- sd(trt1); n1 <- length(trt1) 
ybar2 <- mean(trt2); s2 <- sd(trt2); n2 <- length(trt2) 
ybar3 <- mean(trt3); s3 <- sd(trt3); n3 <- length(trt3)

# Overall mean, SD, and sample size
ybar <- mean(Q1vals); s <- sd(Q1vals); N <- length(Q1vals)

# Answers for table:
ybar; s3; n2; N

## [1] 49.33333

## [1] 5.958188

## [1] 4

## [1] 15

Fill in the blanks for the ANOVA table below (by hand), then check your work with the aov function in R.

The aov function will round the SS terms to the nearest integer, but keep MS and F as decimals, so you may want to do the same.

Source	DF	SS	MS	F	p-value
Treatment	2	2116	1058.1	30.45	1.99e-05
Error	12	417	34.8
Total	14	2533

Use the plot() function on your ANOVA aov object to obtain a residual plot and qqnorm() plot of the residuals. Use the graphs and the summary values above to explain why the assumptions for an ANOVA analysis are well met for this data.

plot(time_mod)

Based on the ANOVA table, make a conclusion in the context of the problem. (Write out your hypotheses, identify your test statistic, and p value here.)

Use R to obtain the relevant multiplier and then create \(95\%\) CIs for all pairwise comparisons of means using the Tukey method. Do this by hand and show your work.

qtukey(0.95, nmeans = 3, df = 12)/sqrt(2)

## [1] 2.667864

Summarize your results regarding which groups are found significantly different by adding letter codes to the table below. What do you conclude?

Treatment	Sample Mean	Letter (according to Tukey’s)
1	62.8	A
2	53.25	A
3	35.5	B

Least Squares Linear Regression

Exercise 2. Suppose we are interested in exploring the relationship between city air particulate and rates of childhood asthma. We sample 15 cities for particulate (X) measured in parts-per-million (ppm) of large particulate matter and for the rate of childhood asthma (Y) measured in percents. The data is given in the summary table and R vectors below.

variable	size	mean	variance
X	15	11.42	13.05
Y	15	14.513	2.636

Plot the data as you see fit and summarize the pattern’s shape, direction, and strength in the context of the problem.

particulate <- c(11.6, 15.9, 15.7, 7.9, 6.3, 13.7, 13.1,
                 10.8, 6.0, 7.6, 14.8, 7.4, 16.2, 13.1, 11.2)
asthma <- c(14.5, 16.6, 16.5, 12.6, 12.0, 15.8, 15.1,
            14.2, 12.2, 13.1, 16.0, 12.9, 16.4, 15.4, 14.4)
            
plot(particulate, asthma, xlab = "particulates ppm", ylab = "asthma percentage", main = "City Particulates vs Rates of Chilhood Asthma")

cor(particulate, asthma)

## [1] 0.9931873

mean(particulate)

## [1] 11.42

sd(particulate)

## [1] 3.612518

mean(asthma)

## [1] 14.51333

sd(asthma)

## [1] 1.62343

asthma_model = lm(asthma ~ particulate)
summary(asthma_model)

## 
## Call:
## lm(formula = asthma ~ particulate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34226 -0.12842 -0.01514  0.12130  0.29164 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.41626    0.17344   54.29  < 2e-16 ***
## particulate  0.44633    0.01452   30.73  1.6e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1963 on 13 degrees of freedom
## Multiple R-squared:  0.9864, Adjusted R-squared:  0.9854 
## F-statistic: 944.4 on 1 and 13 DF,  p-value: 1.595e-13

plot(fitted(asthma_model), resid(asthma_model))

qqnorm(resid(asthma_model))

plot(particulate, asthma, abline(lm(asthma ~ particulate)), main = "City Particulates vs Rates of Chilhood Asthma")

data.frame(particulate, asthma, fitted(asthma_model), resid(asthma_model))

##    particulate asthma fitted.asthma_model. resid.asthma_model.
## 1         11.6   14.5             14.59367         -0.09367246
## 2         15.9   16.6             16.51288          0.08711504
## 3         15.7   16.5             16.42362          0.07638074
## 4          7.9   12.6             12.94226         -0.34225706
## 5          6.3   12.0             12.22813         -0.22813148
## 6         13.7   15.8             15.53096          0.26903771
## 7         13.1   15.1             15.26317         -0.16316519
## 8         10.8   14.2             14.23661         -0.03660967
## 9          6.0   12.2             12.09423          0.10576707
## 10         7.6   13.1             12.80836          0.29164149
## 11        14.8   16.0             16.02192         -0.02192362
## 12         7.4   12.9             12.71909          0.18090719
## 13        16.2   16.4             16.64678         -0.24678350
## 14        13.1   15.4             15.26317          0.13683481
## 15        11.2   14.4             14.41514         -0.01514107

Calculate the correlation coefficient (with R) and explain how the value corresponds to what you observed in the graph in part (a)

Build a linear regression model with least squares estimators for slope and y intercept for the data

First, build a regression model by hand using the correlation computed in (b) and summary statistics given above.

Check your computations using lm in R.

Interpret the estimated intercept and slope in the context of the question.

Construct a residual plot of fitted y values on the x axis and residuals on the y. Graphically assess whether the correct model and constant variance assumptions are reasonably met.

Identify which (particulate, asthma) data point results in the residual with the largest magnitude. Is that point above or below the fitted regression line? Show how the residual is calculated. (Make sure that you can also identify that point on the residual plot.)

Stat 371 Homework #10

Student’s Name Here

Analysis of Variance

Least Squares Linear Regression