7

(a)

The p-value for the coefficient of the interaction term is 0.0120. It is Statistically significant at 0.05, but it is not significant at 0.01.

latour <- read.table("/Users/shijiabian/Dropbox/STA 721 Linear Model/Homework/HW2/Latour.txt", header = T)
fit <- lm(formula = Quality ~ EndofHarvest + Rain + Rain:EndofHarvest, data = latour)
summary(fit)

## 
## Call:
## lm(formula = Quality ~ EndofHarvest + Rain + Rain:EndofHarvest, 
##     data = latour)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.683 -0.570  0.127  0.439  1.635 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.1612     0.6892    7.49  3.9e-09 ***
## EndofHarvest       -0.0314     0.0176   -1.79    0.082 .  
## Rain                1.7867     1.3174    1.36    0.183    
## EndofHarvest:Rain  -0.0831     0.0316   -2.63    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.758 on 40 degrees of freedom
## Multiple R-squared:  0.685,  Adjusted R-squared:  0.661 
## F-statistic:   29 on 3 and 40 DF,  p-value: 4.02e-10

(b) Estimate the number of days of delay to the end of harvest it takes to decrease the quality rating by 1 point when there is:

i. No unwanted rain at harvest

If there is no unwanter rain happens, 31.85 days of delay to the end of the harvest can decrease the quality rating by one point.

ii. Some unwanted rain at harvest

If there is some unwanter rain happens, 8.73 days of delay to the end of the harvest can decrease the quality rating by one point.

\[ \begin{equation} \begin{split} \text{Quality} &= \hat{\beta_0} + \text{EndofHarvest} \times \hat{\beta_1} + \text{Rain} \times \hat{\beta_2} + \text{Rain} \times \text{EndofHarvest} \times \hat{\beta_3}\\ & = \hat{\beta_0} + \text{EndofHarvest} \times \hat{\beta_1} \quad\text{No unwanted rain at harvest, R = 0}\\ & \text{or} \\ & = (\hat{\beta_0} + \hat{\beta_0}) + \text{EndofHarvest} \times (\hat{\beta_1} + \hat{\beta_3}) \quad \text{Some unwanted rain at harvest, R = 1} \end{split} \end{equation} \]

Therefore, \[ \begin{equation} \begin{split} \text{Quality} &= \hat{\beta_0} + \text{EndofHarvest} \times \hat{\beta_1} + \text{Rain} \times \hat{\beta_2} + \text{Rain} \times \text{EndofHarvest} \times \hat{\beta_3}\\ & = 5.16122 - 0.03145 \times \hat{\beta_1} \quad\text{No unwanted rain at harvest, R = 0}\\ & \text{or} \\ & = (5.16122 + 1.78670) + \text{EndofHarvest} \times (-0.03145 - 0.08314) \quad \text{Some unwanted rain at harvest, R = 1} \end{split} \end{equation} \]

(c) Construct confidence and prediction intervals for the quality rating and overlay on the plot of Quality verus the End of harvest plot, making sure to label all axes and provide a legend as in Figure 5.8.

attach(latour)

### For no rain at harvest
# Build the data frame specific for no rain, and extract x and y axis
latourNoRaine <- latour$Rain==0
NoRain <- latour[latourNoRaine,]
yNoRaine <- NoRain$Quality
xNoRaine <- NoRain$EndofHarvest

# Build the data frame specific for rain
latourRain <-latour$Rain==1
Rain <- latour[latourRain,]
yRaine <- Rain$Quality
xRaine <- Rain$EndofHarvest

# Rebuild the model for rain and no rain specifically
fitnoRain <- lm (yNoRaine~xNoRaine)
fitRain <- lm (yRaine~xRaine)

y = latour$Rain
par(mfrow=c(1,1))
plot(EndofHarvest,Quality,pch=y+1,col=y+1,xlab="End of Harvest (in days since August 31)")
legend(20, 2.5,legend=c("No","Yes"),pch=1:2,col=1:2,title="Rain at Harvest?")
legend(45, 5,legend=c("NoRainCI","RainCI","NoRainPI","RainPI"),lty=c(2,2,4,4),col= c("blue","purple", 
                          "red", "orange"))
abline(fitnoRain,lty=1,col=1)
abline(fitRain,lty=2,col=2)

new <- data.frame(xNoRaine = seq(min(xNoRaine),max(xNoRaine), length.out=100))
pred.w.plim <- predict(lm (yNoRaine~xNoRaine), new, interval="prediction")
pred.w.clim <- predict(lm (yNoRaine~xNoRaine), new, interval="confidence")

# overlaying the C intervals 
lines(new$xNoRaine,pred.w.clim[,2],col="blue",lty=2)
lines(new$xNoRaine,pred.w.clim[,3],col="blue",lty=2)

# overlaying the P intervals
lines(new$xNoRaine,pred.w.plim[,2],col="red",lty=4)
lines(new$xNoRaine,pred.w.plim[,3],col="red",lty=4)

newRain <- data.frame(xRaine = seq(min(xRaine),max(xRaine), length.out=100))
pred.r.plim <- predict(lm(yRaine~xRaine), newRain, interval="prediction")
pred.r.clim <- predict(lm(yRaine~xRaine), newRain, interval="confidence")

# overlaying the C intervals 
lines(newRain$xRaine,pred.r.plim[,2],col="purple",lty=2)
lines(newRain$xRaine,pred.r.clim[,3],col="purple",lty=2)

# overlaying the P intervals
lines(newRain$xRaine,pred.r.plim[,2],col="orange",lty=4)
lines(newRain$xRaine,pred.r.clim[,3],col="orange",lty=4)

detach(latour)

### Citation 
## Textbook Website <http://www.stat.tamu.edu/~sheather/book/docs/rcode/Chapter5.R>
## Plot Prediction and Confidence Interval in R <http://www.stat.sc.edu/~wang345/Teaching/STA704Fall2013/Code/Ch1_Toluca.R>

(d) Provide an interpretation of the prediction and confidence intervals for Quality and expected quality ratings respectively (in this context), and explain when it is appropriate to use each.

For both the rain and no rain case, the linear model, confidence interval and prediciton interval are negatively related. The overall rating for the no rain is higher than that have rain, this also matches our intuition. Please ignore the graph showing the x-axis is smaller than 20, because this is out of the range of the data. I ploted this way because I want to show the graph more clearly. We want to use confidence interval to test sampling error in relation to the parameter of interest: sample size increases, sampling error decreases and the interval becomes narrower. Confidence intervals does not show the distribution of individual values, it shows the the interested parameter. Given the specified settings of the predictors in the model, we use a prediction interval to predict the range that is likely to contain the response value of a single new observation.