Day 2 Practice Solution

## Warning: package 'ggplot2' was built under R version 3.5.3

Airfreight breakage. A substance used in biological and medical research is shipped by airfreight to users in cartons of 1,000 ampules. The data below, involving 10 shipments, were collected on the number of times the carton was transferred from one aircraft to another over the shipment route \((X)\) and the number of ampules found to be broken upon arrival \((Y)\). Assume the first-order regression model is appropriate.

a) Obtain the estimated regression function. Plot the estimated regression function and the data. Does a linear regression function appear to give a good fit here?

X <- myData$X
Y <- myData$Y
n <- nrow(myData)
b1 <- (sum(X*Y)-sum(X)*sum(Y)/n)/(sum(X^2)-sum(X)^2/n)
b0 <- mean(Y)-b1*mean(X)

# or with matrices
X <- cbind(1, X)
betahat <- solve(t(X)%*%X)%*%t(X)%*%Y

# or with lm
mod1 <- lm(Y ~ X, data = myData)
summary(mod1)

## 
## Call:
## lm(formula = Y ~ X, data = myData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.2   -1.2    0.3    0.8    1.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.2000     0.6633  15.377 3.18e-07 ***
## X             4.0000     0.4690   8.528 2.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.483 on 8 degrees of freedom
## Multiple R-squared:  0.9009, Adjusted R-squared:  0.8885 
## F-statistic: 72.73 on 1 and 8 DF,  p-value: 2.749e-05

ggplot(data = myData, aes(x = X, y = Y)) +
  geom_point() +
  geom_abline(aes(slope = b1, intercept = b0)) +
  labs(title = "Transfers vs Broken Ampules", x = "Number of Times Carton Transferred",
       y = "Number of Broken Ampules")

The estimated regression function is \(\hat{Y}=10.2+4.0X\). A linear regression function appears to give a good fit here.

b) Obtain a point estimate of the expected number of broken ampules when \(X=1\) transfer is made.

coef(mod1)[1]+coef(mod1)[2]*1

## (Intercept) 
##        14.2

predict(mod1, newdata = data.frame(X = 1))

##    1 
## 14.2

That is, we estimate an expected number of \(14.2\) broken ampules when \(X=1\) transfer is made.

c) Estimate the increase in the expected number of ampules broken when there are 2 transfers as compared to 1 transfer.

The increase in the expected number of ampules broken when there are 2 transfers as compared to 1 transfer is the slope, 4.0 broken ampules.

d) Verify that your fitted regression line goes through the point \((\bar{X},\bar{Y})\).

xmean <- mean(myData$X)
ymean <- mean(myData$Y)
predict(mod1, newdata = data.frame(X = xmean))

##    1 
## 14.2

ymean

## [1] 14.2

The fitted regression line goes through the point \((\bar{X},\bar{Y})\).

e) Obtain the residual for the first case. What is its relation to \(\epsilon _1\)?

#Residual=Observed-Expected
observed <- myData$Y[1]
expected <- predict(mod1, newdata = data.frame(X = myData$X[1]))
observed-expected

##   1 
## 1.8

resid(mod1)[1]

##   1 
## 1.8

mod1$residuals[1]

##   1 
## 1.8

The residual for the first case is 1.8, which is an estimate of \(\epsilon _1\).

f) Compute \(\sum e_i ^2\) and \(MSE\). What is estimated by \(MSE\)?

yhat <- mod1$fitted.values
SSE <- sum((yhat-myData$Y)^2)

SSE <- sum(resid(mod1)^2)
SSE

## [1] 17.6

n <- nrow(myData)
MSE <- SSE/(n-2)
MSE

## [1] 2.2

\(SSE = 17.6\) and \(MSE = 2.2\). \(MSE\) estimates \(\sigma ^2\).

g) Obtain the residuals \(e_i\). Compute \(\sum e_i\). What should this number be?

resid(mod1)

##    1    2    3    4    5    6    7    8    9   10 
##  1.8 -1.2 -1.2  1.8 -0.2 -1.2 -2.2  0.8  0.8  0.8

sum(resid(mod1))

## [1] -4.996004e-16

\(\sum e_i=-4.996 \times 10^{-16}\) which is close to zero (as it should be).

h) Evaluate the least squares criterion \(Q\) for the estimates (1) \(b_0 = 9,b_1=3;\) (2) \(b_0 = 11,b_1=5.\) Is the criterion \(Q\) larger for these estimates than for the least squares estimates? Note that \(Q=\sum (Y_i -\beta _0 - \beta_1 X_i)^2.\)

#(1)
b0 <- 9
b1 <- 3
Q <- sum((myData$Y-b0-b1*myData$X)^2)
Q

## [1] 76

#(2)
b0 <- 11
b1 <- 5
Q <- sum((myData$Y-b0-b1*myData$X)^2)
Q

## [1] 60

Note that both Q’s are greater than the SSE for the least squares estimates.

i) Because of changes in airline routes, shipments may have to be transferred more frequently than in the past. Estimate the mean breakage for the following numbers of transfers: \(X=2,4\). Use separate 99 percent confidence intervals. Interpret your results.

yhat=predict(mod1,newdata=data.frame(X=2))
tstar=qt((1+.99)/2,nrow(myData)-2)
s=sqrt(MSE*(1/n+(2-xmean)^2/sum((myData$X-xmean)^2)))
lower=yhat-tstar*s
upper=yhat+tstar*s
c(lower,upper)

##        1        1 
## 15.97429 20.42571

#or just have R do everything in 1 step
predict(mod1,newdata=data.frame(X=2),interval="confidence",level=.99)

##    fit      lwr      upr
## 1 18.2 15.97429 20.42571

yhat=predict(mod1,newdata=data.frame(X=4))
tstar=qt((1+.99)/2,nrow(myData)-2)
s=sqrt(MSE*(1/n+(4-xmean)^2/sum((myData$X-xmean)^2)))
lower=yhat-tstar*s
upper=yhat+tstar*s
c(lower,upper)

##        1        1 
## 21.22316 31.17684

#or just have R do everything in 1 step
predict(mod1,newdata=data.frame(X=4),interval="confidence",level=.99)

##    fit      lwr      upr
## 1 26.2 21.22316 31.17684

We are 99 percent confident that the true mean breakage for 2 transfers is between 15.97 and 20.43. Similarly, we are 99 percent confident that the true mean breakage for 4 transfers is between 21.22 and 31.18.

j) Set up an ANOVA table.

yhat <- predict(mod1)
ybar <- mean(myData$Y)
yi <- myData$Y
n <- nrow(myData)

SOV <- c("Regression","Error","Total")
SSR <- sum((yhat-ybar)^2)
SSE <- sum((yi-yhat)^2)
SSTO <- sum((yi-ybar)^2) # <- SSR+SSE
SS <- c(SSR,SSE,SSTO)
df <- c(1,n-2,n-1)
MS <- c(SSR/1,SSE/(n-2), "")
Fstar <- round(SSR/1/(SSE/(n-2)), 3)
f <- c(Fstar, "", "")
p <- c(round(pf(Fstar, 1, nrow(myData)-2, lower.tail = FALSE), 8), "", "")

AOVTab <- cbind(SOV, SS, df, MS, f, p)

library(knitr)

## Warning: package 'knitr' was built under R version 3.5.3

kable(AOVTab)

SOV	SS	df	MS	f	p
Regression	160	1	160	72.727	2.749e-05
Error	17.6	8	2.2
Total	177.6	9

# or just have r do almost everything
anova(mod1)

## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## X          1  160.0   160.0  72.727 2.749e-05 ***
## Residuals  8   17.6     2.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

k) Calculate and interpret a 95 percent confidence interval for \(\beta_0\).

mod1 <- lm(Y~X, data = myData)
b0 <- mod1$coefficients[1]
b1 <- mod1$coefficients[2]
n <- nrow(myData)
MSE <- sum(mod1$residuals^2)/(n-2)
s_beta0 <- sqrt(MSE*(1/n+mean(myData$X)^2/sum((myData$X-mean(myData$X))^2)))
lower <- b0-qt((1+.95)/2,nrow(myData)-2)*s_beta0
upper <- b0+qt((1+.95)/2,nrow(myData)-2)*s_beta0
c(lower,upper)

## (Intercept) (Intercept) 
##     8.67037    11.72963

# or just have r do everything
confint(mod1, level = 0.95)

##                2.5 %    97.5 %
## (Intercept) 8.670370 11.729630
## X           2.918388  5.081612

We are 95 percent confident that the true \(\beta_0\) is between \(8.6704\) and \(11.7296\).

l) Calculate and interpret a 95 percent confidence interval for \(\beta_1\).

s_beta1 <- sqrt(MSE/sum((myData$X-mean(myData$X))^2))
lower <- b1-qt((1+.95)/2,nrow(myData)-2)*s_beta1
upper <- b1+qt((1+.95)/2,nrow(myData)-2)*s_beta1
c(lower,upper)

##        X        X 
## 2.918388 5.081612

# or just have r do everything
confint(mod1, level = 0.95)

##                2.5 %    97.5 %
## (Intercept) 8.670370 11.729630
## X           2.918388  5.081612

We are 95 percent confident that the true \(\beta_1\) is between \(2.9184\) and \(5.0816\).

Copier maintenance. The Tri-City Office Equipment Corporation sells an imported copier on a franchise basis and performs preventive maintenance and repair service on this copier. The data below have been collected from 45 recent calls on users to perform routine preventive maintenance service; for each call, \(X\) is the number of copiers serviced and \(Y\) is the total number of minutes spent by the service person. Assume that first-order regression model is appropriate.

a) Obtain the estimated regression function.

mod1 <- lm(Y~X,data=myData)
summary(mod1)

## 
## Call:
## lm(formula = Y ~ X, data = myData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7723  -3.7371   0.3334   6.3334  15.4039 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.5802     2.8039  -0.207    0.837    
## X            15.0352     0.4831  31.123   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.914 on 43 degrees of freedom
## Multiple R-squared:  0.9575, Adjusted R-squared:  0.9565 
## F-statistic: 968.7 on 1 and 43 DF,  p-value: < 2.2e-16

The estimated regression function is \(\hat{Y}=-0.5802+15.0352X\).

b) Plot the estimated regression function and the data. How well does the estimated regression function fit the data?

ggplot(data = myData, aes(x = X, y = Y)) +
  geom_point() +
  geom_abline(aes(slope = coef(mod1)[2], intercept = coef(mod1)[1])) +
  labs(title = "Number of Copiers Serviced vs Number of Minutes Spent",
       x = "Number of Copiers Serviced",
       y = "Number of Minutes")

A linear regression function appears to give a good fit here.

c) Interpret \(b_0\) in your estimated regression function. Does \(b_0\) provide any relevant information here? Explain.

\(b_0 = -0.5802\), which tells us that for a service call of 0 copiers, it would take -0.5802 minutes to service. This does not tell us any relevant information.

d) Obtain a point estimate of the mean service time when \(X=5\) copiers are serviced.

predict(mod1, newdata = data.frame(X = 5))

##        1 
## 74.59608

That is, the expected mean service time when \(X=5\) copiers are serviced is 74.60 minutes.

e) Obtain the residuals \(e_i\) and the sum of the squared residuals \(\sum e_i^2\). What is the relation between the sum of the squared residuals here and the quantity \(Q\)?

resid(mod1)

##           1           2           3           4           5           6 
##  -9.4903394   0.4391645   1.4744125  11.5096606  -2.4550914 -12.7723238 
##           7           8           9          10          11          12 
##  -6.5960836  14.4039164 -10.4550914   2.5096606   9.2629243   6.2276762 
##          13          14          15          16          17          18 
##   3.3686684  -8.5255875  12.4391645 -19.7018277   0.3334204  11.2981723 
##          19          20          21          22          23          24 
## -22.7723238  -2.5608355  -8.5960836  -3.6665796   4.3334204  -0.5960836 
##          25          26          27          28          29          30 
##  -0.7370757   7.3334204 -11.4903394  -1.5960836   6.3334204   6.3686684 
##          31          32          33          34          35          36 
##   3.2981723  15.4039164  -9.4903394  -1.4903394 -11.4550914  -2.5608355 
##          37          38          39          40          41          42 
##  11.4039164  -2.7370757   7.3334204  12.5449086  -3.7370757   4.5096606 
##          43          44          45 
##  -2.4903394   1.4391645   2.4039164

SSE <- sum(resid(mod1)^2)
SSE

## [1] 3416.377

The SSE is the minimum value for Q.

f) Obtain a 90 percent confidence interval for the mean service time on calls in which six copiers are serviced. Interpret your confidence interval.

MSE <- SSE/(nrow(myData)-2)
yhat <- predict(mod1,newdata=data.frame(X=6))
tstar <- qt((1+.9)/2,nrow(myData)-2)
s <- sqrt(MSE*(1/nrow(myData)+(6-mean(myData$X))^2/sum((myData$X-mean(myData$X))^2)))
lower <- yhat-tstar*s
upper <- yhat+tstar*s
c(lower,upper)

##        1        1 
## 87.28387 91.97880

#or just have R do everything in 1 step
predict(mod1, newdata = data.frame(X = 6), interval = "confidence", level = 0.9)

##        fit      lwr     upr
## 1 89.63133 87.28387 91.9788

We are 90 percent confident that the true mean service time on calls in which six copiers are serviced is between 87.28 and 91.98 minutes.

g) Obtain a 90 percent prediction interval for the service time on the next call in which six copiers are serviced. Is your prediction interval wider than the corresponding confidence interval in part (f)? Should it be?

yhat <- predict(mod1,newdata = data.frame(X = 6))
tstar <- qt((1+.9)/2,nrow(myData)-2)
s <- sqrt(MSE*(1+1/nrow(myData)+(6-mean(myData$X))^2/sum((myData$X-mean(myData$X))^2)))
lower <- yhat-tstar*s
upper <- yhat+tstar*s
c(lower,upper)

##         1         1 
##  74.46433 104.79833

#or just have R do everything in 1 step
predict(mod1, newdata = data.frame(X = 6), interval = "prediction", level = 0.9)

##        fit      lwr      upr
## 1 89.63133 74.46433 104.7983

This prediction interval is wider than that of part (f). This is expected because we are trying to predict for a specific call, not the average of similar calls.

h) The company believes that service calls should take no more than 15 minutes per copier. Conduct a hypothesis test to determine if the company is not meeting this standard. Be sure to state the hypotheses, test statistic, p value, and conclusion.

\(H_0: \beta_1 = 15\) vs \(H_a:\beta_1 > 15\). The test statistic is \[t = \frac{b1-15}{s(b_1)}=0.07296,\] and \(p=0.471\).

mod1 <- lm(Y~X, data = myData)
b0 <- mod1$coefficients[1]
b1 <- mod1$coefficients[2]
n <- nrow(myData)
MSE <- sum(mod1$residuals^2)/(n-2)
s_beta1 <- sqrt(MSE/sum((myData$X-mean(myData$X))^2))

teststat <- (b1-15)/s_beta1
p <- pt(teststat, n-2, lower.tail = FALSE)

Since the p value is much larger than 0.05, we fail to reject \(H_0\). Therefore, we believe the company is meeting their standard.

Day 2 Practice Solution

Scott Manski

May 8, 2019