1. (6.3) A student stated: “Adding predictor variables to a regression model can never reduce \(R^2\), so we should include all available predictor variables in the model.” Comment.
  1. (6.4) Why is it not meaningful to attach a sign to the coefficient of multiple correlation R, although we do so for the coefficient of simple correlation \(r_{12}\)?
  1. (6.27) In a small-scale regression study, the following data were obtained,
Y X1 X2
42.0 7.0 33.0
33.0 4.0 41.0
75.0 16.0 7.0
28.0 3.0 49.0
91.0 21.0 5.0
55.0 8.0 31.0

Assume the standard multiple regression model with independent normal error terms. Compute b, e, H, SSErr, \(R^2\), \(s^2_b\), \(\hat{Y}\) for \(X1\) = 10, \(X2\) = 30. You can do the computations using software or by hand, although it would be lengthy to do them by hand.

scalar <- c(1, 1, 1, 1, 1, 1)
Y <-  c(42, 33, 75, 28, 91, 55)
X1 <-  c(7, 4, 16, 3, 21, 8)
X2 <-  c(33, 41, 7, 49, 5, 31)

# namely X is the matrix
X <- matrix(c(scalar, X1, X2), 6, 3, byrow = FALSE)
X
##      [,1] [,2] [,3]
## [1,]    1    7   33
## [2,]    1    4   41
## [3,]    1   16    7
## [4,]    1    3   49
## [5,]    1   21    5
## [6,]    1    8   31

b: slope

\[ \boldsymbol{\hat{\beta}} = \boldsymbol{(X^TX)}^{-1}\boldsymbol{(X^TY)} \]

library(matlib)
b = inv(t(X) %*% X) %*% t(X) %*% Y
b
##            [,1]
## [1,] 33.9321020
## [2,]  2.7847707
## [3,] -0.2643979

H: Hat

H = X %*% inv(t(X) %*% X) %*% t(X)
H
##             [,1]        [,2]        [,3]       [,4]        [,5]       [,6]
## [1,]  0.23143639  0.25168006  0.21178834  0.1488734 -0.05475455 0.21099418
## [2,]  0.25168006  0.31240977  0.09437951  0.2662835 -0.14787196 0.22314063
## [3,]  0.21178834  0.09437951  0.70442097 -0.3191731  0.10446756 0.20412257
## [4,]  0.14887339  0.26628346 -0.31917314  0.6142637  0.14143589 0.14834214
## [5,] -0.05475455 -0.14787196  0.10446756  0.1414359  0.94040059 0.01632796
## [6,]  0.21099418  0.22314063  0.20412257  0.1483421  0.01632796 0.19708945

e: resuduals

\[ e = Y -\hat{Y} = Y - H * Y \]

e <- Y - H %*% Y
e
##             [,1]
## [1,] -2.70036663
## [2,] -1.23087135
## [3,] -1.63764825
## [4,] -1.33091751
## [5,] -0.09029763
## [6,]  6.98606687

SSErr

\[ SSE = e^T * e \]

SSE <- t(e) %*% e
SSE
##          [,1]
## [1,] 62.07354
# SSErr <- sum(e^2)
# SSErr

\(R^2\)

\[ R^2 = \frac{SSRegression}{SSTotal} = 1 - \frac{SSError}{SSTotal} \]

SST <- sum((Y - mean(Y))^2) # total sum of squares
SST
## [1] 3072
R2 <- 1 - SSE/SST # R^2
R2
##           [,1]
## [1,] 0.9797938

\(s^2_b\)

\[ MSE = \frac{SSE}{n-2} \]

MSE <- SSE/ (6-2)
MSE
##          [,1]
## [1,] 15.51839

\[ S^2\{b\} = MSE(X^T * X)^{-1} \]

s2b <- MSE[1,1] * inv(t(X) %*% X)
s2b
##           [,1]        [,2]        [,3]
## [1,] 536.60338 -25.6191888 -10.1962033
## [2,] -25.61919   1.2462499   0.4830506
## [3,] -10.19620   0.4830506   0.1968509

E{Y|X1 = 10, X2 = 30}

Xone <- c(1, 10, 30)
yhat <- t(Xone) %*% b
yhat
##          [,1]
## [1,] 53.84787
  1. (Computer project, #6.5—#6.8) Dataset “Brand preference” is available on our Blackboard, on http://statweb.lsu.edu/EXSTWeb/StatLab/DataSets/NKNWData/CH06PR05.txt, and here:
brand <- read.table("./data/CH06PR05.txt")
brand %>%
  rename(brand = "V1", moisture = "V2", sweetness = "V3") -> brand
brand
# Y, X1, X2

It was collected to study the relation between degree of brand liking (Y ) and moisture content (X1) and sweetness (X2) of the product. (a) Fit a regression model to these data and state the estimated regression function. Interpret b1. \[ \hat{Y} = 37.6500 + 4.4250X1 + 4.3750X2 \] - The slope \(b_1\) is 4.425, meaning that the mean response degree of brand liking is increase by 4.425 with 1 unit increase of moisture when sweetness is held constant.

reg <- lm(brand ~ moisture + sweetness, data = brand)
summary(reg)
## 
## Call:
## lm(formula = brand ~ moisture + sweetness, data = brand)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## moisture      4.4250     0.3011  14.695 1.78e-09 ***
## sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09
  1. Obtain residual plots. What information do they provide? Plot residuals against fitted values, against each predictor, and against the product of predictors.
par(mfrow = c(2, 2))
plot(reg)

  1. Test homoscedasticity.
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
ncvTest(reg)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.6261627, Df = 1, p = 0.42877
  1. Conduct a formal lack of fit test.
full <- lm(brand ~ as.factor(moisture) + as.factor(sweetness), data = brand)
anova(reg, full)
anova(full, reg)
  1. Test whether the proposed linear regression model is significant. What do the results of the ANOVA F-test imply about the slopes?
anova(reg)
summary(reg)
## 
## Call:
## lm(formula = brand ~ moisture + sweetness, data = brand)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.400 -1.762  0.025  1.587  4.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6500     2.9961  12.566 1.20e-08 ***
## moisture      4.4250     0.3011  14.695 1.78e-09 ***
## sweetness     4.3750     0.6733   6.498 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.693 on 13 degrees of freedom
## Multiple R-squared:  0.9521, Adjusted R-squared:  0.9447 
## F-statistic: 129.1 on 2 and 13 DF,  p-value: 2.658e-09
  1. Estimate both slopes simultaneously using the Bonferroni procedure with at least a 99 percent confidence level.
confint(reg, level = (1-0.005))
##                0.25 %   99.75 %
## (Intercept) 27.545738 47.754262
## moisture     3.409483  5.440517
## sweetness    2.104236  6.645764
  1. Report R2 and adjusted R2. How are they interpreted here?
summary(reg)$r.square
## [1] 0.952059
summary(reg)$adj.r.square
## [1] 0.9446834
  1. Calculate the squared correlation coefficient between \(Yi\) and \(\hat{Y}\). Compare with part (g).
  1. Obtain a 99% confidence interval for E(Y) when X1 = 5 and X2 = 4. Interpret it.
predict(reg, tibble(moisture = 5, sweetness = 4), interval = "confidence", level = 0.99)
##      fit      lwr      upr
## 1 77.275 73.88111 80.66889
  1. Obtain a 99% prediction interval for a new observation Y when X1 = 5 and X2 = 4. Interpret it.
predict(reg, tibble(moisture = 5, sweetness = 4), interval = "prediction", level = 0.99)
##      fit      lwr      upr
## 1 77.275 68.48077 86.06923
  1. (# 6.26, Stat-615 only) Show that the squared sample correlation coefficient between \(Y\) and \(\hat Y\) equals \(R^2\).

Remark. Now you can check if you did #3h correctly.
Hints. First, show that the sample averages of Yi and Ybi are the same. Then, write the sample correlation coefficient between \(Y\) and \(\hat Y\) as \[ r_{Y\widehat{Y}} = \frac{\sum (Y_i-\overline{Y})(\widehat{Y}_i-\overline{Y})}{\sqrt{\sum (Y_i-\overline{Y})^2\sum(\widehat{Y}_i-\overline{Y})^2}} = \frac{\sum (\widehat{Y}_i-\overline{Y}+Y_i-\widehat{Y}_i)(\widehat{Y}_i-\overline{Y})}{\sqrt{\sum (Y_i-\overline{Y})^2\sum(\widehat{Y}_i-\overline{Y})^2}} \] and use known properties of residuals \(\sum e_i=0\), \(\sum X_{ij} e_i=0\), \(\sum \widehat{Y}_i e_i=0\). }}