A note about the notation. SSReg(A | B) is the extra sum of squares that appeared as aresult of including variables A into the regression model that already had variables B in it. Thus, it is used to compare the full model with both A and B in it against the reduced model with only B.
Ans: We can calculate degrees of freedom by counting the number of variables to the left of the “|”. - SSReg(X1 | X2) = 1 - SSReg(X2 | X1, X3) = 1 - SSReg(X1, X2 | X3, X4) = 2 - SSReg(X1, X2, X3 | X4, X5) = 3
Reference:
The equation might be: \[
Y = \beta0 + \beta1X1 + \beta2X2 + \beta3X3 + \beta4X4 + \beta5X5 + ei
\] (a) whether or not \(\beta5\) = 0?
- SSR(X5 | X2, X3, X4, X5)
(b) whether or not \(\beta2\) = \(\beta4\) = 0?
- SSR(X2, X4 | X1, X3, X5)
Reference:
Recall the variables: It was collected to study the relation between degree of brand liking (Y) and moisture content (X1) and sweetness (X2) of the product.
brand <- read.table("./data/CH06PR05.txt")
brand %>%
rename(Y = V1, X1 = V2, X2 = V3) -> brand
# SSR(X1)
X1 <- lm(Y ~ X1, data = brand)
# SSR(X2|X1)
X2givenX1 <- lm(Y ~ X1 + X2, data = brand)
anova(X1)
anova(X2givenX1)
Consider dropping X2, the hypothesis is H0: \(\beta2\) = 0 vs \(\beta2 \neq 0\). According to the analysis of variance table above, the p-value of X2 is 2.011e-05, indicating that there is evidence that \(\beta2 \neq 0\), so X2 cannot be removed from the model.
summary(X1)$coefficients[, 1]
## (Intercept) X1
## 50.775 4.425
\[ \hat{Y} = 50.775 + 4.425X_1 \]
X2givenX1 model, the estimated regression coefficient for X1 is 4.425.X1 model, the estimated regression coefficient for X1 is 4.425, too.summary(X2givenX1)$coefficients[2,1]
## [1] 4.425
summary(X1)$coefficients[2,1]
## [1] 4.425
# SSReg(X1)
anova(X1)
# SSReg(X1|X2)
X1givenX2 <- lm(Y ~ X2 + X1, data = brand)
anova(X1givenX2)
lm.fit.5fa <- lm(Y ~ X2 , data = brand)
lm.fit.5fa$residuals -> a
a
## 1 2 3 4 5 6 7 8 9 10
## -13.375 -13.125 -16.375 -10.125 -5.375 -6.125 -6.375 -3.125 5.625 2.875
## 11 12 13 14 15 16
## 8.625 6.875 10.625 8.875 16.625 13.875
lm.fit.5fb <- (lm(X1 ~ X2, data = brand))
lm.fit.5fb$residuals -> b
b
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## -3 -3 -3 -3 -1 -1 -1 -1 1 1 1 1 3 3 3 3
Regress residuals from the model “Y on X2” on residuals from the model “X1 on X2”; compare the estimated slope, error sum of squares with #1. What about \(R^2\)?
The regression of Y on X2: estimated slope is 4.375, SSE is 1660.75, \(R^2\) is 0.1557.
The regression of Y and X1 on X2: estimated slope is 4.425, SSE is 94.3, \(R^2\) is 0.9432.
Because these two model are not regressing the same size so these two R-squares are completely different.
lm.fit.5fc <- lm(a ~ b)
summary(lm.fit.5fa) # lm(Y ~ X2)
##
## Call:
## lm(formula = Y ~ X2, data = brand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.375 -7.312 -0.125 8.688 16.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.625 8.610 7.970 1.43e-06 ***
## X2 4.375 2.723 1.607 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.89 on 14 degrees of freedom
## Multiple R-squared: 0.1557, Adjusted R-squared: 0.09539
## F-statistic: 2.582 on 1 and 14 DF, p-value: 0.1304
anova(lm.fit.5fa)
summary(lm.fit.5fc)
##
## Call:
## lm(formula = a ~ b)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.400 -1.762 0.025 1.587 4.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.718e-17 6.488e-01 0.00 1
## b 4.425e+00 2.902e-01 15.25 4.09e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 14 degrees of freedom
## Multiple R-squared: 0.9432, Adjusted R-squared: 0.9392
## F-statistic: 232.6 on 1 and 14 DF, p-value: 4.089e-10
anova(lm.fit.5fc)
X2 and X3 are dummy variables. Also, Y represents profit or loss, X1 represents the size of bank.\(\beta_2\): The difference between the commercial bank’s and the savings and loan bank’s expected profit or loss.
\(\beta_3\): The difference between the mutual saving bank’s and the savings and loan bank’s expected profit or loss.
\(\beta_2-\beta_3\): The difference between the mutual saving bank’s and the commercial bank’s expected profit or loss.
(8.16, 8.20) Refer to our old GPA data
An assistant to the director of admissions conjectured that the predictive power of the model could be improved by adding information on whether the student had chosen a major field of concentration at the time the application was submitted. Suppose that the first 10 students chose their major when they applied.
GPA <- read.table("./data/CH01PR19.txt")
GPA %>%
rename(Y = "V1", X1 = "V2") -> GPA
# Suppose that the first 10 students chose their major when they applied.
GPA %>%
mutate(X2 = 0) -> GPA
GPA$X2[1:10] = 1
head(GPA)
lm.fit <- lm(Y ~ X1 + X2, data = GPA)
summary(lm.fit)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = GPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81035 -0.33271 0.02987 0.44702 1.15523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.11062 0.32220 6.551 1.6e-09 ***
## X1 0.03871 0.01282 3.018 0.00312 **
## X2 0.07728 0.20663 0.374 0.70910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6254 on 117 degrees of freedom
## Multiple R-squared: 0.07373, Adjusted R-squared: 0.05789
## F-statistic: 4.656 on 2 and 117 DF, p-value: 0.01133
\[ \hat{Y} = 2.11062 + 2.11062X_1 + 0.07728X_2 \]
Significance of the whole model is tested by H0 : \(\beta_2 = 0\) vs H1 : \(\beta_2 \neq 0\). With a large p-value 0.7091 and a small test statistic F = 0.1399, we fail to reject the null hypothesis, meaning that we have no evidence to conclude that X2 is significant so X2 may be remove from the model.
lm.fit.dropedX2 <- lm(Y ~ X1, data = GPA)
anova(lm.fit, lm.fit.dropedX2)
# interaction term
lm.fit.interaction <- lm(Y ~ X1 * X2, data = GPA)
summary(lm.fit.interaction)
##
## Call:
## lm(formula = Y ~ X1 * X2, data = GPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47832 -0.31337 0.04355 0.45001 1.07374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.83364 0.33492 5.475 2.57e-07 ***
## X1 0.04992 0.01336 3.738 0.00029 ***
## X2 2.49114 1.00135 2.488 0.01428 *
## X1:X2 -0.09635 0.03915 -2.461 0.01531 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6123 on 116 degrees of freedom
## Multiple R-squared: 0.1197, Adjusted R-squared: 0.09694
## F-statistic: 5.258 on 3 and 116 DF, p-value: 0.001947
\[ \hat{Y} = 1.83364 + 0.04992X_1 + 2.49114X_2 - 0.09635X_1X_2 \]
X2 = 1 is the student has indicated a major at the time of application, otherwise X2 is 0. The estimated value of \(\beta3\) is -0.09635, indicating that there is a expected difference value on GPA between the students in these two groups (whether they have indicated a major or not).