Robbie Beane
In multiple linear regression, we model the response variable as a linear combination of several predictors, plus a single error term. Assume that we have a single response \(Y\), and \(p\) predictors \(X^{(1)}, X^{(2)}, ..., X^{(p)}\). The hypothetical form for a multiple regression model would be as follows:
\[Y = \beta_0 + \beta_1 X^{(1)} + \beta_2 X^{(2)} + ... + \beta_p X^{(p)} + \varepsilon\]
We make the same assumptions for the error term \(e\) as we did for simple linear regression. The fitted model will have the form:
\[\hat Y = \hat\beta_0 + \hat\beta_1 X^{(1)} + \hat\beta_2 X^{(2)} + ... + \hat\beta_p X^{(p)}\]
The process for fitting a multiple regression model is very similar to that of a simple linear regression model. In either case, we score potential models using SSE, and find the model that produces the lowest SSE score. In detail:
We will now draw attention to several important similarities and differences between simple and multiple linear regression.
Recall that \(\sigma^2 = \mathrm{Var}[\varepsilon]\). For simple linear regression, the quantity \(s^2 = \frac{SSE}{n - 2}\) provides an unbaised estimate of \(\sigma^2\). For multiple regression, our unbiased estimate is given by:
\[s^2 = \frac{SSE}{n-p-1}\]
The square root of this quantity, \(s\), is called the residual standard error. It approximates the standard deviation of our error term.
To determine which predictors are relevant within the model, we will perform a t-test of the form
\[H_0: \beta_i = 0\] \[H_A: \beta_i \neq 0\]
for each coefficient estimate. To calculate the test statistic for any particular test of this form, we will have to calculate \(SE \left[ \beta_i \right]\). The formulas for the standard errors in multiple regression are a bit more complicated than those for simple linear regression. We will discuss the formulas for the standard errors in a later lecture. For now, we will rely on R to calculate these quantities, and to perform the t-Tests.
We need to be careful about conducting multiple t-tests when there are many predictors in a multiple regression problem. There is always a possibility that some predictors will pass their t-test by chance, and the likelihood increases with each additional predictor.
For example, assume that we have 100 predictors, none of which are relevant. If we conduct t-tests for each of these variables using a significance level of \(\alpha=0.05\), then each predictor has a 5% chance of passing the test at random. We would expect for 5 of the 100 variables to appear significant!
Prior to performing the t-Tests, we should first perform the following hypothesis test, called an F-test, to see if there is evidence that there is at least one relevant variable"
\[H_0 : \beta_1 = \beta_2 = ... = \beta_p = 0\]
\[H_A : \text{There is at least one } \beta_i \neq 0\]
We can test this hypothesis using the F statistic, defined by:
\[F = \frac{\left(SST - SSE\right) / p}{SSE / \left(n - p - 1 \right)}\] Assuming the Gauss-Markov assumptions hold, this test statistic will follow and F distribution.
Note that summary output for the model states two R-squared values: “Multiple R-squared”" and “Adjusted R-squared”. The first value, multiple R-squared, is the standard version of R-squared that we are used to. In particular:
Recall that for simple linear regression, we use the R-squared value to assess the predictive quality of the fit.
\[r^2 = 1 - \frac{SSE}{SST}\]
The R-squared measures the proportion of variability in the reponse that the model is able to explain by taking into account the value of the predictor.
Using R-squared to assess a multiple regression model can be problematic. It can be shown that adding additional predictors to a regression model will always increase the value of R-squared, even if the new predictor is in no way related to the response.
When scoring a multiple regression model, it is preferable to calculate a score called adjusted R-squared, which is similar to R-squared, but it applies a penalty for having many predictors in the model. The formula for adjusted R-squared is as follows:
\[r_{adj}^2 = 1 - \frac{s^2}{s_Y^2} = 1 - \frac{\frac{1}{n-p-1}SSE}{\frac{1}{n-1}SST} = 1 - \frac{(n-1) SSE}{(n-p-1)SST} \]
\[r_{adj}^2 = r^2 - (1-r^2) \frac{p}{n-p-1}\]
We can use adjusted R-squared to compare models with different numbers of predictors. This is an important tool for deciding which predictors to include in a model, and which to exclude.
To check the validity of the Gauss-Markov assumptions in simple linear regression, we create a residual plot by plotting the residuals for each training observation against the predictors. For multiple regression, we have many predictors, which means that a similar plot could be very high-dimensional and would thus not be feasible to use. Fortunately, we can apply a similar analysis by plotting the residuals against the fitted values for each observation in the training set. For simple linear regression, this approach produces a plot that is identical to our standard resisual plot, example that the horizontal axis is on a different scale.
In this example, we will be exploring the “New York City Restaurant” dataset, which is stored in the file nyc.txt. This tab-separated text file contains information for 168 Italian restaurants in New York City. The dataset contains six variables for each restaurant:
| Variable | Description |
|---|---|
Price |
The average price, in USD, for a meal at the restaurant (including one drink and a tip). |
Food |
The average customer rating of the food, on a scale from 1 to 30. |
Decor |
The average customer rating of the decor, on a scale from 1 to 30. |
Service |
The average cusstomer rating of the service, on a scale from 1 to 30. |
Wait |
The average wait time, in minutes, on a Saturday evening. |
East |
A binary variable. Equals 1 if the restaurant is east of 5th Avenue, and 0 otherwise. |
We begin by loading the data into a data frame using the read.table() function.
## Price Food Decor Service
## Min. :19.0 Min. :16.0 Min. : 6.00 Min. :14.0
## 1st Qu.:36.0 1st Qu.:19.0 1st Qu.:16.00 1st Qu.:18.0
## Median :43.0 Median :20.5 Median :18.00 Median :20.0
## Mean :42.7 Mean :20.6 Mean :17.69 Mean :19.4
## 3rd Qu.:50.0 3rd Qu.:22.0 3rd Qu.:19.00 3rd Qu.:21.0
## Max. :65.0 Max. :25.0 Max. :25.00 Max. :24.0
## Wait East
## Min. : 0.00 Min. :0.000
## 1st Qu.:16.75 1st Qu.:0.000
## Median :23.00 Median :1.000
## Mean :22.92 Mean :0.631
## 3rd Qu.:29.00 3rd Qu.:1.000
## Max. :46.00 Max. :1.000
In order to explore the relationship between the categorical (factor) predictor East and the response Price, we will use the two variables to create a box plot.
Our goal is to create a regression model with price as the response and some combination of the other five variables as predictors. We will begin with the “full” model that uses all five available predictors. Our fitted model will have the following form:
\[\hat {Price} = \hat\beta_0 + \hat\beta_1\cdot Food + \hat\beta_2\cdot Decor + \hat\beta_3\cdot Service + \hat\beta_4\cdot Wait + \hat\beta_5\cdot East\]
mod1 <- lm(Price ~ Food + Decor + Service + Wait + East, df)
#mod1 <- lm(Price ~ ., df)
summary(mod1)##
## Call:
## lm(formula = Price ~ Food + Decor + Service + Wait + East, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3315 -3.9098 -0.2242 3.3561 17.7499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.16816 4.78350 -5.261 4.47e-07 ***
## Food 1.55401 0.36844 4.218 4.09e-05 ***
## Decor 1.91615 0.21663 8.845 1.49e-15 ***
## Service -0.04571 0.39688 -0.115 0.9085
## Wait 0.06796 0.05311 1.280 0.2025
## East 2.04599 0.94505 2.165 0.0319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.727 on 162 degrees of freedom
## Multiple R-squared: 0.6316, Adjusted R-squared: 0.6202
## F-statistic: 55.55 on 5 and 162 DF, p-value: < 2.2e-16
Note that this model has an R-squared value of \(r^2 = 0.6316\) and an adjusted R-squared value of \(r^2_{adj} = 0.6202\). We will refer back to these values in a moment.
The coefficient estimates \(\hat\beta_0, ..., \hat\beta_5\) are shown in the first column of the section labeled “Coefficients”. We can use these to see that the fitted model is given by:
\[\hat {Price} = -25.168 + 1.554\cdot Food + 1.916\cdot Decor - 0.046\cdot Service + 0.068\cdot Wait + 2.046 \cdot East\]
Notice that some of the coefficients in this model are very small. That is not a huge concern, in and of itself. What we want to know is if any of the coefficients are so small that we cannot say that they are non-zero with a reasonably high degree of certainty.
We can see from the summary that the F-test resulted in a very low p-value. This means that there is strong support for rejecting the null hypothesis
\[H_0 : \beta_1 = \beta_2 = ... = \beta_p = 0\]
in favor of the alternative hypothesis
\[H_A : \text{There is at least one } \beta_i \neq 0\]
With this in mind, we move on to consider the t-tests.
Assume that we have opted to use a significance level of \(\alpha = 0.05\). Then we see that the results of the t-tests for the intercept, Food, Decor, and East are statistically significant, while the results of the tests for Service and Wait are NOT statistically significant. We can be quite confident that the true values of \(\beta_0\), \(\beta_1\), and \(\beta_2\) are non-zero, and relatively confident that the value of \(\beta_5\) is non-zero. We cannot say with a large degree of certainty that the true values of the parameters \(\beta_3\) and \(\beta_4\) are non-zero.
If we are not confident that \(\beta_3\) and \(\beta_4\) are non-zero, then we are not confident that Service and Wait have an effect on the price. We should consider removing these from the model.
Since the t-tests for the coefficients of Service and Wait were not statistically significant, we will remove those from the model.
##
## Call:
## lm(formula = Price ~ Food + Decor + East, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0451 -3.8809 0.0389 3.3918 17.7557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.0269 4.6727 -5.142 7.67e-07 ***
## Food 1.5363 0.2632 5.838 2.76e-08 ***
## Decor 1.9094 0.1900 10.049 < 2e-16 ***
## East 2.0670 0.9318 2.218 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6211
## F-statistic: 92.24 on 3 and 164 DF, p-value: < 2.2e-16
This produces the fitted model:
\[\hat {Price} = -24.027 + 1.536\cdot Food + 1.909\cdot Decor + 2.067 \cdot East\]
Notice that the t-Tests for all three parameter estimates result in statistically significant p-values. Removing `wait and Service and Wait from the model resulted in a slightly decreased R-squared value, but a slightly increased adjusted R-squared.
We see that in our second model, we have a residual standard error of \(s = 5.72\). Let’s compare that with the standard deviation of our response variable:
## [1] 9.292814
We will now analyze the residuals from our second model to see if they support the Gauss-Markov assumptions.
res <- mod2$residuals
plot(res ~ mod2$fitted.values, pch=21, col="black", bg="salmon",
xlab="Fitted Value", ylab="Residuals", main="Residual Plot")
abline(h=0, col="darkred", lwd=2)To test normality, we will conduct a Shapiro-Wilks hypothesis test of the form:
\[H_0 : \text{The residuals are normally distributed.}\]
\[H_A : \text{The residuals are not normally distributed.}\]
We will use a significance level of \(\alpha = 0.05\).
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.98623, p-value = 0.09733
The p-value of 0.09733 is relatively low, but not signifance for our selected value of \(\alpha\). While there does seem to be some evidence that the residuals are not normally distributed, it is not exceptionally strong.
To further our analysis, we will create a histogram and QQ-plot of the residuals.
There seems to be evidence that the Gauss-Markov assumptions hold true.
We will now use our final fitted model to generate predictions for three NYC restuarants with the following characteristics:
Food = 21, Decor = 25, Service = 18, Wait = 35, East = 1Food = 24, Decor = 13, Service = 23, Wait = 16, East = 0Food = 18, Decor = 16, Service = 27, Wait = 27, East = 1nd <- data.frame(Food = c(21, 24, 18), Decor = c(25, 13, 16), East = c(1, 0, 1))
predict(mod2, newdata = nd, interval = "prediction", level = 0.99)## fit lwr upr
## 1 58.03772 42.64722 73.42821
## 2 37.66726 22.04456 53.28997
## 3 36.24432 21.16980 51.31884