Multiple regression is a widely utilized method due to its relatively straightforward nature and power of fitting linear relationships. The concepts explored in a previous post on simple regression apply in the case of multiple regression as simple regression is a special case of the linear regression model (Winner, 2009, pp. 142).
In this example, multiple regression with two predictor variables will be explored using the delivery dataset from the robustbase package. The calculations of multiple regression will also be investigated using matrix algebra to verify the results and gain a deeper understanding of how multiple regression is performed. Multiple regression models with three or more predictor variables are much more complicated to calculate manually and is best left to a computer. However, the case of two predictor variables is relatively straightforward and provides a view into how a multiple regression model is fit.
A regression model with two predictor variables \(x_1\) and \(x_2\) has the form:
\[ Y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon_i \]
The formulation of the model will be examined more closely later in the post.
First, load the packages that will be needed and the delivery
dataset. The gridExtra package is handy for combining plots created by ggplot2.
library(ggplot2)
library(gridExtra)
library(robustbase)
## Warning: package 'robustbase' was built under R version 3.3.1
data(delivery)
According to the dataset’s documentation, “The aim is to explain the time required to service a vending machine (Y) using the number of products stocked (X1) and the distance walked by the route driver (X2).”
Plot two scatterplots of the predictor variables with the response variable.
del.nprod <- ggplot(delivery, aes(n.prod, delTime)) +
geom_point(color='#2980B9', size = 3)
del.dist <- ggplot(delivery, aes(distance, delTime)) +
geom_point(color='#2980B9', size = 3) + labs(y=NULL)
grid.arrange(del.nprod, del.dist, ncol = 2)
It does appear there is a linear relationship between the number of products stocked, and the distance walked on the total delivery time. This relationship would make intuitive sense as one would assume that a delivery would take longer as the number of products stocked, or distance walked to the site increases. The observation in the far top right of both scatterplots is potentially an outlier but treating it will be ignored in this example as it is outside its intended scope.
The multiple regression model is fitted with the lm()
function. The .
tells the function to use all of the variables in the dataset other than the response variable to the left of the ~
sign. The same model would be fit if the call was lm(delTime ~ n.prod + distance, data = delivery)
.
del.lm <- lm(delTime ~ ., data = delivery)
summary(del.lm)
##
## Call:
## lm(formula = delTime ~ ., data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## n.prod 1.615907 0.170735 9.464 3.25e-09 ***
## distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
The summary()
function shows both coefficients are significant, and thus there is indeed a linear relationship between the total delivery time and the two predictor variables. The coefficients are interpreted as the change in the response \(y\) per unit increase in \(x_1\) when \(x_2\) is held constant and vice-versa. For example, according to the model, the mean delivery time changes 1.62 units for each unit increase in the number of products while the distance walked remains constant. Due to this interpretation, the coefficients can also be called partial regression coefficients. Please note that this is an example analysis, it is recommended to scale the variables in the dataset before fitting a model as the variables are all represented by different units.
Multiple regression is usually performed using matrix algebra. As mentioned previously, the linear regression model with two predictor variables is defined as:
\[ Y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon_i \]
And is also called a first-order regression model as its predictor variables are linear. The model can be extended to the general case with \(p\) predictor variables.
\[ Y_i = \beta_0 + \beta_1 x_{1} + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon \]
Which can also be written as:
\[ Y_i = \beta_0 + \sum^p_{k=1} \beta_k x_k + \epsilon \]
In matrix terms, the multiple regression model can be expressed as:
\(\underset{n x 1}{Y} = \begin{bmatrix} \begin{aligned} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{aligned} \end{bmatrix}\)
\(\underset{n x p}{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1,p} \\ 1 & x_{21} & x_{22} & \dots & x_{2,p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{n,p} \end{bmatrix}\)
\(\underset{p x 1}{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}\)
\(\underset{n x 1}{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\)
Which can also be written as:
\[ Y = X\beta + \epsilon \]
As in the case of simple linear regression, the goal of least squares estimation is to find values for the parameters \(\beta_0, \beta_1, \dots, \beta_p\) that minimize the standard error. This value will be denoted as \(Q\) and is defined as:
\[ Q = \sum^n_{i=1} (Y_i - \beta_0 - \beta_1 x_1 - \dots - \beta_p x_p)^2 \]
The vector (a matrix with one column) of the estimated parameters are denoted as:
\(\underset{p x 1}{b} = \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_p \end{bmatrix}\)
Thus the least squares equations for the multiple regression model can be represented as:
\[ X'Xb = X'Y \]
With estimators:
\[ \underset{2 x 1}{b} = \underset{2 x 2}{(X'X)^{-1}} \underset{2 x 1}{(X'X)}Y \]
The fitted response can also be represented as a vector denoted \(\hat{Y}\) with values \(\hat{Y}_i\) and residuals \(\epsilon_i = Y_i - \hat{Y}_i\).
\(\underset{n x 1}{\hat{Y}} = \begin{bmatrix} \hat{Y}_1 \\ \hat{Y}_2 \\ \vdots \\ \hat{Y}_n \end{bmatrix}\)
\(\underset{n x 1}{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\)
The fitted values can also be written as:
\[ \underset{n x 1}{\hat{Y}} = Xb \]
With residuals:
\[ \underset{n x 1}{\epsilon} = Y - \hat{Y} = Y - Yb \]
The hat matrix is defined as:
\[ H = X(X'X)^{-1} X' \]
Therefore, the fitted values vector \(\hat{Y}\) can be expressed in terms of the hat matrix:
\[ \underset{n x 1}{\hat{Y}} = HY \]
With the vector of residuals then:
\[ \underset{n x 1}{\epsilon} = (I - H)Y \]
Briefly touching on ANOVA in the multiple regression setting, the sum of squares and mean sum of squares from each source of variation are defined in matrix terms as:
\[ SSE = Y'Y - b'X'Y = Y'(I - H)Y\] \[ SST = Y'Y - \frac{\sum (Y)^2}{n} \] \[ SSR = SST - SSE \]
\[ MSE = \frac{SSE}{n - p} \] \[ MSR = \frac{SSR}{p - 1} \]
The F-statistic for regression relationship between the response variable and the predictor variables remains the same from simple linear regression:
\[ F = \frac{MSR}{MSE} \]
The null hypothesis of no regression relation between any variables is rejected if the critical \(F\) value exceeds the F value at a given level of confidence with \(p - 1\) and \(n - p\) degrees of freedom.
\[ F > F_{\alpha, p - 1, n - p} \]
In the multiple regression setting, \(R^2\) is designated the coefficient of multiple determination and is defined as:
\[ R^2 = 1 - \frac{SSE}{SST} \]
As in the case of simple linear regression, \(R^2\) measures the proportionate reduction of total variance in the response variable associated with the predictor variables in the model. \(R^2\) can only increase as more predictor variables are added to the model. \(R_a^2\), the adjusted coefficient of multiple determination corrects the problem of artificially inflating \(R^2\) by dividing the each sum of squares by its degrees of freedom.
\[ R_a^2 = 1 - \left(\frac{n - 1}{n - p}\right)\frac{SSE}{SST} \]
Phew! Apologies for all of the definitions. A good portion of the above should be familiar from the simple linear regression setting. To summarize the above, a first order regression model with two predictor variables would appear in matrix form as:
\(Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}\)
\(X = \begin{bmatrix} 1 & x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{bmatrix}\)
\(\epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}\)
To following steps are a quick summary of how to calculate the regression parameters for a two predictor variable model.
The fitted response values are thus:
\[\hat{Y} = Xb \]
The following function implements the above definitions and equations to fit a first order multiple regression model with two predictor variables.
two.predictor.regression <- function(x, y) {
y <- as.matrix(y)
if (ncol(y) > 1) {
stop('y must be a vector')
}
x <- data.frame(x)
if (ncol(x) != 2) {
stop('x must have only two predictor variables')
}
x$intercept <- 1 # Add the intercept term to the X matrix
col <- grep('intercept', names(x))
x <- as.matrix(x[, c(col, (1:ncol(x))[-col])]) # Put the intercept term first in the X matrix
n <- length(y)
if (n != length(x[,1])) {
stop('x and y must be the same length')
}
# Degrees of freedom. As only two predictor variables and one response variable are considered, the degrees of freedom will always be 3
df <- 3
x.crossprod <- crossprod(x) # Calculate crossproduct X'X
y.crossprod <- crossprod(y) # Calculate crossproduct Y'Y
xy.crossprod <- crossprod(x, y) # crossproduct X'Y
x.inv <- solve(x.crossprod) # Find inverse of X'X
b <- x.inv %*% xy.crossprod # Get b parameters by multiplying the inverse of X'X and X'Y
fitted.y <- x %*% b # Find the fitted response values
resid.mat <- y - fitted.y # Find the residual values
min.resid <- min(resid.mat)
max.resid <- max(resid.mat)
med.resid <- median(resid.mat)
Q1.resid <- quantile(resid.mat, .25)
Q3.resid <- quantile(resid.mat, .75)
resid.df <- data.frame(round(cbind(min.resid, Q1.resid, med.resid, Q3.resid, max.resid), 3), row.names = '')
colnames(resid.df) <- c('Min', '1Q', 'Median', '3Q', 'Max')
sse <- y.crossprod - crossprod(b, xy.crossprod) # SSE
mse <- sse / (n - df) # MSE
sst <- y.crossprod - sum(y)^2 / n # SST
ssr <- sst - sse # SSR
msr <- ssr / 2 # MSR
f.stat <- msr / mse # Critical F-statistic
p.val <- pf(f.stat, df1 = 2, df2 = n - df, lower.tail = FALSE) # Computed p-value
R2 <- 1 - sse / sst # R-squared
R2.adj <- 1 - ((n - 1) / (n - df)) * sse / sst # adjusted R-squared
s2b <- mse[1] * x.inv # Estimate variance-covariance matrix of regression parameters
# Get the parameter standard errors
std.err <- suppressWarnings(matrix(sqrt(s2b)[!is.na(sqrt(s2b))], nrow = 3, ncol=1))
t.values <- b / std.err # Critical t-values of regression parameters
p.values <- 2 * pt(t.values, df = n - df, lower.tail = FALSE)
res.std.err <- sqrt(mse)
# The following collects and returns all the results found above into a list
coef <- data.frame(cbind(round(b,3), round(std.err,3), round(t.values,3), format(p.values, digits = 3)))
colnames(coef) <- c('Estimate', 'Std. Error', 't value', 'Pr(>|t|)')
resids <- paste('Residual standard error:', round(res.std.err, 3), 'on', length(delivery$n.prod - df), ' degrees of freedom')
R.squared <- paste('Multiple R-squared: ', round(R2,4), 'Adjusted R-squared: ', round(R2.adj,4))
f.stat <- paste('F-statistic: ', round(f.stat, 1), 'on 2 and', length(delivery$n.prod - df), ' DF')
p <- paste('p-value: ', format(p.val, digits = 4, scientific = TRUE))
res <- list('Residuals'=resid.df, 'Coefficients'=coef, resids, R.squared, f.stat, p)
return(res)
}
two.predictor.regression(delivery[,1:2], delivery[,3])
## $Residuals
## Min 1Q Median 3Q Max
## -5.788 -0.663 0.436 1.157 7.42
##
## $Coefficients
## Estimate Std. Error t value Pr(>|t|)
## intercept 2.341 1.097 2.135 4.42e-02
## n.prod 1.616 0.171 9.464 3.25e-09
## distance 0.014 0.004 3.981 6.31e-04
##
## [[3]]
## [1] "Residual standard error: 3.259 on 25 degrees of freedom"
##
## [[4]]
## [1] "Multiple R-squared: 0.9596 Adjusted R-squared: 0.9559"
##
## [[5]]
## [1] "F-statistic: 261.2 on 2 and 25 DF"
##
## [[6]]
## [1] "p-value: 4.687e-16"
summary(del.lm)
##
## Call:
## lm(formula = delTime ~ ., data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## n.prod 1.615907 0.170735 9.464 3.25e-09 ***
## distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
The function outputs the same results as the lm()
function used earlier.
This post examined multiple regression in-depth in the two predictor variable setting to get a deeper understanding of the method. Multiple regression is such a frequently used tool that it pays to have a fundamental knowledge of the reasoning and assumptions behind it. Fortunately, many of the concepts from simple linear regression apply to multiple regression.
Kutner, M. H., Nachtsheim, C. J., Neter, J., Li, W., & Wasserman, W. (2004). Applied linear statistical models (5th ed.). Boston, MA: McGraw-Hill Higher Education.
5.4 - A matrix formulation of the multiple regression model. (2016). Retrieved from Regression Methods, https://onlinecourses.science.psu.edu/stat501/node/382
Brannick, M. (2040). Regression with Two independent variables. Retrieved from Regression with Two Independent Variables, http://faculty.cas.usf.edu/mbrannick/regression/Reg2IV.html
Winner, L. (2009). Applied Statistical Methods. University of Florida.