An Overview of Multiple Linear Regression

We have previously discussed Simple Linear regression with a single predictor and a single response variable. In the real world, many factors may have an influence on a single output variable. We can model this using the multiple linear regression formula with \(n\) predictor variables: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_1 + \hat{\beta_2} x_2 +... + \hat{\beta_n} x_n\] where the betas are your unknown regression parameters, and the \({x_i}\) are the predictor variables.
\({\beta_0}\) is a measure of the intercept, and it remains the case that it may not have a practical interpretation, depending on the data set given.
Similar to simple linear regression, the error terms describe the effects on \({y_i}\) of all factors other than the \({x_i}\) predictors. We must keep in consideration the same assumptions on the error term as was the case for simple linear regression. The residual term is still calculated as \((y_i - \hat{y_i})\), where \(\hat{y_i}\) is the observed value from the multiple linear regression.

A Motivating Example

Let’s take a look at the data set domedata1. This data set holds the result of the experiment of throwing a baseball in the Metrodome and recording the distance it travelled with respect to changes in the direction of air conditioning fans. According to the command ?domedata1, there are 96 observations.

Exploratory Data Analysis

First, we must identify the library and which data set we want to use by using library() and data(), then attach the data using attach(). When we attach the data to R, it allows us to use the variable names from the data set without needing to refer back to the data set. To understand how our data is formatted and what it contains, we use the command head().

library(alr3)
## Loading required package: car
data(domedata1) 
attach(domedata1)
head(domedata1) 
##    Date     Cond Angle Velocity  BallWt BallDia  Dist
## 1 March Headwind  49.1    154.1 140.980    2.86 355.8
## 2 March Headwind  49.1    157.1 140.567    2.88 358.8
## 3 March Headwind  49.4    155.8 141.855    2.83 347.6
## 4 March Headwind  49.5    153.2 140.099    2.81 343.9
## 5 March Headwind  49.6    158.8 140.980    2.86 359.3
## 6 March Headwind  49.9    156.3 140.980    2.86 356.8

Our data set includes the variables: date by month, wind condition, angle of the throw in degrees, velocity in feet per second, ball weight in grams, ball diameter in inches, and distance in feet of the throw. We believe all the quantitative factors may affect the distance of the throw. Wind condition is the only categorical predictor that we think will impact the distance of the throw. This variable is defined as either headwind or tailwind.
Another source of information to aid our understanding of the data set is obtained by using the command summary(). It provides a quick summary of the variability of each variable in our collected data.

summary(domedata1)
##     Date          Cond        Angle          Velocity         BallWt     
##  March:34   Headwind:49   Min.   :35.60   Min.   :121.8   Min.   :140.1  
##  May  :62   Tailwind:47   1st Qu.:42.60   1st Qu.:128.4   1st Qu.:141.0  
##                           Median :49.40   Median :145.2   Median :141.5  
##                           Mean   :45.46   Mean   :144.8   Mean   :141.4  
##                           3rd Qu.:50.10   3rd Qu.:155.7   3rd Qu.:141.9  
##                           Max.   :51.30   Max.   :162.5   Max.   :143.0  
##     BallDia           Dist      
##  Min.   :2.810   Min.   :273.9  
##  1st Qu.:2.837   1st Qu.:304.4  
##  Median :2.850   Median :347.6  
##  Mean   :2.847   Mean   :335.8  
##  3rd Qu.:2.860   3rd Qu.:358.6  
##  Max.   :2.880   Max.   :383.4

This is helpful to understand the range of each factor; for example, distance varies by over 100 feet, while ball weight only varies around 3 grams.

Creating the Multiple Linear Regression Model

The variables we are looking at include the condition, angle, velocity, ball weight, ball diameter, and the distance of the flight of the ball. We hypothesize that these variables will be predictors for the distance the ball travelled when it was thrown.
To test this hypothesis, we must first define our full multiple linear regression model. We do this using the lm() command, defining distance as a result of (~) the other predictors. We then call the summary() of our model to inform us of many important statistics regarding the model.

mymod <- lm(Dist ~ Velocity + Angle + BallWt + BallDia + Cond, data = domedata1)
summary(mymod)
## 
## Call:
## lm(formula = Dist ~ Velocity + Angle + BallWt + BallDia + Cond, 
##     data = domedata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.843  -5.841  -1.527   6.347  24.374 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -18.60964  238.27745  -0.078    0.938    
## Velocity       2.54854    0.08965  28.427  < 2e-16 ***
## Angle         -1.52905    0.20085  -7.613 2.54e-11 ***
## BallWt        -0.09274    1.39605  -0.066    0.947    
## BallDia       23.59297   53.72935   0.439    0.662    
## CondTailwind   1.79182    1.92312   0.932    0.354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.284 on 90 degrees of freedom
## Multiple R-squared:  0.9133, Adjusted R-squared:  0.9084 
## F-statistic: 189.5 on 5 and 90 DF,  p-value: < 2.2e-16

We note that this model summary looks very similar to the simple linear regression model summary. However, in the multiple linear regression model summary, the coefficients section includes the 5 predictors and intercept line, whereas the simple linear regression line has only one coefficient and the intercept term.

The multiple linear regression formula is \[\hat{Distance_i} = -18.60964 + 2.54854(Velocity) - 1.52905(Angle) - 0.09274(Ball Weight) + 23.59297(Ball Diameter) + 1.79182(CondTail).\]
We can interpret this by allowing one predictor to vary while holding the other predictors constant; for example, for every one foot per second increase in velocity, the ball will go 2.54854 feet further. A similar understanding can be drawn from angle (for every one degree increase, the ball goes 1.52905 less feet), ball weight, and ball diameter. Condition is a categorical variable, so it is numerically treated as either a 0 (headwind) or 1 (tailwind). If the experiment was performed with tailwind, then distance is increased by 1.79182 feet, holding all else constant. Otherwise, the prediction for distance is unchanged.

Verifying Error Assumptions

Now that the model has been built, we must confirm that the data set meets all required assumptions:

  1. The population of potential error term values has a mean equal to 0.

To confirm this, we will examine the residuals of our multiple linear regression model. Using the linear model mymod as we have coded above, we will take the sum of all residuals of that model using both the sum() and resid() commands to find the mean of the potential error term values.

sum(resid(mymod))
## [1] -1.132427e-14

While the answer is not exactly zero, we can be confident in moving onto checking the other assumptions, because R already verifies this assumption for us.

  1. Constant variance assumption of potential error terms.

For this assumption, we need to rule out heteroscadasticity for each of the variables. We need to check that the potential error terms do not depend on any of the \({x_i}\)’s. This is done by using the simple plot() function for each quantitative variable and plotting it against distance. We don’t need to plot the condition because it is a qualitative variable.

plot(Dist ~ Velocity, xlab = "Velocity (fps)", ylab ="Distance (ft)",
                        main="Distance and Velocity")

plot(Dist ~ Angle, xlab = "Angle (degrees)", ylab = "Distance (ft)",
                        main="Distance and Angle")

plot(Dist ~ BallWt, xlab = "Ball Weight (g)", ylab ="Distance (ft)",
                        main="Distance and Ball Weight")

plot(Dist ~ BallDia, xlab = "Ball Diameter (in)", ylab ="Distance (ft)",
                        main="Distance and Ball Diameter")

None of these graphs are severely nonlinear, but there is slight heteroscadasticity in the variable ball weight. We will discuss how to deal with this later in this R guide.

  1. Normality of all potential error terms.

To check for normality of the error terms, we plot the residuals (using mymod$residuals) for that model using the qqnorm plot using the qqnorm() code. In addition, we add a line of best fit. If the residuals are normally distributed, they will follow the qqline(residuals_name) fairly closely.

resids <- mymod$residuals
qqnorm(resids, main="Normal Q-Q Plot of Residuals from mymod")
qqline(resids)

The curvature of this line is not perfectly linear along the qq line. We hypothesize that a quadratic term may help the normality of the error terms, which we will test later on.

4.Independence

Since each data point was created on a separate day, we can say that one point does not directly impact the other points. Therefore, the independence assumption is met.

Mean Square Error (MSE)

The mean squared error (MSE) of an estimator measures the average of the squares of the errors or deviations, that is, the difference between the estimator and what is estimated. We can use MSE to estimate \({\sigma}^2\), which is shown in the linear model summary next to ‘Residual standard error’. For this model, the MSE is 9.284. We hope to find a low MSE, which indicates our model is more accurate. We will examine ways to decide if models are more accurate later on in this guide.

Choosing a Model

Multiple linear regression is a wonderful analysis tool to utilize when multiple predictors seem to have an effect on the response; however, it is easy to add in a million predictors that may or may not directly improve the prediction value of a model. We need to be mindful of what predictors we choose, and act logically when building a model. Additionally, there are other tools we have available to use, including transforming the data, using polynomial regression, and adding an interaction term. Therefore, when choosing a model, we will scrutinize the number of predictors as well as the formula of the model itself.

Transformation

Data sets may violate certain assumptions, as we have seen above in our own data set. Using a transformation on one of the variables may help us to draw a more correlated multiple linear regression for the response variable. The three transformations that are typically used are log, square root, and inverse functions. In our data set, we noted there was slight heteroscadasticity in angle, which can be reduced by using a log transformation. We will try this method below to see if our data better predicts the distance in which the ball travels after being transformed. We transform the predictor ball weight and the response using log(), then defining this new linear model using the log transformed variables.

logBW <- log(BallWt)
logDist <- log(Dist)
plot(logDist ~ logBW, xlab = "Log Transformed Ball Weight (g)", ylab = "Log Transformed Distance (ft)",
      main = "Log Transformed Variables")

logmod <- lm(logDist ~ Velocity + Angle + logBW + BallDia + Cond + Angle , data = domedata1)
summary(logmod)
## 
## Call:
## lm(formula = logDist ~ Velocity + Angle + logBW + BallDia + Cond + 
##     Angle, data = domedata1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.054817 -0.018145 -0.004164  0.018301  0.075097 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.2346361  2.9677490   2.101   0.0385 *  
## Velocity      0.0078637  0.0002728  28.821  < 2e-16 ***
## Angle        -0.0050798  0.0006113  -8.310 9.35e-13 ***
## logBW        -0.2933013  0.6016646  -0.487   0.6271    
## BallDia       0.0421327  0.1635570   0.258   0.7973    
## CondTailwind  0.0048242  0.0058525   0.824   0.4119    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02826 on 90 degrees of freedom
## Multiple R-squared:  0.9152, Adjusted R-squared:  0.9105 
## F-statistic: 194.4 on 5 and 90 DF,  p-value: < 2.2e-16

While this does improve the spread of our data, it does not greatly impact the shape of the graph. It also does not improve p value for ball diameter or condition. It does improve the residual standard error and the p value of the ball weight predictor. However, ball weight is still not considered statistically significant. In our opinion, the mild improvement of this transformation makes the interpretation of the model extremely complicated, especially for readers not trained in statistics. Since it does not make the predictor significant, we choose to return to the original model.

Interaction

Interaction terms allow us to note additional information without adding other predictors. If we believe two predictors may be related, we can alter the lm() command to include an interaction (replacing the + with the *). For our example, we define the interaction terms as condition and angle. We hypothesize this because if a ball is thrown at a set degree above the x axis with headwind, this would hinder the distance the ball can travel. If the same angle is paired with tailwind, we hypothesize that the ball will go further because it is not countering the direction of the wind.

mymod1 <- lm(Dist ~ Velocity + Angle + BallWt + BallDia + Angle * Cond, data = domedata1)
summary(mymod1)
## 
## Call:
## lm(formula = Dist ~ Velocity + Angle + BallWt + BallDia + Angle * 
##     Cond, data = domedata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.089  -6.279  -1.158   6.193  24.515 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -22.59429  238.58764  -0.095    0.925    
## Velocity             2.54989    0.08976  28.406  < 2e-16 ***
## Angle               -1.67992    0.26262  -6.397 7.19e-09 ***
## BallWt              -0.29199    1.41532  -0.206    0.837    
## BallDia             37.25788   55.92392   0.666    0.507    
## CondTailwind       -12.19670   15.78163  -0.773    0.442    
## Angle:CondTailwind   0.30662    0.34334   0.893    0.374    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.295 on 89 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9082 
## F-statistic: 157.7 on 6 and 89 DF,  p-value: < 2.2e-16

This interaction did not turn out to be significant (as we can see from the high p value of .374), so we will return to our original model.

Quadratic Term

Additionally, we could alter the base model using a quadratic term. As we mentioned in the assumption, the curvature of the QQ line does not exactly follow like we hope for a linear regression model. This is constructed by creating a new variable defined by the old variable squared. These terms are then added to the model, where we can check for significance using the p value.

qvelocity <- Velocity^2
qangle <- Angle ^2
qbw <- BallWt^2
qbd <- BallDia ^2
quadmod <- lm(Dist ~ Velocity + qvelocity + Angle + qangle + BallWt + qbw + BallDia + qbd + Cond, data = domedata1)
summary(quadmod)
## 
## Call:
## lm(formula = Dist ~ Velocity + qvelocity + Angle + qangle + BallWt + 
##     qbw + BallDia + qbd + Cond, data = domedata1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4271  -4.2400  -0.7508   5.0688  16.8316 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.272e+04  2.404e+04  -1.777  0.07907 .  
## Velocity      5.901e+00  2.414e+00   2.444  0.01656 *  
## qvelocity    -1.152e-02  8.536e-03  -1.349  0.18083    
## Angle         1.480e+01  3.510e+00   4.217 6.11e-05 ***
## qangle       -1.865e-01  4.088e-02  -4.563 1.67e-05 ***
## BallWt        1.118e+03  3.799e+02   2.944  0.00417 ** 
## qbw          -3.942e+00  1.339e+00  -2.945  0.00416 ** 
## BallDia      -2.610e+04  1.763e+04  -1.480  0.14244    
## qbd           4.586e+03  3.098e+03   1.480  0.14243    
## CondTailwind  6.283e-01  1.653e+00   0.380  0.70490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.886 on 86 degrees of freedom
## Multiple R-squared:  0.9402, Adjusted R-squared:  0.9339 
## F-statistic: 150.2 on 9 and 86 DF,  p-value: < 2.2e-16

The addition of the quadratic terms is beneficial in both improving correlation and making sure that we have created the most accurate model to represent our data. As we can see, at a significance level of .05, the quadratic term for angle and ball weight become significant. Since they add statistical significance to the model, we will keep the quadratic terms for angle and ball weight. We can use a test called ANOVA to compare these two models to prove that they are significantly different. Analysis of Variance (ANOVA) tests determine which of the two models should be used, based on inferences about the means. The null hypothesis we test is that the difference between the reduced and full model is negligible, and we should drop the predictors. We are hoping for results of a high F-statistic and a low p-value, which indicates the complex model is an improved model in comparison to the reduced model.

anova(mymod, quadmod)
## Analysis of Variance Table
## 
## Model 1: Dist ~ Velocity + Angle + BallWt + BallDia + Cond
## Model 2: Dist ~ Velocity + qvelocity + Angle + qangle + BallWt + qbw + 
##     BallDia + qbd + Cond
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     90 7758.1                                  
## 2     86 5348.6  4    2409.5 9.6856 1.629e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At a significance level of .05, we can say that the more complex model is a better model because the p value is 1.629*10-6. However, since the quadratic’s linear model summary shows that not all the quadratic terms are statistically significant, it is important to test which quadratic terms should be included.

Model Building

There are three different types of model building. The two we will demonstrate use the stepAIC() command, which automatically creates and tests models with different combinations of predictors.

  1. Forward Selection is the method by which a model is built with a single significant predictor and more predictors from our data set are added and tested for significance.

First, we call the library MASS since it contains the function we would like to use. Then we define the model with only distance based on the single predictor. Lastly, we use the stepAIC() command, and define the direction as “forward”. R also needs a list of all the predictors we want it to check, so we define those predictors in a list. The Dist ~ 1 allows the computer to accept just the intercept as the only significant portion of the model.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
mod2 <- lm(Dist ~ Velocity, data = domedata1)
stepAIC(mod2, direction = "forward", scope = list(upper = Dist ~ Velocity + Angle + qangle + qbw + BallWt + BallDia + Cond, lower = Dist~1))
## Start:  AIC=477.63
## Dist ~ Velocity
## 
##           Df Sum of Sq     RSS    AIC
## + qangle   1    5778.6  7554.4 425.09
## + Angle    1    5478.2  7854.8 428.84
## <none>                 13333.1 477.63
## + BallWt   1     217.2 13115.9 478.05
## + qbw      1     212.7 13120.4 478.09
## + BallDia  1     199.3 13133.8 478.18
## + Cond     1     175.9 13157.2 478.36
## 
## Step:  AIC=425.09
## Dist ~ Velocity + qangle
## 
##           Df Sum of Sq    RSS    AIC
## + Angle    1   1507.11 6047.3 405.73
## <none>                 7554.4 425.09
## + Cond     1     72.57 7481.9 426.17
## + BallDia  1     19.45 7535.0 426.84
## + qbw      1      3.88 7550.6 427.04
## + BallWt   1      3.45 7551.0 427.05
## 
## Step:  AIC=405.73
## Dist ~ Velocity + qangle + Angle
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 6047.3 405.73
## + BallDia  1    85.751 5961.6 406.36
## + Cond     1    28.066 6019.3 407.28
## + qbw      1     2.503 6044.8 407.69
## + BallWt   1     2.223 6045.1 407.70
## 
## Call:
## lm(formula = Dist ~ Velocity + qangle + Angle, data = domedata1)
## 
## Coefficients:
## (Intercept)     Velocity       qangle        Angle  
##   -358.0055       2.6467      -0.2094      16.5067

The output R provides from this forward selection stepAIC is the linear model with predictors velocity, angle, and the quadratic angle term. This result can be seen underneath the Call: where the linear model is defined. We can compare this to the output of the backward elimination next.

  1. Backward Elimination is the method by which we start with all the predictors available to us, and drop one at a time. This is done using the stepAIC() command for the linear model and defining the direction to be “backward”.
modback <- lm(Dist ~ Velocity + Angle + qangle + qbw + BallWt + BallDia + Cond, data = domedata1)
stepAIC(mymod, direction = "backward")
## Start:  AIC=433.65
## Dist ~ Velocity + Angle + BallWt + BallDia + Cond
## 
##            Df Sum of Sq   RSS    AIC
## - BallWt    1         0  7758 431.65
## - BallDia   1        17  7775 431.85
## - Cond      1        75  7833 432.57
## <none>                   7758 433.65
## - Angle     1      4996 12754 479.37
## - Velocity  1     69660 77418 652.49
## 
## Step:  AIC=431.65
## Dist ~ Velocity + Angle + BallDia + Cond
## 
##            Df Sum of Sq   RSS    AIC
## - BallDia   1        16  7775 429.85
## - Cond      1        78  7837 430.62
## <none>                   7758 431.65
## - Angle     1      5211 12969 478.97
## - Velocity  1     73431 81189 655.06
## 
## Step:  AIC=429.85
## Dist ~ Velocity + Angle + Cond
## 
##            Df Sum of Sq   RSS    AIC
## - Cond      1        80  7855 428.84
## <none>                   7775 429.85
## - Angle     1      5382 13157 478.36
## - Velocity  1     73831 81606 653.55
## 
## Step:  AIC=428.84
## Dist ~ Velocity + Angle
## 
##            Df Sum of Sq   RSS    AIC
## <none>                   7855 428.84
## - Angle     1      5478 13333 477.63
## - Velocity  1     74045 81900 651.90
## 
## Call:
## lm(formula = Dist ~ Velocity + Angle, data = domedata1)
## 
## Coefficients:
## (Intercept)     Velocity        Angle  
##      37.129        2.549       -1.548

Our final model here produces the output model of velocity and angle only as predictors for distance.

  1. All Subsets method is completed by looking at every possible iteration of the model, comparing their AIC values, and then selecting the lowest one possible. We feel the other 2 methods gave us reliable models and we feel confident in moving forward using them.

Since backward elimination removed the quadratic term for angle but the forward selection did not, we must decide whether we will keep this term. These two methods give us slightly different models because the intercorrelation between the regressors affect the order of term entry and removal. We will redefine these two models as backmod and forwardmod. We will then perform an ANOVA to aid our decision in whether to use the more complex model or the reduced model. Again, our null hypothesis is that these two are not statistically different (and therefore we should use the reduced model).

backmod <- lm(Dist ~ Velocity + Angle, data = domedata1)
forwardmod <- lm(Dist ~ Velocity + Angle + qangle, data = domedata1)
anova(backmod, forwardmod)
## Analysis of Variance Table
## 
## Model 1: Dist ~ Velocity + Angle
## Model 2: Dist ~ Velocity + Angle + qangle
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     93 7854.8                                  
## 2     92 6047.3  1    1807.5 27.498 9.981e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By running an ANOVA, we can see that these models are statistically different in the high F stat and the low p value. Because of this, we feel that it is important to keep the quadratic term within our model. We will define this multiple linear regression model as finalmod and finish the guide with this model in mind.

finalmod <- lm(formula = Dist ~ Velocity + Angle + qangle, data = domedata1)
summary(finalmod)
## 
## Call:
## lm(formula = Dist ~ Velocity + Angle + qangle, data = domedata1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6668  -6.2650  -0.5578   5.8425  17.7035 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -358.00551   75.94251  -4.714 8.60e-06 ***
## Velocity       2.64671    0.07820  33.846  < 2e-16 ***
## Angle         16.50673    3.44727   4.788 6.41e-06 ***
## qangle        -0.20941    0.03993  -5.244 9.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.108 on 92 degrees of freedom
## Multiple R-squared:  0.9324, Adjusted R-squared:  0.9302 
## F-statistic: 422.9 on 3 and 92 DF,  p-value: < 2.2e-16

\[\hat{distance_i} = -358.00551 + 2.64671(Velocity) + 16.50673(Angle) - 0.20941({Angle}^2)\] We can interpret this by allowing one predictor to vary while holding the others constant; for example, for every one foot per second increase in velocity, the ball will go 2.64671 feet further, holding all other predictors constant. Similarly, for a ten degree increase in angle, the ball will travel 139.56573 feet further, holding all else constant. This change in distance is comprised of two factors of angle - the regular angle and squared angle term. The ball travels 160.50673 feet further from the angle term and it travels 20.941 feet less far from the angle squared term.

Model Comparison Tests

As we have seen above, R reduces models with predictors that do not contribute much to the model. It is important to test and identify these predictors that are less helpful. In a perfect world, testing predictors will reassure our hypothesis of the reduced outputs as we have identified above. The methods of model comparison include adjusted R2, F-tests, t-tests, Mallow CP, and AIC. Each comparison method may indicate different results for predictors, so it is important to understand each of them.

Multiple Coefficient of Determination

R2 is the proportion of variability in the response that is explained by its linear relationship. It will never decrease when adding more predictors. Because of this, statisticians have adjusted the statistic to include a motivation to limit the number of predictors. Adjusted R2 penalizes a multiple linear regression model for having more predictors. For the interpretation, we turn back to our summary output.

For the final model, our multiple R2 is 0.9324, and our adjusted R2 is 0.9302. This means that 93.24% (or 93.02% if you penalize for predictors) of the variability in distance can be described by the predictors velocity, angle, and angle2 within our final model. If we look back at our first model with all predictors from the data set included, our adjusted R2 was .9084. This means our new model explains the variability in distance more than the original model with all predictors.

This statistic is not the most helpful model comparison test. Although adjusted R2 does penalize a model for more predictors, it is only by a trivial amount. The other model comparison methods are more effective at balancing between complex models and effectiveness of predictors.

F-test

F-tests are used to answer the question “does distance have a linear relationship with any of the predictors?” In our model, we hypothesized that distance should have a linear relationship with the predictors. However, if we had multiple models, we could use the p-value as an indication as to which model has a stronger linear relationship between the predictors used and distance. This statistic is found in our linear model summary output.

The p-value for this model is 2.2*10-16. This small p-value indicates that we have found a statistically significant linear relationship between distance and our five predictors. P-values are lower for models that more accurately predict the response, and therefore, adding any predictors with a linear relationship to the response will never increase this statistic. Therefore, it is important to use other statistics to verify the worth of having many predictors.

T-test

T-tests allow us to decide if we can drop one of the predictors in our model, because it evaluates whether an individual predictor adds significance to the model. For measuring the worth of the i-th predictor, our null hypothesis is that the predictor does not have a linear relationship with the response (\({\beta_i}\) = 0) and we will test against the hypothesis that a linear relationsihp does exist (\({\beta_i}\) \(\neq\) 0). We are hoping for small p-values to show that we can reject the null hypothesis and affirm the predictor does add significance to the model.

Again, we go back to the summary to see each of the p values for the significance of an independent variable. We have listed the p-value for each predictor below:

Velocity: 2*10-16

Angle: 2.54*10-11

Ball Weight: 0.947

Ball Diameter: 0.662

Cond: 0.354

At a 95% confidence level, R says that velocity and angle are statistically significant in predicting the distance that the ball will travel. However, we cannot say that ball weight, diameter, or condition are significant to predicting the distance. Our previous methods gave us the same result, which should inspire our confidence to continue!

Partial F-test

Partial F-tests allow us to test whether we can drop multiple predictors in our model, testing the significance of combinations of these independent variables. We can test a portion of the model, which is referred to as the “reduced” model. The original model with all predictors included is called the “complete” or “full” model.
We will perform another anova() to test the original model of all predictors against our final model here.

finalmod <- lm(formula = Dist ~ Velocity + Angle + qangle, data = domedata1)
anova(quadmod, finalmod)
## Analysis of Variance Table
## 
## Model 1: Dist ~ Velocity + qvelocity + Angle + qangle + BallWt + qbw + 
##     BallDia + qbd + Cond
## Model 2: Dist ~ Velocity + Angle + qangle
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     86 5348.6                              
## 2     92 6047.3 -6   -698.75 1.8725 0.09473 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At a significance level of .05, we can say that the original quadratic model and the final model are not statistically different models. Therefore, we will continue with the reduced model, finalmod.

Further Analysis

It is important to consider other possible issues with our data set, such as correlation within the predictors and possible outliers. With our final model, we can create prediction and point intervals for data interpretation.

Multicollinearity

When using multiple predictors, it is important to confirm the assumption that none of these independent variables are correlated. When predictors are dependent, the standard errors are inflated and the test statistic is decreased, which makes it harder to reject any null hypothesis about these predictors.
To challenge this assumption, we examine the plots of each of the predictors against each other.

plot(Velocity ~ Angle, xlab = "Angle (degrees)", ylab ="Velocity (feet per second)",
                        main="Multicollinearity Check")

plot(Velocity ~ qangle, xlab = "Angle Squared (degrees)", ylab ="Velocity (feet per second)",
                        main="Multicollinearity Check")

Looking at these plots, we do not see any obvious examples of multicollinearity because none of the boxes contain a strong linear correlation. We will verify this numerically using the vif() function. The variance inflation factor (VIF) quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity. We are looking for numbers less than 10 because that means that we do not have significant amounts of collinearity.

vif(mymod)
## Velocity    Angle   BallWt  BallDia     Cond 
## 1.500821 1.510233 1.290144 1.076135 1.029261

Since these numbers are less than 10, we do not need to worry about multicollinearity within our predictors.

Confidence Intervals for Regression Coefficients

We can find a 95% confidence interval for the parameters \(\hat{\beta_0}\), \(\hat{\beta_1}\), … , \(\hat{\beta_n}\), using the function confint(model_name) command. In this example, we can find confidence intervals for velocity, angle, and angle2.

confint(finalmod)
##                    2.5 %       97.5 %
## (Intercept) -508.8338930 -207.1771229
## Velocity       2.4913984    2.8020182
## Angle          9.6601456   23.3533139
## qangle        -0.2887197   -0.1300961

We are 95% confident that the true population mean of velocity will lie between 2.4913984 and 2.8020182 feet per second. A similar interpretation can be drawn for angle.

Point Estimate

One helpful outcome of multiple linear regression is the ability to predict the response (distance) for an additional entry. For example, Tommie the Tomcat wants to know how far the ball will travel in feet if he were to recreate this experiment. If he will throw with a velocity of 155 feet per second and a 50 degree angle, we can estimate the distance using our multiple linear regression equation. The point predictor value is restricted by the range of each given variable; it must fall within the scope of our model. The multiple linear regression equation is written below with the values of angle and velocity for Tommie’s predicted throw. \[\hat{distance_i} = -358.00551 + 2.64671(155) + 16.50673(50) - 0.20941(2500)\] In this case, we estimate that the distance of Tommie’s throw will be around 354.051 feet.

Confidence and Prediction Intervals

Next, we are going to test confidence intervals. As we have seen above, we can use multiple linear regression to predict a point estimate for a given set of predictor values. We will continue with the example of Tommie’s throw. Tommie may not need to know a specific point of landing, but he may want to have a range of the distance it will travel. Prediction intervals are constructed to provide this range with a certain level of confidence (this confidence level is automatically set to 95%). Confidence intervals are similarly a defined range with a confidence level. However, confidence levels are defined for the true population mean of slope, and therefore they should be smaller than a given individual value (which has a higher standard deviation than the mean value). We have discussed this difference in the simple linear regression R guide.

After defining the linear model, we will create a newdata frame with defined levels for the velocity and angle. Then, we will use the command predict(model_name, newdata) and specify either a confidence or prediction interval:

predictmod <- lm(Dist ~ Velocity + Angle + qangle, data = domedata1)
newdata <- data.frame(Velocity = 155, Angle = 50, qangle = 2500)
(predy <- predict(predictmod, newdata, interval="predict") )
##       fit      lwr      upr
## 1 354.051 337.7958 370.3061
(confy <- predict(predictmod, newdata, interval="confidence") )
##       fit      lwr      upr
## 1 354.051 351.8264 356.2755

Given that the velocity is 155 feet per second and angle is 50 degrees, we are 95% confident that Tommie’s individual trial of this experiment will result in a distance between 337.7958 and 370.3061 feet. We are also 95% confident that the true mean value of distance that the ball will travel lies between 351.8264 and 356.2755 feet. It is important to note that the “fit” in our output above is the same for both intervals because both models are centered on the same point.

Outliers

In statistics, an outlier is an observation point that is distant from other observations. They can be visually detected by using the plot command.

plot(finalmod)

On the Residuals vs. Fitted graph, R denotes a few influential points in these two graphs; 55, 76, and 83. These points add some curvature to the scale location graph and the Residuals vs Fitted graph. On the Residuals vs. Leverage graph, the outlier lines do not even appear. This informs us that we do not have any significant outliers.

While the easiest solution would be to remove outliers, this is not always the best idea. This data represents a unique event that may aid our understanding of the range of possible outcomes better. If we decide a point is an outlier, we should investigate whether it is an incorrectly recorded point or why it is an outlier (maybe the person throwing the ball did arm day the day before and was cramping up). Additionally, for any outliers, we should look into finding a predictor to add that could explain this and possibly other outliers. We typically will leave the outlier point in our data, although we can always decide to weight them differently if there is statistical reasoning to believe that it is appropriate.

Conclusion

When performing multiple linear regression for a data set, it is important to balance the level of complication in a model (the number of predictors used) and the accuracy of prediction. Simplicity is often preferred to a small increase in accuracy. It is best to be mindful of the tools available for use in R to fit a model (quadratic, transformations, interactions, etc.) but also to have awareness of the audience of a project. If your audience does not have background in statistical analysis, they will not be able to easily interpret a multiple linear regression transformation, and it is best to stick to simplicity. However, if the project is intended for statisticians, it could be advantageous to use a more complicated model if it results in a stronger linear relationship.