Introduction

This R guide will be focusing on multiple linear regression. For multiple linear regression, we will have at least three variables in our models, one response and two predictors. The response variable will be quantitative, but the predictor variables can be quantitative or categorical.

First, we’ll investigate what the data set looks like and explore how we can visualize uncomplicated multiple linear regression equations. Next, we will demonstrate how R can assist us with model building, though we will be manually building different models throughout this R guide. Then, we’ll dig deeper into the process of creating a multiple linear regression model and determine its interpretation. We will build a model, using both quantitative and categorical variables, discussing how categorical variables are used in a multiple linear regression situation. Additionally, we will interpret the coefficients for our variables and determine whether each of our predictor variables has a significant relationship with our response variable. We will also discuss the use and interpretation of interaction terms, F-tests, partial F-tests and make predictions using our regression line. Finally, we will demonstrate why and how to include polynomial terms or transformations in a model and interpret the output.

Exploring the Data

Understanding the Data Set

The first data set we will use for this R Guide is $\texttt{CollegeDistance}$ from the R package AER. This is a cross-sectional data set from a survey conducted by the Department of Education in 1980, with a follow-up in 1986, containing 14 variables relating to the characteristics of the students surveyed for this data set, their families and the area in which they live. We can see the names of all of the variables in the data set by using the command $\texttt{names(dataset)}$ and see how the observations for each of these variables look by using the command $\texttt{head(dataset)}$.

library(AER)

data(CollegeDistance)
names(CollegeDistance)

##  [1] "gender"    "ethnicity" "score"     "fcollege"  "mcollege" 
##  [6] "home"      "urban"     "unemp"     "wage"      "distance" 
## [11] "tuition"   "education" "income"    "region"

head(CollegeDistance)

##   gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
##   tuition education income region
## 1 0.88915        12   high  other
## 2 0.88915        12    low  other
## 3 0.88915        12    low  other
## 4 0.88915        12    low  other
## 5 0.88915        13    low  other
## 6 0.88915        12    low  other

Though the data set contains 14 different variables, in this R Guide, we will only use $\texttt{score}$, the achievement test score obtained during the student’s senior year of high school, $\texttt{urban}$, whether the student’s high school is located in an urban area, $\texttt{distance}$, the distance the student lives from a 4-year college (in 10’s of miles), $\texttt{tuition}$, the average 4-year college tuition in the student’s state (in 1000’s of dollars), and $\texttt{education}$, the number of years of education attained 6 years after high school graduation, in any of the models we create. Of these chosen variables, we can see that $\texttt{score}$, $\texttt{distance}$, $\texttt{tuition}$ and $\texttt{education}$ are quantitative and $\texttt{urban}$ is categorical.

Checking Multicollinearity

One of the very first things we want to do before making any model is check for multicollinearity between our quantitative predictor variables. Multicollinearity occurs when one or more of the quantitative predictors in your multiple linear regression model is highly correlated. Multicollinearity is a problem because it will inflate the standard error of a model as well as make the parameter estimates inconsistent.

So, to look at multicollinearity, let’s first look at the pairwise scatterplots for each of our quantitative predictor variables.

Pairwise Scatterplots

We want to look at pairwise scatterplots of our predictor variables so that we can determine visually if any pairs of predictors appear highly correlated. To do this, we can use the command $\texttt{plot(variable, variable, ..., variable)}$ using the variables of interest as our arguments, or more simply, specify the columns to use within your data frame like this:

plot(CollegeDistance[c(3,10, 11)])

Looking at our plot, it does not appear that any of our quantitative predictor variables are highly correlated, or have a strong linear relationship with one another.

Correlation Matrices

Just to be certain there is no strong correlation, let’s calculate the correlation matrix of the three variables using $\texttt{cor()}$ which takes the variables of interest as its arguments, similar to $\texttt{plot()}$.

cor(CollegeDistance[c(3,10, 11)])

##                score    distance    tuition
## score     1.00000000 -0.06797927  0.1298585
## distance -0.06797927  1.00000000 -0.1009806
## tuition   0.12985848 -0.10098058  1.0000000

We see that the pairwise correlations between our quantitative predictor variables are very low. We only consider multicollinearity to be a problem when the absolute value of the correlation is greater than .9, so we would say that there is no multicollinearity issue in this model.

Variance Inflation Factor

There is a third way to detect multicollinearity in a model. We consider the variance inflation factor (VIF) of a variable by first making a model using that variable as the response and all other predictor variables in the original model as the predictors. For example, to calculate the VIF of achievement test score we need to create this linear model:

score <- lm(score ~ distance + tuition, data = CollegeDistance)

Then, we need to plug the R-squared value of that model into our VIF equation, $VIF = \dfrac{1}{1-R^2}$.

1/(1-summary(score)$r.squared)

## [1] 1.020309

So the VIF of achievement test score for this model is around 1.02. This is great news since we only consider multicollinearity to be an issue when $VIF>10$. We could then continue checking VIF by following the same process for our $\texttt{distance}$ and $\texttt{tuition}$ variables.

Visualizing the Data

Now that we know there is no evidence of multicollinearity between our predictor variables, we can move on to visualizing some of the data. For simple linear regression, we easily built a scatterplot for exploratory data analysis since we only had two variables; however, in many multiple linear regression situations, the variables we are using cannot be simultaneously represented two-dimensionally so exploring the data visually is far more difficult. Nevertheless, using an example, I will demonstrate that we can represent a fairly uncomplicated multiple linear regression situation two-dimensionally. For this, we will need a quantitative response variable and two predictor variables, one quantitative and one categorical.

Plotting

We can plot the student’s achievement test score, the quantitative response, versus the number of years of education they have attained, the quantitative predictor, varying the shape of their plotted point by whether or not their high school is located in an urban area, the categorical predictor, because this satisfies our variable requirements. We’ll use the $\texttt{plot()}$ command, discussed in my simple linear regression R guide (http://rpubs.com/bensonsyd/364877), to make the scatterplot, adding one unusual argument, $\texttt{pch = varaible}$, which varies the shape of the points by the value of a certain variable. The variable must be numeric.

plot(CollegeDistance$score, CollegeDistance$education, xlab = "Years of Education", ylab = "Achievement Test Score", pch = as.numeric(CollegeDistance$urban))

The circles indicate that the student represented went to a high school in an urban area. The triangles indicate that the student represented went to a high school in a non-urban area.

From this plot we can’t really see whether there is any difference between the two groups because there are a lot of overlapping data points. We’ll try to fix this by adding noise to, or slightly increasing the variation of, the scatterplot so that we can see any patterns in the data more clearly.

Adding Noise to the Scatterplot

We can add more variation to the scatterplot by using the command $\texttt{jitter(variable, factor = number)}$ on our predictor and response variables, where $\texttt{factor}$ is the argument specifying how much noise to add to the plot, and then using these altered variables as our new predictor and response like this:

newscore <- jitter(CollegeDistance$score, factor = 2)
newed <- jitter(CollegeDistance$education, factor = 2)
plot(newscore, newed, xlab = "Years of Education", ylab = "Achievement Test Score", pch = as.numeric(CollegeDistance$urban))

Using the $\texttt{jitter()}$ command has unveiled some of the hidden data points. We can now more distinctly see two types of data points, circles and triangles. However, there doesn’t appear to be a distinct separation of the data points. This might lead us to believe that there is not a significant difference between the educational attainment of students at urban and non-urban schools, but we can confirm whether or not this is true using a test later on.

Multiple Linear Regression

Building a Model

The first aspect of multiple linear regression we’ll focus on in this R guide is different types of model building and the numerous criteria that can be used to determine which is the best model. First, we’ll need to discuss the criteria and how they differ.

Criteria

For Comparing all Kinds of Models

There are many types of criteria that we can use to determine whether or not one model of a set of data is better than another model of that same set of data. First, when comparing any two models we can use the Akaike Information Criteria (AIC) or Mallow’s $C_p$. When looking at AIC, we want to weigh the number of parameters in a model and the AIC value together. For example, if we have a model with an AIC value of $X$ and 4 parameters, in order to prefer a model with more than 4 parameters, we would want the AIC value for that model to be at least 10 units less than $X$. Generally, we can think of lower AIC values corresponding to better models. Then, if we instead use Mallow’s $C_p$, we see that a smaller value means the model fits better, so we would choose the model with the lowest Mallow’s $C_p$. One important note on these comparison tools is that they have no units so they can only compare models made from the same data set, not models from varied data sets.

For Comparing Nested Models

If we are comparing nested models, i.e. one model’s predictors are a subset of the other model’s predictors, we have additional comparison options. For comparing nested models, we can use the p-values from t-tests or partial F-tests, only choosing the more complex model if the p-value of the t-test or F-test is below .05. We can also use adjusted R-squared as a comparison tool; however, I would discourage anyone from using adjusted R-squared as their comparison criteria because the other methods of comparison are more rigorous.

How to Create the Models

Stepwise Selection

Now, on to the topic of actually building the model. First, we can discuss the idea of building a model in a stepwise fashion. There are two ways of creating models stepwise, either forward selection or backwards elimination. In forward selection, you start with the simplest desired model and add predictors to the model until your chosen criteria indicate that adding more predictors to the model would actually worsen the model’s abilities. In backwards elimination, you start with the most complex model acceptable and remove predictors from the model until your chosen criteria indicate that removing more predictors from the model would worsen the model’s abilities.

By their nature, these models will always be nested so any of the comparison methods previously mentioned can be used.

As an example, I’ll be illustrating the backwards elimination method to create a model that predicts educational attainment 6 years after graduation using all or a subset of the variables discussed earlier and AIC as my model comparison criteria. We’ll need to use the command $\texttt{stepAIC(starting.model, scope = list(upper = complex.model, lower = simple.model), direction = “backward”)}$ in the R package MASS to do this. $\texttt{starting.model}$ is the model that we will begin the stepwise selection with, $\texttt{complex.model}$ is the most complex model we would want to have, so in the case of backward elimination this would be the same as $\texttt{starting.model}$, and $\texttt{simple.model}$ is the most simplistic model you would want to obtain. I will specify the simplest model as a model using only the intercept to model educational attainment 6 years after graduation.

library(MASS)

starting.model <- lm(education ~ score + urban + distance + tuition, data = CollegeDistance)
simple.model <- lm(education ~ 1, data = CollegeDistance)
stepAIC(starting.model, scope = list(upper = starting.model, lower = simple.model), direction = "backward")

## Start:  AIC=4339.19
## education ~ score + urban + distance + tuition
## 
##            Df Sum of Sq   RSS    AIC
## - urban     1       0.5 11815 4337.4
## <none>                  11815 4339.2
## - tuition   1      10.8 11826 4341.5
## - distance  1      53.3 11868 4358.5
## - score     1    3178.2 14993 5466.2
## 
## Step:  AIC=4337.39
## education ~ score + distance + tuition
## 
##            Df Sum of Sq   RSS    AIC
## <none>                  11815 4337.4
## - tuition   1        11 11826 4339.8
## - distance  1        62 11877 4360.2
## - score     1      3205 15020 5472.8

## 
## Call:
## lm(formula = education ~ score + distance + tuition, data = CollegeDistance)
## 
## Coefficients:
## (Intercept)        score     distance      tuition  
##     9.15680      0.09547     -0.05013     -0.14369

We can see from the output that our final model actually contains all of the chosen variables, except for whether or not a student’s high school is located in an urban area, as the optimal way of predicting that student’s educational attainment 6 years after graduation. One thing to note about this process is that, although the two models’ AIC differ by less than 10, the chosen model is the model with fewer predictor variables because of the necessary balance between accuracy and complexity that AIC uses.

We could easily use a forward selection method by specifying $\texttt{starting.model}$ as the simplest model we would deem acceptable and changing $\texttt{direction}$ to $\texttt{“forward”}$. However, using a forward selection method might produce a different model than the backwards selection method because each process takes a different set of paths to reach the optimal model.

All Subsets

Lastly, we have the all subsets method. This method requires that you calculate all of the AIC values, or Mallow’s $C_p$’s, for all possible models that can be created from any subset of the variables you are considering for the model. This method is very time consuming and so, it is used infrequently. I will note that these models will not all be nested so we can only use AIC or Mallow’s $C_p$ as comparison methods.

Besides the automated method of model building, using $\texttt{stepAIC()}$ for stepwise selection, we can also simply create models manually. Although these models may not be the best possible models with respect to any one of the numerous criteria we discussed, they do work well for demonstration purposes. Thus, throughout the rest of this R guide, the models used for examples will be created manually.

Creating a New Model

We’ve looked at our data set, gone through a short example of how one might visualize a relatively simple multiple linear regression situation and discussed some sophisticated techniques of model building as well as how to compare models. Now, we can create a multiple linear regression model manually. For this model, I will use achievement test score, average tuition for a four-year college in the student’s state, distance from a four year college and whether or not the student’s high school was located in an urban area to predict the number of years of education the student had attained six years after graduation.

We can create the model similarly to how we created the simple linear regression model, only adding more predictors. We use $\texttt{lm(response ~ predictor + predictor + ... + predictor, data = dataset)}$ to create our model.

schoolmod <- lm(education ~ score + tuition + distance + urban, data = CollegeDistance)

Checking Model Assumptions

At this point, we need to check the assumptions of our model to be sure that the model provides accurate results. In my R guide on simple linear regression, we created the diagnostic plots for our model individually, but there is a function in R that allows us to create all of these plots at once. We create the plots using the function $\texttt{plot(model)}$.

plot(schoolmod)

The first plot is the residual plot, a comparison of the residuals of our model against the fitted values produced by our model, and is the most important plot because it can tell us about trends in our residuals, evidence of heteroskedasticity and possible outliers. The plot for this model indicates that our model is systematically underpredicting the lower values of educational attainment and systematically overpredicting the higher values of educational attainment. Although the residuals do not seem to be evenly spread around 0 for all fitted values, the range of the residuals at each fitted value appears to be roughly the same, so we can conclude there is no evidence of heteroskedasticity. Finally, this plot indicates that there are likely no outliers because there are no points on the plot well-separated from the rest.

The next plot is the QQ-plot. Though most of the points seem to fall on the line which indicates that our residuals come from a normal distribution, there are some points that stray from the line in the lower and upper quantiles of the plot. It is possible that these points do not come from a normal distribution, but most of our points seem to come from a normal distribution so there is not a lot to worry about here.

The third plot created is the scale-location plot. This plot is similar to the residual plot, but uses the square root of the standardized residuals instead of the residuals themselves. This makes trends in residuals more evident and, from this plot, we can see that there is likely a U-shaped trend in our residuals.

Finally, we see the leverage plot. This plot graphs the standardized residuals against their leverage. It also includes the Cook’s distance boundaries. Any point outside of those boundaries would be an outlier in the x direction. Since we cannot even see the boundaries on our plot, we can conclude that we have no outliers.

Outliers

So, although our data did not appear to have any outliers, it’s still important to know what to do if we do have outliers. Unfortunately, this is not an easy question to answer.

The Definition

First, we should define what an outlier is. Generally, we define an outlier as any point well-separated from the rest. More specifically, we can define an outlier as any point with a studentized residual of greater than 2 or less than negative 2. This criteria indicates whether a point is an outlier in the y direction. Then, a data point outside the Cook’s distance boundaries of a leverage plot would indicate that that point is an outlier in the x direction.

What can we do?

So, what can we do if we find that we have an outlier?

There is no simple answer to this question. However, the first thing you’ll want to check is whether the data was recorded correctly. If you have access to the original data collection materials, check to see if the data was inputted incorrectly. If you do not have access to this information, but the recorded value is not contextually possible, you may remove this observation.

On the other hand, if the observation was recorded correctly, or if the observation is possible in the context of the study, you should think about why the observation is an outlier. One possible explanation for the outlier is that there is a predictor missing from the model that could explain the outlier. To really understand your outlier, you’ll need to look at the specific context in which you are working and, hopefully, a reason for the outlier can be determined.

Indicator Variables

summary(schoolmod)

## 
## Call:
## lm(formula = education ~ score + tuition + distance + urban, 
##     data = CollegeDistance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.826 -1.181 -0.246  1.219  5.367 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.141015   0.148905  61.388  < 2e-16 ***
## score        0.095596   0.002679  35.686  < 2e-16 ***
## tuition     -0.142627   0.068517  -2.082   0.0374 *  
## distance    -0.048723   0.010539  -4.623 3.88e-06 ***
## urbanyes     0.025619   0.057090   0.449   0.6536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 4734 degrees of freedom
## Multiple R-squared:  0.221,  Adjusted R-squared:  0.2203 
## F-statistic: 335.7 on 4 and 4734 DF,  p-value: < 2.2e-16

We should first take note of how our categorical variable $\texttt{urban}$ is written in our summary of the model. In the model, this variable is treated as an indicator variable meaning that it can take one of two values, either 0 or 1. In the summary, this variable is labeled as $\texttt{urbanyes}$. This means that the variable is given the value of 1 if a student’s school is in an urban area and a 0 if a student’s school is not in an urban area. Since this variable has only 2 categories, yes and no, it only needs one indicator function. However, if a categorical variable has more than 2 categories, it will need more indicator functions. To be exact, the variable would need the number of categories minus one indicator functions.

The Regression Equation

From our summary, we can also get the multiple linear regression equation: $\widehat{EducationalAttainment} = 9.141 + .0956 \cdot AchievementScore - .1426 \cdot Tuition - .0487 \cdot Distance + .0256 \cdot I(urban = yes)$

Since we have an indicator variable, $\texttt{urban}$, in our equation, this equation can actually be broken down into two separate equations, one for if the student’s school is located in an urban area and another for if the student’s school is in a non-urban area. For students going to school in urban areas: $\widehat{EducationalAttainment} = (9.141 + .0256) + .0956 \cdot AchievementScore - .1426 \cdot Tuition - .0487 \cdot Distance$ For students going to school in non-urban areas: $\widehat{EducationalAttainment} = 9.141 + .0956 \cdot AchievmentScore - .1426 \cdot Tuition - .0487 \cdot CollegeDistance$

Notice that the only change in the two equations is the intercept.

Interpreting Regression Coefficients

Returning again to our summary, we can interpret the coefficients for each of our variables in a way similar to how they were interpreted for simple linear regression. For example, for every increase in achievement test score by 1 point, the predicted years of education of a student increases by .0956 years when that student’s state’s average tuition for a 4-year college, distance from a 4-year college and location of school (urban or non-urban) all stay the same. You might notice that the only change in the interpretation that we made from the interpretation for simple linear regression is that you must state that all of the other predictors in the model are being held constant.

We can also interpret the coefficient of our indicator function in a similar way. For example, the predicted years of education of a student increases by .0256 when their high school is located in an urban location as opposed to a non-urban location when their achievement test score, average state tuition for a four-year college and distance from a four-year college all remain the same.

Finally, thinking about the intercept in context, we can see that the intercept wouldn’t really make sense to interpret because even a student that scored a zero on their achievement test, had an average state tuition for a 4-year college of zero dollars, lived zero miles from a 4-year college and went to a non-urban high school would still have at least 12 years of education since they had to have graduated in order to be part of this data set and our intercept is less than 12. We cannot use this model to determine anything about students who did not graduate.

Confidence Intervals for Regression Coefficients

We recognize that these regression coefficient values given by our summary are simply point estimates so we’ll need to create confidence intervals for them as well. We can create confidence intervals for them using the command $\texttt{confint(model)}$. This will generate 95% confidence intervals for all of the coefficients in our model.

confint(schoolmod)

##                   2.5 %       97.5 %
## (Intercept)  8.84909261  9.432937106
## score        0.09034461  0.100848185
## tuition     -0.27695296 -0.008301033
## distance    -0.06938312 -0.028062221
## urbanyes    -0.08630305  0.137541361

As an example, we can interpret the confidence intervals for achievement test score and the indicator variable for whether a student’s high school is located in an urban area.

We interpret the confidence interval for achievement test score, a quantitative variable, like this: we are 95% confident that an increase by 1 point in achievement test score will increase the number of years of education a student achieves 6 years after high school graduation by between .09 and .1 years on average, all other variables held constant.

We interpret the confidence interval for our indicator variable, a categorical variable, like this: we are 95% confident that a student going to high school in an urban area will achieve between -.086 and .138 years of education more than a student going to a high school in a non-urban area on average, all other variables held constant.

Again, we cannot interpret our intercept because it does not make sense within the limits of our data set.

T-tests

From our summary, we can also determine the significance of each of our predictor variables through the results of a t-test. Like I discussed in my simple linear regression R guide, a t-test tests the null hypothesis that $\beta = 0$ against the alternative hypothesis that $\beta \neq 0$, where $\beta$ is the regression coefficient. The only difference for multiple linear regression is that we have multiple $\beta$s, so a different t-test is being run for each regression coefficient. In our summary table, we see the t-statistic for each of our predictor variables as well as the intercept and a corresponding p-value for each of the t-statistics. The results of the t-tests tell us that all of the variables in our model are significant predictors of a student’s number of years of education except whether their high school is in an urban or non-urban area since that is the only p-value greater than .05. This means we can remove our urban variable from the model because it has no significant relationship with number of years of education that a student achieves 6 years after high school graduation. This is a similar conclusion to the one that was drawn from our automated model building procedure earlier.

Making Predictions

Next, we can make some predictions using our model. We can use our regression equation to predict the years of education a student would have attained if they had a test score of 52, the average college tuition for their state was $900, they lived five miles from a four-year college and they attended an urban high school.

u <- (9.141+.0256) + .0956*52 - .1426*.900 - .0487*.5
u

## [1] 13.98511

From our equation, we can see that this student would have likely pursued just shy of 2 years of education after graduating high school. Now, let’s compare that value with a similar student who only differed by where they went to high school. This student did not go to high school in an urban area.

nu <- 9.141 + .0956*52 - .1426*.900 - .0487*.5
nu

## [1] 13.95951

From our equation, it seems that this student would have also pursued just shy of 2 years of education after graduating high school. However, take note of the difference between the two predicted years of education.

abs(u-nu)

## [1] 0.0256

The difference between the two educational attainment levels is the value of the regression coefficient estimated for the urban indicator variable. Since this coefficient is small, the difference between the two students is also small.

We know that providing a point estimate is usually not adequate when providing a prediction based on a regression model because it gives an answer that is very specific and likely not exactly correct. Therefore, we might want to make a prediction interval. We can do this using the command $\texttt{predict(model, new.dataframe, interval)}$, similar to how we made prediction intervals for simple linear regression models. Our $\texttt{interval}$ would be specified as $\texttt{"predict"}$. The only difference for the multiple linear regression models is that the new data frame must contain specified values for all of the variables in your regression model. So, our new data frame, corresponding to the example student that attended a high school in an urban area, would look like this:

newdata <- data.frame(score = 52, tuition = .9, distance = .5, urban = 'yes')

Now, we can use this data frame to create a 95% confidence interval for the student with a high school in an urban area that we made a point prediction for previously.

predict(schoolmod, newdata, interval = "predict")

##        fit      lwr      upr
## 1 13.98492 10.88636 17.08349

We can see that the point estimate given by this prediction interval is slightly different than the value we calculated, however, this can be attributed to a rounding error. We see that our 95% prediction interval is [10.886, 17.083], thus, I am 95% confident that an individual student with an achievement score of 52, an average state tuition for four-year colleges of $900, living five miles from a four-year college and going to a high school in an urban area will have achieved a total of between 10.886 and 17.083 years of education 6 years after graduating from high school.

In this case, we have a slightly odd interval because we know that any student graduating from high school will attain at least 12 years of education 6 years after graduating from high school.

Confidence Intervals

In addition to prediction intervals, we can also create confidence intervals for the mean years of education students with specific characteristics might attain 6 years after high school graduation. We can use a similar $\texttt{predict(model, new.dataframe, interval)}$ command to do this, changing the $\texttt{interval}$ argument from $\texttt{"predict"}$ to $\texttt{"confidence"}$. We will use the same data frame that we created previously for the prediction interval.

predict(schoolmod, newdata, interval = "confidence")

##        fit      lwr      upr
## 1 13.98492 13.89019 14.07965

We get the same point prediction as we did for the prediction interval, which is to be expected. Also as we would expect, we see a smaller interval than we did for the prediction interval. From this interval we can say we are 95% confident that the mean number of years of education a student would attain 6 years after high school graduation, given that they had an achievement test score of 52, an average state tuition for four-year colleges of $900, live five miles from the nearest four-year college and went to a high school in an urban area, is between 13.89 and 14.08 years.

Adding Interaction Terms

Creating a New Model

Another attribute we can add to our models is interaction terms. For this model we will use an interaction between our $\texttt{distance}$ and $\texttt{urban}$ variables. By that I mean that the relationship between the education level that a student achieves and the distance they are from a four-year college depends on whether the student’s high school is located in an urban area or not. We can add the interaction term by using a $\texttt{*}$ between the two predictor variables included in our interaction term in the $\texttt{lm()}$ command instead of $\texttt{+}$, like this:

schoolmod2 <- lm(education ~ score + tuition + distance*urban, data = CollegeDistance)

The $\texttt{*}$ denotes an interaction between two variables, while a $\texttt{+}$ indicates that there is no interaction between the two variables it sits between.

Checking Model Assumptions

Since we have created a second model, we need to check our model assumptions again. We interpreted all of the diagnostic plots for the first model so we will just focus on the most important plot, the residual plot, for this one. We always need to check the residual plot to make sure the mean function is appropriate. We will use the same $\texttt{plot()}$ command as before and so all four plots will still be generated.

plot(schoolmod2)

We see a similar residual plot to our first residual plot, with the same over-/underprediction problem. Again, there appears to be no strong evidence for heteroskedasticity. Finally, we see no outliers on this plot.

Additionally, the other 3 diagnostic plots look very similar to the diagnostic plots generated for our first model.

The Regression Equation

summary(schoolmod2)

## 
## Call:
## lm(formula = education ~ score + tuition + distance * urban, 
##     data = CollegeDistance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8258 -1.1814 -0.2471  1.2193  5.3674 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.141319   0.149012  61.346  < 2e-16 ***
## score              0.095592   0.002680  35.665  < 2e-16 ***
## tuition           -0.142507   0.068555  -2.079   0.0377 *  
## distance          -0.048802   0.010627  -4.592  4.5e-06 ***
## urbanyes           0.022658   0.076409   0.297   0.7668    
## distance:urbanyes  0.004735   0.081204   0.058   0.9535    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 4733 degrees of freedom
## Multiple R-squared:  0.221,  Adjusted R-squared:  0.2201 
## F-statistic: 268.5 on 5 and 4733 DF,  p-value: < 2.2e-16

We see from our summary output that our new regression equation looks like this: $\widehat{EducationalAttainment} = 9.141 + .096 \cdot AchievementScore - .143 \cdot Tuition - .049 \cdot Distance + .026 \cdot I(urban = yes) + .005 \cdot Distance \cdot I(urban = yes)$

Again, we can split our regression equation into two equations, one for a student going to a high school in an urban area and one for a student going to a high school in a non-urban area. For student’s going to school in urban areas: $\widehat{EducationalAttainment} = (9.141 + .026) + .096 \cdot AchievementScore - .143 \cdot Tuition - (.049 - .005) \cdot Distance$ For student’s not going to school in urban areas: $\widehat{EducationalAttainment} = 9.141 + .096 \cdot AchievementScore - .143 \cdot Tuition - .049 \cdot Distance$

This time we see that the difference between the two equations is not just the intercept, but also the coefficient for the distance a student lives from a four-year college since our $\texttt{urban}$ variable effects both of those locations in the equation.

The T-test Interpretation

Our summary also tells us the results of the t-test performed for our interaction term ($\texttt{distance:urbanyes}$). We can see in the table that the t-statistic is .058 and the corresponding p-value is .9535. This means the relationship between a student’s number of years of education and distance from a four-year college does not significantly depend on whether that student’s high school is located in an urban area.

It’s important to note that, although in this example we used an interaction between a categorical and a quantitative variable, interaction terms can also be made using two categorical or two quantitative variables.

F-tests

Another test that our summary output displays is the results of an F-test for the model. The F-test tests the null hypothesis that $\beta_1 = \beta_2 = \ldots = \beta_n = 0$ against the alternative hypothesis that at least one of the regression coefficients, $\beta$s, is not equal to 0. We can see from our summary output that the F-statistic for this model is 268.5 and this is compared to an F-table with 5 and 4733 degrees of freedom to obtain a p-value of less than $2.2 \cdot 10^{-16}$. This means that at least one of the variables in our model has a significant relationship with our predictor.

Partial F-tests

A final test we can look at to understand the significance of the variables in our model is the partial F-test. The partial F-test test the null hypothesis that $\beta_m = \ldots = \beta_p = 0$, a subset of your regression coefficients is equal to 0, against the alternative hypothesis that at least one $\beta$ from the subset of $\beta$s is not equal to 0. This test allows us to test the significance of a subset of our model variables. It differs from a t-test in that a t-test only allows us to test the significance of one of our variables at a time.

We can use a partial F-test to understand the significance of our interaction term since our interaction term is a subset of our predictor variables. This is essentially the same as a t-test since our subset is only one term in our regression equation.

To run the test, we will use the command $\texttt{anova(complete.model, reduced.model)}$, where $\texttt{complete.model}$ is our original model and $\texttt{reduced.model}$ is our original model without the variables we are interested in testing the significance of.

anova(schoolmod, schoolmod2)

## Analysis of Variance Table
## 
## Model 1: education ~ score + tuition + distance + urban
## Model 2: education ~ score + tuition + distance * urban
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1   4734 11815                           
## 2   4733 11815  1  0.008489 0.0034 0.9535

Our output tells us that our partial F-statistic is .0034 and, from that, we get a p-value of .9535. This p-value is not significant at even a high significance level of .1 so we can conclude that the relationship between the education level that a student achieves and the distance they are from a four-year college does not depend on whether the student’s high school is located in an urban area or not.

Polynomial Regression

It’s safe to say that quadratic regression is a variation on multiple linear regression; however, instead of using a regression equation with the form, $\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2$, we use a regression equation with the form $\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2$.

Creating the Model

To examine this type of regression, let’s create a new model using the data set $\texttt{segreg}$ from the R package alr3. This data set contains observations on the per day electricity consumption and the mean monthly temperature for one building on the University of Minnesota - Twin Cities’ campus between 1988 and 1992. So, for this model, we’ll look at the relationship between the per day electricity consumption, $\texttt{C}$, and the mean monthly temperature for this building, $\texttt{Temp}$. First, we’ll create the model.

library(alr3)

## 
## Attaching package: 'alr3'

## The following object is masked from 'package:MASS':
## 
##     forbes

energy <- lm(C ~ Temp, data = segreg)

Examining the Fit of the Model

Now we can look at the residual plot for this model. The other diagnostic plots have been checked and assumptions have been mostly met, following the process outlined in my simple linear regression R guide (http://rpubs.com/bensonsyd/364877). However, there is one possible outlier in the model.

plot(energy$fitted, energy$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)

There seems to be a fairly strong U-shaped trend in the residuals, meaning that we are systematically underpredicting the electricity consumption for the lowest and highest temperatures, while systematically overpredicting the electricity consumption for the moderate temperatures. This kind of trend in the residuals is a good indicator of when you should use polynomial regression or a transformation.

Fixing the Fit of the Model

We’ll try a quadratic model to fix this issue. We can create the quadratic model like this:

quadenergy <- lm(C ~ Temp + I(Temp^2), data = segreg)

The major change to this model is the addition of $\texttt{I(variable^2)}$. $\texttt{I(variable^2)}$ indicates that we want to add an $x^2$ term to our regression equation.

Examining the Fit of the New Model

Visualizing the Fit of the New Model

Now we can look at the residual plot for this altered model.

plot(quadenergy$fitted, quadenergy$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)

Looking at the new residual plot, we see that the quadratic term made a big difference in our problem of over- and underpredicting the building’s electricity consumption. We see less of a trend in our residuals; however, we now see some evidence for heteroskedasticity since we have greater variability in the residuals for the lowest predicted electricity consumption values. One possibility for fixing this issue would be to add a cubic term to our model. We can continue adding polynomial terms to our model by adding more terms of the form $\texttt{I(variable^k)}$, where $\texttt{variable}$ is the variable you want to model with the polynomial term and $\texttt{k}$ is the degree of the polynomial. You’ll want to also include all polynomial terms of a degree less than $\texttt{k}$ in your model as well.

Alternatively, we could try transforming the observations before putting them in the original model, but there is no guarantee that either of these options will be able to fix the problem. Nevertheless, I will demonstrate the transformation technique a little later.

The T-test Interpretation

One way of determining whether our quadratic term is significant is by looking at the results of our t-test in the summary of the model with the quadratic term.

summary(quadenergy)

## 
## Call:
## lm(formula = C ~ Temp + I(Temp^2), data = segreg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8610  -2.5547  -0.0028   2.3084  10.1016 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.939962   4.266966  18.969  < 2e-16 ***
## Temp        -0.490845   0.211815  -2.317 0.026285 *  
## I(Temp^2)    0.008919   0.002299   3.879 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.388 on 36 degrees of freedom
## Multiple R-squared:  0.6669, Adjusted R-squared:  0.6484 
## F-statistic: 36.05 on 2 and 36 DF,  p-value: 2.542e-09

We see here, that the t-test states that our quadratic term is significant, since the p-value is less than .05, and so it improves our model and we should keep it.

The Partial F-test Interpretation

Another way we can see if our quadratic term is necessary is by running a partial F-test, like we discussed above.

anova(energy, quadenergy)

## Analysis of Variance Table
## 
## Model 1: C ~ Temp
## Model 2: C ~ Temp + I(Temp^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     37 1481.7                                  
## 2     36 1044.9  1    436.74 15.046 0.0004285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, we see that the p-value of this test, .00043, is significant. Thus, we can conclude that the quadratic term does actually improve our model.

Visualizing the Regression Equation

One final thing we can do to compare our models is visualize the regression equation.

Creating the Components of the Polynomial Regression Line

First, we’ll need to look at our range of the mean monthly temperatures so that we can plug values within that range into our regression equation for the quadratic model. To look at the range we can use the command $\texttt{range(variable)}$. Then we’ll use that information from the $\texttt{range()}$ command to create a list of x values to plug into our equation. To create this list of x values we use the command $\texttt{seq(from = lower.range, to = upper.range, by = interval)}$. In this case, $\texttt{lower.range}$ is the minimum temperature in our dataset and $\texttt{upper.range}$ is the maximum temperature in our dataset. The interval I’ve chosen is .1. Thus, a sequence of x values will be created starting at 8.5536 and going to 78.0484, increasing by .1. We then plug these x values into our regression equation to get the corresponding y values.

range(segreg$Temp)

## [1]  8.5536 78.0484

x <- seq(from=8.5536, to=78.0484, by=.1)
y <- coef(quadenergy)[1] + coef(quadenergy)[2]*x + coef(quadenergy)[3]*x^2

Plotting the Regression Lines

Now we’ve created everything that we need to visualize the line of our quadratic model, so let’s plot it, along with the line from our simple linear regression model, on our plot of the mean monthly temperature and the electricity consumption per day for a building at the University of Minnesota. We need to make sure we create the plot before we can add the lines.

To add the regression line for the quadratic model, we use the command $\texttt{lines(x, y)}$, where $\texttt{x}$ and $\texttt{y}$ are the x and y vectors that we just created. We can add the regression line from the simple linear regression model by using the command $\texttt{abline(model.name)}$. We also add the argument $\texttt{lty = number}$, where $\texttt{number}$ is our desired line type (written as a number), to change the look of the regression line for the simple model and differentiate it from the regression line from our quadratic model.

plot(segreg$Temp, segreg$C, xlab = "Monthly Mean Temperature", ylab = "Electricity Consumption (in KWH/day)")
lines(x, y)
abline(energy, lty = 2)

Comparing the Regression Lines

In our plot, the solid line represents the regression equation of our quadratic model and the dashed line represents the regression equation of our model without the quadratic term. As you can see, there is a noticeable difference between the fit of the equations to the data.

Visually, the model with the quadratic term definitely seems to fit the data better and we know from the two tests we have already run that the model does benefit from the additional quadratic term.

Transformations

Another tool we can use to create a better fitting mean function is transformations. The main transformations we use are logarithmic, inverse and square root. These transformations are used to create a more linear relationship between our predictor and response variables and eliminate heteroskedasticity issues.

Previously, we examined the residual plot for our model relating monthly mean temperature and electricity consumption for a specific building and saw a strong, U-shaped trend. We saw that adding a quadratic term made a big difference in how our mean function fit our data, but let’s see if we can see the same affect using transformations.

Creating the Model

Outside of this R guide, I have tested different transformations on this model and have chosen what seems like the optimal transformation to demonstrate in this example. For the example, we will be using a logarithmic transformation on our electricity consumption observations.

To add the transformation to the model, we simply calculate the transformed values of the variable within the model itself. For the logarithmic transformation, we use $\texttt{log(variable)}$ instead of just $\texttt{variable}$. To use the inverse transformation we can use $\texttt{1/variable}$ and for the square root transformation we use $\texttt{sqrt(variable)}$. So, to make our logarithmic transformation on electricity consumption, we can do this:

t.energy <- lm(log(C) ~ Temp, data = segreg)

Examining the Fit of the New Model

So, let’s look at the residual plot to determine whether or not our new model fixes our under-/overprediction problem.

plot(t.energy$fitted, t.energy$residuals, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)

It looks like this transformation mostly fixed our over-/underprediction problem, but we still see a very slight U-shaped trend in our residuals. Additionally, this model seems to be less heteroskedastic than the quadratic model. Finally, take note of the one possible outlier on the plot.

Multiple Linear Regression R Guide

NO NAME

4/22/2018

Introduction

Exploring the Data

Understanding the Data Set

Checking Multicollinearity

Pairwise Scatterplots

Correlation Matrices

Variance Inflation Factor

Visualizing the Data

Plotting

Adding Noise to the Scatterplot

Multiple Linear Regression

Building a Model

Criteria

For Comparing all Kinds of Models

For Comparing Nested Models

How to Create the Models

Stepwise Selection

All Subsets

Creating a New Model

Checking Model Assumptions

Outliers

The Definition

What can we do?

Indicator Variables

The Regression Equation

Interpreting Regression Coefficients

Confidence Intervals for Regression Coefficients

T-tests

Making Predictions

Confidence Intervals

Adding Interaction Terms

Creating a New Model

Checking Model Assumptions

The Regression Equation

The T-test Interpretation

F-tests

Partial F-tests

Polynomial Regression

Creating the Model

Examining the Fit of the Model

Fixing the Fit of the Model

Examining the Fit of the New Model

Visualizing the Fit of the New Model

The T-test Interpretation

The Partial F-test Interpretation

Visualizing the Regression Equation

Creating the Components of the Polynomial Regression Line

Plotting the Regression Lines

Comparing the Regression Lines

Transformations

Creating the Model

Examining the Fit of the New Model