Multiple linear regression (MLR) is an awesome tool. The response variable is 1 continuous variable. For predictor variables, you can have many variables that are either categorical or continuous. Generally, you use more continuous predictor variables than categorical variables in MLR.
MLR gives you a lot of flexiblity and are very powerful, but can be a little different to work with than what many of us are used to. Because regression is more powerful than ANOVA, MLR can be a good replacement for certain 2- and 3-way ANOVAs.
We will use the iris dataset, a freely available example data set in R that we can call directly (instead of reading in data from a .csv like we did in Getting Started with R). So let’s take a look at the iris dataset.
View(iris) #view the iris dataset
colnames(iris) #variable names in the iris dataset
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
Looking at the data, we see the first 4 columns are continuous variables and the final (Species) is a categorical variable.
For this analysis, let’s see how sepal length is affected by the other variables.
library(Hmisc)
library(car)
MLR has the same assumptions as simple linear regression (so you may want to check out that tutorial) plus one more - no collinearity. We will discuss collinearity later.
There are ~6 assumptions to test for MLR.
The observations were randomly sampled.
The observations are independent from one another.
The relationship of the two variables are linearly related (no curve, just a nice straight line).
Homoscedasticity: the spread of data is the same throughout. Residual values have same variance across range of data.
See this link for a better explanation and a nice graph of heteroscedasticity.
The errors of the model are distributed normally.
No collinearity among predictor variables.
You have determine if your sampling was independent and random. The other assumptions can be tested with the code below.
We will examine the linearity using the code below which produces a scatterplot matrix. What we want to look at is the relationship of each predictor variable with the response variable, sepal length. At this point, we don’t want to look for linearity between the predictor variables, we will do that later.
pairs(iris) #scatterplot matrix
Sepal length and sepal width are not clearly linear, but let’s call it ok for now.
Sepal length and petal length are clearly linear, even if at low values the relationship appears at a lower slope.
Sepal length and petal width is rather linear, looks good.
Sepal length and species is pretty much linear, although species is a categorical variable, so it can be hard to judge. Let’s continue.
Note: calling linearity is always subjective. Ecological data is messy; almost always messier than this example dataset. Pretty much what I’m saying is try your best and follow your heart.
To test if the variances are the same throughout the data range, we again look at the scatterplot matrix. What we want to see is pretty equal scatter from left to right; funnel shapes are bad. An explanation of heteroscedasticity here. It is the same link as above.
Sepal length and sepal width’s relationship may be heteroscedastic, but not bad enough to be a major issue (at least in my opinion).
Sepal length and petal length is homoscedastic! It’s good to go.
Sepal length and petal width is also homoscedastic! It’s good to go.
Sepal length and species is probably homoscedastic, again it’s hard to tell with a categorical variable like species.
Judging homoscedasticity is again subjective, so again do your best. You can transform variables (log, square root, and exponents) are particularly useful.
A residual plot is a great way to look at homo vs. heteroscedasticity of the residual errors using the code below.
A qq plot is the best way to see if the errors are normally distributed.
The code below is a quick way to see residual and q-q norm plots for any linear model (which you use to make ANOVA and regressions).
The most useful (to me) is the q-q norm plot (the 2nd one). If the errors of the model are normal, the data points will fall closely on the line. In this particular example, it’s pretty much perfect!
For more information, google these nested functions: plot(lm()) and see https://drsimonj.svbtle.com/visualising-residuals.
lm.irisMLR<-lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width+Species,data=iris) #make your linear model
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(lm.irisMLR) # Plot the linear model information you made above
#Note, if 4 graphs aren't showing up for you all at once, run the par() line and plot() line together.
par(mfrow = c(1,1)) #Returns the plotting panel to normal (1x1)
Make sure to run the last par() line so you return your plotting view back to normal.
Below we make another scatterplot matrix, this time just with the predictor variables, leaving our response variable (sepal length) out.
pairs(iris[c(2:5)]) #scatterplot matrix with columns 2-5: without the response variable (don't need to check the response variable for collinearity)
We don’t want to include predictor variables that are linearly related to one another, otherwise called collinear. This sounds backwards right? But we exclude them because we don’t want to inflate the model - if two variables are collinear then they likely are explaining the same thing. Including both variables in the model is then inappropriate, we don’t want to include two variables that explain the same thing. Doing this would lead to the model appearing to fit better than it does in reality.
Was that crazy confusing? I’m sorry about my poor explanation. All you need to do is to not include collinear predictor variables.
So let’s get to it:
Sepal width and petal length are not collinear.
Sepal width and petal width are not collinear.
Petal length and petal width are collinear.
Species may be collinear with petal length and width, but it’s hard to tell because species is categorical.
This is once again a subjective call. But let’s check out a less subjective method below.
The code below produces a Pearson correlation matrix. Pearson correlation (commonly known as r) is a measure of how closely linearly correlated two variables are. It ranges from 0 to 1: the closer to 1, the more closely correlated.
The code below only includes the predictor variables, which are columns 2-4 (that part [c(2:4)] does that). Because we only are using the predictor variables, we can investigate the collinearity.
cor(iris[,c(2:4)],use="complete")
## Sepal.Width Petal.Length Petal.Width
## Sepal.Width 1.0000000 -0.4284401 -0.3661259
## Petal.Length -0.4284401 1.0000000 0.9628654
## Petal.Width -0.3661259 0.9628654 1.0000000
So that you have options for correlation matrixes, check out another method below. This one shows the same information, but with colored graphics. Remember to use install.packages(“corrplot”) ONCE to install the package.
library(corrplot)
correlations <- cor(iris[,2:4])
corrplot(correlations, method="circle")
Most of the correlations look low, except for petal width and petal length at r=0.96. We need to exclude either petal width or petal length.
Note: when researchers are dealing with many variables and looking for collinearity, they often set a maximum r value, usually between 0.7 and 0.9. Review the literature to see what your expertise suggests is the best cutoff.
So do we exclude petal width or petal length? Try to do this randomly, using a random number generator or similar method. For this example, let’s exclude petal width.
Ok, let’s finally fit the model, excluding petal width because it is collinear with petal length.
irisMod.lm <-lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species,data=iris)
summary(irisMod.lm)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82156 -0.20530 0.00638 0.22645 0.74999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.39039 0.26227 9.114 5.94e-16 ***
## Sepal.Width 0.43222 0.08139 5.310 4.03e-07 ***
## Petal.Length 0.77563 0.06425 12.073 < 2e-16 ***
## Speciesversicolor -0.95581 0.21520 -4.442 1.76e-05 ***
## Speciesvirginica -1.39410 0.28566 -4.880 2.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3103 on 145 degrees of freedom
## Multiple R-squared: 0.8633, Adjusted R-squared: 0.8595
## F-statistic: 228.9 on 4 and 145 DF, p-value: < 2.2e-16
Let’s first look at the coefficients section and focus on the p-values (column “Pr(>|t|)”). We see that each variable is significant. But you may be wondering why there are 2 species variables. These are called indicator variables or dummy variables. I’m not going explain them here, but please see this website for more information. An indicator variable is basically a way for certain models to incorporate categorical data.
Now let’s look at the bottom section: we see that the adjusted r-squared is 0.8595, the F statistic is 228.9, the degrees of freedom are 4 and 145, and the model p-value is crazy small.
Our model looks like it’s pretty good! High r-squared and low p-value. But do we know that we need each and every variable? It’s hard to tell just looking at the p-values. To find the ‘best’ combination of predictor variables, we will use Akaike Information Criterion (AIC).
The function stepAIC from package MASS is one option we have. This function adds and/or takes away variables until it finds the “best” model. You have several options for direction.
forward: adds variables until the “best” model is found.
backward: starts with all variables and then kicks them out one by one until the “best” model is found.
both: the function adds and removes variables until the “best” model is found.
It defaults to backward. Other functions and packages may give you more control over your AIC iterations, including order of variables being added or discarded. For now, let’s stick with this function.
So let’s actually run a stepAIC! Hit that code below.
library(MASS)
stepAIC(irisMod.lm,direction="both")
## Start: AIC=-346.11
## Sepal.Length ~ Sepal.Width + Petal.Length + Species
##
## Df Sum of Sq RSS AIC
## <none> 13.966 -346.11
## - Species 2 2.3632 16.329 -326.66
## - Sepal.Width 1 2.7161 16.682 -321.45
## - Petal.Length 1 14.0382 28.004 -243.74
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Species,
## data = iris)
##
## Coefficients:
## (Intercept) Sepal.Width Petal.Length
## 2.3904 0.4322 0.7756
## Speciesversicolor Speciesvirginica
## -0.9558 -1.3941
Two results chunks are produced. At the top, we see the AIC -346.11 and the formula of the model being tested. Below that, we see a table showing degrees of freedown, sum of squares, residual sum of squares, and AIC for each variable.
At the bottom, following call, we are told the formula of the “best” model and the coefficients of that model.
Note: This was such a simple MLR that only one iteration was produced; usually multiple iterations will be performed and the final one is the best. Look for the Call: after this will be the formula of the “best” model.
In R, the “best” AIC is the least value (or in other words, the most negative). For example, -16 is a better AIC than -14. Best practice is that you re-run the top 3 models, usually making an lm for each and look at their summary. In each summary, you can find many useful summary stats, including the multiple r2.
When using AIC, it is easy to think that the best AIC is the best model. If a simpler model has only a slightly worse AIC, it could be better to use the simpler model instead. Why? Field ecology is expensive and collecting more variables than necessary makes it even more costly. There is always a balance between how much a variable improves the AIC and how much it costs in time or money.
There is a nice way to double check if your variables are collinear called Variance Inflation Factors. Remember the MLR assumption that the predictor variables are not collinear.
We use the function vif within the package car.
For each variable, GVIF (generalized VIF), DF, and GVIF normalized by DF is produced. The accepted VIF cutoffs are 5 or 10, depending on who you ask. This is to say that if your VIF value is above 5 (or 10), you may consider that variable is collinear and needs to be removed.
But which VIF statistic to look at? According to this, we should use the normalized one, the 3rd column. None of our VIFs are above 5, so we are golden. If we had included petal width, the VIF would have been very high for that variable, and we would have removed it here and then re-run the analysis.
library(car) # for vif below
# Evaluate Collinearity
vif(irisMod.lm ) # variance inflation factors
## GVIF Df GVIF^(1/(2*Df))
## Sepal.Width 1.946902 1 1.395314
## Petal.Length 19.898536 1 4.460778
## Species 27.111945 2 2.281866
If you run this without any categorical variables (Species for us), the output will look different. It produces VIFs (instead of GVIFs), which do not need to be normalized by DF, so you can just compare them to your cutoff of 5 or 10.
If you wish, you can look at VIFs during your assumptions testing. I do it at the end only to double check my process, but it may be better to do earlier for some applications.
vif(lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris))
## Sepal.Width Petal.Length
## 1.224831 1.224831
library(Hmisc)
library(car)
#Linearity and homoscedasticity visual check
pairs(iris) #scatterplot matrix
#Residual and qqplots
lm.irisMLR<-lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width+Species,data=iris) #make your linear model
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(lm.irisMLR) # Plot the linear model information you made above
#Note, if 4 graphs aren't showing up for you all at once, run the par() line and plot() line together.
par(mfrow = c(1,1)) #Returns the plotting panel to normal (1x1)
#Looking at collinearity of predictor variables
pairs(iris[c(2:5)]) #scatterplot matrix with columns 2-5: without the response variable (don't need to check the response variable for collinearity), change columns as needed
#Pearson correlations, another way to look at collinearity
cor(iris[,c(2:4)],use="complete")
#Pearson correlations, but with colored graphics
library(corrplot)
correlations <- cor(iris[,2:4]) #only the 2nd through 4th columns
corrplot(correlations, method="circle")
#Fitting the model
irisMod.lm <-lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species,data=iris)
summary(irisMod.lm)
#Model selection by AIC
library(MASS)
stepAIC(irisMod.lm,direction="both") #can also be "backward" or "forward"
#last iteration of model should be "best"
#Double check for collinearity by Variable Inflation Factors (VIF)
library(car) # for vif below
vif(irisMod.lm ) # variance inflation factors
#For GVIFs, look at normalized GVIF^(1/(2*Df))
#For normal VIFs, no normalization needed
#Compare to cutoff of either 5 or 10