Multiple Linear Regression is an extension of Linear Regression. Instead of using one X variable to predict a response variable, Multiple Regression uses two or more X variables to predict a quantitative output, Y. The following equation is used for Multiple Regression.
\[Y = \beta_0 + \beta_1 X_1+ \beta_2 X_2 + \beta_3 X_3+ ... + \beta_nX_n +\epsilon \]
Y = Dependent (response) variable
Xn = Independent (predictor) Variables
\(\beta_0\) = Intercept
\(\beta_n\) = Coefficient (slope) for each independent variable n
\(\epsilon\) = is a mean-zero random error term (variability in Y the model is unable to explain)
library(usdm) #provides functions for exploring the different sources of uncertainties on distribution models
library(corrplot) #graphical display of correlation matrix and confidence intervals
library(knitr) #use of kable to create nice tables
Abalones are marine snails. The flesh of abalones are widely considered to be a desirable food and is consumed raw or cooked by a variety of cultures!
This data was collected in an attempt to predict the age of an abalone from certain physical measurements: height, length, diameter, weight, etc. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. 1
Name | Data Type | Description |
---|---|---|
Sex | nominal | M, F, and I (infant) |
Length | continuous | longest shell measurement (mm) |
Diameter | continuous | perpendicular to length (mm) |
Height | continuous | with meat in shell (mm) |
Whole Weight | continuous | whole abalone (grams) |
Shucked Weight | continuous | weight of meat (grams) |
Viscera Weight | continuous | gut weight after bleeding (grams) |
Shell Weight | continuous | after being dried (grams) |
Rings | integer | +1.5 gives the age in years |
The dataset and more information is available here. Use the code below to directly import the data.
abalone <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",
sep = ",",
col.names = c("gender", "length", "diameter", "height", "whole_weight",
"shucked_weight", "viscera_weight", "shell_weight", "rings"),
fill = TRUE)
Hypothesize which independent variables you think will be a good predictor of your dependent variable. One way to help you decide which variables to choose is by checking the VIF scores for each of the independent variables. This will tell you if variables are multicollinear. Using the vif()
function from the usdm
package, you can check VIF scores for your entire data set. You will get VIF scores for all of your numeric variables.
vif(abalone)
A high VIF score means that there is multicollinearity between certain independent variables. For this reason, we need to eliminate the predictor variables that are highly correlated to each other. For the abalone data set, there are four separate measurements of weight: whole weight, shucked weight, viscera weight, and shell weight. Because these variables all deal with the weight of an abalone, they are highly correlated to each other, and we can confirm this because the weights have high VIF scores. Looking at the VIF scores, we would want to choose the weight that has the lowest VIF score. The “viscera weight” has the lowest VIF score of 17.725. Next, you’ll notice that length and diameter both have high VIF scores because they are correlated to each other. Again, we chose the predictor variable with the lower VIF score which was length. This leaves us with four variables (viscera weight, length, height, and gender) that we believe might be good predictors of our dependent variable, rings.
If you’re not familiar with VIF scores, you can test the correlation of the entire data set by either using cor.test()
or corrplot()
to check for correlation.
Exercise:
Use your data and your selected independent variables to create a model. This will give you your coefficients for each independent variable.
For this step, we used the lm()
function. Our response variable is the age of the abalone, which is measured by the number of rings. The predictor variables we decided to use based on VIF scores were the viscera weight, length, height, and gender. There is no reason to exclude gender as a possible variable based on the lack of correlation data from the VIF test.
abalone_model <- lm( rings ~ length + height + viscera_weight + gender, data = abalone)
Evaluate each of the independent variables and the model as a whole. Use the summary()
function to evaluate the abalone model.
summary(abalone_model)
##
## Call:
## lm(formula = rings ~ length + height + viscera_weight + gender,
## data = abalone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4100 -1.5964 -0.5629 0.8421 16.5181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2252 0.3108 10.376 < 2e-16 ***
## length 8.9982 0.8524 10.557 < 2e-16 ***
## height 22.5943 1.7493 12.916 < 2e-16 ***
## viscera_weight -3.8011 0.8757 -4.340 1.46e-05 ***
## genderI -1.2741 0.1190 -10.705 < 2e-16 ***
## genderM -0.1723 0.0975 -1.767 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.58 on 4171 degrees of freedom
## Multiple R-squared: 0.3607, Adjusted R-squared: 0.3599
## F-statistic: 470.6 on 5 and 4171 DF, p-value: < 2.2e-16
The numbers under the “Estimate” column are the coefficients for each of the predictor variables. To evaluate the significance of each variable, look at the “Pr(>|t|)” column. The significance codes are listed underneath the table; the more stars, the more significant the predictor variable is. Notice that all of the variables are less than 0.05 (the most common \(\alpha\) to determine significance) except for “genderM.” Because the gender variable is comprised of three character levels, male (M), female (F), and infant (I), dummy variables were created. In the model, both “gender I” and “gender M” are present and these act as binary variables in the regression model.
In the bottom section of the summary output, look at the “Residual Standard Error,” “Multiple R-squared,” and “F-statistic.” The Residual Standard Error is the estimate of the standard deviation of \(\epsilon\) so you want a small number. The Multiple R-squared, also called the coefficient of determination is the proportion of the variance in the data that’s explained by the model. The more variables you add - even if they don’t help - the larger this will be. The Adjusted one reduces that to account for the number of variables in the model. The R-squared value in the abalone data set is 0.3607 which indicates that roughly 36% of the variance is explained. A good model will have a high F-statistic and a small p-value based on the \(\alpha\) chosen for the model.
Exercise:
Run the model again but leave out gender as a variable. Run the summary statistics and evaluate the new variables and model.
Is the R squared better or worse?
Which variables are significant using an \(\alpha\) of 0.05?
Check the assumptions. You want your residuals to be normally distributed. For this reason, the median should hover around 0 and the absolute values of the 1q and 3q should be about the same.
You can test all of the assumptions using the plot function and the model you created.
plot(abalone_model, which = 1, id.n = 5)
The first plot tests for linearity and heteroskedasticity. To learn more about heteroskedasticity click here. Based on this plot, you can see that the residuals are heteroskedastic which indicates dependency between the residuals and the fitted values.
This link shows an example of a good Residuals vs. Fitted plot.
plot(abalone_model, which = 2, id.n = 5)
The second plot tests for the normality of the residuals. This plot shows that our residuals are “heavy tailed” based on the description in this link
plot(abalone_model, which = 3, id.n = 5)
The third plot also tests for heteroskedasticity. Like plot 1, the residuals are not evenly spread across the line.
plot(abalone_model, which = 4, id.n = 5)
Our fourth model, Cook’s distance, shows the outliers. The default is to show the top five values. For our plot here, we can see one clear outlier.
plot(abalone_model, which = 5, id.n = 5)
This plot helps us to find if any specific data points are influencing the model. The two points that are leveraging our outcome is the “1418” and “2052” which were also shown in the Cook’s Distance plot above.
Click here for more information on Residuals vs Leverage plots.
Exercise:
Adjust the model based on steps 3 & 4.
Use the model to predict your dependent variable. For this model, we will use the coefficients()
to create a prediction for the rings (age) of an abalone.
You can use the function coefficients to pull up the values for each variable.
kable(coefficients(abalone_model), col.names = "\u03b2~n~")
ßn | |
---|---|
(Intercept) | 3.2252267 |
length | 8.9982090 |
height | 22.5942924 |
viscera_weight | -3.8010812 |
genderI | -1.2740566 |
genderM | -0.1722822 |
\[Rings = 3.2252 + 8.9982 (Length)+ 22.5943 (Height) - 3.8011 (Viscera Weight) - 1.2741(GenderI) - 0.1723(GenderM) +\epsilon \]
Exercise:
Our R-squared value to predict the age of an abalone was low (0.3607). However, a high R-squared does not necessarily indicate that the model has a good fit. All of our predictor variables were statistically significant with p-values that were lower than the \(\alpha\) of 0.05. For this reason, we were still able to draw some conclusions about our variables but maybe Multiple Linear Regression model is not the best way to predict the age of an abalone.
For this tutorial, we referenced the following individuals’ work.
Warwick Nash, Tracy Sellers, Simon Talbot, and Andrew Cawthorn. The Population Biology of Abalone in Tasmania from the North Coast and Islands of Bass Strait. 1994.↩