Introduction

Multiple Linear Regression is an extension of Linear Regression. Instead of using one X variable to predict a response variable, Multiple Regression uses two or more X variables to predict a quantitative output, Y. The following equation is used for Multiple Regression.

\[Y = \beta_0 + \beta_1 X_1+ \beta_2 X_2 + \beta_3 X_3+ ... + \beta_nX_n +\epsilon \]

Y = Dependent (response) variable

X_n = Independent (predictor) Variables

\(\beta_0\) = Intercept

\(\beta_n\) = Coefficient (slope) for each independent variable n

\(\epsilon\) = is a mean-zero random error term (variability in Y the model is unable to explain)

Prerequisites

library(usdm) #provides functions for exploring the different sources of uncertainties on distribution models 
library(corrplot) #graphical display of correlation matrix and confidence intervals
library(knitr) #use of kable to create nice tables

Data Set Information

Abalones are marine snails. The flesh of abalones are widely considered to be a desirable food and is consumed raw or cooked by a variety of cultures!

This data was collected in an attempt to predict the age of an abalone from certain physical measurements: height, length, diameter, weight, etc. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. ¹

Name	Data Type	Description
Sex	nominal	M, F, and I (infant)
Length	continuous	longest shell measurement (mm)
Diameter	continuous	perpendicular to length (mm)
Height	continuous	with meat in shell (mm)
Whole Weight	continuous	whole abalone (grams)
Shucked Weight	continuous	weight of meat (grams)
Viscera Weight	continuous	gut weight after bleeding (grams)
Shell Weight	continuous	after being dried (grams)
Rings	integer	+1.5 gives the age in years

The dataset and more information is available here. Use the code below to directly import the data.

abalone <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", 
       sep = ",",
       col.names = c("gender", "length", "diameter", "height", "whole_weight",
                     "shucked_weight", "viscera_weight", "shell_weight", "rings"),
       fill = TRUE)

Analyzing a Multiple Regression Model

Step 1

Hypothesize which independent variables you think will be a good predictor of your dependent variable. One way to help you decide which variables to choose is by checking the VIF scores for each of the independent variables. This will tell you if variables are multicollinear. Using the vif() function from the usdm package, you can check VIF scores for your entire data set. You will get VIF scores for all of your numeric variables.

vif(abalone)

A high VIF score means that there is multicollinearity between certain independent variables. For this reason, we need to eliminate the predictor variables that are highly correlated to each other. For the abalone data set, there are four separate measurements of weight: whole weight, shucked weight, viscera weight, and shell weight. Because these variables all deal with the weight of an abalone, they are highly correlated to each other, and we can confirm this because the weights have high VIF scores. Looking at the VIF scores, we would want to choose the weight that has the lowest VIF score. The “viscera weight” has the lowest VIF score of 17.725. Next, you’ll notice that length and diameter both have high VIF scores because they are correlated to each other. Again, we chose the predictor variable with the lower VIF score which was length. This leaves us with four variables (viscera weight, length, height, and gender) that we believe might be good predictors of our dependent variable, rings.

If you’re not familiar with VIF scores, you can test the correlation of the entire data set by either using cor.test() or corrplot() to check for correlation.

Exercise:

Using what you already know about correlation, do you see the same results as we got from the VIF scores?
What variables would you choose based on the correlation results?

Step 2

Use your data and your selected independent variables to create a model. This will give you your coefficients for each independent variable.

For this step, we used the lm() function. Our response variable is the age of the abalone, which is measured by the number of rings. The predictor variables we decided to use based on VIF scores were the viscera weight, length, height, and gender. There is no reason to exclude gender as a possible variable based on the lack of correlation data from the VIF test.

abalone_model <- lm( rings ~ length + height + viscera_weight + gender, data = abalone)

Step 3

Evaluate each of the independent variables and the model as a whole. Use the summary() function to evaluate the abalone model.

summary(abalone_model)

## 
## Call:
## lm(formula = rings ~ length + height + viscera_weight + gender, 
##     data = abalone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4100  -1.5964  -0.5629   0.8421  16.5181 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.2252     0.3108  10.376  < 2e-16 ***
## length           8.9982     0.8524  10.557  < 2e-16 ***
## height          22.5943     1.7493  12.916  < 2e-16 ***
## viscera_weight  -3.8011     0.8757  -4.340 1.46e-05 ***
## genderI         -1.2741     0.1190 -10.705  < 2e-16 ***
## genderM         -0.1723     0.0975  -1.767   0.0773 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.58 on 4171 degrees of freedom
## Multiple R-squared:  0.3607, Adjusted R-squared:  0.3599 
## F-statistic: 470.6 on 5 and 4171 DF,  p-value: < 2.2e-16

The numbers under the “Estimate” column are the coefficients for each of the predictor variables. To evaluate the significance of each variable, look at the “Pr(>|t|)” column. The significance codes are listed underneath the table; the more stars, the more significant the predictor variable is. Notice that all of the variables are less than 0.05 (the most common \(\alpha\) to determine significance) except for “genderM.” Because the gender variable is comprised of three character levels, male (M), female (F), and infant (I), dummy variables were created. In the model, both “gender I” and “gender M” are present and these act as binary variables in the regression model.

In the bottom section of the summary output, look at the “Residual Standard Error,” “Multiple R-squared,” and “F-statistic.” The Residual Standard Error is the estimate of the standard deviation of \(\epsilon\) so you want a small number. The Multiple R-squared, also called the coefficient of determination is the proportion of the variance in the data that’s explained by the model. The more variables you add - even if they don’t help - the larger this will be. The Adjusted one reduces that to account for the number of variables in the model. The R-squared value in the abalone data set is 0.3607 which indicates that roughly 36% of the variance is explained. A good model will have a high F-statistic and a small p-value based on the \(\alpha\) chosen for the model.

Exercise:

Run the model again but leave out gender as a variable. Run the summary statistics and evaluate the new variables and model.
Is the R squared better or worse?
Which variables are significant using an \(\alpha\) of 0.05?

Step 4

Check the assumptions. You want your residuals to be normally distributed. For this reason, the median should hover around 0 and the absolute values of the 1q and 3q should be about the same.

You can test all of the assumptions using the plot function and the model you created.

Plot 1: Residuals vs Fitted

plot(abalone_model, which = 1, id.n = 5)

The first plot tests for linearity and heteroskedasticity. To learn more about heteroskedasticity click here. Based on this plot, you can see that the residuals are heteroskedastic which indicates dependency between the residuals and the fitted values.

This link shows an example of a good Residuals vs. Fitted plot.

Plot 2: Normal Q-Q

plot(abalone_model, which = 2, id.n = 5)

The second plot tests for the normality of the residuals. This plot shows that our residuals are “heavy tailed” based on the description in this link

Plot 3: Scale-Location

plot(abalone_model, which = 3, id.n = 5)

The third plot also tests for heteroskedasticity. Like plot 1, the residuals are not evenly spread across the line.

Plot 4: Cook’s Distance

plot(abalone_model, which = 4, id.n = 5)

Our fourth model, Cook’s distance, shows the outliers. The default is to show the top five values. For our plot here, we can see one clear outlier.

Plot 5: Residuals vs Leverage

plot(abalone_model, which = 5, id.n = 5)

This plot helps us to find if any specific data points are influencing the model. The two points that are leveraging our outcome is the “1418” and “2052” which were also shown in the Cook’s Distance plot above.

Click here for more information on Residuals vs Leverage plots.

Exercise:

Using the five plots, what can you say about the assumptions?

Step 5

Adjust the model based on steps 3 & 4.

Step 6

Use the model to predict your dependent variable. For this model, we will use the coefficients() to create a prediction for the rings (age) of an abalone.

You can use the function coefficients to pull up the values for each variable.

kable(coefficients(abalone_model), col.names = "\u03b2~n~")

	ß_n
(Intercept)	3.2252267
length	8.9982090
height	22.5942924
viscera_weight	-3.8010812
genderI	-1.2740566
genderM	-0.1722822

\[Rings = 3.2252 + 8.9982 (Length)+ 22.5943 (Height) - 3.8011 (Viscera Weight) - 1.2741(GenderI) - 0.1723(GenderM) +\epsilon \]

Exercise:

Pick three observations from the abalone data at random and plug in inputs for each of the independent variables. Remember that the variables from genderM and genderI are binary. How close are the predictions to the actuals?

Conclusion

Our R-squared value to predict the age of an abalone was low (0.3607). However, a high R-squared does not necessarily indicate that the model has a good fit. All of our predictor variables were statistically significant with p-values that were lower than the \(\alpha\) of 0.05. For this reason, we were still able to draw some conclusions about our variables but maybe Multiple Linear Regression model is not the best way to predict the age of an abalone.