Multiple Linear Regression

RPub #3 in a series on data mining with R by

Karen Mazidi

Further information can be found on my blog: http://www.karenmazidi.com/blog.html

In the second RPub in this series we downloaded the brain head data set and fitted a simple linear regression model Head \( \sim \) Brain to see the relation between head size and brain weight. A simple linear model uses one predictor variable whereas multiple linear models use more than one predictor variable. We'll explore multiple linear regression in this demo, but first, let's reload that data and recreate our model.

Loading the data and simple model

# get the data if it is not already in the data folder
data_file = "data/brainhead.dat"
if (!file.exists(data_file)) {
#   dir.create(dirname(data_file), FALSE)
  download.file("http://www.stat.ufl.edu/~winner/data/brainhead.dat", destfile=data_file)
  }
# load the data into R
brain = read.table(data_file, header=FALSE)
# add column headings
colnames(brain) <- c("Gender", "Age", "Head","Brain")
attach(brain)  
model0 = lm(Brain~Head, data=brain)
summary(model0)

## 
## Call:
## lm(formula = Brain ~ Head, data = brain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175.98  -49.76   -1.76   46.60  242.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 325.57342   47.14085   6.906 4.61e-11 ***
## Head          0.26343    0.01291  20.409  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.43 on 235 degrees of freedom
## Multiple R-squared:  0.6393, Adjusted R-squared:  0.6378 
## F-statistic: 416.5 on 1 and 235 DF,  p-value: < 2.2e-16

According to the \( R^2 \) statistic above, this simple linear regression model accounts for about 64% of the variation in brain weight. We will see if adding any of the remaining variables can account for more of this variation.

Qualitative Predictors

The one predictor we used in our simple linear model was quantitative - head size in cubic centimeters. The other two variables we have in the brain data set are Age and Gender which are both qualitative predictors.

Our two qualitative variables have only two possible values, 1 and 2. Coding schemes for two possible values typically either take values of 0 and 1, or 1 and -1. For an overview of coding schemes, see: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

In our original data, Gender=1 for males and 2 for females. Let's change this to 0 for males and 1 for females. We'll do a similar thing for Age, and then make sure R treats them as categorical variables.

# this method demonstrates how to use the for and ifelse in R
for (i in 1:length(Gender)){  # change 1s to 0s and 2s to 1s
       Gender[i] = ifelse(Gender[i]==1, 0, 1)      
       }
Gender

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

# however, R is designed to work efficiently with vectors, so the following method is simpler and faster
#    and this explains why you don't see many for loops in R: you often don't need them
Age = Age - 1  # 2s become 1s and 1s become 0s
Age

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Age = as.factor(Age)
summary(Age)

##   0   1 
## 110 127

Gender = as.factor(Gender)
summary(Gender)

##   0   1 
## 134 103

Model formula with more that one predictor

To include the Gender variable in our model, we use the formula:

Brain~Head+Gender

In formulas,

Head+Gender means use both Head and Gender as predictors

Head:Gender means use interactions of Head and Gender

Head*Gender means use Head, Gender, and Head:Gender

Looking at the summary of the model, it appears that Gender was not a significant predictor of brain weight. Comparing to just using Head size, there does not seem to be much difference in the coefficient: 0.26343 for Head alone and 0.2509 for Head+Gender.

model1 = lm(Brain~Head+Gender, data=brain)
summary(model1)

## 
## Call:
## lm(formula = Brain ~ Head + Gender, data = brain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -176.671  -52.175    1.487   46.227  236.702 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 396.5754    64.2517   6.172 2.94e-09 ***
## Head          0.2509     0.0150  16.734  < 2e-16 ***
## Gender      -17.8615    11.0266  -1.620    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.18 on 234 degrees of freedom
## Multiple R-squared:  0.6433, Adjusted R-squared:  0.6403 
## F-statistic:   211 on 2 and 234 DF,  p-value: < 2.2e-16

Interpreting multiple regression with qualitative predictors

How do we interpret the above summary?

The linear model for two predictors is given by:

\( Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \)

For females (X2=1) this will be: \( Y = \beta_0 + \beta_1 X_1 + \beta_2 + \epsilon \)

For males (X2=0) this will be: \( Y = \beta_0 + \beta_1 X_1 + \epsilon \)

So we interpret \( \beta_0 + \beta_1 X_1 \) as the average brain weight among males, and \( \beta_0 + \beta_1 X_1 + \beta_2 \) as the average brain weight among females.

Notice that there is no * at the end of the Gender line. So in this model, Gender was not signficant.

\( R^2 \) for multiple predictors

Now that we have more than one predictor, R produces a multiple \( R^2 \) which is the square of the correlation between the observed values of Y and those predicted by the multiple regression model. A model that gets an \( R^2 \)=1 would mean that it perfectly predicts the observed data. That's not likely to happen of course.

AIC

The problem with \( R^2 \) is that as you add more variables it will always go up. So if you are choosing between a model with 4 predictors and one with 3 additional variables, the model with 7 predictors will have a higher \( R^2 \). For this reason some researchers prefer to look at the AIC. Akaike Information Criterion penalizes a model for having more variables. If the AIC is bigger, the fit is worse; if the AIC is smaller, the fit is better. Note that you can only compare AIC on models trained on the same data.

Let's refit the simple model from the last RPub, we'll call it model 0, and then compare the AIC numbers.

Wow, not much difference.

model0 = lm(Brain~Head, data=brain)
AIC(model0)

## [1] 2706.51

AIC(model1)

## [1] 2705.867

Visualizing categorical data

The plot below color codes the data points. The pinkish dots are males and the aqua dots are females. We can see that there are more pinkish dots at the top and more aqua at the bottom, but the majority of the middle space seems a roughly even mix of genders. Recall that we also had slightly more males in the data set.

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.4

ggplot(data=brain, aes(x=Head, y=Brain, colour=factor(Gender))) +  geom_point()

plot of chunk unnamed-chunk-5

Interaction

Let's see what happens with the interaction of Head and Gender.

model2 = lm(Brain~Head:Gender, data=brain)
summary(model2)

## 
## Call:
## lm(formula = Brain ~ Head:Gender, data = brain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321.51  -80.57   -6.64   73.41  344.68 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.385e+03  2.570e+01  53.889  < 2e-16 ***
## Head:Gender -1.993e-02  4.797e-03  -4.155 4.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.4 on 235 degrees of freedom
## Multiple R-squared:  0.06842,    Adjusted R-squared:  0.06446 
## F-statistic: 17.26 on 1 and 235 DF,  p-value: 4.565e-05

And what happens with Head, Gender and their interaction

model3 = lm(Brain~Head*Gender, data=brain)
summary(model3)

## 
## Call:
## lm(formula = Brain ~ Head * Gender, data = brain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -171.215  -47.721    0.182   47.768  237.690 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  574.51836  167.54992   3.429 0.000716 ***
## Head           0.20192    0.04519   4.468 1.23e-05 ***
## Gender      -144.21567  110.44279  -1.306 0.192910    
## Head:Gender    0.03544    0.03082   1.150 0.251403    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.13 on 233 degrees of freedom
## Multiple R-squared:  0.6453, Adjusted R-squared:  0.6408 
## F-statistic: 141.3 on 3 and 233 DF,  p-value: < 2.2e-16

Visualizing the effect of a variable

We have seen the interactio of head size and gender, but what would the data look like if we considered only one gender?

# don't know how to do this

Methods of adding predictors

We've gone through quite a bit on analysis looking at adding the Gender variable, and we haven't even talked about the Age variable yet. Imagine a data set with dozens of possible predictor variables. How can we go about selecting predictors in a principled manner? There are a few popular approaches and a future RPub will explore these in detail. For now, we'll throw them all in and see what happens.

In the function notation used below Brain~., the dot indicates that all variables should be used except the response. Below we see that running the linear regression function using all the variables indicates that they all are considered signficant. Gender and Age received one * and Head received three.

model_all = lm(Brain~., data=brain)
summary(model_all)

## 
## Call:
## lm(formula = Brain ~ ., data = brain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -163.772  -48.974    0.381   43.294  247.353 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 464.56281   68.98183   6.735 1.27e-10 ***
## Gender      -22.54325   11.05789  -2.039   0.0426 *  
## Age         -23.96845    9.48065  -2.528   0.0121 *  
## Head          0.24421    0.01506  16.212  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.36 on 233 degrees of freedom
## Multiple R-squared:  0.6528, Adjusted R-squared:  0.6484 
## F-statistic:   146 on 3 and 233 DF,  p-value: < 2.2e-16

Adjusted \( R^2 \)

With all the variables in the model, we can interpret them as follows. Women's brain weight is less by about 23 grams compared to men, and the brains of people over 46 weigh less by about 24 grams. Does knowing the Age and Gender significantly change the model?

Even though R is telling us that Age and Gender were significant, are they signficant enough to warrant including in the model? Is model_all really better than model0?

Looking back at the summary for model0 and model_all we can compare their \( R^2 \) scores:

Model	\( R^2 \)	Adjusted \( R^2 \)
model0	0.6393	0.6378
model_all	0.6528	0.6484

The statistic \( R^2 \) gives the amount of variability in the outcome is explained by the predictors. The adjusted \( R^2 \) statistic gives an indication of how well the model will generalize and it is good to see that the \( R^2 \) and adjusted \( R^2 \) are close to each other. For model0 the difference between \( R^2 \) and adjusted \( R^2 \) is .0015 and for model_all it is .0044. These numbers suggest that both models would generalize well.

R provides a way to extract data from the model so we don't have to do it manually.

attributes(summary(model0))  # get a list of available attributes

## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"

summary(model0)$r.squared - summary(model0)$adj.r.squared

## [1] 0.001534844

summary(model_all)$r.squared - summary(model_all)$adj.r.squared

## [1] 0.00446994

anova

The anova() function computes an analysis of variance table for one or more linear model fits. When using the anova function you should make sure that the models were trained on the same data. You should also list the models in a hierarchical order from fewest to most predictors.

The results below show that there is a modestly significant improvement in model_all over model0.

anova(model0, model_all)

## Analysis of Variance Table
## 
## Model 1: Brain ~ Head
## Model 2: Brain ~ Gender + Age + Head
##   Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
## 1    235 1232728                             
## 2    233 1186511  2     46217 4.538 0.01166 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plots for model analysis

When we plot a model, as shown below, R gives us 4 graphs.

Residuals vs. Fitted
Normal Q-Q
Scale_Location
Residuals vs. Leverage

Next we discuss how to interpret them.

Residuals vs. Fitted

What we are looking for here is that the data points are somewhat evenly and randomly distributed above and below 0. If we see a curve in the graph then perhaps a linear model is not the best fit. If we see the points form a funnel shape then we suspect there is heteroscedasticity in the data.

Q-Q plot

The Q-Q plot shows deviations from normality. There is a faint line the represents a normal distribution and the points are the observed residuals. In this example the points are pretty much on the line. Another way to check is to use the histogram function, which we will talk about later.

Scale-Location plot

This is similar to the residuals vs. fitted plot but uses the square root of the standardized residuals. Like the first plot, we see no discernable pattern, which is what we hope for.

Residuals vs. Leverage plot

This plot helps us determine if our model was driven by a few influential cases. What we are looking for here is not the overall pattern, but whether there are outlying values at the upper right or lower left corners that are far from the dashed red lines, which are Cook's distance lines.

par(mfrow=c(2,2)) 
plot(model0)

plot of chunk unnamed-chunk-12

The plots for model0 look good, let's try them for model_all. They appear pretty good also.

par(mfrow=c(2,2)) 
plot(model0)

plot of chunk unnamed-chunk-13

Histogram check for skewness

If we are concerned about the results based on the residuals vs. fitted graph, we can plot a histogram of the residuals of the model, as shown below. The rstudent() function returns the standardized residuals. We hope to see a fairly normal distribution. This appears to have a little skew, but nothing too dramatic.

par(mfrow=c(2,1))
hist(rstudent(model0))
hist(rstudent(model_all))

plot of chunk unnamed-chunk-14

Looking ahead

In the next RPub we will talk more about model analysis. Specifically, we will discuss methods of selecting predictors, and outliers and influential cases which can bias our model.