Introduction to Multiple Linear Regression in R

In this recitation, we will: (1) load a dataset from the car package, (2) perform a multiple linear regression analysis using lm() and (3) analyze the distribution of the residuals.

1. Loading the dataset

In this recitation, we will use the Prestige dataset from the car package. First, install the car package using install.packages(). Then, load it into your library using library().

Variable name	Description
education	Average education of occupational incumbents in 1971 (in years)
income	Average income of incumbents in 1971 (in dollars)
women	Percentage of incumbents who are women
prestige	Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s
census	Canadian Census occupational code
type	Type of occupation: bc (Blue Collar), prof (Professional, Managerial, and Technical), wc (White Collar)

library(car) # Loading the car package using library()

## Warning: package 'car' was built under R version 3.4.4

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.4.4

dim(Prestige) # Checking the dimensions of the Prestige dataset using dim()

## [1] 102   6

sapply(Prestige, FUN=class) # Viewing the class attribute of each variable using class()

## education    income     women  prestige    census      type 
## "numeric" "integer" "numeric" "numeric" "integer"  "factor"

head(Prestige,10) # Viewing the first 10 observations of the Prestige dataset using head()

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
## biologists              15.09   8258 25.65     72.6   2133 prof
## architects              15.44  14163  2.69     78.1   2141 prof
## civil.engineers         14.52  11377  1.03     73.1   2143 prof
## mining.engineers        14.64  11023  0.94     68.8   2153 prof

2. Determinants of occupational prestige

How strongly do education, income and gender correlate with occupational prestige? Using a multiple linear regression model, we will predict occupational prestige scores given the average number of years of education, the average income and the percentage of women within the respondents who have a certain occupation. We call this a multiple linear regression model because we are using more than one independent variable to predict one dependent variable (i.e. prestige).

2.1 Creating a matrix scatterplot using `plot()`

At first glance, here is what the relationship between each pair of variables looks like. Which variable seems more strongly correlated with prestige? Is this a positive or a negative relationship?

library(car)

ourSubset = Prestige[,1:4]

plot(ourSubset, pch=16, col="#f44182", main="Matrix Scatterplot (Education, Income, Women and Prestige)")

2.2 Performing a multiple linear regression analysis using `lm()`

Using lm(), we specify a multiple linear regression formula: dependent variable ~ independent variable #1 + independent variable #2 + independent variable #3. The main difference from simple linear models is that you have more than one predictors (or independent variables).

The lm() function will generate an overall intercept for the model and a regression coefficient for each variable. The intercept represents the mean of the dependent variable when all of the explanatory variables have a value equal to zero. The regression coefficient of a given independent variable represents the mean change in the dependent variable for one unit of change in that independent variable (in other words, the slope).

model <- lm(prestige ~ education + income + women, data=Prestige) # Creating our model using lm()

summary(model) # Viewing our regression results using summary()

## 
## Call:
## lm(formula = prestige ~ education + income + women, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8246  -5.3332  -0.1364   5.1587  17.5045 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.7943342  3.2390886  -2.098   0.0385 *  
## education    4.1866373  0.3887013  10.771  < 2e-16 ***
## income       0.0013136  0.0002778   4.729 7.58e-06 ***
## women       -0.0089052  0.0304071  -0.293   0.7702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.846 on 98 degrees of freedom
## Multiple R-squared:  0.7982, Adjusted R-squared:  0.792 
## F-statistic: 129.2 on 3 and 98 DF,  p-value: < 2.2e-16

Here is how to interpret these results:

The intercept is the average prestige score of a given job when its employees have on average 0 years of education, 0$ of income, and when 0% of them are women.
Each regression coefficient represents the additional effect of a one-unit increase for a given independent variable.

Visualization of the 4-dimensional relationship

## Warning: package 'bindrcpp' was built under R version 3.4.4

Don’t forget that linear regression models are ultimately designed to help us predict the value of the dependent variable given the values of the independent variable(s). Here is an example of such a prediction using predict().

# Predicting the prestige score of a theoretical job where education=10, income=15000 and women=40
predict(model,newdata=data.frame(education=10,income=15000,women=40))

##        1 
## 54.41924

2.3 Plotting the residuals using `plot()`, `residuals()` and `density()`

One of the assumptions of linear models is that residuals should be normally distributed. In other words, there should not be a systematic bias in the residuals, which would imply that our model is better at predicting certain types of observations than others.

summary(residuals(model)) # Summarizing the residuals

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -19.8246  -5.3332  -0.1364   0.0000   5.1587  17.5045

plot(density(residuals(model)), col="blue",
     main="Density Plot of the Residuals from our OLS Model") # Plotting the density of the residuals

Does this look normally distributed to you? Why or why not?

Exercises

Exercise 1

Perform a linear regression analysis of the relationship between income (DV) and the two following independent variables: years of education + percentage of women. Which is the strongest predictor, excluding the intercept?

Relevant function: lm().

Exercise 2

In your own words, how do you interpret the intercept and the regression coefficients? In terms of p-value, how convincing are these results?

Exercise 3

Generate a density plot of the residuals of your OLS model. Do the residuals look normally distributed? Why is it important to check for this?

Relevant functions: plot(), residuals() and density().

Exercise 4

Extract the intercept and the coefficients of your OLS model and store these values into a myCoef vector.