Multiple linear Regression is a useful approach for predicting a response based on more than one predictor variables. Here, we will learn how to build a mulitple linear regression model and how to visualise the results.
Occupational prestige is a way for sociologists to describe the relative social class positions people have. It refers to the consensual nature of rating a job based on the belief of its worthiness. But what really determines the prestige score? Education? Income? Let’s find out.
The Prestige of Canadian Occupations data “prestige” is obtained from the “car” package.
education= Average education (years) of occupational incumbents, in 1971. income = Average income (dollars) of incumbents, 1971. women= Percentage of incumbents who are women, 1971. prestige = Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s. census= Canadian Census occupational code. type= Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.
First you will need to install the “car” package
install.package(“car”)
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
#load the data "Prestige" as p
p<-data.frame(Prestige)
# Explore the data
# look at the structure of the dataset
str(p)
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
By exploring the data, we can see there are 102 observations and 6 variables. The Variables are’education, income, women, prestige, census and type.
‘data.frame’: 102 obs. of 6 variables: $ education: num 13.1 12.3 12.8 11.4 14.6 … $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 … $ women : num 11.16 4.02 15.7 9.11 11.68 … $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 … $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 … $ type : Factor w/ 3 levels “bc”,“prof”,“wc”: 2 2 2 2 2 2 2 2 2 2 …
summary(p)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
summary() function provides a summary on each variable. Here, it will give us some insgihts of how dirty the data is and we will have clean the data if it is dirty. For example if the maximun value in education is 200 or 200 years old or having a negative value of -100, this may suggest that there maybe some invalid values in education as it doesnt make any sense for someone to study for 200 years or negative 100 years. In this case, the data seemes to look normal.
In this mulitple linear regression example, we will examine which of the 6 variables are good predictors of predicting the prestige score.
Before we do that, lets examine their correlation, by using the cor() function
#check correlation between education and income
cor(p$education,p$prestige)
## [1] 0.8501769
cor(p$income, p$prestige)
## [1] 0.7149057
cor(p$women, p$prestige)
## [1] -0.1183342
cor(p$census, p$prestige)
## [1] -0.6345103
Based on the above results, we can see that both income and education have a high postive correlation with the prestige score. The census is only a Canadian Census occupational code in numeric form, it is funny to see that there is a strong negative correlation with the prestiage code which i would assume there should be none. “Women” represents the Percentage of incumbents who are women, so this tells there is little relationship between Percentage of incumbents who are women and prestige score. As the values for “types” are not in numeric form, we cannot calculate the correlation.
Lets visualise these relationships. We can do this by using the ggplot in tidyverse package
#Create Scatter plot
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages ---------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some() masks car::some()
ggplot(data=p) + geom_point(mapping = aes(x=education, y=prestige))
ggplot(data=p) + geom_point(mapping = aes(x=income, y=prestige))
ggplot(data=p) + geom_point(mapping = aes(x=women, y=prestige))
ggplot(data=p) + geom_point(mapping = aes(x=type, y=prestige))
Seems like the scatter plots also reflect the correlation values that we calculated earlier. Even we cannot calculate the correlation between “type” and prestige, we can see from the scattor plot that professtionals and white collar have a higher prestige score than blue collar, which kind of make sense.
First: You will need to install the ISLR package.
install.package(“ISLR”)
Then will need to load the package
Lets build a multiple linear regression model, by setting “prestige” as a dependent variable and other variables as independent variables. We shouldnt be including the “census” in the model given that it is only a numeric occupational code, but let see if the model can identify that it is not a good predictor of prestige.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.4
#Run linear regression
p_lm=lm(formula = prestige ~., data=p)
summary(p_lm)
##
## Call:
## lm(formula = prestige ~ ., data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9863 -4.9813 0.6983 4.8690 19.2402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.213e+01 8.018e+00 -1.513 0.13380
## education 3.933e+00 6.535e-01 6.019 3.64e-08 ***
## income 9.946e-04 2.601e-04 3.824 0.00024 ***
## women 1.310e-02 3.018e-02 0.434 0.66524
## census 1.156e-03 6.183e-04 1.870 0.06471 .
## typeprof 1.077e+01 4.676e+00 2.303 0.02354 *
## typewc 2.877e-01 3.139e+00 0.092 0.92718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.037 on 91 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.841, Adjusted R-squared: 0.8306
## F-statistic: 80.25 on 6 and 91 DF, p-value: < 2.2e-16
The variables for “education”, “income” and “typerof” have a p -value of less than 0.05 which suggests that they are statistically significant in predicting the value “prestige”. The Adjusted R-squared is a measure of fit, a higher value close to 1 means better fit. In this case, the adjusted R square is very high at 0.8306. Let’s see if we can improve our model by removing the statiscally insignificant variables with p value over 0.05.
library(ISLR)
#Run linear regression
p_lm2=lm(formula = prestige ~ education + income + type, data=p)
summary(p_lm2)
##
## Call:
## lm(formula = prestige ~ education + income + type, data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9529 -4.4486 0.1678 5.0566 18.6320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6229292 5.2275255 -0.119 0.905
## education 3.6731661 0.6405016 5.735 1.21e-07 ***
## income 0.0010132 0.0002209 4.586 1.40e-05 ***
## typeprof 6.0389707 3.8668551 1.562 0.122
## typewc -2.7372307 2.5139324 -1.089 0.279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.095 on 93 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8278
## F-statistic: 117.5 on 4 and 93 DF, p-value: < 2.2e-16
The Adjusted R-squared is lowered and we can also see that the variables for “type” is statiscally insignificant with p-value over 0.05. Let’s remove “type” and see if we can improve this model
p_lm3=lm(formula = prestige ~ education + income, data=p)
summary(p_lm3)
##
## Call:
## lm(formula = prestige ~ education + income, data = p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4040 -5.3308 0.0154 4.9803 17.6889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.8477787 3.2189771 -2.127 0.0359 *
## education 4.1374444 0.3489120 11.858 < 2e-16 ***
## income 0.0013612 0.0002242 6.071 2.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7939
## F-statistic: 195.6 on 2 and 99 DF, p-value: < 2.2e-16
##calcuate Mean squared error
#First model
mean(p_lm$residuals^2)
## [1] 45.97815
#Second Model
mean(p_lm2$residuals^2)
## [1] 47.76812
#Third Model
mean(p_lm3$residuals^2)
## [1] 59.20442
mean squared error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. It is always non-negative, and values closer to zero are better.
We can also plot some graphs for all the models. This can be done by using the plot() function
#plot linear regression for regression 1
layout(matrix(1:4, 2, 2))
plot(p_lm)
#plot linear regression for regression 2
layout(matrix(1:4, 2, 2))
plot(p_lm2)
#plot linear regression for regression 3
layout(matrix(1:4, 2, 2))
plot(p_lm3)
Based on the above graphs, we can see that the dependent variable “prestige” is normally distributed.
To further check on outliers and influence point, we can also plot the influence plot graphs.
This function creates a “bubble” plot of Studentized residuals versus hat values, with the areas of the circles representing the observations proportional to the value Cook’s distance. Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale.
#check influence for regression 1
influencePlot(p_lm, id.n=3)
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## StudRes Hat CookD
## general.managers -0.9199659 0.43760411 0.09423639
## medical.technicians 2.9700914 0.07964707 0.10042624
## electronic.workers 2.0655585 0.07809617 0.04984309
## pilots -0.9288104 0.34407247 0.06474496
#check influence for regression 2
influencePlot(p_lm2, id.n=3)
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## StudRes Hat CookD
## general.managers -1.3134574 0.33504477 0.172503975
## physicians -0.3953204 0.22420309 0.009115491
## medical.technicians 2.8210910 0.06858836 0.109052582
## electronic.workers 2.2251940 0.02701237 0.026372394
#check influence for regression 3
influencePlot(p_lm3, id.n=3)
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## StudRes Hat CookD
## general.managers -1.5101887 0.27146172 0.27965035
## physicians -0.9223321 0.20339599 0.07251181
## newsboys -2.5960873 0.03105248 0.06805145
## farmers 2.3517921 0.03011295 0.05473610
## to check which one has high residuals for regression 1
qqPlot(p_lm, id.n=3)
## medical.technicians electronic.workers
## 31 82
## to check which one has high residuals for regression 2
qqPlot(p_lm2, id.n=3)
## medical.technicians electronic.workers
## 31 82
## to check which one has high residuals for regression 3
qqPlot(p_lm3, id.n=3)
## newsboys farmers
## 53 67
# id.n – id observations with high residual
Cook’s distance measures how much an observation influences the overall model or predicted values
Studentizided residualis the quotient resulting from the division of a residual by an estimate of its standard deviation.
Bonferroni test is a type of multiple comparison test used in statistical analysis. When an experimenter performs enough hypothesis tests, he or she will eventually end up with a result that shows statistical significance of the dependent variable, even if there is none.
Hat-points identify influential observations (have a high impact on the predictor variables)
#Regression model 1
influenceIndexPlot(p_lm, id.n = 3)
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
#Regression model 2
influenceIndexPlot(p_lm2, id.n = 3)
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
#Regression model 3
influenceIndexPlot(p_lm3, id.n = 3)
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.window(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not
## a graphical parameter
## Warning in box(...): "id.n" is not a graphical parameter
## Warning in title(...): "id.n" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter
Null for the Bonferonni adjusted outlier test is the observation is an outlier.
For Regression 1, observation related to ‘medical.technicians’ is an outlier. For Regression 2, observation related to ‘medical.technicians’ is an outlier. For Regression 3, observation related to ‘newsboys’ is an outlier.
outlierTest(p_lm)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## medical.technicians 2.970091 0.0038164 0.374
outlierTest(p_lm2)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## medical.technicians 2.821091 0.0058632 0.57459
outlierTest(p_lm3)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## newsboys -2.596087 0.010879 NA
Based on the results above, a person with a higher “income” and “education” will have a higher prestige score.
Now, you should be able to build a multiple linear regression model and how to visualize the results using R packages.
Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage.