Modified from Simple Linear Regression - An example using R and Multiple Regression Analysis in R - First Steps by Felipe Rego.
This notebook will utilize the Prestige dataset from the car package library(car) to illustrate the application of linear regression.
The Prestige dataset is a data frame with 102 rows and 6 columns. Each row is an observation that relates to an occupation. The columns relate to predictors such as average years of education, percentage of women in the occupation, prestige of the occupation, etc.
# Load the dataset and utilities libraries
library(car)
library(tidyverse)
library(corrplot)
library(rgl)
library(rglwidget)
library(scatterplot3d)
library(knitr)
# knit_hooks$set(webgl = hook_webgl)# Inspect the data
head(Prestige)# View the data structure
str(Prestige)'data.frame': 102 obs. of 6 variables:
$ education: num 13.1 12.3 12.8 11.4 14.6 ...
$ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
$ women : num 11.16 4.02 15.7 9.11 11.68 ...
$ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
$ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
$ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
# Summarize the data
summary(Prestige) education income women prestige census
Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 Min. :1113
1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.:3120
Median :10.540 Median : 5930 Median :13.600 Median :43.60 Median :5135
Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83 Mean :5402
3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27 3rd Qu.:8312
Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20 Max. :9517
type
bc :44
prof:31
wc :23
NA's: 4
The first part of the notebook illustrates a simple linear regression model. Our goal is to predict income using a single predictor education. Education refers to the average number of years of education that exists in each profession.
# Subset the data to capture only `income` and `education`
prestige <- Prestige %>% dplyr::select(income, education)First, we’ll do some initial exploration of our variables.
We use histogram to inspect how the variables are distributed.
# Histogram of `education`
ggplot(prestige, aes(x = education)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = mean(prestige$education),
color = "indianred") +
geom_vline(xintercept = median(prestige$education),
color = "cornflowerblue")# Histogram of `income`
ggplot(prestige, aes(x = income)) +
geom_histogram(binwidth = 1000) +
geom_vline(xintercept = mean(prestige$income),
color = "indianred") +
geom_vline(xintercept = median(prestige$income),
color = "cornflowerblue")Now we draw a scatterplot to visualize their relationship.
# Scatterplot of `income` and `education`
ggplot(prestige, aes(x = education, y = income)) +
geom_point()Each point in the graph represents a profession. Observe how income changes as years of education increases.
Here, we’ll fit a linear regression and see how well this model fits the observed data. In the most simplistic form, the equation we want to solve is:
\[ income = B_0 + B_1 \times education \]
The model will estimate the cooeficients: the intercept (\(B_0\)) and the slope (\(B_1\)). In our case, the intercept is the expected income value for the average number of years of education and the slope is the average increase in income associated with a one-unit increase in years of education. We want our model to fit a line across the observed relationship in a way that the line created is as close as possible to all data points.
Before we fit the model, we create a new variable called education.c. This new variable centers the value of the variable education on its mean. This transformation was applied on the education variable so we could have a meaningful interpretation of its intercept estimate.
Had we not centered Education, we would have gotten a negative intercept estimate from the model and we would have ended up with a nonsensical intercept meaning (essentially saying that for a zero years of education, income is estimated to be negative - or the same as saying that no education means you owe money!)
# Center `education` on its mean
education.c <- scale(prestige$education, center = TRUE, scale = FALSE)We use the lm() function to fit our linear model. Here we are using the Least Square approach.
# Fit a linear model and run a summary of its results
set.seed(1)
model <- lm(income ~ education.c, data = prestige)
summary(model)
Call:
lm(formula = income ~ education.c, data = prestige)
Residuals:
Min 1Q Median 3Q Max
-5493.2 -2433.8 -41.9 1491.5 17713.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6797.9 344.9 19.709 < 2e-16 ***
education.c 898.8 127.0 7.075 2.08e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3483 on 100 degrees of freedom
Multiple R-squared: 0.3336, Adjusted R-squared: 0.3269
F-statistic: 50.06 on 1 and 100 DF, p-value: 2.079e-10
Let’s examine how the fitted line looks like in a scatterplot.
ggplot(prestige, aes(x = education.c, y = income)) +
geom_point() +
stat_smooth(method = "lm", col = "indianred") +
scale_y_continuous(breaks = seq(1000, 25000, by = 2000), minor_breaks = NULL)From the model output and the scatterplot we can make some interesting observations:
Visually the scatterplot indicates the relationship between income and education does not follow a straight line. While this visual inspection alone is not a sufficient indication of non-linearity, this may suggest the relationship is in fact non-linear. Observe that our fitted line does not seem to follow pattern observed across all points.
While we can see a significant p-value (very close to zero), the model generated does not yield a strong \(R^2\). \(R^2\) (or coefficient of determination) is a measure that indicates the proportional variance of income explained by education. The closer the number is to 1, the better the model explains the variance shown. In our model results, the \(R^2\) we get is 0.33, a pretty low score. This suggests the linear model we just fit in the data is explaining a mere 33% of the variance observed in the data.
Another interesting point from the model output is the residual standard error which measures the average amount of income that will deviate from the true regression line for any given point. In our example, any prediction of income on the basis of education will be off by an average of $3,483! A fairly large number.
Given that the Residual standard error for income is $3483 and the mean income value is $6798, we can assume that the average percentage error for any given point is more than 51%! Again, a pretty large error rate.
To throw some further evidence supporting the lack of model fit, let’s plot the residuals against the predicted values:
plot(model, which = 1, pch = 16)The graph above shows the model residuals (which is the average amount that the response will deviate from the true regression line) plotted against the fitted values (the model’s predicted value of income).
Ideally, when the model fits the data well, the residuals would be randomly scattered around the horizontal line. In our case here, there is strong evidence a non-linear pattern is present in the relationship. Also, there are points standing far away from the horizontal line. This could indicate the presence of outliers (note how the points for general managers, physicians and lawyers are way out there!).
Now we’ll extend the our linear regression model to include multiple predictors.
Our goal is to predict income using 3 predictors: women, prestige, and education. Remember that Education refers to the average number of years of education that exists in each profession. The women variable refers to the percentage of women in the profession and the prestige variable refers to a prestige score for each occupation (given by a metric called Pineo-Porter), from a social survey conducted in the mid-1960s.
# Subset the data to capture only `income`, `women`, `prestige`, `education`
prestige <- Prestige %>% dplyr::select(income, women, prestige, education)
summary(prestige) income women prestige education
Min. : 611 Min. : 0.000 Min. :14.80 Min. : 6.380
1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.: 8.445
Median : 5930 Median :13.600 Median :43.60 Median :10.540
Mean : 6798 Mean :28.979 Mean :46.83 Mean :10.738
3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27 3rd Qu.:12.648
Max. :25879 Max. :97.510 Max. :87.20 Max. :15.970
We first use a matrix scatterplot to visualize the relationship between the variables.
plot(prestige, pch = 16, col = "cornflowerblue")The matrix plot above allows us to vizualise the relationship among all variables in one single image.
Here we learnt that average years of education increases with average income. Prestige seems to have a similar pattern relative to education when plotted against income. Interestingly, as the percentage of women increases in a profession, average income in the profession declines.
We’ll start by fitting a linear regression on this dataset and see how well it models the observed data. We’ll add all other predictors and give each of them a separate slope coefficient. We want to estimate the relationship and fit a plane (note that in a multi-dimensional setting, with two or more predictors and one response, the least squares regression line becomes a plane) that explains this relationship.
For our multiple linear regression example, we want to solve the following equation:
\[ income = B_0 + B_1 \times education + B_2 \times prestige + B_3 \times prestige \]
The model will estimate the value of the intercept (\(B_0\)) and each predictor’s slope (\(B_1\)) for education, (\(B_2\)) for prestige and (\(B_3\)) for women. The intercept is the average expected income value for the average value across all predictors. The value for each slope estimate will be the average increase in income associated with a one-unit increase in each predictor value, holding the others constant. We want our model to fit a line or plane across the observed relationship in a way that the line/plane created is as close as possible to all data points.
First we create a centered version of all predictor variables each ending with a .c in their names. These new variables were centered on their mean. This transformation was applied on each variable so we could have a meaningful interpretation of the intercept estimates.
# Center all predictors on their means
education.c = scale(prestige$education, center = TRUE, scale = FALSE)
prestige.c = scale(prestige$prestige, center = TRUE, scale = FALSE)
women.c = scale(prestige$women, center = TRUE, scale = FALSE)
# Bind the new variables into a dataframe
centered_predictors <- cbind(education.c, prestige.c, women.c)
prestige <- cbind(prestige, centered_predictors)
names(prestige)[5:7] = c("education.c", "prestige.c", "women.c" )
summary(prestige) income women prestige education education.c
Min. : 611 Min. : 0.000 Min. :14.80 Min. : 6.380 Min. :-4.358
1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.: 8.445 1st Qu.:-2.293
Median : 5930 Median :13.600 Median :43.60 Median :10.540 Median :-0.198
Mean : 6798 Mean :28.979 Mean :46.83 Mean :10.738 Mean : 0.000
3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27 3rd Qu.:12.648 3rd Qu.: 1.909
Max. :25879 Max. :97.510 Max. :87.20 Max. :15.970 Max. : 5.232
prestige.c women.c
Min. :-32.033 Min. :-28.98
1st Qu.:-11.608 1st Qu.:-25.39
Median : -3.233 Median :-15.38
Mean : 0.000 Mean : 0.00
3rd Qu.: 12.442 3rd Qu.: 23.22
Max. : 40.367 Max. : 68.53
Now we fit the linear regression model.
# Fit a linear model and run a summary of its results
model1 <- lm(income ~ education.c + prestige.c + women.c, data = prestige)
summary(model1)
Call:
lm(formula = income ~ education.c + prestige.c + women.c, data = prestige)
Residuals:
Min 1Q Median 3Q Max
-7715.3 -929.7 -231.2 689.7 14391.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6797.902 254.934 26.665 < 2e-16 ***
education.c 177.199 187.632 0.944 0.347
prestige.c 141.435 29.910 4.729 7.58e-06 ***
women.c -50.896 8.556 -5.948 4.19e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2575 on 98 degrees of freedom
Multiple R-squared: 0.6432, Adjusted R-squared: 0.6323
F-statistic: 58.89 on 3 and 98 DF, p-value: < 2.2e-16
Centering of the predictors on their mean allows us to say that the estimated income is $6,798 when we consider the average number of years of education, the average percent of women and the average prestige from the dataset.
From the model output and the scatterplot we can make some interesting observations:
For any given level of education and prestige in a profession, improving one percentage point of women in a given profession will see the average income decline by $-50.9. Similarly, for any given level of education and percent of women, seeing an improvement in prestige by one point in a given profession will lead to an an extra $141.4 in average income.
Note also our Adjusted R-squared value (we’re now looking at adjusted R-square as a more appropriate metric of variability as the adjusted R-squared increases only if the new term added ends up improving the model more than would be expected by chance). In this model, we arrived in a larger R-squared number of 0.6322843 (compared to roughly 0.37 from our last simple linear regression exercise).
Recall from our previous simple linear regression exmaple that our centered education predictor variable had a significant p-value (close to zero). But from the multiple regression model output above, education no longer displays a significant p-value. Here, education represents the average effect while holding the other variables women and prestige constant. From the matrix scatterplot shown above, we can see the pattern income takes when regressed on education and prestige. Note how closely aligned their pattern is with each other. So in essence, when they are put together in the model, education is no longer significant after adjusting for prestige. When we have two or more predictor variables strongly correlated, we face a problem of collinearity (the predictors are collinear).
Let’s validate this situation with a correlation plot:
prestige_cor <- cor(prestige[1:4])
corrplot(prestige_cor, method = "number")The correlation matrix shown above highlights the situation we encoutered with the model output. Notice that the correlation between education and prestige is very high at 0.85. This reveals each profession’s level of education is strongly aligned to each profession’s level of prestige. So in essence, education’s high p-value indicates that women and prestige are related to income, but there is no evidence that education is associated with income, at least not when these other two predictors are also considered in the model.
The model output can also help answer whether there is a relationship between the response and the predictors used. We can use the value of our F-Statistic to test whether all our coefficients are equal to zero (testing for the null hypothesis which means). The F-Statistic value from our model is 58.89 on 3 and 98 degrees of freedom. So assuming that the number of data points is appropriate and given that the p-values returned are low, we have some evidence that at least one of the predictors is associated with income.
Given that we have indications that at least one of the predictors is associated with income, and based on the fact that education here has a high p-value, we can consider removing education from the model and see how the model fit changes (we are not going to run a variable selection procedure such as forward, backward or mixed selection in this example):
# Fit a linear model excluding `education`
model2 <- lm(income ~ prestige.c + women.c, data = prestige)
summary(model2)
Call:
lm(formula = income ~ prestige.c + women.c, data = prestige)
Residuals:
Min 1Q Median 3Q Max
-7620.9 -1008.7 -240.4 873.1 14180.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6797.902 254.795 26.680 < 2e-16 ***
prestige.c 165.875 14.988 11.067 < 2e-16 ***
women.c -48.385 8.128 -5.953 4.02e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2573 on 99 degrees of freedom
Multiple R-squared: 0.64, Adjusted R-squared: 0.6327
F-statistic: 87.98 on 2 and 99 DF, p-value: < 2.2e-16
The model excluding education has in fact improved our F-Statistic from 58.89 to 87.98 but no substantial improvement was achieved in residual standard error and adjusted R-square value. This is possibly due to the presence of outlier points in the data.
Let’s plot this last model’s residuals:
plot(model2, which = 1, pch = 16)Let’s visualize a three-dimensional interactive graph with both predictors and the target variable and the linear fit from the last model:
new_data <- expand.grid(prestige.c = seq(-35, 45, by = 5),
women.c = seq(-25, 70, by = 5))
new_data$plane <- predict(model2, newdata = new_data)
with(prestige, plot3d(prestige.c, women.c, income, size = 1, type = "s"))
with(new_data, surface3d(unique(prestige.c), unique(women.c), plane,
alpha = 0.5, color = "tomato2",
front = "fill", back = "fill"))
rglwidget()Loading required namespace: rmarkdown
Note from the 3D graph above how this view more clearly highlights the pattern existent across prestige and women relative to income. Also, this interactive view allows us to more clearly see those three or four outlier points as well as how well our last linear model fit the data.
At this stage we could try a few different transformations on both the predictors and the response variable to see how this would improve the model fit. For now, let’s apply a logarithmic transformation with the log function on the income variable (the log function here transforms using the natural log. If base 10 is desired log10 is the function to be used). Also, we could try to square both predictors. Let’s apply these suggested transformations directly into the model function and see what happens with both the model fit and the model accuracy.
# Fit a model excluding `education`, log the `income`
model3 = lm(log(income) ~ prestige.c + I(prestige.c^2) + women.c + I(women.c^2),
data = prestige)
summary(model3)
Call:
lm(formula = log(income) ~ prestige.c + I(prestige.c^2) + women.c +
I(women.c^2), data = prestige)
Residuals:
Min 1Q Median 3Q Max
-1.01614 -0.10973 0.00966 0.14479 0.80844
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.809e+00 5.944e-02 148.188 < 2e-16 ***
prestige.c 2.518e-02 1.787e-03 14.096 < 2e-16 ***
I(prestige.c^2) -2.605e-04 9.396e-05 -2.773 0.00666 **
women.c -6.306e-03 1.476e-03 -4.271 4.53e-05 ***
I(women.c^2) -7.194e-05 4.014e-05 -1.792 0.07620 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.293 on 97 degrees of freedom
Multiple R-squared: 0.7643, Adjusted R-squared: 0.7546
F-statistic: 78.64 on 4 and 97 DF, p-value: < 2.2e-16
# Plot model residuals.
plot(model3, which = 1, pch = 16)new_data2 <- expand.grid(prestige.c = seq(-35, 45, by = 5),
women.c = seq(-25, 70, by = 5))
new_data2$plane <- predict(model3, newdata = new_data2)
with(prestige, plot3d(prestige.c, women.c, log(income), size = 1, type = "s"))
with(new_data2, surface3d(unique(prestige.c), unique(women.c), plane,
alpha = 0.5, color = "tomato2",
front = "fill", back = "fill"))
rglwidget()By transforming both the predictors and the target variable, we achieve an improved model fit. Note how the adjusted R-square has jumped to 0.7545965. Most predictors’ p-values are significant. Here, the squared women.c predictor yields a weak p-value (maybe an indication that in the presence of other predictors, it is not relevant to include and we could exclude it from the model.)
Let’s go on and remove the squared women.c variable from the model to see how it changes:
# Fit a model excluding `education`, log `income`.
model4 = lm(log(income) ~ prestige.c + I(prestige.c^2) + women.c , data = prestige)
summary(model4)
Call:
lm(formula = log(income) ~ prestige.c + I(prestige.c^2) + women.c,
data = prestige)
Residuals:
Min 1Q Median 3Q Max
-1.12302 -0.09650 0.02764 0.13913 0.78303
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.729e+00 4.023e-02 216.995 < 2e-16 ***
prestige.c 2.499e-02 1.803e-03 13.858 < 2e-16 ***
I(prestige.c^2) -2.351e-04 9.392e-05 -2.503 0.014 *
women.c -8.365e-03 9.376e-04 -8.922 2.64e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2963 on 98 degrees of freedom
Multiple R-squared: 0.7565, Adjusted R-squared: 0.7491
F-statistic: 101.5 on 3 and 98 DF, p-value: < 2.2e-16
# Plot model residuals.
plot(model4, which = 1, pch = 16)new_data3 <- expand.grid(prestige.c = seq(-35, 45, by = 5),
women.c = seq(-25, 70, by = 5))
new_data3$plane <- predict(model4, newdata = new_data3)
with(prestige, plot3d(prestige.c, women.c, log(income), size = 1, type = "s"))
with(new_data3, surface3d(unique(prestige.c), unique(women.c), plane,
alpha = 0.5, color = "tomato2",
front = "fill", back = "fill"))
rglwidget()Note now that this updated model yields a much better R-square measure of 0.7490565, with all predictor p-values highly significant and improved F-Statistic value (101.5). The residuals plot also shows a randomly scattered plot indicating a relatively good fit given the transformations applied due to the non-linearity nature of the data.
In summary, we’ve seen a few different multiple linear regression models applied to the Prestige dataset. We tried an linear approach. We created a correlation matrix to understand how each variable was correlated. Subsequently, we transformed the variables to see the effect in the model. We’ve created three-dimensional plots to visualize the relationship of the variables and how the model was fitting the data in hand.