\(~\)
The second blog demonstrated how to create a simple linear regression model to find relationships between two quantitative variables where there’s one dependent variable (y) and one independent variable (x). In this third blog I am introducing multiple linear regression, where there’s still one dependent variable but one or more independent variables.
Below I’ll demonstrate how to start a multiple linear regression
model using the diamonds
package from the library
ggplot2
. For a more in depth multiple regression model I’d
like to share the work my team and I create for a homework.
library(ggplot2)
library(tidyverse)
library(jtools)
library(hrbrthemes)
library(gtsummary)
Loading data using the head()
function
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
\(~\)
Using summary()
function to get the statistical
structure of our data
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
\(~\)
Using gtsummary()
function as well to view the data’s
structure differently
Characteristic | N = 53,9401 |
---|---|
carat | 0.70 (0.40, 1.04) |
cut | |
Fair | 1,610 (3.0%) |
Good | 4,906 (9.1%) |
Very Good | 12,082 (22%) |
Premium | 13,791 (26%) |
Ideal | 21,551 (40%) |
color | |
D | 6,775 (13%) |
E | 9,797 (18%) |
F | 9,542 (18%) |
G | 11,292 (21%) |
H | 8,304 (15%) |
I | 5,422 (10%) |
J | 2,808 (5.2%) |
clarity | |
I1 | 741 (1.4%) |
SI2 | 9,194 (17%) |
SI1 | 13,065 (24%) |
VS2 | 12,258 (23%) |
VS1 | 8,171 (15%) |
VVS2 | 5,066 (9.4%) |
VVS1 | 3,655 (6.8%) |
IF | 1,790 (3.3%) |
depth | 61.80 (61.00, 62.50) |
table | 57.00 (56.00, 59.00) |
price | 2,401 (950, 5,324) |
x | 5.70 (4.71, 6.54) |
y | 5.71 (4.72, 6.54) |
z | 3.53 (2.91, 4.04) |
1 Median (IQR); n (%) |
\(~\)
The diamonds
data set does not require any cleaning or
transformation in order to use but always keep in mind the data sets we
use in the future will.
\(~\)
Just like in simple linear regression, we start by creating a model to look at the overall data. This serves as a base model to compare our other models to. The equation for a multiple linear regression model is: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 ... + \beta_p x_p + \epsilon\)
# creating first model
<- lm(diamonds)
model summary(model)
##
## Call:
## lm(formula = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52251 -0.03060 -0.00102 0.02905 2.17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.660e+00 2.388e-02 -69.512 < 2e-16 ***
## cut.L -2.057e-02 1.416e-03 -14.531 < 2e-16 ***
## cut.Q 9.257e-03 1.131e-03 8.186 2.76e-16 ***
## cut.C -6.879e-03 9.715e-04 -7.081 1.45e-12 ***
## cut^4 1.741e-03 7.762e-04 2.244 0.02487 *
## color.L 1.085e-01 1.115e-03 97.305 < 2e-16 ***
## color.Q 4.106e-02 9.904e-04 41.461 < 2e-16 ***
## color.C 5.772e-03 9.243e-04 6.245 4.28e-10 ***
## color^4 -5.063e-03 8.482e-04 -5.969 2.40e-09 ***
## color^5 5.713e-03 8.014e-04 7.130 1.02e-12 ***
## color^6 1.959e-03 7.285e-04 2.689 0.00716 **
## clarity.L -1.784e-01 2.058e-03 -86.670 < 2e-16 ***
## clarity.Q 1.081e-01 1.785e-03 60.536 < 2e-16 ***
## clarity.C -5.677e-02 1.518e-03 -37.391 < 2e-16 ***
## clarity^4 1.727e-02 1.211e-03 14.259 < 2e-16 ***
## clarity^5 -1.232e-02 9.885e-04 -12.464 < 2e-16 ***
## clarity^6 -1.793e-03 8.602e-04 -2.084 0.03717 *
## clarity^7 1.719e-04 7.595e-04 0.226 0.82093
## depth 1.212e-02 2.801e-04 43.278 < 2e-16 ***
## table 2.203e-03 1.825e-04 12.073 < 2e-16 ***
## price 4.428e-05 1.913e-07 231.494 < 2e-16 ***
## x 2.425e-01 1.800e-03 134.732 < 2e-16 ***
## y 5.961e-03 1.212e-03 4.917 8.82e-07 ***
## z 4.586e-03 2.100e-03 2.184 0.02899 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07088 on 53916 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9776
## F-statistic: 1.025e+05 on 23 and 53916 DF, p-value: < 2.2e-16
Plotting our models is the easiest part, first you’ll want to create
a matrix of nrows x ncols so that the plots are fitted into one page.
Using the plot()
function we can insert our model name and
we get 4 different plots:
# plots
par(mfrow=c(2,2)) # matrix of plots
plot(model) # create plots for model
\(~\)
With multiple linear regression there are assumptions to keep in mind:
\(~\)
The second model we create, similar to the first, will be with less variables to check if this model is fits better.
##
## Call:
## lm(formula = price ~ carat + cut + depth + table, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17513.5 -786.6 -39.9 521.4 12596.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4296.951 468.456 9.173 < 2e-16 ***
## carat 7890.772 14.047 561.745 < 2e-16 ***
## cut.L 1028.088 29.667 34.654 < 2e-16 ***
## cut.Q -483.298 23.749 -20.350 < 2e-16 ***
## cut.C 325.172 20.528 15.840 < 2e-16 ***
## cut^4 59.229 16.466 3.597 0.000322 ***
## depth -73.691 5.302 -13.898 < 2e-16 ***
## table -41.816 3.881 -10.775 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1508 on 53932 degrees of freedom
## Multiple R-squared: 0.8571, Adjusted R-squared: 0.8571
## F-statistic: 4.62e+04 on 7 and 53932 DF, p-value: < 2.2e-16
Plots for model2
\(~\)
The next two models (model3
and model4
) are
created using backwards elimination to view the differences in
statistical output we receive and compare the adjusted \(R^2\).
model3
##
## Call:
## lm(formula = price ~ carat + color + clarity + x + y, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21463.8 -598.8 -178.9 390.1 10627.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.25 79.13 0.534 0.5934
## carat 10968.05 47.96 228.710 < 2e-16 ***
## color.L -1964.82 17.61 -111.548 < 2e-16 ***
## color.Q -671.29 16.03 -41.885 < 2e-16 ***
## color.C -174.13 14.96 -11.642 < 2e-16 ***
## color^4 33.79 13.74 2.459 0.0139 *
## color^5 -96.31 12.98 -7.419 1.19e-13 ***
## color^6 -54.48 11.80 -4.616 3.91e-06 ***
## clarity.L 4344.21 30.11 144.271 < 2e-16 ***
## clarity.Q -2032.91 28.42 -71.531 < 2e-16 ***
## clarity.C 1076.63 24.38 44.162 < 2e-16 ***
## clarity^4 -418.98 19.51 -21.480 < 2e-16 ***
## clarity^5 266.60 15.98 16.680 < 2e-16 ***
## clarity^6 -13.07 13.92 -0.939 0.3478
## clarity^7 104.88 12.29 8.531 < 2e-16 ***
## x -958.86 27.58 -34.764 < 2e-16 ***
## y 39.06 19.39 2.015 0.0439 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1148 on 53923 degrees of freedom
## Multiple R-squared: 0.9172, Adjusted R-squared: 0.9171
## F-statistic: 3.731e+04 on 16 and 53923 DF, p-value: < 2.2e-16
Plots for model3
model4
##
## Call:
## lm(formula = price ~ carat + clarity + x, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20779.7 -570.8 -111.1 416.0 11419.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.08 88.06 -0.546 0.585
## carat 10165.82 52.82 192.464 <2e-16 ***
## clarity.L 4157.89 33.35 124.663 <2e-16 ***
## clarity.Q -2005.70 31.62 -63.433 <2e-16 ***
## clarity.C 1060.55 27.16 39.049 <2e-16 ***
## clarity^4 -487.67 21.72 -22.455 <2e-16 ***
## clarity^5 305.51 17.80 17.162 <2e-16 ***
## clarity^6 12.29 15.52 0.792 0.428
## clarity^7 186.09 13.68 13.605 <2e-16 ***
## x -755.46 22.46 -33.632 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1280 on 53930 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.897
## F-statistic: 5.219e+04 on 9 and 53930 DF, p-value: < 2.2e-16
Plots of model4
\(~\)
After plotting our models and analyzing the summary outputs we
conclude that the model that better fits our data is
model3
. Comparing statistical outputs, model3
had the highest \(R^2\) at .9171,
compared to model2
at .8571 and model4
at
.897. While this simple analysis works with our data, a deeper analysis
will always help you select the best fit model for your data, perform
predictions and more.
If you’d like to dig deeper into multiple linear regression I found a great article from R-Bloggers that takes you step by step on the process to create your own models.
\(~\)