For my second blog I will continue to demonstrate a linear regression model by showing how to do a multiple linear regression model. A multiple linear regression model is a model that shows the relationship between an dependent variable, y, and one or more independent variables.
I will be using the diamonds dataset again to show an example on how to create a multiple linear regression model. The diamond dataset is under the ggplot2 library.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
Use the lm model again to create the model.
model <- lm(price ~ carat + cut + color + clarity + depth + table + x + y + z, data = diamonds)
model
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth +
## table + x + y + z, data = diamonds)
##
## Coefficients:
## (Intercept) carat cut.L cut.Q cut.C cut^4
## 5753.762 11256.978 584.457 -301.908 148.035 -20.794
## color.L color.Q color.C color^4 color^5 color^6
## -1952.160 -672.054 -165.283 38.195 -95.793 -48.466
## clarity.L clarity.Q clarity.C clarity^4 clarity^5 clarity^6
## 4097.431 -1925.004 982.205 -364.918 233.563 6.883
## clarity^7 depth table x y z
## 90.640 -63.806 -26.474 -1008.261 9.609 -50.119
Here we see a model created with every variable in the diamond dataset and their corresponding coefficients and intercepts displayed.
summary(model)
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth +
## table + x + y + z, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21376.0 -592.4 -183.5 376.4 10694.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5753.762 396.630 14.507 < 2e-16 ***
## carat 11256.978 48.628 231.494 < 2e-16 ***
## cut.L 584.457 22.478 26.001 < 2e-16 ***
## cut.Q -301.908 17.994 -16.778 < 2e-16 ***
## cut.C 148.035 15.483 9.561 < 2e-16 ***
## cut^4 -20.794 12.377 -1.680 0.09294 .
## color.L -1952.160 17.342 -112.570 < 2e-16 ***
## color.Q -672.054 15.777 -42.597 < 2e-16 ***
## color.C -165.283 14.725 -11.225 < 2e-16 ***
## color^4 38.195 13.527 2.824 0.00475 **
## color^5 -95.793 12.776 -7.498 6.59e-14 ***
## color^6 -48.466 11.614 -4.173 3.01e-05 ***
## clarity.L 4097.431 30.259 135.414 < 2e-16 ***
## clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
## clarity.C 982.205 24.152 40.668 < 2e-16 ***
## clarity^4 -364.918 19.285 -18.922 < 2e-16 ***
## clarity^5 233.563 15.752 14.828 < 2e-16 ***
## clarity^6 6.883 13.715 0.502 0.61575
## clarity^7 90.640 12.103 7.489 7.06e-14 ***
## depth -63.806 4.535 -14.071 < 2e-16 ***
## table -26.474 2.912 -9.092 < 2e-16 ***
## x -1008.261 32.898 -30.648 < 2e-16 ***
## y 9.609 19.333 0.497 0.61918
## z -50.119 33.486 -1.497 0.13448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
The summary function is once again used to show additional information of the model.
Let’s create a model where we narrow down some of the variables. Let’s create a model using price, carat, and depth.
model2 <- lm(price ~ carat + depth, data = diamonds)
model2
##
## Call:
## lm(formula = price ~ carat + depth, data = diamonds)
##
## Coefficients:
## (Intercept) carat depth
## 4045.3 7765.1 -102.2
summary(model2)
##
## Call:
## lm(formula = price ~ carat + depth, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18238.9 -801.6 -19.6 546.3 12683.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4045.333 286.205 14.13 <2e-16 ***
## carat 7765.141 14.009 554.28 <2e-16 ***
## depth -102.165 4.635 -22.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1542 on 53937 degrees of freedom
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.8507
## F-statistic: 1.536e+05 on 2 and 53937 DF, p-value: < 2.2e-16
This is the information for the model created by the price, carat, and depth variables.
Let’s plot the residuals to check for normality. This is done by using the qqnorm and qqline functions.
qqnorm(model2$residuals)
qqline(model2$residuals)
Here we see that a good amount of residuals follow the straight line, but there are a lot that deviate away from the line. We can say that the distribution is nearly normal. The model is an okay fit to the data, but there could be a model that is a better fit. This means that there is not really a strong relationship and having both carat and depth doesn’t affect the price of a diamond too much.