For my first blog I will be demonstrating how to create a simple linear regression model. A linear regression model is a model that shows the relationship between a dependent variable, y, and an independent variable, x.
I will be using the diamonds dataset to show an example on how to create a simple linear regression. The diamond dataset is under the ggplot2 library.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
The summary function is used to take a look at the dataset and to see what we are working with.
Use the lm function to create your regression model.
model <- lm(data = diamonds)
model
##
## Call:
## lm(data = diamonds)
##
## Coefficients:
## (Intercept) cut.L cut.Q cut.C cut^4 color.L
## -1.660e+00 -2.057e-02 9.257e-03 -6.879e-03 1.741e-03 1.085e-01
## color.Q color.C color^4 color^5 color^6 clarity.L
## 4.106e-02 5.772e-03 -5.063e-03 5.713e-03 1.959e-03 -1.784e-01
## clarity.Q clarity.C clarity^4 clarity^5 clarity^6 clarity^7
## 1.081e-01 -5.677e-02 1.727e-02 -1.232e-02 -1.793e-03 1.719e-04
## depth table price x y z
## 1.212e-02 2.203e-03 4.428e-05 2.425e-01 5.961e-03 4.586e-03
What is displayed is the coefficients and intercepts for each variable in the dataset.
summary(model)
##
## Call:
## lm(data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52251 -0.03060 -0.00102 0.02905 2.17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.660e+00 2.388e-02 -69.512 < 2e-16 ***
## cut.L -2.057e-02 1.416e-03 -14.531 < 2e-16 ***
## cut.Q 9.257e-03 1.131e-03 8.186 2.76e-16 ***
## cut.C -6.879e-03 9.715e-04 -7.081 1.45e-12 ***
## cut^4 1.741e-03 7.762e-04 2.244 0.02487 *
## color.L 1.085e-01 1.115e-03 97.305 < 2e-16 ***
## color.Q 4.106e-02 9.904e-04 41.461 < 2e-16 ***
## color.C 5.772e-03 9.243e-04 6.245 4.28e-10 ***
## color^4 -5.063e-03 8.482e-04 -5.969 2.40e-09 ***
## color^5 5.713e-03 8.014e-04 7.130 1.02e-12 ***
## color^6 1.959e-03 7.285e-04 2.689 0.00716 **
## clarity.L -1.784e-01 2.058e-03 -86.670 < 2e-16 ***
## clarity.Q 1.081e-01 1.785e-03 60.536 < 2e-16 ***
## clarity.C -5.677e-02 1.518e-03 -37.391 < 2e-16 ***
## clarity^4 1.727e-02 1.211e-03 14.259 < 2e-16 ***
## clarity^5 -1.232e-02 9.885e-04 -12.464 < 2e-16 ***
## clarity^6 -1.793e-03 8.602e-04 -2.084 0.03717 *
## clarity^7 1.719e-04 7.595e-04 0.226 0.82093
## depth 1.212e-02 2.801e-04 43.278 < 2e-16 ***
## table 2.203e-03 1.825e-04 12.073 < 2e-16 ***
## price 4.428e-05 1.913e-07 231.494 < 2e-16 ***
## x 2.425e-01 1.800e-03 134.732 < 2e-16 ***
## y 5.961e-03 1.212e-03 4.917 8.82e-07 ***
## z 4.586e-03 2.100e-03 2.184 0.02899 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07088 on 53916 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9776
## F-statistic: 1.025e+05 on 23 and 53916 DF, p-value: < 2.2e-16
The summary function is used to show the coefficients and intercepts, as well as other information such as the r-squared and adjusted r-squared.
This model includes all the variables in the dataset. To get a much better model it is better to create a model based on a relationship between two variables. Let’s create a new model using the carat and price variables and the lm function again.
model2 <- lm(carat ~ price, data = diamonds)
model2
##
## Call:
## lm(formula = carat ~ price, data = diamonds)
##
## Coefficients:
## (Intercept) price
## 0.3672972 0.0001095
summary(model2)
##
## Call:
## lm(formula = carat ~ price, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35765 -0.11329 -0.02442 0.10344 2.66973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.673e-01 1.112e-03 330.2 <2e-16 ***
## price 1.095e-04 1.986e-07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.184 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
Here we can see the coefficients, intercepts, and other information from the summary function. This all comes from the model created from the carat and price variables. The equation of the model is y = 0.0001095x + 0.3672972.
Another good thing to do is plot the residuals to check for normality. Residuals are the distance between data points and the regression line. The qqnorm function is used to plot the residuals and the qqline function adds a line to the plot that passes through the first and third quartiles.
qqnorm(model2$residuals)
qqline(model2$residuals)
Here we see most of the residuals follow the straight line. There are a good amount of points that deviate from the line. We can say that this distribution follows a nearly normal distribution. We can say that the model fits the data. This means that there is a strong relationship between carat and price, meaning that a higher carat diamond would have a higher price to it.