Linear regression is the process of determining the relationship between two or more variables, specifically it is a relationship between a dependent variable that is continuous and one or more independent variables that are either categorial or numerical. The independent variable(s) is known as the predictor variables and dependent variables are knows as response variables There are 2 types of linear regressions:
Simple Linear Regression: there is only 1 independent variable
Multiple Linear Regression: there is more than 1 independent variable
For this example, we will use the “cars” dataset in R, to build a linear model for stopping distance as a function of speed
library(ggplot2)
library(skimr)
library(dplyr)
library(tidyr)
skim(cars)
| Name | cars |
| Number of rows | 50 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| speed | 0 | 1 | 15.40 | 5.29 | 4 | 12 | 15 | 19 | 25 | ▂▅▇▇▃ |
| dist | 0 | 1 | 42.98 | 25.77 | 2 | 26 | 36 | 56 | 120 | ▅▇▅▂▁ |
A general first step in this one-factor modeling process is to determine whether or not it looks as though a linear relationship exists between the predictor and the output value. Doing a scatter plot of car speed and stop distance shows a positive linear relationship. It shows that the greater the speed, the larger the stopping distance. Therefore, we can conduct a regression model to see if this visually observed relationship is statistically significant
cars %>% gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="pink") +
theme_minimal()
ggplot(cars, aes(x=speed, y=dist)) + geom_point() + labs(x="Car Speed", y="Stopping Distance")
cars.lm <- lm(cars$dist ~ cars$speed)
summary(cars.lm)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
ggplot(cars, aes(x=cars$speed, y=cars$dist)) + geom_point() +
labs(x="Car Speed", y="Stopping Distance") +
geom_smooth(method=lm, se=FALSE)
## Warning: Use of `cars$speed` is discouraged. Use `speed` instead.
## Warning: Use of `cars$dist` is discouraged. Use `dist` instead.
## Warning: Use of `cars$speed` is discouraged. Use `speed` instead.
## Warning: Use of `cars$dist` is discouraged. Use `dist` instead.
## `geom_smooth()` using formula 'y ~ x'
The regression function is: \(y=3.9324x-17.5791\)
To determine if the data is a good fit for regressions, we can analyze the residuals. Some notable points from the text are:
ggplot(cars.lm) + geom_point(aes(x=cars.lm$fitted.values, y=cars.lm$residuals))
ggplot(cars.lm, aes(sample = cars.lm$residuals)) + geom_qq() + geom_qq_line()
Observations:
In our case:
Based on this the model is a good fit for further analysis
summary(cars.lm)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Once we determine that the data is a good fit for modeling, the coefficients help us determine if the relationships found in the model are statistically significant and how much of the variation in the data is explained buy the model. From the text some key points are:
Observations:
In our case:
The decision to use a model is not always obvious and is very subjective. In our cars example, while the relationship established between speed and stopping distance is statistically significant, we have based this on only 50 observations. In real world examples a sample set this small may not be a good representative of the entire population, and therefore may call the credibility of the model. Additionally, the model itself can only account for 64% of the variation and could be a cause of doubt for predicting stop distances for new observations.