Linear Regression

Linear regression is the process of determining the relationship between two or more variables, specifically it is a relationship between a dependent variable that is continuous and one or more independent variables that are either categorial or numerical. The independent variable(s) is known as the predictor variables and dependent variables are knows as response variables There are 2 types of linear regressions:

Simple Linear Regression: there is only 1 independent variable
Multiple Linear Regression: there is more than 1 independent variable

For this example, we will use the “cars” dataset in R, to build a linear model for stopping distance as a function of speed

library(ggplot2)
library(skimr)
library(dplyr)
library(tidyr)
skim(cars)
Data summary
Name cars
Number of rows 50
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
speed 0 1 15.40 5.29 4 12 15 19 25 ▂▅▇▇▃
dist 0 1 42.98 25.77 2 26 36 56 120 ▅▇▅▂▁

Visualize the Data

A general first step in this one-factor modeling process is to determine whether or not it looks as though a linear relationship exists between the predictor and the output value. Doing a scatter plot of car speed and stop distance shows a positive linear relationship. It shows that the greater the speed, the larger the stopping distance. Therefore, we can conduct a regression model to see if this visually observed relationship is statistically significant

cars %>% gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="pink") +
  theme_minimal()

ggplot(cars, aes(x=speed, y=dist)) + geom_point() + labs(x="Car Speed", y="Stopping Distance")

Linear Model

cars.lm <- lm(cars$dist ~ cars$speed)
summary(cars.lm)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
ggplot(cars, aes(x=cars$speed, y=cars$dist)) + geom_point() + 
  labs(x="Car Speed", y="Stopping Distance") + 
  geom_smooth(method=lm, se=FALSE)
## Warning: Use of `cars$speed` is discouraged. Use `speed` instead.
## Warning: Use of `cars$dist` is discouraged. Use `dist` instead.
## Warning: Use of `cars$speed` is discouraged. Use `speed` instead.
## Warning: Use of `cars$dist` is discouraged. Use `dist` instead.
## `geom_smooth()` using formula 'y ~ x'

The regression function is: \(y=3.9324x-17.5791\)

Model Quality (Residuals)

To determine if the data is a good fit for regressions, we can analyze the residuals. Some notable points from the text are:

ggplot(cars.lm) + geom_point(aes(x=cars.lm$fitted.values, y=cars.lm$residuals))

ggplot(cars.lm, aes(sample = cars.lm$residuals)) + geom_qq() + geom_qq_line()

Observations:

In our case:

Based on this the model is a good fit for further analysis

Coefficients

summary(cars.lm)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Once we determine that the data is a good fit for modeling, the coefficients help us determine if the relationships found in the model are statistically significant and how much of the variation in the data is explained buy the model. From the text some key points are:

Observations:
In our case:

Conclusion

The decision to use a model is not always obvious and is very subjective. In our cars example, while the relationship established between speed and stopping distance is statistically significant, we have based this on only 50 observations. In real world examples a sample set this small may not be a good representative of the entire population, and therefore may call the credibility of the model. Additionally, the model itself can only account for 64% of the variation and could be a cause of doubt for predicting stop distances for new observations.