Blog 1 - Linear Regression

Linear Regression

Linear regression is the process of determining the relationship between two or more variables, specifically it is a relationship between a dependent variable that is continuous and one or more independent variables that are either categorial or numerical. The independent variable(s) is known as the predictor variables and dependent variables are knows as response variables There are 2 types of linear regressions:

Simple Linear Regression: there is only 1 independent variable
Multiple Linear Regression: there is more than 1 independent variable

For this example, we will use the “cars” dataset in R, to build a linear model for stopping distance as a function of speed

library(ggplot2)
library(skimr)
library(dplyr)
library(tidyr)
skim(cars)

Data summary
Name	cars
Number of rows	50
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
speed	0	1	15.40	5.29	4	12	15	19	25	▂▅▇▇▃
dist	0	1	42.98	25.77	2	26	36	56	120	▅▇▅▂▁

Visualize the Data

A general first step in this one-factor modeling process is to determine whether or not it looks as though a linear relationship exists between the predictor and the output value. Doing a scatter plot of car speed and stop distance shows a positive linear relationship. It shows that the greater the speed, the larger the stopping distance. Therefore, we can conduct a regression model to see if this visually observed relationship is statistically significant

cars %>% gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="pink") +
  theme_minimal()

ggplot(cars, aes(x=speed, y=dist)) + geom_point() + labs(x="Car Speed", y="Stopping Distance")

Linear Model

cars.lm <- lm(cars$dist ~ cars$speed)
summary(cars.lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

ggplot(cars, aes(x=cars$speed, y=cars$dist)) + geom_point() + 
  labs(x="Car Speed", y="Stopping Distance") + 
  geom_smooth(method=lm, se=FALSE)

## Warning: Use of `cars$speed` is discouraged. Use `speed` instead.

## Warning: Use of `cars$dist` is discouraged. Use `dist` instead.

## Warning: Use of `cars$speed` is discouraged. Use `speed` instead.

## Warning: Use of `cars$dist` is discouraged. Use `dist` instead.

## `geom_smooth()` using formula 'y ~ x'

The regression function is: \(y=3.9324x-17.5791\)

Model Quality (Residuals)

To determine if the data is a good fit for regressions, we can analyze the residuals. Some notable points from the text are:

If the line is a good ﬁt with the data, we would expect residual values that are normally distributed around a mean of zero
A good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and ﬁrst and third quartile values of roughly the same magnitude
A model that ﬁts the data well would tend to over-predict as often as it under-predicts. Thus, if we plot the residual values, we would expect to see them distributed uniformly around zero for a well-ﬁtted model

ggplot(cars.lm) + geom_point(aes(x=cars.lm$fitted.values, y=cars.lm$residuals))

ggplot(cars.lm, aes(sample = cars.lm$residuals)) + geom_qq() + geom_qq_line()

Observations:

In our case:

the residual median is around -2
the 1st and 3rd quartiles are also roughly the same magnitudes at around absolute value of 9
the residual distribution against the fitted values looks to be uniform with no cyclic indicators
the QQ plot shows that the residuals do follow a normal distribution only deviating at the tail ends slightly

Based on this the model is a good fit for further analysis

Coefficients

summary(cars.lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Once we determine that the data is a good fit for modeling, the coefficients help us determine if the relationships found in the model are statistically significant and how much of the variation in the data is explained buy the model. From the text some key points are:

For a good model, we typically would like to see a standard error that is at least ﬁve to ten times smaller than the corresponding coefﬁcient
Pr(>|t|), shows the probability that the corresponding coefﬁcient is not relevant in the model. This value is also known as the signiﬁcance or p-value of the coefﬁcient.
The Residual standard error is a measure of the total variation in the residual values. If the residuals are distributed normally, the ﬁrst and third quantiles of the previous residuals should be about 1.5 times this standard error.
The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data

Observations:
In our case:

the standard error (.4155) for speed is about 9 times smaller than speed’s coefficient of 3.9324
Speed’s p value is near 0 at 1.49e-12 therefore it is a statistically significant relationship
The R-squared of .6438 tells us that about 64% of the variation is explained by the model

Conclusion

The decision to use a model is not always obvious and is very subjective. In our cars example, while the relationship established between speed and stopping distance is statistically significant, we have based this on only 50 observations. In real world examples a sample set this small may not be a good representative of the entire population, and therefore may call the credibility of the model. Additionally, the model itself can only account for 64% of the variation and could be a cause of doubt for predicting stop distances for new observations.