A linear regression aims to model a data set using a straight line. It is modeled by the formula:
y = a + bx + c
where a is the y-intercept and b is the slope or how much y increases when x increases by one unit.
To be classed as linear, a regression model must meet 4 assumptions.
Before commencing analysis, use the code below to install the required packages. If the packages have not yet been installed, do so using install.packages().
#Load required packages for analysis
library(tidyverse)
library(GGally)
library(gridExtra)
For this simple demonstration, we will use the cars data set from base R. Our hypothesis is that there is a linear relationship between speed and distance.
data(cars)
First, view the data to get an understanding of what variables are available.
#View the data
view(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
In order to understand the data properly before modelling, we need to explore it.
Firstly, plot the explanatory and response variables to understand their distributions.
#distribution of response variable
ggplot(aes(x = dist),data=cars)+
geom_histogram(colour = "black", binwidth = 2)
#distribution of explanatory variable
ggplot(aes(x = speed),data=cars)+
geom_histogram(colour = "black", binwidth = 2)
Next, plot the response variable against the explanatory variable.
ggplot(aes(x = speed, y = dist), data = cars) +
geom_point() +
geom_smooth()
If there are more variables in your dataset, it is a good idea to
summarise the correlations between variables. This can be done using
ggpairs.
ggpairs(data = cars)
There is a pretty obvious trend in the above plots, confirming our
hypothesis that distance is a predictor of speed as there is a high and
significant correlation between the two variables.
The next step is to build a model that sets distance as a function of speed. Linear models in R are built using lm(response ~ explanatory).
model_1 <- lm(dist ~ speed, cars)
summary(model_1)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Based on the above results, the significance value for speed indicates its relationship to distance is statistically significant.
Before evaluating the model, we need to check that it meets the assumptions for linearity. We can do this by plotting the model.
par(mfrow=c(2,2))
plot(model_1)
Residuals vs. Fitted can account for linearity and should be
pattern less around Y = 0. Normal Q-Q checks for normality, and
should have few deviations from the straight line.
Scale-Location helps to check for equal-variance with positive
or negative trends suggesting variances are not equal. Finally,
Residuals vs. Leverage checks for influential points or
outliers.
The formula for this model is:
distance = -17.58 + 3.93(speed)
The model can be evaluated using the r-squared value, which indicates how much of the dependent variable is explained by the independent variable. The r-squared value for this model is 0.6511, meaning that 65.11% of the change in distance is explained by speed.