1. Linear Regression Models

A linear regression aims to model a data set using a straight line. It is modeled by the formula:

y = a + bx + c

where a is the y-intercept and b is the slope or how much y increases when x increases by one unit.

2. Assumptions of Linear Regression

To be classed as linear, a regression model must meet 4 assumptions.

  • A linear relationship between the response (X) and explanatory variable (Y)
  • Errors are independent and there is no correlation between the distance of points from the line
  • The responses for x values are evenly distributed
  • The variance of responses for x values are evently distributed

3. Required Packages

Before commencing analysis, use the code below to install the required packages. If the packages have not yet been installed, do so using install.packages().

#Load required packages for analysis
library(tidyverse)
library(GGally)
library(gridExtra)

4. Load the Data

For this simple demonstration, we will use the cars data set from base R. Our hypothesis is that there is a linear relationship between speed and distance.

data(cars)

5. Exploratory Data Analysis

First, view the data to get an understanding of what variables are available.

#View the data
view(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

In order to understand the data properly before modelling, we need to explore it.

Firstly, plot the explanatory and response variables to understand their distributions.

#distribution of response variable

ggplot(aes(x = dist),data=cars)+
  geom_histogram(colour = "black", binwidth = 2)

#distribution of explanatory variable

ggplot(aes(x = speed),data=cars)+
  geom_histogram(colour = "black", binwidth = 2)

Next, plot the response variable against the explanatory variable.

ggplot(aes(x = speed, y = dist), data = cars) +
  geom_point() +
  geom_smooth()

If there are more variables in your dataset, it is a good idea to summarise the correlations between variables. This can be done using ggpairs.

ggpairs(data = cars)

There is a pretty obvious trend in the above plots, confirming our hypothesis that distance is a predictor of speed as there is a high and significant correlation between the two variables.

6 .Building a model

The next step is to build a model that sets distance as a function of speed. Linear models in R are built using lm(response ~ explanatory).

model_1 <- lm(dist ~ speed, cars)
summary(model_1)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Based on the above results, the significance value for speed indicates its relationship to distance is statistically significant.

7. Testing assumptions of linearity

Before evaluating the model, we need to check that it meets the assumptions for linearity. We can do this by plotting the model.

par(mfrow=c(2,2))
plot(model_1)

Residuals vs. Fitted can account for linearity and should be pattern less around Y = 0. Normal Q-Q checks for normality, and should have few deviations from the straight line. Scale-Location helps to check for equal-variance with positive or negative trends suggesting variances are not equal. Finally, Residuals vs. Leverage checks for influential points or outliers.

8. Model Evaluation

The formula for this model is: distance = -17.58 + 3.93(speed)

The model can be evaluated using the r-squared value, which indicates how much of the dependent variable is explained by the independent variable. The r-squared value for this model is 0.6511, meaning that 65.11% of the change in distance is explained by speed.