For the linear regression line to be useful for predictive purposes, certain conditions must hold:

1) The samples which we used to fit the line must be independent.
2) The relationship between the descriptive and response variables must be linear.
3) The residuals from the regression line must be nearly normal.
4) The variability of our observations around our regression line must be constant.

Important Libraries

require(ggthemes)
library(tidyverse)
library(magrittr)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(fpp2)   
library(forecast)
library(ggpubr)
library(boot)

Statistical Summary of the cars data

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Look into the Data

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Fitting a linear model

cor(cars$speed, cars$dist) #correlation between stopping distance vs. speed
## [1] 0.8068949

Observation 1:


Fig 1:  Shows Stopping Distance (response variable) increases as speed (predictor variable) increases with correlation of 80%.  Here it appears that stopping distance as a function of speed is a good candidate for linear regression. It meets the 2nd condition in that the predictor and response variables have a linear relationship

# Histogram of cars data
hist(cars$dist, xlab="Stopping distance",col="blue",,main="Fig 2: Histogram of Stopping Distance",probability=TRUE)
s = sd(cars$dist)
m = mean(cars$dist)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "red", lwd = 3)

Observation 2:


Fig 2:  Histogram of Stopping Distance exibit a slight skewness to the right with a longer right tail.  

Shapiro Test to check for normalilty of data

shapiro.test(cars$dis)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars$dis
## W = 0.95144, p-value = 0.0391

Observation 3:


Shapiro-Wilk Test for Normality shows that stopping distance is Not normal as we have already witnessed in Figure 2.  We saw it contains a right skewness 

Does Fig2 and the Shapiro test preclude the use of a linear regression model since the distance data set is Not normal?

Q-Q Plot and Heteroscedasticity Test

# Q-Q Plot to check for normality of residuals
# linear model — distance as a function of speed from base R “cars” dataset
qq <- glm(dist ~ speed, data = cars, family = gaussian)
# diagnostic plot of model
glm.diag.plots(qq)

Observation 4:


The Q-Q plot looks pretty good and indicates normal distribution, as it’s generally in a straight line; except for some "fraying" on the ends but the residuals around the zero appears to be random with not stark pattern one way or another.

Summary and Conclusion:


- The cars data is an ideal candidate for a simple linear regression model.  The predictor (speed) and the response variables are highly correlated at 80%.  The statistical diagnostics revealed that all the conditions of linearity and normality of the residuals are met as well.

- Therefore the linear equation: stopping distance = 3.9*speed -18 can be use for prediciton