For the linear regression line to be useful for predictive purposes, certain conditions must hold:
1) The samples which we used to fit the line must be independent.
2) The relationship between the descriptive and response variables must be linear.
3) The residuals from the regression line must be nearly normal.
4) The variability of our observations around our regression line must be constant.
require(ggthemes)
library(tidyverse)
library(magrittr)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(fpp2)
library(forecast)
library(ggpubr)
library(boot)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
cor(cars$speed, cars$dist) #correlation between stopping distance vs. speed
## [1] 0.8068949
Observation 1:
Fig 1: Shows Stopping Distance (response variable) increases as speed (predictor variable) increases with correlation of 80%. Here it appears that stopping distance as a function of speed is a good candidate for linear regression. It meets the 2nd condition in that the predictor and response variables have a linear relationship
# Histogram of cars data
hist(cars$dist, xlab="Stopping distance",col="blue",,main="Fig 2: Histogram of Stopping Distance",probability=TRUE)
s = sd(cars$dist)
m = mean(cars$dist)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "red", lwd = 3)
Observation 2:
Fig 2: Histogram of Stopping Distance exibit a slight skewness to the right with a longer right tail.
shapiro.test(cars$dis)
##
## Shapiro-Wilk normality test
##
## data: cars$dis
## W = 0.95144, p-value = 0.0391
Observation 3:
Shapiro-Wilk Test for Normality shows that stopping distance is Not normal as we have already witnessed in Figure 2. We saw it contains a right skewness
The answer is no, because condition 3 state that it’s the residuals Not the actual parent data that needs to be normal in order to qualify for linear regression
In addtion to normality of residuals, condition 4 states there should not be any Heteroscedasticity of the residuals across the distribution.
# Q-Q Plot to check for normality of residuals
# linear model — distance as a function of speed from base R “cars” dataset
qq <- glm(dist ~ speed, data = cars, family = gaussian)
# diagnostic plot of model
glm.diag.plots(qq)
Observation 4:
The Q-Q plot looks pretty good and indicates normal distribution, as it’s generally in a straight line; except for some "fraying" on the ends but the residuals around the zero appears to be random with not stark pattern one way or another.
Summary and Conclusion:
- The cars data is an ideal candidate for a simple linear regression model. The predictor (speed) and the response variables are highly correlated at 80%. The statistical diagnostics revealed that all the conditions of linearity and normality of the residuals are met as well.
- Therefore the linear equation: stopping distance = 3.9*speed -18 can be use for prediciton