Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

  1. INTRODUCTION
#display first six records of dataset
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
#check summary statistics
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
#dataset size
dim(cars)
## [1] 50  2

Data set contains 2 variables and 50 records(rows). Both variables “speed” and “dist” belong to quantitative discrete data type.

  1. VISUALIZATION
#create boxplots
boxplot(cars$speed, main = "Speed")

boxplot(cars$dist, main = "Distance")

By looking at speed box plot, we can suggest that speed variable is normally distributed since mean is almost equal to median and it has no outliers. Whereas the distance box plot suggests that distance variable is skewed to the right as mean appears to be greater that median and some outliers are observed.

  1. LINEAR REGRESSION MODEL
model <- lm(dist ~ speed, data=cars)

Linear regression should satisfy the following assumptions:

  1. Linear relationship. Linear regression requires the relationship between the independent and dependent variables to be linear.
#create a scatterplot with the least squares line laid on top
plot(dist ~ speed, data=cars, main="Distance vs Speed")
abline(model,col="red")

#calculate correlation
cor(cars)
##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000

The scatter plot shows a positive linear relationship between the response variable and the predictor as it travels upwards from left to right. It shows a steady rate of increase. The relationships between ‘dist’ and ‘speed’ can be considered linear since most of the points don’t deviates much from the regression line. Also, the relationship between ‘dist’ and ‘speed’ are strong as the correlation coefficient equals to 0.8

  1. Multivariate normality. The linear regression analysis requires all variables to be multivariate normal.
#create histogram
par(mfrow=c(2,2))
hist(cars$dist, probability=TRUE, breaks=seq(0, 140, 20),col="gray", border="white", main="Distribution of distance")
d <- density(cars$dist)
    lines(d, col="red")
#normal probability plot 
qqnorm(cars$dist)
qqline(cars$dist) 

#create histogram
hist(cars$speed, probability=TRUE, breaks=seq(0, 30, 3),col="gray", border="white", main="Distribution of speed")
d <- density(cars$speed)
    lines(d, col="red")
#normal probability plot 
qqnorm(cars$speed)
qqline(cars$speed)     

By looking at the histogram of speed we can conclude that the speed data is normally distributed because the distribution has unimodal(has one clear peek) bell-shaped curve and most of the points on Normal Q-Q Plot fall on the diagonal line. Whereas the histogram of distance suggests that the distance data is skewed to the right because the distribution has unimodal(has one clear peek) bell-shaped but mean looks to be greater that the median. However, the Normal Q-Q Plot shows that not that much points deviates from the diagonal line. So that, we can assume that distribution of the variable distance is nearly normal.

  1. No auto-correlation. The linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. Let’s assume that residuals are independent from each other.

  2. Homoscedasticity

summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The equation of the linear regression is described by the flowing formula:

\(Distance = -17.5791 + 3.9324 *Speed\)

The intercept of -17.5791 shows that distance equals to -17.5791 when speed is 0. It doesn’t look realistic unless car stops on the slope with brakes/transmission not being released. The dependent variable ‘speed’ seems to be statistically significant since its p-value is less that the level of significance of 5%.

The slope of 3.9324 indicates that single increase in speed increases the distance by 3.9324.

  1. MODEL DIADNOSTICS

To assess whether the linear model is reliable, we need to check for a. Linearity. We already checked if the relationship between ‘dist’ and ‘speed’ is linear using a scatter plot. We should also verify this condition with a plot of the residuals vs. response variable ‘dist’.

plot(model$residuals ~ cars$dist)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

By looking at the residuals plot above we can state that the relationships between residuals and hits are linear since residuals values pretty equally and randomly spaced around the horizontal axis on the residual plot.

  1. Nearly normal residuals. Let’s check whether residuals are normally distributed or not.
#create histogram
par(mfrow=c(1,2))
hist(model$residuals, probability=TRUE,col="gray", border="white", main="Distribution of residuals")
d <- density(model$residuals)
    lines(d, col="red")
#normal probability plot 
qqnorm(model$residuals)
qqline(model$residuals) 

By looking at the histogram of residuals we can conclude that the distribution of residuals are near to normal because the distribution has unimodal(has one clear peek) bell-shaped curve that is slightly skewed to the right and most of the points on Normal Q-Q Plot fall on the diagonal line.

  1. Constant variability. The constant variability condition appear to be met as the variability of points around the least squares line is reasonably constant while the variability of residuals around the zero line look reasonably constant as well.
  1. CONCLUSION

We found out that the linear model meets all the assumptions of simple linear regression and is described by the following equation:

\(Distance = -17.5791 + 3.9324 *Speed\)

Moreover, we the model diagnostics proves that the model is pretty reliable since it satisfied the conditions of linearity of residuals, nearly normal residuals and constant variability.