Homework 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

(1) Dataset:

The \(cars\) dataset has 50 rows and 2 columns. Each row is an observation that relates to a reading between car speed and the distance it takes for a car to stop. The columns in the dataset are “speed”" and “dist”.

library(ggplot2)
Warning: package 'ggplot2' was built under R version 3.3.2
library(dplyr)

summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  
colnames(cars)
[1] "speed" "dist" 
ncol(cars)
[1] 2
nrow(cars)
[1] 50
print(cars)
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17
11    11   28
12    12   14
13    12   20
14    12   24
15    12   28
16    13   26
17    13   34
18    13   34
19    13   46
20    14   26
21    14   36
22    14   60
23    14   80
24    15   20
25    15   26
26    15   54
27    16   32
28    16   40
29    17   32
30    17   40
31    17   50
32    18   42
33    18   56
34    18   76
35    18   84
36    19   36
37    19   46
38    19   68
39    20   32
40    20   48
41    20   52
42    20   56
43    20   64
44    22   66
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85

(2) Data Visualization:

cars_speed_df = arrange(cars, speed)

ggplot(data=cars_speed_df, aes(cars_speed_df$speed)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "green", high = "brown") +
  labs(title = "Historgram - Speed") +
  labs(x = "speed") +
  labs(y = "Count")

cars_dist_df = arrange(cars, dist)

ggplot(data=cars_dist_df, aes(cars_dist_df$dist)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "blue", high = "purple") +
  labs(title = "Historgram - Distance") +
  labs(x = "dist") +
  labs(y = "Count")

(3) Statistical Analysis:

In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between speed and distance.

(3.1) Create a function to calculate the correlation and round it to 4 decimal digits
findCorrelation <- function() {
  x = cars$speed
  y = cars$dist
  corr = round(cor(x, y),4)
  print (paste0("Correlation = ",corr))
  return (corr)
}
c = findCorrelation()
[1] "Correlation = 0.8069"
(3.2) Create a function for Linear Model
findStatsFunction <- function() {
  m = lm (cars$dist ~ cars$speed, data = cars)
  s = summary(m)
  print(s)
  
  slp = round(m$coefficients[2], 4)
  int = round(m$coefficients[1], 4)

  return (m)
}
m = findStatsFunction()

Call:
lm(formula = cars$dist ~ cars$speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

\[ \hat{dist} = -17.5791 + 3.9324 * speed \]

(3.3) Display the Linear Model
ggplot(cars, aes(speed, dist)) + geom_point(colour="blue", size=2) + 
    geom_abline(aes(slope=round(m$coefficients[2], 4), intercept=round(m$coefficients[1], 4))) +
    labs(title = "speed vs dist") +
    xlab("speed") + 
    ylab("dist")

(3.4) Regression Statistics
Linear Regression Equation Correlation Coefficient Multiple R-Square R-Square
dist = -17.5791 + (3.9324 * speed) 0.8069 0.6511 0.6511


(4) Quality Evaluation of the Model:

The Multiple R-squared value is a statistical measure of how well the model describes the data. The reported R-Squared of 0.6511 for this model means that the model explains 65.11 percent of the data’s variation.

(5) Residual Analysis:

ggplot(m, aes(.fitted, .resid)) + 
  geom_point(color = "brown", size=2) +
  labs(title = "Fitted Values vs Residuals") +
  labs(x = "Fitted Values") +
  labs(y = "Residuals")

qqnorm(resid(m))
qqline(resid(m))

If the residuals are normally distributed, we can expect the points plotted in the Q-Q plot to follow a straight line. With our model, we see that the two ends diverge from the Q-Q plot line. This behavior indicates that the residuals are not normally distributed. The plot suggests that the distribution’s tails are “heavier” than what we would expect from a normal distribution.

(6) Conclusion:

We observe that the data has correlation 0.8069 with a Multiple R-squared value of 0.6511. Also, the Q-Q plot confirms that using only the speed as a predictor in the model is insufficient to explain the data. Therefore, we can say that there may be other factors like weather, road and brake pads’ conditions that need to be considered to accurately predict the stopping distance.