The \(cars\) dataset has 50 rows and 2 columns. Each row is an observation that relates to a reading between car speed and the distance it takes for a car to stop. The columns in the dataset are “speed”" and “dist”.
library(ggplot2)
Warning: package 'ggplot2' was built under R version 3.3.2
library(dplyr)
summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
colnames(cars)
[1] "speed" "dist"
ncol(cars)
[1] 2
nrow(cars)
[1] 50
print(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
11 11 28
12 12 14
13 12 20
14 12 24
15 12 28
16 13 26
17 13 34
18 13 34
19 13 46
20 14 26
21 14 36
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
29 17 32
30 17 40
31 17 50
32 18 42
33 18 56
34 18 76
35 18 84
36 19 36
37 19 46
38 19 68
39 20 32
40 20 48
41 20 52
42 20 56
43 20 64
44 22 66
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
cars_speed_df = arrange(cars, speed)
ggplot(data=cars_speed_df, aes(cars_speed_df$speed)) +
geom_histogram(aes(fill = ..count..)) +
scale_fill_gradient("Count", low = "green", high = "brown") +
labs(title = "Historgram - Speed") +
labs(x = "speed") +
labs(y = "Count")
cars_dist_df = arrange(cars, dist)
ggplot(data=cars_dist_df, aes(cars_dist_df$dist)) +
geom_histogram(aes(fill = ..count..)) +
scale_fill_gradient("Count", low = "blue", high = "purple") +
labs(title = "Historgram - Distance") +
labs(x = "dist") +
labs(y = "Count")
In this section we will create a linear regression model and calculate the correlation between the data to see if there is a relationship between speed and distance.
findCorrelation <- function() {
x = cars$speed
y = cars$dist
corr = round(cor(x, y),4)
print (paste0("Correlation = ",corr))
return (corr)
}
c = findCorrelation()
[1] "Correlation = 0.8069"
findStatsFunction <- function() {
m = lm (cars$dist ~ cars$speed, data = cars)
s = summary(m)
print(s)
slp = round(m$coefficients[2], 4)
int = round(m$coefficients[1], 4)
return (m)
}
m = findStatsFunction()
Call:
lm(formula = cars$dist ~ cars$speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
\[ \hat{dist} = -17.5791 + 3.9324 * speed \]
ggplot(cars, aes(speed, dist)) + geom_point(colour="blue", size=2) +
geom_abline(aes(slope=round(m$coefficients[2], 4), intercept=round(m$coefficients[1], 4))) +
labs(title = "speed vs dist") +
xlab("speed") +
ylab("dist")
| Linear Regression Equation | Correlation Coefficient | Multiple R-Square | R-Square |
|---|---|---|---|
| dist = -17.5791 + (3.9324 * speed) | 0.8069 | 0.6511 | 0.6511 |
The Multiple R-squared value is a statistical measure of how well the model describes the data. The reported R-Squared of 0.6511 for this model means that the model explains 65.11 percent of the data’s variation.
ggplot(m, aes(.fitted, .resid)) +
geom_point(color = "brown", size=2) +
labs(title = "Fitted Values vs Residuals") +
labs(x = "Fitted Values") +
labs(y = "Residuals")
qqnorm(resid(m))
qqline(resid(m))
If the residuals are normally distributed, we can expect the points plotted in the Q-Q plot to follow a straight line. With our model, we see that the two ends diverge from the Q-Q plot line. This behavior indicates that the residuals are not normally distributed. The plot suggests that the distribution’s tails are “heavier” than what we would expect from a normal distribution.
We observe that the data has correlation 0.8069 with a Multiple R-squared value of 0.6511. Also, the Q-Q plot confirms that using only the speed as a predictor in the model is insufficient to explain the data. Therefore, we can say that there may be other factors like weather, road and brake pads’ conditions that need to be considered to accurately predict the stopping distance.