head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
ggplot(cars, aes(x=speed, y=dist)) + geom_point() + ggtitle('Speed vs Stopping Distance')
Here (by the eye) you can see that some activity is going on between the two variables. The points are not close, nor are they far from each other but there is some association here. Let’s have a look at the linear model to see what is happening.
# x = speed, y = dist
cars_lm <- lm(dist ~ speed, data = cars)
ggplot(cars, aes(x=speed, y=dist)) + geom_point() + geom_smooth(method=lm, se=F) +
ggtitle(label = "Speed vs Stopping Distance with Regression Line", subtitle = paste("R^2 = ",signif(summary(cars_lm)$r.squared, 5),
"Intercept =",signif(cars_lm$coefficients[[1]],5 ),
" Slope =",signif(cars_lm$coefficients[[2]], 5),
" P =",signif(summary(cars_lm)$coefficients[2,4], 5)))
As suspected, there is a positive linear pattern. This means that there is a positive association between the stopping distance
and speed
. To confirm this, knowing that \(R^2 = 0.65108\), then \(R = 0.807\) which indicates a strong correlation between the response (dist
) and explanatory variable(speed
).
summary(cars_lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The formula for the model is \[\hat{y} = -17.5791 + 3.9324 * speed\]
Basically, the distance increases by 3.9324 when the speed increases by 1 and is predicted to be -17.5791 when speed is zero.
The size of the P-value being very small confirms that the correlation is statistically significant which agrees with the fact that this model fits the data.
The model also explains 65.11
% of the variation according to \(R^2\).
carslm_df <- augment(cars_lm)
ggplot(carslm_df, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept=0, color = 'orange') + ggtitle('Residual vs Fitted')
The plot shows a random pattern, indicating a good fit for a linear model.
ggplot(carslm_df, aes(x=.std.resid)) + geom_histogram(aes(y=..density..), bins = 10 ,colour="black")+
geom_density(alpha=.2, fill="blue") + ggtitle('Histogram of Residuals')
Histograms seems nearly normal based on the symmetric or bell shape curve.
qplot(sample =.std.resid, data = carslm_df) + geom_abline()+ ggtitle('Normal Q-Q Plot')
The normal probability plot shows a strong linear pattern. There are only minor deviations from the line fit to the points on the probability plot. The normal distribution appears to be a good model for this data.
The correlation between stopping distance and speed is positive and the relationship is linear. Also the residuals are distributed normally. There may be a few outliers but they are not that far out from the rest of the data points. This model created seems to be great fit for the data where the data is distributed normally.