describe(cars)
## vars n mean sd median trimmed mad min max range skew
## speed 1 50 15.40 5.29 15 15.47 5.93 4 25 21 -0.11
## dist 2 50 42.98 25.77 36 40.88 23.72 2 120 118 0.76
## kurtosis se
## speed -0.67 0.75
## dist 0.12 3.64
cars$time = cars$dist / cars$speed
ggplot(cars, aes(dist, speed)) + geom_point(aes(colour = time))+scale_colour_gradientn(colours = terrain.colors(20))
The above is a graph plotting distance for each individual car, while going a certain speed.
Even though it is given; we can see that as distance goes down, so does the amount of time needed to achieve that distance.
lmf = lm(cars$dist~cars$speed)
lmf
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
Intercept at -17.579 means we’ll expect a distance of at least 4.4707528 before we see any speed; which goes to show that linear regression isn’t perfect.
Our formula for distance becomes \(-17.579 + 3.932 *\) cars$speed…Lets see the model compared.
cars$regressed = -17.579 + (3.932*cars$speed)
comparing = melt(cars,id.var=1)
ggplot() +
geom_point(data=cars, aes(speed, dist, color='Actual' )) +
geom_point(data=cars, aes(speed, regressed, color='Predicted'))
summary(lmf)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
I’d say having a residual standard error of 15.38 is actually pretty bad; I would not use this model.
plot(fitted(lmf),resid(lmf))
This model consistently undervalues the dependent. Probably because of the high valued intercept and limited range.
qqnorm(resid(lmf))
qqline(resid(lmf))
Heavy tails, indicates we have some extreme values distorting our data.