Using R’s built in car data we are going to build a linear model to try and predict distance using car speed. First let’s view the data we’re working with:
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Let’s see if there might be a linear relationship between the columns speed and distance.
* It appears there is a positive relationship between speed and distance
plot(cars$speed,cars$dist,main='Car Speed & Distance',xlab='Speed',ylab='Distance')
Next we are going to create a linear model object and assign it to cars_lm where the predicator is the column speed, which will predict the output dist.
We see that the linear function is:
Distance = -17.579 + 3.932*Speed
attach(cars)
cars_lm <- lm(cars$dist ~ cars$speed)
cars_lm
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
Let’s add the this relationship to the plot:
plot(cars$speed,cars$dist,main='Car Speed & Distance',xlab='Speed',ylab='Distance')
abline(cars_lm)
Let’s look at a further summary of the linear model:
* To see if there is variability in the slope, we check that the standard error column is roughly 5-10 times smaller than the corresponding coefficient. Here we see (3.9324/0.4155) = 9.46. This means there isn’t variability in the slope estimate a1.
Finally we check the probability, Pr(>|t|), that the corresponding coefficient is not relevant in the model. Here the probability that the speed is not relevant in the model is 1.49e-12. That tells us the speed is relevant. The probability that the intercept isn’t relevant is 0.0123.
The R squared value is 0.6438. This means that the model explains 64.38% of the data’s variability, which is good but not amazing.
summary(cars_lm)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
A residual is how far off the data point is from the regression line. A negative residual means the data point was below the regression line. We need to make sure that our residuals are normally distributted.
Let’s look at the residuals plotted as a histogram:
*We see the residuals are pretty normal except for a long right tail. There seems to be a few of points that are far away from our linear regression line and are greater than 40.
Our model works well except for these points on the far right.
hist(cars_lm$residuals)
plot(fitted(cars_lm),resid(cars_lm))
Here we can see that there are points on the right side that are far away from the model. This is in line with what we saw before.
qqnorm(resid(cars_lm))
qqline(resid(cars_lm))