Using R’s built in car data we are going to build a linear model to try and predict distance using car speed. First let’s view the data we’re working with:

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Let’s see if there might be a linear relationship between the columns speed and distance.
* It appears there is a positive relationship between speed and distance

plot(cars$speed,cars$dist,main='Car Speed & Distance',xlab='Speed',ylab='Distance')

Next we are going to create a linear model object and assign it to cars_lm where the predicator is the column speed, which will predict the output dist.

We see that the linear function is:
Distance = -17.579 + 3.932*Speed

attach(cars)
cars_lm <- lm(cars$dist ~ cars$speed)
cars_lm
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

Plots & Summary

Let’s add the this relationship to the plot:

plot(cars$speed,cars$dist,main='Car Speed & Distance',xlab='Speed',ylab='Distance')
abline(cars_lm)

Let’s look at a further summary of the linear model:
* To see if there is variability in the slope, we check that the standard error column is roughly 5-10 times smaller than the corresponding coefficient. Here we see (3.9324/0.4155) = 9.46. This means there isn’t variability in the slope estimate a1.
Finally we check the probability, Pr(>|t|), that the corresponding coefficient is not relevant in the model. Here the probability that the speed is not relevant in the model is 1.49e-12. That tells us the speed is relevant. The probability that the intercept isn’t relevant is 0.0123.
The R squared value is 0.6438. This means that the model explains 64.38% of the data’s variability, which is good but not amazing.

summary(cars_lm)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residuals

A residual is how far off the data point is from the regression line. A negative residual means the data point was below the regression line. We need to make sure that our residuals are normally distributted.

Let’s look at the residuals plotted as a histogram:
*We see the residuals are pretty normal except for a long right tail. There seems to be a few of points that are far away from our linear regression line and are greater than 40.

Our model works well except for these points on the far right.

hist(cars_lm$residuals)

plot(fitted(cars_lm),resid(cars_lm))

QQ Plots

Here we can see that there are points on the right side that are far away from the model. This is in line with what we saw before.

qqnorm(resid(cars_lm))
qqline(resid(cars_lm))