The purpose of the assignment was to explore linear regression.
More specifically, it was to use the “cars” dataset in R to build out a linear model for stopping distance as a function of speed and then replicate the analysis of Chapter 3 from the course text for visualization, quality evaluation of the model, and residual analysis.
First we familiarize ourselves with the dataset by exploring its summary statistics, column names, number of columns, number of rows, and first 6 entries:
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
## [1] "speed" "dist"
## [1] 2
## [1] 50
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
“Cars” is a 50 row, 2 column dataset with variables speed and distance. The average speed is 15.4, the max speed is 25.0, the min speed is 4.0, and the 1st and 3rd quartiles are 12.0 and 19.0 respectively. The average distance is 42.98, the max distance is 120.00, the min distance is 2.00, and the 1st and 3rd quartiles are 26.00 and 56.00 respectively.
Once we’ve familiarized ourselves with the data, we visualize the data and create an initial plot to observe the relationship between the two variables. Being that our aim is to build out a linear model for stopping distance as a function of speed, \(y = distance\) and \(x = speed\).
#Plot distance as a function of speed for cars
attach(cars)
plot(speed, dist, main = "Distance as a function of speed", xlab = "speed", ylab = "distance")
Although it is not a perfect relationship, it appears there is some sort of linear relationship between the two variables. We can explore further via regression.
From pg 17 of the course text:
The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the output, which is called the system’s response. This type of model is known as a one-factor linear regression.
For the sake of our one-factor linear regression, the speed is the input / predictor variable and the distance is the output / response variable.
The simplest regression model is a straight line, and that’s exactly what we’ll fit to our data.
##
## Call:
## lm(formula = dist ~ speed)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
We have a y-intercept of -17.579 and a slope of 3.932.
Now, we can begin exploring the fit of our regression model / line to the data by re-plotting the stopping distance as a function of speed as we did before. This time we incorporate the fitted line to our plot.
#distance = 8.2839 + 0.1656 * speed
plot(speed, dist, main = "Distance as a function of speed", xlab = "speed", ylab = "distance")
abline(cars.lm)
It appears we have a good fit. Let’s explore further …
We’ll analyze the credibility of our linear regression model by interpreting its summary data and residual plot:
##
## Call:
## lm(formula = dist ~ speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
First we observe the summary statistics and evaluate the quality of our model:
Residual values: for a good fit, we would expect residual values normally distributed around a mean of zero. Although our 1Q and 3Q values are of a similar scale, our median value is slightly below 0 and our min and max values vary by a scale of nearly 1.5. A better model would have a median value nearer 0 and min-max and 1st quartile and 3rd quartile values closer in scale … I would consider failing the fit based on these observations but will instead opt for a “conditional pass” to further analyze the data. PASS
Coefficients: for a good model, we’d like to see a standard error on the scale of 5-10x smaller than our corresponding coefficient. For the speed, our coefficients hit the mark but for the intercept it’s only on the scale of ~3x smaller … again I will give a “conditional pass” to further analyze the data. PASS
$R^2 value: values closer to 1 indicate a better fit and representation of the data set of interest. Based on this value, our model explains ~65% of the data’s variation and is a pretty good fit … PASS
Although we passed all checks thus far, 2 were “conditional passes” and thus we may have to reconsider our fit …
Next, we take the analysis a step further and interpret our residuals plot:
Finally, even though we failed, we’ll display and interpret the Q-Q plot to complete the approach outlined in Chapter 3 of the course text:
It appears our linear regression model was “close but no cigar”. We can not use it to confidently predict distance based on speed, and we can rule out our model.