Background

The purpose of the assignment was to explore linear regression.

More specifically, it was to use the “cars” dataset in R to build out a linear model for stopping distance as a function of speed and then replicate the analysis of Chapter 3 from the course text for visualization, quality evaluation of the model, and residual analysis.

Data familiarization

First we familiarize ourselves with the dataset by exploring its summary statistics, column names, number of columns, number of rows, and first 6 entries:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
colnames(cars)
## [1] "speed" "dist"
ncol(cars)
## [1] 2
nrow(cars)
## [1] 50
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

“Cars” is a 50 row, 2 column dataset with variables speed and distance. The average speed is 15.4, the max speed is 25.0, the min speed is 4.0, and the 1st and 3rd quartiles are 12.0 and 19.0 respectively. The average distance is 42.98, the max distance is 120.00, the min distance is 2.00, and the 1st and 3rd quartiles are 26.00 and 56.00 respectively.

Once we’ve familiarized ourselves with the data, we visualize the data and create an initial plot to observe the relationship between the two variables. Being that our aim is to build out a linear model for stopping distance as a function of speed, \(y = distance\) and \(x = speed\).

#Plot distance as a function of speed for cars
attach(cars)
plot(speed, dist, main = "Distance as a function of speed", xlab = "speed", ylab = "distance")

Although it is not a perfect relationship, it appears there is some sort of linear relationship between the two variables. We can explore further via regression.

Linear Regression Model

From pg 17 of the course text:

The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the output, which is called the system’s response. This type of model is known as a one-factor linear regression.

For the sake of our one-factor linear regression, the speed is the input / predictor variable and the distance is the output / response variable.

The simplest regression model is a straight line, and that’s exactly what we’ll fit to our data.

#Calculate a_0 and a_1 for our regression model
cars.lm <- lm(dist ~ speed)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

We have a y-intercept of -17.579 and a slope of 3.932.

Now, we can begin exploring the fit of our regression model / line to the data by re-plotting the stopping distance as a function of speed as we did before. This time we incorporate the fitted line to our plot.

#distance = 8.2839 + 0.1656 * speed
plot(speed, dist, main = "Distance as a function of speed", xlab = "speed", ylab = "distance")
abline(cars.lm)

It appears we have a good fit. Let’s explore further …

Analysis

We’ll analyze the credibility of our linear regression model by interpreting its summary data and residual plot:

#Summary statistics for linear regression model
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

First we observe the summary statistics and evaluate the quality of our model:

  • Residual values: for a good fit, we would expect residual values normally distributed around a mean of zero. Although our 1Q and 3Q values are of a similar scale, our median value is slightly below 0 and our min and max values vary by a scale of nearly 1.5. A better model would have a median value nearer 0 and min-max and 1st quartile and 3rd quartile values closer in scale … I would consider failing the fit based on these observations but will instead opt for a “conditional pass” to further analyze the data. PASS

  • Coefficients: for a good model, we’d like to see a standard error on the scale of 5-10x smaller than our corresponding coefficient. For the speed, our coefficients hit the mark but for the intercept it’s only on the scale of ~3x smaller … again I will give a “conditional pass” to further analyze the data. PASS

  • $R^2 value: values closer to 1 indicate a better fit and representation of the data set of interest. Based on this value, our model explains ~65% of the data’s variation and is a pretty good fit … PASS

Although we passed all checks thus far, 2 were “conditional passes” and thus we may have to reconsider our fit …

Next, we take the analysis a step further and interpret our residuals plot:

#Residual plot
plot(fitted(cars.lm),resid(cars.lm))

  • Residual analysis: there’s a slight increase in our residuals as we move right and they are not uniformly distributed. While, the residuals are close to uniformly distributed, the slight increase as we move right and slight variance of distribution density is enough tom say “three strikes and you’re out”. FAIL

Finally, even though we failed, we’ll display and interpret the Q-Q plot to complete the approach outlined in Chapter 3 of the course text:

#Q-Q plot
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

  • Q-Q plot: if the residuals were normally distributed, they would follow a straight line. With our model, we see divergence (albeit slight) at the ends. This behavior indicates that the residuals are not normally distributed. FAIL

It appears our linear regression model was “close but no cigar”. We can not use it to confidently predict distance based on speed, and we can rule out our model.