1. Introduction

The objective of this discussion is to explore, analyze, and build a simple linear regression to model quarter mile time (the time it takes to travel 1/4 of a mile) as a function of horse power using R’s built in mtcars data set. The data set contains the following variables:

variable description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Keeping in line with the week’s material focusing on simple regression, I will only be using horse power as the predictor variable and quarter mile time as the response variable.

2. Data Exploration

We will commence with basic summary statistics.

2.1 Summary Statistics

The most important highlight of the summary table of the data below is the fact that there are no missing values in this data set. This means that there is no need for imputation. Also, the minimum value for each variable is greater than zero ( having a negative quarter mile time or horse power is not acceptable in this context), therefore we do not need to discard any records initially.


Summary Table

data(mtcars)
summary <- describe(select(mtcars, qsec, hp))
summary$na <- nrow(mtcars) - summary$n
summary <- select(summary, vars, n, mean, sd, median, min, max, na )
names(summary) <- c("Variables", "Cases", "Mean", "SD", "Median", "Min", "Max", "NAs")
datatable(summary, options = list(filter = FALSE))


The Data

The data set has 32 records, these can be viewed in the table below

# Pirnt summary 
datatable(select(mtcars, qsec, hp), options = list(filter = FALSE))


2.2 Data Visualization

The scatter plot below shows that there might be a linear relationship between quarter mile time and horsepower. It seems that generally as horsepower increases the quarter mile time decreases.

ggplot(aes(y = qsec, x = hp ), data = mtcars) +
  geom_point() +
   labs(x="Horsepower", y="1/4 Mile Time (seconds)") + 
  ggtitle("Box plots of 1/4 Mile Time vs. Horsepower")


3. Model Building

There are no missing values in our data. There is also no evident need for transformation, and we will not be discarding any data. Thus we may proceed with building our linear model for quarter mile time as a function of horsepower

lm <- lm(qsec ~ hp, data = mtcars)

summary(lm)
## 
## Call:
## lm(formula = qsec ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1766 -0.6975  0.0348  0.6520  4.0972 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.556354   0.542424  37.897  < 2e-16 ***
## hp          -0.018458   0.003359  -5.495 5.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.282 on 30 degrees of freedom
## Multiple R-squared:  0.5016, Adjusted R-squared:  0.485 
## F-statistic: 30.19 on 1 and 30 DF,  p-value: 5.766e-06

Both the intercept and horsepower are statistically significant. An R-squared value of 0.5016 seems ok in this context but I am not an expert in this area so further research is necessary. The model itself is significant with a p-value of 5.766e-06.

The equation for this model is:

\[ quarter \space mile \space time = 20.556354-0.018458*horsepower \]

This may be interpreted as: “For each unit of horsepower we can expect a 0.018458 decrease in quarter mile time.”

Conventional wisdom would agree with the negative relationship between the power of the engine and the time it takes to travel 1/4 of a mile. However, the magnitude of this relationship is quite miniscule.

4. Model Visualization

Let’s overlay the least squares for our model on the scatter plot to get a visual of our model against the observed data.

ggplot(aes(x = hp, y = qsec ), data = mtcars) +
  geom_point() +
  labs(x="Horsepower", y="Quarter Mile Time (seconds)") + 
  ggtitle("quarter mile time = 20.556354 - 0.018458*horsepower") + 
  geom_abline(intercept = lm$coefficients["(Intercept)"], 
              slope = lm$coefficients["hp"], 
              color = "blue", 
              size = 1.0)

This highlights the negative linear relationship between horse power and quarter mile time. Is this a reliable model?

5. Model Diagnostics

To assess whether this model is reliable, we will check for linearity, nearly normal residuals, constant variability (homoscedasticity) and independence of errors.

5.1 Linearity

We have already produced a scatter plot of quarter mile time against horse power in section 4 where the linear relationship between the two can be observed. Generally, as horse power increases, quarter mile time decreases.

5.2 Nearly Normal Residuals

Investigating using the Q-Q plot below, we can see the evidence of skewness with a blatant deviation of the last residual from the line, and a few deviations after the first residual near the lower bound. I would say that the residuals are not nearly normal.


ggplot( lm, aes(sample = lm$residuals) ) +
  stat_qq() +
  stat_qq_line() +
  labs(x="Theoretical Quantiles", y="Sample Quantiles") + 
  ggtitle("Normal Q-Q Plot")


5.3 Constant Variability and Independence

The plot of the residuals vs. the fitted values below shows no clear pattern. I would say that the constant variability and independence criterias are met.

ggplot(aes(x = mtcars$hp, y = lm$residuals ), data = mtcars) +
  geom_point(shape=1) +
  labs(x="Horsepower", y="Residuals") + 
  ggtitle("Residuals vs. Horsepower") + 
  geom_abline(intercept = 0, 
              slope = 0, 
              color = "blue", 
              size = 1.0,
              linetype="dashed")

6. Conclusion

This model did not meet the criteria of nearly normal residuals, however this may be subjective. The model aligns with conventional wisdom, although not at the magnitude one would expect. There are likely other factors beyond horsepower that affects quarter mile time, such as the weight of the vehicle, type of transmition, number of forward gears, etc. This may be best modeled using multiple linear regression with more predictors and observations.