The objective of this discussion is to explore, analyze, and build a simple linear regression to model quarter mile time (the time it takes to travel 1/4 of a mile) as a function of horse power using R’s built in mtcars data set. The data set contains the following variables:
| variable | description |
|---|---|
mpg |
Miles/(US) gallon |
cyl |
Number of cylinders |
disp |
Displacement (cu.in.) |
hp |
Gross horsepower |
drat |
Rear axle ratio |
wt |
Weight (1000 lbs) |
qsec |
1/4 mile time |
vs |
V/S |
am |
Transmission (0 = automatic, 1 = manual) |
gear |
Number of forward gears |
carb |
Number of carburetors |
Keeping in line with the week’s material focusing on simple regression, I will only be using horse power as the predictor variable and quarter mile time as the response variable.
We will commence with basic summary statistics.
The most important highlight of the summary table of the data below is the fact that there are no missing values in this data set. This means that there is no need for imputation. Also, the minimum value for each variable is greater than zero ( having a negative quarter mile time or horse power is not acceptable in this context), therefore we do not need to discard any records initially.
Summary Table
data(mtcars)
summary <- describe(select(mtcars, qsec, hp))
summary$na <- nrow(mtcars) - summary$n
summary <- select(summary, vars, n, mean, sd, median, min, max, na )
names(summary) <- c("Variables", "Cases", "Mean", "SD", "Median", "Min", "Max", "NAs")
datatable(summary, options = list(filter = FALSE))The Data
The data set has 32 records, these can be viewed in the table below
# Pirnt summary
datatable(select(mtcars, qsec, hp), options = list(filter = FALSE))The scatter plot below shows that there might be a linear relationship between quarter mile time and horsepower. It seems that generally as horsepower increases the quarter mile time decreases.
ggplot(aes(y = qsec, x = hp ), data = mtcars) +
geom_point() +
labs(x="Horsepower", y="1/4 Mile Time (seconds)") +
ggtitle("Box plots of 1/4 Mile Time vs. Horsepower")There are no missing values in our data. There is also no evident need for transformation, and we will not be discarding any data. Thus we may proceed with building our linear model for quarter mile time as a function of horsepower
lm <- lm(qsec ~ hp, data = mtcars)
summary(lm)##
## Call:
## lm(formula = qsec ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1766 -0.6975 0.0348 0.6520 4.0972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.556354 0.542424 37.897 < 2e-16 ***
## hp -0.018458 0.003359 -5.495 5.77e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.282 on 30 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.485
## F-statistic: 30.19 on 1 and 30 DF, p-value: 5.766e-06
Both the intercept and horsepower are statistically significant. An R-squared value of 0.5016 seems ok in this context but I am not an expert in this area so further research is necessary. The model itself is significant with a p-value of 5.766e-06.
The equation for this model is:
\[ quarter \space mile \space time = 20.556354-0.018458*horsepower \]
This may be interpreted as: “For each unit of horsepower we can expect a 0.018458 decrease in quarter mile time.”
Conventional wisdom would agree with the negative relationship between the power of the engine and the time it takes to travel 1/4 of a mile. However, the magnitude of this relationship is quite miniscule.
Let’s overlay the least squares for our model on the scatter plot to get a visual of our model against the observed data.
ggplot(aes(x = hp, y = qsec ), data = mtcars) +
geom_point() +
labs(x="Horsepower", y="Quarter Mile Time (seconds)") +
ggtitle("quarter mile time = 20.556354 - 0.018458*horsepower") +
geom_abline(intercept = lm$coefficients["(Intercept)"],
slope = lm$coefficients["hp"],
color = "blue",
size = 1.0)This highlights the negative linear relationship between horse power and quarter mile time. Is this a reliable model?
To assess whether this model is reliable, we will check for linearity, nearly normal residuals, constant variability (homoscedasticity) and independence of errors.
We have already produced a scatter plot of quarter mile time against horse power in section 4 where the linear relationship between the two can be observed. Generally, as horse power increases, quarter mile time decreases.
Investigating using the Q-Q plot below, we can see the evidence of skewness with a blatant deviation of the last residual from the line, and a few deviations after the first residual near the lower bound. I would say that the residuals are not nearly normal.
ggplot( lm, aes(sample = lm$residuals) ) +
stat_qq() +
stat_qq_line() +
labs(x="Theoretical Quantiles", y="Sample Quantiles") +
ggtitle("Normal Q-Q Plot")The plot of the residuals vs. the fitted values below shows no clear pattern. I would say that the constant variability and independence criterias are met.
ggplot(aes(x = mtcars$hp, y = lm$residuals ), data = mtcars) +
geom_point(shape=1) +
labs(x="Horsepower", y="Residuals") +
ggtitle("Residuals vs. Horsepower") +
geom_abline(intercept = 0,
slope = 0,
color = "blue",
size = 1.0,
linetype="dashed")This model did not meet the criteria of nearly normal residuals, however this may be subjective. The model aligns with conventional wisdom, although not at the magnitude one would expect. There are likely other factors beyond horsepower that affects quarter mile time, such as the weight of the vehicle, type of transmition, number of forward gears, etc. This may be best modeled using multiple linear regression with more predictors and observations.