library("tidyverse")
library("psych")
library("ggplot2")
library("DT")
library("MASS")
library("corrplot")

1. Introduction

I will expand on the previous discussion to build a multiple linear regression to model quarter-mile time using R’s built in mtcars data set. The goal is to include at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. For your reference, the data set contains the following variables:

variable description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs V/S (V engine or straight engine)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

2. Data Exploration

2.1 Summary Statistics

The most important highlight of the summary table of the data below is the fact that there are no missing values in this data set. This means that there is no need for imputation. Also, the minimum value for each variable is greater than zero. It seems we do not need to discard any records initially.


Summary Table

data(mtcars)
summary <- describe(mtcars)
summary$na <- nrow(mtcars) - summary$n
summary <- dplyr::select(summary, vars, n, mean, sd, median, min, max, na )
names(summary) <- c("Variables", "Cases", "Mean", "SD", "Median", "Min", "Max", "NAs")
datatable(summary, options = list(filter = FALSE))


The Data

The data set has 32 records, these can be viewed in the table below

# Pirnt summary 
datatable(dplyr::select(mtcars, qsec, hp), options = list(filter = FALSE))


2.2 Data Visualization

Based on the scatter plots below, there might be a negative linear relationship between quarter mile time and hp, disp, and wt. However these aren’t very strong relationships. There is a positive linear relationship between mpg and quarter mile time.

gather( mtcars, "VARIABLE", "VALUE", 1:6,8:11) %>%
ggplot( aes(x=VALUE, y=qsec)) + 
  geom_point() + facet_wrap(~VARIABLE, scale = "free") + 
  labs(x="", y="1/4 Mile Time", title = "Scatter Plot of 1/4 Mile Time vs. Predictors" )


3. Data Preparation

As predictors I will use hp, am as the dichotomous term, the product of disp and vs as the dichotomous vs. quantitative interaction term. I will use wt as the quadratic term. I cannot intuitively find the best exponent for this term based on the scatter plot or domain knowledge ,so I will use the box-cox transformation for assistance.

Box-Cox Transformation of wt

lmBoxCox <- lm( wt ~ qsec, data = mtcars )

bc <- boxcox(lmBoxCox, lambda = seq(-2,2))

wtLambda <- bc$x[which.max(bc$y)]

print( paste0( "The best transformation of wt is " , wtLambda )  )
## [1] "The best transformation of wt is 0.464646464646465"

This transformation will be rounded up to the square root transformation of 0.5.

4. Build Model

lm <- lm(qsec ~ hp  + vs  + sqrt(wt) + vs:cyl, data = mtcars)

summary(lm)
## 
## Call:
## lm(formula = qsec ~ hp + vs + sqrt(wt) + vs:cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0521 -0.3874 -0.1936  0.3262  2.4975 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.188146   1.280007   8.741 2.35e-09 ***
## hp          -0.017824   0.003185  -5.596 6.17e-06 ***
## vs           4.903943   1.288518   3.806 0.000738 ***
## sqrt(wt)     4.659763   0.728179   6.399 7.44e-07 ***
## vs:cyl      -0.566642   0.247907  -2.286 0.030344 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7485 on 27 degrees of freedom
## Multiple R-squared:  0.8472, Adjusted R-squared:  0.8246 
## F-statistic: 37.42 on 4 and 27 DF,  p-value: 1.204e-10

The resulting model is

\[ qsec = 11.1881 - 0.0178*hp + 4.9039*vs + 4.6598*\sqrt{wt} - 0.5666*vs:cyl\]

All of the predictors are significant.

Interpretation of coefficients:

hp: For each unit increase in horsepower, quarter-mile time decreeases by 0.0178 seconds, holding all else constant

vs: For each unit increase in vs, quarter-mile time increases by 4.9039 seconds, holding all else constant

\(\sqrt{wt}\): For each unit increase in the square root of weight, quarter-mile time increases by 4.6598 seconds, holding all else constant

\(vs*cyl\) For each unit increase in the product of the type of engine and the number of cylinders, quarter-mile time decreases by 0.5666 seconds.

5. Model Diagnostics

To assess whether this model is reliable, we will check for linearity, nearly normal residuals, constant variability (homoscedasticity), and issues with multicollinearity.

5.1 Linearity

We have already produced scatter plots in section 2 where the only non-categorical variable (used as a predictor) that did not show signs of linearity was wt. This variable has since been transformed using the square root transformation. The plot shows that this transformation did not improve the relationship significantly. This seems like a case where more observations were needed because one would expect and negative linear relationship between these two variables (heavier vehicles tend to accelerate slower).

ggplot(aes(y = qsec, x = sqrt(wt) ), data = mtcars) +
  geom_point() +
   labs(x="sqrt(Weight)", y="1/4 Mile Time (seconds)") + 
  ggtitle("Scatter of 1/4 Mile Time vs. sqrt(Weight)")

5.2 Nearly Normal Residuals

Investigating using the Q-Q plot below, we can see the evidence of skewness with a blatant deviation of the last residual from the line. The others show only small deviations from the line. I would say that the assumption of nearly normal residuals has been met, although reservedly so.


ggplot( lm, aes(sample = lm$residuals) ) +
  stat_qq() +
  stat_qq_line() +
  labs(x="Theoretical Quantiles", y="Sample Quantiles") + 
  ggtitle("Normal Q-Q Plot")


5.3 Constant Variability and Independence

The plot of the residuals vs. the fitted values below shows an outlier that may be influential. It seems that the variance increases towards the right. I would say that the constant variability is not met but it is difficult to say for sure.

ggplot(aes(x = lm$fitted.values, y = lm$residuals ), data = mtcars) +
  geom_point(shape=1) +
  labs(x="Fitted Values", y="Residuals") + 
  ggtitle("Residuals vs. Fitted Values") + 
  geom_abline(intercept = 0, 
              slope = 0, 
              color = "blue", 
              size = 1.0,
              linetype="dashed")

The correlation plot below shows there are strong correlations among predictors. This is not surprising due to the interactive term in which one of its operands are also a predictor.

data <- dplyr::select( mtcars, hp, vs, cyl, wt ) %>% 
  mutate( vs_X_cyl = vs * cyl, wt_sqrt = sqrt(wt) ) %>%
  dplyr::select( hp, vs, vs_X_cyl, wt_sqrt )
## Warning: package 'bindrcpp' was built under R version 3.4.4
corrplot( round( cor(data, method = "pearson", use = "complete.obs"), 2 ), 
          method="circle", 
          type="upper",
          tl.col="black")

6. Conclusion

I would say that although the predictors are significant and the R-squared value of 0.85 is very high, there is room for improvements. The dichotomous terms could be converted to factors and the data could be standardized before modeling. The outlier in the residual plot should be investigated further if not removed. Maybe a better transformation of wt and other variables could be an option to improve homoscedasticity. Other variables could also be explored in an effort to find a more simple model.