Hi,
I am going to use the mpg dataset from the UCI machine learning repository. It is the same dataset from discussion 11 but it allows for flexibility in order to meet the needs of this weeks discussion.
For this weeks discussion, the goal is to transfrom the data and build a multiple regression model and Include at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Lets look at some of the relationships that exist among the variables which are continous
What do the variables look like in terms of distributions?
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
## mpg cylinders displacement X.horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
## weight acceleration
## Min. :1613 Min. : 8.00
## 1st Qu.:2225 1st Qu.:13.78
## Median :2804 Median :15.50
## Mean :2978 Mean :15.54
## 3rd Qu.:3615 3rd Qu.:17.02
## Max. :5140 Max. :24.80
## mpg cylinders displacement X.horsepower weight
## 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## acceleration
## 0.4233285
The correlation matrix heatmap indicates some strong positive correlations such as displacement and cylinders. Cylinders is a discrete variable where as displacement is a measurment of how far an object has moved from its starting place. We will let that be our interaction term
mod <- lm(mpg ~ (cylinders*displacement)+I(X.horsepower^2)+weight+acceleration+factor(model.year), data=auto.df)
summary(mod)
##
## Call:
## lm(formula = mpg ~ (cylinders * displacement) + I(X.horsepower^2) +
## weight + acceleration + factor(model.year), data = auto.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1530 -1.4956 0.1002 1.5550 11.7602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.846e+01 2.244e+00 21.597 < 2e-16 ***
## cylinders -2.355e+00 4.222e-01 -5.577 4.69e-08 ***
## displacement -9.152e-02 1.257e-02 -7.280 1.99e-12 ***
## I(X.horsepower^2) -5.855e-05 4.396e-05 -1.332 0.183744
## weight -5.580e-03 5.696e-04 -9.797 < 2e-16 ***
## acceleration 1.918e-01 7.610e-02 2.521 0.012133 *
## factor(model.year)71 1.292e+00 8.282e-01 1.560 0.119709
## factor(model.year)72 -6.537e-01 8.201e-01 -0.797 0.425911
## factor(model.year)73 -6.129e-01 7.350e-01 -0.834 0.404869
## factor(model.year)74 1.609e+00 8.629e-01 1.864 0.063068 .
## factor(model.year)75 1.642e+00 8.453e-01 1.942 0.052909 .
## factor(model.year)76 1.969e+00 8.109e-01 2.428 0.015637 *
## factor(model.year)77 2.998e+00 8.299e-01 3.612 0.000345 ***
## factor(model.year)78 3.574e+00 7.868e-01 4.543 7.51e-06 ***
## factor(model.year)79 5.280e+00 8.347e-01 6.325 7.23e-10 ***
## factor(model.year)80 9.680e+00 8.722e-01 11.099 < 2e-16 ***
## factor(model.year)81 7.119e+00 8.521e-01 8.354 1.31e-15 ***
## factor(model.year)82 8.334e+00 8.403e-01 9.918 < 2e-16 ***
## cylinders:displacement 1.370e-02 1.686e-03 8.126 6.61e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.89 on 373 degrees of freedom
## Multiple R-squared: 0.8692, Adjusted R-squared: 0.8629
## F-statistic: 137.7 on 18 and 373 DF, p-value: < 2.2e-16
If we look at our p values first, our quadratic term of horsepower is not significant due to the high p value. Acceleration is also not significant. Our dummy variable model year has multiple levels with high p value until the year is higher than 1978. Also our interaction term is significant in this particular model. We have an adjusted r squared of roughly 80 percent meaning 80 percent of the variability in the data is accounted for. How do the resuduals look?
par(mfrow=c(2,2))
plot(mod)
hist(resid(mod), main="Histogram of Residuals");
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked _by_ '.GlobalEnv':
##
## auto
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(mod)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : mpg
## Variables: fitted values of mpg
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 62.09954
## Prob > Chi2 = 3.265282e-15
First of all, a low p value in the Breusch Pagan Test for Heteroskedasticity allows us to reject the null hypothesis, meaning Heteroskedasticity is assumed. The histogram of the residuals is nearly normal with a slight skew. The QQ plot shows evidence of outliers in the data set. We can take a more zoomed in look at the constant variance check.
plot(fitted(mod), residuals(mod), xlab="fitted", ylab="residuals")
abline(h=0);
plot(fitted(mod), sqrt(abs(residuals(mod))), xlab="fitted", ylab=expression(sqrt(hat(epsilon))))
It appears that there is indeed some curvature and some sort of transformation is needed.