Hi,

I am going to use the mpg dataset from the UCI machine learning repository. It is the same dataset from discussion 11 but it allows for flexibility in order to meet the needs of this weeks discussion.

For this weeks discussion, the goal is to transfrom the data and build a multiple regression model and Include at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Lets look at some of the relationships that exist among the variables which are continous

What do the variables look like in terms of distributions?

## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths

##       mpg          cylinders      displacement    X.horsepower  
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##      weight      acceleration  
##  Min.   :1613   Min.   : 8.00  
##  1st Qu.:2225   1st Qu.:13.78  
##  Median :2804   Median :15.50  
##  Mean   :2978   Mean   :15.54  
##  3rd Qu.:3615   3rd Qu.:17.02  
##  Max.   :5140   Max.   :24.80
##          mpg    cylinders displacement X.horsepower       weight 
##    1.0000000   -0.7776175   -0.8051269   -0.7784268   -0.8322442 
## acceleration 
##    0.4233285

The correlation matrix heatmap indicates some strong positive correlations such as displacement and cylinders. Cylinders is a discrete variable where as displacement is a measurment of how far an object has moved from its starting place. We will let that be our interaction term

mod <- lm(mpg ~ (cylinders*displacement)+I(X.horsepower^2)+weight+acceleration+factor(model.year), data=auto.df)
summary(mod)
## 
## Call:
## lm(formula = mpg ~ (cylinders * displacement) + I(X.horsepower^2) + 
##     weight + acceleration + factor(model.year), data = auto.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1530  -1.4956   0.1002   1.5550  11.7602 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.846e+01  2.244e+00  21.597  < 2e-16 ***
## cylinders              -2.355e+00  4.222e-01  -5.577 4.69e-08 ***
## displacement           -9.152e-02  1.257e-02  -7.280 1.99e-12 ***
## I(X.horsepower^2)      -5.855e-05  4.396e-05  -1.332 0.183744    
## weight                 -5.580e-03  5.696e-04  -9.797  < 2e-16 ***
## acceleration            1.918e-01  7.610e-02   2.521 0.012133 *  
## factor(model.year)71    1.292e+00  8.282e-01   1.560 0.119709    
## factor(model.year)72   -6.537e-01  8.201e-01  -0.797 0.425911    
## factor(model.year)73   -6.129e-01  7.350e-01  -0.834 0.404869    
## factor(model.year)74    1.609e+00  8.629e-01   1.864 0.063068 .  
## factor(model.year)75    1.642e+00  8.453e-01   1.942 0.052909 .  
## factor(model.year)76    1.969e+00  8.109e-01   2.428 0.015637 *  
## factor(model.year)77    2.998e+00  8.299e-01   3.612 0.000345 ***
## factor(model.year)78    3.574e+00  7.868e-01   4.543 7.51e-06 ***
## factor(model.year)79    5.280e+00  8.347e-01   6.325 7.23e-10 ***
## factor(model.year)80    9.680e+00  8.722e-01  11.099  < 2e-16 ***
## factor(model.year)81    7.119e+00  8.521e-01   8.354 1.31e-15 ***
## factor(model.year)82    8.334e+00  8.403e-01   9.918  < 2e-16 ***
## cylinders:displacement  1.370e-02  1.686e-03   8.126 6.61e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.89 on 373 degrees of freedom
## Multiple R-squared:  0.8692, Adjusted R-squared:  0.8629 
## F-statistic: 137.7 on 18 and 373 DF,  p-value: < 2.2e-16

If we look at our p values first, our quadratic term of horsepower is not significant due to the high p value. Acceleration is also not significant. Our dummy variable model year has multiple levels with high p value until the year is higher than 1978. Also our interaction term is significant in this particular model. We have an adjusted r squared of roughly 80 percent meaning 80 percent of the variability in the data is accounted for. How do the resuduals look?

par(mfrow=c(2,2))
plot(mod)

hist(resid(mod), main="Histogram of Residuals");
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     auto
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(mod)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##              Data               
##  -------------------------------
##  Response : mpg 
##  Variables: fitted values of mpg 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    62.09954 
##  Prob > Chi2   =    3.265282e-15

First of all, a low p value in the Breusch Pagan Test for Heteroskedasticity allows us to reject the null hypothesis, meaning Heteroskedasticity is assumed. The histogram of the residuals is nearly normal with a slight skew. The QQ plot shows evidence of outliers in the data set. We can take a more zoomed in look at the constant variance check.

plot(fitted(mod), residuals(mod), xlab="fitted", ylab="residuals")
abline(h=0);

plot(fitted(mod), sqrt(abs(residuals(mod))), xlab="fitted", ylab=expression(sqrt(hat(epsilon))))

It appears that there is indeed some curvature and some sort of transformation is needed.