Introduction
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Libraries
library(tidyverse)
library(openintro)
library(car)
Data
head(nycflights)
## # A tibble: 6 x 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## # ... with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
Preliminary Models
nycflights.lm <- lm(arr_delay ~origin*dep_delay*air_time*distance ,data = nycflights )
anova(nycflights.lm)
## Analysis of Variance Table
##
## Response: arr_delay
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 91477 45739 1.9152e+02 < 2.2e-16
## dep_delay 1 54785416 54785416 2.2940e+05 < 2.2e-16
## air_time 1 9228 9228 3.8641e+01 5.155e-10
## distance 1 2582129 2582129 1.0812e+04 < 2.2e-16
## origin:dep_delay 2 1950 975 4.0815e+00 0.016890
## origin:air_time 2 13872 6936 2.9043e+01 2.499e-13
## dep_delay:air_time 1 1 1 2.8000e-03 0.957603
## origin:distance 2 19346 9673 4.0503e+01 < 2.2e-16
## dep_delay:distance 1 18580 18580 7.7800e+01 < 2.2e-16
## air_time:distance 1 12875 12875 5.3910e+01 2.148e-13
## origin:dep_delay:air_time 2 13 7 2.7900e-02 0.972476
## origin:dep_delay:distance 2 592 296 1.2386e+00 0.289793
## origin:air_time:distance 2 18965 9482 3.9706e+01 < 2.2e-16
## dep_delay:air_time:distance 1 1727 1727 7.2316e+00 0.007167
## origin:dep_delay:air_time:distance 2 160 80 3.3410e-01 0.715999
## Residuals 32711 7812071 239
##
## origin ***
## dep_delay ***
## air_time ***
## distance ***
## origin:dep_delay *
## origin:air_time ***
## dep_delay:air_time
## origin:distance ***
## dep_delay:distance ***
## air_time:distance ***
## origin:dep_delay:air_time
## origin:dep_delay:distance
## origin:air_time:distance ***
## dep_delay:air_time:distance **
## origin:dep_delay:air_time:distance
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Isolating Variables and Interactions with significant values
nycflights.lm <- lm(arr_delay~dest+dep_delay+air_time+origin*distance,data = nycflights)
anova(nycflights.lm)
## Analysis of Variance Table
##
## Response: arr_delay
## Df Sum Sq Mean Sq F value Pr(>F)
## dest 101 844488 8361 37.335 < 2.2e-16 ***
## dep_delay 1 54277207 54277207 242357.933 < 2.2e-16 ***
## air_time 1 2898801 2898801 12943.692 < 2.2e-16 ***
## origin 2 9077 4539 20.266 1.600e-09 ***
## distance 1 19352 19352 86.409 < 2.2e-16 ***
## origin:distance 2 12729 6365 28.419 4.659e-13 ***
## Residuals 32626 7306747 224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Optimizing a variable using box cox transformation.
hist(nycflights$air_time)
mytf=powerTransform(nycflights$air_time)
mytf
## Estimated transformation parameter
## nycflights$air_time
## 0.1220552
hist(nycflights$air_time^.1220552)
Final Linear Model
nycflights$air_time <- nycflights$air_time^.1220552
nycflights.lm <- lm(arr_delay~dest+dep_delay+air_time+origin*distance,data = nycflights)
anova(nycflights.lm)
## Analysis of Variance Table
##
## Response: arr_delay
## Df Sum Sq Mean Sq F value Pr(>F)
## dest 101 844488 8361 35.564 < 2.2e-16 ***
## dep_delay 1 54277207 54277207 230862.285 < 2.2e-16 ***
## air_time 1 2480653 2480653 10551.191 < 2.2e-16 ***
## origin 2 12120 6060 25.775 6.53e-12 ***
## distance 1 37352 37352 158.871 < 2.2e-16 ***
## origin:distance 2 46000 23000 97.828 < 2.2e-16 ***
## Residuals 32626 7670582 235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(nycflights.lm)
##
## Call:
## lm(formula = arr_delay ~ dest + dep_delay + air_time + origin *
## distance, data = nycflights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.063 -9.437 -1.762 6.967 144.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.602e+02 4.159e+01 -6.257 3.97e-10 ***
## destACK -4.329e+02 3.725e+01 -11.622 < 2e-16 ***
## destALB -4.407e+02 3.830e+01 -11.505 < 2e-16 ***
## destANC 5.478e+02 3.902e+01 14.039 < 2e-16 ***
## destATL -3.196e+02 2.451e+01 -13.041 < 2e-16 ***
## destAUS -9.790e+01 7.764e+00 -12.610 < 2e-16 ***
## destAVL -3.593e+02 2.841e+01 -12.646 < 2e-16 ***
## destBDL -4.314e+02 3.889e+01 -11.094 < 2e-16 ***
## destBGR -4.034e+02 3.318e+01 -12.157 < 2e-16 ***
## destBHM -3.007e+02 2.230e+01 -13.483 < 2e-16 ***
## destBNA -3.217e+02 2.447e+01 -13.148 < 2e-16 ***
## destBOS -4.375e+02 3.725e+01 -11.744 < 2e-16 ***
## destBQN -5.906e+01 6.606e+00 -8.940 < 2e-16 ***
## destBTV -4.239e+02 3.559e+01 -11.913 < 2e-16 ***
## destBUF -4.277e+02 3.490e+01 -12.257 < 2e-16 ***
## destBUR 2.140e+02 1.517e+01 14.109 < 2e-16 ***
## destBWI -4.404e+02 3.753e+01 -11.734 < 2e-16 ***
## destBZN 3.019e+01 1.145e+01 2.638 0.008342 **
## destCAE -3.523e+02 2.828e+01 -12.456 < 2e-16 ***
## destCAK -4.031e+02 3.272e+01 -12.321 < 2e-16 ***
## destCHO -4.250e+02 3.798e+01 -11.190 < 2e-16 ***
## destCHS -3.517e+02 2.725e+01 -12.905 < 2e-16 ***
## destCLE -4.057e+02 3.222e+01 -12.593 < 2e-16 ***
## destCLT -3.763e+02 2.942e+01 -12.787 < 2e-16 ***
## destCMH -3.898e+02 3.086e+01 -12.633 < 2e-16 ***
## destCRW -4.008e+02 3.194e+01 -12.550 < 2e-16 ***
## destCVG -3.743e+02 2.853e+01 -13.122 < 2e-16 ***
## destDAY -3.773e+02 2.939e+01 -12.837 < 2e-16 ***
## destDCA -4.406e+02 3.683e+01 -11.961 < 2e-16 ***
## destDEN -6.671e+01 5.791e+00 -11.520 < 2e-16 ***
## destDFW -1.432e+02 1.056e+01 -13.559 < 2e-16 ***
## destDSM -2.503e+02 1.862e+01 -13.443 < 2e-16 ***
## destDTW -3.957e+02 3.033e+01 -13.043 < 2e-16 ***
## destEGE -3.813e+01 4.870e+00 -7.829 5.08e-15 ***
## destEYW -1.879e+02 2.114e+01 -8.889 < 2e-16 ***
## destFLL -2.318e+02 1.746e+01 -13.275 < 2e-16 ***
## destGRR -3.580e+02 2.782e+01 -12.869 < 2e-16 ***
## destGSO -3.942e+02 3.136e+01 -12.571 < 2e-16 ***
## destGSP -3.635e+02 2.804e+01 -12.965 < 2e-16 ***
## destHDN -1.990e+01 1.148e+01 -1.732 0.083205 .
## destHNL 1.121e+03 7.220e+01 15.529 < 2e-16 ***
## destHOU -1.306e+02 9.740e+00 -13.404 < 2e-16 ***
## destIAD -4.412e+02 3.652e+01 -12.081 < 2e-16 ***
## destIAH -1.307e+02 9.973e+00 -13.101 < 2e-16 ***
## destILM -3.930e+02 3.071e+01 -12.798 < 2e-16 ***
## destIND -3.502e+02 2.680e+01 -13.071 < 2e-16 ***
## destJAC 3.698e+01 1.142e+01 3.238 0.001204 **
## destJAX -3.029e+02 2.294e+01 -13.208 < 2e-16 ***
## destLAS 1.432e+02 1.015e+01 14.106 < 2e-16 ***
## destLAX 2.218e+02 1.513e+01 14.654 < 2e-16 ***
## destLGB 2.146e+02 1.504e+01 14.275 < 2e-16 ***
## destMCI -2.267e+02 1.677e+01 -13.512 < 2e-16 ***
## destMCO -2.690e+02 2.029e+01 -13.261 < 2e-16 ***
## destMDW -3.387e+02 2.534e+01 -13.366 < 2e-16 ***
## destMEM -2.701e+02 2.004e+01 -13.477 < 2e-16 ***
## destMHT -4.294e+02 3.683e+01 -11.657 < 2e-16 ***
## destMIA -2.286e+02 1.701e+01 -13.442 < 2e-16 ***
## destMKE -3.334e+02 2.503e+01 -13.319 < 2e-16 ***
## destMSN -3.123e+02 2.345e+01 -13.315 < 2e-16 ***
## destMSP -2.533e+02 1.865e+01 -13.582 < 2e-16 ***
## destMSY -2.042e+02 1.507e+01 -13.547 < 2e-16 ***
## destMTJ -2.406e+01 1.133e+01 -2.123 0.033744 *
## destMVY -4.392e+02 3.789e+01 -11.589 < 2e-16 ***
## destMYR -3.822e+02 3.000e+01 -12.742 < 2e-16 ***
## destOAK 2.539e+02 1.766e+01 14.374 < 2e-16 ***
## destOKC -1.533e+02 1.193e+01 -12.851 < 2e-16 ***
## destOMA -2.151e+02 1.591e+01 -13.513 < 2e-16 ***
## destORD -3.378e+02 2.513e+01 -13.441 < 2e-16 ***
## destORF -4.263e+02 3.506e+01 -12.159 < 2e-16 ***
## destPBI -2.429e+02 1.837e+01 -13.218 < 2e-16 ***
## destPDX 2.089e+02 1.472e+01 14.193 < 2e-16 ***
## destPHL -4.553e+02 3.950e+01 -11.529 < 2e-16 ***
## destPHX 1.087e+02 8.133e+00 13.364 < 2e-16 ***
## destPIT -4.256e+02 3.406e+01 -12.497 < 2e-16 ***
## destPSE -4.980e+01 6.268e+00 -7.946 1.99e-15 ***
## destPSP 1.930e+02 1.691e+01 11.411 < 2e-16 ***
## destPVD -4.355e+02 3.791e+01 -11.487 < 2e-16 ***
## destPWM -4.219e+02 3.533e+01 -11.942 < 2e-16 ***
## destRDU -4.004e+02 3.197e+01 -12.525 < 2e-16 ***
## destRIC -4.331e+02 3.517e+01 -12.313 < 2e-16 ***
## destROC -4.336e+02 3.575e+01 -12.130 < 2e-16 ***
## destRSW -2.360e+02 1.738e+01 -13.581 < 2e-16 ***
## destSAN 2.129e+02 1.451e+01 14.679 < 2e-16 ***
## destSAT -8.140e+01 6.662e+00 -12.219 < 2e-16 ***
## destSAV -3.299e+02 2.547e+01 -12.950 < 2e-16 ***
## destSDF -3.533e+02 2.694e+01 -13.114 < 2e-16 ***
## destSEA 1.981e+02 1.397e+01 14.176 < 2e-16 ***
## destSFO 2.575e+02 1.761e+01 14.621 < 2e-16 ***
## destSJC 2.580e+02 1.744e+01 14.792 < 2e-16 ***
## destSJU -5.704e+01 6.067e+00 -9.401 < 2e-16 ***
## destSLC 5.300e+01 5.047e+00 10.501 < 2e-16 ***
## destSMF 2.429e+02 1.648e+01 14.744 < 2e-16 ***
## destSNA 2.072e+02 1.476e+01 14.034 < 2e-16 ***
## destSRQ -2.423e+02 1.817e+01 -13.335 < 2e-16 ***
## destSTL -2.921e+02 2.170e+01 -13.462 < 2e-16 ***
## destSTT -5.243e+01 5.892e+00 -8.898 < 2e-16 ***
## destSYR -4.392e+02 3.696e+01 -11.884 < 2e-16 ***
## destTPA -2.514e+02 1.894e+01 -13.274 < 2e-16 ***
## destTUL -1.834e+02 1.435e+01 -12.782 < 2e-16 ***
## destTVC -3.542e+02 2.759e+01 -12.838 < 2e-16 ***
## destTYS -3.538e+02 2.717e+01 -13.023 < 2e-16 ***
## destXNA -2.093e+02 1.596e+01 -13.113 < 2e-16 ***
## dep_delay 1.014e+00 2.110e-03 480.824 < 2e-16 ***
## air_time 4.905e+02 4.698e+00 104.397 < 2e-16 ***
## originJFK -1.555e+00 4.118e-01 -3.777 0.000159 ***
## originLGA 2.105e+00 5.298e-01 3.974 7.08e-05 ***
## distance -3.933e-01 2.296e-02 -17.132 < 2e-16 ***
## originJFK:distance 4.152e-03 2.970e-04 13.981 < 2e-16 ***
## originLGA:distance 2.990e-03 5.938e-04 5.036 4.78e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.33 on 32626 degrees of freedom
## Multiple R-squared: 0.8827, Adjusted R-squared: 0.8823
## F-statistic: 2272 on 108 and 32626 DF, p-value: < 2.2e-16
Residual Analysis
Based on the residual analysis we determine that a linear model is not appropriate. Although the adjusted R-squared and the p-value are appropriate, the residuals do not have constant variability and are not normal.
hist(nycflights.lm$residuals)
plot(nycflights.lm)