Computational Mathematics - Regression Analysis III

Euclides Rodriguez

2022-04-15

Introduction

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Libraries

library(tidyverse)
library(openintro)
library(car)

Data

head(nycflights)
## # A tibble: 6 x 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2013     6    30      940        15     1216        -4 VX      N626VA     407
## 2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
## 3  2013    12     8      859        -1     1238        11 DL      N712TW     422
## 4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
## 5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
## 6  2013     1     1     1817        -3     2008         3 AA      N3AXAA     353
## # ... with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>

Preliminary Models

nycflights.lm <- lm(arr_delay ~origin*dep_delay*air_time*distance ,data = nycflights )

anova(nycflights.lm)
## Analysis of Variance Table
## 
## Response: arr_delay
##                                       Df   Sum Sq  Mean Sq    F value    Pr(>F)
## origin                                 2    91477    45739 1.9152e+02 < 2.2e-16
## dep_delay                              1 54785416 54785416 2.2940e+05 < 2.2e-16
## air_time                               1     9228     9228 3.8641e+01 5.155e-10
## distance                               1  2582129  2582129 1.0812e+04 < 2.2e-16
## origin:dep_delay                       2     1950      975 4.0815e+00  0.016890
## origin:air_time                        2    13872     6936 2.9043e+01 2.499e-13
## dep_delay:air_time                     1        1        1 2.8000e-03  0.957603
## origin:distance                        2    19346     9673 4.0503e+01 < 2.2e-16
## dep_delay:distance                     1    18580    18580 7.7800e+01 < 2.2e-16
## air_time:distance                      1    12875    12875 5.3910e+01 2.148e-13
## origin:dep_delay:air_time              2       13        7 2.7900e-02  0.972476
## origin:dep_delay:distance              2      592      296 1.2386e+00  0.289793
## origin:air_time:distance               2    18965     9482 3.9706e+01 < 2.2e-16
## dep_delay:air_time:distance            1     1727     1727 7.2316e+00  0.007167
## origin:dep_delay:air_time:distance     2      160       80 3.3410e-01  0.715999
## Residuals                          32711  7812071      239                     
##                                       
## origin                             ***
## dep_delay                          ***
## air_time                           ***
## distance                           ***
## origin:dep_delay                   *  
## origin:air_time                    ***
## dep_delay:air_time                    
## origin:distance                    ***
## dep_delay:distance                 ***
## air_time:distance                  ***
## origin:dep_delay:air_time             
## origin:dep_delay:distance             
## origin:air_time:distance           ***
## dep_delay:air_time:distance        ** 
## origin:dep_delay:air_time:distance    
## Residuals                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Isolating Variables and Interactions with significant values

nycflights.lm <- lm(arr_delay~dest+dep_delay+air_time+origin*distance,data = nycflights)
anova(nycflights.lm)
## Analysis of Variance Table
## 
## Response: arr_delay
##                    Df   Sum Sq  Mean Sq    F value    Pr(>F)    
## dest              101   844488     8361     37.335 < 2.2e-16 ***
## dep_delay           1 54277207 54277207 242357.933 < 2.2e-16 ***
## air_time            1  2898801  2898801  12943.692 < 2.2e-16 ***
## origin              2     9077     4539     20.266 1.600e-09 ***
## distance            1    19352    19352     86.409 < 2.2e-16 ***
## origin:distance     2    12729     6365     28.419 4.659e-13 ***
## Residuals       32626  7306747      224                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Optimizing a variable using box cox transformation.

hist(nycflights$air_time)

mytf=powerTransform(nycflights$air_time)
mytf
## Estimated transformation parameter 
## nycflights$air_time 
##           0.1220552
hist(nycflights$air_time^.1220552)

Final Linear Model

nycflights$air_time <- nycflights$air_time^.1220552
nycflights.lm <- lm(arr_delay~dest+dep_delay+air_time+origin*distance,data = nycflights)
anova(nycflights.lm)
## Analysis of Variance Table
## 
## Response: arr_delay
##                    Df   Sum Sq  Mean Sq    F value    Pr(>F)    
## dest              101   844488     8361     35.564 < 2.2e-16 ***
## dep_delay           1 54277207 54277207 230862.285 < 2.2e-16 ***
## air_time            1  2480653  2480653  10551.191 < 2.2e-16 ***
## origin              2    12120     6060     25.775  6.53e-12 ***
## distance            1    37352    37352    158.871 < 2.2e-16 ***
## origin:distance     2    46000    23000     97.828 < 2.2e-16 ***
## Residuals       32626  7670582      235                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(nycflights.lm)
## 
## Call:
## lm(formula = arr_delay ~ dest + dep_delay + air_time + origin * 
##     distance, data = nycflights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.063  -9.437  -1.762   6.967 144.477 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.602e+02  4.159e+01  -6.257 3.97e-10 ***
## destACK            -4.329e+02  3.725e+01 -11.622  < 2e-16 ***
## destALB            -4.407e+02  3.830e+01 -11.505  < 2e-16 ***
## destANC             5.478e+02  3.902e+01  14.039  < 2e-16 ***
## destATL            -3.196e+02  2.451e+01 -13.041  < 2e-16 ***
## destAUS            -9.790e+01  7.764e+00 -12.610  < 2e-16 ***
## destAVL            -3.593e+02  2.841e+01 -12.646  < 2e-16 ***
## destBDL            -4.314e+02  3.889e+01 -11.094  < 2e-16 ***
## destBGR            -4.034e+02  3.318e+01 -12.157  < 2e-16 ***
## destBHM            -3.007e+02  2.230e+01 -13.483  < 2e-16 ***
## destBNA            -3.217e+02  2.447e+01 -13.148  < 2e-16 ***
## destBOS            -4.375e+02  3.725e+01 -11.744  < 2e-16 ***
## destBQN            -5.906e+01  6.606e+00  -8.940  < 2e-16 ***
## destBTV            -4.239e+02  3.559e+01 -11.913  < 2e-16 ***
## destBUF            -4.277e+02  3.490e+01 -12.257  < 2e-16 ***
## destBUR             2.140e+02  1.517e+01  14.109  < 2e-16 ***
## destBWI            -4.404e+02  3.753e+01 -11.734  < 2e-16 ***
## destBZN             3.019e+01  1.145e+01   2.638 0.008342 ** 
## destCAE            -3.523e+02  2.828e+01 -12.456  < 2e-16 ***
## destCAK            -4.031e+02  3.272e+01 -12.321  < 2e-16 ***
## destCHO            -4.250e+02  3.798e+01 -11.190  < 2e-16 ***
## destCHS            -3.517e+02  2.725e+01 -12.905  < 2e-16 ***
## destCLE            -4.057e+02  3.222e+01 -12.593  < 2e-16 ***
## destCLT            -3.763e+02  2.942e+01 -12.787  < 2e-16 ***
## destCMH            -3.898e+02  3.086e+01 -12.633  < 2e-16 ***
## destCRW            -4.008e+02  3.194e+01 -12.550  < 2e-16 ***
## destCVG            -3.743e+02  2.853e+01 -13.122  < 2e-16 ***
## destDAY            -3.773e+02  2.939e+01 -12.837  < 2e-16 ***
## destDCA            -4.406e+02  3.683e+01 -11.961  < 2e-16 ***
## destDEN            -6.671e+01  5.791e+00 -11.520  < 2e-16 ***
## destDFW            -1.432e+02  1.056e+01 -13.559  < 2e-16 ***
## destDSM            -2.503e+02  1.862e+01 -13.443  < 2e-16 ***
## destDTW            -3.957e+02  3.033e+01 -13.043  < 2e-16 ***
## destEGE            -3.813e+01  4.870e+00  -7.829 5.08e-15 ***
## destEYW            -1.879e+02  2.114e+01  -8.889  < 2e-16 ***
## destFLL            -2.318e+02  1.746e+01 -13.275  < 2e-16 ***
## destGRR            -3.580e+02  2.782e+01 -12.869  < 2e-16 ***
## destGSO            -3.942e+02  3.136e+01 -12.571  < 2e-16 ***
## destGSP            -3.635e+02  2.804e+01 -12.965  < 2e-16 ***
## destHDN            -1.990e+01  1.148e+01  -1.732 0.083205 .  
## destHNL             1.121e+03  7.220e+01  15.529  < 2e-16 ***
## destHOU            -1.306e+02  9.740e+00 -13.404  < 2e-16 ***
## destIAD            -4.412e+02  3.652e+01 -12.081  < 2e-16 ***
## destIAH            -1.307e+02  9.973e+00 -13.101  < 2e-16 ***
## destILM            -3.930e+02  3.071e+01 -12.798  < 2e-16 ***
## destIND            -3.502e+02  2.680e+01 -13.071  < 2e-16 ***
## destJAC             3.698e+01  1.142e+01   3.238 0.001204 ** 
## destJAX            -3.029e+02  2.294e+01 -13.208  < 2e-16 ***
## destLAS             1.432e+02  1.015e+01  14.106  < 2e-16 ***
## destLAX             2.218e+02  1.513e+01  14.654  < 2e-16 ***
## destLGB             2.146e+02  1.504e+01  14.275  < 2e-16 ***
## destMCI            -2.267e+02  1.677e+01 -13.512  < 2e-16 ***
## destMCO            -2.690e+02  2.029e+01 -13.261  < 2e-16 ***
## destMDW            -3.387e+02  2.534e+01 -13.366  < 2e-16 ***
## destMEM            -2.701e+02  2.004e+01 -13.477  < 2e-16 ***
## destMHT            -4.294e+02  3.683e+01 -11.657  < 2e-16 ***
## destMIA            -2.286e+02  1.701e+01 -13.442  < 2e-16 ***
## destMKE            -3.334e+02  2.503e+01 -13.319  < 2e-16 ***
## destMSN            -3.123e+02  2.345e+01 -13.315  < 2e-16 ***
## destMSP            -2.533e+02  1.865e+01 -13.582  < 2e-16 ***
## destMSY            -2.042e+02  1.507e+01 -13.547  < 2e-16 ***
## destMTJ            -2.406e+01  1.133e+01  -2.123 0.033744 *  
## destMVY            -4.392e+02  3.789e+01 -11.589  < 2e-16 ***
## destMYR            -3.822e+02  3.000e+01 -12.742  < 2e-16 ***
## destOAK             2.539e+02  1.766e+01  14.374  < 2e-16 ***
## destOKC            -1.533e+02  1.193e+01 -12.851  < 2e-16 ***
## destOMA            -2.151e+02  1.591e+01 -13.513  < 2e-16 ***
## destORD            -3.378e+02  2.513e+01 -13.441  < 2e-16 ***
## destORF            -4.263e+02  3.506e+01 -12.159  < 2e-16 ***
## destPBI            -2.429e+02  1.837e+01 -13.218  < 2e-16 ***
## destPDX             2.089e+02  1.472e+01  14.193  < 2e-16 ***
## destPHL            -4.553e+02  3.950e+01 -11.529  < 2e-16 ***
## destPHX             1.087e+02  8.133e+00  13.364  < 2e-16 ***
## destPIT            -4.256e+02  3.406e+01 -12.497  < 2e-16 ***
## destPSE            -4.980e+01  6.268e+00  -7.946 1.99e-15 ***
## destPSP             1.930e+02  1.691e+01  11.411  < 2e-16 ***
## destPVD            -4.355e+02  3.791e+01 -11.487  < 2e-16 ***
## destPWM            -4.219e+02  3.533e+01 -11.942  < 2e-16 ***
## destRDU            -4.004e+02  3.197e+01 -12.525  < 2e-16 ***
## destRIC            -4.331e+02  3.517e+01 -12.313  < 2e-16 ***
## destROC            -4.336e+02  3.575e+01 -12.130  < 2e-16 ***
## destRSW            -2.360e+02  1.738e+01 -13.581  < 2e-16 ***
## destSAN             2.129e+02  1.451e+01  14.679  < 2e-16 ***
## destSAT            -8.140e+01  6.662e+00 -12.219  < 2e-16 ***
## destSAV            -3.299e+02  2.547e+01 -12.950  < 2e-16 ***
## destSDF            -3.533e+02  2.694e+01 -13.114  < 2e-16 ***
## destSEA             1.981e+02  1.397e+01  14.176  < 2e-16 ***
## destSFO             2.575e+02  1.761e+01  14.621  < 2e-16 ***
## destSJC             2.580e+02  1.744e+01  14.792  < 2e-16 ***
## destSJU            -5.704e+01  6.067e+00  -9.401  < 2e-16 ***
## destSLC             5.300e+01  5.047e+00  10.501  < 2e-16 ***
## destSMF             2.429e+02  1.648e+01  14.744  < 2e-16 ***
## destSNA             2.072e+02  1.476e+01  14.034  < 2e-16 ***
## destSRQ            -2.423e+02  1.817e+01 -13.335  < 2e-16 ***
## destSTL            -2.921e+02  2.170e+01 -13.462  < 2e-16 ***
## destSTT            -5.243e+01  5.892e+00  -8.898  < 2e-16 ***
## destSYR            -4.392e+02  3.696e+01 -11.884  < 2e-16 ***
## destTPA            -2.514e+02  1.894e+01 -13.274  < 2e-16 ***
## destTUL            -1.834e+02  1.435e+01 -12.782  < 2e-16 ***
## destTVC            -3.542e+02  2.759e+01 -12.838  < 2e-16 ***
## destTYS            -3.538e+02  2.717e+01 -13.023  < 2e-16 ***
## destXNA            -2.093e+02  1.596e+01 -13.113  < 2e-16 ***
## dep_delay           1.014e+00  2.110e-03 480.824  < 2e-16 ***
## air_time            4.905e+02  4.698e+00 104.397  < 2e-16 ***
## originJFK          -1.555e+00  4.118e-01  -3.777 0.000159 ***
## originLGA           2.105e+00  5.298e-01   3.974 7.08e-05 ***
## distance           -3.933e-01  2.296e-02 -17.132  < 2e-16 ***
## originJFK:distance  4.152e-03  2.970e-04  13.981  < 2e-16 ***
## originLGA:distance  2.990e-03  5.938e-04   5.036 4.78e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.33 on 32626 degrees of freedom
## Multiple R-squared:  0.8827, Adjusted R-squared:  0.8823 
## F-statistic:  2272 on 108 and 32626 DF,  p-value: < 2.2e-16

Residual Analysis

Based on the residual analysis we determine that a linear model is not appropriate. Although the adjusted R-squared and the p-value are appropriate, the residuals do not have constant variability and are not normal.

hist(nycflights.lm$residuals)

plot(nycflights.lm)