MPG Car Data Multiple Linear Regression

Irving Cedillo

Import data

Note to reader: The number of cylinders an engine has determine its performance and efficiency; Engine displacement is the amount of air and fuel able to be moved by an engine (total cylinder capacity); Horsepower is an engines power output

'data.frame':   398 obs. of  10 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ car_make    : Factor w/ 30 levels "amc","audi","bmw",..: 7 4 22 1 12 12 7 22 23 1 ...
 $ car_model   : Factor w/ 300 levels "","'cuda 340",..: 95 271 263 253 283 168 192 166 79 58 ...

EDA

Null values

Horsepower was originally of data type char even though its entries were numerical. Clearer investigation discovered that NULL values were stored as ‘?’ rather than true NA which caused confusion. 6 rows were removed since they accounted for less than 2% of data.

Data Types

Original column car name was removed to create two new columns : car_make & car_model. However, car_model was dropped in the beginning stages since it had ~300 levels and would not make sense for the data set of 399 observations. The column car_make had many spelling errors and inconsistencies, like Toyota:Touyota and vw:volkswagon. Due to the high number of errors (unreliability) and high level of Factor levels this column was also removed.

Finally, though origin and cylinders were originally both integers, I have decided to change them both into Factors. The column origin represents the continent of origin of a vehicle and it is nominal not continuous. Car origin can fall into three groups: 1 (US), 2 (Europe), 3 (Asia), for their country/continent of origin. The column cylinders being the number of cylinders in an engine is also an ordered discrete number and not continuous. Finally, model_year was left as an integer type, with the reasoning being if the model were to be used in the future the addition of more years would not create more factors.

'data.frame':   392 obs. of  8 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "na.action")= 'omit' Named int [1:6] 33 127 331 337 355 375
  ..- attr(*, "names")= chr [1:6] "33" "127" "331" "337" ...

      mpg        cylinders  displacement     horsepower        weight    
 Min.   : 9.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   4:199     1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   5:  3     Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   6: 83     Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   8:103     3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
  acceleration     model_year    origin 
 Min.   : 8.00   Min.   :70.00   1:245  
 1st Qu.:13.78   1st Qu.:73.00   2: 68  
 Median :15.50   Median :76.00   3: 79  
 Mean   :15.54   Mean   :75.98          
 3rd Qu.:17.02   3rd Qu.:79.00          
 Max.   :24.80   Max.   :82.00

             vars   n    mean     sd  median trimmed    mad  min    max  range
mpg             1 392   23.45   7.81   22.75   22.99   8.60    9   46.6   37.6
cylinders*      2 392    3.21   1.33    2.00    3.15   0.00    1    5.0    4.0
displacement    3 392  194.41 104.64  151.00  183.83  90.44   68  455.0  387.0
horsepower      4 392  104.47  38.49   93.50   99.82  28.91   46  230.0  184.0
weight          5 392 2977.58 849.40 2803.50 2916.94 948.12 1613 5140.0 3527.0
acceleration    6 392   15.54   2.76   15.50   15.48   2.52    8   24.8   16.8
model_year      7 392   75.98   3.68   76.00   75.97   4.45   70   82.0   12.0
origin*         8 392    1.58   0.81    1.00    1.47   0.00    1    3.0    2.0
             skew kurtosis    se
mpg          0.45    -0.54  0.39
cylinders*   0.26    -1.69  0.07
displacement 0.70    -0.79  5.29
horsepower   1.08     0.65  1.94
weight       0.52    -0.83 42.90
acceleration 0.29     0.41  0.14
model_year   0.02    -1.18  0.19
origin*      0.91    -0.86  0.04

Descriptive Statistics

Horsepower is rightwarddly skewed with value of 1.08 and Displacement has moderate skew with a value of .70. Skews can cause problems with linear model assumptions like normality of residuals.

Correlation plots

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The ggpairs demonstrate high level of correlation existing within the plots. The predictors horsepower, weight,displacement and accelaration are all highly correlated to each other, which affects the multicolinearity assumption. The response variable mpg hints at a nonlinear pattern with displacement, horsepower and weight forming a banana shaped pattern. The relationship between cylinders and displacement& horsepower align with real world scenario of, the more cylinders an engine has the more displacement (capacity) it has and therefore the more horsepower it has.

ANOVA

The relationship between origin, displacement and weight suggests that cars from the region 1: US tend to be heavier with more displacement (capacity). Therefore one can hypothesize that horsepower and acceleration would follow suit. When testing whether horsepower and acceleration are significantly different between origins we find that:

             Df Sum Sq Mean Sq F value Pr(>F)    
origin        2 138895   69447   61.34 <2e-16 ***
Residuals   389 440399    1132                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[1] "TUKEY"

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = horsepower ~ origin, data = df_clean)

$origin
           diff       lwr       upr     p adj
2-1 -38.4901561 -49.34074 -27.63957 0.0000000
3-1 -39.2135366 -49.45576 -28.97131 0.0000000
3-2  -0.7233805 -13.81849  12.37173 0.9907313

The mean horsepower differs significantly across origin groups (\(p=2e-16\)). Origin 1 (US) has significantly higher mean horsepower compared to the other regions.

             Df Sum Sq Mean Sq F value   Pr(>F)    
origin        2  212.6   106.3   14.96 5.49e-07 ***
Residuals   389 2763.5     7.1                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = acceleration ~ origin, data = df_clean)

$origin
          diff        lwr       upr     p adj
2-1  1.8039136  0.9443935 2.6634336 0.0000035
3-1  1.1819478  0.3706183 1.9932773 0.0019458
3-2 -0.6219657 -1.6592840 0.4153525 0.3363574

The mean acceleration differs significantly across origin groups (\(p=5.49e-07\)). Origin 1 (US) has the lowest mean acceleration.

MPG vs Numerical

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

The individual plots further confirm earlier suspicion of skewness with a long tail. The non linear pattern is also more easily seen on the right end for horsepower. Since the relationship may be non-linear using ONLY a linear regression model without transformations may underestimate/overestimate at the extreme ends.

MPG vs Categorical

             Df Sum Sq Mean Sq F value Pr(>F)    
origin        2   7904    3952    96.6 <2e-16 ***
Residuals   389  15915      41                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = mpg ~ origin, data = df_clean)

$origin
         diff       lwr       upr     p adj
2-1  7.569472 5.5068042  9.632139 0.0000000
3-1 10.417164 8.4701431 12.364184 0.0000000
3-2  2.847692 0.3583458  5.337038 0.0202502

From the box plots, visually origin seems to be well separated with each region having distinct means.

The mean mpg differs significantly across origin groups (\(p=<2e-16\)). Specifically origin 3 (Asia) has statistically significant higher mean mpg compared to the other regions.

             Df Sum Sq Mean Sq F value Pr(>F)    
cylinders     4  15275    3819     173 <2e-16 ***
Residuals   387   8544      22                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = mpg ~ cylinders, data = df_clean)

$cylinders
          diff        lwr         upr     p adj
4-3   8.733920   2.230560  15.2372794 0.0024534
5-3   6.816667  -3.019020  16.6523535 0.3193286
6-3  -0.576506  -7.168805   6.0157927 0.9992685
8-3  -5.586893 -12.149699   0.9759130 0.1366880
5-4  -1.917253  -9.408167   5.5736612 0.9560878
6-4  -9.310426 -10.993120  -7.6277312 0.0000000
8-4 -14.320813 -15.883977 -12.7576485 0.0000000
6-5  -7.393173 -14.961429   0.1750839 0.0592140
8-5 -12.403560 -19.946141  -4.8609787 0.0000851
8-6  -5.010387  -6.909913  -3.1108618 0.0000000

From the box plots, visually cylinders seems to be well separated with each region having distinct means.

The mean mpg differs significantly across cylinder groups (\(p=<2e-16\))

It is important to note that though the difference of means is statistically significant, one must also be aware that the cylinder group is not well balanced. The group with 4 cylinders is overwhelmingly represented with a count of 199.

Full Model


Call:
lm(formula = mpg ~ ., data = df_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6797 -1.9373 -0.0678  1.6711 12.7756 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.208e+01  4.541e+00  -4.862 1.70e-06 ***
cylinders4    6.722e+00  1.654e+00   4.064 5.85e-05 ***
cylinders5    7.078e+00  2.516e+00   2.813  0.00516 ** 
cylinders6    3.351e+00  1.824e+00   1.837  0.06701 .  
cylinders8    5.099e+00  2.109e+00   2.418  0.01607 *  
displacement  1.870e-02  7.222e-03   2.590  0.00997 ** 
horsepower   -3.490e-02  1.323e-02  -2.639  0.00866 ** 
weight       -5.780e-03  6.315e-04  -9.154  < 2e-16 ***
acceleration  2.598e-02  9.304e-02   0.279  0.78021    
model_year    7.370e-01  4.892e-02  15.064  < 2e-16 ***
origin2       1.764e+00  5.513e-01   3.200  0.00149 ** 
origin3       2.617e+00  5.272e-01   4.964 1.04e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.098 on 380 degrees of freedom
Multiple R-squared:  0.8469,    Adjusted R-squared:  0.8425 
F-statistic: 191.1 on 11 and 380 DF,  p-value: < 2.2e-16

Before checking assumptions the full model performance is good

\(R^2\approx.84\) meaning the model can explain roughly \(84\%\) of variance
Residual standard error is small 3.098
F-Statistic is large \(191.1\) resulting in a significant p.val \(<2.2e-16\)

Assumptions:

Inflated VIF points to multicolinearity
Heteroscedasticity with slight funnel shape
Violation of normal residuals
Relatively flat line for linear assumption

Coefficient interpretation:

P value of displacement\(=.0097<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase in displacement results in \(.01870\) increase in mpg.
P value of horsepower\(=0.008667<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase in horsepower results in \(.0349\) decrease in mpg.
P value of weight\(< 2e-16<<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase in weight results in \(.005780\) decrease in mpg.
P value of acceleration\(=0.78021 >.05\) meaning it is NOT statistically significant. While holding all other variables constant, one unit increase in acceleration results in \(.02598\) increase in mpg.
P value of model_year\(< 2e-16<<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase in model_year results in \(.737\) increase in mpg.
P value of origin2\(=0.00149 <.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baseline origin1, origin2 increases mpg by \(1.764\)
P value of origin3\(=1.04e-06<<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baseline origin1, origin3 increases mpg by \(2.617\)
P value of cylinders4\(=5.85e-05<<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baseline cylinders3, cylinders4 increases mpg by \(6.722\)
P value of cylinders5\(=0.00516<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baseline cylinders3, cylinders5 increases mpg by \(7.078\)
P value of cylinders6\(=0.06701>.05\) meaning it is NOT statistically significant. While holding all other variables constant, when compared to the baseline cylinders3, cylinders6 increases mpg by \(3.351\)
P value of cylinders8\(=0.01607<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baseline cylinders3, cylinders8 increases mpg by \(5.099\)

Intuition

Multicolinearity must be present since displacement does not decrease as intended. Since p.value was significant, displacement results in \(.01870\) increase in mpg, however from correlation plot displacement and mpg are negatively correlated. Being multicolinear, the displacement \(\beta\) coefficient is inflated by the other columns it is colinear with. The logic follows that:

A heavier car has more horsepower

\(\implies\) increased horsepower and weight increases displacement

\(\therefore\) displacement results in \(.01870\) increase in mpg in the full model,

\(\unicode{x21af}\) since there exists a negative correlation of \(.81\) between displacement and mpg.

VIF further confirms the muticolinearity of the full model between weight, displacement and horsepower.

                  GVIF Df GVIF^(1/(2*Df))
cylinders    15.521205  4        1.408853
displacement 23.271418  1        4.824046
horsepower   10.559221  1        3.249495
weight       11.723047  1        3.423894
acceleration  2.684794  1        1.638534
model_year    1.323483  1        1.150427
origin        2.344607  2        1.237421

Partial Model

Partial model 1 (Fail)

The first partial model was done by taking all the significant variables in the full model. Since VIF demonstrated high multicolinearity I removed displacement, and since cylinders was heavily unbalanced I removed it as well.

\(PM1:\)mpg ~ horsepower + weight + model_year +origin


Call:
lm(formula = mpg ~ horsepower + weight + model_year + origin, 
    data = df_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-9.553 -2.110 -0.027  1.795 13.476 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.722e+01  4.193e+00  -4.107 4.89e-05 ***
horsepower  -8.480e-03  9.343e-03  -0.908  0.36463    
weight      -5.564e-03  4.408e-04 -12.622  < 2e-16 ***
model_year   7.544e-01  5.156e-02  14.631  < 2e-16 ***
origin2      1.955e+00  5.186e-01   3.769  0.00019 ***
origin3      2.283e+00  5.243e-01   4.353 1.72e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.338 on 386 degrees of freedom
Multiple R-squared:  0.8194,    Adjusted R-squared:  0.8171 
F-statistic: 350.3 on 5 and 386 DF,  p-value: < 2.2e-16

[1] "VIF"

               GVIF Df GVIF^(1/(2*Df))
horsepower 4.537936  1        2.130243
weight     4.919323  1        2.217955
model_year 1.266033  1        1.125181
origin     1.668152  2        1.136472

[1] "Model Fit"

Analysis of Variance Table

Model 1: mpg ~ cylinders + displacement + horsepower + weight + acceleration + 
    model_year + origin
Model 2: mpg ~ horsepower + weight + model_year + origin
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    380 3646.3                                  
2    386 4301.2 -6   -654.92 11.375 1.055e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit:

The smaller model has an \(Adj.R^2 =0.8171\) which is very close to the full model value of \(.8425\)
Variable horsepower is not significant with p. value \(0.36463>.05\)
Nested partial F test reveals that variables cylinders, displacement, acceleration contain information that helps predict mpg and their removal increased RSS of the partial model to \(4301.2\)

Assumptions:

VIF values have decreased, pointing at the issue of multicolinearity handled
Heteroscedasticity is observed with funnel shape
Violation of normal residuals
Non flat line for linear assumption
Residuals are not normal

Partial model 2 (Fail)

I removed horsepower since it was not significant.

\(PM2:\)mpg ~ weight + model_year +origin

[1] "Partial Model 2 Summary"


Call:
lm(formula = mpg ~ weight + model_year + origin, data = df_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6025 -2.1132 -0.0206  1.7617 13.5261 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.831e+01  4.017e+00  -4.557 6.96e-06 ***
weight      -5.887e-03  2.599e-04 -22.647  < 2e-16 ***
model_year   7.698e-01  4.867e-02  15.818  < 2e-16 ***
origin2      1.976e+00  5.180e-01   3.815 0.000158 ***
origin3      2.215e+00  5.188e-01   4.268 2.48e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.337 on 387 degrees of freedom
Multiple R-squared:  0.819, Adjusted R-squared:  0.8172 
F-statistic: 437.9 on 4 and 387 DF,  p-value: < 2.2e-16

[1] "VIF"

               GVIF Df GVIF^(1/(2*Df))
weight     1.711383  1        1.308199
model_year 1.128435  1        1.062278
origin     1.612360  2        1.126848

[1] "Model Fit"

Analysis of Variance Table

Model 1: mpg ~ cylinders + displacement + horsepower + weight + acceleration + 
    model_year + origin
Model 2: mpg ~ weight + model_year + origin
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    380 3646.3                                 
2    387 4310.4 -7    -664.1 9.887 2.397e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit:

The smaller model has an \(Adj.R^2 =0.8172\) which is very close to the full model value of \(.8425\), (note partial model 1 had a value of \(0.8171\))
All variables are significant, with weight, showing a negative \(\beta\) coefficients.
- With all values held constant one unit increase in weight results in \(.005887\) decrease in mpg
Nested partial F test reveals that variables horsepower,cylinders, displacement, acceleration contain information that helps predict mpg and their removal increased RSS of the partial model to \(4310.4\)

Assumptions:

VIF values have decreased, pointing at the issue of multicolinearity handled
Heteroscedasticity with funnel shape
Violation of normal residuals
Non flat line hinting at nonlinear pattern
Residuals are not normal

Log model (Pass)

Model 2, though it failed many assumptions was mostly due to nonlinear interactions, not because of the selected variables. I tested a logarithmic operation of mpg to see if the model performance of assumptions improved.

\(PM2:\)log(mpg) ~ weight + model_year +origin

[1] "Log Model of PM2"


Call:
lm(formula = log(mpg) ~ weight + model_year + origin, data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42352 -0.07273  0.01168  0.07197  0.38176 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.504e+00  1.448e-01  10.386  < 2e-16 ***
weight      -2.864e-04  9.370e-06 -30.565  < 2e-16 ***
model_year   3.189e-02  1.754e-03  18.180  < 2e-16 ***
origin2      7.031e-02  1.867e-02   3.766 0.000192 ***
origin3      5.750e-02  1.870e-02   3.074 0.002259 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1203 on 387 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8748 
F-statistic: 684.2 on 4 and 387 DF,  p-value: < 2.2e-16

[1] "VIF"

               GVIF Df GVIF^(1/(2*Df))
weight     1.711383  1        1.308199
model_year 1.128435  1        1.062278
origin     1.612360  2        1.126848

[1] "Model Fit"

            df       AIC
full_model  13 2012.6901
log_model_2  6 -540.9114

Fit:

The log model has an \(Adj.R^2 =0.8748\) which is an improvement when compared to the full model value of \(.8425\), and the highest value so far
All variables are significant
Nested partial F test cannot be done to compare since they are not nested models, but from AIC comparison is a huge indicator of how well the logarithmic model compares to the full model

Assumptions:

VIF values are stable, pointing at the issue of multicolinearity handled
Relatively flat line across residuals vs fitted
Homoscedasticity accomplished with equal spread of variance amongst residuals
Normalcy of residuals greatly improved but lower left has shown that extreme values are not reliable
Some points are near Cook’s distance and can be considered influential points

Interaction model (Pass)

Since there were few significant variables I identified from the log model, I wanted to see if there were interaction between the remaining variables: weight,model_year,origin. What I tested first was whether the interaction between the two continuous variables is significant. An interaction between weight*model_year has been shown to be significant. Then I tested whether there was interaction between continuous and categorical variables. The first plot demonstrated that mpg over years by origin groups, and this plot shows parallel lines. The second plot demonstrates mpg over weight by origin, and this plot demonstrates interaction of the lines.

[1] "Test interaction term between continous variables"


Call:
lm(formula = log(mpg) ~ weight * model_year, data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43841 -0.07020  0.00361  0.06614  0.37638 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.660e-01  4.931e-01   0.742  0.45839    
weight             1.444e-04  1.681e-04   0.859  0.39080    
model_year         4.830e-02  6.543e-03   7.381 9.65e-13 ***
weight:model_year -6.069e-06  2.250e-06  -2.698  0.00728 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1216 on 388 degrees of freedom
Multiple R-squared:  0.8731,    Adjusted R-squared:  0.8721 
F-statistic: 889.7 on 3 and 388 DF,  p-value: < 2.2e-16

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

The interaction models were created:

inter_model1: log(mpg) ~ weight*origin
inter_model2: log(mpg) ~ weight*model_year
inter_model3: log(mpg) ~ weight*model_year + weight*origin

From the AIC analysis is has been shown that model 3 and model 2 significantly outperform model 1 and the full model. Closer inspection of model fit shows that both show relatively similar \(Adj.R^2\) values with model 2 having: \(0.8721\) and model 3: \(0.8777\). Following the principal of parsimony I decided to investigate further on model 2’s assumptions.

                   df       AIC
full_model         13 2012.6901
interaction_model1  7 -300.0444
interaction_model2  5 -533.4393
interaction_model3  9 -546.9564


Call:
lm(formula = log(mpg) ~ weight * model_year, data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43841 -0.07020  0.00361  0.06614  0.37638 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.660e-01  4.931e-01   0.742  0.45839    
weight             1.444e-04  1.681e-04   0.859  0.39080    
model_year         4.830e-02  6.543e-03   7.381 9.65e-13 ***
weight:model_year -6.069e-06  2.250e-06  -2.698  0.00728 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1216 on 388 degrees of freedom
Multiple R-squared:  0.8731,    Adjusted R-squared:  0.8721 
F-statistic: 889.7 on 3 and 388 DF,  p-value: < 2.2e-16


Call:
lm(formula = log(mpg) ~ weight * model_year + weight * origin, 
    data = df_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41810 -0.06971  0.00381  0.07508  0.39120 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.845e-01  4.963e-01   0.775  0.43897    
weight             9.493e-05  1.701e-04   0.558  0.57719    
model_year         4.657e-02  6.625e-03   7.029 9.53e-12 ***
origin2            1.290e-01  8.304e-02   1.554  0.12101    
origin3            2.683e-01  1.024e-01   2.620  0.00914 ** 
weight:model_year -5.029e-06  2.293e-06  -2.193  0.02890 *  
weight:origin2    -2.073e-05  3.195e-05  -0.649  0.51682    
weight:origin3    -9.446e-05  4.393e-05  -2.150  0.03215 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1189 on 384 degrees of freedom
Multiple R-squared:  0.8799,    Adjusted R-squared:  0.8777 
F-statistic: 401.8 on 7 and 384 DF,  p-value: < 2.2e-16

[1] "Assumptions Interaction Model 2"

Fit:

The log model has an \(Adj.R^2 =0.8721\) which is an improvement when compared to the full model value of \(.8425\) adn relatively close to model 3 where \(Adj.R^2 =0.8777\)
Weight was the only variables who’s beta coefficient was not significant
Nested partial F test cannot be done to compare since they are not nested models, but from AIC interaction model 2 may not performed the best, but was close enough to model 3 in terms of \(Adj.R^2\).

Assumptions:

VIF not advised for since interaction terms will have multicolinearity with main effects
Relatively flat line across residuals vs fitted
Homoscedasticity accomplished with equal spread of variance amongst residuals
Normalcy of residuals still has the same issue with extremes as the logarithmic model.
Some points are near Cook’s distance and can be considered influential points

Box-Cox Model (Pass)

Since logarithmic scale improved fit, an optimal power must be investigated. The Box-Cox method revealed that teh optimal power to partial_model_2 <- lm(mpg ~ weight + model_year +origin, data= df_clean) (what the log model was based on) is to be \(-0.29\) with a \(95%\) confidence interval it lies between \([-.44,-.12]\). For simplicity I will use \(\lambda = -0.25\).

Warning: package 'MASS' was built under R version 4.5.2


Attaching package: 'MASS'

The following object is masked from 'package:patchwork':

    area

The following object is masked from 'package:dplyr':

    select

                 [,1]
Original 9.5955611307
BoxCox   0.0001866694

BC model


Call:
lm(formula = I(mpg^(-0.25)) ~ weight + model_year + origin, data = df_clean)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.041526 -0.008113 -0.001365  0.008005  0.051565 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.411e-01  1.645e-02  38.983  < 2e-16 ***
weight       3.398e-05  1.064e-06  31.934  < 2e-16 ***
model_year  -3.650e-03  1.992e-04 -18.317  < 2e-16 ***
origin2     -7.656e-03  2.121e-03  -3.610 0.000346 ***
origin3     -5.455e-03  2.124e-03  -2.569 0.010587 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01366 on 387 degrees of freedom
Multiple R-squared:  0.8818,    Adjusted R-squared:  0.8806 
F-statistic: 722.1 on 4 and 387 DF,  p-value: < 2.2e-16

               GVIF Df GVIF^(1/(2*Df))
weight     1.711383  1        1.308199
model_year 1.128435  1        1.062278
origin     1.612360  2        1.126848

           df       AIC
full_model 13  2012.690
bc_model_2  6 -2246.364

Fit:

The log model has an \(Adj.R^2 =0.88\) which is an improvement when compared to the full model value of \(.8425\), and even higher than the previous logarithmic model.
All variables are significant
Nested partial F test cannot be done to compare since they are not nested models, but from AIC comparison the box-cox model is a significant improvement

Assumptions:

VIF values are stable, pointing at the issue of multicolinearity handled
Relatively flat line across residuals vs fitted, but a slight dip on the right half of the plot
Homoscedasticity accomplished with equal spread of variance amongst residuals
Normalcy of residuals greatly improved on the lower end, but extreme values on the right are still a problem
Some points are near Cook’s distance and can be considered influential points

Comparison Analysis

All three models performed well meeting all assumptions. Box-Cox model performed better when dealing with extreme values on the left end of the spectrum. This could explain why it has a significantly higher AIC score compared to the other models.

\[Box\_Cox: mpg^{-\frac{1}{4}} = weight + model\_year +origin\]

                   df        AIC
log_model_2         6  -540.9114
bc_model_2          6 -2246.3636
interaction_model2  5  -533.4393
full_model         13  2012.6901

Outlier Analysis

Cooks distance

    Obs CooksD Flag_gt1 Flag_gt4n
112 111 0.0370    FALSE      TRUE
276 274 0.0304    FALSE      TRUE
278 276 0.0296    FALSE      TRUE
328 326 0.0291    FALSE      TRUE
388 382 0.0230    FALSE      TRUE
45   44 0.0226    FALSE      TRUE
327 325 0.0200    FALSE      TRUE
29   29 0.0182    FALSE      TRUE
329 327 0.0171    FALSE      TRUE
323 321 0.0164    FALSE      TRUE

Further analysis of Cook’s distance for the Box-Cox model demonstrates a clear separation between a small subset of 4 highly influential points. Observations 111,274,276 and 326 are a small group that are highly separated (\(CD > .028\)) from the next highest distance observation 382 with a distance value of \(0.0230\).

Influence Point Analysis

There were 12 points identified who met at least 4 of the flags and are classified as potential extreme influential points.

[1] "n-rows: 392"

[1] "p-coefs: 5"

[1] "hat_cut: 0.0255102040816327"

[1] "cook_cut: 0.0102040816326531"

[1] "dffits_cut: 0.225876975726313"

[1] "dfbetas_cut: 0.101015254455221"

[1] "covratio_lo: 0.961734693877551"

[1] "covratio_hi: 1.03826530612245"

    obs    Hat CooksD  DFFITS DFBETA_Intercept DFBETA_nprod DFBETA_distance
112 111 0.0172 0.0370  0.4358           0.2222      -0.0849         -0.2224
276 274 0.0199 0.0304  0.3932          -0.1646       0.1795          0.1369
278 276 0.0231 0.0296  0.3871          -0.1759       0.2178          0.1401
328 326 0.0217 0.0291 -0.3842           0.2101      -0.1505         -0.1935
388 382 0.0121 0.0230 -0.3428           0.2324       0.0060         -0.2667
323 321 0.0139 0.0164 -0.2879           0.0788      -0.0041         -0.0851
125 124 0.0055 0.0150  0.2781           0.1104       0.0456         -0.1176
365 360 0.0123 0.0136 -0.2619           0.2048      -0.1167         -0.2046
167 165 0.0045 0.0129  0.2583           0.0745      -0.0704         -0.0503
361 356 0.0264 0.0124 -0.2493           0.1574      -0.1199         -0.1433
55   54 0.0262 0.0121 -0.2462          -0.1736       0.1117          0.1629
298 296 0.0266 0.0116 -0.2415           0.1297      -0.1462         -0.1067
216 214 0.0051 0.0069  0.1869          -0.0431       0.0823          0.0372
    CovRatio HighHat HighCook HighDFFITS HighDFBETA HighCovRatio TotalFlags
112   0.8972       0        1          1          1            1          4
276   0.9375       0        1          1          1            1          4
278   0.9557       0        1          1          1            1          4
328   0.9507       0        1          1          1            1          4
388   0.9074       0        1          1          1            1          4
323   0.9528       0        1          1          1            1          4
125   0.8539       0        1          1          1            1          4
365   0.9552       0        1          1          1            1          4
167   0.8429       0        1          1          1            1          4
361   1.0101       1        1          1          1            0          4
55    1.0104       1        1          1          1            0          4
298   1.0125       1        1          1          1            0          4
216   0.9330       0        0          0          0            1          1

Residual Analysis

Further confirmation of extreme influence points based on residual analysis. The R-Stud. revealed there were six values that are ‘STRONG’, signifying that the p.value indicates there is statistical evidence with a confidence of \(95\%\) that these observations are influence points. The observations : \([165,124,111,382,274,214]\) were all identified and are almost all extreme values of Cook’s distance, observation \(214\) only raised one flag in the previous influential point analysis.

[1] "n-rows: 392"

[1] "p-coefs: 5"

[1] "df:n-p: 387"

[1] "MSE: 0.00018666939832121"

[1] "s:sqrt(MSE): 0.0136627009892338"

  obs r_rstudent   p_rstudent flag_rstudent
1 165   3.849587 0.0001384471        STRONG
2 124   3.723128 0.0002260632        STRONG
3 111   3.295953 0.0010716960        STRONG
4 382  -3.091661 0.0021349388        STRONG
5 274   2.758326 0.0060861713        STRONG
6 214   2.609730 0.0094138672        STRONG

Removal of Outliers/Influential Points

From the following output it can be shown that removal of all observations does not significanlty affect the \(\beta\) coefficient values.

           (Intercept) weight model_year origin2 origin3
All_Data        0.6411      0    -0.0036 -0.0077 -0.0055
Remove_165      0.6399      0    -0.0036 -0.0074 -0.0052
Remove_124      0.6393      0    -0.0036 -0.0075 -0.0053
Remove_111      0.6375      0    -0.0036 -0.0076 -0.0060
Remove_382      0.6373      0    -0.0036 -0.0078 -0.0057
Remove_274      0.6438      0    -0.0037 -0.0084 -0.0056
Remove_all      0.6438      0    -0.0037 -0.0084 -0.0056

Conclusion

The Box-Cox model for mpg was chosen using the following predictors : weight, model_year, origin. The optimal \(\lambda\) was not chosen but for interpretability the closest value of \(-\frac{1}{4}\) was used. Since removal of influential point is not necessary the final model has a low residual standard error of \(.01366\) and high \(R^2Adj.\) of \(88\%.\)

[1] "All data BC"


Call:
lm(formula = I(mpg^(-0.25)) ~ weight + model_year + origin, data = df_clean)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.041526 -0.008113 -0.001365  0.008005  0.051565 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.411e-01  1.645e-02  38.983  < 2e-16 ***
weight       3.398e-05  1.064e-06  31.934  < 2e-16 ***
model_year  -3.650e-03  1.992e-04 -18.317  < 2e-16 ***
origin2     -7.656e-03  2.121e-03  -3.610 0.000346 ***
origin3     -5.455e-03  2.124e-03  -2.569 0.010587 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01366 on 387 degrees of freedom
Multiple R-squared:  0.8818,    Adjusted R-squared:  0.8806 
F-statistic: 722.1 on 4 and 387 DF,  p-value: < 2.2e-16

[1] "274 Influential point removed BC"


Call:
lm(formula = I(mpg^(-0.25)) ~ weight + model_year + origin, data = df_clean[-c(274), 
    ])

Residuals:
      Min        1Q    Median        3Q       Max 
-0.041421 -0.008273 -0.001328  0.008019  0.051511 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.438e-01  1.634e-02  39.409  < 2e-16 ***
weight       3.379e-05  1.057e-06  31.958  < 2e-16 ***
model_year  -3.677e-03  1.978e-04 -18.587  < 2e-16 ***
origin2     -8.387e-03  2.119e-03  -3.957 9.02e-05 ***
origin3     -5.623e-03  2.107e-03  -2.669  0.00793 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01355 on 386 degrees of freedom
Multiple R-squared:  0.884, Adjusted R-squared:  0.8828 
F-statistic: 735.1 on 4 and 386 DF,  p-value: < 2.2e-16

Predictions (Appendix)

One of the removed observations at the beginning for having a NA value in horsepower I decided to test as a prediction. The observation point has weight=1835, model_year=80, origin=2 with an mpg=40.9. The second observation point has weight=2905, model_year=80, origin=1 with an mpg=23.6. Both mpg values lie within their respective prediction interval.

[1] "mpg1=40.9,   mpg2=23.6"

  weight model_year origin
1   1835         80      2
2   2905         80      1

[1] "Prediction Intervals MPG:"

       fit      lwr      upr
1 37.59884 49.65181 28.99117
2 24.85709 31.86824 19.67230