'data.frame': 398 obs. of 10 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ car_make : Factor w/ 30 levels "amc","audi","bmw",..: 7 4 22 1 12 12 7 22 23 1 ...
$ car_model : Factor w/ 300 levels "","'cuda 340",..: 95 271 263 253 283 168 192 166 79 58 ...
MPG Car Data Multiple Linear Regression
Irving Cedillo
Import data
Note to reader: The number of cylinders an engine has determine its performance and efficiency; Engine displacement is the amount of air and fuel able to be moved by an engine (total cylinder capacity); Horsepower is an engines power output
EDA
Null values
Horsepower was originally of data type char even though its entries were numerical. Clearer investigation discovered that NULL values were stored as ‘?’ rather than true NA which caused confusion. 6 rows were removed since they accounted for less than 2% of data.
Data Types
Original column car name was removed to create two new columns : car_make & car_model. However, car_model was dropped in the beginning stages since it had ~300 levels and would not make sense for the data set of 399 observations. The column car_make had many spelling errors and inconsistencies, like Toyota:Touyota and vw:volkswagon. Due to the high number of errors (unreliability) and high level of Factor levels this column was also removed.
Finally, though origin and cylinders were originally both integers, I have decided to change them both into Factors. The column origin represents the continent of origin of a vehicle and it is nominal not continuous. Car origin can fall into three groups: 1 (US), 2 (Europe), 3 (Asia), for their country/continent of origin. The column cylinders being the number of cylinders in an engine is also an ordered discrete number and not continuous. Finally, model_year was left as an integer type, with the reasoning being if the model were to be used in the future the addition of more years would not create more factors.
'data.frame': 392 obs. of 8 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'omit' Named int [1:6] 33 127 331 337 355 375
..- attr(*, "names")= chr [1:6] "33" "127" "331" "337" ...
mpg cylinders displacement horsepower weight
Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 4:199 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 5: 3 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 6: 83 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 8:103 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140
acceleration model_year origin
Min. : 8.00 Min. :70.00 1:245
1st Qu.:13.78 1st Qu.:73.00 2: 68
Median :15.50 Median :76.00 3: 79
Mean :15.54 Mean :75.98
3rd Qu.:17.02 3rd Qu.:79.00
Max. :24.80 Max. :82.00
vars n mean sd median trimmed mad min max range
mpg 1 392 23.45 7.81 22.75 22.99 8.60 9 46.6 37.6
cylinders* 2 392 3.21 1.33 2.00 3.15 0.00 1 5.0 4.0
displacement 3 392 194.41 104.64 151.00 183.83 90.44 68 455.0 387.0
horsepower 4 392 104.47 38.49 93.50 99.82 28.91 46 230.0 184.0
weight 5 392 2977.58 849.40 2803.50 2916.94 948.12 1613 5140.0 3527.0
acceleration 6 392 15.54 2.76 15.50 15.48 2.52 8 24.8 16.8
model_year 7 392 75.98 3.68 76.00 75.97 4.45 70 82.0 12.0
origin* 8 392 1.58 0.81 1.00 1.47 0.00 1 3.0 2.0
skew kurtosis se
mpg 0.45 -0.54 0.39
cylinders* 0.26 -1.69 0.07
displacement 0.70 -0.79 5.29
horsepower 1.08 0.65 1.94
weight 0.52 -0.83 42.90
acceleration 0.29 0.41 0.14
model_year 0.02 -1.18 0.19
origin* 0.91 -0.86 0.04
Descriptive Statistics
Horsepower is rightwarddly skewed with value of 1.08 and Displacement has moderate skew with a value of .70. Skews can cause problems with linear model assumptions like normality of residuals.
Correlation plots
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The ggpairs demonstrate high level of correlation existing within the plots. The predictors horsepower, weight,displacement and accelaration are all highly correlated to each other, which affects the multicolinearity assumption. The response variable mpg hints at a nonlinear pattern with displacement, horsepower and weight forming a banana shaped pattern. The relationship between cylinders and displacement& horsepower align with real world scenario of, the more cylinders an engine has the more displacement (capacity) it has and therefore the more horsepower it has.
ANOVA
The relationship between origin, displacement and weight suggests that cars from the region 1: US tend to be heavier with more displacement (capacity). Therefore one can hypothesize that horsepower and acceleration would follow suit. When testing whether horsepower and acceleration are significantly different between origins we find that:
Df Sum Sq Mean Sq F value Pr(>F)
origin 2 138895 69447 61.34 <2e-16 ***
Residuals 389 440399 1132
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "TUKEY"
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = horsepower ~ origin, data = df_clean)
$origin
diff lwr upr p adj
2-1 -38.4901561 -49.34074 -27.63957 0.0000000
3-1 -39.2135366 -49.45576 -28.97131 0.0000000
3-2 -0.7233805 -13.81849 12.37173 0.9907313
- The mean
horsepowerdiffers significantly acrossorigingroups (\(p=2e-16\)).Origin1 (US) has significantly higher mean horsepower compared to the other regions.
Df Sum Sq Mean Sq F value Pr(>F)
origin 2 212.6 106.3 14.96 5.49e-07 ***
Residuals 389 2763.5 7.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = acceleration ~ origin, data = df_clean)
$origin
diff lwr upr p adj
2-1 1.8039136 0.9443935 2.6634336 0.0000035
3-1 1.1819478 0.3706183 1.9932773 0.0019458
3-2 -0.6219657 -1.6592840 0.4153525 0.3363574
- The mean
accelerationdiffers significantly acrossorigingroups (\(p=5.49e-07\)).Origin1 (US) has the lowest mean acceleration.
MPG vs Numerical
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
The individual plots further confirm earlier suspicion of skewness with a long tail. The non linear pattern is also more easily seen on the right end for horsepower. Since the relationship may be non-linear using ONLY a linear regression model without transformations may underestimate/overestimate at the extreme ends.
MPG vs Categorical
Df Sum Sq Mean Sq F value Pr(>F)
origin 2 7904 3952 96.6 <2e-16 ***
Residuals 389 15915 41
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mpg ~ origin, data = df_clean)
$origin
diff lwr upr p adj
2-1 7.569472 5.5068042 9.632139 0.0000000
3-1 10.417164 8.4701431 12.364184 0.0000000
3-2 2.847692 0.3583458 5.337038 0.0202502
From the box plots, visually origin seems to be well separated with each region having distinct means.
- The mean
mpgdiffers significantly acrossorigingroups (\(p=<2e-16\)). Specificallyorigin3 (Asia) has statistically significant higher meanmpgcompared to the other regions.
Df Sum Sq Mean Sq F value Pr(>F)
cylinders 4 15275 3819 173 <2e-16 ***
Residuals 387 8544 22
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mpg ~ cylinders, data = df_clean)
$cylinders
diff lwr upr p adj
4-3 8.733920 2.230560 15.2372794 0.0024534
5-3 6.816667 -3.019020 16.6523535 0.3193286
6-3 -0.576506 -7.168805 6.0157927 0.9992685
8-3 -5.586893 -12.149699 0.9759130 0.1366880
5-4 -1.917253 -9.408167 5.5736612 0.9560878
6-4 -9.310426 -10.993120 -7.6277312 0.0000000
8-4 -14.320813 -15.883977 -12.7576485 0.0000000
6-5 -7.393173 -14.961429 0.1750839 0.0592140
8-5 -12.403560 -19.946141 -4.8609787 0.0000851
8-6 -5.010387 -6.909913 -3.1108618 0.0000000
From the box plots, visually cylinders seems to be well separated with each region having distinct means.
- The mean
mpgdiffers significantly acrosscylindergroups (\(p=<2e-16\))
It is important to note that though the difference of means is statistically significant, one must also be aware that the cylinder group is not well balanced. The group with 4 cylinders is overwhelmingly represented with a count of 199.
Full Model
Call:
lm(formula = mpg ~ ., data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-8.6797 -1.9373 -0.0678 1.6711 12.7756
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.208e+01 4.541e+00 -4.862 1.70e-06 ***
cylinders4 6.722e+00 1.654e+00 4.064 5.85e-05 ***
cylinders5 7.078e+00 2.516e+00 2.813 0.00516 **
cylinders6 3.351e+00 1.824e+00 1.837 0.06701 .
cylinders8 5.099e+00 2.109e+00 2.418 0.01607 *
displacement 1.870e-02 7.222e-03 2.590 0.00997 **
horsepower -3.490e-02 1.323e-02 -2.639 0.00866 **
weight -5.780e-03 6.315e-04 -9.154 < 2e-16 ***
acceleration 2.598e-02 9.304e-02 0.279 0.78021
model_year 7.370e-01 4.892e-02 15.064 < 2e-16 ***
origin2 1.764e+00 5.513e-01 3.200 0.00149 **
origin3 2.617e+00 5.272e-01 4.964 1.04e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.098 on 380 degrees of freedom
Multiple R-squared: 0.8469, Adjusted R-squared: 0.8425
F-statistic: 191.1 on 11 and 380 DF, p-value: < 2.2e-16
Before checking assumptions the full model performance is good
- \(R^2\approx.84\) meaning the model can explain roughly \(84\%\) of variance
- Residual standard error is small 3.098
- F-Statistic is large \(191.1\) resulting in a significant p.val \(<2.2e-16\)
Assumptions:
Inflated VIF points to multicolinearity
Heteroscedasticity with slight funnel shape
Violation of normal residuals
Relatively flat line for linear assumption
Coefficient interpretation:
- P value of
displacement\(=.0097<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase indisplacementresults in \(.01870\) increase inmpg. - P value of
horsepower\(=0.008667<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase inhorsepowerresults in \(.0349\) decrease inmpg. - P value of
weight\(< 2e-16<<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase inweightresults in \(.005780\) decrease inmpg. - P value of
acceleration\(=0.78021 >.05\) meaning it is NOT statistically significant. While holding all other variables constant, one unit increase inaccelerationresults in \(.02598\) increase inmpg. - P value of
model_year\(< 2e-16<<.05\) meaning it is statistically significant. While holding all other variables constant, one unit increase inmodel_yearresults in \(.737\) increase inmpg. - P value of
origin2\(=0.00149 <.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baselineorigin1,origin2increasesmpgby \(1.764\) - P value of
origin3\(=1.04e-06<<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baselineorigin1,origin3increasesmpgby \(2.617\) - P value of
cylinders4\(=5.85e-05<<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baselinecylinders3,cylinders4increasesmpgby \(6.722\) - P value of
cylinders5\(=0.00516<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baselinecylinders3,cylinders5increasesmpgby \(7.078\) - P value of
cylinders6\(=0.06701>.05\) meaning it is NOT statistically significant. While holding all other variables constant, when compared to the baselinecylinders3,cylinders6increasesmpgby \(3.351\) - P value of
cylinders8\(=0.01607<.05\) meaning it is statistically significant. While holding all other variables constant, when compared to the baselinecylinders3,cylinders8increasesmpgby \(5.099\)
Intuition
Multicolinearity must be present since displacement does not decrease as intended. Since p.value was significant, displacement results in \(.01870\) increase in mpg, however from correlation plot displacement and mpg are negatively correlated. Being multicolinear, the displacement \(\beta\) coefficient is inflated by the other columns it is colinear with. The logic follows that:
A heavier car has more horsepower
\(\implies\) increased horsepower and weight increases displacement
\(\therefore\) displacement results in \(.01870\) increase in mpg in the full model,
\(\unicode{x21af}\) since there exists a negative correlation of \(.81\) between displacement and mpg.
VIF further confirms the muticolinearity of the full model between weight, displacement and horsepower.
GVIF Df GVIF^(1/(2*Df))
cylinders 15.521205 4 1.408853
displacement 23.271418 1 4.824046
horsepower 10.559221 1 3.249495
weight 11.723047 1 3.423894
acceleration 2.684794 1 1.638534
model_year 1.323483 1 1.150427
origin 2.344607 2 1.237421
Partial Model
Partial model 1 (Fail)
The first partial model was done by taking all the significant variables in the full model. Since VIF demonstrated high multicolinearity I removed displacement, and since cylinders was heavily unbalanced I removed it as well.
\(PM1:\)mpg ~ horsepower + weight + model_year +origin
Call:
lm(formula = mpg ~ horsepower + weight + model_year + origin,
data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-9.553 -2.110 -0.027 1.795 13.476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.722e+01 4.193e+00 -4.107 4.89e-05 ***
horsepower -8.480e-03 9.343e-03 -0.908 0.36463
weight -5.564e-03 4.408e-04 -12.622 < 2e-16 ***
model_year 7.544e-01 5.156e-02 14.631 < 2e-16 ***
origin2 1.955e+00 5.186e-01 3.769 0.00019 ***
origin3 2.283e+00 5.243e-01 4.353 1.72e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.338 on 386 degrees of freedom
Multiple R-squared: 0.8194, Adjusted R-squared: 0.8171
F-statistic: 350.3 on 5 and 386 DF, p-value: < 2.2e-16
[1] "VIF"
GVIF Df GVIF^(1/(2*Df))
horsepower 4.537936 1 2.130243
weight 4.919323 1 2.217955
model_year 1.266033 1 1.125181
origin 1.668152 2 1.136472
[1] "Model Fit"
Analysis of Variance Table
Model 1: mpg ~ cylinders + displacement + horsepower + weight + acceleration +
model_year + origin
Model 2: mpg ~ horsepower + weight + model_year + origin
Res.Df RSS Df Sum of Sq F Pr(>F)
1 380 3646.3
2 386 4301.2 -6 -654.92 11.375 1.055e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fit:
The smaller model has an \(Adj.R^2 =0.8171\) which is very close to the full model value of \(.8425\)
Variable
horsepoweris not significant with p. value \(0.36463>.05\)Nested partial F test reveals that variables
cylinders,displacement,accelerationcontain information that helps predictmpgand their removal increased RSS of the partial model to \(4301.2\)
Assumptions:
VIF values have decreased, pointing at the issue of multicolinearity handled
Heteroscedasticity is observed with funnel shape
Violation of normal residuals
Non flat line for linear assumption
Residuals are not normal
Partial model 2 (Fail)
I removed horsepower since it was not significant.
\(PM2:\)mpg ~ weight + model_year +origin
[1] "Partial Model 2 Summary"
Call:
lm(formula = mpg ~ weight + model_year + origin, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-9.6025 -2.1132 -0.0206 1.7617 13.5261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.831e+01 4.017e+00 -4.557 6.96e-06 ***
weight -5.887e-03 2.599e-04 -22.647 < 2e-16 ***
model_year 7.698e-01 4.867e-02 15.818 < 2e-16 ***
origin2 1.976e+00 5.180e-01 3.815 0.000158 ***
origin3 2.215e+00 5.188e-01 4.268 2.48e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.337 on 387 degrees of freedom
Multiple R-squared: 0.819, Adjusted R-squared: 0.8172
F-statistic: 437.9 on 4 and 387 DF, p-value: < 2.2e-16
[1] "VIF"
GVIF Df GVIF^(1/(2*Df))
weight 1.711383 1 1.308199
model_year 1.128435 1 1.062278
origin 1.612360 2 1.126848
[1] "Model Fit"
Analysis of Variance Table
Model 1: mpg ~ cylinders + displacement + horsepower + weight + acceleration +
model_year + origin
Model 2: mpg ~ weight + model_year + origin
Res.Df RSS Df Sum of Sq F Pr(>F)
1 380 3646.3
2 387 4310.4 -7 -664.1 9.887 2.397e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fit:
The smaller model has an \(Adj.R^2 =0.8172\) which is very close to the full model value of \(.8425\), (note partial model 1 had a value of \(0.8171\))
All variables are significant, with
weight, showing a negative \(\beta\) coefficients.- With all values held constant one unit increase in weight results in \(.005887\) decrease in
mpg
- With all values held constant one unit increase in weight results in \(.005887\) decrease in
Nested partial F test reveals that variables
horsepower,cylinders,displacement,accelerationcontain information that helps predictmpgand their removal increased RSS of the partial model to \(4310.4\)
Assumptions:
VIF values have decreased, pointing at the issue of multicolinearity handled
Heteroscedasticity with funnel shape
Violation of normal residuals
Non flat line hinting at nonlinear pattern
Residuals are not normal
Log model (Pass)
Model 2, though it failed many assumptions was mostly due to nonlinear interactions, not because of the selected variables. I tested a logarithmic operation of mpg to see if the model performance of assumptions improved.
\(PM2:\)log(mpg) ~ weight + model_year +origin
[1] "Log Model of PM2"
Call:
lm(formula = log(mpg) ~ weight + model_year + origin, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-0.42352 -0.07273 0.01168 0.07197 0.38176
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.504e+00 1.448e-01 10.386 < 2e-16 ***
weight -2.864e-04 9.370e-06 -30.565 < 2e-16 ***
model_year 3.189e-02 1.754e-03 18.180 < 2e-16 ***
origin2 7.031e-02 1.867e-02 3.766 0.000192 ***
origin3 5.750e-02 1.870e-02 3.074 0.002259 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1203 on 387 degrees of freedom
Multiple R-squared: 0.8761, Adjusted R-squared: 0.8748
F-statistic: 684.2 on 4 and 387 DF, p-value: < 2.2e-16
[1] "VIF"
GVIF Df GVIF^(1/(2*Df))
weight 1.711383 1 1.308199
model_year 1.128435 1 1.062278
origin 1.612360 2 1.126848
[1] "Model Fit"
df AIC
full_model 13 2012.6901
log_model_2 6 -540.9114
Fit:
The log model has an \(Adj.R^2 =0.8748\) which is an improvement when compared to the full model value of \(.8425\), and the highest value so far
All variables are significant
Nested partial F test cannot be done to compare since they are not nested models, but from AIC comparison is a huge indicator of how well the logarithmic model compares to the full model
Assumptions:
VIF values are stable, pointing at the issue of multicolinearity handled
Relatively flat line across residuals vs fitted
Homoscedasticity accomplished with equal spread of variance amongst residuals
Normalcy of residuals greatly improved but lower left has shown that extreme values are not reliable
Some points are near Cook’s distance and can be considered influential points
Interaction model (Pass)
Since there were few significant variables I identified from the log model, I wanted to see if there were interaction between the remaining variables: weight,model_year,origin. What I tested first was whether the interaction between the two continuous variables is significant. An interaction between weight*model_year has been shown to be significant. Then I tested whether there was interaction between continuous and categorical variables. The first plot demonstrated that mpg over years by origin groups, and this plot shows parallel lines. The second plot demonstrates mpg over weight by origin, and this plot demonstrates interaction of the lines.
[1] "Test interaction term between continous variables"
Call:
lm(formula = log(mpg) ~ weight * model_year, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-0.43841 -0.07020 0.00361 0.06614 0.37638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.660e-01 4.931e-01 0.742 0.45839
weight 1.444e-04 1.681e-04 0.859 0.39080
model_year 4.830e-02 6.543e-03 7.381 9.65e-13 ***
weight:model_year -6.069e-06 2.250e-06 -2.698 0.00728 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1216 on 388 degrees of freedom
Multiple R-squared: 0.8731, Adjusted R-squared: 0.8721
F-statistic: 889.7 on 3 and 388 DF, p-value: < 2.2e-16
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
The interaction models were created:
inter_model1:
log(mpg) ~ weight*origininter_model2:
log(mpg) ~ weight*model_yearinter_model3:
log(mpg) ~ weight*model_year + weight*origin
From the AIC analysis is has been shown that model 3 and model 2 significantly outperform model 1 and the full model. Closer inspection of model fit shows that both show relatively similar \(Adj.R^2\) values with model 2 having: \(0.8721\) and model 3: \(0.8777\). Following the principal of parsimony I decided to investigate further on model 2’s assumptions.
df AIC
full_model 13 2012.6901
interaction_model1 7 -300.0444
interaction_model2 5 -533.4393
interaction_model3 9 -546.9564
Call:
lm(formula = log(mpg) ~ weight * model_year, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-0.43841 -0.07020 0.00361 0.06614 0.37638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.660e-01 4.931e-01 0.742 0.45839
weight 1.444e-04 1.681e-04 0.859 0.39080
model_year 4.830e-02 6.543e-03 7.381 9.65e-13 ***
weight:model_year -6.069e-06 2.250e-06 -2.698 0.00728 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1216 on 388 degrees of freedom
Multiple R-squared: 0.8731, Adjusted R-squared: 0.8721
F-statistic: 889.7 on 3 and 388 DF, p-value: < 2.2e-16
Call:
lm(formula = log(mpg) ~ weight * model_year + weight * origin,
data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-0.41810 -0.06971 0.00381 0.07508 0.39120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.845e-01 4.963e-01 0.775 0.43897
weight 9.493e-05 1.701e-04 0.558 0.57719
model_year 4.657e-02 6.625e-03 7.029 9.53e-12 ***
origin2 1.290e-01 8.304e-02 1.554 0.12101
origin3 2.683e-01 1.024e-01 2.620 0.00914 **
weight:model_year -5.029e-06 2.293e-06 -2.193 0.02890 *
weight:origin2 -2.073e-05 3.195e-05 -0.649 0.51682
weight:origin3 -9.446e-05 4.393e-05 -2.150 0.03215 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1189 on 384 degrees of freedom
Multiple R-squared: 0.8799, Adjusted R-squared: 0.8777
F-statistic: 401.8 on 7 and 384 DF, p-value: < 2.2e-16
[1] "Assumptions Interaction Model 2"
Fit:
The log model has an \(Adj.R^2 =0.8721\) which is an improvement when compared to the full model value of \(.8425\) adn relatively close to model 3 where \(Adj.R^2 =0.8777\)
Weight was the only variables who’s beta coefficient was not significant
Nested partial F test cannot be done to compare since they are not nested models, but from AIC interaction model 2 may not performed the best, but was close enough to model 3 in terms of \(Adj.R^2\).
Assumptions:
VIF not advised for since interaction terms will have multicolinearity with main effects
Relatively flat line across residuals vs fitted
Homoscedasticity accomplished with equal spread of variance amongst residuals
Normalcy of residuals still has the same issue with extremes as the logarithmic model.
Some points are near Cook’s distance and can be considered influential points
Box-Cox Model (Pass)
Since logarithmic scale improved fit, an optimal power must be investigated. The Box-Cox method revealed that teh optimal power to partial_model_2 <- lm(mpg ~ weight + model_year +origin, data= df_clean) (what the log model was based on) is to be \(-0.29\) with a \(95%\) confidence interval it lies between \([-.44,-.12]\). For simplicity I will use \(\lambda = -0.25\).
Warning: package 'MASS' was built under R version 4.5.2
Attaching package: 'MASS'
The following object is masked from 'package:patchwork':
area
The following object is masked from 'package:dplyr':
select
[,1]
Original 9.5955611307
BoxCox 0.0001866694
BC model
Call:
lm(formula = I(mpg^(-0.25)) ~ weight + model_year + origin, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-0.041526 -0.008113 -0.001365 0.008005 0.051565
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.411e-01 1.645e-02 38.983 < 2e-16 ***
weight 3.398e-05 1.064e-06 31.934 < 2e-16 ***
model_year -3.650e-03 1.992e-04 -18.317 < 2e-16 ***
origin2 -7.656e-03 2.121e-03 -3.610 0.000346 ***
origin3 -5.455e-03 2.124e-03 -2.569 0.010587 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01366 on 387 degrees of freedom
Multiple R-squared: 0.8818, Adjusted R-squared: 0.8806
F-statistic: 722.1 on 4 and 387 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
weight 1.711383 1 1.308199
model_year 1.128435 1 1.062278
origin 1.612360 2 1.126848
df AIC
full_model 13 2012.690
bc_model_2 6 -2246.364
Fit:
The log model has an \(Adj.R^2 =0.88\) which is an improvement when compared to the full model value of \(.8425\), and even higher than the previous logarithmic model.
All variables are significant
Nested partial F test cannot be done to compare since they are not nested models, but from AIC comparison the box-cox model is a significant improvement
Assumptions:
VIF values are stable, pointing at the issue of multicolinearity handled
Relatively flat line across residuals vs fitted, but a slight dip on the right half of the plot
Homoscedasticity accomplished with equal spread of variance amongst residuals
Normalcy of residuals greatly improved on the lower end, but extreme values on the right are still a problem
Some points are near Cook’s distance and can be considered influential points
Comparison Analysis
All three models performed well meeting all assumptions. Box-Cox model performed better when dealing with extreme values on the left end of the spectrum. This could explain why it has a significantly higher AIC score compared to the other models.
\[Box\_Cox: mpg^{-\frac{1}{4}} = weight + model\_year +origin\]
df AIC
log_model_2 6 -540.9114
bc_model_2 6 -2246.3636
interaction_model2 5 -533.4393
full_model 13 2012.6901
Outlier Analysis
Cooks distance
Obs CooksD Flag_gt1 Flag_gt4n
112 111 0.0370 FALSE TRUE
276 274 0.0304 FALSE TRUE
278 276 0.0296 FALSE TRUE
328 326 0.0291 FALSE TRUE
388 382 0.0230 FALSE TRUE
45 44 0.0226 FALSE TRUE
327 325 0.0200 FALSE TRUE
29 29 0.0182 FALSE TRUE
329 327 0.0171 FALSE TRUE
323 321 0.0164 FALSE TRUE
Further analysis of Cook’s distance for the Box-Cox model demonstrates a clear separation between a small subset of 4 highly influential points. Observations 111,274,276 and 326 are a small group that are highly separated (\(CD > .028\)) from the next highest distance observation 382 with a distance value of \(0.0230\).
Influence Point Analysis
There were 12 points identified who met at least 4 of the flags and are classified as potential extreme influential points.
[1] "n-rows: 392"
[1] "p-coefs: 5"
[1] "hat_cut: 0.0255102040816327"
[1] "cook_cut: 0.0102040816326531"
[1] "dffits_cut: 0.225876975726313"
[1] "dfbetas_cut: 0.101015254455221"
[1] "covratio_lo: 0.961734693877551"
[1] "covratio_hi: 1.03826530612245"
obs Hat CooksD DFFITS DFBETA_Intercept DFBETA_nprod DFBETA_distance
112 111 0.0172 0.0370 0.4358 0.2222 -0.0849 -0.2224
276 274 0.0199 0.0304 0.3932 -0.1646 0.1795 0.1369
278 276 0.0231 0.0296 0.3871 -0.1759 0.2178 0.1401
328 326 0.0217 0.0291 -0.3842 0.2101 -0.1505 -0.1935
388 382 0.0121 0.0230 -0.3428 0.2324 0.0060 -0.2667
323 321 0.0139 0.0164 -0.2879 0.0788 -0.0041 -0.0851
125 124 0.0055 0.0150 0.2781 0.1104 0.0456 -0.1176
365 360 0.0123 0.0136 -0.2619 0.2048 -0.1167 -0.2046
167 165 0.0045 0.0129 0.2583 0.0745 -0.0704 -0.0503
361 356 0.0264 0.0124 -0.2493 0.1574 -0.1199 -0.1433
55 54 0.0262 0.0121 -0.2462 -0.1736 0.1117 0.1629
298 296 0.0266 0.0116 -0.2415 0.1297 -0.1462 -0.1067
216 214 0.0051 0.0069 0.1869 -0.0431 0.0823 0.0372
CovRatio HighHat HighCook HighDFFITS HighDFBETA HighCovRatio TotalFlags
112 0.8972 0 1 1 1 1 4
276 0.9375 0 1 1 1 1 4
278 0.9557 0 1 1 1 1 4
328 0.9507 0 1 1 1 1 4
388 0.9074 0 1 1 1 1 4
323 0.9528 0 1 1 1 1 4
125 0.8539 0 1 1 1 1 4
365 0.9552 0 1 1 1 1 4
167 0.8429 0 1 1 1 1 4
361 1.0101 1 1 1 1 0 4
55 1.0104 1 1 1 1 0 4
298 1.0125 1 1 1 1 0 4
216 0.9330 0 0 0 0 1 1
Residual Analysis
Further confirmation of extreme influence points based on residual analysis. The R-Stud. revealed there were six values that are ‘STRONG’, signifying that the p.value indicates there is statistical evidence with a confidence of \(95\%\) that these observations are influence points. The observations : \([165,124,111,382,274,214]\) were all identified and are almost all extreme values of Cook’s distance, observation \(214\) only raised one flag in the previous influential point analysis.
[1] "n-rows: 392"
[1] "p-coefs: 5"
[1] "df:n-p: 387"
[1] "MSE: 0.00018666939832121"
[1] "s:sqrt(MSE): 0.0136627009892338"
obs r_rstudent p_rstudent flag_rstudent
1 165 3.849587 0.0001384471 STRONG
2 124 3.723128 0.0002260632 STRONG
3 111 3.295953 0.0010716960 STRONG
4 382 -3.091661 0.0021349388 STRONG
5 274 2.758326 0.0060861713 STRONG
6 214 2.609730 0.0094138672 STRONG
Removal of Outliers/Influential Points
From the following output it can be shown that removal of all observations does not significanlty affect the \(\beta\) coefficient values.
(Intercept) weight model_year origin2 origin3
All_Data 0.6411 0 -0.0036 -0.0077 -0.0055
Remove_165 0.6399 0 -0.0036 -0.0074 -0.0052
Remove_124 0.6393 0 -0.0036 -0.0075 -0.0053
Remove_111 0.6375 0 -0.0036 -0.0076 -0.0060
Remove_382 0.6373 0 -0.0036 -0.0078 -0.0057
Remove_274 0.6438 0 -0.0037 -0.0084 -0.0056
Remove_all 0.6438 0 -0.0037 -0.0084 -0.0056
Conclusion
The Box-Cox model for mpg was chosen using the following predictors : weight, model_year, origin. The optimal \(\lambda\) was not chosen but for interpretability the closest value of \(-\frac{1}{4}\) was used. Since removal of influential point is not necessary the final model has a low residual standard error of \(.01366\) and high \(R^2Adj.\) of \(88\%.\)
[1] "All data BC"
Call:
lm(formula = I(mpg^(-0.25)) ~ weight + model_year + origin, data = df_clean)
Residuals:
Min 1Q Median 3Q Max
-0.041526 -0.008113 -0.001365 0.008005 0.051565
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.411e-01 1.645e-02 38.983 < 2e-16 ***
weight 3.398e-05 1.064e-06 31.934 < 2e-16 ***
model_year -3.650e-03 1.992e-04 -18.317 < 2e-16 ***
origin2 -7.656e-03 2.121e-03 -3.610 0.000346 ***
origin3 -5.455e-03 2.124e-03 -2.569 0.010587 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01366 on 387 degrees of freedom
Multiple R-squared: 0.8818, Adjusted R-squared: 0.8806
F-statistic: 722.1 on 4 and 387 DF, p-value: < 2.2e-16
[1] "274 Influential point removed BC"
Call:
lm(formula = I(mpg^(-0.25)) ~ weight + model_year + origin, data = df_clean[-c(274),
])
Residuals:
Min 1Q Median 3Q Max
-0.041421 -0.008273 -0.001328 0.008019 0.051511
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.438e-01 1.634e-02 39.409 < 2e-16 ***
weight 3.379e-05 1.057e-06 31.958 < 2e-16 ***
model_year -3.677e-03 1.978e-04 -18.587 < 2e-16 ***
origin2 -8.387e-03 2.119e-03 -3.957 9.02e-05 ***
origin3 -5.623e-03 2.107e-03 -2.669 0.00793 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01355 on 386 degrees of freedom
Multiple R-squared: 0.884, Adjusted R-squared: 0.8828
F-statistic: 735.1 on 4 and 386 DF, p-value: < 2.2e-16
Predictions (Appendix)
One of the removed observations at the beginning for having a NA value in horsepower I decided to test as a prediction. The observation point has weight=1835, model_year=80, origin=2 with an mpg=40.9. The second observation point has weight=2905, model_year=80, origin=1 with an mpg=23.6. Both mpg values lie within their respective prediction interval.
[1] "mpg1=40.9, mpg2=23.6"
weight model_year origin
1 1835 80 2
2 2905 80 1
[1] "Prediction Intervals MPG:"
fit lwr upr
1 37.59884 49.65181 28.99117
2 24.85709 31.86824 19.67230