class: center, middle, inverse, title-slide .title[ # Week 2 Assignment: Presentation ] .subtitle[ ## MLR of CO2 Emissions for Vehicles ] .author[ ### Alice Xiang & Angelo Saporito ] .date[ ### 2024-02-20 ] --- class: inverse1 <h2 align="center"> Table of Contents</h2> .pull-left[ - Introduction to the Dataset - Dataset and variables - Research Question - Full Model - Analysis - Residual analysis - Variance Inflation Factor (VIF) - Discussion - Edited Model - Analysis - Residual analysis - Variance Inflation Factor (VIF) - Discussion ] .pull-right[ - Transformed Model - Box-Cox Transform - Log Transform - Residual Analysis - Discussion - The Bootstrap - Model Selection - Goodness-of-fit - Final Model - Conclusion ] --- class: inverse center middle ## Introduction to the Dataset --- .pull-left[ ## Introduction We chose [this dataset](https://www.kaggle.com/datasets/bhuviranga/co2-emissions) on CO2 emissions of different vehicles to do multiple linear regression analysis. ] .pull-right[ ## Variables The following are the variables included in the dataset (6 continuous, 2 categorical): - Engine.Size - Cylinders - Fuel.Type - Fuel.Consumption.City - Fuel.Consumption.Hwy - Fuel.Consumption.Combined - Fuel.Consumption.mpg - CO2.Emissions ] --- class: inverse center middle ## Research Question: How do different predictor variables relate to the CO2 emissions of the vehicle? --- class: inverse center middle ## Full Model + Discussion --- ## The Full Model Using R, we create the full model. ```r full.model = lm(CO2.Emissions ~ ., data = emissions) ```
--- ## Residual Analysis <img src="Untitled_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> --- .pull-left[ ## VIF <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> GVIF </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> GVIF^(1/(2*Df)) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Engine.Size </td> <td style="text-align:right;"> 11.67 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3.42 </td> </tr> <tr> <td style="text-align:left;"> Cylinders </td> <td style="text-align:right;"> 14.40 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 1.21 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Type </td> <td style="text-align:right;"> 2.48 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1.12 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Consumption.City </td> <td style="text-align:right;"> 2069.97 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 45.50 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Consumption.Hwy </td> <td style="text-align:right;"> 568.00 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 23.83 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Consumption.Combined </td> <td style="text-align:right;"> 4651.99 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 68.21 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Consumption.mpg </td> <td style="text-align:right;"> 10.26 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3.20 </td> </tr> </tbody> </table> ] .pull-right[ ## Issues We See - nonconstant variance - residuals not normal (Q-Q plot) - multicollinearity between all Fuel Consumption variables ] --- class: inverse center middle ## Edited Model + Discussion --- ## Edited Model: Removing Predictors due to Multicollinearity Of the Fuel Consumption variables, we keep only Fuel.Consumption.mpg and create the following model.
--- ## Residual Plots of Edited Model <img src="Untitled_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> --- .pull-left[ ## VIF of Edited Model <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> GVIF </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> GVIF^(1/(2*Df)) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Engine.Size </td> <td style="text-align:right;"> 11.15 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3.34 </td> </tr> <tr> <td style="text-align:left;"> Cylinders </td> <td style="text-align:right;"> 11.79 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 1.19 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Type </td> <td style="text-align:right;"> 1.41 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1.04 </td> </tr> <tr> <td style="text-align:left;"> Fuel.Consumption.mpg </td> <td style="text-align:right;"> 2.95 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1.72 </td> </tr> </tbody> </table> ] .pull-right[ ## Edited Model Discussion We see that the residual plots improved, and the issues with multicollinearity have been resolved. ### Remaining Issues: - variances still nonconstant - assumption of normality still violated ] --- class: inverse center middle ## Transformed Model + Discussion --- ## Box-Cox Transformation We proceed by performing several box-cox transformations on the data. <img src="Untitled_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> The plots show that a log transformation of Fuel Consumption impacts lambda. --- ## Log Transformed Model Using a log transformed mpg, we create the following model with a log of the response variable CO2.Emissions: ```r log.model = lm(log(CO2.Emissions) ~ Engine.Size + Cylinders + Fuel.Type + log(Fuel.Consumption.mpg), data = emissions.edit) ```
--- ## Residual Plots of Transformed Model <img src="Untitled_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> --- ## Transformed Model Discussion - Significant improvements from earlier models - Curvature in residual plot greatly improved - Q-Q plot closest to normal --- class: inverse center middle ## The Bootstrap --- <h2 align="center">Boostrapping Coefficients</h2> .pull-center[ <img src="Untitled_files/figure-html/unnamed-chunk-17-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> ] --- <h2 align="center">Confidence Intervals</h2> .pull-center[ <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> Std. Error </th> <th style="text-align:left;"> t value </th> <th style="text-align:left;"> Pr(>|t|) </th> <th style="text-align:left;"> btc.ci.95 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:left;"> 8.891 </td> <td style="text-align:left;"> 0.006 </td> <td style="text-align:left;"> 1414.010 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ 8.8792 , 8.9023 ] </td> </tr> <tr> <td style="text-align:left;"> Engine.Size </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> 0.798 </td> <td style="text-align:left;"> 0.425 </td> <td style="text-align:left;"> [ -5e-04 , 0.0013 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders4 </td> <td style="text-align:left;"> -0.002 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> -1.388 </td> <td style="text-align:left;"> 0.165 </td> <td style="text-align:left;"> [ -0.0051 , 1e-04 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders5 </td> <td style="text-align:left;"> -0.009 </td> <td style="text-align:left;"> 0.004 </td> <td style="text-align:left;"> -2.359 </td> <td style="text-align:left;"> 0.018 </td> <td style="text-align:left;"> [ -0.014 , -0.0037 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders6 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.856 </td> <td style="text-align:left;"> 0.392 </td> <td style="text-align:left;"> [ -0.0012 , 0.0045 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders8 </td> <td style="text-align:left;"> 0.001 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.564 </td> <td style="text-align:left;"> 0.572 </td> <td style="text-align:left;"> [ -0.0026 , 0.0053 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders10 </td> <td style="text-align:left;"> 0.003 </td> <td style="text-align:left;"> 0.004 </td> <td style="text-align:left;"> 0.699 </td> <td style="text-align:left;"> 0.485 </td> <td style="text-align:left;"> [ -0.004 , 0.0089 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders12 </td> <td style="text-align:left;"> 0.007 </td> <td style="text-align:left;"> 0.003 </td> <td style="text-align:left;"> 2.181 </td> <td style="text-align:left;"> 0.029 </td> <td style="text-align:left;"> [ 0.0012 , 0.0127 ] </td> </tr> <tr> <td style="text-align:left;"> Fuel.TypeE </td> <td style="text-align:left;"> -0.492 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> -292.677 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.4963 , -0.4878 ] </td> </tr> <tr> <td style="text-align:left;"> Fuel.TypeX </td> <td style="text-align:left;"> -0.141 </td> <td style="text-align:left;"> 0.001 </td> <td style="text-align:left;"> -108.385 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.1425 , -0.1391 ] </td> </tr> <tr> <td style="text-align:left;"> Fuel.TypeZ </td> <td style="text-align:left;"> -0.142 </td> <td style="text-align:left;"> 0.001 </td> <td style="text-align:left;"> -108.189 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.1439 , -0.1406 ] </td> </tr> <tr> <td style="text-align:left;"> log(Fuel.Consumption.mpg) </td> <td style="text-align:left;"> -0.988 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> -654.666 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.9904 , -0.9848 ] </td> </tr> </tbody> </table> ] --- <h2 align="center">Bootstrapping Residuals</h2> <img src="Untitled_files/figure-html/unnamed-chunk-19-1.png" width="60%" height="60%" style="display: block; margin: auto;" /> - Residuals are largely symmetric - Presence of at least one outlier and some slight right skew --- <h2 align="center">Bootstrapping Residuals cont.</h2> <img src="Untitled_files/figure-html/unnamed-chunk-20-1.png" width="70%" height="70%" style="display: block; margin: auto;" /> --- <h2 align="center">Bootstrapped Coefficients & Residuals</h2> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> Std. Error </th> <th style="text-align:left;"> Pr(>|t|) </th> <th style="text-align:left;"> btc.ci.95 </th> <th style="text-align:left;"> btr.ci.95 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:left;"> 8.891 </td> <td style="text-align:left;"> 0.006 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ 8.8792 , 8.9023 ] </td> <td style="text-align:left;"> [ 8.8788 , 8.9036 ] </td> </tr> <tr> <td style="text-align:left;"> Engine.Size </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> 0.425 </td> <td style="text-align:left;"> [ -5e-04 , 0.0013 ] </td> <td style="text-align:left;"> [ -6e-04 , 0.0013 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders4 </td> <td style="text-align:left;"> -0.002 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.165 </td> <td style="text-align:left;"> [ -0.0051 , 1e-04 ] </td> <td style="text-align:left;"> [ -0.0056 , 0.0013 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders5 </td> <td style="text-align:left;"> -0.009 </td> <td style="text-align:left;"> 0.004 </td> <td style="text-align:left;"> 0.018 </td> <td style="text-align:left;"> [ -0.014 , -0.0037 ] </td> <td style="text-align:left;"> [ -0.0158 , -0.0016 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders6 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.392 </td> <td style="text-align:left;"> [ -0.0012 , 0.0045 ] </td> <td style="text-align:left;"> [ -0.002 , 0.0058 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders8 </td> <td style="text-align:left;"> 0.001 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.572 </td> <td style="text-align:left;"> [ -0.0026 , 0.0053 ] </td> <td style="text-align:left;"> [ -0.0033 , 0.0065 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders10 </td> <td style="text-align:left;"> 0.003 </td> <td style="text-align:left;"> 0.004 </td> <td style="text-align:left;"> 0.485 </td> <td style="text-align:left;"> [ -0.004 , 0.0089 ] </td> <td style="text-align:left;"> [ -0.0045 , 0.0098 ] </td> </tr> <tr> <td style="text-align:left;"> Cylinders12 </td> <td style="text-align:left;"> 0.007 </td> <td style="text-align:left;"> 0.003 </td> <td style="text-align:left;"> 0.029 </td> <td style="text-align:left;"> [ 0.0012 , 0.0127 ] </td> <td style="text-align:left;"> [ 9e-04 , 0.0132 ] </td> </tr> <tr> <td style="text-align:left;"> Fuel.TypeE </td> <td style="text-align:left;"> -0.492 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.4963 , -0.4878 ] </td> <td style="text-align:left;"> [ -0.4951 , -0.4887 ] </td> </tr> <tr> <td style="text-align:left;"> Fuel.TypeX </td> <td style="text-align:left;"> -0.141 </td> <td style="text-align:left;"> 0.001 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.1425 , -0.1391 ] </td> <td style="text-align:left;"> [ -0.1433 , -0.1381 ] </td> </tr> <tr> <td style="text-align:left;"> Fuel.TypeZ </td> <td style="text-align:left;"> -0.142 </td> <td style="text-align:left;"> 0.001 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.1439 , -0.1406 ] </td> <td style="text-align:left;"> [ -0.1448 , -0.1397 ] </td> </tr> <tr> <td style="text-align:left;"> log(Fuel.Consumption.mpg) </td> <td style="text-align:left;"> -0.988 </td> <td style="text-align:left;"> 0.002 </td> <td style="text-align:left;"> 0.000 </td> <td style="text-align:left;"> [ -0.9904 , -0.9848 ] </td> <td style="text-align:left;"> [ -0.9908 , -0.9847 ] </td> </tr> </tbody> </table> --- class: inverse center middle ## Model Selection --- ## Comparison of the Models' Goodness of Fit <table> <caption>Goodness-of-fit Measures of Candidate Models</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> SSE </th> <th style="text-align:right;"> R.sq </th> <th style="text-align:right;"> R.adj </th> <th style="text-align:right;"> Cp </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> SBC </th> <th style="text-align:right;"> PRESS </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Full Model </td> <td style="text-align:right;"> 1673191 </td> <td style="text-align:right;"> 0.934 </td> <td style="text-align:right;"> 0.934 </td> <td style="text-align:right;"> 14.015 </td> <td style="text-align:right;"> 40077 </td> <td style="text-align:right;"> 40174 </td> <td style="text-align:right;"> 2293146 </td> </tr> <tr> <td style="text-align:left;"> Edited Model </td> <td style="text-align:right;"> 1673191 </td> <td style="text-align:right;"> 0.934 </td> <td style="text-align:right;"> 0.934 </td> <td style="text-align:right;"> 14.015 </td> <td style="text-align:right;"> 40077 </td> <td style="text-align:right;"> 40174 </td> <td style="text-align:right;"> 2293146 </td> </tr> <tr> <td style="text-align:left;"> Transformed Model </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.995 </td> <td style="text-align:right;"> 0.995 </td> <td style="text-align:right;"> Inf </td> <td style="text-align:right;"> -60600 </td> <td style="text-align:right;"> -60517 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> --- ## Model Selection Log Transformed Model selected - highest adjusted R squared - fewest violations to assumptions ## Variable Selection - Values of Cylinder have large p-values - Engine.Size p-value also large We remove Engine.Size from the model --- ## Final Model ```r log.model = lm(log(CO2.Emissions) ~ Cylinders + Fuel.Type + log(Fuel.Consumption.mpg), data = emissions.edit) ```
--- class: inverse center middle ## Conclusions --- ## Conclusions - Log transformed model chosen as best model due to residual analysis and goodness of fit - Still shows violations to assumptions - variation in residuals - assumption of normality - Includes outliers Further analysis can be done through bootstrapping to eliminate some of these issues --- class: inverse center middle ## Contributions Alice: First half of the presentation up until log-transformed model Angelo: Second half (log-transform, bootstrap, model selection)