class: center, middle, inverse, title-slide .title[ # Week 2 Assignment: Presentation ] .subtitle[ ## MLR of CO2 Emissions for Vehicles ] .author[ ### Alice Xiang ] .date[ ### 2024-02-18 ] --- ## Table of Contents - Introduction to the Dataset - Research Question - The Full Model + Discussion - The Edited Model + Discussion - The Transformed Model + Discussion - Model Selection - Conclusion --- class: inverse center middle ## Introduction to the Dataset --- ## Introduction I chose [this dataset](https://www.kaggle.com/datasets/bhuviranga/co2-emissions) on CO2 emissions of different cars to do multiple linear regression. --- ## Variables The following are the variables included in the dataset (6 continuous, 2 categorical): - Engine.Size - Cylinders - Fuel.Type - Fuel.Consumption.City - Fuel.Consumption.Hwy - Fuel.Consumption.Combined - Fuel.Consumption.mpg - CO2.Emissions --- class: inverse center middle ## Research Question: How do different predictor variables relate to the CO2 emissions of the vehicle? --- class: inverse center middle ## Full Model + Discussion --- ## The Full Model Using R, we create the full model. ```r full.model = lm(CO2.Emissions ~ ., data = emissions) ``` --- ## Summary of the Full Model
--- ## Residual Analysis <img src="xaringan_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> --- ## VIF ```r vif(full.model) ``` ``` ## GVIF Df GVIF^(1/(2*Df)) ## Engine.Size 11.668643 1 3.415940 ## Cylinders 14.403671 7 1.209896 ## Fuel.Type 2.475681 4 1.119984 ## Fuel.Consumption.City 2069.965111 1 45.496869 ## Fuel.Consumption.Hwy 568.001039 1 23.832772 ## Fuel.Consumption.Combined 4651.987253 1 68.205478 ## Fuel.Consumption.mpg 10.261228 1 3.203315 ``` --- ## Issues We See - nonconstant variance - residuals not normal (Q-Q plot) - multicollinearity between all Fuel Consumption variables --- class: inverse center middle ## Edited Model + Discussion --- ## Edited Model: Removing Predictors due to Multicollinearity Of the Fuel Consumption variables, we keep only Fuel.Consumption.mpg and create the following model. ```r emissions.edit <- emissions %>% dplyr::select(-c(Fuel.Consumption.City, Fuel.Consumption.Hwy, Fuel.Consumption.Combined)) full.model.edit = lm(CO2.Emissions ~., data=emissions.edit) ``` --- ## Summary of the Edited Model ```r DT::datatable(summary(full.model.edit)$coef, fillContainer = FALSE, options = list(pageLength = 5)) ```
--- ## Residual Plots of Edited Model <img src="xaringan_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> --- ## VIF of Edited Model ```r vif(full.model.edit) ``` ``` ## GVIF Df GVIF^(1/(2*Df)) ## Engine.Size 11.145876 1 3.338544 ## Cylinders 11.786999 7 1.192693 ## Fuel.Type 1.410630 4 1.043943 ## Fuel.Consumption.mpg 2.951199 1 1.717905 ``` --- ## Edited Model Discussion We see that the residual plots improved, and the issues with multicollinearity have been resolved. ### Remaining Issues: - variances still nonconstant - assumption of normality still violated --- class: inverse center middle ## Transformed Model + Discussion --- ## Box-Cox Transformation We proceed by performing several box-cox transformations on the data. <img src="xaringan_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> The plots show that a log transformation of Fuel Consumption impacts lambda. --- ## Log Transformed Model Using a log transformed mpg, we create the following model with a log of the response variable CO2.Emissions: ```r log.model = lm(log(CO2.Emissions) ~ Engine.Size + Cylinders + Fuel.Type + log(Fuel.Consumption.mpg), data = emissions.edit) ``` --- ## Summary of the Transformed Model ```r DT::datatable(summary(log.model)$coef, fillContainer = FALSE, options = list(pageLength = 5)) ```
--- ## Residual Plots of Transformed Model ```r par(mfrow=c(2,2), pin=c(1.5,1)) plot(log.model) ``` <img src="xaringan_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> --- ## Transformed Model Discussion - Significant improvements from earlier models - Curvature in residual plot greatly improved - Q-Q plot closest to normal --- --- class: inverse center middle ## Model Selection --- ## Comparison of the Models' Goodness of Fit Table: Goodness-of-fit Measures of Candidate Models | | SSE| R.sq| R.adj| Cp| AIC| SBC| PRESS| |:-----------------|------------:|---------:|---------:|--:|---------:|---------:|-------:| |Full Model | 1.673191e+06| 0.9338159| 0.9336992| 14| 40077.13| 40173.83| 2293146| |Edited Model | 1.673191e+06| 0.9338159| 0.9336992| 14| 40077.13| 40173.83| 2293146| |Transformed Model | 2.031511e+00| 0.9950472| 0.9950385| 14| -60517.38| -60420.68| Inf| --- ## Model Selection Log Transformed Model selected - highest adjusted R squared - fewest violations to assumptions ## Variable Selection - Values of Cylinder have large p-values - Engine.Size p-value also large We remove Engine.Size from the model --- ## Final Model ```r log.model = lm(log(CO2.Emissions) ~ Cylinders + Fuel.Type + log(Fuel.Consumption.mpg), data = emissions.edit) DT::datatable(summary(log.model)$coef, fillContainer = FALSE, options = list(pageLength = 5)) ```
--- class: inverse center middle ## Conclusions --- ## Conclusions - Log transformed model chosen as best model due to residual analysis and goodness of fit - Still shows violations to assumptions - variation in residuals - assumption of normality - Includes outliers Further analysis can be done through bootstrapping to eliminate some of these issues --- class: inverse center middle ## Thank you