Week 2 Assignment: Presentation

class: center, middle, inverse, title-slide

.title[
# Week 2 Assignment: Presentation
]
.subtitle[
## MLR of CO2 Emissions for Vehicles
]
.author[
### Alice Xiang
]
.date[
### 2024-02-18
]

---

## Table of Contents

- Introduction to the Dataset
- Research Question
- The Full Model + Discussion
- The Edited Model + Discussion
- The Transformed Model + Discussion
- Model Selection
- Conclusion

---
class: inverse center middle

## Introduction to the Dataset

---

## Introduction

I chose [this dataset](https://www.kaggle.com/datasets/bhuviranga/co2-emissions) on CO2 emissions of different cars to do multiple linear regression.

---

## Variables

The following are the variables included in the dataset (6 continuous, 2 categorical):

- Engine.Size

- Cylinders

- Fuel.Type

- Fuel.Consumption.City

- Fuel.Consumption.Hwy

- Fuel.Consumption.Combined

- Fuel.Consumption.mpg

- CO2.Emissions

---
class: inverse center middle

## Research Question: How do different predictor variables relate to the CO2 emissions of the vehicle?

---
class: inverse center middle

## Full Model + Discussion

---

## The Full Model

Using R, we create the full model.

```r
full.model = lm(CO2.Emissions ~ ., data = emissions)
```

---

## Summary of the Full Model

<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-97d749de279cb80fc5e9" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-97d749de279cb80fc5e9">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["(Intercept)","Engine.Size","Cylinders4","Cylinders5","Cylinders6","Cylinders8","Cylinders10","Cylinders12","Cylinders16","Fuel.TypeE","Fuel.TypeN","Fuel.TypeX","Fuel.TypeZ","Fuel.Consumption.City","Fuel.Consumption.Hwy","Fuel.Consumption.Combined","Fuel.Consumption.mpg"],[93.86700592931325,0.8218282259850136,-0.9584579335314239,-3.55010448014747,-0.0901626353848913,1.308276132454805,6.303207027193385,8.605655485194246,28.89624790824825,-137.7060177396539,-111.3336334409743,-30.52596134597003,-31.1123288437751,6.096225527958519,5.48267524056471,8.194797893051557,-0.9647167053907203],[1.642289915788377,0.1440849931055685,0.5227130642987137,1.103369636917398,0.5888234796809556,0.7376997733144125,1.094563313100982,0.9572174584848991,3.067556709696526,0.5364754699883367,4.925878253242038,0.3843916506436318,0.3887747869664007,0.7424400201668332,0.6119734545924317,1.346873652311458,0.0253005770175994],[57.15617262634937,5.703773920319897,-1.833621539223087,-3.217511486056276,-0.1531233697299986,1.773453347527612,5.758650003841183,8.990282624823235,9.419955568191259,-256.6865130714135,-22.60178342160578,-79.41369510720862,-80.02661151599817,8.211068049091212,8.959008269755945,6.084310788163335,-38.13022543792781],[0,1.217199897848648e-08,0.06675050754875948,0.00129867210619025,0.8783051799495434,0.07619491435061164,8.819850037038467e-09,3.089637050615245e-19,5.917493871500079e-21,0,2.043217625385835e-109,0,0,2.567098223922559e-16,4.092108154177914e-19,1.22846934786612e-09,1.667283597557993e-290]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Estimate<\/th>\n      <th>Std. Error<\/th>\n      <th>t value<\/th>\n      <th>Pr(>|t|)<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"Estimate","targets":1},{"name":"Std. Error","targets":2},{"name":"t value","targets":3},{"name":"Pr(>|t|)","targets":4}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Residual Analysis

---

## VIF

```r
vif(full.model)
```

```
##                                  GVIF Df GVIF^(1/(2*Df))
## Engine.Size                 11.668643  1        3.415940
## Cylinders                   14.403671  7        1.209896
## Fuel.Type                    2.475681  4        1.119984
## Fuel.Consumption.City     2069.965111  1       45.496869
## Fuel.Consumption.Hwy       568.001039  1       23.832772
## Fuel.Consumption.Combined 4651.987253  1       68.205478
## Fuel.Consumption.mpg        10.261228  1        3.203315
```

---

## Issues We See

- nonconstant variance
- residuals not normal (Q-Q plot)
- multicollinearity between all Fuel Consumption variables

---
class: inverse center middle

## Edited Model + Discussion

---

## Edited Model: Removing Predictors due to Multicollinearity

Of the Fuel Consumption variables, we keep only Fuel.Consumption.mpg and create the following model.

```r
emissions.edit <- emissions %>% dplyr::select(-c(Fuel.Consumption.City, Fuel.Consumption.Hwy, Fuel.Consumption.Combined))
full.model.edit = lm(CO2.Emissions ~., data=emissions.edit)
```

---

## Summary of the Edited Model

```r
DT::datatable(summary(full.model.edit)$coef, fillContainer = FALSE, options = list(pageLength = 5))
```

<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-b43f6e1586b345c297ce" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-b43f6e1586b345c297ce">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["(Intercept)","Engine.Size","Cylinders4","Cylinders5","Cylinders6","Cylinders8","Cylinders10","Cylinders12","Cylinders16","Fuel.TypeE","Fuel.TypeN","Fuel.TypeX","Fuel.TypeZ","Fuel.Consumption.mpg"],[425.4363233700449,7.499682382390572,-8.933532628831339,-13.77736112544499,-5.947815206359456,11.74102576318419,34.31568642366131,44.38107646366814,145.2516283914759,-77.75131314802286,-100.3179414254768,-25.91604270676288,-29.99438795280482,-6.053155605219331],[2.777403590573778,0.4322623788979119,1.601375202509239,3.383367009818704,1.805902309020907,2.25772613769068,3.333705902701893,2.867431038446597,9.265172051136846,1.465818494809694,15.1136250091658,1.177351206276188,1.193015469041779,0.04164968550129563],[153.177710583342,17.34983831235007,-5.578663023402098,-4.072085908936981,-3.293542057423992,5.200376417306984,10.29355540806681,15.47764388004642,15.67716471856055,-53.04293363969134,-6.637583065918204,-22.01215964158394,-25.14165887295332,-145.3349654952624],[0,4.020980491342882e-66,2.50966758382251e-08,4.708168817950154e-05,0.00099399945392509,2.042365119491379e-07,1.106433811483896e-24,3.353997080133979e-53,1.634740220682596e-54,0,3.414845398701041e-11,4.68227082710005e-104,6.736636070332254e-134,0]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Estimate<\/th>\n      <th>Std. Error<\/th>\n      <th>t value<\/th>\n      <th>Pr(>|t|)<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"Estimate","targets":1},{"name":"Std. Error","targets":2},{"name":"t value","targets":3},{"name":"Pr(>|t|)","targets":4}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Residual Plots of Edited Model

---

## VIF of Edited Model

```r
vif(full.model.edit)
```

```
##                           GVIF Df GVIF^(1/(2*Df))
## Engine.Size          11.145876  1        3.338544
## Cylinders            11.786999  7        1.192693
## Fuel.Type             1.410630  4        1.043943
## Fuel.Consumption.mpg  2.951199  1        1.717905
```

---

## Edited Model Discussion

We see that the residual plots improved, and the issues with multicollinearity have been resolved.

### Remaining Issues: 
- variances still nonconstant
- assumption of normality still violated

---
class: inverse center middle

## Transformed Model + Discussion

---

## Box-Cox Transformation

We proceed by performing several box-cox transformations on the data.

<img src="xaringan_files/figure-html/unnamed-chunk-10-1.png" width="100%" />
The plots show that a log transformation of Fuel Consumption impacts lambda.

---

## Log Transformed Model

Using a log transformed mpg, we create the following model with a log of the response variable CO2.Emissions:

```r
log.model = lm(log(CO2.Emissions) ~ Engine.Size + Cylinders + Fuel.Type + log(Fuel.Consumption.mpg), data = emissions.edit)
```

---

## Summary of the Transformed Model

```r
DT::datatable(summary(log.model)$coef, fillContainer = FALSE, options = list(pageLength = 5))
```

<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-b806752b86299a5c12ab" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-b806752b86299a5c12ab">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["(Intercept)","Engine.Size","Cylinders4","Cylinders5","Cylinders6","Cylinders8","Cylinders10","Cylinders12","Cylinders16","Fuel.TypeE","Fuel.TypeN","Fuel.TypeX","Fuel.TypeZ","log(Fuel.Consumption.mpg)"],[8.890900290330839,0.0003853346975784949,-0.002446164978060585,-0.008786620638703072,0.00170231857567191,0.001407407839457294,0.002575113275193172,0.006916144488318374,0.03921870093692342,-0.4918855391759905,-0.4798671182113504,-0.1408340169608197,-0.1422930063920903,-0.9876378522657913],[0.006286865869120913,0.000482798686414896,0.001762078092467884,0.003724796768307971,0.001987361091734482,0.002492948615493821,0.003683160903610972,0.003171352286138482,0.01022011441130341,0.001680416262870537,0.01665453724124096,0.001299206306108323,0.001315046428030868,0.001508409185756637],[1414.202318837454,0.7981270629376883,-1.388227337095259,-2.358953034287153,0.8565723575609505,0.5645554949308511,0.6991585061267699,2.180818737340482,3.837403316497874,-292.7164834359179,-28.81299619800152,-108.4000410856053,-108.2037891279306,-654.7545994758577],[0,0.4248224702128509,0.1651097839082765,0.01835251891945298,0.3917091136598859,0.5723933226304774,0.4844750391985708,0.02922833293244707,0.0001253811479697677,0,4.371253091260494e-173,0,0,0]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Estimate<\/th>\n      <th>Std. Error<\/th>\n      <th>t value<\/th>\n      <th>Pr(>|t|)<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"Estimate","targets":1},{"name":"Std. Error","targets":2},{"name":"t value","targets":3},{"name":"Pr(>|t|)","targets":4}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Residual Plots of Transformed Model

```r
par(mfrow=c(2,2), pin=c(1.5,1))
plot(log.model)
```

---

## Transformed Model Discussion

- Significant improvements from earlier models
- Curvature in residual plot greatly improved
- Q-Q plot closest to normal

--- 
---

class: inverse center middle

## Model Selection

---

## Comparison of the Models' Goodness of Fit

Table: Goodness-of-fit Measures of Candidate Models

|                  |          SSE|      R.sq|     R.adj| Cp|       AIC|       SBC|   PRESS|
|:-----------------|------------:|---------:|---------:|--:|---------:|---------:|-------:|
|Full Model        | 1.673191e+06| 0.9338159| 0.9336992| 14|  40077.13|  40173.83| 2293146|
|Edited Model      | 1.673191e+06| 0.9338159| 0.9336992| 14|  40077.13|  40173.83| 2293146|
|Transformed Model | 2.031511e+00| 0.9950472| 0.9950385| 14| -60517.38| -60420.68|     Inf|

---

## Model Selection

Log Transformed Model selected
- highest adjusted R squared 
- fewest violations to assumptions

## Variable Selection

- Values of Cylinder have large p-values
- Engine.Size p-value also large

We remove Engine.Size from the model

---

## Final Model

```r
log.model = lm(log(CO2.Emissions) ~ Cylinders + Fuel.Type + log(Fuel.Consumption.mpg), data = emissions.edit)

DT::datatable(summary(log.model)$coef, fillContainer = FALSE, options = list(pageLength = 5))
```

<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-8f856fa4eb2127027cf9" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-8f856fa4eb2127027cf9">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["(Intercept)","Cylinders4","Cylinders5","Cylinders6","Cylinders8","Cylinders10","Cylinders12","Cylinders16","Fuel.TypeE","Fuel.TypeN","Fuel.TypeX","Fuel.TypeZ","log(Fuel.Consumption.mpg)"],[8.892922249874468,-0.002270014553377662,-0.008492415754716692,0.002310309677129377,0.002630239722728868,0.003958662224723102,0.00852977786549881,0.04138368651220588,-0.4919640257742908,-0.4798480632242628,-0.1408128271844633,-0.142350230741227,-0.9880460642977784],[0.005753697550933988,0.001748158498637169,0.003706420707017567,0.001835522035881436,0.00196651716555996,0.003249638553786798,0.002443277929892481,0.009853316877412829,0.001677495095284646,0.01665411009412625,0.001298903028725339,0.001313058096005698,0.00141901559848394],[1545.60127138259,-1.298517585875264,-2.291271397938594,1.258666271483874,1.337511702817969,1.218185394837244,3.491120580733185,4.199975198917173,-293.2730039909963,-28.81259103682164,-108.4090375265719,-108.4112204739868,-696.2897838144949],[0,0.1941501549471393,0.02197579696417375,0.208190756334447,0.1810969271203818,0.223192513577959,0.0004838153397325495,2.7008776059105e-05,0,4.405259368784618e-173,0,0,0]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Estimate<\/th>\n      <th>Std. Error<\/th>\n      <th>t value<\/th>\n      <th>Pr(>|t|)<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"Estimate","targets":1},{"name":"Std. Error","targets":2},{"name":"t value","targets":3},{"name":"Pr(>|t|)","targets":4}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---
class: inverse center middle

## Conclusions

---

## Conclusions

- Log transformed model chosen as best model due to residual analysis and goodness of fit
- Still shows violations to assumptions
  - variation in residuals
  - assumption of normality
- Includes outliers

Further analysis can be done through bootstrapping to eliminate some of these issues

---
class: inverse center middle

## Thank you