library(tidyverse)
This data set captures the details of how CO2 emissions by a vehicle can vary with the different features. The data set has been taken from Canada Government official open data website. This is a compiled version found on kaggle. This contains data over a period of 7 years. I decided to look at the relationship between CO2 emissions and engine size and build a linear model.
# import data set
emissions<- read_csv("/Users/dirkhartog/Desktop/CUNY_MSDS/DATA_605/WK11/archive (6)/CO2 Emissions_Canada.csv")
## Rows: 7385 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Make, Model, Vehicle Class, Transmission, Fuel Type
## dbl (7): Engine Size(L), Cylinders, Fuel Consumption City (L/100 km), Fuel C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# insepct data set
head(emissions)
## # A tibble: 6 × 12
## Make Model `Vehicle Class` `Engine Size(L)` Cylinders Transmission
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 ACURA ILX COMPACT 2 4 AS5
## 2 ACURA ILX COMPACT 2.4 4 M6
## 3 ACURA ILX HYBRID COMPACT 1.5 4 AV7
## 4 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6
## 5 ACURA RDX AWD SUV - SMALL 3.5 6 AS6
## 6 ACURA RLX MID-SIZE 3.5 6 AS6
## # ℹ 6 more variables: `Fuel Type` <chr>,
## # `Fuel Consumption City (L/100 km)` <dbl>,
## # `Fuel Consumption Hwy (L/100 km)` <dbl>,
## # `Fuel Consumption Comb (L/100 km)` <dbl>,
## # `Fuel Consumption Comb (mpg)` <dbl>, `CO2 Emissions(g/km)` <dbl>
glimpse(emissions)
## Rows: 7,385
## Columns: 12
## $ Make <chr> "ACURA", "ACURA", "ACURA", "ACURA",…
## $ Model <chr> "ILX", "ILX", "ILX HYBRID", "MDX 4W…
## $ `Vehicle Class` <chr> "COMPACT", "COMPACT", "COMPACT", "S…
## $ `Engine Size(L)` <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, …
## $ Cylinders <dbl> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6,…
## $ Transmission <chr> "AS5", "M6", "AV7", "AS6", "AS6", "…
## $ `Fuel Type` <chr> "Z", "Z", "Z", "Z", "Z", "Z", "Z", …
## $ `Fuel Consumption City (L/100 km)` <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 1…
## $ `Fuel Consumption Hwy (L/100 km)` <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, …
## $ `Fuel Consumption Comb (L/100 km)` <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10…
## $ `Fuel Consumption Comb (mpg)` <dbl> 33, 29, 48, 25, 27, 28, 28, 25, 24,…
## $ `CO2 Emissions(g/km)` <dbl> 196, 221, 136, 255, 244, 230, 232, …
# Visualize the relationship between engine size and CO2 emissions
emissions %>% ggplot(mapping = aes(x = `Engine Size(L)`,
y = `CO2 Emissions(g/km)`)) +
geom_point() +
geom_jitter()
We can clearly see that there is a positive relationship between engine size and CO2 emissions.
engine_size_lm <- lm(emissions$`CO2 Emissions(g/km)` ~ emissions$`Engine Size(L)`, data = emissions)
summary(engine_size_lm)
##
## Call:
## lm(formula = emissions$`CO2 Emissions(g/km)` ~ emissions$`Engine Size(L)`,
## data = emissions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113.309 -18.309 -1.442 19.079 142.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.3659 0.9075 148.1 <2e-16 ***
## emissions$`Engine Size(L)` 36.7773 0.2640 139.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.72 on 7383 degrees of freedom
## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7244
## F-statistic: 1.941e+04 on 1 and 7383 DF, p-value: < 2.2e-16
We can interpret this model by saying that if a car with a 0L engine size would still emit 134.37 g/km of CO2 and for every 1 Liter increase in size the vehicle would emit an additional 36.78 g/km of CO2.
par(mfrow = c(2,2))
plot(engine_size_lm)
The \(R^2\) of.7244 is fair and looking at the residuals give us some more information about our model. The points in the residual plot generally are distributed all along the intercept at 0 but you can see that there are far less points on the left side of the graph. The qq plot also closely follows the diagonal line giving us evidence that the data is nearly normal. It would make sense that many other factors would contribute to the level of CO2 emissions from a motor vehicle and a multiple linear regression is likely more appropriate in this case vs. solely relying on engine size.