library(tidyverse)

Exercise Week 11

Using R, build a regression model for data that interests you. Conduct residual analysis.

Was the linear model appropriate? Why or why not?

This data set captures the details of how CO2 emissions by a vehicle can vary with the different features. The data set has been taken from Canada Government official open data website. This is a compiled version found on kaggle. This contains data over a period of 7 years. I decided to look at the relationship between CO2 emissions and engine size and build a linear model.

# import data set
emissions<- read_csv("/Users/dirkhartog/Desktop/CUNY_MSDS/DATA_605/WK11/archive (6)/CO2 Emissions_Canada.csv")
## Rows: 7385 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Make, Model, Vehicle Class, Transmission, Fuel Type
## dbl (7): Engine Size(L), Cylinders, Fuel Consumption City (L/100 km), Fuel C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# insepct data set 
head(emissions)
## # A tibble: 6 × 12
##   Make  Model      `Vehicle Class` `Engine Size(L)` Cylinders Transmission
##   <chr> <chr>      <chr>                      <dbl>     <dbl> <chr>       
## 1 ACURA ILX        COMPACT                      2           4 AS5         
## 2 ACURA ILX        COMPACT                      2.4         4 M6          
## 3 ACURA ILX HYBRID COMPACT                      1.5         4 AV7         
## 4 ACURA MDX 4WD    SUV - SMALL                  3.5         6 AS6         
## 5 ACURA RDX AWD    SUV - SMALL                  3.5         6 AS6         
## 6 ACURA RLX        MID-SIZE                     3.5         6 AS6         
## # ℹ 6 more variables: `Fuel Type` <chr>,
## #   `Fuel Consumption City (L/100 km)` <dbl>,
## #   `Fuel Consumption Hwy (L/100 km)` <dbl>,
## #   `Fuel Consumption Comb (L/100 km)` <dbl>,
## #   `Fuel Consumption Comb (mpg)` <dbl>, `CO2 Emissions(g/km)` <dbl>
glimpse(emissions)
## Rows: 7,385
## Columns: 12
## $ Make                               <chr> "ACURA", "ACURA", "ACURA", "ACURA",…
## $ Model                              <chr> "ILX", "ILX", "ILX HYBRID", "MDX 4W…
## $ `Vehicle Class`                    <chr> "COMPACT", "COMPACT", "COMPACT", "S…
## $ `Engine Size(L)`                   <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, …
## $ Cylinders                          <dbl> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6,…
## $ Transmission                       <chr> "AS5", "M6", "AV7", "AS6", "AS6", "…
## $ `Fuel Type`                        <chr> "Z", "Z", "Z", "Z", "Z", "Z", "Z", …
## $ `Fuel Consumption City (L/100 km)` <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 1…
## $ `Fuel Consumption Hwy (L/100 km)`  <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, …
## $ `Fuel Consumption Comb (L/100 km)` <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10…
## $ `Fuel Consumption Comb (mpg)`      <dbl> 33, 29, 48, 25, 27, 28, 28, 25, 24,…
## $ `CO2 Emissions(g/km)`              <dbl> 196, 221, 136, 255, 244, 230, 232, …

Exploratory Data Analysis

# Visualize the relationship between engine size and CO2 emissions

emissions %>% ggplot(mapping = aes(x = `Engine Size(L)`, 
                                   y = `CO2 Emissions(g/km)`)) +
  geom_point() + 
  geom_jitter()

We can clearly see that there is a positive relationship between engine size and CO2 emissions.

Building the linear model - Original data set

engine_size_lm <- lm(emissions$`CO2 Emissions(g/km)` ~ emissions$`Engine Size(L)`, data = emissions)

summary(engine_size_lm)
## 
## Call:
## lm(formula = emissions$`CO2 Emissions(g/km)` ~ emissions$`Engine Size(L)`, 
##     data = emissions)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -113.309  -18.309   -1.442   19.079  142.914 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                134.3659     0.9075   148.1   <2e-16 ***
## emissions$`Engine Size(L)`  36.7773     0.2640   139.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.72 on 7383 degrees of freedom
## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7244 
## F-statistic: 1.941e+04 on 1 and 7383 DF,  p-value: < 2.2e-16

We can interpret this model by saying that if a car with a 0L engine size would still emit 134.37 g/km of CO2 and for every 1 Liter increase in size the vehicle would emit an additional 36.78 g/km of CO2.

Model diagnostics

par(mfrow = c(2,2))
plot(engine_size_lm)

Conclusions

The \(R^2\) of.7244 is fair and looking at the residuals give us some more information about our model. The points in the residual plot generally are distributed all along the intercept at 0 but you can see that there are far less points on the left side of the graph. The qq plot also closely follows the diagonal line giving us evidence that the data is nearly normal. It would make sense that many other factors would contribute to the level of CO2 emissions from a motor vehicle and a multiple linear regression is likely more appropriate in this case vs. solely relying on engine size.