As the world’s population continues to grow, the use of automobiles has become an essential part of modern life. However, the emissions produced by automobiles have become a major concern for scientists and policymakers. Carbon dioxide (CO2) is one of the primary gases produced by cars and is known to have a significant impact on the environment.
In this research project, a linear regression model is presented that analyzes a dataset of cars from a Canadian dealership to identify the characteristics of cars that have the greatest impact on CO2 emissions. By doing so, valuable information can be provided to policymakers and car manufacturers to help them reduce the environmental impact of automobiles.
In the end, we will have a linear model with Box-Cox transformation that features CO2 emissions as the dependent variable and various car characteristics as independent variables.
Database used:
FuelConsumption
(please right-click and select ‘open in a new tab/window’)
| Code | Fuel |
|---|---|
| D | Diesel |
| E | Ethanol |
| X | Gasoline |
| Z | Premium Gasoline |
| Code | Trasmission |
|---|---|
| A | Automatic |
| AM | Automatic Manual |
| AS | Sequential Automatic |
| AV | Continuous Variable Automatic |
| M | Manual |
pacotes <- c("kableExtra", "utils", "plotly", "dplyr", "rstatix", "jtools", "equatiomatic", "cowplot", "olsrr", "nortest", "car", "PerformanceAnalytics", "fastDummies", "ggplot2")
lapply(pacotes, library, character.only = TRUE)
FuelConsumption <- read.csv("FuelConsumption.csv")
save(FuelConsumption, file = "FuelConsumption.RData")
View(FuelConsumption)
| MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
| 2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
| 2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
| 2014 | ACURA | MDX 4WD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
| 2014 | ACURA | RDX AWD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
| 2014 | ACURA | RLX | MID-SIZE | 3.5 | 6 | AS6 | Z | 11.9 | 7.7 | 10.0 | 28 | 230 |
| 2014 | ACURA | TL | MID-SIZE | 3.5 | 6 | AS6 | Z | 11.8 | 8.1 | 10.1 | 28 | 232 |
| 2014 | ACURA | TL AWD | MID-SIZE | 3.7 | 6 | AS6 | Z | 12.8 | 9.0 | 11.1 | 25 | 255 |
| 2014 | ACURA | TL AWD | MID-SIZE | 3.7 | 6 | M6 | Z | 13.4 | 9.5 | 11.6 | 24 | 267 |
| 2014 | ACURA | TSX | COMPACT | 2.4 | 4 | AS5 | Z | 10.6 | 7.5 | 9.2 | 31 | 212 |
The variable MODELYEAR, which represents the year of manufacture of the automobile, has the same value in all observations (the complete database can be viewed in the link provided at the beginning of this document). Therefore, we can conclude that it will not have any influence on the model and will result in unnecessary data processing. For this reason, we will remove it from our database.
dropyear <- FuelConsumption[-c(1)]
Furthermore, it is also noticeable that the variable MODEL has levels that are very infrequently repeated in the 1067 observations. Consequently, the variable is unable to form significant patterns or trends that can assist the model in making accurate predictions or inferences. Therefore, we will also eliminate this variable..
dropyear2 <- dropyear[-c(2)]
Viewing the new database:
View(dropyear2)
| MAKE | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS |
|---|---|---|---|---|---|---|---|---|---|---|
| ACURA | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
| ACURA | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
| ACURA | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
| ACURA | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
| ACURA | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
| ACURA | MID-SIZE | 3.5 | 6 | AS6 | Z | 11.9 | 7.7 | 10.0 | 28 | 230 |
| ACURA | MID-SIZE | 3.5 | 6 | AS6 | Z | 11.8 | 8.1 | 10.1 | 28 | 232 |
| ACURA | MID-SIZE | 3.7 | 6 | AS6 | Z | 12.8 | 9.0 | 11.1 | 25 | 255 |
| ACURA | MID-SIZE | 3.7 | 6 | M6 | Z | 13.4 | 9.5 | 11.6 | 24 | 267 |
| ACURA | COMPACT | 2.4 | 4 | AS5 | Z | 10.6 | 7.5 | 9.2 | 31 | 212 |
glimpse(dropyear2)
## Rows: 1,067
## Columns: 11
## $ MAKE <chr> "ACURA", "ACURA", "ACURA", "ACURA", "ACURA", …
## $ VEHICLECLASS <chr> "COMPACT", "COMPACT", "COMPACT", "SUV - SMALL…
## $ ENGINESIZE <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, 3.7, 3.7, …
## $ CYLINDERS <int> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6, 12, 12, 8…
## $ TRANSMISSION <chr> "AS5", "M6", "AV7", "AS6", "AS6", "AS6", "AS6…
## $ FUELTYPE <chr> "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", …
## $ FUELCONSUMPTION_CITY <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 11.8, 12.8,…
## $ FUELCONSUMPTION_HWY <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, 9.0, 9.5, …
## $ FUELCONSUMPTION_COMB <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10.1, 11.1, …
## $ FUELCONSUMPTION_COMB_MPG <int> 33, 29, 48, 25, 27, 28, 28, 25, 24, 31, 29, 2…
## $ CO2EMISSIONS <int> 196, 221, 136, 255, 244, 230, 232, 255, 267, …
Here, we can see that some of the variables are typed as “integer.” Since some functions to be used during this modeling require variables to be of type “numeric,” we will change the type of the “integer” variables to “numeric.”
FuelConsumption$CYLINDERS <- as.numeric(dropyear$CYLINDERS)
FuelConsumption$FUELCONSUMPTION_COMB_MPG <- as.numeric(dropyear$FUELCONSUMPTION_COMB_MPG)
FuelConsumption$CO2EMISSIONS <- as.numeric(dropyear$CO2EMISSIONS)
Checking the result of the changes in variable types:
glimpse(dropyear)
## Rows: 1,067
## Columns: 12
## $ MAKE <chr> "ACURA", "ACURA", "ACURA", "ACURA", "ACURA", …
## $ MODEL <chr> "ILX", "ILX", "ILX HYBRID", "MDX 4WD", "RDX A…
## $ VEHICLECLASS <chr> "COMPACT", "COMPACT", "COMPACT", "SUV - SMALL…
## $ ENGINESIZE <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, 3.7, 3.7, …
## $ CYLINDERS <int> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6, 12, 12, 8…
## $ TRANSMISSION <chr> "AS5", "M6", "AV7", "AS6", "AS6", "AS6", "AS6…
## $ FUELTYPE <chr> "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", …
## $ FUELCONSUMPTION_CITY <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 11.8, 12.8,…
## $ FUELCONSUMPTION_HWY <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, 9.0, 9.5, …
## $ FUELCONSUMPTION_COMB <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10.1, 11.1, …
## $ FUELCONSUMPTION_COMB_MPG <int> 33, 29, 48, 25, 27, 28, 28, 25, 24, 31, 29, 2…
## $ CO2EMISSIONS <int> 196, 221, 136, 255, 244, 230, 232, 255, 267, …
summary(FuelConsumption)
## MODELYEAR MAKE MODEL VEHICLECLASS
## Min. :2014 Length:1067 Length:1067 Length:1067
## 1st Qu.:2014 Class :character Class :character Class :character
## Median :2014 Mode :character Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2014
## Max. :2014
## ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE
## Min. :1.000 Min. : 3.000 Length:1067 Length:1067
## 1st Qu.:2.000 1st Qu.: 4.000 Class :character Class :character
## Median :3.400 Median : 6.000 Mode :character Mode :character
## Mean :3.346 Mean : 5.795
## 3rd Qu.:4.300 3rd Qu.: 8.000
## Max. :8.400 Max. :12.000
## FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB
## Min. : 4.60 Min. : 4.900 Min. : 4.70
## 1st Qu.:10.25 1st Qu.: 7.500 1st Qu.: 9.00
## Median :12.60 Median : 8.800 Median :10.90
## Mean :13.30 Mean : 9.475 Mean :11.58
## 3rd Qu.:15.55 3rd Qu.:10.850 3rd Qu.:13.35
## Max. :30.20 Max. :20.500 Max. :25.80
## FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
## Min. :11.00 Min. :108.0
## 1st Qu.:21.00 1st Qu.:207.0
## Median :26.00 Median :251.0
## Mean :26.44 Mean :256.2
## 3rd Qu.:31.00 3rd Qu.:294.0
## Max. :60.00 Max. :488.0
Categories of the categorical variables:
levels(factor(FuelConsumption$MAKE))
## [1] "ACURA" "ASTON MARTIN" "AUDI" "BENTLEY"
## [5] "BMW" "BUICK" "CADILLAC" "CHEVROLET"
## [9] "CHRYSLER" "DODGE" "FIAT" "FORD"
## [13] "GMC" "HONDA" "HYUNDAI" "INFINITI"
## [17] "JAGUAR" "JEEP" "KIA" "LAMBORGHINI"
## [21] "LAND ROVER" "LEXUS" "LINCOLN" "MASERATI"
## [25] "MAZDA" "MERCEDES-BENZ" "MINI" "MITSUBISHI"
## [29] "NISSAN" "PORSCHE" "RAM" "ROLLS-ROYCE"
## [33] "SCION" "SMART" "SRT" "SUBARU"
## [37] "TOYOTA" "VOLKSWAGEN" "VOLVO"
levels(factor(FuelConsumption$VEHICLECLASS))
## [1] "COMPACT" "FULL-SIZE"
## [3] "MID-SIZE" "MINICOMPACT"
## [5] "MINIVAN" "PICKUP TRUCK - SMALL"
## [7] "PICKUP TRUCK - STANDARD" "SPECIAL PURPOSE VEHICLE"
## [9] "STATION WAGON - MID-SIZE" "STATION WAGON - SMALL"
## [11] "SUBCOMPACT" "SUV - SMALL"
## [13] "SUV - STANDARD" "TWO-SEATER"
## [15] "VAN - CARGO" "VAN - PASSENGER"
levels(factor(FuelConsumption$TRANSMISSION))
## [1] "A4" "A5" "A6" "A7" "A8" "A9" "AM5" "AM6" "AM7" "AS4" "AS5" "AS6"
## [13] "AS7" "AS8" "AS9" "AV" "AV6" "AV7" "AV8" "M5" "M6" "M7"
levels(factor(FuelConsumption$FUELTYPE))
## [1] "D" "E" "X" "Z"
Frequency of the categorical variables:
table(FuelConsumption$MAKE)
##
## ACURA ASTON MARTIN AUDI BENTLEY BMW
## 12 7 49 8 64
## BUICK CADILLAC CHEVROLET CHRYSLER DODGE
## 16 32 86 19 39
## FIAT FORD GMC HONDA HYUNDAI
## 10 90 49 21 24
## INFINITI JAGUAR JEEP KIA LAMBORGHINI
## 21 22 31 33 3
## LAND ROVER LEXUS LINCOLN MASERATI MAZDA
## 19 22 11 6 27
## MERCEDES-BENZ MINI MITSUBISHI NISSAN PORSCHE
## 59 36 16 33 44
## RAM ROLLS-ROYCE SCION SMART SRT
## 13 7 9 2 2
## SUBARU TOYOTA VOLKSWAGEN VOLVO
## 23 49 42 11
table(FuelConsumption$VEHICLECLASS)
##
## COMPACT FULL-SIZE MID-SIZE
## 172 86 178
## MINICOMPACT MINIVAN PICKUP TRUCK - SMALL
## 47 14 12
## PICKUP TRUCK - STANDARD SPECIAL PURPOSE VEHICLE STATION WAGON - MID-SIZE
## 62 7 6
## STATION WAGON - SMALL SUBCOMPACT SUV - SMALL
## 36 65 154
## SUV - STANDARD TWO-SEATER VAN - CARGO
## 110 71 22
## VAN - PASSENGER
## 25
table(FuelConsumption$TRANSMISSION)
##
## A4 A5 A6 A7 A8 A9 AM5 AM6 AM7 AS4 AS5 AS6 AS7 AS8 AS9 AV AV6 AV7 AV8 M5
## 45 30 222 12 87 8 2 6 34 1 10 189 76 80 2 46 11 5 3 48
## M6 M7
## 141 9
table(FuelConsumption$FUELTYPE)
##
## D E X Z
## 27 92 514 434
chart.Correlation((FuelConsumption[, c(5, 6, 9:13)]), histogram = TRUE)
enginesize, cylinders, fuelconsumption_city, fuelcomsumption_hwy, fuelconsumption_comb, fuelconsumption_comb_mpg, co2emissions
As expected, the variables related to fuel consumption have a high degree of correlation with each other, as they are both related to fuel consumption in different driving contexts. If there is an increase in city fuel consumption, it is likely that there will be a corresponding increase in highway fuel consumption, and vice versa. This positive relationship between the two variables results in a high degree of correlation between them. This is a typical example of collinearity that we will address using the stepwise algorithm to avoid issues in the final model.
Before we proceed, we need to convert our categorical variables into numerical values. For this purpose, we will use the process of dummification, also known as one-hot encoding. In this case, the most frequent category of each categorical variable will be used as the reference for estimating the parameter values of the dummy variables. You can check the frequency of these variables in the “Statistics of the variables” section above.
FuelConsumption_dummies <- dummy_columns(.data = dropyear2,
select_columns = "MAKE",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
FuelConsumption_dummies <- dummy_columns(.data = FuelConsumption_dummies,
select_columns = "VEHICLECLASS",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
FuelConsumption_dummies <- dummy_columns(.data = FuelConsumption_dummies,
select_columns = "TRANSMISSION",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
FuelConsumption_dummies <- dummy_columns(.data = FuelConsumption_dummies,
select_columns = "FUELTYPE",
remove_selected_columns = T,
remove_most_frequent_dummy = T)
With the categorical variables dummified, we can start building our model.
The dummification process significantly increased our number of
variables, from 11 to 90, so it’s unlikely that all of these variables
are significant for our model. Additionally, in our correlation
analysis, we observed the presence of collinearity among some variables.
To eliminate non-significant variables, we will create an initial OLS
linear model with all the variables that can be used as a baseline
during the step function.
modelo_FuelConsumption <- lm(formula = CO2EMISSIONS ~ ., data = FuelConsumption_dummies)
Now that we have our initial model, we can run the Stepwise algorithm
on it using the step function:
step_FuelConsumption <- step(modelo_FuelConsumption, k = qchisq(p = 0.05, df = 1, lower.tail = FALSE))
The value “k” is used as a critical threshold for the chi-square statistic, and in this case, we are seeking variable selection with a 95% confidence level.
Let’s use the summary function to evaluate the parameters of our model:
summary(step_FuelConsumption)
##
## Call:
## lm(formula = CO2EMISSIONS ~ CYLINDERS + FUELCONSUMPTION_COMB +
## FUELCONSUMPTION_COMB_MPG + MAKE_BMW + MAKE_PORSCHE + VEHICLECLASS_COMPACT +
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` + `VEHICLECLASS_SUV - STANDARD` +
## `VEHICLECLASS_TWO-SEATER` + TRANSMISSION_AV + FUELTYPE_D +
## FUELTYPE_E, data = FuelConsumption_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.290 -2.161 -0.691 0.938 38.633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.42465 3.53340 24.459 < 2e-16
## CYLINDERS 1.70864 0.17206 9.930 < 2e-16
## FUELCONSUMPTION_COMB 17.93986 0.17884 100.310 < 2e-16
## FUELCONSUMPTION_COMB_MPG -1.49883 0.06918 -21.665 < 2e-16
## MAKE_BMW -2.47424 0.71285 -3.471 0.00054
## MAKE_PORSCHE -2.47898 0.84606 -2.930 0.00346
## VEHICLECLASS_COMPACT 1.25355 0.48312 2.595 0.00960
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` 2.27570 0.74216 3.066 0.00222
## `VEHICLECLASS_SUV - STANDARD` 1.51029 0.58230 2.594 0.00963
## `VEHICLECLASS_TWO-SEATER` 1.52927 0.68209 2.242 0.02517
## TRANSMISSION_AV 2.71432 0.89150 3.045 0.00239
## FUELTYPE_D 32.92162 1.09034 30.194 < 2e-16
## FUELTYPE_E -110.37148 0.88755 -124.356 < 2e-16
##
## (Intercept) ***
## CYLINDERS ***
## FUELCONSUMPTION_COMB ***
## FUELCONSUMPTION_COMB_MPG ***
## MAKE_BMW ***
## MAKE_PORSCHE **
## VEHICLECLASS_COMPACT **
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` **
## `VEHICLECLASS_SUV - STANDARD` **
## `VEHICLECLASS_TWO-SEATER` *
## TRANSMISSION_AV **
## FUELTYPE_D ***
## FUELTYPE_E ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.355 on 1054 degrees of freedom
## Multiple R-squared: 0.9929, Adjusted R-squared: 0.9929
## F-statistic: 1.236e+04 on 12 and 1054 DF, p-value: < 2.2e-16
We can see that the step function kept the variables
“fuelconsumption_comb” and “fuelconsumption_comb_mpg” despite the high
correlation between them as indicated in the correlation table. Let’s
check the values of the Variance Inflation Factors (VIF) using the
vif function.
vif(step_FuelConsumption)
## CYLINDERS FUELCONSUMPTION_COMB
## 3.556295 14.448422
## FUELCONSUMPTION_COMB_MPG MAKE_BMW
## 9.925956 1.066287
## MAKE_PORSCHE VEHICLECLASS_COMPACT
## 1.053243 1.174497
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` `VEHICLECLASS_SUV - STANDARD`
## 1.121868 1.166800
## `VEHICLECLASS_TWO-SEATER` TRANSMISSION_AV
## 1.075468 1.220163
## FUELTYPE_D FUELTYPE_E
## 1.091233 2.309771
Knowing that values well above 1 indicate the presence of collinearity, we can assume that “fuelconsumption_comb” and “fuelconsumption_comb_mpg” have high collinearity. Since high collinearity can be detrimental to the model, let’s eliminate the variable with the highest VIF to avoid multicollinearity.
step_FuelConsumption2 <- update(step_FuelConsumption, CO2EMISSIONS ~ .- FUELCONSUMPTION_COMB, data = FuelConsumption_dummies)
Looking at the new VIF values:
vif(step_FuelConsumption2)
## CYLINDERS FUELCONSUMPTION_COMB_MPG
## 2.766158 3.880821
## MAKE_BMW MAKE_PORSCHE
## 1.053868 1.033862
## VEHICLECLASS_COMPACT `VEHICLECLASS_PICKUP TRUCK - STANDARD`
## 1.174482 1.121858
## `VEHICLECLASS_SUV - STANDARD` `VEHICLECLASS_TWO-SEATER`
## 1.163437 1.073321
## TRANSMISSION_AV FUELTYPE_D
## 1.194467 1.091230
## FUELTYPE_E
## 1.340661
We can see that the values have reduced significantly. Let’s use the
step function on the new model, and then we will test whether it
exhibits heteroscedasticity (presence of non-constant variance in the
model residuals) and whether its residuals adhere to normality. We will
use the Breusch-Pagan test (ols_test_breusch_pagan) and the
Shapiro-Francia test (sf.test) for these purposes,
respectively.
step_FuelConsumption3 <- step(step_FuelConsumption2, k = 3.841459)
The value “k” is used as a critical threshold for the chi-square statistic, and in this case, we are seeking variable selection with a 95% confidence level.
ols_test_breusch_pagan(step_FuelConsumption3)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ----------------------------------------
## Response : CO2EMISSIONS
## Variables: fitted values of CO2EMISSIONS
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 99.15354
## Prob > Chi2 = 2.336652e-23
sf.test(step_FuelConsumption3$residuals)
##
## Shapiro-Francia normality test
##
## data: step_FuelConsumption3$residuals
## W = 0.88059, p-value < 2.2e-16
Based on the results of the tests, it can be concluded that the current model has heteroscedasticity in its residuals because the p-value (2.336652e-23) is lower than the commonly used significance level of 0.05. It can also be concluded that the residuals do not appear to follow a normal distribution since the p-value (2.2e-16) is also lower than the significance level of 0.05.
Although a model with these characteristics still has predictive potential, such characteristics may indicate that the model could be further improved. We will attempt to enhance the predictive capability of our model by applying a Box-Cox transformation to its dependent variable.
The Box-Cox transformation is a statistical technique used to
stabilize variance and/or approximate the distribution of data to a
normal distribution. Therefore, it can help reduce model
heteroscedasticity and improve the adherence of its residuals to
normality. The Box-Cox transformation is defined by a parametric
equation where a parameter lambda (λ) is applied to the data. Let’s
determine the value of lambda (λ) for our dependent variable using the
powerTransform function from the car
package.
lambda_BC <- powerTransform(dropyear2$CO2EMISSIONS)
lambda_BC
## Estimated transformation parameter
## dropyear2$CO2EMISSIONS
## 0.1278504
With the value of lambda (λ) in hand, all that remains is to perform the transformation of our dependent variable. We know that the transformation to be applied varies according to the value of lambda (λ):
If lambda is equal to 0, a logarithmic transformation is applied to the data. If lambda is different from 0, the transformation is given by the formula: ((x^lambda) - 1) / lambda, where x is the original value of the data.
Since our lambda (λ) is different from 0, we just need to execute the following code to insert our transformed dependent variable into our database:
FuelConsumption_dummies$bcCO2EMISSIONS <- (((dropyear2$CO2EMISSIONS ^ lambda_BC$lambda) - 1) / lambda_BC$lambda)
We will estimate a new model using the dependent variable with a Box-Cox transformation.
modelo_bc_FuelConsumption <- lm(formula = bcCO2EMISSIONS ~ .-CO2EMISSIONS,
data = FuelConsumption_dummies)
Since this is a new model, we need to run the Stepwise algorithm
again using the step function to eliminate non-significant
variables.
step_bc_FuelConsumption <- step(modelo_bc_FuelConsumption, k = 3.841459)
the value “k” is used as a critical threshold for the chi-squared statistic, and in this case, we are seeking variable selection with a confidence level of 95%.
Checking the VIF (Variance Inflation Factor) values in our new model:
vif(step_bc_FuelConsumption)
## CYLINDERS FUELCONSUMPTION_CITY
## 4.533626 232.067828
## FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG
## 226.916848 12.118356
## MAKE_BMW MAKE_CHRYSLER
## 1.055734 1.085185
## MAKE_DODGE MAKE_MINI
## 1.092760 1.100341
## MAKE_MITSUBISHI `VEHICLECLASS_PICKUP TRUCK - SMALL`
## 1.483501 1.075288
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` `VEHICLECLASS_SUV - STANDARD`
## 1.298780 1.352225
## `VEHICLECLASS_VAN - PASSENGER` TRANSMISSION_A4
## 1.614018 1.572269
## TRANSMISSION_AV TRANSMISSION_AV6
## 1.336003 1.551947
## TRANSMISSION_M5 FUELTYPE_D
## 1.169629 1.120641
## FUELTYPE_E
## 2.739652
As expected, once again, the variables related to fuel consumption
exhibit high collinearity. It’s also noted that the step
function did not eliminate the “FUELCONSUMPTION_CITY” variable as it had
in the model without the Box-Cox transformation. We will remove the
highest VIF values to attempt to ensure the predictive effectiveness of
the model:
step_bc_FuelConsumption2 <- update(step_bc_FuelConsumption, bcCO2EMISSIONS ~ .- FUELCONSUMPTION_COMB - FUELCONSUMPTION_CITY, data = FuelConsumption_dummies)
Since the variables in our model have changed, we need to use the
step function again:
step_bc_FuelConsumption3 <- step(step_bc_FuelConsumption2, k = 3.841459)
The value “k” is used as a critical threshold for the chi-squared statistic, and in this case, we are seeking variable selection with a confidence level of 95%.
Checking the new VIF (Variance Inflation Factor) values:
vif(step_bc_FuelConsumption3)
## CYLINDERS FUELCONSUMPTION_COMB_MPG
## 2.818367 4.070345
## MAKE_BMW MAKE_CHRYSLER
## 1.030173 1.024935
## MAKE_DODGE MAKE_MITSUBISHI
## 1.025345 1.402096
## `VEHICLECLASS_PICKUP TRUCK - SMALL` `VEHICLECLASS_PICKUP TRUCK - STANDARD`
## 1.041477 1.166559
## `VEHICLECLASS_SUV - STANDARD` `VEHICLECLASS_VAN - PASSENGER`
## 1.222065 1.221950
## TRANSMISSION_A4 TRANSMISSION_AV
## 1.202790 1.182870
## TRANSMISSION_AV6 FUELTYPE_D
## 1.407417 1.093097
## FUELTYPE_E
## 1.386989
With significantly reduced VIF values, we can now proceed with the
Breusch-Pagan tests (ols_test_breusch_pagan) and
Shapiro-Francia tests (sf.test). We will also take the
opportunity to evaluate the model’s parameters using
summary:
summary(step_bc_FuelConsumption3)
##
## Call:
## lm(formula = bcCO2EMISSIONS ~ CYLINDERS + FUELCONSUMPTION_COMB_MPG +
## MAKE_BMW + MAKE_CHRYSLER + MAKE_DODGE + MAKE_MITSUBISHI +
## `VEHICLECLASS_PICKUP TRUCK - SMALL` + `VEHICLECLASS_PICKUP TRUCK - STANDARD` +
## `VEHICLECLASS_SUV - STANDARD` + `VEHICLECLASS_VAN - PASSENGER` +
## TRANSMISSION_A4 + TRANSMISSION_AV + TRANSMISSION_AV6 + FUELTYPE_D +
## FUELTYPE_E, data = FuelConsumption_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23102 -0.03972 -0.00660 0.03047 0.41212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.324097 0.026732 348.793 < 2e-16
## CYLINDERS 0.050369 0.002085 24.160 < 2e-16
## FUELCONSUMPTION_COMB_MPG -0.059838 0.000603 -99.236 < 2e-16
## MAKE_BMW -0.036264 0.009537 -3.802 0.000151
## MAKE_CHRYSLER -0.052770 0.017080 -3.090 0.002057
## MAKE_DODGE -0.033871 0.012039 -2.813 0.004994
## MAKE_MITSUBISHI 0.065187 0.021738 2.999 0.002775
## `VEHICLECLASS_PICKUP TRUCK - SMALL` 0.070783 0.021593 3.278 0.001079
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` 0.054160 0.010301 5.258 1.76e-07
## `VEHICLECLASS_SUV - STANDARD` 0.058240 0.008111 7.180 1.32e-12
## `VEHICLECLASS_VAN - PASSENGER` 0.263238 0.016305 16.145 < 2e-16
## TRANSMISSION_A4 0.102007 0.012175 8.379 < 2e-16
## TRANSMISSION_AV 0.063111 0.011947 5.282 1.55e-07
## TRANSMISSION_AV6 -0.076332 0.026205 -2.913 0.003656
## FUELTYPE_D 0.288201 0.014853 19.403 < 2e-16
## FUELTYPE_E -0.475942 0.009361 -50.841 < 2e-16
##
## (Intercept) ***
## CYLINDERS ***
## FUELCONSUMPTION_COMB_MPG ***
## MAKE_BMW ***
## MAKE_CHRYSLER **
## MAKE_DODGE **
## MAKE_MITSUBISHI **
## `VEHICLECLASS_PICKUP TRUCK - SMALL` **
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` ***
## `VEHICLECLASS_SUV - STANDARD` ***
## `VEHICLECLASS_VAN - PASSENGER` ***
## TRANSMISSION_A4 ***
## TRANSMISSION_AV ***
## TRANSMISSION_AV6 **
## FUELTYPE_D ***
## FUELTYPE_E ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07288 on 1051 degrees of freedom
## Multiple R-squared: 0.9791, Adjusted R-squared: 0.9788
## F-statistic: 3286 on 15 and 1051 DF, p-value: < 2.2e-16
ols_test_breusch_pagan(step_bc_FuelConsumption3)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ------------------------------------------
## Response : bcCO2EMISSIONS
## Variables: fitted values of bcCO2EMISSIONS
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.6508984
## Prob > Chi2 = 0.4197917
sf.test(step_bc_FuelConsumption3$residuals)
##
## Shapiro-Francia normality test
##
## data: step_bc_FuelConsumption3$residuals
## W = 0.93977, p-value < 2.2e-16
From the results presented, we can conclude that a superior model was
obtained compared to the model where the dependent variable had not
undergone the Box-Cox transformation. We have a high R² value and
independent variables with a significance level exceeding 95%. The new
model also exhibited a p-value greater than 0.05 during the
ols_test_breusch_pagan test, indicating the absence of
heteroscedasticity. However, despite the efforts, it was not possible to
fit the residuals to normality because the p-value associated with the
sf.test remains less than 0.05.
However, despite the normality of residuals being a common assumption in many statistical models, it is not always necessary to obtain accurate predictions. According to Wilcox (1998), even if the residuals of a linear regression do not follow a normal distribution, predictions can be accurate, and hypothesis tests can still have satisfactory power, as long as other model assumptions are met.
We can assess the predictive capability of our model through the graph below:
FuelConsumption_dummies$fitted_step3 <- step_bc_FuelConsumption3$fitted.values
plot(FuelConsumption_dummies$CO2EMISSIONS ~ FuelConsumption_dummies$fitted_step3,
xlab = "Estimated values of CO2 emissions (Box-Cox)",
ylab = "Observed values of CO2 emissions")
The resulting values from our model need to undergo an inverse transformation to revert the Box-Cox transformation and obtain the actual values of CO2 emissions:
((bcCO2EMISSIONS*(lambda_BC$lambda))+
1)^(1/(lambda_BC$lambda))
| x |
|---|
| \[ \begin{aligned} \operatorname{\widehat{bcCO2EMISSIONS}} &= 9.32 + 0.05(\operatorname{CYLINDERS}) - 0.06(\operatorname{FUELCONSUMPTION\_COMB\_MPG}) - 0.04(\operatorname{MAKE\_BMW})\ - \\ &\quad 0.05(\operatorname{MAKE\_CHRYSLER}) - 0.03(\operatorname{MAKE\_DODGE}) + 0.07(\operatorname{MAKE\_MITSUBISHI}) + 0.07(\operatorname{`VEHICLECLASS\_PICKUP\ TRUCK\ -\ SMALL`})\ + \\ &\quad 0.05(\operatorname{`VEHICLECLASS\_PICKUP\ TRUCK\ -\ STANDARD`}) + 0.06(\operatorname{`VEHICLECLASS\_SUV\ -\ STANDARD`}) + 0.26(\operatorname{`VEHICLECLASS\_VAN\ -\ PASSENGER`}) + 0.1(\operatorname{TRANSMISSION\_A4})\ + \\ &\quad 0.06(\operatorname{TRANSMISSION\_AV}) - 0.08(\operatorname{TRANSMISSION\_AV6}) + 0.29(\operatorname{FUELTYPE\_D}) - 0.48(\operatorname{FUELTYPE\_E}) \end{aligned} \] |
The analysis results have shown that the final model is a reliable and accurate tool for predicting the amount of CO2 emissions from automobiles. The predictive power of the model has been validated by its ability to predict CO2 emissions from different types of vehicles with high precision. Furthermore, the model provides valuable insights into the factors influencing CO2 emissions, including the car manufacturer, car body type, fuel type, transmission type, and combined fuel consumption in miles.
Overall, this document demonstrates the value of using data analysis and statistical modeling techniques to address important environmental issues. Future research can focus on expanding the model to include other factors that may influence CO2 emissions and explore the potential use of machine learning algorithms to improve the model’s accuracy and predictive power..
Wilcox, R. R. (1998). Do we really need the normality assumption for linear regression? The American Statistician, 52(2), 162-166.