As the world’s population continues to grow, the use of automobiles has become an essential part of modern life. However, the emissions produced by automobiles have become a major concern for scientists and policymakers. Carbon dioxide (CO2) is one of the primary gases produced by cars and is known to have a significant impact on the environment.

In this research project, a linear regression model is presented that analyzes a dataset of cars from a Canadian dealership to identify the characteristics of cars that have the greatest impact on CO2 emissions. By doing so, valuable information can be provided to policymakers and car manufacturers to help them reduce the environmental impact of automobiles.

In the end, we will have a linear model with Box-Cox transformation that features CO2 emissions as the dependent variable and various car characteristics as independent variables.

Database used:
FuelConsumption (please right-click and select ‘open in a new tab/window’)

Code	Fuel
D	Diesel
E	Ethanol
X	Gasoline
Z	Premium Gasoline

Code	Trasmission
A	Automatic
AM	Automatic Manual
AS	Sequential Automatic
AV	Continuous Variable Automatic
M	Manual

Loading the necessary packages.

pacotes <- c("kableExtra", "utils", "plotly", "dplyr", "rstatix", "jtools", "equatiomatic", "cowplot", "olsrr", "nortest", "car", "PerformanceAnalytics", "fastDummies", "ggplot2")
lapply(pacotes, library, character.only = TRUE)

Loading and saving the database in R

FuelConsumption <- read.csv("FuelConsumption.csv")
save(FuelConsumption, file = "FuelConsumption.RData")

Initial analysis of the variables

View(FuelConsumption)

MODELYEAR	MAKE	MODEL	VEHICLECLASS	ENGINESIZE	CYLINDERS	TRANSMISSION	FUELTYPE	FUELCONSUMPTION_CITY	FUELCONSUMPTION_HWY	FUELCONSUMPTION_COMB	FUELCONSUMPTION_COMB_MPG	CO2EMISSIONS
2014	ACURA	ILX	COMPACT	2.0	4	AS5	Z	9.9	6.7	8.5	33	196
2014	ACURA	ILX	COMPACT	2.4	4	M6	Z	11.2	7.7	9.6	29	221
2014	ACURA	ILX HYBRID	COMPACT	1.5	4	AV7	Z	6.0	5.8	5.9	48	136
2014	ACURA	MDX 4WD	SUV - SMALL	3.5	6	AS6	Z	12.7	9.1	11.1	25	255
2014	ACURA	RDX AWD	SUV - SMALL	3.5	6	AS6	Z	12.1	8.7	10.6	27	244
2014	ACURA	RLX	MID-SIZE	3.5	6	AS6	Z	11.9	7.7	10.0	28	230
2014	ACURA	TL	MID-SIZE	3.5	6	AS6	Z	11.8	8.1	10.1	28	232
2014	ACURA	TL AWD	MID-SIZE	3.7	6	AS6	Z	12.8	9.0	11.1	25	255
2014	ACURA	TL AWD	MID-SIZE	3.7	6	M6	Z	13.4	9.5	11.6	24	267
2014	ACURA	TSX	COMPACT	2.4	4	AS5	Z	10.6	7.5	9.2	31	212

The variable MODELYEAR, which represents the year of manufacture of the automobile, has the same value in all observations (the complete database can be viewed in the link provided at the beginning of this document). Therefore, we can conclude that it will not have any influence on the model and will result in unnecessary data processing. For this reason, we will remove it from our database.

dropyear <- FuelConsumption[-c(1)]

Furthermore, it is also noticeable that the variable MODEL has levels that are very infrequently repeated in the 1067 observations. Consequently, the variable is unable to form significant patterns or trends that can assist the model in making accurate predictions or inferences. Therefore, we will also eliminate this variable..

dropyear2 <- dropyear[-c(2)]

Viewing the new database:

View(dropyear2)

MAKE	VEHICLECLASS	ENGINESIZE	CYLINDERS	TRANSMISSION	FUELTYPE	FUELCONSUMPTION_CITY	FUELCONSUMPTION_HWY	FUELCONSUMPTION_COMB	FUELCONSUMPTION_COMB_MPG	CO2EMISSIONS
ACURA	COMPACT	2.0	4	AS5	Z	9.9	6.7	8.5	33	196
ACURA	COMPACT	2.4	4	M6	Z	11.2	7.7	9.6	29	221
ACURA	COMPACT	1.5	4	AV7	Z	6.0	5.8	5.9	48	136
ACURA	SUV - SMALL	3.5	6	AS6	Z	12.7	9.1	11.1	25	255
ACURA	SUV - SMALL	3.5	6	AS6	Z	12.1	8.7	10.6	27	244
ACURA	MID-SIZE	3.5	6	AS6	Z	11.9	7.7	10.0	28	230
ACURA	MID-SIZE	3.5	6	AS6	Z	11.8	8.1	10.1	28	232
ACURA	MID-SIZE	3.7	6	AS6	Z	12.8	9.0	11.1	25	255
ACURA	MID-SIZE	3.7	6	M6	Z	13.4	9.5	11.6	24	267
ACURA	COMPACT	2.4	4	AS5	Z	10.6	7.5	9.2	31	212

Viewing and changing the variable types:

glimpse(dropyear2)

## Rows: 1,067
## Columns: 11
## $ MAKE                     <chr> "ACURA", "ACURA", "ACURA", "ACURA", "ACURA", …
## $ VEHICLECLASS             <chr> "COMPACT", "COMPACT", "COMPACT", "SUV - SMALL…
## $ ENGINESIZE               <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, 3.7, 3.7, …
## $ CYLINDERS                <int> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6, 12, 12, 8…
## $ TRANSMISSION             <chr> "AS5", "M6", "AV7", "AS6", "AS6", "AS6", "AS6…
## $ FUELTYPE                 <chr> "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", …
## $ FUELCONSUMPTION_CITY     <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 11.8, 12.8,…
## $ FUELCONSUMPTION_HWY      <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, 9.0, 9.5, …
## $ FUELCONSUMPTION_COMB     <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10.1, 11.1, …
## $ FUELCONSUMPTION_COMB_MPG <int> 33, 29, 48, 25, 27, 28, 28, 25, 24, 31, 29, 2…
## $ CO2EMISSIONS             <int> 196, 221, 136, 255, 244, 230, 232, 255, 267, …

Here, we can see that some of the variables are typed as “integer.” Since some functions to be used during this modeling require variables to be of type “numeric,” we will change the type of the “integer” variables to “numeric.”

FuelConsumption$CYLINDERS <- as.numeric(dropyear$CYLINDERS)
FuelConsumption$FUELCONSUMPTION_COMB_MPG <- as.numeric(dropyear$FUELCONSUMPTION_COMB_MPG)
FuelConsumption$CO2EMISSIONS <- as.numeric(dropyear$CO2EMISSIONS)

Checking the result of the changes in variable types:

glimpse(dropyear)

## Rows: 1,067
## Columns: 12
## $ MAKE                     <chr> "ACURA", "ACURA", "ACURA", "ACURA", "ACURA", …
## $ MODEL                    <chr> "ILX", "ILX", "ILX HYBRID", "MDX 4WD", "RDX A…
## $ VEHICLECLASS             <chr> "COMPACT", "COMPACT", "COMPACT", "SUV - SMALL…
## $ ENGINESIZE               <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, 3.7, 3.7, …
## $ CYLINDERS                <int> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6, 12, 12, 8…
## $ TRANSMISSION             <chr> "AS5", "M6", "AV7", "AS6", "AS6", "AS6", "AS6…
## $ FUELTYPE                 <chr> "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", "Z", …
## $ FUELCONSUMPTION_CITY     <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 11.8, 12.8,…
## $ FUELCONSUMPTION_HWY      <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, 9.0, 9.5, …
## $ FUELCONSUMPTION_COMB     <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10.1, 11.1, …
## $ FUELCONSUMPTION_COMB_MPG <int> 33, 29, 48, 25, 27, 28, 28, 25, 24, 31, 29, 2…
## $ CO2EMISSIONS             <int> 196, 221, 136, 255, 244, 230, 232, 255, 267, …

Statistics of the variables

summary(FuelConsumption)

##    MODELYEAR        MAKE              MODEL           VEHICLECLASS      
##  Min.   :2014   Length:1067        Length:1067        Length:1067       
##  1st Qu.:2014   Class :character   Class :character   Class :character  
##  Median :2014   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2014                                                           
##  3rd Qu.:2014                                                           
##  Max.   :2014                                                           
##    ENGINESIZE      CYLINDERS      TRANSMISSION         FUELTYPE        
##  Min.   :1.000   Min.   : 3.000   Length:1067        Length:1067       
##  1st Qu.:2.000   1st Qu.: 4.000   Class :character   Class :character  
##  Median :3.400   Median : 6.000   Mode  :character   Mode  :character  
##  Mean   :3.346   Mean   : 5.795                                        
##  3rd Qu.:4.300   3rd Qu.: 8.000                                        
##  Max.   :8.400   Max.   :12.000                                        
##  FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB
##  Min.   : 4.60        Min.   : 4.900      Min.   : 4.70       
##  1st Qu.:10.25        1st Qu.: 7.500      1st Qu.: 9.00       
##  Median :12.60        Median : 8.800      Median :10.90       
##  Mean   :13.30        Mean   : 9.475      Mean   :11.58       
##  3rd Qu.:15.55        3rd Qu.:10.850      3rd Qu.:13.35       
##  Max.   :30.20        Max.   :20.500      Max.   :25.80       
##  FUELCONSUMPTION_COMB_MPG  CO2EMISSIONS  
##  Min.   :11.00            Min.   :108.0  
##  1st Qu.:21.00            1st Qu.:207.0  
##  Median :26.00            Median :251.0  
##  Mean   :26.44            Mean   :256.2  
##  3rd Qu.:31.00            3rd Qu.:294.0  
##  Max.   :60.00            Max.   :488.0

Categories of the categorical variables:

levels(factor(FuelConsumption$MAKE))

##  [1] "ACURA"         "ASTON MARTIN"  "AUDI"          "BENTLEY"      
##  [5] "BMW"           "BUICK"         "CADILLAC"      "CHEVROLET"    
##  [9] "CHRYSLER"      "DODGE"         "FIAT"          "FORD"         
## [13] "GMC"           "HONDA"         "HYUNDAI"       "INFINITI"     
## [17] "JAGUAR"        "JEEP"          "KIA"           "LAMBORGHINI"  
## [21] "LAND ROVER"    "LEXUS"         "LINCOLN"       "MASERATI"     
## [25] "MAZDA"         "MERCEDES-BENZ" "MINI"          "MITSUBISHI"   
## [29] "NISSAN"        "PORSCHE"       "RAM"           "ROLLS-ROYCE"  
## [33] "SCION"         "SMART"         "SRT"           "SUBARU"       
## [37] "TOYOTA"        "VOLKSWAGEN"    "VOLVO"

levels(factor(FuelConsumption$VEHICLECLASS))

##  [1] "COMPACT"                  "FULL-SIZE"               
##  [3] "MID-SIZE"                 "MINICOMPACT"             
##  [5] "MINIVAN"                  "PICKUP TRUCK - SMALL"    
##  [7] "PICKUP TRUCK - STANDARD"  "SPECIAL PURPOSE VEHICLE" 
##  [9] "STATION WAGON - MID-SIZE" "STATION WAGON - SMALL"   
## [11] "SUBCOMPACT"               "SUV - SMALL"             
## [13] "SUV - STANDARD"           "TWO-SEATER"              
## [15] "VAN - CARGO"              "VAN - PASSENGER"

levels(factor(FuelConsumption$TRANSMISSION))

##  [1] "A4"  "A5"  "A6"  "A7"  "A8"  "A9"  "AM5" "AM6" "AM7" "AS4" "AS5" "AS6"
## [13] "AS7" "AS8" "AS9" "AV"  "AV6" "AV7" "AV8" "M5"  "M6"  "M7"

levels(factor(FuelConsumption$FUELTYPE))

## [1] "D" "E" "X" "Z"

Frequency of the categorical variables:

table(FuelConsumption$MAKE)

## 
##         ACURA  ASTON MARTIN          AUDI       BENTLEY           BMW 
##            12             7            49             8            64 
##         BUICK      CADILLAC     CHEVROLET      CHRYSLER         DODGE 
##            16            32            86            19            39 
##          FIAT          FORD           GMC         HONDA       HYUNDAI 
##            10            90            49            21            24 
##      INFINITI        JAGUAR          JEEP           KIA   LAMBORGHINI 
##            21            22            31            33             3 
##    LAND ROVER         LEXUS       LINCOLN      MASERATI         MAZDA 
##            19            22            11             6            27 
## MERCEDES-BENZ          MINI    MITSUBISHI        NISSAN       PORSCHE 
##            59            36            16            33            44 
##           RAM   ROLLS-ROYCE         SCION         SMART           SRT 
##            13             7             9             2             2 
##        SUBARU        TOYOTA    VOLKSWAGEN         VOLVO 
##            23            49            42            11

table(FuelConsumption$VEHICLECLASS)

## 
##                  COMPACT                FULL-SIZE                 MID-SIZE 
##                      172                       86                      178 
##              MINICOMPACT                  MINIVAN     PICKUP TRUCK - SMALL 
##                       47                       14                       12 
##  PICKUP TRUCK - STANDARD  SPECIAL PURPOSE VEHICLE STATION WAGON - MID-SIZE 
##                       62                        7                        6 
##    STATION WAGON - SMALL               SUBCOMPACT              SUV - SMALL 
##                       36                       65                      154 
##           SUV - STANDARD               TWO-SEATER              VAN - CARGO 
##                      110                       71                       22 
##          VAN - PASSENGER 
##                       25

table(FuelConsumption$TRANSMISSION)

## 
##  A4  A5  A6  A7  A8  A9 AM5 AM6 AM7 AS4 AS5 AS6 AS7 AS8 AS9  AV AV6 AV7 AV8  M5 
##  45  30 222  12  87   8   2   6  34   1  10 189  76  80   2  46  11   5   3  48 
##  M6  M7 
## 141   9

table(FuelConsumption$FUELTYPE)

## 
##   D   E   X   Z 
##  27  92 514 434

Checking the correlations between the numerical variables to be used

chart.Correlation((FuelConsumption[, c(5, 6, 9:13)]), histogram = TRUE)

enginesize, cylinders, fuelconsumption_city, fuelcomsumption_hwy, fuelconsumption_comb, fuelconsumption_comb_mpg, co2emissions

As expected, the variables related to fuel consumption have a high degree of correlation with each other, as they are both related to fuel consumption in different driving contexts. If there is an increase in city fuel consumption, it is likely that there will be a corresponding increase in highway fuel consumption, and vice versa. This positive relationship between the two variables results in a high degree of correlation between them. This is a typical example of collinearity that we will address using the stepwise algorithm to avoid issues in the final model.

Dummification of the categorical variables

Before we proceed, we need to convert our categorical variables into numerical values. For this purpose, we will use the process of dummification, also known as one-hot encoding. In this case, the most frequent category of each categorical variable will be used as the reference for estimating the parameter values of the dummy variables. You can check the frequency of these variables in the “Statistics of the variables” section above.

FuelConsumption_dummies <- dummy_columns(.data = dropyear2,
                                    select_columns = "MAKE",
                                    remove_selected_columns = T,
                                    remove_most_frequent_dummy = T)

FuelConsumption_dummies <- dummy_columns(.data = FuelConsumption_dummies,
                                         select_columns = "VEHICLECLASS",
                                         remove_selected_columns = T,
                                         remove_most_frequent_dummy = T)

FuelConsumption_dummies <- dummy_columns(.data = FuelConsumption_dummies,
                                         select_columns = "TRANSMISSION",
                                         remove_selected_columns = T,
                                         remove_most_frequent_dummy = T)

FuelConsumption_dummies <- dummy_columns(.data = FuelConsumption_dummies,
                                         select_columns = "FUELTYPE",
                                         remove_selected_columns = T,
                                         remove_most_frequent_dummy = T)

With the categorical variables dummified, we can start building our model.

Modeling

The dummification process significantly increased our number of variables, from 11 to 90, so it’s unlikely that all of these variables are significant for our model. Additionally, in our correlation analysis, we observed the presence of collinearity among some variables. To eliminate non-significant variables, we will create an initial OLS linear model with all the variables that can be used as a baseline during the step function.

modelo_FuelConsumption <- lm(formula = CO2EMISSIONS ~ ., data = FuelConsumption_dummies)

Now that we have our initial model, we can run the Stepwise algorithm on it using the step function:

step_FuelConsumption <- step(modelo_FuelConsumption, k = qchisq(p = 0.05, df = 1, lower.tail = FALSE))

The value “k” is used as a critical threshold for the chi-square statistic, and in this case, we are seeking variable selection with a 95% confidence level.

Let’s use the summary function to evaluate the parameters of our model:

summary(step_FuelConsumption)

## 
## Call:
## lm(formula = CO2EMISSIONS ~ CYLINDERS + FUELCONSUMPTION_COMB + 
##     FUELCONSUMPTION_COMB_MPG + MAKE_BMW + MAKE_PORSCHE + VEHICLECLASS_COMPACT + 
##     `VEHICLECLASS_PICKUP TRUCK - STANDARD` + `VEHICLECLASS_SUV - STANDARD` + 
##     `VEHICLECLASS_TWO-SEATER` + TRANSMISSION_AV + FUELTYPE_D + 
##     FUELTYPE_E, data = FuelConsumption_dummies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.290  -2.161  -0.691   0.938  38.633 
## 
## Coefficients:
##                                          Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                              86.42465    3.53340   24.459  < 2e-16
## CYLINDERS                                 1.70864    0.17206    9.930  < 2e-16
## FUELCONSUMPTION_COMB                     17.93986    0.17884  100.310  < 2e-16
## FUELCONSUMPTION_COMB_MPG                 -1.49883    0.06918  -21.665  < 2e-16
## MAKE_BMW                                 -2.47424    0.71285   -3.471  0.00054
## MAKE_PORSCHE                             -2.47898    0.84606   -2.930  0.00346
## VEHICLECLASS_COMPACT                      1.25355    0.48312    2.595  0.00960
## `VEHICLECLASS_PICKUP TRUCK - STANDARD`    2.27570    0.74216    3.066  0.00222
## `VEHICLECLASS_SUV - STANDARD`             1.51029    0.58230    2.594  0.00963
## `VEHICLECLASS_TWO-SEATER`                 1.52927    0.68209    2.242  0.02517
## TRANSMISSION_AV                           2.71432    0.89150    3.045  0.00239
## FUELTYPE_D                               32.92162    1.09034   30.194  < 2e-16
## FUELTYPE_E                             -110.37148    0.88755 -124.356  < 2e-16
##                                           
## (Intercept)                            ***
## CYLINDERS                              ***
## FUELCONSUMPTION_COMB                   ***
## FUELCONSUMPTION_COMB_MPG               ***
## MAKE_BMW                               ***
## MAKE_PORSCHE                           ** 
## VEHICLECLASS_COMPACT                   ** 
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` ** 
## `VEHICLECLASS_SUV - STANDARD`          ** 
## `VEHICLECLASS_TWO-SEATER`              *  
## TRANSMISSION_AV                        ** 
## FUELTYPE_D                             ***
## FUELTYPE_E                             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.355 on 1054 degrees of freedom
## Multiple R-squared:  0.9929, Adjusted R-squared:  0.9929 
## F-statistic: 1.236e+04 on 12 and 1054 DF,  p-value: < 2.2e-16

We can see that the step function kept the variables “fuelconsumption_comb” and “fuelconsumption_comb_mpg” despite the high correlation between them as indicated in the correlation table. Let’s check the values of the Variance Inflation Factors (VIF) using the vif function.

vif(step_FuelConsumption)

##                              CYLINDERS                   FUELCONSUMPTION_COMB 
##                               3.556295                              14.448422 
##               FUELCONSUMPTION_COMB_MPG                               MAKE_BMW 
##                               9.925956                               1.066287 
##                           MAKE_PORSCHE                   VEHICLECLASS_COMPACT 
##                               1.053243                               1.174497 
## `VEHICLECLASS_PICKUP TRUCK - STANDARD`          `VEHICLECLASS_SUV - STANDARD` 
##                               1.121868                               1.166800 
##              `VEHICLECLASS_TWO-SEATER`                        TRANSMISSION_AV 
##                               1.075468                               1.220163 
##                             FUELTYPE_D                             FUELTYPE_E 
##                               1.091233                               2.309771

Knowing that values well above 1 indicate the presence of collinearity, we can assume that “fuelconsumption_comb” and “fuelconsumption_comb_mpg” have high collinearity. Since high collinearity can be detrimental to the model, let’s eliminate the variable with the highest VIF to avoid multicollinearity.

step_FuelConsumption2 <- update(step_FuelConsumption, CO2EMISSIONS ~ .- FUELCONSUMPTION_COMB, data = FuelConsumption_dummies)

Looking at the new VIF values:

vif(step_FuelConsumption2)

##                              CYLINDERS               FUELCONSUMPTION_COMB_MPG 
##                               2.766158                               3.880821 
##                               MAKE_BMW                           MAKE_PORSCHE 
##                               1.053868                               1.033862 
##                   VEHICLECLASS_COMPACT `VEHICLECLASS_PICKUP TRUCK - STANDARD` 
##                               1.174482                               1.121858 
##          `VEHICLECLASS_SUV - STANDARD`              `VEHICLECLASS_TWO-SEATER` 
##                               1.163437                               1.073321 
##                        TRANSMISSION_AV                             FUELTYPE_D 
##                               1.194467                               1.091230 
##                             FUELTYPE_E 
##                               1.340661

We can see that the values have reduced significantly. Let’s use the step function on the new model, and then we will test whether it exhibits heteroscedasticity (presence of non-constant variance in the model residuals) and whether its residuals adhere to normality. We will use the Breusch-Pagan test (ols_test_breusch_pagan) and the Shapiro-Francia test (sf.test) for these purposes, respectively.

step_FuelConsumption3 <- step(step_FuelConsumption2, k = 3.841459)

The value “k” is used as a critical threshold for the chi-square statistic, and in this case, we are seeking variable selection with a 95% confidence level.

ols_test_breusch_pagan(step_FuelConsumption3)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                   
##  ----------------------------------------
##  Response : CO2EMISSIONS 
##  Variables: fitted values of CO2EMISSIONS 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    99.15354 
##  Prob > Chi2   =    2.336652e-23

sf.test(step_FuelConsumption3$residuals)

## 
##  Shapiro-Francia normality test
## 
## data:  step_FuelConsumption3$residuals
## W = 0.88059, p-value < 2.2e-16

Based on the results of the tests, it can be concluded that the current model has heteroscedasticity in its residuals because the p-value (2.336652e-23) is lower than the commonly used significance level of 0.05. It can also be concluded that the residuals do not appear to follow a normal distribution since the p-value (2.2e-16) is also lower than the significance level of 0.05.

Although a model with these characteristics still has predictive potential, such characteristics may indicate that the model could be further improved. We will attempt to enhance the predictive capability of our model by applying a Box-Cox transformation to its dependent variable.

The Box-Cox transformation

The Box-Cox transformation is a statistical technique used to stabilize variance and/or approximate the distribution of data to a normal distribution. Therefore, it can help reduce model heteroscedasticity and improve the adherence of its residuals to normality. The Box-Cox transformation is defined by a parametric equation where a parameter lambda (λ) is applied to the data. Let’s determine the value of lambda (λ) for our dependent variable using the powerTransform function from the car package.

lambda_BC <- powerTransform(dropyear2$CO2EMISSIONS)
lambda_BC

## Estimated transformation parameter 
## dropyear2$CO2EMISSIONS 
##              0.1278504

With the value of lambda (λ) in hand, all that remains is to perform the transformation of our dependent variable. We know that the transformation to be applied varies according to the value of lambda (λ):

If lambda is equal to 0, a logarithmic transformation is applied to the data. If lambda is different from 0, the transformation is given by the formula: ((x^lambda) - 1) / lambda, where x is the original value of the data.

Since our lambda (λ) is different from 0, we just need to execute the following code to insert our transformed dependent variable into our database:

FuelConsumption_dummies$bcCO2EMISSIONS <- (((dropyear2$CO2EMISSIONS ^ lambda_BC$lambda) - 1) / lambda_BC$lambda)

Modeling with Box-Cox

We will estimate a new model using the dependent variable with a Box-Cox transformation.

modelo_bc_FuelConsumption <- lm(formula = bcCO2EMISSIONS ~ .-CO2EMISSIONS, 
                           data = FuelConsumption_dummies)

Since this is a new model, we need to run the Stepwise algorithm again using the step function to eliminate non-significant variables.

step_bc_FuelConsumption <- step(modelo_bc_FuelConsumption, k = 3.841459)

the value “k” is used as a critical threshold for the chi-squared statistic, and in this case, we are seeking variable selection with a confidence level of 95%.

Checking the VIF (Variance Inflation Factor) values in our new model:

vif(step_bc_FuelConsumption)

##                              CYLINDERS                   FUELCONSUMPTION_CITY 
##                               4.533626                             232.067828 
##                   FUELCONSUMPTION_COMB               FUELCONSUMPTION_COMB_MPG 
##                             226.916848                              12.118356 
##                               MAKE_BMW                          MAKE_CHRYSLER 
##                               1.055734                               1.085185 
##                             MAKE_DODGE                              MAKE_MINI 
##                               1.092760                               1.100341 
##                        MAKE_MITSUBISHI    `VEHICLECLASS_PICKUP TRUCK - SMALL` 
##                               1.483501                               1.075288 
## `VEHICLECLASS_PICKUP TRUCK - STANDARD`          `VEHICLECLASS_SUV - STANDARD` 
##                               1.298780                               1.352225 
##         `VEHICLECLASS_VAN - PASSENGER`                        TRANSMISSION_A4 
##                               1.614018                               1.572269 
##                        TRANSMISSION_AV                       TRANSMISSION_AV6 
##                               1.336003                               1.551947 
##                        TRANSMISSION_M5                             FUELTYPE_D 
##                               1.169629                               1.120641 
##                             FUELTYPE_E 
##                               2.739652

As expected, once again, the variables related to fuel consumption exhibit high collinearity. It’s also noted that the step function did not eliminate the “FUELCONSUMPTION_CITY” variable as it had in the model without the Box-Cox transformation. We will remove the highest VIF values to attempt to ensure the predictive effectiveness of the model:

step_bc_FuelConsumption2 <- update(step_bc_FuelConsumption, bcCO2EMISSIONS ~ .- FUELCONSUMPTION_COMB - FUELCONSUMPTION_CITY, data = FuelConsumption_dummies)

Since the variables in our model have changed, we need to use the step function again:

step_bc_FuelConsumption3 <- step(step_bc_FuelConsumption2, k = 3.841459)

The value “k” is used as a critical threshold for the chi-squared statistic, and in this case, we are seeking variable selection with a confidence level of 95%.

Checking the new VIF (Variance Inflation Factor) values:

vif(step_bc_FuelConsumption3)

##                              CYLINDERS               FUELCONSUMPTION_COMB_MPG 
##                               2.818367                               4.070345 
##                               MAKE_BMW                          MAKE_CHRYSLER 
##                               1.030173                               1.024935 
##                             MAKE_DODGE                        MAKE_MITSUBISHI 
##                               1.025345                               1.402096 
##    `VEHICLECLASS_PICKUP TRUCK - SMALL` `VEHICLECLASS_PICKUP TRUCK - STANDARD` 
##                               1.041477                               1.166559 
##          `VEHICLECLASS_SUV - STANDARD`         `VEHICLECLASS_VAN - PASSENGER` 
##                               1.222065                               1.221950 
##                        TRANSMISSION_A4                        TRANSMISSION_AV 
##                               1.202790                               1.182870 
##                       TRANSMISSION_AV6                             FUELTYPE_D 
##                               1.407417                               1.093097 
##                             FUELTYPE_E 
##                               1.386989

With significantly reduced VIF values, we can now proceed with the Breusch-Pagan tests (ols_test_breusch_pagan) and Shapiro-Francia tests (sf.test). We will also take the opportunity to evaluate the model’s parameters using summary:

summary(step_bc_FuelConsumption3)

## 
## Call:
## lm(formula = bcCO2EMISSIONS ~ CYLINDERS + FUELCONSUMPTION_COMB_MPG + 
##     MAKE_BMW + MAKE_CHRYSLER + MAKE_DODGE + MAKE_MITSUBISHI + 
##     `VEHICLECLASS_PICKUP TRUCK - SMALL` + `VEHICLECLASS_PICKUP TRUCK - STANDARD` + 
##     `VEHICLECLASS_SUV - STANDARD` + `VEHICLECLASS_VAN - PASSENGER` + 
##     TRANSMISSION_A4 + TRANSMISSION_AV + TRANSMISSION_AV6 + FUELTYPE_D + 
##     FUELTYPE_E, data = FuelConsumption_dummies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23102 -0.03972 -0.00660  0.03047  0.41212 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             9.324097   0.026732 348.793  < 2e-16
## CYLINDERS                               0.050369   0.002085  24.160  < 2e-16
## FUELCONSUMPTION_COMB_MPG               -0.059838   0.000603 -99.236  < 2e-16
## MAKE_BMW                               -0.036264   0.009537  -3.802 0.000151
## MAKE_CHRYSLER                          -0.052770   0.017080  -3.090 0.002057
## MAKE_DODGE                             -0.033871   0.012039  -2.813 0.004994
## MAKE_MITSUBISHI                         0.065187   0.021738   2.999 0.002775
## `VEHICLECLASS_PICKUP TRUCK - SMALL`     0.070783   0.021593   3.278 0.001079
## `VEHICLECLASS_PICKUP TRUCK - STANDARD`  0.054160   0.010301   5.258 1.76e-07
## `VEHICLECLASS_SUV - STANDARD`           0.058240   0.008111   7.180 1.32e-12
## `VEHICLECLASS_VAN - PASSENGER`          0.263238   0.016305  16.145  < 2e-16
## TRANSMISSION_A4                         0.102007   0.012175   8.379  < 2e-16
## TRANSMISSION_AV                         0.063111   0.011947   5.282 1.55e-07
## TRANSMISSION_AV6                       -0.076332   0.026205  -2.913 0.003656
## FUELTYPE_D                              0.288201   0.014853  19.403  < 2e-16
## FUELTYPE_E                             -0.475942   0.009361 -50.841  < 2e-16
##                                           
## (Intercept)                            ***
## CYLINDERS                              ***
## FUELCONSUMPTION_COMB_MPG               ***
## MAKE_BMW                               ***
## MAKE_CHRYSLER                          ** 
## MAKE_DODGE                             ** 
## MAKE_MITSUBISHI                        ** 
## `VEHICLECLASS_PICKUP TRUCK - SMALL`    ** 
## `VEHICLECLASS_PICKUP TRUCK - STANDARD` ***
## `VEHICLECLASS_SUV - STANDARD`          ***
## `VEHICLECLASS_VAN - PASSENGER`         ***
## TRANSMISSION_A4                        ***
## TRANSMISSION_AV                        ***
## TRANSMISSION_AV6                       ** 
## FUELTYPE_D                             ***
## FUELTYPE_E                             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07288 on 1051 degrees of freedom
## Multiple R-squared:  0.9791, Adjusted R-squared:  0.9788 
## F-statistic:  3286 on 15 and 1051 DF,  p-value: < 2.2e-16

ols_test_breusch_pagan(step_bc_FuelConsumption3)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                    Data                    
##  ------------------------------------------
##  Response : bcCO2EMISSIONS 
##  Variables: fitted values of bcCO2EMISSIONS 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.6508984 
##  Prob > Chi2   =    0.4197917

sf.test(step_bc_FuelConsumption3$residuals)

## 
##  Shapiro-Francia normality test
## 
## data:  step_bc_FuelConsumption3$residuals
## W = 0.93977, p-value < 2.2e-16

From the results presented, we can conclude that a superior model was obtained compared to the model where the dependent variable had not undergone the Box-Cox transformation. We have a high R² value and independent variables with a significance level exceeding 95%. The new model also exhibited a p-value greater than 0.05 during the ols_test_breusch_pagan test, indicating the absence of heteroscedasticity. However, despite the efforts, it was not possible to fit the residuals to normality because the p-value associated with the sf.test remains less than 0.05.

However, despite the normality of residuals being a common assumption in many statistical models, it is not always necessary to obtain accurate predictions. According to Wilcox (1998), even if the residuals of a linear regression do not follow a normal distribution, predictions can be accurate, and hypothesis tests can still have satisfactory power, as long as other model assumptions are met.

We can assess the predictive capability of our model through the graph below:

FuelConsumption_dummies$fitted_step3 <- step_bc_FuelConsumption3$fitted.values
plot(FuelConsumption_dummies$CO2EMISSIONS ~ FuelConsumption_dummies$fitted_step3,
     xlab = "Estimated values of CO2 emissions (Box-Cox)",
     ylab = "Observed values of CO2 emissions")

The resulting values from our model need to undergo an inverse transformation to revert the Box-Cox transformation and obtain the actual values of CO2 emissions:

((bcCO2EMISSIONS*(lambda_BC$lambda))+
    1)^(1/(lambda_BC$lambda))

Final Model

x
\[ \begin{aligned} \operatorname{\widehat{bcCO2EMISSIONS}} &= 9.32 + 0.05(\operatorname{CYLINDERS}) - 0.06(\operatorname{FUELCONSUMPTION\_COMB\_MPG}) - 0.04(\operatorname{MAKE\_BMW})\ - \\ &\quad 0.05(\operatorname{MAKE\_CHRYSLER}) - 0.03(\operatorname{MAKE\_DODGE}) + 0.07(\operatorname{MAKE\_MITSUBISHI}) + 0.07(\operatorname{`VEHICLECLASS\_PICKUP\ TRUCK\ -\ SMALL`})\ + \\ &\quad 0.05(\operatorname{`VEHICLECLASS\_PICKUP\ TRUCK\ -\ STANDARD`}) + 0.06(\operatorname{`VEHICLECLASS\_SUV\ -\ STANDARD`}) + 0.26(\operatorname{`VEHICLECLASS\_VAN\ -\ PASSENGER`}) + 0.1(\operatorname{TRANSMISSION\_A4})\ + \\ &\quad 0.06(\operatorname{TRANSMISSION\_AV}) - 0.08(\operatorname{TRANSMISSION\_AV6}) + 0.29(\operatorname{FUELTYPE\_D}) - 0.48(\operatorname{FUELTYPE\_E}) \end{aligned} \]

The analysis results have shown that the final model is a reliable and accurate tool for predicting the amount of CO2 emissions from automobiles. The predictive power of the model has been validated by its ability to predict CO2 emissions from different types of vehicles with high precision. Furthermore, the model provides valuable insights into the factors influencing CO2 emissions, including the car manufacturer, car body type, fuel type, transmission type, and combined fuel consumption in miles.

Overall, this document demonstrates the value of using data analysis and statistical modeling techniques to address important environmental issues. Future research can focus on expanding the model to include other factors that may influence CO2 emissions and explore the potential use of machine learning algorithms to improve the model’s accuracy and predictive power..

References

Wilcox, R. R. (1998). Do we really need the normality assumption for linear regression? The American Statistician, 52(2), 162-166.

Developing a predictive model to estimate CO2 emissions from automobiles

Rafael

2023-05-05