1 Description of the Dataset

I chose a dataset found from this link: https://www.kaggle.com/datasets/bhuviranga/co2-emissions that includes information about the CO2 emissions of different cars. The author did not include information about how the data was collected. The dataset includes the following variables.

Engine.Size (continuous) - the engine size in liters

Cylinders (categorical) - the number of cylinders in the car

Fuel.Type (categorical) - the type of fuel used for the car. D = Diesel, E = ethanol, X = gasoline, Z = premium gasoline

Fuel.Consumption.City (continuous) - the fuel consumption of the car in L/100km when driving in the city

Fuel.Consumption.Hwy (continuous) - the fuel consumption of the car in L/100km when driving on highways

Fuel.Consumption.Combined (continuous) - the fuel consumption of the car in L/km when driving in a combination of city roads and highways

Fuel.Consumption.mpg (continuous) - the fuel consumption of the car in miles per gallon when driving in a combination of city roads and highways

CO2.Emissions (continuous) - The emissions of the car in grams of CO2/kilometers traveled

2 Research Question

The primary question is how different predictor variables relate to the CO2 emissions of the vehicle.

3 Multiple Linear Regression

We begin to examine the predictors in a model with a response variable of CO2.Emissions.

3.1 Full Model

We look at the full model first.

Full Model examining CO2 Emissions
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.8670059 1.6422899 57.1561726 0.0000000
Engine.Size 0.8218282 0.1440850 5.7037739 0.0000000
Cylinders4 -0.9584579 0.5227131 -1.8336215 0.0667505
Cylinders5 -3.5501045 1.1033696 -3.2175115 0.0012987
Cylinders6 -0.0901626 0.5888235 -0.1531234 0.8783052
Cylinders8 1.3082761 0.7376998 1.7734533 0.0761949
Cylinders10 6.3032070 1.0945633 5.7586500 0.0000000
Cylinders12 8.6056555 0.9572175 8.9902826 0.0000000
Cylinders16 28.8962479 3.0675567 9.4199556 0.0000000
Fuel.TypeE -137.7060177 0.5364755 -256.6865131 0.0000000
Fuel.TypeN -111.3336334 4.9258783 -22.6017834 0.0000000
Fuel.TypeX -30.5259613 0.3843917 -79.4136951 0.0000000
Fuel.TypeZ -31.1123288 0.3887748 -80.0266115 0.0000000
Fuel.Consumption.City 6.0962255 0.7424400 8.2110680 0.0000000
Fuel.Consumption.Hwy 5.4826752 0.6119735 8.9590083 0.0000000
Fuel.Consumption.Combined 8.1947979 1.3468737 6.0843108 0.0000000
Fuel.Consumption.mpg -0.9647167 0.0253006 -38.1302254 0.0000000

With this model, we do residual analysis to check if any of our assumptions have been violated.

## Warning: not plotting observations with leverage one:
##   2440

From looking at these plots, we can see some pretty big issues with our model. The variance of the residuals is not constant, and a pattern is visible in the plot of the residuals versus the fitted values. The Q-Q plot indicates that our residuals are not distributed normally.

We will perform a Box-Cox transformation to try and correct some of these issues.

We will also examine the VIF indicies of the model to check for multicollinearity.

##                                  GVIF Df GVIF^(1/(2*Df))
## Engine.Size                 11.668643  1        3.415940
## Cylinders                   14.403671  7        1.209896
## Fuel.Type                    2.475681  4        1.119984
## Fuel.Consumption.City     2069.965111  1       45.496869
## Fuel.Consumption.Hwy       568.001039  1       23.832772
## Fuel.Consumption.Combined 4651.987253  1       68.205478
## Fuel.Consumption.mpg        10.261228  1        3.203315

The VIF indicies indicate significant issues with multicollinearity for the variables of Fuel.Consumption.City, Fuel.Consumption.Hwy, Fuel.Consumption.Combined, and Fuel.Consumption.mpg.

As a result, we remove the predictors of Fuel.Consumption.City, Fuel.Consumption.Hwy, and Fuel.Consumption.Combined to reduce multicollinearity.

Edited Model examining CO2 Emissions
Estimate Std. Error t value Pr(>|t|)
(Intercept) 425.436323 2.7774036 153.177711 0.00e+00
Engine.Size 7.499682 0.4322624 17.349838 0.00e+00
Cylinders4 -8.933533 1.6013752 -5.578663 0.00e+00
Cylinders5 -13.777361 3.3833670 -4.072086 4.71e-05
Cylinders6 -5.947815 1.8059023 -3.293542 9.94e-04
Cylinders8 11.741026 2.2577261 5.200376 2.00e-07
Cylinders10 34.315686 3.3337059 10.293555 0.00e+00
Cylinders12 44.381076 2.8674310 15.477644 0.00e+00
Cylinders16 145.251628 9.2651721 15.677165 0.00e+00
Fuel.TypeE -77.751313 1.4658185 -53.042934 0.00e+00
Fuel.TypeN -100.317941 15.1136250 -6.637583 0.00e+00
Fuel.TypeX -25.916043 1.1773512 -22.012160 0.00e+00
Fuel.TypeZ -29.994388 1.1930155 -25.141659 0.00e+00
Fuel.Consumption.mpg -6.053156 0.0416497 -145.334966 0.00e+00
## Warning: not plotting observations with leverage one:
##   2440

##                           GVIF Df GVIF^(1/(2*Df))
## Engine.Size          11.145876  1        3.338544
## Cylinders            11.786999  7        1.192693
## Fuel.Type             1.410630  4        1.043943
## Fuel.Consumption.mpg  2.951199  1        1.717905

When we examine this model, we see that the VIF indicies are much improved, no longer having major issues with multicollinearity. However, the variances of the residuals are still not constant and the assumption of normality is still violated. Therefore, we proceed with the Box-Cox transformations.

3.2 Box-Cox Transformations

We will perform several Box-Cox transformations on the model.

Examining these plots, we can see that taking the log of Fuel.Consumption.mpg impacts lambda.

3.3 Square-root Transformation

Using our Box-Cox transformation with log-transformed Fuel.Consumption.mpg, we create the following model:

sqrt-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.8253658 0.0659711 618.837289 0.0000000
Engine.Size 0.0411383 0.0050662 8.120089 0.0000000
Cylinders4 -0.1240613 0.0184903 -6.709526 0.0000000
Cylinders5 -0.2394471 0.0390861 -6.126148 0.0000000
Cylinders6 -0.1093716 0.0208543 -5.244550 0.0000002
Cylinders8 0.0425413 0.0261597 1.626214 0.1039468
Cylinders10 0.2527221 0.0386492 6.538875 0.0000000
Cylinders12 0.3834120 0.0332785 11.521310 0.0000000
Cylinders16 1.5391967 0.1072445 14.352216 0.0000000
Fuel.TypeE -3.6834953 0.0176334 -208.892974 0.0000000
Fuel.TypeN -3.6533055 0.1747640 -20.904222 0.0000000
Fuel.TypeX -1.0252207 0.0136332 -75.200343 0.0000000
Fuel.TypeZ -1.0793541 0.0137994 -78.217409 0.0000000
log(Fuel.Consumption.mpg) -7.3167119 0.0158285 -462.250481 0.0000000

The residual plots are as follows:

## Warning: not plotting observations with leverage one:
##   2440

Looking at these plots, the curvature in the residual plot looks weaker and the Q-Q plot is closer to normal. Even so, the assumptions of constant variance and normality still remain violated.

3.4 Log Transformation

In this transformation, we will take the log of CO2 Emissions.

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8909003 0.0062869 1414.2023188 0.0000000
Engine.Size 0.0003853 0.0004828 0.7981271 0.4248225
Cylinders4 -0.0024462 0.0017621 -1.3882273 0.1651098
Cylinders5 -0.0087866 0.0037248 -2.3589530 0.0183525
Cylinders6 0.0017023 0.0019874 0.8565724 0.3917091
Cylinders8 0.0014074 0.0024929 0.5645555 0.5723933
Cylinders10 0.0025751 0.0036832 0.6991585 0.4844750
Cylinders12 0.0069161 0.0031714 2.1808187 0.0292283
Cylinders16 0.0392187 0.0102201 3.8374033 0.0001254
Fuel.TypeE -0.4918855 0.0016804 -292.7164834 0.0000000
Fuel.TypeN -0.4798671 0.0166545 -28.8129962 0.0000000
Fuel.TypeX -0.1408340 0.0012992 -108.4000411 0.0000000
Fuel.TypeZ -0.1422930 0.0013150 -108.2037891 0.0000000
log(Fuel.Consumption.mpg) -0.9876379 0.0015084 -654.7545995 0.0000000

Once again, we will look at the residual plots.

## Warning: not plotting observations with leverage one:
##   2440

These plots seem significantly improved from the earlier models. The curvature and variances of the residual versus fitted plot are greatly improved, and the Q-Q plot is also the closest to normal.

3.5 Goodness of Fit

We will compare the three models based on different goodness-of-fit statistics that we will summarize in the following table.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 1.673191e+06 0.9338159 0.9336992 14 40077.13 40173.83 2293146
sqrt.model 2.236959e+02 0.9910218 0.9910060 14 -25796.75 -25700.04 Inf
log.model 2.031511e+00 0.9950472 0.9950385 14 -60517.38 -60420.68 Inf

Looking at the goodness of fit statistics and the residual plots, we can see that the \(R^2\) and adjusted \(R^2\) of the third model are the highest. We can also see that the log transformed model has the best residual plots and the fewest violations of our assumptions. As a result, we choose the log-transformed model as our final model.

4 Final Model

The following table summarizes our final model.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8909003 0.0062869 1414.2023188 0.0000000
Engine.Size 0.0003853 0.0004828 0.7981271 0.4248225
Cylinders4 -0.0024462 0.0017621 -1.3882273 0.1651098
Cylinders5 -0.0087866 0.0037248 -2.3589530 0.0183525
Cylinders6 0.0017023 0.0019874 0.8565724 0.3917091
Cylinders8 0.0014074 0.0024929 0.5645555 0.5723933
Cylinders10 0.0025751 0.0036832 0.6991585 0.4844750
Cylinders12 0.0069161 0.0031714 2.1808187 0.0292283
Cylinders16 0.0392187 0.0102201 3.8374033 0.0001254
Fuel.TypeE -0.4918855 0.0016804 -292.7164834 0.0000000
Fuel.TypeN -0.4798671 0.0166545 -28.8129962 0.0000000
Fuel.TypeX -0.1408340 0.0012992 -108.4000411 0.0000000
Fuel.TypeZ -0.1422930 0.0013150 -108.2037891 0.0000000
log(Fuel.Consumption.mpg) -0.9876379 0.0015084 -654.7545995 0.0000000

When we consider the final model, we see that the vast majority of the p values for the predictors are very close to zero.H However, we can see that for certain values for the variable Cylinder and for the variable Engine.Size, the p value is very large. To further refine our model, we eliminate the variable Engine.Size.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8929222 0.0057537 1545.601271 0.0000000
Cylinders4 -0.0022700 0.0017482 -1.298518 0.1941502
Cylinders5 -0.0084924 0.0037064 -2.291271 0.0219758
Cylinders6 0.0023103 0.0018355 1.258666 0.2081908
Cylinders8 0.0026302 0.0019665 1.337512 0.1810969
Cylinders10 0.0039587 0.0032496 1.218185 0.2231925
Cylinders12 0.0085298 0.0024433 3.491121 0.0004838
Cylinders16 0.0413837 0.0098533 4.199975 0.0000270
Fuel.TypeE -0.4919640 0.0016775 -293.273004 0.0000000
Fuel.TypeN -0.4798481 0.0166541 -28.812591 0.0000000
Fuel.TypeX -0.1408128 0.0012989 -108.409037 0.0000000
Fuel.TypeZ -0.1423502 0.0013131 -108.411220 0.0000000
log(Fuel.Consumption.mpg) -0.9880461 0.0014190 -696.289784 0.0000000
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

5 Summary of the Model

We write the final model explicitly as follows:

\(log(CO2 Emissions) = 8.893 - 0.00227 \times Cylinders4 - 0.00849 \times Cylinders5 + 0.00231 \times Cylinders6 + 0.00263 \times Cylinders8 + 0.00396 \times Cylinders10 + 0.00853 \times Cylinders12 - 0.492 \times Fuel.TypeE - 0.480 \times Fuel.TypeN - 0.141 \times Fuel.TypeX - 0.143 \times Fuel.TypeZ - 0.988 \times log(Fuel.Consumption.mpg)\)

The meaning of the coefficients holds that for every increase of one for the value of the predictor, the response variable will increase or decrease by the value of the coefficient (depending on if its value is positive or negative) with all other predictors held constant. For example, an increase of one for the value of log(Fuel.Consumption.mpg) corresponds to a decrease of 0.988 for the log of the CO2 emissions of the car in g/km.

6 Discussion

We used the Box-Cox transformation, the square root transformation, and log transformation to create a model that best explains CO2 emissions for this dataset. In the process of creating our model, we remove several variables due to multicollinearity, and then removed the variable Engine.Size due to its high p-value. We selected our models based on their corresponding residual plots and goodness-of-fit measures.

Our final model still shows some variation in the variance of the residuals and slight violations to the assumption of normality, and also includes a selection of outliers such as observation 4084. Further model creation and analysis may be implemented to create a better model, as well as the implementation of bootstrapping.