I chose a dataset found from this link: https://www.kaggle.com/datasets/bhuviranga/co2-emissions that includes information about the CO2 emissions of different cars. The author did not include information about how the data was collected. The dataset includes the following variables.
Engine.Size (continuous) - the engine size in liters
Cylinders (categorical) - the number of cylinders in the car
Fuel.Type (categorical) - the type of fuel used for the car. D = Diesel, E = ethanol, X = gasoline, Z = premium gasoline
Fuel.Consumption.City (continuous) - the fuel consumption of the car in L/100km when driving in the city
Fuel.Consumption.Hwy (continuous) - the fuel consumption of the car in L/100km when driving on highways
Fuel.Consumption.Combined (continuous) - the fuel consumption of the car in L/km when driving in a combination of city roads and highways
Fuel.Consumption.mpg (continuous) - the fuel consumption of the car in miles per gallon when driving in a combination of city roads and highways
CO2.Emissions (continuous) - The emissions of the car in grams of CO2/kilometers traveled
The primary question is how different predictor variables relate to the CO2 emissions of the vehicle.
We begin to examine the predictors in a model with a response variable of CO2.Emissions.
We look at the full model first.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 93.8670059 | 1.6422899 | 57.1561726 | 0.0000000 |
Engine.Size | 0.8218282 | 0.1440850 | 5.7037739 | 0.0000000 |
Cylinders4 | -0.9584579 | 0.5227131 | -1.8336215 | 0.0667505 |
Cylinders5 | -3.5501045 | 1.1033696 | -3.2175115 | 0.0012987 |
Cylinders6 | -0.0901626 | 0.5888235 | -0.1531234 | 0.8783052 |
Cylinders8 | 1.3082761 | 0.7376998 | 1.7734533 | 0.0761949 |
Cylinders10 | 6.3032070 | 1.0945633 | 5.7586500 | 0.0000000 |
Cylinders12 | 8.6056555 | 0.9572175 | 8.9902826 | 0.0000000 |
Cylinders16 | 28.8962479 | 3.0675567 | 9.4199556 | 0.0000000 |
Fuel.TypeE | -137.7060177 | 0.5364755 | -256.6865131 | 0.0000000 |
Fuel.TypeN | -111.3336334 | 4.9258783 | -22.6017834 | 0.0000000 |
Fuel.TypeX | -30.5259613 | 0.3843917 | -79.4136951 | 0.0000000 |
Fuel.TypeZ | -31.1123288 | 0.3887748 | -80.0266115 | 0.0000000 |
Fuel.Consumption.City | 6.0962255 | 0.7424400 | 8.2110680 | 0.0000000 |
Fuel.Consumption.Hwy | 5.4826752 | 0.6119735 | 8.9590083 | 0.0000000 |
Fuel.Consumption.Combined | 8.1947979 | 1.3468737 | 6.0843108 | 0.0000000 |
Fuel.Consumption.mpg | -0.9647167 | 0.0253006 | -38.1302254 | 0.0000000 |
With this model, we do residual analysis to check if any of our assumptions have been violated.
## Warning: not plotting observations with leverage one:
## 2440
From looking at these plots, we can see some pretty big issues with our model. The variance of the residuals is not constant, and a pattern is visible in the plot of the residuals versus the fitted values. The Q-Q plot indicates that our residuals are not distributed normally.
We will perform a Box-Cox transformation to try and correct some of these issues.
We will also examine the VIF indicies of the model to check for multicollinearity.
## GVIF Df GVIF^(1/(2*Df))
## Engine.Size 11.668643 1 3.415940
## Cylinders 14.403671 7 1.209896
## Fuel.Type 2.475681 4 1.119984
## Fuel.Consumption.City 2069.965111 1 45.496869
## Fuel.Consumption.Hwy 568.001039 1 23.832772
## Fuel.Consumption.Combined 4651.987253 1 68.205478
## Fuel.Consumption.mpg 10.261228 1 3.203315
The VIF indicies indicate significant issues with multicollinearity for the variables of Fuel.Consumption.City, Fuel.Consumption.Hwy, Fuel.Consumption.Combined, and Fuel.Consumption.mpg.
As a result, we remove the predictors of Fuel.Consumption.City, Fuel.Consumption.Hwy, and Fuel.Consumption.Combined to reduce multicollinearity.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 425.436323 | 2.7774036 | 153.177711 | 0.00e+00 |
Engine.Size | 7.499682 | 0.4322624 | 17.349838 | 0.00e+00 |
Cylinders4 | -8.933533 | 1.6013752 | -5.578663 | 0.00e+00 |
Cylinders5 | -13.777361 | 3.3833670 | -4.072086 | 4.71e-05 |
Cylinders6 | -5.947815 | 1.8059023 | -3.293542 | 9.94e-04 |
Cylinders8 | 11.741026 | 2.2577261 | 5.200376 | 2.00e-07 |
Cylinders10 | 34.315686 | 3.3337059 | 10.293555 | 0.00e+00 |
Cylinders12 | 44.381076 | 2.8674310 | 15.477644 | 0.00e+00 |
Cylinders16 | 145.251628 | 9.2651721 | 15.677165 | 0.00e+00 |
Fuel.TypeE | -77.751313 | 1.4658185 | -53.042934 | 0.00e+00 |
Fuel.TypeN | -100.317941 | 15.1136250 | -6.637583 | 0.00e+00 |
Fuel.TypeX | -25.916043 | 1.1773512 | -22.012160 | 0.00e+00 |
Fuel.TypeZ | -29.994388 | 1.1930155 | -25.141659 | 0.00e+00 |
Fuel.Consumption.mpg | -6.053156 | 0.0416497 | -145.334966 | 0.00e+00 |
## Warning: not plotting observations with leverage one:
## 2440
## GVIF Df GVIF^(1/(2*Df))
## Engine.Size 11.145876 1 3.338544
## Cylinders 11.786999 7 1.192693
## Fuel.Type 1.410630 4 1.043943
## Fuel.Consumption.mpg 2.951199 1 1.717905
When we examine this model, we see that the VIF indicies are much improved, no longer having major issues with multicollinearity. However, the variances of the residuals are still not constant and the assumption of normality is still violated. Therefore, we proceed with the Box-Cox transformations.
We will perform several Box-Cox transformations on the model.
Examining these plots, we can see that taking the log of Fuel.Consumption.mpg impacts lambda.
Using our Box-Cox transformation with log-transformed Fuel.Consumption.mpg, we create the following model:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 40.8253658 | 0.0659711 | 618.837289 | 0.0000000 |
Engine.Size | 0.0411383 | 0.0050662 | 8.120089 | 0.0000000 |
Cylinders4 | -0.1240613 | 0.0184903 | -6.709526 | 0.0000000 |
Cylinders5 | -0.2394471 | 0.0390861 | -6.126148 | 0.0000000 |
Cylinders6 | -0.1093716 | 0.0208543 | -5.244550 | 0.0000002 |
Cylinders8 | 0.0425413 | 0.0261597 | 1.626214 | 0.1039468 |
Cylinders10 | 0.2527221 | 0.0386492 | 6.538875 | 0.0000000 |
Cylinders12 | 0.3834120 | 0.0332785 | 11.521310 | 0.0000000 |
Cylinders16 | 1.5391967 | 0.1072445 | 14.352216 | 0.0000000 |
Fuel.TypeE | -3.6834953 | 0.0176334 | -208.892974 | 0.0000000 |
Fuel.TypeN | -3.6533055 | 0.1747640 | -20.904222 | 0.0000000 |
Fuel.TypeX | -1.0252207 | 0.0136332 | -75.200343 | 0.0000000 |
Fuel.TypeZ | -1.0793541 | 0.0137994 | -78.217409 | 0.0000000 |
log(Fuel.Consumption.mpg) | -7.3167119 | 0.0158285 | -462.250481 | 0.0000000 |
The residual plots are as follows:
## Warning: not plotting observations with leverage one:
## 2440
Looking at these plots, the curvature in the residual plot looks weaker and the Q-Q plot is closer to normal. Even so, the assumptions of constant variance and normality still remain violated.
In this transformation, we will take the log of CO2 Emissions.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8909003 | 0.0062869 | 1414.2023188 | 0.0000000 |
Engine.Size | 0.0003853 | 0.0004828 | 0.7981271 | 0.4248225 |
Cylinders4 | -0.0024462 | 0.0017621 | -1.3882273 | 0.1651098 |
Cylinders5 | -0.0087866 | 0.0037248 | -2.3589530 | 0.0183525 |
Cylinders6 | 0.0017023 | 0.0019874 | 0.8565724 | 0.3917091 |
Cylinders8 | 0.0014074 | 0.0024929 | 0.5645555 | 0.5723933 |
Cylinders10 | 0.0025751 | 0.0036832 | 0.6991585 | 0.4844750 |
Cylinders12 | 0.0069161 | 0.0031714 | 2.1808187 | 0.0292283 |
Cylinders16 | 0.0392187 | 0.0102201 | 3.8374033 | 0.0001254 |
Fuel.TypeE | -0.4918855 | 0.0016804 | -292.7164834 | 0.0000000 |
Fuel.TypeN | -0.4798671 | 0.0166545 | -28.8129962 | 0.0000000 |
Fuel.TypeX | -0.1408340 | 0.0012992 | -108.4000411 | 0.0000000 |
Fuel.TypeZ | -0.1422930 | 0.0013150 | -108.2037891 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9876379 | 0.0015084 | -654.7545995 | 0.0000000 |
Once again, we will look at the residual plots.
## Warning: not plotting observations with leverage one:
## 2440
These plots seem significantly improved from the earlier models. The
curvature and variances of the residual versus fitted plot are greatly
improved, and the Q-Q plot is also the closest to normal.
We will compare the three models based on different goodness-of-fit statistics that we will summarize in the following table.
SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
---|---|---|---|---|---|---|---|
full.model | 1.673191e+06 | 0.9338159 | 0.9336992 | 14 | 40077.13 | 40173.83 | 2293146 |
sqrt.model | 2.236959e+02 | 0.9910218 | 0.9910060 | 14 | -25796.75 | -25700.04 | Inf |
log.model | 2.031511e+00 | 0.9950472 | 0.9950385 | 14 | -60517.38 | -60420.68 | Inf |
Looking at the goodness of fit statistics and the residual plots, we can see that the \(R^2\) and adjusted \(R^2\) of the third model are the highest. We can also see that the log transformed model has the best residual plots and the fewest violations of our assumptions. As a result, we choose the log-transformed model as our final model.
The following table summarizes our final model.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8909003 | 0.0062869 | 1414.2023188 | 0.0000000 |
Engine.Size | 0.0003853 | 0.0004828 | 0.7981271 | 0.4248225 |
Cylinders4 | -0.0024462 | 0.0017621 | -1.3882273 | 0.1651098 |
Cylinders5 | -0.0087866 | 0.0037248 | -2.3589530 | 0.0183525 |
Cylinders6 | 0.0017023 | 0.0019874 | 0.8565724 | 0.3917091 |
Cylinders8 | 0.0014074 | 0.0024929 | 0.5645555 | 0.5723933 |
Cylinders10 | 0.0025751 | 0.0036832 | 0.6991585 | 0.4844750 |
Cylinders12 | 0.0069161 | 0.0031714 | 2.1808187 | 0.0292283 |
Cylinders16 | 0.0392187 | 0.0102201 | 3.8374033 | 0.0001254 |
Fuel.TypeE | -0.4918855 | 0.0016804 | -292.7164834 | 0.0000000 |
Fuel.TypeN | -0.4798671 | 0.0166545 | -28.8129962 | 0.0000000 |
Fuel.TypeX | -0.1408340 | 0.0012992 | -108.4000411 | 0.0000000 |
Fuel.TypeZ | -0.1422930 | 0.0013150 | -108.2037891 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9876379 | 0.0015084 | -654.7545995 | 0.0000000 |
When we consider the final model, we see that the vast majority of the p values for the predictors are very close to zero.H However, we can see that for certain values for the variable Cylinder and for the variable Engine.Size, the p value is very large. To further refine our model, we eliminate the variable Engine.Size.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8929222 | 0.0057537 | 1545.601271 | 0.0000000 |
Cylinders4 | -0.0022700 | 0.0017482 | -1.298518 | 0.1941502 |
Cylinders5 | -0.0084924 | 0.0037064 | -2.291271 | 0.0219758 |
Cylinders6 | 0.0023103 | 0.0018355 | 1.258666 | 0.2081908 |
Cylinders8 | 0.0026302 | 0.0019665 | 1.337512 | 0.1810969 |
Cylinders10 | 0.0039587 | 0.0032496 | 1.218185 | 0.2231925 |
Cylinders12 | 0.0085298 | 0.0024433 | 3.491121 | 0.0004838 |
Cylinders16 | 0.0413837 | 0.0098533 | 4.199975 | 0.0000270 |
Fuel.TypeE | -0.4919640 | 0.0016775 | -293.273004 | 0.0000000 |
Fuel.TypeN | -0.4798481 | 0.0166541 | -28.812591 | 0.0000000 |
Fuel.TypeX | -0.1408128 | 0.0012989 | -108.409037 | 0.0000000 |
Fuel.TypeZ | -0.1423502 | 0.0013131 | -108.411220 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9880461 | 0.0014190 | -696.289784 | 0.0000000 |
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
We write the final model explicitly as follows:
\(log(CO2 Emissions) = 8.893 - 0.00227 \times Cylinders4 - 0.00849 \times Cylinders5 + 0.00231 \times Cylinders6 + 0.00263 \times Cylinders8 + 0.00396 \times Cylinders10 + 0.00853 \times Cylinders12 - 0.492 \times Fuel.TypeE - 0.480 \times Fuel.TypeN - 0.141 \times Fuel.TypeX - 0.143 \times Fuel.TypeZ - 0.988 \times log(Fuel.Consumption.mpg)\)
The meaning of the coefficients holds that for every increase of one for the value of the predictor, the response variable will increase or decrease by the value of the coefficient (depending on if its value is positive or negative) with all other predictors held constant. For example, an increase of one for the value of log(Fuel.Consumption.mpg) corresponds to a decrease of 0.988 for the log of the CO2 emissions of the car in g/km.
We used the Box-Cox transformation, the square root transformation, and log transformation to create a model that best explains CO2 emissions for this dataset. In the process of creating our model, we remove several variables due to multicollinearity, and then removed the variable Engine.Size due to its high p-value. We selected our models based on their corresponding residual plots and goodness-of-fit measures.
Our final model still shows some variation in the variance of the residuals and slight violations to the assumption of normality, and also includes a selection of outliers such as observation 4084. Further model creation and analysis may be implemented to create a better model, as well as the implementation of bootstrapping.