1 Introduction

I chose a dataset found from this link: https://www.kaggle.com/datasets/bhuviranga/co2-emissions that includes information about the CO2 emissions of different cars. The author did not include information about how the data was collected.

1.1 Description of the Dataset

The dataset includes the following variables.

Engine.Size (continuous) - the engine size in liters

fac.Cylinders (categorical) - the number of cylinders in the car

Fuel.Type (categorical) - the type of fuel used for the car. D = Diesel, E = ethanol, X = gasoline, Z = premium gasoline

Fuel.Consumption.City (continuous) - the fuel consumption of the car in L/100km when driving in the city

Fuel.Consumption.Hwy (continuous) - the fuel consumption of the car in L/100km when driving on highways

Fuel.Consumption.Combined (continuous) - the fuel consumption of the car in L/km when driving in a combination of city roads and highways

Fuel.Consumption.mpg (continuous) - the fuel consumption of the car in miles per gallon when driving in a combination of city roads and highways

CO2.Emissions (continuous) - The emissions of the car in grams of CO2/kilometers traveled

1.2 Research Question

The primary question is how different predictor variables relate to the CO2 emissions of the vehicle. There should be sufficient information to address this question from the data provided.

2 Model Building through MLR

We begin to examine the predictors in a model with a response variable of CO2.Emissions.

2.1 Full Model

We look at the full model first.

Full Model examining CO2 Emissions
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.8670059 1.6422899 57.1561726 0.0000000
Engine.Size 0.8218282 0.1440850 5.7037739 0.0000000
fac.Cylinders4 -0.9584579 0.5227131 -1.8336215 0.0667505
fac.Cylinders5 -3.5501045 1.1033696 -3.2175115 0.0012987
fac.Cylinders6 -0.0901626 0.5888235 -0.1531234 0.8783052
fac.Cylinders8 1.3082761 0.7376998 1.7734533 0.0761949
fac.Cylinders10 6.3032070 1.0945633 5.7586500 0.0000000
fac.Cylinders12 8.6056555 0.9572175 8.9902826 0.0000000
fac.Cylinders16 28.8962479 3.0675567 9.4199556 0.0000000
Fuel.TypeE -137.7060177 0.5364755 -256.6865131 0.0000000
Fuel.TypeN -111.3336334 4.9258783 -22.6017834 0.0000000
Fuel.TypeX -30.5259613 0.3843917 -79.4136951 0.0000000
Fuel.TypeZ -31.1123288 0.3887748 -80.0266115 0.0000000
Fuel.Consumption.City 6.0962255 0.7424400 8.2110680 0.0000000
Fuel.Consumption.Hwy 5.4826752 0.6119735 8.9590083 0.0000000
Fuel.Consumption.Combined 8.1947979 1.3468737 6.0843108 0.0000000
Fuel.Consumption.mpg -0.9647167 0.0253006 -38.1302254 0.0000000

With this model, we do residual analysis to check if any of our assumptions have been violated.

## Warning: not plotting observations with leverage one:
##   2440

From looking at these plots, we can see some pretty big issues with our model. The variance of the residuals is not constant, and a pattern is visible in the plot of the residuals versus the fitted values. The Q-Q plot indicates that our residuals are not distributed normally.

We will perform a Box-Cox transformation to try and correct some of these issues.

We will also examine the VIF indicies of the model to check for multicollinearity.

##                                  GVIF Df GVIF^(1/(2*Df))
## Engine.Size                 11.668643  1        3.415940
## fac.Cylinders               14.403671  7        1.209896
## Fuel.Type                    2.475681  4        1.119984
## Fuel.Consumption.City     2069.965111  1       45.496869
## Fuel.Consumption.Hwy       568.001039  1       23.832772
## Fuel.Consumption.Combined 4651.987253  1       68.205478
## Fuel.Consumption.mpg        10.261228  1        3.203315

The VIF indicies indicate significant issues with multicollinearity for the variables of Fuel.Consumption.City, Fuel.Consumption.Hwy, Fuel.Consumption.Combined, and Fuel.Consumption.mpg.

As a result, we remove the predictors of Fuel.Consumption.City, Fuel.Consumption.Hwy, and Fuel.Consumption.Combined to reduce multicollinearity.

Edited Model examining CO2 Emissions
Estimate Std. Error t value Pr(>|t|)
(Intercept) 425.436323 2.7774036 153.177711 0.00e+00
Engine.Size 7.499682 0.4322624 17.349838 0.00e+00
fac.Cylinders4 -8.933533 1.6013752 -5.578663 0.00e+00
fac.Cylinders5 -13.777361 3.3833670 -4.072086 4.71e-05
fac.Cylinders6 -5.947815 1.8059023 -3.293542 9.94e-04
fac.Cylinders8 11.741026 2.2577261 5.200376 2.00e-07
fac.Cylinders10 34.315686 3.3337059 10.293555 0.00e+00
fac.Cylinders12 44.381076 2.8674310 15.477644 0.00e+00
fac.Cylinders16 145.251628 9.2651721 15.677165 0.00e+00
Fuel.TypeE -77.751313 1.4658185 -53.042934 0.00e+00
Fuel.TypeN -100.317941 15.1136250 -6.637583 0.00e+00
Fuel.TypeX -25.916043 1.1773512 -22.012160 0.00e+00
Fuel.TypeZ -29.994388 1.1930155 -25.141659 0.00e+00
Fuel.Consumption.mpg -6.053156 0.0416497 -145.334966 0.00e+00
## Warning: not plotting observations with leverage one:
##   2440

##                           GVIF Df GVIF^(1/(2*Df))
## Engine.Size          11.145876  1        3.338544
## fac.Cylinders        11.786999  7        1.192693
## Fuel.Type             1.410630  4        1.043943
## Fuel.Consumption.mpg  2.951199  1        1.717905

When we examine this model, we see that the VIF indicies are much improved, no longer having major issues with multicollinearity. However, the variances of the residuals are still not constant and the assumption of normality is still violated. Therefore, we proceed with the Box-Cox transformations.

2.2 Box-Cox Transformations

We will perform several Box-Cox transformations on the model.

Examining these plots, we can see that taking the log of Fuel.Consumption.mpg impacts lambda.

2.3 Square-root Transformation

Using our Box-Cox transformation with log-transformed Fuel.Consumption.mpg, we create the following model:

sqrt-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.8253658 0.0659711 618.837289 0.0000000
Engine.Size 0.0411383 0.0050662 8.120089 0.0000000
fac.Cylinders4 -0.1240613 0.0184903 -6.709526 0.0000000
fac.Cylinders5 -0.2394471 0.0390861 -6.126148 0.0000000
fac.Cylinders6 -0.1093716 0.0208543 -5.244550 0.0000002
fac.Cylinders8 0.0425413 0.0261597 1.626214 0.1039468
fac.Cylinders10 0.2527221 0.0386492 6.538875 0.0000000
fac.Cylinders12 0.3834120 0.0332785 11.521310 0.0000000
fac.Cylinders16 1.5391967 0.1072445 14.352216 0.0000000
Fuel.TypeE -3.6834953 0.0176334 -208.892974 0.0000000
Fuel.TypeN -3.6533055 0.1747640 -20.904222 0.0000000
Fuel.TypeX -1.0252207 0.0136332 -75.200343 0.0000000
Fuel.TypeZ -1.0793541 0.0137994 -78.217409 0.0000000
log(Fuel.Consumption.mpg) -7.3167119 0.0158285 -462.250481 0.0000000

The residual plots are as follows:

## Warning: not plotting observations with leverage one:
##   2440

Looking at these plots, the curvature in the residual plot looks weaker and the Q-Q plot is closer to normal. Even so, the assumptions of constant variance and normality still remain violated.

2.4 Log Transformation

In this transformation, we will take the log of CO2 Emissions.

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8909003 0.0062869 1414.2023188 0.0000000
Engine.Size 0.0003853 0.0004828 0.7981271 0.4248225
fac.Cylinders4 -0.0024462 0.0017621 -1.3882273 0.1651098
fac.Cylinders5 -0.0087866 0.0037248 -2.3589530 0.0183525
fac.Cylinders6 0.0017023 0.0019874 0.8565724 0.3917091
fac.Cylinders8 0.0014074 0.0024929 0.5645555 0.5723933
fac.Cylinders10 0.0025751 0.0036832 0.6991585 0.4844750
fac.Cylinders12 0.0069161 0.0031714 2.1808187 0.0292283
fac.Cylinders16 0.0392187 0.0102201 3.8374033 0.0001254
Fuel.TypeE -0.4918855 0.0016804 -292.7164834 0.0000000
Fuel.TypeN -0.4798671 0.0166545 -28.8129962 0.0000000
Fuel.TypeX -0.1408340 0.0012992 -108.4000411 0.0000000
Fuel.TypeZ -0.1422930 0.0013150 -108.2037891 0.0000000
log(Fuel.Consumption.mpg) -0.9876379 0.0015084 -654.7545995 0.0000000

Once again, we will look at the residual plots.

## Warning: not plotting observations with leverage one:
##   2440

These plots seem significantly improved from the earlier models. The curvature and variances of the residual versus fitted plot are greatly improved, and the Q-Q plot is also the closest to normal.

2.5 Goodness of Fit

We will compare the three models based on different goodness-of-fit statistics that we will summarize in the following table.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 1.673191e+06 0.9338159 0.9336992 14 40077.13 40173.83 2293146
sqrt.model 2.236959e+02 0.9910218 0.9910060 14 -25796.75 -25700.04 Inf
log.model 2.031511e+00 0.9950472 0.9950385 14 -60517.38 -60420.68 Inf

Looking at the goodness of fit statistics and the residual plots, we can see that the \(R^2\) and adjusted \(R^2\) of the third model are the highest. We can also see that the log transformed model has the best residual plots and the fewest violations of our assumptions. As a result, we choose the log-transformed model as our final model.

3 Bootstrapping Our Final Model

The following table summarizes our final model.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8909003 0.0062869 1414.2023188 0.0000000
Engine.Size 0.0003853 0.0004828 0.7981271 0.4248225
fac.Cylinders4 -0.0024462 0.0017621 -1.3882273 0.1651098
fac.Cylinders5 -0.0087866 0.0037248 -2.3589530 0.0183525
fac.Cylinders6 0.0017023 0.0019874 0.8565724 0.3917091
fac.Cylinders8 0.0014074 0.0024929 0.5645555 0.5723933
fac.Cylinders10 0.0025751 0.0036832 0.6991585 0.4844750
fac.Cylinders12 0.0069161 0.0031714 2.1808187 0.0292283
fac.Cylinders16 0.0392187 0.0102201 3.8374033 0.0001254
Fuel.TypeE -0.4918855 0.0016804 -292.7164834 0.0000000
Fuel.TypeN -0.4798671 0.0166545 -28.8129962 0.0000000
Fuel.TypeX -0.1408340 0.0012992 -108.4000411 0.0000000
Fuel.TypeZ -0.1422930 0.0013150 -108.2037891 0.0000000
log(Fuel.Consumption.mpg) -0.9876379 0.0015084 -654.7545995 0.0000000

When we consider the final model, we see that the vast majority of the p values for the predictors are very close to zero.H However, we can see that for certain values for the variable Cylinder and for the variable Engine.Size, the p value is very large. To further refine our model, we eliminate the variable Engine.Size.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8929222 0.0057537 1545.601271 0.0000000
fac.Cylinders4 -0.0022700 0.0017482 -1.298518 0.1941502
fac.Cylinders5 -0.0084924 0.0037064 -2.291271 0.0219758
fac.Cylinders6 0.0023103 0.0018355 1.258666 0.2081908
fac.Cylinders8 0.0026302 0.0019665 1.337512 0.1810969
fac.Cylinders10 0.0039587 0.0032496 1.218185 0.2231925
fac.Cylinders12 0.0085298 0.0024433 3.491121 0.0004838
fac.Cylinders16 0.0413837 0.0098533 4.199975 0.0000270
Fuel.TypeE -0.4919640 0.0016775 -293.273004 0.0000000
Fuel.TypeN -0.4798481 0.0166541 -28.812591 0.0000000
Fuel.TypeX -0.1408128 0.0012989 -108.409037 0.0000000
Fuel.TypeZ -0.1423502 0.0013131 -108.411220 0.0000000
log(Fuel.Consumption.mpg) -0.9880461 0.0014190 -696.289784 0.0000000
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

We also will take a final look at our predictors. When we look at the variable Fuel.Type, we see that there is only one observation with fuel type N and only three observations where Cylinder equals 16. This could be an issue during resampling, if these observations is not selected in our sample, as it can cause issues with the dimensions of our matrix. As a result, we will remove these observations.

3.1 Boostrapping Coefficients

At this point, we can begin to bootstrap our model. We will use bootstrapping to construct confidence intervals for the coefficients of each of the predictors of our model.

Then, we will use a function to construct histograms for each coefficient.

The histograms are displayed below.

The red density curve is based on the estimated regression coefficients and their corresponding standard errors. The p-values of the model are based on this normal curve.

The blue curve is non-parametric and based on the density of the bootstrap sampling distribution.

The two density curves in all of the histograms are close to each other, which indicates that there are no obvious errors that we must correct. Therefore, we continue to find the 95% confidence intervals for each of the coefficients.

The 95% confidence intervals are constructed and displayed in the following table.

## Warning in cbind(formatC(cmtrx, 4, format = "f"), btc.ci.95 = btc.ci): number
## of rows of result is not a multiple of vector length (arg 2)
Regression Coefficient Matrix
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) 8.8929 0.0058 1545.6013 0.0000 [ 8.883 , 8.9029 ]
fac.Cylinders4 -0.0023 0.0017 -1.2985 0.1942 [ -0.0049 , 3e-04 ]
fac.Cylinders5 -0.0085 0.0037 -2.2913 0.0220 [ -0.0132 , -0.0032 ]
fac.Cylinders6 0.0023 0.0018 1.2587 0.2082 [ -4e-04 , 0.005 ]
fac.Cylinders8 0.0026 0.0020 1.3375 0.1811 [ -5e-04 , 0.0056 ]
fac.Cylinders10 0.0040 0.0032 1.2182 0.2232 [ -0.0017 , 0.0093 ]
fac.Cylinders12 0.0085 0.0024 3.4911 0.0005 [ 0.0042 , 0.013 ]
fac.Cylinders16 0.0414 0.0099 4.2000 0.0000 [ -0.4965 , -0.4874 ]
Fuel.TypeE -0.4920 0.0017 -293.2730 0.0000 [ -0.1426 , -0.139 ]
Fuel.TypeN -0.4798 0.0167 -28.8126 0.0000 [ -0.1441 , -0.1405 ]
Fuel.TypeX -0.1408 0.0013 -108.4090 0.0000 [ -0.9906 , -0.9854 ]
Fuel.TypeZ -0.1424 0.0013 -108.4112 0.0000 [ 8.883 , 8.9029 ]
log(Fuel.Consumption.mpg) -0.9880 0.0014 -696.2898 0.0000 [ -0.0049 , 3e-04 ]

3.2 Bootstrapping Residuals

Now, we can bootstrap the residuals. We begin by construction a histogram of the residuals of the model.

This histogram displays that the residuals are largely symmetric, but there is the presence of at least one outlier and some slight right skew.

Similarly to above, we will generate bootstrap confidence intervals for the regression coefficients and then construct histograms of the residual bootstrap estimates.

The plots show no significant disparities between the curves, so we proceed by creating the confidence intervals.

## Warning in cbind(formatC(cmtrx, 4, format = "f"), btr.ci.95 = btr.ci): number
## of rows of result is not a multiple of vector length (arg 2)
Regression Coefficient Matrix with 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) 8.8929 0.0058 1545.6013 0.0000 [ 8.8813 , 8.9041 ]
fac.Cylinders4 -0.0023 0.0017 -1.2985 0.1942 [ -0.0058 , 0.0016 ]
fac.Cylinders5 -0.0085 0.0037 -2.2913 0.0220 [ -0.0152 , -0.0012 ]
fac.Cylinders6 0.0023 0.0018 1.2587 0.2082 [ -0.0013 , 0.0063 ]
fac.Cylinders8 0.0026 0.0020 1.3375 0.1811 [ -0.0013 , 0.0068 ]
fac.Cylinders10 0.0040 0.0032 1.2182 0.2232 [ -0.0024 , 0.0105 ]
fac.Cylinders12 0.0085 0.0024 3.4911 0.0005 [ 0.0037 , 0.0134 ]
fac.Cylinders16 0.0414 0.0099 4.2000 0.0000 [ -0.4951 , -0.4886 ]
Fuel.TypeE -0.4920 0.0017 -293.2730 0.0000 [ -0.1433 , -0.1381 ]
Fuel.TypeN -0.4798 0.0167 -28.8126 0.0000 [ -0.1448 , -0.1397 ]
Fuel.TypeX -0.1408 0.0013 -108.4090 0.0000 [ -0.9906 , -0.9852 ]
Fuel.TypeZ -0.1424 0.0013 -108.4112 0.0000 [ 8.8813 , 8.9041 ]
log(Fuel.Consumption.mpg) -0.9880 0.0014 -696.2898 0.0000 [ -0.0058 , 0.0016 ]

Our calculated p-values are extremely similar all around, as expected. We will finish by combining all of our inferential statistics into a table and comparing them.

4 Combining the Results: Regular and Bootstrap Models

Comparing our final results, we get:

## Warning in cbind(formatC(cmtrx[, -3], 4, format = "f"), btc.ci.95 = btc.ci, :
## number of rows of result is not a multiple of vector length (arg 2)
Final Combined Inferential Statistics: p-values and Bootstrap CIs
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) 8.8929 0.0058 0.0000 [ 8.883 , 8.9029 ] [ 8.8813 , 8.9041 ]
fac.Cylinders4 -0.0023 0.0017 0.1942 [ -0.0049 , 3e-04 ] [ -0.0058 , 0.0016 ]
fac.Cylinders5 -0.0085 0.0037 0.0220 [ -0.0132 , -0.0032 ] [ -0.0152 , -0.0012 ]
fac.Cylinders6 0.0023 0.0018 0.2082 [ -4e-04 , 0.005 ] [ -0.0013 , 0.0063 ]
fac.Cylinders8 0.0026 0.0020 0.1811 [ -5e-04 , 0.0056 ] [ -0.0013 , 0.0068 ]
fac.Cylinders10 0.0040 0.0032 0.2232 [ -0.0017 , 0.0093 ] [ -0.0024 , 0.0105 ]
fac.Cylinders12 0.0085 0.0024 0.0005 [ 0.0042 , 0.013 ] [ 0.0037 , 0.0134 ]
fac.Cylinders16 0.0414 0.0099 0.0000 [ -0.4965 , -0.4874 ] [ -0.4951 , -0.4886 ]
Fuel.TypeE -0.4920 0.0017 0.0000 [ -0.1426 , -0.139 ] [ -0.1433 , -0.1381 ]
Fuel.TypeN -0.4798 0.0167 0.0000 [ -0.1441 , -0.1405 ] [ -0.1448 , -0.1397 ]
Fuel.TypeX -0.1408 0.0013 0.0000 [ -0.9906 , -0.9854 ] [ -0.9906 , -0.9852 ]
Fuel.TypeZ -0.1424 0.0013 0.0000 [ 8.883 , 8.9029 ] [ 8.8813 , 8.9041 ]
log(Fuel.Consumption.mpg) -0.9880 0.0014 0.0000 [ -0.0049 , 3e-04 ] [ -0.0058 , 0.0016 ]

The similar p-values indicate that our final model does not have any serious violations to our assumptions. We can also compare the confidence intervals between the two methods.

width of the two bootstrap confidence intervals
btc.wd btr.wd
0.0199 0.0228
0.0051 0.0074
0.0100 0.0140
0.0055 0.0076
0.0061 0.0081
0.0109 0.0128
0.0088 0.0097
0.0091 0.0065
0.0035 0.0052
0.0036 0.0051
0.0051 0.0054

The confidence intervals are similar in width, indicating our two methods are very much comparable in their results. We will finish this report on the parametric model.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8929222 0.0057537 1545.601271 0.0000000
fac.Cylinders4 -0.0022700 0.0017482 -1.298518 0.1941502
fac.Cylinders5 -0.0084924 0.0037064 -2.291271 0.0219758
fac.Cylinders6 0.0023103 0.0018355 1.258666 0.2081908
fac.Cylinders8 0.0026302 0.0019665 1.337512 0.1810969
fac.Cylinders10 0.0039587 0.0032496 1.218185 0.2231925
fac.Cylinders12 0.0085298 0.0024433 3.491121 0.0004838
fac.Cylinders16 0.0413837 0.0098533 4.199975 0.0000270
Fuel.TypeE -0.4919640 0.0016775 -293.273004 0.0000000
Fuel.TypeN -0.4798481 0.0166541 -28.812591 0.0000000
Fuel.TypeX -0.1408128 0.0012989 -108.409037 0.0000000
Fuel.TypeZ -0.1423502 0.0013131 -108.411220 0.0000000
log(Fuel.Consumption.mpg) -0.9880461 0.0014190 -696.289784 0.0000000

For our categorical variables Cylinders and Fuel.Type, each of the levels are compared to a baseline level, the baseline being Cylinders=3 and Fuel.Type=D respectively. We can interpret the coefficient of the level Cylinders=5 with an estimated value of \(-0.0084924\) according to the following:

\[log(Cylinders_5) - log(Cylinders_3) = -0.0084924\] \[log(\frac{Cylinders_5}{Cylinders_3} = -0.0084924)\] \[\frac{Cylinders_5}{Cylinders_3} = 0.991543\] \[Cylinders_5 = 0.991543 \cdot Cylinders_3\]

The same can be done for the other predictors.

5 Summary and Discussion

Our final model has the predictors of Cylinders, Engine.Size, and Fuel.Consumption.mpg to predict the CO2 emissions of a vehicle. Cylinders and Engine.Size are categorical variables, while Fuel.Consumption.mpg is continuous. Through bootstrapping and transformations, we corrected for some of the problems of our original full model and created a model that satisfies our assumptions and meets our requirements.

This model can be used for prediction and estimation of the CO2 emissions of vehicles based on the parameters outlined in the model. Currently, looking at the residual plots, it is evident that there are still outliers that can be removed to improve our model. There were also issues with the small sample sizes for observations with Fuel.Type=N and Cylinder=16; adding more observations with these values could improve our model to account for them.