I chose a dataset found from this link: https://www.kaggle.com/datasets/bhuviranga/co2-emissions that includes information about the CO2 emissions of different cars. The author did not include information about how the data was collected.
The dataset includes the following variables.
Engine.Size (continuous) - the engine size in liters
fac.Cylinders (categorical) - the number of cylinders in the car
Fuel.Type (categorical) - the type of fuel used for the car. D = Diesel, E = ethanol, X = gasoline, Z = premium gasoline
Fuel.Consumption.City (continuous) - the fuel consumption of the car in L/100km when driving in the city
Fuel.Consumption.Hwy (continuous) - the fuel consumption of the car in L/100km when driving on highways
Fuel.Consumption.Combined (continuous) - the fuel consumption of the car in L/km when driving in a combination of city roads and highways
Fuel.Consumption.mpg (continuous) - the fuel consumption of the car in miles per gallon when driving in a combination of city roads and highways
CO2.Emissions (continuous) - The emissions of the car in grams of CO2/kilometers traveled
The primary question is how different predictor variables relate to the CO2 emissions of the vehicle. There should be sufficient information to address this question from the data provided.
We begin to examine the predictors in a model with a response variable of CO2.Emissions.
We look at the full model first.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 93.8670059 | 1.6422899 | 57.1561726 | 0.0000000 |
Engine.Size | 0.8218282 | 0.1440850 | 5.7037739 | 0.0000000 |
fac.Cylinders4 | -0.9584579 | 0.5227131 | -1.8336215 | 0.0667505 |
fac.Cylinders5 | -3.5501045 | 1.1033696 | -3.2175115 | 0.0012987 |
fac.Cylinders6 | -0.0901626 | 0.5888235 | -0.1531234 | 0.8783052 |
fac.Cylinders8 | 1.3082761 | 0.7376998 | 1.7734533 | 0.0761949 |
fac.Cylinders10 | 6.3032070 | 1.0945633 | 5.7586500 | 0.0000000 |
fac.Cylinders12 | 8.6056555 | 0.9572175 | 8.9902826 | 0.0000000 |
fac.Cylinders16 | 28.8962479 | 3.0675567 | 9.4199556 | 0.0000000 |
Fuel.TypeE | -137.7060177 | 0.5364755 | -256.6865131 | 0.0000000 |
Fuel.TypeN | -111.3336334 | 4.9258783 | -22.6017834 | 0.0000000 |
Fuel.TypeX | -30.5259613 | 0.3843917 | -79.4136951 | 0.0000000 |
Fuel.TypeZ | -31.1123288 | 0.3887748 | -80.0266115 | 0.0000000 |
Fuel.Consumption.City | 6.0962255 | 0.7424400 | 8.2110680 | 0.0000000 |
Fuel.Consumption.Hwy | 5.4826752 | 0.6119735 | 8.9590083 | 0.0000000 |
Fuel.Consumption.Combined | 8.1947979 | 1.3468737 | 6.0843108 | 0.0000000 |
Fuel.Consumption.mpg | -0.9647167 | 0.0253006 | -38.1302254 | 0.0000000 |
With this model, we do residual analysis to check if any of our assumptions have been violated.
## Warning: not plotting observations with leverage one:
## 2440
From looking at these plots, we can see some pretty big issues with our model. The variance of the residuals is not constant, and a pattern is visible in the plot of the residuals versus the fitted values. The Q-Q plot indicates that our residuals are not distributed normally.
We will perform a Box-Cox transformation to try and correct some of these issues.
We will also examine the VIF indicies of the model to check for multicollinearity.
## GVIF Df GVIF^(1/(2*Df))
## Engine.Size 11.668643 1 3.415940
## fac.Cylinders 14.403671 7 1.209896
## Fuel.Type 2.475681 4 1.119984
## Fuel.Consumption.City 2069.965111 1 45.496869
## Fuel.Consumption.Hwy 568.001039 1 23.832772
## Fuel.Consumption.Combined 4651.987253 1 68.205478
## Fuel.Consumption.mpg 10.261228 1 3.203315
The VIF indicies indicate significant issues with multicollinearity for the variables of Fuel.Consumption.City, Fuel.Consumption.Hwy, Fuel.Consumption.Combined, and Fuel.Consumption.mpg.
As a result, we remove the predictors of Fuel.Consumption.City, Fuel.Consumption.Hwy, and Fuel.Consumption.Combined to reduce multicollinearity.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 425.436323 | 2.7774036 | 153.177711 | 0.00e+00 |
Engine.Size | 7.499682 | 0.4322624 | 17.349838 | 0.00e+00 |
fac.Cylinders4 | -8.933533 | 1.6013752 | -5.578663 | 0.00e+00 |
fac.Cylinders5 | -13.777361 | 3.3833670 | -4.072086 | 4.71e-05 |
fac.Cylinders6 | -5.947815 | 1.8059023 | -3.293542 | 9.94e-04 |
fac.Cylinders8 | 11.741026 | 2.2577261 | 5.200376 | 2.00e-07 |
fac.Cylinders10 | 34.315686 | 3.3337059 | 10.293555 | 0.00e+00 |
fac.Cylinders12 | 44.381076 | 2.8674310 | 15.477644 | 0.00e+00 |
fac.Cylinders16 | 145.251628 | 9.2651721 | 15.677165 | 0.00e+00 |
Fuel.TypeE | -77.751313 | 1.4658185 | -53.042934 | 0.00e+00 |
Fuel.TypeN | -100.317941 | 15.1136250 | -6.637583 | 0.00e+00 |
Fuel.TypeX | -25.916043 | 1.1773512 | -22.012160 | 0.00e+00 |
Fuel.TypeZ | -29.994388 | 1.1930155 | -25.141659 | 0.00e+00 |
Fuel.Consumption.mpg | -6.053156 | 0.0416497 | -145.334966 | 0.00e+00 |
## Warning: not plotting observations with leverage one:
## 2440
## GVIF Df GVIF^(1/(2*Df))
## Engine.Size 11.145876 1 3.338544
## fac.Cylinders 11.786999 7 1.192693
## Fuel.Type 1.410630 4 1.043943
## Fuel.Consumption.mpg 2.951199 1 1.717905
When we examine this model, we see that the VIF indicies are much improved, no longer having major issues with multicollinearity. However, the variances of the residuals are still not constant and the assumption of normality is still violated. Therefore, we proceed with the Box-Cox transformations.
We will perform several Box-Cox transformations on the model.
Examining these plots, we can see that taking the log of Fuel.Consumption.mpg impacts lambda.
Using our Box-Cox transformation with log-transformed Fuel.Consumption.mpg, we create the following model:
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 40.8253658 | 0.0659711 | 618.837289 | 0.0000000 |
Engine.Size | 0.0411383 | 0.0050662 | 8.120089 | 0.0000000 |
fac.Cylinders4 | -0.1240613 | 0.0184903 | -6.709526 | 0.0000000 |
fac.Cylinders5 | -0.2394471 | 0.0390861 | -6.126148 | 0.0000000 |
fac.Cylinders6 | -0.1093716 | 0.0208543 | -5.244550 | 0.0000002 |
fac.Cylinders8 | 0.0425413 | 0.0261597 | 1.626214 | 0.1039468 |
fac.Cylinders10 | 0.2527221 | 0.0386492 | 6.538875 | 0.0000000 |
fac.Cylinders12 | 0.3834120 | 0.0332785 | 11.521310 | 0.0000000 |
fac.Cylinders16 | 1.5391967 | 0.1072445 | 14.352216 | 0.0000000 |
Fuel.TypeE | -3.6834953 | 0.0176334 | -208.892974 | 0.0000000 |
Fuel.TypeN | -3.6533055 | 0.1747640 | -20.904222 | 0.0000000 |
Fuel.TypeX | -1.0252207 | 0.0136332 | -75.200343 | 0.0000000 |
Fuel.TypeZ | -1.0793541 | 0.0137994 | -78.217409 | 0.0000000 |
log(Fuel.Consumption.mpg) | -7.3167119 | 0.0158285 | -462.250481 | 0.0000000 |
The residual plots are as follows:
## Warning: not plotting observations with leverage one:
## 2440
Looking at these plots, the curvature in the residual plot looks weaker and the Q-Q plot is closer to normal. Even so, the assumptions of constant variance and normality still remain violated.
In this transformation, we will take the log of CO2 Emissions.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8909003 | 0.0062869 | 1414.2023188 | 0.0000000 |
Engine.Size | 0.0003853 | 0.0004828 | 0.7981271 | 0.4248225 |
fac.Cylinders4 | -0.0024462 | 0.0017621 | -1.3882273 | 0.1651098 |
fac.Cylinders5 | -0.0087866 | 0.0037248 | -2.3589530 | 0.0183525 |
fac.Cylinders6 | 0.0017023 | 0.0019874 | 0.8565724 | 0.3917091 |
fac.Cylinders8 | 0.0014074 | 0.0024929 | 0.5645555 | 0.5723933 |
fac.Cylinders10 | 0.0025751 | 0.0036832 | 0.6991585 | 0.4844750 |
fac.Cylinders12 | 0.0069161 | 0.0031714 | 2.1808187 | 0.0292283 |
fac.Cylinders16 | 0.0392187 | 0.0102201 | 3.8374033 | 0.0001254 |
Fuel.TypeE | -0.4918855 | 0.0016804 | -292.7164834 | 0.0000000 |
Fuel.TypeN | -0.4798671 | 0.0166545 | -28.8129962 | 0.0000000 |
Fuel.TypeX | -0.1408340 | 0.0012992 | -108.4000411 | 0.0000000 |
Fuel.TypeZ | -0.1422930 | 0.0013150 | -108.2037891 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9876379 | 0.0015084 | -654.7545995 | 0.0000000 |
Once again, we will look at the residual plots.
## Warning: not plotting observations with leverage one:
## 2440
These plots seem significantly improved from the earlier models. The curvature and variances of the residual versus fitted plot are greatly improved, and the Q-Q plot is also the closest to normal.
We will compare the three models based on different goodness-of-fit statistics that we will summarize in the following table.
SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
---|---|---|---|---|---|---|---|
full.model | 1.673191e+06 | 0.9338159 | 0.9336992 | 14 | 40077.13 | 40173.83 | 2293146 |
sqrt.model | 2.236959e+02 | 0.9910218 | 0.9910060 | 14 | -25796.75 | -25700.04 | Inf |
log.model | 2.031511e+00 | 0.9950472 | 0.9950385 | 14 | -60517.38 | -60420.68 | Inf |
Looking at the goodness of fit statistics and the residual plots, we can see that the \(R^2\) and adjusted \(R^2\) of the third model are the highest. We can also see that the log transformed model has the best residual plots and the fewest violations of our assumptions. As a result, we choose the log-transformed model as our final model.
The following table summarizes our final model.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8909003 | 0.0062869 | 1414.2023188 | 0.0000000 |
Engine.Size | 0.0003853 | 0.0004828 | 0.7981271 | 0.4248225 |
fac.Cylinders4 | -0.0024462 | 0.0017621 | -1.3882273 | 0.1651098 |
fac.Cylinders5 | -0.0087866 | 0.0037248 | -2.3589530 | 0.0183525 |
fac.Cylinders6 | 0.0017023 | 0.0019874 | 0.8565724 | 0.3917091 |
fac.Cylinders8 | 0.0014074 | 0.0024929 | 0.5645555 | 0.5723933 |
fac.Cylinders10 | 0.0025751 | 0.0036832 | 0.6991585 | 0.4844750 |
fac.Cylinders12 | 0.0069161 | 0.0031714 | 2.1808187 | 0.0292283 |
fac.Cylinders16 | 0.0392187 | 0.0102201 | 3.8374033 | 0.0001254 |
Fuel.TypeE | -0.4918855 | 0.0016804 | -292.7164834 | 0.0000000 |
Fuel.TypeN | -0.4798671 | 0.0166545 | -28.8129962 | 0.0000000 |
Fuel.TypeX | -0.1408340 | 0.0012992 | -108.4000411 | 0.0000000 |
Fuel.TypeZ | -0.1422930 | 0.0013150 | -108.2037891 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9876379 | 0.0015084 | -654.7545995 | 0.0000000 |
When we consider the final model, we see that the vast majority of the p values for the predictors are very close to zero.H However, we can see that for certain values for the variable Cylinder and for the variable Engine.Size, the p value is very large. To further refine our model, we eliminate the variable Engine.Size.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8929222 | 0.0057537 | 1545.601271 | 0.0000000 |
fac.Cylinders4 | -0.0022700 | 0.0017482 | -1.298518 | 0.1941502 |
fac.Cylinders5 | -0.0084924 | 0.0037064 | -2.291271 | 0.0219758 |
fac.Cylinders6 | 0.0023103 | 0.0018355 | 1.258666 | 0.2081908 |
fac.Cylinders8 | 0.0026302 | 0.0019665 | 1.337512 | 0.1810969 |
fac.Cylinders10 | 0.0039587 | 0.0032496 | 1.218185 | 0.2231925 |
fac.Cylinders12 | 0.0085298 | 0.0024433 | 3.491121 | 0.0004838 |
fac.Cylinders16 | 0.0413837 | 0.0098533 | 4.199975 | 0.0000270 |
Fuel.TypeE | -0.4919640 | 0.0016775 | -293.273004 | 0.0000000 |
Fuel.TypeN | -0.4798481 | 0.0166541 | -28.812591 | 0.0000000 |
Fuel.TypeX | -0.1408128 | 0.0012989 | -108.409037 | 0.0000000 |
Fuel.TypeZ | -0.1423502 | 0.0013131 | -108.411220 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9880461 | 0.0014190 | -696.289784 | 0.0000000 |
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
We also will take a final look at our predictors. When we look at the variable Fuel.Type, we see that there is only one observation with fuel type N and only three observations where Cylinder equals 16. This could be an issue during resampling, if these observations is not selected in our sample, as it can cause issues with the dimensions of our matrix. As a result, we will remove these observations.
At this point, we can begin to bootstrap our model. We will use bootstrapping to construct confidence intervals for the coefficients of each of the predictors of our model.
Then, we will use a function to construct histograms for each coefficient.
The histograms are displayed below.
The red density curve is based on the estimated regression coefficients and their corresponding standard errors. The p-values of the model are based on this normal curve.
The blue curve is non-parametric and based on the density of the bootstrap sampling distribution.
The two density curves in all of the histograms are close to each other, which indicates that there are no obvious errors that we must correct. Therefore, we continue to find the 95% confidence intervals for each of the coefficients.
The 95% confidence intervals are constructed and displayed in the following table.
## Warning in cbind(formatC(cmtrx, 4, format = "f"), btc.ci.95 = btc.ci): number
## of rows of result is not a multiple of vector length (arg 2)
Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
---|---|---|---|---|---|
(Intercept) | 8.8929 | 0.0058 | 1545.6013 | 0.0000 | [ 8.883 , 8.9029 ] |
fac.Cylinders4 | -0.0023 | 0.0017 | -1.2985 | 0.1942 | [ -0.0049 , 3e-04 ] |
fac.Cylinders5 | -0.0085 | 0.0037 | -2.2913 | 0.0220 | [ -0.0132 , -0.0032 ] |
fac.Cylinders6 | 0.0023 | 0.0018 | 1.2587 | 0.2082 | [ -4e-04 , 0.005 ] |
fac.Cylinders8 | 0.0026 | 0.0020 | 1.3375 | 0.1811 | [ -5e-04 , 0.0056 ] |
fac.Cylinders10 | 0.0040 | 0.0032 | 1.2182 | 0.2232 | [ -0.0017 , 0.0093 ] |
fac.Cylinders12 | 0.0085 | 0.0024 | 3.4911 | 0.0005 | [ 0.0042 , 0.013 ] |
fac.Cylinders16 | 0.0414 | 0.0099 | 4.2000 | 0.0000 | [ -0.4965 , -0.4874 ] |
Fuel.TypeE | -0.4920 | 0.0017 | -293.2730 | 0.0000 | [ -0.1426 , -0.139 ] |
Fuel.TypeN | -0.4798 | 0.0167 | -28.8126 | 0.0000 | [ -0.1441 , -0.1405 ] |
Fuel.TypeX | -0.1408 | 0.0013 | -108.4090 | 0.0000 | [ -0.9906 , -0.9854 ] |
Fuel.TypeZ | -0.1424 | 0.0013 | -108.4112 | 0.0000 | [ 8.883 , 8.9029 ] |
log(Fuel.Consumption.mpg) | -0.9880 | 0.0014 | -696.2898 | 0.0000 | [ -0.0049 , 3e-04 ] |
Now, we can bootstrap the residuals. We begin by construction a histogram of the residuals of the model.
This histogram displays that the residuals are largely symmetric, but there is the presence of at least one outlier and some slight right skew.
Similarly to above, we will generate bootstrap confidence intervals for the regression coefficients and then construct histograms of the residual bootstrap estimates.
The plots show no significant disparities between the curves, so we proceed by creating the confidence intervals.
## Warning in cbind(formatC(cmtrx, 4, format = "f"), btr.ci.95 = btr.ci): number
## of rows of result is not a multiple of vector length (arg 2)
Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
---|---|---|---|---|---|
(Intercept) | 8.8929 | 0.0058 | 1545.6013 | 0.0000 | [ 8.8813 , 8.9041 ] |
fac.Cylinders4 | -0.0023 | 0.0017 | -1.2985 | 0.1942 | [ -0.0058 , 0.0016 ] |
fac.Cylinders5 | -0.0085 | 0.0037 | -2.2913 | 0.0220 | [ -0.0152 , -0.0012 ] |
fac.Cylinders6 | 0.0023 | 0.0018 | 1.2587 | 0.2082 | [ -0.0013 , 0.0063 ] |
fac.Cylinders8 | 0.0026 | 0.0020 | 1.3375 | 0.1811 | [ -0.0013 , 0.0068 ] |
fac.Cylinders10 | 0.0040 | 0.0032 | 1.2182 | 0.2232 | [ -0.0024 , 0.0105 ] |
fac.Cylinders12 | 0.0085 | 0.0024 | 3.4911 | 0.0005 | [ 0.0037 , 0.0134 ] |
fac.Cylinders16 | 0.0414 | 0.0099 | 4.2000 | 0.0000 | [ -0.4951 , -0.4886 ] |
Fuel.TypeE | -0.4920 | 0.0017 | -293.2730 | 0.0000 | [ -0.1433 , -0.1381 ] |
Fuel.TypeN | -0.4798 | 0.0167 | -28.8126 | 0.0000 | [ -0.1448 , -0.1397 ] |
Fuel.TypeX | -0.1408 | 0.0013 | -108.4090 | 0.0000 | [ -0.9906 , -0.9852 ] |
Fuel.TypeZ | -0.1424 | 0.0013 | -108.4112 | 0.0000 | [ 8.8813 , 8.9041 ] |
log(Fuel.Consumption.mpg) | -0.9880 | 0.0014 | -696.2898 | 0.0000 | [ -0.0058 , 0.0016 ] |
Our calculated p-values are extremely similar all around, as expected. We will finish by combining all of our inferential statistics into a table and comparing them.
Comparing our final results, we get:
## Warning in cbind(formatC(cmtrx[, -3], 4, format = "f"), btc.ci.95 = btc.ci, :
## number of rows of result is not a multiple of vector length (arg 2)
Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
---|---|---|---|---|---|
(Intercept) | 8.8929 | 0.0058 | 0.0000 | [ 8.883 , 8.9029 ] | [ 8.8813 , 8.9041 ] |
fac.Cylinders4 | -0.0023 | 0.0017 | 0.1942 | [ -0.0049 , 3e-04 ] | [ -0.0058 , 0.0016 ] |
fac.Cylinders5 | -0.0085 | 0.0037 | 0.0220 | [ -0.0132 , -0.0032 ] | [ -0.0152 , -0.0012 ] |
fac.Cylinders6 | 0.0023 | 0.0018 | 0.2082 | [ -4e-04 , 0.005 ] | [ -0.0013 , 0.0063 ] |
fac.Cylinders8 | 0.0026 | 0.0020 | 0.1811 | [ -5e-04 , 0.0056 ] | [ -0.0013 , 0.0068 ] |
fac.Cylinders10 | 0.0040 | 0.0032 | 0.2232 | [ -0.0017 , 0.0093 ] | [ -0.0024 , 0.0105 ] |
fac.Cylinders12 | 0.0085 | 0.0024 | 0.0005 | [ 0.0042 , 0.013 ] | [ 0.0037 , 0.0134 ] |
fac.Cylinders16 | 0.0414 | 0.0099 | 0.0000 | [ -0.4965 , -0.4874 ] | [ -0.4951 , -0.4886 ] |
Fuel.TypeE | -0.4920 | 0.0017 | 0.0000 | [ -0.1426 , -0.139 ] | [ -0.1433 , -0.1381 ] |
Fuel.TypeN | -0.4798 | 0.0167 | 0.0000 | [ -0.1441 , -0.1405 ] | [ -0.1448 , -0.1397 ] |
Fuel.TypeX | -0.1408 | 0.0013 | 0.0000 | [ -0.9906 , -0.9854 ] | [ -0.9906 , -0.9852 ] |
Fuel.TypeZ | -0.1424 | 0.0013 | 0.0000 | [ 8.883 , 8.9029 ] | [ 8.8813 , 8.9041 ] |
log(Fuel.Consumption.mpg) | -0.9880 | 0.0014 | 0.0000 | [ -0.0049 , 3e-04 ] | [ -0.0058 , 0.0016 ] |
The similar p-values indicate that our final model does not have any serious violations to our assumptions. We can also compare the confidence intervals between the two methods.
btc.wd | btr.wd |
---|---|
0.0199 | 0.0228 |
0.0051 | 0.0074 |
0.0100 | 0.0140 |
0.0055 | 0.0076 |
0.0061 | 0.0081 |
0.0109 | 0.0128 |
0.0088 | 0.0097 |
0.0091 | 0.0065 |
0.0035 | 0.0052 |
0.0036 | 0.0051 |
0.0051 | 0.0054 |
The confidence intervals are similar in width, indicating our two methods are very much comparable in their results. We will finish this report on the parametric model.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 8.8929222 | 0.0057537 | 1545.601271 | 0.0000000 |
fac.Cylinders4 | -0.0022700 | 0.0017482 | -1.298518 | 0.1941502 |
fac.Cylinders5 | -0.0084924 | 0.0037064 | -2.291271 | 0.0219758 |
fac.Cylinders6 | 0.0023103 | 0.0018355 | 1.258666 | 0.2081908 |
fac.Cylinders8 | 0.0026302 | 0.0019665 | 1.337512 | 0.1810969 |
fac.Cylinders10 | 0.0039587 | 0.0032496 | 1.218185 | 0.2231925 |
fac.Cylinders12 | 0.0085298 | 0.0024433 | 3.491121 | 0.0004838 |
fac.Cylinders16 | 0.0413837 | 0.0098533 | 4.199975 | 0.0000270 |
Fuel.TypeE | -0.4919640 | 0.0016775 | -293.273004 | 0.0000000 |
Fuel.TypeN | -0.4798481 | 0.0166541 | -28.812591 | 0.0000000 |
Fuel.TypeX | -0.1408128 | 0.0012989 | -108.409037 | 0.0000000 |
Fuel.TypeZ | -0.1423502 | 0.0013131 | -108.411220 | 0.0000000 |
log(Fuel.Consumption.mpg) | -0.9880461 | 0.0014190 | -696.289784 | 0.0000000 |
For our categorical variables Cylinders and Fuel.Type, each of the levels are compared to a baseline level, the baseline being Cylinders=3 and Fuel.Type=D respectively. We can interpret the coefficient of the level Cylinders=5 with an estimated value of \(-0.0084924\) according to the following:
\[log(Cylinders_5) - log(Cylinders_3) = -0.0084924\] \[log(\frac{Cylinders_5}{Cylinders_3} = -0.0084924)\] \[\frac{Cylinders_5}{Cylinders_3} = 0.991543\] \[Cylinders_5 = 0.991543 \cdot Cylinders_3\]
The same can be done for the other predictors.
Our final model has the predictors of Cylinders, Engine.Size, and Fuel.Consumption.mpg to predict the CO2 emissions of a vehicle. Cylinders and Engine.Size are categorical variables, while Fuel.Consumption.mpg is continuous. Through bootstrapping and transformations, we corrected for some of the problems of our original full model and created a model that satisfies our assumptions and meets our requirements.
This model can be used for prediction and estimation of the CO2 emissions of vehicles based on the parameters outlined in the model. Currently, looking at the residual plots, it is evident that there are still outliers that can be removed to improve our model. There were also issues with the small sample sizes for observations with Fuel.Type=N and Cylinder=16; adding more observations with these values could improve our model to account for them.