I chose a dataset found from this link: https://www.kaggle.com/datasets/bhuviranga/co2-emissions that includes information about the CO2 emissions of different cars. The author did not include information about how the data was collected. The dataset includes the following variables.
Make (categorical) - the make of the car, or the company or brand that manufactured it (ex. Honda, Buick, Acura)
Model (categorical) - the specific model of the car under
Vehicle.Class (categorical) - this describes the type of car, regardless of its make or model, based on characteristics of its build. For example: SUV, Compact, Minivan
Engine.Size (continuous) - the engine size in liters
Cylinders (categorical) - the number of cylinders in the car
Transmission (categorical) - the type of transmission for the car. Often begins with A or M for automatic or manual transmission
Fuel.Type (categorical) - the type of fuel used for the car. D = Diesel, E = ethanol, X = gasoline, Z = premium gasoline
Fuel.Consumption.City (continuous) - the fuel consumption of the car in L/100km when driving in the city
Fuel.Consumption.Hwy (continuous) - the fuel consumption of the car in L/100km when driving on highways
Fuel.Consumption.Combined (continuous) - the fuel consumption of the car in L/km when driving in a combination of city roads and highways
Fuel.Consumption.mpg (continuous) - the fuel consumption of the car in miles per gallon when driving in a combination of city roads and highways
CO2.Emissions (continuous) - The emissions of the car in grams of CO2/kilometers traveled
I have questions about how the make of the car, the type of fuel consumed, the efficiency of fuel consumption, and the effects of driving on city roads versus highways in how they affect the emissions of the car. These are questions that can be answered by this dataset. I also had questions in comparing the efficiency of EV and hybrid cars, but these do not seem to be considered in this dataset.
I chose to do a simple linear regression between engine size and CO2 emissions. The model can be represented as \[CO2 = \beta_0 + \beta_1 \cdot Engine Size + \epsilon\]
##
## Call:
## lm(formula = CO2.Emissions ~ Engine.Size, data = emissions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113.309 -18.309 -1.442 19.079 142.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.3659 0.9075 148.1 <2e-16 ***
## Engine.Size 36.7773 0.2640 139.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.72 on 7383 degrees of freedom
## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7244
## F-statistic: 1.941e+04 on 1 and 7383 DF, p-value: < 2.2e-16
According to this model, an estimated increase of 1 liter for the engine size leads to an increase of 36.78 grams of CO2 emissions for the car per kilometer driven. The engine size statistically significantly predicted CO2 emissions, \(p\) <0.001, with \(R^2\) = 0.7244%.
The assumption of the model holds that the residuals are randomly scattered with constant variances. The residual plot seems largely random, but there does seem to be higher variances at larger engine sizes.
Bootstrap sampling distribution (with reference normal density curve)
2.5% | 97.5% | |
---|---|---|
boot.beta0.ci | 132.5 | 136.3 |
boot.beta1.ci | 36.16 | 37.4 |
When comparing the bootstrap to the original regression model, one finds that the results are extremely similar. The slope and y-intercept of the model fall within the ranges given by the bootstrap sample. The residual plot does not seem to have serious violations to the assumption of normality, so both the bootstrap and the original regression model are acceptable.