1 Description of the Dataset

I chose a dataset found from this link: https://www.kaggle.com/datasets/bhuviranga/co2-emissions that includes information about the CO2 emissions of different cars. The author did not include information about how the data was collected. The dataset includes the following variables.

Make (categorical) - the make of the car, or the company or brand that manufactured it (ex. Honda, Buick, Acura)

Model (categorical) - the specific model of the car under

Vehicle.Class (categorical) - this describes the type of car, regardless of its make or model, based on characteristics of its build. For example: SUV, Compact, Minivan

Engine.Size (continuous) - the engine size in liters

Cylinders (categorical) - the number of cylinders in the car

Transmission (categorical) - the type of transmission for the car. Often begins with A or M for automatic or manual transmission

Fuel.Type (categorical) - the type of fuel used for the car. D = Diesel, E = ethanol, X = gasoline, Z = premium gasoline

Fuel.Consumption.City (continuous) - the fuel consumption of the car in L/100km when driving in the city

Fuel.Consumption.Hwy (continuous) - the fuel consumption of the car in L/100km when driving on highways

Fuel.Consumption.Combined (continuous) - the fuel consumption of the car in L/km when driving in a combination of city roads and highways

Fuel.Consumption.mpg (continuous) - the fuel consumption of the car in miles per gallon when driving in a combination of city roads and highways

CO2.Emissions (continuous) - The emissions of the car in grams of CO2/kilometers traveled

I have questions about how the make of the car, the type of fuel consumed, the efficiency of fuel consumption, and the effects of driving on city roads versus highways in how they affect the emissions of the car. These are questions that can be answered by this dataset. I also had questions in comparing the efficiency of EV and hybrid cars, but these do not seem to be considered in this dataset.

2 Simple Linear Regression

2.1 Pairwise Scatterplot

2.2 Simple Linear Regression

I chose to do a simple linear regression between engine size and CO2 emissions. The model can be represented as \[CO2 = \beta_0 + \beta_1 \cdot Engine Size + \epsilon\]

## 
## Call:
## lm(formula = CO2.Emissions ~ Engine.Size, data = emissions)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -113.309  -18.309   -1.442   19.079  142.914 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 134.3659     0.9075   148.1   <2e-16 ***
## Engine.Size  36.7773     0.2640   139.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.72 on 7383 degrees of freedom
## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7244 
## F-statistic: 1.941e+04 on 1 and 7383 DF,  p-value: < 2.2e-16

According to this model, an estimated increase of 1 liter for the engine size leads to an increase of 36.78 grams of CO2 emissions for the car per kilometer driven. The engine size statistically significantly predicted CO2 emissions, \(p\) <0.001, with \(R^2\) = 0.7244%.

2.3 Residuals

The assumption of the model holds that the residuals are randomly scattered with constant variances. The residual plot seems largely random, but there does seem to be higher variances at larger engine sizes.

3 Bootstrap

Bootstrap sampling distribution (with reference normal density curve)

Bootstrap sampling distribution (with reference normal density curve)

Bootstrap confidence intervals of regression coefficients.
  2.5% 97.5%
boot.beta0.ci 132.5 136.3
boot.beta1.ci 36.16 37.4

When comparing the bootstrap to the original regression model, one finds that the results are extremely similar. The slope and y-intercept of the model fall within the ranges given by the bootstrap sample. The residual plot does not seem to have serious violations to the assumption of normality, so both the bootstrap and the original regression model are acceptable.