Part 1: Formulating a Research Question

1.Propose a research question that explores the relationship between a dependent variable and at least three independent variables.

2.The question should be of your own interest and in your analysis please explain why the question is of your interest.

3.Identify a real dataset that can provide insights into answering your question.

Research Question: How does cars’ brand, model and class make different in car fuel efficiency?

Reasons of choice: Car as one of the most important transportation tool for everyone, it consume a lot of money just to buy, use and maintain it. In the future, having a car is necessary, it is important to find a car that has high fuel efficiency for a long term use. As the fuel can be expensive and less affordable in the future because there are limited fossil fuel.

Data to use: The date set I am going to use is the “CO2 Emission by Vehicles” from the date platform Koggle. The data source is from Canada government website. In this dataset, it include all three variables that I need for a car and fuel consumption data on both high way, city way and the combined of them of a car. We can compare the consumption data of different car to get the possible answer.

Part 2: Data Exploration and Preparation

1.Describe the dataset, its source, and the variables under study.

2.Provide statistical summaries of the variables (tables and visualizations).

ca_car <- read.csv("/Users/yuhe/Downloads/archive/CO2 Emissions_Canada.csv")
ca_car$Fuel.Type <- as.factor(ca_car$Fuel.Type)

glimpse(ca_car)

## Rows: 7,385
## Columns: 12
## $ Make                             <chr> "ACURA", "ACURA", "ACURA", "ACURA", "…
## $ Model                            <chr> "ILX", "ILX", "ILX HYBRID", "MDX 4WD"…
## $ Vehicle.Class                    <chr> "COMPACT", "COMPACT", "COMPACT", "SUV…
## $ Engine.Size.L.                   <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, 3.…
## $ Cylinders                        <int> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6, 4…
## $ Transmission                     <chr> "AS5", "M6", "AV7", "AS6", "AS6", "AS…
## $ Fuel.Type                        <fct> Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z…
## $ Fuel.Consumption.City..L.100.km. <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 11.…
## $ Fuel.Consumption.Hwy..L.100.km.  <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, 9.…
## $ Fuel.Consumption.Comb..L.100.km. <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10.1…
## $ Fuel.Consumption.Comb..mpg.      <int> 33, 29, 48, 25, 27, 28, 28, 25, 24, 3…
## $ CO2.Emissions.g.km.              <int> 196, 221, 136, 255, 244, 230, 232, 25…

Dataset Introduction: The data of “C2 Emission by Vehicles” is from Canada government website about fuel consumption rate (https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64#wb-auto-6). This dateset include Car manufacturer, model, class, engine size with litres, cylinders, transmission type, fuel type, and fuel consumption rate in high way, city way and the combination of both (55% city, 45% hwy) showing in L/100km and mpg. CO2 emission is the tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving. Model variable includes 4WD/4X4 - Four-wheel drive, AWD - All-wheel drive, FFV - Flexible-fuel vehicle, SWB - Short wheelbase, LWB - Long wheelbase and EWB - Extended wheelbase. Transmission variable includes A - automatic, AM - automated manual, AS - automatic with select shift, AV - continuously variable, M - manual and 3-10 after M means the number of gears. Fuel type variable includes X - regular gasoline, Z - premium gasoline, D - diesel, E - ethanol (E85), and N - natural gas.

##!!!!!!!!!!!!!!!!!!!!!!!!!!!! Statistical summaries

Part 3:Linear Regression Analysis

1.Define the regression model and justify the choice of variables.

2.Fit a multiple linear regression model and refine it using model selection techniques such as ANOVA.

3.Use residual plots and normality plots to check regression assumptions. Also check whether there are serious collinearity issues in your model.

4.Use residual plots to see whether it is necessary to include interaction terms or higher-order (quadratic, cubic) terms.

5.Interpret the model coefficients and explain how they answer the research question.

Regression model: fuel_efficiency(Combo mpg) = β0 + β1(Engine Size) + β2(Cylinders) +β3(Fuel Type) + 𝜀 (Multiple linear regression)

Reasons: I want to find the relationship between car fuel efficiency and car engine specification and fuel type. Fuel efficiency is the dependent variable while car engine size, cylinders and the fuel type that the car use as independent variables.

Linear Regression Model

model_0 <- lm(Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + Fuel.Type, data = ca_car)

summary(model_0)

## 
## Call:
## lm(formula = Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + 
##     Fuel.Type, data = ca_car)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.690 -2.602 -0.318  1.770 34.849 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     43.23228    0.37500 115.287  < 2e-16 ***
## Engine.Size.L.  -3.23264    0.10385 -31.127  < 2e-16 ***
## Cylinders       -0.38942    0.07822  -4.979 6.55e-07 ***
## Fuel.TypeE     -10.02471    0.40817 -24.560  < 2e-16 ***
## Fuel.TypeN      -7.25824    4.40114  -1.649   0.0992 .  
## Fuel.TypeX      -2.35166    0.34045  -6.908 5.34e-12 ***
## Fuel.TypeZ      -3.89143    0.34258 -11.359  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.388 on 7378 degrees of freedom
## Multiple R-squared:  0.6321, Adjusted R-squared:  0.6318 
## F-statistic:  2113 on 6 and 7378 DF,  p-value: < 2.2e-16

Check and Refine

Simple models create and anova test

# Drop the engine size variable
model_1 <- lm(Fuel.Consumption.Comb..mpg. ~ Cylinders + Fuel.Type, data = ca_car)

anova(model_1, model_0)

## Analysis of Variance Table
## 
## Model 1: Fuel.Consumption.Comb..mpg. ~ Cylinders + Fuel.Type
## Model 2: Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + Fuel.Type
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   7379 160743                                  
## 2   7378 142085  1     18658 968.86 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: The new simple model_1 case 18658 “error” with a p-value far less than 0.05. It means that our newly model is less accuracy than the old one, we should reject the model_1 and keep model_0.

model_2 <- lm(Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Fuel.Type, data = ca_car)

anova(model_2, model_0)

## Analysis of Variance Table
## 
## Model 1: Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Fuel.Type
## Model 2: Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + Fuel.Type
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   7379 142562                                  
## 2   7378 142085  1    477.32 24.786 6.552e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: ANOVA result of getting a p-value less than 0.05, which means the result think the full model is better. However, the change only cause our predict accuracy reduce by (477/142562)*100=0.3%. There is a great improvement compare to the model_1.

High correlation variables check

# VIF test on cylinders, engine size and fuel type
car::vif(model_0)

##                    GVIF Df GVIF^(1/(2*Df))
## Engine.Size.L. 7.583696  1        2.753851
## Cylinders      7.841933  1        2.800345
## Fuel.Type      1.186523  4        1.021609

Note: Cylinders and Engine Size VIF score (>5) shows that they have a high correlation with each other. It will case a confuse for the model. Compare to the potential effect on the high correction between cylinder and engine size, the 0.3% improvement is not worth it. In this case, we keep the model_2.

Residual Plot check

plot(model_2)

## Warning: not plotting observations with leverage one:
##   2440

Note: In Residuals vs Fitted graph, points are roughly around the 0 line but fan out widely at higher MPG. In Q-Q plot, it shows out residuals are highly skewed, our p-value is not trustworthy.

Better residual plot

car::residualPlot(model_2)

Note: The residual plot does shows a roughly funnel shape.

# Log transformation 
model_3 <- lm(log(Fuel.Consumption.Comb..mpg.) ~ Engine.Size.L. + Fuel.Type, data = ca_car)

plot(model_3)

## Warning: not plotting observations with leverage one:
##   2440

Note: Situation getting better but our p-value still not highly trustworthy. In Q-Q plot, it shows that our data is cure wildly away at the end, which means our residuals are still at certain level of skewed.

summary(model_3)

## 
## Call:
## lm(formula = log(Fuel.Consumption.Comb..mpg.) ~ Engine.Size.L. + 
##     Fuel.Type, data = ca_car)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44865 -0.07836  0.00129  0.07557  0.69816 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     3.842068   0.010441  367.979  < 2e-16 ***
## Engine.Size.L. -0.140969   0.001183 -119.177  < 2e-16 ***
## Fuel.TypeE     -0.424872   0.012264  -34.645  < 2e-16 ***
## Fuel.TypeN     -0.243536   0.132527   -1.838   0.0662 .  
## Fuel.TypeX     -0.080575   0.010232   -7.875 3.89e-15 ***
## Fuel.TypeZ     -0.127437   0.010311  -12.359  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1321 on 7379 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.7369 
## F-statistic:  4137 on 5 and 7379 DF,  p-value: < 2.2e-16

Note: Log transformation improved the model predicted power to 0.7369.

Model coefficients

Note: The model_3 is a log linear regression model. In this case, the estimate in the summary means the percentage change in Y if X change 1 unit. The D, diseal, fuel type has been used as the baseline in the summary for analysis.

Model: Fuel efficiency (mpg) = e^(3.842068) + (e^(-0.140969) - 1)(Engine.Size.L.) + (e^(-0.424872) - 1)(Fuel.TypeE) + (e^(-0.080575)(Fuel.TypeX) + (e^(-0.127437) - 1)(Fuel.TypeZ) -> Fuel efficiency (mpg) = 46.6 - 0.1315(Engine.Size.L.) - 0.3461(Fuel.TypeE) - 0.0774(Fuel.TypeX) - 0.1196(Fuel.TypeZ)

Interpretation: The “Intercept” estimate tells us when there is 0 size of engine using baseline fuel type D, diesel. The expected log(mpg) for a car is around 46.6 mpg. For one liter increase in engine size, mpg decreases by approximately 13.15%. If the engine size is fixed, car using type E fuel will decrease fuel efficiency by approximately 34.61%. If the car use type X fuel, the fuel efficiency decreases by approximately 7.74%. If the car use type Z fuel, the fuel efficiency decreases by approximately 11.96%. Since the type N fuel has p-value greater than 0.05, we fail to get a conclusion on the difference between using type D and type N fuel. As we got the multiple R-square equal to 0.7371, the model tells us that 73.71% of variance in fuel efficiency explain by fuel type and engine size.

Part 4: Discussion and Model Improvement

1.Discuss potential improvements to your model.

Answer: Suppose, not all the engine can consume all type of fuel. Specific the diesel and gasoline engine built different. This model can not tell which car’s engine can or can not use one type of fuel. If we can include the concern into the model, we probably can further improve it.

2.Suggest additional variables or transformations that could enhance the analysis.

Answer: Suggest to separate the diesel and gasoline and do separate analysis.

3.Reflect on limitations and possible extensions for future research.

Answer: This model only contain the variables on the car itself, engine and fuel type. However, there are more variable that affect the fuel efficiency, including the condition of the engine, the condition of the air filter, type of engine oil used by the car and so on. In the future research, it is possible to add more variables that can explain these concerns into the model to make the model better explain the causes of fuel efficiency.

Homework #2: Linear Regression

Yuhe Luo

2026-04-08