Research Question: How does cars’ brand, model and class make different in car fuel efficiency?
Reasons of choice: Car as one of the most important transportation tool for everyone, it consume a lot of money just to buy, use and maintain it. In the future, having a car is necessary, it is important to find a car that has high fuel efficiency for a long term use. As the fuel can be expensive and less affordable in the future because there are limited fossil fuel.
Data to use: The date set I am going to use is the “CO2 Emission by Vehicles” from the date platform Koggle. The data source is from Canada government website. In this dataset, it include all three variables that I need for a car and fuel consumption data on both high way, city way and the combined of them of a car. We can compare the consumption data of different car to get the possible answer.
ca_car <- read.csv("/Users/yuhe/Downloads/archive/CO2 Emissions_Canada.csv")
ca_car$Fuel.Type <- as.factor(ca_car$Fuel.Type)
glimpse(ca_car)
## Rows: 7,385
## Columns: 12
## $ Make <chr> "ACURA", "ACURA", "ACURA", "ACURA", "…
## $ Model <chr> "ILX", "ILX", "ILX HYBRID", "MDX 4WD"…
## $ Vehicle.Class <chr> "COMPACT", "COMPACT", "COMPACT", "SUV…
## $ Engine.Size.L. <dbl> 2.0, 2.4, 1.5, 3.5, 3.5, 3.5, 3.5, 3.…
## $ Cylinders <int> 4, 4, 4, 6, 6, 6, 6, 6, 6, 4, 4, 6, 4…
## $ Transmission <chr> "AS5", "M6", "AV7", "AS6", "AS6", "AS…
## $ Fuel.Type <fct> Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z, Z…
## $ Fuel.Consumption.City..L.100.km. <dbl> 9.9, 11.2, 6.0, 12.7, 12.1, 11.9, 11.…
## $ Fuel.Consumption.Hwy..L.100.km. <dbl> 6.7, 7.7, 5.8, 9.1, 8.7, 7.7, 8.1, 9.…
## $ Fuel.Consumption.Comb..L.100.km. <dbl> 8.5, 9.6, 5.9, 11.1, 10.6, 10.0, 10.1…
## $ Fuel.Consumption.Comb..mpg. <int> 33, 29, 48, 25, 27, 28, 28, 25, 24, 3…
## $ CO2.Emissions.g.km. <int> 196, 221, 136, 255, 244, 230, 232, 25…
Dataset Introduction: The data of “C2 Emission by Vehicles” is from Canada government website about fuel consumption rate (https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64#wb-auto-6). This dateset include Car manufacturer, model, class, engine size with litres, cylinders, transmission type, fuel type, and fuel consumption rate in high way, city way and the combination of both (55% city, 45% hwy) showing in L/100km and mpg. CO2 emission is the tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving. Model variable includes 4WD/4X4 - Four-wheel drive, AWD - All-wheel drive, FFV - Flexible-fuel vehicle, SWB - Short wheelbase, LWB - Long wheelbase and EWB - Extended wheelbase. Transmission variable includes A - automatic, AM - automated manual, AS - automatic with select shift, AV - continuously variable, M - manual and 3-10 after M means the number of gears. Fuel type variable includes X - regular gasoline, Z - premium gasoline, D - diesel, E - ethanol (E85), and N - natural gas.
##!!!!!!!!!!!!!!!!!!!!!!!!!!!! Statistical summaries
Regression model: fuel_efficiency(Combo mpg) = β0 + β1(Engine Size) + β2(Cylinders) +β3(Fuel Type) + 𝜀 (Multiple linear regression)
Reasons: I want to find the relationship between car fuel efficiency and car engine specification and fuel type. Fuel efficiency is the dependent variable while car engine size, cylinders and the fuel type that the car use as independent variables.
model_0 <- lm(Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + Fuel.Type, data = ca_car)
summary(model_0)
##
## Call:
## lm(formula = Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders +
## Fuel.Type, data = ca_car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.690 -2.602 -0.318 1.770 34.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.23228 0.37500 115.287 < 2e-16 ***
## Engine.Size.L. -3.23264 0.10385 -31.127 < 2e-16 ***
## Cylinders -0.38942 0.07822 -4.979 6.55e-07 ***
## Fuel.TypeE -10.02471 0.40817 -24.560 < 2e-16 ***
## Fuel.TypeN -7.25824 4.40114 -1.649 0.0992 .
## Fuel.TypeX -2.35166 0.34045 -6.908 5.34e-12 ***
## Fuel.TypeZ -3.89143 0.34258 -11.359 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.388 on 7378 degrees of freedom
## Multiple R-squared: 0.6321, Adjusted R-squared: 0.6318
## F-statistic: 2113 on 6 and 7378 DF, p-value: < 2.2e-16
# Drop the engine size variable
model_1 <- lm(Fuel.Consumption.Comb..mpg. ~ Cylinders + Fuel.Type, data = ca_car)
anova(model_1, model_0)
## Analysis of Variance Table
##
## Model 1: Fuel.Consumption.Comb..mpg. ~ Cylinders + Fuel.Type
## Model 2: Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + Fuel.Type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7379 160743
## 2 7378 142085 1 18658 968.86 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: The new simple model_1 case 18658 “error” with a p-value far less than 0.05. It means that our newly model is less accuracy than the old one, we should reject the model_1 and keep model_0.
model_2 <- lm(Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Fuel.Type, data = ca_car)
anova(model_2, model_0)
## Analysis of Variance Table
##
## Model 1: Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Fuel.Type
## Model 2: Fuel.Consumption.Comb..mpg. ~ Engine.Size.L. + Cylinders + Fuel.Type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7379 142562
## 2 7378 142085 1 477.32 24.786 6.552e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: ANOVA result of getting a p-value less than 0.05, which means the result think the full model is better. However, the change only cause our predict accuracy reduce by (477/142562)*100=0.3%. There is a great improvement compare to the model_1.
# VIF test on cylinders, engine size and fuel type
car::vif(model_0)
## GVIF Df GVIF^(1/(2*Df))
## Engine.Size.L. 7.583696 1 2.753851
## Cylinders 7.841933 1 2.800345
## Fuel.Type 1.186523 4 1.021609
Note: Cylinders and Engine Size VIF score (>5) shows that they have a high correlation with each other. It will case a confuse for the model. Compare to the potential effect on the high correction between cylinder and engine size, the 0.3% improvement is not worth it. In this case, we keep the model_2.
plot(model_2)
## Warning: not plotting observations with leverage one:
## 2440
Note: In Residuals vs Fitted graph, points are roughly around the 0 line but fan out widely at higher MPG. In Q-Q plot, it shows out residuals are highly skewed, our p-value is not trustworthy.
car::residualPlot(model_2)
Note: The residual plot does shows a roughly funnel shape.
# Log transformation
model_3 <- lm(log(Fuel.Consumption.Comb..mpg.) ~ Engine.Size.L. + Fuel.Type, data = ca_car)
plot(model_3)
## Warning: not plotting observations with leverage one:
## 2440
Note: Situation getting better but our p-value still not highly trustworthy. In Q-Q plot, it shows that our data is cure wildly away at the end, which means our residuals are still at certain level of skewed.
summary(model_3)
##
## Call:
## lm(formula = log(Fuel.Consumption.Comb..mpg.) ~ Engine.Size.L. +
## Fuel.Type, data = ca_car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44865 -0.07836 0.00129 0.07557 0.69816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.842068 0.010441 367.979 < 2e-16 ***
## Engine.Size.L. -0.140969 0.001183 -119.177 < 2e-16 ***
## Fuel.TypeE -0.424872 0.012264 -34.645 < 2e-16 ***
## Fuel.TypeN -0.243536 0.132527 -1.838 0.0662 .
## Fuel.TypeX -0.080575 0.010232 -7.875 3.89e-15 ***
## Fuel.TypeZ -0.127437 0.010311 -12.359 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1321 on 7379 degrees of freedom
## Multiple R-squared: 0.7371, Adjusted R-squared: 0.7369
## F-statistic: 4137 on 5 and 7379 DF, p-value: < 2.2e-16
Note: Log transformation improved the model predicted power to 0.7369.
Note: The model_3 is a log linear regression model. In this case, the estimate in the summary means the percentage change in Y if X change 1 unit. The D, diseal, fuel type has been used as the baseline in the summary for analysis.
Model: Fuel efficiency (mpg) = e^(3.842068) + (e^(-0.140969) - 1)(Engine.Size.L.) + (e^(-0.424872) - 1)(Fuel.TypeE) + (e^(-0.080575)(Fuel.TypeX) + (e^(-0.127437) - 1)(Fuel.TypeZ) -> Fuel efficiency (mpg) = 46.6 - 0.1315(Engine.Size.L.) - 0.3461(Fuel.TypeE) - 0.0774(Fuel.TypeX) - 0.1196(Fuel.TypeZ)
Interpretation: The “Intercept” estimate tells us when there is 0 size of engine using baseline fuel type D, diesel. The expected log(mpg) for a car is around 46.6 mpg. For one liter increase in engine size, mpg decreases by approximately 13.15%. If the engine size is fixed, car using type E fuel will decrease fuel efficiency by approximately 34.61%. If the car use type X fuel, the fuel efficiency decreases by approximately 7.74%. If the car use type Z fuel, the fuel efficiency decreases by approximately 11.96%. Since the type N fuel has p-value greater than 0.05, we fail to get a conclusion on the difference between using type D and type N fuel. As we got the multiple R-square equal to 0.7371, the model tells us that 73.71% of variance in fuel efficiency explain by fuel type and engine size.
Answer: Suppose, not all the engine can consume all type of fuel. Specific the diesel and gasoline engine built different. This model can not tell which car’s engine can or can not use one type of fuel. If we can include the concern into the model, we probably can further improve it.
Answer: Suggest to separate the diesel and gasoline and do separate analysis.
Answer: This model only contain the variables on the car itself, engine and fuel type. However, there are more variable that affect the fuel efficiency, including the condition of the engine, the condition of the air filter, type of engine oil used by the car and so on. In the future research, it is possible to add more variables that can explain these concerns into the model to make the model better explain the causes of fuel efficiency.