This data set captures the details of how CO2 emissions by a vehicle can vary with the different features. The data set has been taken from Canada Government official open data website. This is a compiled version. This contains data over a period of 7 years. There are a total of 7385 rows and 12 columns.
The data has been taken and compiled from the below Canada Government official link: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64#wb-auto-6
# upload packages and data set/information provided
library(tidyverse)
library(readr)
co2<-read.csv("https://raw.githubusercontent.com/kitadasmalley/MATH239/main/data/CO2_Emissions_Midterm.csv",
sep=",", quote="\"",
header = TRUE)
co2$Fuel.Type<-as.factor(co2$Fuel.Type)
Question 1:
(5 points) Fuel type should be a categorical variable. What are the levels and how many are there? Which group will R choose to be the reference group? Does this make sense in the context of these data? (Hint: contrasts)
There are 5 levels:
levels(co2$Fuel.Type)
## [1] "D" "E" "N" "X" "Z"
X = Regular gasoline
Z = Premium gasoline
D = Diesel
E = Ethanol (E85)
N = Natural gas
using the contrasts function I was able to indicate that diesel fuel would be the reference variable since it is the row with all zeros in the matrix. Also, we know that R default setting for R is organizing variable alphabetical/numerical.Therefore, D = Diesel would be the first variable at bat, as default, Diesel would be the reference variable.
Question 2: (5 points) In your own words, describe how categorical variables with multiple levels are coded.
First, since we cannot use characters for a regression model we will have to change them into a binary responds variables. Which is also referred to the dummy variables. This will created for each observations that has a 1 and 0 for the rest of them. Then, we are then able to create a contrast matrix that would exhibit my words:
contrasts(co2$Fuel.Type)
## E N X Z
## D 0 0 0 0
## E 1 0 0 0
## N 0 1 0 0
## X 0 0 1 0
## Z 0 0 0 1
#changing the responds variable to be "X"
co2$Fuel.Type <- factor(co2$Fuel.Type,
levels = c("X", "Z", "D", "E", "N"))
Question 4: (5 points) Create a side-by-side box plot to compare the distributions of CO2 Emissions (g/km) across fuel types. Feel free to use color! What are your observations?
ggplot(co2,aes(Fuel.Type,CO2.Emissions.g.km.,fill = Fuel.Type ))+
geom_boxplot()
# I wanted to see if N was flat because the box plot is not the best for the data.
ggplot(co2,aes(Fuel.Type,CO2.Emissions.g.km.,fill = Fuel.Type ))+
geom_boxplot()+
geom_jitter(color = "grey")
#I found that their is only a couple data points in N, so I concluded that there is a lot of data on N.
My observations are that, for the most part the data is normal distributed. I did notice that there was a lot of outiliers. I notice that X and Z have more outliers towards the maximum values, and E has them both in the max and min direction. I am still on the search of how to deal with these outliers. My conclusion you could take them out, but also, they might hold information that we may not have considered. I am quite surprised that E= Ethanol has high CO2 Emissions since I believe it in a component in gasoline. I would want to investigate further to into the fuel times chemistry to see the different levels of Ethanol in each fuel type. I decided to investigate further and used jitter on this plot to make sure that the data was being express the correct way for N = Natural gas It looks like there is only one data point there. I gathered that is why there is no box/no data?
Question 5: (5 points) Perform the appropriate hypothesis test whether there is a significant difference in the average carbon emissions across fuels types.
Mod1 <- lm(CO2.Emissions.g.km.~Fuel.Type, data = co2 )
summary(Mod1)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147.092 -42.119 -8.043 35.881 255.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 235.1193 0.9335 251.855 <2e-16 ***
## Fuel.TypeZ 30.9241 1.3643 22.666 <2e-16 ***
## Fuel.TypeD 2.4292 4.3571 0.558 0.577
## Fuel.TypeE 39.9726 3.0722 13.011 <2e-16 ***
## Fuel.TypeN -22.1193 56.3078 -0.393 0.694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.3 on 7380 degrees of freedom
## Multiple R-squared: 0.0747, Adjusted R-squared: 0.0742
## F-statistic: 148.9 on 4 and 7380 DF, p-value: < 2.2e-16
anova(Mod1)
Please state the following components:
Name of the hypothesis test
Hypotheses (null and alternative)
H_0 = there is no significant difference in the average carbon emissions across fuels types
H_A = there is a strong significant difference in the average carbon emissions across fuels types
Provide the test statistic
Reference distribution
P-value
Then communicate your findings in sentence form (5 part summary).
Question 6: (5 points) Please write the form of a linear model and annotate the following components:
Question 7: (5 points) List the assumptions for fitting a linear model. Hint: There are five that we talked about in class
1.) Linear relationship between x and y
2.) average error is zero
3.) a constant spread
4.) independence
5.) normal
Question 8: (5 points) Create a pairs plot for the numeric variables in these dataset. What relationships do you observe between the potential explanatory variables and the response (carbon emissions)? Hint: In order to accomplish this, you might want to use the select function first for the numeric variables. Then call the pairs() function.
To me it looks like there is a positive trend and a strong correlation to most the potential explanatory variables.There are a few that are non-parametic that I would consider not to use as a model.
Num_co2 <- co2 %>%
select(Engine.Size.L.,Cylinders,Fuel.Consumption.City..L.100.km., Fuel.Consumption.Hwy..L.100.km., Fuel.Consumption.Comb..L.100.km.,Fuel.Consumption.Comb..mpg.,CO2.Emissions.g.km.)
summary(Num_co2)
## Engine.Size.L. Cylinders Fuel.Consumption.City..L.100.km.
## Min. :0.90 Min. : 3.000 Min. : 4.20
## 1st Qu.:2.00 1st Qu.: 4.000 1st Qu.:10.10
## Median :3.00 Median : 6.000 Median :12.10
## Mean :3.16 Mean : 5.615 Mean :12.56
## 3rd Qu.:3.70 3rd Qu.: 6.000 3rd Qu.:14.60
## Max. :8.40 Max. :16.000 Max. :30.60
## Fuel.Consumption.Hwy..L.100.km. Fuel.Consumption.Comb..L.100.km.
## Min. : 4.000 Min. : 4.10
## 1st Qu.: 7.500 1st Qu.: 8.90
## Median : 8.700 Median :10.60
## Mean : 9.042 Mean :10.98
## 3rd Qu.:10.200 3rd Qu.:12.60
## Max. :20.600 Max. :26.10
## Fuel.Consumption.Comb..mpg. CO2.Emissions.g.km.
## Min. :11.00 Min. : 96.0
## 1st Qu.:22.00 1st Qu.:208.0
## Median :27.00 Median :246.0
## Mean :27.48 Mean :250.6
## 3rd Qu.:32.00 3rd Qu.:288.0
## Max. :69.00 Max. :522.0
pairs(Num_co2)
Question 9: (5 points) In this dataset we have both Fuel Consumption Comb (L/100km) and Fuel Consumption Comb (mpg). Look at the pairs plot for the relationships between these variables and the response, CO2 Emissions (g/km). Which one of these would you want to include as an explanatory variable in your model? Why? Hint: Think about the model assumptions from Question 8
Num2_co2 <- co2 %>%
select(Fuel.Consumption.Comb..L.100.km.,Fuel.Consumption.Comb..mpg.,CO2.Emissions.g.km.)
summary(Num2_co2)
## Fuel.Consumption.Comb..L.100.km. Fuel.Consumption.Comb..mpg.
## Min. : 4.10 Min. :11.00
## 1st Qu.: 8.90 1st Qu.:22.00
## Median :10.60 Median :27.00
## Mean :10.98 Mean :27.48
## 3rd Qu.:12.60 3rd Qu.:32.00
## Max. :26.10 Max. :69.00
## CO2.Emissions.g.km.
## Min. : 96.0
## 1st Qu.:208.0
## Median :246.0
## Mean :250.6
## 3rd Qu.:288.0
## Max. :522.0
pairs(Num2_co2)
We can see from the graph that the relationship between Fuel.Consumption.Comb.mpg and co2 emissions is a non-parametric negative trending correlation. I would not want to use Fuel.Consumption.Comb.mpg as my explanatory variable for a linear regression model.
Extra Credit: (5 points) In the data description, it says that Fuel Consumption Comb (L/100km) is a combined rating (55% city, 45% hwy). Why would it not be appropriate to include Fuel Consumption City (L/100km), Fuel Consumption Hwy (L/100km), and Fuel Consumption Comb (L/100km) in the same model? How is this related to VIF?
There is no correlation between consumption comb since it is a percent of consumption city and consumption hwy. For some reason the vif function was not working however, in the summary you can see that the P-value is large. Meaning there is no significance of the responds variable to be correlated to the explanatory variable.
Mod3 <- lm(Fuel.Consumption.Comb..L.100.km.~Fuel.Consumption.City..L.100.km.+Fuel.Consumption.Hwy..L.100.km., data = co2)
summary(Mod3)
##
## Call:
## lm(formula = Fuel.Consumption.Comb..L.100.km. ~ Fuel.Consumption.City..L.100.km. +
## Fuel.Consumption.Hwy..L.100.km., data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41554 -0.02527 0.00026 0.02709 0.50961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0018983 0.0021136 -0.898 0.369
## Fuel.Consumption.City..L.100.km. 0.5496934 0.0004452 1234.838 <2e-16 ***
## Fuel.Consumption.Hwy..L.100.km. 0.4506590 0.0007005 643.368 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04254 on 7382 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 1.706e+07 on 2 and 7382 DF, p-value: < 2.2e-16
#vif(Mod3)
Question 10: (5 points) Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km). Describe this scatter plot. Be sure to talk about the four characteristics of a scatter plot.
ggplot(co2,aes(CO2.Emissions.g.km.,Fuel.Consumption.Comb..L.100.km.,color = CO2.Emissions.g.km.))+
geom_point()
there is a strong positive correlation that is linear the only worry I have that it is not scattered. There might be some outliers on the bottom left.
Question 11: (5 points) Create a simple linear model for the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km). Write the equation for the estimated fitted model.
y(hat) = -0.3971673 + 0.0453828x
Mod2 <- lm(Fuel.Consumption.Comb..L.100.km.~CO2.Emissions.g.km., data = co2)
summary(Mod2)
##
## Call:
## lm(formula = Fuel.Consumption.Comb..L.100.km. ~ CO2.Emissions.g.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3069 -0.3462 -0.1892 -0.0678 7.5272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3971673 0.0586935 -6.767 1.42e-11 ***
## CO2.Emissions.g.km. 0.0453828 0.0002281 198.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.147 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
Question 12: (5 points) Perform a hypothesis test for the slope. Perform a test for the slope. State the reference distribution, degrees of freedom, the test statistic, and p-value in the form of a five-part conclusion in the context of the problem.
By performing a hypothesis test for the slope there is convincing evidence to suggested that there is a significant relationship between the Fuel.Consumption.Comb..L.100.km. and co2 emissions that we can reject the null hypothesis. We can see there is standard error of 1.147 on 7383 degrees of freedom which also suggest to reject the null hypothesis. Also, that test statistic of 198.968 shows a significant difference. Since the p-value is 2.2e-16 < 0.05 significance level it further proves that we reject the null hypothesis and evidence suggest that there is a strong correlation of a linear relationship between Fuel.Consumption.Comb..L.100.km. and co2 emissions
Mod2$coefficients
## (Intercept) CO2.Emissions.g.km.
## -0.39716731 0.04538281
plot(Mod2)
Question 13: (5 points) Should we trust the inference we made in the previous step? To assess this check the model diagnostics.
I am 75% on these assumptions. Since I have never encountered this I would assume that this model is okay.
Homoscedasticity is a problem. The variability in the outcome increases with the values of the outcome
there is a defined shape that worries me with this model
right-skewed with longer end tail.
Looks like there is quite a few influential outliers in this graph. I would want to check to see to see if there is errors in the data.
Question 14: (5 points) Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km) but now color the points based on the Fuel Type. Discuss your observations.
I notice that E has a larger slope in a positive direction. That Z and X has a similar slope and relationship Looks like there is a strong linear relationship however there is a defined shape to this graph, makes me consider investigating the data more. However, there is a strong correlation to the responds variable and the explanatory variables.
ggplot(co2,aes(CO2.Emissions.g.km.,Fuel.Consumption.Comb..L.100.km.,color = as.factor(Fuel.Type)))+
geom_point()
Question 15: (5 points) Create a parallel lines model for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.
y= b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3v + b_4 * x_4 + b_5 * x_5
X , y = b_0 + b_1*x_1
Z , y= (b_0 + b_2) + b_1*x_1
D , y = (b_0 +b_3) + b_1*x_1
E, y = (b_0 +b_4) + b_1*x_1
N, y = (b_0 + b_5) + b_1*x_1
Mod4 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km.+Fuel.Type,data =co2)
summary(Mod4)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.595 -2.760 0.045 2.234 44.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34154 0.27768 19.236 < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 22.78507 0.02601 875.998 < 2e-16 ***
## Fuel.TypeZ 0.43328 0.13763 3.148 0.00165 **
## Fuel.TypeD 30.89114 0.42649 72.432 < 2e-16 ***
## Fuel.TypeE -114.43678 0.34782 -329.016 < 2e-16 ***
## Fuel.TypeN -81.71198 5.49603 -14.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
## F-statistic: 1.66e+05 on 5 and 7379 DF, p-value: < 2.2e-16
Question 16: (5 points) Create a graphic for your parallel lines model showing the fitted models for each type of fuel. Comment on which shifts of intercept are significant.
Mod4$coefficients[1]
## (Intercept)
## 5.341542
ggplot(co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color= Fuel.Type))+
geom_point()+
geom_abline(intercept = Mod4$coefficients[1], slope = Mod4$coefficients[2],
color="red", lwd=1)+
geom_abline(intercept = Mod4$coefficients[1]+Mod4$coefficients[3], slope=Mod4$coefficients[2],
color="yellow", lwd=1)+
geom_abline(intercept = Mod4$coefficients[1]+Mod4$coefficients[4], slope=Mod4$coefficients[2],
color="blue", lwd=1)+
geom_abline(intercept = Mod4$coefficients[1]+Mod4$coefficients[5], slope=Mod4$coefficients[2],
color="forestgreen", lwd=1)
##MODEL C - Interactions (Unrelated) Lines
Question 17: (5 points) Create an unrelated lines (interaction) model for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.
y= b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3v + b_4 * x_4 + b_5 * x_5
X , y = b_0 + b_1*x_1
Z , y= (b_0 + b_2) + b_1*x_1 + x_6
D , y = (b_0 +b_3) + b_1*x_1 + x_7
E, y = (b_0 +b_4) + b_1*x_1 + x_8
N, y = (b_0 + b_5) + b_1*x_1 + x_9
Mod5 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km.*Fuel.Type,data =co2)
summary(Mod5)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. *
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.127 -2.607 0.659 1.886 25.251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.42897 0.20641 2.078
## Fuel.Consumption.Comb..L.100.km. 23.27221 0.01988 1170.440
## Fuel.TypeZ 0.18413 0.32445 0.568
## Fuel.TypeD -0.54618 1.30980 -0.417
## Fuel.TypeE 4.24637 0.92786 4.577
## Fuel.TypeN -82.98605 2.95545 -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ -0.03526 0.02923 -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD 3.62697 0.14556 24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE -7.23455 0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA NA NA
## Pr(>|t|)
## (Intercept) 0.0377 *
## Fuel.Consumption.Comb..L.100.km. < 2e-16 ***
## Fuel.TypeZ 0.5704
## Fuel.TypeD 0.6767
## Fuel.TypeE 4.8e-06 ***
## Fuel.TypeN < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ 0.2279
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared: 0.9975, Adjusted R-squared: 0.9975
## F-statistic: 3.611e+05 on 8 and 7376 DF, p-value: < 2.2e-16
Question 18: (5 points) Create a graphic for your unrelated lines model showing the fitted models for each type of fuel. Comment on which shifts of intercept and/or slope are significant.
Fuel.TypeE has a change in slope
Mod4$coefficients[1]
## (Intercept)
## 5.341542
ggplot(co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color= Fuel.Type))+
geom_point()+
geom_abline(intercept = Mod5$coefficients[1], slope = Mod5$coefficients[2],
color="red", lwd=1)+
geom_abline(intercept = Mod5$coefficients[1]+Mod5$coefficients[3], slope=Mod5$coefficients[2]+Mod5$coefficients[7],
color="yellow", lwd=1)+
geom_abline(intercept = Mod5$coefficients[1]+Mod5$coefficients[4], slope=Mod5$coefficients[2]+Mod5$coefficients[8],
color="blue", lwd=1)+
geom_abline(intercept = Mod5$coefficients[1]+Mod5$coefficients[5], slope=Mod5$coefficients[2]+Mod5$coefficients[9],
color="forestgreen", lwd=1)
##Model Selection
Question 19: (10 points) Compare models from #12 (A) , #15 (B), and #17 (C) by reporting their Adjusted R-squared values. Which model would you pick? Make an argument based on an assessment of the model assumptions, simplicity, and interpretability as well as model fit.
summary(Mod2) #Adjusted R-squared: 0.8428
##
## Call:
## lm(formula = Fuel.Consumption.Comb..L.100.km. ~ CO2.Emissions.g.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3069 -0.3462 -0.1892 -0.0678 7.5272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3971673 0.0586935 -6.767 1.42e-11 ***
## CO2.Emissions.g.km. 0.0453828 0.0002281 198.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.147 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
summary(Mod4) #Adjusted R-squared: 0.9912
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.595 -2.760 0.045 2.234 44.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34154 0.27768 19.236 < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 22.78507 0.02601 875.998 < 2e-16 ***
## Fuel.TypeZ 0.43328 0.13763 3.148 0.00165 **
## Fuel.TypeD 30.89114 0.42649 72.432 < 2e-16 ***
## Fuel.TypeE -114.43678 0.34782 -329.016 < 2e-16 ***
## Fuel.TypeN -81.71198 5.49603 -14.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
## F-statistic: 1.66e+05 on 5 and 7379 DF, p-value: < 2.2e-16
summary(Mod5) #Adjusted R-squared: 0.9975
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. *
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.127 -2.607 0.659 1.886 25.251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.42897 0.20641 2.078
## Fuel.Consumption.Comb..L.100.km. 23.27221 0.01988 1170.440
## Fuel.TypeZ 0.18413 0.32445 0.568
## Fuel.TypeD -0.54618 1.30980 -0.417
## Fuel.TypeE 4.24637 0.92786 4.577
## Fuel.TypeN -82.98605 2.95545 -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ -0.03526 0.02923 -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD 3.62697 0.14556 24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE -7.23455 0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA NA NA
## Pr(>|t|)
## (Intercept) 0.0377 *
## Fuel.Consumption.Comb..L.100.km. < 2e-16 ***
## Fuel.TypeZ 0.5704
## Fuel.TypeD 0.6767
## Fuel.TypeE 4.8e-06 ***
## Fuel.TypeN < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ 0.2279
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared: 0.9975, Adjusted R-squared: 0.9975
## F-statistic: 3.611e+05 on 8 and 7376 DF, p-value: < 2.2e-16
I would pick Model C, since we can see that the Adjusted R-squared value increase with the number of variables used. By looking at the graph when I was able to ajust the slope of the line the lines became more alines with the explanitory variables. You can also notices that the standard error also decressed. The less of an error the better.