Motivating Question: What variables are most influential in predicting carbon emissions of cars?

About the Data:

This data set captures the details of how CO2 emissions by a vehicle can vary with the different features. The data set has been taken from Canada Government official open data website. This is a compiled version. This contains data over a period of 7 years. There are a total of 7385 rows and 12 columns.

Data Source/Citation:

The data has been taken and compiled from the below Canada Government official link: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64#wb-auto-6

# upload packages and data set/information provided
library(tidyverse)
library(readr)
co2<-read.csv("https://raw.githubusercontent.com/kitadasmalley/MATH239/main/data/CO2_Emissions_Midterm.csv",
              sep=",", quote="\"",
              header = TRUE)

co2$Fuel.Type<-as.factor(co2$Fuel.Type)

Part 1 (25 points): Working with a Categorical Explanatory Variable

How does fuel type affect CO2 emissions?

Question 1:

(5 points) Fuel type should be a categorical variable. What are the levels and how many are there? Which group will R choose to be the reference group? Does this make sense in the context of these data? (Hint: contrasts)

There are 5 levels:

levels(co2$Fuel.Type)
## [1] "D" "E" "N" "X" "Z"

X = Regular gasoline

Z = Premium gasoline

D = Diesel

E = Ethanol (E85)

N = Natural gas

using the contrasts function I was able to indicate that diesel fuel would be the reference variable since it is the row with all zeros in the matrix. Also, we know that R default setting for R is organizing variable alphabetical/numerical.Therefore, D = Diesel would be the first variable at bat, as default, Diesel would be the reference variable.

Question 2: (5 points) In your own words, describe how categorical variables with multiple levels are coded.

First, since we cannot use characters for a regression model we will have to change them into a binary responds variables. Which is also referred to the dummy variables. This will created for each observations that has a 1 and 0 for the rest of them. Then, we are then able to create a contrast matrix that would exhibit my words:

contrasts(co2$Fuel.Type)
##   E N X Z
## D 0 0 0 0
## E 1 0 0 0
## N 0 1 0 0
## X 0 0 1 0
## Z 0 0 0 1
#changing the responds variable to be "X"
co2$Fuel.Type <- factor(co2$Fuel.Type,
                        levels = c("X", "Z", "D", "E", "N"))

Question 4: (5 points) Create a side-by-side box plot to compare the distributions of CO2 Emissions (g/km) across fuel types. Feel free to use color! What are your observations?

ggplot(co2,aes(Fuel.Type,CO2.Emissions.g.km.,fill = Fuel.Type ))+
  geom_boxplot()

# I wanted to see if N was flat because the box plot is not the best for the data.


ggplot(co2,aes(Fuel.Type,CO2.Emissions.g.km.,fill = Fuel.Type ))+
  geom_boxplot()+
  geom_jitter(color = "grey")

#I found that their is only a couple data points in N, so I concluded that there is a lot of data on N.

My observations are that, for the most part the data is normal distributed. I did notice that there was a lot of outiliers. I notice that X and Z have more outliers towards the maximum values, and E has them both in the max and min direction. I am still on the search of how to deal with these outliers. My conclusion you could take them out, but also, they might hold information that we may not have considered. I am quite surprised that E= Ethanol has high CO2 Emissions since I believe it in a component in gasoline. I would want to investigate further to into the fuel times chemistry to see the different levels of Ethanol in each fuel type. I decided to investigate further and used jitter on this plot to make sure that the data was being express the correct way for N = Natural gas It looks like there is only one data point there. I gathered that is why there is no box/no data?

Question 5: (5 points) Perform the appropriate hypothesis test whether there is a significant difference in the average carbon emissions across fuels types.

Mod1 <- lm(CO2.Emissions.g.km.~Fuel.Type, data = co2 )

summary(Mod1)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Type, data = co2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -147.092  -42.119   -8.043   35.881  255.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 235.1193     0.9335 251.855   <2e-16 ***
## Fuel.TypeZ   30.9241     1.3643  22.666   <2e-16 ***
## Fuel.TypeD    2.4292     4.3571   0.558    0.577    
## Fuel.TypeE   39.9726     3.0722  13.011   <2e-16 ***
## Fuel.TypeN  -22.1193    56.3078  -0.393    0.694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.3 on 7380 degrees of freedom
## Multiple R-squared:  0.0747, Adjusted R-squared:  0.0742 
## F-statistic: 148.9 on 4 and 7380 DF,  p-value: < 2.2e-16
anova(Mod1)

Please state the following components:

Name of the hypothesis test

  • ANOVA since there is more then 2 independent variables.

Hypotheses (null and alternative)

  • H_0 = there is no significant difference in the average carbon emissions across fuels types

  • H_A = there is a strong significant difference in the average carbon emissions across fuels types

Provide the test statistic

  • 148.95

Reference distribution

  • 56.3 with 7380 degrees of freedom

P-value

  • 2.2e-16

Then communicate your findings in sentence form (5 part summary).

  • Through the calculations I have conducted we reject the H_0(null) that there is no significant difference in the mean carbon emissions across fuels types. Since we can see that the p-value is 2.2e-16 < 0.05 significant level.There is convincing evidence to suggest that H_A(alternative) is true, we can say that there is a strong correlation between mean fuel types and CO2 emissions!

Part 2 (20 points): Properties of Linear Models

Question 6: (5 points) Please write the form of a linear model and annotate the following components:

Question 7: (5 points) List the assumptions for fitting a linear model. Hint: There are five that we talked about in class

1.) Linear relationship between x and y

2.) average error is zero

3.) a constant spread

4.) independence

5.) normal

Question 8: (5 points) Create a pairs plot for the numeric variables in these dataset. What relationships do you observe between the potential explanatory variables and the response (carbon emissions)? Hint: In order to accomplish this, you might want to use the select function first for the numeric variables. Then call the pairs() function.

To me it looks like there is a positive trend and a strong correlation to most the potential explanatory variables.There are a few that are non-parametic that I would consider not to use as a model.

Num_co2 <- co2 %>%
select(Engine.Size.L.,Cylinders,Fuel.Consumption.City..L.100.km., Fuel.Consumption.Hwy..L.100.km., Fuel.Consumption.Comb..L.100.km.,Fuel.Consumption.Comb..mpg.,CO2.Emissions.g.km.)

summary(Num_co2)
##  Engine.Size.L.   Cylinders      Fuel.Consumption.City..L.100.km.
##  Min.   :0.90   Min.   : 3.000   Min.   : 4.20                   
##  1st Qu.:2.00   1st Qu.: 4.000   1st Qu.:10.10                   
##  Median :3.00   Median : 6.000   Median :12.10                   
##  Mean   :3.16   Mean   : 5.615   Mean   :12.56                   
##  3rd Qu.:3.70   3rd Qu.: 6.000   3rd Qu.:14.60                   
##  Max.   :8.40   Max.   :16.000   Max.   :30.60                   
##  Fuel.Consumption.Hwy..L.100.km. Fuel.Consumption.Comb..L.100.km.
##  Min.   : 4.000                  Min.   : 4.10                   
##  1st Qu.: 7.500                  1st Qu.: 8.90                   
##  Median : 8.700                  Median :10.60                   
##  Mean   : 9.042                  Mean   :10.98                   
##  3rd Qu.:10.200                  3rd Qu.:12.60                   
##  Max.   :20.600                  Max.   :26.10                   
##  Fuel.Consumption.Comb..mpg. CO2.Emissions.g.km.
##  Min.   :11.00               Min.   : 96.0      
##  1st Qu.:22.00               1st Qu.:208.0      
##  Median :27.00               Median :246.0      
##  Mean   :27.48               Mean   :250.6      
##  3rd Qu.:32.00               3rd Qu.:288.0      
##  Max.   :69.00               Max.   :522.0
pairs(Num_co2)

Question 9: (5 points) In this dataset we have both Fuel Consumption Comb (L/100km) and Fuel Consumption Comb (mpg). Look at the pairs plot for the relationships between these variables and the response, CO2 Emissions (g/km). Which one of these would you want to include as an explanatory variable in your model? Why? Hint: Think about the model assumptions from Question 8

Num2_co2 <- co2 %>%
select(Fuel.Consumption.Comb..L.100.km.,Fuel.Consumption.Comb..mpg.,CO2.Emissions.g.km.)

summary(Num2_co2)
##  Fuel.Consumption.Comb..L.100.km. Fuel.Consumption.Comb..mpg.
##  Min.   : 4.10                    Min.   :11.00              
##  1st Qu.: 8.90                    1st Qu.:22.00              
##  Median :10.60                    Median :27.00              
##  Mean   :10.98                    Mean   :27.48              
##  3rd Qu.:12.60                    3rd Qu.:32.00              
##  Max.   :26.10                    Max.   :69.00              
##  CO2.Emissions.g.km.
##  Min.   : 96.0      
##  1st Qu.:208.0      
##  Median :246.0      
##  Mean   :250.6      
##  3rd Qu.:288.0      
##  Max.   :522.0
pairs(Num2_co2)

We can see from the graph that the relationship between Fuel.Consumption.Comb.mpg and co2 emissions is a non-parametric negative trending correlation. I would not want to use Fuel.Consumption.Comb.mpg as my explanatory variable for a linear regression model.

Extra Credit: (5 points) In the data description, it says that Fuel Consumption Comb (L/100km) is a combined rating (55% city, 45% hwy). Why would it not be appropriate to include Fuel Consumption City (L/100km), Fuel Consumption Hwy (L/100km), and Fuel Consumption Comb (L/100km) in the same model? How is this related to VIF?

There is no correlation between consumption comb since it is a percent of consumption city and consumption hwy. For some reason the vif function was not working however, in the summary you can see that the P-value is large. Meaning there is no significance of the responds variable to be correlated to the explanatory variable.

Mod3 <- lm(Fuel.Consumption.Comb..L.100.km.~Fuel.Consumption.City..L.100.km.+Fuel.Consumption.Hwy..L.100.km., data = co2)
summary(Mod3)
## 
## Call:
## lm(formula = Fuel.Consumption.Comb..L.100.km. ~ Fuel.Consumption.City..L.100.km. + 
##     Fuel.Consumption.Hwy..L.100.km., data = co2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41554 -0.02527  0.00026  0.02709  0.50961 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                      -0.0018983  0.0021136   -0.898    0.369    
## Fuel.Consumption.City..L.100.km.  0.5496934  0.0004452 1234.838   <2e-16 ***
## Fuel.Consumption.Hwy..L.100.km.   0.4506590  0.0007005  643.368   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04254 on 7382 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 1.706e+07 on 2 and 7382 DF,  p-value: < 2.2e-16
#vif(Mod3)

Part 3 (55 points): Fitting Models

Model A: Simple Linear Regression (SLR)

Question 10: (5 points) Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km). Describe this scatter plot. Be sure to talk about the four characteristics of a scatter plot.

ggplot(co2,aes(CO2.Emissions.g.km.,Fuel.Consumption.Comb..L.100.km.,color = CO2.Emissions.g.km.))+
  geom_point()

there is a strong positive correlation that is linear the only worry I have that it is not scattered. There might be some outliers on the bottom left.

Question 11: (5 points) Create a simple linear model for the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km). Write the equation for the estimated fitted model.

y(hat) = -0.3971673 + 0.0453828x

Mod2 <- lm(Fuel.Consumption.Comb..L.100.km.~CO2.Emissions.g.km., data = co2)
summary(Mod2) 
## 
## Call:
## lm(formula = Fuel.Consumption.Comb..L.100.km. ~ CO2.Emissions.g.km., 
##     data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3069 -0.3462 -0.1892 -0.0678  7.5272 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.3971673  0.0586935  -6.767 1.42e-11 ***
## CO2.Emissions.g.km.  0.0453828  0.0002281 198.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.147 on 7383 degrees of freedom
## Multiple R-squared:  0.8428, Adjusted R-squared:  0.8428 
## F-statistic: 3.959e+04 on 1 and 7383 DF,  p-value: < 2.2e-16

Question 12: (5 points) Perform a hypothesis test for the slope. Perform a test for the slope. State the reference distribution, degrees of freedom, the test statistic, and p-value in the form of a five-part conclusion in the context of the problem.

By performing a hypothesis test for the slope there is convincing evidence to suggested that there is a significant relationship between the Fuel.Consumption.Comb..L.100.km. and co2 emissions that we can reject the null hypothesis. We can see there is standard error of 1.147 on 7383 degrees of freedom which also suggest to reject the null hypothesis. Also, that test statistic of 198.968 shows a significant difference. Since the p-value is 2.2e-16 < 0.05 significance level it further proves that we reject the null hypothesis and evidence suggest that there is a strong correlation of a linear relationship between Fuel.Consumption.Comb..L.100.km. and co2 emissions

Mod2$coefficients
##         (Intercept) CO2.Emissions.g.km. 
##         -0.39716731          0.04538281
plot(Mod2)

Question 13: (5 points) Should we trust the inference we made in the previous step? To assess this check the model diagnostics.

I am 75% on these assumptions. Since I have never encountered this I would assume that this model is okay.

  • Create a residual plot - comment on mean zero assumption and homoscedasticity.

Homoscedasticity is a problem. The variability in the outcome increases with the values of the outcome

there is a defined shape that worries me with this model

  • Create a qq plot - comment on normality

right-skewed with longer end tail.

  • Create a leverage plot - comment on the presence of influential outliers

Looks like there is quite a few influential outliers in this graph. I would want to check to see to see if there is errors in the data.

MODEL B - Parallel Lines

Question 14: (5 points) Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km) but now color the points based on the Fuel Type. Discuss your observations.

I notice that E has a larger slope in a positive direction. That Z and X has a similar slope and relationship Looks like there is a strong linear relationship however there is a defined shape to this graph, makes me consider investigating the data more. However, there is a strong correlation to the responds variable and the explanatory variables.

ggplot(co2,aes(CO2.Emissions.g.km.,Fuel.Consumption.Comb..L.100.km.,color = as.factor(Fuel.Type)))+
  geom_point()

Question 15: (5 points) Create a parallel lines model for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.

y= b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3v + b_4 * x_4 + b_5 * x_5

X , y = b_0 + b_1*x_1

Z , y= (b_0 + b_2) + b_1*x_1

D , y = (b_0 +b_3) + b_1*x_1

E, y = (b_0 +b_4) + b_1*x_1

N, y = (b_0 + b_5) + b_1*x_1

Mod4 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km.+Fuel.Type,data =co2)
summary(Mod4)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.595  -2.760   0.045   2.234  44.852 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                         5.34154    0.27768   19.236  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.   22.78507    0.02601  875.998  < 2e-16 ***
## Fuel.TypeZ                          0.43328    0.13763    3.148  0.00165 ** 
## Fuel.TypeD                         30.89114    0.42649   72.432  < 2e-16 ***
## Fuel.TypeE                       -114.43678    0.34782 -329.016  < 2e-16 ***
## Fuel.TypeN                        -81.71198    5.49603  -14.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9912 
## F-statistic: 1.66e+05 on 5 and 7379 DF,  p-value: < 2.2e-16

Question 16: (5 points) Create a graphic for your parallel lines model showing the fitted models for each type of fuel. Comment on which shifts of intercept are significant.

Mod4$coefficients[1]
## (Intercept) 
##    5.341542
ggplot(co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color= Fuel.Type))+
  geom_point()+
  geom_abline(intercept = Mod4$coefficients[1], slope = Mod4$coefficients[2],
              color="red", lwd=1)+
  geom_abline(intercept = Mod4$coefficients[1]+Mod4$coefficients[3], slope=Mod4$coefficients[2],
              color="yellow", lwd=1)+
  geom_abline(intercept = Mod4$coefficients[1]+Mod4$coefficients[4], slope=Mod4$coefficients[2],
              color="blue", lwd=1)+
  geom_abline(intercept = Mod4$coefficients[1]+Mod4$coefficients[5], slope=Mod4$coefficients[2],
              color="forestgreen", lwd=1)

##MODEL C - Interactions (Unrelated) Lines

Question 17: (5 points) Create an unrelated lines (interaction) model for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.

y= b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3v + b_4 * x_4 + b_5 * x_5

X , y = b_0 + b_1*x_1

Z , y= (b_0 + b_2) + b_1*x_1 + x_6

D , y = (b_0 +b_3) + b_1*x_1 + x_7

E, y = (b_0 +b_4) + b_1*x_1 + x_8

N, y = (b_0 + b_5) + b_1*x_1 + x_9

Mod5 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km.*Fuel.Type,data =co2)
summary(Mod5)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. * 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.127  -2.607   0.659   1.886  25.251 
## 
## Coefficients: (1 not defined because of singularities)
##                                              Estimate Std. Error  t value
## (Intercept)                                   0.42897    0.20641    2.078
## Fuel.Consumption.Comb..L.100.km.             23.27221    0.01988 1170.440
## Fuel.TypeZ                                    0.18413    0.32445    0.568
## Fuel.TypeD                                   -0.54618    1.30980   -0.417
## Fuel.TypeE                                    4.24637    0.92786    4.577
## Fuel.TypeN                                  -82.98605    2.95545  -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ  -0.03526    0.02923   -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD   3.62697    0.14556   24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  -7.23455    0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN        NA         NA       NA
##                                             Pr(>|t|)    
## (Intercept)                                   0.0377 *  
## Fuel.Consumption.Comb..L.100.km.             < 2e-16 ***
## Fuel.TypeZ                                    0.5704    
## Fuel.TypeD                                    0.6767    
## Fuel.TypeE                                   4.8e-06 ***
## Fuel.TypeN                                   < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ   0.2279    
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared:  0.9975, Adjusted R-squared:  0.9975 
## F-statistic: 3.611e+05 on 8 and 7376 DF,  p-value: < 2.2e-16

Question 18: (5 points) Create a graphic for your unrelated lines model showing the fitted models for each type of fuel. Comment on which shifts of intercept and/or slope are significant.

Fuel.TypeE has a change in slope

Mod4$coefficients[1]
## (Intercept) 
##    5.341542
ggplot(co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color= Fuel.Type))+
  geom_point()+
  geom_abline(intercept = Mod5$coefficients[1], slope = Mod5$coefficients[2],
              color="red", lwd=1)+
  geom_abline(intercept = Mod5$coefficients[1]+Mod5$coefficients[3], slope=Mod5$coefficients[2]+Mod5$coefficients[7],
              color="yellow", lwd=1)+
  geom_abline(intercept = Mod5$coefficients[1]+Mod5$coefficients[4], slope=Mod5$coefficients[2]+Mod5$coefficients[8],
              color="blue", lwd=1)+
  geom_abline(intercept = Mod5$coefficients[1]+Mod5$coefficients[5], slope=Mod5$coefficients[2]+Mod5$coefficients[9],
              color="forestgreen", lwd=1)

##Model Selection

Question 19: (10 points) Compare models from #12 (A) , #15 (B), and #17 (C) by reporting their Adjusted R-squared values. Which model would you pick? Make an argument based on an assessment of the model assumptions, simplicity, and interpretability as well as model fit.

summary(Mod2) #Adjusted R-squared:  0.8428
## 
## Call:
## lm(formula = Fuel.Consumption.Comb..L.100.km. ~ CO2.Emissions.g.km., 
##     data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3069 -0.3462 -0.1892 -0.0678  7.5272 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.3971673  0.0586935  -6.767 1.42e-11 ***
## CO2.Emissions.g.km.  0.0453828  0.0002281 198.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.147 on 7383 degrees of freedom
## Multiple R-squared:  0.8428, Adjusted R-squared:  0.8428 
## F-statistic: 3.959e+04 on 1 and 7383 DF,  p-value: < 2.2e-16
summary(Mod4) #Adjusted R-squared:  0.9912
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.595  -2.760   0.045   2.234  44.852 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                         5.34154    0.27768   19.236  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.   22.78507    0.02601  875.998  < 2e-16 ***
## Fuel.TypeZ                          0.43328    0.13763    3.148  0.00165 ** 
## Fuel.TypeD                         30.89114    0.42649   72.432  < 2e-16 ***
## Fuel.TypeE                       -114.43678    0.34782 -329.016  < 2e-16 ***
## Fuel.TypeN                        -81.71198    5.49603  -14.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9912 
## F-statistic: 1.66e+05 on 5 and 7379 DF,  p-value: < 2.2e-16
summary(Mod5) #Adjusted R-squared:  0.9975 
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. * 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.127  -2.607   0.659   1.886  25.251 
## 
## Coefficients: (1 not defined because of singularities)
##                                              Estimate Std. Error  t value
## (Intercept)                                   0.42897    0.20641    2.078
## Fuel.Consumption.Comb..L.100.km.             23.27221    0.01988 1170.440
## Fuel.TypeZ                                    0.18413    0.32445    0.568
## Fuel.TypeD                                   -0.54618    1.30980   -0.417
## Fuel.TypeE                                    4.24637    0.92786    4.577
## Fuel.TypeN                                  -82.98605    2.95545  -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ  -0.03526    0.02923   -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD   3.62697    0.14556   24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  -7.23455    0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN        NA         NA       NA
##                                             Pr(>|t|)    
## (Intercept)                                   0.0377 *  
## Fuel.Consumption.Comb..L.100.km.             < 2e-16 ***
## Fuel.TypeZ                                    0.5704    
## Fuel.TypeD                                    0.6767    
## Fuel.TypeE                                   4.8e-06 ***
## Fuel.TypeN                                   < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ   0.2279    
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared:  0.9975, Adjusted R-squared:  0.9975 
## F-statistic: 3.611e+05 on 8 and 7376 DF,  p-value: < 2.2e-16

I would pick Model C, since we can see that the Adjusted R-squared value increase with the number of variables used. By looking at the graph when I was able to ajust the slope of the line the lines became more alines with the explanitory variables. You can also notices that the standard error also decressed. The less of an error the better.