Part 1: Properties of linear models

Question 1: Please write the form of a linear model and annotate the following components: explanatory variables (x_i), response variable (y), parameters (β_0, β_1, …), and error term (ε_i).

Answer 1:

y = β_0 + β_1*x_i + ε_i. For a linear model the response variable is equal to the y-intercept (aka β_0 or parameter 0) plus the slope of line (β_1 times explanatory variable) plus the error.

Question 2: List the five assumptions for fitting a linear model.

Answer 2:

Most of our assumptions have to do with how error is handled.

We assume that error is random such that there is an even amount above and below the linear regression line. We also assume that the error is evenly distributed across x_i, that one end of x_i doesn’t have more than the other end or visa-versa. We assume that errors are normally distributed, that larger errors are less likely to occur than smaller errors. Finally we assume that errors are independent from each other and there isn’t an unaccounted for variable that could explain the variance.

Most important of all we assume that in the population our two variables actually have a linear relationship.

Question 3: Create a pairs plot for the numeric variables in these dataset. What relationships do you observe between the potential explanatory variables and the responce (carbon emissions)?

Answer 3:

co2 <- read.table("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH239/main/co2.csv", sep=",", quote="\"", header=TRUE)

co2numeric <- co2 %>%
  select_if(is.numeric)
pairs(co2numeric)

From the above chart it looks like there’s a linear relationship between engine size and co2 per km, and a linear relationship between number of cylinders and co2 per km. Also, not surprisingly there’s a linear relationship between L of fuel consumed per km and co2 per km. Finally there appears to be a parametric relationship between mi per gal and co2 per km.

Question 4: In this data set we have both Fuel Consumption Comb (L/100km) and Fuel Consumption Comb (mpg). Look at the pairs plot for the relationships between these variables and the response. Which one of these would want to include as an explanatory variable in your model? Why?

Answer 4:

I would want to include Fuel Consumption Comb (L/100km) over Fuel Consumption Comb (mpg) for two reasons. First, I want to make sure that both of my variables are in metric units. Secondly, in my pairs plot I can see that there is a non-linear relationship between Fuel Consumption Comb (mpg) and co2 per km.

Extra Credit: In the data description, it says that Fuel Consumption Comb (L/100km) is a combined rating (55% city, 45% hwy). Why would it not be appropriate to included Fuel Consumption City (L/100km), Fuel Consumption Hwy (L/100km), and Fuel Consumption Comb (L/100km) in the same model? How is this related to VIF?

Extra Credit Answer:

You would not want to included there variables together because the Variance Inflation Factor is a measure of how correlated explanatory variables are. Since Fuel Consumption Comb is derivative of both other variables you’d essentially be accounting for Fuel Consumption City and Fuel Consumption Hwy twice. This would result in your correlation being far higher than it is in the population.

Part 2: Working with categorical explanatory variable. How does fuel type affect CO2 emissions?

Question 5: Fuel type is a categorical variable. What are the levels and how many are there? Which group will R choose to be the reference group? Does this make sense in the context of these data?

Answer 5:

There are five levels of fuel type, they are “X” for regular gasoline, “Z” for premium gasoline, “D” for diesel, “E” for ethanol, and “N” for natural gas. R chooses the alphabetically first level as the reference group. This doesn’t make a lot of sense since in this case that’s “D” for diesel while “X” for regular is the most common gas type.

Question 6: In your own words, describe how categorical variables with multiple levels are coded.

Answer 6:

Categorical variables are coded by giving a “dummy” value to each level of the categorical variable alphabetically.

Question 7: Using the method you described above write out this coding.

Answer 7:

Following from question six this means in this example “D” would get assigned 0, “E” assigned 1, “N” assigned 2, “X” assigned 3, and “Z” assigned 4.

Question 8: Create a side-by-side boxplot to compare the distributions of carbon emissions across fuel types.

Answer 8:

co2$Fuel.Type <- factor(co2$Fuel.Type, levels = c("X", "Z", "D", "E", "N"))

ggplot(co2, aes(Fuel.Type, CO2.Emissions.g.km., fill=Fuel.Type)) + geom_boxplot()

From the above plot I learned that ethanol has the highest average highest co2 per km, while regular gas on average has the lowest. Additionally it appears that there is only one entry of a natural gas car.

Question 9: Perform the appropriate hypothesis test whether there is a significant difference in the average carbon emissions across fuel types. Please state the following components: Name of the hypothesis test, hypothesis (null and alternative), provide the test statistic, reference distribution, p-value. Then communicate your findings in sentence form.

Answer 9:

co2lm1 <- lm(CO2.Emissions.g.km. ~ Cylinders + Fuel.Type, co2)
summary(co2lm1)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Cylinders + Fuel.Type, data = co2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.413  -20.178   -1.178   18.660  150.926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.7516     1.2250  82.245   <2e-16 ***
## Cylinders    26.6653     0.2185 122.054   <2e-16 ***
## Fuel.TypeZ    0.1034     0.8249   0.125    0.900    
## Fuel.TypeD    3.9275     2.5079   1.566    0.117    
## Fuel.TypeE   -0.4978     1.7991  -0.277    0.782    
## Fuel.TypeN  -47.7436    32.4105  -1.473    0.141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.41 on 7379 degrees of freedom
## Multiple R-squared:  0.6935, Adjusted R-squared:  0.6933 
## F-statistic:  3339 on 5 and 7379 DF,  p-value: < 2.2e-16

Part 3: Fitting models

Question 10: Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and co2 Emissions (g/km). Describe the scatter plot above (direction, form, strength, outliers).

Answer 10:

scatterplot1 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., alpha=0.2)) + geom_point()
scatterplot1

In the above scatter plot there is a positive correlation between co2 emissions and fuel consumption. Interestingly there seems to be two (or maybe three) distinct lines, possible related to fuel type. All of these correlations seem very high as the points seem to have little error.

Model A: Simple Linear Regression (SLR)

Question 11: Create a SLR model for the relationship between Fuel Consumption Comb (L/100km) and co2 Emissions (g/km). Write the equation for the estimated fitted model.

Answer 11:

co2lm2 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km., co2)
summary(co2lm2)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km., 
##     data = co2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -141.619   -6.048    1.952   11.667   62.954 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      46.76315    1.05937   44.14   <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132    0.09334  198.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared:  0.8428, Adjusted R-squared:  0.8428 
## F-statistic: 3.959e+04 on 1 and 7383 DF,  p-value: < 2.2e-16

The equation of the fitted model is:

g/km = 46.76L/100km + 18.57L/100km*L

This means that for every addition L of gas it takes a vehicle to move 100km the g/km increases by 18.57g, plus an additional 46.76g/km regardless of how fuel efficient the vehicle is.

Question 12: Perform a hypothesis test for the slope. State the reference distribution, degrees of freedom, the test statistic, and p-value in the form of a five-part conclusion in the context of the problem.

Answer 12:

From the output below I know the reference distribution will be very close to a Z distribution (aka normal distribution) as we have 7385 samples and 7383 degrees of freedom. Our test statistic will be an t-test as we are only comparing two variables so an f-test would be unnecessary, I found it’s equal to 198.97 with a p-value of <.001.

Using the data above I can write an informative conclusion. My null hypothesis is that there is no relationship between Fuel Consumption per 100km and CO2 emissions per km. My test hypothesis is that there is a relationship between Fuel Consumption per 100km and CO2 emissions per km. I then decided on an acceptable alpha value, I’ll chose .01 which means I’m accepting a 1% chance of a Type I error. I then find the t-critical value (two-tailed, alpha=.01, DF=7383) which is about +-2.576. The next step is to find the t-value of the slope which is 198.97. Since our t-value is greater than our t-critical we can reject our null hypothesis and can retain our test hypothesis, acknowledging a 1% chance of a type I error.

Question 13: Should we trust the inference we made in the previous step? To assess this check the model diagnostics (residual plot, qq plot, and leverage plot)

Answer 13:

ggplot(co2lm2, aes(co2lm2$fitted, co2lm2$residuals)) + geom_point() + xlab("Fitted") + ylab("Residuals") + geom_abline(slope=0, intercept=0, color='red') + geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

qqnorm(co2lm2$residuals)
qqline(co2lm2$residuals)

par(mfrow=c(2,2))
plot(co2lm2)

No we should should not trust the inferences made in the above question. The residual plot shows us that error is not distributed evenly above and below the line of the linear model. The Q-Q plot shows us that our data is not normally distributed as a significant part of our data departs from the line. Finally our leverage plot shows us that there are many points whose influence is too great and fall outside Cook’s distance.

Multiple Linear Regression

Question 14a: Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km) but now color the points based on the fuel types.

scatterplot2 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color=Fuel.Type, alpha=0.2)) + geom_point()
scatterplot2

Answer 14b:

I totally called this, the three different lines correspond to different fuel types. It’s interesting that regular and premium gas appear to be identical. This got me interested in what the difference actually is, this MIT article claims that cars with higher compression ratios (whatever that means) can burn premium gasoline at higher temperatures and from that gain more mechanical advantage from each liter of gas. This data seems to contradict that, if the cars really were getting more mechanical advantage per unit of gas they’d travel farther per Liter and have lower CO2 emissions per km, and lower fuel consumption per km. However, out data shows that premium gas cars generally have higher CO2 per km, and higher fuel consumption per km.

Model B: Parallel Lines

Question 14b: Create parallel lines models for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.

Answer 14b:

co2lm3 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + Fuel.Type, co2)
summary(co2lm3)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.595  -2.760   0.045   2.234  44.852 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                         5.34154    0.27768   19.236  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.   22.78507    0.02601  875.998  < 2e-16 ***
## Fuel.TypeZ                          0.43328    0.13763    3.148  0.00165 ** 
## Fuel.TypeD                         30.89114    0.42649   72.432  < 2e-16 ***
## Fuel.TypeE                       -114.43678    0.34782 -329.016  < 2e-16 ***
## Fuel.TypeN                        -81.71198    5.49603  -14.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9912 
## F-statistic: 1.66e+05 on 5 and 7379 DF,  p-value: < 2.2e-16

Regular gas: g/km = 5.34L/100km + 22.78L/100km*L

Premium gas: g/km = 5.34L/100km + 0.43L/100km*L

Ethanol: g/km = 5.34L/100km + 30.89L/100km*L

Diesel: g/km = 5.34L/100km - 114.43L/100km*L

Question 15: Create a graphic for you parallel lines model showing the fitted models for each type of fuel. Common on which shifts of intercept are significant.

Answer 15:

scatterplot3 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color=Fuel.Type, alpha=0.2)) + geom_point() + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[1], color='orange') + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[3], color='brown') + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[4], color='green') + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[5], color='cyan')
scatterplot3

Looking at the significance codes from the output of question 14b, the variation of between gas types is unlikely to have happened by chance. In fact, there’s less than a .1% that the variation between regular and ethanol, the same goes for the variation between regular and diesel. However there is a 1% chance that the variation between regular and premium is coincidental.

Model C: Interactions (Unrelated) Lines

Question 16: Create an unrelated lines (interaction) model for CO2 Emission (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.

Answer 16:

co2lm4 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.:Fuel.Type + Fuel.Type, co2)
summary(co2lm4)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.:Fuel.Type + 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.127  -2.607   0.659   1.886  25.251 
## 
## Coefficients: (1 not defined because of singularities)
##                                              Estimate Std. Error  t value
## (Intercept)                                   0.42897    0.20641    2.078
## Fuel.TypeZ                                    0.18413    0.32445    0.568
## Fuel.TypeD                                   -0.54618    1.30980   -0.417
## Fuel.TypeE                                    4.24637    0.92786    4.577
## Fuel.TypeN                                  212.57103    2.96179   71.771
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeX  23.27221    0.01988 1170.440
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ  23.23695    0.02143 1084.227
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD  26.89918    0.14419  186.550
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  16.03766    0.05287  303.338
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN        NA         NA       NA
##                                             Pr(>|t|)    
## (Intercept)                                   0.0377 *  
## Fuel.TypeZ                                    0.5704    
## Fuel.TypeD                                    0.6767    
## Fuel.TypeE                                   4.8e-06 ***
## Fuel.TypeN                                   < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeX  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared:  0.9975, Adjusted R-squared:  0.9975 
## F-statistic: 3.611e+05 on 8 and 7376 DF,  p-value: < 2.2e-16

Regular gas: g/km = 0.42L/100km + 23.27L/100km*L

Premium gas: g/km = 0.18L/100km + 23.24L/100km*L

Ethanol: g/km = 4.24L/100km + 16.04L/100km*L

Diesel: g/km = -0.54L/100km + 26.89L/100km*L

Question 17: Create a graphic for you unrelated lines model showing the fitted models for each type of fuel. Common on which shifts of intercept and/or slope are significant.

scatterplot4 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color=Fuel.Type, alpha=0.2)) + geom_point() + geom_abline(slope=co2lm4$coefficients[6], intercept=co2lm4$coefficients[1], color='orange') + geom_abline(slope=co2lm4$coefficients[7], intercept=co2lm4$coefficients[2], color='brown') + geom_abline(slope=co2lm4$coefficients[8], intercept=co2lm4$coefficients[3], color='green') + geom_abline(slope=co2lm4$coefficients[9], intercept=co2lm4$coefficients[4], color='cyan')
scatterplot4

Answer 17:

From the data table in question 16 we can tell that the changes in slope between fuel types are all significant and there’s less than a 0.1% chance that the differences are coincidental. However, the differences in intercept are not significant, this makes sense as regardless of fuel type if fuel consumption is zero so will be emissions.

Model Comparison and Selection

Question 18: Compare Models A, B, and C by reporting their adjusted R-squared values. Which model would you pick? Make an argument based on an assessment of the model assumptions, simplicity, and interpretability as well as model fit.

Answer 18:

First, we need to acknowledge that model A is not good, I did diagnostics and it failed all of them and should not be considered. This is now a debate between model B and model C. Although model B is simpler it does generate some strange predictions, for example it claims that a ethanol vehicles can remove CO2 from the atmosphere if they are driven zero km. Model C has the advantage that slope can can change depending on fuel type. This makes sense as it’s reasonable that different fuel types could emit more or less CO2 per km traveled. For the above reasons I would pick model C.

Prediction

Question 19: Professor Smalley drives a 2001 Toyota Rav4. She calculated that her car gets a combined fuel composition of 25 miles per gallon. She puts regular gasoline into her car. Predict the CO2 Emissions (g/km) for her car, using the model you chose in Question 18.

Answer 19:

25mpg is equal to 9.4L/100km. Using the regular gas formula from model C (0.42L/100km + 23.27L/100km*L) I predict that Professor Smalley’s car emits 219.35g of CO2 per km.