Part 0: Import data & libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
co2 <- read.csv("https://raw.githubusercontent.com/kitadasmalley/MATH239/main/data/CO2_Emissions_Midterm.csv", sep = ",", quote = "\"", header = TRUE)

Part 1: Working with a categorical explanatory variable

1.

the levels are: D (diesel), E (ethanol), N (natural gas), X (regular), and Z (premium)
R will choose D as the reference group. This doesn’t really make sense in context, because the different levels should probably be compared to the regular gas, rather than to a specialty type.

contrasts(as.factor(co2$Fuel.Type))
##   E N X Z
## D 0 0 0 0
## E 1 0 0 0
## N 0 1 0 0
## X 0 0 1 0
## Z 0 0 0 1

2.

Categorical variables with multiple levels are coded with a system of multiple binary situations, using one less than the number of levels of the variable. For every non-reference group, that indicator is set to 1 if it is a specific value and 0 otherwise. The reference group is always 0.

3.

contrasts(as.factor(co2$Fuel.Type))
##   E N X Z
## D 0 0 0 0
## E 1 0 0 0
## N 0 1 0 0
## X 0 0 1 0
## Z 0 0 0 1

The contrasts function writes out this coding.
It could also be written as:
xi1 = {1 if E, 0 otherwise}
xi2 = {1 if N, 0 otherwise}
xi3 = {1 if X, 0 otherwise}
xi4 = {1 if Z, 0 otherwise}

4.

co2$Fuel.Type <- factor(co2$Fuel.Type, levels = c("X", "Z", "D", "E", "N"))
ggplot(co2, aes(Fuel.Type, CO2.Emissions.g.km., fill = Fuel.Type))+
  geom_boxplot()

Natural gas has the lowest median CO2 emissions and also the lowest range. Ethanol has the highest median CO2 emissions. Regular has the largest IQR, and a range from the lowest to the second highest. Overall, all of the fuel types have similar medians.

5.

mod1 <- lm(CO2.Emissions.g.km.~Fuel.Type, data = co2)
anova(mod1)
## Analysis of Variance Table
## 
## Response: CO2.Emissions.g.km.
##             Df   Sum Sq Mean Sq F value    Pr(>F)    
## Fuel.Type    4  1888452  472113  148.95 < 2.2e-16 ***
## Residuals 7380 23392397    3170                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I performed a one-way ANOVA F-test.
HO: μ1 = μ2 = μ3 = μ4 = μ5
HA: at least one μj is different
test statistic: 148.95 reference distribution: F4/7380 p-value: <2.2e-16

Conclusion: Because our p-value of <2.2e-16 is less than our alpha of 0.05, we reject the null hypothesis in favor of the alternative hypothesis. There is convincing evidence that at least one of the fuel types has a different mean co2 emission than the others.

6.

This question does not appear to exist

Part 2: Properties of linear models

7.

y = β0 + β1x1 + β2x2 + … + βpxp + εi

y: response variable (co2 emissions (g/km)) β0: intercept
βi: slope for given explanatory variable
xi: explanatory variable (fuel consumption, etc)
εi: error

8.

  1. There is a linear relationship between x and y
  2. average error is 0
  3. errors have a constant spread
  4. errors are independent (or uncorrelated)
  5. error are normally distributed

9.

co2_numeric <- co2 %>%
  select(where(is.numeric))
pairs(co2_numeric)

most of the relationships between potential explanatory variables and the response variable are positive and linear. However, the relationship between fuel consumption comb (mpg) and co2 emissions (g/km) is negative and exponential.

10.

I would include fuel consumption comb (L/100km) as an explanatory variable over fuel consumption comb (mpg), because using mpg doesn’t fit with the first of the linear model assumptions: that the relationship is linear.

Extra Credit:

It wouldn’t be appropriate to include fuel consumption city (L/100km), fuel consumption hwy (L/100km), and fuel consumption comb (L/100km) in the same model because they would definitely be collinear. This would drastically inflate the VIF, which makes a model less accurate. Having collinear variables will result in a larger p-value.

Part 3: Fitting models

Model A

10.

ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km.))+
  geom_point()

The relationship in the above scatterplot is positive, linear, and, while overall, moderately strong, has three distinct regions of extreme strength. Nothing stands out as a clear outlier, but the point at about (12, 130) could potentially be one.

11.

modA <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km., data = co2)
summary(modA)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km., 
##     data = co2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -141.619   -6.048    1.952   11.667   62.954 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      46.76315    1.05937   44.14   <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132    0.09334  198.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared:  0.8428, Adjusted R-squared:  0.8428 
## F-statistic: 3.959e+04 on 1 and 7383 DF,  p-value: < 2.2e-16

equation: y = 18.57132x + 46.76315

12.

using a T distribution with df 7383, we get a p-value of less than 2e-16. Since this p-value is less than 0.05, we reject the null hypothesis in favor of the alternative hypothesis. There is convincing evidence that the slope of the line for co2 emmisions as a function of fuel consumption is not 0.

13.

plot(modA)

The residuals do appear to have a mean of near 0, but there are distinct patterns in them, and there is definitely not a constant variance.

The residuals appear normal from theoretical quantiles -1 to 3, but below -1, they deviate strongly from the Q-Q line.

There do not appear to be any influential outliers.

Model B

14.

ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type, alpha = 0.25))+
  geom_point()

it appears that the fuel types each make their own line, except for regular and premium gasoline, which appear to be largely the same. Natural Gas appears to only have 1 data point.

15.

modB <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km. + Fuel.Type, data = co2)
summary(modB)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + 
##     Fuel.Type, data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.595  -2.760   0.045   2.234  44.852 
## 
## Coefficients:
##                                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                         5.34154    0.27768   19.236  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.   22.78507    0.02601  875.998  < 2e-16 ***
## Fuel.TypeZ                          0.43328    0.13763    3.148  0.00165 ** 
## Fuel.TypeD                         30.89114    0.42649   72.432  < 2e-16 ***
## Fuel.TypeE                       -114.43678    0.34782 -329.016  < 2e-16 ***
## Fuel.TypeN                        -81.71198    5.49603  -14.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9912 
## F-statistic: 1.66e+05 on 5 and 7379 DF,  p-value: < 2.2e-16

Regular gasoline: y = 5.34154 + 22.78507x
Premium gasoline: y = 5.77482 + 22.78507x
Diesel: y = 36.23268 + 22.78507x
Ethanol: y = -109.09524 + 22.78507x
Natural gas: y = -76.37044 + 22.78507x

16.

ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type, alpha = 0.25))+
  geom_point()+
  geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1], color = "red", lwd = 1)+
  geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[3], color = "yellow", lwd = 1)+
  geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[4], color = "green", lwd = 1)+
  geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[5], color = "blue", lwd = 1)+
  geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[6], color = "purple", lwd = 1)

the red and yellow lines are overlapping - they have very similar intercepts. However, other than those two, all the intercept shifts appear to be significant

Model C

17.

modC <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km. + Fuel.Type + Fuel.Consumption.Comb..L.100.km.*Fuel.Type, data = co2)
summary(modC)
## 
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + 
##     Fuel.Type + Fuel.Consumption.Comb..L.100.km. * Fuel.Type, 
##     data = co2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.127  -2.607   0.659   1.886  25.251 
## 
## Coefficients: (1 not defined because of singularities)
##                                              Estimate Std. Error  t value
## (Intercept)                                   0.42897    0.20641    2.078
## Fuel.Consumption.Comb..L.100.km.             23.27221    0.01988 1170.440
## Fuel.TypeZ                                    0.18413    0.32445    0.568
## Fuel.TypeD                                   -0.54618    1.30980   -0.417
## Fuel.TypeE                                    4.24637    0.92786    4.577
## Fuel.TypeN                                  -82.98605    2.95545  -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ  -0.03526    0.02923   -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD   3.62697    0.14556   24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  -7.23455    0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN        NA         NA       NA
##                                             Pr(>|t|)    
## (Intercept)                                   0.0377 *  
## Fuel.Consumption.Comb..L.100.km.             < 2e-16 ***
## Fuel.TypeZ                                    0.5704    
## Fuel.TypeD                                    0.6767    
## Fuel.TypeE                                   4.8e-06 ***
## Fuel.TypeN                                   < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ   0.2279    
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE  < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared:  0.9975, Adjusted R-squared:  0.9975 
## F-statistic: 3.611e+05 on 8 and 7376 DF,  p-value: < 2.2e-16

Regular: y = 0.42897 + 23.27221x
Premium: y = 0.6131 + 23.23695
Diesel: y = -0.11721 + 3.6297x
Ethanol: y = 4.67534 + 16.03766x
Natural Gas: -82.55708 + 23.27221x

18.

ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type, alpha = 0.25))+
  geom_point()+
  geom_abline(slope = modC$coefficients[2], intercept = modC$coefficients[1], color = "red", lwd = 1)+
  geom_abline(slope = modC$coefficients[2] + modC$coefficients[7], intercept = modC$coefficients[1] + modC$coefficients[3], color = "yellow", lwd = 1)+
  geom_abline(slope = modC$coefficients[2] + modC$coefficients[8], intercept = modC$coefficients[1] + modC$coefficients[4], color = "green", lwd = 1)+
  geom_abline(slope = modC$coefficients[2] + modC$coefficients[9], intercept = modC$coefficients[1] + modC$coefficients[5], color = "blue", lwd = 1)+
  geom_abline(slope = modC$coefficients[2], intercept = modC$coefficients[1] + modC$coefficients[6], color = "purple", lwd = 1)

Ethanol and natural gas had significant changes in intercept, while diesel and ethanol had significant changes in slope. Natural gas, if it had more data points, might have had a significant change in slope as well, but it is unknown.

Model Selection:

19.

Model A: 0.8428
Model B: 0.9912
Model C: 0.9975

plot(modA)

plot(modB)
## Warning: not plotting observations with leverage one:
##   2440

plot(modC)
## Warning: not plotting observations with leverage one:
##   2440

Model A has the lowest adjusted R2 value, and it also fits the least of the model assumptions. However, it is the easiest to interpret and is the simplest.

Model B has the middle adjusted R2 value, and fits the model assumptions more than model A, but still isn’t perfect. It is still not too hard to interpret, and is (obviously) the middle in terms of simplicity.

Model C has the highest adjusted R^2 value, and fits almost all the model assumptions very well. However, it is not easy to interpret, and not all the changes were very significant. It is also the most complicated.

I would overall use model B, because it fits the model very well and is easier to interpret than model C, and it does still hit most of the model assumptions.

Extra Credit:

A.

 FCC <- 25 * 235.21
 5.34154 + 22.78507* FCC
## [1] 133987.2

I would predict that Professor Smalley’s car would have a CO2 emission of 133987.2 g/km.

B.

newdata = data.frame(Fuel.Type = "X", Fuel.Consumption.Comb..L.100.km. = FCC)
predict(modB, newdata, interval = "predict")
##        fit      lwr      upr
## 1 133987.3 133687.8 134286.8

interval: [133687.8, 134286.8]