library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
co2 <- read.csv("https://raw.githubusercontent.com/kitadasmalley/MATH239/main/data/CO2_Emissions_Midterm.csv", sep = ",", quote = "\"", header = TRUE)
the levels are: D (diesel), E (ethanol), N (natural gas), X (regular), and Z (premium)
R will choose D as the reference group. This doesn’t really make sense in context, because the different levels should probably be compared to the regular gas, rather than to a specialty type.
contrasts(as.factor(co2$Fuel.Type))
## E N X Z
## D 0 0 0 0
## E 1 0 0 0
## N 0 1 0 0
## X 0 0 1 0
## Z 0 0 0 1
Categorical variables with multiple levels are coded with a system of multiple binary situations, using one less than the number of levels of the variable. For every non-reference group, that indicator is set to 1 if it is a specific value and 0 otherwise. The reference group is always 0.
contrasts(as.factor(co2$Fuel.Type))
## E N X Z
## D 0 0 0 0
## E 1 0 0 0
## N 0 1 0 0
## X 0 0 1 0
## Z 0 0 0 1
The contrasts function writes out this coding.
It could also be written as:
xi1 = {1 if E, 0 otherwise}
xi2 = {1 if N, 0 otherwise}
xi3 = {1 if X, 0 otherwise}
xi4 = {1 if Z, 0 otherwise}
co2$Fuel.Type <- factor(co2$Fuel.Type, levels = c("X", "Z", "D", "E", "N"))
ggplot(co2, aes(Fuel.Type, CO2.Emissions.g.km., fill = Fuel.Type))+
geom_boxplot()
Natural gas has the lowest median CO2 emissions and also the lowest range. Ethanol has the highest median CO2 emissions. Regular has the largest IQR, and a range from the lowest to the second highest. Overall, all of the fuel types have similar medians.
mod1 <- lm(CO2.Emissions.g.km.~Fuel.Type, data = co2)
anova(mod1)
## Analysis of Variance Table
##
## Response: CO2.Emissions.g.km.
## Df Sum Sq Mean Sq F value Pr(>F)
## Fuel.Type 4 1888452 472113 148.95 < 2.2e-16 ***
## Residuals 7380 23392397 3170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I performed a one-way ANOVA F-test.
HO: μ1 = μ2 = μ3 = μ4 = μ5
HA: at least one μj is different
test statistic: 148.95 reference distribution: F4/7380 p-value: <2.2e-16
Conclusion: Because our p-value of <2.2e-16 is less than our alpha of 0.05, we reject the null hypothesis in favor of the alternative hypothesis. There is convincing evidence that at least one of the fuel types has a different mean co2 emission than the others.
This question does not appear to exist
y = β0 + β1x1 + β2x2 + … + βpxp + εi
y: response variable (co2 emissions (g/km)) β0: intercept
βi: slope for given explanatory variable
xi: explanatory variable (fuel consumption, etc)
εi: error
co2_numeric <- co2 %>%
select(where(is.numeric))
pairs(co2_numeric)
most of the relationships between potential explanatory variables and the response variable are positive and linear. However, the relationship between fuel consumption comb (mpg) and co2 emissions (g/km) is negative and exponential.
I would include fuel consumption comb (L/100km) as an explanatory variable over fuel consumption comb (mpg), because using mpg doesn’t fit with the first of the linear model assumptions: that the relationship is linear.
It wouldn’t be appropriate to include fuel consumption city (L/100km), fuel consumption hwy (L/100km), and fuel consumption comb (L/100km) in the same model because they would definitely be collinear. This would drastically inflate the VIF, which makes a model less accurate. Having collinear variables will result in a larger p-value.
ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km.))+
geom_point()
The relationship in the above scatterplot is positive, linear, and, while overall, moderately strong, has three distinct regions of extreme strength. Nothing stands out as a clear outlier, but the point at about (12, 130) could potentially be one.
modA <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km., data = co2)
summary(modA)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.619 -6.048 1.952 11.667 62.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.76315 1.05937 44.14 <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132 0.09334 198.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
equation: y = 18.57132x + 46.76315
using a T distribution with df 7383, we get a p-value of less than 2e-16. Since this p-value is less than 0.05, we reject the null hypothesis in favor of the alternative hypothesis. There is convincing evidence that the slope of the line for co2 emmisions as a function of fuel consumption is not 0.
plot(modA)
The residuals do appear to have a mean of near 0, but there are distinct patterns in them, and there is definitely not a constant variance.
The residuals appear normal from theoretical quantiles -1 to 3, but below -1, they deviate strongly from the Q-Q line.
There do not appear to be any influential outliers.
ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type, alpha = 0.25))+
geom_point()
it appears that the fuel types each make their own line, except for regular and premium gasoline, which appear to be largely the same. Natural Gas appears to only have 1 data point.
modB <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km. + Fuel.Type, data = co2)
summary(modB)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.595 -2.760 0.045 2.234 44.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34154 0.27768 19.236 < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 22.78507 0.02601 875.998 < 2e-16 ***
## Fuel.TypeZ 0.43328 0.13763 3.148 0.00165 **
## Fuel.TypeD 30.89114 0.42649 72.432 < 2e-16 ***
## Fuel.TypeE -114.43678 0.34782 -329.016 < 2e-16 ***
## Fuel.TypeN -81.71198 5.49603 -14.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
## F-statistic: 1.66e+05 on 5 and 7379 DF, p-value: < 2.2e-16
Regular gasoline: y = 5.34154 + 22.78507x
Premium gasoline: y = 5.77482 + 22.78507x
Diesel: y = 36.23268 + 22.78507x
Ethanol: y = -109.09524 + 22.78507x
Natural gas: y = -76.37044 + 22.78507x
ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type, alpha = 0.25))+
geom_point()+
geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1], color = "red", lwd = 1)+
geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[3], color = "yellow", lwd = 1)+
geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[4], color = "green", lwd = 1)+
geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[5], color = "blue", lwd = 1)+
geom_abline(slope = modB$coefficients[2], intercept = modB$coefficients[1] + modB$coefficients[6], color = "purple", lwd = 1)
the red and yellow lines are overlapping - they have very similar intercepts. However, other than those two, all the intercept shifts appear to be significant
modC <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km. + Fuel.Type + Fuel.Consumption.Comb..L.100.km.*Fuel.Type, data = co2)
summary(modC)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type + Fuel.Consumption.Comb..L.100.km. * Fuel.Type,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.127 -2.607 0.659 1.886 25.251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.42897 0.20641 2.078
## Fuel.Consumption.Comb..L.100.km. 23.27221 0.01988 1170.440
## Fuel.TypeZ 0.18413 0.32445 0.568
## Fuel.TypeD -0.54618 1.30980 -0.417
## Fuel.TypeE 4.24637 0.92786 4.577
## Fuel.TypeN -82.98605 2.95545 -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ -0.03526 0.02923 -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD 3.62697 0.14556 24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE -7.23455 0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA NA NA
## Pr(>|t|)
## (Intercept) 0.0377 *
## Fuel.Consumption.Comb..L.100.km. < 2e-16 ***
## Fuel.TypeZ 0.5704
## Fuel.TypeD 0.6767
## Fuel.TypeE 4.8e-06 ***
## Fuel.TypeN < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ 0.2279
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared: 0.9975, Adjusted R-squared: 0.9975
## F-statistic: 3.611e+05 on 8 and 7376 DF, p-value: < 2.2e-16
Regular: y = 0.42897 + 23.27221x
Premium: y = 0.6131 + 23.23695
Diesel: y = -0.11721 + 3.6297x
Ethanol: y = 4.67534 + 16.03766x
Natural Gas: -82.55708 + 23.27221x
ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type, alpha = 0.25))+
geom_point()+
geom_abline(slope = modC$coefficients[2], intercept = modC$coefficients[1], color = "red", lwd = 1)+
geom_abline(slope = modC$coefficients[2] + modC$coefficients[7], intercept = modC$coefficients[1] + modC$coefficients[3], color = "yellow", lwd = 1)+
geom_abline(slope = modC$coefficients[2] + modC$coefficients[8], intercept = modC$coefficients[1] + modC$coefficients[4], color = "green", lwd = 1)+
geom_abline(slope = modC$coefficients[2] + modC$coefficients[9], intercept = modC$coefficients[1] + modC$coefficients[5], color = "blue", lwd = 1)+
geom_abline(slope = modC$coefficients[2], intercept = modC$coefficients[1] + modC$coefficients[6], color = "purple", lwd = 1)
Ethanol and natural gas had significant changes in intercept, while diesel and ethanol had significant changes in slope. Natural gas, if it had more data points, might have had a significant change in slope as well, but it is unknown.
Model A: 0.8428
Model B: 0.9912
Model C: 0.9975
plot(modA)
plot(modB)
## Warning: not plotting observations with leverage one:
## 2440
plot(modC)
## Warning: not plotting observations with leverage one:
## 2440
Model A has the lowest adjusted R2 value, and it also fits the least of the model assumptions. However, it is the easiest to interpret and is the simplest.
Model B has the middle adjusted R2 value, and fits the model assumptions more than model A, but still isn’t perfect. It is still not too hard to interpret, and is (obviously) the middle in terms of simplicity.
Model C has the highest adjusted R^2 value, and fits almost all the model assumptions very well. However, it is not easy to interpret, and not all the changes were very significant. It is also the most complicated.
I would overall use model B, because it fits the model very well and is easier to interpret than model C, and it does still hit most of the model assumptions.
FCC <- 25 * 235.21
5.34154 + 22.78507* FCC
## [1] 133987.2
I would predict that Professor Smalley’s car would have a CO2 emission of 133987.2 g/km.
newdata = data.frame(Fuel.Type = "X", Fuel.Consumption.Comb..L.100.km. = FCC)
predict(modB, newdata, interval = "predict")
## fit lwr upr
## 1 133987.3 133687.8 134286.8
interval: [133687.8, 134286.8]