#load the packages we need
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(ISLR)
library(ggplot2)
library(dplyr)
library(tidyr)
#load the data
co2<-read.csv("https://raw.githubusercontent.com/kitadasmalley/MATH239/main/data/CO2_Emissions_Midterm.csv",
sep=",", quote="\"",
header = TRUE)
head(co2)
## Make Model Vehicle.Class Engine.Size.L. Cylinders Transmission
## 1 ACURA ILX COMPACT 2.0 4 AS5
## 2 ACURA ILX COMPACT 2.4 4 M6
## 3 ACURA ILX HYBRID COMPACT 1.5 4 AV7
## 4 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6
## 5 ACURA RDX AWD SUV - SMALL 3.5 6 AS6
## 6 ACURA RLX MID-SIZE 3.5 6 AS6
## Fuel.Type Fuel.Consumption.City..L.100.km. Fuel.Consumption.Hwy..L.100.km.
## 1 Z 9.9 6.7
## 2 Z 11.2 7.7
## 3 Z 6.0 5.8
## 4 Z 12.7 9.1
## 5 Z 12.1 8.7
## 6 Z 11.9 7.7
## Fuel.Consumption.Comb..L.100.km. Fuel.Consumption.Comb..mpg.
## 1 8.5 33
## 2 9.6 29
## 3 5.9 48
## 4 11.1 25
## 5 10.6 27
## 6 10.0 28
## CO2.Emissions.g.km.
## 1 196
## 2 221
## 3 136
## 4 255
## 5 244
## 6 230
QUESTION 1: Fuel type should be a categorical variable. What are the levels and how many are there? Which group will R choose to be the reference group? Does this make sense in the context of these data?
#There are five different levels of fuel type which are D (diesel), X (regular gasoline), Z (premium gasoline), E (ethanol), or N (natural gas). R would code x (regular gasoline) as the reference variable since it is the first variable in the list. This would make sense in the context of this data since this would compare all other fuel types to the type that is most commonly used.
QUESTION 2: In your own words, describe how categorical variables with multiple levels are coded.
#In R, you would factor the categorical variable that you're looking at in order to re-code this column with factor variables. Then, plug your variable into the contrasts function in order to get the 0, 1 values for each variable comparison.
QUESTION 3: Using the method you described above, write out this coding using 0’s and 1’s.
co2$Fuel.Type <- factor(co2$Fuel.Type,
levels = c("X", "Z", "D", "E", "N"))
contrasts(co2$Fuel.Type)
## Z D E N
## X 0 0 0 0
## Z 1 0 0 0
## D 0 1 0 0
## E 0 0 1 0
## N 0 0 0 1
# Z D E N
#X 0 0 0 0
#Z 1 0 0 0
#D 0 1 0 0
#E 0 0 1 0
#N 0 0 0 1
QUESTION 4: Create a side-by-side box-plot to compare the distributions of CO2 Emissions (g/km) across fuel types. Feel free to use color! What are your observations?
ggplot(data=co2, aes(x=Fuel.Type, y=CO2.Emissions.g.km., color=Fuel.Type))+
geom_boxplot()
#From this box-plot, we can see that premium gasoline (Z) and ethanol (E) tend to have the highest levels of co2 emission, they both have the highest Q1, Median, and Q3 values, with premium having the highest maximum values. Regular gasoline has the lowest minimum, Q1, and median value, but has the second highest maximum value, indicating that is has a pretty wide range. Regular and premium gasoline have wide ranges of data points with multiple outliers, where diesel, ethanol, and natural gas have much narrower ranges (ethanol does also have outliers, though!)
QUESTION 5: Perform the appropriate hypothesis test whether there is a significant difference in the average carbon emissions across fuels types.
mod1 <- lm(CO2.Emissions.g.km.~Fuel.Type, data = co2)
sumMod1<-summary(mod1)
anova(mod1)
## Analysis of Variance Table
##
## Response: CO2.Emissions.g.km.
## Df Sum Sq Mean Sq F value Pr(>F)
## Fuel.Type 4 1888452 472113 148.95 < 2.2e-16 ***
## Residuals 7380 23392397 3170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##Please state the following values:
#Name of hypothesis test- Two way Anova Test
#Hypothesis (null & alternative)- H0: B1=0; HA: B1=/0. Where H0 is null hypothesis, B1 is beta sub 1, HA is alternative hypothesis, and =/ is not equal to
#Test statistic-
#Z: 22.666
#D: 0.558
#E: 13.011
#N: -0.393
#Reference distribution- 7380
#P value- < 2.2e-16
##Communicate your findings in sentence form (5 part summary)
#We reject the null hypothesis with a p-value of < 2.2e-16 at the .05 significance level. There is convincing evidence to suggest that there is a linear relationship between fuel type and the amount of co2 emissions that a vehicle gives off.
QUESTION 6: Please write the form of a linear model and annotate the following components: explanatory variables (), response variable ( ), parameters ( ), error term ().
#Z: y=235.1193 + 30.9241x
#D: y=235.1193 + 2.4292x
#E: y=235.1193 + 39.9726x
#N: y=235.1193 + -22.1193x
#Explanatory Variable: Fuel Type
#Response Variable: Co2 Emissions
#Parameters: Z-> 30.9241; D-> 2.4292; E ->39.9726; N-> -22.1193
#Error Term: Z-> 1.3643; D-> 4.3571; E ->3.0722; N-> 56.3078
QUESTION 7: List the assumptions for fitting a linear model.
#1- Linearity assumption: look for signs that the data might be non-linear (residual plot)
#2- Mean=0 assumption: even spread above and below the y=0 line (residual plot)
#3- Constant spread assumption: constant spread of the residuals (residual plot)
#4- Normality assumption: points would ideally fall on diagonal line (QQ Plot)
#5- Outliers assumption: identify problematic outliers that would highly influence a model (leverage vs. residual plot)
QUESTION 8: Create a pairs plot for the numeric variables in these data-set. What relationships do you observe between the potential explanatory variables and the response (carbon emissions)?
co2%>%
select("Engine.Size.L.", "Cylinders", "Fuel.Consumption.City..L.100.km.", "Fuel.Consumption.Hwy..L.100.km.", "Fuel.Consumption.Comb..L.100.km.", "Fuel.Consumption.Comb..mpg.", "CO2.Emissions.g.km.")%>%
pairs()
#There seems to be a linear relationship between Co2 emissions and Fuel Consumption Comb (L/100km), Fuel Consumption Hwy (L/100km), and Fuel Consumption City (L/100km).
QUESTION 9:In this dataset we have both Fuel Consumption Comb (L/100km) and Fuel Consumption Comb (mpg). Look at the pairs plot for the relationships between these variables and the response, CO2 Emissions (g/km). Which one of these would you want to include as an explanatory variable in your model? Why?
#I would want to include Fuel Consumption Comb (L/100km) as the explanatory variable in my model because the scatterplot for this model indicates a linear relationship, while the plot for the other model is clearly non-linear, which would violate the linearity assumption.
QUESTION 10: Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km). Describe this scatterplot.
ggplot(co2, aes(x = CO2.Emissions.g.km., y = Fuel.Consumption.Comb..L.100.km.))+
geom_point()
#We can see that there is a strong, positive, linear relationship between these two variables, and that there may be a few outliers.
QUESTION 11: Create a simple linear model for the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km). Write the equation for the estimated fitted model.
mod2 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km., data = co2)
sumMod2<- summary(mod2)
# y= 46.76315 + 18.57132x
QUESTION 12: Perform a hypothesis test for the slope. Perform a test for the slope. State the reference distribution, degrees of freedom, the test statistic, and p-value in the form of a five-part conclusion in the context of the problem.
print(sumMod2)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.619 -6.048 1.952 11.667 62.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.76315 1.05937 44.14 <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132 0.09334 198.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
anova(mod2)
## Analysis of Variance Table
##
## Response: CO2.Emissions.g.km.
## Df Sum Sq Mean Sq F value Pr(>F)
## Fuel.Consumption.Comb..L.100.km. 1 21307172 21307172 39588 < 2.2e-16 ***
## Residuals 7383 3973677 538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Reference distrubution:7383
#df: 7385
#Test Statistic: 198.97
#p-value: < 2.2e-16
#We reject the null hypothesis with a p-value of < 2.2e-16 at the 0.05 significance level. There is convincing evidence to suggest a linear relationship between co2 emissions and Fuel.Consumption.Comb..L.100.km..
QUESTION 13: Should we trust the inference we made in the previous step? To assess this check the model diagnostics.
#Create a residual plot
intercept <- mod2$coefficients[1]
intercept
## (Intercept)
## 46.76315
ggplot(co2, aes(CO2.Emissions.g.km., mod2$residuals, color=Fuel.Type))+
geom_point()+
geom_hline(yintercept = 0,
color = "red")
#Create a qq plot
qqnorm(mod2$residuals)
qqline(mod2$residuals)
#Create a leverage plot
plot(mod2)
#These plots indicate that the conditions may not be all met by this particular dataset. The residuals are not grouped around y=0, the qq plot does not follow a linear pattern, and the leverage plot indicates that there are significant outliers.
QUESTION 14: Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km) but now color the points based on the Fuel Type. Discuss your observations.
ggplot(co2, aes(x = CO2.Emissions.g.km., y = Fuel.Consumption.Comb..L.100.km., color=Fuel.Type))+
geom_point()
#Each of the fuel types indicate a strong, positive, linear relationship between Co2 emissions and Fuel.Consumption.Comb..L.100.km.. Both ethanol (E) and premium fuel (Z) have the most clear outliers.
QUESTION 15: Create a parallel lines model for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.
mod3 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + Fuel.Type, data=co2)
summary(mod3)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.595 -2.760 0.045 2.234 44.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34154 0.27768 19.236 < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 22.78507 0.02601 875.998 < 2e-16 ***
## Fuel.TypeZ 0.43328 0.13763 3.148 0.00165 **
## Fuel.TypeD 30.89114 0.42649 72.432 < 2e-16 ***
## Fuel.TypeE -114.43678 0.34782 -329.016 < 2e-16 ***
## Fuel.TypeN -81.71198 5.49603 -14.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
## F-statistic: 1.66e+05 on 5 and 7379 DF, p-value: < 2.2e-16
#X: y= 5.34154 + 22.78507x
#Z: y= 5.77482 + 22.78507x
#D: y= 36.23268 + 22.78507x
#E: y= -109.09524 + 22.78507x
#N: y= -76.37044 + 22.78507x
QUESTION 16: Create a graphic for your parallel lines model showing the fitted models for each type of fuel. Comment on which shifts of intercept are significant.
ggplot(data=co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type))+
geom_point()+
ggtitle("co2 Emissions and Fuel Consumption")+
theme_bw()+
geom_abline(intercept = 5.34154, slope=22.78507, color="red", lwd=1)+
geom_abline(intercept = 5.77482, slope=22.78507, color="yellow", lwd=1)+
geom_abline(intercept = 36.23268, slope=22.78507, color="green", lwd=1)+
geom_abline(intercept = -109.09524, slope=22.78507, color="blue", lwd=1)+
geom_abline(intercept = -76.37044, slope=22.78507, color="purple", lwd=1)
#These shifts in intercept allow us to see how these lines which all have the same slope fit their fuel type better with these intercept changes.
QUESTION 17: Create an unrelated lines (interaction) model for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.
mod4 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. * Fuel.Type, data=co2)
summary(mod4)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. *
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.127 -2.607 0.659 1.886 25.251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.42897 0.20641 2.078
## Fuel.Consumption.Comb..L.100.km. 23.27221 0.01988 1170.440
## Fuel.TypeZ 0.18413 0.32445 0.568
## Fuel.TypeD -0.54618 1.30980 -0.417
## Fuel.TypeE 4.24637 0.92786 4.577
## Fuel.TypeN -82.98605 2.95545 -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ -0.03526 0.02923 -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD 3.62697 0.14556 24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE -7.23455 0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA NA NA
## Pr(>|t|)
## (Intercept) 0.0377 *
## Fuel.Consumption.Comb..L.100.km. < 2e-16 ***
## Fuel.TypeZ 0.5704
## Fuel.TypeD 0.6767
## Fuel.TypeE 4.8e-06 ***
## Fuel.TypeN < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ 0.2279
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared: 0.9975, Adjusted R-squared: 0.9975
## F-statistic: 3.611e+05 on 8 and 7376 DF, p-value: < 2.2e-16
#X: y= 0.42897 + 23.27221x
#Z: y= 0.6131 + 23.27221x
#D: y= -0.11721 + 23.27221x
#E: y= 4.67534 + 23.27221x
#N: y= -82.55708 + 23.27221x
QUESTION 18: Create a graphic for your unrelated lines model showing the fitted models for each type of fuel. Comment on which shifts of intercept and/or slope are significant.
ggplot(data=co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color=Fuel.Type))+
geom_point()+
ggtitle("CO2 Consumption and Fuel Consumption")+
theme_bw()+
geom_abline(slope=mod4$coefficients[2], intercept = mod4$coefficients[1], col=2)+
geom_abline(slope=mod4$coefficients[2], intercept = mod4$coefficients[1]+mod4$coefficients[3])+
geom_abline(slope=mod4$coefficients[2], intercept = mod4$coefficients[1]+mod4$coefficients[4])+
geom_abline(slope=mod4$coefficients[2], intercept = mod4$coefficients[1]+mod4$coefficients[5])+
geom_abline(slope=mod4$coefficients[2], intercept = mod4$coefficients[1]+mod4$coefficients[6])
QUESTION 19: Compare models from #12 (A) , #15 (B), and #17 (C) by reporting their Adjusted R-squared values. Which model would you pick? Make an argument based on an assessment of the model assumptions, simplicity, and interpretability as well as model fit.
#Model 2 looks the best in this case because each of the lines fit their coordinating points nicely, with data both above and below the line.