library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
co2<-read.table("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH239/main/co2.csv",
sep=",", quote="\"",
header=TRUE)
#view(co2)
mod <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km., co2)
summary(mod)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.619 -6.048 1.952 11.667 62.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.76315 1.05937 44.14 <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132 0.09334 198.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
# x = Fuel consumption combined rating (55% city, 45% highway)
# y = CO2 emissions
# intercept = 46.76315
# slope = 18.57132
# error term = 0.09334
Assumptions for fitting a linear model: 1. The mean of our errors is 0. 2. There is constant variance (homogeneity) and constant spread (homoscedasticity) of error throughout the data. 3. Our explanatory variables are uncorrelated, but does not necessarily independent. 4. Our errors come from a normal distribution. 5. There is a linear relationship between x and y.
pairsco2 <- co2 %>%
select(Fuel.Consumption.City..L.100.km., Fuel.Consumption.Hwy..L.100.km., Fuel.Consumption.Comb..L.100.km., Fuel.Consumption.Comb..mpg.)
pairs(pairsco2)
Observations: The relationships between the fuel consumption rating in the city (L/km), the highway (L/km), and combined (L/km), all appear to have positive linear relationships in combination with one another. However, the combined fuel consumption rating (mpg) has a negative exponential relationship in combination with all the other variables.
When given the choice to choose between the combined fuel consumption rating in L/km or mpg as a variable in the model, I would chose L/km because based on the above pairs plot, the L/km has a linear relationship with the other variables while mpg does not. This is important to note because one of our model assumptions is that there is a linear relationship between x and y.
Fuel type is a categorical variable. According to the given data description, there are 5 levels, and they are denoted as follows: X = Regular gasoline, Z = Premium gasoline, D = Diesel, E = Ethanol(E85), N = Natural gas. R will chose Diesel to be the reference group because it comes first alphabetically. This does not make the most sense since a logical choice for the reference group would be regular gasoline. I think most people put regular gasoline in their cars, and so in the context of this data it would be easier loop at shifts in intercept and slope away from the reference of regular gasoline.
Categorical variables are coded using a matrix filled with binary code. The the rows of the matrix represent the categories and the columns represent the variables. There is one reference group, and so if there are n categories, then there needs to be n-1 variables. For example, if the categories are A, B, and C, then there will be 2 variables, x1 and x2. If A, x1 will be 1 and x2 will be 0. If C, x1 will be 0 and x2 will be 1. That means that B is the reference group because if B, both x1 and x2 will be 0.
# contrasts(co2$Fuel.Type)
# This code was working when I ran this chunk but then when I tried to knit the document it ran an error.
# setting regular gasoline as the reference group
co2$Fuel.Type <- factor(co2$Fuel.Type,
levels = c("X", "Z", "D", "E", "N"))
ggplot(co2, aes(Fuel.Type, CO2.Emissions.g.km., fill = Fuel.Type))+
geom_boxplot()+
theme_bw()
Observations: Natural gas seems to have the lowest median emissions, but the sample size is very small. So small, that there is not much that is observable about natural gas in this plot. Then, the median emissions by fuel type increase from regular gasoline, to diesel, then premium gasoline, and finally ethanol. Regular and premium gasoline have the largest spread of data, and the most outliers that exist near the upper bounds of the data spread.
modcat <- lm(CO2.Emissions.g.km.~Fuel.Type, co2)
summary(modcat)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147.092 -42.119 -8.043 35.881 255.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 235.1193 0.9335 251.855 <2e-16 ***
## Fuel.TypeZ 30.9241 1.3643 22.666 <2e-16 ***
## Fuel.TypeD 2.4292 4.3571 0.558 0.577
## Fuel.TypeE 39.9726 3.0722 13.011 <2e-16 ***
## Fuel.TypeN -22.1193 56.3078 -0.393 0.694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.3 on 7380 degrees of freedom
## Multiple R-squared: 0.0747, Adjusted R-squared: 0.0742
## F-statistic: 148.9 on 4 and 7380 DF, p-value: < 2.2e-16
Hypothesis test: T test Hypotheses: Null: There is no relationship between fuel type and CO2 emissions. Alternative: There is a linear relationship between fuel type and CO2 emissions. Test statistic: 148.9 on 4 Reference Distribution: T distribution 5 part summary: We can reject the null hypothesis with a p-value of < 2.2e-16 at the α = 0 significance level.There is strong evidence to suggest that there could be a linear relationship between fuel type and CO2 emissions (i.e. the slope parameter is not equal to zero).
ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km.,CO2.Emissions.g.km.))+
geom_point()+
theme_bw()
Observations: There seems so be a positive linear relationship between the combined fuel consumption rating (L/km) and fuel type. The relationship appears to be very strong as the data is almost forming diagonal lines in the plot. You can see how the data is clustered into groups, which is likely because we are plotting categorical variables (clustering by fuel type). There are a few outliers which are visible in the plot and they appear to be outliers because they are not clustering in the same pattern as the rest of the data.
mod1 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km., co2)
summary(mod1)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.619 -6.048 1.952 11.667 62.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.76315 1.05937 44.14 <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132 0.09334 198.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
# equation: y = 18.57132x + 46.76315
Hypothesis test: T test Test statistic: 3.959e+04 on 1 Reference Distribution: T distribution Degrees of Freedom: 7383 degrees of freedom P value: < 2.2e-16 5 part summary: We can reject the null hypothesis with a p-value of < 2.2e-16 at the α = 0 significance level.There is strong evidence to suggest that there could be a linear relationship between combined fuel consumption ratings and CO2 emissions (i.e. the slope parameter is not equal to zero).
plot(mod1)
Observations: Based on these plots, it may not be in our best interests to trust the conclusion that we made in question 12. In the residuals vs. fitted values plot, there appears to be a large cluster of the data which does not follow the y = 0 line. Additionally, this rogue cluster is trending negatively, meaning that the error is increasing and that this data is heteroscedastic rather than homoscedastic. The QQ plot also shows abnormality because a large section of the data lies off of the diagonal line. The residuals vs. leverage plot reveals again that the data is highly clustered, and that each cluster contains a handful of outliers. The cluster that is farthest from the y = 0 line has the most influential outliers.
ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color = Fuel.Type))+
geom_point()+
theme_bw()
Observations:Now, with the addition of color, I can easily see that the clusters in the data are caused by different fuel types.
mod2 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km. + Fuel.Type, co2)
summary(mod2)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.595 -2.760 0.045 2.234 44.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34154 0.27768 19.236 < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 22.78507 0.02601 875.998 < 2e-16 ***
## Fuel.TypeZ 0.43328 0.13763 3.148 0.00165 **
## Fuel.TypeD 30.89114 0.42649 72.432 < 2e-16 ***
## Fuel.TypeE -114.43678 0.34782 -329.016 < 2e-16 ***
## Fuel.TypeN -81.71198 5.49603 -14.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
## F-statistic: 1.66e+05 on 5 and 7379 DF, p-value: < 2.2e-16
# equations:
# X: y = 22.78507x + 5.34154
# Z: y = 22.78507x + 5.77482
# D: y = 22.78507x + 36.23268
# E: y = 22.78507x - 109.0952
# N: y = 22.78507x - 76.37044
ggplot(co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color=Fuel.Type))+
geom_point()+
theme_bw()+
# X (reference)
geom_abline(intercept = mod2$coefficients[1], slope=mod2$coefficients[2], alpha = 0.5)+
# Z
geom_abline(intercept = mod2$coefficients[1]+mod2$coefficients[3], slope=mod2$coefficients[2], alpha = 0.5)+
# D
geom_abline(intercept = mod2$coefficients[1]+mod2$coefficients[4], slope=mod2$coefficients[2], alpha = 0.5)+
# E
geom_abline(intercept = mod2$coefficients[1]+mod2$coefficients[5], slope=mod2$coefficients[2], alpha = 0.5)+
# N
geom_abline(intercept = mod2$coefficients[1]+mod2$coefficients[5], slope=mod2$coefficients[2], alpha = 0.5)
From looking at this plot, there appear to be significant shifts in the intercept for all fuel types except premium gasoline (Z). Visually, there appears to only be 3 lines because regular and premium gasoline appear to overlap and ethanol and natural gas also appear to overlap. I adjusted the transparency of the line in order to observe this.
mod3 <- lm(CO2.Emissions.g.km.~Fuel.Consumption.Comb..L.100.km. * Fuel.Type, co2)
summary(mod3)
##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. *
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.127 -2.607 0.659 1.886 25.251
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 0.42897 0.20641 2.078
## Fuel.Consumption.Comb..L.100.km. 23.27221 0.01988 1170.440
## Fuel.TypeZ 0.18413 0.32445 0.568
## Fuel.TypeD -0.54618 1.30980 -0.417
## Fuel.TypeE 4.24637 0.92786 4.577
## Fuel.TypeN -82.98605 2.95545 -28.079
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ -0.03526 0.02923 -1.206
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD 3.62697 0.14556 24.918
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE -7.23455 0.05649 -128.077
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA NA NA
## Pr(>|t|)
## (Intercept) 0.0377 *
## Fuel.Consumption.Comb..L.100.km. < 2e-16 ***
## Fuel.TypeZ 0.5704
## Fuel.TypeD 0.6767
## Fuel.TypeE 4.8e-06 ***
## Fuel.TypeN < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeZ 0.2279
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeD < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeE < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km.:Fuel.TypeN NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 7376 degrees of freedom
## Multiple R-squared: 0.9975, Adjusted R-squared: 0.9975
## F-statistic: 3.611e+05 on 8 and 7376 DF, p-value: < 2.2e-16
# equations:
# X: y = 22.78507x + 5.34154
# Z: y = 22.74981x + 5.95895
# D: y = 26.41204x + 35.6865
# E: y = 15.55052x - 104.8488
# N: y = 22.78507x - 159.3565
ggplot(co2, aes(x=Fuel.Consumption.Comb..L.100.km., y=CO2.Emissions.g.km., color=Fuel.Type))+
geom_point()+
theme_bw()+
# X (reference)
geom_abline(intercept = mod3$coefficients[1], slope=mod3$coefficients[2], alpha = 0.5)+
# Z
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[3], slope=mod3$coefficients[2]+mod3$coefficients[7], alpha = 0.5)+
# D
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[4], slope=mod3$coefficients[2]+mod3$coefficients[8], alpha = 0.5)+
# E
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[5], slope=mod3$coefficients[2]+mod3$coefficients[9], alpha = 0.5)+
# N
geom_abline(intercept = mod3$coefficients[1]+mod3$coefficients[6], slope=mod3$coefficients[2], alpha = 0.5)
From this plot, one can see that there is not a noticeable shift of the slope or the intercept between regular and premium gasoline. Additionally, because the sample size for natural gas is so small, there is no change in the slope from the reference group. There are significant shifts in the slope in the diesel and ethanol categories away from the reference slope. Diesel has a larger slope coefficient than regular gasoline and ethanol has a smaller slope coefficient.
Adjusted r squared values: mod1: 0.8428 mod2: 0.9912 mod3: 0.9975 Adjusted r squared shows percentage of the outputs that are explained by the inputs (how well the model fits the data). According to the adjusted r squared values, the 3rd model fits the data the best. However, when I look at the model output from each of the 3 models, I can see that the 3rd model has more variables that are not significant than the first 2 models. In the first two models, each explanatory variable included in the model is significant at least at the alpha = 0.001 significance level. In the third model, some of the explanatory variables are not significant. Each of the 3 models have a p value of < 2.2e-16. Based on this information, I would chose the 3rd model to represent this data, but I would exclude some of the explanatory variables if I were to refine this model.
# y = 22.78507x + 5.34154
# y = 22.78507(58.8025) + 5.34154
# y = 1345.161 g/km