Part 1: Properties of linear models
Question 1: Please write the form of a linear model and annotate the following components: explanatory variables (x_i), response variable (y), parameters (β_0, β_1, …), and error term (ε_i).
Answer 1:
y = β_0 + β_1*x_i + ε_i. For a linear model the response variable is equal to the y-intercept (aka β_0 or parameter 0) plus the slope of line (β_1 times explanatory variable) plus the error.
Question 2: List the five assumptions for fitting a linear model.
Answer 2:
Most of our assumptions have to do with how error is handled.
We assume that error is random such that there is an even amount above and below the linear regression line. We also assume that the error is evenly distributed across x_i, that one end of x_i doesn’t have more than the other end or visa-versa. We assume that errors are normally distributed, that larger errors are less likely to occur than smaller errors. Finally we assume that errors are independent from each other and there isn’t an unaccounted for variable that could explain the variance.
Most important of all we assume that in the population our two variables actually have a linear relationship.
Question 3: Create a pairs plot for the numeric variables in these dataset. What relationships do you observe between the potential explanatory variables and the responce (carbon emissions)?
Answer 3:
co2 <- read.table("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH239/main/co2.csv", sep=",", quote="\"", header=TRUE)
co2numeric <- co2 %>%
select_if(is.numeric)
pairs(co2numeric)From the above chart it looks like there’s a linear relationship between engine size and co2 per km, and a linear relationship between number of cylinders and co2 per km. Also, not surprisingly there’s a linear relationship between L of fuel consumed per km and co2 per km. Finally there appears to be a parametric relationship between mi per gal and co2 per km.
Question 4: In this data set we have both Fuel Consumption Comb (L/100km) and Fuel Consumption Comb (mpg). Look at the pairs plot for the relationships between these variables and the response. Which one of these would want to include as an explanatory variable in your model? Why?
Answer 4:
I would want to include Fuel Consumption Comb (L/100km) over Fuel Consumption Comb (mpg) for two reasons. First, I want to make sure that both of my variables are in metric units. Secondly, in my pairs plot I can see that there is a non-linear relationship between Fuel Consumption Comb (mpg) and co2 per km.
Extra Credit: In the data description, it says that Fuel Consumption Comb (L/100km) is a combined rating (55% city, 45% hwy). Why would it not be appropriate to included Fuel Consumption City (L/100km), Fuel Consumption Hwy (L/100km), and Fuel Consumption Comb (L/100km) in the same model? How is this related to VIF?
Extra Credit Answer:
You would not want to included there variables together because the Variance Inflation Factor is a measure of how correlated explanatory variables are. Since Fuel Consumption Comb is derivative of both other variables you’d essentially be accounting for Fuel Consumption City and Fuel Consumption Hwy twice. This would result in your correlation being far higher than it is in the population.
Part 2: Working with categorical explanatory variable. How does fuel type affect CO2 emissions?
Question 5: Fuel type is a categorical variable. What are the levels and how many are there? Which group will R choose to be the reference group? Does this make sense in the context of these data?
Answer 5:
There are five levels of fuel type, they are “X” for regular gasoline, “Z” for premium gasoline, “D” for diesel, “E” for ethanol, and “N” for natural gas. R chooses the alphabetically first level as the reference group. This doesn’t make a lot of sense since in this case that’s “D” for diesel while “X” for regular is the most common gas type.
Question 6: In your own words, describe how categorical variables with multiple levels are coded.
Answer 6:
Categorical variables are coded by giving a “dummy” value to each level of the categorical variable alphabetically.
Question 7: Using the method you described above write out this coding.
Answer 7:
Following from question six this means in this example “D” would get assigned 0, “E” assigned 1, “N” assigned 2, “X” assigned 3, and “Z” assigned 4.
Question 8: Create a side-by-side boxplot to compare the distributions of carbon emissions across fuel types.
Answer 8:
co2$Fuel.Type <- factor(co2$Fuel.Type, levels = c("X", "Z", "D", "E", "N"))
ggplot(co2, aes(Fuel.Type, CO2.Emissions.g.km., fill=Fuel.Type)) + geom_boxplot() From the above plot I learned that ethanol has the highest average highest co2 per km, while regular gas on average has the lowest. Additionally it appears that there is only one entry of a natural gas car.
Question 9: Perform the appropriate hypothesis test whether there is a significant difference in the average carbon emissions across fuel types. Please state the following components: Name of the hypothesis test, hypothesis (null and alternative), provide the test statistic, reference distribution, p-value. Then communicate your findings in sentence form.
Answer 9:
co2lm1 <- lm(CO2.Emissions.g.km. ~ Cylinders + Fuel.Type, co2)
summary(co2lm1)##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Cylinders + Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.413 -20.178 -1.178 18.660 150.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.7516 1.2250 82.245 <2e-16 ***
## Cylinders 26.6653 0.2185 122.054 <2e-16 ***
## Fuel.TypeZ 0.1034 0.8249 0.125 0.900
## Fuel.TypeD 3.9275 2.5079 1.566 0.117
## Fuel.TypeE -0.4978 1.7991 -0.277 0.782
## Fuel.TypeN -47.7436 32.4105 -1.473 0.141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.41 on 7379 degrees of freedom
## Multiple R-squared: 0.6935, Adjusted R-squared: 0.6933
## F-statistic: 3339 on 5 and 7379 DF, p-value: < 2.2e-16
Part 3: Fitting models
Question 10: Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and co2 Emissions (g/km). Describe the scatter plot above (direction, form, strength, outliers).
Answer 10:
scatterplot1 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., alpha=0.2)) + geom_point()
scatterplot1In the above scatter plot there is a positive correlation between co2 emissions and fuel consumption. Interestingly there seems to be two (or maybe three) distinct lines, possible related to fuel type. All of these correlations seem very high as the points seem to have little error.
Model A: Simple Linear Regression (SLR)
Question 11: Create a SLR model for the relationship between Fuel Consumption Comb (L/100km) and co2 Emissions (g/km). Write the equation for the estimated fitted model.
Answer 11:
co2lm2 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km., co2)
summary(co2lm2)##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km.,
## data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.619 -6.048 1.952 11.667 62.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.76315 1.05937 44.14 <2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 18.57132 0.09334 198.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.2 on 7383 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8428
## F-statistic: 3.959e+04 on 1 and 7383 DF, p-value: < 2.2e-16
The equation of the fitted model is:
g/km = 46.76L/100km + 18.57L/100km*L
This means that for every addition L of gas it takes a vehicle to move 100km the g/km increases by 18.57g, plus an additional 46.76g/km regardless of how fuel efficient the vehicle is.
Question 12: Perform a hypothesis test for the slope. State the reference distribution, degrees of freedom, the test statistic, and p-value in the form of a five-part conclusion in the context of the problem.
Answer 12:
From the output below I know the reference distribution will be very close to a Z distribution (aka normal distribution) as we have 7385 samples and 7383 degrees of freedom. Our test statistic will be an t-test as we are only comparing two variables so an f-test would be unnecessary, I found it’s equal to 198.97 with a p-value of <.001.
Using the data above I can write an informative conclusion. My null hypothesis is that there is no relationship between Fuel Consumption per 100km and CO2 emissions per km. My test hypothesis is that there is a relationship between Fuel Consumption per 100km and CO2 emissions per km. I then decided on an acceptable alpha value, I’ll chose .01 which means I’m accepting a 1% chance of a Type I error. I then find the t-critical value (two-tailed, alpha=.01, DF=7383) which is about +-2.576. The next step is to find the t-value of the slope which is 198.97. Since our t-value is greater than our t-critical we can reject our null hypothesis and can retain our test hypothesis, acknowledging a 1% chance of a type I error.
Question 13: Should we trust the inference we made in the previous step? To assess this check the model diagnostics (residual plot, qq plot, and leverage plot)
Answer 13:
ggplot(co2lm2, aes(co2lm2$fitted, co2lm2$residuals)) + geom_point() + xlab("Fitted") + ylab("Residuals") + geom_abline(slope=0, intercept=0, color='red') + geom_smooth(se=FALSE)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
qqnorm(co2lm2$residuals)
qqline(co2lm2$residuals)par(mfrow=c(2,2))
plot(co2lm2) No we should should not trust the inferences made in the above question. The residual plot shows us that error is not distributed evenly above and below the line of the linear model. The Q-Q plot shows us that our data is not normally distributed as a significant part of our data departs from the line. Finally our leverage plot shows us that there are many points whose influence is too great and fall outside Cook’s distance.
Multiple Linear Regression
Question 14a: Create a scatter plot to look at the relationship between Fuel Consumption Comb (L/100km) and CO2 Emissions (g/km) but now color the points based on the fuel types.
scatterplot2 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color=Fuel.Type, alpha=0.2)) + geom_point()
scatterplot2Answer 14b:
I totally called this, the three different lines correspond to different fuel types. It’s interesting that regular and premium gas appear to be identical. This got me interested in what the difference actually is, this MIT article claims that cars with higher compression ratios (whatever that means) can burn premium gasoline at higher temperatures and from that gain more mechanical advantage from each liter of gas. This data seems to contradict that, if the cars really were getting more mechanical advantage per unit of gas they’d travel farther per Liter and have lower CO2 emissions per km, and lower fuel consumption per km. However, out data shows that premium gas cars generally have higher CO2 per km, and higher fuel consumption per km.
Model B: Parallel Lines
Question 14b: Create parallel lines models for CO2 Emissions (g/km) using Fuel Consumption Comb (L/100km) and Fuel Type. Write the equations for the estimated fitted models for the fuel types.
Answer 14b:
co2lm3 <- lm(CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. + Fuel.Type, co2)
summary(co2lm3)##
## Call:
## lm(formula = CO2.Emissions.g.km. ~ Fuel.Consumption.Comb..L.100.km. +
## Fuel.Type, data = co2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.595 -2.760 0.045 2.234 44.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.34154 0.27768 19.236 < 2e-16 ***
## Fuel.Consumption.Comb..L.100.km. 22.78507 0.02601 875.998 < 2e-16 ***
## Fuel.TypeZ 0.43328 0.13763 3.148 0.00165 **
## Fuel.TypeD 30.89114 0.42649 72.432 < 2e-16 ***
## Fuel.TypeE -114.43678 0.34782 -329.016 < 2e-16 ***
## Fuel.TypeN -81.71198 5.49603 -14.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.495 on 7379 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9912
## F-statistic: 1.66e+05 on 5 and 7379 DF, p-value: < 2.2e-16
Regular gas: g/km = 5.34L/100km + 22.78L/100km*L
Premium gas: g/km = 5.34L/100km + 0.43L/100km*L
Ethanol: g/km = 5.34L/100km + 30.89L/100km*L
Diesel: g/km = 5.34L/100km - 114.43L/100km*L
Question 15: Create a graphic for you parallel lines model showing the fitted models for each type of fuel. Common on which shifts of intercept are significant.
Answer 15:
scatterplot3 <- ggplot(co2, aes(Fuel.Consumption.Comb..L.100.km., CO2.Emissions.g.km., color=Fuel.Type, alpha=0.2)) + geom_point() + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[1], color='orange') + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[3], color='brown') + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[4], color='green') + geom_abline(slope=co2lm3$coefficients[2], intercept=co2lm3$coefficients[5], color='cyan')
scatterplot3Looking at the significance codes from the output of question 14b, the variation of between gas types is unlikely to have happened by chance. In fact, there’s less than a .1% that the variation between regular and ethanol, the same goes for the variation between regular and diesel. However there is a 1% chance that the variation between regular and premium is coincidental.