Problem Set 2

Question 1

Explore the dataset and make sure all variables are stored in a manner that allows you to work with them correctly. Make any changes you need to make.

I changed horsepower variable into a numeric as we could not create meaningful computations when it was stored as a character. I made year and cylinders into integers because number of cylinders and car model year cannot be anything but a whole number, so conceptually this is what made most sense to me.

Question 2

Create a scatter plot with cylinders in the x-axis and mpg on the y-axis. Include a linear regression line to the plot. Make sure you include axis labels that can be read and correctly interpreted (i.e.: include units of measurement where relevant).

ggplot(data = cars, aes(x = cylinders, y = mpg)) + 
  geom_point() +
  stat_smooth(method = "lm") +
labs(cars, title = "Fuel Efficiency (MPG) by Number of Cylinders", x = "Number of Cylinders", y = "Fuel Efficiency in Miles Per Gallon (MPG)")

Question 3

Run a linear regression with mpg as the dependent variable and cylinders as the only control variable. Interpret the coefficient. Is this coefficient in line with the graphical representation you found in question 2?

cars3 <- lm(mpg ~ cylinders, data = cars)
summary(cars3)

## 
## Call:
## lm(formula = mpg ~ cylinders, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2607  -3.3841  -0.6478   2.5538  17.9022 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.9493     0.8330   51.56   <2e-16 ***
## cylinders    -3.5629     0.1458  -24.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.942 on 396 degrees of freedom
## Multiple R-squared:  0.6012, Adjusted R-squared:  0.6002 
## F-statistic: 597.1 on 1 and 396 DF,  p-value: < 2.2e-16

I would like to make a comment about the linear regression model, and why it does not make logical sense to me. Linear Regression Model is not the best regression model to use in this particular data vizualization because linear regression is best used for continious data, and the number of cylinders are discontinous and thereby the right model to use would be multiple Logistic Regression. Moreover, it does not make sense that if the number of cylinders is 0, then there are 42 MPG (y-intercept), it should also be 0.

Nonetheless, the coefficient is in line with the graphical representation I found in question 2, because both the coefficient and the graphical representation exhibit a negative relationship between number of cylinders and miles per gallon in fuel efficiency. As we increase the number of cylinders in a car, the fuel efficiency decreases as displayed through the negative slope. MPG decreases as cylinders increase, because a higher number of cylinders consume more gallons of gas.

Question 4

Run a linear regression with mpg as the dependent variable and cylinders, weight and year as control variables. Briefly interpret the coefficient for each. Is the coefficient on cylinders statistically significant in this case?

cars4 <- lm(mpg ~ cylinders + weight + year, data = cars)
summary(cars4)

## 
## Call:
## lm(formula = mpg ~ cylinders + weight + year, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9727 -2.3180 -0.0755  2.0138 14.3505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.925603   4.037305  -3.449 0.000623 ***
## cylinders    -0.087402   0.232075  -0.377 0.706665    
## weight       -0.006511   0.000459 -14.185  < 2e-16 ***
## year          0.753286   0.049802  15.126  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.438 on 394 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8065 
## F-statistic: 552.4 on 3 and 394 DF,  p-value: < 2.2e-16

When controlling for car model year and car weight; for every increase in cylinders by 1 cylinder (unit), fuel efficiency decreases by 0.08 mpg as highlighted by the negative slope. The negative coefficient displays a negative relationship between number of cylinders and MPG for Fuel Efficiency.

When controlling for cylinder and car model year, for every 100 pounds increase in car weight, MPG decreases by 0.65 pounds as highlighted by the negative slope. The negative coefficient displays a negative relationship between car weight and MPG for Fuel Efficiency.

When controlling for cylinder and car weight, for every newer car model (1 year increase), fuel efficiency increases (is improved) by 0.75 mpg as highlighted by the positive slope. The positive coefficient displays a positive relationship between car model year and MPG for Fuel Efficiency.

The coefficient for number of cylinders is not statistically significant in this case because the P-value is at 0.707 > than any significance level indicated in the summary. This means that we cannot rely on the results of the data as it is largely explained by chance. This is statistically indistinguishable.

Question 5

What would explain the difference in your results in questions 3 and 4? What condition is necessary for this to occur? Explicitly verify that this condition is met.

The difference in my results between Q3 and Q4 is that Q3 was a bivariate linear regression model, whereas Q4 was a multivariate linear regression model. Multivariate regression is more descriptive because it includes the effect of the other independent variables on the dependent variable, whereas bivarate only considers one variables’ effect (cylinders) on the dependent variable, MPG. Furthermore, the coefficients of cylinder is really large in the bivariate analysis as opposed to the multivariate analysis because when it is a single controlling variable, the dependent variable is highly sensitive to it as it does not control other variables in Q4. Multivariate analysis considered the sensitivtiy of the car model year, the weight of the car, as well as the number of the cylinders and their individual volume of effect on the MPG, and as such one could identify which controlled variable, MPG was most sensitive too. Whereas in bivariate regression, we are tricked into thinking that the number of cylinders has larger sensitivity, when in reality it is low. The difference can be fathomed; in the bivariate model, for every unit increase in number of cylinders, mpg decreases by 3.56, whereas in the multivariate model the effect is insignificant, for every unit increase in number of cylinders, mpg decreases by 0.087.

The condition that is necessary for number of cylinders to change significance between bivariate model and multivariate model is based on the ommitted variable bias. The ommitted variable bias was at play in the Q3 because the model was ommitting relevant variables that were correlated with cylinder as an independent variable and this ommission resulted in the effect of cylinders on fuel efficiency seeming a siginficant. This condition is highlighted in the Q4 when we see that the effect of year and weight on MPG is stasticailly significant, and cylinders effect becomes insignificant. Another way to verify this condition is that cylinder variable should be correlated with the omitted variable. We can verify it by making cylinder the dependent variable in this question and control year and weight to check for correlation. This model also highlights that year and weight are statistically significant variables. As such, a correlation between the ommitted variables and cylinder is verified. For everyone pound the car weight increases, the number of cylinders increases by 1.749e-03. For every year the car model upgrades, the number of cylinders decreases by 3.760e-02. The fact that year and weight are correlated with cylinder, and statistically significant when MPG is the dependent variable, verifies the ommitted variable bias effect.

cars5 <- lm(cylinders ~ weight + year, data = cars)
summary(cars5)

## 
## Call:
## lm(formula = cylinders ~ weight + year, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97944 -0.51579 -0.02345  0.45658  2.20968 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.116e+00  8.612e-01   3.619 0.000334 ***
## weight       1.749e-03  4.642e-05  37.691  < 2e-16 ***
## year        -3.760e-02  1.063e-02  -3.537 0.000452 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7455 on 395 degrees of freedom
## Multiple R-squared:  0.8089, Adjusted R-squared:  0.8079 
## F-statistic:   836 on 2 and 395 DF,  p-value: < 2.2e-16

Question 6

A friend asserts that as years went by, car manufacturers became more conscious about producing cars with better fuel efficiency (higher levels of miles per gallon). Use the data to study this question. There is no right/wrong answer here and there are several ways to think about this question. I am interested in your rationale, how you think about data, represent information and justify your answer. Add a short, worded explanation accompanying your answer and feel free to include visualizations if needed.

cars6 <- lm(mpg ~ year, data = cars)
summary(cars6)

## 
## Call:
## lm(formula = mpg ~ year, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.024  -5.451  -0.390   4.947  18.200 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -69.55560    6.58911  -10.56   <2e-16 ***
## year          1.22445    0.08659   14.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.379 on 396 degrees of freedom
## Multiple R-squared:  0.3356, Adjusted R-squared:  0.3339 
## F-statistic:   200 on 1 and 396 DF,  p-value: < 2.2e-16

ggplot(data = cars, aes(x = year, y = mpg)) + 
  geom_point() +
  stat_smooth(method = "lm") +
labs(cars, title = "Fuel Efficiency (MPG) by Car Model Year", x = "Car Model Year", y = "Fuel Efficiency in Miles Per Gallon (MPG)")

cars7 <-lm(mpg ~ year + cylinders + displacement + weight + acceleration + horsepower , data = cars)
summary(cars7)

## 
## Call:
## lm(formula = mpg ~ year + cylinders + displacement + weight + 
##     acceleration + horsepower, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6927 -2.3864 -0.0801  2.0291 14.3607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
## year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
## cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
## displacement  7.678e-03  7.358e-03   1.044  0.29733    
## weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
## acceleration  8.527e-02  1.020e-01   0.836  0.40383    
## horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 385 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8063 
## F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16

cars8 <- lm(mpg ~ year + weight, data = cars)
summary(cars8)

## 
## Call:
## lm(formula = mpg ~ year + weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8777 -2.3140 -0.1211  2.0591 14.3330 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.420e+01  3.968e+00  -3.578 0.000389 ***
## year         7.566e-01  4.898e-02  15.447  < 2e-16 ***
## weight      -6.664e-03  2.139e-04 -31.161  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 395 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8069 
## F-statistic: 830.4 on 2 and 395 DF,  p-value: < 2.2e-16

cars9 <- lm(weight ~ year, data = cars)
summary(cars9)

## 
## Call:
## lm(formula = weight ~ year, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1709.18  -674.60   -49.24   617.88  1817.82 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8307.11     833.65   9.965  < 2e-16 ***
## year          -70.21      10.95  -6.409 4.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 807.1 on 396 degrees of freedom
## Multiple R-squared:  0.09398,    Adjusted R-squared:  0.09169 
## F-statistic: 41.08 on 1 and 396 DF,  p-value: 4.164e-10

ggplot(data = cars, aes(x = year, y = weight)) + 
  geom_point() +
  stat_smooth(method = "lm") +
labs(cars, title = "Car Weight by Car Model Year", x = " Car Model Year", y = "Car Weight (in Pounds)")

To study this question, first, I took a bivariate analysis of the relationship between year and fuel efficiency and recognized that it does not carefully consider other factors that could affect fuel efficiency and falls under the ommitted variable bias.Therefore, it was important to run a multivariate regression analysis where I consider all other variables in the dataset that could potentially affect a change in the fuel efficiency. I realized that of all the controlled variables I included, that weight and year were statistically significant in their relationship with fuel efficiency and so I wanted to focus on their respective relationships. To explore my curiousity further, I looked at weight as the dependent variable and year as the indepedent variable, which uncovered that, as cars upgrade models year by year, they also reduce the weight of the car. The reduction in the weight of the car year by year, increases the fuel efficiency. So for every year that car model upgrades, weight is decreased by 70 pounds on average, and thereby affecting increased fuel efficiency.

The friend in the text, although correctly identifies that as car models upgrade on a yearly basis, the cars do have better fuel efficiency, neglects to mention that driving variable that results in better fuel efficiency is reduced car weight year by year.