This dataset, cars.csv, provides mileage, horsepower, model year, and other technical specifications for cars.
Create a scatter plot with cylindersin the x-axis and mpgon the y-axis. Include a linear regression line to the plot. Make sure you includeaxis labels that can be read and correctlyinterpreted (i.e.: include units of measurement where relevant).
Run a linear regression with mpgas the dependentvariable and cylindersas the only controlvariable.Interpret the coefficient. Is this coefficient in line with the graphical representation you found in question2?
Here, for each 1 unit increase in cylinders, the Feul efficiency is decreasing by 3.5629 unit, and vice versa. Hence, the regression shows that they are negatively corelated.
summary(lm(mpg ~ cylinders, data = data))
Call:
lm(formula = mpg ~ cylinders, data = data)
Residuals:
Min 1Q Median 3Q Max
-14.2607 -3.3841 -0.6478 2.5538 17.9022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.9493 0.8330 51.56 <2e-16 ***
cylinders -3.5629 0.1458 -24.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.942 on 396 degrees of freedom
Multiple R-squared: 0.6012, Adjusted R-squared: 0.6002
F-statistic: 597.1 on 1 and 396 DF, p-value: < 2.2e-16
Run a linear regression with mpgas the dependent variable and cylinders, weight and year as control variables.Briefly interpret the coefficient for each.Is the coefficients cylinders statistically significant in this case?
Cylinders is not statistically significant, as the Pr = 0.707 which is significantly greater than the 0.05 (As for a variable to be significant, its p-value should be 0.05 or less).
summary(lm(mpg ~ cylinders+weight+year, data = data))
Call:
lm(formula = mpg ~ cylinders + weight + year, data = data)
Residuals:
Min 1Q Median 3Q Max
-8.9727 -2.3180 -0.0755 2.0138 14.3505
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.925603 4.037305 -3.449 0.000623 ***
cylinders -0.087402 0.232075 -0.377 0.706665
weight -0.006511 0.000459 -14.185 < 2e-16 ***
year 0.753286 0.049802 15.126 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.438 on 394 degrees of freedom
Multiple R-squared: 0.8079, Adjusted R-squared: 0.8065
F-statistic: 552.4 on 3 and 394 DF, p-value: < 2.2e-16
What would explain the difference in your results in questions 3 and 4? What condition is necessary for this to occur? Explicitly verify that this condition is met.
In Question 3 there are many variables which we are not accounting for, and we are just checking the correlation between mpg and cylinders, and this can be a problem as there can be a few ommitted variables, and they go in the error term. Whereas in question 4, we are examining 3 independant variables, which reduces the bias, as now we are checking variables, while keeping the others constant, instead of keeping them in the error term. Here we are doing what is known as “controlling for” variable xyz.
A friend asserts that as years went by, car manufacturers became more conscious about producing cars with better fuel efficiency(higher levels of miles per gallon). Use the data to study this question. There isno right/wrong answer here and there areseveralways to think about this question.I am interested in your rationale, how you think about data, represent informationand justify your answer. Add a short, worded explanation accompanying your answerand feel free to include visualizationsif needed.
We will first run a regression where mpg is a dependent variable and year, weight, horsepower, displacement, cylinders and acceleration are independent variables. This is done to ensure that the independant variables are not affecting our analysis to see the correlation between year and mpg.
lm(mpg ~ year+acceleration+horsepower+weight+displacement+cylinders, data= data)
Call:
lm(formula = mpg ~ year + acceleration + horsepower + weight +
displacement + cylinders, data = data)
Coefficients:
(Intercept) year acceleration horsepower weight
-1.454e+01 7.534e-01 8.527e-02 -3.914e-04 -6.795e-03
displacement cylinders
7.678e-03 -3.299e-01
summary(lm(mpg ~ year+acceleration+weight+horsepower+displacement+cylinders, data = data))
Call:
lm(formula = mpg ~ year + acceleration + weight + horsepower +
displacement + cylinders, data = data)
Residuals:
Min 1Q Median 3Q Max
-8.6927 -2.3864 -0.0801 2.0291 14.3607
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.454e+01 4.764e+00 -3.051 0.00244 **
year 7.534e-01 5.262e-02 14.318 < 2e-16 ***
acceleration 8.527e-02 1.020e-01 0.836 0.40383
weight -6.795e-03 6.700e-04 -10.141 < 2e-16 ***
horsepower -3.914e-04 1.384e-02 -0.028 0.97745
displacement 7.678e-03 7.358e-03 1.044 0.29733
cylinders -3.299e-01 3.321e-01 -0.993 0.32122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.435 on 385 degrees of freedom
(6 observations deleted due to missingness)
Multiple R-squared: 0.8093, Adjusted R-squared: 0.8063
F-statistic: 272.2 on 6 and 385 DF, p-value: < 2.2e-16
The regression model shows that all the variables including acceleration, horsepower, and displacement are not statistically significant as p-value is greater than 0.05, and with every year, fuel efficiency increases by 0.75 miles per gallon.
The graph shows the correlation between year and mpg. Although, it has a few omitted variables, their effect was explained earlier.