Information about the dataset

This dataset, cars.csv, provides mileage, horsepower, model year, and other technical specifications for cars.

Variables and short description:

  • mpg: fuel efficiency measured in miles per gallon (mpg)
  • cylinders: number of cylinders in the engine
  • displacement: engine displacement (in cubic inches)
  • horsepower: engine horsepower
  • weight: vehicle weight (in pounds)
  • acceleration: time to accelerate from O to 60 mph (in seconds)
  • year: car model year

Question 2

Create a scatter plot with cylindersin the x-axis and mpgon the y-axis. Include a linear regression line to the plot. Make sure you includeaxis labels that can be read and correctlyinterpreted (i.e.: include units of measurement where relevant).

Question 3

Run a linear regression with mpgas the dependentvariable and cylindersas the only controlvariable.Interpret the coefficient. Is this coefficient in line with the graphical representation you found in question2?

Here, for each 1 unit increase in cylinders, the Feul efficiency is decreasing by 3.5629 unit, and vice versa. Hence, the regression shows that they are negatively corelated.

summary(lm(mpg ~ cylinders, data = data))

Call:
lm(formula = mpg ~ cylinders, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.2607  -3.3841  -0.6478   2.5538  17.9022 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  42.9493     0.8330   51.56   <2e-16 ***
cylinders    -3.5629     0.1458  -24.43   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.942 on 396 degrees of freedom
Multiple R-squared:  0.6012,    Adjusted R-squared:  0.6002 
F-statistic: 597.1 on 1 and 396 DF,  p-value: < 2.2e-16

Question 4

Run a linear regression with mpgas the dependent variable and cylinders, weight and year as control variables.Briefly interpret the coefficient for each.Is the coefficients cylinders statistically significant in this case?

Cylinders is not statistically significant, as the Pr = 0.707 which is significantly greater than the 0.05 (As for a variable to be significant, its p-value should be 0.05 or less).

  • An increase in 1 unit of cylinder decreases fuel efficiency by 0.087 miles per gallon.
  • An increase in 1 unit of weight decreases fuel efficiency by 0.0065 miles per gallon.
  • Every new year increases fuel efficiency by 0.75 miles per gallon.
summary(lm(mpg ~ cylinders+weight+year, data = data))

Call:
lm(formula = mpg ~ cylinders + weight + year, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9727 -2.3180 -0.0755  2.0138 14.3505 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -13.925603   4.037305  -3.449 0.000623 ***
cylinders    -0.087402   0.232075  -0.377 0.706665    
weight       -0.006511   0.000459 -14.185  < 2e-16 ***
year          0.753286   0.049802  15.126  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.438 on 394 degrees of freedom
Multiple R-squared:  0.8079,    Adjusted R-squared:  0.8065 
F-statistic: 552.4 on 3 and 394 DF,  p-value: < 2.2e-16

Question 5

What would explain the difference in your results in questions 3 and 4? What condition is necessary for this to occur? Explicitly verify that this condition is met.

In Question 3 there are many variables which we are not accounting for, and we are just checking the correlation between mpg and cylinders, and this can be a problem as there can be a few ommitted variables, and they go in the error term. Whereas in question 4, we are examining 3 independant variables, which reduces the bias, as now we are checking variables, while keeping the others constant, instead of keeping them in the error term. Here we are doing what is known as “controlling for” variable xyz.

Question 6

A friend asserts that as years went by, car manufacturers became more conscious about producing cars with better fuel efficiency(higher levels of miles per gallon). Use the data to study this question. There isno right/wrong answer here and there areseveralways to think about this question.I am interested in your rationale, how you think about data, represent informationand justify your answer. Add a short, worded explanation accompanying your answerand feel free to include visualizationsif needed.

We will first run a regression where mpg is a dependent variable and year, weight, horsepower, displacement, cylinders and acceleration are independent variables. This is done to ensure that the independant variables are not affecting our analysis to see the correlation between year and mpg.

lm(mpg ~ year+acceleration+horsepower+weight+displacement+cylinders, data= data)

Call:
lm(formula = mpg ~ year + acceleration + horsepower + weight + 
    displacement + cylinders, data = data)

Coefficients:
 (Intercept)          year  acceleration    horsepower        weight  
  -1.454e+01     7.534e-01     8.527e-02    -3.914e-04    -6.795e-03  
displacement     cylinders  
   7.678e-03    -3.299e-01  
summary(lm(mpg ~ year+acceleration+weight+horsepower+displacement+cylinders, data = data))

Call:
lm(formula = mpg ~ year + acceleration + weight + horsepower + 
    displacement + cylinders, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6927 -2.3864 -0.0801  2.0291 14.3607 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
acceleration  8.527e-02  1.020e-01   0.836  0.40383    
weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
displacement  7.678e-03  7.358e-03   1.044  0.29733    
cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.435 on 385 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.8093,    Adjusted R-squared:  0.8063 
F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16

The regression model shows that all the variables including acceleration, horsepower, and displacement are not statistically significant as p-value is greater than 0.05, and with every year, fuel efficiency increases by 0.75 miles per gallon.

The graph shows the correlation between year and mpg. Although, it has a few omitted variables, their effect was explained earlier.