This dataset, cars.csv, provides mileage, horsepower, model year, and other technical specifications for cars.

Variables and short description:

Questions 1) Explore the dataset and make sure all variables are stored in a manner that allows you to work with them correctly. Make any changes you need to make.

Answer Upon Inspection of dataset, I noticed that HorsePower variable was in string characters. To change to Numeric Values, I followed the following code.

cars$horsepower <- as.numeric(cars$horsepower)
## Warning: NAs introduced by coercion
  1. Create a scatter plot with cylinders in the x-axis and mpg on the y-axis. Include a linear regression line to the plot. Make sure you include axis labels that can be read and correctly interpreted (i.e.: include units of measurement where relevant).
ggplot(data = cars, aes(x = cylinders, y= mpg))+
  geom_point()+
  stat_smooth(method = "lm") +
  labs(x = "No. of Cylinders", y ="Fuel Efficiency in Miles per Gallon")
## `geom_smooth()` using formula 'y ~ x'

  1. Run a linear regression with mpg as the dependent variable and cylinders as the only control variable. Interpret the coefficient. Is this coefficient in line with the graphical representation you found in question 2?

Answer The coefficient in Q2 is negative for B1 just as the line is negative slope.

reg1 <- lm(mpg ~ cylinders, data = cars)
summary(reg1)
## 
## Call:
## lm(formula = mpg ~ cylinders, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2607  -3.3841  -0.6478   2.5538  17.9022 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.9493     0.8330   51.56   <2e-16 ***
## cylinders    -3.5629     0.1458  -24.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.942 on 396 degrees of freedom
## Multiple R-squared:  0.6012, Adjusted R-squared:  0.6002 
## F-statistic: 597.1 on 1 and 396 DF,  p-value: < 2.2e-16
  1. Run a linear regression with mpg as the dependent variable and cylinders, weight and year as control variables. Briefly interpret the coefficient for each. Is the coefficient on cylinders statistically significant in this case?

Answer From the below regression analysis, we see that it is not statistically significant as P-Value for cylinders is more than the significance level of 0.1. However, the coefficient of weight and year are statistically significant.

Interpretation of Coefficients

reg2 <- lm(mpg ~ cylinders + weight + year, data = cars)
summary(reg2)
## 
## Call:
## lm(formula = mpg ~ cylinders + weight + year, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9727 -2.3180 -0.0755  2.0138 14.3505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.925603   4.037305  -3.449 0.000623 ***
## cylinders    -0.087402   0.232075  -0.377 0.706665    
## weight       -0.006511   0.000459 -14.185  < 2e-16 ***
## year          0.753286   0.049802  15.126  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.438 on 394 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8065 
## F-statistic: 552.4 on 3 and 394 DF,  p-value: < 2.2e-16
  1. What would explain the difference in your results in questions 3 and 4? What condition is necessary for this to occur? Explicitly verify that this condition is met.

Answer The main difference between the two results is the omitted variable bias. The omitted variables here are Weight and Year. OVB has occurred because year and weight are correlated with cylinders. And year and weight variables are determinants of fuel efficiency. We are correcting this bias with a multivariate regression analysis in Q4. Since we are only conducting a bivariate regression analysis in Q3, this conditions are not met. However, in Q4 these conditions are met and we get a clearer relationship between cylinders and MPG. In the below code, I am checking for a correlation between the x variables. It shows that number of cylinders is correlated with weight and year. We control for this correlation in Question 4.

reg3 <- lm(cylinders ~ weight + year, data = cars)
summary(reg3)
## 
## Call:
## lm(formula = cylinders ~ weight + year, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97944 -0.51579 -0.02345  0.45658  2.20968 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.116e+00  8.612e-01   3.619 0.000334 ***
## weight       1.749e-03  4.642e-05  37.691  < 2e-16 ***
## year        -3.760e-02  1.063e-02  -3.537 0.000452 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7455 on 395 degrees of freedom
## Multiple R-squared:  0.8089, Adjusted R-squared:  0.8079 
## F-statistic:   836 on 2 and 395 DF,  p-value: < 2.2e-16
  1. A friend asserts that as years went by, car manufacturers became more conscious about producing cars with better fuel efficiency (higher levels of miles per gallon). Use the data to study this question. There is no right/wrong answer here and there are several ways to think about this question. I am interested in your rationale, how you think about data, represent information and justify your answer. Add a short, worded explanation accompanying your answer and feel free to include visualizations if needed.

Ans: I will do a multivariate regression analysis to test this hypothesis. Here the null hypothesis is that as years increase, fuel efficiency does not increase.Our hypothesis states that as years increase, fuel efficiency does increase.This regression analysis will find the relationship between fuel efficiency which will increase by 1 unit increase in year. I will control for cylinders, acceleration and horsepower as these variables also affect fuel efficiency. Here the year coefficient is statistically significant at 0.001 level. Based on the summary of the regression analysis, there is a positive relationship between year and mpg. As years increase, MPG increases. This data can also be seen visually through the regression graph 2. With a 1 year increase in car model years, fuel efficiency increases by 0.66 mpg.

Control Variables For this research, I am controlling for horsepower and weight in my first regression analysis. This is because horsepower and weight have a direct impact on years. By controlling for these variables we can limit the Omitted variable bias.

reg8 <- lm(year ~ horsepower + acceleration + cylinders + weight + displacement, data = cars)
summary(reg8)
## 
## Call:
## lm(formula = year ~ horsepower + acceleration + cylinders + weight + 
##     displacement, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4071 -2.4749 -0.0637  2.8023  7.0804 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.703752   2.088599  38.640  < 2e-16 ***
## horsepower   -0.059554   0.013037  -4.568 6.63e-06 ***
## acceleration -0.151822   0.098398  -1.543  0.12367    
## cylinders    -0.090353   0.321219  -0.281  0.77864    
## weight        0.002134   0.000639   3.340  0.00092 ***
## displacement -0.010302   0.007098  -1.451  0.14746    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.323 on 386 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1967, Adjusted R-squared:  0.1863 
## F-statistic:  18.9 on 5 and 386 DF,  p-value: < 2.2e-16

Variables out of control For this research I did not consider number of cylinders,acceleration and displacement’s impact on mpg. At the 0.1 significance level, we notice these variables does not have an impact on years as P-value of displacement > 0.1.

reg4 <- lm(mpg ~ year  + horsepower  + weight, data = cars)
summary(reg4)
## 
## Call:
## lm(formula = mpg ~ year + horsepower + weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16

A visual representation of the above regression analysis shows that in fact fuel efficiency is highest with the most recent years.

ggplot(data = cars, aes(x =    horsepower + weight, y= mpg, color=year))+
  geom_point()+
  stat_smooth(method = "lm", se=FALSE) +
  labs(x = "Control Variables: Weight, Horsepower", y ="Fuel Efficiency in Miles per Gallon")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).