This dataset, cars.csv, provides mileage, horsepower, model year, and other technical specifications for cars.
Variables and short description:
Questions 1) Explore the dataset and make sure all variables are stored in a manner that allows you to work with them correctly. Make any changes you need to make.
Answer Upon Inspection of dataset, I noticed that HorsePower variable was in string characters. To change to Numeric Values, I followed the following code.
cars$horsepower <- as.numeric(cars$horsepower)
## Warning: NAs introduced by coercion
ggplot(data = cars, aes(x = cylinders, y= mpg))+
geom_point()+
stat_smooth(method = "lm") +
labs(x = "No. of Cylinders", y ="Fuel Efficiency in Miles per Gallon")
## `geom_smooth()` using formula 'y ~ x'
Answer The coefficient in Q2 is negative for B1 just as the line is negative slope.
reg1 <- lm(mpg ~ cylinders, data = cars)
summary(reg1)
##
## Call:
## lm(formula = mpg ~ cylinders, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2607 -3.3841 -0.6478 2.5538 17.9022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.9493 0.8330 51.56 <2e-16 ***
## cylinders -3.5629 0.1458 -24.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.942 on 396 degrees of freedom
## Multiple R-squared: 0.6012, Adjusted R-squared: 0.6002
## F-statistic: 597.1 on 1 and 396 DF, p-value: < 2.2e-16
Answer From the below regression analysis, we see that it is not statistically significant as P-Value for cylinders is more than the significance level of 0.1. However, the coefficient of weight and year are statistically significant.
Interpretation of Coefficients
reg2 <- lm(mpg ~ cylinders + weight + year, data = cars)
summary(reg2)
##
## Call:
## lm(formula = mpg ~ cylinders + weight + year, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9727 -2.3180 -0.0755 2.0138 14.3505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.925603 4.037305 -3.449 0.000623 ***
## cylinders -0.087402 0.232075 -0.377 0.706665
## weight -0.006511 0.000459 -14.185 < 2e-16 ***
## year 0.753286 0.049802 15.126 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.438 on 394 degrees of freedom
## Multiple R-squared: 0.8079, Adjusted R-squared: 0.8065
## F-statistic: 552.4 on 3 and 394 DF, p-value: < 2.2e-16
Answer The main difference between the two results is the omitted variable bias. The omitted variables here are Weight and Year. OVB has occurred because year and weight are correlated with cylinders. And year and weight variables are determinants of fuel efficiency. We are correcting this bias with a multivariate regression analysis in Q4. Since we are only conducting a bivariate regression analysis in Q3, this conditions are not met. However, in Q4 these conditions are met and we get a clearer relationship between cylinders and MPG. In the below code, I am checking for a correlation between the x variables. It shows that number of cylinders is correlated with weight and year. We control for this correlation in Question 4.
reg3 <- lm(cylinders ~ weight + year, data = cars)
summary(reg3)
##
## Call:
## lm(formula = cylinders ~ weight + year, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97944 -0.51579 -0.02345 0.45658 2.20968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.116e+00 8.612e-01 3.619 0.000334 ***
## weight 1.749e-03 4.642e-05 37.691 < 2e-16 ***
## year -3.760e-02 1.063e-02 -3.537 0.000452 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7455 on 395 degrees of freedom
## Multiple R-squared: 0.8089, Adjusted R-squared: 0.8079
## F-statistic: 836 on 2 and 395 DF, p-value: < 2.2e-16
Ans: I will do a multivariate regression analysis to test this hypothesis. Here the null hypothesis is that as years increase, fuel efficiency does not increase.Our hypothesis states that as years increase, fuel efficiency does increase.This regression analysis will find the relationship between fuel efficiency which will increase by 1 unit increase in year. I will control for cylinders, acceleration and horsepower as these variables also affect fuel efficiency. Here the year coefficient is statistically significant at 0.001 level. Based on the summary of the regression analysis, there is a positive relationship between year and mpg. As years increase, MPG increases. This data can also be seen visually through the regression graph 2. With a 1 year increase in car model years, fuel efficiency increases by 0.66 mpg.
Control Variables For this research, I am controlling for horsepower and weight in my first regression analysis. This is because horsepower and weight have a direct impact on years. By controlling for these variables we can limit the Omitted variable bias.
reg8 <- lm(year ~ horsepower + acceleration + cylinders + weight + displacement, data = cars)
summary(reg8)
##
## Call:
## lm(formula = year ~ horsepower + acceleration + cylinders + weight +
## displacement, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4071 -2.4749 -0.0637 2.8023 7.0804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.703752 2.088599 38.640 < 2e-16 ***
## horsepower -0.059554 0.013037 -4.568 6.63e-06 ***
## acceleration -0.151822 0.098398 -1.543 0.12367
## cylinders -0.090353 0.321219 -0.281 0.77864
## weight 0.002134 0.000639 3.340 0.00092 ***
## displacement -0.010302 0.007098 -1.451 0.14746
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.323 on 386 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.1967, Adjusted R-squared: 0.1863
## F-statistic: 18.9 on 5 and 386 DF, p-value: < 2.2e-16
Variables out of control For this research I did not consider number of cylinders,acceleration and displacement’s impact on mpg. At the 0.1 significance level, we notice these variables does not have an impact on years as P-value of displacement > 0.1.
reg4 <- lm(mpg ~ year + horsepower + weight, data = cars)
summary(reg4)
##
## Call:
## lm(formula = mpg ~ year + horsepower + weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7911 -2.3220 -0.1753 2.0595 14.3527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.372e+01 4.182e+00 -3.281 0.00113 **
## year 7.487e-01 5.212e-02 14.365 < 2e-16 ***
## horsepower -5.000e-03 9.439e-03 -0.530 0.59663
## weight -6.448e-03 4.089e-04 -15.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 388 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8068
## F-statistic: 545.4 on 3 and 388 DF, p-value: < 2.2e-16
A visual representation of the above regression analysis shows that in fact fuel efficiency is highest with the most recent years.
ggplot(data = cars, aes(x = horsepower + weight, y= mpg, color=year))+
geom_point()+
stat_smooth(method = "lm", se=FALSE) +
labs(x = "Control Variables: Weight, Horsepower", y ="Fuel Efficiency in Miles per Gallon")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).