This project is to evaluate the relationship between other variables compared to the outcome miles per gallon (MPG). In, Motor Trend, a magazine about the automobile industry, is interested to explore the data set “mtcars” to analyze the relationships/impact of regression on the variables in the data set. We are going to apply linear models, generalized linear models, binary models, residuals, predictions, hatvalues, t-test, interpreting odds ration and other regression models learned in this session. “mtcars” data set was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The data frame has 32 observations and 11 variables.
Initially you need to load the mtcars data and make all variables as factors
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$am <- factor(mtcars$am)
In this project we will explore the linear model between miles per gallon (MPG) and rest of the variables. Here we will have miles per gallon (MPG) as predictor for the motor tread cars dataset. We have listed out the pair diagram for this linear model in appendix,figure-1. In the rest of the analysis we need to find if automatic or manual transmission is going to make difference in mile per gallon.
To start with, lets analyze the impact of other variables w.r.t mpg
fit <- lm(mpg ~ ., data = mtcars)
Keeping this as base, we would start exploring the impact of other variables.
To explore and see of automatic or manual transmission better for MPG, we would use step function to understand the impact of other attributes on mileage.
stepFit <- step(fit, direction = "both")
By now we could have understood that the variables wt, qsec, am are used as regression variables to find the impact on mileage. From this summary we can reject the null hypothesis because all the coefficients are significant at 0.05 significant level. The residual standard error 2.459 on 28 degrees of freedom. The R-square is 0.85 with the adjusted R-square is 0.83
Now, let’s create a 3rd fit model which would be only between the mpg and am variable. This is a simple model to compare with the rest of the fit models to gain the analysis.
amFit <- lm(mpg ~ am, mtcars)
summary(amFit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
As the final step, let’s go try to compare all 3 fit models to understand the impact. All fit models are using MPG as predictor. Here we have one fit model with all other variables. Another one has only wt, qsec, am and the final one is using only am.
anova(fit,stepFit,amFit)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 147.49
## 2 28 169.29 -7 -21.79 0.4432 0.8636
## 3 30 720.90 -2 -551.61 39.2687 8.025e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By looking at the output result, we could reject the null hypothesis as the p-value is highly significant and we could make out that other variables has no impact.
In this section we would investigate the residual and leverage to find any potential issues with the fit models. The variation around the regression line is called residual variation. We use hatvalues() function to find the data points with the most leverage in the fit. We use dfbetas() function to influence the coefficients.
hatsV <- hatvalues(stepFit)
tail(hatsV)
## Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
## 0.09484773 0.16064541 0.16774768 0.11382232 0.19098150
## Volvo 142E
## 0.12428491
dfbV <- dfbetas(stepFit)
tail(dfbV)
## (Intercept) wt qsec am1
## Porsche 914-2 0.07877530 -6.932899e-02 -0.06904492 0.02157174
## Lotus Europa 0.36660011 -4.291538e-01 -0.26692418 -0.13713643
## Ford Pantera L -0.15609429 -6.169769e-02 0.23847362 -0.14024508
## Ferrari Dino -0.05798534 3.523032e-06 0.07688146 -0.04804068
## Maserati Bora -0.03976530 -1.324597e-01 0.12036772 -0.16842534
## Volvo 142E 0.31198327 -2.543348e-01 -0.28378130 -0.41564820
Figure-4 in appendix has details on the residuals. First block is about the Residuals vs Fitted, this uses random plotting. Second block, Normal Q-Q, is a linear model. Third block, Scale-Location, has constant assumption. Forth block, residual vs leverage with no out liners.
We have to reject the null hypothesis for the analysis and to do the same we will do it with t-test
(t.test(mpg~am,data = mtcars))$p.value
## [1] 0.001373638
As the p-value is less then 0.5 we reject null hypothesis.
Manual tramission cars get 1.8 more milage per gallon compared to automatic. Mileage has impact on the weight of the car along with the number of cylinders. - For weight, 2.5 per 1000 pounds increase. - For cylinders, increase of cylinders will descrease the MPG by 3.0