Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).
They are particularly interested in the following two questions:
Quantify the MPG difference between automatic and manual transmissions
@ Performed some exploratory analysis for the data visualization purpose and to have a better understanding of the relationship between variable, did a correlation analysis done. We found there was a significant 1. Highly negative correlation between Miles Per Gallon and Number of Cylinders, Displacement cu.in and Weight. 2. Moderately positive correlation between Miles Per Gallon and V-Engile or Straight Engine, Automatic or Manual Transmission and Real axle ratio
@ Using Hypothesis Testing, I could reject H0 and conclude that there is a significant relationship between Transmission and Miles Per Gallon. In multi variable t test, I couldn't see any significant relationship between mpg and qsec.
@ During Regression analysis, I could see there is a significant relationship between Transmission and mpg and there is a slope change of 7.245 when the transmission is changed from automatic to manual.
@@ The Multivariate Regression Analysis, didn't show any significance in any of the variables and so testing was made against multicolinearity and found that carborator had a very high variance inflation ratio folowing it was cyclinder.
@@ The Logistic Regression for all the variable again, didn't show any significance in any of the variables except the variable wt and hp has a p-value slightly above 0.05.When we rerun the logistic model omit the multi colinear variable we clearly found that weight is highly significant with p - value less than alpha 95% significance and the weight have negative coefficient.Based logit model, manual transmission increases the log odds by 2.92105 for a better mpg.
@@ Further study of Anova Analysis showed the best model is mpg ~ cyl + disp + wt and the Stepwise selection model and the nested likelihood shows that the best fitted model is mpg ~ am + wt + qsec
@@ So I couldn't clearly conclude which is the best fitted model, but clearly Transmission definitely changes the performance on Miles Per Gallon. The Manual Transmission cars have a significant slope change of 7.245 on the regression model, which concluded that Manual Transmission had a 2.92105 increase in Miles Per Gallon when compared to Automatic Transmission
For the purpose of this analysis we use mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Below is a brief description of the variables in the data set:
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
# Overview of data
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
modeldata <- mtcars
# take the foctor columns and convert into factor variable.
modelcols <- c(2,8,9,10,11)
modeldata[modelcols] <- lapply(modeldata[modelcols],factor)
# convert the boolean values of factor into more meaningful names for visualization purpose.
modeldata <- transform(modeldata,
am = factor(am, levels = 0:1, c("Automatic", "Manual")),
gear = factor(gear, levels = 3:5, labels = c("3 Gears", "4 Gears", "5 Gears")),
vs = factor(vs,levels = 0:1,labels = c("V-Engine","S-Engine")))
# convert the factor variable into numeric for correlation purpose.
modeldata_cor <- transform(modeldata,
am = as.numeric(am),
gear = as.numeric(gear),
vs = as.numeric(vs),
carb = as.numeric(carb),
cyl = as.numeric(cyl))
head(modeldata)
## mpg cyl disp hp drat wt qsec vs am
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 V-Engine Manual
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 V-Engine Manual
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 S-Engine Manual
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 S-Engine Automatic
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 V-Engine Automatic
## Valiant 18.1 6 225 105 2.76 3.460 20.22 S-Engine Automatic
## gear carb
## Mazda RX4 4 Gears 4
## Mazda RX4 Wag 4 Gears 4
## Datsun 710 4 Gears 1
## Hornet 4 Drive 3 Gears 1
## Hornet Sportabout 3 Gears 2
## Valiant 3 Gears 1
Focusing on the project goal to determine whether automatic or manual transimission better for mpg, we will explore the violin graph between the two variable mpg and am.
ggplot(data=modeldata,aes(am,mpg,fill = am)) + geom_violin(color = "black",size=1)
In the above graph, we could visibly visualize the violin shape for automatic transmission and mpg, which implies that there is a significant relationship between mpg and automatic gears in cars. Automatic transmission cars have less mpg than manual transmission. We can confidently hypothesis on this for further analysis.
Let us look at the pairs plot which depicts about whether there is any significant correlation between mpg and every other variable in dataset.
corrplot.mixed(cor(modeldata_cor),lower="number",upper="pie")
In the above graph, on the left side, we can see the correlation value between mpg and other variables and on the right side, the pie chart showing the correlation scale between -1 and 1. Generally correlation value >= 0.8 means strong positive correlation correlation value <= -0.8 means strong negative correlation correlation value = 0 means no correlation between the respective two variables.
So based on that if we analyze the graph above,
Strong negative correlation between mpg (Miles Per Gallon) and cyl (# of cylinders), disp (Displacement cu.in) and wt (weight). Thats, for every decrease in number of cylinder or displacement or weight the miles per gallon increases.
There is no variable here shows it has no correlation with mpg variable.
Moderately positive correlation between Miles Per Gallon and vs (V-Engile or Straight Engine), am (Automatic or Manual Transmission) and drat (Real axle ratio), which means when we change the engine type from V to Straight or from Automatic to Manual Transmission, the performance of the engine on Miles Per Gallon increases, but not highly significant.
Are we right? well we need to do further analysis.
let us see, how are the distributions between mpg and moderate to high correlated variables is.
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
g = ggpairs(mtcars, lower = list(continuous = my_fn))
suppressWarnings(print(g))
In the above pairs diagram, we could see there is 1. Highly significant linear relationship between Miles Per Gallon and Number of Cylinders, Displacement, Weight.
Hypothesis test
H0: Automatic or Manual transmission is not related to the perfomance on mpg
H1: Automatic or Manual transmission is related to the perfomance on mpg
t.test(mpg~am,data=mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
Observation: In the above T statistics, the p value is < 0.05 and 95% confidence interval doesn't pass through zero between the upper and lower limits, so we can REJECT H0 and say there could be a siginificant relationship between Transmission and Miles Per Gallon.
Let us run the t statistics for all variables
sapply(mtcars[,2:11], function(i) t.test(mtcars$mpg,i)$conf.int)
## cyl disp hp drat wt qsec vs
## [1,] 11.65034 -255.3602 -151.3964 14.31395 14.67644 -0.01100157 17.47381
## [2,] 16.15591 -165.9023 -101.7973 18.67417 19.07031 4.49475157 21.83244
## am gear carb
## [1,] 17.50519 14.21654 15.03984
## [2,] 21.86356 18.58971 19.51641
sapply(mtcars[,2:11], function(i) t.test(mtcars$mpg,i)$p.value)
## cyl disp hp drat wt
## 9.507708e-15 7.978234e-11 1.030354e-11 3.164364e-16 1.027903e-16
## qsec vs am gear carb
## 5.107103e-02 2.241293e-18 2.151228e-18 3.077106e-16 1.680654e-17
Observation: Except qsec, every other variables confidence interval doesn't change sign between upper and lower intervals and so it doesn't pass through zero. Also the p value for these individual variable is less than 0.05 except qsec which is slightly above the alpha value. Clearly we can omit qsec variable in fitting the models.
Omit Variables: qsec but not so confidentally as it very close to alpha value and confidence interval value is just slightly below 0.
To prove further that our variable selection is correct we need to check the linear regression models.
Let us first see the relationship between our two variables of interest, Miles per Gallon and Transmission type.
fitmpgam <- lm(mpg~factor(am)-1,data=modeldata)
summary(fitmpgam)
##
## Call:
## lm(formula = mpg ~ factor(am) - 1, data = modeldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## factor(am)Automatic 17.147 1.125 15.25 1.13e-15 ***
## factor(am)Manual 24.392 1.360 17.94 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9452
## F-statistic: 277.2 on 2 and 30 DF, p-value: < 2.2e-16
confint.lm(fitmpgam)
## 2.5 % 97.5 %
## factor(am)Automatic 14.85062 19.44411
## factor(am)Manual 21.61568 27.16894
Observation: Here the estimates are provided in comparison with automatic transmission. There is positive relationship between mpg and Manual transmission and there is a slope change of 7.245 (difference between the coefficents of automatic and manual transmission). The p-value is clearly less than alpha value 0.05 and the confidence interval doesn't pass through 0. Therefore we can conclude that there is a siginificant relationship between Transmission and MPG and there is a highly significant performance change in MPG when transmission changes to Manual from Automatic.
fitmpgall <- lm(mpg~.-1,data=modeldata)
summary(fitmpgall)
##
## Call:
## lm(formula = mpg ~ . - 1, data = modeldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## cyl4 23.87913 20.06582 1.190 0.2525
## cyl6 21.23044 18.33416 1.158 0.2650
## cyl8 23.54297 18.22250 1.292 0.2159
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vsS-Engine 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 Gears 1.11435 3.79952 0.293 0.7733
## gear5 Gears 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.9914, Adjusted R-squared: 0.9817
## F-statistic: 102 on 17 and 15 DF, p-value: 1.979e-12
confint.lm(fitmpgall)
## 2.5 % 97.5 %
## cyl4 -18.8901510 66.64841591
## cyl6 -17.8479101 60.30878447
## cyl8 -15.2973628 62.38330171
## disp -0.0324452 0.10353785
## hp -0.1545404 0.01352676
## drat -4.1105919 6.47625226
## wt -9.9409845 0.88143283
## qsec -1.6259039 2.36159354
## vsS-Engine -4.1890905 8.05079160
## amManual -5.6373936 8.06162502
## gear4 Gears -6.9841244 9.21283428
## gear5 Gears -5.4354626 10.49225456
## carb2 -5.9199999 3.96129129
## carb3 -6.1518381 12.15111565
## carb4 -8.3927175 10.57556323
## carb6 -9.1297377 18.08487616
## carb8 -10.5697142 25.07053667
Interpretation: In multivariate relationship we clearly see that the p-value of all the variable is greater than alpha and all the variables' confidence interval passes through 0. Which is a major problem and so much contradicting with the above study so far. This problem mostly occurs when there is a Multicolinearity among the variables. Which means that two or more predictor variables are highly correlated and so we might be adding duplicates of relationships when predicting the outcome. We can find the colinearity based on Variance Inflation Factor (VIF) and while looking at VIF,
library(car)
## Warning: package 'car' was built under R version 3.2.5
fitvif <- lm(mpg ~., data=modeldata)
vif(fitvif)
## GVIF Df GVIF^(1/(2*Df))
## cyl 128.120962 2 3.364380
## disp 60.365687 1 7.769536
## hp 28.219577 1 5.312210
## drat 6.809663 1 2.609533
## wt 23.830830 1 4.881683
## qsec 10.790189 1 3.284842
## vs 8.088166 1 2.843970
## am 9.930495 1 3.151269
## gear 50.852311 2 2.670408
## carb 503.211851 5 1.862838
Observation: The highest VIF variables are the multicolinear variables and in this case the VIF is significantly higher in carb following is the cyl.
glmfitall <- glm(mpg ~ ., data=modeldata)
summary(glmfitall)
##
## Call:
## glm(formula = mpg ~ ., data = modeldata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vsS-Engine 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 Gears 1.11435 3.79952 0.293 0.7733
## gear5 Gears 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 8.026845)
##
## Null deviance: 1126.0 on 31 degrees of freedom
## Residual deviance: 120.4 on 15 degrees of freedom
## AIC: 169.22
##
## Number of Fisher Scoring iterations: 2
Interpretation: Interpreting the above results, we can see that all variables are pretty much not statistically significant, but the variable wt and hp has a p-value slightly above 0.05.
Let us try to omit the multi colinear variable and run the logistic regression one more time
glmfit <- glm(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear, data = modeldata)
summary(glmfit)
##
## Call:
## glm(formula = mpg ~ disp + hp + drat + wt + qsec + vs + am +
## gear, data = modeldata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0548 -1.4564 -0.3425 1.2825 4.7168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.80934 13.36845 0.809 0.4274
## disp 0.01380 0.01341 1.029 0.3145
## hp -0.02721 0.01720 -1.582 0.1279
## drat 1.18599 1.74482 0.680 0.5038
## wt -3.68884 1.52013 -2.427 0.0239 *
## qsec 0.91001 0.64014 1.422 0.1692
## vsS-Engine 0.65015 1.93968 0.335 0.7407
## amManual 2.92105 2.00082 1.460 0.1584
## gear4 Gears -0.42897 2.43311 -0.176 0.8617
## gear5 Gears 0.88164 2.57587 0.342 0.7354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.662079)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 146.57 on 22 degrees of freedom
## AIC: 161.51
##
## Number of Fisher Scoring iterations: 2
Interpretation: Now if we interpret the model after removing the highly correlated variable cylinder and carborator, we could clearly see that weight is highy significant with p - value less than alpha 95% significance and the weight have negative coefficient. If we focus on the project variable am, remember that in the logit model the response variable is log odds: ln(odds) = ln(p/(1-p)) = a*x1 + b*x2 + … + z*xn. Since manual is a dummy variable, being manual transmission increases the log odds by 2.92105 for a better mpg.
Now let us run the ANOVA on the above model to analyze the table of deviance
anova(glm(mpg ~.,data=modeldata),test="Chisq")
## Analysis of Deviance Table
##
## Model: gaussian, link: identity
##
## Response: mpg
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 31 1126.05
## cyl 2 824.78 29 301.26 < 2.2e-16 ***
## disp 1 57.64 28 243.62 0.007367 **
## hp 1 18.50 27 225.12 0.128955
## drat 1 11.91 26 213.20 0.223098
## wt 1 55.79 25 157.42 0.008382 **
## qsec 1 1.52 24 155.89 0.662974
## vs 1 0.30 23 155.59 0.846179
## am 1 16.57 22 139.02 0.150825
## gear 2 5.02 20 134.00 0.731400
## carb 5 13.60 15 120.40 0.889633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Adding hp and drat significantly reduces the residual deviance. But the variable wt seems to improve the model slightly better. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. The astericks by the p-value shows that those variables with more number of astericks are higly significant. So in this case we see that the model could fit well with variables disp, hp and wt and any addition of other variables in to the model shows no significant performance in mpg.
Suggested Model from above: mpg ~ disp + cyl + wt
To check this more accurately, we can do a step wise regression model.
stepfit=step(lm(data=modeldata, mpg ~ .),trace=0,steps=10000)
summary(stepfit)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = modeldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
Interpretation: Here the fitted model is formula = mpg ~ cyl + hp + wt + am and the Adjusted R-squared: 0.8401 which means there is 84.01%% variation in miles per gallon in this model. The bad news this model and the above model have a difference. Now this model is fitted against the cleaned data modeldata.
If I do the stepwise selection model directly on mtcars instead of the cleaned dataset modeldata
stepfit=step(lm(data=mtcars, mpg ~ .),trace=0,steps=10000)
summary(stepfit)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Interpretation: Here the fitted model is formula =mpg ~ wt + qsec + am and the Adjusted R-squared: 0.8336 which means there is 83.36% variation in miles per gallon in this model.
Clearly the best fitted model is different among the two dataset and also both these above model have a different from the anova model we did before. To find why the anova and step wise gave a different model each, we need to do the nested likelihood ratio test to test the step wise selection model
Let us fit the model nested starting from the one variable of interest and keep adding more significant variables
fitam <- lm(mpg ~ am,data=modeldata)
fitamwt <- lm(mpg ~ am + wt, data=modeldata)
fitamwtqsec <- lm(mpg ~ am + wt + qsec, data=modeldata)
fitamwtqsechp <- lm(mpg ~ am + wt + qsec + hp, data=modeldata)
fitamwtqsechpcyl <- lm(mpg ~ am + wt + qsec + hp + cyl, data=modeldata)
fitamwtqsechpcyldisp <- lm(mpg ~ am + wt + qsec + hp + cyl + disp, data=modeldata)
anova(fitam,fitamwt,fitamwtqsec,fitamwtqsechp,fitamwtqsechpcyl,fitamwtqsechpcyldisp)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
## Model 4: mpg ~ am + wt + qsec + hp
## Model 5: mpg ~ am + wt + qsec + hp + cyl
## Model 6: mpg ~ am + wt + qsec + hp + cyl + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 74.6280 7.892e-09 ***
## 3 28 169.29 1 109.03 18.3854 0.000254 ***
## 4 27 160.07 1 9.22 1.5546 0.224489
## 5 25 143.98 2 16.08 1.3561 0.276702
## 6 24 142.33 1 1.65 0.2784 0.602585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: Against cleaned dataset modeldata, in the above nested model, the p value is significant only for the model with variable am, wt and qsec
Now if we try the same nested likelihood ratio for motcars dataset
fitam <- lm(mpg ~ am,data=mtcars)
fitamwt <- lm(mpg ~ am + wt, data=mtcars)
fitamwtqsec <- lm(mpg ~ am + wt + qsec, data=mtcars)
fitamwtqsechp <- lm(mpg ~ am + wt + qsec + hp, data=mtcars)
fitamwtqsechpcyl <- lm(mpg ~ am + wt + qsec + hp + cyl, data=mtcars)
fitamwtqsechpcyldisp <- lm(mpg ~ am + wt + qsec + hp + cyl + disp, data=mtcars)
anova(fitam,fitamwt,fitamwtqsec,fitamwtqsechp,fitamwtqsechpcyl,fitamwtqsechpcyldisp)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
## Model 4: mpg ~ am + wt + qsec + hp
## Model 5: mpg ~ am + wt + qsec + hp + cyl
## Model 6: mpg ~ am + wt + qsec + hp + cyl + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 73.2786 6.692e-09 ***
## 3 28 169.29 1 109.03 18.0530 0.0002609 ***
## 4 27 160.07 1 9.22 1.5265 0.2281245
## 5 26 159.82 1 0.25 0.0412 0.8407494
## 6 25 150.99 1 8.83 1.4614 0.2380164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: Against raw dataset mtcars, in the above nested model, the p value is significant only for the model with variable am, wt and qsec. So both shows the same results. There is a very slight difference in p values but the best fitted model with clear significance is mpg ~ am + wt + qsec
ggplot(data=modeldata,aes(wt,mpg,color=gear)) + geom_point() + facet_grid(cyl~am) + labs(title = "Miles Per Gallon for given Weight, Transmission, Gears and Cylinders.")
I couldn't clearly conclude which is the best fitted model, but clearly Transmission definitely changes the performance on Miles Per Gallon. The Manual Transmission cars have a significant slope change of 7.245 on the regression model, which concluded that Manual Transmission had a 2.92105 increase in Miles Per Gallon when compared to Automatic Transmission
par(mfrow=c(2,2))
fit <- glm(mpg ~ disp + wt + factor(cyl), data = modeldata)
plot(fit)
par(mfrow=c(2,2))
plot(fitamwtqsec)
end of report