Executive Summary

Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome).

They are particularly interested in the following two questions:

Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions

@ Performed some exploratory analysis for the data visualization purpose and to have a better understanding of the relationship between variable, did a correlation analysis done. We found there was a significant 1. Highly negative correlation between Miles Per Gallon and Number of Cylinders, Displacement cu.in and Weight. 2. Moderately positive correlation between Miles Per Gallon and V-Engile or Straight Engine, Automatic or Manual Transmission and Real axle ratio

@ Using Hypothesis Testing, I could reject H0 and conclude that there is a significant relationship between Transmission and Miles Per Gallon. In multi variable t test, I couldn't see any significant relationship between mpg and qsec.

@ During Regression analysis, I could see there is a significant relationship between Transmission and mpg and there is a slope change of 7.245 when the transmission is changed from automatic to manual.

@@ The Multivariate Regression Analysis, didn't show any significance in any of the variables and so testing was made against multicolinearity and found that carborator had a very high variance inflation ratio folowing it was cyclinder.

@@ The Logistic Regression for all the variable again, didn't show any significance in any of the variables except the variable wt and hp has a p-value slightly above 0.05.When we rerun the logistic model omit the multi colinear variable we clearly found that weight is highly significant with p - value less than alpha 95% significance and the weight have negative coefficient.Based logit model, manual transmission increases the log odds by 2.92105 for a better mpg.

@@ Further study of Anova Analysis showed the best model is mpg ~ cyl + disp + wt and the Stepwise selection model and the nested likelihood shows that the best fitted model is mpg ~ am + wt + qsec

@@ So I couldn't clearly conclude which is the best fitted model, but clearly Transmission definitely changes the performance on Miles Per Gallon. The Manual Transmission cars have a significant slope change of 7.245 on the regression model, which concluded that Manual Transmission had a 2.92105 increase in Miles Per Gallon when compared to Automatic Transmission

Overview of the Motor Trend Data for cars.

For the purpose of this analysis we use mtcars dataset which is a dataset that was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Below is a brief description of the variables in the data set:

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (lb/1000)

[, 7] qsec 1/4 mile time

[, 8] vs V/S

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

# Overview of data
data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(mtcars)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Cleaning of data to proper format and data type for easy exploration and analysis.

modeldata <- mtcars
# take the foctor columns and convert into factor variable.
modelcols <- c(2,8,9,10,11)
modeldata[modelcols] <- lapply(modeldata[modelcols],factor)
# convert the boolean values of factor into more meaningful names for visualization purpose.
modeldata <- transform(modeldata,
  am = factor(am, levels = 0:1, c("Automatic", "Manual")),
  gear = factor(gear, levels = 3:5, labels = c("3 Gears", "4 Gears", "5 Gears")),
  vs = factor(vs,levels = 0:1,labels = c("V-Engine","S-Engine")))
# convert the factor variable into numeric for correlation purpose.
modeldata_cor <- transform(modeldata,
  am = as.numeric(am),
  gear = as.numeric(gear),
  vs = as.numeric(vs),
  carb = as.numeric(carb),
  cyl = as.numeric(cyl))

head(modeldata)

##                    mpg cyl disp  hp drat    wt  qsec       vs        am
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46 V-Engine    Manual
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02 V-Engine    Manual
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61 S-Engine    Manual
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44 S-Engine Automatic
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02 V-Engine Automatic
## Valiant           18.1   6  225 105 2.76 3.460 20.22 S-Engine Automatic
##                      gear carb
## Mazda RX4         4 Gears    4
## Mazda RX4 Wag     4 Gears    4
## Datsun 710        4 Gears    1
## Hornet 4 Drive    3 Gears    1
## Hornet Sportabout 3 Gears    2
## Valiant           3 Gears    1

Exploratory Analysis of Motor Trend cars data.

Focusing on the project goal to determine whether automatic or manual transimission better for mpg, we will explore the violin graph between the two variable mpg and am.

ggplot(data=modeldata,aes(am,mpg,fill = am)) + geom_violin(color = "black",size=1)

plot of chunk unnamed-chunk-2

In the above graph, we could visibly visualize the violin shape for automatic transmission and mpg, which implies that there is a significant relationship between mpg and automatic gears in cars. Automatic transmission cars have less mpg than manual transmission. We can confidently hypothesis on this for further analysis.

Correlation Analysis

Let us look at the pairs plot which depicts about whether there is any significant correlation between mpg and every other variable in dataset.

corrplot.mixed(cor(modeldata_cor),lower="number",upper="pie")

plot of chunk unnamed-chunk-3

In the above graph, on the left side, we can see the correlation value between mpg and other variables and on the right side, the pie chart showing the correlation scale between -1 and 1. Generally correlation value >= 0.8 means strong positive correlation correlation value <= -0.8 means strong negative correlation correlation value = 0 means no correlation between the respective two variables.

So based on that if we analyze the graph above,

Strong negative correlation between mpg (Miles Per Gallon) and cyl (# of cylinders), disp (Displacement cu.in) and wt (weight). Thats, for every decrease in number of cylinder or displacement or weight the miles per gallon increases.
There is no variable here shows it has no correlation with mpg variable.
Moderately positive correlation between Miles Per Gallon and vs (V-Engile or Straight Engine), am (Automatic or Manual Transmission) and drat (Real axle ratio), which means when we change the engine type from V to Straight or from Automatic to Manual Transmission, the performance of the engine on Miles Per Gallon increases, but not highly significant.

Are we right? well we need to do further analysis.

let us see, how are the distributions between mpg and moderate to high correlated variables is.

my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=loess, fill="red", color="red", ...) +
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}

g = ggpairs(mtcars, lower = list(continuous = my_fn))
suppressWarnings(print(g))

plot of chunk unnamed-chunk-4

In the above pairs diagram, we could see there is 1. Highly significant linear relationship between Miles Per Gallon and Number of Cylinders, Displacement, Weight.

Almost no significant linear relationship between Miles Per Gallon and qsec (1/4 mile time), Real axle Rotation (drat) and Gear.

Variable Selection

T Test between MPG and AM

Hypothesis test

H0: Automatic or Manual transmission is not related to the perfomance on mpg

H1: Automatic or Manual transmission is related to the perfomance on mpg

t.test(mpg~am,data=mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Observation: In the above T statistics, the p value is < 0.05 and 95% confidence interval doesn't pass through zero between the upper and lower limits, so we can REJECT H0 and say there could be a siginificant relationship between Transmission and Miles Per Gallon.

Multi T Test

Let us run the t statistics for all variables

sapply(mtcars[,2:11], function(i) t.test(mtcars$mpg,i)$conf.int)

##           cyl      disp        hp     drat       wt        qsec       vs
## [1,] 11.65034 -255.3602 -151.3964 14.31395 14.67644 -0.01100157 17.47381
## [2,] 16.15591 -165.9023 -101.7973 18.67417 19.07031  4.49475157 21.83244
##            am     gear     carb
## [1,] 17.50519 14.21654 15.03984
## [2,] 21.86356 18.58971 19.51641

sapply(mtcars[,2:11], function(i) t.test(mtcars$mpg,i)$p.value)

##          cyl         disp           hp         drat           wt 
## 9.507708e-15 7.978234e-11 1.030354e-11 3.164364e-16 1.027903e-16 
##         qsec           vs           am         gear         carb 
## 5.107103e-02 2.241293e-18 2.151228e-18 3.077106e-16 1.680654e-17

Observation: Except qsec, every other variables confidence interval doesn't change sign between upper and lower intervals and so it doesn't pass through zero. Also the p value for these individual variable is less than 0.05 except qsec which is slightly above the alpha value. Clearly we can omit qsec variable in fitting the models.

Omit Variables: qsec but not so confidentally as it very close to alpha value and confidence interval value is just slightly below 0.

Regression Analysis

Linear Regression

To prove further that our variable selection is correct we need to check the linear regression models.

Let us first see the relationship between our two variables of interest, Miles per Gallon and Transmission type.

fitmpgam <- lm(mpg~factor(am)-1,data=modeldata)
summary(fitmpgam)

## 
## Call:
## lm(formula = mpg ~ factor(am) - 1, data = modeldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## factor(am)Automatic   17.147      1.125   15.25 1.13e-15 ***
## factor(am)Manual      24.392      1.360   17.94  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9452 
## F-statistic: 277.2 on 2 and 30 DF,  p-value: < 2.2e-16

confint.lm(fitmpgam)

##                        2.5 %   97.5 %
## factor(am)Automatic 14.85062 19.44411
## factor(am)Manual    21.61568 27.16894

Observation: Here the estimates are provided in comparison with automatic transmission. There is positive relationship between mpg and Manual transmission and there is a slope change of 7.245 (difference between the coefficents of automatic and manual transmission). The p-value is clearly less than alpha value 0.05 and the confidence interval doesn't pass through 0. Therefore we can conclude that there is a siginificant relationship between Transmission and MPG and there is a highly significant performance change in MPG when transmission changes to Manual from Automatic.

Multivariate Linear Regression Model

fitmpgall <- lm(mpg~.-1,data=modeldata)
summary(fitmpgall)

## 
## Call:
## lm(formula = mpg ~ . - 1, data = modeldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## cyl4        23.87913   20.06582   1.190   0.2525  
## cyl6        21.23044   18.33416   1.158   0.2650  
## cyl8        23.54297   18.22250   1.292   0.2159  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vsS-Engine   1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4 Gears  1.11435    3.79952   0.293   0.7733  
## gear5 Gears  2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.9914, Adjusted R-squared:  0.9817 
## F-statistic:   102 on 17 and 15 DF,  p-value: 1.979e-12

confint.lm(fitmpgall)

##                   2.5 %      97.5 %
## cyl4        -18.8901510 66.64841591
## cyl6        -17.8479101 60.30878447
## cyl8        -15.2973628 62.38330171
## disp         -0.0324452  0.10353785
## hp           -0.1545404  0.01352676
## drat         -4.1105919  6.47625226
## wt           -9.9409845  0.88143283
## qsec         -1.6259039  2.36159354
## vsS-Engine   -4.1890905  8.05079160
## amManual     -5.6373936  8.06162502
## gear4 Gears  -6.9841244  9.21283428
## gear5 Gears  -5.4354626 10.49225456
## carb2        -5.9199999  3.96129129
## carb3        -6.1518381 12.15111565
## carb4        -8.3927175 10.57556323
## carb6        -9.1297377 18.08487616
## carb8       -10.5697142 25.07053667

Interpretation: In multivariate relationship we clearly see that the p-value of all the variable is greater than alpha and all the variables' confidence interval passes through 0. Which is a major problem and so much contradicting with the above study so far. This problem mostly occurs when there is a Multicolinearity among the variables. Which means that two or more predictor variables are highly correlated and so we might be adding duplicates of relationships when predicting the outcome. We can find the colinearity based on Variance Inflation Factor (VIF) and while looking at VIF,

Multicolinearity

library(car)

## Warning: package 'car' was built under R version 3.2.5

fitvif <- lm(mpg ~., data=modeldata)
vif(fitvif)

##            GVIF Df GVIF^(1/(2*Df))
## cyl  128.120962  2        3.364380
## disp  60.365687  1        7.769536
## hp    28.219577  1        5.312210
## drat   6.809663  1        2.609533
## wt    23.830830  1        4.881683
## qsec  10.790189  1        3.284842
## vs     8.088166  1        2.843970
## am     9.930495  1        3.151269
## gear  50.852311  2        2.670408
## carb 503.211851  5        1.862838

Observation: The highest VIF variables are the multicolinear variables and in this case the VIF is significantly higher in carb following is the cyl.

Generalized Linear model - Logistic Regression

glmfitall <- glm(mpg ~ ., data=modeldata)
summary(glmfitall)

## 
## Call:
## glm(formula = mpg ~ ., data = modeldata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5087  -1.3584  -0.0948   0.7745   4.6251  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vsS-Engine   1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4 Gears  1.11435    3.79952   0.293   0.7733  
## gear5 Gears  2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8.026845)
## 
##     Null deviance: 1126.0  on 31  degrees of freedom
## Residual deviance:  120.4  on 15  degrees of freedom
## AIC: 169.22
## 
## Number of Fisher Scoring iterations: 2

Interpretation: Interpreting the above results, we can see that all variables are pretty much not statistically significant, but the variable wt and hp has a p-value slightly above 0.05.

Let us try to omit the multi colinear variable and run the logistic regression one more time

glmfit <- glm(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear, data = modeldata)
summary(glmfit)

## 
## Call:
## glm(formula = mpg ~ disp + hp + drat + wt + qsec + vs + am + 
##     gear, data = modeldata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0548  -1.4564  -0.3425   1.2825   4.7168  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 10.80934   13.36845   0.809   0.4274  
## disp         0.01380    0.01341   1.029   0.3145  
## hp          -0.02721    0.01720  -1.582   0.1279  
## drat         1.18599    1.74482   0.680   0.5038  
## wt          -3.68884    1.52013  -2.427   0.0239 *
## qsec         0.91001    0.64014   1.422   0.1692  
## vsS-Engine   0.65015    1.93968   0.335   0.7407  
## amManual     2.92105    2.00082   1.460   0.1584  
## gear4 Gears -0.42897    2.43311  -0.176   0.8617  
## gear5 Gears  0.88164    2.57587   0.342   0.7354  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 6.662079)
## 
##     Null deviance: 1126.05  on 31  degrees of freedom
## Residual deviance:  146.57  on 22  degrees of freedom
## AIC: 161.51
## 
## Number of Fisher Scoring iterations: 2

Interpretation: Now if we interpret the model after removing the highly correlated variable cylinder and carborator, we could clearly see that weight is highy significant with p - value less than alpha 95% significance and the weight have negative coefficient. If we focus on the project variable am, remember that in the logit model the response variable is log odds: ln(odds) = ln(p/(1-p)) = a*x1 + b*x2 + … + z*xn. Since manual is a dummy variable, being manual transmission increases the log odds by 2.92105 for a better mpg.

ANOVA analysis

Now let us run the ANOVA on the above model to analyze the table of deviance

anova(glm(mpg ~.,data=modeldata),test="Chisq")

## Analysis of Deviance Table
## 
## Model: gaussian, link: identity
## 
## Response: mpg
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    31    1126.05              
## cyl   2   824.78        29     301.26 < 2.2e-16 ***
## disp  1    57.64        28     243.62  0.007367 ** 
## hp    1    18.50        27     225.12  0.128955    
## drat  1    11.91        26     213.20  0.223098    
## wt    1    55.79        25     157.42  0.008382 ** 
## qsec  1     1.52        24     155.89  0.662974    
## vs    1     0.30        23     155.59  0.846179    
## am    1    16.57        22     139.02  0.150825    
## gear  2     5.02        20     134.00  0.731400    
## carb  5    13.60        15     120.40  0.889633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Adding hp and drat significantly reduces the residual deviance. But the variable wt seems to improve the model slightly better. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. The astericks by the p-value shows that those variables with more number of astericks are higly significant. So in this case we see that the model could fit well with variables disp, hp and wt and any addition of other variables in to the model shows no significant performance in mpg.

Suggested Model from above: mpg ~ disp + cyl + wt

To check this more accurately, we can do a step wise regression model.

Stepwise selection model

stepfit=step(lm(data=modeldata, mpg ~ .),trace=0,steps=10000)
summary(stepfit)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = modeldata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Interpretation: Here the fitted model is formula = mpg ~ cyl + hp + wt + am and the Adjusted R-squared: 0.8401 which means there is 84.01%% variation in miles per gallon in this model. The bad news this model and the above model have a difference. Now this model is fitted against the cleaned data modeldata.

If I do the stepwise selection model directly on mtcars instead of the cleaned dataset modeldata

stepfit=step(lm(data=mtcars, mpg ~ .),trace=0,steps=10000)
summary(stepfit)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Interpretation: Here the fitted model is formula =mpg ~ wt + qsec + am and the Adjusted R-squared: 0.8336 which means there is 83.36% variation in miles per gallon in this model.

Clearly the best fitted model is different among the two dataset and also both these above model have a different from the anova model we did before. To find why the anova and step wise gave a different model each, we need to do the nested likelihood ratio test to test the step wise selection model

Nested Likelihood Ratio test

Let us fit the model nested starting from the one variable of interest and keep adding more significant variables

fitam <- lm(mpg ~ am,data=modeldata)
fitamwt <- lm(mpg ~ am + wt, data=modeldata)
fitamwtqsec <- lm(mpg ~ am + wt + qsec, data=modeldata)
fitamwtqsechp <- lm(mpg ~ am + wt + qsec + hp, data=modeldata)
fitamwtqsechpcyl <- lm(mpg ~ am + wt + qsec + hp + cyl, data=modeldata)
fitamwtqsechpcyldisp <- lm(mpg ~ am + wt + qsec + hp + cyl + disp, data=modeldata)
anova(fitam,fitamwt,fitamwtqsec,fitamwtqsechp,fitamwtqsechpcyl,fitamwtqsechpcyldisp)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
## Model 4: mpg ~ am + wt + qsec + hp
## Model 5: mpg ~ am + wt + qsec + hp + cyl
## Model 6: mpg ~ am + wt + qsec + hp + cyl + disp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 74.6280 7.892e-09 ***
## 3     28 169.29  1    109.03 18.3854  0.000254 ***
## 4     27 160.07  1      9.22  1.5546  0.224489    
## 5     25 143.98  2     16.08  1.3561  0.276702    
## 6     24 142.33  1      1.65  0.2784  0.602585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: Against cleaned dataset modeldata, in the above nested model, the p value is significant only for the model with variable am, wt and qsec

Now if we try the same nested likelihood ratio for motcars dataset

fitam <- lm(mpg ~ am,data=mtcars)
fitamwt <- lm(mpg ~ am + wt, data=mtcars)
fitamwtqsec <- lm(mpg ~ am + wt + qsec, data=mtcars)
fitamwtqsechp <- lm(mpg ~ am + wt + qsec + hp, data=mtcars)
fitamwtqsechpcyl <- lm(mpg ~ am + wt + qsec + hp + cyl, data=mtcars)
fitamwtqsechpcyldisp <- lm(mpg ~ am + wt + qsec + hp + cyl + disp, data=mtcars)
anova(fitam,fitamwt,fitamwtqsec,fitamwtqsechp,fitamwtqsechpcyl,fitamwtqsechpcyldisp)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
## Model 4: mpg ~ am + wt + qsec + hp
## Model 5: mpg ~ am + wt + qsec + hp + cyl
## Model 6: mpg ~ am + wt + qsec + hp + cyl + disp
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 73.2786 6.692e-09 ***
## 3     28 169.29  1    109.03 18.0530 0.0002609 ***
## 4     27 160.07  1      9.22  1.5265 0.2281245    
## 5     26 159.82  1      0.25  0.0412 0.8407494    
## 6     25 150.99  1      8.83  1.4614 0.2380164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: Against raw dataset mtcars, in the above nested model, the p value is significant only for the model with variable am, wt and qsec. So both shows the same results. There is a very slight difference in p values but the best fitted model with clear significance is mpg ~ am + wt + qsec

Appendix 1 Data visualization with respect to mpg

ggplot(data=modeldata,aes(wt,mpg,color=gear)) + geom_point() + facet_grid(cyl~am) + labs(title = "Miles Per Gallon for given Weight, Transmission, Gears and Cylinders.")

plot of chunk unnamed-chunk-17

Result:

I couldn't clearly conclude which is the best fitted model, but clearly Transmission definitely changes the performance on Miles Per Gallon. The Manual Transmission cars have a significant slope change of 7.245 on the regression model, which concluded that Manual Transmission had a 2.92105 increase in Miles Per Gallon when compared to Automatic Transmission

Appendix 2 Residual Diagnostics of both the models

Model 1: mpg ~ cyl + disp + wt

par(mfrow=c(2,2))
fit <- glm(mpg ~ disp + wt + factor(cyl), data = modeldata)
plot(fit)

plot of chunk unnamed-chunk-18

Model 2: mpg ~ am + wt + qsec

par(mfrow=c(2,2))
plot(fitamwtqsec)

plot of chunk unnamed-chunk-19

end of report