Synopsis

      In this project we explore the relationship between miles per gallon (MPG) and a set of variables from the Motor Trends car dataset. We are particularly interested in the following two questions:

      A variety of models are possible, it is quickly apparent that manual is significantly better for gas mileage than automatic. The initial t.test shows this. Specifically quantifying the amount requires accounting for all other regressor while maintaining low variance and high accuracy.

      Adjusted R^2 is a good way to account for the change in variance when adding reggressors. By optimizing our model for adjusted R^2 we arrive at a statistically significant model which accounts the highest percentage of variance and has a lower standard error then one including all variables which achieves a significant P-value.

Hypothesis Testing

       Below is the outcome of a t.test showing that manual has a significanlty higher mean than automatic for MPG. Therefore, based on this alone we can reliably conclude manual transmission is better than automatic with 95% confidence. The data-splitting is included in the appendix

t.test(manual$mpg,auto$mpg,alternative = "less")$p.value
## [1] 0.0006868192


Model/feature Selection

       This graph tells us that there is a lot of conflation going on between the variables and we will need to take care to chose the right regressors. We can’t assume independence and msut balance variance with model fit.

      To optimize for adjusted R^2 I wrote a code to choose regressors (including am) with the highest adjusted R^2 value. This is an equivalent process of minimizing the standard error for the am variable as a regressor.

sumall# fitall <- ln(mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear + carb, mtcars)
##   Estimate Std. Error    t value   Pr(>|t|)    Adj R^2        R^2 
##      2.520      2.057      1.225      0.234      0.807      0.869

sum1 # fit1 <- ln(mpg ~ am, mtcars)
##   Estimate Std. Error    t value   Pr(>|t|)    Adj R^2        R^2 
##   7.244939   1.764422   4.106127   0.000285   0.338459   0.359799

sumopt # fitopt <- Ln(mpg ~ am + disp + hp + wt + qsec, mtcars)
##   Estimate Std. Error    t value   Pr(>|t|)    Adj R^2        R^2 
##     3.4705     1.4858     2.3358     0.0275     0.8375     0.8637

Conclusions

Our investigation concludes based on this data that manual transmissions are more efficient than automatic by an average of 3.407 MPG holding all other variables constant. The model appears to have some level of homoscedasticity with outliers being the Corolla, Fiat 128, and Chrysler Imperial. All performing above predicted values

Appendix

Required Data and packages

require(datasets); require(plyr); require(ggplot2); 
require(GGally);require(car);data(mtcars);attach(mtcars)

Splitting data for t.tests and aggregation

auto <- mtcars[which(am==1),]
manual <- mtcars[which(am==0),]

T.Test

t.test(manual$mpg,auto$mpg,alternative = "less")$p.value

Plot

g2 <- ggplot(mtcars, aes(y=mpg,x=am)) + geom_boxplot(aes(fill=factor(am))) + annotate("text", x = .5, y = 10, label = c("p = .00068"))+ geom_jitter(width = 0.2)+ annotate("text", x = 0, y = 16.85, label = c("mean = 17.15"))+ annotate("text", x = 1, y = 24.39, label = c("mean = 24.39")) + annotate("text", x = .5, y = 11, label = c("Corr = .6")) + annotate("text", x = .5, y = 12, label = c("Beta = 7.23"))

g2

Plot2 Pairwise

pair_loess <- function(data, mapping, method="loess", ...){
       p <- ggplot(data = data, mapping = mapping) + 
              geom_point() + 
              geom_smooth(method=method, ...)
       p
}

g <- ggpairs(mtcars, lower = list(continuous=pair_loess))

Extracting useful info for presenting model features

options(digits=3)

fitall <-lm(mpg~.,mtcars)
fit1 <- lm(mpg~am,mtcars)
fitopt <- df[[memdex]]

sumopt <- append(summary(df[[memdex]])$coef[2,],c(summary(df[[memdex]])$adj.r.squared, summary(df[[memdex]])$r.squared))
names(sumopt)[5:6] <- c("Adj R^2", "R^2")

sumall <- append(summary(fitall)$coef[2,],c(summary(fitall)$adj.r.squared, summary(fitall)$r.squared))
names(sumall)[5:6] <- c("Adj R^2", "R^2")

sum1 <- append(summary(fit1)$coef[2,],c(summary(fit1)$adj.r.squared, summary(fit1)$r.squared))
names(sum1)[5:6] <- c("Adj R^2", "R^2")

Optimizing standard error for am estimated coefficient

fitmem = {0}
memdex = {0}
       #Removing mpg and am from regressor list since they are necessary for our purposes
       regressors <- names(mtcars)[c(2:8,10:11)]
       
       #Generating a binary matrix for using each regressor or not in a grid matrix
       usematrix <- expand.grid(c(TRUE,FALSE), c(TRUE,FALSE),
                             c(TRUE,FALSE), c(TRUE,FALSE),c(TRUE,FALSE), c(TRUE,FALSE),
                             c(TRUE,FALSE), c(TRUE,FALSE)) 
       
       #Creates a list of all possible regressor cominations, there are 256
       Models <- apply(usematrix, 1, function(x) as.formula(paste(c("mpg ~ am", regressors[x]), collapse=" + ")))
       #Applies linear models to all of the possible functions and stores in list
       df <- lapply(Models,lm,data =mtcars)
       
       #Loops over the list storing the max value for adjusted R^2 and the corresponding index
       for (i in 1:256){
              if (summary(df[[i]])$adj.r.squared > fitmem) {
                     fitmem <- summary(df[[i]])$adj.r.squared
                     memdex <- i
              } 
              fitmem
              memdex
       }