In this project we explore the relationship between miles per gallon (MPG) and a set of variables from the Motor Trends car dataset. We are particularly interested in the following two questions:
      A variety of models are possible, it is quickly apparent that manual is significantly better for gas mileage than automatic. The initial t.test shows this. Specifically quantifying the amount requires accounting for all other regressor while maintaining low variance and high accuracy.
      Adjusted R^2 is a good way to account for the change in variance when adding reggressors. By optimizing our model for adjusted R^2 we arrive at a statistically significant model which accounts the highest percentage of variance and has a lower standard error then one including all variables which achieves a significant P-value.
      Below is the outcome of a t.test showing that manual has a significanlty higher mean than automatic for MPG. Therefore, based on this alone we can reliably conclude manual transmission is better than automatic with 95% confidence. The data-splitting is included in the appendix
t.test(manual$mpg,auto$mpg,alternative = "less")$p.value
## [1] 0.0006868192
      This graph tells us that there is a lot of conflation going on between the variables and we will need to take care to chose the right regressors. We can’t assume independence and msut balance variance with model fit.
      To optimize for adjusted R^2 I wrote a code to choose regressors (including am) with the highest adjusted R^2 value. This is an equivalent process of minimizing the standard error for the am variable as a regressor.
sumall# fitall <- ln(mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear + carb, mtcars)
## Estimate Std. Error t value Pr(>|t|) Adj R^2 R^2
## 2.520 2.057 1.225 0.234 0.807 0.869
sum1 # fit1 <- ln(mpg ~ am, mtcars)
## Estimate Std. Error t value Pr(>|t|) Adj R^2 R^2
## 7.244939 1.764422 4.106127 0.000285 0.338459 0.359799
sumopt # fitopt <- Ln(mpg ~ am + disp + hp + wt + qsec, mtcars)
## Estimate Std. Error t value Pr(>|t|) Adj R^2 R^2
## 3.4705 1.4858 2.3358 0.0275 0.8375 0.8637
Required Data and packages
require(datasets); require(plyr); require(ggplot2);
require(GGally);require(car);data(mtcars);attach(mtcars)
Splitting data for t.tests and aggregation
auto <- mtcars[which(am==1),]
manual <- mtcars[which(am==0),]
T.Test
t.test(manual$mpg,auto$mpg,alternative = "less")$p.value
Plot
g2 <- ggplot(mtcars, aes(y=mpg,x=am)) + geom_boxplot(aes(fill=factor(am))) + annotate("text", x = .5, y = 10, label = c("p = .00068"))+ geom_jitter(width = 0.2)+ annotate("text", x = 0, y = 16.85, label = c("mean = 17.15"))+ annotate("text", x = 1, y = 24.39, label = c("mean = 24.39")) + annotate("text", x = .5, y = 11, label = c("Corr = .6")) + annotate("text", x = .5, y = 12, label = c("Beta = 7.23"))
g2
Plot2 Pairwise
pair_loess <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
p
}
g <- ggpairs(mtcars, lower = list(continuous=pair_loess))
Extracting useful info for presenting model features
options(digits=3)
fitall <-lm(mpg~.,mtcars)
fit1 <- lm(mpg~am,mtcars)
fitopt <- df[[memdex]]
sumopt <- append(summary(df[[memdex]])$coef[2,],c(summary(df[[memdex]])$adj.r.squared, summary(df[[memdex]])$r.squared))
names(sumopt)[5:6] <- c("Adj R^2", "R^2")
sumall <- append(summary(fitall)$coef[2,],c(summary(fitall)$adj.r.squared, summary(fitall)$r.squared))
names(sumall)[5:6] <- c("Adj R^2", "R^2")
sum1 <- append(summary(fit1)$coef[2,],c(summary(fit1)$adj.r.squared, summary(fit1)$r.squared))
names(sum1)[5:6] <- c("Adj R^2", "R^2")
Optimizing standard error for am estimated coefficient
fitmem = {0}
memdex = {0}
#Removing mpg and am from regressor list since they are necessary for our purposes
regressors <- names(mtcars)[c(2:8,10:11)]
#Generating a binary matrix for using each regressor or not in a grid matrix
usematrix <- expand.grid(c(TRUE,FALSE), c(TRUE,FALSE),
c(TRUE,FALSE), c(TRUE,FALSE),c(TRUE,FALSE), c(TRUE,FALSE),
c(TRUE,FALSE), c(TRUE,FALSE))
#Creates a list of all possible regressor cominations, there are 256
Models <- apply(usematrix, 1, function(x) as.formula(paste(c("mpg ~ am", regressors[x]), collapse=" + ")))
#Applies linear models to all of the possible functions and stores in list
df <- lapply(Models,lm,data =mtcars)
#Loops over the list storing the max value for adjusted R^2 and the corresponding index
for (i in 1:256){
if (summary(df[[i]])$adj.r.squared > fitmem) {
fitmem <- summary(df[[i]])$adj.r.squared
memdex <- i
}
fitmem
memdex
}