Executive Summary: Any “Gear Head” will tell you that a manual transmission will get better gas mileage than an automatic, but that this anecodatal at best. In this paper, simple and multiple linear regression to determine whether that the alternative hypothesis was true. That they were indeed different. I, also, investigated other variables to determine if there were confounding variables as well. Weight of the vehicle has more influence over gas mileage than its transmission. However, the transmission does play an important part of the equation that predicts gas mileage.

data("mtcars")
cars <-mtcars

head(cars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The column names are not meaningful, so they were changed for easier interpretability.

colnames(cars)<-c("miles_per_gallon", "cylinder","displacement", "horse_power",
    "rear_axle_ratio","weight","quarter_mile_time","Engine_type",
    "transmission","number_of_forward_gears", "carburetors")
y =cars$miles_per_gallon
x =cars$transmission


cars<-cars%>%
  mutate(transmission=(replace(transmission,transmission==0, "Automatic")))%>%
  mutate(transmission=(replace(transmission,transmission==1, "Manual")))%>%
  mutate(Engine_type=(replace(Engine_type, Engine_type==0, "V-shaped")))%>%
  mutate(Engine_type=(replace(Engine_type,Engine_type==1, "Straight")))

The four major assumptions that must be met.

When performing a linear regression model, there are 4 assumptions that must be met

-Independence: Observations are independent of each other. Meaning that if the outcome cannot be a calculation of the predictor (i.e a an hourly wage is a function of the time spent a work therefore, y is not independent of x.)However, the transmission type does not give us any information about mile per gallon of a car. So this criterion is met.

-Normality: For any given x, in this case transmission type, the y (mile per gallon ) must be normally distributed.

g<-ggplot(data=cars, aes(x=miles_per_gallon)) + 
  geom_histogram(bins=10, binwidth = 1.5)
g

The graph looks mostly normal.

#Linearity: The relationship between X and the mean of Y is linear.

cars_cor <-cor(x,y)

The Pearson’s collilation coefficient was 0.5998324. This means that Now to perform a two-sample t-test: First, store the miles_per_gallon for automatic transmission and the miles_per_gallon for manual transmission in new, separate variables.

cars_test <-t.test(cars[cars$transmission=="Automatic",]$miles_per_gallon,
       cars[cars$transmission=="Manual",]$miles_per_gallon)
cars_test
## 
##  Welch Two Sample t-test
## 
## data:  cars[cars$transmission == "Automatic", ]$miles_per_gallon and cars[cars$transmission == "Manual", ]$miles_per_gallon
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

From here we see that the p-value is 0.0013736 meaning that it is highly significant. Here we start the linearity model for the next part of he evaulation.

This is a simple linear model with one predictor, transmission, and the outcome miles per gallon.

fit <-lm(miles_per_gallon~transmission-1, cars)
fit
## 
## Call:
## lm(formula = miles_per_gallon ~ transmission - 1, data = cars)
## 
## Coefficients:
## transmissionAutomatic     transmissionManual  
##                 17.15                  24.39
summary(fit)$coeff
##                       Estimate Std. Error  t value     Pr(>|t|)
## transmissionAutomatic 17.14737   1.124603 15.24749 1.133983e-15
## transmissionManual    24.39231   1.359578 17.94109 1.376283e-17
levels(cars$transmission)=c("Automatic","Manual")
plot_ly(data=cars, x=~transmission, y=~miles_per_gallon, type="scatter")%>%
  add_boxplot()%>%
  layout(showlegend=FALSE)
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Homoscedasticity of a simple linear model:

Are the the variance of residual is the same for any value of X. now we plot the graph of the residuals and the miles_per_gallon

par(mfrow=c(1,1))
plot(fit)

fit.aug <-augment(fit)
head(fit.aug)
## # A tibble: 6 x 9
##   .rownames   miles_per_gallon transmission .fitted .resid   .hat .sigma .cooksd
##   <chr>                  <dbl> <chr>          <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 Mazda RX4               21   Manual          24.4 -3.39  0.0769   4.94 0.0216 
## 2 Mazda RX4 ~             21   Manual          24.4 -3.39  0.0769   4.94 0.0216 
## 3 Datsun 710              22.8 Manual          24.4 -1.59  0.0769   4.98 0.00476
## 4 Hornet 4 D~             21.4 Automatic       17.1  4.25  0.0526   4.92 0.0221 
## 5 Hornet Spo~             18.7 Automatic       17.1  1.55  0.0526   4.98 0.00294
## 6 Valiant                 18.1 Automatic       17.1  0.953 0.0526   4.98 0.00111
## # ... with 1 more variable: .std.resid <dbl>

Multiple variable linear Regression

While the type of transmission seems to predict the miles per gallon; it has an error of ~0.25. That means it will get it wrong 25% of the time. What about all the variable collected in this table.

model_all <-lm(miles_per_gallon~., data=cars)
summary(model_all)
## 
## Call:
## lm(formula = miles_per_gallon ~ ., data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             12.62114   19.02842   0.663   0.5144  
## cylinder                -0.11144    1.04502  -0.107   0.9161  
## displacement             0.01334    0.01786   0.747   0.4635  
## horse_power             -0.02148    0.02177  -0.987   0.3350  
## rear_axle_ratio          0.78711    1.63537   0.481   0.6353  
## weight                  -3.71530    1.89441  -1.961   0.0633 .
## quarter_mile_time        0.82104    0.73084   1.123   0.2739  
## Engine_typeV-shaped     -0.31776    2.10451  -0.151   0.8814  
## transmissionManual       2.52023    2.05665   1.225   0.2340  
## number_of_forward_gears  0.65541    1.49326   0.439   0.6652  
## carburetors             -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

From here we can see that there are variables that significantly influence the miles per gallon: -rear axle ratio -weight -quarter mile time -transmission So the next model include only these variables.

model2 <- lm(miles_per_gallon ~ I(transmission=="Manual")+ weight+ rear_axle_ratio + quarter_mile_time,data = cars)
model_all_coeff <-round(model2$coefficients,2)
anova(model2)
## Analysis of Variance Table
## 
## Response: miles_per_gallon
##                             Df Sum Sq Mean Sq F value    Pr(>F)    
## I(transmission == "Manual")  1 405.15  405.15 65.1576 1.132e-08 ***
## weight                       1 442.58  442.58 71.1766 4.763e-09 ***
## rear_axle_ratio              1  11.33   11.33  1.8225 0.1882294    
## quarter_mile_time            1  99.10   99.10 15.9378 0.0004518 ***
## Residuals                   27 167.89    6.22                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model2)
## 
## Call:
## lm(formula = miles_per_gallon ~ I(transmission == "Manual") + 
##     weight + rear_axle_ratio + quarter_mile_time, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3046 -1.6260 -0.6634  1.2097  4.6626 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.6277     8.2103   0.929 0.361095    
## I(transmission == "Manual")TRUE   2.5729     1.6225   1.586 0.124446    
## weight                           -3.8040     0.7592  -5.010 2.96e-05 ***
## rear_axle_ratio                   0.6429     1.3551   0.474 0.639003    
## quarter_mile_time                 1.1958     0.2995   3.992 0.000452 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.494 on 27 degrees of freedom
## Multiple R-squared:  0.8509, Adjusted R-squared:  0.8288 
## F-statistic: 38.52 on 4 and 27 DF,  p-value: 8.673e-11
par(mfrow=c(2,2))
plot(model2)

With the first to plot we see that the residual fall mostly in a random pattern around the mean.Meaning that the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.

What about the error in this new equation?

new_error <-round(sigma(model2)/mean(cars$miles_per_gallon),2)

The standard error for the new parred-down model is 0.12 which is half the original error.

Finally, the variables that best determine a car’s miles per gallon are transmission(Manual), weight, quarter mile time, and rear axle ratio(aka a differential for all you Rovolio Clockbergs out there). The equation that best discribes it is: 7.63+2.57\(x_0\)+-3.8\(x_1+\) 0.64\(x_2=y^0\)