Executive Summary: Any “Gear Head” will tell you that a manual transmission will get better gas mileage than an automatic, but that this anecodatal at best. In this paper, simple and multiple linear regression to determine whether that the alternative hypothesis was true. That they were indeed different. I, also, investigated other variables to determine if there were confounding variables as well. Weight of the vehicle has more influence over gas mileage than its transmission. However, the transmission does play an important part of the equation that predicts gas mileage.
data("mtcars")
cars <-mtcars
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The column names are not meaningful, so they were changed for easier interpretability.
colnames(cars)<-c("miles_per_gallon", "cylinder","displacement", "horse_power",
"rear_axle_ratio","weight","quarter_mile_time","Engine_type",
"transmission","number_of_forward_gears", "carburetors")
y =cars$miles_per_gallon
x =cars$transmission
cars<-cars%>%
mutate(transmission=(replace(transmission,transmission==0, "Automatic")))%>%
mutate(transmission=(replace(transmission,transmission==1, "Manual")))%>%
mutate(Engine_type=(replace(Engine_type, Engine_type==0, "V-shaped")))%>%
mutate(Engine_type=(replace(Engine_type,Engine_type==1, "Straight")))
When performing a linear regression model, there are 4 assumptions that must be met
-Independence: Observations are independent of each other. Meaning that if the outcome cannot be a calculation of the predictor (i.e a an hourly wage is a function of the time spent a work therefore, y is not independent of x.)However, the transmission type does not give us any information about mile per gallon of a car. So this criterion is met.
-Normality: For any given x, in this case transmission type, the y (mile per gallon ) must be normally distributed.
g<-ggplot(data=cars, aes(x=miles_per_gallon)) +
geom_histogram(bins=10, binwidth = 1.5)
g
The graph looks mostly normal.
#Linearity: The relationship between X and the mean of Y is linear.
cars_cor <-cor(x,y)
The Pearson’s collilation coefficient was 0.5998324. This means that Now to perform a two-sample t-test: First, store the miles_per_gallon for automatic transmission and the miles_per_gallon for manual transmission in new, separate variables.
cars_test <-t.test(cars[cars$transmission=="Automatic",]$miles_per_gallon,
cars[cars$transmission=="Manual",]$miles_per_gallon)
cars_test
##
## Welch Two Sample t-test
##
## data: cars[cars$transmission == "Automatic", ]$miles_per_gallon and cars[cars$transmission == "Manual", ]$miles_per_gallon
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
From here we see that the p-value is 0.0013736 meaning that it is highly significant. Here we start the linearity model for the next part of he evaulation.
This is a simple linear model with one predictor, transmission, and the outcome miles per gallon.
fit <-lm(miles_per_gallon~transmission-1, cars)
fit
##
## Call:
## lm(formula = miles_per_gallon ~ transmission - 1, data = cars)
##
## Coefficients:
## transmissionAutomatic transmissionManual
## 17.15 24.39
summary(fit)$coeff
## Estimate Std. Error t value Pr(>|t|)
## transmissionAutomatic 17.14737 1.124603 15.24749 1.133983e-15
## transmissionManual 24.39231 1.359578 17.94109 1.376283e-17
levels(cars$transmission)=c("Automatic","Manual")
plot_ly(data=cars, x=~transmission, y=~miles_per_gallon, type="scatter")%>%
add_boxplot()%>%
layout(showlegend=FALSE)
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Are the the variance of residual is the same for any value of X. now we plot the graph of the residuals and the miles_per_gallon
par(mfrow=c(1,1))
plot(fit)
fit.aug <-augment(fit)
head(fit.aug)
## # A tibble: 6 x 9
## .rownames miles_per_gallon transmission .fitted .resid .hat .sigma .cooksd
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 Manual 24.4 -3.39 0.0769 4.94 0.0216
## 2 Mazda RX4 ~ 21 Manual 24.4 -3.39 0.0769 4.94 0.0216
## 3 Datsun 710 22.8 Manual 24.4 -1.59 0.0769 4.98 0.00476
## 4 Hornet 4 D~ 21.4 Automatic 17.1 4.25 0.0526 4.92 0.0221
## 5 Hornet Spo~ 18.7 Automatic 17.1 1.55 0.0526 4.98 0.00294
## 6 Valiant 18.1 Automatic 17.1 0.953 0.0526 4.98 0.00111
## # ... with 1 more variable: .std.resid <dbl>
While the type of transmission seems to predict the miles per gallon; it has an error of ~0.25. That means it will get it wrong 25% of the time. What about all the variable collected in this table.
model_all <-lm(miles_per_gallon~., data=cars)
summary(model_all)
##
## Call:
## lm(formula = miles_per_gallon ~ ., data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.62114 19.02842 0.663 0.5144
## cylinder -0.11144 1.04502 -0.107 0.9161
## displacement 0.01334 0.01786 0.747 0.4635
## horse_power -0.02148 0.02177 -0.987 0.3350
## rear_axle_ratio 0.78711 1.63537 0.481 0.6353
## weight -3.71530 1.89441 -1.961 0.0633 .
## quarter_mile_time 0.82104 0.73084 1.123 0.2739
## Engine_typeV-shaped -0.31776 2.10451 -0.151 0.8814
## transmissionManual 2.52023 2.05665 1.225 0.2340
## number_of_forward_gears 0.65541 1.49326 0.439 0.6652
## carburetors -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
From here we can see that there are variables that significantly influence the miles per gallon: -rear axle ratio -weight -quarter mile time -transmission So the next model include only these variables.
model2 <- lm(miles_per_gallon ~ I(transmission=="Manual")+ weight+ rear_axle_ratio + quarter_mile_time,data = cars)
model_all_coeff <-round(model2$coefficients,2)
anova(model2)
## Analysis of Variance Table
##
## Response: miles_per_gallon
## Df Sum Sq Mean Sq F value Pr(>F)
## I(transmission == "Manual") 1 405.15 405.15 65.1576 1.132e-08 ***
## weight 1 442.58 442.58 71.1766 4.763e-09 ***
## rear_axle_ratio 1 11.33 11.33 1.8225 0.1882294
## quarter_mile_time 1 99.10 99.10 15.9378 0.0004518 ***
## Residuals 27 167.89 6.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model2)
##
## Call:
## lm(formula = miles_per_gallon ~ I(transmission == "Manual") +
## weight + rear_axle_ratio + quarter_mile_time, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3046 -1.6260 -0.6634 1.2097 4.6626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6277 8.2103 0.929 0.361095
## I(transmission == "Manual")TRUE 2.5729 1.6225 1.586 0.124446
## weight -3.8040 0.7592 -5.010 2.96e-05 ***
## rear_axle_ratio 0.6429 1.3551 0.474 0.639003
## quarter_mile_time 1.1958 0.2995 3.992 0.000452 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.494 on 27 degrees of freedom
## Multiple R-squared: 0.8509, Adjusted R-squared: 0.8288
## F-statistic: 38.52 on 4 and 27 DF, p-value: 8.673e-11
par(mfrow=c(2,2))
plot(model2)
With the first to plot we see that the residual fall mostly in a random pattern around the mean.Meaning that the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.
What about the error in this new equation?
new_error <-round(sigma(model2)/mean(cars$miles_per_gallon),2)
The standard error for the new parred-down model is 0.12 which is half the original error.
Finally, the variables that best determine a car’s miles per gallon are transmission(Manual), weight, quarter mile time, and rear axle ratio(aka a differential for all you Rovolio Clockbergs out there). The equation that best discribes it is: 7.63+2.57\(x_0\)+-3.8\(x_1+\) 0.64\(x_2=y^0\)