## Import libraries
library(knitr)
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(cowplot)
library(ggfortify)
In this analysis, we want to better understand the relationship between the given set of variables and miles per gallon. We also want to be able to answer a set of more specific questions, such as the following:
Research has shown that there is a correlation between a car’s type of transmission and its fuel consumption. A simple linear regression model comparing the am (transmission) with the mpg (miles/gallon) demonstrates an increase of 7.25 miles per gallon for cars with a manual transmission, in comparison to cars with an automatic transmission. However, the inclusion of additional variables in the model can help us understand the reason for this difference in means. For example, automatic cars are heavier than manual cars in our dataset, which could possibly make them less efficient. Additionally, automatic cars tend to have a faster quarter mile time, which could imply that they have more horsepower, as well. A simple linear regression model that includes a car’s transmission type, quarter mile time, and weight performs well. Throu
## Summary of the data
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## Summarize the differences between means and sd for transmission types
mtcars %>% group_by(am) %>%
summarise(Mean = mean(mpg, na.rm = TRUE),
SD = sd(mpg, na.rm = TRUE),
Count = n())
## # A tibble: 2 x 4
## am Mean SD Count
## <dbl> <dbl> <dbl> <int>
## 1 0 17.1 3.83 19
## 2 1 24.4 6.17 13
## Plot the differences in types of transmissions
ggplot(mtcars, aes(factor(am), mpg)) +
geom_boxplot(fill = "blue") +
labs(x = "Transmission Type", y = "MPG") +
scale_x_discrete(labels = c("0" = "Automatic", "1" = "Manual"))
According to the boxplot and the table summaries, there seems to be a significant difference between the mean miles per gallon for automatic cars and manual cars. Next, we should back up this difference with statistics by performing a t-test between the two means of miles per gallons for automatic cars and manual cars.
## Ensure automatic transmission type is distributed normally
aut <- filter(mtcars, am == "0")
qqnorm(aut$mpg)
qqline(aut$mpg, col = 2, lwd = 2, lty = 2)
## Ensure manual transmission type is distributed normally
man <- filter(mtcars, am == "1")
qqnorm(man$mpg)
qqline(man$mpg, col = 2, lwd = 2, lty = 2)
## Two sample t-test to compare variances of transmission types
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
Before we perform any statistical tests on the “mpg” variable, we should first examine the distributions of the “mpg” variable for both automatic and manual cars. By analyzing the Q-Q plots, we’re able to see that thy are not normally distributed. Therefore, we should compare the means between the two groups to see if there is a statistically significant difference in their means. After performing a two sample t-test, there seems to be a significant difference between the two means, since the p-value is extremely small. We should keep in mind that there are very few observations for the two groups, implying the sample mean may not have converged to the population mean.
## Fit a slr model
cars.lm <- lm(mpg ~ am, data = mtcars)
## Model summary
summary(cars.lm)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
After fitting our linear model and reviewing the model summary, we can see that an average manual car will have around 7.245 more miles per gallon compared to an average automatic car. It should be noted that the miles per gallon can vary by about 1.7 miles per gallon. Since the R-squared value is only about 0.34, the car’s type of transmission does not explain much of the variance of mpg. We will need to include other variables in our model if our goal is to accurately predict the car’s miles per gallon.
## Reformat variable classes
mtcars[,c(2,8:11)] <- lapply(mtcars[,c(2,8:11)], as.factor)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
## Fit a glm
cars.mlm <- lm(mpg ~ ., data = mtcars)
## Model summary
summary(cars.mlm)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
It seems like
## ANOVA table
cars.lm <- lm(mpg ~ am + qsec + wt + hp + disp, data = mtcars)
anova(cars.lm)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## am 1 405.15 405.15 68.6527 9.104e-09 ***
## qsec 1 368.26 368.26 62.4021 2.240e-08 ***
## wt 1 183.35 183.35 31.0682 7.442e-06 ***
## hp 1 9.22 9.22 1.5622 0.2225
## disp 1 6.63 6.63 1.1232 0.2990
## Residuals 26 153.44 5.90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the ANOVA table, adding the transmission type, quarter mile time, and weight to the model will account for a large amount of the variability created by the “mpg” variable, while using only a few degrees of freedom. Therefore, we should include these variables in the model and prefer this model over the original linear regression model, especially since the R-squared value of the updated model is improved.
## Plot of mpg at the different levels of am
g1 <- ggplot(mtcars, aes(am, mpg)) +
geom_boxplot() +
labs(y = "mpg")
## Plot of qsec at the different levels of am
g2 <- ggplot(mtcars, aes(am, qsec)) +
geom_boxplot() +
labs(y = "Quarter Mile Time")
## Plot of wt at the different levels of am
g3 <- ggplot(mtcars, aes(am, wt)) +
geom_boxplot() +
labs(y = "Weight")
## Final plots
plot_grid(g1, g2, g3, ncol = 3, nrow = 1)
According to the plots, we’re able to affirm that manual cars have more miles per gallon compared to automatic cars. We can also see that automatic cars have very different weights and quarter mile times compared to manual cars, which may indicate a potential for using an interaction term for these variables.
## Diagnostic plots
autoplot(cars.lm, data = mtcars, colour = 'am')
The residuals vs. fitted plot shows if the residuals have any non-linear patterns. There could be a non-linear relationship between the predictor variables in the model and the response variable, which would be indicated by unequal spread around the dashed horizontal line located at 0 on the y-axis. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships, which doesn’t seem like the case in this situation. If we look deeper into the residual plots, we’re able to see that there is a parabolic relationship between the predictors and the response variable.
The normal Q-Q plot also shows if the residuals are normally distributed, which is an assumption of a simple linear regression model. If the residuals deviate from the linear line at all, then the residuals are not normally distributed, meaning the relationship between the predictor variables in the model and the response variable may not have a linear relationship or may include heteroscedasticity. According to the Q-Q plot, the residuals seem to deviate from the linear line, meaning a non-linear relationship between the predictor variables and the response variable may exist.
The Scale-Location plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance, otherwise known as homoscedasticity. It’s good if you see a horizontal line with randomly spread points. The line seems to increse slightly at a very steep angle, meaning heteroscedasticity most likely exists in our model.
Lastly, the residuals vs. leverage plot seems to highlight any outliers in the plot. We can see that the Toyota Corolla and Chrysler Imperial are distinct outliers.