Looking at the mtcars dataset we found that the type of transmission does affect the mileage of a car and the addition of regressors: weight and 1/4 mile second helps to fit a better model to our data. The adjusted R-squared value for transmission solely was 35.98% and for the multiple-regressor model was 83.65%, indicating our multiple-regressor model explains 83.65% of the variance. When plotting the residuals for the multiple-regressor model we found them to be normal and un-systematic indicating a good, unbiased model.
Data Summary: The website for the data source: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars gives a good idea of how the data was obtained. The structure str() of the data gives a quick overview of the variables, types, and inputs. The str() output for this data can be found in Appendix A, Figure 1.
Looking at the structure, we see that data has 11 variables and 32 observations, which we expected from the website description. The two variables we will start with are mpg and am(transmission type: 0=automatic, 1=manual ). Through our own understanding and the documentation for the data, ?mtcars we understand transmission type to be a factor variable, and will treat it as such for the rest of this analysis.
Boxplot: The boxplot in Appendix A, Figure 2 shows the difference between the means of each transmission type. The calculated mean mpgs from the two levels of these factor variables are mean(mtcars\(mpg(mtcars\)am==“0”)) = 17.15 mpg for Level 0: Automatic, and mean(mtcars\(mpg(mtcars\)am==“1”)) = 24.39 mpg for Level 1: Manual. It is obvious from the boxplot and mean calculations that transmission type has an effect on mileage.
Mpg with only transmission type as a regressor: A model is fit with mpg as the output and transmission type as a factor variable, to calculate the mean differences. Its summary analytics are printed below.
Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## factor(am)1 7.244939 1.764422 4.106127 2.850207e-04
R-Squared:
## [1] 0.3597989
Adjusted R-squared:
## [1] 0.3384589
The model gives an intercept of 17.15 mean mpg for automatic transmissions while manual transmissions have a coefficient of 7.25 mpg higher and therefore they have a higher mean of 24.40 mpg, which confirms our two means from earlier. The R-squared shows that the model accounts for 35.98% of the model’s variance. We will however use the adjusted R-squared to compare with our multiple-regressor model; adjusted R-squared accounts for the number of regressors. The adjusted R-squared shows that the model accounts for 33.85% of the variance. The low R-squared values show that the model needs work to help account for more of the variance.
Interaction with other variables: A stepwise regression is performed to find the regressors best fit for the analysis of mpg ~ am. A stepwise regression is based upon the AIC (Akaike information criterion) of each model and retains the AIC with the lowest outcome; the AIC for the best model is 0. This is performed backwards by default, but can be done forwards as well.
*I have since learned that stepwise regression is an out-dated method, and will find another method for regressor selection in the future.
Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
## NULL
Adjusted R-Squared:
## [1] 0.8335561
This leaves us with wt(weight), qsec(quarter-mile acceleration) and am(transmission type) as significant predictors with significance level of alpha=.05 for am, and alpha=001 for wt and qsec and an adjusted R-squared value of 0.8335561. To show another side to this I will run a simulation on the addition of these variables with anova, which will produce the same statistics for the final model with all of our regressors added.
Finding the best model: To confirm our findings using stepwise regression, a series of models is run through anova. Each successive model includes the addition of a regressor term suggested by the stepwise regression.In the summary the term Pr(>F) is the p-value associated with the F-statistic. The p-values given from the anova table show that including both the 1/4 mile time and the weight are highly significant within the model, alpha = .001, confirming our findings using stepwise regression. A Shapiro test must now be done as low p-values can be a bi-product of lack of normality in the residuals.
Shapiro Test: A Shapiro-test with the null hypothesis that the residuals are approximately normal is run to test if our model stands.
The p-value from the Shapiro test is `0.10010.0804277 > .05 and therefore we cannot reject the null hypothesis. Our hypothesis that the residuals are approximately normal is true and our model stands.
Residuals plot: One last step is to check the bias of our model by checking the plot of ‘Residuals vs the Fitted Values’ for any apparent patterns, plot found in Appendix A, Figure 3. There is no apparent pattern and therefore our model is a seemingly good fit. It is now time to move onto looking at the summary statistics for our multi-regressor model.
Summary Analysis for multi-regressor model: The summary statistics are printed below for Model 3: mpg ~ am + wt + qsec. Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## am 2.935837 1.4109045 2.080819 4.671551e-02
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
Adjusted R-Squared:
## [1] 0.8335561
We see the multiple-regressor model lm3 accounts for 83.36% of the total variance, which is much better than our original model with just transmission type, which accounted for 33.85% of the variance. We also see that including variables qsec and wt in our model, the difference in mean between manual and automatic is smaller; it is now a difference of 2.958 mpg instead of 7.25 mpg. This change in difference was caused by the variables wt and qsec confounding our original model, and with the addition of these terms, we are able to view the data with more of the variance explained.
Transmission type does affect mileage but the regressors weight and qsec must be included in your model to give an accurate portrayal of the system.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...