This project assignment is to look at a data set of car collections in 1974 (mtcars), and the two main questions this project would like to know the following questions.
Load the respective libraries for data obtainment, plotting statistical evaluation before executing the codes.
# load datasets to obtain the mtcars within the library.
library(datasets)
# load ggplot2 for the graphing algorithm.
library(ggplot2)
## Warning: 套件 'ggplot2' 是用 R 版本 4.1.1 來建造的
# load the car package to evaluate the VIFs for the mtcars model
library(car)
## Warning: 套件 'car' 是用 R 版本 4.1.2 來建造的
## 載入需要的套件:carData
## Warning: 套件 'carData' 是用 R 版本 4.1.1 來建造的
# load data table into the abbreviation
mtc <- mtcars
# Understand the data structure
sapply(mtc, class)
## mpg cyl disp hp drat wt qsec vs
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## am gear carb
## "numeric" "numeric" "numeric"
And by looking at the data structure, variable am is the transmission type (where 0 = automatic and 1 = manual from ?mtcars’s justification). Therefore the model will definitely have this as an inclusion.
In this particular model, we have used the factor labels to place the am as an identified parameter to indicate the 2 different population for comparison. The associated plot is attached with the code below:
# Use factor to label transmission
mtc$am <- factor(mtc$am, labels = c("Auto", "Manual"))
# Use ggplot to obtain the boxplot between the 2 transmission type
ggplot(data = mtc, aes(x = am, y = mpg)) +
geom_boxplot(aes(fill = am)) +
ggtitle("Transmission Type's Miles Per Gallon Comparison Plot")
# Create model for mpg and transmission type ONLY.
fit_t0 <- lm(mpg ~ am, data = mtc)
summary(fit_t0)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## amManual 7.244939 1.764422 4.106127 2.850207e-04
anova(fit_t0)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## am 1 405.15 405.15 16.86 0.000285 ***
## Residuals 30 720.90 24.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By default, the ANOVA’s hypothesis testing is given below.
H0 = Manual & Automatic will impact the mpg in the same way. H1 = Manual & Automatic will impact the mpg in the different way.
Based on the ANOVA summary and given box-plot, it indicates that the H0 should be rejected which indicates the contribution between transmission type certainly will impact the mpg output.
Before evaluating the collinearities between the input variable, finding the correlations between variables.
library(GGally)
## Warning: 套件 'GGally' 是用 R 版本 4.1.1 來建造的
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
cor_plot <- ggpairs(mtcars, lower = list(continuous = "smooth"))
cor_plot
In general, all parameters within the data frame shall be considered to evaluate the plotting. Therefore, this section will initiate the VIF examination while considering the collinearities within the parameters.
The VIF criteria is given below with the rule of thumb:
| VIF | Status of Parameters |
|---|---|
| VIF = 1 | Uncorrelated (and zero inflation) |
| 1 < VIF < 5 | Moderately correlated |
| VIF > 5 to 10 | Highly correlated |
Based on the VIF information for the entire data set (by setting mpg as the output variable), the VIF factors are given below.
# Generate original setup for the full dataset model.
fit_full <- lm(mpg ~. , data = mtc)
# Check collinearity for full data set.
vif(fit_full)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
Based on observation, cyl and disp has high variance within the model, therefore removing the cylinder and disp shall reduce the VIF level for other factors.
# Removing cyl and disp first for adjusted model
fit_t1 <- lm(mpg ~. -cyl -disp , data = mtc)
summary(fit_t1)
##
## Call:
## lm(formula = mpg ~ . - cyl - disp, data = mtc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8187 -1.3903 -0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp -0.01225 0.01649 -0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## amManual 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb -0.61286 0.59109 -1.037 0.3106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08
# Re-Check collinearity after removing the high variant factors
vif(fit_t1)
## hp drat wt qsec vs am gear carb
## 6.015788 3.111501 6.051127 5.918682 4.270956 4.285815 4.690187 4.290468
After remove the top 2 confounding factors which caused the high collinearity, elimination of additional factors which does not have correlation with respect to the mpg output will be slowly eliminated based on the p-value of 95% confidence level.
# Remove cyl, disp and vs
fit_t2 <- lm(mpg ~. -cyl -disp -vs, data = mtc)
# Remove cyl, disp, vs and gear
fit_t3 <- lm(mpg ~. -cyl -disp -vs -gear, data = mtc)
# Remove cyl, disp, vs ,gear and hp
fit_t4 <- lm(mpg ~. -cyl -disp -vs -gear -hp, data = mtc)
# Remove cyl, vs, carb , gear, hp and drat
fit_t5 <- lm(mpg ~. -cyl -disp -vs -gear -hp -drat, data = mtc)
# Remove cyl, vs, carb, gear, hp, drat and carb
fit_t6 <- lm(mpg ~. -cyl -disp -vs -gear -hp -drat -carb, data = mtc)
In this section, the linear model for mpg vs transmission type and the ideal model based on prediction from transmission type, weight and qsec will be applied for the calculation values will plotting the regression line.
# setup the residuals and predicted values for plotting based on the linear model by transmission vs mpg model.
par(mfrow = c(2,2))
fit_t0s <- data.frame(Fitted = predict(fit_t0), Residuals = resid(fit_t0), transmission = mtc$am)
fit_t6s <- data.frame(Fitted = predict(fit_t6), Residuals = resid(fit_t6), transmission = mtc$am)
ggplot(data = fit_t0s, aes(x = Fitted, y = Residuals, color = transmission)) +
geom_point(size = 3, alpha = 0.7)+
ggtitle("Transmission vs MPG Residual Plot")
ggplot(data = fit_t6s, aes(x = Fitted, y = Residuals, color = transmission)) +
geom_point(size = 3, alpha = 0.7)+
ggtitle("MPG Residual Plot vs Prediction from Transmission, Weight & qsec")
The following codes are the respective factors within the model.
# Change in outcome
dffits(fit_t0)
dffits(fit_t6)
# Change in individual coefficients
dfbetas(fit_t0)
dfbetas(fit_t6)
# Overall change in coefficients
cooks.distance(fit_t0)
cooks.distance(fit_t6)
# Leverage Comparison
hatvalues(fit_t0)
hatvalues(fit_t6)
Based on the given information, the following conclusion can be made.
The transmission type does effect the mpg differently and is considered as a two separate population within the mtcars data set. The residual plots and p-value validates this hypothesis
After consolidating the correlation factors, the best way to predict the miles per gallon besides transmission type are the wt and qsec variables.
The confidence interval for both automatic transmission & predictive value range is given below.
| Type | Mean | Lower | Upper |
|---|---|---|---|
| Automatic (Confidence Interval) | 17.15 | 14.85 | 19.44 |
| Automatic (Prediction) | 17.15 | 6.88 | 27.42 |
| Manual (Confidence Interval) | 24.39 | 21.62 | 27.17 |
| Manual (Prediction) | 24.39 | 14.00 | 34.78 |
# Summarize all fit models respectively
summary(fit_t2)
##
## Call:
## lm(formula = mpg ~ . - cyl - disp - vs, data = mtc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7967 -1.4077 -0.2955 1.2099 4.5072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.49215 10.71794 1.259 0.2202
## hp -0.01215 0.01599 -0.760 0.4549
## drat 0.89764 1.47732 0.608 0.5491
## wt -2.62772 1.06891 -2.458 0.0216 *
## qsec 0.65845 0.47292 1.392 0.1766
## amManual 2.41351 1.85858 1.299 0.2064
## gear 0.70547 1.30189 0.542 0.5929
## carb -0.61460 0.57750 -1.064 0.2978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 24 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8262
## F-statistic: 22.06 on 7 and 24 DF, p-value: 5.308e-09
summary(fit_t3)
##
## Call:
## lm(formula = mpg ~ . - cyl - disp - vs - gear, data = mtc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7903 -1.3426 -0.1935 1.1624 4.2998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.55110 10.38833 1.401 0.17359
## hp -0.01174 0.01575 -0.746 0.46292
## drat 1.07285 1.42100 0.755 0.45731
## wt -2.82895 0.98807 -2.863 0.00837 **
## qsec 0.70712 0.45770 1.545 0.13493
## amManual 2.85861 1.64350 1.739 0.09427 .
## carb -0.45445 0.48908 -0.929 0.36168
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.477 on 25 degrees of freedom
## Multiple R-squared: 0.8638, Adjusted R-squared: 0.8311
## F-statistic: 26.43 on 6 and 25 DF, p-value: 1.122e-09
summary(fit_t4)
##
## Call:
## lm(formula = mpg ~ . - cyl - disp - vs - gear - hp, data = mtc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9355 -1.2134 -0.3151 1.0669 4.2271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.9243 8.2592 1.202 0.24035
## drat 1.2071 1.3975 0.864 0.39562
## wt -3.1108 0.9050 -3.437 0.00199 **
## qsec 0.9145 0.3603 2.538 0.01748 *
## amManual 2.9639 1.6234 1.826 0.07939 .
## carb -0.6023 0.4432 -1.359 0.18583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.455 on 26 degrees of freedom
## Multiple R-squared: 0.8608, Adjusted R-squared: 0.834
## F-statistic: 32.15 on 5 and 26 DF, p-value: 2.423e-10
summary(fit_t5)
##
## Call:
## lm(formula = mpg ~ . - cyl - disp - vs - gear - hp - drat, data = mtc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1184 -1.5414 -0.1392 1.2917 4.3604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.8972 7.4725 1.726 0.095784 .
## wt -3.4343 0.8200 -4.188 0.000269 ***
## qsec 1.0191 0.3378 3.017 0.005507 **
## amManual 3.5114 1.4875 2.361 0.025721 *
## carb -0.4886 0.4212 -1.160 0.256212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.444 on 27 degrees of freedom
## Multiple R-squared: 0.8568, Adjusted R-squared: 0.8356
## F-statistic: 40.39 on 4 and 27 DF, p-value: 5.064e-11
summary(fit_t6)
##
## Call:
## lm(formula = mpg ~ . - cyl - disp - vs - gear - hp - drat - carb,
## data = mtc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The inference analysis is calculated below without the intercepts.
# Store the coefficient summary without intercepts
fit_t0t <- lm(mpg ~ am - 1, data = mtcars)
fit_t0c <- summary(fit_t0t)$coef
# Use Prediction and confidence level to evaluate the models
predict(fit_t0, newdata = data.frame(am = as.factor(c("Auto", "Manual"))), interval = "confidence")
## fit lwr upr
## 1 17.14737 14.85062 19.44411
## 2 24.39231 21.61568 27.16894
predict(fit_t0, newdata = data.frame(am = as.factor(c("Auto", "Manual"))), interval = "prediction")
## fit lwr upr
## 1 17.14737 6.876013 27.41872
## 2 24.39231 14.003113 34.78150