You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
There are, 25 values in mpg. * H0 = The mpg is consistent with a specified reference distribution. * H1 = The mpg is NOT consistent with a specified reference distribution
library('zoo')
p1 <-hist(mtcars$mpg, include.lowest=FALSE, right=FALSE)
x=seq(min(mtcars$mpg), max(mtcars$mpg),length = length(p1$counts))
breaks_cdf <- dnorm(x, mean(mtcars$mpg), sd(mtcars$mpg))
a <- chisq.test(x = p1$counts, p=breaks_cdf, rescale.p=TRUE, simulate.p.value=TRUE)
a$p.value
## [1] 0.005497251374
library("Hmisc")
library("dplyr")
library(kableExtra)
library(formattable)
flat_cor_mat <- function(cor_r, cor_p){
#This function provides a simple formatting of a correlation matrix
#into a table with 4 columns containing :
# Column 1 : row names (variable 1 for the correlation test)
# Column 2 : column names (variable 2 for the correlation test)
# Column 3 : the correlation coefficients
# Column 4 : the p-values of the correlations
library(tidyr)
library(tibble)
cor_r <- rownames_to_column(as.data.frame(round(cor_r,2)), var = "row")
cor_r <- gather(cor_r, column, cor, -1)
cor_p <- rownames_to_column(as.data.frame( round(cor_p,2)), var = "row")
cor_p <- gather(cor_p, column, p, -1)
cor_p_matrix <- left_join(cor_r, cor_p, by = c("row", "column"))
cor_p_matrix
}
cor_3 <- rcorr(as.matrix(mtcars[, 1:7]))
my_cor_matrix <- flat_cor_mat(cor_3$r, cor_3$P)
kable(my_cor_matrix,"latex",longtable = T, caption = "Correlation Table") %>%
kable_styling(fixed_thead = T, latex_options = c("striped", "repeat_header"), full_width = F, position = "center")
Opacity of each label is dependent on the absolute value of the coefficient of correlation.
library(GGally)
ggcorr(mtcars, label = TRUE, label_alpha = TRUE)
Figure 1. Correlations
Variances of the estimated coefficients increase as more regressors with higher correlation are included. To get a more accurate result, it is important to leave these regressors out
lm_all <- lm(mpg ~ ., data = mtcars)
summary(lm_all)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506441 -1.6044018 -0.1196051 1.2192678 4.6270942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.65731 0.518124
## cyl -0.11144048 1.04502336 -0.10664 0.916087
## disp 0.01333524 0.01785750 0.74676 0.463489
## hp -0.02148212 0.02176858 -0.98684 0.334955
## drat 0.78711097 1.63537307 0.48130 0.635278
## wt -3.71530393 1.89441430 -1.96119 0.063252 .
## qsec 0.82104075 0.73084480 1.12341 0.273941
## vs 0.31776281 2.10450861 0.15099 0.881423
## am 2.52022689 2.05665055 1.22540 0.233990
## gear 0.65541302 1.49325996 0.43891 0.665206
## carb -0.19941925 0.82875250 -0.24063 0.812179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.650197 on 21 degrees of freedom
## Multiple R-squared: 0.8690158, Adjusted R-squared: 0.8066423
## F-statistic: 13.93246 on 10 and 21 DF, p-value: 0.0000003793152
library(car)
makelms <- function(){
# Simulate a dependent variable, y, as x1
# plus a normally distributed error of mean 0 and
# standard deviation .3.
col_len <- dim(mtcars)[2]
viff <- c()
for(i in 2: col_len){
temp <- capture.output(cat(na.omit(colnames(mtcars)[-which(colnames(mtcars)=="am")][i:col_len]),sep="+"))
if(length(temp)){
formula = paste0("mpg ~ factor(am)+",temp)
mdl <- lm(formula = formula, data = mtcars)
print(vif(mdl))
viff = c(viff, vif(mdl))
}
}
viff
}
#compute bias - overfitting and underfitting
vifs <- makelms()
## factor(am) cyl disp hp drat
## 4.648487456 15.373833403 21.620241029 9.832036844 3.374620008
## wt qsec vs gear carb
## 15.164886964 7.527958225 4.965873466 5.357452106 7.908746751
## factor(am) disp hp drat wt
## 4.332285930 20.088643356 9.499794893 3.118061510 14.971794853
## qsec vs gear carb
## 6.960353472 4.454934513 4.691535866 7.497053712
## factor(am) hp drat wt qsec vs
## 4.285814561 6.015787688 3.111500773 6.051126642 5.918682154 4.270956073
## gear carb
## 4.690187139 4.290467707
## factor(am) drat wt qsec vs gear
## 4.258479132 3.043073328 5.104823316 4.139107281 4.191817892 4.688164464
## carb
## 3.826242795
## factor(am) wt qsec vs gear carb
## 3.974216136 4.779644615 4.068140457 4.145482549 4.519156734 3.768859449
## factor(am) qsec vs gear carb
## 3.402052744 3.740871649 3.641623856 3.999256760 2.453053992
## factor(am) vs gear carb
## 3.109049828 1.990702430 3.985747305 2.228001604
## factor(am) gear carb
## 2.926043153 3.153213924 1.168889369
## factor(am) carb
## 1.003321195 1.003321195
The gear and drat should be left out.
#find correlation between variables
cor.test(mtcars$am, mtcars$gear, method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: mtcars\(am and mtcars\)gear t = 7.1552247, df = 30, p-value = 0.00000005834043 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6158963198 0.8949545741 sample estimates: cor 0.7940587603
cor.test(mtcars$am, mtcars$drat, method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: mtcars\(am and mtcars\)drat t = 5.5650966, df = 30, p-value = 0.00000472679 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4843990837 0.8501319450 sample estimates: cor 0.7127111272
cor.test(mtcars$drat, mtcars$carb, method=c("pearson", "kendall", "spearman"))
Pearson's product-moment correlation
data: mtcars\(drat and mtcars\)carb t = -0.49933844, df = 30, p-value = 0.6211834 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.4259975986 0.2663357912 sample estimates: cor -0.09078979887 ### Shapiro-Wilk test can be performed as follow: Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed
#plot(mtcars$mpg,mtcars$am,pch=19,col="darkgrey",xlab="MPG",ylab="Automatic Transmission")
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756473, p-value = 0.1228814
shapiro.test(mtcars$am)
##
## Shapiro-Wilk normality test
##
## data: mtcars$am
## W = 0.62507437, p-value = 0.00000007836354
Conclusion: Normality cannot be assumed for am column
anova(fit_no_wt, fit_simple, fit_interact)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ wt + factor(am)
## Model 3: mpg ~ wt * factor(am)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.89660
## 2 29 278.31970 1 442.57690 65.91302 0.0000000077171 ***
## 3 28 188.00767 1 90.31203 13.45018 0.0010171 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear
\[ RW_i = b_0 + b_1 RS_i + e_i \]
or
\[ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i\]
Logistic
\[ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)}\]
or
\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]
## Test_MPG pred_test
## Mazda RX4 21.0 23.075
## Valiant 18.1 17.300
## Duster 360 14.3 17.300
## Merc 240D 24.4 17.300
## Merc 230 22.8 17.300
## Merc 280C 17.8 17.300
### Odds ratios and confidence intervals
exp(mod_glm$coeff)
## (Intercept) am
## 17.300000000 1.333815029
exp(confint(mod_glm))
## 2.5 % 97.5 %
## (Intercept) 13.830637458 20.769182899
## am 1.043921992 1.735534932
mod_glm2 <- glm(mpg ~ factor(am)*wt, family=gaussian(link="log"), data= train_cars)
summary(mod_glm2)
##
## Call:
## glm(formula = mpg ~ factor(am) * wt, family = gaussian(link = "log"),
## data = train_cars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5332767 -1.4022925 -0.1536827 0.9869617 4.0139082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.44003032 0.29068633 11.83417 0.000000056461 ***
## factor(am)1 0.72227921 0.31492523 2.29349 0.040674 *
## wt -0.15562829 0.07749517 -2.00823 0.067670 .
## factor(am)1:wt -0.26121984 0.09368849 -2.78817 0.016400 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.069524889)
##
## Null deviance: 484.29750 on 15 degrees of freedom
## Residual deviance: 48.83429 on 12 degrees of freedom
## AIC: 73.259537
##
## Number of Fisher Scoring iterations: 5
pred_test2 <- predict(mod_glm,newdata=test_cars, type = "response")
par(mar=c(10,5,2,1), mfcol=c(1,2), cex=1, ps=12)
plot(mod_glm2, which = 1)
plot(mod_glm2, which = 2)
Figure 3. Residual Plots with Interactions
anova(mod_glm,mod_glm2,test="LRT")
## Analysis of Deviance Table
##
## Model 1: mpg ~ (am)
## Model 2: mpg ~ factor(am) * wt
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 14 350.89500
## 2 12 48.83429 2 302.06071 < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residual deviance and AIC of, mpg ~ factor(am) * wt are, 48.83 and 73.26 which are much lower than those of mpg ~ (am) which are, 350.89 and 100.81
## Start: AIC=163.71
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Deviance AIC
## - cyl 1 147.57430 161.72713
## - vs 1 147.65456 161.74453
## - carb 1 147.90110 161.79792
## - gear 1 148.84749 162.00203
## - drat 1 149.12146 162.06087
## - disp 1 151.41110 162.54847
## - hp 1 154.33434 163.16040
## - qsec 1 156.35855 163.57737
## <none> 147.49443 163.70981
## - am 1 158.04108 163.91988
## - wt 1 174.50881 167.09173
##
## Step: AIC=161.73
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Deviance AIC
## - vs 1 147.84282 159.78531
## - carb 1 148.09444 159.83972
## - gear 1 149.39545 160.11962
## - drat 1 149.55692 160.15418
## - disp 1 151.47524 160.56203
## - hp 1 154.93747 161.28521
## <none> 147.57430 161.72713
## - qsec 1 157.66756 161.84416
## - am 1 159.41025 162.19591
## + cyl 1 147.49443 163.70981
## - wt 1 174.60235 165.10887
##
## Step: AIC=159.79
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Deviance AIC
## - carb 1 148.52829 157.93333
## - gear 1 149.98651 158.24597
## - drat 1 150.05672 158.26094
## - disp 1 151.48947 158.56503
## - hp 1 154.94878 159.28755
## <none> 147.84282 159.78531
## - am 1 159.41221 160.19630
## - qsec 1 163.52587 161.01160
## + vs 1 147.57430 161.72713
## + cyl 1 147.65456 161.74453
## - wt 1 175.22277 163.22238
##
## Step: AIC=157.93
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Deviance AIC
## - gear 1 150.09325 156.26873
## - drat 1 150.46041 156.34692
## <none> 148.52829 157.93333
## - disp 1 158.63855 158.04062
## - am 1 160.85150 158.48393
## - hp 1 163.35392 158.97793
## + carb 1 147.84282 159.78531
## + vs 1 148.09444 159.83972
## + cyl 1 148.11386 159.84392
## - qsec 1 174.93634 161.17003
## - wt 1 217.65521 168.16171
##
## Step: AIC=156.27
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Deviance AIC
## - drat 1 153.43781 154.97397
## - disp 1 158.63861 156.04064
## <none> 150.09325 156.26873
## - hp 1 163.37791 156.98263
## + gear 1 148.52829 157.93333
## + cyl 1 149.08986 158.05409
## + vs 1 149.44777 158.13082
## + carb 1 149.98651 158.24597
## - am 1 170.12913 158.27837
## - qsec 1 175.66766 159.30352
## - wt 1 217.66522 166.16318
##
## Step: AIC=154.97
## mpg ~ disp + hp + wt + qsec + am
##
## Df Deviance AIC
## - disp 1 160.06646 154.32737
## <none> 153.43781 154.97397
## - hp 1 166.00986 155.49403
## + drat 1 150.09325 156.26873
## + gear 1 150.46041 156.34692
## + cyl 1 150.99111 156.45959
## + vs 1 152.31700 156.73936
## + carb 1 153.42638 156.97158
## - qsec 1 179.90760 158.06671
## - am 1 185.63532 159.06961
## - wt 1 222.48085 164.86343
##
## Step: AIC=154.33
## mpg ~ hp + wt + qsec + am
##
## Df Deviance AIC
## - hp 1 169.28593 154.11937
## <none> 160.06646 154.32737
## + disp 1 153.43781 154.97397
## + carb 1 156.83927 155.67561
## + drat 1 158.63861 156.04064
## - qsec 1 180.29107 156.13484
## + cyl 1 159.81748 156.27756
## + vs 1 159.81791 156.27764
## + gear 1 159.89534 156.29314
## - am 1 186.05930 157.14261
## - wt 1 238.56023 165.09642
##
## Step: AIC=154.12
## mpg ~ wt + qsec + am
##
## Df Deviance AIC
## <none> 169.28593 154.11937
## + hp 1 160.06646 154.32737
## + carb 1 161.24999 154.56311
## + disp 1 166.00986 155.49403
## + cyl 1 167.78487 155.83436
## + drat 1 167.88631 155.85370
## + gear 1 169.16321 156.09617
## + vs 1 169.28546 156.11928
## - am 1 195.46363 156.72050
## - qsec 1 278.31970 168.02917
## - wt 1 352.63319 175.60223
##
## Call:
## glm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4810670 -1.5555254 -0.7257309 1.4110122 4.6609983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6177805 6.9595930 1.38195 0.17791517
## wt -3.9165037 0.7112016 -5.50688 0.0000069527 ***
## qsec 1.2258860 0.2886696 4.24668 0.00021617 ***
## am 2.9358372 1.4109045 2.08082 0.04671551 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.045926055)
##
## Null deviance: 1126.04719 on 31 degrees of freedom
## Residual deviance: 169.28593 on 28 degrees of freedom
## AIC: 154.11937
##
## Number of Fisher Scoring iterations: 2
The residual plots for the model selected by the AIC are:
par(mar=c(10,5,2,1), mfcol=c(1,2), cex=1, ps=12)
plot(result.step.1, which = 1)
plot(result.step.1, which = 2)
Figure 1. Plots
The mpg ~ wt + qsec + am with the residual plots at: @ref(fig:residual_plots_final_model) , is the most apt. model for predicting the mpg with the am as one of the predictors. The Automatic transmission is better for the MPG. I have used the family=gaussian(link=“log”) this is the interpretation of the coefficient of the transmission on the MPG. Keeping the rest of the coefficients constant, a change, from manual to automatic in the transmission, would result in a change of mpg by \(exp(coef(am))\) which is, 18.84