Use the cars_new.csv. See HW1 for detailed information of variables.
# Read data
q4.data <- read.csv("cars_new.csv")
# Check the structure
str(q4.data)
## 'data.frame': 180 obs. of 4 variables:
## $ type : chr "Sedan" "Sedan" "Sedan" "Sedan" ...
## $ origin : chr "Asia" "Asia" "Asia" "Asia" ...
## $ cylinders : int 4 4 6 6 6 6 6 6 6 6 ...
## $ mpg_highway: int 31 29 28 24 24 24 30 29 30 28 ...
# Convert int to factor independent variables
q4.data$type <- factor(q4.data$type)
q4.data$origin <- factor(q4.data$origin)
q4.data$cylinders <- factor(q4.data$cylinders)
#check the result
str(q4.data)
## 'data.frame': 180 obs. of 4 variables:
## $ type : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
## $ origin : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
## $ cylinders : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
## $ mpg_highway: int 31 29 28 24 24 24 30 29 30 28 ...
table(q4.data$cylinders)
##
## 4 6
## 86 94
table(q4.data$origin)
##
## Asia USA
## 104 76
table(q4.data$type)
##
## Sedan Sports
## 164 16
All terms are unbalanced.
q4.aov.full.a <- aov(data= q4.data, mpg_highway ~ cylinders + origin + type )
summary(q4.aov.full.a)
## Df Sum Sq Mean Sq F value Pr(>F)
## cylinders 1 1470.8 1470.8 137.497 < 2e-16 ***
## origin 1 8.6 8.6 0.801 0.37212
## type 1 108.1 108.1 10.102 0.00175 **
## Residuals 176 1882.7 10.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(q4.aov.full.a, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 69548 1 6501.6715 < 2e-16 ***
## cylinders 1453 1 135.8499 < 2e-16 ***
## origin 1 1 0.0786 0.77948
## type 108 1 10.1018 0.00175 **
## Residuals 1883 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Origin p-value is very large (insignificant), thus we will eliminate it.
q4.aov.a.temp1 <- aov(data=q4.data, mpg_highway ~ cylinders + type )
Anova(q4.aov.a.temp1, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 88449 1 8311.96 < 2.2e-16 ***
## cylinders 1482 1 139.27 < 2.2e-16 ***
## type 116 1 10.88 0.001175 **
## Residuals 1883 177
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All variables have significant difference, thus we stop elimination (cut-off)
The model is: (mpg_highway ~ cylinders + type)
(According to question part (a), we only checked the main effects, no need to check the interaction in this part)
Our final model (based on main effects) is:
mpg_highway ~ cylinders + type
Now we will add interaction to our model:
q4.aov.a.temp2 <- aov(data=q4.data, mpg_highway ~ cylinders + type + cylinders * type )
Anova(q4.aov.a.temp2, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 85471 1 8358.838 < 2.2e-16 ***
## cylinders 1558 1 152.397 < 2.2e-16 ***
## type 198 1 19.392 1.844e-05 ***
## cylinders:type 84 1 8.201 0.004696 **
## Residuals 1800 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value of interaction is small, thus we will add interaction to model. our final model is:
mpg_highway ~ cylinders + type + cylinders * type
we will start two-way anova test
already is done.
q4.aov.b <-aov(data=q4.data, mpg_highway ~ cylinders*type )
summary(q4.aov.b)
## Df Sum Sq Mean Sq F value Pr(>F)
## cylinders 1 1470.8 1470.8 143.839 < 2e-16 ***
## type 1 115.8 115.8 11.323 0.00094 ***
## cylinders:type 1 83.9 83.9 8.201 0.00470 **
## Residuals 176 1799.6 10.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cylinders:
H0: there is no cylinders effect
H1: there exist cylinders effect
type:
H0: there is no type effect
H1: there exist type effect
interaction:
H0: there is no interaction effect
H1: there exist interaction effect
q4.lm.b <- lm(data=q4.data, mpg_highway ~ cylinders + type + cylinders*type)
#aov(q4.lm.b)
summary(q4.lm.b)$r.square
## [1] 0.4813821
The R Square is the percentage of variation in a response variable that is explained by the model.
Anova(q4.aov.b,type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 85471 1 8358.838 < 2.2e-16 ***
## cylinders 1558 1 152.397 < 2.2e-16 ***
## type 198 1 19.392 1.844e-05 ***
## cylinders:type 84 1 8.201 0.004696 **
## Residuals 1800 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# q4.a.t3 <- Anova(q4.aov.b,type=3)
# q4.a.t3
Regarding to P-Values:
- There exist significant cylinders effect.
- There exist significant type effect.
- There exist significant interaction effect.
leveneTest(q4.aov.b)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 4.9694 0.002465 **
## 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value is small, then variances are not equal, we must check it by residuals plot too.
par(mfrow=c(2,2))
plot(q4.aov.b)
##### Result:
According to QQ plot, normality assumption is reasonable, although there are some extreme values out of the line, but it’s acceptable, because ANOVA works on the semi-normal style data.
Regarding to Residuals plot, the variances are not equal (data are heterogeneous). thus we must run Welch test.
TukeyHSD(q4.aov.b)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg_highway ~ cylinders * type, data = q4.data)
##
## $cylinders
## diff lwr upr p adj
## 6-4 -5.722662 -6.664343 -4.780981 0
##
## $type
## diff lwr upr p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
##
## $`cylinders:type`
## diff lwr upr p adj
## 6:Sedan-4:Sedan -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan 0.9447674 -2.120956 4.010491 0.8546517
## 6:Sports-6:Sedan -0.4302326 -3.495956 2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993 2.771993 0.8253946
µ(4:Sedan) > µ(6:Sedan)
&
µ(4:Sedan) > µ(4:sports) > µ(6:sports)
therefore, it means µ(4:Sedan) are different with others.
In the other word, sedan type with 4 cylinders has most fuel efficiency.