Analysis of Variance

Use the cars_new.csv. See HW1 for detailed information of variables.

# Read data 
q4.data <- read.csv("cars_new.csv")

# Check the structure
str(q4.data)
## 'data.frame':    180 obs. of  4 variables:
##  $ type       : chr  "Sedan" "Sedan" "Sedan" "Sedan" ...
##  $ origin     : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ cylinders  : int  4 4 6 6 6 6 6 6 6 6 ...
##  $ mpg_highway: int  31 29 28 24 24 24 30 29 30 28 ...
# Convert int to factor independent variables

q4.data$type <- factor(q4.data$type)
q4.data$origin <- factor(q4.data$origin)
q4.data$cylinders <- factor(q4.data$cylinders)


#check the result

str(q4.data)
## 'data.frame':    180 obs. of  4 variables:
##  $ type       : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
##  $ origin     : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
##  $ cylinders  : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
##  $ mpg_highway: int  31 29 28 24 24 24 30 29 30 28 ...

Question (a):

  • Start with a three-way main effects ANOVA and choose the best main effects ANOVA model for mpg_highway as a function of cylinders, origin, and type for the cars in this set. Comment on which terms should be kept in a model for mpg_highway and why based on Type 3 SS. For the model with just predictors you decide to keep, comment on the significant effects in the model and comment on how much variation in highway fuel efficiency the model describes.

Answer -:

Step 1) check balancing data
table(q4.data$cylinders)
## 
##  4  6 
## 86 94
table(q4.data$origin)
## 
## Asia  USA 
##  104   76
table(q4.data$type)
## 
##  Sedan Sports 
##    164     16
Result:

All terms are unbalanced.

Step 2) Make the full model
q4.aov.full.a <- aov(data= q4.data, mpg_highway ~ cylinders + origin + type )
summary(q4.aov.full.a)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## cylinders     1 1470.8  1470.8 137.497 < 2e-16 ***
## origin        1    8.6     8.6   0.801 0.37212    
## type          1  108.1   108.1  10.102 0.00175 ** 
## Residuals   176 1882.7    10.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Type III
Anova(q4.aov.full.a, type=3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##             Sum Sq  Df   F value  Pr(>F)    
## (Intercept)  69548   1 6501.6715 < 2e-16 ***
## cylinders     1453   1  135.8499 < 2e-16 ***
## origin           1   1    0.0786 0.77948    
## type           108   1   10.1018 0.00175 ** 
## Residuals     1883 176                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:

Origin p-value is very large (insignificant), thus we will eliminate it.

Step 4) Eliminate first item
q4.aov.a.temp1 <- aov(data=q4.data, mpg_highway ~ cylinders + type )
Anova(q4.aov.a.temp1, type=3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##             Sum Sq  Df F value    Pr(>F)    
## (Intercept)  88449   1 8311.96 < 2.2e-16 ***
## cylinders     1482   1  139.27 < 2.2e-16 ***
## type           116   1   10.88  0.001175 ** 
## Residuals     1883 177                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:

All variables have significant difference, thus we stop elimination (cut-off)

The model is: (mpg_highway ~ cylinders + type)

(According to question part (a), we only checked the main effects, no need to check the interaction in this part)


Question (b):

  • Starting with main effects chosen in part (a), find your best ANOVA model by adding in any additional interaction terms that will significantly improve the model. For your final model, comment on the significant effects and variation explained by the model.

Answer -:

Our final model (based on main effects) is:
mpg_highway ~ cylinders + type

Now we will add interaction to our model:

q4.aov.a.temp2 <- aov(data=q4.data, mpg_highway ~ cylinders + type + cylinders * type )
Anova(q4.aov.a.temp2, type=3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##                Sum Sq  Df  F value    Pr(>F)    
## (Intercept)     85471   1 8358.838 < 2.2e-16 ***
## cylinders        1558   1  152.397 < 2.2e-16 ***
## type              198   1   19.392 1.844e-05 ***
## cylinders:type     84   1    8.201  0.004696 ** 
## Residuals        1800 176                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:

P-value of interaction is small, thus we will add interaction to model. our final model is:
mpg_highway ~ cylinders + type + cylinders * type

we will start two-way anova test

Step 1) check balancing

already is done.

Step 2) Run Two-Way ANOVA
q4.aov.b <-aov(data=q4.data, mpg_highway ~ cylinders*type )
summary(q4.aov.b)
##                 Df Sum Sq Mean Sq F value  Pr(>F)    
## cylinders        1 1470.8  1470.8 143.839 < 2e-16 ***
## type             1  115.8   115.8  11.323 0.00094 ***
## cylinders:type   1   83.9    83.9   8.201 0.00470 ** 
## Residuals      176 1799.6    10.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Chack significant of effect
  • cylinders:
    H0: there is no cylinders effect
    H1: there exist cylinders effect

  • type:
    H0: there is no type effect
    H1: there exist type effect

  • interaction:
    H0: there is no interaction effect
    H1: there exist interaction effect

Step 4) Calculate the R Square
q4.lm.b <- lm(data=q4.data, mpg_highway ~ cylinders + type + cylinders*type)
#aov(q4.lm.b)
summary(q4.lm.b)$r.square
## [1] 0.4813821
Result:

The R Square is the percentage of variation in a response variable that is explained by the model.

Step 5) Run Type III
Anova(q4.aov.b,type=3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##                Sum Sq  Df  F value    Pr(>F)    
## (Intercept)     85471   1 8358.838 < 2.2e-16 ***
## cylinders        1558   1  152.397 < 2.2e-16 ***
## type              198   1   19.392 1.844e-05 ***
## cylinders:type     84   1    8.201  0.004696 ** 
## Residuals        1800 176                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# q4.a.t3 <- Anova(q4.aov.b,type=3)
# q4.a.t3
Result Type III and conclusion:

Regarding to P-Values:
- There exist significant cylinders effect.
- There exist significant type effect.
- There exist significant interaction effect.

Step 6) Chack variance equality by “Levene Test”
leveneTest(q4.aov.b)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   3  4.9694 0.002465 **
##       176                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:

P-value is small, then variances are not equal, we must check it by residuals plot too.

Step 7) Diagnostic plot
par(mfrow=c(2,2))
plot(q4.aov.b)

##### Result:
According to QQ plot, normality assumption is reasonable, although there are some extreme values out of the line, but it’s acceptable, because ANOVA works on the semi-normal style data.

Regarding to Residuals plot, the variances are not equal (data are heterogeneous). thus we must run Welch test.


Question (c):

  • Comment on any significant group differences through the post-hoc test. What does this tell us about fuel efficiency differences across cylinders, origin, or type groups? See Hint in Exercise 3.

Answer -:

TukeyHSD(q4.aov.b)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mpg_highway ~ cylinders * type, data = q4.data)
## 
## $cylinders
##          diff       lwr       upr p adj
## 6-4 -5.722662 -6.664343 -4.780981     0
## 
## $type
##                   diff       lwr       upr     p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
## 
## $`cylinders:type`
##                         diff       lwr       upr     p adj
## 6:Sedan-4:Sedan   -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan  -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan  -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan   0.9447674 -2.120956  4.010491 0.8546517
## 6:Sports-6:Sedan  -0.4302326 -3.495956  2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993  2.771993 0.8253946
Result:
  • Diff of “6:Sedan-4:Sedan” is “Negative”, P-Value is small ==> µ(4:Sedan) > µ(6:Sedan)
  • Diff of “4:Sports-4:Sedan” is “Negative”, P-Value is small ==> µ(4:Sedan) > µ(4:Sports)
  • Diff of “6:Sports-4:Sedan” is “Negative”, P-Value is small ==> µ(4:Sedan) > µ(6:Sports)
  • Diff of “4:Sports-6:Sedan” is “Positive”, P-Value is large ==> µ(4:Sports) = µ(6:Sedan)
  • Diff of “6:Sports-6:Sedan” is “Negative”, P-Value is large ==> µ(6:Sports) = µ(6:Sedan)
  • Diff of “6:Sports-4:Sports” is “Negative”, P-Value is large ==> µ(6:Sports) = µ(4:Sports)

Conclussion:

µ(4:Sedan) > µ(6:Sedan)
&
µ(4:Sedan) > µ(4:sports) > µ(6:sports)

therefore, it means µ(4:Sedan) are different with others.
In the other word, sedan type with 4 cylinders has most fuel efficiency.