Setting Working Directory

setwd("/Users/Kristianto_97/Downloads/MSDA/Fall 2020/DA Algorithms/HW 2")

Packages Used

library(DescTools)
library(MASS)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:DescTools':
## 
##     Recode

Exercise 1: Analysis of Variance

The heartbpchol.csv data set contains continuous cholesterol (Cholesterol) and blood pressure status (BP_Status) (category: High/ Normal/ Optimal) for alive patients.

For the heartbpchol.xlsx data set, consider a one-way ANOVA model to identify differences between group cholesterol means. The normality assumption is reasonable, so you can proceed without testing normality.

Exercise 1A

Perform a one-way ANOVA for Cholesterol with BP_Status as the categorical predictor. Comment on statistical significance of BP_Status, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

Importing data and converting formats

heart <- read.csv("heartbpchol.csv") 
heart$Cholesterol <- as.numeric(heart$Cholesterol)
str(heart)
## 'data.frame':    541 obs. of  2 variables:
##  $ Cholesterol: num  221 188 292 319 205 247 202 150 228 280 ...
##  $ BP_Status  : Factor w/ 3 levels "High","Normal",..: 3 1 1 2 2 1 3 2 1 1 ...

Checking balance

table(heart$BP_Status) ## Checking balance for "BP_Status" ###
## 
##    High  Normal Optimal 
##     229     245      67
boxplot(Cholesterol ~ BP_Status, heart, 
        main = "Cholesterol Distribution by BP Status",
        xlab = "BP Status",
        ylab = "Cholesterol",
        col = "mistyrose",
        horizontal = FALSE)

  • Observation: The distribution of BP_Status is unbalanced. Each group has a different number of observations.

One-way ANOVA (Cholesterol ~ BP_Status)

aov.heart <- aov(Cholesterol ~ BP_Status, heart)
summary(aov.heart)
##              Df  Sum Sq Mean Sq F value  Pr(>F)   
## BP_Status     2   25211   12605   6.671 0.00137 **
## Residuals   538 1016631    1890                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Conclusion: Since p-value is 0.00317 (below significance level of 0.05), it’s safe to say that we can reject the Ho. This means that at least one observation of BP Status has a significance effect on Cholesterol (at least one of them has a different mean of Cholesterol)

R-Square (Variation of response variable explained by BP_Status)

lm.res <- lm(Cholesterol ~ BP_Status, heart)
summary(lm.res)$r.squared
## [1] 0.02419833
  • Observation: Only 2.4% of the variation of Cholesterol can be explained by BP_Status

Equal Variance Assumption (Levene Test)

LeveneTest(aov.heart)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.1825 0.8332
##       538
  • Conclusion: Since the p-value is 0.8332 (above significance level of 0.05), it’s safe to say that we cannot reject the Ho. This means that we can assume all BP_Status groups have equal variance

Checking for normality

par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.heart, 2)

  • Observation: From the analysis of QQPlot above, we can conclude that a normal distribution is visible

Exercise 1B

Comment on any significantly different cholesterol means as determined by the post-hoc test comparing all pairwise differences. Specifically explain what that tells us about differences in cholesterol levels across blood pressure status groups, like which group has the highest or lowest mean values of Cholesterol.

ScheffeTest(aov.heart)
## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $BP_Status
##                      diff    lwr.ci    upr.ci   pval    
## Normal-High    -11.543481 -21.35092 -1.736038 0.0159 *  
## Optimal-High   -18.646679 -33.46702 -3.826341 0.0089 ** 
## Optimal-Normal  -7.103198 -21.81359  7.607194 0.4958    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

  • From the Scheffe Test above, we can see that the Normal-High group and the Optimal-High group have different mean values (below significance level of 0.05) of Cholesterol, and thus we can reject Ho.

  • From the Scheffe Test above, we can see that the Optimal-Normal group, however, has the least significant mean value of Cholesterol (above significance level of 0.05), which means that Optimal and Normal BP_Status both have equal means

  • More specifically, below are some of the effects of BP_Status on Cholesterol that can be seen:

    • High > Normal (Mean value of Cholesterol for High is higher than mean Cholesterol of Normal)
    • High > Optimal (Mean value of Cholesterol for High is higher than mean Cholesterol of Optimal)
    • Optimal = Normal (mean Cholesterol of Optimal is the same as mean Cholesterol of Normal)

Exercise 2: Analysis of Variance

For this problem use the bupa.csv data set. Check UCI Machine Learning Repository for more information (http://archive.ics.uci.edu/ml/datasets/Liver+Disorders). The mean corpuscular volume and alkaline phosphatase are blood tests thought to be sensitive to liver disorder related to excessive alcohol consumption. We assume that normality and independence assumptions are valid.

Exercise 2A

Perform a one-way ANOVA for mcv as a function of drinkgroup. Comment on significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

Importing data and converting formats

bupa <- read.csv("bupa.csv")
bupa$mcv <- as.numeric(bupa$mcv)
bupa$alkphos <- as.numeric(bupa$alkphos)
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
str(bupa)
## 'data.frame':    345 obs. of  3 variables:
##  $ mcv       : num  85 85 86 91 87 98 88 88 92 90 ...
##  $ alkphos   : num  92 64 54 78 70 55 62 67 54 60 ...
##  $ drinkgroup: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Checking balance

table(bupa$drinkgroup) # Checking balance for drinkgroup
## 
##   1   2   3   4   5 
## 117  52  88  67  21
boxplot(mcv ~ drinkgroup, bupa, 
        main = "Mean Corpuscular Volume Distribution by Drink Group",
        xlab = "Drink Group",
        ylab = "Mean Corpuscular Volume",
        col = "powderblue",
        horizontal = FALSE)

  • Observation: The distribution of drinkgroup is unbalanced. Each group has a different number of observations.

One-way ANOVA (mcv ~ drinkgroup)

aov.mcv <- aov(mcv ~ drinkgroup, bupa)
summary(aov.mcv)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## drinkgroup    4    733  183.29   10.26 7.43e-08 ***
## Residuals   340   6073   17.86                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Conclusion: Since p-value is 7.43e-08 (below significance level of 0.05), it’s safe to say that we can reject the Ho. This means that at least one observation of drinkgroup has a significance effect on mcv (at least one of them has a different mean of mcv)

R-Square (Variation of response variable explained by drinkgroup)

lm.res_mcv <- lm(mcv ~ drinkgroup, bupa)
summary(lm.res_mcv)$r.squared
## [1] 0.1077214
  • Observation: Only 10.78% of the variation of mcv can be explained by drinkgroup

Equal Variance Assumption (Levene Test)

LeveneTest(aov.mcv)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.3053 0.8744
##       340

Conclusion: Since the p-value is 0.8744 (above significance level of 0.05), we cannot reject the Ho. This means that we can assume all drinkgroups groups have equal variance.

Checking for normality

par(mfrow=c(1,1)) # Diagnostic plot 
plot(aov.mcv, 2)

  • Observation: From the analysis of QQPlot, we can conclude that a normal distribution is visible

Exercise 2B

Perform a one-way ANOVA for alkphos as a function of drinkgroup. Comment on statistical significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

Checking Balance

table(bupa$drinkgroup) 
## 
##   1   2   3   4   5 
## 117  52  88  67  21
boxplot(alkphos ~ drinkgroup, bupa,
        main = "Alkaline Phosphatase Distribution by Drink Group",
        xlab = "Alkaline Phosphatase",
        ylab = "Drink Group",
        col = "#C3CFFF",
        horizontal = FALSE)

  • Observation: The distribution of drinkgroup is unbalanced. Each group has a different number of observations.

One-way ANOVA (alkphos ~ drinkgroup)

aov.alkphos <- aov(alkphos ~ drinkgroup, bupa)
summary(aov.alkphos)
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## drinkgroup    4   4946  1236.4   3.792 0.00495 **
## Residuals   340 110858   326.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Conclusion: Since p-value is 0.00495 (below significance level of 0.05), it’s safe to say that we can reject the Ho. This means that at least one observation of drinkgroup has a significance effect on alkphos (at least one of them has a different mean of alkphos)

R-Square (Variation of response variable explained by drinkgroup)

lm.alkphos <- lm(alkphos ~ drinkgroup, bupa)
summary(lm.alkphos)$r.squared
## [1] 0.04270721
  • Observation: Only 4.27% of variation alkphos can be explained by drinkgroup

Equal Variance Assumption (Levene Test)

LeveneTest(aov.alkphos)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.8089 0.5201
##       340

Conclusion: Since the p-value is 0.5201 (above significance level of 0.05), we cannot reject the Ho. This means that we can assume all drinkgroup groups have equal variance.

Checking for normality

par(mfrow=c(1,1)) # Diagnostic plot 
plot(aov.alkphos, 2)

  • Observation: From the analysis of QQPlot, we can conclude that a normal distribution is visible

Exercise 2C

Perform post-hoc tests for models in a) and b). Comment on any similarities or differences you observe from their results.

mcv

ScheffeTest(aov.mcv)
## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $drinkgroup
##             diff      lwr.ci   upr.ci    pval    
## 2-1  1.241452991 -0.94020481 3.423111  0.5410    
## 3-1  0.938131313 -0.90892674 2.785189  0.6495    
## 4-1  3.744610282  1.73913894 5.750082 1.9e-06 ***
## 5-1  3.746031746  0.64379565 6.848268  0.0081 ** 
## 3-2 -0.303321678 -2.59291786 1.986275  0.9966    
## 4-2  2.503157290  0.08395442 4.922360  0.0380 *  
## 5-2  2.504578755 -0.87987039 5.889028  0.2646    
## 4-3  2.806478969  0.68408993 4.928868  0.0025 ** 
## 5-3  2.807900433 -0.37116998 5.986971  0.1151    
## 5-4  0.001421464 -3.27222796 3.275071  1.0000    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

  • From the Scheffe Test above, we can see that the groups 4-1, 5-1, 4-2, and 4-3 have different mean values (below significance level of 0.05) of mcv, and thus we can reject the null.

  • From the Scheffe Test above, we can see that groups 2-1, 3-1, 3-2, 5-2, 5-3, and 5-4 do NOT have different mean values (above significance level of 0.05) of mcv, and thus we do not reject the null

  • More specifically, below are some of the effects of drinkgroup on mcv that can be seen:

    • 4 > 1 (mean mcv of 4 is greater than mean mcv of 1)
    • 5 > 1 (mean mcv of 5 is greater than mean mcv of 1)
    • 4 > 2 (mean mcv of 4 is greater than mean mcv of 2)
    • 4 > 3 (mean mcv of 4 is greater than mean mcv of 3)

alkphos

ScheffeTest(aov.alkphos)
## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $drinkgroup
##          diff      lwr.ci    upr.ci   pval    
## 2-1 -2.645299 -11.9663647  6.675766 0.9419    
## 3-1 -4.056138 -11.9476367  3.835360 0.6389    
## 4-1 -1.148743  -9.7170578  7.419571 0.9965    
## 5-1 12.572650  -0.6815582 25.826857 0.0734 .  
## 3-2 -1.410839 -11.1930681  8.371390 0.9953    
## 4-2  1.496556  -8.8394138 11.832525 0.9952    
## 5-2 15.217949   0.7579944 29.677903 0.0329 *  
## 4-3  2.907395  -6.1604467 11.975236 0.9117    
## 5-3 16.628788   3.0463078 30.211268 0.0069 ** 
## 5-4 13.721393  -0.2651729 27.707959 0.0578 .  
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

  • From the Scheffe Test above, we can see that groups 5-2 and 5-3 have significantly different mean values (below significance level of 0.05) of alkphos, and thus we can reject the null

  • From the Scheffe Test above, we can see that groups 2-1, 3-1, 4-1, 5-1, 3-2, 4-2, 4-3 and 5-4 do NOT have significantly different mean values of alkphos (above significance level of 0.05), and thus we cannot reject the null

  • More specifically, below are some of the effects of drinkgroup on alkphos that can be seen:

    • 5 > 3 (mean alkphos of 5 is higher than mean alkphos of 3)
    • 5 > 2 (mean alkphos of 5 is higher than mean alkphos of 2)
    • Equivalent mean values can be found in drinkgroup pairs 2-1, 3-1, 4-1, 5-1, 3-2, 4-2, 4-3, 5-4

Similarities:

  • Group pairs 2-1, 3-1, 3-2 and 5-4 have equal mean values, since their p-values are above the significance level of 0.05. This means that they don’t have an effect on either mcv and alkphos

Exercise 3

The psychology department at a hypothetical university has been accused of underpaying female faculty members. The data represent salary (in thousands of dollars) for all 22 professors in the department. This problem is from Maxwell and Delaney (2004).

Importing data and converting formats

psych <- read.csv("psych.csv")
psych$salary <- as.numeric(psych$salary)
str(psych)
## 'data.frame':    22 obs. of  3 variables:
##  $ sex   : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 2 2 2 ...
##  $ rank  : Factor w/ 2 levels "Assist","Assoc": 1 1 1 1 1 1 1 1 1 1 ...
##  $ salary: num  33 36 35 38 42 37 39 38 40 44 ...

Exercise 3A:

Fit a two-way ANOVA model including sex (F, M) and rank (Assistant, Associate) the interaction term. What do the Type 1 and Type 3 sums of squares tell us about significance of effects? Is the interaction between sex and rank significant? Also comment on the variation explained by the model.

Two-way ANOVA (Type 1)

aov.psych1 <- aov(salary ~ sex * rank, psych)
summary(aov.psych1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## sex          1 155.15  155.15  17.007 0.000637 ***
## rank         1 169.82  169.82  18.616 0.000417 ***
## sex:rank     1   0.63    0.63   0.069 0.795101    
## Residuals   18 164.21    9.12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych2 <- aov(salary ~ rank * sex, psych)
summary(aov.psych2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## rank         1 252.22  252.22  27.647 5.33e-05 ***
## sex          1  72.76   72.76   7.975   0.0112 *  
## rank:sex     1   0.63    0.63   0.069   0.7951    
## Residuals   18 164.21    9.12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

  • Since both groups have low p-values (below the significance level of 0.05), we reject the null hypothesis and conclude that these groups have a significance effect on salary.

  • We can see interaction between sex and rank result in a much higher p-value (above the significance level of 0.05), which means that the interaction of sex and rank doesn’t have significance effect on salary ( don’t reject the null hypothesis)

Two-way ANOVA (Type 3)

Anova(aov.psych1, type = 3)
## Anova Table (Type III tests)
## 
## Response: salary
##             Sum Sq Df  F value  Pr(>F)    
## (Intercept) 8140.2  1 892.2994 < 2e-16 ***
## sex           28.0  1   3.0711 0.09671 .  
## rank          70.4  1   7.7189 0.01240 *  
## sex:rank       0.6  1   0.0695 0.79510    
## Residuals    164.2 18                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

  • Since rank is the only group that has a p-value below the significance level of 0.05, we reject the null hypothesis and conclude that the rank group has a significant effect on salary

  • sex and interaction groups have p-values above the significance level of 0.05, which means we cannot reject Ho and conclude that these groups don’t have a sig effect on salary

Variation Explained by the model (sex and rank)

lm.psych1 <- lm(salary ~ sex*rank, psych)
summary(lm.psych1)$r.squared
## [1] 0.6647566

Observation: 66.4% of salary variation can be explained by the model (sex and rank)

Exercise 3B

Refit the model without the interaction term. Comment on the significance of effects and variation explained. Report and interpret the Type 1 and Type 3 tests of the main effects. Are the main effects of rank and sex significant?

Two-way ANOVA (Type 1)

aov.psych3 <- aov(salary ~ rank + sex, psych)
summary(aov.psych3)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## rank         1 252.22  252.22  29.071 3.34e-05 ***
## sex          1  72.76   72.76   8.386  0.00926 ** 
## Residuals   19 164.84    8.68                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych4 <- aov(salary ~ sex + rank, psych)
summary(aov.psych4)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## sex          1  155.2  155.15   17.88 0.000454 ***
## rank         1  169.8  169.82   19.57 0.000291 ***
## Residuals   19  164.8    8.68                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

  • Without any interaction, we conclude that both groups have significant effect on salary since their p-values are below the significance level of 0.05. Thus, we reject the null hypothesis, which means that at least one group in sex and rank have different mean values.

Two-way ANOVA (Type 3)

Anova(aov.psych3, type = 3)
## Anova Table (Type III tests)
## 
## Response: salary
##              Sum Sq Df   F value    Pr(>F)    
## (Intercept) 10227.6  1 1178.8469 < 2.2e-16 ***
## rank          169.8  1   19.5743 0.0002912 ***
## sex            72.8  1    8.3862 0.0092618 ** 
## Residuals     164.8 19                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

  • Without any interaction, we conclude that both groups have significant effect on salary since their p-values are below the significance level of 0.05. Thus, we reject the null hypothesis, which means that at least one group in sex and rank have different mean values, which means that at least one group in sex and rank have different mean values.

Variance Explained by the Model

lm.psych3 <- lm(salary ~ rank + sex, psych)
summary(lm.psych3)$r.squared
## [1] 0.6634627

Observation: Only 66.3% of salary variation can be explained by the model (sex and rank)

Exercise 3C

Obtain model diagnostics to validate your Normality assumptions.

par(mfrow=c(1,1)) # Diagnostic plot 
plot(aov.psych3, 2)

  • Observation: From the analysis of QQPlot, we can conclude that a normal distribution is visible

Exercise 3D

Choose a final model based on your results from parts (a) and (b). Comment on any significant group differences through the post-hoc test. State the differences in salary across different main effect groups and interaction (if included) between them.

Based on the result (a) and (b), we can conclude that interaction effect does NOT exist. Thus, the two-way ANOVA - Type 3 is selected to see unique contribution of each categorical variable. As every effect is adjusted for all other effects, this model is the best option.

Post-hoc Test

TukeyHSD(aov.psych4)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = salary ~ sex + rank, data = psych)
## 
## $sex
##         diff      lwr      upr     p adj
## M-F 5.333333 2.693648 7.973019 0.0004544
## 
## $rank
##                  diff      lwr      upr     p adj
## Assoc-Assist 5.377778 2.738092 8.017463 0.0004193

Conclusion:

sex

  • Because the p-value is below the significance level of 0.05, it’s apparent that sex has a significant effect on salary.
  • M > F (mean salary of Male is higher than mean salary of Female)

rank

  • Because the p-value is below the significance level of 0.05, it’s apparent that rank also has a significant effect on

salary.

  • Associate > Assistant (mean salary of an Associate is higher than mean salary of an Assistant)

Exercise 4

Use the cars_new.csv. See HW1 for detailed information of variables.

Exercise 4A

Start with a three-way main effects ANOVA and choose the best main effects ANOVA model for mpg_highway as a function of cylinders, origin, and type for the cars in this set. Comment on which terms should be kept in a model for mpg_highway and why based on Type 3 SS. For the model with just predictors you decide to keep, comment on the significant effects in the model and comment on how much variation in highway fuel efficiency the model describes.

Importing data and converting formats

## Importing data and converting formats ###
cars <- read.csv("cars_new.csv")
cars$mpg_highway <- as.numeric(cars$mpg_highway)
cars$cylinders <- as.factor(cars$cylinders)
cars$origin <- as.factor(cars$origin)
cars$type <- as.factor(cars$type)
str(cars)
## 'data.frame':    180 obs. of  4 variables:
##  $ type       : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
##  $ origin     : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
##  $ cylinders  : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
##  $ mpg_highway: num  31 29 28 24 24 24 30 29 30 28 ...

Three-way ANOVA (Type 3)

aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars)
Anova(aov.cars, type=3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##             Sum Sq  Df   F value  Pr(>F)    
## (Intercept)  69548   1 6501.6715 < 2e-16 ***
## cylinders     1453   1  135.8499 < 2e-16 ***
## origin           1   1    0.0786 0.77948    
## type           108   1   10.1018 0.00175 ** 
## Residuals     1883 176                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

  • From the ANOVA table above, we can see that cylinders and type have p-values below the significant level of 0.05. This means that we reject the null hypothesis and conclude that these two groups have a significant effect on mpg_highway.
  • Additionally, at least one group has different mean values
  • Since origin has a p-value higher than than significant level of 0.05, we can say that we don’t reject the the null hypothesis and conclude that this group dosn’t have a significant effect on mpg_highway. Thus, we will remove origin from the model since it has insignificant effect.

Variation Explained by the Model (cylinders and type)

lm.cars <- lm(mpg_highway ~ cylinders + type, cars)
summary(lm.cars)$r.squared
## [1] 0.4572163

Observation: 45.7% of the variation of mpg_highway can be explained by cylinders and type

Exercise 4B

Starting with main effects chosen in part (a), find your best ANOVA model by adding in any additional interaction terms that will significantly improve the model. For your final model, comment on the significant effects and variation explained by the model.

aov.cars2 <- aov(mpg_highway ~ cylinders + type + cylinders * type, cars)
Anova(aov.cars2, type = 3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##                Sum Sq  Df  F value    Pr(>F)    
## (Intercept)     85471   1 8358.838 < 2.2e-16 ***
## cylinders        1558   1  152.397 < 2.2e-16 ***
## type              198   1   19.392 1.844e-05 ***
## cylinders:type     84   1    8.201  0.004696 ** 
## Residuals        1800 176                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

*The best model is two-way ANOVA - Type 3 with the implementation of cylinders, type and its interaction. All three of them have p-values lower than the significance level of 0.05, which means that we can reject the null hypothesis and conclude that these have a significant effect on mpg_highway

Variation Explained by the Model

lm.cars2 <- lm(mpg_highway ~ cylinders + type + cylinders * type, cars)
summary(lm.cars2)$r.squared
## [1] 0.4813821

Observation: 48.1% of mpg_highway can be explained by cylinders, type, and its interaction.

Normality Check

par(mfrow=c(1,1)) # Diagnostic plot 
plot(aov.cars2, 2)

  • Observation: From the analysis of QQPlot above, we can conclude that a normal distribution is visible

Exercise 4C

Comment on any significant group differences through the post-hoc test. What does this tell us about fuel efficiency differences across cylinders, origin, or type groups? See Hint in Exercise 3.

Post-hoc Test

TukeyHSD(aov.cars2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mpg_highway ~ cylinders + type + cylinders * type, data = cars)
## 
## $cylinders
##          diff       lwr       upr p adj
## 6-4 -5.722662 -6.664343 -4.780981     0
## 
## $type
##                   diff       lwr       upr     p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
## 
## $`cylinders:type`
##                         diff       lwr       upr     p adj
## 6:Sedan-4:Sedan   -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan  -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan  -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan   0.9447674 -2.120956  4.010491 0.8546517
## 6:Sports-6:Sedan  -0.4302326 -3.495956  2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993  2.771993 0.8253946
  • Conclusion:

    • type:

      • Since the p-value for type is smaller than the significance level of 0.05, we reject the null hypothesis and conclude that the type has a significant effect on mpg_highway

      • Sports < Sedan (mean mpg_highway of Sedan is higher than mean mpg_highway of Sports)

    • cylinders:

      • Since the p-value for cylinder is smaller than the significance level 0.05, we reject the null hypothesis and conclude that the cylinders has a sig effect on mpg_highway

      • 6 < 4 (mean mpg_highway of 6 is lower than mean mpg_highway of 4)

    • cylinders and type interaction

      • Because the p-value is below the significance level of 0.05, type groups 6:Sedan-4:Sedan, 4:Sports-4:Sedan, and 6:Sports-4:Sedan have a significant effect on mpg_highway

      • Groups 4:Sports-6:Sedan, 6:Sports-6:Sedan, and 6:Sports-4:Sports all have p-values above the significant level of 0.05, and we conclude that they have insignificant effect on mpg_highway

      • 6:Sedan < 4:Sedan (mean mpg_highway of 6:Sedan is smaller than mean mpg_highway of 4:Sedan)

      • 4:Sports < 4:Sedan (mean mpg_highway of 4:Sports is smaller than mean mpg_highway of 4:Sedan)

      • 6:Sports < 4:Sedan (mean mpg_highway of 6:Sports is smaller than mean mpg_highway of 4:Sedan)

    • Final Conclusion:

      • Sedans have higher mpg_highway than sport cars (Sedans have better fuel efficienty)

      • 4 Cylinder cars have a higher mpg_highway than 6 cylinder cars (4 cylinder cars have better fuel efficiency)

      • 4 cylinder sedans have a higher mpg_highway than 6 cylinder sedans (4 cylinder cars have better fuel efficiency)

      • 4 cylinder sedans have a higher mpg_highway than 4 cylinder sports (4 cylinder cars have better fuel efficiency)

      • 4 cylinder sedans have a higher mpg_highway than 6 cylinder sports (4 cylinder cars have better fuel efficiency)