setwd("/Users/Kristianto_97/Downloads/MSDA/Fall 2020/DA Algorithms/HW 2")
library(DescTools)
library(MASS)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:DescTools':
##
## Recode
The heartbpchol.csv data set contains continuous cholesterol (Cholesterol) and blood pressure status (BP_Status) (category: High/ Normal/ Optimal) for alive patients.
For the heartbpchol.xlsx data set, consider a one-way ANOVA model to identify differences between group cholesterol means. The normality assumption is reasonable, so you can proceed without testing normality.
Perform a one-way ANOVA for Cholesterol with BP_Status as the categorical predictor. Comment on statistical significance of BP_Status, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
heart <- read.csv("heartbpchol.csv")
heart$Cholesterol <- as.numeric(heart$Cholesterol)
str(heart)
## 'data.frame': 541 obs. of 2 variables:
## $ Cholesterol: num 221 188 292 319 205 247 202 150 228 280 ...
## $ BP_Status : Factor w/ 3 levels "High","Normal",..: 3 1 1 2 2 1 3 2 1 1 ...
table(heart$BP_Status) ## Checking balance for "BP_Status" ###
##
## High Normal Optimal
## 229 245 67
boxplot(Cholesterol ~ BP_Status, heart,
main = "Cholesterol Distribution by BP Status",
xlab = "BP Status",
ylab = "Cholesterol",
col = "mistyrose",
horizontal = FALSE)
BP_Status is unbalanced. Each group has a different number of observations.aov.heart <- aov(Cholesterol ~ BP_Status, heart)
summary(aov.heart)
## Df Sum Sq Mean Sq F value Pr(>F)
## BP_Status 2 25211 12605 6.671 0.00137 **
## Residuals 538 1016631 1890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BP Status has a significance effect on Cholesterol (at least one of them has a different mean of Cholesterol)lm.res <- lm(Cholesterol ~ BP_Status, heart)
summary(lm.res)$r.squared
## [1] 0.02419833
Cholesterol can be explained by BP_StatusLeveneTest(aov.heart)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.1825 0.8332
## 538
BP_Status groups have equal variancepar(mfrow=c(1,1)) # Diagnostic plot
plot(aov.heart, 2)
Comment on any significantly different cholesterol means as determined by the post-hoc test comparing all pairwise differences. Specifically explain what that tells us about differences in cholesterol levels across blood pressure status groups, like which group has the highest or lowest mean values of Cholesterol.
ScheffeTest(aov.heart)
##
## Posthoc multiple comparisons of means : Scheffe Test
## 95% family-wise confidence level
##
## $BP_Status
## diff lwr.ci upr.ci pval
## Normal-High -11.543481 -21.35092 -1.736038 0.0159 *
## Optimal-High -18.646679 -33.46702 -3.826341 0.0089 **
## Optimal-Normal -7.103198 -21.81359 7.607194 0.4958
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
From the Scheffe Test above, we can see that the Normal-High group and the Optimal-High group have different mean values (below significance level of 0.05) of Cholesterol, and thus we can reject Ho.
From the Scheffe Test above, we can see that the Optimal-Normal group, however, has the least significant mean value of Cholesterol (above significance level of 0.05), which means that Optimal and Normal BP_Status both have equal means
More specifically, below are some of the effects of BP_Status on Cholesterol that can be seen:
For this problem use the bupa.csv data set. Check UCI Machine Learning Repository for more information (http://archive.ics.uci.edu/ml/datasets/Liver+Disorders). The mean corpuscular volume and alkaline phosphatase are blood tests thought to be sensitive to liver disorder related to excessive alcohol consumption. We assume that normality and independence assumptions are valid.
Perform a one-way ANOVA for mcv as a function of drinkgroup. Comment on significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
bupa <- read.csv("bupa.csv")
bupa$mcv <- as.numeric(bupa$mcv)
bupa$alkphos <- as.numeric(bupa$alkphos)
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
str(bupa)
## 'data.frame': 345 obs. of 3 variables:
## $ mcv : num 85 85 86 91 87 98 88 88 92 90 ...
## $ alkphos : num 92 64 54 78 70 55 62 67 54 60 ...
## $ drinkgroup: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
table(bupa$drinkgroup) # Checking balance for drinkgroup
##
## 1 2 3 4 5
## 117 52 88 67 21
boxplot(mcv ~ drinkgroup, bupa,
main = "Mean Corpuscular Volume Distribution by Drink Group",
xlab = "Drink Group",
ylab = "Mean Corpuscular Volume",
col = "powderblue",
horizontal = FALSE)
drinkgroup is unbalanced. Each group has a different number of observations.aov.mcv <- aov(mcv ~ drinkgroup, bupa)
summary(aov.mcv)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 733 183.29 10.26 7.43e-08 ***
## Residuals 340 6073 17.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drinkgroup has a significance effect on mcv (at least one of them has a different mean of mcv)drinkgroup)lm.res_mcv <- lm(mcv ~ drinkgroup, bupa)
summary(lm.res_mcv)$r.squared
## [1] 0.1077214
mcv can be explained by drinkgroupLeveneTest(aov.mcv)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.3053 0.8744
## 340
Conclusion: Since the p-value is 0.8744 (above significance level of 0.05), we cannot reject the Ho. This means that we can assume all drinkgroups groups have equal variance.
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.mcv, 2)
Perform a one-way ANOVA for alkphos as a function of drinkgroup. Comment on statistical significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
table(bupa$drinkgroup)
##
## 1 2 3 4 5
## 117 52 88 67 21
boxplot(alkphos ~ drinkgroup, bupa,
main = "Alkaline Phosphatase Distribution by Drink Group",
xlab = "Alkaline Phosphatase",
ylab = "Drink Group",
col = "#C3CFFF",
horizontal = FALSE)
drinkgroup is unbalanced. Each group has a different number of observations.aov.alkphos <- aov(alkphos ~ drinkgroup, bupa)
summary(aov.alkphos)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 4946 1236.4 3.792 0.00495 **
## Residuals 340 110858 326.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drinkgroup has a significance effect on alkphos (at least one of them has a different mean of alkphos)drinkgroup)lm.alkphos <- lm(alkphos ~ drinkgroup, bupa)
summary(lm.alkphos)$r.squared
## [1] 0.04270721
alkphos can be explained by drinkgroupLeveneTest(aov.alkphos)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.8089 0.5201
## 340
Conclusion: Since the p-value is 0.5201 (above significance level of 0.05), we cannot reject the Ho. This means that we can assume all drinkgroup groups have equal variance.
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.alkphos, 2)
Perform post-hoc tests for models in a) and b). Comment on any similarities or differences you observe from their results.
mcv
ScheffeTest(aov.mcv)
##
## Posthoc multiple comparisons of means : Scheffe Test
## 95% family-wise confidence level
##
## $drinkgroup
## diff lwr.ci upr.ci pval
## 2-1 1.241452991 -0.94020481 3.423111 0.5410
## 3-1 0.938131313 -0.90892674 2.785189 0.6495
## 4-1 3.744610282 1.73913894 5.750082 1.9e-06 ***
## 5-1 3.746031746 0.64379565 6.848268 0.0081 **
## 3-2 -0.303321678 -2.59291786 1.986275 0.9966
## 4-2 2.503157290 0.08395442 4.922360 0.0380 *
## 5-2 2.504578755 -0.87987039 5.889028 0.2646
## 4-3 2.806478969 0.68408993 4.928868 0.0025 **
## 5-3 2.807900433 -0.37116998 5.986971 0.1151
## 5-4 0.001421464 -3.27222796 3.275071 1.0000
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
From the Scheffe Test above, we can see that the groups 4-1, 5-1, 4-2, and 4-3 have different mean values (below significance level of 0.05) of mcv, and thus we can reject the null.
From the Scheffe Test above, we can see that groups 2-1, 3-1, 3-2, 5-2, 5-3, and 5-4 do NOT have different mean values (above significance level of 0.05) of mcv, and thus we do not reject the null
More specifically, below are some of the effects of drinkgroup on mcv that can be seen:
mcv of 4 is greater than mean mcv of 1)mcv of 5 is greater than mean mcv of 1)mcv of 4 is greater than mean mcv of 2)mcv of 4 is greater than mean mcv of 3)alkphos
ScheffeTest(aov.alkphos)
##
## Posthoc multiple comparisons of means : Scheffe Test
## 95% family-wise confidence level
##
## $drinkgroup
## diff lwr.ci upr.ci pval
## 2-1 -2.645299 -11.9663647 6.675766 0.9419
## 3-1 -4.056138 -11.9476367 3.835360 0.6389
## 4-1 -1.148743 -9.7170578 7.419571 0.9965
## 5-1 12.572650 -0.6815582 25.826857 0.0734 .
## 3-2 -1.410839 -11.1930681 8.371390 0.9953
## 4-2 1.496556 -8.8394138 11.832525 0.9952
## 5-2 15.217949 0.7579944 29.677903 0.0329 *
## 4-3 2.907395 -6.1604467 11.975236 0.9117
## 5-3 16.628788 3.0463078 30.211268 0.0069 **
## 5-4 13.721393 -0.2651729 27.707959 0.0578 .
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
From the Scheffe Test above, we can see that groups 5-2 and 5-3 have significantly different mean values (below significance level of 0.05) of alkphos, and thus we can reject the null
From the Scheffe Test above, we can see that groups 2-1, 3-1, 4-1, 5-1, 3-2, 4-2, 4-3 and 5-4 do NOT have significantly different mean values of alkphos (above significance level of 0.05), and thus we cannot reject the null
More specifically, below are some of the effects of drinkgroup on alkphos that can be seen:
alkphos of 5 is higher than mean alkphos of 3)alkphos of 5 is higher than mean alkphos of 2)drinkgroup pairs 2-1, 3-1, 4-1, 5-1, 3-2, 4-2, 4-3, 5-4Similarities:
mcv and alkphosThe psychology department at a hypothetical university has been accused of underpaying female faculty members. The data represent salary (in thousands of dollars) for all 22 professors in the department. This problem is from Maxwell and Delaney (2004).
psych <- read.csv("psych.csv")
psych$salary <- as.numeric(psych$salary)
str(psych)
## 'data.frame': 22 obs. of 3 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 2 2 2 ...
## $ rank : Factor w/ 2 levels "Assist","Assoc": 1 1 1 1 1 1 1 1 1 1 ...
## $ salary: num 33 36 35 38 42 37 39 38 40 44 ...
Fit a two-way ANOVA model including sex (F, M) and rank (Assistant, Associate) the interaction term. What do the Type 1 and Type 3 sums of squares tell us about significance of effects? Is the interaction between sex and rank significant? Also comment on the variation explained by the model.
aov.psych1 <- aov(salary ~ sex * rank, psych)
summary(aov.psych1)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.15 155.15 17.007 0.000637 ***
## rank 1 169.82 169.82 18.616 0.000417 ***
## sex:rank 1 0.63 0.63 0.069 0.795101
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych2 <- aov(salary ~ rank * sex, psych)
summary(aov.psych2)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 27.647 5.33e-05 ***
## sex 1 72.76 72.76 7.975 0.0112 *
## rank:sex 1 0.63 0.63 0.069 0.7951
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
Since both groups have low p-values (below the significance level of 0.05), we reject the null hypothesis and conclude that these groups have a significance effect on salary.
We can see interaction between sex and rank result in a much higher p-value (above the significance level of 0.05), which means that the interaction of sex and rank doesn’t have significance effect on salary ( don’t reject the null hypothesis)
Anova(aov.psych1, type = 3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 8140.2 1 892.2994 < 2e-16 ***
## sex 28.0 1 3.0711 0.09671 .
## rank 70.4 1 7.7189 0.01240 *
## sex:rank 0.6 1 0.0695 0.79510
## Residuals 164.2 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
Since rank is the only group that has a p-value below the significance level of 0.05, we reject the null hypothesis and conclude that the rank group has a significant effect on salary
sex and interaction groups have p-values above the significance level of 0.05, which means we cannot reject Ho and conclude that these groups don’t have a sig effect on salary
sex and rank)lm.psych1 <- lm(salary ~ sex*rank, psych)
summary(lm.psych1)$r.squared
## [1] 0.6647566
Observation: 66.4% of salary variation can be explained by the model (sex and rank)
Refit the model without the interaction term. Comment on the significance of effects and variation explained. Report and interpret the Type 1 and Type 3 tests of the main effects. Are the main effects of rank and sex significant?
aov.psych3 <- aov(salary ~ rank + sex, psych)
summary(aov.psych3)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 29.071 3.34e-05 ***
## sex 1 72.76 72.76 8.386 0.00926 **
## Residuals 19 164.84 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych4 <- aov(salary ~ sex + rank, psych)
summary(aov.psych4)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.2 155.15 17.88 0.000454 ***
## rank 1 169.8 169.82 19.57 0.000291 ***
## Residuals 19 164.8 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
salary since their p-values are below the significance level of 0.05. Thus, we reject the null hypothesis, which means that at least one group in sex and rank have different mean values.Anova(aov.psych3, type = 3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 10227.6 1 1178.8469 < 2.2e-16 ***
## rank 169.8 1 19.5743 0.0002912 ***
## sex 72.8 1 8.3862 0.0092618 **
## Residuals 164.8 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
salary since their p-values are below the significance level of 0.05. Thus, we reject the null hypothesis, which means that at least one group in sex and rank have different mean values, which means that at least one group in sex and rank have different mean values.lm.psych3 <- lm(salary ~ rank + sex, psych)
summary(lm.psych3)$r.squared
## [1] 0.6634627
Observation: Only 66.3% of salary variation can be explained by the model (sex and rank)
Obtain model diagnostics to validate your Normality assumptions.
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.psych3, 2)
Choose a final model based on your results from parts (a) and (b). Comment on any significant group differences through the post-hoc test. State the differences in salary across different main effect groups and interaction (if included) between them.
Based on the result (a) and (b), we can conclude that interaction effect does NOT exist. Thus, the two-way ANOVA - Type 3 is selected to see unique contribution of each categorical variable. As every effect is adjusted for all other effects, this model is the best option.
TukeyHSD(aov.psych4)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = salary ~ sex + rank, data = psych)
##
## $sex
## diff lwr upr p adj
## M-F 5.333333 2.693648 7.973019 0.0004544
##
## $rank
## diff lwr upr p adj
## Assoc-Assist 5.377778 2.738092 8.017463 0.0004193
Conclusion:
sex
sex has a significant effect on salary.salary of Male is higher than mean salary of Female)rank
rank also has a significant effect onsalary.
salary of an Associate is higher than mean salary of an Assistant)Use the cars_new.csv. See HW1 for detailed information of variables.
Start with a three-way main effects ANOVA and choose the best main effects ANOVA model for mpg_highway as a function of cylinders, origin, and type for the cars in this set. Comment on which terms should be kept in a model for mpg_highway and why based on Type 3 SS. For the model with just predictors you decide to keep, comment on the significant effects in the model and comment on how much variation in highway fuel efficiency the model describes.
## Importing data and converting formats ###
cars <- read.csv("cars_new.csv")
cars$mpg_highway <- as.numeric(cars$mpg_highway)
cars$cylinders <- as.factor(cars$cylinders)
cars$origin <- as.factor(cars$origin)
cars$type <- as.factor(cars$type)
str(cars)
## 'data.frame': 180 obs. of 4 variables:
## $ type : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
## $ origin : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
## $ cylinders : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
## $ mpg_highway: num 31 29 28 24 24 24 30 29 30 28 ...
aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars)
Anova(aov.cars, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 69548 1 6501.6715 < 2e-16 ***
## cylinders 1453 1 135.8499 < 2e-16 ***
## origin 1 1 0.0786 0.77948
## type 108 1 10.1018 0.00175 **
## Residuals 1883 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
cylinders and type have p-values below the significant level of 0.05. This means that we reject the null hypothesis and conclude that these two groups have a significant effect on mpg_highway.origin has a p-value higher than than significant level of 0.05, we can say that we don’t reject the the null hypothesis and conclude that this group dosn’t have a significant effect on mpg_highway. Thus, we will remove origin from the model since it has insignificant effect.cylinders and type)lm.cars <- lm(mpg_highway ~ cylinders + type, cars)
summary(lm.cars)$r.squared
## [1] 0.4572163
Observation: 45.7% of the variation of mpg_highway can be explained by cylinders and type
Starting with main effects chosen in part (a), find your best ANOVA model by adding in any additional interaction terms that will significantly improve the model. For your final model, comment on the significant effects and variation explained by the model.
aov.cars2 <- aov(mpg_highway ~ cylinders + type + cylinders * type, cars)
Anova(aov.cars2, type = 3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 85471 1 8358.838 < 2.2e-16 ***
## cylinders 1558 1 152.397 < 2.2e-16 ***
## type 198 1 19.392 1.844e-05 ***
## cylinders:type 84 1 8.201 0.004696 **
## Residuals 1800 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
*The best model is two-way ANOVA - Type 3 with the implementation of cylinders, type and its interaction. All three of them have p-values lower than the significance level of 0.05, which means that we can reject the null hypothesis and conclude that these have a significant effect on mpg_highway
lm.cars2 <- lm(mpg_highway ~ cylinders + type + cylinders * type, cars)
summary(lm.cars2)$r.squared
## [1] 0.4813821
Observation: 48.1% of mpg_highway can be explained by cylinders, type, and its interaction.
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.cars2, 2)
Comment on any significant group differences through the post-hoc test. What does this tell us about fuel efficiency differences across cylinders, origin, or type groups? See Hint in Exercise 3.
TukeyHSD(aov.cars2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg_highway ~ cylinders + type + cylinders * type, data = cars)
##
## $cylinders
## diff lwr upr p adj
## 6-4 -5.722662 -6.664343 -4.780981 0
##
## $type
## diff lwr upr p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
##
## $`cylinders:type`
## diff lwr upr p adj
## 6:Sedan-4:Sedan -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan 0.9447674 -2.120956 4.010491 0.8546517
## 6:Sports-6:Sedan -0.4302326 -3.495956 2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993 2.771993 0.8253946
Conclusion:
type:
Since the p-value for type is smaller than the significance level of 0.05, we reject the null hypothesis and conclude that the type has a significant effect on mpg_highway
Sports < Sedan (mean mpg_highway of Sedan is higher than mean mpg_highway of Sports)
cylinders:
Since the p-value for cylinder is smaller than the significance level 0.05, we reject the null hypothesis and conclude that the cylinders has a sig effect on mpg_highway
6 < 4 (mean mpg_highway of 6 is lower than mean mpg_highway of 4)
cylinders and type interaction
Because the p-value is below the significance level of 0.05, type groups 6:Sedan-4:Sedan, 4:Sports-4:Sedan, and 6:Sports-4:Sedan have a significant effect on mpg_highway
Groups 4:Sports-6:Sedan, 6:Sports-6:Sedan, and 6:Sports-4:Sports all have p-values above the significant level of 0.05, and we conclude that they have insignificant effect on mpg_highway
6:Sedan < 4:Sedan (mean mpg_highway of 6:Sedan is smaller than mean mpg_highway of 4:Sedan)
4:Sports < 4:Sedan (mean mpg_highway of 4:Sports is smaller than mean mpg_highway of 4:Sedan)
6:Sports < 4:Sedan (mean mpg_highway of 6:Sports is smaller than mean mpg_highway of 4:Sedan)
Final Conclusion:
Sedans have higher mpg_highway than sport cars (Sedans have better fuel efficienty)
4 Cylinder cars have a higher mpg_highway than 6 cylinder cars (4 cylinder cars have better fuel efficiency)
4 cylinder sedans have a higher mpg_highway than 6 cylinder sedans (4 cylinder cars have better fuel efficiency)
4 cylinder sedans have a higher mpg_highway than 4 cylinder sports (4 cylinder cars have better fuel efficiency)
4 cylinder sedans have a higher mpg_highway than 6 cylinder sports (4 cylinder cars have better fuel efficiency)