library(DescTools)
library(MASS)
library(car)
The heartbpchol.csv data set contains continuous cholesterol (Cholesterol) and blood pressure status (BP_Status) (category: High/ Normal/ Optimal) for alive patients. For the heartbpchol.xlsx data set, consider a one-way ANOVA model to identify differences between group cholesterol means. The normality assumption is reasonable, so you can proceed without testing normality.
Perform a one-way ANOVA for Cholesterol with BP_Status as the categorical predictor. Comment on statistical significance of BP_Status, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
Importing data
heart <- read.csv("heartbpchol.csv")
str(heart)
## 'data.frame': 541 obs. of 2 variables:
## $ Cholesterol: int 221 188 292 319 205 247 202 150 228 280 ...
## $ BP_Status : chr "Optimal" "High" "High" "Normal" ...
Since, the normality assumptions is reasonable, we will proceed with the one way ANOVA test
One-way ANOVA (Cholesterol ~ BP_Status)
aov.heart <- aov(Cholesterol ~ BP_Status, heart)
summary(aov.heart)
## Df Sum Sq Mean Sq F value Pr(>F)
## BP_Status 2 25211 12605 6.671 0.00137 **
## Residuals 538 1016631 1890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: Since p-value is 0.00317 (below significance level of 0.05), we have enough evidence to reject Ho. This means that at least one observation of BP Status has a significant effect on Cholesterol (at least one of them has a different mean of Cholesterol)
R-Square (Variation of response variable explained by BP_Status)
lm.res <- lm(Cholesterol ~ BP_Status, heart)
summary(lm.res)$r.squared
## [1] 0.02419833
Observation: Only 2.4% of the variation of Cholesterol can be explained by the model. This means that the remaining 97.58% is from the Sum of Squares of residuals, and the main effect of the mean of the Cholesterol, doesn’t come from the Blood Pressure, but from the Residuals.
Levene Test for Equal Variance Assumption
LeveneTest(aov.heart)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.1825 0.8332
## 538
Conclusion: Since the p-value is 0.8332 (above significance level of 0.05), it’s safe to say that we do not have enough evidence to reject the Ho. This means that we can assume all BP_Status groups have equal variance.
Normality Check
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.heart, 2)
Observation: From the analysis of QQPlot above, we can conclude that the variance is not much different between the groups, hence the variance equality assumption is reasonable.
Comment on any significantly different cholesterol means as determined by the post-hoc test comparing all pairwise differences. Specifically explain what that tells us about differences in cholesterol levels across blood pressure status groups, like which group has the highest or lowest mean values of Cholesterol.
ScheffeTest(aov.heart)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $BP_Status
## diff lwr.ci upr.ci pval
## Normal-High -11.543481 -21.35092 -1.736038 0.0159 *
## Optimal-High -18.646679 -33.46702 -3.826341 0.0089 **
## Optimal-Normal -7.103198 -21.81359 7.607194 0.4958
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
More specifically, below are some of the effects of BP_Status on Cholesterol that can be seen:
For this problem use the bupa.csv data set. Check UCI Machine Learning Repository for more information (http://archive.ics.uci.edu/ml/datasets/Liver+Disorders). The mean corpuscular volume and alkaline phosphate are blood tests thought to be sensitive to liver disorder related to excessive alcohol consumption. We assume that normality and independence assumptions are valid.
Perform a one-way ANOVA for mcv as a function of drinkgroup. Comment on significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
Importing data
bupa <- read.csv("bupa.csv")
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
str(bupa)
## 'data.frame': 345 obs. of 3 variables:
## $ mcv : int 85 85 86 91 87 98 88 88 92 90 ...
## $ alkphos : int 92 64 54 78 70 55 62 67 54 60 ...
## $ drinkgroup: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
One-way ANOVA (mcv ~ drinkgroup)
aov.mcv <- aov(mcv ~ drinkgroup, bupa)
summary(aov.mcv)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 733 183.29 10.26 7.43e-08 ***
## Residuals 340 6073 17.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: Since p-value is 7.43e-08 (below significance level of 0.05), it’s safe to say that we can reject the Ho. This means that at least one observation of drinkgroup has a significance effect on mcv (at least one of them has a different mean of mcv)
R-Square (Variation of response variable explained by drinkgroup)
lm.res_mcv <- lm(mcv ~ drinkgroup, bupa)
summary(lm.res_mcv)$r.squared
## [1] 0.1077214
Observation: Only 10.78% of the variation of mcv can be explained by the drinkgroup
Levene Test for Equal Variance Assumption
LeveneTest(aov.mcv)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.3053 0.8744
## 340
Conclusion: Since the p-value is 0.8744 (above significance level of 0.05), we do not have enough evidence to reject the Ho. This means that we can assume all drinkgroups groups have equal variance.
Normality Check
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.mcv, 2)
Observation: From the analysis of QQPlot, the variance in the groups is equal, hence the variance equality assumption is reasonable.
Perform a one-way ANOVA for alkphos as a function of drinkgroup. Comment on statistical significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
One-way ANOVA (alkphos ~ drinkgroup)
aov.alkphos <- aov(alkphos ~ drinkgroup, bupa)
summary(aov.alkphos)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 4946 1236.4 3.792 0.00495 **
## Residuals 340 110858 326.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: Since p-value is 0.00495 (below significance level of 0.05), we have enough evidence to reject the Ho. This means that at least one observation of drinkgroup has a significance effect on alkphos (at least one of them has a different mean of alkphos)
R-Square (Variation of response variable explained by drinkgroup)
lm.alkphos <- lm(alkphos ~ drinkgroup, bupa)
summary(lm.alkphos)$r.squared
## [1] 0.04270721
Observation: Only 4.27% of variation alkphos can be explained by the drinkgroup.This means that the remaining 95.73% is from the Sum of Squares of residuals, and the main effect of the mean of the alkphos, doesn’t come from the drinkgroup, but from the Residuals.
Levene Test for Equal Variance Assumption
LeveneTest(aov.alkphos)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.8089 0.5201
## 340
Conclusion: Since the p-value is 0.5201 (above significance level of 0.05), we cannot reject the Ho. This means that we can assume all drinkgroup groups have equal variance.
Normality Check
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.alkphos, 2)
Observation: From the analysis of QQPlot, we see that the variance in the groups is not that different, hence the variance equality assumption is reasonable.
Perform post-hoc tests for models in a) and b). Comment on any similarities or differences you observe from their results.
mcv:
ScheffeTest(aov.mcv)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $drinkgroup
## diff lwr.ci upr.ci pval
## 2-1 1.241452991 -0.94020481 3.423111 0.5410
## 3-1 0.938131313 -0.90892674 2.785189 0.6495
## 4-1 3.744610282 1.73913894 5.750082 1.9e-06 ***
## 5-1 3.746031746 0.64379565 6.848268 0.0081 **
## 3-2 -0.303321678 -2.59291786 1.986275 0.9966
## 4-2 2.503157290 0.08395442 4.922360 0.0380 *
## 5-2 2.504578755 -0.87987039 5.889028 0.2646
## 4-3 2.806478969 0.68408993 4.928868 0.0025 **
## 5-3 2.807900433 -0.37116998 5.986971 0.1151
## 5-4 0.001421464 -3.27222796 3.275071 1.0000
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
From the Scheffe Test above, we can see that the groups 4-1, 5-1, 4-2, and 4-3 have different mean values (below significance level of 0.05) of mcv, and thus we have enough evidence to reject the null. We can also see that groups 2-1, 3-1, 3-2, 5-2, 5-3, and 5-4 do not have different mean values (above significance level of 0.05) of mcv, and thus we do not have enough evidence to reject the null.
Below are some of the effects of drinkgroup on mcv that can be seen:
alkphos:
ScheffeTest(aov.alkphos)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $drinkgroup
## diff lwr.ci upr.ci pval
## 2-1 -2.645299 -11.9663647 6.675766 0.9419
## 3-1 -4.056138 -11.9476367 3.835360 0.6389
## 4-1 -1.148743 -9.7170578 7.419571 0.9965
## 5-1 12.572650 -0.6815582 25.826857 0.0734 .
## 3-2 -1.410839 -11.1930681 8.371390 0.9953
## 4-2 1.496556 -8.8394138 11.832525 0.9952
## 5-2 15.217949 0.7579944 29.677903 0.0329 *
## 4-3 2.907395 -6.1604467 11.975236 0.9117
## 5-3 16.628788 3.0463078 30.211268 0.0069 **
## 5-4 13.721393 -0.2651729 27.707959 0.0578 .
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
From the Scheffe Test above, we can see that groups 5-2 and 5-3 have significantly different mean values (below significance level of 0.05) of alkphos, and thus we have enough evidence to reject the null. The groups 5-1 and 5-4 are only a little higher than our alpha (0.05) which shows that they might be statistically significantly different as well. We can also see that groups 2-1, 3-1, 4-1, 3-2, 4-2, 4-3 and do not have significantly different mean values of alkphos (above significance level of 0.05), and thus we do not have enough evidence to reject the null.
Below are some of the effects of drinkgroup on alkphos that can be seen:
Similarities:
Group pairs 2-1, 3-1, 3-2 and 5-4 have equal mean values, since their p-values are above the significance level of 0.05. This means that they don’t have an effect on either mcv and alkphos.
The psychology department at a hypothetical university has been accused of underpaying female faculty members. The data represent salary (in thousands of dollars) for all 22 professors in the department. This problem is from Maxwell and Delaney (2004).
Fit a two-way ANOVA model including sex (F, M) and rank (Assistant, Associate) the interaction term. What do the Type 1 and Type 3 sums of squares tell us about significance of effects? Is the interaction between sex and rank significant? Also comment on the variation explained by the model.
Importing data
psych <- read.csv("psych.csv")
str(psych)
## 'data.frame': 22 obs. of 3 variables:
## $ sex : chr "F" "F" "F" "F" ...
## $ rank : chr "Assist" "Assist" "Assist" "Assist" ...
## $ salary: int 33 36 35 38 42 37 39 38 40 44 ...
Two-way ANOVA (Type 1)
aov.psych1 <- aov(salary ~ sex * rank, psych)
summary(aov.psych1)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.15 155.15 17.007 0.000637 ***
## rank 1 169.82 169.82 18.616 0.000417 ***
## sex:rank 1 0.63 0.63 0.069 0.795101
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Two-way ANOVA (Type 1)
aov.psych2 <- aov(salary ~ rank * sex, psych)
summary(aov.psych2)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 27.647 5.33e-05 ***
## sex 1 72.76 72.76 7.975 0.0112 *
## rank:sex 1 0.63 0.63 0.069 0.7951
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
Since both groups have low p-values (below the significance level of 0.05), we have enough evidence to reject the null hypothesis and conclude that these groups have a significant effect on salary.
We can see interaction between sex and rank result has a much higher p-value (above the significance level of 0.05), which means that the interaction of sex and rank doesn’t have significant effect on salary (we do not have enough evidence to reject the null hypothesis)
Two-way ANOVA (Type 3)
Anova(aov.psych1, type = 3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 8140.2 1 892.2994 < 2e-16 ***
## sex 28.0 1 3.0711 0.09671 .
## rank 70.4 1 7.7189 0.01240 *
## sex:rank 0.6 1 0.0695 0.79510
## Residuals 164.2 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
Since rank is the only group that has a p-value below the significance level of 0.05, we have enough evidence to reject the null hypothesis and conclude that the rank group has a significant effect on salary.
sex and interaction groups have p-values above the significance level of 0.05, which means we do not have enough evidence to reject Ho and conclude that these groups don’t have a significant effect on salary.
Variation Explained by the model (sex and rank)
lm.psych1 <- lm(salary ~ sex*rank, psych)
summary(lm.psych1)$r.squared
## [1] 0.6647566
Observation: 66.47% of salary variation can be explained by the model (sex and rank)
Refit the model without the interaction term. Comment on the significance of effects and variation explained. Report and interpret the Type 1 and Type 3 tests of the main effects. Are the main effects of rank and sex significant?
Two-way ANOVA (Type 1)
aov.psych3 <- aov(salary ~ rank + sex, psych)
summary(aov.psych3)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 29.071 3.34e-05 ***
## sex 1 72.76 72.76 8.386 0.00926 **
## Residuals 19 164.84 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Two-way ANOVA (Type 1)
aov.psych4 <- aov(salary ~ sex + rank, psych)
summary(aov.psych4)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.2 155.15 17.88 0.000454 ***
## rank 1 169.8 169.82 19.57 0.000291 ***
## Residuals 19 164.8 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
Without any interaction, we conclude that both groups have significant effect on salary since their p-values are below the significance level of 0.05. Thus, we have enough evidence to reject the null hypothesis, which means that at least one group in sex and rank have different mean values.
Two-way ANOVA (Type 3)
Anova(aov.psych3, type = 3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 10227.6 1 1178.8469 < 2.2e-16 ***
## rank 169.8 1 19.5743 0.0002912 ***
## sex 72.8 1 8.3862 0.0092618 **
## Residuals 164.8 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
Without any interaction, we conclude that both groups have significant effect on salary since their p-values are below the significance level of 0.05. Thus, we have enough evidence to reject the null hypothesis, which means that at least one group in sex and rank have different mean values.
Variance Explained by the Model
lm.psych3 <- lm(salary ~ rank + sex, psych)
summary(lm.psych3)$r.squared
## [1] 0.6634627
Observation: Only 66.34% of salary variation can be explained by the model (sex and rank)
Obtain model diagnostics to validate your Normality assumptions.
Normality Check
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.psych3, 2)
Observation: From the analysis of QQPlot, we can conclude that a normal distribution is visible.
Choose a final model based on your results from parts (a) and (b). Comment on any significant group differences through the post-hoc test. State the differences in salary across different main effect groups and interaction (if included) between them.
Based on the result (a) and (b), we can conclude that interaction effect does NOT exist. Thus, the two-way ANOVA - Type 3 is selected to see unique contribution of each categorical variable. As every effect is adjusted for all other effects, this model is the best option.
Post-hoc Test
TukeyHSD(aov.psych4)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = salary ~ sex + rank, data = psych)
##
## $sex
## diff lwr upr p adj
## M-F 5.333333 2.693648 7.973019 0.0004544
##
## $rank
## diff lwr upr p adj
## Assoc-Assist 5.377778 2.738092 8.017463 0.0004193
Conclusion:
sex - Because the p-value is below the significance level of 0.05, it’s apparent that sex has a significant effect on salary. M > F (mean salary of Male is higher than mean salary of Female)
rank - Because the p-value is below the significance level of 0.05, it’s apparent that rank also has a significant effect on salary.
Associate > Assistant (Sum of square difference = 5.3777784. Mean salary of an Associate is higher than mean salary of an Assistant)
Use the cars_new.csv. See HW1 for detailed information of variables
Start with a three-way main effects ANOVA and choose the best main effects ANOVA model for mpg_highway as a function of cylinders, origin, and type for the cars in this set. Comment on which terms should be kept in a model for mpg_highway and why based on Type 3 SS. For the model with just predictors you decide to keep, comment on the significant effects in the model and comment on how much variation in highway fuel efficiency the model describes.
Importing data
cars <- read.csv("cars_new.csv")
cars$cylinders <- as.factor(cars$cylinders)
str(cars)
## 'data.frame': 180 obs. of 4 variables:
## $ type : chr "Sedan" "Sedan" "Sedan" "Sedan" ...
## $ origin : chr "Asia" "Asia" "Asia" "Asia" ...
## $ cylinders : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
## $ mpg_highway: int 31 29 28 24 24 24 30 29 30 28 ...
Three-way ANOVA (Type 3)
aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars)
Anova(aov.cars, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 69548 1 6501.6715 < 2e-16 ***
## cylinders 1453 1 135.8499 < 2e-16 ***
## origin 1 1 0.0786 0.77948
## type 108 1 10.1018 0.00175 **
## Residuals 1883 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
From the ANOVA table above, we can see that cylinders and type have p-values below the significant level of 0.05. This means that we have enough evidence to reject the null hypothesis and conclude that these two groups have a significant effect on mpg_highway.
Additionally, at least one group has different mean values
Since origin has a p-value higher than than significant level of 0.05, we can say that we do not have enough evidence to reject the the null hypothesis and conclude that this group doesn’t have a significant effect on mpg_highway. Thus, we will remove origin from the model since it has insignificant effect.
Variation Explained by the Model (cylinders and type)
lm.cars <- lm(mpg_highway ~ cylinders + type, cars)
summary(lm.cars)$r.squared
## [1] 0.4572163
Observation: 45.72% of the variation of mpg_highway can be explained by the model
Starting with main effects chosen in part (a), find your best ANOVA model by adding in any additional interaction terms that will significantly improve the model. For your final model, comment on the significant effects and variation explained by the model.
aov.cars2 <- aov(mpg_highway ~ cylinders + type + cylinders * type, cars)
Anova(aov.cars2, type = 3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 85471 1 8358.838 < 2.2e-16 ***
## cylinders 1558 1 152.397 < 2.2e-16 ***
## type 198 1 19.392 1.844e-05 ***
## cylinders:type 84 1 8.201 0.004696 **
## Residuals 1800 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation:
The best model is two-way ANOVA - Type 3 with the implementation of cylinders, type and its interaction. All three of them have p-values lower than the significance level of 0.05, which means that we have enough evidence to reject the null hypothesis and conclude that these have a significant effect on mpg_highway.
Variation Explained by the Model
lm.cars2 <- lm(mpg_highway ~ cylinders + type + cylinders * type, cars)
summary(lm.cars2)$r.squared
## [1] 0.4813821
Observation: 48.13% of mpg_highway can be explained by cylinders, type, and its interaction.
Normality Check
par(mfrow=c(1,1)) # Diagnostic plot
plot(aov.cars2, 2)
Observation: From the analysis of QQPlot above, we can conclude that a normal distribution is visible
Comment on any significant group differences through the post-hoc test. What does this tell us about fuel efficiency differences across cylinders, origin, or type groups? See Hint in Exercise 3.
Post-hoc Test
TukeyHSD(aov.cars2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg_highway ~ cylinders + type + cylinders * type, data = cars)
##
## $cylinders
## diff lwr upr p adj
## 6-4 -5.722662 -6.664343 -4.780981 0
##
## $type
## diff lwr upr p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
##
## $`cylinders:type`
## diff lwr upr p adj
## 6:Sedan-4:Sedan -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan 0.9447674 -2.120956 4.010491 0.8546517
## 6:Sports-6:Sedan -0.4302326 -3.495956 2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993 2.771993 0.8253946
Conclusion:
type: Since the p-value for type is smaller than the significance level of 0.05, we have enough evidence to reject the null hypothesis and conclude that the type has a significant effect on mpg_highway. Also, Sports < Sedan (mean mpg_highway of Sedan is higher than mean mpg_highway of Sports)
cylinders: Since the p-value for cylinder is smaller than the significance level 0.05, we have enough evidence to reject the null hypothesis and conclude that the cylinders has a significant effect on mpg_highway. Also, 6 < 4 (mean mpg_highway of 6 is lower than mean mpg_highway of 4)
cylinders and type interaction:
We can finally conclude that: