Due: Saturday, Sep 30, at 11:59 pm
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(fBasics)
## Warning: package 'fBasics' was built under R version 4.3.2
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:fBasics':
##
## densityPlot
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.3.2
##
## Attaching package: 'DescTools'
##
## The following object is masked from 'package:car':
##
## Recode
library(dplyr)
Data: heartbpchol.csv for Exercise 1, bupa.csv for Exercise 2, psych.csv for Exercise 3, cars_new.csv for Exercise 4.
Use the significance level of .05.
.bordered{
border-style: solid;
border-color: teal;
padding: 5px;
background-color: #DCDCDC;
}
The heartbpchol.csv data set contains continuous cholesterol (Cholesterol) and blood pressure status (BP_Status) (category: High/ Normal/ Optimal) for alive patients.
For the heartbpchol data set, consider a one-way ANOVA model to identify differences between group cholesterol means. The normality assumption is reasonable, so you can proceed without testing normality.
(A) Perform a one-way ANOVA for Cholesterol with BP_Status as the categorical predictor. Comment on statistical significance of BP_Status, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
The analysis of variance table has a p-value of 0.00137 for BP_Status, so BP_Status is statistically significant. The r-squared value of 0.0242, however, is very small. Only 2.42% of the variation in cholesterol can be described by the blood pressure levels. The Levene’s test for homogeneity of variance gives p-value of 0.8332 and it is highly insignificant. Thus, an equal variance assumption is fine here.
BP=read_csv("heartbpchol.csv")
## Rows: 541 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): BP_Status
## dbl (1): Cholesterol
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(BP)
## spc_tbl_ [541 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Cholesterol: num [1:541] 221 188 292 319 205 247 202 150 228 280 ...
## $ BP_Status : chr [1:541] "Optimal" "High" "High" "Normal" ...
## - attr(*, "spec")=
## .. cols(
## .. Cholesterol = col_double(),
## .. BP_Status = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
BP$Cholesterol= as.numeric(BP$Cholesterol)
str(BP)
## spc_tbl_ [541 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Cholesterol: num [1:541] 221 188 292 319 205 247 202 150 228 280 ...
## $ BP_Status : chr [1:541] "Optimal" "High" "High" "Normal" ...
## - attr(*, "spec")=
## .. cols(
## .. Cholesterol = col_double(),
## .. BP_Status = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
table(BP$BP_Status)
##
## High Normal Optimal
## 229 245 67
aov.bp <- aov(Cholesterol ~ BP_Status, BP)
summary(aov.bp)
## Df Sum Sq Mean Sq F value Pr(>F)
## BP_Status 2 25211 12605 6.671 0.00137 **
## Residuals 538 1016631 1890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.res = lm(Cholesterol~BP_Status,BP)
anova(lm.res)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## BP_Status 2 25211 12605.4 6.6708 0.001375 **
## Residuals 538 1016631 1889.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.res)$r.squared
## [1] 0.02419833
leveneTest(aov.bp)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.1825 0.8332
## 538
par(mfrow=c(2,2))
plot(aov.bp)
(B) Comment on any significantly different cholesterol means as determined by the post-hoc test comparing all pairwise differences. Specifically explain what that tells us about differences in cholesterol levels across blood pressure status groups, like which group has the highest or lowest mean values of Cholesterol.
We perform all pairwise comparisons via Tukey’s multi-comparison test. Expected cholesterol levels are significantly different for high and normal blood pressures and for high and optimal blood pressures. We expect individuals with high blood pressures to have cholesterol levels 11.54 higher than those with normal blood pressures on average and 18.65 higher than those with optimal blood pressure on average. The difference between normal and optimal blood pressure groups is not significant. Simply speaking, mean of High > mean of (Normal, Optimal).
# performing the post-hoc test, using the Tukey Test.
TukeyHSD(aov.bp)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Cholesterol ~ BP_Status, data = BP)
##
## $BP_Status
## diff lwr upr p adj
## Normal-High -11.543481 -20.93394 -2.153023 0.0111929
## Optimal-High -18.646679 -32.83690 -4.456460 0.0059898
## Optimal-Normal -7.103198 -21.18815 6.981749 0.4624869
For this problem use the bupa.csv data set. Check UCI Machine Learning Repository for more information (http://archive.ics.uci.edu/ml/datasets/Liver+Disorders). The mean corpuscular volume and alkaline phosphatase are blood tests thought to be sensitive to liver disorder related to excessive alcohol consumption. We assume that normality and independence assumptions are valid.
bupa=read_csv("bupa.csv")
## Rows: 345 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): mcv, alkphos, drinkgroup
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Drink group categorization of the half-pint equivalents of alcoholic beverages drunk per day:
(A) Perform a one-way ANOVA for mcv as a function of drink group. Comment on significance of the drink group, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
bupa$mcv <- as.numeric(bupa$mcv)
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
aov.bupa <- aov(mcv~drinkgroup,bupa)
summary(aov.bupa)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 733 183.29 10.26 7.43e-08 ***
## Residuals 340 6073 17.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.bupa<-lm(mcv~drinkgroup, bupa)
anova(lm.bupa)
## Analysis of Variance Table
##
## Response: mcv
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 733.2 183.294 10.262 7.429e-08 ***
## Residuals 340 6073.1 17.862
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.bupa)$r.squared
## [1] 0.1077214
leveneTest(aov.bupa)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.3053 0.8744
## 340
(B) Perform a one-way ANOVA for alkphos as a function of drinkgroup. Comment on statistical significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.
bupa$alkphos <- as.numeric(bupa$alkphos)
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
aov.alkphos <- aov(alkphos~drinkgroup,bupa)
summary(aov.alkphos)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 4946 1236.4 3.792 0.00495 **
## Residuals 340 110858 326.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.alkphos<-lm(alkphos~drinkgroup, bupa)
anova(lm.alkphos)
## Analysis of Variance Table
##
## Response: alkphos
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 4946 1236.41 3.7921 0.00495 **
## Residuals 340 110858 326.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.alkphos)$r.squared
## [1] 0.04270721
LeveneTest(aov.alkphos)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.8089 0.5201
## 340
(C) Perform post-hoc tests for models in a) and b). Comment on any similarities or differences you observe from their results.
For the anova model with mcv and drinkgroup, Group5 has significantly greater mcv than group1. And group4 has significantly greater mcv than group1, group2, and group 3. From multi-comparison result for the anova model with alkphos and drinkgroup, group5 has significantly greater alkphos than group1, group2, group3, and group 4. Multi-comparison results indicate that more alcoholic beverages drunk per day associates with higher mean corpuscular volume and higher alkphos. Especially we see higher risk from group4 and 5 like around 4 or more 9 drinks per day.
Tukey Post-hoc for MCV
TukeyHSD(aov.bupa)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mcv ~ drinkgroup, data = bupa)
##
## $drinkgroup
## diff lwr upr p adj
## 2-1 1.241452991 -0.690316778 3.173223 0.3973587
## 3-1 0.938131313 -0.697363908 2.573627 0.5157202
## 4-1 3.744610282 1.968846495 5.520374 0.0000002
## 5-1 3.746031746 0.999127120 6.492936 0.0020039
## 3-2 -0.303321678 -2.330666517 1.724023 0.9940309
## 4-2 2.503157290 0.361050967 4.645264 0.0127884
## 5-2 2.504578755 -0.492214115 5.501372 0.1499436
## 4-3 2.806478969 0.927189295 4.685769 0.0005031
## 5-3 2.807900433 -0.007037875 5.622839 0.0509321
## 5-4 0.001421464 -2.897262735 2.900106 1.0000000
Tukey Post-hoc for Alkphos
TukeyHSD(aov.alkphos)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = alkphos ~ drinkgroup, data = bupa)
##
## $drinkgroup
## diff lwr upr p adj
## 2-1 -2.645299 -10.8987258 5.608127 0.9045115
## 3-1 -4.056138 -11.0437411 2.931464 0.5035761
## 4-1 -1.148743 -8.7356393 6.438152 0.9937509
## 5-1 12.572650 0.8365845 24.308715 0.0288892
## 3-2 -1.410839 -10.0726073 7.250929 0.9917361
## 4-2 1.496556 -7.6555274 10.648639 0.9916117
## 5-2 15.217949 2.4142438 28.021654 0.0107154
## 4-3 2.907395 -5.1218122 10.936602 0.8583931
## 5-3 16.628788 4.6020510 28.655525 0.0016498
## 5-4 13.721393 1.3368544 26.105932 0.0214375
The psychology department at a hypothetical university has been accused of underpaying female faculty members. The data represent salary (in thousands of dollars) for all 22 professors in the department. This problem is from Maxwell and Delaney (2004).
This includes the interaction term: an interaction term is a multiplication term, not an addition.
(A) Fit a two-way ANOVA model including sex (F, M) and rank (Assistant, Associate) the interactionterm. What do the Type 1 and Type 3 sums of squares tell us about significance of effects? Is the interaction between sex and rank significant? Also comment on the variation explained by the model.
Type I test tells us that sex and rank main effects are both significant since the p-values are all less than 0.05. But Type III test shows significant rank effect but insignificant gender effect based on significance level 0.05. The interaction term between rank and sex is not significant with a p-value of 0.7951 from both Type I and Type III tests. The value of R-square is 0.6648, meaning that around 66.48% of the variation in salary (the response variable) is explained by the model (two-way ANOVA with interaction effect).
#reviewing for balanced
pd<-read.csv("psych.csv")
Analysis of Variance Table - Type 1 Test.
#Type 1 test
#1
aov.pd <- aov(salary ~ sex * rank, pd)
summary(aov.pd)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.15 155.15 17.007 0.000637 ***
## rank 1 169.82 169.82 18.616 0.000417 ***
## sex:rank 1 0.63 0.63 0.069 0.795101
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table - Type 3 Test.
# Type 3 test
Anova(aov.pd, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 8140.2 1 892.2994 < 2e-16 ***
## sex 28.0 1 3.0711 0.09671 .
## rank 70.4 1 7.7189 0.01240 *
## sex:rank 0.6 1 0.0695 0.79510
## Residuals 164.2 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-Squared
lm.res<-lm(salary~sex*rank,pd)
summary(lm.res)$r.squared
## [1] 0.6647566
Note: This does not include the interaction term - it means that you are fitting a model where the effects of the predictor variables are considered independently, without accounting for any interaction between them.
(B) Refit the model without the interaction term. Comment on the significance of effects and variation explained. Report and interpret the Type 1 and Type 3 tests of the main effects. Are the main effects of rank and sex significant?
#1
aov.pd2 <- aov(salary ~ sex + rank, pd)
summary(aov.pd2)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.2 155.15 17.88 0.000454 ***
## rank 1 169.8 169.82 19.57 0.000291 ***
## Residuals 19 164.8 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Type 3 test
Anova(aov.pd2, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 10227.6 1 1178.8469 < 2.2e-16 ***
## sex 72.8 1 8.3862 0.0092618 **
## rank 169.8 1 19.5743 0.0002912 ***
## Residuals 164.8 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#2
aov.pd3 <- aov(salary ~rank + sex, pd)
summary(aov.pd3)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 29.071 3.34e-05 ***
## sex 1 72.76 72.76 8.386 0.00926 **
## Residuals 19 164.84 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(aov.pd3, type=3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 10227.6 1 1178.8469 < 2.2e-16 ***
## rank 169.8 1 19.5743 0.0002912 ***
## sex 72.8 1 8.3862 0.0092618 **
## Residuals 164.8 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(C) Obtain model diagnostics to validate your Normality assumptions.
lm.res2<-lm(salary~sex+rank,pd)
summary(lm.res2)$r.squared
## [1] 0.6634627
par(mfrow=c(2,2))
plot(aov.pd)
(D) Choose a final model based on your results from parts (a) and (b). Comment on any significant group differences through the post-hoc test. State the differences in salary across different main effect groups and interaction (if included) between them.
Use the cars_new.csv. See HW1 for detailed information of variables.
(A) Start with a three-way main effects ANOVA and choose the best main effects ANOVA model for mpg_highway as a function of cylinders, origin, and type for the cars in this set. Comment on which terms should be kept in a model for mpg_highway and why based on Type 3 SS. For the model with just predictors you decide to keep, comment on the significant effects in the model and comment on how much variation in highway fuel efficiency the model describes.
cars_new<- read.csv("cars_new.csv")
cars_new$type <- as.factor(cars_new$type)
cars_new$origin <- as.factor(cars_new$origin)
cars_new$cylinders <- as.factor(cars_new$cylinders)
cars_new$mpg_highway <- as.numeric(cars_new$mpg_highway)
str(cars_new)
## 'data.frame': 180 obs. of 4 variables:
## $ type : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
## $ origin : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
## $ cylinders : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
## $ mpg_highway: num 31 29 28 24 24 24 30 29 30 28 ...
aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars_new)
Anova(aov.cars, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 69548 1 6501.6715 < 2e-16 ***
## cylinders 1453 1 135.8499 < 2e-16 ***
## origin 1 1 0.0786 0.77948
## type 108 1 10.1018 0.00175 **
## Residuals 1883 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.cars<-lm(mpg_highway~cylinders+type,cars_new)
summary(lm.cars)$r.squared
## [1] 0.4572163
(B) Starting with main effects chosen in part (a), find your best ANOVA model by adding in any additional interaction terms that will significantly improve the model. For your final model, comment on the significant effects and variation explained by the model.
aov.car1 <- aov(mpg_highway ~ cylinders*type, cars_new)
Anova(aov.car1, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 85471 1 8358.838 < 2.2e-16 ***
## cylinders 1558 1 152.397 < 2.2e-16 ***
## type 198 1 19.392 1.844e-05 ***
## cylinders:type 84 1 8.201 0.004696 **
## Residuals 1800 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm.car1<-lm(mpg_highway~cylinders*type,cars_new)
summary(lm.car1)$r.squared
## [1] 0.4813821
plot(aov.car1, 2)
(C) Comment on any significant group differences through the post-hoc test. What does this tell us about fuel efficiency differences across cylinders, origin, or type groups? See Hint in Exercise 3.
The following differences of least squares means for main effects tell us that 4 cylinder cars are significantly more efficient than 6 cylinder cars, with 4 cylinders expected to get 5.74 mpg more than 6 cylinders. Sedans are significantly more efficient and expected to get 2.678 mpg more than sports cars with p-value less than 0.05.
TukeyHSD(aov.car1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg_highway ~ cylinders * type, data = cars_new)
##
## $cylinders
## diff lwr upr p adj
## 6-4 -5.722662 -6.664343 -4.780981 0
##
## $type
## diff lwr upr p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
##
## $`cylinders:type`
## diff lwr upr p adj
## 6:Sedan-4:Sedan -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan 0.9447674 -2.120956 4.010491 0.8546517
## 6:Sports-6:Sedan -0.4302326 -3.495956 2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993 2.771993 0.8253946