Exercise 1: Analysis of Variance

heartBP <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/heartbpchol.csv", header = TRUE)

Ex. 1 Part (a)

Convert format

heartBP$Cholesterol <- as.numeric(heartBP$Cholesterol)
str(heartBP)
## 'data.frame':    541 obs. of  2 variables:
##  $ Cholesterol: num  221 188 292 319 205 247 202 150 228 280 ...
##  $ BP_Status  : chr  "Optimal" "High" "High" "Normal" ...

Check balance

table(heartBP$BP_Status)
## 
##    High  Normal Optimal 
##     229     245      67

Boxplot

Shows BP_Status is unbalance

boxplot(Cholesterol ~ BP_Status, heartBP, 
        main = "Cholesterol Distribution by BP Status",
        xlab = "BP Status",
        ylab = "Cholesterol")

One-Way Anova Test

P-value is less than significant value of 0.05 – we can reject null hypothesis at least one has a different mean and an effect on cholesterol

aov.heartBP <- aov(Cholesterol ~ BP_Status, heartBP)
summary(aov.heartBP)
##              Df  Sum Sq Mean Sq F value  Pr(>F)   
## BP_Status     2   25211   12605   6.671 0.00137 **
## Residuals   538 1016631    1890                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-Square

*~2.4% of variation in Cholesterol is explained by BP_Status

lm.res <- lm(Cholesterol ~ BP_Status, heartBP)
summary(lm.res)$r.squared
## [1] 0.02419833

Equal Variance

p-value is greater than significant value of 0.05 – we cannot reject null can assume all groups have equal variance

leveneTest(aov.heartBP)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.1825 0.8332
##       538

Ex. 1 Part (b)

Normal-High and Optimal-High groups have different mean values - we can reject the null Optimal-Normal group has the least significant mean value – Optimal and Normal BP_Status have equal means mean of High Cholesterol is greater than mean of Normal mean of High Cholesterol is greater than mean of Optimal *mean of Optimal Cholesterol is equal to mean of Normal

ScheffeTest(aov.heartBP)
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $BP_Status
##                      diff    lwr.ci    upr.ci   pval    
## Normal-High    -11.543481 -21.35092 -1.736038 0.0159 *  
## Optimal-High   -18.646679 -33.46702 -3.826341 0.0089 ** 
## Optimal-Normal  -7.103198 -21.81359  7.607194 0.4958    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 2: Analysis of Variance

Ex. 2 Part (a)

bupa <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/bupa.csv")

Convert format

bupa$mcv <- as.numeric(bupa$mcv)
bupa$alkphos <- as.numeric(bupa$alkphos)
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
str(bupa)
## 'data.frame':    345 obs. of  3 variables:
##  $ mcv       : num  85 85 86 91 87 98 88 88 92 90 ...
##  $ alkphos   : num  92 64 54 78 70 55 62 67 54 60 ...
##  $ drinkgroup: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Check balance

table(bupa$drinkgroup)
## 
##   1   2   3   4   5 
## 117  52  88  67  21

Boxplot

*Shows drinkgroup is unbalance

boxplot(mcv ~ drinkgroup, bupa, 
        main = "Mean Corpuscular Volume Distribution by Drink Group",
        xlab = "Drink Group",
        ylab = "Mean Corpuscular Volume")

One-Way Anova Test

P-value is less than significant value of 0.05 – we can reject null hypothesis at least one has a different mean and an effect on mcv

aov.mcv <- aov(mcv ~ drinkgroup, bupa)
summary(aov.mcv)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## drinkgroup    4    733  183.29   10.26 7.43e-08 ***
## Residuals   340   6073   17.86                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-Square

*~10.78% of variation in mcv is explained by drinkgroup

lm.res_mcv <- lm(mcv ~ drinkgroup, bupa)
summary(lm.res_mcv)$r.squared
## [1] 0.1077214

Equal Variance

p-value is greater than significant value of 0.05 – we cannot reject null can assume all groups have equal variance

leveneTest(aov.mcv)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.3053 0.8744
##       340

Ex. 2 Part (b)

Check balance

table(bupa$drinkgroup)
## 
##   1   2   3   4   5 
## 117  52  88  67  21

Boxplot

*Shows drinkgroup is unbalance

boxplot(alkphos ~ drinkgroup, bupa,
        main = "Alkaline Phosphatase Distribution by Drink Group",
        xlab = "Alkaline Phosphatase",
        ylab = "Drink Group")

One-Way Anova Test

P-value is less than significant value of 0.05 – we can reject null hypothesis at least one has a different mean and an effect on alkphos

aov.alkphos <- aov(alkphos ~ drinkgroup, bupa)
summary(aov.alkphos)
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## drinkgroup    4   4946  1236.4   3.792 0.00495 **
## Residuals   340 110858   326.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-Square

*~4.27% of variation in alkphos is explained by drinkgroup

lm.res_alkphos <- lm(alkphos ~ drinkgroup, bupa)
summary(lm.res_alkphos)$r.squared
## [1] 0.04270721

Equal Variance

p-value is greater than significant value of 0.05 – we cannot reject null can assume all groups have equal variance

leveneTest(aov.alkphos)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.8089 0.5201
##       340

Ex. 2 Part (c)

mcv Post-hoc

groups ‘4-1’ ‘5-1’ ‘4-2’ and ‘4-3’ have different means - we can reject null groups ‘2-1’ ‘3-1’ ‘3-2’ ‘5-2’ ‘5-3’and ’5-4’ have equal means - cannot reject null

ScheffeTest(aov.mcv)
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $drinkgroup
##             diff      lwr.ci   upr.ci    pval    
## 2-1  1.241452991 -0.94020481 3.423111  0.5410    
## 3-1  0.938131313 -0.90892674 2.785189  0.6495    
## 4-1  3.744610282  1.73913894 5.750082 1.9e-06 ***
## 5-1  3.746031746  0.64379565 6.848268  0.0081 ** 
## 3-2 -0.303321678 -2.59291786 1.986275  0.9966    
## 4-2  2.503157290  0.08395442 4.922360  0.0380 *  
## 5-2  2.504578755 -0.87987039 5.889028  0.2646    
## 4-3  2.806478969  0.68408993 4.928868  0.0025 ** 
## 5-3  2.807900433 -0.37116998 5.986971  0.1151    
## 5-4  0.001421464 -3.27222796 3.275071  1.0000    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

alkphos Post-hoc

groups ‘5-2’ and ‘5-3’ have different means - we can reject the null groups ‘2-1’ ‘3-1’ ‘4-1’ ‘5-1’ ‘3-2’ ‘4-2’ ‘4-3’ and ‘5-4’ have equal means - cannot reject null

ScheffeTest(aov.alkphos)
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $drinkgroup
##          diff      lwr.ci    upr.ci   pval    
## 2-1 -2.645299 -11.9663647  6.675766 0.9419    
## 3-1 -4.056138 -11.9476367  3.835360 0.6389    
## 4-1 -1.148743  -9.7170578  7.419571 0.9965    
## 5-1 12.572650  -0.6815582 25.826857 0.0734 .  
## 3-2 -1.410839 -11.1930681  8.371390 0.9953    
## 4-2  1.496556  -8.8394138 11.832525 0.9952    
## 5-2 15.217949   0.7579944 29.677903 0.0329 *  
## 4-3  2.907395  -6.1604467 11.975236 0.9117    
## 5-3 16.628788   3.0463078 30.211268 0.0069 ** 
## 5-4 13.721393  -0.2651729 27.707959 0.0578 .  
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SIMILARITIES

*groups ‘2-1’ ‘3-1’ ‘3-2’ and ‘5-4’ have equal means – they do not have effect on either mcv or alkphos

Exercise 3:

psych <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/psych.csv")

Convert format

psych$salary <- as.numeric(psych$salary)
str(psych)
## 'data.frame':    22 obs. of  3 variables:
##  $ sex   : chr  "F" "F" "F" "F" ...
##  $ rank  : chr  "Assist" "Assist" "Assist" "Assist" ...
##  $ salary: num  33 36 35 38 42 37 39 38 40 44 ...

Ex. 3 Part (a)

Two-way ANOVA Type 1

both have p-values less than 0.05 – we can reject the null both groups have significant effect on salary interaction between sex and rank result in higher p-value - does not have effect on salary cannot reject the null

aov.psych_1 <- aov(salary ~ sex * rank, psych)
summary(aov.psych_1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## sex          1 155.15  155.15  17.007 0.000637 ***
## rank         1 169.82  169.82  18.616 0.000417 ***
## sex:rank     1   0.63    0.63   0.069 0.795101    
## Residuals   18 164.21    9.12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych_2 <- aov(salary ~ rank * sex, psych)
summary(aov.psych_2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## rank         1 252.22  252.22  27.647 5.33e-05 ***
## sex          1  72.76   72.76   7.975   0.0112 *  
## rank:sex     1   0.63    0.63   0.069   0.7951    
## Residuals   18 164.21    9.12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA Type 3

rank is only one below 0.05 - can reject null - rank has effect on salary sex is greater than 0.05 - cannot reject null - sex has no significant effect on salary

Anova(aov.psych_1, type = 3)
## Anova Table (Type III tests)
## 
## Response: salary
##             Sum Sq Df  F value  Pr(>F)    
## (Intercept) 8140.2  1 892.2994 < 2e-16 ***
## sex           28.0  1   3.0711 0.09671 .  
## rank          70.4  1   7.7189 0.01240 *  
## sex:rank       0.6  1   0.0695 0.79510    
## Residuals    164.2 18                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variation

*~66.4% of salary variation can be explained by sex and rank

lm.psych_1 <- lm(salary ~ sex * rank, psych)
summary(lm.psych_1)$r.squared
## [1] 0.6647566

Ex. 3 Part (b)

Two-way ANOVA Type 1

no interaction is shown - both groups have significant effect on salary (p-value less than 0.05) we can reject the null

aov.psych_3 <- aov(salary ~ rank + sex, psych)
summary(aov.psych_3)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## rank         1 252.22  252.22  29.071 3.34e-05 ***
## sex          1  72.76   72.76   8.386  0.00926 ** 
## Residuals   19 164.84    8.68                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych_4 <- aov(salary ~ sex + rank, psych)
summary(aov.psych_4)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## sex          1  155.2  155.15   17.88 0.000454 ***
## rank         1  169.8  169.82   19.57 0.000291 ***
## Residuals   19  164.8    8.68                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA Type 3

no interaction is shown - both groups have significant effect on salary (p-value less than 0.05) we can reject the null

Anova(aov.psych_3, type = 3)
## Anova Table (Type III tests)
## 
## Response: salary
##              Sum Sq Df   F value    Pr(>F)    
## (Intercept) 10227.6  1 1178.8469 < 2.2e-16 ***
## rank          169.8  1   19.5743 0.0002912 ***
## sex            72.8  1    8.3862 0.0092618 ** 
## Residuals     164.8 19                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variation

*~66.3% of salary variation can be explained by sex and rank

lm.psych_3 <- lm(salary ~ rank + sex, psych)
summary(lm.psych_3)$r.squared
## [1] 0.6634627

Ex. 3 Part (c)

Model Diagnostics to validate Normality Assumptions

*from QQPlot - demonstrates normal distribution

par(mfrow=c(1,1))
plot(aov.psych_3, 2)

Ex. 3 Part (d)

Post-Hoc Test

sex has significant effect on salary - p-value is less than 0.05 mean salary of Male is greater than mean salary of Female rank has significant effect on salary - p-value is less than 0.05 mean salary of Associate is greater than mean salary of Assistant

TukeyHSD(aov.psych_4)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = salary ~ sex + rank, data = psych)
## 
## $sex
##         diff      lwr      upr     p adj
## M-F 5.333333 2.693648 7.973019 0.0004544
## 
## $rank
##                  diff      lwr      upr     p adj
## Assoc-Assist 5.377778 2.738092 8.017463 0.0004193

Exercise 4:

cars <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/cars_new.csv")

Convert format

cars$mpg_highway <- as.numeric(cars$mpg_highway)
cars$cylinders <- as.factor(cars$cylinders)
cars$origin <- as.factor(cars$origin)
cars$type <- as.factor(cars$type)
str(cars)
## 'data.frame':    180 obs. of  4 variables:
##  $ type       : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
##  $ origin     : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
##  $ cylinders  : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
##  $ mpg_highway: num  31 29 28 24 24 24 30 29 30 28 ...

Ex. 4 Part (a)

Three-way ANOVA Type 3

cylinders and type have p-values less than 0.05 – can reject null & they have effect on mpg_highway at least one group has different mean origin has p-value greater than 0.05 – cannot reject the null & has no effect on mpg_highway will remove origin from the model

aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars)
Anova(aov.cars, type=3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##             Sum Sq  Df   F value  Pr(>F)    
## (Intercept)  69548   1 6501.6715 < 2e-16 ***
## cylinders     1453   1  135.8499 < 2e-16 ***
## origin           1   1    0.0786 0.77948    
## type           108   1   10.1018 0.00175 ** 
## Residuals     1883 176                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variation

*~45.7% of mpg_highway variation can be explained by cylinders and type

lm.cars <- lm(mpg_highway ~ cylinders + type, cars)
summary(lm.cars)$r.squared
## [1] 0.4572163

Ex. 4 Part (b)

all three have p-values less than 0.05 – can reject the null & all have significant effect on mpg_highway two-way ANOVA type 3 is best model

aov.cars_2 <- aov(mpg_highway ~ cylinders + type + cylinders * type, cars)
Anova(aov.cars_2, type = 3)
## Anova Table (Type III tests)
## 
## Response: mpg_highway
##                Sum Sq  Df  F value    Pr(>F)    
## (Intercept)     85471   1 8358.838 < 2.2e-16 ***
## cylinders        1558   1  152.397 < 2.2e-16 ***
## type              198   1   19.392 1.844e-05 ***
## cylinders:type     84   1    8.201  0.004696 ** 
## Residuals        1800 176                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variation

*~48.1% of mpg_highway variation can be explained by cylinders and type & the interaction

lm.cars_2 <- lm(mpg_highway ~ cylinders + type + cylinders * type, cars)
summary(lm.cars_2)$r.squared
## [1] 0.4813821

Post-Hoc Test

type and cylinders have p-value less than 0.05 - can reject the null & they have effect on mpg_highway mean of sedan is higher than mean of sports mean of 6 cylinder is less than mean of 4 type groups 6:sedan-4:sedan 4:sports-4:sedan and 6:sports-4:sedan have effect *groups 4:sports-6:sedan 6:sports-6:sedan and 6:sports-4:sports no effect

TukeyHSD(aov.cars_2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mpg_highway ~ cylinders + type + cylinders * type, data = cars)
## 
## $cylinders
##          diff       lwr       upr p adj
## 6-4 -5.722662 -6.664343 -4.780981     0
## 
## $type
##                   diff       lwr       upr     p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
## 
## $`cylinders:type`
##                         diff       lwr       upr     p adj
## 6:Sedan-4:Sedan   -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan  -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan  -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan   0.9447674 -2.120956  4.010491 0.8546517
## 6:Sports-6:Sedan  -0.4302326 -3.495956  2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993  2.771993 0.8253946

FINAL REMARKS

sedans have higher mpg_highway than sports 4 cylinder cars have higher mpg_highway than 6 cylinder cars