heartBP <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/heartbpchol.csv", header = TRUE)
heartBP$Cholesterol <- as.numeric(heartBP$Cholesterol)
str(heartBP)
## 'data.frame': 541 obs. of 2 variables:
## $ Cholesterol: num 221 188 292 319 205 247 202 150 228 280 ...
## $ BP_Status : chr "Optimal" "High" "High" "Normal" ...
table(heartBP$BP_Status)
##
## High Normal Optimal
## 229 245 67
Shows BP_Status is unbalance
boxplot(Cholesterol ~ BP_Status, heartBP,
main = "Cholesterol Distribution by BP Status",
xlab = "BP Status",
ylab = "Cholesterol")
P-value is less than significant value of 0.05 – we can reject null hypothesis at least one has a different mean and an effect on cholesterol
aov.heartBP <- aov(Cholesterol ~ BP_Status, heartBP)
summary(aov.heartBP)
## Df Sum Sq Mean Sq F value Pr(>F)
## BP_Status 2 25211 12605 6.671 0.00137 **
## Residuals 538 1016631 1890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~2.4% of variation in Cholesterol is explained by BP_Status
lm.res <- lm(Cholesterol ~ BP_Status, heartBP)
summary(lm.res)$r.squared
## [1] 0.02419833
p-value is greater than significant value of 0.05 – we cannot reject null can assume all groups have equal variance
leveneTest(aov.heartBP)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.1825 0.8332
## 538
Normal-High and Optimal-High groups have different mean values - we can reject the null Optimal-Normal group has the least significant mean value – Optimal and Normal BP_Status have equal means mean of High Cholesterol is greater than mean of Normal mean of High Cholesterol is greater than mean of Optimal *mean of Optimal Cholesterol is equal to mean of Normal
ScheffeTest(aov.heartBP)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $BP_Status
## diff lwr.ci upr.ci pval
## Normal-High -11.543481 -21.35092 -1.736038 0.0159 *
## Optimal-High -18.646679 -33.46702 -3.826341 0.0089 **
## Optimal-Normal -7.103198 -21.81359 7.607194 0.4958
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bupa <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/bupa.csv")
bupa$mcv <- as.numeric(bupa$mcv)
bupa$alkphos <- as.numeric(bupa$alkphos)
bupa$drinkgroup <- as.factor(bupa$drinkgroup)
str(bupa)
## 'data.frame': 345 obs. of 3 variables:
## $ mcv : num 85 85 86 91 87 98 88 88 92 90 ...
## $ alkphos : num 92 64 54 78 70 55 62 67 54 60 ...
## $ drinkgroup: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
table(bupa$drinkgroup)
##
## 1 2 3 4 5
## 117 52 88 67 21
*Shows drinkgroup is unbalance
boxplot(mcv ~ drinkgroup, bupa,
main = "Mean Corpuscular Volume Distribution by Drink Group",
xlab = "Drink Group",
ylab = "Mean Corpuscular Volume")
P-value is less than significant value of 0.05 – we can reject null hypothesis at least one has a different mean and an effect on mcv
aov.mcv <- aov(mcv ~ drinkgroup, bupa)
summary(aov.mcv)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 733 183.29 10.26 7.43e-08 ***
## Residuals 340 6073 17.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~10.78% of variation in mcv is explained by drinkgroup
lm.res_mcv <- lm(mcv ~ drinkgroup, bupa)
summary(lm.res_mcv)$r.squared
## [1] 0.1077214
p-value is greater than significant value of 0.05 – we cannot reject null can assume all groups have equal variance
leveneTest(aov.mcv)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.3053 0.8744
## 340
table(bupa$drinkgroup)
##
## 1 2 3 4 5
## 117 52 88 67 21
*Shows drinkgroup is unbalance
boxplot(alkphos ~ drinkgroup, bupa,
main = "Alkaline Phosphatase Distribution by Drink Group",
xlab = "Alkaline Phosphatase",
ylab = "Drink Group")
P-value is less than significant value of 0.05 – we can reject null hypothesis at least one has a different mean and an effect on alkphos
aov.alkphos <- aov(alkphos ~ drinkgroup, bupa)
summary(aov.alkphos)
## Df Sum Sq Mean Sq F value Pr(>F)
## drinkgroup 4 4946 1236.4 3.792 0.00495 **
## Residuals 340 110858 326.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~4.27% of variation in alkphos is explained by drinkgroup
lm.res_alkphos <- lm(alkphos ~ drinkgroup, bupa)
summary(lm.res_alkphos)$r.squared
## [1] 0.04270721
p-value is greater than significant value of 0.05 – we cannot reject null can assume all groups have equal variance
leveneTest(aov.alkphos)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.8089 0.5201
## 340
groups ‘4-1’ ‘5-1’ ‘4-2’ and ‘4-3’ have different means - we can reject null groups ‘2-1’ ‘3-1’ ‘3-2’ ‘5-2’ ‘5-3’and ’5-4’ have equal means - cannot reject null
ScheffeTest(aov.mcv)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $drinkgroup
## diff lwr.ci upr.ci pval
## 2-1 1.241452991 -0.94020481 3.423111 0.5410
## 3-1 0.938131313 -0.90892674 2.785189 0.6495
## 4-1 3.744610282 1.73913894 5.750082 1.9e-06 ***
## 5-1 3.746031746 0.64379565 6.848268 0.0081 **
## 3-2 -0.303321678 -2.59291786 1.986275 0.9966
## 4-2 2.503157290 0.08395442 4.922360 0.0380 *
## 5-2 2.504578755 -0.87987039 5.889028 0.2646
## 4-3 2.806478969 0.68408993 4.928868 0.0025 **
## 5-3 2.807900433 -0.37116998 5.986971 0.1151
## 5-4 0.001421464 -3.27222796 3.275071 1.0000
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
groups ‘5-2’ and ‘5-3’ have different means - we can reject the null groups ‘2-1’ ‘3-1’ ‘4-1’ ‘5-1’ ‘3-2’ ‘4-2’ ‘4-3’ and ‘5-4’ have equal means - cannot reject null
ScheffeTest(aov.alkphos)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $drinkgroup
## diff lwr.ci upr.ci pval
## 2-1 -2.645299 -11.9663647 6.675766 0.9419
## 3-1 -4.056138 -11.9476367 3.835360 0.6389
## 4-1 -1.148743 -9.7170578 7.419571 0.9965
## 5-1 12.572650 -0.6815582 25.826857 0.0734 .
## 3-2 -1.410839 -11.1930681 8.371390 0.9953
## 4-2 1.496556 -8.8394138 11.832525 0.9952
## 5-2 15.217949 0.7579944 29.677903 0.0329 *
## 4-3 2.907395 -6.1604467 11.975236 0.9117
## 5-3 16.628788 3.0463078 30.211268 0.0069 **
## 5-4 13.721393 -0.2651729 27.707959 0.0578 .
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*groups ‘2-1’ ‘3-1’ ‘3-2’ and ‘5-4’ have equal means – they do not have effect on either mcv or alkphos
psych <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/psych.csv")
psych$salary <- as.numeric(psych$salary)
str(psych)
## 'data.frame': 22 obs. of 3 variables:
## $ sex : chr "F" "F" "F" "F" ...
## $ rank : chr "Assist" "Assist" "Assist" "Assist" ...
## $ salary: num 33 36 35 38 42 37 39 38 40 44 ...
both have p-values less than 0.05 – we can reject the null both groups have significant effect on salary interaction between sex and rank result in higher p-value - does not have effect on salary cannot reject the null
aov.psych_1 <- aov(salary ~ sex * rank, psych)
summary(aov.psych_1)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.15 155.15 17.007 0.000637 ***
## rank 1 169.82 169.82 18.616 0.000417 ***
## sex:rank 1 0.63 0.63 0.069 0.795101
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych_2 <- aov(salary ~ rank * sex, psych)
summary(aov.psych_2)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 27.647 5.33e-05 ***
## sex 1 72.76 72.76 7.975 0.0112 *
## rank:sex 1 0.63 0.63 0.069 0.7951
## Residuals 18 164.21 9.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rank is only one below 0.05 - can reject null - rank has effect on salary sex is greater than 0.05 - cannot reject null - sex has no significant effect on salary
Anova(aov.psych_1, type = 3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 8140.2 1 892.2994 < 2e-16 ***
## sex 28.0 1 3.0711 0.09671 .
## rank 70.4 1 7.7189 0.01240 *
## sex:rank 0.6 1 0.0695 0.79510
## Residuals 164.2 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~66.4% of salary variation can be explained by sex and rank
lm.psych_1 <- lm(salary ~ sex * rank, psych)
summary(lm.psych_1)$r.squared
## [1] 0.6647566
no interaction is shown - both groups have significant effect on salary (p-value less than 0.05) we can reject the null
aov.psych_3 <- aov(salary ~ rank + sex, psych)
summary(aov.psych_3)
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 1 252.22 252.22 29.071 3.34e-05 ***
## sex 1 72.76 72.76 8.386 0.00926 **
## Residuals 19 164.84 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.psych_4 <- aov(salary ~ sex + rank, psych)
summary(aov.psych_4)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 155.2 155.15 17.88 0.000454 ***
## rank 1 169.8 169.82 19.57 0.000291 ***
## Residuals 19 164.8 8.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
no interaction is shown - both groups have significant effect on salary (p-value less than 0.05) we can reject the null
Anova(aov.psych_3, type = 3)
## Anova Table (Type III tests)
##
## Response: salary
## Sum Sq Df F value Pr(>F)
## (Intercept) 10227.6 1 1178.8469 < 2.2e-16 ***
## rank 169.8 1 19.5743 0.0002912 ***
## sex 72.8 1 8.3862 0.0092618 **
## Residuals 164.8 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~66.3% of salary variation can be explained by sex and rank
lm.psych_3 <- lm(salary ~ rank + sex, psych)
summary(lm.psych_3)$r.squared
## [1] 0.6634627
*from QQPlot - demonstrates normal distribution
par(mfrow=c(1,1))
plot(aov.psych_3, 2)
sex has significant effect on salary - p-value is less than 0.05 mean salary of Male is greater than mean salary of Female rank has significant effect on salary - p-value is less than 0.05 mean salary of Associate is greater than mean salary of Assistant
TukeyHSD(aov.psych_4)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = salary ~ sex + rank, data = psych)
##
## $sex
## diff lwr upr p adj
## M-F 5.333333 2.693648 7.973019 0.0004544
##
## $rank
## diff lwr upr p adj
## Assoc-Assist 5.377778 2.738092 8.017463 0.0004193
cars <- read.csv("C:/Users/lacyb/Documents/2021 Fall/STA 6443 - DA Algorithms I/cars_new.csv")
cars$mpg_highway <- as.numeric(cars$mpg_highway)
cars$cylinders <- as.factor(cars$cylinders)
cars$origin <- as.factor(cars$origin)
cars$type <- as.factor(cars$type)
str(cars)
## 'data.frame': 180 obs. of 4 variables:
## $ type : Factor w/ 2 levels "Sedan","Sports": 1 1 1 1 1 2 1 1 1 1 ...
## $ origin : Factor w/ 2 levels "Asia","USA": 1 1 1 1 1 1 2 2 2 2 ...
## $ cylinders : Factor w/ 2 levels "4","6": 1 1 2 2 2 2 2 2 2 2 ...
## $ mpg_highway: num 31 29 28 24 24 24 30 29 30 28 ...
cylinders and type have p-values less than 0.05 – can reject null & they have effect on mpg_highway at least one group has different mean origin has p-value greater than 0.05 – cannot reject the null & has no effect on mpg_highway will remove origin from the model
aov.cars <- aov(mpg_highway ~ cylinders + origin + type, cars)
Anova(aov.cars, type=3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 69548 1 6501.6715 < 2e-16 ***
## cylinders 1453 1 135.8499 < 2e-16 ***
## origin 1 1 0.0786 0.77948
## type 108 1 10.1018 0.00175 **
## Residuals 1883 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~45.7% of mpg_highway variation can be explained by cylinders and type
lm.cars <- lm(mpg_highway ~ cylinders + type, cars)
summary(lm.cars)$r.squared
## [1] 0.4572163
all three have p-values less than 0.05 – can reject the null & all have significant effect on mpg_highway two-way ANOVA type 3 is best model
aov.cars_2 <- aov(mpg_highway ~ cylinders + type + cylinders * type, cars)
Anova(aov.cars_2, type = 3)
## Anova Table (Type III tests)
##
## Response: mpg_highway
## Sum Sq Df F value Pr(>F)
## (Intercept) 85471 1 8358.838 < 2.2e-16 ***
## cylinders 1558 1 152.397 < 2.2e-16 ***
## type 198 1 19.392 1.844e-05 ***
## cylinders:type 84 1 8.201 0.004696 **
## Residuals 1800 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*~48.1% of mpg_highway variation can be explained by cylinders and type & the interaction
lm.cars_2 <- lm(mpg_highway ~ cylinders + type + cylinders * type, cars)
summary(lm.cars_2)$r.squared
## [1] 0.4813821
type and cylinders have p-value less than 0.05 - can reject the null & they have effect on mpg_highway mean of sedan is higher than mean of sports mean of 6 cylinder is less than mean of 4 type groups 6:sedan-4:sedan 4:sports-4:sedan and 6:sports-4:sedan have effect *groups 4:sports-6:sedan 6:sports-6:sedan and 6:sports-4:sports no effect
TukeyHSD(aov.cars_2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg_highway ~ cylinders + type + cylinders * type, data = cars)
##
## $cylinders
## diff lwr upr p adj
## 6-4 -5.722662 -6.664343 -4.780981 0
##
## $type
## diff lwr upr p adj
## Sports-Sedan -2.817931 -4.470787 -1.165075 0.0009407
##
## $`cylinders:type`
## diff lwr upr p adj
## 6:Sedan-4:Sedan -6.1723315 -7.469178 -4.875485 0.0000000
## 4:Sports-4:Sedan -5.2275641 -8.306639 -2.148489 0.0001079
## 6:Sports-4:Sedan -6.6025641 -9.681639 -3.523489 0.0000006
## 4:Sports-6:Sedan 0.9447674 -2.120956 4.010491 0.8546517
## 6:Sports-6:Sedan -0.4302326 -3.495956 2.635491 0.9834567
## 6:Sports-4:Sports -1.3750000 -5.521993 2.771993 0.8253946
sedans have higher mpg_highway than sports 4 cylinder cars have higher mpg_highway than 6 cylinder cars