Analysis of Variance

For this problem use the bupa.csv data set. Check UCI Machine Learning Repository for more information (http://archive.ics.uci.edu/ml/datasets/Liver+Disorders). The mean corpuscular volume and alkaline phosphatase are blood tests thought to be sensitive to liver disorder related to excessive alcohol consumption. We assume that normality and independence assumptions are valid.

# Read data 
q2.data <- read.csv("bupa.csv")

# Check the structure
str(q2.data)
## 'data.frame':    345 obs. of  3 variables:
##  $ mcv       : int  85 85 86 91 87 98 88 88 92 90 ...
##  $ alkphos   : int  92 64 54 78 70 55 62 67 54 60 ...
##  $ drinkgroup: int  1 1 1 1 1 1 1 1 1 1 ...
table(q2.data$drinkgroup)
## 
##   1   2   3   4   5 
## 117  52  88  67  21
# Convert int to factor for drink gtoup variable

q2.data$drinkgroup <- factor(q2.data$drinkgroup)

#check the result

str(q2.data)
## 'data.frame':    345 obs. of  3 variables:
##  $ mcv       : int  85 85 86 91 87 98 88 88 92 90 ...
##  $ alkphos   : int  92 64 54 78 70 55 62 67 54 60 ...
##  $ drinkgroup: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Question (a):

  • Perform a one-way ANOVA for mcv as a function of drinkgroup. Comment on significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

Answer -:

  • H0:
    There are no difference between drink groups, in the other word mean of all groups are equal, thus there is no effect of drinking groups on MCV (mean corpuscular volume) factor.

  • H1:
    At least one group has a different mean, thus there is drinking effect on MCV (mean corpuscular volume) factor.

Step 1) Check Balancing
table(q2.data$drinkgroup)
## 
##   1   2   3   4   5 
## 117  52  88  67  21
Result:

Data of “drinkinggroup” is not balanced, but because this is “One_Way ANOVA” and we have only one independent variable, thus unbalancing is not important and we can continue the ANOVA test.

Step 2) Run One-Way ANOVA
q2.aov.a <- aov(data=q2.data, mcv ~ drinkgroup)
summary(q2.aov.a)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## drinkgroup    4    733  183.29   10.26 7.43e-08 ***
## Residuals   340   6073   17.86                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Chack variance equality by “Levene Test”

We must check equality of variances with 2 ways:
1) Levente Test
2) Residuals diagram

here we will check the leven test, next at the diagnostic step, we will check the residuals plot.

  • H0: Variances are equal (Homogeneity of Variances)
  • H1: Variances are not equal (Heterogeneous of variances)
leveneTest(q2.aov.a)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.3053 0.8744
##       340
Result:

P-Value is very large, thus we don’t have enough evidence to reject the Null, therefore the variances are equal (Variances are Homogeneity)

Step 4) Diagnostic plot
  • We can judge the normality assumption, through the QQ plot
par(mfrow=c(2,2))
plot(q2.aov.a)

Result:

According to QQ plot, normality assumption is reasonable.

Regarding to Residuals plot, the variances are equal (Same as result of Levene Test).

Step 5) Calculate the R Square
q2.lm.a <- lm(q2.data$mcv~q2.data$drinkgroup)
aov(q2.lm.a)
## Call:
##    aov(formula = q2.lm.a)
## 
## Terms:
##                 q2.data$drinkgroup Residuals
## Sum of Squares             733.177  6073.055
## Deg. of Freedom                  4       340
## 
## Residual standard error: 4.226337
## Estimated effects may be unbalanced
summary(q2.lm.a)$r.square
## [1] 0.1077214
Result:

The R Square is the percentage of variation in a response variable that is explained by the model.

Conclussion:

According the Step (2), P-Value is very small and significant, thus we don’t have any evidence to accept Null, therefore we reject Null.

Final Result:
At least one group has a different mean, thus there is drinking effect on MCV (mean corpuscular volume) factor.


Question (b):

  • Perform a one-way ANOVA for alkphos as a function of drinkgroup. Comment on statistical significance of the drinkgroup, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

Answer -:

  • H0:
    There are no difference between drink groups, in the other word mean of all groups are equal, thus there is no effect of drinking on “Alkaline phosphate” factor.

  • H1:
    At least one group has a different mean, thus there is drinking effect on “alkaline phosphate” factor.

Step 1) Check Balancing
table(q2.data$drinkgroup)
## 
##   1   2   3   4   5 
## 117  52  88  67  21
Result:

Same as last part, data of “drinkinggroup” is not balanced, but because this is “One_Way ANOVA” and we have only one independent variable, thus unbalancing is not important and we can continue the ANOVA test.

Step 2) Run One-Way ANOVA
q2.aov.b=aov(data=q2.data, alkphos ~ drinkgroup)
summary(q2.aov.b)
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## drinkgroup    4   4946  1236.4   3.792 0.00495 **
## Residuals   340 110858   326.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Chack variance equality by “Levene Test”

We must check equality of variances with 2 ways:
1) Levente Test
2) Residuals diagram

here we will check the leven test, next at the diagnostic step, we will check the residuals plot.

  • H0: Variances are equal (Homogeneity of Variances)
  • H1: Variances are not equal (Heterogeneous of variances)
leveneTest(q2.aov.b)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4  0.8089 0.5201
##       340
Result:

P-Value is very large, thus we don’t have enough evidence to reject the Null, therefore the variances are equal (Variances are Homogeneity)

Step 4) Diagnostic plot
  • We can judge the normality assumption, through the QQ plot
par(mfrow=c(2,2))
plot(q2.aov.b)

Result:

According to QQ plot, normality assumption is reasonable, although there are some extreme values out of the line, but it’s acceptable, because ANOVA works on the semi-normal style data.

Regarding to Residuals plot, the variances are equal (Same as result of Levene Test).

Step 5) Calculate the R Square
q2.lm.b <- lm(q2.data$alkphos ~ q2.data$drinkgroup)
#aov(q2.lm.b)
summary(q2.lm.b)$r.square
## [1] 0.04270721
Result:

The R Square is the percentage of variation in a response variable that is explained by the model.

Conclussion:

According the Step (2), P-Value is very small and significant, thus we don’t have any evidence to accept Null, therefore we reject Null.

Final Result:
- At least one group has a different mean, thus there is drinking effect on “alkaline phosphate” factor.


Question (c):

  • Perform post-hoc tests for models in a) and b). Comment on any similarities or differences you observe from their results.

Answer -:

Post-hoc on Part (a)
ScheffeTest(q2.aov.a)
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $drinkgroup
##             diff      lwr.ci   upr.ci    pval    
## 2-1  1.241452991 -0.94020481 3.423111  0.5410    
## 3-1  0.938131313 -0.90892674 2.785189  0.6495    
## 4-1  3.744610282  1.73913894 5.750082 1.9e-06 ***
## 5-1  3.746031746  0.64379565 6.848268  0.0081 ** 
## 3-2 -0.303321678 -2.59291786 1.986275  0.9966    
## 4-2  2.503157290  0.08395442 4.922360  0.0380 *  
## 5-2  2.504578755 -0.87987039 5.889028  0.2646    
## 4-3  2.806478969  0.68408993 4.928868  0.0025 ** 
## 5-3  2.807900433 -0.37116998 5.986971  0.1151    
## 5-4  0.001421464 -3.27222796 3.275071  1.0000    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:
  • Diff of “group 2-group 1” is “Positive”, P-Value is large ==> µ(group 2) = µ(group 1)
  • Diff of “group 3-group 1” is “Positive”, P-Value is large ==> µ(group 3) = µ(group 1)
  • Diff of “group 4-group 1” is “Positive”, P-Value is small ==> µ(group 4) > µ(group 1)
  • Diff of “group 5-group 1” is “Positive”, P-Value is small ==> µ(group 5) > µ(group 1)
  • Diff of “group 3-group 2” is “Negative”, P-Value is large ==> µ(group 2) = µ(group 3)
  • Diff of “group 4-group 2” is “Positive”, P-Value is small ==> µ(group 4) > µ(group 2)
  • Diff of “group 5-group 2” is “Positive”, P-Value is large ==> µ(group 5) = µ(group 2)
  • Diff of “group 4-group 3” is “Positive”, P-Value is small ==> µ(group 4) > µ(group 3)
  • Diff of “group 5-group 3” is “Positive”, P-Value is large ==> µ(group 5) = µ(group 3)
  • Diff of “group 5-group 4” is “Positive”, P-Value is large ==> µ(group 5) = µ(group 4)

Conclussion:

µ(group 4) > µ(group 1) AND µ(group 4) > µ(group 2) AND µ(group 4) > µ(group 3)
µ(group 5) > µ(group 1)

At least group 4 and group 5 have different mean. It means group 4 and group 5 (drinking 6 and more per day) have more effect on MCV (mean corpuscular volume).


Post-hoc on Part (b)
ScheffeTest(q2.aov.b)
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $drinkgroup
##          diff      lwr.ci    upr.ci   pval    
## 2-1 -2.645299 -11.9663647  6.675766 0.9419    
## 3-1 -4.056138 -11.9476367  3.835360 0.6389    
## 4-1 -1.148743  -9.7170578  7.419571 0.9965    
## 5-1 12.572650  -0.6815582 25.826857 0.0734 .  
## 3-2 -1.410839 -11.1930681  8.371390 0.9953    
## 4-2  1.496556  -8.8394138 11.832525 0.9952    
## 5-2 15.217949   0.7579944 29.677903 0.0329 *  
## 4-3  2.907395  -6.1604467 11.975236 0.9117    
## 5-3 16.628788   3.0463078 30.211268 0.0069 ** 
## 5-4 13.721393  -0.2651729 27.707959 0.0578 .  
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:
  • Diff of “group 2-group 1” is “Negative”, P-Value is large ==> µ(group 2) = µ(group 1)
  • Diff of “group 3-group 1” is “Negative”, P-Value is large ==> µ(group 3) = µ(group 1)
  • Diff of “group 4-group 1” is “Negative”, P-Value is large ==> µ(group 4) = µ(group 1)
  • Diff of “group 5-group 1” is “Positive”, P-Value is large ==> µ(group 5) = µ(group 1)
  • Diff of “group 3-group 2” is “Negative”, P-Value is large ==> µ(group 2) = µ(group 3)
  • Diff of “group 4-group 2” is “Positive”, P-Value is large ==> µ(group 4) = µ(group 2)
  • Diff of “group 5-group 2” is “Positive”, P-Value is small ==> µ(group 5) > µ(group 2)
  • Diff of “group 4-group 3” is “Positive”, P-Value is large ==> µ(group 4) = µ(group 3)
  • Diff of “group 5-group 3” is “Positive”, P-Value is small ==> µ(group 5) > µ(group 3)
  • Diff of “group 5-group 4” is “Positive”, P-Value is large ==> µ(group 5) = µ(group 4)

Conclussion:

µ(group 5) > µ(group 2) AND µ(group 5) > µ(group 3)

At least group 5 has the different mean with others, it means drinking same as group 5 has most effect on “alkaline phosphate” factor.