Exercise 1: Analysis of Variance


The heartbpchol.csv data set contains continuous cholesterol (Cholesterol) and blood pressure status (BP_Status) (category: High/ Normal/ Optimal) for alive patients.

For the heartbpchol.xlsx data set, consider a one-way ANOVA model to identify differences between group cholesterol means. The normality assumption is reasonable, so you can proceed without testing normality.

# Read the CSV file
q1.data <- read.csv("heartbpchol.csv") 

# Check data format
str(q1.data)
## 'data.frame':    541 obs. of  2 variables:
##  $ Cholesterol: int  221 188 292 319 205 247 202 150 228 280 ...
##  $ BP_Status  : chr  "Optimal" "High" "High" "Normal" ...
# change BP_Status to Factor
q1.data$BP_Status <- factor(q1.data$BP_Status)

str(q1.data)
## 'data.frame':    541 obs. of  2 variables:
##  $ Cholesterol: int  221 188 292 319 205 247 202 150 228 280 ...
##  $ BP_Status  : Factor w/ 3 levels "High","Normal",..: 3 1 1 2 2 1 3 2 1 1 ...

View data

head(q1.data)
##   Cholesterol BP_Status
## 1         221   Optimal
## 2         188      High
## 3         292      High
## 4         319    Normal
## 5         205    Normal
## 6         247      High

Question (a):

  • Perform a one-way ANOVA for Cholesterol with BP_Status as the categorical predictor.
    Comment on statistical significance of BP_Status, the amount of variation described by the model, and whether or not the equal variance assumption can be trusted.

Answer -:

  • H0:
    There are no mean difference between blood pressure groups, in the other word, mean of blood pressures in any groups are equal.

  • H1:
    At least one group in blood pressure have a difference mean, thus there exist cholesterol effect.

Step 1) Check Balancing
# Check the balancing data

table(q1.data$BP_Status)
## 
##    High  Normal Optimal 
##     229     245      67
Result:

Data of “BP_Status” is not balanced, but because this is “One_Way ANOVA” and we have only one independent variable, thus unbalancing is not important and we can continue the ANOVA test.

Step 2) Run One-Way ANOVA
q1.aov=aov(q1.data$Cholesterol ~ q1.data$BP_Status)
summary(q1.aov)
##                    Df  Sum Sq Mean Sq F value  Pr(>F)   
## q1.data$BP_Status   2   25211   12605   6.671 0.00137 **
## Residuals         538 1016631    1890                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 3) Chack variance equality by “Levene Test”

We must check equality of variances with 2 ways:
1) Levente Test
2) Residuals diagram

here we will check the leven test, next at the diagnostic step, we will check the residuals plot.

  • H0: Variances are equal (Homogeneity of Variances)
  • H1: Variances are not equal (Heterogeneous of variances)
LeveneTest(q1.aov)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.1825 0.8332
##       538
Result:

P-Value is very large, thus we don’t have enough evidence to reject the Null, therefore the variances are equal (Variances are Homogeneity)

Step 4) Diagnostic plot
  • According the question, normality assumption is reasonable (As we can see the QQ plot, below), just we make a plot to check the residuals diagram.
par(mfrow=c(2,2))
plot(q1.aov)

Result:

Regarding to Residuals plot, the variances are equal (Same as result of Levene Test).

Step 5) Calculate the R Square
q1.lm=lm(data=q1.data, Cholesterol ~ BP_Status)
anova(q1.lm)
## Analysis of Variance Table
## 
## Response: Cholesterol
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## BP_Status   2   25211 12605.4  6.6708 0.001375 **
## Residuals 538 1016631  1889.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(q1.lm)$r.squared
## [1] 0.02419833
Result:

The R Square is the percentage of variation in a response variable that is explained by the model.

Conclussion:

According the Step (2), P-Value is very small and statisticaly significant, thus we don’t have any evidence to accept Null, therefore we reject Null.

Final Result:
- At least one group in blood pressure has different mean, it means there exist cholesterol effect on blood pressure.


Question (b):

  • Comment on any significantly different cholesterol means as determined by the post-hoc test comparing all pairwise differences. Specifically explain what that tells us about differences in cholesterol levels across blood pressure status groups, like which group has the highest or lowest mean values of Cholesterol.

Answer -:

  • Run Scheffe Test to find out the different means
ScheffeTest(q1.aov)
## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $`q1.data$BP_Status`
##                      diff    lwr.ci    upr.ci   pval    
## Normal-High    -11.543481 -21.35092 -1.736038 0.0159 *  
## Optimal-High   -18.646679 -33.46702 -3.826341 0.0089 ** 
## Optimal-Normal  -7.103198 -21.81359  7.607194 0.4958    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result:
  • Diff of “Normal-High” is “Negative”, P-Value is small ==> µ(High) > µ(Normal)
  • Diff of “Optimal-High” is “Negative”, P-Value is small ==> µ(High) > µ(Optimal)
  • Diff of “Optimal-Normal” is “Negative”, P-Value is large ==> µ(Optimal) = µ(Normal)

Conclussion:

  • µ(High) > µ(Normal) AND µ(High) > µ(Optimal)

In the other word, mean of the “High” is bigger than both of “Optimal” and “Normal”, it means the High has highest mean value of cholesterol


End of repoer