코드는 곽기영, R을 이용한 통계데이터분석, 도서출판 청람을 참고했다.

library(dplyr) #data manipulation
library(gplots) #graphics
library(car) #aplied regression
library(HH) #사후분석을 위해 statistical analysis and data display
library(faraway) #functions in Julian Faraway
library(effects) #effects display
library(heplots) #visualizing hypothesis tests in multivariate linear models

간단한 ANOVA

adhd <- data.frame(
        score=c(95,105,98,103,107,110,125,105,113,120),
        therapy=c(rep("A", 5), rep("B", 5)))

adhd

그룹이 2개이다. therapy = {A,B} 각각의 그룹에 따라 score에 차이가 있나?

#anova model
#independent: therapy
#dependent: score
adhv.aov <- aov(score ~ therapy, data=adhd)
summary(adhv.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## therapy      1  422.5   422.5   9.591 0.0147 *
## Residuals    8  352.4    44.0                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

결과를 보면 F-value = 9.591, p-value < 0.05로 유의미한 차이를 보인다. 약하지만. therapy의 효과가 있다(score의 차이)고 말할 수 있을까?

adhd %>% 
        group_by(therapy) %>% 
        summarize(mean=mean(score),sd=sd(score)) %>% 
        ungroup()

평균을 보면 B의 score가 좀더 크다.

One-way ANOVA

데이터 InsectSprays를 사용한다.

str(InsectSprays)

## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

head(InsectSprays)

기초 통계를 살펴보자.

InsectSprays %>% 
        group_by(spray) %>% 
        summarize(length=n(),
                  mean=mean(count) %>% round(2),
                  sd=sd(count) %>% round(2)) %>% 
        ungroup()

gplots의 plotmeans()를 사용해 그룹별 평균의 차이를 시각적으로 살펴보자.

plotmeans(
        count ~ spray, #값 ~ 그룹
        data=InsectSprays, #데이터
        col="cornflowerblue", #연결선의 색
        #분포 시각화:
        barcol="tomato", #색
        barwidth=3, #막대의 넓이
        lwd=2,
        #라벨과 타이틀:
        xlab="Type of Sprays", 
        ylab="Insect Count", 
        main="Performance of Insect Sprays\nwith 95% CI of Mean")

C,D,E가 상대적으로 insect count가 낮다. 즉 살충효과가 크다.

Analysis of Variance의 결과를 보자.

sprays.aov <- aov(count ~ spray, data=InsectSprays)
sprays.aov

## Call:
##    aov(formula = count ~ spray, data = InsectSprays)
## 
## Terms:
##                    spray Residuals
## Sum of Squares  2668.833  1015.167
## Deg. of Freedom        5        66
## 
## Residual standard error: 3.921902
## Estimated effects may be unbalanced

summary(sprays.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

그룹간 차이가 존재한다(F-value = 34.7, p-value < 0.001). Aov model의 fit을 바탕으로 tables 분석을 수행하자.

cat("Effects:\n")

## Effects:

model.tables(sprays.aov,se=TRUE) #standard errors?=true

## Tables of effects
## 
##  spray 
## spray
##      A      B      C      D      E      F 
##  5.000  5.833 -7.417 -4.583 -6.000  7.167 
## 
## Standard errors of effects
##         spray
##         1.132
## replic.    12

각각의 경우에 있어 coefficients를 관찰하니 C,D,E가 확실히 낮다.

model.tables(sprays.aov, type="mean")

## Tables of means
## Grand mean
##     
## 9.5 
## 
##  spray 
## spray
##      A      B      C      D      E      F 
## 14.500 15.333  2.083  4.917  3.500 16.667

평균으로 살펴봐도 (그래프에서 처럼) 차이를 확인할 수 있다.

이제 사후분석을 Tukey HSD(honest significant differences)로 살펴보자. aov 결과를 입력값으로 한다.

sprays.compare <- sprays.aov %>% TukeyHSD() 
sprays.compare

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##            diff        lwr       upr     p adj
## B-A   0.8333333  -3.866075  5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A  -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A   2.1666667  -2.532742  6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B   1.3333333  -3.366075  6.032742 0.9603075
## D-C   2.8333333  -1.866075  7.532742 0.4920707
## E-C   1.4166667  -3.282742  6.116075 0.9488669
## F-C  14.5833333   9.883925 19.282742 0.0000000
## E-D  -1.4166667  -6.116075  3.282742 0.9488669
## F-D  11.7500000   7.050591 16.449409 0.0000000
## F-E  13.1666667   8.467258 17.866075 0.0000000

C,D,E가 낮은 편이다.

plot(TukeyHSD(sprays.aov), col="blue", las=1)

그림을 그려보면 0이 포함된 것은 차이가 없는 것들이다. C-A, D-A, E-A, C-B, D-B, E-B, F-C, F-D, F-E에서 차이가 뚜렷하다.

가정 확인

car에 포함된 qqPlot()으로 정규분포 가정을 확인해보자.

qqPlot(InsectSprays$count, 
       pch=20, 
       col="deepskyblue", 
       id=FALSE,
       main="Q-Q Plot", 
       xlab="Theoretical Quantiles",
       ylab="Sample Quantiles")

정규분포를 위배하는 일부 값들이 보인다(점선 바깥). 샤피로 테스트로 통계값을 구해보자.

shapiro.test(InsectSprays$count)

## 
##  Shapiro-Wilk normality test
## 
## data:  InsectSprays$count
## W = 0.9216, p-value = 0.0002525

나머지 테스트도 해보자.

leveneTest(count ~ spray, data=InsectSprays)

bartlett.test(count ~ spray, data=InsectSprays)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  count by spray
## Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05

이분산을 가정해야 하기 때문에 손쉬운 oneway.test()함수를 사용하자. 기본값은 이분산이다. 만약 등분산 결과를 얻으려면 var.equal=TRUE로 하자.

#이분산
oneway.test(count ~ spray, data=InsectSprays)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  count and spray
## F = 36.065, num df = 5.000, denom df = 30.043, p-value = 7.999e-12

#등분산
oneway.test(count ~ spray, data=InsectSprays, var.equal=TRUE)

## 
##  One-way analysis of means
## 
## data:  count and spray
## F = 34.702, num df = 5, denom df = 66, p-value < 2.2e-16

Two-way ANOVA

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

#데이터가 숫자로 되어 있는데 factor()로 고쳐보자.
ToothGrowth$dose <- factor(ToothGrowth$dose, 
                           levels=c(0.5, 1.0, 2.0), labels=c("low", "med", "high"))
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "low","med","high": 1 1 1 1 1 1 1 1 1 1 ...

치아 길이(len)에 대해 2차원 factor인 supp, dose를 list()로 묶고, 항목 개수, 평균, 표준편차를 계산한 테이블을 얻어보자.

with(ToothGrowth, tapply(len, list(supp, dose), length))

##    low med high
## OJ  10  10   10
## VC  10  10   10

with(ToothGrowth, tapply(len, list(supp, dose), mean))

##      low   med  high
## OJ 13.23 22.70 26.06
## VC  7.98 16.77 26.14

with(ToothGrowth, tapply(len, list(supp, dose), sd))

##         low      med     high
## OJ 4.459709 3.910953 2.655058
## VC 2.746634 2.515309 4.797731

Twoway ANOVA는 교차효과를 모델링한다(*).

ToothGrowth.aov <- aov(len ~ supp * dose, data=ToothGrowth)
summary(ToothGrowth.aov)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

supp, dose에 따라 각각 차이가 있고, 교차효과도 유의미하다.

model.tables(ToothGrowth.aov, type="means")

## Tables of means
## Grand mean
##          
## 18.81333 
## 
##  supp 
## supp
##     OJ     VC 
## 20.663 16.963 
## 
##  dose 
## dose
##    low    med   high 
## 10.605 19.735 26.100 
## 
##  supp:dose 
##     dose
## supp low   med   high 
##   OJ 13.23 22.70 26.06
##   VC  7.98 16.77 26.14

평균효과를 살펴보면 OJ가 더 길고, dose에 따라서는 high가 가장 효과적이다. OJ나 VC 모두 dose가 증가할수록 치아 길이에 긍정적이다. 다만 VC는 low와 high의 간격이 더 크다.

교차효과를 시각적으로 확인해보자.

interaction.plot(
        x.factor=ToothGrowth$dose, 
        trace.factor=ToothGrowth$supp, 
        response=ToothGrowth$len, 
        las=1, 
        type="b", 
        pch=c(1, 19),
        col=c("blue", "red"),
        trace.label="Supplement",
        xlab="Dose Level",
        ylab="Tooth Length",
        main="Interaction Plot for Tooth Growth of Guinea Pigs")

OJ의 경우 med와 high 사이의 차이 기울기가 감소된다. 다른 대안은 gplots의 plotmeans() 함수를 사용하는 것. 신뢰구간도 그려주기 때문에 유용하다.

plotmeans(len ~ interaction(supp, dose, sep=" "), data=ToothGrowth,
          connect=list(c(1,3,5), c(2,4,6)),
          col=c("red", "green3"),
          xlab="Supplement and Dose Combination", ylab="Tooth Length",
          main="Means Plot for Tooth Growth of Guinea Pigs\nwith 95% CI of Mean")

또 다른 대안으로 heplots 패키지의 coplot()이 있다.

coplot(len ~ dose | supp, data=ToothGrowth, 
       col="steelblue", pch=19, 
       panel=panel.smooth, lwd=2, col.smooth="darkorange",
       xlab="Dose Level", ylab="Tooth Length")

다른 대안은 HH 패키지의 interaction2wt()가 있다. 가장 간단하고 시각적으로 이해가 쉽다.

interaction2wt(len ~ supp * dose,  data=ToothGrowth)

사후 분석을 해보자.

TukeyHSD(ToothGrowth.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $supp
##       diff       lwr       upr     p adj
## VC-OJ -3.7 -5.579828 -1.820172 0.0002312
## 
## $dose
##            diff       lwr       upr   p adj
## med-low   9.130  6.362488 11.897512 0.0e+00
## high-low 15.495 12.727488 18.262512 0.0e+00
## high-med  6.365  3.597488  9.132512 2.7e-06
## 
## $`supp:dose`
##                  diff        lwr        upr     p adj
## VC:low-OJ:low   -5.25 -10.048124 -0.4518762 0.0242521
## OJ:med-OJ:low    9.47   4.671876 14.2681238 0.0000046
## VC:med-OJ:low    3.54  -1.258124  8.3381238 0.2640208
## OJ:high-OJ:low  12.83   8.031876 17.6281238 0.0000000
## VC:high-OJ:low  12.91   8.111876 17.7081238 0.0000000
## OJ:med-VC:low   14.72   9.921876 19.5181238 0.0000000
## VC:med-VC:low    8.79   3.991876 13.5881238 0.0000210
## OJ:high-VC:low  18.08  13.281876 22.8781238 0.0000000
## VC:high-VC:low  18.16  13.361876 22.9581238 0.0000000
## VC:med-OJ:med   -5.93 -10.728124 -1.1318762 0.0073930
## OJ:high-OJ:med   3.36  -1.438124  8.1581238 0.3187361
## VC:high-OJ:med   3.44  -1.358124  8.2381238 0.2936430
## OJ:high-VC:med   9.29   4.491876 14.0881238 0.0000069
## VC:high-VC:med   9.37   4.571876 14.1681238 0.0000058
## VC:high-OJ:high  0.08  -4.718124  4.8781238 1.0000000

TukeyHSD(ToothGrowth.aov, which=c("dose"), conf.level=0.99)

##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $dose
##            diff       lwr       upr   p adj
## med-low   9.130  5.637681 12.622319 0.0e+00
## high-low 15.495 12.002681 18.987319 0.0e+00
## high-med  6.365  2.872681  9.857319 2.7e-06

ANCOVA

연속형 값이 포함된 경우 ANCOVA를 사용해보자. Faraway의 책에 사용된 성범죄 이후 여성의 외상후 장애 데이터를 사용해보자.

cpa: Childhood physical abuse on standard scale
ptsd: Post-traumatic stress disorder on standard scale
csa: Childhood sexual abuse - abused or not abused

str(sexab)

## 'data.frame':    76 obs. of  3 variables:
##  $ cpa : num  2.048 0.839 -0.241 -1.115 2.015 ...
##  $ ptsd: num  9.71 6.17 15.16 11.31 9.95 ...
##  $ csa : Factor w/ 2 levels "Abused","NotAbused": 1 1 1 1 1 1 1 1 1 1 ...

sexab %>% 
        group_by(csa) %>% 
        summarize(mean(ptsd),
                  sd(ptsd))

2개의 수치형 변수를 사용한다.

sexab.aov <- aov(ptsd ~ cpa + csa, data=sexab)
summary(sexab.aov)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cpa          1  449.8   449.8   41.98 9.46e-09 ***
## csa          1  624.0   624.0   58.25 6.91e-11 ***
## Residuals   73  782.1    10.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

효과를 각각 추려보자.

effect("csa", sexab.aov)

## 
##  csa effect
## csa
##    Abused NotAbused 
## 11.544429  5.271677

effect("cpa", sexab.aov)

## 
##  cpa effect
## cpa
##        -3      -0.2         3         6         9 
##  6.037757  7.579303  9.341070 10.992727 12.644383

HH 패키지의 ancova() 함수는 결과와 함께 시각화 자료도 제공한다.

ancova(ptsd ~ cpa + csa, data=sexab)

## Analysis of Variance Table
## 
## Response: ptsd
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## cpa        1 449.80  449.80  41.984 9.462e-09 ***
## csa        1 624.03  624.03  58.247 6.907e-11 ***
## Residuals 73 782.08   10.71                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cpa에 따라 ptsd가 함께 증가한다. 그런데 cas가 abused인 경우에 전체적으로 그 강도가 높다. 기울기에 있어서는 큰 차이가 없다.

Repeated Measures ANOVA

식물의 이산화탄소 흡수 데이터인 CO2를 사용해서 RM-ANOVA를 사용해보자.

uptake: a numeric vector of carbon dioxide uptake rates (umol/m^2 sec).
Type: a factor with levels Quebec Mississippi giving the origin of the plant
conc: a numeric vector of ambient carbon dioxide concentrations (mL/L).
Plant: an ordered factor with levels Qn1 < Qn2 < Qn3 < … < Mc1 giving a unique identifier for each plant.

CO2sub <- CO2 %>% 
        filter(Treatment=='chilled') %>% 
        mutate(conc = as.factor(conc))

conc는 반복 측정되었다.

table(CO2sub$conc)

## 
##   95  175  250  350  500  675 1000 
##    6    6    6    6    6    6    6

각 식물에 따른 차이는 통제되어야 하기 때문에 Plant와 conc가 모두 랜덤항(Error)로 처리되어야 한다. 이때 반복측정 부분은 통제항에 균일하게 분포되어야 하므로 나눠준다. 즉, Error(Plant/conc). 이제 Two-way ANOVA를 Type과 conc 별로 보자.

CO2sub.aov <- aov(uptake ~ Type * conc + Error(Plant/conc), data=CO2sub)
summary(CO2sub.aov)

## 
## Error: Plant
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## Type       1 2667.2  2667.2   60.41 0.00148 **
## Residuals  4  176.6    44.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Plant:conc
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## conc       6 1472.4  245.40   52.52 1.26e-12 ***
## Type:conc  6  428.8   71.47   15.30 3.75e-07 ***
## Residuals 24  112.1    4.67                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

결과를 보면 conc 별 차이가 있다. 또한 Type과 conc 간에 interaction effect도 존재한다.

boxplot(uptake ~ Type * conc, data=CO2sub,
        col=c("deepskyblue", "violet"), las=2, cex.axis=0.75,
        ylab="Carbon dioxide uptake rate",
        main="Effects of Plant Type and CO2 on Carbon Dioxide Uptake")
legend("topleft", inset=0.02, 
       legend=c("Quebec", "Mississippi"), fill=c("deepskyblue", "violet"))

interaction2wt(uptake ~ conc * Type, data=CO2sub)

ANOVA는 범주형 자료 분석에서 좋은 반응을 보인다.