데이터의 분포를 박스 플롯으로 시각화하고 주요 기술통계를 본다.
# boxplot by color
boxplot(price ~ color,
data = diamonds,
main = "Boxplot of price by each color",
xlab = "Factor Levels : color",
ylab = "price")## $D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1838 3170 4214 18693
##
## $E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 882 1739 3077 4003 18731
##
## $F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342 982 2344 3725 4868 18791
##
## $G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 354 931 2242 3999 6048 18818
##
## $H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 984 3460 4487 5980 18803
##
## $I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 3730 5092 7202 18823
##
## $J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4234 5324 7695 18710
# boxplot by cut
boxplot(price ~ cut,
data = diamonds,
main = "Boxplot of price by each cut",
xlab = "Factor Levels : cut",
ylab = "price")## $Fair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 2050 3282 4359 5206 18574
##
## $Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327 1145 3050 3929 5028 18788
##
## $Ideal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 878 1810 3458 4678 18806
##
## $Premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 1046 3185 4584 6296 18823
##
## $`Very Good`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 912 2648 3982 5373 18818
# boxplot by clarity
boxplot(price ~ clarity,
data = diamonds,
main = "Boxplot of price by each clarity",
xlab = "Factor Levels : clarity",
ylab = "price")## $I1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 345 2080 3344 3924 5161 18531
##
## $IF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 369 895 1080 2865 2388 18806
##
## $SI1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 1089 2822 3996 5250 18818
##
## $SI2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 2264 4072 5063 5777 18804
##
## $VS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327 876 2005 3839 6023 18795
##
## $VS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 900 2054 3925 6024 18823
##
## $VVS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336 816 1093 2523 2379 18777
##
## $VVS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 336.0 794.2 1311.0 3283.7 3638.2 18768.0
표본수가 30 이상이므로 정규성 검정은 생략한다.
등분산성을 가설검정할 때 요인의 수준이 2 개인 경우에는 두 모집단의 분산의 차이 검정을 F-분포를 사용하여 수행할 수 있다. 이 경우에는 var.test() 함수를 이용하여 가설검정을 수행하였다. 그러나 요인의 수준이 3 개 이상인 경우에는 bartlet.test()을 사용한다.
# Bartlett test to test the null hypothesis of equal group variances: 등분산 검정 by color
bartlett.test(price ~ color, data = diamonds)##
## Bartlett test of homogeneity of variances
##
## data: price by color
## Bartlett's K-squared = 1402.4, df = 6, p-value < 2.2e-16
# Bartlett test to test the null hypothesis of equal group variances: 등분산 검정 by cut
bartlett.test(price ~ cut, data = diamonds)##
## Bartlett test of homogeneity of variances
##
## data: price by cut
## Bartlett's K-squared = 406.7, df = 4, p-value < 2.2e-16
# Bartlett test to test the null hypothesis of equal group variances: 등분산 검정 by clarity
bartlett.test(price ~ clarity, data = diamonds)##
## Bartlett test of homogeneity of variances
##
## data: price by clarity
## Bartlett's K-squared = 502.89, df = 7, p-value < 2.2e-16
3범주별 가격의 등분산성 검정 결과, 모두 p-값<0.05이므로 등분산 가정을 기각하게 된다. 따라서 이 경우에는 Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA)을 시행한다.
다이아몬드 컬러,cut,clarity별 가격 수준에서 하나 이상의 차이가 있는지 확인하기 위해 Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA)과 Brown-Forsythe’s 일원 배치 분산 분석 (Brown-Forsythe’s One way ANOVA) 를 수행한다.
# Welch's 일원 배치 분산 분석 (Welch's One way ANOVA) by color
oneway.test( price ~ color, data = diamonds, var.equal=FALSE )##
## One-way analysis of means (not assuming equal variances)
##
## data: price and color
## F = 280.55, num df = 6, denom df = 18316, p-value < 2.2e-16
# Welch's 일원 배치 분산 분석 (Welch's One way ANOVA) by cut
oneway.test( price ~ cut, data = diamonds, var.equal=FALSE )##
## One-way analysis of means (not assuming equal variances)
##
## data: price and cut
## F = 166.04, num df = 4.0, denom df = 9398.6, p-value < 2.2e-16
# Welch's 일원 배치 분산 분석 (Welch's One way ANOVA) by clarity
oneway.test( price ~ clarity, data = diamonds, var.equal=FALSE )##
## One-way analysis of means (not assuming equal variances)
##
## data: price and clarity
## F = 224.38, num df = 7.0, denom df = 8560.2, p-value < 2.2e-16
# install.packages("onewaytests")
library("onewaytests")
# Brown-Forsythe's 일원 배치 분산 분석 (Brown-Forsythe's One way ANOVA) by color
diamonds$color <-as.factor(diamonds$color)
bf.test(price ~ color, data = diamonds)##
## Brown-Forsythe Test (alpha = 0.05)
## -------------------------------------------------------------
## data : price and color
##
## statistic : 275.2607
## num df : 6
## denom df : 34219.87
## p.value : 0
##
## Result : Difference is statistically significant.
## -------------------------------------------------------------
# Brown-Forsythe's 일원 배치 분산 분석 (Brown-Forsythe's One way ANOVA) by cut
diamonds$cut <-as.factor(diamonds$cut)
bf.test(price ~ cut, data = diamonds)##
## Brown-Forsythe Test (alpha = 0.05)
## -------------------------------------------------------------
## data : price and cut
##
## statistic : 185.7974
## num df : 4
## denom df : 22814.69
## p.value : 5.604546e-157
##
## Result : Difference is statistically significant.
## -------------------------------------------------------------
# Brown-Forsythe's 일원 배치 분산 분석 (Brown-Forsythe's One way ANOVA) by clarity
diamonds$clarity <-as.factor(diamonds$clarity)
bf.test(price ~ clarity, data = diamonds)##
## Brown-Forsythe Test (alpha = 0.05)
## -------------------------------------------------------------
## data : price and clarity
##
## statistic : 236.6057
## num df : 7
## denom df : 28656.45
## p.value : 0
##
## Result : Difference is statistically significant.
## -------------------------------------------------------------
분석 결과 최소한 하나 이상의 컬러,cut,clarity간 가격 차이가 있다고 보인다. (p<0.05이므로)
Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA), Brown-Forsythe’s 일원 배치 분산 분석 (Brown-Forsythe’s One way ANOVA) 역시 분석을 통해 평균이 같다는 귀무가설이 기각되면, 어느 수준 간의 차이에 의해 귀무가설이 기각되었는지 살펴보아야 한다.
하지만 등분산 가정 위배되는 경우 사후 검정 방법은 Tamhane’s T2, Games-Howell, Dunnett’s T3 정도이다.
# install.packages("PMCMRplus")
library("PMCMRplus")
#1) Tamhane's T2 by color
tamhaneT2Test(price ~ color, data = diamonds)##
## Pairwise comparisons using Tamhane's T2-test for unequal variances
## data: price by color
## D E F G H I
## E 0.82 - - - - -
## F < 2e-16 < 2e-16 - - - -
## G < 2e-16 < 2e-16 9.6e-06 - - -
## H < 2e-16 < 2e-16 < 2e-16 9.3e-15 - -
## I < 2e-16 < 2e-16 < 2e-16 < 2e-16 4.5e-13 -
## J < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 0.45
##
## P value adjustment method: T2 (Sidak)
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Tamhane's T2-test for unequal variances
## data: price by cut
## Fair Good Ideal Premium
## Good 0.00032 - - -
## Ideal < 2e-16 1e-14 - -
## Premium 0.17541 < 2e-16 < 2e-16 -
## Very Good 0.00084 0.99449 < 2e-16 < 2e-16
##
## P value adjustment method: T2 (Sidak)
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Tamhane's T2-test for unequal variances
## data: price by clarity
## I1 IF SI1 SI2 VS1 VS2 VVS1
## IF 9.5e-13 - - - - - -
## SI1 1.0000 < 2e-16 - - - - -
## SI2 < 2e-16 < 2e-16 < 2e-16 - - - -
## VS1 1.0000 < 2e-16 0.1251 < 2e-16 - - -
## VS2 1.0000 < 2e-16 0.9896 < 2e-16 0.9837 - -
## VVS1 < 2e-16 0.0424 < 2e-16 < 2e-16 < 2e-16 < 2e-16 -
## VVS2 1.2e-06 0.0026 < 2e-16 < 2e-16 4.7e-14 < 2e-16 < 2e-16
##
## P value adjustment method: T2 (Sidak)
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Games-Howell test
## data: price by color
## D E F G H I
## E 0.58 - - - - -
## F 1.7e-08 2.2e-08 - - - -
## G 2.5e-08 < 2e-16 9.4e-06 - - -
## H 1.2e-08 1.9e-08 3.0e-08 3.4e-08 - -
## I < 2e-16 2.0e-11 < 2e-16 < 2e-16 < 2e-16 -
## J 3.6e-08 1.3e-08 3.0e-08 2.8e-08 4.0e-08 0.30
##
## P value adjustment method: none
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Games-Howell test
## data: price by cut
## Fair Good Ideal Premium
## Good 0.00031 - - -
## Ideal 2.4e-11 1.2e-11 - -
## Premium 0.13117 < 2e-16 < 2e-16 -
## Very Good 0.00080 0.92085 < 2e-16 < 2e-16
##
## P value adjustment method: none
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Games-Howell test
## data: price by clarity
## I1 IF SI1 SI2 VS1 VS2 VVS1
## IF 2.8e-11 - - - - - -
## SI1 0.9979 4.1e-11 - - - - -
## SI2 1.5e-13 < 2e-16 3.1e-08 - - - -
## VS1 0.9952 < 2e-16 0.0893 3.3e-08 - - -
## VS2 1.0000 < 2e-16 0.8396 1.4e-08 0.8140 - -
## VVS1 < 2e-16 0.0332 < 2e-16 2.4e-11 2.3e-11 5.0e-12 -
## VVS2 1.2e-06 0.0024 < 2e-16 < 2e-16 < 2e-16 < 2e-16 2.5e-11
##
## P value adjustment method: none
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Dunnett's T3 test for multiple comparisons
## with unequal variances
## data: price by color
## D E F G H I
## E 0.82 - - - - -
## F < 2e-16 < 2e-16 - - - -
## G < 2e-16 < 2e-16 9.6e-06 - - -
## H < 2e-16 < 2e-16 < 2e-16 9.8e-15 - -
## I < 2e-16 < 2e-16 < 2e-16 < 2e-16 4.5e-13 -
## J < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 0.45
##
## P value adjustment method: single-step
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Dunnett's T3 test for multiple comparisons
## with unequal variances
## data: price by cut
## Fair Good Ideal Premium
## Good 0.00032 - - -
## Ideal < 2e-16 1.1e-14 - -
## Premium 0.17528 < 2e-16 < 2e-16 -
## Very Good 0.00084 0.99449 < 2e-16 < 2e-16
##
## P value adjustment method: single-step
## alternative hypothesis: two.sided
##
## Pairwise comparisons using Dunnett's T3 test for multiple comparisons
## with unequal variances
## data: price by clarity
## I1 IF SI1 SI2 VS1 VS2 VVS1
## IF 9.5e-13 - - - - - -
## SI1 1.0000 < 2e-16 - - - - -
## SI2 < 2e-16 < 2e-16 < 2e-16 - - - -
## VS1 1.0000 < 2e-16 0.1250 < 2e-16 - - -
## VS2 1.0000 < 2e-16 0.9896 < 2e-16 0.9836 - -
## VVS1 < 2e-16 0.0423 < 2e-16 < 2e-16 < 2e-16 < 2e-16 -
## VVS2 1.2e-06 0.0026 < 2e-16 < 2e-16 5.0e-14 < 2e-16 < 2e-16
##
## P value adjustment method: single-step
## alternative hypothesis: two.sided
## D E F G H I J
## 3169.954 3076.752 3724.886 3999.136 4486.669 5091.875 5323.818
## Fair Good Ideal Premium Very Good
## 4358.758 3928.864 3457.542 4584.258 3981.760
## I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
## 3924.169 2864.839 3996.001 5063.029 3839.455 3924.989 2523.115 3283.737
위의 모든 비교에서 다음과 같은 순서로 차이가 있다고 판단된다.
\[J=I>H>G>F>D=E\]
\[Premium=Fair>Very Good=Good>Ideal\]
\[SI2>SI1=VS2=I1=VS1>VVS2>IF=VVS1\]