1 데이터 불러오기


# install.packages('readxl')
library(readxl)
diamonds <- read_excel("diamonds.xlsx", sheet = "diamonds")


2 데이터 시각화 및 주요 기술통계


데이터의 분포를 박스 플롯으로 시각화하고 주요 기술통계를 본다.


2.1 Color에 따른 가격 시각화


# boxplot by color

boxplot(price ~ color, 
        data = diamonds,
        main = "Boxplot of price by each color", 
        xlab = "Factor Levels : color", 
        ylab = "price")

# descriptive statistics by color
tapply(diamonds$price, diamonds$color, summary)
## $D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357     911    1838    3170    4214   18693 
## 
## $E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     882    1739    3077    4003   18731 
## 
## $F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     342     982    2344    3725    4868   18791 
## 
## $G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     354     931    2242    3999    6048   18818 
## 
## $H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337     984    3460    4487    5980   18803 
## 
## $I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1120    3730    5092    7202   18823 
## 
## $J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    1860    4234    5324    7695   18710


2.2 Cut에 따른 가격 시각화


# boxplot by cut

boxplot(price ~ cut, 
        data = diamonds,
        main = "Boxplot of price by each cut", 
        xlab = "Factor Levels : cut", 
        ylab = "price")

# descriptive statistics by cut
tapply(diamonds$price, diamonds$cut, summary)
## $Fair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337    2050    3282    4359    5206   18574 
## 
## $Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     327    1145    3050    3929    5028   18788 
## 
## $Ideal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     878    1810    3458    4678   18806 
## 
## $Premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    1046    3185    4584    6296   18823 
## 
## $`Very Good`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     336     912    2648    3982    5373   18818


2.3 Clarity에 따른 가격 시각화


# boxplot by clarity

boxplot(price ~ clarity, 
        data = diamonds,
        main = "Boxplot of price by each clarity", 
        xlab = "Factor Levels : clarity", 
        ylab = "price")

# descriptive statistics by clarity
tapply(diamonds$price, diamonds$clarity, summary)
## $I1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     345    2080    3344    3924    5161   18531 
## 
## $IF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     369     895    1080    2865    2388   18806 
## 
## $SI1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    1089    2822    3996    5250   18818 
## 
## $SI2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    2264    4072    5063    5777   18804 
## 
## $VS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     327     876    2005    3839    6023   18795 
## 
## $VS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334     900    2054    3925    6024   18823 
## 
## $VVS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     336     816    1093    2523    2379   18777 
## 
## $VVS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   336.0   794.2  1311.0  3283.7  3638.2 18768.0


3 정규성 검정


표본수가 30 이상이므로 정규성 검정은 생략한다.

4 등분산성 검정


등분산성을 가설검정할 때 요인의 수준이 2 개인 경우에는 두 모집단의 분산의 차이 검정을 F-분포를 사용하여 수행할 수 있다. 이 경우에는 var.test() 함수를 이용하여 가설검정을 수행하였다. 그러나 요인의 수준이 3 개 이상인 경우에는 bartlet.test()을 사용한다.


# Bartlett test to test the null hypothesis of equal group variances: 등분산 검정 by color
bartlett.test(price ~ color, data = diamonds)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  price by color
## Bartlett's K-squared = 1402.4, df = 6, p-value < 2.2e-16
# Bartlett test to test the null hypothesis of equal group variances: 등분산 검정 by cut
bartlett.test(price ~ cut, data = diamonds)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  price by cut
## Bartlett's K-squared = 406.7, df = 4, p-value < 2.2e-16
# Bartlett test to test the null hypothesis of equal group variances: 등분산 검정 by clarity
bartlett.test(price ~ clarity, data = diamonds)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  price by clarity
## Bartlett's K-squared = 502.89, df = 7, p-value < 2.2e-16


3범주별 가격의 등분산성 검정 결과, 모두 p-값<0.05이므로 등분산 가정을 기각하게 된다. 따라서 이 경우에는 Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA)을 시행한다.


5 Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA) 등 수행


다이아몬드 컬러,cut,clarity별 가격 수준에서 하나 이상의 차이가 있는지 확인하기 위해 Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA)Brown-Forsythe’s 일원 배치 분산 분석 (Brown-Forsythe’s One way ANOVA) 를 수행한다.

# Welch's 일원 배치 분산 분석 (Welch's One way ANOVA) by color
oneway.test( price ~ color, data = diamonds, var.equal=FALSE )
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  price and color
## F = 280.55, num df = 6, denom df = 18316, p-value < 2.2e-16
# Welch's 일원 배치 분산 분석 (Welch's One way ANOVA) by cut
oneway.test( price ~ cut, data = diamonds, var.equal=FALSE )
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  price and cut
## F = 166.04, num df = 4.0, denom df = 9398.6, p-value < 2.2e-16
# Welch's 일원 배치 분산 분석 (Welch's One way ANOVA) by clarity
oneway.test( price ~ clarity, data = diamonds, var.equal=FALSE )
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  price and clarity
## F = 224.38, num df = 7.0, denom df = 8560.2, p-value < 2.2e-16
# install.packages("onewaytests")
library("onewaytests")

# Brown-Forsythe's 일원 배치 분산 분석 (Brown-Forsythe's One way ANOVA) by color
diamonds$color <-as.factor(diamonds$color)
bf.test(price ~ color, data = diamonds)
## 
##   Brown-Forsythe Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : price and color 
## 
##   statistic  : 275.2607 
##   num df     : 6 
##   denom df   : 34219.87 
##   p.value    : 0 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------
# Brown-Forsythe's 일원 배치 분산 분석 (Brown-Forsythe's One way ANOVA) by cut
diamonds$cut <-as.factor(diamonds$cut)
bf.test(price ~ cut, data = diamonds)
## 
##   Brown-Forsythe Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : price and cut 
## 
##   statistic  : 185.7974 
##   num df     : 4 
##   denom df   : 22814.69 
##   p.value    : 5.604546e-157 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------
# Brown-Forsythe's 일원 배치 분산 분석 (Brown-Forsythe's One way ANOVA) by clarity
diamonds$clarity <-as.factor(diamonds$clarity)
bf.test(price ~ clarity, data = diamonds)
## 
##   Brown-Forsythe Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : price and clarity 
## 
##   statistic  : 236.6057 
##   num df     : 7 
##   denom df   : 28656.45 
##   p.value    : 0 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------


분석 결과 최소한 하나 이상의 컬러,cut,clarity간 가격 차이가 있다고 보인다. (p<0.05이므로)


Welch’s 일원 배치 분산 분석 (Welch’s One way ANOVA), Brown-Forsythe’s 일원 배치 분산 분석 (Brown-Forsythe’s One way ANOVA) 역시 분석을 통해 평균이 같다는 귀무가설이 기각되면, 어느 수준 간의 차이에 의해 귀무가설이 기각되었는지 살펴보아야 한다.

하지만 등분산 가정 위배되는 경우 사후 검정 방법은 Tamhane’s T2, Games-Howell, Dunnett’s T3 정도이다.

6 다중 비교


# install.packages("PMCMRplus")
library("PMCMRplus")

#1) Tamhane's T2 by color
tamhaneT2Test(price ~ color, data = diamonds)
## 
##  Pairwise comparisons using Tamhane's T2-test for unequal variances
## data: price by color
##   D       E       F       G       H       I   
## E 0.82    -       -       -       -       -   
## F < 2e-16 < 2e-16 -       -       -       -   
## G < 2e-16 < 2e-16 9.6e-06 -       -       -   
## H < 2e-16 < 2e-16 < 2e-16 9.3e-15 -       -   
## I < 2e-16 < 2e-16 < 2e-16 < 2e-16 4.5e-13 -   
## J < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 0.45
## 
## P value adjustment method: T2 (Sidak)
## alternative hypothesis: two.sided
#1) Tamhane's T2 by cut
tamhaneT2Test(price ~ cut, data = diamonds)
## 
##  Pairwise comparisons using Tamhane's T2-test for unequal variances
## data: price by cut
##           Fair    Good    Ideal   Premium
## Good      0.00032 -       -       -      
## Ideal     < 2e-16 1e-14   -       -      
## Premium   0.17541 < 2e-16 < 2e-16 -      
## Very Good 0.00084 0.99449 < 2e-16 < 2e-16
## 
## P value adjustment method: T2 (Sidak)
## alternative hypothesis: two.sided
#1) Tamhane's T2 by clarity
tamhaneT2Test(price ~ clarity, data = diamonds)
## 
##  Pairwise comparisons using Tamhane's T2-test for unequal variances
## data: price by clarity
##      I1      IF      SI1     SI2     VS1     VS2     VVS1   
## IF   9.5e-13 -       -       -       -       -       -      
## SI1  1.0000  < 2e-16 -       -       -       -       -      
## SI2  < 2e-16 < 2e-16 < 2e-16 -       -       -       -      
## VS1  1.0000  < 2e-16 0.1251  < 2e-16 -       -       -      
## VS2  1.0000  < 2e-16 0.9896  < 2e-16 0.9837  -       -      
## VVS1 < 2e-16 0.0424  < 2e-16 < 2e-16 < 2e-16 < 2e-16 -      
## VVS2 1.2e-06 0.0026  < 2e-16 < 2e-16 4.7e-14 < 2e-16 < 2e-16
## 
## P value adjustment method: T2 (Sidak)
## alternative hypothesis: two.sided
#2) Games-Howell by color
gamesHowellTest(price ~ color, data = diamonds)
## 
##  Pairwise comparisons using Games-Howell test
## data: price by color
##   D       E       F       G       H       I   
## E 0.58    -       -       -       -       -   
## F 1.7e-08 2.2e-08 -       -       -       -   
## G 2.5e-08 < 2e-16 9.4e-06 -       -       -   
## H 1.2e-08 1.9e-08 3.0e-08 3.4e-08 -       -   
## I < 2e-16 2.0e-11 < 2e-16 < 2e-16 < 2e-16 -   
## J 3.6e-08 1.3e-08 3.0e-08 2.8e-08 4.0e-08 0.30
## 
## P value adjustment method: none
## alternative hypothesis: two.sided
#2) Games-Howell by cut
gamesHowellTest(price ~ cut, data = diamonds)
## 
##  Pairwise comparisons using Games-Howell test
## data: price by cut
##           Fair    Good    Ideal   Premium
## Good      0.00031 -       -       -      
## Ideal     2.4e-11 1.2e-11 -       -      
## Premium   0.13117 < 2e-16 < 2e-16 -      
## Very Good 0.00080 0.92085 < 2e-16 < 2e-16
## 
## P value adjustment method: none
## alternative hypothesis: two.sided
#2) Games-Howell by clarity
gamesHowellTest(price ~ clarity, data = diamonds)
## 
##  Pairwise comparisons using Games-Howell test
## data: price by clarity
##      I1      IF      SI1     SI2     VS1     VS2     VVS1   
## IF   2.8e-11 -       -       -       -       -       -      
## SI1  0.9979  4.1e-11 -       -       -       -       -      
## SI2  1.5e-13 < 2e-16 3.1e-08 -       -       -       -      
## VS1  0.9952  < 2e-16 0.0893  3.3e-08 -       -       -      
## VS2  1.0000  < 2e-16 0.8396  1.4e-08 0.8140  -       -      
## VVS1 < 2e-16 0.0332  < 2e-16 2.4e-11 2.3e-11 5.0e-12 -      
## VVS2 1.2e-06 0.0024  < 2e-16 < 2e-16 < 2e-16 < 2e-16 2.5e-11
## 
## P value adjustment method: none
## alternative hypothesis: two.sided
#3) Dunnett's T3 by color
dunnettT3Test(price ~ color, data = diamonds)
## 
##  Pairwise comparisons using Dunnett's T3 test for multiple comparisons
##      with unequal variances
## data: price by color
##   D       E       F       G       H       I   
## E 0.82    -       -       -       -       -   
## F < 2e-16 < 2e-16 -       -       -       -   
## G < 2e-16 < 2e-16 9.6e-06 -       -       -   
## H < 2e-16 < 2e-16 < 2e-16 9.8e-15 -       -   
## I < 2e-16 < 2e-16 < 2e-16 < 2e-16 4.5e-13 -   
## J < 2e-16 < 2e-16 < 2e-16 < 2e-16 < 2e-16 0.45
## 
## P value adjustment method: single-step
## alternative hypothesis: two.sided
#3) Dunnett's T3 by cut
dunnettT3Test(price ~ cut, data = diamonds)
## 
##  Pairwise comparisons using Dunnett's T3 test for multiple comparisons
##      with unequal variances
## data: price by cut
##           Fair    Good    Ideal   Premium
## Good      0.00032 -       -       -      
## Ideal     < 2e-16 1.1e-14 -       -      
## Premium   0.17528 < 2e-16 < 2e-16 -      
## Very Good 0.00084 0.99449 < 2e-16 < 2e-16
## 
## P value adjustment method: single-step
## alternative hypothesis: two.sided
#3) Dunnett's T3 by clarity
dunnettT3Test(price ~ clarity, data = diamonds)
## 
##  Pairwise comparisons using Dunnett's T3 test for multiple comparisons
##      with unequal variances
## data: price by clarity
##      I1      IF      SI1     SI2     VS1     VS2     VVS1   
## IF   9.5e-13 -       -       -       -       -       -      
## SI1  1.0000  < 2e-16 -       -       -       -       -      
## SI2  < 2e-16 < 2e-16 < 2e-16 -       -       -       -      
## VS1  1.0000  < 2e-16 0.1250  < 2e-16 -       -       -      
## VS2  1.0000  < 2e-16 0.9896  < 2e-16 0.9836  -       -      
## VVS1 < 2e-16 0.0423  < 2e-16 < 2e-16 < 2e-16 < 2e-16 -      
## VVS2 1.2e-06 0.0026  < 2e-16 < 2e-16 5.0e-14 < 2e-16 < 2e-16
## 
## P value adjustment method: single-step
## alternative hypothesis: two.sided
# descriptive statistics by color
tapply( diamonds$price  , diamonds$color , mean)
##        D        E        F        G        H        I        J 
## 3169.954 3076.752 3724.886 3999.136 4486.669 5091.875 5323.818
# descriptive statistics by cut
tapply( diamonds$price  , diamonds$cut , mean)
##      Fair      Good     Ideal   Premium Very Good 
##  4358.758  3928.864  3457.542  4584.258  3981.760
# descriptive statistics by clarity
tapply( diamonds$price  , diamonds$clarity , mean)
##       I1       IF      SI1      SI2      VS1      VS2     VVS1     VVS2 
## 3924.169 2864.839 3996.001 5063.029 3839.455 3924.989 2523.115 3283.737


위의 모든 비교에서 다음과 같은 순서로 차이가 있다고 판단된다.


6.1 Color별 가격 차이 및 순서


\[J=I>H>G>F>D=E\]

6.2 Cut별 가격 차이 및 순서


\[Premium=Fair>Very Good=Good>Ideal\]


6.3 Clarity별 가격 차이 및 순서


\[SI2>SI1=VS2=I1=VS1>VVS2>IF=VVS1\]