통계적 가설 검정

t검정: 두 집단간의 평균 차이여부를 검정(단일 표본, 독립 표본, 대응 표본) f검정: 3개이상 집단의 평균을 비교하여 차이 여부 검정 카이제곱 검정: 범주형 변수(factor)간의 독립성이나 적합성을 검정

t 검정 #############################################

t검정의 조건 1) 정규성 (표본이 30개 이상일때 만족) 2) 등분산성 두 집단의 분산이 같음

등분산성 평가 예제 1> var.test(a,b) 사용

a <- c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b <- c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)
 
var.test(a,b)

## 
##  F test to compare two variances
## 
## data:  a and b
## F = 2.1028, num df = 9, denom df = 9, p-value = 0.2834
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5223017 8.4657950
## sample estimates:
## ratio of variances 
##           2.102784

예제의 경우 p-value가 0.05보다 크므로 두 분산은 같다. (p-value: 유의확률)

——-단일 표본 t-test——————————–

단일 표본 t-test는 하나의 집단의 평균이 특정 기준보다 유의미하게 다른지 혹은 큰지/작은지를 알아보는 분석 방법입니다. 아래와 같은 문법을 사용합니다.

문법: t.test(관측치, alternative = 판별 방향, mu=특정기준, conf.level = 신뢰수준) alternative에는 “greater”, “less”, “two.sided”가 있습니다. 각각 큰지/작은지/같은지를 구분하라는 명령입니다.

예제2> 기말고사의 평균 점수는 25.1점이네요. 이 학생들의 기말고사 점수가 24점보다 유의하게 높은지를 확인해 볼까요?

final = c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)
t.test(final, alternative="greater", mu=24, conf.level = .95)

## 
##  One Sample t-test
## 
## data:  final
## t = 1.0093, df = 9, p-value = 0.1696
## alternative hypothesis: true mean is greater than 24
## 95 percent confidence interval:
##  23.10218      Inf
## sample estimates:
## mean of x 
##      25.1

p-value가 0.1696으로 0.05보다 크므로 귀무가설을 기각할 수 없습니다. 95% 신뢰 수준에서 학생들의 기말고사 성적은 24점보다 높다고 말할 수 없겠네요.

예제3> 23점 기준으로 확인해 볼까요? p-value가 0.05보다 낮으므로 학생들의 기말고사 성적은 23점보다는 유의하게 높다고 말할 수 있겠습니다.

final = c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)
t.test(final, alternative="greater", mu=23, conf.level = .95)

## 
##  One Sample t-test
## 
## data:  final
## t = 1.9269, df = 9, p-value = 0.04305
## alternative hypothesis: true mean is greater than 23
## 95 percent confidence interval:
##  23.10218      Inf
## sample estimates:
## mean of x 
##      25.1

예제4> 새로운 제조법이 기존 제조법보다 철강 강도가 높은지 검정

final = c(11,12,15,14,17,20,18,14,18,11,17,14,16,13,15,19)
t.test(final, alternative="greater", mu=12, conf.level = .95)

## 
##  One Sample t-test
## 
## data:  final
## t = 4.695, df = 15, p-value = 0.0001438
## alternative hypothesis: true mean is greater than 12
## 95 percent confidence interval:
##  14.03651      Inf
## sample estimates:
## mean of x 
##     15.25

새 제조법의 철강 강도가 높다고 할 수 있습니다.

——- 독립표본 t-test (independent two sample t-test)————

서로 다른 두개의 그룹 간 평균의 차이가 유의미 한지 여부를 판단하기 위한 t-test. 두개의 표본이 “독립”적 이기 위해서는 아래 조건을 만족해야 합니다.

A. 두개의 표본이 서로 관계 없는 모집단에서 추출 되었을 것 B. 표본 간에는 아무런 관계가 없을 것

예제5> 자동차 기어 종류(오토/수동)에 따른 mpg의 차이가 통계적으로 유의한지 t-test를 통해 확인해 보겠습니다.(mtcar data 사용) 우선 두 표본이 등분산성을 만족하는지 확인해 보아야 겠죠?

var.test(mtcars[mtcars$am==1,1 ], mtcars[mtcars$am==0, 1])

## 
##  F test to compare two variances
## 
## data:  mtcars[mtcars$am == 1, 1] and mtcars[mtcars$am == 0, 1]
## F = 2.5869, num df = 12, denom df = 18, p-value = 0.06691
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.934280 8.040391
## sample estimates:
## ratio of variances 
##           2.586911

## [mtcars$am==1,1 ]는 조건을 만족하는 행과 열(1열 mpg)을 뜻함.

p-value가 0.06691로 0.05보다 크므로 등분산성을 만족합니다. 그럼 다음 단계인 t-test단계로 넘어가 보겠습니다.

R에서 독립표본 t-test를 하는 방법은 두가지가 있습니다. 하나는 분석을 원하는 두 집단의 평균을 각각 별개의 벡터 객체로 만들어 입력하는 방법 입니다. 유형 1 문법: t.test(group 1의 관측치, group2의 관측치, t-test 유형, 신뢰범위)

다른 방법은 하나의 데이터 프레임에서 집단을 구분하고자 하는 기준을 입력하는 것입니다. 유형 2 문법: t.test(관측치~집단 구분 기준, 데이터프레임, t-test 유형, 신뢰범위)

예제6> mtcars 데이터셋으로 돌아가서, 한번 분석을 실시해 보겠습니다. 독립표본 t-test의 경우 t-test 유형을 var.equal을 TRUE로 지정하면 됩니다. 신뢰범위는 default로 0.95로 지정되어 있으므로 별도로 지정할 필요는 없습니다.

t.test(mtcars[mtcars$am==0,1], mtcars[mtcars$am==1,1], var.equal = TRUE, conf.level = 0.95)

## 
##  Two Sample t-test
## 
## data:  mtcars[mtcars$am == 0, 1] and mtcars[mtcars$am == 1, 1]
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.84837  -3.64151
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

결과를 해석해 보겠습니다. 우선 가장 아래쪽 집단 별 mpg의평균을 보면 오토는 17.14, 수동은 24.39로 차이가 나는 것 같네요. 이러한 차이가 유의한지를 판단하기 위해서는 p-value를 확인하면 됩니다.

p-value를 확인해 보면 0.000285로 오토와 수동 자동차의 mpg차이는 유의하네요.

예제 7> 유형 2 문법: t.test(관측치~집단 구분 기준, 데이터프레임, t-test 유형, 신뢰범위)

t.test(mpg ~ am, data=mtcars, var.equal=TRUE, conf.level = 0.95)

## 
##  Two Sample t-test
## 
## data:  mpg by am
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -10.84837  -3.64151
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

예제8> 두 공장에서 처리한 철강의 길이는 차이가 나는가?

h <-c(22,19,16,17,19,16,26,24,18,19,13,16,22,18,19,26)
s <-c(22,20,28,24,22,28,22,19,25,21,23,24,23,23,29,23)
var.test(h, s)

## 
##  F test to compare two variances
## 
## data:  h and s
## F = 1.7312, num df = 15, denom df = 15, p-value = 0.2988
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6048896 4.9549977
## sample estimates:
## ratio of variances 
##            1.73125

t.test(h, s,  paired =FALSE, var.equal = TRUE, conf.level = 0.95)

## 
##  Two Sample t-test
## 
## data:  h and s
## t = -3.5299, df = 30, p-value = 0.001364
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.511599 -1.738401
## sample estimates:
## mean of x mean of y 
##    19.375    23.500

그러므로 두 집단의 분산은 동질성이 있으며, 두 회사의 제품 길이는 차이가 있다고 검증함.

——- 대응표본 t-test——————————–

대응표본 t-test는 동일한 집단의 전-후 차이를 비교하기 위해 사용됩니다.

예를 들어 초콜렛을 하루 30g씩 섭취하는 것이 수면 시간에 영향을 미치는지 여부나, 과외를 받는 것이 학교 성적에 영향을 미치는지 등등 특정 변인의 영향을 측정하기 위해 주로 사용되죠. 주의할 점은 대응 표본은 실험 전-후를 비교하는 것이기 때문에 입력하는 관측치의 수가 반드시 같아야 합니다.

예제 9> 중간고사 이후 과외를 받은 10명의 학생의 중간고사 – 기말고사 점수 데이터를 가상으로 만들어서 비교해 보겠습니다.

mid = c(16, 20, 21, 22, 23, 22, 27, 25, 27, 28)
final = c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)
t.test(mid,final, paired=TRUE)

## 
##  Paired t-test
## 
## data:  mid and final
## t = -4.4721, df = 9, p-value = 0.00155
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -3.0116674 -0.9883326
## sample estimates:
## mean difference 
##              -2

p-value가 0.00155로 과외를 받은 전과 이후의 평균 성적 차이는 통계적으로 유의미하다고 말할 수 있겠네요.

—— 상관관계 분석 (cor.test (a,b ))————————

cor.test(mtcars$am,mtcars$mpg )

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$am and mtcars$mpg
## t = 4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3175583 0.7844520
## sample estimates:
##       cor 
## 0.5998324

gear/mpg 두 요소는 보통의 상관관계가 있음.

corrplot 으로 상관관계 시각화하기, corrplot package 사용

library(corrplot)

## corrplot 0.92 loaded

cor_data <- cor(mtcars)
head(cor_data)

##             mpg        cyl       disp         hp       drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.0000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000
##             qsec         vs         am       gear       carb
## mpg   0.41868403  0.6640389  0.5998324  0.4802848 -0.5509251
## cyl  -0.59124207 -0.8108118 -0.5226070 -0.4926866  0.5269883
## disp -0.43369788 -0.7104159 -0.5912270 -0.5555692  0.3949769
## hp   -0.70822339 -0.7230967 -0.2432043 -0.1257043  0.7498125
## drat  0.09120476  0.4402785  0.7127111  0.6996101 -0.0907898
## wt   -0.17471588 -0.5549157 -0.6924953 -0.5832870  0.4276059

corrplot(cor_data)

R로 가설 검정하기

통계적 가설 검정

——-단일 표본 t-test——————————–

——- 독립표본 t-test (independent two sample t-test)————

——- 대응표본 t-test——————————–

—— 상관관계 분석 (cor.test (a,b ))————————