chap2

카이검정 두 범주형 변수의 차이를 검정하기위해 카이제곱검정을 할 수 있다. 이전에 불러온 데이터의 범주형 임금 데이터와 성별의 차이를 알아보자. 일단 table을 통해 빈도수를 볼 수 있다.

table(a$wage_category, a$gender2)

##           
##            남자 여자
##   고소득자   38   48
##   저소득자   62   52

이 빈도가 기대빈도와 차이가 나는지 알아보자. 검정결과 p-value가 유의하지 않다.

chisq.test(a$wage_category, a$gender2)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  a$wage_category and a$gender2
## X-squared = 1.6524, df = 1, p-value = 0.1986

범주형 변수에 따라 연속형 변수의 차이를 알아보기 위해 t검정을 해 보았다. 첫번째로 임금범주형에 따라 임금이 차이나는지 알아보았다. 사실 당연히 유의하겠지만 그래도 해보자

t.test(wage ~ wage_category , data= a , paired = F)

## 
##  Welch Two Sample t-test
## 
## data:  wage by wage_category
## t = 19.05, df = 191.88, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  757.8941 932.9628
## sample estimates:
## mean in group 고소득자 mean in group 저소득자 
##               3370.270               2524.842

두번째는 대응 표본 검정을 하였다.

t.test(a$ran1, a$ran2 , paired = T)

## 
##  Paired t-test
## 
## data:  a$ran1 and a$ran2
## t = 0.44423, df = 199, p-value = 0.6574
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1579617  0.2498244
## sample estimates:
## mean of the differences 
##              0.04593135

anovoa의 경우 aov나 anova 함수를 많이 사용한다. 아래는 onway anova다. 함수를 입력한뒤 해당 모델을 m1으로 정의 해 보았다. 제곱합과 자유도가 나온다.

m1<-aov(wage ~ wage_category3, data=a)
m1

## Call:
##    aov(formula = wage ~ wage_category3, data = a)
## 
## Terms:
##                 wage_category3 Residuals
## Sum of Squares        39287950  15439080
## Deg. of Freedom              2       197
## 
## Residual standard error: 279.9481
## Estimated effects may be unbalanced

모델 m1을 summary 함수 안에 넣어보자. 모델 m1을 summary 함수 안에 넣어보자. 일반적이 anova table이 나온다.

summary(m1)

##                 Df   Sum Sq  Mean Sq F value Pr(>F)    
## wage_category3   2 39287950 19643975   250.7 <2e-16 ***
## Residuals      197 15439080    78371                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

twoway anova도 동일하다. 상호작용항에 “:”를 넣어도 되고 "*"을 넣어도 된다.

m2<-aov(wage ~ wage_category3*gender2, data=a)
summary(m2)

##                         Df   Sum Sq  Mean Sq F value Pr(>F)    
## wage_category3           2 39287950 19643975 254.034 <2e-16 ***
## gender2                  1   271669   271669   3.513 0.0624 .  
## wage_category3:gender2   2   165729    82865   1.072 0.3445    
## Residuals              194 15001682    77328                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.1) rm ANOVA…반복측정 anova를 해보자. 추가 패키지가 필요하다.

library(DescTools)
library(reshape2)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

일단 데이터를 만든다. m은 id를 기준으로 t에 따라 반복된 데이터를 만든 것이다. 450행이 된다. 반복측정을 위해 150행 데이터를 id를 기준으로 새롭게 데이터를 만들었다. 이전에 y 에 해당하는 열 3가지를 하나로 합치기 위해서 변수 t1~t3를 변수 t의 3가지 level로 만들었다.

t1<-rnorm(150,0,1)
t2<-rnorm(150,2,1)
t3<-rnorm(150,3,1)
id<-seq(1,150)
a1<-data.frame(id,t1, t2,t3)
m<-melt(a1, id.vars = c("id"))
colnames(m)<-c("id", "t", "y")
head(m)

##   id  t          y
## 1  1 t1  2.2525273
## 2  2 t1 -0.8490278
## 3  3 t1  0.2967672
## 4  4 t1 -0.7014945
## 5  5 t1 -0.7881701
## 6  6 t1 -0.6499908

t1<-rnorm(150,0,1)
t2<-rnorm(150,2,1)
t3<-rnorm(150,3,1)
id<-seq(1,150)
a1<-data.frame(id,t1, t2,t3)
m<-melt(a1, id.vars = c("id"))
colnames(m)<-c("id", "t", "y")
head(m)

##   id  t          y
## 1  1 t1  0.5341176
## 2  2 t1 -0.8695181
## 3  3 t1 -0.1639379
## 4  4 t1  0.6203844
## 5  5 t1 -0.1480483
## 6  6 t1 -0.9590788

#one-way rmanova one-way rmanova t1~3가지 3번 측정했다고 가정해보자. 아래는 rstatix패키지를 사용 하였다. 일단 구형성(Sphericity)을 위반하지 않아 구형성을 가정한 F검정이 가능하다. 사실 위에 편차를 모두 1로 해 두어서 당연한 결과이다.

rm1<-anova_test(m,dv=y, wid = id, within = t, effect.size = "ges")
rm1

## ANOVA Table (type III tests)
## 
## $ANOVA
##   Effect DFn DFd       F        p p<.05   ges
## 1      t   2 298 330.699 2.19e-76     * 0.599
## 
## $`Mauchly's Test for Sphericity`
##   Effect    W     p p<.05
## 1      t 0.99 0.479      
## 
## $`Sphericity Corrections`
##   Effect  GGe       DF[GG]    p[GG] p[GG]<.05   HFe       DF[HF]    p[HF]
## 1      t 0.99 1.98, 295.08 1.15e-75         * 1.003 2.01, 299.03 2.19e-76
##   p[HF]<.05
## 1         *

사후검정은 본페로니를 사용하였다.

prm1<-pairwise_t_test(y ~ t,data = m ,paired = T, p.adjust.method = "bonferroni")
prm1

## # A tibble: 3 x 10
##   .y.   group1 group2    n1    n2 statistic    df        p    p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>    <dbl> <chr>       
## 1 y     t1     t2       150   150    -17.6    149 3.77e-38 1.13e-37 ****        
## 2 y     t1     t3       150   150    -24.5    149 3.47e-54 1.04e-53 ****        
## 3 y     t2     t3       150   150     -8.45   149 2.39e-14 7.17e-14 ****

#wto-way rmanova 집단 변인을 추가한다음 새로운 data를 만들었다

xx1<-rep("i",50)
xx2<-rep("o",50)
xx3<-rep("u",50)
x1<-c(xx1, xx2, xx3)
a<-data.frame(id,x1,t1, t2,t3)
m<-melt(a, id.vars = c("id","x1"))
colnames(m)<-c("id","g" ,"t", "y")
head(m)

##   id g  t          y
## 1  1 i t1  0.5341176
## 2  2 i t1 -0.8695181
## 3  3 i t1 -0.1639379
## 4  4 i t1  0.6203844
## 5  5 i t1 -0.1480483
## 6  6 i t1 -0.9590788

동일하게 검정과 더불어 구형성 검정도 해보자. 뭐 당연히 구형성을 충족하여 이를 가정한 자유도와 F값을 사용해도 된다.

rm2<-anova_test(data = m, dv = y, wid = id, within = c(t), between = c(g), effect.size = "pes")
rm2

## ANOVA Table (type II tests)
## 
## $ANOVA
##   Effect DFn DFd       F        p p<.05   pes
## 1      g   2 147   0.280 7.56e-01       0.004
## 2      t   2 294 329.011 9.68e-76     * 0.691
## 3    g:t   4 294   0.620 6.49e-01       0.008
## 
## $`Mauchly's Test for Sphericity`
##   Effect     W     p p<.05
## 1      t 0.991 0.529      
## 2    g:t 0.991 0.529      
## 
## $`Sphericity Corrections`
##   Effect   GGe       DF[GG]    p[GG] p[GG]<.05   HFe       DF[HF]    p[HF]
## 1      t 0.991 1.98, 291.47 4.09e-75         * 1.005 2.01, 295.44 9.68e-76
## 2    g:t 0.991 3.97, 291.47 6.47e-01           1.005 4.02, 295.44 6.49e-01
##   p[HF]<.05
## 1         *
## 2

회귀 회귀는 lm함수를 쓰면 된다. 일단 데이터를 만들어보자. 정규분포를 따르는 변수를 만들때는 rnorm을 쓰면 된다.

x1<-rnorm(150)
x2<-x1+rnorm(150)
x3<-(x1/x2)
y<-x1+x2+x3 + rnorm(150,0,5)
d<-data.frame(x1,x2,x3,y)

그다음은 anova랑 같다. 회귀 식을 쓰면 된다.

m2<-lm(y ~ x1, data = d)
summary(m2)

## 
## Call:
## lm(formula = y ~ x1, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5121  -2.9257  -0.2375   3.2600  31.1737 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.4233     0.5243   0.807 0.420779    
## x1            2.3164     0.5875   3.943 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.422 on 148 degrees of freedom
## Multiple R-squared:  0.09505,    Adjusted R-squared:  0.08893 
## F-statistic: 15.54 on 1 and 148 DF,  p-value: 0.000124

회귀식의 항을 추가하는 방법도 동일하다

m3<-lm(y ~ x1 + x2 +x3, data = d)
summary(m3)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9852  -2.8798  -0.0202   2.9984  13.8496 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3689     0.3986   0.926  0.35613    
## x1            1.6492     0.5989   2.754  0.00664 ** 
## x2            0.8751     0.3989   2.194  0.02985 *  
## x3            1.0829     0.1039  10.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.827 on 146 degrees of freedom
## Multiple R-squared:  0.4956, Adjusted R-squared:  0.4852 
## F-statistic: 47.81 on 3 and 146 DF,  p-value: < 2.2e-16

조절 변인도 가능하다.

m4<-lm(y ~ x2 +x3+  x2:x3, data = d)
summary(m4)

## 
## Call:
## lm(formula = y ~ x2 + x3 + x2:x3, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9852  -2.8798  -0.0202   2.9984  13.8496 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3689     0.3986   0.926  0.35613    
## x2            0.8751     0.3989   2.194  0.02985 *  
## x3            1.0829     0.1039  10.420  < 2e-16 ***
## x2:x3         1.6492     0.5989   2.754  0.00664 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.827 on 146 degrees of freedom
## Multiple R-squared:  0.4956, Adjusted R-squared:  0.4852 
## F-statistic: 47.81 on 3 and 146 DF,  p-value: < 2.2e-16

chap2

LHJ

2021 8 24