Linear models and linear mixed effects models in R with linguistic applications.

1. Linear modeling

linear regression(선형 회귀)이란?: 관측치들을 x축을 독립변수, y축을 종속변수라고 했을 때의 좌표평면 상의 점으로 나타냈을 때 그 점들을 가지고 직선 방정식을 예측하는 것.
왜 하필 regression인가?: 점들을 가지고 선을 예측하기 때문에…?
어떻게 직선 방정식을 예측하는 가?: 점들의 error(residual)들이 최솟값을 갖는 직선방정식을 예측.
y=ax+b+e x: 독립변수, y: 종속변수, a: slope, b: intercept, e(epsilon): error(오차) or residual(잔차)
error(residual)란?: 예측한 회귀 방정식과 관측치 사이의 거리.
Residuals과 error의 차이:
error는 모집단에서 회귀식을 얻었을 때 나온 예측값과 관측값의 차이
Residuals는 표본집단에서 회귀식을 얻었을 때 나온 예측값과 관측값의 차이
-> 전 세계 모든 남녀의 목소리 pitch 값을 얻는 것(모집단)은 불가능하므로 표본집단을 사용. -> 때문에 error가 아닌 residual!!

Residuals: 잔차의 5 points(Min, 1Q, Median, 3Q, Max)

Coefficients: 왼쪽부터 intercept/slope, 표준오차의 평균, t-value, p-value

Multiple R-squared: cor(x, y)^2, 독립변수에 의해 설명 가능한 종속변수의 분산의 양 -> 92.1%의 종속변수가 독립변수에 의해 설명 가능
Adjusted R-squared: 독립변수에 의해 설명 가능한 종속변수의 분산의 양 + 설명을 하는데 사용한 독립변수의 개수(나이, 공손성, 방언 등의 독립변수의 개수가 많아질수록 수치가 낮아질 수 있음.)

#독립변수가 categorical variable일 때 

pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)

summary(lm(pitch ~ sex, data = my.df)) #F(1,4)=46.61, p<0.01, 여성의 pitch의 예측치가 226.33Hz이고 남성 pitch의 예측치는 그보다 98.33Hz 유의미하게 작다.

## 
## Call:
## lm(formula = pitch ~ sex, data = my.df)
## 
## Residuals:
##       1       2       3       4       5       6 
##   6.667 -22.333  15.667   2.000 -16.000  14.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   226.33      10.18  22.224 2.43e-05 ***
## sexmale       -98.33      14.40  -6.827  0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.64 on 4 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.9012 
## F-statistic: 46.61 on 1 and 4 DF,  p-value: 0.002407

#독립변수가 continueous variable일 때 
age = c(14,23,35,48,52,67)
pitch2 = c(252,244,240,233,212,204)
my.df2 = data.frame(age, pitch2)
plot(age, pitch2, data = my.df2)

## Warning in plot.window(...): "data"는 그래픽 매개변수가 아닙니다

## Warning in plot.xy(xy, type, ...): "data"는 그래픽 매개변수가 아닙니다

## Warning in axis(side = side, at = at, labels = labels, ...): "data"는 그래픽 매
## 개변수가 아닙니다

## Warning in axis(side = side, at = at, labels = labels, ...): "data"는 그래픽 매
## 개변수가 아닙니다

## Warning in box(...): "data"는 그래픽 매개변수가 아닙니다

## Warning in title(...): "data"는 그래픽 매개변수가 아닙니다

abline(lm(pitch2 ~ age, data = my.df2))

summary(lm(pitch2 ~ age, data = my.df2)) #F(1,4) = 33.64, p<0.01, intercept: 267.0765/slope:-0.9099 -> 나이가 한 살 올라갈 때마다 0.9099Hz씩 pitch가 낮아진다.

## 
## Call:
## lm(formula = pitch2 ~ age, data = my.df2)
## 
## Residuals:
##      1      2      3      4      5      6 
## -2.338 -2.149  4.769  9.597 -7.763 -2.115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 267.0765     6.8522   38.98 2.59e-06 ***
## age          -0.9099     0.1569   -5.80  0.00439 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 4 degrees of freedom
## Multiple R-squared:  0.8937, Adjusted R-squared:  0.8672 
## F-statistic: 33.64 on 1 and 4 DF,  p-value: 0.004395

#intercept는 중요하지 않음 (0살 일때 267.0765Hz인 게 말이 안 됨) 여기서 intercept가 유의미하려면 나이 독립변수의 **분산**값을 독립변수로 하여 linear model을 세우면 됨
my.df2$age.c = my.df2$age - mean(my.df2$age)
summary(lm(pitch2 ~ age.c, data = my.df2)) #표본집단의 평균 나이에서 1살 올라갈 때마다 0.9099Hz씩 pitch가 낮아진다.(분산이 0이다 = 관측치가 평균값과 같다.이므로)

## 
## Call:
## lm(formula = pitch2 ~ age.c, data = my.df2)
## 
## Residuals:
##      1      2      3      4      5      6 
## -2.338 -2.149  4.769  9.597 -7.763 -2.115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 230.8333     2.8113   82.11 1.32e-07 ***
## age.c        -0.9099     0.1569   -5.80  0.00439 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 4 degrees of freedom
## Multiple R-squared:  0.8937, Adjusted R-squared:  0.8672 
## F-statistic: 33.64 on 1 and 4 DF,  p-value: 0.004395

linear regression을 사용하기 위한 조건

1. Linearity: 관측치는 선형적이어야 한다. (잔차를 y축으로 하는 좌표평면 상의 점들의 패턴이 없어야 한다…?)
잔차 그래프가 비선형성을 나타내는 이유:
첫째. 선형 모델에 넣은 독립변수들 중 매우 중요하게 interact하는 다른 독립변수를 넣지 않아서
둘째. log 변환과 같은 비선형적인 변환을 실행시켜서
셋째. 실제로 관측치가 비선형적이어서 ex)잔차의 그래프가 U자 모양일 경우, 이때는 독립변수의 제곱을 또 다른 독립변수로 추가할 수 있음
넷째. 범주형 데이터를 다루는 경우(잔차 그래프가 줄무늬를 띄는 경우) 이때는 다른 class의 모델을 세워야 함 ex) logistic model

plot(fitted(lm(pitch2 ~ age, data = my.df2)),residuals(lm(pitch2 ~ age, data = my.df2)))^2

## numeric(0)

abline(0,0)

2. Absence of collinearity: 둘 이상의 독립변수가 상관관계여서는 안된다.
ex) intelligence ratings ~ talking speed를 알아보기 위해 초당 말하는 음절의 개수, 단어의 개수, 문장의 개수를 모두 구하여 모두 모델에 집어 넣으면 서로의 effect를 훔치거나 누가 더 effect가 큰 지 알 수 없음
독립변수의 상관관계를 없애는 방법
첫째. 애초에 상관관계가 없는 독립변수들을 설정
둘째. 이미 상관관계가 있는 여러 측정법을 사용한 경우 가장 중요한 걸 골라서 다른 걸 떨어뜨리기
셋째. Principal Component Analysis같은 dimension-reduction techniques을 고려할 수 있음.

3. Homoskedasticity… or “absence of heteroskedasticity”: 데이터의 가변성은 예측 값의 범위에서 대략적으로 동일해야 한다.
ex) 잔차의 그래프가 동그라미 형태를 하고 있는 게 이상적인 잔차 그래프, 나팔 형태를 하고 있다면 Homoskedastic하지 않은 그래프

plot(rnorm(100), rnorm(100))

4. Normality of residuals: 잔차는 정규성을 띄어야 한다.

#잔차의 정규성을 확인하는 법:
hist(residuals(lm(pitch2 ~ age, data = my.df2))) #histogram 그리기

qqnorm(residuals(lm(pitch2 ~ age, data = my.df2))) #qqplot을 그려봤을 때 y = x의 형태를 따를 경우 정규 분포를 띈다고 볼 수 있다.
qqline(residuals(lm(pitch2 ~ age, data = my.df2)))

shapiro.test(residuals(lm(pitch2 ~ age, data = my.df2))) #shapiro wilk test를 시행했을 때 p-value가 0.05보다 크면 정규성을 띈다고 볼 수 있다.

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm(pitch2 ~ age, data = my.df2))
## W = 0.9131, p-value = 0.4571

5. Absence of influential data points: 영향력 있는 데이터 관측치가 없어야 한다.
dfbeta(): 만약 해당 행의 관측치를 제외할 경우 intercept과 slope가 얼마나 조정되는가?
만약 해당 행의 관측치를 제외할 경우 slope의 부호가 바뀐다면 influential data points라고 할 수 있음
(원래 slope - dfbeta function을 사용해서 얻는 조정값의 결과가 slope의 부호가 바뀐 경우)

dfbeta(lm(pitch2 ~ age, data = my.df2))

##   (Intercept)         age
## 1  -3.3645662  0.06437573
## 2  -1.6119656  0.02736278
## 3   1.5481303 -0.01456709
## 4  -0.0259835  0.05092767
## 5   0.8707699 -0.06479736
## 6   1.8551808 -0.06622744

6. Independence!!!: 관측치들은 독립적이어야 한다. ex) 만약 같은 참여자에게서 여러 개의 관측치를 이끌어 낸 경우, 동일한 참여자에게서 나온 관측치들은 서로 독립적이라고 할 수 없음.

2: A very basic tutorial for performing linear mixed effects analyses

fixed effect(독립변수): 모집단의 모든 가능한 레벨을 포괄하는 변수에 의한 효과 random effect(ε): fixed effect가 모집단의 모든 가능한 레벨을 포괄하지 못하는 변수에 의한 효과

pitch ~ politeness + sex + ε -> 사람들의 pitch는 사람마다 조금씩 다르지만 그 이유를 공손성과 성별만으로 설명할 수는 없음
-> pitch ~ politeness + sex + (1|subject) + (1|item) + ε: 각 참여자와 아이템의 random effect를 넣은 모델을 세우면 됨
Std.Dev: random effect의 가변성 Residual: fixed effect에 의해 생긴 가변성

install.packages("lme4")

library(lme4)

## Loading required package: Matrix

politeness= read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")

#random effect의 표준편자를 봤을 때 item보다 subject사이의 편차가 더 큰 것을 알 수 있음 

boxplot(frequency ~ attitude*gender, col=c("white","lightgray"),politeness)

summary(lmer(frequency ~ attitude + (1|subject) + (1|scenario), data = politeness))

## Linear mixed model fit by REML ['lmerMod']
## Formula: frequency ~ attitude + (1 | subject) + (1 | scenario)
##    Data: politeness
## 
## REML criterion at convergence: 793.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2006 -0.5817 -0.0639  0.5625  3.4385 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept)  219     14.80   
##  subject  (Intercept) 4015     63.36   
##  Residual              646     25.42   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  202.588     26.754   7.572
## attitudepol  -19.695      5.585  -3.527
## 
## Correlation of Fixed Effects:
##             (Intr)
## attitudepol -0.103

#gender라는 fixed effect를 넣었기 때문에 subject random effect의 표준편자가 줄어듦
summary(lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data=politeness))

## Linear mixed model fit by REML ['lmerMod']
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##    Data: politeness
## 
## REML criterion at convergence: 775.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2591 -0.6236 -0.0772  0.5388  3.4795 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept) 219.5    14.81   
##  subject  (Intercept) 615.6    24.81   
##  Residual             645.9    25.41   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  256.846     16.116  15.938
## attitudepol  -19.721      5.584  -3.532
## genderM     -108.516     21.013  -5.164
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.173       
## genderM     -0.652  0.004

Statistical significance

Likelihood Ratio Test로 model이 유의미한지 검사하는 방법
1. effect가 있는지 의심이 가는 fixed effect를 뺀 model(null model)과 넣은 model, 두 가지의 model(full model)을 만든다.
2. anova(null model, full model)
3. 나온 결괏값에서 Chi square, p-value를 report한다.

politeness.null <- lmer(frequency ~ attitude + (1|subject) + (1|scenario), data=politeness, REML = F) #null model
politeness.full <- lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data=politeness, REML = F) #full model
anova(politeness.null, politeness.full) #politeness affected pitch (χ2(1)=11.62, p=0.00065), lowering it by about   19.7 Hz ± 5.6   (standard   errors)

## Data: politeness
## Models:
## politeness.null: frequency ~ attitude + (1 | subject) + (1 | scenario)
## politeness.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## politeness.null    5 817.04 829.13 -403.52   807.04                         
## politeness.full    6 807.10 821.61 -397.55   795.10 11.938  1    0.00055 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Likelihood Ratio Test로 interaction model이 유의미한지 검사하는 방법
1. 위와 마찬가지로 interaction을 뺀 model(null model)과 넣은 model, 두 가지의 model(full model)을 만든다.
2. anova(null model, full model)
3. 나온 결괏값에서 Chi square, p-value를 report한다.

politeness.full <- lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data=politeness, REML = F) #full model
politeness.inter <- lmer(frequency ~ attitude * gender + (1|subject) + (1|scenario), data=politeness, REML = F) #interaction model
anova(politeness.full, politeness.inter) #politeness affected pitch (χ2(1)=11.62, p=0.00065), lowering it by    about   19.7 Hz ± 5.6   (standard   errors)

## Data: politeness
## Models:
## politeness.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
## politeness.inter: frequency ~ attitude * gender + (1 | subject) + (1 | scenario)
##                  npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## politeness.full     6 807.10 821.61 -397.55   795.10                     
## politeness.inter    7 807.11 824.04 -396.55   793.11 1.9963  1     0.1577

Random slopes versus random intercepts

random intercept: 애초에 fixed effect를 적용시키지 않아도 모든 종속변수는 동일하지 않고 subject와 item에 따라 다를 것이다.
random slope: fixed effect의 영향은 subject, item마다 각각 다를 것이다.

coef(politeness.full) #random intercept model의 경우 subject와 item마다 intercept가 다른 것을 알 수 있다.

## $scenario
##   (Intercept) attitudepol   genderM
## 1    243.4860   -19.72207 -108.5173
## 2    263.3592   -19.72207 -108.5173
## 3    268.1322   -19.72207 -108.5173
## 4    277.2546   -19.72207 -108.5173
## 5    254.9319   -19.72207 -108.5173
## 6    244.8015   -19.72207 -108.5173
## 7    245.9618   -19.72207 -108.5173
## 
## $subject
##    (Intercept) attitudepol   genderM
## F1    243.3684   -19.72207 -108.5173
## F2    266.9442   -19.72207 -108.5173
## F3    260.2276   -19.72207 -108.5173
## M3    284.3535   -19.72207 -108.5173
## M4    262.0575   -19.72207 -108.5173
## M7    224.1293   -19.72207 -108.5173
## 
## attr(,"class")
## [1] "coef.mer"

#random intercept + slope -> subject와 item마다 intercept와 slope가 다른 것을 알 수 있다. 
coef(lmer(frequency ~ attitude + gender + (1+attitude|subject) + (1+attitude|scenario), data=politeness, REML=FALSE))

## boundary (singular) fit: see ?isSingular

## $scenario
##   (Intercept) attitudepol   genderM
## 1    245.2630   -20.43990 -110.8062
## 2    263.3044   -15.94550 -110.8062
## 3    269.1439   -20.63124 -110.8062
## 4    276.8329   -16.30057 -110.8062
## 5    256.0602   -19.40628 -110.8062
## 6    246.8625   -21.94848 -110.8062
## 7    248.4714   -23.55653 -110.8062
## 
## $subject
##    (Intercept) attitudepol   genderM
## F1    243.8065   -20.68446 -110.8062
## F2    266.7314   -19.16924 -110.8062
## F3    260.1482   -19.60436 -110.8062
## M3    285.6972   -17.91570 -110.8062
## M4    264.2019   -19.33643 -110.8062
## M7    227.3618   -21.77138 -110.8062
## 
## attr(,"class")
## [1] "coef.mer"

mixed model을 사용하기 위한 조건

linear regression과 같으나 마지막 항목인 independence를 지킬 필요는 없음