1. Starting R and finding your way around

Exercise 1.1. You want to do a ‘Multivariate ANOVA’, but you do not know the exact command that does it. Use help.search() (or ??) to find the name of the command.

help.search("Multivariate ANOVA")

Exercise 1.2. The R function shapiro.test() implements well-known Shapiro-Wilk normality test.

How many of the initial characters you need type before R can complete the function name using tab completion?
> ’sha’까지만 쳐도 나온다.
How many R functions start with sh?
> 14개 정도?

Exercise 1.3. Perform the following actions in R:

store 20 in the variable x
store 10 in the variable m
store 5 in the variable s
subtract m from x, store the result in t
divide t by s, store the result in variable z
What is the value of the variable z?

x <- 20
m <- 10
s <- 5
t <- x-m
(z <- t/s)

## [1] 2

Exercise 1.4. Redo the calculations in Exercise 1.3 steps 4 and 5 without the use of the temporary variable t. Use parentheses if necessary to get the same result

(z <- (x-m)/s) #z-score(표준 점수)

## [1] 2

Exercise 1.5. R supports a wide range of mathematical functions. A function that is often useful in statistical analysis is the logarithm function. Using the shortcut you learned in Section 1 search for the function name that returns logarithm of a given number, and use it to calculate logarithm of 2.7.

log(2.7)

## [1] 0.9932518

Exercise 1.6. You should have obtained a value close to 1 in Exercise 1.5. This is because of the fact that R calculates natural logarithm (base e=2.718282…) by default. Often we use base-2 logarithm. The function you used above can be used to calculate base-2 logarithm as well. Using the shortcut you learned in Section 1 read the help for the logarithm function to learn how to specify which base to use. Calculate base-2 logarithm of 2.7

str(log)

## function (x, base = exp(1))

log(2.7, 2)

## [1] 1.432959

Exercise 1.7. After completing the exercises in this section, you should have a set of variables that we will not use in the future. List the variables in your R session, and delete the variables t and x (If you wish, you can delete all variables except the vectors nwords and nwords2. We will not use the other variables in the rest of the tutorial.). List your variables again to see if you have achieved the desired result

rm(x, m, r, t, z)

Exercise 1.8. Calculate the standard error of the mean for the word count data stored in nwords. You can calculate the standard deviation using the function sd(). It is easy to just count the number of elements in nwords, but you can use the length() function to get the number of elements in a vector.

s/sqrt(n): standard error of the mean(SE, 표준 오차)
s: standard division
n: size of the sample

nwords = c(3510,3508,3468,3520,3516,3525,3505,3519,3558,3487)
sd(nwords)/sqrt(length(nwords))

## [1] 7.485096

Exercise 1.9. Calculate z-scores of the values in the vector nwords, and assign it to a new vector variable named znwords. Display the resulting vector, its mean and the standard deviation
z-score(표준 점수): (x-m)/s s: standard division
x: sample
m: mean

(znwords <- (nwords - mean(nwords))/sd(nwords))

##  [1] -0.06759625 -0.15209156 -1.84199776  0.35488030  0.18588968  0.56611858
##  [7] -0.27883452  0.31263265  1.96029119 -1.03929231

mean(znwords)

## [1] 3.852463e-15

sd(znwords)

## [1] 1

Exercise 1.10. In a (hypothetical) country, four political parties got 36,35,8 and 71 seats in the parliament with 150 seats. Sort the numbers of seats in reverse order, with the largest element first and the smallest as last.

parties <- c(36, 35, 8, 71)
sort(parties, decreasing = T)

## [1] 71 36 35  8

Exercise 1.11. Use the seat counts in Exercise 1.10 to calculate the percentages of seats for each party. Use a single expression, and do not hard code the number of seats in the parliament into your expression. You may want to store the data in a variable for convenience.

parties <- c(36, 35, 8, 71)
(pro_parties <- parties/sum(parties) * 100)

## [1] 24.000000 23.333333  5.333333 47.333333

Exercise 1.12. You realized that the word counts we used in this section included the essay title and the student’s name by mistake. For each essay, we would like to discount word counts by 6,8,6,5,7,5,9,7,10,9. Store these values in a new vector variable wdiff.

wdiff <- c(6,8,6,5,7,5,9,7,10,9)

Exercise 1.13. Create a backup copy of the data stored in the vector nwords in the vector nwords2 (this may sound like serious work, but you can simply use the assignment operator). Subtract the vector wdiff from the vector nwords and store the result again in the vector nwords. Display the contents of nwords and nwords2.

nwords2 <- nwords
(nwords <- nwords - wdiff)

##  [1] 3504 3500 3462 3515 3509 3520 3496 3512 3548 3478

nwords2

##  [1] 3510 3508 3468 3520 3516 3525 3505 3519 3558 3487

Exercise 1.14. Find the differences of means of the values stored in nwords2 and nwords. Is it the same as the mean of the vector wdiff? (In other words is the difference of the means the mean of the differences?)

mean(nwords)

## [1] 3504.4

mean(nwords2)

## [1] 3511.6

mean(wdiff)

## [1] 7.2

mean(nwords2)-mean(nwords)

## [1] 7.2

2. Basic data exploration and inference

2.1 Summarizing and visualizing one-dimensional data

Exercise 2.1. Produce a stem-and-leaf plot of words. Does the plot indicate an outlier? Can you say more about the distribution of the word counts?

words.f <- c(17667, 15347, 14401, 5037, 20845, 11211, 6008, 17140, 13284, 10930)
words.m <- c(5599, 19776, 13961, 10144, 6107, 16776, 31955, 21140, 5482, 2152)
words <- c(words.f, words.m)
age <- c(24, 31, 28, 21, 29, 29, 25, 32, 30, 31, 33, 26, 22, 24, 23, 23, 20, 21, 29, 27)
stem(words) # 줄기와 잎 그림 / 줄기: 만의 자리, 잎: 천의 자리

## 
##   The decimal point is 4 digit(s) to the right of the |
## 
##   0 | 255666
##   1 | 0113445778
##   2 | 011
##   3 | 2

Exercise 2.2. Produce a histogram of words.

hist(words)

Exercise 2.3. Display a box plot of words.

boxplot(words)

Exercise 2.4. Normally, box plots are more useful for comparing two or more samples, or groups. Plot box plots for words.f, and words.m side by side. Do women talk more than men?

data.frame(words.f, words.m) -> words_df
boxplot(words_df)

2.2 Summarizing and visualizing two-dimensional data

Exercise 2.5. Find the correlation between words and age. Do people speak more as they get older? Is this a strong correlation?
correlation coefficient(상관계수): 상관계수의 절댓값이 0에 가까울수록 약하고 1에 가까울 수록 강함

cor(words, age) #상관계수의 절댓값이 약 0.26으로 두 변수간 상관관계가 약하다.

## [1] -0.2604937

Exercise 2.6. Do you expect any correlation between the variables words.f and words.m? Calculate the correlation coefficient between these two variables. Can you explain your findings?

cor(words.f, words.m) #상관계수의 절댓값이 약 0.32로 두 변수간 상관관계가 약하다.

## [1] -0.3212523

Exercise 2.7. Create a scatter plot for visualizing relationship between words and age. Which variable should be plotted along the x-axis?
y = ax + b x: 독립변수
y: 종속변수
a: intercept
b: slope

plot(age, words)

lm(words ~ age) #intercept: 25610.3 slope: -468.3

## 
## Call:
## lm(formula = words ~ age)
## 
## Coefficients:
## (Intercept)          age  
##     25610.3       -468.3

Exercise 2.8. The command abline(lm(words ~ age)) plots least-squares regression line over the existing graph. Plot the regression line. Does the regression line agree with the correlation coefficient you have calculated earlier? ▷

plot(age, words)
abline(lm(words ~ age))

Exercise 2.9. Create a scatter plot of words.f against words.m, and also draw the corresponding regression line.

plot(words.f, words.m)
abline(lm(words.m ~ words.f))

2.3 Simple inference

Exercise 2.10. Calculate the SE for the complete words data set. Store the result in a variable and display it.
standard error of the mean(SE, 표준 오차): s/sqrt(n) s: standard division
n: size of the sample

sd(words)/sqrt(length(words))

## [1] 1620.478

Exercise 2.11. Calculate the 95% confidence interval for words data set with the above approximation.
confidence interval(신뢰구간): 모수가 어느 범위 안에 있는지를 확률적으로 보여주는 방법이다.
95% 신뢰구간 구하는 공식 1: 표본의 평균 ± 2 * 표준 오차

sd(words)/sqrt(length(words)) -> s.e_words
mean(words) + 2 * s.e_words

## [1] 16489.06

mean(words) - 2 * s.e_words

## [1] 10007.14

Exercise 2.12. Calculate the 95% confidence interval for words data set using values obtained with qnorm() and qt(). Compare the results to each other and the values obtained with the approximate calculation above.
95% 신뢰구간 구하는 공식 2: 표본의 평균 + qnorm(0.025~0.975) * 표준 오차
qnorm(p): 분위수 -> 평균이 0이고 표준편차가 1인 정규분포에서 p만큼의 확률을 가지는 분위수
qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE): 자유도 df, 비중심모수가 ncp인 t-분포에서 백분위수
t 분포에서 degree of freedom: n(size of the sample)-1

mean(words) + qnorm(c(0.025, 0.975)) * s.e_words

## [1] 10072.02 16424.18

mean(words) + qt(c(0.025,0.975), df = 19) * s.e_words

## [1]  9856.401 16639.799

Exercise 2.13. Assume that it is known that average number of words spoken by a Dutch speaker per day is 16000. Using confidence intervals you have calculated above on words, test the alternative hypothesis that the number of words spoken per day is different for Dutch and English speakers. null hypothesis: 독일어 화자의 평균 발화단어와 영어 화자의 평균 발화단어는 다르지 않을 것이다.
alternative hypothesis: 독일어 화자의 평균 발화단어와 영어 화자의 평균 발화단어는 다를 것이다.
영어 화자의 발화 단어의 신뢰구간 95% 안에 독일 화자의 평균 발화 단어의 수가 포함되기 때문에 null hypothesis를 기각할 수 없다.

Exercise 2.14. Perform the same test in Exercise 2.13 using the function t.test() One sample t-test: 집단의 평균이 특정 기준과 유의미하게 다른가
t.test(x, mu)
x: sample
mu: 특정 기준

t.test(words, mu=16000) #p-value가 0.05를 초과하므로 null hypothesis를 기각할 수 없다.

## 
##  One Sample t-test
## 
## data:  words
## t = -1.6982, df = 19, p-value = 0.1058
## alternative hypothesis: true mean is not equal to 16000
## 95 percent confidence interval:
##   9856.401 16639.799
## sample estimates:
## mean of x 
##   13248.1

Exercise 2.15. A popular book claims that, on average, women speak 20000 words per day. Can you reject this hypothesis using the data (words.f) above?

t.test(words.f, mu=20000) #p-value가 0.05미만이므로 null hypothesis를 기각할 수 있다.

## 
##  One Sample t-test
## 
## data:  words.f
## t = -4.2857, df = 9, p-value = 0.002033
## alternative hypothesis: true mean is not equal to 20000
## 95 percent confidence interval:
##   9590.798 16783.202
## sample estimates:
## mean of x 
##     13187

Exercise 2.16. Finally: do women speak more than men? Welch Two Sample t-test: 두 집단의 평균이 유의미하게 다른가
t.test(x, y, alternative = ) x, y: sample alternative = “greater”: x의 평균이 y의 평균보다 유의미하게 큰가

t.test(words.f, words.m, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  words.f and words.m
## t = -0.036701, df = 13.889, p-value = 0.5144
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -5990.058       Inf
## sample estimates:
## mean of x mean of y 
##   13187.0   13309.2

3. Linear regression: a first introduction

3.1 Some preliminaries

Exercise 3.1. Plot a scatter plot for visualizing the effect of the child’s MLU on the mother’s MLU. Remember that we put the predictor (or independent variable) on the x-axis.
y = ax + b + e
e: error, 선형 회귀 모델 y = ax + b에서 실제 관측치에 생긴 error

mlu <- data.frame(
age=seq(25,48),
chi=c(1.46, 1.41, 1.66, 1.74, 1.90, 1.91, 1.85, 2.06, 2.27, 2.43, 2.70, 2.81, 2.69, 2.72, 2.64, 3.05, 3.22, 3.42, 3.70, 3.90, 3.57, 3.49, 3.66, 3.64),
mot=c(5.42, 5.69, 6.27, 6.10, 6.06, 5.98, 6.10, 6.09, 6.10, 6.14, 6.42, 6.35, 6.21, 6.07, 5.84, 6.17, 5.74, 6.11, 6.41, 5.50, 6.00, 6.90, 6.65, 6.40))

plot(mlu$chi, mlu$mot)

Exercise 3.2. Using the scatter plot you have produced in Exercise 3.1, try to draw the line that you think fits the data the best. Do not use lm() to estimate the line yet, draw multiple straight lines using abline() until you are convinced that you have the best line
abline(a, b): a: intercept, b: slope

plot(mlu$chi, mlu$mot)
abline(5.7,0.1)

Exercise 3.3. Many plotting commands in R accept a ‘col’ argument for specifying the color of the objects drawn. Similarly, you can specify the width of a line using the argument ‘lwd’. Redraw the best-fitting line from Exercise 3.2. Make sure the line is ‘red’ and it is twice as thick as the standard lines
abline(a, b, col, lwd): col: 선의 색, lwd: 선의 굵기

plot(mlu$chi, mlu$mot)
abline(5.7,0.1, col = "red", lwd = 2)

Exercise 3.4. Draw the estimated regression line in color blue over the ones you have drawn in Exercise 3.2 and Exercise 3.3. Compare it with the line you estimated.

plot(mlu$chi, mlu$mot)
abline(5.7,0.1, col = "red", lwd = 2)
abline(lm(mlu$mot~mlu$chi), col = "blue", lwd = 2)

Exercise 3.5. Find the Pearson correlation coefficient between the mother’s and the child’s MLU. Compare square of the correlation coefficient with the R-squared reported in Listing 5.
summary(lm(종속~독립))
Residuals: 잔차의 5 points(Min, 1Q, Median, 3Q, Max)
Coefficients: 왼쪽부터 intercept/slope, 표준오차의 평균, t-value, p-value
Multiple R-squared: cor(x, y)^2, 독립변수에 의해 설명 가능한 종속변수의 분산의 양

summary(lm(mlu$mot~mlu$chi))

## 
## Call:
## lm(formula = mlu$mot ~ mlu$chi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79928 -0.14665  0.06142  0.14003  0.66232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.7133     0.2326  24.559   <2e-16 ***
## mlu$chi       0.1503     0.0839   1.791   0.0871 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3182 on 22 degrees of freedom
## Multiple R-squared:  0.1272, Adjusted R-squared:  0.08757 
## F-statistic: 3.207 on 1 and 22 DF,  p-value: 0.08708

cor(mlu$chi, mlu$mot)^2

## [1] 0.1272399

3.2 Some model diagnostics

통계 추론은 대부분이 모집단이 정규분포를 따른다는 가정하에 진행이 됩니다.

Exercise 3.6. Extract residuals from the model fitted in Listing 5,
선형 회귀 방정식이 유의미하기 위한 조건
1. error가 독립적이지 않을 것
2. 잔차가 평균이 0인 정규분포를 따를 것.: qqnorm, qqline을 그려봤을 때 y = x의 형태를 따를 경우 정규 분포를 띈다고 볼 수 있다.
3. 잔차의 분산이 연속적일 것.

summary(resid(lm(mlu$mot~mlu$chi)))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.79928 -0.14665  0.06142  0.00000  0.14003  0.66232

#visualize the residuals using a histogram,
hist(resid(lm(mlu$mot~mlu$chi)))

# Check normality of the residuals using a normal quantile-quantile plot. A normal Q-Q plot can be used to visually check whether a given sample follows the normal distribution or not. The R function qqnorm() produces a normal Q-Q plot, and qqline() plots the theoretical normal line.
qqnorm(resid(lm(mlu$mot~mlu$chi)))
qqline(resid(lm(mlu$mot~mlu$chi)))

Exercise 3.7. We wonder whether MLU is a good measure of a child’s language ability. For a normally developing child, we expect the age of the child to be a good predictor of his/her language ability. Using age as a proxy to child’s language ability, investigate the relation between the MLU and the language ability.

#1. What is the best choice of predictor and response variables for this problem?: 독립변수: 아이의 MLU 종속변수: 나이 

#2. What is the estimates of intercept and the slope, and how do you interpret them?: 아이의 mlu가 1 올라갈 때 나이가 약 8.7개월 씩 증가한다.
lm(mlu$age~mlu$chi)

## 
## Call:
## lm(formula = mlu$age ~ mlu$chi)
## 
## Coefficients:
## (Intercept)      mlu$chi  
##      13.289        8.718

#3. Produce a scatter plot of the data, and draw the regression line.
plot(mlu$chi, mlu$age)
abline(lm(mlu$age~mlu$chi))

#4. Is the slope estimated statistically significant at level 0.05?: slope의 p-value가 8.00e-16 ***으로 95% 신뢰구간 내에서 통계적으로 유의미하다. 
summary(lm(mlu$age~mlu$chi))

## 
## Call:
## lm(formula = mlu$age ~ mlu$chi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2881 -0.8938 -0.0091  0.8140  2.9785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.2892     1.1795   11.27 1.32e-10 ***
## mlu$chi       8.7177     0.4254   20.49 8.00e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.613 on 22 degrees of freedom
## Multiple R-squared:  0.9502, Adjusted R-squared:  0.948 
## F-statistic:   420 on 1 and 22 DF,  p-value: 7.999e-16

#5. Do you observe any clear outliers in the scatter plot?: 잔차의 boxplot을 그려본 결과 outlier는 발견되지 않았다.
boxplot(resid(lm(mlu$age~mlu$chi)))

#6. Extract residuals form the model, and check whether they are normally distributed or not: qqplot이 y = x의 형태를 따르므로 잔차가 정규성을 가진다고 볼 수 있다. 
qqnorm(resid(lm(mlu$age~mlu$chi)))
qqline(resid(lm(mlu$age~mlu$chi)))

4. Linear models with categorical predictors

Exercise 4.1. Create a data frame containing only the variables words and age. Name your data frame ‘talk’.

data.frame(words, age) -> talk
talk$gender <- factor(c(rep('F', 10), rep('M', 10)))
mlu.ses <- read.csv('http://coltekin.net/cagri/R/data/mlu-ses.csv')
mlu.ses$ses <- factor(mlu.ses$ses , levels=c('low', 'medium', 'high'))

Exercise 4.2. The function str() prints a readable summary of a variable (an R object). Inspect the data frame using str(). Does the summary match what you expect from the description above?

str(mlu.ses)

## 'data.frame':    60 obs. of  4 variables:
##  $ subject: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ses    : Factor w/ 3 levels "low","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender : chr  "male" "male" "male" "male" ...
##  $ mlu    : num  1.81 1.91 1.35 2.84 2.54 2.92 1.2 1.2 1.73 2.21 ...

Exercise 4.3. You should have seen in Exercise 4.2 that R took the variable subject (which corresponds to the arbitrary ID of participant in the study) as a numeric value. Convert subject variable in mlu.ses to a factor variable.

as.factor(mlu.ses$subject) -> mlu.ses$subject

Exercise 4.4. Choose only the mlu values from mlu.ses where the gender is male

mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male",colnames(mlu.ses) == "mlu"]

##  [1] 1.81 1.91 1.35 2.84 2.54 2.92 1.20 1.20 1.73 2.21 2.96 4.18 4.09 3.53 2.63
## [16] 2.61 1.81 2.91 1.76 3.18 3.02 3.62 3.06 3.12 2.26 3.63 1.20 3.02 1.85 2.84

4.1 Comparing two means

t.test(talk$words ~ talk$gender, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  talk$words by talk$gender
## t = -0.036701, df = 13.889, p-value = 0.5144
## alternative hypothesis: true difference in means between group F and group M is greater than 0
## 95 percent confidence interval:
##  -5990.058       Inf
## sample estimates:
## mean in group F mean in group M 
##         13187.0         13309.2

4.2 Checking assumptions of t test and ANOVA

t.test와 ANOVA가 유의미하기 위한 조건
1. 관측치는 독립적일 것: 데이터를 어떻게 수집했느냐와 관련이 있음 2. 각 집단의 관측치가 (유사하게) 정규분포를 따를 것 3. 각 집단의 분산이 (거의) 같을 것

Exercise 4.5. One of the ways of visualizing the distribution of a set of data points is to plot a histogram. Plot necessary histograms to inspect whether the talk data meets the normality assumption required by the t test in Listing 6. Note that you will need to check both subsets.

hist(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male",colnames(mlu.ses) == "mlu"])

hist(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "female",colnames(mlu.ses) == "mlu"]) #각 집단의 관측치가 정규분포를 따르는 지는 잘 모르겠음

Exercise 4.6. Use normal Q-Q plots to inspect whether the data meets the normality requirement.

qqnorm(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male",colnames(mlu.ses) == "mlu"])
qqline(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male",colnames(mlu.ses) == "mlu"])

qqnorm(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "female",colnames(mlu.ses) == "mlu"])
qqline(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "female",colnames(mlu.ses) == "mlu"])

Exercise 4.7. Verify whether Shapiro-Wilk test of normality agrees with your earlier judgments from the histograms and Q-Q plots.
Shapiro Wilk test: 관측치가 정규분포를 따르는 지 알아보는 검사, 귀무가설: 관측치는 정규분포를 따른다

shapiro.test(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male",colnames(mlu.ses) == "mlu"])

## 
##  Shapiro-Wilk normality test
## 
## data:  mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male", colnames(mlu.ses) == "mlu"]
## W = 0.95627, p-value = 0.2479

shapiro.test(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "female",colnames(mlu.ses) == "mlu"])

## 
##  Shapiro-Wilk normality test
## 
## data:  mlu.ses[mlu.ses$mlu & mlu.ses$gender == "female", colnames(mlu.ses) == "mlu"]
## W = 0.93422, p-value = 0.06363

#남성 집단의 관측치는 95%의 신뢰구간에서 귀무가설을 기각할 수 없으므로 정규성을 띈다.
#여성 집단의 관측치는 95%의 신뢰구간에서 귀무가설을 아슬아슬하게 기각할 수 없으므로 정규성을 띈다.(marginally significnat)

Exercise 4.8. Run an independent samples t test to test the alternative hypothesis that women talk more than man, using the words data set as before, but instruct t.test() to assume equal variances, and use the formula notation.
Welch Two Sample t-test: 두 집단의 분산이 다를 때 시행하는 t-test(default)
Two Sample t-test: 두 집단의 분산이 같을 때 시행하는 t-test(var.equal = T)

Do the t-test results with or without corrections differ substantially?: p-value가 동일한 것으로 봐서 분산의 보정에 따른 큰 차이는 발견되지 않음
Can you reason about the change in the ‘degrees of freedom’ reported by the t tests with and without the Welch correction?
Two Sample t-test: 20개의 data에서 평균을 두 개 구했으므로 degree of freedom은 18이 된다. Welch Two Sample t-test의 경우, 분산에 보정이 들어가기 때문에 degree of freedom에 변화가 일어나는데 자유도가 더 작으면 좀 더 보수적인 검정 결과가 나온다.

t.test(talk$words ~ talk$gender, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  talk$words by talk$gender
## t = -0.036701, df = 13.889, p-value = 0.5144
## alternative hypothesis: true difference in means between group F and group M is greater than 0
## 95 percent confidence interval:
##  -5990.058       Inf
## sample estimates:
## mean in group F mean in group M 
##         13187.0         13309.2

t.test(talk$words ~ talk$gender, alternative = "greater", var.equal = T)

## 
##  Two Sample t-test
## 
## data:  talk$words by talk$gender
## t = -0.036701, df = 18, p-value = 0.5144
## alternative hypothesis: true difference in means between group F and group M is greater than 0
## 95 percent confidence interval:
##  -5896.008       Inf
## sample estimates:
## mean in group F mean in group M 
##         13187.0         13309.2

Exercise 4.9. A good way of visualizing whether variances of two (or more) samples are equal is to use box plots. We have already produced the plot necessary in Exercise 2.4. Produce the same box plot, but use the formula notation this time. Do the variances of male and female subsets look equal?

boxplot(mlu.ses[mlu.ses$mlu & mlu.ses$gender == "male",colnames(mlu.ses) == "mlu"], mlu.ses[mlu.ses$mlu & mlu.ses$gender == "female",colnames(mlu.ses) == "mlu"]) #분산이 같아 보이는 지는 모르겠음

Exercise 4.10. Another (non-visual) way of checking equality of variances is to use var.test() (which basically performs an F-test). Do the variances of male and female subsets differ according to var.test()?
F-test(var.test): 두 집단의 분산이 같은 지 판단하는 테스트
귀무가설: 두 집단의 분산이 같다.

var.test(talk$words ~ talk$gender) #95% 신뢰구간 내에서 귀무가설을 반박할 수 없기 때문에 두 분산이 같다고 할 수 있다.

## 
##  F test to compare two variances
## 
## data:  talk$words by talk$gender
## F = 0.29526, num df = 9, denom df = 9, p-value = 0.08358
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.07333846 1.18871596
## sample estimates:
## ratio of variances 
##          0.2952602

Exercise 4.11. In R, the non-parametric alternatives of all flavors of the t test are performed using the function wilcox.test(). Perform a Mann-Whitney test, and compare the results with the results from the t tests performed earlier. Mann-Whitney test: t-test와 다른 점은 분산이나 정규성을 몰라서 검정이 가능하다. 단 t-test에 비해 검정력은 낮은편이다.

wilcox.test(talk$words ~ talk$gender, alternative = "greater")

## 
##  Wilcoxon rank sum exact test
## 
## data:  talk$words by talk$gender
## W = 53, p-value = 0.4267
## alternative hypothesis: true location shift is greater than 0

4.3 Single ANOVA

11 Logistic Regression

Logistic Regression이란?: generalized linear model(GLM, 잔차가 정규분포가 아닌 linear model) 중 하나로, binary or binomial response data에 적합.

Exercise 11.1. Read the CSV file http://coltekin.net/cagri/R/data/seg-large.csv into your R environment. Name the resulting data set seg (overwriting the earlier one if exists). Make sure that the variables utterance and phoneme are factor variables.

#PMI: pointwise mutual information
#SV: successor variety 
#H: boundary entropy
#RH reverse boundary entropy 
read.csv("http://coltekin.net/cagri/R/data/seg-large.csv") -> seg
as.factor(seg$utterance) -> seg$utterance

Exercise 11.2. Read the CSV file http://coltekin.net/cagri/R/data/past-tense-or.csv into your R environment. Name the resulting data set or. Add a new variable correct to the or data frame that contains the rate of correct (not overregularized) production of the past-tense forms.

#age: 개월 수
#n.past: 발화한 불규칙 동사 과거형
#n.or: 발화한 불규칙 동사 과거형 중 과잉 일반화한 단어 수 
#correct: 과잉일반화하지 않은 단어의 비율

read.csv("http://coltekin.net/cagri/R/data/past-tense-or.csv") -> or
or$correct <- 1 - (or$n.or/or$n.past)

11.1 Regression and binomial response variables

Exercise 11.3. Fit an ordinary least squares regression model predicting the rate of correct past tense production from the age of the child using the data set or.
1. Check the model assumptions through diagnostic plots, and refit a model excluding the most influential data point. Save the resulting model as or.ols.
2. Repeat the diagnostic plots. Do you see any other problems with the model diagnostics?
ordinary least squares regression: 잔차의 제곱이 최솟값으로 나오게 하는 회귀 모델

1번 plot(linearity)
- 예측값(fitted)과 잔차(residual)의 비교
- 모든 예측값에서 잔차가 비슷하게 있어야 함(가운데 점선)
- 빨간 실선은 잔차의 추세를 나타냄
- 빨간 실선이 점선에서 크게 벗어난다면 예측값에 따라 잔차가 크게 달라진다는 것

2번 plot(normality)
- 잔차가 정규분포를 따른다는 가정
- Q-Q 플롯으로 확인할 수 있음
- 잔차가 정규분포를 띄면 Q-Q 플롯에서 점들이 점선을 따라 배치되어 있어야 함

3번 plot(homogeneity of variance)
- 회귀모형을 통햬 예측된 값이 크던 작던, 모든 값들에 대하여 잔차의 분산이 동일하다는 가정
- 아래 그래프는 예측값(가로축)에 따라 잔차가 어떻게 달라지는지 보여줌
- 빨간색 실선이 수평선을 그리는 것이 이상적

4번 plot(Cook’s distance, 극단값)

plot(lm(correct ~ age, data=or))

or.ols <- lm(correct ~ age, data=or, subset=-68) #68이 영향을 많이 주는 outlier이므로 제외하고 모델을 세움

Exercise 11.4. What is the predicted correct past-tense rate for children at ages 1.5 and 8? Note that age in our data is in months.

predict(or.ols, newdata=data.frame(age=c(1.5*12, 8*12))) #확률값을 예측하는데 보통 0~1 사이를 예측하지만 linear model은 -∞ ~ +∞ 예측 가능

##         1         2 
## 0.7809471 1.1195738

Exercise 11.5. Plot the curve of the logit function defined above.
odds ratio: 성공확률이 실패확률에 비해 몇 배 더 높은지를 나타내는 수식 -> p/(1-p)
logit transform: 로지스틱 회귀분석을 위해 odds ratio에 로그를 취한 함수 -> log(p/(1-p))
log변환을 쓰는 이유: 데이터는 변하지 않으나 숫자가 큰 관측지와 작은 관측치 사이의 gap을 줄일 수 있음

p <- seq(0,1, 0.01)
plot(p/(1-p))

plot(log(p/(1-p)))
abline(0,0)

Exercise 11.6. Add a new variable, log.odds, to the or data frame which contains the log odds for the probability of correct responses.
1. fit a linear regression model predicting the transformed variable log.odds from age. Exclude the outlier we identified earlier. Save the model as or.logit.
2. Produce the diagnostic plots and inspect the results. How do you interpret the coefficients?

log(or$correct/(1 - or$correct)) -> or$log.odd
lm(data = or, log.odd ~ age) -> or.logit
plot(or.logit)

Exercise 11.7. What are the expected rate of overregularization for children at ages 1 to 10 (years) according to the model fit in Exercise 11.6?

predict(or.logit, newdata = data.frame(age=c((1:10)*12)))

##        1        2        3        4        5        6        7        8 
## 1.117592 1.558244 1.998897 2.439549 2.880202 3.320854 3.761507 4.202159 
##        9       10 
## 4.642812 5.083465

11.2 Binomial data and generalized linear models

glm(data, b ~ a, famliy = ): generalized linear model, famliy: default는 gauissian 즉, lm()과 동일

Exercise 11.8. Although we concluded that there is no overdispersion with the model presented in Listing 11, fit the same model with quasibinomial family, and compare the summary with the model or.glm above.

or.glm <- glm(cbind(n.past-n.or, n.or) ~ age, family='binomial', data=or, subset=-68)
summary(or.glm)

## 
## Call:
## glm(formula = cbind(n.past - n.or, n.or) ~ age, family = "binomial", 
##     data = or, subset = -68)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93872  -0.54742   0.01848   0.75018   2.31268  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.558637   0.153005   3.651 0.000261 ***
## age         0.037836   0.003878   9.756  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.765  on 90  degrees of freedom
## Residual deviance:  93.036  on 89  degrees of freedom
## AIC: 385.24
## 
## Number of Fisher Scoring iterations: 4

or.glm.q <- glm(cbind(n.past-n.or, n.or) ~ age, family='quasibinomial', data=or, subset=-68)
summary(or.glm.q)

## 
## Call:
## glm(formula = cbind(n.past - n.or, n.or) ~ age, family = "quasibinomial", 
##     data = or, subset = -68)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93872  -0.54742   0.01848   0.75018   2.31268  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.558637   0.152026   3.675 0.000407 ***
## age         0.037836   0.003853   9.819 7.58e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9872408)
## 
##     Null deviance: 190.765  on 90  degrees of freedom
## Residual deviance:  93.036  on 89  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Exercise 11.9. What are the expected rate of overregularization for children at ages 1 to 10 (years) according to the model or.glm from Listing 11?

1 - predict(or.glm, type='response', newdata=data.frame(age=12*(1:10)))

##           1           2           3           4           5           6 
## 0.266457804 0.187444516 0.127779805 0.085117305 0.055787702 0.036164953 
##           7           8           9          10 
## 0.023274172 0.014907166 0.009518764 0.006066081

Exercise 11.10. Plot the scatter plot of observed correct past tense rates against age in the or data. Plot the curve representing the expected correct past tense rate against the age range in the data using the model or.glm.

plot(correct ~ age, data = or)
x <- min(or$age):max(or$age)
y <- predict(or.glm, type='response', newdata=data.frame(age=x))
lines(x, y)

11.3 Binary data

Exercise 11.11. Using the seg data set, fit a logistic regression model predicting whether there is a boundary or not for a given pmi value. Save the model as seg.pmi. Plot the observed data, and the prediction curve on the same graph.

glm(data = seg, boundary ~ pmi, family = "binomial") -> seg.pmi
plot(boundary ~ pmi, data = seg)
x <- seq(min(seg$pmi), max(seg$pmi), by=0.2)
y <- predict(seg.pmi, type='response', newdata=data.frame(pmi=x)) 
lines(x, y)

★. Linear models and linear mixed effects models in R with linguistic applications.

1. Linear modeling

linear regression(선형 회귀)이란?: 관측치들을 x축을 독립변수, y축을 종속변수라고 했을 때의 좌표평면 상의 점으로 나타냈을 때 그 점들을 가지고 직선 방정식을 예측하는 것.

왜 하필 regression인가?: 점들을 가지고 선을 예측하기 때문에…?

어떻게 직선 방정식을 예측하는 가?: 점들의 error(residual)들이 최솟값을 갖는 직선방정식을 예측.

y=ax+b+e x: 독립변수, y: 종속변수, a: slope, b: intercept, e(epsilon): error(오차) or residual(잔차)

dependent variable: 연구자가 측정한 관측값
Independent variable: 독립변수, 설명 변수

error(residual)란?: 예측한 회귀 방정식과 관측치 사이의 거리.

Residuals과 error의 차이:
error는 모집단에서 회귀식을 얻었을 때 나온 예측값과 관측값의 차이
Residuals는 표본집단에서 회귀식을 얻었을 때 나온 예측값과 관측값의 차이
-> 전 세계 모든 남녀의 목소리 pitch 값을 얻는 것(모집단)은 불가능하므로 표본집단을 사용. -> 때문에 error가 아닌 residual!!

Residuals: 잔차의 5 points(Min, 1Q, Median, 3Q, Max)

Coefficients: 왼쪽부터 intercept/slope, 표준오차의 평균, t-value, p-value

Multiple R-squared: cor(x, y)^2, 독립변수에 의해 설명 가능한 종속변수의 분산의 양 Adjusted R-squared: 독립변수에 의해 설명 가능한 종속변수의 분산의 양 + 설명을 하는데 사용한 독립변수의 개수(나이, 공손성, 방언 등의 독립변수의 개수가 많아질수록 수치가 낮아질 수 있음.)

## pitch predict by sex?

#독립변수가 categorical variable일 때 

pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)

summary(lm(pitch ~ sex, data = my.df))  #y = -98.33*x + 226.33 + e

## 
## Call:
## lm(formula = pitch ~ sex, data = my.df)
## 
## Residuals:
##       1       2       3       4       5       6 
##   6.667 -22.333  15.667   2.000 -16.000  14.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   226.33      10.18  22.224 2.43e-05 ***
## sexmale       -98.33      14.40  -6.827  0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.64 on 4 degrees of freedom
## Multiple R-squared:  0.921,  Adjusted R-squared:  0.9012 
## F-statistic: 46.61 on 1 and 4 DF,  p-value: 0.002407

#F(1,4)=46.61, p<0.01, 여성의 pitch의 예측치가 226.33Hz이고 남성 pitch의 예측치는 그보다 98.33Hz 유의미하게 작다.  
#자유도 값이 두 개 나오는 이유: 집단 내 자유도, 집단 간 자유도   
#Multiple R-squared를 봤을 때 92.1%의 종속변수가 독립변수에 의해 설명 가능

#독립변수가 continueous variable일 때 
age = c(14,23,35,48,52,67)
pitch2 = c(252,244,240,233,212,204)
my.df2 = data.frame(age, pitch2)
plot(age, pitch2, data = my.df2)
abline(lm(pitch2 ~ age, data = my.df2))

summary(lm(pitch2 ~ age, data = my.df2)) # y = -0.9099*(age) + 267.0965 + e

## 
## Call:
## lm(formula = pitch2 ~ age, data = my.df2)
## 
## Residuals:
##      1      2      3      4      5      6 
## -2.338 -2.149  4.769  9.597 -7.763 -2.115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 267.0765     6.8522   38.98 2.59e-06 ***
## age          -0.9099     0.1569   -5.80  0.00439 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 4 degrees of freedom
## Multiple R-squared:  0.8937, Adjusted R-squared:  0.8672 
## F-statistic: 33.64 on 1 and 4 DF,  p-value: 0.004395

#F(1,4) = 33.64, p<0.01, intercept: 267.0765/slope:-0.9099 -> 나이가 한 살 올라갈 때마다 0.9099Hz씩 pitch가 낮아진다.  
#intercept는 중요하지 않음 (0살 일때 267.0765Hz인 게 말이 안 됨) 여기서 intercept가 유의미하려면 나이 독립변수의 **분산**값을 독립변수로 하여 linear model을 세우면 됨
my.df2$age.c = my.df2$age - mean(my.df2$age) # y = -0.9099*(age.c) + 230.0965 + e  
summary(lm(pitch2 ~ age.c, data = my.df2)) #표본집단의 평균 나이에서 1살 올라갈 때마다 0.9099Hz씩 pitch가 낮아진다.(분산이 0이다 = 관측치가 평균값과 같다.이므로)

## 
## Call:
## lm(formula = pitch2 ~ age.c, data = my.df2)
## 
## Residuals:
##      1      2      3      4      5      6 
## -2.338 -2.149  4.769  9.597 -7.763 -2.115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 230.8333     2.8113   82.11 1.32e-07 ***
## age.c        -0.9099     0.1569   -5.80  0.00439 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 4 degrees of freedom
## Multiple R-squared:  0.8937, Adjusted R-squared:  0.8672 
## F-statistic: 33.64 on 1 and 4 DF,  p-value: 0.004395

linear regression을 사용하기 위한 조건

1. Linearity: 관측치는 선형적이어야 한다. (잔차를 y축으로 하는 좌표평면 상의 점들의 패턴이 없어야 한다…?)
잔차 그래프가 비선형성을 나타내는 이유:
첫째. 선형 모델에 넣은 독립변수들 중 매우 중요하게 interact하는 다른 독립변수를 넣지 않아서
둘째. log 변환과 같은 비선형적인 변환을 실행시켜서
셋째. 실제로 관측치가 비선형적이어서 ex)잔차의 그래프가 U자 모양일 경우, 이때는 독립변수의 제곱을 또 다른 독립변수로 추가할 수 있음
넷째. 범주형 데이터를 다루는 경우(잔차 그래프가 줄무늬를 띄는 경우) 이때는 다른 class의 모델을 세워야 함 ex) logistic model

plot(fitted(lm(pitch2 ~ age, data = my.df2)),residuals(lm(pitch2 ~ age, data = my.df2)))^2

## numeric(0)

abline(0,0)

2. Absence of collinearity: 둘 이상의 독립변수가 상관관계여서는 안된다.
ex) intelligence ratings ~ talking speed를 알아보기 위해 초당 말하는 음절의 개수, 단어의 개수, 문장의 개수를 모두 구하여 모두 모델에 집어 넣으면 서로의 effect를 훔치거나 누가 더 effect가 큰 지 알 수 없음
독립변수의 상관관계를 없애는 방법
첫째. 애초에 상관관계가 없는 독립변수들을 설정
둘째. 이미 상관관계가 있는 여러 측정법을 사용한 경우 가장 중요한 걸 골라서 다른 걸 떨어뜨리기
셋째. Principal Component Analysis같은 dimension-reduction techniques을 고려할 수 있음.

3. Homoskedasticity… or “absence of heteroskedasticity”: 데이터의 가변성은 예측 값의 범위에서 대략적으로 동일해야 한다.
ex) 잔차의 그래프가 동그라미 형태를 하고 있는 게 이상적인 잔차 그래프, 나팔 형태를 하고 있다면 Homoskedastic하지 않은 그래프

plot(rnorm(100), rnorm(100))

4. Normality of residuals: 잔차는 정규성을 띄어야 한다.

#잔차의 정규성을 확인하는 법:
hist(residuals(lm(pitch2 ~ age, data = my.df2))) #histogram 그리기

qqnorm(residuals(lm(pitch2 ~ age, data = my.df2))) #qqplot을 그려봤을 때 y = x의 형태를 따를 경우 정규 분포를 띈다고 볼 수 있다.
qqline(residuals(lm(pitch2 ~ age, data = my.df2)))

shapiro.test(residuals(lm(pitch2 ~ age, data = my.df2))) #shapiro wilk test를 시행했을 때 p-value가 0.05보다 크면 정규성을 띈다고 볼 수 있다.

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm(pitch2 ~ age, data = my.df2))
## W = 0.9131, p-value = 0.4571

5. Absence of influential data points: 영향력 있는 데이터 관측치가 없어야 한다. dfbeta(): 만약 해당 행의 관측치를 제외할 경우 intercept과 slope가 얼마나 조정되는가?
만약 해당 행의 관측치를 제외할 경우 slope의 부호가 바뀐다면 influential data points라고 할 수 있음
(원래 slope - dfbeta function을 사용해서 얻는 조정값의 결과가 slope의 부호가 바뀐 경우)

dfbeta(lm(pitch2 ~ age, data = my.df2))

##   (Intercept)         age
## 1  -3.3645662  0.06437573
## 2  -1.6119656  0.02736278
## 3   1.5481303 -0.01456709
## 4  -0.0259835  0.05092767
## 5   0.8707699 -0.06479736
## 6   1.8551808 -0.06622744

6. Independence!!!: 관측치들은 독립적이어야 한다. ex) 만약 같은 참여자에게서 여러 개의 관측치를 이끌어 낸 경우, 동일한 참여자에게서 나온 관측치들은 서로 독립적이라고 할 수 없음.

2: A very basic tutorial for performing linear mixed effects analyses

fixed effect(독립변수): 모집단의 모든 가능한 레벨을 포괄하는 변수에 의한 효과 random effect(ε): fixed effect가 모집단의 모든 가능한 레벨을 포괄하지 못하는 변수에 의한 효과

pitch ~ politeness + sex + ε -> 사람들의 pitch는 사람마다 조금씩 다르지만 그 이유를 공손성과 성별만으로 설명할 수는 없음
-> pitch ~ politeness + sex + (1|subject) + (1|item) + ε: 각 참여자와 아이템의 random effect를 넣은 모델을 세우면 됨
Std.Dev: random effect의 가변성 Residual: fixed effect에 의해 생긴 가변성

install.packages("lme4")
install.packages("lmerTest")

library(lme4)

## Warning: 패키지 'lme4'는 R 버전 4.1.2에서 작성되었습니다

## 필요한 패키지를 로딩중입니다: Matrix

library(lmerTest)

## Warning: 패키지 'lmerTest'는 R 버전 4.1.2에서 작성되었습니다

## 
## 다음의 패키지를 부착합니다: 'lmerTest'

## The following object is masked from 'package:lme4':
## 
##     lmer

## The following object is masked from 'package:stats':
## 
##     step

politeness= read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")

#random effect의 표준편자를 봤을 때 item보다 subject사이의 편차가 더 큰 것을 알 수 있음 
as.factor(politeness$scenario) -> politeness$scenario
boxplot(frequency ~ attitude*gender, col=c("white","lightgray"),politeness)

boxplot(frequency ~ subject, col=c("white","lightgray"),politeness)

summary(lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data = politeness))

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##    Data: politeness
## 
## REML criterion at convergence: 775.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2591 -0.6236 -0.0772  0.5388  3.4795 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept) 219.5    14.81   
##  subject  (Intercept) 615.6    24.81   
##  Residual             645.9    25.41   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  256.846     16.116    5.432  15.938 9.06e-06 ***
## attitudepol  -19.721      5.584   70.054  -3.532 0.000735 ***
## genderM     -108.516     21.013    4.007  -5.164 0.006647 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.173       
## genderM     -0.652  0.004

#gender라는 fixed effect를 넣었기 때문에 subject random effect의 표준편자가 줄어듦
summary(lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data=politeness))

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##    Data: politeness
## 
## REML criterion at convergence: 775.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2591 -0.6236 -0.0772  0.5388  3.4795 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept) 219.5    14.81   
##  subject  (Intercept) 615.6    24.81   
##  Residual             645.9    25.41   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  256.846     16.116    5.432  15.938 9.06e-06 ***
## attitudepol  -19.721      5.584   70.054  -3.532 0.000735 ***
## genderM     -108.516     21.013    4.007  -5.164 0.006647 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.173       
## genderM     -0.652  0.004

###Statistical significance

Likelihood Ratio Test로 model이 유의미한지 검사하는 방법 1. effect가 있는지 의심이 가는 fixed effect를 뺀 model(null model)과 넣은 model, 두 가지의 model(full model)을 만든다. 2. anova(null model, full model) 3. 나온 결괏값에서 Chi square, p-value를 report한다.

politeness.null <- lmer(frequency ~ attitude + (1|subject) + (1|scenario), data=politeness, REML = F) #null model
politeness.full <- lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data=politeness, REML = F) #full model
anova(politeness.null, politeness.full) #politeness affected pitch (χ2(1)=11.62, p=0.00065), lowering it by about   19.7 Hz ± 5.6   (standard   errors)

## Data: politeness
## Models:
## politeness.null: frequency ~ attitude + (1 | subject) + (1 | scenario)
## politeness.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## politeness.null    5 817.04 829.13 -403.52   807.04                         
## politeness.full    6 807.10 821.61 -397.55   795.10 11.938  1    0.00055 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Likelihood Ratio Test로 interaction model이 유의미한지 검사하는 방법 1. 위와 마찬가지로 interaction을 뺀 model(null model)과 넣은 model, 두 가지의 model(full model)을 만든다. 2. anova(null model, full model) 3. 나온 결괏값에서 Chi square, p-value를 report한다.

politeness.full <- lmer(frequency ~ attitude + gender + (1|subject) + (1|scenario), data=politeness, REML = F) #full model
politeness.inter <- lmer(frequency ~ attitude * gender + (1|subject) + (1|scenario), data=politeness, REML = F) #interaction model
anova(politeness.full, politeness.inter) #politeness affected pitch (χ2(1)=11.62, p=0.00065), lowering it by    about   19.7 Hz ± 5.6   (standard   errors)

## Data: politeness
## Models:
## politeness.full: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
## politeness.inter: frequency ~ attitude * gender + (1 | subject) + (1 | scenario)
##                  npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## politeness.full     6 807.10 821.61 -397.55   795.10                     
## politeness.inter    7 807.11 824.04 -396.55   793.11 1.9963  1     0.1577

Random slopes versus random intercepts

random intercept: 애초에 fixed effect를 적용시키지 않아도 모든 종속변수는 동일하지 않고 subject와 item에 따라 다를 것이다.
random slope: fixed effect의 영향은 subject, item마다 각각 다를 것이다.

coef(politeness.full) #random intercept model의 경우 subject와 item마다 intercept가 다른 것을 알 수 있다.

## $scenario
##   (Intercept) attitudepol   genderM
## 1    243.4860   -19.72207 -108.5173
## 2    263.3592   -19.72207 -108.5173
## 3    268.1322   -19.72207 -108.5173
## 4    277.2546   -19.72207 -108.5173
## 5    254.9319   -19.72207 -108.5173
## 6    244.8015   -19.72207 -108.5173
## 7    245.9618   -19.72207 -108.5173
## 
## $subject
##    (Intercept) attitudepol   genderM
## F1    243.3684   -19.72207 -108.5173
## F2    266.9442   -19.72207 -108.5173
## F3    260.2276   -19.72207 -108.5173
## M3    284.3535   -19.72207 -108.5173
## M4    262.0575   -19.72207 -108.5173
## M7    224.1293   -19.72207 -108.5173
## 
## attr(,"class")
## [1] "coef.mer"

#random intercept + slope -> subject와 item마다 intercept와 slope가 다른 것을 알 수 있다. 
coef(lmer(frequency ~ attitude + gender + (1+attitude|subject) + (1+attitude|scenario), data=politeness, REML=FALSE))

## boundary (singular) fit: see ?isSingular

## $scenario
##   (Intercept) attitudepol   genderM
## 1    245.2630   -20.43990 -110.8062
## 2    263.3044   -15.94550 -110.8062
## 3    269.1439   -20.63124 -110.8062
## 4    276.8329   -16.30057 -110.8062
## 5    256.0602   -19.40628 -110.8062
## 6    246.8625   -21.94848 -110.8062
## 7    248.4714   -23.55653 -110.8062
## 
## $subject
##    (Intercept) attitudepol   genderM
## F1    243.8065   -20.68446 -110.8062
## F2    266.7314   -19.16924 -110.8062
## F3    260.1482   -19.60436 -110.8062
## M3    285.6972   -17.91570 -110.8062
## M4    264.2019   -19.33643 -110.8062
## M7    227.3618   -21.77138 -110.8062
## 
## attr(,"class")
## [1] "coef.mer"

mixed model을 사용하기 위한 조건

linear regression과 같으나 마지막 항목인 independence를 지킬 필요는 없음

A hands-on tutorial on using R for (mostly) linguistics research

김수한