2/27

1 Installing R

Site：https://www.r-project.org/
Step1：使用以上連結進入R下載連結網頁，並點擊“download R”

Step2：選擇國家“Taiwan”，三個連結都可選擇

Step3：選擇載點Mac/Windows/Linux

Step4：選擇“install R for the first time”

Step5：選擇“Download R 3.4.3 for Windows”後執行

2 Installing RStudio

Site：https://www.rstudio.com/

Step1：使用以上連結進入RStudio網頁，並點擊“Download”

Step2：選擇RStudio Desktop並點擊“DOWNLOAD”

Step3：選擇電腦版本，下載後安裝執行

Step4：RStudio基本介面介紹

Step4-1：套件安裝可使用Install.packages此命令函式或在右下角Packages窗格按下“install”後輸入套件名稱安裝套件

3 Data Input&Output

3.1 資料讀取

在讀取資料前可先用setwd()以及getwd()來改變和檢視工作目錄

setwd("D:/")
getwd()

## [1] "D:/"

我們在政府資料開放平臺上找到一筆資料，主要是關於腸病毒健保門診以及住院就診人數統計
資料介紹&下載網址：https://data.gov.tw/dataset/14590
我們使用read.table()來讀取資料，當中sep表示資料的分隔形式，常見的csv檔分隔形式為逗號分隔形式，header表示資料是否有變數名稱，encoding表示資料檔案格式，Windows作業系統為“Big5”，Mac則是“UTF-8”

Enterovirus <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Enterovirus.csv", sep = ",", header = T, encoding = "Big5")
head(Enterovirus) #使用head()觀察部分資料

##     年 週 就診類別 年齡別   縣市 腸病毒健保就診人次 健保就診總人次
## 1 2008 14     住院    0-2 台中市                  0            105
## 2 2008 14     住院    0-2 台北市                  2            151
## 3 2008 14     住院    0-2 台東縣                  0             14
## 4 2008 14     住院    0-2 台南市                  0             20
## 5 2008 14     住院    0-2 宜蘭縣                  0             44
## 6 2008 14     住院    0-2 花蓮縣                  0             17

我們也能直接在Rstudio上直接檢視完整資料，在Global Environment上點選該筆資料就會出現，可參考下圖

3.2 資料輸出

若需儲存資料或分析結果至外部檔案，可使用write.table()函數

#write.table(Enterovirus, "D:/Enterovirus.csv", row.names = FALSE, col.names = TRUE, sep = ",", fileEncoding = "Big5")

4 Basic R

4.1 Vector-向量

4.1.1 向量基本介紹

x <- c(1, 2, 3, 4, 5) #用c()建構向量
x

## [1] 1 2 3 4 5

y <- c(75, 60, 85, 80, 70)
y

## [1] 75 60 85 80 70

z <- c("男", "女", "男", "男", "女")
z

## [1] "男" "女" "男" "男" "女"

「#」符號後面表示是注釋，寫程式時清楚表明注釋能提升可讀性
「<-」是指定值符號，意思是把後面的內容值指定給前面的內容，除了用「<-」也可使用「=」

4.1.2 向量基本運算

x * y #「*」表示乘法，對應元素相乘

## [1]  75 120 255 320 350

x / y #「/」表示除法

## [1] 0.01333333 0.03333333 0.03529412 0.05000000 0.07142857

x ^ 2 #「^」表示次方

## [1]  1  4  9 16 25

x + y #「+」表示加法

## [1] 76 62 88 84 75

x - y #「-」表示減法

## [1] -74 -58 -82 -76 -65

y %/% x #「%/%」表示整數除法的商

## [1] 75 30 28 20 14

y %% x #「%%」表示整數除法的餘數

## [1] 0 0 1 0 0

4.1.3 向量常見的函數

sum(x) #求總和

## [1] 15

max(x) #求最大值

## [1] 5

min(x) #求最小值

## [1] 1

mean(x) #求平均值

## [1] 3

range(x) #求全距

## [1] 1 5

median(x) #求中位數

## [1] 3

var(x) #求變異數

## [1] 2.5

sort(x) #排序

## [1] 1 2 3 4 5

4.1.4 向量索引

x[3] #抓出x中第三個元素

## [1] 3

x[c(1:3)] #抓出x中一到三個元素，「1:3」表示第一到第三個

## [1] 1 2 3

x[2] <- 5 #將x中第二個元素指定為5
x

## [1] 1 5 3 4 5

x[-c(1, 4, 5)] #刪除x中第1,4,5個元素，「-」可表示刪除

## [1] 5 3

4.2 Factor-因子

4.2.1 因子基本介紹

類別型資料經常要把資料轉成不同的水準/因子(factor)，如性別、職業、是否已婚……等。當判斷此資料中變數為類別型時，建議再次檢查變數在資料中是否是以因子方式存在，若不是則須經過轉換，再繼續分析。

class(z) #以之前的z當作例子，先以class()此函數觀察z的物件類型

## [1] "character"

z <- factor(z) #觀察到原本z中的物件是以字元(character)型式存放，故我們要將他轉換成因子
class(z) #再次檢視z的物件類型

## [1] "factor"

levels(z) #提取因子的水準，觀察因子中有哪些類別

## [1] "女" "男"

table(z) #計算z中各因子出現次數

## z
## 女 男 
##  2  3

3/6

4.3 Matrix-矩陣

4.3.1 矩陣基本介紹

矩陣是將資料用行和列排列的長方形表格，矩陣內元素必須是相同資料類型

4.3.2 矩陣基本運算

m <- matrix(1:16, nrow = 4, ncol = 4) #生成一個4*4的矩陣
m

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

n <- matrix(sample(1:100, 16), nrow = 4, ncol = 4) #從1到100中抽16個數字生成矩陣
n

##      [,1] [,2] [,3] [,4]
## [1,]   32   64   13   45
## [2,]   93   50   27   51
## [3,]   78   87  100   82
## [4,]   99   75   79   70

m %*% n #矩陣間乘法用「%*%」表示

##      [,1] [,2] [,3] [,4]
## [1,] 2486 2072 2075 1948
## [2,] 2788 2348 2294 2196
## [3,] 3090 2624 2513 2444
## [4,] 3392 2900 2732 2692

m[3, 4] #擷取矩陣第三列第四行的元素，m[列, 行]

## [1] 15

4.3.3 矩陣常見函數

dim(m) #矩陣維度

## [1] 4 4

t(m) #矩陣轉置

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16

det(m) #求行列式值

## [1] 0

nrow(m) #觀察矩陣有幾列

## [1] 4

rowSums(m) #矩陣每列總和

## [1] 28 32 36 40

colSums(m) #矩陣每行總和

## [1] 10 26 42 58

4.4 Data Frame-資料框

4.4.1 Data Frame基本介紹

Data Frame是一種矩陣形式的資料，但資料框中各變數可以以不同類型的方式存在，每一行表示一個變數，每一列可表示一筆資料

Student <- data.frame(class = x, score = y, sex = z) #將x,y,z三個變數合為一個Data，並將變數命名
Student

##   class score sex
## 1     1    75  男
## 2     5    60  女
## 3     3    85  男
## 4     4    80  男
## 5     5    70  女

接著我們想對資料框的每一列命名

row.names(Student) <- c("陳xx", "王xx", "林xx", "張xx", "黃xx")
Student

##      class score sex
## 陳xx     1    75  男
## 王xx     5    60  女
## 林xx     3    85  男
## 張xx     4    80  男
## 黃xx     5    70  女

4.4.2 Data Frame基本運算

Student[, "score"] #觀察資料中變數score的所有樣本，也可用 Student[, 2] or Student$score

## [1] 75 60 85 80 70

Student["林xx", ] #觀察資料中第三位學生所有資料，也可用 Student[3, ]

##      class score sex
## 林xx     3    85  男

還可以篩選出符合我們的條件，例如：篩選出性別為男性且成績大於等於80分

Student[Student$sex == "男" & Student$score >= 80, ] #R中的等號為「==」非「=」需要注意一下！

##      class score sex
## 林xx     3    85  男
## 張xx     4    80  男

接著我們想了解一下這筆資料的基本敘述統計

table(Student$sex) #統計資料中男女各占的數量

## 
## 女 男 
##  2  3

table(Student$sex, Student$class) #也能以列聯表方式觀察不同班級中的男女人數

##     
##      1 3 4 5
##   女 0 0 0 2
##   男 1 1 1 0

plot(Student$class, Student$score) #觀察不同班級下的成績分布

5 Distribution

5.1 Binomial

在n次獨立的試驗中，每次成功機率接固定為p，X代表成功的次數，以下介紹在R中計算binomial distribution

dbinom(5, 50, 0.5) #求機率密度

## [1] 1.881837e-09

pbinom(5, 50, 0.5, lower.tail = T) #求累積機率，lower.tail = T表示求 Pr(X <= x)

## [1] 2.104926e-09

rbinom(5, 50, 0.5) #產生二項分布的隨機亂數

## [1] 26 26 20 26 26

qbinom(0.3, 50, 0.5) #求機率p的對應分位數，解釋為成功小於等於幾次下，累積機率為 0.3

## [1] 23

5.2 Normal

dnorm(5, mean = 25, sd = sqrt(50*0.5*0.5)) #sqrt表示開根號

## [1] 1.269823e-08

pnorm(5, mean = 25, sd = sqrt(50*0.5*0.5), lower.tail = T) #lower.tail = T表示求 P(X <= x)

## [1] 7.708629e-09

rnorm(5, mean = 25, sd = sqrt(50*0.5*0.5))

## [1] 25.52206 22.69812 25.33545 22.75618 28.24591

qnorm(0.975, mean = 25, sd = sqrt(50*0.5*0.5))

## [1] 31.92952

5.3 Poisson

dpois(2, lambda = 5) #在單位時間內平均而言發生次數為5次，求單位時間內發生2次的機率

## [1] 0.08422434

ppois(2, lambda = 5) #單位時間內，發生次數小於等於2次的機率

## [1] 0.124652

qpois(0.084, lambda = 5) #求累積發生幾次事件下的機率是 0.084

## [1] 2

x <- rpois(10000, lambda = 5) #隨機回傳10000個符合機率分布的隨機值
hist(x) #可藉由繪圖檢視是否剛剛抽出10000個值符合poisson分配

3/13

6 Test

6.1 Proportion test

6.1.1 One sample proportion test

使用Enterovirus的資料，預檢定住院的人中10-14歲比例是否為0.5

table(Enterovirus$就診類別, Enterovirus$年齡別)

##       
##          0-2 10-14   15+   3-4   5-9
##   住院 10278 10090 11247 10112 10269
##   門診 11263 11263 11263 11263 11263

prop.test(x = table(Enterovirus[, 3:4])["住院", "10-14"], p = 0.5,
          n = sum(table(Enterovirus[, 3:4])[1,]), alternative = "two.sided")

## 
##  1-sample proportions test with continuity correction
## 
## data:  table(Enterovirus[, 3:4])["住院", "10-14"] out of sum(table(Enterovirus[, 3:4])[1, ]), null probability 0.5
## X-squared = 19467, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.1906673 0.1974848
## sample estimates:
##         p 
## 0.1940534

結論：reject Ho，我們推論住院的人中10到14歲所占的比例不為0.5

6.1.2 Two sample proportion test

檢定10-14歲的人中，住院與門診的比例是否相同

prop.test(x = c(table(Enterovirus[, 3:4])["住院", "10-14"], 
                table(Enterovirus[, 3:4])["門診", "10-14"]),
          n = c(sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 2]),
                sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 2])))

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(table(Enterovirus[, 3:4])["住院", "10-14"], table(Enterovirus[,  out of c(sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2,     3:4])["門診", "10-14"]) out of     2]), sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, c(table(Enterovirus[, 3:4])["住院", "10-14"], table(Enterovirus[,  out of     3:4])[2, 2]))
## X-squared = 128.65, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.06445051 -0.04541696
## sample estimates:
##    prop 1    prop 2 
## 0.4725331 0.5274669

結論：reject Ho，我們推論在10-14歲的族群中，住院與門診的比例不相近

6.1.3 Fisher Exact test

我們使用一筆新的資料，內容為戴安全帽與否與是否發生事故，透過檢定我們想了解是否戴安全帽與否發生事故有關聯

Helmet_Injury <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Injury.csv", header = T, sep = ",")
Helmet_Injury$Helmet <- as.factor(Helmet_Injury$Helmet)
Helmet_Injury$Injury <- as.factor(Helmet_Injury$Injury)
xtabs(~ Helmet + Injury, data = Helmet_Injury)

##       Injury
## Helmet  0  1
##      0  5 11
##      1 31  3

fisher.test(xtabs(~ Helmet_Injury$Helmet + Helmet_Injury$Injury, 
                  data = Helmet_Injury))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury, data = Helmet_Injury)
## p-value = 2.898e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.006295768 0.261238397
## sample estimates:
## odds ratio 
## 0.04846721

結論：reject Ho，我們推論是否戴安全帽與是否發生事故有關聯

3/20

6.1.4 Chisq Test

使用是否帶安全帽與是否發生事故的資料做卡方檢定

chisq.test(xtabs(~ Helmet_Injury$Helmet + Helmet_Injury$Injury), correct = F)

## Warning in chisq.test(xtabs(~Helmet_Injury$Helmet + Helmet_Injury
## $Injury), : Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury)
## X-squared = 19.381, df = 1, p-value = 1.071e-05

chisq.test(xtabs(~ Helmet_Injury$Helmet + Helmet_Injury$Injury))

## Warning in chisq.test(xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury)):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury)
## X-squared = 16.522, df = 1, p-value = 4.808e-05

結論：reject H0，我們推論是否戴安全帽與是否發生事故有關連

結果中多顯示了一行Warning，原因就是在計算列連表中各格的期望次數時有小於等於5的值出現，因此用卡方檢定是可能會出現誤差

4/10

6.1.5 Mantel-Haenszel Test

Mantel-Haenszel Test是一種分析分層列連表的檢定方法，此方法優點是可以在調控分層變數後，檢定兩個類別變數間的關聯性。
資料包含三個變數(Sex、Place、Accident)

Data <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Example_Data.csv", sep = ",", header = T)
head(Data)

##   Sex Place Accident
## 1   M     Z        0
## 2   M     Y        0
## 3   F     Y        0
## 4   F     Z        0
## 5   F     X        0
## 6   M     X        0

檢定性別與是否發生車禍有無關聯-使用之前教過的Fisher exact test

fisher.test(table(Data$Sex, Data$Accident), alternative = "two.sided", conf.level = 0.95)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(Data$Sex, Data$Accident)
## p-value = 0.04846
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9925868 2.0770631
## sample estimates:
## odds ratio 
##   1.434571

檢定結果拒絕Ho，推論性別與是否發生車禍有關聯。

但是我們想知道此差異是否真的是因為性別所影響，或是有第三因子所造成的，因此我們加入這些人居住地區做考量，使用的方法就可用Mantel-Haenszel Test

mantelhaen.test(table(Data$Sex, Data$Accident, Data$Place), alternative = "two.sided", correct = T, conf.level = 0.95)

## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  table(Data$Sex, Data$Accident, Data$Place)
## Mantel-Haenszel X-squared = 3.7504, df = 1, p-value = 0.0528
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.012565 2.058560
## sample estimates:
## common odds ratio 
##          1.443754

p-value為0.0528，結果為不顯著，在合併表格中，雖然性別與是否發生車禍看似相關，但是如果我們調控了居住地區時，各區內性別與是否發生車禍並不相關，發現此結果我們可能會推斷是因為地區與車禍是否發生有關而不是性別。

7 GLM

4/17

在線性迴歸中，我們對於依變數的假設為需要服從常態分配，但在現實狀況中不見得如此，這時候我們就可以藉著使用廣義線性模型來幫我們解決這個問題，廣義線性模型允許我們依變數擁有不同的分配，而藉著不同的連結函數(link function)我們就能將Systemetic component與Random component做連結，常用的link function有logit link、log link。
在本週課程會先介紹當依變數為Categorical data & Count Data時可使用的方法，並當遇到overdispersion問題時的處理，使用資料為Injury的資料檔。

Data <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Injury.csv", sep = ",", header = T)
set.seed(45)
Data$Injury_count <- sample(0:15, 50, replace = T) #增加新變數為觀察時間內發生事件(受傷)次數
model.logit <- glm(Injury ~ Helmet, data = Data, family = binomial("logit"))
model.probit <- glm(Injury ~ Helmet, data = Data, family = binomial("probit"))
summary(model.logit)

## 
## Call:
## glm(formula = Injury ~ Helmet, family = binomial("logit"), data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5252  -0.4298  -0.4298   0.8657   2.2035  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7885     0.5394   1.462 0.143785    
## Helmet       -3.1238     0.8102  -3.855 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59.295  on 49  degrees of freedom
## Residual deviance: 40.168  on 48  degrees of freedom
## AIC: 44.168
## 
## Number of Fisher Scoring iterations: 5

summary(model.probit)

## 
## Call:
## glm(formula = Injury ~ Helmet, family = binomial("probit"), data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5252  -0.4298  -0.4298   0.8657   2.2035  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4888     0.3273   1.493    0.135    
## Helmet       -1.8405     0.4467  -4.120 3.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 59.295  on 49  degrees of freedom
## Residual deviance: 40.168  on 48  degrees of freedom
## AIC: 44.168
## 
## Number of Fisher Scoring iterations: 4

當使用link function為logit時，有繫安全帶受傷的機會是沒繫安全帶的exp(-3.1238) = 0.044倍
而使用link function為probit時，有繫安全帶受傷的機會是沒繫安全帶的exp(-1.8405) = 0.159倍

7.1 Overdispersion

當我們的依變數為Count Data時，所使用的link function為log link

model.log <- glm(Injury_count ~ Helmet, data = Data, family = poisson("log"))
summary(model.log)

## 
## Call:
## glm(formula = Injury_count ~ Helmet, family = poisson("log"), 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4895  -1.1420  -0.0358   1.1009   2.4290  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.11021    0.08704  24.245  < 2e-16 ***
## Helmet      -0.30385    0.11139  -2.728  0.00637 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 128.26  on 49  degrees of freedom
## Residual deviance: 121.01  on 48  degrees of freedom
## AIC: 300.6
## 
## Number of Fisher Scoring iterations: 5

當所使用分配為poisson時，我們期望依變數的期望值與變異數差異不要過大，因此檢查一下是否出現overdispersion的問題

mean(Data$Injury_count)

## [1] 6.78

var(Data$Injury_count)

## [1] 16.54245

發現出現平均數小於變異數的情況，使用negative binomial做配適或許更為保險

library(MASS)
model.nb <- glm.nb(Injury_count ~ Helmet, data = Data)
summary(model.nb)

## 
## Call:
## glm.nb(formula = Injury_count ~ Helmet, data = Data, init.theta = 4.799298545, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80405  -0.73926  -0.02383   0.70338   1.49171  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.1102     0.1435   14.70   <2e-16 ***
## Helmet       -0.3039     0.1776   -1.71   0.0872 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.7993) family taken to be 1)
## 
##     Null deviance: 59.351  on 49  degrees of freedom
## Residual deviance: 56.400  on 48  degrees of freedom
## AIC: 279.87
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.80 
##           Std. Err.:  1.75 
## 
##  2 x log-likelihood:  -273.869

在這邊現暫時忽略顯著性問題，把重點放在結果解釋上，結論告訴我們有戴安全帽的人比沒戴安全帽的受傷的次數平均而言少了exp(-0.3039) = 0.738次。

5/1

本次內容將先補充上次上課介紹的如何依變數為binary時，在family = binomial狀況下，下identity link，以及使用quasipoisson觀察overdispersion狀況。使用資料為上次作業使用的crabs data

Crabs <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/crab.csv", header = F, sep = ",")
colnames(Crabs) <- c("c", "s", "w", "wt", "sa")
Crabs$Y <- ifelse(Crabs$sa >= 1, 1, 0)
model.1 <- glm(Y ~ wt, data = Crabs, family = binomial("identity"), start = c(0.0001, 0.0001))
summary(model.1) #start point通常為一個很小的數值，若設太大可能會無法收斂

## 
## Call:
## glm(formula = Y ~ wt, family = binomial("identity"), data = Crabs, 
##     start = c(1e-04, 1e-04))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4466  -1.0445   0.9738   1.1634   1.4591  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.08675    0.06879   1.261    0.207    
## wt           0.17563    0.01323  13.275   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 217.19  on 171  degrees of freedom
## AIC: 221.19
## 
## Number of Fisher Scoring iterations: 25

model.2 <- glm(sa ~ wt, data = Crabs, family = quasipoisson)
summary(model.2)

## 
## Call:
## glm(formula = sa ~ wt, family = quasipoisson, data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9306  -1.9981  -0.5627   0.9299   4.9992  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4282     0.3167  -1.352    0.178    
## wt            0.5892     0.1151   5.121 8.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.133896)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 560.84  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

本週課程內容要介紹交互作用以及模型診斷，在配適模型時我們可能會懷疑所有變項中某兩項多項間可能會互相影響，因此在配適模型時可以加入交互作用項做考量。
Case：我們懷疑螃蟹寬度與重量間可能有交互作用存在，因此藉由檢定來觀察他們之間是否真的有交互作用存在。

model.3 <- glm(Y ~ wt + w + w*wt, data = Crabs, family = binomial())
summary(model.3)

## 
## Call:
## glm(formula = Y ~ wt + w + w * wt, family = binomial(), data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2385  -1.0298   0.4853   0.9486   1.5160  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   1.7002    12.2558   0.139    0.890
## wt           -4.2441     5.5113  -0.770    0.441
## w            -0.1134     0.4826  -0.235    0.814
## wt:w          0.1912     0.2065   0.926    0.355
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 191.98  on 169  degrees of freedom
## AIC: 199.98
## 
## Number of Fisher Scoring iterations: 5

在本例中，可以由配適結果看出，當同時在模型裡面放入寬度與重量以及兩者交互作用項，發現每個變項都未顯著，但是在處理交互作用項問題時可以藉由這些方法來觀察。

7.2 Model Selection

接著介紹模型診斷，介紹的方法有LR Test、ROC、AIC以及Hosmer-Lemeshow goodness of fit test

library(lmtest)

## Warning: package 'zoo' was built under R version 3.5.1

model.all <- glm(Y ~ c + s + w + wt, data = Crabs, family = binomial(link = "logit"))
lrtest(model.1, model.all) #也可直接放lrtest(model.1)

## Likelihood ratio test
## 
## Model 1: Y ~ wt
## Model 2: Y ~ c + s + w + wt
##   #Df   LogLik Df  Chisq Pr(>Chisq)    
## 1   2 -108.595                         
## 2   5  -93.329  3 30.532  1.066e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(Ho\)：model.1 is satisfied with this data
\(H1\)：model.all is satisfied with this data
結論：reject \(Ho\)，我們有足夠證據推論model.all配適較佳

接下來試著繪製ROC Curve，在繪製ROC Curve前，我們先把資料切成訓練資料集(Training Data)以及測試資料集(Testing Data)，訓練資料集目的是用於配適Logistic Regression而測試資料集目的則是用於驗證模型準確度

library(caret)

## Warning: package 'caret' was built under R version 3.5.1

library(ROCR)

intrain <- createDataPartition(y = Crabs$Y, p = 0.7, list = F) #將資料切成30%(Testing) & 70%(Training)
Training <- Crabs[intrain, ]
Testing <- Crabs[-intrain, ]
model.roc <- glm(Y ~ c + s + w + wt, data = Training, family = binomial(link = "logit"))
result <- predict(model.roc, newdata = Testing, type = "response")
pred <- prediction(result, Testing$Y)
per <- performance(pred, measure = "tpr", x.measure = "fpr")
AUC <- performance(pred, "auc")
plot(per, col = rainbow(7), main = "ROC Curve") +
  abline(0, 1) +
  text(0.5, 0.5, paste0("AUC=", as.character(round(AUC@y.values[[1]], 3))))

## integer(0)

cost.perf <- performance(pred, "cost")
pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])] #Find Cutpoint

##        35 
## 0.6554858

藉由ROC Curve發現AUC=0.783，模型預測表現還不錯

AIC的部分直接在R中使用AIC的function即可，EX:AIC(model.1)，以下使用apply家族中常用的函數sapply，藉由sapply我們可以將程式更加簡化

sapply(list(model.1, model.3, model.all), AIC)

## [1] 221.1909 199.9806 196.6589

結果觀察到當我們把所有變數都加入模型時AIC會最小。

最後一部分介紹另一個模型檢定的方式Hosmer-Lemeshow goodness of fit test，用於檢定模型所產生的期望值與實際資料觀察值間的配適程度，此檢定用於評估模型預測表現是否符合模型本身的預期(如果符合預期 p-value>0.1)

library(ResourceSelection)

hoslem.test(Crabs$Y, model.1$fitted.values)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Crabs$Y, model.1$fitted.values
## X-squared = 24, df = 8, p-value = 0.002292

5/22

8 Multicategory Model

在先前課程中提到，當依變數為二元時，可使用Logistic regression做配適，但是在實際情況中往往不是像我們想的這麼簡單，因此就會需要用到這禮拜所介紹的Multicategory Logit Model，而在這種Model中，我們又可以根據依變數不同的類型(名目or順序尺度)而有不同的配適方式，以下將作介紹。

8.1 Nomial

使用的資料為課本上Table6.1的資料，包含三個變數為(id,length,primary.food)，其中primary food type分為三組分別是Fish(1)，Invertebrate(2)，Other(3)，以Other為baseline category進行模型配適

library(VGAM)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.1

Alligator_1 <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Alligator-1.txt", header = T, sep = " ")
vglm(primary.food ~ length, data = Alligator_1, reflevel = "3", 
             family = multinomial) %>% summary()

## 
## Call:
## vglm(formula = primary.food ~ length, family = multinomial, data = Alligator_1, 
##     reflevel = "3")
## 
## 
## Pearson residuals:
##                       Min      1Q  Median     3Q   Max
## log(mu[,1]/mu[,3]) -2.330 -0.5075  0.5538 0.6836 1.452
## log(mu[,2]/mu[,3]) -2.687 -0.4821 -0.1653 0.7093 3.439
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1   1.6177     1.3073   1.237  0.21591   
## (Intercept):2   5.6974     1.7937   3.176  0.00149 **
## length:1       -0.1101     0.5171  -0.213  0.83137   
## length:2       -2.4654     0.8996      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  2 
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 98.3412 on 114 degrees of freedom
## 
## Log-likelihood: -49.1706 on 114 degrees of freedom
## 
## Number of iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'length:2'
## 
## Reference group is level  3  of the response

寫下兩個model：
\(log(\hat\pi_1/\hat\pi_3)=1.6177-0.1101*length\)
\(log(\hat\pi_1/\hat\pi_3)=5.6974-2.4654*length\)
可以看出鱷魚身長每增加一單位(公尺)，以魚類為主要食物型態的機會會是食物類型為其他種類的\(e^{-0.1101}=0.9\)倍，而以無脊椎動物為主要食物型態的機會則會是食物類型為其他種類的\(e^{-2.4654}=0.085\)倍，若今天我們想看的是以魚類與無脊椎動物為主食的鱷魚間的關係，則可以把兩個模型相減得到 \(log(\hat\pi_1/\hat\pi_2)=-4.08+2.355*length\)
從此式可以得到鱷魚身長每增加一單位(公尺)，以魚類為主要食物型態的機會會是以無脊椎動物為主食的\(e^{2.355}=10.538\)倍。

Alligator <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Alligator.txt", header = T)
contrasts(Alligator$Lake) <- contr.treatment(levels(Alligator$Lake), base = 1)
contrasts(Alligator$Lake)

##          hancock oklawaha trafford
## george         0        0        0
## hancock        1        0        0
## oklawaha       0        1        0
## trafford       0        0        1

contrasts(Alligator$Size) <- contr.treatment(levels(Alligator$Size), base = 2)
contrasts(Alligator$Size)

##      <2.3
## <2.3    1
## >2.3    0

vglm(cbind(Invertebrate, Reptile, Bird, Other, Fish) ~ Size + Lake, data = Alligator, reflevel = "Fish", family = multinomial) %>% summary()

## 
## Call:
## vglm(formula = cbind(Invertebrate, Reptile, Bird, Other, Fish) ~ 
##     Size + Lake, family = multinomial, data = Alligator, reflevel = "Fish")
## 
## 
## Pearson residuals:
##                        Min      1Q  Median     3Q   Max
## log(mu[,1]/mu[,5]) -1.3716 -0.4379 -0.0248 0.2436 1.995
## log(mu[,2]/mu[,5]) -0.8298 -0.5850 -0.2309 0.2225 2.237
## log(mu[,3]/mu[,5]) -0.9873 -0.5082 -0.1144 0.2373 3.994
## log(mu[,4]/mu[,5]) -1.5873 -0.3189 -0.0159 1.0330 1.413
## 
## Coefficients: 
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  -1.549019   0.424922  -3.645 0.000267 ***
## (Intercept):2  -3.314533   1.053068      NA       NA    
## (Intercept):3  -2.093077   0.662235  -3.161 0.001574 ** 
## (Intercept):4  -1.904272   0.525827  -3.621 0.000293 ***
## Size<2.3:1      1.458205   0.395945   3.683 0.000231 ***
## Size<2.3:2     -0.351263   0.580032  -0.606 0.544785    
## Size<2.3:3     -0.630660   0.642473  -0.982 0.326291    
## Size<2.3:4      0.331550   0.448252   0.740 0.459511    
## Lakehancock:1  -1.658359   0.612877  -2.706 0.006813 ** 
## Lakehancock:2   1.242777   1.185418   1.048 0.294461    
## Lakehancock:3   0.695118   0.781263   0.890 0.373608    
## Lakehancock:4   0.826196   0.557541   1.482 0.138378    
## Lakeoklawaha:1  0.937219   0.471905   1.986 0.047030 *  
## Lakeoklawaha:2  2.458872   1.118113   2.199 0.027869 *  
## Lakeoklawaha:3 -0.653208   1.201917  -0.543 0.586805    
## Lakeoklawaha:4  0.005653   0.776571   0.007 0.994192    
## Laketrafford:1  1.121985   0.490513   2.287 0.022174 *  
## Laketrafford:2  2.935253   1.116395   2.629 0.008558 ** 
## Laketrafford:3  1.087767   0.841669   1.292 0.196221    
## Laketrafford:4  1.516369   0.621435   2.440 0.014683 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  4 
## 
## Names of linear predictors: 
## log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
## 
## Residual deviance: 52.4785 on 44 degrees of freedom
## 
## Log-likelihood: -74.4295 on 44 degrees of freedom
## 
## Number of iterations: 5 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2'
## 
## Reference group is level  5  of the response

使用講義中p.12 Example的Data，欲了解考量更多變數後鱷魚主要食物型態為魚類與無脊椎動物間的關係。首先，先將鱷魚的Size與位於的湖泊(Lake)轉成dummy variable藉由contrasts()將\(Size\geq2.3=0\)&\(Lake(George)=0\)配適出的模型為
\(log(\hat\pi_I/\hat\pi_F)=-1.55+1.46*Size-1.66*Lake_H+0.94*Lake_O+1.12*Lake_T\)
例如有隻鱷魚位於Lake George且\(Size\geq2.3\)則鱷魚以無脊椎動物為主食的機會是以魚類為主食的\(e^{-1.55}=0.212\)倍。

8.2 Ordinal

接下來使用Mental Impairment的資料集，介紹當依變數為順序尺度時要如何配適Cumulative Logit Model

Mental<-read.table("http://www.stat.ufl.edu/~aa/glm/data/Mental.dat", header = T) #可直接讀取網頁檔案，也可自行下載Ceiba檔案讀取
vglm(impair ~ life + ses, data = Mental, family = cumulative(parallel = T)) %>% summary() #parallel=T,表示有符合Proportional odds

## 
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(parallel = T), 
##     data = Mental)
## 
## 
## Pearson residuals:
##                   Min      1Q  Median     3Q   Max
## logit(P[Y<=1]) -1.568 -0.7048 -0.2102 0.8070 2.713
## logit(P[Y<=2]) -2.328 -0.4666  0.2657 0.6904 1.615
## logit(P[Y<=3]) -3.688  0.1198  0.2039 0.4194 1.892
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept):1  -0.2819     0.6231  -0.452  0.65096   
## (Intercept):2   1.2128     0.6511   1.863  0.06251 . 
## (Intercept):3   2.2094     0.7171   3.081  0.00206 **
## life           -0.3189     0.1194  -2.670  0.00759 **
## ses             1.1112     0.6143   1.809  0.07045 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  3 
## 
## Names of linear predictors: 
## logit(P[Y<=1]), logit(P[Y<=2]), logit(P[Y<=3])
## 
## Residual deviance: 99.0979 on 115 degrees of freedom
## 
## Log-likelihood: -49.5489 on 115 degrees of freedom
## 
## Number of iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## Exponentiated coefficients:
##      life       ses 
## 0.7269742 3.0380707

Model可寫成\(logit[P(Y\leq{j})]=\alpha_j+\beta_1*life\,event+\beta_2*SES\)的型式
舉例來說在固定life event的情況下，社經地位高者不管在給定哪個level( j )下，有精神障礙的風險是低社經地位的\(e^{1.1112}=3\)倍。

8.3 Paired-Category Ordial Logits

在先前課程中提到Cumulative Logits中，我們所關注的是累積機率，因此使用的Logit link為\(logit(P(Y\leq{j}))\)，而今天所要介紹的則是只使用部分依變數的資訊，可以做兩個類別的比較，使用的方法為Adjacent-Categories Logit，而Logit Link為\(logit(P(Y=j+1)/P(Y=j))\)，所使用的R套件為VGAM，以下整理出此套件中包含的Family function

同學可參照此圖依照使用得目的選擇適合的Family function，這邊將以課本中例子介紹

library(VGAM)
library(dplyr)
Political_Ideology <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Political_Ideology.csv", header = T, sep = ",")
Political_Ideology$Ideology <- as.factor(Political_Ideology$Ideology)
vglm(Ideology ~ Gender + Party, data = Political_Ideology, family = acat(reverse = F, parallel = F)) %>% summary()

## Warning in eval(slot(family, "initialize")): response should be ordinal---
## see ordered()

## 
## Call:
## vglm(formula = Ideology ~ Gender + Party, family = acat(reverse = F, 
##     parallel = F), data = Political_Ideology)
## 
## 
## Pearson residuals:
##                        Min      1Q  Median      3Q   Max
## loge(P[Y=2]/P[Y=1]) -3.371  0.0712  0.1591  0.2380 1.358
## loge(P[Y=3]/P[Y=2]) -2.408 -0.6983  0.2491  0.8383 1.009
## loge(P[Y=4]/P[Y=3]) -1.172 -0.6687 -0.3114  0.6870 2.732
## loge(P[Y=5]/P[Y=4]) -1.785 -0.2853 -0.2242 -0.1143 2.451
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  0.06687    0.19024   0.352 0.725190    
## (Intercept):2  0.83169    0.15668   5.308 1.11e-07 ***
## (Intercept):3 -1.63822    0.19734  -8.302  < 2e-16 ***
## (Intercept):4  0.33236    0.23302   1.426 0.153780    
## GenderM:1     -0.13565    0.26457  -0.513 0.608152    
## GenderM:2     -0.23610    0.21592  -1.093 0.274191    
## GenderM:3      0.53441    0.21580   2.476 0.013273 *  
## GenderM:4     -0.08632    0.24141  -0.358 0.720666    
## PartyR:1       0.42419    0.28331   1.497 0.134329    
## PartyR:2       0.43680    0.21674   2.015 0.043876 *  
## PartyR:3       0.82594    0.22226   3.716 0.000202 ***
## PartyR:4      -0.12353    0.25468  -0.485 0.627644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  4 
## 
## Names of linear predictors: 
## loge(P[Y=2]/P[Y=1]), loge(P[Y=3]/P[Y=2]), loge(P[Y=4]/P[Y=3]), loge(P[Y=5]/P[Y=4])
## 
## Residual deviance: 2462.332 on 3328 degrees of freedom
## 
## Log-likelihood: -1231.166 on 3328 degrees of freedom
## 
## Number of iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates

以\(log(P[Y=2]/P[Y=1])\)舉例，給定相同性別下，較自由主義與非常自由主義的odds在共和主義的人中是民主主義的人的\(exp(0.42419)=1.528352\)倍。

8.4 Continuation-Ratio Logits

以課本中Toxicity Study Data示範，簡單來說此研究想看懷孕老鼠在暴露於某一工業用溶劑，在不同劑量下(此處給不同劑量Score)所生下老鼠的狀態，分為三種(死亡、畸形、正常)，並去觀察他們之間的關係，在這邊舉例，在給定所生下老鼠是活著的情況，我們想了解畸形與正常老鼠間的關係

library(VGAM)
library(dplyr)
Toxicity <- data.frame(Concentration = c(0, 62.5, 125, 250, 500),
                       Response_Dead = c(15, 17, 22, 38, 144),
                       Response_Mal = c(1, 0, 7, 59, 132),
                       Response_Nor = c(281, 225, 283, 202, 9))
vglm(cbind(Response_Mal, Response_Nor) ~ Concentration, data = Toxicity, family = sratio()) %>% summary()

## 
## Call:
## vglm(formula = cbind(Response_Mal, Response_Nor) ~ Concentration, 
##     family = sratio(), data = Toxicity)
## 
## 
## Pearson residuals:
##          [,1]
## [1,]  0.06344
## [2,] -1.49190
## [3,] -0.44346
## [4,]  0.86216
## [5,] -0.87367
## 
## Coefficients: 
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.701902   0.332248  -17.16   <2e-16 ***
## Concentration  0.017375   0.001227   14.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  1 
## 
## Name of linear predictor: logit(P[Y=1|Y>=1]) 
## 
## Residual deviance: 6.0609 on 3 degrees of freedom
## 
## Log-likelihood: -10.7452 on 3 degrees of freedom
## 
## Number of iterations: 4 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'Concentration'

結果可以看出，在給定老鼠皆是存活的狀態下，當母鼠每日暴露於此工業用溶劑的劑量每增加\(100mg/kg\)，生下的老鼠會是畸形的可能性會增加\(exp(0.017375*100)=5.7\)倍。

9 Match-Paired Data

在現實狀況中，我們可能會想去看試驗者在接受某種治療後，試驗前與試驗後的反應，而此種資料型式就是成對資料，在當資料為連續型時，可以使用Pair t test做檢定，而當資料為類別型時，也有其適合的方法，本章將會做介紹。

9.1 Mcnemar Test

首先介紹當資料為類別型時，可以使用Mcnemar Test幫助我們做檢定，我們需把資料整理成2*2 Table，以下舉例：此筆資料主要是訪問800人對於某位候選人就任前與就任後的滿意度，70位在就任前後皆滿意，80位就任前不滿意就任後變為滿意，其餘依此類推。

Performance <- matrix(c(70, 80, 150, 500), nrow = 2, 
                      dimnames = list("1st Survey" = c("Satidfied", "Dissatisfied"),
                                      "2nd Survey" = c("Satidfied", "Dissatisfied")))
mcnemar.test(Performance)

## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  Performance
## McNemar's chi-squared = 20.7, df = 1, p-value = 5.372e-06

虛無假設為
\(H_0：就任前後滿意與不滿意比例相同\)
\(H_1：就任前後滿意與不滿意比例不同\)
檢定結果：\(Reject\,H_0\)有足夠證據證明就任前後滿意與不滿意比例不同

9.2 Marginal Logistic Regression

在先前所提到的羅吉斯迴歸，資料間彼此互相獨立，因此可以使用，但當資料具有相依性，而我們仍然想要去量化解釋變數對依變數的影響，這時我們就可以使用Marginal logistic regression or Conditional logistic regression幫助我們。以講義中CH8 p.4 table舉例。

Data <- data.frame(CutLS_YES = c(227, 107), CutLS_NO = c(132, 678))
row.names(Data) <- c("HigherTaxes_Yes", "HigherTaxes_No")
Data

##                 CutLS_YES CutLS_NO
## HigherTaxes_Yes       227      132
## HigherTaxes_No        107      678

由於我們要建立的是Marginal Logistic Regression因此要先建立Marginal Data

Data.marginal <- data.frame(Q = c("HigherTaxes", "CutLS"), 
                            yes = c(359, 334), 
                            no = c(785, 810))
Data.marginal

##             Q yes  no
## 1 HigherTaxes 359 785
## 2       CutLS 334 810

接著建立Model

library(dplyr)
glm(cbind(Data.marginal$yes, Data.marginal$no) ~ Q, data = Data.marginal, 
    family = binomial(link = "logit")) %>% summary()

## 
## Call:
## glm(formula = cbind(Data.marginal$yes, Data.marginal$no) ~ Q, 
##     family = binomial(link = "logit"), data = Data.marginal)
## 
## Deviance Residuals: 
## [1]  0  0
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.88589    0.06503 -13.623   <2e-16 ***
## QHigherTaxes  0.10353    0.09104   1.137    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.2939e+00  on 1  degrees of freedom
## Residual deviance: 2.7489e-13  on 0  degrees of freedom
## AIC: 18.649
## 
## Number of Fisher Scoring iterations: 2

9.3 Conditional Logistic Regression

當我們想了解的是個人對於這兩個問題看法的差異，這時就可以使用Conditional Logistic Regresion，在建立條件羅吉斯回歸時，需要先建構出Subject-Specific table，也就是我們所說的Conditional Model，之所以叫Conditional原因為我們針對每個人都建立table，概念其實很像針對個人去做分層後下去建立模型，也就是說我們是“條件於每個人之下”，去觀察解釋變數對依變數的影響。

首先先將資料轉為Individual Data，當中queestion=1代表是否願意付較高稅收，question=0代表降低生活品質

Data.raw=data.frame(ID = rep(1:1144, each=2), 
                    question = rep(c(1, 0), 1144),
                    answer = c(rep(c(1, 1), 227), rep(c(1, 0), 132), 
                               rep(c(0, 1), 107), rep(c(0, 0), 678)))

接著建立Model

library(dplyr)
library(survival)
clogit(answer ~ question + strata(ID), data = Data.raw) %>% summary()

## Call:
## coxph(formula = Surv(rep(1, 2288L), answer) ~ question + strata(ID), 
##     data = Data.raw, method = "exact")
## 
##   n= 2288, number of events= 693 
## 
##            coef exp(coef) se(coef)     z Pr(>|z|)
## question 0.2100    1.2336   0.1301 1.614    0.106
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## question     1.234     0.8106     0.956     1.592
## 
## Rsquare= 0.001   (max possible= 0.135 )
## Likelihood ratio test= 2.62  on 1 df,   p=0.1055
## Wald test            = 2.61  on 1 df,   p=0.1065
## Score (logrank) test = 2.62  on 1 df,   p=0.1059

根據以上結果可以看出，針對個人而言回答願意付較高稅的機會是願意降低生活品質的1.234倍。也就是說個人而言大家還是願意付較高的稅而較不願意降低生活品質。

9.4 Cox Model

除了使用Conditional Logistic Regression做配適，我們也可以藉由存活分析中常用到著Cox Proportional Hazard Model達到相同目標，以下以講義CH8 p.18的table做示範。

Individual <- data.frame(Matched = rep(1:144, each = 2), T = rep(c(1:2), 144), 
                         Y = rep(c(1, 0), 144),
                         X = c(rep(1, 18), rep(c(1, 0), 37), rep(c(0, 1), 16), rep(0, 164)))
head(Individual)

##   Matched T Y X
## 1       1 1 1 1
## 2       1 2 0 1
## 3       2 1 1 1
## 4       2 2 0 1
## 5       3 1 1 1
## 6       3 2 0 1

library(survival)
my.surv <- Surv(Individual$T, Individual$Y == 1)
Model <- coxph(my.surv ~ X + strata(Matched), data = Individual)
summary(Model)

## Call:
## coxph(formula = my.surv ~ X + strata(Matched), data = Individual)
## 
##   n= 288, number of events= 144 
## 
##     coef exp(coef) se(coef)     z Pr(>|z|)   
## X 0.8383    2.3125   0.2992 2.802  0.00508 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##   exp(coef) exp(-coef) lower .95 upper .95
## X     2.312     0.4324     1.286     4.157
## 
## Concordance= 0.573  (se = 0.184 )
## Rsquare= 0.029   (max possible= 0.5 )
## Likelihood ratio test= 8.55  on 1 df,   p=0.003449
## Wald test            = 7.85  on 1 df,   p=0.005082
## Score (logrank) test = 8.32  on 1 df,   p=0.003919

\(h(t|Diabetes)=h_0(t)+exp(0.8383*Diabetes)\)
Time=T
STATUS=Y
Matched=Stratified Variable
使用 Cox Proportional Hazard Model 做出來結果與使用 Conditional Logistic Regression 相同，表示有糖尿病的患者罹患心肌梗塞的風險為沒有糖尿病者的 2.312 倍。

10 GEE

使用Depression Study Data用R code示範，當Working Correlation Matrix為Independent時，比較使用GLM與GEE估計出係數的差別

library(dplyr)
library(gee)
Depression <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Depression.txt", 
                         sep = "", header = T)
glm(outcome ~ severity + drug + time + time*drug, data = Depression, family = binomial(link = "logit")) %>% 
  summary() #GLM

## 
## Call:
## glm(formula = outcome ~ severity + drug + time + time * drug, 
##     family = binomial(link = "logit"), data = Depression)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4352  -1.0220   0.3254   0.9915   1.8009  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.02799    0.16391  -0.171    0.864    
## severity    -1.31391    0.14641  -8.974  < 2e-16 ***
## drug        -0.05960    0.22221  -0.268    0.789    
## time         0.48241    0.11476   4.204 2.63e-05 ***
## drug:time    1.01744    0.18879   5.389 7.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1411.9  on 1019  degrees of freedom
## Residual deviance: 1161.9  on 1015  degrees of freedom
## AIC: 1171.9
## 
## Number of Fisher Scoring iterations: 4

gee(outcome ~ severity + drug + time + time*drug, id = case, data = Depression, corstr = "independence", 
    family = binomial(link = "logit")) %>% summary()

## (Intercept)    severity        drug        time   drug:time 
## -0.02798843 -1.31391092 -0.05960381  0.48241209  1.01744498

## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Logit 
##  Variance to Mean Relation: Binomial 
##  Correlation Structure:     Independent 
## 
## Call:
## gee(formula = outcome ~ severity + drug + time + time * drug, 
##     id = case, data = Depression, family = binomial(link = "logit"), 
##     corstr = "independence")
## 
## Summary of Residuals:
##         Min          1Q      Median          3Q         Max 
## -0.94844242 -0.40683252  0.05155758  0.38830952  0.80242231 
## 
## 
## Coefficients:
##                Estimate Naive S.E.    Naive z Robust S.E.   Robust z
## (Intercept) -0.02798843  0.1627083 -0.1720160   0.1741865 -0.1606808
## severity    -1.31391092  0.1453432 -9.0400569   0.1459845 -9.0003423
## drug        -0.05960381  0.2205812 -0.2702126   0.2285385 -0.2608042
## time         0.48241209  0.1139224  4.2345663   0.1199350  4.0222784
## drug:time    1.01744498  0.1874132  5.4288855   0.1876938  5.4207709
## 
## Estimated Scale Parameter:  0.9854113
## Number of Iterations:  1
## 
## Working Correlation
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

由於Outcome為Binary因此在GEE中也要記得設定family，否則default設定為Identity，結果會有誤差，而GEE在解釋上也與GLM相同

請同學自行下載套件catdata，此套件就有提供Teratology的Data，有Grouped Data(teratology)及Individual Data(teratology2)，這邊使用Individual Data，分別使用GLM & GEE(Exchangeable Working Correlation Matrix)並與CH9講義P.25對照

library(catdata)
library(dplyr)
library(gee)
data("teratology2")
glm(y ~ Grp, data = teratology2, family = binomial(link = "logit")) %>% summary()

## 
## Call:
## glm(formula = y ~ Grp, family = binomial(link = "logit"), data = teratology2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6855  -0.4631  -0.2649   0.7437   2.5951  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.1440     0.1292   8.855   <2e-16 ***
## GrpG2        -3.3225     0.3308 -10.043   <2e-16 ***
## GrpG3        -4.4762     0.7307  -6.126    9e-10 ***
## GrpG4        -4.1297     0.4762  -8.672   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.68  on 606  degrees of freedom
## Residual deviance: 496.70  on 603  degrees of freedom
## AIC: 504.7
## 
## Number of Fisher Scoring iterations: 5

gee(y ~ Grp, id = Rat, data = teratology2, corstr = "exchangeable", family = binomial(link = "logit")) %>% summary()

## (Intercept)       GrpG2       GrpG3       GrpG4 
##    1.143981   -3.322513   -4.476184   -4.129663

## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Logit 
##  Variance to Mean Relation: Binomial 
##  Correlation Structure:     Exchangeable 
## 
## Call:
## gee(formula = y ~ Grp, id = Rat, data = teratology2, family = binomial(link = "logit"), 
##     corstr = "exchangeable")
## 
## Summary of Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.7705629 -0.1036164 -0.0331754  0.2294371  0.9668246 
## 
## 
## Coefficients:
##              Estimate Naive S.E.   Naive z Robust S.E.  Robust z
## (Intercept)  1.211492  0.2245137  5.396073   0.2695606  4.494323
## GrpG2       -3.369165  0.5660161 -5.952419   0.4304211 -7.827600
## GrpG3       -4.583701  1.3091173 -3.501368   0.6235402 -7.351091
## GrpG4       -4.247410  0.8527494 -4.980842   0.6047894 -7.022957
## 
## Estimated Scale Parameter:  1.035308
## Number of Iterations:  3
## 
## Working Correlation
##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
##  [1,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [2,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
##  [3,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
##  [4,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
##  [5,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
##  [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
##  [7,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [8,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [9,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [13,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##            [,7]      [,8]      [,9]     [,10]     [,11]     [,12]
##  [1,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [2,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [3,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [4,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [5,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [7,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [8,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
##  [9,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
## [13,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##           [,13]     [,14]     [,15]     [,16]     [,17]
##  [1,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [2,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [3,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [4,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [5,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [7,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [8,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
##  [9,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [13,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000

11 References

1: R語言資料分析從機器學習、資料探勘、文字探勘到巨量資料分析-李仁鐘、李秋緣
2: R語言資料分析活用範例詳解-方匡南、朱建平、姜葉飛著，H&C 譯
3: An Introduction to Categorical Data Analysis, Second Edition-Alan Agresti

CDA R Class

Hsien-Chou, Yeh

Update 2018/6/9

1 Installing R

2 Installing RStudio

3 Data Input&Output

3.1 資料讀取

3.2 資料輸出

4 Basic R

4.1 Vector-向量

4.1.1 向量基本介紹

4.1.2 向量基本運算

4.1.3 向量常見的函數

4.1.4 向量索引

4.2 Factor-因子

4.2.1 因子基本介紹

4.3 Matrix-矩陣

4.3.1 矩陣基本介紹

4.3.2 矩陣基本運算

4.3.3 矩陣常見函數

4.4 Data Frame-資料框

4.4.1 Data Frame基本介紹

4.4.2 Data Frame基本運算

5 Distribution

5.1 Binomial

5.2 Normal

5.3 Poisson

6 Test

6.1 Proportion test

6.1.1 One sample proportion test

6.1.2 Two sample proportion test

6.1.3 Fisher Exact test

6.1.4 Chisq Test

6.1.5 Mantel-Haenszel Test

7 GLM

7.1 Overdispersion

7.2 Model Selection

8 Multicategory Model

8.1 Nomial

8.2 Ordinal

8.3 Paired-Category Ordial Logits

8.4 Continuation-Ratio Logits

9 Match-Paired Data

9.1 Mcnemar Test

9.2 Marginal Logistic Regression

9.3 Conditional Logistic Regression

9.4 Cox Model

10 GEE

11 References