2/27
Site:https://www.r-project.org/
Step1:使用以上連結進入R下載連結網頁,並點擊“download R”
Step2:選擇國家“Taiwan”,三個連結都可選擇
Step3:選擇載點Mac/Windows/Linux
Step4:選擇“install R for the first time”
Step5:選擇“Download R 3.4.3 for Windows”後執行
Site:https://www.rstudio.com/
Step1:使用以上連結進入RStudio網頁,並點擊“Download”
Step2:選擇RStudio Desktop並點擊“DOWNLOAD”
Step3:選擇電腦版本,下載後安裝執行
Step4:RStudio基本介面介紹
Step4-1:套件安裝可使用Install.packages此命令函式或在右下角Packages窗格按下“install”後輸入套件名稱安裝套件
在讀取資料前可先用setwd()以及getwd()來改變和檢視工作目錄
## [1] "D:/"
我們在政府資料開放平臺上找到一筆資料,主要是關於腸病毒健保門診以及住院就診人數統計
資料介紹&下載網址:https://data.gov.tw/dataset/14590
我們使用read.table()來讀取資料,當中sep表示資料的分隔形式,常見的csv檔分隔形式為逗號分隔形式,header表示資料是否有變數名稱,encoding表示資料檔案格式,Windows作業系統為“Big5”,Mac則是“UTF-8”
Enterovirus <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Enterovirus.csv", sep = ",", header = T, encoding = "Big5")
head(Enterovirus) #使用head()觀察部分資料## 年 週 就診類別 年齡別 縣市 腸病毒健保就診人次 健保就診總人次
## 1 2008 14 住院 0-2 台中市 0 105
## 2 2008 14 住院 0-2 台北市 2 151
## 3 2008 14 住院 0-2 台東縣 0 14
## 4 2008 14 住院 0-2 台南市 0 20
## 5 2008 14 住院 0-2 宜蘭縣 0 44
## 6 2008 14 住院 0-2 花蓮縣 0 17
我們也能直接在Rstudio上直接檢視完整資料,在Global Environment上點選該筆資料就會出現,可參考下圖
若需儲存資料或分析結果至外部檔案,可使用write.table()函數
## [1] 1 2 3 4 5
## [1] 75 60 85 80 70
## [1] "男" "女" "男" "男" "女"
「#」符號後面表示是注釋,寫程式時清楚表明注釋能提升可讀性
「<-」是指定值符號,意思是把後面的內容值指定給前面的內容,除了用「<-」也可使用「=」
## [1] 75 120 255 320 350
## [1] 0.01333333 0.03333333 0.03529412 0.05000000 0.07142857
## [1] 1 4 9 16 25
## [1] 76 62 88 84 75
## [1] -74 -58 -82 -76 -65
## [1] 75 30 28 20 14
## [1] 0 0 1 0 0
## [1] 15
## [1] 5
## [1] 1
## [1] 3
## [1] 1 5
## [1] 3
## [1] 2.5
## [1] 1 2 3 4 5
## [1] 3
## [1] 1 2 3
## [1] 1 5 3 4 5
## [1] 5 3
類別型資料經常要把資料轉成不同的水準/因子(factor),如性別、職業、是否已婚……等。當判斷此資料中變數為類別型時,建議再次檢查變數在資料中是否是以因子方式存在,若不是則須經過轉換,再繼續分析。
## [1] "character"
## [1] "factor"
## [1] "女" "男"
## z
## 女 男
## 2 3
3/6
矩陣是將資料用行和列排列的長方形表格,矩陣內元素必須是相同資料類型
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## [,1] [,2] [,3] [,4]
## [1,] 32 64 13 45
## [2,] 93 50 27 51
## [3,] 78 87 100 82
## [4,] 99 75 79 70
## [,1] [,2] [,3] [,4]
## [1,] 2486 2072 2075 1948
## [2,] 2788 2348 2294 2196
## [3,] 3090 2624 2513 2444
## [4,] 3392 2900 2732 2692
## [1] 15
## [1] 4 4
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [1] 0
## [1] 4
## [1] 28 32 36 40
## [1] 10 26 42 58
Data Frame是一種矩陣形式的資料,但資料框中各變數可以以不同類型的方式存在,每一行表示一個變數,每一列可表示一筆資料
## class score sex
## 1 1 75 男
## 2 5 60 女
## 3 3 85 男
## 4 4 80 男
## 5 5 70 女
接著我們想對資料框的每一列命名
## class score sex
## 陳xx 1 75 男
## 王xx 5 60 女
## 林xx 3 85 男
## 張xx 4 80 男
## 黃xx 5 70 女
## [1] 75 60 85 80 70
## class score sex
## 林xx 3 85 男
還可以篩選出符合我們的條件,例如:篩選出性別為男性且成績大於等於80分
## class score sex
## 林xx 3 85 男
## 張xx 4 80 男
接著我們想了解一下這筆資料的基本敘述統計
##
## 女 男
## 2 3
##
## 1 3 4 5
## 女 0 0 0 2
## 男 1 1 1 0
在n次獨立的試驗中,每次成功機率接固定為p,X代表成功的次數,以下介紹在R中計算binomial distribution
## [1] 1.881837e-09
## [1] 2.104926e-09
## [1] 26 26 20 26 26
## [1] 23
## [1] 1.269823e-08
## [1] 7.708629e-09
## [1] 25.52206 22.69812 25.33545 22.75618 28.24591
## [1] 31.92952
## [1] 0.08422434
## [1] 0.124652
## [1] 2
3/13
使用Enterovirus的資料,預檢定住院的人中10-14歲比例是否為0.5
##
## 0-2 10-14 15+ 3-4 5-9
## 住院 10278 10090 11247 10112 10269
## 門診 11263 11263 11263 11263 11263
prop.test(x = table(Enterovirus[, 3:4])["住院", "10-14"], p = 0.5,
n = sum(table(Enterovirus[, 3:4])[1,]), alternative = "two.sided")##
## 1-sample proportions test with continuity correction
##
## data: table(Enterovirus[, 3:4])["住院", "10-14"] out of sum(table(Enterovirus[, 3:4])[1, ]), null probability 0.5
## X-squared = 19467, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.1906673 0.1974848
## sample estimates:
## p
## 0.1940534
結論:reject Ho,我們推論住院的人中10到14歲所占的比例不為0.5
檢定10-14歲的人中,住院與門診的比例是否相同
prop.test(x = c(table(Enterovirus[, 3:4])["住院", "10-14"],
table(Enterovirus[, 3:4])["門診", "10-14"]),
n = c(sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 2]),
sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 2])))##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(table(Enterovirus[, 3:4])["住院", "10-14"], table(Enterovirus[, out of c(sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 3:4])["門診", "10-14"]) out of 2]), sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, c(table(Enterovirus[, 3:4])["住院", "10-14"], table(Enterovirus[, out of 3:4])[2, 2]))
## X-squared = 128.65, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.06445051 -0.04541696
## sample estimates:
## prop 1 prop 2
## 0.4725331 0.5274669
結論:reject Ho,我們推論在10-14歲的族群中,住院與門診的比例不相近
我們使用一筆新的資料,內容為戴安全帽與否與是否發生事故,透過檢定我們想了解是否戴安全帽與否發生事故有關聯
Helmet_Injury <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Injury.csv", header = T, sep = ",")
Helmet_Injury$Helmet <- as.factor(Helmet_Injury$Helmet)
Helmet_Injury$Injury <- as.factor(Helmet_Injury$Injury)
xtabs(~ Helmet + Injury, data = Helmet_Injury)## Injury
## Helmet 0 1
## 0 5 11
## 1 31 3
##
## Fisher's Exact Test for Count Data
##
## data: xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury, data = Helmet_Injury)
## p-value = 2.898e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.006295768 0.261238397
## sample estimates:
## odds ratio
## 0.04846721
結論:reject Ho,我們推論是否戴安全帽與是否發生事故有關聯
3/20
使用是否帶安全帽與是否發生事故的資料做卡方檢定
## Warning in chisq.test(xtabs(~Helmet_Injury$Helmet + Helmet_Injury
## $Injury), : Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury)
## X-squared = 19.381, df = 1, p-value = 1.071e-05
## Warning in chisq.test(xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury)):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury)
## X-squared = 16.522, df = 1, p-value = 4.808e-05
結論:reject H0,我們推論是否戴安全帽與是否發生事故有關連
結果中多顯示了一行Warning,原因就是在計算列連表中各格的期望次數時有小於等於5的值出現,因此用卡方檢定是可能會出現誤差
4/10
Mantel-Haenszel Test是一種分析分層列連表的檢定方法,此方法優點是可以在調控分層變數後,檢定兩個類別變數間的關聯性。
資料包含三個變數(Sex、Place、Accident)
Data <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Example_Data.csv", sep = ",", header = T)
head(Data)## Sex Place Accident
## 1 M Z 0
## 2 M Y 0
## 3 F Y 0
## 4 F Z 0
## 5 F X 0
## 6 M X 0
檢定性別與是否發生車禍有無關聯-使用之前教過的Fisher exact test
##
## Fisher's Exact Test for Count Data
##
## data: table(Data$Sex, Data$Accident)
## p-value = 0.04846
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9925868 2.0770631
## sample estimates:
## odds ratio
## 1.434571
檢定結果拒絕Ho,推論性別與是否發生車禍有關聯。
但是我們想知道此差異是否真的是因為性別所影響,或是有第三因子所造成的,因此我們加入這些人居住地區做考量,使用的方法就可用Mantel-Haenszel Test
##
## Mantel-Haenszel chi-squared test with continuity correction
##
## data: table(Data$Sex, Data$Accident, Data$Place)
## Mantel-Haenszel X-squared = 3.7504, df = 1, p-value = 0.0528
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.012565 2.058560
## sample estimates:
## common odds ratio
## 1.443754
p-value為0.0528,結果為不顯著,在合併表格中,雖然性別與是否發生車禍看似相關,但是如果我們調控了居住地區時,各區內性別與是否發生車禍並不相關,發現此結果我們可能會推斷是因為地區與車禍是否發生有關而不是性別。
4/17
在線性迴歸中,我們對於依變數的假設為需要服從常態分配,但在現實狀況中不見得如此,這時候我們就可以藉著使用廣義線性模型來幫我們解決這個問題,廣義線性模型允許我們依變數擁有不同的分配,而藉著不同的連結函數(link function)我們就能將Systemetic component與Random component做連結,常用的link function有logit link、log link。
在本週課程會先介紹當依變數為Categorical data & Count Data時可使用的方法,並當遇到overdispersion問題時的處理,使用資料為Injury的資料檔。
Data <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Injury.csv", sep = ",", header = T)
set.seed(45)
Data$Injury_count <- sample(0:15, 50, replace = T) #增加新變數為觀察時間內發生事件(受傷)次數
model.logit <- glm(Injury ~ Helmet, data = Data, family = binomial("logit"))
model.probit <- glm(Injury ~ Helmet, data = Data, family = binomial("probit"))
summary(model.logit)##
## Call:
## glm(formula = Injury ~ Helmet, family = binomial("logit"), data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5252 -0.4298 -0.4298 0.8657 2.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7885 0.5394 1.462 0.143785
## Helmet -3.1238 0.8102 -3.855 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59.295 on 49 degrees of freedom
## Residual deviance: 40.168 on 48 degrees of freedom
## AIC: 44.168
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = Injury ~ Helmet, family = binomial("probit"), data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5252 -0.4298 -0.4298 0.8657 2.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4888 0.3273 1.493 0.135
## Helmet -1.8405 0.4467 -4.120 3.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59.295 on 49 degrees of freedom
## Residual deviance: 40.168 on 48 degrees of freedom
## AIC: 44.168
##
## Number of Fisher Scoring iterations: 4
當使用link function為logit時,有繫安全帶受傷的機會是沒繫安全帶的exp(-3.1238) = 0.044倍
而使用link function為probit時,有繫安全帶受傷的機會是沒繫安全帶的exp(-1.8405) = 0.159倍
當我們的依變數為Count Data時,所使用的link function為log link
##
## Call:
## glm(formula = Injury_count ~ Helmet, family = poisson("log"),
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4895 -1.1420 -0.0358 1.1009 2.4290
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.11021 0.08704 24.245 < 2e-16 ***
## Helmet -0.30385 0.11139 -2.728 0.00637 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 128.26 on 49 degrees of freedom
## Residual deviance: 121.01 on 48 degrees of freedom
## AIC: 300.6
##
## Number of Fisher Scoring iterations: 5
當所使用分配為poisson時,我們期望依變數的期望值與變異數差異不要過大,因此檢查一下是否出現overdispersion的問題
## [1] 6.78
## [1] 16.54245
發現出現平均數小於變異數的情況,使用negative binomial做配適或許更為保險
##
## Call:
## glm.nb(formula = Injury_count ~ Helmet, data = Data, init.theta = 4.799298545,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80405 -0.73926 -0.02383 0.70338 1.49171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1102 0.1435 14.70 <2e-16 ***
## Helmet -0.3039 0.1776 -1.71 0.0872 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.7993) family taken to be 1)
##
## Null deviance: 59.351 on 49 degrees of freedom
## Residual deviance: 56.400 on 48 degrees of freedom
## AIC: 279.87
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.80
## Std. Err.: 1.75
##
## 2 x log-likelihood: -273.869
在這邊現暫時忽略顯著性問題,把重點放在結果解釋上,結論告訴我們有戴安全帽的人比沒戴安全帽的受傷的次數平均而言少了exp(-0.3039) = 0.738次。
5/1
本次內容將先補充上次上課介紹的如何依變數為binary時,在family = binomial狀況下,下identity link,以及使用quasipoisson觀察overdispersion狀況。使用資料為上次作業使用的crabs data
Crabs <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/crab.csv", header = F, sep = ",")
colnames(Crabs) <- c("c", "s", "w", "wt", "sa")
Crabs$Y <- ifelse(Crabs$sa >= 1, 1, 0)
model.1 <- glm(Y ~ wt, data = Crabs, family = binomial("identity"), start = c(0.0001, 0.0001))
summary(model.1) #start point通常為一個很小的數值,若設太大可能會無法收斂##
## Call:
## glm(formula = Y ~ wt, family = binomial("identity"), data = Crabs,
## start = c(1e-04, 1e-04))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4466 -1.0445 0.9738 1.1634 1.4591
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.08675 0.06879 1.261 0.207
## wt 0.17563 0.01323 13.275 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 217.19 on 171 degrees of freedom
## AIC: 221.19
##
## Number of Fisher Scoring iterations: 25
##
## Call:
## glm(formula = sa ~ wt, family = quasipoisson, data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9306 -1.9981 -0.5627 0.9299 4.9992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4282 0.3167 -1.352 0.178
## wt 0.5892 0.1151 5.121 8.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.133896)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 560.84 on 171 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
本週課程內容要介紹交互作用以及模型診斷,在配適模型時我們可能會懷疑所有變項中某兩項多項間可能會互相影響,因此在配適模型時可以加入交互作用項做考量。
Case:我們懷疑螃蟹寬度與重量間可能有交互作用存在,因此藉由檢定來觀察他們之間是否真的有交互作用存在。
##
## Call:
## glm(formula = Y ~ wt + w + w * wt, family = binomial(), data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2385 -1.0298 0.4853 0.9486 1.5160
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.7002 12.2558 0.139 0.890
## wt -4.2441 5.5113 -0.770 0.441
## w -0.1134 0.4826 -0.235 0.814
## wt:w 0.1912 0.2065 0.926 0.355
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 191.98 on 169 degrees of freedom
## AIC: 199.98
##
## Number of Fisher Scoring iterations: 5
在本例中,可以由配適結果看出,當同時在模型裡面放入寬度與重量以及兩者交互作用項,發現每個變項都未顯著,但是在處理交互作用項問題時可以藉由這些方法來觀察。
接著介紹模型診斷,介紹的方法有LR Test、ROC、AIC以及Hosmer-Lemeshow goodness of fit test
## Warning: package 'zoo' was built under R version 3.5.1
model.all <- glm(Y ~ c + s + w + wt, data = Crabs, family = binomial(link = "logit"))
lrtest(model.1, model.all) #也可直接放lrtest(model.1)## Likelihood ratio test
##
## Model 1: Y ~ wt
## Model 2: Y ~ c + s + w + wt
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -108.595
## 2 5 -93.329 3 30.532 1.066e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(Ho\):model.1 is satisfied with this data
\(H1\):model.all is satisfied with this data
結論:reject \(Ho\),我們有足夠證據推論model.all配適較佳
接下來試著繪製ROC Curve,在繪製ROC Curve前,我們先把資料切成訓練資料集(Training Data)以及測試資料集(Testing Data),訓練資料集目的是用於配適Logistic Regression而測試資料集目的則是用於驗證模型準確度
## Warning: package 'caret' was built under R version 3.5.1
intrain <- createDataPartition(y = Crabs$Y, p = 0.7, list = F) #將資料切成30%(Testing) & 70%(Training)
Training <- Crabs[intrain, ]
Testing <- Crabs[-intrain, ]
model.roc <- glm(Y ~ c + s + w + wt, data = Training, family = binomial(link = "logit"))
result <- predict(model.roc, newdata = Testing, type = "response")
pred <- prediction(result, Testing$Y)
per <- performance(pred, measure = "tpr", x.measure = "fpr")
AUC <- performance(pred, "auc")
plot(per, col = rainbow(7), main = "ROC Curve") +
abline(0, 1) +
text(0.5, 0.5, paste0("AUC=", as.character(round(AUC@y.values[[1]], 3))))## integer(0)
cost.perf <- performance(pred, "cost")
pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])] #Find Cutpoint## 35
## 0.6554858
藉由ROC Curve發現AUC=0.783,模型預測表現還不錯
AIC的部分直接在R中使用AIC的function即可,EX:AIC(model.1),以下使用apply家族中常用的函數sapply,藉由sapply我們可以將程式更加簡化
## [1] 221.1909 199.9806 196.6589
結果觀察到當我們把所有變數都加入模型時AIC會最小。
最後一部分介紹另一個模型檢定的方式Hosmer-Lemeshow goodness of fit test,用於檢定模型所產生的期望值與實際資料觀察值間的配適程度,此檢定用於評估模型預測表現是否符合模型本身的預期(如果符合預期 p-value>0.1)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Crabs$Y, model.1$fitted.values
## X-squared = 24, df = 8, p-value = 0.002292
5/22
在先前課程中提到,當依變數為二元時,可使用Logistic regression做配適,但是在實際情況中往往不是像我們想的這麼簡單,因此就會需要用到這禮拜所介紹的Multicategory Logit Model,而在這種Model中,我們又可以根據依變數不同的類型(名目or順序尺度)而有不同的配適方式,以下將作介紹。
使用的資料為課本上Table6.1的資料,包含三個變數為(id,length,primary.food),其中primary food type分為三組分別是Fish(1),Invertebrate(2),Other(3),以Other為baseline category進行模型配適
## Warning: package 'dplyr' was built under R version 3.5.1
Alligator_1 <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Alligator-1.txt", header = T, sep = " ")
vglm(primary.food ~ length, data = Alligator_1, reflevel = "3",
family = multinomial) %>% summary()##
## Call:
## vglm(formula = primary.food ~ length, family = multinomial, data = Alligator_1,
## reflevel = "3")
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,3]) -2.330 -0.5075 0.5538 0.6836 1.452
## log(mu[,2]/mu[,3]) -2.687 -0.4821 -0.1653 0.7093 3.439
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 1.6177 1.3073 1.237 0.21591
## (Intercept):2 5.6974 1.7937 3.176 0.00149 **
## length:1 -0.1101 0.5171 -0.213 0.83137
## length:2 -2.4654 0.8996 NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 2
##
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
##
## Residual deviance: 98.3412 on 114 degrees of freedom
##
## Log-likelihood: -49.1706 on 114 degrees of freedom
##
## Number of iterations: 5
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'length:2'
##
## Reference group is level 3 of the response
寫下兩個model:
\(log(\hat\pi_1/\hat\pi_3)=1.6177-0.1101*length\)
\(log(\hat\pi_1/\hat\pi_3)=5.6974-2.4654*length\)
可以看出鱷魚身長每增加一單位(公尺),以魚類為主要食物型態的機會會是食物類型為其他種類的\(e^{-0.1101}=0.9\)倍,而以無脊椎動物為主要食物型態的機會則會是食物類型為其他種類的\(e^{-2.4654}=0.085\)倍,若今天我們想看的是以魚類與無脊椎動物為主食的鱷魚間的關係,則可以把兩個模型相減得到 \(log(\hat\pi_1/\hat\pi_2)=-4.08+2.355*length\)
從此式可以得到鱷魚身長每增加一單位(公尺),以魚類為主要食物型態的機會會是以無脊椎動物為主食的\(e^{2.355}=10.538\)倍。
Alligator <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Alligator.txt", header = T)
contrasts(Alligator$Lake) <- contr.treatment(levels(Alligator$Lake), base = 1)
contrasts(Alligator$Lake)## hancock oklawaha trafford
## george 0 0 0
## hancock 1 0 0
## oklawaha 0 1 0
## trafford 0 0 1
contrasts(Alligator$Size) <- contr.treatment(levels(Alligator$Size), base = 2)
contrasts(Alligator$Size)## <2.3
## <2.3 1
## >2.3 0
##
## Call:
## vglm(formula = cbind(Invertebrate, Reptile, Bird, Other, Fish) ~
## Size + Lake, family = multinomial, data = Alligator, reflevel = "Fish")
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,5]) -1.3716 -0.4379 -0.0248 0.2436 1.995
## log(mu[,2]/mu[,5]) -0.8298 -0.5850 -0.2309 0.2225 2.237
## log(mu[,3]/mu[,5]) -0.9873 -0.5082 -0.1144 0.2373 3.994
## log(mu[,4]/mu[,5]) -1.5873 -0.3189 -0.0159 1.0330 1.413
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -1.549019 0.424922 -3.645 0.000267 ***
## (Intercept):2 -3.314533 1.053068 NA NA
## (Intercept):3 -2.093077 0.662235 -3.161 0.001574 **
## (Intercept):4 -1.904272 0.525827 -3.621 0.000293 ***
## Size<2.3:1 1.458205 0.395945 3.683 0.000231 ***
## Size<2.3:2 -0.351263 0.580032 -0.606 0.544785
## Size<2.3:3 -0.630660 0.642473 -0.982 0.326291
## Size<2.3:4 0.331550 0.448252 0.740 0.459511
## Lakehancock:1 -1.658359 0.612877 -2.706 0.006813 **
## Lakehancock:2 1.242777 1.185418 1.048 0.294461
## Lakehancock:3 0.695118 0.781263 0.890 0.373608
## Lakehancock:4 0.826196 0.557541 1.482 0.138378
## Lakeoklawaha:1 0.937219 0.471905 1.986 0.047030 *
## Lakeoklawaha:2 2.458872 1.118113 2.199 0.027869 *
## Lakeoklawaha:3 -0.653208 1.201917 -0.543 0.586805
## Lakeoklawaha:4 0.005653 0.776571 0.007 0.994192
## Laketrafford:1 1.121985 0.490513 2.287 0.022174 *
## Laketrafford:2 2.935253 1.116395 2.629 0.008558 **
## Laketrafford:3 1.087767 0.841669 1.292 0.196221
## Laketrafford:4 1.516369 0.621435 2.440 0.014683 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 4
##
## Names of linear predictors:
## log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]), log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 52.4785 on 44 degrees of freedom
##
## Log-likelihood: -74.4295 on 44 degrees of freedom
##
## Number of iterations: 5
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2'
##
## Reference group is level 5 of the response
使用講義中p.12 Example的Data,欲了解考量更多變數後鱷魚主要食物型態為魚類與無脊椎動物間的關係。首先,先將鱷魚的Size與位於的湖泊(Lake)轉成dummy variable藉由contrasts()將\(Size\geq2.3=0\)&\(Lake(George)=0\)配適出的模型為
\(log(\hat\pi_I/\hat\pi_F)=-1.55+1.46*Size-1.66*Lake_H+0.94*Lake_O+1.12*Lake_T\)
例如有隻鱷魚位於Lake George且\(Size\geq2.3\)則鱷魚以無脊椎動物為主食的機會是以魚類為主食的\(e^{-1.55}=0.212\)倍。
接下來使用Mental Impairment的資料集,介紹當依變數為順序尺度時要如何配適Cumulative Logit Model
Mental<-read.table("http://www.stat.ufl.edu/~aa/glm/data/Mental.dat", header = T) #可直接讀取網頁檔案,也可自行下載Ceiba檔案讀取
vglm(impair ~ life + ses, data = Mental, family = cumulative(parallel = T)) %>% summary() #parallel=T,表示有符合Proportional odds##
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(parallel = T),
## data = Mental)
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logit(P[Y<=1]) -1.568 -0.7048 -0.2102 0.8070 2.713
## logit(P[Y<=2]) -2.328 -0.4666 0.2657 0.6904 1.615
## logit(P[Y<=3]) -3.688 0.1198 0.2039 0.4194 1.892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.2819 0.6231 -0.452 0.65096
## (Intercept):2 1.2128 0.6511 1.863 0.06251 .
## (Intercept):3 2.2094 0.7171 3.081 0.00206 **
## life -0.3189 0.1194 -2.670 0.00759 **
## ses 1.1112 0.6143 1.809 0.07045 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 3
##
## Names of linear predictors:
## logit(P[Y<=1]), logit(P[Y<=2]), logit(P[Y<=3])
##
## Residual deviance: 99.0979 on 115 degrees of freedom
##
## Log-likelihood: -49.5489 on 115 degrees of freedom
##
## Number of iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
## Exponentiated coefficients:
## life ses
## 0.7269742 3.0380707
Model可寫成\(logit[P(Y\leq{j})]=\alpha_j+\beta_1*life\,event+\beta_2*SES\)的型式
舉例來說在固定life event的情況下,社經地位高者不管在給定哪個level( j )下, 有精神障礙的風險是低社經地位的\(e^{1.1112}=3\)倍。
在先前課程中提到Cumulative Logits中,我們所關注的是累積機率,因此使用的Logit link為\(logit(P(Y\leq{j}))\),而今天所要介紹的則是只使用部分依變數的資訊,可以做兩個類別的比較,使用的方法為Adjacent-Categories Logit,而Logit Link為\(logit(P(Y=j+1)/P(Y=j))\),所使用的R套件為VGAM,以下整理出此套件中包含的Family function
同學可參照此圖依照使用得目的選擇適合的Family function,這邊將以課本中例子介紹
library(VGAM)
library(dplyr)
Political_Ideology <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Political_Ideology.csv", header = T, sep = ",")
Political_Ideology$Ideology <- as.factor(Political_Ideology$Ideology)
vglm(Ideology ~ Gender + Party, data = Political_Ideology, family = acat(reverse = F, parallel = F)) %>% summary()## Warning in eval(slot(family, "initialize")): response should be ordinal---
## see ordered()
##
## Call:
## vglm(formula = Ideology ~ Gender + Party, family = acat(reverse = F,
## parallel = F), data = Political_Ideology)
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## loge(P[Y=2]/P[Y=1]) -3.371 0.0712 0.1591 0.2380 1.358
## loge(P[Y=3]/P[Y=2]) -2.408 -0.6983 0.2491 0.8383 1.009
## loge(P[Y=4]/P[Y=3]) -1.172 -0.6687 -0.3114 0.6870 2.732
## loge(P[Y=5]/P[Y=4]) -1.785 -0.2853 -0.2242 -0.1143 2.451
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 0.06687 0.19024 0.352 0.725190
## (Intercept):2 0.83169 0.15668 5.308 1.11e-07 ***
## (Intercept):3 -1.63822 0.19734 -8.302 < 2e-16 ***
## (Intercept):4 0.33236 0.23302 1.426 0.153780
## GenderM:1 -0.13565 0.26457 -0.513 0.608152
## GenderM:2 -0.23610 0.21592 -1.093 0.274191
## GenderM:3 0.53441 0.21580 2.476 0.013273 *
## GenderM:4 -0.08632 0.24141 -0.358 0.720666
## PartyR:1 0.42419 0.28331 1.497 0.134329
## PartyR:2 0.43680 0.21674 2.015 0.043876 *
## PartyR:3 0.82594 0.22226 3.716 0.000202 ***
## PartyR:4 -0.12353 0.25468 -0.485 0.627644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 4
##
## Names of linear predictors:
## loge(P[Y=2]/P[Y=1]), loge(P[Y=3]/P[Y=2]), loge(P[Y=4]/P[Y=3]), loge(P[Y=5]/P[Y=4])
##
## Residual deviance: 2462.332 on 3328 degrees of freedom
##
## Log-likelihood: -1231.166 on 3328 degrees of freedom
##
## Number of iterations: 4
##
## No Hauck-Donner effect found in any of the estimates
以\(log(P[Y=2]/P[Y=1])\)舉例,給定相同性別下,較自由主義與非常自由主義的odds在共和主義的人中是民主主義的人的\(exp(0.42419)=1.528352\)倍。
以課本中Toxicity Study Data示範,簡單來說此研究想看懷孕老鼠在暴露於某一工業用溶劑,在不同劑量下(此處給不同劑量Score)所生下老鼠的狀態,分為三種(死亡、畸形、正常),並去觀察他們之間的關係,在這邊舉例,在給定所生下老鼠是活著的情況,我們想了解畸形與正常老鼠間的關係
library(VGAM)
library(dplyr)
Toxicity <- data.frame(Concentration = c(0, 62.5, 125, 250, 500),
Response_Dead = c(15, 17, 22, 38, 144),
Response_Mal = c(1, 0, 7, 59, 132),
Response_Nor = c(281, 225, 283, 202, 9))
vglm(cbind(Response_Mal, Response_Nor) ~ Concentration, data = Toxicity, family = sratio()) %>% summary()##
## Call:
## vglm(formula = cbind(Response_Mal, Response_Nor) ~ Concentration,
## family = sratio(), data = Toxicity)
##
##
## Pearson residuals:
## [,1]
## [1,] 0.06344
## [2,] -1.49190
## [3,] -0.44346
## [4,] 0.86216
## [5,] -0.87367
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.701902 0.332248 -17.16 <2e-16 ***
## Concentration 0.017375 0.001227 14.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 1
##
## Name of linear predictor: logit(P[Y=1|Y>=1])
##
## Residual deviance: 6.0609 on 3 degrees of freedom
##
## Log-likelihood: -10.7452 on 3 degrees of freedom
##
## Number of iterations: 4
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'Concentration'
結果可以看出,在給定老鼠皆是存活的狀態下,當母鼠每日暴露於此工業用溶劑的劑量每增加\(100mg/kg\),生下的老鼠會是畸形的可能性會增加\(exp(0.017375*100)=5.7\)倍。
在現實狀況中,我們可能會想去看試驗者在接受某種治療後,試驗前與試驗後的反應,而此種資料型式就是成對資料,在當資料為連續型時,可以使用Pair t test做檢定,而當資料為類別型時,也有其適合的方法,本章將會做介紹。
首先介紹當資料為類別型時,可以使用Mcnemar Test幫助我們做檢定,我們需把資料整理成2*2 Table,以下舉例:此筆資料主要是訪問800人對於某位候選人就任前與就任後的滿意度,70位在就任前後皆滿意,80位就任前不滿意就任後變為滿意,其餘依此類推。
Performance <- matrix(c(70, 80, 150, 500), nrow = 2,
dimnames = list("1st Survey" = c("Satidfied", "Dissatisfied"),
"2nd Survey" = c("Satidfied", "Dissatisfied")))
mcnemar.test(Performance)##
## McNemar's Chi-squared test with continuity correction
##
## data: Performance
## McNemar's chi-squared = 20.7, df = 1, p-value = 5.372e-06
虛無假設為
\(H_0:就任前後滿意與不滿意比例相同\)
\(H_1:就任前後滿意與不滿意比例不同\)
檢定結果:\(Reject\,H_0\)有足夠證據證明就任前後滿意與不滿意比例不同
在先前所提到的羅吉斯迴歸,資料間彼此互相獨立,因此可以使用,但當資料具有相依性,而我們仍然想要去量化解釋變數對依變數的影響,這時我們就可以使用Marginal logistic regression or Conditional logistic regression幫助我們。以講義中CH8 p.4 table舉例。
Data <- data.frame(CutLS_YES = c(227, 107), CutLS_NO = c(132, 678))
row.names(Data) <- c("HigherTaxes_Yes", "HigherTaxes_No")
Data## CutLS_YES CutLS_NO
## HigherTaxes_Yes 227 132
## HigherTaxes_No 107 678
由於我們要建立的是Marginal Logistic Regression因此要先建立Marginal Data
Data.marginal <- data.frame(Q = c("HigherTaxes", "CutLS"),
yes = c(359, 334),
no = c(785, 810))
Data.marginal## Q yes no
## 1 HigherTaxes 359 785
## 2 CutLS 334 810
接著建立Model
library(dplyr)
glm(cbind(Data.marginal$yes, Data.marginal$no) ~ Q, data = Data.marginal,
family = binomial(link = "logit")) %>% summary()##
## Call:
## glm(formula = cbind(Data.marginal$yes, Data.marginal$no) ~ Q,
## family = binomial(link = "logit"), data = Data.marginal)
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.88589 0.06503 -13.623 <2e-16 ***
## QHigherTaxes 0.10353 0.09104 1.137 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.2939e+00 on 1 degrees of freedom
## Residual deviance: 2.7489e-13 on 0 degrees of freedom
## AIC: 18.649
##
## Number of Fisher Scoring iterations: 2
當我們想了解的是個人對於這兩個問題看法的差異,這時就可以使用Conditional Logistic Regresion,在建立條件羅吉斯回歸時,需要先建構出Subject-Specific table,也就是我們所說的Conditional Model,之所以叫Conditional原因為我們針對每個人都建立table,概念其實很像針對個人去做分層後下去建立模型,也就是說我們是“條件於每個人之下”,去觀察解釋變數對依變數的影響。
首先先將資料轉為Individual Data,當中queestion=1代表是否願意付較高稅收,question=0代表降低生活品質
Data.raw=data.frame(ID = rep(1:1144, each=2),
question = rep(c(1, 0), 1144),
answer = c(rep(c(1, 1), 227), rep(c(1, 0), 132),
rep(c(0, 1), 107), rep(c(0, 0), 678)))接著建立Model
library(dplyr)
library(survival)
clogit(answer ~ question + strata(ID), data = Data.raw) %>% summary()## Call:
## coxph(formula = Surv(rep(1, 2288L), answer) ~ question + strata(ID),
## data = Data.raw, method = "exact")
##
## n= 2288, number of events= 693
##
## coef exp(coef) se(coef) z Pr(>|z|)
## question 0.2100 1.2336 0.1301 1.614 0.106
##
## exp(coef) exp(-coef) lower .95 upper .95
## question 1.234 0.8106 0.956 1.592
##
## Rsquare= 0.001 (max possible= 0.135 )
## Likelihood ratio test= 2.62 on 1 df, p=0.1055
## Wald test = 2.61 on 1 df, p=0.1065
## Score (logrank) test = 2.62 on 1 df, p=0.1059
根據以上結果可以看出,針對個人而言回答願意付較高稅的機會是願意降低生活品質的1.234倍。也就是說個人而言大家還是願意付較高的稅而較不願意降低生活品質。
除了使用Conditional Logistic Regression做配適,我們也可以藉由存活分析中常用到著Cox Proportional Hazard Model達到相同目標,以下以講義CH8 p.18的table做示範。
Individual <- data.frame(Matched = rep(1:144, each = 2), T = rep(c(1:2), 144),
Y = rep(c(1, 0), 144),
X = c(rep(1, 18), rep(c(1, 0), 37), rep(c(0, 1), 16), rep(0, 164)))
head(Individual)## Matched T Y X
## 1 1 1 1 1
## 2 1 2 0 1
## 3 2 1 1 1
## 4 2 2 0 1
## 5 3 1 1 1
## 6 3 2 0 1
library(survival)
my.surv <- Surv(Individual$T, Individual$Y == 1)
Model <- coxph(my.surv ~ X + strata(Matched), data = Individual)
summary(Model)## Call:
## coxph(formula = my.surv ~ X + strata(Matched), data = Individual)
##
## n= 288, number of events= 144
##
## coef exp(coef) se(coef) z Pr(>|z|)
## X 0.8383 2.3125 0.2992 2.802 0.00508 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## X 2.312 0.4324 1.286 4.157
##
## Concordance= 0.573 (se = 0.184 )
## Rsquare= 0.029 (max possible= 0.5 )
## Likelihood ratio test= 8.55 on 1 df, p=0.003449
## Wald test = 7.85 on 1 df, p=0.005082
## Score (logrank) test = 8.32 on 1 df, p=0.003919
\(h(t|Diabetes)=h_0(t)+exp(0.8383*Diabetes)\)
Time=T
STATUS=Y
Matched=Stratified Variable
使用 Cox Proportional Hazard Model 做出來結果與使用 Conditional Logistic Regression 相同,表示有 糖尿病的患者罹患心肌梗塞的風險為沒有糖尿病者的 2.312 倍。
使用Depression Study Data用R code示範,當Working Correlation Matrix為Independent時,比較使用GLM與GEE估計出係數的差別
library(dplyr)
library(gee)
Depression <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Depression.txt",
sep = "", header = T)
glm(outcome ~ severity + drug + time + time*drug, data = Depression, family = binomial(link = "logit")) %>%
summary() #GLM##
## Call:
## glm(formula = outcome ~ severity + drug + time + time * drug,
## family = binomial(link = "logit"), data = Depression)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4352 -1.0220 0.3254 0.9915 1.8009
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.02799 0.16391 -0.171 0.864
## severity -1.31391 0.14641 -8.974 < 2e-16 ***
## drug -0.05960 0.22221 -0.268 0.789
## time 0.48241 0.11476 4.204 2.63e-05 ***
## drug:time 1.01744 0.18879 5.389 7.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1411.9 on 1019 degrees of freedom
## Residual deviance: 1161.9 on 1015 degrees of freedom
## AIC: 1171.9
##
## Number of Fisher Scoring iterations: 4
gee(outcome ~ severity + drug + time + time*drug, id = case, data = Depression, corstr = "independence",
family = binomial(link = "logit")) %>% summary()## (Intercept) severity drug time drug:time
## -0.02798843 -1.31391092 -0.05960381 0.48241209 1.01744498
##
## GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
## gee S-function, version 4.13 modified 98/01/27 (1998)
##
## Model:
## Link: Logit
## Variance to Mean Relation: Binomial
## Correlation Structure: Independent
##
## Call:
## gee(formula = outcome ~ severity + drug + time + time * drug,
## id = case, data = Depression, family = binomial(link = "logit"),
## corstr = "independence")
##
## Summary of Residuals:
## Min 1Q Median 3Q Max
## -0.94844242 -0.40683252 0.05155758 0.38830952 0.80242231
##
##
## Coefficients:
## Estimate Naive S.E. Naive z Robust S.E. Robust z
## (Intercept) -0.02798843 0.1627083 -0.1720160 0.1741865 -0.1606808
## severity -1.31391092 0.1453432 -9.0400569 0.1459845 -9.0003423
## drug -0.05960381 0.2205812 -0.2702126 0.2285385 -0.2608042
## time 0.48241209 0.1139224 4.2345663 0.1199350 4.0222784
## drug:time 1.01744498 0.1874132 5.4288855 0.1876938 5.4207709
##
## Estimated Scale Parameter: 0.9854113
## Number of Iterations: 1
##
## Working Correlation
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
由於Outcome為Binary因此在GEE中也要記得設定family,否則default設定為Identity,結果會有誤差,而GEE在解釋上也與GLM相同
請同學自行下載套件catdata,此套件就有提供Teratology的Data, 有Grouped Data(teratology)及Individual Data(teratology2), 這邊使用Individual Data,分別使用GLM & GEE(Exchangeable Working Correlation Matrix)並與CH9講義P.25對照
library(catdata)
library(dplyr)
library(gee)
data("teratology2")
glm(y ~ Grp, data = teratology2, family = binomial(link = "logit")) %>% summary()##
## Call:
## glm(formula = y ~ Grp, family = binomial(link = "logit"), data = teratology2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6855 -0.4631 -0.2649 0.7437 2.5951
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1440 0.1292 8.855 <2e-16 ***
## GrpG2 -3.3225 0.3308 -10.043 <2e-16 ***
## GrpG3 -4.4762 0.7307 -6.126 9e-10 ***
## GrpG4 -4.1297 0.4762 -8.672 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.68 on 606 degrees of freedom
## Residual deviance: 496.70 on 603 degrees of freedom
## AIC: 504.7
##
## Number of Fisher Scoring iterations: 5
## (Intercept) GrpG2 GrpG3 GrpG4
## 1.143981 -3.322513 -4.476184 -4.129663
##
## GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
## gee S-function, version 4.13 modified 98/01/27 (1998)
##
## Model:
## Link: Logit
## Variance to Mean Relation: Binomial
## Correlation Structure: Exchangeable
##
## Call:
## gee(formula = y ~ Grp, id = Rat, data = teratology2, family = binomial(link = "logit"),
## corstr = "exchangeable")
##
## Summary of Residuals:
## Min 1Q Median 3Q Max
## -0.7705629 -0.1036164 -0.0331754 0.2294371 0.9668246
##
##
## Coefficients:
## Estimate Naive S.E. Naive z Robust S.E. Robust z
## (Intercept) 1.211492 0.2245137 5.396073 0.2695606 4.494323
## GrpG2 -3.369165 0.5660161 -5.952419 0.4304211 -7.827600
## GrpG3 -4.583701 1.3091173 -3.501368 0.6235402 -7.351091
## GrpG4 -4.247410 0.8527494 -4.980842 0.6047894 -7.022957
##
## Estimated Scale Parameter: 1.035308
## Number of Iterations: 3
##
## Working Correlation
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [2,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [3,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [4,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [5,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
## [7,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [8,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [9,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [13,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [2,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [3,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [4,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [5,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [7,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [8,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [9,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
## [13,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [,13] [,14] [,15] [,16] [,17]
## [1,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [2,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [3,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [4,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [5,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [7,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [8,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [9,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [13,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
1: R語言資料分析 從機器學習、資料探勘、文字探勘到巨量資料分析-李仁鐘、李秋緣
2: R語言資料分析活用範例詳解-方匡南、朱建平、姜葉飛 著,H&C 譯
3: An Introduction to Categorical Data Analysis, Second Edition-Alan Agresti