2/19
Site:https://www.r-project.org/
Step1:使用以上連結進入R下載連結網頁,並點擊“download R”
Step2:選擇國家“Taiwan”,三個連結都可選擇
Step3:選擇載點Mac/Windows/Linux
Step4:選擇“install R for the first time”
Step5:選擇“Download R 3.5.2 for Windows”後執行
Site:https://www.rstudio.com/
Step1:使用以上連結進入RStudio網頁,並點擊“Download”
Step2:選擇RStudio Desktop並點擊“DOWNLOAD”
Step3:選擇電腦版本,下載後安裝執行
Step4:RStudio基本介面介紹
Step4-1:套件安裝可使用Install.packages此命令函式或在右下角Packages窗格按下“install”後輸入套件名稱安裝套件
在讀取資料前可先用setwd()以及getwd()來改變和檢視工作目錄
## [1] "D:/"
我們在政府資料開放平臺上找到一筆資料,主要是關於腸病毒健保門診以及住院就診人數統計
資料介紹&下載網址:https://data.gov.tw/dataset/14590
我們使用read.table()來讀取資料,當中sep表示資料的分隔形式,常見的csv檔分隔形式為逗號分隔形式,header表示資料是否有變數名稱,encoding表示資料檔案格式,Windows作業系統為“Big5”,Mac則是“UTF-8”
Enterovirus <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Enterovirus.csv", sep = ",", header = T, encoding = "Big5")
head(Enterovirus) #使用head()觀察部分資料## 年 週 就診類別 年齡別 縣市 腸病毒健保就診人次 健保就診總人次
## 1 2008 14 住院 0-2 台中市 0 105
## 2 2008 14 住院 0-2 台北市 2 151
## 3 2008 14 住院 0-2 台東縣 0 14
## 4 2008 14 住院 0-2 台南市 0 20
## 5 2008 14 住院 0-2 宜蘭縣 0 44
## 6 2008 14 住院 0-2 花蓮縣 0 17
我們也能直接在Rstudio上直接檢視完整資料,在Global Environment上點選該筆資料就會出現,可參考下圖
若需儲存資料或分析結果至外部檔案,可使用write.table()函數
2/26
## [1] 1 2 3 4 5
## [1] 75 60 85 80 70
## [1] "男" "女" "男" "男" "女"
「#」符號後面表示是注釋,寫程式時清楚表明注釋能提升可讀性
「<-」是指定值符號,意思是把後面的內容值指定給前面的內容,除了用「<-」也可使用「=」
## [1] 75 120 255 320 350
## [1] 0.01333333 0.03333333 0.03529412 0.05000000 0.07142857
## [1] 1 4 9 16 25
## [1] 76 62 88 84 75
## [1] -74 -58 -82 -76 -65
## [1] 75 30 28 20 14
## [1] 0 0 1 0 0
## [1] 15
## [1] 5
## [1] 1
## [1] 3
## [1] 1 5
## [1] 3
## [1] 2.5
## [1] 1 2 3 4 5
## [1] 3
## [1] 1 2 3
## [1] 1 5 3 4 5
## [1] 5 3
類別型資料經常要把資料轉成不同的水準/因子(factor),如性別、職業、是否已婚……等。當判斷此資料中變數為類別型時,建議再次檢查變數在資料中是否是以因子方式存在,若不是則須經過轉換,再繼續分析。
## [1] "character"
## [1] "factor"
## [1] "女" "男"
## z
## 女 男
## 2 3
矩陣是將資料用行和列排列的長方形表格,矩陣內元素必須是相同資料類型
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## [,1] [,2] [,3] [,4]
## [1,] 49 9 82 62
## [2,] 5 11 6 98
## [3,] 31 55 13 45
## [4,] 76 23 79 39
## [,1] [,2] [,3] [,4]
## [1,] 1341 858 1256 1464
## [2,] 1502 956 1436 1708
## [3,] 1663 1054 1616 1952
## [4,] 1824 1152 1796 2196
## [1] 15
## [1] 4 4
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [1] 0
## [1] 4
## [1] 28 32 36 40
## [1] 10 26 42 58
Data Frame是一種矩陣形式的資料,但資料框中各變數可以以不同類型的方式存在,每一行表示一個變數,每一列可表示一筆資料
## class score sex
## 1 1 75 男
## 2 5 60 女
## 3 3 85 男
## 4 4 80 男
## 5 5 70 女
接著我們想對資料框的每一列命名
## class score sex
## 陳xx 1 75 男
## 王xx 5 60 女
## 林xx 3 85 男
## 張xx 4 80 男
## 黃xx 5 70 女
## [1] 75 60 85 80 70
## class score sex
## 林xx 3 85 男
還可以篩選出符合我們的條件,例如:篩選出性別為男性且成績大於等於80分
## class score sex
## 林xx 3 85 男
## 張xx 4 80 男
接著我們想了解一下這筆資料的基本敘述統計
##
## 女 男
## 2 3
##
## 1 3 4 5
## 女 0 0 0 2
## 男 1 1 1 0
在n次獨立的試驗中,每次成功機率接固定為p,X代表成功的次數,以下介紹在R中計算binomial distribution
## [1] 1.881837e-09
## [1] 2.104926e-09
## [1] 24 20 26 23 23
## [1] 23
## [1] 1.269823e-08
## [1] 7.708629e-09
## [1] 27.56153 23.39406 25.88309 25.28619 24.26078
## [1] 31.92952
## [1] 0.08422434
## [1] 0.124652
## [1] 2
*補充:Normal & Poisson approximation to binomial distribution plot
若想要了解函數內參數設定可使用“?plot”方式取得資訊
由於Binomial以及Poisson分配為間斷型分布,故在繪製圖形時使用長條圖表示,type設定為“h”,lwd為線的寬度
而Normal為連續型分布故使用曲線表示,參數部分同學可自行調整觀察
plot(0:30, dbinom(0:30, 200, 0.1), col = "red", ylim = c(0, 0.13), type = "h", lwd = 10,
xlab = "X", ylab = "Probability", main = "Binomial(200, 0.1), Poisson(20) & Normal Plot")
lines(0:30, dpois(0:30, 20), col = "green", lwd = 3, type = "h")
normden <- function(x){dnorm(x, mean = 20, sd = sqrt(200 * 0.1 * 0.9))}
curve(normden, from=0, to=35, add=TRUE, col="blue", lwd = 5)
legend("topright", col = c("red", "green", "blue"), lwd = 3, bty = "o",
legend = c("Binomial", "Poisson", "Normal")) 3/5
本次課程將以實際資料計算最大概似估計值,課程中已介紹過理論的概念,在這邊將以程式示範
在R當中其實已經存在許多可以直接使用的資料,本次使用資料為空氣品質資料,讀入方法如下
在此筆資料中依變數(y)為臭氧濃度(Ozone),自變數(x)為太陽輻射(Solar.R)、平均風速(Wind)、最高溫度(Temp)、月份(Month)、日(Day)
我們想探討最高溫度對於臭氧濃度的影響程度,以簡單線性迴歸進行模型配適,結果如下
##
## Call:
## lm(formula = Ozone ~ Temp, data = Air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.729 -17.409 -0.587 11.306 118.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -146.9955 18.2872 -8.038 9.37e-13 ***
## Temp 2.4287 0.2331 10.418 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.71 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.4877, Adjusted R-squared: 0.4832
## F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16
由結果可觀察到最高溫度每增加一度,臭氧濃度會增加2.4287ppm,接著我們自行實際計算MLE比對結果
首先先估計出迴歸模型中參數估計值,若對於解根方式不熟悉者,可參考以下公式
\[y = b_{0}+b_{1}x \]
如果有多個共變數也是一樣,只是用矩陣表達比較方便,以hers資料為例,我們將L3第52頁的模式以矩陣表達 \[ \left( \begin{matrix} E[Ozone_{1}|X_{1}]\\ E[Ozone_{2}|X_{2}] \\ ...\\ E[Ozone_{n}|X_{n}]\\ \end{matrix} \right )= \left[ \begin{matrix} 1\quad Temp_{1} \\ 1\quad Temp_{2} \\ ...\\ 1\quad Temp_{n} \\ \end{matrix} \right ] \left[ \begin{matrix} \beta_{0}\\ \beta_{1} \\ \end{matrix} \right ] \]
要求迴歸係數也一樣用到最小平方法概念,只是我們需要用到反矩陣 \[ b_{2\times1}= \left( \begin{matrix} b_{0}\\ b_{1} \\ \end{matrix} \right )= \mathbf{(X^{'}X)^{-1}X^{'}Y} \]
了解概念後,我們從實際資料著手
Air <- Air[!is.na(Air$Ozone) == T, ] #移除依變數(y)為遺失值的資料
n <- length(Air$Temp)
X <- matrix(c(rep(1, n), Air$Temp), nrow = n) #建立解根需要的X矩陣
Y <- matrix(Air$Ozone, nrow = n) #建立解根需要的Y矩陣
Beta <- solve(t(X) %*% X) %*% t(X) %*% Y #解根
Beta## [,1]
## [1,] -146.995491
## [2,] 2.428703
解出Beta後,以繪製Likelihood Function來檢查是否解出來的是MLE
Beta0 <- Beta[1, 1]
Beta1 <- seq(0, 5, 0.1)
Loglik <- NULL
for(i in 1:length(Beta1)) {
Loglik[i] <- sum(dnorm(Y, mean = Beta0 + Beta1[i] * Air$Temp, sd = sd(Y), log = T))
}
{plot(Beta1, Loglik, xlab = "Beta1", ylab = "Log Likelihood", type = "l", lwd = 3, col = "blue")
points(x = Beta[2, 1], max(Loglik), col = "red", pch = 19, type = "o", cex = 1.2)}## [1] -541.8762
圖中紅色點可看出,當Beta為我們一開始解根得到的答案時(即Beta = 2.43),概似函數為最大
03/12
使用Enterovirus的資料,預檢定住院的人中10-14歲比例是否為0.5
##
## 0-2 10-14 15+ 3-4 5-9
## 住院 10278 10090 11247 10112 10269
## 門診 11263 11263 11263 11263 11263
prop.test(x = table(Enterovirus[, 3:4])["住院", "10-14"], p = 0.5,
n = sum(table(Enterovirus[, 3:4])[1,]), alternative = "two.sided")##
## 1-sample proportions test with continuity correction
##
## data: table(Enterovirus[, 3:4])["住院", "10-14"] out of sum(table(Enterovirus[, 3:4])[1, ]), null probability 0.5
## X-squared = 19467, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.1906673 0.1974848
## sample estimates:
## p
## 0.1940534
結論:reject Ho,我們推論住院的人中10到14歲所占的比例不為0.5
檢定10-14歲的人中,住院與門診的比例是否相同
prop.test(x = c(table(Enterovirus[, 3:4])["住院", "10-14"],
table(Enterovirus[, 3:4])["門診", "10-14"]),
n = c(sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 2]),
sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 2])))##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(table(Enterovirus[, 3:4])["住院", "10-14"], table(Enterovirus[, out of c(sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, 3:4])[2, 3:4])["門診", "10-14"]) out of 2]), sum(table(Enterovirus[, 3:4])[1, 2], table(Enterovirus[, c(table(Enterovirus[, 3:4])["住院", "10-14"], table(Enterovirus[, out of 3:4])[2, 2]))
## X-squared = 128.65, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.06445051 -0.04541696
## sample estimates:
## prop 1 prop 2
## 0.4725331 0.5274669
結論:reject Ho,我們推論在10-14歲的族群中,住院與門診的比例不相近
我們使用一筆新的資料,內容為戴安全帽與否與是否發生事故,透過檢定我們想了解是否戴安全帽與否發生事故有關聯
Helmet_Injury <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Injury.csv", header = T, sep = ",")
Helmet_Injury$Helmet <- as.factor(Helmet_Injury$Helmet)
Helmet_Injury$Injury <- as.factor(Helmet_Injury$Injury)
xtabs(~ Helmet + Injury, data = Helmet_Injury)## Injury
## Helmet 0 1
## 0 5 11
## 1 31 3
##
## Fisher's Exact Test for Count Data
##
## data: xtabs(~Helmet_Injury$Helmet + Helmet_Injury$Injury, data = Helmet_Injury)
## p-value = 2.898e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.006295768 0.261238397
## sample estimates:
## odds ratio
## 0.04846721
結論:reject Ho,我們推論是否戴安全帽與是否發生事故有關聯
Likelihood ratio test概念為藉由計算最大概似函數以及實際觀察到資料的概似函數間之比值進行檢定,若以模式的概念來看即藉由觀察最大的模式(Full Model)以及我們所建構的模式間概似函數的比值來檢定模式配適的適合程度。
以課本中例子示範,在10次伯努力試驗中得到9次成功1次失敗的結果,欲檢定
\[H_0{:成功機率=0.5}\] \[H_1{:成功機率不等於0.5}\]
L0 <- dbinom(9, 10, 0.5)
L1 <- dbinom(9, 10, 0.9)
-2 * log(L0 / L1) # Likelihood-ratio test statistics## [1] 7.361284
若以模式來看,以上週空氣品質的資料舉例
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
Air <- airquality
Air <- na.omit(Air)
Lm_Full <- lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = Air)
Lm <- lm(Ozone ~ Temp, data = Air)
lrtest(Lm_Full, Lm)## Likelihood ratio test
##
## Model 1: Ozone ~ Solar.R + Wind + Temp + Month + Day
## Model 2: Ozone ~ Temp
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -491.61
## 2 3 -508.89 -4 34.556 5.729e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[H_0{:兩個模式沒有差異}\] \[H_1{:兩模式有差異}\]
結論:拒絕虛無假設,我們可以推論兩模式間有差異,即較小的模式對於資料的解釋程度仍然無法取代大模式
在空氣汙染指標的標準中,當臭氧濃度低於60ppm,我們則認為空氣品質是好的,反之則認為會對人體有影響,因此在此處我們先將臭氧濃度轉換為類別變數,若高於60ppm則視為1,反之為0。而風力部分也將之做分類,風速0-7.9mph定義為較小風勢(紀錄為0),而8-20.7定義為較大風勢(紀錄為1),欲檢定風勢大小是否與臭氧濃度有關係,我們可以先藉由列聯表觀察分布狀況
Air$Ozone_Cat <- ifelse(Air$Ozone > 60, 1, 0)
Air$Wind_Cat <- ifelse(Air$Wind < 8, 0, 1)
table(Air$Wind_Cat, Air$Ozone_Cat)##
## 0 1
## 0 13 20
## 1 69 9
其實藉由列聯表已經可以觀察到,風勢越大臭氧濃度越小,風勢越小臭氧濃度則越大,但我們仍藉由檢定觀察,定義虛無假設
\[H_0{:風勢大小與臭氧濃度無關}\] \[H_1{:風勢大小與臭氧濃度有關}\]
Air$Ozone_Cat <- as.factor(Air$Ozone_Cat)
Air$Wind_Cat <- as.factor(Air$Wind_Cat) #這兩個變數為類別型,建議都自行轉換較不容易出錯
chisq.test(xtabs(~ Air$Ozone_Cat + Air$Wind_Cat), correct = F)##
## Pearson's Chi-squared test
##
## data: xtabs(~Air$Ozone_Cat + Air$Wind_Cat)
## X-squared = 28.927, df = 1, p-value = 7.514e-08
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~Air$Ozone_Cat + Air$Wind_Cat)
## X-squared = 26.441, df = 1, p-value = 2.717e-07
結論:reject H0,我們推論風勢大小與臭氧濃度高低是有關係的
結果中多顯示了一行warning,原因就是在計算列連表中各格的期望次數時有小於等於5的值出現,因此用卡方檢定是可能會出現誤差
以下再用另一個例子,是否戴安全帽與是否發生車禍受傷有無關係,可由下圖列聯表觀察兩者間統計數字,再由右邊式子自行計算後觀察與程式執行結果是否相似
*補充:Odds ratio在R當中計算
平常在計算Odds ratio時,需要先建立出列聯表再做計算,但以程式操作時函數內可放table或是直接放入變數名稱做計算,根據空氣品質資料舉例
請先下載套件“questionr”並讀入
library(questionr)
Air$Ozone_Cat <- as.factor(Air$Ozone_Cat)
Air$Wind_Cat <- as.factor(Air$Wind_Cat) #這兩個變數為類別型,建議都自行轉換較不容易出錯
odds.ratio(Air$Ozone_Cat, Air$Wind_Cat)## OR 2.5 % 97.5 % p
## Fisher's test 0.087472 0.027957 0.2509 2.703e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## OR 2.5 % 97.5 % p
## Fisher's test 0.087472 0.027957 0.2509 2.703e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
根據呈現的結果,將此筆資料例子以之前課堂上fisher.test的函數實際操作一次,觀察看看結果是如何
03/19
Mantel-Haenszel Test是一種分析分層列連表的檢定方法,此方法優點是可以在調控分層變數後,檢定兩個類別變數間的關聯性。
資料包含三個變數(Sex、Place、Accident)
Data <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Example_Data.csv", sep = ",", header = T)
head(Data)## Sex Place Accident
## 1 M Z 0
## 2 M Y 0
## 3 F Y 0
## 4 F Z 0
## 5 F X 0
## 6 M X 0
檢定性別與是否發生車禍有無關聯-使用之前教過的Fisher exact test
##
## Fisher's Exact Test for Count Data
##
## data: table(Data$Sex, Data$Accident)
## p-value = 0.04846
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9925868 2.0770631
## sample estimates:
## odds ratio
## 1.434571
檢定結果拒絕Ho,推論性別與是否發生車禍有關聯。
但是我們想知道此差異是否真的是因為性別所影響,或是有第三因子所造成的,因此我們加入這些人居住地區做考量,使用的方法就可用Mantel-Haenszel Test
##
## Mantel-Haenszel chi-squared test with continuity correction
##
## data: table(Data$Sex, Data$Accident, Data$Place)
## Mantel-Haenszel X-squared = 3.7504, df = 1, p-value = 0.0528
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.012565 2.058560
## sample estimates:
## common odds ratio
## 1.443754
p-value為0.0528,結果為不顯著,在合併表格中,雖然性別與是否發生車禍看似相關,但是如果我們調控了居住地區時,各區內性別與是否發生車禍並不相關,發現此結果我們可能會推斷是因為地區與車禍是否發生有關而不是性別。
在線性迴歸中,我們對於依變數的假設為需要服從常態分配,但在現實狀況中不見得如此,這時候我們就可以藉著使用廣義線性模型來幫我們解決這個問題,廣義線性模型允許我們依變數擁有不同的分配,而藉著不同的連結函數(link function)我們就能將Systemetic component與Random component做連結,常用的link function有logit link、log link。
在本週課程會先介紹當依變數為Categorical data & Count Data時可使用的方法,並當遇到overdispersion問題時的處理,使用資料為Injury的資料檔。
Data <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Injury.csv", sep = ",", header = T)
set.seed(45)
Data$Injury_count <- sample(0:15, 50, replace = T) #增加新變數為觀察時間內發生事件(受傷)次數
model.logit <- glm(Injury ~ Helmet, data = Data, family = binomial("logit"))
model.probit <- glm(Injury ~ Helmet, data = Data, family = binomial("probit"))
summary(model.logit)##
## Call:
## glm(formula = Injury ~ Helmet, family = binomial("logit"), data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5252 -0.4298 -0.4298 0.8657 2.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7885 0.5394 1.462 0.143785
## Helmet -3.1238 0.8102 -3.855 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59.295 on 49 degrees of freedom
## Residual deviance: 40.168 on 48 degrees of freedom
## AIC: 44.168
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = Injury ~ Helmet, family = binomial("probit"), data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5252 -0.4298 -0.4298 0.8657 2.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4888 0.3273 1.493 0.135
## Helmet -1.8405 0.4467 -4.120 3.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59.295 on 49 degrees of freedom
## Residual deviance: 40.168 on 48 degrees of freedom
## AIC: 44.168
##
## Number of Fisher Scoring iterations: 4
當使用link function為logit時,有繫安全帶受傷的機會是沒繫安全帶的exp(-3.1238) = 0.044倍
而使用link function為probit時,有繫安全帶受傷的機會是沒繫安全帶的exp(-1.8405) = 0.159倍
3/25
當我們的依變數為Count Data時,所使用的link function為log link
##
## Call:
## glm(formula = Injury_count ~ Helmet, family = poisson("log"),
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4895 -1.1420 -0.0358 1.1009 2.4290
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.11021 0.08704 24.245 < 2e-16 ***
## Helmet -0.30385 0.11139 -2.728 0.00637 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 128.26 on 49 degrees of freedom
## Residual deviance: 121.01 on 48 degrees of freedom
## AIC: 300.6
##
## Number of Fisher Scoring iterations: 5
當所使用分配為poisson時,我們期望依變數的期望值與變異數差異不要過大,因此檢查一下是否出現overdispersion的問題
## [1] 6.78
## [1] 16.54245
發現出現平均數小於變異數的情況,使用negative binomial做配適或許更為保險
##
## Call:
## glm.nb(formula = Injury_count ~ Helmet, data = Data, init.theta = 4.799298545,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80405 -0.73926 -0.02383 0.70338 1.49171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1102 0.1435 14.70 <2e-16 ***
## Helmet -0.3039 0.1776 -1.71 0.0872 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.7993) family taken to be 1)
##
## Null deviance: 59.351 on 49 degrees of freedom
## Residual deviance: 56.400 on 48 degrees of freedom
## AIC: 279.87
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.80
## Std. Err.: 1.75
##
## 2 x log-likelihood: -273.869
在這邊現暫時忽略顯著性問題,把重點放在結果解釋上,結論告訴我們有戴安全帽的人比沒戴安全帽的受傷的次數平均而言少了exp(-0.3039) = 0.738次。
本次內容將先補充上次上課介紹的如何依變數為binary時,在family = binomial狀況下,下identity link,以及使用quasipoisson觀察overdispersion狀況。使用資料為上次作業使用的crabs data
Crabs <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/crab.csv", header = F, sep = ",")
colnames(Crabs) <- c("c", "s", "w", "wt", "sa")
Crabs$Y <- ifelse(Crabs$sa >= 1, 1, 0)
model.1 <- glm(Y ~ wt, data = Crabs, family = binomial("identity"), start = c(0.0001, 0.0001))
summary(model.1) #start point通常為一個很小的數值,若設太大可能會無法收斂##
## Call:
## glm(formula = Y ~ wt, family = binomial("identity"), data = Crabs,
## start = c(1e-04, 1e-04))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4466 -1.0445 0.9738 1.1634 1.4591
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.08675 0.06879 1.261 0.207
## wt 0.17563 0.01323 13.275 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 217.19 on 171 degrees of freedom
## AIC: 221.19
##
## Number of Fisher Scoring iterations: 25
##
## Call:
## glm(formula = sa ~ wt, family = quasipoisson, data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9306 -1.9981 -0.5627 0.9299 4.9992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4282 0.3167 -1.352 0.178
## wt 0.5892 0.1151 5.121 8.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.133896)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 560.84 on 171 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
本週課程內容要介紹交互作用以及模型診斷,在配適模型時我們可能會懷疑所有變項中某兩項多項間可能會互相影響,因此在配適模型時可以加入交互作用項做考量。
Case:我們懷疑螃蟹寬度與重量間可能有交互作用存在,因此藉由檢定來觀察他們之間是否真的有交互作用存在。
##
## Call:
## glm(formula = Y ~ wt + w + w * wt, family = binomial(), data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2385 -1.0298 0.4853 0.9486 1.5160
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.7002 12.2558 0.139 0.890
## wt -4.2441 5.5113 -0.770 0.441
## w -0.1134 0.4826 -0.235 0.814
## wt:w 0.1912 0.2065 0.926 0.355
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 191.98 on 169 degrees of freedom
## AIC: 199.98
##
## Number of Fisher Scoring iterations: 5
在本例中,可以由配適結果看出,當同時在模型裡面放入寬度與重量以及兩者交互作用項,發現每個變項都未顯著,但是在處理交互作用項問題時可以藉由這些方法來觀察。
接著介紹模型診斷,介紹的方法有LR Test、ROC、AIC以及Hosmer-Lemeshow goodness of fit test
model.all <- glm(Y ~ c + s + w + wt, data = Crabs, family = binomial(link = "logit"))
lrtest(model.1, model.all) #也可直接放lrtest(model.1)## Likelihood ratio test
##
## Model 1: Y ~ wt
## Model 2: Y ~ c + s + w + wt
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -108.595
## 2 5 -93.329 3 30.532 1.066e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(Ho\):model.1 is satisfied with this data
\(H1\):model.all is satisfied with this data
結論:reject \(Ho\),我們有足夠證據推論model.all配適較佳
接下來試著繪製ROC Curve,在繪製ROC Curve前,我們先把資料切成訓練資料集(Training Data)以及測試資料集(Testing Data),訓練資料集目的是用於配適Logistic Regression而測試資料集目的則是用於驗證模型準確度
intrain <- createDataPartition(y = Crabs$Y, p = 0.7, list = F) #將資料切成30%(Testing) & 70%(Training)
Training <- Crabs[intrain, ]
Testing <- Crabs[-intrain, ]
model.roc <- glm(Y ~ c + s + w + wt, data = Training, family = binomial(link = "logit"))
result <- predict(model.roc, newdata = Testing, type = "response")
pred <- prediction(result, Testing$Y)
per <- performance(pred, measure = "tpr", x.measure = "fpr")
AUC <- performance(pred, "auc")
plot(per, col = rainbow(7), main = "ROC Curve") +
abline(0, 1) +
text(0.5, 0.5, paste0("AUC=", as.character(round(AUC@y.values[[1]], 3))))## integer(0)
cost.perf <- performance(pred, "cost")
pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])] #Find Cutpoint## 35
## 0.6554858
藉由ROC Curve發現AUC=0.783,模型預測表現還不錯
AIC的部分直接在R中使用AIC的function即可,EX:AIC(model.1),以下使用apply家族中常用的函數sapply,藉由sapply我們可以將程式更加簡化
## [1] 221.1909 199.9806 196.6589
結果觀察到當我們把所有變數都加入模型時AIC會最小。
最後一部分介紹另一個模型檢定的方式Hosmer-Lemeshow goodness of fit test,用於檢定模型所產生的期望值與實際資料觀察值間的配適程度,此檢定用於評估模型預測表現是否符合模型本身的預期(如果符合預期 p-value>0.1)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Crabs$Y, model.1$fitted.values
## X-squared = 24, df = 8, p-value = 0.002292
04/16
在先前課程中提到,當依變數為二元時,可使用Logistic regression做配適,但是在實際情況中往往不是像我們想的這麼簡單,因此就會需要用到這禮拜所介紹的Multicategory Logit Model,而在這種Model中,我們又可以根據依變數不同的類型(名目or順序尺度)而有不同的配適方式,以下將作介紹。
使用的資料為課本上Table6.1的資料,包含三個變數為(id,length,primary.food),其中primary food type分為三組分別是Fish(1),Invertebrate(2),Other(3),以Other為baseline category進行模型配適
library(VGAM)
library(dplyr)
Alligator_1 <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Alligator-1.txt", header = T, sep = " ")
vglm(primary.food ~ length, data = Alligator_1, reflevel = "3",
family = multinomial) %>% summary()##
## Call:
## vglm(formula = primary.food ~ length, family = multinomial, data = Alligator_1,
## reflevel = "3")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,3]) -2.330 -0.5075 0.5538 0.6836 1.452
## log(mu[,2]/mu[,3]) -2.687 -0.4821 -0.1653 0.7093 3.439
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 1.6177 1.3073 1.237 0.21591
## (Intercept):2 5.6974 1.7937 3.176 0.00149 **
## length:1 -0.1101 0.5171 -0.213 0.83137
## length:2 -2.4654 0.8996 NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
##
## Residual deviance: 98.3412 on 114 degrees of freedom
##
## Log-likelihood: -49.1706 on 114 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'length:2'
##
##
## Reference group is level 3 of the response
寫下兩個model:
\(log(\hat\pi_1/\hat\pi_3)=1.6177-0.1101*length\)
\(log(\hat\pi_1/\hat\pi_3)=5.6974-2.4654*length\)
可以看出鱷魚身長每增加一單位(公尺),以魚類為主要食物型態的機會會是食物類型為其他種類的\(e^{-0.1101}=0.9\)倍,而以無脊椎動物為主要食物型態的機會則會是食物類型為其他種類的\(e^{-2.4654}=0.085\)倍,若今天我們想看的是以魚類與無脊椎動物為主食的鱷魚間的關係,則可以把兩個模型相減得到 \(log(\hat\pi_1/\hat\pi_2)=-4.08+2.355*length\)
從此式可以得到鱷魚身長每增加一單位(公尺),以魚類為主要食物型態的機會會是以無脊椎動物為主食的\(e^{2.355}=10.538\)倍。
Alligator <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Alligator.txt", header = T)
contrasts(Alligator$Lake) <- contr.treatment(levels(Alligator$Lake), base = 1)
contrasts(Alligator$Lake)## hancock oklawaha trafford
## george 0 0 0
## hancock 1 0 0
## oklawaha 0 1 0
## trafford 0 0 1
contrasts(Alligator$Size) <- contr.treatment(levels(Alligator$Size), base = 2)
contrasts(Alligator$Size)## <2.3
## <2.3 1
## >2.3 0
##
## Call:
## vglm(formula = cbind(Invertebrate, Reptile, Bird, Other, Fish) ~
## Size + Lake, family = multinomial, data = Alligator, reflevel = "Fish")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,5]) -1.3716 -0.4379 -0.0248 0.2436 1.995
## log(mu[,2]/mu[,5]) -0.8298 -0.5850 -0.2309 0.2225 2.237
## log(mu[,3]/mu[,5]) -0.9873 -0.5082 -0.1144 0.2373 3.994
## log(mu[,4]/mu[,5]) -1.5873 -0.3189 -0.0159 1.0330 1.413
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -1.549019 0.424922 -3.645 0.000267 ***
## (Intercept):2 -3.314533 1.053068 NA NA
## (Intercept):3 -2.093077 0.662235 -3.161 0.001574 **
## (Intercept):4 -1.904272 0.525827 -3.621 0.000293 ***
## Size<2.3:1 1.458205 0.395945 3.683 0.000231 ***
## Size<2.3:2 -0.351263 0.580032 -0.606 0.544785
## Size<2.3:3 -0.630660 0.642473 -0.982 0.326291
## Size<2.3:4 0.331550 0.448252 0.740 0.459511
## Lakehancock:1 -1.658359 0.612877 -2.706 0.006813 **
## Lakehancock:2 1.242777 1.185418 1.048 0.294461
## Lakehancock:3 0.695118 0.781263 0.890 0.373608
## Lakehancock:4 0.826196 0.557541 1.482 0.138378
## Lakeoklawaha:1 0.937219 0.471905 1.986 0.047030 *
## Lakeoklawaha:2 2.458872 1.118113 2.199 0.027869 *
## Lakeoklawaha:3 -0.653208 1.201917 -0.543 0.586805
## Lakeoklawaha:4 0.005653 0.776571 0.007 0.994192
## Laketrafford:1 1.121985 0.490513 2.287 0.022174 *
## Laketrafford:2 2.935253 1.116395 2.629 0.008558 **
## Laketrafford:3 1.087767 0.841669 1.292 0.196221
## Laketrafford:4 1.516369 0.621435 2.440 0.014683 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,5]), log(mu[,2]/mu[,5]),
## log(mu[,3]/mu[,5]), log(mu[,4]/mu[,5])
##
## Residual deviance: 52.4785 on 44 degrees of freedom
##
## Log-likelihood: -74.4295 on 44 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2'
##
##
## Reference group is level 5 of the response
使用講義中p.12 Example的Data,欲了解考量更多變數後鱷魚主要食物型態為魚類與無脊椎動物間的關係。首先,先將鱷魚的Size與位於的湖泊(Lake)轉成dummy variable藉由contrasts()將\(Size\geq2.3=0\)&\(Lake(George)=0\)配適出的模型為
\(log(\hat\pi_I/\hat\pi_F)=-1.55+1.46*Size-1.66*Lake_H+0.94*Lake_O+1.12*Lake_T\)
例如有隻鱷魚位於Lake George且\(Size\geq2.3\)則鱷魚以無脊椎動物為主食的機會是以魚類為主食的\(e^{-1.55}=0.212\)倍。
接下來使用Mental Impairment的資料集,介紹當依變數為順序尺度時要如何配適Cumulative Logit Model
Mental<-read.table("http://www.stat.ufl.edu/~aa/glm/data/Mental.dat", header = T) #可直接讀取網頁檔案,也可自行下載Ceiba檔案讀取
vglm(impair ~ life + ses, data = Mental, family = cumulative(parallel = T)) %>% summary() #parallel=T,表示有符合Proportional odds##
## Call:
## vglm(formula = impair ~ life + ses, family = cumulative(parallel = T),
## data = Mental)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logitlink(P[Y<=1]) -1.568 -0.7048 -0.2102 0.8070 2.713
## logitlink(P[Y<=2]) -2.328 -0.4666 0.2657 0.6904 1.615
## logitlink(P[Y<=3]) -3.688 0.1198 0.2039 0.4194 1.892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.2819 0.6231 -0.452 0.65096
## (Intercept):2 1.2128 0.6511 1.863 0.06251 .
## (Intercept):3 2.2094 0.7171 3.081 0.00206 **
## life -0.3189 0.1194 -2.670 0.00759 **
## ses 1.1112 0.6143 1.809 0.07045 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3])
##
## Residual deviance: 99.0979 on 115 degrees of freedom
##
## Log-likelihood: -49.5489 on 115 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Exponentiated coefficients:
## life ses
## 0.7269742 3.0380707
Model可寫成\(logit[P(Y\leq{j})]=\alpha_j+\beta_1*life\,event+\beta_2*SES\)的型式
舉例來說在固定life event的情況下,社經地位高者不管在給定哪個level( j )下, 有精神障礙的風險是低社經地位的\(e^{1.1112}=3\)倍。
04/30
在先前課程中提到Cumulative Logits中,我們所關注的是累積機率,因此使用的Logit link為\(logit(P(Y\leq{j}))\),而今天所要介紹的則是只使用部分依變數的資訊,可以做兩個類別的比較,使用的方法為Adjacent-Categories Logit,而Logit Link為\(logit(P(Y=j+1)/P(Y=j))\),所使用的R套件為VGAM,以下整理出此套件中包含的Family function
同學可參照此圖依照使用得目的選擇適合的Family function,這邊將以課本中例子介紹
library(VGAM)
library(dplyr)
Political_Ideology <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Political_Ideology.csv", header = T, sep = ",")
Political_Ideology$Ideology <- as.factor(Political_Ideology$Ideology)
vglm(Ideology ~ Gender + Party, data = Political_Ideology, family = acat(reverse = F, parallel = F)) %>% summary()## Warning in eval(slot(family, "initialize")): response should be ordinal---
## see ordered()
##
## Call:
## vglm(formula = Ideology ~ Gender + Party, family = acat(reverse = F,
## parallel = F), data = Political_Ideology)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## loglink(P[Y=2]/P[Y=1]) -3.371 0.0712 0.1591 0.2380 1.358
## loglink(P[Y=3]/P[Y=2]) -2.408 -0.6983 0.2491 0.8383 1.009
## loglink(P[Y=4]/P[Y=3]) -1.172 -0.6687 -0.3114 0.6870 2.732
## loglink(P[Y=5]/P[Y=4]) -1.785 -0.2853 -0.2242 -0.1143 2.451
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 0.06687 0.19024 0.352 0.725190
## (Intercept):2 0.83169 0.15668 5.308 1.11e-07 ***
## (Intercept):3 -1.63822 0.19734 -8.302 < 2e-16 ***
## (Intercept):4 0.33236 0.23302 1.426 0.153780
## GenderM:1 -0.13565 0.26457 -0.513 0.608152
## GenderM:2 -0.23610 0.21592 -1.093 0.274191
## GenderM:3 0.53441 0.21580 2.476 0.013273 *
## GenderM:4 -0.08632 0.24141 -0.358 0.720666
## PartyR:1 0.42419 0.28331 1.497 0.134329
## PartyR:2 0.43680 0.21674 2.015 0.043876 *
## PartyR:3 0.82594 0.22226 3.716 0.000202 ***
## PartyR:4 -0.12353 0.25468 -0.485 0.627644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: loglink(P[Y=2]/P[Y=1]),
## loglink(P[Y=3]/P[Y=2]), loglink(P[Y=4]/P[Y=3]), loglink(P[Y=5]/P[Y=4])
##
## Residual deviance: 2462.332 on 3328 degrees of freedom
##
## Log-likelihood: -1231.166 on 3328 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## No Hauck-Donner effect found in any of the estimates
以\(log(P[Y=2]/P[Y=1])\)舉例,給定相同性別下,較自由主義與非常自由主義的odds在共和主義的人中是民主主義的人的\(exp(0.42419)=1.528352\)倍。
以課本中Toxicity Study Data示範,簡單來說此研究想看懷孕老鼠在暴露於某一工業用溶劑,在不同劑量下(此處給不同劑量Score)所生下老鼠的狀態,分為三種(死亡、畸形、正常),並去觀察他們之間的關係,在這邊舉例,在給定所生下老鼠是活著的情況,我們想了解畸形與正常老鼠間的關係
library(VGAM)
library(dplyr)
Toxicity <- data.frame(Concentration = c(0, 62.5, 125, 250, 500),
Response_Dead = c(15, 17, 22, 38, 144),
Response_Mal = c(1, 0, 7, 59, 132),
Response_Nor = c(281, 225, 283, 202, 9))
vglm(cbind(Response_Mal, Response_Nor) ~ Concentration, data = Toxicity, family = sratio()) %>% summary()##
## Call:
## vglm(formula = cbind(Response_Mal, Response_Nor) ~ Concentration,
## family = sratio(), data = Toxicity)
##
## Pearson residuals:
## [,1]
## [1,] 0.06344
## [2,] -1.49190
## [3,] -0.44346
## [4,] 0.86216
## [5,] -0.87367
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.701902 0.332248 -17.16 <2e-16 ***
## Concentration 0.017375 0.001227 14.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Name of linear predictor: logitlink(P[Y=1|Y>=1])
##
## Residual deviance: 6.0609 on 3 degrees of freedom
##
## Log-likelihood: -10.7452 on 3 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## 'Concentration'
結果可以看出,在給定老鼠皆是存活的狀態下,當母鼠每日暴露於此工業用溶劑的劑量每增加\(100mg/kg\),生下的老鼠會是畸形的可能性會增加\(exp(0.017375*100)=5.7\)倍。
05/14
在現實狀況中,我們可能會想去看試驗者在接受某種治療後,試驗前與試驗後的反應,而此種資料型式就是成對資料,在當資料為連續型時,可以使用Pair t test做檢定,而當資料為類別型時,也有其適合的方法,本章將會做介紹。
首先介紹當資料為類別型時,可以使用Mcnemar Test幫助我們做檢定,我們需把資料整理成2*2 Table,以下舉例:此筆資料主要是訪問800人對於某位候選人就任前與就任後的滿意度,70位在就任前後皆滿意,80位就任前不滿意就任後變為滿意,其餘依此類推。
Performance <- matrix(c(70, 80, 150, 500), nrow = 2,
dimnames = list("1st Survey" = c("Satidfied", "Dissatisfied"),
"2nd Survey" = c("Satidfied", "Dissatisfied")))
mcnemar.test(Performance)##
## McNemar's Chi-squared test with continuity correction
##
## data: Performance
## McNemar's chi-squared = 20.7, df = 1, p-value = 5.372e-06
虛無假設為
\(H_0:就任前後滿意與不滿意比例相同\)
\(H_1:就任前後滿意與不滿意比例不同\)
檢定結果:\(Reject\,H_0\)有足夠證據證明就任前後滿意與不滿意比例不同
在先前所提到的羅吉斯迴歸,資料間彼此互相獨立,因此可以使用,但當資料具有相依性,而我們仍然想要去量化解釋變數對依變數的影響,這時我們就可以使用Marginal logistic regression or Conditional logistic regression幫助我們。以講義中CH8 p.4 table舉例。
Data <- data.frame(CutLS_YES = c(227, 107), CutLS_NO = c(132, 678))
row.names(Data) <- c("HigherTaxes_Yes", "HigherTaxes_No")
Data## CutLS_YES CutLS_NO
## HigherTaxes_Yes 227 132
## HigherTaxes_No 107 678
由於我們要建立的是Marginal Logistic Regression因此要先建立Marginal Data
Data.marginal <- data.frame(Q = c("HigherTaxes", "CutLS"),
yes = c(359, 334),
no = c(785, 810))
Data.marginal## Q yes no
## 1 HigherTaxes 359 785
## 2 CutLS 334 810
接著建立Model
library(dplyr)
glm(cbind(Data.marginal$yes, Data.marginal$no) ~ Q, data = Data.marginal,
family = binomial(link = "logit")) %>% summary()##
## Call:
## glm(formula = cbind(Data.marginal$yes, Data.marginal$no) ~ Q,
## family = binomial(link = "logit"), data = Data.marginal)
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.88589 0.06503 -13.623 <2e-16 ***
## QHigherTaxes 0.10353 0.09104 1.137 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.2939e+00 on 1 degrees of freedom
## Residual deviance: 2.7489e-13 on 0 degrees of freedom
## AIC: 18.649
##
## Number of Fisher Scoring iterations: 2
當我們想了解的是個人對於這兩個問題看法的差異,這時就可以使用Conditional Logistic Regresion,在建立條件羅吉斯回歸時,需要先建構出Subject-Specific table,也就是我們所說的Conditional Model,之所以叫Conditional原因為我們針對每個人都建立table,概念其實很像針對個人去做分層後下去建立模型,也就是說我們是“條件於每個人之下”,去觀察解釋變數對依變數的影響。
首先先將資料轉為Individual Data,當中queestion=1代表是否願意付較高稅收,question=0代表降低生活品質
Data.raw=data.frame(ID = rep(1:1144, each=2),
question = rep(c(1, 0), 1144),
answer = c(rep(c(1, 1), 227), rep(c(1, 0), 132),
rep(c(0, 1), 107), rep(c(0, 0), 678)))接著建立Model
library(dplyr)
library(survival)
clogit(answer ~ question + strata(ID), data = Data.raw) %>% summary()## Call:
## coxph(formula = Surv(rep(1, 2288L), answer) ~ question + strata(ID),
## data = Data.raw, method = "exact")
##
## n= 2288, number of events= 693
##
## coef exp(coef) se(coef) z Pr(>|z|)
## question 0.2100 1.2336 0.1301 1.614 0.106
##
## exp(coef) exp(-coef) lower .95 upper .95
## question 1.234 0.8106 0.956 1.592
##
## Concordance= 0.552 (se = 0.045 )
## Rsquare= 0.001 (max possible= 0.135 )
## Likelihood ratio test= 2.62 on 1 df, p=0.1
## Wald test = 2.61 on 1 df, p=0.1
## Score (logrank) test = 2.62 on 1 df, p=0.1
根據以上結果可以看出,針對個人而言回答願意付較高稅的機會是願意降低生活品質的1.234倍。也就是說個人而言大家還是願意付較高的稅而較不願意降低生活品質。
除了使用Conditional Logistic Regression做配適,我們也可以藉由存活分析中常用到著Cox Proportional Hazard Model達到相同目標,以下以講義CH8 p.18的table做示範。
Individual <- data.frame(Matched = rep(1:144, each = 2), T = rep(c(1:2), 144),
Y = rep(c(1, 0), 144),
X = c(rep(1, 18), rep(c(1, 0), 37), rep(c(0, 1), 16), rep(0, 164)))
head(Individual)## Matched T Y X
## 1 1 1 1 1
## 2 1 2 0 1
## 3 2 1 1 1
## 4 2 2 0 1
## 5 3 1 1 1
## 6 3 2 0 1
library(survival)
my.surv <- Surv(Individual$T, Individual$Y == 1)
Model <- coxph(my.surv ~ X + strata(Matched), data = Individual)
summary(Model)## Call:
## coxph(formula = my.surv ~ X + strata(Matched), data = Individual)
##
## n= 288, number of events= 144
##
## coef exp(coef) se(coef) z Pr(>|z|)
## X 0.8383 2.3125 0.2992 2.802 0.00508 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## X 2.312 0.4324 1.286 4.157
##
## Concordance= 0.573 (se = 0.035 )
## Rsquare= 0.029 (max possible= 0.5 )
## Likelihood ratio test= 8.55 on 1 df, p=0.003
## Wald test = 7.85 on 1 df, p=0.005
## Score (logrank) test = 8.32 on 1 df, p=0.004
\(h(t|Diabetes)=h_0(t)+exp(0.8383*Diabetes)\)
Time=T
STATUS=Y
Matched=Stratified Variable
使用 Cox Proportional Hazard Model 做出來結果與使用 Conditional Logistic Regression 相同,表示有 糖尿病的患者罹患心肌梗塞的風險為沒有糖尿病者的 2.312 倍。
5/21
使用Depression Study Data用R code示範,當Working Correlation Matrix為Independent時,比較使用GLM與GEE估計出係數的差別
library(dplyr)
library(gee)
Depression <- read.table("D:/NTU Class Lectures/Categorical Data Analysis/R_Class/Data/Depression.txt",
sep = "", header = T)
glm(outcome ~ severity + drug + time + time*drug, data = Depression, family = binomial(link = "logit")) %>%
summary() #GLM##
## Call:
## glm(formula = outcome ~ severity + drug + time + time * drug,
## family = binomial(link = "logit"), data = Depression)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4352 -1.0220 0.3254 0.9915 1.8009
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.02799 0.16391 -0.171 0.864
## severity -1.31391 0.14641 -8.974 < 2e-16 ***
## drug -0.05960 0.22221 -0.268 0.789
## time 0.48241 0.11476 4.204 2.63e-05 ***
## drug:time 1.01744 0.18879 5.389 7.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1411.9 on 1019 degrees of freedom
## Residual deviance: 1161.9 on 1015 degrees of freedom
## AIC: 1171.9
##
## Number of Fisher Scoring iterations: 4
gee(outcome ~ severity + drug + time + time*drug, id = case, data = Depression, corstr = "independence",
family = binomial(link = "logit")) %>% summary()## (Intercept) severity drug time drug:time
## -0.02798843 -1.31391092 -0.05960381 0.48241209 1.01744498
##
## GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
## gee S-function, version 4.13 modified 98/01/27 (1998)
##
## Model:
## Link: Logit
## Variance to Mean Relation: Binomial
## Correlation Structure: Independent
##
## Call:
## gee(formula = outcome ~ severity + drug + time + time * drug,
## id = case, data = Depression, family = binomial(link = "logit"),
## corstr = "independence")
##
## Summary of Residuals:
## Min 1Q Median 3Q Max
## -0.94844242 -0.40683252 0.05155758 0.38830952 0.80242231
##
##
## Coefficients:
## Estimate Naive S.E. Naive z Robust S.E. Robust z
## (Intercept) -0.02798843 0.1627083 -0.1720160 0.1741865 -0.1606808
## severity -1.31391092 0.1453432 -9.0400569 0.1459845 -9.0003423
## drug -0.05960381 0.2205812 -0.2702126 0.2285385 -0.2608042
## time 0.48241209 0.1139224 4.2345663 0.1199350 4.0222784
## drug:time 1.01744498 0.1874132 5.4288855 0.1876938 5.4207709
##
## Estimated Scale Parameter: 0.9854113
## Number of Iterations: 1
##
## Working Correlation
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
由於Outcome為Binary因此在GEE中也要記得設定family,否則default設定為Identity,結果會有誤差,而GEE在解釋上也與GLM相同
請同學自行下載套件catdata,此套件就有提供Teratology的Data, 有Grouped Data(teratology)及Individual Data(teratology2), 這邊使用Individual Data,分別使用GLM & GEE(Exchangeable Working Correlation Matrix)並與CH9講義P.25對照
library(catdata)
library(dplyr)
library(gee)
data("teratology2")
glm(y ~ Grp, data = teratology2, family = binomial(link = "logit")) %>% summary()##
## Call:
## glm(formula = y ~ Grp, family = binomial(link = "logit"), data = teratology2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6855 -0.4631 -0.2649 0.7437 2.5951
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1440 0.1292 8.855 <2e-16 ***
## GrpG2 -3.3225 0.3308 -10.043 <2e-16 ***
## GrpG3 -4.4762 0.7307 -6.126 9e-10 ***
## GrpG4 -4.1297 0.4762 -8.672 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 832.68 on 606 degrees of freedom
## Residual deviance: 496.70 on 603 degrees of freedom
## AIC: 504.7
##
## Number of Fisher Scoring iterations: 5
## (Intercept) GrpG2 GrpG3 GrpG4
## 1.143981 -3.322513 -4.476184 -4.129663
##
## GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
## gee S-function, version 4.13 modified 98/01/27 (1998)
##
## Model:
## Link: Logit
## Variance to Mean Relation: Binomial
## Correlation Structure: Exchangeable
##
## Call:
## gee(formula = y ~ Grp, id = Rat, data = teratology2, family = binomial(link = "logit"),
## corstr = "exchangeable")
##
## Summary of Residuals:
## Min 1Q Median 3Q Max
## -0.7705629 -0.1036164 -0.0331754 0.2294371 0.9668246
##
##
## Coefficients:
## Estimate Naive S.E. Naive z Robust S.E. Robust z
## (Intercept) 1.211492 0.2245137 5.396073 0.2695606 4.494323
## GrpG2 -3.369165 0.5660161 -5.952419 0.4304211 -7.827600
## GrpG3 -4.583701 1.3091173 -3.501368 0.6235402 -7.351091
## GrpG4 -4.247410 0.8527494 -4.980842 0.6047894 -7.022957
##
## Estimated Scale Parameter: 1.035308
## Number of Iterations: 3
##
## Working Correlation
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [2,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [3,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [4,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [5,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
## [7,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [8,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [9,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [13,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [2,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [3,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [4,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [5,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [7,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [8,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [9,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
## [13,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [,13] [,14] [,15] [,16] [,17]
## [1,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [2,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [3,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [4,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [5,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [6,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [7,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [8,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [9,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [10,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [11,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [12,] 0.1853379 0.1853379 0.1853379 0.1853379 0.1853379
## [13,] 1.0000000 0.1853379 0.1853379 0.1853379 0.1853379
## [14,] 0.1853379 1.0000000 0.1853379 0.1853379 0.1853379
## [15,] 0.1853379 0.1853379 1.0000000 0.1853379 0.1853379
## [16,] 0.1853379 0.1853379 0.1853379 1.0000000 0.1853379
## [17,] 0.1853379 0.1853379 0.1853379 0.1853379 1.0000000
前面討論到廣義估計方程式(GEE),但GEE並無法考量到隨機效果,所以當outcome為類別型資料時,又須考慮到隨機效應時,則考慮GLMM
GLMM Model
\[g(E(Y))=X\beta+Z\gamma\]
其中\(\beta\)為固定效應(fixed effect),表示Population-averaged effect,用於推估群體效應
而\(\gamma\)為隨機效應(random effect),表示Subject-specific effect,只能說明個人效應,不能外推至他人
g表示link function
X&Z分別對應到固定效應&隨機效應的設計矩陣(Design matrix)。\(\gamma~N(0,D)\) 在R中操作GLMM時,使用lme4套件中glmer函式,以下以wheeze資料實際示範,紀錄16位孩童於不同年齡下的氣喘狀況(1 = 有氣喘, 0 = 無氣喘)、孩童居住城市、母親抽菸程度
## Loading required package: Matrix
data("cbpp")
Wheeze <- read.table("D:/NTU Class Lectures/Categorical Data Analysis107-2/R_Class/Data/wheeze.csv", header = T, sep = ",")
summary(glmer(wheeze ~ smoke + city + age + (1|case), data = Wheeze, family = binomial()))## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: wheeze ~ smoke + city + age + (1 | case)
## Data: Wheeze
##
## AIC BIC logLik deviance df.resid
## 84.7 95.5 -37.4 74.7 59
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.2209 -0.5487 -0.4302 0.9042 2.2155
##
## Random effects:
## Groups Name Variance Std.Dev.
## case (Intercept) 1.09 1.044
## Number of obs: 64, groups: case, 16
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.68229 2.91109 0.578 0.563
## smoke -0.07979 0.52283 -0.153 0.879
## cityportage -0.16503 0.81173 -0.203 0.839
## age -0.24944 0.28029 -0.890 0.374
##
## Correlation of Fixed Effects:
## (Intr) smoke ctyprt
## smoke 0.036
## cityportage -0.144 -0.046
## age -0.969 -0.176 0.014
\[logit(P(Y = 1))=1.68-0.07*Smoke-0.165*City_{portage}-0.25*Age\]
在固定居住城市以及孩童年齡下,若母親煙癮每增加一單位,孩童有氣喘的odds會是原本的0.07倍,但參數並未顯著。
1: R語言資料分析 從機器學習、資料探勘、文字探勘到巨量資料分析-李仁鐘、李秋緣
2: R語言資料分析活用範例詳解-方匡南、朱建平、姜葉飛 著,H&C 譯
3: An Introduction to Categorical Data Analysis, Second Edition-Alan Agresti