1 課程目標

本週上課將介紹如何檢證連續變數的平均值是否不等於0或者其他特定值，以及類別變數的各類別相對連續變數的平均值。例如我們想知道五種樹木的平均樹圍是否有差異：

m1<-aov(circumference ~ Tree, data=Orange )
summary(m1)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Tree         4  11841    2960   0.883  0.486
## Residuals   30 100525    3351

結果顯示五種樹木的平均樹圍並沒有統計上的顯著差異。

2 母體與樣本

研究某個現象時，我們假設這個現象來自於看不到的母體，要形容這個看不到的母體，我們會用這個母體的平均值以及離散程度這兩個參數。我們相信雖然我們永遠無法真正知道母體平均值以及離散程度，但是透過隨機抽樣的樣本，我們可以估計這兩個參數。例如丟硬幣100次，其中得到52次正面、48次反面。我們對於母體的平均數估計是\(\frac{52}{100}\cdot 1+\frac{48}{100}\cdot 0=0.52\)。如果我們相信母體平均數是0.52，那麼擲這枚硬幣10次，其中會得到6次正面的機率是：\(\frac{10!}{6!4!}(0.52)^6\cdot (1-0.52)^4=\)

p=0.52; n=10; x=6
(factorial(n))/(factorial(x)*factorial(n-x))*p^x*(1-p)^4

## [1] 0.2203963

又例如身高、體重、智商來自於常態分佈的母體，平均數為\(\bar{X}=\frac{X_{1}+\ldots+X_{n}}{n}\)，機率分佈為\(f(X)=\frac{1}{\sigma\sqrt{2\pi}}\cdot e^{\frac{-(x-\mu)^2}{2\sigma^2}}\)

回到二項分布的例子。假設我們抽樣100個同學，抽50次，而且我們相信女生佔了52%，那麼這50套樣本當中，有多少套的樣本，女生剛好52人？

set.seed(116)
m = 50; n= 100; p = .52;
phat = rbinom(m,n,p)/n
phat

##  [1] 0.48 0.51 0.50 0.58 0.51 0.57 0.50 0.57 0.58 0.55 0.54 0.53 0.54 0.54
## [15] 0.51 0.44 0.55 0.42 0.56 0.61 0.54 0.47 0.59 0.56 0.54 0.53 0.48 0.57
## [29] 0.57 0.52 0.60 0.45 0.48 0.62 0.52 0.59 0.52 0.56 0.51 0.57 0.63 0.51
## [43] 0.57 0.48 0.44 0.62 0.55 0.51 0.50 0.46

我們可以排序50次的結果，然後以點狀圖表示：

dotchart(sort(phat), pch=16, col='#CCFF00', yaxt='n', xaxt='n',   ylab='',xlab=expression(paste("", hat(p))), cex.lab=1.8)
abline(v = 0.52, lwd=1.5, col='red3', lty = 2)
mtext('0.52',1)

這個圖指出抽樣的結果，涵蓋小於0.45到大於0.6，但是只有幾個等於0.52。換句話說，抽樣得到的估計點，雖然理論上都是無偏估計，但是實際上與母體參數可能有一些不同。

接下來我們用區間估計表示抽樣的結果，也就是點估計加上一定的區間。

set.seed(116)
m = 50; n= 100; p = .52;
phat = rbinom(m,n,p)/n
SE = sqrt(phat*(1-phat)/n)
alpha = 0.05;zstar = qnorm(1-alpha/2)
cat("zstar=", zstar)

## zstar= 1.959964

matplot(rbind(phat - zstar*SE, phat + zstar*SE),
  rbind(1:m,1:m), type="l", lty=1,
  xlab=expression(paste(hat(p)-z,"*",SE,",  ", hat(p)+z,"*",SE)), ylab="")
abline(v=p)

上面這個圖是由以下幾個參數所組成：

\(\hat{p}\) 表示每一套樣本當中，女生的比例

SE表示樣本的標準誤=\(\sqrt{\frac{p(1-p)}{n}}\)

\(\alpha\)(alpha)表示我們估計時容許的抽樣誤差，也稱為顯著水準(significance level)，顯著水準越小，代表誤判虛無假設成立的機會越小。一般設定為0.05或是\(5\%\)，分成左右兩邊各是0.025

\(z^{*}\)表示一定的顯著水準下，對應的z值。顯著水準越大，z值越小。z值來自平均為0，變異數為1的分佈。

\(z^{*}\cdot SE\)表示一定顯著水準下，涵蓋母體參數的上下限。顯著水準越小、範圍越大、區間越容易涵蓋母體參數。

從這個圖可以看出，大部分的線跨過0.52這條垂直線，代表大部分抽樣的結果，涵蓋了母體為0.52這個參數。
因為區間估計比點估計來得周全，所以我們在檢定平均數或者平均數差異時，將使用區間估計。
標準誤表示為\(\frac{\sigma}{\sqrt{n}}\)。連續變數的\(\sigma\)的無偏估計是\(s=\frac{\sum(X_{i}-\bar{X})^2}{n-1}\)。二元變數則是\(\sqrt{p(1-p)}\)。

3 平均數檢定

雖然\(s\)是標準誤的無偏估計，但是\(s\)只是隨機亂數，並不是母體分佈。但是我們可以改用\(t\)分佈。\(t\)分布近似常態分佈，樣本數越大越接近常態分佈。

以下的圖顯示\(t\)分佈的自由度為1, 5, 10, 30時的機率分佈，\(t\)分佈的自由度是\(n-1\)。：

curve(dt(x, 30), from = -5, to = 5, col = "orange", 
      xlab = "quantile", ylab = "density", lwd = 2)
curve(dt(x, 10), from = -5, to = 5, col = "green2", add = TRUE, lwd = 2)
curve(dt(x, 5), from = -5, to = 5, col = "navyblue", add = TRUE, lwd = 2)
curve(dt(x, 1), from = -5, to = 5, col = "grey40", add = TRUE, lwd = 2)
legend("topleft", legend = paste0("DF = ", c(1, 5, 10, 30)),
       col = c("grey40", "navyblue", "green2", "orange"),
       lty = 1, lwd = 2)

\(t\)分佈形狀取決於自由度，自由度代表已知統計值之後，觀察值可以變動的數目。\(t\)分佈的形狀代表它的離散程度，或者是樣本變異數，而變異數來自於樣本平均值(\(\bar{X}\))。當我們已知平均值，只要有一個樣本不變，其他\(n-1\)個觀察值改變，還是可以得到相同的平均值。所以自由度是\(n-1\)。

以下的式子表示母體參數的上下區間的機率為\(1-\alpha\)： \(P(\bar{X}-t_{\alpha/2,n-1}S/\sqrt{n}<\mu<\bar{X}+t_{\alpha/2,n-1}S/\sqrt{n})\) \(=1-\alpha\)

三種信賴區間的\(t\)值表示如下：

信賴區間	alpha	df.1	df.4	df.15	df.3000
90%	0.10	6.314	2.132	1.753	1.645
95%	0.05	12.710	2.776	2.131	1.960
99%	0.01	63.660	4.604	2.947	2.576

R可以求得以上的機率，例如信賴區間為95%時，\(\alpha\)除以2，加上自由度：

qt(0.025, 3000)

## [1] -1.960755

\(t\)檢定的步驟：

1. 假設\(\mu=c\)。

2. 決定顯著水準\(\alpha\)或者信賴區間\((1-\alpha)\%\)

3. 計算自由度\(n-1\)

4. 計算\(T=\frac{X-c}{s/\sqrt{n}}\)

5. 計算顯著水準\(\alpha\)與自由度相對應的\(t^{*}\)值。\(\alpha=5\%\)、自由度等於1,000時，約為1.962

qt(0.975, 1000)

## [1] 1.962339

6. 當\(p(T)<\alpha\)，表示會發生c的機會非常小，可以拒斥\(\mu=c\)的假設。

7. 同樣的當\(T>t_{*}\)，表示會發生c的機會非常小，可以拒斥\(\mu=c\)的假設。

假設10位受試者的體重資料如下，請問平均體重等於173磅嗎？

weight <-c(175, 176, 173, 175, 174, 173, 173, 176, 173, 179)
t.test(weight, mu=173)

## 
##  One Sample t-test
## 
## data:  weight
## t = 2.7618, df = 9, p-value = 0.02205
## alternative hypothesis: true mean is not equal to 173
## 95 percent confidence interval:
##  173.3076 176.0924
## sample estimates:
## mean of x 
##     174.7

分析顯示p-value <0.05，也就是說我們可以拒斥\(\mu=173\)的假設。那麼我們要如何一步步計算\(T\)值以及\(t^{*}\)呢？先計算\(T\)：

n=10
x_bar=mean(weight)
s2=sum((weight-x_bar)^2)/(n-1)
c=173
T=(x_bar-c)/sqrt(s2/n)
T

## [1] 2.761805

已知\(T\)值，用pt(T, df)求p-value：

T=2.761805
pvalue=(1-pt(T, df=9))*2
pvalue

## [1] 0.02204698

之所以要先用1減pt(T, df)再除以2是因為T對應的累積機率值是從最左邊一直累積過來，因此右邊的部分需要用1減。減完之後除以2代表左右兩邊各一半。或者比較\(T\)與\(t^{*}\)

T=2.761805
t.star=qt(0.975, 9)
T-t.star

## [1] 0.4996478

我們用以下的圖表示顯著水準0.05的雙尾檢定：

plot.new()
x <- seq(-4, 4, length = 1000)
y <- dt(x, 100)
plot(x, y, type="n", xlab = "", ylab = "", main = "", axes = FALSE)
axis(1)
 lines(x, y)
# Returns a vector of boolean values representing whether the x value is between the two bounds then
# filters the values so that only the ones within the bounds are returned
alpha=0.05; df=100
lower_bound <- 0-qt(1-alpha/2, df)
upper_bound <- 0+qt(1-alpha/2, df)
bounds_filter <- x >= lower_bound & x <= upper_bound
x_within_bounds <- x[bounds_filter]
y_within_bounds <- y[bounds_filter]
 
x_polygon <- c(lower_bound, x_within_bounds, upper_bound)
y_polygon <- c(0, y_within_bounds, 0)
 
polygon(x_polygon, y_polygon, col = "red")
probability_within_bounds <- pt(upper_bound,df) - pt(lower_bound,df)
text <- paste("p(", lower_bound, "< p <", upper_bound, ") =", signif(probability_within_bounds, digits = 3))
 
# Display the text on the plot. The default "side" parameter is 3, representing the top of the plot.
mtext(text)

當T=2.78，右邊的區域剩下非常小的區域：

scale <- 0.1 
x <- seq(-4, 4, scale) 
df=100
y <- dt(x, df) 
plot(x, y, type = "l", main="t-Test, t = 2.76") 

linepos <- 2.76 
abline(v = linepos) 
cutpoint <- (max(x) - linepos) / scale 

xt <- x[(length(x)-cutpoint):length(x)] 
yt <- y[(length(y)-cutpoint):length(y)] 

# draw the polygon 

n <- length(xt) 
xt <- c(xt[1], xt, xt[n]) 
yt <- c(0,yt,0) 
polygon(xt, yt, col="red" )

回到上述女生比例的例子，假設第一套樣本抽到102人，其中女生有54人，請問這一套樣本是否來自於\(p=0.52\)的母體？以prop.test(x, n, p, conf.level)

prop.test(54, 102, p=0.52, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  54 out of 102, null probability 0.52
## X-squared = 0.0083113, df = 1, p-value = 0.9274
## alternative hypothesis: true p is not equal to 0.52
## 95 percent confidence interval:
##  0.4284751 0.6281074
## sample estimates:
##         p 
## 0.5294118

因為p-value>0.05，因此我們可以接受\(c=0.52\)的假設。換句話說，這一套樣本來自於p=0.52的母體。

因為伯努利分佈的樣本變異數是\(p(1-p)\)，所以樣本標準誤的公式為：

\(SE=\sqrt{\hat{p}(1-\hat{p})}/n\)

T值的公式為：

\(T=\frac{\hat{p}-c}{\sqrt{\hat{p}(1-\hat{p})}/n}\)

n=102
p=54/102
c=0.52
T=(p-c)/sqrt(p*(1-p)/n)
T

## [1] 0.1904381

T=0.190
pvalue=(1-pt(T, df-1))*2
pvalue

## [1] 0.8496982

由於R的prop.test()用Z分數與T值相比，所以p-value會與自行運算的結果不太相同。

3.0.1 單尾檢定

有時候我們需要檢定樣本平均數是否大於或者小於某一個數。這時候我們可以用單尾檢定。

單尾檢定需要把拒斥的區域集中在其中一邊，如果顯著水準為0.05，那麼就是右邊0.05或是左邊0.05，而不是0.025。可以想像顯著水準0.05除以2所對應的\(t^{*}\)要比0.05對應的\(t^{*}\)來的大。下圖表示右側的區域為0.05。

plot.new()
x <- seq(-4, 4, length = 1000)
y <- dt(x, 100)
plot(x, y, type="n", xlab = "", ylab = "", main = "", axes = FALSE)
axis(1)
 lines(x, y)
# Returns a vector of boolean values representing whether the x value is between the two bounds then
# filters the values so that only the ones within the bounds are returned
alpha=0.05; df=100
lower_bound <- -4
upper_bound <- 0+qt(1-alpha, df)
bounds_filter <- x >= lower_bound & x <= upper_bound
x_within_bounds <- x[bounds_filter]
y_within_bounds <- y[bounds_filter]
 
x_polygon <- c(lower_bound, x_within_bounds, upper_bound)
y_polygon <- c(0, y_within_bounds, 0)
 
polygon(x_polygon, y_polygon, col = "red")
probability_within_bounds <- pt(upper_bound,df) - pt(lower_bound,df)
text <- paste("p(", lower_bound, "< p <", upper_bound, ") =", signif(probability_within_bounds, digits = 3))
 
# Display the text on the plot. The default "side" parameter is 3, representing the top of the plot.
mtext(text)

不同於雙尾檢定的對立假設是\[\mu \neq c\]，單尾檢定的虛無假設仍然是\[\mu=c\]，對立假設則是\[\mu>c\]或是\[\mu<c\]。

與雙尾檢定相同，單尾檢定也可以從\(95\%\)信賴區間來確認欲檢定的統計值是否位於信賴區間。如果大於信賴區間的最大值，應該就是高於檢定的統計值。如果小於信賴區間的最小值，應該就是小於檢定的統計值。

以10名受訪者的體重的資料為例，假設平均體重大於173磅。當我們計算\(T\)值後，可以計算自由度為9的情況下的\(p\)值：

T=2.761805
pvalue=pt(T, df=9)
pvalue

## [1] 0.9889765

因為0.988>0.05，可以結論接受平均體重等於173磅的假設，也就是不接受平均體重大於173磅的對立假設。

或者用t.test()函數，記得加上alternative=’greater’設定，也就是：

\(H_{0}: \mu = 173\)

\(H_{1}: \mu \ge 173\)

t.test(weight, mu=173, alternative = 'greater')

## 
##  One Sample t-test
## 
## data:  weight
## t = 2.7618, df = 9, p-value = 0.01102
## alternative hypothesis: true mean is greater than 173
## 95 percent confidence interval:
##  173.5716      Inf
## sample estimates:
## mean of x 
##     174.7

因為\(p>0.05\)，不拒斥樣本的平均值等於173磅的虛無假設，也就是不接受大於173的對立假設。從\(95\%\)信賴區間包含負無限大到175.82來看，不能排除母體平均值小於173的可能性，也就是不一定會大於173。

3.1 雙樣本檢定

我們經常需要知道兩套樣本的差異是否達到統計上的顯著水準，例如兩種教學方法的成績、兩個餐廳的評價、兩個國家人民對於民主的滿意程度等等。

假設有兩種油漆，我們想知道這兩種油漆的效果是不是相似。兩種油漆的資料如下，

sample1<-c(19.7475, 30.5562, 11.0734, 18.1730, 19.8387, 14.5291, 19.4998, 18.8374, 12.6873, 14.7627,18.3869, 17.9287, 17.6973, 14.3439, 10.7374, 15.3563, 19.0878, 12.5745, 18.0030, 18.6004)

sample2<-c(17.4715, 9.8613, 23.4827, 13.6029, 20.0386, 19.6289, 24.9357, 17.8812, 12.6012, 9.7741, 19.9265, 16.4178, 20.4401, 15.1119, 7.9955, 5.1385, 22.4969, 17.4448, 17.6675, 7.0984)

先用箱形圖顯示兩套樣本的分佈，可以看出第一套樣本比較集中，第二套比較分散：

library(reshape2)
dt=data.frame(sample1, sample2)
dt.m <- melt(dt)

## No id variables; using all as measure variables

library(ggplot2)
ggplot(data=dt.m, aes(x=variable, y=value)) +
  geom_boxplot()  +
   labs(x="Groups", y='') +
  stat_summary(fun.y=mean, geom="point", shape=16, size=2, col='red')

以t.test()檢定假設：兩筆資料的平均數相等，或者是相減等於0?

 t.test(sample1, sample2, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  sample1 and sample2
## t = 0.7348, df = 38, p-value = 0.467
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.053835  4.394365
## sample estimates:
## mean of x mean of y 
##  17.12107  15.95080

由以上的結果可以看出，\(p>0.05\)，因此我們無法拒斥兩筆資料的平均數相等的假設。結論是兩種油漆的效果非常相近。那麼如何計算兩套樣本的標準誤？公式為： \(SE=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{1}^{2}}{n_{1}}}\)

如果\(n_{1}=n_{2}=n\)， \(SE=\sqrt{\frac{s_{1}^{2}+s_{2}^{2}}{n}}\)

如果兩者樣本數不相等， \(SE=\frac{s_{1}+s_{2}}{2}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}\)

\(T\)值的公式為：

\(T=\frac{\bar{Y_{1}}-\bar{Y_{2}}}{SE}\)

自由度：\(n_{1}+n_{2}-2\)

根據以上公式，我們再驗算一次：

n1<-length(sample1); n2<-length(sample2)
s1<-sd(sample1); s2<-sd(sample2)
se <-((s1+s2)/2)*(sqrt(n1^-1+n2^-1))
T=(mean(sample1)-mean(sample2))/se
cat('t =',T, '\n')

## t = 0.7414099

pvalue=(1-pt(0.74, n1+n2-2))*2
cat('p value =', pvalue)

## p value = 0.4638476

除了檢驗兩個樣本有相同的平均值，也可以檢驗一個大於另一個平均值。以上面資料為例，

 t.test(sample1, sample2, var.equal = T, alternative = 'less')

## 
##  Two Sample t-test
## 
## data:  sample1 and sample2
## t = 0.7348, df = 38, p-value = 0.7665
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 3.855357
## sample estimates:
## mean of x mean of y 
##  17.12107  15.95080

可以看出\(p\geq 0.05\)。我們不拒斥第一套樣本等於第二套樣本的平均數的假設。

又或者用隨機抽樣的資料驗證第一套樣本平均值小於第二套樣本平均值。

set.seed(29393091)
x=rnorm(30)
set.seed(116)
y=rnorm(100, sqrt(2), 1)
t.test(x, y, alternative = 'greater')

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -5.4224, df = 50.425, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -1.433189       Inf
## sample estimates:
## mean of x mean of y 
##  0.247371  1.342228

結果顯示\(p>0.05\)，因此以\(\alpha=0.05\)的標準，我們不拒斥第一套樣本平均值小於第二套樣本平均值的假設。

3.1.1 雙樣本的比例檢定

如果兩個樣本都是來自二項分布，那麼我們可以檢驗比例是否相同。例如一輛捷運，第一個車廂有50人，第二個車廂有61人，第一個車廂有20個男性，第二個則有13個，請問兩個車廂有相同的男性比例嗎？用R計算如下：

x1=20; n1=50; x2=13; n2=61
prop.test(c(x1,x2), c(n1, n2))

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 3.7427, df = 1, p-value = 0.05304
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.001604362  0.375374854
## sample estimates:
##    prop 1    prop 2 
## 0.4000000 0.2131148

可以看出\(p>0.05\)，因此以\(\alpha=0.05\)的標準，我們接受兩個車廂有一樣男性比例的假設。

3.2 成對樣本檢定

前面的雙樣本檢定假定兩套樣本互相獨立，但是有不同的平均值與標準差。如果是想要確定兩個變數來自於同一套樣本，兩次測量的平均值相等，則是進行成對樣本檢定。

用R的t.test函數，可設定paired=T，進行成對樣本檢定，虛無假設為兩次測量的平均值相等。

例如比較民進黨候選人在台北市第2選區在2016年的立委選舉與2019年的立委補選的各里的平均得票率是否相同：

taipei<-read.csv('taipei201619.csv', header=T, fileEncoding = 'BIG5')
library(dplyr)
taipei<-taipei %>% 
          mutate(DPP2016.p=100*DPP2016/total2016, DPP2019.p=100*DPP2019/total2019)
with(taipei, t.test(DPP2016.p, DPP2019.p, paired = T))

## 
##  Paired t-test
## 
## data:  DPP2016.p and DPP2019.p
## t = 25.822, df = 196, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  11.06842 12.89895
## sample estimates:
## mean of the differences 
##                11.98368

結果顯示兩者有很大的差異，拒斥兩次選舉的平均值相減等於0的假設。

為了確認這個檢定結果，我們可以畫圖顯示兩次選舉的平均值。先整理成為長資料：

library(reshape2)
dat<- taipei %>% 
       select('2016'='DPP2016.p', '2019'='DPP2019.p')
dat.new <- melt(dat, variable.name = "election", value.name = "DPP")

## No id variables; using all as measure variables

dat.new <- dat.new %>% 
           group_by(election) %>%
           mutate(grp.mean=mean(DPP)) %>%
           mutate(Year=as.factor(election))

然後使用ggplot2畫分佈圖：

library(ggplot2)
ggplot(dat.new, aes(x=DPP, color=Year)) +
  geom_density(position="identity", alpha=.4) +
  geom_vline(data=dat.new, aes(xintercept=grp.mean, color=Year),
             linetype="dashed") +
  labs(x="DPP's Vote Share", y="")

圖形顯示發現民進黨候選人在台北市第二選區的兩次選舉的得票率，的確有相當不同的分佈以及集中趨勢。

4 變異數分析

檢驗一個類別變數與其它連續變數(continuous variable)之間的關係，可運用One-way Anova(單因子變異數分析)，分析類別之間是否存在平均值的差異。

Two-way Anova(雙因子變異數分析)則是有兩個以上的類別變數，分析類別之間是否存在連續變數的平均值的差異。

Anova分析建立在三個假設：

類別變數與連續變數互相獨立
每一個類別內的連續變數呈常態分佈
每一個類別內的連續變數的變異數相同

Anova 的目標是比較每一組之間的離散程度跟每一類別（或是每一組）內部的離散程度。兩者相比越小，表示組跟組之間的差異越小於每一組內部的差異，因此分組的意義也越小，類別變數與連續變數的關係也越小。反之則越大。組間變異(SSB, sum of square between)為各組的樣本數乘以組平均減總平均的平方的加總。

\(SSB=\sum n_{k}(\bar{x_{k}}-\bar{x})^2\)

組內變異(SSW, sum of square within)為各組的變異數的加總。

\(SSW=\sum_{k}\sum (x_{ki}-\bar{x_{k}})^2\)

全部的變異量等於這兩者相加： \(SST=SSB+SSW\)

每一組之間的平均離散程度用MSB(mean square between)表示，每一類別（或是每一組）內部的離散程度用MSW(mean square within)表示。計算方式為：

\(MSB=SSB/df(B) \hspace{.4cm} df(B)=k-1\)

\(MSW=SSW/df(W) \hspace{.4cm} df(W)=n-1-(k-1)=n-k\)

\(F=\frac{MSB}{MSW}\)

使用\(F\)分佈而不是\(t\)分佈檢驗假設成立與否。英國統計學家兼生物學家羅納德·費雪（Ronald Aylmer Fisher）發明了\(F\)檢驗。 F 分佈的參數是兩個自由度，可畫圖如下：

curve(df(x, df1=1, df2=2), from=0, to=5, ylab='')

curve(df(x, df1=5, df2=6), from=0, to=5, col='red', add=T)

curve(df(x, df1=20, df2=30), from=0, to=5, col='blue2', add=T)
abline(v=1, lwd=1.5, lty=2, col='green2')
legend('topright', c('df=1;df=2','df=5;df=6',
                     'df=20;df=30'),
            col=c('black',"red", "blue2"),
            lty=c(1,1,1))

圖片顯示自由度越大、\(F\)分佈圖形越接近常態分佈。

R的\(F\)分佈有以下指令：

df(x, df1, df1)回傳x百分比的機率 (y軸的高度)
pf(q, df1, df1)回傳q百分比的累積機率
qf(p, df1, df1)回傳p值的\(F\)值
rf(n, df1, df1)回傳從\(F\)分佈抽樣的n個樣本

以Orange這筆資料為例，我們想檢驗每一種類型的樹的樹圍是否相等。先整理資料，計算平均樹圍以及標準差：

Orange$tree<-factor(Orange$Tree, levels=c("1","2", "3", "4","5"))
library(dplyr)
mydata <- Orange %>%
          group_by (tree) %>%
          summarize(
         count = n(),
         avg = mean(circumference, na.rm = TRUE),
         sd = sd(circumference, na.rm = TRUE)
  ) 
kable(mydata) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

tree	count	avg	sd
1	7	99.57143	43.29302
2	7	135.28571	66.32424
3	7	94.00000	42.98062
4	7	139.28571	71.89741
5	7	111.14286	58.85980

可以看出五種平均樹圍之間有一些差異。

用箱形圖檢視五種類型的樹圍：

library(ggplot2)
p=ggplot(Orange, aes(x=tree, y=circumference)) 
p+geom_boxplot(data=Orange, aes(color=tree)) +
  theme_bw()

箱形圖顯示平均數可能不相等。再用密度圖觀察是否常態分佈？

G=ggplot(Orange, aes(x=circumference, fill=tree, alpha=0.2))
G+geom_density() +
   xlim(c(0, 300)) +
   theme_bw()

部分類別可能不是成常態分佈，而且離散程度可能不相等。

最後用aov函數進行Anova分析：

res.aov <- aov(circumference ~ tree, data = Orange)
# Summary of the analysis
summary(res.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)
## tree         4  11841    2960   0.883  0.486
## Residuals   30 100525    3351

結果顯示\(p>0.05\)，無法拒斥五種平均樹圍相等的假設。

我們可以畫圖檢視變異數相等的假設：

plot(res.aov, 1, cex=1.5, lwd=1.5)

這一張圖顯示，每一類別的平均值跟觀察值與該平均值之間的殘差(residual)並沒有明顯的關聯。因此，變異數相等的假設成立。

R另外提供Bartlett以及Levene兩種檢定變異數相等的函數。Bartlett檢驗的做法是：

bartlett.test(circumference ~ tree, data = Orange)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  circumference by tree
## Bartlett's K-squared = 2.4607, df = 4, p-value = 0.6517

Levene檢驗則是：

library(car)
leveneTest(circumference ~ tree, data = Orange)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  4  0.5907  0.672
##       30

兩個檢驗結果都是不拒斥變異數相等的假設。可以先進行這兩個檢驗，再進行變異數分析。

雖然我們無法拒斥每一組平均數相等的假設，但是我們仍然可以用TukeyHSD函數檢驗哪些組別之間可能有差異：

TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = circumference ~ tree, data = Orange)
## 
## $tree
##           diff        lwr       upr     p adj
## 2-1  35.714286  -54.03528 125.46385 0.7765641
## 3-1  -5.571429  -95.32099  84.17813 0.9997503
## 4-1  39.714286  -50.03528 129.46385 0.7030223
## 5-1  11.571429  -78.17813 101.32099 0.9956073
## 3-2 -41.285714 -131.03528  48.46385 0.6725728
## 4-2   4.000000  -85.74956  93.74956 0.9999331
## 5-2 -24.142857 -113.89242  65.60671 0.9343443
## 4-3  45.285714  -44.46385 135.03528 0.5930257
## 5-3  17.142857  -72.60671 106.89242 0.9805850
## 5-4 -28.142857 -117.89242  61.60671 0.8909860

結果發現每一組之間都沒有顯著的差異。

5 作業

1. 請問UsingR中的<span style:“color=blue”>rat這筆資料適不適合進行平均數的檢定？請檢定平均數是否大於110。

2. 請問UsingR中的<span style:“color=blue”>iq這筆資料中，iq的平均數大於100嗎？

3. 某一調查中，18位受訪者對2位候選人的表現打成績如下表，老師想知道這兩位候選人得到的成績是否相同，請問如何檢定？

Respondent	Candidate.1	Candidate.2
1	75	66
2	80	74
3	86	74
4	70	74
5	80	89
6	69	79
7	59	83
8	86	85
9	74	80
10	80	80
11	88	60
12	88	59
13	76	49
14	90	42
15	91	30
16	70	72
17	68	60
18	55	88

4. 請用Moodle上面的「立委-A05-2（台北市）.ods」資料以及「候選人在各投開票所得票數一覽表」，整理台北市第二選區國民黨候選人在2016立委選舉以及2019補選的資料料，然後先畫圖比較國民黨候選人在台北市第2選區在2016年的立委選舉與2019年的立委補選的各里的分佈，最後檢驗平均得票率是否相同。

5. 請建立以下的模擬資料，然後檢驗各組的平均數是否相等。如果不相等，請找出哪兩組有不同的平均數。

set.seed(377)
DV<-rnorm(1000, 0, 3)
X.1<-round(runif(1000, 1, 4))
X <- as.factor(X.1)

6 更新日期

## 最後更新日期 06/09/2019

社會科學統計方法

平均數檢定與變異數分析(v2.0)