母體變異數 \(\sigma^2 = \displaystyle{\sum_{i=1}^{N} \frac{(x_i-\mu)^2}{N}}\) 的分母是 \(N\),
樣本變異數 \(s^2 = \displaystyle{\sum_{i=1}^{n} \frac{(x_i-\overline{x})^2}{n-1}}\) 的分母是 \(n-1\),
既然都是平均的概念,分母所除的數字卻並非都是 “總數”,這是初學者常常感到困惑之處。
為了方便瞭解與透過電腦進一步驗證(注意只是驗證而非證明),我們在這裡舉一個簡單的例子:假設有一個很小的母體 \(pop= \{ 1,2,3,4,5 \}\),其中只有 \(N=5\) 個元素,則母體變異數的理論值可透過 R 計算如下:
pop <- c(1,2,3,4,5) # 或 s <- 1:5
sig2 <- 0
for ( i in 1:length(pop)) {sig2 <- sig2 + (pop[i]-3)^2/5}
sig2
## [1] 2
接下來考慮從母體當中抽出一個\(n=3\) 的樣本
在取後放回的情況下,\(S\) 當中的每個元素都有固定 \(\frac{1}{5}\) 的機率被抽到,在母體平均數未知的情況下,以樣本平均數取代母體平均數,計算這125組樣本 (\(5 \times 5 \times 5\),每組樣本都有 \(\frac{1}{5^3}=\frac{1}{125}\) 的機會被抽到) 所得之樣本變異數:
for (i in 1:5)
{for (j in 1:5){
for(k in 1:5){m <-c(); m <- c(i,j,k); cat("sample variance of sample",i,j,k,"=", sum((m-mean(m))^2)/2,"\n")}}} #注意分母除的是 n-1=3-1=2
## sample variance of sample 1 1 1 = 0
## sample variance of sample 1 1 2 = 0.3333333
## sample variance of sample 1 1 3 = 1.333333
## sample variance of sample 1 1 4 = 3
## sample variance of sample 1 1 5 = 5.333333
## sample variance of sample 1 2 1 = 0.3333333
## sample variance of sample 1 2 2 = 0.3333333
## sample variance of sample 1 2 3 = 1
## sample variance of sample 1 2 4 = 2.333333
## sample variance of sample 1 2 5 = 4.333333
## sample variance of sample 1 3 1 = 1.333333
## sample variance of sample 1 3 2 = 1
## sample variance of sample 1 3 3 = 1.333333
## sample variance of sample 1 3 4 = 2.333333
## sample variance of sample 1 3 5 = 4
## sample variance of sample 1 4 1 = 3
## sample variance of sample 1 4 2 = 2.333333
## sample variance of sample 1 4 3 = 2.333333
## sample variance of sample 1 4 4 = 3
## sample variance of sample 1 4 5 = 4.333333
## sample variance of sample 1 5 1 = 5.333333
## sample variance of sample 1 5 2 = 4.333333
## sample variance of sample 1 5 3 = 4
## sample variance of sample 1 5 4 = 4.333333
## sample variance of sample 1 5 5 = 5.333333
## sample variance of sample 2 1 1 = 0.3333333
## sample variance of sample 2 1 2 = 0.3333333
## sample variance of sample 2 1 3 = 1
## sample variance of sample 2 1 4 = 2.333333
## sample variance of sample 2 1 5 = 4.333333
## sample variance of sample 2 2 1 = 0.3333333
## sample variance of sample 2 2 2 = 0
## sample variance of sample 2 2 3 = 0.3333333
## sample variance of sample 2 2 4 = 1.333333
## sample variance of sample 2 2 5 = 3
## sample variance of sample 2 3 1 = 1
## sample variance of sample 2 3 2 = 0.3333333
## sample variance of sample 2 3 3 = 0.3333333
## sample variance of sample 2 3 4 = 1
## sample variance of sample 2 3 5 = 2.333333
## sample variance of sample 2 4 1 = 2.333333
## sample variance of sample 2 4 2 = 1.333333
## sample variance of sample 2 4 3 = 1
## sample variance of sample 2 4 4 = 1.333333
## sample variance of sample 2 4 5 = 2.333333
## sample variance of sample 2 5 1 = 4.333333
## sample variance of sample 2 5 2 = 3
## sample variance of sample 2 5 3 = 2.333333
## sample variance of sample 2 5 4 = 2.333333
## sample variance of sample 2 5 5 = 3
## sample variance of sample 3 1 1 = 1.333333
## sample variance of sample 3 1 2 = 1
## sample variance of sample 3 1 3 = 1.333333
## sample variance of sample 3 1 4 = 2.333333
## sample variance of sample 3 1 5 = 4
## sample variance of sample 3 2 1 = 1
## sample variance of sample 3 2 2 = 0.3333333
## sample variance of sample 3 2 3 = 0.3333333
## sample variance of sample 3 2 4 = 1
## sample variance of sample 3 2 5 = 2.333333
## sample variance of sample 3 3 1 = 1.333333
## sample variance of sample 3 3 2 = 0.3333333
## sample variance of sample 3 3 3 = 0
## sample variance of sample 3 3 4 = 0.3333333
## sample variance of sample 3 3 5 = 1.333333
## sample variance of sample 3 4 1 = 2.333333
## sample variance of sample 3 4 2 = 1
## sample variance of sample 3 4 3 = 0.3333333
## sample variance of sample 3 4 4 = 0.3333333
## sample variance of sample 3 4 5 = 1
## sample variance of sample 3 5 1 = 4
## sample variance of sample 3 5 2 = 2.333333
## sample variance of sample 3 5 3 = 1.333333
## sample variance of sample 3 5 4 = 1
## sample variance of sample 3 5 5 = 1.333333
## sample variance of sample 4 1 1 = 3
## sample variance of sample 4 1 2 = 2.333333
## sample variance of sample 4 1 3 = 2.333333
## sample variance of sample 4 1 4 = 3
## sample variance of sample 4 1 5 = 4.333333
## sample variance of sample 4 2 1 = 2.333333
## sample variance of sample 4 2 2 = 1.333333
## sample variance of sample 4 2 3 = 1
## sample variance of sample 4 2 4 = 1.333333
## sample variance of sample 4 2 5 = 2.333333
## sample variance of sample 4 3 1 = 2.333333
## sample variance of sample 4 3 2 = 1
## sample variance of sample 4 3 3 = 0.3333333
## sample variance of sample 4 3 4 = 0.3333333
## sample variance of sample 4 3 5 = 1
## sample variance of sample 4 4 1 = 3
## sample variance of sample 4 4 2 = 1.333333
## sample variance of sample 4 4 3 = 0.3333333
## sample variance of sample 4 4 4 = 0
## sample variance of sample 4 4 5 = 0.3333333
## sample variance of sample 4 5 1 = 4.333333
## sample variance of sample 4 5 2 = 2.333333
## sample variance of sample 4 5 3 = 1
## sample variance of sample 4 5 4 = 0.3333333
## sample variance of sample 4 5 5 = 0.3333333
## sample variance of sample 5 1 1 = 5.333333
## sample variance of sample 5 1 2 = 4.333333
## sample variance of sample 5 1 3 = 4
## sample variance of sample 5 1 4 = 4.333333
## sample variance of sample 5 1 5 = 5.333333
## sample variance of sample 5 2 1 = 4.333333
## sample variance of sample 5 2 2 = 3
## sample variance of sample 5 2 3 = 2.333333
## sample variance of sample 5 2 4 = 2.333333
## sample variance of sample 5 2 5 = 3
## sample variance of sample 5 3 1 = 4
## sample variance of sample 5 3 2 = 2.333333
## sample variance of sample 5 3 3 = 1.333333
## sample variance of sample 5 3 4 = 1
## sample variance of sample 5 3 5 = 1.333333
## sample variance of sample 5 4 1 = 4.333333
## sample variance of sample 5 4 2 = 2.333333
## sample variance of sample 5 4 3 = 1
## sample variance of sample 5 4 4 = 0.3333333
## sample variance of sample 5 4 5 = 0.3333333
## sample variance of sample 5 5 1 = 5.333333
## sample variance of sample 5 5 2 = 3
## sample variance of sample 5 5 3 = 1.333333
## sample variance of sample 5 5 4 = 0.3333333
## sample variance of sample 5 5 5 = 0
或直接計算這125組樣本的樣本變異數之平均值,即理論上的樣本變異數期望值
sums <- 0
for (i in 1:5)
{for (j in 1:5){
for(k in 1:5){m <-c(); m <- c(i,j,k); sums <- sums + sum((m-mean(m))^2)/2}}} #注意分母除的是 n-1=3-1=2
sums/125
## [1] 2
得到的結果與母體變異數一致!所以分母應該除以 \(n-1=3-1=2\) 是正確的!
也可以採用模擬的方式作進一步驗證 (下列程式模擬 10,000 次抽樣),可以得到與理論值非常接近的結果。
pop <- c(1,2,3,4,5) # 或 pop <- 1:5
avgs <- c()
for (i in 1:10000){sa <- sample(pop,3,rep=T) ; # 隨機從母體抽出三個值 (樣本,即 x_1, x_2, x_3,樣本有可能重複選取)
avgs <- rbind(avgs, sum((sa-mean(sa))^2)/2)} # 計算 sum (x_i - xbar)^2 / (n-1) ,串聯資料 (rbind)
mean(avgs)
## [1] 2.0018
根據機率理論,從 \(N\) 個相異物品中抽出 \(n\) 個,共有 \(\binom{N}{n}\) 種方法,以本例為言即 \(\binom{5}{3}=\frac{5!}{2!3!}=10\) 種可能的組合,每種組合各有 \(\frac{1}{10}\) 的機會發生。用 R 來計算即:
choose(5,3)
## [1] 10
因為可能的狀況不多,我們可以把這十種組合分別列示出來,即 \((1,2,3), (1,2,4), (1,2,5), (1,3,4), (1,3,5), (1,4,5), (2,3,4), (2,3,5), (2,4,5), (3,4,5)\) 這十種。亦可使用 R 套件 “gtools” 的 “combinations” 指令 產生出這十種結果
# "gtools" 在 R 4.1.1 版已內建
install.packages("gtools", repos = "http://cran.us.r-project.org")
## package 'gtools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\user\AppData\Local\Temp\Rtmp0MjobH\downloaded_packages
library(gtools)
## Warning: 套件 'gtools' 是用 R 版本 4.1.1 來建造的
x <- combinations(5,3)
x
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 4
## [3,] 1 2 5
## [4,] 1 3 4
## [5,] 1 3 5
## [6,] 1 4 5
## [7,] 2 3 4
## [8,] 2 3 5
## [9,] 2 4 5
## [10,] 3 4 5
我們可以利用這十種結果分別計算樣本變異數,先詢問 \(x\) 的資料型態:
class(x)
## [1] "matrix" "array"
在母體平均數未知的情況下,以樣本平均數取代母體平均數,計算所得之樣本變異數:(將第 \(i\) 種組合的每一個值扣掉第 \(i\) 種組合的平均數 (即樣本平均數) 後平方,加總後除以 \(2\))
for (i in 1:length(x[,1])) {cat("sample variance of sample",i,"=",sum((x[i,]-mean(x[i,]))^2)/2,"\n")}
## sample variance of sample 1 = 1
## sample variance of sample 2 = 2.333333
## sample variance of sample 3 = 4.333333
## sample variance of sample 4 = 2.333333
## sample variance of sample 5 = 4
## sample variance of sample 6 = 4.333333
## sample variance of sample 7 = 1
## sample variance of sample 8 = 2.333333
## sample variance of sample 9 = 2.333333
## sample variance of sample 10 = 1
或直接計算這十組樣本的樣本變異數之平均值,
sums <- 0
for (i in 1:length(x[,1])) {sums <- sums + sum((x[i,]-mean(x[i,]))^2)/2}
sums/10
## [1] 2.5
得到的理論值與母體變異數不同!
原因是從有限母體中以取後不放回方式抽樣,此時任二樣本皆不彼此獨立,與娶後放回,機率固定的狀況不同。
可證明此時樣本變異數 \(s^2 = \displaystyle{\sum_{i=1}^{n} \frac{(x_i-\overline{x})^2}{n-1}}\) 是 \(\displaystyle{\sum_{i=1}^{N} \frac{(x_i-\mu)^2}{N-1}}=\frac{N}{N-1} \ \sigma^2\) 的不偏估計量 (即前者可用來估計後者,且期望值恰等於後者),因此將計算出來的變異數乘以 \(\frac{N-1}{N}=\frac{4}{5}\) 會等於母體變異數,亦即 \(2.5*\frac{4}{5}=2.0\)
(詳細證明請參閱丁村成老師於2005年在數學傳播第29卷第1期發表的文章: https://web.math.sinica.edu.tw/math_media/d291/29102.pdf)。
也可以採用模擬的方式作進一步驗證 (下列程式模擬 10,000 次抽樣),可以得到與理論值非常接近的結果。
pop <- c(1,2,3,4,5) # 或 pop <- 1:5
avgs <- c()
for (i in 1:10000){sa <- sample(pop,3,rep=F) ; # 隨機從母體抽出三個值 (樣本,即 x_1, x_2, x_3,不會有重複選取的情形)
avgs <- rbind(avgs, sum((sa-mean(sa))^2)/2)} # 計算 sum (x_i - xbar)^2 / (n-1) ,串聯資料 (rbind)
mean(avgs)
## [1] 2.5125