政府用統計來了解全國的生產力以及土地、森林等資源,所以每一個單位都要納入範圍中。
社會科學假設可以測量個人或者群體的行為跟態度,並且假設這些觀察值來自於某種分布,例如常態分佈、均等分佈、博桑分佈等等,然後用樣本統計來推論全體,並且找出哪些變數之間有相關。
不論是母體或者樣本統計,我們關心中央趨勢以及離散程度這兩個統計,可以說是這些資料的摘要。
如果我們想研究哪些國家發生民主倒退,假設倒退是0或者1的二元變數,也就是來自二元分佈,我們可以計算民主倒退的比例以及離散程度,然後找出跟這個現象相關的變數。
又例如我們想研究台北市的便利商店在午夜12點到凌晨5點之間,有多少人進入便利商店,我們可以抽樣100家超商,然後統計人數,推論到全部的超商,並且用超商的各種位置特徵,解釋這些超商樣本的人數,然後用來預測其他超商在深夜的來客數。人數屬於計數(count)資料,需要用Poisson分佈來計算其平均數與離散程度。
有時候我們訪問母體裡的所有個人,這時候我們就不需要樣本。例如我們研究的是一本書裡面所有的文字背後的標題,只要我們有電子檔案,就可以分析所有的文字,而標題應該來自於Latent Dirichlet Allocation (LDA)或者多元分佈(multivariate distribution)。每一個標題都是一個選項,加總這些選項發生的機率等於1。
\[ E[Y]=E[E(\hat{Y})] \]
Figure 1.1: 分層抽樣
Figure 1.2: 一階段集群抽樣
Figure 1.3: 兩階段集群抽樣
假設一個社區分成15棟大樓,想要推估每一戶平均在7-8月份繳的電費。研究者抽出3棟樓,每棟樓訪問成功10戶,詢問每戶7-8月份繳的電費,模擬資料如下(詳細計算過程參考Lohr(2010:170-172)):
模擬每一群的訪問結果如下:
clusters <- 3 # Number of selected clusters
n_per_cluster <- 10
# Generate sample data
set.seed(02138)
data <- tibble(
cluster = rep(1:clusters, n_per_cluster),
value = round(rnorm(cluster*n_per_cluster, mean = 800, sd = 100), digits = 0)
# Simulated values
)
#change to 3X12 table
dm <- matrix(data$value, ncol = 3, nrow = 10)
row.names(dm) <- c(paste("Household ", 1:10))
colnames(dm) <- c(paste("Cluster ", 1:3))
#Make kable table
kable(dm, booktabs=T, caption="模擬集群抽樣") %>%
kable_styling(full_width=F, bootstrap_options = "striped", font_size=14)
Cluster 1 | Cluster 2 | Cluster 3 | |
---|---|---|---|
Household 1 | 803 | 773 | 853 |
Household 2 | 768 | 774 | 898 |
Household 3 | 720 | 786 | 813 |
Household 4 | 630 | 769 | 873 |
Household 5 | 921 | 830 | 656 |
Household 6 | 846 | 576 | 773 |
Household 7 | 530 | 692 | 728 |
Household 8 | 787 | 779 | 849 |
Household 9 | 686 | 929 | 806 |
Household 10 | 971 | 843 | 765 |
# Compute means of each cluster
cluster_totals <- aggregate(value ~ cluster, data, sum)
kable(cluster_totals, booktabs=T, caption="集群總和") %>%
kable_styling(full_width=F, bootstrap_options = "striped", font_size=14)
cluster | value |
---|---|
1 | 7628 |
2 | 7945 |
3 | 7854 |
n <- 3
bar.y <- sum(cluster_totals)/n
cat("Mean of electricity expense of 3 clusters:", bar.y, "\n")
Mean of electricity expense of 3 clusters: 7811
n <- 3; M = 10
hat.bar.y <- sum(cluster_totals)/(n*M)
cat("Mean of electricity expense of population:", hat.bar.y, "\n")
Mean of electricity expense of population: 781.1
# compute total electricity expense for each cluster
sum.k <- data %>% group_by (cluster) %>%
summarise(x = sum(value))
M <- 10
# compute the sample variance
s_2= (1/(M-1))*sum((sum.k$x - bar.y)^2)
cat("Sample Variance=", s_2)
Sample Variance= 5922
\[\text{se} = \frac{1}{M}\sqrt{\left(1-\frac{n}{N}\right)}\frac{s^2}{n}\]
n = 3
N = 15
M = 10
se <- (1/M)*sqrt((1-(n/N))*(s_2/n))
cat('Standard error of mean:', se)
Standard error of mean: 3.974
cat("Upper estimate of electricity expense:", hat.bar.y+2*se, "\n")
Upper estimate of electricity expense: 789
cat("Lower estimate of electricity expense:", hat.bar.y-2*se, "\n")
Lower estimate of electricity expense: 773.2
UsingR::alltime.movies
有電影的票房資料。我們可以統計這筆資料的中央趨勢: movies<-UsingR::alltime.movies; attach(movies)
cat("mean:", mean(Gross))
## mean: 240.2
movies<-UsingR::alltime.movies; attach(movies)
dt1 <- movies[which(Release.Year < 2000),]
dt2 <- movies[which(Release.Year >= 2000),]
cat("mean before 2000:", mean(dt1$Gross), "\n")
## mean before 2000: 243.5
cat("mean after 2000:", mean(dt2$Gross))
## mean after 2000: 234.6
library(gtsummary)
movies <- movies %>% mutate (D = ifelse (Release.Year < 2000, 'Before 2000', 'After 2000')) %>%
mutate (D = as.factor(D)) %>%
mutate (D = factor(D, levels=c("Before 2000", "After 2000"))) %>%
dplyr::select (D, Gross)
tblmovies<- movies %>%
tbl_summary(
by = D,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{mean}",
"{median} ({p25}, {p75})",
"{min}, {max}"
),
missing = "no",
digits = all_continuous() ~ 2
) %>%
modify_header(label ~ "**Release Year**") %>%
modify_caption("**Boxoffice and Release Year**") %>%
bold_labels()
tblmovies
Release Year | Before 2000, N = 50 | After 2000, N = 29 |
---|---|---|
Gross | ||
Mean | 243.46 | 234.55 |
Median (IQR) | 216.50 (181.75, 257.75) | 215.00 (190.00, 260.00) |
Range | 172.00, 601.00 | 176.00, 404.00 |
library(sjPlot)
H.movies<- movies %>%
group_by(D) %>%
plot_frq(Gross, type = "histogram", show.mean = TRUE, normal.curve = TRUE) %>%
plot_grid()
Figure 2.1: 2000年前後的電影票房分佈
H.movies
TableGrob (2 x 1) “arrange”: 2 grobs z cells name grob 1 1 (1-1,1-1) arrange gtable[layout] 2 2 (2-2,1-1) arrange gtable[layout]
p.sex <- starwars %>% sjPlot::plot_frq(sex)
p.sex
Figure 2.2: Eye Color of Characters in Star Wars
\[ p_{i}=\frac{freq_{i}}{N}\times\frac{100}{100} \]
t.sex <- starwars %>% janitor::tabyl(sex)
t.sex
## sex n percent valid_percent
## female 16 0.18391 0.19277
## hermaphroditic 1 0.01149 0.01205
## male 60 0.68966 0.72289
## none 6 0.06897 0.07229
## <NA> 4 0.04598 NA
table(starwars$sex)
##
## female hermaphroditic male none
## 16 1 60 6
prop.table(table(starwars$sex))
##
## female hermaphroditic male none
## 0.19277 0.01205 0.72289 0.07229
library(ggplot2); library(dplyr)
eyecolors <- unique(starwars$eye_color)
p.eye <- starwars %>% ggplot2::ggplot(aes(x = eye_color, fill = eye_color)) +
geom_bar() +
scale_fill_manual(values= c('black','blue','#AAAAFF', 'brown',
'gray20', 'gold', 'greenyellow','#BB2211',
'orange','pink', 'red', '#CCEEFF', '#EE11EE',
'white', 'yellow')) +
ggplot2::labs(caption = "Star Wars", y = 'Frequency', x = 'Eye Color')+
theme(legend.position = 'none') +
coord_flip()
p.eye
Figure 2.3: 星際大戰角色的眼睛顏色
n <- 390+350
p1 <- 390/n
p=0.5
s1 <- sqrt(p*(1-p)/n)
z <- (p1-p)/s1; z
[1] 1.47
1-pnorm(z)
[1] 0.07072
p2 <-350/n
s2 <- sqrt(p2*(1-p2)/n)
z2 <- (p1-p2)/s2; z2
[1] 2.945
p1<-0.36; p2<-0.40; n=1000;
s.error = 1.96*sqrt((p1+p2-(p1-p2)^2)/n); s.error
[1] 0.05398
眾數
中位數
百分位數
平均數
從這四個統計值,可以大致判斷資料的分佈狀況。
R
的mode()
函數會回傳向量儲藏資料的性質,並不會告訴我們眾數。例如我們讀了一筆以SPSS格式儲存的民調資料,然後我們看一下Q1的分佈:b2<-here::here('data','PP0797B2.sav')
dt <- sjlabelled::read_spss(b2)
mode(dt$Q1)
## [1] "numeric"
#tabyl
gtable <- dt %>% janitor::tabyl(Q1)
gtable
## Q1 n percent
## 1 617 0.299806
## 2 684 0.332362
## 3 443 0.215258
## 4 91 0.044218
## 95 10 0.004859
## 96 57 0.027697
## 97 52 0.025267
## 98 104 0.050534
可以看出Q1的眾數是2。\(\texttt{mode}\)回傳的是變數的性質,不是眾數。
我們可以自己寫一個函數來得到眾數,首先我們創造一個向量,呈現變數的表格,然後用names()
找出這個表格的首行,進一步篩選首行的元素,條件為該表格的最大值,符合這個條件的就是該變數的眾數:
tmp <- table(as.vector(dt$Q1))
tmp
##
## 1 2 3 4 95 96 97 98
## 617 684 443 91 10 57 52 104
names(tmp)
## [1] "1" "2" "3" "4" "95" "96" "97" "98"
names(tmp)[tmp == max(tmp)]
## [1] "2"
☛請練習用ISLR
套件中的Carseats資料,找出
UsingR
的套件中,Boston
這筆資料有房價中位數的變數 library(ggplot2)
ggplot(data=MASS::Boston, aes(y=medv, x=ptratio)) +
geom_point(aes(color=lstat))
Figure 2.4: 波士頓各區的房價中位數與生師比及低社會地位人口比例散佈圖
A <- c(0:10)
cat('Md = ', median(A), '\n')
Md = 5
A <- c(0, 1, 10, 100, 1000, 100000)
cat('Md = ', median(A), '\n')
Md = 55
☛請問studentsfull.txt這筆資料中,學生的中位數成績是多少?
# Generate some random data
set.seed(02138)
data <- c(rnorm(500, mean = 40, sd = 12), rnorm(500, mean=60, sd=5))
# Calculate median, mean, and mode
median_val <- median(data)
mean_val <- mean(data)
# Density
plot(density(data), col = "black",
main = "Distribution with Central Tendency Measures",
xlab = "Value")
# Add lines for median, mean, and mode
abline(v = median_val, col = "red", lwd = 2, lty = 2)
abline(v = mean_val, col = "blue", lwd = 2, lty = 2)
# Add legend
legend("topright", legend = c("Median", "Mean"), col = c("red", "blue"), lwd = 2, lty = 2)
Figure 2.5: 平均數在中位數的左邊
可以看到分佈向右偏,一半的數在平緩的左邊,一半的數在高聳的右邊。平均數在左邊,代表左邊的數雖然分散,但是總和等於右邊的總和,
用均等分佈模擬資料,畫圖2.6表示平均數在中位數的右邊:
# Generate some random data
set.seed(02138)
data <- c(runif(600, 10, 40), runif(400, 65, 250))
# Calculate median, mean, and mode
median_val <- median(data)
mean_val <- mean(data)
# Density
plot(density(data), col = "black",
main = "Distribution with Central Tendency Measures",
xlab = "Value")
# Add lines for median, mean, and mode
abline(v = median_val, col = "red", lwd = 2, lty = 2)
abline(v = mean_val, col = "blue", lwd = 2, lty = 2)
# Add legend
legend("topright", legend = c("Median", "Mean"), col = c("red", "blue"), lwd = 2, lty = 2)
Figure 2.6: 平均數在中位數的右邊
四分位數是數列分成四份之後的三個點:25分位、50分數、75分為其中的25與75分位數。
對於數列的分佈有不同的假設,就有不同計算百分位數的方式。
四分位數是依序排列觀察值,分成四等份的分位數\(Q_{i}\),\(i=\{1,2,3\}\),\(Q_{1}\)代表有\(\frac{1}{4}\)的觀察值小於\(Q_{1}\),\(Q_{3}\)代表有\(\frac{3}{4}\)的觀察值小於\(Q_{3}\)。
例如資料為:\(X=(1, 1001, 1002, 1003)\)
25 百分位所在位置\(=\frac{4\times 25}{100}=1\)。因此 25百分位為 1。
50 百分位所在位置為:\(\frac{4\times 50}{100}=2\)。因此 50百分位為 1001。
75 百分位所在位置為:\(\frac{4\times 75}{100}=3\)。因此 75百分位為 1002。
X <- c(1, 1001, 1002, 1003)
qd <- quantile(X, c(.25, .5, .75), type=1)
knitr::kable(qd, col.names = c('quantiles', 'value')) %>%
kable_styling(full_width=F)
quantiles | value |
---|---|
25% | 1 |
50% | 1001 |
75% | 1002 |
例如:隨機抽出11位大學生,調查他們最近吃最貴的午餐費用為:195,220, 250,250,305,311,350,371,420,473,650,分別取25, 50, 75分位如下:
\(Q_{1}=\frac{11}{4}=2.75\)。進位之後取第3個數,得到250。
\(Q_{2}=\frac{2\times11}{4}=5.5\)。進位之後取第6個數,得到311。
\(Q_{3}=\frac{3\times11}{4}=8.25\)。進位之後取第9個數,得到420。
R
提供9種計算方法,每種方法來自於不同的母體分佈。前面3種適用於間斷變數,後面6種則是連續變數。我們以第1種方法計算。
m <- c(195,220, 250,250,305,311,350,371,420,473,650)
qd <- quantile(m, c(0.25, 0.5, 0.75), type = 1)
knitr::kable(qd, col.names = c('quantiles', 'value'))%>%
kable_styling(full_width=F)
quantiles | value |
---|---|
25% | 250 |
50% | 311 |
75% | 420 |
set.seed(1000)
# data
tmp1<-rnorm(100, 40, 10)
# quantiles
qmp1 <- quantile(as.integer(tmp1), c(0.25, 0.5, 0.75), type=1)
# table
knitr::kable(qmp1, col.names = c('quantiles', 'value'))%>%
kable_styling(full_width = F)
quantiles | value |
---|---|
25% | 34 |
50% | 40 |
75% | 45 |
set.seed(1000)
tmp2<-rnorm(100, 40, 14)
qmp2 <- quantile(as.integer(tmp2), c(0.25, 0.5, 0.75), type=1)
knitr::kable(qmp2, col.names = c('quantiles', 'value'))%>%
kable_styling(full_width = F)
quantiles | value |
---|---|
25% | 31 |
50% | 40 |
75% | 47 |
\[m_{i}=n\cdot \frac{i}{100}\]
上面的公式中,\(m\)變數的\(i\)百分位數等於\(i\)除以\(m\)變數的觀察值總數\(n\)再乘以100。如果\(m_{i}\)不是整數,則\(k\)為該百分位數,且\(m_{i+1}\ge k\ge m_{i}\)。
換句話說,當\(m_{i}\)不是整數,我們可以將\(m_{i}\)無條件進位加1的數當做\(m_{i}\)。
另一種算法是當\(m_{i}\)是整數,則排在第\(m\)位與\(m+1\)位資料值的算術平均數就是這群資料的第\(k\)百分位數。
用實際資料驗證手算以及R
的結果:
full<-here::here('data','studentsfull.txt')
dt <- read.csv(full,sep="",header=T)
dt$Score<-sort(dt$Score)
dt$Score
[1] 60 62 66 66 69 70 75 77 78 80 80 81 82 83 85 85 86 87 88 88 88 89 91 92 92 [26] 93
dt$Score[floor(length(dt$Score)*0.25)+1]
[1] 75
dt$Score[floor(length(dt$Score)*0.75)+1]
[1] 88
dt$Score[floor(length(dt$Score)*0.9)+1]
[1] 92
qd1 <- quantile(dt$Score, c(0.25, .75, 0.9), type=1)
qd1 <- knitr::kable(qd1, col.names = c('quantiles', 'value'))%>%
kable_styling(latex_options = "scale_down")
qd2 <- quantile(dt$Score, c(0.25, .75, 0.9), type=4)
knitr::kable(qd2, col.names = c('quantiles', 'value'))%>%
kable_styling(full_width = F)
quantiles | value |
---|---|
25% | 72.5 |
75% | 88.0 |
90% | 91.4 |
R
的輸出跟SPSS類似,我們可以加以對照(圖2.7)。SPSS的統計值等於是R
的quantile()的第六種計算方式。scores<-c(15, 22, 26, 32, 33,36, 36, 41, 42, 44,
44, 45, 47, 48, 61,63, 63, 65, 65, 65,
66, 66, 68, 69, 70,71, 74, 74, 76, 77,
78, 78, 80, 85)
qscore <- quantile(scores, c(.25,.5,.75,.9), type=6)
knitr::kable(qscore, col.names = c('quantiles', 'value')) %>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 41.75 |
50% | 64.00 |
75% | 71.75 |
90% | 78.00 |
v1<-here::here('Fig','v1_quantile.png')
knitr::include_graphics(v1)
Figure 2.7: 四分位統計
# Set seed for reproducibility
set.seed(02138)
# Simulate 5000 observations from a chi-squared distribution with 3 degrees of freedom
sim <- data.frame(x = rchisq(5000, df = 3))
# Compute the 5th, 50th, and 95th percentiles
quantiles <- quantile(sim$x, probs = c(0.50, 0.75, 0.95))
# Print the results
print(quantiles)
50% 75% 95% 2.321 4.025 7.598
#sjplot
#theme
sjPlot::set_theme(base = theme_blank(), geom.label.color = 'white')
#plot
plot1 <- sim %>% mutate(x.sort=sort(x)) %>%
sjPlot::plot_frq(x.sort, type='density', geom.colors = 'white')
plot1 + ggplot2::scale_x_continuous(breaks = c(0:10))
trim
刪除若干百分比的數。算術平均數(arithmetic mean)的公式如下:\[\bar{y}=\frac{\sum y_{i}}{n}\]
例如: \[y={6, 7, 8, 8, 9, 10, 13, 15, 16, 45}\]
平均數為: \[\bar{y}=\frac{\sum (6+7+\cdots , +45)}{10}=13.7\]
如果\(x_{i}=y_{i}+10\),請問平均數\(\bar{x}\)會比\(\bar{y}\)大、小、不變?
如果\(x_{i}=10\times y_{i}\),請問平均數\(\bar{x}\)會比\(\bar{y}\)大、小、不變?
\[\bar{y}=\frac{\sum n_{k}\cdot \bar{y_{k}}}{n}\]
換句話說,每一組的平均數乘以每一組在全部資料中佔的比例,類似加權,得到的平均數和就是全體樣本的平均數。
例如有三個空氣品質的觀測站的資料,要計算總平均數,首先從總和除以全部個案數計算:
A<-list(station1=c(25, 33, 44),
station2=c(43, 66, 78, 81),
station3=c(90, 76, 105, 110, 121))
#n of each group
group.n=sapply(A, length); group.n
## station1 station2 station3
## 3 4 5
#n of data
total.n=sum(sapply(A, length));
#proportion of each group
group_p <- group.n/total.n
#mean of each group
submean=sapply(A, FUN=mean); cat("air pollution of each station:", submean,"\n")
## air pollution of each station: 34 67 100.4
cat("Average air pollution=", sum(group_p*submean),"\n")
## Average air pollution= 72.67
#sum of data
totalair=sum(sapply(A, sum));
cat("Sum of air pollution=", totalair,"\n")
## Sum of air pollution= 872
cat("average air pollution=", totalair/total.n)
## average air pollution= 72.67
有偏態時須注意平均值是否會誤導。
正偏表示:右邊的尾巴較左邊長,眾數<中位數<平均數,偏態係數大於0。
負偏表示:左邊的尾巴較右邊長,眾數>中位數>平均數,偏態係數小於0。不過眾數的位置不一定在最右邊。
常態分佈的偏態值=0
樣本偏態值\(=\frac{n}{(n-1)(n-2)} \frac{\sum (x_{i}-\bar{x})^3}{s^3}\)
其中\(s\)是樣本標準差,\(s=\sqrt{\frac{\sum (x-\bar{x})^2}{n}}\)
以圖2.8表示如下:
spss<-here::here('Fig','week3_skewness.jpg')
knitr::include_graphics(spss)
Figure 2.8: 偏態統計
R
的計算公式1與2分別與Stata以及SPSS得到的結果相同。library(foreign)
hs<-here::here('data','hsb2.dta')
hsb2<-read.dta(hs)
library(e1071)
cat('skewness of writing:', skewness(hsb2$write, type=1), '\n')
skewness of writing: -0.4784
cat('skewness of writing:', skewness(hsb2$write, type=2))
skewness of writing: -0.482
stata<-here::here('Fig','write_stata.png')
knitr::include_graphics(stata)
Figure 2.9: Stata的偏態統計
spss<-here::here('Fig','write_spss.png')
knitr::include_graphics(spss)
Figure 2.10: SPSS的偏態統計
# Sample data
set.seed(02138)
A <- c(runif(700, 20, 30), runif(300, 30, 70))
B <- rnorm(1000, 30, 4)
C <- c(runif(200, 0, 80), rnorm(800, 50, 10))
cat('A skewness:', skewness(A, type=1), '\n')
A skewness: 1.396
cat('B skewness:', skewness(B, type=1), '\n')
B skewness: 0.01334
cat('C skewness:', skewness(C, type=1), '\n')
C skewness: -0.7783
library(gtsummary)
library(dplyr)
library(e1071)
# Create the dataset
tmp <- data.frame(A = A, B = B, C = C)
# Create tbl_summary object
tblm1 <- tmp %>%
tbl_summary(
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{mean}", "{median}", "{sd}","{N_obs}"
),
missing = "no",
digits = all_continuous() ~ 2
) %>%
modify_caption("**三筆資料的平均值與中位數**") |>
bold_labels()
tblm1
Characteristic | N = 1,000 |
---|---|
A | |
Mean | 32.30 |
Median | 27.04 |
SD | 13.11 |
No. obs. | 1,000.00 |
B | |
Mean | 29.96 |
Median | 29.87 |
SD | 3.94 |
No. obs. | 1,000.00 |
C | |
Mean | 48.15 |
Median | 49.55 |
SD | 13.88 |
No. obs. | 1,000.00 |
#finding mode
my_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
#df
df <- list(A, B, C)
sapply(df, FUN=my_mode)
[1] 25.12 26.72 50.36
我們可以用橫的盒型圖如圖2.11表示分佈:
data <- data.frame(group = rep(c("A", "B", "C"), each = 1000),
value = c(A, B, C))
# Define colors for each group
colors <- c("#EE22AA", "#EBC311", "#C1F210")
# Draw horizontal box plot
data$group <- ordered(as.factor(data$group), levels = c('C','B','A'))
boxplot(value ~ group, data = data, horizontal = TRUE, col=colors,
ylim=c(0, 80))
# Adjust y-axis ticks
axis(2, at = seq(0, 80, by = 10))
Figure 2.11: 表示偏態的箱型圖
x <- seq(min(A), max(A), length = 1000)
df <- data.frame(x=c(x, x, x), y=c(sort(A), sort(B), sort(C)),
dataset = c(rep('A', 1000), rep('B', 1000), rep('C', 1000)))
ggplot2::ggplot(df, aes(x = y, group = dataset)) +
geom_density(aes(fill = dataset), alpha = 0.5) +
scale_fill_manual(values=c("#C1F210", "#EBC311", "#EE22AA"))
Figure 2.12: 3個模擬資料的密度圖
#Read data
path <- here::here('data','expected_stats.csv')
batter <- read.csv(path, header = T, sep=',')
#Histogram
ggplot2::ggplot(batter, aes(slg, group=1)) +
geom_histogram(colour="#AC100E", fill = 'white') +
labs(x = 'Slugging Percentage')
Figure 2.13: 2024年美國職棒大聯盟的打者壘打率
cat('skewness=', skewness(batter$slg), '\n')
## skewness= 0.8833
md.slg <- median(batter$slg)
cat('Md=', md.slg, '\n')
## Md= 0.4045
mean.slg <- mean(batter$slg)
cat('Mean=', mean.slg)
## Mean= 0.4148
height <- c(112, 114, 116, 117, 117, 118, 119, 120, 120, 121, 122, 122, 123, 124, 125, 125, 126, 128, 130)
x <- c(112, 114, 116, 117, 117, 118, 119, 120, 120, 121, 122, 122, 123, 124, 125, 125, 126, 128, 130)
tmp <- data.frame(height = x)
#stargazer::stargazer(tmp, type = 'html', digits = 0)
#median, 95th
q.scores <- quantile(x, c(.5, 0.95), type=6)
cat("Median=", q.scores[1], '\n')
Median= 121
cat("95th percentile=", q.scores[2], '\n')
95th percentile= 130
#skewness
sk.scores <- skewness(tmp$height)
cat("skewness=", sk.scores, '\n')
skewness= 0
峰度的用途是測量資料點集中在前後兩端的程度,越集中在兩邊,峰度值越大於0,反之,峰度值小於0。標準常態分佈則是0。
計算峰度的公式為: \[\frac{m4}{m2^{2}}-3\] \[s_{2}=\sum (x_{i}-\bar{x})^2\] \[s_{4}=\sum (x_{i}-\bar{x})^4\] \[m2=\frac{s_{2}^2}{n}\] \[m4=\frac{s_{4}}{n}\]
不同統計軟體計算峰度的公式略有不同,如果用R
,可以用e1071
這個套件裡面的kurtosis
,用type
指令選擇2可得到跟SPSS一樣的答案。
Stata的計算峰度公式為 \[\frac{m^4}{m^2_{2}}\]
樣本數目越大,理論上各種計算方式的結果越接近。
#m4
s4=sum((hsb2$write-mean(hsb2$write))^4)
m4=s4/200; print(m4)
## [1] 17889
#m2
s2=sum((hsb2$write-mean(hsb2$write))^2)
m2=s2/200; print(m2)
## [1] 89.39
#kurtosis
cat('kurtosis=', m4/m2^2)
## kurtosis= 2.239
set.seed(02138)
uni_dist <- runif(1000, min = 0, max = 10)
k_uni <- kurtosis(uni_dist, type=1)
#kurtosis
cat('kurtosis=', k_uni)
kurtosis= -1.197
uni_df <- data.frame(x= sort(uni_dist))
plot_uni <- uni_df %>% sjPlot::plot_frq(x, type = 'density');
plot_uni + ggplot2::theme_minimal()
set.seed(02138)
laplace_dist <- VGAM::rlaplace(1000, location = 0, scale = 1)
l_uni <- kurtosis(laplace_dist, type=1)
#kurtosis
cat('kurtosis=', l_uni)
kurtosis= 2.885
lac_df <- data.frame(x= sort(laplace_dist))
plot_lac <- lac_df %>% sjPlot::plot_frq(x, type = 'density');
plot_lac + ggplot2::theme_minimal()
tmp <- data.frame(x1= sort(uni_dist), x2= sort(laplace_dist))
q1.l <- mean(tmp$x1) - 3*sd(tmp$x1); q1.u <- mean(tmp$x1) + 3*sd(tmp$x1)
q2.l <- mean(tmp$x2) - 3*sd(tmp$x2); q2.u <- mean(tmp$x2) + 3*sd(tmp$x2)
#create new variable
div <- tmp %>% mutate(uni.l=ifelse(x1 < q1.l, 1, 0)) %>%
mutate(uni.u=ifelse(x1 > q1.u, 1, 0)) %>%
mutate(lap.l =ifelse(x2 < q2.l, 1, 0)) %>%
mutate(lap.u = ifelse(x2 > q2.u, 1, 0)) %>%
mutate(uni = uni.l + uni.u) %>%
mutate(n.uni = sum(uni)) %>%
group_by(lap = lap.l + lap.u) %>%
mutate(n.lap = sum(lap))
#Obs meets conditions
cat("< or > 3 standard deviations in uniform distribution=", div$n.uni[1], '\n')
< or > 3 standard deviations in uniform distribution= 0
cat("< or > 3 standard deviations in Laplace distribution=",div$n.lap[1])
< or > 3 standard deviations in Laplace distribution= 16
R
計算結果:kurtosis(hsb2$write, type=1)+3
[1] 2.239
kurtosis(hsb2$write, type=2)+3
[1] 2.25
stata<-here::here('Fig','write_stata.png')
knitr::include_graphics(stata)
Figure 2.14: Stata的峰度統計
spss<-here::here('Fig','write_spss.png')
knitr::include_graphics(spss)
Figure 2.15: SPSS的峰度統計
R
的彈性比較大。set.seed(02139)
# Generate random samples from the normal distribution
normal_dist <- rnorm(1000, 0, 1)
e1071::kurtosis(normal_dist, type=1)
## [1] -0.109
# Generate random samples from the t-distribution with 3 degrees of freedom
t_dist <- rt(1000, df = 3)
e1071::kurtosis(t_dist, type=1)
## [1] 7.532
x <- seq(min(A), max(A), length = 1000)
df <- data.frame(x=c(x, x), y=c(normal_dist, t_dist),
dataset = c(rep('Standard Normal', 1000), rep('t', 1000)))
ggplot2::ggplot(df, aes(x = y, group = dataset)) +
geom_density(aes(fill = dataset), alpha = 0.5) +
scale_fill_manual(values=c("#EBC311", "#EE22AA"))
Figure 2.16: 標準常態分佈與t分佈
setwd(here::here('data','DSS'))
dt <- read.csv('UA_precincts.csv', header = T)
range(dt$pro_russian)
[1] 0.0 78.9
全距相同,但是離散程度可能不同;有的變數比較離散,集中在兩端,有的則集中在中間,兩端很少。
全距容易受到最大值與最小值的影響;最大值會拉大全距,離散程度變大。因此,解釋離散程度時要小心最大值與最小值。
library(ggplot2)
sv<-c(15, 22, 26, 32, 33,36, 36, 41, 42, 44,
44, 45, 47, 48, 61,63, 63, 65, 65, 65,
66, 66, 68, 69, 70,71, 74, 74, 76, 77,
78, 78, 80, 85)
quantile(sv, c(.25,.5,.75), type=6)
## 25% 50% 75%
## 41.75 64.00 71.75
dt <- data.frame(scores=sv)
ggplot(data=dt, aes(y=scores)) +
geom_boxplot(fill="#FF22EE11")
Figure 3.1: 學生成績盒鬚圖
☛請計算MASS::Animals
這筆資料中的腦容量的四分位距。
每一個觀察值與平均數的差距稱為變異數,用來衡量連續變數資料離散的程度。
常態分佈的母體的變異數公式為:
\[\sigma^2=\frac{\sum (X-\mu)^2}{n}\]
\[S^2=\frac{\sum (X-\bar{X})^2}{n-1}\]
\[n\cdot p\cdot(1-p)\]
\[\sqrt{\frac{np(1-p)}{n-1}}\]
某10家上市公司的股票價格分別為13.5, 22.2, 31.2, 15.2, 20.3, 21.9, 18.3, 25.3, 21.3, 19.8。請問這些公司的股價的變異數是多少?
如果這10家的股票價格都漲了1塊錢,請問變異數變大、變小或者不變?為什麼?
如果這10家的股票價格都漲了1倍,請問變異數變大、變小或者不變?為什麼?
某戶人家收集過去12個月的水費帳單,想知道每個月水費有沒有超過300元,發現7, 8, 9月之外,其他月份都沒有超過。請問水費有無超過300元的變異數是多少?
如果該住戶發現其實10月份的水費也超過300元,請問變異數變大、變小還是不變?為什麼?
如果該住戶統計到隔年1月,發現還是3個月的水費超過300元,請問變異數變大、變小還是不變?為什麼?
如果隨機變數屬於常態分配,大部分的值應該聚集在平均值加減一個標準差的範圍內,因此,樣本標準差的大小特別重要。
當樣本來自於常態分配的母體,利用微積分可求出平均數的加減1個標準差包含約\(68\%\)的樣本。2個標準差包含約\(95\%\)的樣本。3個標準差包含約\(99\%\)的樣本。
先寫程式計算標準差,再用R
的
#hsb2 data
file <- here::here('data', 'hsb2.txt')
hsb2 <- read.table(file, header = T)
#variance
v.write<-var(hsb2$write); sqrt(v.write)
## [1] 9.479
#standard deviation
std = function(x) sqrt(var(x))
std(hsb2$write)
## [1] 9.479
#self-defined function
sd<-function(V)sqrt( sum((V - mean(V))^2 /(length(V)-1)))
sd(hsb2$write)
## [1] 9.479
H<-c(15000,7000,19000,3000,15000,19000,4000,12000,
17000, 9000)
h<-c(15,7,19,3,15,19, 4,12,17, 9)
sd(H); sd(h)
## [1] 5963
## [1] 5.963
我們用一個分數的正負判斷某個樣本的連續變數是否高於平均數,用絕對值判斷距離平均數的差距有多遠,這就是Z分數或者Z值。
Z值的優點是將原始分數轉換為標準分數後,不同單位的數量或者不同測量的分數,均可直接比較,不會受測量困難程度的影響。
如果轉換原始分數轉換為Z值得到小數或負數,也可以乘以一個常數,再加一常數,變成整數。也就是整個分佈向右邊移動。
\[f(Z)=\frac{1}{2\pi}e^{-\frac{Z^2}{2}}\\ Z=\frac{X-\mu}{\sigma}\]
\[Z\sim N(0,1)\]
\[Z=\frac{x-\bar{x}}{s}\] - s是標準差。
\[Z=\frac{X-\mu}{\sigma}\] 其中:
\[\sigma\neq 0\]
pnorm(6, 0, 1)
## [1] 1
pnorm(-6, 0, 1)
## [1] 9.866e-10
pnorm(-1.96,0,1)
## [1] 0.025
curve(dnorm(x),
xlim = c(-3, 3),
ylab = "Density",
#main = "機率密度與區域",
col='red', lwd=2, xlab='Z')
cord.1x <- c(-3,seq(-3, -1.96,0.01),-1.96)
cord.1y <- c(0,dnorm(seq(-3, -1.96,0.01)),0)
polygon(cord.1x,cord.1y,col='grey80')
Figure 4.1: 標準常態分佈曲線下的左邊2.5%區域
curve(dnorm(x),
xlim = c(-3, 3),
ylab = "Density",
#main = "機率密度與區域",
col='red', lwd=2, xlab='Z')
cord.1x <- c(1.96 ,seq(1.96, 3, 0.01), 3)
cord.1y <- c(0, dnorm(seq(1.96, 3, 0.01)), 0)
polygon(cord.1x,cord.1y,col='grey80')
Figure 4.2: 標準常態分佈曲線下的右邊2.5%區域
Z = 1.96
1-pnorm(Z, 0, 1)
## [1] 0.025
q = 0.975
qnorm(q)
## [1] 1.96
q = 0.95
qnorm(q)
## [1] 1.645
★請問在
head(alr4::Heights, 4)
mheight dheight 1 59.7 55.1 2 58.2 56.5 3 60.6 56.0 4 60.7 56.8
m.i<-mean(alr4::Heights$mheight)
m.i
[1] 62.45
s.i<-var(alr4::Heights$mheight)
s.i
[1] 5.547
zstar1=(63-m.i)/sqrt(s.i);zstar2=(65-m.i)/sqrt(s.i)
cat("63in=",zstar1,"\n","65in=",zstar2,"\n")
63in= 0.2323 65in= 1.082
pnorm(zstar2, 0, 1)-pnorm(zstar1, 0, 1)
[1] 0.2684
curve(dnorm(x),
xlim = c(-3, 3),
ylab = "Density",
#main = "機率密度與區域",
col='red', lwd=2, xlab='Z')
cord.1x <- c(0.232,seq(0.232, 1.081,0.01), 1.081)
cord.1y <- c(0,dnorm(seq(0.232, 1.081,0.01)),0)
polygon(cord.1x,cord.1y,col='grey80')
Figure 4.3: 標準常態分佈曲線下的特定區域
★ 有一位員工的今年月薪為8.5萬,去年則為8萬。今年的全體薪水標準差為2.3萬,平均值為6.4萬,而去年的全體員工薪水標準差為2萬,平均值為6.2萬。假設員工薪水常態分配。請問該員工月薪相較於全體員工,今年比去年有增加嗎?
\[z_{1}=\frac{8.5-6.4}{2.3}=0.91\]
\[z_{2}=\frac{8-6.2}{2}=0.75\] - 因為\(z_{1}\geq z_{2}\),因此該員工月薪相較於去年有增加。
score
中位數、90百分位數以及男性跟女性的平均數:
score
平均數以及標準差:
UsingR
套件中的faithful
資料,請問要看噴泉最少要等幾分鐘?平均要等幾分鐘?最多跟最少等的時間差距多少分鐘?
airquality
這筆資料的Wind
這個變數,計算前後兩個資料點的差異,以分析兩天之間風速的差異。例如:
A <- c(35, 61, 69)
d.A <- c(26, 8)
airquality
這筆資料的Wind
的偏態為何?峰度是多少?
councilor
這筆資料中,請問平均工程預算是多少?樣本標準差多少?
2008Election
這筆資料,請問馬英九的得票數的25分位數、中位數、75分位數分別是多少?請問25與75分位數之間差別多少?
BES
這筆資料,計算表示英國應該離開歐盟(leave
)的比例。
vote
)的次數最多?相對頻率多少?
最後更新時間: 2025-03-28 13:08:10