如果我們也想研究哪些國家發生民主倒退,假設倒退是0或者1的二元變數,也就是來自二元分佈,我們可以計算民主倒退的比例以及離散程度,然後找出跟這個現象相關的變數。
又例如我們想研究台北市的便利商店在午夜12點到凌晨5點之間,有多少人進入便利商店,我們可以抽樣100家超商,然後統計人數,推論到全部的超商,並且用超商的各種位置特徵,解釋這些超商樣本的人數,然後用來預測其他超商在深夜的來客數。人數屬於計數(count)資料,需要用博桑分佈來計算其平均數與離散程度。
有時候我們可以訪問母體裡的所有個人,這時候我們就不需要樣本。例如我們研究的是一本書裡面所有的文字背後的標題,只要我們有電子檔案,就可以分析所有的文字,而標題應該來自於Latent Dirichlet Allocation (LDA)或者多元分佈(multivariate distribution)。每一個標題都是一個選項,加總這些選項發生的機率等於1。
\[ E[Y]=E[E(\hat{Y})] \]
\[ p(x)=P(X=x)\\ 0\le p(x)\le 1 \\ \sum p(x) =1 \]
\[ p(1)=P(X=1)= \frac{1}{6}\\ p(2)=P(X=2)= \frac{1}{6}\\ p(3)=P(X=3)= \frac{1}{6}\\ p(4)=P(X=4)= \frac{1}{6}\\ p(5)=P(X=5)= \frac{1}{6}\\ p(6)=P(X=6)= \frac{1}{6} \]
# Load ggplot2 package
library(ggplot2)
# Define the range for x values
x_range <- seq(-3, 3, by = 0.01)
# Define the standard normal distribution function
y <- dnorm(x_range, mean = 0, sd = 1)
# Create the data frame for plotting
data <- data.frame(x = x_range, y = y)
# Plot the normal distribution curve
p <- ggplot(data, aes(x = x, y = y)) +
geom_line() + # Draw the curve
geom_area(data = subset(data, x >= -1 & x <= 1), aes(x = x, y = y), fill = "blue", alpha = 0.5) +
labs(title = "Normal Distribution with Shaded Area Between z = -1 and z = 1",
x = "Z value",
y = "Density") +
theme_minimal()
# Display the plot
print(p)
Figure 2.1: 身高常態分佈
\[ f(x | \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2 } \]
其中:
R
表示為:e <- exp(1)
e
## [1] 2.718
#simulate
n <- 10
e_powers <- exp(-n: n)
#data
df <- data.frame(x = c(-n:n), y = e_powers)
#graphic
xg <- ggplot2::ggplot(data = df, aes(x = x, y = y)) +
geom_line(col = '#CC11EE') +
geom_point(col = '#1100FF')
xg
Figure 2.2: 自然對數的底
可以看出\(e\)有緩慢上升的趨勢,與常態分佈有關。
假設政大學生身高成常態分佈,平均值是171,標準差是7公分。今天抽樣統計,得知樣本當中身高最矮是160,最高175,請問這套樣本的身高在全體政大學生的當中,出現機率有多高?
# Probability that a randomly selected adult male is between 168 cm and 182 cm
lower_bound <- 160
upper_bound <- 175
# Mean and standard deviation
mean_height <- 171
std_dev <- 7
# Calculating the probabilities
prob_lower = pnorm(lower_bound, mean = mean_height, sd = std_dev)
prob_upper = pnorm(upper_bound, mean = mean_height, sd = std_dev)
# Probability of a randomly selected adult male being between 168 cm and 182 cm
prob_between = prob_upper - prob_lower
prob_between
## [1] 0.6581
# Mean and standard deviation
mean_height <- 171
std_dev <- 7
lbound <- 160
ubound <- 175
# Generate data
height <- rnorm(1000, 171, 7)
# Create a data frame with the density values
density <- data.frame(x = seq(min(height), max(height), length.out = 1000),
y = dnorm(seq(min(height), max(height), length.out = 1000), mean(height), sd(height)))
# Plot the density curve
hp <- ggplot(density, aes(x = x, y = y)) +
geom_line() +
geom_ribbon(data = subset(density, x >= 160 & x <= 175),
aes(ymax = y, ymin = 0), fill = "#C11100", alpha = 0.3) +
labs(title = "Normal Distribution with Shaded Area",
x = "Height", y = "Density")
hp
Figure 2.3: 身高分佈
UsingR::alltime.movies
有電影的票房資料。我們可以找出這筆資料的中央趨勢,例如計算2000年(含)前後的票房平均數: movies<-UsingR::alltime.movies; attach(movies)
dt1 <- movies[which(Release.Year < 2000),]
dt2 <- movies[which(Release.Year >= 2000),]
cat("mean before 2000:", mean(dt1$Gross), "\n")
## mean before 2000: 243.5
cat("mean after 2000:", mean(dt2$Gross))
## mean after 2000: 234.6
movies <- movies %>% mutate (D = ifelse (Release.Year < 2000, 'Before 2000', 'After 2000')) %>%
mutate (D = as.factor(D)) %>%
mutate (D = factor(D, levels=c("Before 2000", "After 2000"))) %>%
dplyr::select (D, Gross)
tblmovies<- movies %>%
tbl_summary(
by = D,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{mean}",
"{median} ({p25}, {p75})",
"{min}, {max}"
),
missing = "no",
digits = all_continuous() ~ 2
) %>%
modify_header(label ~ "**Release Year**") %>%
modify_caption("**Boxoffice and Release Year**") %>%
bold_labels()
tblmovies
Release Year | Before 2000, N = 50 | After 2000, N = 29 |
---|---|---|
Gross | ||
Mean | 243.46 | 234.55 |
Median (IQR) | 216.50 (181.75, 257.75) | 215.00 (190.00, 260.00) |
Range | 172.00, 601.00 | 176.00, 404.00 |
library(sjPlot)
H.movies<- movies %>%
group_by(D) %>%
plot_frq(Gross, type = "histogram", show.mean = TRUE, normal.curve = TRUE) %>%
plot_grid()
Figure 2.4: 2000年前後的電影票房分佈
H.movies
TableGrob (2 x 1) “arrange”: 2 grobs z cells name grob 1 1 (1-1,1-1) arrange gtable[layout] 2 2 (2-2,1-1) arrange gtable[layout]
p.sex <- starwars %>% sjPlot::plot_frq(sex)
p.sex
Figure 2.5: Eye Color of Characters in Star Wars
\[ p_{i}=\frac{freq_{i}}{N}\times\frac{100}{100} \]
t.sex <- starwars %>% janitor::tabyl(sex)
t.sex
## sex n percent valid_percent
## female 16 0.18391 0.19277
## hermaphroditic 1 0.01149 0.01205
## male 60 0.68966 0.72289
## none 6 0.06897 0.07229
## <NA> 4 0.04598 NA
table(starwars$sex)
##
## female hermaphroditic male none
## 16 1 60 6
prop.table(table(starwars$sex))
##
## female hermaphroditic male none
## 0.19277 0.01205 0.72289 0.07229
library(ggplot2); library(dplyr)
eyecolors <- unique(starwars$eye_color)
p.eye <- starwars %>% ggplot2::ggplot(aes(x = eye_color, fill = eye_color)) +
geom_bar() +
scale_fill_manual(values= c('black','blue','#AAAAFF', 'brown',
'gray20', 'gold', 'greenyellow','#BB2211',
'orange','pink', 'red', '#CCEEFF', '#EE11EE',
'white', 'yellow')) +
ggplot2::labs(caption = "Star Wars", y = 'Frequency', x = 'Eye Color')+
theme(legend.position = 'none') +
coord_flip()
p.eye
Figure 2.6: 星際大戰角色的眼睛顏色
n <- 390+350
p1 <- 390/n
p=0.5
s1 <- sqrt(p*(1-p)/n)
z <- (p1-p)/s1; z
[1] 1.47
1-pnorm(z)
[1] 0.07072
p2 <-350/n
s2 <- sqrt(p2*(1-p2)/n)
z2 <- (p1-p2)/s2; z2
[1] 2.945
p1<-0.36; p2<-0.40; n=1000;
s.error = 1.96*sqrt((p1+p2-(p1-p2)^2)/n); s.error
[1] 0.05398
眾數
中位數
百分位數
平均數
從這四個統計值,可以大致判斷資料的分佈狀況。
R
的mode()
函數會回傳向量儲藏資料的性質,並不會告訴我們眾數。例如我們讀了一筆以SPSS格式儲存的民調資料:b2<-here::here('data','PP0797B2.sav')
dt <- haven::read_spss(b2)
mode(dt$Q1)
## [1] "numeric"
table(dt$Q1)
##
## 1 2 3 4 95 96 97 98
## 617 684 443 91 10 57 52 104
names()
找出這個表格的首行,進一步篩選首行的元素,條件為該表格的最大值,符合這個條件的就是該變數的眾數: tmp <- table(as.vector(dt$Q1))
tmp
##
## 1 2 3 4 95 96 97 98
## 617 684 443 91 10 57 52 104
names(tmp)
## [1] "1" "2" "3" "4" "95" "96" "97" "98"
names(tmp)[tmp == max(tmp)]
## [1] "2"
☛請練習用ISLR
套件中的Carseats資料,找出
UsingR
的套件中,Boston
這筆資料有房價中位數的變數 library(ggplot2)
ggplot(data=MASS::Boston, aes(y=medv, x=ptratio)) +
geom_point(aes(color=lstat))
Figure 2.7: 波士頓各區的房價中位數與生師比及低社會地位人口比例散佈圖
A<-c(1:10); B<-c(0:10)
median(A); median(B)
[1] 5.5 [1] 5
☛請問studentsfull.txt這筆資料中,學生的中位數成績是多少?
# Generate some random data
set.seed(02138)
data <- c(rnorm(500, mean = 40, sd = 12), rnorm(500, mean=60, sd=5))
# Calculate median, mean, and mode
median_val <- median(data)
mean_val <- mean(data)
# Density
plot(density(data), col = "black",
main = "Distribution with Central Tendency Measures",
xlab = "Value")
# Add lines for median, mean, and mode
abline(v = median_val, col = "red", lwd = 2, lty = 2)
abline(v = mean_val, col = "blue", lwd = 2, lty = 2)
# Add legend
legend("topright", legend = c("Median", "Mean"), col = c("red", "blue"), lwd = 2, lty = 2)
Figure 2.8: 平均數在中位數的左邊
可以看到分佈向右偏,一半的數在平緩的左邊,一半的數在高聳的右邊。平均數在左邊,代表左邊的數雖然分散,但是總和等於右邊的總和,
我們用均等分佈模擬資料,畫圖2.9表示平均數在中位數的右邊:
# Generate some random data
set.seed(02138)
data <- c(runif(600, 10, 40), runif(400, 65, 250))
# Calculate median, mean, and mode
median_val <- median(data)
mean_val <- mean(data)
# Density
plot(density(data), col = "black",
main = "Distribution with Central Tendency Measures",
xlab = "Value")
# Add lines for median, mean, and mode
abline(v = median_val, col = "red", lwd = 2, lty = 2)
abline(v = mean_val, col = "blue", lwd = 2, lty = 2)
# Add legend
legend("topright", legend = c("Median", "Mean"), col = c("red", "blue"), lwd = 2, lty = 2)
Figure 2.9: 平均數在中位數的右邊
四分位數是數列分成四份之後的三個點:25分位、50分數、75分為其中的25與75分位數。
對於數列的分佈有不同的假設,就有不同計算百分位數的方式。
四分位數是依序排列觀察值,分成四等份的分位數\(Q_{i}\),\(i=\{1,2,3\}\),\(Q_{1}\)代表有\(\frac{1}{4}\)的觀察值小於\(Q_{1}\),\(Q_{3}\)代表有\(\frac{3}{4}\)的觀察值小於\(Q_{3}\)。
例如資料為:\(X=(1, 1001, 1002, 1003)\)
25 百分位所在位置\(=\frac{4\times 25}{100}=1\)。因此 25百分位為 1。
50 百分位所在位置為:\(\frac{4\times 50}{100}=2\)。因此 50百分位為 1001。
75 百分位所在位置為:\(\frac{4\times 75}{100}=3\)。因此 75百分位為 1002。
X <- c(1, 1001, 1002, 1003)
qd <- quantile(X, c(.25, .5, .75), type=1)
knitr::kable(qd, col.names = c('quantiles', 'value')) %>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 1 |
50% | 1001 |
75% | 1002 |
例如:11位大學生的手機資費如下:195,220, 250,250,305,311,350,371,420,473,650。
\(Q_{1}=\frac{11}{4}=2.75\)。進位之後取第3個數,得到250。
\(Q_{2}=\frac{2\times11}{4}=5.5\)。進位之後取第6個數,得到311。
\(Q_{3}=\frac{3\times11}{4}=8.25\)。進位之後取第9個數,得到420。
R
提供9種計算方法,前面3種適用於間斷變數,後面6種則是連續變數。我們以第1種方法計算。
m <- c(195,220, 250,250,305,311,350,371,420,473,650)
qd <- quantile(m, c(0.25, 0.5, 0.75), type = 1)
knitr::kable(qd, col.names = c('quantiles', 'value'))%>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 250 |
50% | 311 |
75% | 420 |
set.seed(1000)
tmp1<-rnorm(100, 40, 10)
qmp1 <- quantile(as.integer(tmp1), c(0.25, 0.5, 0.75), type=1)
knitr::kable(qmp1, col.names = c('quantiles', 'value'))%>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 34 |
50% | 40 |
75% | 45 |
set.seed(1000)
tmp2<-rnorm(100, 40, 14)
qmp2 <- quantile(as.integer(tmp2), c(0.25, 0.5, 0.75), type=1)
knitr::kable(qmp2, col.names = c('quantiles', 'value'))%>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 31 |
50% | 40 |
75% | 47 |
\[m_{i}=100\cdot \frac{i}{n}\]
上面的公式中,\(m\)變數的\(i\)百分位數等於\(i\)除以\(m\)變數的觀察值總數\(n\)再乘以100。如果\(m_{i}\)不是整數,則\(k\)為該百分位數,且\(m_{i+1}\ge k\ge m_{i}\)。
換句話說,當\(m_{i}\)不是整數,我們可以將\(m_{i}\)無條件進位加1的數當做\(m_{i}\)。
另一種算法是當\(m_{i}\)是整數,則排在第\(m\)位與\(m+1\)位資料值的算術平均數就是這群資料的第\(k\)百分位數。
我們也可以用另一筆資料驗證手算以及R
的結果:
full<-here::here('data','studentsfull.txt')
dt <- read.csv(full,sep="",header=T)
dt$Score<-sort(dt$Score)
dt$Score
[1] 60 62 66 66 69 70 75 77 78 80 80 81 82 83 85 85 86 87 88 88 88 89 91 92 92 [26] 93
dt$Score[floor(length(dt$Score)*0.25)+1]
[1] 75
dt$Score[floor(length(dt$Score)*0.75)+1]
[1] 88
dt$Score[floor(length(dt$Score)*0.9)+1]
[1] 92
qd1 <- quantile(dt$Score, c(0.25, .75, 0.9), type=1)
qd1 <- knitr::kable(qd1, col.names = c('quantiles', 'value'))%>%
kable_styling(latex_options = "scale_down")
qd2 <- quantile(dt$Score, c(0.25, .75, 0.9), type=4)
knitr::kable(qd2, col.names = c('quantiles', 'value'))%>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 72.5 |
75% | 88.0 |
90% | 91.4 |
R
的輸出跟SPSS類似,我們可以加以對照(圖2.10)。SPSS的統計值等於是R
的quantile()的第六種計算方式。scores<-c(15, 22, 26, 32, 33,36, 36, 41, 42, 44,
44, 45, 47, 48, 61,63, 63, 65, 65, 65,
66, 66, 68, 69, 70,71, 74, 74, 76, 77,
78, 78, 80, 85)
qscore <- quantile(scores, c(.25,.5,.75,.9), type=6)
knitr::kable(qscore, col.names = c('quantiles', 'value')) %>%
kable_styling(latex_options = "scale_down")
quantiles | value |
---|---|
25% | 41.75 |
50% | 64.00 |
75% | 71.75 |
90% | 78.00 |
v1<-here::here('Fig','v1_quantile.png')
knitr::include_graphics(v1)
Figure 2.10: 四分位統計
trim
刪除若干百分比的數。算術平均數(arithmetic mean)的公式如下:\[\bar{y}=\frac{\sum y_{i}}{n}\]
例如: \[y={6, 7, 8, 8, 9, 10, 13, 15, 16, 45}\]
平均數為: \[\bar{y}=\frac{\sum (6+7+\cdots , +45)}{10}=13.7\]
如果\(x_{i}=y_{i}+10\),請問平均數\(\bar{x}\)會比\(\bar{y}\)大、小、不變?
如果\(x_{i}=10\times y_{i}\),請問平均數\(\bar{x}\)會比\(\bar{y}\)大、小、不變?
\[\bar{y}=\frac{\sum n_{k}\cdot \bar{y_{k}}}{n}\]
換句話說,每一組的平均數乘以每一組在全部資料中佔的比例,類似加權,得到的平均數和就是全體樣本的平均數。
例如有三個空氣品質的觀測站的資料,要計算總平均數,首先從總和除以全部個案數計算:
A<-list(station1=c(25, 33, 44),
station2=c(43, 66, 78, 81),
station3=c(90, 76, 105, 110, 121))
#n of each group
group.n=sapply(A, length); group.n
## station1 station2 station3
## 3 4 5
#n of data
total.n=sum(sapply(A, length));
#proportion of each group
group_p <- group.n/total.n
#mean of each group
submean=sapply(A, mean); cat("air pollution of each station:", submean,"\n")
## air pollution of each station: 34 67 100.4
cat("Average air pollution=", sum(group_p*submean),"\n")
## Average air pollution= 72.67
#sum of data
totalair=sum(sapply(A, sum));
cat("Sum of air pollution=",totalair,"\n")
## Sum of air pollution= 872
cat("average air pollution=", totalair/total.n)
## average air pollution= 72.67
sapply()
函數可套用函數在列表的每一個向量。spss<-here::here('Fig','week3_skewness.jpg')
knitr::include_graphics(spss)
Figure 2.11: 偏態統計
R
的計算公式1與2分別與Stata以及SPSS得到的結果相同。library(foreign)
hs<-here::here('data','hsb2.dta')
hsb2<-read.dta(hs)
library(e1071)
cat('skewness of writing:', skewness(hsb2$write, type=1), '\n')
skewness of writing: -0.4784
cat('skewness of writing:', skewness(hsb2$write, type=2))
skewness of writing: -0.482
stata<-here::here('Fig','write_stata.png')
knitr::include_graphics(stata)
Figure 2.12: Stata的偏態統計
spss<-here::here('Fig','write_spss.png')
knitr::include_graphics(spss)
Figure 2.13: SPSS的偏態統計
# Sample data
set.seed(02138)
A <- c(runif(700, 20, 30), runif(300, 30, 70))
B <- rnorm(1000, 30, 4)
C <- c(runif(200, 0, 80), rnorm(800, 50, 10))
cat('A skewness:', skewness(A, type=1), '\n')
A skewness: 1.396
cat('B skewness:', skewness(B, type=1), '\n')
B skewness: 0.01334
cat('C skewness:', skewness(C, type=1), '\n')
C skewness: -0.7783
library(gtsummary)
library(dplyr)
library(e1071)
# Create the dataset
tmp <- data.frame(A = A, B = B, C = C)
# Create tbl_summary object
tblm1 <- tmp %>%
tbl_summary(
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c(
"{mean}",
"{median}"
),
missing = "no",
digits = all_continuous() ~ 2
) %>%
bold_labels()
tblm1
Characteristic | N = 1,000 |
---|---|
A | |
Mean | 32.30 |
Median | 27.04 |
B | |
Mean | 29.96 |
Median | 29.87 |
C | |
Mean | 48.15 |
Median | 49.55 |
#mode
varmode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
cat('Mode of A:', varmode(A), '\n')
Mode of A: 25.12
cat('Mode of B:', varmode(B), '\n')
Mode of B: 26.72
cat('Mode of C:', varmode(C), '\n')
Mode of C: 50.36
#tblm2 <- tmp %>%
# tbl_custom_summary(
#stat_fns = ~varmode,
#statistic = ~'{mode}'
# )
#tbl_merge_ex2 <-
# tbl_merge(tbls = list(tblm1, tblm2)) %>%
# modify_spanning_header(everything() ~ NA_character_)
我們可以用橫的盒型圖如圖2.14表示分佈:
data <- data.frame(group = rep(c("A", "B", "C"), each = 1000),
value = c(A, B, C))
# Define colors for each group
colors <- c("#EE22AA", "#EBC311", "#C1F210")
# Draw horizontal box plot
data$group <- ordered(as.factor(data$group), levels = c('C','B','A'))
boxplot(value ~ group, data = data, horizontal = TRUE, col=colors,
ylim=c(0, 80))
# Adjust y-axis ticks
axis(2, at = seq(0, 80, by = 10))
Figure 2.14: 表示偏態的箱型圖
x <- seq(min(A), max(A), length = 1000)
df <- data.frame(x=c(x, x, x), y=c(sort(A), sort(B), sort(C)),
dataset = c(rep('A', 1000), rep('B', 1000), rep('C', 1000)))
ggplot2::ggplot(df, aes(x = y, group = dataset)) +
geom_density(aes(fill = dataset), alpha = 0.5) +
scale_fill_manual(values=c("#C1F210", "#EBC311", "#EE22AA"))
Figure 2.15: 3個模擬資料的密度圖
R
,可以用e1071
這個套件裡面的kurtosis
,用type
指令選擇2可得到跟SPSS一樣的答案。#m4
s4=sum((hsb2$write-mean(hsb2$write))^4)
m4=s4/200; print(m4)
## [1] 17889
#m2
s2=sum((hsb2$write-mean(hsb2$write))^2)
m2=s2/200; print(m2)
## [1] 89.39
#kurtosis
cat('kurtosis=', m4/m2^2)
## kurtosis= 2.239
R
計算結果:kurtosis(hsb2$write, type=1)+3
[1] 2.239
kurtosis(hsb2$write, type=2)+3
[1] 2.25
stata<-here::here('Fig','write_stata.png')
knitr::include_graphics(stata)
Figure 2.16: Stata的峰度統計
spss<-here::here('Fig','write_spss.png')
knitr::include_graphics(spss)
Figure 2.17: SPSS的峰度統計
R
的彈性比較大。# Generate random samples from the normal distribution
normal_dist <- rnorm(1000)
e1071::kurtosis(normal_dist, type=1)
## [1] 0.02436
# Generate random samples from the t-distribution with 3 degrees of freedom
t_dist <- rt(1000, df = 3)
e1071::kurtosis(t_dist, type=1)
## [1] 46.38
hist(t_dist, freq = FALSE, main = "t-Distribution (df = 3)",
xlab = "Value", ylab = "Density", col = "lightgreen", ylim=c(0, 0.33),
breaks=40)
lines(density(t_dist), col = "#AA11AA", lwd = 2)
Figure 2.18: T分佈模擬PDF
range(hsb2$write)
[1] 31 67
range(hsb2$math)
[1] 33 75
par(bg="#33110011")
x <- seq(-3, 3, length=100)
hx <- dnorm(x)
plot(x, hx, type="l", lty=2, xlab="x value", lwd=2,
ylab="Density", main="Comparison of t Distributions", ylim=c(0,0.6))
y <- seq(-3, 3, length=600)
curve(dnorm(x, 0, 0.75), type="l", add=T, lwd=2, col="red")
Figure 3.1: 相同全距離散程度不同的機率密度圖
library(ggplot2)
sv<-c(15, 22, 26, 32, 33,36, 36, 41, 42, 44,
44, 45, 47, 48, 61,63, 63, 65, 65, 65,
66, 66, 68, 69, 70,71, 74, 74, 76, 77,
78, 78, 80, 85)
quantile(sv, c(.25,.5,.75), type=6)
## 25% 50% 75%
## 41.75 64.00 71.75
dt <- data.frame(scores=sv)
ggplot(data=dt, aes(y=scores)) +
geom_boxplot(fill="#FF22EE11")
Figure 3.2: 學生成績盒鬚圖
☛請計算MASS::Animals
這筆資料中的腦容量的四分位距。
每一個觀察值與平均數的差距稱為變異數,用來衡量連續變數資料離散的程度。
母體的變異數公式為:
\[\sigma^2=\frac{\sum (X-\mu)^2}{n}\]
\[S^2=\frac{\sum (X-\bar{X})^2}{n-1}\]
\[n\cdot p\cdot(1-p)\]
\[\sqrt{\frac{np(1-p)}{n-1}}\]
某10家上市公司的股票價格分別為13.5, 22.2, 31.2, 15.2, 20.3, 21.9, 18.3, 25.3, 21.3, 19.8。請問這些公司的股價的變異數是多少?
如果這10家的股票價格都漲了1塊錢,請問變異數變大、變小或者不變?為什麼?
如果這10家的股票價格都漲了1倍,請問變異數變大、變小或者不變?為什麼?
某戶人家收集過去12個月的水費帳單,想知道每個月水費有沒有超過300元,發現7, 8, 9月之外,其他月份都沒有超過。請問水費有無超過300元的變異數是多少?
如果該住戶發現其實10月份的水費也超過300元,請問變異數變大、變小還是不變?為什麼?
如果該住戶統計到隔年1月,發現還是3個月的水費超過300元,請問變異數變大、變小還是不變?為什麼?
如果隨機變數屬於常態分配,大部分的值應該聚集在平均值加減一個標準差的範圍內,因此,樣本標準差的大小特別重要。
當樣本來自於常態分配的母體,利用微積分可求出平均數的加減1個標準差包含約\(68\%\)的樣本。2個標準差包含約\(95\%\)的樣本。3個標準差包含約\(99\%\)的樣本。
先寫程式計算標準差,再用R
的
#hsb2 data
file <- here::here('data', 'hsb2.txt')
hsb2 <- read.table(file, header = T)
#variance
v.write<-var(hsb2$write); sqrt(v.write)
## [1] 9.479
#standard deviation
std = function(x) sqrt(var(x))
std(hsb2$write)
## [1] 9.479
#self-defined function
sd<-function(V)sqrt( sum((V - mean(V))^2 /(length(V)-1)))
sd(hsb2$write)
## [1] 9.479
H<-c(15000,7000,19000,3000,15000,19000,4000,12000,
17000, 9000)
h<-c(15,7,19,3,15,19, 4,12,17, 9)
sd(H); sd(h)
## [1] 5963
## [1] 5.963
我們用一個分數的正負判斷某個樣本的連續變數是否高於平均數,用絕對值判斷距離平均數的差距有多遠,這就是Z分數或者Z值。
Z值的優點是將原始分數轉換為標準分數後,不同單位的數量或者不同測量的分數,均可直接比較,不會受測量困難程度的影響。
如果轉換原始分數轉換為Z值得到小數或負數,也可以乘以一個常數,再加一常數,變成整數。也就是整個分佈向右邊移動。
\[f(Z)=\frac{1}{2\pi}e^{-\frac{Z^2}{2}}\\ Z=\frac{X-\mu}{\sigma}\]
\[Z\sim N(0,1)\]
\[Z=\frac{x-\bar{x}}{s}\] - s是標準差。
\[Z=\frac{X-\mu}{\sigma}\] \[\sigma\neq 0\]
pnorm(6, 0, 1)
## [1] 1
pnorm(-6, 0, 1)
## [1] 9.866e-10
pnorm(-1.96,0,1)
## [1] 0.025
curve(dnorm(x),
xlim = c(-3, 3),
ylab = "Density",
#main = "機率密度與區域",
col='red', lwd=2, xlab='Z')
cord.1x <- c(-3,seq(-3, -1.96,0.01),-1.96)
cord.1y <- c(0,dnorm(seq(-3, -1.96,0.01)),0)
polygon(cord.1x,cord.1y,col='grey80')
Figure 4.1: 標準常態分佈曲線下的左邊2.5%區域
curve(dnorm(x),
xlim = c(-3, 3),
ylab = "Density",
#main = "機率密度與區域",
col='red', lwd=2, xlab='Z')
cord.1x <- c(1.96 ,seq(1.96, 3, 0.01), 3)
cord.1y <- c(0, dnorm(seq(1.96, 3, 0.01)), 0)
polygon(cord.1x,cord.1y,col='grey80')
Figure 4.2: 標準常態分佈曲線下的右邊2.5%區域
Z = 1.96
1-pnorm(Z, 0, 1)
## [1] 0.025
q = 0.975
qnorm(q)
## [1] 1.96
q = 0.95
qnorm(q)
## [1] 1.645
★請問在
head(alr4::Heights, 4)
mheight dheight 1 59.7 55.1 2 58.2 56.5 3 60.6 56.0 4 60.7 56.8
m.i<-mean(alr4::Heights$mheight)
m.i
[1] 62.45
s.i<-var(alr4::Heights$mheight)
s.i
[1] 5.547
zstar1=(63-m.i)/sqrt(s.i);zstar2=(65-m.i)/sqrt(s.i)
cat("63in=",zstar1,"\n","65in=",zstar2,"\n")
63in= 0.2323 65in= 1.082
pnorm(zstar2, 0, 1)-pnorm(zstar1, 0, 1)
[1] 0.2684
curve(dnorm(x),
xlim = c(-3, 3),
ylab = "Density",
#main = "機率密度與區域",
col='red', lwd=2, xlab='Z')
cord.1x <- c(0.232,seq(0.232, 1.081,0.01), 1.081)
cord.1y <- c(0,dnorm(seq(0.232, 1.081,0.01)),0)
polygon(cord.1x,cord.1y,col='grey80')
Figure 4.3: 標準常態分佈曲線下的特定區域
★ 有一位員工的今年月薪為8.5萬,去年則為8萬。今年的全體薪水標準差為2.3萬,平均值為6.4萬,而去年的全體員工薪水標準差為2萬,平均值為6.2萬。假設員工薪水常態分配。請問該員工月薪相較於全體員工,今年比去年有增加嗎?
\[z_{1}=\frac{8.5-6.4}{2.3}=0.91\]
\[z_{2}=\frac{8-6.2}{2}=0.75\] - 因為\(z_{1}\geq z_{2}\),因此該員工月薪相較於全體員工有增加。
score =78
mu = 67.09
sigma = 11.08
(score-mu)/sigma
[1] 0.9847
score
中位數、90百分位數以及男性跟女性的平均數:
score
平均數以及標準差:
UsingR
套件中的faithful
資料,請問要看噴泉最少要等幾分鐘?平均要等幾分鐘?最多跟最少等的時間差距多少分鐘?
airquality
這筆資料的Wind
這個變數,計算前後兩個資料點的差異,以分析兩天之間風速的差異。例如:
A <- c(35, 61, 69)
d.A <- c(26, 8)
airquality
這筆資料的Wind
的偏態為何?峰度是多少?
councilor
這筆資料中,請問平均工程預算是多少?樣本標準差多少?
2008Election
這筆資料,請問馬英九的得票數的25分位數、中位數、75分位數分別是多少?請問25與75分位數之間差別多少?
BES
這筆資料,計算表示英國應該離開歐盟(leave
)的比例。
vote
)的次數最多?相對頻率多少?
最後更新時間: 2024-03-22 13:10:45