Presentation

어떤 부부의 첫 번째 아이가 여자일 때, 두 번째 아이가 여자일 확률은 얼마인가?

Solution 1. Mendelian’s law

Mendelian law에 따라서 두 번째 아이가 여아일 확률을 계산하면 다음과 같이 계산할 수 있다.

Px <- 0.5 #첫 번째 아이가 여자일 확률
Py <- 0.5 #두 번째 아이가 여자일 확률
#P(y|x) = P(y&x)/P(x) = P(y)*P(x)/p(x)
Pyx <- Py*Px/Px
print(Pyx)

## [1] 0.5

위의 식은 일반적인 상황에서 자명해 보이지만 다음과 같은 가정을 내포하고 있다.

#1. P(x) = P(y) = 0.5, 즉, 아이를 낳을 때 여자아이를 낳을 확률은 0.5로 정해져 있다.  
#2. P(y&x) = P(y)*P(x), 즉 아이를 낳는 사건이 서로 독립이다.

하지만 실제 상황에서 위의 가정이 항상 적용되는 것은 아니다. 부부에 따라 여자아이를 낳을 확률이 0.5가 아닌 다른 값을 가질 수도 있기 때문이다. 따라서 Solution 2.에서부터는 베이즈 통계학을 이용해 적절한 사전확률이나 사전분포를 가정하고 첫 번째 아이가 여자라는 정보를 이용해 두 번째 아이가 여자일 확률에 대해 추론해 볼 것이다.

Solution 2. 균등분포

여아를 가질 확률이 각각 0.4, 0.5, 0.6인 3그룹의 부부를 생각해 볼 수 있다. 이유불충분의 원리에 의해 각 그룹에 대한 확률을 1/3로 할당하였다.

type <- c(0.4,0.5,0.6)
probability <- c(1/3,1/3,1/3)
data <- data.frame(x=type, y=probability)
ggplot(data, aes(x=x,y=y))+geom_bar(stat="identity", width=0.05, fill="lightblue1")+ggtitle("Probability  function")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

첫 번째 아이가 여자인 정보를 이용해 사후 확률을 구하면 다음과 같다.

probability2 <- type*probability/sum(type*probability)
data2 <- data.frame(x=type, y=probability2)
ggplot(data2, aes(x=x,y=y))+geom_bar(stat="identity", width=0.05, fill="lightblue1")+ggtitle("Probability function w/info")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

이 사후 확률을 이용해 두 번째 아이가 여자일 기댓값을 구해준다.

py <- sum(type*probability2)
print(py)

## [1] 0.5133333

Solution 3. 베타분포 1

Solution 2.에서 3개의 그룹으로 나누었던 것을 확장해 여자아이를 가질 확률이 0이상 1이하인 무수히 많은 그룹의 부부를 가정하였다. 여기서는 먼저 이유불충분의 원리에 따라 사전 분포를 동일하게 가정하였다.

P_density <- function(x){1} #확률밀도함수
ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density)+ geom_hline(yintercept = 0)+ggtitle("Probability density function")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

첫 번째 아이가 여자인 정보를 이용해 사후 분포를 구하면 다음과 같다.

P_density2 <- function(x){1*x}*2 #정규화 조건을 충족하기 위해 2를 곱해준다.
integrate(P_density2,0,1)

## 1 with absolute error < 1.1e-14

ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density2)+ggtitle("Probability density function w/info")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

이 사후 분포를 이용해 두 번째 아이가 여자일 기댓값을 구해준다.

P_density3 <- function(x){2*x*x}
py <- integrate(P_density3,0,1)
print(py)

## 0.6666667 with absolute error < 7.4e-15

Solution 4. 베타분포 2

Solution3.는 여자아이를 가지는 부부에 대해 몇 개의 그룹으로 한정하지 않고 연속적인 값으로 표현했다는 데에서 의미가 있지만 그 값에 대해 모두 동일한 확률 분포로 설정했다는 것은 현실과 많이 동떨어진다. 따라서 일반적으로 여아를 낳을 확률이 0.5일 가능성이 가장 크다고 사전 분포를 설정하고 그 값을 구해주었다. 이 때 사용한 것은 베타분포로 알파와 베타가 각각 1일 때의 값이다. 참고로 Solution 3.에서 설정한 사전 분포 또한 알파와 베타가 모두 0인 베타분포를 이용하였다고 할 수 있다.

#베타분포 = a * (x)^(alpha)*(1-x)^(beta)

P_density <- function(x){6*x*(1-x)}
ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density)+ggtitle("Probability density function")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

P_density2 <- function(x){x*(1-x)*x*12} #정규화 조건을 충족하기 위해 12를 곱해준다. 
integrate(P_density2,0,1)

## 1 with absolute error < 1.1e-14

ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density2)+ggtitle("Probability density function w/info")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

P_density3 <- function(x){x*(1-x)*x*12*x}
py <- integrate(P_density3,0,1)
print(py)

## 0.6 with absolute error < 6.7e-15

Solution 5. 표준정규분포 변형

베이즈 통계학을 이용한 추론에서는 사전확률이나 사전분포가 매우 중요하다. 때로는 편의에 따라 이유불충분의 원리를 이용해 똑같이 설정하거나 분석자 임의로 설정하기도 하지만 정확한 모델을 세울수록 추론하는 값이 참값에 가까울 확률이 높아진다. 2016년 통계청이 발표한 남녀 성비는 105이다. 여전히 유산, 낙태 등과 같이 여러 가지 변수가 작용하지만 이는 Mendelian’s law가 실제와는 다르게 남아를 낳을 확률이 조금 더 높은 것일 수도 있다. 따라서 이번에는 표준정교분포를 이용해 2016년 통계청이 발표한 남녀 성비를 평균으로 설정하고 적절한 표준편차를 넣어 사전 분포를 만들어 보았다.

P_density <- function(x){dnorm(x,mean=100/205,sd=1)}
ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density)+ggtitle("Normal distribution w/adjusted mean")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 20, color = "gray1"))

integrate(P_density,0,1)

## 0.3828987 with absolute error < 4.3e-15

P_density11 <- function(x){dnorm(x,mean=100/205,sd=0.1)}
P_density12 <- function(x){dnorm(x,mean=100/205,sd=0.05)}
integrate(P_density11,0,1)

## 0.9999993 with absolute error < 1.1e-09

integrate(P_density12,0,1)

## 1 with absolute error < 1.8e-09

p11 <- ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density11)+ggtitle("Normal distribution, sd= 0.1")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 15, color = "gray1"))
p21 <- ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density12)+ggtitle("Normal distribution. sd=0.05")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 15, color = "gray1"))
grid.arrange(p11,p21,ncol=2)

P_density21 <- function(x){dnorm(x,mean=100/205,sd=0.1)*x*2.05} #정규화 조건을 충족하기 위해 2.05를 곱해준다.
P_density22 <- function(x){dnorm(x,mean=100/205,sd=0.05)*x*2.05} #정규화 조건을 충족하기 위해 2.05를 곱해준다.
integrate(P_density21,0,1)

## 0.9999997 with absolute error < 9.8e-10

integrate(P_density22,0,1)

## 1 with absolute error < 8.6e-05

p21 <- ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density21)+ggtitle("Probability density function  \n w/info ,sd=0.1")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 15, color = "gray1"))
p22 <- ggplot(data.frame(x=c(0,1)),aes(x=x))+stat_function(fun=P_density22)+ggtitle("Probability density function  \n w/info ,sd=0.1")+theme_bw()+
theme(plot.title = element_text(family = "serif", face = "bold", hjust = 0.5, size = 15, color = "gray1"))
grid.arrange(p21,p22,ncol=2)

P_density31 <- function(x){dnorm(x,mean=100/205,sd=0.1)*x*2.05*x}
P_density32 <- function(x){dnorm(x,mean=100/205,sd=0.05)*x*2.05*x}
integrate(P_density31,0,1)

## 0.5083046 with absolute error < 1.7e-09

integrate(P_density32,0,1)

## 0.4929299 with absolute error < 4.2e-05