Simulate \(X_1,X_2\) (each with size 1,000) from a Normal with mean 0 and variance 1 and a normal with mean 3 and variance 1, respectively. (a) Calculate \(M = \min(X_1,X_2)\). Plot the histogram of \(M\). Calculate \(P (M > 1)\). (b) Combine two samples (mixture normal) and plot the histogram of the combined sample.
require(ggplot2)
## Loading required package: ggplot2
sampa=rnorm(1000,0,1)
sampb=rnorm(1000,3,1)
ccombind = cbind(sampa, sampb)
dat <- transform(ccombind, M = pmin(sampa, sampb))
hist (dat$M)
head(dat$M)
## [1] -0.74696036 0.73620795 -0.04287846 -2.23937369 0.39100085 -1.95050911
length(which(dat$M>1)) / length(dat$M)
## [1] 0.129
###############
combined = c(sampa, sampb)
plt2 = ggplot(data.frame(combined), aes(x=combined)) + stat_bin(binwidth=0.25, position="Identity")
plt2
Set the seed to be your student number. Simulate \(U_1,U_2,U_3\) (each with size 10,000) all come from a \(\chi^2(1)\) distribution. (a) Calculate \(M = \max(U_1,U_2,U_3)\) using the simulated observations, and print out the first 6 values of M. (b) Plot the histogram of M. (c) Estimate (to 3 significant digits) the probability \(P (M > 1)\) (Hint, use the number of elements in M greater than 1 to be divided by the total number of values).
Your R code, results and plots:
set.seed(500910066)
U_1 <- rchisq(10000, df = 1)
U_2 <- rchisq(10000, df = 1)
U_3 <- rchisq(10000, df = 1)
ccombind = cbind(U_1, U_2, U_3)
dat <- transform(ccombind, M = pmax(U_1, U_2, U_3))
head(dat$M)
## [1] 0.8571184 0.7987459 7.5230561 0.1848106 0.1108221 2.9862671
hist(dat$M)
round(length(which(dat$M>1))/length(dat$M),3)
## [1] 0.684
Set the seed to be your student number. Try to simulate 3,000 points from a mixture of two normal distributions. Normal distribution 1 (N1) is the standard normal. Normal distribution 2 (N2) has mean 1 and the variance twice the size of the distribution 1. Moreover, sample 1 from distribution 1 should be twice the size of sample 2 from distribution 2. (a) Plot the histogram the combined sample of the mixture normal (MN). (b) Estimate (to 3 significant digits) the probabilities \(P (0.5 < MN < 1)\), \(P (0.5 < N1 < 1)\) and \(P (0.5 < N2 < 1)\).
Your R code, results and plots:
set.seed(500910066)
N1 = rnorm(2000, mean = 0, sd = 1)
N2 = rnorm(1000, mean = 1, sd = sqrt(2*length(N1)))
MN = cbind(N1,N2)
hist(MN)
\(P (0.5 < MN < 1)\)
round(length(which(MN > 0.5 & MN < 1)) / length(MN), 3)
## [1] 0.082
\(P (0.5 < N1 < 1)\)
round(length(which(N1 > 0.5 & N1 < 1)) / length(N1), 3)
## [1] 0.16
\(P (0.5 < N2 < 1)\)
round(length(which(N2 > 0.5 & N2 < 1)) / length(N2), 3)
## [1] 0.003
Let’s generate some data y from a binomial distribution, where \(\theta = P (Y=1) = 0.4\). Start with a small sample size 20.
set.seed(500910066)
n <- 20
theta <- 0.4
y <- rbinom(n, size = 1, prob = theta)
print(y)
## [1] 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0
It’s easy to compute the maximum likelihood estimate in this case since \(\hat{\theta} = \bar{y}\). From a Bayesian perspective, we assume that \(\theta\) follows a Beta distribution. It’s also easy to construct the posterior parameters \(a_n\) and \(b_n\) of a Beta distribution \(Beta (a_n, b_n)\). First, we need to specify a prior distribution. For simplicity, lets assume \(a_0 = b_0 = 1\).
yhat <- mean(y)
yhat ## must be almost 0.4
## [1] 0.35
############
a0 <- b0 <- 1
a_n <- sum(y) + a0
b_n <- n - sum(y) + b0
a_n / (a_n + b_n) ## expected value of theta - must be almost 0.4
## [1] 0.3636364
One way to interpret the posterior distribution \(p(\theta|y)\) is to visualize its density using dbeta. Regions of relatively high density tell us which values of \(\theta\) are most probable. We can visualize the data using base graphics or ggplot2. Both approaches can produce very high quality visualizations. Let’s use base graphics first. The plot function produces a variety of plots, depending on the specified type argument. To add to an existing plot, we can use the lines (or equivalently points) functions. We’ll plot the posterior and compare it with the prior, i.e., compare what we have learned about \(\theta\) since observing the data \(y\). We’ll add the true value of \(\theta\) to the plot using abline. We can add labels to the plot using the text function.
x <- seq(0, 1, length.out= 100)
f_y1 <- dbeta(x, shape1 = a0, shape2 = b0)
f_y2 <- dbeta(x, shape1 = a_n, shape2 = b_n)
## Plot the beta density curve
plot(x, f_y1, ylim = range(f_y1, f_y2), type = "l", main ="Posterior and Prior Comparison", sub = "Simulated Data Result", xlab = expression(theta), ylab = "Probability Density", lwd = 2)
lines(x, f_y2, lwd = 2, col = 2)
abline(v = theta, col = 3)
text(c(0.8, 0.3, theta), c(1.2, 3.5, 2), labels = c("p(theta)", "p(theta | y)", "True Value"), col = 1:3) ## These values chosen with some trial and error
We can construct a very similar plot using ggplot. There are couple of differences this time. First, we need to supply the data to ggplot as a data frame. Then we build the plot up in layers using different geoms. The syntax used by ggplot2 adds these layers together explicitly.
library("ggplot2")
## Need to supply data to be plotted as a data frame
plot_dat <- data.frame(x, f_y1, f_y2)
## We'll build the plot in layers - first add the prior and posterior
p <- ggplot(data = plot_dat) + geom_line(aes(x, f_y1)) + geom_line(aes(x, f_y2), col = "red")
## Now the true value
p <- p + geom_vline(mapping = aes(xintercept = 0.4), col = "green")
## Add text - note we use a different data frame here
p <- p + geom_text(data = data.frame(x = c(0.8, 0.3, theta), y = c(1.2, 3.5, 2)), aes(x, y, label = c("p(theta)", "p(theta | y)", "True Value")), col = c("Black", "Red", "Green") )
## Add title and axis labels
p <- p + ggtitle("Posterior and Prior Comparison") + ylab("Posterior Density") + xlab(expression(theta))
p
We can assess the posterior distribution \(p(\theta|y)\) in other ways. We can construct probability statements such as \(P(\theta > 0.5)\) based on the posterior distribution of \(\theta\) using pbeta:
pbeta(0.5, a_n, b_n, lower.tail = FALSE)
## [1] 0.09462357
1 - pbeta(0.5, a_n, b_n)
## [1] 0.09462357
We can also construct intervals that \(\theta\) will lie in with high probability, e.g., find \(\theta_1, \theta_2\) such that \(P(\theta_1 < \theta < \theta_2) = 0.95\) using qbeta:
qbeta(c(0.025, 0.975), a_n, b_n)
## [1] 0.1810716 0.5696755
Often it can be convenient to estimate quantities of interest based on simulated values rather than by evaluating these values exactly (Monto Carlo simulation). We can generate posterior parameter values for \(\theta\) using rbeta. Let’s compare the estimates we obtain from the simulated data to the exact quantities we estimated in the previous section:
## Simulation a random sample of 5000 observations from Beta distribution with parameters a_n, b_n
sim_theta <- rbeta(5000, a_n, b_n)
## estimate expected value of theta
mean(sim_theta)
## [1] 0.3667312
## estimate probabilities using empirical sample
## compare this to pbeta(0.5, a_n, b_n, lower.tail = FALSE)
mean( sim_theta > 0.5)
## [1] 0.097
## we can use quantile to estimate upper and lower bounds for data
## compare this to qbeta(c(0.025, 0.975), a_n, b_n)
quantile(sim_theta, c(0.025, 0.975))
## 2.5% 97.5%
## 0.1801387 0.5759823
We can also visualise the distribution of the samples using hist or density. Comparing the plot of the simulated distribution with the Beta density curve, they should be similar.
## Plot the histogram of the similated data
hist(sim_theta, freq=FALSE, main = "Histogram of Monte Carlo samples", xlab = expression(theta), ylab = "Posterior density")
lines(density(sim_theta))
Set the seed to be your student number. Conduct the same analysis as I have done using the sample size \(n=100\) and real value \(\theta = 0.4\) for Binomial data \(y\). (a) Calculate mle \(\hat{\theta} = \bar{y}\). Calculate \(a_n, b_n\), and the expected value of \(\theta\) using \(a_n / (a_n + b_n)\). Is the mle or the expected value close to 0.4? (b) Plot and describe the behavior of the prior and posterior distribution using base graphics or ggplot2. Add a line to the plot indicating the position of the mle \(\hat{\theta}\). (c) Comparing the mle with sample size 100, and the mle with sample size 20, which mle is closer to the real value \(\theta = 0.4\)? (d) Calculate the probability \(P(\theta > 0.45)\). Find a 99% lower and upper bound for \(\theta\) using qbeta.
You should write the R code, results, plots and interpretations as a case study report like the example:
set.seed(500910066)
n <- 100
theta <- 0.4
y <- rbinom(n, size = 1, prob = theta)
print(y)
## [1] 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0
## [38] 1 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 0 1
## [75] 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1
y_hat <- mean(y)
y_hat
## [1] 0.41
a0 <- b0 <- 1
a_n <- sum(y) + a0
b_n <- n - sum(y) + b0
(mle = a_n / (a_n + b_n))
## [1] 0.4117647
The MLE and expected value of theta are almost equal to the real value of theta (0.4)
x <- seq(0, 1, length.out= 100)
f_y1 <- dbeta(x, shape1 = a0, shape2 = b0)
f_y2 <- dbeta(x, shape1 = a_n, shape2 = b_n)
plot_dat <- data.frame(x, f_y1, f_y2)
p <- ggplot(data = plot_dat) + geom_line(aes(x, f_y1)) + geom_line(aes(x, f_y2), col = "red")
p <- p + geom_vline(mapping = aes(xintercept = 0.4), col = "Green") +
geom_vline(mapping = aes(xintercept = mle), col = "Blue")
p <- p + geom_text(data = data.frame(x = c(0.8, 0.3, theta,mle), y = c(1.2, 3.5, 2,1.35)), aes(x, y, label = c("p(theta)", "p(theta | y)", "True Value","MLE")), col = c("Black", "Red", "Green","Blue") )
p <- p + ggtitle("Posterior and Prior Comparison") + ylab("Posterior Density") + xlab(expression(theta))
p
The plot shows that true value and MLE are almost exactly the same. The value for theta is likely to be 0.4
MLE with a sample size of 100 is more closer to theta (0.4)
pbeta(0.45, a_n, b_n, lower.tail = FALSE)
## [1] 0.2151954
qbeta(c(0.005, 0.995), a_n, b_n)
## [1] 0.2911872 0.5387836
Followed by Question 1 and using the \(a_n\) and \(b_n\) from Question 1, conduct the same Monte Carlo simulation (sample size 5000) analysis as I have done for the posterior distribution of \(\theta\). (a) Visualize the distribution of the sample of simulated observations. Compare the histogram/approximate density using Monte Carlo simulation you have plotted and the exact density curve you have already plotted using dbeta from Question 1 (b). (b) Calculate and sample mean. Estimate probability \(P (\theta > 0.45)\) using empirical sample and compare this to the results using pbeta in Question 1 (d). (c) Find a 99% lower and upper bound for \(\theta\) using the Monte Carlo simulation. Compare the results with the results you have obtained in Question 1 (d).
You should write the R code, results, plots and interpretations as a case study report like the example:
sim_theta <- rbeta(5000, a_n, b_n)
hist(sim_theta, freq=FALSE, main = "Histogram of Monte Carlo Samples",
xlab = expression(theta), ylab = "Posterior Density")
lines(density(sim_theta), col = "red",lwd = 2)
This histogram is almost identical to the exact density curve Posterior Distribution.
mean(sim_theta)
## [1] 0.4133478
mean( sim_theta > 0.45)
## [1] 0.2214
The mean of the sample (0.413) was almost exactly equal to the value of theta (0.4). P(theta > 0.45) is close to the results using pbeta.
quantile(sim_theta, c(0.005, 0.995))
## 0.5% 99.5%
## 0.2928881 0.5399449
99% lower and upper bound of theta using the Monte Carlo simulation was almost identical to what was generated by qbeta function from Question 1 (d).