The question is asking to generate 1000 samples from the distribution of \(X\). However, each time \(X\) takes on different n or p values. \(X\)~\(Binom(n,p)\) would take on different n and p values each time and a different graph will be plotted for every condition. The values for \(n\) and \(p\) are as follow:
\(p = 0.2, 0.3, 0.4, 0.5\)
\(n = 10, 20, 100, 200, 1,000, 2,000, 10,000, 20,000, 100,000, 200,000\)
The code chunk below computes normal distribution for all the 40 possibilities and plots a Q-Q graph for each. Also the mean and variance of the binomial distributions were calculated as follow: \(mean = n * p\) and \(variance = n * p * (1-p)\).
bi.p <- as.array(c(0.2,0.3,0.4,0.5))
bi.n <- as.array(c(10,20,100,200,1000,2000,10000,20000,100000,200000))
bi.mean <- vector("list")
bi.var <- vector("list")
bi.counter = 1
par(mfrow = c(3,3))
for (j in bi.n) {
for (i in bi.p) {
RandomData <- rbinom(1000, j, i)
qqnorm(RandomData, main = bquote("Q-Q Plot of Binom(" ~ .(as.character(j)) ~ "," ~ .(as.character(i)) ~ ")"))
qqline(RandomData, col = "red")
bi.mean[bi.counter] <- j * i #mean = n * p
bi.var[bi.counter] <- j * i * (1-i) #variance = n * p * q
bi.counter = bi.counter + 1
}
}
To interpret a Q-Q plot we must observe the cluster of points on the graph as close as possible to the diagonal line (Q-Q line) on the graph. We have 1000 samples from the distribution of \(X\) for each graph. As the number of observations (\(n\)) increases, we observe more points around the diagonal line. The close proximity of the observation to the Q-Q line is a measurement of how well the sample was approximated by the normal distribution. It intuitively makes sense to have a better approximation if the number of observations is bigger. This would basically give us more chances of predicting a closer observation (depending on the probability of the observation).
bi.mean = matrix(bi.mean, ncol = 10, byrow = TRUE)
colnames(bi.mean) = c("n=10","n=20","n=100","n=200","n=1000","n=2000","n=10000","n=20000","n=100000","n=200000")
rownames(bi.mean) = c("p=0.2","p=0.3","p=0.4","p=0.5")
bi.var = matrix(bi.var, ncol = 10, byrow = FALSE)
colnames(bi.var) = c("n=10","n=20","n=100","n=200","n=1000","n=2000","n=10000","n=20000","n=100000","n=200000")
rownames(bi.var) = c("p=0.2","p=0.3","p=0.4","p=0.5")
| n=10 | n=20 | n=100 | n=200 | n=1000 | n=2000 | n=10000 | n=20000 | n=100000 | n=200000 | |
|---|---|---|---|---|---|---|---|---|---|---|
| p=0.2 | 2 | 3 | 4 | 5 | 4 | 6 | 8 | 10 | 20 | 30 |
| p=0.3 | 40 | 50 | 40 | 60 | 80 | 100 | 200 | 300 | 400 | 500 |
| p=0.4 | 400 | 600 | 800 | 1000 | 2000 | 3000 | 4000 | 5000 | 4000 | 6000 |
| p=0.5 | 8000 | 10000 | 20000 | 30000 | 40000 | 50000 | 40000 | 60000 | 80000 | 1e+05 |
| n=10 | n=20 | n=100 | n=200 | n=1000 | n=2000 | n=10000 | n=20000 | n=100000 | n=200000 | |
|---|---|---|---|---|---|---|---|---|---|---|
| p=0.2 | 1.6 | 3.2 | 16 | 32 | 160 | 320 | 1600 | 3200 | 16000 | 32000 |
| p=0.3 | 2.1 | 4.2 | 21 | 42 | 210 | 420 | 2100 | 4200 | 21000 | 42000 |
| p=0.4 | 2.4 | 4.8 | 24 | 48 | 240 | 480 | 2400 | 4800 | 24000 | 48000 |
| p=0.5 | 2.5 | 5 | 25 | 50 | 250 | 500 | 2500 | 5000 | 25000 | 50000 |
The mean and variance above was computed to give a better undestading of the data.
Just like for question 1 we must iterate through different values. Poisson distribution was calculated for different values of \(n\) and \(λ\) The values are as follow:
\(λ = 10, 15, 20\)
\(n=10, 100, 1,000, 10,000, 100,000\)
The code chunk below iterates through possibilities of \(λ\) and \(n\) to generate \(Y\)~\(Poisson(λ)\) distribution. The specific command to generate the plots was commented out, however, a supplementary pdf file was attached to the assignment including all the plotted graphs.
The unbiased estimate of Poisson distribution is the mean of the distribution. Within the same loop, the 95% confidence interval of unbiased estimate was computed. It’s noteworthy to mention, in order to compute the 95% CI, this formula was used: \(95\% \ CI = λ - 1.96 * \sqrt{λ/n}\) and \(95\% \ CI = λ + 1.96 * \sqrt{λ/n}\). Unlike the exactpoissonCI() function, this formula would take into account the different n values. Otherwise, the exactpoissonCI() function would give the same result for all the distributions that have same \(λ\) values but different \(n\).
library(exactci)
lambda <- c(10,15,20)
poiss.n <- c(10,100,1000,10000,100000)
poiss.data <- vector("list", length(poiss.n))
CI <- vector("list")
poiss.estimate <- vector("list")
poiss.counter = 1
par(mfrow = c(2,2))
for(x in 1:length(poiss.n)){
for(y in lambda){
poiss.data <- rpois(n = poiss.n[x], lambda = y)
poiss.estimate[poiss.counter] <- mean(poiss.data)
poiss.data <- table(poiss.data)
#plot(poiss.data, main = bquote("Poisson(" ~ .(as.character(poiss.n[x])) ~ "," ~ .(as.character(y)) ~ ")"), xlab = "X", ylab = "Density")
CI[[poiss.counter]] <- toString(formatC(c(y-1.96*sqrt(y/poiss.n[x]),y+1.96*sqrt(y/poiss.n[x]))), digits = 2, format = "f")
poiss.counter = poiss.counter + 1
}
}
This code chunk below organizes all the confidence intervals and unbiased estimates into two tables according to the appropriate \(n\) and \(λ\) values.
CInterval <- matrix(c(CI[1], CI[2], CI[3], CI[4], CI[5], CI[6], CI[7], CI[8], CI[9], CI[10], CI[11], CI[12],CI[13], CI[14], CI[15]), ncol = 3, byrow = TRUE)
colnames(CInterval) <- c("λ=10","λ=15","λ=20")
rownames(CInterval) <- c("n=10","n=100","n=1000","n=10000","n=100000")
poiss.estimate <- matrix(poiss.estimate, ncol = 3, byrow = TRUE)
colnames(poiss.estimate) <- c("λ=10","λ=15","λ=20")
rownames(poiss.estimate) <- c("n=10","n=100","n=1000","n=10000","n=100000")
| λ=10 | λ=15 | λ=20 | |
|---|---|---|---|
| n=10 | 10.5 | 14.9 | 21.5 |
| n=100 | 9.64 | 15.27 | 19.66 |
| n=1000 | 10.093 | 14.811 | 20.065 |
| n=10000 | 10.0302 | 15.0396 | 20.0123 |
| n=100000 | 10.01345 | 14.99205 | 20.0137 |
The table above shows that the more we increase the \(n\), the closer the estimator gets to \(λ\) value.
| λ=10 | λ=15 | λ=20 | |
|---|---|---|---|
| n=10 | 8.04, 11.96 | 12.6, 17.4 | 17.23, 22.77 |
| n=100 | 9.38, 10.62 | 14.24, 15.76 | 19.12, 20.88 |
| n=1000 | 9.804, 10.2 | 14.76, 15.24 | 19.72, 20.28 |
| n=10000 | 9.938, 10.06 | 14.92, 15.08 | 19.91, 20.09 |
| n=100000 | 9.98, 10.02 | 14.98, 15.02 | 19.97, 20.03 |
It can be inferred from the table above that for a specific \(λ\) value, as the \(n\) increases the 95% confidence interval gets narrower. Mathematically speaking, increase in n would make \(\sqrt{λ/n}\) smaller, thus the narrower smaller interval. Intuitively, increasing the number of values to be predicted (\(n\)) would ensure there will be more chances of getting a number closer to \(λ\) every time, thus narrowing the interval.