Assignment 2

Question 1

The question is asking to generate 1000 samples from the distribution of \(X\). However, each time \(X\) takes on different n or p values. \(X\)~\(Binom(n,p)\) would take on different n and p values each time and a different graph will be plotted for every condition. The values for \(n\) and \(p\) are as follow:

\(p = 0.2, 0.3, 0.4, 0.5\)

\(n = 10, 20, 100, 200, 1,000, 2,000, 10,000, 20,000, 100,000, 200,000\)

The code chunk below computes normal distribution for all the 40 possibilities and plots a Q-Q graph for each. Also the mean and variance of the binomial distributions were calculated as follow: \(mean = n * p\) and \(variance = n * p * (1-p)\).

bi.p <- as.array(c(0.2,0.3,0.4,0.5))
bi.n <- as.array(c(10,20,100,200,1000,2000,10000,20000,100000,200000))
bi.mean <- vector("list")
bi.var <- vector("list")
bi.counter = 1
par(mfrow = c(3,3))
for (j in bi.n) {
  for (i in bi.p) {
    RandomData <- rbinom(1000, j, i)
    qqnorm(RandomData, main = bquote("Q-Q Plot of Binom(" ~ .(as.character(j)) ~ "," ~ .(as.character(i)) ~ ")"))
    qqline(RandomData, col = "red")
    bi.mean[bi.counter] <- j * i #mean = n * p
    bi.var[bi.counter] <- j * i * (1-i) #variance = n * p * q
    bi.counter = bi.counter + 1
  }
}

To interpret a Q-Q plot we must observe the cluster of points on the graph as close as possible to the diagonal line (Q-Q line) on the graph. We have 1000 samples from the distribution of \(X\) for each graph. As the number of observations (\(n\)) increases, we observe more points around the diagonal line. The close proximity of the observation to the Q-Q line is a measurement of how well the sample was approximated by the normal distribution. It intuitively makes sense to have a better approximation if the number of observations is bigger. This would basically give us more chances of predicting a closer observation (depending on the probability of the observation).

bi.mean = matrix(bi.mean, ncol = 10, byrow = TRUE)
colnames(bi.mean) = c("n=10","n=20","n=100","n=200","n=1000","n=2000","n=10000","n=20000","n=100000","n=200000")
rownames(bi.mean) = c("p=0.2","p=0.3","p=0.4","p=0.5")
bi.var = matrix(bi.var, ncol = 10, byrow = FALSE)
colnames(bi.var) = c("n=10","n=20","n=100","n=200","n=1000","n=2000","n=10000","n=20000","n=100000","n=200000")
rownames(bi.var) = c("p=0.2","p=0.3","p=0.4","p=0.5")

Mean

	n=10	n=20	n=100	n=200	n=1000	n=2000	n=10000	n=20000	n=100000	n=200000
p=0.2	2	3	4	5	4	6	8	10	20	30
p=0.3	40	50	40	60	80	100	200	300	400	500
p=0.4	400	600	800	1000	2000	3000	4000	5000	4000	6000
p=0.5	8000	10000	20000	30000	40000	50000	40000	60000	80000	1e+05

Variance

	n=10	n=20	n=100	n=200	n=1000	n=2000	n=10000	n=20000	n=100000	n=200000
p=0.2	1.6	3.2	16	32	160	320	1600	3200	16000	32000
p=0.3	2.1	4.2	21	42	210	420	2100	4200	21000	42000
p=0.4	2.4	4.8	24	48	240	480	2400	4800	24000	48000
p=0.5	2.5	5	25	50	250	500	2500	5000	25000	50000

The mean and variance above was computed to give a better undestading of the data.

Question 2

Just like for question 1 we must iterate through different values. Poisson distribution was calculated for different values of \(n\) and \(λ\) The values are as follow:

\(λ = 10, 15, 20\)

\(n=10, 100, 1,000, 10,000, 100,000\)

The code chunk below iterates through possibilities of \(λ\) and \(n\) to generate \(Y\)~\(Poisson(λ)\) distribution. The specific command to generate the plots was commented out, however, a supplementary pdf file was attached to the assignment including all the plotted graphs.

The unbiased estimate of Poisson distribution is the mean of the distribution. Within the same loop, the 95% confidence interval of unbiased estimate was computed. It’s noteworthy to mention, in order to compute the 95% CI, this formula was used: \(95\% \ CI = λ - 1.96 * \sqrt{λ/n}\) and \(95\% \ CI = λ + 1.96 * \sqrt{λ/n}\). Unlike the exactpoissonCI() function, this formula would take into account the different n values. Otherwise, the exactpoissonCI() function would give the same result for all the distributions that have same \(λ\) values but different \(n\).

library(exactci)
lambda <- c(10,15,20)
poiss.n <- c(10,100,1000,10000,100000)
poiss.data <- vector("list", length(poiss.n))
CI <- vector("list")
poiss.estimate <- vector("list")
poiss.counter = 1
par(mfrow = c(2,2))
for(x in 1:length(poiss.n)){
  for(y in lambda){
    poiss.data <- rpois(n = poiss.n[x], lambda = y)
    poiss.estimate[poiss.counter] <- mean(poiss.data)
    poiss.data <- table(poiss.data)
    #plot(poiss.data, main = bquote("Poisson(" ~ .(as.character(poiss.n[x])) ~ "," ~ .(as.character(y)) ~ ")"), xlab = "X", ylab = "Density")
    CI[[poiss.counter]] <- toString(formatC(c(y-1.96*sqrt(y/poiss.n[x]),y+1.96*sqrt(y/poiss.n[x]))), digits = 2, format = "f")
    poiss.counter = poiss.counter + 1
  }
}

This code chunk below organizes all the confidence intervals and unbiased estimates into two tables according to the appropriate \(n\) and \(λ\) values.

CInterval <- matrix(c(CI[1], CI[2], CI[3], CI[4], CI[5], CI[6], CI[7], CI[8], CI[9], CI[10], CI[11], CI[12],CI[13], CI[14], CI[15]), ncol = 3, byrow = TRUE)
colnames(CInterval) <- c("λ=10","λ=15","λ=20")
rownames(CInterval) <- c("n=10","n=100","n=1000","n=10000","n=100000")
poiss.estimate <- matrix(poiss.estimate, ncol = 3, byrow = TRUE)
colnames(poiss.estimate) <- c("λ=10","λ=15","λ=20")
rownames(poiss.estimate) <- c("n=10","n=100","n=1000","n=10000","n=100000")

Unbiased Estimator

	λ=10	λ=15	λ=20
n=10	10.5	14.9	21.5
n=100	9.64	15.27	19.66
n=1000	10.093	14.811	20.065
n=10000	10.0302	15.0396	20.0123
n=100000	10.01345	14.99205	20.0137

The table above shows that the more we increase the \(n\), the closer the estimator gets to \(λ\) value.

95% Confidence Interval

	λ=10	λ=15	λ=20
n=10	8.04, 11.96	12.6, 17.4	17.23, 22.77
n=100	9.38, 10.62	14.24, 15.76	19.12, 20.88
n=1000	9.804, 10.2	14.76, 15.24	19.72, 20.28
n=10000	9.938, 10.06	14.92, 15.08	19.91, 20.09
n=100000	9.98, 10.02	14.98, 15.02	19.97, 20.03

It can be inferred from the table above that for a specific \(λ\) value, as the \(n\) increases the 95% confidence interval gets narrower. Mathematically speaking, increase in n would make \(\sqrt{λ/n}\) smaller, thus the narrower smaller interval. Intuitively, increasing the number of values to be predicted (\(n\)) would ensure there will be more chances of getting a number closer to \(λ\) every time, thus narrowing the interval.