QUESTION 1 a)
# Sample 1: n = 25
set.seed(123)
x25 <- rnorm(25, mean = 0, sd = 1)
# Sample 2: n = 250
set.seed(123)
x250 <- rnorm(250, mean = 0, sd = 1)
# Range Encapsulating CDF
x_vals <- seq(-4, 4, length.out = 400)
# ECDF for S1
plot(ecdf(x25),
main = "ECDF vs CDF (n = 25)",
xlab = "x", ylab = "F(x)", col = "blue", lwd = 2)
lines(x_vals, pnorm(x_vals), col = "red", lwd = 2, lty = 2)
# ECDF for S2
plot(ecdf(x250),
main = "ECDF vs CDF (n = 250)",
xlab = "x", ylab = "F(x)", col = "blue", lwd = 2)
lines(x_vals, pnorm(x_vals), col = "red", lwd = 2, lty = 2)
b) The Empirical Distribution Function (ECDF) is a sample - based
estimate of the Cumulative Distribution Function (CDF) its formula is
\[
\hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^{n} I(X_i \le x)
\] I(X<=x) is equal to 1 if the observation from the sample
(X) is less than x (our threshold), else it equals zero.
The given result is a proportion of suitable observed values that are below or equal to the threshold, divided by the total number of observations.
The ECDF is non-decreasing, and either stays constant or increases and its shape mirrors the CDF in practice.
However at low sample sizes the ECDF look to be irregular and less smooth, in comparison to the CDF As the sample size increases, and by the law of large numbers, the ECDF converges to the shape of the CDF.
This is intuitive as the CDF is a continuous function of infinitesimal change, as our sample size increases the ECDF is now beginning to take on that behavior, despite it being discrete.
This makes the ECDF useful as an estimate for the CDF, with increasing reliability as the sample size increases.
For n=25, the ECDF looks like a step function that roughly follows the CDF, however it deviates frequently, shown by it dipping below the CDF or exceeding it.
This is because the sample size is small. Natural Variation in a random sample is more pronounced with fewer observations.
Most notably around the mean of the population, where most samples arise we see a lot of jumps, indicating the proportional increase of the CDF. The discrete thresholds have to account for the marked increase in proportion through fewer threshold points.
The CDF in comparison is infinitesimal and continuous,so smoothly increases continuously across the sample observations.
For n=250 however, the ECDF now looks to be almost overlaid on the CDF, it appears much smoother. This shows how, as the sample size increases, the ECDF is now more accurate as an estimate of the CDF and shows the effect of the law of large numbers.
Natural Variation is mitigated by the larger sample size, however there is still remnants of deviation, especially near the Mean. This could be explained by the fact, that as the ECDF is still discrete and is functioning by increasing proportion by step, visibly we can see the small steps accruing proportion, even as it becomes closer the CDF.
So as the proportion enters around the mean, observation numbers increase, meaning proportion accrued across each step is markedly higher, visualized by squiggles around the mean, or the marked over performance.
The under performance at the lower tail could be explained by the lack of observations at that threshold, as the proportion of observations decrease as you move away from the mean.
The stabilization towards the end of the curve indicates that we have accrued most of the cumulative proportion, so the remaining observations have little influence towards deviation of the ECDF towards the CDf, especially as they are fewer in number
QUESTION 2
set.seed(123)
lambda <- 4
samples <- c(5, 30, 100)
rep_samp <- 1000
for (n in samples) {
# Each Iteration generates 1000 mean values from the poisson distribution for each sample size
means <- replicate(rep_samp, mean(rpois(n, lambda)))
# histogram of sample means
hist(means, probability = TRUE,
main = paste("Sample size n =", n),
xlab = "Sample mean", col = "lightblue", border = "white")
# combining with Normal Distribution
curve(dnorm(x, mean = lambda, sd = sqrt(lambda / n)),
col = "red", lwd = 2, add = TRUE)
}
The Central Limit Theorem states that as the sample size increases, the distribution of sample means approaches a Normal distribution, regardless of the population’s shape.
In this simulation, each histogram shows the sampling distribution of means from Poisson(lambda = 4) samples of size n = 5, 30, and 100. For n = 5, the distribution is still skewed and its center is slightly below 4.
As n increases, the sample means become more tightly clustered around the true mean and the shape becomes increasingly symmetric.
By n = 100, the histogram closely follows the overlaid Normal curve N(4,4/n).
This confirms the Central Limit Theorem, even though the Poisson distribution is discrete and right-skewed, the distribution of its sample means becomes approximately Normal as n grows.
QUESTION 3 a)
For a Binomial(\(n, p\)) random variable: \[E[X] = np, \quad Var(X) = np(1 - p)\]
The sample mean and variance are: \[\bar{X} = np, \quad s^2 = np(1 - p)\]
Solving for \(p\) and \(n\) gives:
\[p = 1 - \frac{s^2}{\bar{X}}, \quad n = \frac{\bar{X}}{p}.\]
# Generating 100 samples for Binom Dist
x <- rbinom(100, 15,0.2)
# sample mean and variance
sampmean <- mean(x)
varmean <- var(x)
# MOM
ph <- 1 - (varmean / sampmean)
nh <- sampmean / ph
cat("Sample mean =", sampmean, "\n")
## Sample mean = 3.06
cat("Sample variance =",varmean, "\n")
## Sample variance = 3.147879
cat("Estimated p =", ph, "\n")
## Estimated p = -0.02871856
cat("Estimated n =", nh, "\n")
## Estimated n = -106.5513
After repeating the simulation roughly 30 times, the MOM(Method of Moments) produced a wide range of estimates for n and p.
Most Estimate closed in on the true values, ranging from 0.15 - 0.3 for p and 10 - 20 for n. Across the 30 trials estimates for p ranged from, -0.02 to 0.37, and estimates for n ranged from -130 - 42.
While the Majority were somewhat representative for the true parameters, a noticeable amount were far off or negative, which is completely implausible.
The MOM depends on the sample mean and variance. When calculating the probability, if the variance exceeds the mean, we yield a negative probability. This in turn yields a negative estimate for the size, both being completely wrong as estimates.
Overall, the results show that the Method of Moments, while simple to compute, is quite unreliable in generating estimates that are accurate, and is prone to generating inaccurate or nonsensical estimates.
# Load plantgrowt
data("PlantGrowth")
x <- PlantGrowth$weight
# define negative log lieklihood
neglog <- function(mu, sigma2) {
-sum(dnorm(x, mean = mu, sd = sqrt(sigma2), log = TRUE))
}
library(stats4)
# Find MLEs (start guesses for mu and sigma2)
mlem <- mle(neglog, list(mu = 5, sigma2 = 0.5))
## Warning in sqrt(sigma2): NaNs produced
## Warning in sqrt(sigma2): NaNs produced
# Display the results
summary(mlem)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = neglog, start = list(mu = 5, sigma2 = 0.5))
##
## Coefficients:
## Estimate Std. Error
## mu 5.0729848 0.1258666
## sigma2 0.4752721 0.1227108
##
## -2 log L: 62.82084
#compute theoretical MLEs for R&T
the_mu <- mean(x)
the_var <- mean((x - the_mu)^2)
cat("R Code MLE for MU: ", coef(mlem)["mu"],
"|| Theoretical Mu: ",the_mu, "\n")
## R Code MLE for MU: 5.072985 || Theoretical Mu: 5.073
cat("R Code MLE for Variance: ", coef(mlem)["sigma2"],
"|| Theoretical Variance: ",the_var, "\n")
## R Code MLE for Variance: 0.4752721 || Theoretical Variance: 0.475281
Computing the theoretical MLEs derived in R&T(example 5.2.10, part c)) using our data, yields almost exactly the same results as the mle function using our negative log-likelihood function.