X Wang - Assignment 2

PSCI 7108: Advanced Data Analysis III

1: The Multinomial Distribution

Candidates A, B, and C received 50%, 25%, and 25% of the vote in an election, respectively.

a: If we draw 10 voters at random, what is the probability that there will be exactly six supporters of candidate A, two supporters of candidate B, and two supporters of candidate C? Compute the exact probability analytically.

The probability mass function for a multinomial distribution is: \( p(y|n,\pi) = {\large{\frac{n!}{y_1!y_2!...y_k!}}}\pi_1^{y_1}\pi_2^{y_2}...\pi_k^{y_k} \), where \( n \) is the number of independent trials, \( k \) is the number of possible outcomes, \( y_i \) is the number of times outcome \( i \) is observed over \( n \) trials, and \( \pi_i \) is the probability of each outcome.

n <- 10
pi_A <- 0.5
pi_B <- 0.25
pi_C <- 0.25
y_A <- 6
y_B <- 2
y_C <- 2
Pr1 <- (factorial(n)/(factorial(y_A) * factorial(y_B) * factorial(y_C))) * (pi_A^(y_A) * 
    pi_B^(y_B) * pi_C^(y_C))
Pr1  # 0.0769043
## [1] 0.0769

b: Compute the probability that there will be exactly six supporters of candidate A, two supporters of candidate B, and two supporters of candidate C using simulation in R.

# Set variables and counter
sample.obs <- 10  # Number of observations for each simulation
sims <- 1e+05  # Number of simulations
candidates <- letters[seq(3)]  # Create sample space
pi_cands <- c(pi_A, pi_B, pi_C)  # Candidates' probabilities
desired <- c(6, 2, 2)  # Desired outcome
match <- 0  # Counter

# Set for loop
set.seed(39587)
for (i in 1:sims) {
    sample_results <- sample(candidates, sample.obs, replace = T, prob = pi_cands)
    if (length(which(sample_results == candidates[1])) == desired[1] & length(which(sample_results == 
        candidates[2])) == desired[2] & length(which(sample_results == candidates[3])) == 
        desired[3]) {
        # Evaluates whether the simulation matched with the desired outome; which()
        # returns the vector index where the logic statement is true
        match <- match + 1
    }
}
Pr2 <- match/sims  # Probability using 100,000 simulations
Pr2  # 0.07545
## [1] 0.07545

# Just for kicks, increasing simulations to 1M and coding a cleaner for loop
sims2 <- 1e+06
match <- 0  # Reset counter
sample_outcome <- rep(NA, 3)  # Vector to hold sample outcome counts, Y_i

set.seed(10987)
for (i in 1:sims2) {
    sample_results <- sample(candidates, sample.obs, replace = T, prob = pi_cands)
    sample_outcome <- c(sum(sample_results == candidates[1]), sum(sample_results == 
        candidates[2]), sum(sample_results == candidates[3]))  # Counts results, evaluates against desired outcome for each candidate, returns 1 for each match
    if (sum(sample_outcome == desired) == 3) {
        # All 3 needs to match
        match <- match + 1
    }
}
Pr3 <- match/sims2  # Probability using 1,000,000 simulations 
Pr3  # 0.07647, much closer to Pr1
## [1] 0.07647

2: The Multivariate Normal Distribution

a: Draw 1,000 random numbers from a multivariate normal distribution with two variables (i.e., bivariate normal). Set the variance of each variable to 1. Choose the covariance (correlation). Report the mean, variances, and correlation of the sample that drawn.

# Install and load libraries
install.packages("MASS")
## Error: trying to use CRAN without setting a mirror
library("MASS")

# Define the bivariate normal samples
sigma_bi <- matrix(c(1, 0.25, 0.25, 1), nrow = 2, ncol = 2)  # Defining covariance matrix sigma for k=2: variance of each variable is 1 and covariance is 0.25
set.seed(23410)
bivn <- mvrnorm(n = 1000, mu = c(0, 0), Sigma = sigma_bi)  # 1000 samples, mean at (0,0) 

# Create a matrix to save the results from parts (b) and (c)
compare <- matrix(NA, nrow = 4, ncol = 3)
colnames(compare) <- c("Mean", "Variance", "Correlation")
rownames(compare) <- c("Bivariate Var 1", "Bivariate Var 2", "Univariate Var 1", 
    "Univariate Var 2")

# Find the mean of the two variables (by column)
compare[1, 1] <- apply(bivn, 2, mean)[1]
compare[2, 1] <- apply(bivn, 2, mean)[2]
# Find the variance
compare[1, 2] <- apply(bivn, 2, var)[1]
compare[2, 2] <- apply(bivn, 2, var)[2]
# Find correlation between the variables
compare[1, 3] <- cor(bivn)[1, 2]
compare[2, 3] <- cor(bivn)[2, 1]
compare
##                       Mean Variance Correlation
## Bivariate Var 1  -0.027328   0.9964      0.2658
## Bivariate Var 2   0.008198   0.9868      0.2658
## Univariate Var 1        NA       NA          NA
## Univariate Var 2        NA       NA          NA

Now draw two separate univariate random normal variables of 1,000 observations each using rnorm() with the same means and variance used in part (a). Report the mean, variances, and correlation of the sample.

# Define the univariate normal samples
set.seed(48571)
n1 <- rnorm(1000, mean = 0, sd = 1)  # Since sd^2 is variance
set.seed(12009)
n2 <- rnorm(1000, mean = 0, sd = 1)

# Saves the results to compare
compare[3, 1] <- mean(n1)
compare[3, 2] <- var(n1)
compare[4, 1] <- mean(n2)
compare[4, 2] <- var(n2)
compare[3:4, 3] <- cor(n1, n2)  # Correlation is same for both variables
compare
##                       Mean Variance Correlation
## Bivariate Var 1  -0.027328   0.9964    0.265849
## Bivariate Var 2   0.008198   0.9868    0.265849
## Univariate Var 1 -0.027784   1.0327   -0.009359
## Univariate Var 2 -0.023249   0.9762   -0.009359

c: Are the sample means, variances, and correlations that you computed in parts (a) and (b) similar to one another? Why or why not?

See matrix above. The means and variances are very similar for samples produced in parts (a) and (b)–around 0 and 1, respectively–because they are defined identically in the mvrnorm() and rnorm() functions. The correlations are different because \( \Sigma \) defines the covariance between the variables in multivariate normal distributions–in this case, it is 0.25 between the two variables. The correlation between the two variables in the univariate samples are near 0–as expected–because they are generated independent of each other and no covariance can be defined.

d: Create a 3D density plot of the bivariate normal distribution from part (a). Explain what information the figure conveys.

bivn.kde <- kde2d(bivn[, 1], bivn[, 2], n = 100)  # Two-dimensional kernal density estimation 
persp(bivn.kde, phi = 30, theta = 30, xlab = "Dimension 1", ylab = "Dimension 2", 
    zlab = "Density", cex.lab = 1.5)

plot of chunk unnamed-chunk-5

The density plot shows that there is some covariance (0.25) between the two variables, though it is not high. For example, if we pick a high value along dimension 1 and look down the plane along dimension 2, it is slightly more dense for higher values of dimension 2, though there is still a good amount of uncertainty (since there is density for lower values of dimension 2 as well).