Nature of Probability Distribution

A1

We will have you run two sets of simulations, similar to the above but with N = 500 and N = 5,000. Prepare graphs illustrating the relative frequency distributions for the results of each simulation. Put the graphs on one page and hand this in; the command par(mfrow = c(2,1)) helps do this in R

# Set the random seed for reproducibility
# Set the random seed for reproducibility
set.seed(123)

# N = 500 simulation
Die1_500 <- sample(1:6, 500, replace = T)
Die2_500 <- sample(1:6, 500, replace = T)
Die3_500 <- sample(1:6, 500, replace = T)
DiceSum_500 <- Die1_500 + Die2_500 + Die3_500

# N = 5000 simulation
Die1_5000 <- sample(1:6, 5000, replace = T)
Die2_5000 <- sample(1:6, 5000, replace = T)
Die3_5000 <- sample(1:6, 5000, replace = T)
DiceSum_5000 <- Die1_5000 + Die2_5000 + Die3_5000

# Adjusting the margins
par(mar = c(4, 4, 2, 2))

# Plotting the N=500 result
barplot(table(DiceSum_500)/500, main="N = 500", xlab="Dice Sum", ylab="Relative Frequency", col="skyblue")

# Adjusting the margins again for the second plot
par(mar = c(4, 4, 2, 2))

# Plotting the N=5000 result
barplot(table(DiceSum_5000)/5000, main="N = 5000", xlab="Dice Sum", ylab="Relative Frequency", col="lightcoral")

A2

Graphs both are bell shaped with increase in sample size making other graph having more concentration at the middle

A3

What is the mean (expected value) and variance of the theoretical probability distribution of X, the sum of three (randomly tossed) fair dice? Show how you obtained these values.

# Theoretical Mean = 3 * (1+2+3+4+5+6)/6
mean_theoretical <- 3 * sum(1:6)/6
# Variance = 3 * variance of a single die
# Variance of single die = E(X^2) - E(X)^2
var_theoretical <- 3 * ((sum((1:6)^2)/6) - (sum(1:6)/6)^2)
mean_theoretical

## [1] 10.5

var_theoretical

## [1] 8.75

A4

Obtain the sample mean and sample variance for the data from each of your simulations. How do these compare to the theoretical values?

# N = 500
mean_500 <- mean(DiceSum_500)
var_500 <- var(DiceSum_500)

# N = 5000
mean_5000 <- mean(DiceSum_5000)
var_5000 <- var(DiceSum_5000)

mean_500

## [1] 10.468

var_500

## [1] 8.790557

mean_5000

## [1] 10.4374

var_5000

## [1] 8.5958

A5

What outcomes result in a sum of 12? Obtain the individual counts for each outcome that results in a sum of 12 and show how these result in the relative frequency (= empirical probability) for the simulation with N = 5000.

count_500 <- length(DiceSum_500[DiceSum_500 == 12])
count_5000 <- length(DiceSum_5000[DiceSum_5000 == 12])

relative_freq_500 <- count_500 / 500
relative_freq_5000 <- count_5000 / 5000

count_500

## [1] 55

relative_freq_500

## [1] 0.11

count_5000

## [1] 573

relative_freq_5000

## [1] 0.1146

PART B: Binomial Distribution Questions, Some Using the Pig Litter Data

B1

Obtain a barplot of the empirical (observed) frequencies for litters of size n = 10.

males <- c(0,1,2,3,4,5,6,7,8,9,10)
frequencies <- c(0,3,15,72,101,75,83,33,18,1,0)

barplot(frequencies, names.arg = males, main = "Empirical Frequencies for Litters of Size 10", 
        xlab = "Number of Males", ylab = "Frequency", col = "skyblue")

B2

Obtain a barplot of the expected frequencies if we use a binomial distribution with n = 10 and p estimated from the data in litters of size 10 only. To estimate p, divide the number of male pigs by the number of pigs in litters of size 10. How do the two graphs in B1 and B2 compare? Put the graphs on one page and hand this in; the command par(mfrow = c(2,1)) helps do this in R

total_males_10 <- sum(males * frequencies)
total_litters_10 <- sum(frequencies)

p_estimated <- total_males_10 / (total_litters_10 * 10)

expected_frequencies <- dbinom(males, 10, p_estimated) * total_litters_10

barplot(expected_frequencies, names.arg = males, main = "Expected Frequencies for Litters of Size 10 (Binomial)", 
        xlab = "Number of Males", ylab = "Expected Frequency", col = "lightcoral")

B3

Report the value you used as your estimate of p. What is the mean and variance for a binomial distribution with this value of p and n = 10? What are the values of the sample mean and sample variance for the number of male offspring in the litters of size n = 10?

# Data
males <- 0:10
frequencies <- c(0, 3, 15, 79, 101, 75, 33, 18, 13, 1, 0)

# Estimate of p
total_litters <- sum(frequencies)
total_males <- sum(males * frequencies)
p_estimate <- total_males / (10 * total_litters)

# Mean and variance for binomial distribution with p and n=10
binom_mean <- 10 * p_estimate
binom_variance <- 10 * p_estimate * (1 - p_estimate)

# Sample mean and variance for number of male offspring
sample_mean <- total_males / total_litters
sample_variance <- sum(frequencies * (males - sample_mean)^2) / total_litters

cat("Estimated p:", p_estimate, "\n")

## Estimated p: 0.439645

cat("Binomial Mean:", binom_mean, "\n")

## Binomial Mean: 4.39645

cat("Binomial Variance:", binom_variance, "\n")

## Binomial Variance: 2.463573

cat("Sample Mean:", sample_mean, "\n")

## Sample Mean: 4.39645

cat("Sample Variance:", sample_variance, "\n")

## Sample Variance: 2.115017

B4

If we define Y as the number of female offspring in a litter, what is the sample mean and variance of Y for the 277 litters of size n = 10? [Hint: the calculations required here should be minimal!]

# Sample mean and variance for Y (number of female offspring)
female_offspring <- 10 - males
sample_mean_Y <- sum(female_offspring * frequencies) / total_litters
sample_variance_Y <- sum(frequencies * (female_offspring - sample_mean_Y)^2) / total_litters

cat("Sample Mean for Y (females):", sample_mean_Y, "\n")

## Sample Mean for Y (females): 5.60355

cat("Sample Variance for Y (females):", sample_variance_Y, "\n")

## Sample Variance for Y (females): 2.115017

B5

Use the dbinom() function to generate binomial probability distributions, all with n = 10 but varying p = 0.05, 0.10, 0.25, 0.75, 0.90, 0.95. Use barplot() to graph each distribution. You do not need to hand the graphs in, but you are to describe two important patterns you should see as p changes.

# Define varying p values and n
p_values <- c(0.05, 0.10, 0.25, 0.75, 0.90, 0.95)
n <- 10

# Create a function to plot the binomial distribution for a given p
plot_binom <- function(p) {
  probs <- dbinom(0:n, size = n, prob = p)
  barplot(probs, main = paste("Binomial Distribution: p =", p), 
          xlab = "Number of Successes", ylab = "Probability", ylim = c(0, 0.4))
}

# Use par() to layout multiple plots on one page
par(mfrow = c(2, 3))

# Plot the binomial distribution for each p value
invisible(lapply(p_values, plot_binom))

Centering: As p increases from 0.05 to 0.95, the peak (or mode) of the distribution shifts to the right. This reflects the increasing likelihood of more successes (i.e., more male offspring) as p becomes larger.

Spread: The distribution becomes more spread out or narrower based on p. For p values near 0 or 1 (like 0.05 and 0.95), the distribution is more “peaked” and less spread out. For p values near 0.5, the distribution is wider.

You can also observe the skewness, where for smaller values of p, the distribution is positively skewed, and for larger values of p, it’s negatively skewed. For p close to 0.5, it tends to be symmetric.

Part C: Normal Distribution Questions, Some Using the Elderly Female Heights Data

C1

# Get the current working directory
current_dir <- getwd()

# Print the current working directory
print(current_dir)

## [1] "/cloud/project"

# Set the working directory to "/cloud/project" (if not already set)
setwd("/cloud/project")

# Load the necessary package to read Excel files
library(readxl)

# Read the Excel file
data <- read_excel("elderly.xlsx")

# View the first few rows of the data
head(data)

tail(data)

# Create a density histogram with a rescaled y-axis (probability density)
hist(data$Height.cm, xlim = c(140, 180), ylim = c(0, 0.07),
     main = "Elderly Female Heights Fit to Normal Distribution",
     xlab = "Height (cm)", prob = TRUE)

# Generate a sequence of x-values for the normal curve
xfit <- seq(140, 180, 0.01)

# Calculate the density values for the normal distribution based on sample mean and standard deviation
yfit <- dnorm(xfit, mean(data$Height.cm), sd(data$Height.cm))

# Overlay the normal distribution curve on the histogram
lines(xfit, yfit, col = "red", lwd = 2)

C2.

URS1 <- c(0.66600, 0.00532, 0.32755, 0.98300, 0.64661)
NormRS1 <- qnorm(URS1, mean = 80, sd = 12)
xbar <- mean(NormRS1)
s <- sd(NormRS1)
print(xbar)

## [1] 79.81798

print(s)

## [1] 20.38502

SE <- s / sqrt(5)
print(SE)

## [1] 9.116456

Calculate the confidence interval for your sample data using the multiplier 2.7764 as mentioned:

lower_limit <- xbar - 2.7764 * SE
upper_limit <- xbar + 2.7764 * SE
 print(lower_limit)

## [1] 54.50705

 print(upper_limit)

## [1] 105.1289

if (lower_limit <= 80 && 80 <= upper_limit) {
  cat("The population mean μ = 80 is included in the interval.\n")
} else {
  cat("The population mean μ = 80 is not included in the interval.\n")
}

## The population mean μ = 80 is included in the interval.

C3

You can generate random samples (actually pseudorandom samples) very quickly in R with the rnorm() function. To get a better sense of how estimation of the population mean is influenced by sample size, obtain random samples of sizes n = 100, 10,000, and 1,000,000 from a normal distribution with mean ì = 80 and standard deviation ó = 12. For each data set obtain the mean, standard deviation, and calculate the 95% confidence interval for the population mean. The 95% confidence intervals can be calculated as (xbar ! 1.96SE, xbar + 1.96SE) since the sample sizes are all quite large. Did your confidence intervals capture ì = 80? (Most of the time the answer will be yes, but sometimes the answer will be no!)

# Set the parameters
mu <- 80
sigma <- 12
sample_sizes <- c(100, 10000, 1000000)

# Initialize vectors to store results
sample_means <- numeric(length(sample_sizes))
sample_sds <- numeric(length(sample_sizes))
conf_intervals <- matrix(nrow = length(sample_sizes), ncol = 2)

# Generate random samples and calculate statistics for each sample size
for (i in 1:length(sample_sizes)) {
  n <- sample_sizes[i]
  
  # Generate random sample
  random_sample <- rnorm(n, mean = mu, sd = sigma)
  
  # Calculate sample mean and standard deviation
  sample_means[i] <- mean(random_sample)
  sample_sds[i] <- sd(random_sample)
  
  # Calculate the 95% confidence interval
  SE <- sigma / sqrt(n)
  conf_intervals[i, ] <- c(sample_means[i] - 1.96 * SE, sample_means[i] + 1.96 * SE)
}

# Print the results
for (i in 1:length(sample_sizes)) {
  cat("Sample Size:", sample_sizes[i], "\n")
  cat("Sample Mean:", sample_means[i], "\n")
  cat("Sample Standard Deviation:", sample_sds[i], "\n")
  cat("95% Confidence Interval:", conf_intervals[i, 1], "-", conf_intervals[i, 2], "\n")
  cat("\n")
}

## Sample Size: 100 
## Sample Mean: 80.35688 
## Sample Standard Deviation: 10.97905 
## 95% Confidence Interval: 78.00488 - 82.70888 
## 
## Sample Size: 10000 
## Sample Mean: 79.89491 
## Sample Standard Deviation: 12.0252 
## 95% Confidence Interval: 79.65971 - 80.13011 
## 
## Sample Size: 1e+06 
## Sample Mean: 80.00077 
## Sample Standard Deviation: 11.99178 
## 95% Confidence Interval: 79.97725 - 80.02429

C4

Generate a random sample of size 600 from a normal distribution with mean ì = 80 and standard deviation ó = 12 with the R command rnorm(600, 80, 12). Make sure to assign your generated values to a variable name. Obtain the mean and standard deviation for your sample. Next obtain a density histogram of your sample data and then overlay the density function for a normal random variable with the mean and standard deviation parameters estimated by the sample mean and standard deviation. The method for producing this graph is very close to the example in question C1. Hand this graph in.

# Generate a random sample of size 600
random_sample <- rnorm(600, mean = 80, sd = 12)

# Calculate the mean and standard deviation for the sample
sample_mean <- mean(random_sample)
sample_sd <- sd(random_sample)

# Create a density histogram of the sample data
hist(random_sample, probability = TRUE, main = "Density Histogram of Random Sample",
     xlab = "Value", ylab = "Density")

# Generate a sequence of x-values for the normal density function
x_values <- seq(min(random_sample), max(random_sample), length.out = 1000)

# Calculate the density values for the normal distribution based on sample mean and standard deviation
y_values <- dnorm(x_values, mean = sample_mean, sd = sample_sd)

# Overlay the density function on the histogram
lines(x_values, y_values, col = "red", lwd = 2)

# Save the graph as a file (optional)
# png("density_histogram.png")
# plot(...)
# dev.off()

Nature of Probability Distribution

Samuel N Wekesa

2023-10-31

A1

A2

A3

A4

A5

PART B: Binomial Distribution Questions, Some Using the Pig Litter Data

B1

B2

B3

B4

B5

Part C: Normal Distribution Questions, Some Using the Elderly Female Heights Data

C1

C2.

C3

C4