We will have you run two sets of simulations, similar to the above but with N = 500 and N = 5,000. Prepare graphs illustrating the relative frequency distributions for the results of each simulation. Put the graphs on one page and hand this in; the command par(mfrow = c(2,1)) helps do this in R
# Set the random seed for reproducibility
# Set the random seed for reproducibility
set.seed(123)
# N = 500 simulation
Die1_500 <- sample(1:6, 500, replace = T)
Die2_500 <- sample(1:6, 500, replace = T)
Die3_500 <- sample(1:6, 500, replace = T)
DiceSum_500 <- Die1_500 + Die2_500 + Die3_500
# N = 5000 simulation
Die1_5000 <- sample(1:6, 5000, replace = T)
Die2_5000 <- sample(1:6, 5000, replace = T)
Die3_5000 <- sample(1:6, 5000, replace = T)
DiceSum_5000 <- Die1_5000 + Die2_5000 + Die3_5000
# Adjusting the margins
par(mar = c(4, 4, 2, 2))
# Plotting the N=500 result
barplot(table(DiceSum_500)/500, main="N = 500", xlab="Dice Sum", ylab="Relative Frequency", col="skyblue")
# Adjusting the margins again for the second plot
par(mar = c(4, 4, 2, 2))
# Plotting the N=5000 result
barplot(table(DiceSum_5000)/5000, main="N = 5000", xlab="Dice Sum", ylab="Relative Frequency", col="lightcoral")
Graphs both are bell shaped with increase in sample size making other graph having more concentration at the middle
What is the mean (expected value) and variance of the theoretical probability distribution of X, the sum of three (randomly tossed) fair dice? Show how you obtained these values.
# Theoretical Mean = 3 * (1+2+3+4+5+6)/6
mean_theoretical <- 3 * sum(1:6)/6
# Variance = 3 * variance of a single die
# Variance of single die = E(X^2) - E(X)^2
var_theoretical <- 3 * ((sum((1:6)^2)/6) - (sum(1:6)/6)^2)
mean_theoretical
## [1] 10.5
var_theoretical
## [1] 8.75
Obtain the sample mean and sample variance for the data from each of your simulations. How do these compare to the theoretical values?
# N = 500
mean_500 <- mean(DiceSum_500)
var_500 <- var(DiceSum_500)
# N = 5000
mean_5000 <- mean(DiceSum_5000)
var_5000 <- var(DiceSum_5000)
mean_500
## [1] 10.468
var_500
## [1] 8.790557
mean_5000
## [1] 10.4374
var_5000
## [1] 8.5958
What outcomes result in a sum of 12? Obtain the individual counts for each outcome that results in a sum of 12 and show how these result in the relative frequency (= empirical probability) for the simulation with N = 5000.
count_500 <- length(DiceSum_500[DiceSum_500 == 12])
count_5000 <- length(DiceSum_5000[DiceSum_5000 == 12])
relative_freq_500 <- count_500 / 500
relative_freq_5000 <- count_5000 / 5000
count_500
## [1] 55
relative_freq_500
## [1] 0.11
count_5000
## [1] 573
relative_freq_5000
## [1] 0.1146
Obtain a barplot of the empirical (observed) frequencies for litters of size n = 10.
males <- c(0,1,2,3,4,5,6,7,8,9,10)
frequencies <- c(0,3,15,72,101,75,83,33,18,1,0)
barplot(frequencies, names.arg = males, main = "Empirical Frequencies for Litters of Size 10",
xlab = "Number of Males", ylab = "Frequency", col = "skyblue")
Obtain a barplot of the expected frequencies if we use a binomial distribution with n = 10 and p estimated from the data in litters of size 10 only. To estimate p, divide the number of male pigs by the number of pigs in litters of size 10. How do the two graphs in B1 and B2 compare? Put the graphs on one page and hand this in; the command par(mfrow = c(2,1)) helps do this in R
total_males_10 <- sum(males * frequencies)
total_litters_10 <- sum(frequencies)
p_estimated <- total_males_10 / (total_litters_10 * 10)
expected_frequencies <- dbinom(males, 10, p_estimated) * total_litters_10
barplot(expected_frequencies, names.arg = males, main = "Expected Frequencies for Litters of Size 10 (Binomial)",
xlab = "Number of Males", ylab = "Expected Frequency", col = "lightcoral")
Report the value you used as your estimate of p. What is the mean and variance for a binomial distribution with this value of p and n = 10? What are the values of the sample mean and sample variance for the number of male offspring in the litters of size n = 10?
# Data
males <- 0:10
frequencies <- c(0, 3, 15, 79, 101, 75, 33, 18, 13, 1, 0)
# Estimate of p
total_litters <- sum(frequencies)
total_males <- sum(males * frequencies)
p_estimate <- total_males / (10 * total_litters)
# Mean and variance for binomial distribution with p and n=10
binom_mean <- 10 * p_estimate
binom_variance <- 10 * p_estimate * (1 - p_estimate)
# Sample mean and variance for number of male offspring
sample_mean <- total_males / total_litters
sample_variance <- sum(frequencies * (males - sample_mean)^2) / total_litters
cat("Estimated p:", p_estimate, "\n")
## Estimated p: 0.439645
cat("Binomial Mean:", binom_mean, "\n")
## Binomial Mean: 4.39645
cat("Binomial Variance:", binom_variance, "\n")
## Binomial Variance: 2.463573
cat("Sample Mean:", sample_mean, "\n")
## Sample Mean: 4.39645
cat("Sample Variance:", sample_variance, "\n")
## Sample Variance: 2.115017
If we define Y as the number of female offspring in a litter, what is the sample mean and variance of Y for the 277 litters of size n = 10? [Hint: the calculations required here should be minimal!]
# Sample mean and variance for Y (number of female offspring)
female_offspring <- 10 - males
sample_mean_Y <- sum(female_offspring * frequencies) / total_litters
sample_variance_Y <- sum(frequencies * (female_offspring - sample_mean_Y)^2) / total_litters
cat("Sample Mean for Y (females):", sample_mean_Y, "\n")
## Sample Mean for Y (females): 5.60355
cat("Sample Variance for Y (females):", sample_variance_Y, "\n")
## Sample Variance for Y (females): 2.115017
Use the dbinom() function to generate binomial probability distributions, all with n = 10 but varying p = 0.05, 0.10, 0.25, 0.75, 0.90, 0.95. Use barplot() to graph each distribution. You do not need to hand the graphs in, but you are to describe two important patterns you should see as p changes.
# Define varying p values and n
p_values <- c(0.05, 0.10, 0.25, 0.75, 0.90, 0.95)
n <- 10
# Create a function to plot the binomial distribution for a given p
plot_binom <- function(p) {
probs <- dbinom(0:n, size = n, prob = p)
barplot(probs, main = paste("Binomial Distribution: p =", p),
xlab = "Number of Successes", ylab = "Probability", ylim = c(0, 0.4))
}
# Use par() to layout multiple plots on one page
par(mfrow = c(2, 3))
# Plot the binomial distribution for each p value
invisible(lapply(p_values, plot_binom))
Centering: As p increases from 0.05 to 0.95, the peak (or mode) of the distribution shifts to the right. This reflects the increasing likelihood of more successes (i.e., more male offspring) as p becomes larger.
Spread: The distribution becomes more spread out or narrower based on p. For p values near 0 or 1 (like 0.05 and 0.95), the distribution is more “peaked” and less spread out. For p values near 0.5, the distribution is wider.
You can also observe the skewness, where for smaller values of p, the distribution is positively skewed, and for larger values of p, it’s negatively skewed. For p close to 0.5, it tends to be symmetric.
# Get the current working directory
current_dir <- getwd()
# Print the current working directory
print(current_dir)
## [1] "/cloud/project"
# Set the working directory to "/cloud/project" (if not already set)
setwd("/cloud/project")
# Load the necessary package to read Excel files
library(readxl)
# Read the Excel file
data <- read_excel("elderly.xlsx")
# View the first few rows of the data
head(data)
tail(data)
# Create a density histogram with a rescaled y-axis (probability density)
hist(data$Height.cm, xlim = c(140, 180), ylim = c(0, 0.07),
main = "Elderly Female Heights Fit to Normal Distribution",
xlab = "Height (cm)", prob = TRUE)
# Generate a sequence of x-values for the normal curve
xfit <- seq(140, 180, 0.01)
# Calculate the density values for the normal distribution based on sample mean and standard deviation
yfit <- dnorm(xfit, mean(data$Height.cm), sd(data$Height.cm))
# Overlay the normal distribution curve on the histogram
lines(xfit, yfit, col = "red", lwd = 2)
URS1 <- c(0.66600, 0.00532, 0.32755, 0.98300, 0.64661)
NormRS1 <- qnorm(URS1, mean = 80, sd = 12)
xbar <- mean(NormRS1)
s <- sd(NormRS1)
print(xbar)
## [1] 79.81798
print(s)
## [1] 20.38502
SE <- s / sqrt(5)
print(SE)
## [1] 9.116456
Calculate the confidence interval for your sample data using the multiplier 2.7764 as mentioned:
lower_limit <- xbar - 2.7764 * SE
upper_limit <- xbar + 2.7764 * SE
print(lower_limit)
## [1] 54.50705
print(upper_limit)
## [1] 105.1289
if (lower_limit <= 80 && 80 <= upper_limit) {
cat("The population mean μ = 80 is included in the interval.\n")
} else {
cat("The population mean μ = 80 is not included in the interval.\n")
}
## The population mean μ = 80 is included in the interval.
You can generate random samples (actually pseudorandom samples) very quickly in R with the rnorm() function. To get a better sense of how estimation of the population mean is influenced by sample size, obtain random samples of sizes n = 100, 10,000, and 1,000,000 from a normal distribution with mean ì = 80 and standard deviation ó = 12. For each data set obtain the mean, standard deviation, and calculate the 95% confidence interval for the population mean. The 95% confidence intervals can be calculated as (xbar ! 1.96SE, xbar + 1.96SE) since the sample sizes are all quite large. Did your confidence intervals capture ì = 80? (Most of the time the answer will be yes, but sometimes the answer will be no!)
# Set the parameters
mu <- 80
sigma <- 12
sample_sizes <- c(100, 10000, 1000000)
# Initialize vectors to store results
sample_means <- numeric(length(sample_sizes))
sample_sds <- numeric(length(sample_sizes))
conf_intervals <- matrix(nrow = length(sample_sizes), ncol = 2)
# Generate random samples and calculate statistics for each sample size
for (i in 1:length(sample_sizes)) {
n <- sample_sizes[i]
# Generate random sample
random_sample <- rnorm(n, mean = mu, sd = sigma)
# Calculate sample mean and standard deviation
sample_means[i] <- mean(random_sample)
sample_sds[i] <- sd(random_sample)
# Calculate the 95% confidence interval
SE <- sigma / sqrt(n)
conf_intervals[i, ] <- c(sample_means[i] - 1.96 * SE, sample_means[i] + 1.96 * SE)
}
# Print the results
for (i in 1:length(sample_sizes)) {
cat("Sample Size:", sample_sizes[i], "\n")
cat("Sample Mean:", sample_means[i], "\n")
cat("Sample Standard Deviation:", sample_sds[i], "\n")
cat("95% Confidence Interval:", conf_intervals[i, 1], "-", conf_intervals[i, 2], "\n")
cat("\n")
}
## Sample Size: 100
## Sample Mean: 80.35688
## Sample Standard Deviation: 10.97905
## 95% Confidence Interval: 78.00488 - 82.70888
##
## Sample Size: 10000
## Sample Mean: 79.89491
## Sample Standard Deviation: 12.0252
## 95% Confidence Interval: 79.65971 - 80.13011
##
## Sample Size: 1e+06
## Sample Mean: 80.00077
## Sample Standard Deviation: 11.99178
## 95% Confidence Interval: 79.97725 - 80.02429
Generate a random sample of size 600 from a normal distribution with mean ì = 80 and standard deviation ó = 12 with the R command rnorm(600, 80, 12). Make sure to assign your generated values to a variable name. Obtain the mean and standard deviation for your sample. Next obtain a density histogram of your sample data and then overlay the density function for a normal random variable with the mean and standard deviation parameters estimated by the sample mean and standard deviation. The method for producing this graph is very close to the example in question C1. Hand this graph in.
# Generate a random sample of size 600
random_sample <- rnorm(600, mean = 80, sd = 12)
# Calculate the mean and standard deviation for the sample
sample_mean <- mean(random_sample)
sample_sd <- sd(random_sample)
# Create a density histogram of the sample data
hist(random_sample, probability = TRUE, main = "Density Histogram of Random Sample",
xlab = "Value", ylab = "Density")
# Generate a sequence of x-values for the normal density function
x_values <- seq(min(random_sample), max(random_sample), length.out = 1000)
# Calculate the density values for the normal distribution based on sample mean and standard deviation
y_values <- dnorm(x_values, mean = sample_mean, sd = sample_sd)
# Overlay the density function on the histogram
lines(x_values, y_values, col = "red", lwd = 2)
# Save the graph as a file (optional)
# png("density_histogram.png")
# plot(...)
# dev.off()