GSB518 HW8

Worked with Brogan Pietrocini and Ryan Perez

Problem 3 from HW 7

Continuing Problems 1 and 2 from HW7

Compute $\text{MSE}(\hat{p})$ when $\mu = 2.3$ and $n=5$. (You can do the next part first if you want, but it helps to work with specific numbers first.)

MSE(p hat)=VAR(p hat)+Bias(p hat)

MSE(p hat)=0.0355+(0)

MSE(p hat)=0.0355
Derive the MSE function of $\hat{p}$. (Hint: use facts from previous parts.)

MSE(p hat)=Var(p hat)+0

Var(X) = E[X²]-(E[X]])²

E[X]=theta

E[X²]=(0²)(1-theta)+(1²)(theta)=theta

Var(X)=theta-theta²=theta(1-theta)

Var(X)/n=(theta(1-theta))/n
Suppose $\mu = 2.3$ (and $n=5$). Explain in full detail how you would use simulation to approximate the MSE of $\hat{\theta}$.

First you would want to generate a large number of random samples form a normal distribution with mu=2.3 and some assumed SD. These samples represent the estimate theta hat obtained from samples of size 5. Then you want to calculate the average of the SE which will give you an approximation of MSE of theta hat.

Coding required. Conduct the simulation from the previous part and approximate the MSE of $\hat{\theta}$ when $\mu =2.3$ (and $n=5$).

# Set parameters
n <- 5         
mu <- 2.3      

# Number of simulations
num_simulations <- 10000

# Initialize a vector to store squared errors
squared_errors <- numeric(num_simulations)

# Perform simulations
for (i in 1:num_simulations) {
  # Generate a random sample from a normal distribution
  sample_data <- rnorm(n, mean = mu, sd = 1)  # You may need to specify an appropriate standard deviation (sd)

  # Estimate (sample mean)
  theta_hat <- mean(sample_data)

  # Squared error
  squared_error <- (theta_hat - mu)^2
  squared_errors[i] <- squared_error
}

estimated_mse <- mean(squared_errors)
estimated_mse

## [1] 0.1975845

The approximated value of MSE of theta hat is 0.203

Which estimator has smaller MSE when $\mu = 2.3$ (and $n=5$)? Answer, but then explain why this information alone is not really useful.

I’m not entirely sure.
Explain in full detail how you would use simulation to approximate the MSE function of $\hat{\theta}$ (if $n=5$).

Same as above but with 1.6 as the estimator instead of 2.3.

Coding required. Conduct the simulation from the previous part and plot the approximate MSE function. Compare to the MSE function of $\hat{p}$.

# Set parameters
n <- 5
num_simulations <- 10000

# Create a range of mu values
mu_values <- seq(1, 4, by = 0.1)  # Adjust the range and step size as needed

# Initialize a vector to store estimated MSE values for each mu
mse_values <- numeric(length(mu_values))

# Perform simulations for each mu
for (j in 1:length(mu_values)) {
  mu <- mu_values[j]
  squared_errors <- numeric(num_simulations)

  for (i in 1:num_simulations) {
    # Generate a random sample from a normal distribution
    sample_data <- rnorm(n, mean = mu, sd = 1)  # You may need to specify an appropriate standard deviation (sd)

    # Estimate (sample mean)
    theta_hat <- mean(sample_data)

    # Squared error
    squared_error <- (theta_hat - mu)^2
    squared_errors[i] <- squared_error
  }

  # Calculate the estimated MSE for the current mu
  mse_values[j] <- mean(squared_errors)
}

# Create a plot
plot(mu_values, mse_values, type = "l", 
     xlab = "μ", ylab = "Estimated MSE",
     main = "Estimated MSE vs. μ")

Compare the MSEs of the two estimators for $n=5$ and a few other values of $n$. Is there a clear preference between these two estimators? Discuss.

Again, I am not sure what the 2 estimators are.

Problem 1

In roulette, a bet on a single number has a 1/38 probability of success and pays 35-to-1. That is, if you bet 1 dollar, your net winnings are -1 with probability 37/38 and +35 with probability 1/38. Consider betting on a single number on each of $n$ spins of a roulette wheel. Let $\bar{X}_n$ be your average net winnings per bet.

For each of the values $n = 10$, $n = 100$, $n = 1000$:

Compute $\text{E}(\bar{X}_n)$

E(X-bar n)=(-1 * 37/38) + (35/38) = -0.0526 for all n values
Compute $\text{SD}(\bar{X}_n)$

Var(X-bar)=36.97

SD(X-bar)=6.08

For n=10:

6.08/SQRT(10)=1.92

For n=100:

6.08/SQRT(100)=0.608

For n=1000:

6.08/SQRT(1000)=0.192

Run a simulation to determine if the distribution of $\bar{X}_n$ is approximately Normal

library(MASS)

# Set parameters
n <- 100  # Number of spins
M <- 10000  # Number of simulations

# Simulate and store average net winnings
average_winnings <- replicate(M, mean(ifelse(sample(1:38, n, replace = TRUE) == 1, 35, -1)))

# Plot histogram
hist(average_winnings, breaks = 50, main = "Histogram of Average Net Winnings", xlab = "Average Net Winnings", prob = TRUE, col = "lightblue")

# Fit normal distribution and plot
fit <- fitdistr(average_winnings, "normal")
curve(dnorm(x, mean = fit$estimate[1], sd = fit$estimate[2]), add = TRUE, col = "red", lwd = 2)

The distribution is not normal and is skewed right for n=100

# Set parameters
n <- 1000  # Number of spins
M <- 10000  # Number of simulations

# Simulate and store average net winnings
average_winnings <- replicate(M, mean(ifelse(sample(1:38, n, replace = TRUE) == 1, 35, -1)))

# Plot histogram
hist(average_winnings, breaks = 50, main = "Histogram of Average Net Winnings", xlab = "Average Net Winnings", prob = TRUE, col = "lightblue")

# Fit normal distribution and plot
fit <- fitdistr(average_winnings, "normal")
curve(dnorm(x, mean = fit$estimate[1], sd = fit$estimate[2]), add = TRUE, col = "red", lwd = 2)

When n=1000, the distribution becomes normal, this is because of the Central limit theory.

Use simulation to approximate $\text{P}(\bar{X}_n >0)$, the probability that you come out ahead after $n$ bets

# Set parameters
n <- 100    # Number of spins
M <- 10000  # Number of simulations

# Simulate and store average net winnings
average_winnings <- replicate(M, mean(ifelse(sample(1:38, n, replace = TRUE) == 1, 35, -1)))

# Calculate the proportion of trials where X-bar is greater than 0
probability_X_bar_positive <- sum(average_winnings > 0) / M

# Print the result
cat("P(X-bar > 0) =", probability_X_bar_positive, "\n")

## P(X-bar > 0) = 0.492

# Set parameters
n <- 1000    # Number of spins
M <- 10000  # Number of simulations

# Simulate and store average net winnings
average_winnings <- replicate(M, mean(ifelse(sample(1:38, n, replace = TRUE) == 1, 35, -1)))

# Calculate the proportion of trials where X-bar is greater than 0
probability_X_bar_positive <- sum(average_winnings > 0) / M

# Print the result
cat("P(X-bar > 0) =", probability_X_bar_positive, "\n")

## P(X-bar > 0) = 0.4006

As n increases, the P(X-bar>0) decreases.

If $n=1000$ use the Central Limit Theorem to approximate $\text{P}(\bar{X}_{1000} >0)$, the probability that you come out ahead after 1000 bets.

Because when n=1000, the distribution is normal, you can use pnorm to find your answer.
```
1-pnorm(0, -0.0526, 0.192)
```
```
## [1] 0.3920583
```
P(X-bar 1000>0)=0.392. 39.2% of time you would come out positive after 1000 bets.
The casino wants to determine how many bets on a single number are needed before they have (at least) a 99% probability of making a profit. (Remember, the casino profits if you lose; that is, if $\bar{X}_n <0$.) Use the Central Limit Theorem to determine the minimum number of bets (keeping in mind that $n$ must be an integer). You can assume that whatever $n$ is, it’s large for the CLT to kick in.
```
pnorm(0, -0.0526, 0.02)
```
```
## [1] 0.9957308
```
For the casino to have a probability of winning above 99%, you need the standard deviation to be 0.02. So set the standard deviation equation to 0.02

0.02=6.08/SQRT(n)

n=92416

Problem 2

The standard measurement of the alcohol content of drinks is alcohol by volume (ABV), which is given as the volume of ethanol as a percent of the total volume of the drink. A report states that on average the ABV for beer is 4.5%. Is this true? In a sample of 67 brands of beer, the mean ABV is 4.61 percent and the standard deviation of ABV is 0.754 percent. Note: the data set and some R summaries are provided in Canvas so you can check your answers. But you should answer these questions by hand, using only the information provided here.

Use R to summarize the sample data. Then describe the main features in a sentence or two.

beer <- read.csv("beer.csv")

summary(beer)

##     Brand             Brewery               ABV          Calories    
##  Length:67          Length:67          Min.   :0.40   Min.   : 70.0  
##  Class :character   Class :character   1st Qu.:4.20   1st Qu.:110.0  
##  Mode  :character   Mode  :character   Median :4.60   Median :142.0  
##                                        Mean   :4.61   Mean   :135.4  
##                                        3rd Qu.:5.00   3rd Qu.:151.0  
##                                        Max.   :6.50   Max.   :210.0  
##  Carbohydrates  
##  Min.   : 2.60  
##  1st Qu.: 7.00  
##  Median :11.20  
##  Mean   :10.30  
##  3rd Qu.:12.95  
##  Max.   :23.90

ABV has a minimum value of 0.4, mean of 4.61, and maximum value of 6.50. Calories has a minimum value of 70, mean value of 135.4, and maximum value of 210. Carbohydrates has a minimum value of 2.6, mean of 10.3, and maximum value of 23.9.

Compute an interpret a 95% confidence interval for the appropriate population mean.

4.61+-2(0.754)=(3.102, 6.118)
Write a clearly worded sentence reporting your confidence interval in context.

We are 95% confident that the true sample mean of ABV lies between 3.102 and 6.118.

One of the brands of beer in the sample is O’Doul’s, a non-alcoholic beer. The ABV for O’Doul’s is 0.4 (it has a bit of alcohol.) Suppose O’Doul’s is removed from the data set. Compute the sample mean ABV of the remaining 66 brands. (You can do this with filter in tidyverse; see how I filtered the 2-week babies in the feeding length data.)

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.4.3     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.2     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

new_beer <- slice(beer,-(43))

summary(new_beer)

##     Brand             Brewery               ABV           Calories    
##  Length:66          Length:66          Min.   :3.800   Min.   : 94.0  
##  Class :character   Class :character   1st Qu.:4.200   1st Qu.:110.0  
##  Mode  :character   Mode  :character   Median :4.600   Median :142.5  
##                                        Mean   :4.674   Mean   :136.4  
##                                        3rd Qu.:5.000   3rd Qu.:151.5  
##                                        Max.   :6.500   Max.   :210.0  
##  Carbohydrates  
##  Min.   : 2.60  
##  1st Qu.: 7.00  
##  Median :11.20  
##  Mean   :10.25  
##  3rd Qu.:12.85  
##  Max.   :23.90

The new sample mean of ABV is 4.674 after taking O’douls out.

Continuing the previous part, compute the sample SD of ABV of the remaining 66 brands. Compare to the original sample mean; which mean is larger — with or without O’Doul’s? Explain briefly.
```
oldSD <- sd(beer$ABV)
newSD <- sd(new_beer$ABV)

print(oldSD)
```
```
## [1] 0.7540063
```
```
print(newSD)
```
```
## [1] 0.5480906
```
The mean increases because the low outlier was taken out.
The sample SD of ABV of the remaining 66 brands is 0.55 percent. Compare to the original sample SD; which SD is larger — with or without O’Doul’s? Explain briefly.

The standard deviation is less without O’Douls which makes sense because it was an outlier.
Compute the 95% confidence interval based on the sample with O’Doul’s removed. Compute to the original interval, both in terms of center of the CI and its width. Explain briefly.

4.674+-2(0.548)=(3.578, 5.77)

We are 95% confident that the true ABV mean, after removing O’Douls, is between the values 3.578 and 5.77. The new interval is more narrow and has a higher center, which makes sense because the low outlier was removed.
Based on the interval based on the sample with O’Doul’s removed, is 4.5% a plausible value of the parameter? Explain briefly.

That is plausible because 4.5% is contained in the CI(3.578, 5.77).
Which of the analyses is more appropriate: with or without O’Doul’s? Explain your reasoning.

I would say it’s more appropriate to leave out O’Doul’s as it is a non-alcoholic beer, and it will obviously have very low ABV, thus will be a clear outlier in the dataset.

Problem 4 (4, because Problem 3 is from HW 7)

(I had much longer instructions written, but I think they weren’t helpful. If you have questions about how the applet is working, don’t hesitate to ask.)

You are going to use a simulation applet that randomly generates confidence intervals for a population mean to help you understand some ideas. The applet calculates a confidence interval for each set of randomly generated data.

In the box for “Statistic” select means “Means”.
In the box next to “Method” select “t”.
Start with “Number of intervals” equal to 1 at first to see what is happening, but then change it to lots.

Experiment with different simulations; be sure to change

Distribution
Population Mean
Population SD
Sample size
Confidence level

Then write a paragraph or two or some bullet points of your main observations. Based on this applet, what can you say about how confidence intervals work? How does changing the inputs affect the confidence intervals and the Results?

The first thing that I noted when playing with the number of intervals for a normal, exponential, and uniform distribution is that a very small amount of intervals revealed a confidence interval that didn’t include the true population mean. This shows us not all samples give a true estimate of a true population mean. Changing the population mean yielded the same results. Increasing the SD obviously made the CI wider, while decreasing it made the CI tighter. The same occurred when increasing and decreasing sample size.

When you decrease the confidence level, you see that far more intervals reveal a CI that did not capture the true population mean. This makes sense are we are less confident with our estimates.

GSB518 HW8

Oliver Sheehan

9/13/2023

Problem 3 from HW 7

Problem 1

Problem 2

Problem 4 (4, because Problem 3 is from HW 7)