Section 1

Question 1.1.1:
I am interested in knowing the proportion of people in Berkshire county that wear glasses. I randomly sample 100 people every day of January.
The population is :

  1. The 100 people I sample every day of January
  2. The people in Berkshire county
  3. All the people in all my samples
  4. The people in the United States

Answer:
B

Question 1.1.2:
From the previous question, the sample statistics are: a. The 100 people I sample every day of January
b. The people in Berkshire county
c. All the people in all my samples
d. The proportion of people in each sample that wear glasses

Answer:
d

Question 1.1.3:
The distribution of the sample statistic (the proportion of people in each sample that wear glasses) will be

  1. N(p, √(p(1−p)/n))
  2. N(p, p(1−p)/√n)
  3. N(μ, √(σ/n))
  4. N(μ, σ/n)

Answer:
a

Question 1.1.4:
What type of distribution is this called:

  1. Proportion distribution
  2. Sampling distribution
  3. Population distribution
  4. Statistic distribution

Answer:
b

Question 1.1.5:
If we increase the sample size n by a factor of 4 (e.g., from 100 to 400), what happens to the standard error of the sample proportion?

  1. It doubles
  2. It is cut in half
  3. It stays the same
  4. It quadruples

Answer:
b

Question 1.1.6 (NEED HELP):
The Central Limit Theorem (CLT) states that for a large enough sample size n, the sampling distribution of the sample mean x̄ will be approximately normal. For which would the required n be smallest?

  1. Population is highly left skewed
  2. Population is highly right skewed
  3. Population is symmetric with heavy tails
  4. Population is symmetric with light tails

Answer:
d

Question 1.1.7:
A researcher takes 1000 different random samples of size n = 50 from the same population and calculates the sample mean x̄ for each. If they plot a histogram of these 1000 sample means, what are they visualizing?

  1. The population distribution
  2. The sample distribution
  3. The sampling distribution of the mean
  4. The distribution of the standard deviation

Answer:
c

Section 1.2

Question 1.2.1:
Which of the following is the correct interpretation of a 95% confidence interval for a population mean μ?

  1. There is a 95% probability that the true mean μ falls within this specific interval.
  2. 95% of the data points in the population fall within this interval.
  3. If we repeated the sampling process many times, approximately 95% of the intervals we constructed would contain the true population mean μ.
  4. We are 95% sure that the sample mean x̄ is equal to the population mean μ.

Answer:
c

Question 1.2.2:
How does increasing the confidence level (e.g., from 90% to 99%) affect the width of a confidence interval, assuming the sample size and standard deviation remain the same?

  1. The interval becomes narrower.
  2. The interval becomes wider.
  3. The width of the interval does not change.
  4. The interval shifts to the right.

Answer:
b

Question 1.2.3:
When constructing a confidence interval for a population mean μ and the population standard deviation σ is unknown, which distribution should we use for the critical value?

  1. Standard Normal (z) distribution
  2. t-distribution with n − 1 degrees of freedom
  3. Binomial distribution
  4. Uniform distribution

Answer:
b

Section 2.1

Question 2.1a:
State the distribution of the number of people who prefer tea (tea drinkers) in a sample of size n.

Answer:
Let X be the number of tea drinkers in the sample. Since each person independently prefers tea with probability p = 0.10,
X ~ Binomial(n, 0.10).

Question 2.1b:
What is the expected number of tea drinkers in a sample of n people? What is the standard deviation of the count of tea drinkers?

Answer:
E[X] = np = n(0.10) = 0.1n
SD(X) = sqrt(np(1-p)) = sqrt(n(0.10)(0.90))

Question 2.1c:
Simulate 1000 samples of 10 people and compute p̂.

Answer:
The simulated mean of p̂ should be close to 0.10 because p̂ is an unbiased estimator of p.
The simulated SD of p̂ should be close to sqrt(p(1-p)/n).

Question 2.1d:
Simulate 1000 samples of 100 people and compute p̂.

Answer:
Again the mean of p̂ should be near 0.10 and the SD near sqrt(p(1-p)/100).

Question 2.1e:
Comment on the similarity or difference between the two distributions.

Answer:
Both sampling distributions are centered around 0.10.
The n = 100 distribution is narrower because the standard error decreases as sample size increases.
The n = 10 distribution is more spread out and discrete.

Question 2.1f:
Do the results match the theoretical sampling distribution?

Answer:
Yes. The simulated mean is close to p = 0.10 and the SD is close to sqrt(p(1-p)/n).
The n = 100 distribution looks more normal because larger samples reduce discreteness.

Question 2.1g:
Give a 95% range for the sample proportions.

Answer:
For n = 10 the middle 95% of p̂ values is approximately 0.00 to 0.30.
For n = 100 the middle 95% is approximately 0.05 to 0.17.

Question 2.1h:
Is it worth sampling 100 people instead of 10?

Answer:
Yes. Sampling 100 people produces a much narrower range for p̂, meaning the estimate is more precise.

set.seed(2026)

N <- 1000
p <- 0.10

# n = 10
n10 <- 10
x10 <- rbinom(N, size = n10, prob = p)
phat10 <- x10 / n10

mean(phat10)
## [1] 0.0969
sd(phat10)
## [1] 0.09224369
sqrt(p*(1-p)/n10)
## [1] 0.09486833
quantile(phat10, c(0.025,0.975))
##  2.5% 97.5% 
##   0.0   0.3
# n = 100
n100 <- 100
x100 <- rbinom(N, size = n100, prob = p)
phat100 <- x100 / n100

mean(phat100)
## [1] 0.10017
sd(phat100)
## [1] 0.02964034
sqrt(p*(1-p)/n100)
## [1] 0.03
quantile(phat100, c(0.025,0.975))
##  2.5% 97.5% 
##  0.05  0.17
hist(phat10,
     main="Sampling distribution of p-hat (n=10)",
     xlab="p-hat")

hist(phat100,
     main="Sampling distribution of p-hat (n=100)",
     xlab="p-hat")


Section 2.2

Question 2.2a

Population parameter:
p = the true proportion of Florida voters who support the Republican ticket.

Sample statistic:
p̂ = the proportion of voters in the poll sample who support the Republican ticket.

Sampling method:
A random representative sample of n = 806 Florida voters.

Assumptions required for the normal approximation to the sampling distribution of p̂:

Since

n p̂ = 806(0.48) = 386.88
n(1 − p̂) = 806(0.52) = 419.12

both are far greater than 10, so the normal approximation is appropriate.

Question 2.2b

We compute a confidence interval using

p̂ ± z × √(p̂(1 − p̂) / n)

Using p̂ = 0.48 and n = 806:

98% confidence interval: (0.439, 0.521)
90% confidence interval: (0.451, 0.509)

These intervals mean that if we repeatedly took random samples of size 806 and constructed confidence intervals in this way, about 98% (or 90%) of those intervals would contain the true population proportion.

Question 2.2c

Two reasons the poll estimate may differ from the true population value:

  1. Sampling variability
    Random samples naturally fluctuate around the true population proportion.

  2. Polling bias
    The sample might not perfectly represent the population due to nonresponse bias, selection bias, or measurement error.

Question 2.2d

The sampling distribution of the sample proportion is approximately

p̂ ~ Normal(p, √(p(1 − p) / n))

where
p = 0.512 and n = 806.

Question 2.2e

Using the true population proportion p = 0.512 we compute the 98% range of possible sample estimates

p ± z × √(p(1 − p) / n)

This produces the range

(0.471, 0.553)

We then check whether the observed sample proportion p̂ = 0.48 falls inside this range.

Question 2.2f

Since the observed sample proportion 0.48 lies inside the 98% range (0.471, 0.553), the poll result is not surprising under random sampling.
This means the observed difference between the poll estimate and the true population value could reasonably occur due to normal sampling variability.

Question 2.2g

To determine the sample size required for a margin of error of 0.01 we use

n = (z² × 0.25) / (0.01²)

using the conservative estimate p = 0.5.

phat <- 0.48
n <- 806

SE <- sqrt(phat*(1-phat)/n)

z98 <- qnorm(0.99)
z90 <- qnorm(0.95)

CI_98 <- phat + c(-1,1)*z98*SE
CI_90 <- phat + c(-1,1)*z90*SE

CI_98
## [1] 0.4390617 0.5209383
CI_90
## [1] 0.4510544 0.5089456
p <- 0.512
SE_true <- sqrt(p*(1-p)/n)

range98 <- p + c(-1,1)*z98*SE_true
range98
## [1] 0.4710407 0.5529593
phat_obs <- 0.48
phat_obs >= range98[1] & phat_obs <= range98[2]
## [1] TRUE
ME <- 0.01
z95 <- qnorm(0.975)

n_needed <- ceiling((z95^2*0.25)/(ME^2))
n_needed
## [1] 9604

Section 2.3

Question 2.3a

First we compute the sample correlation between miles per gallon (mpg) and car weight (wt) using the mtcars dataset.

The correlation coefficient measures the strength and direction of the linear relationship between the two variables.
A negative value indicates that as weight increases, fuel efficiency tends to decrease.

Question 2.3b

To understand the variability of the correlation estimate, we use bootstrap resampling.
Bootstrapping repeatedly samples rows from the dataset with replacement and recomputes the statistic of interest (in this case the correlation).

This allows us to approximate the sampling distribution of the correlation coefficient without needing additional real data.

Question 2.3c

The histogram of the bootstrap correlations represents the sampling distribution of the correlation between mpg and wt.

If the histogram appears roughly symmetric and bell shaped, it suggests that the sampling distribution of the correlation is approximately normal.

Question 2.3d

We compute the mean and standard deviation of the bootstrap correlations.

Question 2.3e

To evaluate a claim about the population correlation, we shift the bootstrap distribution so that its center is −0.85 and compute a z-score for the observed sample correlation.

The z-score measures how many standard deviations the observed correlation is from the hypothesized population value.

Question 2.3f

If the observed correlation is many standard deviations away from −0.85, the claim that the true correlation equals −0.85 would be unlikely.
If the observed value lies within a reasonable range of the bootstrap distribution, then the claim could be consistent with the data.


# observed sample correlation
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
# bootstrap resampling
set.seed(202)
B <- dim(mtcars)[1]
N <- 1000

versions <- lapply(1:N, function(i) {
  mtcars[sample(1:B, B, replace = TRUE), ]
})

corrs <- unlist(lapply(versions, function(df) {
  cor(df$mpg, df$wt)
}))

# histogram of bootstrap correlations
hist(corrs,
     main="Bootstrap Sampling Distribution of Correlation",
     xlab="Correlation")

# mean and standard deviation
mean(corrs)
## [1] -0.8702986
sd(corrs)
## [1] 0.03465381
# shift distribution to mean -0.85
corrs_shifted <- corrs - mean(corrs) - 0.85

# compute z-score
r_obs <- cor(mtcars$mpg, mtcars$wt)
z_score <- (r_obs - (-0.85)) / sd(corrs)

z_score
## [1] -0.5095941