Essential of Probability

assignment week 13

Logo

1 Introduction

Confidence interval is a fundamental concept in inferential statistics that is used to estimate population parameters based on sample data. Rather than providing a single point estimate, a confidence interval offers a range of values that is likely to contain the true population parameter at a specified confidence level. This approach allows analysts to quantify the uncertainty associated with statistical estimates.

In data analytics and decision-making processes, confidence intervals play an important role in evaluating the reliability of estimated means, proportions, and experimental outcomes. Different confidence levels, such as 90%, 95%, and 99%, reflect varying degrees of certainty, where higher confidence levels generally result in wider intervals.

The case studies in this assignment demonstrate the application of confidence intervals across multiple real-world scenarios, including mean estimation with known and unknown population standard deviations, proportion estimation in A/B testing, precision comparison between Z-test and t-test approaches, and the use of one-sided confidence intervals for business target evaluation. Through these analyses, this study aims to enhance understanding of confidence interval concepts and their practical implications in data-driven decision making.

2 Case Study 1

2.1 Task 1

The appropriate statistical method for this analysis is a confidence interval for the population mean using the Z-distribution. This approach is suitable because the population standard deviation is known and the sample size is sufficiently large. Under these conditions, the sampling distribution of the sample mean follows a normal distribution, allowing the use of critical values from the standard normal distribution.

2.2 Task 2

This task computes the confidence interval for the population mean using the Z-distribution. Since the population standard deviation is known, the Z-based confidence interval formula is applied. The interval provides a range of plausible values for the true mean at different confidence levels.

The confidence interval for the population mean with known population standard deviation is given by:

\[ \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \]

Where :

\(\bar{x}\) represents the sample mean.
\(z_{\alpha/2}\) : critical value from the standard normal distribution corresponding to the chosen confidence level.
\(\sigma\) is the population standard deviation (used for the Z-interval).
\(n\) : sample size.

\[ \bar{x} = 12.6,\quad \sigma = 3.2,\quad n = 100 \]

The standard error is: \[ \frac{3.2}{\sqrt{100}} = 0.32 \]

The resulting confidence intervals are:

\[ CI_{90\%} = 12.6 \pm 1.645 \times 0.32 \]

\[ CI_{95\%} = 12.6 \pm 1.96 \times 0.32 \]

\[ CI_{99\%} = 12.6 \pm 2.576 \times 0.32 \]

2.3 Task 3

The visualization illustrates how the width of the confidence interval changes across different confidence levels. As the confidence level increases, the interval becomes wider, reflecting greater uncertainty in estimating the population mean.

# Given values
mean <- 12.6
sigma <- 3.2
n <- 100
se <- sigma / sqrt(n)

# Z values and confidence levels
z_values <- c(1.645, 1.96, 2.576)
levels <- c("90%", "95%", "99%")

# Confidence intervals
lower <- mean - z_values * se
upper <- mean + z_values * se

# Data frame
ci_data <- data.frame(
  Level = levels,
  Lower = lower,
  Upper = upper
)

# Visualization (ggplot2 new standard)
library(ggplot2)

ggplot(ci_data, aes(y = Level, x = mean)) +
  geom_point(size = 3) +
  geom_errorbar(
    aes(xmin = Lower, xmax = Upper),
    orientation = "y",
    width = 0.2
  ) +
  labs(
    title = "Comparison of Confidence Intervals for the Mean",
    x = "Average Daily Transactions",
    y = "Confidence Level"
  ) +
  theme_minimal()

2.4 Task 4

Interpretation

At the 95% confidence level, the average number of daily transactions per user is expected to fall within the constructed confidence interval. This indicates that the platform can estimate user transaction behavior with a high degree of certainty. Higher confidence levels result in wider intervals, reflecting increased assurance at the cost of reduced precision. These results provide valuable insight for evaluating the impact of newly launched features on user engagement.

3 Case Study 2

3.1 Task 1

The appropriate statistical method for this case is a confidence interval for the population mean using the Student’s t-distribution. This method is required because the population standard deviation is unknown and the sample size is relatively small. Under these conditions, the sampling distribution of the sample mean follows a t-distribution with𝑛−1degrees of freedom.

3.2 Task 2

This task computes the confidence interval for the population mean using the Student’s t-distribution. Since the population standard deviation is unknown, the sample standard deviation is used to estimate variability. The resulting interval reflects additional uncertainty compared to a Z-based interval.

The confidence interval for the population mean with unknown population standard deviation is given by:

\[ \bar{x} \pm t_{\alpha/2,\,n-1} \cdot \frac{s}{\sqrt{n}} \]

Where:

\(\bar{x}\) represents the sample mean.
\(t_{\alpha/2,\,n-1}\) is the critical value from the t-distribution with \(n-1\) degrees of freedom.
\(s\) is the sample standard deviation (used for the t-interval).
\(n\) : sample size. \[ \bar{x} = 68,\quad s = 5,\quad n = 12 \]

The standard error is: \[ \frac{5}{\sqrt{12}} \approx 1.443 \]

The resulting confidence intervals are:

\[ CI_{90\%} = 68 \pm t_{0.05,11} \times 1.443 \]

\[ CI_{95\%} = 68 \pm t_{0.025,11} \times 1.443 \]

\[ CI_{99\%} = 68 \pm t_{0.005,11} \times 1.443 \]

3.3 Task 3

The visualization demonstrates how the confidence interval widens as the confidence level increases. This highlights the effect of uncertainty when estimating the population mean using the t-distribution.

# Given values
mean <- 68
s <- 5
n <- 12
se <- s / sqrt(n)
df <- n - 1

# t values
t_values <- c(
  qt(0.95, df),
  qt(0.975, df),
  qt(0.995, df)
)

levels <- c("90%", "95%", "99%")

# Confidence intervals
lower <- mean - t_values * se
upper <- mean + t_values * se

# Data frame
ci_data <- data.frame(
  Level = levels,
  Lower = lower,
  Upper = upper
)

# Plot
library(ggplot2)

ggplot(ci_data, aes(y = Level, x = mean)) +
  geom_point(size = 3) +
  geom_errorbar(
    aes(xmin = Lower, xmax = Upper),
    orientation = "y",
    width = 0.2
  ) +
  labs(
    title = "Confidence Intervals for the Mean (t-Distribution)",
    x = "Average Score",
    y = "Confidence Level"
  ) +
  theme_minimal()

3.4 Task 4

Interpretation

Using the t-distribution, the confidence intervals account for additional uncertainty caused by the unknown population standard deviation and small sample size. As the confidence level increases, the interval becomes wider, reflecting greater certainty at the expense of precision. These intervals provide a reasonable estimate of the true population mean while acknowledging sampling variability.

4 Case Study 3

4.1 Task 1

The appropriate method for this analysis is a confidence interval for a population proportion using the normal approximation (Z-distribution). This approach is suitable because the sample size is sufficiently large, allowing the sampling distribution of the sample proportion to be approximated by a normal distribution.

4.2 Task 2

This task calculates the confidence interval for a population proportion based on the sample data. The interval is constructed using a normal approximation and provides a range of plausible values for the true population proportion at different confidence levels.

The confidence interval for a population proportion is given by:

\[ \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

with: \[ \hat{p} = \frac{x}{n} \]

Where:

\(\hat{p}\) : sample proportion, used as an estimate of the population proportion.
\(x\) : number of observations with the characteristic of interest.
\(n\) : sample size.
\(z_{\alpha/2}\) : critical value from the standard normal distribution corresponding to the chosen confidence level.
\(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) : standard error of the sample proportion.

The confidence intervals for different confidence levels are:

\[ CI_{90\%} = \hat{p} \pm 1.645 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

\[ CI_{95\%} = \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

\[ CI_{99\%} = \hat{p} \pm 2.576 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

4.3 Task 3

The plot illustrates the confidence intervals for the population proportion across multiple confidence levels. Higher confidence levels result in wider intervals, indicating increased uncertainty.

# Given values
x <- 120     # number of successes (example)
n <- 500     # total users
p_hat <- x / n

# Standard error
se <- sqrt(p_hat * (1 - p_hat) / n)

# Z values and levels
z_values <- c(1.645, 1.96, 2.576)
levels <- c("90%", "95%", "99%")

# Confidence intervals
lower <- p_hat - z_values * se
upper <- p_hat + z_values * se

# Data frame
ci_data <- data.frame(
  Level = levels,
  Lower = lower,
  Upper = upper
)

# Plot
library(ggplot2)

ggplot(ci_data, aes(y = Level, x = p_hat)) +
  geom_point(size = 3) +
  geom_errorbar(
    aes(xmin = Lower, xmax = Upper),
    orientation = "y",
    width = 0.2
  ) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(
    title = "Confidence Intervals for Conversion Rate",
    x = "Conversion Rate",
    y = "Confidence Level"
  ) +
  theme_minimal()

4.4 Task 4

Interpretation

The confidence intervals indicate the range in which the true conversion rate is likely to fall at different confidence levels. Higher confidence levels produce wider intervals, reflecting increased certainty but reduced precision. These results help product teams evaluate the reliability of A/B testing outcomes before making data-driven decisions.

5 Case Study 4

5.1 Task 1

Team A uses a confidence interval for the population mean based on the Z-distribution because the population standard deviation is known. Team B uses a confidence interval for the population mean based on the Student’s t-distribution because the population standard deviation is unknown and is estimated using the sample standard deviation.

5.2 Task 2

This task computes confidence intervals for the population mean using both the Z-distribution and the t-distribution. Although the sample statistics are identical, the resulting intervals differ in width due to the underlying distributional assumptions.

For Team A (Z-distribution), the confidence interval is:

\[ \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \]

For Team B (t-distribution), the confidence interval is:

\[ \bar{x} \pm t_{\alpha/2,\,n-1} \cdot \frac{s}{\sqrt{n}} \]

Where:

\(\bar{x}\) represents the sample mean.

\(n\) denotes the sample size.

\(\sigma\) is the population standard deviation (used for the Z-interval).

\(s\) is the sample standard deviation (used for the t-interval).

\(\frac{\sigma}{\sqrt{n}}\) or \(\frac{s}{\sqrt{n}}\) represents the standard error of the mean.

\(z_{\alpha/2}\) is the critical value from the standard normal distribution.

\(t_{\alpha/2,\,n-1}\) is the critical value from the t-distribution with \(n-1\) degrees of freedom.

\[ \bar{x} = 210,\quad n = 36,\quad \sigma = 24,\quad s = 24 \]

5.3 Task 3

The visualization compares the widths of Z-based and t-based confidence intervals at various confidence levels. The plot highlights that t-intervals are wider, reflecting greater uncertainty when the population standard deviation is unknown.

# Given values
mean <- 210
n <- 36
sigma <- 24
s <- 24
se <- sigma / sqrt(n)
df <- n - 1

# Z and t values
z_values <- c(1.645, 1.96, 2.576)
t_values <- c(
  qt(0.95, df),
  qt(0.975, df),
  qt(0.995, df)
)

levels <- c("90%", "95%", "99%")

# Confidence intervals
lower_z <- mean - z_values * se
upper_z <- mean + z_values * se

lower_t <- mean - t_values * se
upper_t <- mean + t_values * se

# Data frame
ci_data <- data.frame(
  Method = rep(c("Z-interval", "t-interval"), each = 3),
  Level = rep(levels, times = 2),
  Lower = c(lower_z, lower_t),
  Upper = c(upper_z, upper_t)
)

# Plot
library(ggplot2)

ggplot(ci_data, aes(y = Level, x = mean, color = Method)) +
  geom_point(size = 3) +
  geom_errorbar(
    aes(xmin = Lower, xmax = Upper),
    orientation = "y",
    width = 0.2
  ) +
  labs(
    title = "Comparison of Z and t Confidence Intervals",
    x = "API Latency (ms)",
    y = "Confidence Level"
  ) +
  theme_minimal()

5.4 Task 4

Explanation

Although both teams report identical sample means and sample sizes, the confidence intervals differ in width due to the underlying distribution used. The t-distribution accounts for additional uncertainty when the population standard deviation is unknown, resulting in wider intervals compared to the Z-distribution. This demonstrates how distributional assumptions influence the precision of statistical estimates.

6 Case Study 5

6.1 Task 1

This problem uses a one-sided confidence interval for a population proportion. The appropriate method is a Z-based confidence interval for proportions, focusing on the lower bound only.

6.2 Task 2

This task computes the lower bound of a one-sided confidence interval for the population proportion. Unlike a two-sided interval, this calculation focuses solely on the minimum plausible value of the proportion at a given confidence level.

\[ \hat{p} - z_{\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

\[ \hat{p} = \frac{x}{n} \]

Where:

\(\hat{p}\) : sample proportion, used as an estimate of the population proportion.
\(x\) : number of observations with the characteristic of interest.
\(n\) : sample size.
\(z_{\alpha/2}\) : critical value from the standard normal distribution corresponding to the chosen confidence level.
\(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) : standard error of the sample proportion.

6.3 Task 3

The visualization displays the lower confidence bounds across different confidence levels. The dashed reference line represents the target threshold, allowing for an assessment of whether the minimum requirement is satisfied.

# Data
n <- 250
x <- 185
p_hat <- x / n

# Z values for one-sided CI
z_values <- c(1.282, 1.645, 2.33)
levels <- c("90%", "95%", "99%")

# Lower bounds
lower_bounds <- p_hat - z_values * sqrt(p_hat * (1 - p_hat) / n)

# Data frame
ci_data <- data.frame(
  Confidence_Level = levels,
  Lower_Bound = lower_bounds
)

# Plot
library(ggplot2)

ggplot(ci_data, aes(x = Confidence_Level, y = Lower_Bound)) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0.7, linetype = "dashed") +
  labs(
    title = "One-Sided Lower Confidence Bounds for Premium Usage",
    y = "Lower Confidence Bound",
    x = "Confidence Level"
  ) +
  theme_minimal()

6.4 Task 4

At all confidence levels, the lower bound of the confidence interval remains above the 70% threshold. Therefore, the company can conclude that the target is statistically satisfied.