HW03

2024-02-04

Interval Estimation

Introduction: Welcome to the realm of statistical estimation, where uncertainty meets precision. Interval estimation plays a crucial role in providing a holistic understanding of population parameters. Unlike point estimates, which offer a single value as our “best guess,” interval estimation goes a step further, acknowledging the inherent variability in our estimations.

Why Interval Estimation? In statistical endeavors, our aim is to capture the essence of the true population parameter. However, relying solely on point estimates can be deceptive. Interval estimation comes to our rescue by addressing the need to quantify the uncertainty associated with our estimations. It provides a range of values within which the true parameter is likely to reside.

Confidence Intervals: A prominent technique within interval estimation is the construction of confidence intervals. These intervals offer a statistical safety net, giving us a plausible range around our point estimate.

Confidence Intervals and its components

The choice of a confidence level, often expressed as a percentage (e.g., 95% or 99%), reflects our willingness to embrace uncertainty. A 95% confidence interval suggests that, in repeated sampling, we expect about 95% of these intervals to contain the true parameter. The key components that construct the framework for reliable statistical estimation are given below:

Point Estimate: At the heart of every confidence interval lies the point estimate. This is our initial best guess, often represented by the sample mean ($\bar{x}$ or sample proportion $\hat{p}$.

Margin of Error: The margin of error is the statistical wiggle room we allow around our point estimate. It accounts for the variability inherent in sampling. A larger margin of error indicates greater uncertainty, while a smaller one suggests a more precise estimate.

Critical Value (z): Determining the critical value is akin to setting the boundaries of our confidence. It depends on the chosen confidence level (e.g., 1.96 for a 95% confidence level). This value is derived from the standard normal distribution or other distributions depending on the context.

Confidence Intervals for mean

At the heart of constructing a Confidence Interval for the Mean lies the sample mean, flanked by two crucial components – the margin of error and the critical value. The mean serves as our best guess, the margin of error captures the variability, and the critical value sets the width of our confidence interval. Together, they form a dynamic trio that empowers us to make informed statistical inferences.

It is crucial to understand if we have the information of another point estimate called Population Standard deviation. It is because based on this we calculate two different types of confidence intervals called the z-intervals and t-intervals. If we have the knowledge of population standard deviation with us, we use z-interval and if we do not know population standard deviation we use t-intervals.

Now lets see the formulas for each of the case when we do have the knowledge of population standard deviation with us and when we do not have the knowledge of population standard deviation.

Formulas of z-interval

Understanding z-intervals for the mean.
In the realm of Confidence Intervals for the Mean, the lower and upper bound play a crucial role in defining the range within which we can reasonably expect the true population mean to lie. Let’s break down the components of the lower bound formula:

\[ \text{Lower Bound} = \bar{x} - z \times \frac{s}{\sqrt{n}} \]

\[ \text{Upper Bound} = \bar{x} + z \times \frac{s}{\sqrt{n}} \]

$\bar{x}$ : sample mean

z : critical value

$\frac{s}{\sqrt{n}}$ : standard deviation

Formulas of t-interval

Understanding T-Intervals for the Mean.
In the realm of Confidence Intervals for the Mean, T-Intervals provide a robust alternative, especially when dealing with smaller sample sizes or unknown population standard deviations.

Lower Bound = $\bar{x} - t_{\frac{\alpha}{2}, n-1} \times \frac{s}{\sqrt{n}}$

Upper Bound = $\bar{x} + t_{\frac{\alpha}{2}, n-1} \times \frac{s}{\sqrt{n}}$

$\bar{x}$ : sample mean

t : critical value

$\frac{s}{\sqrt{n}}$ : standard deviation

ggplot on z-interval for mean

Here, we visualize the confidence interval for the mean eruption duration of the Old Faithful geyser. In this case, the population standard deviation is known. We can infer from the plot that 95% of eruption duration is between 3.35 to 3.62 minutes.

code for the z-interval ggplot

library(ggplot2)
data(faithful)
sample_data <- faithful$eruptions
population_sd <- sd(sample_data)
sample_mean <- mean(sample_data)

confidence_level <- 0.95
z <- qnorm((1 + confidence_level) / 2)
margin_of_error <- z * (population_sd / sqrt(length(sample_data)))

lower_bound <- sample_mean - margin_of_error
upper_bound <- sample_mean + margin_of_error

plot_data_z_known_sd <- data.frame(x = c(“Sample Mean”, “Lower Bound”, “Upper Bound”), y = c(sample_mean, lower_bound, upper_bound))

cont. of code for z-interval

ggplot(plot_data_z_known_sd, aes(x, y)) + geom_bar(stat = “identity”, fill = c(“blue”, “lightgray”, “lightgray”), width = 0.5) + geom_errorbar(aes(ymin = lower_bound, ymax = upper_bound), width = 0.2, color = “red”) + labs(title = “Z-Interval for the Mean (Known Population SD)”, y = “Values”, caption = sprintf(“95%% Confidence Interval: (%.2f, %.2f)”, lower_bound, upper_bound))

ggplot on t-interval for mean

Here, we visualize the confidence interval for the mean of miles per gallon for different cars. The t-interval provides a range within which we can reasonably expect the true population mean to lie. We can infer that 95% of car models have a miles per gallon capacity between 18.20 and 22.00.

code for the t-interval ggplot

library(ggplot2)
data(mtcars)
sample_data <- mtcars$mpg

t_interval <- t.test(sample_data)$conf.int
plot_data_t <- data.frame(x = c(“Sample Mean”, “Lower Bound”, “Upper Bound”), y = c(mean(sample_data), t_interval[1], t_interval[2]))

ggplot(plot_data_t, aes(x, y)) + geom_bar(stat = “identity”, fill = c(“blue”, “lightgray”, “lightgray”), width = 0.5) + geom_errorbar(aes(ymin = t_interval[1], ymax = t_interval[2]), width = 0.2, color = “red”) + labs(title = “T-Interval for the Mean”, y = “Values”, caption = sprintf(“95%% Confidence Interval: (%.2f, %.2f)”, t_interval[1], t_interval[2]))

Proportion and Confidence Intervals for proportion

The sample proportion, $\hat{p}$ is a statistic that represents the proportion of successes or occurrences in a sample with respect to a specific binary outcome. It is an estimate of the true population proportion based on the information observed in the sample.

Understanding the Concept
When we collect a sample from a population to estimate a proportion (e.g., the proportion of successes in a binary outcome), we often want to quantify the uncertainty surrounding our estimate. This is where confidence intervals come into play. A confidence interval provides a range of values that is likely to contain the true proportion with a certain level of confidence.

Construction of Confidence Intervals for proportion

The construction of a confidence interval for a proportion involves using sample data to calculate the sample proportion $\hat{p}$, and then applying a formula that accounts for the variability in the estimate. The margin of error is determined by the standard error of the proportion, and the confidence interval is calculated as z , where z is the critical value from the standard normal distribution and n is the sample size.

Interpreting the Interval
A 95% confidence interval, for example, suggests that if we were to take many random samples and construct intervals using the same method, we would expect approximately 95% of those intervals to contain the true population proportion. It quantifies the uncertainty associated with our estimate, offering a more informative picture of the underlying population parameter.

Formulas of C.I. for proportion

Understanding the CI for proportion.
In the realm of Confidence Intervals for Proportions, the lower and upper bounds are crucial in defining the range within which we can reasonably expect the true population proportion to lie. Let’s break down the components of the lower bound formula:

\[ \text{Lower Bound} = \hat{p} - z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

\[ \text{Upper Bound} = \hat{p} + z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

$\hat{p}$ : sample proportion

z : critical value

$\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$ : standard error of proportion

plotly plot on intervals for proportion

Here, we visualize the confidence interval for proportion of sesota species in iris dataset. We can infer that there are 95% chances that setosa species range from 0.25 to 0.4 proportion of all the species.

code for the plot on CI for proportion

library(plotly)
data(iris)

iris$binary_variable <- ifelse(iris$Species == “setosa”, 1, 0)
sample_proportion <- mean(iris$binary_variable)
confidence_level <- 0.95
z <- qnorm((1 + confidence_level) / 2)
se_proportion <- sqrt((sample_proportion * (1 - sample_proportion)) / nrow(iris))
margin_of_error <- z * se_proportion

lower_bound <- sample_proportion - margin_of_error
upper_bound <- sample_proportion + margin_of_error

cont. code for CI for proportion

plot_ly() %>%
add_trace( type = “bar”,
x = c(“Sample Proportion”, “Lower Bound”, “Upper Bound”),
y = c(sample_proportion, lower_bound, upper_bound),
marker = list(color = c(“blue”, “lightgray”, “lightgray”)),
error_y = list(type = “data”, array = c(0, margin_of_error, margin_of_error))
) %>%
layout( title = “Confidence Interval for Proportion of Setosa Species in Iris Dataset”,
yaxis = list(title = “Values”),
showlegend = FALSE
)