October 24, 2025

Interval Estimation for a Population Proportion

It is a type of statistical inference to estimate a population’s true proportion but due to sampling error or how much a sample differs from other samples, a range of possible values is used to represented the predicted proportion instead of a single point.

Four Steps:

  1. Identify \(\hat{p}\), n, and the confidence level of interest

  2. Verify conditions using \(\hat{p}\) and n to approximate a mean, and standard error

  3. Calculate the interval using a normal distribution curve and the margin of error

  4. Interpret

Step 1: Identifing

\(\hat{p}\) is the sample proportion and is considered a point estimate of the population proportion or the parameter represented by \(p\).

n is the sample’s size.

Confidence level is the percent likeliness that the interval will include the populations true proportion.

\(\mu\) is the mean of the entire population. \(\mu_\hat{p}\) is the mean of the sample.

\(SE\) is the standard error for the population. \(SE_\hat{p}\) is the standard error for the sample.

Graphical representation of a confidence level

Using a a confidence level of 90% as an example, means 90% of data falls withing 1.96 z scores of the population proportion. Meaning that for all of the possible 90% confidence level intervals constructed from a populations sampling distribution, 90% of them will include the true proportion.

To demonstrate this, 10 samples were taken from a binomial distribution with size 20 and success rate of 0.5 and 90% confidence intervals were constructed. Of the 10 that were constructed, 9 include the true population proportion of 0.5, indicated by the red line.

library(plotly)
library(dplyr)
set.seed(5)

confidenceExample = data.frame(seperators = rep(1:10), values = rbinom(n = 10,
    size = 20, prob = 0.5))

confidenceLvls = confidenceExample %>%
    group_by(seperators) %>%
    summarise(proportion = values/20, lower = values/20 - 1.645 *
        sqrt((0.5 * (1 - 0.5))/20), higher = values/20 + 1.645 *
        sqrt((0.5 * (1 - 0.5))/20))

examplePlot = plot_ly(confidenceLvls, x = ~seperators, y = ~proportion,
    type = "scatter", name = "Sample Proportion") %>%
    add_segments(x = ~seperators, xend = ~seperators, y = ~lower,
        yend = ~higher, line = list(color = "blue"), name = "90% Confidence Interval") %>%
    layout(shapes = list(type = "line", x0 = 0, x1 = 1, xref = "paper",
        y0 = 0.5, y1 = 0.5, line = list(color = "red", dash = "dot"))) %>%
    layout(title = "90% Confidence Intervals for a Population Proportion",
        xaxis = list(title = "Samples"), yaxis = list(title = "Predicted Proportion")) %>%
    layout(legend = list(x = 0, y = 1.025, orientation = "h"))
examplePlot

Step 2: Conditions

Central Limit Theorem states that given the sample is independent and sufficiently large using the success-failure condition, the proportion \(\hat{p}\) will have a distribution similar to a normal distribution.

Given that the sample is independent, \(np\geq10\), and \(n(1-p)\geq10\) the theorem applies therefore the mean and standard deviation of the population can be approximated with the sample.

\(\mu_\hat{p} = p \approx \hat{p}\)

\(SE_\hat{p} = \sqrt{\frac{p(1-p)}{n}} \approx \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Another thing to consider is if the sample is larger than 10% of the population. This rarely happens but it can impact the sampling error.

Graphical representation of sample size

Sample size is tested with the success-failure condition to ensure that the sample data has enough observations that inferences can be made about the sampling distribution using a normal distribution.

To demonstrate this, four binomial sampling distributions have been made below with 1000 samples each and a success rate of 0.5. The distributions differ by the size of the sample; n = 10, 20, 100, and 150. From the graphs, the greater the n value the closer to normal and less discrete the distribution is in addition to less variability.

library(ggplot2)
set.seed(6)

temp = NULL
for (n in c(10, 20, 100, 150)) {
    temp = c(temp, (rbinom(n = 1000, size = n, prob = 0.5)/n))
}

sizeEx = data.frame(identifier = rep(c("A", "B", "C", "D"), each = 1000),
    data = temp)

graph1 = ggplot(sizeEx, aes(x = data)) + geom_bar(fill = "navy") +
    facet_wrap(~identifier, labeller = labeller(identifier = c(A = "n = 10",
        B = "n = 20", C = "n = 100", D = "n = 150")), scale = "free_y") +
    scale_x_continuous(breaks = seq(0, 1, 0.1), limits = c(0,
        1)) + labs(title = "Sample Distributions for a Population's Proportion",
    x = "", y = "Frequency of the Proportion")

graph1
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Step 3: Calculations

Formula for a confidence interval:

\(\hat{p} \pm z \cdot SE_\hat{p}\)

Things to note:

\(z\) is the confidence level represented by the corresponding z-score. This can be done at any confidence level, usually the higher the better.

\(z \cdot SE\) is considered the margin of error or \(MOE\).

Step 4: Example Data HairEyeColor with Interpretation

The data set used is HairEyeColor. The data set was added to R in 1992 and was a survey of students at the University of Delaware reported by Snee (1974). Its made up of 4 columns dictating the hair color, eye color, sex, and amount of students at the University in each row.

library(ggplot2)

data("HairEyeColor")
HairEyeColor = as.data.frame(HairEyeColor)
head(HairEyeColor)
   Hair   Eye  Sex Freq
1 Black Brown Male   32
2 Brown Brown Male   53
3   Red Brown Male   10
4 Blond Brown Male    3
5 Black  Blue Male   11
6 Brown  Blue Male   50

Step 4: HairEyeColor Manilulation

Example: Create a 90% confidence interval for the sample to infer the population’s proportion of blue eyed students. The population is stats students at the University of Delaware. For this example, only the eye color and frequency will be needed and the proportions will be calculated out of the total observations.

BlueEyes = HairEyeColor %>%
    select("Eye", "Freq") %>%
    group_by(Eye) %>%
    summarise(total = sum(Freq)) %>%
    mutate(proportion = total/sum(total))

The data is now a data frame of three variables; eye color, frequency, and proportion calculated out of the total sample size.

Step 4: Visualization of the data values

The graph represents the frequencies of the eye colors to understand the break down of the data in each category.

Step 4: Example Data with Interpretation

Variables: \(\hat{p}\) = 0.363, \(n\) = 592, and a confidence level of 90% is a z-score of 1.645

Conditions:

\(0.363\cdot592 = 214.896 \geq10\) and \(592(1-0.363) = 377.104 \geq10\)

The sample is assumed independent and less than 10% of the population

Therefore

\(\mu_\hat{p} = 0.363\) and \(SE_\hat{p} = \sqrt{\frac{0.363(1-0.363)}{592}} \approx 0.0008\)

Calculations: \(0.363 + 1.645 \cdot 0.0008 = 0.3643\) and \(0.363 - 1.645 \cdot 0.0008 = 0.3617\)

Conclusion: Given that the sample meets the conditions for the central limit theorem, we can be 90% confident that the proportion of statistics students at the University of Delaware with blue eyes is between 0.3643 and 0.3617.