Proportion Samplinng

Harold Nelson

2025-03-17

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The Sampling Distribution of the Proportion

To think about a proportion, we focus on the parameter p of a binomial distribution. The sample size in this context is the papameter n.  We have the basic theoretical results.

The estimates of a proportion \(\hat{p}\) based on a sample of size n is approximately normal and has the following mean and standard deviation provided that

\(n\hat{p} > 10\) and \(n(1-\hat{p}) > 10\).

\[\mu_{\hat{p}} = p\] and

\[\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\]

This is all based on the theory of the binomial distribution. In that theory we talked about having a known value of \(p\), the probability of success on a single trial. When a single trial is repeated \(n\) times, the expected number of successes is given by \(np\). Now we are trying to estimate the value of \(p\) based on the results of \(n\) trials.

We’ll use the diamonds dataframe again to illustrate the theory. Let’s look at a relative frequency table of the variable cut.

Solution

table(diamonds$cut)/nrow(diamonds)
## 
##       Fair       Good  Very Good    Premium      Ideal 
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652

A single trial consists of selecting a diamond at random. For our first exercise, we’ll consider getting one with a Fair cut as a success. From the table we know that the true population value of \(p\) is \(.02984798\).

Let’s set up a code snippet to look at the distribution of estimates based on different sample sizes. We can change the sample size and re-run the snippet.

Solution

# First create a vector where we can put our estimates
Results = rep(0,1000)

# Set the sample size
n = 1000

# Draw 1,000 samples of this size, estimate p for each one and place 
# each result into the results vector

for(i in 1:1000)
  {Sample = sample(diamonds$cut,n)
   p.estimate = mean(Sample == "Fair")
   Results[i] = p.estimate
   }
# Look at the distribution of results
mean(Results)
## [1] 0.030279
hist(Results)

Simulation with proportion estimates

The following code chunk allows us to simulate an estimate of a proportion based on a single sample. The code allows for varying the true population proportion and the sample size.

look = function(n,p){
expsize = 1000
results = rep(0,expsize)

for(i in 1:expsize){
  x = sample(c(1,0),size = n,replace = TRUE,
             prob = c(p,1-p))
  results[i] = mean(x)
}

summary(results)
hist(results)
}

Good Cases

Run the code in some cases where the constraints on n and p are not violated.

Solution

look(100,.2)

## Bad Cases

Run the code in some cases where the constraints on n and p are violated.

Solution

look(10,.05)