Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The Sampling Distribution of the Proportion

To think about a proportion, we focus on the parameter p of a binomial distribution. The sample size in this context is the papameter n. We have the basic theoretical results.

The estimates of a proportion $\hat{p}$ based on a sample of size n is approximately normal and has the following mean and standard deviation provided that

$n\hat{p} > 10$ and $n(1-\hat{p}) > 10$ .

$\mu_{\hat{p}} = p$ and

$\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$

This is all based on the theory of the binomial distribution. In that theory we talked about having a known value of $p$ , the probability of success on a single trial. When a single trial is repeated $n$ times, the expected number of successes is given by $np$ . Now we are trying to estimate the value of $p$ based on the results of $n$ trials.

We’ll use the diamonds dataframe again to illustrate the theory. Let’s look at a relative frequency table of the variable cut.

Solution

table(diamonds$cut)/nrow(diamonds)

## 
##       Fair       Good  Very Good    Premium      Ideal 
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652

A single trial consists of selecting a diamond at random. For our first exercise, we’ll consider getting one with a Fair cut as a success. From the table we know that the true population value of $p$ is $.02984798$ .

Let’s set up a code snippet to look at the distribution of estimates based on different sample sizes. We can change the sample size and re-run the snippet.

Solution

# First create a vector where we can put our estimates
Results = rep(0,1000)

# Set the sample size
n = 1000

# Draw 1,000 samples of this size, estimate p for each one and place 
# each result into the results vector

for(i in 1:1000)
  {Sample = sample(diamonds$cut,n)
   p.estimate = mean(Sample == "Fair")
   Results[i] = p.estimate
   }
# Look at the distribution of results
mean(Results)

## [1] 0.029705

hist(Results)

Simulation with proportion estimates

The following code chunk allows us to simulate an estimate of a proportion based on a single sample. The code allows for varying the true population proportion and the sample size.

look = function(n,p){
expsize = 1000
results = rep(0,expsize)

for(i in 1:expsize){
  x = sample(c(1,0),size = n,replace = TRUE,
             prob = c(p,1-p))
  results[i] = mean(x)
}

summary(results)
hist(results)
}

Good Cases

Run the code in some cases where the constraints on n and p are not violated.

Solution

look(100,.2)

## Bad Cases

Run the code in some cases where the constraints on n and p are violated.

Solution

look(10,.05)

Proportion Samplinng

Setup

The Sampling Distribution of the Proportion

Solution

Solution

Simulation with proportion estimates

Good Cases

Solution

Solution