Harold Nelson
2025-03-17
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To think about a proportion, we focus on the parameter p of a binomial distribution. The sample size in this context is the papameter n. We have the basic theoretical results.
The estimates of a proportion \(\hat{p}\) based on a sample of size n is approximately normal and has the following mean and standard deviation provided that
\(n\hat{p} > 10\) and \(n(1-\hat{p}) > 10\).
\[\mu_{\hat{p}} = p\] and
\[\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\]
This is all based on the theory of the binomial distribution. In that theory we talked about having a known value of \(p\), the probability of success on a single trial. When a single trial is repeated \(n\) times, the expected number of successes is given by \(np\). Now we are trying to estimate the value of \(p\) based on the results of \(n\) trials.
We’ll use the diamonds dataframe again to illustrate the theory. Let’s look at a relative frequency table of the variable cut.
##
## Fair Good Very Good Premium Ideal
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
A single trial consists of selecting a diamond at random. For our first exercise, we’ll consider getting one with a Fair cut as a success. From the table we know that the true population value of \(p\) is \(.02984798\).
Let’s set up a code snippet to look at the distribution of estimates based on different sample sizes. We can change the sample size and re-run the snippet.
# First create a vector where we can put our estimates
Results = rep(0,1000)
# Set the sample size
n = 1000
# Draw 1,000 samples of this size, estimate p for each one and place
# each result into the results vector
for(i in 1:1000)
{Sample = sample(diamonds$cut,n)
p.estimate = mean(Sample == "Fair")
Results[i] = p.estimate
}
# Look at the distribution of results
mean(Results)
## [1] 0.030279
The following code chunk allows us to simulate an estimate of a proportion based on a single sample. The code allows for varying the true population proportion and the sample size.
Run the code in some cases where the constraints on n and p are not violated.
## Bad Cases
Run the code in some cases where the constraints on n and p are violated.