Harold Nelson
2024-03-25
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To think about a proportion, we focus on the parameter p of a binomial distribution. The sample size in this context is the papameter n. We have the basic theoretical results.
The estimates of a proportion ˆp based on a sample of size n is approximately normal and has the following mean and standard deviation provided that
nˆp>10 and n(1−ˆp)>10.
μˆp=p and
σˆp=√p(1−p)n
This is all based on the theory of the binomial distribution. In that theory we talked about having a known value of p, the probability of success on a single trial. When a single trial is repeated n times, the expected number of successes is given by np. Now we are trying to estimate the value of p based on the results of n trials.
We’ll use the diamonds dataframe again to illustrate the theory. Let’s look at a relative frequency table of the variable cut.
##
## Fair Good Very Good Premium Ideal
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
A single trial consists of selecting a diamond at random. For our first exercise, we’ll consider getting one with a Fair cut as a success. From the table we know that the true population value of p is .02984798.
Let’s set up a code snippet to look at the distribution of estimates based on different sample sizes. We can change the sample size and re-run the snippet.
# First create a vector where we can put our estimates
Results = rep(0,1000)
# Set the sample size
n = 1000
# Draw 1,000 samples of this size, estimate p for each one and place
# each result into the results vector
for(i in 1:1000)
{Sample = sample(diamonds$cut,n)
p.estimate = mean(Sample == "Fair")
Results[i] = p.estimate
}
# Look at the distribution of results
mean(Results)
## [1] 0.029705
The following code chunk allows us to simulate an estimate of a proportion based on a single sample. The code allows for varying the true population proportion and the sample size.
look = function(n,p){
expsize = 1000
results = rep(0,expsize)
for(i in 1:expsize){
x = sample(c(1,0),size = n,replace = TRUE,
prob = c(p,1-p))
results[i] = mean(x)
}
summary(results)
hist(results)
}
Run the code in some cases where the constraints on n and p are not violated.
## Bad Cases
Run the code in some cases where the constraints on n and p are violated.
Space, Right Arrow or swipe left to move to next slide, click help below for more details