We are interested in the Palmer Penguins data set, specifically studying body mass of the 342 penguins involved. In this work, we will identify two variables of interest – one of which is body_mass_g, and one of which is species. We will construct sampling distributions for a mean (measuring samples of body mass), and also a sampling distribution for a proportion (measuring the proportion of our penguins that are gentoo penguins). We do this for sample size n=50 (provided), and sample size n=100 (provided by you, the student!) We will answer questions about our sampling distributions that involve normal percentiles.
We begin by examining our penguin data set, with only the variables of interest included:
head(peng)
## # A tibble: 6 × 3
## species body_mass_g is_gentoo
## <fct> <int> <lgl>
## 1 Adelie 3750 FALSE
## 2 Adelie 3800 FALSE
## 3 Adelie 3250 FALSE
## 4 Adelie 3450 FALSE
## 5 Adelie 3650 FALSE
## 6 Adelie 3625 FALSE
names(peng)
## [1] "species" "body_mass_g" "is_gentoo"
table(peng$species)
##
## Adelie Chinstrap Gentoo
## 151 68 123
Next, we calculate the following variables from our population: the mean body mass; the standard deviation of body mass; and the proportion of penguins that are gentoo penguins.
# SAMPLE MEAN "POPULATION" VALUES (from the dataset)
mu <- mean(peng$body_mass_g)
sigma <- sd(peng$body_mass_g)
mu
## [1] 4201.754
sigma
## [1] 801.9545
# SAMPLE PROPORTION "POPULATION" VALUE (from the dataset)
p <- mean(peng$is_gentoo)
p
## [1] 0.3596491
We will construct sampling distributions of sample size 50, using the Central Limit Theorem. We will do this for mean body mass, and we will also do it for proportion of gentoo penguins. In each case, we will answer a question about the probability of drawing a sample of a particular makeup, using normal percentiles to arrive at our answer.
n <- 50
Although we will use the CLT for our calculations, we include a simulated sampling distribution below for sample size n=50. This will help us imagine the sampling distribution that the CLT provides.
B = 1000
# simulated sampling distribution for proportions
sample_props <- replicate(B,mean(sample(peng$is_gentoo, size = n, replace = TRUE)))
# simulated sampling distribution for means
sample_means <- replicate(B, mean(sample(peng$body_mass_g, size = n, replace = TRUE)))
ggplot(data.frame(xbar = sample_means), aes(x = xbar)) +
geom_histogram(bins = 30) +
labs(
title = "Simulated Sampling Distribution for Means of xbar",
x = "Sample Mean", y = "Count"
)
ggplot(data.frame(phat = sample_props), aes(x = phat)) +
geom_histogram(bins = 20) +
labs(
title = "Sampling distribution of proportions",
x = "Sample proportion (phat)", y = "Count"
)
We will create a sampling distribution for means, using the variable body_mass_g. We seek to answer the question: if we draw a sample of 50 penguins at random, what are the chances of the average penguin weight being 4000 or less?
# The center of the model is also the population parameter, mu
mu
## [1] 4201.754
# Compute the standard error for sample means
SE <- sigma / sqrt(n)
SE
## [1] 113.4135
We see that our sampling distribution has a mean of 4202, and a standard deviation (i.e. standard error) of 113.4.
We can use the CLT to answer the question: what is the probability that a randomly drawn sample has a mean less than 4000 when n=50?
(4000-mu)/SE
## [1] -1.778927
#z-score was -1.78
pnorm(-1.78)
## [1] 0.03753798
We see that the normal percentile is 0.0375. Therefore, we if we draw a sample of 50 penguins at random, the chances of the average penguin weight being 4000 or less is quite small – it will happen about 3.75% of the time.
We will create a sampling distribution for proportions, using the variable body_mass_g. We seek to answer the question: if we draw a sample of 50 penguins at random, what are the chances that 50% or more of our penguins will be gentoo?
# The center of the model is also the population parameter, p
p
## [1] 0.3596491
# Compute the standard error for sample means
SE_prop <- sqrt(p * (1 - p) / n)
SE_prop
## [1] 0.06786776
We see that our sampling distribution has a proportion of 0.36, and a standard deviation (i.e. standard error) of 0.068.
We can use the CLT to answer the question: what is the probability that a randomly drawn sample of sample size 50 has a proportion greater than 50% gentoo?
(0.5-p)/SE_prop
## [1] 2.068005
#z-score was 1.85
pnorm(1.85)
## [1] 0.9678432
The normal percentile is 0.9678. If we draw a sample of 50 penguins at random, the chances of the sample being less than 50% gentoo is quite large – it will happen about 96.78% of the time. This means the chances of our sample being more than 50% gentoo is quite small: it’ll happen about 3.22% of the time.
n <- 100
We will construct sampling distributions of sample size 100, using the Central Limit Theorem. We will do this for mean body mass, and we will also do it for proportion of gentoo penguins. In each case, we will answer a question about the probability of drawing a sample of a particular makeup, using normal percentiles to arrive at our answer. We will answer:
What is the probability that a randomly drawn sample of sample size 100 has an average weight less than 4000?
SE <- sigma/sqrt(n)
mu
## [1] 4201.754
SE
## [1] 80.19545
(4000-mu)/SE
## [1] -2.515783
#The Z score was -2.515783
pnorm(-2.515783)
## [1] 0.005938414
The normal percentile is .005938414. The probability that a randomly drawn sample of sample size 100 has an average weight less than 4000 is .59%
What is the probability that a randomly drawn sample of sample size 100 has a proportion greater than 50% gentoo?
(0.5-p)/SE_prop
## [1] 2.068005
#The Z score was 2.068005
pnorm(2.068005)
## [1] 0.9806802
The normal percentile is 0.9806802 if we draw from a sample of 100 pinguins at random the chances of a gentoo pinguin being less than 50% is 98.68% of the time. This means that the chances of the sample being greater than 50% are 1.32%.
You are encouraged, but not required, to make a simulated sampling distribution for n=100 and include it in your graph as well.
FILL IN THIS SECTION YOURSELF