Scenario: Studying Penguin Body Mass using Normal Percentile

We are interested in the Palmer Penguins data set, specifically studying body mass of the 342 penguins involved. In this work, we will identify two variables of interest – one of which is body_mass_g, and one of which is species. We will construct sampling distributions for a mean (measuring samples of body mass), and also a sampling distribution for a proportion (measuring the proportion of our penguins that are gentoo penguins). We do this for sample size n=50 (provided), and sample size n=100 (provided by you, the student!) We will answer questions about our sampling distributions that involve normal percentiles.

We begin by examining our penguin data set, with only the variables of interest included:

head(peng)
## # A tibble: 6 × 3
##   species body_mass_g is_gentoo
##   <fct>         <int> <lgl>    
## 1 Adelie         3750 FALSE    
## 2 Adelie         3800 FALSE    
## 3 Adelie         3250 FALSE    
## 4 Adelie         3450 FALSE    
## 5 Adelie         3650 FALSE    
## 6 Adelie         3625 FALSE
names(peng)
## [1] "species"     "body_mass_g" "is_gentoo"
table(peng$species)
## 
##    Adelie Chinstrap    Gentoo 
##       151        68       123

Next, we calculate the following variables from our population: the mean body mass; the standard deviation of body mass; and the proportion of penguins that are gentoo penguins.

# SAMPLE MEAN "POPULATION" VALUES (from the dataset)
mu <- mean(peng$body_mass_g)
sigma <- sd(peng$body_mass_g)

mu
## [1] 4201.754
sigma
## [1] 801.9545
# SAMPLE PROPORTION "POPULATION" VALUE (from the dataset)
p <- mean(peng$is_gentoo)
p
## [1] 0.3596491

Central Limit Theorem Calculations for n=50

We will construct sampling distributions of sample size 50, using the Central Limit Theorem. We will do this for mean body mass, and we will also do it for proportion of gentoo penguins. In each case, we will answer a question about the probability of drawing a sample of a particular makeup, using normal percentiles to arrive at our answer.

n <- 50

Although we will use the CLT for our calculations, we include a simulated sampling distribution below for sample size n=50. This will help us imagine the sampling distribution that the CLT provides.

B = 1000
# simulated sampling distribution for proportions
sample_props <- replicate(B,mean(sample(peng$is_gentoo, size = n, replace = TRUE)))

# simulated sampling distribution for means
sample_means <- replicate(B, mean(sample(peng$body_mass_g, size = n, replace = TRUE)))
ggplot(data.frame(xbar = sample_means), aes(x = xbar)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Simulated Sampling Distribution for Means of xbar",
    x = "Sample Mean", y = "Count"
  )

ggplot(data.frame(phat = sample_props), aes(x = phat)) +
  geom_histogram(bins = 20) +
  labs(
    title = "Sampling distribution of proportions",
    x = "Sample proportion (phat)", y = "Count"
  )

CLT for sample means, using body_mass_g

We will create a sampling distribution for means, using the variable body_mass_g. We seek to answer the question: if we draw a sample of 50 penguins at random, what are the chances of the average penguin weight being 4000 or less?

# The center of the model is also the population parameter, mu
mu
## [1] 4201.754
# Compute the standard error for sample means
SE <- sigma / sqrt(n)
SE
## [1] 113.4135

We see that our sampling distribution has a mean of 4202, and a standard deviation (i.e. standard error) of 113.4.

We can use the CLT to answer the question: what is the probability that a randomly drawn sample has a mean less than 4000 when n=50?

(4000-mu)/SE
## [1] -1.778927
#z-score was -1.78
pnorm(-1.78)
## [1] 0.03753798

We see that the normal percentile is 0.0375. Therefore, we if we draw a sample of 50 penguins at random, the chances of the average penguin weight being 4000 or less is quite small – it will happen about 3.75% of the time.

CLT for sample proportions, using is_gentoo

We will create a sampling distribution for proportions, using the variable body_mass_g. We seek to answer the question: if we draw a sample of 50 penguins at random, what are the chances that 50% or more of our penguins will be gentoo?

# The center of the model is also the population parameter, p
p
## [1] 0.3596491
# Compute the standard error for sample means
SE_prop <- sqrt(p * (1 - p) / n)
SE_prop
## [1] 0.06786776

We see that our sampling distribution has a proportion of 0.36, and a standard deviation (i.e. standard error) of 0.068.

We can use the CLT to answer the question: what is the probability that a randomly drawn sample of sample size 50 has a proportion greater than 50% gentoo?

(0.5-p)/SE_prop
## [1] 2.068005
#z-score was 1.85
pnorm(1.85)
## [1] 0.9678432

The normal percentile is 0.9678. If we draw a sample of 50 penguins at random, the chances of the sample being less than 50% gentoo is quite large – it will happen about 96.78% of the time. This means the chances of our sample being more than 50% gentoo is quite small: it’ll happen about 3.22% of the time.

Central Limit Theorem Calculations for n=100

We will construct sampling distributions of sample size 100, using the Central Limit Theorem. We will do this for mean body mass, and we will also do it for proportion of gentoo penguins. In each case, we will answer a question about the probability of drawing a sample of a particular makeup, using normal percentiles to arrive at our answer. We will answer:

What is the probability that a randomly drawn sample of sample size 100 has an average weight less than 4000?

What is the probability that a randomly drawn sample of sample size 100 has a proportion greater than 50% gentoo?

You are encouraged, but not required, to make a simulated sampling distribution for n=100 and include it in your graph as well.

FILL IN THIS SECTION YOURSELF

Scenario: Studying Penguin Body Mass using Normal Percentile

We are interested in the Palmer Penguins data set, specifically studying body mass of the 342 penguins involved. In this work, we will identify two variables of interest – one of which is body_mass_g, and one of which is species. We will construct sampling distributions for a mean (measuring samples of body mass), and also a sampling distribution for a proportion (measuring the proportion of our penguins that are gentoo penguins). We do this for sample size n=50 (provided), and sample size n=100 (provided by you, the student!) We will answer questions about our sampling distributions that involve normal percentiles.

head(peng)
## # A tibble: 6 × 3
##   species body_mass_g is_gentoo
##   <fct>         <int> <lgl>    
## 1 Adelie         3750 FALSE    
## 2 Adelie         3800 FALSE    
## 3 Adelie         3250 FALSE    
## 4 Adelie         3450 FALSE    
## 5 Adelie         3650 FALSE    
## 6 Adelie         3625 FALSE
names(peng)
## [1] "species"     "body_mass_g" "is_gentoo"
table(peng$species)
## 
##    Adelie Chinstrap    Gentoo 
##       151        68       123

Next, we calculate the following variables from our population: the mean body mass; the standard deviation of body mass; and the proportion of penguins that are gentoo penguins.

# SAMPLE MEAN "POPULATION" VALUES (from the dataset)
mu <- mean(peng$body_mass_g)
sigma <- sd(peng$body_mass_g)

mu
## [1] 4201.754
sigma
## [1] 801.9545
# SAMPLE PROPORTION "POPULATION" VALUE (from the dataset)
p <- mean(peng$is_gentoo)
p
## [1] 0.3596491

Central Limit Theorem Calculations for n=100

We will construct sampling distributions of sample size 100, using the Central Limit Theorem. We will do this for mean body mass, and we will also do it for proportion of gentoo penguins. In each case, we will answer a question about the probability of drawing a sample of a particular makeup, using normal percentiles to arrive at our answer.

n <- 100

Although we will use the CLT for our calculations, we include a simulated sampling distribution below for sample size n=100. This will help us imagine the sampling distribution that the CLT provides.

B = 4000
# simulated sampling distribution for proportions
sample_props <- replicate(B,mean(sample(peng$is_gentoo, size = n, replace = TRUE)))

# simulated sampling distribution for means
sample_means <- replicate(B, mean(sample(peng$body_mass_g, size = n, replace = TRUE)))
ggplot(data.frame(xbar = sample_means), aes(x = xbar)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Simulated Sampling Distribution for Means of xbar",
    x = "Sample Mean", y = "Count"
  )

ggplot(data.frame(phat = sample_props), aes(x = phat)) +
  geom_histogram(bins = 20) +
  labs(
    title = "Sampling distribution of proportions",
    x = "Sample proportion (phat)", y = "Count"
  )

CLT for sample means, using body_mass_g

We will create a sampling distribution for means, using the variable body_mass_g. We seek to answer the question: if we draw a sample of 100 penguins at random, what are the chances of the average penguin weight being 4000 or less?

# The center of the model is also the population parameter, mu
mu
## [1] 4201.754
# Compute the standard error for sample means
SE <- sigma / sqrt(n)
SE
## [1] 80.19545

We see that our sampling distribution has a mean of 4202, and a standard deviation (i.e. standard error) of 113.4.

We can use the CLT to answer the question: what is the probability that a randomly drawn sample has a mean less than 4000 when n=100?

(4000-mu)/SE
## [1] -2.515783
#z-score was -2.515783
pnorm(-2.515783)
## [1] 0.005938414

We see that the normal percentile is 0.005938414. Therefore, we if we draw a sample of 100 penguins at random, the chances of the average penguin weight being 4000 or less is quite small – it will happen about 5.93% of the time.

CLT for sample proportions, using is_gentoo

We will create a sampling distribution for proportions, using the variable body_mass_g. We seek to answer the question: if we draw a sample of 50 penguins at random, what are the chances that 50% or more of our penguins will be gentoo?

# The center of the model is also the population parameter, p
p
## [1] 0.3596491
# Compute the standard error for sample means
SE_prop <- sqrt(p * (1 - p) / n)
SE_prop
## [1] 0.04798975

We see that our sampling distribution has a proportion of 0.36, and a standard deviation (i.e. standard error) of 0.0479.

We can use the CLT to answer the question: what is the probability that a randomly drawn sample of sample size 50 has a proportion greater than 50% gentoo?

What is the probability that a randomly drawn sample of sample size 100 has a proportion greater than 50% Gentoo?

p <- mean(peng$is_gentoo)
SE_prop <- sqrt(p * (1 - p) / n)

(1-p)/SE_prop
## [1] 13.34349
m <- (0.5 - p) / SE_prop

1 - pnorm(m)
## [1] 0.001724491

The normal percentile is 0.001724491 If we draw a sample of 100 penguins at random, the chances of the sample being less than 50% gentoo is quite large – it will happen about 1.7% of the time. This means the chances of our sample being more than 50% gentoo is quite small: it’ll happen about 13% of the time.