We are interested in the Palmer Penguins data set, specifically studying body mass of the 333 penguins involved. In this work, we will identify two penguins of interest – one of which is a Gentoo penguin, and one of which is a Chinstrap penguin. We calculate the z-scores of each penguin, both within the larger data set (compared to all penguins) and within the species-specific data set. When it is appropriate to discuss normal percentiles, we will do so.
We begin by plotting body mass of all penguins. We include a histogram and a density plot.
ggplot(penguins2, aes(x=body_mass_g))+
geom_histogram(bins=30, fill="lightblue", color="black")+
labs(title = "Penguin Body Mass (measured across all species)", x = "grams", y = "Count")
ggplot(penguins2, aes(x = body_mass_g)) +
geom_density(alpha = 0.4)
As we can see, the data set is unimodal and skewed to the right, meaning that the majority of our data values are closer to the minimum, with a handful of data values (the skew/tail of the distribution) at the higher end of the distribution. Since our observable units are penguins, our right-skewed body mass data indicates that most of our penguins have lower body mass (closer to the minimum value), while relatively fewer will have a higher body mass. Importantly, because this data set does not have the same shape as the normal distribution, we do NOT expect normal percentiles to apply. Thus, the 68-95-99.7 rule, as well as the percentiles provided by the z-table and/or the pnorm() command, will not be accurate for this data set.
However, we continue by plotting a density plot across species.
ggplot(penguins2, aes(x = body_mass_g, fill = species)) +
geom_density(alpha = 0.4) +
labs(title = "Body Mass by Species", x = "Body mass (g)", y = "Density") +
theme_minimal()
We see that each species, by itself, is (roughly) unimodal and symmetric. Therefore, the pnorm() command, the 68-95-99.7 rule, and the z-table are all appropriate to use when considering z-scores strictly within a single species.
We decide to focus on the Gentoo penguin that is 150th on the list. We calculate the penguin’s body mass, and also the penguin’s z-score (when compared to ALL penguins). Note that, to do this, we need to calculate the mean and standard deviation of the entire data set.
penguins2$body_mass_g[150]
## [1] 5700
mu <- mean(penguins2$body_mass_g)
sigma <- sd(penguins2$body_mass_g)
mu
## [1] 4207.057
sigma
## [1] 805.2158
(5700-4207.057)/805.2158
## [1] 1.854091
pnorm(1.854091)
## [1] 0.9681369
We see that the penguin’s body mass was 5700 grams, and this z-score was 1.854091. This means the penguin’s body mass was 1.854091 standard deviations above the average penguin body mass. If this data set was normally distributed (unimodal and symmetric), we could convert this z-score into a normal percentile. Doing so (using the pnorm() command) gives us an answer of 0.9681369. HOWEVER: We saw that this data set was skewed to the right, and so we cannot conclude anything from the pnorm() command. In particular, we can NOT conclude that this penguin is larger than 96.8% of all other penguins.
When focusing solely on a single species, Gentoo, by creating a copy of the data set that filters for species. We also calculate the mean and standard deviation for body mass.
penguins_gentoo <- penguins2 %>%
filter(species == "Gentoo")
mu_gentoo <- mean(penguins_gentoo$body_mass_g)
sigma_gentoo <- sd(penguins_gentoo$body_mass_g)
We graph the distribution of Gentoos, finding that body mass is unimodal and symmetric, and so we can model the data using the normal model:
ggplot(penguins_gentoo, aes(x = body_mass_g)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 200, fill="red", color="black") +
stat_function(fun = dnorm, args = list(mean = mu_gentoo, sd = sigma_gentoo)) +
labs(
title = "Body mass with Normal curve overlay (Gentoo Penguins Only)",
x = "body_mass_g",
y = "density"
)
Therefore, we shift our focus to examine only on the Gentoo penguins. We calculate the mean body mass of the Gentoos; the standard deviation of the Gentoos; and finally, the z-score (and normal percentile) for our specific Gentoo penguin.
mu_gentoo
## [1] 5092.437
sigma_gentoo
## [1] 501.4762
penguins2$body_mass_g[150]
## [1] 5700
(5700-5092.437)/501.4762
## [1] 1.211549
pnorm(1.211549)
## [1] 0.8871575
We # We conclude that our 150th penguin has a z-score of 1.211549 when compared against the other Gentoo penguins. Since our Gentoo body mass data are unimodal and symmetric, we can expect the normal probability calculated by pnorm() to be accurate for this z-score. Therefore, this penguin to be larger than 88.7% of the other Gentoo penguins.
penguins2$body_mass_g[299]
## [1] 4100
mu <- mean(penguins2$body_mass_g)
sigma <- sd(penguins2$body_mass_g)
mu
## [1] 4207.057
sigma
## [1] 805.2158
(4100-4207.057)/805.2158
## [1] -0.1329544
pnorm(-0.1329544)
## [1] 0.4471147
penguins_Chinstrap <- penguins2 %>%
filter(species == "Chinstrap")
mu_Chinstrap <- mean(penguins_Chinstrap$body_mass_g)
sigma_Chinstrap <- sd(penguins_Chinstrap$body_mass_g)
(4100-3733.08)/384.34
## [1] 0.9546755
pnorm(0.9546755)
## [1] 0.8301291
#My penguin weighs a good 4100 grams
#Penguin 299 Z score amongst all penguins (4100-4207.057)/805.2158 = (-0.1329544)
#Penguin 299 Z score amongst chinstrap (4100-3733.08)/384.34 = (0.9546755)
#Penguin 299 is heavier than about 83 percent of all chinstrap penguins
In this report, we analyzed penguin body mass data collected from several penguin species on the Palmer islands. Our initial work focused on graphing body mass for the entire penguin data set, and also spread out across each species. We analyzed two specific penguins in our data set; one Gentoo penguin and one Chinstrap penguin. In each case, we calculated appropriate z-scores, and – when simply looking within a single species – we calculate the normal percentiles for our chosen penguins.