Scenario

When studying the penguins on the Palmer islands, how do penguin mass body mass, penguin species, and penguin sex all interact with each other? We will try to explore these questions by looking at an appropriate data set. Furthermore (above and beyond what we did last time), we will explore how to use the DPLYR package to work with our data.

The Data (Pengins)

We do some exploratory data analysis to determine more about this data set.

names(Penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
dim(Penguins)
## [1] 344   8
head(Penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

We are interested in identifying how many penguins there are of each species, which we do using the following bar graph that displays the counts by species:

ggplot(Penguins, aes(x = species, fill=species)) +
  geom_bar() +
  labs(title = "How many penguins of each species?",
       x = "Species", y = "Count")

We wish to break this down further, by seeing how many penguins of each type appear on each island. We summarize this using the following table-like summary with DPLYR, and also by creating a side-by-side bar chart:

# Table-like summary with dplyr:
Penguins %>%
  count(species, island)
## # A tibble: 5 × 3
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Adelie    Biscoe       44
## 2 Adelie    Dream        56
## 3 Adelie    Torgersen    52
## 4 Chinstrap Dream        68
## 5 Gentoo    Biscoe      124
# side-by-side bar (counts)
Penguins %>%
  ggplot(aes(x = island, fill = species)) +
  geom_bar(position = "dodge") +
  labs(title = "Counts by island and species (side-by-side)",
       x = "Island", y = "Count")

# segmented bar (conditional % within island)
Penguins %>%
  ggplot(aes(x = island, fill = species)) +
  geom_bar(position = "fill") +
  labs(title = "Within each island, % by species (segmented)",
       x = "Island", y = "Percent")

Cleaning our data (giving us Penguins2)

We decide to restrict our attention to wish to restrict our attention to the variables for species, island, sex, flipper length, and body mass. We also wish to remove any NA commands. We can do this in one (piped) bit of code in DPLYR, giving us a new data set called Penguins2. We do that here:

# Choose key variables, then drop rows with NAs in those columns
Penguins2 <- Penguins %>%
  select(species, island, sex, 
         flipper_length_mm, body_mass_g) %>%
  drop_na()

In the new data set, we can look at the variables body mass and flipper length. We look at the graphs below, and see the following: Both graphs show that most penguins fall around the middle range, with flipper length peaking near 190–195 mm and body mass around 3,500–3,700 g. In both cases, the distributions stretch a bit to the right, meaning there are some penguins with longer flippers or heavier body weights pulling the tail upward. This is illustrated in the histograms below:

# Distributions of flipper length after data cleaning
ggplot(Penguins2, aes(x = flipper_length_mm)) +
  geom_histogram(bins = 25) +
  labs(title = "Flipper length", x = "mm", y = "Count")

# Distributions of body mass after data cleaning
ggplot(Penguins2, aes(x = body_mass_g)) +
  geom_histogram(bins = 25) +
  labs(title = "Body mass", x = "grams", y = "Count")

Filtering to focus on Female Adelie Penguins

We decide to restrict our attention to focus on the Female Adelie Penguins. We can do this by using a couple of piped “filter” commands, and creating a new data set to work with. We do this here:

# Keep only Adelie penguins
Adelie <- Penguins2 %>% filter(species == "Adelie")

# Keeping only the female Adelie penguins on Dream island
Adelie_F <- Penguins2 %>%
  filter(species == "Adelie") %>%
  filter(sex == "female")

Within this group (Female Adelie Penguins), we are interested in looking at the summary statistics of mean and standard deviation, both (as a whole group) and also (grouped by island). But which variable should we choose to look at? Flipper length.

Based on this choice, we summarize the following: the mean (across all penguins); the standard deviation (across all penguins); the mean (grouped by island); and the standard deviation (grouped by island). The lines of code we need to do this are provided below.

# Mean and SD of flipper length by species
Penguins2 %>%
  group_by(species) %>%
  summarize(mean_fl = mean(flipper_length_mm),
            sd_fl   = sd(flipper_length_mm))
## # A tibble: 3 × 3
##   species   mean_fl sd_fl
##   <fct>       <dbl> <dbl>
## 1 Adelie       190.  6.52
## 2 Chinstrap    196.  7.13
## 3 Gentoo       217.  6.59
Adelie_F %>% summarize(mean_flipper = mean(flipper_length_mm),
                        sd_flipper   = sd(flipper_length_mm))
## # A tibble: 1 × 2
##   mean_flipper sd_flipper
##          <dbl>      <dbl>
## 1         188.       5.60
#  Adelie Females only: summary statistics, grouped by island
Adelie_F %>%
  group_by(island) %>%
  summarize(mean_flipper = mean(flipper_length_mm),
            sd_flipper = sd(flipper_length_mm))
## # A tibble: 3 × 3
##   island    mean_flipper sd_flipper
##   <fct>            <dbl>      <dbl>
## 1 Biscoe            187.       6.74
## 2 Dream             188.       5.51
## 3 Torgersen         188.       4.64

Additionally, based on this variable, we create the following graphs (of the Female Adelie Penguins), displaying both (a) a histogram of the data, and (b) a series of boxplots of the data, grouped according to island. The code for doing this is below.

# Distributions of flipper length after data cleaning
ggplot(Penguins2, aes(x = flipper_length_mm)) +
  geom_histogram(bins = 25) +
  labs(title = "Flipper length", x = "mm", y = "Count")

# Adelie Females only — flipper length by island
ggplot(Adelie_F, aes(x = island, y = flipper_length_mm, fill=island)) +
  geom_boxplot() +
  labs(title = "Adelie: Flipper length by island", x = "Island", y = "Body mass (g)")

Finally, to contextualize the Adelie Female Penguins data (and compare them to the males), we prepare the following side-by-side boxplot, which looks at categories of sex and of island:

# Boxplot of flipper length of Adelie Penguins, grouped by island
Penguins2 %>%
  filter(species == "Adelie") %>%
  ggplot(aes(x = island, y = flipper_length_mm, fill = sex)) +
  geom_boxplot(position = position_dodge(width = 0.8)) +
  labs(title = "Flipper Length of Adelie by island and sex",
       x = "Island", y = "Flipper Length")

In Conclusion

This data set allows us to explore the interplay between species and island, and between species, sex, body mass, and flipper length, of the Palmer Penguins. By using the DPLYR codes of select(), slice(), summarize(), filter(), and group_by(), we could break down our data across various lines to make analysis easier.

We ultimately focused on the Adelie Female penguins, and primarily focused on the quantitative variable FLIPPER LENGTH. We summarized and graphed this variable individually and across islands. There are many other comparisons we could make in this data set, and we may return to it in the future!