When studying the penguins on the Palmer islands, how do penguin mass body mass, penguin species, and penguin sex all interact with each other? We will try to explore these questions by looking at an appropriate data set. Furthermore (above and beyond what we did last time), we will explore how to use the DPLYR package to work with our data.
We do some exploratory data analysis to determine more about this data set.
names(Penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
dim(Penguins)
## [1] 344 8
head(Penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
We are interested in identifying how many penguins there are of each species, which we do using the following bar graph that displays the counts by species:
ggplot(Penguins, aes(x = species, fill=species)) +
geom_bar() +
labs(title = "How many penguins of each species?",
x = "Species", y = "Count")
We wish to break this down further, by seeing how many penguins of each type appear on each island. We summarize this using the following table-like summary with DPLYR, and also by creating a side-by-side bar chart:
Penguins %>%
ggplot(aes(x = island, fill = species)) +
geom_bar(position = "dodge") +
labs(title = "Counts by island and species (side-by-side)",
x = "Island", y = "Count")
We decide to restrict our attention to wish to restrict our attention to the variables for species, island, sex, flipper length, and body mass. We also wish to remove any NA commands. We can do this in one (piped) bit of code in DPLYR, giving us a new data set called Penguins2. We do that here:
Penguins2 <- Penguins %>%
select(species, island, sex, flipper_length_mm, body_mass_g) %>%
drop_na()
In the new data set, we can look at the variables body mass and flipper length. We look at the graphs below, and see the following: Both Flipper length and body mass seem to scew right with the majority of the data sitting just off left of the center point of the data spread. This is illustrated in the histograms below:
ggplot(Penguins2, aes(x = flipper_length_mm)) +
geom_histogram(bins = 25) +
labs(title = "Flipper length", x = "mm", y = "Count")
ggplot(Penguins2, aes(x = body_mass_g)) +
geom_histogram(bins = 25) +
labs(title = "Body mass", x = "grams", y = "Count")
We decide to restrict our attention to focus on the Female Adelie Penguins. We can do this by using a couple of piped “filter” commands, and creating a new data set to work with. We do this here:
Adelie_F <- Penguins2 %>%
filter(species == "Adelie") %>%
filter(sex == "female")
Within this group (Female Adelie Penguins), we are interested in looking at the summary statistics of mean and standard deviation, both (as a whole group) and also (grouped by island). But which variable should we choose to look at? The following indicates body mass
Based on this choice, we summarize the following: the mean (across all penguins); the standard deviation (across all penguins); the mean (grouped by island); and the standard deviation (grouped by island). The lines of code we need to do this are provided below.
Adelie_F %>%
summarize(mean(body_mass_g))
## # A tibble: 1 × 1
## `mean(body_mass_g)`
## <dbl>
## 1 3369.
Adelie_F %>%
summarize(sd(body_mass_g))
## # A tibble: 1 × 1
## `sd(body_mass_g)`
## <dbl>
## 1 269.
Adelie_F %>%
group_by(island) %>%
summarize(mean_Body = mean(body_mass_g),
sd_Body = sd(body_mass_g))
## # A tibble: 3 × 3
## island mean_Body sd_Body
## <fct> <dbl> <dbl>
## 1 Biscoe 3369. 343.
## 2 Dream 3344. 212.
## 3 Torgersen 3396. 259.
Additionally, based on this variable, we create the following graphs (of the Female Adelie Penguins), displaying both (a) a histogram of the data, and (b) a series of boxplots of the data, grouped according to island. The code for doing this is below.
ggplot(data = Adelie_F, aes(x = body_mass_g, fill = island)) +
geom_histogram(binwidth = 150) +
facet_wrap(~island)
ggplot(data = Adelie_F, aes(y = island, x = body_mass_g)) +
geom_boxplot()
Finally, to contextualize the Adelie Female Penguins data (and compare them to the males), we prepare the following side-by-side boxplot, which looks at categories of sex and of island:
Penguins2 %>%
filter(species == "Adelie") %>%
ggplot(aes(x = island, y = body_mass_g, fill = sex)) +
geom_boxplot(position = position_dodge(width = 0.8)) +
labs(title = "Body Mass of Adelie by island and sex",
x = "Island", y = "Body Mass")
This data set allows us to explore the interplay between species and island, and between species, sex, body mass, and flipper length, of the Palmer Penguins. By using the DPLYR codes of select(), slice(), summarize(), filter(), and group_by(), we could break down our data across various lines to make analysis easier.
We ultimately focused on the Adelie Female penguins, and primarily focused on the quantitative variable Body Mass. We summarized and graphed this variable individually and across islands. There are many other comparisons we could make in this data set, and we may return to it in the future!