When studying the penguins on the Palmer islands, how do penguin mass body mass, penguin species, and penguin sex all interact with each other? We will try to explore these questions by looking at an appropriate data set.
We do some exploratory data analysis to determine more about this data set.
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
dim(penguins)
## [1] 344 8
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
As we can see, there are eight variables: Categorical variables- Species, Island, Sex Quantitative variables Continuous- Body Mass, Flipper Length, Bill length, Bill depth Discrete- year
We might be interested in knowing what the “categories” in our categorical variables might be. We enter the following R code to check this:
levels(as.factor(penguins$species))
## [1] "Adelie" "Chinstrap" "Gentoo"
levels(as.factor(penguins$island))
## [1] "Biscoe" "Dream" "Torgersen"
levels(as.factor(penguins$sex))
## [1] "female" "male"
We see that each categorical variable has the following categories: Species- Adelie, Chinstrap, Gentoo Island- Torgersen, Dream, Biscoe Sex- male, female We did see that this data set had some NA values in it. We will “clean” the data by removing the NAs, and naming this new data set “penguins2”. We will only work with penguins2 for the rest of this report.
colSums(is.na(penguins))
## species island bill_length_mm bill_depth_mm
## 0 0 2 2
## flipper_length_mm body_mass_g sex year
## 2 2 11 0
penguins2 <- penguins %>% drop_na()
We are interested in checking the independence of the categorical variables species and island. We can do this by making a contingency table and by checking appropriate graphs. We include a few of those below:
table(penguins2$species, penguins2$island)
##
## Biscoe Dream Torgersen
## Adelie 44 55 47
## Chinstrap 0 68 0
## Gentoo 119 0 0
addmargins(table(penguins2$species, penguins2$island))
##
## Biscoe Dream Torgersen Sum
## Adelie 44 55 47 146
## Chinstrap 0 68 0 68
## Gentoo 119 0 0 119
## Sum 163 123 47 333
# Basic bar graph with species
ggplot(data = penguins, aes(x = species)) +
geom_bar()
# Basic bar graphs, “facet wrapped” for island
ggplot(data = penguins2, aes(x = species)) +
geom_bar() +
facet_wrap(~island)
ggplot(data = penguins2, aes(x = species, fill = island)) +
geom_bar()
ggplot(data = penguins2, aes(x = species, fill = island)) +
geom_bar(position = "dodge")
ggplot(data = penguins2, aes(x = species, fill = island)) +
geom_bar(position = "fill") # segmented (100%) bars
# We can add more details! For example, we can “fill” by island and “facet wrap” by sex.
ggplot(data = penguins2, aes(x = species, fill = island)) +
geom_bar(position = "fill") +
facet_wrap(~sex)
In the data below, we see that the distribution for species changes for different islands. Specifically,on Biscoe Island, most penguins are Gentoo, with a smaller number of Adelie and almost no Chinstrap. On Dream Island, the species are more evenly split between Adelie and Chinstrap, while no Gentoo penguins are present. On Torgersen Island, only Adelie penguins are observed, with no Chinstrap or Gentoo. This difference (in species) across groups (of island) suggests that these two variables are NOT independent.
We might also be interested in comparing body mass (a quantitative variable) across categories of species and of sex. We can do this by calculating summary statistics (mean, standard deviation, and so on) for each species and/or sex category. We include a few of those below:
# Overall mean body mass (ignore NA)
penguins2 %>%
summarize(mean(body_mass_g))
## # A tibble: 1 × 1
## `mean(body_mass_g)`
## <dbl>
## 1 4207.
# Mean by sex
penguins2 %>%
group_by(sex) %>%
summarize(mean_body_mass = mean(body_mass_g))
## # A tibble: 2 × 2
## sex mean_body_mass
## <fct> <dbl>
## 1 female 3862.
## 2 male 4546.
# Mean by species
penguins2 %>%
group_by(species) %>%
summarize(mean_body_mass = mean(body_mass_g))
## # A tibble: 3 × 2
## species mean_body_mass
## <fct> <dbl>
## 1 Adelie 3706.
## 2 Chinstrap 3733.
## 3 Gentoo 5092.
# Mean by species and sex
penguins2 %>%
group_by(species, sex) %>%
summarize(
n = n(),
mean_body_mass = mean(body_mass_g),
sd_body_mass = sd(body_mass_g)
)
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## # Groups: species [3]
## species sex n mean_body_mass sd_body_mass
## <fct> <fct> <int> <dbl> <dbl>
## 1 Adelie female 73 3369. 269.
## 2 Adelie male 73 4043. 347.
## 3 Chinstrap female 34 3527. 285.
## 4 Chinstrap male 34 3939. 362.
## 5 Gentoo female 58 4680. 282.
## 6 Gentoo male 61 5485. 313.
We can draw the following conclusion: Adelie penguins-females average about 3369 g while males are heavier at 4043g. Chinstrap penguins-females average 3527 g and males are higher at 3939g. Gentoo penguin- females average 4680 g, while males are much heavier at 5485g.
To further explore the difference in body mass across categories of sex and/or species, we can create displays of the data! We do so here, creating appropriate histograms and boxplots:
# Body mass histograms
ggplot(data = penguins2, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
ggplot(data = penguins2, aes(x = body_mass_g, fill = species)) +
geom_histogram(binwidth = 200)
ggplot(data = penguins2, aes(x = body_mass_g, fill = species)) +
geom_histogram(binwidth = 200) +
facet_wrap(~species)
############################################################
### One Quantitative and One Categorical Variable – Graphing Boxplots
############################################################
# A single overall boxplot for body mass
ggplot(data = penguins2, aes(x = body_mass_g)) +
geom_boxplot()
# Boxplots of body mass by species (vertical)
ggplot(data = penguins2, aes(y = species, x = body_mass_g)) +
geom_boxplot()
# Boxplots of body mass by sex (horizontal)
ggplot(data = penguins2, aes(y = sex, x = body_mass_g)) +
geom_boxplot()
# Boxplots by species and sex together
ggplot(data = penguins2, aes(y = interaction(species, sex), x = body_mass_g, fill = sex)) +
geom_boxplot()
By looking at the histograms and the boxplots, we see the following:
Gentoo penguins stand out as having much higher body mass compared to Adelie and Chinstrap. Within each species, males are consistently heavier than females, and the difference between sexes is clear in the distributions. The overall spread of body mass is fairly similar across groups, though males tend to show slightly greater variation than females.
This data set allows us to explore the interplay between species and island, and between species, sex, and body mass, of the Palmer Penguins. By using a contingency table and some bar charts, we see that there is a relationship between species and island (with different islands having vastly different distributions of penguin species). By comparing summary statistics and looking at histograms and boxplots, we see that there is a relationship between body mass and species / sex (with males typically having more body mass than females for each species, and Gentoo penguins typically having more body mass than the other two species of penguin). There are many other comparisons we could make in this data set, and we may return to it in the future!