Scenario

When studying the penguins on the Palmer islands, how do penguin mass body mass, penguin species, and penguin sex all interact with each other? We will try to explore these questions by looking at an appropriate data set.

The Data

We do some exploratory data analysis to determine more about this data set.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
dim(penguins)
## [1] 344   8
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

As we can see, there are eight variables: Categorical variables- Species, Island, Sex Quantitative variables Continuous- Body Mass, Flipper Length, Bill length, Bill depth Discrete- year

We might be interested in knowing what the “categories” in our categorical variables might be. We enter the following R code to check this:

levels(as.factor(penguins$species))
## [1] "Adelie"    "Chinstrap" "Gentoo"
levels(as.factor(penguins$island))
## [1] "Biscoe"    "Dream"     "Torgersen"
levels(as.factor(penguins$sex))
## [1] "female" "male"

We see that each categorical variable has the following categories: Species- Adelie, Chinstrap, Gentoo Island- Torgersen, Dream, Biscoe Sex- male, female We did see that this data set had some NA values in it. We will “clean” the data by removing the NAs, and naming this new data set “penguins2”. We will only work with penguins2 for the rest of this report.

colSums(is.na(penguins))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0
penguins2 <- penguins %>% drop_na()

Comparing species and island

We are interested in checking the independence of the categorical variables species and island. We can do this by making a contingency table and by checking appropriate graphs. We include a few of those below:

table(penguins2$species, penguins2$island)
##            
##             Biscoe Dream Torgersen
##   Adelie        44    55        47
##   Chinstrap      0    68         0
##   Gentoo       119     0         0
addmargins(table(penguins2$species, penguins2$island))
##            
##             Biscoe Dream Torgersen Sum
##   Adelie        44    55        47 146
##   Chinstrap      0    68         0  68
##   Gentoo       119     0         0 119
##   Sum          163   123        47 333
# Basic bar graph with species
ggplot(data = penguins, aes(x = species)) +
  geom_bar()

# Basic bar graphs, “facet wrapped” for island
ggplot(data = penguins2, aes(x = species)) +
  geom_bar() +
  facet_wrap(~island)

ggplot(data = penguins2, aes(x = species, fill = island)) +
  geom_bar()

ggplot(data = penguins2, aes(x = species, fill = island)) +
  geom_bar(position = "dodge")

ggplot(data = penguins2, aes(x = species, fill = island)) +
  geom_bar(position = "fill")    # segmented (100%) bars

# We can add more details! For example, we can “fill” by island and “facet wrap” by sex.

ggplot(data = penguins2, aes(x = species, fill = island)) +
  geom_bar(position = "fill") +
  facet_wrap(~sex)

In the data below, we see that the distribution for species changes for different islands. Specifically,on Biscoe Island, most penguins are Gentoo, with a smaller number of Adelie and almost no Chinstrap. On Dream Island, the species are more evenly split between Adelie and Chinstrap, while no Gentoo penguins are present. On Torgersen Island, only Adelie penguins are observed, with no Chinstrap or Gentoo. This difference (in species) across groups (of island) suggests that these two variables are NOT independent.

Comparing Body Mass, Species, and Sex

We might also be interested in comparing body mass (a quantitative variable) across categories of species and of sex. We can do this by calculating summary statistics (mean, standard deviation, and so on) for each species and/or sex category. We include a few of those below:

# Overall mean body mass (ignore NA)
penguins2 %>%
  summarize(mean(body_mass_g))
## # A tibble: 1 × 1
##   `mean(body_mass_g)`
##                 <dbl>
## 1               4207.
# Mean by sex
penguins2 %>%
  group_by(sex) %>%
  summarize(mean_body_mass = mean(body_mass_g))
## # A tibble: 2 × 2
##   sex    mean_body_mass
##   <fct>           <dbl>
## 1 female          3862.
## 2 male            4546.
# Mean by species
penguins2 %>%
  group_by(species) %>%
  summarize(mean_body_mass = mean(body_mass_g))
## # A tibble: 3 × 2
##   species   mean_body_mass
##   <fct>              <dbl>
## 1 Adelie             3706.
## 2 Chinstrap          3733.
## 3 Gentoo             5092.
# Mean by species and sex
penguins2 %>%
  group_by(species, sex) %>%
  summarize(
    n = n(),
    mean_body_mass = mean(body_mass_g),
    sd_body_mass   = sd(body_mass_g)
  )
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## # Groups:   species [3]
##   species   sex        n mean_body_mass sd_body_mass
##   <fct>     <fct>  <int>          <dbl>        <dbl>
## 1 Adelie    female    73          3369.         269.
## 2 Adelie    male      73          4043.         347.
## 3 Chinstrap female    34          3527.         285.
## 4 Chinstrap male      34          3939.         362.
## 5 Gentoo    female    58          4680.         282.
## 6 Gentoo    male      61          5485.         313.

We can draw the following conclusion: Adelie penguins-females average about 3369 g while males are heavier at 4043g. Chinstrap penguins-females average 3527 g and males are higher at 3939g. Gentoo penguin- females average 4680 g, while males are much heavier at 5485g.

To further explore the difference in body mass across categories of sex and/or species, we can create displays of the data! We do so here, creating appropriate histograms and boxplots:

# Body mass histograms
ggplot(data = penguins2, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)

ggplot(data = penguins2, aes(x = body_mass_g, fill = species)) +
  geom_histogram(binwidth = 200)

ggplot(data = penguins2, aes(x = body_mass_g, fill = species)) +
  geom_histogram(binwidth = 200) +
  facet_wrap(~species)

############################################################
### One Quantitative and One Categorical Variable – Graphing Boxplots
############################################################

# A single overall boxplot for body mass
ggplot(data = penguins2, aes(x = body_mass_g)) +
  geom_boxplot()

# Boxplots of body mass by species (vertical)
ggplot(data = penguins2, aes(y = species, x = body_mass_g)) +
  geom_boxplot()

# Boxplots of body mass by sex (horizontal)
ggplot(data = penguins2, aes(y = sex, x = body_mass_g)) +
  geom_boxplot()

# Boxplots by species and sex together
ggplot(data = penguins2, aes(y = interaction(species, sex), x = body_mass_g, fill = sex)) +
  geom_boxplot()

By looking at the histograms and the boxplots, we see the following:

Gentoo penguins stand out as having much higher body mass compared to Adelie and Chinstrap. Within each species, males are consistently heavier than females, and the difference between sexes is clear in the distributions. The overall spread of body mass is fairly similar across groups, though males tend to show slightly greater variation than females.

In Conclusion

This data set allows us to explore the interplay between species and island, and between species, sex, and body mass, of the Palmer Penguins. By using a contingency table and some bar charts, we see that there is a relationship between species and island (with different islands having vastly different distributions of penguin species). By comparing summary statistics and looking at histograms and boxplots, we see that there is a relationship between body mass and species / sex (with males typically having more body mass than females for each species, and Gentoo penguins typically having more body mass than the other two species of penguin). There are many other comparisons we could make in this data set, and we may return to it in the future!