Exploring the Penguin Dataset

Author

S.Matsumoto

This Quarto document serves as a practical illustration of the concepts covered in the productive workflow online course. It’s designed primarily for educational purposes, so the focus is on demonstrating Quarto techniques rather than on the rigor of its scientific content. The callout reference page is here.

1 Introduction

This is the case study to make an html report by quarto. You can read more about the penguin dataset here.

Let’s load libraries before we start!

Show the code
# load the tidyverse
library(tidyverse)
library(patchwork)
library(ggthemes)

2 Loading data

The dataset has already been loaded and cleaned in the previous step of this pipeline.

Let’s load the clean version, together with a few functions available in functions.R!

Show the code
# Source functions
source(file = "functions.R")

# Read the clean dataset
data <- readRDS(file = "../input/clean_data.rds")

Bill measurement explanation

3 Bill Length and Bill Depth

Now, let’s make some descriptive analysis, including summary statistics and graphs.

What’s striking is the slightly negative relationship between bill length and bill depth:

Show the code
data %>%
  ggplot(
    aes(x = bill_length_mm, y = bill_depth_mm)
  ) +
    geom_point(color="#69b3a2") +
    labs(
      x = "Bill Length (mm)",
      y = "Bill Depth (mm)",
      title = paste("Surprising relationship?")
    ) + 
  theme_classic()

Relationship between bill length and bill depth. All data points included.

It is also interesting to note that bill length a and bill depth are quite different from one specie to another. The average of a variable can be computed as follow:

\[Avg = \frac{1}{n} \displaystyle\sum_{i=1}^{n} a_i = \frac{a_1+a_2+\dots+a_n}{n}\]

bill length and bill depth averages are summarized in the 2 tables below.

Show the code
# Calculating mean bill length for different species and islands using dplyr
data %>%
  # filter(species == "Adelie") %>%
  group_by(species) %>%
  summarize(mean_bill_length = round(mean(as.numeric(bill_length_mm), na.rm = TRUE), 2))
# Calculating average bill depth for different species and islands using dplyr
data %>%
  # filter(species == "Adelie") %>%
  group_by(species) %>%
  summarize(mean_bill_depth = round(mean(as.numeric(bill_depth_mm), na.rm = TRUE), 2))
adelie_avg <- 
  data%>%
  filter(species == "Adelie") %>% 
  # group_by(species) %>%
  summarize(mean_bill_length = round(mean(as.numeric(bill_length_mm), na.rm = TRUE), 2))
# A tibble: 3 × 2
  species   mean_bill_length
  <chr>                <dbl>
1 Adelie                38.8
2 Chinstrap             48.8
3 Gentoo                47.5
# A tibble: 3 × 2
  species   mean_bill_depth
  <chr>               <dbl>
1 Adelie               18.3
2 Chinstrap            18.4
3 Gentoo               15.0

For instance, the average bill length for the specie Adelie is 38.81

Now, let’s check the relationship between bill depth and bill length for the specie Adelie on the island Torgersen:

Show the code
#
# Use the function in functions.R
p1 <- create_scatterplot(data, "Adelie", "#6689c6")
p2 <- create_scatterplot(data, "Chinstrap", "#e85252")
p3 <- create_scatterplot(data, "Gentoo", "#9a6fb0")

p1+p2+p3

There is actually a positive correlation when split by species.