Exploratory Data Analysis Activity

Learning Objectives

  • Create and interpret univariate statistics & visualizations

  • Create and interpret bivariate statistics & visualizations

First, let’s load necessary libraries for this activity

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(flextable)
library(corrr)
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())

# Set options for flextables
set_flextable_defaults(na_str = "NA")

# Load function for printing tables nicely
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/make_flex.R")

Importing data

# Load Palmer penguins data
penguins <- readr::read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/penguins.csv")

Univariate statistics

# Calculating descriptive statistics
quant1Stats <- penguins %>% 
  dplyr::summarize(
  Minimum = min(flipper_length_mm, na.rm = TRUE),
  Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
  M = median(flipper_length_mm, na.rm = TRUE),
  Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
  Maximum = max(flipper_length_mm, na.rm = TRUE),
  Mean = mean(flipper_length_mm, na.rm = TRUE),
  R = Maximum - Minimum,
  s = sd(flipper_length_mm, na.rm = TRUE)
)

# Printing table of statistics
quant1Stats %>% 
  make_flex(caption = "Quantitative summary statistics for penguin flipper lengths (mm).")
Table 1: Quantitative summary statistics for penguin flipper lengths (mm).

Minimum

Q1

M

Q3

Maximum

Mean

R

s

172.00

190.00

197.00

213.00

231.00

200.92

59.00

14.06

What are the largest and smallest flippers lengths for penguins in this data set?

The largest was 231mm and the smallest was 172mm

Provide and interpret the value of the sample median flipper length for the penguins.

M = 197mm, so 50% of penguins in the data set had a flipper length of at least 197mm.

Provide the value of the sample variance of the flipper length for the penguins.

Since s =14.06, the sample variance s2 = 197.6836mm2

Which statistic is more sensitive to outliers: the range or the interquartile-range (IQR)?

The range is more sensitive to outliers - the IQR is more robust to outliers than the range.

Single categorical variable

# Printing frequency tables
penguins %>% 
  dplyr::count(species) %>% 
  make_flex(caption = "Number of penguins by species.")
Table 2: Number of penguins by species.

species

n

Adelie

152

Chinstrap

68

Gentoo

124

# Printing frequency tables
penguins %>% 
  dplyr::count(island) %>% 
  make_flex(caption = "Number of penguins by island")
Table 3: Number of penguins by island

island

n

Biscoe

168

Dream

124

Torgersen

52

Making a histogram

# Creating a histogram
penguins %>% 
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(color = "white") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
  labs(title = "Distribution of penguin flipper lengths",
       x = "Flipper length (mm)",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package")

What can we say about the penguin flipper lengths based on the histogram?

The histogram shows that the distribution of penguins flipper lengths is bimodal and fairly symmetric.

Creating a box plot

# Creating a box plot
penguins %>% 
  ggplot(aes(x = flipper_length_mm)) + 
  geom_boxplot() +
  scale_y_discrete(breaks = NULL) +
    labs(title = "Distribution of penguin flipper lengths",
       x = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

What can we say about the penguin flipper lengths based on the box plot? Are there any outliers present?

  • Q1 = 190mm

  • Q3 = 212mm

  • Mean = 197

  • min = 172

  • max = 230

  • There are no outliers

Creating a bar chart

# Creating a bar chart
penguins %>% dplyr::count(species, .drop = FALSE) %>% 
  mutate(species = fct_reorder(species, n)) %>% 
  ggplot(aes(x = species, y = n,
             fill = species)) + 
  geom_col(color = "black") +
  scale_fill_manual(values = c("#c05ccb", "#067075", "#ff7600")) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of penguin species",
       x = "Species",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

Adelle has the most number of penguins in the data set

Chinstrap had the least number of penguins.

Calculating correlations

# Calculating correlations
corTable <- penguins %>% 
  corrr::correlate(diagonal = 1)

# Printing table of correlations
corTable %>%
  make_flex(caption = "Table of pairwise correlations.")
Table 4: Table of pairwise correlations.

term

bill_length_mm

bill_depth_mm

flipper_length_mm

body_mass_g

year

bill_length_mm

1.00

-0.24

0.66

0.60

0.05

bill_depth_mm

-0.24

1.00

-0.58

-0.47

-0.06

flipper_length_mm

0.66

-0.58

1.00

0.87

0.17

body_mass_g

0.60

-0.47

0.87

1.00

0.04

year

0.05

-0.06

0.17

0.04

1.00

Which variables have the strongest correlation?

Flipper length and body mass

Which variables have the weakest correlation?

Body mass and year

Calculating descriptive statistics

# Calculating descriptive statistics
quant2Stats <- penguins %>% 
  group_by(species) %>% 
  summarize(
  Minimum = min(flipper_length_mm, na.rm = TRUE),
  Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
  M = median(flipper_length_mm, na.rm = TRUE),
  Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
  Maximum = max(flipper_length_mm, na.rm = TRUE),
  Mean = mean(flipper_length_mm, na.rm = TRUE),
  R = Maximum - Minimum,
  s = sd(flipper_length_mm, na.rm = TRUE),
  n = n()
)

# Printing table of statistics
quant2Stats %>% 
  make_flex(caption = "Summary statistics for penguin flipper lengths by species.")
Table 5: Summary statistics for penguin flipper lengths by species.

species

Minimum

Q1

M

Q3

Maximum

Mean

R

s

n

Adelie

172.00

186.00

190.00

195.00

210.00

189.95

38.00

6.54

152

Chinstrap

178.00

191.00

196.00

201.00

212.00

195.82

34.00

7.13

68

Gentoo

203.00

212.00

216.00

221.00

231.00

217.19

28.00

6.48

124

Which penguin species typically had the largest flipper lengths?

Gentoo had the largest mean and median.

Which penguin species had the most variability in their flipper lengths?

Chinstrap - because they have the largest standard deviation.

Based on the range, Adelle has the largest variability.

measures of spread or variability are:

  • Variance = (s)squared

  • IQR = Q3 - Q1

Modifying the previously provided code, recreate the table of statistics for the body masses of the penguins stratified by sex below.

# Calculating descriptive statistics
quant2Stats <- penguins %>% 
  group_by(sex) %>% 
  summarize(
  Minimum = min(body_mass_g, na.rm = TRUE),
  Q1 = quantile(body_mass_g, na.rm = TRUE, probs = 0.25),
  M = median(body_mass_g, na.rm = TRUE),
  Q3 = quantile(body_mass_g, na.rm = TRUE, probs = 0.75),
  Maximum = max(body_mass_g, na.rm = TRUE),
  Mean = mean(body_mass_g, na.rm = TRUE),
  R = Maximum - Minimum,
  s = sd(body_mass_g, na.rm = TRUE),
  n = n()
)

# Printing table of statistics
quant2Stats %>% 
  make_flex(caption = "Summary statistics for penguin flipper lengths by sex")
Table 6: Summary statistics for penguin flipper lengths by sex

sex

Minimum

Q1

M

Q3

Maximum

Mean

R

s

n

female

2,700.00

3,350.00

3,650.00

4,550.00

5,200.00

3,862.27

2,500.00

666.17

165

male

3,250.00

3,900.00

4,300.00

5,312.50

6,300.00

4,545.68

3,050.00

787.63

168

NA

2,975.00

3,475.00

4,100.00

4,650.00

4,875.00

4,005.56

1,900.00

679.36

11

Table of counts

For two categorical variables, a table of counts is commonly calculated as below.

# Creating frequency table
speciesIslandCounts <- penguins %>% 
  dplyr::count(species, island)

# Printing frequency table
speciesIslandCounts %>% 
  make_flex(caption = "Number of penguins by island and species.")
Table 7: Number of penguins by island and species.

species

island

n

Adelie

Biscoe

44

Adelie

Dream

56

Adelie

Torgersen

52

Chinstrap

Dream

68

Gentoo

Biscoe

124

Was any penguin species found on more than one island?

–Only one - Adelle

How many penguins were found on Dream island?

–124 Penguins

Bivariate visualizations

Scatter plot

# Creating a scatter plot
penguins %>% 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point(pch = 21, color = "white", fill = "black") +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

We can also add a straight line of best fit to scatter plots as well.

# Creating a scatter plot with a line of best fit
penguins %>% 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm)) + 
  geom_point(pch = 21, color = "white", fill = "black") +
  geom_smooth(method = 'lm', se = FALSE) +
      labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package")

A strong positive and linear correlation observed.

Side-by-side box plot

One quantitative and one categorical variable

For one quantitative and one categorical variable, side-by-side box plots are a useful visualization. We can use a side-by-side boxplot to explore the penguin flipper lengths by species as below.

# Creating side-by-side box plots
penguins %>% 
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) + 
  geom_boxplot() + 
  scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
        labs(title = "Penguin flipper lengths by species",
       x = "Species",
       y = "Flipper length (mm)",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

  • All the species had similar level of variability.*

  • Gentoo had the largest flipper length*

  • Adelie had the shortest flipper length*

Modifying the previously provided code, recreate the side-by-side box plots below.

# Creating side-by-side box plots
penguins %>% 
  ggplot(aes(x = island, y = body_mass_g, fill = island)) + 
  geom_boxplot() + 
  scale_fill_manual(values = c("#1F77B4", "#2CA02C", "#D62728")) +
        labs(title = "Penguin body masses by island",
       x = "island",
       y = "Body mass (g)",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "none")

Two categorical variables

Lastly, for two categorical variables, a clustered bar chart or dumbbell chart are the more commonly used visualizations.

Clustered bar chart

# Creating a clustered bar chart
penguins %>% dplyr::count(species, sex, .drop = FALSE) %>% 
    dplyr::filter(!is.na(species), !is.na(sex)) %>% 
  mutate(sex = fct_reorder(sex, n)) %>% 
  ggplot(aes(x = sex, y = n,
             fill = species)) + 
  geom_col(position="dodge", color = "black") +
  scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of penguin species by sex",
       x = "Sex",
       y = "Frequency",
       caption = "Data source: palmerpenguins R package",
       fill = "Species")

Dumbbell chart - conveys the same information as clusterred bar chart.

# Creating dumbbell chart
penguins %>% dplyr::count(species, sex, .drop = FALSE) %>% 
  dplyr::filter(!is.na(species), !is.na(sex)) %>% 
  dplyr::mutate(species_sex = str_c(species, "_", sex)) %>% 
  ggplot(aes(x = n, y = sex,
             color = species, fill = species)) + 
  geom_line(aes(group = sex), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
      labs(title = "Distribution of penguin species by sex",
           x = "Frequency",
           y = "Sex",
           caption = "Data source: palmerpenguins R package",
       fill = "Species") +
  theme(legend.position = "bottom")

Multivariate analyses

Frequency table

# Creating frequency table
speciesIslandSexCounts <- penguins %>% 
  dplyr::count(species, island, sex, .drop = FALSE)

# Printing frequency table
speciesIslandSexCounts %>% 
  make_flex(caption = "Number of penguins by island, species, and sex.")
Table 8: Number of penguins by island, species, and sex.

species

island

sex

n

Adelie

Biscoe

female

22

Adelie

Biscoe

male

22

Adelie

Dream

female

27

Adelie

Dream

male

28

Adelie

Dream

NA

1

Adelie

Torgersen

female

24

Adelie

Torgersen

male

23

Adelie

Torgersen

NA

5

Chinstrap

Dream

female

34

Chinstrap

Dream

male

34

Gentoo

Biscoe

female

58

Gentoo

Biscoe

male

61

Gentoo

Biscoe

NA

5

Multivariate visualizations

Scatter plot

# Creating a scatter plot
penguins %>% 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + 
  geom_point(pch = 21, color = "white") +
    scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom")

What can we say about the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species based on the scatter plot?

– Positive linear relationship for each species

Faceted scatter plot

# Creating a faceted scatter plot
penguins %>% 
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) + 
  geom_point(pch = 21, color = "white") +
    scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
  facet_grid(species ~ .) +
    labs(title = "Penguin flipper lengths by body mass",
       x = "Body mass (g)",
       y = "Flipper length (mm)",
       fill = "Species",
       caption = "Data source: palmerpenguins R package") +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))