Exploratory Data Analysis Activity
Learning Objectives
Create and interpret univariate statistics & visualizations
Create and interpret bivariate statistics & visualizations
First, let’s load necessary libraries for this activity
Univariate statistics
# Calculating descriptive statistics
quant1Stats <- penguins %>%
dplyr::summarize(
Minimum = min(flipper_length_mm, na.rm = TRUE),
Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
M = median(flipper_length_mm, na.rm = TRUE),
Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
Maximum = max(flipper_length_mm, na.rm = TRUE),
Mean = mean(flipper_length_mm, na.rm = TRUE),
R = Maximum - Minimum,
s = sd(flipper_length_mm, na.rm = TRUE)
)
# Printing table of statistics
quant1Stats %>%
make_flex(caption = "Quantitative summary statistics for penguin flipper lengths (mm).")
Minimum | Q1 | M | Q3 | Maximum | Mean | R | s |
---|---|---|---|---|---|---|---|
172.00 | 190.00 | 197.00 | 213.00 | 231.00 | 200.92 | 59.00 | 14.06 |
What are the largest and smallest flippers lengths for penguins in this data set?
The largest was 231mm and the smallest was 172mm
Provide and interpret the value of the sample median flipper length for the penguins.
M = 197mm, so 50% of penguins in the data set had a flipper length of at least 197mm.
Provide the value of the sample variance of the flipper length for the penguins.
Since s =14.06, the sample variance s2 = 197.6836mm2
Which statistic is more sensitive to outliers: the range or the interquartile-range (IQR)?
The range is more sensitive to outliers - the IQR is more robust to outliers than the range.
Single categorical variable
# Printing frequency tables
penguins %>%
dplyr::count(species) %>%
make_flex(caption = "Number of penguins by species.")
species | n |
---|---|
Adelie | 152 |
Chinstrap | 68 |
Gentoo | 124 |
# Printing frequency tables
penguins %>%
dplyr::count(island) %>%
make_flex(caption = "Number of penguins by island")
island | n |
---|---|
Biscoe | 168 |
Dream | 124 |
Torgersen | 52 |
Making a histogram
# Creating a histogram
penguins %>%
ggplot(aes(x = flipper_length_mm)) +
geom_histogram(color = "white") +
scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
labs(title = "Distribution of penguin flipper lengths",
x = "Flipper length (mm)",
y = "Frequency",
caption = "Data source: palmerpenguins R package")
What can we say about the penguin flipper lengths based on the histogram?
The histogram shows that the distribution of penguins flipper lengths is bimodal and fairly symmetric.
Creating a box plot
# Creating a box plot
penguins %>%
ggplot(aes(x = flipper_length_mm)) +
geom_boxplot() +
scale_y_discrete(breaks = NULL) +
labs(title = "Distribution of penguin flipper lengths",
x = "Flipper length (mm)",
caption = "Data source: palmerpenguins R package")
What can we say about the penguin flipper lengths based on the box plot? Are there any outliers present?
Q1 = 190mm
Q3 = 212mm
Mean = 197
min = 172
max = 230
There are no outliers
Creating a bar chart
# Creating a bar chart
penguins %>% dplyr::count(species, .drop = FALSE) %>%
mutate(species = fct_reorder(species, n)) %>%
ggplot(aes(x = species, y = n,
fill = species)) +
geom_col(color = "black") +
scale_fill_manual(values = c("#c05ccb", "#067075", "#ff7600")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
labs(title = "Distribution of penguin species",
x = "Species",
y = "Frequency",
caption = "Data source: palmerpenguins R package") +
theme(legend.position = "none")
Adelle has the most number of penguins in the data set
Chinstrap had the least number of penguins.
Calculating correlations
# Calculating correlations
corTable <- penguins %>%
corrr::correlate(diagonal = 1)
# Printing table of correlations
corTable %>%
make_flex(caption = "Table of pairwise correlations.")
term | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year |
---|---|---|---|---|---|
bill_length_mm | 1.00 | -0.24 | 0.66 | 0.60 | 0.05 |
bill_depth_mm | -0.24 | 1.00 | -0.58 | -0.47 | -0.06 |
flipper_length_mm | 0.66 | -0.58 | 1.00 | 0.87 | 0.17 |
body_mass_g | 0.60 | -0.47 | 0.87 | 1.00 | 0.04 |
year | 0.05 | -0.06 | 0.17 | 0.04 | 1.00 |
Which variables have the strongest correlation?
Flipper length and body mass
Which variables have the weakest correlation?
Body mass and year
Calculating descriptive statistics
# Calculating descriptive statistics
quant2Stats <- penguins %>%
group_by(species) %>%
summarize(
Minimum = min(flipper_length_mm, na.rm = TRUE),
Q1 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.25),
M = median(flipper_length_mm, na.rm = TRUE),
Q3 = quantile(flipper_length_mm, na.rm = TRUE, probs = 0.75),
Maximum = max(flipper_length_mm, na.rm = TRUE),
Mean = mean(flipper_length_mm, na.rm = TRUE),
R = Maximum - Minimum,
s = sd(flipper_length_mm, na.rm = TRUE),
n = n()
)
# Printing table of statistics
quant2Stats %>%
make_flex(caption = "Summary statistics for penguin flipper lengths by species.")
species | Minimum | Q1 | M | Q3 | Maximum | Mean | R | s | n |
---|---|---|---|---|---|---|---|---|---|
Adelie | 172.00 | 186.00 | 190.00 | 195.00 | 210.00 | 189.95 | 38.00 | 6.54 | 152 |
Chinstrap | 178.00 | 191.00 | 196.00 | 201.00 | 212.00 | 195.82 | 34.00 | 7.13 | 68 |
Gentoo | 203.00 | 212.00 | 216.00 | 221.00 | 231.00 | 217.19 | 28.00 | 6.48 | 124 |
Which penguin species typically had the largest flipper lengths?
Gentoo had the largest mean and median.
Which penguin species had the most variability in their flipper lengths?
Chinstrap - because they have the largest standard deviation.
Based on the range, Adelle has the largest variability.
measures of spread or variability are:
Variance = (s)squared
IQR = Q3 - Q1
Modifying the previously provided code, recreate the table of statistics for the body masses of the penguins stratified by sex below.
# Calculating descriptive statistics
quant2Stats <- penguins %>%
group_by(sex) %>%
summarize(
Minimum = min(body_mass_g, na.rm = TRUE),
Q1 = quantile(body_mass_g, na.rm = TRUE, probs = 0.25),
M = median(body_mass_g, na.rm = TRUE),
Q3 = quantile(body_mass_g, na.rm = TRUE, probs = 0.75),
Maximum = max(body_mass_g, na.rm = TRUE),
Mean = mean(body_mass_g, na.rm = TRUE),
R = Maximum - Minimum,
s = sd(body_mass_g, na.rm = TRUE),
n = n()
)
# Printing table of statistics
quant2Stats %>%
make_flex(caption = "Summary statistics for penguin flipper lengths by sex")
sex | Minimum | Q1 | M | Q3 | Maximum | Mean | R | s | n |
---|---|---|---|---|---|---|---|---|---|
female | 2,700.00 | 3,350.00 | 3,650.00 | 4,550.00 | 5,200.00 | 3,862.27 | 2,500.00 | 666.17 | 165 |
male | 3,250.00 | 3,900.00 | 4,300.00 | 5,312.50 | 6,300.00 | 4,545.68 | 3,050.00 | 787.63 | 168 |
NA | 2,975.00 | 3,475.00 | 4,100.00 | 4,650.00 | 4,875.00 | 4,005.56 | 1,900.00 | 679.36 | 11 |
Table of counts
For two categorical variables, a table of counts is commonly calculated as below.
# Creating frequency table
speciesIslandCounts <- penguins %>%
dplyr::count(species, island)
# Printing frequency table
speciesIslandCounts %>%
make_flex(caption = "Number of penguins by island and species.")
species | island | n |
---|---|---|
Adelie | Biscoe | 44 |
Adelie | Dream | 56 |
Adelie | Torgersen | 52 |
Chinstrap | Dream | 68 |
Gentoo | Biscoe | 124 |
Was any penguin species found on more than one island?
–Only one - Adelle
How many penguins were found on Dream island?
–124 Penguins
Bivariate visualizations
Scatter plot
# Creating a scatter plot
penguins %>%
ggplot(aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(pch = 21, color = "white", fill = "black") +
labs(title = "Penguin flipper lengths by body mass",
x = "Body mass (g)",
y = "Flipper length (mm)",
caption = "Data source: palmerpenguins R package")
We can also add a straight line of best fit to scatter plots as well.
# Creating a scatter plot with a line of best fit
penguins %>%
ggplot(aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(pch = 21, color = "white", fill = "black") +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Penguin flipper lengths by body mass",
x = "Body mass (g)",
y = "Flipper length (mm)",
caption = "Data source: palmerpenguins R package")
A strong positive and linear correlation observed.
Side-by-side box plot
One quantitative and one categorical variable
For one quantitative and one categorical variable, side-by-side box plots are a useful visualization. We can use a side-by-side boxplot to explore the penguin flipper lengths by species as below.
# Creating side-by-side box plots
penguins %>%
ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot() +
scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
labs(title = "Penguin flipper lengths by species",
x = "Species",
y = "Flipper length (mm)",
caption = "Data source: palmerpenguins R package") +
theme(legend.position = "none")
All the species had similar level of variability.*
Gentoo had the largest flipper length*
Adelie had the shortest flipper length*
Modifying the previously provided code, recreate the side-by-side box plots below.
# Creating side-by-side box plots
penguins %>%
ggplot(aes(x = island, y = body_mass_g, fill = island)) +
geom_boxplot() +
scale_fill_manual(values = c("#1F77B4", "#2CA02C", "#D62728")) +
labs(title = "Penguin body masses by island",
x = "island",
y = "Body mass (g)",
caption = "Data source: palmerpenguins R package") +
theme(legend.position = "none")
Two categorical variables
Lastly, for two categorical variables, a clustered bar chart or dumbbell chart are the more commonly used visualizations.
Clustered bar chart
# Creating a clustered bar chart
penguins %>% dplyr::count(species, sex, .drop = FALSE) %>%
dplyr::filter(!is.na(species), !is.na(sex)) %>%
mutate(sex = fct_reorder(sex, n)) %>%
ggplot(aes(x = sex, y = n,
fill = species)) +
geom_col(position="dodge", color = "black") +
scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
labs(title = "Distribution of penguin species by sex",
x = "Sex",
y = "Frequency",
caption = "Data source: palmerpenguins R package",
fill = "Species")
Dumbbell chart - conveys the same information as clusterred bar chart.
# Creating dumbbell chart
penguins %>% dplyr::count(species, sex, .drop = FALSE) %>%
dplyr::filter(!is.na(species), !is.na(sex)) %>%
dplyr::mutate(species_sex = str_c(species, "_", sex)) %>%
ggplot(aes(x = n, y = sex,
color = species, fill = species)) +
geom_line(aes(group = sex), color = "black") +
geom_point(pch = 21, color = "black", size = 5) +
scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
labs(title = "Distribution of penguin species by sex",
x = "Frequency",
y = "Sex",
caption = "Data source: palmerpenguins R package",
fill = "Species") +
theme(legend.position = "bottom")
Multivariate analyses
Frequency table
# Creating frequency table
speciesIslandSexCounts <- penguins %>%
dplyr::count(species, island, sex, .drop = FALSE)
# Printing frequency table
speciesIslandSexCounts %>%
make_flex(caption = "Number of penguins by island, species, and sex.")
species | island | sex | n |
---|---|---|---|
Adelie | Biscoe | female | 22 |
Adelie | Biscoe | male | 22 |
Adelie | Dream | female | 27 |
Adelie | Dream | male | 28 |
Adelie | Dream | NA | 1 |
Adelie | Torgersen | female | 24 |
Adelie | Torgersen | male | 23 |
Adelie | Torgersen | NA | 5 |
Chinstrap | Dream | female | 34 |
Chinstrap | Dream | male | 34 |
Gentoo | Biscoe | female | 58 |
Gentoo | Biscoe | male | 61 |
Gentoo | Biscoe | NA | 5 |
Multivariate visualizations
Scatter plot
# Creating a scatter plot
penguins %>%
ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) +
geom_point(pch = 21, color = "white") +
scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
labs(title = "Penguin flipper lengths by body mass",
x = "Body mass (g)",
y = "Flipper length (mm)",
fill = "Species",
caption = "Data source: palmerpenguins R package") +
theme(legend.position = "bottom")
What can we say about the relationship between the penguin flipper lengths (mm) and body masses (g) for each penguin species based on the scatter plot?
– Positive linear relationship for each species
Faceted scatter plot
# Creating a faceted scatter plot
penguins %>%
ggplot(aes(x = body_mass_g, y = flipper_length_mm, fill = species)) +
geom_point(pch = 21, color = "white") +
scale_fill_manual(values = c("#ff7600", "#c05ccb", "#067075")) +
facet_grid(species ~ .) +
labs(title = "Penguin flipper lengths by body mass",
x = "Body mass (g)",
y = "Flipper length (mm)",
fill = "Species",
caption = "Data source: palmerpenguins R package") +
theme(legend.position = "bottom",
strip.background.y = element_rect(linetype = "solid", color = "black"))