We first need to load the libraries tidyverse and
palmerpenguins. The latter will provide us the dataset
which will be used to reproduce the main charts. For some extra
graphical resources, we’ll use patchwork.
library(tidyverse)
library(patchwork)
library(palmerpenguins)Now we are good to go. But first, let’s check our dataset and it’s structure:
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
So, we got 344 observations and 8 variables, where 3 of them are categorical, 2 are continuous, and 3 are discrete.
We also got 19 NA values, being sex variable the one
with the most amount.
s = colSums(is.na(penguins)) |> data.frame() |>
rownames_to_column(var = 'Column') |>
rename(NA_Freq = 'colSums.is.na.penguins..') # NA values by column| Column | NA_Freq |
|---|---|
| species | 0 |
| island | 0 |
| bill_length_mm | 2 |
| bill_depth_mm | 2 |
| flipper_length_mm | 2 |
| body_mass_g | 2 |
| sex | 11 |
| year | 0 |
Motivation: is there any relation between penguin’s flipper length
(mm) and their body mass (g)? Remember your data has NA
values!
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T)Motivation: how would the previous chart be if we wanted to see the spread by species? Use color property.
penguins |> ggplot(aes(flipper_length_mm, body_mass_g, color = species)) +
geom_point(na.rm = T)Motivation: and if instead of different colors, we wanted to have different shapes representing each specie?
penguins |> ggplot(aes(flipper_length_mm, body_mass_g, shape = species)) +
geom_point(na.rm = T)Motivation: can we have both shape and color properties to represent the species?
penguins |> ggplot(aes(flipper_length_mm, body_mass_g, shape = species, color = species)) +
geom_point(na.rm = T)Motivation: how to add a fade effect on points? Consider the first plot you made.
penguins |> ggplot(aes(flipper_length_mm, body_mass_g, alpha = species)) +
geom_point(na.rm = T)## Warning: Using alpha for a discrete variable is not advised.
Note ggplot yields a warning advising that using alpha
for discrete variables is not recommended.
Motivation: considering the first plot, how can one change the color of the points?
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T, colour = 'purple')Motivation: it’s cool to see the spread of the penguin’s body masses along with their flipper’s lengths, all in on chart. But how could we see it separately?
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T) +
facet_wrap(~species)It’s also possible to use facet_grid() function,
although it’s used more often for two variables.
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T) +
facet_grid(~species)With two variables:
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T) +
facet_grid(sex~species)Note we have NA values in the sex variable, and they’re
observations from Adelie and Gentoo species.
And if we had to hide the NA values:
penguins |> drop_na() |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point() +
facet_grid(sex~species)Motivation: by seeing the charts above, it looks reasonable to say that we could fit a regression model with these two variables (body mass and flipper length). How to add a regression line to the previous chart?
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T) +
geom_smooth(na.rm = T)By default, ggplot plots the regression line using the
loess method, but this behavior can be me changed
through the method parameter:
penguins |> ggplot(aes(flipper_length_mm, body_mass_g)) +
geom_point(na.rm = T) +
geom_smooth(na.rm = T, method = 'lm')Motivation: at last, what if we wanted to plot the regression line, considering all types of species?
penguins |> ggplot(aes(flipper_length_mm, body_mass_g, linetype = species)) +
geom_smooth(na.rm = T, method = 'lm')Motivation: create a bar chart showing the frequency of species of penguins.
penguins |> ggplot(aes(x = species)) +
geom_bar()We could also make a horizontal bar chart:
penguins |> ggplot(aes(y = species)) +
geom_bar()Motivation: note that the previous charts displays the frequencies
sorted alphabetically. To make it be sorted by the frequency, we need to
use forcats library, which helps us a lot when we’re
working with categorical variables (factors).
penguins |> ggplot(aes(y = fct_infreq(species))) +
geom_bar()Now, let’s sort in descending order:
penguins |> ggplot(aes(y = fct_rev(fct_infreq(species)))) +
geom_bar()Motivation: how would look a stacked bar chart of species, stacked by
island? Use the fill parameter in aes().
penguins |> ggplot(aes(y = fct_rev(fct_infreq(species)), fill = island)) +
geom_bar()Note that all Gentoos are from Biscoe island, while all Chinstraps are from Dream, and Adelies are spread across all islands!
Motivation: and if we wanted a ‘side-by-side’ bar plot for each specie and island?
penguins |> ggplot(aes(species, fill = island)) + geom_bar(position = 'dodge')Motivation: It is highly recommended not to use this kind of charts, specially when we have several categories within one variable. But only for the purpose of training, let’s make a pie chart for the penguin species.
penguins |> count(species) |> ggplot(aes(x = '', y = n, fill = species)) +
geom_col() +
coord_polar("y", start = 0) +
theme_void()Note that even it’s implementation is a bit harder compared to other plots.
Motivation: and for donat chart?
penguins |> count(species) |> ggplot(aes(2, n, fill = species)) +
geom_col() +
coord_polar("y") +
xlim(c(0.2, 3)) +
theme_void()Motivation: consider now that we wanted to know the distribution of the penguin’s bill length (mm) variable. A good start, is to plot a histogram, since it is a continuous variable.
penguins |> ggplot(aes(bill_length_mm)) +
geom_histogram(na.rm = T)By default, the bins plotted (amount and width) were calculated
automatically by ggplot, but if we wanted to specify a number, we could
use the bins and/or binwidth parameters:
p1 = penguins |> ggplot(aes(bill_length_mm)) +
geom_histogram(na.rm = T)
p2 = penguins |> ggplot(aes(bill_length_mm)) +
geom_histogram(na.rm = T, binwidth = 0.3)
p3 = penguins |> ggplot(aes(bill_length_mm)) +
geom_histogram(na.rm = T, binwidth = 2)
p3/p1/p2 # patchwork is awesome!penguins |> ggplot(aes(bill_length_mm)) +
geom_histogram(na.rm = T, bins = 20)Motivation: and finally, what if we wanted to plot the sample distribution curve with the histogram?
penguins |> ggplot(aes(bill_length_mm)) +
geom_histogram(na.rm = T, aes(y = after_stat(density))) +
geom_density(na.rm = T, color = 'red', lwd = .8)Note that in geom_histogram(), it was necessary to
define the y-axis to be the same scale as density.
Motivation: take a look in the distribution of flipper length and species variables of the penguins dataset, using a boxplot. What are the properties of both?
penguins |> ggplot(aes(species, flipper_length_mm)) +
stat_boxplot(geom = 'errorbar', na.rm = T, width = .25) +
geom_boxplot(na.rm = T)Their amplitude are basically the same, and they’re all approximately symmetric. We see only two outliers, both presented in Adelie specie.
Motivation: species is a categorical variable. How can we add the points together with the boxplot, adding a little of noise to transform from discrete (count) to continuous?
penguins |> ggplot(aes(species, flipper_length_mm)) +
stat_boxplot(geom = 'errorbar', na.rm = T, width = .25) +
geom_boxplot(na.rm = T) +
geom_jitter(na.rm = T)The jitter function adds a little amount of random
variation to the location of each point, helping to see clearly the plot
of discrete variables (in this case, we’re counting each specie flipper
length).
Line charts in ggplot2 works better with time series. Therefore,
let’s use the economics dataset, from ggplot2, to plot some
line charts.
Motivation: plot the number of unemployment over time.
economics |> ggplot(aes(date, unemploy)) + geom_line()Motivation: plot the personal saving rates with median duration of unemployment over time.
economics |> ggplot(aes(date)) +
geom_line(aes(y = psavert), colour = 'darkred') +
geom_line(aes(y = uempmed), colour = 'steelblue')Motivation: now, using the previous variables, plot an area chart.
economics |> ggplot(aes(date)) +
geom_area(aes(y = psavert), colour = 'orange', alpha = 0.4, fill = 'orange') +
geom_area(aes(y = uempmed), colour = 'steelblue', alpha = 0.4, fill = 'steelblue')Motivation: we were able to look at the histograms and boxplots for flipper and bill length variables, and concluded that these data are not normally distributed. But what if we wanted to plot a qqnorm chart?
penguins |> ggplot(aes(sample = flipper_length_mm)) + stat_qq(na.rm = T) + stat_qq_line(na.rm = T)Motivation: choose any chart above and add to it labels, like title, subtitle, axis labels, legends labels and colors, etc.
penguins |> ggplot(aes(y = fct_rev(fct_infreq(species)), fill = island)) +
geom_bar() +
labs(title = "Penguin species by island", subtitle = 'Analyzing penguin distribution',
caption = 'Dataset: palmerpenguins', x = 'Penguins', y = 'Specie', fill = 'Island') +
geom_text(aes(label = after_stat(count)),
stat = 'count',
position = position_stack(vjust = 0.5),
fontface = 'bold')penguins |> group_by(species, island) |> count() |> arrange(species, desc(n)) |> kable() |>
kable_styling(bootstrap_options = c('hover', 'striped', 'condensed', 'reponsive'),
full_width = F)| species | island | n |
|---|---|---|
| Adelie | Dream | 56 |
| Adelie | Torgersen | 52 |
| Adelie | Biscoe | 44 |
| Chinstrap | Dream | 68 |
| Gentoo | Biscoe | 124 |