Getting started


Install packages and load libraries


To install a new package, use the function install.packages("name.package")

To load a library, use the function library(name.package)


### Install packages 
# install.packages("ggplot2")
# install.packages("ggridges")
# install.packages("tidyverse")
# install.packages("janitor")
# install.packages("kableExtra")
# install.packages("unikn")
# install.packages("ggpubr")
# install.packages("sjPlot")

### Load libraries 
library(tidyverse)
library(ggplot2)
library(ggridges)
library(janitor) # helpful to clean col names `clean_names()`
library(kableExtra) # to display and edit tables 
library(unikn) # for uni Konstanz theme 
library(ggpubr) # to arrange plots 
library(sjPlot) 

# set options for tables 
bs_style <- c("striped", "hover", "condensed", "responsive")
options(kable_styling_bootstrap_options = bs_style)

Load data


### Other data 
### Run the following chunk 
data("state")
state.x77 %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename(
    state = rowname
  ) %>% 
  janitor::clean_names() -> df
rm(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
state population income illiteracy life_exp murder hs_grad frost area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766

Building a plot: Basic steps


  1. Define plot aesthetics with ggplot(aes(x = ..., y = ...))

df %>% 
  ### pass the data to the ggplot function 
  ggplot(
    ### define aesthetics 
    aes(x = income, y = illiteracy)
  ) 


  1. Define the type of plot. Here, we use geom_boxplot().


  1. Add colors to the plot using the color argument

Here, we want to color by low- vs. high-population states

df %>% 
  ### pass the data to the ggplot function 
  ggplot(
    ### define aesthetics 
    aes(x = income, y = illiteracy, color = population2)
  ) + 
  ### define geom 
  geom_point()


  1. Modify the plot labs with xlab() and ylab() or with lab(x = "...", y = "...")

df %>% 
  ### pass the data to the ggplot function 
  ggplot(
    ### define aesthetics 
    aes(x = income, y = illiteracy, color = population2)
  ) + 
  ### define geom (geom_boxplot)
  geom_point() + 
  ### add or edit labs 
  labs(x = "Income", y = "Illiteracy")


  1. Add a title with ggtitle()

df %>% 
  ### pass the data to the ggplot function 
  ggplot(
    ### define aesthetics 
    aes(x = income, y = illiteracy, color = population2)
  ) + 
  ### define geom (geom_boxplot)
  geom_point() + 
  ### add or edit labs 
  labs(x = "Income", y = "Illiteracy") + 
  ### add title
  ggtitle("Illiteracy by income and population size")


  1. Change theme of the plot with theme_ ...().

A list of themes is provided in the section customize your plot > change plot theme

df %>% 
  ### pass the data to the ggplot function 
  ggplot(
    ### define aesthetics 
    aes(x = income, y = illiteracy, color = population2)
  ) + 
  ### define geom (geom_boxplot)
  geom_point() + 
  ### add or edit labs 
  labs(x = "Income", y = "Illiteracy") + 
  ### add title
  ggtitle("Illiteracy by income and population size") + 
  ### change plot theme 
  theme_unikn()


  1. Change the axis limits


Most used geometrics for data visualization


Box plot


geom_boxplot(): “The boxplot compactly displays the distribution of a continuous variable. It visualises five summary statistics (the median, two hinges and two whiskers), and all”outlying” points individually”.


Usage:

geom_boxplot(
  outlier.color = NULL,# if NULL inherit colors from ggplot() aesthetics 
  outlier.fill = NULL, # if NULL inherit colors from ggplot() aesthetics 
  outlier.shape = 19, # change number to change shape 
  outlier.size = 1.5, # change number to change shape 
  outlier.alpha = NULL, # modify transparency of outlier color 
  varwidth = FALSE, # if TRUE, plots widths are proportional to the square roots of the number of observation
  na.rm = FALSE, # removes NA values 
  inherit.aes = TRUE # default 
)

In the following, we will create a new column in the df1 based on whether the state population is above or below the mean (4246). The new column is called population2 and we will use an if_else statement within the function mutate, which is used to create or mutate columns in a dataframe.

Now, we can plot the percentage of illiteracy by low and high-population states

df %>% 
  ggplot(aes(x = population2, y = illiteracy, fill = population2)) + 
  geom_boxplot() + 
  ggtitle("% illiteracy by low/high-population states") 

How to read a boxplot:

  • The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).

  • The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles).

  • The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually.

Run the following code for more information about geom_boxplot()

help("geom_boxplot")

[Reference: R documentation]


Bar chart


“There are two types of bar charts: geom_bar()and geom_col().

  • geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights).

  • geom_bar() uses stat_count() by default: it counts the number of cases at each x position

  • If you want the heights of the bars to represent values in the data, use geom_col() instead.

  • geom_col() uses stat_identity(): it leaves the data as is”


geom_col()


Usage:

geom_col(
  position = "stack", # define position, stack is default
  ...,
  just = 0.5, #default. -1 moves the bar to the right, +1 moves bars to the left
  width = NULL, # define bar size 
  na.rm = FALSE,
  show.legend = NA,
  ..., # for other options see section customize your plots 
)

In the following, we are going to plot the population of the first 5 states of df

df %>% 
  ### select first 5 rows 
  slice_head(n = 5) %>% 
  ggplot(aes(x = state, y = population, fill = state)) + 
  geom_col(color = "black", width = .8) + 
  ggtitle("Population by state")


geom_bar(): Creates bar charts, useful for visualizing the count or frequency of categorical data.


Usage

geom_bar(
  position = "stack", # define position, stack is default
  stats = "count", # default - if you wish to define y value, either use geom_col() or define stats = "identity
  ...,
  just = 0.5, #default. -1 moves the bar to the right, +1 moves bars to the left
  width = NULL, # define bar size 
  na.rm = FALSE,
  show.legend = NA,
  ..., # for other options see section customize your plots 
)

Similarly to the example above, we are going to plot the population of the first 5 states of df1

df %>% 
  ### select first 5 rows 
  slice_head(n = 5) %>% 
  
  ggplot(aes(x = state, y = population, fill = state)) + 
  geom_bar(stat = "identity", color = "black") + 
  ggtitle("Population by state")

“A bar chart uses height to represent a value, and so the base of the bar must always be shown to produce a valid visual comparison. Proceed with caution when using transformed scales with a bar chart. It’s important to always use a meaningful reference point for the base of the bar. For example, for log transformations the reference point is 1. In fact, when using a log scale, geom_bar() automatically places the base of the bar at 1. Furthermore, never use stacked bars with a transformed scale, because scaling happens before stacking. As a consequence, the height of bars will be wrong when stacking occurs with a transformed scale.

By default, multiple bars occupying the same x position will be stacked atop one another by position_stack(). If you want them to be dodged side-to-side, use position_dodge() or position_dodge2(). Finally, position_fill() shows relative proportions at each x by stacking the bars and then standardising each bar to have the same height”.

Run the following code for more information about geom_bar() and geom_col()

help("geom_col")
help("geom_bar")

[Reference: R documentation]


Density plot


geom_density(): Display a smooth estimate of the distribution of continuous data.


Usage:

geom_density(
  stat = "density", # default
  position = "identity",
  ...,
  na.rm = FALSE, # remove na values
  outline.type = "upper"
)

In the following, we will plot the distribution of illiteracy. Note: geom_density() only requires either x or y aesthetics.

df %>% 
  ggplot(aes(x = illiteracy)) + 
  geom_density(fill = "deepskyblue4") + 
  ggtitle("Distribution of illiteracy")

We can also verify the distribution of illiteracy by states with high vs. low population using the fill argument within the aes() function

df %>% 
  ggplot(aes(x = illiteracy, fill = population2)) + 
  geom_density() + 
  ggtitle("Distribution of illiteracy by high vs. low-population states")

In this case, it is better to lower the color transparency to see how the two groups are distributed. We can do this with the alpha argument within the geom_density function

df %>% 
  ggplot(aes(x = illiteracy, fill = population2)) + 
  geom_density(alpha = .7) + 
  ggtitle("Distribution of illiteracy by high vs. low-population states")


geom_density_ridges(): Create ridge plots to visualize the distribution of continuous data along one or more categorical variables.


Usage:

geom_density_ridges(
  mapping = NULL,
  data = NULL,
  stat = "density_ridges",
  position = "points_sina",
  panel_scaling = TRUE, # scaling is calculated for each panel 
  na.rm = FALSE, # removes na values 
  ...
)

In the following, the distribution of illiteracy, murder and life_exp. In order to use ggridges::geom_density_ridges(), we have to manipulate the data format. To do this, we wil use the function pivot_longer.

Note: the function geom_density_ridges() is part of the package ggridges.

df %>% 
  dplyr::select(illiteracy, murder, life_exp) %>% 
  pivot_longer(names_to = "measure", values_to = "value", 1:3) %>% 
  
  ggplot(aes(x = value, y = measure, fill = measure)) + 
  ggridges::geom_density_ridges(alpha = .7)

We can also plot the distribution of illiteracy, murder and life_exp based on states with high and low income.

First, we have to create a new column (income2) based on whether the income of the state is above or below 4436 (mean). To do this, we use again the if_else() function within the mutate() function.

df %>% 
  mutate(income2 = if_else(income >= 4436, "high", "low")) -> df

Now we can create a plot with geom_density_ridges()

df %>% 
  dplyr::select(income2, illiteracy, murder, life_exp) %>% 
  pivot_longer(names_to = "measure", values_to = "value", 2:4) %>% 
  
  ggplot(aes(x = value, y = measure, fill = income2)) + 
  ggridges::geom_density_ridges(alpha = .7)

Another option is to use the function facet_wrap(~variable) instead of the fill = argument

df %>% 
  dplyr::select(income2, illiteracy, murder, life_exp) %>% 
  pivot_longer(names_to = "measure", values_to = "value", 2:4) %>% 
  
  ggplot(aes(x = value, y = measure, fill = measure)) + 
  ggridges::geom_density_ridges(alpha = .7) + 
  facet_wrap(~income2)

Run the following code for more information about geom_density() and geom_density_ridges()

help("geom_density")
help("geom_density_ridges")

[Reference: R documentation]


Histograms


geom_histogram(): Display the distribution of continuous data using bars.


“Visualise the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars; frequency polygons (geom_freqpoly()) display the counts with lines. Frequency polygons are more suitable when you want to compare the distribution across the levels of a categorical variable”.

Usage:

geom_histogram(
  mapping = NULL,
  data = NULL,
  stat = "bin",
  position = "stack", # or "jitter"
  ...,
  binwidth = NULL, # width of the bins
  bins = NULL, # number of bins overridden by binwidth
  na.rm = FALSE, # remove NAs values 
)

Similarly to geom_density(), geom_histogram() also only requires either x or y aesthetics

ggarrange(df %>% 
  ggplot(aes(x = illiteracy)) + 
  geom_histogram(fill = "deepskyblue4", color = "black", alpha = .5, position = "jitter", bins = 30) + 
  ggtitle("Distribution of illiteracy") + 
    annotate("label", x = 2.7, y = 10, label = "jitter"), 

df %>% 
  ggplot(aes(x = illiteracy)) + 
  geom_histogram(fill = "indianred3", color = "black", alpha = .5, position = "stack", bins = 10) + ### default
  ggtitle("Distribution of illiteracy") + 
    annotate("label", x = 2.7, y = 12.8, label = "stack"))

Run the following code for more information about geom_histogram()

help("geom_histogram")

[Reference: R documentation]


Scatterplots


geom_point() and geom_jitter(): Display scatterplots. geom_jitter avoinds overlapping of points.


geom_point() and geom_jitter() are used to create scatterplots. geom_point() is most useful for displaying the relationship between two continuous variables. It can be used to compare one continuous and one categorical variable, or two categorical variables, but a variation like geom_jitter(), is usually more appropriate.”

Usage:

geom_point(
  mapping = NULL,
  data = NULL,
  stat = "identity", 
  position = "identity", # or jitter 
  ...,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

geom_jitter(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "jitter",
  ...,
  width = NULL,
  height = NULL,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

Let’s compare the two plots by plotting the percentage of illiteracy by countries with low vs. high income (income2)

df %>% 
  ggplot(aes(x = income2, y = illiteracy, color = income2)) + 
  geom_point() + 
  ggtitle("Scatter plot with geom_point()") 

geom_jitter() is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.”

df %>% 
  ggplot(aes(x = income2, y = illiteracy, color = income2)) + 
  geom_jitter() + 
  ggtitle("Scatter plot with geom_jitter()") 

Run the following code for more information about geom_point() and geom_jitter()

help("geom_point")
help("geom_jitter")

[Reference: R documentation]


Connect observations


geom_line(): Connect data points with lines, useful for showing trends.


Usage:

geom_line(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)

### Alternatives 
geom_path()
geom_step()
### see help()

df %>%
  slice_head(n = 5) %>%
  ggplot(aes(x = income, y = illiteracy)) + 
  geom_line(size = .8, linetype = 3, color = "deepskyblue4") +
  geom_point(size = 2, color = "deepskyblue4") +
  ggtitle("Change in illiteracy by income")

Run the following code for more information about geom_line()

help("geom_line")

[Reference: R documentation]


Smoothed conditional means


geom_smooth(): Adds a smooth trend line to a scatter plot


geom_smooth() calculates: Predicted value of y, lower and upper pointwise confidence interval around the mean and standard error.

Usage:

geom_smooth(
  mapping = NULL,
  data = NULL,
  stat = "smooth",
  position = "identity",
  ...,
  method = NULL, # IMPORTANT: Use method "lm" for plotting linear regression
  formula = NULL,
  se = TRUE,
  na.rm = FALSE,
  ...
)

Let’s plot the relationship between murders and illiteracy:

df %>% 
  ggplot(aes(x = murder, y = illiteracy)) + 
  geom_jitter(alpha = .5, color = "indianred") + 
  geom_smooth(color = "indianred", fill = "lightgrey") + 
  ggtitle("Relationship between illiteracy and murder")

If you specify method = "lm" within the geom_smooth() function, you get a regression line

df %>% 
  ggplot(aes(x = murder, y = illiteracy)) +
  geom_jitter(alpha = .5, color = "indianred") + 
  geom_smooth(color = "indianred", fill = "lightgrey", method = "lm") + 
  ggtitle("Smooth plot with (method = lm)")

Run the following code for more information about geom_smooth()

help("geom_smooth")

[Reference: R documentation]


Vertical intervals: lines, crossbars & errorbars


geom_errorbar(): Adds error bar to plot. It allows you to visually represent the variability or uncertainty associated with the data points.


Usage:

geom_errorbar(
  mapping = NULL,
  data = NULL,
  stat = "identity", # default
  position = "identity", # default
  ...,
  na.rm = FALSE,
)

In the following plot, we are using geom_point() to plot the average income for states with high vs. low population. We will use geom_errorbar() to plot the standard deviation.

### first, let's compute the mean and the standard error 
df %>% 
  summarize(
    mean = mean(income), 
    sd = sd(income),
    se = sd/sqrt(n()),
    .by = "population2"
  ) -> summary_income
  
summary_income %>% 
  ggplot(aes(x = population2, y = mean, color = population2)) + 
  geom_point(size = 3) + 
  geom_errorbar(aes(x = population2, ymin = mean-se, ymax = mean+se), width = 0.1, size = 0.5) + 
  ggtitle("Income by population (high vs. low)")

Run the following code for more information about geom_errorbar()

help("geom_errorbar")

[Reference: R documentation]


Maps


geom_polyglon(): the start and end points are connected and the inside is coloured by fill.


  • First, you have to download the map of the state/region from the function map_data. See help for more information.
# download map states using `map_data` 
map_data("state") -> usa.map
df %>% mutate(state = tolower(state)) -> df.map

states <- c("alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut", "delaware", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana", 
            "maine", "maryland", "massachusetts", "michigan", "minnesota", "mississippi", "missouri", "montana", 
            "nebraska", "nevada", "new hampshire", "new jersey", "new mexico", "new york", "north carolina", 
            "north dakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
            "south dakota", "tennessee", "texas", "utah", "vermont", "virginia", "washington", "west virginia", 
            "wisconsin", "wyoming")


usa.map %>% 
  ### filter only the states of our df 
  filter(region %in% states) -> usa.map

### join dfs 

df.map %>% rename(region = state) -> df.map
common.cols <- intersect(names(df.map), names(usa.map))
left_join(df.map, usa.map, by = common.cols) -> df.map

In the following, we plot the percentage of murder by state (in the current df - region)

df.map %>% 
  ggplot(aes(x = long, y = lat, group = group, fill = murder)) + 
  geom_polygon(color = "black") + 
  scale_fill_gradient(low="white", high="indianred3")  + 
  ggtitle("Murder by state") + 
  labs(x = "", y = "")

Run the following code for more information about geom_polyglon()

??geom_polygon

[Reference: R documentation]


Violin plot


geom_violin():


“A violin plot is a compact display of a continuous distribution. It is a blend of geom_boxplot() and geom_density(): a violin plot is a mirrored density plot displayed in the same way as a boxplot”

geom_violin(
  mapping = NULL,
  data = NULL,
  stat = "ydensity",
  position = "dodge",
  ...,
  draw_quantiles = NULL, # If not(NULL) (default), draw horizontal lines at the given quantiles of the density estimate.
  trim = TRUE, # If TRUE (default), trim the tails of the violins to the range of the data. If FALSE, don't trim the tails.
  scale = "area", # if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width.
  na.rm = FALSE,
  ... 
)

One example plotting the percentage of illiteracy rate by income (low/high)

df %>% 
  ggplot(aes(x = income2, y = illiteracy, fill = income2)) + 
  geom_violin(alpha = .5) + 
  ggtitle("% illiteracy by low/high-income states") 

Run the following code for more information about geom_violin()

help("geom_violin")

References:

  • R documentation;

  • Hintze, J. L., Nelson, R. D. (1998) Violin Plots: A Box Plot-Density Trace Synergism. The American Statistician 52, 181-184.


Plotting correlations


ggcorplot(): Plots correlation matrix

cor_pmat(): Compute a correlation matrix p-values


Usage

ggcorrplot(
  corr,
  method = c("square", "circle"),
  type = c("full", "lower", "upper"),
  ggtheme = ggplot2::theme_minimal, 
  title = "",
  show.legend = TRUE, 
  legend.title = "Corr",
  show.diag = NULL,
  colors = c("blue", "white", "red"), # set colors 
  outline.color = "gray",
  hc.order = FALSE, 
  hc.method = "complete",
  
  lab = FALSE, # add correlation coefficient to the plot 
  lab_col = "black",
  lab_size = 4,
  
  p.mat = NULL,
  sig.level = 0.05, # p-value significance level 
  insig = c("pch", "blank"), # if pch = add characters, if blank = remove correlation
  pch = 4, # shape 
  pch.col = "black",
  pch.cex = 5, # size pch 
  # the size, the color and the string rotation of text label (variable names).
  tl.cex = 12,
  tl.col = "black",
  tl.srt = 45,
  
  digits = 2,
  as.is = FALSE
)

cor_pmat(x, ...)

Let’s plot the correlation between all the continuous variable in df

df %>% 
  dplyr::select(2:9) %>% mutate_all(~scale(.x)) %>% 
  cor(method = "pearson") -> cor

ggcorrplot::cor_pmat(cor) -> p.values
round(p.values, 3) -> p.values

ggcorrplot::ggcorrplot(cor, hc.order = T,
                       type = "lower",
                       lab = T,
                       lab_size = 4,
                       method = "square",
                       colors = c("grey", "white", "turquoise3"),
                       p.mat = p.values,
                       pch.col = "grey50",
                       pch = 4,
                       show.legend = F,
                       insig = "pch",
                       title = "Correlation", 
                       ggtheme = unikn::theme_unikn(),
                       outline.color = "black", 
                       tl.cex = 10,
                       tl.col = "black",
                       tl.srt = 90) +
 labs(
caption = "non significant correlations (p<.05) are crossed out")

Run the following code for more information about ggcorrplot::ggcorrplot()

help("ggcorrplot")

[References: R documentation]


Likert plots


plot_likert(): Plot likert scales as centered stacked bars.


Usage:

plot_likert(
  items,
  groups = NULL,
  groups.titles = "auto",
  title = NULL,
  legend.title = NULL,
  legend.labels = NULL,
  axis.titles = NULL,
  axis.labels = NULL,
  # optional, amount of categories of items (e.g. "strongly disagree", "disagree", "agree" and "strongly agree" would be catcount = 4). 
  catcount = NULL,
  # If there's a neutral category (like "don't know" etc.), specify the index number (value) for this category.
  cat.neutral = NULL,
  sort.frq = NULL,
  weight.by = NULL,
  title.wtd.suffix = NULL,
  wrap.title = 50,
  wrap.labels = 30,
  wrap.legend.title = 30,
  wrap.legend.labels = 28,
  geom.size = 0.6,
  geom.colors = "BrBG",
  cat.neutral.color = "grey70",
  intercept.line.color = "grey50",
  reverse.colors = FALSE,
  values = "show",
  show.n = TRUE,
  show.legend = TRUE,
  show.prc.sign = FALSE,
  grid.range = 1,
  grid.breaks = 0.2,
  expand.grid = TRUE,
  digits = 1,
  reverse.scale = FALSE,
  coord.flip = TRUE,
  sort.groups = TRUE,
  legend.pos = "bottom",
  rel_heights = 1,
  group.legend.options = list(nrow = NULL, byrow = TRUE),
  cowplot.options = list(label_x = 0.01, hjust = 0, align = "v")
)

Example:

TrustInGovernment PoliticalKnowledge SatisfactionDemocracy InterestPolitics ApprovalLeaders
5 4 4 1 2
4 5 2 5 3
4 2 3 3 5
2 4 5 1 1
2 4 4 5 1
3 1 3 1 5
### make sure to mutate variables into "factors"
df1 %>% 
  mutate_all(~as.factor(.x)) -> df1

sjPlot::plot_likert(df1, 
                    catcount = 5, 
                    geom.colors = c("#993F00", "#FF8E32", "#FFE5CC", "#B2FCFF", "#51C3CC", "grey"),
                   # legend.labels = c()
                    reverse.scale = T,
                    title = "Likert plot", 
                    geom.size = .5) + 
  theme(aspect.ratio = 1/2)

df2 %>% 
  mutate_all(~as.factor(.x)) -> df2

sjPlot::plot_likert(df2, 
                    catcount = 5, # there are 5 categories (1 to 5 and 1 neutral)
                    cat.neutral = 1, # the position of the neutral category, here is the first as it is 0
                    geom.colors = c("#993F00", "#FF8E32", "#FFE5CC", "#B2FCFF", "#51C3CC", "grey"),
                    legend.labels = c("I don't know","1", "2", "3", "4", "5"),
                    reverse.scale = T,
                    title = "Likert plot", 
                    geom.size = .6) + 
  theme(aspect.ratio = 1/2)

sjPlot::plot_likert(df2, 
                    catcount = 5, # there are 5 categories (1 to 5 and 1 neutral)
                    cat.neutral = 1, # the position of the neutral category, here is the first as it is 0
                    geom.colors = c("#993F00", "#FF8E32", "#FFE5CC", "#B2FCFF", "#51C3CC", "grey"),
                    legend.labels = c("I don't know","1", "2", "3", "4", "5"),
                    reverse.scale = T,
                    title = "Likert plot", 
                    geom.size = .6, 
                    values = "sum.inside") + 
  theme(aspect.ratio = 1/2)

Run the following code for more information about sjPlot::plot_likert()

??sjPlot::plot_likert

[References: R documentation]


Customize your plot


In the following sections you will learn how to customize your plots. You can find these information by typing ggplot2-specs in the help panel.


Colour and fill


Almost every geom has either colour, fill, or both.

  • Colours and fills can be specified in the following ways:

  • A name, e.g., “red”. R has 657 built-in named colours, which can be listed with colours().

head(colours()) # only print first rows with "head()"
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"
  • An rgb specification (see https://r-charts.com/color-palettes/)

  • The transparency of the colors can be modified with the argument alpha. A lower value sets a more transparent colour.

The arguments fill and colour are normally specified within the aes() function, especially if you are plotting categorical variables in relation to continuous variables and you want different colours/fill for each level of your categorical variable.


Examples:

  1. Set the color of the plot
### exaple of a geom with only the color argument 
df %>% 
  ggplot(aes(x = income, y = illiteracy)) + 
  geom_jitter(color = "indianred3") + 
  ggtitle("Set the color of the plot")

### example of a geom with both color and fill argument 
ggarrange(
df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(fill = "#51C3CC") + 
  annotate("label", x = 1.5, y = 3, label = "fill", fill = "#51C3CC"), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(color = "#51C3CC") + 
  annotate("label", x = 1.5, y = 3, label = "colour", color = "#51C3CC")
)


  1. Define colour and fill based on levels of categorical variables. In this case, the two arguments have to be specified within the aes(x = ..., y = ..., fill = ..., colour = ...) function. This can be done either within the ggplot() function or within the geom_X() function/
### option 1
ggarrange(
  
df %>% 
  ggplot(aes(x = income2, y = illiteracy, fill = income2)) + 
  geom_violin() + 
  annotate("label", x = 1.5, y = 3, label = "fill"), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy, colour = income2)) + 
  geom_violin() + 
  annotate("label", x = 1.5, y = 3, label = "colour")
)

### option 2
ggarrange(
  
df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2)) + 
  annotate("label", x = 1.5, y = 3, label = "fill"), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(colour = income2)) + 
  annotate("label", x = 1.5, y = 3, label = "colour")
)

Of course, you may want to set your favourite colors instead of using the default ones. For this, there are several functions that you can use depending on the type of variables (i.e., discrete/continuous).

For discrete variables


[Reference: https://stackoverflow.com/questions/70942728/understanding-color-scales-in-ggplot2]


ggarrange(
  
df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2)) + 
  annotate("label", x = 1.5, y = 3, label = "fill") + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(colour = income2), linewidth = 1) + 
  annotate("label", x = 1.5, y = 3, label = "colour") + 
  scale_color_manual(values = c("#0073C2", "#EFC000"))
)


For continuous variables


[Reference: https://stackoverflow.com/questions/70942728/understanding-color-scales-in-ggplot2]


df %>% 
  ggplot(aes(x = income, y = illiteracy, color = income)) + 
  geom_jitter() + 
  scale_color_gradient2(low = "#BCFFB2", mid = "#8AE67E", high = "#1F990F") + 
  ggtitle("Example with scale_color_gradient()")


Edit lines and dots shapes


The appearance of a line is affected by linewidth, linetype, lineend.

Line types can be specified with:

An integer or name:

  • 0 = blank,

  • 1 = solid,

  • 2 = dashed,

  • 3 = dotted,

  • 4 = dotdash,

  • 5 = longdash,

  • 6 = twodash

  • The appearance of the line end is controlled by the lineend paramter, and can be one of “round”, “butt” (the default), or “square”.

Example:

ggarrange(
  
df %>%
  slice_head(n = 5) %>%
  ggplot(aes(x = income, y = illiteracy)) + 
  geom_line(size = .8, linetype = 3, color = "deepskyblue4") +
  geom_point(size = 2, color = "deepskyblue4") +
  ggtitle("Dotted line"),

df %>%
  slice_head(n = 5) %>%
  ggplot(aes(x = income, y = illiteracy)) + 
  geom_line(size = .8, linetype = 6, color = "indianred") +
  geom_point(size = 2, color = "indianred") +
  ggtitle("Twodasch line")
)


You can add vertical and horizontal lines with geom_vline() and geom_hline() respectively

Let’s plot the distribution of illiteracy by income (high vs. low)

df %>% 
  ggplot(aes(x = illiteracy, fill = income2)) + 
  geom_density(alpha = .7) + 
  ggtitle("Distribution of illiteracy by high vs. low-population states") + 
  scale_fill_manual(values = c("#0073C2", "#EFC000"))

Let’s now compute the mean score of illiteracy by income (high vs. low). To do this, we can use the function summarize() from the tidyverse package.

df %>% 
  summarize(
    mean = mean(illiteracy), 
    .by = "income2"
  ) -> summary_illiteracy
kable(summary_illiteracy) %>% kable_styling()
income2 mean
low 1.457143
high 0.962069

Now we can add a vertical line to the density plot, to signal the mean value for each state group (high vs. low income)

df %>% 
  ggplot(aes(x = illiteracy, fill = income2)) + 
  geom_density(alpha = .7) + 
  ggtitle("geom_vline()") + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  ### ADD VERTICAL LINE
  geom_vline(xintercept = 1.457143, color = "#EFC000") + 
  geom_vline(xintercept = 0.962069, color = "#0073C2") 

The following, is a “more professional” alternative which avoids you to type all values manually. Above, we saved the summary table into a new variable summary_illiteracy. We can retrieve the xintercept values from the tables directly as follows:

df %>% 
  ggplot(aes(x = illiteracy, fill = income2)) + 
  geom_density(alpha = .7) + 
  ggtitle("geom_vline()") + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  ### ADD VERTICAL LINE
  geom_vline(xintercept = summary_illiteracy$mean[summary_illiteracy$income2 == "low"], color = "#EFC000") + 
  geom_vline(xintercept = summary_illiteracy$mean[summary_illiteracy$income2 == "high"], color = "#0073C2") 

You can edit the linetype and size with linetype and size respectively.

df %>% 
  ggplot(aes(x = illiteracy, fill = income2)) + 
  geom_density(alpha = .7) + 
  ggtitle("geom_vline()") + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  ### ADD VERTICAL LINE
  geom_vline(xintercept = summary_illiteracy$mean[summary_illiteracy$income2 == "low"], color = "#EFC000", linetype = 2, size = 1.5) + 
  geom_vline(xintercept = summary_illiteracy$mean[summary_illiteracy$income2 == "high"], color = "#0073C2", linetype = 6, size = 1.5) 


Similarly, we can add an horizontal line to the plot.

In the following, we will plot the population of the first 5 countries in the df, and we will add an horizontal line to the plot to illustrate what the mean population of the 5 states is (i.e., 5900)

df %>% 
  slice_head(n = 5) %>% 
  
  ggplot(aes(x = state, y = population, fill = state)) + 
  geom_col(color = "black") + 
  geom_hline(yintercept = 5900, color = "indianred", linetype = 6, size = 1) + 
  ggtitle("geom_hline()")


Change point shape


Use shape or pch to edit point shape




Example:

df %>% 
  ggplot(aes(x = income, y = illiteracy, pch = income2, color = income2)) + 
  geom_jitter(size = 3) + 
  scale_shape_manual(values = c(15, 17)) + 
  scale_color_manual(values = c("#0073C2", "#EFC000")) + 
  ggtitle("Edit point shape")


Further customizations with theme()


Usage:

theme(
  line,
  rect,
  text,
  title,
  aspect.ratio,
  axis.title,
  axis.title.x,
  axis.title.x.top,
  axis.title.x.bottom,
  axis.title.y,
  axis.title.y.left,
  axis.title.y.right,
  axis.text,
  axis.text.x,
  axis.text.x.top,
  axis.text.x.bottom,
  axis.text.y,
  axis.text.y.left,
  axis.text.y.right,
  axis.ticks,
  axis.ticks.x,
  axis.ticks.x.top,
  axis.ticks.x.bottom,
  axis.ticks.y,
  axis.ticks.y.left,
  axis.ticks.y.right,
  axis.ticks.length,
  axis.ticks.length.x,
  axis.ticks.length.x.top,
  axis.ticks.length.x.bottom,
  axis.ticks.length.y,
  axis.ticks.length.y.left,
  axis.ticks.length.y.right,
  axis.line,
  axis.line.x,
  axis.line.x.top,
  axis.line.x.bottom,
  axis.line.y,
  axis.line.y.left,
  axis.line.y.right,
  legend.background,
  legend.margin,
  legend.spacing,
  legend.spacing.x,
  legend.spacing.y,
  legend.key,
  legend.key.size,
  legend.key.height,
  legend.key.width,
  legend.text,
  legend.text.align,
  legend.title,
  legend.title.align,
  legend.position,
  legend.direction,
  legend.justification,
  legend.box,
  legend.box.just,
  legend.box.margin,
  legend.box.background,
  legend.box.spacing,
  panel.background,
  panel.border,
  panel.spacing,
  panel.spacing.x,
  panel.spacing.y,
  panel.grid,
  panel.grid.major,
  panel.grid.minor,
  panel.grid.major.x,
  panel.grid.major.y,
  panel.grid.minor.x,
  panel.grid.minor.y,
  panel.ontop,
  plot.background,
  plot.title,
  plot.title.position,
  plot.subtitle,
  plot.caption,
  plot.caption.position,
  plot.tag,
  plot.tag.position,
  plot.margin,
  strip.background,
  strip.background.x,
  strip.background.y,
  strip.clip,
  strip.placement,
  strip.text,
  strip.text.x,
  strip.text.x.bottom,
  strip.text.x.top,
  strip.text.y,
  strip.text.y.left,
  strip.text.y.right,
  strip.switch.pad.grid,
  strip.switch.pad.wrap,
  ...,
  complete = FALSE,
  validate = TRUE
)

Arguments

  • line: all line elements (element_line())

  • rect: all rectangular elements (element_rect())

  • text: all text elements (element_text())

  • title: all title elements: plot, axes, legends (element_text(); inherits from text)

  • aspect.ratio: aspect ratio of the panel

  • axis.title, axis.title.x, axis.title.y, axis.title.x.top, axis.title.x.bottom, axis.title.y.left, axis.title.y.righ: labels of axes (element_text()).

  • axis.text, axis.text.x, axis.text.y, axis.text.x.top, axis.text.x.bottom, axis.text.y.left, axis.text.y.right: tick labels along axes (element_text()).

  • axis.ticks, axis.ticks.x, axis.ticks.x.top, axis.ticks.x.bottom, axis.ticks.y, axis.ticks.y.left, axis.ticks.y.righ: tick marks along axes (element_line()).

  • axis.ticks.length, axis.ticks.length.x, axis.ticks.length.x.top, axis.ticks.length.x.bottom, axis.ticks.length.y, axis.ticks.length.y.left, axis.ticks.length.y.right: length of tick marks (unit)

  • axis.line, axis.line.x, axis.line.x.top, axis.line.x.bottom, axis.line.y, axis.line.y.left, axis.line.y.right: lines along axes (element_line()).

  • legend.background: background of legend (element_rect(); inherits from rect)

  • legend.margin: the margin around each legend (margin())

  • legend.spacing, legend.spacing.x, legend.spacing.y: the spacing between legends (unit).

  • legend.key: background underneath legend keys

  • legend.key.size, legend.key.height, legend.key.width: size of legend keys (unit)

  • legend.text: legend item labels (element_text(); inherits from text)

  • legend.text.align: alignment of legend labels (number from 0 (left) to 1 (right))

  • legend.title: title of legend (element_text(); inherits from title)

  • legend.title.align: alignment of legend title (number from 0 (left) to 1 (right))

  • legend.position: the position of legends (“none”, “left”, “right”, “bottom”, “top”, or two-element numeric vector)

  • legend.direction: layout of items in legends (“horizontal” or “vertical”)

  • legend.justification: anchor point for positioning legend inside plot (“center” or two-element numeric vector) or the justification according to the plot area when positioned outside the plot

  • legend.box: arrangement of multiple legends (“horizontal” or “vertical”)

  • legend.box.just: justification of each legend within the overall bounding box, when there are multiple legends (“top”, “bottom”, “left”, or “right”)

  • legend.box.margin: margins around the full legend area, as specified using margin()

  • legend.box.background: background of legend area (element_rect(); inherits from rect)

  • plot.title: plot title (text appearance) (element_text(); inherits from title) left-aligned by default

  • plot.title.position, plot.caption.position: Alignment of the plot title/subtitle and caption.

  • plot.subtitle: plot subtitle (text appearance) (element_text(); inherits from title) left-aligned by default

  • plot.caption: caption below the plot (text appearance) (element_text(); inherits from title) right-aligned by default

For more information, run the following code:

help(theme)

Example:

df %>% 
  ggplot(aes(x = hs_grad, y = murder, pch = income2, color = income2)) + 
  geom_jitter(size = 2) + 
  geom_smooth(method = "lm", fill = "lightgrey") + 
  scale_color_manual(values = c("#0073C2", "#EFC000")) + 
  ggtitle("theme()") + 
  
  ### customize plot
  
  theme(
    ### change axis title x and y 
    axis.title.x = element_text(family = "mono", color = "deepskyblue4"), 
    axis.title.y = element_text(family = "mono", color = "deepskyblue4"), 
    
    ### change title 
    plot.title = element_text(face = "italic", size = 20, color = "indianred"), 
    ### legend title
    legend.title = element_text(face = "italic", family = "mono", size = 12), 
    
    ### edit legend position 
    legend.position = c(1,1),
    legend.justification = c("right", "top"), 
    legend.background = element_rect(color = "black"),
    legend.key = element_rect(color = "black")
  )


Change plot theme


There are several built-in themes in ggplot2. Here is a helpful link: https://r-charts.com/ggplot2/themes/


Here are themes from ggplot2:::


The following are themes from the ggpubr:: package:


Here are themes from sjPlot:: and unikn:: packages:


Annotate plots


plots can be annotated with the geom_text() or with the annotate() function


Let’s see some examples

Imagine we want to annotate the mean of illiteracy for both high and low income states. Here is how you can do it with geom_text()

df %>%
  ggplot(aes(x = income2, y = illiteracy, fill = income2)) +
  geom_boxplot() + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_unikn() +
  geom_text(data = summary_illiteracy, aes(x = income2, y = mean, label = mean), vjust = -0.5, size = 5)


And here is how you can use annotate

df %>%
  ggplot(aes(x = income2, y = illiteracy, fill = income2)) +
  geom_boxplot() + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_unikn() +
  annotate("label", x = summary_illiteracy$income2, y = summary_illiteracy$mean, label = summary_illiteracy$mean)

With annotate you can choose between “text” and “label”

df %>%
  ggplot(aes(x = income2, y = illiteracy, fill = income2)) +
  geom_boxplot() + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_unikn() +
  annotate("text", x = summary_illiteracy$income2, y = summary_illiteracy$mean, label = summary_illiteracy$mean)

And you can edit the color, fill and size with color, fill and size

df %>%
  ggplot(aes(x = income2, y = illiteracy, fill = income2)) +
  geom_boxplot() + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_unikn() +
  annotate("label", x = summary_illiteracy$income2, y = summary_illiteracy$mean, label = summary_illiteracy$mean, fill = "grey", size = 6)

For more information run the following code:

help(annotate)

Arrange plots


ggpubr::ggarrange() is used to arrange multiple plots together


Usage:

ggarrange(
  ...,
  plotlist = NULL,
  ncol = NULL,
  nrow = NULL,
  labels = NULL,
  label.x = 0,
  label.y = 1,
  hjust = -0.5,
  vjust = 1.5,
  font.label = list(size = 14, color = "black", face = "bold", family = NULL),
  align = c("none", "h", "v", "hv"),
  widths = 1,
  heights = 1,
  legend = NULL,
  common.legend = FALSE,
  legend.grob = NULL
)

Example:

ggarrange(
df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_sjplot(), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_sjplot2(), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_unikn(), 
  
  df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  theme_grau(), 

common.legend = T, 

### edit legend position
legend = "bottom"
)


If the plots have the same x and y axes, you can remove them form the plot codes using xlab() and ylab() and add them once the plots are arranged with the function annotate_figure()

annotate_figure(
  p,
  top = NULL,
  bottom = NULL,
  left = NULL,
  right = NULL,
  fig.lab = NULL,
  fig.lab.pos = c("top.left", "top", "top.right", "bottom.left", "bottom",
    "bottom.right"),
  fig.lab.size,
  fig.lab.face
)

Example:

ggarrange(
df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) + 
  ylab(" ") + xlab(" ") + 
  theme_sjplot2(), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) +  
  ylab(" ") + xlab(" ") + 
  theme_sjplot2(), 

df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) +  
  ylab(" ") + xlab(" ") + 
  theme_sjplot2(), 
  
  df %>% 
  ggplot(aes(x = income2, y = illiteracy)) + 
  geom_violin(aes(fill = income2), alpha = .9) + 
  scale_fill_manual(values = c("#0073C2", "#EFC000")) +  
  ylab(" ") + xlab(" ") + 
  theme_sjplot2(), 

common.legend = T, 

### edit legend position
legend = "right"
) -> p

annotate_figure(p,
               top = text_grob("annotate_figure()", color = "black", face = "bold", size = 14),
               bottom = text_grob("Income", color = "darkgrey", face = "italic"),
               left = text_grob("Illiteracy", color = "darkgrey", face = "italic", rot = 90))


Exercises: Day 1


Data


index_code expenditure_on_education_pct_gdp mortality_rate_infant gini_index gdp_per_capita_ppp inflation_consumer_prices intentional_homicides unemployment gross_fixed_capital_formation population_density suicide_mortality_rate tax_revenue taxes_on_income_profits_capital alcohol_consumption_per_capita government_health_expenditure_pct_gdp urban_population_pct_total country time sex rating
AUS-2003 5.246357 4.9 33.5 30121.82 2.732596 1.533073 5.933 26.05029 2.567035 10.5 24.29997 62.72655 NA 5.623778 84.343 AUS 2003 BOY 527
AUS-2003 5.246357 4.9 33.5 30121.82 2.732596 1.533073 5.933 26.05029 2.567035 10.5 24.29997 62.72655 NA 5.623778 84.343 AUS 2003 GIRL 522
AUS-2003 5.246357 4.9 33.5 30121.82 2.732596 1.533073 5.933 26.05029 2.567035 10.5 24.29997 62.72655 NA 5.623778 84.343 AUS 2003 TOT 524
AUS-2006 4.738430 4.7 NA 34846.72 3.555288 1.372940 4.785 27.78913 2.662089 10.6 24.51177 65.23156 NA 5.719998 84.700 AUS 2006 BOY 527
AUS-2006 4.738430 4.7 NA 34846.72 3.555288 1.372940 4.785 27.78913 2.662089 10.6 24.51177 65.23156 NA 5.719998 84.700 AUS 2006 GIRL 513
AUS-2006 4.738430 4.7 NA 34846.72 3.555288 1.372940 4.785 27.78913 2.662089 10.6 24.51177 65.23156 NA 5.719998 84.700 AUS 2006 TOT 520

[Source: https://www.kaggle.com/datasets/walassetomaz/pisa-results-2000-2022-economics-and-education?resource=download]


Dataset structure

  • index_code: Index code.

  • expenditure_on_education_pct_gdp: Expenditure on education as a percentage of GDP.

  • mortality_rate_infant: Infant mortality rate.

  • gini_index: Gini index.

  • gdp_per_capita_ppp: GDP per capita in terms of purchasing power parity.

  • inflation_consumer_prices: Consumer price inflation.

  • intentional_homicides: Intentional homicides.

  • unemployment: Unemployment rate.

  • gross_fixed_capital_formation: Gross fixed capital formation as a percentage of GDP.

  • population_density: Population density.

  • suicide_mortality_rate: Suicide mortality rate.

  • tax_revenue: Tax revenue.

  • taxes_on_income_profits_capital: Taxes on income, profits, and capital gains.

  • alcohol_consumption_per_capita: Total alcohol consumption per capita.

  • government_health_expenditure_pct_gdp: Government health expenditure as a percentage of GDP.

  • urban_population_pct_total: Urban population percentage of the total population.

  • country: Country.

  • time: Years.

  • sex: Sex.

  • rating: Value of PISA (Programme for International Student Assessment) Results.


Exercise A1: Choose the right plot


Plot the relationship between alcohol_consumption_per_capita and suicide_mortality_rate


The purpose of this exercise is to select the right plot. Here, the two variables are continuous.

df %>% 
  ggplot(aes(x = suicide_mortality_rate, y = alcohol_consumption_per_capita)) + 
  geom_point() 


Exercise A2: Choose the right plot


Plot the government_health_expenditure_pct_gdp over the years (time) for Australia only (AUS)


The purpose of this exercise is to select the right plot. Here, the two variables are continuous. However, we have time data, so we want to see the trend of government_health_expenditure_pct_gdp over time.

### Run this 
df %>% 
  filter(country == "AUS") -> ex.a2
ex.a2 %>% 
  ggplot(aes(x = time, y = government_health_expenditure_pct_gdp)) + 
  geom_point() + 
  geom_line()


Exercise A3: Choose the right plot


Plot the PISA rating by sex


The purpose of this exercise is to select the right plot. Here, we have a categorical variable and a continuous variable. What is the best type of plot?

Note: You do not have to plot the “TOT” level from the sex column. To do this, add the following code before the ggplot function: df %>% filter(sex != "TOT").

ggarrange(
df %>% 
  filter(sex != "TOT") %>% 
  ggplot(aes(x = sex, y = rating)) + 
  geom_boxplot(),

df %>% 
  filter(sex != "TOT") %>% 
  ggplot(aes(x = sex, y = rating)) + 
  geom_violin()
)


Exercise A4: Choose the right plot


Plot the distribution of PISA rating


df %>% 
  ggplot(aes(x = rating)) + 
  geom_density()


Exercise A5: Choose the right plot


Plot the distribution of population_density for Italy (ITA) and Germany (DEU)

  • To filter only Italy and Germany, add this code before the ggplot() function: df %>% filter(country %in% c("ITA", "DEU"))

df %>% 
  filter(country %in% c("ITA", "DEU")) %>% 
  ggplot(aes(x = population_density, fill = country)) + 
  geom_density()


Exercise B1: geom_point()


Plot the relationship between expenditure_on_education_pct_gdp and rating using geom_point()

  • The dots should be red

df %>% 
  ggplot(aes(x = expenditure_on_education_pct_gdp, y = rating)) + 
  geom_point(color = "indianred") 


Exercise B2: geom_point() & geom_line()


Plot the PISA rating over the years (time) for Italy only (ITA) by sex using geom_point() and geom_line()

  • Set the color and shape/type of the points and lines by sex
  • Change the color according to your preferences
  • To filter only Italy and remove sex = TOT, run this before the ggplot function: df %>% filter(country == "ITA" & sex != "TOT")

df %>% 
  filter(country == "ITA" & sex != "TOT") %>% 
  ggplot(aes(x = time, y = rating, color = sex, pch = sex, linetype = sex)) + 
  geom_point(size = 3) + 
  geom_line(size = .5) + 
  scale_color_manual(values = c("indianred3", "deepskyblue4"))


Exercise B3: geom_point() & theme()


Plot the PISA rating by state in 2018 with geom_point using geom_point()

  • Set the point color by rating
  • Remove the plot legend using theme()
  • Make the x axis text vertical using theme()
  • Remove labs
  • To filter only sex = TOT and time = 2018, add this before the ggplot function: df %>% filter(sex == "TOT" & time == 2018)

df %>% 
  filter(sex == "TOT" & time == 2018) %>% 
  ggplot(aes(x = country, y = rating, color = rating)) + 
  geom_point() +
  xlab("") + ylab("") + 
  theme(
    axis.text.x = element_text(angle = 90)
  ) 


Exercise B4: geom_density_ridges()


Plot the distribution of PISA rating and expenditure_on_education_pct_gdp in 2015 with geom_density_ridges()

  • Remove x and y lab and the legend using theme()
  • Set the color of the plot by measure

Note: You have to edit the df first, so run the following code

### Run this code 
df %>% 
  filter(time == 2015) %>% 
  distinct(country, .keep_all = T) %>% 
  dplyr::select(expenditure_on_education_pct_gdp, rating) %>% 
  pivot_longer(names_to = "measure", values_to = "value", 1:2) %>% 
  mutate(value = scale(value)) -> ex.B4

ex.B4 %>% 
  ggplot(aes(x = value, y = measure, fill = measure)) + 
  geom_density_ridges(alpha = .5) + 
  xlab("") + ylab("") + 
  theme(legend.position = "none")


Exercise B5: geom_point() & geom_errorbar()


Plot the mean and standard deviation of expenditure_on_education_pct_gdp for Italy and Germany using geom_point() and geom_errorbar()

  • Edit point and line color and type by country
  • Set the colors you prefer
  • Change the axis limits

### Run this code 

df %>% 
  filter(country == "ITA" | country == "DEU") %>% 
  summarize(
    mean = mean(expenditure_on_education_pct_gdp), 
    sd = sd(expenditure_on_education_pct_gdp), 
    .by = "country"
  ) -> ex.b5
ex.b5 %>% 
  
  ggplot(aes(x = country, y = mean, color = country, linetype = country)) + 
  geom_point(size = 4) + 
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = .1, size = .6) + 
  ylim(3, 6)


Challenge


Plot the distribution of the data on the plot of exercise B5

  • Edit the color and linetype as you prefer and reduce the dots opacity in geom_jitter()

### Run this code
df %>% 
  filter(country == "ITA" | country == "DEU") %>% 
  summarize(
    mean = mean(expenditure_on_education_pct_gdp), 
    sd = sd(expenditure_on_education_pct_gdp), 
    .by = "country"
  ) -> challenge
challenge %>% 
  
  ggplot(aes(x = country, y = mean, color = country, linetype = country)) + 
  geom_point(size = 4) + 
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = .1, size = .6) + 
  ylim(3, 6) + 
  geom_jitter(data = df %>% filter(country %in% c("ITA", "DEU")), aes(x = country, y = expenditure_on_education_pct_gdp), alpha = .4) + 
  scale_color_manual(values = c("orange", "deepskyblue4"))


Exercises: Day 2


Exercise C1: geom_density_ridges()


Plot the distribution of government_health_expenditure_pct_gdp and mortality_rate_infant for the following countries: Australia, United States and Canada. Use geom_density_ridges()

  • The name of the country should be on the y axis and each country should have two density lines within the same panel
  • government_health_expenditure_pct_gdp and mortality_rate_infant should be scaled
  • Ethe colors and transparency
  • Look at the exercises of day 1 for how to manipulate the data
  • Add a title and remove labels

### Run this code 
df %>% 
  dplyr::select(country, mortality_rate_infant, government_health_expenditure_pct_gdp) %>% 
  mutate(across(2:3, ~scale(.x))) %>% 
  filter(country %in% c("AUS", "USA", "CAN")) %>% 
  pivot_longer(names_to = "measure", values_to = "value", 2:3) -> ex.c1
ex.c1 %>% 
  
  ggplot(aes(x = value, y = country, fill = measure)) + 
  geom_density_ridges(alpha = .8) + 
  scale_fill_manual(values = c("orange", "deepskyblue3")) + 
  ggtitle("Distribution plot") + 
  xlab("") + ylab("")


Exercise C2: theme() and annotate()


In the plot of exercise C1, remove the legend using theme() and manually write it inside the plot using annotate()

  • The name of the country should be on the y axis and each country should have two density lines within the same panel
  • government_health_expenditure_pct_gdp and mortality_rate_infant should be scaled
  • Remove the legend, edit the colors and transparency
  • Look at the exercises of day 1 for how to manipulate the data
  • Add a title and remove labels

ex.c1 %>% 
  
  ggplot(aes(x = value, y = country, fill = measure)) + 
  geom_density_ridges(alpha = .8) + 
  scale_fill_manual(values = c("orange", "deepskyblue4")) + 
  ggtitle("Distribution plot") + 
  xlab("") + ylab("") + 
  
  ### C2
  theme(legend.position = "none") + 
  annotate("label", x = 1, y = 4, label = "Government health expenditure pct gdp", fill = "orange") + 
  annotate("label", x = 1, y = 4.25, label = "Mortality rate infant", fill = "deepskyblue3") 


Exercise C3: geom_histogram() & ggarrange()


Plot the distribution of all continuous variables in 2018 with geom_histogram() and display them in a single plot using ggarrange()

  • Remove ylab
  • Reduce the size of the x label

### Run this code 
df %>% 
  filter(time == 2018) %>%  
  dplyr::select(2:16, 20) %>% 
  mutate_all(~scale(.x)) -> ex.c3
ggarrange(
  
  ex.c3 %>% 
    ggplot(aes(x = expenditure_on_education_pct_gdp)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = mortality_rate_infant)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = gini_index)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = gdp_per_capita_ppp)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = inflation_consumer_prices)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = intentional_homicides)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = unemployment)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = gross_fixed_capital_formation)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = population_density)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = suicide_mortality_rate)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = tax_revenue)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = taxes_on_income_profits_capital)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = government_health_expenditure_pct_gdp)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = urban_population_pct_total)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = rating)) + 
    geom_histogram() + theme(axis.title.x = element_text(size = 7)) + ylab(" ")
)


Exercise C4: annotate_figure()


Use the plot of exercise C3 and change the color of the histograms, and annotate the final arranged grid adding a caption using annotate_figure()


ggarrange(
  
  ex.c3 %>% 
    ggplot(aes(x = expenditure_on_education_pct_gdp)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = mortality_rate_infant)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = gini_index)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = gdp_per_capita_ppp)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = inflation_consumer_prices)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = intentional_homicides)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = unemployment)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = gross_fixed_capital_formation)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = population_density)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = suicide_mortality_rate)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = tax_revenue)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = taxes_on_income_profits_capital)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = government_health_expenditure_pct_gdp)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = urban_population_pct_total)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" "), 
  ex.c3 %>% 
    ggplot(aes(x = rating)) + 
    geom_histogram(fill = "indianred3") + theme(axis.title.x = element_text(size = 7)) + ylab(" ")
) %>% 
  
  annotate_figure(bottom = text_grob("Exercise C3. Distribution plot", color = "black", face = "italic", size = 10, hjust = 1.5))


Exercise C5: geom_bar()


Plot the mean value of rating by country with a bar chart. Use geom_bar(stat = "identity")

  • Each bar should have a different color based on country
  • x axis labels should be vertical and bold
  • remove the legend
  • Add a title

### Run this code first 
df %>% 
  filter(sex == "TOT") %>% 
  summarize(
    mean = round(mean(rating),2), 
    .by = c("country")
  ) -> ex.c5
ex.c5 %>%   
  ggplot(aes(x = country, y = mean, fill = country)) + 
  geom_bar(stat = "identity", color = "black") + 
  theme(axis.text.x = element_text(size = 8, angle = 90, face = "bold"), 
        legend.position = "none")  + 
  xlab(" ") + ylab(" ") + 
  ggtitle("Mean value of rating")


Exercise C6: geom_bar() & annotate()


Use the plot of exercise C5 and annotate the mean value as a label above each column using annotate()

  • Use ex.c5

ex.c5 %>% 
  
  ggplot(aes(x = country, y = mean, fill = country)) + 
  geom_bar(stat = "identity", color = "black", width = .5) + 
  theme(axis.text.x = element_text(size = 8, angle = 90, face = "bold"), 
        legend.position = "none")  + 
  xlab(" ") + ylab(" ") + 
  ggtitle("Mean value of rating") + 
  annotate("label", 
           x = ex.c5$country, 
           y = ex.c5$mean, 
           label = ex.c5$mean, 
           size = 2)


Exercise C7: geom_point() & geom_line()


Plot the change in the trend of unemployment in Germany, France, Ireland and Italy over time using geom_point() and geom_line().

The trend for each country should be displayed in a separate panel using facet_wrap().

  • Edit the color as you prefer and remove unnecessary information to facilitate visualization (e.g., the legend, in this case)

### Run this code first 
df %>% 
  filter(country %in% c("ITA", "DEU", "FRA", "IRL")) %>% 
  summarize(
    unemployment = unemployment, 
    unemployement.mean = round(mean(unemployment),0),
    sd.un = sd(unemployment), 
    suic.mean = round(mean(suicide_mortality_rate),0),
    sd.su = sd(suicide_mortality_rate), 
    .by = c("country", "time")
  ) -> ex.c7
ex.c7 %>% 
  
  ggplot(aes(x = time, y = unemployment, fill = country, color = country, pch = country, linetype = country)) + 
  geom_point(size = 3) + 
  geom_line(size = 1) + 
  theme(legend.position = "none") + 
  scale_color_manual(values = c("indianred3", "darkgreen", "orange", "deepskyblue4")) + 
  facet_wrap(~country) 


Exercise C8: ggarrange() & annotate_figure()


In a plot similar to the one from exercise C7, annotate the mean of unemployment at each time point for each country. Use ggarrange() here.

  • Add y axes label, a title and a caption using annotate_figure().

### Run this code first 
df %>% 
  filter(country %in% c("ITA", "DEU", "FRA", "IRL")) %>% 
  summarize(
    unemployment = unemployment, 
    unemployement.mean = round(mean(unemployment),0),
    sd.un = sd(unemployment), 
    suic.mean = round(mean(suicide_mortality_rate),0),
    sd.su = sd(suicide_mortality_rate), 
    .by = c("country", "time")
  ) -> ex.c8
ggarrange(
  
ex.c8 %>% 
  filter(country == "ITA") %>% 
  ggplot(aes(x = time, y = unemployment)) + 
  geom_point(size = 3, pch = 16, color = "indianred3") + 
  geom_line(size = 1, linetype = 1, color = "indianred3") + 
  theme(legend.position = "none") + 
  annotate("label", x = ex.c8$time[ex.c8$country == "ITA"], y = ex.c8$unemployement.mean[ex.c8$country == "ITA"], label = ex.c8$unemployement.mean[ex.c8$country == "ITA"], color = "black", size = 2) + 
  annotate("label", x = 2017, y = 8, label = "Italy", color = "indianred3", size = 3) + 
  xlab(" ") + ylab(" "), 

ex.c8 %>% 
  filter(country == "DEU") %>% 
  ggplot(aes(x = time, y = unemployment)) + 
  geom_point(size = 3, pch = 16, color = "darkgreen") + 
  geom_line(size = 1, linetype = 3, color = "darkgreen") + 
  theme(legend.position = "none") + 
  annotate("label", x = ex.c8$time[ex.c8$country == "DEU"], y = ex.c8$unemployement.mean[ex.c8$country == "DEU"], label = ex.c8$unemployement.mean[ex.c8$country == "DEU"], color = "black", size = 2) + 
  annotate("label", x = 2017, y = 10, label = "Germany", color = "darkgreen", size = 3) + 
  xlab(" ") + ylab(" "), 

ex.c8 %>% 
  filter(country == "FRA") %>% 
  ggplot(aes(x = time, y = unemployment)) + 
  geom_point(size = 3, pch = 16, color = "orange") + 
  geom_line(size = 1, linetype = 16, color = "orange") + 
  theme(legend.position = "none") + 
  annotate("label", x = ex.c8$time[ex.c8$country == "FRA"], y = ex.c8$unemployement.mean[ex.c8$country == "FRA"], label = ex.c8$unemployement.mean[ex.c8$country == "FRA"], color = "black", size = 2) + 
  annotate("label", x = 2017, y = 8.5, label = "France", color = "orange", size = 3) + 
  xlab(" ") + ylab(" "), 

ex.c8 %>% 
  filter(country == "IRL") %>% 
  ggplot(aes(x = time, y = unemployment)) + 
  geom_point(size = 3, pch = 16, color = "deepskyblue4") + 
  geom_line(size = 1, linetype = 11, color = "deepskyblue4") + 
  theme(legend.position = "none") + 
  annotate("label", x = ex.c8$time[ex.c8$country == "IRL"], y = ex.c8$unemployement.mean[ex.c8$country == "IRL"], label = ex.c8$unemployement.mean[ex.c8$country == "IRL"], color = "black", size = 2) + 
  annotate("label", x = 2017, y = 14, label = "Ireland", color = "deepskyblue4", size = 3) + 
  xlab(" ") + ylab(" ")

) %>% 
  
  annotate_figure(
    top = text_grob(label = "Trend of unemployment", face = "bold", hjust = 1.7), 
    left = text_grob(label = "unemployment", face = "italic", rot = 90, size = 10)
  )


Exercise C9: geom_smooth() & geom_jitter()


Plot the change in taxes_on_income_profits_capital based on gdp_per_capita_ppp in 2018 using geom_smooth().

Plot the distribution of the data using geom_jitter().


### Run this code
df %>% 
  mutate(taxes_on_income_profits_capital = scale(taxes_on_income_profits_capital),
         gdp_per_capita_ppp = scale(gdp_per_capita_ppp)) %>% 
  filter(time == 2018) -> ex.c9
ex.c9 %>% 
  
  ggplot(aes(x = gdp_per_capita_ppp, y = taxes_on_income_profits_capital)) + 
  geom_jitter(color = "orange", alpha = .4) + 
  geom_smooth(method = "lm", color = "deepskyblue4", fill = "lightgrey") 


Exercise C10: geom_boxplot() & geom_hline()


Using a boxplot(), plot rating over time.

  • The distribution of the data should be plotted above the boxplot using geom_jitter().
  • Draw a horizontal line to indiate the mean level of rating using geom_hline()
  • Add this code before ggplot: df %>% mutate(time = as.factor(time))

df %>% 
  mutate(time = as.factor(time)) %>% 
  ggplot(aes(x = time, y = gdp_per_capita_ppp)) + 
  geom_boxplot(aes(fill = time), alpha = .1) + 
  geom_jitter(aes(color = time), size = 2, alpha = .4) + 
  geom_hline(yintercept = mean(df$gdp_per_capita_ppp), size = 1, color = "indianred3", linetype = 2) + 
  scale_color_manual(values = c("#F3A360", "#EFB27E", "#DFCEBA", "#BEC8CC", "#7FAFD2", "#5082B0")) + 
  scale_fill_manual(values = c("#F3A360", "#EFB27E", "#DFCEBA", "#BEC8CC", "#7FAFD2", "#5082B0")) + 
  xlab(" ") + ylab("GDP per capita") + 
  theme(legend.position = "none") 


Challenge


  1. Plot where the PISA ratings were collected in 2018 with map using geom_polyglon()

# Run this code first
map_data("world") -> world
country_mapping <- c(
  AUS = "Australia", AUT = "Austria", BEL = "Belgium", CAN = "Canada", CHL = "Chile",
  COL = "Colombia", CRI = "Costa Rica", CZE = "Czech Republic", DNK = "Denmark", EST = "Estonia",
  FIN = "Finland", FRA = "France", DEU = "Germany", GRC = "Greece", HUN = "Hungary", ISL = "Iceland",
  IRL = "Ireland", ISR = "Israel", ITA = "Italy", JPN = "Japan", KOR = "South Korea", LVA = "Latvia",
  LTU = "Lithuania", LUX = "Luxembourg", MEX = "Mexico", NLD = "Netherlands", NZL = "New Zealand",
  NOR = "Norway", POL = "Poland", PRT = "Portugal", SVK = "Slovakia", SVN = "Slovenia", ESP = "Spain",
  SWE = "Sweden", CHE = "Switzerland", TUR = "Turkey", USA = "United States", GBR = "United Kingdom",
  BRA = "Brazil"
)

# Add full country name column
df %>%
  mutate(full_country_name = country_mapping[country]) -> df 

# Join data frames
df %>%
  filter(sex == "TOT" & time == 2018) %>%
  rename(region = full_country_name) %>%
  dplyr::select(region, rating) -> df.map.2018

common.cols <- intersect(names(df.map.2018), names(world))
left_join(world, df.map.2018, by = common.cols) -> df.map.2018
df.map.2018 %>%
  mutate(rating = if_else(is.na(rating), 0, rating)) %>%
  ggplot(aes(x = long, y = lat, group = group, fill = rating)) +
  geom_polygon(color = "black") +
  scale_fill_gradient2(low = "white", mid = "lightgrey", high = "orange") +
  ggtitle("PISA rating in 2018") +
  labs(x = "", y = "") 


References


Useful links: