1 Introduction

Data, computing power and open-source software has never been so readily available as today. Nowadays, one of the most requested skills in an actuarial job ad is programming - mostly in R or Python, which almost always includes effective visualizations.

This is understandable, since actuaries need to be able to clearly explain complex technical information.

In this course, you will learn how to create effective and elegant plots with the R packageggplot2.

While R provides several systems to produce graphs, ggplot2 is currently the most popular, flexible and elegant graphical framework available, and its popularity is only increasing. As shown by the monthly downloads of ggplot2 and tidyverse (a bundle of packages that includes ggplot2):

This tutorial assumes just a basic working knowledge of R and includes an introduction to data wrangling.

1.1 Setup: R, RStudio, Packages and Course Materials

Before we start, there are few things you need to download/install in your device namely: R, RStudio, packages/libraries and the Course Materials.

R is the programming language that we will be using - just in case you haven’t done so already - download and install R by following the link:
https://cloud.r-project.org/
RStudio is an integrated development environment (IDE), which makes working with R a much more pleasant experience, and it is completely free of charge - to install it go to:
https://www.rstudio.com/products/rstudio/download/#download
Packages and Libraries
Packages are collections of R functions, data, and compiled code whereas libraries are the directories where these packages are stored.

Install the packages and load the libraries below - it is assumed that these are loaded during this tutorial:

# run in case you want to install the package
install.packages("tidyverse") #bundle of packages including ggplot2
install.packages("scales")
install.packages("ggpubr")
install.packages("plotly")
install.packages("hexbin")

# Load the libraries
library(tidyverse)
library(readxl) #although it's part of 'tidyverse' it is not a core package
library(scales)
library(ggpubr)
library(plotly)

If a package is installed but not loaded its functions can still be accessed by preceding the function with the package’s name e.g.: dplyr::group_by(). This syntax also helps identifying from which package the function is coming from, or avoid masking issues in-case its name is not unique to a specific package.

Course Materials
Finally, the course material includes R scripts and data sources. To download these click-on Course Materials, unzip the file, and set the root folder as your working directory.

You may use the file ‘canvas.R’ which loads the necessary libraries and set the working directory from the menu

‘Session - Set Working Directory - To Source File Location’:

And that is it, you are all set to start!

2 Data wrangling

It is highly unlikely that your data comes in a format that is ready to be used by a plotting function without the need to manipulate it first. Some examples of data manipulations include filtering, aggregating, renaming variables, merging tables, changing the type of variable, etc…

Therefore, we start with step zero of plotting - data wrangling. We will focus mainly on the core tidyverse libraries dplyr and tidyr.

From dplyr we will explore the following manipulations:

filter() to select or exclude rows based on specified conditions
mutate() to add or change existing columns
group_by() and summarise() to summarize data, similar to Excel pivot tables
left_join() to merge two tables, think of it as a super-vlookup
select() to select and/or remove columns
rename() to rename existing columns
arrange() for sorting the data-frame based on specified columns

See https://dplyr.tidyverse.org/ for more details.

From tidyr we’ll explore

pivot_longer() to gather 2 or more columns under one column (ggplot2 takes full advantage of long data formats)
pivot_wider() to spread one column into several columns

See https://tidyr.tidyverse.org/ for more details.

Some tips:
- To get help for a given function, type an interrogation mark followed by the function’s name e.g.: ?dplyr::group_by.
- To see all declared arguments for a given function use args() e.g.: args(tidyr::pivot_longer).
- To assign a variable use <- and not =, for example x <- 10 assigns the value 10 to x, use the shortcut Alt+- in RStudio.
- To insert a comment start the line with a hashtag, for example # this is just a comment, use the shortcut Ctrl+Alt+C in RStudio.

For this section will use the data from the excel file ‘tech_results.xlsx’ available in the folder ‘data’, and contains Swiss Re’s technical results publicly available for Liability, Property and Motor for treaty years 2011 to 2020.

To import the data from this excel file we use the function read_excel() from the package readxl, which is part of the tidyverse bundle.

df_tr <- readxl::read_excel(
  path = "data/technical_results.xlsx", #full path of the file 
  sheet = "tech_result") #name of the sheet

df_tr

## # A tibble: 30 × 6
##    line         ty premium   paid  case  ibnr
##    <chr>     <dbl>   <dbl>  <dbl> <dbl> <dbl>
##  1 liability  2011   1729.  938.   118.  122.
##  2 liability  2012   2217. 1537.   205.  232.
##  3 liability  2013   2276. 1186.   197.  316.
##  4 liability  2014   2752. 1808.   362.  636.
##  5 liability  2015   2560. 1566.   495.  716.
##  6 liability  2016   3355. 1547.   937. 1369.
##  7 liability  2017   3400. 1312.   805. 1724.
##  8 liability  2018   3597.  747.   729. 2467.
##  9 liability  2019   4389.  288.   462. 3824.
## 10 liability  2020   2008.   20.6  134. 1757.
## # … with 20 more rows

Description of each column:

line: is the ‘line of business’ - there are three: Liability, Property and Motor
ty: stand for ‘Treaty Year’
premium: refers to ‘Gross Written Premium’
paid: are ‘Paid Losses’
case: are the ‘Case Reserves’ for each open claim
ibnr: stands for Incurred But Not Reported losses

The sum of paid + case = reported loss, also known as incurred loss.
The sum of reported loss + ibnr = total loss, also known as ultimate loss.

2.1 Filter

The first verb we’ll discuss is filter(). As the name suggests, we use it to include or exclude rows based on a given criteria.

For example, if we are only interested in the last Treaty Year i.e.: 2020 then we can use the following code to apply the filter:

filter(.data = df_tr, 
       ty == 2020)

## # A tibble: 3 × 6
##   line         ty premium  paid  case  ibnr
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 liability  2020   2008.  20.6  134. 1757.
## 2 motor      2020   1184. 151.   148.  732.
## 3 property   2020   5937. 415.  1048. 3667.

The first argument asks for the data frame, and the second argument contains the condition we want to apply. In this case, we set ty (treaty year) equal to 2020.

The equal operator is ==, and not =. Hence, filter(.data = df_tr, ty = 2020) would return an error.

It was not necessary to refer to the column’s name ty with quotes, which is a nice feature of tidy evaluation (if the column’s name includes a space or special characters then it needs to be surrounded by back ticks).

The argument .data = can be omitted if it occupies the first position i.e.: filter(df_tr, ty == 2020) works fine.

If .data is the second argument then it needs to be specified e.g.:
filter(ty == 2020, .data = df_tr) works but

filter(ty == 2020, df_tr) does not.

We can also use the pipe operator %>% using the shortcut Ctrl+Shift+M as:

df_tr %>% filter(ty == 2020)

The pipe operator %>% takes the preceding variable and inserts it into the first argument of the function that follows. This is a great feature since we can avoid messy nesting when applying several steps in one go, making the code much more readable.

To exclude Liability, use the Not Equal operator. An Excel user may think this is given by <> but it is not.
The Not Equal operator in R is given by !=. The exclamation sign (called “bang”) is the Not operator!

df_tr %>% 
  filter(line != "liability") %>%  #'line' not equal to 'liability'
  head() #`head()` returns the first 6 observations of a data-frame, so that the output is not too long

## # A tibble: 6 × 6
##   line     ty premium  paid  case  ibnr
##   <chr> <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 motor  2011   2001. 1757.  88.7  38.7
## 2 motor  2012   2581. 2199. 151.  131. 
## 3 motor  2013   2421. 2210. 147.   60.8
## 4 motor  2014   2201. 1967. 160.   72.6
## 5 motor  2015   2550. 2355. 235.  120. 
## 6 motor  2016   3173. 2654. 479.  286.

The code above can be read as:

take the ‘df_tr’ data-frame
and then filter all lines that are not “liability”
and then return the first 6 rows.

Compare it with when the functions are nested without indentation:

head(filter(df_tr, line != "liability"))

Even in this simple example there is a big difference in readability between the two.

The string “liability” needs to be quoted since this is a record of the data-frame and not the name of a column. The reason we did not quote 2020, in the prior example, is because ty (treaty year) is a numeric variable (double) while line is a character variable.

To return the last 2 years we can use the Or operator | as:

df_tr %>% 
  filter(ty == 2019 | ty == 2020)

## # A tibble: 6 × 6
##   line         ty premium   paid  case  ibnr
##   <chr>     <dbl>   <dbl>  <dbl> <dbl> <dbl>
## 1 liability  2019   4389.  288.   462. 3824.
## 2 liability  2020   2008.   20.6  134. 1757.
## 3 motor      2019   2707.  989.   487. 1095.
## 4 motor      2020   1184.  151.   148.  732.
## 5 property   2019   8025. 3415.  2277. 1746.
## 6 property   2020   5937.  415.  1048. 3667.

What if, for some strange reason, we only wanted even years? It is not very efficient to separate each year with the Or operator | as:

df_tr %>% 
  filter(
    ty == 2012 | 
    ty == 2014 | 
    ty == 2016 | 
    ty == 2018 | 
    ty == 2020)

A better approach is to use the %in% operator which selects all items that are in a vector:

df_tr %>% 
  filter(ty %in% c(2012, 2014, 2016, 2018, 2020))

Or even better:

df_tr %>% 
  filter(ty %in% seq(2012, 2020, by = 2))

which does the job with less typing.

Inequalities can be used to filter numerical values such as: df_tr %>% filter(ty >= 2016) or df_tr %>% filter(ty > 2015).

To add two or more filters, just separate each condition by a comma, for example:

df_tr %>% 
  filter(ty %in% c(2013, 2015, 2020),
         line == "property",
         premium > 6000)

## # A tibble: 2 × 6
##   line        ty premium  paid  case  ibnr
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 property  2013   6569. 3251.  40.0 -3.66
## 2 property  2015   6612. 3533. 137.  51.8

2.2 Mutate (add or change columns)

To add or change a column for a given data-frame we use the mutate() function from dplyr.

For example, the code below adds a column with reported values (reported = paid + case) and another with ultimate values (ultimate = reported + ibnr):

df_tr <- df_tr %>% 
  mutate(
    reported = paid + case,
    ultimate = reported + ibnr)

head(df_tr)

## # A tibble: 6 × 8
##   line         ty premium  paid  case  ibnr reported ultimate
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 liability  2011   1729.  938.  118.  122.    1057.    1179.
## 2 liability  2012   2217. 1537.  205.  232.    1742.    1974.
## 3 liability  2013   2276. 1186.  197.  316.    1383.    1699.
## 4 liability  2014   2752. 1808.  362.  636.    2170.    2805.
## 5 liability  2015   2560. 1566.  495.  716.    2061.    2777.
## 6 liability  2016   3355. 1547.  937. 1369.    2484.    3853.

In the example above, the two columns reported and ultimate were added by separating each with a comma. What is great is that ultimate uses the reported column which was created in the same mutate() step!

To move the reported column after case use relocate():

df_tr <- df_tr %>%
  relocate(reported, .after = case)

head(df_tr)

## # A tibble: 6 × 8
##   line         ty premium  paid  case reported  ibnr ultimate
##   <chr>     <dbl>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>
## 1 liability  2011   1729.  938.  118.    1057.  122.    1179.
## 2 liability  2012   2217. 1537.  205.    1742.  232.    1974.
## 3 liability  2013   2276. 1186.  197.    1383.  316.    1699.
## 4 liability  2014   2752. 1808.  362.    2170.  636.    2805.
## 5 liability  2015   2560. 1566.  495.    2061.  716.    2777.
## 6 liability  2016   3355. 1547.  937.    2484. 1369.    3853.

mutate()can also be used to change an existing column. For example, to change the data type for ty (treaty year) from double to integer:

df_tr <- df_tr %>%
  mutate(ty = as.integer(ty))

head(df_tr)

## # A tibble: 6 × 8
##   line         ty premium  paid  case reported  ibnr ultimate
##   <chr>     <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>
## 1 liability  2011   1729.  938.  118.    1057.  122.    1179.
## 2 liability  2012   2217. 1537.  205.    1742.  232.    1974.
## 3 liability  2013   2276. 1186.  197.    1383.  316.    1699.
## 4 liability  2014   2752. 1808.  362.    2170.  636.    2805.
## 5 liability  2015   2560. 1566.  495.    2061.  716.    2777.
## 6 liability  2016   3355. 1547.  937.    2484. 1369.    3853.

It is often necessary to change the variable type from character (string) to a factor (an ordered categorical variable). Because, columns such as ‘countries’ or ‘line of business’, are usually defined as character and, as such, are sorted in alphabetical order. But most of the times we need to sort our categorical columns by other criteria, for example: months should be sorted by chronological order and not alphabetical.

Sorting the data-frame by line of business shows that the order is ‘liability’, ‘motor’ and last ‘property’:

df_tr %>% 
  filter(ty == 2020) %>% 
  arrange(line)

## # A tibble: 3 × 8
##   line         ty premium  paid  case reported  ibnr ultimate
##   <chr>     <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>
## 1 liability  2020   2008.  20.6  134.     155. 1757.    1911.
## 2 motor      2020   1184. 151.   148.     299.  732.    1031.
## 3 property   2020   5937. 415.  1048.    1463. 3667.    5130.

arrange(line) sorts the data-frame by the column line from smallest to largest i.e.: in ascending alphabetical order. But the largest line by premium is ‘property’.

Changing the variable type of the column line, from character to factor - allows us to define the order as we please e.g. as: ‘property’, ‘liability’ and ‘motor’.
For each factor we specify its level in a vector:

df_tr <- df_tr %>%
  mutate(line = 
           factor(line,
                  levels = c("property", #first entry first level
                             "liability", #second
                             "motor")))  #third

df_tr %>% 
  filter(ty == 2020) %>% 
  arrange(line)

## # A tibble: 3 × 8
##   line         ty premium  paid  case reported  ibnr ultimate
##   <fct>     <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>
## 1 property   2020   5937. 415.  1048.    1463. 3667.    5130.
## 2 liability  2020   2008.  20.6  134.     155. 1757.    1911.
## 3 motor      2020   1184. 151.   148.     299.  732.    1031.

The function factor() is part of base R. However, there are several functions from the (core tidyverse) package forcats to help us handle factors. To know more, follow the link https://forcats.tidyverse.org/.

2.3 Group and Summarise

If you ever used a pivot table in excel then you have grouped and summarized data.

In Excel to display the total premium per line of business, we would use use a pivot table as:

The same (and much more) can be done with group_by() and summarise():

df_tr %>% 
  group_by(line) %>% 
  summarise(premium = sum(premium))

## # A tibble: 3 × 2
##   line      premium
##   <fct>       <dbl>
## 1 property   65712.
## 2 liability  28282.
## 3 motor      23923.

group_by() is equivalent to adding variables to the rows field of an Excel pivot table.

While summarise() (which can also be spelled with a ‘z’) performs the aggregation. In this case we summed the premium, with sum(premium), for each line of business.
It is similar to dragging line into the ‘Rows’ field of pivot table and premium into the ‘Values’ field with ‘Summarize values By: Sum’.

To obtain the average values of the last 5 years use the function mean(), see how it compares to a pivot table:

Average Premium per treaty year from 2016 to 2020:

df_tr %>% 
  filter(ty >= 2016) %>% 
  group_by(ty) %>% 
  summarize(avg_prem = mean(premium))

## # A tibble: 5 × 2
##      ty avg_prem
##   <int>    <dbl>
## 1  2016    4367.
## 2  2017    4168.
## 3  2018    4201.
## 4  2019    5040.
## 5  2020    3043.

The equivalent of a pivot table in Excel:

Exercise

Load the data-frame ‘df_tr.rds’ from the folder ‘data’ and name it ‘df_tr’, then:
Show the percentage of premium for each line of business excluding treaty year 2020.

Complete the exercise by overwriting the gaps indicated with ‘______’ in the code below:

# loading data 'df_tr.rds'
# (note the slash bar that separates folder/files is tilted to the right '/')
df_tr <- read_rds(file = "data/______")

df_tr %>%
  filter(ty ______ 2020) %>%
  ______(line) %>%
  ______(premium = sum(______)) %>% 
  mutate(prem_pct = premium/______(premium))

Solution

# loading data 'df_tr.rds'
# (note the slash bar that separates folder/files is tilted to the right '/')
df_tr <- read_rds(file = "data/df_tr.rds")

df_tr %>%
  filter(ty != 2020) %>%
  group_by(line) %>%
  summarise(premium = sum(premium)) %>% 
  mutate(prem_pct = premium/sum(premium))

## # A tibble: 3 × 3
##   line      premium prem_pct
##   <fct>       <dbl>    <dbl>
## 1 property   59775.    0.549
## 2 liability  26274.    0.242
## 3 motor      22739.    0.209

To aggregate several columns under one summarise() just separate each aggregation with a comma:

df_tr %>% 
  group_by(ty) %>% 
  summarise(
    premium = sum(premium),
    paid = sum(paid),
    case = sum(case),
    reported = sum(reported),
    ibnr = sum(ibnr),
    ultimate = sum(ultimate)
    )

## # A tibble: 10 × 7
##       ty premium  paid  case reported  ibnr ultimate
##    <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>
##  1  2011   9343. 6697.  254.    6951.  172.    7123.
##  2  2012  12005. 7514.  495.    8009.  368.    8377.
##  3  2013  11266. 6648.  384.    7032.  373.    7405.
##  4  2014  11123. 6492.  604.    7095.  728.    7824.
##  5  2015  11722. 7454.  867.    8321.  888.    9208.
##  6  2016  13102. 8386. 1751.   10137. 1712.   11848.
##  7  2017  12505. 9507. 2128.   11636. 2275.   13910.
##  8  2018  12602. 6984. 2378.    9362. 3331.   12692.
##  9  2019  15121. 4692. 3226.    7918. 6665.   14583.
## 10  2020   9129.  587. 1330.    1917. 6156.    8073.

What if the data-frame contained hundreds of numerical columns and we wanted to aggregate them all? No problem!
The function across() applies the aggregation across several variables at once.
It requires at least two arguments:

args(across)

## function (.cols = everything(), .fns = NULL, ..., .names = NULL) 
## NULL

.cols defines which columns to summarize. The default is everything(), i.e.: all columns.
.fns requires a function to specify the type of aggregation (sum, mean, max…). To let R know that you’re writing a function precede it with a ~. The tilde ~ must be omitted if no arguments are defined.

df_tr %>% 
  group_by(ty) %>%
  summarise(
    across(
      .cols = where(is.numeric),
      .fns = sum)
    )

Yields the same output as before.

Or in case we use any sum() arguments:

df_tr_ty <- df_tr %>%
  group_by(ty) %>% 
  summarise(
    across(
      .cols = where(is.numeric),
      .fns = ~ sum(., na.rm = TRUE)  #if using arguments inside the function precede the function with '~'
      )
    )

df_tr_ty

## # A tibble: 10 × 7
##       ty premium  paid  case reported  ibnr ultimate
##    <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>
##  1  2011   9343. 6697.  254.    6951.  172.    7123.
##  2  2012  12005. 7514.  495.    8009.  368.    8377.
##  3  2013  11266. 6648.  384.    7032.  373.    7405.
##  4  2014  11123. 6492.  604.    7095.  728.    7824.
##  5  2015  11722. 7454.  867.    8321.  888.    9208.
##  6  2016  13102. 8386. 1751.   10137. 1712.   11848.
##  7  2017  12505. 9507. 2128.   11636. 2275.   13910.
##  8  2018  12602. 6984. 2378.    9362. 3331.   12692.
##  9  2019  15121. 4692. 3226.    7918. 6665.   14583.
## 10  2020   9129.  587. 1330.    1917. 6156.    8073.

Whatever is on the right side of a tilde ~ becomes a formula object. This is necessary because we added arguments inside sum().
The dot . represents any column that meets the condition in .cols while na.rm = TRUE instructs the sum to remove all values with NA (in case they are present).

across() can also be used with other dplyr functions such as mutate().

2.4 Left join - a super Vlookup

It is not uncommon to merge two different data sets to create a plot. In Excel we would either use a ‘lookup’ function or, if you’re familiar with Power Query ‘Merge Queries’. With dplyr you may use any of the mutating join functions. We briefly showcase the most popular one: left_join().

left_join() joins two data sets x and y keeping all records of the x (left) data set.

If any record in the x data is not present in the y data - then it will be filled with a missing value NA
If any record in the y data is not present in the x data - then this record will not be merged
the by argument instructs which columns are to be use for matching the two tables

Let’s see a simple example. Below are two tibbles (the tidyverse version of data-frame) created with the tibble() function.

df1 <- tibble(color = c("red", "yellow", "blue", "grey"),
              type = c("warm", "warm", "cold", "neutral"))

df2 <- tibble(color = c("red", "yellow", "blue", "black"),
              hex = c("#FF0000", "#FFFF00", "#0000FF", "#000000"),
              rgb = c("255,0,0", "(255,255,0)", "(0,0,255)", "(0, 0, 0)"))
# see df1
df1

## # A tibble: 4 × 2
##   color  type   
##   <chr>  <chr>  
## 1 red    warm   
## 2 yellow warm   
## 3 blue   cold   
## 4 grey   neutral

# see df2
df2

## # A tibble: 4 × 3
##   color  hex     rgb        
##   <chr>  <chr>   <chr>      
## 1 red    #FF0000 255,0,0    
## 2 yellow #FFFF00 (255,255,0)
## 3 blue   #0000FF (0,0,255)  
## 4 black  #000000 (0, 0, 0)

# left_join() df1 and df2
left_join(x = df1, 
          y = df2, 
          by = "color")

## # A tibble: 4 × 4
##   color  type    hex     rgb        
##   <chr>  <chr>   <chr>   <chr>      
## 1 red    warm    #FF0000 255,0,0    
## 2 yellow warm    #FFFF00 (255,255,0)
## 3 blue   cold    #0000FF (0,0,255)  
## 4 grey   neutral <NA>    <NA>

Since ‘grey’ is not in df2 the columns hex and rgb are populated with NA. While the color ‘black’ in df2 was not merged with df1.

Try to use right_join(), inner_join() to see how each behaves.

If the names of the columns used for matching in the by argument are not the same (for example: ‘color’ for df1 and ‘color_name’ for df2) then, use the following syntax: by = c("color" = "color_name").

To match two tables based on several columns, just separate each column (or column pair) in the by argument with a comma.

2.5 Long and wide data-frames

Often we need to store several columns under one variable - the function pivot_longer(), from the tidyr package, does exactly that.

For example, we may need to keep the values for premium, ultimate, paid, case, ibnr under one variable called ‘type of value’. We will see that ggplot2 takes full advantage of such data-frames.

One column stores the values while another column stores the names of the columns that were gathered.
To change the default names of these two new columns (‘value’ and ‘name’) use the arguments values_to and names_to:

df_tr %>%
  pivot_longer(cols = premium:ultimate)

## # A tibble: 180 × 4
##    line         ty name     value
##    <fct>     <int> <chr>    <dbl>
##  1 liability  2011 premium  1729.
##  2 liability  2011 paid      938.
##  3 liability  2011 case      118.
##  4 liability  2011 reported 1057.
##  5 liability  2011 ibnr      122.
##  6 liability  2011 ultimate 1179.
##  7 liability  2012 premium  2217.
##  8 liability  2012 paid     1537.
##  9 liability  2012 case      205.
## 10 liability  2012 reported 1742.
## # … with 170 more rows

The cols argument defines which columns are to be gathered: premium:ultimate means all columns from premium to ultimate.

Often, it is better to select the columns not to pivot. This is because the order of the columns, or even their presence, may change from one data-frame to another.
Simply place the NOT operator ! before the vector with the columns not to pivot:

df_tr_long <- df_tr %>% 
  pivot_longer(cols = !c(line, ty), #pivot all columns except 'line' and 'ty'
               names_to = "tov") #'tov' stands for 'type of value'

df_tr_long

## # A tibble: 180 × 4
##    line         ty tov      value
##    <fct>     <int> <chr>    <dbl>
##  1 liability  2011 premium  1729.
##  2 liability  2011 paid      938.
##  3 liability  2011 case      118.
##  4 liability  2011 reported 1057.
##  5 liability  2011 ibnr      122.
##  6 liability  2011 ultimate 1179.
##  7 liability  2012 premium  2217.
##  8 liability  2012 paid     1537.
##  9 liability  2012 case      205.
## 10 liability  2012 reported 1742.
## # … with 170 more rows

We can also go from a ‘long data-frame’ to ‘wide data-frame’ with pivot_wider():

df_tr_long %>% 
  pivot_wider(names_from = tov,
              values_from = value)

## # A tibble: 30 × 8
##    line         ty premium   paid  case reported  ibnr ultimate
##    <fct>     <int>   <dbl>  <dbl> <dbl>    <dbl> <dbl>    <dbl>
##  1 liability  2011   1729.  938.   118.    1057.  122.    1179.
##  2 liability  2012   2217. 1537.   205.    1742.  232.    1974.
##  3 liability  2013   2276. 1186.   197.    1383.  316.    1699.
##  4 liability  2014   2752. 1808.   362.    2170.  636.    2805.
##  5 liability  2015   2560. 1566.   495.    2061.  716.    2777.
##  6 liability  2016   3355. 1547.   937.    2484. 1369.    3853.
##  7 liability  2017   3400. 1312.   805.    2117. 1724.    3841.
##  8 liability  2018   3597.  747.   729.    1477. 2467.    3944.
##  9 liability  2019   4389.  288.   462.     750. 3824.    4574.
## 10 liability  2020   2008.   20.6  134.     155. 1757.    1911.
## # … with 20 more rows

Exercise

Add a column to ‘df_tr’

with: ultimate divided by premium - and name it ultimate_rt (for Ultimate Loss Ratio)

Next, create a new data-frame named ‘df_tr_long’:

gather all columns under one variable, except for ty and line, and name the column that stores the name of the variables as tov for ‘type of value’

# adding the ultimate column to 'df_tr'
df_tr <- ______ %>% 
  ______(ultimate_rt = ______ / ______) 
  
# check the result for df_tr, first 6 rows
head(df_tr)

df_tr_long <- ______ %>%   
  ______(cols = !c(______, ______), #all except these columns
               names_to = ______)

# View df_tr_long data
View(df_tr_long)

Solution

Part 1:

# adding the ultimate column to df_tr
df_tr <- df_tr %>% 
  mutate(ultimate_rt = ultimate / premium) 

# check the result for df_tr, first 6 rows
head(df_tr)

## # A tibble: 6 × 9
##   line         ty premium  paid  case reported  ibnr ultimate ultimate_rt
##   <fct>     <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>       <dbl>
## 1 liability  2011   1729.  938.  118.    1057.  122.    1179.       0.682
## 2 liability  2012   2217. 1537.  205.    1742.  232.    1974.       0.890
## 3 liability  2013   2276. 1186.  197.    1383.  316.    1699.       0.747
## 4 liability  2014   2752. 1808.  362.    2170.  636.    2805.       1.02 
## 5 liability  2015   2560. 1566.  495.    2061.  716.    2777.       1.08 
## 6 liability  2016   3355. 1547.  937.    2484. 1369.    3853.       1.15

Part 2:

df_tr_long <- df_tr %>%   
  pivot_longer(cols = !c(ty, line), #all except these columns
               names_to = "tov")

# View df_tr_long data
View(df_tr_long)

2.6 Select and Rename columns

The function select() from dplyris used to select or remove columns

df_tr %>% 
  select(ty, premium, ultimate) %>% 
  head()

## # A tibble: 6 × 3
##      ty premium ultimate
##   <int>   <dbl>    <dbl>
## 1  2011   1729.    1179.
## 2  2012   2217.    1974.
## 3  2013   2276.    1699.
## 4  2014   2752.    2805.
## 5  2015   2560.    2777.
## 6  2016   3355.    3853.

Preceding the variable with a minus - will remove it.

df_tr %>% 
  select(-ultimate, -ibnr) %>% 
  head()

## # A tibble: 6 × 7
##   line         ty premium  paid  case reported ultimate_rt
##   <fct>     <int>   <dbl> <dbl> <dbl>    <dbl>       <dbl>
## 1 liability  2011   1729.  938.  118.    1057.       0.682
## 2 liability  2012   2217. 1537.  205.    1742.       0.890
## 3 liability  2013   2276. 1186.  197.    1383.       0.747
## 4 liability  2014   2752. 1808.  362.    2170.       1.02 
## 5 liability  2015   2560. 1566.  495.    2061.       1.08 
## 6 liability  2016   3355. 1547.  937.    2484.       1.15

The function rename() is used to rename variables.

df_tr %>% 
  rename(incurred = reported) %>% 
  head()

## # A tibble: 6 × 9
##   line         ty premium  paid  case incurred  ibnr ultimate ultimate_rt
##   <fct>     <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl>       <dbl>
## 1 liability  2011   1729.  938.  118.    1057.  122.    1179.       0.682
## 2 liability  2012   2217. 1537.  205.    1742.  232.    1974.       0.890
## 3 liability  2013   2276. 1186.  197.    1383.  316.    1699.       0.747
## 4 liability  2014   2752. 1808.  362.    2170.  636.    2805.       1.02 
## 5 liability  2015   2560. 1566.  495.    2061.  716.    2777.       1.08 
## 6 liability  2016   3355. 1547.  937.    2484. 1369.    3853.       1.15

2.7 Format numbers

The function number() from the scales package is great for formatting values of data-frames and has a ggplot2 equivalent (number_format()).

To make sure we are using the right function, we’ll precede it with the package’s name i.e.: scales::number() (in case there any other package with a function named ‘number’).

The arguments are:

args(scales::number)

## function (x, accuracy = NULL, scale = 1, prefix = "", suffix = "", 
##     big.mark = " ", decimal.mark = ".", style_positive = c("none", 
##         "plus"), style_negative = c("hyphen", "minus", "parens"), 
##     scale_cut = NULL, trim = TRUE, ...) 
## NULL

A brief description of some of the arguments:

x the column or number vector to format
accuracy sets the number of decimal places: 1 shows no decimal place; 0.1 shows 1 decimal place; 0.01 shows two decimal places and so on.
scale is the number to multiply x, e.g.: scale = 1 / 1000000 divides all values by a million, while scale = 100 multiplies all values by 100, the default is 1 as shown in args(scales::number).
prefix and sufix characters to display before and after the value
big.mark the thousand separator

The code below changes the ultimate_rt column from ‘df_tr’ to percentage with one decimal place:

df_tr %>%
  mutate(ultimate_rt = 
           scales::number(x = ultimate_rt,
                          accuracy = 0.1,
                          scale = 100,
                          suffix = "%"
                        )) %>%
  head()

## # A tibble: 6 × 9
##   line         ty premium  paid  case reported  ibnr ultimate ultimate_rt
##   <fct>     <int>   <dbl> <dbl> <dbl>    <dbl> <dbl>    <dbl> <chr>      
## 1 liability  2011   1729.  938.  118.    1057.  122.    1179. 68.2%      
## 2 liability  2012   2217. 1537.  205.    1742.  232.    1974. 89.0%      
## 3 liability  2013   2276. 1186.  197.    1383.  316.    1699. 74.7%      
## 4 liability  2014   2752. 1808.  362.    2170.  636.    2805. 102.0%     
## 5 liability  2015   2560. 1566.  495.    2061.  716.    2777. 108.5%     
## 6 liability  2016   3355. 1547.  937.    2484. 1369.    3853. 114.8%

However, the variable type for reported_rt is no longer a double but a character.

We may need to keep the original numerical column - to filter data for example. So adding a column instead of changing the existing one is often a good idea.

Also, instead of scales::number() we can use scales::percent and avoid the need to scale and add a suffix.

df_tr %>% 
  mutate(txt_ultimate_rt = 
           scales::percent(x = ultimate_rt,
                           accuracy = 0.1))

While we have only touched the surface of data wrangling - the examples given here already provide several useful techniques to tackle real world challenges.

There are several other powerful packages (such as data.table or even base R) to import and transform data. The key take-away is that almost every plot starts with data wrangling.

3 The Basics of ggplot2

We are now ready to start producing plots with ggplot2 - a core tidyverse package.

See https://ggplot2.tidyverse.org/ for the official documentation.

ggplot2 works by adding layers and components to create a visual.

The basic commands to create a plot are:

data: the argument for the data-frame underlying the plot
mapping: the argument to map variables from data, e.g.: to map the x and y axis’
- all mapping is done with aesthetics function aes() e.g.: mapping = aes(x = year, y = loss_ratio) maps the columns ‘year’ and ‘loss_ratio’ to the x and y axis respectively
geom_*(): family of functions to add geometric objects to the plot such as dots, lines, bars, densities…

There are many more options and commands but the following can be used as a basic blue-print:

ggplot(
  data = <DATA>,
  mapping = aes(
      x = <X-AXIS>,
      y = <Y-AXIS>)
      ) +
  geom_*()

The first argument of ggplot() is data and the second is mapping.
The x and y variables are mapped with the aes() function in the mapping argument, and any variable in aes() can be called simply by typing the name of the column without reference to the underlying data-frame.
A geom_*() function such as geom_point() or geom_line() defines what to draw in the plot area.

All ggplot2 components are combined with the + sign.

If it all sounds a little esoteric, don’t worry, a couple of examples should help demystify these concepts.

3.1 A basic plot

The data-frame ‘df_gender_height_weight.rds’ contains a fictional random sample of the gender, height and weight of high school students.

df_ghw <- readr::read_rds(file = "data/df_gender_height_weight.rds")

head(df_ghw)

## # A tibble: 6 × 3
##   gender height weight
##   <chr>   <dbl>  <dbl>
## 1 m        171.   76.2
## 2 m        175.   81.4
## 3 m        174.   79.3
## 4 m        171.   75.1
## 5 m        174.   80.6
## 6 m        172.   80.1

The code below assigns the data-frame ‘df_ghw’ to ggplot() and allocates height to the x-axis and premium to the y-axis:

ggplot(data = df_ghw, 
       mapping = aes(
         x = height, 
         y = weight))

The plot has height and weight as the x and y axis respectively but nothing is displayed in the plot area - this is because we did not specify any geom.

A couple of comments before adding a geom:

First: as long as the first argument in ggplot() is data and the second is mapping, their arguments can be omitted.

The same applies for the x and y arguments of the aes() function.

The code above can re-written as:

ggplot(df_ghw, 
       aes(height, weight))

Second: the data frame, being the first argument of the function ggplot(), can be inserted with the pipe operator:

gg_ghw1 <- df_ghw %>%
  ggplot(aes(height, weight))

The object ‘gg_ghw1’ is a ggplot() with data and mapping defined, so let’s add a geom to it:

gg_ghw1 + geom_point()

We can use the arguments color and size in geom_point() to change its appearance:

gg_ghw1 +
  geom_point(size = 3, 
             color = "red")

3.1.1 Alpha (transparency)

A simple yet important feature is the alpha argument, which controls the transparency of the geoms.

There is some over-plotting in this chart, i.e.: too many dots in the center, which makes it difficult to see them properly. The plot can be improved by making the dots transparent:

gg_ghw1 +
  geom_point(size = 3,
             color = "red",
             alpha = 0.2)

alpha accepts values between zero and one; alpha = 0 means fully transparent i.e.: invisible and alpha = 1 means fully opaque i.e.: no transparency applied (which is the default).

Exercise

Recreate the plot from scratch, including importing ‘df_gender_height_weight.rds’ but with:

dots size 1,
color skyblue4 and
alpha of 0.5

df_ghw <- readr::read_rds(file = "data/______")

______ %>% 
  ggplot(aes(______, ______)) +
  geom_point(______ = ______, 
             ______ = ______,
             ______ = ______)

Solution

df_ghw <- readr::read_rds(file = "data/df_gender_height_weight.rds")

df_ghw %>% 
  ggplot(aes(height, weight)) +
  geom_point(size = 1, 
             color = "skyblue4",
             alpha = 0.5)

3.2 Basic behaviour

One for all: When a data-frame or a mapping is defined inside the function ggplot() it is transferred to all geoms:

ggplot(head(df_ghw),
       aes(height, weight)) +
  
  geom_point(size = 5,
             color = "red") +
  geom_line()

First come first serve: The order in which each geom appear matters - when geom_line() is after geom_point(), the line is placed on top of the dots. Compare it with the next plot, where geom_line() is before geom_point()):

ggplot(head(df_ghw),
       aes(height, weight)) +
  geom_line() +
  geom_point(size = 5,
             color = "red")

Each geom for itself: The data-frame and aesthetics can also be defined inside a geom.

ggplot() +
  geom_point(data = df_ghw,
             aes(height, weight))

However, in this case - the data and mapping is only accessible to the geom that contains it.
Adding geom_line() to the code above results in an error because there is no data or mapping that it can access.

Thanks but no thanks: When a data-frame is present in both in ggplot() and a geom_*(), the data in geom_*() ignores the data in ggplot(), for example:

ggplot(data = df_ghw,
       aes(x = height,
           y = weight)) +
  geom_point() + #geom_point() gets the data from the function ggplot()
  geom_line(data = tibble(height = c(165, 175), #geom_line() has its own data-frame
                          weight = c(70, 80)), 
            color = "red", 
            size = 4,
            alpha = 0.5)

The data in geom_line(), was added with the tibble() function and overwrites the data ‘df_ghw’ supplied by ggplot().

Setting a bad example
One way (a very bad one, by the way) to color the dots by gender is:

  ggplot(mapping = aes(height, weight)) +
  
  geom_point(data = filter(df_ghw, gender == "m"),
             color = "red") + 
  
  geom_point(data = filter(df_ghw, gender == "f"),
             color = "blue")

This approach should be avoided at all costs!

Setting a good example
One of the many great features of ggplot is that it takes full advantage of long-data. In this case, male and female are under one column - gender - which is perfect for ggplot.

Mapping color to gender inside aes() is the way to go:

gg_ghw2 <- ggplot(df_ghw, 
       aes(height, weight,
           color = gender)) +
  geom_point(size = 1,
             alpha = 0.5)

gg_ghw2

Since gender is mapped to color inside aes() it is colored automatically!
Moreover, it also appears as a legend. You can avoid the legend with geom_point(show.legend = FALSE).

Since the mapping in done in ggplot() - it will be applied to every geom:

gg_ghw2 +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

geom_smooth() draws by default a gam regression (an extention of glms useful to capture non-monotonic relations). It also adds a shaded area at the 95% confidence level.

With color mapped to gender we have one regression for males and another for females. Compare it with:

gg_ghw1 +
  geom_point(color = "indianred") +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

which applies one regression to all data.

As a side note - the type of regression is defined by the argument method and confidence level by the argument level. In this case, a linear regression would be a more parsimonious approach:

gg_ghw2 +
  geom_smooth(method = "lm", #apply a linear model instead of the gam
              level = 0.99) #'se = FALSE' removes the standard deviation of the estimate

## `geom_smooth()` using formula 'y ~ x'

3.3 Saving Mr. Plot

To save a ggplot you can use the Export menu from the ‘Plot’ pane, which has the option to save the plot as an image, pdf or copy it to the clip board:

cool

Another solution, which allows for finer control, is to use the ggsave() function:

ggsave(filename = "my_plots/nice_plot.png", #location and name to save the plot 
       plot = gg_ghw2, #plot object to save
       width = 18,
       height = 10,
       dpi = 200 # image resolution (default is 300) 
       )

filename = "my_plots/nice_plot.png" assumes there is a folder named ‘my_plots’ in the working directory. If that’s the case, the plot gg_ghw2 will be saved with the name ‘nice_plot.png’.

4 Control the appearence (basics)

The nice thing about ggplot is that it you only need to provide the essential information for it to create a plot. This is because there are default values and algorithms that take care of the rest.

However, most of time we want to change these - either to create a more effective visual and/or to align with corporate branding. In the next sections we discuss some ways to change the appearance of a ggplot.

4.1 Scale the x and y axis’

We can change the scale of the x and y axis’ with scale_x_*() and scale_y_*() functions respectively.

Below are some of the most common ones:

# scale_x_continuous()
# scale_x_discrete()
# scale_x_datetime()
#scale_x_log10()

all with an equivalent for the y-axis.

In our example, both the x and y axis are continuous, so we use scale_x_continuous() and scale_y_continuous().

#check its arguments
args(scale_x_continuous)

## function (name = waiver(), breaks = waiver(), minor_breaks = waiver(), 
##     n.breaks = NULL, labels = waiver(), limits = NULL, expand = waiver(), 
##     oob = censor, na.value = NA_real_, trans = "identity", guide = waiver(), 
##     position = "bottom", sec.axis = waiver()) 
## NULL

For example, to change the title of the x-axis:

gg_ghw2 + scale_x_continuous(name = "Height in cm")

When labeling numerical values, is important to include the units in which the values are measured.

To remove the x-axis title set the name to NULL, i.e.: name = NULL.

The argument breaks controls units of the axis. For example, breaks = seq(160, 180, by = 2 sets the x-units range from 162 to 180 marked every 2 centimeters:

gg_ghw2 +
  scale_x_continuous(name = "Height in cm",
                     breaks = seq(162, 180, by = 2))

The smallest value for height is 163cm, hence the first label is 164cm and not 162cm. To force labels within a given range, for example between 162 and 180, use the argument limit:

gg_ghw2 +
  scale_x_continuous(name = "Height in cm",
                     breaks = seq(162, 180, by = 2),
                     limit = c(162, 180))

Or to zoom into a particular area:

gg_ghw2 +
  scale_x_continuous(name = "Height in cm",
                     breaks = seq(160, 180, by = 1),
                     limit = c(170, 175))

## Warning: Removed 855 rows containing missing values (geom_point).

The argument labels is used to format the numbers of the axis’.
The good news is that this can be done with the scales package, which is what we used to format numbers for data-frames.
The functions to format the axis’ have a suffix *_format(), and work exactly the same to their data-frame counterparts (except that the x argument is omitted as there is no need specify a variable).

gg_ghw2 +
  scale_x_continuous(name = "Height",
                     breaks = seq(160, 180, by = 2.5),
                     limit = c(160, 180),
                     labels = scales::number_format(accuracy = 0.1,
                                                    suffix = " cm"))

It is better to show the unit “cm” in the axis’ title to avoid redundancy and unnecessary clutter.

Another way to define the number of breaks is with the argument n.breaks:

gg_ghw2 + 
  scale_x_continuous(n.breaks = 10)

The number of breaks will not always match the number it is given because the algorithm of n.breaks may choose a different number to ensure nice labels.

The same effect can be achieved with breaks using the function pretty_breaks() from the scales package:

gg_ghw2 +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

We can use the function labs() to add a title, subtitle and/or a caption:

gg_ghw3 <- gg_ghw2 +
  scale_x_continuous(name = "Height in cm",
                     breaks = scales::pretty_breaks(8)) +
  
  labs(title = "Height vs weight by gender",
       subtitle = str_wrap("The positive correlation between height and weight is higher for males than for females", 50),
       caption = "source: figures produced by a random number generator")

A subtitle should immediately convey the message. A subtitle that starts with “the chart below shows the relationship between bla bla bla” quickly loses the reader’s attention.

The function str_wrap(), from the core tidyverse package stringr, limits the length of each text line to 50 characters.

Exercise

Update the plot gg_ghw3 by:

changing the y-axis title to “Weight in kg”
setting the y-axis 10 breaks with n.breaks

# load gg_ghw3
source(file = "aux_ex_gg/ex_gg_ghw3.R")
# have a look
gg_ghw3

gg_ghw3 <- gg_ghw3 + #no blank spaces to fill

Solution

# load gg_ghw3
source(file = "aux_ex_gg/ex_gg_ghw3.R")

gg_ghw3 <- gg_ghw3 + 
  scale_y_continuous(name = "Weight in kg",
                     n.breaks = 10)

gg_ghw3

To learn more about scale continuous function go to
https://ggplot2.tidyverse.org/reference/scale_continuous.html

4.2 scale_color*() functions

The scale_color_*() functions are used to control the color scheme.

A very popular color scheme for discrete data is the brewer scale - included in ggplot2 with scale_color_brewer().

The brewer scale provides sequential, diverging and qualitative color schemes. While it is designed for discrete data it can, sometimes, provide good results for continuous data as well.

Since color is mapped to the discrete variable gender, which has no specific order - we can choose a qualitative or diverging color scheme, for example ‘Set1’ (diverging):

gg_ghw3 + scale_color_brewer(palette = "Set1")

The brewer color scheme “Set1” is a nice improvement over the default colors.

To learn more about the brewer paletes go to:
https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html

Colors can bet defined manually with scale_color_manual():

gg_ghw3 + scale_color_manual(values = c("gold2", "turquoise3" ))

To see all accepted color strings go to:
http://sape.inf.usi.ch/quick-reference/ggplot2/colour

Colors can also be set in hexadecimal form.

To change the labels associated which each color use the argument label:

gg_ghw4 <- gg_ghw3 + 
  scale_color_manual(values = c("#E7B800", "#00aFBB"),
                     labels = c("Female", "Male"))

gg_ghw4

Another very popular color scheme is Viridis and it is specially designed to improve readability for readers with color blindness and/or color vision deficiency.

It is included in ggplot2, with scale_color_viridis_c(), scale_color_viridis_d() and scale_color_viridis_b() for continuous, discrete and binned data.

For more details visit https://cran.r-project.org/web/packages/viridis/index.html.

scale_color_*() functions are for geometric objects that have no defined area, such as dots or lines. To color geometric objects with defined area, such as a bar from a bar plot, we use scale_fill_*() functions.
Every scale_color_*() function as a scale_fill_*() equivalent.

4.3 Theme

The function theme() is used to control non-data elements of the plot.

A very handy feature is that ggplot2 comes with a series of predefined themes to give your plots a consistent look.

A theme does not change how the data is displayed by geoms or how it is transformed by scales.

# minimalist design
gg_ghw4 + 
  theme_minimal()

theme_minimal() provides a nice clear background giving it a cleaner and more appealing look.

While ggplot2 provides a set custom themes, there are several other packages that provide many more. One of my favorites themes is theme_pubclean() from the package ggpubr. Besides providing additional themes it also includes other handy ggplot2 functions.

# minimalist design
gg_ghw4 +
  ggpubr::theme_pubclean()

However, theme_pubclean() is more suitable when the x-axis is categorical - since the gird is only displayed for the y-axis.

Other packages such as ggthemes or ggthemr are worth exploring as well.

The function theme() contains several arguments:

  args(theme)

## function (line, rect, text, title, aspect.ratio, axis.title, 
##     axis.title.x, axis.title.x.top, axis.title.x.bottom, axis.title.y, 
##     axis.title.y.left, axis.title.y.right, axis.text, axis.text.x, 
##     axis.text.x.top, axis.text.x.bottom, axis.text.y, axis.text.y.left, 
##     axis.text.y.right, axis.ticks, axis.ticks.x, axis.ticks.x.top, 
##     axis.ticks.x.bottom, axis.ticks.y, axis.ticks.y.left, axis.ticks.y.right, 
##     axis.ticks.length, axis.ticks.length.x, axis.ticks.length.x.top, 
##     axis.ticks.length.x.bottom, axis.ticks.length.y, axis.ticks.length.y.left, 
##     axis.ticks.length.y.right, axis.line, axis.line.x, axis.line.x.top, 
##     axis.line.x.bottom, axis.line.y, axis.line.y.left, axis.line.y.right, 
##     legend.background, legend.margin, legend.spacing, legend.spacing.x, 
##     legend.spacing.y, legend.key, legend.key.size, legend.key.height, 
##     legend.key.width, legend.text, legend.text.align, legend.title, 
##     legend.title.align, legend.position, legend.direction, legend.justification, 
##     legend.box, legend.box.just, legend.box.margin, legend.box.background, 
##     legend.box.spacing, panel.background, panel.border, panel.spacing, 
##     panel.spacing.x, panel.spacing.y, panel.grid, panel.grid.major, 
##     panel.grid.minor, panel.grid.major.x, panel.grid.major.y, 
##     panel.grid.minor.x, panel.grid.minor.y, panel.ontop, plot.background, 
##     plot.title, plot.title.position, plot.subtitle, plot.caption, 
##     plot.caption.position, plot.tag, plot.tag.position, plot.margin, 
##     strip.background, strip.background.x, strip.background.y, 
##     strip.placement, strip.text, strip.text.x, strip.text.y, 
##     strip.switch.pad.grid, strip.switch.pad.wrap, ..., complete = FALSE, 
##     validate = TRUE) 
## NULL

Many arguments in theme() work with element_*() functions, such as:

element_text() to format text e.g.: labels/titles
element_line() to format lines elements e.g.: grids
element_rect() to format borders and backgrounds
element_blank() clear the object referenced by the argument - for all elements!

For example:

theme_custom <- theme(
  #removing the grey background panel
  # panel.background is an element_rect() argument
  # but to remove it just use element_blank()
  panel.background = element_blank(),
  
  # girds are a line element, hence use element_line() to change it
  panel.grid.major = element_line(linetype = "dotted",
                                  color = "grey80"),
  
  # axis.ticks don't look here so let's get rid of them
  # to format axis.ticks use element_line(), but we want to simply remove it
  axis.ticks = element_blank(),
  
  # the legend key is the grey area around the dots in the legend
  # to format it use element_rect() for borders and backgrounds, but to remove it element_blank()
  legend.key = element_blank(),
  
  # to format the title use element_text() but we want to remove it
  legend.title = element_blank(),
  
  # increase the size of the axis title
  axis.title = element_text(size = 14),
  
  # Not all arguments require element_*(), legend.position accepts strings "left", "right" "bottom", "top" or "none"
  # and coordenates for finer control for example c(0.5, 0.5) would set the legend in the middle of the plot
  legend.position = "top"
  )

gg_ghw4 +
  geom_smooth(method = "lm",
              se = FALSE) +
  theme_custom

## `geom_smooth()` using formula 'y ~ x'

There is no need for a legend title since it is clear it is referring to gender.

The font of the axis title was increased as often they are too small. It is a good idea to visualize the plot in the scale it will be published.

We can always override individual settings of a predefined theme, for example - changing the position of the legend for theme_minimal:

gg_ghw4 +
  theme_minimal() +
  theme(legend.title = element_blank(), 
        legend.position = "top")

In the next sections we will make use of these functions and explore additional ggplot2 features.

5 Dot plot - categorical variable

In the prior example we created a plot with geom_point() with two continuous variables. In the next plot, we will continue to use geom_point()but with one categorical (discrete) variable.

The data in ‘df_loss_by_region.csv’ contains reported losses for each region: US, Europe, Latin America and Asia.

# importing csv file and setting the variables types for each column 
#readr::read_csv() usually does a good job guessing the type of variable but staying in control is good-practice
df_reg1 <- read_csv("data/df_loss_by_region.csv", 
                    col_types = list(region = col_character(), 
                                     loss = col_double()))

head(df_reg1)

## # A tibble: 6 × 2
##   region  loss
##   <chr>  <dbl>
## 1 US     1768.
## 2 US     1801.
## 3 US      464.
## 4 US      479.
## 5 US      795.
## 6 US      298.

Exercise

Run the code above and create a dot plot from ‘df_reg1’, with region on the x-axis
Add the title “Reported Losses in USD by region”, with ‘txt_title’ in labs()
Color the dots by region
Apply scale_color_brewer with the “Dark2” palette
Use the theme theme_pubclean() from the ggpubr package
Assign the plot to a variable named gg_reg1

txt_title <- "Reported Losses in USD by region"

gg_reg1 <- ______ %>% 
  ggplot(
    aes(______, 
        ______, 
        color = ______)) +
  
  ______ + # <---the appropriate geom
  
  ______(______ = txt_title) +
  
  ______(palette = ______) +
  
  ______ #<---- theme

Solution

txt_title <- "Reported Losses in USD by region"

gg_reg1 <- df_reg1 %>% 
  ggplot(
    aes(region, 
        loss, 
        color = region)) +
  
  geom_point() +
  
  labs(title = txt_title) +
  
  scale_color_brewer(palette = "Dark2") +
  
  theme_pubclean()

gg_reg1

The regions are sorted by alphabetical order with ‘Asia’ appearing first. However, it would be better if the order was defined by the number (or amount) of reported losses instead:

df_aux <- df_reg1 %>% 
  group_by(region) %>% 
  summarise(loss = sum(loss),
            n = n()) %>% # the function n() counts the number of observations for each group
  arrange(desc(loss)) #arrange losses in descending order

df_aux

## # A tibble: 4 × 3
##   region            loss     n
##   <chr>            <dbl> <int>
## 1 US            4513102.  2463
## 2 Europe        3189086.  2060
## 3 Latin America 1203860.  1064
## 4 Asia            80987.   115

By count and total amount ‘US’ should be first followed by ‘Europe’, ‘Latin America’ and lastly ‘Asia’.

This means we have to change the variable type for region, from character to factor, with the appropriate levels.

v_levels <- df_aux %>% pull(region) #pull from dplyr takes the column from a data-frame as a vector
v_levels

## [1] "US"            "Europe"        "Latin America" "Asia"

df_reg2 <- 
  df_reg1 %>% 
  mutate(region = factor(region, 
                         levels = v_levels))

# the order should now be according to the level of the factors

gg_reg2 <- df_reg2 %>% 
  ggplot(
    aes(region, 
        loss, 
        color = region)) +
  
  labs(title = txt_title) +
  
  scale_color_brewer(palette = "Dark2") +
  
  theme_pubclean()

# adding geom_point() to gg_reg2
gg_reg2 + geom_point()

There is no need for an x-axis title and a legend’s title since it is clear we are referring to regions.

The title of the y-axis can also be removed - the plot title already shows the unit and the type of amount and there is only one continuous variable.

In addition, the number of breaks of the y-axis seems low, the following adjustments shouldn’t look too strange at this point:

gg_reg3 <- 
  
  gg_reg2 + 
    scale_x_discrete(name = NULL) +
    scale_y_continuous(name = NULL,
                       n.breaks = 8,
                       labels = scales::number_format(big.mark = "'")) +
  
  theme(legend.title = element_blank())

gg_reg3 + geom_point()

5.1 Shake it - geom_jitter()

With a discrete x-axis, all dots are vertically aligned making the plot difficult to read.
This is a perfect scenario to use geom_jitter() - it shuffles/jitters the dots to significantly improve the visual!

gg_reg3 + geom_jitter()

The amount of jitter can be controlled both horizontally and vertically with the arguments width and height.

The same effect can be accomplished with geom_point() using the position argument along with the position_jitter() function.

  gg_reg3 +
  geom_point(position = position_jitter(seed = 3))

Since jitter is based on a random shuffle, we can set a seed to ensure reproducibility.

There is some over-plotting for lower values. Fix it in the next exercise.

Exercise

From the plot gg_reg3

add a geom_point() with size 1
apply jitter with default settings and;
set the alpha to 0.25

and name the updated plot ‘gg_reg4’.

# load gg_reg3 
source(file = "aux_ex_gg/ex_gg_reg3.R")
# have a look
gg_reg3

gg_reg4 <- _____ +
  _____(_____ = ______,
        _____ = 0.25,
        size = ______) 

gg_reg4

Solution

# load gg_reg3 
source(file = "aux_ex_gg/ex_gg_reg3.R")

gg_reg4 <-   gg_reg3 +
  geom_point(position = position_jitter(),
             alpha = 0.25,
             size = 1) 
  
gg_reg4

5.2 Guides - guide_legend()

There is an annoying problem with the legend of gg_reg4 - the transparency/alpha was also applied to the legend, which is not what we want.

You may be tempted to assume that we can fix this with theme(). But recall that theme() does not change how the data is displayed by geoms or how it is transformed by scales. So actually we require an additional function: guides().

Guides are used mainly to manipulate the axis or legend by converting visual properties back to the data.

Although, the code below won’t win any beauty contest - it is very useful since the need for these type of adjustments is quite common.

gg_reg4 +
  guides(color = guide_legend(override.aes = list(alpha = 1)))

Since there’s a legend due to the mapping of aes(color = gender) - the argument to use in guides is color i.e.: guides(color = ....).

Also, the argument override.aes needs to be in a list() because there are several aesthetics we can override.

guide_legend() provides several other options to format the legend:

gg_reg4 + 
  guides(color = guide_legend(nrow = 2,
                              override.aes = list(alpha = 1, 
                                                  size = 4)))

The plot displays the claims activity for each region. Assuming homogeneous policies, it also provides an empirical assessment of the attritional/large loss threshold: values under +/- USD 10k can be considered attritional, whereas losses above USD 25k can be considered extremely large.

As a final comment: the legend should actually be removed, since the regions are clearly identified, the goal here was to introduce the guide() function:

gg_reg4 + 
  # the legend is redundant
  theme(legend.position = "none")

6 Bar plot with geom_col()

In this section we will use geom_col() to create bar plots by exploring an insurance policy data-base.

df_pol1 <- read_rds(file = "data/df_pol.rds")

View(df_pol1) shows the data-frame in a dedicated tab.

The first 30 entries are:

6.1 Bar plot - plain vanilla

The function geom_col() plots a bar chart with one categorical (discrete) variable and, the height of each bar based on some numerical attribute, for example ‘premium’:

df_pol1 %>% 
  ggplot(aes(x = uwy,
             y = premium)) +
  geom_col()

Because uwy (underwriting year) is an integer it is treated as a continuous variable and the default scale unit is applied.

However, every underwriting year should be displayed on the x-axis, so we change it from integer to factor. This can be done directly in ggplot:

df_pol1 %>% 
  ggplot(aes(x = as.factor(uwy),
             y = premium)) +
  geom_col()

We will use this data more times so it is worth to changing the data-frame itself:

df_pol2 <- 
  df_pol1 %>%
  mutate(uwy_f = as.factor(uwy))

# it always a good idea to check the order of the levels!!!
df_pol2$uwy_f %>% levels()

## [1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021"

Because bars are geometric objects with non-zero area, unlike dots and lines, the argument to define its color is fill and not color (the color argument is for the borders of the bars):

gg_pol_uwy1 <- df_pol2 %>% 
  ggplot(aes(x = uwy_f, 
             y = premium)) +
  geom_col(fill = "skyblue2")

gg_pol_uwy1

The argument expand of scale_y_continuous(), controls the vertical distance between the axis’.
The argument expand works along with its companion, the function expansion(), which accepts either a multiplicative or additive value.

gg_pol_uwy1 + 
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))

The code above placed the starting point of the y axis to zero and added 5% of a unit at the top.

This chart could use some adjustments:

fmt_num <- scales::number_format(accuracy = 0.1, 
                                 big.mark = "'", 
                                 scale = 1/1000000)

scale_y <- scale_y_continuous(n.breaks = 12,
                              labels = fmt_num,
                              expand = expansion(mult = c(0, 0.05))
                              )

gg_pol_uwy2 <- gg_pol_uwy1 +
  scale_y +
  xlab("Underwriting Year") + #another way to set the x-axis' title
  ylab("Gross Written Premium [USDm]") + #another way to set the y-axis' title
  theme_pubclean()

gg_pol_uwy2

One way to add labels to the plot is with geom_text(). All we need is a data-frame with total premium by underwriting year including a column with formatted values:

df_txt_premium <- df_pol2 %>% 
  group_by(uwy_f) %>% 
  summarise(premium = sum(premium)) %>% 
  mutate(txt_premium = scales::number(x = premium, accuracy = 0.1, scale = 1/1e6))

# look at the data-frame
head(df_txt_premium)

## # A tibble: 6 × 3
##   uwy_f   premium txt_premium
##   <fct>     <dbl> <chr>      
## 1 2014   3654449. 3.7        
## 2 2015   5274022. 5.3        
## 3 2016   9400845. 9.4        
## 4 2017   9816300. 9.8        
## 5 2018  11611880. 11.6       
## 6 2019  15293190. 15.3

# adding geom_text() to the plot
gg_pol_uwy2 +
  geom_text(data = df_txt_premium,
            aes(label = txt_premium))

Since the label is given by a column of the data-frame it needs to be called from the aes() function.

The vertical alignment of text is not ideal. To change it use the argument vjust (there is alsohjust):

gg_pol_uwy2 +
  geom_text(data = df_txt_premium,
            aes(label = txt_premium),
            vjust = -0.5)

Negative value for vjust moves the text upwards. A bit counter-intuitive, in my opinion, but that’s how it is.

6.2 Time to flip the plot

Let’s plot the premium for each industry:

df_pol2 %>% 
  ggplot(aes(industry, premium)) +
  geom_col()

The labels for industries are totally illegible! There are too many and their titles are too long.

A common solution is to tilt the angle of the x-axis labels:

df_pol2 %>% 
  ggplot(aes(industry, premium)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90))

Unless we want our readers to end-up with a stiff neck, we should avoid this approach.

Another popular, but not recommended solution is to color each bar and add a legend. But this approach gets messy with only a few categories and we have 29 industries! See the chart below from my favorite online tech retailer:

digitec

Only 5 categories is enough to cause troubles so there no hope that this approach applies for 29 long-named categories!

In case you haven’t already, it is time to flip the plot.
Switching the x with the y axis with coord_flip() makes all the difference in the world:

df_pol2 %>% 
  ggplot(aes(x = industry, 
             y = premium)) +
  geom_col() +
  coord_flip()

Now all industries are clearly visible and coloring each bar adds no value (quite the contrary).

Only vertical grid lines are needed, so let’s apply theme_pubclean():

df_pol2 %>% 
  ggplot(aes(x = industry, 
             y = premium)) +
  geom_col(fill = "skyblue2") +
  scale_y +
  xlab(NULL) +  
  ylab("Gross Written Premium [USDm]") +
  theme_pubclean(flip = TRUE) +
  coord_flip()

When adding coord_flip(), theme() does not follow along. So theme_pubclean() will keep the horizontal grid-lines even though we flipped the plot.

Fortunately, theme_pubclean() has an argument flip - when set to TRUE it flips the grid-lines as well. Most theme pre-sets do not have this option but we can always overwrite any setting, such as grid-lines, with theme().

Simply switching the columns for x and y directly in aes()is another way to flip the plot:

df_pol2 %>% 
  ggplot(aes(y = industry, 
             x = premium)) +
  geom_col()

Industry is sorted by alphabetical order from bottom-up which makes it difficult to identify the rank of each industry.

Sorting by value will improve the plot quite a bit. This requires industry to be a factor/categorical variable.

# group total premium by industry and 
# sort industry by premium in ascending order

df_pol_ind1 <- df_pol2 %>% 
  group_by(industry) %>% 
  summarise(premium = sum(premium)) %>% 
  arrange(premium) %>% #sort by premium
  mutate(industry = fct_inorder(industry)) #sets the level according to the order in the data-frame

gg_pol_ind1 <- df_pol_ind1 %>% 
  ggplot(aes(x = industry, 
             y = premium)) +
  scale_y +
  xlab(NULL) +
  ylab("Gross Written Premium [USDm]") +
  geom_col(fill = "skyblue2") +
  theme_pubclean(flip = TRUE) +
  coord_flip()

gg_pol_ind1

The chart displays the premium for 29 categories in a clear and simple way.

6.3 Stacked bar plot

The type of bar plot (e.g.: side-by-side or stacked) is defined by the argument position in geom_col().

A side-by-side bar plot requires position = "dodge" while position = "stack" creates a stacked bar plot.

In this section, we will create a plot with ‘paid’, ‘case’ and ‘ibnr’ stacked on top of each other so they add-up to the ultimate. This is a good way to assess the composition of the ultimate loss by a given category, in this case region.

The key is to group the data by region and gather under one variable: ‘paid’, ‘case’ and ‘ibnr’ values.

# start with policy data 
head(df_pol1)

## # A tibble: 6 × 12
## # Groups:   pol_id, uwy, insured, lob, region [6]
##   pol_id     uwy insured lob   region indus…¹ premium  paid  case repor…²   ibnr
##   <chr>    <dbl> <chr>   <chr> <chr>  <chr>     <dbl> <dbl> <dbl>   <dbl>  <dbl>
## 1 pol_101…  2014 Marco … Gene… Cigol… Sports  212141.     0     0       0  5967.
## 2 pol_101…  2015 Marco … Gene… Cigol… Sports  219132.     0     0       0 11433.
## 3 pol_102…  2014 Javi M… Gene… Lapla… Fishery  89646.     0     0       0 14638.
## 4 pol_102…  2015 Javi M… Gene… Lapla… Fishery  92417.     0     0       0 13574.
## 5 pol_102…  2016 Javi M… Gene… Lapla… Fishery  88270.     0     0       0 20757.
## 6 pol_102…  2017 Javi M… Gene… Lapla… Fishery 158108.     0     0       0 67423.
## # … with 1 more variable: ultimate <dbl>, and abbreviated variable names
## #   ¹industry, ²reported

# summarise by region paid, case, ibnr
# and gather paid, case, ibnr under 'tov'
# change tov and region to factor with appropriate levels
df_ult_tov_reg1 <- df_pol1 %>% 
  filter(uwy >= 2018) %>% 
  group_by(region) %>% 
  summarise(
    paid = sum(paid),
    case = sum(case),
    ibnr = sum(ibnr)) %>% 
  pivot_longer(!region, names_to = "tov") %>% 
  group_by(region) %>% 
  mutate(tov = factor(tov, levels = c("paid", "case", "ibnr")),
         ultimate = sum(value)) %>% 
  arrange(desc(ultimate)) %>% 
  ungroup() %>% 
  mutate(region = fct_inorder(region))

# check the data-frame
head(df_ult_tov_reg1)

## # A tibble: 6 × 4
##   region     tov       value  ultimate
##   <fct>      <fct>     <dbl>     <dbl>
## 1 Disneyland paid   1977297. 22448563.
## 2 Disneyland case   8848827. 22448563.
## 3 Disneyland ibnr  11622439. 22448563.
## 4 Connyland  paid   1917787. 17099892.
## 5 Connyland  case   1061268. 17099892.
## 6 Connyland  ibnr  14120837. 17099892.

df_ult_tov_reg1 %>%
  ggplot(aes(region, value, fill = tov)) +
  geom_col(position = position_stack())

We sorted the type of value by level of certainty and by alphabetical order. With ‘paid’ being the most deterministic value, followed by ‘case’ and last ‘ibnr’ - given that ‘ibnr’ is the most random amount.

A color sequence works well again, with the darker shade reserved for ‘paid’ and the lighter for ‘ibnr’.

The vectorv_greens has manually assigned hexadecimal colors. To print colors use show_col() from the scales package (the scales package is really a gem):

v_greens <- c("#007934", "#7fbc99", "#cce4d6")
scales::show_col(v_greens, ncol = 3)

gg_ult_tov_reg1 <- df_ult_tov_reg1 %>% 
  ggplot(aes(region, 
             value,
             fill = tov)) +
  geom_col(position = position_stack(reverse = TRUE)) + #default is top-down, reverse = TRUE sorts it from bottom-up
  
  # cosmetics
  scale_fill_manual(values = v_greens,
                    labels = c("Paid Losses", "Case Reserves", "IBNR")) +
  scale_y +
  xlab(NULL) +
  ylab("Loss Amounts in USD") +
  theme_pubclean()

gg_ult_tov_reg1

The default order of position_stack() is from top-down i.e.: it sets the first element (‘paid’) at the top and the last element (‘ibnr’) at the bottom.
Thankfully, position_stack() has an argument called reverse which set to true i.e.: position_stack(reverse = TRUE), sorts it from bottom-up.

Besides displaying the ultimate values by region, it also shows how much of it is Paid, Case Reserved and IBNR.

Exercise

Change the legend title of gg_ult_tov_reg1 to “Ultimate Loss”.
Hint: use guides() and think of what aesthetic created the legend.

# load gg_ult_tov_reg1
source(file = "aux_ex_gg/ex_gg_ult_tov_reg1.R")
# have a look
gg_ult_tov_reg1 

gg_ult_tov_reg1 +
  ______(______ = ______(______ = "Ultimate Loss"))

Solution

# load gg_ult_tov_reg1
source(file = "aux_ex_gg/ex_gg_ult_tov_reg1.R")

gg_ult_tov_reg1 +
  guides(fill = guide_legend(title = "Ultimate Loss"))

The legend is due to the mapping aes(fill = tov), so the function guide_legend() is placed in the fill argument of guides().

Home challenge

Replicate the plot below.

Hint: set fill = tov in geom_col() not in ggplot()! And use the argument group for geom_line():

6.4 Fill it up - 100% stack plot

The previous plot compared the components of the ultimate loss in absolute values.

But we might be more interested in comparing the proportions instead of absolute amounts. The most popular way to do this is via pie-charts. While a pie chart with 3 slices is perfectly fine - trying to compare 6 with each other is far from ideal.

However, a 100% stack plot works great, as it compares the proportions for each region in one single plot. In addition, it is easier to assess the area of a rectangular shape than a circular shape.

To create a 100% stack plot set position = "fill" or the equivalent position = position_fill().

We will also flip the plot, and revert the order of the regions so that Disneyland stays on top and Lapland at the bottom.

df_ult_tov_reg2 <- 
  df_ult_tov_reg1 %>%
  mutate(region = fct_rev(region)) # fct_rev() reverses the order of the factors

df_ult_tov_reg2 %>% 
  # the important bit
  ggplot(aes(region, 
             value)) +
  geom_col(aes(fill = tov, ),
           color = "white", #color the border of the bars
           size = 0.5, #width of the borders
           position = position_fill(reverse = TRUE)) + #revert the order of 'tov'
  
  # cosmetics and labeling
  scale_fill_manual(values = v_greens,
                     labels = c("Paid Losses", "Case Reserves", "IBNR")) +
  scale_y_continuous(name = "Proportion of Ultimate Loss",
                     n.breaks = 12,
                     labels = scales::percent_format(accuracy = 1),
                     expand = expansion(mult = c(0, 0.04)),
                     position = "right") + #move the y axis to the right (top when flipped)
  scale_x_discrete(expand = expansion(add = 0)) +
  xlab(NULL) +
  theme_pubclean(flip = TRUE) +
  guides(fill = guide_legend(title = "Ultimate Loss")) +
  coord_flip()

6.5 Bar plots - What else?

There is also a geom called geom_bar() which is often confused with geom_col(). The difference is that geom_bar() makes the height of the bar proportional to the number of observations. While geom_col(), as we have seen, sets the height of the bars based on a given numerical column.

Hence,geom_col() requires a categorical variable and a continuous variable while geom_bar() requires only one categorical variable.

For example:

df_pol2 %>% 
  ggplot(aes(x = region)) + #map only one axis, mapping y will flip the plot
  geom_bar()

To create a similar plot but with a continuous variable, instead of a discrete on like region, use geom_histogram().

7 Interactive plot with ggplotly

In this section we will create an interactive plot using the function ggplotly() from the plotly package.

7.1 Pre-work - dot plot with two continuous variables

First we need a standard ggplot.

We’ll use the data in ‘df_loss_mov.rds’ from the ‘data’ folder.

The goal is to visualize reported losses as at Q3 and Q4 of 2021.

# Import data
df_loss_mov1 <- read_rds("data/df_loss_mov.rds")

# have a look
head(df_loss_mov1)

## # A tibble: 6 × 12
##   pol_id         uwy insured region le    indus…¹ claim…² claim…³ prem_cq  rl_cq
##   <chr>        <dbl> <chr>   <fct>  <fct> <fct>   <chr>   <chr>     <dbl>  <dbl>
## 1 pol_157.1.2…  2016 Thibau… Conny… Rema… Govern… claim_… allege… 618992. 1.63e5
## 2 pol_164.1.2…  2016 Gonzal… Cigol… Rema… Graphi… claim_… potent… 506857. 9.07e4
## 3 pol_164.1.2…  2016 Gonzal… Cigol… Rema… Graphi… claim_… potent… 506857. 9.07e4
## 4 pol_164.1.2…  2016 Gonzal… Cigol… Rema… Graphi… claim_… potent…      0  7.03e3
## 5 pol_164.1.2…  2016 Gonzal… Cigol… Rema… Graphi… claim_… potent…      0  7.03e3
## 6 pol_145.4.2…  2017 Azpili… Disne… RR I… Entert… claim_… <NA>     61080. 3.00e4
## # … with 2 more variables: prem_lq <dbl>, rl_lq <dbl>, and abbreviated variable
## #   names ¹industry, ²claim_id, ³claim_desc

With rl_cq and rl_lq standing for ‘reported loss as at current quarter’ and ‘reported losses as at last quarter’ respectively.

One good way to compare movements from one period to another is to draw a scatter plot, with the last period on the x-axis and current period on the y-axis .

It also helps to color the dots by a given category - we will use underwriting year (uwy), but region would also be a good option.

A distinct/diverging colors palette is the most appropriate since there is no particular order for underwriting year in this case.
We must take care not to end-up with too many different categories. So we will group 2017 and prior - with an ‘if else then’ condition:

#uwy_agg aggregates underwriting years 2017 & prior 
df_loss_mov2 <- df_loss_mov1 %>% 
    mutate(uwy_agg = if_else(condition = uwy <=2017, 
                             true = "2017&prior", 
                             false = as.character(uwy)))
# false = uwy, returns an error
# because a column can only have one variable type
# uwy is numberic (double) while "2017&prior" is a character

Base plot:

gg_rl_mov1 <- ggplot(df_loss_mov2,
         aes(rl_lq,
             rl_cq,
             color = uwy_agg)) +
  geom_point(size = 4)

gg_rl_mov1

We can do several things to improve the plot:

# custom theme
theme_rl_mov <- theme(
  panel.background = element_blank(), 
  panel.grid.major = element_line(color = "grey80"), 
  axis.ticks = element_blank(),
  legend.key = element_blank(),
  plot.title = element_text(face = "plain")) #default is bold

# x and y continuous  scale
scale_x_rl_mov <- scale_x_continuous(breaks = scales::pretty_breaks(8),
                                     labels = scales::number_format(accuracy = 0.1, scale = 1/1e6))

scale_y_rl_mov <- scale_y_continuous(breaks = scales::pretty_breaks(8),
                                     labels = scales::number_format(accuracy = 0.1, scale = 1/1e6))


# color scale
colors_rl_mov <- c("#2E2B21", "#E00034", "#F2B401", "#0F4DBC", "#007934")
scale_color_rl_mov <- scale_color_manual(values = colors_rl_mov)

# labels
labs_rl_mov <- labs(title = "Reported Losses in USDm as at Q3 and Q4 2021",
                    x = "As at September 31, 2021",
                    y = "As at December 31, 2021")

guides_rl_mov <- guides(color = guide_legend(title = "Underwriting Year"))

The code above should not look too strange at this point.

Applying cosmetics and labels:

gg_rl_mov2 <- gg_rl_mov1 +
  
  # cosmetics and labels
  scale_x_rl_mov +
  scale_y_rl_mov +
  theme_rl_mov +
  labs_rl_mov +
  scale_color_rl_mov +
  guides_rl_mov
  
gg_rl_mov2

Any reported loss (dot) that has remained unchanged from one quarter to the other will lie on the identity line i.e.: x = y.
Therefore, it is a good idea to draw an identity line à la qq-plot:

The function geom_abline() with the following arguments:

slope = 1
intercept = 0
color = "black"
size = 0.5

is exactly what we need.

gg_rl_mov3 <- gg_rl_mov2 +
  geom_abline(slope = 1,
              intercept = 0,
              color = "black",
              size = 0.5)

gg_rl_mov3

The slope and intercept could be omitted since we gave them their default values.

We can draw more generic functions With geom_function() however, it is not supported by ggplotly, which is the package we will use to turn the plot interactive (at least at the time of writing this document).

Each axis should have the same length, since they are measuring the same unit.
A rectangular shape would provide a misleading picture because the identity line will not have a 45 degree angle. In other words, the distance of the identity line, from the x and y axis’ should the same.

The function coord_fixed() controls the ratio of the axis’ to one another - the argument ratio has a default value of 1, which is what we want here.

gg_rl_mov4 <- gg_rl_mov3 + coord_fixed()
gg_rl_mov4

The grid is now a perfect square!

This plot clearly shows that the largest reported loss in the portfolio is about USD 4.5m and second largest loss is just under USD 2.0m. Both of these losses are from underwriting year 2020 and have hardly moved during Q4 2021.

Any loss that lies on the y-axis is a new loss since it was zero at the end of the prior quarter. Thus, the largest new reported loss is just below USD 1.0m from underwriting year 2021.

7.2 Look who’s talking - Interactive plot

To turn the plot into an interactive one, we will use the function ggplotly() from the plotly package.

Simply wrap the ggplot with ggplotly() from the plotly package. The result will appears in the ‘viewer’ pane, which shows and saves local web content (that’s pretty handy!):

ggplotly(gg_rl_mov4)

The top right menu displays a series of tools, the magnifier allows the user to zoom in on any particular area for more detail.
Explore these features to see what they do. Just note that ‘Show closest data on hover’ and ‘Compare data on hover’ requires one categorical variable so it won’t work in this case.

Hovering over any dot will display a pop-up message and you can use the legend as a filter!

While the pop-up message is a great feature, ggplotly() takes, by default, all the aesthetics of the plot.
But most of the time, we are interested in showing other variables, apply labels with friendlier names, and to be able to format the numbers.
In sum, it is more useful just to create a custom pop-up message.

For that, we’ll add a column to the data-frame with the message we want to display.

df_loss_mov3 <- df_loss_mov2 %>%
  mutate(
    #formatting numbers
    txt_rl_cq = scales::number(x = rl_cq, 
                               accuracy = 1, 
                               scale = 1/1e3, 
                               suffix = "k", 
                               big.mark = "'"
                               ),
    txt_rl_lq = scales::number(x = rl_lq, 
                               accuracy = 1, 
                               scale = 1/1e3, 
                               suffix = "k", 
                               big.mark = "'"
                               ),
    # creating tool_tip
    txt_tip = str_c("Reported current: ", txt_rl_cq, "\n",
                    "Reported last: ", txt_rl_lq, "\n",
                    "Policy ID: ", pol_id, "\n",
                    "Claim ID: ", claim_id,"\n",
                    "Region: ", region,"\n",
                    "Industry: ", industry,"\n",
                    "Claim desciption: **Confidential**"
                    )
    )

# see the first pop-up message
df_loss_mov3$txt_tip[[1]] %>% writeLines()

## Reported current: 163k
## Reported last: 174k
## Policy ID: pol_157.1.2016
## Claim ID: claim_011
## Region: Connyland
## Industry: Government Administration
## Claim desciption: **Confidential**

The function str_c() from the core tidyverse package stringr concatenates strings similar to the base function paste0(), whereas “\n” inserts a new line.

There are two tricks to create a custom message.

The first, is to map the newly created column txt_tip to the obscure argument text in aes():

gg_rl_mov5 <- ggplot(data = df_loss_mov3, 
       aes(rl_lq, 
           rl_cq,
           color = uwy_agg,
           text = txt_tip)) + #text = txt_tip, will force ggplotly to display this message
  geom_point(size = 4) +
  
  
  # cosmetics and labels
  scale_x_rl_mov + 
  scale_y_rl_mov +
  labs_rl_mov +
  scale_color_rl_mov +
  theme_rl_mov +
  geom_abline() +
  coord_fixed() +
  guides_rl_mov

The second - and last trick - is to use the ggplotly argument tooltip = "txt_tip".

ggplotly(gg_rl_mov5,
         tooltip = "txt_tip")

Now you have all the loss amounts and loss movements at your fingertips!

8 Multipanel plots with facet_wrap()

We will continue to use the data in df_pol.rds’.
The code below groups the premium by region and underwriting year and then creates a simple bar plot with geom_col():

# automated data wrangling
df_facet_land <- 
  read_rds(file = "data/df_pol.rds") %>% 
  mutate(uwy_f = as.factor(uwy)) %>%
  group_by(region, uwy_f, uwy) %>% 
  summarise(premium = sum(premium)) %>% 
  group_by(region) %>% #aux to sort the region by largest latest premium
  mutate(sort_prem = sum(premium[uwy == max(uwy)])) %>% #premium 2021 (so it always works)
  arrange(desc(sort_prem)) %>% 
  ungroup() %>% #ungroup before fct_inorder()
  mutate(region = fct_inorder(region)) #factor levels in order of appearance

# see the output
df_facet_land

## # A tibble: 48 × 5
##    region     uwy_f   uwy   premium sort_prem
##    <fct>      <fct> <dbl>     <dbl>     <dbl>
##  1 Connyland  2014   2014  1318032. 11696819.
##  2 Connyland  2015   2015  2104860. 11696819.
##  3 Connyland  2016   2016  3937245. 11696819.
##  4 Connyland  2017   2017  2820563. 11696819.
##  5 Connyland  2018   2018  4013459. 11696819.
##  6 Connyland  2019   2019  3176144. 11696819.
##  7 Connyland  2020   2020  4812121. 11696819.
##  8 Connyland  2021   2021 11696819. 11696819.
##  9 Disneyland 2014   2014   871245. 10938131.
## 10 Disneyland 2015   2015  1180515. 10938131.
## # … with 38 more rows

# creating a simple bar plot
gg_facet <- df_facet_land %>% 
  ggplot(aes(uwy_f, premium)) +
  geom_col(fill = "skyblue2",
           alpha = .7) +
  
  scale_y_continuous(
    name = NULL,
    labels = scales::number_format(accuracy = 0.1,
                                   scale = 1/1e6),
    n.breaks = 8
    ) +
  theme_pubclean() +
  # no ticks!
  theme(axis.ticks = element_blank()) +
  labs(title = "Gross Written Premium in USDm",
       x = "Underwriting Year")

# let's see the plot
gg_facet

Since data-frame ‘df_facet_land’ is grouped by region and underwriting year, we can create a side-by-side bar plot as:

Just use position = "dodge") in geom_col().

However, there are 6 regions (a bit too many), which requires some effort to match the colors to each region.
In addition, the coloring is too aggressive - a sequential palette improves the looks but increases the effort of matching colors to region even more!

A side-by-side plot puts the focus in comparing each region with each other over time - but if there are more than 4 categories it is not an easy read.

If we care more about how each individual region developed over time, or/and if there are too many categories, then there is a better solution. With only one line of code we can produce a bar plot for each region separately, which simplifies readability.

For this amazing feature use the function facet_wrap().

The key argument is facets - setting it to region preceded by a tilde ~ results in the multi-panel plot:

gg_facet + facet_wrap(facets = ~ region)

ggplot2 again takes full advantage of the long-data format.

Next financial period, just replace the data in the ‘data’ folder and all plots are updated. If, by any chance, there is a new region it will automatically be included in the output.

By default, the scale is the same for all charts. With argument scales we can free the x and/or y axis for each individual plot:

scales = "free" - both x and y axis are independent for each plot
scales = "free_x" only the x-axis is independent for each plot while the y-axis is common for all plots
scales = "free_y" only the y-axis is independent for each plot while the x-axis is common for all plots

A free y-scale may be misleading when there are many facets:

gg_facet + 
  facet_wrap(facets = ~ region, 
                       scales = "free_y")

On the other hand, having the y-scale fixed may hide important information for smaller regions.

For the x-axis to be on each row, (and not only on the bottom) it needs to have a free scale:

gg_facet + 
  facet_wrap(facets = ~ region, 
             scales = "free_x")

There are other arguments, for example to define the number of columns:

gg_facet + 
  facet_wrap(facets = ~ region, 
             scales = "free_x", 
             ncol = 2)

The same can be done with any categorical variable, for example industry.

facet_grid() is another function to create multi-panel plots, to learn more about it go to
https://ggplot2.tidyverse.org/reference/facet_grid.html.

9 Throw the 3D in the bin

The data set of the next (and last!!!) example contains - the normalized weight and age in years - of heavy industrial machines for a given industry.

9.1 One at a time

A histogram or density plot are good options to see how the weight and/or age are distributed. Not surprisingly, to plot a density plot use geom_density().

Which leads us to our next exercise.

Exercise

Import the data frame ‘df_bin.rds’ from the ‘data’ folder using read_rds()
Create a density plot for weight using geom_density()
Set blue_dark as the fill color
Set blue_light as the line color
Set alpha to 0.6
Use theme_pubclean() from the ggpubr package

Bonus points to those who eliminate one of my pet peeves: the axis-ticks! Hint: use theme().

Your plot should look like this:

Give it a go!

# read data
df_bin1 <- ______(file = "data/df_bin.rds")

# defining colors
blue_dark <-  "#0F4DBC"
blue_light <-  "#66CBEC"

# Sorry no hints or pre-populated code for the plot!

Solution

# read data
df_bin1 <- read_rds(file = "data/df_bin.rds")

# define colors
blue_dark <-  "#0F4DBC"
blue_light <-  "#66CBEC"

df_bin1 %>%
  ggplot(aes(x = weight)) +
  geom_density(fill = blue_light,
               color = blue_dark,
               alpha = .6               ) +
  theme_pubclean() +
  theme(axis.ticks = element_blank())

Another way to inspect this data is by creating a table. Since the variables are continuous, we need to group them into categories.

Age categories	Count	Percent
0-5	17'067	33%
5-10	11'504	22%
10-15	8'526	17%
15-20	5'004	10%
20-30	5'678	11%
30-40	3'751	7%
Total	51'530	100%

Weight categories	Count	Percent
-2 to -1	3'086	6%
-1 to 0	30'300	59%
0 to 1	12'151	24%
1 to 2	5'993	12%
Total	51'530	100%

These simple tables gives us an idea how data is distributed for each variable. We could also create a bar chart from each table.

As side note, the grouping influences the results:

too much grouping and details are lost
to little grouping results in too many categories with little data

Use your judgment on a case-by-case basis.

9.2 Both at the same time

But what if we want to see the combined distribution of this data? It is tempting to compute a 3D plot, afterall they look fancy!
The intersection could represent the third dimension on the z-axis.

However, bar some special cases - it is a good idea to avoid 3D plots, so just throw them in the bin!

A nice solution is to group - or if you prefer - to bin the intersection of these two continuous variables and represent this third dimension as a color scale.

For this we use geom_hex().

Let’s start with a base plot gg_bin0.

gg_bin0 <- df_bin1 %>%
  ggplot(aes(weight, age)) +
  
  labs(title = "Count: Age by Normalized Weight",
       x = "Normalized weight",
       y = "Age\n(in years)")

No let’s add the geom geom_hex() and see what we get:

gg_bin0 +
  geom_hex()

while we get the idea - it is not very inspiring. But a few tweaks make all the difference.

First: there are too many bins i.e.: the aggregation is too granular. The argument bins in geom_hex() controls the number of bins in each direction, we’ll set it to 15.

Second: The color scheme is not ideal. We can use the color scheme “inferno” from viridis with scale_fill_viridis_c(option = "A"), where “A” = “inferno”. This will help create greater contrast between the bins.

Third: a custom theme (but this is old news by now):

theme_bin <- theme(panel.grid = element_blank(), 
        panel.background = element_blank(),
        # make the letters bigger and change the angle 
        axis.title.y = element_text(hjust = 0, angle = 0, size = 13),
        axis.title.x = element_text(hjust = 0, size = 13),
        axis.text = element_text(size = 12),
        axis.ticks = element_blank())

Applying these 3 concepts results in a much better plot:

gg_bin1 <- gg_bin0 +
  geom_hex(bins = 15,
           color = "slategrey") +
  scale_fill_viridis_c(option = "inferno" ) +
  theme_bin
  
gg_bin1

It is clear where the hot spots lies!

A quick glance tell us that the weight is skewed to the right, the very heavy machines are mostly below 10 years old. While older machines hover just below the mean weight.

So if you’re running a regression, do not expect great results for heavy and old machines!

You may want to control the color of the two count extremes. This is done with scale_fill_gradient(), which as two arguments low and high. Let’s set them to “white” and “red” respectively:

gg_bin1 +  
  scale_fill_gradient(low = "white", 
                      high = "red")

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Home challenge

We finish this webcast with another home challenge.

Perhaps you would like to display the percentage values inside each hexagon.

Try to replicate the plot below which has bins = 8. This will test your Google and StackOverflow skills!

10 Conclusion and free online resources

In this this web session you were introduced to ggplot2 as well as to some techniques to help your plots resonate with your audience.

Not surprisingly, we have only scratch surface of what ggplot2 is capable. By the same token - we only scratched the surface of what constitutes an effective visual.

The good news is that they are amazing free online resources that you can follow and continue to make progress in these two fronts.

The list below is kept purposely short so that these texts are given the priority they deserve.

Syntax focussed

This is the book I have by my side every time I use ggplot2. It goes straight to the point with its problem/solution style - and it avoids many Google and Stack Overflow searches:

“R Graphics Cookbook, 2nd edition” by Winston Chang
https://r-graphics.org/

For a deeper dive into ggplot2 I wholeheartedly recommend the text straight from the horse’s mouth:

“ggplot2: Elegant Graphics for Data Analysis” by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen
https://ggplot2-book.org/

For a broader introduction to R and RStudio - including R Markdown (authoring documents) and Shiny (web applications):

“R for Data Science” by Hadley Wickham and Garrett Grolemund
https://r4ds.had.co.nz/index.html

Design focussed

Regardless of what software you will end up using in the future, this masterful text will significantly improve the way you design plots and communicate data:

“Fundamentals of Data Visualization” by Claus Wilke
https://clauswilke.com/dataviz/

The following book is not available for free but its website contains tons of free content (and the books are worth every penny):

“Storytelling with Data: A Data Visualization Guide for Business Professionals” by Cole Nussbaumer
https://www.storytellingwithdata.com/books

In fact, once you start using R you won’t stop at producing plots: Automating Reports, Creating Web Apps, and Machine Learning are waiting in-line.

Contact and links

Claudio Rebelo

claudio_rebelo@swissre.com

linkedin-claudio_rebelo

rstudio-webinar-rethink-reporting-with-automation

Presentation Disclaimer

Presentations are intended for educational purposes only and do not replace independent professional judgment.

Statements of fact and opinions expressed are those of the participants individually and, unless expressly stated to the contrary, are not the opinion or position of the Society of Actuaries, its cosponsors or its committees.

The Society of Actuaries does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented. Attendees should note that the sessions are audio-recorded and may be published in various media, including print, audio and video formats without further notice.

Introduction to Effective Visuals with R (ggplot2) - SOA Webcast

Claudio Rebelo

1 Introduction

1.1 Setup: R, RStudio, Packages and Course Materials

2 Data wrangling

2.1 Filter

2.2 Mutate (add or change columns)

2.3 Group and Summarise

Exercise

2.4 Left join - a super Vlookup

2.5 Long and wide data-frames

Exercise

2.6 Select and Rename columns

2.7 Format numbers

3 The Basics of ggplot2

3.1 A basic plot

3.1.1 Alpha (transparency)

Exercise

3.2 Basic behaviour

3.3 Saving Mr. Plot

4 Control the appearence (basics)

4.1 Scale the x and y axis’

Exercise

4.2 scale_color*() functions

4.3 Theme

5 Dot plot - categorical variable

Exercise

5.1 Shake it - geom_jitter()

Exercise

5.2 Guides - guide_legend()

6 Bar plot with geom_col()

6.1 Bar plot - plain vanilla

6.2 Time to flip the plot

6.3 Stacked bar plot

Exercise

Home challenge

6.4 Fill it up - 100% stack plot

6.5 Bar plots - What else?

7 Interactive plot with ggplotly

7.1 Pre-work - dot plot with two continuous variables

7.2 Look who’s talking - Interactive plot

8 Multipanel plots with facet_wrap()

9 Throw the 3D in the bin

9.1 One at a time

Exercise

9.2 Both at the same time

Home challenge

10 Conclusion and free online resources

Syntax focussed

Design focussed

Contact and links

Presentation Disclaimer