BlackRock ESG ETF Data: Part 3 - Finding Insight Through Aggregation

Author

Teal Emery

Part I: Setup

We’re going to keep the setup short this time, since we’ve done it before.

First, we load the tidyverse:

library(tidyverse) 
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Then we will load and clean our data the same way we did in the last lesson.

# assign the url to `github_raw_csv_url`
github_raw_csv_url <- "https://raw.githubusercontent.com/t-emery/sais-susfin_data/main/datasets/blackrock_etf_screener_2022-08-30.csv"

# read in the data, and assign it to the object `blackrock_etf_data`
blackrock_etf_data <- read_csv(github_raw_csv_url)
Rows: 393 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): ticker, name, incept_date, net_assets_as_of, asset_class, sub_asse...
dbl  (8): gross_expense_ratio_percent, net_expense_ratio_percent, net_assets...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
blackrock_etf_data <- blackrock_etf_data |> 
  # we are transforming both date columns (currently character strings) into date objects
  # so we can work with them.
  # this syntax is a bit confusing, but selects all columns containing `date` and applies
  # lubridate::mdy() function to them to turn them into date objects. 
  mutate(across(contains("date"), lubridate::mdy)) |>
  # Billions is a more useful magnitude than millions, so we'll create a column with 
  # the assets in billions by dividing by `net_assets_millions` by 1,000 (10^3)
  # If we wanted trillions, we could divide by 1,000,000 (10^6)
  mutate(net_assets_usd_bn = net_assets_usd_mn/10^3) |> 
  # this column doesn't add anything to our analysis - it says that the data is from 8/30/22
  select(-net_assets_as_of)

mini_blackrock_data <- blackrock_etf_data |> 
  # group by whether the fund is an ESG fund or not
  group_by(is_esg) |> 
  # take the top 5 from each group, by net assets
  slice_max(order_by = net_assets_usd_bn, n = 5) |> 
  # select the following columns 
  select(ticker, fund_name = name_wo_ishares_etf, asset_class, sub_asset_class, region, incept_date, net_assets_usd_bn,
         msci_weighted_average_carbon_intensity_tons_co2e_m_sales) |> 
  # rename to `co2_intensity` because the full name is a mouthful, if descriptive.
  rename(co2_intensity = msci_weighted_average_carbon_intensity_tons_co2e_m_sales) |> 
  # always good to ungroup() if you've used a group_by().  We'll discuss later.
  ungroup()
Adding missing grouping variables: `is_esg`

Part II: Finding Insights Through Aggregation

Understanding the Workflow

We’re going to preview a workflow on our mini dataset that we can use to understand our larger dataset.

  • What is Our Goal?: We’re looking at BlackRock’s ETF screener to uncover actionable insights that help explain their ESG business. The financial press writes a lot about it – but how big a part of its business is it? How fast is it growing?

  • Exploratory Data Analysis (EDA): attempts to uncover initial relevant insights about our dataset. It may uncover facts (such as the total assets in ESG funds) that stand alone, or it may help generate hypotheses that can be examined through more rigorous statistical methods at a later time (like relationship between size & fund age). Here are key relationships we’ll explore in EDA:

    • Relative Size & Magnitude: If we are comparing two things, how big is one compared to the other? Is one the size of a dog, and the other the size of a cat (different, but similar magnitude). Or is one the size of a mouse, and the other the size of a giraffe (big magnitude).

    • Ranking: What is the relative order? Everybody loves a ranking. Top 5 & Top 10 lists always draw eyes and are perpetual clickbait.

    • Growth: Maybe something is small now, but growing quickly. Or large but shrinking?

  • Finding Narrative Through Aggregation: By aggregating together our data into meaningful groups (ESG vs. Regular Funds, by asset class, by year launched) we can unlock a hidden narrative of larger trends implicit in our dataset.

  • Outputs:

    • Data visualizations: a picture is worth 1,000 words, right? But for real, data visualizations are very efficient ways to convey a lot of information.

    • Emotive factoids: we’re looking for numbers that help make the data real to you and other readers. Emotive factoids are the way that you’ll convey information when you are writing and speaking and can’t use data visualizations. Here’s an OpEd I wrote for a South African financial newspaper – spot the emotive factoids.

Looking at Our Mini Dataset

mini_blackrock_data |>
  glimpse()
Rows: 10
Columns: 9
$ is_esg            <chr> "ESG Fund", "ESG Fund", "ESG Fund", "ESG Fund", "ESG…
$ ticker            <chr> "ESGU", "ESGD", "ICLN", "ESGE", "DSI", "IVV", "IEFA"…
$ fund_name         <chr> "ESG Aware MSCI USA", "ESG Aware MSCI EAFE", "Global…
$ asset_class       <chr> "Equity", "Equity", "Equity", "Equity", "Equity", "E…
$ sub_asset_class   <chr> "Large/Mid Cap", "Large/Mid Cap", "All Cap", "Large/…
$ region            <chr> "North America", "Global", "Global", "Global", "Nort…
$ incept_date       <date> 2016-12-01, 2016-06-28, 2008-06-24, 2016-06-28, 200…
$ net_assets_usd_bn <dbl> 22.376688, 6.425673, 5.628011, 4.233544, 3.688930, 2…
$ co2_intensity     <dbl> 103.60, 103.83, 265.82, 167.71, 72.73, 148.34, 126.6…

Let’s take the data we need to compare data by assets between ESG funds and Regular Funds, and by region.

funds_by_assets_bn <- mini_blackrock_data |> 
  select(fund_name, is_esg, region, net_assets_usd_bn) 

funds_by_assets_bn
# A tibble: 10 × 4
   fund_name                  is_esg       region        net_assets_usd_bn
   <chr>                      <chr>        <chr>                     <dbl>
 1 ESG Aware MSCI USA         ESG Fund     North America             22.4 
 2 ESG Aware MSCI EAFE        ESG Fund     Global                     6.43
 3 Global Clean Energy        ESG Fund     Global                     5.63
 4 ESG Aware MSCI EM          ESG Fund     Global                     4.23
 5 MSCI KLD 400 Social        ESG Fund     North America              3.69
 6 Core S&P 500               Regular Fund North America            298.  
 7 Core MSCI EAFE             Regular Fund Global                    84.2 
 8 Core U.S. Aggregate Bond   Regular Fund North America             82.3 
 9 Core S&P Small-Cap         Regular Fund North America             66.5 
10 Core MSCI Emerging Markets Regular Fund Global                    64.9 

Feature Engineering

Feature engineering is a fancy way to say that you’re making new variables from the existing variables using dplyr::mutate().

In exploratory data analysis, finding meaningful ways to describe datapoints is essential. We will focus on five common forms here.

  • Rank: Who is first? Who is last? Everybody wants to know!
  • Z-score: A great way to standardize data because it has an intuitive meaning. 0 is the average. 1 is 1 standard deviation above the average. -1 is 1 standard deviation below the average.
  • Percentile: Can be a helpful way to quantify where a value is in a distribution, especially when relative order matters most.
  • Quantile: Can be great to create groupings based on relative rank. We will use 5 groups by default (quntiles), but use whatever number makes most sense in your situation. Quartiles (4) and ceciles (10) are also commonly used.
  • Percent of total: How much of the total assets does the largest fund account for?
funds_by_assets_bn |> 
  mutate(value_z_score = (net_assets_usd_bn-mean(net_assets_usd_bn))/sd(net_assets_usd_bn),
         value_rank = rank(net_assets_usd_bn),
         value_percentile = percent_rank(net_assets_usd_bn),
         value_ntile_5 = ntile(net_assets_usd_bn, 5),
         value_pct_total = net_assets_usd_bn/sum(net_assets_usd_bn)) |> 
  arrange(value_rank)
# A tibble: 10 × 9
   fund_name       is_esg region net_a…¹ value…² value…³ value…⁴ value…⁵ value…⁶
   <chr>           <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <dbl>
 1 MSCI KLD 400 S… ESG F… North…    3.69 -0.677        1   0           1 0.00578
 2 ESG Aware MSCI… ESG F… Global    4.23 -0.671        2   0.111       1 0.00664
 3 Global Clean E… ESG F… Global    5.63 -0.655        3   0.222       2 0.00882
 4 ESG Aware MSCI… ESG F… Global    6.43 -0.646        4   0.333       2 0.0101 
 5 ESG Aware MSCI… ESG F… North…   22.4  -0.466        5   0.444       3 0.0351 
 6 Core MSCI Emer… Regul… Global   64.9   0.0126       6   0.556       3 0.102  
 7 Core S&P Small… Regul… North…   66.5   0.0307       7   0.667       4 0.104  
 8 Core U.S. Aggr… Regul… North…   82.3   0.209        8   0.778       4 0.129  
 9 Core MSCI EAFE  Regul… Global   84.2   0.230        9   0.889       5 0.132  
10 Core S&P 500    Regul… North…  298.    2.63        10   1           5 0.467  
# … with abbreviated variable names ¹​net_assets_usd_bn, ²​value_z_score,
#   ³​value_rank, ⁴​value_percentile, ⁵​value_ntile_5, ⁶​value_pct_total

You’ll notice that by default, the smallest value has the highest rank. The same is true for percentile and for quantile. If you’re measuring which company has the smallest carbon footprint, this is good. If you’re measuring countries by GDP, you’d want the opposite. Always check this.

funds_by_assets_bn |> 
  mutate(value_z_score = (net_assets_usd_bn-mean(net_assets_usd_bn))/sd(net_assets_usd_bn) 
         * -1, # multiply by negative 1
         value_rank = rank(net_assets_usd_bn * -1), # multiply by negative 1
         value_percentile = percent_rank(net_assets_usd_bn * -1), # multiply by negative 1
         value_ntile_5 = ntile(net_assets_usd_bn * -1, 5), # multiply by negative 1
         value_pct_total = net_assets_usd_bn/sum(net_assets_usd_bn) # not relevant
         ) |> 
  arrange(value_rank)
# A tibble: 10 × 9
   fund_name       is_esg region net_a…¹ value…² value…³ value…⁴ value…⁵ value…⁶
   <chr>           <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <dbl>
 1 Core S&P 500    Regul… North…  298.   -2.63         1   0           1 0.467  
 2 Core MSCI EAFE  Regul… Global   84.2  -0.230        2   0.111       1 0.132  
 3 Core U.S. Aggr… Regul… North…   82.3  -0.209        3   0.222       2 0.129  
 4 Core S&P Small… Regul… North…   66.5  -0.0307       4   0.333       2 0.104  
 5 Core MSCI Emer… Regul… Global   64.9  -0.0126       5   0.444       3 0.102  
 6 ESG Aware MSCI… ESG F… North…   22.4   0.466        6   0.556       3 0.0351 
 7 ESG Aware MSCI… ESG F… Global    6.43  0.646        7   0.667       4 0.0101 
 8 Global Clean E… ESG F… Global    5.63  0.655        8   0.778       4 0.00882
 9 ESG Aware MSCI… ESG F… Global    4.23  0.671        9   0.889       5 0.00664
10 MSCI KLD 400 S… ESG F… North…    3.69  0.677       10   1           5 0.00578
# … with abbreviated variable names ¹​net_assets_usd_bn, ²​value_z_score,
#   ³​value_rank, ⁴​value_percentile, ⁵​value_ntile_5, ⁶​value_pct_total

You’ll have to exercise judgment as to which direction any given indicator makes sense. Whatever you choose, make sure you’re clear what larger/smaller means and whether it will intuitively make sense to you and the people you’re presenting your analysis to.

Building Functions: Work Hard to Be Lazy

If you find yourself copying and pasting code more than 3 times, it’s time to make a a function. Chapter 27 in R for Data Science 2e provides more details.

You don’t need to worry about the details for now, but here’s a brief guide.

# add the argument
plus_one <- function(x) {
  # define what you want it to do
  x + 1
}

plus_one(7)
[1] 8

The function above is called plus_one(). It takes a number you provide and adds 1 to it. When we provide 7 as an argument, it returns 8.

Function 1: add_rank_features()

Our function, add_rank_features() , is just a little more complicated. Don’t worry about understanding everything right now. Here are the important features

  • Like all tidyverse-compliant functions, the data is the first argument so that it can be added to a data processing pipeline using the pipe operator |>.

  • The second argument, value_var, will by default select a column names value, unless you provide another column name that has the numeric values you want to use.

  • The third argument allows you to choose whether you want to have the rank by lowest (the default), or highest (use FALSE if you want this).

  • The fourth argument sets the number of quantiles. The default is set to 5.

add_rank_features <- function(data, value_var = value, rank_by_lowest = TRUE, 
                              quantile_n = 5) {
  

  # If rank_by_lowest is set to TRUE, as it is by default, then it will use this first code chunk 
if (rank_by_lowest)
  data |> 
    mutate("{{value_var}}_z_score" := ({{ value_var }}-mean({{ value_var }}))/
             sd({{ value_var }}),
         "{{value_var}}_rank" := rank({{ value_var }}), 
         "{{value_var}}_percentile" := percent_rank({{ value_var }}),
         "{{value_var}}_quantile_{{quantile_n}}" := ntile({{ value_var }}, 
                                                          {{ quantile_n }}),
         "{{value_var}}_pct_total" := {{ value_var }}/sum({{ value_var }})) |> 
    arrange({{ value_var }})
  
   # If rank_by_lowest is set to FALSE, as it is by default, then it will use this second code chunk
else
  
  data |> 
    mutate("{{value_var}}_z_score" := ({{ value_var }}-mean({{ value_var }}))/
             sd({{ value_var }}) * -1, # multiply by -1
           # put negative sign
           "{{value_var}}_rank" := rank(-{{ value_var }}), 
           # put negative sign
           "{{value_var}}_percentile" := percent_rank(-{{ value_var }}),
           # put negative sign
           "{{value_var}}_quantile_{{quantile_n}}" := ntile(-{{ value_var }}, 
                                                            {{ quantile_n }}), 
           # no change, not relevant for pct_total
           "{{value_var}}_pct_total" := {{ value_var }}/sum({{ value_var }})) |> 
    arrange({{ value_var }}*-1)
    
}

And now we can be deservedly lazy. If we just want the default behavior of the function, you can just add it onto a data pipeline.

funds_by_assets_bn |> 
  add_rank_features(value_var = net_assets_usd_bn)
# A tibble: 10 × 9
   fund_name       is_esg region net_a…¹ net_a…² net_a…³ net_a…⁴ net_a…⁵ net_a…⁶
   <chr>           <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <dbl>
 1 MSCI KLD 400 S… ESG F… North…    3.69 -0.677        1   0           1 0.00578
 2 ESG Aware MSCI… ESG F… Global    4.23 -0.671        2   0.111       1 0.00664
 3 Global Clean E… ESG F… Global    5.63 -0.655        3   0.222       2 0.00882
 4 ESG Aware MSCI… ESG F… Global    6.43 -0.646        4   0.333       2 0.0101 
 5 ESG Aware MSCI… ESG F… North…   22.4  -0.466        5   0.444       3 0.0351 
 6 Core MSCI Emer… Regul… Global   64.9   0.0126       6   0.556       3 0.102  
 7 Core S&P Small… Regul… North…   66.5   0.0307       7   0.667       4 0.104  
 8 Core U.S. Aggr… Regul… North…   82.3   0.209        8   0.778       4 0.129  
 9 Core MSCI EAFE  Regul… Global   84.2   0.230        9   0.889       5 0.132  
10 Core S&P 500    Regul… North…  298.    2.63        10   1           5 0.467  
# … with abbreviated variable names ¹​net_assets_usd_bn,
#   ²​net_assets_usd_bn_z_score, ³​net_assets_usd_bn_rank,
#   ⁴​net_assets_usd_bn_percentile, ⁵​net_assets_usd_bn_quantile_5,
#   ⁶​net_assets_usd_bn_pct_total
funds_by_assets_bn |> 
  add_rank_features(value_var = net_assets_usd_bn, 
                    # Now it will rank by highest value equals 1
                    rank_by_lowest = FALSE, 
                    # let's divide it into 10 quantiles (deciles)
                    quantile_n = 10)
# A tibble: 10 × 9
   fund_name       is_esg region net_a…¹ net_a…² net_a…³ net_a…⁴ net_a…⁵ net_a…⁶
   <chr>           <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <dbl>
 1 Core S&P 500    Regul… North…  298.   -2.63         1   0           1 0.467  
 2 Core MSCI EAFE  Regul… Global   84.2  -0.230        2   0.111       2 0.132  
 3 Core U.S. Aggr… Regul… North…   82.3  -0.209        3   0.222       3 0.129  
 4 Core S&P Small… Regul… North…   66.5  -0.0307       4   0.333       4 0.104  
 5 Core MSCI Emer… Regul… Global   64.9  -0.0126       5   0.444       5 0.102  
 6 ESG Aware MSCI… ESG F… North…   22.4   0.466        6   0.556       6 0.0351 
 7 ESG Aware MSCI… ESG F… Global    6.43  0.646        7   0.667       7 0.0101 
 8 Global Clean E… ESG F… Global    5.63  0.655        8   0.778       8 0.00882
 9 ESG Aware MSCI… ESG F… Global    4.23  0.671        9   0.889       9 0.00664
10 MSCI KLD 400 S… ESG F… North…    3.69  0.677       10   1          10 0.00578
# … with abbreviated variable names ¹​net_assets_usd_bn,
#   ²​net_assets_usd_bn_z_score, ³​net_assets_usd_bn_rank,
#   ⁴​net_assets_usd_bn_percentile, ⁵​net_assets_usd_bn_quantile_10,
#   ⁶​net_assets_usd_bn_pct_total

Creating Features by Group

Sometimes you want to compare features by groupings. What is the largest ESG Fund?

funds_by_assets_bn |> 
  group_by(is_esg) |> 
  add_rank_features(value_var = net_assets_usd_bn, 
                    # Now it will rank by highest value equals 1
                    rank_by_lowest = FALSE)
# A tibble: 10 × 9
# Groups:   is_esg [2]
   fund_name       is_esg region net_a…¹ net_a…² net_a…³ net_a…⁴ net_a…⁵ net_a…⁶
   <chr>           <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <dbl>
 1 Core S&P 500    Regul… North…  298.    -1.78        1    0          1  0.500 
 2 Core MSCI EAFE  Regul… Global   84.2    0.348       2    0.25       2  0.141 
 3 Core U.S. Aggr… Regul… North…   82.3    0.367       3    0.5        3  0.138 
 4 Core S&P Small… Regul… North…   66.5    0.525       4    0.75       4  0.112 
 5 Core MSCI Emer… Regul… Global   64.9    0.541       5    1          5  0.109 
 6 ESG Aware MSCI… ESG F… North…   22.4   -1.77        1    0          1  0.528 
 7 ESG Aware MSCI… ESG F… Global    6.43   0.261       2    0.25       2  0.152 
 8 Global Clean E… ESG F… Global    5.63   0.362       3    0.5        3  0.133 
 9 ESG Aware MSCI… ESG F… Global    4.23   0.540       4    0.75       4  0.100 
10 MSCI KLD 400 S… ESG F… North…    3.69   0.609       5    1          5  0.0871
# … with abbreviated variable names ¹​net_assets_usd_bn,
#   ²​net_assets_usd_bn_z_score, ³​net_assets_usd_bn_rank,
#   ⁴​net_assets_usd_bn_percentile, ⁵​net_assets_usd_bn_quantile_5,
#   ⁶​net_assets_usd_bn_pct_total
Understanding the Output

What is the percent of the grouped total for the largest ESG fund and for the largest non-ESG fund?

funds_by_assets_bn |> 
  group_by(is_esg, region) |> 
  add_rank_features(value_var = net_assets_usd_bn, 
                    # Now it will rank by highest value equals 1
                    rank_by_lowest = FALSE) |> 
  # arrange first by is_esg, second by region, and third by rank so you can easily
  # see the ranking within each sub-group
  arrange(is_esg, region, net_assets_usd_bn_rank)
# A tibble: 10 × 9
# Groups:   is_esg, region [4]
   fund_name       is_esg region net_a…¹ net_a…² net_a…³ net_a…⁴ net_a…⁵ net_a…⁶
   <chr>           <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <dbl>
 1 ESG Aware MSCI… ESG F… Global    6.43  -0.898       1     0         1   0.395
 2 Global Clean E… ESG F… Global    5.63  -0.179       2     0.5       2   0.346
 3 ESG Aware MSCI… ESG F… Global    4.23   1.08        3     1         3   0.260
 4 ESG Aware MSCI… ESG F… North…   22.4   -0.707       1     0         1   0.858
 5 MSCI KLD 400 S… ESG F… North…    3.69   0.707       2     1         2   0.142
 6 Core MSCI EAFE  Regul… Global   84.2   -0.707       1     0         1   0.565
 7 Core MSCI Emer… Regul… Global   64.9    0.707       2     1         2   0.435
 8 Core S&P 500    Regul… North…  298.    -1.15        1     0         1   0.667
 9 Core U.S. Aggr… Regul… North…   82.3    0.515       2     0.5       2   0.184
10 Core S&P Small… Regul… North…   66.5    0.637       3     1         3   0.149
# … with abbreviated variable names ¹​net_assets_usd_bn,
#   ²​net_assets_usd_bn_z_score, ³​net_assets_usd_bn_rank,
#   ⁴​net_assets_usd_bn_percentile, ⁵​net_assets_usd_bn_quantile_5,
#   ⁶​net_assets_usd_bn_pct_total
Understanding the Output
  • What is the second largest ESG fund with a global focus?

  • What other grouping might be useful to try with this data? Try it.

Aggregation: Summarizing Data By Group

In exploratory data analysis we’re trying to make sense of the data. We’re going from a large number of individual datapoints and trying to find patterns that show structure in the data and tell a story. We’re trying to generate hypotheses that can be tested more rigorously later.

What value is the average fund size by fund type?

funds_by_assets_bn |> 
  group_by(is_esg) |> 
  # always default to using na.rm = TRUE, otherwise NA values 
  # (very common in the wild) will trip up your calculations
  summarize(avg_assets = mean(net_assets_usd_bn, na.rm = TRUE))
# A tibble: 2 × 2
  is_esg       avg_assets
  <chr>             <dbl>
1 ESG Fund           8.47
2 Regular Fund     119.  

We can expand this to use a few more common summary statistics:

funds_by_assets_bn |> 
  group_by(is_esg) |> 
  summarize(avg = mean(net_assets_usd_bn, na.rm = TRUE),
            std_dev = sd(net_assets_usd_bn, na.rm = TRUE),
            min = min(net_assets_usd_bn, na.rm = TRUE),
            max = max(net_assets_usd_bn, na.rm = TRUE),
            pctile_25 = quantile(net_assets_usd_bn, .25, na.rm = TRUE),
            # median is the same as the 50th percentile above
            median = median(net_assets_usd_bn, na.rm = TRUE),
            pctile_75 = quantile(net_assets_usd_bn, .75, na.rm = TRUE))
# A tibble: 2 × 8
  is_esg          avg std_dev   min   max pctile_25 median pctile_75
  <chr>         <dbl>   <dbl> <dbl> <dbl>     <dbl>  <dbl>     <dbl>
1 ESG Fund       8.47    7.85  3.69  22.4      4.23   5.63      6.43
2 Regular Fund 119.    100.   64.9  298.      66.5   82.3      84.2 

Function 2: calculate_summary_stats()

Similar to above, we will work hard to be lazy and write a function.

calculate_summary_stats <- function(data, value_var = value) {
  data |> 
    summarize(avg = mean({{ value_var }}, na.rm = TRUE),
              std_dev = sd({{ value_var }}, na.rm = TRUE),
              min = min({{ value_var }}, na.rm = TRUE),
              max = max({{ value_var }}, na.rm = TRUE),
              pctile_25 = quantile({{ value_var }}, .25, na.rm = TRUE),
              median = median({{ value_var }}, na.rm = TRUE),
              pctile_75 = quantile({{ value_var }}, .75, na.rm = TRUE)) |> 
  # ungrouping is helpful so we don't have to do so afterwards. Save time. 
  ungroup()
}

Let’s give this a try:

funds_by_assets_bn |> 
  group_by(is_esg) |> 
  calculate_summary_stats(value_var = net_assets_usd_bn)
# A tibble: 2 × 8
  is_esg          avg std_dev   min   max pctile_25 median pctile_75
  <chr>         <dbl>   <dbl> <dbl> <dbl>     <dbl>  <dbl>     <dbl>
1 ESG Fund       8.47    7.85  3.69  22.4      4.23   5.63      6.43
2 Regular Fund 119.    100.   64.9  298.      66.5   82.3      84.2 
funds_by_assets_bn |> 
  group_by(is_esg, region) |> 
  calculate_summary_stats(value_var = net_assets_usd_bn)
`summarise()` has grouped output by 'is_esg'. You can override using the
`.groups` argument.
# A tibble: 4 × 9
  is_esg       region           avg std_dev   min    max pctile…¹ median pctil…²
  <chr>        <chr>          <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>
1 ESG Fund     Global          5.43    1.11  4.23   6.43     4.93   5.63    6.03
2 ESG Fund     North America  13.0    13.2   3.69  22.4      8.36  13.0    17.7 
3 Regular Fund Global         74.6    13.6  64.9   84.2     69.7   74.6    79.4 
4 Regular Fund North America 149.    129.   66.5  298.      74.4   82.3   190.  
# … with abbreviated variable names ¹​pctile_25, ²​pctile_75
Understanding the Output
  • What is the size of the median North American ESG fund in our sample?

Comparing Data Aggregates

pivot_longer() and pivot_wider() are your friends. Practice them, and get to know them well.

What you need to know:

  • pivot_longer(): you need to specify the columns you want pivoted longer. By default the column names will be put in a column called names, and the values will be put in a column called values. You can specify your own column names if you’d like.
  • pivot_wider(): you need to specify the column where the column names are going to come from using names_from, and the column where the values come from using values_from.

Spend some time going through the vignette explaining the features of the pivot_ functions. It’s worth your time and will enable you to do cool stuff faster.

vignette("pivot")
starting httpd help server ... done

Now we’re going to walk through the workflow step-by-step:

Step 1: Calculate Summary Statistics

We’ve already done this above. Easy.

funds_by_assets_bn |> 
  group_by(is_esg) |> 
  calculate_summary_stats(value_var = net_assets_usd_bn) 
# A tibble: 2 × 8
  is_esg          avg std_dev   min   max pctile_25 median pctile_75
  <chr>         <dbl>   <dbl> <dbl> <dbl>     <dbl>  <dbl>     <dbl>
1 ESG Fund       8.47    7.85  3.69  22.4      4.23   5.63      6.43
2 Regular Fund 119.    100.   64.9  298.      66.5   82.3      84.2 

Step 2: Pivot the Data into Long Form Using pivot_longer()

Looking at the data above, we want to compare the two groups: ESG Funds vs. Regular Funds. To do that more easily, we need to have the columns from is_esg as the column headers, and the rest of the columns as rows, so we can use mutate() to calculate comparisons.

This is a two-step process. The first step is making the data aggregates into a column.

funds_by_assets_bn |> 
  # group by the feature(s) you want summary statistics for
  group_by(is_esg) |> 
  # use the function we built to calculate those summary stats.
  calculate_summary_stats(value_var = net_assets_usd_bn) |> 
  # pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.
  pivot_longer(cols = avg:pctile_75, names_to = "aggregates")
# A tibble: 14 × 3
   is_esg       aggregates  value
   <chr>        <chr>       <dbl>
 1 ESG Fund     avg          8.47
 2 ESG Fund     std_dev      7.85
 3 ESG Fund     min          3.69
 4 ESG Fund     max         22.4 
 5 ESG Fund     pctile_25    4.23
 6 ESG Fund     median       5.63
 7 ESG Fund     pctile_75    6.43
 8 Regular Fund avg        119.  
 9 Regular Fund std_dev    100.  
10 Regular Fund min         64.9 
11 Regular Fund max        298.  
12 Regular Fund pctile_25   66.5 
13 Regular Fund median      82.3 
14 Regular Fund pctile_75   84.2 

Step 3: Pivot the Groups You Want to Compare into Wide Format Using pivot_wider()

funds_by_assets_bn |> 
  # group by the feature(s) you want summary statistics for
  group_by(is_esg) |> 
  # use the function we built to calculate those summary stats.
  calculate_summary_stats(value_var = net_assets_usd_bn) |> 
  # pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.
  pivot_longer(cols = avg:pctile_75, names_to = "aggregates") |> 
  # then pivot the data into wide format where the two groups you are comparing are columns
  pivot_wider(names_from = is_esg, values_from = value)
# A tibble: 7 × 3
  aggregates `ESG Fund` `Regular Fund`
  <chr>           <dbl>          <dbl>
1 avg              8.47          119. 
2 std_dev          7.85          100. 
3 min              3.69           64.9
4 max             22.4           298. 
5 pctile_25        4.23           66.5
6 median           5.63           82.3
7 pctile_75        6.43           84.2

Step 4: Use janitor::clean_names() to Make the Column Names into snake_case

It’s a lot easier to work with column names in R if you put them into snake_case (lowercase, connected by underscore). Otherwise you need to use lots of quotation marks, and it’s easy to mess up. The janitor package has a lot of great helper functions for cleaning your data, and clean_names() is an especially useful one to know. If you haven’t yet installed, it, run install.packages("janitor") in your console before running the code below.

funds_by_assets_bn |> 
  # group by the feature(s) you want summary statistics for
  group_by(is_esg) |> 
  # use the function we built to calculate those summary stats.
  calculate_summary_stats(value_var = net_assets_usd_bn) |> 
  # pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.
  pivot_longer(cols = avg:pctile_75, names_to = "aggregates") |> 
  # then pivot the data into wide format where the two groups you are comparing are columns
  pivot_wider(names_from = is_esg, values_from = value) |> 
   # make names snake_case which is easier to work with in the mutate function
  janitor::clean_names() 
# A tibble: 7 × 3
  aggregates esg_fund regular_fund
  <chr>         <dbl>        <dbl>
1 avg            8.47        119. 
2 std_dev        7.85        100. 
3 min            3.69         64.9
4 max           22.4         298. 
5 pctile_25      4.23         66.5
6 median         5.63         82.3
7 pctile_75      6.43         84.2

Step 5: Use mutate() to Make Meaningful Comparisons of the Groups

You’ll need to use your judgment to figure out the best manner to compare the two groups given the context of the data.

Here, one group is clearly much larger than the other. So two ways of comparing the groups come to mind:

  1. How many times larger are the regular funds versus the ESG funds?

  2. What percentage of the size of the regular funds do the ESG funds constitute?

funds_by_assets_bn |> 
  # group by the feature(s) you want summary statistics for
  group_by(is_esg) |> 
  # use the function we built to calculate those summary stats.
  calculate_summary_stats(value_var = net_assets_usd_bn) |> 
  # pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.
  pivot_longer(cols = avg:pctile_75, names_to = "aggregates") |> 
  # then pivot the data into wide format where the two groups you are comparing are columns
  pivot_wider(names_from = is_esg, values_from = value) |> 
   # make names snake_case which is easier to work with in the mutate function
  janitor::clean_names() |> 
  # use your judgement to figure out the best way to compare the two groups
  mutate(esg_fund_pct_of_regular = esg_fund/regular_fund * 100,
         regular_times_larger_than_esg = regular_fund/esg_fund)
# A tibble: 7 × 5
  aggregates esg_fund regular_fund esg_fund_pct_of_regular regular_times_large…¹
  <chr>         <dbl>        <dbl>                   <dbl>                 <dbl>
1 avg            8.47        119.                     7.11                  14.1
2 std_dev        7.85        100.                     7.83                  12.8
3 min            3.69         64.9                    5.68                  17.6
4 max           22.4         298.                     7.52                  13.3
5 pctile_25      4.23         66.5                    6.36                  15.7
6 median         5.63         82.3                    6.83                  14.6
7 pctile_75      6.43         84.2                    7.63                  13.1
# … with abbreviated variable name ¹​regular_times_larger_than_esg
Understanding the Output
  • How much larger is the largest regular fund than the largest ESG fund?

  • How much bigger is the average regular fund in our sample than the largest ESG fund?

Which ones seem the most relevant to you?

Practice writing these as “emotive factoids” in sentences.

Part III: Your Turn

Challenge 1: Use These Tools on the Full Dataset

Now that you understand the toolkit, try it out on the full blackrock_etf_data

Challenge 2: Comparing the holdings of the largest ESG Funds Versus the Largest Regular Fund

The class GitHub data repository has a dataset comparing the largest ESG fund (ESGU) and the largest standard fund (IVV). Test out your new-found skills to do initial exploratory data analysis on this dataset.

etf_comparison_data_github_url <- "https://raw.githubusercontent.com/t-emery/sais-susfin_data/main/datasets/etf_comparison-2022-10-03.csv"

etf_comparison <- read_csv(etf_comparison_data_github_url)
Rows: 537 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ticker, company_name, sector, esg_uw_ow
dbl (7): esg_etf, standard_etf, esg_tilt, esg_tilt_z_score, esg_tilt_rank, e...
lgl (3): in_esg_only, in_standard_only, in_on_index_only

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
etf_comparison
# A tibble: 537 × 14
   ticker compa…¹ sector esg_etf stand…² in_es…³ in_st…⁴ in_on…⁵ esg_t…⁶ esg_u…⁷
   <chr>  <chr>   <chr>    <dbl>   <dbl> <lgl>   <lgl>   <lgl>     <dbl> <chr>  
 1 PRU    PRUDEN… Finan…   0.537  0.106  FALSE   FALSE   FALSE     0.431 Overwe…
 2 GIS    GENERA… Consu…   0.552  0.151  FALSE   FALSE   FALSE     0.401 Overwe…
 3 K      KELLOGG Consu…   0.453  0.0592 FALSE   FALSE   FALSE     0.394 Overwe…
 4 ADP    AUTOMA… Infor…   0.649  0.312  FALSE   FALSE   FALSE     0.337 Overwe…
 5 ECL    ECOLAB… Mater…   0.441  0.118  FALSE   FALSE   FALSE     0.322 Overwe…
 6 JCI    JOHNSO… Indus…   0.416  0.112  FALSE   FALSE   FALSE     0.304 Overwe…
 7 ES     EVERSO… Utili…   0.392  0.0896 FALSE   FALSE   FALSE     0.302 Overwe…
 8 PEG    PUBLIC… Utili…   0.376  0.0929 FALSE   FALSE   FALSE     0.284 Overwe…
 9 RTX    RAYTHE… Indus…   0.677  0.401  FALSE   FALSE   FALSE     0.277 Overwe…
10 LNG    CHENIE… Energy   0.274  0      TRUE    FALSE   TRUE      0.274 Overwe…
# … with 527 more rows, 4 more variables: esg_tilt_z_score <dbl>,
#   esg_tilt_rank <dbl>, esg_tilt_percentile <dbl>, esg_tilt_quantile_5 <dbl>,
#   and abbreviated variable names ¹​company_name, ²​standard_etf, ³​in_esg_only,
#   ⁴​in_standard_only, ⁵​in_on_index_only, ⁶​esg_tilt, ⁷​esg_uw_ow