BlackRock ESG ETF Data, Part 2: Exploratory Data Analysis Using Data Visualization

Author

Teal Emery

Part I: Setup

Load R Packages

At the start of any R session, load the packages you are going to use. Here, we are just using the tidyverse, which automatically loads a set of packages useful for working with your data.

library(tidyverse) 
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Load our data

Like, last time, we’re going to:

  1. Read in full the dataset from the GitHub repo.

  2. Create a mini version of the dataset for illustrating key ideas.

# assign the url to `github_raw_csv_url`
github_raw_csv_url <- "https://raw.githubusercontent.com/t-emery/sais-susfin_data/main/datasets/blackrock_etf_screener_2022-08-30.csv"

# read in the data, and assign it to the object `blackrock_etf_data`
blackrock_etf_data <- read_csv(github_raw_csv_url)
Rows: 393 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): ticker, name, incept_date, net_assets_as_of, asset_class, sub_asse...
dbl  (8): gross_expense_ratio_percent, net_expense_ratio_percent, net_assets...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# print out the tibble to take a look at the data
blackrock_etf_data
# A tibble: 393 × 22
   ticker name    incep…¹ gross…² net_e…³ net_a…⁴ net_a…⁵ asset…⁶ sub_a…⁷ region
   <chr>  <chr>   <chr>     <dbl>   <dbl>   <dbl> <chr>   <chr>   <chr>   <chr> 
 1 IVV    iShare… 5/15/00    0.03    0.03 297663. 8/30/22 Equity  Large … North…
 2 IEFA   iShare… 10/18/…    0.07    0.07  84222. 8/30/22 Equity  All Cap Global
 3 AGG    iShare… 9/22/03    0.04    0.03  82344. 8/30/22 Fixed … Multi … North…
 4 IJR    iShare… 5/22/00    0.06    0.06  66533. 8/30/22 Equity  Small … North…
 5 IEMG   iShare… 10/18/…    0.09    0.09  64920. 8/30/22 Equity  All Cap Global
 6 IWF    iShare… 5/22/00    0.18    0.18  61831. 8/30/22 Equity  Large/… North…
 7 IJH    iShare… 5/22/00    0.05    0.05  61424. 8/30/22 Equity  Mid Cap North…
 8 IWM    iShare… 5/22/00    0.19    0.19  53048. 8/30/22 Equity  Small … North…
 9 IWD    iShare… 5/22/00    0.18    0.18  51913. 8/30/22 Equity  Large/… North…
10 EFA    iShare… 8/14/01    0.32    0.32  44144. 8/30/22 Equity  Large/… Global
# … with 383 more rows, 12 more variables: market <chr>, location <chr>,
#   investment_style <chr>,
#   sustainability_characteristics_msci_esg_fund_ratings_msci_esg_fund_rating_aaa_ccc <chr>,
#   msci_esg_quality_score_0_10 <dbl>,
#   msci_weighted_average_carbon_intensity_tons_co2e_m_sales <dbl>,
#   msci_esg_percent_coverage <dbl>, sustainable_classification <chr>,
#   name_wo_ishares_etf <chr>, is_esg <chr>, years_from_inception <dbl>, …

An Initial Taste of Data Cleaning

One small difference: this time we’re going to do some initial data cleaning and make sure that all of our dates are coded as date objects because that makes them more useful for our analysis.

You can use across() inside mutate to select multiple columns. The syntax is a little confusing at first, but it’s worth looking at this vignette to learn it because it’s super useful. In short, the first function argument is the columns you want to select, and the second function argument is the function you want to apply to those columns.

blackrock_etf_data <- blackrock_etf_data |> 
  # we are transforming both date columns (currently character strings) into date objects
  # so we can work with them.
  # this syntax is a bit confusing, but selects all columns containing `date` and applies
  # lubridate::mdy() function to them to turn them into date objects. 
  mutate(across(contains("date"), lubridate::mdy)) |>
  # Billions is a more useful magnitude than millions, so we'll create a column with 
  # the assets in billions by dividing by `net_assets_millions` by 1,000 (10^3)
  # If we wanted trillions, we could divide by 1,000,000 (10^6)
  mutate(net_assets_usd_bn = net_assets_usd_mn/10^3) |> 
  # this column doesn't add anything to our analysis - it says that the data is from 8/30/22
  select(-net_assets_as_of)

Here’s what we did in the code above, in plain English:

  1. we took the code object blackrock_etf_data and assigned it back to itself, so that the blackrock_etf_data object will inherit the changes we’re about to make.

  2. And then (how you read |> in spoken English) we change all columns containing “date” into date objects by applying lubridate’s mdy() function to it. We use mdy() because the dates are written in month-day-year format: “5/22/00”. The lubridate package contains lots of useful helper functions for working with dates.

  3. And then we make a column with net assets in billions of dollars, because that is a more useful magnitude for our dataset which has a lot of funds with many billions of dollars. Reading $10 billion is much more intuitive than $10,000 million.

  4. And then we got rid of the column net_assets_as_of because it wasn’t going to add information to our analysis.

Create Mini Dataset

Now create our mini dataset. We just copy and paste this code from our last lesson, except we’ll switch our assets column to the one in billions we just created.

mini_blackrock_data <- blackrock_etf_data |> 
  # group by whether the fund is an ESG fund or not
  group_by(is_esg) |> 
  # take the top 5 from each group, by net assets
  slice_max(order_by = net_assets_usd_bn, n = 5) |> 
  # select the following columns 
  select(ticker, fund_name = name_wo_ishares_etf, asset_class, sub_asset_class, region, incept_date, net_assets_usd_bn,
         msci_weighted_average_carbon_intensity_tons_co2e_m_sales) |> 
  # rename to `co2_intensity` because the full name is a mouthful, if descriptive.
  rename(co2_intensity = msci_weighted_average_carbon_intensity_tons_co2e_m_sales) |> 
  # always good to ungroup() if you've used a group_by().  We'll discuss later.
  ungroup()
Adding missing grouping variables: `is_esg`
mini_blackrock_data
# A tibble: 10 × 9
   is_esg       ticker fund_…¹ asset…² sub_a…³ region incept_d…⁴ net_a…⁵ co2_i…⁶
   <chr>        <chr>  <chr>   <chr>   <chr>   <chr>  <date>       <dbl>   <dbl>
 1 ESG Fund     ESGU   ESG Aw… Equity  Large/… North… 2016-12-01   22.4    104. 
 2 ESG Fund     ESGD   ESG Aw… Equity  Large/… Global 2016-06-28    6.43   104. 
 3 ESG Fund     ICLN   Global… Equity  All Cap Global 2008-06-24    5.63   266. 
 4 ESG Fund     ESGE   ESG Aw… Equity  Large/… Global 2016-06-28    4.23   168. 
 5 ESG Fund     DSI    MSCI K… Equity  Large/… North… 2006-11-14    3.69    72.7
 6 Regular Fund IVV    Core S… Equity  Large … North… 2000-05-15  298.     148. 
 7 Regular Fund IEFA   Core M… Equity  All Cap Global 2012-10-18   84.2    127. 
 8 Regular Fund AGG    Core U… Fixed … Multi … North… 2003-09-22   82.3    283. 
 9 Regular Fund IJR    Core S… Equity  Small … North… 2000-05-22   66.5    133. 
10 Regular Fund IEMG   Core M… Equity  All Cap Global 2012-10-18   64.9    369. 
# … with abbreviated variable names ¹​fund_name, ²​asset_class, ³​sub_asset_class,
#   ⁴​incept_date, ⁵​net_assets_usd_bn, ⁶​co2_intensity

Part II: Looking at Our Data

glimpse() It

Like last time, let’s use dplyr::gimpse() to look at all of our variables and remember what we’re working with:

blackrock_etf_data |> 
  glimpse()
Rows: 393
Columns: 22
$ ticker                                                                            <chr> …
$ name                                                                              <chr> …
$ incept_date                                                                       <date> …
$ gross_expense_ratio_percent                                                       <dbl> …
$ net_expense_ratio_percent                                                         <dbl> …
$ net_assets_usd_mn                                                                 <dbl> …
$ asset_class                                                                       <chr> …
$ sub_asset_class                                                                   <chr> …
$ region                                                                            <chr> …
$ market                                                                            <chr> …
$ location                                                                          <chr> …
$ investment_style                                                                  <chr> …
$ sustainability_characteristics_msci_esg_fund_ratings_msci_esg_fund_rating_aaa_ccc <chr> …
$ msci_esg_quality_score_0_10                                                       <dbl> …
$ msci_weighted_average_carbon_intensity_tons_co2e_m_sales                          <dbl> …
$ msci_esg_percent_coverage                                                         <dbl> …
$ sustainable_classification                                                        <chr> …
$ name_wo_ishares_etf                                                               <chr> …
$ is_esg                                                                            <chr> …
$ years_from_inception                                                              <dbl> …
$ year_launched                                                                     <dbl> …
$ net_assets_usd_bn                                                                 <dbl> …

And the same for our mini dataset:

mini_blackrock_data |> 
  glimpse()
Rows: 10
Columns: 9
$ is_esg            <chr> "ESG Fund", "ESG Fund", "ESG Fund", "ESG Fund", "ESG…
$ ticker            <chr> "ESGU", "ESGD", "ICLN", "ESGE", "DSI", "IVV", "IEFA"…
$ fund_name         <chr> "ESG Aware MSCI USA", "ESG Aware MSCI EAFE", "Global…
$ asset_class       <chr> "Equity", "Equity", "Equity", "Equity", "Equity", "E…
$ sub_asset_class   <chr> "Large/Mid Cap", "Large/Mid Cap", "All Cap", "Large/…
$ region            <chr> "North America", "Global", "Global", "Global", "Nort…
$ incept_date       <date> 2016-12-01, 2016-06-28, 2008-06-24, 2016-06-28, 200…
$ net_assets_usd_bn <dbl> 22.376688, 6.425673, 5.628011, 4.233544, 3.688930, 2…
$ co2_intensity     <dbl> 103.60, 103.83, 265.82, 167.71, 72.73, 148.34, 126.6…

Find Unique Values

Finding the unique values associated with your variables is very helpful. This is especially true for non-numeric data. Some variables will likely be unique per observation, like

blackrock_etf_data |>
  # the column you want to examine
  select(asset_class) |> 
  # selects the unique values
  distinct()
# A tibble: 5 × 1
  asset_class 
  <chr>       
1 Equity      
2 Fixed Income
3 Commodity   
4 Real Estate 
5 Multi Asset 
Now You Try

Try examining the unique values in two other columns using the code above as a model.

Now we can use a dplyr pipeline to create a sorted list showing the number of unique values per variable.

blackrock_etf_data |> 
  # select all character columns and find the number of distinct values in each
  summarize(across(where(is.character), n_distinct)) |> 
  # rename this column because the original name is unweildy and makes all the rest of the formatting awkward
  rename(msci_esg_rating = sustainability_characteristics_msci_esg_fund_ratings_msci_esg_fund_rating_aaa_ccc) |> 
  # pivot into long format to make it easier to work with.
  pivot_longer(cols = everything()) |> 
  # arrange from highest to lowest (descending order)
  arrange(value |> desc())
# A tibble: 12 × 2
   name                       value
   <chr>                      <int>
 1 ticker                       393
 2 name                         393
 3 name_wo_ishares_etf          393
 4 location                      41
 5 sub_asset_class               18
 6 region                         7
 7 msci_esg_rating                7
 8 asset_class                    5
 9 sustainable_classification     5
10 market                         3
11 investment_style               2
12 is_esg                         2

Let’s discuss the output. We can see that ticker, name, and name_wo_ishares_etf all have unique variables for each row of our dataset. This makes sense: each fund is one observation (row).

The rest have less. It’s worth examining each of these to see if they contain values that might be useful for subsetting our dataset during analysis

blackrock_etf_data |> 
  # select columns that are character data types. Take out the columns discussed above.
  select(where(is.character), -ticker, -name, -name_wo_ishares_etf) |> 
  # for each column, find the unique values (we'll cover map more later in the course)
  map(unique)
$asset_class
[1] "Equity"       "Fixed Income" "Commodity"    "Real Estate"  "Multi Asset" 

$sub_asset_class
 [1] "Large Cap"              "All Cap"                "Multi Sectors"         
 [4] "Small Cap"              "Large/Mid Cap"          "Mid Cap"               
 [7] "Credit"                 "Inflation"              "Municipals"            
[10] "Precious Metals"        "Government"             "Mortgages"             
[13] "High Yield"             "Real Estate Securities" "Multi Commodity"       
[16] "Static Allocation"      "Mid/Small Cap"          "Multi Strategy"        

$region
[1] "North America"          "Global"                 "Asia Pacific"          
[4] "Latin America"          "Europe"                 "Middle East and Africa"
[7] "Kuwait"                

$market
[1] "Developed" "Emerging"  "Frontier" 

$location
 [1] "United States"        "Broad"                "Japan"               
 [4] "China"                "Brazil"               "India"               
 [7] "Taiwan"               "Canada"               "United Kingdom"      
[10] "South Korea"          "Australia"            "Switzerland"         
[13] "Germany"              "Saudi Arabia"         "Mexico"              
[16] "France"               "Hong Kong"            "Singapore"           
[19] "Chile"                "Indonesia"            "Spain"               
[22] "Sweden"               "Thailand"             "Netherlands"         
[25] "South Africa"         "Turkey"               "Malaysia"            
[28] "Italy"                "Denmark"              "Israel"              
[31] "Poland"               "New Zealand"          "Philippines"         
[34] "Qatar"                "Austria"              "United Arab Emirates"
[37] "Norway"               "Kuwait"               "Finland"             
[40] "Belgium"              "Russia"              

$investment_style
[1] "Index"  "Active"

$sustainability_characteristics_msci_esg_fund_ratings_msci_esg_fund_rating_aaa_ccc
[1] "AA"  "AAA" "A"   "BBB" NA    "BB"  "B"  

$sustainable_classification
[1] NA             "Broad ESG"    "Thematic ESG" "Impact"       "Screened"    

$is_esg
[1] "Regular Fund" "ESG Fund"    

Now this is useful! These are all useful features for our analysis, as you’ll see below. We’ll start by using is_esg which tells us whether a fund is and ESG fund or a regular, non-ESG fund.

Part III: Looking at Our Data Using ggplot2

Univariate Distributions

What does our data distribution look like?

Histograms with geom_histogram()

blackrock_etf_data |> 
  ggplot(aes(x = net_assets_usd_bn)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adjust the binwidth argument in geom_histogram() to a width that makes sense to you.

blackrock_etf_data |> 
  ggplot(aes(x = net_assets_usd_bn)) + 
  geom_histogram(binwidth = 3)

How do we interpret these charts?

  • The X-axis shows the size of the fund in billions of USD.

  • The Y-axis is the count of the number of funds in each bin (set by the binwidth argument, or by a default you can read about in the documentation).

Now You Try

Try adjusting the binwidth. If the binwidth is too large, you can miss important elements of the distribution of your data. If it is too small, the data will be noisy and may obscure meaningful trends.

Density Plots with geom_density()

Density plots are a close cousin of histograms

blackrock_etf_data |> 
  ggplot(aes(x = net_assets_usd_bn)) + 
  geom_density()

The Y axis in a density plot represents the “density” (proportion of total). Depending on your data, the count or the density may be more meaningful. it’s often worth looking at both.

Box plots with geom_boxplot()

blackrock_etf_data |> 
  ggplot(aes(y = net_assets_usd_bn)) +
  geom_boxplot()

Box plots are a great way of visualizing summary statistics in a compact visual manner. It shows you the median, the 1st & 3rd quartiles, and outliers in one graph.

In this case, our box is flattened. It’s still useful — it shows that even the 75th percentile of funds (the top of the box) is really small. It allows us to confidently say that almost all of BlackRock’s ETF assets are in a few large outlier funds. This has practical implications: it means that we can confidently focus most of our analytical energy on a few large funds instead of looking at all of the hundreds of funds.

If you haven’t used box plots, take a look at the function help:

?geom_boxplot

Here is the description of the summary statistics included in the

Summary statistics

The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for more information on how hinge positions are calculated for boxplot().

The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually.

Let’s look at a box plot looking at the mtcars dataset built into R to illustrate some key features:

mtcars |> 
  ggplot(aes(y = mpg)) + 
  geom_boxplot()

The mtcars dataset measures features of a set of cars from the mid 1970s. In this case, we’re looking at the miles-per-gallon (mpg) of the cars in the dataset. From the box plot, we can tell the following things, just by eyeballing the chart:

  • The median (50th percentile) car gets approximately 19 miles-per-gallon.

  • The 25th percentile car gets just over 15 miles-per-gallon.

  • The 75th percentile car gets about 23 miles-per-gallon.

  • One outlier gets about 34 miles-per-gallon.

That’s a lot of information conveyed in one little chart!

People have very strong feelings about the best charts to look at univariate distributions. Take a look at some options, and try a few out.

Using Color/Fill and Facets to Look at Distributions for Groups

Color & Fill

Check the specific ggplot2 geom you are using (such as by entering ?geom_histogram) to see whether it uses fill (e.g. geom_histogram() or geom_boxplot()) or color (e.g. geom_point() or geom_line()).

blackrock_etf_data |> 
  ggplot(aes(x = net_assets_usd_bn, fill = is_esg)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

blackrock_etf_data |> 
  ggplot(aes(x = net_assets_usd_bn, fill = is_esg)) + 
  # alpha (0 to 1) makes the fill color translucent so you can see both
  # position = "identity" makes it so the two overlap, unlike the default of "stack"
  geom_histogram(binwidth = 5, alpha = .6, position = "identity")

This page explains more about different options for grouped histograms.

What can we tell from this?

  • There are a lot more “Regular Funds” than ESG Funds”.

  • All the large outlier funds are “Regular funds”

Now You Try

Try using color and fill with geom_density() and geom_boxplot()

Faceting with facet_wrap()

We can use facet_wrap() to compare the two distributions on top of each other (by specifying ncol = 1 )

blackrock_etf_data |> 
  ggplot(aes(x = net_assets_usd_bn)) + 
  geom_histogram(binwidth = 3) +
  # stick a ~ in front of the variable name you want to facet by. 
  #ncol = 1 makes it 1 column with the two charts stacked on top of each other
  facet_wrap(~is_esg, ncol = 1)