Then we will load and clean our data the same way we did in the last lesson.
# assign the url to `github_raw_csv_url`github_raw_csv_url <-"https://raw.githubusercontent.com/t-emery/sais-susfin_data/main/datasets/blackrock_etf_screener_2022-08-30.csv"# read in the data, and assign it to the object `blackrock_etf_data`blackrock_etf_data <-read_csv(github_raw_csv_url)
Rows: 393 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): ticker, name, incept_date, net_assets_as_of, asset_class, sub_asse...
dbl (8): gross_expense_ratio_percent, net_expense_ratio_percent, net_assets...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
blackrock_etf_data <- blackrock_etf_data |># we are transforming both date columns (currently character strings) into date objects# so we can work with them.# this syntax is a bit confusing, but selects all columns containing `date` and applies# lubridate::mdy() function to them to turn them into date objects. mutate(across(contains("date"), lubridate::mdy)) |># Billions is a more useful magnitude than millions, so we'll create a column with # the assets in billions by dividing by `net_assets_millions` by 1,000 (10^3)# If we wanted trillions, we could divide by 1,000,000 (10^6)mutate(net_assets_usd_bn = net_assets_usd_mn/10^3) |># this column doesn't add anything to our analysis - it says that the data is from 8/30/22select(-net_assets_as_of)mini_blackrock_data <- blackrock_etf_data |># group by whether the fund is an ESG fund or notgroup_by(is_esg) |># take the top 5 from each group, by net assetsslice_max(order_by = net_assets_usd_bn, n =5) |># select the following columns select(ticker, fund_name = name_wo_ishares_etf, asset_class, sub_asset_class, region, incept_date, net_assets_usd_bn, msci_weighted_average_carbon_intensity_tons_co2e_m_sales) |># rename to `co2_intensity` because the full name is a mouthful, if descriptive.rename(co2_intensity = msci_weighted_average_carbon_intensity_tons_co2e_m_sales) |># always good to ungroup() if you've used a group_by(). We'll discuss later.ungroup()
Adding missing grouping variables: `is_esg`
Part II: Finding Insights Through Aggregation
Understanding the Workflow
We’re going to preview a workflow on our mini dataset that we can use to understand our larger dataset.
What is Our Goal?: We’re looking at BlackRock’s ETF screener to uncover actionable insights that help explain their ESG business. The financial press writes a lot about it – but how big a part of its business is it? How fast is it growing?
Exploratory Data Analysis (EDA): attempts to uncover initial relevant insights about our dataset. It may uncover facts (such as the total assets in ESG funds) that stand alone, or it may help generate hypotheses that can be examined through more rigorous statistical methods at a later time (like relationship between size & fund age). Here are key relationships we’ll explore in EDA:
Relative Size & Magnitude: If we are comparing two things, how big is one compared to the other? Is one the size of a dog, and the other the size of a cat (different, but similar magnitude). Or is one the size of a mouse, and the other the size of a giraffe (big magnitude).
Ranking: What is the relative order? Everybody loves a ranking. Top 5 & Top 10 lists always draw eyes and are perpetual clickbait.
Growth: Maybe something is small now, but growing quickly. Or large but shrinking?
Finding Narrative Through Aggregation: By aggregating together our data into meaningful groups (ESG vs. Regular Funds, by asset class, by year launched) we can unlock a hidden narrative of larger trends implicit in our dataset.
Outputs:
Data visualizations: a picture is worth 1,000 words, right? But for real, data visualizations are very efficient ways to convey a lot of information.
Emotive factoids: we’re looking for numbers that help make the data real to you and other readers. Emotive factoids are the way that you’ll convey information when you are writing and speaking and can’t use data visualizations. Here’s an OpEd I wrote for a South African financial newspaper – spot the emotive factoids.
# A tibble: 10 × 4
fund_name is_esg region net_assets_usd_bn
<chr> <chr> <chr> <dbl>
1 ESG Aware MSCI USA ESG Fund North America 22.4
2 ESG Aware MSCI EAFE ESG Fund Global 6.43
3 Global Clean Energy ESG Fund Global 5.63
4 ESG Aware MSCI EM ESG Fund Global 4.23
5 MSCI KLD 400 Social ESG Fund North America 3.69
6 Core S&P 500 Regular Fund North America 298.
7 Core MSCI EAFE Regular Fund Global 84.2
8 Core U.S. Aggregate Bond Regular Fund North America 82.3
9 Core S&P Small-Cap Regular Fund North America 66.5
10 Core MSCI Emerging Markets Regular Fund Global 64.9
Feature Engineering
Feature engineering is a fancy way to say that you’re making new variables from the existing variables using dplyr::mutate().
In exploratory data analysis, finding meaningful ways to describe datapoints is essential. We will focus on five common forms here.
Rank: Who is first? Who is last? Everybody wants to know!
Z-score: A great way to standardize data because it has an intuitive meaning. 0 is the average. 1 is 1 standard deviation above the average. -1 is 1 standard deviation below the average.
Percentile: Can be a helpful way to quantify where a value is in a distribution, especially when relative order matters most.
Quantile: Can be great to create groupings based on relative rank. We will use 5 groups by default (quntiles), but use whatever number makes most sense in your situation. Quartiles (4) and ceciles (10) are also commonly used.
Percent of total: How much of the total assets does the largest fund account for?
You’ll notice that by default, the smallest value has the highest rank. The same is true for percentile and for quantile. If you’re measuring which company has the smallest carbon footprint, this is good. If you’re measuring countries by GDP, you’d want the opposite. Always check this.
funds_by_assets_bn |>mutate(value_z_score = (net_assets_usd_bn-mean(net_assets_usd_bn))/sd(net_assets_usd_bn) *-1, # multiply by negative 1value_rank =rank(net_assets_usd_bn *-1), # multiply by negative 1value_percentile =percent_rank(net_assets_usd_bn *-1), # multiply by negative 1value_ntile_5 =ntile(net_assets_usd_bn *-1, 5), # multiply by negative 1value_pct_total = net_assets_usd_bn/sum(net_assets_usd_bn) # not relevant ) |>arrange(value_rank)
You’ll have to exercise judgment as to which direction any given indicator makes sense. Whatever you choose, make sure you’re clear what larger/smaller means and whether it will intuitively make sense to you and the people you’re presenting your analysis to.
Building Functions: Work Hard to Be Lazy
If you find yourself copying and pasting code more than 3 times, it’s time to make a a function. Chapter 27 in R for Data Science 2e provides more details.
You don’t need to worry about the details for now, but here’s a brief guide.
# add the argumentplus_one <-function(x) {# define what you want it to do x +1}plus_one(7)
[1] 8
The function above is called plus_one(). It takes a number you provide and adds 1 to it. When we provide 7 as an argument, it returns 8.
Function 1: add_rank_features()
Our function, add_rank_features() , is just a little more complicated. Don’t worry about understanding everything right now. Here are the important features
Like all tidyverse-compliant functions, the data is the first argument so that it can be added to a data processing pipeline using the pipe operator |>.
The second argument, value_var, will by default select a column names value, unless you provide another column name that has the numeric values you want to use.
The third argument allows you to choose whether you want to have the rank by lowest (the default), or highest (use FALSE if you want this).
The fourth argument sets the number of quantiles. The default is set to 5.
add_rank_features <-function(data, value_var = value, rank_by_lowest =TRUE, quantile_n =5) {# If rank_by_lowest is set to TRUE, as it is by default, then it will use this first code chunk if (rank_by_lowest) data |>mutate("{{value_var}}_z_score":= ({{ value_var }}-mean({{ value_var }}))/sd({{ value_var }}),"{{value_var}}_rank":=rank({{ value_var }}), "{{value_var}}_percentile":=percent_rank({{ value_var }}),"{{value_var}}_quantile_{{quantile_n}}":=ntile({{ value_var }}, {{ quantile_n }}),"{{value_var}}_pct_total":= {{ value_var }}/sum({{ value_var }})) |>arrange({{ value_var }})# If rank_by_lowest is set to FALSE, as it is by default, then it will use this second code chunkelse data |>mutate("{{value_var}}_z_score":= ({{ value_var }}-mean({{ value_var }}))/sd({{ value_var }}) *-1, # multiply by -1# put negative sign"{{value_var}}_rank":=rank(-{{ value_var }}), # put negative sign"{{value_var}}_percentile":=percent_rank(-{{ value_var }}),# put negative sign"{{value_var}}_quantile_{{quantile_n}}":=ntile(-{{ value_var }}, {{ quantile_n }}), # no change, not relevant for pct_total"{{value_var}}_pct_total":= {{ value_var }}/sum({{ value_var }})) |>arrange({{ value_var }}*-1)}
And now we can be deservedly lazy. If we just want the default behavior of the function, you can just add it onto a data pipeline.
funds_by_assets_bn |>add_rank_features(value_var = net_assets_usd_bn, # Now it will rank by highest value equals 1rank_by_lowest =FALSE, # let's divide it into 10 quantiles (deciles)quantile_n =10)
Sometimes you want to compare features by groupings. What is the largest ESG Fund?
funds_by_assets_bn |>group_by(is_esg) |>add_rank_features(value_var = net_assets_usd_bn, # Now it will rank by highest value equals 1rank_by_lowest =FALSE)
What is the percent of the grouped total for the largest ESG fund and for the largest non-ESG fund?
funds_by_assets_bn |>group_by(is_esg, region) |>add_rank_features(value_var = net_assets_usd_bn, # Now it will rank by highest value equals 1rank_by_lowest =FALSE) |># arrange first by is_esg, second by region, and third by rank so you can easily# see the ranking within each sub-grouparrange(is_esg, region, net_assets_usd_bn_rank)
What is the second largest ESG fund with a global focus?
What other grouping might be useful to try with this data? Try it.
Aggregation: Summarizing Data By Group
In exploratory data analysis we’re trying to make sense of the data. We’re going from a large number of individual datapoints and trying to find patterns that show structure in the data and tell a story. We’re trying to generate hypotheses that can be tested more rigorously later.
What value is the average fund size by fund type?
funds_by_assets_bn |>group_by(is_esg) |># always default to using na.rm = TRUE, otherwise NA values # (very common in the wild) will trip up your calculationssummarize(avg_assets =mean(net_assets_usd_bn, na.rm =TRUE))
# A tibble: 2 × 2
is_esg avg_assets
<chr> <dbl>
1 ESG Fund 8.47
2 Regular Fund 119.
We can expand this to use a few more common summary statistics:
funds_by_assets_bn |>group_by(is_esg) |>summarize(avg =mean(net_assets_usd_bn, na.rm =TRUE),std_dev =sd(net_assets_usd_bn, na.rm =TRUE),min =min(net_assets_usd_bn, na.rm =TRUE),max =max(net_assets_usd_bn, na.rm =TRUE),pctile_25 =quantile(net_assets_usd_bn, .25, na.rm =TRUE),# median is the same as the 50th percentile abovemedian =median(net_assets_usd_bn, na.rm =TRUE),pctile_75 =quantile(net_assets_usd_bn, .75, na.rm =TRUE))
# A tibble: 2 × 8
is_esg avg std_dev min max pctile_25 median pctile_75
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESG Fund 8.47 7.85 3.69 22.4 4.23 5.63 6.43
2 Regular Fund 119. 100. 64.9 298. 66.5 82.3 84.2
Function 2: calculate_summary_stats()
Similar to above, we will work hard to be lazy and write a function.
calculate_summary_stats <-function(data, value_var = value) { data |>summarize(avg =mean({{ value_var }}, na.rm =TRUE),std_dev =sd({{ value_var }}, na.rm =TRUE),min =min({{ value_var }}, na.rm =TRUE),max =max({{ value_var }}, na.rm =TRUE),pctile_25 =quantile({{ value_var }}, .25, na.rm =TRUE),median =median({{ value_var }}, na.rm =TRUE),pctile_75 =quantile({{ value_var }}, .75, na.rm =TRUE)) |># ungrouping is helpful so we don't have to do so afterwards. Save time. ungroup()}
`summarise()` has grouped output by 'is_esg'. You can override using the
`.groups` argument.
# A tibble: 4 × 9
is_esg region avg std_dev min max pctile…¹ median pctil…²
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESG Fund Global 5.43 1.11 4.23 6.43 4.93 5.63 6.03
2 ESG Fund North America 13.0 13.2 3.69 22.4 8.36 13.0 17.7
3 Regular Fund Global 74.6 13.6 64.9 84.2 69.7 74.6 79.4
4 Regular Fund North America 149. 129. 66.5 298. 74.4 82.3 190.
# … with abbreviated variable names ¹pctile_25, ²pctile_75
Understanding the Output
What is the size of the median North American ESG fund in our sample?
Comparing Data Aggregates
pivot_longer() and pivot_wider() are your friends. Practice them, and get to know them well.
What you need to know:
pivot_longer(): you need to specify the columns you want pivoted longer. By default the column names will be put in a column called names, and the values will be put in a column called values. You can specify your own column names if you’d like.
pivot_wider(): you need to specify the column where the column names are going to come from using names_from, and the column where the values come from using values_from.
Spend some time going through the vignette explaining the features of the pivot_ functions. It’s worth your time and will enable you to do cool stuff faster.
vignette("pivot")
starting httpd help server ... done
Now we’re going to walk through the workflow step-by-step:
# A tibble: 2 × 8
is_esg avg std_dev min max pctile_25 median pctile_75
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ESG Fund 8.47 7.85 3.69 22.4 4.23 5.63 6.43
2 Regular Fund 119. 100. 64.9 298. 66.5 82.3 84.2
Step 2: Pivot the Data into Long Form Using pivot_longer()
Looking at the data above, we want to compare the two groups: ESG Funds vs. Regular Funds. To do that more easily, we need to have the columns from is_esg as the column headers, and the rest of the columns as rows, so we can use mutate() to calculate comparisons.
This is a two-step process. The first step is making the data aggregates into a column.
funds_by_assets_bn |># group by the feature(s) you want summary statistics forgroup_by(is_esg) |># use the function we built to calculate those summary stats.calculate_summary_stats(value_var = net_assets_usd_bn) |># pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.pivot_longer(cols = avg:pctile_75, names_to ="aggregates")
# A tibble: 14 × 3
is_esg aggregates value
<chr> <chr> <dbl>
1 ESG Fund avg 8.47
2 ESG Fund std_dev 7.85
3 ESG Fund min 3.69
4 ESG Fund max 22.4
5 ESG Fund pctile_25 4.23
6 ESG Fund median 5.63
7 ESG Fund pctile_75 6.43
8 Regular Fund avg 119.
9 Regular Fund std_dev 100.
10 Regular Fund min 64.9
11 Regular Fund max 298.
12 Regular Fund pctile_25 66.5
13 Regular Fund median 82.3
14 Regular Fund pctile_75 84.2
Step 3: Pivot the Groups You Want to Compare into Wide Format Using pivot_wider()
funds_by_assets_bn |># group by the feature(s) you want summary statistics forgroup_by(is_esg) |># use the function we built to calculate those summary stats.calculate_summary_stats(value_var = net_assets_usd_bn) |># pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.pivot_longer(cols = avg:pctile_75, names_to ="aggregates") |># then pivot the data into wide format where the two groups you are comparing are columnspivot_wider(names_from = is_esg, values_from = value)
Step 4: Use janitor::clean_names() to Make the Column Names into snake_case
It’s a lot easier to work with column names in R if you put them into snake_case (lowercase, connected by underscore). Otherwise you need to use lots of quotation marks, and it’s easy to mess up. The janitor package has a lot of great helper functions for cleaning your data, and clean_names() is an especially useful one to know. If you haven’t yet installed, it, run install.packages("janitor") in your console before running the code below.
funds_by_assets_bn |># group by the feature(s) you want summary statistics forgroup_by(is_esg) |># use the function we built to calculate those summary stats.calculate_summary_stats(value_var = net_assets_usd_bn) |># pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.pivot_longer(cols = avg:pctile_75, names_to ="aggregates") |># then pivot the data into wide format where the two groups you are comparing are columnspivot_wider(names_from = is_esg, values_from = value) |># make names snake_case which is easier to work with in the mutate function janitor::clean_names()
# A tibble: 7 × 3
aggregates esg_fund regular_fund
<chr> <dbl> <dbl>
1 avg 8.47 119.
2 std_dev 7.85 100.
3 min 3.69 64.9
4 max 22.4 298.
5 pctile_25 4.23 66.5
6 median 5.63 82.3
7 pctile_75 6.43 84.2
Step 5: Use mutate() to Make Meaningful Comparisons of the Groups
You’ll need to use your judgment to figure out the best manner to compare the two groups given the context of the data.
Here, one group is clearly much larger than the other. So two ways of comparing the groups come to mind:
How many times larger are the regular funds versus the ESG funds?
What percentage of the size of the regular funds do the ESG funds constitute?
funds_by_assets_bn |># group by the feature(s) you want summary statistics forgroup_by(is_esg) |># use the function we built to calculate those summary stats.calculate_summary_stats(value_var = net_assets_usd_bn) |># pivot the data longer (tidy format) so that all of the summary stats names are in one and the values are in another.pivot_longer(cols = avg:pctile_75, names_to ="aggregates") |># then pivot the data into wide format where the two groups you are comparing are columnspivot_wider(names_from = is_esg, values_from = value) |># make names snake_case which is easier to work with in the mutate function janitor::clean_names() |># use your judgement to figure out the best way to compare the two groupsmutate(esg_fund_pct_of_regular = esg_fund/regular_fund *100,regular_times_larger_than_esg = regular_fund/esg_fund)
How much larger is the largest regular fund than the largest ESG fund?
How much bigger is the average regular fund in our sample than the largest ESG fund?
Which ones seem the most relevant to you?
Practice writing these as “emotive factoids” in sentences.
Part III: Your Turn
Challenge 1: Use These Tools on the Full Dataset
Now that you understand the toolkit, try it out on the full blackrock_etf_data
Challenge 2: Comparing the holdings of the largest ESG Funds Versus the Largest Regular Fund
The class GitHub data repository has a dataset comparing the largest ESG fund (ESGU) and the largest standard fund (IVV). Test out your new-found skills to do initial exploratory data analysis on this dataset.
Rows: 537 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ticker, company_name, sector, esg_uw_ow
dbl (7): esg_etf, standard_etf, esg_tilt, esg_tilt_z_score, esg_tilt_rank, e...
lgl (3): in_esg_only, in_standard_only, in_on_index_only
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.