Data Gathered

I randomly selected 20 companies from the SP500 list. For each company, I downloaded the following documents:

10K (annual SEC filing)
Annual report. These had significant overlap with the 10K reports. I only coded new materials (not shown in the 10K). All 10k is coded as such, with annual report used for any additional content (often a letter to investors).
Investor presentation. I retrieved Q4 2022, or Q1 2023 if the former wasn’t available.
CSR: documents were often labeled differently, such as DEI, ESG, Sustainability, etc… All CSR-related documents were combined for analysis.

Table of documents

print(t_pages_by_type_and_organization_pivot)

## # A tibble: 20 × 8
## # Groups:   organization_name [20]
##    organization_name `10k ` `10k (2012)` `annualreport ` `annualreport (2012)`
##    <chr>              <dbl>        <dbl>           <dbl>                 <dbl>
##  1 algn                   1            1               0                     0
##  2 amcr                   1            0               1                     1
##  3 amd                    1            1               0                     0
##  4 avb                    1            1               1                     1
##  5 bac                    1            1               0                     0
##  6 bfb                    1            1               1                     1
##  7 cdns                   1            1               0                     1
##  8 cmg                    1            1               0                     0
##  9 cost                   1            1               1                     1
## 10 dxcm                   1            1               1                     0
## 11 fcx                    1            1               1                     1
## 12 glw                    1            1               1                     1
## 13 mdt                    1            1               1                     1
## 14 now                    1            0               0                     0
## 15 oke                    1            1               1                     1
## 16 oxy                    1            1               1                     0
## 17 psx                    1            1               1                     1
## 18 ter                    1            1               1                     1
## 19 txt                    1            1               0                     1
## 20 unp                    1            1               0                     0
## # ℹ 3 more variables: `csr ` <dbl>, `csr (2012)` <dbl>,
## #   `investorpresentation ` <dbl>

Charts

All charts were individually coded. This includes:

classification: what type of graphic?
topic: what is the content area of the graphic?
errors: a range of different problems.

Chart Classifications

print(t_graphics)

## # A tibble: 821 × 31
##     year organization_name page_classification     id     x     y classification
##    <int> <chr>             <chr>                <int> <int> <int> <chr>         
##  1  2022 mdt               "10k "                1509   108  1493 line          
##  2  2022 mdt               "10k "                1510   180   245 bar           
##  3  2012 mdt               "annualreport (2012…  1511  3498   194 donut         
##  4  2012 mdt               "annualreport (2012…  1512  4279   190 donut         
##  5  2012 mdt               "annualreport (2012…  1513  3483  1261 bar stacked   
##  6  2012 mdt               "annualreport (2012…  1514  3883  1254 bar           
##  7  2012 mdt               "annualreport (2012…  1515  4227  1257 bar           
##  8  2012 mdt               "annualreport (2012…  1516  4605  1276 bar           
##  9  2012 mdt               "annualreport (2012…  1518   674   186 donut         
## 10  2012 mdt               "annualreport (2012…  1519  1910   238 donut         
## # ℹ 811 more rows
## # ℹ 24 more variables: topic <chr>, title <chr>,
## #   problem_nonzeroverticalaxis <int>, problem_bad3d <int>, page_id <int>,
## #   problem_nonproportionalarea <int>,
## #   problem_unlabelednonzeroverticalaxis <int>, h <int>, w <int>,
## #   has_externalbenchmark <int>, has_yearsofhistory <int>, p0_pixels <int>,
## #   p0_value <int>, p1_pixels <int>, p1_value <int>, …

Chart Topic

Note that the topics are still pretty broad, and the coding scheme needs further refinement. This is just a good first step to get a rough idea of the data.

## # A tibble: 285 × 2
##    topic                           n
##    <chr>                       <int>
##  1 asset capitalprogram            1
##  2 asset capitalprogrambyasset     4
##  3 asset corebusinesses            1
##  4 asset creditlossprovision       4
##  5 asset exportcapability          1
##  6 asset loanstresstest            1
##  7 asset rails                     1
##  8 asset receivables leases        1
##  9 asset segment                   1
## 10 asset storelocations            2
## # ℹ 275 more rows

Chart Errors

My main focus so far has been finding problems with the charts. These can be fairly minor (such as not starting an axis with 0), or more severe. Some severe errors include visual designs that don’t match the underlying numbers, labels attached to the wrong pie chart segments, etc… The most common errors are rgd_errors, where someone has a chart (usually a bar) that isn’t sized to match the underlying values.

Note that charts can have multiple problem tags added.

As shown by the chart below, I found a lot of broken or badly-designed charts.

print(t_errors_by_doc)

## # A tibble: 7 × 14
## # Groups:   page_classification [7]
##   page_classification     year2012 problem_rgd_error problem_nonzeroverticalaxis
##   <chr>                      <dbl>             <dbl>                       <int>
## 1 "10k "                         0                 2                           6
## 2 "10k (2012)"                   1                 0                           2
## 3 "annualreport "                0                 7                           1
## 4 "annualreport (2012)"          1                38                           2
## 5 "csr "                         0                34                           3
## 6 "csr (2012)"                   1                 2                           0
## 7 "investorpresentation "        0                66                          13
## # ℹ 10 more variables: problem_non_arith_or_broken_scale <int>,
## #   problem_negative_value_truncated <int>, problem_nonproportionalarea <int>,
## #   problem_unlabelednonzeroverticalaxis <int>, problem_bad3d <int>,
## #   problem_no_data_point_value_labels_or_scale <int>,
## #   total_distinct_problems <dbl>, total_charts <int>, avg_problems <dbl>,
## #   problem_none <dbl>

Topics v. Organizations

I think that there are some interesting options for comparing the types of information being shared in different types of documents. Do companies reveal ESG information only in their ESG report, or also for analysts?

Trend Data

Lastly, there’s lot of evidence for managers selectively revealing information. The coding is really rough on this, but I tagged some charts as either showing an increasing or decreasing trend.

This could be linked up with a larger dataset, showing all company trends (for example, NI or waste). Is there a correlation between the revealing of a trend and the underlying value? Does this relationship hold the same between ESG documents v. 10K v. company annual reports?

This data is still really rough, and is just giving an idea of how the data could be analyzed. I picked some of the most common items to include, but topics should also key financials (such as NI, EPS, etc…).

Trend Correlations

Beyond individual data, is there selectivity in what data is being shown?

Hypothesis: Firms with negative trends don’t chart that data Hypothesis: Firms with negative trends chart that data with more distortions

NI

NI is very rarely shown. When it is shown, there are frequently errors.

Revenue

Revenue is more commonly shown in investor presentations. Looks like there is a positive relationship, where higher revenue is associated with more errors.

## Warning: Removed 4 rows containing missing values (`geom_point()`).

EPS

Dividends aren’t very common. EPS is more common.

## Warning: Removed 4 rows containing missing values (`geom_point()`).

Data source

Financials manually input from yahoo finance. Mergent Online (from WVU library database page)

Possible ESG Data Sources

Requested 7/11 https://go.factset.com/ Country-focused

https://www.esgbook.com/analytics-solutions/ Really limited companies followed. Solution online kind of broken.

https://www.refinitiv.com/en/financial-data/company-data#benefits Emailed 7/12

https://www.reprisk.com/ Emailed 7/12 for access to reports.

https://www.kaggle.com/code/kingkarth/esg-countries https://www.kaggle.com/datasets/tunguz/environment-social-and-governance-data Country only.

Baseline for data?

https://www.iea.org/data-and-statistics/charts/number-of-companies-in-the-s-and-p-500-reporting-energy-and-emissions-related-metrics How many companies report ghg?

https://www.thecaq.org/sp-500-and-esg-reporting Audit information about ESG information. Very good.

https://www.conference-board.org/topics/sustainability-practices/environmental-practices-dashboard ESG information, requires signup.

Charts in Company Annual Reports