I randomly selected 20 companies from the SP500 list. For each company, I downloaded the following documents:
print(t_pages_by_type_and_organization_pivot)
## # A tibble: 20 Ă— 8
## # Groups: organization_name [20]
## organization_name `10k ` `10k (2012)` `annualreport ` `annualreport (2012)`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 algn 1 1 0 0
## 2 amcr 1 0 1 1
## 3 amd 1 1 0 0
## 4 avb 1 1 1 1
## 5 bac 1 1 0 0
## 6 bfb 1 1 1 1
## 7 cdns 1 1 0 1
## 8 cmg 1 1 0 0
## 9 cost 1 1 1 1
## 10 dxcm 1 1 1 0
## 11 fcx 1 1 1 1
## 12 glw 1 1 1 1
## 13 mdt 1 1 1 1
## 14 now 1 0 0 0
## 15 oke 1 1 1 1
## 16 oxy 1 1 1 0
## 17 psx 1 1 1 1
## 18 ter 1 1 1 1
## 19 txt 1 1 0 1
## 20 unp 1 1 0 0
## # ℹ 3 more variables: `csr ` <dbl>, `csr (2012)` <dbl>,
## # `investorpresentation ` <dbl>
All charts were individually coded. This includes:
print(t_graphics)
## # A tibble: 821 Ă— 31
## year organization_name page_classification id x y classification
## <int> <chr> <chr> <int> <int> <int> <chr>
## 1 2022 mdt "10k " 1509 108 1493 line
## 2 2022 mdt "10k " 1510 180 245 bar
## 3 2012 mdt "annualreport (2012… 1511 3498 194 donut
## 4 2012 mdt "annualreport (2012… 1512 4279 190 donut
## 5 2012 mdt "annualreport (2012… 1513 3483 1261 bar stacked
## 6 2012 mdt "annualreport (2012… 1514 3883 1254 bar
## 7 2012 mdt "annualreport (2012… 1515 4227 1257 bar
## 8 2012 mdt "annualreport (2012… 1516 4605 1276 bar
## 9 2012 mdt "annualreport (2012… 1518 674 186 donut
## 10 2012 mdt "annualreport (2012… 1519 1910 238 donut
## # ℹ 811 more rows
## # ℹ 24 more variables: topic <chr>, title <chr>,
## # problem_nonzeroverticalaxis <int>, problem_bad3d <int>, page_id <int>,
## # problem_nonproportionalarea <int>,
## # problem_unlabelednonzeroverticalaxis <int>, h <int>, w <int>,
## # has_externalbenchmark <int>, has_yearsofhistory <int>, p0_pixels <int>,
## # p0_value <int>, p1_pixels <int>, p1_value <int>, …
Note that the topics are still pretty broad, and the coding scheme needs further refinement. This is just a good first step to get a rough idea of the data.
## # A tibble: 285 Ă— 2
## topic n
## <chr> <int>
## 1 asset capitalprogram 1
## 2 asset capitalprogrambyasset 4
## 3 asset corebusinesses 1
## 4 asset creditlossprovision 4
## 5 asset exportcapability 1
## 6 asset loanstresstest 1
## 7 asset rails 1
## 8 asset receivables leases 1
## 9 asset segment 1
## 10 asset storelocations 2
## # ℹ 275 more rows
My main focus so far has been finding problems with the charts. These can be fairly minor (such as not starting an axis with 0), or more severe. Some severe errors include visual designs that don’t match the underlying numbers, labels attached to the wrong pie chart segments, etc… The most common errors are rgd_errors, where someone has a chart (usually a bar) that isn’t sized to match the underlying values.
Note that charts can have multiple problem tags added.
As shown by the chart below, I found a lot of broken or badly-designed charts.
print(t_errors_by_doc)
## # A tibble: 7 Ă— 14
## # Groups: page_classification [7]
## page_classification year2012 problem_rgd_error problem_nonzeroverticalaxis
## <chr> <dbl> <dbl> <int>
## 1 "10k " 0 2 6
## 2 "10k (2012)" 1 0 2
## 3 "annualreport " 0 7 1
## 4 "annualreport (2012)" 1 38 2
## 5 "csr " 0 34 3
## 6 "csr (2012)" 1 2 0
## 7 "investorpresentation " 0 66 13
## # ℹ 10 more variables: problem_non_arith_or_broken_scale <int>,
## # problem_negative_value_truncated <int>, problem_nonproportionalarea <int>,
## # problem_unlabelednonzeroverticalaxis <int>, problem_bad3d <int>,
## # problem_no_data_point_value_labels_or_scale <int>,
## # total_distinct_problems <dbl>, total_charts <int>, avg_problems <dbl>,
## # problem_none <dbl>
I think that there are some interesting options for comparing the types of information being shared in different types of documents. Do companies reveal ESG information only in their ESG report, or also for analysts?
Lastly, there’s lot of evidence for managers selectively revealing information. The coding is really rough on this, but I tagged some charts as either showing an increasing or decreasing trend.
This could be linked up with a larger dataset, showing all company trends (for example, NI or waste). Is there a correlation between the revealing of a trend and the underlying value? Does this relationship hold the same between ESG documents v. 10K v. company annual reports?
This data is still really rough, and is just giving an idea of how the data could be analyzed. I picked some of the most common items to include, but topics should also key financials (such as NI, EPS, etc…).
Beyond individual data, is there selectivity in what data is being shown?
Hypothesis: Firms with negative trends don’t chart that data Hypothesis: Firms with negative trends chart that data with more distortions
NI is very rarely shown. When it is shown, there are frequently errors.
Revenue is more commonly shown in investor presentations. Looks like there is a positive relationship, where higher revenue is associated with more errors.
## Warning: Removed 4 rows containing missing values (`geom_point()`).
Dividends aren’t very common. EPS is more common.
## Warning: Removed 4 rows containing missing values (`geom_point()`).
Financials manually input from yahoo finance. Mergent Online (from WVU library database page)
Requested 7/11 https://go.factset.com/ Country-focused
https://www.esgbook.com/analytics-solutions/ Really limited companies followed. Solution online kind of broken.
https://www.refinitiv.com/en/financial-data/company-data#benefits Emailed 7/12
https://www.reprisk.com/ Emailed 7/12 for access to reports.
https://www.kaggle.com/code/kingkarth/esg-countries https://www.kaggle.com/datasets/tunguz/environment-social-and-governance-data Country only.
https://www.iea.org/data-and-statistics/charts/number-of-companies-in-the-s-and-p-500-reporting-energy-and-emissions-related-metrics How many companies report ghg?
https://www.thecaq.org/sp-500-and-esg-reporting Audit information about ESG information. Very good.
https://www.conference-board.org/topics/sustainability-practices/environmental-practices-dashboard ESG information, requires signup.