Eco-Alpha: Unearthing Value in ESG Scores

Introduction

ESG Investing is increasingly becoming a crucial part of investment strategies.This analysis leverages S&P 500 ESG Scores to uncover trends and insights, guiding the creation of investment portfolios that not only promise returns but also reflect responsible stewardship of environmental, social, and governance values.”

Statement of Business Task

The core mission of of this project is to harness the power of data analytics to forge a path for ESG investment strategies. By meticulously analyzing S&P 500 ESG Scores, this project aims to pinpoint key trends and actionable insights that will empower investors to build portfolios with a conscience, aligning financial goals with environmental, social, and governance benchmarks.

Data Source Description

This dataset exclusively showcases companies from the S&P 500 index. Researchers, investors, analysts, and policy-makers can utilize this dataset to gain insights into the ESG performance and risk profiles of these major corporations. Whether exploring trends, conducting ESG assessments, or making informed investment decisions, this dataset serves as a valuable resource for comprehending the sustainability and governance practices of S&P 500 companies. Fore more information on the dataset !click here(https://www.kaggle.com/datasets/pritish509/s-and-p-500-esg-risk-ratings)

Loading the Required Libraries

# Loading required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(viridis) # For a vibrant color palette

## Loading required package: viridisLite

library(RColorBrewer)
library(forcats) # For factor releveling
library(cluster) # For cluster analysis
library(reshape2) # for melt function

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Reading the Dataset

esg_data <- read_csv("SP 500 ESG Risk Ratings.csv")

## Rows: 503 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Symbol, Name, Address, Sector, Industry, Description, Controversy L...
## dbl (5): Total ESG Risk score, Environment Risk Score, Governance Risk Score...
## num (1): Full Time Employees
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(esg_data)

## # A tibble: 6 × 15
##   Symbol Name          Address Sector Industry `Full Time Employees` Description
##   <chr>  <chr>         <chr>   <chr>  <chr>                    <dbl> <chr>      
## 1 A      Agilent Tech… "5301 … Healt… Diagnos…                 18000 Agilent Te…
## 2 AAL    American Air… "1 Sky… Indus… Airlines                132500 American A…
## 3 AAP    Advance Auto… "4200 … Consu… Special…                 40000 Advance Au…
## 4 AAPL   Apple Inc     "One A… Techn… Consume…                164000 Apple Inc.…
## 5 ABBV   Abbvie Inc    "1 Nor… Healt… Drug Ma…                 50000 AbbVie Inc…
## 6 ABC    Amerisourceb… "1 Wes… Healt… Medical…                 46000 Amerisourc…
## # ℹ 8 more variables: `Total ESG Risk score` <dbl>,
## #   `Environment Risk Score` <dbl>, `Governance Risk Score` <dbl>,
## #   `Social Risk Score` <dbl>, `Controversy Level` <chr>,
## #   `Controversy Score` <dbl>, `ESG Risk Percentile` <chr>,
## #   `ESG Risk Level` <chr>

colnames(esg_data)

##  [1] "Symbol"                 "Name"                   "Address"               
##  [4] "Sector"                 "Industry"               "Full Time Employees"   
##  [7] "Description"            "Total ESG Risk score"   "Environment Risk Score"
## [10] "Governance Risk Score"  "Social Risk Score"      "Controversy Level"     
## [13] "Controversy Score"      "ESG Risk Percentile"    "ESG Risk Level"

glimpse(esg_data)

## Rows: 503
## Columns: 15
## $ Symbol                   <chr> "A", "AAL", "AAP", "AAPL", "ABBV", "ABC", "AB…
## $ Name                     <chr> "Agilent Technologies Inc", "American Airline…
## $ Address                  <chr> "5301 Stevens Creek Boulevard\nSanta Clara, C…
## $ Sector                   <chr> "Healthcare", "Industrials", "Consumer Cyclic…
## $ Industry                 <chr> "Diagnostics & Research", "Airlines", "Specia…
## $ `Full Time Employees`    <dbl> 18000, 132500, 40000, 164000, 50000, 46000, 1…
## $ Description              <chr> "Agilent Technologies, Inc. provides applicat…
## $ `Total ESG Risk score`   <dbl> 15, 29, 12, 17, 28, 12, 25, 21, 10, 12, 24, 3…
## $ `Environment Risk Score` <dbl> 0.3, 12.0, 0.0, 0.6, 1.1, 1.3, 3.0, 1.0, 0.3,…
## $ `Governance Risk Score`  <dbl> 6.3, 5.0, 3.0, 9.2, 9.9, 5.2, 8.4, 12.0, 4.8,…
## $ `Social Risk Score`      <dbl> 8.6, 12.0, 8.0, 6.9, 16.8, 5.6, 13.6, 7.0, 4.…
## $ `Controversy Level`      <chr> "Low", "Moderate", "Moderate", "Significant",…
## $ `Controversy Score`      <dbl> 1, 2, 2, 3, 3, 3, 3, 2, 2, 1, 0, 3, 1, 0, 2, …
## $ `ESG Risk Percentile`    <chr> "11th percentile", "62nd percentile", "4th pe…
## $ `ESG Risk Level`         <chr> "Low", NA, "Negligible", "Low", "Medium", "Lo…

Data Cleaning

# Data Cleaning Steps

# 1. Remove rows with any missing values
esg_data_clean <- na.omit(esg_data)

# 2. Remove unnecessary columns, e.g., 'Address'
esg_data_clean <- select(esg_data_clean, -Address)

# 3. Convert 'Full Time Employees' to integer
esg_data_clean$`Full Time Employees` <- as.integer(esg_data_clean$`Full Time Employees`)

# 4. Rename 'Symbol' to 'Ticker'
esg_data_clean <- rename(esg_data_clean, Ticker = Symbol)

# 5. Address outliers in 'Total ESG Risk score'
esg_outliers <- boxplot.stats(esg_data_clean$`Total ESG Risk score`)$out
esg_data_clean <- esg_data_clean[!esg_data_clean$`Total ESG Risk score` %in% esg_outliers, ]

# 6. Normalize 'Total ESG Risk score'
esg_data_clean$`Total ESG Risk score` <- scale(esg_data_clean$`Total ESG Risk score`)

Summary Statistics

# Summary Statistics

# Create a summary statistics table, excluding the 'count' equivalent row
esg_data_summary <- summary(esg_data)

# Convert the summary to a data frame for better control over styling
esg_data_summary_df <- as.data.frame(t(esg_data_summary))

# Use kable from knitr to create a basic table
esg_data_describe <- kable(esg_data_summary_df[-1, ], "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Add a header above with the correct number of columns
# The number '4' should match the actual number of columns in your table
esg_data_describe <- esg_data_describe %>%
  add_header_above(c("Summary Statistics" = 4)) %>%
  row_spec(0, background = "white", color = "black") %>%
  column_spec(1, bold = TRUE, border_right = TRUE) %>%
  scroll_box(width = "100%", height = "500px")

# Display the table
esg_data_describe

Summary Statistics
	Var1	Var2	Freq
2	Name		Length:503
3	Address		Length:503
4	Sector		Length:503
5	Industry		Length:503
6	Full Time Employees		Min. : 23
7	Description		Length:503
8	Total ESG Risk score		Min. : 7.00
9	Environment Risk Score		Min. : 0.000
10	Governance Risk Score		Min. : 3.000
11	Social Risk Score		Min. : 1.100
12	Controversy Level		Length:503
13	Controversy Score		Min. :0.000
14	ESG Risk Percentile		Length:503
15	ESG Risk Level		Length:503
16	Symbol		Class :character
17	Name		Class :character
18	Address		Class :character
19	Sector		Class :character
20	Industry		Class :character
21	Full Time Employees		1st Qu.: 10076
22	Description		Class :character
23	Total ESG Risk score		1st Qu.:16.00
24	Environment Risk Score		1st Qu.: 1.500
25	Governance Risk Score		1st Qu.: 5.000
26	Social Risk Score		1st Qu.: 6.600
27	Controversy Level		Class :character
28	Controversy Score		1st Qu.:1.000
29	ESG Risk Percentile		Class :character
30	ESG Risk Level		Class :character
31	Symbol		Mode :character
32	Name		Mode :character
33	Address		Mode :character
34	Sector		Mode :character
35	Industry		Mode :character
36	Full Time Employees		Median : 20000
37	Description		Mode :character
38	Total ESG Risk score		Median :21.00
39	Environment Risk Score		Median : 3.800
40	Governance Risk Score		Median : 6.000
41	Social Risk Score		Median : 8.700
42	Controversy Level		Mode :character
43	Controversy Score		Median :2.000
44	ESG Risk Percentile		Mode :character
45	ESG Risk Level		Mode :character
46	Symbol		NA
47	Name		NA
48	Address		NA
49	Sector		NA
50	Industry		NA
51	Full Time Employees		Mean : 57060
52	Description		NA
53	Total ESG Risk score		Mean :21.42
54	Environment Risk Score		Mean : 5.679
55	Governance Risk Score		Mean : 6.674
56	Social Risk Score		Mean : 9.045
57	Controversy Level		NA
58	Controversy Score		Mean :1.896
59	ESG Risk Percentile		NA
60	ESG Risk Level		NA
61	Symbol		NA
62	Name		NA
63	Address		NA
64	Sector		NA
65	Industry		NA
66	Full Time Employees		3rd Qu.: 53400
67	Description		NA
68	Total ESG Risk score		3rd Qu.:26.00
69	Environment Risk Score		3rd Qu.: 8.900
70	Governance Risk Score		3rd Qu.: 7.700
71	Social Risk Score		3rd Qu.:11.600
72	Controversy Level		NA
73	Controversy Score		3rd Qu.:2.000
74	ESG Risk Percentile		NA
75	ESG Risk Level		NA
76	Symbol		NA
77	Name		NA
78	Address		NA
79	Sector		NA
80	Industry		NA
81	Full Time Employees		Max. :2100000
82	Description		NA
83	Total ESG Risk score		Max. :46.00
84	Environment Risk Score		Max. :25.000
85	Governance Risk Score		Max. :15.500
86	Social Risk Score		Max. :21.000
87	Controversy Level		NA
88	Controversy Score		Max. :5.000
89	ESG Risk Percentile		NA
90	ESG Risk Level		NA
91	Symbol		NA
92	Name		NA
93	Address		NA
94	Sector		NA
95	Industry		NA
96	Full Time Employees		NA’s :7
97	Description		NA
98	Total ESG Risk score		NA’s :70
99	Environment Risk Score		NA’s :70
100	Governance Risk Score		NA’s :70
101	Social Risk Score		NA’s :70
102	Controversy Level		NA
103	Controversy Score		NA’s :70
104	ESG Risk Percentile		NA
105	ESG Risk Level		NA

ESG Score Distribution Analysis

## ESG Score Distribution Analysis

# Select the ESG score columns
esg_scores <- esg_data %>% 
  select(`Total ESG Risk score`, `Environment Risk Score`, `Governance Risk Score`, `Social Risk Score`)

# Define a vibrant color palette
colors <- viridis::viridis(4)

# Create histograms and box plots for each ESG score column
plots <- list()
for (col in c("Total ESG Risk score", "Environment Risk Score", "Governance Risk Score", "Social Risk Score")) {
  
  # Histogram with KDE
  p1 <- ggplot(esg_data, aes(x = .data[[col]], fill = I(colors[which(names(esg_scores) == col)]))) +
    geom_histogram(bins=20, alpha=0.6) +
    geom_density(alpha=0.7, color="white") +
    theme_minimal() +
    labs(title=paste("Distribution of", col), y="Frequency") +
    theme(legend.position="none", plot.title = element_text(hjust = 0.5))
  
  # Box plot
  p2 <- ggplot(esg_data, aes(y = .data[[col]], fill = I(colors[which(names(esg_scores) == col)]))) +
    geom_boxplot(notch=TRUE, outlier.shape=4, outlier.colour="coral") +
    theme_minimal() +
    labs(title=paste("Box Plot of", col)) +
    theme(legend.position="none", plot.title = element_text(hjust = 0.5))
  
  plots[[paste("histogram", col)]] <- p1
  plots[[paste("boxplot", col)]] <- p2
}

# Reshape the data using gather() before plotting
esg_scores_long <- gather(esg_scores, key = "score_type", value = "score_value", factor_key=TRUE)

# Combined KDE plot for all ESG scores
p3 <- ggplot(esg_scores_long, aes(x = score_value, fill = score_type)) +
  geom_density(alpha = 0.7) +
  scale_fill_viridis(discrete = TRUE) +
  labs(x = "ESG Scores", y = "Density", title = "Combined Distribution of ESG Scores") +
  theme_minimal() +
  theme(legend.position = "right", plot.title = element_text(hjust = 0.5))

plots[["combined_kde"]] <- p3

# Display all the plots
plots

## $`histogram Total ESG Risk score`

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Total ESG Risk score`

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $`histogram Environment Risk Score`

## Warning: Removed 70 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Environment Risk Score`

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $`histogram Governance Risk Score`

## Warning: Removed 70 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Governance Risk Score`

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $`histogram Social Risk Score`

## Warning: Removed 70 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Social Risk Score`

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $combined_kde

## Warning: Removed 280 rows containing non-finite outside the scale range
## (`stat_density()`).

# Create a copy of the original dataset without missing values
esg_data_no_na <- na.omit(esg_data)

# Identify and remove outliers
esg_data_no_outliers <- esg_data_no_na
for (col in c("Total ESG Risk score", "Environment Risk Score", "Governance Risk Score", "Social Risk Score")) {
  Q1 <- quantile(esg_data_no_na[[col]], 0.25, na.rm = TRUE)
  Q3 <- quantile(esg_data_no_na[[col]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # Remove outliers from the copy of the dataset
  esg_data_no_outliers <- esg_data_no_outliers %>%
    filter(.data[[col]] >= lower_bound & .data[[col]] <= upper_bound)
}

# Group by 'ESG Risk Level' and calculate the mean 'Total ESG Risk Score' for each level
grouped_data <- esg_data_no_outliers %>%
  group_by(`ESG Risk Level`) %>%
  summarise(Mean_Total_ESG_Risk_Score = mean(`Total ESG Risk score`, na.rm = TRUE)) %>%
  ungroup()

# Define the ESG risk levels
risk_levels <- c('Negligible', 'Low', 'Medium', 'High', 'Severe')

# Reorder the factor levels of 'ESG Risk Level' based on the defined risk levels
grouped_data <- grouped_data %>%
  mutate(`ESG Risk Level` = factor(`ESG Risk Level`, levels = risk_levels))

# Create a bar chart for 'Total ESG Risk Score' for each risk level
ggplot(grouped_data, aes(x = `ESG Risk Level`, y = Mean_Total_ESG_Risk_Score, fill = `ESG Risk Level`)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set3") + # Using a Color Brewer palette
  labs(x = 'ESG Risk Level', y = 'Mean Total ESG Risk Score', title = 'Total ESG Risk Score by ESG Risk Level') +
  theme_light(base_size = 14) + # Using a light theme with base font size set
  theme(legend.position = "none") + # Remove legend
  geom_text(aes(label = sprintf("%.2f", Mean_Total_ESG_Risk_Score)), vjust = -0.5, size = 3.5) # Add data labels

# Convert the ESG Risk Level column to a factor if it's not already
esg_data$`ESG Risk Level` <- factor(esg_data$`ESG Risk Level`)

# Calculate the count of companies at each ESG risk level
risk_level_counts <- table(esg_data$`ESG Risk Level`)

# Convert the table to a dataframe for plotting
risk_level_df <- data.frame(Risk_Level = names(risk_level_counts), Count = as.numeric(risk_level_counts))

# Create a donut chart
ggplot(risk_level_df, aes(x = 2, y = Count, fill = Risk_Level)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  theme(legend.position = "bottom") +
  labs(fill = "ESG Risk Level") +
  # Use geom_text to add labels to each slice of the donut
  geom_text(aes(label = Count), position = position_stack(vjust = 0.5))

Sector & Industry Analysis

## Sector & Industry Analysis

# Sector & Industry Analysis

# Count the number of unique sectors
sectors_count <- length(unique(esg_data$Sector))
cat("Number of Unique Sectors:", sectors_count, "\n")

## Number of Unique Sectors: 12

# Count the number of unique industries
industries_count <- length(unique(esg_data$Industry))
cat("Number of Unique Industries:", industries_count, "\n")

## Number of Unique Industries: 115

# Get unique sectors and their frequencies
sector_counts <- esg_data %>%
  count(Sector, name = "Frequency") %>%
  arrange(desc(Frequency))

# Create a bar chart using ggplot2
ggplot(sector_counts, aes(x = reorder(Sector, Frequency), y = Frequency, fill = Frequency)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip coordinates for horizontal bar chart
  scale_fill_viridis_c(option = "viridis", direction = -1) +  # Use viridis color scale
  labs(title = 'S&P 500 Sectors', x = 'Sector', y = 'Frequency Count') +
  theme_minimal() +  # Use a minimal theme
  theme(legend.position = "none") +  # Remove legend
  geom_text(aes(label = Frequency), hjust = -0.1)  # Add text labels

# Calculate sector-wise average ESG scores
sector_avg_scores <- esg_data %>%
  group_by(Sector) %>%
  summarise(
    Average_Environment_Risk_Score = mean(`Environment Risk Score`, na.rm = TRUE),
    Average_Governance_Risk_Score = mean(`Governance Risk Score`, na.rm = TRUE),
    Average_Social_Risk_Score = mean(`Social Risk Score`, na.rm = TRUE)
  ) %>%
  ungroup()

# Calculate industry-wise average ESG scores
industry_avg_scores <- esg_data %>%
  group_by(Industry) %>%
  summarise(
    Average_Environment_Risk_Score = mean(`Environment Risk Score`, na.rm = TRUE),
    Average_Governance_Risk_Score = mean(`Governance Risk Score`, na.rm = TRUE),
    Average_Social_Risk_Score = mean(`Social Risk Score`, na.rm = TRUE)
  ) %>%
  ungroup()

# Define custom colors
custom_colors <- c("Environment Risk Score" = "#008080",
                   "Governance Risk Score" = "#FF00FF",
                   "Social Risk Score" = "#FFD700")

# Create a visual plot for sector-wise average ESG scores with custom colors
ggplot(sector_avg_scores, aes(x = Sector)) +
  geom_bar(aes(y = Average_Environment_Risk_Score, fill = "Environment Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Governance_Risk_Score, fill = "Governance Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Social_Risk_Score, fill = "Social Risk Score"), stat = "identity") +
  scale_fill_manual(values = custom_colors) +
  labs(title = 'Sector-wise Average ESG Scores', x = 'Sector', y = 'Average Score') +
  theme_minimal() +
  theme(legend.position = "bottom")

# Create a visual plot for industry-wise average ESG scores with custom colors
ggplot(industry_avg_scores, aes(x = Industry)) +
  geom_bar(aes(y = Average_Environment_Risk_Score, fill = "Environment Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Governance_Risk_Score, fill = "Governance Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Social_Risk_Score, fill = "Social Risk Score"), stat = "identity") +
  scale_fill_manual(values = custom_colors) +
  labs(title = 'Industry-wise Average ESG Scores', x = 'Industry', y = 'Average Score') +
  theme_minimal() +
  theme(legend.position = "bottom")

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).

# Create a pair plot with a vibrant color palette
p <- ggpairs(esg_scores, 
             upper = list(continuous = wrap("points", color = I("#E0B0FF"), size = 3)), 
             lower = list(continuous = wrap("points", color = I("#CBC3E3"), size = 3)), 
             diag = list(continuous = wrap("densityDiag", fill = I("#7F00FF"))),
             aes(color = Symbol)
) 

# Customize the color palette
p <- p + scale_color_brewer(palette = "Set3")

# Add title and adjust theme
p <- p + ggtitle('Pairwise Scatter Plots of ESG Scores') +
  theme(
    plot.title = element_text(hjust = 0.5),  # Center the plot title
    strip.background = element_blank(),     # Remove background of facet labels
    strip.text = element_text(face = "bold") # Bold facet labels
  )

# Print the plot
print(p)

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

Cluster Analysis

## Cluster Analysis

# Prepare data for clustering
esg_for_clustering <- esg_data %>%
  select(`Environment Risk Score`, `Governance Risk Score`, `Social Risk Score`) %>%
  na.omit() # Remove NA values

# Perform k-means clustering
set.seed(123) # Set seed for reproducibility
esg_clusters <- kmeans(scale(esg_for_clustering), centers = 5)

# Create a cluster assignment column in the original data
# Initialize the column with NA to account for the omitted rows
esg_data$Cluster <- NA

# Assign the cluster results to the rows that were not omitted
esg_data$Cluster[!is.na(esg_data$`Environment Risk Score`) &
                 !is.na(esg_data$`Governance Risk Score`) &
                 !is.na(esg_data$`Social Risk Score`)] <- esg_clusters$cluster

# Display the data with clusters
print(esg_data)

## # A tibble: 503 × 16
##    Symbol Name         Address Sector Industry `Full Time Employees` Description
##    <chr>  <chr>        <chr>   <chr>  <chr>                    <dbl> <chr>      
##  1 A      Agilent Tec… "5301 … Healt… Diagnos…                 18000 Agilent Te…
##  2 AAL    American Ai… "1 Sky… Indus… Airlines                132500 American A…
##  3 AAP    Advance Aut… "4200 … Consu… Special…                 40000 Advance Au…
##  4 AAPL   Apple Inc    "One A… Techn… Consume…                164000 Apple Inc.…
##  5 ABBV   Abbvie Inc   "1 Nor… Healt… Drug Ma…                 50000 AbbVie Inc…
##  6 ABC    Amerisource… "1 Wes… Healt… Medical…                 46000 Amerisourc…
##  7 ABT    Abbott Labo… "100 A… Healt… Medical…                115000 Abbott Lab…
##  8 ACGL   Arch Capita… "Water… Finan… Insuran…                  5800 Arch Capit…
##  9 ACN    Accenture P… "1 Gra… Techn… Informa…                732000 Accenture …
## 10 ADBE   Adobe Inc    "345 P… Techn… Softwar…                 29239 Adobe Inc.…
## # ℹ 493 more rows
## # ℹ 9 more variables: `Total ESG Risk score` <dbl>,
## #   `Environment Risk Score` <dbl>, `Governance Risk Score` <dbl>,
## #   `Social Risk Score` <dbl>, `Controversy Level` <chr>,
## #   `Controversy Score` <dbl>, `ESG Risk Percentile` <chr>,
## #   `ESG Risk Level` <fct>, Cluster <int>

Does Size Matter?

# Create scatter plots for each ESG score against the number of full-time employees

# Total ESG Risk score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Total ESG Risk score`)) +
  geom_point(aes(color = `Total ESG Risk score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Total ESG Risk score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Total ESG Risk score") +
  theme_minimal()

## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Repeat the plot for Environment, Governance, and Social Risk Scores
# Environment Risk Score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Environment Risk Score`)) +
  geom_point(aes(color = `Environment Risk Score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Environment Risk Score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Environment Risk Score") +
  theme_minimal()

## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Governance Risk Score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Governance Risk Score`)) +
  geom_point(aes(color = `Governance Risk Score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Governance Risk Score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Governance Risk Score") +
  theme_minimal()

## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Social Risk Score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Social Risk Score`)) +
  geom_point(aes(color = `Social Risk Score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Social Risk Score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Social Risk Score") +
  theme_minimal()

## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

Controversy Analysis

# Create a violin plot to examine the distribution of Total ESG Risk score across different controversy levels
ggplot(esg_data, aes(x = `Controversy Level`, y = `Total ESG Risk score`, fill = `Controversy Level`)) +
  geom_violin(trim = FALSE) + # Use trim=FALSE to show the full distribution
  scale_fill_brewer(palette = "Set3") + # Vibrant color palette
  labs(title = "Violin Plot of Total ESG Risk Scores by Controversy Level",
       x = "Controversy Level",
       y = "Total ESG Risk Score") +
  theme_minimal() +
  theme(legend.position = "none") # Hide the legend if not needed

## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

Correlation Analysis

# Select relevant columns for correlation analysis
esg_subset <- esg_data[, c("Total ESG Risk score", "Environment Risk Score", "Governance Risk Score", "Social Risk Score")]

# Compute the correlation matrix
cor_matrix <- cor(esg_subset)

# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)

# Create a vibrant heatmap
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(colors = brewer.pal(9, "Spectral")) +
  theme_minimal() +
  labs(title = "Correlation Heatmap of ESG Scores",
       x = "ESG Scores",
       y = "ESG Scores") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Regression Analysis

# Run a linear regression model
model <- lm(`Total ESG Risk score` ~ `Environment Risk Score` + `Governance Risk Score` + `Social Risk Score`, data = esg_data)

# Summary of the model to see coefficients and statistics
summary(model)

## 
## Call:
## lm(formula = `Total ESG Risk score` ~ `Environment Risk Score` + 
##     `Governance Risk Score` + `Social Risk Score`, data = esg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11381 -0.22486 -0.03485  0.20012  1.18280 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.212983   0.072291   2.946  0.00339 ** 
## `Environment Risk Score` 1.001930   0.003710 270.068  < 2e-16 ***
## `Governance Risk Score`  0.964371   0.009754  98.873  < 2e-16 ***
## `Social Risk Score`      1.004262   0.005565 180.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4034 on 429 degrees of freedom
##   (70 observations deleted due to missingness)
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9969 
## F-statistic: 4.655e+04 on 3 and 429 DF,  p-value: < 2.2e-16

# To check for model assumptions, create diagnostic plots
par(mfrow = c(2, 2))

# Scatter plot of residuals vs. fitted values
plot(model, which = 1, col = "#FF5733", pch = 16)

# Normal Q-Q plot
plot(model, which = 2, col = "#4CBB17", pch = 16)

# Scale-location plot (square root of standardized residuals vs. fitted values)
plot(model, which = 3, col = "#0074D9", pch = 16)

# Residuals vs. leverage plot
plot(model, which = 5, col = "#FFC300", pch = 16)

Inferences from the Data Analysis

In the landscape of ESG risk assessment, it is observed that the Energy, Utility, and Materials sectors stand out as the leading trio, each bearing significant ESG risks. These sectors are intrinsically linked to the extraction, production, and consumption of raw materials, placing them at the forefront of environmental scrutiny and sustainability challenges. Their operations are resource-intensive, often directly impacting ecological systems, which heightens their exposure to ESG-related risks. Contrastingly, the Real Estate sector operates on a different paradigm. It is less directly tied to the immediate use of raw materials, as its primary transactions—renting and purchasing existing properties—do not inherently require new construction. This sector offers an alternative that circumvents the initial phase of raw material consumption, distinguishing it from those industries where the procurement and processing of raw goods are fundamental to their business models. The Real Estate sector’s ESG risk profile is thus shaped by different factors, such as property management practices, tenant relations, and investment in sustainable buildings, rather than the direct use of natural resources.

A nuanced landscape emerges where company size, as measured by the number of full-time employees, does not consistently predict ESG risk scores. For instance, industry giants such as Apple Inc, boasting a workforce of 164,000, demonstrate a relatively modest ESG risk score of 17. This indicates that despite their vast operations, they manage to maintain lower levels of ESG-related risks. On the contrary, smaller entities like Albemarle Corp, with a workforce of 7,400, exhibit a higher ESG risk score of 29, suggesting that smaller scale does not inherently equate to lower ESG risk. The spectrum of Controversy Levels further adds complexity to the ESG narrative. These levels span from ‘None’ to ‘Significant’, painting a picture of the varied challenges companies face that could potentially tarnish their reputations and affect their ESG standings. Notably, companies such as Apple Inc and Abbvie Inc find themselves in the ‘Significant’ controversy bracket, hinting at past or ongoing issues that have caught public attention and may influence their ESG ratings. Such controversies could stem from a range of issues, including environmental incidents, social injustices, or governance failures, and they underscore the importance of robust risk management and transparent corporate practices. These insights reveal that ESG performance is influenced by a complex interplay of factors beyond just company size. They underscore the importance for investors and stakeholders to look beyond the surface and consider a holistic view of a company’s operations, policies, and the broader impact of its actions when evaluating ESG performance. The data suggests that effective ESG management can occur at any scale and that vigilance is key in maintaining a positive public image and strong ESG profile. Also, there is no correlation between environment, social and governance scores. Regression analysis is also of no use in this case.

Recommendations for ESG Investors and Portfolio Managers

Diversify Across Sectors: Balance portfolios by investing in a mix of sectors with varying ESG risk profiles, including those with high raw material dependency and those like Real Estate with different risk factors.
Assess Beyond Size: Evaluate ESG risks without preconceived notions about company size. Both large and small companies can exhibit a range of ESG risk scores, and each should be assessed on its individual merits and practices.
Monitor Controversy Levels: Keep a close watch on controversy levels as they can significantly impact a company’s ESG rating and public perception. Engage in active dialogue with companies that have higher controversy levels to understand their risk mitigation strategies.
Holistic ESG Evaluation: Adopt a comprehensive approach to ESG evaluation that considers the interplay of environmental, social, and governance factors, rather than relying solely on regression analysis or individual score correlations.
Sustainable Real Estate Practices: For investments in the Real Estate sector, prioritize companies that demonstrate sustainable property management, strong tenant relations, and investments in eco-friendly buildings.
Risk Management: Implement robust risk management practices to navigate the complexities of ESG risks, especially for sectors heavily reliant on raw materials.
Transparent Corporate Practices: Advocate for transparency in corporate practices, as this can mitigate the risks associated with controversies and improve a company’s ESG standing.
Active Engagement: Engage proactively with companies to encourage improvements in ESG practices, regardless of their size or sector.
Continuous Education: Stay informed about the latest developments in ESG criteria and adjust investment strategies accordingly to reflect the evolving landscape.
Positive Impact Investments: Consider the broader impact of investments and support companies that not only have strong ESG scores but also contribute positively to societal and environmental goals.

These recommendations aim to guide ESG-focused investment decisions, ensuring that portfolios are not only financially sound but also aligned with sustainable and ethical practices.