Introduction

ESG Investing is increasingly becoming a crucial part of investment strategies.This analysis leverages S&P 500 ESG Scores to uncover trends and insights, guiding the creation of investment portfolios that not only promise returns but also reflect responsible stewardship of environmental, social, and governance values.”

Statement of Business Task

The core mission of of this project is to harness the power of data analytics to forge a path for ESG investment strategies. By meticulously analyzing S&P 500 ESG Scores, this project aims to pinpoint key trends and actionable insights that will empower investors to build portfolios with a conscience, aligning financial goals with environmental, social, and governance benchmarks.

Data Source Description

This dataset exclusively showcases companies from the S&P 500 index. Researchers, investors, analysts, and policy-makers can utilize this dataset to gain insights into the ESG performance and risk profiles of these major corporations. Whether exploring trends, conducting ESG assessments, or making informed investment decisions, this dataset serves as a valuable resource for comprehending the sustainability and governance practices of S&P 500 companies. Fore more information on the dataset !click here(https://www.kaggle.com/datasets/pritish509/s-and-p-500-esg-risk-ratings)

Loading the Required Libraries

# Loading required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(viridis) # For a vibrant color palette
## Loading required package: viridisLite
library(RColorBrewer)
library(forcats) # For factor releveling
library(cluster) # For cluster analysis
library(reshape2) # for melt function
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Reading the Dataset

esg_data <- read_csv("SP 500 ESG Risk Ratings.csv")
## Rows: 503 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Symbol, Name, Address, Sector, Industry, Description, Controversy L...
## dbl (5): Total ESG Risk score, Environment Risk Score, Governance Risk Score...
## num (1): Full Time Employees
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(esg_data)
## # A tibble: 6 × 15
##   Symbol Name          Address Sector Industry `Full Time Employees` Description
##   <chr>  <chr>         <chr>   <chr>  <chr>                    <dbl> <chr>      
## 1 A      Agilent Tech… "5301 … Healt… Diagnos…                 18000 Agilent Te…
## 2 AAL    American Air… "1 Sky… Indus… Airlines                132500 American A…
## 3 AAP    Advance Auto… "4200 … Consu… Special…                 40000 Advance Au…
## 4 AAPL   Apple Inc     "One A… Techn… Consume…                164000 Apple Inc.…
## 5 ABBV   Abbvie Inc    "1 Nor… Healt… Drug Ma…                 50000 AbbVie Inc…
## 6 ABC    Amerisourceb… "1 Wes… Healt… Medical…                 46000 Amerisourc…
## # ℹ 8 more variables: `Total ESG Risk score` <dbl>,
## #   `Environment Risk Score` <dbl>, `Governance Risk Score` <dbl>,
## #   `Social Risk Score` <dbl>, `Controversy Level` <chr>,
## #   `Controversy Score` <dbl>, `ESG Risk Percentile` <chr>,
## #   `ESG Risk Level` <chr>
colnames(esg_data)
##  [1] "Symbol"                 "Name"                   "Address"               
##  [4] "Sector"                 "Industry"               "Full Time Employees"   
##  [7] "Description"            "Total ESG Risk score"   "Environment Risk Score"
## [10] "Governance Risk Score"  "Social Risk Score"      "Controversy Level"     
## [13] "Controversy Score"      "ESG Risk Percentile"    "ESG Risk Level"
glimpse(esg_data)
## Rows: 503
## Columns: 15
## $ Symbol                   <chr> "A", "AAL", "AAP", "AAPL", "ABBV", "ABC", "AB…
## $ Name                     <chr> "Agilent Technologies Inc", "American Airline…
## $ Address                  <chr> "5301 Stevens Creek Boulevard\nSanta Clara, C…
## $ Sector                   <chr> "Healthcare", "Industrials", "Consumer Cyclic…
## $ Industry                 <chr> "Diagnostics & Research", "Airlines", "Specia…
## $ `Full Time Employees`    <dbl> 18000, 132500, 40000, 164000, 50000, 46000, 1…
## $ Description              <chr> "Agilent Technologies, Inc. provides applicat…
## $ `Total ESG Risk score`   <dbl> 15, 29, 12, 17, 28, 12, 25, 21, 10, 12, 24, 3…
## $ `Environment Risk Score` <dbl> 0.3, 12.0, 0.0, 0.6, 1.1, 1.3, 3.0, 1.0, 0.3,…
## $ `Governance Risk Score`  <dbl> 6.3, 5.0, 3.0, 9.2, 9.9, 5.2, 8.4, 12.0, 4.8,…
## $ `Social Risk Score`      <dbl> 8.6, 12.0, 8.0, 6.9, 16.8, 5.6, 13.6, 7.0, 4.…
## $ `Controversy Level`      <chr> "Low", "Moderate", "Moderate", "Significant",…
## $ `Controversy Score`      <dbl> 1, 2, 2, 3, 3, 3, 3, 2, 2, 1, 0, 3, 1, 0, 2, …
## $ `ESG Risk Percentile`    <chr> "11th percentile", "62nd percentile", "4th pe…
## $ `ESG Risk Level`         <chr> "Low", NA, "Negligible", "Low", "Medium", "Lo…

Data Cleaning

# Data Cleaning Steps

# 1. Remove rows with any missing values
esg_data_clean <- na.omit(esg_data)

# 2. Remove unnecessary columns, e.g., 'Address'
esg_data_clean <- select(esg_data_clean, -Address)

# 3. Convert 'Full Time Employees' to integer
esg_data_clean$`Full Time Employees` <- as.integer(esg_data_clean$`Full Time Employees`)

# 4. Rename 'Symbol' to 'Ticker'
esg_data_clean <- rename(esg_data_clean, Ticker = Symbol)

# 5. Address outliers in 'Total ESG Risk score'
esg_outliers <- boxplot.stats(esg_data_clean$`Total ESG Risk score`)$out
esg_data_clean <- esg_data_clean[!esg_data_clean$`Total ESG Risk score` %in% esg_outliers, ]

# 6. Normalize 'Total ESG Risk score'
esg_data_clean$`Total ESG Risk score` <- scale(esg_data_clean$`Total ESG Risk score`)

Summary Statistics

# Summary Statistics

# Create a summary statistics table, excluding the 'count' equivalent row
esg_data_summary <- summary(esg_data)

# Convert the summary to a data frame for better control over styling
esg_data_summary_df <- as.data.frame(t(esg_data_summary))

# Use kable from knitr to create a basic table
esg_data_describe <- kable(esg_data_summary_df[-1, ], "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Add a header above with the correct number of columns
# The number '4' should match the actual number of columns in your table
esg_data_describe <- esg_data_describe %>%
  add_header_above(c("Summary Statistics" = 4)) %>%
  row_spec(0, background = "white", color = "black") %>%
  column_spec(1, bold = TRUE, border_right = TRUE) %>%
  scroll_box(width = "100%", height = "500px")

# Display the table
esg_data_describe
Summary Statistics
Var1 Var2 Freq
2 Name Length:503
3 Address Length:503
4 Sector Length:503
5 Industry Length:503
6 Full Time Employees Min. : 23
7 Description Length:503
8 Total ESG Risk score Min. : 7.00
9 Environment Risk Score Min. : 0.000
10 Governance Risk Score Min. : 3.000
11 Social Risk Score Min. : 1.100
12 Controversy Level Length:503
13 Controversy Score Min. :0.000
14 ESG Risk Percentile Length:503
15 ESG Risk Level Length:503
16 Symbol Class :character
17 Name Class :character
18 Address Class :character
19 Sector Class :character
20 Industry Class :character
21 Full Time Employees 1st Qu.: 10076
22 Description Class :character
23 Total ESG Risk score 1st Qu.:16.00
24 Environment Risk Score 1st Qu.: 1.500
25 Governance Risk Score 1st Qu.: 5.000
26 Social Risk Score 1st Qu.: 6.600
27 Controversy Level Class :character
28 Controversy Score 1st Qu.:1.000
29 ESG Risk Percentile Class :character
30 ESG Risk Level Class :character
31 Symbol Mode :character
32 Name Mode :character
33 Address Mode :character
34 Sector Mode :character
35 Industry Mode :character
36 Full Time Employees Median : 20000
37 Description Mode :character
38 Total ESG Risk score Median :21.00
39 Environment Risk Score Median : 3.800
40 Governance Risk Score Median : 6.000
41 Social Risk Score Median : 8.700
42 Controversy Level Mode :character
43 Controversy Score Median :2.000
44 ESG Risk Percentile Mode :character
45 ESG Risk Level Mode :character
46 Symbol NA
47 Name NA
48 Address NA
49 Sector NA
50 Industry NA
51 Full Time Employees Mean : 57060
52 Description NA
53 Total ESG Risk score Mean :21.42
54 Environment Risk Score Mean : 5.679
55 Governance Risk Score Mean : 6.674
56 Social Risk Score Mean : 9.045
57 Controversy Level NA
58 Controversy Score Mean :1.896
59 ESG Risk Percentile NA
60 ESG Risk Level NA
61 Symbol NA
62 Name NA
63 Address NA
64 Sector NA
65 Industry NA
66 Full Time Employees 3rd Qu.: 53400
67 Description NA
68 Total ESG Risk score 3rd Qu.:26.00
69 Environment Risk Score 3rd Qu.: 8.900
70 Governance Risk Score 3rd Qu.: 7.700
71 Social Risk Score 3rd Qu.:11.600
72 Controversy Level NA
73 Controversy Score 3rd Qu.:2.000
74 ESG Risk Percentile NA
75 ESG Risk Level NA
76 Symbol NA
77 Name NA
78 Address NA
79 Sector NA
80 Industry NA
81 Full Time Employees Max. :2100000
82 Description NA
83 Total ESG Risk score Max. :46.00
84 Environment Risk Score Max. :25.000
85 Governance Risk Score Max. :15.500
86 Social Risk Score Max. :21.000
87 Controversy Level NA
88 Controversy Score Max. :5.000
89 ESG Risk Percentile NA
90 ESG Risk Level NA
91 Symbol NA
92 Name NA
93 Address NA
94 Sector NA
95 Industry NA
96 Full Time Employees NA’s :7
97 Description NA
98 Total ESG Risk score NA’s :70
99 Environment Risk Score NA’s :70
100 Governance Risk Score NA’s :70
101 Social Risk Score NA’s :70
102 Controversy Level NA
103 Controversy Score NA’s :70
104 ESG Risk Percentile NA
105 ESG Risk Level NA

ESG Score Distribution Analysis

## ESG Score Distribution Analysis

# Select the ESG score columns
esg_scores <- esg_data %>% 
  select(`Total ESG Risk score`, `Environment Risk Score`, `Governance Risk Score`, `Social Risk Score`)

# Define a vibrant color palette
colors <- viridis::viridis(4)

# Create histograms and box plots for each ESG score column
plots <- list()
for (col in c("Total ESG Risk score", "Environment Risk Score", "Governance Risk Score", "Social Risk Score")) {
  
  # Histogram with KDE
  p1 <- ggplot(esg_data, aes(x = .data[[col]], fill = I(colors[which(names(esg_scores) == col)]))) +
    geom_histogram(bins=20, alpha=0.6) +
    geom_density(alpha=0.7, color="white") +
    theme_minimal() +
    labs(title=paste("Distribution of", col), y="Frequency") +
    theme(legend.position="none", plot.title = element_text(hjust = 0.5))
  
  # Box plot
  p2 <- ggplot(esg_data, aes(y = .data[[col]], fill = I(colors[which(names(esg_scores) == col)]))) +
    geom_boxplot(notch=TRUE, outlier.shape=4, outlier.colour="coral") +
    theme_minimal() +
    labs(title=paste("Box Plot of", col)) +
    theme(legend.position="none", plot.title = element_text(hjust = 0.5))
  
  plots[[paste("histogram", col)]] <- p1
  plots[[paste("boxplot", col)]] <- p2
}

# Reshape the data using gather() before plotting
esg_scores_long <- gather(esg_scores, key = "score_type", value = "score_value", factor_key=TRUE)

# Combined KDE plot for all ESG scores
p3 <- ggplot(esg_scores_long, aes(x = score_value, fill = score_type)) +
  geom_density(alpha = 0.7) +
  scale_fill_viridis(discrete = TRUE) +
  labs(x = "ESG Scores", y = "Density", title = "Combined Distribution of ESG Scores") +
  theme_minimal() +
  theme(legend.position = "right", plot.title = element_text(hjust = 0.5))

plots[["combined_kde"]] <- p3

# Display all the plots
plots
## $`histogram Total ESG Risk score`
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Total ESG Risk score`
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $`histogram Environment Risk Score`
## Warning: Removed 70 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Environment Risk Score`
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $`histogram Governance Risk Score`
## Warning: Removed 70 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Governance Risk Score`
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $`histogram Social Risk Score`
## Warning: Removed 70 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

## 
## $`boxplot Social Risk Score`
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## 
## $combined_kde
## Warning: Removed 280 rows containing non-finite outside the scale range
## (`stat_density()`).

# Create a copy of the original dataset without missing values
esg_data_no_na <- na.omit(esg_data)

# Identify and remove outliers
esg_data_no_outliers <- esg_data_no_na
for (col in c("Total ESG Risk score", "Environment Risk Score", "Governance Risk Score", "Social Risk Score")) {
  Q1 <- quantile(esg_data_no_na[[col]], 0.25, na.rm = TRUE)
  Q3 <- quantile(esg_data_no_na[[col]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # Remove outliers from the copy of the dataset
  esg_data_no_outliers <- esg_data_no_outliers %>%
    filter(.data[[col]] >= lower_bound & .data[[col]] <= upper_bound)
}

# Group by 'ESG Risk Level' and calculate the mean 'Total ESG Risk Score' for each level
grouped_data <- esg_data_no_outliers %>%
  group_by(`ESG Risk Level`) %>%
  summarise(Mean_Total_ESG_Risk_Score = mean(`Total ESG Risk score`, na.rm = TRUE)) %>%
  ungroup()

# Define the ESG risk levels
risk_levels <- c('Negligible', 'Low', 'Medium', 'High', 'Severe')

# Reorder the factor levels of 'ESG Risk Level' based on the defined risk levels
grouped_data <- grouped_data %>%
  mutate(`ESG Risk Level` = factor(`ESG Risk Level`, levels = risk_levels))

# Create a bar chart for 'Total ESG Risk Score' for each risk level
ggplot(grouped_data, aes(x = `ESG Risk Level`, y = Mean_Total_ESG_Risk_Score, fill = `ESG Risk Level`)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set3") + # Using a Color Brewer palette
  labs(x = 'ESG Risk Level', y = 'Mean Total ESG Risk Score', title = 'Total ESG Risk Score by ESG Risk Level') +
  theme_light(base_size = 14) + # Using a light theme with base font size set
  theme(legend.position = "none") + # Remove legend
  geom_text(aes(label = sprintf("%.2f", Mean_Total_ESG_Risk_Score)), vjust = -0.5, size = 3.5) # Add data labels

# Convert the ESG Risk Level column to a factor if it's not already
esg_data$`ESG Risk Level` <- factor(esg_data$`ESG Risk Level`)

# Calculate the count of companies at each ESG risk level
risk_level_counts <- table(esg_data$`ESG Risk Level`)

# Convert the table to a dataframe for plotting
risk_level_df <- data.frame(Risk_Level = names(risk_level_counts), Count = as.numeric(risk_level_counts))

# Create a donut chart
ggplot(risk_level_df, aes(x = 2, y = Count, fill = Risk_Level)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  theme(legend.position = "bottom") +
  labs(fill = "ESG Risk Level") +
  # Use geom_text to add labels to each slice of the donut
  geom_text(aes(label = Count), position = position_stack(vjust = 0.5))

Sector & Industry Analysis

## Sector & Industry Analysis
# Sector & Industry Analysis

# Count the number of unique sectors
sectors_count <- length(unique(esg_data$Sector))
cat("Number of Unique Sectors:", sectors_count, "\n")
## Number of Unique Sectors: 12
# Count the number of unique industries
industries_count <- length(unique(esg_data$Industry))
cat("Number of Unique Industries:", industries_count, "\n")
## Number of Unique Industries: 115
# Get unique sectors and their frequencies
sector_counts <- esg_data %>%
  count(Sector, name = "Frequency") %>%
  arrange(desc(Frequency))

# Create a bar chart using ggplot2
ggplot(sector_counts, aes(x = reorder(Sector, Frequency), y = Frequency, fill = Frequency)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip coordinates for horizontal bar chart
  scale_fill_viridis_c(option = "viridis", direction = -1) +  # Use viridis color scale
  labs(title = 'S&P 500 Sectors', x = 'Sector', y = 'Frequency Count') +
  theme_minimal() +  # Use a minimal theme
  theme(legend.position = "none") +  # Remove legend
  geom_text(aes(label = Frequency), hjust = -0.1)  # Add text labels

# Calculate sector-wise average ESG scores
sector_avg_scores <- esg_data %>%
  group_by(Sector) %>%
  summarise(
    Average_Environment_Risk_Score = mean(`Environment Risk Score`, na.rm = TRUE),
    Average_Governance_Risk_Score = mean(`Governance Risk Score`, na.rm = TRUE),
    Average_Social_Risk_Score = mean(`Social Risk Score`, na.rm = TRUE)
  ) %>%
  ungroup()

# Calculate industry-wise average ESG scores
industry_avg_scores <- esg_data %>%
  group_by(Industry) %>%
  summarise(
    Average_Environment_Risk_Score = mean(`Environment Risk Score`, na.rm = TRUE),
    Average_Governance_Risk_Score = mean(`Governance Risk Score`, na.rm = TRUE),
    Average_Social_Risk_Score = mean(`Social Risk Score`, na.rm = TRUE)
  ) %>%
  ungroup()

# Define custom colors
custom_colors <- c("Environment Risk Score" = "#008080",
                   "Governance Risk Score" = "#FF00FF",
                   "Social Risk Score" = "#FFD700")

# Create a visual plot for sector-wise average ESG scores with custom colors
ggplot(sector_avg_scores, aes(x = Sector)) +
  geom_bar(aes(y = Average_Environment_Risk_Score, fill = "Environment Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Governance_Risk_Score, fill = "Governance Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Social_Risk_Score, fill = "Social Risk Score"), stat = "identity") +
  scale_fill_manual(values = custom_colors) +
  labs(title = 'Sector-wise Average ESG Scores', x = 'Sector', y = 'Average Score') +
  theme_minimal() +
  theme(legend.position = "bottom")

# Create a visual plot for industry-wise average ESG scores with custom colors
ggplot(industry_avg_scores, aes(x = Industry)) +
  geom_bar(aes(y = Average_Environment_Risk_Score, fill = "Environment Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Governance_Risk_Score, fill = "Governance Risk Score"), stat = "identity") +
  geom_bar(aes(y = Average_Social_Risk_Score, fill = "Social Risk Score"), stat = "identity") +
  scale_fill_manual(values = custom_colors) +
  labs(title = 'Industry-wise Average ESG Scores', x = 'Industry', y = 'Average Score') +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).

# Create a pair plot with a vibrant color palette
p <- ggpairs(esg_scores, 
             upper = list(continuous = wrap("points", color = I("#E0B0FF"), size = 3)), 
             lower = list(continuous = wrap("points", color = I("#CBC3E3"), size = 3)), 
             diag = list(continuous = wrap("densityDiag", fill = I("#7F00FF"))),
             aes(color = Symbol)
) 

# Customize the color palette
p <- p + scale_color_brewer(palette = "Set3")

# Add title and adjust theme
p <- p + ggtitle('Pairwise Scatter Plots of ESG Scores') +
  theme(
    plot.title = element_text(hjust = 0.5),  # Center the plot title
    strip.background = element_blank(),     # Remove background of facet labels
    strip.text = element_text(face = "bold") # Bold facet labels
  )

# Print the plot
print(p)
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 70 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_density()`).

Cluster Analysis

## Cluster Analysis

# Prepare data for clustering
esg_for_clustering <- esg_data %>%
  select(`Environment Risk Score`, `Governance Risk Score`, `Social Risk Score`) %>%
  na.omit() # Remove NA values

# Perform k-means clustering
set.seed(123) # Set seed for reproducibility
esg_clusters <- kmeans(scale(esg_for_clustering), centers = 5)

# Create a cluster assignment column in the original data
# Initialize the column with NA to account for the omitted rows
esg_data$Cluster <- NA

# Assign the cluster results to the rows that were not omitted
esg_data$Cluster[!is.na(esg_data$`Environment Risk Score`) &
                 !is.na(esg_data$`Governance Risk Score`) &
                 !is.na(esg_data$`Social Risk Score`)] <- esg_clusters$cluster

# Display the data with clusters
print(esg_data)
## # A tibble: 503 × 16
##    Symbol Name         Address Sector Industry `Full Time Employees` Description
##    <chr>  <chr>        <chr>   <chr>  <chr>                    <dbl> <chr>      
##  1 A      Agilent Tec… "5301 … Healt… Diagnos…                 18000 Agilent Te…
##  2 AAL    American Ai… "1 Sky… Indus… Airlines                132500 American A…
##  3 AAP    Advance Aut… "4200 … Consu… Special…                 40000 Advance Au…
##  4 AAPL   Apple Inc    "One A… Techn… Consume…                164000 Apple Inc.…
##  5 ABBV   Abbvie Inc   "1 Nor… Healt… Drug Ma…                 50000 AbbVie Inc…
##  6 ABC    Amerisource… "1 Wes… Healt… Medical…                 46000 Amerisourc…
##  7 ABT    Abbott Labo… "100 A… Healt… Medical…                115000 Abbott Lab…
##  8 ACGL   Arch Capita… "Water… Finan… Insuran…                  5800 Arch Capit…
##  9 ACN    Accenture P… "1 Gra… Techn… Informa…                732000 Accenture …
## 10 ADBE   Adobe Inc    "345 P… Techn… Softwar…                 29239 Adobe Inc.…
## # ℹ 493 more rows
## # ℹ 9 more variables: `Total ESG Risk score` <dbl>,
## #   `Environment Risk Score` <dbl>, `Governance Risk Score` <dbl>,
## #   `Social Risk Score` <dbl>, `Controversy Level` <chr>,
## #   `Controversy Score` <dbl>, `ESG Risk Percentile` <chr>,
## #   `ESG Risk Level` <fct>, Cluster <int>

Does Size Matter?

# Create scatter plots for each ESG score against the number of full-time employees

# Total ESG Risk score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Total ESG Risk score`)) +
  geom_point(aes(color = `Total ESG Risk score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Total ESG Risk score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Total ESG Risk score") +
  theme_minimal()
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Repeat the plot for Environment, Governance, and Social Risk Scores
# Environment Risk Score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Environment Risk Score`)) +
  geom_point(aes(color = `Environment Risk Score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Environment Risk Score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Environment Risk Score") +
  theme_minimal()
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Governance Risk Score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Governance Risk Score`)) +
  geom_point(aes(color = `Governance Risk Score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Governance Risk Score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Governance Risk Score") +
  theme_minimal()
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Social Risk Score vs Full Time Employees
ggplot(esg_data, aes(x = `Full Time Employees`, y = `Social Risk Score`)) +
  geom_point(aes(color = `Social Risk Score`), alpha = 0.6) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Social Risk Score vs Full Time Employees",
       x = "Number of Full Time Employees",
       y = "Social Risk Score") +
  theme_minimal()
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

Controversy Analysis

# Create a violin plot to examine the distribution of Total ESG Risk score across different controversy levels
ggplot(esg_data, aes(x = `Controversy Level`, y = `Total ESG Risk score`, fill = `Controversy Level`)) +
  geom_violin(trim = FALSE) + # Use trim=FALSE to show the full distribution
  scale_fill_brewer(palette = "Set3") + # Vibrant color palette
  labs(title = "Violin Plot of Total ESG Risk Scores by Controversy Level",
       x = "Controversy Level",
       y = "Total ESG Risk Score") +
  theme_minimal() +
  theme(legend.position = "none") # Hide the legend if not needed
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Groups with fewer than two datapoints have been dropped.
## ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

Correlation Analysis

# Select relevant columns for correlation analysis
esg_subset <- esg_data[, c("Total ESG Risk score", "Environment Risk Score", "Governance Risk Score", "Social Risk Score")]

# Compute the correlation matrix
cor_matrix <- cor(esg_subset)

# Melt the correlation matrix into a long format
cor_melted <- melt(cor_matrix)

# Create a vibrant heatmap
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(colors = brewer.pal(9, "Spectral")) +
  theme_minimal() +
  labs(title = "Correlation Heatmap of ESG Scores",
       x = "ESG Scores",
       y = "ESG Scores") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Regression Analysis

# Run a linear regression model
model <- lm(`Total ESG Risk score` ~ `Environment Risk Score` + `Governance Risk Score` + `Social Risk Score`, data = esg_data)

# Summary of the model to see coefficients and statistics
summary(model)
## 
## Call:
## lm(formula = `Total ESG Risk score` ~ `Environment Risk Score` + 
##     `Governance Risk Score` + `Social Risk Score`, data = esg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11381 -0.22486 -0.03485  0.20012  1.18280 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.212983   0.072291   2.946  0.00339 ** 
## `Environment Risk Score` 1.001930   0.003710 270.068  < 2e-16 ***
## `Governance Risk Score`  0.964371   0.009754  98.873  < 2e-16 ***
## `Social Risk Score`      1.004262   0.005565 180.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4034 on 429 degrees of freedom
##   (70 observations deleted due to missingness)
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9969 
## F-statistic: 4.655e+04 on 3 and 429 DF,  p-value: < 2.2e-16
# To check for model assumptions, create diagnostic plots
par(mfrow = c(2, 2))

# Scatter plot of residuals vs. fitted values
plot(model, which = 1, col = "#FF5733", pch = 16)

# Normal Q-Q plot
plot(model, which = 2, col = "#4CBB17", pch = 16)

# Scale-location plot (square root of standardized residuals vs. fitted values)
plot(model, which = 3, col = "#0074D9", pch = 16)

# Residuals vs. leverage plot
plot(model, which = 5, col = "#FFC300", pch = 16)

Inferences from the Data Analysis

In the landscape of ESG risk assessment, it is observed that the Energy, Utility, and Materials sectors stand out as the leading trio, each bearing significant ESG risks. These sectors are intrinsically linked to the extraction, production, and consumption of raw materials, placing them at the forefront of environmental scrutiny and sustainability challenges. Their operations are resource-intensive, often directly impacting ecological systems, which heightens their exposure to ESG-related risks. Contrastingly, the Real Estate sector operates on a different paradigm. It is less directly tied to the immediate use of raw materials, as its primary transactions—renting and purchasing existing properties—do not inherently require new construction. This sector offers an alternative that circumvents the initial phase of raw material consumption, distinguishing it from those industries where the procurement and processing of raw goods are fundamental to their business models. The Real Estate sector’s ESG risk profile is thus shaped by different factors, such as property management practices, tenant relations, and investment in sustainable buildings, rather than the direct use of natural resources.

A nuanced landscape emerges where company size, as measured by the number of full-time employees, does not consistently predict ESG risk scores. For instance, industry giants such as Apple Inc, boasting a workforce of 164,000, demonstrate a relatively modest ESG risk score of 17. This indicates that despite their vast operations, they manage to maintain lower levels of ESG-related risks. On the contrary, smaller entities like Albemarle Corp, with a workforce of 7,400, exhibit a higher ESG risk score of 29, suggesting that smaller scale does not inherently equate to lower ESG risk. The spectrum of Controversy Levels further adds complexity to the ESG narrative. These levels span from ‘None’ to ‘Significant’, painting a picture of the varied challenges companies face that could potentially tarnish their reputations and affect their ESG standings. Notably, companies such as Apple Inc and Abbvie Inc find themselves in the ‘Significant’ controversy bracket, hinting at past or ongoing issues that have caught public attention and may influence their ESG ratings. Such controversies could stem from a range of issues, including environmental incidents, social injustices, or governance failures, and they underscore the importance of robust risk management and transparent corporate practices. These insights reveal that ESG performance is influenced by a complex interplay of factors beyond just company size. They underscore the importance for investors and stakeholders to look beyond the surface and consider a holistic view of a company’s operations, policies, and the broader impact of its actions when evaluating ESG performance. The data suggests that effective ESG management can occur at any scale and that vigilance is key in maintaining a positive public image and strong ESG profile. Also, there is no correlation between environment, social and governance scores. Regression analysis is also of no use in this case.

Recommendations for ESG Investors and Portfolio Managers

These recommendations aim to guide ESG-focused investment decisions, ensuring that portfolios are not only financially sound but also aligned with sustainable and ethical practices.