Table of Contents


Synopsis

Working Directory and Required Packages

Data Imporation

Merge Data by Country Code, CPI, and GDP

Feature Engineering

Statistical Analysis of Factors Affecting ESG Scores, via Linear Regression Modeling

First Statistical Analysis
Second Statistical Analysis



Synopsis



This document examines the economic factors affecting State-Owned Enterprises and Non-State-Owned Enterprises, combined with the Corruption Perceptions Index rating of the Countries where the Enterprises originate, in order to predict the Environmental, Social, and Governance Investing score of the Enterprise.

To accomplish that objective, this document firstly examines if there is a different association between corruption and ESG Scores for State-owned enterprises (SOE) vs. non-state-owned enterprises (non-SOE). The data gathered for this analysis spans the years from 2015 to 2019, (i.e. 2015, 2016, 2017, 2018, and 2019).

Scaling and Winsorizing are applied to the dataset’s numeric variables, in order to normalize the variable ranges and limit extreme values to prevent outliers.

The trial interaction used for linear regression of the dataset’s independent variables is ESG = dummy SOE (SOE=1, non-SOE=0) + Corruption + SOE x Corruption + the effects of control variables. Secondly, this document examines to what degree Corruption affects the ESG score of those SOEs. The hypothesis would thus be whether corruption affects ESG scores in a positive or negative way, or ESG = Corruption score of SOE + control variables.

This document also seeks to examine year fixed effects and country fixed effects within the dataset. The results are presented as descriptive statistics, regression tables, correlation tables and Quantile-Quantile analysis of residuals, (that also displays possible outliers).

The main independent variable, Corruption Perceptions Index (CPI), ranks countries by the degree to which corruption is perceived to exist among public officials and politicians. That perception is determined via different assessments and opinion surveys carried out by reliably independent institutions. The assessments capture information about the administrative and political corruption. The surveys/assessments used to calculate the index include questions relating to bribery of public officials, kickbacks in public procurement, embezzlement of public funds, and probe the effectiveness of public sector anti-corruption efforts. The results are given on a score scale of 0 to 100, where 0 is highly corrupt and 100 is very clean.

The control variables are as follows, (including logarithms):
ROE in %
ln (Total assets in dollars)
Total debt to total capital = leverage
ln (Market Capitalization)
ln (GDP per capita)
Market to book value

This RMarkdown document is scripted in the R programming language, and imports spreadsheets that contain the following information:
Data for State-owned companies
Data for non state-owned companies
A list of Corruption Perceptions Index ratings
A list of GDP per capita



Working Directory and Required Packages



# Set working directory to the folder with the project files
# setwd("C:/Users/johna/Dropbox/Programming/CorrelationAnalysis")

# Required R language packages
library(data.table)
library(readxl)
library(DescTools)
library(corrplot)
library(lmtest)
library(sandwich)
library(knitr)



Data Importation



The objective of this section of the document is to import the source spreadsheet of data for statistical analysis. The source spreadsheet contained four tabs of spreadsheet forms with State-Owned Enterprise data, Non-State Owned Enterprise data, CPI ratings and GDP totals.

After importing the source data, this document section matches the four data forms with the objective to create clean datasets where every column is an observation, and every row is a record.

The steps to accomplish this involved import State-Owned Enterprise data, sheet 1, then removing the first three rows of blank cells. The new first row then contained the column names for the dataset. After column name formatting, the remaining first row of data was removed. Then, a separate dataset of esg scores was created from the sheet 1 data. Other separate datasets were created, including return on equity data, total_assets, total_debt, market capitalization, and market value to book.

Sheet 2 with non-state owned enterprise data was then processed in the same manner as the sheet 1 data. Column names were formatted, then separate datasets for non-soe esg scores, non-soe return on equity, non-soe total assets, non-soe total debt, non-soe market capitalization, and non-soe market value to book were created.

Sheet 3 with corruption perceptions index ratings was then processed to created column names for a separate CPI dataset. Sheet 4 with gross domestic product totals was also processed to created column names for a separate GDP dataset.





Merge Data by Country Code, CPI, and GDP



The country codes found in the columnar data of the datasets allowed for the merging of the economic datasets of sheet 1 and 2, and the CPI/GDP data of sheets 3 and 4. The beginning of this process was the column binding of the economic datasets of the previous section, thereby eliminating redundant spreadsheet row and column names. The SOE data and the non-SOE data was then row binded to create a final dataset containing all of the data of the four spreadsheets with precise matching of the data values.





Feature Engineering



The all_data dataset is divided into yearly dataset’s with the clean data format of one observation per column and one record per row. A numeric Country Code variable is added to the dataset. Missing values in the datasets are imputted with the mean of the columnar data. The dataset’s values are normalized for number range compatibility, and then Winsorized to limit extreme values to prevent outliers. After each dataset has been feature engineered, the first five rows of data is printed in table format.



# Subset 2015 Data
AllData2015 <- all_data[,c(1:3,9,21,33,45,57,69,75,77,87)]

colnames(AllData2015) <- c("Country Code", "ISIN", "Enterprise_Name", "ESG", "ROE", "Total_Assets", "Total_Debt", "MarketCap", "MarketVal", "Type_of_Enterprise", "CPI", "GDP")

# Create a numeric 'country' variable
AllData2015$Country_Number <- as.numeric(as.factor(AllData2015$`Country Code`))

# Replace missing values with the mean
AllData2015 <- data.frame(sapply(AllData2015,
    function(x) ifelse(is.na(x), mean(x, na.rm = TRUE),
                       x)))

# Convert columns to numeric
cols.num <- c(4:13)
AllData2015[cols.num] <- sapply(AllData2015[cols.num],as.numeric)

# Normalization of the dataset's numeric ranges with MinMax Scaler
cols.num <- c(4:12)
norm_minmax <- function(x){ (x- min(x)) /(max(x)-min(x)) }
AllData2015[cols.num] <- sapply(AllData2015[cols.num], norm_minmax)

# Winsorize() data, to limit extreme values to prevent outliers
AllData2015[cols.num] <- sapply(AllData2015[cols.num], Winsorize)

# Subset 2016 Data
AllData2016 <- all_data[,c(1:3,10,22,34,46,58,70,75,78,88)]
colnames(AllData2016) <- c("Country Code", "ISIN", "Enterprise_Name","ESG","ROE","Total_Assets", "Total_Debt",
"MarketCap", "MarketVal", "Type_of_Enterprise", "CPI", "GDP")

# Create a numeric 'country' variable
AllData2016$Country_Number <- as.numeric(as.factor(AllData2016$`Country Code`))

# Replace missing values with the mean
AllData2016 <- data.frame(sapply(AllData2016, function(x) ifelse(is.na(x),mean(x, na.rm = TRUE), x)))

# Convert columns back to numeric
cols.num <- c(4:13)
AllData2016[cols.num] <- sapply(AllData2016[cols.num],as.numeric)

# Normalization the dataset's numeric ranges with MinMax Scaler
cols.num <- c(4:12)
AllData2016[cols.num] <- sapply(AllData2016[cols.num], norm_minmax)

# Winsorize() data, to limit extreme values to prevent outliers
AllData2016[cols.num] <- sapply(AllData2016[cols.num], Winsorize)

# Subset 2017 Data
AllData2017 <- all_data[,c(1:3,11,23,35,47,59,71,75,79,89)]

colnames(AllData2017) <- c("Country Code", "ISIN", "Enterprise_Name","ESG","ROE","Total_Assets", "Total_Debt",
"MarketCap", "MarketVal", "Type_of_Enterprise", "CPI", "GDP")

# Create a numeric 'country' variable
AllData2017$Country_Number <- as.numeric(as.factor(AllData2017$`Country Code`))

# Replace missing values with the mean
AllData2017 <- data.frame(sapply(AllData2017,
function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x)))

# Convert columns back to numeric
cols.num <- c(4:13)
AllData2017[cols.num] <- sapply(AllData2017[cols.num],as.numeric)

# Normalization the dataset's numeric ranges with MinMax Scaler
cols.num <- c(4:12)
AllData2017[cols.num] <- sapply(AllData2017[cols.num], norm_minmax)

# Winsorize() data, to limit extreme values to prevent outliers
AllData2017[cols.num] <- sapply(AllData2017[cols.num], Winsorize)

# Subset 2018 Data
AllData2018 <- all_data[,c(1:3,12,24,36,48,60,72,75,80,90)]

colnames(AllData2018) <- c("Country Code", "ISIN", "Enterprise_Name","ESG","ROE","Total_Assets", "Total_Debt",
 "MarketCap", "MarketVal", "Type_of_Enterprise", "CPI", "GDP")

# Create a numeric 'country' variable
AllData2018$Country_Number <- as.numeric(as.factor(AllData2018$`Country Code`))

# Replace missing values with the mean
AllData2018 <- data.frame(sapply(AllData2018,
 function(x) ifelse(is.na(x), mean(x, na.rm = TRUE),
                                                    x)))
# Convert columns back to numeric
cols.num <- c(4:13)
AllData2018[cols.num] <- sapply(AllData2018[cols.num],as.numeric)

# Normalization the dataset's numeric ranges with MinMax Scaler
cols.num <- c(4:12)
AllData2018[cols.num] <- sapply(AllData2018[cols.num], norm_minmax)

# Winsorize() data, to limit extreme values to prevent outliers
AllData2018[cols.num] <- sapply(AllData2018[cols.num], Winsorize)

# Subset 2019 Data
AllData2019 <- all_data[,c(1:3,13,25,37,49,61,73,75,81,91)]

colnames(AllData2019) <- c("Country Code", "ISIN", "Enterprise_Name","ESG","ROE","Total_Assets", "Total_Debt", "MarketCap", "MarketVal", "Type_of_Enterprise", "CPI", "GDP")

# Create a numeric 'country' variable
AllData2019$Country_Number <- as.numeric(as.factor(AllData2019$`Country Code`))

# Replace missing values with the mean
AllData2019 <- data.frame(sapply(AllData2019,
function(x) ifelse(is.na(x),mean(x, na.rm = TRUE), x)))

# Convert columns back to numeric
cols.num <- c(4:13)
AllData2019[cols.num] <- sapply(AllData2019[cols.num],as.numeric)

# Normalization the dataset's numeric ranges with MinMax Scaler
cols.num <- c(4:12)
AllData2019[cols.num] <- sapply(AllData2019[cols.num], norm_minmax)

# Winsorize data to limit extreme values to prevent outliers
AllData2019[cols.num] <- sapply(AllData2019[cols.num], Winsorize)

# Row bind all years dataset
AllYears <- rbind(AllData2015, AllData2016, AllData2017,
                  AllData2018, AllData2019)

# Format natural logarithms of total_assets and gdp
AllYears$Total_Assets <- log(AllYears$Total_Assets)
AllYears$GDP <- log(AllYears$GDP)

kable(head(AllYears, 5), caption ="Table 1. First Five Rows of Final Data")
Table 1. First Five Rows of Final Data
Country.Code ISIN Enterprise_Name ESG ROE Total_Assets Total_Debt MarketCap MarketVal Type_of_Enterprise CPI GDP Country_Number
AUS AU000000SXY7 SENEX ENERGY - ESG Score 0.2440964 0.8809707 -6.878128 0.0000000 0.0008051 0.8442004 1 0.8064516 -0.675431 1
AUT AT0000746409 VERBUND - ESG Score 0.7857831 0.8866388 -3.292851 0.0458114 0.0153824 0.8449882 1 0.7580645 -0.675431 3
AUT AT0000743059 OMV - ESG Score 0.8730120 0.8828275 -2.297833 0.0486550 0.0318451 0.8434589 1 0.7580645 -0.675431 3
BRA BRTRPLACNPR1 CTEEP CPAD.TMO.DE ENELA. - ESG Score 0.4528916 0.8881718 -4.958805 0.0208737 0.0085498 0.8462394 1 0.1451613 -2.311783 5
BRA BRCPLEACNPB9 CIA PARANAENSE DE - ESG Score 0.7920482 0.8877177 -3.579418 0.0436976 0.0077140 0.8431345 1 0.1451613 -2.311783 5



Statistical Analysis of Factors Affecting ESG Scores, via Linear Regression Modeling



The primary interaction used for statistical analysis of the factors affecting ESG scores, via linear regression of the dataset’s independent variables, is ESG = dummy SOE (SOE=1, non-SOE=0) + Corruption + SOE x Corruption + the effects of control variables. This calculates the descriptive statistics to examines to what degree Corruption affects the ESG score of SOEs and non-SOEs. The hypothesis would thus be whether corruption affects ESG scores in a positive or negative way, or ESG = Corruption score of SOE + control variables.



First Statistical Analysis



# Set the random number seed for consistent results
set.seed(123)

# Create train and test datasets for cross-validation
Train_AllYears <- AllYears[1:1394,]
Test_AllYears <- AllYears[1395:1640,]

# Create linear regression model of ESG data
LM_AllYears <-  lm(ESG ~ Type_of_Enterprise + CPI + ROE +
                      Total_Assets + Total_Debt +
                      MarketVal + GDP, data = Train_AllYears)

# View linear model
table_AllYears <- coeftest(LM_AllYears, vcov = vcovHC(LM_AllYears, type = "HC0"))
cat("Table 2. Linear Model of ESG Data")
## Table 2. Linear Model of ESG Data
table_AllYears
## 
## t test of coefficients:
## 
##                      Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)         0.9624385  0.0451031 21.3386 < 2.2e-16 ***
## Type_of_Enterprise  0.0382032  0.0135039  2.8291 0.0047359 ** 
## CPI                -0.0928382  0.0279498 -3.3216 0.0009183 ***
## ROE                -0.0038350  0.0172756 -0.2220 0.8243531    
## Total_Assets        0.0733216  0.0032667 22.4455 < 2.2e-16 ***
## Total_Debt          0.0133941  0.0183209  0.7311 0.4648508    
## MarketVal          -0.0279281  0.0183445 -1.5224 0.1281318    
## GDP                 0.0782599  0.0160841  4.8657 1.271e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Figure 1. Q-Q Plots of Residuals

Table 3. Accuracy of Linear Regression of ESG Data
R2 MAPE MSE
0.3562727 47.73462 0.0381979
Table 4. ESG Data Correlations
ESG ROE Total_Assets Total_Debt MarketVal Type_of_Enterprise CPI GDP Country_Number
ESG 1.0000000 0.0222453 0.5519685 0.0077814 -0.0533267 0.2154619 -0.2296139 -0.0546685 0.1467418
ROE 0.0222453 1.0000000 0.0400514 -0.6169041 -0.2080202 0.0058859 -0.0044527 -0.0059762 0.0002307
Total_Assets 0.5519685 0.0400514 1.0000000 0.0066023 0.0029277 0.2653908 -0.3289491 -0.1910761 0.2160715
Total_Debt 0.0077814 -0.6169041 0.0066023 1.0000000 0.0461991 -0.0007432 0.0084901 0.0030058 0.0021280
MarketVal -0.0533267 -0.2080202 0.0029277 0.0461991 1.0000000 -0.0026466 -0.0063423 -0.0536503 0.0027455
Type_of_Enterprise 0.2154619 0.0058859 0.2653908 -0.0007432 -0.0026466 1.0000000 -0.5218536 -0.3609720 -0.1331846
CPI -0.2296139 -0.0044527 -0.3289491 0.0084901 -0.0063423 -0.5218536 1.0000000 0.6107058 -0.0209468
GDP -0.0546685 -0.0059762 -0.1910761 0.0030058 -0.0536503 -0.3609720 0.6107058 1.0000000 0.0301822
Country_Number 0.1467418 0.0002307 0.2160715 0.0021280 0.0027455 -0.1331846 -0.0209468 0.0301822 1.0000000



Second Statistical Analysis



# Set the random number seed for consistent results
set.seed(123)

# Subset the data for SOEs only
AllYearsSOE <- AllYears[AllYears$Type_of_Enterprise=="1",]

# Create train and test datasets for cross-validation
Train_AllYearsSOE <- AllYearsSOE[1:323,]
Test_AllYearsSOE <- AllYearsSOE[323:430,]

# Create linear regression model of SOE only ESG data
LM_AllYearsSOE <-  lm(ESG ~ Type_of_Enterprise + CPI +
                         ROE + Total_Assets + Total_Debt +
                         MarketVal + GDP,
                      data = Train_AllYearsSOE)

# View linear model
table_AllYearsSOE <- coeftest(LM_AllYearsSOE,
                              vcov = vcovHC(LM_AllYearsSOE,
                                            type = "HC0"))
cat("Table 5. Linear Model of SOE-only ESG Data")
## Table 5. Linear Model of SOE-only ESG Data
table_AllYearsSOE
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   1.0834273  0.0761951 14.2191 < 2.2e-16 ***
## CPI          -0.3973674  0.0764491 -5.1978 3.627e-07 ***
## ROE           0.0212066  0.0340739  0.6224  0.534147    
## Total_Assets  0.0235336  0.0090154  2.6104  0.009475 ** 
## Total_Debt    0.0133106  0.0373556  0.3563  0.721837    
## MarketVal    -0.0439978  0.0403540 -1.0903  0.276414    
## GDP           0.1602593  0.0256953  6.2369 1.430e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Figure 2. Q-Q Plots of Residuals

Table 6. Accuracy of Linear Regression of SOE-only ESG Data
R2 MAPE MSE
0.0983792 40.59049 0.036536
Table 7. SOE-only ESG Data Correlations
ESG ROE Total_Assets Total_Debt MarketVal CPI GDP Country_Number
ESG 1.0000000 0.0162377 0.2250581 0.0201756 -0.0463844 -0.0878779 0.0465789 -0.0399526
ROE 0.0162377 1.0000000 0.0494469 -0.6151376 -0.2045771 0.0196242 -0.0031050 0.0031607
Total_Assets 0.2250581 0.0494469 1.0000000 0.0047786 -0.0201710 -0.3299395 -0.2447386 -0.0631686
Total_Debt 0.0201756 -0.6151376 0.0047786 1.0000000 0.0413891 -0.0134918 0.0059345 -0.0067086
MarketVal -0.0463844 -0.2045771 -0.0201710 0.0413891 1.0000000 -0.0471787 -0.0329667 -0.0015522
CPI -0.0878779 0.0196242 -0.3299395 -0.0134918 -0.0471787 1.0000000 0.8869413 -0.0527573
GDP 0.0465789 -0.0031050 -0.2447386 0.0059345 -0.0329667 0.8869413 1.0000000 -0.0515096
Country_Number -0.0399526 0.0031607 -0.0631686 -0.0067086 -0.0015522 -0.0527573 -0.0515096 1.0000000