Text Mining of Year 2020 ESG Report from a Power Generator Company

Background

Text Mining of Year 2020 ESG Report from a Power Generator and Engineering Company, VPower.
The Company is randomly selected for demonstration only.
I would like to find out the most common wordings from ESG (Environment Social Governance) Annual Report. This can be a very fast and quick way to understand what is mentioned or emphasized by the Company.

File Inport and Extraction

The English version for Year 2020 ESG report can be download from their offical website: http://vpower.com/wp-content/uploads/2021/06/E_SustainabilityReport_2020.pdf

If we take a look at this ESG report, there are total 50 pages in this PDF file.
Some parts will be before from text mining:

1) Page 1 to 3 are cover and content page, and page 34 to 50 are data summary, index and compliance. These pages are excluded from analysis.

2) Also, there are footers under each page (It can be considered as last 3 line of each pages). These words in footer should be removed by breaking each page by lines.

Data Cleaning Process

The Cleaning can be performed by 2 directions:
Possible Solutions 1:Use the built-in function: removeNumbers, removePunctuation, removeWords, stemDocument, and stripWhitespace…
Possible Solutions 2:Create content transformers, i.e., functions which modify the content of an R object.
Format : f <- content_transformer(function(x, pattern) gsub(pattern, ” “, x)))
Example : to_space = content_transformer(function(x, pattern) gsub(pattern,” “, x))) ……..where x is the input Corpus,
e.g. In the following code, any characters matching”/|@| |“) will be changed to a space.
docs <- tm_map(docs, to_Space,”[^0-9a-zA-Z]“)

Create Document Term Matrix and Data Frame

dtm_esg_2020=TermDocumentMatrix(corpus_1)
matrix_esg_2020=as.matrix(dtm_esg_2020)
# Sort the word result in descending order
matrix_esg_2020=sort(rowSums(matrix_esg_2020),decreasing=TRUE)
df_esg_2020=data.frame(word=names(matrix_esg_2020),freq=matrix_esg_2020)
# Convert back to Title (only the first letter is uppercase)
df_esg_2020$word=str_to_title(df_esg_2020$word)

Further Cleaning

Take a look at Data Frame after Text Mining

head(df_esg_2020)

##                        word freq
## energy               Energy   60
## power                 Power   59
## emissions         Emissions   58
## waste                 Waste   48
## environmental Environmental   41
## business           Business   40

Some words can be converted or combined.

# Abbreviation
df_esg_2020$word[df_esg_2020$word=="Ghg"]="Green House Gas"
df_esg_2020$word[df_esg_2020$word=="Gri"]="Global Reporting Initiative"
df_esg_2020$word[df_esg_2020$word=="Esg"]="Environmental Social and Governance"
# Location
df_esg_2020$word[df_esg_2020$word=="Hong"]="Hong Kong"
df_esg_2020$word[df_esg_2020$word=="Kong"]=""
# Similar word or different part of speech
# Environment vs Environmental
df_esg_2020$freq[df_esg_2020$word=="Environment"]=df_esg_2020$freq[df_esg_2020$word=="Environment"]+df_esg_2020$freq[df_esg_2020$word=="Environmental"]
df_esg_2020$freq[df_esg_2020$word=="Environmental"]=0
# Sustainability vs Sustainable
df_esg_2020$freq[df_esg_2020$word=="Sustainability"]=df_esg_2020$freq[df_esg_2020$word=="Sustainability"]+df_esg_2020$freq[df_esg_2020$word=="Sustainable"]
df_esg_2020$freq[df_esg_2020$word=="Sustainable"]=0

df_esg_2020= df_esg_2020 %>% filter(!(word %in% c("Environmental","Sustainable"))) %>% arrange(desc(freq))
esg_2020_top_word=head(df_esg_2020,30)

Plot a column chart

esg_2020_label=as.vector(esg_2020_top_word$word)
ggplotly(
ggplot(esg_2020_top_word,aes(x=factor(word,levels=esg_2020_label),y=freq,text=paste(word,freq)))+
  geom_col(fill="darkgreen")+
  xlab("")+ylab("Counts")+labs(title="Top 20 most common wordings from 2020 ESG Reports")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
,tooltip = c("text")
)

Make a word cloud

set.seed(1234)
esg_2020_top_word$angle=30*sample(-3:3,30,replace = TRUE, prob = c(1,1,2,4,2,1,1))
ggplot(esg_2020_top_word, aes(label = word, size = freq,angle=angle/4,color=word)) +
  geom_text_wordcloud_area(shape="circle")+scale_size_area(max_size =20) +
  theme_minimal()