Text Mining of Year 2020 ESG Report from a Power Generator and
Engineering Company, VPower.
The Company is randomly selected for
demonstration only.
I would like to find out the most common
wordings from ESG (Environment Social Governance) Annual Report. This
can be a very fast and quick way to understand what is mentioned or
emphasized by the Company.
The English version for Year 2020 ESG report can be download
from their offical website: http://vpower.com/wp-content/uploads/2021/06/E_SustainabilityReport_2020.pdf
If we take a look at this ESG report, there are total 50 pages in
this PDF file.
Some parts will be before from text mining:
1) Page 1 to 3 are cover and content page, and page 34 to 50 are data
summary, index and compliance. These pages are excluded from analysis.
2) Also, there are footers under each page (It can be considered as last
3 line of each pages). These words in footer should be removed by
breaking each page by lines.
The Cleaning can be performed by 2 directions:
Possible
Solutions 1:Use the built-in function: removeNumbers, removePunctuation,
removeWords, stemDocument, and stripWhitespace…
Possible Solutions
2:Create content transformers, i.e., functions which modify the content
of an R object.
Format : f <- content_transformer(function(x,
pattern) gsub(pattern, ” “, x)))
Example : to_space =
content_transformer(function(x, pattern) gsub(pattern,” “, x)))
……..where x is the input Corpus,
e.g. In the following code, any
characters matching”/|@| |“) will be changed to a space.
docs <-
tm_map(docs, to_Space,”[^0-9a-zA-Z]“)
dtm_esg_2020=TermDocumentMatrix(corpus_1)
matrix_esg_2020=as.matrix(dtm_esg_2020)
# Sort the word result in descending order
matrix_esg_2020=sort(rowSums(matrix_esg_2020),decreasing=TRUE)
df_esg_2020=data.frame(word=names(matrix_esg_2020),freq=matrix_esg_2020)
# Convert back to Title (only the first letter is uppercase)
df_esg_2020$word=str_to_title(df_esg_2020$word)Take a look at Data Frame after Text Mining
head(df_esg_2020)## word freq
## energy Energy 60
## power Power 59
## emissions Emissions 58
## waste Waste 48
## environmental Environmental 41
## business Business 40
Some words can be converted or combined.
# Abbreviation
df_esg_2020$word[df_esg_2020$word=="Ghg"]="Green House Gas"
df_esg_2020$word[df_esg_2020$word=="Gri"]="Global Reporting Initiative"
df_esg_2020$word[df_esg_2020$word=="Esg"]="Environmental Social and Governance"
# Location
df_esg_2020$word[df_esg_2020$word=="Hong"]="Hong Kong"
df_esg_2020$word[df_esg_2020$word=="Kong"]=""
# Similar word or different part of speech
# Environment vs Environmental
df_esg_2020$freq[df_esg_2020$word=="Environment"]=df_esg_2020$freq[df_esg_2020$word=="Environment"]+df_esg_2020$freq[df_esg_2020$word=="Environmental"]
df_esg_2020$freq[df_esg_2020$word=="Environmental"]=0
# Sustainability vs Sustainable
df_esg_2020$freq[df_esg_2020$word=="Sustainability"]=df_esg_2020$freq[df_esg_2020$word=="Sustainability"]+df_esg_2020$freq[df_esg_2020$word=="Sustainable"]
df_esg_2020$freq[df_esg_2020$word=="Sustainable"]=0
df_esg_2020= df_esg_2020 %>% filter(!(word %in% c("Environmental","Sustainable"))) %>% arrange(desc(freq))
esg_2020_top_word=head(df_esg_2020,30)esg_2020_label=as.vector(esg_2020_top_word$word)
ggplotly(
ggplot(esg_2020_top_word,aes(x=factor(word,levels=esg_2020_label),y=freq,text=paste(word,freq)))+
geom_col(fill="darkgreen")+
xlab("")+ylab("Counts")+labs(title="Top 20 most common wordings from 2020 ESG Reports")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
,tooltip = c("text")
)set.seed(1234)
esg_2020_top_word$angle=30*sample(-3:3,30,replace = TRUE, prob = c(1,1,2,4,2,1,1))
ggplot(esg_2020_top_word, aes(label = word, size = freq,angle=angle/4,color=word)) +
geom_text_wordcloud_area(shape="circle")+scale_size_area(max_size =20) +
theme_minimal()