AUTHORS - Carla Parrucci, cp113330 (Student Leader). Contribution to
the creation of the Code and Report - Marina Gabriela Mititelu,
mm113327. Contribution to the creation of the Code
STATEMENT OF NON-INFRINGEMENT OF COPYRIGHT. The Work shall not
violate or infringe upon the rights of any third party, including,
without limitation, any patent rights, copyright rights, trademark
rights, trade secret rights, or other proprietary rights of any
kind.
CC LICENSE STATEMENT FOR THE PROJECT This work is licensed under a
Creative Commons Attribution 4.0 International License. This license
allows a work to be copied, modified, distributed, presented, and
performed only if the author is credited.
CLASS Auditing Course, year 2022, spring semester, Warsaw School of
Economics (SGH), Warsaw (PL).
Load necessary packages
require(rvest)#read data from webpage
require(dplyr)#library required for data frame manipulation
require(tm)#text mining
require(wordcloud)#word-cloud generator
require(RColorBrewer)#color palettes
require(syuzhet)#sentiment analysis
require(ggplot2)#plot graph
Load HTML pages to analyze 10 External Audit Reports.
After the analysis of the results, it is possible to assume whether
the Audit reports are Optimistic or Pessimistic
sample1<-read_html("https://annualreport.dsm.com/ar2020/other-information/independent-auditor-s-report.html")
audit_report1<-sample1 %>% html_nodes("p") %>% html_text()
# %>% is multiple times to "chain" functions together
sample2<-read_html("http://www.pta.wa.gov.au/Portals/0/annualreports/2011/pta2011v4/independent-audit-opinion.html")
audit_report2<-sample2 %>% html_nodes("p") %>% html_text()
sample3<-read_html("https://report.basf.com/2021/en/financial-statements/audit/auditors-report.html")
audit_report3<-sample3 %>% html_nodes("p") %>% html_text()
sample4<-read_html("https://coca-colafemsa.com/reportes/reporte-anual-2016/eng/independent-auditors.html")
audit_report4<-sample4 %>% html_nodes("p") %>% html_text()
sample5<-read_html("https://www.canada.ca/en/revenue-agency/corporate/about-canada-revenue-agency-cra/departmental-performance-reports/2018-19-departmental-results-report/iar-agency-activities.html")
audit_report5<-sample5 %>% html_nodes("p") %>% html_text()
sample6<-read_html("https://2020.annualreport.sbmoffshore.com/corporate-statements-2020/other-information/independent-auditor-s-report")
audit_report6<-sample6 %>% html_nodes("p") %>% html_text()
sample7<-read_html("https://annual-report.puma.com/2020/en/consolidated-financial-statement/independent-auditor-s-report.html")
audit_report7<-sample7 %>% html_nodes("p") %>% html_text()
sample8<-read_html("https://www.remgro.com/ar2021/financial-report/report-independent-auditor.php")
audit_report8<-sample8 %>% html_nodes("p") %>% html_text()
sample9<-read_html("https://www.vodacom-reports.co.za/integrated-reports/ir-2018/independent-auditors-report-on-the-consolidated-annual-financial-statements.php")
audit_report9<-sample9 %>% html_nodes("p") %>% html_text()
sample10<-read_html("https://www.massmart.co.za/iar2020/our-performance/independent-auditors-report/")
audit_report10<-sample10 %>% html_nodes("p") %>% html_text()
Convertion of Vectors to Corpus of each Audit Report
audit_document1<- Corpus(VectorSource(audit_report1))#create corpus by converting vector;
audit_document2<- Corpus(VectorSource(audit_report2))
audit_document3<- Corpus(VectorSource(audit_report3))
audit_document4<- Corpus(VectorSource(audit_report4))
audit_document5<- Corpus(VectorSource(audit_report5))
audit_document6<- Corpus(VectorSource(audit_report6))
audit_document7<- Corpus(VectorSource(audit_report7))
audit_document8<- Corpus(VectorSource(audit_report8))
audit_document9<- Corpus(VectorSource(audit_report9))
audit_document10<- Corpus(VectorSource(audit_report10))
To perform the sentiment analysis, it is necessary to correct the
text of each Audit Report by converting all sentences to lower case,
removing numbers from the documents, stop words, punctuation and extra
spaces. In this way R Studio is able to evaluate the meaning of each
word
audit_document1<-tm_map(audit_document1, content_transformer(tolower))#convert all text to lower case
audit_document1<-tm_map(audit_document1, removeNumbers)#remove numbers from document
audit_document1<-tm_map(audit_document1,removeWords,stopwords(kind = "en"))#remove stop words in English
audit_document1<-tm_map(audit_document1,removePunctuation)#remove punctuation
audit_document1<- tm_map(audit_document1,stripWhitespace)#eliminate extra spaces
audit_document2<-tm_map(audit_document2, content_transformer(tolower))
audit_document2<-tm_map(audit_document2, removeNumbers)
audit_document2<-tm_map(audit_document2,removeWords,stopwords(kind = "en"))
audit_document2<-tm_map(audit_document2,removePunctuation)
audit_document2<- tm_map(audit_document2,stripWhitespace)
audit_document3<-tm_map(audit_document3, content_transformer(tolower))
audit_document3<-tm_map(audit_document3, removeNumbers)
audit_document3<-tm_map(audit_document3,removeWords,stopwords(kind = "en"))
audit_document3<-tm_map(audit_document3,removePunctuation)
audit_document3<- tm_map(audit_document3,stripWhitespace)
audit_document4<-tm_map(audit_document4, content_transformer(tolower))
audit_document4<-tm_map(audit_document4, removeNumbers)
audit_document4<-tm_map(audit_document4,removeWords,stopwords(kind = "en"))
audit_document4<-tm_map(audit_document4,removePunctuation)
audit_document4<- tm_map(audit_document4,stripWhitespace)
audit_document5<-tm_map(audit_document5, content_transformer(tolower))
audit_document5<-tm_map(audit_document5, removeNumbers)
audit_document5<-tm_map(audit_document5,removeWords,stopwords(kind = "en"))
audit_document5<-tm_map(audit_document5,removePunctuation)
audit_document5<- tm_map(audit_document5,stripWhitespace)
audit_document6<-tm_map(audit_document6, content_transformer(tolower))
audit_document6<-tm_map(audit_document6, removeNumbers)
audit_document6<-tm_map(audit_document6,removeWords,stopwords(kind = "en"))
audit_document6<-tm_map(audit_document6,removePunctuation)
audit_document6<- tm_map(audit_document6,stripWhitespace)
audit_document7<-tm_map(audit_document7, content_transformer(tolower))
audit_document7<-tm_map(audit_document7, removeNumbers)
audit_document7<-tm_map(audit_document7,removeWords,stopwords(kind = "en"))
audit_document7<-tm_map(audit_document7,removePunctuation)
audit_document7<- tm_map(audit_document7,stripWhitespace)
audit_document8<-tm_map(audit_document8, content_transformer(tolower))
audit_document8<-tm_map(audit_document8, removeNumbers)
audit_document8<-tm_map(audit_document8,removeWords,stopwords(kind = "en"))
audit_document8<-tm_map(audit_document8,removePunctuation)
audit_document8<- tm_map(audit_document8,stripWhitespace)
audit_document9<-tm_map(audit_document9, content_transformer(tolower))
audit_document9<-tm_map(audit_document9, removeNumbers)
audit_document9<-tm_map(audit_document9,removeWords,stopwords(kind = "en"))
audit_document9<-tm_map(audit_document9,removePunctuation)
audit_document9<- tm_map(audit_document9,stripWhitespace)
audit_document10<-tm_map(audit_document10, content_transformer(tolower))
audit_document10<-tm_map(audit_document10, removeNumbers)
audit_document10<-tm_map(audit_document10,removeWords,stopwords(kind = "en"))
audit_document10<-tm_map(audit_document10,removePunctuation)
audit_document10<- tm_map(audit_document10,stripWhitespace)
To perform a sensitivity analysis, it is necessary the computation of
the frequency of each word. Using this code, we built a table with the
words and frequency of each Audit Report analyzed.
dtm1<-TermDocumentMatrix(audit_document1)#term-matrix, it computes the words in the Report and their frequency
m1<-as.matrix(dtm1)#convert terms in the matrix
v1<-sort(rowSums(m1),decreasing = TRUE)#convert matrix in vector
d1<-data.frame(frequency=v1)#convert vector in data frame
d1<-data.frame(word=names(v1),frequency=v1) #add the variable names to the data frame
head(d1,20) #visualize the 20 words most used in the first Audit Report
dtm2<-TermDocumentMatrix(audit_document2)
m2<-as.matrix(dtm2)
v2<-sort(rowSums(m2),decreasing = TRUE)
d2<-data.frame(word=names(v2),frequency=v2)
head(d2,20)
dtm3<-TermDocumentMatrix(audit_document3)
m3<-as.matrix(dtm3)
v3<-sort(rowSums(m3),decreasing = TRUE)
d3<-data.frame(word=names(v3),frequency=v3)
head(d3,20)
dtm4<-TermDocumentMatrix(audit_document4)
m4<-as.matrix(dtm4)
v4<-sort(rowSums(m4),decreasing = TRUE)
d4<-data.frame(word=names(v4),frequency=v4)
head(d4,20)
dtm5<-TermDocumentMatrix(audit_document5)
m5<-as.matrix(dtm5)
v5<-sort(rowSums(m5),decreasing = TRUE)
d5<-data.frame(word=names(v5),frequency=v5)
head(d5,20)
dtm6<-TermDocumentMatrix(audit_document6)
m6<-as.matrix(dtm6)
v6<-sort(rowSums(m6),decreasing = TRUE)
d6<-data.frame(word=names(v6),frequency=v6)
head(d6,20)
dtm7<-TermDocumentMatrix(audit_document7)
m7<-as.matrix(dtm7)
v7<-sort(rowSums(m7),decreasing = TRUE)
d7<-data.frame(word=names(v7),frequency=v7)
head(d7,20)
dtm8<-TermDocumentMatrix(audit_document8)
m8<-as.matrix(dtm8)
v8<-sort(rowSums(m8),decreasing = TRUE)
d8<-data.frame(word=names(v8),frequency=v8)
head(d8,20)
dtm9<-TermDocumentMatrix(audit_document9)
m9<-as.matrix(dtm9)
v9<-sort(rowSums(m9),decreasing = TRUE)
d9<-data.frame(word=names(v9),frequency=v9)
head(d9,20)
dtm10<-TermDocumentMatrix(audit_document10)
m10<-as.matrix(dtm10)
v10<-sort(rowSums(m10),decreasing = TRUE)
d10<-data.frame(word=names(v10),frequency=v10)
head(d10,20)
From the 10 different Data Frames is observable that the most used
words are “Audit”, “Financial”, “Statements”, “Report”, “Consolidated”.
To perform the analysis we chose Audit Reports with different lengths,
as observable from the frequency of the words.
Audit Report 1: The most frequent words have a neutral meaning,
except for the word “Fraud”. However the word is not used with a
negative connotation since the auditors completed an analysis specific
to fraud risks for this Company. Therefore, the word “Fraud” is
frequent. Based on the first 20 words, our opinion about Report 1 is
Neutral. To increase the number of words displayed, we can use the code:
head(d1,40)
Audit Report 2: Differently from Report 1, the words have lower
frequency and within the first 20 most frequent words appear the word
“Fair”. Opinion: Neutral/Optimistic
Audit Report 3: The report lead to very optimistic outcome since
three of the 20 most frequent words are: “appropriate”, “material”,
“accordance”, “assurance”. Opinion: Optimistic
Audit Report 4: As well as Report 1, the listed words have neutral
meaning. For this reason is suggested to implement the number of words
visualized. Opinion: Neutral
Audit Report 5: As well as Report 3, the report used several positive
words such as: “accordance”, “material”, “assurance”. Opinion:
Optimistic
Audit Report 6: Visualization of more words suggested.
Opinion:Neutral.
Audit Report 7: Visualization of more words suggested.
Opinion:Neutral.
Audit Report 8: Opinion: Neutral/Optimistic
Audit Report 9: Opinion: Neutral/Optimistic
Audit Report 10: Opinion: Neutral/Optimistic
To visualize words with different frequency, we used the ‘subset()’
function. We used d1 data frame as Corpus and we considered words
frequencies to create the subset.
subset(d1, with(d1, frequency %in% names(which(table(frequency)<=15))))
subset(d2, with(d2, frequency %in% names(which(table(frequency)<=15))))
subset(d3, with(d3, frequency %in% names(which(table(frequency)<=15))))
subset(d4, with(d4, frequency %in% names(which(table(frequency)<=15))))
subset(d5, with(d5, frequency %in% names(which(table(frequency)<=15))))
subset(d6, with(d6, frequency %in% names(which(table(frequency)<=15))))
subset(d7, with(d7, frequency %in% names(which(table(frequency)<=15))))
subset(d8, with(d8, frequency %in% names(which(table(frequency)<=15))))
subset(d9, with(d9, frequency %in% names(which(table(frequency)<=15))))
subset(d10, with(d10, frequency %in% names(which(table(frequency)<=15))))
To visualize every words of Audit Reports we used the ‘names()’
function
names(v1)
names(v2)
names(v3)
names(v4)
names(v5)
names(v6)
names(v7)
names(v8)
names(v9)
names(v10)
In order to find frequent terms, we used the findFreqTerms function,
imposing the number 10 as threshold. In this way, words repeated more
than 10 times are classified as frequent.
findFreqTerms(dtm1,lowfreq = 10)
findFreqTerms(dtm2,lowfreq = 10)
findFreqTerms(dtm3,lowfreq = 10)
findFreqTerms(dtm4,lowfreq = 10)
findFreqTerms(dtm5,lowfreq = 10)
findFreqTerms(dtm6,lowfreq = 10)
findFreqTerms(dtm7,lowfreq = 10)
findFreqTerms(dtm8,lowfreq = 10)
findFreqTerms(dtm9,lowfreq = 10)
findFreqTerms(dtm10,lowfreq = 10)
For the Analysis, it is also interesting to examine the association
between frequent terms. Since the most important part of the Audit
Report is the Opinion and the documents fairness, we will visualize
which words are usually associated with the words “Opinion” and
“Fair”.
findAssocs(dtm1,terms="opinion", corlimit = 0.4)
findAssocs(dtm1,terms="fair", corlimit = 0.4)
findAssocs(dtm2,terms="opinion", corlimit = 0.4)
findAssocs(dtm2,terms="fair", corlimit = 0.4)
findAssocs(dtm3,terms="opinion", corlimit = 0.4)
findAssocs(dtm3,terms="fair", corlimit = 0.4)
findAssocs(dtm4,terms="opinion", corlimit = 0.4)
findAssocs(dtm4,terms="fair", corlimit = 0.4)
findAssocs(dtm5,terms="opinion", corlimit = 0.4)
findAssocs(dtm5,terms="fair", corlimit = 0.4)
findAssocs(dtm6,terms="opinion", corlimit = 0.4)
findAssocs(dtm6,terms="fair", corlimit = 0.4)
findAssocs(dtm7,terms="opinion", corlimit = 0.4)
findAssocs(dtm7,terms="fair", corlimit = 0.4)
findAssocs(dtm8,terms="opinion", corlimit = 0.4)
findAssocs(dtm8,terms="fair", corlimit = 0.4)
findAssocs(dtm9,terms="opinion", corlimit = 0.4)
findAssocs(dtm9,terms="fair", corlimit = 0.4)
findAssocs(dtm10,terms="opinion", corlimit = 0.4)
findAssocs(dtm10,terms="fair", corlimit = 0.4)
From the association is observable that usually the word “opinion” is
followed by Positive words such as: “sufficient”, “appropriate”,
“assurance”, “reasonable”, “fairly”, “significance”, “fulfilled.”
Associated with the word “Opinion”, some of the basic features of the
Audit Process are outlined. In fact, “opinion” is also associated with:
“independent”, “material”, “standard”, “statement”, “responsibility”,
“evidence”. The words associated with “Fair” are: “presentation”,
“preparation”, “reporting”, “control”, “considerations”, “assumptions”,
“value”. In Audit Report number 6, the word “Fair” is not used, for this
reason the function gives the output: ‘numeric(0)’.
To visualize the most frequent words of each report we used a Bar
Plot
barplot(d1[1:10,]$frequency, las=2,names.arg=d1[1:10,]$word,
col = "lightblue",main = "Most frequent words in auditing report 1",
ylab = "word frequencies")
barplot(d2[1:10,]$frequency, las=2,names.arg=d2[1:10,]$word,
col = "lightyellow",main = "Most frequent words in auditing report 2",
ylab = "word frequencies")
barplot(d3[1:10,]$frequency, las=2,names.arg=d3[1:10,]$word,
col = "red",main = "Most frequent words in auditing report 3",
ylab = "word frequencies")
barplot(d4[1:10,]$frequency, las=2,names.arg=d4[1:10,]$word,
col = "lightgreen",main = "Most frequent words in auditing report 4",
ylab = "word frequencies")
barplot(d5[1:10,]$frequency, las=2,names.arg=d5[1:10,]$word,
col = "lightgrey",main = "Most frequent words in auditing report 5",
ylab = "word frequencies")
barplot(d6[1:10,]$frequency, las=2,names.arg=d6[1:10,]$word,
col = "orange",main = "Most frequent words in auditing report 6",
ylab = "word frequencies")
barplot(d7[1:10,]$frequency, las=2,names.arg=d7[1:10,]$word,
col = "Green",main = "Most frequent words in auditing report 7",
ylab = "word frequencies")
barplot(d8[1:10,]$frequency, las=2,names.arg=d8[1:10,]$word,
col = "blue",main = "Most frequent words in auditing report 8",
ylab = "word frequencies")
barplot(d9[1:10,]$frequency, las=2,names.arg=d9[1:10,]$word,
col = "purple",main = "Most frequent words in auditing report 9",
ylab = "word frequencies")
barplot(d10[1:10,]$frequency, las=2,names.arg=d10[1:10,]$word,
col = "pink",main = "Most frequent words in auditing report 10",
ylab = "word frequencies")
Thanks to the word frequencies, it is possible to create Word
Clouds
set.seed(1234)
wordcloud(words = d1$word,freq = d1$freq,min.freq = 5,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Spectral"))
wordcloud(words = d2$word,freq = d2$freq,min.freq = 5,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Blues"))
wordcloud(words = d3$word,freq = d3$freq,min.freq = 11,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Accent"))
wordcloud(words = d4$word,freq = d4$freq,min.freq = 5,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Set3"))
wordcloud(words = d5$word,freq = d5$freq,min.freq = 3,
max.words = 100,random.order = FALSE,rot.per = 0.4,
colors = brewer.pal(4,"Set2"))
wordcloud(words = d6$word,freq = d6$freq,min.freq = 5,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Oranges"))
wordcloud(words = d7$word,freq = d7$freq,min.freq = 8,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Pastel2"))
wordcloud(words = d8$word,freq = d8$freq,min.freq = 3,
max.words = 100,random.order = FALSE,rot.per = 0.30,
colors = brewer.pal(4,"Pastel1"))
wordcloud(words = d9$word,freq = d9$freq,min.freq = 5,
max.words = 100,random.order = FALSE,rot.per = 0.40,
colors = brewer.pal(4,"Dark2"))
wordcloud(words = d10$word,freq = d10$freq,min.freq = 5,
max.words = 100,random.order = FALSE,rot.per = 0.5,
colors = brewer.pal(4,"BrBG"))
Several methods can be used, with different scales, to asses whether
the intrinsic sentiment of each Audit Report is positive or negative. If
the computed mean is POSITIVE, the overall sentiment expressed in the
audit report is OPTIMISTIC. The three methods used are Syuzhet, Bing and
Afinn.
syuzhet_vector1<-get_sentiment(d1,method = "syuzhet") #analyze text based on syuzhet
summary(syuzhet_vector1)#see summary statistics of the vector
bing_vector1<-get_sentiment(audit_report1,method="bing")#analyze text based on bing
summary(bing_vector1)#see summary statistics of the vector
afinn_vector1<-get_sentiment(audit_report1,method="afinn")#analyze text based on afinn, min of -5 (most neg) to 5 (most pos)
summary(afinn_vector1)#see summary statistics of the vector
syuzhet_vector2<-get_sentiment(d2,method = "syuzhet")
summary(syuzhet_vector2)
bing_vector2<-get_sentiment(audit_report2,method="bing")
summary(bing_vector2)
afinn_vector2<-get_sentiment(audit_report2,method="afinn")
summary(afinn_vector2)
syuzhet_vector3<-get_sentiment(d3,method = "syuzhet")
summary(syuzhet_vector3)
bing_vector3<-get_sentiment(audit_report3,method="bing")
summary(bing_vector3)
afinn_vector3<-get_sentiment(audit_report3,method="afinn")
summary(afinn_vector3)
syuzhet_vector4<-get_sentiment(d4,method = "syuzhet")
summary(syuzhet_vector4)
bing_vector4<-get_sentiment(audit_report4,method="bing")
summary(bing_vector4)
afinn_vector4<-get_sentiment(audit_report4,method="afinn")
summary(afinn_vector4)
syuzhet_vector5<-get_sentiment(d5,method = "syuzhet")
summary(syuzhet_vector5)
bing_vector5<-get_sentiment(audit_report5,method="bing")
summary(bing_vector5)
afinn_vector5<-get_sentiment(audit_report5,method="afinn")
summary(afinn_vector5)
syuzhet_vector6<-get_sentiment(d6,method = "syuzhet")
summary(syuzhet_vector6)
bing_vector6<-get_sentiment(audit_report6,method="bing")
summary(bing_vector6)
afinn_vector6<-get_sentiment(audit_report6,method="afinn")
summary(afinn_vector6)
syuzhet_vector7<-get_sentiment(d7,method = "syuzhet")
summary(syuzhet_vector7)
bing_vector7<-get_sentiment(audit_report7,method="bing")
summary(bing_vector7)
afinn_vector7<-get_sentiment(audit_report7,method="afinn")
summary(afinn_vector7)
syuzhet_vector8<-get_sentiment(d8,method = "syuzhet")
summary(syuzhet_vector8)
bing_vector8<-get_sentiment(audit_report8,method="bing")
summary(bing_vector8)
afinn_vector8<-get_sentiment(audit_report8,method="afinn")
summary(afinn_vector8)
syuzhet_vector9<-get_sentiment(d9,method = "syuzhet")
summary(syuzhet_vector9)
bing_vector9<-get_sentiment(audit_report9,method="bing")
summary(bing_vector9)
afinn_vector9<-get_sentiment(audit_report9,method="afinn")
summary(afinn_vector9)
syuzhet_vector10<-get_sentiment(d10,method = "syuzhet")
summary(syuzhet_vector10)
bing_vector10<-get_sentiment(audit_report10,method="bing")
summary(bing_vector10)
afinn_vector10<-get_sentiment(audit_report10,method="afinn")
summary(afinn_vector10)
As we can see from the results, the Mean of each Audit Report is
Positive, meaning that the overall sentiment is Optimistic.
To asses if the words used in each report have a positive or negative
meaning we use ‘get_nrc_sentiment()’. After the creation of a database
with the function ‘rowSumus()’ is possible to address each word to a
given sentiment.
vector.nrc1<-get_nrc_sentiment(audit_report1)#analyze text based on nrc
df.nrc1<-data.frame(t(vector.nrc1))#transpose to create database. It shows how many words were seen in each line
#the function row Sums computes column sums across rows for each level of groups
td_new1<-data.frame(rowSums(df.nrc1[6:87]))#specify the range based on number of columns of data
vector.nrc2<-get_nrc_sentiment(audit_report2)
df.nrc2<-data.frame(t(vector.nrc2))
td_new2<-data.frame(rowSums(df.nrc2[1:20]))
vector.nrc3<-get_nrc_sentiment(audit_report3)
df.nrc3<-data.frame(t(vector.nrc3))
td_new3<-data.frame(rowSums(df.nrc3[1:69]))
vector.nrc4<-get_nrc_sentiment(audit_report4)
df.nrc4<-data.frame(t(vector.nrc4))
td_new4<-data.frame(rowSums(df.nrc4[1:44]))
vector.nrc5<-get_nrc_sentiment(audit_report5)
df.nrc5<-data.frame(t(vector.nrc5))
td_new5<-data.frame(rowSums(df.nrc5[1:18]))
vector.nrc6<-get_nrc_sentiment(audit_report6)
df.nrc6<-data.frame(t(vector.nrc6))
td_new6<-data.frame(rowSums(df.nrc6[1:136]))
vector.nrc7<-get_nrc_sentiment(audit_report7)
df.nrc7<-data.frame(t(vector.nrc7))
td_new7<-data.frame(rowSums(df.nrc7[1:51]))
vector.nrc8<-get_nrc_sentiment(audit_report8)
df.nrc8<-data.frame(t(vector.nrc8))
td_new8<-data.frame(rowSums(df.nrc8[1:11]))
vector.nrc9<-get_nrc_sentiment(audit_report9)
df.nrc9<-data.frame(t(vector.nrc9))
td_new9<-data.frame(rowSums(df.nrc9[1:51]))
vector.nrc10<-get_nrc_sentiment(audit_report10)
df.nrc10<-data.frame(t(vector.nrc10))
td_new10<-data.frame(rowSums(df.nrc10[1:33]))
To visualize the result of our Sensitivity Analysis we create a plot
for each Audit Report. In this way we observe the frequency of each
sentiment for every report.
names(td_new1)[1]<-"count"
td_new1<-cbind("sentiment"=rownames(td_new1),td_new1)
rownames(td_new1)<-NULL
td_1<-td_new1[1:10,]
#plot count of words associated with each sentiment
quickplot(sentiment,data = td_1,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new2)[1]<-"count"
td_new2<-cbind("sentiment"=rownames(td_new2),td_new2)
rownames(td_new2)<-NULL
td_2<-td_new2[1:10,]
quickplot(sentiment,data = td_2,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new3)[1]<-"count"
td_new3<-cbind("sentiment"=rownames(td_new3),td_new3)
rownames(td_new3)<-NULL
td_3<-td_new3[1:10,]
quickplot(sentiment,data = td_3,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new4)[1]<-"count"
td_new4<-cbind("sentiment"=rownames(td_new4),td_new4)
rownames(td_new4)<-NULL
td_4<-td_new4[1:10,]
quickplot(sentiment,data = td_4,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new5)[1]<-"count"
td_new5<-cbind("sentiment"=rownames(td_new5),td_new5)
rownames(td_new5)<-NULL
td_5<-td_new5[1:10,]
quickplot(sentiment,data = td_5,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new6)[1]<-"count"
td_new6<-cbind("sentiment"=rownames(td_new6),td_new6)
rownames(td_new6)<-NULL
td_6<-td_new6[1:10,]
quickplot(sentiment,data = td_6,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new7)[1]<-"count"
td_new7<-cbind("sentiment"=rownames(td_new7),td_new7)
rownames(td_new7)<-NULL
td_7<-td_new7[1:10,]
quickplot(sentiment,data = td_7,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new8)[1]<-"count"
td_new8<-cbind("sentiment"=rownames(td_new8),td_new8)
rownames(td_new8)<-NULL
td_8<-td_new8[1:10,]
quickplot(sentiment,data = td_8,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new9)[1]<-"count"
td_new9<-cbind("sentiment"=rownames(td_new9),td_new9)
rownames(td_new9)<-NULL
td_9<-td_new9[1:10,]
quickplot(sentiment,data = td_9,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
names(td_new10)[1]<-"count"
td_new10<-cbind("sentiment"=rownames(td_new10),td_new10)
rownames(td_new10)<-NULL
td_10<-td_new10[1:10,]
quickplot(sentiment,data = td_10,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
From the plots is observable that the two most frequent sentiments
are “Positive” and “Trust.” Therefore, we can assume that most of the
analyzed Audit Reports are optimistic. Considering the presence of other
sentiments, we conclude that the auditors were not overly optimistic.
For this reason, the results of each report can be considered objective
and realistic. However, in the second, eight and tenth report, the
frequency of “trust” is high (almost equal to “positive” frequency). An
auditor who gives too much “trust” to a company could lead to bias and
errors in his further analysis. Therefore, the auditor should always
respect the principles of integrity, objectiveness and independence in
order to work properly.
Inherent risk can be defined as the risk posed by omission or error
in a financial statement. This risk does not occur due to failure of
internal control. The most common inherent risk factors are the
susceptibility to obsolescence and fraudulent reporting, difficulty in
creating disclosure and the need of judgment. For this reason we can
assume that the second Audit Report is more susceptible to inherent
risk.
REFERENCES Budhiraja G., Gupta S., Sharma C., Rastogi Y., “Latent
Topic Analysis”. 2016, 20 March. Retrived from: http://rstudio-pubs-static.s3.amazonaws.com/163169_c79385802d2c4448aae913abb3e0dd9e.html
Data Centric Inc. (video), “How to create natural language processing
in r on Web page data”. 2021, 28 October. Retrived from Youtube: https://www.youtube.com/watch?v=onacC9OTYv8
Edureka (video), “Introduction to Sentiment Analysis in R”. 2021.
Retrived from Youtube: https://www.youtube.com/watch?v=tFPk5Nln3aA
finnstats, “Sentiment analysis in R”. 2021, 16 May. Retrieved from
R-bloggers: https://www.r-bloggers.com/2021/05/sentiment-analysis-in-r-3/
Feuerriegel S.& Proellochs N., “SentimentAnalysis Vignette”.
2021, 18 February. Retrived from Cran.R-Project: https://cran.r-project.org/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html
“Math of tm::findAssocs how does this function work?”. 2013. Retrived
from stackoverflow: https://stackoverflow.com/questions/14267199/math-of-tmfindassocs-how-does-this-function-work
Mhatre S., “Text Mining and Sentiment Analysis: Analysis with R”.
2020, 13 May. Retrived from RedGate: https://www.red-gate.com/simple-talk/databases/sql-server/bi-sql-server/text-mining-and-sentiment-analysis-with-r/
Schweinberger M., “Sentiment Analysis in R”.2022, 3 April. Retrieved
from LADL: https://slcladal.github.io/sentiment.html
Silge J. & Robinson D., “Text Mining with R: A Tidy Approach”,
2022, May 3. Retrieved from https://www.tidytextmining.com/sentiment.html
---
title: "Task 2"
Creator: Carla Parrucci (cp113330) & Marina Gabriela Mititelu (mm113327)
output: html_notebook
---
AUTHORS
- Carla Parrucci, cp113330 (Student Leader). Contribution to the creation of the Code and Report
- Marina Gabriela Mititelu, mm113327. Contribution to the creation of the Code

STATEMENT OF NON-INFRINGEMENT OF COPYRIGHT.
The Work shall not violate or infringe upon the rights of any third party, including, without limitation, any patent rights, copyright rights, trademark rights, trade secret rights, or other proprietary rights of any kind.

CC LICENSE STATEMENT FOR THE PROJECT
This work is licensed under a Creative Commons Attribution 4.0 International License. This license allows a work to be copied, modified, distributed, presented, and performed only if the author is credited.

CLASS
Auditing Course, year 2022, spring semester, Warsaw School of Economics (SGH), Warsaw (PL).

Load necessary packages
```{r}

require(rvest)#read data from webpage
require(dplyr)#library required for data frame manipulation
require(tm)#text mining
require(wordcloud)#word-cloud generator
require(RColorBrewer)#color palettes
require(syuzhet)#sentiment analysis
require(ggplot2)#plot graph

```
Load HTML pages to analyze 10 External Audit Reports.

After the analysis of the results, it is possible to assume whether the Audit reports are Optimistic or Pessimistic
```{r}
sample1<-read_html("https://annualreport.dsm.com/ar2020/other-information/independent-auditor-s-report.html")
audit_report1<-sample1 %>%  html_nodes("p") %>%  html_text()
# %>% is multiple times to "chain" functions together

sample2<-read_html("http://www.pta.wa.gov.au/Portals/0/annualreports/2011/pta2011v4/independent-audit-opinion.html")
audit_report2<-sample2 %>%  html_nodes("p") %>%  html_text()

sample3<-read_html("https://report.basf.com/2021/en/financial-statements/audit/auditors-report.html")
audit_report3<-sample3 %>%  html_nodes("p") %>%  html_text()

sample4<-read_html("https://coca-colafemsa.com/reportes/reporte-anual-2016/eng/independent-auditors.html")
audit_report4<-sample4 %>%  html_nodes("p") %>%  html_text()

sample5<-read_html("https://www.canada.ca/en/revenue-agency/corporate/about-canada-revenue-agency-cra/departmental-performance-reports/2018-19-departmental-results-report/iar-agency-activities.html")
audit_report5<-sample5 %>%  html_nodes("p") %>%  html_text()

sample6<-read_html("https://2020.annualreport.sbmoffshore.com/corporate-statements-2020/other-information/independent-auditor-s-report")
audit_report6<-sample6 %>%  html_nodes("p") %>%  html_text()

sample7<-read_html("https://annual-report.puma.com/2020/en/consolidated-financial-statement/independent-auditor-s-report.html")
audit_report7<-sample7 %>%  html_nodes("p") %>%  html_text()

sample8<-read_html("https://www.remgro.com/ar2021/financial-report/report-independent-auditor.php")
audit_report8<-sample8 %>%  html_nodes("p") %>%  html_text()

sample9<-read_html("https://www.vodacom-reports.co.za/integrated-reports/ir-2018/independent-auditors-report-on-the-consolidated-annual-financial-statements.php")
audit_report9<-sample9 %>%  html_nodes("p") %>%  html_text()

sample10<-read_html("https://www.massmart.co.za/iar2020/our-performance/independent-auditors-report/")
audit_report10<-sample10 %>%  html_nodes("p") %>%  html_text()
```

Convertion of Vectors to Corpus of each Audit Report
```{r}
audit_document1<- Corpus(VectorSource(audit_report1))#create corpus by converting vector;
audit_document2<- Corpus(VectorSource(audit_report2))
audit_document3<- Corpus(VectorSource(audit_report3))
audit_document4<- Corpus(VectorSource(audit_report4))
audit_document5<- Corpus(VectorSource(audit_report5))
audit_document6<- Corpus(VectorSource(audit_report6))
audit_document7<- Corpus(VectorSource(audit_report7))
audit_document8<- Corpus(VectorSource(audit_report8))
audit_document9<- Corpus(VectorSource(audit_report9))
audit_document10<- Corpus(VectorSource(audit_report10))
```

To perform the sentiment analysis, it is necessary to correct the text of each Audit Report by converting all sentences to lower case, removing numbers from the documents, stop words, punctuation and extra spaces. In this way R Studio is able to evaluate the meaning of each word 
```{r}
audit_document1<-tm_map(audit_document1, content_transformer(tolower))#convert all text to lower case
audit_document1<-tm_map(audit_document1, removeNumbers)#remove numbers from document
audit_document1<-tm_map(audit_document1,removeWords,stopwords(kind = "en"))#remove stop words in English
audit_document1<-tm_map(audit_document1,removePunctuation)#remove punctuation
audit_document1<- tm_map(audit_document1,stripWhitespace)#eliminate extra spaces

audit_document2<-tm_map(audit_document2, content_transformer(tolower))
audit_document2<-tm_map(audit_document2, removeNumbers)
audit_document2<-tm_map(audit_document2,removeWords,stopwords(kind = "en"))
audit_document2<-tm_map(audit_document2,removePunctuation)
audit_document2<- tm_map(audit_document2,stripWhitespace)

audit_document3<-tm_map(audit_document3, content_transformer(tolower))
audit_document3<-tm_map(audit_document3, removeNumbers)
audit_document3<-tm_map(audit_document3,removeWords,stopwords(kind = "en"))
audit_document3<-tm_map(audit_document3,removePunctuation)
audit_document3<- tm_map(audit_document3,stripWhitespace)

audit_document4<-tm_map(audit_document4, content_transformer(tolower))
audit_document4<-tm_map(audit_document4, removeNumbers)
audit_document4<-tm_map(audit_document4,removeWords,stopwords(kind = "en"))
audit_document4<-tm_map(audit_document4,removePunctuation)
audit_document4<- tm_map(audit_document4,stripWhitespace)

audit_document5<-tm_map(audit_document5, content_transformer(tolower))
audit_document5<-tm_map(audit_document5, removeNumbers)
audit_document5<-tm_map(audit_document5,removeWords,stopwords(kind = "en"))
audit_document5<-tm_map(audit_document5,removePunctuation)
audit_document5<- tm_map(audit_document5,stripWhitespace)

audit_document6<-tm_map(audit_document6, content_transformer(tolower))
audit_document6<-tm_map(audit_document6, removeNumbers)
audit_document6<-tm_map(audit_document6,removeWords,stopwords(kind = "en"))
audit_document6<-tm_map(audit_document6,removePunctuation)
audit_document6<- tm_map(audit_document6,stripWhitespace)

audit_document7<-tm_map(audit_document7, content_transformer(tolower))
audit_document7<-tm_map(audit_document7, removeNumbers)
audit_document7<-tm_map(audit_document7,removeWords,stopwords(kind = "en"))
audit_document7<-tm_map(audit_document7,removePunctuation)
audit_document7<- tm_map(audit_document7,stripWhitespace)

audit_document8<-tm_map(audit_document8, content_transformer(tolower))
audit_document8<-tm_map(audit_document8, removeNumbers)
audit_document8<-tm_map(audit_document8,removeWords,stopwords(kind = "en"))
audit_document8<-tm_map(audit_document8,removePunctuation)
audit_document8<- tm_map(audit_document8,stripWhitespace)

audit_document9<-tm_map(audit_document9, content_transformer(tolower))
audit_document9<-tm_map(audit_document9, removeNumbers)
audit_document9<-tm_map(audit_document9,removeWords,stopwords(kind = "en"))
audit_document9<-tm_map(audit_document9,removePunctuation)
audit_document9<- tm_map(audit_document9,stripWhitespace)

audit_document10<-tm_map(audit_document10, content_transformer(tolower))
audit_document10<-tm_map(audit_document10, removeNumbers)
audit_document10<-tm_map(audit_document10,removeWords,stopwords(kind = "en"))
audit_document10<-tm_map(audit_document10,removePunctuation)
audit_document10<- tm_map(audit_document10,stripWhitespace)
```

To perform a sensitivity analysis, it is necessary the computation of the frequency of each word.
Using this code, we built a table with the words and frequency of each Audit Report analyzed.
```{r}
dtm1<-TermDocumentMatrix(audit_document1)#term-matrix, it computes the words in the Report and their frequency
m1<-as.matrix(dtm1)#convert terms in the matrix 
v1<-sort(rowSums(m1),decreasing = TRUE)#convert matrix in vector
d1<-data.frame(frequency=v1)#convert vector in data frame
d1<-data.frame(word=names(v1),frequency=v1) #add the variable names to the data frame
head(d1,20) #visualize the 20 words most used in the first Audit Report

dtm2<-TermDocumentMatrix(audit_document2)
m2<-as.matrix(dtm2)
v2<-sort(rowSums(m2),decreasing = TRUE)
d2<-data.frame(word=names(v2),frequency=v2)
head(d2,20)

dtm3<-TermDocumentMatrix(audit_document3)
m3<-as.matrix(dtm3)
v3<-sort(rowSums(m3),decreasing = TRUE)
d3<-data.frame(word=names(v3),frequency=v3)
head(d3,20)

dtm4<-TermDocumentMatrix(audit_document4)
m4<-as.matrix(dtm4)
v4<-sort(rowSums(m4),decreasing = TRUE)
d4<-data.frame(word=names(v4),frequency=v4)
head(d4,20)

dtm5<-TermDocumentMatrix(audit_document5)
m5<-as.matrix(dtm5)
v5<-sort(rowSums(m5),decreasing = TRUE)
d5<-data.frame(word=names(v5),frequency=v5)
head(d5,20)

dtm6<-TermDocumentMatrix(audit_document6)
m6<-as.matrix(dtm6)
v6<-sort(rowSums(m6),decreasing = TRUE)
d6<-data.frame(word=names(v6),frequency=v6)
head(d6,20)

dtm7<-TermDocumentMatrix(audit_document7)
m7<-as.matrix(dtm7)
v7<-sort(rowSums(m7),decreasing = TRUE)
d7<-data.frame(word=names(v7),frequency=v7)
head(d7,20)

dtm8<-TermDocumentMatrix(audit_document8)
m8<-as.matrix(dtm8)
v8<-sort(rowSums(m8),decreasing = TRUE)
d8<-data.frame(word=names(v8),frequency=v8)
head(d8,20)

dtm9<-TermDocumentMatrix(audit_document9)
m9<-as.matrix(dtm9)
v9<-sort(rowSums(m9),decreasing = TRUE)
d9<-data.frame(word=names(v9),frequency=v9)
head(d9,20)

dtm10<-TermDocumentMatrix(audit_document10)
m10<-as.matrix(dtm10)
v10<-sort(rowSums(m10),decreasing = TRUE)
d10<-data.frame(word=names(v10),frequency=v10)
head(d10,20)
```
From the 10 different Data Frames is observable that the most used words are "Audit", "Financial", "Statements", "Report", "Consolidated". To perform the analysis we chose Audit Reports with different lengths, as observable from the frequency of the words. 

Audit Report 1: The most frequent words have a neutral meaning, except for the word "Fraud". However the word is not used with a negative connotation since the auditors completed an analysis specific to fraud risks for this Company. Therefore, the word "Fraud" is frequent. Based on the first 20 words, our opinion about Report 1 is Neutral.
To increase the number of words displayed, we can use the code: head(d1,40)

Audit Report 2: Differently from Report 1, the words have lower frequency and within the first 20 most frequent words appear the word "Fair". Opinion: Neutral/Optimistic

Audit Report 3: The report lead to very optimistic outcome since three of the 20 most frequent words are: "appropriate", "material", "accordance", "assurance". Opinion: Optimistic

Audit Report 4: As well as Report 1, the listed words have neutral meaning. For this reason is suggested to implement the number of words visualized. Opinion: Neutral

Audit Report 5: As well as Report 3, the report used several positive words such as: "accordance", "material", "assurance". Opinion: Optimistic

Audit Report 6: Visualization of more words suggested. Opinion:Neutral.

Audit Report 7: Visualization of more words suggested. Opinion:Neutral.

Audit Report 8: Opinion: Neutral/Optimistic

Audit Report 9: Opinion: Neutral/Optimistic

Audit Report 10: Opinion: Neutral/Optimistic

To visualize words with different frequency, we used the 'subset()' function. We used d1 data frame as Corpus and we considered words frequencies to create the subset.
```{r}
subset(d1, with(d1, frequency %in% names(which(table(frequency)<=15))))
subset(d2, with(d2, frequency %in% names(which(table(frequency)<=15))))
subset(d3, with(d3, frequency %in% names(which(table(frequency)<=15))))
subset(d4, with(d4, frequency %in% names(which(table(frequency)<=15))))
subset(d5, with(d5, frequency %in% names(which(table(frequency)<=15))))
subset(d6, with(d6, frequency %in% names(which(table(frequency)<=15))))
subset(d7, with(d7, frequency %in% names(which(table(frequency)<=15))))
subset(d8, with(d8, frequency %in% names(which(table(frequency)<=15))))
subset(d9, with(d9, frequency %in% names(which(table(frequency)<=15))))
subset(d10, with(d10, frequency %in% names(which(table(frequency)<=15))))
```
To visualize every words of Audit Reports we used the 'names()' function 
```{r}
names(v1)
names(v2)
names(v3)
names(v4)
names(v5)
names(v6)
names(v7)
names(v8)
names(v9)
names(v10)
```
In order to find frequent terms, we used the findFreqTerms function, imposing the number 10 as threshold. In this way, words repeated more than 10 times are classified as frequent.
```{r}
findFreqTerms(dtm1,lowfreq = 10)
findFreqTerms(dtm2,lowfreq = 10)
findFreqTerms(dtm3,lowfreq = 10)
findFreqTerms(dtm4,lowfreq = 10)
findFreqTerms(dtm5,lowfreq = 10)
findFreqTerms(dtm6,lowfreq = 10)
findFreqTerms(dtm7,lowfreq = 10)
findFreqTerms(dtm8,lowfreq = 10)
findFreqTerms(dtm9,lowfreq = 10)
findFreqTerms(dtm10,lowfreq = 10)
```
For the Analysis, it is also interesting to examine the association between frequent terms.
Since the most important part of the Audit Report is the Opinion and the documents fairness, we will visualize which words are usually associated with the words "Opinion" and "Fair".
```{r}
findAssocs(dtm1,terms="opinion", corlimit = 0.4)
findAssocs(dtm1,terms="fair", corlimit = 0.4)

findAssocs(dtm2,terms="opinion", corlimit = 0.4)
findAssocs(dtm2,terms="fair", corlimit = 0.4)

findAssocs(dtm3,terms="opinion", corlimit = 0.4)
findAssocs(dtm3,terms="fair", corlimit = 0.4)

findAssocs(dtm4,terms="opinion", corlimit = 0.4)
findAssocs(dtm4,terms="fair", corlimit = 0.4)

findAssocs(dtm5,terms="opinion", corlimit = 0.4)
findAssocs(dtm5,terms="fair", corlimit = 0.4)

findAssocs(dtm6,terms="opinion", corlimit = 0.4)
findAssocs(dtm6,terms="fair", corlimit = 0.4)

findAssocs(dtm7,terms="opinion", corlimit = 0.4)
findAssocs(dtm7,terms="fair", corlimit = 0.4)

findAssocs(dtm8,terms="opinion", corlimit = 0.4)
findAssocs(dtm8,terms="fair", corlimit = 0.4)

findAssocs(dtm9,terms="opinion", corlimit = 0.4)
findAssocs(dtm9,terms="fair", corlimit = 0.4)

findAssocs(dtm10,terms="opinion", corlimit = 0.4)
findAssocs(dtm10,terms="fair", corlimit = 0.4)
```
From the association is observable that usually  the word "opinion" is followed by Positive words such as: "sufficient", "appropriate", "assurance", "reasonable", "fairly", "significance", "fulfilled." Associated with the word "Opinion", some of the basic features of the Audit Process are outlined. In fact, "opinion" is also associated with: "independent", "material", "standard", "statement", "responsibility", "evidence".
The words associated with "Fair" are: "presentation", "preparation", "reporting", "control", "considerations", "assumptions", "value".
In Audit Report number 6, the word "Fair" is not used, for this reason the function gives the output: 'numeric(0)'.

To visualize the most frequent words of each report we used a Bar Plot
```{r}
barplot(d1[1:10,]$frequency, las=2,names.arg=d1[1:10,]$word,
        col = "lightblue",main = "Most frequent words in auditing report 1",
        ylab = "word frequencies")

barplot(d2[1:10,]$frequency, las=2,names.arg=d2[1:10,]$word,
        col = "lightyellow",main = "Most frequent words in auditing report 2",
        ylab = "word frequencies")

barplot(d3[1:10,]$frequency, las=2,names.arg=d3[1:10,]$word,
        col = "red",main = "Most frequent words in auditing report 3",
        ylab = "word frequencies")

barplot(d4[1:10,]$frequency, las=2,names.arg=d4[1:10,]$word,
        col = "lightgreen",main = "Most frequent words in auditing report 4",
        ylab = "word frequencies")

barplot(d5[1:10,]$frequency, las=2,names.arg=d5[1:10,]$word,
        col = "lightgrey",main = "Most frequent words in auditing report 5",
        ylab = "word frequencies")

barplot(d6[1:10,]$frequency, las=2,names.arg=d6[1:10,]$word,
        col = "orange",main = "Most frequent words in auditing report 6",
        ylab = "word frequencies")

barplot(d7[1:10,]$frequency, las=2,names.arg=d7[1:10,]$word,
        col = "Green",main = "Most frequent words in auditing report 7",
        ylab = "word frequencies")

barplot(d8[1:10,]$frequency, las=2,names.arg=d8[1:10,]$word,
        col = "blue",main = "Most frequent words in auditing report 8",
        ylab = "word frequencies")

barplot(d9[1:10,]$frequency, las=2,names.arg=d9[1:10,]$word,
        col = "purple",main = "Most frequent words in auditing report 9",
        ylab = "word frequencies")

barplot(d10[1:10,]$frequency, las=2,names.arg=d10[1:10,]$word,
        col = "pink",main = "Most frequent words in auditing report 10",
        ylab = "word frequencies")

```
Thanks to the word frequencies, it is possible to create Word Clouds
```{r}
set.seed(1234)
wordcloud(words = d1$word,freq = d1$freq,min.freq = 5,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Spectral"))

wordcloud(words = d2$word,freq = d2$freq,min.freq = 5,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Blues"))

wordcloud(words = d3$word,freq = d3$freq,min.freq = 11,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Accent"))

wordcloud(words = d4$word,freq = d4$freq,min.freq = 5,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Set3"))

wordcloud(words = d5$word,freq = d5$freq,min.freq = 3,
          max.words = 100,random.order = FALSE,rot.per = 0.4,
          colors = brewer.pal(4,"Set2"))

wordcloud(words = d6$word,freq = d6$freq,min.freq = 5,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Oranges"))

wordcloud(words = d7$word,freq = d7$freq,min.freq = 8,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Pastel2"))

wordcloud(words = d8$word,freq = d8$freq,min.freq = 3,
          max.words = 100,random.order = FALSE,rot.per = 0.30,
          colors = brewer.pal(4,"Pastel1"))

wordcloud(words = d9$word,freq = d9$freq,min.freq = 5,
          max.words = 100,random.order = FALSE,rot.per = 0.40,
          colors = brewer.pal(4,"Dark2"))

wordcloud(words = d10$word,freq = d10$freq,min.freq = 5,
          max.words = 100,random.order = FALSE,rot.per = 0.5,
          colors = brewer.pal(4,"BrBG"))
```
Several methods can be used, with different scales, to asses whether the intrinsic sentiment of each Audit Report is positive or negative. If the computed mean is POSITIVE, the overall sentiment expressed in the audit report is OPTIMISTIC.
The three methods used are Syuzhet, Bing and Afinn.
```{r}
syuzhet_vector1<-get_sentiment(d1,method = "syuzhet") #analyze text based on syuzhet
summary(syuzhet_vector1)#see summary statistics of the vector

bing_vector1<-get_sentiment(audit_report1,method="bing")#analyze text based on bing
summary(bing_vector1)#see summary statistics of the vector

afinn_vector1<-get_sentiment(audit_report1,method="afinn")#analyze text based on afinn, min of -5 (most neg) to 5 (most pos)
summary(afinn_vector1)#see summary statistics of the vector

syuzhet_vector2<-get_sentiment(d2,method = "syuzhet") 
summary(syuzhet_vector2)

bing_vector2<-get_sentiment(audit_report2,method="bing")
summary(bing_vector2)

afinn_vector2<-get_sentiment(audit_report2,method="afinn")
summary(afinn_vector2)

syuzhet_vector3<-get_sentiment(d3,method = "syuzhet") 
summary(syuzhet_vector3)

bing_vector3<-get_sentiment(audit_report3,method="bing")
summary(bing_vector3)

afinn_vector3<-get_sentiment(audit_report3,method="afinn")
summary(afinn_vector3)

syuzhet_vector4<-get_sentiment(d4,method = "syuzhet") 
summary(syuzhet_vector4)

bing_vector4<-get_sentiment(audit_report4,method="bing")
summary(bing_vector4)

afinn_vector4<-get_sentiment(audit_report4,method="afinn")
summary(afinn_vector4)

syuzhet_vector5<-get_sentiment(d5,method = "syuzhet") 
summary(syuzhet_vector5)

bing_vector5<-get_sentiment(audit_report5,method="bing")
summary(bing_vector5)

afinn_vector5<-get_sentiment(audit_report5,method="afinn")
summary(afinn_vector5)

syuzhet_vector6<-get_sentiment(d6,method = "syuzhet") 
summary(syuzhet_vector6)

bing_vector6<-get_sentiment(audit_report6,method="bing")
summary(bing_vector6)

afinn_vector6<-get_sentiment(audit_report6,method="afinn")
summary(afinn_vector6)

syuzhet_vector7<-get_sentiment(d7,method = "syuzhet") 
summary(syuzhet_vector7)

bing_vector7<-get_sentiment(audit_report7,method="bing")
summary(bing_vector7)

afinn_vector7<-get_sentiment(audit_report7,method="afinn")
summary(afinn_vector7)

syuzhet_vector8<-get_sentiment(d8,method = "syuzhet") 
summary(syuzhet_vector8)

bing_vector8<-get_sentiment(audit_report8,method="bing")
summary(bing_vector8)

afinn_vector8<-get_sentiment(audit_report8,method="afinn")
summary(afinn_vector8)

syuzhet_vector9<-get_sentiment(d9,method = "syuzhet") 
summary(syuzhet_vector9)

bing_vector9<-get_sentiment(audit_report9,method="bing")
summary(bing_vector9)

afinn_vector9<-get_sentiment(audit_report9,method="afinn")
summary(afinn_vector9)

syuzhet_vector10<-get_sentiment(d10,method = "syuzhet") 
summary(syuzhet_vector10)

bing_vector10<-get_sentiment(audit_report10,method="bing")
summary(bing_vector10)

afinn_vector10<-get_sentiment(audit_report10,method="afinn")
summary(afinn_vector10)
```
As we can see from the results, the Mean of each Audit Report is Positive, meaning that the overall sentiment is Optimistic. 

To asses if the words used in each report have a positive or negative meaning we use 'get_nrc_sentiment()'. After the creation of a database with the function 'rowSumus()' is possible to address each word to a given sentiment.
```{r}
vector.nrc1<-get_nrc_sentiment(audit_report1)#analyze text based on nrc
df.nrc1<-data.frame(t(vector.nrc1))#transpose to create database. It shows how many words were seen in each line
#the function row Sums computes column sums across rows for each level of groups
td_new1<-data.frame(rowSums(df.nrc1[6:87]))#specify the range based on number of columns of data

vector.nrc2<-get_nrc_sentiment(audit_report2)
df.nrc2<-data.frame(t(vector.nrc2))
td_new2<-data.frame(rowSums(df.nrc2[1:20]))

vector.nrc3<-get_nrc_sentiment(audit_report3)
df.nrc3<-data.frame(t(vector.nrc3))
td_new3<-data.frame(rowSums(df.nrc3[1:69]))

vector.nrc4<-get_nrc_sentiment(audit_report4)
df.nrc4<-data.frame(t(vector.nrc4))
td_new4<-data.frame(rowSums(df.nrc4[1:44]))

vector.nrc5<-get_nrc_sentiment(audit_report5)
df.nrc5<-data.frame(t(vector.nrc5))
td_new5<-data.frame(rowSums(df.nrc5[1:18]))

vector.nrc6<-get_nrc_sentiment(audit_report6)
df.nrc6<-data.frame(t(vector.nrc6))
td_new6<-data.frame(rowSums(df.nrc6[1:136]))

vector.nrc7<-get_nrc_sentiment(audit_report7)
df.nrc7<-data.frame(t(vector.nrc7))
td_new7<-data.frame(rowSums(df.nrc7[1:51]))

vector.nrc8<-get_nrc_sentiment(audit_report8)
df.nrc8<-data.frame(t(vector.nrc8))
td_new8<-data.frame(rowSums(df.nrc8[1:11]))

vector.nrc9<-get_nrc_sentiment(audit_report9)
df.nrc9<-data.frame(t(vector.nrc9))
td_new9<-data.frame(rowSums(df.nrc9[1:51]))

vector.nrc10<-get_nrc_sentiment(audit_report10)
df.nrc10<-data.frame(t(vector.nrc10))
td_new10<-data.frame(rowSums(df.nrc10[1:33]))
```
To visualize the result of our Sensitivity Analysis we create a plot for each Audit Report. In this way we observe the frequency of each sentiment for every report.
```{r}
names(td_new1)[1]<-"count"
td_new1<-cbind("sentiment"=rownames(td_new1),td_new1)
rownames(td_new1)<-NULL
td_1<-td_new1[1:10,]
#plot count of words associated with each sentiment
quickplot(sentiment,data = td_1,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new2)[1]<-"count"
td_new2<-cbind("sentiment"=rownames(td_new2),td_new2)
rownames(td_new2)<-NULL
td_2<-td_new2[1:10,]
quickplot(sentiment,data = td_2,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new3)[1]<-"count"
td_new3<-cbind("sentiment"=rownames(td_new3),td_new3)
rownames(td_new3)<-NULL
td_3<-td_new3[1:10,]
quickplot(sentiment,data = td_3,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new4)[1]<-"count"
td_new4<-cbind("sentiment"=rownames(td_new4),td_new4)
rownames(td_new4)<-NULL
td_4<-td_new4[1:10,]
quickplot(sentiment,data = td_4,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new5)[1]<-"count"
td_new5<-cbind("sentiment"=rownames(td_new5),td_new5)
rownames(td_new5)<-NULL
td_5<-td_new5[1:10,]
quickplot(sentiment,data = td_5,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new6)[1]<-"count"
td_new6<-cbind("sentiment"=rownames(td_new6),td_new6)
rownames(td_new6)<-NULL
td_6<-td_new6[1:10,]
quickplot(sentiment,data = td_6,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new7)[1]<-"count"
td_new7<-cbind("sentiment"=rownames(td_new7),td_new7)
rownames(td_new7)<-NULL
td_7<-td_new7[1:10,]
quickplot(sentiment,data = td_7,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new8)[1]<-"count"
td_new8<-cbind("sentiment"=rownames(td_new8),td_new8)
rownames(td_new8)<-NULL
td_8<-td_new8[1:10,]
quickplot(sentiment,data = td_8,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new9)[1]<-"count"
td_new9<-cbind("sentiment"=rownames(td_new9),td_new9)
rownames(td_new9)<-NULL
td_9<-td_new9[1:10,]
quickplot(sentiment,data = td_9,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")

names(td_new10)[1]<-"count"
td_new10<-cbind("sentiment"=rownames(td_new10),td_new10)
rownames(td_new10)<-NULL
td_10<-td_new10[1:10,]
quickplot(sentiment,data = td_10,weight=count, geom = "bar", fill=sentiment, ylab = "count")+ggtitle("Audit Report sentiments")
```
From the plots is observable that the two most frequent sentiments are "Positive" and "Trust." Therefore, we can assume that most of the analyzed Audit Reports are optimistic. Considering the presence of other sentiments, we conclude that the auditors were not overly optimistic. For this reason, the results of each report can be considered objective and realistic. However, in the second, eight and tenth report, the frequency of "trust" is high (almost equal to "positive" frequency). An auditor who gives too much "trust" to a company could lead to bias and errors in his further analysis. Therefore, the auditor should always respect the principles of integrity, objectiveness and independence in order to work properly.

Inherent risk can be defined as the risk posed by omission or error in a financial statement. This risk does not occur due to failure of internal control. The most common inherent risk factors are the susceptibility to obsolescence and fraudulent reporting, difficulty in creating disclosure and the need of judgment. For this reason we can assume that the second Audit Report is more susceptible to inherent risk.



REFERENCES
Budhiraja G., Gupta S., Sharma C., Rastogi Y., "Latent Topic Analysis". 2016, 20 March. Retrived from: http://rstudio-pubs-static.s3.amazonaws.com/163169_c79385802d2c4448aae913abb3e0dd9e.html

Data Centric Inc. (video), "How to create natural language processing in r on Web page data". 2021, 28 October. Retrived from Youtube: https://www.youtube.com/watch?v=onacC9OTYv8

Edureka (video), "Introduction to Sentiment Analysis in R". 2021. Retrived from Youtube: https://www.youtube.com/watch?v=tFPk5Nln3aA

finnstats, "Sentiment analysis in R". 2021, 16 May. Retrieved from R-bloggers: https://www.r-bloggers.com/2021/05/sentiment-analysis-in-r-3/

Feuerriegel S.& Proellochs N., "SentimentAnalysis Vignette". 2021, 18 February. Retrived from Cran.R-Project: https://cran.r-project.org/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html

"Math of tm::findAssocs how does this function work?". 2013. Retrived from stackoverflow: https://stackoverflow.com/questions/14267199/math-of-tmfindassocs-how-does-this-function-work

Mhatre S., "Text Mining and Sentiment Analysis: Analysis with R". 2020, 13 May. Retrived from RedGate: https://www.red-gate.com/simple-talk/databases/sql-server/bi-sql-server/text-mining-and-sentiment-analysis-with-r/

Schweinberger M., "Sentiment Analysis in R".2022, 3 April. Retrieved from LADL: https://slcladal.github.io/sentiment.html

Silge J. & Robinson D., "Text Mining with R: A Tidy Approach", 2022, May 3. Retrieved from https://www.tidytextmining.com/sentiment.html
