In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
After researching around there are many ways I can approach this. I first tried using the gutenbergr library to get a different corpus and there were many novels and text available. However I wanted to try some thing different. I wanted to try to do sentimental analysis either some Financial data, Tweets or emails. The easeist to get access to was Financial Data. I tried accessing finance data via gutenbergr library but there were only novals and text available on those subjects so instead I decided to download Amazon 10K and upload it to Github to be used as my corpus.
library(tidytext)
library(textdata)
library(janeaustenr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(tidyr)
library(wordcloud)
## Loading required package: RColorBrewer
Here we are reattempting to recreate the example from Chapter 2 of Text Mining with R.
Robinson, Julia Silge and David. “1 The Tidy Text Format: Text Mining with R.” 1 The Tidy Text Format | Text Mining with R, https://www.tidytextmining.com/tidytext.html.
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,865 more rows
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ... with 291 more rows
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ... with 122,194 more rows
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
## [1] "by jane austen"
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
I took a different approach regarding the New Corpus. I couldnt find a Financial Statement on the Gutenbergr library so Instead I downladed Amazon 10K which was released on 29OCT21. Amazon had performed below Analysts expectation and I wanted apply loughran lexicon to Amazon 10K.
https://ir.aboutamazon.com/sec-filings/sec-filings-details/default.aspx?FilingId=15311356
library(gutenbergr)
Amazon10K= "https://github.com/mianshariq/SPS/raw/10668f542cef868339300d9278f69bf6cd12dcf2/Data%20607/Assignments/Amazon10k.txt"
Amazon10K=readLines(Amazon10K)
Amazon10K <- tibble(text = Amazon10K)
Amazon10K
## # A tibble: 6,193 x 1
## text
## <chr>
## 1 ""
## 2 "Table of Contents"
## 3 ""
## 4 ""
## 5 ""
## 6 ""
## 7 "UNITED STATES"
## 8 "SECURITIES AND EXCHANGE COMMISSION"
## 9 "Washington, D.C. 20549"
## 10 " ____________________________________"
## # ... with 6,183 more rows
Count_Amazon10K <- Amazon10K[c(1:nrow(Amazon10K)),]
Amazon10K_Chapters <- Count_Amazon10K %>%
filter(text != "") %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("CHAPTER [\\dIVXLC]", ignore_case = TRUE))))
Amazon10K_Chapters
## # A tibble: 2,686 x 3
## text linenumber chapter
## <chr> <int> <int>
## 1 Table of Contents 1 0
## 2 UNITED STATES 2 0
## 3 SECURITIES AND EXCHANGE COMMISSION 3 0
## 4 Washington, D.C. 20549 4 0
## 5 ____________________________________ 5 0
## 6 FORM 10-Q 6 0
## 7 ____________________________________ 7 0
## 8 (Mark One) 8 0
## 9 ? 9 0
## 10 QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE S~ 10 0
## # ... with 2,676 more rows
According to https://sraf.nd.edu/textual-analysis/resources/ loughran lexicon is used for Accounting and Fiannce. Intrestingally it stated that “A growing literature finds significant relations between financial phenomena (e.g., stock returns, commodity prices, bankruptcies, governance) and the sentiment of financial disclosures as measured by word classifications such as those provided below.” The sintements used to describe the sentiments are: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”
get_sentiments("loughran")
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # ... with 4,140 more rows
Amazon10K_tidy <- Amazon10K_Chapters %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>%
anti_join(stop_words)
names(Amazon10K_tidy)<-c("word", "sentiment", "Freq")
Amazon10K_tidy
## # A tibble: 55 x 3
## word sentiment Freq
## <fct> <chr> <int>
## 1 obligations constraining 44
## 2 risks uncertainty 36
## 3 losses negative 31
## 4 loss negative 30
## 5 jurisdictions litigious 28
## 6 laws litigious 28
## 7 regulations litigious 28
## 8 risk uncertainty 23
## 9 commitments constraining 21
## 10 requirements constraining 20
## # ... with 45 more rows
ggplot(data = Amazon10K_tidy, aes(x = word, y = Freq, fill = sentiment)) +
geom_bar(stat = "identity") + coord_flip() + facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",x = NULL)
Amazon10K_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): required could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): improvements could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): efficiently could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): prevent could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): unable could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): approximately could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): assumptions could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): expose could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): regulation could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): controversies could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): contracts could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): regulations could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): adverse could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): difficult could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): opportunities could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): jurisdictions could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): comply could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): adequately could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): investigations could not be fit
## on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): claims could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): requires could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): require could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): favorable could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): contractual could not be fit on
## page. It will not be plotted.
Amazon10K_tidy %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
I want to compare the sentimnet to a different Quarter and see whether there is a difference there as the most recent quarter results were lover than expected. You can see a small difference in the sentiments and that is predictable as Amazon is a big company and they dont want to spook their investors becasue ther didnt meet expectations in their last Quarter.
Amazon10KQ2= "https://github.com/mianshariq/SPS/raw/3790a3bf2750dd6cb34548d50cc6b3507eb0e904/Data%20607/Assignments/Amazon10KQ3.txt"
Amazon10KQ2 =readLines(Amazon10KQ2)
Amazon10KQ2 <- tibble(text = Amazon10KQ2)
Amazon10KQ2
## # A tibble: 8,861 x 1
## text
## <chr>
## 1 ""
## 2 "Table of Contents"
## 3 " "
## 4 ""
## 5 ""
## 6 ""
## 7 ""
## 8 "UNITED STATES"
## 9 "SECURITIES AND EXCHANGE COMMISSION"
## 10 "Washington, D.C. 20549"
## # ... with 8,851 more rows
Count_Amazon10KQ2 <- Amazon10KQ2[c(1:nrow(Amazon10KQ2)),]
Amazon10KQ2_Chapters <- Count_Amazon10KQ2 %>%
filter(text != "") %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("CHAPTER [\\dIVXLC]", ignore_case = TRUE))))
Amazon10KQ2_tidy <- Amazon10KQ2_Chapters %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>%
anti_join(stop_words)
names(Amazon10KQ2_tidy)<-c("word", "sentiment", "Freq")
Amazon10K_tidy
## # A tibble: 55 x 3
## word sentiment Freq
## <fct> <chr> <int>
## 1 obligations constraining 44
## 2 risks uncertainty 36
## 3 losses negative 31
## 4 loss negative 30
## 5 jurisdictions litigious 28
## 6 laws litigious 28
## 7 regulations litigious 28
## 8 risk uncertainty 23
## 9 commitments constraining 21
## 10 requirements constraining 20
## # ... with 45 more rows
Amazon10KQ2_tidy
## # A tibble: 47 x 3
## word sentiment Freq
## <fct> <chr> <int>
## 1 obligations constraining 51
## 2 losses negative 46
## 3 loss negative 45
## 4 risks uncertainty 38
## 5 jurisdictions litigious 35
## 6 laws litigious 32
## 7 required constraining 30
## 8 regulations litigious 28
## 9 restricted constraining 28
## 10 risk uncertainty 28
## # ... with 37 more rows
ggplot(data = Amazon10KQ2_tidy, aes(x = word, y = Freq, fill = sentiment)) +
geom_bar(stat = "identity") + coord_flip() + facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",x = NULL)
Amazon10KQ2_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): impairment could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): adversely could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): assumptions could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): complaint could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): improvements could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): successfully could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): requirements could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): claims could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): damages could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): effective could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): required could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): alliances could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): enable could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): prevent could not be fit on
## page. It will not be plotted.
Amazon10K_tidy %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
I want to apply the bing sentiment to pur corpus. This will classify between Positive and Negative. It shopud be interesting to see how some of the words are classified since its a more technical document.
Amazon10K_tidy <- Amazon10K_Chapters %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>% ungroup() %>% mutate(word = reorder(word, n)) %>%
anti_join(stop_words)
names(Amazon10K_tidy)<-c("word", "sentiment", "Freq")
Its interesting to see fulfillment as a postive here. However we now that fulfillment is the centers Amazon use as their warehouse.
ggplot(data = Amazon10K_tidy, aes(x = word, y = Freq, fill = sentiment)) +
geom_bar(stat = "identity") + coord_flip() + facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",x = NULL)
Amazon10K_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Warning in wordcloud(word, n, max.words = 100): effective could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): gross could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): losses could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): adverse could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): fulfillment could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): outstanding could not be fit on
## page. It will not be plotted.
Amazon10K_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## outstanding could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## protection could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## significant could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## sufficient could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## liability could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## restricted could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words = 100):
## unable could not be fit on page. It will not be plotted.
Sentiment analysis provides a way to make it easier to show hoe opinions are expressed in texts or whether a text is classified to a certain attitude. In our Amazon 10K example we can use sentiment analysis to understand how a 10k can be used to correlate stock prices and business effectiveness and restructuring . In this assignment, we added a new corpus from Amazon 10K and applied sentiment analysis. Then we used laughran lexicon and applied it to the Amazon 10K corpus. We found out that words such as obligations had many frequency under the constraining sentiment and losses and loss had high frequency in the negative sentiment. Based on comparing the different 10K for Amazon using the laughran lexicon, One thing I do found interesting was that in Q2, since Amazon had higher gains than Compare to Q3. They did mention gains as the most positive sentiment compare to being second in Q3 sentiments and the frequencies were 20:12 between Q2 and Q3. For the financial documents laughran lexicon seems useful because it gave an in depth sentiment of the 10K compare to the bing lexicon where the most positive word was fulfillment which in this case is int a positive work but it reference fulfillment centers amazon have.