Lets first load the required libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(syuzhet)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(tidytext)
library(forcats)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ✔ readr 2.1.5 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(textdata)
## Warning: package 'textdata' was built under R version 4.4.2
Introduction
In today’s day and age, there are billions of volumes of textual content being generated everywhere. In-apps messages like WhatsApp, Telegram, social media sites like Facebook, Instagram, news publishing sites, google searches and many other sources. All these sources are constantly generating huge volumes of text data every second. And because of these huge volumes of text data NLP becomes a vital resource in understanding the textual content. In this paper, the main focus is on the popular NLP task of Sentiment analysis. Sentiment analysis is contextual mining of text which identifies and extracts subjective information in textual data. Sentiment analysis proves to be an incredible asset for users to extract essential information.
Purpose
The aim of sentiment analysis is the computational study of people’s opinions, sentiments, emotions, and attitudes towards entities such as products, services, issues, events, topics, and their attributes (Liu 2015). As such, sentiment analysis can allow tracking the mood of the public about a particular entity to create actionable knowledge. Also, this type of knowledge can be used to understand, explain, and predict social phenomena (Pozzi et al. 2017). For the business domain, sentiment analysis plays a vital role in enabling businesses to improve strategy and gain insight into customers’ feedback about their products. In today’s customer-oriented business culture, understanding the customer is increasingly important
Assumptions
To ensure that the task is meaningful in practice, existing research makes the following implicit assumptions.
Sentiment analysis assumes that the opinion document d (e.g., a product review) expresses opinions on a single entity e and contains opinions from a single opinion holder h. In practice, if an opinion document evaluates more than one entity, then the sentiments on the entities can be different. For example, the opinion holder may be positive about some entities and negative about others. Thus, it does not make practical sense to assign one sentiment orientation to the entire document in this case.
It also does not make much sense if multiple opinion holders express opinions in a single document because their opinions can be different too. This assumption holds for reviews of products and services because each review usually focuses on evaluating a single product or service and is written by a single reviewer. However, the assumption may not hold for a forum and blog post because in such a post the author may express opinions on multiple entities and compare them using comparative sentences.
Description of Data Set
lets read our csv file with the help of read.csv function. Additionally lets look at our data by using the glimpse function.
data <- read.csv("news.csv")
glimpse(data)
## Rows: 26,706
## Columns: 4
## $ Story.Heading <chr> "Federation of Pakistan v Gen Pervez Musharraf: Treason …
## $ Story.Excerpt <chr> "After the special court's verdict in the high treason c…
## $ Timestamp <chr> "01 Jan, 2020 09:55pm", "01 Jan, 2020 08:35pm", "01 Jan,…
## $ Section <chr> "", "Pakistan", "World", "Pakistan", "Sport", "Pakistan"…
From the above output we can see that we have 4 columns and 26, 706 observations. Columns are self explanatory. We don’t require all columns for our analysis so its better to either remove them or select only the desired column which is in our case is Story.Excerpt. It describes content of news in more detailed manner.
Data Preparation, Cleaning and transformations
Prior to any analysis first step is to clean and transform data according to model requirements. Usually in text data we have some special characters, white spaces or stop words. Its recommended to get rid of these things as they don’t add any value to analysis.
# using unnest_tokens()
data_wr <- data %>%
unnest_tokens(word, Story.Excerpt) # What to create (word) from where (Story.Excerpt)
head(data_wr$word)
## [1] "after" "the" "special" "court's" "verdict" "in"
unnest_tokens() has done some cleaning removed punctuation and white space, transformed to lowercase etc. We can see in the above output that each row contains only one word. Means one word per row. Its a really large data set we can have a look at the total amount of rows by using the dim function.
dim(data_wr)
## [1] 262497 4
In total we have 262497 rows. Lets count the words and arrange them in descending order to see which words occur more frequently.
# counting words
data_wr%>%
count(word) %>%
arrange(desc(n)) %>%
head()
## word n
## 1 to 8783
## 2 in 7723
## 3 of 5565
## 4 the 4596
## 5 for 4091
## 6 on 2610
We still see some common words such as the, in, to, of etc occurred more frequently.To analyze someone’s distinctive word use, you want to remove these words. That can be done with an anti_join to tidy text’s list of stop_words.
# using unnest_tokens() with stopwords
data_st<- data %>%
unnest_tokens(word,Story.Excerpt) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
lets count the words again to see if we manage to resolve the issue successfully.
# counting words again
data_st %>%
count(word) %>%
arrange(desc(n)) %>%
head()
## word n
## 1 covid 1421
## 2 pakistan 1417
## 3 govt 1223
## 4 19 1172
## 5 coronavirus 857
## 6 virus 756
We can see from the above output that now the most frequent words are covid, pakistan, govt etc which actually reflect the actual and meaningful content.
Visualization
Instead of looking at the data frame its more appealing, attractive and understandable to visualize our cleaned data. Lets count the words and arrange them in descending order.
word_counts <- data_st %>%
count(word) %>%
filter(n>300) %>%
arrange(desc(n))
# using coord_flip()
# when data are hard to read
# on the x axis
ggplot(word_counts, aes(x=word, y=n)) +
geom_col() +
coord_flip() +
ggtitle("Review Word Counts")
We can see in the above output each word against its count. Lets reorder them and then visualize them from largest to smallest.
# reorder what (word) by what (n)
word_counts <- data_st %>%
count(word) %>%
filter(n>300) %>%
mutate(word2 = fct_reorder(word, n))
word_counts
## word n word2
## 1 19 1172 19
## 2 chief 320 chief
## 3 china 434 china
## 4 coronavirus 857 coronavirus
## 5 court 406 court
## 6 covid 1421 covid
## 7 day 324 day
## 8 govt 1223 govt
## 9 imran 397 imran
## 10 india 617 india
## 11 indian 352 indian
## 12 karachi 530 karachi
## 13 kp 361 kp
## 14 lahore 301 lahore
## 15 lockdown 411 lockdown
## 16 minister 392 minister
## 17 pakistan 1417 pakistan
## 18 pm 735 pm
## 19 police 437 police
## 20 punjab 410 punjab
## 21 sc 317 sc
## 22 senate 337 senate
## 23 sindh 547 sindh
## 24 test 372 test
## 25 time 370 time
## 26 trump 409 trump
## 27 virus 756 virus
## 28 world 408 world
# now this plot
# with new ordered column x = word2
# is arranged by word count
# and is far better to read:
ggplot(word_counts, aes(x=word2, y=n)) +
geom_col() +
coord_flip() +
ggtitle("Review Word Counts")
Sentiment Analysis
Now that we are done with data cleaning, transformation and
visualization, now lets turn to sentiment analysis. Sentiment analysis
is an automated way of understanding an opinion about a given subject.
It is quite useful for market analysis, customer feedback and product
reviews etc. It can be done at different level of scope such as document
level, sentence level and sub sentence level.
Sentiment analysis can be modeled as classification problems where two
sub problems are tackled; Subjectivity classification (classifying a
sentence as subject or objective) and polarity classification
(classifying a sentence as positive, neutral or negative). We will start
with loughran sentiment lexicon. Its quite common and used in financial
contexts. However, we can apply this to our data as well. It label words
with six possible sentiments. These are “negative”, “positive”,
“litigious”, “uncertainty”, “constraining”, or “superfluous”.
# Sentiment Analysis using loughran ----
# using inner_join()
data_st%>%
inner_join(get_sentiments("loughran")) %>% head()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 393 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## Story.Heading
## 1 Federation of Pakistan v Gen Pervez Musharraf: Treason now accountable
## 2 Federation of Pakistan v Gen Pervez Musharraf: Treason now accountable
## 3 Federation of Pakistan v Gen Pervez Musharraf: Treason now accountable
## 4 Chinese national held for beating traffic police constable in Karachi
## 5 Chinese national held for beating traffic police constable in Karachi
## 6 Chinese national held for beating traffic police constable in Karachi
## Timestamp Section word sentiment
## 1 01 Jan, 2020 09:55pm verdict negative
## 2 01 Jan, 2020 09:55pm verdict litigious
## 3 01 Jan, 2020 09:55pm dismissed negative
## 4 01 Jan, 2020 08:35pm Pakistan suspect negative
## 5 01 Jan, 2020 08:35pm Pakistan assaulted negative
## 6 01 Jan, 2020 08:35pm Pakistan prevented constraining
# counting sentiment
sentiment_review <- data_st%>%
inner_join(get_sentiments("loughran"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 393 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
sentiment_review %>%
count(sentiment)
## sentiment n
## 1 constraining 508
## 2 litigious 2913
## 3 negative 10750
## 4 positive 1635
## 5 superfluous 10
## 6 uncertainty 529
We can see in the above output that words have drastically reduced because inner_join retained only those words that were in the loughran dictionary.Now lets count those words which occured most oftenly for a given sentiment.
sentiment_review %>%
count(word, sentiment) %>%
arrange(desc(n)) %>% head()
## word sentiment n
## 1 court litigious 406
## 2 opposition negative 285
## 3 positive positive 218
## 4 protest negative 200
## 5 law litigious 185
## 6 warns negative 179
We can even plot those sentiments and could how many times each sentiment has occurred.
sentiment_counts <- get_sentiments("loughran") %>%
count(sentiment) %>%
mutate(sentiment2 = fct_reorder(sentiment, n))
ggplot(sentiment_counts, aes(x=sentiment2, y=n)) +
geom_col() +
coord_flip() +
labs(
title = "Sentiment Counts in Loughran",
x = "Counts",
y = "Sentiment"
)
The above bar plot shows that sentiment negative occurred most frequently. However instead of having all sentiments its better to make our analysis more simple by keeping only positive and negative sentiments and see which words are considered positive sentiment and negative sentiment.
sentiment_review2 <- sentiment_review %>%
filter(sentiment %in% c("positive", "negative"))
word_counts <- sentiment_review2 %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
# visualization
ggplot(word_counts, aes(x=word2, y=n, fill=sentiment)) +
geom_col(show.legend=FALSE) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts loughran",
x = "Words"
)
Now lets try out NRC method. NRC classifies words in binary manner yes/no and in the following eight categories anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. We will perform same above steps again but using NRC method.
# Sentiment Analysis using NRC
# counting sentiment
sentiment_review_nrc <- data_st%>%
inner_join(get_sentiments("nrc"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 9468 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
sentiment_review_nrc %>%
count(sentiment)
## sentiment n
## 1 anger 7554
## 2 anticipation 7499
## 3 disgust 3016
## 4 fear 10239
## 5 joy 4163
## 6 negative 15602
## 7 positive 14397
## 8 sadness 6544
## 9 surprise 3959
## 10 trust 10465
# let's count what words are most often
# for a given sentiment
sentiment_review_nrc %>%
count(word, sentiment) %>%
arrange(desc(n)) %>% head()
## word sentiment n
## 1 virus negative 756
## 2 police fear 437
## 3 police positive 437
## 4 police trust 437
## 5 trump surprise 409
## 6 court anger 406
# Pull in the nrc dictionary, count the sentiments and reorder them by count
sentiment_counts <- get_sentiments("nrc") %>%
count(sentiment) %>%
mutate(sentiment2 = fct_reorder(sentiment, n))
# Visualize sentiment_counts
# using the new sentiment2 factor column
ggplot(sentiment_counts, aes(x=sentiment2, y=n)) +
geom_col() +
coord_flip() +
# Change the title to "Sentiment Counts in NRC", x-axis to "Sentiment", and y-axis to "Counts"
labs(
title = "Sentiment Counts in NRC",
x = "Sentiment",
y = "Counts"
)
# filter sentiment review
# and keep only sentiment words
# that are positive or negative
sentiment_review_nrc2 <- sentiment_review_nrc %>%
filter(sentiment %in% c("positive", "negative"))
word_counts_nrc2 <- sentiment_review_nrc2 %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
# visualize the sentiment
ggplot(word_counts_nrc2, aes(x=word2, y=n, fill=sentiment)) +
geom_col(show.legend=FALSE) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts NRC",
x = "Words"
)
We see using NRC method we get different results in comparison to loughran method because both have different sentiments. Now lets try out bing method. Just to be clear bing lexicon categorises words into positive and negative sentiment. So same steps as above.
# Sentiment Analysis using bing ----
# counting sentiment
sentiment_review_bing <- data_st %>%
inner_join(get_sentiments("bing"))
## Joining with `by = join_by(word)`
sentiment_review_bing %>%
count(sentiment)
## sentiment n
## 1 negative 13403
## 2 positive 6182
# let's count what words are most often
# for a given sentiment
sentiment_review_bing %>%
count(word, sentiment) %>%
arrange(desc(n)) %>% head()
## word sentiment n
## 1 virus negative 756
## 2 trump positive 409
## 3 opposition negative 285
## 4 death negative 268
## 5 killed negative 258
## 6 positive positive 218
# filter sentiment review
# and keep only sentiment words
# that are positive or negative
sentiment_review_bing2 <- sentiment_review_bing %>%
filter(sentiment %in% c("positive", "negative"))
word_counts_bing2 <- sentiment_review_bing2 %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
# visualize the sentiment
ggplot(word_counts_bing2, aes(x=word2, y=n, fill=sentiment)) +
geom_col(show.legend=FALSE) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts Bing",
x = "Words"
)
Lastly we can try afinn method. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
# Sentiment Analysis using afinn ----
# counting sentiment
sentiment_review_afinn <- data_st %>%
inner_join(get_sentiments("afinn"))
## Joining with `by = join_by(word)`
sentiment_review_afinn %>%
count(value)
## value n
## 1 -4 199
## 2 -3 2381
## 3 -2 5928
## 4 -1 3614
## 5 1 2213
## 6 2 3031
## 7 3 518
## 8 4 354
## 9 5 4
# let's count what words are most often
# for a given sentiment
sentiment_review_afinn %>%
count(word, value) %>%
arrange(desc(n)) %>% head()
## word value n
## 1 death -2 268
## 2 killed -3 258
## 3 positive 2 218
## 4 attack -1 207
## 5 protest -2 200
## 6 top 2 185
# filter sentiment review
# and keep only sentiment words
# that are strongly positive or negative
sentiment_review_afinn2 <- sentiment_review_afinn %>%
filter(value %in% c("5", "-5"))
word_counts_afinn2 <- sentiment_review_afinn2 %>%
count(word, value) %>%
group_by(value) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
# visualize the sentiment
ggplot(word_counts_afinn2, aes(x=word2, y=n, fill=value)) +
geom_col(show.legend=FALSE) +
facet_wrap(~value, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts Afinn",
x = "Words"
)
We can even filter out and see words whcih have been assign score 4,-4 5 and -5.
# Let's try with 4 -4, 5, -5
sentiment_review_afinn4 <- sentiment_review_afinn %>%
filter(value %in% c("4", "-4", "5", "-5"))
word_counts_afinn4 <- sentiment_review_afinn4 %>%
count(word, value) %>%
group_by(value) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
# visualize the sentiment
ggplot(word_counts_afinn4, aes(x=word2, y=n, fill=value)) +
geom_col(show.legend=FALSE) +
facet_wrap(~value, scales="free") +
coord_flip() +
labs(
title = "Sentiment Word Counts Afinn",
x = "Words"
)
The above bar plots are easy to understand. We words that have been assigned to positive or negative sentiment based on scores. Additionally apart from considering words to perform sentiment analysis we can even get sentiments on rating as well. ```
Conclusions
The objective of the above analysis was to perform sentiment analysis. After data cleaning we performed loughran, NRC, Afinn and Bing methods to identify positive and negative sentiments. We compared and evaluated our results using the above mentioned methods. Results for each method differs as each method has different sentiment words. However one can easily conclude from the results that all methods were able to correctly classify words to thier respective sentiments.