Sentiment Analysis

Lets first load the required libraries.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tm)

## Loading required package: NLP

library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(syuzhet)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(tidytext)
library(forcats)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.2.1
## ✔ readr     2.1.5     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(textdata)

## Warning: package 'textdata' was built under R version 4.4.2

Introduction

In today’s day and age, there are billions of volumes of textual content being generated everywhere. In-apps messages like WhatsApp, Telegram, social media sites like Facebook, Instagram, news publishing sites, google searches and many other sources. All these sources are constantly generating huge volumes of text data every second. And because of these huge volumes of text data NLP becomes a vital resource in understanding the textual content. In this paper, the main focus is on the popular NLP task of Sentiment analysis. Sentiment analysis is contextual mining of text which identifies and extracts subjective information in textual data. Sentiment analysis proves to be an incredible asset for users to extract essential information.

Purpose

The aim of sentiment analysis is the computational study of people’s opinions, sentiments, emotions, and attitudes towards entities such as products, services, issues, events, topics, and their attributes (Liu 2015). As such, sentiment analysis can allow tracking the mood of the public about a particular entity to create actionable knowledge. Also, this type of knowledge can be used to understand, explain, and predict social phenomena (Pozzi et al. 2017). For the business domain, sentiment analysis plays a vital role in enabling businesses to improve strategy and gain insight into customers’ feedback about their products. In today’s customer-oriented business culture, understanding the customer is increasingly important

Assumptions

To ensure that the task is meaningful in practice, existing research makes the following implicit assumptions.

Sentiment analysis assumes that the opinion document d (e.g., a product review) expresses opinions on a single entity e and contains opinions from a single opinion holder h. In practice, if an opinion document evaluates more than one entity, then the sentiments on the entities can be different. For example, the opinion holder may be positive about some entities and negative about others. Thus, it does not make practical sense to assign one sentiment orientation to the entire document in this case.
It also does not make much sense if multiple opinion holders express opinions in a single document because their opinions can be different too. This assumption holds for reviews of products and services because each review usually focuses on evaluating a single product or service and is written by a single reviewer. However, the assumption may not hold for a forum and blog post because in such a post the author may express opinions on multiple entities and compare them using comparative sentences.

Description of Data Set

lets read our csv file with the help of read.csv function. Additionally lets look at our data by using the glimpse function.

data <- read.csv("news.csv")
glimpse(data)

## Rows: 26,706
## Columns: 4
## $ Story.Heading <chr> "Federation of Pakistan v Gen Pervez Musharraf: Treason …
## $ Story.Excerpt <chr> "After the special court's verdict in the high treason c…
## $ Timestamp     <chr> "01 Jan, 2020 09:55pm", "01 Jan, 2020 08:35pm", "01 Jan,…
## $ Section       <chr> "", "Pakistan", "World", "Pakistan", "Sport", "Pakistan"…

From the above output we can see that we have 4 columns and 26, 706 observations. Columns are self explanatory. We don’t require all columns for our analysis so its better to either remove them or select only the desired column which is in our case is Story.Excerpt. It describes content of news in more detailed manner.

Data Preparation, Cleaning and transformations

Prior to any analysis first step is to clean and transform data according to model requirements. Usually in text data we have some special characters, white spaces or stop words. Its recommended to get rid of these things as they don’t add any value to analysis.

# using unnest_tokens()
data_wr <- data %>%
  unnest_tokens(word, Story.Excerpt) # What to create (word) from where (Story.Excerpt) 
head(data_wr$word)

## [1] "after"   "the"     "special" "court's" "verdict" "in"

unnest_tokens() has done some cleaning removed punctuation and white space, transformed to lowercase etc. We can see in the above output that each row contains only one word. Means one word per row. Its a really large data set we can have a look at the total amount of rows by using the dim function.

dim(data_wr)

## [1] 262497      4

In total we have 262497 rows. Lets count the words and arrange them in descending order to see which words occur more frequently.

# counting words
data_wr%>%
  count(word) %>%
  arrange(desc(n)) %>%
  head()

##   word    n
## 1   to 8783
## 2   in 7723
## 3   of 5565
## 4  the 4596
## 5  for 4091
## 6   on 2610

We still see some common words such as the, in, to, of etc occurred more frequently.To analyze someone’s distinctive word use, you want to remove these words. That can be done with an anti_join to tidy text’s list of stop_words.

# using unnest_tokens() with stopwords
data_st<- data %>%
  unnest_tokens(word,Story.Excerpt) %>%
  anti_join(stop_words)

## Joining with `by = join_by(word)`

lets count the words again to see if we manage to resolve the issue successfully.

# counting words again
data_st %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head()

##          word    n
## 1       covid 1421
## 2    pakistan 1417
## 3        govt 1223
## 4          19 1172
## 5 coronavirus  857
## 6       virus  756

We can see from the above output that now the most frequent words are covid, pakistan, govt etc which actually reflect the actual and meaningful content.

Visualization

Instead of looking at the data frame its more appealing, attractive and understandable to visualize our cleaned data. Lets count the words and arrange them in descending order.

word_counts <- data_st %>%
  count(word) %>%
  filter(n>300) %>%
  arrange(desc(n))

# using coord_flip()
# when data are hard to read
# on the x axis
ggplot(word_counts, aes(x=word, y=n)) + 
  geom_col() +
  coord_flip() +
  ggtitle("Review Word Counts")

We can see in the above output each word against its count. Lets reorder them and then visualize them from largest to smallest.

# reorder what (word) by what (n)
word_counts <- data_st %>%
  count(word) %>%
  filter(n>300) %>%
  mutate(word2 = fct_reorder(word, n))

word_counts

##           word    n       word2
## 1           19 1172          19
## 2        chief  320       chief
## 3        china  434       china
## 4  coronavirus  857 coronavirus
## 5        court  406       court
## 6        covid 1421       covid
## 7          day  324         day
## 8         govt 1223        govt
## 9        imran  397       imran
## 10       india  617       india
## 11      indian  352      indian
## 12     karachi  530     karachi
## 13          kp  361          kp
## 14      lahore  301      lahore
## 15    lockdown  411    lockdown
## 16    minister  392    minister
## 17    pakistan 1417    pakistan
## 18          pm  735          pm
## 19      police  437      police
## 20      punjab  410      punjab
## 21          sc  317          sc
## 22      senate  337      senate
## 23       sindh  547       sindh
## 24        test  372        test
## 25        time  370        time
## 26       trump  409       trump
## 27       virus  756       virus
## 28       world  408       world

# now this plot
# with new ordered column x = word2
# is arranged by word count
# and is far better to read:
ggplot(word_counts, aes(x=word2, y=n)) + 
  geom_col() +
  coord_flip() +
  ggtitle("Review Word Counts")

Sentiment Analysis

Now that we are done with data cleaning, transformation and visualization, now lets turn to sentiment analysis. Sentiment analysis is an automated way of understanding an opinion about a given subject. It is quite useful for market analysis, customer feedback and product reviews etc. It can be done at different level of scope such as document level, sentence level and sub sentence level.
Sentiment analysis can be modeled as classification problems where two sub problems are tackled; Subjectivity classification (classifying a sentence as subject or objective) and polarity classification (classifying a sentence as positive, neutral or negative). We will start with loughran sentiment lexicon. Its quite common and used in financial contexts. However, we can apply this to our data as well. It label words with six possible sentiments. These are “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.

# Sentiment Analysis using loughran ----

# using inner_join()
data_st%>%
  inner_join(get_sentiments("loughran")) %>% head()

## Joining with `by = join_by(word)`

## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 393 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

##                                                            Story.Heading
## 1 Federation of Pakistan v Gen Pervez Musharraf: Treason now accountable
## 2 Federation of Pakistan v Gen Pervez Musharraf: Treason now accountable
## 3 Federation of Pakistan v Gen Pervez Musharraf: Treason now accountable
## 4  Chinese national held for beating traffic police constable in Karachi
## 5  Chinese national held for beating traffic police constable in Karachi
## 6  Chinese national held for beating traffic police constable in Karachi
##              Timestamp  Section      word    sentiment
## 1 01 Jan, 2020 09:55pm            verdict     negative
## 2 01 Jan, 2020 09:55pm            verdict    litigious
## 3 01 Jan, 2020 09:55pm          dismissed     negative
## 4 01 Jan, 2020 08:35pm Pakistan   suspect     negative
## 5 01 Jan, 2020 08:35pm Pakistan assaulted     negative
## 6 01 Jan, 2020 08:35pm Pakistan prevented constraining

# counting sentiment
sentiment_review <- data_st%>%
  inner_join(get_sentiments("loughran"))

## Joining with `by = join_by(word)`

## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 393 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

sentiment_review %>%
  count(sentiment)

##      sentiment     n
## 1 constraining   508
## 2    litigious  2913
## 3     negative 10750
## 4     positive  1635
## 5  superfluous    10
## 6  uncertainty   529

We can see in the above output that words have drastically reduced because inner_join retained only those words that were in the loughran dictionary.Now lets count those words which occured most oftenly for a given sentiment.

sentiment_review %>%
  count(word, sentiment) %>%
  arrange(desc(n)) %>% head()

##         word sentiment   n
## 1      court litigious 406
## 2 opposition  negative 285
## 3   positive  positive 218
## 4    protest  negative 200
## 5        law litigious 185
## 6      warns  negative 179

We can even plot those sentiments and could how many times each sentiment has occurred.

sentiment_counts <- get_sentiments("loughran") %>%
  count(sentiment) %>%
  mutate(sentiment2 = fct_reorder(sentiment, n))

ggplot(sentiment_counts, aes(x=sentiment2, y=n)) + 
  geom_col() +
  coord_flip() +
  labs(
    title = "Sentiment Counts in Loughran",
    x = "Counts",
    y = "Sentiment"
  )

The above bar plot shows that sentiment negative occurred most frequently. However instead of having all sentiments its better to make our analysis more simple by keeping only positive and negative sentiments and see which words are considered positive sentiment and negative sentiment.

sentiment_review2 <- sentiment_review %>%
  filter(sentiment %in% c("positive", "negative"))

word_counts <- sentiment_review2 %>%
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

# visualization
ggplot(word_counts, aes(x=word2, y=n, fill=sentiment)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts loughran",
    x = "Words"
  )

Now lets try out NRC method. NRC classifies words in binary manner yes/no and in the following eight categories anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. We will perform same above steps again but using NRC method.

# Sentiment Analysis using NRC 
# counting sentiment
sentiment_review_nrc <- data_st%>%
  inner_join(get_sentiments("nrc"))

## Joining with `by = join_by(word)`

## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 9468 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

sentiment_review_nrc %>%
  count(sentiment)

##       sentiment     n
## 1         anger  7554
## 2  anticipation  7499
## 3       disgust  3016
## 4          fear 10239
## 5           joy  4163
## 6      negative 15602
## 7      positive 14397
## 8       sadness  6544
## 9      surprise  3959
## 10        trust 10465

# let's count what words are most often
# for a given sentiment
sentiment_review_nrc %>%
  count(word, sentiment) %>%
  arrange(desc(n)) %>% head()

##     word sentiment   n
## 1  virus  negative 756
## 2 police      fear 437
## 3 police  positive 437
## 4 police     trust 437
## 5  trump  surprise 409
## 6  court     anger 406

# Pull in the nrc dictionary, count the sentiments and reorder them by count
sentiment_counts <- get_sentiments("nrc") %>% 
  count(sentiment) %>% 
  mutate(sentiment2 = fct_reorder(sentiment, n))

# Visualize sentiment_counts 
# using the new sentiment2 factor column
ggplot(sentiment_counts, aes(x=sentiment2, y=n)) +
  geom_col() +
  coord_flip() +
  # Change the title to "Sentiment Counts in NRC", x-axis to "Sentiment", and y-axis to "Counts"
  labs(
    title = "Sentiment Counts in NRC",
    x = "Sentiment",
    y = "Counts"
  )

# filter sentiment review
# and keep only sentiment words
# that are positive or negative

sentiment_review_nrc2 <- sentiment_review_nrc %>%
  filter(sentiment %in% c("positive", "negative"))


word_counts_nrc2 <- sentiment_review_nrc2 %>%
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

# visualize the sentiment
ggplot(word_counts_nrc2, aes(x=word2, y=n, fill=sentiment)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts NRC",
    x = "Words"
  )

We see using NRC method we get different results in comparison to loughran method because both have different sentiments. Now lets try out bing method. Just to be clear bing lexicon categorises words into positive and negative sentiment. So same steps as above.

# Sentiment Analysis using bing ----
# counting sentiment
sentiment_review_bing <- data_st %>%
  inner_join(get_sentiments("bing"))

## Joining with `by = join_by(word)`

sentiment_review_bing %>%
  count(sentiment)

##   sentiment     n
## 1  negative 13403
## 2  positive  6182

# let's count what words are most often
# for a given sentiment
sentiment_review_bing %>%
  count(word, sentiment) %>%
  arrange(desc(n)) %>% head()

##         word sentiment   n
## 1      virus  negative 756
## 2      trump  positive 409
## 3 opposition  negative 285
## 4      death  negative 268
## 5     killed  negative 258
## 6   positive  positive 218

# filter sentiment review
# and keep only sentiment words
# that are positive or negative

sentiment_review_bing2 <- sentiment_review_bing %>%
  filter(sentiment %in% c("positive", "negative"))


word_counts_bing2 <- sentiment_review_bing2 %>%
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

# visualize the sentiment
ggplot(word_counts_bing2, aes(x=word2, y=n, fill=sentiment)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts Bing",
    x = "Words"
  )

Lastly we can try afinn method. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

# Sentiment Analysis using afinn ----


# counting sentiment
sentiment_review_afinn <- data_st %>%
  inner_join(get_sentiments("afinn"))

## Joining with `by = join_by(word)`

sentiment_review_afinn %>%
  count(value)

##   value    n
## 1    -4  199
## 2    -3 2381
## 3    -2 5928
## 4    -1 3614
## 5     1 2213
## 6     2 3031
## 7     3  518
## 8     4  354
## 9     5    4

# let's count what words are most often
# for a given sentiment
sentiment_review_afinn %>%
  count(word, value) %>%
  arrange(desc(n)) %>% head()

##       word value   n
## 1    death    -2 268
## 2   killed    -3 258
## 3 positive     2 218
## 4   attack    -1 207
## 5  protest    -2 200
## 6      top     2 185

# filter sentiment review
# and keep only sentiment words
# that are strongly positive or negative

sentiment_review_afinn2 <- sentiment_review_afinn %>%
  filter(value %in% c("5", "-5"))


word_counts_afinn2 <- sentiment_review_afinn2 %>%
  count(word, value) %>%
  group_by(value) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

# visualize the sentiment
ggplot(word_counts_afinn2, aes(x=word2, y=n, fill=value)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~value, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts Afinn",
    x = "Words"
  )

We can even filter out and see words whcih have been assign score 4,-4 5 and -5.

# Let's try with 4 -4, 5, -5


sentiment_review_afinn4 <- sentiment_review_afinn %>%
  filter(value %in% c("4", "-4", "5", "-5"))


word_counts_afinn4 <- sentiment_review_afinn4 %>%
  count(word, value) %>%
  group_by(value) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

# visualize the sentiment
ggplot(word_counts_afinn4, aes(x=word2, y=n, fill=value)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~value, scales="free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts Afinn",
    x = "Words"
  )

The above bar plots are easy to understand. We words that have been assigned to positive or negative sentiment based on scores. Additionally apart from considering words to perform sentiment analysis we can even get sentiments on rating as well. ```

Conclusions

The objective of the above analysis was to perform sentiment analysis. After data cleaning we performed loughran, NRC, Afinn and Bing methods to identify positive and negative sentiments. We compared and evaluated our results using the above mentioned methods. Results for each method differs as each method has different sentiment words. However one can easily conclude from the results that all methods were able to correctly classify words to thier respective sentiments.

Sentiment Analysis

Muhammad Usman (437790)

01/05/2025