Introduction

The sentometrics package introduces simple functions to quickly computes the sentiment of texts within a corpus. This easy-to-use approach does not prevent more advanced analysis, and the sentometrics functions remain a solid choice for cutting-edge research. We shall review how to replicate similar research in this tutorial.

Does the position of positive and negative words within a text matter? That’s a question investigated by Boudt & Thewissen, 2019 during their research regarding sentiments implied by CEO letters. Based on a large dataset of letters, they analyzed how sentiment-bearing words were positioned within the text. They found out that CEOs emphasized sentiment at the beginning and the end of the letter, in the hopes of leaving a positive impression to the reader.

Their results confirm generally accepted theories of linguistic saying that readers remember best the first and the last portions of a text, and that the end of the text contributes the most to the reader’s final feeling.

One can wonder whether other types of texts follow a similar structure? Indeed, the world is full of different text media, from Twitter posts to news articles, and most of them are less cautiously written than CEO letters. Let’s investigate together one of these with the help of the sentometrics package!

As part of this guide, you will learn how to:

Preparing data

Let’s load the required packages for what follows in this guide :

library("rio")             ### Package for extracting data from Github
library("sentometrics")    ### Package containing sentiment computation tools
library("quanteda")        ### Package useful for text and document manipulation
library("lexicon")         ### Package of multiple lexicons
library("data.table")      ### Package bringing the data.table objects

In this tutorial, we will use a slight variation from the built-in usnews object from the sentometrics package. We would like to compare our computed sentiment measure against some benchmark, but the built-in usnews does not include one. Fortunately, the Github page of sentometrics made available the raw data used to build usnews, in which sentiment measures are present. We are going to retrieve the raw data from Github and transform it as needed.

usnews2 <- import("https://raw.githubusercontent.com/sborms/sentometrics/master/data-raw/US_economic_news_1951-2014.csv")
usnews2$texts <- stringi::stri_replace_all(usnews2$text, replacement = " ", regex = "</br></br>")
usnews2$texts <- stringi::stri_replace_all(usnews2$texts, replacement = "", regex = '[\\"]')
usnews2$texts <- stringi::stri_replace_all(usnews2$texts, replacement = "", regex = "[^-a-zA-Z0-9,&.' ]")
usnews2$text <- NULL

usnews2$id <- usnews2$`_unit_id`

months <- lapply(stringi::stri_split(usnews2$date, regex = "/"), "[", 1)
days <- lapply(stringi::stri_split(usnews2$date, regex = "/"), "[", 2)
years <- lapply(stringi::stri_split(usnews2$date, regex = "/"), "[", 3)
yearsLong <- lapply(years, function(x) if (as.numeric(x) > 14) return(paste0("19", x)) else return(paste0("20", x)))
datesNew <- paste0(paste0(unlist(months), "/"), paste0(unlist(days), "/"), unlist(yearsLong))
datesNew <- as.character(as.Date(datesNew, format = "%m/%d/%Y"))
usnews2$date <- datesNew


usnews2 <- subset(usnews2, date >= "1971-01-01") ### date bug 1970
usnews2 <- subset(usnews2, !is.na(positivity)) 
usnews2 <- subset(usnews2, positivity !=5 & positivity !=6) ### Remove neutral response

usnews2$s <- ifelse(usnews2$positivity >5,1,-1 )

### Delete obsolete columns
usnews2$`_last_judgment_at` <- usnews2$`_trusted_judgments` <-
usnews2$`positivity:confidence` <- usnews2$`relevance:confidence` <- usnews2$relevance_gold <-
usnews2$articleid <- usnews2$`_unit_state` <- usnews2$`_golden` <- usnews2$positivity_gold <-
usnews2$relevance <- usnews2$positivity <- usnews2$headline <- usnews2$`_unit_id`<-  NULL

usnews2 <- usnews2[order(usnews2$id),]
usnews2 <- as.data.table(usnews2)
table(usnews2$s)
## 
##  -1   1 
## 605 344

We just created a smaller corpus of US news with a response variable s that indicates whether the news is more positive or negative. This will be used to check if our sentiment computation provides accurate results.

Compute sentiments with bins

The compute_sentiment() function from the sentometrics allows sentiments computation with a large number of settings. In the simplest cases, the intratextual structure of sentiments doesn’t matter and sentiments are computed simply based on words’ frequencies.

In the current analysis, we want to investigate how sentiments are positioned within the text. Are they concentrated in the beginning, in the middle or at the end of the text? To investigate this, we create containers (we will call them bins) in which we can split the original text. Sentiments can then be computed for each bin, with the help of compute_sentiment(), and give us some insight regarding the intratextual sentiment structure.

To do so, we are going to create a list object that will contain one list of character vectors for each document (each character vector will represent one bin). That list will then be used in compute_sentiment() to compute sentiments for our bins.

usnews2Sento <- sento_corpus(usnews2)   ### Note that the feature 's' is automatically rescaled from {-1;1} to {0;1}

### Adding some features
usnews2Sento <- add_features(usnews2Sento, data.frame(s0  = 1-usnews2Sento$s,
                                                      dummyFeature = rep(1,length(usnews2Sento))))

### Cleaning
usnews2Toks <- tokens(usnews2Sento, remove_punct = TRUE)
usnews2Toks <- tokens_tolower(usnews2Toks)

nBins <- 10  ### Number of bins in which each document will be splitted
usnews2Bins <- list()

for(i in 1:length(usnews2Toks)){
  usnews2Bins[[i]] <- lapply(parallel::splitIndices(length(usnews2Toks[[i]]),nBins), function(x){usnews2Toks[[i]][x]})
}
names(usnews2Bins) <- names(usnews2Toks)
head(usnews2Bins[[1]],2)
## [[1]]
##  [1] "a"            "mild"         "stock"        "rally"        "fizzled"     
##  [6] "late"         "today"        "overwhelmed"  "by"           "computerized"
## [11] "selling"      "strategies"   "and"          "anxiety"      "that"        
## [16] "has"          "afflicted"    "investors"    "since"        "the"         
## 
## [[2]]
##  [1] "historic"  "market"    "collapse"  "exactly"   "six"       "months"   
##  [7] "ago"       "the"       "dow"       "jones"     "average"   "of"       
## [13] "30"        "blue-chip" "stocks"    "up"        "more"      "than"     
## [19] "32"

The usnews2Bins object will allow us to compute sentiment with our own defined measure. But before that, we need to create a sento_lexicons used in the computation. This time, instead of using the built-in lexicons found in the package, we will construct a unified lexicon from various sources. Using a single lexicon for the sentiment computation will make the analysis clearer.

Our customised lexicons will be created by the union of the built-in sentometrics lexicons and some lexicons from the lexicon package. To measure overall sentiment intensity, we also define an absolute value lexicon using the powerful [] operations of data.table.

lexicons <- available_data("hash_sentiment_[hn]")
lexiconsNames <- lexicons$Data
lexicons <- lapply(lexiconsNames, get)
names(lexicons) <- lexiconsNames

lexicons <- c(lexicons,list_lexicons[grep("en",names(list_lexicons))])

mergedLexicon <- Reduce(funion,lexicons)
mergedLexicon <- data.table(aggregate(mergedLexicon$y, by = list(word = mergedLexicon$x), FUN=mean))

sentoLexicon <- sento_lexicons(list(mergedLexicon = mergedLexicon,
                                    absoluteLexicon = mergedLexicon[, .(word, x = abs(x))]))

Let’s now observe what happens when computing net sentiments for bins… Note the use of arguments tokens and do.sentence in the compute_sentiment() function. These are used to specify a sentiment computation for each bin. We’ll come back to this feature in details later.

binsSentiments <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional", tokens = usnews2Bins,
                                    do.sentence = TRUE)

par(mfrow=c(1,2))
plot(binsSentiments[, .(s = mean(`mergedLexicon--dummyFeature`)), by = sentence_id], type = "l",
     ylab = "Average sentiment", xlab = "Bin")

boxplot(binsSentiments$`mergedLexicon--dummyFeature` ~ binsSentiments$sentence_id, ylab = "Sentiment", xlab = "Bin",
        outline = FALSE)

These two graphs are obtained by comparing specific bins across documents (i.e., the sentiment for all first bins). The left graph shows the average sentiment per bin while the right graph presents a corresponding boxplot, useful to observe the dispersion of sentiment values.

While the left graph seems to indicate a lower sentiment at the beginning of news, the boxplot on the right hardly shows any difference between the bins. In these conditions, we cannot say that sentiments are more present in the headlines of news, as we suggested as hypothesis earlier.

However, it might be interesting to study another measure of sentiment, in the form of absolute sentiment conveyed by bins. This is done using the altered lexicon absoluteLexicon.

par(mfrow=c(1,2))
plot(binsSentiments[, .(s = mean(`absoluteLexicon--dummyFeature`)), by = sentence_id], type = "l",
     ylab = "Average absolute sentiment", xlab = "Bin")

boxplot(binsSentiments$`absoluteLexicon--dummyFeature` ~ binsSentiments$sentence_id, ylab = "Absolute sentiment",
        xlab = "Bin", outline = FALSE)

This time, we observe that the absolute sentiment is higher at the beginning of news, and this result also appears in the corresponding boxplot.

Let’s note that this observation is far less noticeable than what was exposed by Boudt & Thewissen, 2019. during their study of CEO letters and that news may be less subject to strong intratextual sentiment structure.

Another way to highlight this last statement is to compute the average Herfindahl–Hirschman Index across all documents, a popular index of concentration. The low resulting value indicates that sentiments are spread between the different bins.

herfindahl <- binsSentiments[, .(s = `absoluteLexicon--dummyFeature`/sum(`absoluteLexicon--dummyFeature`)), by = id]
herfindahl <- herfindahl[, .(s = sum(s^2)), by = id]
mean(herfindahl$s)
## [1] 0.1135131

Optimising weights

Do these inconclusive results mean that we should disregard the word order in a text when analysis news sentiment? Well, not yet! Even if we identified that sentiments are well spread within a text, it is known that human readers tend to pay more attention to the beginning or end of a text. In the case of news, one could expect that readers are most influenced by headlines to draw their conclusions from articles. In terms of bins, this translates into saying that readers of news give more weight to the first bins of news.

Thus, customizing the weighting of intratextual sentiments can be beneficial in supervised applications. Let’s see a very simple and naive example: our dataset, usnews2, contains a response variable s which present the view of a human expert on each article’s sentiment. The response variably is binary, and a value of -1 denotes a negative sentiment while a value of 1 presents a positive sentiment. The question is as follows: are we able to predict the outcome of human analysis with our lexicon based analysis?

Let’s see what happens when we try to predict the text sentiments based on the mean value of bins’ sentiments :

computedVsHuman <- merge.data.frame(binsSentiments[, .(computed = mean(`mergedLexicon--dummyFeature`)), by = id],
                                    usnews2[,c("id","s")])

computedVsHuman <- data.table(computedVsHuman)
computedVsHuman <- computedVsHuman[order(computed)]

index <- sum(computedVsHuman$s==-1)
treshold <- computedVsHuman[index,computed]

accuracy <- nrow(computedVsHuman[(computed <= treshold  & s == -1) | (computed > treshold  & s == 1)])/nrow(computedVsHuman)
accuracy
## [1] 0.6817703

68.1% accuracy is far from impressive but keep in mind that we are doing a naive naive analysis here. Can we do something to improve this result? As for now, we considered the average sentiment value per bin, which means we gave equal weight to all of them. We can try to improve our prediction by altering those weight, for example by putting 50% more importance on the first bin’s sentiment :

w <- rep(1/(nBins+0.5),nBins)  ### Creating a vector containing our customised weights
w[1] <-  w[1]*1.5

computedVsHuman <- merge.data.frame(binsSentiments[, .(computed = mean(`mergedLexicon--dummyFeature`*w)),
                                                   by = id], usnews2[,c("id","s")])

computedVsHuman <- data.table(computedVsHuman)
computedVsHuman <- computedVsHuman[order(computed)]

index <- sum(computedVsHuman$s==-1)
treshold <- computedVsHuman[index,computed]


accuracy <- nrow(computedVsHuman[(computed <= treshold  & s == -1) | (computed > treshold  & s == 1)])/nrow(computedVsHuman)
accuracy
## [1] 0.6838778

A mere improvement but the idea is there. In a supervised training setting, weights can be optimized with respect to the response variable for a sample of the training set. A more complex example can be found in the paper of Boudt & Thewissen, 2019, where bins weights are optimized to predict firm performance.

Time series with customised weights

Note that the deviated from the standard implementation of the sentometrics package by using a customised way to aggregate bins sentiments. However, we can still use sentometrics functions to create time series based on customised weighting schemes. Let us review how to incorporate customised weighting in a sentiment time series workflow.

Consider four different methods of weighting to compute sentiment :

Computing sentiments using these weighting schemes can be done using the compute_sentiment() function and the proper arguments.

sentimentValues <- list()

sentimentValues$default <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional")
sentimentValues$uShaped <- compute_sentiment(usnews2Sento, sentoLexicon, how = "UShaped")
sentimentValues$sentences <- compute_sentiment(usnews2Sento, sentoLexicon, how = "proportional", do.sentence = TRUE)
sentimentValues$bins <- compute_sentiment(usnews2Sento, sentoLexicon, tokens = usnews2Bins, how = "proportional",
                                          do.sentence = TRUE) 

lapply(sentimentValues, function(x){head(x[,1:5],3)})
## $default
##           id       date word_count mergedLexicon--s mergedLexicon--s0
## 1: 830981632 1971-01-12        192                0       0.005208333
## 2: 830981642 1971-08-04        243                0       0.156378601
## 3: 830981666 1971-08-24        326                0       0.101226994
## 
## $uShaped
##           id       date word_count mergedLexicon--s mergedLexicon--s0
## 1: 830981632 1971-01-12        192                0       -0.04711660
## 2: 830981642 1971-08-04        243                0        0.14895686
## 3: 830981666 1971-08-24        326                0        0.08043824
## 
## $sentences
##           id sentence_id       date word_count mergedLexicon--s
## 1: 830981632           1 1971-01-12         28                0
## 2: 830981632           2 1971-01-12         37                0
## 3: 830981632           3 1971-01-12          6                0
## 
## $bins
##           id sentence_id       date word_count mergedLexicon--s
## 1: 830981632           1 1971-01-12         20                0
## 2: 830981632           2 1971-01-12         19                0
## 3: 830981632           3 1971-01-12         19                0

From this output, we can develop a better understanding of what is returned by the compute_sentiment function. When using the default settings, the sentiment for each word within a text will be determined according to the provided lexicons. These word sentiments are then aggregated using the method defined by the how argument, aggregating up to the document level to form a sentiment value for the document. This is the output stored in sentimentValues$default and sentimentValues$uShaped.

As opposed to this setting, specifying do.sentence = TRUE or providing an object to the tokens argument will change this aggregation level. In the case where do.sentence = TRUE, the aggregation will happen at the sentence level, returning a sentiment value for each sentence. This is the result stored in sentimentValues$sentences.

On the other hand, by using the tokens argument, we define by ourselves how the word sentiments will be aggregated. As we divided our documents in equal-sized bins, a sentiment value will be computed for each bin. Note that this method returns a result similar to the one obtained using do.sentence = TRUE, as visible in sentimentValues$bins. The sentence_id field refers in this case to the number of each bin in a given document.

Only the default computation returns an aggregated sentiment value at the document levels. If we wish to analyze document sentiments with the help of sentences or with the tokens argument, we have to aggregate again up to the document level

This is powerful as it permits to bypass the usual aggregations allowed with the how method, as we did for the bin computation earlier. For example, with the help of the tokens argument, it is possible to return from the sentiment computation only the word sentiments values, thus allowing complete freedom in the future aggregation into document sentiment.

Let’s aggregate sentences and bin sentiment values into document sentiment. For sentences, this can be done using the sentometrics version of aggregate(). Note the use of do.full = FALSE, used to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series). For bins, we can implement our customised weighting with the help of data.table functionalities and convert the result back to a sentiment object.

sentimentValues$sentences <- aggregate(sentimentValues$sentences, ctr_agg(howDocs = "equal_weight"),
                                       do.full = FALSE)

w <- rep(1/(nBins+0.5),nBins)
w[1] <-  w[1]*1.5
aggregate_bins <- function(x){sum(x*w)}

sentimentValues$bins <- sentimentValues$bins[, c(word_count = sum(word_count), lapply(.SD, aggregate_bins)),
                                             by = .(id, date),
                                             .SDcols = names(sentimentValues$bins)[5:ncol(sentimentValues$bins)]]

sentimentValues$bins <- as.sentiment(sentimentValues$bins)

And now the only remaining step is to aggregate up to time series our four different sentiments measures. Since we have four different sentiment objects, we are going to aggregate each of them to obtain four sento_measure object. Since these objects contain multiple combinations between lexicons and the response variable s, we are going to clean and remove the unwanted time series using the subset function.

With only one time series remaining in the sento_measure object, we can plot the time series using base R.

ctr <- ctr_agg(by = "year", lag = 5, howTime = "linear")

timeSeries <- lapply(sentimentValues, aggregate, ctr)
timeSeries <- lapply(timeSeries, subset, select = c("mergedLexicon"), delete = c("s","s0"))

plot(timeSeries[[1]]$measures,  type = 'n', ylab = "Sentiment", ylim = c(0.04,0.11))
for (i in 1:length(timeSeries)){
  lines(timeSeries[[i]]$measures, col = i)
}
legend("bottomright", legend = names(timeSeries), col = 1:length(timeSeries), lwd = 2)

This gives us a view of how the different weighting methods evolved through time. Although the difference between weight schemes remain constant for most measures, we can observe that the U-shaped weights vary much more than the other schemes. This is not surprising, as this approach brings a completely different weighting curve while the other measures remain at least partially based on equal-weighting.

Intratextual sentiment analysis is a hot topic in nowadays research, feel free to investigate how sentiments are positioned within different types of documents!

Acknowledgements

Thanks to Samuel Borms for providing a basis for this tutorial. Thanks to Kris Boudnt and James Thewissen for their article about sentiment in CEO letters, from which this tutorial is heavily inspired.