A Corpus-based Sentiment Analysis of Sarcasm:

A Distributional Pattern of Sentiment Score within Sarcastic Tweets

1.Introduction

Sarcasm has been a recent topic of research in linguistic discussion and computational application. In a broad definition, sarcasm expresses the opposite meaning of its form. The study of sarcasm often involves the discussion of its context and speakers' intention. For example, if a person says 'what a lovely weather' when it's clearly rainy and windy, the listeners would have to know what the weather is actually like in order to correctly interpret the comment as sarcastic. Clift (1999) proposes that sarcasm is a type of irony and is distinct from irony in that the speaker is with intention to express hostile statement. Certain pragmatics goals are reached when speakers use these non-literal expressions. According to Brown and Levinson (1987), using verbal irony can reduce threat and highlights shared knowledge among the interlocutors.

Clark and Gerrig (1984) propose that a speaker expresses irony by pretending to be an injudicious person and expects the hearer to discover such pretense and recognize his attitude. In this study, we consider the use of degree adverbs as a clue of pretense in expressing sarcasm and examine this feature in terms of sentiment analysis.

On the other hand, there are researchers indicating the elusive forms of irony and sarcasm (Muecke, 1986; Nunberg, 2001; Dress et al., 2008). In view of this, we do not exploit any formal structure for sarcasm in the current study, but recognize sarcastic expressions as long as it is tagged as #sarcasm by human users.

In recent years, sentiment analysis in product reviews has been a focus in computational application; however, sarcasm has been a problem in sentiment analysis because what the speaker expresses is different from what the speaker intends. In massive online data, it would be difficult to consider context or to find out the true intention of the speaker. This may result in inaccurate sentiment analysis because the sentiment score of the form would not correspond to its intended sentiment. Though many related researches have been done to tackle this problem, there is still room for further studies.

In this paper, we first review previous researches on sentiment analysis, especially sarcasm detection. In Section 3, we investigate four features of sarcasm and carry out the first classification task based on the results. In Section 4, we propose a special feature within sarcastic tweets, and introduce this feature into the task of sarcasm detection, and compare the performance with previous work. In the end, we make a conclusion based on our results and suggest future work.

2.Literature Review

In previous sentiment analysis, Reyes et al. (2010) evaluated five humor features (sexuality, polarity, ambiguity, emotions, and slang/emoticons) with Bayes classifier, decision tree, and SVM. The result indicates that emotion, when served as a feature, does not improve the accuracy of the classifier. And the five features only have limited performance in distinguishing funny comments. Go and Bhayani (2011) make use of hashtags - one or more words immediately preceded with a hash symbol (#) to indicate the intended tone of the message in their classification of sarcastic tweets, and reported a performance up to 69% by maximum entropy. Burfoot and Baldwin (2009) introduce validity and point out that sarcasm may be based on exaggeration or unusual collocations. In addition, Gonzalez-Ibanez et al. (2011) has extracted tweets tagged with #sarcasm to assure speaker’s intention, and has used lexical factors along with pragmatic factors to assist classification. The pragmatic factor used is the feature “@To User,” which was found to be a useful detector of sarcasm.

In model training, Tsur et al. (2010) propose a semi-supervised framework for recognizing sarcasm. They utilize features specific to product reviews on Amazon, and achieve a precision of 77%. Davidov et al. (2010) use pattern extracted from Twitter and Amazon dataset to train the classification model. They combine pattern extraction with punctuation features and propose their model SASI.

The characteristics of social media, such as the typical length of a message (Twitter even limits the length of a tweet to 140 letters), poses a challenge in sentiment analysis because there would be less clue available for classification. While most studies detect sarcasm based on the whole sentence, Filatova (2012) examines sarcastic product reviews on the document level, suggesting the importance of context in sarcasm detection. In this study, we look for a contextual feature by examining the previous tweet to which the sarcastic tweet is replied. We also investigate the sentiment change within sarcastic tweets, in order to explore more features available for the sentiment analysis of social media.

In the current study, we are going to study four features we consider related to our hypotheses about the characteristics of sarcastic expressions, in order to facilitate sarcasm detection by using sentiment analysis to explore tweets tagged with #sarcastic. The features explored in current data include: (1) the emotion performance of sarcastic tweets, (2) the relationship between the original tweets and the “@To User” tweets, (3) the use of degree adverbs in sarcastic tweets, (4) the high frequent words used in sarcastic tweets. We also examined a pattern of sentiment score within a sarcastic tweet.

3.Sarcasm Analysis

This section examines four features of sarcasm tweets: the sentiment score, the difference of the sentiment score with the previous replied text, the count of degree adverbs, and the word frequency. The data comes from 1000 English tweets tagged with #sarcastic, and those tagged with #happy or #sad (as the random tweets), both created from 2013-07-04 to 2013-07-14, and obtained via the twitteR package.

3.1 Experiment 1: how sarcasm behaves in language use?

In Experiment 1, the issue we want to understand is how sarcasm behaves in its language form? Do Twitter users use positive words to express negative opinions or is it more often the other way around? Though Rockwell (2006) has found that to use negative form to express positive messages is less than to use positive form to express negative message, we cannot know whether this result reaches significance, or whether both use of words and expression of whole statement have this consistent performance.

Method. This issue is going to be explored by comparing the sentiment score from 40 neutral tweets mixed with positive hashtag (#sad) and negative hashtag (#happy), and that from #sarcasm to understand whether the language form of sarcasm is different from non-sarcastic tweets. The sentiment score of a tweet is estimated by counting the number of positive words minus the number of negative words (Hu & Liu, 2004; Breen, 2012). The hypothesis is that average sentiment score in sarcasm should not be equal to average sentiment score in random.

3.2 Experiment 2: does reply tweets help in detecting sarcasm?

Gonzalez-Ibanez et al. (2011) proposed that with the feature “@To User” is rather efficient in sarcasm detection. In the second experiment, we would like to further understand the message relation between the tweet user A used to reply to user B, and the original tweet user B stated. If sarcasm is established on this replied tweet, then the sentiment score on replied tweet and the original tweet should have the same sentiment scores because the replied one though should be twisted in intention should not be twisted in expression. With testifying this hypothesis, we can more efficiently detect the sarcasm tweets by detecting the sentiment score from both replied tweets and original tweets.

Method. This issue is going to be explored by calculating the sentiment score from 40 randomly sampled tweets tagged with #sad and #happy, and that from #sarcasm to understand whether the language form of sarcasm is different from non-sarcastic tweets. The hypothesis is that the original tweet should have the same sentiment scores as the @touser sarcasm tweets; however, to those non-sarcastic tweets, this feature should not be perceived.

3.3 Experiment 3: how influential are the degree adverbs?

To examine the usage of degree adverbs in sarcastic expressions, 1000 random tweets and 1000 sarcastic tweets were compared in term of the counts of degree adverbs from Dong (2007).

3.4 Experiment 4: is the selection of words significantly different?

This experiment investigates the word usage between random tweets and sarcastic tweets. 1000 random tweets and 1000 sarcastic tweets were compared in term of their word frequency.

4.Results

Result and Discussion 1.

From independent t-test, the result shows that the average sentiment scores in sarcasm tweets are not significantly different from that in random tweets [t = 0.3871, p = 0.7001]. Thus, we can conclude that sarcastic tweets are not significantly more negative than random tweets [μrandom = -0.325, μsarcasm = -0.450]. In the case of Twitter, sarcastic tweets indeed pose much difficulty in sentiment analysis, because they are mostly mixed with negative and positive words. This result indicates the need to further investigate sarcasm detection.

load("random.df")
load("sarcasm.df")
t.test(random.df$sentiScore, sarcasm.df$sentiScore)

## 
##  Welch Two Sample t-test
## 
## data:  random.df$sentiScore and sarcasm.df$sentiScore
## t = 0.3871, df = 58.2, p-value = 0.7001
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5214  0.7714
## sample estimates:
## mean of x mean of y 
##    -0.325    -0.450

# boxplot(random.df$sentiScore,sarcasm.df$sentiScore,notch=T)

Result and Discussion 2.

From paired t-test in sarcastic tweets, the result shows that original tweet should have the same sentiment scores as @touser sarcasm tweets [t = 2.0473, p = 0.04741]. Thus, we can conclude that original tweet should have the same sentiment scores as @touser sarcasm tweets [μ@touser = 0.05, μo = -0.425].

t.test(sarcasm.df$sentiScore, sarcasm.df$replySentiScore, paired = T)

## 
##  Paired t-test
## 
## data:  sarcasm.df$sentiScore and sarcasm.df$replySentiScore
## t = -2.047, df = 39, p-value = 0.04741
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.944291 -0.005709
## sample estimates:
## mean of the differences 
##                  -0.475

On the other hand, to non-sarcastic tweets, the original tweets have more similar sentiment scores as the @touser tweets [t = -2.0096, p = 0.05143].

t.test(random.df$sentiScore, random.df$replySentiScore, paired = T)

## 
##  Paired t-test
## 
## data:  random.df$sentiScore and random.df$replySentiScore
## t = -2.01, df = 39, p-value = 0.05143
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.852777  0.002777
## sample estimates:
## mean of the differences 
##                  -0.425

The fact that sentiment scores of @touser sarcasm and the original tweets are the same sheds lights on sarcasm detection. It is amazing to understand that the tweets are with the same sentiment scores in form, but they own different intentions. This emphasizes the divergence between speaker intention and speaker utterance. Besides, this result can facilitate sarcasm detection in taking the original replied tweets into consideration.

Result and Discussion 3.

The average count of degree adverbs is 0.227 in a random tweet, 0.345 in a sarcastic tweet, and suggested a significant difference in degree adverb counts between random and sarcastic tweets [t = -4.5286, p < 6.295×10-6].

t.test(random.df$adverbCount, sarcasm.df$adverbCount)

## 
##  Welch Two Sample t-test
## 
## data:  random.df$adverbCount and sarcasm.df$adverbCount
## t = -1.006, df = 66.28, p-value = 0.318
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.373  0.123
## sample estimates:
## mean of x mean of y 
##     0.225     0.350

boxplot(random.df$adverbCount, sarcasm.df$adverbCount)

plot of chunk unnamed-chunk-4

From the boxplot, we can see that most people tend to use a degree adverb to express
sarcasm, and it is rare to be used in random tweets.

Result and Discussion 4.

The result indicated a significant difference [t=-3.3612, p=0.0008], suggesting that people tend to use different words to express sarcasm, and a sarcastic word list may be useful in sarcasm detection.

word = unlist(strsplit(sarcasm.df$text, " "))
sarcasm.table = as.data.frame(table(word))
word = unlist(strsplit(random.df$text, " "))
random.table = as.data.frame(table(word))
freq.df = merge(random.table, sarcasm.table, by = "word")
t.test(freq.df$Freq.x, freq.df$Freq.y, paired = T)

## 
##  Paired t-test
## 
## data:  freq.df$Freq.x and freq.df$Freq.y
## t = 0.7153, df = 56, p-value = 0.4774
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5054  1.0668
## sample estimates:
## mean of the differences 
##                  0.2807