library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.3.3
library(ggplot2)
library(knitr)

Task: Analyze documents or scraped web pages to predict how new documents that have already been classified (movie reviews as positive/negative, email inbox as spam/ham, etc.) should be categorized.

Using the dataset from a CrowdFlower sentiment analysis called “Emotion in Text” which contains labels for the emotional content (such as happiness, sadness, and anger) of tweets, the content of the tweets were compared against two tidytext sentiment lexicons (“affin” and “bing”), each of which scored certain words as having positive or negative associations both numerically (-5 to 5) and categorically (positive or negative).

# Tidytext sentiments lexicons
get_sentiments("afinn") # -5 to 5
## # A tibble: 2,476 × 2
##          word score
##         <chr> <int>
## 1     abandon    -2
## 2   abandoned    -2
## 3    abandons    -2
## 4    abducted    -2
## 5   abduction    -2
## 6  abductions    -2
## 7       abhor    -3
## 8    abhorred    -3
## 9   abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows
get_sentiments("bing") # negative or positive
## # A tibble: 6,788 × 2
##           word sentiment
##          <chr>     <chr>
## 1      2-faced  negative
## 2      2-faces  negative
## 3           a+  positive
## 4     abnormal  negative
## 5      abolish  negative
## 6   abominable  negative
## 7   abominably  negative
## 8    abominate  negative
## 9  abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

After importing the data, the already classified sentiments (13 in total) were changed into positive, negative or neutral, in order to make the predictions broader and therefore more manageable. Six sentiments were changed to negative, five to positive and two to neutral.

# Import data from GitHub
TextEm = read.csv("https://raw.githubusercontent.com/Galanopoulog/DATA607-Project-4/master/TextEmotion.csv",
           header = T,
           sep = ",",
           stringsAsFactors = F)
kable(head(TextEm))
tweet_id sentiment author content
1956967341 empty xoshayzers @tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[
1956967666 sadness wannamama Layin n bed with a headache ughhhh…waitin on your call…
1956967696 sadness coolfunky Funeral ceremony…gloomy friday…
1956967789 enthusiasm czareaquino wants to hang out with friends SOON!
1956968416 neutral xkilljoyx @dannycastillo We want to trade with someone who has Houston tickets, but no one will.
1956968477 worry xxxPEACHESxxx Re-pinging @ghostridah14: why didn’t you go to prom? BC my bf didn’t like my friends
# Changing sentiments into positive/negative/neutral
TextEm$sentiment[TextEm$sentiment == "anger"] = "negative"
TextEm$sentiment[TextEm$sentiment == "boredom"] = "negative"
TextEm$sentiment[TextEm$sentiment == "empty"] = "negative"
TextEm$sentiment[TextEm$sentiment == "enthusiasm"] = "positive"
TextEm$sentiment[TextEm$sentiment == "fun"] = "positive"
TextEm$sentiment[TextEm$sentiment == "happiness"] = "positive"
TextEm$sentiment[TextEm$sentiment == "hate"] = "negative"
TextEm$sentiment[TextEm$sentiment == "love"] = "positive"
TextEm$sentiment[TextEm$sentiment == "neutral"] = "neutral"
TextEm$sentiment[TextEm$sentiment == "relief"] = "positive"
TextEm$sentiment[TextEm$sentiment == "sadness"] = "negative"
TextEm$sentiment[TextEm$sentiment == "surprise"] = "neutral"
TextEm$sentiment[TextEm$sentiment == "worry"] = "negative"

The data that was labled “neutral” was ommitted due to the difficulty of defining it and, therefore, classifying. For example, the sentiment of “surprise” can be both positive and negative, unlike “relief” which is a feeling of reassurance after experiencing anxiety or worry and therefore, overall, a positive sentiment. From there, the data was split into thirds, two-thirds to use for analysis and one-third to conduct the predictions.

TextEm = filter(TextEm, sentiment != "neutral")
dim(TextEm)
## [1] 29175     4
testText = TextEm[1:19450,]       # two-thirds
predText = TextEm[19451:29175,]   # one-third

The first step in the sentiment analysis was to clean the tweets and organize them in such a way that each word can be compared against both lexicons.

all.content = as.list(testText$content)

collection = tibble()

for(i in 1:nrow(testText)) {
  
  clean = tibble(stuff = all.content[[i]]) %>%
    unnest_tokens(word, stuff) %>%
    mutate(user = testText$author[i]) %>%
    select(user, everything())
  
  collection = rbind(collection, clean)
}

kable(head(collection))
user word
xoshayzers tiffanylue
xoshayzers i
xoshayzers know
xoshayzers i
xoshayzers was
xoshayzers listenin

Once the data is arranged, we compare each word against the “affin” lexicon. After the words are scored, we find the mean score of the significant words for each user.

# Using "afinn" to find scores of words
af.sent = data.frame(get_sentiments("afinn"))

af.sen = data.frame(merge(collection, af.sent, by.x="word", by.y="word") %>% # scoring the words
  group_by(user) %>%      # group by user
  mutate(mean.af.sent = mean(score)) %>%  # find the mean score per user
  slice(which.max(mean.af.sent))  # retain only one output per user
)

kable(head(af.sen))
word user score mean.af.sent
sick __arual -2 -2.000000
like DalekCaan 2 -0.750000
nice __Kizzle 3 2.000000
depressing __laurenS -2 -1.666667
chilling __lozzy -1 1.000000
miss __LucifersAngel -2 -2.000000

Repeat the method used for “affin” lexicon with the “bing” lexicon. However, since “bing” classifies words by positive/negative, in order to find the mean, we convert the negative values to -1 and the positive ones to 1 before performing calculations.

# Using bing to find pos/neg of words
bing.sent = data.frame(get_sentiments("bing"))

# To find if the overall message was pos/neg, turn values into 1 or -1, find mean
bi.sen = data.frame(merge(collection, bing.sent, by.x="word", by.y="word"))
bi.sen$sentiment[bi.sen$sentiment == "negative"] = -1
bi.sen$sentiment[bi.sen$sentiment == "positive"] = 1

bi.sen = bi.sen %>%
  group_by(user) %>% 
  mutate(posneg = mean(as.numeric(sentiment)))%>% 
  group_by(user) %>%
  slice(which.max(posneg))

kable(head(bi.sen))
word user sentiment posneg
sick __arual -1 -1.0000000
like DalekCaan 1 -0.5000000
cold __Jesssicaa -1 -1.0000000
nice __Kizzle 1 1.0000000
depressing __laurenS -1 -0.3333333
love __lozzy 1 1.0000000

Since the two lexicons may not place the tweets under the same category, the solutions from afinn and bing were merged by user and compared. A new column was created called “match” which determined whether the “bing” and “afinn” values matched (yes) or not (no) and, if “yes”, a verdict was made on whether the tweet was overall a positive or negative one.

# Combining by user
all.sen = data.frame(merge(bi.sen, af.sen, by.x="user", by.y="user"))

all.sen2 = all.sen[,c(1,4,7)] %>% 
  group_by(user) %>%
  mutate(match = ifelse(posneg < 0 & mean.af.sent < 0, "yes", ifelse(posneg > 0 & mean.af.sent > 0, "yes", "no")))%>%
  mutate(verdict = ifelse(match == "yes" & posneg < 0, "negative", ifelse(match == "yes" & posneg > 0, "positive", "other")))

kable(head(all.sen))
user word.x sentiment posneg word.y score mean.af.sent
__arual sick -1 -1.0000000 sick -2 -2.000000
DalekCaan like 1 -0.5000000 like 2 -0.750000
__Kizzle nice 1 1.0000000 nice 3 2.000000
__laurenS depressing -1 -0.3333333 depressing -2 -1.666667
__lozzy love 1 1.0000000 chilling -1 1.000000
__LucifersAngel miss -1 -1.0000000 miss -2 -2.000000

Once a match had been concluded, the verdict was compared against the original categorization (sentiment) of the tweet.

final = merge(testText, all.sen2, by.x="author", by.y="user")[, c(1,3,7,8,5,6)] %>% 
  group_by(author) %>%
  slice(which.max(mean.af.sent))

kable(head(final))
author sentiment match verdict posneg mean.af.sent
__arual negative yes negative -1.0000000 -2.000000
DalekCaan negative yes negative -0.5000000 -0.750000
__Kizzle negative yes positive 1.0000000 2.000000
__laurenS negative yes negative -0.3333333 -1.666667
__lozzy positive yes positive 1.0000000 1.000000
__LucifersAngel negative yes negative -1.0000000 -2.000000

In order to assess if the predicted conclusion matched the original classification, only the conclusions that were definitively determined as positive or negative through the agreement of both lexicons were used. From there, a confusion matrix was created to calculate whether the predicted values matched the actual values in addition to the accuracy of this approach.

final.yes = filter(final, match == "yes")

pos.pos = filter(final.yes, sentiment == "positive" & verdict == "positive")
pos.neg = filter(final.yes, sentiment == "positive" & verdict == "negative")
neg.pos = filter(final.yes, sentiment == "negative" & verdict == "positive")
neg.neg = filter(final.yes, sentiment == "negative" & verdict == "negative")

conf.matrix = matrix(c(nrow(pos.pos), 
                       nrow(pos.neg), 
                       nrow(neg.pos), 
                       nrow(neg.neg)),ncol=2)
colnames(conf.matrix) = c("pred_pos", "pred_neg")
rownames(conf.matrix) = c("actual_pos", "actual_neg")

conf.matrix
##            pred_pos pred_neg
## actual_pos     2425     1687
## actual_neg      364     4352
# Number of values that match the original sentiment
table(final.yes[,2] == final.yes[,4])
## 
## FALSE  TRUE 
##  2051  6777
# Proportion of values that match the original sentiment (accuracy)
prop.table(table(final.yes[,2] == final.yes[,4]))
## 
##    FALSE     TRUE 
## 0.232329 0.767671

Before proceeding with running the prediction dataset through a sentiment analysis function that uses this approach, I got curious as to what values were ommitted from the final analysis, especially when seeing that one-fourth of final dataset was discarded due to the lexicons not agreeing.

# proportion of values ommitted
1-nrow(final.yes)/nrow(final)
## [1] 0.2514838
# ommitted data
final.no = filter(final, match == "no")
kable(head(final.no))
author sentiment match verdict posneg mean.af.sent
_annella negative no other 0.0000000 -0.6666667
DESiMO negative no other -0.1428571 1.4444444
Jazzi3 positive no other 0.0000000 1.0000000
Just_Jen negative no other 0.0000000 0.2500000
_kwaz negative no other 0.0000000 0.5000000
_lips_xD negative no other 0.0000000 2.2000000

Upon viewing the dataset where the lexicons didn’t match, it became evident that a major disagreement (approximately 72% of the ommitted data) was due to one lexicon having a neutral value of 0 for a user while the other was positive or negative. To combat this, the mean of the two lexicons was taken and its value (positive or negative) was compared against the original classification. Then, the proportion where the mean of the lexicons matched the original dataset was derived. Approximately 60% of verdicts were correct.

# proportion of ommitted data that had at least one zero
nrow(filter(final.no, posneg == 0 | mean.af.sent == 0))/nrow(final.no)
## [1] 0.7255563
# Average the means of the data and see if they are more positive or negative
final.no2 = final.no %>%
       filter(posneg == 0 | mean.af.sent == 0) %>% 
       filter(posneg != mean.af.sent) %>%   # Remove where both values are zero, because the mean will be undefined
       mutate(both.mean = (posneg+mean.af.sent)/2) %>% 
       mutate(verdict = ifelse(both.mean  < 0, "negative", "positive"))

prop.table(table(final.no2[,2] == final.no2[,4]))
## 
##     FALSE      TRUE 
## 0.3990536 0.6009464

A 60% correct prediction is not preferable, especially when considering that the data that wasn’t ommitted scored approximately 16% higher, however, the higher accuracy can be contributed to the whittling down of data. Adding variability to the model makes it more inclusive to a wider range of tweets. As such, the ommitted, re-evaluated data was added to the previous conclusions.

all.final = bind_rows(final.yes, final.no2)

# accuracy of combined table
prop.table(table(all.final[,2] == all.final[,4]))
## 
##     FALSE      TRUE 
## 0.2618826 0.7381174

Finally, a function was created where tweets from the prediction set (one-third of the original data) and the results of the confusion matrix were returned, in addition to the percent of the model’s accuracy and inaccuracy.

sent.analysis = function(x){
  all.content = as.list(x$content)
  
  collection = tibble()
  
  for(i in 1:nrow(x)) {
    
    clean = tibble(stuff = all.content[[i]]) %>%
      unnest_tokens(word, stuff) %>%
      mutate(user = x$author[i]) %>%
      select(user, everything())
    
    collection = rbind(collection, clean)
  }
 
    af.sent = data.frame(get_sentiments("afinn"))
    af.sen = data.frame(merge(collection, af.sent, by.x="word", by.y="word") %>% 
                          group_by(user) %>% 
                          mutate(mean.af.sent = mean(score)) %>%
                          slice(which.max(mean.af.sent))
    )
    
    bing.sent = data.frame(get_sentiments("bing"))
    bi.sen = data.frame(merge(collection, bing.sent, by.x="word", by.y="word"))
    bi.sen$sentiment[bi.sen$sentiment == "negative"] = -1
    bi.sen$sentiment[bi.sen$sentiment == "positive"] = 1
    
    
    bi.sen = bi.sen %>%
      group_by(user) %>% 
      mutate(posneg = mean(as.numeric(sentiment)))%>% 
      group_by(user) %>%
      slice(which.max(posneg))
    
    all.sen = data.frame(merge(bi.sen, af.sen, by.x="user", by.y="user")[,c(1,4,7)]) %>% 
      group_by(user) %>%
      mutate(match = ifelse(posneg < 0 & mean.af.sent < 0, "yes", ifelse(posneg > 0 & mean.af.sent > 0, "yes", "no")))%>%
      mutate(verdict = ifelse(match == "yes" & posneg < 0, "negative", ifelse(match == "yes" & posneg > 0, "positive", "other")))

    final = merge(x, all.sen, by.x="author", by.y="user")[, c(1,3,7,8,5,6)] %>% 
      group_by(author) %>%
      slice(which.max(mean.af.sent))
    
    final.yes = filter(final, match == "yes")
    final.no = filter(final, match == "no")%>%
      filter(posneg == 0 | mean.af.sent == 0) %>% 
      filter(posneg != mean.af.sent) %>% 
      mutate(both.mean = (posneg+mean.af.sent)/2) %>% 
      mutate(verdict = ifelse(both.mean  < 0, "negative", "positive"))
    all.final = bind_rows(final.yes, final.no2)
    
    pos.pos = filter(all.final, sentiment == "positive" & verdict == "positive")
    pos.neg = filter(all.final, sentiment == "positive" & verdict == "negative")
    neg.pos = filter(all.final, sentiment == "negative" & verdict == "positive")
    neg.neg = filter(all.final, sentiment == "negative" & verdict == "negative")
    
    accuracy = (nrow(pos.pos)+nrow(neg.neg))/(nrow(pos.pos)+ nrow(pos.neg)+ nrow(neg.pos)+ nrow(neg.neg))*100
    
    conf.matrix = matrix(c(nrow(pos.pos), 
                           nrow(pos.neg), 
                           nrow(neg.pos), 
                           nrow(neg.neg),
                           accuracy,
                           100-accuracy), ncol=6)
    colnames(conf.matrix) = c("pred_pos", "pred_neg", "actual_pos", "actual_neg", "percent_accuracy", "percent_inaccuracy")

    return(conf.matrix)
}

sent.analysis(predText)
##      pred_pos pred_neg actual_pos actual_neg percent_accuracy
## [1,]     4320      344       1306       1094         76.64213
##      percent_inaccuracy
## [1,]           23.35787

Using the test dataset, since it had more values, we plotted what the confusion matrix proportions may look like. The plot showed that the model was better at correctly identifying negative tweets, whereas positive tweets had approximately a fifty percent chance of being categorized correctly.

sen.verd = table(all.final$verdict, all.final$sentiment)
mosaicplot(sen.verd, main = "Confusion Matrix Proportions ", xlab = "sentiment", ylab = "verdict", col = c(2,4))

** Conclusion **

Using the “afinn” and “bing” lexicons to analyze tweets led to a model that predicted the category correctly (with enough ommissions) up to approximately 75% of the data. However, by the end of the analysis, it became clear that this approach was not optimal for various reasons, the main one being the ommissions. Ommissions fell mostly into three categories: Pre-evaluation, Lexicon and Disagreeable.

In the Pre-evaluation page, tweets that were deemed “neutral” could not be randomly assigned into either the positive or negative category, simply because they did not fit in either for reasons explained at the beginning of the assignment. As such, they were removed. In order to counter that, what may take place in a future endeavor is perhaps creating a sentiment analysis that separates data into emotive (positive/negative) tweets and neutral ones.

In the Lexicon stage, tweets that did not have words in either or one of the lexicons, were automatically dismissed from the data. The issue with this stems from the way tweets are written, in that the majority of the content includes colloquialisms and purposely/accidentally mispelled words. Few if any of these are found in the lexicons and, thus, a sizeable portion of the data was dropped. If, perhaps, a lexicon was created that included phrases often used in comment sections or a lexicon that identified words, regardless (for example) of whether they were purposely modified to elongate consonants or vowels, the approach used in this assignment would be more appropriate.

Finally, in the Disagreeable stage, when the lexicons were at odds on whether a term was positive or negative, the data was removed due to the inability of determining which lexicon was more accurate. In certain occasions, each lexicon picked different words to compare, in which case the inability to compare the lexicons’ stances was even more pronounced.

Overall, considering the data that was selected and the lexicons used, this method performed admirably. Its main drawback is that it can only be used with data that is more formal and perhaps three lexicons (to break the tie in case two disagree).