Project 4 - Document Classification

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.3.3

library(ggplot2)
library(knitr)

Task: Analyze documents or scraped web pages to predict how new documents that have already been classified (movie reviews as positive/negative, email inbox as spam/ham, etc.) should be categorized.

Using the dataset from a CrowdFlower sentiment analysis called “Emotion in Text” which contains labels for the emotional content (such as happiness, sadness, and anger) of tweets, the content of the tweets were compared against two tidytext sentiment lexicons (“affin” and “bing”), each of which scored certain words as having positive or negative associations both numerically (-5 to 5) and categorically (positive or negative).

# Tidytext sentiments lexicons
get_sentiments("afinn") # -5 to 5

## # A tibble: 2,476 × 2
##          word score
##         <chr> <int>
## 1     abandon    -2
## 2   abandoned    -2
## 3    abandons    -2
## 4    abducted    -2
## 5   abduction    -2
## 6  abductions    -2
## 7       abhor    -3
## 8    abhorred    -3
## 9   abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

get_sentiments("bing") # negative or positive

## # A tibble: 6,788 × 2
##           word sentiment
##          <chr>     <chr>
## 1      2-faced  negative
## 2      2-faces  negative
## 3           a+  positive
## 4     abnormal  negative
## 5      abolish  negative
## 6   abominable  negative
## 7   abominably  negative
## 8    abominate  negative
## 9  abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

After importing the data, the already classified sentiments (13 in total) were changed into positive, negative or neutral, in order to make the predictions broader and therefore more manageable. Six sentiments were changed to negative, five to positive and two to neutral.

# Import data from GitHub
TextEm = read.csv("https://raw.githubusercontent.com/Galanopoulog/DATA607-Project-4/master/TextEmotion.csv",
           header = T,
           sep = ",",
           stringsAsFactors = F)
kable(head(TextEm))

tweet_id	sentiment	author	content
1956967341	empty	xoshayzers	@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[
1956967666	sadness	wannamama	Layin n bed with a headache ughhhh…waitin on your call…
1956967696	sadness	coolfunky	Funeral ceremony…gloomy friday…
1956967789	enthusiasm	czareaquino	wants to hang out with friends SOON!
1956968416	neutral	xkilljoyx	@dannycastillo We want to trade with someone who has Houston tickets, but no one will.
1956968477	worry	xxxPEACHESxxx	Re-pinging @ghostridah14: why didn’t you go to prom? BC my bf didn’t like my friends

# Changing sentiments into positive/negative/neutral
TextEm$sentiment[TextEm$sentiment == "anger"] = "negative"
TextEm$sentiment[TextEm$sentiment == "boredom"] = "negative"
TextEm$sentiment[TextEm$sentiment == "empty"] = "negative"
TextEm$sentiment[TextEm$sentiment == "enthusiasm"] = "positive"
TextEm$sentiment[TextEm$sentiment == "fun"] = "positive"
TextEm$sentiment[TextEm$sentiment == "happiness"] = "positive"
TextEm$sentiment[TextEm$sentiment == "hate"] = "negative"
TextEm$sentiment[TextEm$sentiment == "love"] = "positive"
TextEm$sentiment[TextEm$sentiment == "neutral"] = "neutral"
TextEm$sentiment[TextEm$sentiment == "relief"] = "positive"
TextEm$sentiment[TextEm$sentiment == "sadness"] = "negative"
TextEm$sentiment[TextEm$sentiment == "surprise"] = "neutral"
TextEm$sentiment[TextEm$sentiment == "worry"] = "negative"

The data that was labled “neutral” was ommitted due to the difficulty of defining it and, therefore, classifying. For example, the sentiment of “surprise” can be both positive and negative, unlike “relief” which is a feeling of reassurance after experiencing anxiety or worry and therefore, overall, a positive sentiment. From there, the data was split into thirds, two-thirds to use for analysis and one-third to conduct the predictions.

TextEm = filter(TextEm, sentiment != "neutral")
dim(TextEm)

## [1] 29175     4

testText = TextEm[1:19450,]       # two-thirds
predText = TextEm[19451:29175,]   # one-third

The first step in the sentiment analysis was to clean the tweets and organize them in such a way that each word can be compared against both lexicons.

all.content = as.list(testText$content)

collection = tibble()

for(i in 1:nrow(testText)) {
  
  clean = tibble(stuff = all.content[[i]]) %>%
    unnest_tokens(word, stuff) %>%
    mutate(user = testText$author[i]) %>%
    select(user, everything())
  
  collection = rbind(collection, clean)
}

kable(head(collection))

user	word
xoshayzers	tiffanylue
xoshayzers	i
xoshayzers	know
xoshayzers	i
xoshayzers	was
xoshayzers	listenin

Once the data is arranged, we compare each word against the “affin” lexicon. After the words are scored, we find the mean score of the significant words for each user.

# Using "afinn" to find scores of words
af.sent = data.frame(get_sentiments("afinn"))

af.sen = data.frame(merge(collection, af.sent, by.x="word", by.y="word") %>% # scoring the words
  group_by(user) %>%      # group by user
  mutate(mean.af.sent = mean(score)) %>%  # find the mean score per user
  slice(which.max(mean.af.sent))  # retain only one output per user
)

kable(head(af.sen))

word	user	score	mean.af.sent
sick	__arual	-2	-2.000000
like	DalekCaan	2	-0.750000
nice	__Kizzle	3	2.000000
depressing	__laurenS	-2	-1.666667
chilling	__lozzy	-1	1.000000
miss	__LucifersAngel	-2	-2.000000

Repeat the method used for “affin” lexicon with the “bing” lexicon. However, since “bing” classifies words by positive/negative, in order to find the mean, we convert the negative values to -1 and the positive ones to 1 before performing calculations.

# Using bing to find pos/neg of words
bing.sent = data.frame(get_sentiments("bing"))

# To find if the overall message was pos/neg, turn values into 1 or -1, find mean
bi.sen = data.frame(merge(collection, bing.sent, by.x="word", by.y="word"))
bi.sen$sentiment[bi.sen$sentiment == "negative"] = -1
bi.sen$sentiment[bi.sen$sentiment == "positive"] = 1

bi.sen = bi.sen %>%
  group_by(user) %>% 
  mutate(posneg = mean(as.numeric(sentiment)))%>% 
  group_by(user) %>%
  slice(which.max(posneg))

kable(head(bi.sen))

word	user	sentiment	posneg
sick	__arual	-1	-1.0000000
like	DalekCaan	1	-0.5000000
cold	__Jesssicaa	-1	-1.0000000
nice	__Kizzle	1	1.0000000
depressing	__laurenS	-1	-0.3333333
love	__lozzy	1	1.0000000

Since the two lexicons may not place the tweets under the same category, the solutions from afinn and bing were merged by user and compared. A new column was created called “match” which determined whether the “bing” and “afinn” values matched (yes) or not (no) and, if “yes”, a verdict was made on whether the tweet was overall a positive or negative one.

# Combining by user
all.sen = data.frame(merge(bi.sen, af.sen, by.x="user", by.y="user"))

all.sen2 = all.sen[,c(1,4,7)] %>% 
  group_by(user) %>%
  mutate(match = ifelse(posneg < 0 & mean.af.sent < 0, "yes", ifelse(posneg > 0 & mean.af.sent > 0, "yes", "no")))%>%
  mutate(verdict = ifelse(match == "yes" & posneg < 0, "negative", ifelse(match == "yes" & posneg > 0, "positive", "other")))

kable(head(all.sen))

user	word.x	sentiment	posneg	word.y	score	mean.af.sent
__arual	sick	-1	-1.0000000	sick	-2	-2.000000
DalekCaan	like	1	-0.5000000	like	2	-0.750000
__Kizzle	nice	1	1.0000000	nice	3	2.000000
__laurenS	depressing	-1	-0.3333333	depressing	-2	-1.666667
__lozzy	love	1	1.0000000	chilling	-1	1.000000
__LucifersAngel	miss	-1	-1.0000000	miss	-2	-2.000000

Once a match had been concluded, the verdict was compared against the original categorization (sentiment) of the tweet.

final = merge(testText, all.sen2, by.x="author", by.y="user")[, c(1,3,7,8,5,6)] %>% 
  group_by(author) %>%
  slice(which.max(mean.af.sent))

kable(head(final))

author	sentiment	match	verdict	posneg	mean.af.sent
__arual	negative	yes	negative	-1.0000000	-2.000000
DalekCaan	negative	yes	negative	-0.5000000	-0.750000
__Kizzle	negative	yes	positive	1.0000000	2.000000
__laurenS	negative	yes	negative	-0.3333333	-1.666667
__lozzy	positive	yes	positive	1.0000000	1.000000
__LucifersAngel	negative	yes	negative	-1.0000000	-2.000000

In order to assess if the predicted conclusion matched the original classification, only the conclusions that were definitively determined as positive or negative through the agreement of both lexicons were used. From there, a confusion matrix was created to calculate whether the predicted values matched the actual values in addition to the accuracy of this approach.

final.yes = filter(final, match == "yes")

pos.pos = filter(final.yes, sentiment == "positive" & verdict == "positive")
pos.neg = filter(final.yes, sentiment == "positive" & verdict == "negative")
neg.pos = filter(final.yes, sentiment == "negative" & verdict == "positive")
neg.neg = filter(final.yes, sentiment == "negative" & verdict == "negative")

conf.matrix = matrix(c(nrow(pos.pos), 
                       nrow(pos.neg), 
                       nrow(neg.pos), 
                       nrow(neg.neg)),ncol=2)
colnames(conf.matrix) = c("pred_pos", "pred_neg")
rownames(conf.matrix) = c("actual_pos", "actual_neg")

conf.matrix

##            pred_pos pred_neg
## actual_pos     2425     1687
## actual_neg      364     4352

# Number of values that match the original sentiment
table(final.yes[,2] == final.yes[,4])

## 
## FALSE  TRUE 
##  2051  6777

# Proportion of values that match the original sentiment (accuracy)
prop.table(table(final.yes[,2] == final.yes[,4]))

## 
##    FALSE     TRUE 
## 0.232329 0.767671

Before proceeding with running the prediction dataset through a sentiment analysis function that uses this approach, I got curious as to what values were ommitted from the final analysis, especially when seeing that one-fourth of final dataset was discarded due to the lexicons not agreeing.

# proportion of values ommitted
1-nrow(final.yes)/nrow(final)

## [1] 0.2514838

# ommitted data
final.no = filter(final, match == "no")
kable(head(final.no))

author	sentiment	match	verdict	posneg	mean.af.sent
_annella	negative	no	other	0.0000000	-0.6666667
DESiMO	negative	no	other	-0.1428571	1.4444444
Jazzi3	positive	no	other	0.0000000	1.0000000
Just_Jen	negative	no	other	0.0000000	0.2500000
_kwaz	negative	no	other	0.0000000	0.5000000
_lips_xD	negative	no	other	0.0000000	2.2000000

Upon viewing the dataset where the lexicons didn’t match, it became evident that a major disagreement (approximately 72% of the ommitted data) was due to one lexicon having a neutral value of 0 for a user while the other was positive or negative. To combat this, the mean of the two lexicons was taken and its value (positive or negative) was compared against the original classification. Then, the proportion where the mean of the lexicons matched the original dataset was derived. Approximately 60% of verdicts were correct.

# proportion of ommitted data that had at least one zero
nrow(filter(final.no, posneg == 0 | mean.af.sent == 0))/nrow(final.no)

## [1] 0.7255563

# Average the means of the data and see if they are more positive or negative
final.no2 = final.no %>%
       filter(posneg == 0 | mean.af.sent == 0) %>% 
       filter(posneg != mean.af.sent) %>%   # Remove where both values are zero, because the mean will be undefined
       mutate(both.mean = (posneg+mean.af.sent)/2) %>% 
       mutate(verdict = ifelse(both.mean  < 0, "negative", "positive"))

prop.table(table(final.no2[,2] == final.no2[,4]))

## 
##     FALSE      TRUE 
## 0.3990536 0.6009464

A 60% correct prediction is not preferable, especially when considering that the data that wasn’t ommitted scored approximately 16% higher, however, the higher accuracy can be contributed to the whittling down of data. Adding variability to the model makes it more inclusive to a wider range of tweets. As such, the ommitted, re-evaluated data was added to the previous conclusions.

all.final = bind_rows(final.yes, final.no2)

# accuracy of combined table
prop.table(table(all.final[,2] == all.final[,4]))

## 
##     FALSE      TRUE 
## 0.2618826 0.7381174

Finally, a function was created where tweets from the prediction set (one-third of the original data) and the results of the confusion matrix were returned, in addition to the percent of the model’s accuracy and inaccuracy.

sent.analysis = function(x){
  all.content = as.list(x$content)
  
  collection = tibble()
  
  for(i in 1:nrow(x)) {
    
    clean = tibble(stuff = all.content[[i]]) %>%
      unnest_tokens(word, stuff) %>%
      mutate(user = x$author[i]) %>%
      select(user, everything())
    
    collection = rbind(collection, clean)
  }
 
    af.sent = data.frame(get_sentiments("afinn"))
    af.sen = data.frame(merge(collection, af.sent, by.x="word", by.y="word") %>% 
                          group_by(user) %>% 
                          mutate(mean.af.sent = mean(score)) %>%
                          slice(which.max(mean.af.sent))
    )
    
    bing.sent = data.frame(get_sentiments("bing"))
    bi.sen = data.frame(merge(collection, bing.sent, by.x="word", by.y="word"))
    bi.sen$sentiment[bi.sen$sentiment == "negative"] = -1
    bi.sen$sentiment[bi.sen$sentiment == "positive"] = 1
    
    
    bi.sen = bi.sen %>%
      group_by(user) %>% 
      mutate(posneg = mean(as.numeric(sentiment)))%>% 
      group_by(user) %>%
      slice(which.max(posneg))
    
    all.sen = data.frame(merge(bi.sen, af.sen, by.x="user", by.y="user")[,c(1,4,7)]) %>% 
      group_by(user) %>%
      mutate(match = ifelse(posneg < 0 & mean.af.sent < 0, "yes", ifelse(posneg > 0 & mean.af.sent > 0, "yes", "no")))%>%
      mutate(verdict = ifelse(match == "yes" & posneg < 0, "negative", ifelse(match == "yes" & posneg > 0, "positive", "other")))

    final = merge(x, all.sen, by.x="author", by.y="user")[, c(1,3,7,8,5,6)] %>% 
      group_by(author) %>%
      slice(which.max(mean.af.sent))
    
    final.yes = filter(final, match == "yes")
    final.no = filter(final, match == "no")%>%
      filter(posneg == 0 | mean.af.sent == 0) %>% 
      filter(posneg != mean.af.sent) %>% 
      mutate(both.mean = (posneg+mean.af.sent)/2) %>% 
      mutate(verdict = ifelse(both.mean  < 0, "negative", "positive"))
    all.final = bind_rows(final.yes, final.no2)
    
    pos.pos = filter(all.final, sentiment == "positive" & verdict == "positive")
    pos.neg = filter(all.final, sentiment == "positive" & verdict == "negative")
    neg.pos = filter(all.final, sentiment == "negative" & verdict == "positive")
    neg.neg = filter(all.final, sentiment == "negative" & verdict == "negative")
    
    accuracy = (nrow(pos.pos)+nrow(neg.neg))/(nrow(pos.pos)+ nrow(pos.neg)+ nrow(neg.pos)+ nrow(neg.neg))*100
    
    conf.matrix = matrix(c(nrow(pos.pos), 
                           nrow(pos.neg), 
                           nrow(neg.pos), 
                           nrow(neg.neg),
                           accuracy,
                           100-accuracy), ncol=6)
    colnames(conf.matrix) = c("pred_pos", "pred_neg", "actual_pos", "actual_neg", "percent_accuracy", "percent_inaccuracy")

    return(conf.matrix)
}

sent.analysis(predText)

##      pred_pos pred_neg actual_pos actual_neg percent_accuracy
## [1,]     4320      344       1306       1094         76.64213
##      percent_inaccuracy
## [1,]           23.35787

Using the test dataset, since it had more values, we plotted what the confusion matrix proportions may look like. The plot showed that the model was better at correctly identifying negative tweets, whereas positive tweets had approximately a fifty percent chance of being categorized correctly.

sen.verd = table(all.final$verdict, all.final$sentiment)
mosaicplot(sen.verd, main = "Confusion Matrix Proportions ", xlab = "sentiment", ylab = "verdict", col = c(2,4))

** Conclusion **

Using the “afinn” and “bing” lexicons to analyze tweets led to a model that predicted the category correctly (with enough ommissions) up to approximately 75% of the data. However, by the end of the analysis, it became clear that this approach was not optimal for various reasons, the main one being the ommissions. Ommissions fell mostly into three categories: Pre-evaluation, Lexicon and Disagreeable.

In the Pre-evaluation page, tweets that were deemed “neutral” could not be randomly assigned into either the positive or negative category, simply because they did not fit in either for reasons explained at the beginning of the assignment. As such, they were removed. In order to counter that, what may take place in a future endeavor is perhaps creating a sentiment analysis that separates data into emotive (positive/negative) tweets and neutral ones.

In the Lexicon stage, tweets that did not have words in either or one of the lexicons, were automatically dismissed from the data. The issue with this stems from the way tweets are written, in that the majority of the content includes colloquialisms and purposely/accidentally mispelled words. Few if any of these are found in the lexicons and, thus, a sizeable portion of the data was dropped. If, perhaps, a lexicon was created that included phrases often used in comment sections or a lexicon that identified words, regardless (for example) of whether they were purposely modified to elongate consonants or vowels, the approach used in this assignment would be more appropriate.

Finally, in the Disagreeable stage, when the lexicons were at odds on whether a term was positive or negative, the data was removed due to the inability of determining which lexicon was more accurate. In certain occasions, each lexicon picked different words to compare, in which case the inability to compare the lexicons’ stances was even more pronounced.

Overall, considering the data that was selected and the lexicons used, this method performed admirably. Its main drawback is that it can only be used with data that is more formal and perhaps three lexicons (to break the tie in case two disagree).

Project 4 - Document Classification

Georgia Galanopoulos

April 16, 2017