Comparing the sentiment scores between HONY website and twitter

Blog post 6 describing the comparision of sentiment scores between HONY website and twitter data as a part of the course “Text as Data”

Rahul Gundeti (Graduate student, Data Analytics & Computational Social Sciences (DACSS), UMass Amherst.)
2022-05-03

Creation of VCorpus

hony_unclean <- VCorpus(VectorSource(readLines("C:/Users/gunde/Documents/hony.txt")))
hony_unclean
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1607

Calling a text from Corpus

writeLines(head(strwrap(hony_unclean[[4]]), 7))
A few months after Eduardos case I went to a music festival.
It wasnt normally my kind of scene. It was on the Jersey
Shore. There were a lot of glowsticks and temporary tattoos.
But I was twenty-six. I had to do something on the weekends.
Everyone in my group seemed to know each other except for me
and a girl named Kristen. We were the bring-alongs, so we
kinda got stuck together. Kristens only 53. And shes

Preprocess of corpus

# Clean text file and pre-process for word cloud
# Convert to lowercase
hony_clean_corpus <- tm_map(hony_unclean, content_transformer(tolower))
# Remove numbers
hony_clean_corpus <- tm_map(hony_clean_corpus, removeNumbers)
# Remove conjunctions etc.: "and",the", "of"
hony_clean_corpus <- tm_map(hony_clean_corpus, removeWords, c(stopwords("english"), "im", "didnt", "couldnt","wasnt", "id", "ive", "everi", "tri", "hed", "hes", "everyth", "wed", "someth", "togeth", "noth", "rememb", "cri", "â", "anoth", "marri", "eventu", "especi", "emot", "isnt", "dont", "mother"))
# Remove words like "you'll", "will", "anyways", etc.
hony_clean_corpus <- tm_map(hony_clean_corpus, removeWords, stopwords("SMART"))
# Remove commas, periods, etc.
hony_clean_corpus <- tm_map(hony_clean_corpus, removePunctuation)
#Strip unnecessary whitespace
hony_clean_corpus <- tm_map(hony_clean_corpus, stripWhitespace)
class(hony_clean_corpus)
[1] "VCorpus" "Corpus" 
inspect(hony_clean_corpus[3])
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 1047

Calling cleaned corpus after pre-processing

writeLines(head(strwrap(hony_clean_corpus[[7]])))
john makes fun mom humility put letter billboard put initials
license plate porsche doesnt understand real estate works good
humility mouse hiding rug talks mouse rug list house mouse
list house boss lady began ten years ago million sales past
years number real estate agent augusta finally paid house
bought houses put kids private school paid sports activities

Creation of DFM

docs1 <- c(hony_clean_corpus)
doc_corpus <- corpus(docs1)
docs_tokens <- tokens(doc_corpus)
docs_tokens
Tokens consisting of 1,607 documents and 7 docvars.
text1 :
[1] "hony"    "stories" "dataset"

text2 :
 [1] "early"   "days"    "kristen" "write"   "single"  "email"  
 [7] "type"    "hit"     "send"    "late"    "night"   "glass"  
[ ... and 123 more ]

text3 :
 [1] "quit"          "jobs"          "nervewracking" "remember"     
 [5] "day"           "wore"          "suit"          "meeting"      
 [9] "coffee"        "shop"          "wore"          "suit"         
[ ... and 141 more ]

text4 :
 [1] "months"     "eduardos"   "case"       "music"      "festival"  
 [6] "kind"       "scene"      "jersey"     "shore"      "lot"       
[11] "glowsticks" "temporary" 
[ ... and 150 more ]

text5 :
 [1] "eduardo"     "nervous"     "office"      "barely"     
 [5] "spoke"       "english"     "told"        "story"      
 [9] "interpreter" "explained"   "hometown"    "colombia"   
[ ... and 133 more ]

text6 :
 [1] "tripp"     "prison"    "sat"       "kids"      "told"     
 [6] "loved"     "chosen"    "conceived" "sperm"     "donor"    
[11] "thought"   "hard"     
[ ... and 121 more ]

[ reached max_ndoc ... 1,601 more documents ]
docs_dfm <- docs_tokens %>%
  tokens_wordstem() %>%
  dfm()
docs_dfm
Document-feature matrix of: 1,607 documents, 8,725 features (99.28% sparse) and 7 docvars.
       features
docs    honi stori dataset earli day kristen write singl email type
  text1    1     1       1     0   0       0     0     0     0    0
  text2    0     0       0     1   1       3     1     1     1    1
  text3    0     0       0     0   1       2     0     0     0    0
  text4    0     1       0     0   0       5     0     0     1    0
  text5    0     1       0     0   1       0     0     0     0    0
  text6    0     0       0     0   0       0     0     0     0    0
[ reached max_ndoc ... 1,601 more documents, reached max_nfeat ... 8,715 more features ]

Creating DTM

dtm = DocumentTermMatrix(hony_clean_corpus)
dtm
<<DocumentTermMatrix (documents: 1607, terms: 13070)>>
Non-/sparse entries: 103798/20899692
Sparsity           : 100%
Maximal term length: 36
Weighting          : term frequency (tf)

Creating DataFrame

# Create data frame with words and frequency of occurrence
tdm = TermDocumentMatrix(docs1)

tdm2 = as.matrix(tdm)
words = sort(rowSums(tdm2), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)
dim(df)
[1] 13070     2

Featuring top 50 words in frequency

# Word frequency table
head(df, 50)
             word freq
time         time 1311
people     people 1101
years       years  875
day           day  831
back         back  795
life         life  786
told         told  776
things     things  640
wanted     wanted  586
started   started  551
school     school  548
home         home  541
lot           lot  535
work         work  502
make         make  467
night       night  467
family     family  462
feel         feel  449
thing       thing  445
love         love  434
knew         knew  432
felt         felt  420
good         good  419
made         made  415
thought   thought  409
mom           mom  404
shes         shes  370
house       house  362
money       money  358
friends   friends  348
shed         shed  344
kids         kids  340
hard         hard  329
job           job  328
father     father  325
world       world  303
long         long  292
year         year  291
asked       asked  288
called     called  288
left         left  278
dad           dad  275
remember remember  271
finally   finally  268
gave         gave  268
youre       youre  264
working   working  262
ill           ill  261
man           man  258
entire     entire  253

Final word cloud after cleaning

# Create word cloud
set.seed(5000)
wordcloud(docs1
    , scale=c(2,0.5)     
    , max.words=300      
    , random.order=FALSE 
    , rot.per=0.20       
    , use.r.layout=FALSE 
    , colors=brewer.pal(8, "Set2"))

Barplot of Top 50 Most Frequent Words

# Plot of most frequently used words
barplot(df[1:50,]$freq, las=2, names.arg = df[1:50,]$word,
        col="white", main="Top 50 Most Frequent Words",
        ylab="Word frequencies")

Plotting NRC radar to see the sentiment using NRC

df %>%
  # implement sentiment analysis using the "nrc" lexicon
  inner_join(get_sentiments("nrc")) %>%
  # remove "positive/negative" sentiments
  filter(!sentiment %in% c("positive", "negative", "neutral")) %>%
  #get the frequencies of sentiments
  count(sentiment,sort = T) %>% 
  #calculate the proportion
  mutate(percent=100*n/sum(n)) %>%
  select(sentiment, percent) %>%
  #plot the result
  chartJSRadar(showToolTipLabel = TRUE, main = "NRC Radar")

Creating FCM

# create fcm from dfm
smaller_fcm <- fcm(docs_dfm)

# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_fcm)
[1] 8725 8725

Creating a smaller fcm to plot textplot network

# pull the top features
myFeatures <- names(topfeatures(smaller_fcm, 25))

# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")

# check dimensions
dim(even_smaller_fcm)
[1] 25 25
# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))

# create plot
textplot_network(even_smaller_fcm, 
                 min_freq = 5, 
                 edge_alpha = 0.5, 
                 edge_size = 1,
                 edge_color = "black",
                 vertex_labelsize = log(rowSums(even_smaller_fcm))*0.75)

Accessing Twitter API tokens

Scraping Timeline and retweets of HONY twitter handle

t_hony <- get_timeline("humansofny", n = 3200, retweets =T )

Printing Tweets

print(t_hony)
# A tibble: 3,199 x 90
   user_id   status_id    created_at          screen_name text  source
   <chr>     <chr>        <dttm>              <chr>       <chr> <chr> 
 1 237548529 15185585768~ 2022-04-25 11:52:32 humansofny  "Our~ Twitt~
 2 237548529 15158700641~ 2022-04-18 01:49:21 humansofny  "@TE~ Twitt~
 3 237548529 15085431545~ 2022-03-28 20:34:50 humansofny  "(4/~ Twitt~
 4 237548529 15085216216~ 2022-03-28 19:09:16 humansofny  "(3/~ Twitt~
 5 237548529 15085038151~ 2022-03-28 17:58:30 humansofny  "(2/~ Twitt~
 6 237548529 15084785526~ 2022-03-28 16:18:07 humansofny  "(1/~ Twitt~
 7 237548529 15026978255~ 2022-03-12 17:27:35 humansofny  "@mk~ Twitt~
 8 237548529 15026619784~ 2022-03-12 15:05:08 humansofny  "@cr~ Twitt~
 9 237548529 14995731454~ 2022-03-04 02:31:13 humansofny  "(13~ Twitt~
10 237548529 14995592910~ 2022-03-04 01:36:10 humansofny  "(12~ Twitt~
# ... with 3,189 more rows, and 84 more variables:
#   display_text_width <dbl>, reply_to_status_id <chr>,
#   reply_to_user_id <chr>, reply_to_screen_name <chr>,
#   is_quote <lgl>, is_retweet <lgl>, favorite_count <int>,
#   retweet_count <int>, quote_count <int>, reply_count <int>,
#   hashtags <list>, symbols <list>, urls_url <list>,
#   urls_t.co <list>, urls_expanded_url <list>, media_url <list>, ...

Preprocessing & Tokenization

# We need to restructure lego as one-token-per-row format
tidy_tweets <- t_hony %>% # pipe data frame 
    filter(is_retweet==TRUE)%>% # only include original tweets
  select(status_id, 
         text)%>% # select variables of interest
  unnest_tokens(word, text) # splits column in one token per row format
tidy_tweets
# A tibble: 3,195 x 2
   status_id           word        
   <chr>               <chr>       
 1 1518558576847077378 our         
 2 1518558576847077378 radio       
 3 1518558576847077378 podcast     
 4 1518558576847077378 host        
 5 1518558576847077378 chionwolf   
 6 1518558576847077378 will        
 7 1518558576847077378 be          
 8 1518558576847077378 sharing     
 9 1518558576847077378 a           
10 1518558576847077378 conversation
# ... with 3,185 more rows

Calling stopwords

stop_words
# A tibble: 1,149 x 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ... with 1,139 more rows

Creating a DataFrame

my_stop_words <- tibble( #construct a dataframe
  word = c(
    "https",
    "t.co",
    "rt",
    "amp",
    "rstats",
    "gt"
  ),
  lexicon = "twitter"
)

Connecting stopwords to DF

# Connect stop words
all_stop_words <- stop_words %>%
  bind_rows(my_stop_words) # here we are connecting two data frames

# Let's see if it worked
view(all_stop_words)

# Remove numbers
tidy_tweets <- tidy_tweets %>%
    filter(is.na(as.numeric(word))) # remember filter() returns rows where conditions are true

Converting to vector

tidytweetsText  = as.vector(tidy_tweets$text)
head(tidy_tweets,5)
# A tibble: 5 x 2
  status_id           word     
  <chr>               <chr>    
1 1518558576847077378 our      
2 1518558576847077378 radio    
3 1518558576847077378 podcast  
4 1518558576847077378 host     
5 1518558576847077378 chionwolf

Removing stopwords

tweets_final <- tidy_tweets %>%
  anti_join(all_stop_words, by = "word")

Sentiment Anlaysis using NRC dictionary

# NRC Lexicon terms
# Get the negative and positive sentiments word list 
nrc_sent <-get_sentiments("nrc") %>%
    filter(sentiment %in% c("positive", "negative", "anger", "sadness", "trust", "fear", "disgust", "joy", "surprise")) %>%
    count(word, sentiment, sort=T) %>%
    ungroup()
# Inner join words with NRC lexicon
nrc_df <- df %>% inner_join(nrc_sent)
# Plot of negative and positive sentiments
nrc_df %>%
    group_by(sentiment) %>%
    do(head(., n=10)) %>% # top 10 words
    ungroup() %>%
    mutate(word = reorder(word, freq)) %>%
    ggplot(aes(word, freq, fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment (NRC lexicon)", x=NULL) +coord_flip()

Measuring net.sentiment score using nrc

# NRC: 
nrc_df %>% group_by(sentiment) %>%
    summarize(total=sum(n)) %>%
    spread(sentiment, total) %>%
    mutate((net.sentiment=positive-negative)) %>% 
    kable(align = 'r')
anger disgust fear joy negative positive sadness surprise trust (net.sentiment = positive - negative)
477 344 581 370 1099 1078 497 241 611 -21

NRC wordcloud

# Generate a comparison word cloud
set.seed(123)
nrc_df %>% 
    acast(word ~ sentiment, value.var = "freq", fill=0) %>%
    comparison.cloud(colors = brewer.pal(8,"Set1")
      ,scale =c(5,0.5), rot.per=0.75, title.size=0.75, max.words=5000)

Sentiment Analysis of tweets using NRC

nrc <- get_sentiments("nrc")%>%  # get specific sentiment lexicons in a tidy format
    filter(sentiment %in% c("positive", "negative", "anger", "sadness", "trust", "fear", "disgust", "joy", "surprise")) 

view(nrc)

nrc_words <- tweets_final %>%
  inner_join(nrc, by="word")

view(nrc_words)




pie_words<- nrc_words %>%
  group_by(sentiment) %>% # group by sentiment type
  tally %>% # counts number of rows
  arrange(desc(n)) # arrange sentiments in descending order based on frequency



ggpubr::ggpie(pie_words, "n", label = "sentiment", 
      fill = "sentiment", color = "white", 
      palette = "Spectral")
# NRC: 
pie_words %>% group_by(sentiment) %>%
    summarize(total=sum(n)) %>%
    spread(sentiment, total) %>%
    mutate((net.sentiment=(positive+joy+surprise+trust)-(negative+anger+disgust+fear+sadness))) %>% 
    kable(align = 'r')
anger disgust fear joy negative positive sadness surprise trust (…)
23 10 32 80 43 150 27 33 84 212

Sentiment Analysis using BING

# Bing Lexicon terms
bing_sent <- df %>%
    inner_join(get_sentiments("bing")) %>%
    count(word, sentiment, sort=T) %>%
    ungroup()
# Inner join words with Bing lexicon
bing_df <- df %>% inner_join(bing_sent)
# Plot positive and negative sentiments
bing_df %>%
    group_by(sentiment) %>%
    do(head(., n=20)) %>% # top 20 words
    ungroup() %>%
    mutate(word = reorder(word, freq)) %>%
    ggplot(aes(word, freq, fill=sentiment)) +
    geom_col(show.legend = F) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment (Bing lexicon)", x=NULL) +
    coord_flip()

net.sentiment score using BING

 bing_df %>% group_by(sentiment) %>% 
    summarize(total=sum(n)) %>%
    spread(sentiment, total) %>%
    mutate((net.sentiment=positive-negative)) %>% 
    kable(align = 'l')
negative positive (net.sentiment = positive - negative)
1148 674 -474

BING wordcloud

# Generate a comparison word cloud
set.seed(12345)
bing_df %>% 
    acast(word ~ sentiment, value.var = "freq", fill=0) %>%
    comparison.cloud(colors = brewer.pal(8,"Set1")
      ,scale =c(5,.5), rot.per=0.1, title.size=2, max.words=1000)

Sentiment Analysis of Tweets using BING

bing <- get_sentiments("bing")%>%
  count(word, sentiment, sort=T)
bing
# A tibble: 6,786 x 3
   word        sentiment     n
   <chr>       <chr>     <int>
 1 2-faces     negative      1
 2 abnormal    negative      1
 3 abolish     negative      1
 4 abominable  negative      1
 5 abominably  negative      1
 6 abominate   negative      1
 7 abomination negative      1
 8 abort       negative      1
 9 aborted     negative      1
10 aborts      negative      1
# ... with 6,776 more rows
bing_words <- tweets_final %>%
  inner_join(bing, by="word")

view(bing_words)


pie_words<- bing_words %>%
  group_by(sentiment) %>% # group by sentiment type
  tally %>% # counts number of rows
  arrange(desc(n)) # arrange sentiments in descending order based on frequency



ggpubr::ggpie(pie_words, "n", label = "sentiment", 
      fill = "sentiment", color = "white", 
      palette = "Spectral")
 bing_words %>% group_by(sentiment) %>% 
    summarize(total=sum(n)) %>%
    spread(sentiment, total) %>%
    mutate((net.sentiment=positive-negative)) %>% 
    kable(align = 'l')
negative positive (net.sentiment = positive - negative)
37 95 58

Sentiment Analysis using AFINN

# AFINN lexicon terms
afinn_df <- df %>% inner_join(get_sentiments("afinn")) %>%
    mutate(sentiment = case_when(value < 0 ~ 'negative', 
                                 value > 0 ~ 'positive'))
# Plot positive and negative sentiments   
afinn_df %>%
    group_by(sentiment) %>%
    do(head(., n=20)) %>% # top 20 words
    ungroup() %>%
    mutate(word = reorder(word, freq)) %>%
    ggplot(aes(word, freq, fill=sentiment)) +
    geom_col(show.legend = F) +
    facet_wrap(~sentiment, scales = "free_y") +
    labs(y = "Contribution to sentiment (AFINN lexicon)", x=NULL) +
    coord_flip()

AFINN wordloud

# Generate a comparison word cloud
set.seed(12345)
afinn_df %>% 
    acast(word ~ sentiment, value.var = "freq", fill=0) %>%
    comparison.cloud(colors = brewer.pal(7,"Set1")
      ,scale =c(5,.5), rot.per=0.10, title.size=2, max.words=1000)
afinn_net <- afinn_df %>% 
    group_by(sentiment) %>%
    summarize(total=sum(value)) %>%
    spread(sentiment, total) %>%
    mutate((net.sentiment=positive - abs(negative))) %>% 
    kable(align = 'l')
afinn_net
negative positive (net.sentiment = positive - abs(negative))
-1488 995 -493

Sentiment Analysis of tweets using AFINN

afinn_df <- tweets_final %>% inner_join(get_sentiments("afinn")) %>%
    mutate(sentiment = case_when(value < 0 ~ 'negative', 
                                 value > 0 ~ 'positive'))

view(afinn_df)


pie_words <- afinn_df %>%
  group_by(sentiment) %>% # group by sentiment type
  tally %>% # counts number of rows
  arrange(desc(n)) # arrange sentiments in descending order based on frequency



ggpubr::ggpie(pie_words, "n", label = "sentiment", 
      fill = "sentiment", color = "white", 
      palette = "Spectral")

Conclusion:

The website promotes the negative sentiment while the twitter promotes the positive sentiment. The stories featured on the website acts as one way communication where the reader opinion isn’t heard which implies two things, firstly the author stresses more on the negative aspect of the stories for some reason or secondly the stories that are written are filled with the sad, gloomy lives and tough times the people were facing. On the other hand, twitter acts as a two way medium where the stories are shared and the responses tweeted by the audience are also considered into account. If we carefully observe the stories remain the same but the response to the stories turns to be opposite of what we saw with website. Even though stories majorly carry negative sentiment the influence of those stories turned out to be positive to the audience who read them which is a very noble thing if you ask me!