Blog post 6 describing the comparision of sentiment scores between HONY website and twitter data as a part of the course “Text as Data”
Creation of VCorpus
hony_unclean <- VCorpus(VectorSource(readLines("C:/Users/gunde/Documents/hony.txt")))
hony_unclean
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1607
Calling a text from Corpus
writeLines(head(strwrap(hony_unclean[[4]]), 7))
A few months after Eduardos case I went to a music festival.
It wasnt normally my kind of scene. It was on the Jersey
Shore. There were a lot of glowsticks and temporary tattoos.
But I was twenty-six. I had to do something on the weekends.
Everyone in my group seemed to know each other except for me
and a girl named Kristen. We were the bring-alongs, so we
kinda got stuck together. Kristens only 53. And shes
Preprocess of corpus
# Clean text file and pre-process for word cloud
# Convert to lowercase
hony_clean_corpus <- tm_map(hony_unclean, content_transformer(tolower))
# Remove numbers
hony_clean_corpus <- tm_map(hony_clean_corpus, removeNumbers)
# Remove conjunctions etc.: "and",the", "of"
hony_clean_corpus <- tm_map(hony_clean_corpus, removeWords, c(stopwords("english"), "im", "didnt", "couldnt","wasnt", "id", "ive", "everi", "tri", "hed", "hes", "everyth", "wed", "someth", "togeth", "noth", "rememb", "cri", "â", "anoth", "marri", "eventu", "especi", "emot", "isnt", "dont", "mother"))
# Remove words like "you'll", "will", "anyways", etc.
hony_clean_corpus <- tm_map(hony_clean_corpus, removeWords, stopwords("SMART"))
# Remove commas, periods, etc.
hony_clean_corpus <- tm_map(hony_clean_corpus, removePunctuation)
#Strip unnecessary whitespace
hony_clean_corpus <- tm_map(hony_clean_corpus, stripWhitespace)
class(hony_clean_corpus)
[1] "VCorpus" "Corpus"
inspect(hony_clean_corpus[3])
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 1047
Calling cleaned corpus after pre-processing
writeLines(head(strwrap(hony_clean_corpus[[7]])))
john makes fun mom humility put letter billboard put initials
license plate porsche doesnt understand real estate works good
humility mouse hiding rug talks mouse rug list house mouse
list house boss lady began ten years ago million sales past
years number real estate agent augusta finally paid house
bought houses put kids private school paid sports activities
Creation of DFM
docs1 <- c(hony_clean_corpus)
doc_corpus <- corpus(docs1)
docs_tokens <- tokens(doc_corpus)
docs_tokens
Tokens consisting of 1,607 documents and 7 docvars.
text1 :
[1] "hony" "stories" "dataset"
text2 :
[1] "early" "days" "kristen" "write" "single" "email"
[7] "type" "hit" "send" "late" "night" "glass"
[ ... and 123 more ]
text3 :
[1] "quit" "jobs" "nervewracking" "remember"
[5] "day" "wore" "suit" "meeting"
[9] "coffee" "shop" "wore" "suit"
[ ... and 141 more ]
text4 :
[1] "months" "eduardos" "case" "music" "festival"
[6] "kind" "scene" "jersey" "shore" "lot"
[11] "glowsticks" "temporary"
[ ... and 150 more ]
text5 :
[1] "eduardo" "nervous" "office" "barely"
[5] "spoke" "english" "told" "story"
[9] "interpreter" "explained" "hometown" "colombia"
[ ... and 133 more ]
text6 :
[1] "tripp" "prison" "sat" "kids" "told"
[6] "loved" "chosen" "conceived" "sperm" "donor"
[11] "thought" "hard"
[ ... and 121 more ]
[ reached max_ndoc ... 1,601 more documents ]
docs_dfm <- docs_tokens %>%
tokens_wordstem() %>%
dfm()
docs_dfm
Document-feature matrix of: 1,607 documents, 8,725 features (99.28% sparse) and 7 docvars.
features
docs honi stori dataset earli day kristen write singl email type
text1 1 1 1 0 0 0 0 0 0 0
text2 0 0 0 1 1 3 1 1 1 1
text3 0 0 0 0 1 2 0 0 0 0
text4 0 1 0 0 0 5 0 0 1 0
text5 0 1 0 0 1 0 0 0 0 0
text6 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 1,601 more documents, reached max_nfeat ... 8,715 more features ]
Creating DTM
dtm = DocumentTermMatrix(hony_clean_corpus)
dtm
<<DocumentTermMatrix (documents: 1607, terms: 13070)>>
Non-/sparse entries: 103798/20899692
Sparsity : 100%
Maximal term length: 36
Weighting : term frequency (tf)
Creating DataFrame
# Create data frame with words and frequency of occurrence
tdm = TermDocumentMatrix(docs1)
tdm2 = as.matrix(tdm)
words = sort(rowSums(tdm2), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)
dim(df)
[1] 13070 2
Featuring top 50 words in frequency
# Word frequency table
head(df, 50)
word freq
time time 1311
people people 1101
years years 875
day day 831
back back 795
life life 786
told told 776
things things 640
wanted wanted 586
started started 551
school school 548
home home 541
lot lot 535
work work 502
make make 467
night night 467
family family 462
feel feel 449
thing thing 445
love love 434
knew knew 432
felt felt 420
good good 419
made made 415
thought thought 409
mom mom 404
shes shes 370
house house 362
money money 358
friends friends 348
shed shed 344
kids kids 340
hard hard 329
job job 328
father father 325
world world 303
long long 292
year year 291
asked asked 288
called called 288
left left 278
dad dad 275
remember remember 271
finally finally 268
gave gave 268
youre youre 264
working working 262
ill ill 261
man man 258
entire entire 253
Final word cloud after cleaning
# Create word cloud
set.seed(5000)
wordcloud(docs1
, scale=c(2,0.5)
, max.words=300
, random.order=FALSE
, rot.per=0.20
, use.r.layout=FALSE
, colors=brewer.pal(8, "Set2"))
Barplot of Top 50 Most Frequent Words
# Plot of most frequently used words
barplot(df[1:50,]$freq, las=2, names.arg = df[1:50,]$word,
col="white", main="Top 50 Most Frequent Words",
ylab="Word frequencies")
Plotting NRC radar to see the sentiment using NRC
df %>%
# implement sentiment analysis using the "nrc" lexicon
inner_join(get_sentiments("nrc")) %>%
# remove "positive/negative" sentiments
filter(!sentiment %in% c("positive", "negative", "neutral")) %>%
#get the frequencies of sentiments
count(sentiment,sort = T) %>%
#calculate the proportion
mutate(percent=100*n/sum(n)) %>%
select(sentiment, percent) %>%
#plot the result
chartJSRadar(showToolTipLabel = TRUE, main = "NRC Radar")
Creating FCM
# create fcm from dfm
smaller_fcm <- fcm(docs_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_fcm)
[1] 8725 8725
Creating a smaller fcm to plot textplot network
# pull the top features
myFeatures <- names(topfeatures(smaller_fcm, 25))
# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_fcm)
[1] 25 25
# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))
# create plot
textplot_network(even_smaller_fcm,
min_freq = 5,
edge_alpha = 0.5,
edge_size = 1,
edge_color = "black",
vertex_labelsize = log(rowSums(even_smaller_fcm))*0.75)
Accessing Twitter API tokens
Scraping Timeline and retweets of HONY twitter handle
t_hony <- get_timeline("humansofny", n = 3200, retweets =T )
Printing Tweets
print(t_hony)
# A tibble: 3,199 x 90
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 237548529 15185585768~ 2022-04-25 11:52:32 humansofny "Our~ Twitt~
2 237548529 15158700641~ 2022-04-18 01:49:21 humansofny "@TE~ Twitt~
3 237548529 15085431545~ 2022-03-28 20:34:50 humansofny "(4/~ Twitt~
4 237548529 15085216216~ 2022-03-28 19:09:16 humansofny "(3/~ Twitt~
5 237548529 15085038151~ 2022-03-28 17:58:30 humansofny "(2/~ Twitt~
6 237548529 15084785526~ 2022-03-28 16:18:07 humansofny "(1/~ Twitt~
7 237548529 15026978255~ 2022-03-12 17:27:35 humansofny "@mk~ Twitt~
8 237548529 15026619784~ 2022-03-12 15:05:08 humansofny "@cr~ Twitt~
9 237548529 14995731454~ 2022-03-04 02:31:13 humansofny "(13~ Twitt~
10 237548529 14995592910~ 2022-03-04 01:36:10 humansofny "(12~ Twitt~
# ... with 3,189 more rows, and 84 more variables:
# display_text_width <dbl>, reply_to_status_id <chr>,
# reply_to_user_id <chr>, reply_to_screen_name <chr>,
# is_quote <lgl>, is_retweet <lgl>, favorite_count <int>,
# retweet_count <int>, quote_count <int>, reply_count <int>,
# hashtags <list>, symbols <list>, urls_url <list>,
# urls_t.co <list>, urls_expanded_url <list>, media_url <list>, ...
Preprocessing & Tokenization
# We need to restructure lego as one-token-per-row format
tidy_tweets <- t_hony %>% # pipe data frame
filter(is_retweet==TRUE)%>% # only include original tweets
select(status_id,
text)%>% # select variables of interest
unnest_tokens(word, text) # splits column in one token per row format
tidy_tweets
# A tibble: 3,195 x 2
status_id word
<chr> <chr>
1 1518558576847077378 our
2 1518558576847077378 radio
3 1518558576847077378 podcast
4 1518558576847077378 host
5 1518558576847077378 chionwolf
6 1518558576847077378 will
7 1518558576847077378 be
8 1518558576847077378 sharing
9 1518558576847077378 a
10 1518558576847077378 conversation
# ... with 3,185 more rows
Calling stopwords
stop_words
# A tibble: 1,149 x 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ... with 1,139 more rows
Creating a DataFrame
Connecting stopwords to DF
# Connect stop words
all_stop_words <- stop_words %>%
bind_rows(my_stop_words) # here we are connecting two data frames
# Let's see if it worked
view(all_stop_words)
# Remove numbers
tidy_tweets <- tidy_tweets %>%
filter(is.na(as.numeric(word))) # remember filter() returns rows where conditions are true
Converting to vector
# A tibble: 5 x 2
status_id word
<chr> <chr>
1 1518558576847077378 our
2 1518558576847077378 radio
3 1518558576847077378 podcast
4 1518558576847077378 host
5 1518558576847077378 chionwolf
Removing stopwords
Sentiment Anlaysis using NRC dictionary
# NRC Lexicon terms
# Get the negative and positive sentiments word list
nrc_sent <-get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative", "anger", "sadness", "trust", "fear", "disgust", "joy", "surprise")) %>%
count(word, sentiment, sort=T) %>%
ungroup()
# Inner join words with NRC lexicon
nrc_df <- df %>% inner_join(nrc_sent)
# Plot of negative and positive sentiments
nrc_df %>%
group_by(sentiment) %>%
do(head(., n=10)) %>% # top 10 words
ungroup() %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq, fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment (NRC lexicon)", x=NULL) +coord_flip()
Measuring net.sentiment score using nrc
# NRC:
nrc_df %>% group_by(sentiment) %>%
summarize(total=sum(n)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive-negative)) %>%
kable(align = 'r')
| anger | disgust | fear | joy | negative | positive | sadness | surprise | trust | (net.sentiment = positive - negative) |
|---|---|---|---|---|---|---|---|---|---|
| 477 | 344 | 581 | 370 | 1099 | 1078 | 497 | 241 | 611 | -21 |
NRC wordcloud
# Generate a comparison word cloud
set.seed(123)
nrc_df %>%
acast(word ~ sentiment, value.var = "freq", fill=0) %>%
comparison.cloud(colors = brewer.pal(8,"Set1")
,scale =c(5,0.5), rot.per=0.75, title.size=0.75, max.words=5000)
Sentiment Analysis of tweets using NRC
nrc <- get_sentiments("nrc")%>% # get specific sentiment lexicons in a tidy format
filter(sentiment %in% c("positive", "negative", "anger", "sadness", "trust", "fear", "disgust", "joy", "surprise"))
view(nrc)
nrc_words <- tweets_final %>%
inner_join(nrc, by="word")
view(nrc_words)
pie_words<- nrc_words %>%
group_by(sentiment) %>% # group by sentiment type
tally %>% # counts number of rows
arrange(desc(n)) # arrange sentiments in descending order based on frequency
ggpubr::ggpie(pie_words, "n", label = "sentiment",
fill = "sentiment", color = "white",
palette = "Spectral")
# NRC:
pie_words %>% group_by(sentiment) %>%
summarize(total=sum(n)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=(positive+joy+surprise+trust)-(negative+anger+disgust+fear+sadness))) %>%
kable(align = 'r')
| anger | disgust | fear | joy | negative | positive | sadness | surprise | trust | (…) |
|---|---|---|---|---|---|---|---|---|---|
| 23 | 10 | 32 | 80 | 43 | 150 | 27 | 33 | 84 | 212 |
Sentiment Analysis using BING
# Bing Lexicon terms
bing_sent <- df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=T) %>%
ungroup()
# Inner join words with Bing lexicon
bing_df <- df %>% inner_join(bing_sent)
# Plot positive and negative sentiments
bing_df %>%
group_by(sentiment) %>%
do(head(., n=20)) %>% # top 20 words
ungroup() %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq, fill=sentiment)) +
geom_col(show.legend = F) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment (Bing lexicon)", x=NULL) +
coord_flip()
net.sentiment score using BING
bing_df %>% group_by(sentiment) %>%
summarize(total=sum(n)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive-negative)) %>%
kable(align = 'l')
| negative | positive | (net.sentiment = positive - negative) |
|---|---|---|
| 1148 | 674 | -474 |
BING wordcloud
# Generate a comparison word cloud
set.seed(12345)
bing_df %>%
acast(word ~ sentiment, value.var = "freq", fill=0) %>%
comparison.cloud(colors = brewer.pal(8,"Set1")
,scale =c(5,.5), rot.per=0.1, title.size=2, max.words=1000)
Sentiment Analysis of Tweets using BING
bing <- get_sentiments("bing")%>%
count(word, sentiment, sort=T)
bing
# A tibble: 6,786 x 3
word sentiment n
<chr> <chr> <int>
1 2-faces negative 1
2 abnormal negative 1
3 abolish negative 1
4 abominable negative 1
5 abominably negative 1
6 abominate negative 1
7 abomination negative 1
8 abort negative 1
9 aborted negative 1
10 aborts negative 1
# ... with 6,776 more rows
bing_words <- tweets_final %>%
inner_join(bing, by="word")
view(bing_words)
pie_words<- bing_words %>%
group_by(sentiment) %>% # group by sentiment type
tally %>% # counts number of rows
arrange(desc(n)) # arrange sentiments in descending order based on frequency
ggpubr::ggpie(pie_words, "n", label = "sentiment",
fill = "sentiment", color = "white",
palette = "Spectral")
bing_words %>% group_by(sentiment) %>%
summarize(total=sum(n)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive-negative)) %>%
kable(align = 'l')
| negative | positive | (net.sentiment = positive - negative) |
|---|---|---|
| 37 | 95 | 58 |
Sentiment Analysis using AFINN
# AFINN lexicon terms
afinn_df <- df %>% inner_join(get_sentiments("afinn")) %>%
mutate(sentiment = case_when(value < 0 ~ 'negative',
value > 0 ~ 'positive'))
# Plot positive and negative sentiments
afinn_df %>%
group_by(sentiment) %>%
do(head(., n=20)) %>% # top 20 words
ungroup() %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq, fill=sentiment)) +
geom_col(show.legend = F) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment (AFINN lexicon)", x=NULL) +
coord_flip()
AFINN wordloud
# Generate a comparison word cloud
set.seed(12345)
afinn_df %>%
acast(word ~ sentiment, value.var = "freq", fill=0) %>%
comparison.cloud(colors = brewer.pal(7,"Set1")
,scale =c(5,.5), rot.per=0.10, title.size=2, max.words=1000)
afinn_net <- afinn_df %>%
group_by(sentiment) %>%
summarize(total=sum(value)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive - abs(negative))) %>%
kable(align = 'l')
afinn_net
| negative | positive | (net.sentiment = positive - abs(negative)) |
|---|---|---|
| -1488 | 995 | -493 |
Sentiment Analysis of tweets using AFINN
afinn_df <- tweets_final %>% inner_join(get_sentiments("afinn")) %>%
mutate(sentiment = case_when(value < 0 ~ 'negative',
value > 0 ~ 'positive'))
view(afinn_df)
pie_words <- afinn_df %>%
group_by(sentiment) %>% # group by sentiment type
tally %>% # counts number of rows
arrange(desc(n)) # arrange sentiments in descending order based on frequency
ggpubr::ggpie(pie_words, "n", label = "sentiment",
fill = "sentiment", color = "white",
palette = "Spectral")
The website promotes the negative sentiment while the twitter promotes the positive sentiment. The stories featured on the website acts as one way communication where the reader opinion isn’t heard which implies two things, firstly the author stresses more on the negative aspect of the stories for some reason or secondly the stories that are written are filled with the sad, gloomy lives and tough times the people were facing. On the other hand, twitter acts as a two way medium where the stories are shared and the responses tweeted by the audience are also considered into account. If we carefully observe the stories remain the same but the response to the stories turns to be opposite of what we saw with website. Even though stories majorly carry negative sentiment the influence of those stories turned out to be positive to the audience who read them which is a very noble thing if you ask me!