Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

1.Introduction

In the previous study, I tried to get insight into the anxiety patients forum through findin the most frequent words in each group, and plotting them either in barplots or word clouds.

Such bag of words model may serve as the first step, but I want to go further. In this study, sentiments of the words are analyzed, using two lexicons: nrc and afinn. To do so, the ready-to-use lexicon dataframes are joined with the tokenized text, where each word is a token, and the overal sentiment is evaluated.

2.Sentiment Anlaysis using nrc lexicon

An interesting lexicon is nrc, which includes sentmients such as negative, positive, fear, anger, joy, disgust and so on. Hence, it is fairly explanatory for the text. Some words have more than one sentiment value attached to them. What I do calculation of the most frequent sentiments for the texts of each membership category. Following, one can see some examples of the nrc sentiment dataset, the sentiments of the nrc and the frequent sentiments of each membership category.

## # A tibble: 6 x 2
##        word sentiment
##       <chr>     <chr>
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear
##  [1] "trust"        "fear"         "negative"     "sadness"     
##  [5] "anger"        "surprise"     "positive"     "disgust"     
##  [9] "joy"          "anticipation"
## Joining, by = "word"
## Joining, by = "word"

What can be seen? While Junior and Middle members are almost similar, the senior members have “positive” sentiment at the top, and “trust” is above “fear”. It seems that the senior members are literaly more positive, and more successful in coping with the anxiety.

3.Sentiment Analysis using afinn lexicon

Now, let’s change the lexicon for sentiment analysis. Each lexicon may give us a new insight into the data. So the afinn lexicon is chosen for the next step. The lexicon has scores of -3 to +3 for the words, from the most negative to the most positive sentiments.

Here I decide to use this lexicon in this way: 1.Give the scores to each word posted by each membership group 2.Averaging of the summation of the sentiment scores of each group on the number of words of that group

In other words, I calculate the average score of each word used by each group. So for every group, one average would be computed.

## # A tibble: 6 x 2
##         word score
##        <chr> <int>
## 1    abandon    -2
## 2  abandoned    -2
## 3   abandons    -2
## 4   abducted    -2
## 5  abduction    -2
## 6 abductions    -2
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 4 x 3
## # Groups:   membership [4]
##      membership avg_score_per_word       sd
##          <fctr>              <dbl>    <dbl>
## 1 Senior Member         -0.3640351 2.056119
## 2 Junior Member         -0.6321774 1.979563
## 3        Member         -0.6843575 1.915097
## 4         Guest         -0.9000000 1.943951

The result is interesting. While in generall the sentiment per word of every category is negative, the Senior members have the most positive score per word, followed by junior and middle members whose average scores are very close and then the most negative sentiment score is for guests. Since the number of guest posts is very low, it is better to ignore this group. Nonetheless, the ranking of the memebrs based on average sentiment score is interesting, and compatible with our former insights. Previously, we had found that the literature of the senior memebrs are different, according to the words clouds and word frequencies.

4.Conclusion

Having studied the anxiety forum using bag of word, and specifically word frequency, I took one step further and used sentiment analysis to evaluate the text and compare the membership groups. Two lexicons are used: nrc and afinn and the results show the more positive language of senior members comparing to the junior and middle members. The other available lexicons, loughran and bing, seem not very useful. The former is related to financial text analysis, the latter has only two sentiments: positive and negative.

While such sentiment analysis definitely sheds light on the data from new perspectives, it is still a reductionist approach to the text since it works on the word level without considering the relations among words in phrases, passages and so on. Hence, the results should be evaluated very cautiously, and not as the certain evidences.

In the next study, I venture on a more holistic approach and use bi-grams and tri-grams to analyze the anxiety forum text.

Code

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(colorRamps)
require(SnowballC)

require(tidytext)
require(RColorBrewer)
require(wordcloud)
require(gridExtra)
#url = "http://www.ibsgroup.org/forums/topic/141800-16-and-suffering/"
#url = "https://www.r-bloggers.com/scraping-web-pages-with-r/"
url = "http://anxietyforum.net/forum/showthread.php?8-How-are-you-coping-with-anxiety/"

posts = character()
location = character()
membership = character()
posts2 = character()

for (i in 1:50){
        url = "http://anxietyforum.net/forum/showthread.php?8-How-are-you-coping-with-anxiety/"
        url = paste0(url,"page",i)
        #print(url)
        html <- read_html(url)
        
        content_node <- html_nodes(html , ".postcontent")
        posts <- append(posts,html_text(content_node)) 
        
        location_node <- html_nodes(html, ".post_field:nth-child(2) dd")
        location <- append(location, html_text(location_node) )
        
        member_node <- html_nodes(html, ".usertitle")
         membership <- append(membership, html_text(member_node) )
} 



print(paste0("The number of retrieved posts: ",length(posts)))
print("-----------------")
#length(location)
#length(membership)

#sum(membership == "\nGuest\n")
print("A few samples of the forum's posts:")
posts[1:5]

posts = str_replace_all(posts,"\n","")
#posts[1:5]
#now let's remove the emojis from the text. The emojies happen to have a pattern starting with colon and ending with colon. 

posts = str_replace_all(posts,":.+:","")
# at last there is a "Cath" word regarding some other emojis persumably. let's remove them as well
posts = str_replace_all(posts,"Cath","")

#str_extract(posts, "(.){10}")
last_edited<- which(str_detect(posts, "Last edited+")==TRUE)
posts <- posts[! posts %in% posts[last_edited]]
membership = str_replace_all(membership,"\n","")

data <- data.frame(cbind(membership, posts))
data$posts <- as.character(data$posts)

data <- data %>% filter(membership %in% c("Guest","Junior Member","Member","Senior Member"))

data$posts <- str_replace_all(string = data$posts , 
                              pattern = "\\W",
                              replacement = " ")

#replacing the numbers with white space 
data$posts <- str_replace_all(string = data$posts ,
                              pattern = "[0-9]+",
                              replace = " ")
data("stop_words")
nrc<-get_sentiments("nrc")
head(nrc)
unique(nrc$sentiment)

data_sentiment <- data %>% 
        group_by(membership) %>% 
        unnest_tokens(output = word , input = posts , token = "words") %>%
        anti_join(stop_words) %>% 
        inner_join(nrc) 

g1<- data_sentiment %>%
        group_by(membership) %>% 
        count(sentiment, sort = TRUE) %>% 
        filter(membership == "Junior Member") %>% 
        ggplot() + 
        geom_col(aes(y = n , x = reorder(sentiment,n)),
                 fill = "green") +
        coord_flip() + 
        theme_linedraw() + 
        xlab(label = "sentiments") + 
        ggtitle("Junior Member Posts")

g2 <- data_sentiment %>%
        group_by(membership) %>% 
        count(sentiment, sort = TRUE) %>% 
        filter(membership == "Member") %>% 
        ggplot() + 
        geom_col(aes(y = n , x = reorder(sentiment,n)),
                 fill = "orange") +
        coord_flip() + 
        theme_linedraw() + 
        xlab(label = "sentiments") + 
        ggtitle("Middle Member Posts")


g3<- data_sentiment %>%
        group_by(membership) %>% 
        count(sentiment, sort = TRUE) %>% 
        filter(membership == "Senior Member") %>% 
        ggplot() + 
        geom_col(aes(y = n , x = reorder(sentiment,n)),
                 fill = "blue") +
        coord_flip() + 
        theme_linedraw() + 
        xlab(label = "sentiments") + 
        ggtitle("Senior Member Posts")

grid.arrange(g1,g2,g3, nrow = 2)
afinn <- get_sentiments("afinn")
head(afinn)

data_sentiment <- data %>% 
        group_by(membership) %>% 
        unnest_tokens(output = word , input = posts , token = "words") %>%
        anti_join(stop_words) %>% 
        inner_join(afinn) 

data_sentiment %>% 
        group_by(membership) %>% 
        mutate(avg_score_per_word = sum(score)/n() , sd = sd(score) ) %>%
        select(membership,avg_score_per_word,sd) %>% 
        distinct() 


data_sentiment %>% 
        group_by(membership) %>% 
        mutate(avg_score_per_word = sum(score)/n() ) %>%
        select(membership,avg_score_per_word) %>% 
        distinct() %>% 
        arrange(avg_score_per_word) %>%
        ggplot() + 
        geom_col(aes(y = avg_score_per_word ,
                     x = reorder(membership,avg_score_per_word)),
                 fill = "darkred") +
        coord_flip() + 
        theme_linedraw() + 
        xlab(label = "Membership") + 
        ggtitle("Avergate Afinn Score per Word")