This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is the third part of text analysis on the anxiety related text, scraped from a public forum. In previous studies, the data was analyzed using word frequency, and sentiment evaluation. In this study, I am venturing on using n-grams, more specifically bigrams and trigrams, analysis as well as bi-gram network visualization. Let’s read some quotes from Julia Silge on n-gram analysis:
So far we’ve considered words as individual units, and considered their relationships to sentiments or to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents.
In this chapter, we’ll explore some of the methods tidytext offers for calculating and visualizing relationships between words in your text dataset
This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones.
We’ll also introduce two new packages: ggraph, which extends ggplot2 to construct network plots, and widyr, which calculates pairwise correlations and distances within a tidy data frame.
We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams.
By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.
We do this by adding the token = “ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called “bigrams”:
So the bigrams are ready, and “panich attack” is on the top of the junior and middle groups. In contast, for the senior members it is tea form which is on top, followed by form exercise, panic attack, cbt book and sam obitz. The latter is a writer of behavioural therapy books! The first two groups talk about the problem, either panic attack or anxiety attack, and the medications for it. The middle has natural remedy on their posts, a shift from the junior members’ posts. In the posts of senior members, there is no medication on the top of the list; they have instead more therapy, tea, breathing!
A new, and very important insight! the tea is indead Thoughts Error Analysis! I was so ignorant from the start of this analysis that I thought this is the beverage! The “mint breath spray” and “natural pen” are some comercial remedies that seemingly help with anxiety. It is interesting that the most common trigram of the junior members is about the problem, anxiety panic attack, and senior members talk about solution, TEA form exercise.
It is possible to filter the trigrams. For instance, the ones include “med” in the second or third word.
## # A tibble: 10 x 4
## membership word1 word2 word3
## <fctr> <chr> <chr> <chr>
## 1 Junior Member music zoloft meditation
## 2 Junior Member zoloft meditation computer
## 3 Junior Member daily type med
## 4 Junior Member talking writing meditating
## 5 Junior Member character named ir
## 6 Junior Member ear medication called
## 7 Junior Member healthy yoga meditation
## 8 Junior Member yoga meditation therapy
## 9 Junior Member taking medication classe
## 10 Junior Member sig file medication
It is not very accurate to evaluate the words out of context. So the most preliminary action to include the context, perhaps is consideration of the negative words in sentences, that can change the sentiment of the sentence or words. For instance, the word not can negate the sentiment of following words, however, the sentiment analysis on single word level, would not consider this issue.
Here, I am going to see what frequent words come after not, and how much is their sentiment score in total. Using such total sentiment, it is possible to adjust the work-token sentiment score of the previous study.
As we can see, the words should be evaluated in their context. For instance, the word “good” in one-gram sentiment analysis is evaluated positive, however, many of the “good” words are following negative words such as “not”. This result can be used in adjustments the average sentiment scores per word, that we had computed earlier. Nevertheless, here the adjustment per word seem infenitesimal.
At last, we can have a better, more holistic perspective of relations between pair of words. The bigrams can be visualized as network of words.
bigram_net <- biwords_df %>%
separate(bigram,c( "word1", "word2"), sep = " ") %>%
count(word1,word2, sort = TRUE)
summary(bigram_net$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.347 1.000 74.000
The distribution of bigram count is extremely right skewed. The mean is 1.35, and we want the frequent pairs, which hopefully bear the meaning of the text.
Very interesting! Much better perspective than the previous approaches. It is possible to get insight into the posts by examination of this map. For sake of some examples, “baby step” is probably the tempo of change that the users suggest to each other, “close friend” is a remedy for anxity, “stopped taking lexapro” is self-evident!, “breathing”, “form”, and “cbt” are related concepts.
As the last effort, let’s have the same bigram network for each membership category.
We suppose that senior members are more “positive”, according to our previous analyses. Here, we can see that they “stopped taking” sorts of medicines probably, and most of their graph is about “coping mechanism”, “deep breathing”, “tea form exercise”, “cbt” and “Sam Obitz” the author of TEA exercise book!
What about the middle members?
Middle members have very few posts in the forum, and it seems their main concern is “panic attack” and they have some natural remedies, “natural pen” and “cbd oil”, as well as the “beta blocker” medication.
The majority of forum posts are from junior members. In contrast to senior members, seemingly they have not found their way to cope with anxiety and their approaches are very different. For instance, look at the “started taking lexapro” and “stopped taking lexapro”. They have talked about their experiences apparantly: “feel anxious”, “panic attack”,“bad anxiety”, “mood trigger”, “heart rate”.. and they have some suggestions: “breathing technique”, “breathing exercise”, “video game”.
This study is the third part of text analysis on a medical forum posts related to “coping with anxiety”. In previous studies, I had used word frequency and word clouds, and sentiment analysis. This study is focused on n-grams and network visualization of bigrams.
The n-grams can give much more holistic insight into the text, since more relations between words are retained through them. Among all results, the networks are the most revealing and they show the clear different approaches of membership groups to control and cope with axiety.
library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(colorRamps)
require(SnowballC)
require(tidyr)
require(gridExtra)
require(tidytext)
require(RColorBrewer)
require(wordcloud)
#url = "http://www.ibsgroup.org/forums/topic/141800-16-and-suffering/"
#url = "https://www.r-bloggers.com/scraping-web-pages-with-r/"
url = "http://anxietyforum.net/forum/showthread.php?8-How-are-you-coping-with-anxiety/"
posts = character()
location = character()
membership = character()
posts2 = character()
for (i in 1:50){
url = "http://anxietyforum.net/forum/showthread.php?8-How-are-you-coping-with-anxiety/"
url = paste0(url,"page",i)
#print(url)
html <- read_html(url)
content_node <- html_nodes(html , ".postcontent")
posts <- append(posts,html_text(content_node))
location_node <- html_nodes(html, ".post_field:nth-child(2) dd")
location <- append(location, html_text(location_node) )
member_node <- html_nodes(html, ".usertitle")
membership <- append(membership, html_text(member_node) )
}
#print(paste0("The number of retrieved posts: ",length(posts)))
#print("-----------------")
#length(location)
#length(membership)
#sum(membership == "\nGuest\n")
#print("A few samples of the forum's posts:")
#posts[1:5]
posts = str_replace_all(posts,"\n","")
#posts[1:5]
#now let's remove the emojis from the text. The emojies happen to have a pattern starting with colon and ending with colon.
posts = str_replace_all(posts,":.+:","")
# at last there is a "Cath" word regarding some other emojis persumably. let's remove them as well
posts = str_replace_all(posts,"Cath","")
#str_extract(posts, "(.){10}")
last_edited<- which(str_detect(posts, "Last edited+")==TRUE)
posts <- posts[! posts %in% posts[last_edited]]
membership = str_replace_all(membership,"\n","")
data <- data.frame(cbind(membership, posts))
data$posts <- as.character(data$posts)
data <- data %>% filter(membership %in% c("Guest","Junior Member","Member","Senior Member"))
data$posts <- str_replace_all(string = data$posts ,
pattern = "\\W",
replacement = " ")
#replacing the numbers with white space
data$posts <- str_replace_all(string = data$posts ,
pattern = "[0-9]+",
replace = " ")
data("stop_words")
bigram_df <- data %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "[[:punct:]]",
" ")) %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "Originally Posted",
"")) %>%
unnest_tokens(output = bigram,
input = posts ,
token = "ngrams",
n = 2)
biwords_df <- bigram_df %>%
separate(bigram, c("word1","word2"), sep= " ") %>%
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>%
mutate(word2 = str_replace_all(string = word2 , pattern = "s$", "")) %>%
mutate(word1 = str_replace_all(string = word1 ,
pattern = "panick",
"panic")) %>%
mutate(word2 = str_replace_all(string = word2 ,
pattern = "panick",
"panic")) %>%
unite(bigram, word1, word2 , sep = " ")
#str_replace_all(biwords_df$word2,pattern = "s$", "Z")
#biwords_df[biwords_df$bigram == "panic attacks","bigram"] <- "panic attack"
junior_bigram <- biwords_df %>%
filter(membership == "Junior Member") %>%
count(bigram, sort = TRUE)
filter_bigrams = c("originally posted", "ve found")
junior_bigram <- junior_bigram %>%
filter(!bigram %in% filter_bigrams)
junior_plot <- junior_bigram %>%
head(10) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(bigram,n)),
fill = "green") +
coord_flip() +
theme_linedraw() +
xlab(label = "Bigrams") +
ggtitle("Junior Member Bigrams")
#----- Middle members
member_bigram <- biwords_df %>%
filter(membership == "Member") %>%
count(bigram, sort = TRUE)
member_filter_bigrams <- c("originally posted")
member_bigram <- member_bigram %>%
filter(!bigram %in% member_filter_bigrams)
member_plot <- member_bigram %>%
head(10) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(bigram,n)),
fill = "orange") +
coord_flip() +
theme_linedraw() +
xlab(label = "Bigrams") +
ggtitle("Middle Member Bigrams")
#---- senior members
senior_bigram <- biwords_df %>%
filter(membership == "Senior Member") %>%
count(bigram, sort = TRUE)
senior_filter_bigrams <- c("originally posted")
senior_bigram <- senior_bigram %>%
filter(!bigram %in% senior_filter_bigrams)
senior_plot <- senior_bigram %>%
head(10) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(bigram,n)),
fill = "blue") +
coord_flip() +
theme_linedraw() +
xlab(label = "Bigrams") +
ggtitle("Senior Member Bigrams")
grid.arrange(junior_plot,member_plot,senior_plot, nrow = 2)
trigram_df <- data %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "[[:punct:]]",
" ")) %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "Originally Posted",
"")) %>%
unnest_tokens(output = trigram,
input = posts ,
token = "ngrams",
n = 3)
triwords_df <- trigram_df %>%
separate(trigram, c("word1","word2","word3"), sep= " ") %>%
filter(!word1 %in% stop_words$word &
!word2 %in% stop_words$word &
!word3 %in% stop_words$word) %>%
mutate(word2 = str_replace_all(string = word2 , pattern = "s$", "")) %>%
mutate(word1 = str_replace_all(string = word1 , pattern = "s$", "")) %>%
mutate(word3 = str_replace_all(string = word3 , pattern = "s$", "")) %>%
unite(trigram, word1, word2, word3 , sep = " ") %>%
mutate(trigram = str_replace_all(string = trigram ,
pattern = "panick",
"panic"))
# ---- Junior Members
junior_trigram <- triwords_df %>%
filter(membership == "Junior Member") %>%
count(trigram, sort = TRUE)
junior_plot <- junior_trigram %>%
head(10) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(trigram,n)),
fill = "green") +
coord_flip() +
theme_linedraw() +
xlab(label = "Trigrams") +
ggtitle("Junior Member Trigrams")
# ----- Middle Members
member_trigram <- triwords_df %>%
filter(membership == "Member") %>%
count(trigram, sort = TRUE)
member_plot <- member_trigram %>%
head(10) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(trigram,n)),
fill = "green") +
coord_flip() +
theme_linedraw() +
xlab(label = "Trigrams") +
ggtitle("Middle Member Trigrams")
# ----- Senior Members
senior_trigram <- triwords_df %>%
filter(membership == "Senior Member") %>%
count(trigram, sort = TRUE)
senior_plot <- senior_trigram %>%
head(10) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(trigram,n)),
fill = "blue") +
coord_flip() +
theme_linedraw() +
xlab(label = "Trigrams") +
ggtitle("Senior Member Trigrams")
grid.arrange(junior_plot,member_plot,senior_plot, nrow = 2)
triwords_df %>%
separate(trigram, c("word1","word2","word3"), sep= " ") %>%
filter(str_detect(word2, "med") | str_detect(word3, "med") ) %>%
head(10)
not_freq_bigrams <- data %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "\\sa\\s",
" ")) %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "[[:punct:]]",
" ")) %>%
mutate(posts = str_replace_all(string = posts ,
pattern = "Originally Posted",
"")) %>%
unnest_tokens(output = bigram,
input = posts ,
token = "ngrams",
n = 2) %>%
separate(bigram, c("word1", "word2"),sep = " ")
afinn <- get_sentiments("afinn")
not_freq_bigrams <- not_freq_bigrams %>%
filter(word1 == "dont" |
word1 == "not"|
word1 == "wouldnt" |
word1 == "shouldnt" )
not_freq_bigrams_stem <- not_freq_bigrams %>%
mutate(word2 = str_replace_all(word2,
pattern = "ing$", replacement = ""))
not_freq_bigrams_stem <- not_freq_bigrams_stem %>%
anti_join(stop_words , by = c(word2 = "word") ) %>%
count(word2,sort = TRUE)
top_10 <- not_freq_bigrams %>%
inner_join(afinn, by = c(word2 = "word")) %>%
count(word2, score, sort = TRUE) %>%
mutate(x = n * score ) %>%
arrange(desc(x))
bottom_10 <- not_freq_bigrams %>%
inner_join(afinn, by = c(word2 = "word")) %>%
count(word2, score, sort = TRUE) %>%
mutate(x = n * score ) %>%
arrange((x))
rbind(top_10,bottom_10) %>%
mutate(fill = ifelse(x>0,"positive score","negative score") ) %>%
ggplot() +
geom_col(aes(y = x , x = reorder(word2,x), fill = factor(fill))
) +
coord_flip() +
theme_linedraw() +
xlab(label = "words preceeded by not") +
ggtitle("Sentiment Score * Frequency of the word")
bigram_igraph<- bigram_net %>%
filter(n>4) %>%
graph_from_data_frame()
#ggraph(bigram_igraph, layout = "fr") +
# geom_edge_link() +
# geom_node_point()+
# geom_node_text(aes(label = name), vjust = 1 , hjust = 1) +
# theme_linedraw()
a <- grid::arrow(type = "closed", length = unit(.1, "inches"))
set.seed(7)
ggraph(bigram_igraph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, size = 3) +
theme_void() +
ggtitle("bigram network, n>4")
bigram_igraph<- biwords_df %>%
filter(membership == "Senior Member") %>%
separate(bigram,c( "word1", "word2"), sep = " ") %>%
count(word1,word2, sort = TRUE) %>%
filter(n>3) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.1, "inches"))
set.seed(7)
senior_network <- ggraph(bigram_igraph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, size = 3) +
theme_void() +
ggtitle("Senior Members: n>3")
senior_network
bigram_igraph<- biwords_df %>%
filter(membership == "Member") %>%
separate(bigram,c( "word1", "word2"), sep = " ") %>%
count(word1,word2, sort = TRUE) %>%
filter(n>3) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.1, "inches"))
set.seed(7)
middle_network <- ggraph(bigram_igraph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, size = 3) +
theme_void() +
ggtitle("Middle Members: n>3")
middle_network
bigram_igraph<- biwords_df %>%
filter(membership == "Junior Member") %>%
separate(bigram,c( "word1", "word2"), sep = " ") %>%
count(word1,word2, sort = TRUE) %>%
filter(n>3) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.1, "inches"))
set.seed(7)
junior_network <- ggraph(bigram_igraph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, size = 3) +
theme_void() +
ggtitle("Junior Members: n>3")
junior_network