This is a personal project in order to analysis the patients discussions in Anxiety-related forum. More specifically, I am interested to know whether there is any difference among members of a forum based on the top frequent words in their posts. The members are categorized by the forum into four categories:
The topic of the forum is a question: How do you cope with your social anxiety?
The first step in the analysis of text, is gathering the text. So first we need to scrape webpages using R. The data is collected from AnxietyForum.net
This is the forum that I am going to scrape:
## [1] "The number of retrieved posts: 505"
## [1] "-----------------"
## [1] "A few samples of the forum's posts:"
## [1] "\nIn an effort to see how people are coping with anxiety, I have setup this poll.\n"
## [2] "\nI was taking paxil cr, but im off of it for the time being, but it does really seem to help a lot. I've also tried other medications without any success.\n"
## [3] "\nI'm on Welbutrin. I know some people haven't had luck with this drug, but it seems to be helping me with some of it. I think it is time for me to talk to the dr. about upping my dose though. :roll: \nCath\n"
## [4] "\nI was taking Zoloft, and that helped get me through a kind of depressed period, but it didn't do much good for the anxiety, so I stopped taking it. Now I am looking at alternative therapies to help me with the anxiety, because I just don't want to be on medication forever. There's got to be a way to deal with this without drugs!\n"
## [5] "\nHi I agree with you about the Meds. Im not saying they are bad and that people shouldnt take them, but I know there are medication free ways of beating this anxiety. Every person has a differnt story tho...and im only sayin what my doctor has said to me. I get the impression im not the worst off patient who visits him...but i have my moments "
As we can see, the scraping is successfuly done, and the number of retrieved posts is 505. Now we need to clean the posts, and take the meaningful words out of each. Beside the post texts i.e. user replies to the question of the topic, the location of the users when it is available is retrieved, as well as the membership status of the users.
## [1] "A few sample of cleaned posts:"
## [1] "In an effort to see how people are coping with anxiety, I have setup this poll."
## [2] "I was taking paxil cr, but im off of it for the time being, but it does really seem to help a lot. I've also tried other medications without any success."
## [3] "I'm on Welbutrin. I know some people haven't had luck with this drug, but it seems to be helping me with some of it. I think it is time for me to talk to the dr. about upping my dose though. "
## [4] "I was taking Zoloft, and that helped get me through a kind of depressed period, but it didn't do much good for the anxiety, so I stopped taking it. Now I am looking at alternative therapies to help me with the anxiety, because I just don't want to be on medication forever. There's got to be a way to deal with this without drugs!"
## [5] "Hi I agree with you about the Meds. Im not saying they are bad and that people shouldnt take them, but I know there are medication free ways of beating this anxiety. Every person has a differnt story tho...and im only sayin what my doctor has said to me. I get the impression im not the worst off patient who visits him...but i have my moments "
The posts now are cleaner, and closer to the normal language. Some texts were gathered wrongly and they were not members’ posts, some included emojies as well as some html scripting codes. I removed the impurities.
## [1] "Senior Member" "Junior Member" "Senior Member" "Senior Member"
## [5] "Senior Member"
So now the membership text is also cleaned. Here is the point that we can go further to do text analysis.
Now we have the data. What can I do with it?
Ideas: 1.Word clouds for different members 2.Sentiment analysis
In this report, which is the first part of this forum study, I would do the words cloud, which is indeed a word frequency exploration. In the next report, I would go for the sentiment analysis.
First I will put both memebrs and the corresponding posts into one data frame. Let’s have another look at the data then.
## membership
## 1 Senior Member
## 2 Junior Member
## 3 Senior Member
## 4 Senior Member
## 5 Senior Member
## 6 Junior Member
## posts
## 1 In an effort to see how people are coping with anxiety, I have setup this poll.
## 2 I was taking paxil cr, but im off of it for the time being, but it does really seem to help a lot. I've also tried other medications without any success.
## 3 I'm on Welbutrin. I know some people haven't had luck with this drug, but it seems to be helping me with some of it. I think it is time for me to talk to the dr. about upping my dose though.
## 4 I was taking Zoloft, and that helped get me through a kind of depressed period, but it didn't do much good for the anxiety, so I stopped taking it. Now I am looking at alternative therapies to help me with the anxiety, because I just don't want to be on medication forever. There's got to be a way to deal with this without drugs!
## 5 Hi I agree with you about the Meds. Im not saying they are bad and that people shouldnt take them, but I know there are medication free ways of beating this anxiety. Every person has a differnt story tho...and im only sayin what my doctor has said to me. I get the impression im not the worst off patient who visits him...but i have my moments
## 6 I do not really cope with my anxiety at all. I have been given some advice on how to handle it but it has not worked. So far I have yet to find anything that helps me to calm down when I am extremely anxious.
Here I noticed that I have not removed the non-letter characters from the text. So I do a little bit of further cleaning.
## membership
## 1 Senior Member
## 2 Junior Member
## 3 Senior Member
## 4 Senior Member
## 5 Senior Member
## 6 Junior Member
## posts
## 1 In an effort to see how people are coping with anxiety I have setup this poll
## 2 I was taking paxil cr but im off of it for the time being but it does really seem to help a lot I ve also tried other medications without any success
## 3 I m on Welbutrin I know some people haven t had luck with this drug but it seems to be helping me with some of it I think it is time for me to talk to the dr about upping my dose though
## 4 I was taking Zoloft and that helped get me through a kind of depressed period but it didn t do much good for the anxiety so I stopped taking it Now I am looking at alternative therapies to help me with the anxiety because I just don t want to be on medication forever There s got to be a way to deal with this without drugs
## 5 Hi I agree with you about the Meds Im not saying they are bad and that people shouldnt take them but I know there are medication free ways of beating this anxiety Every person has a differnt story tho and im only sayin what my doctor has said to me I get the impression im not the worst off patient who visits him but i have my moments
## 6 I do not really cope with my anxiety at all I have been given some advice on how to handle it but it has not worked So far I have yet to find anything that helps me to calm down when I am extremely anxious
Ok, now data is hopefully in the format that we want! The first step in the text analysis using Tidy approach is tokenization. Token, is meaningful component of a passage, and here it is “word”, i.e. every word is a token.
After tokenization, I remove the stop words. The words that are not informative to understand the main message of a text, words such as “the, at, on, …”
Now we can go for getting insight into the data.
I have not stemmed the words. Stemming is converting words in different forms into their stems. For isntacne, in the above graph, exercises and exercise are counted as two different words. Shall I stem? I think so. There is another problem as well. Some words does not add anything to our insight. For instance poster, originally are related to user’s replies, even anxiety. Also some words have identical meanings such as medications and meds, and this later problem cannot be resolved by stemming. So it needs manual attention!
Now let’s merge some identical words, and remove some useless words.
questions: don’t take medication is a phrase in the Senior’s post? or is it something else? I rather remove the word don which is infact don’t.
Now for Junior Members before stemming.
Stemming frequency plot.
So now merging some words, and removing some others.
The same we do for the members, the middle category.
Since the number of posts published by Guest users is too few, the common words plot is not meaningful. So I focus on the three categories of Juniro, Member, Senior.
At last, let’s have word clouds as well.
I noticed that the most frequent word, anxiety, is omitted from the clouds due to the size of the clouds and space availability. It is possible to rectify this problem, but I think it is not a bad idea to leave this word omitted! By the way, all of this forum is about anxiety, so knowing that the most frequent word of each group is anxiety does not add anything to our information!
The most frequent words of the different groups are interesting. Senior members emphasize on exercise, cbt which is cognitive behavioral therapy, and tea, beside the medication. Also book and read are exclusive for senior members. Members, which are the middle category based on their amount of activity in the website, have panic and attack as the two top frequent words, and friend and nature/natural are their exclusive frequent words. Junior members have the panic and attack in their cloud but it is not as frequent as middle members. In contrast, depression is their exclusive word.
Now the question is are these differences, in both exclusive words and emphasis, meaningul? Do senior members believe in doing exercise, drinking tea, reading book and cognitive behaviour therapy more than the middle and junior members? Are junior members more prone to depression?
There is one obstacle here to be resolved in order to get more clear conclusions. The seniority is based on the time of extraction of the data, and not the time of positing on the forum. So the current senior members may be junior members at the time of participation in the discussion.
Junior members are juniors for sure at the time of posting on the forum, since they are still junior members. Anyway, the evaluation of senior members are more difficult. They might be senior member at the time of posting on this forum, or become a senior member later on. It is possible to discern these two groups based on their joining date, the date of their posts, and post per day rate. So the seniority level would be adjusted.
More information is needed specially about the forum mechanism. If the seniority level is an uptodate measure, as I assume it is, and the posts are old, the seniority is about the current membership level, and not the level at the time of posting on the forum. Thus, the senior members are the members that are still active in the fora. While the juniros are the ones that once upon a time joined to website, posted a few posts and then became inactive.
If we assume that the seniority level is at the time of poting the contents, new questions would be emerged out of these clouds. Do people with anxiety disorder shape new habits such as exercise and reading after a while? Are Senior members’ suggestions actually helpful to deal with anxiety?
From the text analysis aspect, we should ask what combination of words do the members use. For instance, don’t was one of the frequent words, it is totally rational to ask do not doing what? Maybe the senior members, for instance, admonish us about exercise or reading books! N-gram analysis may help to clarify this issue.
This report is my very first effort in text analysis. I scraped a medical forum, and tried to gain insight into the posts of forum members using word frequency. This is the simplest text analysis method, and in the next steps I am going to learn and impelement more sophisticated methods as well. Nevertheless, the result of this word frequency analysis is interesting, as the senior members seem different from junior members.
It is of importance to emphasize that this study is not a scientific effort, but a programming effort of a text analysis enthusiast.
library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)
library(colorRamps)
#url = "http://www.ibsgroup.org/forums/topic/141800-16-and-suffering/"
#url = "https://www.r-bloggers.com/scraping-web-pages-with-r/"
url = "http://anxietyforum.net/forum/showthread.php?8-How-are-you-coping-with-anxiety/"
posts = character()
location = character()
membership = character()
posts2 = character()
for (i in 1:50){
url = "http://anxietyforum.net/forum/showthread.php?8-How-are-you-coping-with-anxiety/"
url = paste0(url,"page",i)
#print(url)
html <- read_html(url)
content_node <- html_nodes(html , ".postcontent")
posts <- append(posts,html_text(content_node))
location_node <- html_nodes(html, ".post_field:nth-child(2) dd")
location <- append(location, html_text(location_node) )
member_node <- html_nodes(html, ".usertitle")
membership <- append(membership, html_text(member_node) )
}
print(paste0("The number of retrieved posts: ",length(posts)))
print("-----------------")
#length(location)
#length(membership)
#sum(membership == "\nGuest\n")
print("A few samples of the forum's posts:")
posts[1:5]
#posts[1:5]
#the first repeating pattern is \n signs at the start and sometimes at the end of the string elements. Let's remove them
str_replace_all(posts[1:5],"\n","")
#
#now let's do it to all the posts
posts = str_replace_all(posts,"\n","")
#posts[1:5]
#now let's remove the emojis from the text. The emojies happen to have a pattern starting with colon and ending with colon.
posts = str_replace_all(posts,":.+:","")
# at last there is a "Cath" word regarding some other emojis persumably. let's remove them as well
posts = str_replace_all(posts,"Cath","")
str_extract(posts, "(.){10}")
last_edited<- which(str_detect(posts, "Last edited+")==TRUE)
posts <- posts[! posts %in% posts[last_edited]]
print("A few sample of cleaned posts:")
posts[1:5]
#membership[1:50]
membership = str_replace_all(membership,"\n","")
#joint_member_location <- membership
#joint_member_location[str_detect(membership , "Member")] <- location
#joint_member_location[joint_member_location=="Guest"] <- "NA"
membership[1:5]
#require(tm)
#require(SnowballC)
require(tidytext)
require(RColorBrewer)
require(wordcloud)
require(gridExtra)
data <- data.frame(cbind(membership, posts))
data$posts <- as.character(data$posts)
data <- data %>% filter(membership %in% c("Guest","Junior Member","Member","Senior Member"))
#juniors <- data %>% filter(membership == "Junior Member") %>% select(posts)
#dim(juniors)
#members <- data %>% filter(membership == "Member") %>% select(posts)
#dim(members)
#seniors <- data %>% filter(membership == "Senior Member") %>% select(posts)
#dim(seniors)
#guests <- data %>% filter(membership == "Guest") %>% select(posts)
#dim(guests)
head(data)
#j_tokens<-juniros %>% unnest_tokens(output = word, input = posts , token = "words")
#str_replace_all(pattern = "[0-9]" ,replacement = "" )
#replacing the punctuations with a white space
data$posts <- str_replace_all(string = data$posts ,
pattern = "\\W",
replacement = " ")
#replacing the numbers with white space
data$posts <- str_replace_all(string = data$posts ,
pattern = "[0-9]+",
replace = " ")
head(data)
#tokenization, token = word
data_token <- data %>%
group_by(membership) %>%
mutate(n_posts = n()) %>%
unnest_tokens(output = "word" , input = posts , token = "words") %>%
ungroup()
# removal of the stop words
data("stop_words")
data_token <- data_token %>% anti_join(stop_words)
data_token %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Senior Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "skyblue") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Senior Members") +
ggtitle("20 Common Words of Senior Members - No Stemming")
data_token_stemmed <- data_token
data_token_stemmed$word <- wordStem(words = data_token$word , language = "en")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Senior Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "skyblue") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Senior Members")+
ggtitle("20 Common Words of Senior Members - Stemmed")
data_token_stemmed$word <- str_replace_all(string = data_token_stemmed$word,
pattern = "medic" ,
replacement = "med")
data_token_stemmed$word <- str_replace_all(data_token_stemmed$word ,
pattern = "anxi.+" ,
replacement = "anxiety")
data_token_stemmed <- data_token_stemmed %>% filter(!word %in% c("post","origin","cope","lot"))
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Senior Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "skyblue") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Senior Members")+
ggtitle("20 Common Words of Senior Members - Stemmed and Improved")
data_token %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Junior Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "orange") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Senior Members") +
ggtitle("Junior Members 20 Most Common Words - Before Stemming")
#data_token_stemmed <- data_token
#data_token_stemmed$word <- wordStem(words = data_token$word , language = "en")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Junior Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "orange") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Junior Members") +
ggtitle("Junior Members 20 Most Common Words - Stemmed")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Junior Member" ) %>%
filter(!word %in% c("ve","im")) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "orange") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Junior Members") +
ggtitle("Junior Members 20 Most Common Words - Stemmed&Improved")
data_token %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "green") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Members") +
ggtitle("Members 20 Most Common Words - Before Stemming")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Member" ) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "green") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Members") +
ggtitle("Members 20 Most Common Words - Stemmed")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Member" ) %>%
filter(!word %in% c("ve","im")) %>%
head(20) %>%
ggplot() +
geom_col(aes(y = n , x = reorder(word,n)),
fill = "green") +
coord_flip() +
theme_linedraw() +
xlab(label = "20 Common words of Members") +
ggtitle("Members 20 Most Common Words - Stemmed&Improved")
par(mfrow = c(2,2))
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
filter(!word %in% c("don")) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Senior Member" ) %>%
head(20) %>%
with(wordcloud(word,n,random.order = FALSE, colors = matlab.like(20)))
text(x=0.5, y=1, "Senior Members")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Junior Member" ) %>%
filter(!word %in% c("ve","im","don")) %>%
head(20) %>%
with(wordcloud(word,n,random.order = FALSE, colors = matlab.like(20)))
text(x=0.5, y=1, "Junior Members")
data_token_stemmed %>%
group_by(membership) %>%
count(word, sort = TRUE) %>%
#mutate(prop = round(n/sum(n),6)*100) %>%
filter(membership == "Member" ) %>%
filter(!word %in% c("ve","im","don")) %>%
head(20) %>%
with(wordcloud(word,n,random.order = FALSE, colors = matlab.like(20)))
text(x=0.5, y=1, "Members")