Understanding patient needs is important for improving healthcare systems. Web forums are a common way for patients from around the world to ask questions, request support and become educated about their condition. Web scraping could provide useful information to researchers and clinicians about different patient populations. A forum was selected from a website (https://amyloidosis.org.uk/) centered around Amyloidosis, a rare disease where an excess accumulation of an abnormal protein (i.e. amyloid) causes various system-wide pathologies. I retained data that showed the topics users posted and the amount of views and replies each topic received. Term and Polies were reviewed on the site and it says that data may be used for personal, informational and non-commercial purposes. I believe this course assignment to fit that criteria and have determined scraping this data would be considered ethical.
#install.packages("rvest")
#load libraries
library(tidyverse)
library(rvest)
library(stringr)
library(reshape2)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(tidyr)
amyloidosis <- read_html("https://amyloidosis.org.uk/forum/index.php?PHPSESSID=757da3021d3971c86b11e01c4f8f06c8&board=4.0")
topic <- amyloidosis %>% html_nodes("#messageindex span a") %>% html_text()
topic_creator <- amyloidosis %>% html_nodes("p a") %>% html_text()
last_topic_author <- amyloidosis %>% html_nodes(".lastpost a") %>% html_text()
stats <- amyloidosis %>% html_nodes(".stats") %>% html_text()
dryeye_db <- data.frame(topic,topic_creator,last_topic_author,stats)
dryeye_db_adj <- colsplit(stats, "\n\t\t\t\t\t\t\n\t\t\t\t\t\t", names = c("Replies", "Views"))
dryeye_master <- cbind(dryeye_db,dryeye_db_adj)
dryeye_master$Reply <- str_replace_all(dryeye_master$Replies, "[\n\t Replies]", "")
dryeye_master$View <- str_replace_all(dryeye_master$Views, "[\n\t Views]", "")
de_msr <- dryeye_master[1:20, ]
de_msr$topic_type <- c("Information","Treatment","Diagnosis","Diagnosis","Symptoms","Treatment","Symptoms","Symptoms","Diagnosis","Treatment","Treatment","Symptoms","Symptoms","Symptoms","Diagnosis","Treatment","Symptoms","Symptoms","Information","Information")
de_msr$Reply <- as.numeric(de_msr$Reply)
de_msr$View <- as.numeric(de_msr$View)
num_reply<- aggregate(Reply~topic_type,de_msr, length)
names(num_reply)[2] <- 'num'
total_reply<- aggregate(Reply~topic_type,de_msr, sum)
names(total_reply)[2] <-'total_reply'
num_view<- aggregate(View~topic_type,de_msr, length)
names(num_view)[2] <- 'num'
total_view<- aggregate(View~topic_type,de_msr, sum)
names(total_view)[2] <-'total_view'
reply_stat <- merge(num_reply, total_reply)
view_stat <- merge(num_view, total_view)
topic_stats <-merge(reply_stat,view_stat,by="topic_type")
de_analysis <- topic_stats %>% select("Topics"=topic_type, "Replies" = total_reply, "Views" = "total_view") %>%
kbl(., caption = "Table 1: Summary of the number of replies and views for each topic in the Amyloidosis Forum") %>%
kable_minimal() %>%
kable_material(c("striped", "hover"))
de_analysis
| Topics | Replies | Views |
|---|---|---|
| Diagnosis | 8 | 872 |
| Information | 6 | 575 |
| Symptoms | 28 | 1584 |
| Treatment | 15 | 1107 |
de_plot_bc1 <- de_msr %>% select("Topic"=topic_type, "Replies"= Reply, "Views"= View)
de_long <- gather(de_plot_bc1, condition, measurement, Replies:Views, factor_key=TRUE)
ggplot(data = de_long, aes(x = Topic, y = measurement, fill = condition)) + geom_bar(stat = "identity")+ facet_wrap(~ condition) + ggtitle("Figure 1: Amyloidosis Forum Topics: Replies vs Views")
ggplot(de_plot_bc1, aes(x=Topic, y=Replies, fill=Topic))+
geom_bar(stat="identity", position=position_dodge()) + ggtitle("Figure 2: Number of Replies to Each Topic")
ggplot(de_plot_bc1, aes(x=Topic, y=Views, fill=Topic))+
geom_bar(stat="identity", position=position_dodge()) + ggtitle("Figure 2: Number of Views of Each Topic")
Several insights into the Amyloidosis patient experience were gained through this web scraping exercise. This analysis showed that the number of views of a topic was much larger than the number of replies to any given topic. This identifies some potential motivations for the use of these forums, including getting new information about the disease, diagnosis, treatments and general patient experience. In the initial phase of this assignment, several other health care forums were evaluated. What was surprising was many of these other forums appeared to have made up content, while different usernames displayed, it appeared as those they may have all be entered by the same person or at least in the same very narrow window of time (1-2 days 5 years ago and then no other forum activity). Another observation that a lot of these disease-genre forums were actually hosted on pharmaceutical or other biomedical companies. It would seem reasonable that by hosting a forum, these companies will help drive their target population to their website and advertise their product or services. It would make sense that there might be an initial forum development-possibly with fake users and topic posts. It would seem reasonable that people might not want to be the first to post on a forum but after seeing others post might be more willing. Even if a user never posts, the content they view also can identify what the users would most like to see. From this analysis, the number of views was highest when looking at forum topics related to treatment. The topics that generated the most user posts were relating to symptoms. Identification of these preferences might provide support for posting more content on Treatment and start a doctor hosted forum or Q/A module about treatment and symptoms.
Some difficulties I encountered when trying to complete this assignment, occurred largely in looking for a website to scrape. There were several different topics I wanted to explore any many websites I looked at. However, some websites like Petfinder.com appear to prevent web scraping. Other forums, as previously mentioned, appeared to have bad or fake data entered so that the metrics of the entries when analyzed were quite similar. While this in itself led to some interesting insights, it would have made for really boring plots.