This project is done for academic purposes and my own personal interest, and does not reflect my professional opinon about any of the medications or conditions discussed.
For this project, I’m going to use available datasets to help understand patient versus professional perceptions of drug therapy benefit and tolerability. Using patient reviews, I intend to determine the most common classes of prescription drugs, as well as their relative tolerability. These data are based off reviews from Drugs.com (.csv), complete with rating, a brief written review, as well as other user feedback - yes or no to how helpful a review was. The bulk of my time will be spent performing a sentiment analysis of the drug reviews, looking for words associated with positive or negative opinion.
In addition to this, I’ll be using the US Food and Drug Administration’s API for adverse drug reactions (JSON) to offer a contrasting regulatory/ professional perspective. Based on this comparison, I hope to identify any discrepancies in reported tolerability versus firsthand patient experience. My purpose in doing this is to enhance my own ability to warn about side effects, and identify any boundaries to successful drug therapy before they happen.
Sources:
Original Drugs.com paper: http://kdd.cs.ksu.edu/Publications/Student/kallumadi2018aspect.pdf
Kaggle / UC Irvine dataset: https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018
FDA API website: https://open.fda.gov/apis/drug/event/
FDA Github with R package: https://github.com/ropenhealth/openfda
I’ll start by uploading the training set of the original data, which contains over 160,000 different drug reviews. I’ve specified some two rows, drug name and condition, as factors to better understand some of the most commonly reviewed medicines and diseases.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.6.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
##
## intersect, setdiff, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
reviews_orig <- read_csv("https://raw.githubusercontent.com/hillt5/DATA607_Final_Project/master/drugsComTrain_raw.csv", col_types = "dffcdcd") #Set column datatypes
reviews_orig$date <- as.Date(reviews_orig$date, format = "%d-%b-%y") #set date format
head(reviews_orig)
## # A tibble: 6 x 7
## uniqueID drugName condition review rating date usefulCount
## <dbl> <fct> <fct> <chr> <dbl> <date> <dbl>
## 1 206461 Valsartan Left Ventr~ "\"It has no s~ 9 2012-05-20 27
## 2 95260 Guanfacine ADHD "\"My son is h~ 8 2010-04-27 192
## 3 92703 Lybrel Birth Cont~ "\"I used to t~ 5 2009-12-14 17
## 4 138000 Ortho Evra Birth Cont~ "\"This is my ~ 8 2015-11-03 10
## 5 35696 Buprenorph~ Opiate Dep~ "\"Suboxone ha~ 9 2016-11-27 37
## 6 155963 Cialis Benign Pro~ "\"2nd day on ~ 2 2015-11-28 43
I’ll start by looking at the two variables defined as factors, drugName and condition. I’ve limited to looking at the top 25 factors for each.
num_obs <- nrow(reviews_orig) #Number of reviews
summary(reviews_orig$drugName, 25) #Top 25 drugs by review frequency
## Levonorgestrel Etonogestrel
## 3657 3336
## Ethinyl estradiol / norethindrone Nexplanon
## 2850 2156
## Ethinyl estradiol / norgestimate Ethinyl estradiol / levonorgestrel
## 2117 1888
## Phentermine Sertraline
## 1543 1360
## Escitalopram Mirena
## 1292 1242
## Implanon Gabapentin
## 1102 1047
## Bupropion Venlafaxine
## 1022 1016
## Miconazole Medroxyprogesterone
## 1000 995
## Citalopram Lexapro
## 995 952
## Bupropion / naltrexone Duloxetine
## 950 934
## Metronidazole Contrave
## 922 920
## Drospirenone / ethinyl estradiol Depo-Provera
## 890 882
## (Other)
## 126229
round(100*(num_obs-126229)/num_obs, 1) #Percent of reviews that fall within the top 25 drugs
## [1] 21.7
For drugs, it’s evident that many of the commonly reviewed medications are progestins used for oral contraception (OC). Levonorgestrel is an emergency contraceptive and also the active ingredient in many different brands of OC. There are also several brand name contraceptives, as well as long-acting injections and implants, like Implanon and Depo-Provera, which further complcicates definition of contraceptives as a drug class. The top 25 results also represent 21% of all reviews. Beyond contraceptives, phentermine and bupropion/naltrexone are used for weight loss, metronidazole is an antimicrobial, and many of the others are antidepressants.
summary(reviews_orig$condition, 25) #top 25 conditions being treated
## Birth Control Depression Pain
## 28788 9069 6145
## Anxiety Acne Bipolar Disorde
## 5904 5588 4224
## Insomnia Weight Loss Obesity
## 3673 3609 3568
## ADHD Diabetes, Type 2 Emergency Contraception
## 3383 2554 2463
## High Blood Pressure Vaginal Yeast Infection Abnormal Uterine Bleeding
## 2321 2274 2096
## Bowel Preparation ibromyalgia Smoking Cessation
## 1859 1791 1780
## Migraine Anxiety and Stress Major Depressive Disorde
## 1694 1663 1607
## Constipation Panic Disorde (Other)
## 1595 1463 61287
## NA's
## 899
round(100*(num_obs-61287)/num_obs, 2) #Percent of reviews that fall within the top 25 conditions
## [1] 62
For conditions, the most common reason given is ‘Birth Control’, not surprising considering the most common medications. The same medications can also be used for other top conditions, including migraine, acne, emergency contraception, and abnormal uterine bleeding. There are also many psychiatric and neurological illnesses, including anxiety, insomnia, ADHD, and derpession. Chronic illnesses associated with aging, like high blood pressure and type 2 diabetes are also present. Finally, the top 25 conditions comprise just over 62% of the data, with 899 values missing - listed as “NA”.
Next, I’ll correct a spelling error - ‘ibromyalgia’ is likely supposed to mean fibromyalgia, a neuropsychiatric syndrome characterized by chronic pain. I’ll also recode some values as “NA” that I noticed while looking further into the dataset. Some of these are put as ‘Not listed’, while others appear to be coding errors where ‘usefulCount’ data was shifted into the condition column. I also scanned drug name for spelling errors and found no issues.
reviews_edit <- reviews_orig #Create new file from raw input
reviews_edit$condition <- reviews_edit$condition %>%
recode_factor(ibromyalgia = "Fibromyalgia", atigue = "Fatigue") %>% #Fix two spelling errors
na_if("Not Listed / Othe") #Recode values as 'NA'
error_span <- str_detect(reviews_edit$condition, pattern = "</span>") #Identify erroneous entries
reviews_edit$condition <- replace(reviews_edit$condition, list = error_span, NA) #Replace with NA
summary(reviews_edit$condition, 25) #First 25 entries
## Birth Control Depression Pain
## 28788 9069 6145
## Anxiety Acne Bipolar Disorde
## 5904 5588 4224
## Insomnia Weight Loss Obesity
## 3673 3609 3568
## ADHD Diabetes, Type 2 Emergency Contraception
## 3383 2554 2463
## High Blood Pressure Vaginal Yeast Infection Abnormal Uterine Bleeding
## 2321 2274 2096
## Bowel Preparation Fibromyalgia Smoking Cessation
## 1859 1791 1780
## Migraine Anxiety and Stress Major Depressive Disorde
## 1694 1663 1607
## Constipation Panic Disorde (Other)
## 1595 1463 59948
## NA's
## 2238
yrly_reviews <- reviews_edit %>%
mutate(year = year(date)) %>% #Find year of review
group_by(year) %>% #Group by drug, date of review
count()
ggplot(yrly_reviews) +
geom_line(aes(x = year, y = n)) +
geom_point(aes(x = year, y = n)) +
labs(title = "Number of reviews over time", x = "Year", y = "Number of Reviews") + #Change in reviews over time
scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016))
Reviews over time significantly increased from 2014 to 2016.
For the next step of this analysis, I’ll be subsetting the reviews data to offer the closest other medications. The best way of doing this is by looking at the most common conditions and then look at what other medicines are used to manage them. I’ll be defining four groups in the next section: medicines used for oral contraceptives, anxiety/depression, weight loss, and smoking cessation.
My goal for the contraceptive class is for it to contain all available pharmacological agents for contraception, including oral and implanted forms. I will omit the use of emergency contraception, as this is typically one-time use at a much higher dose. As mentioned earlier, I’ll subset the reviews by ‘Birth Control’ condition to generate the shorter list of agents.
oc_reviews <- reviews_edit %>%
filter(condition == 'Birth Control')
head(oc_reviews)
## # A tibble: 6 x 7
## uniqueID drugName condition review rating date usefulCount
## <dbl> <fct> <fct> <chr> <dbl> <date> <dbl>
## 1 92703 Lybrel Birth Con~ "\"I used to t~ 5 2009-12-14 17
## 2 138000 Ortho Evra Birth Con~ "\"This is my ~ 8 2015-11-03 10
## 3 48928 Ethinyl est~ Birth Con~ "\"I had been ~ 8 2016-12-08 1
## 4 98494 Nexplanon Birth Con~ "\"Started Nex~ 3 2014-08-07 10
## 5 227020 Etonogestrel Birth Con~ "\"Nexplanon d~ 9 2014-08-11 11
## 6 106703 Implanon Birth Con~ "\"Never again~ 2 2015-08-20 1
n_oc_reviews <-nrow(oc_reviews) #Number of reviews of birth control
oc_reviews %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_oc_reviews),1)) %>% #Percent of all birth control reviews
arrange(desc(n)) #Highest frequency first
## # A tibble: 172 x 3
## # Groups: drugName [172]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Etonogestrel 3314 11.5
## 2 Ethinyl estradiol / norethindrone 2337 8.1
## 3 Nexplanon 2149 7.5
## 4 Levonorgestrel 2129 7.4
## 5 Ethinyl estradiol / levonorgestrel 1600 5.6
## 6 Ethinyl estradiol / norgestimate 1580 5.5
## 7 Implanon 1095 3.8
## 8 Mirena 965 3.4
## 9 Skyla 822 2.9
## 10 Lo Loestrin Fe 667 2.3
## # ... with 162 more rows
This confirms my suspicions that many of the most common drugs were being used for birth control. Several observations jump out - the highest ranked drug etonorgestrel is the generic name for the active ingredient in Implanon and Explanon, so in all likelihood this is by far the most common contraceptive reviewed. The same generic/brand relationship exists for levonorgestrel and Mirena, copper and ParaGard. Also, several oral contraceptives exist as combinations with a synthetic estrogen, or ethinyl estradiol.
Next, lets take an exciting first look at the favorability of contraceptives, based on the numeric rating.
ggplot(oc_reviews) +
geom_bar(aes(x = rating), fill = "#042f66")+
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Oral Contraceptives") +
scale_x_continuous(breaks = c(1:10))
There appear to be many poorly rated birth control products, likely related to some sort of intolerable effect or treatment failure. I’ll take a look at these reviews specifically, with rating of ‘1’ out of ten.
oc_reviews_bad <- oc_reviews %>%
filter(rating == 1) %>% #Looking closer at the anomaly with rating '1'
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_oc_reviews),1)) %>% #As percent of total reviews
arrange(desc(n))
oc_reviews_bad
## # A tibble: 133 x 3
## # Groups: drugName [133]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Etonogestrel 524 1.8
## 2 Ethinyl estradiol / norethindrone 467 1.6
## 3 Nexplanon 366 1.3
## 4 Ethinyl estradiol / norgestimate 252 0.9
## 5 Ethinyl estradiol / levonorgestrel 239 0.8
## 6 Levonorgestrel 199 0.7
## 7 Implanon 152 0.5
## 8 Medroxyprogesterone 119 0.4
## 9 Depo-Provera 116 0.4
## 10 Mirena 111 0.4
## # ... with 123 more rows
The contraceptives associated with poor ratings appear to be the same as the most popular items.
oc_reviews_popular <- oc_reviews %>%
filter(drugName == c('Etonogestrel', 'Ethinyl estradiol / norethindrone', 'Nexplanon', 'Ethinyl estradiol / norgestimate', 'Ethinyl estradiol / levonorgestrel'))
## Warning in `==.default`(drugName, c("Etonogestrel", "Ethinyl estradiol /
## norethindrone", : longer object length is not a multiple of shorter object
## length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
ggplot(oc_reviews_popular) +
geom_bar(aes(x = rating, fill = drugName))+
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Top 5 Contraceptives") +
scale_fill_viridis_d() +
scale_x_continuous(breaks = c(1:10))
Graphically, I chose the top five and it appears that there are no major anomalies in the ratings. Since many birth control products hit the same pharmacological target, I feel safe in assuming that looking at negative sentiment as a whole is appropriate.
There are several conditions that encompass symptoms of depression and anxiety, within the top 25 there were some partial hits including ‘Insomnia’, ‘Anxiety and Stress’, ‘Panic Disorder’, and "Major Depressive Disorder’.
ad_reviews <- reviews_edit %>%
filter(condition == c('Depression', 'Insomnia', 'Anxiety', 'Anxiety and Stress', 'Major Depressive Disorder', 'Panic Disorder'))
## Warning in `==.default`(condition, c("Depression", "Insomnia", "Anxiety", :
## longer object length is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
n_ad_reviews <- nrow(ad_reviews) #number of reviews for depression/ anxiety meds
ad_reviews %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_ad_reviews),1)) %>% #Percent of all ax/dep reviews
arrange(desc(n)) #Highest first
## # A tibble: 153 x 3
## # Groups: drugName [153]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Escitalopram 145 4.4
## 2 Sertraline 129 3.9
## 3 Citalopram 127 3.8
## 4 Bupropion 104 3.1
## 5 Fluoxetine 101 3.1
## 6 Lexapro 101 3.1
## 7 Venlafaxine 99 3
## 8 Pristiq 82 2.5
## 9 Cymbalta 81 2.5
## 10 Mirtazapine 80 2.4
## # ... with 143 more rows
The most popular medications used for management of anxiety and depression make up a small percent of the overall category of drugs used to manage symptoms. Lets look at the data more visually.
ggplot(ad_reviews) +
geom_bar(aes(x = rating), fill= "#042f66") +
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Anxiety and Depression")
ad_reviews_popular <- ad_reviews %>%
filter(drugName == c('Escitalopram', 'Sertraline', 'Citalopram', 'Bupropion', 'Fluoxetine'))
ggplot(ad_reviews_popular) +
geom_bar(aes(x = rating, fill = drugName)) +
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Top 5 Treatments for Anxiety or Depression") +
scale_fill_viridis_d() +
scale_x_continuous(breaks = c(1:10))
The bar plot follows the overall pattern of relatively good marks, with an uptick in the lowest score. It appears that there’s no set pattern in the ratings, so I’ll take a quick look at the differences in ratings of ‘1’ versus ‘10’.
ad_reviews_bad <- ad_reviews %>%
filter(rating == 1) %>% #Worst ratings
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_ad_reviews),1)) %>% #Percent of total ratings
arrange(desc(n)) #Most frequently given first
ad_reviews_bad
## # A tibble: 76 x 3
## # Groups: drugName [76]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Venlafaxine 19 0.6
## 2 Buspirone 18 0.5
## 3 Belsomra 18 0.5
## 4 Bupropion 15 0.5
## 5 Cymbalta 15 0.5
## 6 Duloxetine 15 0.5
## 7 Sertraline 14 0.4
## 8 Escitalopram 14 0.4
## 9 Suvorexant 13 0.4
## 10 Prozac 12 0.4
## # ... with 66 more rows
ad_reviews_good <- ad_reviews %>%
filter(rating == 10) %>% #Best ratings
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_ad_reviews),1)) %>% #Perecnt of total ratings
arrange(desc(n)) #Most frequently first
ad_reviews_good
## # A tibble: 117 x 3
## # Groups: drugName [117]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Escitalopram 58 1.8
## 2 Alprazolam 53 1.6
## 3 Clonazepam 39 1.2
## 4 Xanax 37 1.1
## 5 Bupropion 35 1.1
## 6 Citalopram 35 1.1
## 7 Sertraline 31 0.9
## 8 Venlafaxine 28 0.8
## 9 Fluoxetine 28 0.8
## 10 Lexapro 28 0.8
## # ... with 107 more rows
Some especially badly rated medications include venlafaxine, buspirone, and Belsomra, while the highest rated include the antidepressants escitalopram and bupropion, as well as the sedatives alprazolam (also Xanax) and clonazepam. For one last look at the medications used for anxiety and depression, I’m going to omit ‘insmonia’, as this may be weighing the ratings too heavily in favor of one class of sedatives.
ad_reviews_wo_insomnia <- ad_reviews %>%
filter(condition != 'Insomnia') #Omit treatment for insomnia
n_ad_reviews_wo_insomnia <-nrow(ad_reviews_wo_insomnia) #Number of reviews for anxiety and depression, not insomnia
ad_reviews_wo_insomnia %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_ad_reviews_wo_insomnia),1)) %>% #Percent of reviews
arrange(desc(n)) #Most frequent first
## # A tibble: 115 x 3
## # Groups: drugName [115]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Escitalopram 145 5.4
## 2 Sertraline 129 4.8
## 3 Citalopram 127 4.7
## 4 Bupropion 104 3.9
## 5 Fluoxetine 101 3.7
## 6 Lexapro 101 3.7
## 7 Venlafaxine 99 3.7
## 8 Pristiq 82 3
## 9 Cymbalta 81 3
## 10 Alprazolam 79 2.9
## # ... with 105 more rows
ad_reviews_wo_insomnia_bad <- ad_reviews_wo_insomnia %>%
filter(rating == 1) %>% #Worst ratings
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_ad_reviews_wo_insomnia),1)) %>% #Percent of reviews
arrange(desc(n)) #Most frequent first
ad_reviews_wo_insomnia_bad
## # A tibble: 55 x 3
## # Groups: drugName [55]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Venlafaxine 19 0.7
## 2 Buspirone 18 0.7
## 3 Bupropion 15 0.6
## 4 Cymbalta 15 0.6
## 5 Duloxetine 15 0.6
## 6 Sertraline 14 0.5
## 7 Escitalopram 14 0.5
## 8 Prozac 12 0.4
## 9 Vilazodone 11 0.4
## 10 Lexapro 10 0.4
## # ... with 45 more rows
ad_reviews_wo_insomnia_good <- ad_reviews_wo_insomnia %>%
filter(rating == 10) %>% #Best ratings
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_ad_reviews_wo_insomnia),1)) %>% #Percent of reviews
arrange(desc(n)) #Most frequent first
ad_reviews_wo_insomnia_good
## # A tibble: 91 x 3
## # Groups: drugName [91]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Escitalopram 58 2.2
## 2 Alprazolam 53 2
## 3 Xanax 37 1.4
## 4 Bupropion 35 1.3
## 5 Citalopram 35 1.3
## 6 Clonazepam 34 1.3
## 7 Sertraline 31 1.1
## 8 Venlafaxine 28 1
## 9 Fluoxetine 28 1
## 10 Lexapro 28 1
## # ... with 81 more rows
Removing insomnia did not change the highest ratings as expected. I’m going to take one last look at ratings, this time looking at the favorability of sedatives only.
ad_reviews_edit <- ad_reviews #Create duplicate of anxiety/depression reviews dataframe
#Below, I've recoded the medication names, which are factors, as their generic equivalent. This is largely possible because there are only a handful reviewed
ad_reviews_edit$drugName <- ad_reviews$drugName %>%
recode_factor('Klonopin' = 'Clonazepam', 'Xanax' = 'Alprazolam', 'Xanax XR' =
'Alprazolam', 'Ambien' = 'Zolpidem', 'Ambien CR' = 'Zolpidem', 'Lunesta' = 'Eszopiclone', 'Ativan' = 'Lorazepam', 'Restoril'= 'Temazepam')
#Below are the generic names for the sedatives I was talking about earlier: benzodiazepines --usualy ends in '-zolam' -- and two hyptnotics that affect the same receptors
ad_reviews_bzd <- ad_reviews %>%
filter(drugName == c('Clonazepam', 'Alprazolam', 'Zolpidem', 'Temazepam', 'Eszopiclone', 'Diazepam', 'Lorazepam', 'Oxazepam', 'Triazolam', 'Chlordiazepoxide'))
## Warning in `==.default`(drugName, c("Clonazepam", "Alprazolam", "Zolpidem", :
## longer object length is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
ggplot(ad_reviews_bzd) +
geom_bar(aes(x = rating, fill = drugName)) +
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings of Sedatives", fill = "Drug") +
scale_fill_viridis_d() +
scale_x_continuous(breaks = c(1:10))
I’ve grouped these two together as they’re less complicated than defining the other two classes. Therapeutically, only a handful of medicines are used to manage these, and the conditions are relatively straightforward.
wl_reviews <- reviews_edit %>%
filter(condition == 'Weight Loss')
n_wl_reviews <-nrow(wl_reviews) #Number of reviews for weight loss drugs
head(wl_reviews)
## # A tibble: 6 x 7
## uniqueID drugName condition review rating date usefulCount
## <dbl> <fct> <fct> <chr> <dbl> <date> <dbl>
## 1 164952 Phentermin~ Weight L~ "\"I have been o~ 8 2015-12-25 38
## 2 145900 Qsymia Weight L~ "\"My Dr agreed ~ 9 2013-02-24 46
## 3 52117 Adipex-P Weight L~ "\"I just starte~ 8 2011-03-10 7
## 4 145785 Qsymia Weight L~ "\"Began taking ~ 10 2014-05-26 61
## 5 145840 Qsymia Weight L~ "\"Made me jitte~ 1 2013-08-21 59
## 6 145495 Qsymia Weight L~ "\"I am a 28 yea~ 10 2017-07-01 21
wl_reviews %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_wl_reviews),1)) %>%
arrange(desc(n))
## # A tibble: 21 x 3
## # Groups: drugName [21]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Phentermine 1211 33.6
## 2 Lorcaserin 387 10.7
## 3 Belviq 381 10.6
## 4 Bupropion / naltrexone 267 7.4
## 5 Contrave 263 7.3
## 6 Adipex-P 255 7.1
## 7 Phentermine / topiramate 249 6.9
## 8 Qsymia 207 5.7
## 9 Liraglutide 166 4.6
## 10 Saxenda 95 2.6
## # ... with 11 more rows
Looking at the initial data, one non-intuitive finding is the addition of the medication megestrol, which is used in illnesses like cancer and AIDS when patients lose weight. So in this case, it is being used for ‘weight loss’, but to gain weight and not lose it. I will take the extra step of removing these reviews.
wl_reviews_edit <- wl_reviews #Create duplicate dataframe for edits
#Recode brand products to generic
wl_reviews_edit$drugName <- recode_factor(wl_reviews$drugName, 'Megace' = 'Megestrol', 'Megace ES' = 'Megestrol', 'Megestrol' = 'Megestrol')
#Remove medication not used for weight loss
wl_reviews_edit <- wl_reviews_edit%>%
filter(drugName != 'Megestrol')
n_wl_reviews <- nrow(wl_reviews_edit) #Find true number of reviews
#Recode the rest, set default generic to phentermine as this is the most common entry
wl_reviews_edit$drugName <- wl_reviews_edit$drugName %>%
recode_factor(Belviq = "Locaserin", "Belviq XR" = "Locaserin", Locaserin = "Locaserin", Contrave = "Bupropion / naltrexone", "Bupropion / naltrexone" = "Bupropion / naltrexone", Qsymia = "Phentermine/ topiramate", "Phentermine/ topiramate" = "Phentermine/ topiramate", Saxenda = "Liraglutide",Victoza = "Liraglutide", Liraglutide = "Liraglutide", .default = "Phentermine")
wl_reviews_edit %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_wl_reviews),1)) %>% #Percent of total reviews
arrange(desc(n)) #Highest frequency first
## # A tibble: 5 x 3
## # Groups: drugName [5]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Phentermine 2149 59.8
## 2 Bupropion / naltrexone 530 14.7
## 3 Locaserin 384 10.7
## 4 Liraglutide 324 9
## 5 Phentermine/ topiramate 207 5.8
ggplot(wl_reviews_edit) +
geom_bar(aes(x = rating, fill = drugName)) +
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Weight Loss") +
scale_fill_viridis_d() +
scale_x_continuous(breaks = c(1:10))
ggplot(wl_reviews_edit) +
geom_bar(aes(x = rating)) +
labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Weight Loss") +
facet_wrap(~drugName) + #For each separate drug
scale_x_continuous(breaks = c(1:10))
wl_yrly_reviews <- wl_reviews_edit %>%
mutate(year = year(date)) %>% #Find year of review
group_by(drugName, year) %>% #Group by drug, date of review
count()
ggplot(wl_yrly_reviews) +
geom_line(aes(x = year, y = n, color = drugName)) +
geom_point(aes(x = year, y = n, color = drugName)) +
labs(title = "Number of ratings for weight loss products over time", x = "Year", y = "Number of Reviews", color = "Drug") + #Change in reviews over time
scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016))
I will provide the same techniques to look at smoking cessation product reviews.
cs_reviews <- reviews_edit %>%
filter(condition == 'Smoking Cessation')
n_cs_reviews <-nrow(cs_reviews) #Number of smoking cessation drugs reviewed
head(cs_reviews)
## # A tibble: 6 x 7
## uniqueID drugName condition review rating date usefulCount
## <dbl> <fct> <fct> <chr> <dbl> <date> <dbl>
## 1 225508 Bupropion Smoking C~ "\"Love this, no ~ 10 2014-08-22 15
## 2 62773 Nicoderm~ Smoking C~ "\"I smoked for 3~ 8 2014-02-08 50
## 3 200059 Varenicl~ Smoking C~ "\"Worked great!!~ 10 2015-10-26 26
## 4 200287 Varenicl~ Smoking C~ "\"Have smoked fo~ 10 2014-12-03 1
## 5 152006 Chantix Smoking C~ "\"I put off tryi~ 10 2011-04-01 16
## 6 200478 Varenicl~ Smoking C~ "\"Amazing. Yes,~ 10 2012-11-08 13
cs_reviews %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_cs_reviews),1)) %>%
arrange(desc(n))
## # A tibble: 17 x 3
## # Groups: drugName [17]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Varenicline 785 44.1
## 2 Chantix 633 35.6
## 3 Bupropion 147 8.3
## 4 Nicotine 84 4.7
## 5 Zyban 58 3.3
## 6 Nicoderm CQ 26 1.5
## 7 Commit 14 0.8
## 8 Nicotrol Inhaler 14 0.8
## 9 Nicorette 5 0.3
## 10 Habitrol 4 0.2
## 11 Nortriptyline 2 0.1
## 12 Pamelor 2 0.1
## 13 Buproban 2 0.1
## 14 Topiramate 1 0.1
## 15 Topamax 1 0.1
## 16 Nicotrol NS 1 0.1
## 17 Leader Nicotine Polacrilex 1 0.1
Smoking cessation products are much less numerous, so generic-brand duplication will have a significant impact. I will be recoding drugName to generic before visualizing ratings.
cs_reviews_edit <- cs_reviews
#Recode products as generic drug names
cs_reviews_edit$drugName <- cs_reviews$drugName %>%
recode_factor(Chantix = "Varenicline", Varenicline = "Varenicline", Bupropion = "Bupropion", Buproban = "Bupropion", Zyban = "Bupropion", Nortriptyline = "Nortriptyline", Pamelor = "Nortriptyline", Topiramate = "Topiramate", Topamax = "Topiramate", .default = "Nicotine")
cs_reviews_edit %>%
group_by(drugName) %>%
count() %>%
mutate(pct_reviews = round((100*n/n_cs_reviews),1)) %>% #Percent of total reviews
arrange(desc(n)) #Most frequent first
## # A tibble: 5 x 3
## # Groups: drugName [5]
## drugName n pct_reviews
## <fct> <int> <dbl>
## 1 Varenicline 1418 79.7
## 2 Bupropion 207 11.6
## 3 Nicotine 149 8.4
## 4 Nortriptyline 4 0.2
## 5 Topiramate 2 0.1
ggplot(cs_reviews_edit) +
geom_bar(aes(x = rating, fill = drugName)) +
labs(x = "Rating (out of 10)", y = "Number of ratings, log scale", title = "Ratings for Smoking Cessation") +
scale_x_discrete(breaks = 10) +
scale_fill_viridis_d()
In the case of varenicline (Chantix), this was approved in 2006, so my expectation is that a time series would show this gaining popularity, especially by the end of the review period.
cs_yrly_reviews <- cs_reviews_edit %>%
mutate(year = year(date)) %>% #Find the year of review
group_by(drugName, year) %>% #
count()
ggplot(cs_yrly_reviews) +
geom_line(aes(x = year, y = n, color = drugName)) +
geom_point(aes(x = year, y = n, color = drugName)) +
labs(title = "Number of ratings for smoking cessation products over time", x = "Year", y = "Number of Reviews", color = "Drug") +
scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016))
This concludes the original exploration of the drug reviews. For the second part, I’ll be looking at sentiment reflected within the reviews.
For this sentiment analysis, I’ll be using the ‘nrc’ lexicon of sentiments used in a previous assignement. I originally tried the other two sentiments, in addition to sentiword and some other medical lexicons. Ultimately, I decided to use nrc positve and negative sentiments for the best initial results and for consistency.
library(textdata)
## Warning: package 'textdata' was built under R version 3.6.3
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.3
library(lexicon)
## Warning: package 'lexicon' was built under R version 3.6.3
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
sentiword <- hash_sentiment_sentiword
names(sentiword)[names(sentiword) == "x"] <- "word"
names(sentiword)[names(sentiword) == "y"] <- "score"
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
nrc_anger <- get_sentiments("nrc") %>%
filter(sentiment == 'anger')
tidy_oc_reviews <- oc_reviews %>%
unnest_tokens(word, review)
library(wordcloud)
library(viridisLite)
color_pal <- viridis(n = 9, direction = -1)
custom_stop_words <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'feeling', 'lose'),
lexicon = c("custom")),
stop_words)
tidy_oc_reviews %>%
filter(rating == 1) %>% #The worst ratings
anti_join(custom_stop_words) %>%
inner_join(nrc_anger) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
For the worst ratings, many of the findings endorse ongoing mental health issues, addition to irritability, pain, and the word shot - likely because many forms of birth control are given as an injection.
nrc_pos <- get_sentiments("nrc") %>%
filter(sentiment == 'positive')
nrc_neg <- get_sentiments("nrc") %>%
filter(sentiment == 'negative')
tidy_ad_reviews <- ad_reviews %>%
unnest_tokens(word, review)
custom_stop_words_ad <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'anxiety', 'depression', 'taking', 'pill', 'effect', 'feeling', 'lose', 'anxious', 'panic', 'disorder', 'medication', 'medicine', 'don'),
lexicon = c("custom")),
stop_words)
tidy_ad_reviews %>%
filter(rating == 10, condition != 'Insomnia') %>% #omitting medications for insomnia
anti_join(custom_stop_words_ad) %>%
inner_join(nrc_pos) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
tidy_ad_reviews %>%
filter(rating == 1, condition != 'Insomnia') %>%
anti_join(custom_stop_words_ad) %>%
inner_join(nrc_neg) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
The most notable finding in this word frequency is ‘worse’, which may indicate that the treatment is ineffective and people are not satisfied with the antidepressant’s effects. Additionally, there are side effects like nausea, dizziness, and potential effects on weight.
One medication each for weight loss and smoking cessation encompassed the vast majority of reviews: varenicline (Chantix) for smoking cessation and phentermine (Adipex-P) for weight loss. I’m going to take an overall look at the word frequency of their reviews, as well as identify any reasons why they are so popular.
tidy_cs_reviews <- cs_reviews_edit %>%
unnest_tokens(word, review)
custom_stop_words_cs <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'taking', 'pill', 'effect', 'feeling', 'haven', 'medication', 'nicotine', 'quit', 'don', 'recommend', 'medicine', 'cold', 'doctor'),
lexicon = c("custom")),
stop_words)
tidy_cs_reviews %>%
filter(rating == 10, drugName == 'Varenicline') %>%
anti_join(custom_stop_words_cs) %>%
inner_join(nrc_pos) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
tidy_cs_reviews %>%
filter(rating == 1, drugName == 'Varenicline') %>%
anti_join(custom_stop_words_cs) %>%
inner_join(nrc_neg) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
For positive ratings, many of the words indicate positive benefit, but also the word ‘vivid’ shows up, an interesting finding. For negative ratings, nausea is the number one hit, in addition to dizziness, lowered mood, and the word ‘nightmare’.
tidy_wl_reviews <- wl_reviews_edit %>%
unnest_tokens(word, review)
custom_stop_words_wl <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'taking', 'pill', 'effect', 'feeling', 'medication', 'day', 'weight', 'lose', 'lost', 'nicotine', 'quit', 'haven', 'don', 'recommend', 'medicine', 'doctor'),
lexicon = c("custom")),
stop_words)
tidy_wl_reviews %>%
filter(rating == 10, drugName == 'Phentermine') %>%
anti_join(custom_stop_words_wl) %>%
inner_join(sentiword) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
tidy_wl_reviews %>%
filter(rating == 1, drugName == 'Phentermine') %>%
anti_join(custom_stop_words_wl) %>%
inner_join(nrc_neg) %>%
count(word) %>%
with(wordcloud(word, n, colors = color_pal, max.words = 75))
## Joining, by = "word"
## Joining, by = "word"
For the positive ratings, many of the most frequent words indicate that reviewers are on an adjunctive diet. For negative ratings, bad ratings mention headache, pain, dizziness, and continued hunger.
For the second section, I’ll look more closely at formal side effect or adverse drug event (ADE) data provided by the US Food and Drug Administration. The FDA’s records are available through their API in JSON format. There is a devtool available on the FDA’s Github, and extensive information on the appropriate formation of a query. I’ll start by loading the tool and look at the 10 years of records that correlate to the drug review time period (2008 - 2017).
library(devtools)
## Warning: package 'devtools' was built under R version 3.6.3
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 3.6.3
devtools::install_github('ropenhealth/openfda')
## Skipping install of 'openfda' from a github remote, the SHA1 (ace7ef93) has not changed since last install.
## Use `force = TRUE` to force installation
library(openfda)
adr_demos = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_count("patient.patientonsetage") %>% #Age frequency of all events
fda_exec()
## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.patientonsetage
ggplot(adr_demos, aes(x = term, y = count)) +
geom_point() +
geom_vline(xintercept = 65, color = "red") +
labs(title = "Ages of Patients with Reported ADE's, 2008 - 2017", x = "Patient Age", y = "Count")
It appears that reports are skewed to the left, with the most frequent age somewhere in the sixties. I have illustrated age 65 with a red line to indicate the age generally accepted as ‘elderly’ in the United States, also when many patients have access to government prescription insurance via Medicare Part D.
Next, I’m going to look at the half dozen cases we’ve established in the previous section. I’ll be looking at the commonly reported side effects associated with birth control/contraception, anxiety/depression, phentermine and varenicline. Using this context, my goal is to interpret any unsual findings in the sentiment analysis.
ade_oc = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_filter("patient.drug.drugindication", 'contraception+"birth+control"') %>% #Indication
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_exec()
## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.drugindication:contraception+"birth+control"&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact
n_ade_oc <- sum(ade_oc$count)
ade_oc_drugs = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_filter("patient.drug.drugindication", 'contraception+"birth+control"') %>% #Indication
fda_count("patient.drug.openfda.generic_name.exact") %>% #Drug name
fda_exec()
## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.drugindication:contraception+"birth+control"&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.drug.openfda.generic_name.exact
ade_oc_vis <- ade_oc %>%
mutate(pct_reviews = round((100*count/n_ade_oc),2)) %>% #Percent of all contraception reports
arrange(desc(pct_reviews)) %>% #Highest frequency first
head(10)
ade_oc_vis %>%
mutate(term = factor(term, levels = term)) %>%
arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
geom_bar(stat = "identity", fill= "#042f66") +
labs(title = "Frequency of ADE's associated with Birth Control Use, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
coord_flip()
The reported events indicate the most common reaction reported is ‘device expulsion,’ which in this context likely refers to intrauterine devices, or IUD’s. This would be a significant event and compromise the effectiveness of treatment. I think this is the most common report as it is serious and requires intervention of a medical professional to fix. Coming in at number 8 is anxiety, which was mentioned as a symptom in the word frequency analysis. The number 10 result, injury, is likekly an error as I was unable to completely remove some queries associated with a common pain medication, ibuprofen.
ade_ad = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_filter("patient.drug.drugindication", 'anxiety+depression+"major+depressive+disorder"+"panic+disorder"') %>% #Indication
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_exec()
## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.drugindication:anxiety+depression+"major+depressive+disorder"+"panic+disorder"&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact
n_ade_ad <- sum(ade_ad$count)
ade_ad_vis <- ade_ad %>%
mutate(pct_reviews = round((100*count/n_ade_ad),2)) %>% #Percent of all antidepressant reports
arrange(desc(pct_reviews)) %>% #Highest frequency first
head(10)
ade_ad_vis
## term count pct_reviews
## 1 ANXIETY 9069 4.41
## 2 DRUG INEFFECTIVE 8034 3.91
## 3 NAUSEA 7042 3.42
## 4 DIZZINESS 5628 2.74
## 5 HEADACHE 5585 2.71
## 6 FATIGUE 5545 2.70
## 7 INSOMNIA 5343 2.60
## 8 DEPRESSION 5108 2.48
## 9 PAIN 4608 2.24
## 10 OFF LABEL USE 4536 2.20
ade_ad_vis %>%
mutate(term = factor(term, levels = term)) %>%
arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
geom_bar(stat = "identity", fill= "#042f66") +
labs(title = "ADE's associated with Anxiety and Depression Treatment, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
coord_flip()
For medications used for managing anxiety and depression, the top two results are associated with worsening or ineffectiveness of therapy. Beyond that, nausea and dizziness came up in our original sentiment analysis as well.
ade_wl = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_filter("patient.drug.openfda.generic_name", "phentermine") %>% #Drug name
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_exec()
## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.openfda.generic_name:phentermine&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact
n_ade_wl <- sum(ade_wl$count) #Number of ADE's reported for phentermine
ade_wl_vis <- ade_wl %>%
mutate(pct_reviews = round((100*count/n_ade_wl),2)) %>% #Percent of all phentermine reports
arrange(desc(pct_reviews)) %>% #Highest frequency first
head(10)
ade_wl_vis %>%
mutate(term = factor(term, levels = term)) %>%
arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
geom_bar(stat = "identity", fill= "#042f66") +
labs(title = "Frequency of ADE's associated with Phentermine Use, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
coord_flip()
It is worth pointing out that there are fewer reports for phentermine side effects than for other classes examined. This could because it remains a treatment that is given to otherwise healthy people looking for weight loss. Two symptoms that stand out are pain and paraesthesia, which is feeling of skin burning and itching. Dizziness is also mentioned in the original sentiment analysis as well.
ade_cs = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_filter("patient.drug.openfda.generic_name", "varenicline") %>% #Drug name
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_exec()
## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.openfda.generic_name:varenicline&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact
n_ade_cs <- sum(ade_cs$count) #Number of ADE's reported for varenicline
ade_cs_vis <- ade_cs %>%
mutate(pct_reviews = round((100*count/n_ade_cs),1)) %>% #Percent of all varenicline reports
arrange(desc(pct_reviews)) %>% #Highest frequency first
head(10)
ade_cs_vis %>%
mutate(term = factor(term, levels = term)) %>%
arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
geom_bar(stat = "identity", fill= "#042f66") +
labs(title = "Frequency of ADE's associated with Chantix Use, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
coord_flip()
Finally, for varenicline the most common side effects reported are nausea and depression. Interestingly, insomnia and abnormal dreams are also mentioned. Based on my own clinical knowledge, this is an interesting finding in addition to the sentiment analysis identifying ‘vivid’ as a word because one idiosyncratic reaction associated with varenicline is nightmares or vivid dreaming. This may be related to its treatment effect on nicotinic acid receptors. It has a much longer half-life than nicotine, 24 hours versus 1-2 hours, so in all likelihood patients are receving constant stimulation even while sleeping.
Patient-generated user reviews provide a complementary perspective of the tolerability of medication therapy. The medications reviewed were not surprising and ecompass many of the most popular medications used for each condition. Analysis of reviews over time show a large spike betewen 2014 and 2016. Using internet reviews in combination with sentiment analysis, many of the most common side effects were correlated with adverse drug events reported to the FDA. There were some differences between reports, as the FDA tended to have more serious and treatment-modifying effects, while user reviews tended to accentuate the mental health impacts of starting a new therapy.
Some limitations I identified early in the beginning of the project is the popularity of birth control as a drug review, which indicates that the reviews are biased towards younger women. On the one hand, these patients are more likely to be on one or two medications, while older patients tend to be on up to four if they are being treated for chronic diseases. This means that the side effects are likely to be from the single medication and not a complex interaction between the medications. However, I would not generalize any of the findings to older populations as has an important impact on drug therapy. Further to this point, younger reviewers are also likely healthier, and it’s not clear whether birth control for instance is exacerbating an existing mood disorder, or a de novo finding.
For future directions, the openFDA API has much more operability than what I utilized, including the consideration of multiple medications and conditions, demographic information, and severity of the reaction (death, hospitalization, discontinuation of therapy). The reviews data could be made much more tidy and my treatment here reflects a minimal exercise to make the data intelligible. Additionally, there was some operability not used from the reviews, including the all-important ‘Users found this helpful’, which is a ubiquitous finding in the Web 2.0 environment of user feedback. Finally, these dataframes are ripe for treatment in a relational database, especially if I were able to find an existing table with all synonyms for a medication (brand, generic, nicknames, identifiers for billing and commerical purposes). These already exist in regulatory and clinical records, and their implementation would diminish the need for recoding of the thousands of names of FDA-approved drugs.