Disclaimer

This project is done for academic purposes and my own personal interest, and does not reflect my professional opinon about any of the medications or conditions discussed.

Introduction

For this project, I’m going to use available datasets to help understand patient versus professional perceptions of drug therapy benefit and tolerability. Using patient reviews, I intend to determine the most common classes of prescription drugs, as well as their relative tolerability. These data are based off reviews from Drugs.com (.csv), complete with rating, a brief written review, as well as other user feedback - yes or no to how helpful a review was. The bulk of my time will be spent performing a sentiment analysis of the drug reviews, looking for words associated with positive or negative opinion.

In addition to this, I’ll be using the US Food and Drug Administration’s API for adverse drug reactions (JSON) to offer a contrasting regulatory/ professional perspective. Based on this comparison, I hope to identify any discrepancies in reported tolerability versus firsthand patient experience. My purpose in doing this is to enhance my own ability to warn about side effects, and identify any boundaries to successful drug therapy before they happen.

Sources:

Original Drugs.com paper: http://kdd.cs.ksu.edu/Publications/Student/kallumadi2018aspect.pdf

Kaggle / UC Irvine dataset: https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018

FDA API website: https://open.fda.gov/apis/drug/event/

FDA Github with R package: https://github.com/ropenhealth/openfda

Initial analysis

Uploading data

I’ll start by uploading the training set of the original data, which contains over 160,000 different drug reviews. I’ve specified some two rows, drug name and condition, as factors to better understand some of the most commonly reviewed medicines and diseases.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.3

## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 3.6.3

## Warning: package 'tibble' was built under R version 3.6.3

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'purrr' was built under R version 3.6.3

## Warning: package 'forcats' was built under R version 3.6.3

## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(lubridate)

## Warning: package 'lubridate' was built under R version 3.6.3

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:dplyr':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)

reviews_orig <- read_csv("https://raw.githubusercontent.com/hillt5/DATA607_Final_Project/master/drugsComTrain_raw.csv", col_types = "dffcdcd") #Set column datatypes

reviews_orig$date <- as.Date(reviews_orig$date, format = "%d-%b-%y") #set date format

head(reviews_orig)

## # A tibble: 6 x 7
##   uniqueID drugName    condition   review          rating date       usefulCount
##      <dbl> <fct>       <fct>       <chr>            <dbl> <date>           <dbl>
## 1   206461 Valsartan   Left Ventr~ "\"It has no s~      9 2012-05-20          27
## 2    95260 Guanfacine  ADHD        "\"My son is h~      8 2010-04-27         192
## 3    92703 Lybrel      Birth Cont~ "\"I used to t~      5 2009-12-14          17
## 4   138000 Ortho Evra  Birth Cont~ "\"This is my ~      8 2015-11-03          10
## 5    35696 Buprenorph~ Opiate Dep~ "\"Suboxone ha~      9 2016-11-27          37
## 6   155963 Cialis      Benign Pro~ "\"2nd day on ~      2 2015-11-28          43

I’ll start by looking at the two variables defined as factors, drugName and condition. I’ve limited to looking at the top 25 factors for each.

num_obs <- nrow(reviews_orig) #Number of reviews

summary(reviews_orig$drugName, 25) #Top 25 drugs by review frequency

##                     Levonorgestrel                       Etonogestrel 
##                               3657                               3336 
##  Ethinyl estradiol / norethindrone                          Nexplanon 
##                               2850                               2156 
##   Ethinyl estradiol / norgestimate Ethinyl estradiol / levonorgestrel 
##                               2117                               1888 
##                        Phentermine                         Sertraline 
##                               1543                               1360 
##                       Escitalopram                             Mirena 
##                               1292                               1242 
##                           Implanon                         Gabapentin 
##                               1102                               1047 
##                          Bupropion                        Venlafaxine 
##                               1022                               1016 
##                         Miconazole                Medroxyprogesterone 
##                               1000                                995 
##                         Citalopram                            Lexapro 
##                                995                                952 
##             Bupropion / naltrexone                         Duloxetine 
##                                950                                934 
##                      Metronidazole                           Contrave 
##                                922                                920 
##   Drospirenone / ethinyl estradiol                       Depo-Provera 
##                                890                                882 
##                            (Other) 
##                             126229

round(100*(num_obs-126229)/num_obs, 1) #Percent of reviews that fall within the top 25 drugs

## [1] 21.7

For drugs, it’s evident that many of the commonly reviewed medications are progestins used for oral contraception (OC). Levonorgestrel is an emergency contraceptive and also the active ingredient in many different brands of OC. There are also several brand name contraceptives, as well as long-acting injections and implants, like Implanon and Depo-Provera, which further complcicates definition of contraceptives as a drug class. The top 25 results also represent 21% of all reviews. Beyond contraceptives, phentermine and bupropion/naltrexone are used for weight loss, metronidazole is an antimicrobial, and many of the others are antidepressants.

summary(reviews_orig$condition, 25) #top 25 conditions being treated

##             Birth Control                Depression                      Pain 
##                     28788                      9069                      6145 
##                   Anxiety                      Acne           Bipolar Disorde 
##                      5904                      5588                      4224 
##                  Insomnia               Weight Loss                   Obesity 
##                      3673                      3609                      3568 
##                      ADHD          Diabetes, Type 2   Emergency Contraception 
##                      3383                      2554                      2463 
##       High Blood Pressure   Vaginal Yeast Infection Abnormal Uterine Bleeding 
##                      2321                      2274                      2096 
##         Bowel Preparation               ibromyalgia         Smoking Cessation 
##                      1859                      1791                      1780 
##                  Migraine        Anxiety and Stress  Major Depressive Disorde 
##                      1694                      1663                      1607 
##              Constipation             Panic Disorde                   (Other) 
##                      1595                      1463                     61287 
##                      NA's 
##                       899

round(100*(num_obs-61287)/num_obs, 2) #Percent of reviews that fall within the top 25 conditions

## [1] 62

For conditions, the most common reason given is ‘Birth Control’, not surprising considering the most common medications. The same medications can also be used for other top conditions, including migraine, acne, emergency contraception, and abnormal uterine bleeding. There are also many psychiatric and neurological illnesses, including anxiety, insomnia, ADHD, and derpession. Chronic illnesses associated with aging, like high blood pressure and type 2 diabetes are also present. Finally, the top 25 conditions comprise just over 62% of the data, with 899 values missing - listed as “NA”.

Next, I’ll correct a spelling error - ‘ibromyalgia’ is likely supposed to mean fibromyalgia, a neuropsychiatric syndrome characterized by chronic pain. I’ll also recode some values as “NA” that I noticed while looking further into the dataset. Some of these are put as ‘Not listed’, while others appear to be coding errors where ‘usefulCount’ data was shifted into the condition column. I also scanned drug name for spelling errors and found no issues.

reviews_edit <- reviews_orig #Create new file from raw input

reviews_edit$condition <- reviews_edit$condition %>%
  recode_factor(ibromyalgia = "Fibromyalgia", atigue = "Fatigue") %>% #Fix two spelling errors
  na_if("Not Listed / Othe") #Recode values as 'NA'

error_span <- str_detect(reviews_edit$condition, pattern = "</span>") #Identify erroneous entries
reviews_edit$condition <- replace(reviews_edit$condition, list = error_span, NA) #Replace with NA

summary(reviews_edit$condition, 25) #First 25 entries

##             Birth Control                Depression                      Pain 
##                     28788                      9069                      6145 
##                   Anxiety                      Acne           Bipolar Disorde 
##                      5904                      5588                      4224 
##                  Insomnia               Weight Loss                   Obesity 
##                      3673                      3609                      3568 
##                      ADHD          Diabetes, Type 2   Emergency Contraception 
##                      3383                      2554                      2463 
##       High Blood Pressure   Vaginal Yeast Infection Abnormal Uterine Bleeding 
##                      2321                      2274                      2096 
##         Bowel Preparation              Fibromyalgia         Smoking Cessation 
##                      1859                      1791                      1780 
##                  Migraine        Anxiety and Stress  Major Depressive Disorde 
##                      1694                      1663                      1607 
##              Constipation             Panic Disorde                   (Other) 
##                      1595                      1463                     59948 
##                      NA's 
##                      2238

yrly_reviews <- reviews_edit %>%
  mutate(year = year(date)) %>% #Find year of review
  group_by(year) %>% #Group by drug, date of review
  count()


ggplot(yrly_reviews) +
  geom_line(aes(x = year, y = n)) +
  geom_point(aes(x = year, y = n)) +
  labs(title = "Number of reviews over time", x = "Year", y = "Number of Reviews") + #Change in reviews over time
  scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016))

Reviews over time significantly increased from 2014 to 2016.

Building a Drug Class from Reviews Data

For the next step of this analysis, I’ll be subsetting the reviews data to offer the closest other medications. The best way of doing this is by looking at the most common conditions and then look at what other medicines are used to manage them. I’ll be defining four groups in the next section: medicines used for oral contraceptives, anxiety/depression, weight loss, and smoking cessation.

Contraception

My goal for the contraceptive class is for it to contain all available pharmacological agents for contraception, including oral and implanted forms. I will omit the use of emergency contraception, as this is typically one-time use at a much higher dose. As mentioned earlier, I’ll subset the reviews by ‘Birth Control’ condition to generate the shorter list of agents.

oc_reviews <- reviews_edit %>%
  filter(condition == 'Birth Control')

head(oc_reviews)

## # A tibble: 6 x 7
##   uniqueID drugName     condition  review          rating date       usefulCount
##      <dbl> <fct>        <fct>      <chr>            <dbl> <date>           <dbl>
## 1    92703 Lybrel       Birth Con~ "\"I used to t~      5 2009-12-14          17
## 2   138000 Ortho Evra   Birth Con~ "\"This is my ~      8 2015-11-03          10
## 3    48928 Ethinyl est~ Birth Con~ "\"I had been ~      8 2016-12-08           1
## 4    98494 Nexplanon    Birth Con~ "\"Started Nex~      3 2014-08-07          10
## 5   227020 Etonogestrel Birth Con~ "\"Nexplanon d~      9 2014-08-11          11
## 6   106703 Implanon     Birth Con~ "\"Never again~      2 2015-08-20           1

n_oc_reviews <-nrow(oc_reviews) #Number of reviews of birth control

oc_reviews %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_oc_reviews),1)) %>% #Percent of all birth control reviews
  arrange(desc(n)) #Highest frequency first

## # A tibble: 172 x 3
## # Groups:   drugName [172]
##    drugName                               n pct_reviews
##    <fct>                              <int>       <dbl>
##  1 Etonogestrel                        3314        11.5
##  2 Ethinyl estradiol / norethindrone   2337         8.1
##  3 Nexplanon                           2149         7.5
##  4 Levonorgestrel                      2129         7.4
##  5 Ethinyl estradiol / levonorgestrel  1600         5.6
##  6 Ethinyl estradiol / norgestimate    1580         5.5
##  7 Implanon                            1095         3.8
##  8 Mirena                               965         3.4
##  9 Skyla                                822         2.9
## 10 Lo Loestrin Fe                       667         2.3
## # ... with 162 more rows

This confirms my suspicions that many of the most common drugs were being used for birth control. Several observations jump out - the highest ranked drug etonorgestrel is the generic name for the active ingredient in Implanon and Explanon, so in all likelihood this is by far the most common contraceptive reviewed. The same generic/brand relationship exists for levonorgestrel and Mirena, copper and ParaGard. Also, several oral contraceptives exist as combinations with a synthetic estrogen, or ethinyl estradiol.

Next, lets take an exciting first look at the favorability of contraceptives, based on the numeric rating.

ggplot(oc_reviews) +
  geom_bar(aes(x = rating), fill = "#042f66")+
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Oral Contraceptives") +
  scale_x_continuous(breaks = c(1:10))

There appear to be many poorly rated birth control products, likely related to some sort of intolerable effect or treatment failure. I’ll take a look at these reviews specifically, with rating of ‘1’ out of ten.

oc_reviews_bad <- oc_reviews %>%
  filter(rating == 1) %>% #Looking closer at the anomaly with rating '1'
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_oc_reviews),1)) %>% #As percent of total reviews
  arrange(desc(n)) 

oc_reviews_bad

## # A tibble: 133 x 3
## # Groups:   drugName [133]
##    drugName                               n pct_reviews
##    <fct>                              <int>       <dbl>
##  1 Etonogestrel                         524         1.8
##  2 Ethinyl estradiol / norethindrone    467         1.6
##  3 Nexplanon                            366         1.3
##  4 Ethinyl estradiol / norgestimate     252         0.9
##  5 Ethinyl estradiol / levonorgestrel   239         0.8
##  6 Levonorgestrel                       199         0.7
##  7 Implanon                             152         0.5
##  8 Medroxyprogesterone                  119         0.4
##  9 Depo-Provera                         116         0.4
## 10 Mirena                               111         0.4
## # ... with 123 more rows

The contraceptives associated with poor ratings appear to be the same as the most popular items.

oc_reviews_popular <- oc_reviews %>%
  filter(drugName == c('Etonogestrel', 'Ethinyl estradiol / norethindrone', 'Nexplanon', 'Ethinyl estradiol / norgestimate', 'Ethinyl estradiol / levonorgestrel'))

## Warning in `==.default`(drugName, c("Etonogestrel", "Ethinyl estradiol /
## norethindrone", : longer object length is not a multiple of shorter object
## length

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

ggplot(oc_reviews_popular) +
  geom_bar(aes(x = rating, fill = drugName))+
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Top 5 Contraceptives") +
  scale_fill_viridis_d() +
  scale_x_continuous(breaks = c(1:10))

Graphically, I chose the top five and it appears that there are no major anomalies in the ratings. Since many birth control products hit the same pharmacological target, I feel safe in assuming that looking at negative sentiment as a whole is appropriate.

Anxiety/ Depression

There are several conditions that encompass symptoms of depression and anxiety, within the top 25 there were some partial hits including ‘Insomnia’, ‘Anxiety and Stress’, ‘Panic Disorder’, and "Major Depressive Disorder’.

ad_reviews <- reviews_edit %>%
  filter(condition == c('Depression', 'Insomnia', 'Anxiety', 'Anxiety and Stress', 'Major Depressive Disorder', 'Panic Disorder'))

## Warning in `==.default`(condition, c("Depression", "Insomnia", "Anxiety", :
## longer object length is not a multiple of shorter object length

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

n_ad_reviews <- nrow(ad_reviews) #number of reviews for depression/ anxiety meds

ad_reviews %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_ad_reviews),1)) %>% #Percent of all ax/dep reviews
  arrange(desc(n)) #Highest first

## # A tibble: 153 x 3
## # Groups:   drugName [153]
##    drugName         n pct_reviews
##    <fct>        <int>       <dbl>
##  1 Escitalopram   145         4.4
##  2 Sertraline     129         3.9
##  3 Citalopram     127         3.8
##  4 Bupropion      104         3.1
##  5 Fluoxetine     101         3.1
##  6 Lexapro        101         3.1
##  7 Venlafaxine     99         3  
##  8 Pristiq         82         2.5
##  9 Cymbalta        81         2.5
## 10 Mirtazapine     80         2.4
## # ... with 143 more rows

The most popular medications used for management of anxiety and depression make up a small percent of the overall category of drugs used to manage symptoms. Lets look at the data more visually.

ggplot(ad_reviews) +
  geom_bar(aes(x = rating), fill= "#042f66") +
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Anxiety and Depression")

ad_reviews_popular <- ad_reviews %>%
  filter(drugName == c('Escitalopram', 'Sertraline', 'Citalopram', 'Bupropion', 'Fluoxetine'))
  
ggplot(ad_reviews_popular) +
  geom_bar(aes(x = rating, fill = drugName)) +
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Top 5 Treatments for Anxiety or Depression") +
  scale_fill_viridis_d() +
  scale_x_continuous(breaks = c(1:10))

The bar plot follows the overall pattern of relatively good marks, with an uptick in the lowest score. It appears that there’s no set pattern in the ratings, so I’ll take a quick look at the differences in ratings of ‘1’ versus ‘10’.

ad_reviews_bad <- ad_reviews %>%
  filter(rating == 1) %>% #Worst ratings
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_ad_reviews),1)) %>% #Percent of total ratings
  arrange(desc(n)) #Most frequently given first

ad_reviews_bad

## # A tibble: 76 x 3
## # Groups:   drugName [76]
##    drugName         n pct_reviews
##    <fct>        <int>       <dbl>
##  1 Venlafaxine     19         0.6
##  2 Buspirone       18         0.5
##  3 Belsomra        18         0.5
##  4 Bupropion       15         0.5
##  5 Cymbalta        15         0.5
##  6 Duloxetine      15         0.5
##  7 Sertraline      14         0.4
##  8 Escitalopram    14         0.4
##  9 Suvorexant      13         0.4
## 10 Prozac          12         0.4
## # ... with 66 more rows

ad_reviews_good <- ad_reviews %>% 
  filter(rating == 10) %>% #Best ratings
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_ad_reviews),1)) %>% #Perecnt of total ratings
  arrange(desc(n)) #Most frequently first

ad_reviews_good

## # A tibble: 117 x 3
## # Groups:   drugName [117]
##    drugName         n pct_reviews
##    <fct>        <int>       <dbl>
##  1 Escitalopram    58         1.8
##  2 Alprazolam      53         1.6
##  3 Clonazepam      39         1.2
##  4 Xanax           37         1.1
##  5 Bupropion       35         1.1
##  6 Citalopram      35         1.1
##  7 Sertraline      31         0.9
##  8 Venlafaxine     28         0.8
##  9 Fluoxetine      28         0.8
## 10 Lexapro         28         0.8
## # ... with 107 more rows

Some especially badly rated medications include venlafaxine, buspirone, and Belsomra, while the highest rated include the antidepressants escitalopram and bupropion, as well as the sedatives alprazolam (also Xanax) and clonazepam. For one last look at the medications used for anxiety and depression, I’m going to omit ‘insmonia’, as this may be weighing the ratings too heavily in favor of one class of sedatives.

ad_reviews_wo_insomnia <- ad_reviews %>%
  filter(condition != 'Insomnia') #Omit treatment for insomnia

n_ad_reviews_wo_insomnia <-nrow(ad_reviews_wo_insomnia) #Number of reviews for anxiety and depression, not insomnia

ad_reviews_wo_insomnia %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_ad_reviews_wo_insomnia),1)) %>% #Percent of reviews
  arrange(desc(n)) #Most frequent first

## # A tibble: 115 x 3
## # Groups:   drugName [115]
##    drugName         n pct_reviews
##    <fct>        <int>       <dbl>
##  1 Escitalopram   145         5.4
##  2 Sertraline     129         4.8
##  3 Citalopram     127         4.7
##  4 Bupropion      104         3.9
##  5 Fluoxetine     101         3.7
##  6 Lexapro        101         3.7
##  7 Venlafaxine     99         3.7
##  8 Pristiq         82         3  
##  9 Cymbalta        81         3  
## 10 Alprazolam      79         2.9
## # ... with 105 more rows

ad_reviews_wo_insomnia_bad <- ad_reviews_wo_insomnia %>%
  filter(rating == 1) %>% #Worst ratings
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_ad_reviews_wo_insomnia),1)) %>% #Percent of reviews 
  arrange(desc(n)) #Most frequent first

ad_reviews_wo_insomnia_bad

## # A tibble: 55 x 3
## # Groups:   drugName [55]
##    drugName         n pct_reviews
##    <fct>        <int>       <dbl>
##  1 Venlafaxine     19         0.7
##  2 Buspirone       18         0.7
##  3 Bupropion       15         0.6
##  4 Cymbalta        15         0.6
##  5 Duloxetine      15         0.6
##  6 Sertraline      14         0.5
##  7 Escitalopram    14         0.5
##  8 Prozac          12         0.4
##  9 Vilazodone      11         0.4
## 10 Lexapro         10         0.4
## # ... with 45 more rows

ad_reviews_wo_insomnia_good <- ad_reviews_wo_insomnia %>%
  filter(rating == 10) %>% #Best ratings
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_ad_reviews_wo_insomnia),1)) %>% #Percent of reviews
  arrange(desc(n)) #Most frequent first

ad_reviews_wo_insomnia_good

## # A tibble: 91 x 3
## # Groups:   drugName [91]
##    drugName         n pct_reviews
##    <fct>        <int>       <dbl>
##  1 Escitalopram    58         2.2
##  2 Alprazolam      53         2  
##  3 Xanax           37         1.4
##  4 Bupropion       35         1.3
##  5 Citalopram      35         1.3
##  6 Clonazepam      34         1.3
##  7 Sertraline      31         1.1
##  8 Venlafaxine     28         1  
##  9 Fluoxetine      28         1  
## 10 Lexapro         28         1  
## # ... with 81 more rows

Removing insomnia did not change the highest ratings as expected. I’m going to take one last look at ratings, this time looking at the favorability of sedatives only.

ad_reviews_edit <- ad_reviews #Create duplicate of anxiety/depression reviews dataframe

#Below, I've recoded the medication names, which are factors, as their generic equivalent. This is largely possible because there are only a handful reviewed

ad_reviews_edit$drugName <- ad_reviews$drugName %>% 
  recode_factor('Klonopin' = 'Clonazepam', 'Xanax' = 'Alprazolam', 'Xanax XR' = 
  'Alprazolam', 'Ambien' = 'Zolpidem', 'Ambien CR' = 'Zolpidem',  'Lunesta' = 'Eszopiclone', 'Ativan' = 'Lorazepam', 'Restoril'= 'Temazepam')

#Below are the generic names for the sedatives I was talking about earlier: benzodiazepines --usualy ends in '-zolam' -- and two hyptnotics that affect the same receptors 

ad_reviews_bzd <- ad_reviews %>%
  filter(drugName == c('Clonazepam', 'Alprazolam', 'Zolpidem', 'Temazepam', 'Eszopiclone',  'Diazepam', 'Lorazepam', 'Oxazepam', 'Triazolam', 'Chlordiazepoxide'))

## Warning in `==.default`(drugName, c("Clonazepam", "Alprazolam", "Zolpidem", :
## longer object length is not a multiple of shorter object length

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

ggplot(ad_reviews_bzd) +
  geom_bar(aes(x = rating, fill = drugName)) +
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings of Sedatives", fill = "Drug") +
  scale_fill_viridis_d() +
  scale_x_continuous(breaks = c(1:10))

Weight Loss

I’ve grouped these two together as they’re less complicated than defining the other two classes. Therapeutically, only a handful of medicines are used to manage these, and the conditions are relatively straightforward.

wl_reviews <- reviews_edit %>%
  filter(condition == 'Weight Loss')

n_wl_reviews <-nrow(wl_reviews) #Number of reviews for weight loss drugs

head(wl_reviews)

## # A tibble: 6 x 7
##   uniqueID drugName    condition review            rating date       usefulCount
##      <dbl> <fct>       <fct>     <chr>              <dbl> <date>           <dbl>
## 1   164952 Phentermin~ Weight L~ "\"I have been o~      8 2015-12-25          38
## 2   145900 Qsymia      Weight L~ "\"My Dr agreed ~      9 2013-02-24          46
## 3    52117 Adipex-P    Weight L~ "\"I just starte~      8 2011-03-10           7
## 4   145785 Qsymia      Weight L~ "\"Began taking ~     10 2014-05-26          61
## 5   145840 Qsymia      Weight L~ "\"Made me jitte~      1 2013-08-21          59
## 6   145495 Qsymia      Weight L~ "\"I am a 28 yea~     10 2017-07-01          21

wl_reviews %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_wl_reviews),1)) %>%
  arrange(desc(n))

## # A tibble: 21 x 3
## # Groups:   drugName [21]
##    drugName                     n pct_reviews
##    <fct>                    <int>       <dbl>
##  1 Phentermine               1211        33.6
##  2 Lorcaserin                 387        10.7
##  3 Belviq                     381        10.6
##  4 Bupropion / naltrexone     267         7.4
##  5 Contrave                   263         7.3
##  6 Adipex-P                   255         7.1
##  7 Phentermine / topiramate   249         6.9
##  8 Qsymia                     207         5.7
##  9 Liraglutide                166         4.6
## 10 Saxenda                     95         2.6
## # ... with 11 more rows

Looking at the initial data, one non-intuitive finding is the addition of the medication megestrol, which is used in illnesses like cancer and AIDS when patients lose weight. So in this case, it is being used for ‘weight loss’, but to gain weight and not lose it. I will take the extra step of removing these reviews.

wl_reviews_edit <- wl_reviews #Create duplicate dataframe for edits


#Recode brand products to generic
wl_reviews_edit$drugName <- recode_factor(wl_reviews$drugName, 'Megace' = 'Megestrol', 'Megace ES' = 'Megestrol', 'Megestrol' = 'Megestrol')

#Remove medication not used for weight loss

wl_reviews_edit <- wl_reviews_edit%>%
  filter(drugName != 'Megestrol')

n_wl_reviews <- nrow(wl_reviews_edit) #Find true number of reviews
         
#Recode the rest, set default generic to phentermine as this is the most common entry

wl_reviews_edit$drugName <- wl_reviews_edit$drugName %>%
  recode_factor(Belviq = "Locaserin", "Belviq XR" = "Locaserin", Locaserin = "Locaserin", Contrave = "Bupropion / naltrexone", "Bupropion / naltrexone" = "Bupropion / naltrexone", Qsymia = "Phentermine/ topiramate", "Phentermine/ topiramate" = "Phentermine/ topiramate", Saxenda = "Liraglutide",Victoza = "Liraglutide", Liraglutide = "Liraglutide", .default = "Phentermine")

wl_reviews_edit %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_wl_reviews),1)) %>% #Percent of total reviews
  arrange(desc(n)) #Highest frequency first

## # A tibble: 5 x 3
## # Groups:   drugName [5]
##   drugName                    n pct_reviews
##   <fct>                   <int>       <dbl>
## 1 Phentermine              2149        59.8
## 2 Bupropion / naltrexone    530        14.7
## 3 Locaserin                 384        10.7
## 4 Liraglutide               324         9  
## 5 Phentermine/ topiramate   207         5.8

ggplot(wl_reviews_edit) +
  geom_bar(aes(x = rating, fill = drugName)) +
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Weight Loss") +
  scale_fill_viridis_d() +
  scale_x_continuous(breaks = c(1:10))

ggplot(wl_reviews_edit) +
  geom_bar(aes(x = rating)) +
  labs(x = "Rating (out of 10)", y = "Number of ratings", title = "Ratings for Weight Loss") + 
  facet_wrap(~drugName) + #For each separate drug 
  scale_x_continuous(breaks = c(1:10))

wl_yrly_reviews <- wl_reviews_edit %>%
  mutate(year = year(date)) %>% #Find year of review
  group_by(drugName, year) %>% #Group by drug, date of review
  count()


ggplot(wl_yrly_reviews) +
  geom_line(aes(x = year, y = n, color = drugName)) +
  geom_point(aes(x = year, y = n, color = drugName)) +
  labs(title = "Number of ratings for weight loss products over time", x = "Year", y = "Number of Reviews", color = "Drug") + #Change in reviews over time
  scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016))

Smoking Cessation

I will provide the same techniques to look at smoking cessation product reviews.

cs_reviews <- reviews_edit %>%
  filter(condition == 'Smoking Cessation')

n_cs_reviews <-nrow(cs_reviews) #Number of smoking cessation drugs reviewed

head(cs_reviews)

## # A tibble: 6 x 7
##   uniqueID drugName  condition  review             rating date       usefulCount
##      <dbl> <fct>     <fct>      <chr>               <dbl> <date>           <dbl>
## 1   225508 Bupropion Smoking C~ "\"Love this, no ~     10 2014-08-22          15
## 2    62773 Nicoderm~ Smoking C~ "\"I smoked for 3~      8 2014-02-08          50
## 3   200059 Varenicl~ Smoking C~ "\"Worked great!!~     10 2015-10-26          26
## 4   200287 Varenicl~ Smoking C~ "\"Have smoked fo~     10 2014-12-03           1
## 5   152006 Chantix   Smoking C~ "\"I put off tryi~     10 2011-04-01          16
## 6   200478 Varenicl~ Smoking C~ "\"Amazing.  Yes,~     10 2012-11-08          13

cs_reviews %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_cs_reviews),1)) %>%
  arrange(desc(n))

## # A tibble: 17 x 3
## # Groups:   drugName [17]
##    drugName                       n pct_reviews
##    <fct>                      <int>       <dbl>
##  1 Varenicline                  785        44.1
##  2 Chantix                      633        35.6
##  3 Bupropion                    147         8.3
##  4 Nicotine                      84         4.7
##  5 Zyban                         58         3.3
##  6 Nicoderm CQ                   26         1.5
##  7 Commit                        14         0.8
##  8 Nicotrol Inhaler              14         0.8
##  9 Nicorette                      5         0.3
## 10 Habitrol                       4         0.2
## 11 Nortriptyline                  2         0.1
## 12 Pamelor                        2         0.1
## 13 Buproban                       2         0.1
## 14 Topiramate                     1         0.1
## 15 Topamax                        1         0.1
## 16 Nicotrol NS                    1         0.1
## 17 Leader Nicotine Polacrilex     1         0.1

Smoking cessation products are much less numerous, so generic-brand duplication will have a significant impact. I will be recoding drugName to generic before visualizing ratings.

cs_reviews_edit <- cs_reviews

#Recode products as generic drug names

cs_reviews_edit$drugName <- cs_reviews$drugName %>% 
  recode_factor(Chantix = "Varenicline", Varenicline = "Varenicline", Bupropion = "Bupropion",  Buproban = "Bupropion", Zyban = "Bupropion", Nortriptyline = "Nortriptyline", Pamelor = "Nortriptyline", Topiramate = "Topiramate", Topamax = "Topiramate", .default = "Nicotine")

cs_reviews_edit %>%
  group_by(drugName) %>%
  count() %>%
  mutate(pct_reviews = round((100*n/n_cs_reviews),1)) %>% #Percent of total reviews
  arrange(desc(n)) #Most frequent first

## # A tibble: 5 x 3
## # Groups:   drugName [5]
##   drugName          n pct_reviews
##   <fct>         <int>       <dbl>
## 1 Varenicline    1418        79.7
## 2 Bupropion       207        11.6
## 3 Nicotine        149         8.4
## 4 Nortriptyline     4         0.2
## 5 Topiramate        2         0.1

ggplot(cs_reviews_edit) +
  geom_bar(aes(x = rating, fill = drugName)) +
  labs(x = "Rating (out of 10)", y = "Number of ratings, log scale", title = "Ratings for Smoking Cessation") +
  scale_x_discrete(breaks = 10) +
  scale_fill_viridis_d()

In the case of varenicline (Chantix), this was approved in 2006, so my expectation is that a time series would show this gaining popularity, especially by the end of the review period.

cs_yrly_reviews <- cs_reviews_edit %>% 
  mutate(year = year(date)) %>% #Find the year of review
  group_by(drugName, year) %>% #
  count()

ggplot(cs_yrly_reviews) +
  geom_line(aes(x = year, y = n, color = drugName)) +
  geom_point(aes(x = year, y = n, color = drugName)) +
  labs(title = "Number of ratings for smoking cessation products over time", x = "Year", y = "Number of Reviews", color = "Drug") +
  scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016))

This concludes the original exploration of the drug reviews. For the second part, I’ll be looking at sentiment reflected within the reviews.

Sentiment Analysis of Selected Medications

For this sentiment analysis, I’ll be using the ‘nrc’ lexicon of sentiments used in a previous assignement. I originally tried the other two sentiments, in addition to sentiword and some other medical lexicons. Ultimately, I decided to use nrc positve and negative sentiments for the best initial results and for consistency.

library(textdata)

## Warning: package 'textdata' was built under R version 3.6.3

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.6.3

library(lexicon)

## Warning: package 'lexicon' was built under R version 3.6.3

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.6.3

## Loading required package: RColorBrewer

sentiword <- hash_sentiment_sentiword

names(sentiword)[names(sentiword) == "x"] <- "word"
names(sentiword)[names(sentiword) == "y"] <- "score"


get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Contraception

nrc_anger <- get_sentiments("nrc") %>%
  filter(sentiment == 'anger')

tidy_oc_reviews <- oc_reviews %>% 
  unnest_tokens(word, review)

library(wordcloud)
library(viridisLite)
color_pal <- viridis(n = 9, direction = -1)

custom_stop_words <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'feeling', 'lose'), 
                                          lexicon = c("custom")), 
                               stop_words)

tidy_oc_reviews %>%
  filter(rating == 1) %>% #The worst ratings
  anti_join(custom_stop_words) %>% 
  inner_join(nrc_anger) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

For the worst ratings, many of the findings endorse ongoing mental health issues, addition to irritability, pain, and the word shot - likely because many forms of birth control are given as an injection.

Treatments for depression and anxiety

nrc_pos <- get_sentiments("nrc") %>%
  filter(sentiment == 'positive')

nrc_neg <- get_sentiments("nrc") %>%
  filter(sentiment == 'negative')

tidy_ad_reviews <- ad_reviews %>%
  unnest_tokens(word, review)


custom_stop_words_ad <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'anxiety', 'depression', 'taking', 'pill', 'effect', 'feeling', 'lose', 'anxious', 'panic', 'disorder', 'medication', 'medicine', 'don'), 
                                          lexicon = c("custom")), 
                               stop_words)

tidy_ad_reviews %>%
  filter(rating == 10, condition != 'Insomnia') %>% #omitting medications for insomnia
  anti_join(custom_stop_words_ad) %>% 
  inner_join(nrc_pos) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

tidy_ad_reviews %>%
  filter(rating == 1, condition != 'Insomnia') %>%
  anti_join(custom_stop_words_ad) %>% 
  inner_join(nrc_neg) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

The most notable finding in this word frequency is ‘worse’, which may indicate that the treatment is ineffective and people are not satisfied with the antidepressant’s effects. Additionally, there are side effects like nausea, dizziness, and potential effects on weight.

Weight Loss and Smoking Cessation

One medication each for weight loss and smoking cessation encompassed the vast majority of reviews: varenicline (Chantix) for smoking cessation and phentermine (Adipex-P) for weight loss. I’m going to take an overall look at the word frequency of their reviews, as well as identify any reasons why they are so popular.

tidy_cs_reviews <- cs_reviews_edit %>%
  unnest_tokens(word, review)


custom_stop_words_cs <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'taking', 'pill', 'effect', 'feeling', 'haven', 'medication', 'nicotine', 'quit', 'don', 'recommend', 'medicine', 'cold', 'doctor'), 
                                          lexicon = c("custom")), 
                               stop_words)


tidy_cs_reviews %>%
  filter(rating == 10, drugName == 'Varenicline') %>%
  anti_join(custom_stop_words_cs) %>% 
  inner_join(nrc_pos) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

tidy_cs_reviews %>%
  filter(rating == 1, drugName == 'Varenicline') %>%
  anti_join(custom_stop_words_cs) %>% 
  inner_join(nrc_neg) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

For positive ratings, many of the words indicate positive benefit, but also the word ‘vivid’ shows up, an interesting finding. For negative ratings, nausea is the number one hit, in addition to dizziness, lowered mood, and the word ‘nightmare’.

tidy_wl_reviews <- wl_reviews_edit %>%
  unnest_tokens(word, review)


custom_stop_words_wl <- bind_rows(tibble(word = c('bad', 'awful', 'horrible', 'terrible', 'taking', 'pill', 'effect', 'feeling', 'medication', 'day', 'weight', 'lose', 'lost', 'nicotine', 'quit', 'haven', 'don', 'recommend', 'medicine', 'doctor'), 
                                          lexicon = c("custom")), 
                               stop_words)


tidy_wl_reviews %>%
  filter(rating == 10, drugName == 'Phentermine') %>%
  anti_join(custom_stop_words_wl) %>% 
  inner_join(sentiword) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

tidy_wl_reviews %>%
  filter(rating == 1, drugName == 'Phentermine') %>%
  anti_join(custom_stop_words_wl) %>% 
  inner_join(nrc_neg) %>%
  count(word) %>%
  with(wordcloud(word, n, colors = color_pal, max.words = 75))

## Joining, by = "word"
## Joining, by = "word"

For the positive ratings, many of the most frequent words indicate that reviewers are on an adjunctive diet. For negative ratings, bad ratings mention headache, pain, dizziness, and continued hunger.

One final extension - Correlating Negative Sentiments with FDA Adverse Events

For the second section, I’ll look more closely at formal side effect or adverse drug event (ADE) data provided by the US Food and Drug Administration. The FDA’s records are available through their API in JSON format. There is a devtool available on the FDA’s Github, and extensive information on the appropriate formation of a query. I’ll start by loading the tool and look at the 10 years of records that correlate to the drug review time period (2008 - 2017).

library(devtools)

## Warning: package 'devtools' was built under R version 3.6.3

## Loading required package: usethis

## Warning: package 'usethis' was built under R version 3.6.3

devtools::install_github('ropenhealth/openfda')

## Skipping install of 'openfda' from a github remote, the SHA1 (ace7ef93) has not changed since last install.
##   Use `force = TRUE` to force installation

library(openfda)

adr_demos = fda_query("/drug/event.json") %>%
  fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
  fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
  fda_count("patient.patientonsetage") %>% #Age frequency of all events
  fda_exec()

## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.patientonsetage

ggplot(adr_demos, aes(x = term, y = count)) +
  geom_point() +
  geom_vline(xintercept = 65, color = "red") +
  labs(title = "Ages of Patients with Reported ADE's, 2008 - 2017", x = "Patient Age", y = "Count")

It appears that reports are skewed to the left, with the most frequent age somewhere in the sixties. I have illustrated age 65 with a red line to indicate the age generally accepted as ‘elderly’ in the United States, also when many patients have access to government prescription insurance via Medicare Part D.

Next, I’m going to look at the half dozen cases we’ve established in the previous section. I’ll be looking at the commonly reported side effects associated with birth control/contraception, anxiety/depression, phentermine and varenicline. Using this context, my goal is to interpret any unsual findings in the sentiment analysis.

Contraceptives

ade_oc = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_filter("patient.drug.drugindication", 'contraception+"birth+control"') %>% #Indication
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_exec()

## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.drugindication:contraception+"birth+control"&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact

n_ade_oc <- sum(ade_oc$count)

ade_oc_drugs = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_filter("patient.drug.drugindication", 'contraception+"birth+control"') %>% #Indication
fda_count("patient.drug.openfda.generic_name.exact") %>% #Drug name
fda_exec()

## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.drugindication:contraception+"birth+control"&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.drug.openfda.generic_name.exact

ade_oc_vis <- ade_oc %>%
  mutate(pct_reviews = round((100*count/n_ade_oc),2)) %>% #Percent of all contraception reports
  arrange(desc(pct_reviews)) %>% #Highest frequency first
  head(10)


ade_oc_vis %>%
  mutate(term = factor(term, levels = term)) %>%
  arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
  geom_bar(stat = "identity", fill= "#042f66") +
  labs(title = "Frequency of ADE's associated with Birth Control Use, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
  coord_flip()

The reported events indicate the most common reaction reported is ‘device expulsion,’ which in this context likely refers to intrauterine devices, or IUD’s. This would be a significant event and compromise the effectiveness of treatment. I think this is the most common report as it is serious and requires intervention of a medical professional to fix. Coming in at number 8 is anxiety, which was mentioned as a symptom in the word frequency analysis. The number 10 result, injury, is likekly an error as I was unable to completely remove some queries associated with a common pain medication, ibuprofen.

Anxiety/ Depression Medications

ade_ad = fda_query("/drug/event.json") %>%
fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
fda_filter("patient.drug.drugindication", 'anxiety+depression+"major+depressive+disorder"+"panic+disorder"') %>% #Indication
fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
fda_exec()

## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.drugindication:anxiety+depression+"major+depressive+disorder"+"panic+disorder"&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact

n_ade_ad <- sum(ade_ad$count)

ade_ad_vis <- ade_ad %>%
  mutate(pct_reviews = round((100*count/n_ade_ad),2)) %>% #Percent of all antidepressant reports
  arrange(desc(pct_reviews)) %>% #Highest frequency first
  head(10)

ade_ad_vis

##                term count pct_reviews
## 1           ANXIETY  9069        4.41
## 2  DRUG INEFFECTIVE  8034        3.91
## 3            NAUSEA  7042        3.42
## 4         DIZZINESS  5628        2.74
## 5          HEADACHE  5585        2.71
## 6           FATIGUE  5545        2.70
## 7          INSOMNIA  5343        2.60
## 8        DEPRESSION  5108        2.48
## 9              PAIN  4608        2.24
## 10    OFF LABEL USE  4536        2.20

ade_ad_vis %>%
  mutate(term = factor(term, levels = term)) %>%
  arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
  geom_bar(stat = "identity", fill= "#042f66") +
  labs(title = "ADE's associated with Anxiety and Depression Treatment, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
  coord_flip()

For medications used for managing anxiety and depression, the top two results are associated with worsening or ineffectiveness of therapy. Beyond that, nausea and dizziness came up in our original sentiment analysis as well.

Weight Loss

ade_wl = fda_query("/drug/event.json") %>%
  fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
  fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
  fda_filter("patient.drug.openfda.generic_name", "phentermine") %>% #Drug name
  fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
  fda_exec()

## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.openfda.generic_name:phentermine&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact

n_ade_wl <- sum(ade_wl$count) #Number of ADE's reported for phentermine

ade_wl_vis <- ade_wl %>%
  mutate(pct_reviews = round((100*count/n_ade_wl),2)) %>% #Percent of all phentermine reports
  arrange(desc(pct_reviews)) %>% #Highest frequency first
head(10)
  
ade_wl_vis %>%
  mutate(term = factor(term, levels = term)) %>%
  arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
  geom_bar(stat = "identity", fill= "#042f66") +
  labs(title = "Frequency of ADE's associated with Phentermine Use, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
  coord_flip()

It is worth pointing out that there are fewer reports for phentermine side effects than for other classes examined. This could because it remains a treatment that is given to otherwise healthy people looking for weight loss. Two symptoms that stand out are pain and paraesthesia, which is feeling of skin burning and itching. Dizziness is also mentioned in the original sentiment analysis as well.

Smoking Cessation

ade_cs = fda_query("/drug/event.json") %>%
  fda_api_key("5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM") %>% #API key
  fda_filter("receivedate", "[2008-01-01+TO+2017-12-31]") %>% #Date range
  fda_filter("patient.drug.openfda.generic_name", "varenicline") %>% #Drug name
  fda_count("patient.reaction.reactionmeddrapt.exact") %>% #Reported reaction
  fda_exec()

## Fetching: https://api.fda.gov/drug/event.json?search=receivedate:[2008-01-01+TO+2017-12-31]+AND+patient.drug.openfda.generic_name:varenicline&api_key=5jiTQwpGEERnaPEvpBmTwVW55CXricnKEKXZbcAM&count=patient.reaction.reactionmeddrapt.exact

n_ade_cs <- sum(ade_cs$count) #Number of ADE's reported for varenicline

ade_cs_vis <- ade_cs %>%
  mutate(pct_reviews = round((100*count/n_ade_cs),1)) %>% #Percent of all varenicline reports
  arrange(desc(pct_reviews)) %>% #Highest frequency first
  head(10)

ade_cs_vis %>%
  mutate(term = factor(term, levels = term)) %>%
  arrange(desc(pct_reviews)) %>%
ggplot(aes(x = term, y = pct_reviews)) +
  geom_bar(stat = "identity", fill= "#042f66") +
  labs(title = "Frequency of ADE's associated with Chantix Use, 2008 - 2017", x = "Reaction", y = "Percent of Total Reports") +
  coord_flip()

Finally, for varenicline the most common side effects reported are nausea and depression. Interestingly, insomnia and abnormal dreams are also mentioned. Based on my own clinical knowledge, this is an interesting finding in addition to the sentiment analysis identifying ‘vivid’ as a word because one idiosyncratic reaction associated with varenicline is nightmares or vivid dreaming. This may be related to its treatment effect on nicotinic acid receptors. It has a much longer half-life than nicotine, 24 hours versus 1-2 hours, so in all likelihood patients are receving constant stimulation even while sleeping.

Conclusions

Patient-generated user reviews provide a complementary perspective of the tolerability of medication therapy. The medications reviewed were not surprising and ecompass many of the most popular medications used for each condition. Analysis of reviews over time show a large spike betewen 2014 and 2016. Using internet reviews in combination with sentiment analysis, many of the most common side effects were correlated with adverse drug events reported to the FDA. There were some differences between reports, as the FDA tended to have more serious and treatment-modifying effects, while user reviews tended to accentuate the mental health impacts of starting a new therapy.

Limitations and Future Directions

Some limitations I identified early in the beginning of the project is the popularity of birth control as a drug review, which indicates that the reviews are biased towards younger women. On the one hand, these patients are more likely to be on one or two medications, while older patients tend to be on up to four if they are being treated for chronic diseases. This means that the side effects are likely to be from the single medication and not a complex interaction between the medications. However, I would not generalize any of the findings to older populations as has an important impact on drug therapy. Further to this point, younger reviewers are also likely healthier, and it’s not clear whether birth control for instance is exacerbating an existing mood disorder, or a de novo finding.

For future directions, the openFDA API has much more operability than what I utilized, including the consideration of multiple medications and conditions, demographic information, and severity of the reaction (death, hospitalization, discontinuation of therapy). The reviews data could be made much more tidy and my treatment here reflects a minimal exercise to make the data intelligible. Additionally, there was some operability not used from the reviews, including the all-important ‘Users found this helpful’, which is a ubiquitous finding in the Web 2.0 environment of user feedback. Finally, these dataframes are ripe for treatment in a relational database, especially if I were able to find an existing table with all synonyms for a medication (brand, generic, nicknames, identifiers for billing and commerical purposes). These already exist in regulatory and clinical records, and their implementation would diminish the need for recoding of the thousands of names of FDA-approved drugs.

DATA 607 Final Project - Drug Reviews Sentiment Analysis and Safety Profiles

Thomas Hill

5/10/2020

Disclaimer

Introduction

Initial analysis

Uploading data

Building a Drug Class from Reviews Data

Contraception

Anxiety/ Depression

Weight Loss

Smoking Cessation

Sentiment Analysis of Selected Medications

Contraception

Treatments for depression and anxiety

Weight Loss and Smoking Cessation

One final extension - Correlating Negative Sentiments with FDA Adverse Events

Contraceptives

Anxiety/ Depression Medications

Weight Loss

Smoking Cessation

Conclusions

Limitations and Future Directions