setwd('C:/Users/nguye/ITEC4220/Project')
getwd()
## [1] "C:/Users/nguye/ITEC4220/Project"
drugs <- read.csv("realistic_drug_labels_side_effects.csv")
head(drugs)
##      drug_name      manufacturer approval_year     drug_class
## 1 Seroxetine50       AstraZeneca          1996 Antidepressant
## 2  Mecoparin93       AstraZeneca          2018        Vaccine
## 3   Daxozole89       Merck & Co.          1997  Antipsychotic
## 4 Viracillin84  Roche Holding AG          2004     Antifungal
## 5 Amoxstatin62       Pfizer Inc.          2003 Antidepressant
## 6 Loxocillin72 Johnson & Johnson          2023     Antifungal
##              indications                        side_effects dosage_mg
## 1         Allergy relief                     Fatigue, Nausea       260
## 2         Allergy relief                              Nausea       470
## 3         Allergy relief Diarrhea, Blurred vision, Dizziness       330
## 4 Inflammation reduction                  Fatigue, Dry mouth       450
## 5      Psychosis control        Insomnia, Dry mouth, Fatigue       430
## 6       Viral infections                     Rash, Dizziness       180
##   administration_route   contraindications                 warnings price_usd
## 1               Rectal  Bleeding disorders            Avoid alcohol    192.43
## 2           Inhalation   Allergic reaction           Take with food    397.82
## 3           Sublingual High blood pressure           Take with food    131.69
## 4                 Oral   Kidney impairment Do not operate machinery    372.82
## 5              Topical  Bleeding disorders Do not operate machinery    281.48
## 6          Intravenous  Bleeding disorders     May affect fertility    463.28
##   batch_number expiry_date side_effect_severity approval_status
## 1      MV388Pl  2026-11-29                 Mild         Pending
## 2      UR279ZN  2027-07-14                 Mild        Approved
## 3      we040kH  2028-06-02             Moderate         Pending
## 4      hO060rh  2026-07-07                 Mild        Rejected
## 5      Fa621Sw  2027-12-28             Moderate         Pending
## 6      Nl465Ez  2025-12-15             Moderate        Rejected

Plot a histogram that shows the number of approved drugs each year from 1990 to 2024.

hist(drugs$approval_year, breaks=20, 
     main= "Number of Drugs Approved Each Year", 
     xlab = "Year", ylab = "Number of Drugs")

Simple statistical calculation with the price column.

summary(drugs$price_usd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.21  128.83  255.13  251.84  372.21  499.06

Hypothesis 1: Drug price increases over the year due to inflation.

Regression analysis:

model <- lm(drugs$price_usd ~ drugs$approval_year)
summary(model)
## 
## Call:
## lm(formula = drugs$price_usd ~ drugs$approval_year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.901 -123.027    2.367  121.538  252.081 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         1230.7470   743.0764   1.656   0.0979 .
## drugs$approval_year   -0.4878     0.3703  -1.317   0.1879  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.6 on 1434 degrees of freedom
## Multiple R-squared:  0.001209,   Adjusted R-squared:  0.0005123 
## F-statistic: 1.735 on 1 and 1434 DF,  p-value: 0.1879

The result of the regression analysis suggests that on average, drug price decreases by $0.49 per year. This result doesn’t support the assumption that newer developed drugs are more expensive due to various factors, including inflation. The p-value here is greater than 0.05, which means we can’t reject the null hypothesis and that there’s no relationship between the approval year and the price.

This could be due to having one or multiple outliers in the data.

boxplot(drugs$price_usd, main = "Boxplot of Drug Prices")

While the box plot doesn’t show any outliers, it doesn’t necessarily mean that the result from the regression analysis reflects the full picture. For more accurate price comparison, it may be better to compare drugs within the same classification or from the same manufacturers.

Plot a histogram that shows the drug price distribution

hist(drugs$price_usd, breaks=20, main="Price distribution in USD", 
     xlab = "Price", ylab= "Frequency")

The data shown in the histogram isn’t normally distributed, which limits our hypothesis testing methods to non-parametric tests.

table(drugs$drug_class)
## 
##         Analgesic Anti-inflammatory        Antibiotic    Antidepressant 
##               126               158               143               156 
##        Antifungal     Antihistamine     Antipsychotic       Antipyretic 
##               144               143               147               134 
##         Antiviral           Vaccine 
##               141               144
boxplot(price_usd ~ drug_class, data = drugs, main = "Price by Drug Type", cex.axis=0.7, las=2)

Create a dataset which contains only antidepressant drug entries.

antidepressant <- drugs[drugs$drug_class == "Antidepressant",]
head(antidepressant)
##       drug_name         manufacturer approval_year     drug_class
## 1  Seroxetine50          AstraZeneca          1996 Antidepressant
## 5  Amoxstatin62          Pfizer Inc.          2003 Antidepressant
## 17    Cefzole42      GlaxoSmithKline          1996 Antidepressant
## 32  Fenoparin78    Johnson & Johnson          2007 Antidepressant
## 40 Mecoprofen83    Johnson & Johnson          1999 Antidepressant
## 57    Daxomab70 Moderna Therapeutics          2009 Antidepressant
##               indications                 side_effects dosage_mg
## 1          Allergy relief              Fatigue, Nausea       260
## 5       Psychosis control Insomnia, Dry mouth, Fatigue       430
## 17            Pain relief               Blurred vision       770
## 32         Allergy relief                      Fatigue       460
## 40            Pain relief       Blurred vision, Nausea        50
## 57 Inflammation reduction            Fatigue, Diarrhea       120
##    administration_route  contraindications                 warnings price_usd
## 1                Rectal Bleeding disorders            Avoid alcohol    192.43
## 5               Topical Bleeding disorders Do not operate machinery    281.48
## 17               Rectal             Asthma           Take with food     85.40
## 32           Sublingual             Asthma Do not operate machinery    451.54
## 40        Intramuscular           Diabetes            Avoid alcohol    369.83
## 57           Sublingual           Diabetes  Avoid sunlight exposure     41.95
##    batch_number expiry_date side_effect_severity approval_status
## 1       MV388Pl  2026-11-29                 Mild         Pending
## 5       Fa621Sw  2027-12-28             Moderate         Pending
## 17      JW654ck  2027-06-07               Severe        Rejected
## 32      lR165PH  2027-11-28                 Mild        Approved
## 40      ZF007mi  2028-01-10                 Mild        Rejected
## 57      OW258yR  2027-01-06                 Mild        Rejected

Divide the dataset Antidepressant into three groups based on the approval year.

antidepressant_1994_to_2004 <- antidepressant[antidepressant$approval_year >= 1994 & antidepressant$approval_year <= 2004,]

antidepressant_2004_to_2014 <- antidepressant[antidepressant$approval_year >= 2004 & antidepressant$approval_year <= 2014,]

antidepressant_2014_to_2024 <- antidepressant[antidepressant$approval_year >= 2014 & antidepressant$approval_year <= 2024,]

Use Wilcoxon Rank Sum Test to see whether the difference in the mean price of these three groups are statistically significant.

wilcox.test(antidepressant_1994_to_2004$price_usd, antidepressant_2014_to_2024$price_usd, alternative = "g")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  antidepressant_1994_to_2004$price_usd and antidepressant_2014_to_2024$price_usd
## W = 1221, p-value = 0.5752
## alternative hypothesis: true location shift is greater than 0

Since p-value is larger than 0.05, we cannot reject the null hypothesis. Therefore, we cannot say that the difference in the mean price of the two groups is statistically significant.