setwd('C:/Users/nguye/ITEC4220/Project')
getwd()
## [1] "C:/Users/nguye/ITEC4220/Project"
drugs <- read.csv("realistic_drug_labels_side_effects.csv")
head(drugs)
## drug_name manufacturer approval_year drug_class
## 1 Seroxetine50 AstraZeneca 1996 Antidepressant
## 2 Mecoparin93 AstraZeneca 2018 Vaccine
## 3 Daxozole89 Merck & Co. 1997 Antipsychotic
## 4 Viracillin84 Roche Holding AG 2004 Antifungal
## 5 Amoxstatin62 Pfizer Inc. 2003 Antidepressant
## 6 Loxocillin72 Johnson & Johnson 2023 Antifungal
## indications side_effects dosage_mg
## 1 Allergy relief Fatigue, Nausea 260
## 2 Allergy relief Nausea 470
## 3 Allergy relief Diarrhea, Blurred vision, Dizziness 330
## 4 Inflammation reduction Fatigue, Dry mouth 450
## 5 Psychosis control Insomnia, Dry mouth, Fatigue 430
## 6 Viral infections Rash, Dizziness 180
## administration_route contraindications warnings price_usd
## 1 Rectal Bleeding disorders Avoid alcohol 192.43
## 2 Inhalation Allergic reaction Take with food 397.82
## 3 Sublingual High blood pressure Take with food 131.69
## 4 Oral Kidney impairment Do not operate machinery 372.82
## 5 Topical Bleeding disorders Do not operate machinery 281.48
## 6 Intravenous Bleeding disorders May affect fertility 463.28
## batch_number expiry_date side_effect_severity approval_status
## 1 MV388Pl 2026-11-29 Mild Pending
## 2 UR279ZN 2027-07-14 Mild Approved
## 3 we040kH 2028-06-02 Moderate Pending
## 4 hO060rh 2026-07-07 Mild Rejected
## 5 Fa621Sw 2027-12-28 Moderate Pending
## 6 Nl465Ez 2025-12-15 Moderate Rejected
hist(drugs$approval_year, breaks=20,
main= "Number of Drugs Approved Each Year",
xlab = "Year", ylab = "Number of Drugs")
summary(drugs$price_usd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.21 128.83 255.13 251.84 372.21 499.06
Regression analysis:
model <- lm(drugs$price_usd ~ drugs$approval_year)
summary(model)
##
## Call:
## lm(formula = drugs$price_usd ~ drugs$approval_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -253.901 -123.027 2.367 121.538 252.081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1230.7470 743.0764 1.656 0.0979 .
## drugs$approval_year -0.4878 0.3703 -1.317 0.1879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.6 on 1434 degrees of freedom
## Multiple R-squared: 0.001209, Adjusted R-squared: 0.0005123
## F-statistic: 1.735 on 1 and 1434 DF, p-value: 0.1879
The result of the regression analysis suggests that on average, drug price decreases by $0.49 per year. This result doesn’t support the assumption that newer developed drugs are more expensive due to various factors, including inflation. The p-value here is greater than 0.05, which means we can’t reject the null hypothesis and that there’s no relationship between the approval year and the price.
This could be due to having one or multiple outliers in the data.
boxplot(drugs$price_usd, main = "Boxplot of Drug Prices")
While the box plot doesn’t show any outliers, it doesn’t necessarily mean that the result from the regression analysis reflects the full picture. For more accurate price comparison, it may be better to compare drugs within the same classification or from the same manufacturers.
hist(drugs$price_usd, breaks=20, main="Price distribution in USD",
xlab = "Price", ylab= "Frequency")
The data shown in the histogram isn’t normally distributed, which limits our hypothesis testing methods to non-parametric tests.
table(drugs$drug_class)
##
## Analgesic Anti-inflammatory Antibiotic Antidepressant
## 126 158 143 156
## Antifungal Antihistamine Antipsychotic Antipyretic
## 144 143 147 134
## Antiviral Vaccine
## 141 144
boxplot(price_usd ~ drug_class, data = drugs, main = "Price by Drug Type", cex.axis=0.7, las=2)
antidepressant <- drugs[drugs$drug_class == "Antidepressant",]
head(antidepressant)
## drug_name manufacturer approval_year drug_class
## 1 Seroxetine50 AstraZeneca 1996 Antidepressant
## 5 Amoxstatin62 Pfizer Inc. 2003 Antidepressant
## 17 Cefzole42 GlaxoSmithKline 1996 Antidepressant
## 32 Fenoparin78 Johnson & Johnson 2007 Antidepressant
## 40 Mecoprofen83 Johnson & Johnson 1999 Antidepressant
## 57 Daxomab70 Moderna Therapeutics 2009 Antidepressant
## indications side_effects dosage_mg
## 1 Allergy relief Fatigue, Nausea 260
## 5 Psychosis control Insomnia, Dry mouth, Fatigue 430
## 17 Pain relief Blurred vision 770
## 32 Allergy relief Fatigue 460
## 40 Pain relief Blurred vision, Nausea 50
## 57 Inflammation reduction Fatigue, Diarrhea 120
## administration_route contraindications warnings price_usd
## 1 Rectal Bleeding disorders Avoid alcohol 192.43
## 5 Topical Bleeding disorders Do not operate machinery 281.48
## 17 Rectal Asthma Take with food 85.40
## 32 Sublingual Asthma Do not operate machinery 451.54
## 40 Intramuscular Diabetes Avoid alcohol 369.83
## 57 Sublingual Diabetes Avoid sunlight exposure 41.95
## batch_number expiry_date side_effect_severity approval_status
## 1 MV388Pl 2026-11-29 Mild Pending
## 5 Fa621Sw 2027-12-28 Moderate Pending
## 17 JW654ck 2027-06-07 Severe Rejected
## 32 lR165PH 2027-11-28 Mild Approved
## 40 ZF007mi 2028-01-10 Mild Rejected
## 57 OW258yR 2027-01-06 Mild Rejected
antidepressant_1994_to_2004 <- antidepressant[antidepressant$approval_year >= 1994 & antidepressant$approval_year <= 2004,]
antidepressant_2004_to_2014 <- antidepressant[antidepressant$approval_year >= 2004 & antidepressant$approval_year <= 2014,]
antidepressant_2014_to_2024 <- antidepressant[antidepressant$approval_year >= 2014 & antidepressant$approval_year <= 2024,]
wilcox.test(antidepressant_1994_to_2004$price_usd, antidepressant_2014_to_2024$price_usd, alternative = "g")
##
## Wilcoxon rank sum test with continuity correction
##
## data: antidepressant_1994_to_2004$price_usd and antidepressant_2014_to_2024$price_usd
## W = 1221, p-value = 0.5752
## alternative hypothesis: true location shift is greater than 0
Since p-value is larger than 0.05, we cannot reject the null hypothesis. Therefore, we cannot say that the difference in the mean price of the two groups is statistically significant.