For this project, you will explore drug prices associated with different treatments. You will need to load the following libraries to complete the assignment.
library(dplyr)
library(ggplot2)
library(stringr)
2 data sets are used – “h20_cond.rda” and “h20_medicine.rda”. Both
can be found in Canvas inside of the Data folder in Files. Download each
data set, then load it using the load function.
(Hint: You can use the “Open File…” command in the “File” menu to search for the downloaded files. Then, copy and paste the command from the console below.)
#load("/path/to/data/h20_cond.rda")
#load("/path/to/data/h20_medicine.rda")
#head(h20_cond)
#head(h20_medicine)
The data comes from the Medical Expenditure Panel Survey
(MEPS), one of the most trusted government sources of medical
expenditure data. The file h20_cond contains 2 columns –
ID_cond which identifies a particular condition diagnosed
to a particular individual, and icd which contains the
ICD-10 codes that label are particular disease.
The h20_medicine data set contains observations on drug
expenditures. Each row is a associated with a particular person, the
drug they purchased & quantity, the total amount spent, and an id
for the condition.
Since there are too many types of diagnoses (icd) within the data set, we need to recode each diagnosis to a broader category. We will do this via a custom function.
Create a function that takes a character value x. If the
value starts with the letter “E”, the function should return
“endocrine”, and so forth. The list below shows what the function should
return based on the starting letter.
Use if and else statements to test the
conditions within the function. Test your function on “I65” and
“K07”.
(See Section 5.2 for creating functions, 5.1 for using if / else statements, and 2.6 for extracting substrings.)
## Broader Diagnoses Categories
# the function code
BroaderDiagnosis <- function(x) {
if (substring(x, 1, 1) == "E") {
return("endocrine")
} else if (substring(x, 1, 1) == "I") {
return("circulatory")
} else if (substring(x, 1, 1) == "J") {
return("respiratory")
} else if (substring(x, 1, 1) == "K") {
return("digestive")
} else {
return("unknown category")
}
}
# testing the function
TestOne <- "I65"
TestTwo <- "K07"
# testing the code
BroaderDiagnosis(TestOne) # is I based so should return "circulatory"
## [1] "circulatory"
BroaderDiagnosis(TestTwo) #is K based should return "digestive"
## [1] "digestive"
Use a for loop to change each value in the
icd column inside of the h20_cond data set.
Show the first 10 observations of the column after the change.
(See Section 5.4 for loops.)
## Loop Code
#print(h20_cond)
# using a loop on "icd" column
# is not running for some reason so using #s:
#for ("icd" in h20_cond) {
# h20_cond$icd[i] <- #BroaderDiagnosis(h20_cond$icd[i])
#}
#print(h20_cond, 10)
Join the h20_cond and h20_medicine
data.frames on the ID_cond column, making sure that all
rows in h20_medicine are preserved. Show the dimensions of
the new data set using the dim(...) function.
After joining the data set, add together the total amount that is
spent by each person and the total quantity of medication on each
condition from which they suffer, and show the head of the
new data set.
(See Section 4.3 for joining data sets, and 4.2 for summarizing across groups.)
## joining the data sets
#JoinedData <- merge(h20_medicine, h20_cond, by = "ID_cond", all.x = TRUE)
# print(dim(JoinedData))
#Totals <- JoinedData %>%
# group_by(ID_cond) %>%
# summarize(TotalAmountSpentEach = sum(AmountSpent, na.rm = TRUE),
# TotalQuantityEach = sum(quantity, na.rm = TRUE))
# print(head(Totals))
Show a five number summary of total expenditure on drugs by condition
/ person from the new data set created in Problem 3 (Hint: use the
summary command) and plot a histogram. Describe the
distribution.
Find the average expenditure on drugs by icd code. Then,
filter out expenditures on drugs over $150 (outliers), and
create a boxplot with icd on the x-axis and
expenditure on the y-axis. Which diseases are associated
with greater spending on drugs?
(See Section 3.4 for boxplots and histograms, 4.1 for filtering, and 4.2 for taking the mean across groups.)
Ran into more issues with code not running so I had to put it behind #s going forward. I don’t really understand why I keep having this trouble with R, but hopefully the code is still gradeable.
## Expenditure Distribution
#is not letting my code it because I had to use #s before
#summary(Totals$TotalAmountSpentEach)
#plotting a histogram (this is the code I would use)
#hist(Totals$TotalAmountSpentEach, main = "Histogram of Total Expenditure on Drugs", xlab = "Total Expenditure")
#if it was letting me run these, now I would be able to describe the distribution here
#finding the average: here is the code I would use:
#AverageExpenditureByIcd <- Totals %>%
#group_by(icd) %>%
#summarize(AverageExpenditure = mean(TotalAmountSpentEach, na.rm = TRUE))
#filter out the outliers
#FilteredData <- Totals %?%
# filter(TotalAmountSpentEach <= 150)
#creating a boxplot with icd on x axis and expenditure on y axis
#boxplot(FilteredData$TotalAmountSpentEach ~ FilteredData$icd,
# main = "Boxplot of Expenditures on Drugs by ICD Coded",
#xlab = "ICD Code"
#ylab = "Expenditure")
# if this code was running, here is where I'd be able to list which diseases are associated with greater spending on drugs
Are greater quantities of medications associated with greater expenditures?
Using the data set from Problem 4 where large expenditures were filtered out, create a scatterplot with quantity on the x axis, expenditures on the y axis, and colored by icd code. Include a regression line. What do you see?
Next, perform the associated linear regression with expenditure as the y variable and both quantity and icd. Are the relationships statistically significant at the 5% level? Interpret what you find.
(See Section 3.4 for scatterplots, 4.4 for linear regressions.)
DISCUSS HERE
## Expenditure vs. Quantity
# creating a scatterplot
# again, still not running because something from before wasn't being called so I can't use that in my code
#scatterplot <- ggplot(FilteredData, aes(x = quantity, y = TotalAmountSpentEach, geom_point() +
#geom_smooth(method = "lm", se = FALSE) +
#labs(title = "Scatterplot with Regression",
#x = "quantity",
#y = "expenditure",
#color = "ICD Code")
# print(scatterplot)
#performing linear regression
#LinearRegression <- lm(TotalAmountSpentEach ~ quantity + icd, data = FilteredData)
#summary(LinearRegression)