Project 5

Introduction

For this project, you will explore drug prices associated with different treatments. You will need to load the following libraries to complete the assignment.

library(dplyr)
library(ggplot2)
library(stringr)

2 data sets are used – “h20_cond.rda” and “h20_medicine.rda”. Both can be found in Canvas inside of the Data folder in Files. Download each data set, then load it using the load function.

(Hint: You can use the “Open File…” command in the “File” menu to search for the downloaded files. Then, copy and paste the command from the console below.)

#load("/path/to/data/h20_cond.rda")
#load("/path/to/data/h20_medicine.rda")
#head(h20_cond)
#head(h20_medicine)

The data comes from the Medical Expenditure Panel Survey (MEPS), one of the most trusted government sources of medical expenditure data. The file h20_cond contains 2 columns – ID_cond which identifies a particular condition diagnosed to a particular individual, and icd which contains the ICD-10 codes that label are particular disease.

The h20_medicine data set contains observations on drug expenditures. Each row is a associated with a particular person, the drug they purchased & quantity, the total amount spent, and an id for the condition.

1. Functions

Since there are too many types of diagnoses (icd) within the data set, we need to recode each diagnosis to a broader category. We will do this via a custom function.

Create a function that takes a character value x. If the value starts with the letter “E”, the function should return “endocrine”, and so forth. The list below shows what the function should return based on the starting letter.

“E” – “endocrine”
“I” – “circulatory”
“J” – “respiratory”
“K” – “digestive”

Use if and else statements to test the conditions within the function. Test your function on “I65” and “K07”.

(See Section 5.2 for creating functions, 5.1 for using if / else statements, and 2.6 for extracting substrings.)

## Broader Diagnoses Categories

# the function code
BroaderDiagnosis <- function(x) {
  if (substring(x, 1, 1) == "E") {
    return("endocrine")
  } else if (substring(x, 1, 1) == "I") {
    return("circulatory")
  } else if (substring(x, 1, 1) == "J") {
    return("respiratory")
  } else if (substring(x, 1, 1) == "K") {
    return("digestive")
  } else {
    return("unknown category")
  }
}

# testing the function
TestOne <- "I65"
TestTwo <- "K07"



# testing the code
BroaderDiagnosis(TestOne) # is I based so should return "circulatory"

## [1] "circulatory"

BroaderDiagnosis(TestTwo) #is K based should return "digestive"

## [1] "digestive"

2. Loops

Use a for loop to change each value in the icd column inside of the h20_cond data set. Show the first 10 observations of the column after the change.

(See Section 5.4 for loops.)

## Loop Code

#print(h20_cond)


# using a loop on "icd" column
# is not running for some reason so using #s:

#for ("icd" in h20_cond) {
#  h20_cond$icd[i] <- #BroaderDiagnosis(h20_cond$icd[i])
#}


#print(h20_cond, 10)

3. Joining Data Sets

Join the h20_cond and h20_medicine data.frames on the ID_cond column, making sure that all rows in h20_medicine are preserved. Show the dimensions of the new data set using the dim(...) function.

After joining the data set, add together the total amount that is spent by each person and the total quantity of medication on each condition from which they suffer, and show the head of the new data set.

(See Section 4.3 for joining data sets, and 4.2 for summarizing across groups.)

## joining the data sets

#JoinedData <- merge(h20_medicine, h20_cond, by = "ID_cond", all.x = TRUE)
# print(dim(JoinedData))


#Totals <- JoinedData %>%
#  group_by(ID_cond) %>%
#  summarize(TotalAmountSpentEach = sum(AmountSpent, na.rm = TRUE),
#  TotalQuantityEach = sum(quantity, na.rm = TRUE))



# print(head(Totals))

4. Expenditure Distribution

Show a five number summary of total expenditure on drugs by condition / person from the new data set created in Problem 3 (Hint: use the summary command) and plot a histogram. Describe the distribution.

Find the average expenditure on drugs by icd code. Then, filter out expenditures on drugs over $150 (outliers), and create a boxplot with icd on the x-axis and expenditure on the y-axis. Which diseases are associated with greater spending on drugs?

(See Section 3.4 for boxplots and histograms, 4.1 for filtering, and 4.2 for taking the mean across groups.)

Ran into more issues with code not running so I had to put it behind #s going forward. I don’t really understand why I keep having this trouble with R, but hopefully the code is still gradeable.

## Expenditure Distribution


#is not letting my code it because I had to use #s before
#summary(Totals$TotalAmountSpentEach)

#plotting a histogram (this is the code I would use)
#hist(Totals$TotalAmountSpentEach, main = "Histogram of Total Expenditure on Drugs", xlab = "Total Expenditure")

#if it was letting me run these, now I would be able to describe the distribution here




#finding the average: here is the code I would use:
#AverageExpenditureByIcd <- Totals %>%
#group_by(icd) %>%
#summarize(AverageExpenditure = mean(TotalAmountSpentEach, na.rm = TRUE))



#filter out the outliers
#FilteredData <- Totals %?%
# filter(TotalAmountSpentEach <= 150)



#creating a boxplot with icd on x axis and expenditure on y axis
#boxplot(FilteredData$TotalAmountSpentEach ~ FilteredData$icd,
#   main = "Boxplot of Expenditures on Drugs by ICD Coded",
  #xlab = "ICD Code"
  #ylab = "Expenditure")


# if this code was running, here is where I'd be able to list which diseases are associated with greater spending on drugs

5. Expenditure vs. Quantity

Are greater quantities of medications associated with greater expenditures?

Using the data set from Problem 4 where large expenditures were filtered out, create a scatterplot with quantity on the x axis, expenditures on the y axis, and colored by icd code. Include a regression line. What do you see?

Next, perform the associated linear regression with expenditure as the y variable and both quantity and icd. Are the relationships statistically significant at the 5% level? Interpret what you find.

(See Section 3.4 for scatterplots, 4.4 for linear regressions.)

DISCUSS HERE

## Expenditure vs. Quantity


# creating a scatterplot
# again, still not running because something from before wasn't being called so I can't use that in my code
#scatterplot <- ggplot(FilteredData, aes(x = quantity, y = TotalAmountSpentEach, geom_point() +
#geom_smooth(method = "lm", se = FALSE) +
#labs(title = "Scatterplot with Regression",
#x = "quantity",
#y = "expenditure",
#color = "ICD Code")


# print(scatterplot)



#performing linear regression
#LinearRegression <- lm(TotalAmountSpentEach ~ quantity + icd, data = FilteredData)

#summary(LinearRegression)