Homework 2 — Solutions

About The Data

Homework 1 (the previous assignment) focused on a simulated data set related to electronic health records and long-run outcomes for cardiology patients. That data set included a variety of messy variables. For this homework, we will use a cleaned up version of this data set while drilling down deeper into questions about the utlization, effectiveness, and costs of medicines.

File: Homework 2 Data – 2017.csv

Delimiter: Each column of the data set is separated with a comma , delimiter.

Header The first row of the data set includes the column names, and each subsequent row includes one observation of values. Here is a selection of 100 lines from the data set:

The data is written in long format (e.g. panel data). Each patient’s records are collected over time in one or more rows. Each row corresponds to a period of time. During this time, the patient’s status is recorded in terms of medications, hospitalizations, and complications. Each patient is followed until either death or the end of the follow-up period.

Here is a brief description of each variable:

id: This is a unique identifier for each patient. Because of strict privacy regulations, this identifier is anonymous. All records with the same value of id correspond to the same patient. This patient’s medical history is recorded in all of the rows with this id value. Some patients may have only a single row, while others may have many rows of updates.
begin: This is the beginning of the observation interval. This is defined as the number of days since the patient entered the study (see the definition of age above). The patient’s age at the beginning of the interval is the age variable (in years) plus the begin variable (in days).
end: This is the end of the observation interval. This is defined as the number of days since the patient entered the study (see the definition of age above). The observation interval is half open. This means that the begin date is included, while the end date is excluded. For patients with more than one row of records, the beginning of the next row should correspond to the end of the previous row. Any mismatches between these values constitute gaps in coverage, when we lack records on a patient. (For instance, if a patient switches insurance companies and then switches back, then we might lose a year’s worth of records.) The length of an interval in one row is therefore end - begin days. The patient’s age at the end of the interval is the age variable (in years) plus the end variable (in days).
age: This is the patient’s age in (rounded) years at the time of entry into the study – at the first diagnosis of coronary heart disease. For patients with multiple records in different rows, the age should be the same in every entry. For the purpose of this study, all of the patients should be at least 18 years old.
diabetes: This is an indicator of whether the patient had a diagnosed case of diabetes.
hypertension: This is an indicator of whether the patient had a diagnosed case of hypertension.
kidney_disease This is an indicator of whether the patient had a diagnosed case of kidney disease.
ace: This is an indicator of whether the patient has filled a prescription for ACE Inhibitors, a common cardiovascular drug. While we cannot say when a patient actually took a medication, our pharmacy’s records tell us when the patient filled a prescription and therefore possessed it. Therefore, we have the following coding for the values of ace:
- 1: Possession;
- 0: No possession.
beta.blocker: This is an indicator for possession of Beta Blockers, a cardiovascular medicine. It has the same coding as that of ace.
statin: This is an indicator for possession of Statins, another cardiovascular medicine. It has the same coding as that of ace and beta.blocker.
hospital: This is an indicator of whether the patient was in the hospital during the interval. Its values are coded as:
- 1: Hospitalized;
- 0: Not Hospitalized.
heart.attack: This is an indicator of whether the patient suffered a heart attack. When this occurs, the patient is assumed to go to the hospital and stay for some period of time (e.g. 1-7 days). The heart attack is assumed to happen at the beginning of the interval, and the remainder of this time is considered a recovery period. The values are coded as:
- 1: Suffered a heart attack.
- 0: No heart attack.
death: This is an indicator of the end of the patient’s life. Its values are coded as:
- 1: End of life.
- 0: Patient is still alive.

Each patient is followed until either death or the end of the observation. Many patients with coronary disease were still alive at the end of follow-up.

Note: The description above tells you the intended structure of the data set. However, it’s possible that there could be problems lurking in the records. In the course of doing this assignment, you may uncover some issues. For instance, you may find an erroneous value in some of the variables. In this circumstance, it will be necessary to resolve the situation. Here are some guidelines for doing so:

If the issue has an obvious solution, then you may recode the data. For instance, if you see a value of TRUE for the heart.attack variable, then you may safely assume that this value should have been coded as a 1.
If the issue does not have an obvious solution, then you can replace the erroneous value with NA to denote a missing value.

In either circumstance, note the problem in your solution and briefly describe the work you did to clean the data.

Question 1: Number of Medicines

At any one time, each patient is taking some combination of ACE inhibitors, Beta Blockers, and Statins – anywhere from 0 to 3 of them. We are curious about how often these patients are taking at least 2 medications at once. How often does this happen for the average patient? To answer this question, follow these steps:

For each patient, across the full range of our observation, compute the percentage of the time that the patient is taking at least 2 medicines at once.
Across all of the patients, compute the average of these percentages.

Note: Using the function rowSums is one way to calculate the number of medicines.

meds <- c("ace", "beta.blocker", "statin")
med.values <- dat[, .SD, .SDcols = meds]

dat[, Two.Plus := 1*(rowSums(x = med.values, na.rm = TRUE) >= 2)]

percent.by.patient <- dat[, .(Percent = sum(Two.Plus * (end-begin))/sum(end-begin)), by = "id"]

print(sprintf("The average patient was on at least 2 medicines about %.1f%% of the time.", percent.by.patient[, 100*mean(Percent)]))

[1] "The average patient was on at least 2 medicines about 80.8% of the time."

Question 2: Hospitalization Stays

Now let’s look at how long patients stay in the hospital. What is the minimum, average, median, maximum, and standard deviation of the length of a trip to the hospital? Round your answers to 1 decimal point.

Do the numbers make sense? What do you think?

Note: Keep in mind that if two consecutive records for a single id both have hospital = 1, then the patient remained in the hospital over both periods.

# This function checks for consecutive records in which hospital has the value 1.  It is intended to be used on a single patient's data.

check.consecutive.hospitalizations <- function(hospital){
  consecutives <- FALSE
  
  num.records <- length(hospital)
  if(num.records > 1){
    for(i in 2:num.records){
      if(hospital[i] == 1 & hospital[i-1] == 1){
        consecutives <- TRUE
      }
    }
  }
  return(as.numeric(consecutives))
}

check.consecutive.hospitalizations2 <- function(hospital){
  next.record <- c(0, hospital[1:(length(hospital)-1)])
  
  consecutives <- 1*(hospital == 1 & next.record == 1)
  return(max(consecutives, na.rm=TRUE))
}

toc <- Sys.time()
consec.check <- dat[, lapply(X = .SD, FUN = "check.consecutive.hospitalizations"), .SDcols = "hospital", by = id]
tic <- Sys.time()

toc2 <- Sys.time()
consec.check2 <- dat[, lapply(X = .SD, FUN = "check.consecutive.hospitalizations2"), .SDcols = "hospital", by = id]
tic2 <- Sys.time()


setnames(x = consec.check, old = names(consec.check), new = c("id", "consecutive.hospitalizations"))

print(sprintf("%d patients had consecutive records with hospital = 1.", sum(consec.check[, consecutive.hospitalizations])))

[1] "0 patients had consecutive records with hospital = 1."

length.of.stay <- dat[, (end - begin)[hospital == 1]]

length.of.stay.stats <- data.table(min = min(length.of.stay, na.rm=TRUE), mean = mean(length.of.stay, na.rm = TRUE), median = median(length.of.stay, na.rm = TRUE), max = max(length.of.stay, na.rm=TRUE), standard.deviation = sd(length.of.stay, na.rm = TRUE))

print(round(length.of.stay.stats, 1))

   min mean median  max standard.deviation
1:   0  6.6      3 2106               24.1

print("This mostly makes sense.  The stays of zero days corresponded to patient deaths.  The maximum stay was several years long, which occasionally happens.  The typical patient stayed about 4-8 days, which seems reasonable for this patients with cardiovascular complications.")

[1] "This mostly makes sense.  The stays of zero days corresponded to patient deaths.  The maximum stay was several years long, which occasionally happens.  The typical patient stayed about 4-8 days, which seems reasonable for this patients with cardiovascular complications."

Question 3: Initiating Medicines

What percentage of the patients started each medicine (ACE inhibitors, Beta Blockers, and Statins) within 14 days of diagnosis?

Note: This includes anyone who filled a prescription at any point within the first 14 days (with the begin variable less than 14), even if they later stopped taking it. Round your percentage answer to 1 decimal place (e.g. 36.2 percent).

Hint: If a patient has entirely missing records for a variable (e.g. a medicine like a statin), then this patient’s initiation status shouldu also be missing.

augmented_max <- function(x){
  if(mean(is.na(x)) == 1){
    return(NA_integer_)
  }
  else{
    return(max(x, na.rm=TRUE))
  }
}


initiation <- dat[begin < 14, lapply(X = .SD, FUN = "augmented_max"), .SDcols = meds, by = "id"]

initiation.results <- 100*initiation[, lapply(X = .SD,FUN = "mean", na.rm=TRUE), .SDcols = meds]

datatable(data = round(x = initiation.results, digits = 1), rownames = FALSE)

Question 4: 5 Year Outcomes

How many patients a) had a heart attack and b) died within 5 years of baseline?

Hint: Some patients may have had more than 1 heart attack within this time period. (Hopefully there were not multiple deaths for a single patient, although this is a good exercise in cleaning the data.) For this question, we are asking about the number of patients who experienced an event rather than the overall number of events.

outcomes <- c("heart.attack", "death")
one.year <- 365.25
time.frame <- 5 * one.year

had.event.by.patient <- dat[begin < time.frame, lapply(X = .SD, FUN = "max", na.rm=TRUE), .SDcols = outcomes, by = "id"]

outcomes.table <- had.event.by.patient[, lapply(X = .SD, FUN = "sum", na.rm=TRUE), .SDcols = outcomes]

datatable(data = outcomes.table, rownames = FALSE)

Question 5: Modeling 5-Year Survival

How do all of the clinical factors that we have measured impact a patient’s survival in the first 5 years after the baseline diagnosis of cardiovascular disease? To answer this question, we will construct a multivariable model. Let’s follow these steps:

Construct a data.table object with 1 row for each unique patient.
The columns of this data set will include:
- age (at baseline)
- diabetes (at baseline)
- hypertension (at baseline)
- kidney_disease (at baseline)
- ace (initiated within 14 days of baseline)
- beta_blocker (initiated within 14 days of baseline)
- statin (initiated within 14 days of baseline)
- death (within 5 years, defining 1 year as 365.25 days).
Limit the table to those who a) died within 5 years, or b) survived at least 5 years.
Fit a logistic regression of death against the predictor variables. To do use, use the glm function with family = “binomial”.
Extract the summary of the coefficients in a table. If the logistic regression model’s result is stored in a variable called mod, then this can be displayed with summary(mod)$coefficients.
Compute the Odds Ratios for the predictor variables by exponentiating the coefficients.
Compute confidence intervals for the Odds Ratios for the predictor variables by exponentiating (coef plus or minus 1.96 * standard error).
Display the results. Round your answers to 3 decimal places.
Briefly comment on the model’s results.

baseline.dat <- dat[, .SD[1], .SDcols = c("age", "diabetes", "hypertension", "kidney_disease"), by = "id"]

baseline.plus.meds.initiated <- merge(x = baseline.dat, y = initiation, by = "id")

the.outcomes <- dat[begin < time.frame, .(death = max(death, na.rm=TRUE), Survived.5 = 1*(max(end) >= time.frame)), by = "id"]

merged.dat <- merge(x = baseline.plus.meds.initiated, y = the.outcomes, by = "id")

analysis.dat <- merged.dat[death == 1 | Survived.5 == 1,]

the.formula <- "death ~ age + diabetes + hypertension + kidney_disease + ace + beta.blocker + statin"
mod <- glm(formula = the.formula, family = "binomial", data = analysis.dat)

alpha = 0.05
z <- qnorm(p = 1-alpha/2, mean = 0, sd = 1)
the.coefs <- as.data.table(x = summary(mod)$coefficients, keep.rownames = TRUE)
the.coefs[, Odds.Ratio := exp(Estimate)]
the.coefs[, Odds.Ratio.95.Lower := exp(Estimate - z * `Std. Error`)]
the.coefs[, Odds.Ratio.95.Upper := exp(Estimate + z * `Std. Error`)]

setnames(x = the.coefs, old = c("rn", "Pr(>|z|)"), new =  c("Variable", "p.value"))
setcolorder(x = the.coefs, neworder = c("Variable", "Estimate", "Odds.Ratio", "Std. Error", "z value", "Odds.Ratio.95.Lower", "Odds.Ratio.95.Upper", "p.value"))

datatable(data = the.coefs[, lapply(X = .SD, FUN = "round.numerics", digits = 3)], rownames = FALSE)

print("The evidence seems to suggest that the clinical factors -- age, diabetes, hypertension, and kidney disease -- significantly increase the odds of dying within 5 years of a baseline diagnosis of heart disease.  Meanwhile, beta blockers appear to have a significant harmful effect, while ACE inhibitors and statins -- don't appear to have a significant impact on mortality at 5 years.")

[1] "The evidence seems to suggest that the clinical factors -- age, diabetes, hypertension, and kidney disease -- significantly increase the odds of dying within 5 years of a baseline diagnosis of heart disease.  Meanwhile, beta blockers appear to have a significant harmful effect, while ACE inhibitors and statins -- don't appear to have a significant impact on mortality at 5 years."

Question 6: Identifying a Cohort

Your organization is going to perform a research study about the overall costs of medical treatment. You want to get a sense of how much it costs to treat patients with cardiovascular problems in the first year after their initial diagnoses. For the study, we will exclude any records beyond the first year (365.25 days) of follow-up.

We will start by focusing on a cohort of patients who meet all of the following criteria:

Baseline Age at least 65 years old.
Survived for at least 1 year after baseline.
Did not have any interruptions in their records for the first year of follow-up (no time gaps).

How many patients meet these criteria?

dat[, Followup.Time := max(end), by = "id"]
cohort <- dat[age >= 65 & Followup.Time >= one.year & begin < one.year, ]
cohort[, end := pmin(end, one.year)]
cohort[, Followup.Time := NULL]

cohort[, Followup.Time := as.numeric(sum(end-begin)), by = "id"]

cohort <- cohort[Followup.Time == one.year,]

print(sprintf("A total of %d patients meet the inclusion criteria for the study.", cohort[, length(unique(id))]))

[1] "A total of 5129 patients meet the inclusion criteria for the study."

Note: The remainder of the questions will rely on the research cohort identified in the previous question. Do not use the full set of patients to answer these questions.

Question 7: Medication Adherence

For the patients identified in the research cohort, what percentage of the time did they take their medicines? To answer this question, follow these steps:

For each medicine (ACE inhibitors, beta blockers, and statins), compute the medication adherence for each patient in the first year.
Compute the average adherence across all of the patients in the research cohort.
Express your answers as a percentage that is rounded to 1 decimal place, e.g. 36.1%.

compute.adherence <- function(x, begin, end, digits){
  adherence <- 100 * sum(x * (end-begin), na.rm=TRUE) / sum(end-begin, na.rm = TRUE)
  return(adherence)
}

adherence.by.patient <- cohort[, lapply(X = .SD, FUN = "compute.adherence", begin = begin, end = end, digits = 1), .SDcols = meds, by = "id"]

mean.adherence <- adherence.by.patient[, lapply(X = .SD, FUN = "mean", na.rm=TRUE), .SDcols = meds]

datatable(data = round(x = mean.adherence, digits = 1), rownames = FALSE)

Question 8: Costs

For the study on the research cohort, you have estimated the following costs:

ACE Inhibitors: $80 per month.
Beta Blockers: $200 per month.
Statins: $120 per month.
Hospitalization: $2500 per day.
Heart Attack: $15000 in addition to hospitalization costs.

For each of these patient in the research cohort, compute the total cost of their treatments, including all medications, hospitalizations, and complications. Then, use the summary function to display the minimum, 25th percentile, average, median, 75th percentile, and maximum value of the costs.

Note: For any missing value of a variable, treat that cost as a zero for the time period.

ace.cost <- 80
beta.blocker.cost <- 200
statin.cost <- 120
hospitalization.cost <- 2500
heart.attack.cost <- 15000

cohort[, Cost := ((end-begin)/(one.year/12)) * (ace.cost * ace + beta.blocker.cost * beta.blocker + statin.cost * statin) + (end-begin)*(hospitalization.cost * hospital) + (heart.attack.cost * heart.attack)]

patient.costs <- cohort[, .(Total_Cost = sum(Cost, na.rm = TRUE)), by = "id"]

print(patient.costs[, summary(Total_Cost)])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1440    5919   18535   23631   31173  915525

Question 9: Cost Ranges

We have learned about two methods for computing confidence intervals:

Normal Distribution: Mean plus or minus 1.96 * SD / sqrt(n)
Bootstrap: Use resampling to simulate a distribution of values.

For the bootstrap method, compute B = 10000 simulated versions of the average cost. Then, compute the confidence interval by taking the empirical 2.5th and 97.5th quantiles of these values. The sampling can be performed using the sample function, and the quantile function will extract the selected quantiles. (Alternatively, the bootstrapped means can be sorted in increasing order, and the percentiles can be selected by choosing the correct index.)

For each method, compute a 95% confidence interval for the average cost of care in the research cohort. Round your results to the nearest 100 dollars. If the results are not similar, comment briefly on why that might be the case.

alpha = 0.05
z <- qnorm(p = (1-alpha/2), mean = 0, sd = 1)

the.mean <- mean(patient.costs[, Total_Cost], na.rm=TRUE)
the.sd <- sd(patient.costs[, Total_Cost], na.rm=TRUE)
n <- length(unique(patient.costs[, id]))
moe <- z * the.sd/sqrt(n)

normal.lower <- the.mean - moe
normal.upper <- the.mean + moe

B <- 10000

resampled.costs <- as.data.table(matrix(data = sample(x = patient.costs[, Total_Cost], size = n*B, replace = TRUE), nrow = n))

bootstrapped.means <- colMeans(resampled.costs, na.rm = TRUE)

bootstrap.lower <- quantile(x = bootstrapped.means, probs = alpha/2)
bootstrap.upper <- quantile(x = bootstrapped.means, probs =  1-alpha/2)

ci.results <- data.table(Method = c("Normal", "Bootstrap"), Lower = c(normal.lower, bootstrap.lower), Upper = c(normal.upper, bootstrap.upper))


datatable(data = ci.results[, lapply(X = .SD, FUN = "round.numerics", nearest = 100)], rownames = FALSE)

Question 10: Super Utilizers

We can define a super utilizer as anyone with an excessively high overall cost of medical care. What percentage of the research cohort had a cost of at least $100,000 in the first year? For these patients, what was the average number of days spent in the hospital? Round your answers to 1 decimal place.

cost.threshold <- 100000
Num_Super_Utilizers = patient.costs[, length(unique(id[Total_Cost >= cost.threshold]))]

hospital.days.by.patient <- cohort[, .(Hospital_Days = sum((end-begin)[hospital == 1], na.rm=TRUE)), by = "id"]
  
Ave_Hospital_Days <- hospital.days.by.patient[id %in% patient.costs[Total_Cost >= cost.threshold, id], mean(Hospital_Days, na.rm=TRUE)]

super.utilizers <- data.table(Num_Super_Utilizers, Ave_Hospital_Days)

datatable(data = round(x = super.utilizers, digits = 1), rownames = FALSE)