About The Data

Homework 1 (the previous assignment) focused on a simulated data set related to electronic health records and long-run outcomes for cardiology patients. That data set included a variety of messy variables. For this homework, we will use a cleaned up version of this data set while drilling down deeper into questions about the utlization, effectiveness, and costs of medicines.

File: Homework 2 Data – 2017.csv

Delimiter: Each column of the data set is separated with a comma , delimiter.

Header The first row of the data set includes the column names, and each subsequent row includes one observation of values. Here is a selection of 100 lines from the data set:

The data is written in long format (e.g. panel data). Each patient’s records are collected over time in one or more rows. Each row corresponds to a period of time. During this time, the patient’s status is recorded in terms of medications, hospitalizations, and complications. Each patient is followed until either death or the end of the follow-up period.

Here is a brief description of each variable:

Each patient is followed until either death or the end of the observation. Many patients with coronary disease were still alive at the end of follow-up.

Note: The description above tells you the intended structure of the data set. However, it’s possible that there could be problems lurking in the records. In the course of doing this assignment, you may uncover some issues. For instance, you may find an erroneous value in some of the variables. In this circumstance, it will be necessary to resolve the situation. Here are some guidelines for doing so:

In either circumstance, note the problem in your solution and briefly describe the work you did to clean the data.

Question 1: Number of Medicines

At any one time, each patient is taking some combination of ACE inhibitors, Beta Blockers, and Statins – anywhere from 0 to 3 of them. We are curious about how often these patients are taking at least 2 medications at once. How often does this happen for the average patient? To answer this question, follow these steps:

  1. For each patient, across the full range of our observation, compute the percentage of the time that the patient is taking at least 2 medicines at once.

  2. Across all of the patients, compute the average of these percentages.

Note: Using the function rowSums is one way to calculate the number of medicines.

meds <- c("ace", "beta.blocker", "statin")
med.values <- dat[, .SD, .SDcols = meds]

dat[, Two.Plus := 1*(rowSums(x = med.values, na.rm = TRUE) >= 2)]

percent.by.patient <- dat[, .(Percent = sum(Two.Plus * (end-begin))/sum(end-begin)), by = "id"]

print(sprintf("The average patient was on at least 2 medicines about %.1f%% of the time.", percent.by.patient[, 100*mean(Percent)]))
[1] "The average patient was on at least 2 medicines about 80.8% of the time."

Question 2: Hospitalization Stays

Now let’s look at how long patients stay in the hospital. What is the minimum, average, median, maximum, and standard deviation of the length of a trip to the hospital? Round your answers to 1 decimal point.

Do the numbers make sense? What do you think?

Note: Keep in mind that if two consecutive records for a single id both have hospital = 1, then the patient remained in the hospital over both periods.

# This function checks for consecutive records in which hospital has the value 1.  It is intended to be used on a single patient's data.

check.consecutive.hospitalizations <- function(hospital){
  consecutives <- FALSE
  
  num.records <- length(hospital)
  if(num.records > 1){
    for(i in 2:num.records){
      if(hospital[i] == 1 & hospital[i-1] == 1){
        consecutives <- TRUE
      }
    }
  }
  return(as.numeric(consecutives))
}

check.consecutive.hospitalizations2 <- function(hospital){
  next.record <- c(0, hospital[1:(length(hospital)-1)])
  
  consecutives <- 1*(hospital == 1 & next.record == 1)
  return(max(consecutives, na.rm=TRUE))
}

toc <- Sys.time()
consec.check <- dat[, lapply(X = .SD, FUN = "check.consecutive.hospitalizations"), .SDcols = "hospital", by = id]
tic <- Sys.time()

toc2 <- Sys.time()
consec.check2 <- dat[, lapply(X = .SD, FUN = "check.consecutive.hospitalizations2"), .SDcols = "hospital", by = id]
tic2 <- Sys.time()


setnames(x = consec.check, old = names(consec.check), new = c("id", "consecutive.hospitalizations"))

print(sprintf("%d patients had consecutive records with hospital = 1.", sum(consec.check[, consecutive.hospitalizations])))
[1] "0 patients had consecutive records with hospital = 1."
length.of.stay <- dat[, (end - begin)[hospital == 1]]

length.of.stay.stats <- data.table(min = min(length.of.stay, na.rm=TRUE), mean = mean(length.of.stay, na.rm = TRUE), median = median(length.of.stay, na.rm = TRUE), max = max(length.of.stay, na.rm=TRUE), standard.deviation = sd(length.of.stay, na.rm = TRUE))

print(round(length.of.stay.stats, 1))
   min mean median  max standard.deviation
1:   0  6.6      3 2106               24.1
print("This mostly makes sense.  The stays of zero days corresponded to patient deaths.  The maximum stay was several years long, which occasionally happens.  The typical patient stayed about 4-8 days, which seems reasonable for this patients with cardiovascular complications.")
[1] "This mostly makes sense.  The stays of zero days corresponded to patient deaths.  The maximum stay was several years long, which occasionally happens.  The typical patient stayed about 4-8 days, which seems reasonable for this patients with cardiovascular complications."

Question 3: Initiating Medicines

What percentage of the patients started each medicine (ACE inhibitors, Beta Blockers, and Statins) within 14 days of diagnosis?

Note: This includes anyone who filled a prescription at any point within the first 14 days (with the begin variable less than 14), even if they later stopped taking it. Round your percentage answer to 1 decimal place (e.g. 36.2 percent).

Hint: If a patient has entirely missing records for a variable (e.g. a medicine like a statin), then this patient’s initiation status shouldu also be missing.

augmented_max <- function(x){
  if(mean(is.na(x)) == 1){
    return(NA_integer_)
  }
  else{
    return(max(x, na.rm=TRUE))
  }
}


initiation <- dat[begin < 14, lapply(X = .SD, FUN = "augmented_max"), .SDcols = meds, by = "id"]

initiation.results <- 100*initiation[, lapply(X = .SD,FUN = "mean", na.rm=TRUE), .SDcols = meds]

datatable(data = round(x = initiation.results, digits = 1), rownames = FALSE)

Question 4: 5 Year Outcomes

How many patients a) had a heart attack and b) died within 5 years of baseline?

Hint: Some patients may have had more than 1 heart attack within this time period. (Hopefully there were not multiple deaths for a single patient, although this is a good exercise in cleaning the data.) For this question, we are asking about the number of patients who experienced an event rather than the overall number of events.

outcomes <- c("heart.attack", "death")
one.year <- 365.25
time.frame <- 5 * one.year

had.event.by.patient <- dat[begin < time.frame, lapply(X = .SD, FUN = "max", na.rm=TRUE), .SDcols = outcomes, by = "id"]

outcomes.table <- had.event.by.patient[, lapply(X = .SD, FUN = "sum", na.rm=TRUE), .SDcols = outcomes]

datatable(data = outcomes.table, rownames = FALSE)

Question 5: Modeling 5-Year Survival

How do all of the clinical factors that we have measured impact a patient’s survival in the first 5 years after the baseline diagnosis of cardiovascular disease? To answer this question, we will construct a multivariable model. Let’s follow these steps:

  1. Construct a data.table object with 1 row for each unique patient.

  2. The columns of this data set will include:

    • age (at baseline)
    • diabetes (at baseline)
    • hypertension (at baseline)
    • kidney_disease (at baseline)
    • ace (initiated within 14 days of baseline)
    • beta_blocker (initiated within 14 days of baseline)
    • statin (initiated within 14 days of baseline)
    • death (within 5 years, defining 1 year as 365.25 days).
  3. Limit the table to those who a) died within 5 years, or b) survived at least 5 years.

  4. Fit a logistic regression of death against the predictor variables. To do use, use the glm function with family = “binomial”.

  5. Extract the summary of the coefficients in a table. If the logistic regression model’s result is stored in a variable called mod, then this can be displayed with summary(mod)$coefficients.

  6. Compute the Odds Ratios for the predictor variables by exponentiating the coefficients.

  7. Compute confidence intervals for the Odds Ratios for the predictor variables by exponentiating (coef plus or minus 1.96 * standard error).

  8. Display the results. Round your answers to 3 decimal places.

  9. Briefly comment on the model’s results.

baseline.dat <- dat[, .SD[1], .SDcols = c("age", "diabetes", "hypertension", "kidney_disease"), by = "id"]

baseline.plus.meds.initiated <- merge(x = baseline.dat, y = initiation, by = "id")

the.outcomes <- dat[begin < time.frame, .(death = max(death, na.rm=TRUE), Survived.5 = 1*(max(end) >= time.frame)), by = "id"]

merged.dat <- merge(x = baseline.plus.meds.initiated, y = the.outcomes, by = "id")

analysis.dat <- merged.dat[death == 1 | Survived.5 == 1,]
the.formula <- "death ~ age + diabetes + hypertension + kidney_disease + ace + beta.blocker + statin"
mod <- glm(formula = the.formula, family = "binomial", data = analysis.dat)
alpha = 0.05
z <- qnorm(p = 1-alpha/2, mean = 0, sd = 1)
the.coefs <- as.data.table(x = summary(mod)$coefficients, keep.rownames = TRUE)
the.coefs[, Odds.Ratio := exp(Estimate)]
the.coefs[, Odds.Ratio.95.Lower := exp(Estimate - z * `Std. Error`)]
the.coefs[, Odds.Ratio.95.Upper := exp(Estimate + z * `Std. Error`)]

setnames(x = the.coefs, old = c("rn", "Pr(>|z|)"), new =  c("Variable", "p.value"))
setcolorder(x = the.coefs, neworder = c("Variable", "Estimate", "Odds.Ratio", "Std. Error", "z value", "Odds.Ratio.95.Lower", "Odds.Ratio.95.Upper", "p.value"))

datatable(data = the.coefs[, lapply(X = .SD, FUN = "round.numerics", digits = 3)], rownames = FALSE)
print("The evidence seems to suggest that the clinical factors -- age, diabetes, hypertension, and kidney disease -- significantly increase the odds of dying within 5 years of a baseline diagnosis of heart disease.  Meanwhile, beta blockers appear to have a significant harmful effect, while ACE inhibitors and statins -- don't appear to have a significant impact on mortality at 5 years.")
[1] "The evidence seems to suggest that the clinical factors -- age, diabetes, hypertension, and kidney disease -- significantly increase the odds of dying within 5 years of a baseline diagnosis of heart disease.  Meanwhile, beta blockers appear to have a significant harmful effect, while ACE inhibitors and statins -- don't appear to have a significant impact on mortality at 5 years."

Question 6: Identifying a Cohort

Your organization is going to perform a research study about the overall costs of medical treatment. You want to get a sense of how much it costs to treat patients with cardiovascular problems in the first year after their initial diagnoses. For the study, we will exclude any records beyond the first year (365.25 days) of follow-up.

We will start by focusing on a cohort of patients who meet all of the following criteria:

How many patients meet these criteria?

dat[, Followup.Time := max(end), by = "id"]
cohort <- dat[age >= 65 & Followup.Time >= one.year & begin < one.year, ]
cohort[, end := pmin(end, one.year)]
cohort[, Followup.Time := NULL]

cohort[, Followup.Time := as.numeric(sum(end-begin)), by = "id"]

cohort <- cohort[Followup.Time == one.year,]

print(sprintf("A total of %d patients meet the inclusion criteria for the study.", cohort[, length(unique(id))]))
[1] "A total of 5129 patients meet the inclusion criteria for the study."

Note: The remainder of the questions will rely on the research cohort identified in the previous question. Do not use the full set of patients to answer these questions.

Question 7: Medication Adherence

For the patients identified in the research cohort, what percentage of the time did they take their medicines? To answer this question, follow these steps:

  1. For each medicine (ACE inhibitors, beta blockers, and statins), compute the medication adherence for each patient in the first year.

  2. Compute the average adherence across all of the patients in the research cohort.

  3. Express your answers as a percentage that is rounded to 1 decimal place, e.g. 36.1%.

compute.adherence <- function(x, begin, end, digits){
  adherence <- 100 * sum(x * (end-begin), na.rm=TRUE) / sum(end-begin, na.rm = TRUE)
  return(adherence)
}

adherence.by.patient <- cohort[, lapply(X = .SD, FUN = "compute.adherence", begin = begin, end = end, digits = 1), .SDcols = meds, by = "id"]

mean.adherence <- adherence.by.patient[, lapply(X = .SD, FUN = "mean", na.rm=TRUE), .SDcols = meds]

datatable(data = round(x = mean.adherence, digits = 1), rownames = FALSE)

Question 8: Costs

For the study on the research cohort, you have estimated the following costs:

For each of these patient in the research cohort, compute the total cost of their treatments, including all medications, hospitalizations, and complications. Then, use the summary function to display the minimum, 25th percentile, average, median, 75th percentile, and maximum value of the costs.

Note: For any missing value of a variable, treat that cost as a zero for the time period.

ace.cost <- 80
beta.blocker.cost <- 200
statin.cost <- 120
hospitalization.cost <- 2500
heart.attack.cost <- 15000
cohort[, Cost := ((end-begin)/(one.year/12)) * (ace.cost * ace + beta.blocker.cost * beta.blocker + statin.cost * statin) + (end-begin)*(hospitalization.cost * hospital) + (heart.attack.cost * heart.attack)]

patient.costs <- cohort[, .(Total_Cost = sum(Cost, na.rm = TRUE)), by = "id"]

print(patient.costs[, summary(Total_Cost)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1440    5919   18535   23631   31173  915525 

Question 9: Cost Ranges

We have learned about two methods for computing confidence intervals:

For the bootstrap method, compute B = 10000 simulated versions of the average cost. Then, compute the confidence interval by taking the empirical 2.5th and 97.5th quantiles of these values. The sampling can be performed using the sample function, and the quantile function will extract the selected quantiles. (Alternatively, the bootstrapped means can be sorted in increasing order, and the percentiles can be selected by choosing the correct index.)

For each method, compute a 95% confidence interval for the average cost of care in the research cohort. Round your results to the nearest 100 dollars. If the results are not similar, comment briefly on why that might be the case.

alpha = 0.05
z <- qnorm(p = (1-alpha/2), mean = 0, sd = 1)

the.mean <- mean(patient.costs[, Total_Cost], na.rm=TRUE)
the.sd <- sd(patient.costs[, Total_Cost], na.rm=TRUE)
n <- length(unique(patient.costs[, id]))
moe <- z * the.sd/sqrt(n)

normal.lower <- the.mean - moe
normal.upper <- the.mean + moe
B <- 10000

resampled.costs <- as.data.table(matrix(data = sample(x = patient.costs[, Total_Cost], size = n*B, replace = TRUE), nrow = n))

bootstrapped.means <- colMeans(resampled.costs, na.rm = TRUE)

bootstrap.lower <- quantile(x = bootstrapped.means, probs = alpha/2)
bootstrap.upper <- quantile(x = bootstrapped.means, probs =  1-alpha/2)

ci.results <- data.table(Method = c("Normal", "Bootstrap"), Lower = c(normal.lower, bootstrap.lower), Upper = c(normal.upper, bootstrap.upper))


datatable(data = ci.results[, lapply(X = .SD, FUN = "round.numerics", nearest = 100)], rownames = FALSE)

Question 10: Super Utilizers

We can define a super utilizer as anyone with an excessively high overall cost of medical care. What percentage of the research cohort had a cost of at least $100,000 in the first year? For these patients, what was the average number of days spent in the hospital? Round your answers to 1 decimal place.

cost.threshold <- 100000
Num_Super_Utilizers = patient.costs[, length(unique(id[Total_Cost >= cost.threshold]))]

hospital.days.by.patient <- cohort[, .(Hospital_Days = sum((end-begin)[hospital == 1], na.rm=TRUE)), by = "id"]
  
Ave_Hospital_Days <- hospital.days.by.patient[id %in% patient.costs[Total_Cost >= cost.threshold, id], mean(Hospital_Days, na.rm=TRUE)]

super.utilizers <- data.table(Num_Super_Utilizers, Ave_Hospital_Days)

datatable(data = round(x = super.utilizers, digits = 1), rownames = FALSE)