This is a data report on the “Drug Prices” data received from Blue Shield. Original file (Drug Pricing Data Combined Final - V2.xlsx) rearranged and saved as (Updated BlueShield.xlsx) - by splitting into 3 tables and rearranging some columns.

Some Data Wrangling

drugs <- read_excel("/cloud/project/DATA/Updated BlueShieldMain.xlsx", sheet = "Table 1")  %>% unique() 

drug.prices.wide <- read_excel("/cloud/project/DATA/Updated BlueShieldMain.xlsx", sheet = "Table 2") 

Glimpse of Data

kable(head(drugs))
NDC LABEL_NAME MFR_NAME DRUG_GENERIC_NAME DOSAGE_FORM_DESC ROUTE_OF_ADMIN_DESC FORM_TYPE_CODE_DESC PACKAGE_SIZE PACKAGE_DESC MAINTENANCE_CODE_DESC DRUG_GROUPING DRUG_BRAND_OR_GENERIC
00002323560 CYMBALTA CAP 20MG LILLY DULOXETINE HCL Capsule Delayed Release Particles Oral Solid 00060000 BOTTLE Maintenance Drug Antidepressant Brand
00002324030 CYMBALTA CAP 30MG LILLY DULOXETINE HCL Capsule Delayed Release Particles Oral Solid 00030000 BOTTLE Maintenance Drug Antidepressant Brand
00002324090 CYMBALTA CAP 30MG LILLY DULOXETINE HCL Capsule Delayed Release Particles Oral Solid 00090000 BOTTLE Maintenance Drug Antidepressant Brand
00002327004 CYMBALTA CAP 60MG LILLY DULOXETINE HCL Capsule Delayed Release Particles Oral Solid 01000000 BOTTLE Maintenance Drug Antidepressant Brand
00002327030 CYMBALTA CAP 60MG LILLY DULOXETINE HCL Capsule Delayed Release Particles Oral Solid 00030000 BOTTLE Maintenance Drug Antidepressant Brand
00008083321 EFFEXOR XR CAP 75MG PFIZER U.S. VENLAFAXINE HCL Capsule Extended Release 24 Hour Oral Solid 00030000 BOTTLE Maintenance Drug Antidepressant Brand
kable(head(drug.prices.wide))
NDC LABEL_NAME LOB DOSAGE_FORM_DESC PACKAGE_SIZE AWP_UNIT_PRICE1 AWP_EFFECTIVE_DATE1 AWP_UNIT_PRICE2 AWP_EFFECTIVE_DATE2 AWP_UNIT_PRICE3 AWP_EFFECTIVE_DATE3 AWP_UNIT_PRICE4 AWP_EFFECTIVE_DATE4 AWP_UNIT_PRICE5 AWP_EFFECTIVE_DATE5 AWP_UNIT_PRICE6 AWP_EFFECTIVE_DATE6 awp_per_30ds WAC_UNIT_PRICE1 WAC_EFFECTIVE_DATE1 WAC_UNIT_PRICE2 WAC_EFFECTIVE_DATE2 WAC_UNIT_PRICE3 WAC_EFFECTIVE_DATE3 WAC_UNIT_PRICE4 WAC_EFFECTIVE_DATE4 WAC_UNIT_PRICE5 WAC_EFFECTIVE_DATE5 WAC_UNIT_PRICE6 WAC_EFFECTIVE_DATE6 wac_per_30ds
00002323560 CYMBALTA CAP 20MG Commercial Capsule Delayed Release Particles 00060000 9.4 2021-01-01 9.1560000000000006 43861 8.8919999999999995 43515 8.5440000000000005 43084 7.7759999999999998 41453 7.0739999999999998 41255 428 7.9 2021-01-01 7.63 43861 7.41 43515 7.12 43084 6.48 41453 5.8949999999999996 41255 357
00002324030 CYMBALTA CAP 30MG Commercial Capsule Delayed Release Particles 00030000 10.6 2021-01-01 10.272 43861 9.9719999999999995 43515 9.5879999999999992 43084 8.7240000000000002 41453 7.9379999999999997 41255 437 8.8 2021-01-01 8.56 43861 8.31 43515 7.99 43084 7.27 41453 6.6150000000000002 41255 364
00002324090 CYMBALTA CAP 30MG Commercial Capsule Delayed Release Particles 00090000 10.6 2021-01-01 10.272 43861 9.9719999999999995 43515 9.5879999999999992 43084 8.7240000000000002 41453 7.9379999999999997 41255 565 8.8 2021-01-01 8.56 43861 8.31 43515 7.99 43084 7.27 41453 6.6150000000000002 41255 470
00002327004 CYMBALTA CAP 60MG Commercial Capsule Delayed Release Particles 01000000 10.6 2021-01-01 10.272 43861 9.9719999999999995 43515 9.5879999999999992 43084 8.7240000000000002 41453 7.9379999999999997 41255 358 8.8 2021-01-01 8.56 43861 8.31 43515 7.99 43084 7.27 41453 6.6150000000000002 41255 298
00002327030 CYMBALTA CAP 60MG Commercial Capsule Delayed Release Particles 00030000 10.6 2021-01-01 10.272 43861 9.9719999999999995 43515 9.5879999999999992 43084 8.7240000000000002 41453 7.9379999999999997 41255 401 8.8 2021-01-01 8.56 43861 8.31 43515 7.99 43084 7.27 41453 6.6150000000000002 41255 334
00008083321 EFFEXOR XR CAP 75MG Commercial Capsule Extended Release 24 Hour 00030000 19.5 2021-01-01 18.601330000000001 43831 17.732330000000001 43480 16.888000000000002 43297 18.482330000000001 43282 16.888000000000002 43101 764 16.3 2021-01-01 15.500999999999999 43831 14.776999999999999 43480 14.07333 43297 15.401999999999999 43282 14.07333 43101 637

The drug prices are stated as up to 6 (Date,Price) pairs. That is there are 6 dates for which prices are given for each item. For each date, there are two types of prices: awp (average wholesale price) and wac (wholesale acqusition cost). In addition there are two additional price numbers: awp_per_30ds and wac_per_30ds which state the average of the 30-day-supply prices (using most recent price).

Renaming Columns

First rename some columns to make it more convenient for coding. In the drugs table, this is straightforward, we just want to make some names shorter and simpler.

names(drugs) <- c("NDC", "Label_Name", "Mfr_Name", "Generic_Name", "Form", "How_Taken", "Drug_State", "Qty", "Package_Type", "Maintenance", "Category", "Brand")

In the drug prices table, there’s a little more. The first few “drug” columns can be renamed similar to above.

The remaining - repeating - “price and date” column names (of the form AWP_UNIT_PRICE1) can be simplified (to Price1_AWP), and they can be combined for simplification.

# Names for initial columns of the drug.prices.wide table
drug.prices.wide <- drug.prices.wide %>% rename("NDC" = 1,  "Label_Name" = 2, "LOB" =3, "Form"=4, "Qty"=5) 

# Simplify the price and date column names
#eg convert AWP_UNIT_PRICE1 to AWPPrice_1
colnames(drug.prices.wide) <- gsub('_UNIT_PRICE','Price_',colnames(drug.prices.wide))

colnames(drug.prices.wide) <- gsub('_EFFECTIVE_DATE','Date_',colnames(drug.prices.wide))

Notice the use of “.” and “" in the names … this will be useful in the wide-to-long conversion below. The "." is just to make the name readable, and the "” will help separate out the repeated readings of Price and Date.

Fixing Data Types

Excel data format does not always preserve data type (class), so before doing any operations it is useful to coerce columns to the right type. Strangely the first date columns (among AWP and WAC) are both transmitted correctly, but the rest are not.

Let’s first identify and coerce the price columns.

# class(drug.prices.wide$AWPPrice_2)
# class(drug.prices.wide$AWPDate_2)

price.cols <- names(drug.prices.wide %>% select(contains("PRICE")) )

drug.prices.wide <- drug.prices.wide %>% mutate(across(all_of(price.cols), as.numeric)) # (cols = awp.cols, numeric) 

And now identify and coerce the date columns (except the 1st and 7th), converting from character to numeric, then to Date.

date.cols <- names(drug.prices.wide %>% select(contains("DATE")) )

drug.prices.wide <- drug.prices.wide %>%
  mutate(across(all_of(date.cols[c(-1,-7)]), as.numeric)) %>%
mutate(across(all_of(date.cols[c(-1,-7)]), as.Date, origin="1900-01-01")) %>%
 mutate(across(all_of(date.cols), as.Date))  

# as_datetime(x, origin = lubridate::origin, tz = "UTC")

Convert Qty column, which seems to be displayed with a multiple of 1000, into numeric. Finally convert character columns to factor.

drugs2 <- drugs %>% mutate(Qty = as.numeric(Qty)/1000)
drug.prices.wide2 <- drug.prices.wide %>% mutate(Qty = as.numeric(Qty)/1000)

drugs <- drugs %>% mutate_if(is.character,as.factor)
drug.prices.wide <- drug.prices.wide %>% mutate_if(is.character,as.factor)

Split the Form into Form + Release Mechanism

(** This is not necessary because the BSC database has more detailed columns identifying Form **)

The “Form” values in the data set are sometimes compound. For instance, “Capsule Extended Release 12 Hour”, mixes the Form (“Capsule”) and release mechanism (“Extended Release 12 Hour”). We will split these, so that then one can compare prices for “Capsule Extended Release 12 Hour” and “Tablet Extended Release 12 Hour”, but not compare “Capsule Extended Release 12 Hour” against “Tablet Extended Release 24 Hour”.

drug.prices.wide$Mechanism <- str_extract(drug.prices.wide$Form, "\\s\\w*") #eg Tab Cap

Split the Label_Name Column into Name, FormulationType, Dosage

The Label_Name column has compound values. For instance “CYMBALTA CAP 30M” is the medicine CYMBALTA in CAP type and 30MG Dosage. We’ll split it into 3 new columns: DrugName, FormType, and Dosage

# remove extra whitespaces
drug.prices.wide <- drug.prices.wide %>% mutate(Label_Name = gsub("\\s+"," ",drug.prices.wide$Label_Name) ) 

# Drug Name: first remove FormType ("\\s\\w+") and Dosage ("\\d+[^\ ]*")
# drug.prices.wide <- drug.prices.wide %>% mutate(DrugName = sub("\\s\\w+", " ", drug.prices.wide$Label_Name)) 

# drug.prices.wide <- drug.prices.wide %>% mutate(DrugName = sub("\\d+[^\ ]*","", drug.prices.wide$DrugName)) 

# get the "main form" by pulling the first word in Label_Name before a whitespace and followed by 0 or more "word" characters

drug.prices.wide$FormType <- str_extract(drug.prices.wide$Label_Name, "\\s\\w+") #eg Tab Cap

# For Dosage: match digit+nonspacealpha
# https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match
drug.prices.wide$Dosage <- str_extract(drug.prices.wide$Label_Name, "\\d+[^\ ]*")

From Wide to Long Form

The goal is to take the 6 pairs of columns of AWP Prices and Dates and reduce them to 2 columns (AWP Price and AWP Date) and up to 6 rows; do the same for WAC. To apply the pivot_longer command efficiently these columns should be named like AWP1_AWP.Price and AWP1_AWP.Date etc.).

drugs.def <- drug.prices.wide %>% select(c(NDC:Qty), awp_per_30ds, wac_per_30ds, Dosage) #, DrugName, FormType, Dosage)

drugs.prices.only <- drug.prices.wide %>% select(c(NDC, contains("Price")))

drugs.dates.only <- drug.prices.wide %>% select(c(NDC, contains("Date")))
drug.prices.long <- pivot_longer(
   drugs.prices.only,
   cols = AWPPrice_1:WACPrice_6,  
   names_to = c(".value", "Type"), # notice the swap in order  
   names_pattern = "(.+)_(.+)",
#   names_sep = "_",
   values_drop_na = TRUE)

drug.dates.long <- pivot_longer(
   drugs.dates.only,
   cols = AWPDate_1:WACDate_6,  
   names_to = c(".value", "Type"), 
   names_sep = "_",
   values_drop_na = TRUE)

Now re-join the 3 split tables.

drugs.long <- merge(drug.dates.long, drug.prices.long, by=c("NDC", "Type")) %>% unique()

drugs.long <- merge(drugs.def, drugs.long, by="NDC")

# drugs.long.1 <- drugs.long %>% filter(Type == "1")

Finally, add the Generic_Name and Mfr_Name columns from the drugs table.

drugs.final <- merge(drugs.long, drugs[, c("NDC", "Category", "Generic_Name", "Mfr_Name", "Brand")], by = "NDC", all.x=TRUE) 

nameorder <- c("NDC", "Category", "Generic_Name", "Qty", "Dosage", "Mfr_Name", "Form", "LOB", "AWPPrice", "WACPrice", "AWPDate", "WACDate", "awp_per_30ds", "wac_per_30ds",  "Label_Name", "Brand") #, "DrugName", "FormType", "FormType")

drugs.final <- drugs.final %>% mutate_if(is.character,as.factor) %>% select(nameorder) %>% ungroup() 

drugs.final <- drugs.final %>% 
   mutate(NDC = as.character(NDC), Generic_Name = as.character(Generic_Name), 
          Mfr_Name = as.character(Mfr_Name), Dosage = as.character(Dosage), 
          Label_Name = as.character(Label_Name))  %>% arrange(Generic_Name, Qty, Dosage, Mfr_Name, Form) 

Save

Finally, save the data frames needed for analysis into an R data object.

kable(head(drugs.final)) 
NDC Category Generic_Name Qty Dosage Mfr_Name Form LOB AWPPrice WACPrice AWPDate WACDate awp_per_30ds wac_per_30ds Label_Name Brand
47781034201 Antidiabetics ACARBOSE 00100000 100MG ALVOGEN Tablet Commercial 1.2 0.62 2013-06-07 2013-06-07 104 56 ACARBOSE TAB 100MG Generic
00054014225 Antidiabetics ACARBOSE 00100000 100MG HIKMA Tablet Medicare 1.2 0.60 2008-05-07 2013-12-20 97 50 ACARBOSE TAB 100MG Generic
00054014225 Antidiabetics ACARBOSE 00100000 100MG HIKMA Tablet Medicare NA 0.94 NA 2012-01-12 97 50 ACARBOSE TAB 100MG Generic
00054014225 Antidiabetics ACARBOSE 00100000 100MG HIKMA Tablet Commercial 1.2 0.60 2008-05-07 2013-12-20 96 50 ACARBOSE TAB 100MG Generic
00054014225 Antidiabetics ACARBOSE 00100000 100MG HIKMA Tablet Commercial NA 0.94 NA 2012-01-12 96 50 ACARBOSE TAB 100MG Generic
69543012210 Antidiabetics ACARBOSE 00100000 100MG VIRTUS PHARMACEUTICALS Tablet Medicare 1.2 0.60 2015-05-20 2015-05-20 101 53 ACARBOSE TAB 100MG Generic
save(drugs.final,  drug.prices.wide, file="/cloud/project/DATA/BSC-GH-2021.RData")