MATH13224 - Introduction to Statistics
Assignment 4 - Semester 1, 2017

An analysis of the cost of the Australian
Pharmaceutical Benefits Scheme (PBS)
2011 - 2015

Parikshit Sreedhar (#s3643796)
Matt Bolton (#s8501702)

Last updated: 05 June 2017

Introduction

The Australian Pharmaceutical Benefits Scheme (PBS)

The Australian health care system provides essential medical care to Australians, regardless of their financial circumstances. An important component of this is the Pharmaceutical Benefits Scheme (PBS), which provides subsidised prescription drugs to Australians.

PBS costs are increasing every year, costing nearly eight billion dollars in 2015-16.* (PBS:2017) Ways of reducing this cost to the Australian taxpayer are worth considering.

Detailed data on a part of the cost of the PBS (approximately $300 million per annum) was made available recently via the Melbourne Datathon. Using this data, we will analyse the following components to look for patterms that may point to high cost areas:

Anatomical Therapeutic Chemical (ATC) Level 1
Gender
Socio-economic status of patients, based on Australian Bureau of Statistics Statistical Area 4 areas (ABS SA4)

Any patterns found could form the basis of cost reduction via, for example, negotiation with drug manufacturers or additional targetted health eduction in the areas (ATC or socio-economic/geographic) highlighted.

** Note: The data analysed, being only a subset of the full cost of the PBS, doesn’t show this increasing cost, so our conclusions will keep that in mind.

Problem Statement

Are there factors influencing the cost of the PBS that could be used to help reduce its cost?

To answer this question, we will be assessing the data in the following areas:

ATC Level 1

Drugs are categorised at five increasingly detailed levels using the Anatomical Therapeutic Chemical system. At level 1, there are 13 groups based on the broad area of the human body on which the chemicals work, e.g. Dermatologicals. The level 2 is more detailed, e.g. Anti-acne preparations and level 3 more detailed still e.g. Anti-acne topical use, Anti-acne systemic use. Our study will look at ATC Level 1 only.

Gender

Gender could be an important factor in the use of drugs and therefore cost to the taxpayer. We will analyse whether there is significant difference between genders for drug cost. We will conduct hypothesis testing of the per capita cost of drugs to male & female patients across each of the ATC 1 levels.

Socio-economic factors

The drug data provided contains patient postcode. By combining this with ABS data, we can link drug costs to ABS statistical areas. We will study the data at ABS SA4 level (88 areas in Australia) to look for patterms which may be useful in understanding areas of higher cost. This will be done via scatter plots. If any interesting relationships are seen, these will be explored further using regression analysis.

Data

Data was obtained from two sources - the Melbourne Datathon 2017 (NOSTRADATA limited) and the Australian Bureau of Statistics (ABS).

The Datathon data consists of 64 million drug purchase transactions between 2010 and 2016. Only 1 week of data was available for 2010 and 6 months for 2016, so we limited our analysis to the period from 2011-2015.

The data was summarised in SQL server grouping transactions on:
- ATC Level 1
- Year of transaction
- Gender of patient
- Drug

with counts and costs totalled or averaged as appropriate. This left 61,578 data points (Drugs_SA4).

In order to link this data to population and economic factors, ABS Income data (Catalog 1410.0) data was used. This data contains data at national and state levels as well as SA1 - SA4 (standard areas the ABS uses to aggregate data). We used SA4, giving 88 areas to study. To join this data to the drug data, a postcode to SA4 mapping was required and ABS Catalog 1270.0.55.003 was used for this purpose. This enabled a table of the SA4 areas to be populated with summary drug cost data, both total and per capita. (ABS_SA4)

Note: The match between postcodes and SA4 areas is not exact - some postcodes falling into multiple SA4 areas. A ratio of the area of the postcode in each SA is provided, so the postcode was allocated for our purposes to the SA4 which contained the largest area.

A joining table containing data from both these tables was created to use for the gender comparisons.

This data is aggregated on:
- SA4
- ATC
- Year
- Gender

Descriptive Statistics and Visualisation

Annual Cost by ATC Level 1

-Now we look at the total govt. spending across the dispense years broken down by ATC Level 1 Names.

Drugs_SA4 <- read.csv('Drugs_SA4.csv')
#Group Drugs data by ATC & Year
SA4_ATC_Year <- Drugs_SA4 %>% group_by(ATC, Disp_Year) %>% summarise_each(funs(sum), Govt_Reclaim_Amt)
# Adjust factors by value for more meaningful display (sorted by cost)
SA4_ATC_Year$ATC <- factor(SA4_ATC_Year$ATC, levels = c("Antiparisitic","Dermatological","Hormones","Various","Sensory",
                  "Genito_Urinary","Antiinfectives","Musculo-Skeletal","Respiratory",
                  "Blood","Antineoplastic","Alimentary","Nervous_System","Cardiovascular"))

# Need a large colour palette for chart
SA4_ATC <- SA4_ATC_Year %>% group_by(ATC) %>% summarise_each(funs(sum), Govt_Reclaim_Amt)
colourCount = length(unique(SA4_ATC$ATC))
cp = colorRampPalette(brewer.pal(11, "Spectral"))(colourCount)

ggplot(SA4_ATC_Year, aes(x = Disp_Year, y = Govt_Reclaim_Amt / 1000000, fill = ATC)) + theme(plot.title = element_text(hjust = 0.5)) + 
    geom_bar(stat="identity") +
    ggtitle ("Annual Drug cost to Government by ATC Level 1") + 
    labs(x = "Year", y = "Cost ($ million)") +
    scale_fill_manual(values=cp)

Over the years,heart related medication have cost the government dearly.

Annual Cost by ATC Level 1 (cont.)

-To assess the distribution of Govt. spending by ATC Level 1 Names, below are some crisp stats furnished at the ATC Level 1 Name level.

Summary Statistics

SA4_ATC %>% group_by(ATC) %>% summarise(
  min = min(Govt_Reclaim_Amt
  , na.rm = TRUE),
  Q1 = quantile(Govt_Reclaim_Amt, probs = .25, na.rm = TRUE),
  Median = median(Govt_Reclaim_Amt, na.rm = TRUE),
  Q3 = quantile(Govt_Reclaim_Amt, probs = .75, na.rm = TRUE),
  Max = max(Govt_Reclaim_Amt, na.rm = TRUE),
  Mean = mean(Govt_Reclaim_Amt, na.rm = TRUE),
  SD = sd(Govt_Reclaim_Amt, na.rm = TRUE),
  n = n(),
  Missing = sum(is.na(Govt_Reclaim_Amt))
  )

#Group by ATC only 
#SA4_ATCY <- spread(SA4_ATC_Year, Disp_Year, Govt_Reclaim_Amt)
#SA4_ATC
#%>% prop.table(margin = 2) * 100

Box Plots

-We now look at the spread of govt. reclaim amount across ATC Level 1 levels -The chart on the left represents the given Govt. Reclaim amounts (on the vertical axis) -The chart on the right represents the log scaled value of the Govt. Reclaim amounts (on the vertical axis ) -The following ATC Level 1 levels (below) possess considerable variation in Govt. Reclaim Amounts and hence we go through them with a fine tooth comb in the

The following ATC Level 1 levels (below) possess considerable variation in Govt. Reclaim Amounts:
- ALIMENTARY
- ANTINEOPLASTIC
- CARDIOVASCULAR
- NERVOUS SYSTEM

fill <- "gold1"
line <- "goldenrod2"

p1 <- ggplot(Drugs_SA4, aes(x = ATC, y = Govt_Reclaim_Amt /1000)) +
    geom_boxplot(fill = fill, colour = line, outlier.shape = 1) +
    ggtitle ("Drug cost to Government 2011-2015 by ATC Level 1") + theme(plot.title = element_text(hjust = 0.5)) + 
    scale_x_discrete(name = "Anatomical Therapeutic Chemical (ATC) Level 1") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    scale_y_continuous(name = "Cost per drug ($ thousand)") 

p2 <- ggplot(Drugs_SA4, aes(x = ATC, y = log(Govt_Reclaim_Amt))) +
    geom_boxplot(fill = fill, colour = line, outlier.shape = 1) +
    ggtitle ("Drug cost to Government 2011-2015 by ATC Level 1") + theme(plot.title = element_text(hjust = 0.5)) + 
    scale_x_discrete(name = "Anatomical Therapeutic Chemical (ATC) Level 1") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    scale_y_continuous(name = "Cost per drug (log)") 

multiplot(p1, p2, cols=2)

ATC_SA4 <- read.csv('ATC_SA4.csv')
drug_hyp <- ATC_SA4

class (drug_hyp$Gender)

## [1] "factor"

drug_hyp <- drug_hyp %>% filter(Gender=="M"|Gender=="F")
drug_hyp$Gender <- str_trim(drug_hyp$Gender)
drug_hyp <- drug_hyp[complete.cases(drug_hyp),]

drug_hyp_subset <- drug_hyp %>% group_by(ATC, Gender) %>% summarise_each(funs(sum), Govt_Reclaim_Amt)
drug_hyp_subset <- spread(drug_hyp_subset, Gender, Govt_Reclaim_Amt)

drug_atc_gender <- Drugs_SA4 %>% group_by(Drug_ID, ATC, Gender) %>% summarise_each(funs(sum), Govt_Reclaim_Amt) %>% filter(Gender=="M"|Gender=="F")
drug_atc_gender <- spread(drug_atc_gender, Gender, Govt_Reclaim_Amt)
drug_atc_gender <- drug_atc_gender[complete.cases(drug_atc_gender), ]

Hypothesis Testing

*Assumptions: +After subsetting the data at the ATC Level 1 Names, we assume that there are equal number of male and female patients. If the result is close, we could take into account the actual ratio of 98.8 males per 100 females (ABS) +Patients with “Unidentified” gender will be excluded from the analysis and not deleted since they account to 19% of the total govt. spending +Significance level of 5% will be used to conduct the test

*With regard to total government spending on Males vs Females: +Since the same sample of ATC Level1 Names are taken for both the groups, a paired t-test is carried out +The data is at the ATC Level1 Name level

The null hypothesis is as follows:
There is NO difference in govt. subsidies to Males and Females. \[H_0 : di=0\]
The alternative hypothesis is as follows:
There is a statistically significant difference in govt. subsidies to Males and Females. \[H_a: di \neq 0\]

t.test(drug_hyp_subset$F,drug_hyp_subset$M,alternative = "two.sided",paired = TRUE)

## 
##  Paired t-test
## 
## data:  drug_hyp_subset$F and drug_hyp_subset$M
## t = 2.4967, df = 13, p-value = 0.02675
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   1817133 25160034
## sample estimates:
## mean of the differences 
##                13488584

Since the p-value is less than the significance level (0.05), we reject \[H_0\]. We conclude that there is a statistically significant difference in govt. spending towards Males and Females

Hypothesis testing at each ATC level

p_values<-list()
atclevels<-unique(drug_atc_gender$ATC)
for(i in 1:length(atclevels)){
  dag<-drug_atc_gender%>%filter(ATC==atclevels[i])
  p_values[[i]]<-list(print(paste(atclevels[i],round(t.test(dag$F,dag$M,paired=TRUE,alternative="two.sided")$p.value,3))))
}

## [1] "Nervous_System 0"
## [1] "Antiinfectives 0.001"
## [1] "Cardiovascular 0"
## [1] "Alimentary 0.003"
## [1] "Various 0.822"
## [1] "Musculo-Skeletal 0.002"
## [1] "Dermatological 0.061"
## [1] "Respiratory 0.004"
## [1] "Sensory 0"
## [1] "Antiparisitic 0.234"
## [1] "Antineoplastic 0.003"
## [1] "Genito_Urinary 0.687"
## [1] "Blood 0.051"
## [1] "Hormones 0.036"

p_values_unlist<-unlist(p_values)

p_values<-strsplit(p_values_unlist,split = " ")

This code performs a t-test for each ATC Level 1, by looping through each ATC Level.

p_values_df<-data.frame()
for(i in 1:length(p_values)){
  p_values_df<-rbind(p_values_df,data.frame(ATC=p_values[[i]][1],P_Values=p_values[[i]][2]))
}
p_values_df$P_Values<-as.numeric(as.character(p_values_df$P_Values))
p_values_df<-p_values_df[order(p_values_df$P_Values),]
barplot(p_values_df$P_Values,names.arg = p_values_df$ATC)
abline(h=0.05)

This chart is a bar plot of p-values for the hypothesis test conducted for each ATC Level 1 Name. The horizontal black line show the significant level of 0.05

Economic Data

We will look here at relationships between selected economic indicators and the cost of the PBS.

Economic Data (cont.)

All charts show visually some potential correlation. We will look at one of these (Income) to establish the strength of the linear relationship.

Income per capita

We will test the following hypothesis:

There is NO relationship between Income (per capita) and PBS cost.

Intercept: \[H_0 : \alpha =0\] \[H_A : \alpha \neq 0\]

Slope: \[H_0 : \beta =0\] \[H_A : \beta \neq 0\]

Assumptions:
- Independence
- Linearity
- Normality of residuals
- Homoscedasticity
Independence
- The data were obtained from separate sources - actual drug purchases and ABS data based on census and Tax Office data, so we can be confident of independence.
Linearity
- The chart on the previous page show visually a reasonable inverse linear relationship, though there is significant data at high levels of cost for low income areas.

df = length(unique(ABS_SA4$SA4_Income_ID))
LR <- lm(Drug_Govt_Cost_perCap ~ Income_per_cap, data = ABS_SA4)
LR %>% summary()

## 
## Call:
## lm(formula = Drug_Govt_Cost_perCap ~ Income_per_cap, data = ABS_SA4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -127.43  -34.18  -10.18   24.39  192.77 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.765e+02  2.087e+01   8.455 6.32e-13 ***
## Income_per_cap -1.458e-03  3.062e-04  -4.762 7.70e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.97 on 86 degrees of freedom
## Multiple R-squared:  0.2087, Adjusted R-squared:  0.1994 
## F-statistic: 22.68 on 1 and 86 DF,  p-value: 7.7e-06

LR %>% confint()

##                        2.5 %        97.5 %
## (Intercept)    134.973680268  2.179518e+02
## Income_per_cap  -0.002066876 -8.494108e-04

The p-value for the F-test is very small (<0.001), so we can reject the Null hypothesis and conclude that there is a statistically significant relationship between between Income (per capita) and PBS cost with an r-squared of 0.21.

The confidence intervals for the Intercept and slope both do not contain the null hypothesised value of 0, so we can conclude a estimated linear regression with an intercept of 176.5 (Confidence intervals 135, 218) and slope of -1.458e-03 (CI: -0.0021 -8.50-04).

r=cor(ABS_SA4$Drug_Govt_Cost_perCap,ABS_SA4$Income_per_cap)
r

## [1] -0.4567835

CIr(r = r, n = df, level = .95)

## [1] -0.6080534 -0.2735076

A Pearson’s correlation was calculated to measure the strength of the linear relationship between Income (per capita) and PBS cost. The negative correlation was statistically significant, r=0.21, p<.001, 95% CI [-0.608 -0.273].

Check assumptions

LR %>% plot(which = 1)

LR %>% plot(which = 2)

Findings

The cost to the Australian taxpayer of the PBS is significant. The significant portion of the costs are associated with medications dealing with the areas of Cardiovascular, Alimentary, Antineoplastic and Nervous System.

There is a statistically significant disparity in the government reclaim amount between males and females. Viewing te data, it seems that females cost more, but this wasn’t investigated. Further study on this would be useful, but the variability of the gender ratio across age groups would need to be taken into consideration. Refer to the chart below that shows the sex ratio decreasing (more females) with age, and subsequent likely medical cost increases.

There relationship between income and PBS cost, which could lead to the conclusion that those in lower income brackets tend to be a higher drain on the system.

MATH13224 - Introduction to Statistics
Assignment 4 - Semester 1, 2017

An analysis of the cost of the Australian
Pharmaceutical Benefits Scheme (PBS)
2011 - 2015

Introduction

The Australian Pharmaceutical Benefits Scheme (PBS)

Problem Statement

Are there factors influencing the cost of the PBS that could be used to help reduce its cost?

ATC Level 1

Gender

Socio-economic factors

Data

Descriptive Statistics and Visualisation

Annual Cost by ATC Level 1

Annual Cost by ATC Level 1 (cont.)

Box Plots

Hypothesis Testing

Hypothesis testing at each ATC level

Economic Data

Economic Data (cont.)

Income per capita

Findings

References

PBS

ABS Data

MATH13224 - Introduction to Statistics Assignment 4 - Semester 1, 2017

An analysis of the cost of the Australian Pharmaceutical Benefits Scheme (PBS) 2011 - 2015

Introduction

The Australian Pharmaceutical Benefits Scheme (PBS)

Problem Statement

Are there factors influencing the cost of the PBS that could be used to help reduce its cost?

ATC Level 1

Gender

Socio-economic factors

Data

Descriptive Statistics and Visualisation

Annual Cost by ATC Level 1

Annual Cost by ATC Level 1 (cont.)

Box Plots

Hypothesis Testing

Hypothesis testing at each ATC level

Economic Data

Economic Data (cont.)

Income per capita

Findings

References

PBS

ABS Data

MATH13224 - Introduction to Statistics
Assignment 4 - Semester 1, 2017

An analysis of the cost of the Australian
Pharmaceutical Benefits Scheme (PBS)
2011 - 2015