R for Health Data Analytics

Muhammad Fikru Rizal

Part I. Preparation

Installing package (could be skipped)

This step is used to install the necessary package for our ‘project’
This code if you want to install only one package.

install.packages("tidyverse")

Use this if you want to install a list of packages.

install.packages(c("tidyverse", "dplyr", "haven", "ggplot2", 
                "modelsummary", "gtsummary","gt", "huxtable", 
                "flextable","benchmarking", "frontier",
                "AER", "truncreg"))

Loading the installed packages

Before using any packages, you need to activate them first.

library(tidyverse)
library(haven)
library(modelsummary)
library(gtsummary)
library(flextable)
library(huxtable)
library(labelled)
library(AER)

Open the required dataset

The required dataset is already uploaded to the project directory in the format of a CSV (comma separated value)
R can also import another document format such as xls, dta (from Stata), or dat (from SPSS)

You can use this code to import the dataset and label the variable

# Importing dataset from csv file
hospital <- read_csv("hospital.csv") %>% as_tibble()

# Labeling the dataset
hospital <- hospital %>% 
  set_variable_labels(
    id_rs = "Hospital ID",
    gp_FTE = "Number of General Practitioners",
    spec_FTE = "Number of Medical Specialists",
    nurse_total = "Number of Nurses",
    other_prof = "Number of other employees",
    beds = "Number of beds",
    outpatients = "Number of outpatient services",
    bed_days = "Total Bed Days",
    admissions1 = "Number of admissions",
    bed_occ = "Bed Occupancy",
    throughput = "Throughput",
    alos = "Average Length of Stay",
    totalcost_op = "Total cost of outpatient services",
    total_cost_bd = "Total cost of beddays",
    totalcost_ip = "Total cost of admission/inpatient",
    publichospital = "Public Hospital",
    poorins = "Proportion with Social Health Insurance",
    classCD = "Class C/D",
    non_JavaBali = "Non Java-Bali") %>%
  set_value_labels(
    publichospital = c("Public"=1, "Private"=0),
    classCD = c("Class C/D"=1, "Class A/B"=0),
    non_JavaBali = c("Non Java-Bali"=1, "Java-Bali"=0))

Part 2. Descriptive analysis

Summary statistics table

In this section, we use the gtsummary package that can create a publication-ready table.
As an example, we create a table describing several variables namely: number of GPs, number of specialists, and ownership.

hospital %>% 
  select(gp_FTE,spec_FTE,publichospital) %>% # Select variable
  to_factor() %>%  # Set categorical variable to "factor"
  tbl_summary(digits = all_continuous() ~ 2, # Number of digits for cont. variables
  statistic = all_continuous() ~ "{mean} ({sd})") %>% # What type of summary statistics
  as_flex_table() # Presented as flex_table, easy to copy to Word processing software

Characteristic	N = 198
Number of General Practitioners	15.41 (9.95)
Number of Medical Specialists	19.86 (20.52)
Public Hospital
Public	120 (61%)
Private	78 (39%)
Mean (SD); n (%)

Exercise

Create a descriptive table for number of outpatient services (outpatients), number of admissions (admissions1), and location of hospital (non_JavaBali)

Descriptive Plot

We will use the ggplot package to make a versatile and nice data visualisation. Source: https://ggplot2.tidyverse.org/index.html
Some basic plot like histogram, scatter plot, and box-plot will be used as an example.

# Load package
library(ggplot2)

Histogram of number of GPs using base function

# Using base function
hist(hospital$gp_FTE)

Histogram of number of GPs using ggplot

# Using ggplot
hospital %>% 
  ggplot(aes(x=gp_FTE)) +
  geom_histogram(color="black", 
                 fill="darkseagreen")

Scatter plot of number of medical specialist vs number of outpatients using base function

# Using base function
plot(hospital$spec_FTE,hospital$outpatients)

Scatter plot of number of medical specialist vs number of outpatients using ggplot

# Using ggplot
hospital %>% 
  ggplot(aes(x=spec_FTE,y=outpatients)) +
  geom_point(color="darkseagreen") +
  labs(x="Number of specialists",
       y="Number of outpatient services")

Using ggplot + color code based on hospital class [EXTRA]

Box Plot of total cost of outpatient based on hospital class using base function

# Using base function
boxplot(hospital$classCD,hospital$totalcost_op)

Box Plot of total cost of outpatient based on hospital class using ggplot

# Using ggplot
hospital %>% to_factor() %>%
  ggplot(aes(x=classCD,y=totalcost_op)) +
  geom_boxplot(color="darkseagreen") +
  labs(x="Hospital Class",
       y="Total cost of outpatient")

Exercise

Histogram of number of specialists
Scatter plot of number of nurses (nurse_total) vs number of admissions (admissions1)
Boxplot of total cost of inpatients based on hospital ownership (publichospital)

Part 3. Concept of Efficiency

Part 4. Efficiency Analysis

Cost analysis

Simple calculation and creating new variables using mutate function from the dplyr package.
Source: https://dplyr.tidyverse.org/reference/mutate.html

# Calculating unit cost per service
hospital <- hospital %>% mutate(
  unitcost_ip = totalcost_ip/admissions1,
  unitcost_bd = total_cost_bd/bed_days)

After creating new variables, we may want to label them.

## Adding variable label to the newly created variables
hospital <- hospital %>% 
  set_variable_labels(
    unitcost_ip = "Inpatient Unit Cost",
    unitcost_bd = "Bed days Unit Cost")

Again, we will create a table using the tbl_summary function from the gtsummary package.

# Create summary table of the unit cost
hospital %>% select(unitcost_ip, unitcost_bd) %>%
  tbl_summary(digits = all_continuous() ~ 2,
            statistic = all_continuous() ~ "{mean} ({sd})") %>%
  as_flex_table()

This is the result.

Characteristic	N = 198
Inpatient Unit Cost	297.58 (194.62)
Bed days Unit Cost	82.35 (54.61)
Mean (SD)

Exercise

Please calculate the unit cost of outpatient service (totalcost_op divided by outpatients) and present the table

Pabon-Lasso Model/Diagram

It is basically a scatter plot comparing bed turnover ratio (throughput) and bed occupancy ratio (bed-occ).
Hospitals are then divided into 4 sectors based on the mean of bed turnover and bed occupancy ratios.
First, let’s calculate them

# Calculating the sample mean of bed turnover and bed occupancy ratio
mean_btr <-mean(hospital$throughput)
mean_bor <-mean(hospital$bed_occ)

This is a simple PL diagram using ggplot for all hospital

hospital %>% 
  ggplot(aes(x=bed_occ, y=throughput)) +
  geom_point() +
  geom_hline(yintercept=mean_btr) +
  geom_vline(xintercept=mean_bor)

This is a simple PL diagram using ggplot for public hospital

hospital %>% filter(publichospital==1) %>% # Filtering the dataset based on hospital type
  ggplot(aes(x=bed_occ, y=throughput)) +   
  geom_point() +
  geom_hline(yintercept=mean_btr) +
  geom_vline(xintercept=mean_bor) +
  labs(x="Bed occupancy ratio",
       y="Bed turnover")

This is a more advanced plot [EXTRA]

hospital %>% to_factor() %>% 
  ggplot(aes(x=bed_occ, y=throughput, colour=publichospital, size=beds)) + 
  geom_point(alpha=0.5) +
  geom_hline(yintercept=mean_btr) +
  geom_vline(xintercept=mean_bor) +
  labs(title="Model Pabon-Lasso",
       x="Bed Occupancy Rate", 
       y="Bed Turnover",
       colour="Ownership",
       size= "Number of beds")

Efficiency indicator

Based on the Pabon-Lasso model, we will define whether hospitals are considered efficient or not.
The efficient hospitals are those who have both higher than average of bed turnover and bed occupancy rate.

# Construct a binary indicator of hospitals efficiency based on the PL model
hospital <- hospital %>% 
mutate(eff_pl = if_else (
  (throughput > mean_btr) & (bed_occ > mean_bor),1,0))

This code is used to give label to the new variable eff_pl.

# Labelling variable using the labelled package
var_label(hospital$eff_pl) <- "Efficiency (Pabon-Lasso)"
val_labels(hospital$eff_pl) <- c("Efficient"=1,"Not efficient"=0)

Exercise

Create a simple Pabon-Lasso diagram for private hospitals.

Data Envelopment Analysis (DEA)

Load the benchmarking and frontier package.

library(Benchmarking)
library(frontier)

Select input and output variables

# Input variables
x <- with(hospital, cbind(gp_FTE, nurse_total, other_prof,  beds)) 
 
# Output variables
y <- with(hospital, cbind(outpatients, admissions1))

Calculate the input and output-oriented efficiency score

# Input-oriented efficiency
dea_input <- dea.boot(x,y,NREP=100, RTS = "vrs", 
            ORIENTATION = "in")

# Output-oriented efficiency
dea_output <- dea.boot(x,y,NREP=100, RTS = "vrs", 
            ORIENTATION = "out")

Combining the estimates to the main dataset

# Combining results from DEA 
dea_id <- data.frame(
  cbind(id_rs=hospital$id_rs, 
        EI=c(dea_input$eff), 
        EI_bc=c(dea_input$eff.bc), 
        EI_CI_low=c(dea_input$conf.int[,1]), 
        EI_CI_hi=c(dea_input$conf.int[,2]),
        EO=c(dea_output$eff), 
        EO_bc=c(dea_output$eff.bc), 
        EO_CI_low=c(dea_output$conf.int[,1]), 
        EO_CI_hi=c(dea_output$conf.int[,2])))
 

# Merge with hospital data
hospital_dea <- merge(
  hospital, dea_id, by="id_rs", all = TRUE)

Part 5. Bivariate analysis

Introduction

In efficiency studies, bivariate analysis can be used to explore factors associated with some efficiency measures or scores.
For example, we can test whether public hospitals and private hospitals have different unit cost for their services. In this case, cost variable might be converted to a log format.

Converting variables to log form

First, we may want to convert the cost variable to a log form for further analyses

# Converting cost variable to log format
hospital <- hospital %>% mutate(
  lnuc_ip = log(unitcost_ip),
  lnuc_bd = log(unitcost_bd))

Next, we can label the newly created variables

## Adding variable label to the newly created variables
hospital <- hospital %>% 
  set_variable_labels(
    lnuc_ip = "Log (Inpatient Unit Cost)",
    lnuc_bd = "Log (Bed days Unit Cost)")

Visualise the cost difference

A boxplot can be used to visualise the difference in unit cost

# A boxplot comparing inpatient unit cost between public and private hospital
hospital %>% to_factor() %>%
  ggplot(aes(x=publichospital, y=unitcost_ip)) +
  xlab("Hospital type") + ylab("Inpatient unit cost") +
  geom_boxplot()

We can also present the cost variable in a log form

# Cost in a log form
hospital %>% to_factor() %>%
  ggplot(aes(x=publichospital, y=lnuc_ip)) +
  geom_boxplot() +
  labs(x="Hospital type",
       y="Inpatient unit cost")

Statistical test

Is the difference in means statistically significant?
In this case, we can use a two-sample t-test and produce the table at the same time using tbl_summary function.

# Two-sample t-test (unequal variance, Welch two-sample t-test)
hospital %>% 
  to_factor() %>%
  select(unitcost_ip, lnuc_ip, publichospital) %>%
  tbl_summary(by=publichospital, digits = all_continuous() ~ 3,
            statistic = all_continuous() ~ "{mean} ({sd})") %>%
  add_p(test=all_continuous() ~ "t.test") %>%
  as_flex_table()

Two-sample t-test (unequal variance, Welch two-sample t-test)

Characteristic	Public, N = 120¹	Private, N = 78¹	p-value²
Inpatient Unit Cost	273.839 (186.983)	334.099 (201.611)	0.036
Log (Inpatient Unit Cost)	5.423 (0.626)	5.650 (0.566)	0.009
¹ Mean (SD)
² Welch Two Sample t-test

Exercise

Do a statistical test comparing the difference between inpatient unit cost of hospitals located in Java/Bali and in Non-Java/Bali

Part 6. Multivariable analysis

Linear regressions

When the outcome (y) variable is continues and we want to explore multiple factors (x) that are associated with the outcome variable, we might use an Ordinary Least Square (OLS) or linear regression.
For example, we want to analyse the relationship between some contextual factors and the number of GPs.

# Estimating a linear model
ols_gp <- hospital %>% 
  lm(gp_FTE ~ publichospital + classCD + non_JavaBali + poorins, data=.)

Logistic regression

If the outcome is binary (yes or no, 1 or 0, efficient or not), another approach that can be used is a logistic regression.
You can find some intuitive explanations here: https://argoshare.is.ed.ac.uk/healthyr_book/binary-logistic-regression.html
Here, we estimate factors associated with whether hospitals are efficient or not based on the PL model

# Estimating a logistic regression
logit_pl <- hospital %>% 
  glm(formula = eff_pl ~ publichospital + classCD + non_JavaBali + poorins, data=., family=binomial )

Presenting regression results

To present the results/regression coefficients, R has several alternatives.
For example, we can use the base function (summary), tbl_regression function from the gtsummary package, modelsummary function, or many other options.

Linear regression coefficient summary using base function

# Linear regression coefficient summary using base function
summary(ols_gp)


Call:
lm(formula = gp_FTE ~ publichospital + classCD + non_JavaBali + 
    poorins, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.489  -5.460  -1.102   2.868  46.368 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     17.7474     1.7312  10.252  < 2e-16 ***
publichospital   7.1797     1.3301   5.398 1.96e-07 ***
classCD         -4.2448     1.5165  -2.799 0.005647 ** 
non_JavaBali    -0.8601     1.3301  -0.647 0.518658    
poorins        -14.3360     4.2856  -3.345 0.000988 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.725 on 193 degrees of freedom
Multiple R-squared:  0.247, Adjusted R-squared:  0.2314 
F-statistic: 15.83 on 4 and 193 DF,  p-value: 3.206e-11

Logistic regression coefficient summary using base function

# Logistic regression coefficient summary using base function
summary(logit_pl)


Call:
glm(formula = eff_pl ~ publichospital + classCD + non_JavaBali + 
    poorins, family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5355  -0.9421  -0.7739   1.1929   1.8367  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)     -0.5624     0.4243  -1.326  0.18500   
publichospital   0.5235     0.3344   1.565  0.11748   
classCD         -0.3133     0.3715  -0.843  0.39899   
non_JavaBali    -0.8621     0.3277  -2.631  0.00852 **
poorins          2.1924     1.0382   2.112  0.03470 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 262.73  on 197  degrees of freedom
Residual deviance: 247.62  on 193  degrees of freedom
AIC: 257.62

Number of Fisher Scoring iterations: 4

Linear regression table using gtsummary

# Linear regression coefficient table using tbl_regression
ols_gp %>% tbl_regression() %>% as_flex_table()

Characteristic	Beta	95% CI¹	p-value
Public Hospital	7.2	4.6, 9.8	<0.001
Class C/D	-4.2	-7.2, -1.3	0.006
Non Java-Bali	-0.86	-3.5, 1.8	0.5
Proportion with Social Health Insurance	-14	-23, -5.9	<0.001
¹ CI = Confidence Interval

# Logistic regression Odds Ratios (ORs) table using tbl_regression
logit_pl %>% tbl_regression(exponentiate = TRUE)

Characteristic	OR	95% CI	p-value
Public Hospital	1.69	0.88, 3.29	0.12
Class C/D	0.73	0.35, 1.53	0.4
Non Java-Bali	0.42	0.22, 0.80	0.009
Proportion with Social Health Insurance	8.96	1.17, 70.3	0.035
OR = Odds Ratio, CI = Confidence Interval

Linear regression table using modelsummary

# Linear regression table using modelsummary
# Labelling the covariates or regressors
ols_gp_label <- ols_gp
names(ols_gp_label$coefficients) <- 
  c("Constant", "Class C/D", "Public", "Non Java-Bali", "Proportion with SHI")

# Produce table
modelsummary(list("Number of GP" = ols_gp_label), # List of the model(s) and their label
             fmt=3,  # Number of digits
             estimate = "{estimate}{stars}", # Format of coefficients
             gof_omit = ".*")    # Goodness of fit statistics are excluded

Linear regression table using modelsummary

	Number of GP
Constant	17.747***
	(1.731)
Class C/D	7.180***
	(1.330)
Public	-4.245**
	(1.517)
Non Java-Bali	-0.860
	(1.330)
Proportion with SHI	-14.336***
	(4.286)

Logistic regression table using modelsummary (code)

# Labelling the covariates or regressors
logit_pl_label <- logit_pl
names(logit_pl_label$coefficients) <- 
  c("Constant", "Class C/D", "Public", "Non Java-Bali", "Proportion with SHI")

# Produce table
modelsummary(list("Efficient Hospital (PL)" = logit_pl_label), # List of the model(s) and their label
             fmt=NULL,  # Number of digits
             estimate = "{round(exp(estimate), 3)}{stars}", # Format of coefficients
             statistic = "({round(exp(estimate) * std.error, 3)})",
             gof_omit = ".*")    # Goodness of fit statistics are excluded

Logistic regression table using modelsummary (result)

	Efficient Hospital (PL)
Constant	0.57
	(0.242)
Class C/D	1.688
	(0.564)
Public	0.731
	(0.272)
Non Java-Bali	0.422**
	(0.138)
Proportion with SHI	8.957*
	(9.299)

Exercise

Run a linear regression with number of specialists as the outcome variable (y) and hospital ownership, location, class, and district-level proportion of population with social health insurance as the explanatory variables (x).
Present the result/coefficient using either base function or other functions.

R for Health Data Analytics

Part I. Preparation

Installing package (could be skipped)

Loading the installed packages

Open the required dataset

Part 2. Descriptive analysis

Summary statistics table

Exercise

Descriptive Plot

Exercise

Part 3. Concept of Efficiency

Part 4. Efficiency Analysis

Cost analysis

Exercise

Pabon-Lasso Model/Diagram

Efficiency indicator

Exercise

Data Envelopment Analysis (DEA)

Part 5. Bivariate analysis

Introduction

Converting variables to log form

Visualise the cost difference

Statistical test

Exercise

Part 6. Multivariable analysis

Linear regressions

Logistic regression

Presenting regression results

Exercise

Thank you!