R for Health Data Analytics

Muhammad Fikru Rizal

Part I. Preparation

Installing package (could be skipped)

  • This step is used to install the necessary package for our ‘project’

  • This code if you want to install only one package.

install.packages("tidyverse")
  • Use this if you want to install a list of packages.
install.packages(c("tidyverse", "dplyr", "haven", "ggplot2", 
                "modelsummary", "gtsummary","gt", "huxtable", 
                "flextable","benchmarking", "frontier",
                "AER", "truncreg"))

Loading the installed packages

  • Before using any packages, you need to activate them first.
library(tidyverse)
library(haven)
library(modelsummary)
library(gtsummary)
library(flextable)
library(huxtable)
library(labelled)
library(AER)

Open the required dataset

  • The required dataset is already uploaded to the project directory in the format of a CSV (comma separated value)
  • R can also import another document format such as xls, dta (from Stata), or dat (from SPSS)
  • You can use this code to import the dataset and label the variable
# Importing dataset from csv file
hospital <- read_csv("hospital.csv") %>% as_tibble()

# Labeling the dataset
hospital <- hospital %>% 
  set_variable_labels(
    id_rs = "Hospital ID",
    gp_FTE = "Number of General Practitioners",
    spec_FTE = "Number of Medical Specialists",
    nurse_total = "Number of Nurses",
    other_prof = "Number of other employees",
    beds = "Number of beds",
    outpatients = "Number of outpatient services",
    bed_days = "Total Bed Days",
    admissions1 = "Number of admissions",
    bed_occ = "Bed Occupancy",
    throughput = "Throughput",
    alos = "Average Length of Stay",
    totalcost_op = "Total cost of outpatient services",
    total_cost_bd = "Total cost of beddays",
    totalcost_ip = "Total cost of admission/inpatient",
    publichospital = "Public Hospital",
    poorins = "Proportion with Social Health Insurance",
    classCD = "Class C/D",
    non_JavaBali = "Non Java-Bali") %>%
  set_value_labels(
    publichospital = c("Public"=1, "Private"=0),
    classCD = c("Class C/D"=1, "Class A/B"=0),
    non_JavaBali = c("Non Java-Bali"=1, "Java-Bali"=0)) 

Part 2. Descriptive analysis

Summary statistics table

  • In this section, we use the gtsummary package that can create a publication-ready table.
  • As an example, we create a table describing several variables namely: number of GPs, number of specialists, and ownership.
hospital %>% 
  select(gp_FTE,spec_FTE,publichospital) %>% # Select variable
  to_factor() %>%  # Set categorical variable to "factor"
  tbl_summary(digits = all_continuous() ~ 2, # Number of digits for cont. variables
  statistic = all_continuous() ~ "{mean} ({sd})") %>% # What type of summary statistics
  as_flex_table() # Presented as flex_table, easy to copy to Word processing software

Characteristic

N = 198

Number of General Practitioners15.41 (9.95)
Number of Medical Specialists19.86 (20.52)
Public Hospital
Public120 (61%)
Private78 (39%)
Mean (SD); n (%)

Exercise

  • Create a descriptive table for number of outpatient services (outpatients), number of admissions (admissions1), and location of hospital (non_JavaBali)

Descriptive Plot

  • We will use the ggplot package to make a versatile and nice data visualisation. Source: https://ggplot2.tidyverse.org/index.html

  • Some basic plot like histogram, scatter plot, and box-plot will be used as an example.

# Load package
library(ggplot2)
  • Histogram of number of GPs using base function
# Using base function
hist(hospital$gp_FTE)
  • Histogram of number of GPs using ggplot
# Using ggplot
hospital %>% 
  ggplot(aes(x=gp_FTE)) +
  geom_histogram(color="black", 
                 fill="darkseagreen") 
  • Scatter plot of number of medical specialist vs number of outpatients using base function
# Using base function
plot(hospital$spec_FTE,hospital$outpatients)
  • Scatter plot of number of medical specialist vs number of outpatients using ggplot
# Using ggplot
hospital %>% 
  ggplot(aes(x=spec_FTE,y=outpatients)) +
  geom_point(color="darkseagreen") +
  labs(x="Number of specialists",
       y="Number of outpatient services")
  • Using ggplot + color code based on hospital class [EXTRA]
  • Box Plot of total cost of outpatient based on hospital class using base function
# Using base function
boxplot(hospital$classCD,hospital$totalcost_op)
  • Box Plot of total cost of outpatient based on hospital class using ggplot
# Using ggplot
hospital %>% to_factor() %>%
  ggplot(aes(x=classCD,y=totalcost_op)) +
  geom_boxplot(color="darkseagreen") +
  labs(x="Hospital Class",
       y="Total cost of outpatient")

Exercise

  • Histogram of number of specialists
  • Scatter plot of number of nurses (nurse_total) vs number of admissions (admissions1)
  • Boxplot of total cost of inpatients based on hospital ownership (publichospital)

Part 3. Concept of Efficiency

Part 4. Efficiency Analysis

Cost analysis

# Calculating unit cost per service
hospital <- hospital %>% mutate(
  unitcost_ip = totalcost_ip/admissions1,
  unitcost_bd = total_cost_bd/bed_days)
  • After creating new variables, we may want to label them.
## Adding variable label to the newly created variables
hospital <- hospital %>% 
  set_variable_labels(
    unitcost_ip = "Inpatient Unit Cost",
    unitcost_bd = "Bed days Unit Cost")
  • Again, we will create a table using the tbl_summary function from the gtsummary package.
# Create summary table of the unit cost
hospital %>% select(unitcost_ip, unitcost_bd) %>%
  tbl_summary(digits = all_continuous() ~ 2,
            statistic = all_continuous() ~ "{mean} ({sd})") %>%
  as_flex_table()
  • This is the result.

Characteristic

N = 198

Inpatient Unit Cost297.58 (194.62)
Bed days Unit Cost82.35 (54.61)
Mean (SD)

Exercise

  • Please calculate the unit cost of outpatient service (totalcost_op divided by outpatients) and present the table

Pabon-Lasso Model/Diagram

  • It is basically a scatter plot comparing bed turnover ratio (throughput) and bed occupancy ratio (bed-occ).
  • Hospitals are then divided into 4 sectors based on the mean of bed turnover and bed occupancy ratios.
  • First, let’s calculate them
# Calculating the sample mean of bed turnover and bed occupancy ratio
mean_btr <-mean(hospital$throughput)
mean_bor <-mean(hospital$bed_occ)
  • This is a simple PL diagram using ggplot for all hospital
hospital %>% 
  ggplot(aes(x=bed_occ, y=throughput)) +
  geom_point() +
  geom_hline(yintercept=mean_btr) +
  geom_vline(xintercept=mean_bor)
  • This is a simple PL diagram using ggplot for public hospital
hospital %>% filter(publichospital==1) %>% # Filtering the dataset based on hospital type
  ggplot(aes(x=bed_occ, y=throughput)) +   
  geom_point() +
  geom_hline(yintercept=mean_btr) +
  geom_vline(xintercept=mean_bor) +
  labs(x="Bed occupancy ratio",
       y="Bed turnover")
  • This is a more advanced plot [EXTRA]
hospital %>% to_factor() %>% 
  ggplot(aes(x=bed_occ, y=throughput, colour=publichospital, size=beds)) + 
  geom_point(alpha=0.5) +
  geom_hline(yintercept=mean_btr) +
  geom_vline(xintercept=mean_bor) +
  labs(title="Model Pabon-Lasso",
       x="Bed Occupancy Rate", 
       y="Bed Turnover",
       colour="Ownership",
       size= "Number of beds") 

Efficiency indicator

  • Based on the Pabon-Lasso model, we will define whether hospitals are considered efficient or not.
  • The efficient hospitals are those who have both higher than average of bed turnover and bed occupancy rate.
# Construct a binary indicator of hospitals efficiency based on the PL model
hospital <- hospital %>% 
mutate(eff_pl = if_else (
  (throughput > mean_btr) & (bed_occ > mean_bor),1,0))
  • This code is used to give label to the new variable eff_pl.
# Labelling variable using the labelled package
var_label(hospital$eff_pl) <- "Efficiency (Pabon-Lasso)"
val_labels(hospital$eff_pl) <- c("Efficient"=1,"Not efficient"=0)

Exercise

  • Create a simple Pabon-Lasso diagram for private hospitals.

Data Envelopment Analysis (DEA)

  • Load the benchmarking and frontier package.
library(Benchmarking)
library(frontier)
  • Select input and output variables
# Input variables
x <- with(hospital, cbind(gp_FTE, nurse_total, other_prof,  beds)) 
 
# Output variables
y <- with(hospital, cbind(outpatients, admissions1))
  • Calculate the input and output-oriented efficiency score
# Input-oriented efficiency
dea_input <- dea.boot(x,y,NREP=100, RTS = "vrs", 
            ORIENTATION = "in")

# Output-oriented efficiency
dea_output <- dea.boot(x,y,NREP=100, RTS = "vrs", 
            ORIENTATION = "out")
  • Combining the estimates to the main dataset
# Combining results from DEA 
dea_id <- data.frame(
  cbind(id_rs=hospital$id_rs, 
        EI=c(dea_input$eff), 
        EI_bc=c(dea_input$eff.bc), 
        EI_CI_low=c(dea_input$conf.int[,1]), 
        EI_CI_hi=c(dea_input$conf.int[,2]),
        EO=c(dea_output$eff), 
        EO_bc=c(dea_output$eff.bc), 
        EO_CI_low=c(dea_output$conf.int[,1]), 
        EO_CI_hi=c(dea_output$conf.int[,2])))
 

# Merge with hospital data
hospital_dea <- merge(
  hospital, dea_id, by="id_rs", all = TRUE)

Part 5. Bivariate analysis

Introduction

  • In efficiency studies, bivariate analysis can be used to explore factors associated with some efficiency measures or scores.
  • For example, we can test whether public hospitals and private hospitals have different unit cost for their services. In this case, cost variable might be converted to a log format.

Converting variables to log form

  • First, we may want to convert the cost variable to a log form for further analyses
# Converting cost variable to log format
hospital <- hospital %>% mutate(
  lnuc_ip = log(unitcost_ip),
  lnuc_bd = log(unitcost_bd))
  • Next, we can label the newly created variables
## Adding variable label to the newly created variables
hospital <- hospital %>% 
  set_variable_labels(
    lnuc_ip = "Log (Inpatient Unit Cost)",
    lnuc_bd = "Log (Bed days Unit Cost)")

Visualise the cost difference

  • A boxplot can be used to visualise the difference in unit cost
# A boxplot comparing inpatient unit cost between public and private hospital
hospital %>% to_factor() %>%
  ggplot(aes(x=publichospital, y=unitcost_ip)) +
  xlab("Hospital type") + ylab("Inpatient unit cost") +
  geom_boxplot()
  • We can also present the cost variable in a log form
# Cost in a log form
hospital %>% to_factor() %>%
  ggplot(aes(x=publichospital, y=lnuc_ip)) +
  geom_boxplot() +
  labs(x="Hospital type",
       y="Inpatient unit cost")

Statistical test

  • Is the difference in means statistically significant?
  • In this case, we can use a two-sample t-test and produce the table at the same time using tbl_summary function.
# Two-sample t-test (unequal variance, Welch two-sample t-test)
hospital %>% 
  to_factor() %>%
  select(unitcost_ip, lnuc_ip, publichospital) %>%
  tbl_summary(by=publichospital, digits = all_continuous() ~ 3,
            statistic = all_continuous() ~ "{mean} ({sd})") %>%
  add_p(test=all_continuous() ~ "t.test") %>%
  as_flex_table()
  • Two-sample t-test (unequal variance, Welch two-sample t-test)
Characteristic Public, N = 1201 Private, N = 781 p-value2
Inpatient Unit Cost 273.839 (186.983) 334.099 (201.611) 0.036
Log (Inpatient Unit Cost) 5.423 (0.626) 5.650 (0.566) 0.009
1 Mean (SD)
2 Welch Two Sample t-test

Exercise

  • Do a statistical test comparing the difference between inpatient unit cost of hospitals located in Java/Bali and in Non-Java/Bali

Part 6. Multivariable analysis

Linear regressions

  • When the outcome (y) variable is continues and we want to explore multiple factors (x) that are associated with the outcome variable, we might use an Ordinary Least Square (OLS) or linear regression.
  • For example, we want to analyse the relationship between some contextual factors and the number of GPs.
# Estimating a linear model
ols_gp <- hospital %>% 
  lm(gp_FTE ~ publichospital + classCD + non_JavaBali + poorins, data=.)

Logistic regression

# Estimating a logistic regression
logit_pl <- hospital %>% 
  glm(formula = eff_pl ~ publichospital + classCD + non_JavaBali + poorins, data=., family=binomial )

Presenting regression results

  • To present the results/regression coefficients, R has several alternatives.
  • For example, we can use the base function (summary), tbl_regression function from the gtsummary package, modelsummary function, or many other options.
  • Linear regression coefficient summary using base function
# Linear regression coefficient summary using base function
summary(ols_gp)

Call:
lm(formula = gp_FTE ~ publichospital + classCD + non_JavaBali + 
    poorins, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.489  -5.460  -1.102   2.868  46.368 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     17.7474     1.7312  10.252  < 2e-16 ***
publichospital   7.1797     1.3301   5.398 1.96e-07 ***
classCD         -4.2448     1.5165  -2.799 0.005647 ** 
non_JavaBali    -0.8601     1.3301  -0.647 0.518658    
poorins        -14.3360     4.2856  -3.345 0.000988 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.725 on 193 degrees of freedom
Multiple R-squared:  0.247, Adjusted R-squared:  0.2314 
F-statistic: 15.83 on 4 and 193 DF,  p-value: 3.206e-11
  • Logistic regression coefficient summary using base function
# Logistic regression coefficient summary using base function
summary(logit_pl)

Call:
glm(formula = eff_pl ~ publichospital + classCD + non_JavaBali + 
    poorins, family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5355  -0.9421  -0.7739   1.1929   1.8367  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)     -0.5624     0.4243  -1.326  0.18500   
publichospital   0.5235     0.3344   1.565  0.11748   
classCD         -0.3133     0.3715  -0.843  0.39899   
non_JavaBali    -0.8621     0.3277  -2.631  0.00852 **
poorins          2.1924     1.0382   2.112  0.03470 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 262.73  on 197  degrees of freedom
Residual deviance: 247.62  on 193  degrees of freedom
AIC: 257.62

Number of Fisher Scoring iterations: 4
  • Linear regression table using gtsummary
# Linear regression coefficient table using tbl_regression
ols_gp %>% tbl_regression() %>% as_flex_table()
Characteristic Beta 95% CI1 p-value
Public Hospital 7.2 4.6, 9.8 <0.001
Class C/D -4.2 -7.2, -1.3 0.006
Non Java-Bali -0.86 -3.5, 1.8 0.5
Proportion with Social Health Insurance -14 -23, -5.9 <0.001
1 CI = Confidence Interval
# Logistic regression Odds Ratios (ORs) table using tbl_regression
logit_pl %>% tbl_regression(exponentiate = TRUE)

Characteristic

OR

95% CI

p-value

Public Hospital1.690.88, 3.290.12
Class C/D0.730.35, 1.530.4
Non Java-Bali0.420.22, 0.800.009
Proportion with Social Health Insurance8.961.17, 70.30.035
OR = Odds Ratio, CI = Confidence Interval
  • Linear regression table using modelsummary
# Linear regression table using modelsummary
# Labelling the covariates or regressors
ols_gp_label <- ols_gp
names(ols_gp_label$coefficients) <- 
  c("Constant", "Class C/D", "Public", "Non Java-Bali", "Proportion with SHI")

# Produce table
modelsummary(list("Number of GP" = ols_gp_label), # List of the model(s) and their label
             fmt=3,  # Number of digits
             estimate = "{estimate}{stars}", # Format of coefficients
             gof_omit = ".*")    # Goodness of fit statistics are excluded
  • Linear regression table using modelsummary
Number of GP
Constant 17.747***
(1.731)
Class C/D 7.180***
(1.330)
Public -4.245**
(1.517)
Non Java-Bali -0.860
(1.330)
Proportion with SHI -14.336***
(4.286)
  • Logistic regression table using modelsummary (code)
# Labelling the covariates or regressors
logit_pl_label <- logit_pl
names(logit_pl_label$coefficients) <- 
  c("Constant", "Class C/D", "Public", "Non Java-Bali", "Proportion with SHI")

# Produce table
modelsummary(list("Efficient Hospital (PL)" = logit_pl_label), # List of the model(s) and their label
             fmt=NULL,  # Number of digits
             estimate = "{round(exp(estimate), 3)}{stars}", # Format of coefficients
             statistic = "({round(exp(estimate) * std.error, 3)})",
             gof_omit = ".*")    # Goodness of fit statistics are excluded
  • Logistic regression table using modelsummary (result)
Efficient Hospital (PL)
Constant 0.57
(0.242)
Class C/D 1.688
(0.564)
Public 0.731
(0.272)
Non Java-Bali 0.422**
(0.138)
Proportion with SHI 8.957*
(9.299)

Exercise

  • Run a linear regression with number of specialists as the outcome variable (y) and hospital ownership, location, class, and district-level proportion of population with social health insurance as the explanatory variables (x).

  • Present the result/coefficient using either base function or other functions.

Thank you!