Abstract

This analysis deals with exploring variables and their relationship to no show rates for medical appointments. The variables that we explored are gender, number of days an appointment is scheduled in advance, age, financial aid via scholarships, preconceived health conditions, and SMS Message reminders. The data in this analysis were collected from patients in Brazil and comprises of 110,527 medical appointments and can be found on kaggle.com.

Attribute Information:

Key

The data analysis will be broken down by the following sections:

Let’s begin by taking a glimpse of our data and see what we are working with! This is also a great time to check if our data is tidy and the classes of our variables are accurate.

glimpse(ma_data) # taking a glimpse of our data, the classes and columns are accurate
## Rows: 110,527
## Columns: 13
## $ PatientId      <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8.679512e+...
## $ AppointmentID  <int> 5642903, 5642503, 5642549, 5642828, 5642494, 5626772...
## $ Gender         <chr> "F", "M", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ ScheduledDay   <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2016-04-2...
## $ AppointmentDay <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04-29, 201...
## $ Age            <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 29, 22, 2...
## $ Scholarship    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1...
## $ Hypertension   <fct> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Diabetes       <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Alcoholism     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Handcap        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SMS_received   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ No.show        <fct> No, No, No, No, No, No, Yes, Yes, No, No, No, Yes, N...

Gender and No Show Rates

Let’s run a bar plot and visualize the number of males vs females that made an appointment and within that a proportion of who showed and did not show to their appointment:

gender_stack <- ggplot(ma_data, aes(x = Gender, fill = No.show)) +
  geom_bar(position = "stack") +
  theme_dark() +
  labs(x = "Gender", 
       y = "Count",
       title = "More Female Appointments than Male Appointments",
       subtitle = "F = Female & M = Male",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

gender_fill <- ggplot(ma_data, aes(x = Gender, fill = No.show)) +
geom_bar(position = "fill") +
  theme_dark() +
  labs(x = "Gender", 
       y = "Rate",
       title = "No Show Rates are the Same Despite Gender",
       subtitle = "F = Female & M = Male",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(gender_stack, gender_fill, cols = 2)

kable1 <- ma_data %>%
  group_by(Gender, No.show) %>%
  summarize(count = n())
## `summarise()` regrouping output by 'Gender' (override with `.groups` argument)
knitr::kable(kable1)
Gender No.show count
F No 57246
F Yes 14594
M No 30962
M Yes 7725

Number of Days Appointments are Scheduled in Advance vs Rate of No Shows

Next, we will explore if scheduling an appointment in advance affects the rate of no shows. To do this, we will take the absolute difference between the date of the appointment and the date the appointment was scheduled.

date_data <- ma_data %>%
  select(ScheduledDay, AppointmentDay, No.show) %>%
  mutate(diffdays = abs(difftime(AppointmentDay, ScheduledDay, units = "days"))) # took the difference between the two dates to see how long it has been since they set the appointment and the actual appointment day. Absolute value was taken since you can't have an appointment before scheduling one.

Now that we have the data set we want to explore, let’s visualize the difference with a histogram (since we have a numerical x variable):

ggplot(date_data, aes(x = diffdays, fill = No.show)) +
  geom_histogram() +
  theme_dark() +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Count",
       title = "Most Appointments Scheduled for the Same Day",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Let’s zoom in and see where most of the no shows are by setting our x limits and y limits!

ggplot(date_data, aes(x = diffdays, fill = No.show)) +
  geom_histogram() +
  theme_dark() +
  coord_cartesian(xlim = c(0, 180), ylim = c(0, 6000)) +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Count",
       title = "Same Day Appointments Have High No Show Counts",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

However, after looking at the proportions, we see something entirely different.

ggplot(date_data, aes(x = diffdays, fill = No.show)) +
  geom_histogram(position = "fill") +
  theme_dark() +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Rate",
       title = "No Show Rates are Lower for Same Day Appointments",
       subtitle = "Higher Likelihood of No Show after 4 Months of Advance Scheduling",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Age and No Show Rates

We are going to explore the correlation between age/age groups and the rate at which they show or do not show to their appointment.

We want to answer if there is a particular age group that is predicted to not show up to their scheduled appointment and answer if younger patients are less dependable than older patients.

Keep in Mind:

ggplot(ma_data, aes(x = Age, fill = No.show)) +
  geom_histogram() +
  theme_dark() +
  labs(x = "Age", 
       y = "Count",
       title = "Most Appointments are Scheduled for Children and Adults",
       subtitle = "Children Age Range: 0-5 years old & Adults Age Range: 30-60 years old",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Now let’s zoom into the plot and analyze the number of no shows:

We have increased the number of bins to represent 1 year per bin for easier interpretation

ggplot(ma_data, aes(x = Age, fill = No.show)) +
  geom_histogram(bins = 100) +
  theme_dark() +
  coord_cartesian(xlim = c(0, 100), ylim = c(0, 800)) +
  labs(x = "Age", 
       y = "Count",
       title = "No Shows are Most Apparent in Adults",
       subtitle = "Adults in the Age Range: 20-30 years old",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Now let’s look at the proportions:

ggplot(ma_data, aes(x = Age, fill = No.show)) +
  geom_histogram(bins = 100, position = "fill") +
  theme_dark() +
  labs(x = "Age", 
       y = "Rate",
       title = "No Shows are Most Apparent in Adults",
       subtitle = "Adults in the Age Range: 20-30 years old",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

We can explore this further by analyzing the spread of days a patient, above the age of 90, will schedule an appointment in advance.

age_date_data <- ma_data %>%
  select(ScheduledDay, AppointmentDay, Age, No.show) %>%
  mutate(diffdays = abs(difftime(AppointmentDay, ScheduledDay, units = "days")))

ggplot(age_date_data, aes(x = diffdays, y = Age, color = No.show)) +
  geom_point(position = "jitter") +
  theme_dark() +
  coord_cartesian(ylim = c(90, 120)) +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Age",
       title = "Patients Above the Age of 90",
       subtitle = "Mainly will Schedule Appointments 0-50 days in Advance",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplot(age_date_data, aes(x = diffdays, y = Age, color = No.show)) +
  geom_point(position = "jitter") +
  theme_dark() +
  coord_cartesian(xlim = c(0, 40), ylim = c(90, 120)) +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Age",
       title = "Patients Above the Age of 90",
       subtitle = "Mainly will Schedule Appointments 0-50 days in Advance",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Binary Variables within the Data Set and Their Relationship to No Shows

Within our data set, there are binary variables that we are going to explore. We are going to see if having financial aid via scholarships, preconceived health conditions, or SMS reminders sent will affect the rate of no shows.

We will first explore each binary variable and its effects on the total number of no shows and then the rate of no shows.

binary_plot <- function(df, vect){
  ggplot(data, aes(x = vect, fill = No.show)) +
    geom_bar() +
    theme_dark() +
    scale_x_discrete(limits = c("0", "1"),
                     labels = c("Doesn't\nHave", "Has"))
} # function to make plots for the desired vector within our data set

scholar_plot <- binary_plot(ma_data, ma_data$Scholarship) +
  labs(x = "Scholarship", 
       y = "Count",
       title = "Scholarship and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

hypertension_plot <- binary_plot(ma_data, ma_data$Hypertension)+
  labs(x = "Hypertension", 
       y = "Count",
       title = "Hypertension and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

diabetes_plot <- binary_plot(ma_data, ma_data$Diabetes)+
  labs(x = "Diabetes", 
       y = "Count",
       title = "Diabetes and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

alcoholism_plot <- binary_plot(ma_data, ma_data$Alcoholism)+
  labs(x = "Alcoholism", 
       y = "Count",
       title = "Alcoholism and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

handcap_plot <- binary_plot(ma_data, ma_data$Handcap)+
  labs(x = "Handicap Disability", 
       y = "Count",
       title = "Handicap and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

sms_plot <- binary_plot(ma_data, ma_data$SMS_received)+
  labs(x = "SMS Message", 
       y = "Count",
       title = "SMS Message and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(scholar_plot, hypertension_plot, diabetes_plot, alcoholism_plot, handcap_plot, sms_plot, cols = 2)

Let’s see if having or not having these variables will affect the rate of no shows:

binary_plot2 <- function(df, vect){
  ggplot(data, aes(x = vect, fill = No.show)) +
    geom_bar(position = "fill") +
    theme_dark() +
    scale_x_discrete(limits = c("0", "1"),
                     labels = c("Doesn't\nHave", "Has"))
}

scholar_plot2 <- binary_plot2(ma_data, ma_data$Scholarship)+
  labs(x = "Scholarship", 
       y = "Rate",
       title = "Scholarship Not Effective for No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

hypternsion_plot2 <- binary_plot2(ma_data, ma_data$Hypertension)+
  labs(x = "Hypertension", 
       y = "Rate",
       title = "Hypertensive Pts have Slightly Lower No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

diabetes_plot2 <- binary_plot2(ma_data, ma_data$Diabetes) +
  labs(x = "Diabetes", 
       y = "Rate",
       title = "Diabetic Pts have Slightly Lower No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

alcoholism_plot2 <- binary_plot2(ma_data, ma_data$Alcoholism) +
  labs(x = "Alcoholism", 
       y = "Rate",
       title = "Alcoholism Not a Good Indicator for No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

handcap_plot2 <- binary_plot2(ma_data, ma_data$Handcap) +
  labs(x = "Handicap Disability", 
       y = "Rate",
       title = "Handicap Pts have Slightly Lower No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

sms_plot2 <- binary_plot2(ma_data, ma_data$SMS_received)+
  labs(x = "SMS Message", 
       y = "Rate",
       title = "SMS Messages Do Not Improve No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(scholar_plot2, hypternsion_plot2, diabetes_plot2, alcoholism_plot2, handcap_plot2, sms_plot2, cols = 2)

Looking at the proportions of patients with or without the binary variables, we can see if there are trends of no shows.

Conclusion

Not only does failure to follow through with scheduled medical appointments have a negative impact on the functionality/workflow of a clinic but it can also negatively affect the patient. This data analysis aimed to explore different variables and their relationship to no show rates. We have seen that gender plays a role in the number of scheduled appointments, however, does not affect the no show rate. Both males and females have the same rate of no shows (~20%). We have also seen that the rate of no shows increases as appointments are scheduled for more than 4 months in advance. Next, we confirmed that certain age groups, such as between 20-30 years old, have high rates of no shows as well as patients that are 90+ years old. Lastly, we confirmed that the binary variables of Scholarship, Hypertension, Diabetes, Alcoholism, the Handicapped, and SMS message reminders do not show a significant improvement in no show rates. However, we also noted that the sample size was too small to accurately indicate effectiveness.

Opportunities for further analysis could be researching the following: