Medical Appointment No Show Report

Abstract

This analysis deals with exploring variables and their relationship to no show rates for medical appointments. The variables that we explored are gender, number of days an appointment is scheduled in advance, age, financial aid via scholarships, preconceived health conditions, and SMS Message reminders. The data in this analysis were collected from patients in Brazil and comprises of 110,527 medical appointments and can be found on kaggle.com.

Attribute Information:

PatientId = Identification number of a patient
AppointmentID = Identification number of each appointment
Gender = Male or Female
ScheduledDay = The date-time (YYYY-MM-DD) the patient scheduled their appointment
AppointmentDay = Date of appointment (YYYY-MM-DD)
Age = Age of patient in years
Scholarship = Financial Aid for a family depending on if the child is enrolled in an academic program.
Hypertension = Signifies a patient as hypertensive or not hypertensive
Diabetes = Signifies a patient as diabetic or not diabetic
Alcoholism = Signifies a patient as an alcoholic or not an alcoholic
Handcap = Signifies a patient as handicap or not handicap
SMS_received = Patient opts to receive or not receive an SMS Message reminder for an appointment
No-show = Patient does or does not show up to the scheduled appointment

Key

0 (zero) indicates No/FALSE
1 (one) indicates Yes/TRUE

The data analysis will be broken down by the following sections:

Gender and No Show Rates
Number of Days Appointments are Scheduled in Advance vs Rate of No Shows
Age and No Show Rates
Binary Variables within the Data Set and Their Relationship to No Shows
- Binary Variables such as Scholarship, Hypertension, Diabetes, Alcoholism, Handicap, and SMS message received

Let’s begin by taking a glimpse of our data and see what we are working with! This is also a great time to check if our data is tidy and the classes of our variables are accurate.

glimpse(ma_data) # taking a glimpse of our data, the classes and columns are accurate

## Rows: 110,527
## Columns: 13
## $ PatientId      <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8.679512e+...
## $ AppointmentID  <int> 5642903, 5642503, 5642549, 5642828, 5642494, 5626772...
## $ Gender         <chr> "F", "M", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ ScheduledDay   <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2016-04-2...
## $ AppointmentDay <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04-29, 201...
## $ Age            <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 29, 22, 2...
## $ Scholarship    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1...
## $ Hypertension   <fct> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Diabetes       <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Alcoholism     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Handcap        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SMS_received   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ No.show        <fct> No, No, No, No, No, No, Yes, Yes, No, No, No, Yes, N...

Our data looks good! Let’s dive into our first section of the analysis, Age’s effect on no show rates.

Gender and No Show Rates

Let’s run a bar plot and visualize the number of males vs females that made an appointment and within that a proportion of who showed and did not show to their appointment:

gender_stack <- ggplot(ma_data, aes(x = Gender, fill = No.show)) +
  geom_bar(position = "stack") +
  theme_dark() +
  labs(x = "Gender", 
       y = "Count",
       title = "More Female Appointments than Male Appointments",
       subtitle = "F = Female & M = Male",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

gender_fill <- ggplot(ma_data, aes(x = Gender, fill = No.show)) +
geom_bar(position = "fill") +
  theme_dark() +
  labs(x = "Gender", 
       y = "Rate",
       title = "No Show Rates are the Same Despite Gender",
       subtitle = "F = Female & M = Male",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(gender_stack, gender_fill, cols = 2)

kable1 <- ma_data %>%
  group_by(Gender, No.show) %>%
  summarize(count = n())

## `summarise()` regrouping output by 'Gender' (override with `.groups` argument)

knitr::kable(kable1)

Gender	No.show	count
F	No	57246
F	Yes	14594
M	No	30962
M	Yes	7725

We can see that more appointments are made by females than males.
- Within this data set there are 71,840 female appointments and 38,687 male appointments
However, looking at the rate of who does and does not show up, both females and males are even at 20%.
Therefore, gender is not a good predictor of no show rates as there is no significant difference between the two in this data set.
- This means that although females schedule more appointments (nearly double), the rate at which they no show is the same as male patients.

Number of Days Appointments are Scheduled in Advance vs Rate of No Shows

Next, we will explore if scheduling an appointment in advance affects the rate of no shows. To do this, we will take the absolute difference between the date of the appointment and the date the appointment was scheduled.

date_data <- ma_data %>%
  select(ScheduledDay, AppointmentDay, No.show) %>%
  mutate(diffdays = abs(difftime(AppointmentDay, ScheduledDay, units = "days"))) # took the difference between the two dates to see how long it has been since they set the appointment and the actual appointment day. Absolute value was taken since you can't have an appointment before scheduling one.

Now that we have the data set we want to explore, let’s visualize the difference with a histogram (since we have a numerical x variable):

ggplot(date_data, aes(x = diffdays, fill = No.show)) +
  geom_histogram() +
  theme_dark() +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Count",
       title = "Most Appointments Scheduled for the Same Day",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

This graph shows us that the majority of the appointments are set on the same day (roughly zero days in between)

Let’s zoom in and see where most of the no shows are by setting our x limits and y limits!

ggplot(date_data, aes(x = diffdays, fill = No.show)) +
  geom_histogram() +
  theme_dark() +
  coord_cartesian(xlim = c(0, 180), ylim = c(0, 6000)) +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Count",
       title = "Same Day Appointments Have High No Show Counts",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

At first glance, it seems that in this zoomed plot, the majority of our no shows happen when a patient schedules on the same day or for the next 1-2 days and as patients schedule for appointments further in advance, the no shows start to decline.

However, after looking at the proportions, we see something entirely different.

ggplot(date_data, aes(x = diffdays, fill = No.show)) +
  geom_histogram(position = "fill") +
  theme_dark() +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Rate",
       title = "No Show Rates are Lower for Same Day Appointments",
       subtitle = "Higher Likelihood of No Show after 4 Months of Advance Scheduling",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

We see that the no show rates are relatively low if you schedule appointments and have your appointments within 120 days (4 months).
- Within 4 months the no show rates are less than 40%.
In contrast, those that have appointments set for far in the future (greater than 120 days), there is a huge spike in the rate of no shows.
- Which is greater than 50%
Therefore, when analyzing appointments and the actual appointment date in regards to no shows.
- There is a higher likelihood of individuals to no show if they set an appointment greater than 4 months in advance

Age and No Show Rates

We are going to explore the correlation between age/age groups and the rate at which they show or do not show to their appointment.

We want to answer if there is a particular age group that is predicted to not show up to their scheduled appointment and answer if younger patients are less dependable than older patients.

Keep in Mind:

The Age variable is measured in years, those that are less than 1 year old will fall within the zero bin.

ggplot(ma_data, aes(x = Age, fill = No.show)) +
  geom_histogram() +
  theme_dark() +
  labs(x = "Age", 
       y = "Count",
       title = "Most Appointments are Scheduled for Children and Adults",
       subtitle = "Children Age Range: 0-5 years old & Adults Age Range: 30-60 years old",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Within this section of the analysis, most patients that have an appointment is less than 1 year old (most likely, accompanied by their parent/guardian) or are patients between the age of 30-60 years old (shown by the red bars).

Now let’s zoom into the plot and analyze the number of no shows:

We have increased the number of bins to represent 1 year per bin for easier interpretation

ggplot(ma_data, aes(x = Age, fill = No.show)) +
  geom_histogram(bins = 100) +
  theme_dark() +
  coord_cartesian(xlim = c(0, 100), ylim = c(0, 800)) +
  labs(x = "Age", 
       y = "Count",
       title = "No Shows are Most Apparent in Adults",
       subtitle = "Adults in the Age Range: 20-30 years old",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

We see that no shows begin to dip between the ages of zero years to 12 years old (most likely because parents are taking their children to their doctor’s appointment)
However, the no shows begin to spike and maintain in the range of 20-30 years of age. (of age to be responsible for your own doctor’s appointments)
Beyond 30 years of age, there is a negative trend of no shows as age progresses.

Now let’s look at the proportions:

ggplot(ma_data, aes(x = Age, fill = No.show)) +
  geom_histogram(bins = 100, position = "fill") +
  theme_dark() +
  labs(x = "Age", 
       y = "Rate",
       title = "No Shows are Most Apparent in Adults",
       subtitle = "Adults in the Age Range: 20-30 years old",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

We see a similar trend as the plot prior, but in regards to proportions.
- There is however a spike of no shows after the age of 90. There could be reasons for this as well (maybe appointments were set in advance for these older patients but passed before they could attend their appointment)

We can explore this further by analyzing the spread of days a patient, above the age of 90, will schedule an appointment in advance.

age_date_data <- ma_data %>%
  select(ScheduledDay, AppointmentDay, Age, No.show) %>%
  mutate(diffdays = abs(difftime(AppointmentDay, ScheduledDay, units = "days")))

ggplot(age_date_data, aes(x = diffdays, y = Age, color = No.show)) +
  geom_point(position = "jitter") +
  theme_dark() +
  coord_cartesian(ylim = c(90, 120)) +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Age",
       title = "Patients Above the Age of 90",
       subtitle = "Mainly will Schedule Appointments 0-50 days in Advance",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplot(age_date_data, aes(x = diffdays, y = Age, color = No.show)) +
  geom_point(position = "jitter") +
  theme_dark() +
  coord_cartesian(xlim = c(0, 40), ylim = c(90, 120)) +
  labs(x = "# of Days Appointment was Scheduled in Advance", 
       y = "Age",
       title = "Patients Above the Age of 90",
       subtitle = "Mainly will Schedule Appointments 0-50 days in Advance",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

As seen above, patients above the age of 90 will normally schedule appointments 0-50 days in advance.
The amount of no shows are spread quite evenly within the 50 days.

Binary Variables within the Data Set and Their Relationship to No Shows

Within our data set, there are binary variables that we are going to explore. We are going to see if having financial aid via scholarships, preconceived health conditions, or SMS reminders sent will affect the rate of no shows.

We will first explore each binary variable and its effects on the total number of no shows and then the rate of no shows.

binary_plot <- function(df, vect){
  ggplot(data, aes(x = vect, fill = No.show)) +
    geom_bar() +
    theme_dark() +
    scale_x_discrete(limits = c("0", "1"),
                     labels = c("Doesn't\nHave", "Has"))
} # function to make plots for the desired vector within our data set

scholar_plot <- binary_plot(ma_data, ma_data$Scholarship) +
  labs(x = "Scholarship", 
       y = "Count",
       title = "Scholarship and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

hypertension_plot <- binary_plot(ma_data, ma_data$Hypertension)+
  labs(x = "Hypertension", 
       y = "Count",
       title = "Hypertension and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

diabetes_plot <- binary_plot(ma_data, ma_data$Diabetes)+
  labs(x = "Diabetes", 
       y = "Count",
       title = "Diabetes and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

alcoholism_plot <- binary_plot(ma_data, ma_data$Alcoholism)+
  labs(x = "Alcoholism", 
       y = "Count",
       title = "Alcoholism and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

handcap_plot <- binary_plot(ma_data, ma_data$Handcap)+
  labs(x = "Handicap Disability", 
       y = "Count",
       title = "Handicap and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

sms_plot <- binary_plot(ma_data, ma_data$SMS_received)+
  labs(x = "SMS Message", 
       y = "Count",
       title = "SMS Message and No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(scholar_plot, hypertension_plot, diabetes_plot, alcoholism_plot, handcap_plot, sms_plot, cols = 2)

Based on the above plots, there are more patients who do not have scholarships, preconceived health conditions, or SMS Messages.
- Therefore, our sample size to see the effectiveness of these binary variables is relatively small.

Let’s see if having or not having these variables will affect the rate of no shows:

binary_plot2 <- function(df, vect){
  ggplot(data, aes(x = vect, fill = No.show)) +
    geom_bar(position = "fill") +
    theme_dark() +
    scale_x_discrete(limits = c("0", "1"),
                     labels = c("Doesn't\nHave", "Has"))
}

scholar_plot2 <- binary_plot2(ma_data, ma_data$Scholarship)+
  labs(x = "Scholarship", 
       y = "Rate",
       title = "Scholarship Not Effective for No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

hypternsion_plot2 <- binary_plot2(ma_data, ma_data$Hypertension)+
  labs(x = "Hypertension", 
       y = "Rate",
       title = "Hypertensive Pts have Slightly Lower No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

diabetes_plot2 <- binary_plot2(ma_data, ma_data$Diabetes) +
  labs(x = "Diabetes", 
       y = "Rate",
       title = "Diabetic Pts have Slightly Lower No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

alcoholism_plot2 <- binary_plot2(ma_data, ma_data$Alcoholism) +
  labs(x = "Alcoholism", 
       y = "Rate",
       title = "Alcoholism Not a Good Indicator for No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

handcap_plot2 <- binary_plot2(ma_data, ma_data$Handcap) +
  labs(x = "Handicap Disability", 
       y = "Rate",
       title = "Handicap Pts have Slightly Lower No Show",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

sms_plot2 <- binary_plot2(ma_data, ma_data$SMS_received)+
  labs(x = "SMS Message", 
       y = "Rate",
       title = "SMS Messages Do Not Improve No Shows",
       caption = "Source: https://archive.ics.uci.edu") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

multiplot(scholar_plot2, hypternsion_plot2, diabetes_plot2, alcoholism_plot2, handcap_plot2, sms_plot2, cols = 2)

Looking at the proportions of patients with or without the binary variables, we can see if there are trends of no shows.

As seen in Scholarships and SMS messages sent to the patient, it does not improve no show rates.
- Patients with children who have financial aid via scholarships have a higher rate of no shows than those without it.
- Also, patients who opted to receive SMS message reminders of appointments had a higher no show rate.
Patients with and without alcoholism show that there is an equal no show rate (so whether or not a person is an alcoholic does not indicate showing up or not showing up to an appointment)
Lastly, patients with preconceived health conditions, such as hypertension, diabetes, and disabilities, such as being handicapped, show to have lower no show rates.
Keep in mind, our sample size for those who had these binary variables was small. Therefore, it cannot accurately represent the population as being effective or ineffective.
- Further research and a large sample size will be needed to test the effectiveness of these variables.

Conclusion

Not only does failure to follow through with scheduled medical appointments have a negative impact on the functionality/workflow of a clinic but it can also negatively affect the patient. This data analysis aimed to explore different variables and their relationship to no show rates. We have seen that gender plays a role in the number of scheduled appointments, however, does not affect the no show rate. Both males and females have the same rate of no shows (~20%). We have also seen that the rate of no shows increases as appointments are scheduled for more than 4 months in advance. Next, we confirmed that certain age groups, such as between 20-30 years old, have high rates of no shows as well as patients that are 90+ years old. Lastly, we confirmed that the binary variables of Scholarship, Hypertension, Diabetes, Alcoholism, the Handicapped, and SMS message reminders do not show a significant improvement in no show rates. However, we also noted that the sample size was too small to accurately indicate effectiveness.

Opportunities for further analysis could be researching the following:

Conducting analysis on the binary variables and obtaining a sample of those only with Scholarship, Hypertension, Diabetes, Alcoholism, Handicap, and SMS message reminders and the no show rates.