This analysis deals with exploring variables and their relationship to no show rates for medical appointments. The variables that we explored are gender, number of days an appointment is scheduled in advance, age, financial aid via scholarships, preconceived health conditions, and SMS Message reminders. The data in this analysis were collected from patients in Brazil and comprises of 110,527 medical appointments and can be found on kaggle.com.
Attribute Information:
PatientId = Identification number of a patient
AppointmentID = Identification number of each appointment
Gender = Male or Female
ScheduledDay = The date-time (YYYY-MM-DD) the patient scheduled their appointment
AppointmentDay = Date of appointment (YYYY-MM-DD)
Age = Age of patient in years
Scholarship = Financial Aid for a family depending on if the child is enrolled in an academic program.
Hypertension = Signifies a patient as hypertensive or not hypertensive
Diabetes = Signifies a patient as diabetic or not diabetic
Alcoholism = Signifies a patient as an alcoholic or not an alcoholic
Handcap = Signifies a patient as handicap or not handicap
SMS_received = Patient opts to receive or not receive an SMS Message reminder for an appointment
No-show = Patient does or does not show up to the scheduled appointment
Key
0 (zero) indicates No/FALSE
1 (one) indicates Yes/TRUE
The data analysis will be broken down by the following sections:
Gender and No Show Rates
Number of Days Appointments are Scheduled in Advance vs Rate of No Shows
Age and No Show Rates
Binary Variables within the Data Set and Their Relationship to No Shows
Let’s begin by taking a glimpse of our data and see what we are working with! This is also a great time to check if our data is tidy and the classes of our variables are accurate.
glimpse(ma_data) # taking a glimpse of our data, the classes and columns are accurate
## Rows: 110,527
## Columns: 13
## $ PatientId <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8.679512e+...
## $ AppointmentID <int> 5642903, 5642503, 5642549, 5642828, 5642494, 5626772...
## $ Gender <chr> "F", "M", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ ScheduledDay <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2016-04-2...
## $ AppointmentDay <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04-29, 201...
## $ Age <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 29, 22, 2...
## $ Scholarship <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1...
## $ Hypertension <fct> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Diabetes <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Alcoholism <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Handcap <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SMS_received <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ No.show <fct> No, No, No, No, No, No, Yes, Yes, No, No, No, Yes, N...
Let’s run a bar plot and visualize the number of males vs females that made an appointment and within that a proportion of who showed and did not show to their appointment:
gender_stack <- ggplot(ma_data, aes(x = Gender, fill = No.show)) +
geom_bar(position = "stack") +
theme_dark() +
labs(x = "Gender",
y = "Count",
title = "More Female Appointments than Male Appointments",
subtitle = "F = Female & M = Male",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
gender_fill <- ggplot(ma_data, aes(x = Gender, fill = No.show)) +
geom_bar(position = "fill") +
theme_dark() +
labs(x = "Gender",
y = "Rate",
title = "No Show Rates are the Same Despite Gender",
subtitle = "F = Female & M = Male",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
multiplot(gender_stack, gender_fill, cols = 2)
kable1 <- ma_data %>%
group_by(Gender, No.show) %>%
summarize(count = n())
## `summarise()` regrouping output by 'Gender' (override with `.groups` argument)
knitr::kable(kable1)
| Gender | No.show | count |
|---|---|---|
| F | No | 57246 |
| F | Yes | 14594 |
| M | No | 30962 |
| M | Yes | 7725 |
We can see that more appointments are made by females than males.
However, looking at the rate of who does and does not show up, both females and males are even at 20%.
Therefore, gender is not a good predictor of no show rates as there is no significant difference between the two in this data set.
Next, we will explore if scheduling an appointment in advance affects the rate of no shows. To do this, we will take the absolute difference between the date of the appointment and the date the appointment was scheduled.
date_data <- ma_data %>%
select(ScheduledDay, AppointmentDay, No.show) %>%
mutate(diffdays = abs(difftime(AppointmentDay, ScheduledDay, units = "days"))) # took the difference between the two dates to see how long it has been since they set the appointment and the actual appointment day. Absolute value was taken since you can't have an appointment before scheduling one.
Now that we have the data set we want to explore, let’s visualize the difference with a histogram (since we have a numerical x variable):
ggplot(date_data, aes(x = diffdays, fill = No.show)) +
geom_histogram() +
theme_dark() +
labs(x = "# of Days Appointment was Scheduled in Advance",
y = "Count",
title = "Most Appointments Scheduled for the Same Day",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Let’s zoom in and see where most of the no shows are by setting our x limits and y limits!
ggplot(date_data, aes(x = diffdays, fill = No.show)) +
geom_histogram() +
theme_dark() +
coord_cartesian(xlim = c(0, 180), ylim = c(0, 6000)) +
labs(x = "# of Days Appointment was Scheduled in Advance",
y = "Count",
title = "Same Day Appointments Have High No Show Counts",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
However, after looking at the proportions, we see something entirely different.
ggplot(date_data, aes(x = diffdays, fill = No.show)) +
geom_histogram(position = "fill") +
theme_dark() +
labs(x = "# of Days Appointment was Scheduled in Advance",
y = "Rate",
title = "No Show Rates are Lower for Same Day Appointments",
subtitle = "Higher Likelihood of No Show after 4 Months of Advance Scheduling",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
We are going to explore the correlation between age/age groups and the rate at which they show or do not show to their appointment.
We want to answer if there is a particular age group that is predicted to not show up to their scheduled appointment and answer if younger patients are less dependable than older patients.
Keep in Mind:
ggplot(ma_data, aes(x = Age, fill = No.show)) +
geom_histogram() +
theme_dark() +
labs(x = "Age",
y = "Count",
title = "Most Appointments are Scheduled for Children and Adults",
subtitle = "Children Age Range: 0-5 years old & Adults Age Range: 30-60 years old",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Now let’s zoom into the plot and analyze the number of no shows:
We have increased the number of bins to represent 1 year per bin for easier interpretation
ggplot(ma_data, aes(x = Age, fill = No.show)) +
geom_histogram(bins = 100) +
theme_dark() +
coord_cartesian(xlim = c(0, 100), ylim = c(0, 800)) +
labs(x = "Age",
y = "Count",
title = "No Shows are Most Apparent in Adults",
subtitle = "Adults in the Age Range: 20-30 years old",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
We see that no shows begin to dip between the ages of zero years to 12 years old (most likely because parents are taking their children to their doctor’s appointment)
However, the no shows begin to spike and maintain in the range of 20-30 years of age. (of age to be responsible for your own doctor’s appointments)
Beyond 30 years of age, there is a negative trend of no shows as age progresses.
Now let’s look at the proportions:
ggplot(ma_data, aes(x = Age, fill = No.show)) +
geom_histogram(bins = 100, position = "fill") +
theme_dark() +
labs(x = "Age",
y = "Rate",
title = "No Shows are Most Apparent in Adults",
subtitle = "Adults in the Age Range: 20-30 years old",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
We can explore this further by analyzing the spread of days a patient, above the age of 90, will schedule an appointment in advance.
age_date_data <- ma_data %>%
select(ScheduledDay, AppointmentDay, Age, No.show) %>%
mutate(diffdays = abs(difftime(AppointmentDay, ScheduledDay, units = "days")))
ggplot(age_date_data, aes(x = diffdays, y = Age, color = No.show)) +
geom_point(position = "jitter") +
theme_dark() +
coord_cartesian(ylim = c(90, 120)) +
labs(x = "# of Days Appointment was Scheduled in Advance",
y = "Age",
title = "Patients Above the Age of 90",
subtitle = "Mainly will Schedule Appointments 0-50 days in Advance",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(age_date_data, aes(x = diffdays, y = Age, color = No.show)) +
geom_point(position = "jitter") +
theme_dark() +
coord_cartesian(xlim = c(0, 40), ylim = c(90, 120)) +
labs(x = "# of Days Appointment was Scheduled in Advance",
y = "Age",
title = "Patients Above the Age of 90",
subtitle = "Mainly will Schedule Appointments 0-50 days in Advance",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
As seen above, patients above the age of 90 will normally schedule appointments 0-50 days in advance.
The amount of no shows are spread quite evenly within the 50 days.
Within our data set, there are binary variables that we are going to explore. We are going to see if having financial aid via scholarships, preconceived health conditions, or SMS reminders sent will affect the rate of no shows.
We will first explore each binary variable and its effects on the total number of no shows and then the rate of no shows.
binary_plot <- function(df, vect){
ggplot(data, aes(x = vect, fill = No.show)) +
geom_bar() +
theme_dark() +
scale_x_discrete(limits = c("0", "1"),
labels = c("Doesn't\nHave", "Has"))
} # function to make plots for the desired vector within our data set
scholar_plot <- binary_plot(ma_data, ma_data$Scholarship) +
labs(x = "Scholarship",
y = "Count",
title = "Scholarship and No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
hypertension_plot <- binary_plot(ma_data, ma_data$Hypertension)+
labs(x = "Hypertension",
y = "Count",
title = "Hypertension and No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
diabetes_plot <- binary_plot(ma_data, ma_data$Diabetes)+
labs(x = "Diabetes",
y = "Count",
title = "Diabetes and No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
alcoholism_plot <- binary_plot(ma_data, ma_data$Alcoholism)+
labs(x = "Alcoholism",
y = "Count",
title = "Alcoholism and No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
handcap_plot <- binary_plot(ma_data, ma_data$Handcap)+
labs(x = "Handicap Disability",
y = "Count",
title = "Handicap and No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
sms_plot <- binary_plot(ma_data, ma_data$SMS_received)+
labs(x = "SMS Message",
y = "Count",
title = "SMS Message and No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
multiplot(scholar_plot, hypertension_plot, diabetes_plot, alcoholism_plot, handcap_plot, sms_plot, cols = 2)
Let’s see if having or not having these variables will affect the rate of no shows:
binary_plot2 <- function(df, vect){
ggplot(data, aes(x = vect, fill = No.show)) +
geom_bar(position = "fill") +
theme_dark() +
scale_x_discrete(limits = c("0", "1"),
labels = c("Doesn't\nHave", "Has"))
}
scholar_plot2 <- binary_plot2(ma_data, ma_data$Scholarship)+
labs(x = "Scholarship",
y = "Rate",
title = "Scholarship Not Effective for No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
hypternsion_plot2 <- binary_plot2(ma_data, ma_data$Hypertension)+
labs(x = "Hypertension",
y = "Rate",
title = "Hypertensive Pts have Slightly Lower No Show",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
diabetes_plot2 <- binary_plot2(ma_data, ma_data$Diabetes) +
labs(x = "Diabetes",
y = "Rate",
title = "Diabetic Pts have Slightly Lower No Show",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
alcoholism_plot2 <- binary_plot2(ma_data, ma_data$Alcoholism) +
labs(x = "Alcoholism",
y = "Rate",
title = "Alcoholism Not a Good Indicator for No Show",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
handcap_plot2 <- binary_plot2(ma_data, ma_data$Handcap) +
labs(x = "Handicap Disability",
y = "Rate",
title = "Handicap Pts have Slightly Lower No Show",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
sms_plot2 <- binary_plot2(ma_data, ma_data$SMS_received)+
labs(x = "SMS Message",
y = "Rate",
title = "SMS Messages Do Not Improve No Shows",
caption = "Source: https://archive.ics.uci.edu") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
multiplot(scholar_plot2, hypternsion_plot2, diabetes_plot2, alcoholism_plot2, handcap_plot2, sms_plot2, cols = 2)
Looking at the proportions of patients with or without the binary variables, we can see if there are trends of no shows.
As seen in Scholarships and SMS messages sent to the patient, it does not improve no show rates.
Patients with and without alcoholism show that there is an equal no show rate (so whether or not a person is an alcoholic does not indicate showing up or not showing up to an appointment)
Lastly, patients with preconceived health conditions, such as hypertension, diabetes, and disabilities, such as being handicapped, show to have lower no show rates.
Keep in mind, our sample size for those who had these binary variables was small. Therefore, it cannot accurately represent the population as being effective or ineffective.
Not only does failure to follow through with scheduled medical appointments have a negative impact on the functionality/workflow of a clinic but it can also negatively affect the patient. This data analysis aimed to explore different variables and their relationship to no show rates. We have seen that gender plays a role in the number of scheduled appointments, however, does not affect the no show rate. Both males and females have the same rate of no shows (~20%). We have also seen that the rate of no shows increases as appointments are scheduled for more than 4 months in advance. Next, we confirmed that certain age groups, such as between 20-30 years old, have high rates of no shows as well as patients that are 90+ years old. Lastly, we confirmed that the binary variables of Scholarship, Hypertension, Diabetes, Alcoholism, the Handicapped, and SMS message reminders do not show a significant improvement in no show rates. However, we also noted that the sample size was too small to accurately indicate effectiveness.
Opportunities for further analysis could be researching the following: