No Show up Analysis
Given the fact that healthcare systems consumed about 20% of the U.S. Gross Domestic Product (GDP) in 2017, which correlates to hundreds of millions of doctor visits each year, it is inevitable that the occasional no-show appointment will occur. However, a significant number of missed appointments can equate to millions of dollars in lost revenue, especially in a hospital setting. According to a 2016 study, the average yearly no-show rate for primary care and specialty medical appointments is just under 20 percent - an astounding number that could mean significant revenue loss for healthcare systems, both large and small.
Studies have shown that the no-show rate drops from 23 percent to around 15 percent when patients are given a reminder call a few days before their appointment. We will try to explore other variable which might lead the skipping of medical appointment.
Data The sample data obtained is from a brazilian doctor’s office which has close to 300k observations with 15 variables. Each observation is tied to a patientID.
Importing the data
# Libraries Used
library(DT)
suppressMessages(library(dplyr))
suppressMessages(library(lubridate))
suppressMessages(library(ggplot2))
suppressMessages(library(RColorBrewer))
suppressMessages(library(tidyr))
theurl <- url('https://raw.githubusercontent.com/DataScienceAR/RClass/master/NoShowAppointments.csv')
no.show.raw <- read.csv(theurl)
no.show.df <- tbl_df(no.show.raw)
datatable(head(no.show.raw, 100))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
names(no.show.df) <- c("PatientID","AppointmentID","Gender","ScheduledDay","AppointmentDay","Age","Address",
"Government_Aid","Hypertension","Diabetes","Alcholism","Handicap","Appointment_Remainder","NoShow")
no.show.df$ScheduledDay <- date(no.show.df$ScheduledDay)
no.show.df$AppointmentDay <- date(no.show.df$AppointmentDay)
no.show.df
## # A tibble: 110,527 x 14
## PatientID AppointmentID Gender ScheduledDay AppointmentDay Age Address
## <dbl> <int> <fct> <date> <date> <int> <fct>
## 1 2.99e13 5642903 F 2016-04-29 2016-04-29 62 JARDIM…
## 2 5.59e14 5642503 M 2016-04-29 2016-04-29 56 JARDIM…
## 3 4.26e12 5642549 F 2016-04-29 2016-04-29 62 MATA D…
## 4 8.68e11 5642828 F 2016-04-29 2016-04-29 8 PONTAL…
## 5 8.84e12 5642494 F 2016-04-29 2016-04-29 56 JARDIM…
## 6 9.60e13 5626772 F 2016-04-27 2016-04-29 76 REPÚBL…
## 7 7.34e14 5630279 F 2016-04-27 2016-04-29 23 GOIABE…
## 8 3.45e12 5630575 F 2016-04-27 2016-04-29 39 GOIABE…
## 9 5.64e13 5638447 F 2016-04-29 2016-04-29 21 ANDORI…
## 10 7.81e13 5629123 F 2016-04-27 2016-04-29 19 CONQUI…
## # ... with 110,517 more rows, and 7 more variables: Government_Aid <int>,
## # Hypertension <int>, Diabetes <int>, Alcholism <int>, Handicap <int>,
## # Appointment_Remainder <int>, NoShow <fct>
# Number of values with 'NA' in the data set
paste("# of 'NA' values in Data set are :",length(which(! complete.cases(no.show.df))) )
## [1] "# of 'NA' values in Data set are : 0"
# Factorise Gender, Government Aid, Hypertenstion, Diabetes, Alcholism, Handicap, SMS_Received
Factorise <- c("Gender","Government Aid","Hypertension","Diabetes","Alcholism","Handicap","NoShow","SMS_Received")
no.show.df$Gender <- as.factor(no.show.df$Gender)
no.show.df$Government_Aid <- as.factor(no.show.df$Government_Aid)
#no.show.df$Hypertension <- as.factor(no.show.df$Hypertension)
#no.show.df$Diabetes <- as.factor(no.show.df$Diabetes)
#no.show.df$Government_Aid <- as.factor(no.show.df$Government_Aid)
#no.show.df$Alcholism <- as.factor(no.show.df$Alcholism)
#no.show.df$Handicap <- as.factor(no.show.df$Handicap)
no.show.df$NoShow <- as.factor(no.show.df$NoShow)
no.show.df$Appointment_Remainder <- as.factor(no.show.df$Appointment_Remainder)
# Number of days between Appointment day and Scheduled Day
no.show.df$Wait_time_days <- no.show.df$AppointmentDay-no.show.df$ScheduledDay
# Function to convert decimal to Percentage
percent <- function(x, digits = 2, format = "f", ...) {
paste0(formatC(100 * x, format = format, digits = digits, ...), " %")
}
no.show.df$Gender <- c("F"= "Female","M"="Male")[no.show.df$Gender]
no.show.df$Gender <- as.factor(no.show.df$Gender)
No Show Patients
round(prop.table(table(no.show.df$NoShow)) *100,2) -> No_Show
names(No_Show) <- c("Perc. of Show up", "Perc. of No Show up")
No_Show
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
Hypertension Patients
round(prop.table(table(no.show.df$Hypertension)) *100,2) -> Hypertension
names(Hypertension) <- c("Perc. with No Hypertension", "Perc. with Hypertension")
Hypertension
## Perc. with No Hypertension Perc. with Hypertension
## 80.28 19.72
Diabetic Patients
round(prop.table(table(no.show.df$NoShow)) *100,2) -> No_Show
names(No_Show) <- c("Perc. of Show up", "Perc. of No Show up")
No_Show
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
_ Patients who are Alcholic_
round(prop.table(table(no.show.df$NoShow)) *100,2) -> No_Show
names(No_Show) <- c("Perc. of Show up", "Perc. of No Show up")
No_Show
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
HandiCap Patients
round(prop.table(table(no.show.df$NoShow)) *100,2) -> Handicap
names(Handicap) <- c("Perc. of Show up", "Perc. of No Show up")
Handicap
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
Patients with Government Aid
round(prop.table(table(no.show.df$Government_Aid)) *100,2) -> Government_Aid
names(Government_Aid) <- c("Perc. of Government_Aid", "Perc. of No Government_Aid")
Government_Aid
## Perc. of Government_Aid Perc. of No Government_Aid
## 90.17 9.83
Diabetes
round(prop.table(table(no.show.df$NoShow)) *100,2) -> Diabetes
names(Diabetes) <- c("Perc. of Show up", "Perc. of No Show up")
Diabetes
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
Alcholism
round(prop.table(table(no.show.df$NoShow)) *100,2) -> Alcholism
names(Alcholism) <- c("Perc. of Show up", "Perc. of No Show up")
Alcholism
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
Gender
round(prop.table(table(no.show.df$NoShow)) *100,2) -> Gender
names(Gender) <- c("Perc. of Show up", "Perc. of No Show up")
Gender
## Perc. of Show up Perc. of No Show up
## 79.81 20.19
no.show.df1 <- filter(no.show.df, NoShow =='Yes') %>%
select(9:12) -> patient_condition_count1
patient_condition_count1 <-gather( patient_condition_count1, Condition, Count, 1:4)
patient_condition_count1
## # A tibble: 89,276 x 2
## Condition Count
## <chr> <int>
## 1 Hypertension 0
## 2 Hypertension 0
## 3 Hypertension 0
## 4 Hypertension 0
## 5 Hypertension 0
## 6 Hypertension 0
## 7 Hypertension 0
## 8 Hypertension 0
## 9 Hypertension 1
## 10 Hypertension 0
## # ... with 89,266 more rows
patient_condition_count <-aggregate(patient_condition_count1$Count, by = list(patient_condition_count1$Condition), FUN = sum)
#patient_condition_count <- names(c("Conditon","Count"))
names(patient_condition_count) <- c("Condition", "# Of Patients")
patient_condition_count
## Condition # Of Patients
## 1 Alcholism 677
## 2 Diabetes 1430
## 3 Handicap 453
## 4 Hypertension 3772
pie(patient_condition_count[,2], labels = patient_condition_count[,1], clockwise = TRUE,radius = 1, col= brewer.pal(7,"Set2"), border ="white", main = "Distributio of Patient condition")
Conclusion
Most of the patients who didn’t show up were under the category of “Hypertension” followed by “Diabetes”, Alcholism and “Handicap”. Since over 60% of the patients come under hypertension category who didn’t show up, this category needs further investivation like the types of the treatment given, severity of the condition ect.
# Comparison of Gender who are Hypertension with Show or No Show
ggplot(no.show.df, aes(x=Gender, fill = NoShow ))+theme_bw()+ facet_wrap( ~ Hypertension)+geom_bar()+ labs(y = "Patient Count", title = "# of Patients by Gender and Hypertension")
# Comparison of Gender who are Diabetes with Show or No Show
ggplot(no.show.df, aes(x=Gender, fill = NoShow ))+theme_bw()+ facet_wrap( ~ Diabetes)+geom_bar()+ labs(y = "Patient Count", title = "# of Patients by Gender and Diabetes")
Conclusion
Clearly there is a high percentage of patients that didn’t show up for the appointment belonged to patients without hypertension. Amoung the patients with hypertension, female patients were more than double the number of male patients. Although patients with Diabetes show the same pattern, it is clear that the female populatin is slightly above the male population who didn’t show up.
ggplot(data = no.show.df) + geom_histogram(aes(x=Age), binwidth = 25)
hist(no.show.df$Age, main="Age Histogram", xlab="Age")
Conclusion
The age distribution shows that there are large number of patients who are falling in two bins: the infants 0 to 10 years and Adults 20 to 60
ggplot(no.show.df, aes(x = as.integer(Wait_time_days) , y= as.integer(Wait_time_days), color = NoShow)) + geom_point() +labs(title = "Wait time and no show", x = "Wait time", y = "Wait time")
no.show.df
## # A tibble: 110,527 x 15
## PatientID AppointmentID Gender ScheduledDay AppointmentDay Age Address
## <dbl> <int> <fct> <date> <date> <int> <fct>
## 1 2.99e13 5642903 Female 2016-04-29 2016-04-29 62 JARDIM…
## 2 5.59e14 5642503 Male 2016-04-29 2016-04-29 56 JARDIM…
## 3 4.26e12 5642549 Female 2016-04-29 2016-04-29 62 MATA D…
## 4 8.68e11 5642828 Female 2016-04-29 2016-04-29 8 PONTAL…
## 5 8.84e12 5642494 Female 2016-04-29 2016-04-29 56 JARDIM…
## 6 9.60e13 5626772 Female 2016-04-27 2016-04-29 76 REPÚBL…
## 7 7.34e14 5630279 Female 2016-04-27 2016-04-29 23 GOIABE…
## 8 3.45e12 5630575 Female 2016-04-27 2016-04-29 39 GOIABE…
## 9 5.64e13 5638447 Female 2016-04-29 2016-04-29 21 ANDORI…
## 10 7.81e13 5629123 Female 2016-04-27 2016-04-29 19 CONQUI…
## # ... with 110,517 more rows, and 8 more variables: Government_Aid <fct>,
## # Hypertension <int>, Diabetes <int>, Alcholism <int>, Handicap <int>,
## # Appointment_Remainder <fct>, NoShow <fct>, Wait_time_days <time>
Conclusion
More wait time leads to more no shows up. There were lot of no show ups beyond 125 days.