No Show up Analysis

No Show up Analysis

Given the fact that healthcare systems consumed about 20% of the U.S. Gross Domestic Product (GDP) in 2017, which correlates to hundreds of millions of doctor visits each year, it is inevitable that the occasional no-show appointment will occur. However, a significant number of missed appointments can equate to millions of dollars in lost revenue, especially in a hospital setting. According to a 2016 study, the average yearly no-show rate for primary care and specialty medical appointments is just under 20 percent - an astounding number that could mean significant revenue loss for healthcare systems, both large and small.

What Causes Patients to Skip Medical Appointments?

Studies have shown that the no-show rate drops from 23 percent to around 15 percent when patients are given a reminder call a few days before their appointment. We will try to explore other variable which might lead the skipping of medical appointment.

Data The sample data obtained is from a brazilian doctor’s office which has close to 300k observations with 15 variables. Each observation is tied to a patientID.

Flat file on GitHuB

Importing the data

# Libraries Used
library(DT)
suppressMessages(library(dplyr))
suppressMessages(library(lubridate))
suppressMessages(library(ggplot2))
suppressMessages(library(RColorBrewer))
suppressMessages(library(tidyr))


theurl <- url('https://raw.githubusercontent.com/DataScienceAR/RClass/master/NoShowAppointments.csv')
no.show.raw <- read.csv(theurl)
no.show.df <- tbl_df(no.show.raw)
datatable(head(no.show.raw, 100))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

1 Data Munging

1.1 Typo Corrections

Column typo corrections

names(no.show.df) <- c("PatientID","AppointmentID","Gender","ScheduledDay","AppointmentDay","Age","Address",
                       "Government_Aid","Hypertension","Diabetes","Alcholism","Handicap","Appointment_Remainder","NoShow")

1.2 Time format change

no.show.df$ScheduledDay <- date(no.show.df$ScheduledDay)
no.show.df$AppointmentDay <- date(no.show.df$AppointmentDay)
no.show.df
## # A tibble: 110,527 x 14
##    PatientID AppointmentID Gender ScheduledDay AppointmentDay   Age Address
##        <dbl>         <int> <fct>  <date>       <date>         <int> <fct>  
##  1   2.99e13       5642903 F      2016-04-29   2016-04-29        62 JARDIM…
##  2   5.59e14       5642503 M      2016-04-29   2016-04-29        56 JARDIM…
##  3   4.26e12       5642549 F      2016-04-29   2016-04-29        62 MATA D…
##  4   8.68e11       5642828 F      2016-04-29   2016-04-29         8 PONTAL…
##  5   8.84e12       5642494 F      2016-04-29   2016-04-29        56 JARDIM…
##  6   9.60e13       5626772 F      2016-04-27   2016-04-29        76 REPÚBL…
##  7   7.34e14       5630279 F      2016-04-27   2016-04-29        23 GOIABE…
##  8   3.45e12       5630575 F      2016-04-27   2016-04-29        39 GOIABE…
##  9   5.64e13       5638447 F      2016-04-29   2016-04-29        21 ANDORI…
## 10   7.81e13       5629123 F      2016-04-27   2016-04-29        19 CONQUI…
## # ... with 110,517 more rows, and 7 more variables: Government_Aid <int>,
## #   Hypertension <int>, Diabetes <int>, Alcholism <int>, Handicap <int>,
## #   Appointment_Remainder <int>, NoShow <fct>

1.3 Check for inconsistent data like “NA”

# Number of values with 'NA' in the data set
paste("# of 'NA' values in Data set are :",length(which(! complete.cases(no.show.df))) )
## [1] "# of 'NA' values in Data set are : 0"

1.4 Factorise the variables

# Factorise Gender, Government Aid, Hypertenstion, Diabetes, Alcholism, Handicap, SMS_Received
Factorise <- c("Gender","Government Aid","Hypertension","Diabetes","Alcholism","Handicap","NoShow","SMS_Received")
no.show.df$Gender <- as.factor(no.show.df$Gender)
no.show.df$Government_Aid <- as.factor(no.show.df$Government_Aid)
#no.show.df$Hypertension <- as.factor(no.show.df$Hypertension)
#no.show.df$Diabetes <- as.factor(no.show.df$Diabetes)
#no.show.df$Government_Aid <- as.factor(no.show.df$Government_Aid)
#no.show.df$Alcholism <- as.factor(no.show.df$Alcholism)
#no.show.df$Handicap <- as.factor(no.show.df$Handicap)
no.show.df$NoShow <- as.factor(no.show.df$NoShow)
no.show.df$Appointment_Remainder <- as.factor(no.show.df$Appointment_Remainder)

1.5 Additional Calculations and Functions

# Number of days between Appointment day and Scheduled Day
no.show.df$Wait_time_days <- no.show.df$AppointmentDay-no.show.df$ScheduledDay

# Function to convert decimal to Percentage
percent <- function(x, digits = 2, format = "f", ...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...), " %")
}

1.5 Change the Gender values to full Abbreviation

no.show.df$Gender <- c("F"= "Female","M"="Male")[no.show.df$Gender]
no.show.df$Gender <- as.factor(no.show.df$Gender)

2. Data Exploration

2.1 Statistics of the population - Numbers

No Show Patients

round(prop.table(table(no.show.df$NoShow)) *100,2) -> No_Show
names(No_Show) <- c("Perc. of Show up", "Perc. of No Show up")
No_Show
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

Hypertension Patients

round(prop.table(table(no.show.df$Hypertension)) *100,2) -> Hypertension
names(Hypertension) <- c("Perc. with No Hypertension", "Perc. with Hypertension")
Hypertension
## Perc. with No Hypertension    Perc. with Hypertension 
##                      80.28                      19.72

Diabetic Patients

round(prop.table(table(no.show.df$NoShow)) *100,2) -> No_Show
names(No_Show) <- c("Perc. of Show up", "Perc. of No Show up")
No_Show
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

_ Patients who are Alcholic_

round(prop.table(table(no.show.df$NoShow)) *100,2) -> No_Show
names(No_Show) <- c("Perc. of Show up", "Perc. of No Show up")
No_Show
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

HandiCap Patients

round(prop.table(table(no.show.df$NoShow)) *100,2) -> Handicap
names(Handicap) <- c("Perc. of Show up", "Perc. of No Show up")
Handicap
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

Patients with Government Aid

round(prop.table(table(no.show.df$Government_Aid)) *100,2) -> Government_Aid
names(Government_Aid) <- c("Perc. of Government_Aid", "Perc. of No Government_Aid")
Government_Aid
##    Perc. of Government_Aid Perc. of No Government_Aid 
##                      90.17                       9.83

Diabetes

round(prop.table(table(no.show.df$NoShow)) *100,2) -> Diabetes
names(Diabetes) <- c("Perc. of Show up", "Perc. of No Show up")
Diabetes
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

Alcholism

round(prop.table(table(no.show.df$NoShow)) *100,2) -> Alcholism
names(Alcholism) <- c("Perc. of Show up", "Perc. of No Show up")
Alcholism
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

Gender

round(prop.table(table(no.show.df$NoShow)) *100,2) -> Gender
names(Gender) <- c("Perc. of Show up", "Perc. of No Show up")
Gender
##    Perc. of Show up Perc. of No Show up 
##               79.81               20.19

2.2 Statistics of the population - Graphics

2.2.1 Pie Chart showing the distribution of variaus patient conditions who didn’t show up
no.show.df1 <- filter(no.show.df, NoShow =='Yes') %>%

  select(9:12) -> patient_condition_count1
patient_condition_count1 <-gather( patient_condition_count1, Condition, Count, 1:4)
patient_condition_count1
## # A tibble: 89,276 x 2
##    Condition    Count
##    <chr>        <int>
##  1 Hypertension     0
##  2 Hypertension     0
##  3 Hypertension     0
##  4 Hypertension     0
##  5 Hypertension     0
##  6 Hypertension     0
##  7 Hypertension     0
##  8 Hypertension     0
##  9 Hypertension     1
## 10 Hypertension     0
## # ... with 89,266 more rows
patient_condition_count <-aggregate(patient_condition_count1$Count, by = list(patient_condition_count1$Condition), FUN = sum)
#patient_condition_count <- names(c("Conditon","Count"))
names(patient_condition_count) <- c("Condition", "# Of Patients")
patient_condition_count
##      Condition # Of Patients
## 1    Alcholism           677
## 2     Diabetes          1430
## 3     Handicap           453
## 4 Hypertension          3772
pie(patient_condition_count[,2], labels = patient_condition_count[,1], clockwise = TRUE,radius = 1, col= brewer.pal(7,"Set2"), border ="white", main = "Distributio of Patient condition")

Conclusion

Most of the patients who didn’t show up were under the category of “Hypertension” followed by “Diabetes”, Alcholism and “Handicap”. Since over 60% of the patients come under hypertension category who didn’t show up, this category needs further investivation like the types of the treatment given, severity of the condition ect.

2.2.2 Bar Chart showing the patient count with hypertension vs. non hypertension by gender.
# Comparison of Gender who are Hypertension with Show or No Show

ggplot(no.show.df, aes(x=Gender, fill = NoShow  ))+theme_bw()+ facet_wrap( ~ Hypertension)+geom_bar()+ labs(y = "Patient Count", title = "# of Patients by Gender and Hypertension")

# Comparison of Gender who are Diabetes with Show or No Show

ggplot(no.show.df, aes(x=Gender, fill = NoShow  ))+theme_bw()+ facet_wrap( ~ Diabetes)+geom_bar()+ labs(y = "Patient Count", title = "# of Patients by Gender and Diabetes")

Conclusion

Clearly there is a high percentage of patients that didn’t show up for the appointment belonged to patients without hypertension. Amoung the patients with hypertension, female patients were more than double the number of male patients. Although patients with Diabetes show the same pattern, it is clear that the female populatin is slightly above the male population who didn’t show up.

2.2.3 Age Distribution of Patients
ggplot(data = no.show.df) + geom_histogram(aes(x=Age), binwidth = 25)

hist(no.show.df$Age, main="Age Histogram", xlab="Age")

Conclusion

The age distribution shows that there are large number of patients who are falling in two bins: the infants 0 to 10 years and Adults 20 to 60

2.2.4 Wait time vs. No show
ggplot(no.show.df, aes(x = as.integer(Wait_time_days) , y= as.integer(Wait_time_days), color = NoShow)) + geom_point() +labs(title = "Wait time and no show", x = "Wait time", y = "Wait time")

no.show.df
## # A tibble: 110,527 x 15
##    PatientID AppointmentID Gender ScheduledDay AppointmentDay   Age Address
##        <dbl>         <int> <fct>  <date>       <date>         <int> <fct>  
##  1   2.99e13       5642903 Female 2016-04-29   2016-04-29        62 JARDIM…
##  2   5.59e14       5642503 Male   2016-04-29   2016-04-29        56 JARDIM…
##  3   4.26e12       5642549 Female 2016-04-29   2016-04-29        62 MATA D…
##  4   8.68e11       5642828 Female 2016-04-29   2016-04-29         8 PONTAL…
##  5   8.84e12       5642494 Female 2016-04-29   2016-04-29        56 JARDIM…
##  6   9.60e13       5626772 Female 2016-04-27   2016-04-29        76 REPÚBL…
##  7   7.34e14       5630279 Female 2016-04-27   2016-04-29        23 GOIABE…
##  8   3.45e12       5630575 Female 2016-04-27   2016-04-29        39 GOIABE…
##  9   5.64e13       5638447 Female 2016-04-29   2016-04-29        21 ANDORI…
## 10   7.81e13       5629123 Female 2016-04-27   2016-04-29        19 CONQUI…
## # ... with 110,517 more rows, and 8 more variables: Government_Aid <fct>,
## #   Hypertension <int>, Diabetes <int>, Alcholism <int>, Handicap <int>,
## #   Appointment_Remainder <fct>, NoShow <fct>, Wait_time_days <time>

Conclusion

More wait time leads to more no shows up. There were lot of no show ups beyond 125 days.

3.0 Summary

  1. The patients registered at the clinic predominatly female accounting to more than 50% of total patients.
  2. Hypertension is the predominant condition amoung the patients registered who did not show up.
  3. There were two different age groups who registered at the clinics 0 to 10 and 20 to 60 age groups
  4. More wait time lead more no shows. Majority of no shows were focused above 125 days.