Logo Left

Logo Right

NO-SHOW APPOINTMENT

INTERNATIONAL CENTRE FOR APPLIED MATHEMATICAL

MODELLING & DATA ANALYTICS (ICAMMDA)

GRADUATE INTERNSHIP TRAINING (GIT)

AUTHOR: ADEDEJI AYOMIDE SAMUEL

DATE: 14th OF NOVEMBER, 2024

Introduction

Patient no-shows present a significant challenge for healthcare practices, impacting both operational efficiency and financial stability. The reasons behind patient no-shows are complex, often influenced by demographic factors, individual circumstances, and systemic issues within the healthcare environment.

Despite the efforts of healthcare managers to implement strategies aimed at reducing no-shows, many of these approaches rely on manual processes and their effectiveness may be very minimal.

This research studied the root causes of no-show behavior according to specific factors available in the dataset and illustrate how some of this factors influence the no-show rate.


Aim

The aim of this project is to provide healthcare practices with effective strategies to reduce patient no-show rates, thereby improving operational efficiency and enhancing patient care.


Question



Data Cleaning

Before we summon the powers of R and move into the mystical world of data analysis, we must first perform our sacred pre-analysis ritual by invoking the setwd() spell.

#Set the working directory for our analytical adventure
setwd("C:/Users/hp/Documents/Adedeji_R_files")

To invoke the read.csv() incantation, allowing us to summon the dataset from its resting place; we simply use a magical library called tidyverse

library(tidyverse)
data <-  read.csv("KaggleV2-May-2016.csv") # read the kaggle data and assign it

The R code below allows us to view the dimension, frequency and the distribution of the data respectively

dim(data) # To verify the number of rows and columns
## [1] 110527     14
table(is.na(data)) # to check the frequency of the missing value(s) 
## 
##   FALSE 
## 1547378
summary(data) #To check the distribution of the data
##    PatientId         AppointmentID        Gender          ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   Length:110527      Length:110527     
##  1st Qu.:4.173e+12   1st Qu.:5640286   Class :character   Class :character  
##  Median :3.173e+13   Median :5680573   Mode  :character   Mode  :character  
##  Mean   :1.475e+14   Mean   :5675305                                        
##  3rd Qu.:9.439e+13   3rd Qu.:5725524                                        
##  Max.   :1.000e+15   Max.   :5790484                                        
##  AppointmentDay          Age         Neighbourhood       Scholarship     
##  Length:110527      Min.   : -1.00   Length:110527      Min.   :0.00000  
##  Class :character   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 37.00   Mode  :character   Median :0.00000  
##                     Mean   : 37.09                      Mean   :0.09827  
##                     3rd Qu.: 55.00                      3rd Qu.:0.00000  
##                     Max.   :115.00                      Max.   :1.00000  
##   Hipertension       Diabetes         Alcoholism        Handcap       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07186   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMS_received     No.show         
##  Min.   :0.000   Length:110527     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

In our quest for clarity amidst the distribution of the Age column, we encountered an anomaly: The mysterious case of the minimum age = -1. Let us look deeper into this by applying the table() function

table(data$Age) #To check the frequency of each Age in the dataset
## 
##   -1    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    1 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 1211 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46 
## 1439 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460 
##   47   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62 
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 
##   63   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78 
## 1374 1331 1101 1187  973 1012  832  724  695  615  725  602  544  571  527  541 
##   79   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##  390  511  434  392  280  311  275  260  184  126  173  109   66   86   53   33 
##   95   96   97   98   99  100  102  115 
##   24   17   11    6    1    4    2    5

We employed the dplyr library to filter the Age column excluding the age=-1, ensuring that our analysis remains focused on the living rather than the -1 ghost.

Assumption: An infant of (1day <= Age < 12months ) is considered “Age=0 years old” for this research.

library(dplyr)


new_data <- data %>% filter(Age != -1) 
table(new_data$Age)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118 1211 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
## 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521 1439 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
## 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460 1394 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
## 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 1374 
##   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
## 1331 1101 1187  973 1012  832  724  695  615  725  602  544  571  527  541  390 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95 
##  511  434  392  280  311  275  260  184  126  173  109   66   86   53   33   24 
##   96   97   98   99  100  102  115 
##   17   11    6    1    4    2    5

To check the index of the column names of the data set and rename specific ones respectively.

data.frame(colnames(new_data)) #Returns column index numbers
colnames(new_data)[c(9,12,13,14)] <- c("Hypertension", "Handicap", "SMS_RECIEVED" ,"No_Show")
### bin the Appointment ID column(not needed in this analysis)
new_data <- new_data %>% select(-AppointmentID) 

sapply(new_data, class)
##      PatientId         Gender   ScheduledDay AppointmentDay            Age 
##      "numeric"    "character"    "character"    "character"      "integer" 
##  Neighbourhood    Scholarship   Hypertension       Diabetes     Alcoholism 
##    "character"      "integer"      "integer"      "integer"      "integer" 
##       Handicap   SMS_RECIEVED        No_Show 
##      "integer"      "integer"    "character"
#To change Gender, Handicap, Neighbourhood to factor data type
#ScheduledDay and AppointmentDay to Date datatype
#Also, Scholarship, Hypertension, Diabetes, Alcoholism, SMSReceived to boolean data type
new_data <- new_data %>%
  mutate(Gender = as.factor(Gender),
         Neighbourhood = as.factor(Neighbourhood),
         Handicap = as.factor(Handicap),
         ScheduledDay  = as.Date(ScheduledDay ),
         AppointmentDay = as.Date(AppointmentDay),
         Scholarship = as.logical(Scholarship),
         Hypertension = as.logical(Hypertension),
         Diabetes = as.logical(Diabetes),
         Alcoholism = as.logical(Alcoholism),
         SMS_RECIEVED = as.logical(SMS_RECIEVED),
         No_Show = recode(No_Show, 'No' = 0, 'Yes' = 1),
         No_Show = as.logical(No_Show)) 

str(new_data)
## 'data.frame':    110526 obs. of  13 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ Gender        : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ScheduledDay  : Date, format: "2016-04-29" "2016-04-29" ...
##  $ AppointmentDay: Date, format: "2016-04-29" "2016-04-29" ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
##  $ Scholarship   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Hypertension  : logi  TRUE FALSE FALSE FALSE TRUE TRUE ...
##  $ Diabetes      : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Alcoholism    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Handicap      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SMS_RECIEVED  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ No_Show       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...



QUESTION 1

# Does the Time Interval Between Scheduled Day and Appointment Day Impact the No-Show Rate?


# Calculate the time interval in days
new_data <- new_data %>%
  mutate(TimeInterval = as.numeric(AppointmentDay - ScheduledDay)) 

unique(new_data$TimeInterval) #To verify the unique number in the time interval
##   [1]   0   2   3   1   4   9  29  10  23  11  18  17  14  28  24  21  15  16
##  [19]  22  43  30  31  42  32  56  45  46  39  37  38  44  50  60  52  53  65
##  [37]  67  91  66  84  78  87 115 109  63  70  72  57  58  51  59  41  49  73
##  [55]  64  20  33  34   6  35  36  12  13  40  47   8   5   7  25  26  48  27
##  [73]  19  61  55  62 176  54  77  69  83  76  89  81 103  79  68  75  85 112
##  [91]  -1  80  86  98  94 142 155 162 169 104 133 125  96  88  90 151 126 127
## [109] 111 119  74  71  82 108 110 102 122 101 105  92  97  93 107  95  -6 139
## [127] 132 179 117 146 123
# Removing the negative days value

new_data <- new_data %>% filter(TimeInterval != -1)
new_data <- new_data %>% filter(TimeInterval != -6)

unique(new_data$TimeInterval)
##   [1]   0   2   3   1   4   9  29  10  23  11  18  17  14  28  24  21  15  16
##  [19]  22  43  30  31  42  32  56  45  46  39  37  38  44  50  60  52  53  65
##  [37]  67  91  66  84  78  87 115 109  63  70  72  57  58  51  59  41  49  73
##  [55]  64  20  33  34   6  35  36  12  13  40  47   8   5   7  25  26  48  27
##  [73]  19  61  55  62 176  54  77  69  83  76  89  81 103  79  68  75  85 112
##  [91]  80  86  98  94 142 155 162 169 104 133 125  96  88  90 151 126 127 111
## [109] 119  74  71  82 108 110 102 122 101 105  92  97  93 107  95 139 132 179
## [127] 117 146 123

Assumption: zero day is equal to all appointment scheduled less than 24hrs.

# barplot for the time Interval column
ggplot(new_data, aes(TimeInterval)) +
  geom_histogram(bins = 100, fill = "skyblue", col = "black") +
  labs(title = "Distribution of days between schedule day and appointment day",
       x = 'Number of days',
       y = 'Count of appointments') 

Observation: The observation that “0 day” accounts for the most appointments suggests that a significant number of patients are seeking care on the same day, indicating a high demand for immediate healthcare services. This pattern may have several implications for healthcare providers which include ensuring that staffing and resources are adequate for same-day appointments and potentially leading to longer waiting period for non-urgent cases.


Here, we group the dataset by the time interval and calculate the no show rate for each time interval.

new_data1 <- new_data %>%
  group_by(TimeInterval) %>%
  summarize(No_Show_Rate = mean(No_Show  == "TRUE")) 
View(new_data1)

#  To visualize the relationship
ggplot(new_data1, aes(x = TimeInterval, y = No_Show_Rate)) +
  geom_line(colour = "red") +
  labs(title = "Impact of Time Interval on No-Show Rate",
       x = "Time Interval (days)",
       y = "No Show Rate")

Even though the time interval of 0 day account for the most appointment, they still tend to show up more compared to other days. Also, there seems to be a peak of no show rate between 125 to 150 days, significantly impacting the no show rate more. However, we can generally say that the no show rate increases as the days increase even though there are few exceptions.



QUESTION 2

# Are there days when patients rarely visit the hospitals to meet up the 
#appointment?




# To Extract specific day of the week from AppointmentDay
days_data <- new_data %>%
  mutate(AppointmentDayOfWeek = weekdays(AppointmentDay))
# Count the number of no-shows for each day of the week
days <- days_data %>%
  group_by(AppointmentDayOfWeek) %>%
  summarize(No_Show_Count = sum(No_Show == "TRUE"))
# Create a factor with levels in a specific order
days <- days %>%
  mutate(AppointmentDayOfWeek = factor(AppointmentDayOfWeek, 
                                       levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

# Plot the no-show rate by day of the week
ggplot(days, aes(x = AppointmentDayOfWeek, y = No_Show_Count)) +
  geom_bar(stat = "identity", fill  = "darkorange2") +
  labs(title = "The Rate Of No Show by Day of the Week",
       x = "Days of the Week",
       y = "No Show Rate")

Even though it seems that patient show up more on Saturday, it is more as a result of less appointment rather than the attitude of the patients to show up more on saturday. Healthcare tends to fix appointment more on working days rather than weekend. We decided to probe Saturday further by uniquely checking for only those that showed up on saturday and we discovered that patients who didn’t show up are significantly high. However, we can generally conclude that the rate of patient not showing up on Monday, Tuesday, and Wednesday is significantly high compared to other days of the week.



QUESTION 3

Assumption: Alcoholism is a disease. This assumption was made according to Maltzman (1994) research titled “Why alcoholism is a disease”.

A disease is a syndrome, a lawful pattern of recurrent observable signs and symptoms (Blakiston 1979). According to Maltzman (1994), there is no precondition that the causes of the syndrome be known in order to classify the syndrome as a disease. Maltzman (1994) stated further that a syndrome is classified as a disease if it represents a significant deviation from a norm or standard of health as judged by experts. For a syndrome to be classified as a disease it must be judged a serious threat to health; it must be life threatening.

# Which Patients With Particular Diseases is likely to Come less for Appointment?



# Group and summarize the data, filtering only for TRUE conditions
diseases <- days_data %>%
  filter(Hypertension == TRUE | Diabetes == TRUE | Alcoholism == TRUE) %>%
  group_by(Hypertension, Diabetes, Alcoholism) %>%
  summarize(No_Show_Count = sum(No_Show == "TRUE"), .groups = "keep")
# To extract only those with a Particular disease
diseases <- diseases %>%
  pivot_longer(cols = c(Hypertension, Diabetes, Alcoholism), 
               names_to = "Disease", 
               values_to = "Present") %>%
  filter(Present == TRUE) #Because there are patient that are TRUE for more than one disease

# Plotting
ggplot(diseases, aes(x = Disease, y = No_Show_Count, fill = Disease)) +
  geom_bar(stat = "identity") +
  labs(title = "No Show Counts by Disease", 
       x = "Disease", 
       y = "No Show Count") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "none")

Patient with hypertension are more unlikely to show up compared to other diseases.



Question 4

# Does the Age Group Have Impact on the No-Show Rate?


# Create age groups
#The cut function divides the age variable into specified intervals
#breaks  specifies the boundaries for the age groups within -Inf to +Inf 
#It ensures that R picks the age from 0, otherwise R will ignore 0 and treat it as missing values

days_data <- days_data %>%
  mutate(AgeGroup = cut(Age, breaks = c(-Inf, 18, 36, 61, Inf), labels = c("0-17", "18-35", "36-60", "61+")))
# Calculate the no-show rate by age group
age_group <- days_data %>%
  group_by(AgeGroup) %>%
  summarize(No_Show_Rate = mean(No_Show == "TRUE"))


# Plot the no-show rate by age group
ggplot(age_group, aes(x = AgeGroup, y = No_Show_Rate)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "No Show Rate by Age Group",
       x = "Age Group",
       y = "No Show Rate")

The age group of 60+ probably tends to show up more because they are mostly retired and they have more time to actually go for the appointment compared to the working class. Hence, we can conclude that the age group has an impact on the no show rate.



Question 5

# Does the Location Have Impact on the No-Show Rate?

# Group by Neighborhood and calculate the No_Show_Rate
location <- days_data %>%
  group_by(Neighbourhood) %>%
  summarize(No_Show_Rate = mean(No_Show == "TRUE"))  #desc-descending

# Plot the no-show rate by neighbourhood
ggplot(location, aes(x = Neighbourhood, y = No_Show_Rate)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  labs(title = "No Show Rate by Neighbourhood",
       x = "Neighbourhood",
       y = "No Show Rate") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x-axis labels for better readability

The no-show rate appears to be higher in certain locations, as shown by the plot, which highlights a significant peak in the ILHAS OCEÂNICAS DE TRINDADE neighborhood. This could be due to the distance between the neighborhood and the healthcare providers or other factors.



Question 6

# Do patients with more than one disease increase no-show rate? 

# assign a variable for the number of diseases
days_data <- days_data %>%
  mutate(NumberOfDiseases = Hypertension + Diabetes + Alcoholism)

# Calculate the no-show rate by number of diseases
disease_count <- days_data %>%
  group_by(NumberOfDiseases) %>%
  summarize(`No Show Rate` = mean(No_Show =="TRUE"))


# Plot the no-show rate by the number of diseases
ggplot(disease_count, aes(x = NumberOfDiseases, y = `No Show Rate`)) +
  geom_line(stat = "identity", col = "sky blue", lwd =2) +
  geom_point(col="red")

  labs(title = "No Show Rate by Number of Diseases",
       x = "Number of Diseases",
       y = "No Show Rate")
## $x
## [1] "Number of Diseases"
## 
## $y
## [1] "No Show Rate"
## 
## $title
## [1] "No Show Rate by Number of Diseases"
## 
## attr(,"class")
## [1] "labels"

Patients with more than one disease do not consistently show an increased no-show rate. The no-show rate for patients with two diseases is slightly lower than for those with one disease, while the rate for patients with three diseases is slightly higher.



Question 7

# Is the Follow-Up Scheme Used Appropriately?



# Count the number of no-shows by SMS_RECIEVED
sms <- days_data %>%
  group_by(SMS_RECIEVED) %>%
  summarize(No_Show_Count = sum(No_Show == TRUE))



# Plot the impact of SMS on the no-show count
ggplot(sms, aes(x = SMS_RECIEVED, y = No_Show_Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Impact of SMS on No-Show", 
       x = "SMS Received",
       y = "No Show Count")

A lower no-show rate among people who receive SMS reminders suggests that sending reminders effectively encourages attendance. Hence, we can conclude that the follow up scheme was appropriately used.



References



Thank you for taking the time to review this analysis. I hope it has been informative.
Should you have any further questions or feedback, feel free to get in touch at officialadedeji@gmail.com.