Patient no-shows present a significant challenge for healthcare practices, impacting both operational efficiency and financial stability. The reasons behind patient no-shows are complex, often influenced by demographic factors, individual circumstances, and systemic issues within the healthcare environment.
Despite the efforts of healthcare managers to implement strategies aimed at reducing no-shows, many of these approaches rely on manual processes and their effectiveness may be very minimal.
This research studied the root causes of no-show behavior according to specific factors available in the dataset and illustrate how some of this factors influence the no-show rate.
The aim of this project is to provide healthcare practices with effective strategies to reduce patient no-show rates, thereby improving operational efficiency and enhancing patient care.
Before we summon the powers of R and move into the mystical world of
data analysis, we must first perform our sacred pre-analysis ritual by
invoking the setwd() spell.
#Set the working directory for our analytical adventure
setwd("C:/Users/hp/Documents/Adedeji_R_files")
To invoke the read.csv() incantation, allowing us to
summon the dataset from its resting place; we simply use a magical
library called tidyverse
library(tidyverse)
data <- read.csv("KaggleV2-May-2016.csv") # read the kaggle data and assign it
The R code below allows us to view the dimension, frequency and the distribution of the data respectively
dim(data) # To verify the number of rows and columns
## [1] 110527 14
table(is.na(data)) # to check the frequency of the missing value(s)
##
## FALSE
## 1547378
summary(data) #To check the distribution of the data
## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 Length:110527 Length:110527
## 1st Qu.:4.173e+12 1st Qu.:5640286 Class :character Class :character
## Median :3.173e+13 Median :5680573 Mode :character Mode :character
## Mean :1.475e+14 Mean :5675305
## 3rd Qu.:9.439e+13 3rd Qu.:5725524
## Max. :1.000e+15 Max. :5790484
## AppointmentDay Age Neighbourhood Scholarship
## Length:110527 Min. : -1.00 Length:110527 Min. :0.00000
## Class :character 1st Qu.: 18.00 Class :character 1st Qu.:0.00000
## Mode :character Median : 37.00 Mode :character Median :0.00000
## Mean : 37.09 Mean :0.09827
## 3rd Qu.: 55.00 3rd Qu.:0.00000
## Max. :115.00 Max. :1.00000
## Hipertension Diabetes Alcoholism Handcap
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1972 Mean :0.07186 Mean :0.0304 Mean :0.02225
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :4.00000
## SMS_received No.show
## Min. :0.000 Length:110527
## 1st Qu.:0.000 Class :character
## Median :0.000 Mode :character
## Mean :0.321
## 3rd Qu.:1.000
## Max. :1.000
In our quest for clarity amidst the distribution of the Age column,
we encountered an anomaly: The mysterious case of the minimum age = -1.
Let us look deeper into this by applying the table()
function
table(data$Age) #To check the frequency of each Age in the dataset
##
## -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 1211 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## 1439 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460
## 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312
## 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
## 1374 1331 1101 1187 973 1012 832 724 695 615 725 602 544 571 527 541
## 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
## 390 511 434 392 280 311 275 260 184 126 173 109 66 86 53 33
## 95 96 97 98 99 100 102 115
## 24 17 11 6 1 4 2 5
We employed the dplyr library to filter the Age column excluding the age=-1, ensuring that our analysis remains focused on the living rather than the -1 ghost.
Assumption: An infant of (1day <= Age < 12months ) is considered “Age=0 years old” for this research.
library(dplyr)
new_data <- data %>% filter(Age != -1)
table(new_data$Age)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118 1211
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521 1439
## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460 1394
## 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
## 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 1374
## 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## 1331 1101 1187 973 1012 832 724 695 615 725 602 544 571 527 541 390
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
## 511 434 392 280 311 275 260 184 126 173 109 66 86 53 33 24
## 96 97 98 99 100 102 115
## 17 11 6 1 4 2 5
To check the index of the column names of the data set and rename specific ones respectively.
data.frame(colnames(new_data)) #Returns column index numbers
colnames(new_data)[c(9,12,13,14)] <- c("Hypertension", "Handicap", "SMS_RECIEVED" ,"No_Show")
### bin the Appointment ID column(not needed in this analysis)
new_data <- new_data %>% select(-AppointmentID)
sapply(new_data, class)
## PatientId Gender ScheduledDay AppointmentDay Age
## "numeric" "character" "character" "character" "integer"
## Neighbourhood Scholarship Hypertension Diabetes Alcoholism
## "character" "integer" "integer" "integer" "integer"
## Handicap SMS_RECIEVED No_Show
## "integer" "integer" "character"
#To change Gender, Handicap, Neighbourhood to factor data type
#ScheduledDay and AppointmentDay to Date datatype
#Also, Scholarship, Hypertension, Diabetes, Alcoholism, SMSReceived to boolean data type
new_data <- new_data %>%
mutate(Gender = as.factor(Gender),
Neighbourhood = as.factor(Neighbourhood),
Handicap = as.factor(Handicap),
ScheduledDay = as.Date(ScheduledDay ),
AppointmentDay = as.Date(AppointmentDay),
Scholarship = as.logical(Scholarship),
Hypertension = as.logical(Hypertension),
Diabetes = as.logical(Diabetes),
Alcoholism = as.logical(Alcoholism),
SMS_RECIEVED = as.logical(SMS_RECIEVED),
No_Show = recode(No_Show, 'No' = 0, 'Yes' = 1),
No_Show = as.logical(No_Show))
str(new_data)
## 'data.frame': 110526 obs. of 13 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
## $ ScheduledDay : Date, format: "2016-04-29" "2016-04-29" ...
## $ AppointmentDay: Date, format: "2016-04-29" "2016-04-29" ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
## $ Scholarship : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Hypertension : logi TRUE FALSE FALSE FALSE TRUE TRUE ...
## $ Diabetes : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Alcoholism : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Handicap : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SMS_RECIEVED : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ No_Show : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
# Does the Time Interval Between Scheduled Day and Appointment Day Impact the No-Show Rate?
# Calculate the time interval in days
new_data <- new_data %>%
mutate(TimeInterval = as.numeric(AppointmentDay - ScheduledDay))
unique(new_data$TimeInterval) #To verify the unique number in the time interval
## [1] 0 2 3 1 4 9 29 10 23 11 18 17 14 28 24 21 15 16
## [19] 22 43 30 31 42 32 56 45 46 39 37 38 44 50 60 52 53 65
## [37] 67 91 66 84 78 87 115 109 63 70 72 57 58 51 59 41 49 73
## [55] 64 20 33 34 6 35 36 12 13 40 47 8 5 7 25 26 48 27
## [73] 19 61 55 62 176 54 77 69 83 76 89 81 103 79 68 75 85 112
## [91] -1 80 86 98 94 142 155 162 169 104 133 125 96 88 90 151 126 127
## [109] 111 119 74 71 82 108 110 102 122 101 105 92 97 93 107 95 -6 139
## [127] 132 179 117 146 123
# Removing the negative days value
new_data <- new_data %>% filter(TimeInterval != -1)
new_data <- new_data %>% filter(TimeInterval != -6)
unique(new_data$TimeInterval)
## [1] 0 2 3 1 4 9 29 10 23 11 18 17 14 28 24 21 15 16
## [19] 22 43 30 31 42 32 56 45 46 39 37 38 44 50 60 52 53 65
## [37] 67 91 66 84 78 87 115 109 63 70 72 57 58 51 59 41 49 73
## [55] 64 20 33 34 6 35 36 12 13 40 47 8 5 7 25 26 48 27
## [73] 19 61 55 62 176 54 77 69 83 76 89 81 103 79 68 75 85 112
## [91] 80 86 98 94 142 155 162 169 104 133 125 96 88 90 151 126 127 111
## [109] 119 74 71 82 108 110 102 122 101 105 92 97 93 107 95 139 132 179
## [127] 117 146 123
Assumption: zero day is equal to all appointment scheduled less than 24hrs.
# barplot for the time Interval column
ggplot(new_data, aes(TimeInterval)) +
geom_histogram(bins = 100, fill = "skyblue", col = "black") +
labs(title = "Distribution of days between schedule day and appointment day",
x = 'Number of days',
y = 'Count of appointments')
Observation: The observation that “0 day” accounts for the most appointments suggests that a significant number of patients are seeking care on the same day, indicating a high demand for immediate healthcare services. This pattern may have several implications for healthcare providers which include ensuring that staffing and resources are adequate for same-day appointments and potentially leading to longer waiting period for non-urgent cases.
Here, we group the dataset by the time interval and calculate the no show rate for each time interval.
new_data1 <- new_data %>%
group_by(TimeInterval) %>%
summarize(No_Show_Rate = mean(No_Show == "TRUE"))
View(new_data1)
# To visualize the relationship
ggplot(new_data1, aes(x = TimeInterval, y = No_Show_Rate)) +
geom_line(colour = "red") +
labs(title = "Impact of Time Interval on No-Show Rate",
x = "Time Interval (days)",
y = "No Show Rate")
Even though the time interval of 0 day account for the most appointment, they still tend to show up more compared to other days. Also, there seems to be a peak of no show rate between 125 to 150 days, significantly impacting the no show rate more. However, we can generally say that the no show rate increases as the days increase even though there are few exceptions.
# Are there days when patients rarely visit the hospitals to meet up the
#appointment?
# To Extract specific day of the week from AppointmentDay
days_data <- new_data %>%
mutate(AppointmentDayOfWeek = weekdays(AppointmentDay))
# Count the number of no-shows for each day of the week
days <- days_data %>%
group_by(AppointmentDayOfWeek) %>%
summarize(No_Show_Count = sum(No_Show == "TRUE"))
# Create a factor with levels in a specific order
days <- days %>%
mutate(AppointmentDayOfWeek = factor(AppointmentDayOfWeek,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))
# Plot the no-show rate by day of the week
ggplot(days, aes(x = AppointmentDayOfWeek, y = No_Show_Count)) +
geom_bar(stat = "identity", fill = "darkorange2") +
labs(title = "The Rate Of No Show by Day of the Week",
x = "Days of the Week",
y = "No Show Rate")
Even though it seems that patient show up more on Saturday, it is more as a result of less appointment rather than the attitude of the patients to show up more on saturday. Healthcare tends to fix appointment more on working days rather than weekend. We decided to probe Saturday further by uniquely checking for only those that showed up on saturday and we discovered that patients who didn’t show up are significantly high. However, we can generally conclude that the rate of patient not showing up on Monday, Tuesday, and Wednesday is significantly high compared to other days of the week.
Assumption: Alcoholism is a disease. This assumption was made according to Maltzman (1994) research titled “Why alcoholism is a disease”.
A disease is a syndrome, a lawful pattern of recurrent observable signs and symptoms (Blakiston 1979). According to Maltzman (1994), there is no precondition that the causes of the syndrome be known in order to classify the syndrome as a disease. Maltzman (1994) stated further that a syndrome is classified as a disease if it represents a significant deviation from a norm or standard of health as judged by experts. For a syndrome to be classified as a disease it must be judged a serious threat to health; it must be life threatening.
# Which Patients With Particular Diseases is likely to Come less for Appointment?
# Group and summarize the data, filtering only for TRUE conditions
diseases <- days_data %>%
filter(Hypertension == TRUE | Diabetes == TRUE | Alcoholism == TRUE) %>%
group_by(Hypertension, Diabetes, Alcoholism) %>%
summarize(No_Show_Count = sum(No_Show == "TRUE"), .groups = "keep")
# To extract only those with a Particular disease
diseases <- diseases %>%
pivot_longer(cols = c(Hypertension, Diabetes, Alcoholism),
names_to = "Disease",
values_to = "Present") %>%
filter(Present == TRUE) #Because there are patient that are TRUE for more than one disease
# Plotting
ggplot(diseases, aes(x = Disease, y = No_Show_Count, fill = Disease)) +
geom_bar(stat = "identity") +
labs(title = "No Show Counts by Disease",
x = "Disease",
y = "No Show Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
theme(legend.position = "none")
Patient with hypertension are more unlikely to show up compared to other diseases.
# Does the Age Group Have Impact on the No-Show Rate?
# Create age groups
#The cut function divides the age variable into specified intervals
#breaks specifies the boundaries for the age groups within -Inf to +Inf
#It ensures that R picks the age from 0, otherwise R will ignore 0 and treat it as missing values
days_data <- days_data %>%
mutate(AgeGroup = cut(Age, breaks = c(-Inf, 18, 36, 61, Inf), labels = c("0-17", "18-35", "36-60", "61+")))
# Calculate the no-show rate by age group
age_group <- days_data %>%
group_by(AgeGroup) %>%
summarize(No_Show_Rate = mean(No_Show == "TRUE"))
# Plot the no-show rate by age group
ggplot(age_group, aes(x = AgeGroup, y = No_Show_Rate)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "No Show Rate by Age Group",
x = "Age Group",
y = "No Show Rate")
The age group of 60+ probably tends to show up more because they are mostly retired and they have more time to actually go for the appointment compared to the working class. Hence, we can conclude that the age group has an impact on the no show rate.
# Does the Location Have Impact on the No-Show Rate?
# Group by Neighborhood and calculate the No_Show_Rate
location <- days_data %>%
group_by(Neighbourhood) %>%
summarize(No_Show_Rate = mean(No_Show == "TRUE")) #desc-descending
# Plot the no-show rate by neighbourhood
ggplot(location, aes(x = Neighbourhood, y = No_Show_Rate)) +
geom_bar(stat = "identity", fill = "darkorange") +
labs(title = "No-Show Rate by Neighbourhood",
x = "Neighbourhood",
y = "No Show Rate") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels for better readability
The no-show rate appears to be higher in certain locations, as shown by the plot, which highlights a significant peak in the ILHAS OCEÂNICAS DE TRINDADE neighborhood. This could be due to the distance between the neighborhood and the healthcare providers or other factors.
# Do patients with more than one disease increase no-show rate?
# assign a variable for the number of diseases
days_data <- days_data %>%
mutate(NumberOfDiseases = Hypertension + Diabetes + Alcoholism)
# Calculate the no-show rate by number of diseases
disease_count <- days_data %>%
group_by(NumberOfDiseases) %>%
summarize(No_Show_Rate = mean(No_Show =="TRUE"))
# Plot the no-show rate by the number of diseases
ggplot(disease_count, aes(x = NumberOfDiseases, y = No_Show_Rate)) +
geom_line(stat = "identity", col = "sky blue", lwd =2) +
geom_point(col="red")
labs(title = "No-Show Rate by Number of Diseases",
x = "Number of Diseases",
y = "No Show Rate")
## $x
## [1] "Number of Diseases"
##
## $y
## [1] "No Show Rate"
##
## $title
## [1] "No-Show Rate by Number of Diseases"
##
## attr(,"class")
## [1] "labels"
Patients with more than one disease do not consistently show an increased no-show rate. The no-show rate for patients with two diseases is slightly lower than for those with one disease, while the rate for patients with three diseases is slightly higher.
# Is the Follow-Up Scheme Used Appropriately?
# Count the number of no-shows by SMS_RECIEVED
sms <- days_data %>%
group_by(SMS_RECIEVED) %>%
summarize(No_Show_Count = sum(No_Show == TRUE))
# Plot the impact of SMS on the no-show count
ggplot(sms, aes(x = SMS_RECIEVED, y = No_Show_Count)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Impact of SMS on No-Show",
x = "SMS Received",
y = "No Show Count")
A lower no-show rate among people who receive SMS reminders suggests that sending reminders effectively encourages attendance. Hence, we can conclude that the follow up scheme was appropriately used.
Thank you for taking the time to review this analysis. I hope it
has been informative.
Should you have any further questions or feedback, feel free to get
in touch at officialadedeji@gmail.com.