Nearly 30% of medical appointments eventually end up canceling or not showing up. This, in turn, incurs a sizeable loss to the Hospitals / Clinics that employ specialists paid per hour. This model tries to predict the chances that the appointment would land up being a no-show to mitigate that additional cost
From the given list of parameters, such as Gender, Scheduled and Appointment Date difference, Age, Neighbourhood, sms_alert and certain conditions of a given patient, the model will predict, based on a studied dataset, whether the given appointment is likely to be a No-Show
To weigh different parameters mentioned above, there will be one dependent variable (show/no-show variable) that will be predicted for a true or false against various regressor variables. We will run a multivariate linear regression to identify most significant factors influencing a No-Show appointment. Once the factors are identified, we’d feed these factors to different machine learning models and choose the most accurate one.
Once the factors influencing a no-show appointment is clearly identified, the hospital authorities can then ensure, while making the appointment, whether it is a good idea to be making that appointment or not. For example, if the model predicts a significant relation of date difference to failure in showing up for an appointment, the authorities might choose to select a narrower window for the booking the appointment. It would be interesting to find out whether the “Send_sms” is really benefiting the hospital in retaining the patients.
library('tidyverse')
library('lubridate')
library('gridExtra')
We would extensively use tidyverse package for cleaning, tidying, manipulating and visualizing the data. The broom package within tidyverse will help us do post regression analysis, identifying outliers and influence points.
The data used in this sample is obtained for the city of Vitoria, Brazil from Kaggle.
The above dataset is one of the featured dataset of Kaggle and initially included 14 variables with close to 110,000 observations of total patient appointments gathered from various hospitals of the city. The neighbourhood variable displays the region where the hospital is located. The data is real (not simulated) from the appointment histories of the hospitals.
The data was initially posted on May, 2016 but last updated on Feb, 2017.
appointment.raw <- read_csv("KaggleV2-May-2016.csv")
glimpse(appointment.raw)
## Observations: 110,527
## Variables: 14
## $ PatientId <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8.679...
## $ AppointmentID <int> 5642903, 5642503, 5642549, 5642828, 5642494, 56...
## $ Gender <chr> "F", "M", "F", "F", "F", "F", "F", "F", "F", "F...
## $ ScheduledDay <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2016...
## $ AppointmentDay <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04-29...
## $ Age <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 29, ...
## $ Neighbourhood <chr> "JARDIM DA PENHA", "JARDIM DA PENHA", "MATA DA ...
## $ Scholarship <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ Hipertension <int> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Diabetes <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Alcoholism <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Handcap <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ SMS_received <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,...
## $ `No-show` <chr> "No", "No", "No", "No", "No", "No", "Yes", "Yes...
After an initial glimpse at the data, we notice that there are many int and char values which need to be converted to logical and factor values.
# Converting categorical variables to factors
appointment.raw$Gender <- factor(appointment.raw$Gender, levels = c("M", "F"))
appointment.raw$`NoShow` <- factor((appointment.raw$`No-show`))
appointment.raw$Neighbourhood <- factor((appointment.raw$Neighbourhood))
# Converting all logical flags
appointment.raw$Diabetes <- as.logical(appointment.raw$Diabetes)
appointment.raw$Alcoholism <- as.logical(appointment.raw$Alcoholism)
appointment.raw$Hipertension <- as.logical(appointment.raw$Hipertension)
appointment.raw$Handcap <- as.logical(appointment.raw$Handcap)
appointment.raw$Scholarship <- as.logical(appointment.raw$Scholarship)
appointment.raw$SMS_received <- as.logical(appointment.raw$SMS_received)
dim(appointment.raw)
## [1] 110527 15
We observe that there are 110527 observations and 15 variables in the data.
A quick look at the summary of the data will give us a clearer picture about the missing values, outliers and error values, if any
summary(appointment.raw)
## PatientId AppointmentID Gender
## Min. :3.922e+04 Min. :5030230 M:38687
## 1st Qu.:4.173e+12 1st Qu.:5640286 F:71840
## Median :3.173e+13 Median :5680573
## Mean :1.475e+14 Mean :5675305
## 3rd Qu.:9.439e+13 3rd Qu.:5725524
## Max. :1.000e+15 Max. :5790484
##
## ScheduledDay AppointmentDay
## Min. :2015-11-10 07:13:56 Min. :2016-04-29 00:00:00
## 1st Qu.:2016-04-29 10:27:01 1st Qu.:2016-05-09 00:00:00
## Median :2016-05-10 12:13:17 Median :2016-05-18 00:00:00
## Mean :2016-05-09 07:49:15 Mean :2016-05-19 00:57:50
## 3rd Qu.:2016-05-20 11:18:37 3rd Qu.:2016-05-31 00:00:00
## Max. :2016-06-08 20:07:23 Max. :2016-06-08 00:00:00
##
## Age Neighbourhood Scholarship Hipertension
## Min. : -1.00 JARDIM CAMBURI : 7717 Mode :logical Mode :logical
## 1st Qu.: 18.00 MARIA ORTIZ : 5805 FALSE:99666 FALSE:88726
## Median : 37.00 RESISTÊNCIA : 4431 TRUE :10861 TRUE :21801
## Mean : 37.09 JARDIM DA PENHA: 3877
## 3rd Qu.: 55.00 ITARARÉ : 3514
## Max. :115.00 CENTRO : 3334
## (Other) :81849
## Diabetes Alcoholism Handcap SMS_received
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:102584 FALSE:107167 FALSE:108286 FALSE:75045
## TRUE :7943 TRUE :3360 TRUE :2241 TRUE :35482
##
##
##
##
## No-show NoShow
## Length:110527 No :88208
## Class :character Yes:22319
## Mode :character
##
##
##
##
From the above summary we notice that there are no missing values (NA), but there are clearly some error fields such as negative Age
range(appointment.raw$Age)
## [1] -1 115
There are 1 negative ages, which are clear errors
boxplot(appointment.raw$Age)
From the boxplot, we observe an outlier point slightly ahead of 100. Let’s examine them.
hist(appointment.raw$Age)
There are sizeable number of records (close to 4000) that have age 0. Either these are talking about new born babies or this maybe erroneous data points. Dropping 4000 complete observations at this time might not be a good idea, we will treat these values of the variable later. Also, there are 5 ages greater or equal than 105. This, however, could be real data so we will retain this.
We remove the clear Age outlier appointment.raw:
appointment.raw <- appointment.raw[appointment.raw$Age>=0,]
table(appointment.raw$NoShow)
##
## No Yes
## 88207 22319
We see that 22319 out of 110526 have not showed up for their appointments.
appointment.raw %>% group_by(Neighbourhood) %>% filter(n()>1000) %>%
ggplot(aes(x=Neighbourhood, fill = NoShow)) + geom_bar() + coord_flip()
The above plot displays neighbourhoods with over 1000 appointments in the data set.
We will now analyze the appointmentID and patientID fields.
appointment.raw %>% group_by(AppointmentID) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
## # A tibble: 6 x 2
## AppointmentID n
## <int> <int>
## 1 5030230 1
## 2 5122866 1
## 3 5134197 1
## 4 5134220 1
## 5 5134223 1
## 6 5134224 1
appointment.raw %>% group_by(PatientId) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
## # A tibble: 6 x 2
## PatientId n
## <dbl> <int>
## 1 8.221459e+14 88
## 2 9.963767e+10 84
## 3 2.688613e+13 70
## 4 3.353478e+13 65
## 5 2.584244e+11 62
## 6 6.264199e+12 62
The appointmentID is unique everytime but the patientId is common to a single patient which has made multiple visits. PatientID could be more fruitful in finding recurring trend with a given patient but appointmentID might not be too useful in our analysis so we can drop this column.
appointment.cleaned <- appointment.raw[,-2]
DaysToAppointment for difference between appointment date and scheduling dateappointment.cleaned <- mutate(appointment.raw, DaysToAppointment = as.numeric(as.Date(AppointmentDay,format='%m-%d-%Y') - as.Date(ScheduledDay,format='%m-%d-%Y')))
We will use this additional variable to make a prediction about show/no-show as it is a better regressor variable than the actual dates. Let us analyze the new column we got for outliers and errors.
range(appointment.cleaned$DaysToAppointment)
## [1] -6 179
Since we know that the future date cannot be negative, there are some error fields in the observations. We need to get rid of these error values.
appointment.cleaned %>% filter(as.numeric(as.Date(AppointmentDay,format='%m-%d-%Y')) < as.numeric(as.Date(ScheduledDay,format='%m-%d-%Y')))
## # A tibble: 5 x 16
## PatientId AppointmentID Gender ScheduledDay AppointmentDay
## <dbl> <int> <fctr> <dttm> <dttm>
## 1 7.839273e+12 5679978 M 2016-05-10 10:51:53 2016-05-09
## 2 7.896294e+12 5715660 F 2016-05-18 14:50:41 2016-05-17
## 3 2.425226e+13 5664962 F 2016-05-05 13:43:58 2016-05-04
## 4 9.982316e+14 5686628 F 2016-05-11 13:49:20 2016-05-05
## 5 3.787482e+12 5655637 M 2016-05-04 06:50:57 2016-05-03
## # ... with 11 more variables: Age <int>, Neighbourhood <fctr>,
## # Scholarship <lgl>, Hipertension <lgl>, Diabetes <lgl>,
## # Alcoholism <lgl>, Handcap <lgl>, SMS_received <lgl>, `No-show` <chr>,
## # NoShow <fctr>, DaysToAppointment <dbl>
The above 5 rows can be marked as errors since the appointment dates need to be in future. These could be back entries or typing errors.
appointment.cleaned <- appointment.cleaned[as.numeric(as.Date(appointment.cleaned$AppointmentDay,format='%m-%d-%Y')) >= as.numeric(as.Date(appointment.cleaned$ScheduledDay,format='%m-%d-%Y')),]
A glimpse at the cleaned data set reveals 14 variables now with changed data types and row count reduced by 6 observations which essentially indicates how clean the data initially was.
glimpse(appointment.cleaned)
## Observations: 110,521
## Variables: 16
## $ PatientId <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8....
## $ AppointmentID <int> 5642903, 5642503, 5642549, 5642828, 5642494,...
## $ Gender <fctr> F, M, F, F, F, F, F, F, F, F, F, M, F, M, F...
## $ ScheduledDay <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2...
## $ AppointmentDay <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04...
## $ Age <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 2...
## $ Neighbourhood <fctr> JARDIM DA PENHA, JARDIM DA PENHA, MATA DA P...
## $ Scholarship <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ Hipertension <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE...
## $ Diabetes <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FAL...
## $ Alcoholism <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ Handcap <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ SMS_received <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ `No-show` <chr> "No", "No", "No", "No", "No", "No", "Yes", "...
## $ NoShow <fctr> No, No, No, No, No, No, Yes, Yes, No, No, N...
## $ DaysToAppointment <dbl> 0, 0, 0, 0, 0, 2, 2, 2, 0, 2, 2, 3, 1, 1, 1,...
Summary of cleaned data
summary(appointment.cleaned)
## PatientId AppointmentID Gender
## Min. :3.922e+04 Min. :5030230 M:38685
## 1st Qu.:4.172e+12 1st Qu.:5640284 F:71836
## Median :3.173e+13 Median :5680573
## Mean :1.475e+14 Mean :5675304
## 3rd Qu.:9.439e+13 3rd Qu.:5725524
## Max. :1.000e+15 Max. :5790484
##
## ScheduledDay AppointmentDay
## Min. :2015-11-10 07:13:56 Min. :2016-04-29 00:00:00
## 1st Qu.:2016-04-29 10:26:51 1st Qu.:2016-05-09 00:00:00
## Median :2016-05-10 12:13:17 Median :2016-05-18 00:00:00
## Mean :2016-05-09 07:48:50 Mean :2016-05-19 00:58:20
## 3rd Qu.:2016-05-20 11:18:39 3rd Qu.:2016-05-31 00:00:00
## Max. :2016-06-08 20:07:23 Max. :2016-06-08 00:00:00
##
## Age Neighbourhood Scholarship Hipertension
## Min. : 0.00 JARDIM CAMBURI : 7717 Mode :logical Mode :logical
## 1st Qu.: 18.00 MARIA ORTIZ : 5805 FALSE:99660 FALSE:88720
## Median : 37.00 RESISTÊNCIA : 4430 TRUE :10861 TRUE :21801
## Mean : 37.09 JARDIM DA PENHA: 3877
## 3rd Qu.: 55.00 ITARARÉ : 3514
## Max. :115.00 CENTRO : 3334
## (Other) :81844
## Diabetes Alcoholism Handcap SMS_received
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:102578 FALSE:107161 FALSE:108282 FALSE:75039
## TRUE :7943 TRUE :3360 TRUE :2239 TRUE :35482
##
##
##
##
## No-show NoShow DaysToAppointment
## Length:110521 No :88207 Min. : 0.00
## Class :character Yes:22314 1st Qu.: 0.00
## Mode :character Median : 4.00
## Mean : 10.18
## 3rd Qu.: 15.00
## Max. :179.00
##
We noticed a very high percentage of Patients that didn’t show up for their appointments from the data presented above. Let us explore further, using ggplot about different perspectives of that.
ggplot(appointment.cleaned, aes(x=NoShow, fill = SMS_received)) + geom_bar(position = "fill")
From the above graph, contrary to the usual belief, it appears that sending of SMS to the patient for their appointments isn’t making a big difference. Perhaps the sms is sent when the appointment is made and there could be a reminder a day prior to the scheduled date.
We would look to explore how DaysToApppointment affects the NoShow numbers. For this, we will group the data by DaysToAppointment and find out the proportion of no show for that given day.
appointment.cleaned %>% group_by(DaysToAppointment) %>% filter(n() > 8) %>% summarise(number = sum(NoShow == 'Yes')/n()) %>% filter(number > 0) %>% top_n(n=10)
## Selecting by number
## # A tibble: 10 x 2
## DaysToAppointment number
## <dbl> <dbl>
## 1 25 0.4018838
## 2 40 0.4055118
## 3 55 0.4495413
## 4 60 0.4246575
## 5 78 0.3939394
## 6 79 0.3913043
## 7 80 0.4444444
## 8 81 0.4000000
## 9 89 0.4117647
## 10 155 0.6000000
It is seen that most number of no shows occur as the difference in the dates increase. It would be interesting to check out the fraction of noshows with increasing days to appointment.
appointment.cleaned %>% group_by(DaysToAppointment) %>% filter(n() > 8) %>% summarise(number = sum(NoShow == 'Yes')/n()) %>% filter(number > 0) %>%
ggplot(aes(x=DaysToAppointment,y=number)) + geom_point(alpha = 0.4)
Unsurprisingly, the proportions of Noshow are least for 0th day since one would imagine that the patients would wait till their appointment time. For the other days, the NoShow rate hovers between .2 to .4 for most of them. To see if there is any impact of Gender in this, we exploit the fill function for both men and women.
appointment.cleaned %>% group_by(DaysToAppointment,Gender,NoShow) %>%
summarize(total.count=n()) %>% filter(total.count>10) %>%
ggplot(aes(x=DaysToAppointment,y = total.count, fill=NoShow)) + geom_bar(stat="identity", position = "fill") + facet_grid(.~Gender) +
coord_cartesian(xlim = c(0,40))
No fixed pattern separates male and female patients in terms of missing their appointments when plotted with their days leading to the appointment.
To see how Show/Noshow vary across gender, we take a look at the following bar plot.
ggplot(appointment.cleaned, aes(x=Gender, fill=NoShow)) + geom_bar(position = "fill")
There appears to be no significant difference in the proportion of male/female in their No show for appointments. Both look steady at around 20%.
Let’s take a closer look into different age groups contributing to frequent No Show. We would also like plot the gender to see any pattern across age and gender that contribute to NoShow
ggplot(appointment.cleaned, aes(x=NoShow, y=Age, col=Gender)) + geom_boxplot()
The above boxplots appear to indicate that there is no major difference between different ages and genders except that younger females, for some reason, have slightly higher NoShow rate when compared to the rest.
As an analyst, it pays to dig deeper where no one would generally look. Even though the Scheduled Day for an appointment, at first, doesn’t sound like it can have any impact on No Show. Let’s look deeper into this.
appointment.cleaned %>% group_by(RegDate=as.Date(ScheduledDay)) %>% filter(n() > 8) %>%
summarise(total.noshow=sum(NoShow=="Yes")/n()) %>% filter(total.noshow > 0) %>% arrange(desc(total.noshow)) %>%
ggplot(aes(x=RegDate, y=total.noshow)) + geom_point(alpha=0.6) + geom_smooth(method = "lm", se = F)
It does appear that, save for a few outliers, there is a general downward trend as indicated by the regression line. This implies that, given a minimum 8 appointments in a day, the proportion of NoShows decrease with as we move towards the month of June. If we look at our data, this consolidates our earlier finding that greater the difference between scheduled date to appointment date, lesser are the chances of showing up. Since most of the appointment data is concentrated around May to June, the scheduled dates from May / June have much lower No Show proportion than the ones which are scheduled much earlier
It will be interesting to view the appointment date trend with their respective days of the week, if the lazy weekends account for more Noshow or whether its the busy weekdays. Let’s find out using the graph below:
appointment.cleaned$DayAppointment <- weekdays(appointment.cleaned$AppointmentDay)
appointment.cleaned %>% group_by(DayAppointment, Gender) %>% summarize(total.noShowPropotion = sum(NoShow == 'Yes')) %>%
ggplot(aes(x=DayAppointment, y =total.noShowPropotion,fill=Gender)) + geom_col()
Lack of data for Saturdays prevents any statistical analysis for that day. In general, there doesn’t appear to be any patterns between other days of the week in terms of No Show.
The following fill plot confirm that notion
appointment.cleaned %>% group_by(DayAppointment, Gender) %>% filter(n() > 1000) %>%
summarize(total.noShowPropotion = sum(NoShow == 'Yes')/n()) %>%
ggplot(aes(x=DayAppointment, y =total.noShowPropotion,fill=Gender)) + geom_col() + facet_grid(.~Gender)
Since we are given the time of Scheduling Date, let’s find out if there are any trends for the time slots of reservation for appointment v/s their NoShow.
appointment.cleaned %>% group_by(RegistrationHour=hour(ScheduledDay),Gender) %>% filter(n()>10) %>%
summarise(proportion.noshow=sum(NoShow=="Yes")/n()) %>% arrange(desc(RegistrationHour)) %>%
ggplot(aes(x=RegistrationHour, y=proportion.noshow, fill=Gender)) +
geom_bar(stat = "identity")
From the figure, we can see that most of the NoShows are the ones that took the appointment late in the day. This trend might hint at the hospitals to stop their reservation hours slightly sooner.
From the given Neighbourhoods, lets see how they contribute to NoShow
appointment.cleaned %>% group_by(Neighbourhood) %>%
filter(n()>100) %>% summarise(proportion.noshow=sum(NoShow=="Yes")/n()) %>%
ggplot(aes(x=Neighbourhood, y=proportion.noshow)) +
geom_bar(stat="identity") + coord_flip()
For regions with atleast 100 appointments, it appears that quite a few of them have greater than .2 i.e. 1/5th no show ratio. Also, the clutter in the above graph makes for a tough viewing. Let’s analyze the regions for any specific trends by the region. To avoid the clutter, we would observe only those neighbourhoods with over 1000 appointments of which atleast 22% were missed. These are eventually the neighbourhoods that would be of bigger concern since they contribute the most to NoShow.
appointment.cleaned %>%
group_by(Neighbourhood) %>%
filter(n()>1000) %>%
summarize(NoShowProportion = sum(NoShow == 'Yes')/n()) %>%
filter(NoShowProportion > 0.22) %>%
inner_join(appointment.cleaned) %>%
select(Neighbourhood,Gender,NoShow) %>%
filter(NoShow == 'Yes') %>%
ggplot(aes(x=Neighbourhood, fill = Gender)) +
geom_bar(position = "fill") + coord_flip()
## Joining, by = "Neighbourhood"
Of the above-mentioned regions, no specific trends surface based on genders. There is slightly higher than usual male proportion missing their appointments in Praia Do Sua. Let’s add our other flag variables of Alcoholism and Scholarship to analyze further.
appointment.cleaned %>%
group_by(Neighbourhood) %>%
filter(n()>1000) %>%
summarize(NoShowProportion = sum(NoShow == 'Yes')/n()) %>%
filter(NoShowProportion > 0.22) %>%
inner_join(appointment.cleaned) %>%
select(Neighbourhood,Gender,NoShow, Alcoholism, Scholarship) %>%
filter(NoShow == 'Yes') %>%
ggplot(aes(x=Neighbourhood, fill = Alcoholism)) +
geom_bar(position = "fill") + coord_flip() + facet_grid(.~Gender)
## Joining, by = "Neighbourhood"
Interestingly, we discover the region of Praia Do sua has high Alcoholism rate than usual. This could be one of the reasons contributing to NoShow in general.
appointment.cleaned %>%
group_by(Neighbourhood) %>%
filter(n()>1000) %>%
summarize(NoShowProportion = sum(NoShow == 'Yes')/n()) %>%
filter(NoShowProportion > 0.22) %>%
inner_join(appointment.cleaned) %>%
select(Neighbourhood,Gender,NoShow, Alcoholism, Scholarship) %>%
filter(NoShow == 'Yes') %>%
ggplot(aes(x=Neighbourhood, fill = Scholarship)) +
geom_bar(position = "fill") + coord_flip()
## Joining, by = "Neighbourhood"
While for Praia Do Sua, the most plaguing factor was Alcoholism, in Ilha Do Principe and Gurgica, almost 1/4th of the total missed appointments have scholarships. Since scholarships are, in general, granted to the poor sectors of the society, these could potentially be the neighbourhoods with higher poverty rate.
In the light of specific group of people missing out on their appointments in each neighbourhood, let’s also look at different age groups that missed their appointments in these regions. We’ll create a new column for age category and create three groups - Kids, Young and Old and see how different regions saw which categories missing their appointments the most.
appointment.cleaned %>%
group_by(Neighbourhood) %>%
summarise(MoreThanThousand = n()>1000) %>%
filter(MoreThanThousand) %>%
inner_join(appointment.cleaned) %>%
mutate(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old"))) %>%
select(Neighbourhood,AgeCat,Age,NoShow) %>%
group_by(Neighbourhood,AgeCat) %>%
summarise(proportion.noshow=sum(NoShow=="Yes")/n()) %>%
ggplot(aes(x=Neighbourhood, y=proportion.noshow, fill = AgeCat)) +
geom_bar(stat="identity", position = "fill") + coord_flip()
## Joining, by = "Neighbourhood"
The above graph displays neighbourhoods with more than 1000 missed apppointments and their age category split.
appointment.cleaned %>%
group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Scholarship, Gender) %>%
summarise(no.show = sum(NoShow=="Yes")) %>%
ggplot(aes(x=AgeCat, y=no.show, fill = Scholarship)) +
geom_bar(stat="identity") + facet_grid(.~Gender)
It appears that young females who have access to Scholarship(medicare concession) tend to have higher number of no-show for their appointments. It could also mean that poor women tend not to care about their medical appointments as much.
appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Alcoholism, Gender) %>%
summarise(no.show = sum(NoShow=="Yes")) %>%
ggplot(aes(x=AgeCat, y=no.show, fill = Alcoholism)) +
geom_bar(stat="identity", position = "fill") + facet_grid(.~Gender)
Interestingly, we observe older men with Alcoholism tend to show much up lesser for their appointments. These parameters could play a critical part in creating our final model that predicts a show/no-show of a patient.
For the other parameters, there aren’t any specific trends/patterns observed.
plot_diabetes <- appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Diabetes, Gender) %>%
summarise(no.show = sum(NoShow=="Yes")) %>%
ggplot(aes(x=AgeCat, y=no.show, fill = Diabetes)) +
geom_bar(stat="identity") + facet_grid(.~Gender)
plot_htension <- appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Hipertension, Gender) %>%
summarise(no.show = sum(NoShow=="Yes")) %>%
ggplot(aes(x=AgeCat, y=no.show, fill = Hipertension)) +
geom_bar(stat="identity") + facet_grid(.~Gender)
plot_Handicap <- appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Handcap, Gender) %>%
summarise(no.show = sum(NoShow=="Yes")) %>%
ggplot(aes(x=AgeCat, y=no.show, fill = Handcap)) +
geom_bar(stat="identity", position = "fill") + facet_grid(.~Gender)
grid.arrange(plot_htension, plot_Handicap, plot_diabetes, nrow=3, top='Inconclusive patterns for Hypertension, Diabetes and Handicap flag')
Let’s look at the top missed appointments by patients and if there is any interesting facts about the profiles of the patients that missed their appointments the most.
appointment.cleaned %>%
group_by(PatientId) %>%
summarise(MissedAppointments=sum(NoShow=="Yes"), ProportionMissed = MissedAppointments / n()) %>%
filter(MissedAppointments>10) %>%
arrange(desc(MissedAppointments)) %>%
inner_join(appointment.cleaned) %>%
select(PatientId,MissedAppointments,ProportionMissed, Gender, Age, Scholarship) %>%
unique()
## Joining, by = "PatientId"
## # A tibble: 11 x 6
## PatientId MissedAppointments ProportionMissed Gender Age
## <dbl> <int> <dbl> <fctr> <int>
## 1 1.421987e+12 18 1.0000000 M 9
## 2 5.635135e+14 16 1.0000000 M 9
## 3 5.587790e+12 15 0.7500000 M 14
## 4 5.811973e+14 14 1.0000000 F 7
## 5 6.575144e+13 13 1.0000000 M 14
## 6 1.198157e+12 12 0.5217391 M 10
## 7 1.198157e+12 12 0.5217391 M 11
## 8 4.768616e+11 11 0.9166667 F 43
## 9 2.728422e+12 11 0.7333333 F 44
## 10 9.715136e+12 11 0.5500000 F 12
## 11 2.491637e+14 11 0.6111111 M 8
## # ... with 1 more variables: Scholarship <lgl>
From the above patients that missed their appointments the most, four of them have Scholarship aid given by the government. Given that our dataset is dominated by female population, the top missed appointments feature more than 60% male population.
Narrow Scheduling Window: Within a day, since high number of NoShow appointments were found to be scheduled after 8 PM, the authorities can look to reduce the appointment schedules (unless there is an emergency) to day times. The length of booking date must also be kept within a narrower range as we see that higher the days to appointment, higher is the NoShow Rate.
Patient Profiling: The above findings display interesting trends about different segments of the society and their relation with NoShow. These must be profiled by the Hospital authorities before booking an appointment to ensure that NoShow rate is least. The profiling could be like the way credit score works, for example, if the patient is an older male with alcoholism or a younger female with scholarship, chances are they might not turn up for the appointment if it is scheduled for a later time. Thus, the patient profile must be used before allotting an appointment date.
We observed insightful details about noshows pertaining to different neighbourhoods. it will be interesting to map these neighbourhoods’ additional data to see if we could root cause the findings. For example, certain neighbourhoods indicated the no show patients to be only females. If the crime rate or various other society data could be extracted, it would give a clearer insight to the problem and how the NoShow can be bettered.
Basic Machine learning techniques can be applied to train a model that would input a patient’s profile to predict his/her chances of not showing up.