INTRODUCTION

PACKAGES REQUIRED

library('tidyverse')
library('lubridate')
library('gridExtra')

We would extensively use tidyverse package for cleaning, tidying, manipulating and visualizing the data. The broom package within tidyverse will help us do post regression analysis, identifying outliers and influence points.

DATA PREPARATION

The data used in this sample is obtained for the city of Vitoria, Brazil from Kaggle.

The above dataset is one of the featured dataset of Kaggle and initially included 14 variables with close to 110,000 observations of total patient appointments gathered from various hospitals of the city. The neighbourhood variable displays the region where the hospital is located. The data is real (not simulated) from the appointment histories of the hospitals.

The data was initially posted on May, 2016 but last updated on Feb, 2017.

appointment.raw <- read_csv("KaggleV2-May-2016.csv")
glimpse(appointment.raw)
## Observations: 110,527
## Variables: 14
## $ PatientId      <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8.679...
## $ AppointmentID  <int> 5642903, 5642503, 5642549, 5642828, 5642494, 56...
## $ Gender         <chr> "F", "M", "F", "F", "F", "F", "F", "F", "F", "F...
## $ ScheduledDay   <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2016...
## $ AppointmentDay <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04-29...
## $ Age            <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 29, ...
## $ Neighbourhood  <chr> "JARDIM DA PENHA", "JARDIM DA PENHA", "MATA DA ...
## $ Scholarship    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ Hipertension   <int> 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Diabetes       <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Alcoholism     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Handcap        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ SMS_received   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,...
## $ `No-show`      <chr> "No", "No", "No", "No", "No", "No", "Yes", "Yes...

After an initial glimpse at the data, we notice that there are many int and char values which need to be converted to logical and factor values.

# Converting categorical variables to factors
appointment.raw$Gender <- factor(appointment.raw$Gender, levels = c("M", "F"))
appointment.raw$`NoShow` <- factor((appointment.raw$`No-show`))
appointment.raw$Neighbourhood <- factor((appointment.raw$Neighbourhood))

# Converting all logical flags 
appointment.raw$Diabetes <- as.logical(appointment.raw$Diabetes)
appointment.raw$Alcoholism <- as.logical(appointment.raw$Alcoholism)
appointment.raw$Hipertension <- as.logical(appointment.raw$Hipertension)
appointment.raw$Handcap <- as.logical(appointment.raw$Handcap)
appointment.raw$Scholarship <- as.logical(appointment.raw$Scholarship)
appointment.raw$SMS_received <- as.logical(appointment.raw$SMS_received)
dim(appointment.raw)
## [1] 110527     15

We observe that there are 110527 observations and 15 variables in the data.

A quick look at the summary of the data will give us a clearer picture about the missing values, outliers and error values, if any

summary(appointment.raw)
##    PatientId         AppointmentID     Gender   
##  Min.   :3.922e+04   Min.   :5030230   M:38687  
##  1st Qu.:4.173e+12   1st Qu.:5640286   F:71840  
##  Median :3.173e+13   Median :5680573            
##  Mean   :1.475e+14   Mean   :5675305            
##  3rd Qu.:9.439e+13   3rd Qu.:5725524            
##  Max.   :1.000e+15   Max.   :5790484            
##                                                 
##   ScheduledDay                 AppointmentDay               
##  Min.   :2015-11-10 07:13:56   Min.   :2016-04-29 00:00:00  
##  1st Qu.:2016-04-29 10:27:01   1st Qu.:2016-05-09 00:00:00  
##  Median :2016-05-10 12:13:17   Median :2016-05-18 00:00:00  
##  Mean   :2016-05-09 07:49:15   Mean   :2016-05-19 00:57:50  
##  3rd Qu.:2016-05-20 11:18:37   3rd Qu.:2016-05-31 00:00:00  
##  Max.   :2016-06-08 20:07:23   Max.   :2016-06-08 00:00:00  
##                                                             
##       Age                 Neighbourhood   Scholarship     Hipertension   
##  Min.   : -1.00   JARDIM CAMBURI : 7717   Mode :logical   Mode :logical  
##  1st Qu.: 18.00   MARIA ORTIZ    : 5805   FALSE:99666     FALSE:88726    
##  Median : 37.00   RESISTÊNCIA    : 4431   TRUE :10861     TRUE :21801    
##  Mean   : 37.09   JARDIM DA PENHA: 3877                                  
##  3rd Qu.: 55.00   ITARARÉ        : 3514                                  
##  Max.   :115.00   CENTRO         : 3334                                  
##                   (Other)        :81849                                  
##   Diabetes       Alcoholism       Handcap        SMS_received   
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:102584    FALSE:107167    FALSE:108286    FALSE:75045    
##  TRUE :7943      TRUE :3360      TRUE :2241      TRUE :35482    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##    No-show          NoShow     
##  Length:110527      No :88208  
##  Class :character   Yes:22319  
##  Mode  :character              
##                                
##                                
##                                
## 

From the above summary we notice that there are no missing values (NA), but there are clearly some error fields such as negative Age

Individual variable analysis

  • Age - The age field has a negative value which definitely is an error.
range(appointment.raw$Age)
## [1]  -1 115

There are 1 negative ages, which are clear errors

boxplot(appointment.raw$Age)

From the boxplot, we observe an outlier point slightly ahead of 100. Let’s examine them.

hist(appointment.raw$Age)

There are sizeable number of records (close to 4000) that have age 0. Either these are talking about new born babies or this maybe erroneous data points. Dropping 4000 complete observations at this time might not be a good idea, we will treat these values of the variable later. Also, there are 5 ages greater or equal than 105. This, however, could be real data so we will retain this.

We remove the clear Age outlier appointment.raw:

appointment.raw <- appointment.raw[appointment.raw$Age>=0,]
  • Show / No-Show - We will now look at the distribution of show and no-show variable. For this we will plot the frequency using table function.
table(appointment.raw$NoShow)
## 
##    No   Yes 
## 88207 22319

We see that 22319 out of 110526 have not showed up for their appointments.

  • Neighbourhood - This is a factor variable for which we would be keen to identify if there is a demographical relation to no show of appointments. Let’s observe the spread of neightbourhood variable
  appointment.raw %>% group_by(Neighbourhood) %>% filter(n()>1000) %>% 
    ggplot(aes(x=Neighbourhood, fill = NoShow)) + geom_bar() + coord_flip()

The above plot displays neighbourhoods with over 1000 appointments in the data set.

  • AppointmentID and PatientID Columns

We will now analyze the appointmentID and patientID fields.

appointment.raw %>% group_by(AppointmentID) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
## # A tibble: 6 x 2
##   AppointmentID     n
##           <int> <int>
## 1       5030230     1
## 2       5122866     1
## 3       5134197     1
## 4       5134220     1
## 5       5134223     1
## 6       5134224     1
appointment.raw %>% group_by(PatientId) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head()
## # A tibble: 6 x 2
##      PatientId     n
##          <dbl> <int>
## 1 8.221459e+14    88
## 2 9.963767e+10    84
## 3 2.688613e+13    70
## 4 3.353478e+13    65
## 5 2.584244e+11    62
## 6 6.264199e+12    62

The appointmentID is unique everytime but the patientId is common to a single patient which has made multiple visits. PatientID could be more fruitful in finding recurring trend with a given patient but appointmentID might not be too useful in our analysis so we can drop this column.

appointment.cleaned <- appointment.raw[,-2]

Creating a new column DaysToAppointment for difference between appointment date and scheduling date

appointment.cleaned <- mutate(appointment.raw, DaysToAppointment = as.numeric(as.Date(AppointmentDay,format='%m-%d-%Y') - as.Date(ScheduledDay,format='%m-%d-%Y')))

We will use this additional variable to make a prediction about show/no-show as it is a better regressor variable than the actual dates. Let us analyze the new column we got for outliers and errors.

range(appointment.cleaned$DaysToAppointment)
## [1]  -6 179

Since we know that the future date cannot be negative, there are some error fields in the observations. We need to get rid of these error values.

appointment.cleaned %>% filter(as.numeric(as.Date(AppointmentDay,format='%m-%d-%Y')) < as.numeric(as.Date(ScheduledDay,format='%m-%d-%Y')))
## # A tibble: 5 x 16
##      PatientId AppointmentID Gender        ScheduledDay AppointmentDay
##          <dbl>         <int> <fctr>              <dttm>         <dttm>
## 1 7.839273e+12       5679978      M 2016-05-10 10:51:53     2016-05-09
## 2 7.896294e+12       5715660      F 2016-05-18 14:50:41     2016-05-17
## 3 2.425226e+13       5664962      F 2016-05-05 13:43:58     2016-05-04
## 4 9.982316e+14       5686628      F 2016-05-11 13:49:20     2016-05-05
## 5 3.787482e+12       5655637      M 2016-05-04 06:50:57     2016-05-03
## # ... with 11 more variables: Age <int>, Neighbourhood <fctr>,
## #   Scholarship <lgl>, Hipertension <lgl>, Diabetes <lgl>,
## #   Alcoholism <lgl>, Handcap <lgl>, SMS_received <lgl>, `No-show` <chr>,
## #   NoShow <fctr>, DaysToAppointment <dbl>

The above 5 rows can be marked as errors since the appointment dates need to be in future. These could be back entries or typing errors.

appointment.cleaned <- appointment.cleaned[as.numeric(as.Date(appointment.cleaned$AppointmentDay,format='%m-%d-%Y')) >= as.numeric(as.Date(appointment.cleaned$ScheduledDay,format='%m-%d-%Y')),]

Cleaned Data

A glimpse at the cleaned data set reveals 14 variables now with changed data types and row count reduced by 6 observations which essentially indicates how clean the data initially was.

glimpse(appointment.cleaned)
## Observations: 110,521
## Variables: 16
## $ PatientId         <dbl> 2.987250e+13, 5.589978e+14, 4.262962e+12, 8....
## $ AppointmentID     <int> 5642903, 5642503, 5642549, 5642828, 5642494,...
## $ Gender            <fctr> F, M, F, F, F, F, F, F, F, F, F, M, F, M, F...
## $ ScheduledDay      <dttm> 2016-04-29 18:38:08, 2016-04-29 16:08:27, 2...
## $ AppointmentDay    <dttm> 2016-04-29, 2016-04-29, 2016-04-29, 2016-04...
## $ Age               <int> 62, 56, 62, 8, 56, 76, 23, 39, 21, 19, 30, 2...
## $ Neighbourhood     <fctr> JARDIM DA PENHA, JARDIM DA PENHA, MATA DA P...
## $ Scholarship       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ Hipertension      <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE...
## $ Diabetes          <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FAL...
## $ Alcoholism        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ Handcap           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ SMS_received      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
## $ `No-show`         <chr> "No", "No", "No", "No", "No", "No", "Yes", "...
## $ NoShow            <fctr> No, No, No, No, No, No, Yes, Yes, No, No, N...
## $ DaysToAppointment <dbl> 0, 0, 0, 0, 0, 2, 2, 2, 0, 2, 2, 3, 1, 1, 1,...

Summary of cleaned data

summary(appointment.cleaned)
##    PatientId         AppointmentID     Gender   
##  Min.   :3.922e+04   Min.   :5030230   M:38685  
##  1st Qu.:4.172e+12   1st Qu.:5640284   F:71836  
##  Median :3.173e+13   Median :5680573            
##  Mean   :1.475e+14   Mean   :5675304            
##  3rd Qu.:9.439e+13   3rd Qu.:5725524            
##  Max.   :1.000e+15   Max.   :5790484            
##                                                 
##   ScheduledDay                 AppointmentDay               
##  Min.   :2015-11-10 07:13:56   Min.   :2016-04-29 00:00:00  
##  1st Qu.:2016-04-29 10:26:51   1st Qu.:2016-05-09 00:00:00  
##  Median :2016-05-10 12:13:17   Median :2016-05-18 00:00:00  
##  Mean   :2016-05-09 07:48:50   Mean   :2016-05-19 00:58:20  
##  3rd Qu.:2016-05-20 11:18:39   3rd Qu.:2016-05-31 00:00:00  
##  Max.   :2016-06-08 20:07:23   Max.   :2016-06-08 00:00:00  
##                                                             
##       Age                 Neighbourhood   Scholarship     Hipertension   
##  Min.   :  0.00   JARDIM CAMBURI : 7717   Mode :logical   Mode :logical  
##  1st Qu.: 18.00   MARIA ORTIZ    : 5805   FALSE:99660     FALSE:88720    
##  Median : 37.00   RESISTÊNCIA    : 4430   TRUE :10861     TRUE :21801    
##  Mean   : 37.09   JARDIM DA PENHA: 3877                                  
##  3rd Qu.: 55.00   ITARARÉ        : 3514                                  
##  Max.   :115.00   CENTRO         : 3334                                  
##                   (Other)        :81844                                  
##   Diabetes       Alcoholism       Handcap        SMS_received   
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:102578    FALSE:107161    FALSE:108282    FALSE:75039    
##  TRUE :7943      TRUE :3360      TRUE :2239      TRUE :35482    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##    No-show          NoShow      DaysToAppointment
##  Length:110521      No :88207   Min.   :  0.00   
##  Class :character   Yes:22314   1st Qu.:  0.00   
##  Mode  :character               Median :  4.00   
##                                 Mean   : 10.18   
##                                 3rd Qu.: 15.00   
##                                 Max.   :179.00   
## 

We noticed a very high percentage of Patients that didn’t show up for their appointments from the data presented above. Let us explore further, using ggplot about different perspectives of that.

DATA EXPLORATION

SMS Alert and Patient Show up

ggplot(appointment.cleaned, aes(x=NoShow, fill = SMS_received)) + geom_bar(position = "fill")

From the above graph, contrary to the usual belief, it appears that sending of SMS to the patient for their appointments isn’t making a big difference. Perhaps the sms is sent when the appointment is made and there could be a reminder a day prior to the scheduled date.

Days to appointment and No Show

We would look to explore how DaysToApppointment affects the NoShow numbers. For this, we will group the data by DaysToAppointment and find out the proportion of no show for that given day.

appointment.cleaned %>% group_by(DaysToAppointment) %>% filter(n() > 8) %>% summarise(number = sum(NoShow == 'Yes')/n()) %>% filter(number > 0) %>% top_n(n=10)
## Selecting by number
## # A tibble: 10 x 2
##    DaysToAppointment    number
##                <dbl>     <dbl>
##  1                25 0.4018838
##  2                40 0.4055118
##  3                55 0.4495413
##  4                60 0.4246575
##  5                78 0.3939394
##  6                79 0.3913043
##  7                80 0.4444444
##  8                81 0.4000000
##  9                89 0.4117647
## 10               155 0.6000000

It is seen that most number of no shows occur as the difference in the dates increase. It would be interesting to check out the fraction of noshows with increasing days to appointment.

appointment.cleaned %>% group_by(DaysToAppointment) %>% filter(n() > 8) %>% summarise(number = sum(NoShow == 'Yes')/n()) %>% filter(number > 0) %>% 
  ggplot(aes(x=DaysToAppointment,y=number)) + geom_point(alpha = 0.4)

Unsurprisingly, the proportions of Noshow are least for 0th day since one would imagine that the patients would wait till their appointment time. For the other days, the NoShow rate hovers between .2 to .4 for most of them. To see if there is any impact of Gender in this, we exploit the fill function for both men and women.

appointment.cleaned %>% group_by(DaysToAppointment,Gender,NoShow) %>% 
    summarize(total.count=n()) %>% filter(total.count>10) %>% 
    ggplot(aes(x=DaysToAppointment,y = total.count, fill=NoShow)) + geom_bar(stat="identity", position = "fill")  + facet_grid(.~Gender) + 
    coord_cartesian(xlim = c(0,40))

No fixed pattern separates male and female patients in terms of missing their appointments when plotted with their days leading to the appointment.

Age v/s No Show v/s Gender

To see how Show/Noshow vary across gender, we take a look at the following bar plot.

ggplot(appointment.cleaned, aes(x=Gender, fill=NoShow)) + geom_bar(position = "fill")

There appears to be no significant difference in the proportion of male/female in their No show for appointments. Both look steady at around 20%.

Let’s take a closer look into different age groups contributing to frequent No Show. We would also like plot the gender to see any pattern across age and gender that contribute to NoShow

ggplot(appointment.cleaned, aes(x=NoShow, y=Age, col=Gender)) + geom_boxplot()

The above boxplots appear to indicate that there is no major difference between different ages and genders except that younger females, for some reason, have slightly higher NoShow rate when compared to the rest.

ScheduledDay and AppointmentDay v/s No Show

As an analyst, it pays to dig deeper where no one would generally look. Even though the Scheduled Day for an appointment, at first, doesn’t sound like it can have any impact on No Show. Let’s look deeper into this.

appointment.cleaned %>% group_by(RegDate=as.Date(ScheduledDay)) %>% filter(n() > 8) %>% 
  summarise(total.noshow=sum(NoShow=="Yes")/n()) %>% filter(total.noshow > 0) %>% arrange(desc(total.noshow)) %>% 
  ggplot(aes(x=RegDate, y=total.noshow)) + geom_point(alpha=0.6) + geom_smooth(method = "lm", se = F)

It does appear that, save for a few outliers, there is a general downward trend as indicated by the regression line. This implies that, given a minimum 8 appointments in a day, the proportion of NoShows decrease with as we move towards the month of June. If we look at our data, this consolidates our earlier finding that greater the difference between scheduled date to appointment date, lesser are the chances of showing up. Since most of the appointment data is concentrated around May to June, the scheduled dates from May / June have much lower No Show proportion than the ones which are scheduled much earlier

It will be interesting to view the appointment date trend with their respective days of the week, if the lazy weekends account for more Noshow or whether its the busy weekdays. Let’s find out using the graph below:

appointment.cleaned$DayAppointment <- weekdays(appointment.cleaned$AppointmentDay)
 appointment.cleaned %>% group_by(DayAppointment, Gender) %>% summarize(total.noShowPropotion = sum(NoShow == 'Yes')) %>% 
    ggplot(aes(x=DayAppointment, y =total.noShowPropotion,fill=Gender)) + geom_col()

Lack of data for Saturdays prevents any statistical analysis for that day. In general, there doesn’t appear to be any patterns between other days of the week in terms of No Show.

The following fill plot confirm that notion

appointment.cleaned %>% group_by(DayAppointment, Gender) %>% filter(n() > 1000) %>% 
  summarize(total.noShowPropotion = sum(NoShow == 'Yes')/n()) %>% 
    ggplot(aes(x=DayAppointment, y =total.noShowPropotion,fill=Gender)) + geom_col() + facet_grid(.~Gender)

Since we are given the time of Scheduling Date, let’s find out if there are any trends for the time slots of reservation for appointment v/s their NoShow.

  appointment.cleaned %>% group_by(RegistrationHour=hour(ScheduledDay),Gender) %>% filter(n()>10) %>% 
    summarise(proportion.noshow=sum(NoShow=="Yes")/n()) %>% arrange(desc(RegistrationHour)) %>% 
    ggplot(aes(x=RegistrationHour, y=proportion.noshow, fill=Gender)) + 
    geom_bar(stat = "identity")

From the figure, we can see that most of the NoShows are the ones that took the appointment late in the day. This trend might hint at the hospitals to stop their reservation hours slightly sooner.

Region v/s NoShow

From the given Neighbourhoods, lets see how they contribute to NoShow

appointment.cleaned %>% group_by(Neighbourhood) %>% 
  filter(n()>100) %>% summarise(proportion.noshow=sum(NoShow=="Yes")/n()) %>% 
    ggplot(aes(x=Neighbourhood, y=proportion.noshow)) + 
    geom_bar(stat="identity") + coord_flip()

For regions with atleast 100 appointments, it appears that quite a few of them have greater than .2 i.e. 1/5th no show ratio. Also, the clutter in the above graph makes for a tough viewing. Let’s analyze the regions for any specific trends by the region. To avoid the clutter, we would observe only those neighbourhoods with over 1000 appointments of which atleast 22% were missed. These are eventually the neighbourhoods that would be of bigger concern since they contribute the most to NoShow.

 appointment.cleaned %>% 
    group_by(Neighbourhood) %>% 
    filter(n()>1000) %>% 
    summarize(NoShowProportion = sum(NoShow == 'Yes')/n()) %>% 
    filter(NoShowProportion > 0.22) %>% 
    inner_join(appointment.cleaned) %>% 
    select(Neighbourhood,Gender,NoShow) %>% 
    filter(NoShow == 'Yes') %>% 
    ggplot(aes(x=Neighbourhood, fill = Gender)) + 
    geom_bar(position = "fill") + coord_flip()
## Joining, by = "Neighbourhood"

Of the above-mentioned regions, no specific trends surface based on genders. There is slightly higher than usual male proportion missing their appointments in Praia Do Sua. Let’s add our other flag variables of Alcoholism and Scholarship to analyze further.

appointment.cleaned %>% 
    group_by(Neighbourhood) %>% 
    filter(n()>1000) %>% 
    summarize(NoShowProportion = sum(NoShow == 'Yes')/n()) %>% 
    filter(NoShowProportion > 0.22) %>% 
    inner_join(appointment.cleaned) %>% 
    select(Neighbourhood,Gender,NoShow, Alcoholism, Scholarship) %>% 
    filter(NoShow == 'Yes') %>% 
    ggplot(aes(x=Neighbourhood, fill = Alcoholism)) + 
    geom_bar(position = "fill") + coord_flip() + facet_grid(.~Gender)
## Joining, by = "Neighbourhood"

Interestingly, we discover the region of Praia Do sua has high Alcoholism rate than usual. This could be one of the reasons contributing to NoShow in general.

 appointment.cleaned %>% 
    group_by(Neighbourhood) %>% 
    filter(n()>1000) %>% 
    summarize(NoShowProportion = sum(NoShow == 'Yes')/n()) %>% 
    filter(NoShowProportion > 0.22) %>% 
    inner_join(appointment.cleaned) %>% 
    select(Neighbourhood,Gender,NoShow, Alcoholism, Scholarship) %>% 
    filter(NoShow == 'Yes') %>% 
    ggplot(aes(x=Neighbourhood, fill = Scholarship)) + 
    geom_bar(position = "fill") + coord_flip()
## Joining, by = "Neighbourhood"

While for Praia Do Sua, the most plaguing factor was Alcoholism, in Ilha Do Principe and Gurgica, almost 1/4th of the total missed appointments have scholarships. Since scholarships are, in general, granted to the poor sectors of the society, these could potentially be the neighbourhoods with higher poverty rate.

In the light of specific group of people missing out on their appointments in each neighbourhood, let’s also look at different age groups that missed their appointments in these regions. We’ll create a new column for age category and create three groups - Kids, Young and Old and see how different regions saw which categories missing their appointments the most.

 appointment.cleaned %>%
    group_by(Neighbourhood) %>% 
    summarise(MoreThanThousand = n()>1000) %>%
    filter(MoreThanThousand) %>% 
    inner_join(appointment.cleaned) %>% 
    mutate(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old"))) %>% 
    select(Neighbourhood,AgeCat,Age,NoShow) %>% 
    group_by(Neighbourhood,AgeCat) %>% 
    summarise(proportion.noshow=sum(NoShow=="Yes")/n()) %>%
    ggplot(aes(x=Neighbourhood, y=proportion.noshow, fill = AgeCat)) + 
    geom_bar(stat="identity", position = "fill") + coord_flip()
## Joining, by = "Neighbourhood"

The above graph displays neighbourhoods with more than 1000 missed apppointments and their age category split.

Impact of other flag variables with NoShow

  • Scholarship : This is the concession given by the Brazillian govt to poor families, in general, to have discounted access to medicare. Let’s see how different age groups and genders contribute to noshow:
appointment.cleaned %>% 
  group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Scholarship, Gender) %>%
      summarise(no.show = sum(NoShow=="Yes")) %>% 
      ggplot(aes(x=AgeCat, y=no.show, fill = Scholarship)) + 
      geom_bar(stat="identity") + facet_grid(.~Gender) 

It appears that young females who have access to Scholarship(medicare concession) tend to have higher number of no-show for their appointments. It could also mean that poor women tend not to care about their medical appointments as much.

  • Alcoholism : Lets look at the impact of alcoholism on NoShow
appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Alcoholism, Gender) %>%
      summarise(no.show = sum(NoShow=="Yes")) %>% 
      ggplot(aes(x=AgeCat, y=no.show, fill = Alcoholism)) + 
      geom_bar(stat="identity", position = "fill") + facet_grid(.~Gender)

Interestingly, we observe older men with Alcoholism tend to show much up lesser for their appointments. These parameters could play a critical part in creating our final model that predicts a show/no-show of a patient.

For the other parameters, there aren’t any specific trends/patterns observed.

    plot_diabetes <- appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Diabetes, Gender) %>%
      summarise(no.show = sum(NoShow=="Yes")) %>% 
      ggplot(aes(x=AgeCat, y=no.show, fill = Diabetes)) + 
      geom_bar(stat="identity") + facet_grid(.~Gender)
    
    plot_htension <- appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Hipertension, Gender) %>%
      summarise(no.show = sum(NoShow=="Yes")) %>% 
      ggplot(aes(x=AgeCat, y=no.show, fill = Hipertension)) + 
      geom_bar(stat="identity") + facet_grid(.~Gender)
    
    plot_Handicap <- appointment.cleaned %>% group_by(AgeCat = ifelse(Age<15,"Kid", ifelse(Age<45, "Young", "Old")), Handcap, Gender) %>%
      summarise(no.show = sum(NoShow=="Yes")) %>% 
      ggplot(aes(x=AgeCat, y=no.show, fill = Handcap)) + 
      geom_bar(stat="identity", position = "fill") + facet_grid(.~Gender)
    
    grid.arrange(plot_htension, plot_Handicap, plot_diabetes, nrow=3, top='Inconclusive patterns for Hypertension, Diabetes and Handicap flag')

Most missed appointments by Patients and their profiles

Let’s look at the top missed appointments by patients and if there is any interesting facts about the profiles of the patients that missed their appointments the most.

appointment.cleaned %>% 
    group_by(PatientId) %>% 
    summarise(MissedAppointments=sum(NoShow=="Yes"), ProportionMissed = MissedAppointments / n()) %>%
    filter(MissedAppointments>10) %>% 
    arrange(desc(MissedAppointments)) %>% 
    inner_join(appointment.cleaned) %>% 
    select(PatientId,MissedAppointments,ProportionMissed, Gender, Age, Scholarship) %>% 
    unique()
## Joining, by = "PatientId"
## # A tibble: 11 x 6
##       PatientId MissedAppointments ProportionMissed Gender   Age
##           <dbl>              <int>            <dbl> <fctr> <int>
##  1 1.421987e+12                 18        1.0000000      M     9
##  2 5.635135e+14                 16        1.0000000      M     9
##  3 5.587790e+12                 15        0.7500000      M    14
##  4 5.811973e+14                 14        1.0000000      F     7
##  5 6.575144e+13                 13        1.0000000      M    14
##  6 1.198157e+12                 12        0.5217391      M    10
##  7 1.198157e+12                 12        0.5217391      M    11
##  8 4.768616e+11                 11        0.9166667      F    43
##  9 2.728422e+12                 11        0.7333333      F    44
## 10 9.715136e+12                 11        0.5500000      F    12
## 11 2.491637e+14                 11        0.6111111      M     8
## # ... with 1 more variables: Scholarship <lgl>

From the above patients that missed their appointments the most, four of them have Scholarship aid given by the government. Given that our dataset is dominated by female population, the top missed appointments feature more than 60% male population.

SUMMARY

Interesting Findings

  • High Number of appointments scheduled after 8 PM land up in no show
  • Young Females with Scholarship tend to have higher no show
  • Older males with Alcoholism also tend to have lower show up rate
  • As the days of appointment increase, the proportion of no-show also increases
  • Of all the regions contributing significantly to NoShow, Praia Do Sua has high NoShow and high Alcoholism in the region.Ilha Do Principe and Gurgica also contribute significantly to NoShow but the reason behind majority of those is Scholarship

Implications

  • Narrow Scheduling Window: Within a day, since high number of NoShow appointments were found to be scheduled after 8 PM, the authorities can look to reduce the appointment schedules (unless there is an emergency) to day times. The length of booking date must also be kept within a narrower range as we see that higher the days to appointment, higher is the NoShow Rate.

  • Patient Profiling: The above findings display interesting trends about different segments of the society and their relation with NoShow. These must be profiled by the Hospital authorities before booking an appointment to ensure that NoShow rate is least. The profiling could be like the way credit score works, for example, if the patient is an older male with alcoholism or a younger female with scholarship, chances are they might not turn up for the appointment if it is scheduled for a later time. Thus, the patient profile must be used before allotting an appointment date.

Limitations / Improvements

  • We observed insightful details about noshows pertaining to different neighbourhoods. it will be interesting to map these neighbourhoods’ additional data to see if we could root cause the findings. For example, certain neighbourhoods indicated the no show patients to be only females. If the crime rate or various other society data could be extracted, it would give a clearer insight to the problem and how the NoShow can be bettered.

  • Basic Machine learning techniques can be applied to train a model that would input a patient’s profile to predict his/her chances of not showing up.