EDA Medical Appointment No-show

About dataset

The dataset was downloaded from https://www.kaggle.com, file “KaggleV2-May-2016.csv” . It contains 14 variables and 110527 observations. The main idea of dataset is to try to recognize persons who order a doctor appointment, receives all the instructions and finally do not arrive (“no-show”).

“No-Show” dataset containes information about appointments, and information about a persons: general and medical.

The most important conclusions made are:

  1. The ratio of “arrived” patients to “didn’t arrive” patients is significantly bigger on the same day when the appointment was on a day when was ordered a queue. The ratio is about 21. In the case when a patient has an appointment on a next day after ordering a queue- a ratio dramatically fault to 3.7.

  2. If patient prone to don’t arrive at doctor appointment- the probability (arrive/ don’t arrive ) is kept about 50%, Despite a number of appointment … first appointment, or second, third or fourth.

  3. Patients at the age between 20 -65 prone to miss appointments more than younger or older patients.

  4. Patients, who suffer from hypertension tend fewer miss appointments.

  5. Other variables have no significant effect on “No Show” (patient didn’t arrive) variable.

noShow<-read.csv("KaggleV2-May-2016.csv", stringsAsFactors = FALSE)
dim(noShow)
## [1] 110527     14
str(noShow)
## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : chr  "F" "M" "F" "F" ...
##  $ ScheduledDay  : chr  "2016-04-29T18:38:08Z" "2016-04-29T16:08:27Z" "2016-04-29T16:19:04Z" "2016-04-29T17:29:31Z" ...
##  $ AppointmentDay: chr  "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : chr  "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
##  $ Scholarship   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hipertension  : int  1 0 0 0 1 1 0 0 0 0 ...
##  $ Diabetes      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Alcoholism    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Handcap       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SMS_received  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No.show       : chr  "No" "No" "No" "No" ...
head(noShow, 6)
##      PatientId AppointmentID Gender         ScheduledDay
## 1 2.987250e+13       5642903      F 2016-04-29T18:38:08Z
## 2 5.589978e+14       5642503      M 2016-04-29T16:08:27Z
## 3 4.262962e+12       5642549      F 2016-04-29T16:19:04Z
## 4 8.679512e+11       5642828      F 2016-04-29T17:29:31Z
## 5 8.841186e+12       5642494      F 2016-04-29T16:07:23Z
## 6 9.598513e+13       5626772      F 2016-04-27T08:36:51Z
##         AppointmentDay Age     Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z  62   JARDIM DA PENHA           0            1
## 2 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            0
## 3 2016-04-29T00:00:00Z  62     MATA DA PRAIA           0            0
## 4 2016-04-29T00:00:00Z   8 PONTAL DE CAMBURI           0            0
## 5 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            1
## 6 2016-04-29T00:00:00Z  76        REPÚBLICA           0            1
##   Diabetes Alcoholism Handcap SMS_received No.show
## 1        0          0       0            0      No
## 2        0          0       0            0      No
## 3        0          0       0            0      No
## 4        0          0       0            0      No
## 5        1          0       0            0      No
## 6        0          0       0            0      No

Data Understanding

a) Numeric Data

Let’s find missing numeric data:

numIndex<- sapply(noShow, is.numeric)
summary(noShow[,numIndex])
##    PatientId         AppointmentID          Age          Scholarship     
##  Min.   :3.922e+04   Min.   :5030230   Min.   : -1.00   Min.   :0.00000  
##  1st Qu.:4.173e+12   1st Qu.:5640286   1st Qu.: 18.00   1st Qu.:0.00000  
##  Median :3.173e+13   Median :5680573   Median : 37.00   Median :0.00000  
##  Mean   :1.475e+14   Mean   :5675305   Mean   : 37.09   Mean   :0.09827  
##  3rd Qu.:9.439e+13   3rd Qu.:5725524   3rd Qu.: 55.00   3rd Qu.:0.00000  
##  Max.   :1.000e+15   Max.   :5790484   Max.   :115.00   Max.   :1.00000  
##   Hipertension       Diabetes         Alcoholism        Handcap       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07186   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMS_received  
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.321  
##  3rd Qu.:1.000  
##  Max.   :1.000

According to the “summary” there are no “NA”’s in all numeric columns (it is a good news).

“PatientId” - contains system running numbers of patients. PatientID is not unique variable - it means that number of patients made an appointment more then one time during the dataset period of time:

length(noShow$PatientId)==length(unique(noShow$PatientId))
## [1] FALSE

Procent Distribution of patients made one, two .. six and more appointments:

tab<-table(noShow$PatientId)
tab<-table(as.vector(tab))
tab<-round(prop.table(tab)*100, digits = 2)
appointmentFreq<-data.frame("numOfAppointments"=names(tab), "procents"=as.vector(tab))
tempDF<-appointmentFreq[appointmentFreq$procents <1,]
appointmentFreq<-appointmentFreq[appointmentFreq$procents >=1,]
tempDF<-data.frame("numOfAppointments"="6+",  "procents"=sum(tempDF$procents))
appointmentFreq<-rbind(appointmentFreq,tempDF)

pie(appointmentFreq$procents, labels = paste(appointmentFreq$procents,"%", sep = " "), 
    col = heat.colors(nrow(appointmentFreq)),
    main = "Repeated Appointments")
legend("bottomright",legend=paste(appointmentFreq$numOfAppointments, "appontments", sep =" "), bty="n",fill=heat.colors(nrow(appointmentFreq)))

“AppointmentID” - contains system running number of appointment- it supposed to be unique. Let’s test whether the same appointment appears few times in data set:

length(noShow$AppointmentID)==length(unique(noShow$AppointmentID))
## [1] TRUE

All appointments are unique. There are no repeated appointments in data set, therefore the column “AppointmentID” is redundant and will be omitted.

Variables “Scholarship”,“Hipertension”,“Diabetes”,“Alcoholism”,“Handcap”,“SMS_received” according to the “summary” are factors and can be “0” or “1”. All factor columns contain proper data (0 or 1 only, biesibes “Handcap” -it has 5 levels:0,1,2,3,4).

Scholarship(%):

tab<-table(noShow$Scholarship)
round(prop.table(tab)*100, digits =1 )
## 
##    0    1 
## 90.2  9.8

Hipertension(%):

tab<-table(noShow$Hipertension)
round(prop.table(tab)*100, digits =1 )
## 
##    0    1 
## 80.3 19.7

Diabetes(%):

tab<-table(noShow$Diabetes)
round(prop.table(tab)*100, digits =1 )
## 
##    0    1 
## 92.8  7.2

Alcoholism(%):

tab<-table(noShow$Alcoholism)
round(prop.table(tab)*100, digits =1 )
## 
##  0  1 
## 97  3

Handcap:

tab<-table(noShow$Handcap)
round(prop.table(tab)*100, digits =3 )
## 
##      0      1      2      3      4 
## 97.972  1.848  0.166  0.012  0.003

SMS_received(%):

tab<-table(noShow$SMS_received)
round(prop.table(tab)*100, digits =1 )
## 
##    0    1 
## 67.9 32.1

Column “Age”" (according to the “summary”) - it looks like contain mostly relevant information, besides rows with age -1 (less the zero):

nrow(noShow[noShow$Age<0,])
## [1] 1

There’s only one row with age value less than 0 -it will be omitted. Let’s visualize age distribution:

noShow <-noShow[noShow$Age>=0,]
hist(noShow$Age, col = "lightblue", main = "Age Distribution", xlab = "Age", ylab = "Quantity of Appointments")

Let’s take to account that patients at age less then about 16 and at age above about 80 depend on other people who care about them, therefore thea arrival to doctor appointment may be depends on the decision of the attendant person. It means that information of this group MAY BE will researched separately - let’s see further.

quantityOfAppointments_16_80<- nrow(noShow[noShow$Age<16,])+nrow(noShow[noShow$Age<16,])
procentOfAppointments_16_80<-quantityOfAppointments_16_80/nrow(noShow)*100
print(paste("Number of appointments made for patients at age less than 16 and above 80: ",quantityOfAppointments_16_80,sep=" "))
## [1] "Number of appointments made for patients at age less than 16 and above 80:  48936"
print(paste("Percent of appointments made for patients at age less than 16 and above 80: ",round(procentOfAppointments_16_80),"%",sep=" "))
## [1] "Percent of appointments made for patients at age less than 16 and above 80:  44 %"

Actually, about half of appointments belong to this category (less then 16 and above 80).

a) Not Numeric Data

numIndex<- sapply(noShow,  is.numeric )
summary(noShow[,!numIndex])
##     Gender          ScheduledDay       AppointmentDay    
##  Length:110526      Length:110526      Length:110526     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Neighbourhood        No.show         
##  Length:110526      Length:110526     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Gender”- according to “unique” statement there is no irrelevant or missing data in the column; 65 percent of the appointments made by (for) women, and 35 by (or for) men.

unique(noShow$Gender)
## [1] "F" "M"
tab<-data.frame(table(noShow$Gender))
tab$Freq<- sapply(tab[,"Freq"], function(x) round((x/sum(tab$Freq)*100),digits = 0))
names(tab)<-c("Gender", "Procent")
tab
##   Gender Procent
## 1      F      65
## 2      M      35

“Neighbourhood”- according to “unique” statement there is no irrelevant or missing data in the column; 81 different neighbourhoods mentioned in dataset.

Neighbourhoods<-unique(noShow$Neighbourhood)
Neighbourhoods[order(Neighbourhoods)]
##  [1] "AEROPORTO"                    "ANDORINHAS"                  
##  [3] "ANTÔNIO HONÓRIO"            "ARIOVALDO FAVALESSA"         
##  [5] "BARRO VERMELHO"               "BELA VISTA"                  
##  [7] "BENTO FERREIRA"               "BOA VISTA"                   
##  [9] "BONFIM"                       "CARATOÍRA"                  
## [11] "CENTRO"                       "COMDUSA"                     
## [13] "CONQUISTA"                    "CONSOLAÇÃO"                
## [15] "CRUZAMENTO"                   "DA PENHA"                    
## [17] "DE LOURDES"                   "DO CABRAL"                   
## [19] "DO MOSCOSO"                   "DO QUADRO"                   
## [21] "ENSEADA DO SUÁ"              "ESTRELINHA"                  
## [23] "FONTE GRANDE"                 "FORTE SÃO JOÃO"            
## [25] "FRADINHOS"                    "GOIABEIRAS"                  
## [27] "GRANDE VITÓRIA"              "GURIGICA"                    
## [29] "HORTO"                        "ILHA DAS CAIEIRAS"           
## [31] "ILHA DE SANTA MARIA"          "ILHA DO BOI"                 
## [33] "ILHA DO FRADE"                "ILHA DO PRÍNCIPE"           
## [35] "ILHAS OCEÂNICAS DE TRINDADE" "INHANGUETÁ"                 
## [37] "ITARARÉ"                     "JABOUR"                      
## [39] "JARDIM CAMBURI"               "JARDIM DA PENHA"             
## [41] "JESUS DE NAZARETH"            "JOANA D´ARC"                
## [43] "JUCUTUQUARA"                  "MÁRIO CYPRESTE"             
## [45] "MARIA ORTIZ"                  "MARUÍPE"                    
## [47] "MATA DA PRAIA"                "MONTE BELO"                  
## [49] "MORADA DE CAMBURI"            "NAZARETH"                    
## [51] "NOVA PALESTINA"               "PARQUE INDUSTRIAL"           
## [53] "PARQUE MOSCOSO"               "PIEDADE"                     
## [55] "PONTAL DE CAMBURI"            "PRAIA DO CANTO"              
## [57] "PRAIA DO SUÁ"                "REDENÇÃO"                  
## [59] "REPÚBLICA"                   "RESISTÊNCIA"                
## [61] "ROMÃO"                       "SÃO BENEDITO"               
## [63] "SÃO CRISTÓVÃO"             "SÃO JOSÉ"                  
## [65] "SÃO PEDRO"                   "SANTA CECÍLIA"              
## [67] "SANTA CLARA"                  "SANTA HELENA"                
## [69] "SANTA LÚCIA"                 "SANTA LUÍZA"                
## [71] "SANTA MARTHA"                 "SANTA TEREZA"                
## [73] "SANTO ANDRÉ"                 "SANTO ANTÔNIO"              
## [75] "SANTOS DUMONT"                "SANTOS REIS"                 
## [77] "SEGURANÇA DO LAR"            "SOLON BORGES"                
## [79] "TABUAZEIRO"                   "UNIVERSITÁRIO"              
## [81] "VILA RUBIM"

No.show” - according to “unique” statement there is no irrelevant or missing data in the column (“Yes”/“No” only); 20 percent of appointments were appointed in vain- patients weren’t appear. Further will be better to transform the No.show data to 0/1 format.

unique(noShow$No.show)
## [1] "No"  "Yes"
tab<-data.frame(table(noShow$No.show))
tab$Freq<- sapply(tab[,"Freq"], function(x) round((x/sum(tab$Freq)*100),digits = 0))
names(tab)<-c("No.show", "Procent")
tab
##   No.show Procent
## 1      No      80
## 2     Yes      20

Data Preparation

Load dataset and omit negative age:

rm(list=ls())
noShow<-read.csv("KaggleV2-May-2016.csv", stringsAsFactors = FALSE)
noShow<-subset.data.frame(noShow, Age>=0)

PatientID” is running number. It provides information about a number of appointments made by the each patient (within the dataset).Lets’s find out number of appointments the patient made and running number of each appointment each patient made. For this purpose, I’ll make tree additional calculated columns : 1) “numberOfAppointment” - it reflects order number of each appointment of each patient. 2) “numberOfAppointments” - how many appointments every patient have. 3) “totalNoShow”- how many appointments each patient lost.

#PatientId
noShow.modified<-data.frame(PatientId=noShow[,"PatientId"])

#Number of each appointment each patient made
appNumber<-sapply(noShow[,"PatientId"], function(x) cumsum(noShow$PatientId==x))
noShow.modified$numberOfAppointment<-appNumber

#Alternative way to calculate Number of each appointment each patient made:
f_numberOfAppointment<-function(x)
{
  numOfAppointment<-rep(0,length(x))
  for(count in 1:length(x))
  {
    occurrence<-0
    PatientId<-x[count]
    for(count_temp in count:length(x))
    {
      PatientId_temp<-x[count_temp]
      if ((PatientId==PatientId_temp)&(numOfAppointment[count_temp]==0))
      {
        occurrence<-occurrence+1
        numOfAppointment[count_temp]<-occurrence
      }
    }
    #print(count)
  }
  return(numOfAppointment)
}
numberOfAppointment<-f_numberOfAppointment(noShow.modified$PatientId)
noShow.modified$numberOfAppointment<-numberOfAppointment

#Number of appontments each patient made
appNumber<-sapply(noShow[,"PatientId"], function(x) sum(noShow$PatientId==x))
noShow.modified$numberOfAppointments<-appNumber

#Number of appointments when patient didn't arrive
appNumber<-sapply(noShow.modified[,"PatientId"], 
                  function(x)  sum((noShow.modified$PatientId==x)&(noShow.modified$No.show==1)))
noShow.modified$totalNoShow<-appNumber

Gender” is character value. Transfer it to numeric 0/1 presentation (due to females are majority 65% set “F” as 1 and “M” as 0).

gender<-as.vector(sapply(noShow[,"Gender"], function(x) if (x=="F") 1 else 0 ))
noShow.modified<-data.frame(noShow.modified, Gender=gender)

AppointmentDay” and “ScheduledDay” - my be omitted, but befor, transfer them from character to POSIXct format and retrive two values: 1) how many days between ordering an doctor’s appointment and appointment itself; 2) on what day of week appointment is.

tempAppointmentDay<-as.POSIXct.default(noShow$AppointmentDay)
tempScheduledDay<-as.POSIXct.default(noShow$ScheduledDay)
tempDFDates<-data.frame(AppointmentDay=tempAppointmentDay,ScheduledDay=tempScheduledDay)

tempDFDates$daysBeforeAppointment<-as.vector(round((tempDFDates$AppointmentDay-tempDFDates$ScheduledDay)/86400,digits = 0))

Let’s test for any negative results:

sum(tempDFDates$daysBeforeAppointment<0)
## [1] 5

We have only 5 impossible negative “Days before appointment”. They can be replaced by 0:

tempDFDates$daysBeforeAppointment[tempDFDates$daysBeforeAppointment<0]<-0

Appointment Day”- a day of week:

getWeekDay<-function(x)
{
  if (x=="Monday") y= 1
  if (x=="Tuesday") y= 2
  if (x=="Wednesday") y=3
  if (x=="Thursday") y=4
  if (x=="Friday") y=5
  if (x=="Saturday") y=6
  return(y)
}
tempWeekDays<-weekdays.POSIXt(tempAppointmentDay)
tempDFDates$AppointmentWeekDay<-sapply(tempWeekDays, getWeekDay)

noShow.modified$daysBeforeAppointment<-tempDFDates$daysBeforeAppointment
noShow.modified$AppointmentWeekDay<- tempDFDates$AppointmentWeekDay
rm(tempDFDates)

Variables **“Age”, “Neighbourhood”, “Scholarship”, “Hipertension”, “Diabetes”, “Alcoholism”, “Handcap”, “SMS_received”* stay as they are, without modification:

noShow.modified<-data.frame(noShow.modified, noShow[6:13])

No.show” will be transformed to 0/1 format. “No”- 0 (patient arrived to appointment), “Yes” - 1 (patient didn’t arrive to appointment):

noShow.modified$No.show<-as.vector(sapply(noShow[,"No.show"], function(x) if (x=="Yes") 1 else 0 ))

Now, new modified data set will be saved and all temporary data structures will be removed:

save(noShow.modified, file='noShow.modified.RData')
save(noShow, file='noShow.RData')
rm(list = ls())
load('noShow.modified.RData')

Data Analysis

Actually, we have in dataset two kind of data: a) unique data connected to appointments themselves - like difference between date of appointment and date when appointment was ordered, or number of each appointment which each patient made; b) unique data connected to the patients- like age, conditions, neighborhood …

Data Analysis- Appointments

Firstly, separate modified data set into two data sets: first contain data of “NoShow” appointment, second “YesShow” appointments:

load('noShow.modified.RData')
arrived.no<-noShow.modified[noShow.modified$No.show==1,]
arrived.yes<-noShow.modified[noShow.modified$No.show==0,]
save(arrived.no, file='arrived.no.RData')
save(arrived.yes, file='arrived.yes.RData')

Let’s, find number (first, second, third …) of appointment, which “not accurate” patient prone to loose:

load('arrived.no.RData')
tab<-table(arrived.no$numberOfAppointment)
plot(prop.table(tab)*100, xlab ="Number of appointment", ylab = "Percent of 'lost' appointments", main="Procents of 'lost' appointments according to number")

#round(prop.table(tab)[1:10]*100, digits = 1)

Majority of “NoShow” appoinments (about 55%)-it is a first appointment patient made, second appointment - about 23%, and third appointment- about 10%, forth- about 5% … Conclusion: if patient prone to don’t arrive at doctor appointment- the probability (arrive/ don’t arrive ) is kept about 50%, Despite a number of appointment … first appointment, or second, third or fourth.

Days before appointment”- how many days between patient ordered a queue to a doctor and appointment itself:

#Density comparison
plot(density(arrived.yes$daysBeforeAppointment), col=3, main="Density: Days before appointment", xlab = "Days before appointment")
lines(density(arrived.no$daysBeforeAppointment), col=2)
legend("bottomright",legend=c("Arrived", "Didn't arrive"),bty="n",fill=c(3,2))

It looks like, that then fewer days before appointment then more patients arrive at the appointment. Let’s see the difference in ranges of “days before appointment”:

#Range comparison
boxplot(arrived.yes$daysBeforeAppointment, arrived.no$daysBeforeAppointment, outline = F, 
        names = c("Arrived", "Didn't arrive"), col = c(3,2), main="Days before appointment")

Let’s see numbers:

temp<-as.matrix(round(as.vector(summary(arrived.yes$daysBeforeAppointment)), digits = 1))
temp<-cbind(temp, round(as.vector(summary(arrived.no$daysBeforeAppointment)), digits = 1))
temp<-t(temp) 
colnames(temp)<-names(summary(arrived.yes$daysBeforeAppointment))
row.names(temp)<-c("Arrived", "Didn't arrive")
temp
##               Min. 1st Qu. Median Mean 3rd Qu. Max.
## Arrived          0       0      2  8.8      12  179
## Didn't arrive    0       4     11 15.8      23  179
t.test(arrived.yes$daysBeforeAppointment, arrived.no$daysBeforeAppointment)
## 
##  Welch Two Sample t-test
## 
## data:  arrived.yes$daysBeforeAppointment and arrived.no$daysBeforeAppointment
## t = -58.263, df = 31531, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.315262 -6.839094
## sample estimates:
## mean of x mean of y 
##  8.754759 15.831937

Now it is obvious: than fewer days before appointment than more patients will arrive at the appointment.

Let’s find exact days, when the difference (ratio) between the quantity of arrived and didn’t arrive patients is significant. I suppose, that to research the interval between 0 days (the queue was ordered in the same day) and 30 days (the queue was ordered in one month in advance) will be enough.

ratio.tab<-data.frame(daysBeforApp=c(0:30))
for(i in c(0:30))
{
  ratio.tab$ratio[i+1]<-round(sum(arrived.yes$daysBeforeAppointment==i)/
                                sum(arrived.no$daysBeforeAppointment==i), digits = 1)
}
brP<-barplot(height =ratio.tab$ratio, names.arg = ratio.tab$daysBeforApp, main="Ratio:  arrived / didn't arrive patients", xlab ="Days befor appointment", ylab = "Ratio" )
lines(x=brP, y=rep(mean(ratio.tab$ratio[-1]),31), col="orange")
legend("topright",legend=c("Ratio arrived/didn't arraived", "Average ratio (without 0 days)"),bty="n",fill=c("grey","orange"))

Conclusion: The ratio of “arrived” patients to “didn’t arrive” patients is significantly bigger on the same day when the appointment was on a day when was ordered a queue. The ratio is about 21. In the case when a patient has an appointment on a next day after ordering a queue- a ratio dramatically fault to 3.7 -close to average ratio. The average ratio is 2.4 (taken without 0 days before the appointment). After that, ratio continuously became closer to average ratio, and on 10 days before appointment became equivalent to average.

Appontment Weekday”: There isn’t significant difference between “Show” and “noShow” appointments according to the days of the week.

tab<-table(arrived.no$AppointmentWeekDay)
temp.matrix<-as.matrix(round(prop.table(tab)*100, digits = 0))
tab<-table(arrived.yes$AppointmentWeekDay)
temp.matrix<-cbind(temp.matrix, round(prop.table(tab)*100, digits = 0) )
ylim <- c(0, 1.2*max((temp.matrix)))
y =t(temp.matrix)
colnames(y)<-c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
bp<-barplot(height = y, beside=T,  ylim = ylim, xlab = "Days of week", ylab = "Percents", main="Distribution according to days of week", col=c("black", "lightgray"))
text(x = bp, y = y, label = paste(y,"%", sep = ""), pos = 3, cex = 0.8, col = "red")
legend("topright",legend=c("didn't arrive", "arrived"),bty="n", fill=c("black", "lightgray"))

SMS received”: About 70% of patients who arrived at appointment didn’t receive SMS before. Only 30% of arrived patients received SMS. On the other hand, almost half of patients, who didn’t arrive at appointment received SMS before. Therefore, it is hard to conclude, that SMS sent before an appointment can decrease the percentage of patients who ordered a queue to a doctor and didn’t arrive at an appointment.

load('noShow.RData')
tab<-table(noShow$SMS_received, noShow$No.show)
tab<-round(prop.table(tab,2)*100, digits = 0)
barplot(
  height = tab,
  beside = TRUE,
  xlab = "Patients arrived or didn't arrive at appointment",
  ylab = "Percents",
  main = "SMS/ noSMS  sent before appointment",
  names.arg = c("arrived", "didn't arrive"))
legend("topright",legend= c("No SMS", "SMS"),bty="n", fill=c("black", "lightgray"))

Data Analysis- Patients

Now, let’s analyze data connected to patients (like age, physical condition, gender…), not to the appointment (like day of week or “SMS sent”). According to “Repeated Appointments” chart, about 40% of appointments are repeated, it means, that data of a number of patients is repeated few times in original dataset. In order to analyze data about patients, we need to extract datasets with unique patients. In order to compare patients, I will make three data frames with unique patients: a) only patients who never arrived at appointment; b) patients who lose one or more appointments and arrived at one or more appointments; c) only patients who always arrived at the appointment.

Given dataset contains information collected during about half year. It means that if the patient made more than one appointment, his age may be different in different appointments and “unique” statement will return same patients with #different ages few times. Followed function recognizes rows with the same patienID and different ages and then calculates average age for each patient:

f_getPatientsUnique<-function(x)
{
  age_temp<-data.frame(PatientId=x$PatientId, Gender=x$Gender, Neighbourhood=x$Neighbourhood,
                       Scholarship=x$Scholarship, Hipertension= x$Hipertension, Diabetes=x$Diabetes,
                       Alcoholism=x$Alcoholism, Handcap=x$Handcap)
  age_temp<-unique(age_temp)  
  age_temp$Age<-rep(0, nrow(age_temp))

  for (i in 1:nrow(age_temp))
  {
    patientId<-age_temp[i,"PatientId"]
    vec<-subset (x, PatientId==patientId, select = Age)
    age_temp$Age[i]<-mean(vec$Age)
    #print(i)
  }
  return(age_temp)
}

#Never arrived at appointment
noShow.never<-subset(x=noShow.modified, subset = noShow.modified$numberOfAppointments==noShow.modified$totalNoShow,
                     select =c(PatientId, Gender, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap) )
#A quantity of patients whose age was changed during observations in current group:
#nrow(noShow.never)-length(unique(noShow.never$PatientId))
temp<-f_getPatientsUnique(noShow.never)
noShow.never<-temp

#Sometimes arrived at appointment
noShow.sometimes<-subset(x=noShow.modified,
                         subset = noShow.modified$numberOfAppointments>noShow.modified$totalNoShow&noShow.modified$totalNoShow>0,
                     select =c(PatientId, Gender, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap) )
#A quantity of patients whose age was changed during observations in current group:
#nrow(unique(noShow.sometimes))-length(unique(noShow.sometimes$PatientId))
temp<-f_getPatientsUnique(noShow.sometimes)
noShow.sometimes<-temp

#Always arrived at appointment
noShow.always<-subset(x=noShow.modified,
                         subset =noShow.modified$totalNoShow==0,
                         select =c(PatientId, Gender, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap) )
#A quantity of patients whose age was changed during observations in current group:
#nrow(unique(noShow.always))-length(unique(noShow.always$PatientId))
temp<-f_getPatientsUnique(noShow.always)
noShow.always<-temp

Age”:

require(infotheo)
## Loading required package: infotheo
tab<-discretize(noShow.modified$Age, nbins=max(noShow.modified$Age))
tab<-table(tab)
tab<-round(prop.table(tab)*100, digits = 0)

x<-seq(0:max(as.integer(names(tab))))
y<-c(0:max(as.vector(tab)))
y<-rep(y, length(x)/length(y))
     
#never
tab<-discretize(noShow.never$Age, nbins=max(noShow.never$Age))
tab<-table(tab)
tab_never<-prop.table(tab)*100

#somtimes
tab<-discretize(noShow.sometimes$Age, nbins=max(noShow.sometimes$Age))
tab<-table(tab)
tab_sometimes<-prop.table(tab)*100

#always
tab<-discretize(noShow.always$Age, nbins=max(noShow.always$Age))
tab<-table(tab)
tab_always<-prop.table(tab)*100

#drow plot
colors<-c("red","blue","green")
plot(x, y, type="n", xlab = "Age", ylab="Percents", main="Age distributions: patients who always, sometimes,
and never miss doctor's appointment.")
lines(tab_never, type="l", col=colors[1])
lines(tab_sometimes, type="l", col=colors[2])
lines(tab_always, type="l", col=colors[3])
legend("topright",legend= c("always", "sometimes","never"), bty="n", fill=colors)

According to the chart, looks like, patients at the age between 20 -65 prone to miss appointments more than younger or older patients. The green line (patients who never miss appointments) in this interval (20-65) lower than in other intervals and lower than red and blue line (patients who always or sometimes miss doctor appointments).

Let’s compare “Scholarship”, “Hipertension”“,”Diabetes“,”Alcoholism“, and”Handcap" variables in three groups

#Function drows barplot to compare variables in three groups

draw_barplot<-function(never,sometimes,always, main, xlab, ylab)
{
  tab<-table(never)
  tab<-round(prop.table(tab)*100, digits = 1)
  never_tab<-as.vector(tab)[2]
  
  tab<-table(sometimes)
  tab<-round(prop.table(tab)*100, digits = 1)
  sometimes_tab<-as.vector(tab)[2]
  
  tab<-table(always)
  tab<-round(prop.table(tab)*100, digits = 1)
  always_tab<-as.vector(tab)[2]
  
  tab<-matrix(data.frame(never_tab,sometimes_tab,always_tab))
  ylim <- c(0, 1.2*max(unlist(tab)))
  row.names(tab)<- c("always","sometimes", "never")
  tab<-t(tab)
  bp<-barplot(tab, col=c("skyblue"), ylim = ylim, main = main, xlab = xlab, ylab = ylab)
  text(x = bp, y = tab, label = paste(tab,"%", sep = ""), pos = 3, cex = 0.8, col = "red")
}

Gender: There is no significant difference in the groups. Gender is not the main criterion explains the absence of patients at the doctor’s appointment.

draw_barplot(noShow.never$Gender,noShow.sometimes$Gender, noShow.always$Gender,
             main="Percentage of women among those \n who miss a doctor's appointment",
             xlab = "Division by groups, patients who missed doctor's appointments",
             ylab = "Percentage")

Scholarship”: There is no significant difference in the groups. But it looks like that educated patients prone more to miss appointments (but the difference is not significant).

draw_barplot(noShow.never$Scholarship,noShow.sometimes$Scholarship, noShow.always$Scholarship,
             main="Percentage of educated people among those \n who miss a doctor's appointment",
             xlab = "Division by groups, patients who missed doctor's appointments",
             ylab = "Percentage")

Hipertension”:Patients, who suffer from hypertension tend fewer miss appointments- more than 5% difference between groups “never miss appointments” and “always miss appointments”.

draw_barplot(noShow.never$Hipertension,noShow.sometimes$Hipertension, noShow.always$Hipertension,
             main="Percentage of patients with Hipertension among\n those who miss a doctor's appointment",
             xlab = "Division by groups, patients who missed doctor's appointments",
             ylab = "Percentage")

Diabete”: There is no significant difference in the groups. But it looks like that diabetic patients tend less to miss appointments (but the difference is not significant).

draw_barplot(noShow.never$Diabetes,noShow.sometimes$Diabetes, noShow.always$Diabetes,
             main="Percentage of diabetic patients among those \n who miss a doctor's appointment",
             xlab = "Division by groups: patients who missed doctor's appointments",
             ylab = "Percentage")

Alcoholism”: There is no significant difference in the groups. But it looks like that alcoholism patient prone more to miss appointments (but the difference is not significant).

draw_barplot(noShow.never$Alcoholism,noShow.sometimes$Alcoholism, noShow.always$Alcoholism,
             main="Percentage of alcoholists patients among those \n who miss a doctor's appointment",
             xlab = "Division by groups, missing the appointment",
             ylab = "Percentage")

Handcap”: Patients who suffer from handicap tend less to miss appointments.

noShow.never.Handcap<-as.vector(sapply(noShow.never[,"Handcap"], function(x) if (x==0) 0 else 1 ))
noShow.sometimes.Handcap<-as.vector(sapply(noShow.sometimes[,"Handcap"], function(x) if (x==0) 0 else 1 ))
noShow.always.Handcap<-as.vector(sapply(noShow.always[,"Handcap"], function(x) if (x==0) 0 else 1 ))

draw_barplot(noShow.never.Handcap,noShow.sometimes.Handcap, noShow.always.Handcap,
             main="Percentage of patients with handcap among those \n who miss a doctor's appointment",
             xlab = "Division by groups, missing the appointment",
             ylab = "Percentage")

Neighbourhood”: Let’s analyze whether neighborhood of patients can explain “missing” appointments. For this purpose, firstly, I’ll find distributions of “Neighbourhood” variable in all three groups, after, I’ll calculate ratio of the each neighborhood in each of three groups to “Neighbourhood” variable in general dataset, and order the data.

#Neighbourhood
tab<-table(noShow.modified$Neighbourhood)
noShow.neighbourhoods<-data.frame("Neighbourhood"=names(tab), "quantityOfPatients"=as.vector(tab))
row.names(noShow.neighbourhoods)<-gsub(" ","_", noShow.neighbourhoods$Neighbourhood)


#Never
tab<-table(noShow.never$Neighbourhood)
neverShoW.neighbourhoods<-data.frame("Neighbourhood"=names(tab), "quantityOfPatients"=as.vector(tab))
neverShoW.neighbourhoods$ratio<-rep(0, nrow(neverShoW.neighbourhoods))
for ( i in 1:nrow(neverShoW.neighbourhoods))
{
  rowname<-gsub(" ","_",  neverShoW.neighbourhoods[i,"Neighbourhood"])
  neverShoW.neighbourhoods$ratio[i]<-round(neverShoW.neighbourhoods$quantityOfPatients[i]/noShow.neighbourhoods[rowname,"quantityOfPatients"]*100,digits = 0)

}
neverShoW.neighbourhoods<-neverShoW.neighbourhoods[order(neverShoW.neighbourhoods$ratio, decreasing = TRUE), ]

#Sometimes
tab<-table(noShow.sometimes$Neighbourhood)
sometimesShow.neighbourhoods<-data.frame("Neighbourhood"=names(tab), "quantityOfPatients"=as.vector(tab))
sometimesShow.neighbourhoods$ratio<-rep(0, nrow(sometimesShow.neighbourhoods))
for ( i in 1:nrow(sometimesShow.neighbourhoods))
{
  rowname<-gsub(" ","_",  sometimesShow.neighbourhoods[i,"Neighbourhood"])
  #print(rowname)
  sometimesShow.neighbourhoods$ratio[i]<-round(sometimesShow.neighbourhoods$quantityOfPatients[i]/noShow.neighbourhoods[rowname,"quantityOfPatients"]*100,digits = 0)
  
}
sometimesShow.neighbourhoods<-sometimesShow.neighbourhoods[order(sometimesShow.neighbourhoods$ratio, decreasing = TRUE), ]


#always
tab<-table(noShow.always$Neighbourhood)
alwaysShow.neighbourhoods<-data.frame("Neighbourhood"=names(tab), "quantityOfPatients"=as.vector(tab))
alwaysShow.neighbourhoods$ratio<-rep(0, nrow(alwaysShow.neighbourhoods))
for ( i in 1:nrow(alwaysShow.neighbourhoods))
{
  rowname<-gsub(" ","_",  alwaysShow.neighbourhoods[i,"Neighbourhood"])
  #print(rowname)
  alwaysShow.neighbourhoods$ratio[i]<-round(alwaysShow.neighbourhoods$quantityOfPatients[i]/noShow.neighbourhoods[rowname,"quantityOfPatients"]*100,digits = 0)
  
}
alwaysShow.neighbourhoods<-alwaysShow.neighbourhoods[order(alwaysShow.neighbourhoods$ratio, decreasing = TRUE), ]

5 first neighbourhoods in “always missing appointments” group:

head(neverShoW.neighbourhoods,5)
##                   Neighbourhood quantityOfPatients ratio
## 35 ILHAS OCEÂNICAS DE TRINDADE                  2   100
## 50                     NAZARETH                 19    14
## 29                        HORTO                 22    13
## 1                     AEROPORTO                  1    12
## 65               SANTA CECÍLIA                 54    12

5 first neighbourhoods in “sometimes missing appointments” group:

head(sometimesShow.neighbourhoods,5)
##        Neighbourhood quantityOfPatients ratio
## 33 ILHA DO PRÍNCIPE                246    11
## 39 JESUS DE NAZARETH                319    11
## 72     SANTOS DUMONT                144    11
## 1         ANDORINHAS                233    10
## 4     BARRO VERMELHO                 41    10

5 first neighbourhoods in “never missing appointments” group:

head(alwaysShow.neighbourhoods,5)
##        Neighbourhood quantityOfPatients ratio
## 51 PARQUE INDUSTRIAL                  1   100
## 1          AEROPORTO                  6    75
## 32       ILHA DO BOI                 19    54
## 48 MORADA DE CAMBURI                 52    54
## 3  ANTÔNIO HONÓRIO                135    50

Despite that those neighborhoods take 10 first places- it is difficult to conclude that patients from those neighborhoods prone to always miss the doctor’s appointments ot tend to always arrive, because of the number of patients from those neighborhoods significantly less than the number of patients from other neighborhoods - tens against thousands. For example, impossible to say that patients from “ILHAS OCEÂNICAS DE TRINDADE” always miss the appointment, despite that all 100% of patients in this “always missing” group - We have only 2 observations at all from this neighborhood against a thousand observations from other neighborhoods. And impossible to say that patients from “PARQUE INDUSTRIAL” always arrive at the appointment, despite that all 100% of patients in this “always arriving” group - We have only 1 observations at all from this neighborhood against a thousand observations from other neighborhoods.