Summary

This is EDA related to dataset No Show provided by JoniHoppen on Kaggle.

NOTE: In this project, I am not doing any CDA(Confirmatory Data Analysis) which means I haven’t conducted any statistical tests.

Dataset

This Dataset collects information from medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment.

#Summary of columns data

  • PatientId : Identification of a patient
  • AppointmentID : Identification of each appointment
  • Gender : Male or Female
  • ScheduledDay : The day of the actuall appointment, when they have to visit the doctor.
  • AppointmentDay : The day someone called or registered the appointment, this is before appointment of course.
  • Age : How old is patient
  • Neighbourhood : Where the appointment takes place.
  • Scholarship : True of False . Please follow the link for understanding this variable https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia
  • Hipertension : True or False
  • Diabetes: True or False
  • Alcoholism : True or False
  • Handcap : True or False
  • SMS_received : 1 or more messages sent to the patient.
  • No-show : YES or NO

Introduction

The main aim of this project is to determine what could be associated with the patients’ likelihood of missing the appointments. In my analysis, I tried to answer the following questions:

  1. Does Gender play any role in missing appointments? If so, which gender is more likely to miss the appointment?

  2. Is the interval between Schedule day and Appointment day related to the likelihood of patients showing up?

  3. How is Scholarship and conditions like Hypertension, diabetes, alcoholism associated with the missing rate?

  4. Does receiving SMS increase the chances of patients showing up?

Loading required packages and data

suppressMessages(library(ggplot2))
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(Hmisc))
suppressMessages(library(GGally))
suppressMessages(library(funModeling))
suppressMessages(library(lubridate))
suppressMessages(library(gridExtra))

Load the Data

appt_no_show_data= read.csv('Doc_appt_EDA_data.csv')

Let’s see the top 6 and last 6 values of the dataset provided

head(appt_no_show_data)
tail(appt_no_show_data)

Data Wrangling

Let’s check out the structure of the dataset:

str(appt_no_show_data)
## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ScheduledDay  : Factor w/ 103549 levels "2015-11-10T07:13:56Z",..: 27742 27504 27539 27709 27498 20074 21386 21496 24945 20895 ...
##  $ AppointmentDay: Factor w/ 27 levels "2016-04-29T00:00:00Z",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
##  $ Scholarship   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hipertension  : int  1 0 0 0 1 1 0 0 0 0 ...
##  $ Diabetes      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Alcoholism    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Handcap       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SMS_received  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No.show       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...

Here we found out that:

1.Scheduled day and appointment day are both dates and are presentd as Factors in data. I might need to creature another feature by getting the days between the two. So, I will convert these to datetime datatypes later in our process.

2.Scholarship, Hipertension,Diabetes,Alcoholism,Handcap and SMS_received are all integers in our data.These have values 0 and 1 which means either“yes” or “no”, so I will convert them to factors.

First Let’s rename the variables to Standard English Names:

names(appt_no_show_data)[names(appt_no_show_data) == "Hipertension"] <- "Hypertension"
names(appt_no_show_data)[names(appt_no_show_data) == "Handcap"] <- "Handicap"

Let’s convert every variable mentioned in point 2 above to factors

appt_no_show_data$Scholarship=as.factor(appt_no_show_data$Scholarship)
appt_no_show_data$Hypertension=as.factor(appt_no_show_data$Hypertension)
appt_no_show_data$Diabetes=as.factor(appt_no_show_data$Diabetes)
appt_no_show_data$Alcoholism=as.factor(appt_no_show_data$Alcoholism)
appt_no_show_data$Handicap=as.factor(appt_no_show_data$Handicap)
appt_no_show_data$SMS_received=as.factor(appt_no_show_data$SMS_received)

Let’s check the summary of our dataset

summary(appt_no_show_data)
##    PatientId         AppointmentID     Gender                  ScheduledDay   
##  Min.   :3.922e+04   Min.   :5030230   F:71840   2016-05-06T07:09:54Z:    24  
##  1st Qu.:4.173e+12   1st Qu.:5640286   M:38687   2016-05-06T07:09:53Z:    23  
##  Median :3.173e+13   Median :5680573             2016-04-25T17:17:46Z:    22  
##  Mean   :1.475e+14   Mean   :5675305             2016-04-25T17:18:27Z:    22  
##  3rd Qu.:9.439e+13   3rd Qu.:5725524             2016-04-25T17:17:23Z:    19  
##  Max.   :1.000e+15   Max.   :5790484             2016-04-25T17:19:03Z:    15  
##                                                  (Other)             :110402  
##               AppointmentDay       Age                 Neighbourhood  
##  2016-06-06T00:00:00Z: 4692   Min.   : -1.00   JARDIM CAMBURI : 7717  
##  2016-05-16T00:00:00Z: 4613   1st Qu.: 18.00   MARIA ORTIZ    : 5805  
##  2016-05-09T00:00:00Z: 4520   Median : 37.00   RESISTÊNCIA    : 4431  
##  2016-05-30T00:00:00Z: 4514   Mean   : 37.09   JARDIM DA PENHA: 3877  
##  2016-06-08T00:00:00Z: 4479   3rd Qu.: 55.00   ITARARÉ        : 3514  
##  2016-05-11T00:00:00Z: 4474   Max.   :115.00   CENTRO         : 3334  
##  (Other)             :83235                    (Other)        :81849  
##  Scholarship Hypertension Diabetes   Alcoholism Handicap   SMS_received
##  0:99666     0:88726      0:102584   0:107167   0:108286   0:75045     
##  1:10861     1:21801      1:  7943   1:  3360   1:  2042   1:35482     
##                                                 2:   183               
##                                                 3:    13               
##                                                 4:     3               
##                                                                        
##                                                                        
##  No.show    
##  No :88208  
##  Yes:22319  
##             
##             
##             
##             
## 

Age has a minimum value of -1 and maximum of 115. These seem to be outliers in the Age. We will figure these out in the later sections.

Almost 20% of patients were having Hypertenstion.

25% of the patients did not show-up for their appointment(which is our problem statement).

Check the data profile
profile_status= df_status(appt_no_show_data)
##          variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1       PatientId       0    0.00    0    0     0     0 numeric  62299
## 2   AppointmentID       0    0.00    0    0     0     0 integer 110527
## 3          Gender       0    0.00    0    0     0     0  factor      2
## 4    ScheduledDay       0    0.00    0    0     0     0  factor 103549
## 5  AppointmentDay       0    0.00    0    0     0     0  factor     27
## 6             Age    3539    3.20    0    0     0     0 integer    104
## 7   Neighbourhood       0    0.00    0    0     0     0  factor     81
## 8     Scholarship   99666   90.17    0    0     0     0  factor      2
## 9    Hypertension   88726   80.28    0    0     0     0  factor      2
## 10       Diabetes  102584   92.81    0    0     0     0  factor      2
## 11     Alcoholism  107167   96.96    0    0     0     0  factor      2
## 12       Handicap  108286   97.97    0    0     0     0  factor      5
## 13   SMS_received   75045   67.90    0    0     0     0  factor      2
## 14        No.show       0    0.00    0    0     0     0  factor      2
  • q_zeros: represent the number of zeros in a variable
  • p_zeros: represent the percentage of zeros.
  • q_na: represent the number of NAs.
  • p_na: represent the percentage of NAs.
  • q_inf: represent the number of infinite values.
  • p_inf: represent the percentage of infinite values.
  • unique: represents the unique values in a variable.
  1. Age has 3539 zeros which is 3.2% of the total values. Here zero represent the children less than 1 year old.

  2. There are 99666 zeros in scholarship which is 90% of the total observations. But zero here has a significance as zero means that the patient did not have a sholarship.

  3. The zeros in Hipertension,Diabetes,Alcoholism,Handcap,SMS_received are also signifying the same thing as above which means these are not missing values.

Before we could treat the missing values and see some outliers in the data, Let’s do some feature engineering to create new features.

Feature Engineering

Firstly, I will convert the appointment day and schedule day to date.

I am using lubridate package in R for doing all the date conversions.

appt_no_show_data$ScheduledDay=ymd_hms(appt_no_show_data$ScheduledDay)
appt_no_show_data$AppointmentDay=ymd_hms(appt_no_show_data$AppointmentDay)
str(appt_no_show_data)
## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ScheduledDay  : POSIXct, format: "2016-04-29 18:38:08" "2016-04-29 16:08:27" ...
##  $ AppointmentDay: POSIXct, format: "2016-04-29" "2016-04-29" ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
##  $ Scholarship   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Hypertension  : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 1 1 1 ...
##  $ Diabetes      : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ Alcoholism    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Handicap      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SMS_received  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ No.show       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...

The Appointment Day and Schedule day are now in Date format.

Let’s make a new feature to see the Day of Week on which the appointment was made and of the scheduled date.

appt_no_show_data$DOW_ScheduledDay=wday(appt_no_show_data$ScheduledDay,label = TRUE)
appt_no_show_data$DOW_AppointmentDay=wday(appt_no_show_data$AppointmentDay,label= TRUE)

Let’s make another variable to categorize the days into weekday and weekend.

appt_no_show_data$day_cat_schdday=ifelse(appt_no_show_data$DOW_ScheduledDay %in% c('Sat','Sun'),'Weekend','Weekday')

appt_no_show_data$day_cat_apptday=ifelse(appt_no_show_data$DOW_AppointmentDay %in% c('Sat','Sun'),'Weekend','Weekday')

appt_no_show_data$day_cat_schdday=as.factor(appt_no_show_data$day_cat_schdday)

appt_no_show_data$day_cat_apptday=as.factor(appt_no_show_data$day_cat_apptday)

Let’s check the summary again

summary(appt_no_show_data)
##    PatientId         AppointmentID     Gender     ScheduledDay                
##  Min.   :3.922e+04   Min.   :5030230   F:71840   Min.   :2015-11-10 07:13:56  
##  1st Qu.:4.173e+12   1st Qu.:5640286   M:38687   1st Qu.:2016-04-29 10:27:01  
##  Median :3.173e+13   Median :5680573             Median :2016-05-10 12:13:17  
##  Mean   :1.475e+14   Mean   :5675305             Mean   :2016-05-09 07:49:15  
##  3rd Qu.:9.439e+13   3rd Qu.:5725524             3rd Qu.:2016-05-20 11:18:37  
##  Max.   :1.000e+15   Max.   :5790484             Max.   :2016-06-08 20:07:23  
##                                                                               
##  AppointmentDay                     Age                 Neighbourhood  
##  Min.   :2016-04-29 00:00:00   Min.   : -1.00   JARDIM CAMBURI : 7717  
##  1st Qu.:2016-05-09 00:00:00   1st Qu.: 18.00   MARIA ORTIZ    : 5805  
##  Median :2016-05-18 00:00:00   Median : 37.00   RESISTÊNCIA    : 4431  
##  Mean   :2016-05-19 00:57:50   Mean   : 37.09   JARDIM DA PENHA: 3877  
##  3rd Qu.:2016-05-31 00:00:00   3rd Qu.: 55.00   ITARARÉ        : 3514  
##  Max.   :2016-06-08 00:00:00   Max.   :115.00   CENTRO         : 3334  
##                                                 (Other)        :81849  
##  Scholarship Hypertension Diabetes   Alcoholism Handicap   SMS_received
##  0:99666     0:88726      0:102584   0:107167   0:108286   0:75045     
##  1:10861     1:21801      1:  7943   1:  3360   1:  2042   1:35482     
##                                                 2:   183               
##                                                 3:    13               
##                                                 4:     3               
##                                                                        
##                                                                        
##  No.show     DOW_ScheduledDay DOW_AppointmentDay day_cat_schdday 
##  No :88208   Sun:    0        Sun:    0          Weekday:110503  
##  Yes:22319   Mon:23085        Mon:22715          Weekend:    24  
##              Tue:26168        Tue:25640                          
##              Wed:24262        Wed:25867                          
##              Thu:18073        Thu:17247                          
##              Fri:18915        Fri:19019                          
##              Sat:   24        Sat:   39                          
##  day_cat_apptday 
##  Weekday:110488  
##  Weekend:    39  
##                  
##                  
##                  
##                  
## 

There re very few appointments on Saturday and there isn’t any patient either scheduling or visiting the doctor on a sunday.

Let’s create another feature to see the number of days between scheduled day and appointment day. This feature seems quite important to me as I want to see how difference of days between the appointment day and schedule day.

appt_no_show_data$Days_Interval=as.Date(appt_no_show_data$AppointmentDay)-as.Date(appt_no_show_data$ScheduledDay)

Exploratory Data Analysis

1. Univariate Analysis:

First we will create some bar graphs for all the categorical values:

freq(appt_no_show_data)

##   Gender frequency percentage cumulative_perc
## 1      F     71840         65              65
## 2      M     38687         35             100

##                  Neighbourhood frequency percentage cumulative_perc
## 1               JARDIM CAMBURI      7717       6.98            6.98
## 2                  MARIA ORTIZ      5805       5.25           12.23
## 3                  RESISTÊNCIA      4431       4.01           16.24
## 4              JARDIM DA PENHA      3877       3.51           19.75
## 5                      ITARARÉ      3514       3.18           22.93
## 6                       CENTRO      3334       3.02           25.95
## 7                   TABUAZEIRO      3132       2.83           28.78
## 8                 SANTA MARTHA      3131       2.83           31.61
## 9            JESUS DE NAZARETH      2853       2.58           34.19
## 10                      BONFIM      2773       2.51           36.70
## 11               SANTO ANTÔNIO      2746       2.48           39.18
## 12                 SANTO ANDRÉ      2571       2.33           41.51
## 13                   CARATOÍRA      2565       2.32           43.83
## 14                      JABOUR      2509       2.27           46.10
## 15                   SÃO PEDRO      2448       2.21           48.31
## 16            ILHA DO PRÍNCIPE      2266       2.05           50.36
## 17              NOVA PALESTINA      2264       2.05           52.41
## 18                  ANDORINHAS      2262       2.05           54.46
## 19                    DA PENHA      2217       2.01           56.47
## 20                       ROMÃO      2215       2.00           58.47
## 21                    GURIGICA      2018       1.83           60.30
## 22                    SÃO JOSÉ      1977       1.79           62.09
## 23                  BELA VISTA      1907       1.73           63.82
## 24                     MARUÍPE      1902       1.72           65.54
## 25              FORTE SÃO JOÃO      1889       1.71           67.25
## 26         ILHA DE SANTA MARIA      1885       1.71           68.96
## 27               SÃO CRISTÓVÃO      1836       1.66           70.62
## 28                    REDENÇÃO      1553       1.41           72.03
## 29                SÃO BENEDITO      1439       1.30           73.33
## 30                 JOANA D´ARC      1427       1.29           74.62
## 31                  CRUZAMENTO      1398       1.26           75.88
## 32                  CONSOLAÇÃO      1376       1.24           77.12
## 33                SANTA TEREZA      1332       1.21           78.33
## 34                PRAIA DO SUÁ      1288       1.17           79.50
## 35               SANTOS DUMONT      1276       1.15           80.65
## 36              GRANDE VITÓRIA      1071       0.97           81.62
## 37           ILHA DAS CAIEIRAS      1071       0.97           82.59
## 38                  INHANGUETÁ      1057       0.96           83.55
## 39              PRAIA DO CANTO      1035       0.94           84.49
## 40              BENTO FERREIRA       858       0.78           85.27
## 41                  VILA RUBIM       851       0.77           86.04
## 42                   CONQUISTA       849       0.77           86.81
## 43                   DO QUADRO       849       0.77           87.58
## 44                   REPÚBLICA       835       0.76           88.34
## 45                  MONTE BELO       824       0.75           89.09
## 46              PARQUE MOSCOSO       802       0.73           89.82
## 47                  GOIABEIRAS       700       0.63           90.45
## 48                 JUCUTUQUARA       694       0.63           91.08
## 49                FONTE GRANDE       682       0.62           91.70
## 50               MATA DA PRAIA       644       0.58           92.28
## 51                   DO CABRAL       560       0.51           92.79
## 52                 SANTOS REIS       547       0.49           93.28
## 53                  ESTRELINHA       538       0.49           93.77
## 54                 SANTA CLARA       506       0.46           94.23
## 55                SOLON BORGES       469       0.42           94.65
## 56                     PIEDADE       452       0.41           95.06
## 57               SANTA CECÍLIA       448       0.41           95.47
## 58                 SANTA LÚCIA       438       0.40           95.87
## 59                 SANTA LUÍZA       428       0.39           96.26
## 60              BARRO VERMELHO       423       0.38           96.64
## 61                  DO MOSCOSO       413       0.37           97.01
## 62              MÁRIO CYPRESTE       371       0.34           97.35
## 63                   BOA VISTA       312       0.28           97.63
## 64                     COMDUSA       310       0.28           97.91
## 65                  DE LOURDES       305       0.28           98.19
## 66         ARIOVALDO FAVALESSA       282       0.26           98.45
## 67             ANTÔNIO HONÓRIO       271       0.25           98.70
## 68                   FRADINHOS       258       0.23           98.93
## 69              ENSEADA DO SUÁ       235       0.21           99.14
## 70                SANTA HELENA       178       0.16           99.30
## 71                       HORTO       175       0.16           99.46
## 72               UNIVERSITÁRIO       152       0.14           99.60
## 73            SEGURANÇA DO LAR       145       0.13           99.73
## 74                    NAZARETH       135       0.12           99.85
## 75           MORADA DE CAMBURI        96       0.09           99.94
## 76           PONTAL DE CAMBURI        69       0.06          100.00
## 77                 ILHA DO BOI        35       0.03          100.03
## 78               ILHA DO FRADE        10       0.01          100.04
## 79                   AEROPORTO         8       0.01          100.05
## 80 ILHAS OCEÂNICAS DE TRINDADE         2       0.00          100.05
## 81           PARQUE INDUSTRIAL         1       0.00          100.00

##   Scholarship frequency percentage cumulative_perc
## 1           0     99666      90.17           90.17
## 2           1     10861       9.83          100.00

##   Hypertension frequency percentage cumulative_perc
## 1            0     88726      80.28           80.28
## 2            1     21801      19.72          100.00

##   Diabetes frequency percentage cumulative_perc
## 1        0    102584      92.81           92.81
## 2        1      7943       7.19          100.00

##   Alcoholism frequency percentage cumulative_perc
## 1          0    107167      96.96           96.96
## 2          1      3360       3.04          100.00

##   Handicap frequency percentage cumulative_perc
## 1        0    108286      97.97           97.97
## 2        1      2042       1.85           99.82
## 3        2       183       0.17           99.99
## 4        3        13       0.01          100.00
## 5        4         3       0.00          100.00

##   SMS_received frequency percentage cumulative_perc
## 1            0     75045       67.9            67.9
## 2            1     35482       32.1           100.0

##   No.show frequency percentage cumulative_perc
## 1      No     88208      79.81           79.81
## 2     Yes     22319      20.19          100.00

##   day_cat_schdday frequency percentage cumulative_perc
## 1         Weekday    110503      99.98           99.98
## 2         Weekend        24       0.02          100.00

##   day_cat_apptday frequency percentage cumulative_perc
## 1         Weekday    110488      99.96           99.96
## 2         Weekend        39       0.04          100.00
## [1] "Variables processed: Gender, Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handicap, SMS_received, No.show, day_cat_schdday, day_cat_apptday"

Let’s check out the continuous variables now: For continuous variables, I am using histograms

AGE

ggplot(data=appt_no_show_data)+
  geom_histogram(mapping = aes(x=Age),binwidth=1)

From this visualization, it is clear that highest number of patients are children with the Age less than 1 year. Then there’s another spike at the age of 55. Let’s zoom in more to see if there are any outliers:

ggplot(data=appt_no_show_data)+
  geom_histogram(mapping = aes(x=Age),binwidth=1)+
  coord_cartesian(xlim = c(0,120), ylim = c(0,30))

From this graph, it is evident that there are few values with Age<0 and Age>100. Age<0 clearly seems to be an outlier. However, Age>100 is quite unlikely but we can keep that data.

Before treating the outliers in Age, Let’s check the Days_Interval: Days_Interval

ggplot(data=appt_no_show_data)+
  geom_histogram(mapping = aes(x=Days_Interval),binwidth=5)+
  coord_cartesian(xlim = c(0,100))

For most of the data, the days between the scheduled and the appointment day is less than 25 days. We will further plot this variable against the No_show percentage.

In the graph above, we can clearly see that there are some values where the Days_Interval is less than 0. This can not be valid data since a patient can not go for an appointment before scheduling it. Let’s treat the outliers here:

table(appt_no_show_data$Days_Interval)
## 
##    -6    -1     0     1     2     3     4     5     6     7     8     9    10 
##     1     4 38563  5213  6725  2737  5290  3277  4037  4906  2332  1605  1391 
##    11    12    13    14    15    16    17    18    19    20    21    22    23 
##   987  1115  1682  2913  1503  1151  1107  1021  1044  1187  1861  1173   822 
##    24    25    26    27    28    29    30    31    32    33    34    35    36 
##   622   637   731  1013  1706  1089   704   672   586   664   808   963   551 
##    37    38    39    40    41    42    43    44    45    46    47    48    49 
##   331   199   170   254   342   417   254   157   137   121   109   117   216 
##    50    51    52    53    54    55    56    57    58    59    60    61    62 
##   182    91    90    96    73   109   203    98   110    90    73   135   193 
##    63    64    65    66    67    68    69    70    71    72    73    74    75 
##   200    75    97   162    63    70   105   112    50    53    37    39    72 
##    76    77    78    79    80    81    82    83    84    85    86    87    88 
##    65    53    33    23    18    10     1     8    54    37     6    16    16 
##    89    90    91    92    93    94    95    96    97    98   101   102   103 
##    17    58    56     3     2     2     5     4     2     5     1     4     5 
##   104   105   107   108   109   110   111   112   115   117   119   122   123 
##     8     4     2     5     5     2     5     5     2     1     4     3     1 
##   125   126   127   132   133   139   142   146   151   155   162   169   176 
##     1     1     1     1    11     1     8     1     1    10    11     8    16 
##   179 
##    10

The frequency here shows that there are total 5 values where the difference of Scheduled day and Appointment day is negative.

I am filtering the data where interval is less than 0

appt_no_show_data = filter(appt_no_show_data,Days_Interval>=0)

Let’s see the data after removal:

table(appt_no_show_data$Days_Interval)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 38563  5213  6725  2737  5290  3277  4037  4906  2332  1605  1391   987  1115 
##    13    14    15    16    17    18    19    20    21    22    23    24    25 
##  1682  2913  1503  1151  1107  1021  1044  1187  1861  1173   822   622   637 
##    26    27    28    29    30    31    32    33    34    35    36    37    38 
##   731  1013  1706  1089   704   672   586   664   808   963   551   331   199 
##    39    40    41    42    43    44    45    46    47    48    49    50    51 
##   170   254   342   417   254   157   137   121   109   117   216   182    91 
##    52    53    54    55    56    57    58    59    60    61    62    63    64 
##    90    96    73   109   203    98   110    90    73   135   193   200    75 
##    65    66    67    68    69    70    71    72    73    74    75    76    77 
##    97   162    63    70   105   112    50    53    37    39    72    65    53 
##    78    79    80    81    82    83    84    85    86    87    88    89    90 
##    33    23    18    10     1     8    54    37     6    16    16    17    58 
##    91    92    93    94    95    96    97    98   101   102   103   104   105 
##    56     3     2     2     5     4     2     5     1     4     5     8     4 
##   107   108   109   110   111   112   115   117   119   122   123   125   126 
##     2     5     5     2     5     5     2     1     4     3     1     1     1 
##   127   132   133   139   142   146   151   155   162   169   176   179 
##     1     1    11     1     8     1     1    10    11     8    16    10

We will do the same process with the outliers in Age:

Let’s check how many values are there with Age<0

age_freq <- table(appt_no_show_data$Age)
age_freq
## 
##   -1    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    1 3539 2273 1618 1513 1299 1489 1521 1426 1424 1372 1274 1195 1092 1103 1118 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 1211 1402 1509 1487 1544 1437 1452 1375 1349 1242 1332 1283 1377 1448 1403 1521 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46 
## 1439 1505 1524 1526 1378 1580 1533 1628 1536 1402 1346 1272 1344 1487 1453 1460 
##   47   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62 
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 
##   63   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78 
## 1374 1331 1101 1187  973 1012  832  724  695  615  725  602  544  571  527  541 
##   79   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##  390  511  433  392  280  311  275  260  184  126  173  109   66   86   53   33 
##   95   96   97   98   99  100  102  115 
##   24   17   11    6    1    4    2    5
age_freq[names(age_freq)<0]
## -1 
##  1

Let’s remove the value where Age<0:

appt_no_show_data=appt_no_show_data %>% filter(Age!=-1)
table(appt_no_show_data$Age)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3539 2273 1618 1513 1299 1489 1521 1426 1424 1372 1274 1195 1092 1103 1118 1211 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
## 1402 1509 1487 1544 1437 1452 1375 1349 1242 1332 1283 1377 1448 1403 1521 1439 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
## 1505 1524 1526 1378 1580 1533 1628 1536 1402 1346 1272 1344 1487 1453 1460 1394 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
## 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 1374 
##   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
## 1331 1101 1187  973 1012  832  724  695  615  725  602  544  571  527  541  390 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95 
##  511  433  392  280  311  275  260  184  126  173  109   66   86   53   33   24 
##   96   97   98   99  100  102  115 
##   17   11    6    1    4    2    5

Bi-Variate Analysis:

Categorical vs Categorical
GENDER vs No.show
ggplot(data=appt_no_show_data)+
 geom_bar(mapping=aes(x=No.show,fill=Gender),position='dodge')

Here we can say that females tend to miss the data more than men does. But it is not clear from the above barchart.

To see it more evidently, Let’s check out the frequency:

tab_Gender <- table(appt_no_show_data$Gender, appt_no_show_data$No.show)
addmargins(tab_Gender)
##      
##           No    Yes    Sum
##   F    57245  14591  71836
##   M    30962   7723  38685
##   Sum  88207  22314 110521

Taking proportions can mathematically show us the impact. (A proportion is the relative frequency of items with a given characteristic in a given set)

Percent_Gender_no_show=appt_no_show_data %>% group_by(Gender) %>% summarise(Gender_noshow=sum((No.show=="Yes")/n()))
Percent_Gender_no_show

About 20% of the females out of total female patients missed their aapointment. However, 19.9% men out of their total population missed it. So, we can say Gender isn’t playing amy role in our case.

Let’s visualise it with the plot:

ggplot(appt_no_show_data,aes(x =Gender,fill = No.show)) + 
    geom_bar(position = "fill")

SCHOLARSHIP vs No.Show
ggplot(data=appt_no_show_data)+
 geom_bar(mapping=aes(x=No.show,fill=Scholarship),position='dodge')

Let’s make a freq table:

tab_Scholarship <- table(appt_no_show_data$Scholarship, appt_no_show_data$No.show)
addmargins(tab_Scholarship)
##      
##           No    Yes    Sum
##   0    79924  19736  99660
##   1     8283   2578  10861
##   Sum  88207  22314 110521

We will now create a proportions table to look at the relative frequency.

Percent_Scholarship_no_show=appt_no_show_data %>% group_by(Scholarship) %>% summarise(Scholarship_noshow=sum((No.show=="Yes")/n())*100)
Percent_Scholarship_no_show

It is evident now that those who received scholarship, tend to miss the appointment than those who didn’t receive it.

Let’s plot our observation:

ggplot(appt_no_show_data,aes(x =Scholarship,fill = No.show)) + 
    geom_bar(position = "fill")

We will repeat the same process for Hypertension, Diabetes, Alcoholism and Handicap.

Hypertension, Diabetes, Alcoholism and Handicap vs No.Show
Percent_Hypertension_no_show=appt_no_show_data %>% group_by(Hypertension) %>% summarise(Hypertension_noshow=sum((No.show=="Yes")/n())*100)
Percent_Hypertension_no_show
g_Diabetes <- ggplot(appt_no_show_data, aes(x=Diabetes, fill=No.show)) + geom_bar(position="fill")
g_Alcoholism <- ggplot(appt_no_show_data, aes(x=Alcoholism, fill=No.show)) + geom_bar(position="fill")
g_Hypertension <- ggplot(appt_no_show_data, aes(x=Hypertension, fill=No.show)) + geom_bar(position="fill")
g_Handicap <- ggplot(appt_no_show_data, aes(x=Handicap, fill=No.show)) + geom_bar(position="fill")

g_binary <- c(g_Diabetes, g_Alcoholism, g_Hypertension, g_Handicap)
grid.arrange(g_Diabetes, g_Alcoholism, g_Hypertension, g_Handicap,ncol=2, top='Health Conditions Analysis')

Diabetes, Alcoholism and Hypertension doesn’t have any association with the likelihood of patients missing their appointments.

SMS Received vs No show

Let’s see the SMS_Received:

Percent_SMS_no_show=appt_no_show_data %>% group_by(SMS_received) %>% summarise(SMS_noshow=sum((No.show=="Yes")/n())*100)
Percent_SMS_no_show
 ggplot(appt_no_show_data, aes(x=SMS_received, fill=No.show)) + geom_bar(position="fill")

Here we can clearly see that patients who received SMS have more rate of not showing up than who didn’t receive it.

Day_of_week for Appointment Day vs No Show:
ggplot(data=appt_no_show_data)+
 geom_bar(mapping=aes(x=No.show,fill=DOW_AppointmentDay),position='dodge')

Looks like tuesday was the day where most of the people missed their appointment. But to see it more evidently, let’s see the rate of missing appointment by Day of the Week:

Missing_Rate_by_DOW=appt_no_show_data %>% group_by(DOW_AppointmentDay) %>% summarise(Days_No_Show=sum(No.show=="Yes")/n())
Missing_Rate_by_DOW

Let’s plot the above table to have clear visibility:

appt_no_show_data %>% group_by(DOW_AppointmentDay) %>% summarise(Days_No_Show=sum(No.show=="Yes")/n())%>% ggplot(aes(x=DOW_AppointmentDay,y=Days_No_Show))+
  geom_point(alpha=1)+
  geom_smooth()+
  coord_cartesian(ylim = c(0,0.40))

Friday and Saturday are the two days where people tend to miss their appointments more than other days. But we have only 39 appointments on Saturday. So, our finding is not that significant

Continuous vs Categorical

Days_Interval vs No show
appt_no_show_data$Days_Interval=as.integer(appt_no_show_data$Days_Interval)
ggplot(data=appt_no_show_data,aes(x=No.show,y=Days_Interval,color=No.show))+
  geom_boxplot(alpha=0.5)

This boxplot clearly shows that waiting time is impacting the likelihood of people not showing up for their appointments. The more the waiting period, there is high tendency that patients will miss their appointment.

Let’s look at another graph to visualize it more evidently:

ggplot(appt_no_show_data, aes(x=Days_Interval, fill=No.show)) + 
                                geom_density(alpha=0.30) + 
                                coord_cartesian(xlim=c(0, 100))

Here, it is clear that there is a huge spike showing people who attended their appointments. With increasing days of waiting time, there are no data points of people who showed up.

AppointmentDay vs No Show

Let’e plot a scatter plot for Appointment day against the proportions for patients who didn’t show up.

appt_no_show_data %>% group_by(AppointmentDay) %>% summarise(total_noshow=sum(No.show=="Yes")/n()) %>% ggplot(aes(x=AppointmentDay, y=total_noshow)) + 
    geom_point(alpha=0.3) + geom_smooth(method = "lm")

It is clear that most of the patients missed their appointments in the month of May.

ggplot(data=appt_no_show_data,aes(x=No.show,y=Age,color=No.show))+
  geom_boxplot(alpha=0.5)

By visualising this graph, we can say that the younger patients tend to miss the appointments more than older patients.

Conclusion

From my findings after EDA of this data, I found out that following variables:

has assocaitiion with patients’ likelihood of showing up to their appointments up to some extent.

Age: Children less than 1 years of age tend to miss their appointments than other age groups. So, more efforts can be put in reaching the parents to encourage them to bring their kids to the appointments.

Days_Interval: There should be less waiting time for an appointment. If the waiting time is too much, there is a chance that patient might have recovered or has forgotten about the appointment scheduled.

To revisit some exceptional findings:

  1. SMS_Received: Patients who received SMS reminder for their appointments have higher chances of missing the appointments. So, we cna save the budget by not spending resources and time by sending SMS as it has no positive effect but rather has a negative effect.

  2. Scholarship: People who were enrolled in the special Bolsa Familia program or has Scholarship variable as TRUE, are more likely to not show up on their appointments.

Future work for this project is to do the Confirmatory Data Analysis in order to prove the tentative findings with statistical significance. Then after, use a classification model on this data.