This is EDA related to dataset No Show provided by JoniHoppen on Kaggle.
NOTE: In this project, I am not doing any CDA(Confirmatory Data Analysis) which means I haven’t conducted any statistical tests.
This Dataset collects information from medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment.
#Summary of columns data
- PatientId : Identification of a patient
- AppointmentID : Identification of each appointment
- Gender : Male or Female
- ScheduledDay : The day of the actuall appointment, when they have to visit the doctor.
- AppointmentDay : The day someone called or registered the appointment, this is before appointment of course.
- Age : How old is patient
- Neighbourhood : Where the appointment takes place.
- Scholarship : True of False . Please follow the link for understanding this variable https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia
- Hipertension : True or False
- Diabetes: True or False
- Alcoholism : True or False
- Handcap : True or False
- SMS_received : 1 or more messages sent to the patient.
- No-show : YES or NO
The main aim of this project is to determine what could be associated with the patients’ likelihood of missing the appointments. In my analysis, I tried to answer the following questions:
Does Gender play any role in missing appointments? If so, which gender is more likely to miss the appointment?
Is the interval between Schedule day and Appointment day related to the likelihood of patients showing up?
How is Scholarship and conditions like Hypertension, diabetes, alcoholism associated with the missing rate?
Does receiving SMS increase the chances of patients showing up?
suppressMessages(library(ggplot2))
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(Hmisc))
suppressMessages(library(GGally))
suppressMessages(library(funModeling))
suppressMessages(library(lubridate))
suppressMessages(library(gridExtra))
appt_no_show_data= read.csv('Doc_appt_EDA_data.csv')
Let’s see the top 6 and last 6 values of the dataset provided
head(appt_no_show_data)
tail(appt_no_show_data)
Let’s check out the structure of the dataset:
str(appt_no_show_data)
## 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
## $ ScheduledDay : Factor w/ 103549 levels "2015-11-10T07:13:56Z",..: 27742 27504 27539 27709 27498 20074 21386 21496 24945 20895 ...
## $ AppointmentDay: Factor w/ 27 levels "2016-04-29T00:00:00Z",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
## $ Scholarship : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hipertension : int 1 0 0 0 1 1 0 0 0 0 ...
## $ Diabetes : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Alcoholism : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Handcap : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SMS_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ No.show : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
Here we found out that:
1.Scheduled day and appointment day are both dates and are presentd as Factors in data. I might need to creature another feature by getting the days between the two. So, I will convert these to datetime datatypes later in our process.
2.Scholarship, Hipertension,Diabetes,Alcoholism,Handcap and SMS_received are all integers in our data.These have values 0 and 1 which means either“yes” or “no”, so I will convert them to factors.
First Let’s rename the variables to Standard English Names:
names(appt_no_show_data)[names(appt_no_show_data) == "Hipertension"] <- "Hypertension"
names(appt_no_show_data)[names(appt_no_show_data) == "Handcap"] <- "Handicap"
Let’s convert every variable mentioned in point 2 above to factors
appt_no_show_data$Scholarship=as.factor(appt_no_show_data$Scholarship)
appt_no_show_data$Hypertension=as.factor(appt_no_show_data$Hypertension)
appt_no_show_data$Diabetes=as.factor(appt_no_show_data$Diabetes)
appt_no_show_data$Alcoholism=as.factor(appt_no_show_data$Alcoholism)
appt_no_show_data$Handicap=as.factor(appt_no_show_data$Handicap)
appt_no_show_data$SMS_received=as.factor(appt_no_show_data$SMS_received)
Let’s check the summary of our dataset
summary(appt_no_show_data)
## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 F:71840 2016-05-06T07:09:54Z: 24
## 1st Qu.:4.173e+12 1st Qu.:5640286 M:38687 2016-05-06T07:09:53Z: 23
## Median :3.173e+13 Median :5680573 2016-04-25T17:17:46Z: 22
## Mean :1.475e+14 Mean :5675305 2016-04-25T17:18:27Z: 22
## 3rd Qu.:9.439e+13 3rd Qu.:5725524 2016-04-25T17:17:23Z: 19
## Max. :1.000e+15 Max. :5790484 2016-04-25T17:19:03Z: 15
## (Other) :110402
## AppointmentDay Age Neighbourhood
## 2016-06-06T00:00:00Z: 4692 Min. : -1.00 JARDIM CAMBURI : 7717
## 2016-05-16T00:00:00Z: 4613 1st Qu.: 18.00 MARIA ORTIZ : 5805
## 2016-05-09T00:00:00Z: 4520 Median : 37.00 RESISTÊNCIA : 4431
## 2016-05-30T00:00:00Z: 4514 Mean : 37.09 JARDIM DA PENHA: 3877
## 2016-06-08T00:00:00Z: 4479 3rd Qu.: 55.00 ITARARÉ : 3514
## 2016-05-11T00:00:00Z: 4474 Max. :115.00 CENTRO : 3334
## (Other) :83235 (Other) :81849
## Scholarship Hypertension Diabetes Alcoholism Handicap SMS_received
## 0:99666 0:88726 0:102584 0:107167 0:108286 0:75045
## 1:10861 1:21801 1: 7943 1: 3360 1: 2042 1:35482
## 2: 183
## 3: 13
## 4: 3
##
##
## No.show
## No :88208
## Yes:22319
##
##
##
##
##
Age has a minimum value of -1 and maximum of 115. These seem to be outliers in the Age. We will figure these out in the later sections.
Almost 20% of patients were having Hypertenstion.
25% of the patients did not show-up for their appointment(which is our problem statement).
profile_status= df_status(appt_no_show_data)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 PatientId 0 0.00 0 0 0 0 numeric 62299
## 2 AppointmentID 0 0.00 0 0 0 0 integer 110527
## 3 Gender 0 0.00 0 0 0 0 factor 2
## 4 ScheduledDay 0 0.00 0 0 0 0 factor 103549
## 5 AppointmentDay 0 0.00 0 0 0 0 factor 27
## 6 Age 3539 3.20 0 0 0 0 integer 104
## 7 Neighbourhood 0 0.00 0 0 0 0 factor 81
## 8 Scholarship 99666 90.17 0 0 0 0 factor 2
## 9 Hypertension 88726 80.28 0 0 0 0 factor 2
## 10 Diabetes 102584 92.81 0 0 0 0 factor 2
## 11 Alcoholism 107167 96.96 0 0 0 0 factor 2
## 12 Handicap 108286 97.97 0 0 0 0 factor 5
## 13 SMS_received 75045 67.90 0 0 0 0 factor 2
## 14 No.show 0 0.00 0 0 0 0 factor 2
- q_zeros: represent the number of zeros in a variable
- p_zeros: represent the percentage of zeros.
- q_na: represent the number of NAs.
- p_na: represent the percentage of NAs.
- q_inf: represent the number of infinite values.
- p_inf: represent the percentage of infinite values.
- unique: represents the unique values in a variable.
Age has 3539 zeros which is 3.2% of the total values. Here zero represent the children less than 1 year old.
There are 99666 zeros in scholarship which is 90% of the total observations. But zero here has a significance as zero means that the patient did not have a sholarship.
The zeros in Hipertension,Diabetes,Alcoholism,Handcap,SMS_received are also signifying the same thing as above which means these are not missing values.
Before we could treat the missing values and see some outliers in the data, Let’s do some feature engineering to create new features.
Firstly, I will convert the appointment day and schedule day to date.
I am using lubridate package in R for doing all the date conversions.
appt_no_show_data$ScheduledDay=ymd_hms(appt_no_show_data$ScheduledDay)
appt_no_show_data$AppointmentDay=ymd_hms(appt_no_show_data$AppointmentDay)
str(appt_no_show_data)
## 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
## $ ScheduledDay : POSIXct, format: "2016-04-29 18:38:08" "2016-04-29 16:08:27" ...
## $ AppointmentDay: POSIXct, format: "2016-04-29" "2016-04-29" ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : Factor w/ 81 levels "AEROPORTO","ANDORINHAS",..: 40 40 47 55 40 59 26 26 2 13 ...
## $ Scholarship : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Hypertension : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 1 1 1 1 ...
## $ Diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ Alcoholism : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Handicap : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SMS_received : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ No.show : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
The Appointment Day and Schedule day are now in Date format.
Let’s make a new feature to see the Day of Week on which the appointment was made and of the scheduled date.
appt_no_show_data$DOW_ScheduledDay=wday(appt_no_show_data$ScheduledDay,label = TRUE)
appt_no_show_data$DOW_AppointmentDay=wday(appt_no_show_data$AppointmentDay,label= TRUE)
Let’s make another variable to categorize the days into weekday and weekend.
appt_no_show_data$day_cat_schdday=ifelse(appt_no_show_data$DOW_ScheduledDay %in% c('Sat','Sun'),'Weekend','Weekday')
appt_no_show_data$day_cat_apptday=ifelse(appt_no_show_data$DOW_AppointmentDay %in% c('Sat','Sun'),'Weekend','Weekday')
appt_no_show_data$day_cat_schdday=as.factor(appt_no_show_data$day_cat_schdday)
appt_no_show_data$day_cat_apptday=as.factor(appt_no_show_data$day_cat_apptday)
Let’s check the summary again
summary(appt_no_show_data)
## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 F:71840 Min. :2015-11-10 07:13:56
## 1st Qu.:4.173e+12 1st Qu.:5640286 M:38687 1st Qu.:2016-04-29 10:27:01
## Median :3.173e+13 Median :5680573 Median :2016-05-10 12:13:17
## Mean :1.475e+14 Mean :5675305 Mean :2016-05-09 07:49:15
## 3rd Qu.:9.439e+13 3rd Qu.:5725524 3rd Qu.:2016-05-20 11:18:37
## Max. :1.000e+15 Max. :5790484 Max. :2016-06-08 20:07:23
##
## AppointmentDay Age Neighbourhood
## Min. :2016-04-29 00:00:00 Min. : -1.00 JARDIM CAMBURI : 7717
## 1st Qu.:2016-05-09 00:00:00 1st Qu.: 18.00 MARIA ORTIZ : 5805
## Median :2016-05-18 00:00:00 Median : 37.00 RESISTÊNCIA : 4431
## Mean :2016-05-19 00:57:50 Mean : 37.09 JARDIM DA PENHA: 3877
## 3rd Qu.:2016-05-31 00:00:00 3rd Qu.: 55.00 ITARARÉ : 3514
## Max. :2016-06-08 00:00:00 Max. :115.00 CENTRO : 3334
## (Other) :81849
## Scholarship Hypertension Diabetes Alcoholism Handicap SMS_received
## 0:99666 0:88726 0:102584 0:107167 0:108286 0:75045
## 1:10861 1:21801 1: 7943 1: 3360 1: 2042 1:35482
## 2: 183
## 3: 13
## 4: 3
##
##
## No.show DOW_ScheduledDay DOW_AppointmentDay day_cat_schdday
## No :88208 Sun: 0 Sun: 0 Weekday:110503
## Yes:22319 Mon:23085 Mon:22715 Weekend: 24
## Tue:26168 Tue:25640
## Wed:24262 Wed:25867
## Thu:18073 Thu:17247
## Fri:18915 Fri:19019
## Sat: 24 Sat: 39
## day_cat_apptday
## Weekday:110488
## Weekend: 39
##
##
##
##
##
There re very few appointments on Saturday and there isn’t any patient either scheduling or visiting the doctor on a sunday.
Let’s create another feature to see the number of days between scheduled day and appointment day. This feature seems quite important to me as I want to see how difference of days between the appointment day and schedule day.
appt_no_show_data$Days_Interval=as.Date(appt_no_show_data$AppointmentDay)-as.Date(appt_no_show_data$ScheduledDay)
First we will create some bar graphs for all the categorical values:
freq(appt_no_show_data)
## Gender frequency percentage cumulative_perc
## 1 F 71840 65 65
## 2 M 38687 35 100
## Neighbourhood frequency percentage cumulative_perc
## 1 JARDIM CAMBURI 7717 6.98 6.98
## 2 MARIA ORTIZ 5805 5.25 12.23
## 3 RESISTÊNCIA 4431 4.01 16.24
## 4 JARDIM DA PENHA 3877 3.51 19.75
## 5 ITARARÉ 3514 3.18 22.93
## 6 CENTRO 3334 3.02 25.95
## 7 TABUAZEIRO 3132 2.83 28.78
## 8 SANTA MARTHA 3131 2.83 31.61
## 9 JESUS DE NAZARETH 2853 2.58 34.19
## 10 BONFIM 2773 2.51 36.70
## 11 SANTO ANTÔNIO 2746 2.48 39.18
## 12 SANTO ANDRÉ 2571 2.33 41.51
## 13 CARATOÍRA 2565 2.32 43.83
## 14 JABOUR 2509 2.27 46.10
## 15 SÃO PEDRO 2448 2.21 48.31
## 16 ILHA DO PRÍNCIPE 2266 2.05 50.36
## 17 NOVA PALESTINA 2264 2.05 52.41
## 18 ANDORINHAS 2262 2.05 54.46
## 19 DA PENHA 2217 2.01 56.47
## 20 ROMÃO 2215 2.00 58.47
## 21 GURIGICA 2018 1.83 60.30
## 22 SÃO JOSÉ 1977 1.79 62.09
## 23 BELA VISTA 1907 1.73 63.82
## 24 MARUÍPE 1902 1.72 65.54
## 25 FORTE SÃO JOÃO 1889 1.71 67.25
## 26 ILHA DE SANTA MARIA 1885 1.71 68.96
## 27 SÃO CRISTÓVÃO 1836 1.66 70.62
## 28 REDENÇÃO 1553 1.41 72.03
## 29 SÃO BENEDITO 1439 1.30 73.33
## 30 JOANA D´ARC 1427 1.29 74.62
## 31 CRUZAMENTO 1398 1.26 75.88
## 32 CONSOLAÇÃO 1376 1.24 77.12
## 33 SANTA TEREZA 1332 1.21 78.33
## 34 PRAIA DO SUÁ 1288 1.17 79.50
## 35 SANTOS DUMONT 1276 1.15 80.65
## 36 GRANDE VITÓRIA 1071 0.97 81.62
## 37 ILHA DAS CAIEIRAS 1071 0.97 82.59
## 38 INHANGUETÁ 1057 0.96 83.55
## 39 PRAIA DO CANTO 1035 0.94 84.49
## 40 BENTO FERREIRA 858 0.78 85.27
## 41 VILA RUBIM 851 0.77 86.04
## 42 CONQUISTA 849 0.77 86.81
## 43 DO QUADRO 849 0.77 87.58
## 44 REPÚBLICA 835 0.76 88.34
## 45 MONTE BELO 824 0.75 89.09
## 46 PARQUE MOSCOSO 802 0.73 89.82
## 47 GOIABEIRAS 700 0.63 90.45
## 48 JUCUTUQUARA 694 0.63 91.08
## 49 FONTE GRANDE 682 0.62 91.70
## 50 MATA DA PRAIA 644 0.58 92.28
## 51 DO CABRAL 560 0.51 92.79
## 52 SANTOS REIS 547 0.49 93.28
## 53 ESTRELINHA 538 0.49 93.77
## 54 SANTA CLARA 506 0.46 94.23
## 55 SOLON BORGES 469 0.42 94.65
## 56 PIEDADE 452 0.41 95.06
## 57 SANTA CECÍLIA 448 0.41 95.47
## 58 SANTA LÚCIA 438 0.40 95.87
## 59 SANTA LUÍZA 428 0.39 96.26
## 60 BARRO VERMELHO 423 0.38 96.64
## 61 DO MOSCOSO 413 0.37 97.01
## 62 MÁRIO CYPRESTE 371 0.34 97.35
## 63 BOA VISTA 312 0.28 97.63
## 64 COMDUSA 310 0.28 97.91
## 65 DE LOURDES 305 0.28 98.19
## 66 ARIOVALDO FAVALESSA 282 0.26 98.45
## 67 ANTÔNIO HONÓRIO 271 0.25 98.70
## 68 FRADINHOS 258 0.23 98.93
## 69 ENSEADA DO SUÁ 235 0.21 99.14
## 70 SANTA HELENA 178 0.16 99.30
## 71 HORTO 175 0.16 99.46
## 72 UNIVERSITÁRIO 152 0.14 99.60
## 73 SEGURANÇA DO LAR 145 0.13 99.73
## 74 NAZARETH 135 0.12 99.85
## 75 MORADA DE CAMBURI 96 0.09 99.94
## 76 PONTAL DE CAMBURI 69 0.06 100.00
## 77 ILHA DO BOI 35 0.03 100.03
## 78 ILHA DO FRADE 10 0.01 100.04
## 79 AEROPORTO 8 0.01 100.05
## 80 ILHAS OCEÂNICAS DE TRINDADE 2 0.00 100.05
## 81 PARQUE INDUSTRIAL 1 0.00 100.00
## Scholarship frequency percentage cumulative_perc
## 1 0 99666 90.17 90.17
## 2 1 10861 9.83 100.00
## Hypertension frequency percentage cumulative_perc
## 1 0 88726 80.28 80.28
## 2 1 21801 19.72 100.00
## Diabetes frequency percentage cumulative_perc
## 1 0 102584 92.81 92.81
## 2 1 7943 7.19 100.00
## Alcoholism frequency percentage cumulative_perc
## 1 0 107167 96.96 96.96
## 2 1 3360 3.04 100.00
## Handicap frequency percentage cumulative_perc
## 1 0 108286 97.97 97.97
## 2 1 2042 1.85 99.82
## 3 2 183 0.17 99.99
## 4 3 13 0.01 100.00
## 5 4 3 0.00 100.00
## SMS_received frequency percentage cumulative_perc
## 1 0 75045 67.9 67.9
## 2 1 35482 32.1 100.0
## No.show frequency percentage cumulative_perc
## 1 No 88208 79.81 79.81
## 2 Yes 22319 20.19 100.00
## day_cat_schdday frequency percentage cumulative_perc
## 1 Weekday 110503 99.98 99.98
## 2 Weekend 24 0.02 100.00
## day_cat_apptday frequency percentage cumulative_perc
## 1 Weekday 110488 99.96 99.96
## 2 Weekend 39 0.04 100.00
## [1] "Variables processed: Gender, Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handicap, SMS_received, No.show, day_cat_schdday, day_cat_apptday"
Let’s check out the continuous variables now: For continuous variables, I am using histograms
AGE
ggplot(data=appt_no_show_data)+
geom_histogram(mapping = aes(x=Age),binwidth=1)
From this visualization, it is clear that highest number of patients are children with the Age less than 1 year. Then there’s another spike at the age of 55. Let’s zoom in more to see if there are any outliers:
ggplot(data=appt_no_show_data)+
geom_histogram(mapping = aes(x=Age),binwidth=1)+
coord_cartesian(xlim = c(0,120), ylim = c(0,30))
From this graph, it is evident that there are few values with Age<0 and Age>100. Age<0 clearly seems to be an outlier. However, Age>100 is quite unlikely but we can keep that data.
Before treating the outliers in Age, Let’s check the Days_Interval: Days_Interval
ggplot(data=appt_no_show_data)+
geom_histogram(mapping = aes(x=Days_Interval),binwidth=5)+
coord_cartesian(xlim = c(0,100))
For most of the data, the days between the scheduled and the appointment day is less than 25 days. We will further plot this variable against the No_show percentage.
In the graph above, we can clearly see that there are some values where the Days_Interval is less than 0. This can not be valid data since a patient can not go for an appointment before scheduling it. Let’s treat the outliers here:
table(appt_no_show_data$Days_Interval)
##
## -6 -1 0 1 2 3 4 5 6 7 8 9 10
## 1 4 38563 5213 6725 2737 5290 3277 4037 4906 2332 1605 1391
## 11 12 13 14 15 16 17 18 19 20 21 22 23
## 987 1115 1682 2913 1503 1151 1107 1021 1044 1187 1861 1173 822
## 24 25 26 27 28 29 30 31 32 33 34 35 36
## 622 637 731 1013 1706 1089 704 672 586 664 808 963 551
## 37 38 39 40 41 42 43 44 45 46 47 48 49
## 331 199 170 254 342 417 254 157 137 121 109 117 216
## 50 51 52 53 54 55 56 57 58 59 60 61 62
## 182 91 90 96 73 109 203 98 110 90 73 135 193
## 63 64 65 66 67 68 69 70 71 72 73 74 75
## 200 75 97 162 63 70 105 112 50 53 37 39 72
## 76 77 78 79 80 81 82 83 84 85 86 87 88
## 65 53 33 23 18 10 1 8 54 37 6 16 16
## 89 90 91 92 93 94 95 96 97 98 101 102 103
## 17 58 56 3 2 2 5 4 2 5 1 4 5
## 104 105 107 108 109 110 111 112 115 117 119 122 123
## 8 4 2 5 5 2 5 5 2 1 4 3 1
## 125 126 127 132 133 139 142 146 151 155 162 169 176
## 1 1 1 1 11 1 8 1 1 10 11 8 16
## 179
## 10
The frequency here shows that there are total 5 values where the difference of Scheduled day and Appointment day is negative.
I am filtering the data where interval is less than 0
appt_no_show_data = filter(appt_no_show_data,Days_Interval>=0)
Let’s see the data after removal:
table(appt_no_show_data$Days_Interval)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 38563 5213 6725 2737 5290 3277 4037 4906 2332 1605 1391 987 1115
## 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1682 2913 1503 1151 1107 1021 1044 1187 1861 1173 822 622 637
## 26 27 28 29 30 31 32 33 34 35 36 37 38
## 731 1013 1706 1089 704 672 586 664 808 963 551 331 199
## 39 40 41 42 43 44 45 46 47 48 49 50 51
## 170 254 342 417 254 157 137 121 109 117 216 182 91
## 52 53 54 55 56 57 58 59 60 61 62 63 64
## 90 96 73 109 203 98 110 90 73 135 193 200 75
## 65 66 67 68 69 70 71 72 73 74 75 76 77
## 97 162 63 70 105 112 50 53 37 39 72 65 53
## 78 79 80 81 82 83 84 85 86 87 88 89 90
## 33 23 18 10 1 8 54 37 6 16 16 17 58
## 91 92 93 94 95 96 97 98 101 102 103 104 105
## 56 3 2 2 5 4 2 5 1 4 5 8 4
## 107 108 109 110 111 112 115 117 119 122 123 125 126
## 2 5 5 2 5 5 2 1 4 3 1 1 1
## 127 132 133 139 142 146 151 155 162 169 176 179
## 1 1 11 1 8 1 1 10 11 8 16 10
We will do the same process with the outliers in Age:
Let’s check how many values are there with Age<0
age_freq <- table(appt_no_show_data$Age)
age_freq
##
## -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 3539 2273 1618 1513 1299 1489 1521 1426 1424 1372 1274 1195 1092 1103 1118
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 1211 1402 1509 1487 1544 1437 1452 1375 1349 1242 1332 1283 1377 1448 1403 1521
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## 1439 1505 1524 1526 1378 1580 1533 1628 1536 1402 1346 1272 1344 1487 1453 1460
## 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312
## 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
## 1374 1331 1101 1187 973 1012 832 724 695 615 725 602 544 571 527 541
## 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
## 390 511 433 392 280 311 275 260 184 126 173 109 66 86 53 33
## 95 96 97 98 99 100 102 115
## 24 17 11 6 1 4 2 5
age_freq[names(age_freq)<0]
## -1
## 1
Let’s remove the value where Age<0:
appt_no_show_data=appt_no_show_data %>% filter(Age!=-1)
table(appt_no_show_data$Age)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3539 2273 1618 1513 1299 1489 1521 1426 1424 1372 1274 1195 1092 1103 1118 1211
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 1402 1509 1487 1544 1437 1452 1375 1349 1242 1332 1283 1377 1448 1403 1521 1439
## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## 1505 1524 1526 1378 1580 1533 1628 1536 1402 1346 1272 1344 1487 1453 1460 1394
## 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
## 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 1374
## 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## 1331 1101 1187 973 1012 832 724 695 615 725 602 544 571 527 541 390
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
## 511 433 392 280 311 275 260 184 126 173 109 66 86 53 33 24
## 96 97 98 99 100 102 115
## 17 11 6 1 4 2 5
ggplot(data=appt_no_show_data)+
geom_bar(mapping=aes(x=No.show,fill=Gender),position='dodge')
Here we can say that females tend to miss the data more than men does. But it is not clear from the above barchart.
To see it more evidently, Let’s check out the frequency:
tab_Gender <- table(appt_no_show_data$Gender, appt_no_show_data$No.show)
addmargins(tab_Gender)
##
## No Yes Sum
## F 57245 14591 71836
## M 30962 7723 38685
## Sum 88207 22314 110521
Taking proportions can mathematically show us the impact. (A proportion is the relative frequency of items with a given characteristic in a given set)
Percent_Gender_no_show=appt_no_show_data %>% group_by(Gender) %>% summarise(Gender_noshow=sum((No.show=="Yes")/n()))
Percent_Gender_no_show
About 20% of the females out of total female patients missed their aapointment. However, 19.9% men out of their total population missed it. So, we can say Gender isn’t playing amy role in our case.
Let’s visualise it with the plot:
ggplot(appt_no_show_data,aes(x =Gender,fill = No.show)) +
geom_bar(position = "fill")
ggplot(data=appt_no_show_data)+
geom_bar(mapping=aes(x=No.show,fill=Scholarship),position='dodge')
Let’s make a freq table:
tab_Scholarship <- table(appt_no_show_data$Scholarship, appt_no_show_data$No.show)
addmargins(tab_Scholarship)
##
## No Yes Sum
## 0 79924 19736 99660
## 1 8283 2578 10861
## Sum 88207 22314 110521
We will now create a proportions table to look at the relative frequency.
Percent_Scholarship_no_show=appt_no_show_data %>% group_by(Scholarship) %>% summarise(Scholarship_noshow=sum((No.show=="Yes")/n())*100)
Percent_Scholarship_no_show
It is evident now that those who received scholarship, tend to miss the appointment than those who didn’t receive it.
Let’s plot our observation:
ggplot(appt_no_show_data,aes(x =Scholarship,fill = No.show)) +
geom_bar(position = "fill")
We will repeat the same process for Hypertension, Diabetes, Alcoholism and Handicap.
Percent_Hypertension_no_show=appt_no_show_data %>% group_by(Hypertension) %>% summarise(Hypertension_noshow=sum((No.show=="Yes")/n())*100)
Percent_Hypertension_no_show
g_Diabetes <- ggplot(appt_no_show_data, aes(x=Diabetes, fill=No.show)) + geom_bar(position="fill")
g_Alcoholism <- ggplot(appt_no_show_data, aes(x=Alcoholism, fill=No.show)) + geom_bar(position="fill")
g_Hypertension <- ggplot(appt_no_show_data, aes(x=Hypertension, fill=No.show)) + geom_bar(position="fill")
g_Handicap <- ggplot(appt_no_show_data, aes(x=Handicap, fill=No.show)) + geom_bar(position="fill")
g_binary <- c(g_Diabetes, g_Alcoholism, g_Hypertension, g_Handicap)
grid.arrange(g_Diabetes, g_Alcoholism, g_Hypertension, g_Handicap,ncol=2, top='Health Conditions Analysis')
Diabetes, Alcoholism and Hypertension doesn’t have any association with the likelihood of patients missing their appointments.
Let’s see the SMS_Received:
Percent_SMS_no_show=appt_no_show_data %>% group_by(SMS_received) %>% summarise(SMS_noshow=sum((No.show=="Yes")/n())*100)
Percent_SMS_no_show
ggplot(appt_no_show_data, aes(x=SMS_received, fill=No.show)) + geom_bar(position="fill")
Here we can clearly see that patients who received SMS have more rate of not showing up than who didn’t receive it.
ggplot(data=appt_no_show_data)+
geom_bar(mapping=aes(x=No.show,fill=DOW_AppointmentDay),position='dodge')
Looks like tuesday was the day where most of the people missed their appointment. But to see it more evidently, let’s see the rate of missing appointment by Day of the Week:
Missing_Rate_by_DOW=appt_no_show_data %>% group_by(DOW_AppointmentDay) %>% summarise(Days_No_Show=sum(No.show=="Yes")/n())
Missing_Rate_by_DOW
Let’s plot the above table to have clear visibility:
appt_no_show_data %>% group_by(DOW_AppointmentDay) %>% summarise(Days_No_Show=sum(No.show=="Yes")/n())%>% ggplot(aes(x=DOW_AppointmentDay,y=Days_No_Show))+
geom_point(alpha=1)+
geom_smooth()+
coord_cartesian(ylim = c(0,0.40))
Friday and Saturday are the two days where people tend to miss their appointments more than other days. But we have only 39 appointments on Saturday. So, our finding is not that significant
appt_no_show_data$Days_Interval=as.integer(appt_no_show_data$Days_Interval)
ggplot(data=appt_no_show_data,aes(x=No.show,y=Days_Interval,color=No.show))+
geom_boxplot(alpha=0.5)
This boxplot clearly shows that waiting time is impacting the likelihood of people not showing up for their appointments. The more the waiting period, there is high tendency that patients will miss their appointment.
Let’s look at another graph to visualize it more evidently:
ggplot(appt_no_show_data, aes(x=Days_Interval, fill=No.show)) +
geom_density(alpha=0.30) +
coord_cartesian(xlim=c(0, 100))
Here, it is clear that there is a huge spike showing people who attended their appointments. With increasing days of waiting time, there are no data points of people who showed up.
Let’e plot a scatter plot for Appointment day against the proportions for patients who didn’t show up.
appt_no_show_data %>% group_by(AppointmentDay) %>% summarise(total_noshow=sum(No.show=="Yes")/n()) %>% ggplot(aes(x=AppointmentDay, y=total_noshow)) +
geom_point(alpha=0.3) + geom_smooth(method = "lm")
It is clear that most of the patients missed their appointments in the month of May.
ggplot(data=appt_no_show_data,aes(x=No.show,y=Age,color=No.show))+
geom_boxplot(alpha=0.5)
By visualising this graph, we can say that the younger patients tend to miss the appointments more than older patients.
From my findings after EDA of this data, I found out that following variables:
has assocaitiion with patients’ likelihood of showing up to their appointments up to some extent.
Age: Children less than 1 years of age tend to miss their appointments than other age groups. So, more efforts can be put in reaching the parents to encourage them to bring their kids to the appointments.
Days_Interval: There should be less waiting time for an appointment. If the waiting time is too much, there is a chance that patient might have recovered or has forgotten about the appointment scheduled.
To revisit some exceptional findings:
SMS_Received: Patients who received SMS reminder for their appointments have higher chances of missing the appointments. So, we cna save the budget by not spending resources and time by sending SMS as it has no positive effect but rather has a negative effect.
Scholarship: People who were enrolled in the special Bolsa Familia program or has Scholarship variable as TRUE, are more likely to not show up on their appointments.
Future work for this project is to do the Confirmatory Data Analysis in order to prove the tentative findings with statistical significance. Then after, use a classification model on this data.