Investigating on Medical Appointment No Show Dataset

Yousuf Ali

10/4/2018

0.1 Data Wrangling
- 0.1.1 Load packages
- 0.1.2 Read table and Cleaning data
0.2 Exploratory Analysis
0.3 Logistic Regression
0.4 Conclusions

0.1 Data Wrangling

0.1.1 Load packages

library(tidyverse)

0.1.2 Read table and Cleaning data

# read the table

df <-  read_csv("data/KaggleV2-May-2016.csv")

# check the data

head(df)

PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hipertension	Diabetes	No-show
2.987250e+13	5642903	F	2016-04-29 18:38:08	2016-04-29	62	JARDIM DA PENHA	1	0	No
5.589978e+14	5642503	M	2016-04-29 16:08:27	2016-04-29	56	JARDIM DA PENHA	0	0	No
4.262962e+12	5642549	F	2016-04-29 16:19:04	2016-04-29	62	MATA DA PRAIA	0	0	No
8.679512e+11	5642828	F	2016-04-29 17:29:31	2016-04-29	8	PONTAL DE CAMBURI	0	0	No
8.841186e+12	5642494	F	2016-04-29 16:07:23	2016-04-29	56	JARDIM DA PENHA	1	1	No
9.598513e+13	5626772	F	2016-04-27 08:36:51	2016-04-29	76	REPÚBLICA	1	0	No

# check structure of the data
str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : chr  "F" "M" "F" "F" ...
##  $ ScheduledDay  : POSIXct, format: "2016-04-29 18:38:08" "2016-04-29 16:08:27" ...
##  $ AppointmentDay: POSIXct, format: "2016-04-29" "2016-04-29" ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : chr  "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
##  $ Scholarship   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hipertension  : int  1 0 0 0 1 1 0 0 0 0 ...
##  $ Diabetes      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Alcoholism    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Handcap       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SMS_received  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No-show       : chr  "No" "No" "No" "No" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 14
##   .. ..$ PatientId     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ AppointmentID : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Gender        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ScheduledDay  :List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_datetime" "collector"
##   .. ..$ AppointmentDay:List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_datetime" "collector"
##   .. ..$ Age           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Neighbourhood : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Scholarship   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Hipertension  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Diabetes      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Alcoholism    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Handcap       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ SMS_received  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ No-show       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

# change columns name
names(df)<-  c('patient_id',
                   'appointment_id',
                   'gender','schedule_day',
                   'appointment_day',
                   'age',
                   'neighborhood',
                   'scholarship',
                   'hypertension',
                   'diabetes',
                   'alcoholism',
                   'handicap',
                   'sms_received',
                   'no_show')

# check again
head(df)

patient_id	appointment_id	gender	schedule_day	appointment_day	age	neighborhood	hypertension	diabetes	no_show
2.987250e+13	5642903	F	2016-04-29 18:38:08	2016-04-29	62	JARDIM DA PENHA	1	0	No
5.589978e+14	5642503	M	2016-04-29 16:08:27	2016-04-29	56	JARDIM DA PENHA	0	0	No
4.262962e+12	5642549	F	2016-04-29 16:19:04	2016-04-29	62	MATA DA PRAIA	0	0	No
8.679512e+11	5642828	F	2016-04-29 17:29:31	2016-04-29	8	PONTAL DE CAMBURI	0	0	No
8.841186e+12	5642494	F	2016-04-29 16:07:23	2016-04-29	56	JARDIM DA PENHA	1	1	No
9.598513e+13	5626772	F	2016-04-27 08:36:51	2016-04-29	76	REPÚBLICA	1	0	No

# check missing value

sapply(df,function(x)sum(is.na(x)))

##      patient_id  appointment_id          gender    schedule_day 
##               0               0               0               0 
## appointment_day             age    neighborhood     scholarship 
##               0               0               0               0 
##    hypertension        diabetes      alcoholism        handicap 
##               0               0               0               0 
##    sms_received         no_show 
##               0               0

# change data type of some columns
df <- mutate_at(df, vars('gender',
                   'neighborhood',
                   'scholarship',
                   'hypertension',
                   'diabetes',
                   'alcoholism',
                   'handicap',
                   'sms_received'), as.factor)

# check data type
lapply(df, class)

## $patient_id
## [1] "numeric"
## 
## $appointment_id
## [1] "integer"
## 
## $gender
## [1] "factor"
## 
## $schedule_day
## [1] "POSIXct" "POSIXt" 
## 
## $appointment_day
## [1] "POSIXct" "POSIXt" 
## 
## $age
## [1] "integer"
## 
## $neighborhood
## [1] "factor"
## 
## $scholarship
## [1] "factor"
## 
## $hypertension
## [1] "factor"
## 
## $diabetes
## [1] "factor"
## 
## $alcoholism
## [1] "factor"
## 
## $handicap
## [1] "factor"
## 
## $sms_received
## [1] "factor"
## 
## $no_show
## [1] "character"

# check summary statistics
summary(df)

##    patient_id        appointment_id    gender   
##  Min.   :3.922e+04   Min.   :5030230   F:71840  
##  1st Qu.:4.173e+12   1st Qu.:5640286   M:38687  
##  Median :3.173e+13   Median :5680573            
##  Mean   :1.475e+14   Mean   :5675305            
##  3rd Qu.:9.439e+13   3rd Qu.:5725524            
##  Max.   :1.000e+15   Max.   :5790484            
##                                                 
##   schedule_day                 appointment_day              
##  Min.   :2015-11-10 07:13:56   Min.   :2016-04-29 00:00:00  
##  1st Qu.:2016-04-29 10:27:01   1st Qu.:2016-05-09 00:00:00  
##  Median :2016-05-10 12:13:17   Median :2016-05-18 00:00:00  
##  Mean   :2016-05-09 07:49:15   Mean   :2016-05-19 00:57:50  
##  3rd Qu.:2016-05-20 11:18:37   3rd Qu.:2016-05-31 00:00:00  
##  Max.   :2016-06-08 20:07:23   Max.   :2016-06-08 00:00:00  
##                                                             
##       age                  neighborhood   scholarship hypertension
##  Min.   : -1.00   JARDIM CAMBURI : 7717   0:99666     0:88726     
##  1st Qu.: 18.00   MARIA ORTIZ    : 5805   1:10861     1:21801     
##  Median : 37.00   RESISTÊNCIA    : 4431                           
##  Mean   : 37.09   JARDIM DA PENHA: 3877                           
##  3rd Qu.: 55.00   ITARARÉ        : 3514                           
##  Max.   :115.00   CENTRO         : 3334                           
##                   (Other)        :81849                           
##  diabetes   alcoholism handicap   sms_received   no_show         
##  0:102584   0:107167   0:108286   0:75045      Length:110527     
##  1:  7943   1:  3360   1:  2042   1:35482      Class :character  
##                        2:   183                Mode  :character  
##                        3:    13                                  
##                        4:     3                                  
##                                                                  
##

minimum age is −1, so lets check all the value less than 0

#check value less than 0
df[df$age < 0, ]

patient_id	appointment_id	gender	schedule_day	appointment_day	age	neighborhood	scholarship	hypertension	diabetes	alcoholism	handicap	sms_received	no_show
4.659432e+14	5775010	F	2016-06-06 08:58:13	2016-06-06	-1	ROMÃO	0	0	0	0	0	0	No

#drop the row with condition

df <-df[!(df$age<0),]

summary(df$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   18.00   37.00   37.09   55.00  115.00

0.2 Exploratory Analysis

0.2.1 Age vs No Show

0.2.1.1 Is there any relation between Age and No show?

0.2.1.2 plot and statistics

In the dataset description, it mention that in no_show column No means patient showed up and yes means patient did not showed up

# age vs showed up or not
# blue showed up
ggplot(df,aes(x=age)) + 
    geom_histogram(data=subset(df,no_show == 'No'),fill = '#00BFC4', alpha = 0.8, bins = 40) +
    geom_histogram(data=subset(df,no_show == 'Yes'),fill = '#F8766D', alpha = 0.8, bins = 40) +
    ggtitle('Age vs No Show Histogram')+
    theme(plot.title = element_text(hjust = 0.5))

It is very hard to tell from this plot the difference in age of patient who showed up or did not showed up. So let’s explore more.

Since there is confusion with no and yes with showed up or not, lets make it clear

# in no_show column No means patient showed up and yes means patient did not showed up
df$no_show[df$no_show == 'No'] <- 'Showed up'
df$no_show[df$no_show == 'Yes'] <- 'not showed up'

df$no_show <- as.factor(df$no_show)

# age box plot

ggplot(df, aes(x = no_show, y = age, fill = no_show))+
    geom_boxplot()+ 
    ggtitle("Age vs No Show Boxplot")+
    theme(plot.title = element_text(hjust = 0.5))

From the box plot, the mean age is higher for those who showed up. Now let’s see the exact value.

#statistics

select(df, age, no_show) %>% 
    group_by(no_show) %>% 
    summarise(age_mean = mean(age))

no_show	age_mean
not showed up	34.31767
Showed up	37.79050

more visualization

Though it might not be appropriate for all case but in this case it is possible to visualize all age group.

ggplot(data = df)+
    geom_bar(aes(factor(age), fill = no_show), position = position_fill())+
    ggtitle("Age vs No Show Proportion in Bar Diagram")+
    ylab('Proportion')+
    theme(plot.title = element_text(hjust = 0.5, size = 24))+
    theme(axis.title.y = element_text(size =18))+
    theme(axis.title.x = element_text(size =18))+
    theme(axis.text.x = element_text(size= 12, angle = 90, hjust = 1))

From the above plot now it seems that proportion of showed up patients are higher in the age range 60 to 80 than patient age under 40.

0.2.1.3 t test

# make column of age

showed_up <- df[df$no_show == 'Showed up',]$age
not_showed_up <- df[df$no_show == 'not showed up',]$age

# independent two sample t test
t.test(showed_up, not_showed_up)

## 
##  Welch Two Sample t-test
## 
## data:  showed_up and not_showed_up
## t = 20.831, df = 36143, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.146073 3.799602
## sample estimates:
## mean of x mean of y 
##  37.79050  34.31767

# just easy way, same result
t.test(df$age ~ df$no_show)

## 
##  Welch Two Sample t-test
## 
## data:  df$age by df$no_show
## t = -20.831, df = 36143, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.799602 -3.146073
## sample estimates:
## mean in group not showed up     mean in group Showed up 
##                    34.31767                    37.79050

So there is significant difference in age of patient those who showed up and those who did not show up.

0.2.2 Gender vs No Show

0.2.2.1 Is there any relation between Gender and No Show?

0.2.2.2 Plot and statistics

# replace levels
levels(df$gender)[levels(df$gender)=="M"] <- "Male"
levels(df$gender)[levels(df$gender)=="F"] <- "Female"

# count gender
ggplot(data = df)+
    geom_bar(aes(x = gender,fill = gender))+
    ggtitle("Gender Bar Diagram")+
    theme(plot.title = element_text(hjust = 0.5))

So there are more female patient than male patient in dataset.

# gender table
table(df$gender, df$no_show)

##         
##          not showed up Showed up
##   Female         14594     57245
##   Male            7725     30962

# total patient
table(df$no_show)

## 
## not showed up     Showed up 
##         22319         88207

#gender vs noshow stacked bar
ggplot(df)+
    geom_bar(aes(x = gender, fill = no_show))+
    ggtitle("Gender vs No Show Stacked Bar Diagram")+
    theme(plot.title = element_text(hjust = 0.5))+
    ylab("Count")+
    xlab("Gender")

showed up and not showed up both group is higher for female than male because more patients in data set is female. So its better to see proportionally.

# proportion table
prop.table(table(df$gender, df$no_show), margin = 1)

##         
##          not showed up Showed up
##   Female     0.2031487 0.7968513
##   Male       0.1996795 0.8003205

#plot of proportion data, just add position_fill()
ggplot(df)+
    geom_bar(aes(x = gender, fill = no_show), position = position_fill())+
    ggtitle("Gender vs No Show Bar Diagram")+
    ylab('Proportion')+
    xlab("Gender")+
    theme(plot.title = element_text(hjust = 0.5))

So there is not much difference between male and female.

0.2.2.3 chi-squared test

chisq.test(table(df$gender,df$no_show))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(df$gender, df$no_show)
## X-squared = 1.8565, df = 1, p-value = 0.173

without continuity correction

chisq.test(table(df$gender,df$no_show), correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$gender, df$no_show)
## X-squared = 1.8779, df = 1, p-value = 0.1706

p value is more than 0.05, so gender difference is not significant.

0.2.3 Scholarship, Hypertension, Diabetes, Alcohol, Handicap, Sms_received vs No Show

0.2.3.1 What are the relation of No Show with Scholarship, Hypertension, Diabetes, Alcohol, Handicap, Sms_received?

0.2.3.2 plot

# Count patients number in each group
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

p1 <- ggplot(df)+geom_bar(aes(scholarship, fill = scholarship))
p2 <- ggplot(df)+geom_bar(aes(hypertension, fill =hypertension))
p3 <- ggplot(df)+geom_bar(aes(diabetes, fill = diabetes))
p4 <- ggplot(df)+geom_bar(aes (alcoholism, fill =  alcoholism))
p5 <- ggplot(df)+geom_bar(aes( handicap, fill = handicap))
p6 <- ggplot(df)+geom_bar(aes(sms_received, fill = sms_received))

grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)

# this plot for each group with patient who showed up and not showed up
p1 <- ggplot(df)+geom_bar(aes(scholarship, fill = no_show))
p2 <- ggplot(df)+geom_bar(aes(hypertension, fill =no_show))
p3 <- ggplot(df)+geom_bar(aes(diabetes, fill = no_show))
p4 <- ggplot(df)+geom_bar(aes (alcoholism, fill =  no_show))
p5 <- ggplot(df)+geom_bar(aes( handicap, fill = no_show))
p6 <- ggplot(df)+geom_bar(aes(sms_received, fill = no_show))

grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)

# this plot is same as above one execpt it show proportion.
p1 <- ggplot(df)+geom_bar(aes(scholarship, fill = no_show), position = position_fill())+
    ylab('Proportion')
p2 <- ggplot(df)+geom_bar(aes(hypertension, fill =no_show), position = position_fill())+
    ylab('Proportion')
p3 <- ggplot(df)+geom_bar(aes(diabetes, fill = no_show), position = position_fill())+
    ylab('Proportion')
p4 <- ggplot(df)+geom_bar(aes (alcoholism, fill =  no_show), position = position_fill())+
    ylab('Proportion')
p5 <- ggplot(df)+geom_bar(aes( handicap, fill = no_show), position = position_fill())+
    ylab('Proportion')
p6 <- ggplot(df)+geom_bar(aes(sms_received, fill = no_show), position = position_fill())+
    ylab('Proportion')

grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)

0.2.3.3 chi-squared test

# chi-squared test

chisq.test(table(df$scholarship, df$no_show), correct = FALSE) # scholarship

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$scholarship, df$no_show)
## X-squared = 93.811, df = 1, p-value < 2.2e-16

chisq.test(table(df$hypertension, df$no_show), correct = FALSE) # hypertension

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$hypertension, df$no_show)
## X-squared = 140.89, df = 1, p-value < 2.2e-16

chisq.test(table(df$diabetes, df$no_show), correct = FALSE) # diabetes

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$diabetes, df$no_show)
## X-squared = 25.473, df = 1, p-value = 4.486e-07

chisq.test(table(df$alcoholism, df$no_show), correct = FALSE) #alcoholism

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$alcoholism, df$no_show)
## X-squared = 0.0042829, df = 1, p-value = 0.9478

chisq.test(table(df$handicap, df$no_show), correct = FALSE) #handicap

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$handicap, df$no_show)
## X-squared = 7.0356, df = 4, p-value = 0.134

chisq.test(table(df$sms_received, df$no_show), correct = FALSE) # sms_received

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$sms_received, df$no_show)
## X-squared = 1766.7, df = 1, p-value < 2.2e-16

p value is significant for scholarship, hypertension, diabetes, sms_received group.

0.2.4 Appointment day vs No Show

0.2.4.1 plot

# make new columns 
df$day <- weekdays(as.Date(df$appointment_day))

ggplot(df)+geom_bar(aes(day, fill = day))+
     ggtitle("Number of Appointment")+
    ylab('Count')+
    xlab('Day')+
    theme(plot.title = element_text(hjust = 0.5))

Number of appointment differ across week. Some day like Wednesday and Tuesday make more appointment than other. Statistics given below to see exact number.

library(lubridate)

# make days column, with lebel true
df$date <- as.Date(df$appointment_day)
df$days <- wday(df$date, label=TRUE)

# day start monday and goes on by day, using days column
ggplot(df)+geom_bar(aes(days, fill = day))+
     ggtitle("Number of Appointment")+
    ylab('Count')+
    xlab('Day')+
    theme(plot.title = element_text(hjust = 0.5))

# days vs no show
ggplot(df)+geom_bar(aes(days, fill = no_show))+ 
    ggtitle(" Number of Appointment vs No Show")+
    ylab('Count')+
    xlab('Day')+
    theme(plot.title = element_text(hjust = 0.5))

#days vs no show, using days column
ggplot(df)+geom_bar(aes(days, fill = no_show), position = position_fill())+
     ggtitle("Appointment vs No Show")+
    ylab('Proportion')+
    xlab('Day')+
    theme(plot.title = element_text(hjust = 0.5))

0.2.4.2 chi-squared test

used day column because days column contain sunday.

# day column
table(df$day, df$no_show)

##            
##             not showed up Showed up
##   Friday             4037     14982
##   Monday             4690     18024
##   Saturday              9        30
##   Thursday           3338     13909
##   Tuesday            5152     20488
##   Wednesday          5093     20774

#days column
table(df$days, df$no_show)

##      
##       not showed up Showed up
##   Sun             0         0
##   Mon          4690     18024
##   Tue          5152     20488
##   Wed          5093     20774
##   Thu          3338     13909
##   Fri          4037     14982
##   Sat             9        30

chisq.test(table(df$no_show,df$day))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$no_show, df$day)
## X-squared = 27.48, df = 5, p-value = 4.599e-05

Since p value is significant that means showing up in appointment day is dependent on which day the appointment is.

0.2.5 Neighborhood vs No Show

df_neighbor <- data.frame(table(df$neighborhood, df$no_show))
names(df_neighbor) <- c("neighborhood", "no_show", 'Count')
head(df_neighbor)

neighborhood	no_show	Count
AEROPORTO	not showed up	1
ANDORINHAS	not showed up	521
ANTÔNIO HONÓRIO	not showed up	50
ARIOVALDO FAVALESSA	not showed up	62
BARRO VERMELHO	not showed up	91
BELA VISTA	not showed up	384

0.2.5.1 plot

ggplot(df_neighbor)+
geom_bar(aes(x = reorder(neighborhood, -Count), y = Count, fill = no_show), stat = 'identity')+
    theme(axis.text.x = element_text(size= 12, angle = 90, hjust = 1))+
    ggtitle("Neighborhood vs No Show")+
    ylab('Count')+
    xlab('Neighborhood')+
    theme(plot.title = element_text(hjust = 0.5, size = 24))+
    theme(axis.title.y = element_text(size =18))+ 
    theme(axis.title.x = element_text(size =18))

# proportion 
ggplot(df)+
    geom_bar(aes(x = neighborhood, fill = no_show), position = position_fill())+
    theme(axis.text.x = element_text(size= 12, angle = 90, hjust = 1))+
      ggtitle("Neighborhood vs No Show")+
    ylab('Proportion')+
    xlab('Neighborhood')+
    theme(plot.title = element_text(hjust = 0.5))+
    theme(plot.title = element_text(hjust = 0.5, size = 24))+
    theme(axis.title.y = element_text(size =18))+ 
    theme(axis.title.x = element_text(size =18))

0.3 Logistic Regression

0.3.1 Logistic Regression Model

select the coulmns that we are interested in.

df_2 <- select(df, age, gender, scholarship, hypertension, diabetes, alcoholism, handicap,sms_received, day, no_show)

df_2 <- mutate_at(df_2, vars(day), as.factor)

log_model_1 <- glm(no_show ~ . ,family = binomial(link = 'logit'),  data = df_2 )

summary(log_model_1)

## 
## Call:
## glm(formula = no_show ~ ., family = binomial(link = "logit"), 
##     data = df_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1306   0.5295   0.6065   0.6931   1.0061  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.2563118  0.0235536  53.339  < 2e-16 ***
## age            0.0065625  0.0003928  16.705  < 2e-16 ***
## genderMale     0.0184803  0.0162927   1.134  0.25668    
## scholarship1  -0.1845315  0.0245204  -7.526 5.25e-14 ***
## hypertension1  0.0671923  0.0245948   2.732  0.00630 ** 
## diabetes1     -0.0832348  0.0341384  -2.438  0.01476 *  
## alcoholism1   -0.1390368  0.0448100  -3.103  0.00192 ** 
## handicap1     -0.0101453  0.0590236  -0.172  0.86353    
## handicap2     -0.1395108  0.1863616  -0.749  0.45410    
## handicap3     -0.3020104  0.6684127  -0.452  0.65139    
## handicap4     -0.5451696  1.2397235  -0.440  0.66012    
## sms_received1 -0.6675492  0.0156403 -42.681  < 2e-16 ***
## dayMonday      0.1027655  0.0244653   4.200 2.66e-05 ***
## daySaturday   -0.1446681  0.3858706  -0.375  0.70772    
## dayThursday    0.1598183  0.0264963   6.032 1.62e-09 ***
## dayTuesday     0.1991868  0.0240898   8.269  < 2e-16 ***
## dayWednesday   0.1832950  0.0240028   7.636 2.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111205  on 110525  degrees of freedom
## Residual deviance: 108900  on 110509  degrees of freedom
## AIC: 108934
## 
## Number of Fisher Scoring iterations: 4

Drop handicap and gender and build a model again.

df_3 <- select(df, age, scholarship, hypertension, diabetes, alcoholism, sms_received, day, no_show)

log_model_2 <- glm(no_show ~ . ,family = binomial(link = 'logit'),  data = df_3 )

summary(log_model_2)

## 
## Call:
## glm(formula = no_show ~ ., family = binomial(link = "logit"), 
##     data = df_3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1357   0.5297   0.6059   0.6931   1.0096  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.2648522  0.0222333  56.890  < 2e-16 ***
## age            0.0065043  0.0003894  16.704  < 2e-16 ***
## scholarship1  -0.1882820  0.0242938  -7.750 9.17e-15 ***
## hypertension1  0.0669163  0.0245739   2.723  0.00647 ** 
## diabetes1     -0.0835637  0.0341255  -2.449  0.01434 *  
## alcoholism1   -0.1330325  0.0444837  -2.991  0.00278 ** 
## sms_received1 -0.6681409  0.0156205 -42.773  < 2e-16 ***
## dayMonday      0.1029304  0.0244647   4.207 2.58e-05 ***
## daySaturday   -0.1456292  0.3858180  -0.377  0.70583    
## dayThursday    0.1599243  0.0264956   6.036 1.58e-09 ***
## dayTuesday     0.1991017  0.0240890   8.265  < 2e-16 ***
## dayWednesday   0.1832997  0.0240024   7.637 2.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111205  on 110525  degrees of freedom
## Residual deviance: 108902  on 110514  degrees of freedom
## AIC: 108926
## 
## Number of Fisher Scoring iterations: 4

0.3.2 Logistic Regression with training and test data set

library(caTools)
set.seed(100)

split = sample.split(df_2$no_show, SplitRatio = 0.70)

train = subset(df_2, split == TRUE)
test = subset(df_2, split == FALSE)

logit_model <- glm(formula = no_show ~ . ,  data = train, family =binomial(link = 'logit') )

summary(logit_model)

## 
## Call:
## glm(formula = no_show ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1568   0.5233   0.6045   0.6913   1.0147  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.2489473  0.0282180  44.261  < 2e-16 ***
## age            0.0067538  0.0004694  14.388  < 2e-16 ***
## genderMale     0.0354982  0.0195178   1.819 0.068948 .  
## scholarship1  -0.1807958  0.0291101  -6.211 5.27e-10 ***
## hypertension1  0.0769496  0.0294399   2.614 0.008954 ** 
## diabetes1     -0.0580227  0.0410676  -1.413 0.157698    
## alcoholism1   -0.1441821  0.0540872  -2.666 0.007682 ** 
## handicap1      0.0422388  0.0721439   0.585 0.558226    
## handicap2     -0.0715583  0.2339774  -0.306 0.759731    
## handicap3      8.9433762 44.6912275   0.200 0.841391    
## handicap4     -0.5436300  1.2422542  -0.438 0.661665    
## sms_received1 -0.6862142  0.0186818 -36.732  < 2e-16 ***
## dayMonday      0.0986086  0.0292974   3.366 0.000763 ***
## daySaturday   -0.1614508  0.5080700  -0.318 0.750657    
## dayThursday    0.1498016  0.0317942   4.712 2.46e-06 ***
## dayTuesday     0.1912953  0.0288552   6.629 3.37e-11 ***
## dayWednesday   0.1948368  0.0288240   6.760 1.38e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 77843  on 77367  degrees of freedom
## Residual deviance: 76114  on 77351  degrees of freedom
## AIC: 76148
## 
## Number of Fisher Scoring iterations: 9

0.3.3 Model Evaluation

fitted_p <- predict(logit_model,newdata=test,type='response')

pred_test <-  ifelse(fitted_p>0.5,1,0)

tab <- table(predicted = pred_test, actual = test$no_show)
tab

##          actual
## predicted not showed up Showed up
##         1          6696     26462

Model is not like as we expected because there is class imbalance in between showed up and not showed up group.
There are \(80\) % patients who show up and \(20\) % those who don’t show up.
So if I predict someone will show up based on data (without model) that is \(80\) %.

0.4 Conclusions

If someone make appointment there is already \(80\)% chance that this patient will show up.
even though we saw some independent variable(predictors) like age, hypertension , diabetes, sms-received are significant predictor but our model fail to predict accuracy more than \(80\)% because of class imbalance. So from this dataset it is very hard to tell that who is not going to show up.
There are techniques (like under-sampling and over-sampling) available to solve class imbalance problem but that beyond this project outline.

same project in python double click and open new tab here

If you want to see same plot in Tableau
and want to see some SAS code go to my github page double click and open new tab here:github