library(tidyverse)
# read the table
df <- read_csv("data/KaggleV2-May-2016.csv")
# check the data
head(df)
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2.987250e+13 | 5642903 | F | 2016-04-29 18:38:08 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 5.589978e+14 | 5642503 | M | 2016-04-29 16:08:27 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4.262962e+12 | 5642549 | F | 2016-04-29 16:19:04 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 8.679512e+11 | 5642828 | F | 2016-04-29 17:29:31 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 8.841186e+12 | 5642494 | F | 2016-04-29 16:07:23 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
| 9.598513e+13 | 5626772 | F | 2016-04-27 08:36:51 | 2016-04-29 | 76 | REPÚBLICA | 0 | 1 | 0 | 0 | 0 | 0 | No |
# check structure of the data
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : chr "F" "M" "F" "F" ...
## $ ScheduledDay : POSIXct, format: "2016-04-29 18:38:08" "2016-04-29 16:08:27" ...
## $ AppointmentDay: POSIXct, format: "2016-04-29" "2016-04-29" ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : chr "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
## $ Scholarship : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hipertension : int 1 0 0 0 1 1 0 0 0 0 ...
## $ Diabetes : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Alcoholism : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Handcap : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SMS_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ No-show : chr "No" "No" "No" "No" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 14
## .. ..$ PatientId : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ AppointmentID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ ScheduledDay :List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"
## .. ..$ AppointmentDay:List of 1
## .. .. ..$ format: chr ""
## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"
## .. ..$ Age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Neighbourhood : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Scholarship : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Hipertension : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Diabetes : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Alcoholism : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Handcap : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ SMS_received : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ No-show : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# change columns name
names(df)<- c('patient_id',
'appointment_id',
'gender','schedule_day',
'appointment_day',
'age',
'neighborhood',
'scholarship',
'hypertension',
'diabetes',
'alcoholism',
'handicap',
'sms_received',
'no_show')
# check again
head(df)
| patient_id | appointment_id | gender | schedule_day | appointment_day | age | neighborhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | no_show |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2.987250e+13 | 5642903 | F | 2016-04-29 18:38:08 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 5.589978e+14 | 5642503 | M | 2016-04-29 16:08:27 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4.262962e+12 | 5642549 | F | 2016-04-29 16:19:04 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 8.679512e+11 | 5642828 | F | 2016-04-29 17:29:31 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 8.841186e+12 | 5642494 | F | 2016-04-29 16:07:23 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
| 9.598513e+13 | 5626772 | F | 2016-04-27 08:36:51 | 2016-04-29 | 76 | REPÚBLICA | 0 | 1 | 0 | 0 | 0 | 0 | No |
# check missing value
sapply(df,function(x)sum(is.na(x)))
## patient_id appointment_id gender schedule_day
## 0 0 0 0
## appointment_day age neighborhood scholarship
## 0 0 0 0
## hypertension diabetes alcoholism handicap
## 0 0 0 0
## sms_received no_show
## 0 0
# change data type of some columns
df <- mutate_at(df, vars('gender',
'neighborhood',
'scholarship',
'hypertension',
'diabetes',
'alcoholism',
'handicap',
'sms_received'), as.factor)
# check data type
lapply(df, class)
## $patient_id
## [1] "numeric"
##
## $appointment_id
## [1] "integer"
##
## $gender
## [1] "factor"
##
## $schedule_day
## [1] "POSIXct" "POSIXt"
##
## $appointment_day
## [1] "POSIXct" "POSIXt"
##
## $age
## [1] "integer"
##
## $neighborhood
## [1] "factor"
##
## $scholarship
## [1] "factor"
##
## $hypertension
## [1] "factor"
##
## $diabetes
## [1] "factor"
##
## $alcoholism
## [1] "factor"
##
## $handicap
## [1] "factor"
##
## $sms_received
## [1] "factor"
##
## $no_show
## [1] "character"
# check summary statistics
summary(df)
## patient_id appointment_id gender
## Min. :3.922e+04 Min. :5030230 F:71840
## 1st Qu.:4.173e+12 1st Qu.:5640286 M:38687
## Median :3.173e+13 Median :5680573
## Mean :1.475e+14 Mean :5675305
## 3rd Qu.:9.439e+13 3rd Qu.:5725524
## Max. :1.000e+15 Max. :5790484
##
## schedule_day appointment_day
## Min. :2015-11-10 07:13:56 Min. :2016-04-29 00:00:00
## 1st Qu.:2016-04-29 10:27:01 1st Qu.:2016-05-09 00:00:00
## Median :2016-05-10 12:13:17 Median :2016-05-18 00:00:00
## Mean :2016-05-09 07:49:15 Mean :2016-05-19 00:57:50
## 3rd Qu.:2016-05-20 11:18:37 3rd Qu.:2016-05-31 00:00:00
## Max. :2016-06-08 20:07:23 Max. :2016-06-08 00:00:00
##
## age neighborhood scholarship hypertension
## Min. : -1.00 JARDIM CAMBURI : 7717 0:99666 0:88726
## 1st Qu.: 18.00 MARIA ORTIZ : 5805 1:10861 1:21801
## Median : 37.00 RESISTÊNCIA : 4431
## Mean : 37.09 JARDIM DA PENHA: 3877
## 3rd Qu.: 55.00 ITARARÉ : 3514
## Max. :115.00 CENTRO : 3334
## (Other) :81849
## diabetes alcoholism handicap sms_received no_show
## 0:102584 0:107167 0:108286 0:75045 Length:110527
## 1: 7943 1: 3360 1: 2042 1:35482 Class :character
## 2: 183 Mode :character
## 3: 13
## 4: 3
##
##
minimum age is −1, so lets check all the value less than 0
#check value less than 0
df[df$age < 0, ]
| patient_id | appointment_id | gender | schedule_day | appointment_day | age | neighborhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | no_show |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4.659432e+14 | 5775010 | F | 2016-06-06 08:58:13 | 2016-06-06 | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | No |
#drop the row with condition
df <-df[!(df$age<0),]
summary(df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 18.00 37.00 37.09 55.00 115.00
In the dataset description, it mention that in no_show column No means patient showed up and yes means patient did not showed up
# age vs showed up or not
# blue showed up
ggplot(df,aes(x=age)) +
geom_histogram(data=subset(df,no_show == 'No'),fill = '#00BFC4', alpha = 0.8, bins = 40) +
geom_histogram(data=subset(df,no_show == 'Yes'),fill = '#F8766D', alpha = 0.8, bins = 40) +
ggtitle('Age vs No Show Histogram')+
theme(plot.title = element_text(hjust = 0.5))
It is very hard to tell from this plot the difference in age of patient who showed up or did not showed up. So let’s explore more.
Since there is confusion with no and yes with showed up or not, lets make it clear
# in no_show column No means patient showed up and yes means patient did not showed up
df$no_show[df$no_show == 'No'] <- 'Showed up'
df$no_show[df$no_show == 'Yes'] <- 'not showed up'
df$no_show <- as.factor(df$no_show)
# age box plot
ggplot(df, aes(x = no_show, y = age, fill = no_show))+
geom_boxplot()+
ggtitle("Age vs No Show Boxplot")+
theme(plot.title = element_text(hjust = 0.5))
From the box plot, the mean age is higher for those who showed up. Now let’s see the exact value.
#statistics
select(df, age, no_show) %>%
group_by(no_show) %>%
summarise(age_mean = mean(age))
| no_show | age_mean |
|---|---|
| not showed up | 34.31767 |
| Showed up | 37.79050 |
more visualization
Though it might not be appropriate for all case but in this case it is possible to visualize all age group.
ggplot(data = df)+
geom_bar(aes(factor(age), fill = no_show), position = position_fill())+
ggtitle("Age vs No Show Proportion in Bar Diagram")+
ylab('Proportion')+
theme(plot.title = element_text(hjust = 0.5, size = 24))+
theme(axis.title.y = element_text(size =18))+
theme(axis.title.x = element_text(size =18))+
theme(axis.text.x = element_text(size= 12, angle = 90, hjust = 1))
From the above plot now it seems that proportion of showed up patients are higher in the age range 60 to 80 than patient age under 40.
# make column of age
showed_up <- df[df$no_show == 'Showed up',]$age
not_showed_up <- df[df$no_show == 'not showed up',]$age
# independent two sample t test
t.test(showed_up, not_showed_up)
##
## Welch Two Sample t-test
##
## data: showed_up and not_showed_up
## t = 20.831, df = 36143, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.146073 3.799602
## sample estimates:
## mean of x mean of y
## 37.79050 34.31767
# just easy way, same result
t.test(df$age ~ df$no_show)
##
## Welch Two Sample t-test
##
## data: df$age by df$no_show
## t = -20.831, df = 36143, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.799602 -3.146073
## sample estimates:
## mean in group not showed up mean in group Showed up
## 34.31767 37.79050
So there is significant difference in age of patient those who showed up and those who did not show up.
# replace levels
levels(df$gender)[levels(df$gender)=="M"] <- "Male"
levels(df$gender)[levels(df$gender)=="F"] <- "Female"
# count gender
ggplot(data = df)+
geom_bar(aes(x = gender,fill = gender))+
ggtitle("Gender Bar Diagram")+
theme(plot.title = element_text(hjust = 0.5))
So there are more female patient than male patient in dataset.
# gender table
table(df$gender, df$no_show)
##
## not showed up Showed up
## Female 14594 57245
## Male 7725 30962
# total patient
table(df$no_show)
##
## not showed up Showed up
## 22319 88207
#gender vs noshow stacked bar
ggplot(df)+
geom_bar(aes(x = gender, fill = no_show))+
ggtitle("Gender vs No Show Stacked Bar Diagram")+
theme(plot.title = element_text(hjust = 0.5))+
ylab("Count")+
xlab("Gender")
showed up and not showed up both group is higher for female than male because more patients in data set is female. So its better to see proportionally.
# proportion table
prop.table(table(df$gender, df$no_show), margin = 1)
##
## not showed up Showed up
## Female 0.2031487 0.7968513
## Male 0.1996795 0.8003205
#plot of proportion data, just add position_fill()
ggplot(df)+
geom_bar(aes(x = gender, fill = no_show), position = position_fill())+
ggtitle("Gender vs No Show Bar Diagram")+
ylab('Proportion')+
xlab("Gender")+
theme(plot.title = element_text(hjust = 0.5))
So there is not much difference between male and female.
chisq.test(table(df$gender,df$no_show))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(df$gender, df$no_show)
## X-squared = 1.8565, df = 1, p-value = 0.173
without continuity correction
chisq.test(table(df$gender,df$no_show), correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: table(df$gender, df$no_show)
## X-squared = 1.8779, df = 1, p-value = 0.1706
p value is more than 0.05, so gender difference is not significant.
# Count patients number in each group
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(df)+geom_bar(aes(scholarship, fill = scholarship))
p2 <- ggplot(df)+geom_bar(aes(hypertension, fill =hypertension))
p3 <- ggplot(df)+geom_bar(aes(diabetes, fill = diabetes))
p4 <- ggplot(df)+geom_bar(aes (alcoholism, fill = alcoholism))
p5 <- ggplot(df)+geom_bar(aes( handicap, fill = handicap))
p6 <- ggplot(df)+geom_bar(aes(sms_received, fill = sms_received))
grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)
# this plot for each group with patient who showed up and not showed up
p1 <- ggplot(df)+geom_bar(aes(scholarship, fill = no_show))
p2 <- ggplot(df)+geom_bar(aes(hypertension, fill =no_show))
p3 <- ggplot(df)+geom_bar(aes(diabetes, fill = no_show))
p4 <- ggplot(df)+geom_bar(aes (alcoholism, fill = no_show))
p5 <- ggplot(df)+geom_bar(aes( handicap, fill = no_show))
p6 <- ggplot(df)+geom_bar(aes(sms_received, fill = no_show))
grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)
# this plot is same as above one execpt it show proportion.
p1 <- ggplot(df)+geom_bar(aes(scholarship, fill = no_show), position = position_fill())+
ylab('Proportion')
p2 <- ggplot(df)+geom_bar(aes(hypertension, fill =no_show), position = position_fill())+
ylab('Proportion')
p3 <- ggplot(df)+geom_bar(aes(diabetes, fill = no_show), position = position_fill())+
ylab('Proportion')
p4 <- ggplot(df)+geom_bar(aes (alcoholism, fill = no_show), position = position_fill())+
ylab('Proportion')
p5 <- ggplot(df)+geom_bar(aes( handicap, fill = no_show), position = position_fill())+
ylab('Proportion')
p6 <- ggplot(df)+geom_bar(aes(sms_received, fill = no_show), position = position_fill())+
ylab('Proportion')
grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)
# chi-squared test
chisq.test(table(df$scholarship, df$no_show), correct = FALSE) # scholarship
##
## Pearson's Chi-squared test
##
## data: table(df$scholarship, df$no_show)
## X-squared = 93.811, df = 1, p-value < 2.2e-16
chisq.test(table(df$hypertension, df$no_show), correct = FALSE) # hypertension
##
## Pearson's Chi-squared test
##
## data: table(df$hypertension, df$no_show)
## X-squared = 140.89, df = 1, p-value < 2.2e-16
chisq.test(table(df$diabetes, df$no_show), correct = FALSE) # diabetes
##
## Pearson's Chi-squared test
##
## data: table(df$diabetes, df$no_show)
## X-squared = 25.473, df = 1, p-value = 4.486e-07
chisq.test(table(df$alcoholism, df$no_show), correct = FALSE) #alcoholism
##
## Pearson's Chi-squared test
##
## data: table(df$alcoholism, df$no_show)
## X-squared = 0.0042829, df = 1, p-value = 0.9478
chisq.test(table(df$handicap, df$no_show), correct = FALSE) #handicap
##
## Pearson's Chi-squared test
##
## data: table(df$handicap, df$no_show)
## X-squared = 7.0356, df = 4, p-value = 0.134
chisq.test(table(df$sms_received, df$no_show), correct = FALSE) # sms_received
##
## Pearson's Chi-squared test
##
## data: table(df$sms_received, df$no_show)
## X-squared = 1766.7, df = 1, p-value < 2.2e-16
p value is significant for scholarship, hypertension, diabetes, sms_received group.
# make new columns
df$day <- weekdays(as.Date(df$appointment_day))
ggplot(df)+geom_bar(aes(day, fill = day))+
ggtitle("Number of Appointment")+
ylab('Count')+
xlab('Day')+
theme(plot.title = element_text(hjust = 0.5))
Number of appointment differ across week. Some day like Wednesday and Tuesday make more appointment than other. Statistics given below to see exact number.
library(lubridate)
# make days column, with lebel true
df$date <- as.Date(df$appointment_day)
df$days <- wday(df$date, label=TRUE)
# day start monday and goes on by day, using days column
ggplot(df)+geom_bar(aes(days, fill = day))+
ggtitle("Number of Appointment")+
ylab('Count')+
xlab('Day')+
theme(plot.title = element_text(hjust = 0.5))
# days vs no show
ggplot(df)+geom_bar(aes(days, fill = no_show))+
ggtitle(" Number of Appointment vs No Show")+
ylab('Count')+
xlab('Day')+
theme(plot.title = element_text(hjust = 0.5))
#days vs no show, using days column
ggplot(df)+geom_bar(aes(days, fill = no_show), position = position_fill())+
ggtitle("Appointment vs No Show")+
ylab('Proportion')+
xlab('Day')+
theme(plot.title = element_text(hjust = 0.5))
used day column because days column contain sunday.
# day column
table(df$day, df$no_show)
##
## not showed up Showed up
## Friday 4037 14982
## Monday 4690 18024
## Saturday 9 30
## Thursday 3338 13909
## Tuesday 5152 20488
## Wednesday 5093 20774
#days column
table(df$days, df$no_show)
##
## not showed up Showed up
## Sun 0 0
## Mon 4690 18024
## Tue 5152 20488
## Wed 5093 20774
## Thu 3338 13909
## Fri 4037 14982
## Sat 9 30
chisq.test(table(df$no_show,df$day))
##
## Pearson's Chi-squared test
##
## data: table(df$no_show, df$day)
## X-squared = 27.48, df = 5, p-value = 4.599e-05
Since p value is significant that means showing up in appointment day is dependent on which day the appointment is.
df_neighbor <- data.frame(table(df$neighborhood, df$no_show))
names(df_neighbor) <- c("neighborhood", "no_show", 'Count')
head(df_neighbor)
| neighborhood | no_show | Count |
|---|---|---|
| AEROPORTO | not showed up | 1 |
| ANDORINHAS | not showed up | 521 |
| ANTÔNIO HONÓRIO | not showed up | 50 |
| ARIOVALDO FAVALESSA | not showed up | 62 |
| BARRO VERMELHO | not showed up | 91 |
| BELA VISTA | not showed up | 384 |
ggplot(df_neighbor)+
geom_bar(aes(x = reorder(neighborhood, -Count), y = Count, fill = no_show), stat = 'identity')+
theme(axis.text.x = element_text(size= 12, angle = 90, hjust = 1))+
ggtitle("Neighborhood vs No Show")+
ylab('Count')+
xlab('Neighborhood')+
theme(plot.title = element_text(hjust = 0.5, size = 24))+
theme(axis.title.y = element_text(size =18))+
theme(axis.title.x = element_text(size =18))
# proportion
ggplot(df)+
geom_bar(aes(x = neighborhood, fill = no_show), position = position_fill())+
theme(axis.text.x = element_text(size= 12, angle = 90, hjust = 1))+
ggtitle("Neighborhood vs No Show")+
ylab('Proportion')+
xlab('Neighborhood')+
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.title = element_text(hjust = 0.5, size = 24))+
theme(axis.title.y = element_text(size =18))+
theme(axis.title.x = element_text(size =18))
select the coulmns that we are interested in.
df_2 <- select(df, age, gender, scholarship, hypertension, diabetes, alcoholism, handicap,sms_received, day, no_show)
df_2 <- mutate_at(df_2, vars(day), as.factor)
log_model_1 <- glm(no_show ~ . ,family = binomial(link = 'logit'), data = df_2 )
summary(log_model_1)
##
## Call:
## glm(formula = no_show ~ ., family = binomial(link = "logit"),
## data = df_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1306 0.5295 0.6065 0.6931 1.0061
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2563118 0.0235536 53.339 < 2e-16 ***
## age 0.0065625 0.0003928 16.705 < 2e-16 ***
## genderMale 0.0184803 0.0162927 1.134 0.25668
## scholarship1 -0.1845315 0.0245204 -7.526 5.25e-14 ***
## hypertension1 0.0671923 0.0245948 2.732 0.00630 **
## diabetes1 -0.0832348 0.0341384 -2.438 0.01476 *
## alcoholism1 -0.1390368 0.0448100 -3.103 0.00192 **
## handicap1 -0.0101453 0.0590236 -0.172 0.86353
## handicap2 -0.1395108 0.1863616 -0.749 0.45410
## handicap3 -0.3020104 0.6684127 -0.452 0.65139
## handicap4 -0.5451696 1.2397235 -0.440 0.66012
## sms_received1 -0.6675492 0.0156403 -42.681 < 2e-16 ***
## dayMonday 0.1027655 0.0244653 4.200 2.66e-05 ***
## daySaturday -0.1446681 0.3858706 -0.375 0.70772
## dayThursday 0.1598183 0.0264963 6.032 1.62e-09 ***
## dayTuesday 0.1991868 0.0240898 8.269 < 2e-16 ***
## dayWednesday 0.1832950 0.0240028 7.636 2.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111205 on 110525 degrees of freedom
## Residual deviance: 108900 on 110509 degrees of freedom
## AIC: 108934
##
## Number of Fisher Scoring iterations: 4
Drop handicap and gender and build a model again.
df_3 <- select(df, age, scholarship, hypertension, diabetes, alcoholism, sms_received, day, no_show)
log_model_2 <- glm(no_show ~ . ,family = binomial(link = 'logit'), data = df_3 )
summary(log_model_2)
##
## Call:
## glm(formula = no_show ~ ., family = binomial(link = "logit"),
## data = df_3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1357 0.5297 0.6059 0.6931 1.0096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2648522 0.0222333 56.890 < 2e-16 ***
## age 0.0065043 0.0003894 16.704 < 2e-16 ***
## scholarship1 -0.1882820 0.0242938 -7.750 9.17e-15 ***
## hypertension1 0.0669163 0.0245739 2.723 0.00647 **
## diabetes1 -0.0835637 0.0341255 -2.449 0.01434 *
## alcoholism1 -0.1330325 0.0444837 -2.991 0.00278 **
## sms_received1 -0.6681409 0.0156205 -42.773 < 2e-16 ***
## dayMonday 0.1029304 0.0244647 4.207 2.58e-05 ***
## daySaturday -0.1456292 0.3858180 -0.377 0.70583
## dayThursday 0.1599243 0.0264956 6.036 1.58e-09 ***
## dayTuesday 0.1991017 0.0240890 8.265 < 2e-16 ***
## dayWednesday 0.1832997 0.0240024 7.637 2.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111205 on 110525 degrees of freedom
## Residual deviance: 108902 on 110514 degrees of freedom
## AIC: 108926
##
## Number of Fisher Scoring iterations: 4
library(caTools)
set.seed(100)
split = sample.split(df_2$no_show, SplitRatio = 0.70)
train = subset(df_2, split == TRUE)
test = subset(df_2, split == FALSE)
logit_model <- glm(formula = no_show ~ . , data = train, family =binomial(link = 'logit') )
summary(logit_model)
##
## Call:
## glm(formula = no_show ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1568 0.5233 0.6045 0.6913 1.0147
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2489473 0.0282180 44.261 < 2e-16 ***
## age 0.0067538 0.0004694 14.388 < 2e-16 ***
## genderMale 0.0354982 0.0195178 1.819 0.068948 .
## scholarship1 -0.1807958 0.0291101 -6.211 5.27e-10 ***
## hypertension1 0.0769496 0.0294399 2.614 0.008954 **
## diabetes1 -0.0580227 0.0410676 -1.413 0.157698
## alcoholism1 -0.1441821 0.0540872 -2.666 0.007682 **
## handicap1 0.0422388 0.0721439 0.585 0.558226
## handicap2 -0.0715583 0.2339774 -0.306 0.759731
## handicap3 8.9433762 44.6912275 0.200 0.841391
## handicap4 -0.5436300 1.2422542 -0.438 0.661665
## sms_received1 -0.6862142 0.0186818 -36.732 < 2e-16 ***
## dayMonday 0.0986086 0.0292974 3.366 0.000763 ***
## daySaturday -0.1614508 0.5080700 -0.318 0.750657
## dayThursday 0.1498016 0.0317942 4.712 2.46e-06 ***
## dayTuesday 0.1912953 0.0288552 6.629 3.37e-11 ***
## dayWednesday 0.1948368 0.0288240 6.760 1.38e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77843 on 77367 degrees of freedom
## Residual deviance: 76114 on 77351 degrees of freedom
## AIC: 76148
##
## Number of Fisher Scoring iterations: 9
fitted_p <- predict(logit_model,newdata=test,type='response')
pred_test <- ifelse(fitted_p>0.5,1,0)
tab <- table(predicted = pred_test, actual = test$no_show)
tab
## actual
## predicted not showed up Showed up
## 1 6696 26462
Model is not like as we expected because there is class imbalance in between showed up and not showed up group.
There are \(80\) % patients who show up and \(20\) % those who don’t show up.
So if I predict someone will show up based on data (without model) that is \(80\) %.
If someone make appointment there is already \(80\)% chance that this patient will show up.
even though we saw some independent variable(predictors) like age, hypertension , diabetes, sms-received are significant predictor but our model fail to predict accuracy more than \(80\)% because of class imbalance. So from this dataset it is very hard to tell that who is not going to show up.
There are techniques (like under-sampling and over-sampling) available to solve class imbalance problem but that beyond this project outline.
same project in python double click and open new tab here
If you want to see same plot in Tableau
and want to see some SAS code go to my github page double click and open new tab here:github