The dataset was downloaded from https://www.kaggle.com, file “KaggleV2-May-2016.csv” . It contains 14 variables and 110527 observations. The main idea of dataset is to try to recognize persons who order a doctor appointment, receives all the instructions and finally do not arrive (“no-show”).
“No-Show” dataset containes information about appointments, and information about a persons: general and medical.
EDA
The ratio of “arrived” patients to “didn’t arrive” patients is significantly bigger on the same day when the appointment was on a day when was ordered a queue. The ratio is about 21. In the case when a patient has an appointment on a next day after ordering a queue- a ratio dramatically fault to 3.7.
If patient prone to don’t arrive at doctor appointment- the probability (arrive/ don’t arrive ) is kept about 50%, Despite a number of appointment … first appointment, or second, third or fourth.
Patients at the age between 20 -65 prone to miss appointments more than younger or older patients.
Patients, who suffer from hypertension tend fewer miss appointments.
Other variables have no significant effect on “No Show” (patient didn’t arrive) variable.
Modelling
Better results provides Conditional Inference Tree model - about 70% of correctly predicted “No.show” appointments, and about 35% of general wrong predictions.
## [1] 110527 14
## 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : chr "F" "M" "F" "F" ...
## $ ScheduledDay : chr "2016-04-29T18:38:08Z" "2016-04-29T16:08:27Z" "2016-04-29T16:19:04Z" "2016-04-29T17:29:31Z" ...
## $ AppointmentDay: chr "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : chr "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
## $ Scholarship : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hipertension : int 1 0 0 0 1 1 0 0 0 0 ...
## $ Diabetes : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Alcoholism : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Handcap : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SMS_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ No.show : chr "No" "No" "No" "No" ...
## PatientId AppointmentID Gender ScheduledDay
## 1 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
## 2 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
## 3 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
## 4 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
## 5 8.841186e+12 5642494 F 2016-04-29T16:07:23Z
## 6 9.598513e+13 5626772 F 2016-04-27T08:36:51Z
## AppointmentDay Age Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1
## 2 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0
## 3 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0
## 4 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0
## 5 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1
## 6 2016-04-29T00:00:00Z 76 REPÃBLICA 0 1
## Diabetes Alcoholism Handcap SMS_received No.show
## 1 0 0 0 0 No
## 2 0 0 0 0 No
## 3 0 0 0 0 No
## 4 0 0 0 0 No
## 5 1 0 0 0 No
## 6 0 0 0 0 No
Let’s find missing numeric data:
numIndex<- sapply(noShow, is.numeric)
summary(noShow[,numIndex])
## PatientId AppointmentID Age Scholarship
## Min. :3.922e+04 Min. :5030230 Min. : -1.00 Min. :0.00000
## 1st Qu.:4.173e+12 1st Qu.:5640286 1st Qu.: 18.00 1st Qu.:0.00000
## Median :3.173e+13 Median :5680573 Median : 37.00 Median :0.00000
## Mean :1.475e+14 Mean :5675305 Mean : 37.09 Mean :0.09827
## 3rd Qu.:9.439e+13 3rd Qu.:5725524 3rd Qu.: 55.00 3rd Qu.:0.00000
## Max. :1.000e+15 Max. :5790484 Max. :115.00 Max. :1.00000
## Hipertension Diabetes Alcoholism Handcap
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1972 Mean :0.07186 Mean :0.0304 Mean :0.02225
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :4.00000
## SMS_received
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.321
## 3rd Qu.:1.000
## Max. :1.000
According to the “summary” there are no “NA”’s in all numeric columns (it is a good news).
“PatientId” - contains system running numbers of patients. PatientID is not unique variable - it means that number of patients made an appointment more then one time during the dataset period of time:
length(noShow$PatientId)==length(unique(noShow$PatientId))
## [1] FALSE
Procent Distribution of patients made one, two .. six and more appointments:
“AppointmentID” - contains system running number of appointment- it supposed to be unique. Let’s test whether the same appointment appears few times in data set:
length(noShow$AppointmentID)==length(unique(noShow$AppointmentID))
## [1] TRUE
All appointments are unique. There are no repeated appointments in data set, therefore the column “AppointmentID” is redundant and will be omitted.
Variables “Scholarship”,“Hipertension”,“Diabetes”,“Alcoholism”,“Handcap”,“SMS_received” Variables “Scholarship”,“Hipertension”,“Diabetes”,“Alcoholism”,“Handcap”,“SMS_received” according to the “summary” are factors and can be “0” or “1”. All factor columns contain proper data (0 or 1 only, biesibes “Handcap” -it has 5 levels:0,1,2,3,4).
Scholarship(%):
##
## 0 1
## 90.2 9.8
Hipertension(%):
##
## 0 1
## 80.3 19.7
Diabetes(%):
##
## 0 1
## 92.8 7.2
Alcoholism(%):
##
## 0 1
## 97 3
Handcap:
##
## 0 1 2 3 4
## 97.972 1.848 0.166 0.012 0.003
SMS_received(%):
##
## 0 1
## 67.9 32.1
Column “Age”" (according to the “summary”) - it looks like contain mostly relevant information, besides rows with age -1 (less the zero):
nrow(noShow[noShow$Age<0,])
## [1] 1
There’s only one row with age value less than 0 -it will be omitted. Let’s visualize age distribution:
Let’s take to account that patients at age less then about 16 and at age above about 80 depend on other people who care about them, therefore thea arrival to doctor appointment may be depends on the decision of the attendant person. It means that information of this group MAY BE will researched separately - let’s see further.
## [1] "Number of appointments made for patients at age less than 16 and above 80: 48936"
## [1] "Percent of appointments made for patients at age less than 16 and above 80: 44 %"
Actually, about half of appointments belong to this category (less then 16 and above 80).
numIndex<- sapply(noShow, is.numeric )
summary(noShow[,!numIndex])
## Gender ScheduledDay AppointmentDay
## Length:110526 Length:110526 Length:110526
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Neighbourhood No.show
## Length:110526 Length:110526
## Class :character Class :character
## Mode :character Mode :character
“Gender”- according to “unique” statement there is no irrelevant or missing data in the column; 65 percent of the appointments made by (for) women, and 35 by (or for) men.
## [1] "F" "M"
## Gender Procent
## 1 F 65
## 2 M 35
“Neighbourhood”- according to “unique” statement there is no irrelevant or missing data in the column; 81 different neighbourhoods mentioned in dataset.
## [1] "AEROPORTO" "ANDORINHAS"
## [3] "ANTÃNIO HONÃRIO" "ARIOVALDO FAVALESSA"
## [5] "BARRO VERMELHO" "BELA VISTA"
## [7] "BENTO FERREIRA" "BOA VISTA"
## [9] "BONFIM" "CARATOÃRA"
## [11] "CENTRO" "COMDUSA"
## [13] "CONQUISTA" "CONSOLAÃÃO"
## [15] "CRUZAMENTO" "DA PENHA"
## [17] "DE LOURDES" "DO CABRAL"
## [19] "DO MOSCOSO" "DO QUADRO"
## [21] "ENSEADA DO SUÃ" "ESTRELINHA"
## [23] "FONTE GRANDE" "FORTE SÃO JOÃO"
## [25] "FRADINHOS" "GOIABEIRAS"
## [27] "GRANDE VITÃRIA" "GURIGICA"
## [29] "HORTO" "ILHA DAS CAIEIRAS"
## [31] "ILHA DE SANTA MARIA" "ILHA DO BOI"
## [33] "ILHA DO FRADE" "ILHA DO PRÃNCIPE"
## [35] "ILHAS OCEÃNICAS DE TRINDADE" "INHANGUETÃ"
## [37] "ITARARÃ" "JABOUR"
## [39] "JARDIM CAMBURI" "JARDIM DA PENHA"
## [41] "JESUS DE NAZARETH" "JOANA D´ARC"
## [43] "JUCUTUQUARA" "MÃRIO CYPRESTE"
## [45] "MARIA ORTIZ" "MARUÃPE"
## [47] "MATA DA PRAIA" "MONTE BELO"
## [49] "MORADA DE CAMBURI" "NAZARETH"
## [51] "NOVA PALESTINA" "PARQUE INDUSTRIAL"
## [53] "PARQUE MOSCOSO" "PIEDADE"
## [55] "PONTAL DE CAMBURI" "PRAIA DO CANTO"
## [57] "PRAIA DO SUÃ" "REDENÃÃO"
## [59] "REPÃBLICA" "RESISTÃNCIA"
## [61] "ROMÃO" "SÃO BENEDITO"
## [63] "SÃO CRISTÃVÃO" "SÃO JOSÃ"
## [65] "SÃO PEDRO" "SANTA CECÃLIA"
## [67] "SANTA CLARA" "SANTA HELENA"
## [69] "SANTA LÃCIA" "SANTA LUÃZA"
## [71] "SANTA MARTHA" "SANTA TEREZA"
## [73] "SANTO ANDRÃ" "SANTO ANTÃNIO"
## [75] "SANTOS DUMONT" "SANTOS REIS"
## [77] "SEGURANÃA DO LAR" "SOLON BORGES"
## [79] "TABUAZEIRO" "UNIVERSITÃRIO"
## [81] "VILA RUBIM"
“No.show” - according to “unique” statement there is no irrelevant or missing data in the column (“Yes”/“No” only); 20 percent of appointments were appointed in vain- patients weren’t appear. Further will be better to transform the No.show data to 0/1 format.
## [1] "No" "Yes"
## No.show Procent
## 1 No 80
## 2 Yes 20
Load dataset and omit negative age.
“AppointmentDay” and “ScheduledDay” - my be omitted, but befor will transfer them from character to POSIXct format, then order dataset according to “AppointmentDay” and retrive two values: 1) how many days between ordering an doctor’s appointment and appointment itself; 2) on what day of week appointment is.
Let’s test for any negative results:
We have only 5 impossible negative “Days before appointment”. They can be replaced by 0.
“Appointment Day”- a day of week: Let’s create new calculated variable “AppointmentWeekDay”.
“No.show” will be transformed to 0/1 format. “No”- 0 (patient arrived to appointment), “Yes” - 1 (patient didn’t arrive to appointment).
“PatientID” is running number. It provides information about a number of appointments made by the each patient (within the dataset).
Let’s find out a number of appointments the patient made and running a number of each appointment each patient made. For this purpose, I’ll make four additional calculated columns : 1) “numberOfAppointment” - it reflects order number of each appointment of each patient. 2) “numberOfAppointments” - how many appointments every patient have. 3) “totalNoShow”- how many appointments each patient lost. 4) “numOfNoShowBefor” - how many appointments the patient missed.
Let’s test distribution of new tree values: 1)numberOfAppointments 2)numberOfAppointment 3)totalNoShow
How it was expected all three values are distributed with a long right tail. It seems, will be better to discretize them by equal frequency method and add new three values to the dataset for future decision.
“Gender” is character value. Transfer it to numeric 0/1 presentation (due to females are majority 65% set “F” as 1 and “M” as 0).
Variables **“Age”, “Neighbourhood”, “Scholarship”, “Hipertension”, “Diabetes”, “Alcoholism”, “Handcap”, “SMS_received”* stay as they are, without modification, only “Neighbourhood” applied as Factor.
Now, new modified data set will be saved and all temporary data structures will be removed.
Actually, we have in dataset two kind of data: a) unique data connected to appointments themselves - like difference between date of appointment and date when appointment was ordered, or number of each appointment which each patient made; b) unique data connected to the patients- like age, conditions, neighborhood …
Firstly, separate modified data set into two data sets: first contain data of “NoShow” appointment, second “YesShow” appointments.
Let’s, find number (first, second, third …) of appointment, which “not accurate” patient prone to loose:
Majority of “NoShow” appoinments (about 55%)-it is a first appointment patient made, second appointment - about 23%, and third appointment- about 10%, forth- about 5% … Conclusion: if patient prone to don’t arrive at doctor appointment- the probability (arrive/ don’t arrive ) is kept about 50%, Despite a number of appointment … first appointment, or second, third or fourth.
“Days before appointment”- how many days between patient ordered a queue to a doctor and appointment itself: It looks like, that then fewer days before appointment then more patients arrive at the appointment. Let’s see the difference in ranges of “days before appointment”:
Let’s see numbers:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Arrived 0 0 2 8.8 12 179
## Didn't arrive 0 4 11 15.8 23 179
##
## Welch Two Sample t-test
##
## data: arrived.yes$daysBeforeAppointment and arrived.no$daysBeforeAppointment
## t = -58.263, df = 31531, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.315262 -6.839094
## sample estimates:
## mean of x mean of y
## 8.754759 15.831937
Now it is obvious: than fewer days before appointment than more patients will arrive at the appointment.
Let’s find exact days, when the difference (ratio) between the quantity of arrived and didn’t arrive patients is significant. I suppose, that to research the interval between 0 days (the queue was ordered in the same day) and 30 days (the queue was ordered in one month in advance) will be enough.
Conclusion: The ratio of “arrived” patients to “didn’t arrive” patients is significantly bigger on the same day when the appointment was on a day when was ordered a queue. The ratio is about 21. In the case when a patient has an appointment on a next day after ordering a queue- a ratio dramatically fault to 3.7 -close to average ratio. The average ratio is 2.4 (taken without 0 days before the appointment). After that, ratio continuously became closer to average ratio, and on 10 days before appointment became equivalent to average.
“Appontment Weekday”: There isn’t significant difference between “Show” and “noShow” appointments according to the days of the week.
“SMS received”: About 70% of patients who arrived at appointment didn’t receive SMS before. Only 30% of arrived patients received SMS. On the other hand, almost half of patients, who didn’t arrive at appointment received SMS before. Therefore, it is hard to conclude, that SMS sent before an appointment can decrease the percentage of patients who ordered a queue to a doctor and didn’t arrive at an appointment.
Now, let’s analyze data connected to patients (like age, physical condition, gender…), not to the appointment (like day of week or “SMS sent”). According to “Repeated Appointments” chart, about 40% of appointments are repeated, it means, that data of a number of patients is repeated few times in original dataset. In order to analyze data about patients, we need to extract datasets with unique patients. In order to compare patients, I will make three data frames with unique patients: a) only patients who never arrived at appointment; b) patients who lose one or more appointments and arrived at one or more appointments; c) only patients who always arrived at the appointment.
Given dataset contains information collected during about half year. It means that if the patient made more than one appointment, his age may be different in different appointments and “unique” statement will return same patients with different ages few times. We have to recognizes rows with the same patienID and different ages and then calculates average age for each patient.
“Age”: According to the chart, looks like, patients at the age between 20 -65 prone to miss appointments more than younger or older patients. The green line (patients who never miss appointments) in this interval (20-65) lower than in other intervals and lower than red and blue line (patients who always or sometimes miss doctor appointments).
Let’s compare “Scholarship”, “Hipertension”“,”Diabetes“,”Alcoholism“, and”Handcap" variables in three groups
Gender: There is no significant difference in the groups. Gender is not the main criterion explains the absence of patients at the doctor’s appointment.
“Scholarship”: There is no significant difference in the groups. But it looks like that educated patients prone more to miss appointments (but the difference is not significant).
“Hipertension”:Patients, who suffer from hypertension tend fewer miss appointments- more than 5% difference between groups “never miss appointments” and “always miss appointments”.
“Diabete”: There is no significant difference in the groups. But it looks like that diabetic patients tend less to miss appointments (but the difference is not significant).
“Alcoholism”: There is no significant difference in the groups. But it looks like that alcoholism patient prone more to miss appointments (but the difference is not significant).
“Handcap”: Patients who suffer from handicap tend less to miss appointments.
“Neighbourhood”: Let’s analyze whether neighborhood of patients can explain “missing” appointments. For this purpose, firstly, I’ll find distributions of “Neighbourhood” variable in all three groups, after, I’ll calculate ratio of the each neighborhood in each of three groups to “Neighbourhood” variable in general dataset, and order the data.
5 first neighbourhoods in “always missing appointments” group:
## Neighbourhood quantityOfPatients ratio
## 35 ILHAS OCEÃNICAS DE TRINDADE 2 100
## 50 NAZARETH 19 14
## 29 HORTO 22 13
## 1 AEROPORTO 1 12
## 65 SANTA CECÃLIA 54 12
5 first neighbourhoods in “sometimes missing appointments” group:
## Neighbourhood quantityOfPatients ratio
## 33 ILHA DO PRÃNCIPE 246 11
## 39 JESUS DE NAZARETH 319 11
## 72 SANTOS DUMONT 144 11
## 1 ANDORINHAS 233 10
## 4 BARRO VERMELHO 41 10
5 first neighbourhoods in “never missing appointments” group:
## Neighbourhood quantityOfPatients ratio
## 51 PARQUE INDUSTRIAL 1 100
## 1 AEROPORTO 6 75
## 32 ILHA DO BOI 19 54
## 48 MORADA DE CAMBURI 52 54
## 3 ANTÃNIO HONÃRIO 135 50
Despite that those neighborhoods take 5 first places- it is difficult to conclude that patients from those neighborhoods prone to always miss the doctor’s appointments ot tend to always arrive, because of the number of patients from those neighborhoods significantly less than the number of patients from other neighborhoods - tens against thousands. For example, impossible to say that patients from “ILHAS OCEÂNICAS DE TRINDADE” always miss the appointment, despite that all 100% of patients in this “always missing” group - We have only 2 observations at all from this neighborhood against a thousand observations from other neighborhoods. And impossible to say that patients from “PARQUE INDUSTRIAL” always arrive at the appointment, despite that all 100% of patients in this “always arriving” group - We have only 1 observations at all from this neighborhood against a thousand observations from other neighborhoods. __________________________________________________________________________
Right now, we have two datasets: Appointments and Patients (patients unique information). Let’s find the best model for each of them and will decide about the final decision. I suppose, the better parameters to compare the models are Sensitivity and Accuracy. Sensitivity gives information about how accurate the model in recognition of Positive (“NoShow” is 1 or “Yes”) cases, regardless to Negative (“NoShow” is 0 or “Now”). Assume that our purpose is to prevent patients to lose appointments. We need to recognize the patient who pron to make an appointment and finally don’t arrive and take some measurements to prevent it. In case we recognize accurate patient as problematic one nothing bad will happen (for example, this patient will receive additional reminding or call from clinic to ensure his or her arrival), but in case we miss problematic patient-doctor will lose the time and will serve fewer patients.
On another hand, 100% of positive prediction (without any model) can give perfect sensitivity, in order to prevent this situation, we should take to account accuracy. Accuracy will provide information about general prediction error.
Due to more than 60% of appointments in the dataset are the first appointments, we have no information about the previous behavior of huge part of patients (do the patient tend to arrive at ordered doctor’s appointment or prone to miss that appointment.). Let’s find whether possible to build model, based on Patient information only, whether provided information about patients is enough to estimate probability of an appointment “No Show”. For this purpose, I’ll retrieve unique patient information from provided appointments dataset, and calculate the probability of “No Show” according to a number of ordered appointments and missed appointments. According to “Repeated Appointments” chart, about 40% of appointments are repeated, it means, that data of a number of patients is repeated few times in original dataset.
Given dataset contains information collected during about half year. It means that if the patient made more than one appointment, his age may be different in different appointments and “unique” statement will return same patients with different ages few times. We have to recognizes rows with the same patienID and different ages and then calculates average age for each patient.
let’s remove redundant variables, and bring the rest variables to range between 0 and 1.
load(file="noShow.Patients_ml.RData")
#Remove redundant variables
names(noShow.Patients_ml)
## [1] "PatientId" "Gender" "Neighbourhood"
## [4] "Scholarship" "Hipertension" "Diabetes"
## [7] "Alcoholism" "Handcap" "numberOfAppointments"
## [10] "totalNoShow" "Age" "No.show"
## [13] "No.show.prob"
nn<-c("PatientId", "numberOfAppointments", "totalNoShow", "No.show")
noShow.Patients_ml<-noShow.Patients_ml[,-match(nn, colnames(noShow.Patients_ml))]
#Normalization
#Neighbourhood
noShow.Patients_ml$Neighbourhood<- as.numeric(noShow.Patients_ml$Neighbourhood)
noShow.Patients_ml$Neighbourhood<-sapply(noShow.Patients_ml[,"Neighbourhood"],
function(x) return(x/100))
#Handcap
noShow.Patients_ml$Handcap<-sapply(noShow.Patients_ml[,"Handcap"],
function(x) return(x/10))
#Age
minAge<-min(noShow.Patients_ml$Age)
maxAge<-max(noShow.Patients_ml$Age)
noShow.Patients_ml$Age <- sapply(noShow.Patients_ml$Age,
function(x) return((x-minAge)/(maxAge-minAge)))
save(noShow.Patients_ml, file = "noShow.Patients_ml.RData")
In order to determine important variables, let’s find information gain of all of them, and remove from dataset variables with 0 information gain.
load(file = "noShow.Patients_ml.RData")
weights <- information.gain(No.show.prob~., noShow.Patients_ml)
weights<-data.frame(VarName=rownames(weights), weight=weights[,1])
weights<-weights[order(weights$weight, decreasing = TRUE),]
print(weights)
## VarName weight
## 2 Neighbourhood 0.1130106295
## 8 Age 0.0050085437
## 4 Hipertension 0.0012669274
## 3 Scholarship 0.0006056872
## 1 Gender 0.0000000000
## 5 Diabetes 0.0000000000
## 6 Alcoholism 0.0000000000
## 7 Handcap 0.0000000000
#Keep important variables only
cols<-c("No.show.prob","Neighbourhood","Age","Hipertension","Scholarship")
noShow.Patients_ml<-noShow.Patients_ml[,cols]
noShow.Patients_norm<-noShow.Patients_ml
save(noShow.Patients_norm, file = "noShow.Patients_norm.RData")
Let’s create train and test datasets
target_feature <- "No.show.prob"
set.seed(300)
train_indx <- sample(1:nrow(noShow.Patients_ml),floor((2/3)*nrow(noShow.Patients_ml)))
train_set <- noShow.Patients_ml[train_indx,]
test_set <- noShow.Patients_ml[-train_indx,]
Linear regression model
#LM
fmla <- as.formula("No.show.prob ~.")
linear_reg_model <- lm(fmla,train_set)
#summary(linear_reg_model)
lm_predict <- predict.lm(linear_reg_model,test_set)
rmse_summary<-data.frame(model="LM", RMSE= rmse(test_set$No.show.prob,lm_predict))
plot(lm_predict,test_set$No.show.prob, main="Lineal model results", xlab="prediction",ylab="testSet data" )
Decision (regression) Tree
require(rpart)
fmla <- as.formula("No.show.prob ~.")
#train the model with maximal depth tree
rctrl <- rpart.control(minbucket=1,minsplit=2,cp=0 )
rt_model <- rpart(fmla,train_set,control = rctrl)
#printcp(rt_model)
#find CP and optimal number of splits according to minimal xerror and train a new model
cptable<-as.data.frame(rt_model$cptable)
#plot(x=cptable$nsplit, y=cptable$xerror)
CP<-cptable[cptable$xerror==min(cptable$xerror),][1,"CP"]
nsplit<-cptable[cptable$xerror==min(cptable$xerror),][1,"nsplit"]
rctrl <- rpart.control(minbucket=1,minsplit=2,cp=CP)
rt_model <- rpart(fmla,train_set,control = rctrl)
#printcp(rt_model)
#plot(rt_model)
#text(rt_model)
fancyRpartPlot(rt_model)
dt_prediction <- predict(rt_model,test_set)
rmse_summary<-rbind(rmse_summary, data.frame(model="DT", RMSE= rmse(test_set$No.show.prob,dt_prediction)))
plot(dt_prediction,test_set$No.show.prob, main="Decision Tree model results", xlab="prediction",ylab="testSet data" )
Artficial neral network model
if (!require(neuralnet)) {install.packages("neuralnet")}
library(neuralnet)
cols <- names(noShow.Patients_ml)
fmla <- as.formula(paste("No.show.prob ~", paste(cols[!cols %in% "No.show.prob"], collapse = " + ")))
#ann_model <- neuralnet(fmla,data=train_set,hidden=c(2),linear.output=T, rep =10, startweights=c(5,100,1000,17))
#saveRDS(ann_model, "ann_noShowPatients_model.rds")
ann_model <- readRDS("ann_noShowPatients_model.rds")
plot(ann_model)
Make prediction and calculate RMSE:
ann_prediction <- compute(ann_model,test_set[,2:5])
rmse_summary<-rbind(rmse_summary, data.frame(model="ANN", RMSE= rmse(test_set$No.show.prob,ann_prediction$net.result)))
plot(test_set$No.show.prob,ann_prediction$net.result, main="ANN model results", xlab="prediction",ylab="testSet data" )
Let’s compare RMSE of all three models:
print(rmse_summary)
## model RMSE
## 1 LM 0.3495745
## 2 DT 0.3488662
## 3 ANN 0.3486925
Conclusion: According to plots of predictions all three models, and RMSE of all three models a conclusion is following: Impossible to create a model based on Patient’s information only, which can predict whether patient tends to arrive at ordered doctor’s appointment or prone to miss that appointment.
Check the information gain, and remove useless variables.
rm(list = ls())
load('noShow.modified_ml_full.RData')
weights <- information.gain(No.show~., noShow.modified_ml_full)
weights<-data.frame(VarName=rownames(weights), weight=weights[,1])
weights<-weights[order(weights$weight, decreasing = TRUE),]
print(weights)
## VarName weight
## 7 totalNoShow 0.2317456420
## 10 totalNoShow_disc 0.2312707038
## 1 daysBeforeAppointment 0.0705537330
## 2 AppointmentWeekDay 0.0264218406
## 4 numberOfAppointment 0.0262612688
## 9 numberOfAppointment_disc 0.0214867338
## 19 SMS_received 0.0203398715
## 5 numOfNoShowBefor 0.0060602165
## 13 Neighbourhood 0.0054988200
## 12 Age 0.0038769177
## 6 numberOfAppointments 0.0019966004
## 15 Hipertension 0.0006678974
## 8 numberOfAppointments_disc 0.0004930191
## 14 Scholarship 0.0004079734
## 3 PatientId 0.0000000000
## 11 Gender 0.0000000000
## 16 Diabetes 0.0000000000
## 17 Alcoholism 0.0000000000
## 18 Handcap 0.0000000000
** “PatientId”, “Gender”, “Diabetes” and “Handcap”, “Alcoholism”** bring nothing and will be omitted. All discretized variables bring less information gain than original variables - Discretized variables also will be omitted. “numberOfAppointments” and “totalNoShow” also should be omitted, due to they provide general information about of whole period of time in dataset, in other words- these two variables provide future information in case we research “early” observations (appointments). Despite variable “Neighborhoods” brings some information gain, it has 81 levels without any kind of sense except alphabetical arrangement - obviously, this variable will bring more complexity to the models, and better will be omitted.
#Keep values with information gain more than 0:
noShow.modified_ml<-noShow.modified_ml_full[,c("No.show",as.vector(subset(weights,weights$weight>0)[,"VarName"]))]
#Remove all discretized variables:
noShow.modified_ml<-noShow.modified_ml[,-grep("?disc", colnames(noShow.modified_ml))]
nn<-c("numberOfAppointments", "totalNoShow","Neighbourhood")
noShow.modified_ml<-noShow.modified_ml[,-match(nn, colnames(noShow.modified_ml))]
save(noShow.modified_ml, file='noShow.modified_ml.RData')
names(noShow.modified_ml)
## [1] "No.show" "daysBeforeAppointment" "AppointmentWeekDay"
## [4] "numberOfAppointment" "SMS_received" "numOfNoShowBefor"
## [7] "Age" "Hipertension" "Scholarship"
Let’s bring the variables to the range between 0 and 1.
#AppointmentWeekDay
noShow.modified_ml$AppointmentWeekDay<-sapply(noShow.modified_ml[,"AppointmentWeekDay"],
function(x) return(x/10))
#Age
minAge<-min(noShow.modified_ml$Age)
maxAge<-max(noShow.modified_ml$Age)
noShow.modified_ml$Age <- sapply(noShow.modified_ml$Age,
function(x) return((x-minAge)/(maxAge-minAge)))
#daysBeforeAppointment
minDBA<-min(noShow.modified_ml$daysBeforeAppointment)
maxDBA<-max(noShow.modified_ml$daysBeforeAppointment)
noShow.modified_ml$daysBeforeAppointment <- sapply(noShow.modified_ml$daysBeforeAppointment,
function(x) return((x-minDBA)/(maxDBA-minDBA)))
#numberOfAppointment
minNOA<-min(noShow.modified_ml$numberOfAppointment)
maxNOA<-max(noShow.modified_ml$numberOfAppointment)
noShow.modified_ml$numberOfAppointment <- sapply(noShow.modified_ml$numberOfAppointment,
function(x) return((x-minNOA)/(maxNOA-minNOA)))
#numOfNoShowBefor
minNSB<-min(noShow.modified_ml$numOfNoShowBefor)
maxNSB<-max(noShow.modified_ml$numOfNoShowBefor)
noShow.modified_ml$numOfNoShowBefor <- sapply(noShow.modified_ml$numOfNoShowBefor,
function(x) return((x-minNSB)/(maxNSB-minNSB)))
noShow.modified_norm<-noShow.modified_ml
save(noShow.modified_norm, file="noShow.modified_norm.RData")
str(noShow.modified_ml)
## 'data.frame': 110526 obs. of 9 variables:
## $ No.show : num 0 0 0 0 0 0 1 1 0 0 ...
## $ daysBeforeAppointment: num 0 0 0 0 0 ...
## $ AppointmentWeekDay : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ numberOfAppointment : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SMS_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ numOfNoShowBefor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num 0.5391 0.487 0.5391 0.0696 0.487 ...
## $ Hipertension : int 1 0 0 0 1 1 0 0 0 0 ...
## $ Scholarship : int 0 0 0 0 0 0 0 0 0 0 ...
Formula, train and test sets:
rm(list = ls())
load(file="noShow.modified_norm.RData")
#Set "Noshow" as factor
noShow.modified_norm$No.show<-factor(noShow.modified_norm[,"No.show"])
# Formula
target_feature <- "No.show"
fmla <- as.formula(paste(target_feature,"~.",sep=""))
# Train & Test
train_indx <- sample(1:nrow(noShow.modified_norm),floor((2/3)*nrow(noShow.modified_norm)))
train_set <- noShow.modified_norm[train_indx,]
test_set <- noShow.modified_norm[-train_indx,]
save(test_set, file ="test_set.RData" )
#Dataframe for models comparison
SensitivityAndAccuracy<-data.frame()
In order to compare the models, let’s define kind of baseline: “No.show” appointments are 20%. I’ll take random 20% of appointments and define them as “1”-“No.show”, and after, I’ll calculate sensitivity and accuracy of this random choices.
##Random. No.MOdel
#finf percent of "1" in test_set:
if (!require(ROCR) ) {install.packages("ROCR")}
library(ROCR)
n<-nrow(test_set)
percentOf1<-nrow(test_set[test_set$No.show==1,])/n
#Create normal distribution of random numbers between 0 and 1 (the random numbers simulate
#probability of "1"), and find X value which represents percent of "1" in test set
set.seed(n)
normDistribution<-rnorm(n, mean = 0, sd=1)
minND<-min(normDistribution)
maxND<-max(normDistribution)
normDistribution<- sapply(normDistribution, function(x) (x-minND)/(maxND-minND))
X<-qnorm(1-percentOf1, mean = mean(normDistribution), sd = sd(normDistribution))
#plot( density(normDistribution))
# Apply "1" to numbers with simulated probability equal or more than X value, "0" for the rest
NoModel<-sapply(normDistribution, function(x) if (x>=X) 1 else 0 )
confusion_matrix <- table(test_set[,target_feature],NoModel)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.random<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.random<-sum(diag(confusion_matrix))/sum(confusion_matrix)
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="No Model",sensitivity=sensitivity.random, accuracy=accuracy.random))
noModel_predict <-prediction(normDistribution, test_set$No.show)
noModel_ROC<-performance(noModel_predict, "tpr", "fpr")
save(noModel_ROC, file = "noModel_ROC.RData")
confusion_matrix
## NoModel
## predicted 0 predicted 1
## actual 0 23673 5829
## actual 1 5885 1455
Desicion Tree model:
rctrl <- rpart.control(minbucket=1,minsplit=2,cp=0)
dt_model <- rpart(fmla,train_set,control = rctrl)
#find CP and optimal number of splits according to minimal xerror and train a new model
cptable<-as.data.frame(dt_model$cptable)
#plot(x=cptable$nsplit, y=cptable$xerror)
CP<-cptable[cptable$xerror==min(cptable$xerror),][1,"CP"]
nsplit<-cptable[cptable$xerror==min(cptable$xerror),][1,"nsplit"]
print(nsplit)
## [1] 0
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="DT",sensitivity=0, accuracy=0))
Conclusion: The minimal xerror value is on 0 splits- it is meaning that Decision Tree model not suitable for this case.
Conditional Inference Tree:
##CTREE
if (!require(party)) {install.packages("party")}
library(party)
ct = ctree(fmla, data = train_set)
plot(ct, main="Conditional Inference Tree")
CTREE prediction:
ct_predict<-treeresponse(ct, newdata = test_set)
#retrieve probabilities of "1" from predictions
points<-seq(2,length(unlist(ct_predict)), by=2)
ct_predict<-unlist(ct_predict)
prob_pred<-vector()
prob_pred<-ct_predict[points]
#Find probability which gives optimal combination of sensitivity and accuracy
confMatData<-data.frame()
for(i in seq(0,1, by=0.02))
{
model_pred<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
model_pred<-as.vector(model_pred)
if (max(model_pred)>0&min(model_pred)<1)
{
confusion_matrix <- table(test_set[,target_feature],model_pred)
sensitivity.ct<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.ct<-sum(diag(confusion_matrix))/sum(confusion_matrix)
confMatData<-rbind(confMatData, data.frame(prob=i, sensitivity=sensitivity.ct, accuracy=accuracy.ct))
}
}
plot(x=confMatData$prob, y=confMatData$sensitivity, type="l", col="blue", main="Conditional Inference Tree",
xlab="Probability", ylab="Sensitivity & Accuracy")
lines(x=confMatData$prob,y=confMatData$accuracy, type = "l", col="green")
legend("topright",legend= c("Sensitivity", "Accuracy"), bty="n", fill=c("blue","green"))
Let’s say that optimal probability is 0.26, and convert predictions with probability 0.26 and more to “1” and predictions with probability less the 0.26 to “0”
# Confusion matrix
i<-0.26
ct_prediction<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
ct_prediction<-as.vector(ct_prediction)
confusion_matrix <- table(test_set[,target_feature],ct_prediction)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.ct<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.ct<-sum(diag(confusion_matrix))/sum(confusion_matrix)
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="CTREE",sensitivity=sensitivity.ct, accuracy=accuracy.ct ))
ct_predict <-prediction(prob_pred, test_set$No.show)
ct_ROC<-performance(ct_predict, "tpr", "fpr")
save(ct_ROC, file = "ct_ROC.RData")
confusion_matrix
## ct_prediction
## predicted 0 predicted 1
## actual 0 18164 11338
## actual 1 2080 5260
** Naive Bayes:**
load(file = "noShow.modified_ml_full.RData")
categorical_features<-c("No.show","numberOfAppointment_disc","Gender","Neighbourhood","Scholarship",
"Hipertension","Diabetes","Alcoholism", "Handcap","SMS_received","numOfNoShowBefor",
"AppointmentWeekDay", "daysBeforeAppointment")
NB_dataset<-noShow.modified_ml_full[,categorical_features]
NB_dataset$No.show<-factor(NB_dataset[,"No.show"])
# Formula
fmla <- as.formula(paste(target_feature,"~.",sep=""))
# Train & Test
train_indx <- sample(1:nrow(NB_dataset),floor((2/3)*nrow(NB_dataset)))
NB.train_set <- NB_dataset[train_indx,]
NB.test_set <- NB_dataset[-train_indx,]
NB_model <- naiveBayes(x=NB.train_set[,-1],y=NB.train_set$No.show)
nb_predict <- predict(NB_model,NB.test_set[,-1],type="raw")
#retrieve probubilities of "1"
points<-seq(2,length(unlist(nb_predict)), by=2)
nb_predict<-unlist(nb_predict)
prob_pred<-vector()
prob_pred<-nb_predict[points]
confMatData<-data.frame()
for(i in seq(0,1, by=0.05))
{
model_pred<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
model_pred<-as.vector(model_pred)
if (max(model_pred)>0&min(model_pred)<1)
{
confusion_matrix <- table(NB.test_set[,target_feature],model_pred)
sensitivity.ct<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.ct<-sum(diag(confusion_matrix))/sum(confusion_matrix)
confMatData<-rbind(confMatData, data.frame(prob=i, sensitivity=sensitivity.ct, accuracy=accuracy.ct))
}
}
plot(x=confMatData$prob, y=confMatData$sensitivity, type="l", col="blue", main="naiveBayes model",
xlab="Probability", ylab="Sensitivity & Accuracy")
lines(x=confMatData$prob,y=confMatData$accuracy, type = "l", col="green")
legend("topright",legend= c("Sensitivity", "Accuracy"), bty="n", fill=c("blue","green"))
Let’s say that optimal probability is 0.5, and convert predictions with probability 0.5 and more to “1” and predictions with probability less the 0.5 to “0”
# Confusion matrix
i<-0.5
nb_prediction<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
nb_prediction<-as.vector(nb_prediction)
confusion_matrix <- table(test_set[,target_feature],nb_prediction)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.nb<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.nb<-sum(diag(confusion_matrix))/sum(confusion_matrix)
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="naiveBayes",sensitivity=sensitivity.nb, accuracy=accuracy.nb ))
NB_predict <-prediction(prob_pred, NB.test_set$No.show)
NB_ROC<-performance(NB_predict, "tpr", "fpr")
save(NB_ROC, file = "NB_ROC.RData")
confusion_matrix
## nb_prediction
## predicted 0 predicted 1
## actual 0 14920 14582
## actual 1 3501 3839
KNN:
library(caret)
trctrl<-trainControl(method='cv', number=3, verboseIter=T)
knn_accuracy <- train(fmla, data=train_set, method='knn', tuneLength = 15,
metric='Accuracy', maximize=T, trControl=trctrl)
save(knn_accuracy, file="knn_accuracy.rds")
Find optimal k using “tolerance” function:
load(file="knn_accuracy.rds")
best_K<-tolerance(knn_accuracy$results, metric = "Accuracy", maximize = F)
K<-knn_accuracy$results[best_K,1]
knn_accuracy$results[best_K,1:2]
## k Accuracy
## 1 5 0.7683486
let’s train the model with recomended k=5, and find optimal
knn_K <- knn(train = train_set[,-1],test = test_set[,-1], k=K ,cl = train_set[,target_feature],prob = T)
# Retrieve probabilities:
prob_pred<-attr(knn_K,"prob")
confMatData<-data.frame()
for(i in seq(0,1, by=0.05))
{
model_pred<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
model_pred<-as.vector(model_pred)
if (max(model_pred)>0&min(model_pred)<1)
{
confusion_matrix <- table(test_set[,target_feature],model_pred)
sensitivity.knn<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.knn<-sum(diag(confusion_matrix))/sum(confusion_matrix)
confMatData<-rbind(confMatData, data.frame(prob=i, sensitivity=sensitivity.knn, accuracy=accuracy.knn))
}
}
plot(x=confMatData$prob, y=confMatData$sensitivity, type="l", col="blue", main="KNN",
xlab="Probability", ylab="Sensitivity & Accuracy")
lines(x=confMatData$prob,y=confMatData$accuracy, type = "l", col="green")
legend("topright",legend= c("Sensitivity", "Accuracy"), bty="n", fill=c("blue","green"))
Let’s say that optimal probability is 0.83, and convert predictions with probability 0.8 and more to “1” and predictions with probability less the 0.83 to “0”. Actually, it is obvious that knn is not an acceptable model in our case.
# Confusion matrix
i<-0.83
knn_prediction<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
knn_prediction<-as.vector(knn_prediction)
confusion_matrix <- table(test_set[,target_feature],knn_prediction)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.knn<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.knn<-sum(diag(confusion_matrix))/sum(confusion_matrix)
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="KNN",sensitivity=sensitivity.knn, accuracy=accuracy.knn ))
KNN_predict <-prediction(prob_pred, test_set$No.show)
KNN_ROC<-performance(KNN_predict, "tpr", "fpr")
save(KNN_ROC, file = "KNN_ROC.RData")
confusion_matrix
## knn_prediction
## predicted 0 predicted 1
## actual 0 12852 16650
## actual 1 5082 2258
** Logistic regression:**
#Train the model
LR_model <- glm(fmla,family=binomial(link='logit'),data=train_set)
prob_pred.glm <- predict.glm(LR_model,test_set,type="response")
save(prob_pred.glm, file = "prob_pred.glm.RData")
#Find probability which gives optimal combination of sensitivity and accuracy
confMatData<-data.frame()
for(i in seq(0,1, by=0.05))
{
glm_pred<- sapply(prob_pred.glm, function(x) if(x>=i){1}else{0})
glm_pred<-as.vector(glm_pred)
if (max(glm_pred)>0&min(glm_pred)<1)
{
confusion_matrix <- table(test_set[,target_feature],glm_pred)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.glm<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.glm<-sum(diag(confusion_matrix))/sum(confusion_matrix)
confMatData<-rbind(confMatData, data.frame(prob=i, sensitivity=sensitivity.glm, accuracy=accuracy.glm))
}
}
plot(x=confMatData$prob, y=confMatData$sensitivity, type="l", col="blue", main="Logistic regression model",
xlab="Probability", ylab="Sensitivity & Accuracy")
lines(x=confMatData$prob,y=confMatData$accuracy, type = "l", col="green")
legend("topright",legend= c("Sensitivity", "Accuracy"), bty="n", fill=c("blue","green"))
Let’s say that optimal probability is 0.2, and convert predictions with probability 0.2 and more to “1” and predictions with probability less the 0.2 to “0”
#load(file = "prob_pred.glm.RData")
i<-0.2
glm_pred<- sapply(prob_pred.glm, function(x) if(x>=i){1}else{0})
glm_pred<-as.vector(glm_pred)
confusion_matrix <- table(test_set[,target_feature],glm_pred)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.glm<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.glm<-sum(diag(confusion_matrix))/sum(confusion_matrix)
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="GLM",sensitivity=sensitivity.glm, accuracy=accuracy.glm))
glm_predict <-prediction(prob_pred.glm, test_set$No.show)
glm_ROC<-performance(glm_predict, "tpr", "fpr")
save(glm_ROC, file="glm_ROC.RData")
confusion_matrix
## glm_pred
## predicted 0 predicted 1
## actual 0 19896 9606
## actual 1 3060 4280
Mixture Discriminant Analysis
if (!require("mda")) install.packages("mda")
library(mda)
mda_model<-mda(fmla,data=train_set)
mda_model<- predict(mda_model, test_set ,type="posterior")
# Retrieve probabilities:
prob_pred<-vector()
prob_pred<-unlist(mda_model)[,2]
#Find probability which gives optimal combination of sensitivity and accuracy
confMatData<-data.frame()
for(i in seq(0,1, by=0.05))
{
model_pred<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
model_pred<-as.vector(model_pred)
if (max(model_pred)>0&min(model_pred)<1)
{
confusion_matrix <- table(test_set[,target_feature],model_pred)
sensitivity.mda<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.mda<-sum(diag(confusion_matrix))/sum(confusion_matrix)
confMatData<-rbind(confMatData, data.frame(prob=i, sensitivity=sensitivity.mda, accuracy=accuracy.mda))
}
}
plot(x=confMatData$prob, y=confMatData$sensitivity, type="l", col="blue", main="Mixture Discriminant Analysis",
xlab="Probability", ylab="Sensitivity & Accuracy")
lines(x=confMatData$prob,y=confMatData$accuracy, type = "l", col="green")
legend("topright",legend= c("Sensitivity", "Accuracy"), bty="n", fill=c("blue","green"))
Let’s say that optimal probability is 0.2, and convert predictions with probability 0.2 and more to “1” and predictions with probability less the 0.2 to “0”
i<-0.2
mda_pred<- sapply(prob_pred, function(x) if(x>=i){1}else{0})
mda_pred<-as.vector(mda_pred)
confusion_matrix <- table(test_set[,target_feature],mda_pred)
row.names(confusion_matrix)<-c("actual 0", "actual 1")
colnames(confusion_matrix)<-c("predicted 0", "predicted 1")
sensitivity.mda<-confusion_matrix[2,2]/(confusion_matrix[2,2]+confusion_matrix[2,1])
accuracy.mda<-sum(diag(confusion_matrix))/sum(confusion_matrix)
SensitivityAndAccuracy<-rbind(SensitivityAndAccuracy, data.frame(model="MDA",sensitivity=sensitivity.mda, accuracy=accuracy.mda))
mda_predict <-prediction(prob_pred, test_set$No.show)
mda_ROC<-performance(mda_predict, "tpr", "fpr")
save(mda_ROC, file="mda_ROC.RData")
confusion_matrix
## mda_pred
## predicted 0 predicted 1
## actual 0 18938 10564
## actual 1 2838 4502
Sensitivity And Accuracy:
print(SensitivityAndAccuracy)
## model sensitivity accuracy
## 1 No Model 0.1982289 0.6820477
## 2 DT 0.0000000 0.0000000
## 3 CTREE 0.7166213 0.6357961
## 4 naiveBayes 0.5230245 0.5091743
## 5 KNN 0.3076294 0.4101297
## 6 GLM 0.5831063 0.6562076
## 7 MDA 0.6133515 0.6362304
It looks like that better result provides Logistic Regression model, Mixture Discriminant Analysis (MDA), and Conditional Inference Tree (CIT), where CIT is performing better: about 70% of correctly predicted “No.show” appointments, and about 35% of general wrong predictions.
For final desision let’s compare the ROC of the models. ROC:
Conclusion: according to better sensitivity, Accuracy and ROC curve, the preferable model is CIT (Conditional Inference Tree).