Project

Introduction

No-show appointments (also commonly referred missed appointments) are a burden to essentially all healthcare systems, significantly impacting income, cost and use of resources. It is a well-known fact that no-show decreases the provider’s productivity and efficiency, increases healthcare costs, and limits the health clinic’s effective capacity. Negative effects are also felt by patients who keep their appointments, such as dissatisfaction with high waiting time and perception of overall decrease in service quality. In addition to creating financial costs for providers, non-attendance generates social costs related with unused staff time, ineffective use of equipment and possible misuse of patient’s time This report investigates the Kaggle no-show appointments dataset, which collects information from over 100k medical appointments in Brazil. There are a number of patient characteristics included.

Aim

The aim of this project was to: The aim of this study is to investigate the factors influencing patient no-show rates in medical appointments.

Research Questions
Do the intervals between scheduled and appointment days affect the no-show rate?
Are there days when patients rarely visit the hospital to meet up?
Which patients with particular diseases are less likely to come for an appointment?
Does the age group have an impact on the no-show rate?
Do patients with more than one disease increase the no-show rate?
Objectives

The objectives of the project were defined as follows:

Examine whether longer gaps between scheduled and appointment days increase no-show rates.
Identify days with the highest and lowest patient attendance rates.
Compare no-show rates among patients with different numbers of chronic conditions.
Determine if certain age groups are more likely to miss appointments.
Assess whether patients with more health conditions have higher or lower no-show rates.

Methodology

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

library(ggmosaic)
library(readxl)
library(tinytex)
library(xfun)

## 
## Attaching package: 'xfun'
## 
## The following object is masked from 'package:stringr':
## 
##     str_wrap
## 
## The following object is masked from 'package:base':
## 
##     attr

library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

setwd("C:\\Users\\USER\\OneDrive\\Desktop\\oluwaseun")
p.data <- read.csv("no_show_data.csv")

To know the dimension of the dataset

dim(p.data)

## [1] 110527     14

summary of all the columns

summary(p.data)

##    PatientId         AppointmentID        Gender          ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   Length:110527      Length:110527     
##  1st Qu.:4.173e+12   1st Qu.:5640286   Class :character   Class :character  
##  Median :3.173e+13   Median :5680573   Mode  :character   Mode  :character  
##  Mean   :1.475e+14   Mean   :5675305                                        
##  3rd Qu.:9.439e+13   3rd Qu.:5725524                                        
##  Max.   :1.000e+15   Max.   :5790484                                        
##  AppointmentDay          Age         Neighbourhood       Scholarship     
##  Length:110527      Min.   : -1.00   Length:110527      Min.   :0.00000  
##  Class :character   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 37.00   Mode  :character   Median :0.00000  
##                     Mean   : 37.09                      Mean   :0.09827  
##                     3rd Qu.: 55.00                      3rd Qu.:0.00000  
##                     Max.   :115.00                      Max.   :1.00000  
##   Hipertension       Diabetes         Alcoholism        Handcap       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07186   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMS_received     No.show         
##  Min.   :0.000   Length:110527     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

Plot of the thirteen columns

ggplot(p.data,aes(x=PatientId))+geom_histogram(fill=I("blue"),bins =40 )+
labs(title = "Distribution of patient identity",x="Patient identity",y="Frequency")

ggplot(p.data,aes(x=AppointmentID))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of appointment identity",x="Appointment identity",y="Frequency")

ggplot(p.data,aes(x=Gender))+geom_bar(fill=I("blue"),width = 0.1)+
labs(title = "Distribution of gender",x="Gender",y="Frequency")

ggplot(p.data,aes(x= AppointmentDay))+geom_bar(fill=I("blue"))+
labs(title = "Distribution of appointmentday",x="AppointmentDay",y="Frequency")

ggplot(p.data,aes(x=Age))+geom_histogram(fill=I("blue"),bins = 100)+
labs(title = "Distribution of age",x="Age",y="Frequency")

ggplot(p.data,aes(x=Hipertension))+geom_bar(fill=("blue"),width = 0.1)+
labs(title = "Distribution of Age",x="Age",y="Frequency")

ggplot(p.data,aes(x=Diabetes))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=Alcoholism))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=Handcap))+geom_bar(fill=("blue"))

ggplot(p.data,aes(x=Neighbourhood))+geom_bar(fill=("blue"))

ggplot(p.data,aes(x=Scholarship))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=SMS_received))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=No.show))+geom_bar(fill=("blue"),width = 0.2)

ggplot(p.data,aes(x=No.show))+geom_bar(color="blue",shape =3 )+
labs(title = "Distribution of  No-show",x="Patient identity",y="Frequency")

## Warning in geom_bar(color = "blue", shape = 3): Ignoring unknown parameters:
## `shape`

ggplot(p.data,aes(x=AppointmentID))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of appointment identity",x="Appointment identity",y="Frequency")

ggplot(p.data,aes(x=Gender))+geom_bar(fill=I("blue"),width = 0.1)+
labs(title = "Distribution of gender",x="Gender",y="Frequency")

ggplot(p.data,aes(x= AppointmentDay))+geom_bar(fill=I("blue"))+
labs(title = "Distribution of appointmentday",x="AppointmentDay",y="Frequency")

ggplot(p.data,aes(x=Age))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of age",x="Age",y="Frequency")

p.data <- p.data %>%rename("Handicap"=Handcap,
"Hypertension"=Hipertension,"SMSreceived"=SMS_received )

#unique(p.data$PatientId)
length(unique(p.data$PatientId))

## [1] 62299

#unique(p.data$AppointmentID)
length(unique(p.data$AppointmentID))

## [1] 110527

Age <- as.data.frame(p.data$Age)
Age_count <- table(Age)
Age_count

## p.data$Age
##   -1    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    1 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 1211 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46 
## 1439 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460 
##   47   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62 
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 
##   63   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78 
## 1374 1331 1101 1187  973 1012  832  724  695  615  725  602  544  571  527  541 
##   79   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##  390  511  434  392  280  311  275  260  184  126  173  109   66   86   53   33 
##   95   96   97   98   99  100  102  115 
##   24   17   11    6    1    4    2    5

Age_count <- class(Age_count)
Age_count <- class(Age_count)
Age_count

## [1] "character"

Age_count <- as.data.frame(Age_count)
class(Age_count)

## [1] "data.frame"

which(p.data$Age==-1)

## [1] 99833

p.data <- p.data[-99833,]
summary(p.data)

##    PatientId         AppointmentID        Gender          ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   Length:110526      Length:110526     
##  1st Qu.:4.173e+12   1st Qu.:5640285   Class :character   Class :character  
##  Median :3.173e+13   Median :5680572   Mode  :character   Mode  :character  
##  Mean   :1.475e+14   Mean   :5675304                                        
##  3rd Qu.:9.439e+13   3rd Qu.:5725523                                        
##  Max.   :1.000e+15   Max.   :5790484                                        
##  AppointmentDay          Age         Neighbourhood       Scholarship     
##  Length:110526      Min.   :  0.00   Length:110526      Min.   :0.00000  
##  Class :character   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 37.00   Mode  :character   Median :0.00000  
##                     Mean   : 37.09                      Mean   :0.09827  
##                     3rd Qu.: 55.00                      3rd Qu.:0.00000  
##                     Max.   :115.00                      Max.   :1.00000  
##   Hypertension       Diabetes         Alcoholism        Handicap      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07187   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMSreceived      No.show         
##  Min.   :0.000   Length:110526     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

p.data <- p.data %>%
  mutate(
    AppointmentDay = as.Date(AppointmentDay),
    ScheduledDay = as.Date(ScheduledDay)
  )

summary(p.data)

##    PatientId         AppointmentID        Gender           ScheduledDay       
##  Min.   :3.922e+04   Min.   :5030230   Length:110526      Min.   :2015-11-10  
##  1st Qu.:4.173e+12   1st Qu.:5640285   Class :character   1st Qu.:2016-04-29  
##  Median :3.173e+13   Median :5680572   Mode  :character   Median :2016-05-10  
##  Mean   :1.475e+14   Mean   :5675304                      Mean   :2016-05-08  
##  3rd Qu.:9.439e+13   3rd Qu.:5725523                      3rd Qu.:2016-05-20  
##  Max.   :1.000e+15   Max.   :5790484                      Max.   :2016-06-08  
##  AppointmentDay            Age         Neighbourhood       Scholarship     
##  Min.   :2016-04-29   Min.   :  0.00   Length:110526      Min.   :0.00000  
##  1st Qu.:2016-05-09   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Median :2016-05-18   Median : 37.00   Mode  :character   Median :0.00000  
##  Mean   :2016-05-19   Mean   : 37.09                      Mean   :0.09827  
##  3rd Qu.:2016-05-31   3rd Qu.: 55.00                      3rd Qu.:0.00000  
##  Max.   :2016-06-08   Max.   :115.00                      Max.   :1.00000  
##   Hypertension       Diabetes         Alcoholism        Handicap      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07187   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMSreceived      No.show         
##  Min.   :0.000   Length:110526     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

p.data <- p.data %>%
  mutate(
    AppointmentDay = as.Date(AppointmentDay),
    ScheduledDay = as.Date(ScheduledDay)
  )

summary(p.data)

##    PatientId         AppointmentID        Gender           ScheduledDay       
##  Min.   :3.922e+04   Min.   :5030230   Length:110526      Min.   :2015-11-10  
##  1st Qu.:4.173e+12   1st Qu.:5640285   Class :character   1st Qu.:2016-04-29  
##  Median :3.173e+13   Median :5680572   Mode  :character   Median :2016-05-10  
##  Mean   :1.475e+14   Mean   :5675304                      Mean   :2016-05-08  
##  3rd Qu.:9.439e+13   3rd Qu.:5725523                      3rd Qu.:2016-05-20  
##  Max.   :1.000e+15   Max.   :5790484                      Max.   :2016-06-08  
##  AppointmentDay            Age         Neighbourhood       Scholarship     
##  Min.   :2016-04-29   Min.   :  0.00   Length:110526      Min.   :0.00000  
##  1st Qu.:2016-05-09   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Median :2016-05-18   Median : 37.00   Mode  :character   Median :0.00000  
##  Mean   :2016-05-19   Mean   : 37.09                      Mean   :0.09827  
##  3rd Qu.:2016-05-31   3rd Qu.: 55.00                      3rd Qu.:0.00000  
##  Max.   :2016-06-08   Max.   :115.00                      Max.   :1.00000  
##   Hypertension       Diabetes         Alcoholism        Handicap      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07187   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMSreceived      No.show         
##  Min.   :0.000   Length:110526     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

Days.interval<-p.data$AppointmentDay- p.data$ScheduledDay 
summary(Days.interval)

##   Length    Class     Mode 
##   110526 difftime  numeric

Days.interval[which(Days.interval<0)]

## Time differences in days
## [1] -1 -1 -1 -6 -1

p.data <- data.frame(p.data,Days.interval)

calculate_percentages <- function(data, interval) {
  
  filtered_data <- data[data$Days.interval == interval, ]
  
  yes_count <- sum(filtered_data$No.show == "Yes")
  no_count <- sum(filtered_data$No.show == "No")
  
 
  yes_percentage <- (yes_count / (yes_count + no_count)) * 100
  no_percentage <- (no_count / (yes_count + no_count)) * 100
  
 
  return(list(
    Yes = yes_percentage,
    No = no_percentage,
    Total = yes_count + no_count
  ))
}

intervals <- seq(from = 0, to = 179, by = 1)
results <- list()
for (i in intervals) {
  results[[as.character(i)]] <- calculate_percentages(p.data, i)
}



intervals <- seq(from = 0, to = 179, by = 1)
results <- list()
for (i in intervals) {
  results[[as.character(i)]] <- calculate_percentages(p.data, i)
}

df <- do.call(rbind, lapply(names(results), function(i) data.frame(
  Interval = as.numeric(i),
  Yes = results[[i]]$Yes,
  No = results[[i]]$No,
  Total = results[[i]]$Total
)))

df <- df[order(df$Interval), ]


kable(df,"html", digits = 2, caption = "Percentage of Yes and No Responses by Interval")%>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)

Percentage of Yes and No Responses by Interval
Interval	Yes	No	Total
0	4.65	95.35	38562
1	21.35	78.65	5213
2	23.82	76.18	6725
3	23.53	76.47	2737
4	23.27	76.73	5290
5	26.61	73.39	3277
6	24.80	75.20	4037
7	26.68	73.32	4906
8	28.73	71.27	2332
9	27.41	72.59	1605
10	31.63	68.37	1391
11	31.61	68.39	987
12	31.66	68.34	1115
13	31.87	68.13	1682
14	31.34	68.66	2913
15	33.40	66.60	1503
16	30.50	69.50	1151
17	31.62	68.38	1107
18	30.56	69.44	1021
19	34.77	65.23	1044
20	34.37	65.63	1187
21	30.90	69.10	1861
22	34.44	65.56	1173
23	33.58	66.42	822
24	37.78	62.22	622
25	40.19	59.81	637
26	35.98	64.02	731
27	31.59	68.41	1013
28	29.48	70.52	1706
29	32.69	67.31	1089
30	26.56	73.44	704
31	32.74	67.26	672
32	35.15	64.85	586
33	33.43	66.57	664
34	33.42	66.58	808
35	37.28	62.72	963
36	31.03	68.97	551
37	35.95	64.05	331
38	34.17	65.83	199
39	32.35	67.65	170
40	40.55	59.45	254
41	35.38	64.62	342
42	36.93	63.07	417
43	29.92	70.08	254
44	29.94	70.06	157
45	36.50	63.50	137
46	36.36	63.64	121
47	30.28	69.72	109
48	29.06	70.94	117
49	34.72	65.28	216
50	26.92	73.08	182
51	36.26	63.74	91
52	25.56	74.44	90
53	34.38	65.62	96
54	36.99	63.01	73
55	44.95	55.05	109
56	25.12	74.88	203
57	37.76	62.24	98
58	32.73	67.27	110
59	36.67	63.33	90
60	42.47	57.53	73
61	22.96	77.04	135
62	20.73	79.27	193
63	26.50	73.50	200
64	37.33	62.67	75
65	29.90	70.10	97
66	29.63	70.37	162
67	30.16	69.84	63
68	32.86	67.14	70
69	32.38	67.62	105
70	31.25	68.75	112
71	22.00	78.00	50
72	30.19	69.81	53
73	35.14	64.86	37
74	33.33	66.67	39
75	30.56	69.44	72
76	30.77	69.23	65
77	33.96	66.04	53
78	39.39	60.61	33
79	39.13	60.87	23
80	44.44	55.56	18
81	40.00	60.00	10
82	0.00	100.00	1
83	12.50	87.50	8
84	22.22	77.78	54
85	21.62	78.38	37
86	16.67	83.33	6
87	31.25	68.75	16
88	31.25	68.75	16
89	41.18	58.82	17
90	24.14	75.86	58
91	21.43	78.57	56
92	0.00	100.00	3
93	0.00	100.00	2
94	0.00	100.00	2
95	20.00	80.00	5
96	25.00	75.00	4
97	0.00	100.00	2
98	20.00	80.00	5
99	NaN	NaN	0
100	NaN	NaN	0
101	0.00	100.00	1
102	25.00	75.00	4
103	60.00	40.00	5
104	75.00	25.00	8
105	0.00	100.00	4
106	NaN	NaN	0
107	0.00	100.00	2
108	0.00	100.00	5
109	0.00	100.00	5
110	50.00	50.00	2
111	20.00	80.00	5
112	0.00	100.00	5
113	NaN	NaN	0
114	NaN	NaN	0
115	0.00	100.00	2
116	NaN	NaN	0
117	0.00	100.00	1
118	NaN	NaN	0
119	0.00	100.00	4
120	NaN	NaN	0
121	NaN	NaN	0
122	0.00	100.00	3
123	0.00	100.00	1
124	NaN	NaN	0
125	0.00	100.00	1
126	100.00	0.00	1
127	0.00	100.00	1
128	NaN	NaN	0
129	NaN	NaN	0
130	NaN	NaN	0
131	NaN	NaN	0
132	100.00	0.00	1
133	36.36	63.64	11
134	NaN	NaN	0
135	NaN	NaN	0
136	NaN	NaN	0
137	NaN	NaN	0
138	NaN	NaN	0
139	100.00	0.00	1
140	NaN	NaN	0
141	NaN	NaN	0
142	37.50	62.50	8
143	NaN	NaN	0
144	NaN	NaN	0
145	NaN	NaN	0
146	100.00	0.00	1
147	NaN	NaN	0
148	NaN	NaN	0
149	NaN	NaN	0
150	NaN	NaN	0
151	100.00	0.00	1
152	NaN	NaN	0
153	NaN	NaN	0
154	NaN	NaN	0
155	60.00	40.00	10
156	NaN	NaN	0
157	NaN	NaN	0
158	NaN	NaN	0
159	NaN	NaN	0
160	NaN	NaN	0
161	NaN	NaN	0
162	18.18	81.82	11
163	NaN	NaN	0
164	NaN	NaN	0
165	NaN	NaN	0
166	NaN	NaN	0
167	NaN	NaN	0
168	NaN	NaN	0
169	12.50	87.50	8
170	NaN	NaN	0
171	NaN	NaN	0
172	NaN	NaN	0
173	NaN	NaN	0
174	NaN	NaN	0
175	NaN	NaN	0
176	37.50	62.50	16
177	NaN	NaN	0
178	NaN	NaN	0
179	20.00	80.00	10

ggplot(p.data,aes(x=factor(Days.interval),fill=No.show ))+
geom_bar(position = "dodge")+labs(x="Intervals between scheduled and appointment Days",
y="Noshow rate")+theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_fill_manual(values = c("No" = "green", "Yes" = "red"))

Note: - Intervals 0-5 days: Higher no-show rates (23.53% - 26.61%) - Intervals 6-15 days: Relatively stable no-show rates (23.82% - 26.68%) - Intervals 16-30 days: Slightly lower no-show rates (21.35% - 24.80%) - Intervals 31-60 days: Higher no-show rates (29.48% - 37.78%) - Intervals beyond 60 days: Variable no-show rates (20.00% - 100.00%) - No-show rates tend to increase as intervals lengthen. - Shorter intervals (0-5 days) have higher no-show rates. - Mid-range intervals (6-30 days) have relatively stable no-show rates. - Longer intervals (31-60 days) have higher no-show rates. - Optimize scheduling to minimize intervals. - Implement reminders for appointments with longer intervals.

p.n<- p.data$AppointmentDay

p.nn <-(weekdays(p.data$AppointmentDay))

p.nnl <- factor(p.nn, 
                levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

data <- data.frame(
  Day = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
  Yes = c(4689, 5150, 5092, 3337, 4037, 9),
  No = c(18023, 20488, 20774, 13909, 14981, 30),
  Total = c(22712, 25638, 25866, 17246, 19018, 39)
)




data$Percent_Yes <- round(data$Yes / data$Total * 100, 1)
data$Percent_No <- round(data$No / data$Total * 100, 1)



kable(data, "html", booktabs = TRUE, caption = "No-show statistics by day") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)

No-show statistics by day
Day	Yes	No	Total	Percent_Yes	Percent_No
Monday	4689	18023	22712	20.6	79.4
Tuesday	5150	20488	25638	20.1	79.9
Wednesday	5092	20774	25866	19.7	80.3
Thursday	3337	13909	17246	19.3	80.7
Friday	4037	14981	19018	21.2	78.8
Saturday	9	30	39	23.1	76.9

D.F4 <- data.frame(Days =c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"),
Percentage=c(79.4,79.9,80.3,80.7,78.8,76.9))
D.F4

##        Days Percentage
## 1    Monday       79.4
## 2   Tuesday       79.9
## 3 Wednesday       80.3
## 4  Thursday       80.7
## 5    Friday       78.8
## 6  Saturday       76.9

ggplot(p.data, aes(x = p.nnl, fill = No.show)) + 
  geom_bar(position = "dodge") +
  labs(x = "Days of the week", y = "Frequency  of no-show", 
  title  = "Days of the week of  no-Show ")+
  theme_classic()+
  scale_fill_manual(values = c("No" = "purple", "Yes" = "blue"))

ggplot(D.F4, aes(x = Days, y = Percentage, group = 1)) +
  geom_line(color = "orange", size = 1.2) +
  geom_point(color = "red", size = 3.9) +
  labs(
    title = "Rate of participant by days",
    x = "Days",
    y = "Percentage"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  scale_x_discrete(limits = D.F4$Days)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p.data$AppointmentDay.Weekday <- p.data %>% mutate(AppointmentDay.Weekday=p.nnl)

Note: - Saturday (Total: 39, Percent Yes: 23.1%, Percent No: 76.9%) This is likely due to the hospital’s reduced operating hours or limited services on Saturdays. - Fridays have the highest no-show rate (21.2%). - Thursdays have the lowest no-show rate (19.3%) among weekdays. - Wednesday has the highest attendance rate (80.3%). - Adjust scheduling and staffing on Saturdays. - Investigate reasons for high no-show rates on Fridays. - Optimize appointment scheduling on Thursdays. - which patient with particular diseases less come for appointment To to call out diabetes column, check does that have and does that does’nt have diseases

A.D <- p.data$Diabetes

YES.D <- which(p.data$Diabetes==1)
length(YES.D)

## [1] 7943

NO.D <- which(A.D==0)

length(NO.D)

## [1] 102583

A.H <- p.data$Hypertension
length(A.H)

## [1] 110526

YES.H <-which(A.H==1) 

length(YES.H)

## [1] 21801

A.L <- (p.data$Alcoholism)
length(A.L)

## [1] 110526

NO.H <- which(A.H==0)
length(NO.H)

## [1] 88725

A.D <- p.data$Diabetes

YES.D <- which(p.data$Diabetes==1)

length(YES.D)

## [1] 7943

NO.D <- which(A.D==0)

length(NO.D)

## [1] 102583

YES.H <-which(A.H==1) 

length(YES.H)

## [1] 21801

A.L <- (p.data$Alcoholism)
length(A.L)

## [1] 110526

NO.H <- which(A.H==0)
length(NO.H)

## [1] 88725

NO.D <- sum(p.data$No.show=="No"&  p.data$Diabetes==1)

NO.D

## [1] 6513

YES.D <- sum(p.data$No.show=="Yes"& p.data$Diabetes== 1)
YES.D

## [1] 1430

NO.H <- sum(p.data$No.show=="No"  & p.data$Hypertension== 1)
NO.H

## [1] 18029

YES.H <- sum(p.data$No.show=="Yes" & p.data$Hypertension== 1)
YES.H

## [1] 3772

A.ll <- sum(p.data$No.show=="No" & p.data$Alcoholism == 1)
A.ll

## [1] 2683

B.ll <-sum(p.data$No.show=="Yes" & p.data$Alcoholism == 1)  
B.ll

## [1] 677

H.N <-sum(p.data$No.show=="No" &p.data$Handicap == 1)
H.N

## [1] 1676

H.Y <-sum(p.data$No.show=="Yes" &p.data$Handicap == 1)
H.Y

## [1] 366

data3 <- data.frame(
  Disease = c("Diabetes", "Hypertension", "Alcoholism", "Handicap"),
  Yes = c(1430, 3772, 677, 401),
  No = c(6513, 18028, 2683, 1822),
  Total = c(7943, 21800, 3360, 2223)
)


data3$Percent_Yes <- round(data3$Yes / data3$Total * 100, 1)
data3$Percent_No <- round(data3$No / data3$Total * 100, 1)


kable(data3, "html", booktabs = TRUE, caption = "Participant statistics by diseases") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)

Participant statistics by diseases
Disease	Yes	No	Total	Percent_Yes	Percent_No
Diabetes	1430	6513	7943	18.0	82.0
Hypertension	3772	18028	21800	17.3	82.7
Alcoholism	677	2683	3360	20.1	79.9
Handicap	401	1822	2223	18.0	82.0

D.AH <- data.frame(
  Diseases = rep(c("Diabetes", "Hypertension", "Alcoholism","Handicap"), each = 2),
  Status = rep(c("No", "Yes"), times = 4),
  Count = c(NO.D, YES.D, NO.H, YES.H, A.ll, B.ll,H.N,H.Y))
D.AH

##       Diseases Status Count
## 1     Diabetes     No  6513
## 2     Diabetes    Yes  1430
## 3 Hypertension     No 18029
## 4 Hypertension    Yes  3772
## 5   Alcoholism     No  2683
## 6   Alcoholism    Yes   677
## 7     Handicap     No  1676
## 8     Handicap    Yes   366

D.F <- data.frame(Diseases=c("Diabetes", "Hypertension", "Alcoholism","Handicap"),
count=c(NO.D,NO.H,A.ll,H.N))
D.F

##       Diseases count
## 1     Diabetes  6513
## 2 Hypertension 18029
## 3   Alcoholism  2683
## 4     Handicap  1676

ggplot(D.AH, aes(x = Diseases, y = Count, fill = Status)) +
geom_col(position = "dodge") +
labs(
    title = "Attendance rates by disease",
    x = "Disease",
    y = "Count",
    fill = "No-show"
  ) +
  scale_fill_manual(values = c("No" = "gold", "Yes" = "brown")) +
  theme_classic()+
  scale_x_discrete(limits = D.F$Diseases)

ggplot(D.F, aes(x = Diseases, y = count, group = 1)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 4) +
  labs(
    title = "Comparison Disease",
    x = "Disease",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  scale_x_discrete(limits = D.F$Diseases)

Note: - Patients with Hypertension have the lowest no-show rate (17.3%), indicating they are more likely to attend appointments. - Hypertension (17.3%) - Diabetes (18.0%) - Handicap (18.0%) - Alcoholism (20.1%)

Patients with Alcoholism have the highest no-show rate (20.1%), suggesting they may require additional support or targeted interventions to improve appointment adherence. - Monitor and support patients with alcoholism to improve appointment adherence. - Investigate factors contributing to high no-show rates among patients with alcoholism. - Develop disease-specific strategies to enhance appointment reminders and patient engagement.

Does the age group have impact on the NOshow rate

min(p.data$Age)

## [1] 0

p.data$Age.Range <- cut(p.data$Age, 
breaks = c(0,5,11,17,29,49,64, Inf), 
labels = c("0-5","6-11", "12-17", "18-29", "30-49", "50-64","65-above"),include.lowest = TRUE)
length(which(p.data$Age.Range=="0-5"))

## [1] 11731

summary(p.data$Age.Range)

##      0-5     6-11    12-17    18-29    30-49    50-64 65-above 
##    11731     8213     7435    16731    29380    22634    14402

a <- sum(p.data$Age.Range=="0-5"&p.data$No.show=="No")
a

## [1] 9546

b <- sum(p.data$Age.Range=="0-5"&p.data$No.show=="Yes")
b

## [1] 2185

a+b

## [1] 11731

c <- sum(p.data$Age.Range=="6-11"&p.data$No.show=="No")
c

## [1] 6363

d<- sum(p.data$Age.Range=="6-11"&p.data$No.show=="Yes")
d

## [1] 1850

c+d

## [1] 8213

e<- sum(p.data$Age.Range=="12-17"&p.data$No.show=="No")
e

## [1] 5473

f<- sum(p.data$Age.Range=="12-17"&p.data$No.show=="Yes")
f

## [1] 1962

e+f

## [1] 7435

g<- sum(p.data$Age.Range=="18-29"&p.data$No.show=="No")
g

## [1] 12607

h<- sum(p.data$Age.Range=="18-29"&p.data$No.show=="Yes")
h

## [1] 4124

g+h

## [1] 16731

i<- sum(p.data$Age.Range=="30-49"&p.data$No.show=="No")
i

## [1] 23200

j <- sum(p.data$Age.Range=="30-49"&p.data$No.show=="Yes")
j

## [1] 6180

i+j

## [1] 29380

k <- sum(p.data$Age.Range=="50-64"&p.data$No.show=="No")
k

## [1] 18849

l <- sum(p.data$Age.Range=="50-64"&p.data$No.show=="Yes")
l

## [1] 3785

k+l

## [1] 22634

m <- sum(p.data$Age.Range=="65-above"&p.data$No.show=="No")
m

## [1] 12169

n <- sum(p.data$Age.Range=="65-above"&p.data$No.show=="Yes")
n

## [1] 2233

m+n

## [1] 14402

data1 <- data.frame(
  Age.Range = c("0-5", "6-11", "12-17", "18-29", "30-49", "50-64", "65-above"),
  Yes = c(2185, 1849, 1962, 4122, 6179, 3785, 2232),
  No = c(9546, 6363, 5473, 12607, 23198, 18849, 12169),
  Total = c(11731, 8212, 7435, 16729, 29377, 22634, 14401)
)


D.F1 <- data.frame(Age.Range=c("0-5","6-11","12-17","18-29","30-49","50-64","65-above"),
Percentage=c(18.6,22.5,26.4,24.6,21.0,16.7,15.5))
D.F1

##   Age.Range Percentage
## 1       0-5       18.6
## 2      6-11       22.5
## 3     12-17       26.4
## 4     18-29       24.6
## 5     30-49       21.0
## 6     50-64       16.7
## 7  65-above       15.5

data1$Percent_Yes <- round(data1$Yes / data1$Total * 100, 1)
data1$Percent_No <- round(data1$No / data1$Total * 100, 1)


kable(data1, "html", booktabs = TRUE, caption = "No-Show Statistics by Age range") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)

No-Show Statistics by Age range
Age.Range	Yes	No	Total	Percent_Yes	Percent_No
0-5	2185	9546	11731	18.6	81.4
6-11	1849	6363	8212	22.5	77.5
12-17	1962	5473	7435	26.4	73.6
18-29	4122	12607	16729	24.6	75.4
30-49	6179	23198	29377	21.0	79.0
50-64	3785	18849	22634	16.7	83.3
65-above	2232	12169	14401	15.5	84.5

ggplot(D.F1, aes(x = Age.Range, y = Percentage, group = 1)) +
  geom_line(color = "red", size = 1.5) +
  geom_point(color = "orange", size = 4) +
  labs(title = "Comparing the age range in percentage",
       x = "Age range", 
       y = "Percentage")+
  scale_x_discrete(limits = D.F1$Age.Range)

ggplot(p.data, aes(x = Age.Range, fill = factor(No.show))) +
geom_bar(position = "dodge") +
labs(title = "Age ranges on no-show rate",
x = "Age range",y = "Frequency",fill = "No-Show") +
scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
theme_classic()

Note - he highest no-show rate is among the 12-17 age group (26.4%). - The lowest no-show rate is among the 0-5 age group (18.6%). - The 18-29 and 30-49 age groups have relatively similar no-show rates (24.6% and 21.0%, respectively). - The 6-11 age group has a slightly higher no-show rate (22.5%) compared to the 0-5 age group. - Yes, the age group has a significant impact on the no-show rate. Adolescents (12-17 years old) are more likely to be no-shows, while younger children (0-5 years old) are less likely.

Do patient with more than one diseases increase no-show rate

u <- length(p.data$Hypertension)
u

## [1] 110526

r <- length(which(p.data$Hypertension==0))
r

## [1] 88725

j <- length(which(p.data$Hypertension==1))
j

## [1] 21801

v <- length(p.data$Diabetes)
v

## [1] 110526

r+j

## [1] 110526

length(which(p.data$Alcoholism==2))

## [1] 0

length(which(p.data$Handicap==2))

## [1] 183

length(which(p.data$Handicap==0))

## [1] 108285

length(which(p.data$Handicap==1))

## [1] 2042

p.data$Handicap[p.data$Handicap==2]=1

p.data$Disease_Count <- rowSums(p.data[, c("Hypertension", 
"Diabetes","Handicap","Alcoholism")])
p.data$Disease_Count <- factor(p.data$Disease_Count, 
levels = c(0, 1, 2, 3, 4), 
labels = c("Zero disease", "One disease", "Two diseases", "Three diseases", "Four diseases"))
length(p.data$Disease_Count)

## [1] 110526

unique(p.data$Disease_Count)

## [1] One disease    Zero disease   Two diseases   Three diseases Four diseases 
## 5 Levels: Zero disease One disease Two diseases ... Four diseases

summary(p.data$Disease_Count)

##   Zero disease    One disease   Two diseases Three diseases  Four diseases 
##          84114          18111           7654            627             20

a <- sum(p.data$Disease_Count=="Zero disease"&p.data$No.show=="Yes")
a

## [1] 17603

b <- sum(p.data$Disease_Count=="Zero disease"&p.data$No.show=="No")

b

## [1] 66511

a+b

## [1] 84114

c <-sum(p.data$Disease_Count=="One disease"&p.data$No.show=="Yes")
c

## [1] 3246

d <-sum(p.data$Disease_Count=="One disease"&p.data$No.show=="No") 
d

## [1] 14865

c+d

## [1] 18111

e <-sum(p.data$Disease_Count=="Two diseases"&p.data$No.show=="Yes") 
e

## [1] 1367

f <-sum(p.data$Disease_Count=="Two diseases"&p.data$No.show=="No") 
f

## [1] 6287

e+f

## [1] 7654

g <-sum(p.data$Disease_Count=="Three diseases"&p.data$No.show=="Yes") 
g

## [1] 97

h<-sum(p.data$Disease_Count=="Three diseases"&p.data$No.show=="No")
h

## [1] 530

g+h

## [1] 627

i <- sum(p.data$Disease_Count=="Four diseases"&p.data$No.show=="Yes")
i

## [1] 6

j <-sum(p.data$Disease_Count=="Four diseases"&p.data$No.show=="No")
j

## [1] 14

i+j

## [1] 20

data2 <- data.frame(
  Disease_Count = c("Zero disease", "One disease", "Two diseases", "Three diseases", "Four diseases"),
  Yes = c(17600, 3244, 1367, 97, 6),
  No = c(66510, 14864, 6287, 530, 14),
  Total = c(84110, 18108, 7654, 627, 20)
)


data2$Percent_Yes <- round(data2$Yes / data2$Total * 100, 1)
data2$Percent_No <- round(data2$No / data2$Total * 100, 1)

kable(data2, "html", booktabs = TRUE, caption = "No-Show Statistics by number of disease") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)

No-Show Statistics by number of disease
Disease_Count	Yes	No	Total	Percent_Yes	Percent_No
Zero disease	17600	66510	84110	20.9	79.1
One disease	3244	14864	18108	17.9	82.1
Two diseases	1367	6287	7654	17.9	82.1
Three diseases	97	530	627	15.5	84.5
Four diseases	6	14	20	30.0	70.0

D.F6 <- data.frame(Disease_Count=c("Zero disease","One disease","Two diseases","Three diseases","Four diseases"),
Percentage=c(20.9,17.9,17.9,15.5,30.0))
D.F6

##    Disease_Count Percentage
## 1   Zero disease       20.9
## 2    One disease       17.9
## 3   Two diseases       17.9
## 4 Three diseases       15.5
## 5  Four diseases       30.0

ggplot(p.data, aes(x = Disease_Count, fill = factor(No.show))) +
geom_bar(position = "dodge") +
labs(title = "No-show rates by disease count",
x = "Number of diseases",
y = "Frequency",fill = "No-show")+
  scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
theme_classic()

ggplot(D.F6, aes(x =  Disease_Count, y = Percentage, group = 1)) +
  geom_line(color = "red", size = 1.5) +
  geom_point(color = "orange", size = 4) +
  labs(title = "Comparing the  disease count in percentage",
       x = "Disease count", 
       y = "Percentage")+
  scale_x_discrete(limits = D.F6$ Disease_Count)

Note: - Patients with 3 diseases have the lowest no-show rate (15.5%). - Patients with 0 diseases have the highest no-show rate (20.9%) among the larger disease categories. - Patients with 4 diseases have a significantly higher no-show rate (30.0%), but the sample size is very small

Targeted interventions: Focus on patients with 0 diseases, as they have the highest no-show rate (20.9%).
Personalized communication: Implement personalized communication strategies to remind patients with 0 diseases about their appointments.
Appointment scheduling: Optimize appointment scheduling for patients with 0 diseases to reduce no-show rates.

International Centre for Applied Mathematical Modelling and Data Analysis (ICAMMDA)

Graduate Internship Training (GIT) Presentation

Table of content

Introduction

Section A

skill(s) Acquired

Project

Methodology

Insight

Recommendation

Conclusion

Data science applied lab(Onging courses)

Future direction