International Centre for Applied Mathematical Modelling and Data Analysis (ICAMMDA)

Graduate Internship Training (GIT) Presentation

Alabi Kazeem Oluwaseun

2-5-2025

Table of content

Introduction

Section A

Completed courses

skill(s) Acquired

Project

No-show appointments (also commonly referred missed appointments) are a burden to essentially all healthcare systems, significantly impacting income, cost and use of resources. It is a well-known fact that no-show decreases the provider’s productivity and efficiency, increases healthcare costs, and limits the health clinic’s effective capacity. Negative effects are also felt by patients who keep their appointments, such as dissatisfaction with high waiting time and perception of overall decrease in service quality. In addition to creating financial costs for providers, non-attendance generates social costs related with unused staff time, ineffective use of equipment and possible misuse of patient’s time This report investigates the Kaggle no-show appointments dataset, which collects information from over 100k medical appointments in Brazil. There are a number of patient characteristics included.

The aim of this project was to: The aim of this study is to investigate the factors influencing patient no-show rates in medical appointments.

The objectives of the project were defined as follows:

Methodology

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggmosaic)
library(readxl)
library(tinytex)
library(xfun)
## 
## Attaching package: 'xfun'
## 
## The following object is masked from 'package:stringr':
## 
##     str_wrap
## 
## The following object is masked from 'package:base':
## 
##     attr
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
setwd("C:\\Users\\USER\\OneDrive\\Desktop\\oluwaseun")
p.data <- read.csv("no_show_data.csv")

To know the dimension of the dataset

dim(p.data)
## [1] 110527     14

summary of all the columns

summary(p.data)
##    PatientId         AppointmentID        Gender          ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   Length:110527      Length:110527     
##  1st Qu.:4.173e+12   1st Qu.:5640286   Class :character   Class :character  
##  Median :3.173e+13   Median :5680573   Mode  :character   Mode  :character  
##  Mean   :1.475e+14   Mean   :5675305                                        
##  3rd Qu.:9.439e+13   3rd Qu.:5725524                                        
##  Max.   :1.000e+15   Max.   :5790484                                        
##  AppointmentDay          Age         Neighbourhood       Scholarship     
##  Length:110527      Min.   : -1.00   Length:110527      Min.   :0.00000  
##  Class :character   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 37.00   Mode  :character   Median :0.00000  
##                     Mean   : 37.09                      Mean   :0.09827  
##                     3rd Qu.: 55.00                      3rd Qu.:0.00000  
##                     Max.   :115.00                      Max.   :1.00000  
##   Hipertension       Diabetes         Alcoholism        Handcap       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07186   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMS_received     No.show         
##  Min.   :0.000   Length:110527     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

Plot of the thirteen columns

ggplot(p.data,aes(x=PatientId))+geom_histogram(fill=I("blue"),bins =40 )+
labs(title = "Distribution of patient identity",x="Patient identity",y="Frequency")

ggplot(p.data,aes(x=AppointmentID))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of appointment identity",x="Appointment identity",y="Frequency")

ggplot(p.data,aes(x=Gender))+geom_bar(fill=I("blue"),width = 0.1)+
labs(title = "Distribution of gender",x="Gender",y="Frequency")

ggplot(p.data,aes(x= AppointmentDay))+geom_bar(fill=I("blue"))+
labs(title = "Distribution of appointmentday",x="AppointmentDay",y="Frequency")

ggplot(p.data,aes(x=Age))+geom_histogram(fill=I("blue"),bins = 100)+
labs(title = "Distribution of age",x="Age",y="Frequency")

ggplot(p.data,aes(x=Hipertension))+geom_bar(fill=("blue"),width = 0.1)+
labs(title = "Distribution of Age",x="Age",y="Frequency")

ggplot(p.data,aes(x=Diabetes))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=Alcoholism))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=Handcap))+geom_bar(fill=("blue"))

ggplot(p.data,aes(x=Neighbourhood))+geom_bar(fill=("blue"))

ggplot(p.data,aes(x=Scholarship))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=SMS_received))+geom_bar(fill=("blue"),width = 0.1)

ggplot(p.data,aes(x=No.show))+geom_bar(fill=("blue"),width = 0.2)

ggplot(p.data,aes(x=No.show))+geom_bar(color="blue",shape =3 )+
labs(title = "Distribution of  No-show",x="Patient identity",y="Frequency")
## Warning in geom_bar(color = "blue", shape = 3): Ignoring unknown parameters:
## `shape`

ggplot(p.data,aes(x=AppointmentID))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of appointment identity",x="Appointment identity",y="Frequency")

ggplot(p.data,aes(x=Gender))+geom_bar(fill=I("blue"),width = 0.1)+
labs(title = "Distribution of gender",x="Gender",y="Frequency")

ggplot(p.data,aes(x= AppointmentDay))+geom_bar(fill=I("blue"))+
labs(title = "Distribution of appointmentday",x="AppointmentDay",y="Frequency")

ggplot(p.data,aes(x=Age))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of age",x="Age",y="Frequency")

p.data <- p.data %>%rename("Handicap"=Handcap,
"Hypertension"=Hipertension,"SMSreceived"=SMS_received )
#unique(p.data$PatientId)
length(unique(p.data$PatientId))
## [1] 62299
#unique(p.data$AppointmentID)
length(unique(p.data$AppointmentID))
## [1] 110527
Age <- as.data.frame(p.data$Age)
Age_count <- table(Age)
Age_count
## p.data$Age
##   -1    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    1 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 1211 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46 
## 1439 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460 
##   47   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62 
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312 
##   63   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78 
## 1374 1331 1101 1187  973 1012  832  724  695  615  725  602  544  571  527  541 
##   79   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##  390  511  434  392  280  311  275  260  184  126  173  109   66   86   53   33 
##   95   96   97   98   99  100  102  115 
##   24   17   11    6    1    4    2    5
Age_count <- class(Age_count)
Age_count <- class(Age_count)
Age_count
## [1] "character"
Age_count <- as.data.frame(Age_count)
class(Age_count)
## [1] "data.frame"
which(p.data$Age==-1)
## [1] 99833
p.data <- p.data[-99833,]
summary(p.data)
##    PatientId         AppointmentID        Gender          ScheduledDay      
##  Min.   :3.922e+04   Min.   :5030230   Length:110526      Length:110526     
##  1st Qu.:4.173e+12   1st Qu.:5640285   Class :character   Class :character  
##  Median :3.173e+13   Median :5680572   Mode  :character   Mode  :character  
##  Mean   :1.475e+14   Mean   :5675304                                        
##  3rd Qu.:9.439e+13   3rd Qu.:5725523                                        
##  Max.   :1.000e+15   Max.   :5790484                                        
##  AppointmentDay          Age         Neighbourhood       Scholarship     
##  Length:110526      Min.   :  0.00   Length:110526      Min.   :0.00000  
##  Class :character   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 37.00   Mode  :character   Median :0.00000  
##                     Mean   : 37.09                      Mean   :0.09827  
##                     3rd Qu.: 55.00                      3rd Qu.:0.00000  
##                     Max.   :115.00                      Max.   :1.00000  
##   Hypertension       Diabetes         Alcoholism        Handicap      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07187   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMSreceived      No.show         
##  Min.   :0.000   Length:110526     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000
p.data <- p.data %>%
  mutate(
    AppointmentDay = as.Date(AppointmentDay),
    ScheduledDay = as.Date(ScheduledDay)
  )

summary(p.data)
##    PatientId         AppointmentID        Gender           ScheduledDay       
##  Min.   :3.922e+04   Min.   :5030230   Length:110526      Min.   :2015-11-10  
##  1st Qu.:4.173e+12   1st Qu.:5640285   Class :character   1st Qu.:2016-04-29  
##  Median :3.173e+13   Median :5680572   Mode  :character   Median :2016-05-10  
##  Mean   :1.475e+14   Mean   :5675304                      Mean   :2016-05-08  
##  3rd Qu.:9.439e+13   3rd Qu.:5725523                      3rd Qu.:2016-05-20  
##  Max.   :1.000e+15   Max.   :5790484                      Max.   :2016-06-08  
##  AppointmentDay            Age         Neighbourhood       Scholarship     
##  Min.   :2016-04-29   Min.   :  0.00   Length:110526      Min.   :0.00000  
##  1st Qu.:2016-05-09   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Median :2016-05-18   Median : 37.00   Mode  :character   Median :0.00000  
##  Mean   :2016-05-19   Mean   : 37.09                      Mean   :0.09827  
##  3rd Qu.:2016-05-31   3rd Qu.: 55.00                      3rd Qu.:0.00000  
##  Max.   :2016-06-08   Max.   :115.00                      Max.   :1.00000  
##   Hypertension       Diabetes         Alcoholism        Handicap      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07187   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMSreceived      No.show         
##  Min.   :0.000   Length:110526     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000
p.data <- p.data %>%
  mutate(
    AppointmentDay = as.Date(AppointmentDay),
    ScheduledDay = as.Date(ScheduledDay)
  )

summary(p.data)
##    PatientId         AppointmentID        Gender           ScheduledDay       
##  Min.   :3.922e+04   Min.   :5030230   Length:110526      Min.   :2015-11-10  
##  1st Qu.:4.173e+12   1st Qu.:5640285   Class :character   1st Qu.:2016-04-29  
##  Median :3.173e+13   Median :5680572   Mode  :character   Median :2016-05-10  
##  Mean   :1.475e+14   Mean   :5675304                      Mean   :2016-05-08  
##  3rd Qu.:9.439e+13   3rd Qu.:5725523                      3rd Qu.:2016-05-20  
##  Max.   :1.000e+15   Max.   :5790484                      Max.   :2016-06-08  
##  AppointmentDay            Age         Neighbourhood       Scholarship     
##  Min.   :2016-04-29   Min.   :  0.00   Length:110526      Min.   :0.00000  
##  1st Qu.:2016-05-09   1st Qu.: 18.00   Class :character   1st Qu.:0.00000  
##  Median :2016-05-18   Median : 37.00   Mode  :character   Median :0.00000  
##  Mean   :2016-05-19   Mean   : 37.09                      Mean   :0.09827  
##  3rd Qu.:2016-05-31   3rd Qu.: 55.00                      3rd Qu.:0.00000  
##  Max.   :2016-06-08   Max.   :115.00                      Max.   :1.00000  
##   Hypertension       Diabetes         Alcoholism        Handicap      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1972   Mean   :0.07187   Mean   :0.0304   Mean   :0.02225  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :4.00000  
##   SMSreceived      No.show         
##  Min.   :0.000   Length:110526     
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.321                     
##  3rd Qu.:1.000                     
##  Max.   :1.000
Days.interval<-p.data$AppointmentDay- p.data$ScheduledDay 
summary(Days.interval)
##   Length    Class     Mode 
##   110526 difftime  numeric
Days.interval[which(Days.interval<0)]
## Time differences in days
## [1] -1 -1 -1 -6 -1
p.data <- data.frame(p.data,Days.interval)
calculate_percentages <- function(data, interval) {
  
  filtered_data <- data[data$Days.interval == interval, ]
  
  yes_count <- sum(filtered_data$No.show == "Yes")
  no_count <- sum(filtered_data$No.show == "No")
  
 
  yes_percentage <- (yes_count / (yes_count + no_count)) * 100
  no_percentage <- (no_count / (yes_count + no_count)) * 100
  
 
  return(list(
    Yes = yes_percentage,
    No = no_percentage,
    Total = yes_count + no_count
  ))
}

intervals <- seq(from = 0, to = 179, by = 1)
results <- list()
for (i in intervals) {
  results[[as.character(i)]] <- calculate_percentages(p.data, i)
}



intervals <- seq(from = 0, to = 179, by = 1)
results <- list()
for (i in intervals) {
  results[[as.character(i)]] <- calculate_percentages(p.data, i)
}

df <- do.call(rbind, lapply(names(results), function(i) data.frame(
  Interval = as.numeric(i),
  Yes = results[[i]]$Yes,
  No = results[[i]]$No,
  Total = results[[i]]$Total
)))

df <- df[order(df$Interval), ]


kable(df,"html", digits = 2, caption = "Percentage of Yes and No Responses by Interval")%>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)
Percentage of Yes and No Responses by Interval
Interval Yes No Total
0 4.65 95.35 38562
1 21.35 78.65 5213
2 23.82 76.18 6725
3 23.53 76.47 2737
4 23.27 76.73 5290
5 26.61 73.39 3277
6 24.80 75.20 4037
7 26.68 73.32 4906
8 28.73 71.27 2332
9 27.41 72.59 1605
10 31.63 68.37 1391
11 31.61 68.39 987
12 31.66 68.34 1115
13 31.87 68.13 1682
14 31.34 68.66 2913
15 33.40 66.60 1503
16 30.50 69.50 1151
17 31.62 68.38 1107
18 30.56 69.44 1021
19 34.77 65.23 1044
20 34.37 65.63 1187
21 30.90 69.10 1861
22 34.44 65.56 1173
23 33.58 66.42 822
24 37.78 62.22 622
25 40.19 59.81 637
26 35.98 64.02 731
27 31.59 68.41 1013
28 29.48 70.52 1706
29 32.69 67.31 1089
30 26.56 73.44 704
31 32.74 67.26 672
32 35.15 64.85 586
33 33.43 66.57 664
34 33.42 66.58 808
35 37.28 62.72 963
36 31.03 68.97 551
37 35.95 64.05 331
38 34.17 65.83 199
39 32.35 67.65 170
40 40.55 59.45 254
41 35.38 64.62 342
42 36.93 63.07 417
43 29.92 70.08 254
44 29.94 70.06 157
45 36.50 63.50 137
46 36.36 63.64 121
47 30.28 69.72 109
48 29.06 70.94 117
49 34.72 65.28 216
50 26.92 73.08 182
51 36.26 63.74 91
52 25.56 74.44 90
53 34.38 65.62 96
54 36.99 63.01 73
55 44.95 55.05 109
56 25.12 74.88 203
57 37.76 62.24 98
58 32.73 67.27 110
59 36.67 63.33 90
60 42.47 57.53 73
61 22.96 77.04 135
62 20.73 79.27 193
63 26.50 73.50 200
64 37.33 62.67 75
65 29.90 70.10 97
66 29.63 70.37 162
67 30.16 69.84 63
68 32.86 67.14 70
69 32.38 67.62 105
70 31.25 68.75 112
71 22.00 78.00 50
72 30.19 69.81 53
73 35.14 64.86 37
74 33.33 66.67 39
75 30.56 69.44 72
76 30.77 69.23 65
77 33.96 66.04 53
78 39.39 60.61 33
79 39.13 60.87 23
80 44.44 55.56 18
81 40.00 60.00 10
82 0.00 100.00 1
83 12.50 87.50 8
84 22.22 77.78 54
85 21.62 78.38 37
86 16.67 83.33 6
87 31.25 68.75 16
88 31.25 68.75 16
89 41.18 58.82 17
90 24.14 75.86 58
91 21.43 78.57 56
92 0.00 100.00 3
93 0.00 100.00 2
94 0.00 100.00 2
95 20.00 80.00 5
96 25.00 75.00 4
97 0.00 100.00 2
98 20.00 80.00 5
99 NaN NaN 0
100 NaN NaN 0
101 0.00 100.00 1
102 25.00 75.00 4
103 60.00 40.00 5
104 75.00 25.00 8
105 0.00 100.00 4
106 NaN NaN 0
107 0.00 100.00 2
108 0.00 100.00 5
109 0.00 100.00 5
110 50.00 50.00 2
111 20.00 80.00 5
112 0.00 100.00 5
113 NaN NaN 0
114 NaN NaN 0
115 0.00 100.00 2
116 NaN NaN 0
117 0.00 100.00 1
118 NaN NaN 0
119 0.00 100.00 4
120 NaN NaN 0
121 NaN NaN 0
122 0.00 100.00 3
123 0.00 100.00 1
124 NaN NaN 0
125 0.00 100.00 1
126 100.00 0.00 1
127 0.00 100.00 1
128 NaN NaN 0
129 NaN NaN 0
130 NaN NaN 0
131 NaN NaN 0
132 100.00 0.00 1
133 36.36 63.64 11
134 NaN NaN 0
135 NaN NaN 0
136 NaN NaN 0
137 NaN NaN 0
138 NaN NaN 0
139 100.00 0.00 1
140 NaN NaN 0
141 NaN NaN 0
142 37.50 62.50 8
143 NaN NaN 0
144 NaN NaN 0
145 NaN NaN 0
146 100.00 0.00 1
147 NaN NaN 0
148 NaN NaN 0
149 NaN NaN 0
150 NaN NaN 0
151 100.00 0.00 1
152 NaN NaN 0
153 NaN NaN 0
154 NaN NaN 0
155 60.00 40.00 10
156 NaN NaN 0
157 NaN NaN 0
158 NaN NaN 0
159 NaN NaN 0
160 NaN NaN 0
161 NaN NaN 0
162 18.18 81.82 11
163 NaN NaN 0
164 NaN NaN 0
165 NaN NaN 0
166 NaN NaN 0
167 NaN NaN 0
168 NaN NaN 0
169 12.50 87.50 8
170 NaN NaN 0
171 NaN NaN 0
172 NaN NaN 0
173 NaN NaN 0
174 NaN NaN 0
175 NaN NaN 0
176 37.50 62.50 16
177 NaN NaN 0
178 NaN NaN 0
179 20.00 80.00 10
ggplot(p.data,aes(x=factor(Days.interval),fill=No.show ))+
geom_bar(position = "dodge")+labs(x="Intervals between scheduled and appointment Days",
y="Noshow rate")+theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_fill_manual(values = c("No" = "green", "Yes" = "red"))

Note: - Intervals 0-5 days: Higher no-show rates (23.53% - 26.61%) - Intervals 6-15 days: Relatively stable no-show rates (23.82% - 26.68%) - Intervals 16-30 days: Slightly lower no-show rates (21.35% - 24.80%) - Intervals 31-60 days: Higher no-show rates (29.48% - 37.78%) - Intervals beyond 60 days: Variable no-show rates (20.00% - 100.00%) - No-show rates tend to increase as intervals lengthen. - Shorter intervals (0-5 days) have higher no-show rates. - Mid-range intervals (6-30 days) have relatively stable no-show rates. - Longer intervals (31-60 days) have higher no-show rates. - Optimize scheduling to minimize intervals. - Implement reminders for appointments with longer intervals.

p.n<- p.data$AppointmentDay 
p.nn <-(weekdays(p.data$AppointmentDay))
p.nnl <- factor(p.nn, 
                levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
data <- data.frame(
  Day = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
  Yes = c(4689, 5150, 5092, 3337, 4037, 9),
  No = c(18023, 20488, 20774, 13909, 14981, 30),
  Total = c(22712, 25638, 25866, 17246, 19018, 39)
)




data$Percent_Yes <- round(data$Yes / data$Total * 100, 1)
data$Percent_No <- round(data$No / data$Total * 100, 1)



kable(data, "html", booktabs = TRUE, caption = "No-show statistics by day") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)
No-show statistics by day
Day Yes No Total Percent_Yes Percent_No
Monday 4689 18023 22712 20.6 79.4
Tuesday 5150 20488 25638 20.1 79.9
Wednesday 5092 20774 25866 19.7 80.3
Thursday 3337 13909 17246 19.3 80.7
Friday 4037 14981 19018 21.2 78.8
Saturday 9 30 39 23.1 76.9
D.F4 <- data.frame(Days =c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"),
Percentage=c(79.4,79.9,80.3,80.7,78.8,76.9))
D.F4
##        Days Percentage
## 1    Monday       79.4
## 2   Tuesday       79.9
## 3 Wednesday       80.3
## 4  Thursday       80.7
## 5    Friday       78.8
## 6  Saturday       76.9
ggplot(p.data, aes(x = p.nnl, fill = No.show)) + 
  geom_bar(position = "dodge") +
  labs(x = "Days of the week", y = "Frequency  of no-show", 
  title  = "Days of the week of  no-Show ")+
  theme_classic()+
  scale_fill_manual(values = c("No" = "purple", "Yes" = "blue"))

ggplot(D.F4, aes(x = Days, y = Percentage, group = 1)) +
  geom_line(color = "orange", size = 1.2) +
  geom_point(color = "red", size = 3.9) +
  labs(
    title = "Rate of participant by days",
    x = "Days",
    y = "Percentage"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  scale_x_discrete(limits = D.F4$Days)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p.data$AppointmentDay.Weekday <- p.data %>% mutate(AppointmentDay.Weekday=p.nnl)

Note: - Saturday (Total: 39, Percent Yes: 23.1%, Percent No: 76.9%) This is likely due to the hospital’s reduced operating hours or limited services on Saturdays. - Fridays have the highest no-show rate (21.2%). - Thursdays have the lowest no-show rate (19.3%) among weekdays. - Wednesday has the highest attendance rate (80.3%). - Adjust scheduling and staffing on Saturdays. - Investigate reasons for high no-show rates on Fridays. - Optimize appointment scheduling on Thursdays. - which patient with particular diseases less come for appointment To to call out diabetes column, check does that have and does that does’nt have diseases

A.D <- p.data$Diabetes

YES.D <- which(p.data$Diabetes==1)
length(YES.D)
## [1] 7943
NO.D <- which(A.D==0)

length(NO.D)
## [1] 102583
A.H <- p.data$Hypertension
length(A.H)
## [1] 110526
YES.H <-which(A.H==1) 

length(YES.H)
## [1] 21801
A.L <- (p.data$Alcoholism)
length(A.L)
## [1] 110526
NO.H <- which(A.H==0)
length(NO.H)
## [1] 88725
A.D <- p.data$Diabetes

YES.D <- which(p.data$Diabetes==1)

length(YES.D)
## [1] 7943
NO.D <- which(A.D==0)

length(NO.D)
## [1] 102583
YES.H <-which(A.H==1) 

length(YES.H)
## [1] 21801
A.L <- (p.data$Alcoholism)
length(A.L)
## [1] 110526
NO.H <- which(A.H==0)
length(NO.H)
## [1] 88725
NO.D <- sum(p.data$No.show=="No"&  p.data$Diabetes==1)

NO.D
## [1] 6513
YES.D <- sum(p.data$No.show=="Yes"& p.data$Diabetes== 1)
YES.D
## [1] 1430
NO.H <- sum(p.data$No.show=="No"  & p.data$Hypertension== 1)
NO.H
## [1] 18029
YES.H <- sum(p.data$No.show=="Yes" & p.data$Hypertension== 1)
YES.H
## [1] 3772
A.ll <- sum(p.data$No.show=="No" & p.data$Alcoholism == 1)
A.ll
## [1] 2683
B.ll <-sum(p.data$No.show=="Yes" & p.data$Alcoholism == 1)  
B.ll
## [1] 677
H.N <-sum(p.data$No.show=="No" &p.data$Handicap == 1)
H.N 
## [1] 1676
H.Y <-sum(p.data$No.show=="Yes" &p.data$Handicap == 1)
H.Y
## [1] 366
data3 <- data.frame(
  Disease = c("Diabetes", "Hypertension", "Alcoholism", "Handicap"),
  Yes = c(1430, 3772, 677, 401),
  No = c(6513, 18028, 2683, 1822),
  Total = c(7943, 21800, 3360, 2223)
)


data3$Percent_Yes <- round(data3$Yes / data3$Total * 100, 1)
data3$Percent_No <- round(data3$No / data3$Total * 100, 1)


kable(data3, "html", booktabs = TRUE, caption = "Participant statistics by diseases") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)
Participant statistics by diseases
Disease Yes No Total Percent_Yes Percent_No
Diabetes 1430 6513 7943 18.0 82.0
Hypertension 3772 18028 21800 17.3 82.7
Alcoholism 677 2683 3360 20.1 79.9
Handicap 401 1822 2223 18.0 82.0
D.AH <- data.frame(
  Diseases = rep(c("Diabetes", "Hypertension", "Alcoholism","Handicap"), each = 2),
  Status = rep(c("No", "Yes"), times = 4),
  Count = c(NO.D, YES.D, NO.H, YES.H, A.ll, B.ll,H.N,H.Y))
D.AH
##       Diseases Status Count
## 1     Diabetes     No  6513
## 2     Diabetes    Yes  1430
## 3 Hypertension     No 18029
## 4 Hypertension    Yes  3772
## 5   Alcoholism     No  2683
## 6   Alcoholism    Yes   677
## 7     Handicap     No  1676
## 8     Handicap    Yes   366
D.F <- data.frame(Diseases=c("Diabetes", "Hypertension", "Alcoholism","Handicap"),
count=c(NO.D,NO.H,A.ll,H.N))
D.F
##       Diseases count
## 1     Diabetes  6513
## 2 Hypertension 18029
## 3   Alcoholism  2683
## 4     Handicap  1676
ggplot(D.AH, aes(x = Diseases, y = Count, fill = Status)) +
geom_col(position = "dodge") +
labs(
    title = "Attendance rates by disease",
    x = "Disease",
    y = "Count",
    fill = "No-show"
  ) +
  scale_fill_manual(values = c("No" = "gold", "Yes" = "brown")) +
  theme_classic()+
  scale_x_discrete(limits = D.F$Diseases)

ggplot(D.F, aes(x = Diseases, y = count, group = 1)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 4) +
  labs(
    title = "Comparison Disease",
    x = "Disease",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  scale_x_discrete(limits = D.F$Diseases)

Note: - Patients with Hypertension have the lowest no-show rate (17.3%), indicating they are more likely to attend appointments. - Hypertension (17.3%) - Diabetes (18.0%) - Handicap (18.0%) - Alcoholism (20.1%)

Patients with Alcoholism have the highest no-show rate (20.1%), suggesting they may require additional support or targeted interventions to improve appointment adherence. - Monitor and support patients with alcoholism to improve appointment adherence. - Investigate factors contributing to high no-show rates among patients with alcoholism. - Develop disease-specific strategies to enhance appointment reminders and patient engagement.

min(p.data$Age)
## [1] 0
p.data$Age.Range <- cut(p.data$Age, 
breaks = c(0,5,11,17,29,49,64, Inf), 
labels = c("0-5","6-11", "12-17", "18-29", "30-49", "50-64","65-above"),include.lowest = TRUE)
length(which(p.data$Age.Range=="0-5"))
## [1] 11731
summary(p.data$Age.Range)
##      0-5     6-11    12-17    18-29    30-49    50-64 65-above 
##    11731     8213     7435    16731    29380    22634    14402
a <- sum(p.data$Age.Range=="0-5"&p.data$No.show=="No")
a
## [1] 9546
b <- sum(p.data$Age.Range=="0-5"&p.data$No.show=="Yes")
b
## [1] 2185
a+b
## [1] 11731
c <- sum(p.data$Age.Range=="6-11"&p.data$No.show=="No")
c
## [1] 6363
d<- sum(p.data$Age.Range=="6-11"&p.data$No.show=="Yes")
d
## [1] 1850
c+d
## [1] 8213
e<- sum(p.data$Age.Range=="12-17"&p.data$No.show=="No")
e
## [1] 5473
f<- sum(p.data$Age.Range=="12-17"&p.data$No.show=="Yes")
f
## [1] 1962
e+f
## [1] 7435
g<- sum(p.data$Age.Range=="18-29"&p.data$No.show=="No")
g
## [1] 12607
h<- sum(p.data$Age.Range=="18-29"&p.data$No.show=="Yes")
h
## [1] 4124
g+h
## [1] 16731
i<- sum(p.data$Age.Range=="30-49"&p.data$No.show=="No")
i
## [1] 23200
j <- sum(p.data$Age.Range=="30-49"&p.data$No.show=="Yes")
j
## [1] 6180
i+j
## [1] 29380
k <- sum(p.data$Age.Range=="50-64"&p.data$No.show=="No")
k
## [1] 18849
l <- sum(p.data$Age.Range=="50-64"&p.data$No.show=="Yes")
l
## [1] 3785
k+l
## [1] 22634
m <- sum(p.data$Age.Range=="65-above"&p.data$No.show=="No")
m
## [1] 12169
n <- sum(p.data$Age.Range=="65-above"&p.data$No.show=="Yes")
n
## [1] 2233
m+n
## [1] 14402
data1 <- data.frame(
  Age.Range = c("0-5", "6-11", "12-17", "18-29", "30-49", "50-64", "65-above"),
  Yes = c(2185, 1849, 1962, 4122, 6179, 3785, 2232),
  No = c(9546, 6363, 5473, 12607, 23198, 18849, 12169),
  Total = c(11731, 8212, 7435, 16729, 29377, 22634, 14401)
)


D.F1 <- data.frame(Age.Range=c("0-5","6-11","12-17","18-29","30-49","50-64","65-above"),
Percentage=c(18.6,22.5,26.4,24.6,21.0,16.7,15.5))
D.F1
##   Age.Range Percentage
## 1       0-5       18.6
## 2      6-11       22.5
## 3     12-17       26.4
## 4     18-29       24.6
## 5     30-49       21.0
## 6     50-64       16.7
## 7  65-above       15.5
data1$Percent_Yes <- round(data1$Yes / data1$Total * 100, 1)
data1$Percent_No <- round(data1$No / data1$Total * 100, 1)


kable(data1, "html", booktabs = TRUE, caption = "No-Show Statistics by Age range") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)
No-Show Statistics by Age range
Age.Range Yes No Total Percent_Yes Percent_No
0-5 2185 9546 11731 18.6 81.4
6-11 1849 6363 8212 22.5 77.5
12-17 1962 5473 7435 26.4 73.6
18-29 4122 12607 16729 24.6 75.4
30-49 6179 23198 29377 21.0 79.0
50-64 3785 18849 22634 16.7 83.3
65-above 2232 12169 14401 15.5 84.5
ggplot(D.F1, aes(x = Age.Range, y = Percentage, group = 1)) +
  geom_line(color = "red", size = 1.5) +
  geom_point(color = "orange", size = 4) +
  labs(title = "Comparing the age range in percentage",
       x = "Age range", 
       y = "Percentage")+
  scale_x_discrete(limits = D.F1$Age.Range)

ggplot(p.data, aes(x = Age.Range, fill = factor(No.show))) +
geom_bar(position = "dodge") +
labs(title = "Age ranges on no-show rate",
x = "Age range",y = "Frequency",fill = "No-Show") +
scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
theme_classic()

Note - he highest no-show rate is among the 12-17 age group (26.4%). - The lowest no-show rate is among the 0-5 age group (18.6%). - The 18-29 and 30-49 age groups have relatively similar no-show rates (24.6% and 21.0%, respectively). - The 6-11 age group has a slightly higher no-show rate (22.5%) compared to the 0-5 age group. - Yes, the age group has a significant impact on the no-show rate. Adolescents (12-17 years old) are more likely to be no-shows, while younger children (0-5 years old) are less likely.

u <- length(p.data$Hypertension)
u
## [1] 110526
r <- length(which(p.data$Hypertension==0))
r
## [1] 88725
j <- length(which(p.data$Hypertension==1))
j
## [1] 21801
v <- length(p.data$Diabetes)
v
## [1] 110526
r+j
## [1] 110526
length(which(p.data$Alcoholism==2))
## [1] 0
length(which(p.data$Handicap==2))
## [1] 183
length(which(p.data$Handicap==0))
## [1] 108285
length(which(p.data$Handicap==1))
## [1] 2042
p.data$Handicap[p.data$Handicap==2]=1
p.data$Disease_Count <- rowSums(p.data[, c("Hypertension", 
"Diabetes","Handicap","Alcoholism")])
p.data$Disease_Count <- factor(p.data$Disease_Count, 
levels = c(0, 1, 2, 3, 4), 
labels = c("Zero disease", "One disease", "Two diseases", "Three diseases", "Four diseases"))
length(p.data$Disease_Count)
## [1] 110526
unique(p.data$Disease_Count)
## [1] One disease    Zero disease   Two diseases   Three diseases Four diseases 
## 5 Levels: Zero disease One disease Two diseases ... Four diseases
summary(p.data$Disease_Count)
##   Zero disease    One disease   Two diseases Three diseases  Four diseases 
##          84114          18111           7654            627             20
a <- sum(p.data$Disease_Count=="Zero disease"&p.data$No.show=="Yes")
a
## [1] 17603
b <- sum(p.data$Disease_Count=="Zero disease"&p.data$No.show=="No")

b
## [1] 66511
a+b
## [1] 84114
c <-sum(p.data$Disease_Count=="One disease"&p.data$No.show=="Yes")
c
## [1] 3246
d <-sum(p.data$Disease_Count=="One disease"&p.data$No.show=="No") 
d
## [1] 14865
c+d
## [1] 18111
e <-sum(p.data$Disease_Count=="Two diseases"&p.data$No.show=="Yes") 
e
## [1] 1367
f <-sum(p.data$Disease_Count=="Two diseases"&p.data$No.show=="No") 
f
## [1] 6287
e+f
## [1] 7654
g <-sum(p.data$Disease_Count=="Three diseases"&p.data$No.show=="Yes") 
g
## [1] 97
h<-sum(p.data$Disease_Count=="Three diseases"&p.data$No.show=="No")
h
## [1] 530
g+h
## [1] 627
i <- sum(p.data$Disease_Count=="Four diseases"&p.data$No.show=="Yes")
i
## [1] 6
j <-sum(p.data$Disease_Count=="Four diseases"&p.data$No.show=="No")
j
## [1] 14
i+j
## [1] 20
data2 <- data.frame(
  Disease_Count = c("Zero disease", "One disease", "Two diseases", "Three diseases", "Four diseases"),
  Yes = c(17600, 3244, 1367, 97, 6),
  No = c(66510, 14864, 6287, 530, 14),
  Total = c(84110, 18108, 7654, 627, 20)
)


data2$Percent_Yes <- round(data2$Yes / data2$Total * 100, 1)
data2$Percent_No <- round(data2$No / data2$Total * 100, 1)

kable(data2, "html", booktabs = TRUE, caption = "No-Show Statistics by number of disease") %>%
  kable_styling(full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE)
No-Show Statistics by number of disease
Disease_Count Yes No Total Percent_Yes Percent_No
Zero disease 17600 66510 84110 20.9 79.1
One disease 3244 14864 18108 17.9 82.1
Two diseases 1367 6287 7654 17.9 82.1
Three diseases 97 530 627 15.5 84.5
Four diseases 6 14 20 30.0 70.0
D.F6 <- data.frame(Disease_Count=c("Zero disease","One disease","Two diseases","Three diseases","Four diseases"),
Percentage=c(20.9,17.9,17.9,15.5,30.0))
D.F6
##    Disease_Count Percentage
## 1   Zero disease       20.9
## 2    One disease       17.9
## 3   Two diseases       17.9
## 4 Three diseases       15.5
## 5  Four diseases       30.0
ggplot(p.data, aes(x = Disease_Count, fill = factor(No.show))) +
geom_bar(position = "dodge") +
labs(title = "No-show rates by disease count",
x = "Number of diseases",
y = "Frequency",fill = "No-show")+
  scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
theme_classic()

ggplot(D.F6, aes(x =  Disease_Count, y = Percentage, group = 1)) +
  geom_line(color = "red", size = 1.5) +
  geom_point(color = "orange", size = 4) +
  labs(title = "Comparing the  disease count in percentage",
       x = "Disease count", 
       y = "Percentage")+
  scale_x_discrete(limits = D.F6$ Disease_Count)

Note: - Patients with 3 diseases have the lowest no-show rate (15.5%). - Patients with 0 diseases have the highest no-show rate (20.9%) among the larger disease categories. - Patients with 4 diseases have a significantly higher no-show rate (30.0%), but the sample size is very small

Insight

Recommendation

The following were recommended based on the analysis:

Conclusion

In conclusion, there are many factors influencing the no-show rate based on the analysis, but the no-show rate can be reduced if the recommendations stated above can be implemented.

Data science applied lab(Onging courses)

Future direction

The data science and mathematical modelling skills I’m developing through the learning and project directly support my mission to solve real-world problems and drive progress in my community and nation. By combining advanced analytics like machine learning and causal inference with domain expertise in public health and climate science, I aim to design scalable solutions for critical challenges.