Alabi Kazeem Oluwaseun
2-5-2025
Internship at ICAMMDA started on february 11, 2024
Report structured into three key sections:
Section A: Completed courses, skills acquired, projects done
Section B: Ongoing courses and expected outcomes
Section C: Future goals and application of
Completed courses
No-show appointments (also commonly referred missed appointments) are a burden to essentially all healthcare systems, significantly impacting income, cost and use of resources. It is a well-known fact that no-show decreases the provider’s productivity and efficiency, increases healthcare costs, and limits the health clinic’s effective capacity. Negative effects are also felt by patients who keep their appointments, such as dissatisfaction with high waiting time and perception of overall decrease in service quality. In addition to creating financial costs for providers, non-attendance generates social costs related with unused staff time, ineffective use of equipment and possible misuse of patient’s time This report investigates the Kaggle no-show appointments dataset, which collects information from over 100k medical appointments in Brazil. There are a number of patient characteristics included.
The aim of this project was to: The aim of this study is to investigate the factors influencing patient no-show rates in medical appointments.
Research Questions
Do the intervals between scheduled and appointment days affect the no-show rate?
Are there days when patients rarely visit the hospital to meet up?
Which patients with particular diseases are less likely to come for an appointment?
Does the age group have an impact on the no-show rate?
Do patients with more than one disease increase the no-show rate?
Objectives
The objectives of the project were defined as follows:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
##
## Attaching package: 'xfun'
##
## The following object is masked from 'package:stringr':
##
## str_wrap
##
## The following object is masked from 'package:base':
##
## attr
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
To know the dimension of the dataset
## [1] 110527 14
summary of all the columns
## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 Length:110527 Length:110527
## 1st Qu.:4.173e+12 1st Qu.:5640286 Class :character Class :character
## Median :3.173e+13 Median :5680573 Mode :character Mode :character
## Mean :1.475e+14 Mean :5675305
## 3rd Qu.:9.439e+13 3rd Qu.:5725524
## Max. :1.000e+15 Max. :5790484
## AppointmentDay Age Neighbourhood Scholarship
## Length:110527 Min. : -1.00 Length:110527 Min. :0.00000
## Class :character 1st Qu.: 18.00 Class :character 1st Qu.:0.00000
## Mode :character Median : 37.00 Mode :character Median :0.00000
## Mean : 37.09 Mean :0.09827
## 3rd Qu.: 55.00 3rd Qu.:0.00000
## Max. :115.00 Max. :1.00000
## Hipertension Diabetes Alcoholism Handcap
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1972 Mean :0.07186 Mean :0.0304 Mean :0.02225
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :4.00000
## SMS_received No.show
## Min. :0.000 Length:110527
## 1st Qu.:0.000 Class :character
## Median :0.000 Mode :character
## Mean :0.321
## 3rd Qu.:1.000
## Max. :1.000
Plot of the thirteen columns
ggplot(p.data,aes(x=PatientId))+geom_histogram(fill=I("blue"),bins =40 )+
labs(title = "Distribution of patient identity",x="Patient identity",y="Frequency")ggplot(p.data,aes(x=AppointmentID))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of appointment identity",x="Appointment identity",y="Frequency")ggplot(p.data,aes(x=Gender))+geom_bar(fill=I("blue"),width = 0.1)+
labs(title = "Distribution of gender",x="Gender",y="Frequency")ggplot(p.data,aes(x= AppointmentDay))+geom_bar(fill=I("blue"))+
labs(title = "Distribution of appointmentday",x="AppointmentDay",y="Frequency")ggplot(p.data,aes(x=Age))+geom_histogram(fill=I("blue"),bins = 100)+
labs(title = "Distribution of age",x="Age",y="Frequency")ggplot(p.data,aes(x=Hipertension))+geom_bar(fill=("blue"),width = 0.1)+
labs(title = "Distribution of Age",x="Age",y="Frequency")ggplot(p.data,aes(x=No.show))+geom_bar(color="blue",shape =3 )+
labs(title = "Distribution of No-show",x="Patient identity",y="Frequency")## Warning in geom_bar(color = "blue", shape = 3): Ignoring unknown parameters:
## `shape`
ggplot(p.data,aes(x=AppointmentID))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of appointment identity",x="Appointment identity",y="Frequency")ggplot(p.data,aes(x=Gender))+geom_bar(fill=I("blue"),width = 0.1)+
labs(title = "Distribution of gender",x="Gender",y="Frequency")ggplot(p.data,aes(x= AppointmentDay))+geom_bar(fill=I("blue"))+
labs(title = "Distribution of appointmentday",x="AppointmentDay",y="Frequency")ggplot(p.data,aes(x=Age))+geom_histogram(fill=I("blue"),bins = 40)+
labs(title = "Distribution of age",x="Age",y="Frequency")p.data <- p.data %>%rename("Handicap"=Handcap,
"Hypertension"=Hipertension,"SMSreceived"=SMS_received )## [1] 62299
## [1] 110527
## p.data$Age
## -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 3539 2273 1618 1513 1299 1489 1521 1427 1424 1372 1274 1195 1092 1103 1118
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 1211 1402 1509 1487 1545 1437 1452 1376 1349 1242 1332 1283 1377 1448 1403 1521
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## 1439 1505 1524 1526 1378 1580 1533 1629 1536 1402 1346 1272 1344 1487 1453 1460
## 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
## 1394 1399 1652 1613 1567 1746 1651 1530 1425 1635 1603 1469 1624 1411 1343 1312
## 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
## 1374 1331 1101 1187 973 1012 832 724 695 615 725 602 544 571 527 541
## 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
## 390 511 434 392 280 311 275 260 184 126 173 109 66 86 53 33
## 95 96 97 98 99 100 102 115
## 24 17 11 6 1 4 2 5
## [1] "character"
## [1] "data.frame"
## [1] 99833
## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 Length:110526 Length:110526
## 1st Qu.:4.173e+12 1st Qu.:5640285 Class :character Class :character
## Median :3.173e+13 Median :5680572 Mode :character Mode :character
## Mean :1.475e+14 Mean :5675304
## 3rd Qu.:9.439e+13 3rd Qu.:5725523
## Max. :1.000e+15 Max. :5790484
## AppointmentDay Age Neighbourhood Scholarship
## Length:110526 Min. : 0.00 Length:110526 Min. :0.00000
## Class :character 1st Qu.: 18.00 Class :character 1st Qu.:0.00000
## Mode :character Median : 37.00 Mode :character Median :0.00000
## Mean : 37.09 Mean :0.09827
## 3rd Qu.: 55.00 3rd Qu.:0.00000
## Max. :115.00 Max. :1.00000
## Hypertension Diabetes Alcoholism Handicap
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1972 Mean :0.07187 Mean :0.0304 Mean :0.02225
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :4.00000
## SMSreceived No.show
## Min. :0.000 Length:110526
## 1st Qu.:0.000 Class :character
## Median :0.000 Mode :character
## Mean :0.321
## 3rd Qu.:1.000
## Max. :1.000
p.data <- p.data %>%
mutate(
AppointmentDay = as.Date(AppointmentDay),
ScheduledDay = as.Date(ScheduledDay)
)
summary(p.data)## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 Length:110526 Min. :2015-11-10
## 1st Qu.:4.173e+12 1st Qu.:5640285 Class :character 1st Qu.:2016-04-29
## Median :3.173e+13 Median :5680572 Mode :character Median :2016-05-10
## Mean :1.475e+14 Mean :5675304 Mean :2016-05-08
## 3rd Qu.:9.439e+13 3rd Qu.:5725523 3rd Qu.:2016-05-20
## Max. :1.000e+15 Max. :5790484 Max. :2016-06-08
## AppointmentDay Age Neighbourhood Scholarship
## Min. :2016-04-29 Min. : 0.00 Length:110526 Min. :0.00000
## 1st Qu.:2016-05-09 1st Qu.: 18.00 Class :character 1st Qu.:0.00000
## Median :2016-05-18 Median : 37.00 Mode :character Median :0.00000
## Mean :2016-05-19 Mean : 37.09 Mean :0.09827
## 3rd Qu.:2016-05-31 3rd Qu.: 55.00 3rd Qu.:0.00000
## Max. :2016-06-08 Max. :115.00 Max. :1.00000
## Hypertension Diabetes Alcoholism Handicap
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1972 Mean :0.07187 Mean :0.0304 Mean :0.02225
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :4.00000
## SMSreceived No.show
## Min. :0.000 Length:110526
## 1st Qu.:0.000 Class :character
## Median :0.000 Mode :character
## Mean :0.321
## 3rd Qu.:1.000
## Max. :1.000
p.data <- p.data %>%
mutate(
AppointmentDay = as.Date(AppointmentDay),
ScheduledDay = as.Date(ScheduledDay)
)
summary(p.data)## PatientId AppointmentID Gender ScheduledDay
## Min. :3.922e+04 Min. :5030230 Length:110526 Min. :2015-11-10
## 1st Qu.:4.173e+12 1st Qu.:5640285 Class :character 1st Qu.:2016-04-29
## Median :3.173e+13 Median :5680572 Mode :character Median :2016-05-10
## Mean :1.475e+14 Mean :5675304 Mean :2016-05-08
## 3rd Qu.:9.439e+13 3rd Qu.:5725523 3rd Qu.:2016-05-20
## Max. :1.000e+15 Max. :5790484 Max. :2016-06-08
## AppointmentDay Age Neighbourhood Scholarship
## Min. :2016-04-29 Min. : 0.00 Length:110526 Min. :0.00000
## 1st Qu.:2016-05-09 1st Qu.: 18.00 Class :character 1st Qu.:0.00000
## Median :2016-05-18 Median : 37.00 Mode :character Median :0.00000
## Mean :2016-05-19 Mean : 37.09 Mean :0.09827
## 3rd Qu.:2016-05-31 3rd Qu.: 55.00 3rd Qu.:0.00000
## Max. :2016-06-08 Max. :115.00 Max. :1.00000
## Hypertension Diabetes Alcoholism Handicap
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1972 Mean :0.07187 Mean :0.0304 Mean :0.02225
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :4.00000
## SMSreceived No.show
## Min. :0.000 Length:110526
## 1st Qu.:0.000 Class :character
## Median :0.000 Mode :character
## Mean :0.321
## 3rd Qu.:1.000
## Max. :1.000
## Length Class Mode
## 110526 difftime numeric
## Time differences in days
## [1] -1 -1 -1 -6 -1
calculate_percentages <- function(data, interval) {
filtered_data <- data[data$Days.interval == interval, ]
yes_count <- sum(filtered_data$No.show == "Yes")
no_count <- sum(filtered_data$No.show == "No")
yes_percentage <- (yes_count / (yes_count + no_count)) * 100
no_percentage <- (no_count / (yes_count + no_count)) * 100
return(list(
Yes = yes_percentage,
No = no_percentage,
Total = yes_count + no_count
))
}
intervals <- seq(from = 0, to = 179, by = 1)
results <- list()
for (i in intervals) {
results[[as.character(i)]] <- calculate_percentages(p.data, i)
}
intervals <- seq(from = 0, to = 179, by = 1)
results <- list()
for (i in intervals) {
results[[as.character(i)]] <- calculate_percentages(p.data, i)
}
df <- do.call(rbind, lapply(names(results), function(i) data.frame(
Interval = as.numeric(i),
Yes = results[[i]]$Yes,
No = results[[i]]$No,
Total = results[[i]]$Total
)))
df <- df[order(df$Interval), ]
kable(df,"html", digits = 2, caption = "Percentage of Yes and No Responses by Interval")%>%
kable_styling(full_width = FALSE, font_size = 12) %>%
column_spec(1, bold = TRUE)| Interval | Yes | No | Total |
|---|---|---|---|
| 0 | 4.65 | 95.35 | 38562 |
| 1 | 21.35 | 78.65 | 5213 |
| 2 | 23.82 | 76.18 | 6725 |
| 3 | 23.53 | 76.47 | 2737 |
| 4 | 23.27 | 76.73 | 5290 |
| 5 | 26.61 | 73.39 | 3277 |
| 6 | 24.80 | 75.20 | 4037 |
| 7 | 26.68 | 73.32 | 4906 |
| 8 | 28.73 | 71.27 | 2332 |
| 9 | 27.41 | 72.59 | 1605 |
| 10 | 31.63 | 68.37 | 1391 |
| 11 | 31.61 | 68.39 | 987 |
| 12 | 31.66 | 68.34 | 1115 |
| 13 | 31.87 | 68.13 | 1682 |
| 14 | 31.34 | 68.66 | 2913 |
| 15 | 33.40 | 66.60 | 1503 |
| 16 | 30.50 | 69.50 | 1151 |
| 17 | 31.62 | 68.38 | 1107 |
| 18 | 30.56 | 69.44 | 1021 |
| 19 | 34.77 | 65.23 | 1044 |
| 20 | 34.37 | 65.63 | 1187 |
| 21 | 30.90 | 69.10 | 1861 |
| 22 | 34.44 | 65.56 | 1173 |
| 23 | 33.58 | 66.42 | 822 |
| 24 | 37.78 | 62.22 | 622 |
| 25 | 40.19 | 59.81 | 637 |
| 26 | 35.98 | 64.02 | 731 |
| 27 | 31.59 | 68.41 | 1013 |
| 28 | 29.48 | 70.52 | 1706 |
| 29 | 32.69 | 67.31 | 1089 |
| 30 | 26.56 | 73.44 | 704 |
| 31 | 32.74 | 67.26 | 672 |
| 32 | 35.15 | 64.85 | 586 |
| 33 | 33.43 | 66.57 | 664 |
| 34 | 33.42 | 66.58 | 808 |
| 35 | 37.28 | 62.72 | 963 |
| 36 | 31.03 | 68.97 | 551 |
| 37 | 35.95 | 64.05 | 331 |
| 38 | 34.17 | 65.83 | 199 |
| 39 | 32.35 | 67.65 | 170 |
| 40 | 40.55 | 59.45 | 254 |
| 41 | 35.38 | 64.62 | 342 |
| 42 | 36.93 | 63.07 | 417 |
| 43 | 29.92 | 70.08 | 254 |
| 44 | 29.94 | 70.06 | 157 |
| 45 | 36.50 | 63.50 | 137 |
| 46 | 36.36 | 63.64 | 121 |
| 47 | 30.28 | 69.72 | 109 |
| 48 | 29.06 | 70.94 | 117 |
| 49 | 34.72 | 65.28 | 216 |
| 50 | 26.92 | 73.08 | 182 |
| 51 | 36.26 | 63.74 | 91 |
| 52 | 25.56 | 74.44 | 90 |
| 53 | 34.38 | 65.62 | 96 |
| 54 | 36.99 | 63.01 | 73 |
| 55 | 44.95 | 55.05 | 109 |
| 56 | 25.12 | 74.88 | 203 |
| 57 | 37.76 | 62.24 | 98 |
| 58 | 32.73 | 67.27 | 110 |
| 59 | 36.67 | 63.33 | 90 |
| 60 | 42.47 | 57.53 | 73 |
| 61 | 22.96 | 77.04 | 135 |
| 62 | 20.73 | 79.27 | 193 |
| 63 | 26.50 | 73.50 | 200 |
| 64 | 37.33 | 62.67 | 75 |
| 65 | 29.90 | 70.10 | 97 |
| 66 | 29.63 | 70.37 | 162 |
| 67 | 30.16 | 69.84 | 63 |
| 68 | 32.86 | 67.14 | 70 |
| 69 | 32.38 | 67.62 | 105 |
| 70 | 31.25 | 68.75 | 112 |
| 71 | 22.00 | 78.00 | 50 |
| 72 | 30.19 | 69.81 | 53 |
| 73 | 35.14 | 64.86 | 37 |
| 74 | 33.33 | 66.67 | 39 |
| 75 | 30.56 | 69.44 | 72 |
| 76 | 30.77 | 69.23 | 65 |
| 77 | 33.96 | 66.04 | 53 |
| 78 | 39.39 | 60.61 | 33 |
| 79 | 39.13 | 60.87 | 23 |
| 80 | 44.44 | 55.56 | 18 |
| 81 | 40.00 | 60.00 | 10 |
| 82 | 0.00 | 100.00 | 1 |
| 83 | 12.50 | 87.50 | 8 |
| 84 | 22.22 | 77.78 | 54 |
| 85 | 21.62 | 78.38 | 37 |
| 86 | 16.67 | 83.33 | 6 |
| 87 | 31.25 | 68.75 | 16 |
| 88 | 31.25 | 68.75 | 16 |
| 89 | 41.18 | 58.82 | 17 |
| 90 | 24.14 | 75.86 | 58 |
| 91 | 21.43 | 78.57 | 56 |
| 92 | 0.00 | 100.00 | 3 |
| 93 | 0.00 | 100.00 | 2 |
| 94 | 0.00 | 100.00 | 2 |
| 95 | 20.00 | 80.00 | 5 |
| 96 | 25.00 | 75.00 | 4 |
| 97 | 0.00 | 100.00 | 2 |
| 98 | 20.00 | 80.00 | 5 |
| 99 | NaN | NaN | 0 |
| 100 | NaN | NaN | 0 |
| 101 | 0.00 | 100.00 | 1 |
| 102 | 25.00 | 75.00 | 4 |
| 103 | 60.00 | 40.00 | 5 |
| 104 | 75.00 | 25.00 | 8 |
| 105 | 0.00 | 100.00 | 4 |
| 106 | NaN | NaN | 0 |
| 107 | 0.00 | 100.00 | 2 |
| 108 | 0.00 | 100.00 | 5 |
| 109 | 0.00 | 100.00 | 5 |
| 110 | 50.00 | 50.00 | 2 |
| 111 | 20.00 | 80.00 | 5 |
| 112 | 0.00 | 100.00 | 5 |
| 113 | NaN | NaN | 0 |
| 114 | NaN | NaN | 0 |
| 115 | 0.00 | 100.00 | 2 |
| 116 | NaN | NaN | 0 |
| 117 | 0.00 | 100.00 | 1 |
| 118 | NaN | NaN | 0 |
| 119 | 0.00 | 100.00 | 4 |
| 120 | NaN | NaN | 0 |
| 121 | NaN | NaN | 0 |
| 122 | 0.00 | 100.00 | 3 |
| 123 | 0.00 | 100.00 | 1 |
| 124 | NaN | NaN | 0 |
| 125 | 0.00 | 100.00 | 1 |
| 126 | 100.00 | 0.00 | 1 |
| 127 | 0.00 | 100.00 | 1 |
| 128 | NaN | NaN | 0 |
| 129 | NaN | NaN | 0 |
| 130 | NaN | NaN | 0 |
| 131 | NaN | NaN | 0 |
| 132 | 100.00 | 0.00 | 1 |
| 133 | 36.36 | 63.64 | 11 |
| 134 | NaN | NaN | 0 |
| 135 | NaN | NaN | 0 |
| 136 | NaN | NaN | 0 |
| 137 | NaN | NaN | 0 |
| 138 | NaN | NaN | 0 |
| 139 | 100.00 | 0.00 | 1 |
| 140 | NaN | NaN | 0 |
| 141 | NaN | NaN | 0 |
| 142 | 37.50 | 62.50 | 8 |
| 143 | NaN | NaN | 0 |
| 144 | NaN | NaN | 0 |
| 145 | NaN | NaN | 0 |
| 146 | 100.00 | 0.00 | 1 |
| 147 | NaN | NaN | 0 |
| 148 | NaN | NaN | 0 |
| 149 | NaN | NaN | 0 |
| 150 | NaN | NaN | 0 |
| 151 | 100.00 | 0.00 | 1 |
| 152 | NaN | NaN | 0 |
| 153 | NaN | NaN | 0 |
| 154 | NaN | NaN | 0 |
| 155 | 60.00 | 40.00 | 10 |
| 156 | NaN | NaN | 0 |
| 157 | NaN | NaN | 0 |
| 158 | NaN | NaN | 0 |
| 159 | NaN | NaN | 0 |
| 160 | NaN | NaN | 0 |
| 161 | NaN | NaN | 0 |
| 162 | 18.18 | 81.82 | 11 |
| 163 | NaN | NaN | 0 |
| 164 | NaN | NaN | 0 |
| 165 | NaN | NaN | 0 |
| 166 | NaN | NaN | 0 |
| 167 | NaN | NaN | 0 |
| 168 | NaN | NaN | 0 |
| 169 | 12.50 | 87.50 | 8 |
| 170 | NaN | NaN | 0 |
| 171 | NaN | NaN | 0 |
| 172 | NaN | NaN | 0 |
| 173 | NaN | NaN | 0 |
| 174 | NaN | NaN | 0 |
| 175 | NaN | NaN | 0 |
| 176 | 37.50 | 62.50 | 16 |
| 177 | NaN | NaN | 0 |
| 178 | NaN | NaN | 0 |
| 179 | 20.00 | 80.00 | 10 |
ggplot(p.data,aes(x=factor(Days.interval),fill=No.show ))+
geom_bar(position = "dodge")+labs(x="Intervals between scheduled and appointment Days",
y="Noshow rate")+theme(axis.text.x = element_text(angle = 90, hjust = 1))+
scale_fill_manual(values = c("No" = "green", "Yes" = "red"))Note: - Intervals 0-5 days: Higher no-show rates (23.53% - 26.61%) - Intervals 6-15 days: Relatively stable no-show rates (23.82% - 26.68%) - Intervals 16-30 days: Slightly lower no-show rates (21.35% - 24.80%) - Intervals 31-60 days: Higher no-show rates (29.48% - 37.78%) - Intervals beyond 60 days: Variable no-show rates (20.00% - 100.00%) - No-show rates tend to increase as intervals lengthen. - Shorter intervals (0-5 days) have higher no-show rates. - Mid-range intervals (6-30 days) have relatively stable no-show rates. - Longer intervals (31-60 days) have higher no-show rates. - Optimize scheduling to minimize intervals. - Implement reminders for appointments with longer intervals.
p.nnl <- factor(p.nn,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))data <- data.frame(
Day = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
Yes = c(4689, 5150, 5092, 3337, 4037, 9),
No = c(18023, 20488, 20774, 13909, 14981, 30),
Total = c(22712, 25638, 25866, 17246, 19018, 39)
)
data$Percent_Yes <- round(data$Yes / data$Total * 100, 1)
data$Percent_No <- round(data$No / data$Total * 100, 1)
kable(data, "html", booktabs = TRUE, caption = "No-show statistics by day") %>%
kable_styling(full_width = FALSE, font_size = 12) %>%
column_spec(1, bold = TRUE)| Day | Yes | No | Total | Percent_Yes | Percent_No |
|---|---|---|---|---|---|
| Monday | 4689 | 18023 | 22712 | 20.6 | 79.4 |
| Tuesday | 5150 | 20488 | 25638 | 20.1 | 79.9 |
| Wednesday | 5092 | 20774 | 25866 | 19.7 | 80.3 |
| Thursday | 3337 | 13909 | 17246 | 19.3 | 80.7 |
| Friday | 4037 | 14981 | 19018 | 21.2 | 78.8 |
| Saturday | 9 | 30 | 39 | 23.1 | 76.9 |
D.F4 <- data.frame(Days =c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"),
Percentage=c(79.4,79.9,80.3,80.7,78.8,76.9))
D.F4## Days Percentage
## 1 Monday 79.4
## 2 Tuesday 79.9
## 3 Wednesday 80.3
## 4 Thursday 80.7
## 5 Friday 78.8
## 6 Saturday 76.9
ggplot(p.data, aes(x = p.nnl, fill = No.show)) +
geom_bar(position = "dodge") +
labs(x = "Days of the week", y = "Frequency of no-show",
title = "Days of the week of no-Show ")+
theme_classic()+
scale_fill_manual(values = c("No" = "purple", "Yes" = "blue"))ggplot(D.F4, aes(x = Days, y = Percentage, group = 1)) +
geom_line(color = "orange", size = 1.2) +
geom_point(color = "red", size = 3.9) +
labs(
title = "Rate of participant by days",
x = "Days",
y = "Percentage"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(hjust = 0.5, face = "bold")
) +
scale_x_discrete(limits = D.F4$Days)## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Note: - Saturday (Total: 39, Percent Yes: 23.1%, Percent No: 76.9%) This is likely due to the hospital’s reduced operating hours or limited services on Saturdays. - Fridays have the highest no-show rate (21.2%). - Thursdays have the lowest no-show rate (19.3%) among weekdays. - Wednesday has the highest attendance rate (80.3%). - Adjust scheduling and staffing on Saturdays. - Investigate reasons for high no-show rates on Fridays. - Optimize appointment scheduling on Thursdays. - which patient with particular diseases less come for appointment To to call out diabetes column, check does that have and does that does’nt have diseases
## [1] 7943
## [1] 102583
## [1] 110526
## [1] 21801
## [1] 110526
## [1] 88725
## [1] 7943
## [1] 102583
## [1] 21801
## [1] 110526
## [1] 88725
## [1] 6513
## [1] 1430
## [1] 18029
## [1] 3772
## [1] 2683
## [1] 677
## [1] 1676
## [1] 366
data3 <- data.frame(
Disease = c("Diabetes", "Hypertension", "Alcoholism", "Handicap"),
Yes = c(1430, 3772, 677, 401),
No = c(6513, 18028, 2683, 1822),
Total = c(7943, 21800, 3360, 2223)
)
data3$Percent_Yes <- round(data3$Yes / data3$Total * 100, 1)
data3$Percent_No <- round(data3$No / data3$Total * 100, 1)
kable(data3, "html", booktabs = TRUE, caption = "Participant statistics by diseases") %>%
kable_styling(full_width = FALSE, font_size = 12) %>%
column_spec(1, bold = TRUE)| Disease | Yes | No | Total | Percent_Yes | Percent_No |
|---|---|---|---|---|---|
| Diabetes | 1430 | 6513 | 7943 | 18.0 | 82.0 |
| Hypertension | 3772 | 18028 | 21800 | 17.3 | 82.7 |
| Alcoholism | 677 | 2683 | 3360 | 20.1 | 79.9 |
| Handicap | 401 | 1822 | 2223 | 18.0 | 82.0 |
D.AH <- data.frame(
Diseases = rep(c("Diabetes", "Hypertension", "Alcoholism","Handicap"), each = 2),
Status = rep(c("No", "Yes"), times = 4),
Count = c(NO.D, YES.D, NO.H, YES.H, A.ll, B.ll,H.N,H.Y))
D.AH## Diseases Status Count
## 1 Diabetes No 6513
## 2 Diabetes Yes 1430
## 3 Hypertension No 18029
## 4 Hypertension Yes 3772
## 5 Alcoholism No 2683
## 6 Alcoholism Yes 677
## 7 Handicap No 1676
## 8 Handicap Yes 366
D.F <- data.frame(Diseases=c("Diabetes", "Hypertension", "Alcoholism","Handicap"),
count=c(NO.D,NO.H,A.ll,H.N))
D.F## Diseases count
## 1 Diabetes 6513
## 2 Hypertension 18029
## 3 Alcoholism 2683
## 4 Handicap 1676
ggplot(D.AH, aes(x = Diseases, y = Count, fill = Status)) +
geom_col(position = "dodge") +
labs(
title = "Attendance rates by disease",
x = "Disease",
y = "Count",
fill = "No-show"
) +
scale_fill_manual(values = c("No" = "gold", "Yes" = "brown")) +
theme_classic()+
scale_x_discrete(limits = D.F$Diseases)ggplot(D.F, aes(x = Diseases, y = count, group = 1)) +
geom_line(color = "blue", size = 1.2) +
geom_point(color = "red", size = 4) +
labs(
title = "Comparison Disease",
x = "Disease",
y = "Count"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(hjust = 0.5, face = "bold")
) +
scale_x_discrete(limits = D.F$Diseases)Note: - Patients with Hypertension have the lowest no-show rate (17.3%), indicating they are more likely to attend appointments. - Hypertension (17.3%) - Diabetes (18.0%) - Handicap (18.0%) - Alcoholism (20.1%)
Patients with Alcoholism have the highest no-show rate (20.1%), suggesting they may require additional support or targeted interventions to improve appointment adherence. - Monitor and support patients with alcoholism to improve appointment adherence. - Investigate factors contributing to high no-show rates among patients with alcoholism. - Develop disease-specific strategies to enhance appointment reminders and patient engagement.
## [1] 0
p.data$Age.Range <- cut(p.data$Age,
breaks = c(0,5,11,17,29,49,64, Inf),
labels = c("0-5","6-11", "12-17", "18-29", "30-49", "50-64","65-above"),include.lowest = TRUE)
length(which(p.data$Age.Range=="0-5"))## [1] 11731
## 0-5 6-11 12-17 18-29 30-49 50-64 65-above
## 11731 8213 7435 16731 29380 22634 14402
## [1] 9546
## [1] 2185
## [1] 11731
## [1] 6363
## [1] 1850
## [1] 8213
## [1] 5473
## [1] 1962
## [1] 7435
## [1] 12607
## [1] 4124
## [1] 16731
## [1] 23200
## [1] 6180
## [1] 29380
## [1] 18849
## [1] 3785
## [1] 22634
## [1] 12169
## [1] 2233
## [1] 14402
data1 <- data.frame(
Age.Range = c("0-5", "6-11", "12-17", "18-29", "30-49", "50-64", "65-above"),
Yes = c(2185, 1849, 1962, 4122, 6179, 3785, 2232),
No = c(9546, 6363, 5473, 12607, 23198, 18849, 12169),
Total = c(11731, 8212, 7435, 16729, 29377, 22634, 14401)
)
D.F1 <- data.frame(Age.Range=c("0-5","6-11","12-17","18-29","30-49","50-64","65-above"),
Percentage=c(18.6,22.5,26.4,24.6,21.0,16.7,15.5))
D.F1## Age.Range Percentage
## 1 0-5 18.6
## 2 6-11 22.5
## 3 12-17 26.4
## 4 18-29 24.6
## 5 30-49 21.0
## 6 50-64 16.7
## 7 65-above 15.5
data1$Percent_Yes <- round(data1$Yes / data1$Total * 100, 1)
data1$Percent_No <- round(data1$No / data1$Total * 100, 1)
kable(data1, "html", booktabs = TRUE, caption = "No-Show Statistics by Age range") %>%
kable_styling(full_width = FALSE, font_size = 12) %>%
column_spec(1, bold = TRUE)| Age.Range | Yes | No | Total | Percent_Yes | Percent_No |
|---|---|---|---|---|---|
| 0-5 | 2185 | 9546 | 11731 | 18.6 | 81.4 |
| 6-11 | 1849 | 6363 | 8212 | 22.5 | 77.5 |
| 12-17 | 1962 | 5473 | 7435 | 26.4 | 73.6 |
| 18-29 | 4122 | 12607 | 16729 | 24.6 | 75.4 |
| 30-49 | 6179 | 23198 | 29377 | 21.0 | 79.0 |
| 50-64 | 3785 | 18849 | 22634 | 16.7 | 83.3 |
| 65-above | 2232 | 12169 | 14401 | 15.5 | 84.5 |
ggplot(D.F1, aes(x = Age.Range, y = Percentage, group = 1)) +
geom_line(color = "red", size = 1.5) +
geom_point(color = "orange", size = 4) +
labs(title = "Comparing the age range in percentage",
x = "Age range",
y = "Percentage")+
scale_x_discrete(limits = D.F1$Age.Range)ggplot(p.data, aes(x = Age.Range, fill = factor(No.show))) +
geom_bar(position = "dodge") +
labs(title = "Age ranges on no-show rate",
x = "Age range",y = "Frequency",fill = "No-Show") +
scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
theme_classic()Note - he highest no-show rate is among the 12-17 age group (26.4%). - The lowest no-show rate is among the 0-5 age group (18.6%). - The 18-29 and 30-49 age groups have relatively similar no-show rates (24.6% and 21.0%, respectively). - The 6-11 age group has a slightly higher no-show rate (22.5%) compared to the 0-5 age group. - Yes, the age group has a significant impact on the no-show rate. Adolescents (12-17 years old) are more likely to be no-shows, while younger children (0-5 years old) are less likely.
## [1] 110526
## [1] 88725
## [1] 21801
## [1] 110526
## [1] 110526
## [1] 0
## [1] 183
## [1] 108285
## [1] 2042
p.data$Disease_Count <- rowSums(p.data[, c("Hypertension",
"Diabetes","Handicap","Alcoholism")])
p.data$Disease_Count <- factor(p.data$Disease_Count,
levels = c(0, 1, 2, 3, 4),
labels = c("Zero disease", "One disease", "Two diseases", "Three diseases", "Four diseases"))
length(p.data$Disease_Count)## [1] 110526
## [1] One disease Zero disease Two diseases Three diseases Four diseases
## 5 Levels: Zero disease One disease Two diseases ... Four diseases
## Zero disease One disease Two diseases Three diseases Four diseases
## 84114 18111 7654 627 20
## [1] 17603
## [1] 66511
## [1] 84114
## [1] 3246
## [1] 14865
## [1] 18111
## [1] 1367
## [1] 6287
## [1] 7654
## [1] 97
## [1] 530
## [1] 627
## [1] 6
## [1] 14
## [1] 20
data2 <- data.frame(
Disease_Count = c("Zero disease", "One disease", "Two diseases", "Three diseases", "Four diseases"),
Yes = c(17600, 3244, 1367, 97, 6),
No = c(66510, 14864, 6287, 530, 14),
Total = c(84110, 18108, 7654, 627, 20)
)
data2$Percent_Yes <- round(data2$Yes / data2$Total * 100, 1)
data2$Percent_No <- round(data2$No / data2$Total * 100, 1)
kable(data2, "html", booktabs = TRUE, caption = "No-Show Statistics by number of disease") %>%
kable_styling(full_width = FALSE, font_size = 12) %>%
column_spec(1, bold = TRUE)| Disease_Count | Yes | No | Total | Percent_Yes | Percent_No |
|---|---|---|---|---|---|
| Zero disease | 17600 | 66510 | 84110 | 20.9 | 79.1 |
| One disease | 3244 | 14864 | 18108 | 17.9 | 82.1 |
| Two diseases | 1367 | 6287 | 7654 | 17.9 | 82.1 |
| Three diseases | 97 | 530 | 627 | 15.5 | 84.5 |
| Four diseases | 6 | 14 | 20 | 30.0 | 70.0 |
D.F6 <- data.frame(Disease_Count=c("Zero disease","One disease","Two diseases","Three diseases","Four diseases"),
Percentage=c(20.9,17.9,17.9,15.5,30.0))
D.F6## Disease_Count Percentage
## 1 Zero disease 20.9
## 2 One disease 17.9
## 3 Two diseases 17.9
## 4 Three diseases 15.5
## 5 Four diseases 30.0
ggplot(p.data, aes(x = Disease_Count, fill = factor(No.show))) +
geom_bar(position = "dodge") +
labs(title = "No-show rates by disease count",
x = "Number of diseases",
y = "Frequency",fill = "No-show")+
scale_fill_manual(values = c("No" = "green", "Yes" = "red")) +
theme_classic()ggplot(D.F6, aes(x = Disease_Count, y = Percentage, group = 1)) +
geom_line(color = "red", size = 1.5) +
geom_point(color = "orange", size = 4) +
labs(title = "Comparing the disease count in percentage",
x = "Disease count",
y = "Percentage")+
scale_x_discrete(limits = D.F6$ Disease_Count)Note: - Patients with 3 diseases have the lowest no-show rate (15.5%). - Patients with 0 diseases have the highest no-show rate (20.9%) among the larger disease categories. - Patients with 4 diseases have a significantly higher no-show rate (30.0%), but the sample size is very small
The insights from the analysis were summarized as follows:
Scheduling intervals matter:
Longer delays between booking and appointment dates lead to higher no-show rates. Highest attendance: Tuesday and Wednesday.
Strengthen reminder systems:
Send automated SMS and email alerts 24-48 hours before appointments.
Add phone call reminders for patients with a history of no-shows.
Target high-risk groups:
Focus engagement efforts on young adults (20-30) and patients with fewer health conditions, as they no-show more often.
Use personalized reminders (e.g., parent alerts for pediatric visits).
Age group variations:
Young adults (20-30) had the highest no-show rates.
Elderly patients (60+) had the lowest no-show rates, likely due to greater medical needs.
Children (0-10) had high attendance rates, possibly due to parental involvement.
Multiple diseases:
Contrary to expectations, patients with more diseases had lower no-show rates.
The following were recommended based on the analysis:
In conclusion, there are many factors influencing the no-show rate based on the analysis, but the no-show rate can be reduced if the recommendations stated above can be implemented.
Topics to cover
Housing in Buenos Aires.
Air quality in Nairobi.
Bankruptcy in Poland.
Customer segment in the US.
A|B testing at WorldQuant University.
Market forecasting in India.
Skills to acquire
Python coding.
Data cleaning.
Data modelling.
Data visualization.
Machine learning.
The data science and mathematical modelling skills I’m developing through the learning and project directly support my mission to solve real-world problems and drive progress in my community and nation. By combining advanced analytics like machine learning and causal inference with domain expertise in public health and climate science, I aim to design scalable solutions for critical challenges.