Loading Dataset for Analysis into R

The dataset was downloaded as a csv file and was then uploaded into R for manipulation and cleaning.

library(readr)

d_csv <- read_csv("C:/Users/AHMED FAMILY/Desktop/noShowDoc.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   Age = col_integer(),
##   Gender = col_character(),
##   AppointmentRegistration = col_datetime(format = ""),
##   ApointmentData = col_datetime(format = ""),
##   DayOfTheWeek = col_character(),
##   Status = col_character(),
##   Stat = col_integer(),
##   Diabetes = col_integer(),
##   Alcoolism = col_integer(),
##   HiperTension = col_integer(),
##   Handcap = col_integer(),
##   Smokes = col_integer(),
##   Scholarship = col_integer(),
##   Tuberculosis = col_integer(),
##   Sms_Reminder = col_integer(),
##   AwaitingTime = col_integer()
## )
head(d_csv)
## # A tibble: 6 × 16
##     Age Gender AppointmentRegistration ApointmentData DayOfTheWeek  Status
##   <int>  <chr>                  <dttm>         <dttm>        <chr>   <chr>
## 1    19      M     2014-12-16 14:46:25     2015-01-14    Wednesday Show-Up
## 2    24      F     2015-08-18 07:01:26     2015-08-19    Wednesday Show-Up
## 3     4      F     2014-02-17 12:53:46     2014-02-18      Tuesday Show-Up
## 4     5      M     2014-07-23 17:02:11     2014-08-07     Thursday Show-Up
## 5    38      M     2015-10-21 15:20:09     2015-10-27      Tuesday Show-Up
## 6     5      F     2014-06-17 06:47:27     2014-07-22      Tuesday No-Show
## # ... with 10 more variables: Stat <int>, Diabetes <int>, Alcoolism <int>,
## #   HiperTension <int>, Handcap <int>, Smokes <int>, Scholarship <int>,
## #   Tuberculosis <int>, Sms_Reminder <int>, AwaitingTime <int>

The dataset above was extracted from kaggle. The dataset in which observations, estimations, simulations, and plotting will be conducted observes the willingness of patients to attend their scheduled appointments with their medical doctor. In other words, the dataset indicates whether patients tend to show up to their medical appointments or not. While the dataset contains 16 variables in total, the assignment will look at and only use four of those variables (Age, Gender, Stat, and DayOfTheWeek) for the following analyses.

library(sjmisc)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
d_csv$DayOfTheWeek <- recode(d_csv$DayOfTheWeek, "Sunday" = 1, "Monday" = 2, "Tuesday" =3, "Wednesday" = 4, "Thursday" = 5, "Friday" = 6, "Saturday" = 7)

Because my DayOfTheWeek variable is categorical,the variable was recoded such that the days of the week would represent a number (Exp: Sunday is 1 and Friday is 6). This was performed on R because alongside using the DescTools package so that the first day (Sunday) would be incorporated into the intercept (Female with zero years of age). This critical step was required as the intercept is now “Female with zero years of age who went to an appointment on Sunday.” This would permit one of the days from the variable DayOfTheWeek to act as a starting point for the intercept.

head(d_csv)
## # A tibble: 6 × 16
##     Age Gender AppointmentRegistration ApointmentData DayOfTheWeek  Status
##   <int>  <chr>                  <dttm>         <dttm>        <dbl>   <chr>
## 1    19      M     2014-12-16 14:46:25     2015-01-14            4 Show-Up
## 2    24      F     2015-08-18 07:01:26     2015-08-19            4 Show-Up
## 3     4      F     2014-02-17 12:53:46     2014-02-18            3 Show-Up
## 4     5      M     2014-07-23 17:02:11     2014-08-07            5 Show-Up
## 5    38      M     2015-10-21 15:20:09     2015-10-27            3 Show-Up
## 6     5      F     2014-06-17 06:47:27     2014-07-22            3 No-Show
## # ... with 10 more variables: Stat <int>, Diabetes <int>, Alcoolism <int>,
## #   HiperTension <int>, Handcap <int>, Smokes <int>, Scholarship <int>,
## #   Tuberculosis <int>, Sms_Reminder <int>, AwaitingTime <int>
library(DescTools)
d_csv$DayOfTheWeek <-factor(d_csv$DayOfTheWeek)
library(Zelig)
## 
## Attaching package: 'Zelig'
## The following objects are masked from 'package:DescTools':
## 
##     Median, Mode
z.noShow <- zlogit$new()
z.noShow$zelig(Stat ~ Gender + DayOfTheWeek + Age, data = d_csv)
summary(z.noShow)
## Model: 
## 
## Call:
## z.noShow$zelig(formula = Stat ~ Gender + DayOfTheWeek + Age, 
##     data = d_csv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8668  -1.4323   0.7847   0.8745   1.1278  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)    1.1512331  1.0975194   1.049    0.294
## GenderM       -0.0024554  0.0085184  -0.288    0.773
## DayOfTheWeek2 -0.7648770  1.0975285  -0.697    0.486
## DayOfTheWeek3 -0.6098948  1.0975284  -0.556    0.578
## DayOfTheWeek4 -0.6514993  1.0975276  -0.594    0.553
## DayOfTheWeek5 -0.6310186  1.0975300  -0.575    0.565
## DayOfTheWeek6 -0.6954399  1.0975339  -0.634    0.526
## DayOfTheWeek7 -1.0333612  1.0989089  -0.940    0.347
## Age            0.0096842  0.0001783  54.313   <2e-16
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 367749  on 299999  degrees of freedom
## Residual deviance: 364509  on 299991  degrees of freedom
## AIC: 364527
## 
## Number of Fisher Scoring iterations: 4
## 
## Next step: Use 'setx' method

In this analysis, the variables Age, Gender and DayOfTheWeek are the independent variables which represent the patients’ age, gender and what day of the week they had to attend their appointment. For our dependent variable, we observe Stat, which indicates if the patient showed up for their scheduled appointment. As observed from the coefficients, females are the most likely (1.15) to show up to see their doctor for medical appointments all throughout the week relative to males. However, the relationship between Gender and showing up is not statistically significant. The only coefficient that is statistically significant is Age which indicates that as a person grows older, they are more likely to see the doctor (0.01). As the analysis suggests, while will more likely attend their medical appointment than male patients, age has a positive effect for medical appointment attendance and is more meaningful.

#z.noShow$setrange(Age = 0:115)
#z.noShow$sim()
#Summary(z.noShow)
#z.noShow$setx()
#z.noShow$sim()
#Summary(z.noShow)

Gender Difference

#z.noShow$setx(Gender = "M")
#z.noShow$setx1(Gender = "F")
#z.noShow$sim()
#summary(z.noShow)
#a.range = min(d_csv$Age):max(d_csv$Age)
#x <- setx(z.noShow, Age = a.range)
#s <- sim(z.noShow, x = x)
#ci.plot(s)

Once the the Zelig-logit tool/package was used for this analysis, estimations and simulaitons provided us with a critical insight. In particular, the results and the graph indicates as shown by the ci.plot that age has a positive effect on attending medical appointments. Patients who are older than younger patients are more likely to attend their medical appointments. This suggests that older patients are willing to disclose their personal health more openly to medical professionals or they might carry more responsiblity than their younger counterparts.