Quitting from lines 16-98 (appendix_I.Rmd) 

1 Eliminate duplicate responses

First, we will check if the duplicated responses were consistent. If not, we will retain the response in which EDC use was marked as “Yes.” The reason is that the respondent who answered No may not be aware of EDC use in the clinical trial. If the responses are concordant, then the 2nd response will be removed. We will use the following code segment for this.

In the first segment, we make a data frame of duplicated trials with serial numbers of responses.

In the second segment, we make a data frame for the first responses and then merge it with the data frame comprising of a list of duplicated trials. For each row, we make a variable that selects the serial of response with the EDC use response = Yes if there is a discordance. The final two code chunks make a data frame with trials for which a single response has been obtained.

2 Statistical Analysis Plan

2.1 Participation Statistics

The following table shows the statistics related to participation in the survey. The number of investigators who were contacted is different from the number of trials as for a trial more than one investigator may have been contacted.

2.2 Included trial characteristics

First we take a look at the trial characteristics of all trials were were in the sampling frame.

Next we compare the trial characteristics of those studies which have responded versus those which have not.

2.3 EDC Adoption Rate

EDC Adoption Rate (EAR): The primary outcome measure is EAR. This will be defined as the ratio of the number of CTRI registered trials that use an EDC with sophistication level 2 or more to that of the participating trials (unique CTRI registered trials for which investigators agreed to participate in the study. The proportion and the binomial 95% confidence intervals of the same will be reported.

The EDC sophistication level is defined as follows:

  • Level 1: There is a unique account and password for each user to access the online system.

  • Level 2: Sites enter subject visit data through a Web interface into electronic case report forms (eCRFs). The completion status of each eCRF for each subject can be tracked automatically online. The system provides an audit trail for all data entry and data modification

  • Level 3: Data validation happens automatically when data are entered into the eCRF. The system will automatically log the user off after a period of inactivity.

  • Level 4: Subjects are randomized automatically

  • Level 5: Subject recruitment can be tracked online for each site

  • Level 6: The system allows tracking of medication inventory at the sites.

For a level to be considered complete, all the questions should be marked as Yes. If one of the questions is marked as No and a higher level is marked Yes then the higher level will be taken. For each unique trial we will therefore calculate the highest EDC sophistication level. If EDC is not used then sophistication level will be marked as missing.

The following table shows the EDC adoption rate and the different levels in the trials for which responses were received in the survey.

The following table shows the breakdown of key trial characteristics by EDC adoption status. Comparison between groups has been done using Chi-square test for categorical variables and Wilcox rank sum test for continuous variables.

2.4 Influence of trial parameters on EAR

Influence of trial parameters on EAR

To determine the influence of the trial parameters on EAR, we will use a logistic regression model where the dependent variable will be EDC adoption with EDC sophistication level 2 or more (modeled as Yes or No). Independent variables will be:
1. Trial sponsor: Industry or Investigator-Initiated. In studies where the primary sponsor is a pharmaceutical company or device manufacturer, the user will be considered industry-sponsored, and the rest will be considered investigator-initiated.
2. Trial sample size: Total trial sample size will be modeled as a continuous variable. To relax the linearity assumption, this will be expanded using a restricted cubic spline with 3 knots.
3. Trial sites: The number of sites will also be modeled as a continuous variable. Again to relax the linearity assumptions, the model term will be expanded using a restricted cubic spline with three knots.

Interactions will be testing in an omnibus model containing all interaction terms. Wald test will be used for determining the significance of any interaction. Odds ratios with 95% confidence intervals will be reported.

                Wald Statistics          Response: edc_adoption 

 Factor                                                       Chi-Square d.f. P     
 sample_size  (Factor+Higher Order Factors)                    6.30      6    0.3905
  All Interactions                                             2.62      4    0.6233
  Nonlinear (Factor+Higher Order Factors)                      2.75      3    0.4315
 sites  (Factor+Higher Order Factors)                         11.31      3    0.0101
  All Interactions                                             1.39      2    0.4994
 industry_funded  (Factor+Higher Order Factors)                1.83      3    0.6095
  All Interactions                                             0.91      2    0.6335
 sample_size * sites  (Factor+Higher Order Factors)            1.39      2    0.4994
  Nonlinear                                                    0.46      1    0.4958
  Nonlinear Interaction : f(A,B) vs. AB                        0.46      1    0.4958
 sample_size * industry_funded  (Factor+Higher Order Factors)  0.91      2    0.6335
  Nonlinear                                                    0.81      1    0.3674
  Nonlinear Interaction : f(A,B) vs. AB                        0.81      1    0.3674
 TOTAL NONLINEAR                                               2.75      3    0.4315
 TOTAL INTERACTION                                             2.62      4    0.6233
 TOTAL NONLINEAR + INTERACTION                                 3.65      5    0.6008
 TOTAL                                                        24.95      8    0.0016

As the results of the above ANOVA show, the Wald test for non-linear terms as well as interactions is not significant. Hence we show the simplified model without the interaction terms as well as without the non-linear assumption. The table below shows the results of the logistic regression analysis.

2.5 EDC Sophistication Level

We will provide data on the median EDC sophistication levels as well as a plot showing the proportion of CTRI registered trials with different levels of EDC sophistication. Further visualization and analysis will also explore the association between trial sample size, number of trial sites, and type of trial sponsorship with EDC sophistication.

In the following table we will show the univariable analysis of the factors which influenced EDC sophistication level. We will dichotomize the level into two categories (score 6 or score 1-5).

3 Additional Analyses

Additionally the survey collected data on alternative methods for data collection used in the trial as well a single item question on the key perceived barriers towards adoption of EDC in their trial.

Other reasons identified for not using EDC were:

Finally two additional questions were asked about the trial center weather they had access to a CTU and an IRB. We will evaluate the data in relation to EDC use.

4 Industry Sponsored trials

The percentage of industry sponsored trials by each year of registration is shown in the figure below.

5 Packages used

  1. R : R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

  2. Tidyverse : Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

  3. gtsummary : Daniel D. Sjoberg, Michael Curry, Margie Hannum, Joseph Larmarange, Karissa Whiting and Emily C. Zabor (2021). gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. <https://github.com/ddsjoberg/gtsummary>, http://www.danieldsjoberg.com/gtsummary/.

  4. Hmisc : Frank E Harrell Jr, with contributions from Charles Dupont and many others. (2021). Hmisc: Harrell Miscellaneous. https://hbiostat.org/R/Hmisc/, https://github.com/harrelfe/Hmisc/

  5. flextable : flextable: Functions for Tabular Reporting. https://ardata-fr.github.io/flextable-book/, https://davidgohel.github.io/flextable/.

  6. rms : Frank E Harrell Jr (2021). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.

  7. ggplot2: H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

  8. Lubridate: Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.

---
title: "Appendix II Electronic Data Capture Systems in India Survey"
output:
  html_notebook:
    toc: yes
    theme: paper
    number_sections: yes
  word_document:
    toc: yes
  html_document:
    toc: yes
    df_print: paged
---

```{r setup}
knitr::opts_chunk$set(warning = FALSE,
                      message = FALSE,
                      echo= FALSE,
                      setwd = "../project/",
                      dev=c("png","tiff"))
library(tidyverse)
library(gtsummary)
library(Hmisc)
library(rms)
library(lubridate)
library(flextable)

df <- read.csv("clinical_trials_infrastructure_s (9).csv")
trial_details <- read.csv("final_merged_nci_ctri_data.csv")
contacts <- read.csv("final_contact_list.csv")
cam_stats1 <- read.csv("GMass-Report-11766210-Main-Campaign-Stats.csv")
cam_stats2 <- read.csv("GMass-Report-12631646-Main-Campaign-Stats.csv")
cam_stats3 <- read.csv("GMass-Report-12960414-Main-Campaign-Stats.csv")
cam_stats1 <- cam_stats1 %>% select(emailaddress,ID,Opens,Unsubscribed,Bounced,Blocked) 
cam_stats2 <- cam_stats2 %>% select(emailaddress,ID,Opens,Unsubscribed,Bounced,Blocked) 
cam_stats3 <- cam_stats3 %>% select(emailaddress,ID,Opens,Unsubscribed,Bounced,Blocked) 



cam_stats <- rbind(cam_stats1,cam_stats2,cam_stats3)
# Sponsor type as institutional, governmental and industry as well as others. 

trial_details <- trial_details %>%  
  mutate(sponsor = trimws(tolower(sponsor))) %>% 
  mutate(sponsor_type = case_when(str_detect(sponsor,"institu|college|university") ~ "Institutional",
                                  str_detect(sponsor,"gover") ~ "Governmental",
                                  str_detect(sponsor, "commercial|consumer|export|market|device|natrace|industry|pharma|compan|contract|manufact|pvt\\.") ~ "Industry",
                                  TRUE ~ "Other"),
         industry_funded = case_when(sponsor_type == "Industry" ~ "Yes", TRUE ~ "No")) 

# Trial phase divided into categories. 
trial_details <- trial_details %>% 
  mutate(phase = trimws(phase)) %>% 
  mutate(phase = str_remove_all(phase,"Phase|Early")) %>% 
  mutate(phase = str_remove_all(phase,"\\s+|\n|\t")) %>% 
  mutate(phase = str_replace(phase,"PostMarketingSurveillance","4")) %>% 
  mutate(phase = str_replace(phase,"NotApplicable","N/A")) %>% 
  mutate(phase = str_replace(phase,"\\|","/")) %>% 
  mutate(phase_type = case_when(
    str_detect(phase,"1\\/2|1$|2$") ~ "Phase 1 - 2",
    str_detect(phase,"2\\/3|3$") ~ "Phase 2 - 3",
    str_detect(phase,"3\\/4|4$") ~ "Phase 4", 
    TRUE ~ "Unknown"
  ))

trial_details<- trial_details %>% 
  mutate(condition = tolower(condition)) %>% 
  mutate(condition = trimws(condition)) %>% 
  mutate(
    condition_type = case_when(
      str_detect(condition,"lymphoblastic|neoplasia|tumour|tumor|cancer|lymphoma|leukemia|neoplasm|carcinoma|sarcoma|castleman|chemotherapy|radiotherapy|ebrt|radiation|myelodysplastic|aplastic|stem cell| mbc |ca ovary|myeloma|nsclc|gestational trophoblastic disease|fibroadenoma|glioblastoma|leukoplakia|malignancy") ~ "Neoplasms",
      str_detect(condition,"diabetes|diabetic|thyroidism|malnutrition|metabolic|nutritional deficiency|obesity|obese|hyperalimentation|glycemia|growth|overweight|stunt|nutritional supplement|endocrine gland|disorder of thyroid|thyroid surgery|obeisity|acromegaly|pituitary|weight loss|type ii dm") ~ "Endocrine disease",
      str_detect(condition,"(m|x)dr(.*?)tb |tubercu|bacterial|pneumonia|septicemia|malaria|hiv|aids|infection|leprosy|suppurative|dengue|malaria|plasmodium|infectious|abscess|^tb |acute pancreatitis|corona|covid|filaria|mucor|warts|japanese|typhoid|sepsis|acantham|hepatitis(\\-| )(a|b|c)(\\s+|$)|difficile|tinea|leptospi|fungal|septic|scabies|rabies|leishmaniasis|hidradenitis") ~ "Infective Diseases",
      str_detect(condition,"fracture|trauma|injury|accident|burn|injuries|smoke|fire|flame|sprain|snake") ~ "Accidents and Injuries",
      str_detect(condition,"anxiety|psychoactive|related stress|depressive|depression|schizophrenia|withdrawl|addiction|affective|attention|obsessi|compulsi|seizure|mental retard|parkinson|alzheimer|neurotic|mood|neuropathy|radiculopathy|cerebral palsy|autis|dementia|(alcohol|tobacco|nicotine) (consumpt|addic|withdra|dependa)|migraine|insomnia|adhd|intellectual|epileptic|(disorder|disorders|disease|diseases) of (brain|central nervous)|epilepsies|post herpetic neuralgia|zoster associated pain|psychosis|bipolar|epilepsy|encephalopathy|smoking cessation") ~ "Mental Behavioural Disorders and nervous system disease",
      str_detect(condition,"asthma|chronic obstruct|copd|bronchiolitis|wheeze|(disease|disorder)(.*?)(respiratory|lung)|respiratory (tract|distress|disorder)|wheezing") ~ "Diseases of respiratory system",
      str_detect(condition,"(calculus)(.*?)(kidney|ureter|urethra|bladder)|nephrotic|(nephro|uro)lith|end(.*?)stage (renal|kidney) disease|esrd|dialysis|nephropathy|glomerular|urinary incontinence|renal stone|disorder of prostate|prostatic hyper|urethra|pyeloplasty|hypospadias|(diseases|disease)(.*?)(genitourinary|kidney|ureter)|urological|(azoo|oligo)spermia|(bladder|kidney|renal|ureteric|prostatic)(.*?)(disorder|disease)") ~ "Diseases of genitourinary system",
      str_detect(condition,"eye disease|glaucoma|cataract|cornea|retinal|retinopathy|conjunct|eyelid|uveitis|optic|ocular|eyelid|astigma|myopia|hypermetropia|episcler|sclera|eye health|pterygium|keratopathy|choroid") ~ "Diseases of Eye",
      str_detect(condition,"periodontitis|gerd|cirrhoses|gastritis|dentinal|pulpitis|endodontotic|edentulous|dental|molar|tooth|gingiv|caries|cirrhosis|cirrhotic|end stage (liver|hepatic)|(liver|hepatic) failure|colitis|(esophageal|gastric|duodenal|jejunal|intestinal|colonic|rectal|peptic|anal|colorectal)(.*?)(ulcer|stricture|obstruction|fistula|disorder|disease)|gastrointestinal disorder|haemorrhoids|hemorrhoids|anus|(fistula|fissure) in (ano|anus)|cholelithiasis|hepatitis|pancreatitis|(disease|diseases|disorder)(.*?)(digestive|liver|biliary|bile)|icteric|acute diarrhea|cholilithiasis|dyspepsia|proctitis|ercp|hyperbilirubinemia|jaundice|gaucher|liver disease|achalasia of cardia|gastroesophageal reflux|budd-chiari|pancreas") ~ "Diseases of digestive system",
      str_detect(condition,"rhemuatic fever|hyperten(sion|sive)|cardiac|angina|coronary syndrome|coronary|ami|myocardial infarction|heart failure|ischemic|cardiomyopathy|atrial|cardiovascular disease|(tachy|brady)cardia|congenital heart|myocardial|pulmonary hypertension|stroke|cerebrovascular disease|cvd|chd|bypass|infarction|atherosclerosis|(disease|disorder)(.*?)(circulatory|heart|cardiac)") ~ "Diseases of Circulatory System",
      str_detect(condition,"arthropathies|arthritis|chronic|polycystic|fibrosis|endometriosis|rheumatoid|crohn|gout|hyperuricemia|infertility|lupus|osteoporosis|sle|ankylosing|systemic sclerosis")~"Chronic non-communicable diseases",
      str_detect(condition,"healthy|players|volunteer|anti-aging|normal children|no specific health condition|oral health") ~ "Normal healthy volunteers",
      TRUE~"Others"))

trial_details <- trial_details %>% mutate(multicentric=case_when(sites > 1 ~ "Yes", TRUE ~ "No"))

df <- df %>% rename(ID = trial_id)


set_flextable_defaults(
  font.family = "Arial", 
  font.size = 10,
  font.color = "black",
  table.layout = "autofit",
  digits = 1,
  theme_fun = "theme_zebra"
  )
```

# Eliminate duplicate responses

First, we will check if the duplicated responses were consistent. If not, we will retain the response in which EDC use was marked as "Yes." The reason is that the respondent who answered No may not be aware of EDC use in the clinical trial. If the responses are concordant, then the 2nd response will be removed. We will use the following code segment for this.

In the first segment, we make a data frame of duplicated trials with serial numbers of responses.

In the second segment, we make a data frame for the first responses and then merge it with the data frame comprising of a list of duplicated trials. For each row, we make a variable that selects the serial of response with the EDC use response = Yes if there is a discordance. The final two code chunks make a data frame with trials for which a single response has been obtained.

```{r}
dup <- df %>% select(serial,ID,edc_use) %>% 
  group_by(ID) %>% 
  mutate(serial.number=row_number()) %>% 
  filter(serial.number>1)  %>% 
  ungroup()

dup <- df %>% select(serial,ID,edc_use) %>% 
  group_by(ID) %>% 
  mutate(serial.number=row_number()) %>%
  filter(serial.number==1) %>% 
  left_join(dup,.,by="ID") %>% 
  filter(serial.x != serial.y) %>% 
  filter(serial.number.x != serial.number.y) %>% 
  mutate(serial_retain = case_when(edc_use.y=="Yes" ~ serial.y,
                                   edc_use.y=="No" & edc_use.x == "Yes" ~ serial.x,
                                   edc_use.y== "No" & edc_use.x == "No" ~serial.y)) %>% 
  select(ID,serial_retain) %>% 
  distinct(.,ID,.keep_all=T) %>% 
  rename(serial = serial_retain)

df_unique <- df %>% anti_join(.,dup,by="ID")

df_dup <- df %>% semi_join(.,dup,by="serial")

df <- rbind(df_unique,df_dup)
```

```{r}
contacts <- contacts %>% 
  mutate(email=trimws(email)) %>% 
  mutate(email=str_remove_all(email,"\\s+|\n|\t"))

# Obtain the list of CTRI trial IDs from respondents

resp <- df %>% 
  select(ID)

# Remove the respondent trials from the list

contacts2 <- contacts %>% anti_join(.,resp,by="ID")  

# Make a list of respondents who have responded for one clinical trial. Do not send them another mail. 

respondent_emails <- contacts %>% 
  semi_join(.,resp,by="ID") %>% 
  select(email,ID) %>% 
  rename(responded_trial=ID) %>% 
  mutate(email=trimws(email)) %>% 
  mutate(email=str_remove_all(email,"\\s+|\n|\t")) %>% 
  distinct(.,email,responded_trial,.keep_all=T)

responders <- contacts2 %>% inner_join(respondent_emails,.,by="email") %>% 
  distinct(.,ID,.keep_all = T) %>% 
  select(responded_trial,ID) %>% 
  rename(ID=responded_trial,other_ID=ID) %>% 
  left_join(.,df,by="ID") %>% 
  select(-c(ID)) %>% 
  rename(ID=other_ID) %>% 
  group_by(serial) %>% 
  mutate(sno = row_number()) %>% 
  mutate(sno = sno+1) %>% 
  mutate(serial = paste(serial,sno,sep="_")) %>% 
  select(-sno)

df <- rbind(df,responders)
```

```{r}
# Merge trial details with the dataset of survey responses to ensure trial characteristics can be ascertained. 

df <- df %>% 
  left_join(.,trial_details,by="ID")
df <- df %>% select(-remote_addr)
```

```{r}

ci_function_cat <- function(data, variable, by, tbl, ...) {
  # first calculate CIs for all levels
  result <-
    data %>% 
    freqtables::freq_table(!!sym(by), !!sym(variable)) %>% 
    mutate(ci = str_glue("{style_percent(lcl_row / 100, symbol = TRUE)} - {style_percent(ucl_row / 100, symbol = TRUE)}")) %>%
    select(row_cat, col_cat, ci) %>%
    pivot_wider(id_cols = col_cat, names_from = row_cat, values_from = ci) 
  
  # if variable is type 'dichotomous', only keep one row
  var_meta_data <- tbl$meta_data %>% filter(.data$variable %in% .env$variable)
  if (var_meta_data$summary_type %in% "dichotomous") {
    result <- result %>% filter(col_cat %in% var_meta_data$dichotomous_value[[1]])
  }
  
  result %>%
    select(-1) %>%
    set_names(paste0("add_stat_", seq_len(ncol(.))))
}

ci_function_con <- function(data, variable, by, ...) {
  data %>%
    arrange(.data[[by]]) %>%
    group_by(.data[[by]]) %>%
    group_map(
      ~.x[[variable]] %>% 
        t.test() %>% 
        broom::tidy() %>%
        mutate_at(vars(conf.low, conf.high), style_sigfig) %>%
        mutate(ci = str_glue("{conf.low} - {conf.high}")) %>%
        select(ci)
    ) %>%
    imap_dfr(~.x %>% mutate(col_name = paste0("add_stat_", .y))) %>%
    pivot_wider(values_from = ci, names_from = col_name)
}

```

# Statistical Analysis Plan

## Participation Statistics

The following table shows the statistics related to participation in the survey. The number of investigators who were contacted is different from the number of trials as for a trial more than one investigator may have been contacted.

```{r}
unique_trials <- contacts %>% select(ID) %>% distinct() %>% summarise(total_trials=n())
unique_contacts <- contacts %>% select(email) %>% distinct() %>% summarise(unique_emails = n())
unique_opens <- cam_stats %>% filter(Opens>0) %>% distinct(emailaddress) %>% summarise(unique_opens=n())
unique_trial_opens <- cam_stats %>% filter(Opens>0) %>% distinct(ID) %>% summarise(unique_trial_opens=n())
part_trials <- df %>% select(ID) %>% distinct() %>% summarise(participating_trials = n())

statistics <- cbind(unique_contacts,unique_trials,unique_opens,unique_trial_opens,part_trials)

statistics %>% 
  mutate(
    investigator_open_rate = round(100*unique_opens/unique_emails,1),
    trial_open_rate = round(100*(unique_trial_opens/total_trials),1),
    participation_rate=round(100*participating_trials/total_trials,1)) %>%
  select(unique_emails,unique_opens,investigator_open_rate,total_trials,unique_trial_opens,trial_open_rate,participating_trials,participation_rate) %>% 
  rename(
    "Investigators (N)" = unique_emails,
    "Trials (N)" = total_trials,
    "Investigators who opened email (N)" = unique_opens,
    "Trials whose email was opened (N)" = unique_trial_opens,
    "Participating Trials (N)" = participating_trials,
    "Investigator open rate (%)" = investigator_open_rate,
    "Trial open rate (%)" = trial_open_rate,
    "Participation rate (%)" =participation_rate
  ) %>% 
  flextable() %>% 
  set_caption(.,"Table 1 : Participation Statistics")
```

## Included trial characteristics

First we take a look at the trial characteristics of all trials were were in the sampling frame.

```{r}
trial_details %>% 
  semi_join(.,contacts,by="ID") %>% 
  select(ID,sponsor_type,industry_funded,trial_type,phase_type,trial_duration,sample_size,sites,multicentric,nations,condition_type) %>% 
  distinct(.,ID,.keep_all = T) %>% 
  select(-ID) %>% 
  tbl_summary(type = list (multicentric ~ "categorical",
                           industry_funded ~ "categorical"),
              label = list(sponsor_type ~ "Sponsor Type",
                           industry_funded ~ "Industry Funded Trial",
                           trial_type ~ "Type of Trial",
                           phase_type ~ "Trial Phase",
                           trial_duration ~ "Duration of Trial (Days)",
                           sample_size ~ "Sample Size (Total)",
                           sites ~ "Number of sites",
                           multicentric ~ "Multicentric Trial",
                           nations ~ "Country of Recruitment"
                           )) %>% 
  bold_labels()%>% 
  as_flex_table() %>% 
  set_caption(.,"Table 2 : Trial Characteristics")
```

Next we compare the trial characteristics of those studies which have responded versus those which have not.

```{r}
part_trials <- df %>% select(ID) %>% mutate(participant="Yes")


trial_details %>% 
  semi_join(.,contacts,by="ID") %>% 
  select(ID,sponsor_type,industry_funded,trial_type,phase_type,trial_duration,sample_size,sites,multicentric,nations,condition_type) %>% 
  distinct(.,ID,.keep_all = T) %>% 
  left_join(.,part_trials,by="ID") %>% 
  mutate(participant = case_when(is.na(participant) ~ "No",
                                 TRUE~"Yes")) %>% 
  select(-ID) %>% 
  tbl_summary(by=participant,
              type = list (multicentric ~ "categorical",
                           industry_funded ~ "categorical"),
              label = list(sponsor_type ~ "Sponsor Type",
                           industry_funded ~ "Industry Funded Trial",
                           trial_type ~ "Type of Trial",
                           phase_type ~ "Trial Phase",
                           trial_duration ~ "Duration of Trial (Days)",
                           sample_size ~ "Sample Size (Total)",
                           sites ~ "Number of sites",
                           multicentric ~ "Multicentric Trial",
                           nations ~ "Country of Recruitment"
                           )) %>% 
  bold_labels()%>% 
  as_flex_table() %>% 
  set_caption(.,"Table 3 : Trial characteristics compared between trials which participated in the survey and those that did not.")
```

## EDC Adoption Rate

**EDC Adoption Rate (EAR)**: The primary outcome measure is EAR. This will be defined as the ratio of the number of CTRI registered trials that use an EDC with sophistication level 2 or more to that of the participating trials (unique CTRI registered trials for which investigators agreed to participate in the study. The proportion and the binomial 95% confidence intervals of the same will be reported.

The **EDC sophistication level** is defined as follows:

-   **Level 1:** There is a unique account and password for each user to access the online system.

-   **Level 2:** Sites enter subject visit data through a Web interface into electronic case report forms (eCRFs). The completion status of each eCRF for each subject can be tracked automatically online. The system provides an audit trail for all data entry and data modification

-   **Level 3:** Data validation happens automatically when data are entered into the eCRF. The system will automatically log the user off after a period of inactivity.

-   **Level 4:** Subjects are randomized automatically

-   **Level 5:** Subject recruitment can be tracked online for each site

-   **Level 6:** The system allows tracking of medication inventory at the sites.

For a level to be considered complete, **all the questions** should be marked as **Yes**. If one of the questions is marked as No and a higher level is marked Yes then the **higher level** will be taken. For each unique trial we will therefore calculate the highest EDC sophistication level. If EDC is not used then sophistication level will be marked as missing.

```{r Calculate highest EDC Sophistication Rate}
df <- df %>% 
  mutate(level1 = case_when(feature_access == "Yes" ~ 1, TRUE ~ 0),
         level2 = case_when(feature_ecrf == "Yes" & feature_audit == "Yes" ~ 1, TRUE ~ 0),
         level3 = case_when(feature_validation == "Yes" & feature_security == "Yes" ~ 1, TRUE ~ 0),
         level4 = case_when(feature_randomize == "Yes" ~ 1, TRUE ~ 0),
         level5 = case_when(feature_tracking == "Yes" ~ 1, TRUE ~ 0),
         level6 = case_when(feature_med_inventory == "Yes" ~ 1, TRUE ~ 0)
         ) %>% 
  select(serial,level1:level6) %>% 
  pivot_longer(cols=-serial, names_to = "Level", values_to = "Status") %>% 
  mutate(Level = str_remove(Level,"level")) %>% 
  mutate(Level = as.numeric(as.character(Level))) %>% 
  arrange(.,serial, Level) %>% 
  filter(Status == 1) %>% 
  group_by(serial) %>% 
  summarise(edc_sophistication_level = max(Level)) %>% 
  ungroup() %>% 
  left_join(df,.,by="serial")

```

```{r Add data on each individual level of EDC Sophistication}
df <- df %>% 
  mutate(level_1 = case_when(feature_access == "Yes" ~ "Yes", TRUE ~ "No"),
         level_2 = case_when(feature_ecrf == "Yes" & feature_audit == "Yes" ~"Yes", TRUE ~ "No"),
         level_3 = case_when(feature_validation == "Yes" & feature_security == "Yes" ~ "Yes", TRUE ~ "No"),
         level_4 = case_when(feature_randomize == "Yes" ~ "Yes", TRUE ~ "No"),
         level_5 = case_when(feature_tracking == "Yes" ~ "Yes", TRUE ~ "No"),
         level_6 = case_when(feature_med_inventory == "Yes" ~ "Yes", TRUE ~ "No")
         ) %>% 
  select(serial,level_1:level_6) %>% 
  left_join(df,.,by="serial")
```

```{r Add data on EDC Adoption Rate}
df <- df %>% 
  mutate(edc_adoption = case_when(edc_sophistication_level > 1 ~ "Yes",
                                  TRUE ~ "No"))
```

The following table shows the EDC adoption rate and the different levels in the trials for which responses were received in the survey.

```{r}
df %>% select(serial,edc_use,edc_adoption, level_1:level_6) %>% 
  pivot_longer(cols=-serial) %>% 
  group_by(name,value) %>% 
  summarise(count=n()) %>% 
  ungroup() %>% 
  pivot_wider(.,id_cols=name,names_from=value,values_from = count) %>% 
  mutate(total=Yes+No) %>% 
  mutate(proportion = round((Yes/total)*100,1)) %>% 
  rowwise() %>% 
  mutate(ci_lower = round((binconf(Yes,total,method="wilson")[2])*100,1),
         ci_upper = round((binconf(Yes,total,method="wilson")[3])*100,1)) %>% 
  select(name,total,Yes,proportion,ci_lower,ci_upper) %>% 
  mutate(name = toupper(str_replace(name,"\\_", " "))) %>% 
  mutate("CI" = paste("( ",ci_lower," - ", ci_upper,")")) %>% 
  select(name,total,Yes,proportion,CI) %>% 
  rename(Variable=name,Total = total, Percentage = proportion, "95% CI" = CI) %>% 
  flextable() %>% 
  set_caption(.,"Table 4 : EDC use and adoption rate with EDC sophistication levels among responding studies")
  
```

The following table shows the breakdown of key trial characteristics by EDC adoption status. Comparison between groups has been done using Chi-square test for categorical variables and Wilcox rank sum test for continuous variables.

```{r EDC Adoption by trial variables}

df %>% 
  select(edc_adoption,sponsor_type,industry_funded,trial_type,phase_type,trial_duration,sample_size,sites,multicentric,nations,ctu_access) %>%
  tbl_summary(by = edc_adoption,
              type = list (multicentric ~ "categorical",
                           industry_funded ~ "categorical",
                           ctu_access ~ "categorical"),
              label = list(sponsor_type ~ "Sponsor Type",
                           industry_funded ~ "Industry Funded Trial",
                           trial_type ~ "Type of Trial",
                           phase_type ~ "Trial Phase",
                           trial_duration ~ "Duration of Trial (Days)",
                           sample_size ~ "Sample Size (Total)",
                           sites ~ "Number of sites",
                           multicentric ~ "Multicentric Trial",
                           nations ~ "Country of Recruitment",
                           ctu_access ~ "Access to Institutional CTU"
                           )) %>% 
  add_stat(
    fns = list(all_categorical() ~ ci_function_cat,
               all_continuous() ~ ci_function_con),
    location = all_categorical(FALSE) ~ "level"
  ) %>% 
  modify_header(starts_with("add_stat_") ~ "**95% CI**") %>%
  # move columns to align CIs with central estimates
  modify_table_body(~.x %>% relocate(add_stat_1, .after = stat_1)) %>% 
  add_n() %>% 
  add_p() %>% 
  bold_labels() %>% 
  add_overall() %>% 
  as_flex_table()%>% 
  set_caption(.,"Table 5 : Comparision of trial characteristics between trials with adopted an EDC and those that did not")

```

## Influence of trial parameters on EAR

Influence of trial parameters on EAR

To determine the influence of the trial parameters on EAR, we will use a logistic regression model where the dependent variable will be EDC adoption with EDC sophistication level 2 or more (modeled as Yes or No). Independent variables will be:\
1. Trial sponsor: Industry or Investigator-Initiated. In studies where the primary sponsor is a pharmaceutical company or device manufacturer, the user will be considered industry-sponsored, and the rest will be considered investigator-initiated.\
2. Trial sample size: Total trial sample size will be modeled as a continuous variable. To relax the linearity assumption, this will be expanded using a restricted cubic spline with 3 knots.\
3. Trial sites: The number of sites will also be modeled as a continuous variable. Again to relax the linearity assumptions, the model term will be expanded using a restricted cubic spline with three knots.

Interactions will be testing in an omnibus model containing all interaction terms. Wald test will be used for determining the significance of any interaction. Odds ratios with 95% confidence intervals will be reported.

```{r}
dd <- datadist(df)
options(datadist="dd")
model1 <- lrm(edc_adoption~rcs(sample_size,3)*(sites+industry_funded),data=df,x = T,y=T,linear.predictors = T)
anova(model1)
```

As the results of the above ANOVA show, the Wald test for non-linear terms as well as interactions is not significant. Hence we show the simplified model without the interaction terms as well as without the non-linear assumption. The table below shows the results of the logistic regression analysis.

```{r}
df <- df %>% mutate(edc_adoption = as.factor(edc_adoption)) 
model1 <- glm(edc_adoption~sample_size+sites+industry_funded,data=df, family=binomial)
tbl_regression(model1,exponentiate = T,intercept=T) %>% 
  add_nevent() %>% 
  add_n() %>% 
  bold_labels() %>% 
  as_flex_table() %>% 
  set_caption(.,"Table 6 : Multivariable analysis of factors influencing EDC use")
```

## EDC Sophistication Level

We will provide data on the median EDC sophistication levels as well as a plot showing the proportion of CTRI registered trials with different levels of EDC sophistication. Further visualization and analysis will also explore the association between trial sample size, number of trial sites, and type of trial sponsorship with EDC sophistication.

```{r}
df %>% select(edc_sophistication_level) %>% 
  rename("Highest EDC sophistication Level" = edc_sophistication_level) %>% 
  tbl_summary() %>%
  as_flex_table() %>% 
  set_caption(.,"Table 7 : Highest level of EDC sophistication")

df %>% select(edc_sophistication_level) %>% 
  group_by(edc_sophistication_level) %>% 
  summarise(count=n()) %>% 
  filter(!is.na(edc_sophistication_level)) %>% 
  mutate(edc_sophistication_level=as.factor(edc_sophistication_level)) %>% 
  ggplot(.,aes(x=edc_sophistication_level,y=count))+geom_bar(stat="identity")+geom_label(aes(label=count)) + labs(title="Figure 1: Highest EDC sophistication Level",x="EDC Sophistication Level",y="Number of Trials") 


```

In the following table we will show the univariable analysis of the factors which influenced EDC sophistication level. We will dichotomize the level into two categories (score 6 or score 1-5).

```{r}
df %>% 
  select(edc_sophistication_level,sponsor_type,industry_funded,trial_type,phase_type,trial_duration,sample_size,sites,nations,multicentric,ctu_access) %>% 
  mutate(edc_sophistication_level_type = case_when(edc_sophistication_level == 6 ~ "Level 6",
                                                   edc_sophistication_level <6 ~ "Level 1 - 5")) %>% 
  select(-edc_sophistication_level) %>% 
  tbl_summary(by=edc_sophistication_level_type,
              type = list (multicentric ~ "categorical",
                           industry_funded ~ "categorical",
                           ctu_access ~ "categorical"),
              label = list(sponsor_type ~ "Sponsor Type",
                           industry_funded ~ "Industry Funded Trial",
                           trial_type ~ "Type of Trial",
                           phase_type ~ "Trial Phase",
                           trial_duration ~ "Duration of Trial (Days)",
                           sample_size ~ "Sample Size (Total)",
                           sites ~ "Number of sites",
                           multicentric ~ "Multicentric Trial",
                           nations ~ "Country of Recruitment",
                           ctu_access ~ "Access to Institutional CTU"
                           )) %>% 
  add_p() %>%
  bold_labels() %>% 
  as_flex_table() %>% 
  set_caption(.,"Table 8 : Univariate analysis of factors associated with EDC sophistication level")
```

## EAR Time trends

For unique responding CTRI registered trials, we will create a subset containing trials registered on or after 1st January 2010. From this subset, we will then aggregate the EAR for each year based on the methodology for calculating EAR as above. This will be graphically demonstrated using a bar plot or a dot plot with a bar for each year. Note that as each trial is independent of each others, we will not use a line plot for the visualization. EAR will be compared between two time periods: period 1 from 1st January 2015 to 31st December 2019 and period 2 from 1st January 2010 to 31st December 2014. Given that most randomized trials will be completed by 10 years, we expect to have few open clinical trials available for analysis that was registered before 2010. However, if more than 30 trials are available, we will also analyze an earlier time point, i.e., between 1st January 2005 and 31st December 2009.

```{r}
df %>% select(serial,date_registered,edc_adoption) %>% 
  mutate(date_registered = dmy(date_registered)) %>% 
  mutate(reg_period = case_when(date_registered <= "2009-12-31" ~ "Pre 2010",
                                date_registered >= "2010-01-01" & date_registered <= "2014-12-31" ~ "Period 2 (2010 - 2014)",
                                date_registered >= "2015-01-01" & date_registered <= "2019-12-31" ~ "Period 1 (2015 - 2019)")) %>% 
  select(edc_adoption,reg_period) %>% 
  tbl_summary(by=reg_period) %>% 
  as_flex_table() %>% 
  set_caption(.,"Table 9 : EAR across time period")


df %>% select(edc_adoption,date_registered) %>%
  mutate(year_registered = year(dmy(date_registered))) %>% 
  select(year_registered,edc_adoption) %>% 
  group_by(year_registered,edc_adoption) %>% 
  summarise(count= n()) %>% 
  pivot_wider(id_cols = year_registered, names_from = edc_adoption, values_from=count,values_fill = 0) %>% 
  mutate(total = Yes+No) %>% 
  mutate(ear = round((Yes/total)*100,1)) %>% 
  mutate(ci_lower = round((binconf(Yes,total,method="wilson")[2])*100,1),
         ci_upper = round((binconf(Yes,total,method="wilson")[3])*100,1)) %>% 
  filter(Yes>1) %>% 
  mutate(year_registered=as.factor(year_registered)) %>% 
  ggplot(.,aes(x=year_registered,y=ear))+geom_point() + geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+geom_label(aes(label=ear))+
  ggtitle("Figure 2: Time Trend of EDC Adoption rate") + xlab("Year") + ylab("EDC Adoption Rate (%)")
```

# Additional Analyses

Additionally the survey collected data on alternative methods for data collection used in the trial as well a single item question on the key perceived barriers towards adoption of EDC in their trial.

```{r}
df %>%  filter(edc_use=="No") %>% 
  select(no_edc__spreadsheet,no_edc__mail,no_edc__fax) %>% 
  rename("Spreadsheet" = no_edc__spreadsheet,
         "Data sent by Email" = no_edc__mail,
         "Data sent by Fax" = no_edc__fax
         ) %>% 
  tbl_summary() %>% 
  add_n() %>% 
  bold_labels() %>% 
  as_flex_table()%>% 
  set_caption(.,"Table 10 : Data collection methods used when EDC was not used")
```

```{r}
df %>%
  filter(edc_use=="No") %>% 
  select(which_of_the_following_were_the_important_barrier_to_implementat__Complex.regulatory.requirements,
              which_of_the_following_were_the_important_barrier_to_implementat__Lack.of.technical.support,
              which_of_the_following_were_the_important_barrier_to_implementat__Costly.Software,
              which_of_the_following_were_the_important_barrier_to_implementat__Lack.of.motivation.among.clinical.trial.staff,
              which_of_the_following_were_the_important_barrier_to_implementat__Lack.of.user.friendly.software) %>% 
  mutate(across(starts_with("which_of"),~case_when(.=="X" ~ "Yes",TRUE~"No"))) %>%  
  rename("Complex Regulatory Requirements" = which_of_the_following_were_the_important_barrier_to_implementat__Complex.regulatory.requirements,
         "Lack of Technical Support" = which_of_the_following_were_the_important_barrier_to_implementat__Lack.of.technical.support,
         "Software Cost" =  which_of_the_following_were_the_important_barrier_to_implementat__Costly.Software,
         "CTU Staff Motivation to implement EDC" = which_of_the_following_were_the_important_barrier_to_implementat__Lack.of.motivation.among.clinical.trial.staff,
         "Lack of user friendly Software" = which_of_the_following_were_the_important_barrier_to_implementat__Lack.of.user.friendly.software) %>%
  pivot_longer(cols=c(1:5),names_to = "Reason") %>% 
  group_by(Reason,value) %>% 
  summarise(count=n()) %>% 
  pivot_wider(id_cols = Reason, names_from=value,values_from=count) %>% 
  mutate(total = No+Yes,
         percent = round((Yes/total)*100,1)) %>% 
  mutate(ci_lower = round((binconf(Yes,total,method="wilson")[2])*100,1),
         ci_upper = round((binconf(Yes,total,method="wilson")[3])*100,1)) %>% 
  select(Reason,total,Yes,percent,ci_lower,ci_upper) %>% 
  mutate("CI" = paste("( ",ci_lower," - ", ci_upper,")")) %>% 
  select(Reason,total,Yes,percent,CI) %>% 
  arrange(-percent) %>% 
  rename(Total = total, Percentage = percent, "95% CI" = CI) %>% 
  flextable()%>% 
  set_caption(.,"Table 11 : Reason for not using EDC")
  

```

Other reasons identified for not using EDC were:

```{r}
df %>%
  filter(edc_use=="No") %>% 
  filter(which_of_the_following_were_the_important_barrier_to_implementat__Other!="") %>% 
  select(which_of_the_following_were_the_important_barrier_to_implementat__Other) %>% 
  distinct(.) %>% 
  rename("Reasons" = which_of_the_following_were_the_important_barrier_to_implementat__Other) %>% 
  flextable() %>% 
  set_caption(.,"Table 12 : Free text responses to reasons for not using EDC in trial")
```

Finally two additional questions were asked about the trial center weather they had access to a CTU and an IRB. We will evaluate the data in relation to EDC use.

```{r}
df %>% 
  mutate(resource = case_when(ctu_access == "Yes" & irb == "Yes" ~ "Both",
                              ctu_access == "Yes" & irb == "No" ~ "Only CTU",
                              ctu_access == "No" & irb == "Yeo" ~ "Only IRB",
                              TRUE ~ "None")) %>% 
  select(edc_adoption,resource) %>% 
  tbl_summary(by=edc_adoption) %>% 
  add_n() %>% 
  bold_labels() %>% 
  add_p() %>% 
  as_flex_table() %>% 
  set_caption(.,"Table 13 : EDC adoption by CTU and IRB availability")

```


# Industry Sponsored trials

The percentage of industry sponsored trials by each year of registration is shown in the figure below. 


```{r}
trial_details %>% 
  select(sponsor_type,date_registered) %>%
  mutate(date_registered = dmy(date_registered)) %>% 
  mutate(year = year(date_registered)) %>% 
  group_by(year,sponsor_type) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  pivot_wider(id_cols=year,names_from = sponsor_type,values_from = n) %>%
  rowwise() %>% 
  mutate(total = sum(Other,Industry,Institutional,Governmental,na.rm=T)) %>% 
  mutate(industry_prop = round(Industry/total*100,1)) %>% 
  filter(!is.na(year)) %>% 
  filter(!is.na(industry_prop)) %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot(.,aes(y=industry_prop,x=year))+geom_bar(stat="identity",fill="steelblue")+
  geom_text(aes(label=industry_prop),vjust = 1.6, color="white")+
  theme_minimal()+
  labs(x = "Year of Trial Registration", y = "Percentage of Industry Sponsored Trials", title = "Figure 3: Year wise change in percentage of Industry sponsored trials")
```



# Packages used

1.  R : R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL <https://www.R-project.org/.>

2.  Tidyverse : Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, <https://doi.org/10.21105/joss.01686>

3.  gtsummary : Daniel D. Sjoberg, Michael Curry, Margie Hannum, Joseph Larmarange, Karissa Whiting and Emily C. Zabor (2021). gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. [\<https://github.com/ddsjoberg/gtsummary\>,](https://github.com/ddsjoberg/gtsummary,) <http://www.danieldsjoberg.com/gtsummary/.>

4.  Hmisc : Frank E Harrell Jr, with contributions from Charles Dupont and many others. (2021). Hmisc: Harrell Miscellaneous. <https://hbiostat.org/R/Hmisc/,> <https://github.com/harrelfe/Hmisc/>

5.  flextable : flextable: Functions for Tabular Reporting. <https://ardata-fr.github.io/flextable-book/,> <https://davidgohel.github.io/flextable/.>

6.  rms : Frank E Harrell Jr (2021). rms: Regression Modeling Strategies. <https://hbiostat.org/R/rms/,> <https://github.com/harrelfe/rms.>

7.  ggplot2: H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

8.  Lubridate: Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL <https://www.jstatsoft.org/v40/i03/.>
