Part One: Getting the random sample

This analysis pulls a random sample of relevant data availability statements from the larger dataset used for the previous PLOS study. First read in the dataset.

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

setwd("Z:/Publications/PLOS data availability analysis/Final coding")
full_df <- read_csv(file = "final_full_coded_set.csv")

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   id = col_character(),
##   publication_date.x = col_character(),
##   data_statement = col_character(),
##   statement_category = col_character(),
##   notes = col_character(),
##   repository = col_character()
## )

Subset to get just those that have listed a specific repository. This means that they could have been coded as either combination or repository.

#how many list anything in the repository field?
length(full_df[!is.na(full_df$repository), ])

## [1] 7

#how many list website only as the sharing mechanism?
nrow(subset(full_df, repository == "website"))

## [1] 286

#how many list a website and something else too?
length(grep("website", full_df$repository))

## [1] 329

There are 8702 total, of which 286 list ONLY a non-repository website, i.e. they don’t list a repository and also a website and 329 list a website and something else. Let’s go ahead and use all 329.

website <- full_df[grepl("website", full_df$repository), ]
repo <- full_df[(!is.na(full_df$repository) & !grepl("website", full_df$repository)),]

Now let’s find our necessary sample size, using a 95% confidence interval with a margin of error equal to 5%. I’m using this calculation: http://documents.routledge-interactive.s3.amazonaws.com/9780415628129/Chapter%209%20-%20Determining%20sample%20size%20final_edited.pdf

p = 0.5
d = 0.05
q = 0.5
n = (4*p*q)/(d^2)

n_repo = nrow(repo)
n_website = nrow(website)

repo_sample = n/((1 + (n - 1)/n_repo))
website_sample = n/((1 + (n - 1)/n_website))

So rounding up, we need 382 repo statements and 181 website statements. Now we will get a random sample of each, adding an additional 10 from each type for testing interrater reliability. These will be saved as CSVs and given to the two raters for coding. I really should have set the seed before taking the random sample, but oh well, next time!

repo_sample <- repo[sample(nrow(repo), 382), ]
website_sample <- website[sample(nrow(website), 182), ]

#write.csv(file = "repo_for_coding.csv", repo_sample)
#write.csv(file = "website_for_coding.csv", website_sample)

Coder instructions are as follows: - visit the URL(s) or repository(ies) provided in the data availability statement. If necessary, search by any accession number of other criteria mentioned in the data availability statement. - in the column labeled “available?” use “yes” to specify that you could locate all data described in the data availability statement, and use “no” to specify that you could not locate ANY of the data described in the statement. Use “partial” if part but not all of the data could be located - for example, if more than one dataset is described but only one is on the website, or if more than one repository or website is listed and some of the data cannot be found. - if both a website AND a repository are listed, only note the availability for the source of interest for each dataset. In other words, if an entry in the “website” CSV also lists a repository, only record whether the data are available in the website, not on the repository.

Here’s a thing to make it easier to get a whole list of accsesion numbers given a range.

#lets <- "ln" #letters that start the accession numbers
#num1 <- 608994 #first number
#num2 <- 609187 #last number
#nums <- seq(num1, num2, 1)
#print(length(nums)) #tells you how many records to expect
#for (i in 1:length(nums)) {
#  writeLines(paste(lets, nums[i], " OR", sep = ""))
#}

Part Two: Analyzing the Coded Data

Read in the coded data, then combine them into one data frame.

repo_coded <- read.csv(file = "repo_for_coding_full.csv")
web_coded <- read.csv(file = "website_for_coding_full.csv")
full_coded <- rbind(repo_coded, web_coded)

First we’ll deal with our pilot test data to see if we have an adequate level of agreement.

pilot_coded <- read.csv(file = "pilot_coding.csv")

library(irr)

## Loading required package: lpSolve

kappa2(pilot_coded[2:3])

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 20 
##    Raters = 2 
##     Kappa = 1 
## 
##         z = 6.59 
##   p-value = 4.29e-11

There is perfect agreement between the two coders.

Let’s do some chi square tests. For the purposes of this analysis, partial availability will count as not available, since having only part of the data is not really adequate for reproducibility. Anything that is not in English will be excluded.

library(plyr)

## Warning: package 'plyr' was built under R version 3.3.3

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

full_coded$avail_recode <- full_coded$available.
full_coded$avail_recode <- revalue(full_coded$avail_recode, c(partial = "no"))
full_coded$avail_recode <- factor(full_coded$avail_recode, levels = c("yes", "no"))
full_coded <- subset(full_coded, avail_recode == "yes" | avail_recode == "no")  ##remove the NAs that were the not in english

Here is a function that will do the chi sq test as well as calculate effect size and power all at once if you just tell it the two variables. Optionally you can set a significance level - by default it is set at 0.05.

custom.chisq <- function(var1, var2, sig_level = 0.05){
  test_table <- table(var1, var2) #create a table from the two variables, since the chi square function requires a table
  n <- sum(test_table) #get the total number of observations in the whole set
  test_results <- chisq.test(test_table) #run the chi sq test
  effect_size <- sqrt(test_results$statistic/n) #calculate the effect size
  require(pwr)
  pwr_results <- pwr.chisq.test(w = effect_size, N = n, sig.level = sig_level, df = test_results$parameter) #calculate power
  OR <- (test_table[1]/test_table[3])/(test_table[2]/test_table[4]) #calculate the odds ratio
  print(test_table)
  print(test_results)
  print(pwr_results)
  if(length(test_table) != 4) {
    writeLines("OR calculation not valid for these variables")
  }
  else
  {writeLines(paste("Odds ratio = ", OR))}
}

Conduct the chi square analysis on the two groups.

custom.chisq(full_coded$group, full_coded$avail_recode)

## Loading required package: pwr

## Warning: package 'pwr' was built under R version 3.3.3

##          var2
## var1      yes  no
##   repo    330  46
##   website 111  65
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  test_table
## X-squared = 43.995, df = 1, p-value = 3.293e-11
## 
## 
##      Chi squared power calculation 
## 
##               w = 0.2823129
##               N = 552
##              df = 1
##       sig.level = 0.05
##           power = 0.9999985
## 
## NOTE: N is the number of observations
## 
## Odds ratio =  4.20094007050529

This is a statistically significant difference, moderate effect size, high power. This odds ratio tells us that data in repositories was 4.2 times more likely to be available than those on websites. Let’s see this in a proportion table

prop.table(table(full_coded$group, full_coded$avail_recode), margin = 1)

##          
##                 yes        no
##   repo    0.8776596 0.1223404
##   website 0.6306818 0.3693182

So 88% of repo data is available compared to only 63% of data on a website.

Let’s deal with time. Are newer articles more likely to have available data?

full_coded$publication_date.x <- as.POSIXct(full_coded$publication_date.x) #convert the pub date column to a date/time object

full_coded$pub_year <- as.factor(format(full_coded$publication_date.x,'%Y')) #extract just the year

custom.chisq(full_coded$avail_recode, full_coded$pub_year)

##      var2
## var1  2014 2015 2016
##   yes   77  243  121
##   no    31   66   14
## 
##  Pearson's Chi-squared test
## 
## data:  test_table
## X-squared = 13.236, df = 2, p-value = 0.001336
## 
## 
##      Chi squared power calculation 
## 
##               w = 0.1548513
##               N = 552
##              df = 2
##       sig.level = 0.05
##           power = 0.9131437
## 
## NOTE: N is the number of observations
## 
## OR calculation not valid for these variables

There is a statisically significant difference in years but it’s a very small effect size.

What about having a doi? The idea is that a persistent unique identifier should make it more likely that stuff is available, so is that true?

full_coded$has_doi <- as.factor(ifelse(grepl("doi", full_coded$data_statement), "doi_yes", "doi_no")) #search the data statement to see if it mentions a doi and create a new column accordingly

custom.chisq(full_coded$has_doi, full_coded$avail_recode)

##          var2
## var1      yes  no
##   doi_no  302  96
##   doi_yes 139  15
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  test_table
## X-squared = 13.412, df = 1, p-value = 0.0002501
## 
## 
##      Chi squared power calculation 
## 
##               w = 0.1558739
##               N = 552
##              df = 1
##       sig.level = 0.05
##           power = 0.9556452
## 
## NOTE: N is the number of observations
## 
## Odds ratio =  0.339478417266187

Having a doi is significantly associated with greater availability, though it’s a pretty small effect size.

But then again being in a repository is significantly associated with having a DOI, so this is probably not a meaningful comparison.

custom.chisq(full_coded$has_doi, full_coded$group)

##          var2
## var1      repo website
##   doi_no   252     146
##   doi_yes  124      30
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  test_table
## X-squared = 14.348, df = 1, p-value = 0.0001519
## 
## 
##      Chi squared power calculation 
## 
##               w = 0.1612255
##               N = 552
##              df = 1
##       sig.level = 0.05
##           power = 0.9662235
## 
## NOTE: N is the number of observations
## 
## Odds ratio =  0.417587273530711

What about just among those repositories that have a doi? It appears no, not really - no significant difference in availability between those that have a doi and those that don’t.

repo_subset <- subset(full_coded, group == "repo")

custom.chisq(repo_subset$has_doi, repo_subset$avail_recode)

##          var2
## var1      yes  no
##   doi_no  217  35
##   doi_yes 113  11
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  test_table
## X-squared = 1.5096, df = 1, p-value = 0.2192
## 
## 
##      Chi squared power calculation 
## 
##               w = 0.06336247
##               N = 376
##              df = 1
##       sig.level = 0.05
##           power = 0.2330066
## 
## NOTE: N is the number of observations
## 
## Odds ratio =  0.60353982300885

However, being a website that has a doi is correlated with greater availability.

web_subset <- subset(full_coded, group == "website")

custom.chisq(web_subset$has_doi, web_subset$avail_recode)

##          var2
## var1      yes no
##   doi_no   85 61
##   doi_yes  26  4
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  test_table
## X-squared = 7.4683, df = 1, p-value = 0.00628
## 
## 
##      Chi squared power calculation 
## 
##               w = 0.2059934
##               N = 176
##              df = 1
##       sig.level = 0.05
##           power = 0.780195
## 
## NOTE: N is the number of observations
## 
## Odds ratio =  0.21437578814628

fisher.test(table(web_subset$has_doi, web_subset$avail_recode)) #double check this with Fisher's test, since doi yes and available no is < 5

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(web_subset$has_doi, web_subset$avail_recode)
## p-value = 0.00325
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.05211725 0.66801746
## sample estimates:
## odds ratio 
##  0.2159573

PLOS data search project

Lisa Federer

July 25, 2017

Part One: Getting the random sample

Part Two: Analyzing the Coded Data