Load data

## 
## -- Column specification --------------------------------------------------------
## cols(
##   indicator = col_character(),
##   `Identifier for the researcher credit ratings being evaluated` = col_character()
## )
## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   indicator = col_double(),
##   desc = col_character(),
##   concept = col_double(),
##   datacur = col_double(),
##   analyses = col_double(),
##   funding = col_double(),
##   investig = col_double(),
##   methods = col_double(),
##   admin = col_double(),
##   resources = col_double(),
##   software = col_double(),
##   supervis = col_double(),
##   validat = col_double(),
##   viz = col_double(),
##   writ.orig = col_double(),
##   writ.edit = col_double(),
##   rater = col_character(),
##   sum = col_double()
## )
indicator Identifier for the researcher credit ratings being evaluated
desc Written description provided by researcher
concept Conceptualization credit ratings (0 = no role; 1 = minor role; 2 = major role)
datacur Data Curation credit ratings (0 = no role; 1 = minor role; 2 = major role)
analyses Formal Analyses credit ratings (0 = no role; 1 = minor role; 2 = major role)
funding Funding Acquisition credit ratings (0 = no role; 1 = minor role; 2 = major role)
investig Investigation credit ratings (0 = no role; 1 = minor role; 2 = major role)
methods Methodology credit ratings (0 = no role; 1 = minor role; 2 = major role)
admin Project Administration credit ratings (0 = no role; 1 = minor role; 2 = major role)
resources Resources credit ratings (0 = no role; 1 = minor role; 2 = major role)
software Software credit ratings (0 = no role; 1 = minor role; 2 = major role)
supervis Supervision credit ratings (0 = no role; 1 = minor role; 2 = major role)
validat Validation credit ratings (0 = no role; 1 = minor role; 2 = major role)
viz Visualization credit ratings (0 = no role; 1 = minor role; 2 = major role)
writ.orig Writing <U+0096> Original Draft credit ratings (0 = no role; 1 = minor role; 2 = major role)
writ.edit Writing <U+0096> Review & Editing credit ratings (0 = no role; 1 = minor role; 2 = major role)
rater Person providing the rating. OG = original ratings from researcher; All other are coder identifiers
sum Total amount of credit granted by rater

Explorations

Could estimate an overall bias for each coder. In particular, we may expect the original authors (“OG”) to overestimate their contributions. But we might also expect variation in the independent raters.

Note: Another perspective is that the OG is the only with with true knowledge of their contributions, but is also a noisy communicator of their contributions: if they did not describe their contributions well none of the raters will be able to do a good job of reproducing the (presumably accurate) OG’s ratings.

## `summarise()` ungrouping output (override with `.groups` argument)
rater mean_rating num0s num1s num2s
OB 0.56 1584 328 524
CS 0.40 1821 245 370
TS 0.37 1846 272 318
OG 0.31 1861 395 180
PF 0.22 1911 513 12
BP 0.22 1935 475 26

By raw ratings, PF and BP use more 1s than other raters, while OB uses more 2s (and many fewer 0s than other raters).

Let’s look at whether raters’ average ratings differ significantly from each other, in which case we may want to adjust their ratings by those averages.

cdl %>% group_by(rater) %>%
  tidyboot_mean(column = rating) %>%
  ggplot(aes(x=reorder(rater, mean), y=mean)) + 
    geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) + 
  theme_classic() + ylab("Mean Ratings") + xlab("Rater") + ylim(0,1)

The bootstrapped 95% CIs show that many raters differ significantly: BP is similar to PF, and TS similar to CS, but otherwise there is no overlap.

Collapsing 1 and 2 ratings

Raters look much more similar if we collapse ‘2’ and ‘1’ ratings.

cdl %>% mutate(rating = ifelse(rating==2, 1, rating)) %>%
  group_by(rater) %>%
  tidyboot_mean(column = rating) %>%
  ggplot(aes(x=reorder(rater, mean), y=mean)) + 
    geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) + 
  theme_classic() + ylab("Mean Ratings") + xlab("Rater")

Reliability: 0, 1, and 2 ratings

Fleiss’ kappa (for categorical scales) with original 0, 1, and 2 ratings (for all raters–OG and 5 independent raters) is shown below per category.

cdw <- cdl %>% pivot_wider(id_cols = c(indicator, dimension), names_from = rater, values_from = rating) 

# Fleiss kappa (categorical Cohen's kappa)
overall_fleiss = kappam.fleiss(cdw %>% select(-indicator, -dimension)) # .342, not .23 as in poster

kappa_per_dim <- tibble()
for(this_dim in unique(cdw$dimension)) {
  kappa = kappam.fleiss(cdw %>% filter(dimension==this_dim) %>% 
                          select(-indicator, -dimension))
  kappa_per_dim = bind_rows(kappa_per_dim, tibble(dimension = this_dim, kappa = kappa$value))
}

#cdw %>% 
#  select(-indicator, -dimension) %>%
#  psych::alpha()
kappa_per_dim %>%
  arrange(desc(kappa)) %>%
  kable(digits=2)
dimension kappa
funding 0.55
writ.edit 0.35
admin 0.28
supervis 0.27
analyses 0.26
software 0.22
resources 0.14
investig 0.12
viz 0.09
datacur 0.08
concept 0.04
validat 0.02
writ.orig 0.00
methods 0.00

Overall kappa with 0, 1, and 2 ratings is 0.34 across all ratings (categories and raters). (Why is this higher than the .23 reported in the poster?) (Also, for ordinal data such as ratings would it not be better to use intraclass correlation coefficients?)

Reliability: Collapsing 1 and 2 ratings

Does reliability improve much if we collapse ‘2’ and ‘1’ ratings? Kappas per category with collapsed ratings are shown below.

cdw01 <- cdl %>% mutate(rating = ifelse(rating==2, 1, rating)) %>%
  pivot_wider(id_cols = c(indicator, dimension), names_from = rater, values_from = rating) 

overall_fleiss01 = kappam.fleiss(cdw01 %>% select(-indicator, -dimension)) 

kappa_per_dim01 <- tibble()
for(this_dim in unique(cdw01$dimension)) {
  kappa = kappam.fleiss(cdw01 %>% filter(dimension==this_dim) %>% 
                          select(-indicator, -dimension))
  kappa_per_dim01 = bind_rows(kappa_per_dim01, tibble(dimension = this_dim, kappa = kappa$value))
}

kappa_per_dim01 %>%
  arrange(desc(kappa)) %>%
  kable(digits=2)
dimension kappa
funding 0.81
writ.edit 0.58
admin 0.50
supervis 0.40
investig 0.31
analyses 0.29
software 0.28
resources 0.21
viz 0.12
datacur 0.11
concept 0.04
validat 0.03
methods 0.02
writ.orig -0.01

Overall kappa with 1 and 2 ratings collapsed is 0.48 across all ratings (categories and raters), which isn’t that bad!

Alpha for collapsed ratings is shown below:

cdw01 %>% 
  select(-indicator, -dimension) %>%
  psych::alpha()
## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.85      0.86    0.84       0.5   6 0.0046 0.25 0.33      0.5
## 
##  lower alpha upper     95% confidence boundaries
## 0.84 0.85 0.86 
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## BP      0.82      0.82    0.80      0.48 4.7   0.0058 0.0064  0.48
## CS      0.83      0.83    0.81      0.50 5.0   0.0056 0.0107  0.51
## OB      0.86      0.86    0.83      0.55 6.0   0.0046 0.0034  0.54
## OG      0.83      0.84    0.81      0.51 5.1   0.0054 0.0081  0.51
## PF      0.81      0.82    0.79      0.47 4.4   0.0060 0.0060  0.48
## TS      0.82      0.83    0.80      0.49 4.8   0.0058 0.0080  0.48
## 
##  Item statistics 
##       n raw.r std.r r.cor r.drop mean   sd
## BP 2436  0.79  0.79  0.75   0.68 0.21 0.40
## CS 2436  0.77  0.76  0.70   0.65 0.25 0.43
## OB 2436  0.68  0.66  0.55   0.51 0.35 0.48
## OG 2436  0.74  0.75  0.67   0.62 0.24 0.42
## PF 2436  0.82  0.83  0.79   0.73 0.22 0.41
## TS 2436  0.78  0.79  0.73   0.67 0.24 0.43
## 
## Non missing response frequency for each item
##       0    1 miss
## BP 0.79 0.21    0
## CS 0.75 0.25    0
## OB 0.65 0.35    0
## OG 0.76 0.24    0
## PF 0.78 0.22    0
## TS 0.76 0.24    0

Ratings by Dimension

Some dimensions have quite different average ratings than others (a strong argument for using IRT models, where items’ varying ‘difficulty’ can be estimated and taken into account). Intuitively, low-scoring items may be more diagnostic: e.g., if only a few people are non-zero on ‘analyses’ (35 1s, 2 2s), then this is a greater contribution than a non-zero on ‘investig’, which most people contributed to (296 0s vs. 748 1s/2s).

cdl %>% group_by(dimension) %>%
  tidyboot_mean(rating) %>%
  arrange(desc(mean)) %>%
  ggplot(aes(y=reorder(dimension, mean), x=mean)) + 
    geom_pointrange(aes(xmin=ci_lower, xmax=ci_upper)) + 
  theme_classic() + xlab("Mean Rating") + ylab("Dimension")

Correlation of ratings

First unadjusted, and then adjusted for raters’ average ratings. (Note: Each ‘indicator’ level is a collaborator on the same large paper.)

Polytomous IRT

We actually may want to estimate different models for each rater (keeping in mind that OG is different people, but maybe have some shared bias due to authorship). This treats contributors as test-takers, and would give us an overall quality rating for each contributor (per rater).