Credit

Load data

## 
## -- Column specification --------------------------------------------------------
## cols(
##   indicator = col_character(),
##   `Identifier for the researcher credit ratings being evaluated` = col_character()
## )

## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   indicator = col_double(),
##   desc = col_character(),
##   concept = col_double(),
##   datacur = col_double(),
##   analyses = col_double(),
##   funding = col_double(),
##   investig = col_double(),
##   methods = col_double(),
##   admin = col_double(),
##   resources = col_double(),
##   software = col_double(),
##   supervis = col_double(),
##   validat = col_double(),
##   viz = col_double(),
##   writ.orig = col_double(),
##   writ.edit = col_double(),
##   rater = col_character(),
##   sum = col_double()
## )

indicator	Identifier for the researcher credit ratings being evaluated
desc	Written description provided by researcher
concept	Conceptualization credit ratings (0 = no role; 1 = minor role; 2 = major role)
datacur	Data Curation credit ratings (0 = no role; 1 = minor role; 2 = major role)
analyses	Formal Analyses credit ratings (0 = no role; 1 = minor role; 2 = major role)
funding	Funding Acquisition credit ratings (0 = no role; 1 = minor role; 2 = major role)
investig	Investigation credit ratings (0 = no role; 1 = minor role; 2 = major role)
methods	Methodology credit ratings (0 = no role; 1 = minor role; 2 = major role)
admin	Project Administration credit ratings (0 = no role; 1 = minor role; 2 = major role)
resources	Resources credit ratings (0 = no role; 1 = minor role; 2 = major role)
software	Software credit ratings (0 = no role; 1 = minor role; 2 = major role)
supervis	Supervision credit ratings (0 = no role; 1 = minor role; 2 = major role)
validat	Validation credit ratings (0 = no role; 1 = minor role; 2 = major role)
viz	Visualization credit ratings (0 = no role; 1 = minor role; 2 = major role)
writ.orig	Writing <U+0096> Original Draft credit ratings (0 = no role; 1 = minor role; 2 = major role)
writ.edit	Writing <U+0096> Review & Editing credit ratings (0 = no role; 1 = minor role; 2 = major role)
rater	Person providing the rating. OG = original ratings from researcher; All other are coder identifiers
sum	Total amount of credit granted by rater

Explorations

Could estimate an overall bias for each coder. In particular, we may expect the original authors (“OG”) to overestimate their contributions. But we might also expect variation in the independent raters.

Note: Another perspective is that the OG is the only with with true knowledge of their contributions, but is also a noisy communicator of their contributions: if they did not describe their contributions well none of the raters will be able to do a good job of reproducing the (presumably accurate) OG’s ratings.

## `summarise()` ungrouping output (override with `.groups` argument)

rater	mean_rating	num0s	num1s	num2s
OB	0.56	1584	328	524
CS	0.40	1821	245	370
TS	0.37	1846	272	318
OG	0.31	1861	395	180
PF	0.22	1911	513	12
BP	0.22	1935	475	26

By raw ratings, PF and BP use more 1s than other raters, while OB uses more 2s (and many fewer 0s than other raters).

Let’s look at whether raters’ average ratings differ significantly from each other, in which case we may want to adjust their ratings by those averages.

cdl %>% group_by(rater) %>%
  tidyboot_mean(column = rating) %>%
  ggplot(aes(x=reorder(rater, mean), y=mean)) + 
    geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) + 
  theme_classic() + ylab("Mean Ratings") + xlab("Rater") + ylim(0,1)

The bootstrapped 95% CIs show that many raters differ significantly: BP is similar to PF, and TS similar to CS, but otherwise there is no overlap.

Collapsing 1 and 2 ratings

Raters look much more similar if we collapse ‘2’ and ‘1’ ratings.

cdl %>% mutate(rating = ifelse(rating==2, 1, rating)) %>%
  group_by(rater) %>%
  tidyboot_mean(column = rating) %>%
  ggplot(aes(x=reorder(rater, mean), y=mean)) + 
    geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) + 
  theme_classic() + ylab("Mean Ratings") + xlab("Rater")

Reliability: 0, 1, and 2 ratings

Fleiss’ kappa (for categorical scales) with original 0, 1, and 2 ratings (for all raters–OG and 5 independent raters) is shown below per category.

cdw <- cdl %>% pivot_wider(id_cols = c(indicator, dimension), names_from = rater, values_from = rating) 

# Fleiss kappa (categorical Cohen's kappa)
overall_fleiss = kappam.fleiss(cdw %>% select(-indicator, -dimension)) # .342, not .23 as in poster

kappa_per_dim <- tibble()
for(this_dim in unique(cdw$dimension)) {
  kappa = kappam.fleiss(cdw %>% filter(dimension==this_dim) %>% 
                          select(-indicator, -dimension))
  kappa_per_dim = bind_rows(kappa_per_dim, tibble(dimension = this_dim, kappa = kappa$value))
}

#cdw %>% 
#  select(-indicator, -dimension) %>%
#  psych::alpha()
kappa_per_dim %>%
  arrange(desc(kappa)) %>%
  kable(digits=2)

dimension	kappa
funding	0.55
writ.edit	0.35
admin	0.28
supervis	0.27
analyses	0.26
software	0.22
resources	0.14
investig	0.12
viz	0.09
datacur	0.08
concept	0.04
validat	0.02
writ.orig	0.00
methods	0.00

Overall kappa with 0, 1, and 2 ratings is 0.34 across all ratings (categories and raters). (Why is this higher than the .23 reported in the poster?) (Also, for ordinal data such as ratings would it not be better to use intraclass correlation coefficients?)

Reliability: Collapsing 1 and 2 ratings

Does reliability improve much if we collapse ‘2’ and ‘1’ ratings? Kappas per category with collapsed ratings are shown below.

cdw01 <- cdl %>% mutate(rating = ifelse(rating==2, 1, rating)) %>%
  pivot_wider(id_cols = c(indicator, dimension), names_from = rater, values_from = rating) 

overall_fleiss01 = kappam.fleiss(cdw01 %>% select(-indicator, -dimension)) 

kappa_per_dim01 <- tibble()
for(this_dim in unique(cdw01$dimension)) {
  kappa = kappam.fleiss(cdw01 %>% filter(dimension==this_dim) %>% 
                          select(-indicator, -dimension))
  kappa_per_dim01 = bind_rows(kappa_per_dim01, tibble(dimension = this_dim, kappa = kappa$value))
}

kappa_per_dim01 %>%
  arrange(desc(kappa)) %>%
  kable(digits=2)

dimension	kappa
funding	0.81
writ.edit	0.58
admin	0.50
supervis	0.40
investig	0.31
analyses	0.29
software	0.28
resources	0.21
viz	0.12
datacur	0.11
concept	0.04
validat	0.03
methods	0.02
writ.orig	-0.01

Overall kappa with 1 and 2 ratings collapsed is 0.48 across all ratings (categories and raters), which isn’t that bad!

Alpha for collapsed ratings is shown below:

cdw01 %>% 
  select(-indicator, -dimension) %>%
  psych::alpha()

## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.85      0.86    0.84       0.5   6 0.0046 0.25 0.33      0.5
## 
##  lower alpha upper     95% confidence boundaries
## 0.84 0.85 0.86 
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## BP      0.82      0.82    0.80      0.48 4.7   0.0058 0.0064  0.48
## CS      0.83      0.83    0.81      0.50 5.0   0.0056 0.0107  0.51
## OB      0.86      0.86    0.83      0.55 6.0   0.0046 0.0034  0.54
## OG      0.83      0.84    0.81      0.51 5.1   0.0054 0.0081  0.51
## PF      0.81      0.82    0.79      0.47 4.4   0.0060 0.0060  0.48
## TS      0.82      0.83    0.80      0.49 4.8   0.0058 0.0080  0.48
## 
##  Item statistics 
##       n raw.r std.r r.cor r.drop mean   sd
## BP 2436  0.79  0.79  0.75   0.68 0.21 0.40
## CS 2436  0.77  0.76  0.70   0.65 0.25 0.43
## OB 2436  0.68  0.66  0.55   0.51 0.35 0.48
## OG 2436  0.74  0.75  0.67   0.62 0.24 0.42
## PF 2436  0.82  0.83  0.79   0.73 0.22 0.41
## TS 2436  0.78  0.79  0.73   0.67 0.24 0.43
## 
## Non missing response frequency for each item
##       0    1 miss
## BP 0.79 0.21    0
## CS 0.75 0.25    0
## OB 0.65 0.35    0
## OG 0.76 0.24    0
## PF 0.78 0.22    0
## TS 0.76 0.24    0

Ratings by Dimension

Some dimensions have quite different average ratings than others (a strong argument for using IRT models, where items’ varying ‘difficulty’ can be estimated and taken into account). Intuitively, low-scoring items may be more diagnostic: e.g., if only a few people are non-zero on ‘analyses’ (35 1s, 2 2s), then this is a greater contribution than a non-zero on ‘investig’, which most people contributed to (296 0s vs. 748 1s/2s).

cdl %>% group_by(dimension) %>%
  tidyboot_mean(rating) %>%
  arrange(desc(mean)) %>%
  ggplot(aes(y=reorder(dimension, mean), x=mean)) + 
    geom_pointrange(aes(xmin=ci_lower, xmax=ci_upper)) + 
  theme_classic() + xlab("Mean Rating") + ylab("Dimension")

Correlation of ratings

First unadjusted, and then adjusted for raters’ average ratings. (Note: Each ‘indicator’ level is a collaborator on the same large paper.)

Polytomous IRT

We actually may want to estimate different models for each rater (keeping in mind that OG is different people, but maybe have some shared bias due to authorship). This treats contributors as test-takers, and would give us an overall quality rating for each contributor (per rater).