##
## -- Column specification --------------------------------------------------------
## cols(
## indicator = col_character(),
## `Identifier for the researcher credit ratings being evaluated` = col_character()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## indicator = col_double(),
## desc = col_character(),
## concept = col_double(),
## datacur = col_double(),
## analyses = col_double(),
## funding = col_double(),
## investig = col_double(),
## methods = col_double(),
## admin = col_double(),
## resources = col_double(),
## software = col_double(),
## supervis = col_double(),
## validat = col_double(),
## viz = col_double(),
## writ.orig = col_double(),
## writ.edit = col_double(),
## rater = col_character(),
## sum = col_double()
## )
indicator | Identifier for the researcher credit ratings being evaluated |
---|---|
desc | Written description provided by researcher |
concept | Conceptualization credit ratings (0 = no role; 1 = minor role; 2 = major role) |
datacur | Data Curation credit ratings (0 = no role; 1 = minor role; 2 = major role) |
analyses | Formal Analyses credit ratings (0 = no role; 1 = minor role; 2 = major role) |
funding | Funding Acquisition credit ratings (0 = no role; 1 = minor role; 2 = major role) |
investig | Investigation credit ratings (0 = no role; 1 = minor role; 2 = major role) |
methods | Methodology credit ratings (0 = no role; 1 = minor role; 2 = major role) |
admin | Project Administration credit ratings (0 = no role; 1 = minor role; 2 = major role) |
resources | Resources credit ratings (0 = no role; 1 = minor role; 2 = major role) |
software | Software credit ratings (0 = no role; 1 = minor role; 2 = major role) |
supervis | Supervision credit ratings (0 = no role; 1 = minor role; 2 = major role) |
validat | Validation credit ratings (0 = no role; 1 = minor role; 2 = major role) |
viz | Visualization credit ratings (0 = no role; 1 = minor role; 2 = major role) |
writ.orig | Writing <U+0096> Original Draft credit ratings (0 = no role; 1 = minor role; 2 = major role) |
writ.edit | Writing <U+0096> Review & Editing credit ratings (0 = no role; 1 = minor role; 2 = major role) |
rater | Person providing the rating. OG = original ratings from researcher; All other are coder identifiers |
sum | Total amount of credit granted by rater |
Could estimate an overall bias for each coder. In particular, we may expect the original authors (“OG”) to overestimate their contributions. But we might also expect variation in the independent raters.
Note: Another perspective is that the OG is the only with with true knowledge of their contributions, but is also a noisy communicator of their contributions: if they did not describe their contributions well none of the raters will be able to do a good job of reproducing the (presumably accurate) OG’s ratings.
## `summarise()` ungrouping output (override with `.groups` argument)
rater | mean_rating | num0s | num1s | num2s |
---|---|---|---|---|
OB | 0.56 | 1584 | 328 | 524 |
CS | 0.40 | 1821 | 245 | 370 |
TS | 0.37 | 1846 | 272 | 318 |
OG | 0.31 | 1861 | 395 | 180 |
PF | 0.22 | 1911 | 513 | 12 |
BP | 0.22 | 1935 | 475 | 26 |
By raw ratings, PF and BP use more 1s than other raters, while OB uses more 2s (and many fewer 0s than other raters).
Let’s look at whether raters’ average ratings differ significantly from each other, in which case we may want to adjust their ratings by those averages.
cdl %>% group_by(rater) %>%
tidyboot_mean(column = rating) %>%
ggplot(aes(x=reorder(rater, mean), y=mean)) +
geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
theme_classic() + ylab("Mean Ratings") + xlab("Rater") + ylim(0,1)
The bootstrapped 95% CIs show that many raters differ significantly: BP is similar to PF, and TS similar to CS, but otherwise there is no overlap.
Raters look much more similar if we collapse ‘2’ and ‘1’ ratings.
cdl %>% mutate(rating = ifelse(rating==2, 1, rating)) %>%
group_by(rater) %>%
tidyboot_mean(column = rating) %>%
ggplot(aes(x=reorder(rater, mean), y=mean)) +
geom_pointrange(aes(ymin=ci_lower, ymax=ci_upper)) +
theme_classic() + ylab("Mean Ratings") + xlab("Rater")
Fleiss’ kappa (for categorical scales) with original 0, 1, and 2 ratings (for all raters–OG and 5 independent raters) is shown below per category.
cdw <- cdl %>% pivot_wider(id_cols = c(indicator, dimension), names_from = rater, values_from = rating)
# Fleiss kappa (categorical Cohen's kappa)
overall_fleiss = kappam.fleiss(cdw %>% select(-indicator, -dimension)) # .342, not .23 as in poster
kappa_per_dim <- tibble()
for(this_dim in unique(cdw$dimension)) {
kappa = kappam.fleiss(cdw %>% filter(dimension==this_dim) %>%
select(-indicator, -dimension))
kappa_per_dim = bind_rows(kappa_per_dim, tibble(dimension = this_dim, kappa = kappa$value))
}
#cdw %>%
# select(-indicator, -dimension) %>%
# psych::alpha()
kappa_per_dim %>%
arrange(desc(kappa)) %>%
kable(digits=2)
dimension | kappa |
---|---|
funding | 0.55 |
writ.edit | 0.35 |
admin | 0.28 |
supervis | 0.27 |
analyses | 0.26 |
software | 0.22 |
resources | 0.14 |
investig | 0.12 |
viz | 0.09 |
datacur | 0.08 |
concept | 0.04 |
validat | 0.02 |
writ.orig | 0.00 |
methods | 0.00 |
Overall kappa with 0, 1, and 2 ratings is 0.34 across all ratings (categories and raters). (Why is this higher than the .23 reported in the poster?) (Also, for ordinal data such as ratings would it not be better to use intraclass correlation coefficients?)
Does reliability improve much if we collapse ‘2’ and ‘1’ ratings? Kappas per category with collapsed ratings are shown below.
cdw01 <- cdl %>% mutate(rating = ifelse(rating==2, 1, rating)) %>%
pivot_wider(id_cols = c(indicator, dimension), names_from = rater, values_from = rating)
overall_fleiss01 = kappam.fleiss(cdw01 %>% select(-indicator, -dimension))
kappa_per_dim01 <- tibble()
for(this_dim in unique(cdw01$dimension)) {
kappa = kappam.fleiss(cdw01 %>% filter(dimension==this_dim) %>%
select(-indicator, -dimension))
kappa_per_dim01 = bind_rows(kappa_per_dim01, tibble(dimension = this_dim, kappa = kappa$value))
}
kappa_per_dim01 %>%
arrange(desc(kappa)) %>%
kable(digits=2)
dimension | kappa |
---|---|
funding | 0.81 |
writ.edit | 0.58 |
admin | 0.50 |
supervis | 0.40 |
investig | 0.31 |
analyses | 0.29 |
software | 0.28 |
resources | 0.21 |
viz | 0.12 |
datacur | 0.11 |
concept | 0.04 |
validat | 0.03 |
methods | 0.02 |
writ.orig | -0.01 |
Overall kappa with 1 and 2 ratings collapsed is 0.48 across all ratings (categories and raters), which isn’t that bad!
Alpha for collapsed ratings is shown below:
cdw01 %>%
select(-indicator, -dimension) %>%
psych::alpha()
##
## Reliability analysis
## Call: psych::alpha(x = .)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.85 0.86 0.84 0.5 6 0.0046 0.25 0.33 0.5
##
## lower alpha upper 95% confidence boundaries
## 0.84 0.85 0.86
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BP 0.82 0.82 0.80 0.48 4.7 0.0058 0.0064 0.48
## CS 0.83 0.83 0.81 0.50 5.0 0.0056 0.0107 0.51
## OB 0.86 0.86 0.83 0.55 6.0 0.0046 0.0034 0.54
## OG 0.83 0.84 0.81 0.51 5.1 0.0054 0.0081 0.51
## PF 0.81 0.82 0.79 0.47 4.4 0.0060 0.0060 0.48
## TS 0.82 0.83 0.80 0.49 4.8 0.0058 0.0080 0.48
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BP 2436 0.79 0.79 0.75 0.68 0.21 0.40
## CS 2436 0.77 0.76 0.70 0.65 0.25 0.43
## OB 2436 0.68 0.66 0.55 0.51 0.35 0.48
## OG 2436 0.74 0.75 0.67 0.62 0.24 0.42
## PF 2436 0.82 0.83 0.79 0.73 0.22 0.41
## TS 2436 0.78 0.79 0.73 0.67 0.24 0.43
##
## Non missing response frequency for each item
## 0 1 miss
## BP 0.79 0.21 0
## CS 0.75 0.25 0
## OB 0.65 0.35 0
## OG 0.76 0.24 0
## PF 0.78 0.22 0
## TS 0.76 0.24 0
Some dimensions have quite different average ratings than others (a strong argument for using IRT models, where items’ varying ‘difficulty’ can be estimated and taken into account). Intuitively, low-scoring items may be more diagnostic: e.g., if only a few people are non-zero on ‘analyses’ (35 1s, 2 2s), then this is a greater contribution than a non-zero on ‘investig’, which most people contributed to (296 0s vs. 748 1s/2s).
cdl %>% group_by(dimension) %>%
tidyboot_mean(rating) %>%
arrange(desc(mean)) %>%
ggplot(aes(y=reorder(dimension, mean), x=mean)) +
geom_pointrange(aes(xmin=ci_lower, xmax=ci_upper)) +
theme_classic() + xlab("Mean Rating") + ylab("Dimension")
First unadjusted, and then adjusted for raters’ average ratings. (Note: Each ‘indicator’ level is a collaborator on the same large paper.)
We actually may want to estimate different models for each rater (keeping in mind that OG is different people, but maybe have some shared bias due to authorship). This treats contributors as test-takers, and would give us an overall quality rating for each contributor (per rater).