we don’t include any raters with >20 attention check misses.
1. Invalid Image Rate per Category
The denominator for each category is n_total from all_annotations — the actual number of images in that category’s ground-truth bucket, which varies across categories.
Code
n_invalid_per_rater <- invalid_df |>group_by(rater, class) |>summarise(n_invalid =n_distinct(filename), .groups ="drop")# Join true denominator; categories a rater marked 0 invalid won't appear above,# All raters that exist in the dataraters <- df_filtered |>distinct(rater, shuffled_index)# Full grid: every rater × every category in the annotation universefull_grid <- raters |>cross_join(images_per_cat |>distinct(class)) # <-- ground truth universeinvalid_rate <- full_grid |>left_join(n_invalid_per_rater, by =c("rater", "class")) |>replace_na(list(n_invalid =0L)) |>left_join(images_per_cat, by =c("class", "shuffled_index")) |>mutate(prop_invalid = n_invalid / n_total) avg_valid <- invalid_rate |>group_by(rater) |>summarise(mean_n =round(mean(n_invalid), 2),mean_prop =round(1-mean(prop_invalid), 3),sd_prop =round(sd(prop_invalid), 3),shuffled_index =first(shuffled_index),.groups ="drop" ) |>arrange(shuffled_index)avg_valid |> knitr::kable(caption ="Average valid image proportion per rater first summarized within category",col.names =c("Rater", "Mean # invalid", "Mean proportion valid", "SD proportion valid", "Shuffled bucket") )
Average valid image proportion per rater first summarized within category
per_class_data |>summarize(precision =round(mean(precision),4), `Average Cohen's kappa across raters`=round(mean(kappa),4), `Percent agreement between raters`=round(mean(pct_agree),4), `Number of total images`=sum(n_images)) |> knitr::kable(caption ="Summary stats")