All raters agreed that Assertiveness and Self-efficacy are important for succeeding at the Presentation task. 8 of 9 raters thought Cooperation was important, and 7 of 9 agreed that Friendliness, Sympathy, Anxiety, and Liberalism are important.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Pres_Assertiveness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Pres_SelfEfficacy_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Pres_Cooperation_Relevant | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Pres_Friendliness_Relevant | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Pres_Sympathy_Relevant | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 7 |
Pres_Anxiety_Relevant | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Pres_Liberalism_Relevant | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 7 |
Pres_Gregariousness_Relevant | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 6 |
Pres_Cheerfulnessful_Relevant | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 6 |
Pres_Morality_Relevant | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 6 |
Pres_Vulnerability_Relevant | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 6 |
Pres_Emotionality_Relevant | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 6 |
Pres_Intellect_Relevant | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 6 |
Pres_Altruism_Relevant | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 5 |
Pres_Modesty_Relevant | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Note. I cut off the last 15 responses to make the table shorter.
Percentage agreement among raters was 10%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 9
## %-agree = 10
Excluding raters 1 and 2, percentage agreement was 16.7%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 7
## %-agree = 16.7
Fleiss kappa: k = 0.14 (p<.001).
Statistic | Value |
---|---|
Kappa | 0.1397685 |
z | 4.5932613 |
p-value | 0.0000044 |
Light’s kappa = 0.16.
Statistic | Value |
---|---|
Kappa | 0.1680231 |
z | 0.0000189 |
p-value | 0.9999849 |
Krippendorff’s alpha was slightly lower than Fleiss’s kappa at 0.12.
subjects | 9 |
raters | 30 |
irr.name | alpha |
value | 0.120839371333732 |
Spearman correlation table
For observability of NEO facets in the presentation task, the most observable facets were Assertiveness (9/9), Gregariousness, Cheerfulness, Self-Efficacy (all 8/9), Friendliness, Cooperation, and Sympathy (7/9). * These are general findings that don’t account for whether the participant selected that the trait was relevant.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Pres_Assertiveness_Observable | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Pres_Gregariousness_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Pres_Cheerfulnessful_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Pres_SelfEfficacy_Observable | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 8 |
Pres_Friendliness_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Pres_Cooperation_Observable | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 7 |
Pres_Sympathy_Observable | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 7 |
Pres_Anxiety_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Pres_Altruism_Observable | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 6 |
Pres_Modesty_Observable | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Pres_Depression_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 6 |
Pres_SelfConsciousness_Observable | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Pres_Vulnerability_Observable | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Pres_Emotionality_Observable | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 6 |
Pres_Orderliness_Observable | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 5 |
This kappa value represents interrater agreement of Observability all 30 facets. The proportion of agreement above chance was 0.15 (p<.001).
Statistic | Value |
---|---|
Kappa | 0.1476518 |
z | 4.8523317 |
p-value | 0.0000012 |
This table presents the optimal level scores.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_mean | |
---|---|---|---|---|---|---|---|---|---|---|
Pres_Friendliness_OptLvl | NA | 4 | 4 | 5 | NA | 4 | 5 | 4 | 4 | 4.285714 |
Pres_Assertiveness_OptLvl | 6 | 6 | 5 | 6 | 4 | 6 | 5 | 6 | 6 | 5.555556 |
Pres_Cooperation_OptLvl | 4 | NA | 4 | 5 | 6 | 5 | 6 | 6 | 2 | 4.750000 |
Pres_Sympathy_OptLvl | 6 | 6 | 4 | 4 | NA | 5 | 5 | 5 | 1 | 4.500000 |
Pres_SelfEfficacy_OptLvl | 6 | 6 | 4 | 6 | 6 | 6 | 6 | 6 | 4 | 5.555556 |
Pres_Anxiety_OptLvl | NA | 5 | 3 | 1 | NA | 1 | 2 | 1 | 1 | 2.000000 |
For the kappa analysis it was necessary to replace all NA values with 0.
Statistic | Value |
---|---|
Kappa | 0.1185364 |
z | 3.5023974 |
p-value | 0.0004611 |
ICC:
## Single Score Intraclass Correlation
##
## Model: twoway
## Type : consistency
##
## Subjects = 6
## Raters = 9
## ICC(C,1) = 0.438
##
## F-Test, H0: r0 = 0 ; H1: r0 > 0
## F(5,40) = 8.01 , p = 2.65e-05
##
## 95%-Confidence Interval for ICC Population Values:
## 0.163 < ICC < 0.843
Lastly, Kendall’s W, a correlation amongst ratings.
## Kendall's coefficient of concordance Wt
##
## Subjects = 6
## Raters = 9
## Wt = 0.625
##
## Chisq(5) = 28.1
## p-value = 3.45e-05
All the raters agreed that Friendliness, Assertiveness, and Cooperation are relevant for succeeding at the Group Discussion. 7/9 agreed that Gregariousness, Modesty, Achievement Striving, Anxiety, and Intellect are relevant.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Grp_Friendliness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Grp_Assertiveness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Grp_Cooperation_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Grp_Gregariousness_Relevant | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Grp_Modesty_Relevant | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 7 |
Grp_AchievementStriving_Relevant | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Grp_Anxiety_Relevant | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Grp_Intellect_Relevant | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 7 |
Grp_SelfEfficacy_Relevant | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 6 |
Grp_SelfConsciousness_Relevant | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Grp_Anger_Relevant | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Grp_Vulnerability_Relevant | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 5 |
Grp_Imagination_Relevant | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 5 |
Grp_Cheerfulness_Relevant | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 4 |
Grp_Trust_Relevant | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 4 |
The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.27 and was significantly different from 0 at p<0.05.
Statistic | Value |
---|---|
Kappa | 0.2671131 |
z | 8.7782321 |
p-value | 0.0000000 |
Percentage agreement among raters was 13.3%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 9
## %-agree = 13.3
Excluding raters 1 and 2, percentage agreement was 16.7%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 7
## %-agree = 16.7
All 9 raters agreed that Friendliness is observable during the group discussion. 8 of 9 agreed that Gregariousness, Assertiveness, Cooperation, and Anxiety are observable.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Grp_Friendliness_Observable | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Grp_Gregariousness_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Grp_Assertiveness_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Grp_Cooperation_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Grp_Anxiety_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Grp_Cheerfulness_Observable | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Grp_Modesty_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Grp_SelfEfficacy_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 6 |
Grp_Anger_Observable | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Grp_Intellect_Observable | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 6 |
Grp_Morality_Observable | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 5 |
Grp_AchievementStriving_Observable | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Grp_SelfConsciousness_Observable | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 5 |
Grp_Vulnerability_Observable | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 5 |
Grp_Trust_Observable | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 4 |
The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.21 and was significantly different from 0 at p<0.05.
Statistic | Value |
---|---|
Kappa | 0.2141249 |
z | 7.0368617 |
p-value | 0.0000000 |
This table presents the optimal level scores for the top 8 most relevant traits for the Group Discussion task. Top 8 was chosen because those facets had 9 or 7 raters in agreement (i.e. the highest interrater agreement).The row mean indicates the average optimal level score out of 7. NA values indicate that the rater did not think the facet was relevant for the task.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_mean | |
---|---|---|---|---|---|---|---|---|---|---|
Grp_Friendliness_OptLvl | 5 | 6 | 4 | 4 | 6 | 5 | 5 | 4 | 5 | 4.888889 |
Grp_Gregariousness_OptLvl | NA | 6 | NA | 4 | 6 | 6 | 6 | 5 | 5 | 5.428571 |
Grp_Assertiveness_OptLvl | 6 | 6 | 4 | 6 | 5 | 5 | 6 | 5 | 6 | 5.444444 |
Grp_Cooperation_OptLvl | 6 | 2 | 3 | 3 | 3 | 4 | 5 | 3 | 6 | 3.888889 |
Grp_Modesty_OptLvl | 6 | 3 | 3 | 2 | NA | 2 | 4 | 2 | 1 | 2.875000 |
Grp_AchievementStriving_OptLvl | 6 | NA | NA | 6 | 5 | 5 | 5 | 5 | 6 | 5.428571 |
Grp_Anxiety_OptLvl | NA | 4 | NA | 1 | 2 | 1 | 2 | 1 | 1 | 1.714286 |
Grp_Intellect_OptLvl | 6 | 6 | NA | 4 | 6 | 5 | NA | 6 | 4 | 5.285714 |
For the kappa analysis it was necessary to replace all NA values with 0. Despite much higher levels of agreement on the relevance and observability of the facets for the group discussion task, expert ratings had low levels of agreement for the optimal levels of the facets they reported were most relevant.
Statistic | Value |
---|---|
Kappa | 0.0660377 |
z | 2.4919454 |
p-value | 0.0127046 |
ICC: Since the optimal level variable could be considered to be on a different scale from the other variables, kappa may not be the best metric. Here is ICC:
## Call: psych::ICC(x = grp_optlvl)
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound upper bound
## Single_raters_absolute ICC1 0.29 4.7 7 64 0.00025 0.091 0.68
## Single_random_raters ICC2 0.30 5.2 7 56 0.00014 0.098 0.68
## Single_fixed_raters ICC3 0.32 5.2 7 56 0.00014 0.104 0.70
## Average_raters_absolute ICC1k 0.79 4.7 7 64 0.00025 0.473 0.95
## Average_random_raters ICC2k 0.79 5.2 7 56 0.00014 0.495 0.95
## Average_fixed_raters ICC3k 0.81 5.2 7 56 0.00014 0.511 0.95
##
## Number of subjects = 8 Number of Judges = 9
## See the help file for a discussion of the other 4 McGraw and Wong estimates,
## Single Score Intraclass Correlation
##
## Model: twoway
## Type : consistency
##
## Subjects = 8
## Raters = 9
## ICC(C,1) = 0.316
##
## F-Test, H0: r0 = 0 ; H1: r0 > 0
## F(7,56) = 5.16 , p = 0.000141
##
## 95%-Confidence Interval for ICC Population Values:
## 0.104 < ICC < 0.7
Lastly, Kendall’s W = 0.45 (p=.017), a moderate correlation amongst ratings.
## Kendall's coefficient of concordance Wt
##
## Subjects = 8
## Raters = 9
## Wt = 0.412
##
## Chisq(7) = 26
## p-value = 0.000509
All the raters agreed that Assertiveness and Anger are relevant for succeeding at the Critique task. 8/9 experts agreed that Friendliness, Altruism, Cooperation, and Dutifulness are relevant.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Crit_Assertiveness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Crit_Anger_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Crit_Friendliness_Relevant | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Crit_Altruism_Relevant | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Crit_Cooperation_Relevant | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 8 |
Crit_Dutifulness_Relevant | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Crit_Morality_Relevant | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 7 |
Crit_Cautiousness_Relevant | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 7 |
Crit_Anxiety_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 7 |
Crit_Sympathy_Relevant | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Crit_Cheerfulness_Relevant | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 5 |
Crit_SelfEfficacy_Relevant | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 5 |
Crit_Orderliness_Relevant | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 5 |
Crit_AchievementStriving_Relevant | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 5 |
Crit_SelfConsciousness_Relevant | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 5 |
The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.29 and was significantly different from 0 at p<0.05.
Compared to kappa values for the other tasks, the level of agreement about the relevance of all 30 facets for the critique task was higher, but the value is still considered moderate according to general guidelines.
Statistic | Value |
---|---|
Kappa | 0.2925538 |
z | 9.6142981 |
p-value | 0.0000000 |
Percentage agreement among raters was 13.3%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 9
## %-agree = 13.3
Excluding raters 1 and 2, percentage agreement was 20%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 7
## %-agree = 20
Spearman correlation
All 9 raters agreed that Anger is observable during the Critique task. 8 of 9 agreed that Assertiveness, Altruism, Cooperation, and Anxiety are observable.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Crit_Anger_Observable | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Crit_Assertiveness_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Crit_Altruism_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Crit_Cooperation_Observable | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 8 |
Crit_Anxiety_Observable | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 8 |
Crit_Friendliness_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Crit_Sympathy_Observable | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Crit_Dutifulness_Observable | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Crit_Cheerfulness_Observable | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 6 |
Crit_Orderliness_Observable | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 6 |
Crit_SelfConsciousness_Observable | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 6 |
Crit_Vulnerability_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 6 |
Crit_SelfEfficacy_Observable | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 5 |
Crit_AchievementStriving_Observable | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Crit_SelfDiscipline_Observable | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 5 |
The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.15 (p<.001).
Statistic | Value |
---|---|
Kappa | 0.1517857 |
z | 4.9881876 |
p-value | 0.0000006 |
This table presents the optimal level scores for the top 5 most relevant traits for the Critique task. Top 5 was chosen because those facets had 9 or 8 raters in agreement (i.e. the highest interrater agreement).The row mean indicates the average optimal level score out of 7. NA values indicate that the rater did not think the facet was relevant for the task.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_mean | |
---|---|---|---|---|---|---|---|---|---|---|
Crit_Friendliness_OptLvl | NA | 3 | 5 | 4 | 5 | 4 | 5 | 4 | 3 | 4.125 |
Crit_Assertiveness_OptLvl | 4 | 6 | 4 | 5 | 4 | 5 | 5 | 6 | 6 | 5.000 |
Crit_Altruism_OptLvl | NA | 5 | 4 | 5 | 4 | 5 | 5 | 5 | 4 | 4.625 |
Crit_Cooperation_OptLvl | 4 | 1 | 3 | 6 | NA | 4 | 6 | 3 | 2 | 3.625 |
Crit_Dutifulness_OptLvl | 6 | 5 | NA | 5 | 6 | 6 | 4 | 5 | 3 | 5.000 |
Crit_Anger_OptLvl | 2 | 4 | 2 | 1 | 2 | 1 | 1 | 1 | 4 | 2.000 |
For the kappa analysis it was necessary to replace all NA values with 0. For the top 8 most relevant facets, agreement was 0.04, which was non-significant (p=0.09).
Statistic | Value |
---|---|
Kappa | 0.0462574 |
z | 1.4949712 |
p-value | 0.1349220 |
ICC:
## boundary (singular) fit: see help('isSingular')
## Call: psych::ICC(x = crit_optlvl)
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound upper bound
## Single_raters_absolute ICC1 0.24 3.9 5 48 0.0049 0.039 0.72
## Single_random_raters ICC2 0.24 3.9 5 40 0.0058 0.039 0.72
## Single_fixed_raters ICC3 0.24 3.9 5 40 0.0058 0.036 0.72
## Average_raters_absolute ICC1k 0.74 3.9 5 48 0.0049 0.269 0.96
## Average_random_raters ICC2k 0.74 3.9 5 40 0.0058 0.268 0.96
## Average_fixed_raters ICC3k 0.74 3.9 5 40 0.0058 0.254 0.96
##
## Number of subjects = 6 Number of Judges = 9
## See the help file for a discussion of the other 4 McGraw and Wong estimates,
All the raters agreed that Friendliness, Assertiveness, and Cheerfulness are relevant for succeeding at the Teaching task. 7/9 experts agreed that Intellect, Modesty, and Altruism are relevant.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Teach_Friendliness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Teach_Assertiveness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Teach_Cheerfulness_Relevant | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Teach_Altruism_Relevant | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Teach_Modesty_Relevant | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Teach_Intellect_Relevant | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 7 |
Teach_Cooperation_Relevant | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 6 |
Teach_Anger_Relevant | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 6 |
Teach_Gregariousness_Relevant | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Teach_Sympathy_Relevant | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Teach_SelfEfficacy_Relevant | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 5 |
Teach_Orderliness_Relevant | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 5 |
Teach_SelfDiscipline_Relevant | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 5 |
Teach_Imagination_Relevant | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 5 |
Teach_ActivityLevel_Relevant | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 4 |
Kappa represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.28 (p<.001).
Compared to kappa values for the other tasks, the level of agreement about the relevance of all 30 facets for the critique task was high, but the value is still considered moderate according to general guidelines.
Statistic | Value |
---|---|
Kappa | 0.2849021 |
z | 9.3628385 |
p-value | 0.0000000 |
Percentage agreement among raters was 13.3%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 9
## %-agree = 23.3
Excluding raters 1 and 2, percentage agreement was still 23.3%
## Percentage agreement (Tolerance=0)
##
## Subjects = 30
## Raters = 7
## %-agree = 23.3
Spearman correlation
All 9 raters agreed that Friendliness is observable during the Teaching task. 8 of 9 agreed that Cheerfulness and Anger are observable. 7 of 9 agreed that Altruism and Cooperation are observable.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_sum | |
---|---|---|---|---|---|---|---|---|---|---|
Teach_Friendliness_Observable | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 9 |
Teach_Cheerfulness_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Teach_Anger_Observable | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |
Teach_Altruism_Observable | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 7 |
Teach_Cooperation_Observable | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 7 |
Teach_Gregariousness_Observable | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Teach_Assertiveness_Observable | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 6 |
Teach_Modesty_Observable | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 5 |
Teach_Sympathy_Observable | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 5 |
Teach_SelfConsciousness_Observable | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 5 |
Teach_Intellect_Observable | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 5 |
Teach_Orderliness_Observable | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 4 |
Teach_Anxiety_Observable | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 4 |
Teach_Depression_Observable | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 4 |
Teach_Vulnerability_Observable | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 4 |
The Feliss’s kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.17 (p<.001).
Statistic | Value |
---|---|
Kappa | 0.1682928 |
z | 5.5306664 |
p-value | 0.0000000 |
This table presents the optimal level scores for the top 6 most relevant traits for the Critique task. Top 5 was chosen because those facets had either 9 or 7 raters in agreement about the relevance of the tasks (i.e. the highest interrater agreement). The row mean indicates the average optimal level score out of 7. NA values indicate that the rater did not think the facet was relevant for the task.
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | row_mean | |
---|---|---|---|---|---|---|---|---|---|---|
Teach_Friendliness_OptLvl | 6 | 6 | 4 | 4 | 4 | 5 | 6 | 6 | 6 | 5.222222 |
Teach_Assertiveness_OptLvl | 4 | 5 | 2 | 5 | 4 | 4 | 5 | 4 | 5 | 4.222222 |
Teach_Cheerfulness_OptLvl | 5 | 5 | 4 | 5 | 4 | 6 | 5 | 5 | 5 | 4.888889 |
Teach_Altruism_OptLvl | NA | 5 | 4 | 6 | NA | 5 | 5 | 6 | 5 | 5.142857 |
Teach_Modesty_OptLvl | 4 | NA | 4 | 3 | NA | 4 | 4 | 5 | 3 | 3.857143 |
Teach_Intellect_OptLvl | 6 | 5 | NA | 6 | 4 | NA | 4 | 5 | 6 | 5.142857 |
For the kappa analysis it was necessary to replace all NA values with 0. For the top 8 most relevant facets, agreement was 0.05, which was non-significant (p=0.09).
Statistic | Value |
---|---|
Kappa | 0.0499080 |
z | 1.3134489 |
p-value | 0.1890317 |
ICC:
## Call: psych::ICC(x = teach_optlvl)
##
## Intraclass correlation coefficients
## type ICC F df1 df2 p lower bound upper bound
## Single_raters_absolute ICC1 0.096 2.0 5 48 0.102 -0.036 0.55
## Single_random_raters ICC2 0.108 2.3 5 40 0.067 -0.020 0.55
## Single_fixed_raters ICC3 0.123 2.3 5 40 0.067 -0.025 0.59
## Average_raters_absolute ICC1k 0.489 2.0 5 48 0.102 -0.453 0.92
## Average_random_raters ICC2k 0.522 2.3 5 40 0.067 -0.215 0.92
## Average_fixed_raters ICC3k 0.558 2.3 5 40 0.067 -0.284 0.93
##
## Number of subjects = 6 Number of Judges = 9
## See the help file for a discussion of the other 4 McGraw and Wong estimates,
Lastly, Kendall’s W = 0.31 (p=.015), a moderate correlation amongst ratings.
## Kendall's coefficient of concordance Wt
##
## Subjects = 6
## Raters = 9
## Wt = 0.315
##
## Chisq(5) = 14.2
## p-value = 0.0145
Fleiss’s kappa can be used to assess inter-rater reliability where there are more than two raters and where data are nominal or ordinal scale. In this instance, Fleiss’s kappa was best suited for calculating reliability for Relevance and Observability, since those variables were yes/no binary. However, some argue that Likert data is ordinal, so I used it for the Optimal Level ratings but also included the ICC.
Fleiss’s kappa ranges from -1 to 1 and relates the degree of agreement that can be expected above chance. Values greater than 0 are considered greater probability than chance.
Interpretation guides strongly caution against using one-size-fits-all interpretation categories, but generally speaking, 0-.20 is “slight” agreement, 0.21-0.40 is “fair” agreement, and don’t worry about the rest because none of the values were higher than 0.40.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/ “For fully-crossed designs with three or more coders, Light (1971) suggests computing kappa for all coder pairs then using the arithmetic mean of these estimates to provide an overall index of agreement.”
Krippendorff’s alpha is similar to Fleiss’s kappa but it calculates disagreement rather than agreement and is more robust against missing values. Krippendorff’s alpha can be used for nominal and ordinal data with more than two raters. Resulting values range from 0-1, with 0 indicating perfect disagreement and 1 indicating perfect agreement.
For ordinal data (such as the Likert scale ratings of optimal personality), Kendall’s W assesses the strength of the relationship between ratings. Similar to correlation, Kendall’s W ranges from 0 to 1, where higher values indicate stronger inter-rater reliability.
https://john-uebersax.com/stat/agree.htm tests association between raters
https://john-uebersax.com/stat/agree.htm “They estimate what the correlation between raters would be if ratings were made on a continuous scale; they are, theoretically, invariant over changes in the number or”width” of rating categories. The tetrachoric and polychoric correlations also provide a framework that allows testing of marginal homogeneity between raters. Thus, these statistics let one separately assess both components of rater agreement: agreement on trait definition and agreement on definitions of specific categories.”
May not be appropriate if the latent trait is discrete
Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category.
https://www.agreestat.com/books/cac5/chapter5/chap5.pdf Aickin (1990): “The α parameter is defined as the fraction of the entire subject population made up of subjects that the two raters A and B classified identically for cause, rather than by chance.” Aickin’s alpha operates on probabilities of raters scoring items into a certain category. Percent chance agreement is calculated based only on items that were difficult to score (i.e. lack consensus), which protects against kappa paradoxes.
“Unlike Aickin’s alpha coefficient, which is defined as the probability that two raters A and B agree for cause, Gwet’s AC1 (see Gwet, 2008a) is defined as the probability that two raters agree given that the subjects being rated are not susceptible to agreement by pure chance.”
https://link-springer-com.manchester.idm.oclc.org/article/10.1007/BF02294802 Slide 46: https://folk.ntnu.no/slyderse/medstat/Interrater_fullpage_9March2016.pdf
We can examine the random effects of raters to see how much they varied in their responses overall.
“Measures of rater agreement often provide low values when high levels of agreement exist among raters. The table below shows 20 passages coded by four raters using the four coding categories listed below. Note that all raters agree on every passage except for passage 20. Despite 95.2% agreement, the other measures of agreement are below acceptable levels: Fleiss’ kappa = .316, mean Cohen’s kappa = .244, and Krippendorff’s alpha = .325. 1 = Positive statement 2 = Negative statement 3 = Neutral statement 4 = Other unrelated statement/Not applicable The problem with these data is lack of variability in codes. When most raters assign one code predominately, then measures of agreement can be misleadingly low, as demonstrated in this example. This is one reason I recommend always reporting percent agreement.”
Respondents were asked to report their personality-related qalifications. Raters V1 and V2 have the least amount of experience in years. It’s possible that Raters 5 and 9 have their PhDs in a non-personality field but have since published in personality journals, or they misread the question.
RaterID | EXPERIENCE | EXPERTISE |
---|---|---|
1 | 3 | Current PhD |
2 | 5 | Current PhD |
3 | 6 | Current PhD |
4 | 7 | Current PhD,Published,Behaviour coding experience |
5 | 7 | Completed MSc,Academic |
6 | 10 | Current PhD |
7 | 12 | Current PhD,Published |
8 | 17 | Completed PhD,Academic,Published,Behaviour coding experience |
9 | 32 | Published |
Note. EXPERIENCE = “How many years of experience do you have in personality science, inclusive of postgraduate training?” EXPERTISE = “Which of the following describes your expertise in personality research? You may select multiple.”