Presentation task

Presentation: Relevance

All raters agreed that Assertiveness and Self-efficacy are important for succeeding at the Presentation task. 8 of 9 raters thought Cooperation was important, and 7 of 9 agreed that Friendliness, Sympathy, Anxiety, and Liberalism are important.

Relevance of NEO Facets on Presentation Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Pres_Assertiveness_Relevant	1	1	1	1	1	1	1	1	1	9
Pres_SelfEfficacy_Relevant	1	1	1	1	1	1	1	1	1	9
Pres_Cooperation_Relevant	1	0	1	1	1	1	1	1	1	8
Pres_Friendliness_Relevant	0	1	1	1	0	1	1	1	1	7
Pres_Sympathy_Relevant	1	1	1	1	0	1	1	1	0	7
Pres_Anxiety_Relevant	0	1	1	1	0	1	1	1	1	7
Pres_Liberalism_Relevant	1	1	1	1	0	0	1	1	1	7
Pres_Gregariousness_Relevant	0	0	0	1	1	1	1	1	1	6
Pres_Cheerfulnessful_Relevant	1	1	0	1	0	1	1	1	0	6
Pres_Morality_Relevant	0	1	1	1	0	0	1	1	1	6
Pres_Vulnerability_Relevant	1	1	0	1	0	1	1	1	0	6
Pres_Emotionality_Relevant	0	0	0	1	1	1	1	1	1	6
Pres_Intellect_Relevant	1	1	0	1	0	0	1	1	1	6
Pres_Altruism_Relevant	1	1	0	0	0	1	1	1	0	5
Pres_Modesty_Relevant	0	0	0	1	0	1	1	1	1	5

Note. I cut off the last 15 responses to make the table shorter.

Percentage agreement among raters was 10%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 9 
##   %-agree = 10

Excluding raters 1 and 2, percentage agreement was 16.7%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 7 
##   %-agree = 16.7

Fleiss kappa: k = 0.14 (p<.001).

Fleiss’s Kappa: Presentation Relevance
Statistic	Value
Kappa	0.1397685
z	4.5932613
p-value	0.0000044

Light’s kappa = 0.16.

Light’s Kappa: Presentation Relevance
Statistic	Value
Kappa	0.1680231
z	0.0000189
p-value	0.9999849

Krippendorff’s alpha was slightly lower than Fleiss’s kappa at 0.12.

Krippendorff’s alpha
subjects	9
raters	30
irr.name	alpha
value	0.120839371333732

Spearman correlation table

Presentation: Observability

For observability of NEO facets in the presentation task, the most observable facets were Assertiveness (9/9), Gregariousness, Cheerfulness, Self-Efficacy (all 8/9), Friendliness, Cooperation, and Sympathy (7/9). * These are general findings that don’t account for whether the participant selected that the trait was relevant.

Observability of NEO Facets on Presentation Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Pres_Assertiveness_Observable	1	1	1	1	1	1	1	1	1	9
Pres_Gregariousness_Observable	0	1	1	1	1	1	1	1	1	8
Pres_Cheerfulnessful_Observable	0	1	1	1	1	1	1	1	1	8
Pres_SelfEfficacy_Observable	1	1	1	1	0	1	1	1	1	8
Pres_Friendliness_Observable	0	1	1	1	0	1	1	1	1	7
Pres_Cooperation_Observable	1	0	1	1	1	1	0	1	1	7
Pres_Sympathy_Observable	1	1	0	1	1	1	0	1	1	7
Pres_Anxiety_Observable	0	1	1	1	0	1	1	1	1	7
Pres_Altruism_Observable	1	0	1	0	1	1	0	1	1	6
Pres_Modesty_Observable	0	1	0	1	0	1	1	1	1	6
Pres_Depression_Observable	0	1	1	1	0	1	0	1	1	6
Pres_SelfConsciousness_Observable	0	0	1	1	0	1	1	1	1	6
Pres_Vulnerability_Observable	1	0	0	1	0	1	1	1	1	6
Pres_Emotionality_Observable	0	0	0	1	1	1	1	1	1	6
Pres_Orderliness_Observable	0	0	0	1	1	1	0	1	1	5

This kappa value represents interrater agreement of Observability all 30 facets. The proportion of agreement above chance was 0.15 (p<.001).

Fleiss’s Kappa: Presentation Observability
Statistic	Value
Kappa	0.1476518
z	4.8523317
p-value	0.0000012

Presentation: Optimal Level

This table presents the optimal level scores.

Optimal Level of Relevant NEO Facets on Presentation Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_mean
Pres_Friendliness_OptLvl	NA	4	4	5	NA	4	5	4	4	4.285714
Pres_Assertiveness_OptLvl	6	6	5	6	4	6	5	6	6	5.555556
Pres_Cooperation_OptLvl	4	NA	4	5	6	5	6	6	2	4.750000
Pres_Sympathy_OptLvl	6	6	4	4	NA	5	5	5	1	4.500000
Pres_SelfEfficacy_OptLvl	6	6	4	6	6	6	6	6	4	5.555556
Pres_Anxiety_OptLvl	NA	5	3	1	NA	1	2	1	1	2.000000

For the kappa analysis it was necessary to replace all NA values with 0.

Fleiss’s Kappa: Presentation Observability
Statistic	Value
Kappa	0.1185364
z	3.5023974
p-value	0.0004611

ICC:

##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : consistency 
## 
##    Subjects = 6 
##      Raters = 9 
##    ICC(C,1) = 0.438
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##     F(5,40) = 8.01 , p = 2.65e-05 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.163 < ICC < 0.843

Lastly, Kendall’s W, a correlation amongst ratings.

##  Kendall's coefficient of concordance Wt
## 
##  Subjects = 6 
##    Raters = 9 
##        Wt = 0.625 
## 
##  Chisq(5) = 28.1 
##   p-value = 3.45e-05

Group discussion task

Group Discussion: Relevance

All the raters agreed that Friendliness, Assertiveness, and Cooperation are relevant for succeeding at the Group Discussion. 7/9 agreed that Gregariousness, Modesty, Achievement Striving, Anxiety, and Intellect are relevant.

Relevance of NEO Facets for Group Discussion Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Grp_Friendliness_Relevant	1	1	1	1	1	1	1	1	1	9
Grp_Assertiveness_Relevant	1	1	1	1	1	1	1	1	1	9
Grp_Cooperation_Relevant	1	1	1	1	1	1	1	1	1	9
Grp_Gregariousness_Relevant	0	1	0	1	1	1	1	1	1	7
Grp_Modesty_Relevant	1	1	1	1	0	1	1	1	0	7
Grp_AchievementStriving_Relevant	1	0	0	1	1	1	1	1	1	7
Grp_Anxiety_Relevant	0	1	0	1	1	1	1	1	1	7
Grp_Intellect_Relevant	1	1	0	1	1	1	0	1	1	7
Grp_SelfEfficacy_Relevant	1	1	0	1	1	1	1	0	0	6
Grp_SelfConsciousness_Relevant	0	0	1	1	0	1	1	1	1	6
Grp_Anger_Relevant	0	0	0	1	0	1	1	1	1	5
Grp_Vulnerability_Relevant	0	1	0	1	0	1	0	1	1	5
Grp_Imagination_Relevant	0	0	0	1	1	1	0	1	1	5
Grp_Cheerfulness_Relevant	0	0	0	1	0	1	1	1	0	4
Grp_Trust_Relevant	1	1	0	1	0	0	1	0	0	4

The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.27 and was significantly different from 0 at p<0.05.

Fleiss’s Kappa: Relevance of facets on Group Discussion task
Statistic	Value
Kappa	0.2671131
z	8.7782321
p-value	0.0000000

Percentage agreement among raters was 13.3%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 9 
##   %-agree = 13.3

Excluding raters 1 and 2, percentage agreement was 16.7%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 7 
##   %-agree = 16.7

Group Discussion: Observability

All 9 raters agreed that Friendliness is observable during the group discussion. 8 of 9 agreed that Gregariousness, Assertiveness, Cooperation, and Anxiety are observable.

Observability of NEO Facets on Group Discussion Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Grp_Friendliness_Observable	1	1	1	1	1	1	1	1	1	9
Grp_Gregariousness_Observable	0	1	1	1	1	1	1	1	1	8
Grp_Assertiveness_Observable	0	1	1	1	1	1	1	1	1	8
Grp_Cooperation_Observable	0	1	1	1	1	1	1	1	1	8
Grp_Anxiety_Observable	0	1	1	1	1	1	1	1	1	8
Grp_Cheerfulness_Observable	0	0	1	1	1	1	1	1	1	7
Grp_Modesty_Observable	0	1	1	1	0	1	1	1	1	7
Grp_SelfEfficacy_Observable	0	1	1	1	1	1	0	0	1	6
Grp_Anger_Observable	0	0	1	1	0	1	1	1	1	6
Grp_Intellect_Observable	0	1	0	1	1	1	0	1	1	6
Grp_Morality_Observable	0	1	0	1	1	1	0	0	1	5
Grp_AchievementStriving_Observable	0	0	0	1	0	1	1	1	1	5
Grp_SelfConsciousness_Observable	0	0	1	1	0	1	0	1	1	5
Grp_Vulnerability_Observable	0	1	0	1	0	1	0	1	1	5
Grp_Trust_Observable	0	1	0	1	0	0	1	0	1	4

The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.21 and was significantly different from 0 at p<0.05.

Fleiss’s Kappa: Relevance of facets on Group Discussion task
Statistic	Value
Kappa	0.2141249
z	7.0368617
p-value	0.0000000

Group Discussion: Optimal level

This table presents the optimal level scores for the top 8 most relevant traits for the Group Discussion task. Top 8 was chosen because those facets had 9 or 7 raters in agreement (i.e. the highest interrater agreement).The row mean indicates the average optimal level score out of 7. NA values indicate that the rater did not think the facet was relevant for the task.

Optimal Level of Relevant NEO Facets on Group Discussion Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_mean
Grp_Friendliness_OptLvl	5	6	4	4	6	5	5	4	5	4.888889
Grp_Gregariousness_OptLvl	NA	6	NA	4	6	6	6	5	5	5.428571
Grp_Assertiveness_OptLvl	6	6	4	6	5	5	6	5	6	5.444444
Grp_Cooperation_OptLvl	6	2	3	3	3	4	5	3	6	3.888889
Grp_Modesty_OptLvl	6	3	3	2	NA	2	4	2	1	2.875000
Grp_AchievementStriving_OptLvl	6	NA	NA	6	5	5	5	5	6	5.428571
Grp_Anxiety_OptLvl	NA	4	NA	1	2	1	2	1	1	1.714286
Grp_Intellect_OptLvl	6	6	NA	4	6	5	NA	6	4	5.285714

For the kappa analysis it was necessary to replace all NA values with 0. Despite much higher levels of agreement on the relevance and observability of the facets for the group discussion task, expert ratings had low levels of agreement for the optimal levels of the facets they reported were most relevant.

Fleiss’s Kappa: Group Discussion Observability
Statistic	Value
Kappa	0.0660377
z	2.4919454
p-value	0.0127046

ICC: Since the optimal level variable could be considered to be on a different scale from the other variables, kappa may not be the best metric. Here is ICC:

## Call: psych::ICC(x = grp_optlvl)
## 
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2       p lower bound upper bound
## Single_raters_absolute   ICC1 0.29 4.7   7  64 0.00025       0.091        0.68
## Single_random_raters     ICC2 0.30 5.2   7  56 0.00014       0.098        0.68
## Single_fixed_raters      ICC3 0.32 5.2   7  56 0.00014       0.104        0.70
## Average_raters_absolute ICC1k 0.79 4.7   7  64 0.00025       0.473        0.95
## Average_random_raters   ICC2k 0.79 5.2   7  56 0.00014       0.495        0.95
## Average_fixed_raters    ICC3k 0.81 5.2   7  56 0.00014       0.511        0.95
## 
##  Number of subjects = 8     Number of Judges =  9
## See the help file for a discussion of the other 4 McGraw and Wong estimates,

##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : consistency 
## 
##    Subjects = 8 
##      Raters = 9 
##    ICC(C,1) = 0.316
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##     F(7,56) = 5.16 , p = 0.000141 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.104 < ICC < 0.7

Lastly, Kendall’s W = 0.45 (p=.017), a moderate correlation amongst ratings.

##  Kendall's coefficient of concordance Wt
## 
##  Subjects = 8 
##    Raters = 9 
##        Wt = 0.412 
## 
##  Chisq(7) = 26 
##   p-value = 0.000509

Critique

Critique: Relevance

All the raters agreed that Assertiveness and Anger are relevant for succeeding at the Critique task. 8/9 experts agreed that Friendliness, Altruism, Cooperation, and Dutifulness are relevant.

Relevance of NEO Facets for Critique Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Crit_Assertiveness_Relevant	1	1	1	1	1	1	1	1	1	9
Crit_Anger_Relevant	1	1	1	1	1	1	1	1	1	9
Crit_Friendliness_Relevant	0	1	1	1	1	1	1	1	1	8
Crit_Altruism_Relevant	0	1	1	1	1	1	1	1	1	8
Crit_Cooperation_Relevant	1	1	1	1	0	1	1	1	1	8
Crit_Dutifulness_Relevant	1	1	0	1	1	1	1	1	1	8
Crit_Morality_Relevant	0	1	1	1	1	1	0	1	1	7
Crit_Cautiousness_Relevant	1	0	1	1	1	1	0	1	1	7
Crit_Anxiety_Relevant	1	1	1	1	1	1	0	1	0	7
Crit_Sympathy_Relevant	0	0	1	1	0	1	1	1	1	6
Crit_Cheerfulness_Relevant	1	1	1	0	0	1	1	0	0	5
Crit_SelfEfficacy_Relevant	1	0	1	1	1	0	1	0	0	5
Crit_Orderliness_Relevant	1	0	0	1	0	1	0	1	1	5
Crit_AchievementStriving_Relevant	1	0	0	1	0	1	1	1	0	5
Crit_SelfConsciousness_Relevant	1	1	0	1	0	1	0	1	0	5

The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.29 and was significantly different from 0 at p<0.05.

Compared to kappa values for the other tasks, the level of agreement about the relevance of all 30 facets for the critique task was higher, but the value is still considered moderate according to general guidelines.

Fleiss’s Kappa: Relevance of facets on Critique task
Statistic	Value
Kappa	0.2925538
z	9.6142981
p-value	0.0000000

Percentage agreement among raters was 13.3%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 9 
##   %-agree = 13.3

Excluding raters 1 and 2, percentage agreement was 20%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 7 
##   %-agree = 20

Spearman correlation

Critique: Observability

All 9 raters agreed that Anger is observable during the Critique task. 8 of 9 agreed that Assertiveness, Altruism, Cooperation, and Anxiety are observable.

Observability of NEO Facets on Critique Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Crit_Anger_Observable	1	1	1	1	1	1	1	1	1	9
Crit_Assertiveness_Observable	0	1	1	1	1	1	1	1	1	8
Crit_Altruism_Observable	0	1	1	1	1	1	1	1	1	8
Crit_Cooperation_Observable	1	1	1	1	0	1	1	1	1	8
Crit_Anxiety_Observable	1	1	1	1	1	1	0	1	1	8
Crit_Friendliness_Observable	0	1	1	1	0	1	1	1	1	7
Crit_Sympathy_Observable	0	0	1	1	1	1	1	1	1	7
Crit_Dutifulness_Observable	0	1	0	1	1	1	1	1	1	7
Crit_Cheerfulness_Observable	0	1	1	0	1	1	1	0	1	6
Crit_Orderliness_Observable	0	0	1	1	1	1	0	1	1	6
Crit_SelfConsciousness_Observable	0	1	0	1	1	1	0	1	1	6
Crit_Vulnerability_Observable	0	1	1	1	0	1	0	1	1	6
Crit_SelfEfficacy_Observable	0	0	1	1	1	0	1	0	1	5
Crit_AchievementStriving_Observable	0	0	0	1	0	1	1	1	1	5
Crit_SelfDiscipline_Observable	0	1	1	0	0	1	1	0	1	5

The kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.15 (p<.001).

Fleiss’s Kappa: Relevance of facets on Critique task
Statistic	Value
Kappa	0.1517857
z	4.9881876
p-value	0.0000006

Critique: Optimal level

This table presents the optimal level scores for the top 5 most relevant traits for the Critique task. Top 5 was chosen because those facets had 9 or 8 raters in agreement (i.e. the highest interrater agreement).The row mean indicates the average optimal level score out of 7. NA values indicate that the rater did not think the facet was relevant for the task.

Optimal Level of Relevant NEO Facets on Critique Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_mean
Crit_Friendliness_OptLvl	NA	3	5	4	5	4	5	4	3	4.125
Crit_Assertiveness_OptLvl	4	6	4	5	4	5	5	6	6	5.000
Crit_Altruism_OptLvl	NA	5	4	5	4	5	5	5	4	4.625
Crit_Cooperation_OptLvl	4	1	3	6	NA	4	6	3	2	3.625
Crit_Dutifulness_OptLvl	6	5	NA	5	6	6	4	5	3	5.000
Crit_Anger_OptLvl	2	4	2	1	2	1	1	1	4	2.000

For the kappa analysis it was necessary to replace all NA values with 0. For the top 8 most relevant facets, agreement was 0.04, which was non-significant (p=0.09).

Fleiss’s Kappa: Presentation Observability
Statistic	Value
Kappa	0.0462574
z	1.4949712
p-value	0.1349220

ICC:

## boundary (singular) fit: see help('isSingular')

## Call: psych::ICC(x = crit_optlvl)
## 
## Intraclass correlation coefficients 
##                          type  ICC   F df1 df2      p lower bound upper bound
## Single_raters_absolute   ICC1 0.24 3.9   5  48 0.0049       0.039        0.72
## Single_random_raters     ICC2 0.24 3.9   5  40 0.0058       0.039        0.72
## Single_fixed_raters      ICC3 0.24 3.9   5  40 0.0058       0.036        0.72
## Average_raters_absolute ICC1k 0.74 3.9   5  48 0.0049       0.269        0.96
## Average_random_raters   ICC2k 0.74 3.9   5  40 0.0058       0.268        0.96
## Average_fixed_raters    ICC3k 0.74 3.9   5  40 0.0058       0.254        0.96
## 
##  Number of subjects = 6     Number of Judges =  9
## See the help file for a discussion of the other 4 McGraw and Wong estimates,

Teaching task

Teaching: Relevance

All the raters agreed that Friendliness, Assertiveness, and Cheerfulness are relevant for succeeding at the Teaching task. 7/9 experts agreed that Intellect, Modesty, and Altruism are relevant.

Relevance of NEO Facets for Teaching Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Teach_Friendliness_Relevant	1	1	1	1	1	1	1	1	1	9
Teach_Assertiveness_Relevant	1	1	1	1	1	1	1	1	1	9
Teach_Cheerfulness_Relevant	1	1	1	1	1	1	1	1	1	9
Teach_Altruism_Relevant	0	1	1	1	0	1	1	1	1	7
Teach_Modesty_Relevant	1	0	1	1	0	1	1	1	1	7
Teach_Intellect_Relevant	1	1	0	1	1	0	1	1	1	7
Teach_Cooperation_Relevant	0	0	0	1	1	1	1	1	1	6
Teach_Anger_Relevant	0	1	0	1	1	1	1	1	0	6
Teach_Gregariousness_Relevant	0	0	0	1	0	1	1	1	1	5
Teach_Sympathy_Relevant	0	0	0	1	0	1	1	1	1	5
Teach_SelfEfficacy_Relevant	1	0	0	1	1	0	1	1	0	5
Teach_Orderliness_Relevant	0	1	0	1	1	0	0	1	1	5
Teach_SelfDiscipline_Relevant	0	1	0	1	0	0	1	1	1	5
Teach_Imagination_Relevant	1	1	0	1	0	0	0	1	1	5
Teach_ActivityLevel_Relevant	1	0	0	1	0	1	1	0	0	4

Kappa represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.28 (p<.001).

Compared to kappa values for the other tasks, the level of agreement about the relevance of all 30 facets for the critique task was high, but the value is still considered moderate according to general guidelines.

Fleiss’s Kappa: Relevance of facets on Teaching task
Statistic	Value
Kappa	0.2849021
z	9.3628385
p-value	0.0000000

Percentage agreement among raters was 13.3%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 9 
##   %-agree = 23.3

Excluding raters 1 and 2, percentage agreement was still 23.3%

##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 30 
##    Raters = 7 
##   %-agree = 23.3

Spearman correlation

Teaching: Observability

All 9 raters agreed that Friendliness is observable during the Teaching task. 8 of 9 agreed that Cheerfulness and Anger are observable. 7 of 9 agreed that Altruism and Cooperation are observable.

Observability of NEO Facets on Teaching Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_sum
Teach_Friendliness_Observable	1	1	1	1	1	1	1	1	1	9
Teach_Cheerfulness_Observable	0	1	1	1	1	1	1	1	1	8
Teach_Anger_Observable	0	1	1	1	1	1	1	1	1	8
Teach_Altruism_Observable	0	1	1	1	0	1	1	1	1	7
Teach_Cooperation_Observable	0	0	1	1	1	1	1	1	1	7
Teach_Gregariousness_Observable	0	0	1	1	0	1	1	1	1	6
Teach_Assertiveness_Observable	0	0	1	1	0	1	1	1	1	6
Teach_Modesty_Observable	0	0	1	1	0	1	0	1	1	5
Teach_Sympathy_Observable	0	0	0	1	0	1	1	1	1	5
Teach_SelfConsciousness_Observable	0	0	1	1	0	0	1	1	1	5
Teach_Intellect_Observable	0	1	0	1	0	0	1	1	1	5
Teach_Orderliness_Observable	0	0	0	1	1	0	0	1	1	4
Teach_Anxiety_Observable	0	0	1	1	0	0	0	1	1	4
Teach_Depression_Observable	0	1	1	0	0	0	0	1	1	4
Teach_Vulnerability_Observable	0	1	0	1	0	0	0	1	1	4

The Feliss’s kappa value represents interrater agreement across all 30 facets. The proportion of agreement above chance agreement was 0.17 (p<.001).

Fleiss’s Kappa: Relevance of facets on Teaching task
Statistic	Value
Kappa	0.1682928
z	5.5306664
p-value	0.0000000

Teaching: Optimal level

This table presents the optimal level scores for the top 6 most relevant traits for the Critique task. Top 5 was chosen because those facets had either 9 or 7 raters in agreement about the relevance of the tasks (i.e. the highest interrater agreement). The row mean indicates the average optimal level score out of 7. NA values indicate that the rater did not think the facet was relevant for the task.

Optimal Levels of most relevant NEO Facets for Teaching Task
	V1	V2	V3	V4	V5	V6	V7	V8	V9	row_mean
Teach_Friendliness_OptLvl	6	6	4	4	4	5	6	6	6	5.222222
Teach_Assertiveness_OptLvl	4	5	2	5	4	4	5	4	5	4.222222
Teach_Cheerfulness_OptLvl	5	5	4	5	4	6	5	5	5	4.888889
Teach_Altruism_OptLvl	NA	5	4	6	NA	5	5	6	5	5.142857
Teach_Modesty_OptLvl	4	NA	4	3	NA	4	4	5	3	3.857143
Teach_Intellect_OptLvl	6	5	NA	6	4	NA	4	5	6	5.142857

For the kappa analysis it was necessary to replace all NA values with 0. For the top 8 most relevant facets, agreement was 0.05, which was non-significant (p=0.09).

Fleiss’s Kappa: Presentation Observability
Statistic	Value
Kappa	0.0499080
z	1.3134489
p-value	0.1890317

ICC:

## Call: psych::ICC(x = teach_optlvl)
## 
## Intraclass correlation coefficients 
##                          type   ICC   F df1 df2     p lower bound upper bound
## Single_raters_absolute   ICC1 0.096 2.0   5  48 0.102      -0.036        0.55
## Single_random_raters     ICC2 0.108 2.3   5  40 0.067      -0.020        0.55
## Single_fixed_raters      ICC3 0.123 2.3   5  40 0.067      -0.025        0.59
## Average_raters_absolute ICC1k 0.489 2.0   5  48 0.102      -0.453        0.92
## Average_random_raters   ICC2k 0.522 2.3   5  40 0.067      -0.215        0.92
## Average_fixed_raters    ICC3k 0.558 2.3   5  40 0.067      -0.284        0.93
## 
##  Number of subjects = 6     Number of Judges =  9
## See the help file for a discussion of the other 4 McGraw and Wong estimates,

Lastly, Kendall’s W = 0.31 (p=.015), a moderate correlation amongst ratings.

##  Kendall's coefficient of concordance Wt
## 
##  Subjects = 6 
##    Raters = 9 
##        Wt = 0.315 
## 
##  Chisq(5) = 14.2 
##   p-value = 0.0145

Metrics

Fleiss’s kappa

Fleiss’s kappa can be used to assess inter-rater reliability where there are more than two raters and where data are nominal or ordinal scale. In this instance, Fleiss’s kappa was best suited for calculating reliability for Relevance and Observability, since those variables were yes/no binary. However, some argue that Likert data is ordinal, so I used it for the Optimal Level ratings but also included the ICC.

Fleiss’s kappa ranges from -1 to 1 and relates the degree of agreement that can be expected above chance. Values greater than 0 are considered greater probability than chance.

Interpretation guides strongly caution against using one-size-fits-all interpretation categories, but generally speaking, 0-.20 is “slight” agreement, 0.21-0.40 is “fair” agreement, and don’t worry about the rest because none of the values were higher than 0.40.

Light’s kappa

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/ “For fully-crossed designs with three or more coders, Light (1971) suggests computing kappa for all coder pairs then using the arithmetic mean of these estimates to provide an overall index of agreement.”

Krippendorff’s Alpha

Krippendorff’s alpha is similar to Fleiss’s kappa but it calculates disagreement rather than agreement and is more robust against missing values. Krippendorff’s alpha can be used for nominal and ordinal data with more than two raters. Resulting values range from 0-1, with 0 indicating perfect disagreement and 1 indicating perfect agreement.

Kendall’s W of Concordance

For ordinal data (such as the Likert scale ratings of optimal personality), Kendall’s W assesses the strength of the relationship between ratings. Similar to correlation, Kendall’s W ranges from 0 to 1, where higher values indicate stronger inter-rater reliability.

log odds ratio

https://john-uebersax.com/stat/agree.htm tests association between raters

tetrachoric correlation

https://john-uebersax.com/stat/agree.htm “They estimate what the correlation between raters would be if ratings were made on a continuous scale; they are, theoretically, invariant over changes in the number or”width” of rating categories. The tetrachoric and polychoric correlations also provide a framework that allows testing of marginal homogeneity between raters. Thus, these statistics let one separately assess both components of rater agreement: agreement on trait definition and agreement on definitions of specific categories.”

May not be appropriate if the latent trait is discrete

Marginal homogeneity

Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category.

two-rater agreement indices

Aickin’s alpha

https://www.agreestat.com/books/cac5/chapter5/chap5.pdf Aickin (1990): “The α parameter is defined as the fraction of the entire subject population made up of subjects that the two raters A and B classified identically for cause, rather than by chance.” Aickin’s alpha operates on probabilities of raters scoring items into a certain category. Percent chance agreement is calculated based only on items that were difficult to score (i.e. lack consensus), which protects against kappa paradoxes.

Gwet’s gamma or AC1 (seems to address some of the difficulties noted with kappa)

“Unlike Aickin’s alpha coefficient, which is defined as the probability that two raters A and B agree for cause, Gwet’s AC1 (see Gwet, 2008a) is defined as the probability that two raters agree given that the subjects being rated are not susceptible to agreement by pure chance.”

conditional agreement (Rosenfield et al 1986)

logistic regression with random effect of case and expert (crossed not lagged)

https://link-springer-com.manchester.idm.oclc.org/article/10.1007/BF02294802 Slide 46: https://folk.ntnu.no/slyderse/medstat/Interrater_fullpage_9March2016.pdf

We can examine the random effects of raters to see how much they varied in their responses overall.

High agreement yet low reliability indices

from: https://www.bwgriffin.com/gsu/courses/edur8331/edur8331-presentations/EDUR-8331-07a-coder-agreement-nominal-data.pdf

“Measures of rater agreement often provide low values when high levels of agreement exist among raters. The table below shows 20 passages coded by four raters using the four coding categories listed below. Note that all raters agree on every passage except for passage 20. Despite 95.2% agreement, the other measures of agreement are below acceptable levels: Fleiss’ kappa = .316, mean Cohen’s kappa = .244, and Krippendorff’s alpha = .325. 1 = Positive statement 2 = Negative statement 3 = Neutral statement 4 = Other unrelated statement/Not applicable The problem with these data is lack of variability in codes. When most raters assign one code predominately, then measures of agreement can be misleadingly low, as demonstrated in this example. This is one reason I recommend always reporting percent agreement.”

Rater experience

Respondents were asked to report their personality-related qalifications. Raters V1 and V2 have the least amount of experience in years. It’s possible that Raters 5 and 9 have their PhDs in a non-personality field but have since published in personality journals, or they misread the question.

Rater demographics
RaterID	EXPERIENCE	EXPERTISE
1	3	Current PhD
2	5	Current PhD
3	6	Current PhD
4	7	Current PhD,Published,Behaviour coding experience
5	7	Completed MSc,Academic
6	10	Current PhD
7	12	Current PhD,Published
8	17	Completed PhD,Academic,Published,Behaviour coding experience
9	32	Published

Note. EXPERIENCE = “How many years of experience do you have in personality science, inclusive of postgraduate training?” EXPERTISE = “Which of the following describes your expertise in personality research? You may select multiple.”

SME Reliability Analysis

Claire Fenerty

2024-01-10

Presentation task

Presentation: Relevance

Presentation: Observability

Presentation: Optimal Level

Group discussion task

Group Discussion: Relevance

Group Discussion: Observability

Group Discussion: Optimal level

Critique

Critique: Relevance

Critique: Observability

Critique: Optimal level

Teaching task

Teaching: Relevance

Teaching: Observability

Teaching: Optimal level

Metrics

Fleiss’s kappa

Light’s kappa

Krippendorff’s Alpha

Kendall’s W of Concordance

log odds ratio

tetrachoric correlation

Marginal homogeneity

two-rater agreement indices

Aickin’s alpha

Gwet’s gamma or AC1 (seems to address some of the difficulties noted with kappa)

conditional agreement (Rosenfield et al 1986)

logistic regression with random effect of case and expert (crossed not lagged)

High agreement yet low reliability indices

Rater experience