After receiving all the results from the experiment on timbre, rhythm, and general similarities, we aimed to find a level of agreement between the raters. For this experiment, the statistical analysis required was Fleiss’ Kappa due to its non-parametric nature. Landis and Koch (1977) gave an interpretation for the Fleiss’ Kappa values, presented in the following table:
| Kappa | Agreement |
|---|---|
| < 0 | Poor agreement |
| 0.01 - 0.20 | Slight agreement |
| 0.21 - 0.40 | Fair agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.81 - 1.00 | Almost perfect agreement |
The first agreement analysis made was across the participants. Because of the experimental design it was not possible to use the entire data set of the timbre and rhythm experiments in a single run of the analysis. Instead, we randomly selected n ratings of each pair for one thousand iterations and calculated the mean. Additionally, we took into account the ratings in a four-point scale, and also scaled it down to a binary scale. The results are as follows:
For four-point scale data (original):
[1] "Minimum ratings: 3"
[1] "Number of subjects: 32"
[1] "Number of pairs: 190"
mean sd median min max
Kappa 0.08 0.03 0.08 -0.02 0.18
z 2.94 1.03 2.95 -0.65 6.67
p.value 0.04 0.10 0.00 0.00 0.99
For two-point scale data (re-scaled):
[1] "Minimum ratings: 3"
[1] "Number of subjects: 32"
[1] "Number of pairs: 190"
mean sd median min max
Kappa 0.11 0.05 0.11 -0.02 0.30
z 2.63 1.12 2.61 -0.50 7.08
p.value 0.07 0.15 0.01 0.00 0.95
These tables show the descriptive statistics of Fleiss’ Kappa’s analysis. The most important points are:
Knowing this, we can see how the mean Kappa value for the two-point scale (mean Kappa = 0.11, median p-value = 0.01) is slightly larger than for the four-point scale (mean Kappa = 0.08, median p-value = 0), enough to describe the effect as slight agreement according to Landis and Koch (1977).
In the previous analysis we used the full rating data set. This contains some pairs with very concordant ratings and other pairs with very disperse ratings. This is to be expected, since participants have different listening strategies and musical preferences which mould their affinity to certain musical traits. Because of this, we selected the 25 pairs with the lowest standard deviations (four-point scale minimum sd = 0.3333, maximum sd = 0.7071; two-point scale minimum sd = 0, maximum sd = 0.3378) for a second Fleiss’ Kappa test. The results for this reduced data set show higher levels of agreement than those of the full data set, reaching fair agreement on the four-point scale (mean Kappa = 0.29, median p-value = 0), and substantial agreement on the two-point scale (mean Kappa = 0.65, median p-value = 0). (Note for edit: Evidently, the agreement level increases as we decrease the number of pairs with low SD. We need to find a good number that doesn’t look like we’re cherrypicking our data.)
For four-point scale data (original):
mean sd median min max
Kappa 0.29 0.06 0.29 0.13 0.47
z 5.35 1.05 5.31 2.43 8.55
p.value 0.00 0.00 0.00 0.00 0.02
For two-point scale data (re-scaled):
mean sd median min max
Kappa 0.65 0.09 0.65 0.37 0.93
z 7.90 1.12 7.97 4.47 11.38
p.value 0.00 0.00 0.00 0.00 0.00
As part of Experiment 2, one participant rated the same set of pairs of segments six times. For this analysis we did not need to get random samples from the ratings since it does not have missing values. A regular Fleiss’ Kappa was applied:
For four-point scale data (original):
Fleiss' Kappa for m Raters
Subjects = 18
Raters = 6
Kappa = 0.544
z = 14.4
p-value = 0
For two-point scale data (re-scaled):
Fleiss' Kappa for m Raters
Subjects = 18
Raters = 6
Kappa = 0.643
z = 10.6
p-value = 0
With this information, we can conclude that this participant has substantial agreement on his ratings based on a two-point scale (Kappa = 0.6427, p-value = 0), and moderate agreement on a four-point scale (Kappa = 0.5435, p-value = 0).
As a part of the study, we also tested participants for their agreement on general music similarity. The Fleiss’ Kappa analysis shows the following:
For four-point scale data (original):
Fleiss' Kappa for m Raters
Subjects = 18
Raters = 26
Kappa = 0.146
z = 17.2
p-value = 0
For two-point scale data (re-scaled):
Fleiss' Kappa for m Raters
Subjects = 18
Raters = 26
Kappa = 0.195
z = 14.9
p-value = 0
We can conclude that even with the re-scaled ratings, the general similarity agreement remains as slight (two-point scale Kappa = 0.1947, p-value = 0; four-point scale Kappa = 0.1464, p-value = 0).
Since the general similarity stimuli were formed from a reduced pool of segments, it is important to analyze the same isolated pairs when looking for timbre similarity. In the following table we can see the Fleiss’ Kappa analysis of timbre similarity of pairs 7, 17, 21, 39, 47, 53, 59, 62, 94, 111, 119, 149, 151, 176, 178, 184, 188, and 190:
For four-point scale data (original):
[1] "Minimum ratings: 3"
[1] "Number of subjects: 32"
[1] "Number of pairs: 18"
mean sd median min max
Kappa 0.16 0.08 0.15 -0.05 0.45
z 1.98 0.95 1.93 -0.57 5.42
p.value 0.14 0.20 0.05 0.00 1.00
For two-point scale data (re-scaled):
[1] "Minimum ratings: 3"
[1] "Number of subjects: 32"
[1] "Number of pairs: 18"
mean sd median min max
Kappa 0.28 0.12 0.26 -0.04 0.70
z 2.07 0.87 1.91 -0.31 5.17
p.value 0.12 0.18 0.06 0.00 0.98
By taking this isolated group, we can find an increase in Kappa values (two-point scale mean Kappa = 0.28, median p-value = 0.06; four-point scale mean Kappa = 0.16, median p-value = 0.05), they are within the fair agreement level for the two-point scale, and slight agreement level for the four-point scale. It is worth mentioning that the median p-values for both two- and four-point scale ratings are slightly above and at the accepted significance value of 0.05, which means that these results could be attributed to chance.
We also conducted a Wilcoxon rank sum test with timbre similarity and general similarity, which determines the probability of the two data sets belonging to the same group. The Wilcoxon rank sum test shows that 8 pairs in the four-point scale and 4 pairs in the two-point scale have a probability below 0.05 of belonging to the same group.
For four-point scale data (original):
W p.value Timbre mean rating General mean rating
7 167.0 0.007593 2.875 1.769
21 342.0 0.006130 1.450 1.038
39 147.5 0.009872 3.429 2.346
47 352.5 0.011239 3.211 2.462
59 360.0 0.001602 2.722 1.692
119 128.0 0.012062 3.000 1.808
149 126.5 0.013097 2.667 1.654
178 314.0 0.033777 2.000 1.423
For two-point scale data (re-scaled):
W p.value Timbre mean rating General mean rating
7 166 0.001564 1.750 1.154
39 140 0.012637 2.000 1.462
59 336 0.004450 1.667 1.231
119 112 0.044047 1.667 1.231
As a comparison, we also calculated the W values of both rhythm similarity and general similarity to get a better overview of the possible attributes that participants might be listening for. In this test we can observe 8 pairs in the four-point scale and 8 pairs in the two-point scale have a lower-than-0.05 probability of being part of the same group.
For four-point scale data (original):
W p.value Rhythm mean rating General mean rating
7 289.0 2.860e-02 2.500 1.769
21 267.5 2.915e-03 1.667 1.038
53 326.5 1.611e-05 2.400 1.115
111 418.5 4.425e-05 3.421 1.962
119 424.5 2.601e-06 3.556 1.808
149 330.0 1.374e-05 3.286 1.654
176 248.5 1.422e-02 3.000 2.154
190 361.0 4.114e-05 3.125 1.769
For two-point scale data (re-scaled):
W p.value Rhythm mean rating General mean rating
7 293.0 6.285e-03 1.562 1.154
21 247.0 6.699e-03 1.267 1.000
53 291.5 2.909e-04 1.533 1.038
111 382.5 2.884e-04 1.895 1.346
119 401.0 4.372e-06 1.944 1.231
149 310.0 2.049e-05 1.857 1.154
176 234.0 2.650e-02 1.769 1.385
190 337.0 1.037e-04 1.812 1.192
Another interesting comparison is the difference in similarity ratings between timbre and rhythm. Since both projects had participants assess the same segment pairs in the same way, the two data sets are comparable in number of pairs (n = 190). We found that in the original four-point scale ratings, there are 43 pairs with low probability of belonging to the same group. In the re-scaled two-point variant, 32 pairs have a low probability of belonging to the same group. There are 20 pairs that are present in in both four-and two-point scales (29, 51, 53, 59, 69, 82, 102, 105, 109, 111, 115, 120, 136, 153, 155, 166, 167, 183, 185, 190). This could mean that the participants are indeed listening for different attributes in these 20 pairs of tracks to determine the similarity rating of the specific trait (timbre or rhythm). Here are the pairs that fit in this criterion for both four- and two-point scales:
For four-point scale data (original):
W p.value Timbre mean rating Rhythm mean rating
5 "105.5" "0.0443255447725856" "2.875" "1.83333333333333"
28 "89.5" "0.0149828936900781" "2.375" "1.42857142857143"
29 "17.5" "0.0177177495805173" "1" "2.16666666666667"
32 "58.5" "0.0460723466278493" "1.8" "1.1875"
40 "161.5" "0.0270862126529018" "1.61111111111111" "1.07692307692308"
51 "55" "0.00514498560809412" "2.75" "1.125"
53 "25" "0.00759201184730878" "1.22222222222222" "2.4"
56 "223" "0.0444594045846168" "1.57894736842105" "1.16666666666667"
57 "57.5" "0.0132419018795998" "2.22222222222222" "3.30769230769231"
58 "87" "0.0456186483271802" "2.47368421052632" "3.2"
59 "179" "0.0379880856728285" "2.72222222222222" "1.92857142857143"
65 "17" "0.0128477356927297" "2.14285714285714" "3.28571428571429"
69 "27" "0.0183774661570275" "2" "3.05263157894737"
81 "74.5" "0.0158192449298792" "2.42857142857143" "1.53846153846154"
82 "201.5" "0.0369022854314003" "2.25" "1.61111111111111"
91 "68" "0.0133031102016919" "2.33333333333333" "1.35714285714286"
92 "97.5" "0.000913353258883279" "1.875" "1"
95 "72.5" "0.00717460018148199" "2.375" "1.18181818181818"
98 "194.5" "0.00756846365157055" "1.8125" "1.11764705882353"
99 "58.5" "0.042017518615856" "2.6" "1.46666666666667"
102 "69.5" "0.0143375066844626" "3" "1.75"
105 "16" "0.00244266926357476" "1.14285714285714" "2.42105263157895"
107 "98.5" "0.0320119799964901" "2.66666666666667" "3.27777777777778"
108 "12.5" "0.0170179227949198" "1.33333333333333" "2.69230769230769"
109 "56" "0.0119224282648445" "1.8" "1"
111 "26.5" "0.0143487704061196" "2.14285714285714" "3.42105263157895"
115 "2" "0.000511068008924017" "1.83333333333333" "3.83333333333333"
120 "7" "0.00325666597345654" "2.2" "3.625"
125 "60" "0.0420929867132197" "2.1875" "3.07692307692308"
127 "49.5" "0.0419782629403505" "3" "1.73333333333333"
136 "105" "0.0101889106953954" "3.25" "1.9375"
153 "32" "0.0297593705805928" "1.88888888888889" "3.06666666666667"
154 "100" "0.0439042733565846" "1.89473684210526" "2.64705882352941"
155 "71.5" "0.0194676359051022" "1.68421052631579" "2.57142857142857"
158 "66" "0.00952622579856508" "2.11764705882353" "3.125"
160 "235.5" "0.039348671476639" "2.65" "1.94117647058824"
166 "75" "0.0171597273494141" "2.33333333333333" "1.25"
167 "43" "0.0305208990764439" "2" "3"
179 "70" "0.00686346177289138" "2.46666666666667" "3.31578947368421"
180 "32" "0.0284379703638451" "2.5" "3.23529411764706"
183 "11.5" "0.000509289004160261" "1.11111111111111" "2.86666666666667"
185 "80" "0.0293241415049927" "2.14285714285714" "1.26666666666667"
190 "14" "0.00875917643294562" "1.66666666666667" "3.125"
For two-point scale data (re-scaled):
W p.value Timbre mean rating Rhythm mean rating
29 "21" "0.0316356750082053" "1" "1.5"
51 "56" "0.000335902695118983" "1.75" "1"
53 "31.5" "0.00958917722413349" "1" "1.53333333333333"
54 "80.5" "0.020116159466565" "2" "1.46666666666667"
59 "174" "0.0372971776971669" "1.66666666666667" "1.28571428571429"
60 "57" "0.0358968515568493" "1.57142857142857" "1.09090909090909"
69 "33" "0.0207030937480778" "1.28571428571429" "1.78947368421053"
75 "96.5" "0.0431498772759032" "1.71428571428571" "1.26315789473684"
82 "191" "0.035842479566953" "1.4375" "1.11111111111111"
93 "60" "0.0374588520797689" "1.5" "1.07142857142857"
97 "59.5" "0.00950080429663086" "1.4" "1"
100 "85" "0.00103592055141566" "1.44444444444444" "1.94736842105263"
102 "62.5" "0.0268884541037458" "1.57142857142857" "1.08333333333333"
105 "38.5" "0.0469083233784227" "1" "1.42105263157895"
109 "56" "0.0119224282648445" "1.4" "1"
111 "35.5" "0.0157673613407186" "1.42857142857143" "1.89473684210526"
115 "12" "0.00227441725433153" "1.33333333333333" "2"
120 "18.5" "0.0108770540412079" "1.4" "1.9375"
122 "171" "0.0399921024834285" "1.5" "1.14285714285714"
123 "90" "0.0263762437560049" "1.70588235294118" "2"
136 "96" "0.0240068064683308" "1.75" "1.25"
148 "24.5" "0.0334774779329084" "1.28571428571429" "1.78571428571429"
153 "28.5" "0.00722893557714664" "1.22222222222222" "1.8"
155 "85" "0.037924837464648" "1.21052631578947" "1.57142857142857"
166 "77" "0.00383275785336518" "1.66666666666667" "1.0625"
167 "46.5" "0.0225754929015144" "1.33333333333333" "1.78947368421053"
168 "21" "0.0103657178757337" "1" "1.6"
177 "72" "0.0340045975757569" "1.28571428571429" "1"
182 "191" "0.0400008397132475" "1.47368421052632" "1.13333333333333"
183 "22.5" "0.00190456544137854" "1" "1.66666666666667"
185 "67.5" "0.0403049743625408" "1.28571428571429" "1"
190 "25" "0.039979602946257" "1.33333333333333" "1.8125"