Preliminary analysis

Timbre similarity

Data cleanup

## Loading required package: psych

As a first step into the analysis of the data obtained via the Survey Gizmo (n = 42) online platform, we needed to clean up the data. To achieve this, I followed a number of steps:

A function (reliability.check.R) compares the repeated elements at the beginning and end of the experiment for each participant, which should be equal (within a ±1 margin). This leaves us with 35 cases.
A script (clean.data.R) extracts only the ratings given for each pair of segments. This function has the option to remove the ratings where the rater was “Not confident” (confidence level 1) about his/her answer, or get the data as-is. When confidence level 1 scores are taken into account, the least number of ratings a pair has is 8 and the largest number is 15. On the other hand, when confidence level 1 scores are removed, the number of ratings ranges between 7 and 15.
Another database was generated by polarizing the 4-point based ratings into “Dissimilar” (1) and “Similar” (2).

With this clean database, I ran some statistical analyses to find interesting trends.

Fleiss' Kappa

Fleiss' Kappa assesses the reliability of agreement between a set number of raters. In our case, the raters were exposed to a fraction of the total amount of pairs, hence our incomplete database. I wrote fleiss.EDM.R to pick the minimum number of ratings at random for all pairs, and get the Kappa statistic. Since it is random, it is performed as many times as needed by entering the number of iterations. The function outputs the descriptive statistics of the list made from the Kappa value of each of the iterations.

Example

4-point scale

# The current raw dataset was loaded as "rtim"
# The variable "conf.1.rm" stands for "confidence level 1: remove"
# The variable "polarize" changes the data from 4-point base to 2-point base.
fexample <- fleiss.EDM(data=rtim, conf.1.rm=FALSE, polarize=FALSE, iterations=200)

## Loading required package: irr
## Loading required package: lpSolve

##                    vars   n mean   sd median trimmed  mad  min  max range
## X0.144861983091651    1 200 0.15 0.01   0.15    0.15 0.01 0.12 0.19  0.07
##                    skew kurtosis se
## X0.144861983091651 0.14    -0.32  0

Here we can observe how out of the 200 iterations, we obtained a mean of 0.15, a minimum of 0.12, and a maximum of 0.19.

2-point scale

# The current raw dataset was loaded as "rtim"
# The variable "conf.1.rm" stands for "confidence level 1: remove"
# The variable "polarize" changes the data from 4-point base to 2-point base.
fexample <- fleiss.EDM(data=rtim, conf.1.rm=FALSE, polarize=TRUE, iterations=200)

##                    vars   n mean   sd median trimmed  mad  min  max range
## X0.129523009315755    1 200 0.13 0.01   0.13    0.13 0.01 0.09 0.17  0.08
##                     skew kurtosis se
## X0.129523009315755 -0.04      0.9  0

Here we can observe how out of the 200 iterations, we obtained a mean of 0.13, a minimum of 0.09, and a maximum of 0.17.

Interclass Correlation Coefficient

Another measure for reliability of agreement between (and also within) raters is the Interclass Correlation Coefficient (ICC). This measure stands apart from other agreement measures because of its ability to analyze exchangeable measurements. This means that it takes into account systematic differences among observer, thanks to its nature as a correlation. However, unlike most correlations, ICC works not only with pairs, but also with larger groups.

The function icc.EDM.R uses the same procedure as fleiss.EDM.R, where ratings are chosen at random, and the the computation of the ICC statistic is made any number of times as deemed necessary.

Example

4-point scale

# The current raw dataset was loaded as "rtim"
# The variable "conf.1.rm" stands for "confidence level 1: remove"
# The variable "polarize" changes the data from 4-point base to 2-point base.
icc.EDM(data=rtim, conf.1.rm=FALSE, polarize=FALSE, iterations=200)

##      type                ICC              F             df1     
##  Length:200         Min.   :0.129   Min.   :2.19   Min.   :189  
##  Class :character   1st Qu.:0.148   1st Qu.:2.39   1st Qu.:189  
##  Mode  :character   Median :0.156   Median :2.48   Median :189  
##                     Mean   :0.157   Mean   :2.49   Mean   :189  
##                     3rd Qu.:0.165   3rd Qu.:2.59   3rd Qu.:189  
##                     Max.   :0.189   Max.   :2.86   Max.   :189  
##       df2             p             lower bound      upper bound   
##  Min.   :1330   Min.   :0.00e+00   Min.   :0.0889   Min.   :0.179  
##  1st Qu.:1330   1st Qu.:0.00e+00   1st Qu.:0.1057   1st Qu.:0.200  
##  Median :1330   Median :0.00e+00   Median :0.1124   Median :0.208  
##  Mean   :1330   Mean   :1.78e-17   Mean   :0.1132   Mean   :0.209  
##  3rd Qu.:1330   3rd Qu.:0.00e+00   3rd Qu.:0.1211   3rd Qu.:0.219  
##  Max.   :1330   Max.   :1.78e-15   Max.   :0.1421   Max.   :0.244

2-point scale

# The current raw dataset was loaded as "rtim"
# The variable "conf.1.rm" stands for "confidence level 1: remove"
# The variable "polarize" changes the data from 4-point base to 2-point base.
icc.EDM(data=rtim, conf.1.rm=FALSE, polarize=TRUE, iterations=200)

##      type                ICC              F             df1     
##  Length:200         Min.   :0.106   Min.   :1.95   Min.   :189  
##  Class :character   1st Qu.:0.126   1st Qu.:2.16   1st Qu.:189  
##  Mode  :character   Median :0.134   Median :2.24   Median :189  
##                     Mean   :0.135   Mean   :2.25   Mean   :189  
##                     3rd Qu.:0.142   3rd Qu.:2.33   3rd Qu.:189  
##                     Max.   :0.163   Max.   :2.56   Max.   :189  
##       df2             p            lower bound      upper bound   
##  Min.   :1330   Min.   :0.0e+00   Min.   :0.0679   Min.   :0.152  
##  1st Qu.:1330   1st Qu.:0.0e+00   1st Qu.:0.0862   1st Qu.:0.175  
##  Median :1330   Median :0.0e+00   Median :0.0932   Median :0.184  
##  Mean   :1330   Mean   :5.5e-13   Mean   :0.0935   Mean   :0.185  
##  3rd Qu.:1330   3rd Qu.:6.0e-15   3rd Qu.:0.1005   3rd Qu.:0.193  
##  Max.   :1330   Max.   :2.0e-11   Max.   :0.1187   Max.   :0.216

Within Participant Concordance

Another part of the project was to investigate if subjects could rate the same segment pairs consistently. We asked one participant to rate the same 18 segments six times. The analysis was made with both 4-point and 2-point scales.

4-point scale analysis

timbreWPC <- read.csv("timbreWPC.csv")
kappam.fleiss(ratings=t(timbreWPC[,seq(22, 57, 2)]))

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 18 
##    Raters = 6 
##     Kappa = 0.544 
## 
##         z = 14.4 
##   p-value = 0

icc(ratings=t(timbreWPC[,seq(22, 57, 2)]))

##  Single Score Intraclass Correlation
## 
##    Model: oneway 
##    Type : consistency 
## 
##    Subjects = 18 
##      Raters = 6 
##      ICC(1) = 0.792
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##    F(17,90) = 23.8 , p = 3.88e-26 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.654 < ICC < 0.901

2-point scale analysis

timbreWPC[timbreWPC == 2] <- 1 ; timbreWPC[timbreWPC == 3 | timbreWPC == 4] <- 2
kappam.fleiss(ratings=t(timbreWPC[,seq(22, 57, 2)]))

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 18 
##    Raters = 6 
##     Kappa = 0.643 
## 
##         z = 10.6 
##   p-value = 0

icc(ratings=t(timbreWPC[,seq(22, 57, 2)]))

##  Single Score Intraclass Correlation
## 
##    Model: oneway 
##    Type : consistency 
## 
##    Subjects = 18 
##      Raters = 6 
##      ICC(1) = 0.657
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##    F(17,90) = 12.5 , p = 5.56e-17 
## 
##  95%-Confidence Interval for ICC Population Values:
##   0.477 < ICC < 0.824

Rhythm similarity

For the rhythm experiment, Thomas has already analyzed the data with Fleiss' Kappa. Here I will use the ICC function to gain some new insights on it.

4-point scale

# The current raw dataset was loaded as "rrhy"
# The variable "conf.1.rm" stands for "confidence level 1: remove"
# The variable "polarize" changes the data from 4-point base to 2-point base.
icc.EDM(data=rrhy, conf.1.rm=FALSE, polarize=FALSE, iterations=200)

##      type                ICC              F             df1     
##  Length:200         Min.   :0.252   Min.   :4.37   Min.   :189  
##  Class :character   1st Qu.:0.280   1st Qu.:4.88   1st Qu.:189  
##  Mode  :character   Median :0.294   Median :5.16   Median :189  
##                     Mean   :0.294   Mean   :5.18   Mean   :189  
##                     3rd Qu.:0.307   3rd Qu.:5.43   3rd Qu.:189  
##                     Max.   :0.343   Max.   :6.23   Max.   :189  
##       df2             p      lower bound     upper bound   
##  Min.   :1710   Min.   :0   Min.   :0.204   Min.   :0.308  
##  1st Qu.:1710   1st Qu.:0   1st Qu.:0.230   1st Qu.:0.338  
##  Median :1710   Median :0   Median :0.243   Median :0.353  
##  Mean   :1710   Mean   :0   Mean   :0.243   Mean   :0.353  
##  3rd Qu.:1710   3rd Qu.:0   3rd Qu.:0.255   3rd Qu.:0.366  
##  Max.   :1710   Max.   :0   Max.   :0.290   Max.   :0.404

2-point scale

# The current raw dataset was loaded as "rrhy"
# The variable "conf.1.rm" stands for "confidence level 1: remove"
# The variable "polarize" changes the data from 4-point base to 2-point base.
icc.EDM(data=rrhy, conf.1.rm=FALSE, polarize=TRUE, iterations=200)

##      type                ICC              F             df1     
##  Length:200         Min.   :0.252   Min.   :4.37   Min.   :189  
##  Class :character   1st Qu.:0.285   1st Qu.:4.98   1st Qu.:189  
##  Mode  :character   Median :0.297   Median :5.23   Median :189  
##                     Mean   :0.298   Mean   :5.27   Mean   :189  
##                     3rd Qu.:0.311   3rd Qu.:5.51   3rd Qu.:189  
##                     Max.   :0.347   Max.   :6.32   Max.   :189  
##       df2             p      lower bound     upper bound   
##  Min.   :1710   Min.   :0   Min.   :0.204   Min.   :0.308  
##  1st Qu.:1710   1st Qu.:0   1st Qu.:0.234   1st Qu.:0.343  
##  Median :1710   Median :0   Median :0.246   Median :0.356  
##  Mean   :1710   Mean   :0   Mean   :0.247   Mean   :0.357  
##  3rd Qu.:1710   3rd Qu.:0   3rd Qu.:0.259   3rd Qu.:0.370  
##  Max.   :1710   Max.   :0   Max.   :0.294   Max.   :0.408

As we can see, rhythm similarity shows a more promising panorama, with ICC values around 0.30, meaning a positive correlation of the data.