Analysis of rating values

RQ: Do domain experts differ in the rating values they assign?

Results:

There are small but significant differences (0.15) for common-knowledge questions.
There are larger differences for domain-specific questions (0.41).
These difference are larger still if the questions are in the experts' own field (0.66).

Setup

This dataset contains one record per rating. Each rating contains attributes of both the user and the rating. We remove validation ratings that ask the user to rate the same concept pair twice.


options(width = 120)
ratings <- read.table("dat/ratings.tsv", header = TRUE, sep = "\t")
ratings <- ratings[ratings$r_condition != "validation", ]
ratings$r_id <- factor(ratings$r_id)
ratings$r_condition <- relevel(ratings$r_condition, ref = "mturk")

For now we are going to remove non-scholars who are not turkers. We don't have enough of them to report interesting effects:

ratings <- ratings[ratings$r_condition != "general", ]

We add an attribute (r_resid) that is the residual after controlling for question-specific effect and a flag indicating whether the rating is general common knowledge.

ratings$r_resid <- resid(lm(r_rating ~ r_id + u_email, data = ratings))
ratings$r_common <- factor(ifelse(ratings$r_field == "general", "general", "specific"))
ratings$r_common <- relevel(ratings$r_common, ref = "general")

The total number of ratings is 8063 from 110 users.

ratings <- ratings[ratings$r_rating > 0,];    # remove ratings for unknown phrases

The total number of defined ratings is 7947.

Do domain experts rate differently than non-experts?

We perform an anova on condition (turker / scholar-in / scholar-out). We control for the average rating for each question.

fit <- aov(formula = r_rating ~ r_condition + r_id, data = ratings)
summary(fit)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## r_condition    2    390   194.8   218.6 <2e-16 ***
## r_id         199   6442    32.4    36.3 <2e-16 ***
## Residuals   7745   6900     0.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(fit, "r_condition")

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = r_rating ~ r_condition + r_id, data = ratings)
## 
## $r_condition
##                           diff     lwr     upr p adj
## scholar-in-mturk        0.6211  0.5462  0.6960     0
## scholar-out-mturk       0.3653  0.3100  0.4206     0
## scholar-out-scholar-in -0.2558 -0.3266 -0.1850     0

Overall, it appears that scholar-in questions are rated 0.26 higher than scholar-outs and 0.62 higher than turkers.

Is this a constant effect, independent of question?

But this might not be particularly interesting. We may just be able to “average adjust” each subject to get comparable scores.

We CANT do this if effects differ between common and domain-spceific questions. Do they?

Yes! The following analysis shows:

Scholars rate specific questions slightly (0.49) higher than general questions.
Turkers rate specific questions substantially lower (0.20) lower than general questions.
Scholars rate in-group and out-group specific questions substantially higher than turkers (0.66 and 0.41 respectively).
Scholars rate general knowledge questions slightly higher than turkers (0.16)

fit <- aov(formula = r_rating ~ r_condition + r_common + r_condition:r_common + u_email + r_id, data = ratings)

summary(fit)

##                        Df Sum Sq Mean Sq F value  Pr(>F)    
## r_condition             2    390   194.8  280.68 < 2e-16 ***
## r_common                1      2     1.8    2.62    0.11    
## u_email               108   1703    15.8   22.73 < 2e-16 ***
## r_id                  198   6323    31.9   46.02 < 2e-16 ***
## r_condition:r_common    1     16    16.0   23.10 1.6e-06 ***
## Residuals            7636   5299     0.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(fit, "r_condition:r_common")

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = r_rating ~ r_condition + r_common + r_condition:r_common + u_email + r_id, data = ratings)
## 
## $`r_condition:r_common`
##                                             diff      lwr      upr  p adj
## scholar-in:general-mturk:general              NA       NA       NA     NA
## scholar-out:general-mturk:general         0.1586  0.02223  0.29489 0.0118
## mturk:specific-mturk:general             -0.2039 -0.32539 -0.08232 0.0000
## scholar-in:specific-mturk:general         0.4523  0.32357  0.58112 0.0000
## scholar-out:specific-mturk:general        0.2075  0.08889  0.32601 0.0000
## scholar-out:general-scholar-in:general        NA       NA       NA     NA
## mturk:specific-scholar-in:general             NA       NA       NA     NA
## scholar-in:specific-scholar-in:general        NA       NA       NA     NA
## scholar-out:specific-scholar-in:general       NA       NA       NA     NA
## mturk:specific-scholar-out:general       -0.3624 -0.45674 -0.26808 0.0000
## scholar-in:specific-scholar-out:general   0.2938  0.19030  0.39727 0.0000
## scholar-out:specific-scholar-out:general  0.0489 -0.04156  0.13936 0.6377
## scholar-in:specific-mturk:specific        0.6562  0.57316  0.73923 0.0000
## scholar-out:specific-mturk:specific       0.4113  0.34521  0.47740 0.0000
## scholar-out:specific-scholar-in:specific -0.2449 -0.32350 -0.16627 0.0000