This dataset contains one record per rating. Each rating contains attributes of both the user and the rating. We remove validation ratings that ask the user to rate the same concept pair twice.
options(width = 120)
ratings <- read.table("dat/ratings.tsv", header = TRUE, sep = "\t")
ratings <- ratings[ratings$r_condition != "validation", ]
ratings$r_id <- factor(ratings$r_id)
ratings$r_condition <- relevel(ratings$r_condition, ref = "mturk")
For now we are going to remove non-scholars who are not turkers. We don't have enough of them to report interesting effects:
ratings <- ratings[ratings$r_condition != "general", ]
We add an attribute (r_resid) that is the residual after controlling for question-specific effect and a flag indicating whether the rating is general common knowledge.
ratings$r_resid <- resid(lm(r_rating ~ r_id + u_email, data = ratings))
ratings$r_common <- factor(ifelse(ratings$r_field == "general", "general", "specific"))
ratings$r_common <- relevel(ratings$r_common, ref = "general")
The total number of ratings is 8063 from 110 users.
ratings <- ratings[ratings$r_rating > 0,]; # remove ratings for unknown phrases
The total number of defined ratings is 7947.
We perform an anova on condition (turker / scholar-in / scholar-out). We control for the average rating for each question.
fit <- aov(formula = r_rating ~ r_condition + r_id, data = ratings)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## r_condition 2 390 194.8 218.6 <2e-16 ***
## r_id 199 6442 32.4 36.3 <2e-16 ***
## Residuals 7745 6900 0.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit, "r_condition")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = r_rating ~ r_condition + r_id, data = ratings)
##
## $r_condition
## diff lwr upr p adj
## scholar-in-mturk 0.6211 0.5462 0.6960 0
## scholar-out-mturk 0.3653 0.3100 0.4206 0
## scholar-out-scholar-in -0.2558 -0.3266 -0.1850 0
Overall, it appears that scholar-in questions are rated 0.26 higher than scholar-outs and 0.62 higher than turkers.
But this might not be particularly interesting. We may just be able to “average adjust” each subject to get comparable scores.
We CANT do this if effects differ between common and domain-spceific questions. Do they?
Yes! The following analysis shows:
fit <- aov(formula = r_rating ~ r_condition + r_common + r_condition:r_common + u_email + r_id, data = ratings)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## r_condition 2 390 194.8 280.68 < 2e-16 ***
## r_common 1 2 1.8 2.62 0.11
## u_email 108 1703 15.8 22.73 < 2e-16 ***
## r_id 198 6323 31.9 46.02 < 2e-16 ***
## r_condition:r_common 1 16 16.0 23.10 1.6e-06 ***
## Residuals 7636 5299 0.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit, "r_condition:r_common")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = r_rating ~ r_condition + r_common + r_condition:r_common + u_email + r_id, data = ratings)
##
## $`r_condition:r_common`
## diff lwr upr p adj
## scholar-in:general-mturk:general NA NA NA NA
## scholar-out:general-mturk:general 0.1586 0.02223 0.29489 0.0118
## mturk:specific-mturk:general -0.2039 -0.32539 -0.08232 0.0000
## scholar-in:specific-mturk:general 0.4523 0.32357 0.58112 0.0000
## scholar-out:specific-mturk:general 0.2075 0.08889 0.32601 0.0000
## scholar-out:general-scholar-in:general NA NA NA NA
## mturk:specific-scholar-in:general NA NA NA NA
## scholar-in:specific-scholar-in:general NA NA NA NA
## scholar-out:specific-scholar-in:general NA NA NA NA
## mturk:specific-scholar-out:general -0.3624 -0.45674 -0.26808 0.0000
## scholar-in:specific-scholar-out:general 0.2938 0.19030 0.39727 0.0000
## scholar-out:specific-scholar-out:general 0.0489 -0.04156 0.13936 0.6377
## scholar-in:specific-mturk:specific 0.6562 0.57316 0.73923 0.0000
## scholar-out:specific-mturk:specific 0.4113 0.34521 0.47740 0.0000
## scholar-out:specific-scholar-in:specific -0.2449 -0.32350 -0.16627 0.0000