Intro: This is analysis of data from the Whyville plankton classification game. It includes age data on judgments across approximately 500k trials, and about 250k “ringers” (labeled examples).
Very brief summary:
Loading stuff in etc.
library(knitr)
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:data.table':
##
## between, last
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(binom)
library(stringr)
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:data.table':
##
## hour, mday, month, quarter, wday, week, yday, year
rm(list=ls())
theme_set(theme_bw())
opts_chunk$set(echo=FALSE, message=FALSE, warning=FALSE, error=FALSE)
Reading in file.
Now read in ringers.
Note that there is some intersection between ringer files and others (not very much though).
Also there is some userID overlap, which is good.
## [1] "roi0.6164851500.jpg" "roi0.4594871800.jpg" "roi0.0725167101.jpg"
## [4] "roi0.2244289000.jpg" "roi0.4193960900.jpg" "roi0.1751712500.jpg"
## [1] 53
## [1] 1685237 2421043 179341 1292152 5819602 914158
## [1] 1791
Now join these files and try to propagate user age to the ringers.
Here’s how much data we have in terms of number of participants by age:
And in terms of number of trials.
## Source: local data frame [5 x 2]
##
## age prop
## 1 (8,12] 0.2773
## 2 (12,16] 0.3120
## 3 (16,20] 0.3096
## 4 (20,24] 0.2750
## 5 NA 0.4480
So clearly there are many trials for some participants. In addition, it looks like this ringer file has a ridiculous number of participants for whom we have no match in terms of UIDs. That means we have no age and have to ditch them, unfortunately.
But the fact that they have so many feedback trials means that probably many are people who tried the game exactly once and then got lost after that.
Here’s the distribution of contributions. Tons of people have done ten trials, few have done 1000, one person has apparently done 40k!
There is tons of data that we have neither ages nor grades for. Let’s make a somewhat ad-hoc decision to drop data for kids for whom we have less than 10 observations and for ages outside the range [8, 24].
## Source: local data frame [4 x 3]
##
## age participants observations
## 1 (8,12] 146 24390
## 2 (12,16] 326 72821
## 3 (16,20] 370 255892
## 4 (20,24] 248 169011
## Source: local data frame [5 x 3]
##
## age participants observations
## 1 (8,12] 253 25132
## 2 (12,16] 494 73948
## 3 (16,20] 462 256542
## 4 (20,24] 299 169393
## 5 NA 1279 269663
In the end we lose very few observations for most age groups, except the older/no-age group.
How many images do we actually have? It turns out that pretty much all the ringers are shown 100+ times, but all the non-ringers are shown once or twice.
Most images are unique, but the ringers are repeated many, many times - mostly 100+.
Image by userid.
First, establish a variable of number of trials per participant.
Now using only feedback trials to graph accuracy over age and trials and categories.
Let’s do it by age/category. All groups are getting better, albeit from different starting places.
Now look at all categories - this isn’t percent correct, this is just proportion classified as different categories. Split this by feedback and no-feedback trials. Gray lines represent base rates.
So on the feedback trials, they are choosing at almost exactly the base rate. But on the no-feedback trials, they are converging to choose “nothing” almost all the time.
[x] check time distribution of repeats [x] number of images a user has seen twice by the time they’ve played for 500 times?
Let’s start with the time distribution of repeats of ringers. Beginning with the basic time distribution:
OK, so there was clearly some kind of event around 7/2013, no idea what that was. Maybe some promotion?
Now, some sanity checks on time of day.
OK, now check time distribution on repeats of ringers.
The upshot of this analysis is that it seems like ringers are actually being seen often within the same minute, because kids are doing a lot of trials at a time.
Now look at proportion of ringer repeats.
Ok, pretty clear that 10% of ringer trials are repeated by 500 trials in…
Look at the correlation between choosing “none” on no-feedback trials and doing well on feedback trials. (Essentially, where you set your threshold in classification).
This correlation exists, but it appears to be mediated by effects of practice. You do better on feedback trials AND classify “none” more often with more practice.
A (somewhat inappropriate) linear model, not shown, supports this conclusion.
Significant effects of practice on accuracy, and an interaction of practice and “none-choosing” such that the more you practice and choose none, the better you are.
Let’s look at confusion matrices and how they evolve over learning and age. This might help us firm up the conclusions about the misclassification of everything as “nothing.”
First the basic matrix, broken down by age. The big stripe across it is clearly evidence for overclassfification as nothing (this is only feedback trials, though).
Note these are old, prior to correct ringer dataset.
TODOs: * Kappa for experts (dyherman lab) * growth models, individuals that “cheat”