EXPLORATION OF DYHRMAN WHYVILLE DATA PLUS TIMESTAMPS

Intro: This is analysis of data from the Whyville plankton classification game. It includes age data on judgments across approximately 500k trials, and about 250k “ringers” (labeled examples).

Very brief summary:

People get better at the “ringer” trials - those on which they receive feedback. These trials are seen many many times for people who are doing thousands of trials.
But on non-“ringer” trials, where the images are unique, participants appear to learn to categorize them as “nothing” almost exclusively.

Preliminaries

Loading stuff in etc.

library(knitr)
library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:data.table':
## 
##     between, last
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(binom)
library(stringr)
library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:data.table':
## 
##     hour, mday, month, quarter, wday, week, yday, year

rm(list=ls())

theme_set(theme_bw())
opts_chunk$set(echo=FALSE, message=FALSE, warning=FALSE, error=FALSE)

Reading in file.

Now read in ringers.

Note that there is some intersection between ringer files and others (not very much though).

Also there is some userID overlap, which is good.

## [1] "roi0.6164851500.jpg" "roi0.4594871800.jpg" "roi0.0725167101.jpg"
## [4] "roi0.2244289000.jpg" "roi0.4193960900.jpg" "roi0.1751712500.jpg"

## [1] 53

## [1] 1685237 2421043  179341 1292152 5819602  914158

## [1] 1791

Now join these files and try to propagate user age to the ringers.

Analysis of amount of data

Here’s how much data we have in terms of number of participants by age:

plot of chunk amt1

And in terms of number of trials.

plot of chunk amt1b

## Source: local data frame [5 x 2]
## 
##       age   prop
## 1  (8,12] 0.2773
## 2 (12,16] 0.3120
## 3 (16,20] 0.3096
## 4 (20,24] 0.2750
## 5      NA 0.4480

So clearly there are many trials for some participants. In addition, it looks like this ringer file has a ridiculous number of participants for whom we have no match in terms of UIDs. That means we have no age and have to ditch them, unfortunately.

But the fact that they have so many feedback trials means that probably many are people who tried the game exactly once and then got lost after that.

Here’s the distribution of contributions. Tons of people have done ten trials, few have done 1000, one person has apparently done 40k!

plot of chunk amt2

There is tons of data that we have neither ages nor grades for. Let’s make a somewhat ad-hoc decision to drop data for kids for whom we have less than 10 observations and for ages outside the range [8, 24].

## Source: local data frame [4 x 3]
## 
##       age participants observations
## 1  (8,12]          146        24390
## 2 (12,16]          326        72821
## 3 (16,20]          370       255892
## 4 (20,24]          248       169011

## Source: local data frame [5 x 3]
## 
##       age participants observations
## 1  (8,12]          253        25132
## 2 (12,16]          494        73948
## 3 (16,20]          462       256542
## 4 (20,24]          299       169393
## 5      NA         1279       269663

In the end we lose very few observations for most age groups, except the older/no-age group.

How many images do we actually have? It turns out that pretty much all the ringers are shown 100+ times, but all the non-ringers are shown once or twice.

Most images are unique, but the ringers are repeated many, many times - mostly 100+.

plot of chunk unnamed-chunk-5

Image by userid.

plot of chunk unnamed-chunk-6

Accuracy analyses

First, establish a variable of number of trials per participant.

Now using only feedback trials to graph accuracy over age and trials and categories.

plot of chunk unnamed-chunk-8

Let’s do it by age/category. All groups are getting better, albeit from different starting places.

plot of chunk unnamed-chunk-9

Analysis of all categorization

Now look at all categories - this isn’t percent correct, this is just proportion classified as different categories. Split this by feedback and no-feedback trials. Gray lines represent base rates.

plot of chunk unnamed-chunk-10

So on the feedback trials, they are choosing at almost exactly the base rate. But on the no-feedback trials, they are converging to choose “nothing” almost all the time.

Follow-up analyses on the ringer-memorization hypothesis

[x] check time distribution of repeats [x] number of images a user has seen twice by the time they’ve played for 500 times?

Let’s start with the time distribution of repeats of ringers. Beginning with the basic time distribution:

plot of chunk unnamed-chunk-11

OK, so there was clearly some kind of event around 7/2013, no idea what that was. Maybe some promotion?

Now, some sanity checks on time of day. plot of chunk unnamed-chunk-12

OK, now check time distribution on repeats of ringers.

plot of chunk unnamed-chunk-13

The upshot of this analysis is that it seems like ringers are actually being seen often within the same minute, because kids are doing a lot of trials at a time.

Now look at proportion of ringer repeats.

plot of chunk unnamed-chunk-14

Ok, pretty clear that 10% of ringer trials are repeated by 500 trials in…

Individual difference analyses

Look at the correlation between choosing “none” on no-feedback trials and doing well on feedback trials. (Essentially, where you set your threshold in classification).

plot of chunk unnamed-chunk-15

This correlation exists, but it appears to be mediated by effects of practice. You do better on feedback trials AND classify “none” more often with more practice.

A (somewhat inappropriate) linear model, not shown, supports this conclusion.

Significant effects of practice on accuracy, and an interaction of practice and “none-choosing” such that the more you practice and choose none, the better you are.

Confusion matrices

Let’s look at confusion matrices and how they evolve over learning and age. This might help us firm up the conclusions about the misclassification of everything as “nothing.”

First the basic matrix, broken down by age. The big stripe across it is clearly evidence for overclassfification as nothing (this is only feedback trials, though).

plot of chunk unnamed-chunk-19

Questions and TODOs

Note these are old, prior to correct ringer dataset.

Where do the ringer judgments come from?
Are people just gaming the game by answering “nothing” always? (perhaps these are people who are the longest-run players adn so that’s why they show up so high)?
TODO: Group chatroom data to look at social pressure?
can we look at learning here given the big “nothing” effect? funny kind of learning.
TODO: massed vs. spaced practice
TODO: are there any people who are completely consistent in their judgements
TODO: actual ringers, are they still hitting “nothing” for everything?

TODOs: * Kappa for experts (dyherman lab) * growth models, individuals that “cheat”