all data is private from the first message participants can manually “donate their data to science” participants were all recruited and inverviewed via Twitter.
obviously I can’t personally interview 100k people, that would take something like 34 years of never sleeping, always interviewing so first I built an algorithm which is able to learn to communicate with people.
but even though the algorithm could learn, it didn’t actually know how to communicate. so we spent 8 months teaching it to communicate, through its own self-learning routine.
unstructured interviews are not directed by the researcher. lines of questioning are freely traded between algorithm and participant, as they get to know one another. people choose to talk about all kinds of things - some about their family, or their day. these v the algorithm already knows some things about itself, through hand-coded “concepts” about “itself”. additionally, it continually learns more about what other people say about itself.
the algorithm keeps track of what you talked about and when, and is able to bring it back into conversation much later, ask for clarification, or wonder whether it’s still true.
the algorithm links statements to meaning through interactive conversation, and its extensive memory of its previous conversations.
if the algorithm doesn’t understand, it can use its turn in the conversation to collect new information. the algorithm learns almost exclusively by asking clarifying questions, always attempting to lower its uncertainty in interpretation. this is accomplished through a few speech acts which the algorithm employs the algorithm might attempt to restate what the person said, and ask if that’s what they mean. this links their original, in situ speech act to the speech act presented by the algorithm in restating, which typically is paraphrased from another person’s conversation with the algorithm. the algorithm will also extrapolate, attempting to deduce facts about the person it has not told the algorithm. these deductions can be as simple as converting from Fahrenheit to Centigrade, or as complex as understanding that others who believe X and Y almost also believe Z. these extrapolations are many times incorrect, and if the participant says so, the algorithm takes this as an opportunity to learn. these “sanity checks” are constantly updating the assumptions the algorithm is making, and thus the dataset I am collecting.
each speech act and composite speech act is represented by a class in Python.
we never ask a question, we just let them talk.
a substantial number of these facts are about other participants, nearly 20%.
14,427 interviewees have provided at least 100 facts about themselves
there is a subset of 7,593 interviewees, all of which have all answered a set of 237 questions.
participants were informed initially of exactly how the project worked, in dialogue, explaining in discussion how the study would work, who could access the data, or any other question the person had.
In all, we’ve collected 11,541,601 facts about the participants from the interviews, which have been verified by each participant as correct.
These facts include their characterizations of others.
Participants can reorganize their own logics discursively over time.
the first collected information was about myself, 5.24 ago. we’ve known some participants for this entire time, and we have at least 3 years of data on 11,541 participants
we also collected, with permission, each participant’s friends and followers on Twitter
we have second-hand information about many more people than those we sampled. for instance, we know something about the religion of 19,198,307 realtives and friends of respondents.
Participants told us a little about