Data Science for Social Good, Chicago 2017

The urgent need
for declarative data

Declarative data are (and will) be needed !


  • In order to solve many social problems we often
    • need to conduct an ad-hoc study and
    • gather new quantitative data
      from population of interest.
  • Existing data can be not comprehensive enough
    or not up-to-date.
  • The solution is to ask people relevant questions directly and gather enough declarative data
    of acceptable quality.

Barriers in declarative data collection

  • We could conduct a survey using one of off-line or on-line research techniques…
  • …but we will face barriers that can prevent even most important social research projects from being conducted, for example:
    • surveying people is expensive,
    • respondents are hard to recruit,
    • response rates are declining, and
    • methodological concerns are rising.
  • The problems are growing even more, if we want to establish long-term relationship with respondents
    • i.e. to conduct repeated measurements within longitudinal panels (when we want to return with our questions to the same respondents at least once after some time).

Why is it so difficult
to establish long-term relationship
between researchers and respondents?

The heuristic of the traditional model
of Respondent-Researcher Relationship

  1. Find potential respondents within time window that is not necessarily convenient for them.
  2. Convince them to devote their time to work through a questionnaire, which is often boring.
  3. Convince them to share some valuable, also often quite private and sensitive information with someone who will benefit from their answers.
  4. Do not share any valuable information with them.
  5. Give the respondents as little as possible in return in order to be as much cost-effective as possible.
  6. Repeat the whole process as many times and
    for as many respondents as necessary.

Consequence of the traditional model


Respondents do not want to be respondents anymore!

In the long run this model is clearly not sustainable but researchers act like the general population of respondents is infinite or it is an easily renewable resource.

This is not only wrong but also harmful!

This situation can be seen as the classic "tragedy of the commons" describing a situation where a shared resource is spoiled and depleted by collective actions of all the actors driven by their individual self-interest which is at odds with the long-term interests of the common good.

Old paradigm in on-line research techniques

On-line research techniques are not natively on-line.
They simply mimic off-line techniques.

They are mainly off-line questionnaires converted into more or less advanced HTML forms with some additional functionalities (like randomization, skip logic, new question types) but they are still:

  • repetitive—many questions from one survey is repeated in others;
  • time-consuming—need a relatively long block of time to complete;
  • linear—non-linear behavior is often forbidden;
  • based on non-equivalent information exchange
    —valuable information from respondent vs.
    no valuable information from researcher.

Towards new paradigm
in declarative data collection

Assumption of the new model
of Respondent-Researcher Relationship

  • The priority is to build long-term relationship with respondents.
  • The process should be more similar to a conversation than an interrogation.
  • The new research tool for collecting declarative data should be able to let go old off-line paradigms, be designed to work natively in on-line environment (both desktop & mobile), and it should maximize the user-experience.
  • During each single interaction between researcher and respondent, done through new research tool, both sides should receive something valuable for them.
  • In consequence, people are intrinsically motivated to become respondents repeatedly, for a long time.

What can be valuable for respondents?

Instant feedback

  • Respondents should be able to access feedback relevant to their answers.
  • The feedback should be provided immediately at the end of a question set if not right after each answered question.
  • The value of the feedback will be much higher for a respondent, if the feedback will be highly customized to this particular respondent.
    • The simplistic but existing in real-world example of this approach is a salary survey, which asks you about your current salary and compares it to salaries of people similar to you.

Sources of data-driven feedback

The feedback for a given respondent can be based on different data sources:

  • the answer of the respondent
    to the given question,
  • previous answers of the respondent
    to the given question,
  • answers of the respondent
    to other questions,
  • answers of other respondents
    to given or other questions,
  • external open data
    (research or administrative) ,
  • outcomes from reference studies
    (aggregated or summarized),
  • other sources.

The role of Data Science in the new model

The role of Data Science is to use multiple data sources, statistical inference, machine learning algorithms, and interactive data visualization, among others, to provide feedback to the respondents, which is