Motivating Research with Open Data

Collin Paschall

Introductions

  • Collin Paschall and Chris Kromphardt

  • Data Analytics and Policy Program

  • Overseen hundreds of course projects and capstone papers

  • Goal of the program is to provide practical training in data analysis in support of evidence-based policymaking

Plan for the Talk

  • Introduce open data and the growing ecosystem

  • Review common open data sources

  • Review start-up costs to using open data

  • Share some thoughts on the research process with open data

The Open Data Ecosystem

  • We live in the Data Age, because AI is mostly about data

Are we going to use data for the common good?

Examples of Entities Doing Data Collection

  • National, state, and local government

  • International organizations

  • NGOs, think tanks, researchers

  • Journalists and citizens

  • Available on platforms like GitHub, Kaggle etc.

Open data is a huge boon to public-facing research

  • Advocacy groups, think tanks, journalists, and others can create informative research projects on public health, natural disasters, public safety, educational outcomes, and much more

  • Open data and open data science tools (R, Python, etc.) have democratized data

But, some risks

  • Reliability and bias

    • Collecting information is a political act because it involves resources and implies choices about what and who is important (who do you measure, what do you measure)
  • Does making data available lead to helpful/good research? Not necessarily.

An Open Data Project Lifecycle

  • Start-up costs

  • Finding and evaluating data

  • Formulating a research question

  • Choosing methods

  • Planning for maximum impact

Start Up Costs

  • If you want to work in the open data ecosystem, you have to have some basic data science skills

    • R, Python, SQL, APIs, GitHub

    • .csv, .xml, .json file

    • Natural language processing - plain, unstructured text is probably the most difficult to deal with

  • So just because data exist, you’ve got to know the basics

Finding Data is about Reading

  • Where are these data?

    • Google searching, Kaggle, ask an LLM
  • The best way to locate data is to read research in your area of interest

    • Organizations and researchers that go to the trouble of collecting data almost always write something about it

    • So the best way to find data for research is to read what other researchers are doing

      • There is very little new under the sun.

Example

  • Google scholar search for “effects of childhood traumatic events on employment”

Reading leads to data

  • Behavioral Risk Factor Surveillance System

Start looking at your data

  • After you have located a data set that seems relevant to your interests, it’s time to stop reading for a bit and spend some time looking at the data

  • Basic questions:

    • What is the unit of observation?

    • How many observations are there?

    • What is being measured in the data for each observation?

    • Are there data quality issues? Is the data easy to read?

A good way to test whether you know your data

  • Try to make a 1 or 2 page “fact sheet” or a set of visualization (perhaps in a dashboard)

  • If you can do this, then you probably know the data well enough to proceed

  • [Insert some simple visualizations here]

Back to careful reading

  • Once you have data that seem interesting, you want to be guided by some general principles:

    • Use the Reporter’s Questions - Who, what, where, when, why, and how

    • Focus on puzzles or anomalous outcomes

    • Draw on theory and research

    • Focus on causes, not open ended inventories of effects

    • Be narrow, not comprehensive

    • Focus on variation

Stand on the shoulders of giants (or assistant professors)

  • The very best thing to do is find other research results that are interesting to you and work on building on those.

  • You do not have to do all the hard work of coming up with original ideas for research yourself.

Align Your Data with Your Reading

  • While you read, keep your data in mind - what would you have to do to the data you have to replicate these studies? What kind of data would you have to add/merge?

Don’t Lie With Statistics

  • Careful description and exploration is often better than sloppy or redundant complex analysis

  • Read Philip Schrodt’s “Seven Deadly Sins”

    • Avoid garbage can regressions

    • Don’t lead too heavily on statistical significance

    • Make sure you understand the methods you are trying to use

Plan for Reproducibility

  • In the modern data science environment, if you are working with open data and aren’t doing anything proprietary, you should plan to post your work on GitHub or similarly make your data and analysis available.

  • Comment your code and document your decisions.

Final Thoughts

  • Open data is an amazing resource!

  • But, the fundamental skill remains reading

  • You will have to modify you research in an iterative way. The work is never done!