Motivating Research with Open Data

Collin Paschall

Introductions

Collin Paschall and Chris Kromphardt
Data Analytics and Policy Program
Overseen hundreds of course projects and capstone papers
Goal of the program is to provide practical training in data analysis in support of evidence-based policymaking

Plan for the Talk

Introduce open data and the growing ecosystem
Review common open data sources
Review start-up costs to using open data
Share some thoughts on the research process with open data

The Open Data Ecosystem

We live in the Data Age, because AI is mostly about data

Are we going to use data for the common good?

Examples of Entities Doing Data Collection

National, state, and local government
International organizations
NGOs, think tanks, researchers
Journalists and citizens
Available on platforms like GitHub, Kaggle etc.

Open data is a huge boon to public-facing research

Advocacy groups, think tanks, journalists, and others can create informative research projects on public health, natural disasters, public safety, educational outcomes, and much more
Open data and open data science tools (R, Python, etc.) have democratized data

But, some risks

Reliability and bias
- Collecting information is a political act because it involves resources and implies choices about what and who is important (who do you measure, what do you measure)
Does making data available lead to helpful/good research? Not necessarily.

An Open Data Project Lifecycle

Start-up costs
Finding and evaluating data
Formulating a research question
Choosing methods
Planning for maximum impact

Start Up Costs

If you want to work in the open data ecosystem, you have to have some basic data science skills
- R, Python, SQL, APIs, GitHub
- .csv, .xml, .json file
- Natural language processing - plain, unstructured text is probably the most difficult to deal with
So just because data exist, you’ve got to know the basics

Finding Data is about Reading

Where are these data?
- Google searching, Kaggle, ask an LLM
The best way to locate data is to read research in your area of interest
- Organizations and researchers that go to the trouble of collecting data almost always write something about it
- So the best way to find data for research is to read what other researchers are doing
  - There is very little new under the sun.

Example

Google scholar search for “effects of childhood traumatic events on employment”

Reading leads to data

Behavioral Risk Factor Surveillance System

Start looking at your data

After you have located a data set that seems relevant to your interests, it’s time to stop reading for a bit and spend some time looking at the data
Basic questions:
- What is the unit of observation?
- How many observations are there?
- What is being measured in the data for each observation?
- Are there data quality issues? Is the data easy to read?

A good way to test whether you know your data

Try to make a 1 or 2 page “fact sheet” or a set of visualization (perhaps in a dashboard)
If you can do this, then you probably know the data well enough to proceed
[Insert some simple visualizations here]

Back to careful reading

Once you have data that seem interesting, you want to be guided by some general principles:
- Use the Reporter’s Questions - Who, what, where, when, why, and how
- Focus on puzzles or anomalous outcomes
- Draw on theory and research
- Focus on causes, not open ended inventories of effects
- Be narrow, not comprehensive
- Focus on variation

Stand on the shoulders of giants (or assistant professors)

The very best thing to do is find other research results that are interesting to you and work on building on those.
You do not have to do all the hard work of coming up with original ideas for research yourself.

Align Your Data with Your Reading

While you read, keep your data in mind - what would you have to do to the data you have to replicate these studies? What kind of data would you have to add/merge?

Don’t Lie With Statistics

Careful description and exploration is often better than sloppy or redundant complex analysis
Read Philip Schrodt’s “Seven Deadly Sins”
- Avoid garbage can regressions
- Don’t lead too heavily on statistical significance
- Make sure you understand the methods you are trying to use

Plan for Reproducibility

In the modern data science environment, if you are working with open data and aren’t doing anything proprietary, you should plan to post your work on GitHub or similarly make your data and analysis available.
Comment your code and document your decisions.

Final Thoughts

Open data is an amazing resource!
But, the fundamental skill remains reading
You will have to modify you research in an iterative way. The work is never done!