Collin Paschall and Chris Kromphardt
Data Analytics and Policy Program
Overseen hundreds of course projects and capstone papers
Goal of the program is to provide practical training in data analysis in support of evidence-based policymaking
Introduce open data and the growing ecosystem
Review common open data sources
Review start-up costs to using open data
Share some thoughts on the research process with open data
We live in the Data Age, because AI is mostly about data
National, state, and local government
International organizations
NGOs, think tanks, researchers
Journalists and citizens
Available on platforms like GitHub, Kaggle etc.
Advocacy groups, think tanks, journalists, and others can create informative research projects on public health, natural disasters, public safety, educational outcomes, and much more
Open data and open data science tools (R, Python, etc.) have democratized data
Reliability and bias
Does making data available lead to helpful/good research? Not necessarily.
Start-up costs
Finding and evaluating data
Formulating a research question
Choosing methods
Planning for maximum impact
If you want to work in the open data ecosystem, you have to have some basic data science skills
R, Python, SQL, APIs, GitHub
.csv, .xml, .json file
Natural language processing - plain, unstructured text is probably the most difficult to deal with
So just because data exist, you’ve got to know the basics
Where are these data?
The best way to locate data is to read research in your area of interest
Organizations and researchers that go to the trouble of collecting data almost always write something about it
So the best way to find data for research is to read what other researchers are doing
After you have located a data set that seems relevant to your interests, it’s time to stop reading for a bit and spend some time looking at the data
Basic questions:
What is the unit of observation?
How many observations are there?
What is being measured in the data for each observation?
Are there data quality issues? Is the data easy to read?
Try to make a 1 or 2 page “fact sheet” or a set of visualization (perhaps in a dashboard)
If you can do this, then you probably know the data well enough to proceed
[Insert some simple visualizations here]
Once you have data that seem interesting, you want to be guided by some general principles:
Use the Reporter’s Questions - Who, what, where, when, why, and how
Focus on puzzles or anomalous outcomes
Draw on theory and research
Focus on causes, not open ended inventories of effects
Be narrow, not comprehensive
Focus on variation
The very best thing to do is find other research results that are interesting to you and work on building on those.
You do not have to do all the hard work of coming up with original ideas for research yourself.
Careful description and exploration is often better than sloppy or redundant complex analysis
Read Philip Schrodt’s “Seven Deadly Sins”
Avoid garbage can regressions
Don’t lead too heavily on statistical significance
Make sure you understand the methods you are trying to use
In the modern data science environment, if you are working with open data and aren’t doing anything proprietary, you should plan to post your work on GitHub or similarly make your data and analysis available.
Comment your code and document your decisions.
Open data is an amazing resource!
But, the fundamental skill remains reading
You will have to modify you research in an iterative way. The work is never done!