Introduction

Welcome! This document is for students who need to find data. We’re going to look at some tips and tricks for thinking about your search for data to make it more efficient and effective. We’ll also look at a few major data sources, and we’ll talk about what makes a good or useable dataset. Finally, we’ll talk about the importance of preparing and cleaning data sources that are found “in the wild.”

Before you start

Before you start your data search, sit down and evaluate your needs. Think about your research question, as well as the literature you’ve read on your topic.

What’s that? You haven’t read any literature on your topic?

Stop, and back up.

Before you start to start your data search, you should sit down and do some reading. Read background information on your topic, read some peer-reviewd research, and use that to help determine your data needs. Most papers have both a “Data” section as well as “Methodology” – these are sections where you can read about what data the researchers used, how they obtained it, and what they had to do to the data to prepare for their work. For tips on searching the literature, see my handout on effective economics literature searching.

Evaluating your data needs

Once you have some background information on your topic, and have started exploring the literature, it’s time to answer some questions about what data you’ll need.

  1. What variables do you need?

    Remember, even the most basic research questions will require at least two variables: an independent (or explanatory) variable and a dependent (or response) variable. Chances are, though, you’ll be asking more complex questions than this and you’ll need several variables. There is no hard-and-fast rule as to “how many” variables you’ll need. Let your research question and the methods you know guide this.

  2. What unit of observation do I need?

    The variables you need will have had their value observed at a particular level, and the data will be reported at a particular level of observation. The level of observation you need will help determine where you should look for your data. Common units of observation include:

    • the individual (i.e. each row is a person)
    • the household (i.e. each row is a household, made up of some number of people)
    • the county/province (i.e. each row represents a county and all of the people/households within it)
    • the state, or the country, etc.

    Remember, you can almost always aggregate up from a smaller unit of observation to a larger one (e.g. county -> state), but you cannot easily disaggregate data (e.g. state -> county) if you don’t have the smaller unit of observation already.

  3. What time period/frequency do you need?

    Certain data is only collected at particular frequencies. For instance, the United States Census is conducted every 10 years. The American Community Survey is conducted every year.

    Certain data is only distributed in certain frequencies. For example, the American Community Survey (see above) is available in annual (one year) and five-year datasets. Each of these frequencies is only available for particular levels of observation, as well (see above).

    While you can aggregate data up from shorter time frequencies to longer ones (e.g. monthy -> annual), downloading data at a much shorter time frequency than you need will result in a much larger dataset. If you only need annual data, downloading the data at a monthly frequency will mean you have a dataset that is 12x larger than you need, which may unneccessarily slow down your data cleaning/processing code.

Data archiving

Once you have some idea of what you want your data to “look” like (i.e. variables, unit of observation, frequency), you can start thinking about where you might find this data. Several questions will help you think about where you might look:

  1. Who would care about this data? And who would care about keeping it?

    Curating data is labor and technology-intensive work. Despite the fact that we generate (and even collect) gigantic amounts of data all the time, this data is not ready to use. Making data that is usable and findable – and can remain usable and findable over the long term – takes research, coding, migration to new formats, and documentation, among other things. What this means for your data search is that you need to think about who would care enough about the data to put this effort into it.

    Some examples:

    • The U.S. Census Bureau cares about the demographic and geographic information collected in the U.S. Census.
    • The National Center for Education Statistics cares about educational data, from Kindergarten through higher education, for United States students.
    • Google cares about the searches you do, when you do them, and the results of those searches.
    • Facebook cares about any data about you that can be used for marketing/advertising to you.

  2. What type of organization are they?

    If the data you’re looking for comes from a governmental organization, you often will be able to acces this data more easily, particularly if you are a citizen. If they are a for-profit company, they probably have little reason to give you access to the data, especially for free.

  3. If not government, how valuable is the data? And who would pay for it?

    Facebook makes a lot of money of the data they have on you, so they are reticient to give access to it, unless it can benefit them in some way. Google can’t make much money off of their search data in the aggregate, so they give you access to the data at that level through Google Trends. However, you can’t use Google Trends for free to find out what males ages 18-25 have been searching, because that data is very valuable to Google.

  4. Are there privacy/confidentiality issues?

    If you’re looking for data on students or on health care, you’ll often run into “restricted” data. If you want data from the U.S. Census on individual respondents (rather than counties, states, etc.), you’ll also run into restrictions. Any time where personally identifiable information could be disclosed, especially when it concerns sensitive topics like grades or health, you need to know that getting the data will take extra effort/time, or may be impossible. Talk to a librarian if you’re interested in restricted data, but be aware that this type of data is more suitable for “big” projects, like theses or senior independent work.

Types of studies

When thinking about what type of data you might need, it’s also helpful to know what type of study you’re looking for data from. Here are some examples:

  • Cross-Sectional
    • data that are only collected once
    • many public opinion surveys are cross-sectional
  • Time Series
    • studies the same variable(s) over time
    • looking at stock prices for particular firms over a 25-year period is an example of time series data
  • Longitudinal Studies (also called Panel studies)
    • conducted repeatedly and the same group of respondents is surveyed each time
    • allows for examining changes over the life course
    • “Add Health” is an example of a longitudinal (panel) study

Data sources

When searching for data, you won’t usually get the best results by going to Google and searching naïvely. If you are going to search Google, usually it is better to search for sources of the data, and then go to that institution’s website and search for the data on their site. One reason for this is that Google cannot “search” variables. If you’re looking for a study with particular variables, Google can’t look “inside” a dataset hosted on an institution’s website and search the variables that are contained therein.

Since it’s usually better to go to a data archive or another institutional site to search, here are some major data sources to know/review:

United States Census & American Community Survey

  • The Census is conducted every 10 years and collects data on everyone living in the United States. The American Community Survey happens annually (since 2005) and conducts data on a 1% sample of the people living in the United States.
  • If you’re working with historical Census data, be aware that questions/answers change over time, and geographic areas (Congressional Districts, counties, etc.) also change
  • Access points for Census/ACS data include:
    • Social Explorer – great for tables and many different levels of geography
    • IPUMS – microdata from around the world and the United States, harmonized over time and space to make comparison easier
    • NHGIS – combines United States data from IPUMS with spatial data to make GIS work easier

General Social Survey/European Social Survey/Afrobarometer

  • These are large-scale public surveys on demographic, behavioral, political, and attitudinal topics
  • Take place over different periods of time (e.g. the GSS is a biennial survey)
  • Randomly selected sample of adults from these areas (i.e. United States, Europe, Africa)
  • Some questions are asked every, some questions come and go, and some questions are asked once and then never again
  • To access:

Add Health (The National Longitudinal Study of Adolescent to Adult Health)

  • Longitudinal study of students who were in grades 7-12 in 1994-95 (most recent follow up in 2008)
  • Survey data on social, economic, psychological and physical well-being
  • Contextual data on family, neighborhood, community, school, friendships, peer groups, and romantic relationships
  • Public use and restricted versions of the data; public use available through ICPSR
  • Public-use access through ICPSR

Inter-university Consortium for Political and Social Research (ICPSR)

  • Part of the Institute for Social Research at University of Michigan
  • First attempt at openly sharing data amongst researchers (started with election studies data)
  • Curated, digitized, diverse historical data sets
  • Access requires creating an account through go/icpsr/, then can access directly through the site

Finding help

Codebooks

If you’re looking for help with a particular dataset you’ve download, the first place to look is in the codebook. Sometimes codebooks are on web pages, sometimes they are downloaded as PDF or text files, but they should always be available for any dataset you want to use. If you can’t find a codebook for your dataset, strongly consider looking for another dataset.

Information contained in a codebook includes:

  • Column locations and widths for each variable (if necessary)
  • Definitions of different record/variable types
  • Response codes for each variable (i.e. does 1 = “male” or does 1 = “female”)
  • Codes used to indicate non-response and missing data (do they use -99? -999? NA? Some combination?)
  • Exact questions and skip patterns used in a survey
  • Other indications of the content and characteristics of each variable

Library Research Guides

Our library research guide on data and statistics can be found at go/econstats/. It has information on finding data, as well as many more sources for data than are listed here.

Ask a librarian!

You can reach out to me, Ryan Clement, at go/ryan/ for assistance with finding, acquiring, cleaning, using, and visualizing your data! Before reaching out, please try to have considered many of the questions above! It will help our time together be the most effective it can!