Welcome! This document is for students who need to find data. We’re going to look at some tips and tricks for thinking about your search for data to make it more efficient and effective. We’ll also look at a few major data sources, and we’ll talk about what makes a good or useable dataset. Finally, we’ll talk about the importance of preparing and cleaning data sources that are found “in the wild.”
Before you start your data search, sit down and evaluate your needs. Think about your research question, as well as the literature you’ve read on your topic.
What’s that? You haven’t read any literature on your topic?
Stop, and back up.
Before you start to start your data search, you should sit down and do some reading. Read background information on your topic, read some peer-reviewd research, and use that to help determine your data needs. Most papers have both a “Data” section as well as “Methodology” – these are sections where you can read about what data the researchers used, how they obtained it, and what they had to do to the data to prepare for their work. For tips on searching the literature, see my handout on effective economics literature searching.
Once you have some background information on your topic, and have started exploring the literature, it’s time to answer some questions about what data you’ll need.
What variables do you need?
Remember, even the most basic research questions will require at least two variables: an independent (or explanatory) variable and a dependent (or response) variable. Chances are, though, you’ll be asking more complex questions than this and you’ll need several variables. There is no hard-and-fast rule as to “how many” variables you’ll need. Let your research question and the methods you know guide this.
What unit of observation do I need?
The variables you need will have had their value observed at a particular level, and the data will be reported at a particular level of observation. The level of observation you need will help determine where you should look for your data. Common units of observation include:
Remember, you can almost always aggregate up from a smaller unit of observation to a larger one (e.g. county -> state), but you cannot easily disaggregate data (e.g. state -> county) if you don’t have the smaller unit of observation already.
What time period/frequency do you need?
Certain data is only collected at particular frequencies. For instance, the United States Census is conducted every 10 years. The American Community Survey is conducted every year.
Certain data is only distributed in certain frequencies. For example, the American Community Survey (see above) is available in annual (one year) and five-year datasets. Each of these frequencies is only available for particular levels of observation, as well (see above).
While you can aggregate data up from shorter time frequencies to longer ones (e.g. monthy -> annual), downloading data at a much shorter time frequency than you need will result in a much larger dataset. If you only need annual data, downloading the data at a monthly frequency will mean you have a dataset that is 12x larger than you need, which may unneccessarily slow down your data cleaning/processing code.
Once you have some idea of what you want your data to “look” like (i.e. variables, unit of observation, frequency), you can start thinking about where you might find this data. Several questions will help you think about where you might look:
Who would care about this data? And who would care about keeping it?
Curating data is labor and technology-intensive work. Despite the fact that we generate (and even collect) gigantic amounts of data all the time, this data is not ready to use. Making data that is usable and findable – and can remain usable and findable over the long term – takes research, coding, migration to new formats, and documentation, among other things. What this means for your data search is that you need to think about who would care enough about the data to put this effort into it.
Some examples:
What type of organization are they?
If the data you’re looking for comes from a governmental organization, you often will be able to acces this data more easily, particularly if you are a citizen. If they are a for-profit company, they probably have little reason to give you access to the data, especially for free.
If not government, how valuable is the data? And who would pay for it?
Facebook makes a lot of money of the data they have on you, so they are reticient to give access to it, unless it can benefit them in some way. Google can’t make much money off of their search data in the aggregate, so they give you access to the data at that level through Google Trends. However, you can’t use Google Trends for free to find out what males ages 18-25 have been searching, because that data is very valuable to Google.
Are there privacy/confidentiality issues?
If you’re looking for data on students or on health care, you’ll often run into “restricted” data. If you want data from the U.S. Census on individual respondents (rather than counties, states, etc.), you’ll also run into restrictions. Any time where personally identifiable information could be disclosed, especially when it concerns sensitive topics like grades or health, you need to know that getting the data will take extra effort/time, or may be impossible. Talk to a librarian if you’re interested in restricted data, but be aware that this type of data is more suitable for “big” projects, like theses or senior independent work.
When thinking about what type of data you might need, it’s also helpful to know what type of study you’re looking for data from. Here are some examples:
When searching for data, you won’t usually get the best results by going to Google and searching naïvely. If you are going to search Google, usually it is better to search for sources of the data, and then go to that institution’s website and search for the data on their site. One reason for this is that Google cannot “search” variables. If you’re looking for a study with particular variables, Google can’t look “inside” a dataset hosted on an institution’s website and search the variables that are contained therein.
Since it’s usually better to go to a data archive or another institutional site to search, here are some major data sources to know/review:
If you’re looking for help with a particular dataset you’ve download, the first place to look is in the codebook. Sometimes codebooks are on web pages, sometimes they are downloaded as PDF or text files, but they should always be available for any dataset you want to use. If you can’t find a codebook for your dataset, strongly consider looking for another dataset.
Information contained in a codebook includes:
Our library research guide on data and statistics can be found at go/econstats/. It has information on finding data, as well as many more sources for data than are listed here.
You can reach out to me, Ryan Clement, at go/ryan/ for assistance with finding, acquiring, cleaning, using, and visualizing your data! Before reaching out, please try to have considered many of the questions above! It will help our time together be the most effective it can!