Projects: Choosing a Dataset for your Project

Codebook for Available Datasets

Click here to run the code for this workshop yourself on RStudio.Cloud!

For your first coding project, you will need a dataset! But how should you go about choosing your dataset and your variables? I’ve laid out some steps below that will help you get started.

1. What datasets are available?

I have collected a series of good datasets that you can choose from. These include datasets on politics, environmental policy, economics, and health. If you’re choosing one of these for my class, you can also choose your own, but will have to obtain it and run it by me first.

Carbon Emissions dataset
County Health Rankings dataset
Environmental Racism dataset
Polarization and Health dataset
COVID19 rates for Massachusetts Cities
German Solar Power dataset
Japanese Prefectural Elections dataset

These datasets are all stored in the Files panel of this RStudio.Cloud session.

2. What is the structure of my dataset?

What does a row in my dataset mean? It is a city? A county? A city in a specific year?

Name	Rows	Filename
County Health Rankings	US Counties in 2020	`county_dataset.csv`
Environmental Racism	US Counties in 2020	`environmental_health.csv`
Polarization and Health	US Counties in 2016	`polarization_and_health.csv`
COVID-19 rates for MA Cities	MA City-Weeks in 2020	`ma_data.csv`
Carbon Emissions	Japanese City-Years	`jp_emissions.csv`
German Solar Power Dataset	Germany Cities in 2011	`germany_data.csv`
Japanese Prefectural Elections	Japanese Candidates	`japanese_prefectural_elections.csv`
Japanese Municipalities Dataset	Japanese City-Years	`japan_muni_elections.csv`

3. What variables are in my dataset?

Most datasets have codebooks, or descriptions of variables on a website somewhere. It is vitally important that you know the exact meaning of your variables, especially what units they are measured in. Height means a lot of different things when measured in feet, inches, meters, or hands.

4. Possible Questions

Finally, you need to design your own research question! I’d recommend choosing an outcome that demonstrates considerable variation, and a single explanatory variable that you have reason to believe might affect that outcome. Then, I’d think about what other variables might also explain the variation in that outcome. Be sure to include those in your analysis!

For anyone starting out, here are a few questions that might interest you! You can certainly design your own, pick one of these, or use these as a launching pad to new research questions.

County Health Rankings

Do rural voters prefer the Republican Party or the Democratic Party? Where? Has this changed over time?
Does race have a linear effect on votes cast for specific parties, or an exponential effect on votes cast for specific parties?
Does politics affect our physical or mental health, or health behaviors, like drinking, exercise, etc.?

Emissions Dataset

What (other than population) helped towns in Japan reduce their CO2 emissions over time?
Did a particular community trait seem to reduce CO2 emissions in some years but not others? Why?
The LDP is Japan’s main conservative, pro-business party. We might expect that towns that vote more for the LDP might not reduce their greenhouse gas emissions quite as much… Is this true? Has this effect changed from the early 2000s to the late 2010s?

Japan Cities Dataset

How did the shutdown of nuclear power plants after Fukushima affect the income per capita of towns hosting nuclear power plants?
How much did the tsunami and the Fukushima exclusion zone affect towns’ social capital?
Japan got hit by a recession - its very own housing bubble - in 1992, stymieing local economies for decades. Did municipalities’ budgets suffer before/after 1992, or did they bounce back?

Please feel free to reach out if you have questions.

Projects: Choosing a Dataset for your Project

Codebook for Available Datasets