Projects: Choosing a Dataset for your Project

Codebook for Available Datasets

Click here to run the code for this workshop yourself on RStudio.Cloud!

For your first coding project, you will need a dataset! But how should you go about choosing your dataset and your variables? I’ve laid out some steps below that will help you get started.



1. What datasets are available?

I have collected a series of good datasets that you can choose from. These include datasets on politics, environmental policy, economics, and health. If you’re choosing one of these for my class, you can also choose your own, but will have to obtain it and run it by me first.

  • Carbon Emissions dataset
  • County Health Rankings dataset
  • Environmental Racism dataset
  • Polarization and Health dataset
  • COVID19 rates for Massachusetts Cities
  • German Solar Power dataset
  • Japanese Prefectural Elections dataset

These datasets are all stored in the Files panel of this RStudio.Cloud session.



2. What is the structure of my dataset?

What does a row in my dataset mean? It is a city? A county? A city in a specific year?

Name Rows Filename
County Health Rankings US Counties in 2020 county_dataset.csv
Environmental Racism US Counties in 2020 environmental_health.csv
Polarization and Health US Counties in 2016 polarization_and_health.csv
COVID-19 rates for MA Cities MA City-Weeks in 2020 ma_data.csv
Carbon Emissions Japanese City-Years jp_emissions.csv
German Solar Power Dataset Germany Cities in 2011 germany_data.csv
Japanese Prefectural Elections Japanese Candidates japanese_prefectural_elections.csv
Japanese Municipalities Dataset Japanese City-Years japan_muni_elections.csv



3. What variables are in my dataset?

Most datasets have codebooks, or descriptions of variables on a website somewhere. It is vitally important that you know the exact meaning of your variables, especially what units they are measured in. Height means a lot of different things when measured in feet, inches, meters, or hands.


Select Dataset

County Health Rankings

  • county: Name of each county.

  • state: Abbreviation for each state.

  • fips: Unique 5 digit ID for each US county.

Health

  • life_expectancy: average number of years a person can expect to live, based on current age-specific death rates.

  • poor_fair_health: percentage of surveyed residents who rated their health as either poor or fair on a 5 point scale (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent).

  • days_poor_physical_health: average number of physically unhealthy days that residents reported in past 30 days (age-adjusted).

  • days_poor_mental_health: average number of mentally unhealthy days that residents reported in past 30 days (age-adjusted).

  • frequent_phys_distress: percentage of adult respondents reporting 14 or more days of poor physical health per month.

**Health Conditions

  • diabetes: percentage of adult respondents aged 20 and above with diagnosed diabetes. (Combines type I and II).

  • obesity: percentage of the adult population (age 20 and older) that reports a body mass index (BMI) greater than or equal to 30 kg/m2.

Health Behaviors

  • physical_inactivity: percentage of adults age 20 and over reporting no leisure-time physical activity.

  • exercise_access: percentage of population with adequate access to locations for physical activity.

Health Environment

  • air_pollution: Average daily density of fine particulate matter in micrograms per cubic meter (PM2.5), provided by the Environmental Public Health Tracking Network.

  • For more variable definitions, see their codebook here.

Environmental Racism

  • county: Name of each county.

  • state: Abbreviation for each state.

  • fips: Unique 5 digit ID for each US county.

Main Variables

  • air_pollution: Average daily density of fine particulate matter in micrograms per cubic meter (PM2.5), provided by the Environmental Public Health Tracking Network.

  • pop_black: percentage of population who are Black.

  • poc: dichotomous variable showing whether 30% or more of county residents are black (“High”) or less than 30% of residents are black (“Low”)

Covariates

  • urban: dichotomous variable showing whether county is a continuously built-up area with a population of 50,000 or more (“Urbanized Area”) or not (“Rural Area”), according to the US Census.

  • party: dichotomous variable, showing whether residents voted primarily for Clinton in (“Democrat”) or Trump (“Republican”) in the 2016 election.

  • wealth: dichotomous variable, showing whether the median household income in a county is above the national average (“Above Median”) or below the national average (“Below Median”).

Polarization and Health

  • county: Name of each county.

  • state: Abbreviation for each state.

  • fips: Unique 5 digit ID for each US county.

  • region: Census region (eg. New England)

Polarization

  • gap_2016: the difference in the percentage of votes for Democrats vs. Republicans in a county in the 2016 presidential election. 0 indicates an even split, meaning low polarization, because no one group is particularly marginalized. 50 indicates one party is much greater than the other (75%-25% split), meaning strong polarization, because one group is marginalized. 90 indicates one party is almost completely dominant (90%-10%), meaning extreme polarization, because one group is a very tiny minority.

  • polarization: dichotomous variable showing whether there is a 50% or greater difference in votes for Democrats vs. Republicans in a given county (“High”), or less than a 50% difference in votes (“Low”).

Social Capital

  • social_associations: number of membership associations in a county, per 10,000 residents. This includes civic organizations, bowling centers, golf clubs, fitness centers, sports organizations, religious organizations, political organizations, labor organizations, business organizations, and professional organizations.

  • social_capital: dichotomous variable showing whether the rate of social associations in a county is above the national median, building high social capital (“High”) or below the national median, building lower social capital (“Low”).

Health

  • poor_fair_health: percentage of surveyed residents who rated their health as either poor or fair on a 5 point scale (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent). (Note: higher = worse health implications.)

  • days_poor_physical_health: average number of physically unhealthy days that residents reported in past 30 days (age-adjusted).

  • days_poor_mental_health: average number of mentally unhealthy days that residents reported in past 30 days (age-adjusted).

  • frequent_phys_distress: percentage of adult respondents reporting 14 or more days of poor physical health per month.

COVID19 rates for Massachusetts Cities

Time-series dataset, with many variables, but most recommended ones defined below:

  • city: Name of Massachusetts county subdivision (essentially a city, in census-speak).

  • unit: Unique ID code for each county subdivision.

COVID Spread

  • date: date of COVID-19 rate reporting.

  • positivity_rate: COVID test positivity rate, meaning the percentage of tests that came back positive for COVID-19 in the last two weeks, out of all total tests run. One of the better measures for COVID-19 spread, since it adjusts for testing capacity.

Vulnerability

  • svi: CDC’s Social Vulnerability Index, which measures residents’ overall vulnerability during crisis from 0 (low) to 1 (high). Incorporates 4 main aspects of vulnerability, which are available in the dataset as subindices from 0 to 1. These indices were averaged up to the city level from the census tract level.

  • svi_socioeconomic: CDC’s subindex for socioeconomic status.

  • svi_minority: CDC’s subindex for minority status and language competency.

  • svi_housing_transport: CDC’s subindex for housing and transportation access.

  • svi_household_disability: CDC’s subindex for household membership traits and disability status.

  • Many other variables are sourced from County Health Rankings. See their codebook for variable definitions.

Mobility Measures

  • average of daily percent change in mobility in certain areas, averaging mobility between 2 to 11 days prior to each date’s COVID-19 rate reported in this dataset. This is because 95% of people show symptoms between 2 and 11 days after exposure. Mobility estimated using Google Android phone user data. Provided by Google Mobility Reports, and averaged to county subdivision level. See below for types of mobility:

  • residential average change in mobility in residential areas.

  • grocery_pharmacy average change in mobility in grocery stores and pharmacies.

  • retail_recreation: average change in mobility in retail stores and recreation areas.

  • parks: average change in mobility at parks.

  • transit: average change in mobility at train stations, metro stations, etc.

  • Social Capital Indices: Measures social capital, the social ties between residents that enable trust and reciprocity, using several dimensions. Each index is measured on a scale from 0 (low) to 1 (high).

  • social_capital: overall social capital, which averages bonding, bridging, and linking social capital indices.

  • bonding: bonding social capital, which measures connectivity to members of the same age group, gender, race, ethnicity, income, education, etc. Measured by proxy in terms of community similarity rates.

  • bridging: bridging social capital, which measures social ties between members of different social groups. Measured by proxy in terms of participation in associations, like unions, parent-teacher associations, religious organizations, etc.

  • linking: linking social capital, which measures ties between residents and officials. Close correlate of trust in government. Measured by proxy using rates of representation and political participation.

German Solar Power Dataset

Dataset of German Cities, observed over 3 time periods.

  • id: Unique ID code for each German city.

  • muni: Name of German city/district.

Main Conditions

  • solar: number of solar farms built between 2008 and 2011.

  • area: area in square kilometers.

  • pop: population in 2000.

  • land_price: cost of land in 2000 in euros per square kilometer.

  • income: income per taxpayer in 2001 in thousands of euros.

  • unemployment: unemployment rate in 2001.

Social Capital & Politics

  • crime_rate: crime rate in 2007 (a common proxy for bonding social capital when survey measures are unavailable.)

  • voter_turnout: voter turnout in 2005 elections. Common proxy for bridging social capital.

  • winning_party: percentage of votes won by the ruling party (CDU-CSU coalition) in 2005 elections.

  • green: percentage of votes won by the Green Party in 2005.

Japanese Carbon Emissions

  • year: year of observation, from 2005 to 2017. (Skips 2006, when data was not recorded.)

  • pref: Prefecture name

  • muni: Municipality name

  • muni_code: unique 5-digit id for each municipality.

Emissions

  • emissions: total emissions in kilotons of CO2, produced in that city over the last year. Also recorded by type of emissions in following variables: emissions_manufacturing, emissions_construction_mining, emissions_agr_forest_fish, emissions_industrial_subtotal, emissions_business, emissions_households, emissions_consumer_subtotal, emissions_automobiles, emissions_trucks, emissions_railroad, emissions_boats, emissions_transport_subtotal, and emissions_waste.

  • emissions_1990_2005: change in emissions from 1990 to 2005. Positive number indicates an increase in emissions, while a negative number indicates a decrease in emissions over time.

Demographics

  • pop: population, in number of residents.

  • pop_density: population density, in 1000s of people per square kilometer.

  • income_per_capita: average income in 1000s of yen per 1000 residents.

  • pop_age_65_plus: % of residents age 65 or over.

  • pop_college: % of residents with a college degree.

  • fin_str_index: measure of city government capacity. Higher number indicates greater overall financial strength and budgetary capacity.

Economics

  • value_agr_mill: agricultural output in millions of yen per capita

  • value_manuf_mill: manufacturing output in millions of yen per capita

  • value_commerce_mill: commerical output in millions of yen per capita

  • For other variable definitions, please just shoot me an email.

Japanese Prefectural Elections

This is a recompilation of Yusaku Horuichi and Ryota Natori’s terrific dataset. You can find their original version here, and check out their work! My lightly edited version of the dataset applied English language names and combines data over time into one file. I’ve summarized important variables below.

  • candidate_name: name of candidate for prefectural assembly.

  • candidate_age: age of candidate, in years.

  • candidate_party: candidate’s party. Many different parties, but just the 6 largest ones are written in English. I recommending filtering to just parties including the “Liberal Democratic Party”, “Democratic Party of Japan”, “Independent”, “Komeito Party”, “Japanese Socialist Party”, and “Japanese Communist Party”.

  • candidate_status: is the candidate an incumbent, a new candidate, have they run previously, or have they previously held office? Some other categories may be in here; best to filter to just categories that you understand.

  • candidate_rank: ranking of each candidate, in terms of votes earned.

  • election_year: year of election.

  • pref_code: unique code from 01 to 47 for each of Japan’s prefectures.

  • muni_code: unique 5-digit code for each of Japan’s municipalities

  • district_code: unique 5-digit code for each electoral district. Sometimes, wards of a city count as their own district. Sometimes, entire clumps of multiple towns count as one district. If you’re looking for city-level electoral outcomes, best to check out our other dataset on Japan.

  • muni_name: name of municipality, in Japanese.

  • election_district: name of electoral district, in Japanese.

  • district_name: name of electoral district, in Japanese.

  • eligible_voters: total number of people who were eligible to vote.

  • turnout: number of people who actually turned out to vote.

  • valid_votes_cast: number of people who, after turning out to vote, actually did cast a valid ballot.

  • num_candidates: number of candidates who ran in each district.

  • votes_received: number of votes received by that candidate in that district.

  • votes_total: number of votes received by that candidate at large for their office. Some races involve multiple districts, I believe.

Japanese Municipalities Dataset

This dataset contains a variety of data at the municipal level in Japan, from 2000 to 2017. This includes electoral outcomes, social capital, disaster outcomes, socioeconomics, and governance outcomes.

Below, I summarize key variables.

  • year: year of observation.

  • pref: prefecture that municipality is in

  • muni: municipality where election took place

  • muni_code unique 5 digit idenfier for each municipality.

  • turnout_rate: percentage of eligible voters in the city who turned out to vote, represented as a decimal (0 to 1).

Disaster Experiments

  • by_border: is that city on the Fukushima side of the border (“Fukushima”), on the other side (“Outside Fukushima”), or some other municipality (“Other”)?

  • by_exclusion_zone: is that city in the Fukushima exclusion zone (“Exclusion Zone”), just outside the Fukushima exclusion zone (“Outside exclusion zone”), or some other municipality (“Other”)?

  • by_tsunami: was that city struck by the tsunami (“Hit”), not hit but just next door (“Not Hit”), or some other municipality (“Other”)?

Electoral Outcomes

  • vote_LDP: percentage of vote won by Liberal Democratic Party, the most powerful party in Japan over the last six decades.

  • vote_DPJ: percentage of vote won by Democratic Party of Japan, the main opposition party. Their name has changed a few times in recent years.

  • vote_Komeito: percentage of vote won by Komeito Party, a conservative minority party closely aligned with Sokka Gakkai, the new age religious movement.

  • vote_JSP: percentage of vote won by Japanese Socialist Party.

  • vote_LDP_coalition: percentage of vote won by the LDP coalition of both the Liberal Democratic Party and Komeito Party candidates.

  • vote_JCP: percentage of vote won by Japanese Communist Party.

  • vote_Ind: percentage of vote won by Independent candidates.

Social Capital Indices

  • These indices measures social capital, the social ties between residents that enable trust and reciprocity, using several dimensions. Each index is measured on a scale from 0 (low) to 1 (high).

  • social_capital: overall social capital, which averages bonding, bridging, and linking social capital indices.

  • bonding: bonding social capital, which measures connectivity to members of the same age group, gender, race, ethnicity, income, education, etc. Measured by proxy in terms of community similarity rates.

  • bridging: bridging social capital, which measures social ties between members of different social groups. Measured by proxy in terms of participation in associations, like unions, parent-teacher associations, religious organizations, etc.

  • linking: linking social capital, which measures ties between residents and officials. Close correlate of trust in government. Measured by proxy using rates of representation and political participation.

  • vulnerability: social vulnerability index, which measures overall vulnerability of residents during crisis, incorporating factors like socioeconomic status, gender and household structure, ethnicity and discrimination, unemployment, etc. Measured where 0 is low and 1 indicates high vulnerability.

Demographics

  • pop: population of each municipality

  • pop_density: population in 1000s of residents per square kilometer

  • pop_over_age_65: % residents over age 65

  • pop_women: % residents who are women

  • unemployment: unemployment rate per 1,000 residents.

  • income_thous_per_capita: income per capita in 1000s of yen per capita.

Government Spending and Priorities

  • financial_strength_index: financial strength index for each city, measuring the city’s financial and budgetary capacity. Higher is better.

  • rev_to_exp_ratio: Ratio of Revenues to Expenditures for each municipality’s budget. Basic measure of municipal budgetary health. If positive, you’ve got more revenue than spending. If negative, you’ve got more spending than revenue.

  • shelters_per_capita: emergency shelters in 1000s of yen per capita.

  • exp_fire_per_capita: spending on fire departments and emergency services in 1000s of yen per capita.

  • exp_dis_relief_per_capita: spending on disaster relief in 1000s of yen per capita.

  • exp_public_works_per_capita: spending on public works and infrastructure in 1000s of yen per capita.

  • exp_social_assistance_per_capita: spending on social assistance and welfare in 1000s of yen per capita.

If you have specific questions about another measure not listed, please shoot me an email.



4. Possible Questions

Finally, you need to design your own research question! I’d recommend choosing an outcome that demonstrates considerable variation, and a single explanatory variable that you have reason to believe might affect that outcome. Then, I’d think about what other variables might also explain the variation in that outcome. Be sure to include those in your analysis!

For anyone starting out, here are a few questions that might interest you! You can certainly design your own, pick one of these, or use these as a launching pad to new research questions.

County Health Rankings

  • Do rural voters prefer the Republican Party or the Democratic Party? Where? Has this changed over time?

  • Does race have a linear effect on votes cast for specific parties, or an exponential effect on votes cast for specific parties?

  • Does politics affect our physical or mental health, or health behaviors, like drinking, exercise, etc.?

Emissions Dataset

  • What (other than population) helped towns in Japan reduce their CO2 emissions over time?

  • Did a particular community trait seem to reduce CO2 emissions in some years but not others? Why?

  • The LDP is Japan’s main conservative, pro-business party. We might expect that towns that vote more for the LDP might not reduce their greenhouse gas emissions quite as much… Is this true? Has this effect changed from the early 2000s to the late 2010s?

Japan Cities Dataset

  • How did the shutdown of nuclear power plants after Fukushima affect the income per capita of towns hosting nuclear power plants?

  • How much did the tsunami and the Fukushima exclusion zone affect towns’ social capital?

  • Japan got hit by a recession - its very own housing bubble - in 1992, stymieing local economies for decades. Did municipalities’ budgets suffer before/after 1992, or did they bounce back?

Please feel free to reach out if you have questions.