Projects: Choosing a Dataset for your Project
Codebook for Available Datasets
For your first coding project, you will need a dataset! But how should you go about choosing your dataset and your variables? I’ve laid out some steps below that will help you get started.
1. What datasets are available?
I have collected a series of good datasets that you can choose from. These include datasets on politics, environmental policy, economics, and health. If you’re choosing one of these for my class, you can also choose your own, but will have to obtain it and run it by me first.
- Carbon Emissions dataset
- County Health Rankings dataset
- Environmental Racism dataset
- Polarization and Health dataset
- COVID19 rates for Massachusetts Cities
- German Solar Power dataset
- Japanese Prefectural Elections dataset
These datasets are all stored in the Files panel of this RStudio.Cloud session.
2. What is the structure of my dataset?
What does a row in my dataset mean? It is a city? A county? A city in a specific year?
| Name | Rows | Filename |
|---|---|---|
| County Health Rankings | US Counties in 2020 | county_dataset.csv |
| Environmental Racism | US Counties in 2020 | environmental_health.csv |
| Polarization and Health | US Counties in 2016 | polarization_and_health.csv |
| COVID-19 rates for MA Cities | MA City-Weeks in 2020 | ma_data.csv |
| Carbon Emissions | Japanese City-Years | jp_emissions.csv |
| German Solar Power Dataset | Germany Cities in 2011 | germany_data.csv |
| Japanese Prefectural Elections | Japanese Candidates | japanese_prefectural_elections.csv |
| Japanese Municipalities Dataset | Japanese City-Years | japan_muni_elections.csv |
3. What variables are in my dataset?
Most datasets have codebooks, or descriptions of variables on a website somewhere. It is vitally important that you know the exact meaning of your variables, especially what units they are measured in. Height means a lot of different things when measured in feet, inches, meters, or hands.
Select Dataset
County Health Rankings
county: Name of each county.state: Abbreviation for each state.fips: Unique 5 digit ID for each US county.
Health
life_expectancy: average number of years a person can expect to live, based on current age-specific death rates.poor_fair_health: percentage of surveyed residents who rated their health as either poor or fair on a 5 point scale (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent).days_poor_physical_health: average number of physically unhealthy days that residents reported in past 30 days (age-adjusted).days_poor_mental_health: average number of mentally unhealthy days that residents reported in past 30 days (age-adjusted).frequent_phys_distress: percentage of adult respondents reporting 14 or more days of poor physical health per month.
**Health Conditions
diabetes: percentage of adult respondents aged 20 and above with diagnosed diabetes. (Combines type I and II).obesity: percentage of the adult population (age 20 and older) that reports a body mass index (BMI) greater than or equal to 30 kg/m2.
Health Behaviors
physical_inactivity: percentage of adults age 20 and over reporting no leisure-time physical activity.exercise_access: percentage of population with adequate access to locations for physical activity.
Health Environment
air_pollution: Average daily density of fine particulate matter in micrograms per cubic meter (PM2.5), provided by the Environmental Public Health Tracking Network.For more variable definitions, see their codebook here.
Environmental Racism
county: Name of each county.state: Abbreviation for each state.fips: Unique 5 digit ID for each US county.
Main Variables
air_pollution: Average daily density of fine particulate matter in micrograms per cubic meter (PM2.5), provided by the Environmental Public Health Tracking Network.pop_black: percentage of population who are Black.poc: dichotomous variable showing whether 30% or more of county residents are black (“High”) or less than 30% of residents are black (“Low”)
Covariates
urban: dichotomous variable showing whether county is a continuously built-up area with a population of 50,000 or more (“Urbanized Area”) or not (“Rural Area”), according to the US Census.party: dichotomous variable, showing whether residents voted primarily for Clinton in (“Democrat”) or Trump (“Republican”) in the 2016 election.wealth: dichotomous variable, showing whether the median household income in a county is above the national average (“Above Median”) or below the national average (“Below Median”).
Polarization and Health
county: Name of each county.state: Abbreviation for each state.fips: Unique 5 digit ID for each US county.region: Census region (eg. New England)
Polarization
gap_2016: the difference in the percentage of votes for Democrats vs. Republicans in a county in the 2016 presidential election. 0 indicates an even split, meaning low polarization, because no one group is particularly marginalized. 50 indicates one party is much greater than the other (75%-25% split), meaning strong polarization, because one group is marginalized. 90 indicates one party is almost completely dominant (90%-10%), meaning extreme polarization, because one group is a very tiny minority.polarization: dichotomous variable showing whether there is a 50% or greater difference in votes for Democrats vs. Republicans in a given county (“High”), or less than a 50% difference in votes (“Low”).
Social Capital
social_associations: number of membership associations in a county, per 10,000 residents. This includes civic organizations, bowling centers, golf clubs, fitness centers, sports organizations, religious organizations, political organizations, labor organizations, business organizations, and professional organizations.social_capital: dichotomous variable showing whether the rate of social associations in a county is above the national median, building high social capital (“High”) or below the national median, building lower social capital (“Low”).
Health
poor_fair_health: percentage of surveyed residents who rated their health as either poor or fair on a 5 point scale (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent). (Note: higher = worse health implications.)days_poor_physical_health: average number of physically unhealthy days that residents reported in past 30 days (age-adjusted).days_poor_mental_health: average number of mentally unhealthy days that residents reported in past 30 days (age-adjusted).frequent_phys_distress: percentage of adult respondents reporting 14 or more days of poor physical health per month.
COVID19 rates for Massachusetts Cities
Time-series dataset, with many variables, but most recommended ones defined below:
city: Name of Massachusetts county subdivision (essentially a city, in census-speak).unit: Unique ID code for each county subdivision.
COVID Spread
date: date of COVID-19 rate reporting.positivity_rate: COVID test positivity rate, meaning the percentage of tests that came back positive for COVID-19 in the last two weeks, out of all total tests run. One of the better measures for COVID-19 spread, since it adjusts for testing capacity.
Vulnerability
svi: CDC’s Social Vulnerability Index, which measures residents’ overall vulnerability during crisis from 0 (low) to 1 (high). Incorporates 4 main aspects of vulnerability, which are available in the dataset as subindices from 0 to 1. These indices were averaged up to the city level from the census tract level.svi_socioeconomic: CDC’s subindex for socioeconomic status.svi_minority: CDC’s subindex for minority status and language competency.svi_housing_transport: CDC’s subindex for housing and transportation access.svi_household_disability: CDC’s subindex for household membership traits and disability status.Many other variables are sourced from County Health Rankings. See their codebook for variable definitions.
Mobility Measures
average of daily percent change in mobility in certain areas, averaging mobility between 2 to 11 days prior to each date’s COVID-19 rate reported in this dataset. This is because 95% of people show symptoms between 2 and 11 days after exposure. Mobility estimated using Google Android phone user data. Provided by Google Mobility Reports, and averaged to county subdivision level. See below for types of mobility:
residentialaverage change in mobility in residential areas.grocery_pharmacyaverage change in mobility in grocery stores and pharmacies.retail_recreation: average change in mobility in retail stores and recreation areas.parks:average change in mobility at parks.transit:average change in mobility at train stations, metro stations, etc.Social Capital Indices: Measures social capital, the social ties between residents that enable trust and reciprocity, using several dimensions. Each index is measured on a scale from 0 (low) to 1 (high).
social_capital: overall social capital, which averages bonding, bridging, and linking social capital indices.bonding: bonding social capital, which measures connectivity to members of the same age group, gender, race, ethnicity, income, education, etc. Measured by proxy in terms of community similarity rates.bridging: bridging social capital, which measures social ties between members of different social groups. Measured by proxy in terms of participation in associations, like unions, parent-teacher associations, religious organizations, etc.linking: linking social capital, which measures ties between residents and officials. Close correlate of trust in government. Measured by proxy using rates of representation and political participation.
German Solar Power Dataset
Dataset of German Cities, observed over 3 time periods.
id: Unique ID code for each German city.muni: Name of German city/district.
Main Conditions
solar: number of solar farms built between 2008 and 2011.area: area in square kilometers.pop: population in 2000.land_price: cost of land in 2000 in euros per square kilometer.income: income per taxpayer in 2001 in thousands of euros.unemployment: unemployment rate in 2001.
Social Capital & Politics
crime_rate: crime rate in 2007 (a common proxy for bonding social capital when survey measures are unavailable.)voter_turnout: voter turnout in 2005 elections. Common proxy for bridging social capital.winning_party: percentage of votes won by the ruling party (CDU-CSU coalition) in 2005 elections.green: percentage of votes won by the Green Party in 2005.
Japanese Carbon Emissions
year: year of observation, from 2005 to 2017. (Skips 2006, when data was not recorded.)pref: Prefecture namemuni: Municipality namemuni_code: unique 5-digit id for each municipality.
Emissions
emissions: total emissions in kilotons of CO2, produced in that city over the last year. Also recorded by type of emissions in following variables:emissions_manufacturing,emissions_construction_mining,emissions_agr_forest_fish,emissions_industrial_subtotal,emissions_business,emissions_households,emissions_consumer_subtotal,emissions_automobiles,emissions_trucks,emissions_railroad,emissions_boats,emissions_transport_subtotal, andemissions_waste.emissions_1990_2005: change in emissions from 1990 to 2005. Positive number indicates an increase in emissions, while a negative number indicates a decrease in emissions over time.
Demographics
pop: population, in number of residents.pop_density: population density, in 1000s of people per square kilometer.income_per_capita: average income in 1000s of yen per 1000 residents.pop_age_65_plus: % of residents age 65 or over.pop_college: % of residents with a college degree.fin_str_index: measure of city government capacity. Higher number indicates greater overall financial strength and budgetary capacity.
Economics
value_agr_mill: agricultural output in millions of yen per capitavalue_manuf_mill: manufacturing output in millions of yen per capitavalue_commerce_mill: commerical output in millions of yen per capitaFor other variable definitions, please just shoot me an email.
Japanese Prefectural Elections
This is a recompilation of Yusaku Horuichi and Ryota Natori’s terrific dataset. You can find their original version here, and check out their work! My lightly edited version of the dataset applied English language names and combines data over time into one file. I’ve summarized important variables below.
candidate_name: name of candidate for prefectural assembly.candidate_age: age of candidate, in years.candidate_party: candidate’s party. Many different parties, but just the 6 largest ones are written in English. I recommending filtering to just parties including the “Liberal Democratic Party”, “Democratic Party of Japan”, “Independent”, “Komeito Party”, “Japanese Socialist Party”, and “Japanese Communist Party”.candidate_status: is the candidate an incumbent, a new candidate, have they run previously, or have they previously held office? Some other categories may be in here; best to filter to just categories that you understand.candidate_rank: ranking of each candidate, in terms of votes earned.election_year: year of election.pref_code: unique code from 01 to 47 for each of Japan’s prefectures.muni_code: unique 5-digit code for each of Japan’s municipalitiesdistrict_code: unique 5-digit code for each electoral district. Sometimes, wards of a city count as their own district. Sometimes, entire clumps of multiple towns count as one district. If you’re looking for city-level electoral outcomes, best to check out our other dataset on Japan.muni_name: name of municipality, in Japanese.election_district: name of electoral district, in Japanese.district_name: name of electoral district, in Japanese.eligible_voters: total number of people who were eligible to vote.turnout: number of people who actually turned out to vote.valid_votes_cast: number of people who, after turning out to vote, actually did cast a valid ballot.num_candidates: number of candidates who ran in each district.votes_received: number of votes received by that candidate in that district.votes_total: number of votes received by that candidate at large for their office. Some races involve multiple districts, I believe.
Japanese Municipalities Dataset
This dataset contains a variety of data at the municipal level in Japan, from 2000 to 2017. This includes electoral outcomes, social capital, disaster outcomes, socioeconomics, and governance outcomes.
Below, I summarize key variables.
year: year of observation.pref: prefecture that municipality is inmuni: municipality where election took placemuni_codeunique 5 digit idenfier for each municipality.turnout_rate: percentage of eligible voters in the city who turned out to vote, represented as a decimal (0 to 1).
Disaster Experiments
by_border: is that city on the Fukushima side of the border (“Fukushima”), on the other side (“Outside Fukushima”), or some other municipality (“Other”)?by_exclusion_zone: is that city in the Fukushima exclusion zone (“Exclusion Zone”), just outside the Fukushima exclusion zone (“Outside exclusion zone”), or some other municipality (“Other”)?by_tsunami: was that city struck by the tsunami (“Hit”), not hit but just next door (“Not Hit”), or some other municipality (“Other”)?
Electoral Outcomes
vote_LDP: percentage of vote won by Liberal Democratic Party, the most powerful party in Japan over the last six decades.vote_DPJ: percentage of vote won by Democratic Party of Japan, the main opposition party. Their name has changed a few times in recent years.vote_Komeito: percentage of vote won by Komeito Party, a conservative minority party closely aligned with Sokka Gakkai, the new age religious movement.vote_JSP: percentage of vote won by Japanese Socialist Party.vote_LDP_coalition: percentage of vote won by the LDP coalition of both the Liberal Democratic Party and Komeito Party candidates.vote_JCP: percentage of vote won by Japanese Communist Party.vote_Ind: percentage of vote won by Independent candidates.
Social Capital Indices
These indices measures social capital, the social ties between residents that enable trust and reciprocity, using several dimensions. Each index is measured on a scale from 0 (low) to 1 (high).
social_capital: overall social capital, which averages bonding, bridging, and linking social capital indices.bonding: bonding social capital, which measures connectivity to members of the same age group, gender, race, ethnicity, income, education, etc. Measured by proxy in terms of community similarity rates.bridging: bridging social capital, which measures social ties between members of different social groups. Measured by proxy in terms of participation in associations, like unions, parent-teacher associations, religious organizations, etc.linking: linking social capital, which measures ties between residents and officials. Close correlate of trust in government. Measured by proxy using rates of representation and political participation.vulnerability: social vulnerability index, which measures overall vulnerability of residents during crisis, incorporating factors like socioeconomic status, gender and household structure, ethnicity and discrimination, unemployment, etc. Measured where 0 is low and 1 indicates high vulnerability.
Demographics
pop: population of each municipalitypop_density: population in 1000s of residents per square kilometerpop_over_age_65: % residents over age 65pop_women: % residents who are womenunemployment: unemployment rate per 1,000 residents.income_thous_per_capita: income per capita in 1000s of yen per capita.
Government Spending and Priorities
financial_strength_index: financial strength index for each city, measuring the city’s financial and budgetary capacity. Higher is better.rev_to_exp_ratio: Ratio of Revenues to Expenditures for each municipality’s budget. Basic measure of municipal budgetary health. If positive, you’ve got more revenue than spending. If negative, you’ve got more spending than revenue.shelters_per_capita: emergency shelters in 1000s of yen per capita.exp_fire_per_capita: spending on fire departments and emergency services in 1000s of yen per capita.exp_dis_relief_per_capita: spending on disaster relief in 1000s of yen per capita.exp_public_works_per_capita: spending on public works and infrastructure in 1000s of yen per capita.exp_social_assistance_per_capita: spending on social assistance and welfare in 1000s of yen per capita.
If you have specific questions about another measure not listed, please shoot me an email.
4. Possible Questions
Finally, you need to design your own research question! I’d recommend choosing an outcome that demonstrates considerable variation, and a single explanatory variable that you have reason to believe might affect that outcome. Then, I’d think about what other variables might also explain the variation in that outcome. Be sure to include those in your analysis!
For anyone starting out, here are a few questions that might interest you! You can certainly design your own, pick one of these, or use these as a launching pad to new research questions.
County Health Rankings
Do rural voters prefer the Republican Party or the Democratic Party? Where? Has this changed over time?
Does race have a linear effect on votes cast for specific parties, or an exponential effect on votes cast for specific parties?
Does politics affect our physical or mental health, or health behaviors, like drinking, exercise, etc.?
Emissions Dataset
What (other than population) helped towns in Japan reduce their CO2 emissions over time?
Did a particular community trait seem to reduce CO2 emissions in some years but not others? Why?
The LDP is Japan’s main conservative, pro-business party. We might expect that towns that vote more for the LDP might not reduce their greenhouse gas emissions quite as much… Is this true? Has this effect changed from the early 2000s to the late 2010s?
Japan Cities Dataset
How did the shutdown of nuclear power plants after Fukushima affect the income per capita of towns hosting nuclear power plants?
How much did the tsunami and the Fukushima exclusion zone affect towns’ social capital?
Japan got hit by a recession - its very own housing bubble - in 1992, stymieing local economies for decades. Did municipalities’ budgets suffer before/after 1992, or did they bounce back?
Please feel free to reach out if you have questions.