Once you’ve decided which data sources and years you need, you can start to select geographies and tables (i.e. variables) for download. With Social Explorer, you’ll have to download each year you need separately, if you need multiple years.
As an example, let’s say that I’m interested in looking at the relationship between unemployment rates and lack of sleep. Do counties with higher rates of people getting insufficient sleep also have a higher unemployment rates?
The first thing I need to decide is what geographic level I need – it’s right there in my question: I’m looking for county-level data, which I can get from the ACS, as well as Social Explorer’s “Health Data Release.” A couple of things to note: to make my comparison, I should select data from the latest available years that match for both of these (i.e. 2019, because 2020 ACS data has not been released yet). Also, because I’m using ACS supplemental data, I won’t be able to get data on counties that have fewer than 20,000 people. This is fine, and is a limitation of the data that I’ll need to discuss in my write-up.
First, let’s get the ACS data. You’ll want to select ACS Supplemental data, remember, because we want to get as many counties as possible, even if there is a larger margin of error in the estimates we get. If the margin of error on the supplemental data was too high, you could use the regular ACS 1-year estimates, which have greater accuracy, but are only available for populations of 65,000+.
Once you choose your dataset, you’ll need to choose your geographies. In our case, we want counties, and we want all counties. If we were only interested in particular counties or particular states, we could select those at this point in order to reduce the size of our final dataset.
After I select the geographies I need, I can move on to selecting tables. These tables usually contain multiple variables (this is why they are called “tables” and not “variables”). You can scroll through the list of available tables to see what is available, but sometimes it is easiest to use the keyword search if you know the words you’re looking for. In our case, I know I need “unemployment” so I’ll just use the keyword search feature.
Once you’ve selected the tables you need and added them to your “table selections,” you can click “Show results” to explore what you’ve chosen. I always recommend taking a look at the results to make sure the variable(s) you’re interested in are actually contained in the table(s) you’ve selected.
You’ll notice two things on the Results screen:
It looks like our variables are rows, and our observations (i.e. counties) are columns. This is not in a tidy format, but don’t fear! When you download your data, it will be reversed (i.e. observations will be in rows, and variables will be in columns).
If you look through the data, you’ll probably see some empty cells – these are areas where the population is lower than 20,000 people, or areas where the margin of error was too high.
Once you’re satisfied with your results, click over to “Data Download” in the top navigation tabs in order to see the options for downloading our data.
Since we’re currently working in R, we’ll ignore the STATA-specific options for now. You want to make sure you download both the CSV file (which contains the data) and the data dictionary file (which contains the metadata about your data).
Finally, once you’ve downloaded the data and dictionary, move them into the proper folders in your project folder (I’ve created both a “data” and a “documentation” folder within my project folder).
In order to get my data on insufficient sleep, I’ll also go through the above process for the “Health Data 2019” dataset, searching for the keyword “sleep” after, again, choosing “all counties” as my geography.
Now that you’ve downloaded your data and moved it into the appropriate directories, we can open R and get to work on cleaning up our data. First you’ll want to make sure to load the tidyverse library as we’ll be using a number of functions from this package.
library(tidyverse)
Next we’ll load our data, giving our tibble an appropriate name, and take a look at the data.
acs2019 <- read_csv("data/R12788791_SL050.csv")
head(acs2019)
## # A tibble: 6 x 29
## Geo_FIPS Geo_NAME Geo_QNAME Geo_NATION Geo_REGION Geo_DIVISION Geo_STATE
## <chr> <chr> <chr> <chr> <lgl> <lgl> <chr>
## 1 01001 Autauga… Autauga … 00 NA NA 01
## 2 01003 Baldwin… Baldwin … 00 NA NA 01
## 3 01005 Barbour… Barbour … 00 NA NA 01
## 4 01007 Bibb Co… Bibb Cou… 00 NA NA 01
## 5 01009 Blount … Blount C… 00 NA NA 01
## 6 01011 Bullock… Bullock … 00 NA NA 01
## # … with 22 more variables: Geo_COUNTY <chr>, Geo_CS <lgl>, Geo_PLACE <lgl>,
## # Geo_CD <lgl>, Geo_SDE <lgl>, Geo_SDS <lgl>, Geo_SDU <lgl>, Geo_PUMA <lgl>,
## # Geo_ANRC <lgl>, Geo_AIAANAHHL <lgl>, Geo_CSA <lgl>, Geo_MSAMSA <lgl>,
## # Geo_PCOP <lgl>, Geo_MD <lgl>, Geo_CNECATA <lgl>, Geo_NECATA <lgl>,
## # Geo_PC <lgl>, Geo_ND <lgl>, Geo_UA <lgl>, SE_T017_001 <dbl>,
## # SE_T017_002 <dbl>, SE_T017_003 <dbl>
You’ll see that a number of the variables appear to be mostly (if not all) NAs. In every file you download from Social Explorer, you’ll have a lot of these, as Social Explorer uses a template, which contains every possible geographic variable (e.g. county, tract, city, etc.), no matter which geographies you selected. So here’s our first clean up step – we’ll need to pare this down to make our data easier to look at by select-ing only the columns we need. While we’re at it, we should also rename the variable columns to be more human-readable (remember the pattern for rename() is new_name = old_name).
acs2019_cleaning <- acs2019 %>%
select(Geo_FIPS, Geo_QNAME, SE_T017_001, SE_T017_002, SE_T017_003) %>%
rename(labor_force = SE_T017_001,
employed = SE_T017_002,
unemployed = SE_T017_003)
head(acs2019_cleaning)
## # A tibble: 6 x 5
## Geo_FIPS Geo_QNAME labor_force employed unemployed
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 01001 Autauga County, Alabama 26598 25560 1038
## 2 01003 Baldwin County, Alabama 101561 96012 5549
## 3 01005 Barbour County, Alabama 8713 8332 381
## 4 01007 Bibb County, Alabama 10677 10302 375
## 5 01009 Blount County, Alabama 25274 24076 1198
## 6 01011 Bullock County, Alabama NA NA NA
Ah, this looks much better! Where did I find out what these variables are so that I can rename them? Why, in the data dictionary of course! This is why you should always make sure you have a data dictionary of some kind.
If we had multiple years worth of data, we’d also want to make sure we have a year variable to properly distinguish our observations. You can create this in each individual tibble before you merge them.
acs2019_cleaning <- acs2019_cleaning %>%
mutate(year = 2019)
Finally, we might need to create some new variables based on what we downloaded. In this case, we have the total number of unemployed people, and the total number of people in the labor force, so we’ll need use both of these to create a new variable for unemployment rate.
acs2019_cleaned <- acs2019_cleaning %>%
mutate(unemployment_rate = unemployed/labor_force)
head(acs2019_cleaned)
## # A tibble: 6 x 7
## Geo_FIPS Geo_QNAME labor_force employed unemployed year unemployment_ra…
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Autauga Count… 26598 25560 1038 2019 0.0390
## 2 01003 Baldwin Count… 101561 96012 5549 2019 0.0546
## 3 01005 Barbour Count… 8713 8332 381 2019 0.0437
## 4 01007 Bibb County, … 10677 10302 375 2019 0.0351
## 5 01009 Blount County… 25274 24076 1198 2019 0.0474
## 6 01011 Bullock Count… NA NA NA 2019 NA
Now that we have our ACS data on unemployment cleaned up, let’s prepare the other data we have downloaded on insuffcient sleep. This time, we’ll just do all of the prep in one longer chain of piped functions.
health2019 <- read_csv("data/R12789309_SL050.csv")
health2019_cleaned <- health2019 %>%
select(Geo_FIPS,
Geo_QNAME,
ORG_RAW_HD2019_031_IS_v1,
ORG_RAW_HD2019_031_IS_v2,
ORG_RAW_HD2019_031_IS_v3) %>%
mutate(year = 2019) %>%
rename(percent_insuff_sleep = ORG_RAW_HD2019_031_IS_v1,
percent_insuff_sleep_95CI_low = ORG_RAW_HD2019_031_IS_v2,
percent_insuff_sleep_95CI_high = ORG_RAW_HD2019_031_IS_v3)
head(health2019_cleaned)
## # A tibble: 6 x 6
## Geo_FIPS Geo_QNAME percent_insuff_s… percent_insuff_s… percent_insuff_… year
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Autauga C… 35.9 35.0 36.8 2019
## 2 01003 Baldwin C… 33.3 32.5 34.1 2019
## 3 01005 Barbour C… 38.6 37.7 39.5 2019
## 4 01007 Bibb Coun… 38.1 37.1 39.2 2019
## 5 01009 Blount Co… 35.9 34.8 37.1 2019
## 6 01011 Bullock C… 45.0 43.9 46.2 2019
We can now merge our two datasets, using the “join” commands from dplyr. Because our two tibbles have columns already in common, we can let the join function decide what column to use in our join.
merged_data <- left_join(acs2019_cleaned, health2019_cleaned)
head(merged_data)
## # A tibble: 6 x 10
## Geo_FIPS Geo_QNAME labor_force employed unemployed year unemployment_ra…
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Autauga … 26598 25560 1038 2019 0.0390
## 2 01003 Baldwin … 101561 96012 5549 2019 0.0546
## 3 01005 Barbour … 8713 8332 381 2019 0.0437
## 4 01007 Bibb Cou… 10677 10302 375 2019 0.0351
## 5 01009 Blount C… 25274 24076 1198 2019 0.0474
## 6 01011 Bullock … NA NA NA 2019 NA
## # … with 3 more variables: percent_insuff_sleep <dbl>,
## # percent_insuff_sleep_95CI_low <dbl>, percent_insuff_sleep_95CI_high <dbl>
Now that we have our two datasets cleaned and merged, we can begin to explore them and craft our graphs or other outputs we need.
First, we might want to look at some histograms to see how our two variables of interest, unemployment_rate and percent_insuff_sleep are distributed:
ggplot(merged_data, aes(x = unemployment_rate)) + geom_histogram(binwidth = 0.005)
We can see that our unemployment data is skewed, with most of the counties hovering around the national unemployment rate, and generally staying under 10%, but with a few counties with much higher unemployment rates as outliers.
ggplot(merged_data, aes(x = percent_insuff_sleep)) + geom_histogram(binwidth = 0.5)
In contrast, the distribution of our percent_insuff_sleep variable appears to be fairly normally distributed.
Now we can create a scatter plot to check for a relationship between our predictor variable (we’ll use percent_insuff_sleep) and our response variable (we’ll use unemployment_rate). Because we have a large number of data points, we’ll set the alpha argument so that our points are a little more transparent.
ggplot(merged_data, aes(x = percent_insuff_sleep, y = unemployment_rate)) +
geom_point(alpha = 0.5)
Because most of our unemployment numbers are bunched up in the lower part of the graph, we might want to do a logarithmic transformation to the y-axis to see if the relationship is any clearer.
ggplot(merged_data, aes(x = percent_insuff_sleep, y = unemployment_rate)) +
geom_point(alpha = 0.5) + scale_y_log10()
Looking at our graph, it appears that there might be a small, fairly weak positive correlation between our two variables.
What else could you do with this data now?
While Social Explorer is great for accessing Census and American Community Survey data, there are a ton of other interesting datasets available! Take the time to browse through some of them, especially if you’re getting stuck thinking of a topic!
Choosing your dataset might require a balancing act between your ideal unit of observation, your ideal timescale, and the exact variables need. Remember to be flexible where you can, and understand the limitations of the data you’re using.
Even data from a tool like Social Explorer, which has had a lot of work done to it to make it more usable, will require some wrangling, clean up, and merging. Remember to budget time for this – in general, you’ll spend 80% of your time finding, understanding, and cleaning your data, and only about 20% of your time creating your final analyses, tables, and visualizations.
Finally, if you have any questions about Social Explorer, or any other data you need for your research, you can reach out to me at go/ryan/!
Social Explorer Basics
When using Social Explorer as a Middlebury student, faculty, or staff, you should always use the go/ link, go/socialexplorer/. You will be asked to sign in to our proxy server, using your Middlebury credentials. If you’re on the correct version, you should see “Professional plan provided by Middlebury College” in the upper-right corner of your browser window. There is a free version of Social Explorer, but it does not provide access to all of the data.
Properly signed in through the proxy server
You’ll see several options for using Social Explorer on the left-hand side of the page. Please take the time to explore them (especially the maps!), but right now we’re only concerned with the “Tables” option.
Social Explorer main menu
After you click on “Tables,” you’ll see a list of all of the different data sources you can download tables from. As you can see, Social Explorer has a wide variety of data! We’re interested in demographic and economic data about the United States, by county, so we’re most likely going to find that in the Census data. However, we have two options for getting Census data! Social Explorer has data from both the Decennial Census and the American Community Survey – what’s the difference?
Sidebar: Decennial Census vs. American Community Survey
The Census Bureau, which is part of the Department of Commerce, is most known for conducting what is now known as the “Decennial Census” (but was historically called other things, such as “Census of Population and Housing”). That’s not the only large-scale survey the Census Bureau conducts, though. They are also responsible for over 130 other surveys, including surveys they conduct for other government departments.
The Census Bureau has a legal requirement to count the total number of people in the United States and submit those population totals to the President by December 31st of any year ending in zero. This is the origin of, and mandate behind, the Decennial Census. However, it became clear early on in the history of the Census that there were other important pieces of information that the bureau could collect about the population – both demographic (age, race, ethnicity, marital status, etc.) and economic (employment status, income, occupation, etc.). Over time, as the population grew and the Bureau developed more advanced statistical methods, the majority of these “extra” questions were shifted on to what became known as the “long form.” This longer version of the Census questionnaire was only sent to a sample of the population, and was then used to compute estimates. The original purpose of the Census, i.e. counting the total population, stuck around in the “short form,” which (in theory) must be answered by every single person residing in the United States at the time the survey is conducted. The longer versions of the Census, whether the long-form Decennial Census, or the annual American Community Survey, are only answered by a sample of the population, and therefore have even greater restrictions on how representative they are of the population.
If you are looking for the more detailed (and less accurate) long-form data from the pre-ACS years, you’ll have to choose the “Sample Data” from the Social Explorer drop-downs, instead of the “100% Data” (which is the survey that is supposed to be answered for everyone).
Social Explorer makes it easier for users, though, and also lets you just choose the “Social Explorer Tables” that they create using both the 100% and Sample datasets, depending on what is appropriate.
If you are looking for the more detailed data after the beginning of the ACS in 2005, you’ll need to choose between the 1-year, 3-year, 5-year, and “supplemental” estimates (the 3-year estimates were discontinued after the release of the 2013 estimates, and were replaced with the “supplemental” estimates the following year). The guidance offered by Census Bureau can help you decide which estimates are appropriate for your use, but you will almost always have to balance currency, accuracy, and precision in deciding on which estimates to use.