What is Social Explorer

Social Explorer is a library database that has demographic and other social data from the United States and around the world. The most widely used data on Social Explorer are the data from the United States Decennial Census and the American Community Survey. These data cover the period from 1790 - present (usually with a 1-2 year delay as new data are processed). Prior to the introduction of the American Community Survey in 2005, data were all from the Decennial Census, meaning you can only get data for Census years (e.g. 1790, 1800, 1810, etc.). After the introduction of the American Community Survey in 2005, data have been released on an annual basis, in both 1-year and 5-year estimates (more on this later).

Census geography levels

Census geography levels

When to use Social Explorer

You should use Social Explorer for your Census data needs if:

  • You need to use data aggregated at one of the Census geographies
  • You need to use data from one of the many pre-made Census tables
  • You want to compare data across different time periods

You should not use Social Explorer for your data needs if:

  • You need to use microdata, e.g. data where each row of your data represents an individual or a household, rather than a Census geography (such as tract, congressional district, county, state, etc.)
  • You need to look at the relationship between multiple variables that are not already available in prepared Census tables
  • You just need a quick set of recent statistics about a particular geography


If you have questions about whether Social Explorer is right for your research, or what alternative you should use instead, you can reach out to me or make an appointment to discuss.

Social Explorer Basics

When using Social Explorer as a Middlebury student, faculty, or staff, you should always use the go/ link, go/socialexplorer/. You will be asked to sign in to our proxy server, using your Middlebury credentials. If you’re on the correct version, you should see “Professional plan provided by Middlebury College” in the upper-right corner of your browser window. There is a free version of Social Explorer, but it does not provide access to all of the data.

Properly signed in through the proxy server

Properly signed in through the proxy server

You’ll see several options for using Social Explorer on the left-hand side of the page. Please take the time to explore them (especially the maps!), but right now we’re only concerned with the “Tables” option.

Social Explorer main menu

Social Explorer main menu

After you click on “Tables,” you’ll see a list of all of the different data sources you can download tables from. As you can see, Social Explorer has a wide variety of data! We’re interested in demographic and economic data about the United States, by county, so we’re most likely going to find that in the Census data. However, we have two options for getting Census data! Social Explorer has data from both the Decennial Census and the American Community Survey – what’s the difference?

Choosing Data to Download

Once you’ve decided which data sources and years you need, you can start to select geographies and tables (i.e. variables) for download. With Social Explorer, you’ll have to download each year you need separately, if you need multiple years.

As an example, let’s say that I’m interested in looking at the relationship between unemployment rates and lack of sleep. Do counties with higher rates of people getting insufficient sleep also have a higher unemployment rates?

The first thing I need to decide is what geographic level I need – it’s right there in my question: I’m looking for county-level data, which I can get from the ACS, as well as Social Explorer’s “Health Data Release.” A couple of things to note: to make my comparison, I should select data from the latest available years that match for both of these (i.e. 2019, because 2020 ACS data has not been released yet). Also, because I’m using ACS supplemental data, I won’t be able to get data on counties that have fewer than 20,000 people. This is fine, and is a limitation of the data that I’ll need to discuss in my write-up.

First, let’s get the ACS data. You’ll want to select ACS Supplemental data, remember, because we want to get as many counties as possible, even if there is a larger margin of error in the estimates we get. If the margin of error on the supplemental data was too high, you could use the regular ACS 1-year estimates, which have greater accuracy, but are only available for populations of 65,000+.

Once you choose your dataset, you’ll need to choose your geographies. In our case, we want counties, and we want all counties. If we were only interested in particular counties or particular states, we could select those at this point in order to reduce the size of our final dataset.

After I select the geographies I need, I can move on to selecting tables. These tables usually contain multiple variables (this is why they are called “tables” and not “variables”). You can scroll through the list of available tables to see what is available, but sometimes it is easiest to use the keyword search if you know the words you’re looking for. In our case, I know I need “unemployment” so I’ll just use the keyword search feature.

Once you’ve selected the tables you need and added them to your “table selections,” you can click “Show results” to explore what you’ve chosen. I always recommend taking a look at the results to make sure the variable(s) you’re interested in are actually contained in the table(s) you’ve selected.

You’ll notice two things on the Results screen:

  1. It looks like our variables are rows, and our observations (i.e. counties) are columns. This is not in a tidy format, but don’t fear! When you download your data, it will be reversed (i.e. observations will be in rows, and variables will be in columns).

  2. If you look through the data, you’ll probably see some empty cells – these are areas where the population is lower than 20,000 people, or areas where the margin of error was too high.

Once you’re satisfied with your results, click over to “Data Download” in the top navigation tabs in order to see the options for downloading our data.

Since we’re currently working in R, we’ll ignore the STATA-specific options for now. You want to make sure you download both the CSV file (which contains the data) and the data dictionary file (which contains the metadata about your data).

Finally, once you’ve downloaded the data and dictionary, move them into the proper folders in your project folder (I’ve created both a “data” and a “documentation” folder within my project folder).

In order to get my data on insufficient sleep, I’ll also go through the above process for the “Health Data 2019” dataset, searching for the keyword “sleep” after, again, choosing “all counties” as my geography.

Data Cleaning

Now that you’ve downloaded your data and moved it into the appropriate directories, we can open R and get to work on cleaning up our data. First you’ll want to make sure to load the tidyverse library as we’ll be using a number of functions from this package.

library(tidyverse)

Next we’ll load our data, giving our tibble an appropriate name, and take a look at the data.

acs2019 <- read_csv("data/R12788791_SL050.csv")
head(acs2019)
## # A tibble: 6 x 29
##   Geo_FIPS Geo_NAME Geo_QNAME Geo_NATION Geo_REGION Geo_DIVISION Geo_STATE
##   <chr>    <chr>    <chr>     <chr>      <lgl>      <lgl>        <chr>    
## 1 01001    Autauga… Autauga … 00         NA         NA           01       
## 2 01003    Baldwin… Baldwin … 00         NA         NA           01       
## 3 01005    Barbour… Barbour … 00         NA         NA           01       
## 4 01007    Bibb Co… Bibb Cou… 00         NA         NA           01       
## 5 01009    Blount … Blount C… 00         NA         NA           01       
## 6 01011    Bullock… Bullock … 00         NA         NA           01       
## # … with 22 more variables: Geo_COUNTY <chr>, Geo_CS <lgl>, Geo_PLACE <lgl>,
## #   Geo_CD <lgl>, Geo_SDE <lgl>, Geo_SDS <lgl>, Geo_SDU <lgl>, Geo_PUMA <lgl>,
## #   Geo_ANRC <lgl>, Geo_AIAANAHHL <lgl>, Geo_CSA <lgl>, Geo_MSAMSA <lgl>,
## #   Geo_PCOP <lgl>, Geo_MD <lgl>, Geo_CNECATA <lgl>, Geo_NECATA <lgl>,
## #   Geo_PC <lgl>, Geo_ND <lgl>, Geo_UA <lgl>, SE_T017_001 <dbl>,
## #   SE_T017_002 <dbl>, SE_T017_003 <dbl>

You’ll see that a number of the variables appear to be mostly (if not all) NAs. In every file you download from Social Explorer, you’ll have a lot of these, as Social Explorer uses a template, which contains every possible geographic variable (e.g. county, tract, city, etc.), no matter which geographies you selected. So here’s our first clean up step – we’ll need to pare this down to make our data easier to look at by select-ing only the columns we need. While we’re at it, we should also rename the variable columns to be more human-readable (remember the pattern for rename() is new_name = old_name).

acs2019_cleaning <- acs2019 %>%
  select(Geo_FIPS, Geo_QNAME, SE_T017_001, SE_T017_002, SE_T017_003) %>%
  rename(labor_force = SE_T017_001,
         employed = SE_T017_002,
         unemployed = SE_T017_003)

head(acs2019_cleaning)
## # A tibble: 6 x 5
##   Geo_FIPS Geo_QNAME               labor_force employed unemployed
##   <chr>    <chr>                         <dbl>    <dbl>      <dbl>
## 1 01001    Autauga County, Alabama       26598    25560       1038
## 2 01003    Baldwin County, Alabama      101561    96012       5549
## 3 01005    Barbour County, Alabama        8713     8332        381
## 4 01007    Bibb County, Alabama          10677    10302        375
## 5 01009    Blount County, Alabama        25274    24076       1198
## 6 01011    Bullock County, Alabama          NA       NA         NA

Ah, this looks much better! Where did I find out what these variables are so that I can rename them? Why, in the data dictionary of course! This is why you should always make sure you have a data dictionary of some kind.

If we had multiple years worth of data, we’d also want to make sure we have a year variable to properly distinguish our observations. You can create this in each individual tibble before you merge them.

acs2019_cleaning <- acs2019_cleaning %>%
  mutate(year = 2019)

Finally, we might need to create some new variables based on what we downloaded. In this case, we have the total number of unemployed people, and the total number of people in the labor force, so we’ll need use both of these to create a new variable for unemployment rate.

acs2019_cleaned <- acs2019_cleaning %>%
  mutate(unemployment_rate = unemployed/labor_force)

head(acs2019_cleaned)
## # A tibble: 6 x 7
##   Geo_FIPS Geo_QNAME      labor_force employed unemployed  year unemployment_ra…
##   <chr>    <chr>                <dbl>    <dbl>      <dbl> <dbl>            <dbl>
## 1 01001    Autauga Count…       26598    25560       1038  2019           0.0390
## 2 01003    Baldwin Count…      101561    96012       5549  2019           0.0546
## 3 01005    Barbour Count…        8713     8332        381  2019           0.0437
## 4 01007    Bibb County, …       10677    10302        375  2019           0.0351
## 5 01009    Blount County…       25274    24076       1198  2019           0.0474
## 6 01011    Bullock Count…          NA       NA         NA  2019          NA

Now that we have our ACS data on unemployment cleaned up, let’s prepare the other data we have downloaded on insuffcient sleep. This time, we’ll just do all of the prep in one longer chain of piped functions.

health2019 <- read_csv("data/R12789309_SL050.csv")

health2019_cleaned <- health2019 %>%
  select(Geo_FIPS, 
         Geo_QNAME, 
         ORG_RAW_HD2019_031_IS_v1, 
         ORG_RAW_HD2019_031_IS_v2, 
         ORG_RAW_HD2019_031_IS_v3) %>%
  mutate(year = 2019) %>%
  rename(percent_insuff_sleep = ORG_RAW_HD2019_031_IS_v1,
         percent_insuff_sleep_95CI_low = ORG_RAW_HD2019_031_IS_v2,
         percent_insuff_sleep_95CI_high = ORG_RAW_HD2019_031_IS_v3)

head(health2019_cleaned)
## # A tibble: 6 x 6
##   Geo_FIPS Geo_QNAME  percent_insuff_s… percent_insuff_s… percent_insuff_…  year
##   <chr>    <chr>                  <dbl>             <dbl>            <dbl> <dbl>
## 1 01001    Autauga C…              35.9              35.0             36.8  2019
## 2 01003    Baldwin C…              33.3              32.5             34.1  2019
## 3 01005    Barbour C…              38.6              37.7             39.5  2019
## 4 01007    Bibb Coun…              38.1              37.1             39.2  2019
## 5 01009    Blount Co…              35.9              34.8             37.1  2019
## 6 01011    Bullock C…              45.0              43.9             46.2  2019

We can now merge our two datasets, using the “join” commands from dplyr. Because our two tibbles have columns already in common, we can let the join function decide what column to use in our join.

merged_data <- left_join(acs2019_cleaned, health2019_cleaned)

head(merged_data)
## # A tibble: 6 x 10
##   Geo_FIPS Geo_QNAME labor_force employed unemployed  year unemployment_ra…
##   <chr>    <chr>           <dbl>    <dbl>      <dbl> <dbl>            <dbl>
## 1 01001    Autauga …       26598    25560       1038  2019           0.0390
## 2 01003    Baldwin …      101561    96012       5549  2019           0.0546
## 3 01005    Barbour …        8713     8332        381  2019           0.0437
## 4 01007    Bibb Cou…       10677    10302        375  2019           0.0351
## 5 01009    Blount C…       25274    24076       1198  2019           0.0474
## 6 01011    Bullock …          NA       NA         NA  2019          NA     
## # … with 3 more variables: percent_insuff_sleep <dbl>,
## #   percent_insuff_sleep_95CI_low <dbl>, percent_insuff_sleep_95CI_high <dbl>

Exploring your data

Now that we have our two datasets cleaned and merged, we can begin to explore them and craft our graphs or other outputs we need.

First, we might want to look at some histograms to see how our two variables of interest, unemployment_rate and percent_insuff_sleep are distributed:

ggplot(merged_data, aes(x = unemployment_rate)) + geom_histogram(binwidth = 0.005)

We can see that our unemployment data is skewed, with most of the counties hovering around the national unemployment rate, and generally staying under 10%, but with a few counties with much higher unemployment rates as outliers.

ggplot(merged_data, aes(x = percent_insuff_sleep)) + geom_histogram(binwidth = 0.5)

In contrast, the distribution of our percent_insuff_sleep variable appears to be fairly normally distributed.

Now we can create a scatter plot to check for a relationship between our predictor variable (we’ll use percent_insuff_sleep) and our response variable (we’ll use unemployment_rate). Because we have a large number of data points, we’ll set the alpha argument so that our points are a little more transparent.

ggplot(merged_data, aes(x = percent_insuff_sleep, y = unemployment_rate)) +
  geom_point(alpha = 0.5)

Because most of our unemployment numbers are bunched up in the lower part of the graph, we might want to do a logarithmic transformation to the y-axis to see if the relationship is any clearer.

ggplot(merged_data, aes(x = percent_insuff_sleep, y = unemployment_rate)) +
  geom_point(alpha = 0.5) + scale_y_log10()

Looking at our graph, it appears that there might be a small, fairly weak positive correlation between our two variables.

What else could you do with this data now?

Reminders

  1. While Social Explorer is great for accessing Census and American Community Survey data, there are a ton of other interesting datasets available! Take the time to browse through some of them, especially if you’re getting stuck thinking of a topic!

  2. Choosing your dataset might require a balancing act between your ideal unit of observation, your ideal timescale, and the exact variables need. Remember to be flexible where you can, and understand the limitations of the data you’re using.

  3. Even data from a tool like Social Explorer, which has had a lot of work done to it to make it more usable, will require some wrangling, clean up, and merging. Remember to budget time for this – in general, you’ll spend 80% of your time finding, understanding, and cleaning your data, and only about 20% of your time creating your final analyses, tables, and visualizations.

  4. Finally, if you have any questions about Social Explorer, or any other data you need for your research, you can reach out to me at go/ryan/!