Social Explorer Basics

When using Social Explorer as a Middlebury student, faculty, or staff, you should always use the go/ link, go/socialexplorer/. You will be asked to sign in to our proxy server, using your Middlebury credentials. If you’re on the correct version, you should see “Professional plan provided by Middlebury College” in the upper-right corner of your browser window. There is a free version of Social Explorer, but it does not provide access to all of the data.

Properly signed in through the proxy server

You’ll see several options for using Social Explorer on the left-hand side of the page. Please take the time to explore them (especially the maps!), but right now we’re only concerned with the “Tables” option.

Social Explorer main menu

After you click on “Tables,” you’ll see a list of all of the different data sources you can download tables from. As you can see, Social Explorer has a wide variety of data! We’re interested in demographic and economic data about the United States, by county, so we’re most likely going to find that in the Census data. However, we have two options for getting Census data! Social Explorer has data from both the Decennial Census and the American Community Survey – what’s the difference?

Sidebar: Decennial Census vs. American Community Survey

The Census Bureau, which is part of the Department of Commerce, is most known for conducting what is now known as the “Decennial Census” (but was historically called other things, such as “Census of Population and Housing”). That’s not the only large-scale survey the Census Bureau conducts, though. They are also responsible for over 130 other surveys, including surveys they conduct for other government departments.

The Census Bureau has a legal requirement to count the total number of people in the United States and submit those population totals to the President by December 31st of any year ending in zero. This is the origin of, and mandate behind, the Decennial Census. However, it became clear early on in the history of the Census that there were other important pieces of information that the bureau could collect about the population – both demographic (age, race, ethnicity, marital status, etc.) and economic (employment status, income, occupation, etc.). Over time, as the population grew and the Bureau developed more advanced statistical methods, the majority of these “extra” questions were shifted on to what became known as the “long form.” This longer version of the Census questionnaire was only sent to a sample of the population, and was then used to compute estimates. The original purpose of the Census, i.e. counting the total population, stuck around in the “short form,” which (in theory) must be answered by every single person residing in the United States at the time the survey is conducted. The longer versions of the Census, whether the long-form Decennial Census, or the annual American Community Survey, are only answered by a sample of the population, and therefore have even greater restrictions on how representative they are of the population.

If you are looking for the more detailed (and less accurate) long-form data from the pre-ACS years, you’ll have to choose the “Sample Data” from the Social Explorer drop-downs, instead of the “100% Data” (which is the survey that is supposed to be answered for everyone).

Social Explorer makes it easier for users, though, and also lets you just choose the “Social Explorer Tables” that they create using both the 100% and Sample datasets, depending on what is appropriate.

If you are looking for the more detailed data after the beginning of the ACS in 2005, you’ll need to choose between the 1-year, 3-year, 5-year, and “supplemental” estimates (the 3-year estimates were discontinued after the release of the 2013 estimates, and were replaced with the “supplemental” estimates the following year). The guidance offered by Census Bureau can help you decide which estimates are appropriate for your use, but you will almost always have to balance currency, accuracy, and precision in deciding on which estimates to use.

Choosing Data to Download

Once you’ve decided which data sources and years you need, you can start to select geographies and tables (i.e. variables) for download. With Social Explorer, you’ll have to download each year you need separately, if you need multiple years.

As an example, let’s say that I’m interested in looking at the relationship between unemployment rates and lack of sleep. Do counties with higher rates of people getting insufficient sleep also have a higher unemployment rates?

The first thing I need to decide is what geographic level I need – it’s right there in my question: I’m looking for county-level data, which I can get from the ACS, as well as Social Explorer’s “Health Data Release.” A couple of things to note: to make my comparison, I should select data from the latest available years that match for both of these (i.e. 2019, because 2020 ACS data has not been released yet). Also, because I’m using ACS supplemental data, I won’t be able to get data on counties that have fewer than 20,000 people. This is fine, and is a limitation of the data that I’ll need to discuss in my write-up.

First, let’s get the ACS data. You’ll want to select ACS Supplemental data, remember, because we want to get as many counties as possible, even if there is a larger margin of error in the estimates we get. If the margin of error on the supplemental data was too high, you could use the regular ACS 1-year estimates, which have greater accuracy, but are only available for populations of 65,000+.

Once you choose your dataset, you’ll need to choose your geographies. In our case, we want counties, and we want all counties. If we were only interested in particular counties or particular states, we could select those at this point in order to reduce the size of our final dataset.

After I select the geographies I need, I can move on to selecting tables. These tables usually contain multiple variables (this is why they are called “tables” and not “variables”). You can scroll through the list of available tables to see what is available, but sometimes it is easiest to use the keyword search if you know the words you’re looking for. In our case, I know I need “unemployment” so I’ll just use the keyword search feature.

Once you’ve selected the tables you need and added them to your “table selections,” you can click “Show results” to explore what you’ve chosen. I always recommend taking a look at the results to make sure the variable(s) you’re interested in are actually contained in the table(s) you’ve selected.

You’ll notice two things on the Results screen:

It looks like our variables are rows, and our observations (i.e. counties) are columns. This is not in a tidy format, but don’t fear! When you download your data, it will be reversed (i.e. observations will be in rows, and variables will be in columns).
If you look through the data, you’ll probably see some empty cells – these are areas where the population is lower than 20,000 people, or areas where the margin of error was too high.

Once you’re satisfied with your results, click over to “Data Download” in the top navigation tabs in order to see the options for downloading our data.

Since we’re currently working in R, we’ll ignore the STATA-specific options for now. You want to make sure you download both the CSV file (which contains the data) and the data dictionary file (which contains the metadata about your data).

Finally, once you’ve downloaded the data and dictionary, move them into the proper folders in your project folder (I’ve created both a “data” and a “documentation” folder within my project folder).

In order to get my data on insufficient sleep, I’ll also go through the above process for the “Health Data 2019” dataset, searching for the keyword “sleep” after, again, choosing “all counties” as my geography.

Data Cleaning

Now that you’ve downloaded your data and moved it into the appropriate directories, we can open R and get to work on cleaning up our data. First you’ll want to make sure to load the tidyverse library as we’ll be using a number of functions from this package.

library(tidyverse)

Next we’ll load our data, giving our tibble an appropriate name, and take a look at the data.

acs2019 <- read_csv("data/R12788791_SL050.csv")
head(acs2019)

## # A tibble: 6 x 29
##   Geo_FIPS Geo_NAME Geo_QNAME Geo_NATION Geo_REGION Geo_DIVISION Geo_STATE
##   <chr>    <chr>    <chr>     <chr>      <lgl>      <lgl>        <chr>    
## 1 01001    Autauga… Autauga … 00         NA         NA           01       
## 2 01003    Baldwin… Baldwin … 00         NA         NA           01       
## 3 01005    Barbour… Barbour … 00         NA         NA           01       
## 4 01007    Bibb Co… Bibb Cou… 00         NA         NA           01       
## 5 01009    Blount … Blount C… 00         NA         NA           01       
## 6 01011    Bullock… Bullock … 00         NA         NA           01       
## # … with 22 more variables: Geo_COUNTY <chr>, Geo_CS <lgl>, Geo_PLACE <lgl>,
## #   Geo_CD <lgl>, Geo_SDE <lgl>, Geo_SDS <lgl>, Geo_SDU <lgl>, Geo_PUMA <lgl>,
## #   Geo_ANRC <lgl>, Geo_AIAANAHHL <lgl>, Geo_CSA <lgl>, Geo_MSAMSA <lgl>,
## #   Geo_PCOP <lgl>, Geo_MD <lgl>, Geo_CNECATA <lgl>, Geo_NECATA <lgl>,
## #   Geo_PC <lgl>, Geo_ND <lgl>, Geo_UA <lgl>, SE_T017_001 <dbl>,
## #   SE_T017_002 <dbl>, SE_T017_003 <dbl>

You’ll see that a number of the variables appear to be mostly (if not all) NAs. In every file you download from Social Explorer, you’ll have a lot of these, as Social Explorer uses a template, which contains every possible geographic variable (e.g. county, tract, city, etc.), no matter which geographies you selected. So here’s our first clean up step – we’ll need to pare this down to make our data easier to look at by select-ing only the columns we need. While we’re at it, we should also rename the variable columns to be more human-readable (remember the pattern for rename() is new_name = old_name).

acs2019_cleaning <- acs2019 %>%
  select(Geo_FIPS, Geo_QNAME, SE_T017_001, SE_T017_002, SE_T017_003) %>%
  rename(labor_force = SE_T017_001,
         employed = SE_T017_002,
         unemployed = SE_T017_003)

head(acs2019_cleaning)

## # A tibble: 6 x 5
##   Geo_FIPS Geo_QNAME               labor_force employed unemployed
##   <chr>    <chr>                         <dbl>    <dbl>      <dbl>
## 1 01001    Autauga County, Alabama       26598    25560       1038
## 2 01003    Baldwin County, Alabama      101561    96012       5549
## 3 01005    Barbour County, Alabama        8713     8332        381
## 4 01007    Bibb County, Alabama          10677    10302        375
## 5 01009    Blount County, Alabama        25274    24076       1198
## 6 01011    Bullock County, Alabama          NA       NA         NA

Ah, this looks much better! Where did I find out what these variables are so that I can rename them? Why, in the data dictionary of course! This is why you should always make sure you have a data dictionary of some kind.

If we had multiple years worth of data, we’d also want to make sure we have a year variable to properly distinguish our observations. You can create this in each individual tibble before you merge them.

acs2019_cleaning <- acs2019_cleaning %>%
  mutate(year = 2019)

Finally, we might need to create some new variables based on what we downloaded. In this case, we have the total number of unemployed people, and the total number of people in the labor force, so we’ll need use both of these to create a new variable for unemployment rate.

acs2019_cleaned <- acs2019_cleaning %>%
  mutate(unemployment_rate = unemployed/labor_force)

head(acs2019_cleaned)

## # A tibble: 6 x 7
##   Geo_FIPS Geo_QNAME      labor_force employed unemployed  year unemployment_ra…
##   <chr>    <chr>                <dbl>    <dbl>      <dbl> <dbl>            <dbl>
## 1 01001    Autauga Count…       26598    25560       1038  2019           0.0390
## 2 01003    Baldwin Count…      101561    96012       5549  2019           0.0546
## 3 01005    Barbour Count…        8713     8332        381  2019           0.0437
## 4 01007    Bibb County, …       10677    10302        375  2019           0.0351
## 5 01009    Blount County…       25274    24076       1198  2019           0.0474
## 6 01011    Bullock Count…          NA       NA         NA  2019          NA

Now that we have our ACS data on unemployment cleaned up, let’s prepare the other data we have downloaded on insuffcient sleep. This time, we’ll just do all of the prep in one longer chain of piped functions.

health2019 <- read_csv("data/R12789309_SL050.csv")

health2019_cleaned <- health2019 %>%
  select(Geo_FIPS, 
         Geo_QNAME, 
         ORG_RAW_HD2019_031_IS_v1, 
         ORG_RAW_HD2019_031_IS_v2, 
         ORG_RAW_HD2019_031_IS_v3) %>%
  mutate(year = 2019) %>%
  rename(percent_insuff_sleep = ORG_RAW_HD2019_031_IS_v1,
         percent_insuff_sleep_95CI_low = ORG_RAW_HD2019_031_IS_v2,
         percent_insuff_sleep_95CI_high = ORG_RAW_HD2019_031_IS_v3)

head(health2019_cleaned)

## # A tibble: 6 x 6
##   Geo_FIPS Geo_QNAME  percent_insuff_s… percent_insuff_s… percent_insuff_…  year
##   <chr>    <chr>                  <dbl>             <dbl>            <dbl> <dbl>
## 1 01001    Autauga C…              35.9              35.0             36.8  2019
## 2 01003    Baldwin C…              33.3              32.5             34.1  2019
## 3 01005    Barbour C…              38.6              37.7             39.5  2019
## 4 01007    Bibb Coun…              38.1              37.1             39.2  2019
## 5 01009    Blount Co…              35.9              34.8             37.1  2019
## 6 01011    Bullock C…              45.0              43.9             46.2  2019

We can now merge our two datasets, using the “join” commands from dplyr. Because our two tibbles have columns already in common, we can let the join function decide what column to use in our join.

merged_data <- left_join(acs2019_cleaned, health2019_cleaned)

head(merged_data)

## # A tibble: 6 x 10
##   Geo_FIPS Geo_QNAME labor_force employed unemployed  year unemployment_ra…
##   <chr>    <chr>           <dbl>    <dbl>      <dbl> <dbl>            <dbl>
## 1 01001    Autauga …       26598    25560       1038  2019           0.0390
## 2 01003    Baldwin …      101561    96012       5549  2019           0.0546
## 3 01005    Barbour …        8713     8332        381  2019           0.0437
## 4 01007    Bibb Cou…       10677    10302        375  2019           0.0351
## 5 01009    Blount C…       25274    24076       1198  2019           0.0474
## 6 01011    Bullock …          NA       NA         NA  2019          NA     
## # … with 3 more variables: percent_insuff_sleep <dbl>,
## #   percent_insuff_sleep_95CI_low <dbl>, percent_insuff_sleep_95CI_high <dbl>

Exploring your data

Now that we have our two datasets cleaned and merged, we can begin to explore them and craft our graphs or other outputs we need.

First, we might want to look at some histograms to see how our two variables of interest, unemployment_rate and percent_insuff_sleep are distributed:

ggplot(merged_data, aes(x = unemployment_rate)) + geom_histogram(binwidth = 0.005)

We can see that our unemployment data is skewed, with most of the counties hovering around the national unemployment rate, and generally staying under 10%, but with a few counties with much higher unemployment rates as outliers.

ggplot(merged_data, aes(x = percent_insuff_sleep)) + geom_histogram(binwidth = 0.5)

In contrast, the distribution of our percent_insuff_sleep variable appears to be fairly normally distributed.

Now we can create a scatter plot to check for a relationship between our predictor variable (we’ll use percent_insuff_sleep) and our response variable (we’ll use unemployment_rate). Because we have a large number of data points, we’ll set the alpha argument so that our points are a little more transparent.

ggplot(merged_data, aes(x = percent_insuff_sleep, y = unemployment_rate)) +
  geom_point(alpha = 0.5)

Because most of our unemployment numbers are bunched up in the lower part of the graph, we might want to do a logarithmic transformation to the y-axis to see if the relationship is any clearer.

ggplot(merged_data, aes(x = percent_insuff_sleep, y = unemployment_rate)) +
  geom_point(alpha = 0.5) + scale_y_log10()

Looking at our graph, it appears that there might be a small, fairly weak positive correlation between our two variables.

What else could you do with this data now?

Reminders

While Social Explorer is great for accessing Census and American Community Survey data, there are a ton of other interesting datasets available! Take the time to browse through some of them, especially if you’re getting stuck thinking of a topic!
Choosing your dataset might require a balancing act between your ideal unit of observation, your ideal timescale, and the exact variables need. Remember to be flexible where you can, and understand the limitations of the data you’re using.
Even data from a tool like Social Explorer, which has had a lot of work done to it to make it more usable, will require some wrangling, clean up, and merging. Remember to budget time for this – in general, you’ll spend 80% of your time finding, understanding, and cleaning your data, and only about 20% of your time creating your final analyses, tables, and visualizations.
Finally, if you have any questions about Social Explorer, or any other data you need for your research, you can reach out to me at go/ryan/!

Social Explorer Tutorial

Ryan Clement

3/8/2021

Choosing Data to Download

Data Cleaning

Exploring your data

Reminders

Social Explorer Tutorial

Ryan Clement

3/8/2021

What is Social Explorer

When to use Social Explorer

Social Explorer Basics

Sidebar: Decennial Census vs. American Community Survey

Choosing Data to Download

Data Cleaning

Exploring your data

Reminders