Lab 2: Introduction to data

Getting Started

Before beginning this lab, make sure you’ve:

created a sub-folder for this assignment within your ECON_210 folder
Within your assignment sub-folder, created another sub-folder called DATA
Downloaded the necessary data from the WordPress site and saved it to your assignment-specific subfolder.
Set your working directly. The address of your working directory should look something like /Users/YourName/OneDrive - Middlebury College/ECON_210/LABS/LAB_2/DATA ;

Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data.

In this lab, we’ll look at data from the American Community Survey (ACS).
The ACS contains a tremendous amount of individual-level data on the respondent’s location, income, education, and more.

We will generate simple graphical and numerical summaries of the ACS data, and explore the relationship between race, education, and income. Since this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages.

We’ll also be using the cowplot package to combine plots into a single window. Before you continue, go ahead and run install.packages("cowplot") in your console to make sure it’s ready. You can also load the package now (put library(cowplot) into the chunk above), or load it later before using the plot_grid function. We’ll load both tidyverse and cowplot here.

library(tidyverse)
library(cowplot)

Creating a reproducible lab report

To create your new lab report, in RStudio, go to New File -> R Markdown… You can also download a lab report template from the Labs page of the class WordPress site.

The data

The American Community Survey (ACS) is a monthly survey administered by the U.S. Census Bureau. The survey’s responses inform where over $675 billion in federal and state funds should be distributed each year. The data we’re using today is from the 2018 ACS.
As we mentioned before, the ACS is a huge dataset. Today, we’re only looking at a small subset of variables but the entire ACS can be accessed here if you’re interested.

First, let’s open the ACS data. Note that we’re reading our data in differently than we did last week. In Lab 1, we used the load function. This week we’ll be using the function read_csv from the the tidyverse package.
read_csv allows us to read .csv (comma separated values) files into R.

We’ll assign our data the name ACR_2018 using the <- symbol.

Type the following in your console to load the data (make sure to set your working directory)

ACS_2018 <- read_csv("acs_2018_small.csv")

The data set ACS_2018 that shows up in your workspace is a data matrix, with each row representing an observation and each column representing a variable. R calls this data format a data frame, which is a term that will be used throughout the labs. For this data set, each observation is a single individual.

To view the names of the variables, type the command

names(ACS_2018)

This returns the names of the variables in this data frame. Some of these variable names are intuitive- we can guess that age represents and individual’s age, and that educ has something to do with their level of education. Other variable names are harder to make sense of: What does incwage represent? How does educ_ba differ fromeduc_college_4yr?*
And what is a statefip? A codebook contains a description of all the variables found in a dataset. The codebook for the 2018 ACS can be found here.
Remember that our dataframe ACS_2018 is only a small subset of the entire ACS, so there will be a lot of variables in the codebook that don’t appear in ACS_2018.

Note: The codebook doesn’t actually cover the difference between educ_ba and edcu_college_4yr. educ_ba is a categorical variable that takes a value of 1 if a BA is an individual’s highest level of education and 0 if it is not (this means that the individual’s highest level of education could be high school or more than a BA (like a masters). The variable educ_college_4yr is a categorical variable that takes a 1 if an individual has completed a BA, which may or may not be their highest degree).

In Lab 1 we used the National Prisoner Statistics (NPS) dataset to explore the racial gap in incarceration rate in the United States. How can we use the ACS_2018 dataset to examine similar gaps in other areas such as education or income?

Some questions we might ask include:

What is the distribution of income in United States? How does it vary by race?
How does the distribution of educational attainment differ between Blacks and Whites?
What is the relationship between education and income? Is this the relationship between Blacks and Whites?
How does the distribution of income vary across the United States?

Analysis

Let’s start by examining the distribution of income (incwage) of all individuals in the dataset with a histogram. Here are two different ways we can code the same bar graph.

The first is the same notation we used in lab 1; specify the data and aes arguments of ggplot, plot a histogram. The second uses the %>% operator, which tells R to use ACS_2018 in the first argument (data) of the first argument. Note that we still need to tell R what it needs to plot, so we still specify aes. Both ways are correct, so use the way that feels most comfortable to you.

#1
ggplot(data = ACS_2018, aes(x = incwage)) +
  geom_histogram()

#2
ACS_2018 %>%
  ggplot(aes(x = incwage)) +
  geom_histogram()

This function says to plot the incwage variable from the ACS_2018 data frame on the x-axis. It also defines a geom (short for geometric object), which describes the type of plot you will produce.

Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. The histogram we just created has prompted R to tell us that the default binwidth, bins = 30, is not a good fit for our data.

The binwidth you choose will depend on what values the variable you are graphing takes on. In this case, the variable we’re graphing represents an individual’s annual income. What kind of values might be typical of an individual’s annual income?
Are these values in the hundreds? Thousands? Tens of thousands?

Let’s try a few different binwidths to see which binwidth best suits our data.
You can easily define the binwidth you want to use (load the cowplot package, i.e. put library(cowplot) into a code chunk, in order to access the plot_grid function):

bw1000 <- ACS_2018 %>%
ggplot(aes(x = incwage)) +
  geom_histogram(binwidth = 1000)

bw2500 <- ACS_2018 %>%
ggplot(aes(x = incwage)) +
  geom_histogram(binwidth = 2500)

bw5000 <- ACS_2018 %>%
ggplot(aes(x = incwage)) +
  geom_histogram(binwidth = 5000)

bw10000<- ACS_2018 %>%
ggplot(aes(x = incwage)) +
  geom_histogram(binwidth = 10000)

plot_grid(bw1000, bw2500, bw5000, bw10000, nrow=2)

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

Let’s say we want to examine the distribution of income for Black individuals living in the United States. We’ll first filter the data for individuals who self-identified as Black on the ACS (racblk == "Yes") and then make a histogram of the income distribution of only these individuals.

ACS_2018 %>%
  filter(racblk == "Yes") %>%
  ggplot(aes(x = incwage)) +
  geom_histogram(binwidth = 5000)

Let’s decipher these two commands (OK, so it might look like four lines, but the first two physical lines of code are actually part of the same command. It’s common to add a break to a new line after %>% to help readability).

Command 1: Take the ACS_2018 data frame, filter for individuals self-identified as Black
- == means “if it’s equal to”.
- “Yes” is in quotation marks since it is a character string.
Command 2: Basically the same ggplot call from earlier for making a histogram, except that it uses the smaller subset of Black individuals instead of all individuals.

Logical operators: Filtering for certain observations (e.g. individuals of a certain race or ethnicity) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data. To do so, you can use the filter function and a series of logical operators. The most commonly used logical operators for data analysis are as follows:

== means “equal to”
!= means “not equal to”
> or < means “greater than” or “less than”
>= or <= means “greater than or equal to” or “less than or equal to”

You can also obtain numerical summaries for numeric variables in your dataframe.
To find the numerical summaries for the subset of Black indivduals we can use the following code:

ACS_2018 %>%
  filter(racblk == "Yes") %>%
  summarise(mean_incwage   = mean(incwage), 
            median_incwage = median(incwage), 
            n         = n())

Note that in the summarise function you created a list of three different numerical summaries that you were interested in. The names of these elements are user defined, like mean_incwage, median_incwage, n, and you can customize these names as you like (just don’t use spaces in your names). Calculating these summary statistics also requires that you know the function calls. Note that n() reports the sample size.

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:

mean
median
sd
var
IQR
min
max

Note that each of these functions takes a single vector as an argument and returns a single value.

You can also filter based on multiple criteria. Suppose you are interested in Black individuals over age 25 (we assume that most individuals are out of school and of working age by 25):

Black_over25 <- ACS_2018 %>%
  filter(racblk == "Yes", age > 25)

Note that you can separate the conditions using commas if you want individuals that are both Black and over 25 years old. If you are interested in either individuals who are Black or over 25 years old, you can use the | instead of the comma.

Create a new data frame that includes individuals who are both Black and over 25 years old, and save this data frame as Black_over25. How many individuals meet these criteria?
Create a new data frame that includes individuals who are both White and over 25 years old, and save this data frame as White_over25. How many individuals meet these criteria?
Describe the distributions of both the White_over25 and Black_over25 data frames using histograms and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution. How are their distributions similar? How are they different?

Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, let’s examine the distribution of education for our Black_over25 dataset. we can modify our previous command using the group_by function to get the same summary stats for each level of educational attainment (educ):

Black_over25 %>%
  group_by(educ) %>%
  summarise(mean_incwage = mean(incwage), median_incwage = median(incwage), n = n())

Here, we first grouped the Black_over25 dataframe by educ and then calculated the summary statistics.

Calculate the mean and median of incwage for the White_over25 dataset, grouped by educ.
What is the mean income for white individuals over 25 with a Grade 12 education?
For individuals with 4 years of college?
What is the median income for white individuals over 25 with a Grade 12 education?
For individuals with 4 years of college?
How do these results compare to those for Black individuals of the same age and educational attainment?

Average Income by State

How does average income vary between states? Which state would you expect to have the highest average income? The lowest?

Let’s think about how you could answer these questions:

First, calculate state averages for incomes. Note that we’re using the original ACS_2018 data here. With the new language you are learning, you could
- group_by state, then
- summarise mean income.
Then, you could to arrange these average incomes in descending order (denoted with the minus sign)

ACS_2018 %>%
  group_by(statefip) %>%
  summarise(mean_incwage = mean(incwage)) %>%
  arrange(-mean_incwage)

Which state has the highest mean income? What about the highest median income?
Which state has the lowest mean income? What about the lowest median income?

Differences in Average Income by Race

Do Black and White individuals with the same education living in the same state have the average income?

To explore this question, we’ll focus on particular state (or geographic area). We can start with the District of Columbia (Washington D.C.), which we just found to have the nation’s highest mean income of $61201.78.

Let’s filter the ACS_2018 dataframe for individuals who are from the District of Columbia and are Black.

Running the following code:

ACS_2018 %>%
  filter(statefip == "District of Columbia", racblk == "Yes", age > 25) %>%
  group_by(educ) %>%
  summarise(mean_incwage = mean(incwage)) %>%
  arrange(educ)

We can see the average income for Blacks living in the District of Columbia grouped by level of education.

What is the average income for Blacks over the age of 25 living in the District of Columbia with a Grade 12
education? With 4 years of college ?
Filter the ACS_2018 dataframe for White individuals over the age of 25 living in the District of Columbia.
What is the average income for Whites over the age of 25 living in the District of Columbia with a Grade 12
education? With 4 years of college ? Compare these results to your results in part 1.
Repeat this exercise, this time filtering for a different state of your choosing. How do the distributions of your chosen state compare to those of the District of Columbia?

Poverty Line Statistics

Suppose you would like to know what proportion of individuals in each state lived above or below the National Poverty Line. In 2020, a one-person household making less than $12,760 annually was considered to be “below” the poverty line.

For now, let’s assume that every individual in our dataset is part of a single person household.

Let’s start by classifying an individual’s income as being “above” or “below” the poverty line by creating a new variable with the mutate function. Note that we are saving this as a new dataframe called ACS_2018_pl.

Get into the habit of creating a new dataframe whenever you make a modification (such as adding a new column, converting units, etc.) This way, you will not overwrite older dataframes, meaning you can always refer back to them.

ACS_2018_pl <- ACS_2018 %>%
  mutate(poverty_line = ifelse(incwage >= 12760, "above", "below"))

The first argument in the mutate function is the name of the new variable we want to create, in this case poverty_line. We use an ifelse statement to classify an individual’s income as being “above” or “below” the federal poverty line. The ifelse says, “Look at the value of incwage. If incwage >= 12760, classify the individual as”above" the poverty_line, else, classify them as “below” the poverty_line.

We can handle all of the remaining steps in one code chunk:

ACS_2018_pl %>%
  group_by(statefip) %>%
  summarise(poverty_rate = sum(poverty_line == "below") / n()) %>%
  arrange(-poverty_rate)

Which state has the highest poverty rate, based on our calculations?
Calculate the poverty rate for Blacks and the poverty rate for Whites.
Hint instead of filtering for raceblk try group_by
In how many states does the Black poverty rate exceed the White poverty rate? Hint refer back to Lab 1 for how to “count” the number of occurences of a particular outcome.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License and was adopted from OpenIntro.org.