Methods 1, Week 9

Outline

  • Final Project

  • Homework Overview

  • Spatial data in R

  • Creating maps

  • In-class exercise

  • Homework

Final Projects

The final project for this course is a research paper that uses R to answer a research question and visualize the results. The project can be on a topic of your choosing, and can be a small group project, or individual. The deliverables will include:

  1. The raw data collected for the project
  2. The scripts used to complete the analysis
  3. A research paper that includes:
    • Description of your research question(s)
    • Short description of the data sources and analysis (the full methods will be at the end)
    • Description of results + formatted data tables + formatted charts, maps
    • Discussion of the results
    • Description of the next steps to continue this research
    • Methods appendix - description of your data sources, methods, and the assumptions that underlie your research

The research paper should be 3 pages without graphics and methods.

Homework overview and questions


Assignment 8 Noteboook

Spatial data

There are special file types necessary for adding a spatial dimension to your data. The two most common are:

  • shapefiles
  • geojsons

Both formats contain geographic information that describes the location of each of observation. For a point file, that is most commonly the latitude and longitude of the point.

Shapefiles

shapefiles are a collection of files that contain the coordinates that make up the shapes, the data associated with each shape, and other information about how your computer should draw the the shapes on earth. You need all of the files in a shapefile. They are meaningless if they are separated. This is the most common spatial data format.

Geojsons

geojsons are one file that contain all the same information.

sf spatial data package


There are many R packages that you can use to work with spatial data. We’ll use sf because it treats spatial data exactly the same as a regular dataframe with the geometry as the last column.

  • install.packages(‘sf’)

import spatial data with sf

  • Create a new folder for geographic data in part2/data/raw/geo
  • Download the state shapefile here
    • select this one: cb_2018_us_state_5m.zip [1.0 MB]
    • unzip it and save move the folder to data/raw/geo
  • Open your part2 project in R
  • Open a new script, call it, explore_educational_attainment.R
  • Save it in the scripts folder
  • Use the st_read() function from sf to import the states shapefile
library(tidycensus)
library(tidyverse)
library(sf)
library(viridis)
library(scales)
library(RColorBrewer)
# import a spatial dataframe of all states in the US
states <- st_read("data/raw/geo/cb_2018_us_state_5m/cb_2018_us_state_5m.shp")

Use tidycensus to import education data

raw_attainment_2020 <- get_acs(geography = "State",
                               variables = c(total_25_over = "B15003_001", 
                                             bachelors = "B15003_022",
                                             masters = "B15003_023", 
                                             professional = "B15003_024",
                                             phd = "B15003_025"),
                               year = 2020,
                               output = "wide")

#   Create a new dataframe -- to calculate the percentage of bachelor's degree
#   and remove Puerto Rico and Hawaii
attainment_2020 <- raw_attainment_2020 |>
  rename(state = NAME) |>
  mutate(pct_bachelors_plus = (bachelorsE + mastersE + 
                                 professionalE + phdE)/total_25_overE) |> 
  filter(state != "Puerto Rico",
         state != "Hawaii",
         state != "Alaska")

join data to spatial dataframe

We’ll use a right join so that we only keep the continental US

states_ed_attain <- states |> 
  right_join(attainment_2020, by = "GEOID") |> 
  select(state, total_25_overE, pct_bachelors_plus, geometry)
  • You must join data TO the spatial dataframe. If you join a spatial dataframe to a regular dataframe you’ll lose the spatial information.
  • To keep only the rows in the second dataframe, use a right_join()

Build a map

Just like other plots, you build a map with ggplot.

  • We’ll start with a simple map of the Percent of people with at least a bachelors degree.

Color by a variable

A choropleth map uses graduated color to show the variation in a your data across your study area.

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() 

Style nicely

  • Use theme_void to remove the grid
ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void()

Style nicely

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void() +
  scale_fill_viridis(direction = -1,
                     name="Bachelors Degree or Higher (%)", 
                     labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020 "
  )

Histogram

So far we have just accepted the defaults on how to display the data. An important step to building a map is understanding the shape of your data and how to best represent it in a map.

  • Let’s look at a histogram of Percent of people with at least a bachelors degree to see how we should define the color scheme.
ggplot(data = states_ed_attain, aes(x = pct_bachelors_plus))  + 
  geom_histogram(binwidth=0.05)

Color Brewer palettes

  • type display.brewer.all() in your console to see the names of RColorBrewer palettes that we can use to represent our data in a choropleth map.

Style with defined breaks

We’ll use scale_fill_fermenter() to define the bins as every 5 percentage points from 0% to 60%, and select the color as Blue to Purple.

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void() +
  scale_fill_fermenter(breaks=c(0, .05, .1, .15, .2, .25, .3, .35, 
                                .4, .45, .5, .55, .6),
                       palette = "BuPu", 
                       direction = 1,
                       name="Bachelors Degree or Higher (%)", 
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020 "
  )

Style with defined breaks

Handle the outliers

This map is being dominated by DC, and it’s so small you can’t even see it! How does it look if we remove it?

ggplot(data = states_ed_attain |> 
         filter(state != "District of Columbia"),
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void() +
  scale_fill_fermenter(breaks=c(0, .05, .1, .15, .2, .25, .3, .35, 
                                .4, .45, .5, .55, .6),
                       palette = "BuPu", 
                       direction = 1,
                       name="Bachelors Degree or Higher (%)", 
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020"
  ) 

Handle the outliers

Normally I might provide a note explaining that we removed an outlier. I’ll skip it this time since DC is so small, and not a state.

Spatial data from tidycensus

You can also import census data for most geograhies as a spatial dataframe with the tidycensus package!

  • Just add the parameter geometry = T to your get_acs() or get_decennial() functions. See example below:
library(tidyverse)
library(tidycensus)

### load all the variables for the ACS
# acs201519 <- load_variables(2019, "acs5", cache = T)

raw_income = get_acs(geography = "county",
                   variables = c(total_25_over = "B15003_001", 
                                   bachelors = "B15003_022",
                                   masters = "B15003_023", 
                                   professional = "B15003_024",
                                   phd = "B15003_025"),
                   state = "GA",
                   year = 2020, 
                   output = "wide",
                   geometry = T) # this parameter imports the geometry 

** Note, Some geographies are not available from tidycensus as spatial dataframes.

In-class exercise / homework instructions

Create 2 maps using data you download from the 2018-22 American Community Survey with the tidycensus package. You can create any maps you like. You can even use this assignment to start thinking about your final project if you are ready for that.

When you have finished your maps, save them in the output folder of part2.

Upload your finalized script to CANVAS.

See the next slides for some example of 2 maps you could make if you want some inspiration:

Idea 1.

Download the median rent for every county in New York (The variable is called MEDIAN CONTRACT RENT in the ACS).
Map ideas:

  1. Create a choropleth map of the median rent in each county in New York
  2. Create a choropleth map of areas that are affordable for a person making New York minimum wage
    • assume 30% of income for rent, and 40 hours a week at NY minimum wage
    • currently the minimum wage is $16/hour in NYC, Westchester and Long Island and $15everywhere else, source.
      • you can use an if else statement to define NYC, Westchester and Long Island’s minimum wage differently than the rest of the state. Get an extra 1 point if you figure out how to do that on your own. OR just define minimum wage as $15 everywhere for now and I’ll show you how to do it next week.

Idea 2.

Download the PEOPLE REPORTING ANCESTRY table for every census tract in Queens County

Map ideas:

  1. Create a choropleth map of the percent of people that have West Indian ancestry in each census tract
  2. Create a choropleth map of the percent of people that have German ancestry in each census tract

Idea 3.

Download table for LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER for every census tract in all 5 counties in New York

Map ideas:

  1. Create a choropleth map of the percent of people that “Speak only English”
  2. Create a choropleth map of the percent of people that “Speak Spanish”