Methods 1, Week 9

Outline

Final Project
Homework Overview
Spatial data in R
Creating maps
In-class exercise
Homework

Final Projects

The final project for this course is a research paper that uses R to answer a research question and visualize the results. The project can be on a topic of your choosing, and can be a small group project, or individual. The deliverables will include:

The raw data collected for the project
The scripts used to complete the analysis
A research paper that includes:
- Description of your research question(s)
- Short description of the data sources and analysis (the full methods will be at the end)
- Description of results + formatted data tables + formatted charts, maps
- Discussion of the results
- Description of the next steps to continue this research
- Methods appendix - description of your data sources, methods, and the assumptions that underlie your research

The research paper should be 3 pages without graphics and methods.

Homework overview and questions

Assignment 8 Noteboook

Spatial data

There are special file types necessary for adding a spatial dimension to your data. The two most common are:

shapefiles
geojsons

Both formats contain geographic information that describes the location of each of observation. For a point file, that is most commonly the latitude and longitude of the point.

Shapefiles

shapefiles are a collection of files that contain the coordinates that make up the shapes, the data associated with each shape, and other information about how your computer should draw the the shapes on earth. You need all of the files in a shapefile. They are meaningless if they are separated. This is the most common spatial data format.

Geojsons

geojsons are one file that contain all the same information.

`sf` spatial data package

There are many R packages that you can use to work with spatial data. We’ll use sf because it treats spatial data exactly the same as a regular dataframe with the geometry as the last column.

install.packages(‘sf’)

import spatial data with sf

Create a new folder for geographic data in part2/data/raw/geo
Download the state shapefile here
- select this one: cb_2018_us_state_5m.zip [1.0 MB]
- unzip it and save move the folder to data/raw/geo
Open your part2 project in R
Open a new script, call it, explore_educational_attainment.R
Save it in the scripts folder
Use the st_read() function from sf to import the states shapefile

library(tidycensus)
library(tidyverse)
library(sf)
library(viridis)
library(scales)
library(RColorBrewer)
# import a spatial dataframe of all states in the US
states <- st_read("data/raw/geo/cb_2018_us_state_5m/cb_2018_us_state_5m.shp")

Use tidycensus to import education data

raw_attainment_2020 <- get_acs(geography = "State",
                               variables = c(total_25_over = "B15003_001", 
                                             bachelors = "B15003_022",
                                             masters = "B15003_023", 
                                             professional = "B15003_024",
                                             phd = "B15003_025"),
                               year = 2020,
                               output = "wide")

#   Create a new dataframe -- to calculate the percentage of bachelor's degree
#   and remove Puerto Rico and Hawaii
attainment_2020 <- raw_attainment_2020 |>
  rename(state = NAME) |>
  mutate(pct_bachelors_plus = (bachelorsE + mastersE + 
                                 professionalE + phdE)/total_25_overE) |> 
  filter(state != "Puerto Rico",
         state != "Hawaii",
         state != "Alaska")

join data to spatial dataframe

We’ll use a right join so that we only keep the continental US

states_ed_attain <- states |> 
  right_join(attainment_2020, by = "GEOID") |> 
  select(state, total_25_overE, pct_bachelors_plus, geometry)

You must join data TO the spatial dataframe. If you join a spatial dataframe to a regular dataframe you’ll lose the spatial information.
To keep only the rows in the second dataframe, use a right_join()

Build a map

Just like other plots, you build a map with ggplot.

We’ll start with a simple map of the Percent of people with at least a bachelors degree.

Color by a variable

A choropleth map uses graduated color to show the variation in a your data across your study area.

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf()

Style nicely

Use theme_void to remove the grid

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void()

Style nicely

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void() +
  scale_fill_viridis(direction = -1,
                     name="Bachelors Degree or Higher (%)", 
                     labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020 "
  )

Histogram

So far we have just accepted the defaults on how to display the data. An important step to building a map is understanding the shape of your data and how to best represent it in a map.

Let’s look at a histogram of Percent of people with at least a bachelors degree to see how we should define the color scheme.

ggplot(data = states_ed_attain, aes(x = pct_bachelors_plus))  + 
  geom_histogram(binwidth=0.05)

Color Brewer palettes

type display.brewer.all() in your console to see the names of RColorBrewer palettes that we can use to represent our data in a choropleth map.

Style with defined breaks

We’ll use scale_fill_fermenter() to define the bins as every 5 percentage points from 0% to 60%, and select the color as Blue to Purple.

ggplot(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void() +
  scale_fill_fermenter(breaks=c(0, .05, .1, .15, .2, .25, .3, .35, 
                                .4, .45, .5, .55, .6),
                       palette = "BuPu", 
                       direction = 1,
                       name="Bachelors Degree or Higher (%)", 
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020 "
  )

Style with defined breaks

Handle the outliers

This map is being dominated by DC, and it’s so small you can’t even see it! How does it look if we remove it?

ggplot(data = states_ed_attain |> 
         filter(state != "District of Columbia"),
          mapping = aes(fill = pct_bachelors_plus))  + 
  geom_sf() +
  theme_void() +
  scale_fill_fermenter(breaks=c(0, .05, .1, .15, .2, .25, .3, .35, 
                                .4, .45, .5, .55, .6),
                       palette = "BuPu", 
                       direction = 1,
                       name="Bachelors Degree or Higher (%)", 
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020"
  )

Handle the outliers

Normally I might provide a note explaining that we removed an outlier. I’ll skip it this time since DC is so small, and not a state.

Spatial data from tidycensus

You can also import census data for most geograhies as a spatial dataframe with the tidycensus package!

Just add the parameter geometry = T to your get_acs() or get_decennial() functions. See example below:

library(tidyverse)
library(tidycensus)

### load all the variables for the ACS
# acs201519 <- load_variables(2019, "acs5", cache = T)

raw_income = get_acs(geography = "county",
                   variables = c(total_25_over = "B15003_001", 
                                   bachelors = "B15003_022",
                                   masters = "B15003_023", 
                                   professional = "B15003_024",
                                   phd = "B15003_025"),
                   state = "GA",
                   year = 2020, 
                   output = "wide",
                   geometry = T) # this parameter imports the geometry

** Note, Some geographies are not available from tidycensus as spatial dataframes.

In-class exercise / homework instructions

Create 2 maps using data you download from the 2018-22 American Community Survey with the tidycensus package. You can create any maps you like. You can even use this assignment to start thinking about your final project if you are ready for that.

When you have finished your maps, save them in the output folder of part2.

Upload your finalized script to CANVAS.

See the next slides for some example of 2 maps you could make if you want some inspiration:

Idea 1.

Download the median rent for every county in New York (The variable is called MEDIAN CONTRACT RENT in the ACS).
Map ideas:

Create a choropleth map of the median rent in each county in New York
Create a choropleth map of areas that are affordable for a person making New York minimum wage
- assume 30% of income for rent, and 40 hours a week at NY minimum wage
- currently the minimum wage is $16/hour in NYC, Westchester and Long Island and $15everywhere else, source.
  - you can use an if else statement to define NYC, Westchester and Long Island’s minimum wage differently than the rest of the state. Get an extra 1 point if you figure out how to do that on your own. OR just define minimum wage as $15 everywhere for now and I’ll show you how to do it next week.

Idea 2.

Download the PEOPLE REPORTING ANCESTRY table for every census tract in Queens County

Map ideas:

Create a choropleth map of the percent of people that have West Indian ancestry in each census tract
Create a choropleth map of the percent of people that have German ancestry in each census tract

Idea 3.

Download table for LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER for every census tract in all 5 counties in New York

Map ideas:

Create a choropleth map of the percent of people that “Speak only English”
Create a choropleth map of the percent of people that “Speak Spanish”

Methods 1, Week 9

Outline

Final Project

Homework Overview

Spatial data in R

Creating maps

In-class exercise

Homework

Final Projects

Homework overview and questions

Spatial data

Shapefiles

Geojsons

sf spatial data package

import spatial data with sf

Use tidycensus to import education data

join data to spatial dataframe

Build a map

Color by a variable

Style nicely

Style nicely

Histogram

Color Brewer palettes

Style with defined breaks

Style with defined breaks

Handle the outliers

Handle the outliers

Spatial data from tidycensus

In-class exercise / homework instructions

Idea 1.

Idea 2.

Idea 3.

`sf` spatial data package