Methods 1, Week 8

Outline

  • Final Project

  • Homework Overview

  • Spatial data in R

  • Creating maps

  • In-class exercise

  • Homework

Final Projects

The final project for this course is a research paper that uses R to answer a research question and visualize the results. The project can be on a topic of your choosing, and can be a small group project, or individual. The deliverables will include:

  1. The raw data collected for the project
  2. The scripts used to complete the analysis
  3. A research paper that includes:
    • Description of your research question(s)
    • Short description of the data sources and analysis (the full methods will be at the end)
    • Description of results + formatted data tables + formatted charts, maps
    • Discussion of the results
    • Description of the next steps to continue this research
    • Methods appendix - description of your data sources, methods, and the assumptions that underlie your research

The research paper should be 3 pages without graphics and methods.

Homework overview and questions


Assignment 7 Noteboook

Spatial data

There are special file types necessary for adding a spatial dimension to your data. The two most common are:

  • shapefiles
  • geojsons

Both formats contain geographic information that describes the location of each of observation. For a point file, that is most commonly the latitude and longitude of the point.

Shapefiles

shapefiles are a collection of files that contain the location, data, and projection.

Geojsons

geojsons are one file that contain all the same information

sf spatial data package


There are many R packages that you can use to work with spatial data, sf is the easiest and best because it treats spatial data exactly the same as a regular dataframe with the geometry as the last column.

  • install.packages(‘sf’)
  • install packages(‘RColorBrewer’)

import spatial data with sf

  • Download the state shapefile here
  • Open your main_data project in R
  • Open a new script, call it, explore_educational_attainment.R
  • Save it in the scripts/data_exploration folder
  • Use the st_read() function to import the states shapefile
library(tidycensus)
library(tidyverse)
library(sf)
library(viridis)
library(scales)
library(RColorBrewer)
# import a spatial dataframe of all states in the US
states <- st_read("main_data/raw/state/states_shp.shp")

Use tidycensus to import education data

raw_attainment_2020 <- get_acs(geography = "State",
                               variables = c(total_25_over = "B15003_001", 
                                             bachelors = "B15003_022",
                                             masters = "B15003_023", 
                                             professional = "B15003_024",
                                             phd = "B15003_025"),
                               year = 2020,
                               output = "wide")

#   Create a new dataframe -- to calculate the percentage of bachelor's degree
#   and remove Puerto Rico and Hawaii
attainment_2020 <- raw_attainment_2020 %>%
  rename(state = NAME) %>%
  mutate(pct_bachelors_plus = (bachelorsE + mastersE + 
                                 professionalE + phdE)/total_25_overE) %>% 
  filter(state != "Puerto Rico",
         state != "Hawaii",
         state != "Alaska")

join data to spatial dataframe

We’ll use a right join so that we only keep the continental US

states_ed_attain <- states %>% 
  right_join(attainment_2020, by = "GEOID") %>% 
  select(state, total_25_overE, pct_bachelors_plus, geometry)
  • You must join data TO the spatial dataframe. If you join a spatial dataframe to a regular dataframe you’ll lose the spatial information.
  • To keep only the rows in the second dataframe, use a right_join()

Build a map

Just like other plots, you build a map by adding instructions to the ggplot. Let’s start with a simple map of the Percent of people with at least a bachelors degree..

Color by a variable

A choropleth map uses graduated color or patterns to show the range of a statistic.

ggplot()  + 
  geom_sf(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus)) 

Style nicely

  • Use theme_void to remove the grid
ggplot()  + 
  geom_sf(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus)) +
  theme_void()

Style nicely

ggplot()  + 
  geom_sf(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus)) +
  theme_void() +
  scale_fill_viridis(direction = -1,
                     name="Bachelors Degree or Higher (%)", 
                     labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020 "
  )

Style with specific breaks in the data

Using the RColorBrewer palette

  • type display.brewer.all() in your console to see the names of palettes

Style with defined breaks

ggplot()  + 
  geom_sf(data = states_ed_attain,
          mapping = aes(fill = pct_bachelors_plus)) +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .1, .2, .3, .4, .5, .6),
                       palette = "Blues", 
                       name="Bachelors Degree or Higher (%)", 
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Educational Attainment",
    subtitle = "Percent of Adults with at least a Bachelors Degree",
    caption = "Source: American Community Survey, 2020 "
  )

In-class exercise / homework

You can also import census data as a shapefile with the tidycensus package! Just add the parameter geometry = T to your get_acs() or get_decennial() functions. See example below:

library(tidyverse)
library(tidycensus)

### load all the variables for the ACS
# acs201519 <- load_variables(2019, "acs5", cache = T)

raw_income = get_acs(geography = "county",
                   variables = "B19013_001",
                   state = "GA",
                   year = 2020, 
                   geometry = T) # this parameter imports the geometry 

Assignment instructions

Create 2 maps using data you download from the 2016-20 American Community Survey with the tidycensus package. You can create any maps you like. You can even use this assignment to start thinking about your final project if you are ready for that.

When you have finished your maps, save them in the output folder of main_data.

Upload your finalized script to CANVAS.

See the next slides for some example of 2 maps you could make if you want some inspiration:

Idea 1.

Download the median rent for every county in New York (The variable is called MEDIAN CONTRACT RENT in the ACS).
Map ideas:

  1. Create a choropleth map of the median rent in each county in New York
  2. Create a choropleth map of areas that are affordable for a person making New York minimum wage
    • assume 30% of income for rent, and 40 hours a week at NY minimum wage (currently the minimum wage is $12.50 outside of NYC, Westchester and Long Island, source).

Idea 2.

Download the PEOPLE REPORTING ANCESTRY table for every census tract in Queens County

Map ideas:

  1. Create a choropleth map of the percent of people that have West Indian ancestry in each census tract
  2. Create a choropleth map of the percent of people that have German ancestry in each census tract

Idea 3.

Download table for LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER for every census tract in all 5 counties in New York

Map ideas:

  1. Create a choropleth map of the percent of people that “Speak only English”
  2. Create a choropleth map of the percent of people that “Speak Spanish”