Getting Started

This sheet builds off of “Handling Census Data in R - Part 1”. Please complete that first

Again, everytime you start R you will need to load any libraries that you need for analysis. You will need to run this bit of code every time you open R to get this to work.

library(tidycensus)
library(sf)
library(ggplot2)
library(ggthemes)
library(dplyr)

To get started working with tidycensus, users should load the package along with the tidyverse package, and set their Census API key. A key can be obtained from Census API Key. It will provide you with a 40 digit text string. Please keep track of this number. Store it in a safe place.

API_Key = '983980b9fc504149e82117csdfwefwe23dsdc507'  # non working example - please paste your own in
census_api_key(API_Key, install = TRUE, overwrite=TRUE)  

Comparing Populations Across Geographies

Let’s say you want to look at the differences in income between different parts of city, let’s say by quadrant (NE,NW,SE,SW). This can easily done using spatial statistics, joining income data into the appropriate quadrant by location.

The first step is going to be to access the DC quatrant shapefile. We can directly access data online easily here. To access the data online we just need the URL of the GeoJson file as seen below:

For this we will use read_sf from the sf package. SF will allow us to do pretty much anything that arcmap can do. In this case read_sf just reads the object from the internet for you and stores it in the same format as our census data.

Notice this is just the attribute data with a column called geometry that stores the outline of each river.

# read quadrant data from the internet
quads = read_sf('https://opendata.arcgis.com/datasets/02923e4697804406b9ee3268a160db99_11.geojson')
ggplot()+geom_sf(data=quads, fill='grey80', color='white')

Now let’s get the median income data for dc tract and plot it underneath the quadrants. Notice that I set fill = NA which makes the quadant polygons transparent, and color='white' sets thier outline color to white. Again these are not inside of aes() because we only use aes() if we want an asthetic property to be set from values in the attribute table.

# download the acs data 
med_inc =  get_acs(state = "DC", 
                   geography = "tract", 
                   year=2018,
                  variables =  c(med_inc = 'B19101_001'), 
                  geometry = TRUE,
                  output = 'wide')   # just do this
Getting data from the 2014-2018 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
Using FIPS code '11' for state 'DC'
# print the column names
print(names(med_inc))
[1] "GEOID"    "NAME"     "med_incE" "med_incM" "geometry"
# plot both
ggplot()+
        # plot median income
        geom_sf(data=med_inc, aes(fill = med_incE))+
  
        # plot quadrants
        geom_sf(data=quads, fill=NA, color='white')

Keep in mind the old addage ‘mapping is not doing’, ok I just made that up. But really we haven’t done any analysis here. We need to combine the attribute tables of med_inc and quads. We can do this by using a spatial aggregation using the sf package. Here we are going to get the median income of all census tracts within each quadrant (NE,NW,SE,SW).

# Calculate the median income of each quadrant
aggregate(med_inc, quads, 
          median_quad_income =  function(x) median(x,na.rm=T))  # na.rm=T avoids missing values
Error in st_geos_binop("intersects", x, y, sparse = sparse, prepared = prepared) : 
  st_crs(x) == st_crs(y) is not TRUE

Oops looks like the crs (coordinate reference system) isn’t the same for both layers. Let’s reproject them to “NAD83” details here using thier EPSG code, which can be used to uniquely identify any projection.

# reproject the data to to NAD_83  https://epsg.io/4269
med_inc = st_transform(med_inc, crs=4269)  
quads = st_transform(quads, crs=4269)  

Now let’s try getting the median income for each quadrant.

# 
quad_med_inc = aggregate(med_inc, quads,  FUN= median)  
although coordinates are longitude/latitude, st_intersects assumes that they are planar
quad_med_inc
Simple feature collection with 4 features and 4 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: -77.1198 ymin: 38.79164 xmax: -76.90915 ymax: 38.99597
epsg (SRID):    4269
proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
        GEOID                                                           NAME med_incE med_incM                       geometry
1 11001009807    Census Tract 65, District of Columbia, District of Columbia      598      124 POLYGON ((-77.06442 38.8895...
2 11001007604 Census Tract 75.04, District of Columbia, District of Columbia      628      111 POLYGON ((-77.00923 38.8393...
3 11001008701 Census Tract 84.02, District of Columbia, District of Columbia      617      106 POLYGON ((-77.0092 38.96498...
4 11001002801    Census Tract 32, District of Columbia, District of Columbia      699      117 POLYGON ((-77.1198 38.93435...
ggplot()+ geom_sf(data=quad_med_inc, aes(fill = med_incE))

Well that isn’t very interesting. Instead it might be more interesting to look at the distribution of census tract incomes in each quadrant, so we need to join the name of the quadrant to each census tract using a spatial join. A spatial join is done in sf with the st_join function.

med_inc_quads = st_join(med_inc, quads)
although coordinates are longitude/latitude, st_intersects assumes that they are planar
med_inc_quads
Simple feature collection with 216 features and 9 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: -77.11976 ymin: 38.79164 xmax: -76.9094 ymax: 38.99511
epsg (SRID):    4269
proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
First 10 features:
          GEOID                                                         NAME.x med_incE med_incM OBJECTID QUADRANT    NAME.y
1   11001009902 Census Tract 99.02, District of Columbia, District of Columbia      586      118        2       SE Southeast
2   11001009904 Census Tract 99.04, District of Columbia, District of Columbia      506       85        2       SE Southeast
2.1 11001009904 Census Tract 99.04, District of Columbia, District of Columbia      506       85        3       NE Northeast
3   11001000100     Census Tract 1, District of Columbia, District of Columbia     1206      180        4       NW Northwest
4   11001000801  Census Tract 8.01, District of Columbia, District of Columbia     1229      165        4       NW Northwest
5   11001000802  Census Tract 8.02, District of Columbia, District of Columbia      747      136        4       NW Northwest
6   11001001001 Census Tract 10.01, District of Columbia, District of Columbia     1910      127        4       NW Northwest
7   11001006202 Census Tract 62.02, District of Columbia, District of Columbia        0       12        1       SW Southwest
7.1 11001006202 Census Tract 62.02, District of Columbia, District of Columbia        0       12        2       SE Southeast
7.2 11001006202 Census Tract 62.02, District of Columbia, District of Columbia        0       12        3       NE Northeast
    SHAPE_Length SHAPE_Area                       geometry
1       27817.06   33075057 POLYGON ((-76.953 38.86628,...
2       27817.06   33075057 POLYGON ((-76.93592 38.8859...
2.1     30312.73   40175193 POLYGON ((-76.93592 38.8859...
3       36110.52   75733371 POLYGON ((-77.06937 38.9003...
4       36110.52   75733371 POLYGON ((-77.1054 38.91719...
5       36110.52   75733371 POLYGON ((-77.09635 38.9141...
6       36110.52   75733371 POLYGON ((-77.10053 38.9490...
7       35676.50   28472526 POLYGON ((-77.06466 38.8918...
7.1     27817.06   33075057 POLYGON ((-77.06466 38.8918...
7.2     30312.73   40175193 POLYGON ((-77.06466 38.8918...

Now let’s save that data to a shapefile so we can use it later. We can use sf

# write a geojson
st_write(obj = med_inc_quads,                                    # Object to be written
         dsn = "path_to_a_folder/acs_2016_medincome.geojson",    # destination of file
         driver = "GeoJSON")                                     # file type see help(st_write)

# write a shapefile 
st_write(obj = med_inc_quads, 
         dsn =  "path_to_a_folder/acs_2016_medincome.shp",  
         driver = "ESRI Shapefile")

Great! Now for each census tract we know which quadrant it is in. Let play with a few different ways of visualizing their differences.

First we will look at using box plots. Here we use ggplot geom_boxplot to plot our census data within each quadrant. So the data=med_inc_quads tells it where to find the new data. y=med_incE indicates what value to plot on the y axis and fill=QUADRANT treats each quadrant as a unique color or grouping.

Boxplots explained.

ggplot()+geom_boxplot(data=med_inc_quads, 
                      aes(y=med_incE, fill=QUADRANT))

Cool! If you are into this sort of thing… We could also look at a grouped histogram. Notice that the function is virtually the same as the last one except we use geom_histogram and x=med_incE because we want our values along the x axis instead of the y.

ggplot()+geom_histogram(data=med_inc_quads, 
                      aes(x=med_incE, fill=QUADRANT) )

ggplot()+geom_histogram(data=med_inc_quads, 
                      aes(x=med_incE, fill=QUADRANT))

Making Beautiful ggplots

Getting good at ggplot takes a while, because frankly it can be confusing. Lucking there is a decent shortcut called the ggthemes library. Which as the name implies allows you to quickly apply a color/style theme. See some examples here.

All we have to do is apply a theme we like using +the_theme_name

ggplot()+geom_histogram(data=med_inc_quads, 
                      aes(x=med_incE, fill=QUADRANT))+
                      theme_economist()

Saving Map

Ggplot2 also makes it easy to save you maps or graphs out as images. You can save a variety of differnt graphics types including tif, png, or even svg (for use in illustrator). See the help here.

# ggsave(filename='path_to_folder/segregationmap.png', 
#        plot = last_plot(),  # save the last thing you ploted 
#        device='png')
---
title: "Handling Census Data in R - Part 3"
output: html_notebook
---

# Getting Started

This sheet builds off of "Handling Census Data in R - Part 1". Please complete that first


Again, everytime you start R you will need to load any libraries that you need for analysis. *You will need to run this bit of code every time you open R* to get this to work. 

```{r include=T}
library(tidycensus)
library(sf)
library(ggplot2)
library(ggthemes)
library(dplyr)
```


To get started working with tidycensus, users should load the package along with the tidyverse package, and set their Census API key. A key can be obtained from [Census API Key](http://api.census.gov/data/key_signup.html).  **It will provide you with a 40 digit text string. Please keep track of this number. Store it in a safe place.**
 
```{r include=T, eval=FALSE}
API_Key = '983980b9fc504149e82117csdfwefwe23dsdc507'  # non working example - please paste your own in
census_api_key(API_Key, install = TRUE, overwrite=TRUE)  
```

# Comparing Populations Across Geographies

Let's say you want to look at the differences in income between different parts of city, let's say by quadrant (NE,NW,SE,SW). This can easily done using spatial statistics, joining income data into the appropriate quadrant by location. 

The first step is going to be to access the DC quatrant shapefile. We can directly access data online easily [here](https://opendata.dc.gov/datasets/dc-quadrants). To access the data online we just need the URL of the GeoJson file as seen below:

![ ](./OpenDataDCAPI.png)


For this we will use `read_sf` from the [sf package](https://r-spatial.github.io/sf/). SF will allow us to do pretty much anything that arcmap can do. In this case `read_sf` just reads the object from the internet for you and stores it in the same format as our census data.  

Notice this is just the attribute data with a column called `geometry` that stores the outline of each river.

```{r}
# read quadrant data from the internet
quads = read_sf('https://opendata.arcgis.com/datasets/02923e4697804406b9ee3268a160db99_11.geojson')
ggplot()+geom_sf(data=quads, fill='grey80', color='white')
```

Now let's get the median income data for dc tract and plot it underneath the quadrants. Notice that I set `fill = NA` which makes the quadant polygons transparent, and `color='white'` sets thier outline color to white. Again these are not inside of `aes()` because we only use `aes()` if we want an asthetic property to be set from values in the attribute table. 

```{r}
# download the acs data 
med_inc =  get_acs(state = "DC", 
                   geography = "tract", 
                   year=2018,
                  variables =  c(med_inc = 'B19101_001'), 
                  geometry = TRUE,
                  output = 'wide')   # just do this

# print the column names
print(names(med_inc))

# plot both
ggplot()+
        # plot median income
        geom_sf(data=med_inc, aes(fill = med_incE))+
  
        # plot quadrants
        geom_sf(data=quads, fill=NA, color='white')
```

Keep in mind the old addage 'mapping is not doing', ok I just made that up. But really we haven't done any analysis here. We need to combine the attribute tables of `med_inc` and `quads`. We can do this by using a spatial aggregation using the `sf` package. Here we are going to get the median income of all census tracts within each quadrant (NE,NW,SE,SW).

```{r}
# Calculate the median income of each quadrant
aggregate(med_inc, quads,  FUN= median)  
```

**Oops looks like the crs (coordinate reference system) isn't the same for both layers.** Let's reproject them to "NAD83" [details here](https://epsg.io/4269) using thier EPSG code, which can be used to uniquely identify any projection. 

```{r}
# reproject the data to to NAD_83  https://epsg.io/4269
med_inc = st_transform(med_inc, crs=4269)  
quads = st_transform(quads, crs=4269)  
```

Now let's try getting the median income for each quadrant. 

```{r}
# Calculate the median income of each quadrant
quad_med_inc = aggregate(med_inc, quads,  FUN= median)  
quad_med_inc
```

```{r}
ggplot()+ geom_sf(data=quad_med_inc, aes(fill = med_incE))
```

Well that isn't very interesting. Instead it might be more interesting to look at the distribution of census tract incomes in each quadrant, so we need to join the name of the quadrant to each census tract using a spatial join. A spatial join is done in `sf` with the `st_join` function.

```{r}
med_inc_quads = st_join(med_inc, quads)
med_inc_quads
```

Now let's save that data to a shapefile so we can use it later. We can use `sf`

```{r}
# write a geojson
st_write(obj = med_inc_quads,                                    # Object to be written
         dsn = "path_to_a_folder/acs_2016_medincome.geojson",    # destination of file
         driver = "GeoJSON")                                     # file type see help(st_write)

# write a shapefile 
st_write(obj = med_inc_quads, 
         dsn =  "path_to_a_folder/acs_2016_medincome.shp",  
         driver = "ESRI Shapefile")
```



Great! Now for each census tract we know which quadrant it is in. Let play with a few different ways of visualizing their differences.

First we will look at using box plots. Here we use `ggplot` `geom_boxplot` to plot our census data within each quadrant. So the `data=med_inc_quads` tells it where to find the new data. `y=med_incE` indicates what value to plot on the y axis and `fill=QUADRANT` treats each quadrant as a unique color or grouping. 

[Boxplots explained](https://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots).   

```{r}
ggplot()+geom_boxplot(data=med_inc_quads, 
                      aes(y=med_incE, fill=QUADRANT))
```


Cool! If you are into this sort of thing... We could also look at a grouped histogram. Notice that the function is virtually the same as the last one except we use `geom_histogram` and `x=med_incE` because we want our values along the x axis instead of the y. 

```{r}
ggplot()+geom_histogram(data=med_inc_quads, 
                      aes(x=med_incE, fill=QUADRANT) )
```

```{r}
ggplot()+geom_histogram(data=med_inc_quads, 
                      aes(x=med_incE, fill=QUADRANT))
```


# Making Beautiful ggplots

Getting good at ggplot takes a while, because frankly it can be confusing. Lucking there is a decent shortcut called the `ggthemes` library. Which as the name implies allows you to quickly apply a color/style theme. [See some examples here](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/).

All we have to do is apply a theme we like using `+the_theme_name` 

```{r}
ggplot()+geom_histogram(data=med_inc_quads, 
                      aes(x=med_incE, fill=QUADRANT))+
                      theme_economist()
```


# Saving Map

Ggplot2 also makes it easy to save you maps or graphs out as images. You can save a variety of differnt graphics types including tif, png, or even svg (for use in illustrator). See the [help here](https://ggplot2.tidyverse.org/reference/ggsave.html).


```{r}
ggsave(filename='path_to_folder/segregationmap.png',
       plot = last_plot(),  # save the last thing you ploted
       device='png')
```





