Install

Before you begin this tutorial you must download R R Markdown. Then install R Studio’s IDE (stands for integrated development environment), a powerful user interface for R. Get the Open Source Edition of RStudio Desktop. R Studio . Then open R Studio.

If you have a pre-existing installation of R and/or RStudio, I highly recommend that you reinstall both and get as current as possible. It can be considerably harder to run old software than new.

Libraries

R is an extensible system and many people share useful code they have developed as a package via CRAN and GitHub. This provides most of the functionality we are looking to use, for instance plotting, GIS operations, etc. To install a package from CRAN, for example the dplyr package for data manipulation, here is one way to do it in the R console (there are others).

Keep in mind you only need to install new packages (aka libraries) the first time you run R.

install.packages('dplyr') # data manipulation and cleaning

Now let’s install the libraries (additional functions) we need to use for this tutorial.

install.packages('sf')         # spatial data handling
install.packages('tidycensus') # census data download see https://walkerke.github.io/tidycensus/articles/basic-usage.html
install.packages("cli")        # needed for devtools install
install.packages('devtools')   # allows installation of libraries from github
devtools::install_github("tidyverse/ggplot2")    # plotting installing latest version from github so that it works with sf
          # if prompted please select "1" to install all additional packages
install.packages('ggthemes')   # plotting themes

Getting Access to Census Data

Now that we have installed all the needed libraries (packages), we need to load them into R. You will need to run this bit of code every time you open R to get this to work.

library(tidycensus)
library(sf)
library(ggplot2)
library(ggthemes)
library(dplyr)

To get started working with tidycensus, users should load the package along with the tidyverse package, and set their Census API key. A key can be obtained from Census API Key. It will provide you with a 40 digit text string. Please keep track of this number. Store it in a safe place.

API_Key = '983980b9fc504149e82117csdfwefwe23dsdc507'  # non working example - please paste your own in
census_api_key(API_Key, install = TRUE, overwrite=TRUE)  

Basic Census Data Analysis

Now that we have access to the census API we can start accessing the data. We have to make a few choices, what census “decennial” vs “acs” (American Community Survey),the geography level e.g. “state”,“county”,“tract”, or “block group”, the question identifier (not all questions are available at all geography levels), and the year.

There are two major functions implemented in tidycensus: get_decennial, which grants access to the 1990, 2000, and 2010 decennial US Census APIs, and get_acs, which grants access to the 5-year American Community Survey APIs.

In this basic example, let’s look at Median household income in the past 12 months of 2016.

m90 <- get_decennial(geography = "state", variables = c(rent="H043A001"), year = 1990, output = 'wide')
Getting data from the 1990 decennial Census
head(m90)

medinc <- get_acs(geography = "state", year = 2016,
                     variables = c(medincome = "B19013_001"), 
                     output = 'wide')
Getting data from the 2012-2016 5-year ACS
head(medinc)
NA

Above we can see that the variables = c(rent="H043A001") assigned the values for question “B19013_001” to a column called ‘medincome’. There are two variables that get returned “E” designates this is the income “estimate” and “M” is the “margin of error”. So we want to use “medincomeE” for our work. We are also set output = 'wide', just do this from now on the default output is harder to use.

Now let’s visualize the data. Here notice that the data hasn’t been sorted.

medinc %>%
  ggplot(aes(x = medincomeE, y =  NAME )) + 
  geom_point()

Let’s look at the same data, but we will sort it decending by rent. We can also clean up the labels for x and y. For more help with plotting in ggplot go here.

medinc %>%
  ggplot(aes(x = medincomeE, y =  reorder(NAME, medincomeE) )) + 
  geom_point() +xlab('Median Income 2016')+ylab('State')

Finding the right question

Before we start you need to find the census code for the question you want. Tidycensus makes this fairly simple. We just need to download all questions and codes for either the ACS or Decennial census. For most interesting questions we will want to look at the ACS five year estimates, we are going to use the function load_variables to download all the questions.

acs <- load_variables(year = 2016,
                      dataset = "acs5", 
                      cache = TRUE)     #stores it so you don't have to download it again later
head(acs)

So the name column holds the question codes, label gives the description, and concept gives the question category. To acs double click it in the upper right hand corner of RStudio, notice you can filter questions.

Mapping Census Data

Now let’s download the data and geometry from the API. We do this by rerunning get_acs but with geometry = True. Notice below the column called geometry. This holds the lat and lon of each point making up the state polygons.

medinc <- get_acs(geography = "state", year = 2016,
                     variables = c(medincome = "B19013_001"), 
                     output = 'wide',
                     geometry = TRUE, # this downloads the lat and lon data
                     shift_geo = TRUE # this move AK and PR closer to the US so we can map them easily
                      )
Getting data from the 2012-2016 5-year ACS
Using feature geometry obtained from the albersusa package
Please note: Alaska and Hawaii are being shifted and are not to scale.
head(medinc)
Simple feature collection with 6 features and 4 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -2031905 ymin: -1470717 xmax: 2295505 ymax: 67481.2
epsg (SRID):    NA
proj4string:    +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
  GEOID                 NAME medincomeE medincomeM                       geometry
1    04              Arizona      51340        231 MULTIPOLYGON (((-1111066 -8...
2    05             Arkansas      42336        234 MULTIPOLYGON (((557903.1 -1...
3    06           California      63783        188 MULTIPOLYGON (((-1853480 -9...
4    08             Colorado      62520        287 MULTIPOLYGON (((-613452.9 -...
5    09          Connecticut      71755        473 MULTIPOLYGON (((2226838 519...
6    11 District of Columbia      72935       1164 MULTIPOLYGON (((1960720 -41...

Now let’s plot it using the powers of ggplot and sf.


ggplot()+geom_sf(data=medinc, aes(fill = medincomeE))

Let’s learn quickly how to reproject data as well. Here we are going to use what is called a “proj4string” which holds all the information needed, for instance the location of the prime meridian, linear units etc. (Site to help identify appropriate projection)[https://projectionwizard.org/#]. (Site to help identify proj4strings)[https://epsg.io/].

Here we will project the data to web mercator and plot the result.

medinc_bg_albers = st_transform(medinc, 
                         crs = '+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext  +no_defs')  


ggplot()+geom_sf(data=medinc_bg_albers,               # this tells ggplot where the data is stored
                 aes(fill = medincomeE)) +   # we are saying to use the values of medincomeE as the fill color
                 ggtitle('Median Income')

Write Out Shapefiles

Writing out data to shapefiles is also very easy.

# write a geojson
st_write(medinc, 
         dsn = "path_to_a_folder/acs_2016_medincome.geojson",  
         driver = "GeoJSON")

# write a shapefile 
st_write(medinc_bg, 
         dsn =  "path_to_a_folder/acs_2016_medincome.shp",  
         driver = "ESRI Shapefile")
---
title: "Handling Census Data in R"
output: html_notebook
---

# Install
Before you begin this tutorial you must download R  [R Markdown](https://mirrors.nics.utk.edu/cran/). Then install [R Studio’s](https://rstudio.com/products/rstudio/download/) IDE (stands for integrated development environment), a powerful user interface for R. Get the Open Source Edition of RStudio Desktop. R Studio . Then open R Studio. 

If you have a pre-existing installation of R and/or RStudio, I highly recommend that you reinstall both and get as current as possible. It can be considerably harder to run old software than new.

# Libraries
R is an extensible system and many people share useful code they have developed as a package via CRAN and GitHub. This provides most of the functionality we are looking to use, for instance plotting, GIS operations, etc.  To install a package from CRAN, for example the dplyr package for data manipulation, here is one way to do it in the R console (there are others).

*Keep in mind you only need to install new packages (aka libraries) the first time you run R.*
 
```{r eval=FALSE,  include=T}
install.packages('dplyr') # data manipulation and cleaning
```

Now let's install the libraries (additional functions) we need to use for this tutorial. 

```{r eval=FALSE,  include=T}
install.packages('sf')         # spatial data handling
install.packages('tidycensus') # census data download see https://walkerke.github.io/tidycensus/articles/basic-usage.html
install.packages("cli")        # needed for devtools install
install.packages('devtools')   # allows installation of libraries from github
devtools::install_github("tidyverse/ggplot2")    # plotting installing latest version from github so that it works with sf
          # if prompted please select "1" to install all additional packages
install.packages('ggthemes')   # plotting themes
```


# Getting Access to Census Data

Now that we have installed all the needed libraries (packages), we need to load them into R. *You will need to run this bit of code every time you open R* to get this to work. 

```{r include=T}
library(tidycensus)
library(sf)
library(ggplot2)
library(ggthemes)
library(dplyr)
```


To get started working with tidycensus, users should load the package along with the tidyverse package, and set their Census API key. A key can be obtained from [Census API Key](http://api.census.gov/data/key_signup.html).  **It will provide you with a 40 digit text string. Please keep track of this number. Store it in a safe place.**
 
```{r include=T, eval=FALSE}
API_Key = '983980b9fc504149e82117csdfwefwe23dsdc507'  # non working example - please paste your own in
census_api_key(API_Key, install = TRUE, overwrite=TRUE)  
```

 


# Basic Census Data Analysis

Now that we have access to the census API we can start accessing the data. We have to make a few choices, what census "decennial" vs "acs" (American Community Survey),the geography level e.g. "state","county","tract", or "block group", the question identifier (*not all questions are available at all geography levels*), and the year.

There are two major functions implemented in tidycensus: get_decennial, which grants access to the 1990, 2000, and 2010 decennial US Census APIs, and get_acs, which grants access to the 5-year American Community Survey APIs. 

In this basic example, let’s look at Median household income in the past 12 months of 2016. 

```{r}
medinc <- get_acs(geography = "state", year = 2016,
                     variables = c(medincome = "B19013_001"), 
                     output = 'wide')
head(medinc)
```

Above we can see that the `variables = c(rent="H043A001")` assigned the values for question "B19013_001" to a column called 'medincome'. There are two variables that get returned "E" designates this is the income "estimate" and "M" is the "margin of error". So we want to use "medincomeE" for our work. We are also set ` output = 'wide'`, *just do this from now on the default output is harder to use*. 

Now let's visualize the data. Here notice that the data hasn't been sorted. 

```{r fig.height=7}
medinc %>%
  ggplot(aes(x = medincomeE, y =  NAME )) + 
  geom_point()
```

Let's look at the same data, but we will sort it decending by rent. We can also clean up the labels for x and y. For more help with plotting in ggplot go [here](https://ggplot2.tidyverse.org/reference/).

```{r fig.height=7}
medinc %>%
  ggplot(aes(x = medincomeE, y =  reorder(NAME, medincomeE) )) + 
  geom_point() +xlab('Median Income 2016')+ylab('State')
```



# Finding the right question

Before we start you need to find the census code for the question you want. Tidycensus makes this fairly simple. We just need to download all questions and codes for either the ACS or Decennial census. For most interesting questions we will want to look at the ACS five year estimates, we are going to use the function [load_variables](https://walkerke.github.io/tidycensus/reference/load_variables.html) to download all the questions. 


```{r}
acs <- load_variables(year = 2016,
                      dataset = "acs5", 
                      cache = TRUE)     #stores it so you don't have to download it again later
head(acs)
```

So the `name` column holds the question codes, `label` gives the description, and `concept` gives the question category. To `acs` double click it in the upper right hand corner of RStudio, notice you can filter questions. 

# Mapping Census Data

Now let's download the data and geometry from the API. We do this by rerunning `get_acs`  but with `geometry = True`. Notice below the column called geometry. This holds the lat and lon of each point making up the state polygons. 

```{r}
medinc <- get_acs(geography = "state", year = 2016,
                     variables = c(medincome = "B19013_001"), 
                     output = 'wide',
                     geometry = TRUE, # this downloads the lat and lon data
                     shift_geo = TRUE # this move AK and PR closer to the US so we can map them easily
                      )
head(medinc)
```

Now let's plot it using the powers of ggplot and sf. 

```{r}

ggplot()+geom_sf(data=medinc,               # this tells ggplot where the data is stored
                 aes(fill = medincomeE))    # we are saying to use the values of medincomeE as the fill color

```


Let's learn quickly how to reproject data as well. Here we are going to use what is called a "proj4string" which holds all the information needed, for instance the location of the prime meridian, linear units etc. (Site to help identify appropriate projection)[https://projectionwizard.org/#]. (Site to help identify proj4strings)[https://epsg.io/].

Here we will project the data to web mercator and plot the result.

```{r}
medinc_bg_albers = st_transform(medinc, 
                         crs = '+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext  +no_defs')  


ggplot()+geom_sf(data=medinc_bg_albers,               # this tells ggplot where the data is stored
                 aes(fill = medincomeE)) +   # we are saying to use the values of medincomeE as the fill color
                 ggtitle('Median Income')
```

# Write Out Shapefiles

Writing out data to shapefiles is also very easy. 

```{r}
# write a geojson
st_write(medinc, 
         dsn = "path_to_a_folder/acs_2016_medincome.geojson",  
         driver = "GeoJSON")

# write a shapefile 
st_write(medinc_bg, 
         dsn =  "path_to_a_folder/acs_2016_medincome.shp",  
         driver = "ESRI Shapefile")

```


