As was mentioned earlier, Socrata is a platform that is used by many local governments to publish open data. The Socrata Open Data API (SODA API) allows us to access these data resources from an R script or notebook in a standardized way. In the video you watched, the API was likened to a waiter or liaison who carries requests and responses back and forth between a client computer (i.e., you) and server computer (i.e., the open data portal). The documentation for the SODA API can be found here and begins with API endpoints, which are essentially Uniform Resource Locators (URLs) that provide access to data. Without getting into too much detail at the moment, we use the Hypertext Transfer Protocol (HTTP) to send requests and receive responses. Most often, we are using the GET method to send queries and retrieve data.

In some cases, an R package exists that allows us to interact with an API without having to construct queries that conform to the requirements of the API and pass them using a package like httr (this is what the ungraded exercise that we started in class last time asked you to do from your web browser). The RSocrata package that we installed and loaded above does this “dirty work” of formatting the underlying HTTP queries that allow us to interact with open data portals that use the Socrata platform.

Let’s take a closer look…

help(package="RSocrata")


The “Inspect RSocrata Documentation” code chunk skimply opens a page with seven functions that are included in this package. Take a look at this page (i.e., in the Help tab in the lower-right corner), then proceed to Exercise 1 below.



Exercise 1


Review the documentation page that appears after running the code chunk above.

1. Identify the function you would use to list datasets that exist on the City of San Francisco’s open data portal

  • Review the usage, arguments, and examples for that function

2. Insert a new code chunk below this text

  • Write and execute a line of code that retrieves a list of all datasets on the portal
  • Note: if you decide to work in a new R script, you will need to install and load the RSocrata package there
sf_data <-  ls.socrata("https://data.sfgov.org/browse")

head(sf_data)


3. Save the information returned by the function in 1. to an object (e.g., sf_data) using the assignment operator <-

  • What kind of object is this?


typeof(sf_data)
## [1] "list"


  • How many rows and columns are there?


dim(sf_data)
## [1] 650  14


There are 650 rows of 14 columns.


  • What are the names of those columns?
colnames(sf_data)
##  [1] "accessLevel"  "landingPage"  "issued"       "@type"        "modified"    
##  [6] "keyword"      "contactPoint" "publisher"    "identifier"   "description" 
## [11] "title"        "distribution" "license"      "theme"


There are more detailed instructions on what to submit and how at the end of this notebook, but you essentially you should add code chunks and text chunks (i.e., Markdown sections) to this R Notebook that you have saved locally that perform the tasks and that answer the questions posed above.

Exploring the Inventory


Next, return to the documentation page for the RSocrata package. The read.socrata function can be used to retrieve a dataset from the San Francisco open data portal. In addition to indexing objects like lists or data frames by their positions using bracket notation, we can also refer to variable names, if they exist using the $ operator. Let’s sort the inventory and peruse the types of information available. The table function provides a count of instances within each category.


sf_contents <- ls.socrata("https://data.sfgov.org/limitTo=datasets")
names(sf_contents)
head(sf_contents)

sort(sf_contents$title)
table(sf_contents$theme)


How many datasets are there on the San Francisco portal right now?

650


Exercise 2


Execute the code chunk above and make sure you understand the code, then proceed with the tasks below:

1. Insert a new code chunk and modify the code in “Get Info for Incident Reports Dataset” to create a new object that only contains police calls to the Tenderloin:

  • You will want to use the analysis_neighborhood attribute
police_calls_2 <- sf_contents[sf_contents$title == "Police Department Incident Reports: 2018 to Present", ]

police_calls_2 <- fromJSON("https://data.sfgov.org/resource/wg3w-h783.json")

police_calls_TEND <- filter(police_calls_2, analysis_neighborhood == "Tenderloin")


2. Write and execute code to create and export to .png a barchart that shows at least one of the following:

  • Year
  • Day of the week


ggplot(data = police_calls_TEND) + 
  geom_bar(mapping = aes(x = incident_year, fill = incident_day_of_week), position = "dodge")

ggsave("Tenderloin-Police-Calls-byYear-byDay.png", units = "in", width = 16, height = 8)


3. Choose another neighborhood table(police_calls_tb$analysis_neighborhood) and compare it with the Tenderloin in terms of police calls.

  • Interpret your findings in a Markdown section


police_calls_SOMA <- filter(police_calls_2, analysis_neighborhood == "South of Market")

ggplot(data = police_calls_TEND) + 
  geom_bar(mapping = aes(x = incident_year), position = "dodge") + 
  ggtitle("Police Calls to the Tenderloin")

ggsave("TEND-Police-Calls-byYear.png", units = "in", width = 16, height = 8)

police_calls_SOMA <- filter(police_calls_2, analysis_neighborhood == "South of Market")

ggplot(data = police_calls_SOMA) + 
  geom_bar(mapping = aes(x = incident_year), position = "dodge") + 
  ggtitle("Police Calls to South of Market")

ggsave("SOMA-Police-Calls-byYear.png", units = "in", width = 16, height = 8)


Findings

The number of police calls to the Tenderloin over these three years has been much more volatile than the number of the calls to the South of Market (SoMA) neighborhood, which experienced the same total number of calls each of the three years captured. The Tenderloin had the highest number of calls (30) in 2018, dipped to less than 15 in 2019, and then two years later, the number of calls (27) nearly matched the 2018 numbers.


Take some time to reflect on this portion of the exercise, then proceed.

Exercise 2 Reflection

This portion of the exercise was relatively straightforward, but it made me realize that I have a lot to learn to use ggplot more effectively. I struggled to, and ultimately failed, to figure out how to sort the data for days of week in the correct order–I hope we will learn how to do so when working with dates. Until then, my bar chart has a random display of the days of the week beginning with Friday! As well, spending more time to learn how to add value labels, chart labels, change values, trend lines, etc. will be very valuable.



The Census API

Rather than downloading data from the official U.S. Census Bureau website, we can also access data programmatically from R. The first step is to take a quick look at the API documentation, then request an API key by visiting this site. You should receive a response in a few minutes, but be sure to safeguard your API key because it it tied to each individual user.

In order to get a sense for how API queries work, open a web browser and type the following into the search bar, inserting your own personal Census API key where indicated:

https://api.census.gov/data/2017/acs/acs5?key=7f5b3f883e24d9d3a7f207f76bfb731d19124299&get=B01003_001E&for=zip%20code%20tabulation%20area:94114,94110

The resulting browser window should contain the results of the query above in JavaScript Object Notation (JSON) format. This is a compact way to store and transmit information over the internet that we will return to later in the semester. For now, we just need to know that the response there tells us:

  • That there 73,737 people living in zip code tabulation area 94110 and 34,561 people in San Francisco’s 94114 zip code tabulation area

So, we can send queries to the Census API for the 2013-2017 ACS 5-Year Estimates directly from a web browser, but it is more efficient to do this from R. But first of all, what are the different components of the query we just executed?

  • The https://api.census.gov/data/2017/acs/acs5? component is the base url. It tells the Census servers that it we are interested in 5-Year ACS data that ends in 2017 (i.e., ACS 2013-2017)
  • The key=[YOUR_API_KEY] component is the unique identifier for the client (i.e., your browser/script) that is making the request
  • The get=B01003_001E is the variable name (Total population). For a full list, see here.
  • The for=zip%20code%20tabulation%20area:94114,94110 for the census geography ZCTA for the Castro/Noe Valley (94114) and Inner Mission/Bernal Heights (94110) areas of San Francisco. Note that these areas are generally southwest of the Tenderloin. The %20 is a hexademical representation of the space character. There are quite a few special characters that need to be properly encoded, if they are being passed to the server.
  • Finally, note that the all the arguments are concatenated using &.

This structure is unique to Census APIs and endpoints. If you want to interact with other APIs, you will first have to refer to their documentation and understand how to properly format a query. It is also possible that APIs change over time causing your code to stop working. Luckily, it is usually not onerous to make the necessary tweaks, once you understand how the API has changed.

Exercise 3


Still using a web browser, practice what we have learned here by constructing the proper API query in the search bar to:

1. Retrieve the total population of the same two San Francisco ZCTAs based on the 2014-2018 ACS 5-Year Estimates


2017


CALL: ‘https://api.census.gov/data/2017/acs/acs5?get=B01003_001E&for=zip%20code%20tabulation%20area:94114,94110&in=state:06&key=7f5b3f883e24d9d3a7f207f76bfb731d19124299


RESPONSE: [[“B01003_001E”,“state”,“zip code tabulation area”], [“73737”,“06”,“94110”], [“34561”,“06”,“94114”]]’


2017 Population Estimates

94114 - 34,561 ; 94110 - 73,737


2018


CALL:https://api.census.gov/data/2018/acs/acs5?get=B01003_001E&for=zip%20code%20tabulation%20area:94114,94110&in=state:06&key=7f5b3f883e24d9d3a7f207f76bfb731d19124299


RESPONSE: ‘[[“B01003_001E”,“state”,“zip code tabulation area”], [“34754”,“06”,“94114”], [“74161”,“06”,“94110”]]’


2018 Populaion Estimates 94114 - 34,561 ; 94110 - 73,737


2. Retrieve the unemployment rate for residents of the same two San Francisco ZCTAs based on the 2014-2018 ACS 5-Year Estimates

  • Hint: Take a look at variable "DP03_0005PE" which is located in a different API with a slightly different base URL here

CALL:https://api.census.gov/data/2018/acs/acs5/profile?get=DP03_0005PE&for=zip%20code%20tabulation%20area:94114,94110&in=state:06&key=7f5b3f883e24d9d3a7f207f76bfb731d19124299

RESPONSE: [[“DP03_0005PE”,“state”,“zip code tabulation area”], [“3.2”,“06”,“94114”], [“3.1”,“06”,“94110”]]

2018 Percent Unemployed 94114 - 3.2% ; 94110 - 3.1%


3. Now change geographies and retrieve the unemployment rate for the city as a whole

  • Hint: Try &for=place:67000&in=state:06

CALL:https://api.census.gov/data/2018/acs/acs5/profile?get=DP03_0005PE&for=place:67000&in=state:06&key=7f5b3f883e24d9d3a7f207f76bfb731d19124299


RESPONSE: ‘[[“DP03_0005PE”,“state”,“place”], [“3.3”,“06”,“67000”]]’


2018 Percent Unemployed San Francisco - 3.3%


Take some time to experiment with one or more of the other variables, then proceed with the exercise.


Using the examples provided by the Census, we can create a simple loop to retrieve a variety of variables for the City of San Francisco.

install.packages("httr")
## Installing package into 'C:/Users/Student/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'httr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Student\AppData\Local\Temp\RtmpA5vUGU\downloaded_packages
library(httr)
## Warning: package 'httr' was built under R version 4.0.5
library(jsonlite)

baseurl <- "https://api.census.gov/data/2018/acs/acs5?"
param1 <- "get="
param2 <- "&for=place:"
param3 <- "&in=state:06"
key <- "&key=f9f31e09fed0f44d23a8a354469e461df98f34cb"

# These are the place FIPS codes for San Francisco, Oakland, and Berkeley
places.list <- paste("67000", "53000", "06000", sep = ",")
vars.list <- c("B00001_001E", "B19013_001E")
vars.names <- c("Total population", "Median household income")

req <- httr::GET(paste0(baseurl, param1, vars.list[1],
                        param2, places.list, param3, key))
req.json <- fromJSON(content(req, "text"), flatten=TRUE)
req.df <- as.data.frame(req.json)
col.names <- req.json[1,]
df <- req.df[2:4, ]
colnames(df) <- col.names
df
req.2 <- httr::GET(paste0(baseurl, param1, vars.list[2],
                        param2, places.list, param3, key))
req.json.2 <- fromJSON(content(req.2, "text"), flatten=TRUE)
req.df.2 <- as.data.frame(req.json.2)
col.names.2 <- req.json.2[1,]
df.2 <- req.df.2[2:4, ]
colnames(df.2) <- col.names.2
df.2
df.2 <- subset(df.2, select = -state)

merged.df <- merge(df, df.2, by = "place")   
merged.df



Lab Reflection

This lab was a useful introduction to navigating the Census API and working in the tidyverse, as well as structuring a digestible markdown document.

Working with Census

Navigating the Census documentation was rather confusing, given the repetitive structure of their site and the many, many different datasets and geographic levels that one can query. I found it very confusing that I could not find anywhere on the site a direct reference to querying by ZCTA–their 53 examples of geographic calls excluded this particular level. And why does a call for ZCTA require a state code when a zip code should be a unique identifier nationally?

The Tidyverse

I’ve certainly learned I have a long way to go with ggplot, as I noted in an earlier reflection. I look forward to developing a much greater command of this tool.

Structuring Markdowns

I believe I’m getting the hang of the basics of building a digestible markdown document, and I look forward to learning more. I’d also like to develop a better command of navigating a longer markdown notebook within RStudio to be able to quickly make adjustments in specific areas of the notebook.



Lab Instructions © Bev Wilson 2022 | Department of Urban + Environmental Planning | University of Virginia