As was mentioned earlier, Socrata is a platform that is used by many local governments to publish open data. The Socrata Open Data API (SODA API) allows us to access these data resources from an R script or notebook in a standardized way. In the video you watched, the API was likened to a waiter or liaison who carries requests and responses back and forth between a client computer (i.e., you) and server computer (i.e., the open data portal). The documentation for the SODA API can be found here and begins with API endpoints, which are essentially Uniform Resource Locators (URLs) that provide access to data. Without getting into too much detail at the moment, we use the Hypertext Transfer Protocol (HTTP) to send requests and receive responses. Most often, we are using the GET method to send queries and retrieve data.
In some cases, an R package exists that allows us to interact with an API without having to construct queries that conform to the requirements of the API and pass them using a package like httr (this is what the ungraded exercise that we started in class last time asked you to do from your web browser). The RSocrata package that we installed and loaded above does this “dirty work” of formatting the underlying HTTP queries that allow us to interact with open data portals that use the Socrata platform.
Let’s take a closer look…
help(package="RSocrata")
The “Inspect RSocrata Documentation” code chunk skimply opens a page with seven functions that are included in this package. Take a look at this page (i.e., in the Help tab in the lower-right corner), then proceed to Exercise 1 below.
Review the documentation page that appears after running the code chunk above.
sf_data <- ls.socrata("https://data.sfgov.org/browse")
head(sf_data)
<-typeof(sf_data)
## [1] "list"
dim(sf_data)
## [1] 650 14
There are 650 rows of 14 columns.
colnames(sf_data)
## [1] "accessLevel" "landingPage" "issued" "@type" "modified"
## [6] "keyword" "contactPoint" "publisher" "identifier" "description"
## [11] "title" "distribution" "license" "theme"
There are more detailed instructions on what to submit and how at the end of this notebook, but you essentially you should add code chunks and text chunks (i.e., Markdown sections) to this R Notebook that you have saved locally that perform the tasks and that answer the questions posed above.
Next, return to the documentation page for the RSocrata package. The read.socrata function can be used to retrieve a dataset from the San Francisco open data portal. In addition to indexing objects like lists or data frames by their positions using bracket notation, we can also refer to variable names, if they exist using the $ operator. Let’s sort the inventory and peruse the types of information available. The table function provides a count of instances within each category.
sf_contents <- ls.socrata("https://data.sfgov.org/limitTo=datasets")
names(sf_contents)
head(sf_contents)
sort(sf_contents$title)
table(sf_contents$theme)
650
Execute the code chunk above and make sure you understand the code, then proceed with the tasks below:
police_calls_2 <- sf_contents[sf_contents$title == "Police Department Incident Reports: 2018 to Present", ]
police_calls_2 <- fromJSON("https://data.sfgov.org/resource/wg3w-h783.json")
police_calls_TEND <- filter(police_calls_2, analysis_neighborhood == "Tenderloin")
ggplot(data = police_calls_TEND) +
geom_bar(mapping = aes(x = incident_year, fill = incident_day_of_week), position = "dodge")
ggsave("Tenderloin-Police-Calls-byYear-byDay.png", units = "in", width = 16, height = 8)
table(police_calls_tb$analysis_neighborhood) and compare it with the Tenderloin in terms of police calls.police_calls_SOMA <- filter(police_calls_2, analysis_neighborhood == "South of Market")
ggplot(data = police_calls_TEND) +
geom_bar(mapping = aes(x = incident_year), position = "dodge") +
ggtitle("Police Calls to the Tenderloin")
ggsave("TEND-Police-Calls-byYear.png", units = "in", width = 16, height = 8)
police_calls_SOMA <- filter(police_calls_2, analysis_neighborhood == "South of Market")
ggplot(data = police_calls_SOMA) +
geom_bar(mapping = aes(x = incident_year), position = "dodge") +
ggtitle("Police Calls to South of Market")
ggsave("SOMA-Police-Calls-byYear.png", units = "in", width = 16, height = 8)
The number of police calls to the Tenderloin over these three years has been much more volatile than the number of the calls to the South of Market (SoMA) neighborhood, which experienced the same total number of calls each of the three years captured. The Tenderloin had the highest number of calls (30) in 2018, dipped to less than 15 in 2019, and then two years later, the number of calls (27) nearly matched the 2018 numbers.
Take some time to reflect on this portion of the exercise, then proceed.
This portion of the exercise was relatively straightforward, but it made me realize that I have a lot to learn to use ggplot more effectively. I struggled to, and ultimately failed, to figure out how to sort the data for days of week in the correct order–I hope we will learn how to do so when working with dates. Until then, my bar chart has a random display of the days of the week beginning with Friday! As well, spending more time to learn how to add value labels, chart labels, change values, trend lines, etc. will be very valuable.
Rather than downloading data from the official U.S. Census Bureau website, we can also access data programmatically from R. The first step is to take a quick look at the API documentation, then request an API key by visiting this site. You should receive a response in a few minutes, but be sure to safeguard your API key because it it tied to each individual user.
In order to get a sense for how API queries work, open a web browser and type the following into the search bar, inserting your own personal Census API key where indicated:
https://api.census.gov/data/2017/acs/acs5?key=7f5b3f883e24d9d3a7f207f76bfb731d19124299&get=B01003_001E&for=zip%20code%20tabulation%20area:94114,94110
The resulting browser window should contain the results of the query above in JavaScript Object Notation (JSON) format. This is a compact way to store and transmit information over the internet that we will return to later in the semester. For now, we just need to know that the response there tells us:
So, we can send queries to the Census API for the 2013-2017 ACS 5-Year Estimates directly from a web browser, but it is more efficient to do this from R. But first of all, what are the different components of the query we just executed?
https://api.census.gov/data/2017/acs/acs5? component is the base url. It tells the Census servers that it we are interested in 5-Year ACS data that ends in 2017 (i.e., ACS 2013-2017)key=[YOUR_API_KEY] component is the unique identifier for the client (i.e., your browser/script) that is making the requestget=B01003_001E is the variable name (Total population). For a full list, see here.for=zip%20code%20tabulation%20area:94114,94110 for the census geography ZCTA for the Castro/Noe Valley (94114) and Inner Mission/Bernal Heights (94110) areas of San Francisco. Note that these areas are generally southwest of the Tenderloin. The %20 is a hexademical representation of the space character. There are quite a few special characters that need to be properly encoded, if they are being passed to the server.&.This structure is unique to Census APIs and endpoints. If you want to interact with other APIs, you will first have to refer to their documentation and understand how to properly format a query. It is also possible that APIs change over time causing your code to stop working. Luckily, it is usually not onerous to make the necessary tweaks, once you understand how the API has changed.
Still using a web browser, practice what we have learned here by constructing the proper API query in the search bar to:
RESPONSE: [[“B01003_001E”,“state”,“zip code tabulation area”], [“73737”,“06”,“94110”], [“34561”,“06”,“94114”]]’
2017 Population Estimates
94114 - 34,561 ; 94110 - 73,737
RESPONSE: ‘[[“B01003_001E”,“state”,“zip code tabulation area”], [“34754”,“06”,“94114”], [“74161”,“06”,“94110”]]’
2018 Populaion Estimates 94114 - 34,561 ; 94110 - 73,737
"DP03_0005PE" which is located in a different API with a slightly different base URL hereRESPONSE: [[“DP03_0005PE”,“state”,“zip code tabulation area”], [“3.2”,“06”,“94114”], [“3.1”,“06”,“94110”]]
2018 Percent Unemployed 94114 - 3.2% ; 94110 - 3.1%
&for=place:67000&in=state:06RESPONSE: ‘[[“DP03_0005PE”,“state”,“place”], [“3.3”,“06”,“67000”]]’
2018 Percent Unemployed San Francisco - 3.3%
Take some time to experiment with one or more of the other variables, then proceed with the exercise.
Using the examples provided by the Census, we can create a simple loop to retrieve a variety of variables for the City of San Francisco.
install.packages("httr")
## Installing package into 'C:/Users/Student/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'httr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Student\AppData\Local\Temp\RtmpA5vUGU\downloaded_packages
library(httr)
## Warning: package 'httr' was built under R version 4.0.5
library(jsonlite)
baseurl <- "https://api.census.gov/data/2018/acs/acs5?"
param1 <- "get="
param2 <- "&for=place:"
param3 <- "&in=state:06"
key <- "&key=f9f31e09fed0f44d23a8a354469e461df98f34cb"
# These are the place FIPS codes for San Francisco, Oakland, and Berkeley
places.list <- paste("67000", "53000", "06000", sep = ",")
vars.list <- c("B00001_001E", "B19013_001E")
vars.names <- c("Total population", "Median household income")
req <- httr::GET(paste0(baseurl, param1, vars.list[1],
param2, places.list, param3, key))
req.json <- fromJSON(content(req, "text"), flatten=TRUE)
req.df <- as.data.frame(req.json)
col.names <- req.json[1,]
df <- req.df[2:4, ]
colnames(df) <- col.names
df
req.2 <- httr::GET(paste0(baseurl, param1, vars.list[2],
param2, places.list, param3, key))
req.json.2 <- fromJSON(content(req.2, "text"), flatten=TRUE)
req.df.2 <- as.data.frame(req.json.2)
col.names.2 <- req.json.2[1,]
df.2 <- req.df.2[2:4, ]
colnames(df.2) <- col.names.2
df.2
df.2 <- subset(df.2, select = -state)
merged.df <- merge(df, df.2, by = "place")
merged.df
This lab was a useful introduction to navigating the Census API and working in the tidyverse, as well as structuring a digestible markdown document.
Navigating the Census documentation was rather confusing, given the repetitive structure of their site and the many, many different datasets and geographic levels that one can query. I found it very confusing that I could not find anywhere on the site a direct reference to querying by ZCTA–their 53 examples of geographic calls excluded this particular level. And why does a call for ZCTA require a state code when a zip code should be a unique identifier nationally?
I’ve certainly learned I have a long way to go with ggplot, as I noted in an earlier reflection. I look forward to developing a much greater command of this tool.
I believe I’m getting the hang of the basics of building a digestible markdown document, and I look forward to learning more. I’d also like to develop a better command of navigating a longer markdown notebook within RStudio to be able to quickly make adjustments in specific areas of the notebook.