Using the BEACON CSV API

This vignette shows how to use the new CSV API to the BEACON database. It's a really quick example using R. If you want to add example code for Matlab, we can probably do that! Email me at david.holstius@berkeley.edu.

Why you might want to use the API:

Daily CO2 for October 2012

First we make a request to the server. It might take a few seconds to respond if you're requesting a new URL. The httr R package is simple and easy but not fast (this is a demo!). Responses are cached on the server, so second and third requests should be faster.

You can reuse the get_month() and get_day functions without needing to understand the guts. I've wrapped them into a beacon R package hoping that people find this dataflow useful.

Both get_month(...) and get_day(...) fetch data from the API, load it into a data.frame (R's native format), and parse the timestamps, which by default are stored in UTC. They'll show up in the timezone you provide, which by default is “America/Los_Angeles”. Let's peek at the data:

suppressPackageStartupMessages(library(beacon))
daily_means <- get_month("Vaisala_CO2_ppm", 2012, 10)
## Downloaded 185 datapoints from /avg/Vaisala_CO2_ppm/2012/10/1d/stacked.csv
head(daily_means)
##            metric           timestamp value            site
## 1 Vaisala_CO2_ppm 2012-09-30 10:00:26 454.0         BurckES
## 2 Vaisala_CO2_ppm 2012-09-30 10:00:27 408.4          WestEd
## 3 Vaisala_CO2_ppm 2012-09-30 10:00:28 437.1 InternationalES
## 4 Vaisala_CO2_ppm 2012-09-30 10:00:31 455.5     KorematsuES
## 5 Vaisala_CO2_ppm 2012-09-30 10:00:32 367.7       SkylineHS
## 6 Vaisala_CO2_ppm 2012-09-30 10:00:33 429.3   StElizabethHS

Unfortunately there are still 0s and -999s in the database, which means the averaged values will be artificially low. We need to clean those out. But here's a quick plot with those artifacts evident:

suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(scales))
fig <- qplot(timestamp, value, color = site, geom = "line", data = daily_means)
fig <- fig + scale_y_continuous("Vaisala_CO2_ppm", limits = c(0, 600))
fig <- fig + scale_x_datetime("America/Los_Angeles", breaks = date_breaks("1 week"), 
    minor_breaks = date_breaks("1 day"), labels = date_format("%a %m/%d"))
show(fig + ggtitle("Daily Means"))
## Warning: Removed 14 rows containing missing values (geom_path).

plot of chunk qpdatalot

What Happened

OK, let's break it down. That URL had three important parts:

The metric name is Vaisala_CO2_pmm. This is the same name that the metric has in our time-series database, OpenTSDB. We need to load in data for more metrics (like board_V).

The date range is the month of October 2012 (“2012/10”) and the averaging interval is daily (“1d”). You can also request hourly averages using get_month(metric, year, month, "1h"), and 1-hour or 5-minute (“5m”) averages using get_day(metric, year, month, day, "5m"). For example:

hourly_means <- get_day("Vaisala_CO2_ppm", 2012, 10, 1)
## Downloaded 174 datapoints from
## /avg/Vaisala_CO2_ppm/2012/10/01/1h/stacked.csv
fig <- qplot(timestamp, value, color = site, geom = "line", data = hourly_means)
fig <- fig + scale_y_continuous("Vaisala_CO2_ppm")
fig <- fig + scale_x_datetime("America/Los_Angeles")
show(fig + ggtitle("Hourly Means"))

plot of chunk daily

They will actually fetch a little more data than is needed, so if you need exact date ranges, use subset(...) to trim the data based on the timestamps.

Since these are public-facing URLs, they are limited in the amount of data they return, so (a) the server doesn't freak out and (b) the caching performs well. Let me know if you want to be able to request larger ranges at higher resolution.

INSTALLING

You can install the API bindings for R using devtools. If you don't have devtools, this will install it for you:

install.packages("devtools")

Once you have that, you can install the beacon package from my GitHub repository:

library(devtools)
install_github("beacon", "holstius")

Then you can load it at any time using:

library(beacon)

NOTES

The format is “stacked” or “long”. Talk to me if you need support for “unstacked” or “wide”. The PANDAS project has awesome support for stacking/unstacking data in Python and an excellent description of the difference.

http://pandas.pydata.org/pandas-docs/dev/reshaping.html#reshaping-by-stacking-and-unstacking