This is a little demonstration of how to quickly load, reshape, and aggregate BEACON data from CSV files using a couple of handy R packages.
Questions or comments? Send email to david.holstius@berkeley.edu.
You'll need the latest data.table package, available from R-Forge. If you don't have it, install it like this:
install.packages("data.table", repos = "http://r-forge.r-project.org")
Here are the rest of the packages you'll need. You can get these from the usual CRAN repositories (type help("install.packages") for more).
suppressPackageStartupMessages({
library(data.table)
library(lubridate)
library(ggplot2)
library(scales)
})
## Warning: package 'data.table' was built under R version 2.15.3
Pick the file you want to load. It cannot be gzipped. The file won't have a header, so we explicitly set the column names.
if (interactive()) {
nodefile_path <- file.choose()
} else {
nodefile_path <- file.path("~", "Dropbox", "BEACON", "using-R", "ginko.csv")
}
nodefile_data <- fread(nodefile_path, header = FALSE)
setnames(nodefile_data, c("O3_V", "CO_V", "NO2_V", "press_V", "power_V", "temp_V",
"humid_V", "CO2_ppm", "temp_C", "timestamp"))
Here are the first and last few rows of the data. Notice the differences.
options(width = 120, digits = 3)
print(nodefile_data)
## O3_V CO_V NO2_V press_V power_V temp_V humid_V CO2_ppm temp_C timestamp
## 1: 0.117 -3.63 4.60 0.253 4.94 5.04 0.453 440 13.4 05-05-2012 00:06:44 -0800
## 2: 0.117 -3.63 4.60 0.250 4.92 5.03 0.453 445 13.4 05-05-2012 00:06:54 -0800
## 3: 0.117 -3.63 4.60 0.253 4.95 5.04 0.453 440 13.4 05-05-2012 00:07:04 -0800
## 4: 0.115 -3.63 4.60 0.250 4.94 5.04 0.453 445 13.4 05-05-2012 00:07:14 -0800
## 5: 0.117 -3.63 4.60 0.250 4.94 5.04 0.453 441 13.5 05-05-2012 00:07:24 -0800
## ---
## 2668156: 0.159 1.54 2.51 0.253 5.07 0.25 0.351 402 22.4 02-23-2013 15:11:37 -0700
## 2668157: 0.159 1.55 2.51 0.253 5.07 0.26 0.351 402 22.4 02-23-2013 15:11:42 -0700
## 2668158: 0.149 1.54 2.50 0.253 5.07 0.25 0.422 401 22.4 02-23-2013 15:11:47 -0700
## 2668159: 0.149 1.54 2.51 0.250 5.07 0.25 0.371 399 22.4 02-23-2013 15:11:52 -0700
## 2668160: 0.149 1.53 2.51 0.253 5.07 0.25 0.340 405 22.4 02-23-2013 15:11:57 -0700
Next we've got to parse the timestamps. Here's an example of how the timestamps with different GMT offsets (Daylight Savings vs Standard) will be converted.
nodefile_timestamp_format <- "%m-%d-%Y %H:%M:%S %z"
as.POSIXct(
c("03-09-2012 07:00:00 -0800",
"03-10-2012 07:00:00 -0700"
),
format = nodefile_timestamp_format
)
## [1] "2012-03-09 07:00:00 PST" "2012-03-10 06:00:00 PST"
Converting timestamps that aren't in UTC takes a loooong time. If they were in UTC we could use fasttime::fastPOSIXct which is about 40-50x faster.
local_time <- as.POSIXct(nodefile_data[, timestamp], format = nodefile_timestamp_format)
nodefile_data[, "timestamp"] <- local_time
setkey(nodefile_data, timestamp)
Extract just the timestamp and CO2_ppm columns from the main data.table, creating a new data.table called co2_data.
summary(co2_data <- nodefile_data[, list(timestamp, CO2_ppm)])
## timestamp CO2_ppm
## Min. :2012-05-05 01:06:44.00 Min. :-999
## 1st Qu.:2012-08-05 13:07:37.50 1st Qu.: 392
## Median :2012-11-10 04:45:07.00 Median : 408
## Mean :2012-10-24 16:00:45.03 Mean : 127
## 3rd Qu.:2013-01-14 23:15:07.25 3rd Qu.: 432
## Max. :2013-02-23 14:11:57.00 Max. : 692
Replace all -999s with NA.
clean <- function(value) replace(value, which(value < 200), NA)
summary(co2_data <- co2_data[, `:=`(CO2_ppm, clean(CO2_ppm))])
## timestamp CO2_ppm
## Min. :2012-05-05 01:06:44.00 Min. :376
## 1st Qu.:2012-08-05 13:07:37.50 1st Qu.:403
## Median :2012-11-10 04:45:07.00 Median :414
## Mean :2012-10-24 16:00:45.03 Mean :425
## 3rd Qu.:2013-01-14 23:15:07.25 3rd Qu.:443
## Max. :2013-02-23 14:11:57.00 Max. :692
## NA's :559212
Now aggregate. You can define your own grouping function; I made one called cut_day. You can also define your own summarization/aggregation functions instead of using mean, median, etc. I created two custom functions, GM and GSD, to calculate the geometric mean and standard deviation. I also created functions q25 and q75 to calculate the lower and upper quartiles. Finally I defined the statistic n to be the number of valid observations (not NAs) in the underlying data.
# Grouping function
cut_day <- function(x) as.POSIXct(trunc(x, "day"))
# Custom aggregation functions
# (Geometric mean and standard deviation)
GM <- function(x, ...) exp(mean(log(x), ...))
GSD <- function(x, ...) exp(sd(log(x), ...))
# Daily rollup
co2_1d <- co2_data[,
list(
mean = mean(CO2_ppm, na.rm=TRUE),
median = median(CO2_ppm, na.rm=TRUE),
sd = sd(CO2_ppm, na.rm=TRUE),
q25 = quantile(CO2_ppm, 0.25, na.rm=TRUE),
q75 = quantile(CO2_ppm, 0.75, na.rm=TRUE),
GM = GM(CO2_ppm, na.rm=TRUE),
GSD = GSD(CO2_ppm, na.rm=TRUE),
n = length(na.omit(CO2_ppm))
),
cut_day(timestamp)
]
head(co2_1d)
## cut_day mean median sd q25 q75 GM GSD n
## 1: 2012-05-05 430.1 427.1 14.61 417.2 442.0 429.8 1.034 8190
## 2: 2012-05-06 435.8 437.3 20.17 415.2 449.9 435.3 1.047 8549
## 3: 2012-05-07 443.6 445.7 21.88 422.7 458.7 443.1 1.050 8555
## 4: 2012-05-08 432.8 431.7 14.52 421.1 443.0 432.6 1.033 8580
## 5: 2012-05-09 426.8 425.0 12.58 416.7 435.9 426.6 1.030 8550
## 6: 2012-05-10 421.8 419.8 11.43 413.4 426.8 421.7 1.027 8550
I called the result co2_1d.
A quick plot showing the median and middle 50%:
fig <- ggplot(co2_1d, aes(x = cut_day))
fig <- fig + scale_y_continuous("Vaisala CO2 (ppm)")
fig <- fig + geom_ribbon(aes(ymin = q25, ymax = q75), alpha = 0.3)
fig <- fig + geom_line(aes(y = median), color = "black")
fig <- fig + ggtitle("Ginko: daily CO2 (middle 50% and median)")
show(fig)
Subsetting is fairly easy. Suppose we only want to look at data since December 2012.
t0 <- ISOdate(2012, 12, 01, tz="America/Los_Angeles")
selected_co2_1d <- co2_1d[cut_day > t0,]
Use the %+% operator to replace the data in the figure above.
fig <- fig %+% selected_co2_1d
Add custom date breaks of 1 month (major) and 1 week (minor).
fig <- fig + scale_x_datetime(
"America/Los_Angeles",
breaks = date_breaks("1 month"),
minor_breaks = date_breaks("1 week"),
labels = date_format("%b\n%Y")
)
show(fig)
## Warning: Removed 1 rows containing missing values (geom_path).
That's all for this demo! Send questions or comments to david.holstius@berkeley.edu.