Working with NODEFILE data in R

This is a little demonstration of how to quickly load, reshape, and aggregate BEACON data from CSV files using a couple of handy R packages.

Questions or comments? Send email to david.holstius@berkeley.edu.

Requirements

You'll need the latest data.table package, available from R-Forge. If you don't have it, install it like this:

install.packages("data.table", repos = "http://r-forge.r-project.org")

Here are the rest of the packages you'll need. You can get these from the usual CRAN repositories (type help("install.packages") for more).

suppressPackageStartupMessages({
    library(data.table)
    library(lubridate)
    library(ggplot2)
    library(scales)
})

## Warning: package 'data.table' was built under R version 2.15.3

Loading CSV data

Pick the file you want to load. It cannot be gzipped. The file won't have a header, so we explicitly set the column names.

if (interactive()) {
    nodefile_path <- file.choose()
} else {
    nodefile_path <- file.path("~", "Dropbox", "BEACON", "using-R", "ginko.csv")
}
nodefile_data <- fread(nodefile_path, header = FALSE)
setnames(nodefile_data, c("O3_V", "CO_V", "NO2_V", "press_V", "power_V", "temp_V", 
    "humid_V", "CO2_ppm", "temp_C", "timestamp"))

Here are the first and last few rows of the data. Notice the differences.

options(width = 120, digits = 3)
print(nodefile_data)

##           O3_V  CO_V NO2_V press_V power_V temp_V humid_V CO2_ppm temp_C                 timestamp
##       1: 0.117 -3.63  4.60   0.253    4.94   5.04   0.453     440   13.4 05-05-2012 00:06:44 -0800
##       2: 0.117 -3.63  4.60   0.250    4.92   5.03   0.453     445   13.4 05-05-2012 00:06:54 -0800
##       3: 0.117 -3.63  4.60   0.253    4.95   5.04   0.453     440   13.4 05-05-2012 00:07:04 -0800
##       4: 0.115 -3.63  4.60   0.250    4.94   5.04   0.453     445   13.4 05-05-2012 00:07:14 -0800
##       5: 0.117 -3.63  4.60   0.250    4.94   5.04   0.453     441   13.5 05-05-2012 00:07:24 -0800
##      ---                                                                                          
## 2668156: 0.159  1.54  2.51   0.253    5.07   0.25   0.351     402   22.4 02-23-2013 15:11:37 -0700
## 2668157: 0.159  1.55  2.51   0.253    5.07   0.26   0.351     402   22.4 02-23-2013 15:11:42 -0700
## 2668158: 0.149  1.54  2.50   0.253    5.07   0.25   0.422     401   22.4 02-23-2013 15:11:47 -0700
## 2668159: 0.149  1.54  2.51   0.250    5.07   0.25   0.371     399   22.4 02-23-2013 15:11:52 -0700
## 2668160: 0.149  1.53  2.51   0.253    5.07   0.25   0.340     405   22.4 02-23-2013 15:11:57 -0700

Parsing timestamps

Next we've got to parse the timestamps. Here's an example of how the timestamps with different GMT offsets (Daylight Savings vs Standard) will be converted.

nodefile_timestamp_format <- "%m-%d-%Y %H:%M:%S %z"
as.POSIXct(
  c("03-09-2012 07:00:00 -0800",
    "03-10-2012 07:00:00 -0700"
    ),
  format = nodefile_timestamp_format
)

## [1] "2012-03-09 07:00:00 PST" "2012-03-10 06:00:00 PST"

Converting timestamps that aren't in UTC takes a loooong time. If they were in UTC we could use fasttime::fastPOSIXct which is about 40-50x faster.

local_time <- as.POSIXct(nodefile_data[, timestamp], format = nodefile_timestamp_format)
nodefile_data[, "timestamp"] <- local_time
setkey(nodefile_data, timestamp)

Aggregating CO2

Extract just the timestamp and CO2_ppm columns from the main data.table, creating a new data.table called co2_data.

summary(co2_data <- nodefile_data[, list(timestamp, CO2_ppm)])

##    timestamp                         CO2_ppm    
##  Min.   :2012-05-05 01:06:44.00   Min.   :-999  
##  1st Qu.:2012-08-05 13:07:37.50   1st Qu.: 392  
##  Median :2012-11-10 04:45:07.00   Median : 408  
##  Mean   :2012-10-24 16:00:45.03   Mean   : 127  
##  3rd Qu.:2013-01-14 23:15:07.25   3rd Qu.: 432  
##  Max.   :2013-02-23 14:11:57.00   Max.   : 692

Replace all -999s with NA.

clean <- function(value) replace(value, which(value < 200), NA)
summary(co2_data <- co2_data[, `:=`(CO2_ppm, clean(CO2_ppm))])

##    timestamp                         CO2_ppm      
##  Min.   :2012-05-05 01:06:44.00   Min.   :376     
##  1st Qu.:2012-08-05 13:07:37.50   1st Qu.:403     
##  Median :2012-11-10 04:45:07.00   Median :414     
##  Mean   :2012-10-24 16:00:45.03   Mean   :425     
##  3rd Qu.:2013-01-14 23:15:07.25   3rd Qu.:443     
##  Max.   :2013-02-23 14:11:57.00   Max.   :692     
##                                   NA's   :559212

Now aggregate. You can define your own grouping function; I made one called cut_day. You can also define your own summarization/aggregation functions instead of using mean, median, etc. I created two custom functions, GM and GSD, to calculate the geometric mean and standard deviation. I also created functions q25 and q75 to calculate the lower and upper quartiles. Finally I defined the statistic n to be the number of valid observations (not NAs) in the underlying data.

# Grouping function
cut_day <- function(x) as.POSIXct(trunc(x, "day"))

# Custom aggregation functions
# (Geometric mean and standard deviation)
GM <- function(x, ...) exp(mean(log(x), ...))
GSD <- function(x, ...) exp(sd(log(x), ...))

# Daily rollup
co2_1d <- co2_data[,
  list(
    mean   = mean(CO2_ppm, na.rm=TRUE),
    median = median(CO2_ppm, na.rm=TRUE),
    sd     = sd(CO2_ppm, na.rm=TRUE),
    q25    = quantile(CO2_ppm, 0.25, na.rm=TRUE),
    q75    = quantile(CO2_ppm, 0.75, na.rm=TRUE),
    GM     = GM(CO2_ppm, na.rm=TRUE),
    GSD    = GSD(CO2_ppm, na.rm=TRUE),
    n      = length(na.omit(CO2_ppm))
  ), 
  cut_day(timestamp)
]
head(co2_1d)

##       cut_day  mean median    sd   q25   q75    GM   GSD    n
## 1: 2012-05-05 430.1  427.1 14.61 417.2 442.0 429.8 1.034 8190
## 2: 2012-05-06 435.8  437.3 20.17 415.2 449.9 435.3 1.047 8549
## 3: 2012-05-07 443.6  445.7 21.88 422.7 458.7 443.1 1.050 8555
## 4: 2012-05-08 432.8  431.7 14.52 421.1 443.0 432.6 1.033 8580
## 5: 2012-05-09 426.8  425.0 12.58 416.7 435.9 426.6 1.030 8550
## 6: 2012-05-10 421.8  419.8 11.43 413.4 426.8 421.7 1.027 8550

I called the result co2_1d.

A quick plot showing the median and middle 50%:

fig <- ggplot(co2_1d, aes(x = cut_day))
fig <- fig + scale_y_continuous("Vaisala CO2 (ppm)")
fig <- fig + geom_ribbon(aes(ymin = q25, ymax = q75), alpha = 0.3)
fig <- fig + geom_line(aes(y = median), color = "black")
fig <- fig + ggtitle("Ginko: daily CO2 (middle 50% and median)")
show(fig)

plot of chunk plot_co2_1d

Subsetting and customizing plots

Subsetting is fairly easy. Suppose we only want to look at data since December 2012.

t0 <- ISOdate(2012, 12, 01, tz="America/Los_Angeles")
selected_co2_1d <- co2_1d[cut_day > t0,]

Use the %+% operator to replace the data in the figure above.

fig <- fig %+% selected_co2_1d

Add custom date breaks of 1 month (major) and 1 week (minor).

fig <- fig + scale_x_datetime(
  "America/Los_Angeles",
  breaks = date_breaks("1 month"),
  minor_breaks = date_breaks("1 week"),
  labels = date_format("%b\n%Y")
)
show(fig)

## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk add_date_breaks

Conclusion

That's all for this demo! Send questions or comments to david.holstius@berkeley.edu.