This notebook contains examples of how to process and plot weather data from the SNBS weather stations. The code in this notebook assumes that data is in “./data/” directory relative to the location of this file.

Before going further, load the required packages:

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.3.0
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(RColorBrewer)

Now let’s take a look at the data files.

The data is recorded by the Campbell Scienfitific CR10X dataloggers onto SM4M storage modules. I extracted the records using the CSI SMS.exe software, to write out csv files. Some of the storage modules were corrupted and required multiple readouts. For each storage module, I extracted as many files as possible, concatenated them, and then deleted non-unique lines. There were some bad lines, where commas were missing or two records were written to a single line. I used standard linux tools like grep and sed to search for lines that were improperly formatted and then manually either corrected them or deleted them if I couldn’t correct them. This process was done mostly on the linux command line and I won’t cover the details here.

Once that process was done, we are left with “.dat” csv file for each weather station. Here are the files:

list.files("./data/")
##  [1] "101Whlr_low217_09.DAT"  "101Whlr_low319_07.DAT" 
##  [3] "201Whlr_high318_07.DAT" "301LV_low205_08.DAT"   
##  [5] "301LV_low224_09.DAT"    "301LV_low301_07.DAT"   
##  [7] "301LV_low306_07.DAT"    "401LV_high231_09.DAT"  
##  [9] "401LV_high310_07.DAT"   "501Bxtr_low088_08.DAT" 
## [11] "501Bxtr_low187_07.DAT"  "501Bxtr_low198_08.DAT" 
## [13] "501Bxtr_low211_09.DAT"  "bxl_uniq.dat"          
## [15] "lvh.dat"                "lvl_cleaned.txt"       
## [17] "wheel_lo.dat"

Let’s take a look at the format of one of these data files:

read_lines("./data/lvh.dat", n_max = 11)
##  [1] "401,2009,231,1500,13.3,15.92,19.36,24.99,7.63,4.125,4.125,18.15,36.05,-.027,18.38"
##  [2] "401,2009,231,1600,13.31,15.93,17.84,25.04,8.43,4.47,4.47,338.1,39.05,-.015,.584"  
##  [3] "401,2009,231,1700,13.33,15.92,15.69,23.92,8.43,5.02,5.02,316.9,21.79,-.005,0"     
##  [4] "401,2009,231,1800,13.17,15.91,14.46,21.93,7.31,4.855,4.855,323.9,16.16,-.01,0"    
##  [5] "401,2009,231,1900,13.01,15.91,13.3,19.42,9.39,4.494,4.494,49.64,54.59,-.006,0"    
##  [6] "401,2009,231,2000,12.91,15.9,12.02,16.73,13.46,7.48,7.48,64.77,19.51,-.006,0"     
##  [7] "401,2009,231,2100,12.84,15.89,11.3,14.56,11.79,7.34,7.34,63.89,13.33,-.007,0"     
##  [8] "401,2009,231,2200,12.8,15.88,10.82,13.12,11.87,6.484,6.484,61.3,26.46,-.008,0"    
##  [9] "401,2009,231,2300,12.77,15.88,10.73,12.13,9.31,3.639,3.639,83.3,48.61,-.008,0"    
## [10] "401,2009,231,2400,12.75,15.88,10.35,11.33,6.673,2.303,2.303,80.6,74.4,-.005,0"    
## [11] "402,2009,231,2400,.011,.01,.01,25.16,7.13,20.73,7.29"

We can see two types of records stored in these files. One type of record has an initial entry of “401” and one of “402”. These correspond to two different recording schedules on the datalogger. The first entry, x01 or x02 is the array label. In our case, x01 records are hourly data and x02 records are daily data. The first digit of the array entry is different for each datalogger, or should be. When uploading data it is important to manually record which station the storage module came from. It is unwise to assume that datalogger was programmed to correctly label the arrays!

We’re now ready to import data into R. Here are functions to translate a .dat file into a long-formatted dataframe.

# fields2datetime
# create single datetime field out of separate year, hour, time fields
fields2datetime = function(year=2017, day=0, time=0) {
  TZ_PST = "US/Pacific"
  d = make_datetime(year, 1, 1, hour=time/100, min=time %% 100, tz=TZ_PST)
  yday(d) = day
  return(d)
}

# dat2df
# read a Campbell .dat file as input and return a long dataframe. 
dat2df = function(datFile) {
  rawDat = read.csv(datFile, header=F, sep=",")
  names(rawDat)[1:4] = c("array", "year", "day", "time") # label the common columns
  rawDat = rawDat %>% mutate(dt = fields2datetime(year, day, time))
  
  # Split data into 2 separate dataframes based on array:
  datArray1 = rawDat[ (rawDat$array %% 10) == 1 , c(1,16,5:15)]
  datArray2 = rawDat[ (rawDat$array %% 10) == 2 , c(1,16,5:11)]
  
  # label the data fields, detailed description in comments
  array1Labels = c("array", "datetime","batV","mSec4","tAir","tGrd","wSpdMax","wSpd","wSpd2","wDir","wUnk","snwD","rain")
  array2Labels = c("array", "datetime","volmSlMax","volmSlMin","volmSl","tGrdMax","tGrdMin","tAirMax","tAirMin")
  names(datArray1) = array1Labels
  names(datArray2) = array2Labels
  array1Long = datArray1 %>% 
    gather(key = "record", value = "value", -datetime, -array)
  array2Long = datArray2 %>% 
    gather(key = "record", value = "value", -datetime, -array)
  datDF = full_join(array1Long, array2Long)
  datDF = datDF %>% arrange(datetime) 
  datDF = datDF %>% filter(!is.na(datetime)) # remove invalid datetimes
  return(datDF) 
}

lvh = dat2df("./data/lvh.dat")
## Joining, by = c("array", "datetime", "record", "value")
head(lvh)

I want to combine the data from all the stations into a single dataframe. We have everything we need to do that with our current dataframe format, but I don’t like how the station ID is currently denoted only as the first digit of the array label. I want to add another column with an actual human-readable station name.

First let’s define the names and ids:

station = c("Wheeler_Lo", "Wheeler_Hi", "LeeVining_Lo", "LeeVining_Hi", "Baxter_Lo")
stn = c("whl", "whh", "lvl", "lvh", "bxl")
id = c(1,2,3,4,5)

stations = data.frame(station, stn, id, stringsAsFactors=FALSE)
stations

Now we can concatenate all of our data and add the labels. I’m also going to change the array field to record_type, which will be hourly or daily.

# make list of all files:
fileList = list.files("./data/", full.names = TRUE)

allDats = lapply(fileList, dat2df)
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
wxAll = allDats %>% reduce(full_join) %>% distinct()
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
## Joining, by = c("array", "datetime", "record", "value")
wxAll = wxAll %>% mutate(id = array %/% 100 )
wxAll = left_join(wxAll, stations, by="id")

wxAll = wxAll %>% mutate(type = array %% 10 ) 

head(wxAll)

There is one more problem, we have some duplicate data points. I don’t know how best to deal with this, but I will just keep whichever one comes first:

# These should be equal:
nrow(wxAll)
## [1] 3402851
nrow(wxAll %>% distinct(datetime, type, record, id, stn))
## [1] 3400574
# Hack to keep only one measurement for a given time, record, station, and array:
wxAll  = wxAll %>% distinct(datetime, type, record, id, stn, .keep_all=TRUE)

Now all of our records are in one long-format dataframe. This is perfect for plotting with ggplot, so let’s start

A big issue with this inconsistent data is determining which periods from which stations are valid. THe main thing that messes up a station is the battery dying. So let’s look at battery voltages for the stations:

batPlot = ggplot(wxAll %>% filter(record == "batV" & type == 1)) +
  geom_point(aes(x=datetime, y=value, color = stn), alpha=0.05)  + 
  ylim(12,14.5)

batPlot
## Warning: Removed 5309 rows containing missing values (geom_point).

The first thing that’s noticeable is that we have data from 2031. We haven’t cleaned the data enough yet! This is because a dead battery can cause the clock to reset. The records look valid but the clock is totally off. Let’s eliminate data with dates obviously out of range and take another look.

maxDate = as.POSIXct(today())
wxAll = wxAll %>% filter(datetime < maxDate)

batPlot = ggplot(wxAll %>% filter(record == "batV" & type == 1)) +
  geom_point(aes(x=datetime, y=value, color = stn), alpha=0.05)  + 
  ylim(12,14.5)

batPlot
## Warning: Removed 3878 rows containing missing values (geom_point).

This graph looks a bit of a mess, but it can be used as a starting point to determine what periods had bad or no data for each station. You might notice there isn’t really any data below 12 Volts on the battery. This is because below 12V, the clock will probably reset and subsequent records will be in a bad time range, and hence filtered out of our plot.

I haven’t yet bothered to write an automated way to select good date ranges. As such, I’m going to manually pick a good date range for one station and then go into how to look at some of the data from it.

Let’s look at one station:

whlBat = ggplot(wxAll %>% filter(record == "batV" & type == 1 & stn == "whl")) +
  geom_point(aes(x=datetime, y=value, color = stn), alpha=0.05)  + 
  ylim(12,14.5)

whlBat  
## Warning: Removed 1640 rows containing missing values (geom_point).

I’m noticing an early dip, after which the station was serviced and back up, and then a long slow decline while the station went neglected. Things to note: there are 3 distinct lines within the overall line. Looking at this data at a smaller timescale can illuminate why.

startDate = as.POSIXct(ymd(20100401))
endDate = as.POSIXct(ymd(20100405))

whlBatDaily = ggplot(wxAll %>% filter( record == "batV" & 
                                  type == 1 & 
                                  stn == "whl" &
                                  datetime > startDate &
                                  datetime < endDate
                                    )) +
  geom_point(aes(x=datetime, y=value, color = stn), alpha=0.9)  + 
  ylim(12,14.5)

whlBatDaily  

Now we can clearly see that the 3 lines are made by the daily charge cycle. You can also see the seasonal variation in charge on a yearly scale. Note that the solar panels are highly tilted, pointing to the South.

startDate = as.POSIXct(ymd(20100501))
endDate = as.POSIXct(ymd(20110430))

whlBatYearly = ggplot(wxAll %>% filter( record == "batV" & 
                                  type == 1 & 
                                  stn == "whl" &
                                  datetime > startDate &
                                  datetime < endDate
                                    )) +
  geom_point(aes(x=datetime, y=value, color = stn), alpha=0.5)  + 
  ylim(12,14.5)

whlBatYearly

I haven’t yet decided how best to filter bad battery voltages, so let’s move on to actual weather data. The snow depth sensors tended to be reliable:

snowWhl = ggplot(wxAll %>% filter( record == "snwD" & 
                                  stn == "whl")) +
  geom_point( aes(x=datetime, y=value), size=0.01, alpha=0.5, colour="#0000CC") +
  ylim(0, 0.8) +
  xlab("time") +
  ylab("snow depth (m)") + 
  ggtitle("Wheeler Low Elevation Snow Long Term")
  
snowWhl
## Warning: Removed 16519 rows containing missing values (geom_point).

And temperature over a similar time period:

tempWhl = ggplot(wxAll %>% filter( record == "tAir" & 
                                  stn == "whl")) +
  geom_point( aes(x=datetime, y=value), size=0.01, alpha=0.5, colour="#CC00CC") +
  xlab("time") +
  ylab("air temp (C)") + 
  ggtitle("Wheeler Low Elevation Temp Long Term")
  
tempWhl 

Ok that’s kind of boring. Wind data is pretty interesting. Let’s look at some ways to plot it:

startDate = as.POSIXct(ymd(20100501))
endDate = as.POSIXct(ymd(20110430))

windWhl = ggplot(wxAll %>% filter( (record == "wSpd" | record == "wSpdMax") &
                                     stn == "whl" &
                                     datetime > startDate &
                                     datetime < endDate )) +
  geom_point(aes(x=datetime, y=value, col=record), size=0.4, alpha=0.3) + 
  scale_color_manual(name="hourly", labels=c("avg", "max"), values=alpha(c("blue","red"), .2)) +
  xlab("time") +
  ylab("Wind speed in m/s") +
  ggtitle("Wheeler Low Elevation Wind Speed Long Term")

windWhl

Wind direction changes a lot, and it’s hard to visualize on cartesian coordinates. One way to look at it for a given time period is as a wind rose on polar coordinates. In this case the wind direction is angle, and the magnitude is the wind speed. We can do a scatter plot on such coordinates. The downside is that it contains no information about the sequence of the measurements. To do this, I do need to transform the data into a wide dataframe

startDate = as.POSIXct(ymd(20100501))
endDate = as.POSIXct(ymd(20110430))


wind = wxAll %>% 
  filter(type == 1) %>% 
  filter( record %in% c("wSpd", "wSpdMax", "wDir") ) %>%
  spread(record, value)

polarWindWhl = ggplot(wind %>% filter( stn == "whl" & 
                                         datetime > startDate & 
                                         datetime < endDate )) +
  geom_point(aes(x=wDir, y=wSpdMax), color="red", alpha=0.1, shape=16) +
  geom_point(aes(x=wDir, y=wSpd), color="blue", alpha=0.1, shape=16) +
  coord_polar() +
  scale_x_continuous(labels = c("N","NE","E","SE","S","SW","W","NW"),
                     breaks = c(0, 45, 90, 135, 180, 225, 270, 315)) +
  scale_y_continuous(name="m/s", breaks=c(0,5,10,15,20,25,30),limits=c(0,20)) +
  ylim(0,18) +
  ylab("") + 
  xlab("blue: hly avg | red: hly max") +
  ggtitle("Scatterplot of wind speed vs. direction for AY 2010/11")
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
polarWindWhl
## Warning: Removed 36 rows containing missing values (geom_point).