Introduction

This tutorial outlines methods for charting data pulled from the aWhere API in the R programming language. Previous tutorials provided by aWhere have covered using the aWhere API in R to pull data for one location or for a set of locations.

While it is possible that readers without any R experience can follow along, this tutorial assumes a relatively advanced level of understanding of R and the RStudio Integrated Development Environment (IDE). This understanding should include a grasp of topics like basic data structures in R, working with data frames, loading packages, and working with the ggplot2 family of packages and functions.

Pulling & Evaluating Data for Charting

In this tutorial, we will work with a dataset pulled from the aWhere API using the parameters and function below. For more information on how this function operates, and details about the aWhere API and its functionality, please refer to the supplementary tutorial Using the aWhere API in R. Please note that for this code to work, you must have previously installed the aWhereAPI R package.


## 
## #install the aWhere R package and its dependencies - only needs to be done once
## install.packages('devtools')
## install.packages(c('chron', 'magrittr', 'btcops', 'DBI', 'assertthat', 'Rcpp', 'tibble'))
## devtools::install_github("aWhereAPI/aWhere-R-Library")
## 
#load libraries that will be needed
suppressWarnings(suppressPackageStartupMessages(library(tidyr)))
suppressWarnings(suppressPackageStartupMessages(library(dplyr)))
suppressWarnings(suppressPackageStartupMessages(library(aWhereAPI)))

#set the path to your working directory
setwd("~/R Working Directory/API")

## 
## #authenticate yourself to the aWhere API - your key and secret are unique to your account
## key <- "API consumer key"
## secret <- "API consumer secret"
## 
#send your credentials to the API to receive an access token
get_token(key, secret)
## $error
## [1] FALSE
## 
## $error_message
## NULL
## 
## $token
## [1] "i7t4NsFfNVtEWwvMBdlxfFsgFl2W"
#set some data query parameters
lat <-        -20.465133
lon <-        -42.081328
day_start <-  "2018-01-01"
day_end <-    as.character(Sys.Date()+7)
year_start <- 2008
year_end <-   2016

#source helper function - be sure the pathway given to the file as an input is correct
source("./function_generateaWhereDataset.R")
source("./function_generateaWhereChart.R")

###Pull entire dataset
weather_df <- generateaWhereDataset(lat = lat, lon = lon, 
                                    day_start = day_start, 
                                    day_end = day_end, 
                                    year_start = year_start, 
                                    year_end = year_end)

In summary, the code above tells R the latitude, longitude, and a common name of the point for which we want to query data, the start and end dates of the data query, and the year span to include in the long-term normal calculations. It then runs a function called generateaWhereDataset to query and store the data for that point in the R environment as an object called weather_df.

Before charting the data, the user should examine it to understand its structure and clean it, if needed. Here is a simple data exploration function, summary() that users can try, along with the (truncated) output the function will deliver if run using the exact dataset retrieved above.


#summary results are run for only the first 10 columns
summary(weather_df[ , 1:10])
##      day                date           maxTemp.amount  maxTemp.average
##  Length:72          Length:72          Min.   :20.39   Min.   :25.16  
##  Class :character   Class :character   1st Qu.:27.32   1st Qu.:26.25  
##  Mode  :character   Mode  :character   Median :28.79   Median :26.84  
##                                        Mean   :28.26   Mean   :26.84  
##                                        3rd Qu.:29.90   3rd Qu.:27.52  
##                                        Max.   :31.61   Max.   :28.36  
##  maxTemp.stdDev  minTemp.amount  minTemp.average minTemp.stdDev  
##  Min.   :1.514   Min.   :16.08   Min.   :17.15   Min.   :0.8336  
##  1st Qu.:2.208   1st Qu.:19.35   1st Qu.:17.85   1st Qu.:1.5645  
##  Median :2.636   Median :19.90   Median :18.20   Median :1.7486  
##  Mean   :2.689   Mean   :19.71   Mean   :18.11   Mean   :1.7767  
##  3rd Qu.:3.088   3rd Qu.:20.39   3rd Qu.:18.41   3rd Qu.:1.9119  
##  Max.   :4.179   Max.   :21.24   Max.   :18.93   Max.   :2.8894  
##  precipitation.amount precipitation.average
##  Min.   : 0.00000     Min.   : 0.000       
##  1st Qu.: 0.09375     1st Qu.: 2.341       
##  Median : 1.97500     Median : 4.240       
##  Mean   : 8.86632     Mean   : 5.170       
##  3rd Qu.:11.35312     3rd Qu.: 7.239       
##  Max.   :59.12500     Max.   :22.744

These commands give you basic information about the way the data frame weather is structured, and about the variables it contains. Users should note that in this particular dataset, forecast, observed (past), and long-term normal weather information is contained in different variables. This is why there are NAs in some particular columns.

For example, there are 8 NA entries in the observed maxTemp column (out of 43 observations) as the NA entries correspond to future dates. Conversely, there are 35 NAs in the maxTemp.forecast column, as there are only 8 days of forecast. The maxTemp.average column contains entries for all 43 dates, because this column denotes the long-term normal (2012-2017, as specified in the API call) at that location on that date. This pattern (variable, variable.forecast, variable.average) is repeated elsewhere in the data with other weather attributes.

From the structure and summary information, a user may make any number of choices about what to do next. Users may want to rename columns or delete unneeded ones to make it simpler to work with the dataset during plotting. Users may decide they want to restrict the date range, or break up the dataset into a few different sets. These tasks and more can be done using basic R functions covered in other tutorials.

Before moving on to making charts, practice subsetting certain rows or columns in the data, using code similar to the examples below. This practice will help ensure that you know how to reference only the data you want to map in the mapping functions.


#Subset only rows corresponding to date "2018-01-01"

weather_df[which(weather_df$date == "2018-01-01"), ]

#Subset only the latitude, longitude, and precipitation columns

weather_df[, names(weather_df) %in% c("date", 
                                "maxTemp.average")]

There are other functions and packages in R that can help you to slice your data simply and flexibly - in this tutorial we are using base functions where possible in order to improve the likelihood that the code will play well with future changes to the R language and package library.

One simple operation we may want to consider is sorting the data by date. This can be accomplished with the order() function.


#Order data by date

weather_df <- weather_df[order(weather_df$date),]

print(weather_df$date[1:10])
##  [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05"
##  [6] "2018-01-06" "2018-01-07" "2018-01-08" "2018-01-09" "2018-01-10"

The order() function returned you a numerical vector of the “date” column’s row numbers, reordered so that the dates are ascending. By giving this as an input to the [] subset operators, R knows to reorder the rows of the weather_df dataset according to the numerical vector. After reordering the data, the “date” column now is in sequential order when printed to the console.

Basic Charting in R

Having examined the data, users may have seen some weather attributes they would like to plot. An obvious candidate would be precipitation at the location queried across the time period queried. The below code demonstrates line graphs using basic R functions of rainfall in the current time period, and the long term normal (LTN) in the same period.


#An example of basic plot syntax in R, note that the dataset input as the first 
#parameter is subsetted

plot(x = weather_df$precipitation.amount,        #the data to be plotted
     type = "o",                                 #the type of graph - "o" = lines + points
     col = "blue",                               #color of the lines and points
     xlab = "day of query range",                #the title of the x axis
     ylab = "mm",                                #the title of the y axis
     ylim = c(0,70),                             #standardize the range of the y axis
     main = "Observed Precip at Query Location") #the title of the plot

plot(x = weather_df$precipitation.average,        #the data to be plotted
     type = "o",                                 #the type of graph - "o" = lines + points
     col = "blue",                               #color of the lines and points
     xlab = "day of query range",                #the title of the x axis
     ylab = "mm",                                #the title of the y axis
     ylim = c(0,70),                             #standardize the range of the y axis
     main = "Observed Precip at Query Location") #the title of the plot


These plots are a simple start, but we can already think of ways they could be improved. For one, it would be better if both lines were on the same graph. For another, it would be better if the x-axis were labeled with the dates themselves, rather than their position in the sequence. And finally, the plot design is very basic, and could use some improvement.

The ggplot2 package is among R’s most popular packages for plotting data intuitively and beautifully. The package works on a slightly different idea of graphing from R’s base graphics. In base R graphics, you input exactly the ranges of data you want plotted for each individual chart, then alter the appearance. In ggplot2, you can start by feeding an entire dataset to the plot, then control which elements of the data appear on the chart, and in what ways, using a succession of commands layered on top with the “+” operator.


#Load the ggplot2 package, and the associated ggthemes package
library(ggplot2)
library(ggthemes)

#Layer a line geometry on top of the base plot
ggplot(weather_df) +
  geom_line(aes(x = date, y = precipitation.amount, group = latitude), 
            size = 2,
            color = "blue",
            na.rm = TRUE) +
  geom_line(aes(x = date, y = precipitation.average, group = latitude),
            size = 2,
            color = "orange",
            na.rm = TRUE) +
  scale_x_discrete(breaks = unique(weather_df$date)[seq(1,length(weather_df$date),10)]) +
  ylab("mm") +
  ggtitle("Precipitation - Observed and LTN")


On your own, you can try to input the base ggplot(weather_df) command alone into your console and see that it results in a blank plot. To produce an interesting graphic, we must tell R what elements of the data to plot, and how. We start by layering a line geometry element on top using the “+” sign, with the geom_line() command. The parameters to this function are important, and need to be understood in detail.

  1. aes: short for aesthetics(meaning visualization components), this function can take many inputs. Here, we have only two aesthetics. **x*: the attribute we wish to use for the x axis **y*: the attribute we wish to use for the y axis **group*: the attribute by which to group data into coherent lines. If we had data for multiple geographic points, this would be especially useful. Above the grouping attribute is simply set to “latitude”, as all the data is for the same point.
  2. size: thickness of the line
  3. color: a default color for elements drawn with geom_line()
  4. na.rm: commanding R to ignore NA values in the data it charts

This code, replicated twice with slight changes to the data and color it charts, layers two lines onto the base chart, and sets the axis and tick labels by default. Users may wish to override the default labels, however.

For tick labels, this can be done with a variety of commands beginning scale_x and scale_y - as dates are discrete sequential items, for example, the above code uses scale_x_discrete() to set custom tick points (breaks) as a sequence from the start of the date range to the end, skipping every ten days.

The Y axis label can be customized using a simple ylab() command, along with a text string inside. A parallel function for xlab() also exists. Users should practice overriding the default parameters of the chart with additional functions like this in order to learn how to control charting appearances.

Finally, the code layers a chart title on top with the ggtitle() command, with a text string inside.

Flexible Chart Customization

This chart is useful, but it would also be useful to see accumulated precipitation versus the “norm”, so we can better understand whether this period of rain is more than average or less than average. To do this we should change the variables included in the data frame to accumulated precipitation.

In changing this chart, the plotting chart code can also be cleaned up to avoid repetition of geom_line() commands and automate the coloring of lines and generation of a legend.


#Select the variables you wish to chart from the data frame, rename them, and change 
#the data format from wide to long
chart_data <- dplyr::select(weather_df, 
                            date, 
                            accumulatedPrecipitation.amount, 
                            accumulatedPrecipitation.average) %>%
    setNames(., c("date", "Current", "LTN")) %>% 
    gather(key = date, value = measure) 
    
names(chart_data)[2] <- "Variable"
    
#Layer a line geometry on top of the base plot
ggplot() + geom_line(data = chart_data, 
                      aes(x = date, 
                          y = measure, 
                          group = Variable,
                          color = Variable),
                      size = 1.5,
                      na.rm = TRUE) + 
  scale_x_discrete(breaks = unique(weather_df$date)[seq(1,length(weather_df$date),10)]) +
  labs(x="Date", y = "mm \n") +
  ggtitle("Accumulated Precipitation - Current vs. Long-term Normal")


There are two parts to this code. The first part selects out the variables we are interested in charting from the weather_df data frame and formats them in a way that will be easier for ggplot to digest. The second part modifies the charting code from above, adding the following parameters within the aes() command:

  1. group: Tell R how to group the precipitation data in order to draw different lines
  2. color: Tell R to assign different colors to each of the lines

These parameters together make it so R will digest any number of “groups” of data, chart individual lines for each, and assign helpful colors as well as automatically generate a legend to help interpret the data.

The code so far has demonstrated some of the customization options available in ggplot, but there are many more. For example, you can add themes to your chart and customize the colors more flexibly. The following chart uses the same subset of data but adds more customization on top.


#Adding a theme, modifying colors, rotating x-axis labels, and placing the legend 
#on the bottom

ggplot() + theme_igray() + scale_colour_tableau() +
    geom_line(data = chart_data, 
              aes(x = date, 
                  y = measure, 
                  group = Variable,
                  color = Variable),
              size = 1.5,
              na.rm = TRUE) + 
    scale_x_discrete(breaks = unique(chart_data$date)[seq(1,length(chart_data$date),10)],
          labels = unique(substr(chart_data$date, 6, 10))[seq(1,length(chart_data$date),10)]) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    theme(legend.position="bottom", legend.direction="horizontal",
          legend.title = element_blank()) +
    labs(x="Date", y = "mm \n") +
    ggtitle(paste("Accumulated Precipitation - Current vs. Long-term Normal"))


aWhere has created a function, called generateaWhereChart(), which leverages code like the above to automatically generate a chart for a given dataset created using the generateaWhereDataset() command. The code to source this function is found above, an example of the output of this function is shown here:


#give the function the weather_df data frame as an input, and a variable to chart

generateaWhereChart(weather_df, variable = "accumulatedPrecipitation")


This charting function takes a data frame , weather_df, as the first input, and can then accept multiple different inputs:

  1. precipitation: Daily precipitation values across the date span of the input dataset
  2. accumulatedPrecipitation: Accumulated precipitation across the date span of the input dataset
  3. maxTemp: Daily maximum temperature values across the date span of the input dataset
  4. minTemp: Daily minimum temperature values across the date span of the input dataset
  5. pet: Daily Potential Evapotranspiration (PET) values across the date span of the input dataset
  6. accumulatedPet: Accumulated Potential Evapotranspiration (PET) across the date span of the input dataset
  7. ppet: Daily Precipitation/PET values across the date span of the input dataset (this is a simple ratio of aWhere’s daily precipitation data over its daily PET data)
  8. rollingavgppet: A rolling average of Precipitation/PET values across the date span of the input dataset (the number of days applied to the rolling average can be customized, by default it is 30)

In addition to the two required parameters, there are a number of optional parameters you can include, demonstrated in the following chart:


#give the function the weather_df data frame as an input

generateaWhereChart(weather_df, variable = "rollingavgppet",
                    title = "Rolling Average Precipitation/PET at Location X")


Remember that R, RStudio, and the suite of R packages are free, open-source tools to help you query data, build visualizations, and replicate your work for other datasets, locations, and time periods. This tutorial is intended to help users quickly get started using R to query and process aWhere weather data into insights for decision-making. The scripts which contain the custom functions created by R and demonstrated in this tutorial are available upon request by contacting your organization’s aWhere representative, or emailing beawhere@awhere.com.