1 Introduction

The Australian Bureau of Meteorology (BOM) http://www.bom.gov.au provides a number of public data services to obtain climate and past weather details, and many other data services and reports pertaining to all kinds of weather related data.

Using the BOM’s weather station directory http://www.bom.gov.au/climate/data/stations/ a user can specify a location, coordinates, state/territory or a weather station number, the weather data type they wish to obtain and download a compressed data file containing all recorded observations since the site started its operation, or narrow it down by a date range. This method requires processing of large number of files and is limited to providing readings for one type of weather data at a time, such as maximum temperature or rainfall.

Another services to obtain data from BOM is the Daily Weather Observations (DWO) services http://www.bom.gov.au/climate/data/index.shtml, which is limited to last 14 months for any given weather station. This method however provides a range of observation types all in one download.

In this post I will describe a process for obtaining DWO for a set of weather stations.

2 Daily Weather Observations

The process described in this post to get DWO includes the following steps:

  1. Select weather stations
  2. Download DWO files
  3. Cleanse and tidy DWO files

2.1 Select Weather Stations

Using the list in this page for selection weather stations, one can select weather stations from different states http://www.bom.gov.au/climate/dwo/index.shtml, any other station number can also be used for downloading DWO.

To automate the download of DWO for selected set of weather stations, we first construct a data set that contains the list of stations to download bomdwolist.

knitr::kable(bomdwolist %>% head(10), caption = "Table: list of weather stations")
Table: list of weather stations
oid stationid state city stationaname
IDCJDW2124 066062 NSW Sydney Observatory Hill
IDCJDW3033 086338 VIC Melbourne Olympic Park
IDCJDW4019 040913 QLD Brisbane Brisbane City
IDCJDW5002 023090 SA Adelaide Kent Town
IDCJDW7021 094029 TAS Hobart Ellerslie Road
IDCJDW2171 070217 SNOWY Cooma Cooma Airport

The bomdwolist is then augmented with columns for the download URLs for DWO data corresponding to the weather station number.

bomdwolist <- bomdwolist %>%
  mutate(
    url_template = paste0(
      "http://www.bom.gov.au/climate/dwo/{ym}/text/",
      oid,
      ".{ym}.csv"
    ),
    filename_template = paste0(bomdwo_download_folder, "/", state, ".{ym}.csv")
  )

knitr::kable(bomdwolist %>% head(10), caption = "Table: list of weather stations with download url template")
Table: list of weather stations with download url template
oid stationid state city stationaname url_template filename_template
IDCJDW2124 066062 NSW Sydney Observatory Hill http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW2124.{ym}.csv dworaw/NSW.{ym}.csv
IDCJDW3033 086338 VIC Melbourne Olympic Park http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW3033.{ym}.csv dworaw/VIC.{ym}.csv
IDCJDW4019 040913 QLD Brisbane Brisbane City http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW4019.{ym}.csv dworaw/QLD.{ym}.csv
IDCJDW5002 023090 SA Adelaide Kent Town http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW5002.{ym}.csv dworaw/SA.{ym}.csv
IDCJDW7021 094029 TAS Hobart Ellerslie Road http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW7021.{ym}.csv dworaw/TAS.{ym}.csv
IDCJDW2171 070217 SNOWY Cooma Cooma Airport http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW2171.{ym}.csv dworaw/SNOWY.{ym}.csv

Parameter {ym} will be later substituted with valid year-month code to download the corresponding data files for the selected month.

2.2 Download DWO files

The download operation will loop through all the selected weather station IDs, and download one file per month for each station. To keep track of the process, a logging mechanism and error handling is implemented in the code.

2.2.1 Download functions with error handling and logging

The following functions will come in handy to automate the download process and later inspect the logs for errors.

record_result <- function(timestamp, status, downloadurl, message)
{
  return(
    data_frame(
      ts = timestamp,
      status = status,
      downloadurl = downloadurl,
      message = message
    )
  )
}


dwo_download.file <- function (url, filename)
{
  print(paste0("try to download: ", url))
  currTime <- now()
  r <- tryCatch(
    expr = {
      download.file(url, filename)
      f <- record_result(currTime, "success", url, filename)
    },
    warning  = function(e) {
      return(record_result(currTime, "warning", url, e$message))
    },
    error = function(e) {
      print("error")
      return(record_result(currTime, "error", url, e$message))
    }
  )
  
  return(r)
}



download_batch <- function (bomdwolist, ym)
{
  bomdwolist <- bomdwolist %>% mutate(
    url = gsub(x = url_template, "\\{ym}", ym),
    filename = gsub(x = filename_template, "\\{ym}", ym)
  )
 
  
  # call the down load function on the batch of {ym}
  loglist  <- Map(dwo_download.file, bomdwolist$url, bomdwolist$filename)
  
  logdf <- do.call(rbind, loglist)
  rownames(logdf) <- c()
  logdf <- logdf %>% mutate(batch = ym) 
  return(logdf)
}

2.2.2 Download Loop

The following code segment will prepare the log data frame and kick-off the download process by iterating over the ym_eeq and call download_batch() function for each of the year-month combinations.

log <- data_frame()

ym_seq <- strftime(now() %m-% months(0:13), "%Y%m") 

for( ym in ym_seq)
{
  l <- download_batch(bomdwolist, ym)
  log <-rbind(log,l)
}


#### write operation log file to disk
logfilename <- paste0(bomdwo_download_folder, strftime(now(), "/log_%Y%m%d%H%M.csv"))
write_csv(log, logfilename)

The log file is then written to disk using the write_csv() function.

2.3 Cleanse DWO files

The first challenge to address when cleansing the downloaded data files is to find the true start of the data set in the file. The downloaded files usually include free text at the beginning of the file that describes the contained data set. The issue with this header is the fact that it has a different number of lines in different files, making it hard to use a constant number of skip the header for all files.

The funciton onefile_cleanse() is used to find the true start line for the data set by looking for the text ,"Date" in the file to determine the approrpate skipcount. Then it opens the datafiles and skips the header using the skipcount value.

# this function will cleanse a single file and returns a tidy df
onefile_cleanse <- function(fn) {
  # open the file
  climateDf <- read_table(fn, col_names = "c1")
  # find the real first row
  skipCount <- which(grepl(pattern = ",\"Date\",", climateDf$c1))
  climateDf <- NA
  
  
  # reopen the file with skip count
  climateDf <- read_csv(fn,
                        skip = skipCount,
                        col_names = colnames)
  
  # add the REGION code, extract from file name, and select data, region and
  # temp variables only
  
  retDf <- climateDf %>%
    mutate(
      Date = ymd(climateDf$Date),
      REGION = gsub(
        pattern = ".\\d{6}.csv",
        replacement = "",
        x = gsub(
          pattern = paste0(bomdwo_download_folder, "/"),
          replacement = "",
          x = fn
        )
      )
    ) %>%
    select(Date, REGION, TempMin_C, TempMax_C, Temp9am_C, Temp3pm_C) %>%
    glimpse()
  
  return(retDf)
  
}

The log file that we generated by the download code will be used to determine the list of files that were successfully downloaded.

#### process all downloaded files 

# get the log file
bomdwo_download_folder <- "dworaw"
bomdwo_cleansedfiles <- "dwoclean"

if (!file.exists(bomdwo_cleansedfiles)) {
  dir.create(file.path(bomdwo_cleansedfiles))
}

logfiles <- list.files(bomdwo_download_folder, pattern = "log_*")

log <- read_csv(paste0(bomdwo_download_folder, "/", logfiles[1]))

files <- log %>% 
  filter(status == 'success') %>% 
  rename(filename = message) %>% 
  select(filename) %>% as_vector()

A list of column names is prepared to then be applied on the extracted data set from data files.

colnames <- c("empty", 
              "Date", 
              "TempMin_C", 
              "TempMax_C", 
              "RainFall_mm", 
              "Evaportation_mm", 
              "Sunshine_hh", 
              "WindGustDirection", 
              "WindGustSpeedMax_kmh",
              "WindGustTime", 
              "Temp9am_C", 
              "Humidity9am_percent", 
              "CloudAmount9am_oktas", 
              "WindDirection9am", 
              "WindSpeed9am_kmh", 
              "MSLPressure9am_hPa", 
              "Temp3pm_C", 
              "Humidity3pm_percent", 
              "CloudAmount3pm_oktas", 
              "WindDirection3pm", 
              "WindSpeed3pm_kmh", 
              "MSLPressure3pm_hPa"
              )

File cleansing is then performed by calling lapply() to invoke the onefile_cleanse() the list of downloaded files. All extracted data sets are then combined in one data frame climatAllDf.

# call  file cleansing 
climateAllDf <- lapply(files, onefile_cleanse)
climateAllDf <- do.call(rbind, climateAllDf)

Here is a summary of all observations from the downloaded data files.

climateAllDf %>% group_by(REGION, ym = as.yearmon(Date)) %>% summarise(N = n()) %>% spread(key = REGION, value = N) %>% knitr::kable( caption = "Table: Summary")
Table: Summary
ym NSW QLD SA SNOWY TAS VIC
Aug 2017 31 31 31 31 31 31
Sep 2017 30 30 30 30 30 30
Oct 2017 31 31 31 31 31 31
Nov 2017 30 30 30 30 30 30
Dec 2017 31 31 31 31 31 31
Jan 2018 31 31 31 31 31 31
Feb 2018 28 28 28 28 28 28
Mar 2018 31 31 31 31 31 31
Apr 2018 30 30 30 30 30 30
May 2018 31 31 31 31 31 31
Jun 2018 30 30 30 30 30 30
Jul 2018 31 31 31 31 31 31
Aug 2018 31 31 31 31 31 31
Sep 2018 30 30 30 30 30 30

The data set can now be saved to disk.

# write the data set to file 

if (!file.exists("tidy")) {
  dir.create(file.path("tidy"))
}


write_csv(climateAllDf, "tidy/climate.csv")

2.4 Explore the weather data

Last but not least, a vignette is never complete without a graph, here is a timeseries chart of maximum and minimum temperatures in NSW over the past 14 months.

climateAllDf %>% filter (REGION == "NSW") %>% 
  ggplot() + 
  geom_line(aes(x=Date , y=TempMax_C), color="red") + 
  geom_line(aes(x=Date , y=TempMin_C), color="green") + 
  ylab("Max (red), Min (green) Temperatures") +
  scale_x_date() -> p 
  
  plotly::ggplotly(p)