2 Daily Weather Observations

The process described in this post to get DWO includes the following steps:

Select weather stations
Download DWO files
Cleanse and tidy DWO files

2.1 Select Weather Stations

Using the list in this page for selection weather stations, one can select weather stations from different states http://www.bom.gov.au/climate/dwo/index.shtml, any other station number can also be used for downloading DWO.

To automate the download of DWO for selected set of weather stations, we first construct a data set that contains the list of stations to download bomdwolist.

knitr::kable(bomdwolist %>% head(10), caption = "Table: list of weather stations")

Table: list of weather stations
oid	stationid	state	city	stationaname
IDCJDW2124	066062	NSW	Sydney	Observatory Hill
IDCJDW3033	086338	VIC	Melbourne	Olympic Park
IDCJDW4019	040913	QLD	Brisbane	Brisbane City
IDCJDW5002	023090	SA	Adelaide	Kent Town
IDCJDW7021	094029	TAS	Hobart	Ellerslie Road
IDCJDW2171	070217	SNOWY	Cooma	Cooma Airport

The bomdwolist is then augmented with columns for the download URLs for DWO data corresponding to the weather station number.

bomdwolist <- bomdwolist %>%
  mutate(
    url_template = paste0(
      "http://www.bom.gov.au/climate/dwo/{ym}/text/",
      oid,
      ".{ym}.csv"
    ),
    filename_template = paste0(bomdwo_download_folder, "/", state, ".{ym}.csv")
  )

knitr::kable(bomdwolist %>% head(10), caption = "Table: list of weather stations with download url template")

Table: list of weather stations with download url template
oid	stationid	state	city	stationaname	url_template	filename_template
IDCJDW2124	066062	NSW	Sydney	Observatory Hill	http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW2124.{ym}.csv	dworaw/NSW.{ym}.csv
IDCJDW3033	086338	VIC	Melbourne	Olympic Park	http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW3033.{ym}.csv	dworaw/VIC.{ym}.csv
IDCJDW4019	040913	QLD	Brisbane	Brisbane City	http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW4019.{ym}.csv	dworaw/QLD.{ym}.csv
IDCJDW5002	023090	SA	Adelaide	Kent Town	http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW5002.{ym}.csv	dworaw/SA.{ym}.csv
IDCJDW7021	094029	TAS	Hobart	Ellerslie Road	http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW7021.{ym}.csv	dworaw/TAS.{ym}.csv
IDCJDW2171	070217	SNOWY	Cooma	Cooma Airport	http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW2171.{ym}.csv	dworaw/SNOWY.{ym}.csv

Parameter {ym} will be later substituted with valid year-month code to download the corresponding data files for the selected month.

2.2 Download DWO files

The download operation will loop through all the selected weather station IDs, and download one file per month for each station. To keep track of the process, a logging mechanism and error handling is implemented in the code.

2.2.1 Download functions with error handling and logging

The following functions will come in handy to automate the download process and later inspect the logs for errors.

record_result <- function(timestamp, status, downloadurl, message)
{
  return(
    data_frame(
      ts = timestamp,
      status = status,
      downloadurl = downloadurl,
      message = message
    )
  )
}


dwo_download.file <- function (url, filename)
{
  print(paste0("try to download: ", url))
  currTime <- now()
  r <- tryCatch(
    expr = {
      download.file(url, filename)
      f <- record_result(currTime, "success", url, filename)
    },
    warning  = function(e) {
      return(record_result(currTime, "warning", url, e$message))
    },
    error = function(e) {
      print("error")
      return(record_result(currTime, "error", url, e$message))
    }
  )
  
  return(r)
}



download_batch <- function (bomdwolist, ym)
{
  bomdwolist <- bomdwolist %>% mutate(
    url = gsub(x = url_template, "\\{ym}", ym),
    filename = gsub(x = filename_template, "\\{ym}", ym)
  )
 
  
  # call the down load function on the batch of {ym}
  loglist  <- Map(dwo_download.file, bomdwolist$url, bomdwolist$filename)
  
  logdf <- do.call(rbind, loglist)
  rownames(logdf) <- c()
  logdf <- logdf %>% mutate(batch = ym) 
  return(logdf)
}

2.2.2 Download Loop

The following code segment will prepare the log data frame and kick-off the download process by iterating over the ym_eeq and call download_batch() function for each of the year-month combinations.

log <- data_frame()

ym_seq <- strftime(now() %m-% months(0:13), "%Y%m") 

for( ym in ym_seq)
{
  l <- download_batch(bomdwolist, ym)
  log <-rbind(log,l)
}


#### write operation log file to disk
logfilename <- paste0(bomdwo_download_folder, strftime(now(), "/log_%Y%m%d%H%M.csv"))
write_csv(log, logfilename)

The log file is then written to disk using the write_csv() function.

2.3 Cleanse DWO files

The first challenge to address when cleansing the downloaded data files is to find the true start of the data set in the file. The downloaded files usually include free text at the beginning of the file that describes the contained data set. The issue with this header is the fact that it has a different number of lines in different files, making it hard to use a constant number of skip the header for all files.

The funciton onefile_cleanse() is used to find the true start line for the data set by looking for the text ,"Date" in the file to determine the approrpate skipcount. Then it opens the datafiles and skips the header using the skipcount value.

# this function will cleanse a single file and returns a tidy df
onefile_cleanse <- function(fn) {
  # open the file
  climateDf <- read_table(fn, col_names = "c1")
  # find the real first row
  skipCount <- which(grepl(pattern = ",\"Date\",", climateDf$c1))
  climateDf <- NA
  
  
  # reopen the file with skip count
  climateDf <- read_csv(fn,
                        skip = skipCount,
                        col_names = colnames)
  
  # add the REGION code, extract from file name, and select data, region and
  # temp variables only
  
  retDf <- climateDf %>%
    mutate(
      Date = ymd(climateDf$Date),
      REGION = gsub(
        pattern = ".\\d{6}.csv",
        replacement = "",
        x = gsub(
          pattern = paste0(bomdwo_download_folder, "/"),
          replacement = "",
          x = fn
        )
      )
    ) %>%
    select(Date, REGION, TempMin_C, TempMax_C, Temp9am_C, Temp3pm_C) %>%
    glimpse()
  
  return(retDf)
  
}

The log file that we generated by the download code will be used to determine the list of files that were successfully downloaded.

#### process all downloaded files 

# get the log file
bomdwo_download_folder <- "dworaw"
bomdwo_cleansedfiles <- "dwoclean"

if (!file.exists(bomdwo_cleansedfiles)) {
  dir.create(file.path(bomdwo_cleansedfiles))
}

logfiles <- list.files(bomdwo_download_folder, pattern = "log_*")

log <- read_csv(paste0(bomdwo_download_folder, "/", logfiles[1]))

files <- log %>% 
  filter(status == 'success') %>% 
  rename(filename = message) %>% 
  select(filename) %>% as_vector()

A list of column names is prepared to then be applied on the extracted data set from data files.

colnames <- c("empty", 
              "Date", 
              "TempMin_C", 
              "TempMax_C", 
              "RainFall_mm", 
              "Evaportation_mm", 
              "Sunshine_hh", 
              "WindGustDirection", 
              "WindGustSpeedMax_kmh",
              "WindGustTime", 
              "Temp9am_C", 
              "Humidity9am_percent", 
              "CloudAmount9am_oktas", 
              "WindDirection9am", 
              "WindSpeed9am_kmh", 
              "MSLPressure9am_hPa", 
              "Temp3pm_C", 
              "Humidity3pm_percent", 
              "CloudAmount3pm_oktas", 
              "WindDirection3pm", 
              "WindSpeed3pm_kmh", 
              "MSLPressure3pm_hPa"
              )

File cleansing is then performed by calling lapply() to invoke the onefile_cleanse() the list of downloaded files. All extracted data sets are then combined in one data frame climatAllDf.

# call  file cleansing 
climateAllDf <- lapply(files, onefile_cleanse)
climateAllDf <- do.call(rbind, climateAllDf)

Here is a summary of all observations from the downloaded data files.

climateAllDf %>% group_by(REGION, ym = as.yearmon(Date)) %>% summarise(N = n()) %>% spread(key = REGION, value = N) %>% knitr::kable( caption = "Table: Summary")

Table: Summary
ym	NSW	QLD	SA	SNOWY	TAS	VIC
Aug 2017	31	31	31	31	31	31
Sep 2017	30	30	30	30	30	30
Oct 2017	31	31	31	31	31	31
Nov 2017	30	30	30	30	30	30
Dec 2017	31	31	31	31	31	31
Jan 2018	31	31	31	31	31	31
Feb 2018	28	28	28	28	28	28
Mar 2018	31	31	31	31	31	31
Apr 2018	30	30	30	30	30	30
May 2018	31	31	31	31	31	31
Jun 2018	30	30	30	30	30	30
Jul 2018	31	31	31	31	31	31
Aug 2018	31	31	31	31	31	31
Sep 2018	30	30	30	30	30	30

The data set can now be saved to disk.

# write the data set to file 

if (!file.exists("tidy")) {
  dir.create(file.path("tidy"))
}


write_csv(climateAllDf, "tidy/climate.csv")

2.4 Explore the weather data

Last but not least, a vignette is never complete without a graph, here is a timeseries chart of maximum and minimum temperatures in NSW over the past 14 months.

climateAllDf %>% filter (REGION == "NSW") %>% 
  ggplot() + 
  geom_line(aes(x=Date , y=TempMax_C), color="red") + 
  geom_line(aes(x=Date , y=TempMin_C), color="green") + 
  ylab("Max (red), Min (green) Temperatures") +
  scale_x_date() -> p 
  
  plotly::ggplotly(p)

Daily Weather Observations from Australian BOM

Mutaz Abu Ghazaleh

2 October 2018

1 Introduction