The Australian Bureau of Meteorology (BOM) http://www.bom.gov.au provides a number of public data services to obtain climate and past weather details, and many other data services and reports pertaining to all kinds of weather related data.
Using the BOM’s weather station directory http://www.bom.gov.au/climate/data/stations/ a user can specify a location, coordinates, state/territory or a weather station number, the weather data type they wish to obtain and download a compressed data file containing all recorded observations since the site started its operation, or narrow it down by a date range. This method requires processing of large number of files and is limited to providing readings for one type of weather data at a time, such as maximum temperature or rainfall.
Another services to obtain data from BOM is the Daily Weather Observations (DWO) services http://www.bom.gov.au/climate/data/index.shtml, which is limited to last 14 months for any given weather station. This method however provides a range of observation types all in one download.
In this post I will describe a process for obtaining DWO for a set of weather stations.
The process described in this post to get DWO includes the following steps:
Using the list in this page for selection weather stations, one can select weather stations from different states http://www.bom.gov.au/climate/dwo/index.shtml, any other station number can also be used for downloading DWO.
To automate the download of DWO for selected set of weather stations, we first construct a data set that contains the list of stations to download bomdwolist
.
knitr::kable(bomdwolist %>% head(10), caption = "Table: list of weather stations")
oid | stationid | state | city | stationaname |
---|---|---|---|---|
IDCJDW2124 | 066062 | NSW | Sydney | Observatory Hill |
IDCJDW3033 | 086338 | VIC | Melbourne | Olympic Park |
IDCJDW4019 | 040913 | QLD | Brisbane | Brisbane City |
IDCJDW5002 | 023090 | SA | Adelaide | Kent Town |
IDCJDW7021 | 094029 | TAS | Hobart | Ellerslie Road |
IDCJDW2171 | 070217 | SNOWY | Cooma | Cooma Airport |
The bomdwolist
is then augmented with columns for the download URLs for DWO data corresponding to the weather station number.
bomdwolist <- bomdwolist %>%
mutate(
url_template = paste0(
"http://www.bom.gov.au/climate/dwo/{ym}/text/",
oid,
".{ym}.csv"
),
filename_template = paste0(bomdwo_download_folder, "/", state, ".{ym}.csv")
)
knitr::kable(bomdwolist %>% head(10), caption = "Table: list of weather stations with download url template")
oid | stationid | state | city | stationaname | url_template | filename_template |
---|---|---|---|---|---|---|
IDCJDW2124 | 066062 | NSW | Sydney | Observatory Hill | http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW2124.{ym}.csv | dworaw/NSW.{ym}.csv |
IDCJDW3033 | 086338 | VIC | Melbourne | Olympic Park | http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW3033.{ym}.csv | dworaw/VIC.{ym}.csv |
IDCJDW4019 | 040913 | QLD | Brisbane | Brisbane City | http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW4019.{ym}.csv | dworaw/QLD.{ym}.csv |
IDCJDW5002 | 023090 | SA | Adelaide | Kent Town | http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW5002.{ym}.csv | dworaw/SA.{ym}.csv |
IDCJDW7021 | 094029 | TAS | Hobart | Ellerslie Road | http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW7021.{ym}.csv | dworaw/TAS.{ym}.csv |
IDCJDW2171 | 070217 | SNOWY | Cooma | Cooma Airport | http://www.bom.gov.au/climate/dwo/{ym}/text/IDCJDW2171.{ym}.csv | dworaw/SNOWY.{ym}.csv |
Parameter {ym}
will be later substituted with valid year-month code to download the corresponding data files for the selected month.
The download operation will loop through all the selected weather station IDs, and download one file per month for each station. To keep track of the process, a logging mechanism and error handling is implemented in the code.
The following functions will come in handy to automate the download process and later inspect the logs for errors.
record_result <- function(timestamp, status, downloadurl, message)
{
return(
data_frame(
ts = timestamp,
status = status,
downloadurl = downloadurl,
message = message
)
)
}
dwo_download.file <- function (url, filename)
{
print(paste0("try to download: ", url))
currTime <- now()
r <- tryCatch(
expr = {
download.file(url, filename)
f <- record_result(currTime, "success", url, filename)
},
warning = function(e) {
return(record_result(currTime, "warning", url, e$message))
},
error = function(e) {
print("error")
return(record_result(currTime, "error", url, e$message))
}
)
return(r)
}
download_batch <- function (bomdwolist, ym)
{
bomdwolist <- bomdwolist %>% mutate(
url = gsub(x = url_template, "\\{ym}", ym),
filename = gsub(x = filename_template, "\\{ym}", ym)
)
# call the down load function on the batch of {ym}
loglist <- Map(dwo_download.file, bomdwolist$url, bomdwolist$filename)
logdf <- do.call(rbind, loglist)
rownames(logdf) <- c()
logdf <- logdf %>% mutate(batch = ym)
return(logdf)
}
The following code segment will prepare the log
data frame and kick-off the download process by iterating over the ym_eeq
and call download_batch()
function for each of the year-month combinations.
log <- data_frame()
ym_seq <- strftime(now() %m-% months(0:13), "%Y%m")
for( ym in ym_seq)
{
l <- download_batch(bomdwolist, ym)
log <-rbind(log,l)
}
#### write operation log file to disk
logfilename <- paste0(bomdwo_download_folder, strftime(now(), "/log_%Y%m%d%H%M.csv"))
write_csv(log, logfilename)
The log file is then written to disk using the write_csv()
function.
The first challenge to address when cleansing the downloaded data files is to find the true start of the data set in the file. The downloaded files usually include free text at the beginning of the file that describes the contained data set. The issue with this header is the fact that it has a different number of lines in different files, making it hard to use a constant number of skip the header for all files.
The funciton onefile_cleanse()
is used to find the true start line for the data set by looking for the text ,"Date"
in the file to determine the approrpate skipcount
. Then it opens the datafiles and skips the header using the skipcount
value.
# this function will cleanse a single file and returns a tidy df
onefile_cleanse <- function(fn) {
# open the file
climateDf <- read_table(fn, col_names = "c1")
# find the real first row
skipCount <- which(grepl(pattern = ",\"Date\",", climateDf$c1))
climateDf <- NA
# reopen the file with skip count
climateDf <- read_csv(fn,
skip = skipCount,
col_names = colnames)
# add the REGION code, extract from file name, and select data, region and
# temp variables only
retDf <- climateDf %>%
mutate(
Date = ymd(climateDf$Date),
REGION = gsub(
pattern = ".\\d{6}.csv",
replacement = "",
x = gsub(
pattern = paste0(bomdwo_download_folder, "/"),
replacement = "",
x = fn
)
)
) %>%
select(Date, REGION, TempMin_C, TempMax_C, Temp9am_C, Temp3pm_C) %>%
glimpse()
return(retDf)
}
The log file that we generated by the download code will be used to determine the list of files that were successfully downloaded.
#### process all downloaded files
# get the log file
bomdwo_download_folder <- "dworaw"
bomdwo_cleansedfiles <- "dwoclean"
if (!file.exists(bomdwo_cleansedfiles)) {
dir.create(file.path(bomdwo_cleansedfiles))
}
logfiles <- list.files(bomdwo_download_folder, pattern = "log_*")
log <- read_csv(paste0(bomdwo_download_folder, "/", logfiles[1]))
files <- log %>%
filter(status == 'success') %>%
rename(filename = message) %>%
select(filename) %>% as_vector()
A list of column names is prepared to then be applied on the extracted data set from data files.
colnames <- c("empty",
"Date",
"TempMin_C",
"TempMax_C",
"RainFall_mm",
"Evaportation_mm",
"Sunshine_hh",
"WindGustDirection",
"WindGustSpeedMax_kmh",
"WindGustTime",
"Temp9am_C",
"Humidity9am_percent",
"CloudAmount9am_oktas",
"WindDirection9am",
"WindSpeed9am_kmh",
"MSLPressure9am_hPa",
"Temp3pm_C",
"Humidity3pm_percent",
"CloudAmount3pm_oktas",
"WindDirection3pm",
"WindSpeed3pm_kmh",
"MSLPressure3pm_hPa"
)
File cleansing is then performed by calling lapply()
to invoke the onefile_cleanse()
the list of downloaded files. All extracted data sets are then combined in one data frame climatAllDf
.
# call file cleansing
climateAllDf <- lapply(files, onefile_cleanse)
climateAllDf <- do.call(rbind, climateAllDf)
Here is a summary of all observations from the downloaded data files.
climateAllDf %>% group_by(REGION, ym = as.yearmon(Date)) %>% summarise(N = n()) %>% spread(key = REGION, value = N) %>% knitr::kable( caption = "Table: Summary")
ym | NSW | QLD | SA | SNOWY | TAS | VIC |
---|---|---|---|---|---|---|
Aug 2017 | 31 | 31 | 31 | 31 | 31 | 31 |
Sep 2017 | 30 | 30 | 30 | 30 | 30 | 30 |
Oct 2017 | 31 | 31 | 31 | 31 | 31 | 31 |
Nov 2017 | 30 | 30 | 30 | 30 | 30 | 30 |
Dec 2017 | 31 | 31 | 31 | 31 | 31 | 31 |
Jan 2018 | 31 | 31 | 31 | 31 | 31 | 31 |
Feb 2018 | 28 | 28 | 28 | 28 | 28 | 28 |
Mar 2018 | 31 | 31 | 31 | 31 | 31 | 31 |
Apr 2018 | 30 | 30 | 30 | 30 | 30 | 30 |
May 2018 | 31 | 31 | 31 | 31 | 31 | 31 |
Jun 2018 | 30 | 30 | 30 | 30 | 30 | 30 |
Jul 2018 | 31 | 31 | 31 | 31 | 31 | 31 |
Aug 2018 | 31 | 31 | 31 | 31 | 31 | 31 |
Sep 2018 | 30 | 30 | 30 | 30 | 30 | 30 |
The data set can now be saved to disk.
# write the data set to file
if (!file.exists("tidy")) {
dir.create(file.path("tidy"))
}
write_csv(climateAllDf, "tidy/climate.csv")
Last but not least, a vignette is never complete without a graph, here is a timeseries chart of maximum and minimum temperatures in NSW over the past 14 months.
climateAllDf %>% filter (REGION == "NSW") %>%
ggplot() +
geom_line(aes(x=Date , y=TempMax_C), color="red") +
geom_line(aes(x=Date , y=TempMin_C), color="green") +
ylab("Max (red), Min (green) Temperatures") +
scale_x_date() -> p
plotly::ggplotly(p)