Data munging, sometimes referred to as data wrangling is the process of transforming and unifying often messy and complex data sets with the intent of making it more accessible and valuable for analysis. It is a fundamental process in analytics and data warehousing projects as it often accounts for 50%-80% of the costs and time involved.
This vignette will explain the steps involved in loading and munging of weather data from the Australian Government’s Bureau of Meterology (BOM), which will allow us to analyse Australian weather data by decade.
Our final table should display the average maximum temperature by decade from four major Australian cities: Sydney, Melbourne, Brisbane and Canberra.
We will be using the ‘bomrang’ and ‘tidyverse’ package for this excercise.
The bomrang library contains functions to interface with the BOM data, fetching and returning data sets of forecasts, historical and current weather data from weather stations in Australia.
The tidyverse package contains the dplyr package which contains functions for transforming our data.
library(bomrang)
##
## Data (c) Australian Government Bureau of Meteorology,
## Creative Commons (CC) Attribution 3.0 licence or
## Public Access Licence (PAL) as appropriate.
## See http://www.bom.gov.au/other/copyright.shtml
##
##
## If you use bomrang, please cite it.
## See `citation('bomrang')` for the proper citation.
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Use the get_historical() function from the ‘bomrang’ library to load past daily observations for a particular weather station in Australia.
In this example, we are using two arguments to load the maximum temperatures from a weather station in each capital city.
To retrieve data from a particular weather station we will use the Station ID which is uniquely allocated to each weather station. More infromation on station numbers can be found here.
The type argument is used to specify what weather measurement we would like to retrieve, such as rain, minimum temperature, maximum temperature or solar exposure. As we are interested in measuring maximum temperature we will be using ‘max’.
syd <- get_historical(stationid = "066062", type = "max")
## Data saved as /var/folders/48/xwjgszy56ms4kwz9k_kj45nw0000gn/T//Rtmpzzqcva/IDCJAC0010_066062_1800_Data.csv
mel <- get_historical(stationid = "086282", type = "max")
## Data saved as /var/folders/48/xwjgszy56ms4kwz9k_kj45nw0000gn/T//Rtmpzzqcva/IDCJAC0010_086282_1800_Data.csv
canb <- get_historical(stationid = "070351", type = "max")
## Data saved as /var/folders/48/xwjgszy56ms4kwz9k_kj45nw0000gn/T//Rtmpzzqcva/IDCJAC0010_070351_1800_Data.csv
bris <- get_historical(stationid = "040913", type = "max")
## Data saved as /var/folders/48/xwjgszy56ms4kwz9k_kj45nw0000gn/T//Rtmpzzqcva/IDCJAC0010_040913_1800_Data.csv
Let’s view a sample of our data frames:
head(syd)
## Product_code Station_number Year Month Day Max_temperature
## 1 IDCJAC0010 66062 1859 1 1 24.4
## 2 IDCJAC0010 66062 1859 1 2 24.4
## 3 IDCJAC0010 66062 1859 1 3 24.2
## 4 IDCJAC0010 66062 1859 1 4 24.7
## 5 IDCJAC0010 66062 1859 1 5 24.6
## 6 IDCJAC0010 66062 1859 1 6 22.2
## Accum_days_max Quality
## 1 NA Y
## 2 1 Y
## 3 1 Y
## 4 1 Y
## 5 1 Y
## 6 1 Y
head(mel)
## Product_code Station_number Year Month Day Max_temperature
## 1 IDCJAC0010 86282 1970 1 1 NA
## 2 IDCJAC0010 86282 1970 1 2 NA
## 3 IDCJAC0010 86282 1970 1 3 NA
## 4 IDCJAC0010 86282 1970 1 4 NA
## 5 IDCJAC0010 86282 1970 1 5 NA
## 6 IDCJAC0010 86282 1970 1 6 NA
## Accum_days_max Quality
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
head(canb)
## Product_code Station_number Year Month Day Max_temperature
## 1 IDCJAC0010 70351 2008 1 1 NA
## 2 IDCJAC0010 70351 2008 1 2 NA
## 3 IDCJAC0010 70351 2008 1 3 NA
## 4 IDCJAC0010 70351 2008 1 4 NA
## 5 IDCJAC0010 70351 2008 1 5 NA
## 6 IDCJAC0010 70351 2008 1 6 NA
## Accum_days_max Quality
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
head(bris)
## Product_code Station_number Year Month Day Max_temperature
## 1 IDCJAC0010 40913 1999 1 1 NA
## 2 IDCJAC0010 40913 1999 1 2 NA
## 3 IDCJAC0010 40913 1999 1 3 NA
## 4 IDCJAC0010 40913 1999 1 4 NA
## 5 IDCJAC0010 40913 1999 1 5 NA
## 6 IDCJAC0010 40913 1999 1 6 NA
## Accum_days_max Quality
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
We can see that the Sydney data frame displays the daily maximum recorded temperature for that particular station since 1st January 1859. It also displays the accumulated number of days of that same maximum temperature and the quality of the data, that is, whether the recorded observation has completed the routine quality control process or not.
When viewing the other data frames from the other cities, note that there are NA values, which is important to note when we summarise our data in the future steps.
As we are only displaying the Station ID, it is not easy to interpret which city the observation originated from. To easily interpret the location, let’s add a column called Location using the mutate() function. This will allow us to display the name of the city that weather station was located in:
syd <- mutate(syd, Location="Sydney")
mel <- mutate(mel, Location="Melbourne")
canb <- mutate(canb, Location="Canberra")
bris <- mutate(bris, Location="Brisbane")
We’d now like to combine the data sets from each of the four major cities to create one dataset allowing us to easily compare the temperatures of each of the major cities.
Use the bind_rows function to join the Sydney, Melbourne, Canberra and Brisbane data frames:
aus <- bind_rows(syd, mel, canb, bris)
Our final output is one table that should summarise the average maximum temperature for each of the major cities by decade, which will require rearranging our data set using the mutate, group_by, summarise and spread functions from dplyr the package in the tidyverse.
Mutate() computes and appends an additional column called Decade at the end of our table. The Decade column calculates the decade that year is in using simple mathematics calculations.
Then, the data set is grouped by the Location then Decade, where we calculate the average maximum temperature for each decade for that particular city. Note that in the summarise() function, we will include an na.rm argument, to specify that we should calcualte the mean of the non-missing values since there are a few decades with NA values.
We then rearrange the Location rows into their own separate columns using the spread().
aus <- mutate(aus,Decade=Year-Year%%10) %>%
group_by(Location, Decade) %>%
summarise(avg_max=mean(Max_temperature,na.rm=T)) %>%
spread(Location, avg_max)
Let’s now view the final aus dataframe:
aus
## # A tibble: 17 x 5
## Decade Brisbane Canberra Melbourne Sydney
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1850 NA NA NA 21.4
## 2 1860 NA NA NA 21.1
## 3 1870 NA NA NA 21.0
## 4 1880 NA NA NA 20.9
## 5 1890 NA NA NA 20.8
## 6 1900 NA NA NA 21.0
## 7 1910 NA NA NA 21.6
## 8 1920 NA NA NA 21.9
## 9 1930 NA NA NA 21.5
## 10 1940 NA NA NA 21.5
## 11 1950 NA NA NA 21.7
## 12 1960 NA NA NA 21.9
## 13 1970 NA NA 19.3 22.1
## 14 1980 NA NA 19.6 22.3
## 15 1990 27.2 NA 19.4 22.2
## 16 2000 26.5 22.1 20.3 22.9
## 17 2010 26.6 21.0 20.6 23.2