In the first days of our journey, we travel to familiar territory. We can use these days to further prepare for the challenges that lay ahead. This part is called data cleaning or data wrangling. Nothing exotic here, but the transformation1 In some cases, data wrangling can reach up to 80% of the total project time. of our raw data2 Reminder: we have our raw data stored safely. We read them but do not edit them. into another format that will be more appropriate for our exploratory analyses.
But first we have to import our files.
list.files('data/raw')
## [1] "precip_day.rds" "project" "runoff_day"
## [4] "runoff_stations.csv"
There is a .csv file containing some information about the locations where data were collected (river runoff stations) and a folder with the actual daily discharge time series in text format. This is quite common in geosciences, especially when it comes to spatiotemporal data. There is a seperate record with other information about the data, such as where the data were collected.
We shall import all of them, transform them to data.table objects and save them as .rds files for further use. In the meantime, we will also add a couple of new variables that might be useful in further steps and remove the redundant ones.3 In this project, the transformations are few and simple, so a single script file can be used for each dataset. In more complex cases, one script should be used for importing and a different one for transforming.
We will create a new R script4 Shortcut: ctr + shift + N and save it as 01a_import_runoff_info.R in folder code. It is useful to keep some ordering in the file name (“01a_”), to keep in track the analysis workflow.
The first thing to do is load the libraries and the data.
library(data.table)
runoff_stations <- fread('./data/raw/runoff_stations.csv')
The fread function reads the csv as a data.table object. Let’s have a look at our data and get some information about how they are stored.
head(runoff_stations)
## id station country lat lon area altitude
## 1: 6335020 REES DE 51.75692 6.395395 159300 8.00
## 2: 6335050 DUESSELDORF DE 51.22555 6.770183 147680 24.48
## 3: 6335060 KOELN DE 50.93696 6.963293 144232 34.97
## 4: 6335070 ANDERNACH DE 50.44339 7.392050 139549 51.47
## 5: 6335100 KAUB DE 50.08544 7.764967 103488 67.66
## 6: 6335150 MAINZ DE 50.00399 8.275313 98206 78.43
str(runoff_stations)
## Classes 'data.table' and 'data.frame': 20 obs. of 7 variables:
## $ id : int 6335020 6335050 6335060 6335070 6335100 6335150 6335170 6335180 6335200 6335400 ...
## $ station : chr "REES" "DUESSELDORF" "KOELN" "ANDERNACH" ...
## $ country : chr "DE" "DE" "DE" "DE" ...
## $ lat : num 51.8 51.2 50.9 50.4 50.1 ...
## $ lon : num 6.4 6.77 6.96 7.39 7.76 ...
## $ area : int 159300 147680 144232 139549 103488 98206 53131 68827 50196 34550 ...
## $ altitude: num 8 24.5 35 51.5 67.7 ...
## - attr(*, ".internal.selfref")=<externalptr>
There are 7 columns with information about station location and the area of the catchment of each Rhine subsection. There are 20 stations and we can check their names to see if there are any duplicates.
runoff_stations$station
## [1] "REES" "DUESSELDORF"
## [3] "KOELN" "ANDERNACH"
## [5] "KAUB" "MAINZ"
## [7] "SPEYER" "WORMS"
## [9] "MAXAU" "RHEINFELDEN"
## [11] "LOBITH" "BASEL(ST.ALBAN)"
## [13] "BASEL, RHEINHALLE" "BASEL, SCHIFFLAENDE"
## [15] "RHEINFELDEN, MESSSTATION" "REKINGEN"
## [17] "NEUHAUSEN, FLURLINGERBRUECKE" "DOMAT/EMS"
## [19] "FELSBERG" "DIEPOLDSAU, RIETBRUECKE"
Everything looks good. However, it will be handier if we provided some shorter names for further use. We achieve this with the abbreviate function. We will save it as a factor, which is a categorical variable. In addition, we notice that the id variable is an integer, whereas it represents also a categorical variable and therefore we transform it also to a factor. Finally, we remove some decimals with round function to improve data readability.
runoff_stations[, sname := factor(abbreviate(station))]
runoff_stations[, id := factor(id)]
runoff_stations[, lat := round(lat, 3)]
runoff_stations[, lon := round(lon, 3)]
runoff_stations[, altitude := round(altitude, 0)]
That’s all! We can now save our imported data in a file under rds format.
saveRDS(runoff_stations, './data/runoff_stations_raw.rds')
We run and save our script. We commit the changes in github and push them to our repository. To continue we clear our workspace5 Clearing the workspace before running each script is fundemental. A quick way to do this is by using the ctr + shift + F10 shortcut.. It is important to keep this workflow for every R script we will produce.6 Each script should do a single thing.
Then, we create a new script file and save it as 01b_import_runoff_day.R, again in folder code7 Note the difference between saving the file that contains the code for importing the data (01b_import_runoff_day.R) and the data themselves (runoff_stations.rds).. This script will import the daily runoff data.
library(data.table)
runoff_stations <- readRDS('./data/runoff_stations_raw.rds')
Importing the daily data needs a more complex approach. We start by examining the structure of the text files by picking one at random.
fread('./data/raw/runoff_day/6335020_Q_Day.Cmd.txt')
## YYYY-MM-DD hh:mm Value
## 1: 1814-11-01 --:-- 845.00
## 2: 1814-11-02 --:-- 828.00
## 3: 1814-11-03 --:-- 812.00
## 4: 1814-11-04 --:-- 796.00
## 5: 1814-11-05 --:-- 768.00
## ---
## 73837: 2016-12-27 --:-- 1048.73
## 73838: 2016-12-28 --:-- 1035.63
## 73839: 2016-12-29 --:-- 1027.47
## 73840: 2016-12-30 --:-- 1012.50
## 73841: 2016-12-31 --:-- 1006.63
There are three columns; date, time and runoff discharge. Our objective is to read each file and merge it to a single data.table object in tidy format.
First of all, we define the variables that we will need.
raw_path <- './data/raw/runoff_day/'
fnames <- list.files(raw_path)
n_station <- length(fnames)
id_length <- 7
runoff_day_raw <- data.table()
id_sname <- runoff_stations[, .(id, sname)]
We do this to minimize numbers and text in the main body of our script, as well as avoiding repetitions.8 Reminder: Try to use meaningful names for the variables. The last line creates a new data.table object that stores the station (short) name with the station id.9 If you have trouble to follow this command, perhaps you might have a look in the data.table syntax.
Then, we create a loop that reads each text file and appends it on the bottom of the previous one. To better understand the loop we will examine it step by step. We can do this by setting the value of 1 to the variable file_count which represents the loop count. Thus, this is the first repetition of the loop.
file_count <- 1
We start with reading the first file and store it in the variable temp_dt, which is a temporary data.table needed just for the loop.
temp_dt <- fread(paste0(raw_path, fnames[file_count]))
The next step is to extract the station id from the file name and add it as a new column to temp_dt.
station_id <- substr(fnames[file_count], 1, id_length)
temp_dt <- cbind(id = factor(station_id), temp_dt)
Let’s have a look at temp_dt.
## id YYYY-MM-DD hh:mm Value
## 1: 6335020 1814-11-01 --:-- 845.00
## 2: 6335020 1814-11-02 --:-- 828.00
## 3: 6335020 1814-11-03 --:-- 812.00
## 4: 6335020 1814-11-04 --:-- 796.00
## 5: 6335020 1814-11-05 --:-- 768.00
## ---
## 73837: 6335020 2016-12-27 --:-- 1048.73
## 73838: 6335020 2016-12-28 --:-- 1035.63
## 73839: 6335020 2016-12-29 --:-- 1027.47
## 73840: 6335020 2016-12-30 --:-- 1012.50
## 73841: 6335020 2016-12-31 --:-- 1006.63
Now, it is time to add the station (short) name as another column. We make this happen by merging temp_dt with id_sname by matching the values in id column.
temp_dt <- id_sname[temp_dt, on = 'id']
Finally, we attach temp_dt to the end of runoff_day_raw.
runoff_day_raw <- rbind(runoff_day_raw, temp_dt)
As the loop is over, all our discharge data are merged in a single data.table object; runoff_day_raw.
Before saving it for further use, we will shall make a few format tweaks. We start with removing the time collumn, as it is of no use in our analysis.
runoff_day_raw[, 'hh:mm' := NULL]
Then, we change the names to preserve script consistency.10 For variables, we use small letters and underscores.
colnames(runoff_day_raw)[3:4] <- c('date', 'value')
Lastly, we transform the dates from characters to Date objects, in order to make easier date manipulations in the future.
runoff_day_raw[, date := as.Date(date)]
Although we still haven’t made any changes in the data, we do not want to keep them in data/raw. That folder is reserved only for the original data sources.
saveRDS(runoff_day_raw, './data/runoff_day_raw.rds')
We have to note theat runoff_day_raw is in tidy format, as we have a single value per variable. On the other hand, we kept runoff_stations in table or wide format, because it is more intuitive to read the information included. If needed, we can easily tranform any part of it to tidy using melt function.
runoff_coordinates <- runoff_stations[, .(sname, lat, lon)]
runoff_coordinates <- melt(runoff_coordinates, id.vars = 'sname')
runoff_coordinates
Now, we are ready to move on and take a first look in our data.11 Don’t forget our workflow!
As you might already have guessed EDA involves a lot of graphical exploration. In R the most robust way to perform it, is through the ggplot2 package12 ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. (ggplot2 homepage). ggplot2 offers the means to create both simple and sophisticated plots. It is good to remember that ggplot2 plays well with the tidy format.
Our next step is to have a look to our imported data. New step in analysis means new script. So let’s clear our workspace and create a new script. We can save it as 02_preparation.R or similar in folder code.
The first thing to do is to load the libraries and the data.
library(data.table)
library(ggplot2)
runoff_stations <- readRDS('./data/runoff_stations_raw.rds')
runoff_day <- readRDS('./data/runoff_day_raw.rds')
Let’s pick a random time series (runoff measured at station Rees) and store it in a seperate data.table.
rees_runoff_day <- runoff_day[sname == 'REES']
Then plot it with function ggplot.
ggplot(data = rees_runoff_day) +
geom_line(aes(x = date, y = value))
With ggplot2, you begin a plot with the function ggplot. The function ggplot creates a coordinate system that you can add layers to. The first argument of ggplot is the dataset to use in the graph. So ggplot(data = rees_runoff_day) creates an empty graph.
You complete your graph by adding one or more layers to ggplot. The function geom_line adds a line layer to your plot. The package ggplot2 comes with many geom functions that each add a different type of layer to a plot. For example, to change the plot type we will just replace geom_line to geom_point.
ggplot(data = rees_runoff_day) +
geom_point(aes(x = date, y = value))
Or we can have them both and transfer aes to ggplot arguments avoiding repetition.
ggplot(data = rees_runoff_day,
aes(x = date, y = value)) +
geom_point() +
geom_line()
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes, and the x and y arguments of aes specify which variables to map to the x and y axes. The function ggplot looks for the mapped variables in the data argument. In this case, it is rees_runoff_day.
aes mapping is not only about x and y. If we have more than one variables we can plot them using different colors or other aesthetic properties. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of our points. We can display a point in different ways by changing the values of its aesthetic properties. We can convey information about our data by mapping the aesthetics in our plot to the variables in our dataset.For example, let’s subset our dataset to two stations. We will use the OR argument |.
rees_dier_runoff_day <- runoff_day[sname == 'REES' | sname == 'DIER']
By adding col argument, we will plot the runoff of each station seperately.
ggplot(data = rees_dier_runoff_day) +
geom_line(aes(x = date, y = value, col = sname))
Back to our project, it would be helpful to plot all available time series. However, putting them in a single plot can be messy. A useful alternative is facet_wrap.13 We have also added theme_bw to change the figures format. By keeping the same axes for all figures we can directly compare them. Sometimes however, we might prefer to adapt the axes to the range of each dataset. This is achieved through facet_wrap argument scales.
ggplot(data = runoff_day, aes(x = date, y = value)) +
geom_line() +
facet_wrap(~sname) +
theme_bw()
At this point, we are interested in outliers14 Values outside plausible range of our data. and missing values. In the plotted runoff data, we can easily see some runoff values that make no sense at all; they are negative. These outliers are in our data actually the missing values.15 It is quite common to use a very low negative value as missing value flag, e.g., -9999. We can summarize missing values to get an overview of the missing values in our time series.
missing_values <- runoff_day[value < 0, .(missing = .N), by = sname]
missing_values
## sname missing
## 1: REES 5053
## 2: MAXA 122
## 3: DOMA 1096
## 4: FELS 1096
Just four stations, that’s not bad. However, it would be more informative if we transformed the absolute numbers to ratios of the total missing values per station.
sample_size <- runoff_day[, .(size = .N), by = sname]
runoff_stations <- runoff_stations[sample_size, on = 'sname']
runoff_stations <- missing_values[runoff_stations, on = 'sname']
runoff_stations[is.na(missing), missing := 0]
runoff_stations[, missing := missing / size]
runoff_stations[, missing := round(missing, 3)]
setcolorder(runoff_stations, #making 'missing' last column
c(setdiff(names(runoff_stations), 'missing'), 'missing'))
head(runoff_stations[missing > 0], 4)
## sname id station country lat lon area altitude size missing
## 1: REES 6335020 REES DE 51.757 6.395 159300 8 73841 0.068
## 2: MAXA 6335200 MAXAU DE 49.039 8.306 50196 98 35064 0.003
## 3: DOMA 6935145 DOMAT/EMS CH 46.838 9.456 3229 623 43099 0.025
## 4: FELS 6935146 FELSBERG CH 46.844 9.482 3249 610 32872 0.033
We see that the there the four records with missing values is below 10%. We will remove the outliers from the dataset and plot REES station to check it.
runoff_day <- runoff_day[value >= 0]
rees_runoff_day <- runoff_day[sname == 'REES']
ggplot(data = rees_runoff_day, aes(x = date, y = value)) +
geom_line() +
geom_point() +
theme_bw()
Another caveat that we would like to avoid is records with few values. But how few are few? The answer to this question is tied with the analysis hypothesis. Here, we are interested in the hydroclimatic shifts since 2000. Therefore, it would be great if we had 20 years of data since 2000. Looking back at our plot, we can easily see that this is not the case. We have to make a compromise. To do this we have to learn more about the last dates in the records.
station_time <- runoff_day[, .(start = min(year(date)),
end = max(year(date))),
by = sname]
table(station_time$end)
##
## 1982 1988 1995 2016 2017
## 1 1 1 16 1
Most of the records appear to reach 2016. Since we are interested in comparing climatic periods, we would need at least data from two consecutive climatic periods.16 A climatic period has usually a 30-year length. We can add the corresponding information in runoff_stations and then keep only the records within these limits.
max_year <- 2016
min_year <- max_year - (30 * 2)
runoff_stations <- runoff_stations[station_time, on = 'sname']
runoff_stations <- runoff_stations[start <= min_year &
end >= max_year &
size >= 30 * 2 * 365]
runoff_day <- runoff_day[id %in% runoff_stations$id]
runoff_day <- runoff_day[year(date) <= 2016]
We save our data, ready for the next step of our exploration. Since this is the dataset we will use for our analyses, we can save the time series as runoff_day.rds.
saveRDS(runoff_stations, './data/runoff_stations.rds')
saveRDS(runoff_day, './data/runoff_day.rds')
Tidy data manifesto from Hadley Wickham.
Need help with ggplot2 graphs? Check Winston Chang’s book R Graphics Cookbook: Practical Recipes for Visualizing Data or Monaé Everett’s Graphical data analysis with R.
Two good places to get an overview of the major grahs and how to create them in R are the R Gallery and the Top 50 ggplot2 visualizations. In addition, a working project on ggplot2 graphs and useful palletes can be found here.
Which are the units for area and runoff in our records?
Which is the average catchment area and runoff of Rhine, according to our data? Write a script that performs the calculation.
Which is the average runoff in each station? Present them in a graph.
Is there any relationship between station altitude and catchment area? Why?
As we look back, the port is nowhere to be seen. We are now sailing in open sea.
abbreviate() # Abbreviate strings to a given number of characters
data.table() # Create a new data.table object
cbind() # Combine vectors, matrices or data frames by columns
colnames() #Retrieve or set the column names of a data object
factor() # Create a factor object
fread() # Load packages in memory
length() # Return the month of a date object
ggplot() # Initialize a ggplot object
is.na() # Check whether an element is missing
max() # Return the maximum value
min() # Return the minimum value
paste0() # Merge vectors after converting to character
rbind() # Combine vectors, matrices or data frames by rows
readRDS() # Load an .Rds file in workspace
round() # Round a value to a given number of significant digits
year() # Return the year of a date object
01a_import_runoff_info.Rlibrary(data.table)
runoff_stations <- fread('./data/raw/runoff_stations.csv')
runoff_stations[, sname := factor(abbreviate(station))]
runoff_stations[, id := factor(id)]
runoff_stations[, lat := round(lat, 3)]
runoff_stations[, lon := round(lon, 3)]
runoff_stations[, altitude := round(altitude, 0)]
saveRDS(runoff_stations, './data/runoff_stations_raw.rds')
01b_import_runoff_day.Rlibrary(data.table)
runoff_stations <- readRDS('./data/runoff_stations_raw.rds')
raw_path <- './data/raw/runoff_day/'
fnames <- list.files(raw_path)
n_station <- length(fnames)
id_length <- 7
runoff_day_raw <- data.table()
id_sname <- runoff_stations[, .(id, sname)]
for(file_count in 1:n_station){
temp_dt <- fread(paste0(raw_path, fnames[file_count]))
station_id <- substr(fnames[file_count], 1, id_length)
temp_dt <- cbind(id = factor(station_id), temp_dt)
temp_dt <- id_sname[temp_dt, on = 'id', ]
runoff_day_raw <- rbind(runoff_day_raw, temp_dt)
}
runoff_day_raw[, 'hh:mm' := NULL]
colnames(runoff_day_raw)[3:4] <- c('date', 'value')
runoff_day_raw[, date := as.Date(date)]
saveRDS(runoff_day_raw, './data/runoff_day_raw.rds')
02_preparation.Rlibrary(data.table)
library(ggplot2)
runoff_stations <- readRDS('./data/runoff_stations_raw.rds')
runoff_day <- readRDS('./data/runoff_day_raw.rds')
rees_runoff_day <- runoff_day[sname == 'REES']
ggplot(data = rees_runoff_day) +
geom_line(aes(x = date, y = value))
ggplot(data = rees_runoff_day) +
geom_point(aes(x = date, y = value))
ggplot(data = rees_runoff_day,
aes(x = date, y = value)) +
geom_point() +
geom_line()
rees_dier_runoff_day <- runoff_day[sname == 'REES' | sname == 'DIER']
ggplot(data = rees_dier_runoff_day) +
geom_line(aes(x = date, y = value, col = sname))
ggplot(data = runoff_day, aes(x = date, y = value)) +
geom_line() +
facet_wrap(~sname) +
theme_bw()
missing_values <- runoff_day[value < 0, .(missing = .N), by = sname]
sample_size <- runoff_day[, .(size = .N), by = sname]
runoff_stations <- runoff_stations[sample_size, on = 'sname']
runoff_stations <- missing_values[runoff_stations, on = 'sname']
runoff_stations[is.na(missing), missing := 0]
runoff_stations[, missing := missing / size]
runoff_stations[, missing := round(missing, 3)]
setcolorder(runoff_stations, #making 'missing' last column
c(setdiff(names(runoff_stations), 'missing'), 'missing'))
runoff_day <- runoff_day[value >= 0]
rees_runoff_day <- runoff_day[sname == 'REES']
ggplot(rees_runoff_day, aes(x = date, y = value)) +
geom_line() +
geom_point() +
theme_bw()
station_time <- runoff_day[, .(start = min(year(date)),
end = max(year(date))),
by = sname]
max_year <- 2016
min_year <- max_year - (30 * 2)
runoff_stations <- runoff_stations[station_time, on = 'sname']
runoff_stations <- runoff_stations[start <= min_year &
end >= max_year &
size >= 30 * 2 * 365]
runoff_day <- runoff_day[id %in% runoff_stations$id]
runoff_day <- runoff_day[year(date) <= 2016]
saveRDS(runoff_stations, './data/runoff_stations.rds')
saveRDS(runoff_day, './data/runoff_day.rds')
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.