Objectives

  1. Develop an efficient workflow.
  2. Download and import some data sets.
  3. Clean and manipulate the data.
  4. Understand the data.table syntax and tidy data philosophy.

Efficient directory structure

First of all, we create a new project and name it my_project. Then, we create the folders where we will save our data, code and results:

dir.create('data')
dir.create('data/raw')
dir.create('results')
dir.create('scripts')
list.files()

Data import

There are many different ways to import data. We will start with one quite common; text files. To do this we will first download them from the knmi climate explorer platform and save them in data/raw folder. Then load them with read.table function.

download.file('https://climexp.knmi.nl/data/pa11518.dat', './data/raw/precip_prague.dat')
download.file('https://climexp.knmi.nl/data/pa16714.dat', './data/raw/precip_athens.dat')
download.file('https://climexp.knmi.nl/data/pa1489.2.dat', './data/raw/precip_oslo.dat')
prcp_prague <- read.table('./data/raw/precip_prague.dat')
prcp_athens <- read.table('./data/raw/precip_athens.dat')
prcp_oslo <- read.table('./data/raw/precip_oslo.dat')

What kind of data structures are the prcp_ objects?

Can you add names to describe each column (hint: try to avoid doing it three times)?

Data.table

Data.table is an enhanced data.frame.

“Data.table allows you to reduce your programming time as well as your computing time considerably, and it is especially useful if you often find yourself working with large datasets. For example, to read in a 20GB .csv file with 200 million rows and 16 columns, data.table only needs 8 minutes thanks to the fread function. This is instead of the hours it would take you with the read.csv function. Once you understand its concepts and principles, the speed and simplicity of the package are astonishing!” (Source: DataCamp site)

library(data.table)
prcp_prague <- fread('./data/raw/precip_prague.dat', skip = 17)

Creation of data.tables

  • conversion from data.frame, matrix
  • build form scratch in the same way as data.frame
  • by importing a file with fread function
prcp_athens <- data.table(prcp_athens)
prcp_oslo <- data.table(prcp_oslo)

save(prcp_prague, prcp_athens, prcp_oslo, file = './data/precip_month.rdata')

General structure of data.tables calls:

DT[i, j, by] means:

Take data.table DT, select rows using i, then calculate j possibly grouped by by.


Which rows?

prcp_prague[year > 1960]
prcp_athens[year > 1990 & year < 2000]
prcp_oslo[year == 1921]
prcp_oslo[feb > 10]
prcp_oslo[feb > 10, feb]
prcp_oslo[feb > 10, year]
prcp_athens[year > 1990 & year < 2000, .(year, dec, jan, feb)]

Which are the years that both June and July precipitation in Prague was below 30? How much was the precipitation in June/July then?

What to do?

Results are either single vector, new variable inside data.table or new data.table

# single vector
prcp_prague[, dec + jan + feb]
prcp_prague[year > 1990 & year < 2000, mean(dec)]
prcp_prague[year == 1921,  dec + jan + feb]

# new variable
prcp_prague[, winter := dec + jan + feb]
prcp_prague[year > 1990 & year < 2000, m_dec_90s := mean(dec)]
prcp_prague[year == 1921, winter_1921 := dec + jan + feb]

# new data.table
prcp_prague[, .(winter = dec + jan + feb, spring = mar + apr + may)]

Can you create a new column in prcp_oslo that is called century with the century number as a factor? (hint: cut function)

Grouped by what?

Groups are specified as the by argument

prcp_oslo[, mean(sep), by = century]

What’s wrong here? How can we fix it?

Which is the maximum precipitation observed in Athens?

Tidy data

Very often we work with 2D arrays - matrices or data.frames. Data set is a collection of values belonging to a variable and observation. Tidy data are such that

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Which are the variables and which are the observations in our data?

Tools to tidy data - melt & dcast

To gather variable from multiple columns use melt:

#Let's start over again
load('./data/precip_month.rdata')

prcp_oslo_tidy <- melt(prcp_oslo, id.vars = 'year', value.factor = TRUE)

prcp_oslo_tidy #the order of the observation is not very efficient this way 

colnames(prcp_oslo_tidy)[2:3] <- c('month', 'prcp')

setorder(prcp_oslo_tidy, year, month)

str(prcp_oslo_tidy)

To spread the variables back to column use dcast

dcast(prcp_oslo_tidy, year ~ month)

Combination of data.tables

  • by rows - rbind
  • by columns - cbind (in general not recommended - other database operations are prefered)
a <- data.table(x = 1:2, y = 1:2)
b <- data.table(x = 3:4, y = c(3, 1))
rbind(a, b)

Let’s try to combine our data.tables into a new one called prcp_stations. First of all, we have to add new variables:

prcp_oslo_tidy[, station := factor('oslo')]

How we should proceed?

Now it is easier to solve the missing value problem now:

prcp_stations[prcp < 0, prcp := NA]
prcp_stations <- prcp_stations[complete.cases(prcp_stations)]

save(prcp_stations, './data/precip_stations.rdata')

It is easier and more efficient to manipulate data in general. Can you answer the following questions:

Which is the max monthly precipitation in Athens?

Which is the max monthly precipitation in each station?

Which is the mean monthly precipitation in Athens?

Which is the mean monthly precipitation in each station? Can you present this as a table, where rows represent the months and columns represent stations (hint: yes, it is dcast!)

Where should we save our script?

Other resources:

The original paper on data table and tidy data.

More info and summaries of the data table syntax can be found here and here.