First of all, we create a new project and name it my_project. Then, we create the folders where we will save our data, code and results:
dir.create('data')
dir.create('data/raw')
dir.create('results')
dir.create('scripts')
list.files()
There are many different ways to import data. We will start with one quite common; text files. To do this we will first download them from the knmi climate explorer platform and save them in data/raw folder. Then load them with read.table function.
download.file('https://climexp.knmi.nl/data/pa11518.dat', './data/raw/precip_prague.dat')
download.file('https://climexp.knmi.nl/data/pa16714.dat', './data/raw/precip_athens.dat')
download.file('https://climexp.knmi.nl/data/pa1489.2.dat', './data/raw/precip_oslo.dat')
prcp_prague <- read.table('./data/raw/precip_prague.dat')
prcp_athens <- read.table('./data/raw/precip_athens.dat')
prcp_oslo <- read.table('./data/raw/precip_oslo.dat')
What kind of data structures are the prcp_ objects?
Can you add names to describe each column (hint: try to avoid doing it three times)?
Data.table is an enhanced data.frame.
“Data.table allows you to reduce your programming time as well as your computing time considerably, and it is especially useful if you often find yourself working with large datasets. For example, to read in a 20GB .csv file with 200 million rows and 16 columns, data.table only needs 8 minutes thanks to the fread function. This is instead of the hours it would take you with the read.csv function. Once you understand its concepts and principles, the speed and simplicity of the package are astonishing!” (Source: DataCamp site)
library(data.table)
prcp_prague <- fread('./data/raw/precip_prague.dat', skip = 17)
prcp_athens <- data.table(prcp_athens)
prcp_oslo <- data.table(prcp_oslo)
save(prcp_prague, prcp_athens, prcp_oslo, file = './data/precip_month.rdata')
DT[i, j, by] means:
Take data.table DT, select rows using i, then calculate j possibly grouped by by.
prcp_prague[year > 1960]
prcp_athens[year > 1990 & year < 2000]
prcp_oslo[year == 1921]
prcp_oslo[feb > 10]
prcp_oslo[feb > 10, feb]
prcp_oslo[feb > 10, year]
prcp_athens[year > 1990 & year < 2000, .(year, dec, jan, feb)]
Which are the years that both June and July precipitation in Prague was below 30? How much was the precipitation in June/July then?
Results are either single vector, new variable inside data.table or new data.table
# single vector
prcp_prague[, dec + jan + feb]
prcp_prague[year > 1990 & year < 2000, mean(dec)]
prcp_prague[year == 1921, dec + jan + feb]
# new variable
prcp_prague[, winter := dec + jan + feb]
prcp_prague[year > 1990 & year < 2000, m_dec_90s := mean(dec)]
prcp_prague[year == 1921, winter_1921 := dec + jan + feb]
# new data.table
prcp_prague[, .(winter = dec + jan + feb, spring = mar + apr + may)]
Can you create a new column in prcp_oslo that is called
centurywith the century number as a factor? (hint:cutfunction)
Groups are specified as the by argument
prcp_oslo[, mean(sep), by = century]
What’s wrong here? How can we fix it?
Which is the maximum precipitation observed in Athens?
Very often we work with 2D arrays - matrices or data.frames. Data set is a collection of values belonging to a variable and observation. Tidy data are such that
Which are the variables and which are the observations in our data?
melt & dcastTo gather variable from multiple columns use melt:
#Let's start over again
load('./data/precip_month.rdata')
prcp_oslo_tidy <- melt(prcp_oslo, id.vars = 'year', value.factor = TRUE)
prcp_oslo_tidy #the order of the observation is not very efficient this way
colnames(prcp_oslo_tidy)[2:3] <- c('month', 'prcp')
setorder(prcp_oslo_tidy, year, month)
str(prcp_oslo_tidy)
To spread the variables back to column use dcast
dcast(prcp_oslo_tidy, year ~ month)
rbindcbind (in general not recommended - other database operations are prefered)a <- data.table(x = 1:2, y = 1:2)
b <- data.table(x = 3:4, y = c(3, 1))
rbind(a, b)
Let’s try to combine our data.tables into a new one called prcp_stations. First of all, we have to add new variables:
prcp_oslo_tidy[, station := factor('oslo')]
How we should proceed?
Now it is easier to solve the missing value problem now:
prcp_stations[prcp < 0, prcp := NA]
prcp_stations <- prcp_stations[complete.cases(prcp_stations)]
save(prcp_stations, './data/precip_stations.rdata')
It is easier and more efficient to manipulate data in general. Can you answer the following questions:
Which is the max monthly precipitation in Athens?
Which is the max monthly precipitation in each station?
Which is the mean monthly precipitation in Athens?
Which is the mean monthly precipitation in each station? Can you present this as a table, where rows represent the months and columns represent stations (hint: yes, it is dcast!)
Where should we save our script?