Introduction

Introduction

The present document works as draft user manual that aims to help understand/ apply most of the examples shared during the lectures. In the following sections there is available information that will help you re-produce efficiently the results that will be followed in the presentations.

In the first lecture, we intend to work with manipulating environmental datasets using data.table. We begin with simple operations to more complex ones. There will also take place an introduction to parallel computing with R in real case studies.

In lecture 2, we will use all the information obtained from lecture 1 and use it in order to walk through an introduction to spatial visualization using a set of very interesting libraries.

The main goal of the sessions that you will attend is to work on real case studies. For this reason, you will be given a set of environmental datasets (mostly precipitation records) available from the European Climate Assessment webpage. For ease of time, all the educational dataset can be downloaded directly from the following path

Now that you have downloaded the folder with the presentation datasets, you may create a new folder, extract the downloaded files into the new folder and open a new r session. You may set the working diretory to the local file where you placed the downloaded data. That was it! You are ready to follow the presentation steps.

Lecture 1 - Introduction to Large Data Management using R

In Lecture 1, we make a short introduction to large - data management using the very well-known library data.table. All participants should take the following installation steps in order to re-produce the examples of Lecture 1.

Installing `data.table`

Click on the “Packages” tab of the R-Studio IDE.
Press Install
Choose Install from: “Repository (CRAN, CRANextra)”
Write data.table.
Remember to tick dependencies
Press install

Reading large .csv datasets using `data.table`

In this section we present a fast and user friendly way to import delimited files in R using fread.

# Load library
library(data.table)

# Define Path of the available data
dataPath <- "D:/ECA Data/"

# Read dataset with default settings
data <- fread(paste0(dataPath,"3SwedishStationsData.csv"))

# Read dataset after specifying the existence of headers
data <- fread(paste0(dataPath,"3SwedishStationsData.csv"), header=TRUE)

# Read dataset and define separator
data <- fread(paste0(dataPath,"3SwedishStationsData.csv"), header=TRUE, sep=",")

# Read specific number of rows of the dataset (-1 means all rows)
data <- fread(paste0(dataPath,"3SwedishStationsData.csv"), header=TRUE, sep=",",
              nrows=10)

# Read but select specific rows (select date and records from station 1 )
data <- fread(paste0(dataPath,"3SwedishStationsData.csv"), header=TRUE,
              sep=",", nrows=10, select=c("DATE","RR_STATION_1"))
              
# If you want to have a look at the structure of the data
str(data)

Tip When Loading an External Dataset

When loading an external dataset in R, it is advised to identify all the available variables of the dataset as.character and later to modify them accordingly. This will prevent R from identifying the variables type automatically itself.

# Load library
library(data.table)
# Define path of the data
dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load data
dataset <- read.table(paste0(dataPath,"RR_STAID000001.txt") ,skip=20, sep=",",header=T, 
                      stringsAsFactors = FALSE, colClasses = "character")
# Make dataset as data.table
dataset <- data.table(dataset)
str(dataset)

Reading large .csv datasets - using the `ff` package.

The ff package is able to load dataset on disk and use fast functions for access.

library(ff)
dataPath <- "D:/ECA Data/"
data <- read.csv.ffdf(file=paste0(dataPath,"totalData.csv"), header=T, VERBOSE=T)

You may also find a very interesting article on how to import datasets in R through www.r-bloggers.com.

Copy `data.table`

A copy() may be required when doing dt_names = names(DT). Due to R’s copy-on-modify, dt_names still points to the same location in memory as names(DT). Therefore modifying DT by reference now, say by adding a new column, dt_names will also get updated. To avoid this, one has to explicitly copy: dt_names <- copy(names(DT)).

data1 <- copy(data)

Subset `data.table`

Subset is a very important operation as it is used often when it comes to perform a data manipulation procedure. For example, it often to exclude missing variables using multiple criteria.

# Load library
library(data.table)
# Read csv data
"D:/ECA Data/"
dataset<-fread(paste0(dataPath,"3SwedishStationsData.csv"),colClasses = "character")
# Apply data.table operation
dataset <- dataset[, lapply(.SD, as.numeric), .SDcols=c("RR_STATION_1","RR_STATION_2","RR_STATION_3"),
                   by=c("DATE")]

# Subset by year
dataset[year(DATE)==2008]
# Subset by month
dataset[month(DATE)==1]
# Subset dataset where non-missing values exist
dataset[!is.na(RR_STATION_1) & RR_STATION_2 & RR_STATION_3]
# Subset and return specific columns
dataset[!is.na(RR_STATION_1) & RR_STATION_2 & RR_STATION_3, RR_STATION_1]

Operations using `data.table` - Modify Date

# Load library
library(data.table)
# Define path of the data
dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load data
dataset <- read.table(paste0(dataPath,"RR_STAID000001.txt") ,skip=20, sep=",",header=T, 
                      stringsAsFactors = FALSE, colClasses = "character")
# Make dataset as data.table
dataset <- data.table(dataset)

# Modify date format
dataset[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]

Operations using `data.table` - Group By - sum

dataset[,RR:= as.numeric(RR)]
dataset[RR != -9999 ,sum(RR), by=month(DATE)]

Advanced Operations using `data.table` - Modify type for a group of variables

# Load library
library(data.table)
# Read csv data
dataset<-fread(paste0(dataPath,"3SwedishStationsData.csv"),colClasses = "character")
# Apply data.table operation
dataset <- dataset[, lapply(.SD, as.numeric), .SDcols=c("RR_STATION_1","RR_STATION_2","RR_STATION_3"),
                   by=c("DATE")]
str(dataset)

Advanced Operations using `data.table` - Group By Multiple Columns - sum

Calculate the monthly precipitation excluding missing records:

# Load library
library(data.table)
# Read csv data
dataset<-fread(paste0(dataPath,"3SwedishStationsData.csv"),colClasses = "character")
# Apply data.table operation
dataset <- dataset[, lapply(.SD, as.numeric), .SDcols=c("RR_STATION_1","RR_STATION_2","RR_STATION_3"),
                   by=c("DATE")]

monthlyMean <- dataset[RR_STATION_1 != -9999 | RR_STATION_2 != -9999 | RR_STATION_3 != -9999,
           lapply(.SD, sum, na.rm=T), by=month(DATE)]
head(monthlyMean)

Create a data.table from scratch

# Create a data.table of 100 columns and 10 rows
testDataset <- data.table(matrix(rnorm(1000, mean=10, sd=1), ncol=100))
# Create a gourping factor variable
testDataset[,GroupingFactor:= sample(x = c("A","B","C"),nrow(testDataset), replace = T)]
# Find the sum of each column by grouping factor
testDataset[, lapply(.SD, sum, na.rm=T), by=GroupingFactor]

If you want to make this operation to a specific set of columns:

# Sum of the first 10 columns by grouping factor
testDataset[, lapply(.SD, sum, na.rm=T), .SDcols=paste0("V",1:10), by=GroupingFactor]

Can you calculate the monthly standard deviation of raindall also?

Advanced Operations using `data.table` - Sub assign by reference

An advanced sub-assign operation would be assigning missing values to station_2 and station_3 in cases where station_1 has missing records too.

# Load library
library(data.table)
# Read csv data
dataset<-fread(paste0(dataPath,"3SwedishStationsData.csv"),colClasses = "character")
# Apply data.table operation
dataset <- dataset[, lapply(.SD, as.numeric), .SDcols=c("RR_STATION_1","RR_STATION_2","RR_STATION_3"),
                   by=c("DATE")]

dataset[is.na(RR_STATION_1), c("RR_STATION_2","RR_STATION_3"):=NA]

Operations using `data.table` - `setkey`

The setkey operation is used to sort a data.table. For example:

# Sorts data.table by date
setkey(dataset,DATE)

You may also sort the dataset according to multiple columns. The setkey operation is used to in order to perform data.table joins. It is also easy to learn by example:

example(data.table)

Lecture 1 - Introduction to Parallel Computing using R

In this section the following packages must be installed:

install.packages("foreach",dependencies = TRUE)) # Install foreach package
install.packages("doParallel",dependencies = TRUE)) # Install doParallel package

Parallel Computing and the for loop operation

Example: Create a function that produces a matrix with elements coming from the Normal distribution and calculate the sum of all elements

sumMatrix <- function(n) { return(sum(matrix(rnorm(n)))) }

Create a for loop that will execute this computation for 1 up to 50 matrices with the respective dimension:

x <- list()
system.time({
  for (i in 1:50){
    x[[i]] <- sumMatrix(i)
  }
  result <- unlist(x)
})

Question: Can we reproduce this example with lapply?

What if we had to calculate sum of all elements of 50000 matrices?

system.time(results <-unlist(lapply(1:50000, sumMatrix)))

A parallel version of the for loop operation:

library(foreach)
library(doParallel)
cores <- 3               # Define number of cores
cl<-makeCluster(cores)   # Create cores
registerDoParallel(cl)   # Register cores
system.time(res <- foreach(i=1:50000) %dopar% sumMatrix(i)) # Parallelize for loop
stopCluster(cl)          # Stop cluster

Parallelize Regression Models accross a number of meteorological stations - Parallel Data & Task Approach

We will try to optimize the computation time of the daily rainfall trend for all the available stations of a country. We begin with the sequential approach:

system.time({
library(data.table)
library(stringr)
library(leaflet)
dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))

# Subset greek meteorological stations
stations <- tempMapping[CN=="SE"]
stations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
stations <- stations[,STAID]

# Load each one meteorological station 
stationList <- list()
for (idx in stations) { 
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  setkey(stationData, DATE)
  # Calculate Mean Annual Rainfall Amount
  stationData[,DailyTrend:=1:nrow(stationData)]
  
  results <- summary(lm(formula = RR~DailyTrend, data = stationData))
  stationList[[idx]] <- c(idx,results$coefficients[2])
}

totalTrends<-do.call(rbind,stationList)
totalTrends <- data.table(totalTrends)
setnames(totalTrends, c("STAID","Slope"))
totalTrends[,Slope:=as.numeric(Slope)]
})

Elapsed computation time 184 seconds. The following code implements the parallelized approach:

library(foreach)
library(doParallel)
library(data.table)
library(stringr)

cores <- 3               # Define number of cores
cl<-makeCluster(cores)   # Create cores
registerDoParallel(cl)   # Register cores


dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))

# Subset greek meteorological stations
stations <- tempMapping[CN=="SE"]
stations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
stations <- stations[,STAID]

dailyTrendFunction <- function(idx) {
  library(data.table)
  library(stringr)
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  setkey(stationData, DATE)
  # Calculate Mean Annual Rainfall Amount
  stationData[,DailyTrend:=1:nrow(stationData)]
  
  results <- summary(lm(formula = RR~DailyTrend, data = stationData))
  stationList <- c(idx,results$coefficients[2])
  
  return(stationList)  

}

# Perform parallelization algorithm
system.time(res <- foreach(idx=stations) %dopar% dailyTrendFunction(idx)) # Parallelize for loop

# Stop cluster
stopCluster(cl)

Elapsed computation time 92 seconds.

Parallelize Tasks - Compute correlation of daily rainfall amount between all the available meteorological stations of a country

First, we begin with the sequential calculation of dailly rainfall correlation between all the available greek stations (26 stations).

library(data.table)
library(stringr)
library(corrplot)

dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))

# Subset greek meteorological stations
stations <- tempMapping[CN=="GR"]
stations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
stations <- stations[,STAID]

# Load each one meteorological station 
stationList <- list()
for (idx in stations) { 
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  # Import station data into a list
  stationList[[idx]] <- stationData
}

# Combine all datasets into one
totalData<-do.call(rbind,stationList)
# Make dataset data.table
totalData <- data.table(totalData)
# Convert rainfall series to numeric
totalData[,RR:=as.numeric(RR)]
# save results
saveRDS(totalData,paste0(dataPath,"totalData.RDS"))

# Manipulate dataset in the appropriate form
dcastDataset<-dcast.data.table(totalData,formula = "DATE ~ STAID", value.var="RR")
# Compute correlation
correlValues <- data.table(cor(dcastDataset[,2:25,with=F], method = "pearson", use="pairwise.complete.obs"))
corrplot.mixed(cor(correlValues))

However, in Greece there are only 26 meteorological stations that measure daily rainfall precipitation. The elapsed computation time took approaximately 3.45 seconds. If we were to apply the same approach to all the available stations of Sweden would the code presented just above work efficiently? In a local laptop it took more than 2 hours and the task was later killed.

We proceed step by step to the parallel computing of the correlation of all the available meterorological stations in Finland:

library(data.table)
library(stringr)
library(corrplot)

dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))

# Subset greek meteorological stations
stations <- tempMapping[CN=="FI"]
stations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
stations <- stations[,STAID]

# Load each one meteorological station 
stationList <- list()
for (idx in stations) { 
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  # Import station data into a list
  stationList[[idx]] <- stationData
}

# Combine all datasets into one
totalData<-do.call(rbind,stationList)
# Make dataset data.table
totalData <- data.table(totalData)
# Convert rainfall series to numeric
totalData[,RR:=as.numeric(RR)]
# save results

# Manipulate dataset in the appropriate form
dcastDataset<-dcast.data.table(totalData,formula = "DATE ~ STAID", value.var="RR")
dcastDataset[,DATE:=NULL]

dcastDataset <-as.matrix(dcastDataset)
# Compute correlation
rm(totalData)

x<-dcastDataset
rm(dcastDataset)

library(ff)
library(data.table)

nblocks <- 40
ncols <- ceiling(ncol(x)/nblocks)
chunks <- rep(1:nblocks, each=ncols)[1:ncol(x)]
chunkMap <- data.table(Column=1:ncol(x), Chunk=chunks)
chunkCombination <- expand.grid(unique(chunks),unique(chunks))
chunkCombination <- t(apply(chunkCombination,1,sort))
chunkCombination <- unique(chunkCombination)

# corMAT <- ff(vmode = "single", dim = c(ncol(x),ncol(x)))

corMAT <- matrix(0, ncol = ncol(x), nrow = ncol(x) )


calcCorrel <- function(idx,x,chunkCombination,chunkMap,corMAT){
  library(ff)
  library(data.table)
  group1 <- chunkMap[Chunk==chunkCombination[idx,1], Column]
  group2 <- chunkMap[Chunk==chunkCombination[idx,2], Column]
  
  cat("Chunk",chunkCombination[idx,1],"with chunk",chunkCombination[idx,2],"\n")
  tempCor <- cor(x[,group1], x[,group2], use="pairwise.complete.obs")  
  
  corMAT[group1,group2] <- tempCor
  corMAT[group2,group1] <- t(tempCor)
  gc()
  
  return(corMAT) 
}

library(foreach)
library(doParallel)
cores <- 4               # Define number of cores
cl<-makeCluster(cores)   # Create cores
registerDoParallel(cl)   # Register cores
system.time(res <- foreach(idx=1:nrow(chunkCombination)) %dopar% calcCorrel(idx,x,chunkCombination,chunkMap,corMAT)) # Parallelize for loop
stopCluster(cl)          # Stop cluster

Lecture 2 - Introduction to Visualization of Spatial Data using R!

In this section, the following packages must be downloaded:

install.packages("ggmap", dependencies = TRUE) # Install ggmap package
install.packages("stringr", dependencies = TRUE) # Install stringr package
install.packages("ggplot2", dependencies = TRUE) # Install ggplot2 package

Case Study - Create a map with the location of the meteorological station in Rhodes

library(ggmap)
library(ggplot2)
# Get Map
map_Location <- get_map(location = c(lon = 28.12, lat = 36.4) , zoom = 8, scale = 2,
                        language = "en-EN",  source = "google", maptype = "roadmap")

# Plot the points on the map
tempPlot <- ggmap(map_Location) + 
            geom_point(aes(x = 28.12, y = 36.4, fill = "red", alpha = 0.8),
            size = 4, shape = 21) +
            guides(fill=FALSE, alpha=FALSE, size=FALSE) +
            xlab("Longitude") + 
            ylab("Latitude")

Case Study - Plot all the available meteorological stations of Greece

library(data.table)
library(ggplot2)
dataPath <- "./data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote=""))
# Subset greek meteorological stations
greekStations <- tempMapping[CN=="GR"]
# Manipulate longitude and latitude function
findLoc <- function(x) {
  tempLoc <- str_split(x, pattern = ":") # Split character string based on ":"
  # Manipulate Longitude and Latitude
  x.out  <- as.numeric(tempLoc[[1]][1]) + as.numeric(tempLoc[[1]][2])/60 +
                       as.numeric(tempLoc[[1]][3])/3600
  return(x.out)
}

# Apply function in order to manipulate longitude and latitude
greekStations[,latUpd:=sapply(LAT,findLoc)][,longUpd:=sapply(LON,findLoc)]
greekStations[,LatLong:=paste0(latUpd,":",longUpd)]
# Find if there are duplicated stations as the data series are blended
greekStations <- greekStations[!duplicated(LatLong)]
# Get map
map_Location <- get_map(location = c(lon = mean(greekStations$longUpd), 
                lat = mean(greekStations$latUpd)) , zoom = 4, language = "en-EN",
                source = "google", maptype = "roadmap")
# Plot all map points
tempPlot <- ggmap(map_Location) +
  geom_point(data = greekStations, aes(x = longUpd, y = latUpd, fill = "red", alpha = 0.8),
             size = 4, shape = 21, color="red") +
guides(fill=FALSE, alpha=FALSE, size=FALSE) + 
xlab("Longitude") + ylab("Latitude")

Case Study - Spatial Rainfall Visualization with `ggmap` - Mean Annual Rainfall Amount

library(data.table)
library(stringr)
library(leaflet)
library(ggmap)
dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))
# Subset greek meteorological stations
greekStations <- tempMapping[CN=="GR"]
greekStations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
greekStationsIds <- greekStations[,STAID]

# Load each one meteorological station 
rainfallList <- list()
for (idx in greekStationsIds) { 
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  # Calculate Mean Annual Rainfall Amount
  AnnualRainfall <- stationData[,list(AnnualRainfall=sum(RR)),
                                by=list(STAID,year(DATE))]
  setnames(AnnualRainfall, "year", "Year")
  rainfallList[[idx]] <- AnnualRainfall
}
# Combine all the results into one dataset:
totalAnnualRainfall <- rbindlist(rainfallList)
# Subset dataset
subsetAnnualRainfall <- totalAnnualRainfall[Year>=1955 & Year<=2004,
                                            list(MeanAnnualRainfall=mean(AnnualRainfall)), by=c("STAID")]

# Function that mannipulates longitude and latitude
findLoc <- function(x) {
  tempLoc <- str_split(x, pattern = ":")
  x.out  <- as.numeric(tempLoc[[1]][1]) + as.numeric(tempLoc[[1]][2])/60 +
    as.numeric(tempLoc[[1]][3])/3600
  return(x.out)
}

# Find out longitude and latitude for each station
subsetAnnualRainfall <- merge(greekStations,subsetAnnualRainfall, by=c("STAID"))
# Manipulate Longitude and latitude
subsetAnnualRainfall[,latUpd:=sapply(LAT,findLoc)][,longUpd:=sapply(LON,findLoc)]
subsetAnnualRainfall[,LatLong:=paste0(latUpd,":",longUpd)]
subsetAnnualRainfall[,STANAME:=gsub(" ","",STANAME)]

#Get Map
map_Location <- get_map(location = c(lon = mean(subsetAnnualRainfall$longUpd), 
                                     lat = mean(subsetAnnualRainfall$latUpd)) ,
                        zoom = 6, language = "en-EN",
                        source = "google", maptype = "roadmap")

# Plot all map points
tempPlot <- ggmap(map_Location) + geom_point(data = subsetAnnualRainfall,
                                             aes(x = longUpd, y = latUpd, color=MeanAnnualRainfall), size=5) + 
  xlab("Longitude") + ylab("Latitude")

Lecture 2 - Introduction to `leaflet` for R

In this section we introduce the leaflet package that creates javascript interactive maps. In order to be able to apply the code presented in this section you need to download and install the following packages:

install.packages(devtools, dependencies = TRUE)
devtools::install_github("rstudio/leaflet")

Create a leaflet map of the meteorological station in Rhodes Island

library(leaflet)
m <- leaflet() %>% setView(lng =28.1,  lat = 36.4, zoom = 12)
m %>% addTiles()

Create a map and plot the location of the meterological station

library(leaflet)
m <- leaflet() %>% setView(lng =28.12,  lat = 36.4, zoom = 14)
m %>% addTiles()  %>% addMarkers(lng =28.12,  lat = 36.4, 
                                 popup="Rhodes Meteorological Station")

Spatial Visualization with `leaflet` for R

In this part we plot as before all the available meteorological stations of Greece and their mean annual rainfall magitude:

library(data.table)
library(stringr)
library(leaflet)
dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))
# Subset greek meteorological stations
greekStations <- tempMapping[CN=="GR"]
greekStations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
greekStationsIds <- greekStations[,STAID]

# Load each one meteorological station 
rainfallList <- list()
for (idx in greekStationsIds) { 
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  # Calculate Mean Annual Rainfall Amount
  AnnualRainfall <- stationData[,list(AnnualRainfall=sum(RR)),
                                 by=list(STAID,year(DATE))]
  setnames(AnnualRainfall, "year", "Year")
  rainfallList[[idx]] <- AnnualRainfall
}
# Combine all the results into one dataset:
totalAnnualRainfall <- rbindlist(rainfallList)
# Subset dataset
subsetAnnualRainfall <- totalAnnualRainfall[Year>=1955 & Year<=2004,
                                                list(MeanAnnualRainfall=mean(AnnualRainfall)), by=c("STAID")]

# Function that mannipulates longitude and latitude
findLoc <- function(x) {
  tempLoc <- str_split(x, pattern = ":")
  x.out  <- as.numeric(tempLoc[[1]][1]) + as.numeric(tempLoc[[1]][2])/60 +
    as.numeric(tempLoc[[1]][3])/3600
  return(x.out)
}

# Find out longitude and latitude for each station
subsetAnnualRainfall <- merge(greekStations,subsetAnnualRainfall, by=c("STAID"))
# Manipulate Longitude and latitude
subsetAnnualRainfall[,latUpd:=sapply(LAT,findLoc)][,longUpd:=sapply(LON,findLoc)]
subsetAnnualRainfall[,LatLong:=paste0(latUpd,":",longUpd)]
subsetAnnualRainfall[,STANAME:=gsub(" ","",STANAME)]

stationsMap <- leaflet(data = subsetAnnualRainfall) %>% addTiles() %>%
  addMarkers(~longUpd, ~latUpd, popup = ~STANAME)
stationsMap

Spatial Visualization with `leaflet` for R - Adding Circles

stationsMap <- leaflet(data = subsetAnnualRainfall) %>% addTiles() %>%
  addCircles(~longUpd, ~latUpd, popup = ~STANAME, color = c('red'))
stationsMap

Spatial Visualization with `leaflet` for R - Plot Mean Annual Rainfall Magnitude

  countrymappping  <- leaflet(subsetAnnualRainfall) %>%
  
  addTiles(group = "OSM (default)") %>%
  addProviderTiles("Stamen.Toner", group = "Toner") %>%
  addProviderTiles("Stamen.TonerLite", group = "Toner Lite") %>%
  
  # Overlay groups
  addCircles(~longUpd, ~latUpd, ~10*MeanAnnualRainfall, stroke = F, group = "Mean Annual Rainfall", color="red") %>%
  
  addMarkers(~longUpd, ~latUpd) %>%
  
  addLayersControl(
    baseGroups = c("OSM (default)", "Toner", "Toner Lite"),
    overlayGroups = c("Mean Annual Rainfall"),
    options = layersControlOptions(collapsed = FALSE))
countrymappping

Spatial Visualization with `leaflet` for R - Plot Number of Daily Rainfall Events

dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote="", colClasses="character"))
# Subset greek meteorological stations
greekStations <- tempMapping[CN=="GR"]
greekStations[,STAID:=gsub(" ","",STAID)]
# Return greek stations
greekStationsIds <- greekStations[,STAID]

# Load each one meteorological station 
rainfallList <- list()
for (idx in greekStationsIds) { 
  stationCodec <- str_pad(idx, pad = 0, width = 6, side = "left")   # Create station code
  # Create station full name of dataset
  stationDataName <- paste0("RR_STAID",stationCodec,".txt")
  # Import dataset
  stationData <- data.table(read.table(paste0(dataPath,stationDataName), skip=20, sep=",",
                                       header=T, stringsAsFactors = FALSE, colClasses = "character"))
  stationData[,STAID:=gsub(" ","",STAID)]
  # Make RR and Q_RR as numeric
  stationData[,c("RR","Q_RR"):=lapply(.SD, as.numeric), .SDcols=c("RR","Q_RR")]
  # Drop missing records
  stationData <- stationData[RR!=-9999]
  # Manipulate Date
  stationData[,DATE:=as.Date(as.character(DATE), format("%Y%m%d"))]
  # Calculate Number of Annual Rainfall Events
  
  AnnualRainfallEvents <- stationData[RR>0, list(NumberofRainfallEvents=.N), by=list(STAID,year(DATE))]
  rainfallList[[idx]] <- AnnualRainfallEvents
}
# Combine all the results into one dataset:
totalRainfallEvents <- rbindlist(rainfallList)
setnames(totalRainfallEvents, "year", "Year")
# Subset dataset
subsetRainfallEvents <- totalRainfallEvents[Year>=1955 & Year<=2004,
                                                list(MeanNumberofRainfallEvents=mean(NumberofRainfallEvents)), by=c("STAID")]


# Function that mannipulates longitude and latitude
findLoc <- function(x) {
  tempLoc <- str_split(x, pattern = ":")
  x.out  <- as.numeric(tempLoc[[1]][1]) + as.numeric(tempLoc[[1]][2])/60 +
    as.numeric(tempLoc[[1]][3])/3600
  return(x.out)
}

# Find out longitude and latitude for each station
subsetRainfallEvents <- merge(greekStations,subsetRainfallEvents, by=c("STAID"))
# Manipulate Longitude and latitude
subsetRainfallEvents[,latUpd:=sapply(LAT,findLoc)][,longUpd:=sapply(LON,findLoc)]
subsetRainfallEvents[,LatLong:=paste0(latUpd,":",longUpd)]

#Plot mean annual number of rainfall events

countrymappping  <- leaflet(subsetRainfallEvents) %>%
  
  addTiles(group = "OSM (default)") %>%
  addProviderTiles("Stamen.Toner", group = "Toner") %>%
  addProviderTiles("Stamen.TonerLite", group = "Toner Lite") %>%
  
  # Overlay groups
  addCircles(~longUpd, ~latUpd, ~500*MeanNumberofRainfallEvents, stroke = F, group = "Mean Annual Rainfall", color="red") %>%
  
  addMarkers(~longUpd, ~latUpd) %>%
  
  addLayersControl(
    baseGroups = c("OSM (default)", "Toner", "Toner Lite"),
    overlayGroups = c("Mean Annual Number of Rainfall Events"),
    options = layersControlOptions(collapsed = FALSE))
countrymappping

Spatial Visualization with `leaflet` for R - Cluster Geospatial Data of Sweden

library(data.table)
library(ggplot2)
library(leaflet)
library(stringr)

dataPath <- "D:/ECA Data/" # Define the path where data is located
# Load Dataset with Locations of Meteorological Stations
tempMapping <- data.table(read.table(paste0(dataPath,"/stations.txt"), skip = 16 ,sep=",",
                                     stringsAsFactors = FALSE, header=TRUE, quote=""))
# Subset greek meteorological stations
greekStations <- tempMapping[CN=="SE"]
# Manipulate longitude and latitude function
findLoc <- function(x) {
  tempLoc <- str_split(x, pattern = ":") # Split character string based on ":"
  # Manipulate Longitude and Latitude
  x.out  <- as.numeric(tempLoc[[1]][1]) + as.numeric(tempLoc[[1]][2])/60 +
    as.numeric(tempLoc[[1]][3])/3600
  return(x.out)
}

# Apply function in order to manipulate longitude and latitude
greekStations[,latUpd:=sapply(LAT,findLoc)][,longUpd:=sapply(LON,findLoc)]
greekStations[,LatLong:=paste0(latUpd,":",longUpd)]
# Find if there are duplicated stations as the data series are blended
greekStations <- greekStations[!duplicated(LatLong)]

clusterChart <- leaflet(greekStations) %>% addTiles() %>% addMarkers(~longUpd, ~latUpd,
  clusterOptions = markerClusterOptions()
)
clusterChart

Material of Introduction to large data management using R!

Konstantinos Mammas

September 2015

Introduction

Lecture 1 - Introduction to Large Data Management using R

Installing `data.table`

Reading large .csv datasets using `data.table`

Tip When Loading an External Dataset

Reading large .csv datasets - using the `ff` package.

Copy `data.table`

Subset `data.table`

Operations using `data.table` - Modify Date

Operations using `data.table` - Group By - sum

Advanced Operations using `data.table` - Modify type for a group of variables

Advanced Operations using `data.table` - Group By Multiple Columns - sum

Create a data.table from scratch

Advanced Operations using `data.table` - Sub assign by reference

Operations using `data.table` - `setkey`

Lecture 1 - Introduction to Parallel Computing using R

Parallel Computing and the for loop operation

Parallelize Regression Models accross a number of meteorological stations - Parallel Data & Task Approach

Parallelize Tasks - Compute correlation of daily rainfall amount between all the available meteorological stations of a country

Lecture 2 - Introduction to Visualization of Spatial Data using R!

Case Study - Create a map with the location of the meteorological station in Rhodes

Case Study - Plot all the available meteorological stations of Greece

Case Study - Spatial Rainfall Visualization with `ggmap` - Mean Annual Rainfall Amount

Lecture 2 - Introduction to `leaflet` for R

Create a leaflet map of the meteorological station in Rhodes Island

Create a map and plot the location of the meterological station

Spatial Visualization with `leaflet` for R

Spatial Visualization with `leaflet` for R - Adding Circles

Spatial Visualization with `leaflet` for R - Plot Mean Annual Rainfall Magnitude

Spatial Visualization with `leaflet` for R - Plot Number of Daily Rainfall Events

Spatial Visualization with `leaflet` for R - Cluster Geospatial Data of Sweden

Material of Introduction to large data management using R!

Konstantinos Mammas

September 2015

Introduction

Lecture 1 - Introduction to Large Data Management using R

Installing data.table

Reading large .csv datasets using data.table

Tip When Loading an External Dataset

Reading large .csv datasets - using the ff package.

Copy data.table

Subset data.table

Operations using data.table - Modify Date

Operations using data.table - Group By - sum

Advanced Operations using data.table - Modify type for a group of variables

Advanced Operations using data.table - Group By Multiple Columns - sum

Create a data.table from scratch

Advanced Operations using data.table - Sub assign by reference

Operations using data.table - setkey

Lecture 1 - Introduction to Parallel Computing using R

Parallel Computing and the for loop operation

Parallelize Regression Models accross a number of meteorological stations - Parallel Data & Task Approach

Parallelize Tasks - Compute correlation of daily rainfall amount between all the available meteorological stations of a country

Lecture 2 - Introduction to Visualization of Spatial Data using R!

Case Study - Create a map with the location of the meteorological station in Rhodes

Case Study - Plot all the available meteorological stations of Greece

Case Study - Spatial Rainfall Visualization with ggmap - Mean Annual Rainfall Amount

Lecture 2 - Introduction to leaflet for R

Create a leaflet map of the meteorological station in Rhodes Island

Create a map and plot the location of the meterological station

Spatial Visualization with leaflet for R

Spatial Visualization with leaflet for R - Adding Circles

Spatial Visualization with leaflet for R - Plot Mean Annual Rainfall Magnitude

Spatial Visualization with leaflet for R - Plot Number of Daily Rainfall Events

Spatial Visualization with leaflet for R - Cluster Geospatial Data of Sweden

Installing `data.table`

Reading large .csv datasets using `data.table`

Reading large .csv datasets - using the `ff` package.

Copy `data.table`

Subset `data.table`

Operations using `data.table` - Modify Date

Operations using `data.table` - Group By - sum

Advanced Operations using `data.table` - Modify type for a group of variables

Advanced Operations using `data.table` - Group By Multiple Columns - sum

Advanced Operations using `data.table` - Sub assign by reference

Operations using `data.table` - `setkey`

Case Study - Spatial Rainfall Visualization with `ggmap` - Mean Annual Rainfall Amount

Lecture 2 - Introduction to `leaflet` for R

Spatial Visualization with `leaflet` for R

Spatial Visualization with `leaflet` for R - Adding Circles

Spatial Visualization with `leaflet` for R - Plot Mean Annual Rainfall Magnitude

Spatial Visualization with `leaflet` for R - Plot Number of Daily Rainfall Events

Spatial Visualization with `leaflet` for R - Cluster Geospatial Data of Sweden