Welcome to the first script of Module 2!
At the example of weather stations, you will here learn how to:
Please read the following code instructions carefully, try to understand the code that follows each instruction, execute it and see what happens. Do not hesitate to change the code or write your own code, and execute it as well!
We start off again with defining our working directory and loading all the packages that we will need:
dir <- "C:/Users/Max/Desktop/IAMO_neu/eLearning/Module_2_Climate_trends/M02_Part01_Weather-stations/"
library(reshape2)
library(dplyr)
library(ggplot2)
library(viridis)
library(pals)
library(lubridate)
library(Kendall)
Let us import the data of the station “Sisian”, and the file “parameters.csv” which contains an overview about the different parameters measured at each station:
sisian <- read.csv(paste0(dir, "data/Weather_by_station/Sisian.csv"), sep=";")
paras <- read.csv(paste0(dir, "data/parameters.csv"), sep=";")
head(sisian[,c(1:10)])
| station | parameter | year_month | day1 | day2 | day3 | day4 | day5 | day6 | day7 |
|---|---|---|---|---|---|---|---|---|---|
| Sisian | 2 | 1995-01 | 2.1 | 6.5 | 9.6 | 9.3 | 10.3 | 4.7 | 4.9 |
| Sisian | 2 | 1995-02 | 6.4 | 5.3 | 8 | 6.3 | -1.6 | 2.9 | 6 |
| Sisian | 2 | 1995-03 | 8.8 | 7.6 | 10.3 | 11.5 | 11.7 | 11.7 | 12.9 |
| Sisian | 2 | 1995-04 | 16.3 | 17.5 | 14.8 | 10.6 | 7.5 | 9.5 | 12 |
| Sisian | 2 | 1995-05 | 19.5 | 13.3 | 17 | 17 | 17.4 | 18.4 | 15.2 |
| Sisian | 2 | 1995-06 | 21.8 | 22.5 | 18.2 | 22 | 21.8 | 19.3 | 19.8 |
head(paras)
| parameter | parameter_long | parameter_short |
|---|---|---|
| 2 | Maximum_Temperature | TMAX |
| 3 | Minimum_Temperature | TMIN |
| 4 | Average_Temperature | TAVG |
| 5 | Precipitation | PRCP |
| 17 | Relative_Humidity | HUMI |
| 19 | Vapour_Pressure | VAPR |
You can see that the data in “sisian” is not yet in a format that allows us to efficiently work with it. We need to apply some processing steps:
### substitute hyphens by NA
sisian[sisian == " -"] <- NA
sisian[sisian == "-"] <- NA
# class(sisian[,4]) ## the data is not formatted as numeric!
### transform data entries to numbers
sisian[, 4:34] = sapply(sisian[, 4:34], as.numeric)
### transform column names to numbers
colnames(sisian)[4:34] <- gsub("day", "", names(sisian))[4:34]
### melt dataframe
sisian <- melt(sisian, id.vars=c("station", "parameter", "year_month"))
### create a column for year and month
sisian$YEAR <- as.numeric(substr(sisian$year_month, 1, 4))
sisian$MONTH <- as.numeric(substr(sisian$year_month, 6, 7))
### join "sisian" with "paras"
sisian <- merge(sisian, paras, by="parameter")
### rename, reorder and delete some columns
names(sisian) <- c("PARA_CODE", "STATION", "YEAR_MONTH", "DAY", "VALUE", "YEAR", "MONTH", "PARA_LONG", "PARA_SHORT")
sisian <- sisian[,c(2,9,8,6,7,4,5)]
sisian$DAY <- as.numeric(sisian$DAY)
### create a date-column and delete all entries with non-existing dates (e.g. February 30th, etc.)
sisian$DATE = as.Date(paste(sisian$YEAR, sisian$MONTH, sisian$DAY, sep = "-"))
sisian = sisian[!is.na(sisian$DATE), ]
Let’s have a look at the final data set. It contains measurements of a total of 17 parameters measured during the years 1995 to 2020:
head(sisian)
| STATION | PARA_SHORT | PARA_LONG | YEAR | MONTH | DAY | VALUE | DATE |
|---|---|---|---|---|---|---|---|
| Sisian | TMAX | Maximum_Temperature | 1995 | 1 | 1 | 2.1 | 1995-01-01 |
| Sisian | TMAX | Maximum_Temperature | 1995 | 2 | 1 | 6.4 | 1995-02-01 |
| Sisian | TMAX | Maximum_Temperature | 1995 | 3 | 1 | 8.8 | 1995-03-01 |
| Sisian | TMAX | Maximum_Temperature | 1995 | 4 | 1 | 16.3 | 1995-04-01 |
| Sisian | TMAX | Maximum_Temperature | 1995 | 5 | 1 | 19.5 | 1995-05-01 |
| Sisian | TMAX | Maximum_Temperature | 1995 | 6 | 1 | 21.8 | 1995-06-01 |
unique(sisian$PARA_LONG)
## [1] "Maximum_Temperature" "Minimum_Temperature"
## [3] "Average_Temperature" "Precipitation"
## [5] "Relative_Humidity" "Vapour_Pressure"
## [7] "Snow_Cover" "Average_Wind_Speed"
## [9] "Maximum_Wind_Speed" "Soil_Temperature_05cm"
## [11] "Soil_Temperature_10cm" "Soil_Temperature_20cm"
## [13] "Soil_Temperature_15cm" "Maximum_Soil_Temperature"
## [15] "Minimum_Soil_Temperature" "Average_Soil_Temperature"
## [17] "Sunshine_Duration"
unique(sisian$PARA_SHORT)
## [1] "TMAX" "TMIN" "TAVG" "PRCP" "HUMI" "VAPR"
## [7] "SNOW" "WIND_AVG" "WIND_MAX" "T_SOIL_05" "T_SOIL_10" "T_SOIL_20"
## [13] "T_SOIL_15" "TMAX_SOIL" "TMIN_SOIL" "TAVG_SOIL" "SUNSHINE"
unique(sisian$YEAR)
## [1] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [16] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
From the newly created column “DATE”, let us also calculate the DOY, i.e. day of the year, with the help of the function yday() from the package lubridate:
sisian$DOY <- yday(sisian$DATE)
We then sort the entire dataframe by the column “DATE”:
sisian <- sisian[order(sisian$PARA_SHORT, sisian$DATE),]
Let us now plot the average daily temperature from 1995 to 1999. The function scale_x_date() ensures that every year is written once on the x-axis:
sisian_sel1 <- filter(sisian, PARA_SHORT == "TAVG", YEAR <= 1999 )
ggplot(data=sisian_sel1, aes(x=DATE, y=VALUE)) +
geom_line(size=1) + labs(x="", y="avg. temp. in °C") +
ggtitle("Average daily temperature at Sisian station") +
scale_x_date(date_breaks = "years" , date_labels = "%Y")
We can use the previously created column “DOY” to calculate the average temperature of each day of the year between 1995 and 1999. Let’s also calculate the standard deviation, and plot it as a shadow around the line of the mean using geom_ribbon():
sisian_sel1_mean <- sisian_sel1 %>% group_by(DOY) %>% summarize(mean = mean(na.omit(VALUE)), sd = sd(na.omit(VALUE))) %>% as.data.frame()
ggplot(data=sisian_sel1_mean, aes(x=DOY, y=mean)) +
geom_line() +
geom_ribbon(aes(ymin = mean-sd, ymax = mean+sd), alpha=0.2) +
labs(x="Day of the year", y="avg. temp. in °C") +
ggtitle("Average daily temperature at Sisian station, mean 1995-1999")
Obviously, there is some year-to-year variation in our data. We are interested in knowing out whether this variation is random, or whether there is some tendency in it, for example a trend of warming temperatures over time. To find that out, we have to further process our data. Let us start off with calculating monthly temperatures for all years:
sisian_tavg <- filter(sisian, PARA_SHORT == "TAVG")
sisian_tavg_monthly <- sisian_tavg %>% group_by(YEAR, MONTH) %>% summarize(mean = mean(na.omit(VALUE))) %>% as.data.frame()
head(sisian_tavg_monthly)
We select the June temperatures and plot them as a line:
sisian_JUN <- filter(sisian_tavg_monthly, MONTH == 6)
ggplot(data=sisian_JUN, aes(x=YEAR, y=mean)) +
geom_line() +
geom_point(size=2) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
ggtitle("June temperatures at Sisian station") +
ylab("avg. temp. in °C") + xlab("")
We calculate the differences between yearly June temperatures and the average June temperature of the reference period 2000-2015 to represent anomalies:
mean_2000_2015 <- mean(filter(sisian_JUN, YEAR %in% c(2000:2015))$mean)
sisian_JUN$diff <- sisian_JUN$mean - mean_2000_2015
ggplot(data=sisian_JUN, aes(x=YEAR, y=diff, fill=diff)) +
geom_bar(stat="identity") +
scale_fill_gradientn(colors=coolwarm(10), limits=c(-2.5, 2.5)) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
ggtitle("June temperature anomalies at Sisian station, reference: 2000-2015") +
ylab("avg. temp. in °C") + xlab("")
The question obviously is by how much June temperatures have gotten warmer over time at Sisian station, and whether this trend is significant. Let us answer the first question by applying a linear regression model with the function lm(). We regress June temperature against year, and save the results in the object “model1”, which is an object of type “lm”, i.e. a linear model object. By calling “model1”, we obtain two coefficients that represent the intercept of the regression line with the y-axis, and the slope of the regression line. Here, the slope can be interpreted as increment in degrees Celsius per year:
model1 <- lm(sisian_JUN$mean ~ sisian_JUN$YEAR)
class(model1)
## [1] "lm"
model1
##
## Call:
## lm(formula = sisian_JUN$mean ~ sisian_JUN$YEAR)
##
## Coefficients:
## (Intercept) sisian_JUN$YEAR
## -75.61392 0.04561
These numbers are also saved under the name “coefficients” in the lm-object, and we can call them to add a regression line to our previous graph with the function geom_abline():
ggplot(data=sisian_JUN, aes(x=YEAR, y=mean)) +
geom_line() +
geom_point(size=2) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
ggtitle("June temperatures at Sisian station") +
ylab("avg. temp. in °C") + xlab("") +
geom_abline(aes(intercept = model1$coefficients[1], slope = model1$coefficients[2]))
The slope in “model1” can be interpreted as follows: Between 1995 and 2020, June temperatures have increased by 0.045 °C per year. In the whole period of 25 years, June has therefore warmed by 1.125 °C. However, we have to test whether this increase is significant. For such time series without periodical trends, we can apply the Mann-Kendall test, which is available in the package Kendall. All we have to do is to supply the time series of yearly June temperatures to the function MannKendall(). The result is an object of type “Kendall” that has two values. We are here most interested in the 2-sided p-value, which tells us whether there is a significant time trend (p-value < 0.05) or whether not (p-value > 0.05). Here, the p-value is 0.07, so we can assume that there is NO significant trend in June temperatures:
model1
##
## Call:
## lm(formula = sisian_JUN$mean ~ sisian_JUN$YEAR)
##
## Coefficients:
## (Intercept) sisian_JUN$YEAR
## -75.61392 0.04561
teststat <- MannKendall(sisian_JUN$mean)
class(teststat)
## [1] "Kendall"
teststat[2]$sl[1]<0.05
## [1] FALSE
But maybe there are significant temperature trends in other months, or there are trends for other parameters. Let us now systematically scan our whole dataset for any monthly trends. That means, we want to test for each parameter, and for each month, whether there has been a significant change between 1995 and 2020. To do so, we can work with a nested for-loop, i.e. we will iterate over both the parameters and the months at once. In each step, we will calculate the respective monthly values, apply the Mann-Kendall test to them, and write the result (“1” for significant and “0” for not significant) into the table “results_temp”. At the end of each outer loop, we append “results_temp” to “results”. For precipitation and sunshine duration, we calculate from the daily measurements the monthly sum instead of the monthly mean:
## define vector with all parameter names
paras <- unique(sisian$PARA_SHORT)
## create empty results object
results <- NULL
## iterate over each parameter
for (p in 1:length(paras))
{ ## create temporary, empty results dataframe to keep track of the results of the Mann-Kendall tests
results_temp <- data.frame(parameter = paras[p], month = c(1:12), significant = NA)
## iterate over each month
for (m in 1:12)
{ ## create data selection based on parameter p and month m
sisian_sel <- filter(sisian, PARA_SHORT == paras[p], MONTH == m)
## check whether there is actually data for the combination of parameter and month
## (otherwise, there will be an error in some iterations)
if (dim(sisian_sel)[1] > 0 ) {
## for each month, calculate the SUM of daily values for all parameters related to precipitation, snow cover and sunshine duration
if (paras[p] %in% c("PRCP", "SUNSHINE")) {
sisian_sel_comb <- sisian_sel %>% group_by(YEAR, MONTH) %>% summarize(comb = sum(na.omit(VALUE))) %>% as.data.frame()
## for each month, calculate the MEAN of daily values for all parameters related to humidity, temperature, vapour pressure and wind
} else { sisian_sel_comb <- sisian_sel %>% group_by(YEAR, MONTH) %>% summarize(comb = mean(na.omit(VALUE))) %>% as.data.frame() }
## check whether there are at least 3 years with data
## (otherwise, there will be an error in some iterations)
if (dim(sisian_sel_comb)[1] > 3 ) {
## calculate Mann Kendall statistics
teststat <- MannKendall(sisian_sel_comb$comb)
## write result of Mann Kendall test into "results_temp"
if (teststat[2]$sl[1] < 0.05) { results_temp$significant[m] <- 1 }
else if (teststat[2]$sl[1] >= 0.05) { results_temp$significant[m] <- 0 }
}
}
}
## append "results_temp" to "results"
results <- rbind(results, results_temp)
}
The table “results” looks the following:
head(results)
| parameter | month | significant |
|---|---|---|
| HUMI | 1 | 0 |
| HUMI | 2 | 0 |
| HUMI | 3 | 0 |
| HUMI | 4 | 0 |
| HUMI | 5 | 0 |
| HUMI | 6 | 1 |
Let us now visualize our results to get a better overview for which parameters and which months there have been significant trends:
ggplot(data=results, aes(x=month, y=parameter, color=factor(significant))) +
geom_point(size=5) +
ggtitle("Trends from 1995 to 2020 at Sisian station") +
scale_x_continuous(breaks=c(1:12), labels = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) +
scale_color_manual(labels = c("no significant trend", "significant trend", "NA"),
values=c("firebrick2", "chartreuse3", "lightgrey"),
name="") +
ylab("") + xlab("")
Congratulations, you made it to the end of the first script! You can now go on with script 02, or solve the exercises 01 to 04.