Meteoswiss

1. Summary and table of content

This short case study proposes to explore a meteorological dataset and perform some “citizen science” about dealing with global warming.

1.1. Main question:

Does the data provides convincing evidence of climate change?

The question could be re-centered though as we are using observational data on a realitvely short period of time, and even though evidence of increasing temperatures is shown, there are a few caveats:
- We interpret data from a few locations in Switzerland (82 stations vs. more than 500), and as a result the yearly country averages we found from the data do not match Swiss official temperature records.
(Source: https://www.meteoschweiz.admin.ch/product/input/climate-data/swissmean/10.18751-Climate-Timeseries-CHTM-1.0-swiss.txt)
- interpreting the data: it is sometimes easy to jump to conclusions and the purpose of this project is not to explain the causes of climate change, just to provide statistical elements.
- short time span of the data, global warming rates are usually computed over several decades, our dataset comprises data from 2011 to 2017.
- measures of global warming come from other methods such as GWP (global warming potential) and with a considerable amount of variables to take into account.

In spite of the caveats, the insights we extracted corroborate what we “know and hear” about climate change.

This document is targeted at non-expert readers with a basic knowledge of R, inline code will be commented.

You can jump at any time to section 5 to consult all the findings.

1.2. Data:

The dataset, “Meteoswiss.xlsx” is not found online, it has been pre-processed, sourced from data originating from Meteoswiss.
Source data: 20170616_GHO_TMT_alle_Stationen_2012-2016.XLSX AND 20170706_gho_TMT_112_Stationen.XLSX.
The data contains daily average air temperatures measured at 82 weather stations in Switzerland, from 2011 to June 2017.

1.3. Table of contents:

1. Summary and table of content
2. Loading libraries and data
3. Cleaning and preprocessing data: missing values, outliers/entry errors.
4. EDA and Questions: exploratory data analysis with summary statistics, plots
5. Predicting future temperatures?
6. Findings: a few words of caution about the data and findings organized by categories

1.4. Summary findings

Observing the dataset, there are “visible” signs of climate change in Switzerland and particularly in the “coldest” localities. Also, the statistical evidence is conclusive and we observed 0.207 temperature increase between January 2012 and December 2016. Even though the findings seem to corroborate what we “intuitively” know or hear about the topic, we would need many more years of data and more explanatory variables to provide more robust elements.

We also had to take some decisions about which data to use, as there were incomplete data or outliers (100 degrees Celsius).

You can find the complete insights at the end of this document (section 5).

2. Loading libraries and dataset

The libraries include
- tidyverse (dplyr, ggplot2, …),
- readxl (much quicker alternative to the base read.xls command, re-encodes non-ASCII to UTF-8 for handling of German characters, and identify/convert dates to POSIXct format),
- lubridate (to handle date formats)
- forecast, xts: load and perform time series analytics

library(knitr); library(tidyverse); library(readxl); library(lubridate); library(forecast); library(xts)
#import first sheet of excel file, skipping the first two lines
meteo <- read_excel("Meteoswiss.xlsx", sheet = 1, col_names = TRUE, skip = 2)
#View the 3 first observations and variables
meteo[1:3, 1:3]

## # A tibble: 3 x 3
##   Datum               `Aadorf/Tänikon - TAE` `Acquarossa/Comprovasco - CO~
##   <dttm>                               <dbl>                         <dbl>
## 1 2011-01-01 00:00:00                     NA                            NA
## 2 2011-01-02 00:00:00                     NA                            NA
## 3 2011-01-03 00:00:00                     NA                            NA

3. Cleaning and preprocessing data

3.1. Shorten variable names

As seen in the output above, the variables/stations names are quite long and may be unpractical to refer to in the process. As they are composed of the full station name followed by a 3-letters code, we will retain the code only as a variable name and keep a reference file on the side to translate codes into a station name.

#save variable names to "stations" dataframe, omitting the first variable which is the date
stations <- data.frame(names(meteo)[-1], stringsAsFactors = FALSE)
#split full name and 3-letter code
stations <- str_split_fixed(stations[,1], " - ", 2)
#change column names in meteo dataset 
names(meteo) <- c("Date", stations[,2])
#View the 3 first observations and variables
meteo[1:3, 1:6]

## # A tibble: 3 x 6
##   Date                  TAE   COM   ABO   AIG   ALT
##   <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2011-01-01 00:00:00    NA    NA    NA    NA  -0.4
## 2 2011-01-02 00:00:00    NA    NA    NA    NA   0.6
## 3 2011-01-03 00:00:00    NA    NA    NA    NA  -2.1

The variables (columns) format provides now for a more concise view.

3.2. Checking empty rows or columns

We check if there are any columns or rows that are empty, so that we delete what is unnecessary.

#are there empty columns? we loop over each column of the dataset, count the missing values. We display then the table only when the value is 0, meaning that that the printed results will match empty columns
empty_cols <- rep(0, ncol(meteo))
for(i in 1:ncol(meteo)){
  all_na_cols <- sum(!is.na(meteo[,i]))
  empty_cols[i] <- all_na_cols
}
which(empty_cols == 0)

## integer(0)

The output means that there is no zero value, therefore each column contains at least one observation. We replicate the process for rows.

empty_rows <- rep(0,nrow(meteo))
for(j in 1:nrow(meteo)){
  all_na_rows <- sum(!is.na(meteo[j,-1]))
  empty_rows[j] <- all_na_rows
}
which(empty_rows == 0)

## [1] 2374

The output shows that row 2374 is empty, this is the last row and we can delete it.

#remove the column
meteo <- meteo[-2374,]
#housekeeping task, delete all objects except "meteo" and "stations"  
rm(list = setdiff(ls(), c("meteo", "stations")))

3.3. Checking NAs (Missing Values)

We start by showing summary statistics of our general data, all temperature variables combined together, therefore we exclude the first variable which contains the date.

summary(unlist(meteo[,-1]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -24.400   3.500   9.400   9.559  15.300 100.000   22265

We see two problems here: a lot of missing values and maximum temperatures of 100 degrees celsius. We will deal with the outliers (unrealistic temperatures) in the next section.

Missing values: let’s diagnose when or where they occur, let’s start with “when” as we have only 7 years to inspect.

#format a dataframe to store the missing values per year
NA_year_table <- data.frame(Year = c(2011:2017), NAs = rep(0))
#loop over each year and count how many NAs we have
for(i in c(1:7)){
  a <- sum(is.na(meteo[year(meteo$Date) == 2010+i,]))
  NA_year_table$NAs[i] <- a
}
#print the results
NA_year_table

##   Year   NAs
## 1 2011 22265
## 2 2012     0
## 3 2013     0
## 4 2014     0
## 5 2015     0
## 6 2016     0
## 7 2017     0

It seems that many stations were not monitored in 2011, we can measure the impact of the missing data over all observations.

#find how many missing values in 2011 (row 1)
missing_2011 <- NA_year_table[1,]$NAs
#find how many possible cases we have in 2011, total observations in 2011 multiplied by the stations
cases_2011 <- nrow(meteo[year(meteo$Date) == 2011,]) * ncol(meteo[-1])
missing_2011 / cases_2011

## [1] 0.7439024

Almost 75% of values are missing for 2011. We do not have enough elements at this stage to decide whether the 25% of available values in 2011 are representative of the whole country.
Therefore we decide to focus for now on the 2012 to 2017 period.

We create two new datasets, “meteo_11” (where we’ll keep the data on the side if we need to revert to it later) and
“meteo_12_17”, which will become our primary dataset for the next steps.

meteo_11 <- meteo[year(meteo$Date) == 2011,]
meteo_12_17 <- meteo[year(meteo$Date) > 2011,]
#print new summary
summary(unlist(meteo_12_17[,-1]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -24.400   3.400   9.300   9.503  15.200 100.000

All missing values disappeared.

3.4. Dealing with outliers

We need now to deal with the maximum temperatures of 100 degrees Celsius, which are obvious errors or a coding convention when there was no observation.
The first step is to replace the extreme values (100 degrees Celsius) by missing values (NA).

#replace values of 100 with NA
meteo_12_17[ ,-1][meteo_12_17[ ,-1] == 100] <- NA
#print summary again
summary(unlist(meteo_12_17[,-1]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -24.400   3.400   9.300   9.235  15.100  29.500     487

It looks now more reasonable with temparatures comprised between -24 degrees and +29.5, but this introduced 487 new missing values.

3.5. Imputing missing values

We need to take a decision as of how to handle these missing values. Before getting too much into strategies for handling missing values, let’s print again a summary per year of the missing values:

#re-using above code
NA_year_table <- data.frame(Year = c(2012:2017), NAs = rep(0))
for(i in c(1:6)){
  a <- sum(is.na(meteo_12_17[year(meteo_12_17$Date) == 2011+i,]))
  NA_year_table$NAs[i] <- a
}
NA_year_table

##   Year NAs
## 1 2012  81
## 2 2013 173
## 3 2014 144
## 4 2015  46
## 5 2016  29
## 6 2017  14

STRATEGY FOR IMPUTING MISSING VALUES
When working mostly with means, like we do here with mean temperatures, imputing missing value linearly is the safest approach. For instance, if we had a series of 1 to 10 with a mean 5.5, and took another similar series with missing values between 1 and 10, we would come to the same result:

mean(1:10)

## [1] 5.5

mean(c(1, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10), na.rm = TRUE)

## [1] 5.5

So this seems like the best strategy, but we have a lot of missing values in 2013 and 2014, we ideally don’t want to have too many consecutive missing values for our linearly imputing method as this may overlap months or seasons and bias the results.

We use for this the rle() function that will check consecutive NAs in our dataset.

#check consecutive missing values
lrna <- rle(is.na(unlist(meteo_12_17))) 
sort(lrna$lengths[lrna$values], decreasing = TRUE)

##  EVI707  CHZ913  CRM731  EGO894  SPF571 BIE1137 BIZ1524  BRL348   RAG38 
##     100      92      28      24      22      21      21      21      15 
##  EGO921  CRM147 CRM1000  EVI539 SPF1168  EGO290  EGO731  GRE410 INT1998 
##      11       8       8       8       6       5       5       5       5 
##  MOE136 ARH1110 BIZ1923  CRM158  GIH746   BRL93 MAH1792 SPF1280    BAS1 
##       5       4       4       4       4       4       4       4       3 
##  CRM604 CRM1325 CRM1414  MAH937   MOE58 BIZ1656 CRM1628  EGO908 EVI1431 
##       3       3       3       3       3       2       2       2       2 
##  MAH294 MOE1226  SPF181   ARH90  ARH286 BIZ1890   EVI90  EVI286 EVI1425 
##       2       2       2       1       1       1       1       1       1 
##   GIH96 GRE1903   BRL58   BRL98  BRL130  BRL225  BRL257   MAH90  MAH286 
##       1       1       1       1       1       1       1       1       1 
##  MAH416  MOE575 
##       1       1

The results are not ideal, top values (100, 92, …) show that there will be some consecutive missing values for long stretches of time, overlapping over months and seaons. This means that the calculated means over months and seasons will not be entirely reliable.
Instead of deleting further data, since the occurences of long stretches of missing values are not frequent, we will accept that bias. For more impactful projects, we would need further exploration and a different strategy.

We impute missing values using the approxfun() command:

#we exclude the first column that contains the date, so that it retains the original date format
meteo_12_17[-1] <- data.frame(lapply(meteo_12_17[-1], function(x) approxfun(seq_along(x), x)(seq_along(x))))
#reprint summary
summary(unlist(meteo_12_17[,-1]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -24.400   3.400   9.300   9.235  15.100  29.500       3

The summary still shows 3 missing values, probabbly at the beginning or end of a column where the approxfun command didn’t find a bounding value, let’s identify where it occured:

#the sapply chunk returns a table with each column and number of NAs per column, we use the same chunk to subset and show only values different from 0
sapply(meteo_12_17, function(x) sum(is.na(x)))[sapply(meteo_12_17, function(x) sum(is.na(x)))!=0]

## RAG 
##   3

#now that we found that the RAG column was impacted, let's find in which rows 
which(is.na(meteo_12_17$RAG))

## [1] 2006 2007 2008

#the 3 last rows show NA values, let's take a look at what are the missing values, starting with a few values before the NAs appear
meteo_12_17$RAG[2000:2008]

## [1] 26.5 25.6 22.9 19.5 22.3 20.8   NA   NA   NA

To deal with these values, we could use some time-series / ARIMA type of scheme to impute values, but we’ll use a simpler strategy by just taking for each NA the mean of the past 5 days:

while(sum(is.na(meteo_12_17$RAG))!=0){
  index_next_na <- match(NA, meteo_12_17$RAG)
  meteo_12_17$RAG[index_next_na] <- round(mean(meteo_12_17$RAG[(index_next_na-5):(index_next_na-1)]),1)
}
#check what the imputed values look like
meteo_12_17$RAG[2000:2008]

## [1] 26.5 25.6 22.9 19.5 22.3 20.8 22.2 21.5 21.3

#check summary
summary(unlist(meteo_12_17[,-1]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -24.400   3.400   9.300   9.236  15.100  29.500

We find ourselves with a clean dataset, no missing values, clear column names, no outliers and correct formats.

We can proceed with the actual tasks at hand.

#cleaning variables
rm(list = setdiff(ls(), c("meteo_11", "meteo_12_17", "stations")))

4. Exploratory Data Analysis and Questions

In this section, we visualize the data to get a feel of how it is structured, and answer some questions at the same time.

We start by adding some variables to the dataset:
- mean_stations: mean across all monitoring stations in Switzerland
- Year: extracting the year from the date
- Month: extracting the month from the date
- Season: extracintg the season from the date, we use a shortcut and take full calendar months instead of starting at the 21st of the month for start of season, therefore winter will be January, February and March, instead of 21st December to 20th March.

#add mean across stations variable in dataframe, we exclude the date from the mean 
meteo_new_1217 <- meteo_12_17[-1] %>% mutate(mean_stations = rowMeans(.))
#we reconstitue the dataframe adding the Date again
meteo_new_1217 <- cbind(Date = meteo_12_17$Date, meteo_new_1217)
#add Year, Month and Seasons variables
meteo_new_1217 <- meteo_new_1217 %>% mutate(Year = year(Date), Month = month(Date), 
                                      Season = ifelse(Month %in% 10:12, "Fall",
                                      ifelse(Month %in% 1:3, "Winter",
                                      ifelse(Month %in% 4:6, "Spring",
                                      "Summer")))
                                      )
#reorder the seasons starting with fall (for clearer plots)
meteo_new_1217$Season <- factor(meteo_new_1217$Season, levels = c("Fall", "Winter", "Spring", "Summer"))
dim(meteo_new_1217)

## [1] 2008   87

Structure of our new dataset (meteo_new_1217):
- 2008 observations (days)
- 87 variables: Date, 82 stations, and 4 newly created variables (mean_stations, Year, Month, Season)

4.1. Question 1: is there, on average, statistical evidence of climate change in Switzerland?

Side by side boxplots of yearly temperature averages across all stations in Switzerland:

ggplot(meteo_new_1217, aes(x = as.factor(Year), y = mean_stations, fill = as.factor(Year))) + geom_boxplot() + 
  geom_jitter(alpha = 0.3, color = "grey") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  labs(title = "Figure 1 - Boxplot of average temperatures per year", x = "Year", y = "Temperature")

There is no obvious visual evidence of an increase.
We can ignore year 2017 as it is incomplete (measurements until end of June, which pushes the mean down to winter and springtime temperatures).

Another way would be to plot the daily temperatures across all stations:

ggplot(data = meteo_new_1217, aes(x = Date, y = mean_stations)) + geom_line() + geom_smooth(method = "lm", color = "red") + geom_smooth() + labs(title = "Figure 2 - Average daily temperatures across all stations (01/2012 to 06/2017)", x = "Date", y = "Temperatures")

There is no surprise in the seasonality of the time series and we observe a hint at an increasing trend with the weakly increasing slope of the regression line (in red).

MEASURING THE TEMPERATURE CHANGE PER YEAR

We create a summary table per year, including all the years present in the dataset.

#initialize table with summary statistics
yearly_stats <- data.frame(Min = rep(0,7), First_quart = rep(0,7), Median = rep(0,7), Mean = rep(0,7), Third_quart = rep(0,7), Max = rep(0,7))
rownames(yearly_stats) <- c(2011:2017)
#incorporating year 2011 from our meteo_11 dataset
yearly_stats[1,1] <- min(unlist(meteo_11[-1]), na.rm = TRUE)
yearly_stats[1,2] <- quantile(unlist(meteo_11[-1]), 0.25, na.rm = TRUE)
yearly_stats[1,3] <- median(unlist(meteo_11[-1]), na.rm = TRUE)
yearly_stats[1,4] <- mean(unlist(meteo_11[-1]), na.rm = TRUE)
yearly_stats[1,5] <- quantile(unlist(meteo_11[-1]), 0.75, na.rm = TRUE)
yearly_stats[1,6] <- max(unlist(meteo_11[-1]), na.rm = TRUE)
#summary statistics for each subsequent year from meteo_new_1217 dataset 
for(i in 2:7){
  yearly_stats[i,1] <- min(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
  yearly_stats[i,2] <- quantile(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]),0.25)
  yearly_stats[i,3] <- median(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
  yearly_stats[i,4] <- mean(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
  yearly_stats[i,5] <- quantile(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]),0.75)
  yearly_stats[i,6] <- max(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
}
#show table
kable(yearly_stats)

	Min	First_quart	Median	Mean	Third_quart	Max
2011	-11.6	4.7	11.3	10.750424	16.6	27.8
2012	-24.4	3.4	9.5	9.088994	15.6	28.5
2013	-17.1	1.8	8.9	8.573798	14.9	28.8
2014	-14.8	4.7	10.1	9.887242	14.9	26.8
2015	-15.2	3.7	9.7	9.831514	15.4	29.5
2016	-16.9	3.4	8.8	9.297885	15.3	26.4
2017	-20.3	2.5	8.1	8.121042	13.7	27.7

For yearly summary, we exclude 2017, that is incomplete. The question remains whether we should include 2011, as only 25% of the stations are monitoring temperatures at that time.

We create two subsets of the summary table and plot the means as well as a regression line:

#including 2011, excluding 2017:
yearly_stats_1116 <- yearly_stats[-7,] 
#exluding both 2011 and 2017:
yearly_stats_1216 <- yearly_stats[-c(1,7),] 
#Plot including 2011:
ggplot(yearly_stats_1116, aes(x = c(2011:2016), y = Mean)) + geom_point() + geom_line() + 
  scale_y_continuous(breaks = round(seq(min(yearly_stats_1116$Mean), max(yearly_stats_1116$Mean), by = 0.1),1)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(title = "Figure 3 - Average yearly temperatures including 2011", x = "Years", y = "Mean Temperatures")

#Regression slope
lm(data = yearly_stats_1116, Mean ~ c(2011:2016))

## 
## Call:
## lm(formula = Mean ~ c(2011:2016), data = yearly_stats_1116)
## 
## Coefficients:
##  (Intercept)  c(2011:2016)  
##     223.6752       -0.1063

We see that including 2011 contradicts our above clue that there would be a slight increase in temperatures. We see a decrease, with a slope of -0.1.
This is because mean temperatures in 2011 were higher than any subsequent year. We cannot explain at this point if this is a real fact, or due to other factors such as only 25% of the stations monitoring the temperature.

Therefore we decide to exclude 2011 and provide again a plot and a slope value for linear regression:

ggplot(yearly_stats_1216, aes(x = c(2012:2016), y = Mean)) + geom_line() + geom_point() + 
  scale_y_continuous(breaks = round(seq(min(yearly_stats_1216$Mean), max(yearly_stats_1216$Mean), by = 0.1),1)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(title = "Figure 4 - Average yearly temperatures excluding 2011", x = "Years", y = "Mean Temperatures")

summary(lm(data = yearly_stats_1216, Mean ~ c(2012:2016)))

## 
## Call:
## lm(formula = Mean ~ c(2012:2016), data = yearly_stats_1216)
## 
## Residuals:
##     2012     2013     2014     2015     2016 
##  0.08821 -0.59454  0.55136  0.32808 -0.37310 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -328.1095   351.1699  -0.934    0.419
## c(2012:2016)    0.1675     0.1744   0.961    0.407
## 
## Residual standard error: 0.5514 on 3 degrees of freedom
## Multiple R-squared:  0.2353, Adjusted R-squared:  -0.01953 
## F-statistic: 0.9234 on 1 and 3 DF,  p-value: 0.4075

The slope is increasing this time, by a 0.1675 factor. We printed a full summary, which also tests the null hypothesis that the slope is not different than zero. The high p-value (>0.05) makes us fail to reject the null hypothesis and conclude that the slope we found may not be that significant, it confirms what we visually suspected, that a linear model is not the best way to model temperature increase (at least not with so few data points).

A larger view of the mean, maximum and minimum temperatures over time shows that even though the means and maximum seem to remain more or less the same, the minimal temperatures seem to get higher over time.

ggplot(yearly_stats_1216, aes(x = c(2012:2016), y = Mean)) + geom_line() + geom_point() + 
  geom_line(aes(y = Min), color = "blue") + geom_point(aes(y = Min), color = "blue") + 
  geom_line(aes(y = Max), color = "red") + geom_point(aes(y = Max), color = "red") + 
  labs(title = "Figure 5 - Minimum, Mean and Maximum Temperatures", x = "Year", y = "Average Temperature")

We perform now a t-test in order to see whether the temperature differences between 2016 and 2012 are significantly different than 0, statistically proving that there could be effects of global warming visible in Switzerland.

We chose 2016 and 2012 as they are years with complete data in our dataset. In order to use a compare means with a paired t-test, we need to create a dataframe showing the average per station and per year: to each station measurement in 2012 will correpond a measurement in 2016:

#create dataframe to store means per station and per year, we reuse our dataset, keeping only station variables, years and first 5 rows only
stations_mean_12_16 <- meteo_new_1217[1:5,-c(1,84,86,87)]
#reorder columns
stations_mean_12_16 <- stations_mean_12_16[,c(83,1:82)]
#compute means of each station per year with a for loop, existing values will be overwritten
for(i in 1:5){
  for(j in 2:83){
    stations_mean_12_16[i,j] <- mean(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
  }
}
#Perform t-test with paired means
t.test(as.numeric(stations_mean_12_16[5,-1]), as.numeric(stations_mean_12_16[1,-1]), paired = TRUE)

## 
##  Paired t-test
## 
## data:  as.numeric(stations_mean_12_16[5, -1]) and as.numeric(stations_mean_12_16[1, -1])
## t = 7.1234, df = 81, p-value = 3.893e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1492058 0.2648619
## sample estimates:
## mean of the differences 
##               0.2070339

The output of the t-test shows a difference in temperatures of about 0.207 degrees celsius in 5 years.

The small p-value (less than 0.05) makes us reject the null hypothesis (that the difference of both means are equal to zero) in favor of the alternative hypothesis.
In plain language, we do have convincing statistical evidence that the mean temperatures differ between 2012 and 2016, with a point estimate of 0,207 degrees Celsius. And we are 95% confidence that the true increase in average temperature was between 0,15 and 0,265 degrees Celsius.

A very important fact also is that the “accepted” global warming rate of 0.15 to 0.18 degrees Celsius per decade is exceeded here in only half a decade of recoreded values.
(Source: https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature)

A word of caution though as the “accepted” global warming rate is calculated over longer periods of time and with a large array of sophisticated methods.
We could see before for instance that adding the year 2011 would completely change our findings. We would need more years of observations in order to confirm our findings.

4.2. Question 2: are there some trends depending on seasons?

Some headlines can sometimes state that “winters are getting colder”, “summers are getting warmer”, or the other way around.
Out next question could be to check if we see stronger evidence of climate change when looking by seasons only.

We use a bit of a shortcut here and observe only the slope of the regression lines per season, across the 2012-2017 data (observations in 2017 are incomplete and used for Winter and Spring).

ggplot(meteo_new_1217, aes(x = as.factor(Year), y = mean_stations)) + geom_boxplot() + 
  geom_jitter(alpha = 0.3, color = "salmon") + 
  facet_grid(~Season) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  labs(title = "Figure 6 - Boxplot of temperatures per season and per year", x = "Year", y = "Average Temperature")

The boxplots show significant variations within each season group, but to get a better understanding, we calculate now the slope for each season, giving a sense of an increase or decrease of temperatures.

We need to group the data differently for this:

#group and calculate means per year and per season
season_means <- meteo_new_1217 %>% group_by(Year, Season) %>% 
  summarise(Mean = mean(mean_stations)) %>% ungroup()
#show table
season_coefs <- data.frame(Season = c("Fall", "Winter", "Spring", "Summer"), Coefs = rep(0,4))
fall_means <- season_means %>% filter(Season == "Fall")
winter_means <- season_means %>% filter(Season == "Winter")
spring_means <- season_means %>% filter(Season == "Spring")
summer_means <- season_means %>% filter(Season == "Summer")
season_coefs[1,2] <- summary(lm(data = fall_means, Mean ~ Year))$coef[2,1]
season_coefs[2,2] <- summary(lm(data = winter_means, Mean ~ Year))$coef[2,1]
season_coefs[3,2] <- summary(lm(data = spring_means, Mean ~ Year))$coef[2,1]
season_coefs[4,2] <- summary(lm(data = summer_means, Mean ~ Year))$coef[2,1]
kable(season_coefs)

Season	Coefs
Fall	-0.0424852
Winter	0.3227299
Spring	0.1777783
Summer	0.2194489

A basic interpretation would tell us that falls and springs don’t see much increase or decrease in temperature, but summers and mostly winters seem to get warmer. The mean is however sensitive to outliers, and the trend in winters could be partly associated with the fact that we observe less very-low temperatures in the winters as seen on figure 5.

Observing medians that are more robust to outliers could be an alternative option.

4.3. Question 3: variations between stations/localities

In a country with a diverse geography and natural fluctuations like in Switzerland, we may observe some large variations depending on where the stations are located.

We create 4 tables for summary statistics: minimum, maximum, mean and median.
In each table, we want to have all stations, their yearly mean values for each statistic, their linear model coefficient and the p-value of the slope.

#1. INITIALIZING TABLE, starting with the minimum statistic
#copy our main dataset into a new one, stations_min_12_16, we remove the date and all columns we added before, retaining just the Year variable and colnames of all stations
stations_min_12_16 <- meteo_new_1217[,-c(1,84,86,87)]
#keep 5 first rows, one per year from 2012 to 2016
stations_min_12_16 <- stations_min_12_16[1:5,]
#reorganize columns
stations_min_12_16 <- stations_min_12_16[,c(83,1:82)]
#set all values to 0
stations_min_12_16[,] <- 0

#2. POPULATE TABLE WITH MINIMUM TEMPERATURES PER YEAR AND PER STATION
for(i in 1:5){
    stations_min_12_16[i,1] <- 2011+i
    for(j in 2:83){
    stations_min_12_16[i,j] <- min(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
  }
}
#replicate the process for maximum temperatures dataset
#we skip steps of part 1 about initializing the dataset by re-using the first minimum temperatures dataset
stations_max_12_16 <- stations_min_12_16
#for loop to populate the table with maximal temperatures
for(i in 1:5){
  for(j in 2:83){
    stations_max_12_16[i,j] <- max(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
  }
}
#replicate the process for means
stations_mean_12_16 <- stations_min_12_16
for(i in 1:5){
  for(j in 2:83){
    stations_mean_12_16[i,j] <- mean(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
  }
}
#rpelication the process for medians
stations_median_12_16 <- stations_min_12_16
for(i in 1:5){
  for(j in 2:83){
    stations_median_12_16[i,j] <- median(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
  }
}

Now that we have clear summary tables for all stations, we transform them into a better format to perform row-wise operations, finding means, slope coefficient and the pvalue of the slope.

#create function that performs the following: transpose the dataframe, add and calculate mean, slope coefficient and pvalue columns.
output_df <- function(x){
  #convert the name of function attribute as character string to re-use at the end for naming the resulting object
  name <- as.character(substitute(x))
  #transpose, columns as rows, rows as columns
  x <- as.data.frame(t(as.matrix(x)))
  #rename columns
  names(x) <- c(2012:2016)
  #delete first row (which was holding the years names)
  x <- x[-1,]
  #calculate mean and add column
  x$mean <- apply(x, 1, FUN = mean)
  #calculate lm and p-value and add columns
  for(i in 1:82){
    x$coef[i] <- lm(as.numeric(x[i,1:5]) ~ c(2012:2016))$coefficients[2]
    x$pvalue[i] <- summary(lm(as.numeric(x[i,1:5]) ~ c(2012:2016)))$coefficients[2,4]
  }
  #"super assign" result to global environment with with the same name
  assign(name, x, envir = .GlobalEnv)
}
#transform each table with the help of the function 
output_df(stations_min_12_16)
output_df(stations_max_12_16)
output_df(stations_mean_12_16)
output_df(stations_median_12_16)
#check the final format
dim(stations_mean_12_16)

## [1] 82  8

head((stations_mean_12_16))

##          2012      2013      2014      2015      2016      mean       coef
## TAE  9.198361  8.557534 10.016986  9.887123  9.334153  9.398831 0.16011737
## COM 10.788525 10.313973 10.966849 11.376986 10.841257 10.857518 0.11684782
## ABO  6.342077  5.761644  7.045479  7.325205  6.807923  6.656466 0.24952556
## AIG 10.709563 10.101096 11.325753 11.005205 10.658743 10.760072 0.08024702
## ALT 10.209016  9.726027 11.136438 10.633973 10.515847 10.444260 0.15216064
## ARH 10.266803  9.867945 10.933425 10.646027 10.420765 10.426993 0.10860057
##        pvalue
## TAE 0.4673919
## COM 0.4084714
## ABO 0.2449977
## AIG 0.6494054
## ALT 0.4350199
## ARH 0.4717388

Please note that we are using only a fraction of the tables we created, most of them being used for exploratory data analysis. We could also easily use these dataframes for an interactive shiny application, this is however beyond the scope of this case study.

What are the warmest localities, in average, over the 2012-2016 period?

#transform the stations object (created at start of project) in a dataframe and modify the variable names
stations <- as.data.frame(stations); names(stations) <- c("station_name", "code")

#warmest localities in average, displaying top 10 values
stations_mean_12_16 %>% tibble::rownames_to_column() %>% arrange(desc(mean)) %>% select(rowname, mean, coef) %>% head(10) %>%
  left_join(.,stations, by = c("rowname" = "code")) %>% kable(digits = 2, align = "cccc")

rowname	mean	coef	station_name
LUG	13.36	0.10	Lugano
OTL	13.25	0.14	Locarno/Monti
MAG	12.13	0.11	Magadino/Cadenazzo
SBO	11.88	0.06	Stabio
PUY	11.48	0.16	Pully
GRO	11.41	-0.21	Grono
SIO	11.19	0.13	Sion
BAS	11.16	0.15	Basel/Binningen
GVE	11.13	0.16	Genève-Cointrin
EVI	11.08	0.17	Evionnaz

What are the coldest localities, in average, over the 2012-2016 period?

stations_mean_12_16 %>% tibble::rownames_to_column() %>% arrange(mean) %>% select(rowname, mean, coef) %>% head(10) %>%
  left_join(.,stations, by = c("rowname" = "code")) %>% kable(digits = 2, align = "cccc")

rowname	mean	coef	station_name
ANT	4.45	0.03	Andermatt
ULR	4.62	0.22	Ulrichen
BRL	5.57	0.12	La Brévine
NAP	6.02	0.26	Napf
SCU	6.28	0.16	Scuol
MVE	6.63	0.20	Montana
FRE	6.65	0.28	Bullet/La Frétaz
ABO	6.66	0.25	Adelboden
SMM	6.77	0.15	Sta. Maria, Val Müstair
CHD	7.11	0.23	Château-d’Oex

Localities that experienced the highest rate of change in temperatures
A larger slope coefficient shows a higher rate of warmer temperatures over the observed period.

#localities with strongest linear regression slope over the 5 years period, meaning where global warming could have been felt obviously
stations_mean_12_16 %>% tibble::rownames_to_column() %>% arrange(desc(coef)) %>% select(rowname, mean, coef, pvalue) %>% head(10) %>%
  left_join(.,stations, by = c("rowname" = "code")) %>% kable(digits = 2, align = "ccccc")

rowname	mean	coef	pvalue	station_name
ENG	7.16	0.28	0.22	Engelberg
FRE	6.65	0.28	0.24	Bullet/La Frétaz
HOE	7.36	0.27	0.31	Hörnli
EBK	8.34	0.27	0.22	Ebnat-Kappel
ELM	7.34	0.27	0.27	Elm
DIS	7.36	0.27	0.17	Disentis/Sedrun
NAP	6.02	0.26	0.27	Napf
ABO	6.66	0.25	0.24	Adelboden
LAE	8.06	0.25	0.34	Lägern
PLF	7.81	0.24	0.32	Plaffeien

These 3 tables show us some interesting insights:
- the warmest localities have a slope close to 0, meaning that they do not seem to have experienced any significant increase in yearly average temperatures.
- the coldest localities (with lowest average temperatures) have significantly higher slopes (except for Andermatt), meaning that they may have experienced more obvious warming rates.
- the localities that experienced the most drastic temperature increase rates show a low average mean in temperatures, 3 of them are among the top 10 coldest Swiss localities (Bullet/La Frétaz, Napf, Adelboden).

We can verify this by plotting the relation between slope coefficients and average temperature of each locality:

ggplot(data = stations_mean_12_16, aes(x = coef, y = mean)) + geom_point() + geom_smooth(method = "lm") + 
  labs(title = "Figure 7 - Coefficient slope vs. Average Temperature", x = "slope coefficient", y = "mean temperatures")

There is no notion of causality here, but there have been some explanations that global warming would impact first the coldest (sometimes glaciers) regions before impacting significatly the moderate climate areas.

4.4. Question 4: can we predict future temperatures?

To get a complete picture of global temperatures, we combine measurements from the air above land and the ocean surface.

The earth is divided into grid boxes of equal surface, the more monitoring stations we have per block, the more accurate the measure will be.

In the case of Switzerland, our current dataset covers 82 stations out of over 500, meaning that our results may be biased, mostly considering the large natural fluctuations within a country like Switzerland.

More information can be found here: https://www.carbonbrief.org/explainer-how-do-scientists-measure-global-temperature.

Working with a small dataset, our prediction cannot be expected to be very thorough:
- we saw that a linear model wouldn’t be the best fit, this is also proven by the fact that the earth warming rate is frequently reviewed by scientists and does not follow a known pattern.
- with our 5 years of data, we can only attempt prediction for the next one to two years, and we could see that including 2011, our regression line would have been decreasing.

LONG TERM PREDICTIONS:

As a reminder, we can print again the linear trend we used from 2012 to 2016, extending the regression line to 2050.

ggplot(yearly_stats_1216, aes(x = c(2012:2016), y = Mean)) + geom_line() + geom_point() + xlim(c(2012, 2050)) + ylim(c(8.5, 20)) + 
  geom_abline(slope = 0.1675, intercept = -328.1095, color = "red") + 
  labs(title = "Figure 8 - Extrapolating the regression line", x = "Years", y = "Mean Temperatures")

lm(data = yearly_stats_1216, Mean ~ c(2012:2016))

## 
## Call:
## lm(formula = Mean ~ c(2012:2016), data = yearly_stats_1216)
## 
## Coefficients:
##  (Intercept)  c(2012:2016)  
##    -328.1095        0.1675

#using the linear model for x = 2050: 
-328.1095 + (0.1675*2050)

## [1] 15.2655

We see (and very much hope), that an average temperature of 15.3 degres Celsius in Switzerland in 2050 is unrealistic. This would mean a difference of 6 degrees between 2016 and 2050 (T_hat_2050 - T_observed_2016 = 15.2655-9.297885).

Using some simmple algebra, we could forecast temperatures in 2050 based on the agreed warming rate of 0.15 to 0.18 degrees per decade (Source: NOAA, https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature).

(2050-2016+1)*0.015; (2050-2016+1)*0.018

## [1] 0.525

## [1] 0.63

So we would expect a nationwide temperature increase in Switzerland of 0.525 to 0.63 degrees Celsius from 2016 to 2050, against a whopping 6 degrees found with a linear model using only 5 years of historical data.

SHORT TERM PREDICTIONS:
Our dataset was not complete and let’s imagine we would like to forecast the average temperatures for 2017 (we have only 6 months of data for 2017).
We have several methods:
1. Using our linear regression model.
2. Using the accepted global warming rate, we will take an average 0.0165 degrees per year (1.5 to 1.8 degrees per century).
3. Using time series.

#initialize reporting table
predicted_mean_2017 <- data.frame(Method = c("Linear Regression", "Use Global Warming Rate", "Time Series"), Value = rep(0))
#1. linear regression
predicted_mean_2017[1,2] <- -328.1095 + (0.1675*2017)

#2. using global warming rate
predicted_mean_2017[2,2] <- yearly_stats_1216[5,4] + 0.0165 

#3. Using time series
#create time series object comprised of dates and mean values across all available Swiss stations
meteo_xts <- xts(meteo_new_1217[,c(1,84)][,-1], order.by=as.Date(meteo_new_1217[,c(1,84)][,1], "%Y-%m-%d"))
#using Fourier and auto.arima to build model
y <- ts(meteo_xts, frequency=7)
z <- fourier(ts(meteo_new_1217$mean_stations, frequency=365.25), K=30)
zf <- fourier(ts(meteo_new_1217$mean_stations, frequency=365.25), K=30, h=184)
fit <- auto.arima(y, xreg=z, seasonal=TRUE)
#forecast values
fc <- forecast(fit, xreg=zf, h=184)
#plot results
autoplot(fc)

#measure mean: sum of predicted values + sum of observed values (second line), the total divided by 365
predicted_mean_2017[3,2] <- (sum(fc$mean) + 
                            meteo_new_1217 %>% filter(Year == 2017) %>% select(mean_stations) %>% summarise(sum = sum(mean_stations)))/365
kable(predicted_mean_2017, digits = 4)

Method	Value
Linear Regression	9.7380
Use Global Warming Rate	9.3144
Time Series	9.7165

Comments:
- On the short term, we see that the linear regression and time series methods are very close.
- Using Time Series would need further testing of different models and fine-tuning, deep learning methods for time series with the keras package for instance could yield interesting results.
- As we do not have a dataset similar to the training data of this case study, we cannot benchmark our results to actual Swiss results. We have the records of 82 stations (out of more than 500) at our disposal and this provides an important bias for the validity of the results.
- As an example, the official data provided by meteoschweiz (https://www.meteoschweiz.admin.ch/product/input/climate-data/swissmean/10.18751-Climate-Timeseries-CHTM-1.0-swiss.txt), shows a year average of 6.16 degrees Celsius against the 9.3 to 9.7 we found using our current dataset. As a benchmark, we see that the 6 first months showed on the meteoschweiz website give an average of 4.78 degrees, whereas our dataset shows 8.13 derees. It seems that the localities we used are not representative of the climate diversity of the country.
- But all proportions left equal, our findings about warming, yearly, per season and nationwide as well as warm/cold regions, remain valid.

5. Summary findings

5.1. Notes about the data

The dataset ranges from 2011 to 2017, with a few caveats to note:
- the observations of 2011 are incomplete with a lot of missing values, only 25% of the 82 monitoring stations are covered.
- in 2017, the observations stop on June 30th. The missing 6 months of data make it unpractical to perform yearly calculations, we can however use this data for seasonal analysis.
- some of the recorded values show 100 degrees celsius, which can be a mistake or a convention for filling in missing values. We decided to convert them into missing values and then impute the values linearly. There are 2 occurences of missing data for periods of over 3 consecutive months each. We decided to carry on with the imputing method but the reader should be informed of this bias. A commercial project going into production would need a more failsafe approach.
- As a result of these biases, most of our findings used data from 2012 to 2016.

5.2. Findings

Average nationwide temperature increase:
- the average nationwide temperature increase between 2012 and 2016 is about 0.207 degrees Celsius with a 95% confidence interval between 0.15 and 0.265 degrees Celsius.
- this can be compared to one of the “accepted” global warming rates of 0.15 to 0.18 degrees per decade. In the only 5 years we explored, the general rate has been exceeded.
- there is no obvious linear trend in yearly averages increase. 2013 showed a lower average temperature, while we saw important spikes in 2014 and 2015.
- looking for statistical proof, we found conclusive evidence (with a pvalue for the paired t-test close to 0) that the temperature increase was significant.
- low temperatures seem to have increased over the year (from -24.4 degrees in 2012 to -16.9 in 2016), maximal temperatures stay more or less the same.

Seasonal impact:
- it seems that the highest warming rates are observed over the years in Winter and in Summer, much less in mid-seasons.

Variations between localities:
- generally, the localities experiencing the most drastic warming rate are the ones showing among the coldest yearly temperature averages.
- the warmest localities have a slope close to 0, meaning that they do not seem to have experienced any significant increase in yearly average temperatures.
- the coldest localities have significantly higher slopes, meaning that they may have experienced more obvious warming rates.
- the localities that experienced the most drastic temperature increase rates show a low average mean in temperatures, 3 of them are among the top 10 coldest Swiss localities (Bullet/La Frétaz, Napf, Adelboden).

Predicting future temperatures:
- we used several methods
- linear regression is the less relevant one as we do not have enough data to identify a linear pattern. Using our linear model makes unrealistic predictions on the long term.
- using the accepted global warming rate of 0.15 to 0.18 degrees per decade. This rate is frequently reviewed (and increased) by autoritative institutions, proving that any linearity wouldn’t be reasonable.
- using time series: applying deep learning with time series would be the most reliable way, but many more explanatory variables would be necessary to obtain reliable projections.

Conclusion:
Observing the dataset, there are clear signs of climate change in Switzerland and particularly in the “coldest” localities, the statistical evidence is conclusive.
Even though the findings corroborate what we “intuitively” know or hear about the topic, we would need many more years of data and more explanatory variables to provide more robust elements.

Thank you for reading through this document, for any question, feel free to reach out:

Xavier Valdayron
xavier@measuringsocial.com
xavier_valdayron@yahoo.com

–

References:
Below are a few reference articles to help get a better grasp on the topic.