3. Cleaning and preprocessing data
3.1. Shorten variable names
As seen in the output above, the variables/stations names are quite long and may be unpractical to refer to in the process. As they are composed of the full station name followed by a 3-letters code, we will retain the code only as a variable name and keep a reference file on the side to translate codes into a station name.
#save variable names to "stations" dataframe, omitting the first variable which is the date
stations <- data.frame(names(meteo)[-1], stringsAsFactors = FALSE)
#split full name and 3-letter code
stations <- str_split_fixed(stations[,1], " - ", 2)
#change column names in meteo dataset
names(meteo) <- c("Date", stations[,2])
#View the 3 first observations and variables
meteo[1:3, 1:6]
## # A tibble: 3 x 6
## Date TAE COM ABO AIG ALT
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2011-01-01 00:00:00 NA NA NA NA -0.4
## 2 2011-01-02 00:00:00 NA NA NA NA 0.6
## 3 2011-01-03 00:00:00 NA NA NA NA -2.1
The variables (columns) format provides now for a more concise view.
3.2. Checking empty rows or columns
We check if there are any columns or rows that are empty, so that we delete what is unnecessary.
#are there empty columns? we loop over each column of the dataset, count the missing values. We display then the table only when the value is 0, meaning that that the printed results will match empty columns
empty_cols <- rep(0, ncol(meteo))
for(i in 1:ncol(meteo)){
all_na_cols <- sum(!is.na(meteo[,i]))
empty_cols[i] <- all_na_cols
}
which(empty_cols == 0)
## integer(0)
The output means that there is no zero value, therefore each column contains at least one observation. We replicate the process for rows.
empty_rows <- rep(0,nrow(meteo))
for(j in 1:nrow(meteo)){
all_na_rows <- sum(!is.na(meteo[j,-1]))
empty_rows[j] <- all_na_rows
}
which(empty_rows == 0)
## [1] 2374
The output shows that row 2374 is empty, this is the last row and we can delete it.
#remove the column
meteo <- meteo[-2374,]
#housekeeping task, delete all objects except "meteo" and "stations"
rm(list = setdiff(ls(), c("meteo", "stations")))
3.3. Checking NAs (Missing Values)
We start by showing summary statistics of our general data, all temperature variables combined together, therefore we exclude the first variable which contains the date.
summary(unlist(meteo[,-1]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -24.400 3.500 9.400 9.559 15.300 100.000 22265
We see two problems here: a lot of missing values and maximum temperatures of 100 degrees celsius. We will deal with the outliers (unrealistic temperatures) in the next section.
Missing values: let’s diagnose when or where they occur, let’s start with “when” as we have only 7 years to inspect.
#format a dataframe to store the missing values per year
NA_year_table <- data.frame(Year = c(2011:2017), NAs = rep(0))
#loop over each year and count how many NAs we have
for(i in c(1:7)){
a <- sum(is.na(meteo[year(meteo$Date) == 2010+i,]))
NA_year_table$NAs[i] <- a
}
#print the results
NA_year_table
## Year NAs
## 1 2011 22265
## 2 2012 0
## 3 2013 0
## 4 2014 0
## 5 2015 0
## 6 2016 0
## 7 2017 0
It seems that many stations were not monitored in 2011, we can measure the impact of the missing data over all observations.
#find how many missing values in 2011 (row 1)
missing_2011 <- NA_year_table[1,]$NAs
#find how many possible cases we have in 2011, total observations in 2011 multiplied by the stations
cases_2011 <- nrow(meteo[year(meteo$Date) == 2011,]) * ncol(meteo[-1])
missing_2011 / cases_2011
## [1] 0.7439024
Almost 75% of values are missing for 2011. We do not have enough elements at this stage to decide whether the 25% of available values in 2011 are representative of the whole country.
Therefore we decide to focus for now on the 2012 to 2017 period.
We create two new datasets, “meteo_11” (where we’ll keep the data on the side if we need to revert to it later) and
“meteo_12_17”, which will become our primary dataset for the next steps.
meteo_11 <- meteo[year(meteo$Date) == 2011,]
meteo_12_17 <- meteo[year(meteo$Date) > 2011,]
#print new summary
summary(unlist(meteo_12_17[,-1]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -24.400 3.400 9.300 9.503 15.200 100.000
All missing values disappeared.
3.4. Dealing with outliers
We need now to deal with the maximum temperatures of 100 degrees Celsius, which are obvious errors or a coding convention when there was no observation.
The first step is to replace the extreme values (100 degrees Celsius) by missing values (NA).
#replace values of 100 with NA
meteo_12_17[ ,-1][meteo_12_17[ ,-1] == 100] <- NA
#print summary again
summary(unlist(meteo_12_17[,-1]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -24.400 3.400 9.300 9.235 15.100 29.500 487
It looks now more reasonable with temparatures comprised between -24 degrees and +29.5, but this introduced 487 new missing values.
3.5. Imputing missing values
We need to take a decision as of how to handle these missing values. Before getting too much into strategies for handling missing values, let’s print again a summary per year of the missing values:
#re-using above code
NA_year_table <- data.frame(Year = c(2012:2017), NAs = rep(0))
for(i in c(1:6)){
a <- sum(is.na(meteo_12_17[year(meteo_12_17$Date) == 2011+i,]))
NA_year_table$NAs[i] <- a
}
NA_year_table
## Year NAs
## 1 2012 81
## 2 2013 173
## 3 2014 144
## 4 2015 46
## 5 2016 29
## 6 2017 14
STRATEGY FOR IMPUTING MISSING VALUES
When working mostly with means, like we do here with mean temperatures, imputing missing value linearly is the safest approach. For instance, if we had a series of 1 to 10 with a mean 5.5, and took another similar series with missing values between 1 and 10, we would come to the same result:
mean(1:10)
## [1] 5.5
mean(c(1, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10), na.rm = TRUE)
## [1] 5.5
So this seems like the best strategy, but we have a lot of missing values in 2013 and 2014, we ideally don’t want to have too many consecutive missing values for our linearly imputing method as this may overlap months or seasons and bias the results.
We use for this the rle() function that will check consecutive NAs in our dataset.
#check consecutive missing values
lrna <- rle(is.na(unlist(meteo_12_17)))
sort(lrna$lengths[lrna$values], decreasing = TRUE)
## EVI707 CHZ913 CRM731 EGO894 SPF571 BIE1137 BIZ1524 BRL348 RAG38
## 100 92 28 24 22 21 21 21 15
## EGO921 CRM147 CRM1000 EVI539 SPF1168 EGO290 EGO731 GRE410 INT1998
## 11 8 8 8 6 5 5 5 5
## MOE136 ARH1110 BIZ1923 CRM158 GIH746 BRL93 MAH1792 SPF1280 BAS1
## 5 4 4 4 4 4 4 4 3
## CRM604 CRM1325 CRM1414 MAH937 MOE58 BIZ1656 CRM1628 EGO908 EVI1431
## 3 3 3 3 3 2 2 2 2
## MAH294 MOE1226 SPF181 ARH90 ARH286 BIZ1890 EVI90 EVI286 EVI1425
## 2 2 2 1 1 1 1 1 1
## GIH96 GRE1903 BRL58 BRL98 BRL130 BRL225 BRL257 MAH90 MAH286
## 1 1 1 1 1 1 1 1 1
## MAH416 MOE575
## 1 1
The results are not ideal, top values (100, 92, …) show that there will be some consecutive missing values for long stretches of time, overlapping over months and seaons. This means that the calculated means over months and seasons will not be entirely reliable.
Instead of deleting further data, since the occurences of long stretches of missing values are not frequent, we will accept that bias. For more impactful projects, we would need further exploration and a different strategy.
We impute missing values using the approxfun() command:
#we exclude the first column that contains the date, so that it retains the original date format
meteo_12_17[-1] <- data.frame(lapply(meteo_12_17[-1], function(x) approxfun(seq_along(x), x)(seq_along(x))))
#reprint summary
summary(unlist(meteo_12_17[,-1]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -24.400 3.400 9.300 9.235 15.100 29.500 3
The summary still shows 3 missing values, probabbly at the beginning or end of a column where the approxfun command didn’t find a bounding value, let’s identify where it occured:
#the sapply chunk returns a table with each column and number of NAs per column, we use the same chunk to subset and show only values different from 0
sapply(meteo_12_17, function(x) sum(is.na(x)))[sapply(meteo_12_17, function(x) sum(is.na(x)))!=0]
## RAG
## 3
#now that we found that the RAG column was impacted, let's find in which rows
which(is.na(meteo_12_17$RAG))
## [1] 2006 2007 2008
#the 3 last rows show NA values, let's take a look at what are the missing values, starting with a few values before the NAs appear
meteo_12_17$RAG[2000:2008]
## [1] 26.5 25.6 22.9 19.5 22.3 20.8 NA NA NA
To deal with these values, we could use some time-series / ARIMA type of scheme to impute values, but we’ll use a simpler strategy by just taking for each NA the mean of the past 5 days:
while(sum(is.na(meteo_12_17$RAG))!=0){
index_next_na <- match(NA, meteo_12_17$RAG)
meteo_12_17$RAG[index_next_na] <- round(mean(meteo_12_17$RAG[(index_next_na-5):(index_next_na-1)]),1)
}
#check what the imputed values look like
meteo_12_17$RAG[2000:2008]
## [1] 26.5 25.6 22.9 19.5 22.3 20.8 22.2 21.5 21.3
#check summary
summary(unlist(meteo_12_17[,-1]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -24.400 3.400 9.300 9.236 15.100 29.500
We find ourselves with a clean dataset, no missing values, clear column names, no outliers and correct formats.
We can proceed with the actual tasks at hand.
#cleaning variables
rm(list = setdiff(ls(), c("meteo_11", "meteo_12_17", "stations")))
4. Exploratory Data Analysis and Questions
In this section, we visualize the data to get a feel of how it is structured, and answer some questions at the same time.
We start by adding some variables to the dataset:
- mean_stations: mean across all monitoring stations in Switzerland
- Year: extracting the year from the date
- Month: extracting the month from the date
- Season: extracintg the season from the date, we use a shortcut and take full calendar months instead of starting at the 21st of the month for start of season, therefore winter will be January, February and March, instead of 21st December to 20th March.
#add mean across stations variable in dataframe, we exclude the date from the mean
meteo_new_1217 <- meteo_12_17[-1] %>% mutate(mean_stations = rowMeans(.))
#we reconstitue the dataframe adding the Date again
meteo_new_1217 <- cbind(Date = meteo_12_17$Date, meteo_new_1217)
#add Year, Month and Seasons variables
meteo_new_1217 <- meteo_new_1217 %>% mutate(Year = year(Date), Month = month(Date),
Season = ifelse(Month %in% 10:12, "Fall",
ifelse(Month %in% 1:3, "Winter",
ifelse(Month %in% 4:6, "Spring",
"Summer")))
)
#reorder the seasons starting with fall (for clearer plots)
meteo_new_1217$Season <- factor(meteo_new_1217$Season, levels = c("Fall", "Winter", "Spring", "Summer"))
dim(meteo_new_1217)
## [1] 2008 87
Structure of our new dataset (meteo_new_1217):
- 2008 observations (days)
- 87 variables: Date, 82 stations, and 4 newly created variables (mean_stations, Year, Month, Season)
4.1. Question 1: is there, on average, statistical evidence of climate change in Switzerland?
Side by side boxplots of yearly temperature averages across all stations in Switzerland:
ggplot(meteo_new_1217, aes(x = as.factor(Year), y = mean_stations, fill = as.factor(Year))) + geom_boxplot() +
geom_jitter(alpha = 0.3, color = "grey") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Figure 1 - Boxplot of average temperatures per year", x = "Year", y = "Temperature")

- There is no obvious visual evidence of an increase.
- We can ignore year 2017 as it is incomplete (measurements until end of June, which pushes the mean down to winter and springtime temperatures).
Another way would be to plot the daily temperatures across all stations:
ggplot(data = meteo_new_1217, aes(x = Date, y = mean_stations)) + geom_line() + geom_smooth(method = "lm", color = "red") + geom_smooth() + labs(title = "Figure 2 - Average daily temperatures across all stations (01/2012 to 06/2017)", x = "Date", y = "Temperatures")

There is no surprise in the seasonality of the time series and we observe a hint at an increasing trend with the weakly increasing slope of the regression line (in red).
MEASURING THE TEMPERATURE CHANGE PER YEAR
We create a summary table per year, including all the years present in the dataset.
#initialize table with summary statistics
yearly_stats <- data.frame(Min = rep(0,7), First_quart = rep(0,7), Median = rep(0,7), Mean = rep(0,7), Third_quart = rep(0,7), Max = rep(0,7))
rownames(yearly_stats) <- c(2011:2017)
#incorporating year 2011 from our meteo_11 dataset
yearly_stats[1,1] <- min(unlist(meteo_11[-1]), na.rm = TRUE)
yearly_stats[1,2] <- quantile(unlist(meteo_11[-1]), 0.25, na.rm = TRUE)
yearly_stats[1,3] <- median(unlist(meteo_11[-1]), na.rm = TRUE)
yearly_stats[1,4] <- mean(unlist(meteo_11[-1]), na.rm = TRUE)
yearly_stats[1,5] <- quantile(unlist(meteo_11[-1]), 0.75, na.rm = TRUE)
yearly_stats[1,6] <- max(unlist(meteo_11[-1]), na.rm = TRUE)
#summary statistics for each subsequent year from meteo_new_1217 dataset
for(i in 2:7){
yearly_stats[i,1] <- min(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
yearly_stats[i,2] <- quantile(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]),0.25)
yearly_stats[i,3] <- median(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
yearly_stats[i,4] <- mean(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
yearly_stats[i,5] <- quantile(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]),0.75)
yearly_stats[i,6] <- max(unlist(meteo_new_1217[meteo_new_1217$Year == 2010+i,-c(1,83:87)]))
}
#show table
kable(yearly_stats)
| 2011 |
-11.6 |
4.7 |
11.3 |
10.750424 |
16.6 |
27.8 |
| 2012 |
-24.4 |
3.4 |
9.5 |
9.088994 |
15.6 |
28.5 |
| 2013 |
-17.1 |
1.8 |
8.9 |
8.573798 |
14.9 |
28.8 |
| 2014 |
-14.8 |
4.7 |
10.1 |
9.887242 |
14.9 |
26.8 |
| 2015 |
-15.2 |
3.7 |
9.7 |
9.831514 |
15.4 |
29.5 |
| 2016 |
-16.9 |
3.4 |
8.8 |
9.297885 |
15.3 |
26.4 |
| 2017 |
-20.3 |
2.5 |
8.1 |
8.121042 |
13.7 |
27.7 |
For yearly summary, we exclude 2017, that is incomplete. The question remains whether we should include 2011, as only 25% of the stations are monitoring temperatures at that time.
We create two subsets of the summary table and plot the means as well as a regression line:
#including 2011, excluding 2017:
yearly_stats_1116 <- yearly_stats[-7,]
#exluding both 2011 and 2017:
yearly_stats_1216 <- yearly_stats[-c(1,7),]
#Plot including 2011:
ggplot(yearly_stats_1116, aes(x = c(2011:2016), y = Mean)) + geom_point() + geom_line() +
scale_y_continuous(breaks = round(seq(min(yearly_stats_1116$Mean), max(yearly_stats_1116$Mean), by = 0.1),1)) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Figure 3 - Average yearly temperatures including 2011", x = "Years", y = "Mean Temperatures")

#Regression slope
lm(data = yearly_stats_1116, Mean ~ c(2011:2016))
##
## Call:
## lm(formula = Mean ~ c(2011:2016), data = yearly_stats_1116)
##
## Coefficients:
## (Intercept) c(2011:2016)
## 223.6752 -0.1063
We see that including 2011 contradicts our above clue that there would be a slight increase in temperatures. We see a decrease, with a slope of -0.1.
This is because mean temperatures in 2011 were higher than any subsequent year. We cannot explain at this point if this is a real fact, or due to other factors such as only 25% of the stations monitoring the temperature.
Therefore we decide to exclude 2011 and provide again a plot and a slope value for linear regression:
ggplot(yearly_stats_1216, aes(x = c(2012:2016), y = Mean)) + geom_line() + geom_point() +
scale_y_continuous(breaks = round(seq(min(yearly_stats_1216$Mean), max(yearly_stats_1216$Mean), by = 0.1),1)) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Figure 4 - Average yearly temperatures excluding 2011", x = "Years", y = "Mean Temperatures")

summary(lm(data = yearly_stats_1216, Mean ~ c(2012:2016)))
##
## Call:
## lm(formula = Mean ~ c(2012:2016), data = yearly_stats_1216)
##
## Residuals:
## 2012 2013 2014 2015 2016
## 0.08821 -0.59454 0.55136 0.32808 -0.37310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -328.1095 351.1699 -0.934 0.419
## c(2012:2016) 0.1675 0.1744 0.961 0.407
##
## Residual standard error: 0.5514 on 3 degrees of freedom
## Multiple R-squared: 0.2353, Adjusted R-squared: -0.01953
## F-statistic: 0.9234 on 1 and 3 DF, p-value: 0.4075
The slope is increasing this time, by a 0.1675 factor. We printed a full summary, which also tests the null hypothesis that the slope is not different than zero. The high p-value (>0.05) makes us fail to reject the null hypothesis and conclude that the slope we found may not be that significant, it confirms what we visually suspected, that a linear model is not the best way to model temperature increase (at least not with so few data points).
A larger view of the mean, maximum and minimum temperatures over time shows that even though the means and maximum seem to remain more or less the same, the minimal temperatures seem to get higher over time.
ggplot(yearly_stats_1216, aes(x = c(2012:2016), y = Mean)) + geom_line() + geom_point() +
geom_line(aes(y = Min), color = "blue") + geom_point(aes(y = Min), color = "blue") +
geom_line(aes(y = Max), color = "red") + geom_point(aes(y = Max), color = "red") +
labs(title = "Figure 5 - Minimum, Mean and Maximum Temperatures", x = "Year", y = "Average Temperature")

We perform now a t-test in order to see whether the temperature differences between 2016 and 2012 are significantly different than 0, statistically proving that there could be effects of global warming visible in Switzerland.
We chose 2016 and 2012 as they are years with complete data in our dataset. In order to use a compare means with a paired t-test, we need to create a dataframe showing the average per station and per year: to each station measurement in 2012 will correpond a measurement in 2016:
#create dataframe to store means per station and per year, we reuse our dataset, keeping only station variables, years and first 5 rows only
stations_mean_12_16 <- meteo_new_1217[1:5,-c(1,84,86,87)]
#reorder columns
stations_mean_12_16 <- stations_mean_12_16[,c(83,1:82)]
#compute means of each station per year with a for loop, existing values will be overwritten
for(i in 1:5){
for(j in 2:83){
stations_mean_12_16[i,j] <- mean(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
}
}
#Perform t-test with paired means
t.test(as.numeric(stations_mean_12_16[5,-1]), as.numeric(stations_mean_12_16[1,-1]), paired = TRUE)
##
## Paired t-test
##
## data: as.numeric(stations_mean_12_16[5, -1]) and as.numeric(stations_mean_12_16[1, -1])
## t = 7.1234, df = 81, p-value = 3.893e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1492058 0.2648619
## sample estimates:
## mean of the differences
## 0.2070339
The output of the t-test shows a difference in temperatures of about 0.207 degrees celsius in 5 years.
The small p-value (less than 0.05) makes us reject the null hypothesis (that the difference of both means are equal to zero) in favor of the alternative hypothesis.
In plain language, we do have convincing statistical evidence that the mean temperatures differ between 2012 and 2016, with a point estimate of 0,207 degrees Celsius. And we are 95% confidence that the true increase in average temperature was between 0,15 and 0,265 degrees Celsius.
A very important fact also is that the “accepted” global warming rate of 0.15 to 0.18 degrees Celsius per decade is exceeded here in only half a decade of recoreded values.
(Source: https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature)
A word of caution though as the “accepted” global warming rate is calculated over longer periods of time and with a large array of sophisticated methods.
We could see before for instance that adding the year 2011 would completely change our findings. We would need more years of observations in order to confirm our findings.
4.2. Question 2: are there some trends depending on seasons?
Some headlines can sometimes state that “winters are getting colder”, “summers are getting warmer”, or the other way around.
Out next question could be to check if we see stronger evidence of climate change when looking by seasons only.
We use a bit of a shortcut here and observe only the slope of the regression lines per season, across the 2012-2017 data (observations in 2017 are incomplete and used for Winter and Spring).
ggplot(meteo_new_1217, aes(x = as.factor(Year), y = mean_stations)) + geom_boxplot() +
geom_jitter(alpha = 0.3, color = "salmon") +
facet_grid(~Season) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Figure 6 - Boxplot of temperatures per season and per year", x = "Year", y = "Average Temperature")

The boxplots show significant variations within each season group, but to get a better understanding, we calculate now the slope for each season, giving a sense of an increase or decrease of temperatures.
We need to group the data differently for this:
#group and calculate means per year and per season
season_means <- meteo_new_1217 %>% group_by(Year, Season) %>%
summarise(Mean = mean(mean_stations)) %>% ungroup()
#show table
season_coefs <- data.frame(Season = c("Fall", "Winter", "Spring", "Summer"), Coefs = rep(0,4))
fall_means <- season_means %>% filter(Season == "Fall")
winter_means <- season_means %>% filter(Season == "Winter")
spring_means <- season_means %>% filter(Season == "Spring")
summer_means <- season_means %>% filter(Season == "Summer")
season_coefs[1,2] <- summary(lm(data = fall_means, Mean ~ Year))$coef[2,1]
season_coefs[2,2] <- summary(lm(data = winter_means, Mean ~ Year))$coef[2,1]
season_coefs[3,2] <- summary(lm(data = spring_means, Mean ~ Year))$coef[2,1]
season_coefs[4,2] <- summary(lm(data = summer_means, Mean ~ Year))$coef[2,1]
kable(season_coefs)
| Fall |
-0.0424852 |
| Winter |
0.3227299 |
| Spring |
0.1777783 |
| Summer |
0.2194489 |
A basic interpretation would tell us that falls and springs don’t see much increase or decrease in temperature, but summers and mostly winters seem to get warmer. The mean is however sensitive to outliers, and the trend in winters could be partly associated with the fact that we observe less very-low temperatures in the winters as seen on figure 5.
Observing medians that are more robust to outliers could be an alternative option.
4.3. Question 3: variations between stations/localities
In a country with a diverse geography and natural fluctuations like in Switzerland, we may observe some large variations depending on where the stations are located.
We create 4 tables for summary statistics: minimum, maximum, mean and median.
In each table, we want to have all stations, their yearly mean values for each statistic, their linear model coefficient and the p-value of the slope.
#1. INITIALIZING TABLE, starting with the minimum statistic
#copy our main dataset into a new one, stations_min_12_16, we remove the date and all columns we added before, retaining just the Year variable and colnames of all stations
stations_min_12_16 <- meteo_new_1217[,-c(1,84,86,87)]
#keep 5 first rows, one per year from 2012 to 2016
stations_min_12_16 <- stations_min_12_16[1:5,]
#reorganize columns
stations_min_12_16 <- stations_min_12_16[,c(83,1:82)]
#set all values to 0
stations_min_12_16[,] <- 0
#2. POPULATE TABLE WITH MINIMUM TEMPERATURES PER YEAR AND PER STATION
for(i in 1:5){
stations_min_12_16[i,1] <- 2011+i
for(j in 2:83){
stations_min_12_16[i,j] <- min(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
}
}
#replicate the process for maximum temperatures dataset
#we skip steps of part 1 about initializing the dataset by re-using the first minimum temperatures dataset
stations_max_12_16 <- stations_min_12_16
#for loop to populate the table with maximal temperatures
for(i in 1:5){
for(j in 2:83){
stations_max_12_16[i,j] <- max(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
}
}
#replicate the process for means
stations_mean_12_16 <- stations_min_12_16
for(i in 1:5){
for(j in 2:83){
stations_mean_12_16[i,j] <- mean(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
}
}
#rpelication the process for medians
stations_median_12_16 <- stations_min_12_16
for(i in 1:5){
for(j in 2:83){
stations_median_12_16[i,j] <- median(meteo_new_1217[meteo_new_1217$Year == 2011+i,j])
}
}
Now that we have clear summary tables for all stations, we transform them into a better format to perform row-wise operations, finding means, slope coefficient and the pvalue of the slope.
#create function that performs the following: transpose the dataframe, add and calculate mean, slope coefficient and pvalue columns.
output_df <- function(x){
#convert the name of function attribute as character string to re-use at the end for naming the resulting object
name <- as.character(substitute(x))
#transpose, columns as rows, rows as columns
x <- as.data.frame(t(as.matrix(x)))
#rename columns
names(x) <- c(2012:2016)
#delete first row (which was holding the years names)
x <- x[-1,]
#calculate mean and add column
x$mean <- apply(x, 1, FUN = mean)
#calculate lm and p-value and add columns
for(i in 1:82){
x$coef[i] <- lm(as.numeric(x[i,1:5]) ~ c(2012:2016))$coefficients[2]
x$pvalue[i] <- summary(lm(as.numeric(x[i,1:5]) ~ c(2012:2016)))$coefficients[2,4]
}
#"super assign" result to global environment with with the same name
assign(name, x, envir = .GlobalEnv)
}
#transform each table with the help of the function
output_df(stations_min_12_16)
output_df(stations_max_12_16)
output_df(stations_mean_12_16)
output_df(stations_median_12_16)
#check the final format
dim(stations_mean_12_16)
## [1] 82 8
head((stations_mean_12_16))
## 2012 2013 2014 2015 2016 mean coef
## TAE 9.198361 8.557534 10.016986 9.887123 9.334153 9.398831 0.16011737
## COM 10.788525 10.313973 10.966849 11.376986 10.841257 10.857518 0.11684782
## ABO 6.342077 5.761644 7.045479 7.325205 6.807923 6.656466 0.24952556
## AIG 10.709563 10.101096 11.325753 11.005205 10.658743 10.760072 0.08024702
## ALT 10.209016 9.726027 11.136438 10.633973 10.515847 10.444260 0.15216064
## ARH 10.266803 9.867945 10.933425 10.646027 10.420765 10.426993 0.10860057
## pvalue
## TAE 0.4673919
## COM 0.4084714
## ABO 0.2449977
## AIG 0.6494054
## ALT 0.4350199
## ARH 0.4717388
Please note that we are using only a fraction of the tables we created, most of them being used for exploratory data analysis. We could also easily use these dataframes for an interactive shiny application, this is however beyond the scope of this case study.
What are the warmest localities, in average, over the 2012-2016 period?
#transform the stations object (created at start of project) in a dataframe and modify the variable names
stations <- as.data.frame(stations); names(stations) <- c("station_name", "code")
#warmest localities in average, displaying top 10 values
stations_mean_12_16 %>% tibble::rownames_to_column() %>% arrange(desc(mean)) %>% select(rowname, mean, coef) %>% head(10) %>%
left_join(.,stations, by = c("rowname" = "code")) %>% kable(digits = 2, align = "cccc")
| LUG |
13.36 |
0.10 |
Lugano |
| OTL |
13.25 |
0.14 |
Locarno/Monti |
| MAG |
12.13 |
0.11 |
Magadino/Cadenazzo |
| SBO |
11.88 |
0.06 |
Stabio |
| PUY |
11.48 |
0.16 |
Pully |
| GRO |
11.41 |
-0.21 |
Grono |
| SIO |
11.19 |
0.13 |
Sion |
| BAS |
11.16 |
0.15 |
Basel/Binningen |
| GVE |
11.13 |
0.16 |
Genève-Cointrin |
| EVI |
11.08 |
0.17 |
Evionnaz |
What are the coldest localities, in average, over the 2012-2016 period?
stations_mean_12_16 %>% tibble::rownames_to_column() %>% arrange(mean) %>% select(rowname, mean, coef) %>% head(10) %>%
left_join(.,stations, by = c("rowname" = "code")) %>% kable(digits = 2, align = "cccc")
| ANT |
4.45 |
0.03 |
Andermatt |
| ULR |
4.62 |
0.22 |
Ulrichen |
| BRL |
5.57 |
0.12 |
La Brévine |
| NAP |
6.02 |
0.26 |
Napf |
| SCU |
6.28 |
0.16 |
Scuol |
| MVE |
6.63 |
0.20 |
Montana |
| FRE |
6.65 |
0.28 |
Bullet/La Frétaz |
| ABO |
6.66 |
0.25 |
Adelboden |
| SMM |
6.77 |
0.15 |
Sta. Maria, Val Müstair |
| CHD |
7.11 |
0.23 |
Château-d’Oex |
Localities that experienced the highest rate of change in temperatures
A larger slope coefficient shows a higher rate of warmer temperatures over the observed period.
#localities with strongest linear regression slope over the 5 years period, meaning where global warming could have been felt obviously
stations_mean_12_16 %>% tibble::rownames_to_column() %>% arrange(desc(coef)) %>% select(rowname, mean, coef, pvalue) %>% head(10) %>%
left_join(.,stations, by = c("rowname" = "code")) %>% kable(digits = 2, align = "ccccc")
| ENG |
7.16 |
0.28 |
0.22 |
Engelberg |
| FRE |
6.65 |
0.28 |
0.24 |
Bullet/La Frétaz |
| HOE |
7.36 |
0.27 |
0.31 |
Hörnli |
| EBK |
8.34 |
0.27 |
0.22 |
Ebnat-Kappel |
| ELM |
7.34 |
0.27 |
0.27 |
Elm |
| DIS |
7.36 |
0.27 |
0.17 |
Disentis/Sedrun |
| NAP |
6.02 |
0.26 |
0.27 |
Napf |
| ABO |
6.66 |
0.25 |
0.24 |
Adelboden |
| LAE |
8.06 |
0.25 |
0.34 |
Lägern |
| PLF |
7.81 |
0.24 |
0.32 |
Plaffeien |
These 3 tables show us some interesting insights:
- the warmest localities have a slope close to 0, meaning that they do not seem to have experienced any significant increase in yearly average temperatures.
- the coldest localities (with lowest average temperatures) have significantly higher slopes (except for Andermatt), meaning that they may have experienced more obvious warming rates.
- the localities that experienced the most drastic temperature increase rates show a low average mean in temperatures, 3 of them are among the top 10 coldest Swiss localities (Bullet/La Frétaz, Napf, Adelboden).
We can verify this by plotting the relation between slope coefficients and average temperature of each locality:
ggplot(data = stations_mean_12_16, aes(x = coef, y = mean)) + geom_point() + geom_smooth(method = "lm") +
labs(title = "Figure 7 - Coefficient slope vs. Average Temperature", x = "slope coefficient", y = "mean temperatures")

There is no notion of causality here, but there have been some explanations that global warming would impact first the coldest (sometimes glaciers) regions before impacting significatly the moderate climate areas.
4.4. Question 4: can we predict future temperatures?
To get a complete picture of global temperatures, we combine measurements from the air above land and the ocean surface.
The earth is divided into grid boxes of equal surface, the more monitoring stations we have per block, the more accurate the measure will be.
In the case of Switzerland, our current dataset covers 82 stations out of over 500, meaning that our results may be biased, mostly considering the large natural fluctuations within a country like Switzerland.
More information can be found here: https://www.carbonbrief.org/explainer-how-do-scientists-measure-global-temperature.
Working with a small dataset, our prediction cannot be expected to be very thorough:
- we saw that a linear model wouldn’t be the best fit, this is also proven by the fact that the earth warming rate is frequently reviewed by scientists and does not follow a known pattern.
- with our 5 years of data, we can only attempt prediction for the next one to two years, and we could see that including 2011, our regression line would have been decreasing.
LONG TERM PREDICTIONS:
As a reminder, we can print again the linear trend we used from 2012 to 2016, extending the regression line to 2050.
ggplot(yearly_stats_1216, aes(x = c(2012:2016), y = Mean)) + geom_line() + geom_point() + xlim(c(2012, 2050)) + ylim(c(8.5, 20)) +
geom_abline(slope = 0.1675, intercept = -328.1095, color = "red") +
labs(title = "Figure 8 - Extrapolating the regression line", x = "Years", y = "Mean Temperatures")

lm(data = yearly_stats_1216, Mean ~ c(2012:2016))
##
## Call:
## lm(formula = Mean ~ c(2012:2016), data = yearly_stats_1216)
##
## Coefficients:
## (Intercept) c(2012:2016)
## -328.1095 0.1675
#using the linear model for x = 2050:
-328.1095 + (0.1675*2050)
## [1] 15.2655
We see (and very much hope), that an average temperature of 15.3 degres Celsius in Switzerland in 2050 is unrealistic. This would mean a difference of 6 degrees between 2016 and 2050 (T_hat_2050 - T_observed_2016 = 15.2655-9.297885).
Using some simmple algebra, we could forecast temperatures in 2050 based on the agreed warming rate of 0.15 to 0.18 degrees per decade (Source: NOAA, https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature).
(2050-2016+1)*0.015; (2050-2016+1)*0.018
## [1] 0.525
## [1] 0.63
So we would expect a nationwide temperature increase in Switzerland of 0.525 to 0.63 degrees Celsius from 2016 to 2050, against a whopping 6 degrees found with a linear model using only 5 years of historical data.
SHORT TERM PREDICTIONS:
Our dataset was not complete and let’s imagine we would like to forecast the average temperatures for 2017 (we have only 6 months of data for 2017).
We have several methods:
1. Using our linear regression model.
2. Using the accepted global warming rate, we will take an average 0.0165 degrees per year (1.5 to 1.8 degrees per century).
3. Using time series.
#initialize reporting table
predicted_mean_2017 <- data.frame(Method = c("Linear Regression", "Use Global Warming Rate", "Time Series"), Value = rep(0))
#1. linear regression
predicted_mean_2017[1,2] <- -328.1095 + (0.1675*2017)
#2. using global warming rate
predicted_mean_2017[2,2] <- yearly_stats_1216[5,4] + 0.0165
#3. Using time series
#create time series object comprised of dates and mean values across all available Swiss stations
meteo_xts <- xts(meteo_new_1217[,c(1,84)][,-1], order.by=as.Date(meteo_new_1217[,c(1,84)][,1], "%Y-%m-%d"))
#using Fourier and auto.arima to build model
y <- ts(meteo_xts, frequency=7)
z <- fourier(ts(meteo_new_1217$mean_stations, frequency=365.25), K=30)
zf <- fourier(ts(meteo_new_1217$mean_stations, frequency=365.25), K=30, h=184)
fit <- auto.arima(y, xreg=z, seasonal=TRUE)
#forecast values
fc <- forecast(fit, xreg=zf, h=184)
#plot results
autoplot(fc)

#measure mean: sum of predicted values + sum of observed values (second line), the total divided by 365
predicted_mean_2017[3,2] <- (sum(fc$mean) +
meteo_new_1217 %>% filter(Year == 2017) %>% select(mean_stations) %>% summarise(sum = sum(mean_stations)))/365
kable(predicted_mean_2017, digits = 4)
| Linear Regression |
9.7380 |
| Use Global Warming Rate |
9.3144 |
| Time Series |
9.7165 |
Comments:
- On the short term, we see that the linear regression and time series methods are very close.
- Using Time Series would need further testing of different models and fine-tuning, deep learning methods for time series with the keras package for instance could yield interesting results.
- As we do not have a dataset similar to the training data of this case study, we cannot benchmark our results to actual Swiss results. We have the records of 82 stations (out of more than 500) at our disposal and this provides an important bias for the validity of the results.
- As an example, the official data provided by meteoschweiz (https://www.meteoschweiz.admin.ch/product/input/climate-data/swissmean/10.18751-Climate-Timeseries-CHTM-1.0-swiss.txt), shows a year average of 6.16 degrees Celsius against the 9.3 to 9.7 we found using our current dataset. As a benchmark, we see that the 6 first months showed on the meteoschweiz website give an average of 4.78 degrees, whereas our dataset shows 8.13 derees. It seems that the localities we used are not representative of the climate diversity of the country.
- But all proportions left equal, our findings about warming, yearly, per season and nationwide as well as warm/cold regions, remain valid.