Abstract
The object of this report is to determine if different seasonal profiles can be identified based on the proposed Meteoblue data, which encompass meteorological measurements in the region of Basel in Switzerland. In this vein, two definitions of seasons have been examined; the astrological definition which bases itself on the position of the earth around the sun and the meteorological definition which relies on annual temperature cycles, which coincide with three month splits in the Gregorian calendar. The analysis focused on the application of Principal Component Analysis (PCA) and Clustering to reduce the number of dimensions and identify a structure in the data. The results from the scaled PCA have led to 11 Principal Components to be chosen, and it can be inferred from the interpretation of the associated coefficients that variables related to ‘Wind’ and ‘Cloud Coverage’ appear to drive most of the underlying trend. The application of K-means Clustering and hierarchical clustering methods on the PCA results indicates that there is no inherent structure in the data that allows for the seasons to be placed into distinct groups, regardless of how the clusters are defined. Therefore, the meteorological definition of seasons in the region does not appear to be suitable for the Region of Basel since the data do not seem to reflect any significant patterns in the measured climatic data across time which would allow for a rigid dichotomy to be made on that basis.This report follows from the Meteoblue dataset proposal and aims to illustrate the application of multivariate techniques to discover latent patterns between the variables under consideration, with a view to form hypotheses. The objective of this report is to investigate if the meteorological definition of seasons (based on annual temperature cycles) can be applied to the region of Basel, in comparison to the astrological defintion of seasons (based on the earth’s position relative to the sun).
In that optic, an overview of the dataset will first be provided, followed by data cleaning as well as an exploratory analysis before the application of the multivariate methods and discussion of the findings. It should also be outlined that this document is targeted at a non-statistical audience and will aspire to transmit the results, as well as the methods used, without any mathematical proof to support a widespread understanding of the latter. The code used for the research and the associated results from the latter will also be presented throughout the document.
As previously described in the proposal, the data encompass daily hourly meteorological measures taken in the canton of Basel in Switzerland.
#Code adapted from:
#https://blog.dominodatalab.com/geographic-visualization-with-rs-ggmaps/
#https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf
require(ggmap)
basel <- c(lon = 7.57 , lat = 47.56)
map <- get_map(location = basel, maptype = "terrain")
ggmap(map)
#Original datastet import
dat <- readxl::read_xlsx("C:/Users/Sanil Purryag/Google Drive/St Andrews docs/Modules/Semester 2/Multivariate Analysis/Project resources/Code-WIP/history_export_1984-2018-U.xlsx")
The data encompass 45 variables and 12,112 observations, represented as numeric information. These variables represent elements of the weather in Basel, recorded over different units of measurements and the time at which they are believed to have occurred. The latter have been grouped under labels, based on the type of information they provide about the climate, and presented in the table below (as shown in the proposal):
| Grouping label for variables of interest | Global Average Minimum values under grouped label | Global Average Maximum values under grouped label |
|---|---|---|
| Year, month and day | 01/01/85 | 27/02/18 |
| Temperature | -13.74 | 30.65 |
| Atmospheric Pressure | 971.56 | 1046.82 |
| Humidity | 34.71 | 100 |
| Percentage of the sky covered with clouds | 0 | 100 |
| Total precipitation amount | 0 | 62.4 |
| Global radiation | 121.93 | 8488.82 |
| Wind speed | 3.69 | 139.68 |
Based on the principles of Van den Broeck et al. (2005) on detecting, diagnosing, and editing data abnormalities, the data will first be examined for any inconsistencies and errors before being used for the analysis.
To reduce the number of variables for the analysis and streamline the data, a new variable “Date” has been created from the existing ‘Year’, ‘Month’ and ‘Day’ column. This has been carried out with the aim of being able to filter and represent the data across time using solely this column. Additionally, the hour and minute columns were dropped as a result of the same value being repeated across the dataset.
#Drop columns for Hour and Minute
#Below line adapted from:
#https://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name
mydat <- within(dat, rm(Hour, Minute))
#create covariate 'Date' by concatenating columns: Year, Month, Date
Date <-
as.Date(paste(mydat$Year, mydat$Month, mydat$Day, sep = "-"))
mydat$Date <- Date
Given that this report will include the determination of distinct seasonal profiles based on different definiton of seasons, different dummy variables will be created for each season to identify whether a specific date can be classified as belonging to a season. These dummy variables will be coded as having a one or a zero, indicating that a date falls within a specific season with a one and a zero when it does not. Therefore two sets of dummies will be created, one for the astrological definition of seasons and one for the meteorological definition of seasons, as follows.
Astrological definition of Seasons - sourced from: https://www.metoffice.gov.uk/learning/seasons/spring/when-does-spring-start
The dummies for the meteorological definition of seasons will be created based on three month blocks. These have been created using the below R code which splits the years into three month blocks from the start of March.
#Below code Adapted from:
#https://stackoverflow.com/questions/45735624/how-to-add-dummy-variables-q1-q2-q3-based-on-the-month-of-date-variable-yyyym
#Length of season obtained from:
#https://www.metoffice.gov.uk/learning/seasons/spring/when-does-spring-start
#Met_season1 - Spring (March - May)
#Met_season2 - Summer (June - August)
#Met_season3 - Autumn (September - November)
#Met_season4 - Winter (December - Febuary)
suppressWarnings(library("lubridate"))
met.def <-
data.table::dcast(mydat,
Date ~ paste0("Met_season", lubridate::quarter(mydat$Date, fiscal_start = 3)),
length,
value.var = "Date")
colnames(met.def)[2] <- "Met_Spring"
colnames(met.def)[3] <- "Met_Summer"
colnames(met.def)[4] <- "Met_Autumn"
colnames(met.def)[5] <- "Met_Winter"
The dummies for the astrological definition of seasons have been generated using user created functions (available in the appendix) and have been created based on the aforementioned dates.
#Extract date column from data, combine years and create empty column for winter dummy
Date <- (mydat$Date)
Years <- mydat$Year
Winter <- data.frame(winter = 1)
Summer <- data.frame(Summer = 1)
Autumn <- data.frame(Autumn = 1)
Spring <- data.frame(Spring = 1)
#merge the dataframes respectively
Merged <- cbind(Date, Years, Winter)
Merged1 <- cbind(Date, Years, Summer)
Merged2 <- cbind(Date, Years, Autumn)
Merged3 <- cbind(Date, Years, Spring)
#clean environment
rm(Winter)
rm(Summer)
rm(Autumn)
rm(Spring)
#run function and save results
astro.winter <- astro.winter(datframe = Merged)
astro.summer <- astro.summer(datframe = Merged1)
astro.autumn <- astro.autumn(datframe = Merged2)
astro.spring <- astro.spring(datframe = Merged3)
Moreover, a categorical variable will be created to be able to identify and filter all the four seasons, for the astrological definition and the meteorological definition. These variables will be coded as follows:
| Season | Meteorological Definition | Astrological Definition |
|---|---|---|
| Summer | Met_summer | Astro_summer |
| Winter | Met_winter | Astro_Winter |
| Autumn | Met_autumn | Astro_autumn |
| Spring | Met_spring | Astro_spring |
#Create Cleaned dataframe by dropping old timestamp and rearranging date as the first column
cleaned <- cbind(Date, mydat[, 1:43])
#Add meteorogical definition of season to Cleaned
cleaned <- cbind(cleaned, met.def [, 2:5])
#Add astronomical definition of seasons to Cleaned
cleaned <-
cbind(cleaned,
astro.winter[, 2],
astro.summer[, 2],
astro.autumn[, 2],
astro.spring[, 2])
colnames(cleaned)[49] <- "Astro_Winter"
colnames(cleaned)[50] <- "Astro_Summer"
colnames(cleaned)[51] <- "Astro_Autumn"
colnames(cleaned)[52] <- "Astro_Spring"
pca.cleaned <-cleaned
pca.cleaned$Date <- as.numeric(paste(cleaned$Year, cleaned$Month, cleaned$Day, sep = ""))
colnames(pca.cleaned)[1] <- "Date"
#Creation of new factor variable from previously created dummies
#meteorological season factor
contain <- pca.cleaned[,-c(2:4)]
meteo <- contain[, 42:45]
meteo[, 2] <- ifelse(contain$Met_Summer == 1, 2, 0)
meteo[, 3] <- ifelse(contain$Met_Autumn == 1, 3, 0)
meteo[, 4] <- ifelse(contain$Met_Winter == 1, 4, 0)
meteo1 <- rowSums(meteo)
#astronomical season factor
astro <- contain[, 46:49]
astro[, 2] <- ifelse(contain$Astro_Summer == 1, 2, 0)
astro[, 3] <- ifelse(contain$Astro_Autumn == 1, 3, 0)
astro[, 4] <- ifelse(contain$Astro_Spring == 1, 4, 0)
astro1 <- rowSums(astro)
something <- cbind(contain, meteo1, astro1)
#subset new dataset for multivariate analysis
proper <- subset(something, select = -c(42:49))
colnames(proper)[42] <- "Met_def"
colnames(proper)[43] <- "Astro_def"
proper$Met_def <- as.character(proper$Met_def)
#Encode factors as seasons for meteorogical
proper$Met_def[proper$Met_def == "1"] <-
"Met_spring"
proper$Met_def[proper$Met_def == "2"] <-
"Met_summer"
proper$Met_def[proper$Met_def == "3"] <-
"Met_autumn"
proper$Met_def[proper$Met_def == "4"] <-
"Met_winter"
#Encode factors as seasons for astronomical
proper$Astro_def[proper$Astro_def == "1"] <- "Astro_winter"
proper$Astro_def[proper$Astro_def == "2"] <- "Astro_summer"
proper$Astro_def[proper$Astro_def == "3"] <- "Astro_autumn"
proper$Astro_def[proper$Astro_def == "4"] <- "Astro_spring"
Before the application of multivariate techniques, some exploratory data analysis will be carried out to investigate the structure of the data. However, given the volume of data underhand, the application and effectiveness of traditional techniques have been found to be limited.
Pairs plots are used to illustrate the pairwise relationships between variables, by matching values in the same data row. Given that the techniques under consideration require linear relationships to be in place, pairs plots have been used to investigate general shape of the data between the different variables. Following the use of the code in the appendix, it can be observed that indeed, a general linear tendency is present between the variables under study. A sample of the pairs plot obtained from the dataset is shown below:
#create sample of random observations
sampled <- cleaned[sample(nrow(cleaned),200),]
#run pairs plots for all the variables to investigate the different relations
require(lattice)
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.4.2
splom(~sampled[,5:7],varname.cex = .5)
#Application of Multivariate analysis techniques
Multivariate analysis techniques will be employed to identify common characteristics within the multiple measurements taken daily. The use of Principal Component Analysis (“PCA”) and Clustering will be considered to identify a common profile for each season in the region of Basel. It should be noted that while Multidimensional scaling (“MDS”) was considered for the analysis, it was later removed as a result of not providing additional insight. Nevertheless, the code for the latter can be found in the appendix.
As outlined by Stackexchange (2018), The objective of Principal Component Analysis (“PCA”) is to summarise data, through the identification of common underlying components. For example, seasons can be described in numerous ways such as their respective temperature levels, their associated humidity levels, their length in terms of days and so on. There can thus be a list of components made for each season. There will, however, be some components that capture most of the underlying meaning associated with seasons than others while there will be some other components describing related attributes of seasons, effectively making them redundant. PCA seeks to summarise the information describing a phenomenon by creating new components, from a combination of existing variables in the data.
Consider the below scatterplot representing different data points (e.g. dates) for two variables (e.g. temperature and wind speed). It can be observed from the general direction of the points on the plot (upward to the left) that the variables are positively correlated, or in other words, that the latter variables move in tandem.
Sample Scatter plot sourced from:https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
A new component, summarising the meaning transmitted by those two variables, can be constructed by drawing a line (Principal Component) through the centre of the data points and projecting all the points on the latter, as shown in the below simulation (rotating red line). The red dots are projections of the original points (blue dots) and the red lines relating both indicates the distance between the original and the new points on the principal component. As shown below, there can be several angles where the principal component line can be set.
PCA simulation - sourced from:https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
Nevertheless, PCA will seek to find the best principal component by minimising the distance between the original points (blue dots) and new points (red dots) and maximising the spread of the red dots along the principal component line (shown below in the fitted simulation).
PCA best fitted line - sourced from:https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
While PCA provides an intuitive way to analyse the data, Bartlett’s test of sphericity will first be used to ensure that this method is suitable. This test examines if the variables are unrelated and thus unsuitable for the detection of an inherent structure (IBM,2018). The below result indicates that indeed the PCA can be used with the dataset.
#code adapted from:
#https://www.r-bloggers.com/naive-principal-component-analysis-using-r/
require(psych)
test.res <- cortest.bartlett(proper[, 1:41])
print(paste("P-value","=",test.res$p.value))
[1] “P-value = 0”
The data will first be scaled to ascertain that the PCA is appropriately run on the underlying dataset. This operation will ensure that the data has unit variance, or in other words, that each data point is evenly spread across the dataset and can be compared with each other, even when the data is subject to being differently measured.
#Below code chunks adapted from:
#https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
#https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/
#http://www.sthda.com/english/wiki/print.php?id=207
#https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html
PCA.complete <- prcomp(proper[, 1:41], scale = TRUE)
The below plots, known as scree plots, serve to illustrate the variation explained by each principal component and can also be used to show how the eigenvalues (scaling factors) for each Principal component vary across all the Principal Components. The scree plot is used to select a number of components that shall later be considered to be a new set of variables for later stages of the analysis.
From the below scree plot with the “Percentage of explained variances” vertical axis, it can be observed that about 12 components explain about 85% of the variation in the data, i.e. 12 components can be used to extract about 85% of the meaning that the original data held. Alternatively, concerning the scree plot with the “Eigenvalue” vertical axis, the Kaiser criterion specifies that only scaling factors with a value higher than one should be considered, leading to only eight components being selected.
library(factoextra)
library(FactoMineR)
# Visualize variances through screeplot
fviz_screeplot(
PCA.complete,
choice = "variance",
addlabels = TRUE,
ylim = c(0, 50),
barfill = 'coral1',
main = "Screeplot for Weather Dataset PCA covering 95% of the variation",
xlab = "Principal Components",
ggtheme = theme_classic(),
ncp = 21
)
# Visualize eigenvalues through screeplot
fviz_screeplot(
PCA.complete,
choice = "eigenvalue",
addlabels = TRUE,
ylim = c(0, 50),
barfill = 'coral1',
main = "Screeplot for Weather Dataset PCA",
xlab = "Principal Components",
ggtheme = theme_classic(),
ncp = 21
)
The below plot (also known as a biplot) illustrates the first two principal components which result from the PCA on the dataset. The biplot depicts the variables under study in such a way that it can be inferred whether the variables are similar to each other from their relative location from the plot. For instance, it can be observed that “Wind Speed daily min [80 m above gnd]” and “Wind Speed daily min [900 mb]” are quite similar to each other from their overlapping location on the plot. Additionally, the colour on the plot indicates the contribution of the variable to each principal component, with “Wind Speed daily min [900 mb]” being one of the highest contributors and “Sunshine Duration daily sum [sfc]” being one of the weakest contributors.
library(factoextra)
library(FactoMineR)
#Make graph of variables - PCA biplot
fviz_pca_var(
PCA.complete,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
)
The below graph depicts the quality of individual observations on the principal components. The cos2 (squared loadings for variables) measure serves to indicate how important a particular individual on a scale of zero to one, where one is a perfect representation (Abdi and Williams, 2010). This plot shows that the individuals close to the centre contribute the less to the first principal components and the more influential variables lie towards the bottom left (highlighted in bright red).
#code adapted from: http://www.sthda.com/english/wiki/print.php?id=204#cos2-quality-of-the-representation-for-individuals-on-the-principal-components
#make graph of individuals
fviz_pca_ind(PCA.complete, col.ind = "cos2") +
scale_color_gradient2(
low = "white",
mid = "blue",
high = "red",
midpoint = 0.50
) + theme_minimal()
The primary aim of this analysis is to identify whether seasons can be profiled based on a combination of climatic occurrences. The below bi-plots, illustrating the two different definitions of seasons across the first two principal components, show how seasons are distributed across PCs.
#code with proper
library(ggfortify)
#PCA plot for meteorogical season
autoplot(PCA.complete, data = proper, colour = 'Met_def')
#PCA plot for astrological season
autoplot(PCA.complete, data = proper, colour = 'Astro_def')
#According to elbow method: PC4 can be chosen
It is important to consider that PCA displays trends in the data through the use of new axes (Principal Components). These patterns can be interpreted by looking at the rotations (or loadings) associated with each variable that has been subject to the PCA, to identify which ones have the highest influence on the principal components. To do so, a measure influence known as the equilibrium contribution will be calculated (shown below) and will be used as a benchmark to find the variables which have loadings higher than the latter.
#Equilibrium contribution computation
equilibrium <- 1/sqrt(ncol(proper[,1:41]))
print(equilibrium)
## [1] 0.1561738
#Equilibrium contribution computation
which((PCA.complete$rotation[,1] > equilibrium)|(PCA.complete$rotation[,1]< -equilibrium))
## Total Cloud Cover daily mean [sfc]
## 7
## Medium Cloud Cover daily mean [mid cld lay]
## 9
## Low Cloud Cover daily mean [low cld lay]
## 10
## Sunshine Duration daily sum [sfc]
## 11
## Shortwave Radiation daily sum [sfc]
## 12
## Wind Speed daily mean [10 m above gnd]
## 13
## Wind Speed daily mean [80 m above gnd]
## 15
## Wind Speed daily mean [900 mb]
## 17
## Wind Gust daily mean [sfc]
## 19
## Wind Speed daily max [10 m above gnd]
## 34
## Wind Speed daily min [10 m above gnd]
## 35
## Wind Speed daily max [80 m above gnd]
## 36
## Wind Speed daily min [80 m above gnd]
## 37
## Wind Speed daily max [900 mb]
## 38
## Wind Speed daily min [900 mb]
## 39
## Wind Gust daily max [sfc]
## 40
## Wind Gust daily min [sfc]
## 41
##first pc suggests that wind speed and cloud cover are important seperators
which((PCA.complete$rotation[,2] > equilibrium)|(PCA.complete$rotation[,2]< -equilibrium))
## Relative Humidity daily mean [2 m above gnd]
## 3
## Total Cloud Cover daily mean [sfc]
## 7
## High Cloud Cover daily mean [high cld lay]
## 8
## Medium Cloud Cover daily mean [mid cld lay]
## 9
## Low Cloud Cover daily mean [low cld lay]
## 10
## Sunshine Duration daily sum [sfc]
## 11
## Shortwave Radiation daily sum [sfc]
## 12
## Wind Speed daily mean [10 m above gnd]
## 13
## Wind Speed daily mean [80 m above gnd]
## 15
## Wind Speed daily mean [900 mb]
## 17
## Wind Gust daily mean [sfc]
## 19
## Relative Humidity daily max [2 m above gnd]
## 22
## Relative Humidity daily min [2 m above gnd]
## 23
## Total Cloud Cover daily min [sfc]
## 27
## Medium Cloud Cover daily min [mid cld lay]
## 31
## Low Cloud Cover daily min [low cld lay]
## 33
## Wind Speed daily max [10 m above gnd]
## 34
## Wind Speed daily max [80 m above gnd]
## 36
## Wind Speed daily min [80 m above gnd]
## 37
## Wind Speed daily min [900 mb]
## 39
## Wind Gust daily max [sfc]
## 40
## Wind Gust daily min [sfc]
## 41
##second pc suggests that the same, in addtion to sunshine, radiation and sea
The equilibrium value above indicates that the variables associated with the wind and cloud cover have been the driving forces behind the dominant trend in the data. It can thus be hypothesised that these variables might have a significant influence on the underlying seasonal patterns.
As an extension to PCA, clustering methods have been considered to better profile seasons by organising the observations into sensible intrinsic groupings or clusters (Jain, 2010). This would allow for the identification of other underlying structures in the data and help to streamline the data further. In this optic, two classes of data clustering methods have been considered, namely hierarchical and partition clustering.
Partition methods are based on the assumption that a certain data point belongs to a specific cluster while hierarchical clustering looks for pairs of samples that are the closest in terms of similarity, using different distance measures between the points, to consequently build different clusters based on the proximity of the points desired. The desired proximity of the points allows for clusters to be built either on the basis that the closest points should be joined (single linkage), the furthest distance to the next instance (complete linkage) or the average of all distances in a cluster to the instance (average linkage).
The clustering algorithms applied to the analysis are K-means clustering (partitioning method) and average hierarchical clustering (hierarchical clustering), with K-means clustering being primarily used due to the lesser computational load associated with it. Additionally, it should be noted that since Meteorological data does not typically have an inherent hirarchical structure,unlike biological data which would justify the use of a dendogram, K-means clustering appears to be more appropriate.
As indicated by Agarwal (2011), the leading challenges when applying clustering methods lie in the determination of the desired number of clusters and the choice of the distance measure. The clustering methods will be applied to the first 12 principal components scores since the latter explain about 85% of the variation in the data. For this report, the distance measure used is the Euclidean distance (straight line distance between two points ), and the choice of clusters was made using the following three methods:
Based on the PCA results, it can be observed that 18 principal components explain about 95% of the data. Thus a k-means cluster will be constructed using 18 clusters and projected across the first three principal components as follows:
#chosen 12 components from 95% variance explained
chosen.comp <- data.frame(PCA.complete$x[, 1:12])
#Kmeans clustering for 18 clusters based on 95% variance explained
k1 <- kmeans(chosen.comp, 18, nstart = 25, iter.max = 1000)
# library(RColorBrewer)
# library(scales)
# palette(alpha(brewer.pal(9, 'Set1'), 0.5))
# plot(chosen.comp, col = k1$clust, pch = 16)
#3d scatter plot for 18 clusters
library(plotly)
Scatter.18 <-
plot_ly(
chosen.comp,
x = ~ PC1,
y = ~ PC2,
z = ~ PC3,
color = ~ k1$clust,
colors = c('#BF382A', '#0C4B8E')
) %>%
add_markers() %>%
layout(scene = list(
xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')
))
Scatter.18
#table for astro_def
table(k1$cluster, proper$Astro_def)
##
## Astro_autumn Astro_spring Astro_summer Astro_winter
## 1 99 35 3 216
## 2 325 107 31 399
## 3 185 124 66 267
## 4 198 43 14 276
## 5 402 160 27 289
## 6 165 395 394 130
## 7 45 153 337 5
## 8 14 2 0 10
## 9 84 175 32 134
## 10 149 98 35 209
## 11 167 45 1 209
## 12 122 300 350 99
## 13 103 373 512 18
## 14 172 468 573 28
## 15 161 255 104 123
## 16 98 261 559 29
## 17 72 44 19 42
## 18 409 31 12 521
#table for Met_def
table(k1$cluster, proper$Met_def)
##
## Met_autumn Met_spring Met_summer Met_winter
## 1 49 67 1 236
## 2 220 162 32 448
## 3 152 169 54 267
## 4 132 76 17 306
## 5 367 206 16 289
## 6 250 338 383 113
## 7 104 113 323 0
## 8 7 5 0 14
## 9 78 196 33 118
## 10 113 127 35 216
## 11 122 88 0 212
## 12 201 290 334 46
## 13 215 295 495 1
## 14 286 336 609 10
## 15 164 232 123 124
## 16 188 191 560 8
## 17 65 49 18 45
## 18 290 96 3 584
The elbow method focuses on finding the percentage of variance that is explained, calculated through the WSS, based on the number of clusters such that adding a new cluster does not lead to a significant difference in terms of explained variance (Bholowalia and Kumar, 2014). The below plot illustrates the WSS across 12 clusters and indicates that after 2 clusters, the improvement in the explained variance with each new cluster is minimal.
#Below code adapted from:
#https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/
#plot wss for 12 clusters for numeric data and choose cluster based on elbow method
wss <- (nrow(proper[, 1:41]) * sum(apply(proper[, 1:41], 2, var)))
for (i in 2:12)
wss[i] <- sum(kmeans(proper[, 1:41],
centers = i)$withinss)
plot(1:12,
wss,
type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
ticks = rep(1:12)
axis(side = 1, at = ticks)
abline(h = 6.394687e+15, col = "red", lty = 3, lwd = 3)
abline(v = 2, col = "red", lty =3, lwd = 3)
Consequently, a 3d scatter plot across the first three principal components will be constructed to illustrate the 2 clusters determined by the elbow method.
#Assuming 2 as elbow based on above graph
elbow.comp <- data.frame(PCA.complete$x[, 1:12])
#Kmeans clustering for 2 clusters based on elbow method
k <- kmeans(elbow.comp, 2, nstart = 25, iter.max = 1000)
# library(RColorBrewer)
# library(scales)
# palette(alpha(brewer.pal(9, 'Set1'), 0.5))
# plot(elbow.comp, col = k$clust, pch = 16)
library(plotly)
Scatter.2 <-
plot_ly(
elbow.comp,
x = ~ PC1,
y = ~ PC2,
z = ~ PC3,
color = ~ k$clust,
colors = c('#BF382A', '#0C4B8E')
) %>%
add_markers() %>%
layout(scene = list(
xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')
))
Scatter.2
#Table for astro_def
table(k$cluster, proper$Astro_def)
##
## Astro_autumn Astro_spring Astro_summer Astro_winter
## 1 1105 927 532 1389
## 2 1865 2142 2537 1615
#Table for Met_def
table(k$cluster, proper$Met_def)
##
## Met_autumn Met_spring Met_summer Met_winter
## 1 928 1026 532 1467
## 2 2075 2010 2504 1570
The Callinski and Harabasz’s Index, also known as the variance ratio criterion, measures the variance between different clusters and within each of those clusters (De Amorim and Hennig, 2015). An application of this measure to determine the optimal number of clusters for the K-means method leads to 15 clusters being chosen and 2 clusters for the hierarchical average method. The associated plot for the K-means is found below, and the results for the hierarchical clustering are found in the appendix.
#compute kmeans optimal cluster number according to NbClust - Callinski and Harabasz's Index
##15 clusters is optimal
# library(NbClust)
# nbclustering.kmeans <-
# NbClust(
# data = PCA.complete$x,
# diss = NULL,
# distance = "euclidean",
# min.nc = 2,
# max.nc = 15,
# method = "kmeans",
# index = "cindex",
# alphaBeale = 0.1
# )
#compute average optimal cluster number according to NbClust - Callinski and Harabasz's Index
## 2 clusters is optimal
# nbclustering.avg <-
# NbClust(
# data = PCA.complete$x,
# diss = NULL,
# distance = "euclidean",
# min.nc = 2,
# max.nc = 15,
# method = "average",
# index = "cindex",
# alphaBeale = 0.1
# )
#Kmeans clustering for 15 clusters based on optimal index choice
k2 <- kmeans(chosen.comp, 15, nstart = 25, iter.max = 1000)
# library(RColorBrewer)
# library(scales)
# palette(alpha(brewer.pal(9, 'Set1'), 0.5))
# plot(chosen.comp, col = k1$clust, pch = 16)
#3d scatter plot for 15 clusters
Scatter.15 <-
plot_ly(
chosen.comp,
x = ~ PC1,
y = ~ PC2,
z = ~ PC3,
color = ~ k2$clust,
colors = c('#BF382A', '#0C4B8E')
) %>%
add_markers() %>%
layout(scene = list(
xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')
))
Scatter.15
#table for astro_def
table(k2$cluster, proper$Astro_def)
##
## Astro_autumn Astro_spring Astro_summer Astro_winter
## 1 244 350 200 238
## 2 145 110 43 207
## 3 100 179 33 156
## 4 177 52 2 216
## 5 113 44 4 245
## 6 141 434 594 31
## 7 14 2 0 10
## 8 119 132 51 82
## 9 219 169 145 282
## 10 364 308 206 408
## 11 279 119 49 337
## 12 49 166 377 4
## 13 409 529 460 135
## 14 471 40 16 617
## 15 126 435 889 36
#table for Met_def
table(k2$cluster, proper$Met_def)
##
## Met_autumn Met_spring Met_summer Met_winter
## 1 267 333 195 237
## 2 113 141 39 212
## 3 88 198 35 147
## 4 133 100 0 214
## 5 57 81 2 266
## 6 279 357 559 5
## 7 7 5 0 14
## 8 113 125 61 85
## 9 198 203 128 286
## 10 311 333 229 413
## 11 208 150 58 368
## 12 112 119 365 0
## 13 492 450 486 105
## 14 338 131 4 671
## 15 287 310 875 14
Based on the different clustering methods and techniques, it appears that the K-means algorithm with 15 clusters is the optimal choice since it has the maximum variance ratio and goes along the number of principal components that explain about 90% of the variation in the data. However, it can also be observed that it performs an inadequate job at separating the data cloud into distinct clusters, to represent the different seasons.
This report sought to establish a profile for each season in the region of Basel, based on the meteorological and astrological definition of seasons. The application of PCA and clustering revealed that the meteorological data obtained can be reduced but cannot be used to find an intrinsic pattern in the data since there are no distinct clusters.
Other research methods could aim at applying different methods of multivariate analysis aimed at dimensionality reduction (reducing the number of variables) and which are robust to outliers. An inspection of the Cook’s distance (code in appendix) revealed a large number of outliers (extreme values) that have not been appropriately addressed in the analysis, since the object of the report was solely to identify patterns to be further investigated. Finally, another research stream that could be considered would relate to investigating what defines a season in Basel based on the climatic data, since the meteorological definition did not appear to be sensible.
#Structure of original imported data
#code adapted from: https://stackoverflow.com/questions/44200394/show-str-as-table-in-r-markdown
# suppressWarnings(library(knitr))
# suppressWarnings(library(magrittr))
# data.frame(variable = names(dat),
# classe = sapply(dat, typeof),
# first_values = sapply(dat, function(x) paste0(head(x), collapse = ", ")),
# row.names = NULL) %>%
# kable()
# #create sample of random observations
# sampled <- cleaned[sample(nrow(cleaned),200),]
# #run pairs plots for all the variables to investigate the different relations
# require(lattice)
# splom(~sampled[,5:14],varname.cex = .5)
# splom(~sampled[,15:23],varname.cex = .5)
# splom(~sampled[,24:34],varname.cex = .5)
# splom(~sampled[,35:44],varname.cex = .5)
# ########################################################################################################################################
# #Function to define dummies for astrological spring
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.spring function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical spring (21st June - 21st September)
#
# ##Required Inputs:
# #Dataframe with dates, year and spring column
#
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the spring range
#
# ########################################################################################################################################
#
#
# #require lubridate package for date manipulation
#
# astro.spring <- function(datframe = Merged) {
# require("lubridate")
# #create dynamic interval
# diff <- (2018 - 1984)
# #create empty vector to store dummy results
# result <- vector()
# #Create Calendar list to preserve dates
# Calendar <- list()
# #Create spring list to preserve dates
# spring.list <- list()
# for (i in 0:diff) {
# year <- 1984 + as.numeric(i)
# #spring interval - could be defined outside of function
# #function could just add years
# start.interval <-
# as.Date(paste(20, 03, year, sep = "."), format = "%d.%m.%Y")
# end.interval <-
# as.Date(paste(20,06 , year, sep = "."), format = "%d.%m.%Y")
# int1 <- interval(start.interval, end.interval)
# #break statement for possibility of year '2019'
# #Used to circumvent for end.ind
# if ((year) == 2019) {
# break
# }
# #cycle through dataframe by identifying first occurance of year
# #and the last occurance for spring (year +1 )
# start.ind <- min(which(year == Years))
# end.ind <- max(which((year) == Years))
# val1 <- datframe[start.ind:end.ind, ]
# #Preserve dates
# Calendar[[i+1]] <- val1[,1]
# #identify number of rows for interval and use as looping rule
# extent <- nrow(val1)
# #reinitialise vector to store result
# result <- vector()
# #rotate through rows of year of interest
# for (j in 1:extent) {
# val <- val1[j, 1]
# test <- val %within% int1
# ifelse(test, result[j] <- 1, result[j] <- 0)
# }
# spring.list[[i + 1]] <- result
# }
# #create dataframe by binding the spring dummy with dates
# spring.list <- as.data.frame(do.call("c",spring.list))
# Calandrier <- as.data.frame(do.call("c",Calendar))
# compiled <- cbind(Calandrier,spring.list)
# colnames(compiled)<- c("Date","spring")
# #remove occuring rows
# uniquely <- unique(compiled)
# #fully remove duplicated values over intervals by subsetting 1 and 0
# sub1 <- subset(compiled,spring == '1')
# sub2 <- subset(compiled,spring == '0')
# binded <- rbind.data.frame(sub1, sub2)
# sorted <- binded[sort.list(binded[,1]),]
# #remove duplicates in second position
# listing <- sorted[!duplicated(sorted$Date),]
# return(listing)
# }
#
# ########################################################################################################################################
# #Function to define dummies for astrological autumn
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.autumn function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical autumn (21st June - 21st September)
#
# ##Required Inputs:
# #Dataframe with dates, year and autumn column
#
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the autumn range
#
# ########################################################################################################################################
#
#
# #require lubridate package for date manipulation
#
# astro.autumn <- function(datframe = Merged) {
# require("lubridate")
# #create dynamic interval
# diff <- (2018 - 1984)
# #create empty vector to store dummy results
# result <- vector()
# #Create Calendar list to preserve dates
# Calendar <- list()
# #Create autumn list to preserve dates
# autumn.list <- list()
# for (i in 0:diff) {
# year <- 1984 + as.numeric(i)
# #autumn interval - could be defined outside of function
# #function could just add years
# start.interval <-
# as.Date(paste(22, 09, year, sep = "."), format = "%d.%m.%Y")
# end.interval <-
# as.Date(paste(20, 12, year, sep = "."), format = "%d.%m.%Y")
# int1 <- interval(start.interval, end.interval)
# #break statement for possibility of year '2019'
# #Used to circumvent for end.ind
# if ((year) == 2019) {
# break
# }
# #cycle through dataframe by identifying first occurance of year
# #and the last occurance for autumn (year +1 )
# start.ind <- min(which(year == Years))
# end.ind <- max(which((year) == Years))
# val1 <- datframe[start.ind:end.ind, ]
# #Preserve dates
# Calendar[[i+1]] <- val1[,1]
# #identify number of rows for interval and use as looping rule
# extent <- nrow(val1)
# #reinitialise vector to store result
# result <- vector()
# #rotate through rows of year of interest
# for (j in 1:extent) {
# val <- val1[j, 1]
# test <- val %within% int1
# ifelse(test, result[j] <- 1, result[j] <- 0)
# }
# autumn.list[[i + 1]] <- result
# }
# #create dataframe by binding the autumn dummy with dates
# autumn.list <- as.data.frame(do.call("c",autumn.list))
# Calandrier <- as.data.frame(do.call("c",Calendar))
# compiled <- cbind(Calandrier,autumn.list)
# colnames(compiled)<- c("Date","autumn")
# #remove occuring rows
# uniquely <- unique(compiled)
# #fully remove duplicated values over intervals by subsetting 1 and 0
# sub1 <- subset(compiled,autumn == '1')
# sub2 <- subset(compiled,autumn == '0')
# binded <- rbind.data.frame(sub1, sub2)
# sorted <- binded[sort.list(binded[,1]),]
# #remove duplicates in second position
# listing <- sorted[!duplicated(sorted$Date),]
# return(listing)
# }
#
# ########################################################################################################################################
# #Function to define dummies for astrological Summer
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.Summer function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical Summer (21st June - 21st September)
#
# ##Required Inputs:
# #Dataframe with dates, year and Summer column
#
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the Summer range
#
# ########################################################################################################################################
#
#
# #require lubridate package for date manipulation
#
# astro.summer <- function(datframe = Merged) {
# require("lubridate")
# #create dynamic interval
# diff <- (2018 - 1984)
# #create empty vector to store dummy results
# result <- vector()
# #Create Calendar list to preserve dates
# Calendar <- list()
# #Create Summer list to preserve dates
# Summer.list <- list()
# for (i in 0:diff) {
# year <- 1984 + as.numeric(i)
# #Summer interval - could be defined outside of function
# #function could just add years
# start.interval <-
# as.Date(paste(21, 06, year, sep = "."), format = "%d.%m.%Y")
# end.interval <-
# as.Date(paste(21, 09, year, sep = "."), format = "%d.%m.%Y")
# int1 <- interval(start.interval, end.interval)
# #break statement for possibility of year '2019'
# #Used to circumvent for end.ind
# if ((year) == 2019) {
# break
# }
# #cycle through dataframe by identifying first occurance of year
# #and the last occurance for Summer (year +1 )
# start.ind <- min(which(year == Years))
# end.ind <- max(which((year) == Years))
# val1 <- datframe[start.ind:end.ind, ]
# #Preserve dates
# Calendar[[i+1]] <- val1[,1]
# #identify number of rows for interval and use as looping rule
# extent <- nrow(val1)
# #reinitialise vector to store result
# result <- vector()
# #rotate through rows of year of interest
# for (j in 1:extent) {
# val <- val1[j, 1]
# test <- val %within% int1
# ifelse(test, result[j] <- 1, result[j] <- 0)
# }
# Summer.list[[i + 1]] <- result
# }
# #create dataframe by binding the Summer dummy with dates
# Summer.list <- as.data.frame(do.call("c",Summer.list))
# Calandrier <- as.data.frame(do.call("c",Calendar))
# compiled <- cbind(Calandrier,Summer.list)
# colnames(compiled)<- c("Date","Summer")
# #remove occuring rows
# uniquely <- unique(compiled)
# #fully remove duplicated values over intervals by subsetting 1 and 0
# sub1 <- subset(compiled,Summer == '1')
# sub2 <- subset(compiled,Summer == '0')
# binded <- rbind.data.frame(sub1, sub2)
# sorted <- binded[sort.list(binded[,1]),]
# #remove duplicates in second position
# listing <- sorted[!duplicated(sorted$Date),]
# return(listing)
# }
# ########################################################################################################################################
# #Function to define dummies for astrological winter
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.winter function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical winter (21st December - 19th March)
#
# ##Required Inputs:
# #Dataframe with dates, year and winter column
#
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the winter range
#
# ########################################################################################################################################
#
# astro.winter <- function(datframe = Merged) {
# require("lubridate")
# #create dynamic interval
# diff <- (2018 - 1984)
# #create empty vector to store dummy results
# result <- vector()
# #Create Calendar list to preserve dates
# Calendar <- list()
# #Create winter list to preserve dates
# winter.list <- list()
# for (i in 0:diff) {
# year <- 1984 + as.numeric(i)
# #winter interval - could be defined outside of function
# #function could just add years
# start.interval <-
# as.Date(paste(21, 12, year, sep = "."), format = "%d.%m.%Y")
# end.interval <-
# as.Date(paste(19, 03, (year + 1), sep = "."), format = "%d.%m.%Y")
# int1 <- interval(start.interval, end.interval)
# #break statement for possibility of year '2019'
# #Used to circumvent for end.ind
# if ((year + 1) == 2019) {
# break
# }
# #cycle through dataframe by identifying first occurance of year
# #and the last occurance for winter (year +1 )
# start.ind <- min(which(year == Years))
# end.ind <- max(which((year + 1) == Years))
# val1 <- datframe[start.ind:end.ind, ]
# #Preserve dates
# Calendar[[i+1]] <- val1[,1]
# #identify number of rows for interval and use as looping rule
# extent <- nrow(val1)
# #reinitialise vector to store result
# result <- vector()
# #rotate through rows of year of interest
# for (j in 1:extent) {
# val <- val1[j, 1]
# test <- val %within% int1
# ifelse(test, result[j] <- 1, result[j] <- 0)
# }
# winter.list[[i + 1]] <- result
# }
# #create dataframe by binding the winter dummy with dates
# winter.list <- as.data.frame(do.call("c",winter.list))
# Calandrier <- as.data.frame(do.call("c",Calendar))
# compiled <- cbind(Calandrier,winter.list)
# colnames(compiled)<- c("Date","Winter")
# #remove occuring rows
# uniquely <- unique(compiled)
# #fully remove duplicated values over intervals by subsetting 1 and 0
# sub1 <- subset(compiled,Winter == '1')
# sub2 <- subset(compiled,Winter == '0')
# binded <- rbind.data.frame(sub1, sub2)
# sorted <- binded[sort.list(binded[,1]),]
# #remove duplicates in second position
# listing <- sorted[!duplicated(sorted$Date),]
# return(listing)
# }
#print PCA loadings using kable
#link for installation
#devtools::install_github("haozhu233/kableExtra")
# loadings<- data.frame((PCA.complete$rotation))
# column.name <- rownames(loadings)
# loadings1<- do.call(rbind,loadings)
# colnames(loadings1) <- column.name
# load <- as.data.frame(loadings1)
#
# library(kableExtra)
# load %>%
# knitr::kable("html") %>%
# kable_styling() %>%
# scroll_box(width = "500px", height = "200px")
#
#Code adapted from:
#https://www.r-statistics.com/2016/01/multidimensional-scaling-with-r-from-mastering-data-analysis-with-r/
#https://www.r-bloggers.com/multidimensional-scaling-mds-with-r/
#
# # configure multicore processing
# require(doParallel)
# cl <- makeCluster(detectCores())
# registerDoParallel(cl)
#
# #classic MDS - excluding 2 factor variables
#dist.matrix <- dist(proper[,1:41])
# #scaled <- cmdscale(dist.matrix)
# #plot classic MDS
# library(ggpubr)
# plot(scaled[, 1], scaled[, 2])
# text(scaled[,1], scaled[,2], labels(proper$Met_def), cex = 0.9, xpd = TRUE)
# #Non metric MDS
# library(MASS)
# nonmetric.mds <- isoMDS(dist.matrix)
#
# #plot non-metric mds
# plot(nonmetric.mds$points, type = "n")
# text(nonmetric.mds$points, labels = as.character(1:nrow(cleaned)))
#Visualising patterns in correlation matrix
# library(dplyr)
# library(ggpubr)
# res.cor <- cor(something[,1:49], method = "spearman")
# mds.cor <- (1 - res.cor) %>%
# cmdscale() %>%
# as_tibble()
# colnames(mds.cor) <- c("Dim.1", "Dim.2")
# ggscatter(mds.cor, x = "Dim.1", y = "Dim.2",
# size = 1,
# label = colnames(res.cor),
# repel = TRUE)
#Detailed cluster analysis------------------------------------------------------------------------
# #2 cluster for kmeans ------------------------
# sort(table(k$clust))
# clust <- names(sort(table(k$clust)))
# #meteorogical season cluster content
# pca.cleaned[k$clust == clust[1], 42:45]
# pca.cleaned[k$clust == clust[2], 42:45]
#
# #astrological season cluster content
# pca.cleaned[k$clust == clust[1], 45:49]
# pca.cleaned[k$clust == clust[2], 45:49]
#
# #15 clusters for kmeans ------------------------
# sort(table(k2$clust))
# clust <- names(sort(table(k2$clust)))
# #meteorogical season cluster content
# pca.cleaned[k2$clust == clust[1], 42:45]
# pca.cleaned[k2$clust == clust[2], 42:45]
#
# #astrological season cluster content
# pca.cleaned[k2$clust == clust[1], 45:49]
# pca.cleaned[k2$clust == clust[2], 45:49]
#
#
# #18 clusters for kmeans ------------------------
# sort(table(k1$clust))
# clust <- names(sort(table(k1$clust)))
# #meteorogical season cluster content
# pca.cleaned[k1$clust == clust[1], 42:45]
# pca.cleaned[k1$clust == clust[2], 42:45]
#
# #astrological season cluster content
# pca.cleaned[k1$clust == clust[1], 45:49]
# pca.cleaned[k1$clust == clust[2], 45:49]
#
## hierarchical clustering - Average----------------
# library(flashClust)
# clusters <- flashClust((dist.matrix), method = 'average')
# #plot(clusters)
# #assuming optimal according to CI index - 2
# clusterCut <- cutree(clusters, 2)
# plot(table(clusterCut, proper$Met_def))
#Check for outlier using Cook's distance
#Code heavily sourced from: http://r-statistics.co/Outlier-Treatment-With-R.html
#
# model <- lm(cleaned$`Temperature daily mean [2 m above gnd]`~. , data = cleaned[,5:52])
# cooksdist <- cooks.distance(model)
# plot(cooksdist, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance
# text(x=1:length(cooksdist)+1, y=cooksdist, labels=ifelse(cooksdist>4*mean(cooksdist, na.rm=T),names(cooksdist),""), col="coral") # add labels
# abline(h = 4*mean(cooksdist), col="blue") # add cutoff line