MT5758 - Applied Multivariate Analysis report

Matriculation Number :170001205

Abstract

The object of this report is to determine if different seasonal profiles can be identified based on the proposed Meteoblue data, which encompass meteorological measurements in the region of Basel in Switzerland. In this vein, two definitions of seasons have been examined; the astrological definition which bases itself on the position of the earth around the sun and the meteorological definition which relies on annual temperature cycles, which coincide with three month splits in the Gregorian calendar. The analysis focused on the application of Principal Component Analysis (PCA) and Clustering to reduce the number of dimensions and identify a structure in the data. The results from the scaled PCA have led to 11 Principal Components to be chosen, and it can be inferred from the interpretation of the associated coefficients that variables related to ‘Wind’ and ‘Cloud Coverage’ appear to drive most of the underlying trend. The application of K-means Clustering and hierarchical clustering methods on the PCA results indicates that there is no inherent structure in the data that allows for the seasons to be placed into distinct groups, regardless of how the clusters are defined. Therefore, the meteorological definition of seasons in the region does not appear to be suitable for the Region of Basel since the data do not seem to reflect any significant patterns in the measured climatic data across time which would allow for a rigid dichotomy to be made on that basis.

1 Introduction

This report follows from the Meteoblue dataset proposal and aims to illustrate the application of multivariate techniques to discover latent patterns between the variables under consideration, with a view to form hypotheses. The objective of this report is to investigate if the meteorological definition of seasons (based on annual temperature cycles) can be applied to the region of Basel, in comparison to the astrological defintion of seasons (based on the earth’s position relative to the sun).

In that optic, an overview of the dataset will first be provided, followed by data cleaning as well as an exploratory analysis before the application of the multivariate methods and discussion of the findings. It should also be outlined that this document is targeted at a non-statistical audience and will aspire to transmit the results, as well as the methods used, without any mathematical proof to support a widespread understanding of the latter. The code used for the research and the associated results from the latter will also be presented throughout the document.

1.1 Overview of selected dataset

As previously described in the proposal, the data encompass daily hourly meteorological measures taken in the canton of Basel in Switzerland.

#Code adapted from:
#https://blog.dominodatalab.com/geographic-visualization-with-rs-ggmaps/
#https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf

require(ggmap)
basel <- c(lon = 7.57 , lat = 47.56)
map <- get_map(location = basel, maptype = "terrain")
ggmap(map)

#Original datastet import
dat <- readxl::read_xlsx("C:/Users/Sanil Purryag/Google Drive/St Andrews docs/Modules/Semester 2/Multivariate Analysis/Project resources/Code-WIP/history_export_1984-2018-U.xlsx")

The data encompass 45 variables and 12,112 observations, represented as numeric information. These variables represent elements of the weather in Basel, recorded over different units of measurements and the time at which they are believed to have occurred. The latter have been grouped under labels, based on the type of information they provide about the climate, and presented in the table below (as shown in the proposal):

Grouping label for variables of interest	Global Average Minimum values under grouped label	Global Average Maximum values under grouped label
Year, month and day	01/01/85	27/02/18
Temperature	-13.74	30.65
Atmospheric Pressure	971.56	1046.82
Humidity	34.71	100
Percentage of the sky covered with clouds	0	100
Total precipitation amount	0	62.4
Global radiation	121.93	8488.82
Wind speed	3.69	139.68

2 Data cleaning and manipulation

Based on the principles of Van den Broeck et al. (2005) on detecting, diagnosing, and editing data abnormalities, the data will first be examined for any inconsistencies and errors before being used for the analysis.

2.1 Creation of ‘Date’ column and removal of redundant information

To reduce the number of variables for the analysis and streamline the data, a new variable “Date” has been created from the existing ‘Year’, ‘Month’ and ‘Day’ column. This has been carried out with the aim of being able to filter and represent the data across time using solely this column. Additionally, the hour and minute columns were dropped as a result of the same value being repeated across the dataset.

#Drop columns for Hour and Minute
#Below line adapted from:
#https://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name

mydat <- within(dat, rm(Hour, Minute))

#create covariate 'Date' by concatenating columns: Year, Month, Date
Date <-
  as.Date(paste(mydat$Year, mydat$Month, mydat$Day, sep = "-"))

mydat$Date <- Date

2.2 Creation of dummies for different season definitions

Given that this report will include the determination of distinct seasonal profiles based on different definiton of seasons, different dummy variables will be created for each season to identify whether a specific date can be classified as belonging to a season. These dummy variables will be coded as having a one or a zero, indicating that a date falls within a specific season with a one and a zero when it does not. Therefore two sets of dummies will be created, one for the astrological definition of seasons and one for the meteorological definition of seasons, as follows.

Astrological definition of Seasons - sourced from: https://www.metoffice.gov.uk/learning/seasons/spring/when-does-spring-start

2.2.1 Creation of dummies based on meteorogical definition of season

The dummies for the meteorological definition of seasons will be created based on three month blocks. These have been created using the below R code which splits the years into three month blocks from the start of March.

#Below code Adapted from:
#https://stackoverflow.com/questions/45735624/how-to-add-dummy-variables-q1-q2-q3-based-on-the-month-of-date-variable-yyyym

#Length of season obtained from:
#https://www.metoffice.gov.uk/learning/seasons/spring/when-does-spring-start

#Met_season1 - Spring (March - May)
#Met_season2 - Summer (June - August)
#Met_season3 - Autumn (September - November)
#Met_season4 - Winter (December - Febuary)

suppressWarnings(library("lubridate"))
met.def <-
  data.table::dcast(mydat,
                    Date ~ paste0("Met_season", lubridate::quarter(mydat$Date, fiscal_start = 3)),
                    length,
                    value.var = "Date")
colnames(met.def)[2] <- "Met_Spring"
colnames(met.def)[3] <- "Met_Summer"
colnames(met.def)[4] <- "Met_Autumn"
colnames(met.def)[5] <- "Met_Winter"

2.2.2 Creation of dummies based on astrological definition of seasons

The dummies for the astrological definition of seasons have been generated using user created functions (available in the appendix) and have been created based on the aforementioned dates.

#Extract date column from data, combine years and create empty column for winter dummy
Date <- (mydat$Date)
Years <- mydat$Year
Winter <- data.frame(winter = 1)
Summer <- data.frame(Summer = 1)
Autumn <- data.frame(Autumn = 1)
Spring <- data.frame(Spring = 1)
#merge the dataframes respectively
Merged <- cbind(Date, Years, Winter)
Merged1 <- cbind(Date, Years, Summer)
Merged2 <- cbind(Date, Years, Autumn)
Merged3 <- cbind(Date, Years, Spring)

#clean environment
rm(Winter)
rm(Summer)
rm(Autumn)
rm(Spring)

#run function and save results
astro.winter <- astro.winter(datframe = Merged)
astro.summer <- astro.summer(datframe = Merged1)
astro.autumn <- astro.autumn(datframe = Merged2)
astro.spring <- astro.spring(datframe = Merged3)

2.2.3 Creation of factor variables to group different seasons

Moreover, a categorical variable will be created to be able to identify and filter all the four seasons, for the astrological definition and the meteorological definition. These variables will be coded as follows:

Season	Meteorological Definition	Astrological Definition
Summer	Met_summer	Astro_summer
Winter	Met_winter	Astro_Winter
Autumn	Met_autumn	Astro_autumn
Spring	Met_spring	Astro_spring

#Create Cleaned dataframe by dropping old timestamp and rearranging date as the first column
cleaned <- cbind(Date, mydat[, 1:43])
#Add meteorogical definition of season to Cleaned
cleaned <- cbind(cleaned, met.def [, 2:5])
#Add astronomical definition of seasons to Cleaned
cleaned <-
  cbind(cleaned,
        astro.winter[, 2],
        astro.summer[, 2],
        astro.autumn[, 2],
        astro.spring[, 2])
colnames(cleaned)[49] <- "Astro_Winter"
colnames(cleaned)[50] <- "Astro_Summer"
colnames(cleaned)[51] <- "Astro_Autumn"
colnames(cleaned)[52] <- "Astro_Spring"

pca.cleaned <-cleaned
pca.cleaned$Date <- as.numeric(paste(cleaned$Year, cleaned$Month, cleaned$Day, sep = ""))
colnames(pca.cleaned)[1] <- "Date"

#Creation of new factor variable from previously created dummies
#meteorological season factor 
contain <- pca.cleaned[,-c(2:4)]
meteo <- contain[, 42:45]
meteo[, 2] <- ifelse(contain$Met_Summer == 1, 2, 0)
meteo[, 3] <- ifelse(contain$Met_Autumn == 1, 3, 0)
meteo[, 4] <- ifelse(contain$Met_Winter == 1, 4, 0)
meteo1 <- rowSums(meteo)

#astronomical season factor
astro <- contain[, 46:49]
astro[, 2] <- ifelse(contain$Astro_Summer == 1, 2, 0)
astro[, 3] <- ifelse(contain$Astro_Autumn == 1, 3, 0)
astro[, 4] <- ifelse(contain$Astro_Spring == 1, 4, 0)
astro1 <- rowSums(astro)

something <- cbind(contain, meteo1, astro1)
#subset new dataset for multivariate analysis
proper <- subset(something, select = -c(42:49))
colnames(proper)[42] <- "Met_def"
colnames(proper)[43] <- "Astro_def"
proper$Met_def <- as.character(proper$Met_def)
#Encode factors as seasons for meteorogical
proper$Met_def[proper$Met_def == "1"] <-
  "Met_spring"
proper$Met_def[proper$Met_def == "2"] <-
  "Met_summer"
proper$Met_def[proper$Met_def == "3"] <-
  "Met_autumn"
proper$Met_def[proper$Met_def == "4"] <-
  "Met_winter"
#Encode factors as seasons for astronomical
proper$Astro_def[proper$Astro_def == "1"] <- "Astro_winter"
proper$Astro_def[proper$Astro_def == "2"] <- "Astro_summer"
proper$Astro_def[proper$Astro_def == "3"] <- "Astro_autumn"
proper$Astro_def[proper$Astro_def == "4"] <- "Astro_spring"

3 Exploratory Data Analysis

Before the application of multivariate techniques, some exploratory data analysis will be carried out to investigate the structure of the data. However, given the volume of data underhand, the application and effectiveness of traditional techniques have been found to be limited.

3.1 Pairs plots

Pairs plots are used to illustrate the pairwise relationships between variables, by matching values in the same data row. Given that the techniques under consideration require linear relationships to be in place, pairs plots have been used to investigate general shape of the data between the different variables. Following the use of the code in the appendix, it can be observed that indeed, a general linear tendency is present between the variables under study. A sample of the pairs plot obtained from the dataset is shown below:

#create sample of random observations
sampled <- cleaned[sample(nrow(cleaned),200),]
#run pairs plots for all the variables to investigate the different relations
require(lattice)

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.4.2

splom(~sampled[,5:7],varname.cex = .5)

#Application of Multivariate analysis techniques

Multivariate analysis techniques will be employed to identify common characteristics within the multiple measurements taken daily. The use of Principal Component Analysis (“PCA”) and Clustering will be considered to identify a common profile for each season in the region of Basel. It should be noted that while Multidimensional scaling (“MDS”) was considered for the analysis, it was later removed as a result of not providing additional insight. Nevertheless, the code for the latter can be found in the appendix.

3.2 Theoritical underpinnings of Principal Component Analysis (PCA)

As outlined by Stackexchange (2018), The objective of Principal Component Analysis (“PCA”) is to summarise data, through the identification of common underlying components. For example, seasons can be described in numerous ways such as their respective temperature levels, their associated humidity levels, their length in terms of days and so on. There can thus be a list of components made for each season. There will, however, be some components that capture most of the underlying meaning associated with seasons than others while there will be some other components describing related attributes of seasons, effectively making them redundant. PCA seeks to summarise the information describing a phenomenon by creating new components, from a combination of existing variables in the data.

Consider the below scatterplot representing different data points (e.g. dates) for two variables (e.g. temperature and wind speed). It can be observed from the general direction of the points on the plot (upward to the left) that the variables are positively correlated, or in other words, that the latter variables move in tandem.

Sample Scatter plot sourced from:https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

A new component, summarising the meaning transmitted by those two variables, can be constructed by drawing a line (Principal Component) through the centre of the data points and projecting all the points on the latter, as shown in the below simulation (rotating red line). The red dots are projections of the original points (blue dots) and the red lines relating both indicates the distance between the original and the new points on the principal component. As shown below, there can be several angles where the principal component line can be set.

PCA simulation - sourced from:https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Nevertheless, PCA will seek to find the best principal component by minimising the distance between the original points (blue dots) and new points (red dots) and maximising the spread of the red dots along the principal component line (shown below in the fitted simulation).

PCA best fitted line - sourced from:https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

3.3 Using a scaled PCA on the data

While PCA provides an intuitive way to analyse the data, Bartlett’s test of sphericity will first be used to ensure that this method is suitable. This test examines if the variables are unrelated and thus unsuitable for the detection of an inherent structure (IBM,2018). The below result indicates that indeed the PCA can be used with the dataset.

#code adapted from:
#https://www.r-bloggers.com/naive-principal-component-analysis-using-r/
require(psych)
test.res <- cortest.bartlett(proper[, 1:41])
print(paste("P-value","=",test.res$p.value))

[1] “P-value = 0”

The data will first be scaled to ascertain that the PCA is appropriately run on the underlying dataset. This operation will ensure that the data has unit variance, or in other words, that each data point is evenly spread across the dataset and can be compared with each other, even when the data is subject to being differently measured.

#Below code chunks adapted from:
#https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
#https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/
#http://www.sthda.com/english/wiki/print.php?id=207
#https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

PCA.complete <- prcomp(proper[, 1:41], scale = TRUE)

3.3.1 Choice of the number of components under consideration

The below plots, known as scree plots, serve to illustrate the variation explained by each principal component and can also be used to show how the eigenvalues (scaling factors) for each Principal component vary across all the Principal Components. The scree plot is used to select a number of components that shall later be considered to be a new set of variables for later stages of the analysis.

From the below scree plot with the “Percentage of explained variances” vertical axis, it can be observed that about 12 components explain about 85% of the variation in the data, i.e. 12 components can be used to extract about 85% of the meaning that the original data held. Alternatively, concerning the scree plot with the “Eigenvalue” vertical axis, the Kaiser criterion specifies that only scaling factors with a value higher than one should be considered, leading to only eight components being selected.

library(factoextra)
library(FactoMineR)
# Visualize variances through screeplot

fviz_screeplot(
  PCA.complete,
  choice = "variance",
  addlabels = TRUE,
  ylim = c(0, 50),
  barfill = 'coral1',
  main = "Screeplot for Weather Dataset PCA covering 95% of the variation",
  xlab = "Principal Components",
  ggtheme = theme_classic(),
  ncp = 21
)

# Visualize eigenvalues through screeplot
fviz_screeplot(
  PCA.complete,
  choice = "eigenvalue",
  addlabels = TRUE,
  ylim = c(0, 50),
  barfill = 'coral1',
  main = "Screeplot for Weather Dataset PCA",
  xlab = "Principal Components",
  ggtheme = theme_classic(),
  ncp = 21
)

3.3.2 PCA results

The below plot (also known as a biplot) illustrates the first two principal components which result from the PCA on the dataset. The biplot depicts the variables under study in such a way that it can be inferred whether the variables are similar to each other from their relative location from the plot. For instance, it can be observed that “Wind Speed daily min [80 m above gnd]” and “Wind Speed daily min [900 mb]” are quite similar to each other from their overlapping location on the plot. Additionally, the colour on the plot indicates the contribution of the variable to each principal component, with “Wind Speed daily min [900 mb]” being one of the highest contributors and “Sunshine Duration daily sum [sfc]” being one of the weakest contributors.

library(factoextra)
library(FactoMineR)
#Make graph of variables - PCA biplot
fviz_pca_var(
  PCA.complete,
  col.var = "contrib",
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
)

The below graph depicts the quality of individual observations on the principal components. The cos2 (squared loadings for variables) measure serves to indicate how important a particular individual on a scale of zero to one, where one is a perfect representation (Abdi and Williams, 2010). This plot shows that the individuals close to the centre contribute the less to the first principal components and the more influential variables lie towards the bottom left (highlighted in bright red).

#code adapted from: http://www.sthda.com/english/wiki/print.php?id=204#cos2-quality-of-the-representation-for-individuals-on-the-principal-components
#make graph of individuals
fviz_pca_ind(PCA.complete, col.ind = "cos2") +
  scale_color_gradient2(
    low = "white",
    mid = "blue",
    high = "red",
    midpoint = 0.50
  ) + theme_minimal()

The primary aim of this analysis is to identify whether seasons can be profiled based on a combination of climatic occurrences. The below bi-plots, illustrating the two different definitions of seasons across the first two principal components, show how seasons are distributed across PCs.

#code with proper
library(ggfortify)
#PCA plot for meteorogical season
autoplot(PCA.complete, data = proper, colour = 'Met_def')

#PCA plot for astrological season
autoplot(PCA.complete, data = proper, colour = 'Astro_def')

#According to elbow method: PC4 can be chosen

3.4 Interpretaion of Principal Components

It is important to consider that PCA displays trends in the data through the use of new axes (Principal Components). These patterns can be interpreted by looking at the rotations (or loadings) associated with each variable that has been subject to the PCA, to identify which ones have the highest influence on the principal components. To do so, a measure influence known as the equilibrium contribution will be calculated (shown below) and will be used as a benchmark to find the variables which have loadings higher than the latter.

#Equilibrium contribution computation
equilibrium  <- 1/sqrt(ncol(proper[,1:41]))
print(equilibrium)

## [1] 0.1561738

#Equilibrium contribution computation
which((PCA.complete$rotation[,1] > equilibrium)|(PCA.complete$rotation[,1]< -equilibrium))

##          Total Cloud Cover daily mean [sfc] 
##                                           7 
## Medium Cloud Cover daily mean [mid cld lay] 
##                                           9 
##    Low Cloud Cover daily mean [low cld lay] 
##                                          10 
##           Sunshine Duration daily sum [sfc] 
##                                          11 
##         Shortwave Radiation daily sum [sfc] 
##                                          12 
##      Wind Speed daily mean [10 m above gnd] 
##                                          13 
##      Wind Speed daily mean [80 m above gnd] 
##                                          15 
##              Wind Speed daily mean [900 mb] 
##                                          17 
##                  Wind Gust daily mean [sfc] 
##                                          19 
##       Wind Speed daily max [10 m above gnd] 
##                                          34 
##       Wind Speed daily min [10 m above gnd] 
##                                          35 
##       Wind Speed daily max [80 m above gnd] 
##                                          36 
##       Wind Speed daily min [80 m above gnd] 
##                                          37 
##               Wind Speed daily max [900 mb] 
##                                          38 
##               Wind Speed daily min [900 mb] 
##                                          39 
##                   Wind Gust daily max [sfc] 
##                                          40 
##                   Wind Gust daily min [sfc] 
##                                          41

##first pc suggests that wind speed and cloud cover are important seperators
which((PCA.complete$rotation[,2] > equilibrium)|(PCA.complete$rotation[,2]< -equilibrium))

## Relative Humidity daily mean [2 m above gnd] 
##                                            3 
##           Total Cloud Cover daily mean [sfc] 
##                                            7 
##   High Cloud Cover daily mean [high cld lay] 
##                                            8 
##  Medium Cloud Cover daily mean [mid cld lay] 
##                                            9 
##     Low Cloud Cover daily mean [low cld lay] 
##                                           10 
##            Sunshine Duration daily sum [sfc] 
##                                           11 
##          Shortwave Radiation daily sum [sfc] 
##                                           12 
##       Wind Speed daily mean [10 m above gnd] 
##                                           13 
##       Wind Speed daily mean [80 m above gnd] 
##                                           15 
##               Wind Speed daily mean [900 mb] 
##                                           17 
##                   Wind Gust daily mean [sfc] 
##                                           19 
##  Relative Humidity daily max [2 m above gnd] 
##                                           22 
##  Relative Humidity daily min [2 m above gnd] 
##                                           23 
##            Total Cloud Cover daily min [sfc] 
##                                           27 
##   Medium Cloud Cover daily min [mid cld lay] 
##                                           31 
##      Low Cloud Cover daily min [low cld lay] 
##                                           33 
##        Wind Speed daily max [10 m above gnd] 
##                                           34 
##        Wind Speed daily max [80 m above gnd] 
##                                           36 
##        Wind Speed daily min [80 m above gnd] 
##                                           37 
##                Wind Speed daily min [900 mb] 
##                                           39 
##                    Wind Gust daily max [sfc] 
##                                           40 
##                    Wind Gust daily min [sfc] 
##                                           41

##second pc suggests that the same, in addtion to sunshine, radiation and sea

The equilibrium value above indicates that the variables associated with the wind and cloud cover have been the driving forces behind the dominant trend in the data. It can thus be hypothesised that these variables might have a significant influence on the underlying seasonal patterns.

3.5 Overview of selected clustering methods

As an extension to PCA, clustering methods have been considered to better profile seasons by organising the observations into sensible intrinsic groupings or clusters (Jain, 2010). This would allow for the identification of other underlying structures in the data and help to streamline the data further. In this optic, two classes of data clustering methods have been considered, namely hierarchical and partition clustering.

Partition methods are based on the assumption that a certain data point belongs to a specific cluster while hierarchical clustering looks for pairs of samples that are the closest in terms of similarity, using different distance measures between the points, to consequently build different clusters based on the proximity of the points desired. The desired proximity of the points allows for clusters to be built either on the basis that the closest points should be joined (single linkage), the furthest distance to the next instance (complete linkage) or the average of all distances in a cluster to the instance (average linkage).

The clustering algorithms applied to the analysis are K-means clustering (partitioning method) and average hierarchical clustering (hierarchical clustering), with K-means clustering being primarily used due to the lesser computational load associated with it. Additionally, it should be noted that since Meteorological data does not typically have an inherent hirarchical structure,unlike biological data which would justify the use of a dendogram, K-means clustering appears to be more appropriate.

3.6 Application of chosen clustering methods on PCA results

As indicated by Agarwal (2011), the leading challenges when applying clustering methods lie in the determination of the desired number of clusters and the choice of the distance measure. The clustering methods will be applied to the first 12 principal components scores since the latter explain about 85% of the variation in the data. For this report, the distance measure used is the Euclidean distance (straight line distance between two points ), and the choice of clusters was made using the following three methods:

Choosing the number of principal components which explain 95% or higher of the variation
Choice of clusters based on the elbow method
Choice of clusters based on Callinski and harabaz’s Index

Choosing the number of principal components which explain 95% or higher of the variation

Based on the PCA results, it can be observed that 18 principal components explain about 95% of the data. Thus a k-means cluster will be constructed using 18 clusters and projected across the first three principal components as follows:

#chosen 12 components from 95% variance explained
chosen.comp <- data.frame(PCA.complete$x[, 1:12])

#Kmeans clustering for 18 clusters based on 95% variance explained
k1 <- kmeans(chosen.comp, 18, nstart = 25, iter.max = 1000)
# library(RColorBrewer)
# library(scales)
# palette(alpha(brewer.pal(9, 'Set1'), 0.5))
# plot(chosen.comp, col = k1$clust, pch = 16)


#3d scatter plot for 18 clusters
library(plotly)
Scatter.18 <-
  plot_ly(
    chosen.comp,
    x = ~ PC1,
    y = ~ PC2,
    z = ~ PC3,
    color = ~ k1$clust,
    colors = c('#BF382A', '#0C4B8E')
  ) %>%
  add_markers() %>%
  layout(scene = list(
    xaxis = list(title = 'PC1'),
    yaxis = list(title = 'PC2'),
    zaxis = list(title = 'PC3')
  ))

Scatter.18

#table for astro_def
table(k1$cluster, proper$Astro_def)

##     
##      Astro_autumn Astro_spring Astro_summer Astro_winter
##   1            99           35            3          216
##   2           325          107           31          399
##   3           185          124           66          267
##   4           198           43           14          276
##   5           402          160           27          289
##   6           165          395          394          130
##   7            45          153          337            5
##   8            14            2            0           10
##   9            84          175           32          134
##   10          149           98           35          209
##   11          167           45            1          209
##   12          122          300          350           99
##   13          103          373          512           18
##   14          172          468          573           28
##   15          161          255          104          123
##   16           98          261          559           29
##   17           72           44           19           42
##   18          409           31           12          521

#table for Met_def
table(k1$cluster, proper$Met_def)

##     
##      Met_autumn Met_spring Met_summer Met_winter
##   1          49         67          1        236
##   2         220        162         32        448
##   3         152        169         54        267
##   4         132         76         17        306
##   5         367        206         16        289
##   6         250        338        383        113
##   7         104        113        323          0
##   8           7          5          0         14
##   9          78        196         33        118
##   10        113        127         35        216
##   11        122         88          0        212
##   12        201        290        334         46
##   13        215        295        495          1
##   14        286        336        609         10
##   15        164        232        123        124
##   16        188        191        560          8
##   17         65         49         18         45
##   18        290         96          3        584

Choice of clusters based on the elbow method

The elbow method focuses on finding the percentage of variance that is explained, calculated through the WSS, based on the number of clusters such that adding a new cluster does not lead to a significant difference in terms of explained variance (Bholowalia and Kumar, 2014). The below plot illustrates the WSS across 12 clusters and indicates that after 2 clusters, the improvement in the explained variance with each new cluster is minimal.

#Below code adapted from:
#https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/

#plot wss for 12 clusters for numeric data and choose cluster based on elbow method
wss <- (nrow(proper[, 1:41]) * sum(apply(proper[, 1:41], 2, var)))
for (i in 2:12)
  wss[i] <- sum(kmeans(proper[, 1:41],
                       centers = i)$withinss)
plot(1:12,
     wss,
     type = "b",
     xlab = "Number of Clusters",
     ylab = "Within groups sum of squares")
ticks = rep(1:12)
axis(side = 1, at = ticks)
abline(h = 6.394687e+15, col = "red", lty = 3, lwd = 3)
abline(v = 2, col = "red", lty =3, lwd = 3)

Consequently, a 3d scatter plot across the first three principal components will be constructed to illustrate the 2 clusters determined by the elbow method.

#Assuming 2 as elbow based on above graph
elbow.comp <- data.frame(PCA.complete$x[, 1:12])

#Kmeans clustering for 2 clusters based on elbow method
k <- kmeans(elbow.comp, 2, nstart = 25, iter.max = 1000)
# library(RColorBrewer)
# library(scales)
# palette(alpha(brewer.pal(9, 'Set1'), 0.5))
# plot(elbow.comp, col = k$clust, pch = 16)
library(plotly)
Scatter.2 <-
  plot_ly(
    elbow.comp,
    x = ~ PC1,
    y = ~ PC2,
    z = ~ PC3,
    color = ~ k$clust,
    colors = c('#BF382A', '#0C4B8E')
  ) %>%
  add_markers() %>%
  layout(scene = list(
    xaxis = list(title = 'PC1'),
    yaxis = list(title = 'PC2'),
    zaxis = list(title = 'PC3')
  ))


Scatter.2

#Table for astro_def
table(k$cluster, proper$Astro_def)

##    
##     Astro_autumn Astro_spring Astro_summer Astro_winter
##   1         1105          927          532         1389
##   2         1865         2142         2537         1615

#Table for Met_def
table(k$cluster, proper$Met_def)

##    
##     Met_autumn Met_spring Met_summer Met_winter
##   1        928       1026        532       1467
##   2       2075       2010       2504       1570

Choice of clusters based on Callinski and harabaz’s Index

The Callinski and Harabasz’s Index, also known as the variance ratio criterion, measures the variance between different clusters and within each of those clusters (De Amorim and Hennig, 2015). An application of this measure to determine the optimal number of clusters for the K-means method leads to 15 clusters being chosen and 2 clusters for the hierarchical average method. The associated plot for the K-means is found below, and the results for the hierarchical clustering are found in the appendix.

#compute kmeans optimal cluster number according to NbClust - Callinski and Harabasz's Index
##15 clusters is optimal
# library(NbClust)
# nbclustering.kmeans <-
#   NbClust(
#     data = PCA.complete$x,
#     diss = NULL,
#     distance = "euclidean",
#     min.nc = 2,
#     max.nc = 15,
#     method = "kmeans",
#     index = "cindex",
#     alphaBeale = 0.1
#   )

#compute average optimal cluster number according to NbClust - Callinski and Harabasz's Index
## 2 clusters is optimal
# nbclustering.avg <-
#   NbClust(
#     data = PCA.complete$x,
#     diss = NULL,
#     distance = "euclidean",
#     min.nc = 2,
#     max.nc = 15,
#     method = "average",
#     index = "cindex",
#     alphaBeale = 0.1
#   )

#Kmeans clustering for 15 clusters based on optimal index choice
k2 <- kmeans(chosen.comp, 15, nstart = 25, iter.max = 1000)
# library(RColorBrewer)
# library(scales)
# palette(alpha(brewer.pal(9, 'Set1'), 0.5))
# plot(chosen.comp, col = k1$clust, pch = 16)


#3d scatter plot for 15 clusters
Scatter.15 <-
  plot_ly(
    chosen.comp,
    x = ~ PC1,
    y = ~ PC2,
    z = ~ PC3,
    color = ~ k2$clust,
    colors = c('#BF382A', '#0C4B8E')
  ) %>%
  add_markers() %>%
  layout(scene = list(
    xaxis = list(title = 'PC1'),
    yaxis = list(title = 'PC2'),
    zaxis = list(title = 'PC3')
  ))

Scatter.15

#table for astro_def
table(k2$cluster, proper$Astro_def)

##     
##      Astro_autumn Astro_spring Astro_summer Astro_winter
##   1           244          350          200          238
##   2           145          110           43          207
##   3           100          179           33          156
##   4           177           52            2          216
##   5           113           44            4          245
##   6           141          434          594           31
##   7            14            2            0           10
##   8           119          132           51           82
##   9           219          169          145          282
##   10          364          308          206          408
##   11          279          119           49          337
##   12           49          166          377            4
##   13          409          529          460          135
##   14          471           40           16          617
##   15          126          435          889           36

#table for Met_def
table(k2$cluster, proper$Met_def)

##     
##      Met_autumn Met_spring Met_summer Met_winter
##   1         267        333        195        237
##   2         113        141         39        212
##   3          88        198         35        147
##   4         133        100          0        214
##   5          57         81          2        266
##   6         279        357        559          5
##   7           7          5          0         14
##   8         113        125         61         85
##   9         198        203        128        286
##   10        311        333        229        413
##   11        208        150         58        368
##   12        112        119        365          0
##   13        492        450        486        105
##   14        338        131          4        671
##   15        287        310        875         14

Based on the different clustering methods and techniques, it appears that the K-means algorithm with 15 clusters is the optimal choice since it has the maximum variance ratio and goes along the number of principal components that explain about 90% of the variation in the data. However, it can also be observed that it performs an inadequate job at separating the data cloud into distinct clusters, to represent the different seasons.

4 Conclusion and further research

This report sought to establish a profile for each season in the region of Basel, based on the meteorological and astrological definition of seasons. The application of PCA and clustering revealed that the meteorological data obtained can be reduced but cannot be used to find an intrinsic pattern in the data since there are no distinct clusters.

Other research methods could aim at applying different methods of multivariate analysis aimed at dimensionality reduction (reducing the number of variables) and which are robust to outliers. An inspection of the Cook’s distance (code in appendix) revealed a large number of outliers (extreme values) that have not been appropriately addressed in the analysis, since the object of the report was solely to identify patterns to be further investigated. Finally, another research stream that could be considered would relate to investigating what defines a season in Basel based on the climatic data, since the meteorological definition did not appear to be sensible.

5 Appendix

5.1 Code for examination of strucutre of original dataset

#Structure of original imported data
#code adapted from: https://stackoverflow.com/questions/44200394/show-str-as-table-in-r-markdown

# suppressWarnings(library(knitr))
# suppressWarnings(library(magrittr))
# data.frame(variable = names(dat),
#            classe = sapply(dat, typeof),
#            first_values = sapply(dat, function(x) paste0(head(x),  collapse = ", ")),
#            row.names = NULL) %>% 
#   kable()

5.2 Code for pairs plots

# #create sample of random observations
# sampled <- cleaned[sample(nrow(cleaned),200),]
# #run pairs plots for all the variables to investigate the different relations
# require(lattice)
# splom(~sampled[,5:14],varname.cex = .5)
# splom(~sampled[,15:23],varname.cex = .5)
# splom(~sampled[,24:34],varname.cex = .5)
# splom(~sampled[,35:44],varname.cex = .5)

5.3 Code for creation of dummies for different seasons

# ########################################################################################################################################
# #Function to define dummies for astrological spring
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.spring function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical spring (21st June - 21st September)
# 
# ##Required Inputs:
# #Dataframe with dates, year and spring column
# 
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the spring range
# 
# ########################################################################################################################################
# 
# 
# #require lubridate package for date manipulation
# 
# astro.spring <- function(datframe = Merged) {
#   require("lubridate")
#   #create dynamic interval
#   diff <- (2018 - 1984)
#   #create empty vector to store dummy results
#   result <- vector()
#   #Create Calendar list to preserve dates
#   Calendar <- list()
#   #Create spring list to preserve dates
#   spring.list <- list()
#   for (i in 0:diff) {
#     year <- 1984 + as.numeric(i)
#     #spring interval - could be defined outside of function
#     #function could just add years
#     start.interval <-
#       as.Date(paste(20, 03, year, sep = "."), format = "%d.%m.%Y")
#     end.interval <-
#       as.Date(paste(20,06 , year, sep = "."), format = "%d.%m.%Y")
#     int1 <- interval(start.interval, end.interval)
#     #break statement for possibility of year '2019'
#     #Used to circumvent for end.ind
#     if ((year) == 2019) {
#       break
#     }
#     #cycle through dataframe by identifying first occurance of year
#     #and the last occurance for spring (year +1 )
#     start.ind <- min(which(year == Years))
#     end.ind <- max(which((year) == Years))
#     val1 <- datframe[start.ind:end.ind, ]
#     #Preserve dates
#     Calendar[[i+1]] <- val1[,1]
#     #identify number of rows for interval and use as looping rule
#     extent <- nrow(val1)
#     #reinitialise vector to store result
#     result <- vector()
#     #rotate through rows of year of interest
#     for (j in 1:extent) {
#       val <- val1[j, 1]
#       test <- val %within% int1
#       ifelse(test, result[j] <- 1, result[j] <- 0)
#     }
#     spring.list[[i + 1]] <- result
#   }
#   #create dataframe by binding the spring dummy with dates
#   spring.list <- as.data.frame(do.call("c",spring.list))
#   Calandrier <- as.data.frame(do.call("c",Calendar))
#   compiled <- cbind(Calandrier,spring.list)
#   colnames(compiled)<- c("Date","spring")
#   #remove occuring rows
#   uniquely <- unique(compiled)
#   #fully remove duplicated values over intervals by subsetting 1 and 0
#   sub1 <- subset(compiled,spring == '1')
#   sub2 <- subset(compiled,spring == '0')
#   binded <- rbind.data.frame(sub1, sub2)
#   sorted <- binded[sort.list(binded[,1]),]
#   #remove duplicates in second position
#   listing <- sorted[!duplicated(sorted$Date),] 
#   return(listing)
# }
# 
# ########################################################################################################################################
# #Function to define dummies for astrological autumn
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.autumn function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical autumn (21st June - 21st September)
# 
# ##Required Inputs:
# #Dataframe with dates, year and autumn column
# 
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the autumn range
# 
# ########################################################################################################################################
# 
# 
# #require lubridate package for date manipulation
# 
# astro.autumn <- function(datframe = Merged) {
#   require("lubridate")
#   #create dynamic interval
#   diff <- (2018 - 1984)
#   #create empty vector to store dummy results
#   result <- vector()
#   #Create Calendar list to preserve dates
#   Calendar <- list()
#   #Create autumn list to preserve dates
#   autumn.list <- list()
#   for (i in 0:diff) {
#     year <- 1984 + as.numeric(i)
#     #autumn interval - could be defined outside of function
#     #function could just add years
#     start.interval <-
#       as.Date(paste(22, 09, year, sep = "."), format = "%d.%m.%Y")
#     end.interval <-
#       as.Date(paste(20, 12, year, sep = "."), format = "%d.%m.%Y")
#     int1 <- interval(start.interval, end.interval)
#     #break statement for possibility of year '2019'
#     #Used to circumvent for end.ind
#     if ((year) == 2019) {
#       break
#     }
#     #cycle through dataframe by identifying first occurance of year
#     #and the last occurance for autumn (year +1 )
#     start.ind <- min(which(year == Years))
#     end.ind <- max(which((year) == Years))
#     val1 <- datframe[start.ind:end.ind, ]
#     #Preserve dates
#     Calendar[[i+1]] <- val1[,1]
#     #identify number of rows for interval and use as looping rule
#     extent <- nrow(val1)
#     #reinitialise vector to store result
#     result <- vector()
#     #rotate through rows of year of interest
#     for (j in 1:extent) {
#       val <- val1[j, 1]
#       test <- val %within% int1
#       ifelse(test, result[j] <- 1, result[j] <- 0)
#     }
#     autumn.list[[i + 1]] <- result
#   }
#   #create dataframe by binding the autumn dummy with dates
#   autumn.list <- as.data.frame(do.call("c",autumn.list))
#   Calandrier <- as.data.frame(do.call("c",Calendar))
#   compiled <- cbind(Calandrier,autumn.list)
#   colnames(compiled)<- c("Date","autumn")
#   #remove occuring rows
#   uniquely <- unique(compiled)
#   #fully remove duplicated values over intervals by subsetting 1 and 0
#   sub1 <- subset(compiled,autumn == '1')
#   sub2 <- subset(compiled,autumn == '0')
#   binded <- rbind.data.frame(sub1, sub2)
#   sorted <- binded[sort.list(binded[,1]),]
#   #remove duplicates in second position
#   listing <- sorted[!duplicated(sorted$Date),] 
#   return(listing)
# }
# 
# ########################################################################################################################################
# #Function to define dummies for astrological Summer
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.Summer function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical Summer (21st June - 21st September)
# 
# ##Required Inputs:
# #Dataframe with dates, year and Summer column
# 
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the Summer range
# 
# ########################################################################################################################################
# 
# 
# #require lubridate package for date manipulation
# 
# astro.summer <- function(datframe = Merged) {
#   require("lubridate")
#   #create dynamic interval
#   diff <- (2018 - 1984)
#   #create empty vector to store dummy results
#   result <- vector()
#   #Create Calendar list to preserve dates
#   Calendar <- list()
#   #Create Summer list to preserve dates
#   Summer.list <- list()
#   for (i in 0:diff) {
#     year <- 1984 + as.numeric(i)
#     #Summer interval - could be defined outside of function
#     #function could just add years
#     start.interval <-
#       as.Date(paste(21, 06, year, sep = "."), format = "%d.%m.%Y")
#     end.interval <-
#       as.Date(paste(21, 09, year, sep = "."), format = "%d.%m.%Y")
#     int1 <- interval(start.interval, end.interval)
#     #break statement for possibility of year '2019'
#     #Used to circumvent for end.ind
#     if ((year) == 2019) {
#       break
#     }
#     #cycle through dataframe by identifying first occurance of year
#     #and the last occurance for Summer (year +1 )
#     start.ind <- min(which(year == Years))
#     end.ind <- max(which((year) == Years))
#     val1 <- datframe[start.ind:end.ind, ]
#     #Preserve dates
#     Calendar[[i+1]] <- val1[,1]
#     #identify number of rows for interval and use as looping rule
#     extent <- nrow(val1)
#     #reinitialise vector to store result
#     result <- vector()
#     #rotate through rows of year of interest
#     for (j in 1:extent) {
#       val <- val1[j, 1]
#       test <- val %within% int1
#       ifelse(test, result[j] <- 1, result[j] <- 0)
#     }
#     Summer.list[[i + 1]] <- result
#   }
#   #create dataframe by binding the Summer dummy with dates
#   Summer.list <- as.data.frame(do.call("c",Summer.list))
#   Calandrier <- as.data.frame(do.call("c",Calendar))
#   compiled <- cbind(Calandrier,Summer.list)
#   colnames(compiled)<- c("Date","Summer")
#   #remove occuring rows
#   uniquely <- unique(compiled)
#   #fully remove duplicated values over intervals by subsetting 1 and 0
#   sub1 <- subset(compiled,Summer == '1')
#   sub2 <- subset(compiled,Summer == '0')
#   binded <- rbind.data.frame(sub1, sub2)
#   sorted <- binded[sort.list(binded[,1]),]
#   #remove duplicates in second position
#   listing <- sorted[!duplicated(sorted$Date),] 
#   return(listing)
# }
# ########################################################################################################################################
# #Function to define dummies for astrological winter
# #Resources used from:
# #https://stackoverflow.com/questions/15659783/why-does-unlist-kill-dates-in-r
# #https://stackoverflow.com/questions/15303283/how-to-do-vlookup-and-fill-down-like-in-excel-in-r
# ########################################################################################################################################
# #Purpose: The astro.winter function performs calculates if dates from a year-year interval
# #falls in the range of dates for the astronomical winter (21st December - 19th March)
# 
# ##Required Inputs:
# #Dataframe with dates, year and winter column
# 
# ##Expected output:
# #list of 0s and 1s indicating if a certain date falls into the winter range
# 
# ########################################################################################################################################
# 
# astro.winter <- function(datframe = Merged) {
#   require("lubridate")
#   #create dynamic interval
#   diff <- (2018 - 1984)
#   #create empty vector to store dummy results
#   result <- vector()
#   #Create Calendar list to preserve dates
#   Calendar <- list()
#   #Create winter list to preserve dates
#   winter.list <- list()
#   for (i in 0:diff) {
#     year <- 1984 + as.numeric(i)
#     #winter interval - could be defined outside of function
#     #function could just add years
#     start.interval <-
#       as.Date(paste(21, 12, year, sep = "."), format = "%d.%m.%Y")
#     end.interval <-
#       as.Date(paste(19, 03, (year + 1), sep = "."), format = "%d.%m.%Y")
#     int1 <- interval(start.interval, end.interval)
#     #break statement for possibility of year '2019'
#     #Used to circumvent for end.ind
#     if ((year + 1) == 2019) {
#       break
#     }
#     #cycle through dataframe by identifying first occurance of year
#     #and the last occurance for winter (year +1 )
#     start.ind <- min(which(year == Years))
#     end.ind <- max(which((year + 1) == Years))
#     val1 <- datframe[start.ind:end.ind, ]
#     #Preserve dates
#     Calendar[[i+1]] <- val1[,1]
#     #identify number of rows for interval and use as looping rule
#     extent <- nrow(val1)
#     #reinitialise vector to store result
#     result <- vector()
#     #rotate through rows of year of interest
#     for (j in 1:extent) {
#       val <- val1[j, 1]
#       test <- val %within% int1
#       ifelse(test, result[j] <- 1, result[j] <- 0)
#     }
#     winter.list[[i + 1]] <- result
#   }
#   #create dataframe by binding the winter dummy with dates
#   winter.list <- as.data.frame(do.call("c",winter.list))
#   Calandrier <- as.data.frame(do.call("c",Calendar))
#   compiled <- cbind(Calandrier,winter.list)
#   colnames(compiled)<- c("Date","Winter")
#   #remove occuring rows
#   uniquely <- unique(compiled)
#   #fully remove duplicated values over intervals by subsetting 1 and 0
#   sub1 <- subset(compiled,Winter == '1')
#   sub2 <- subset(compiled,Winter == '0')
#   binded <- rbind.data.frame(sub1, sub2)
#   sorted <- binded[sort.list(binded[,1]),]
#   #remove duplicates in second position
#   listing <- sorted[!duplicated(sorted$Date),] 
#   return(listing)
# }

5.4 Code for PCA loadings

#print PCA loadings using kable
#link for installation
#devtools::install_github("haozhu233/kableExtra")
# loadings<- data.frame((PCA.complete$rotation))
# column.name <- rownames(loadings)
# loadings1<- do.call(rbind,loadings)
# colnames(loadings1) <- column.name
# load <- as.data.frame(loadings1)
# 
# library(kableExtra)
#  load %>%
#  knitr::kable("html") %>%
#   kable_styling() %>%
#  scroll_box(width = "500px", height = "200px")

5.5 Code for Multidimensional Scaling (metric and non- metric)

#
#Code adapted from:
#https://www.r-statistics.com/2016/01/multidimensional-scaling-with-r-from-mastering-data-analysis-with-r/
#https://www.r-bloggers.com/multidimensional-scaling-mds-with-r/
# 
# # configure multicore processing
# require(doParallel)
# cl <- makeCluster(detectCores())
# registerDoParallel(cl)
# 
# #classic MDS - excluding 2 factor variables
#dist.matrix <- dist(proper[,1:41])
# #scaled <- cmdscale(dist.matrix)
# #plot classic MDS
# library(ggpubr)
# plot(scaled[, 1], scaled[, 2])
# text(scaled[,1], scaled[,2], labels(proper$Met_def), cex = 0.9, xpd = TRUE)

# #Non metric MDS
# library(MASS)
# nonmetric.mds <- isoMDS(dist.matrix)
#  
# #plot non-metric mds
# plot(nonmetric.mds$points, type = "n")
# text(nonmetric.mds$points, labels = as.character(1:nrow(cleaned)))

#Visualising patterns in correlation matrix
# library(dplyr)
# library(ggpubr)
# res.cor <- cor(something[,1:49], method = "spearman")
# mds.cor <- (1 - res.cor) %>%
#   cmdscale() %>%
#   as_tibble()
# colnames(mds.cor) <- c("Dim.1", "Dim.2")
# ggscatter(mds.cor, x = "Dim.1", y = "Dim.2", 
#           size = 1,
#           label = colnames(res.cor),
#           repel = TRUE)

5.6 Detailed cluster analysis

#Detailed cluster analysis------------------------------------------------------------------------

# #2 cluster for kmeans ------------------------
# sort(table(k$clust))
# clust <- names(sort(table(k$clust)))
# #meteorogical season cluster content
# pca.cleaned[k$clust == clust[1], 42:45]
# pca.cleaned[k$clust == clust[2], 42:45]
# 
# #astrological season cluster content
# pca.cleaned[k$clust == clust[1], 45:49]
# pca.cleaned[k$clust == clust[2], 45:49]
# 
# #15 clusters for kmeans ------------------------
# sort(table(k2$clust))
# clust <- names(sort(table(k2$clust)))
# #meteorogical season cluster content
# pca.cleaned[k2$clust == clust[1], 42:45]
# pca.cleaned[k2$clust == clust[2], 42:45]
# 
# #astrological season cluster content
# pca.cleaned[k2$clust == clust[1], 45:49]
# pca.cleaned[k2$clust == clust[2], 45:49]
# 
# 
# #18 clusters for kmeans ------------------------
# sort(table(k1$clust))
# clust <- names(sort(table(k1$clust)))
# #meteorogical season cluster content
# pca.cleaned[k1$clust == clust[1], 42:45]
# pca.cleaned[k1$clust == clust[2], 42:45]
# 
# #astrological season cluster content
# pca.cleaned[k1$clust == clust[1], 45:49]
# pca.cleaned[k1$clust == clust[2], 45:49]
# 

## hierarchical clustering - Average----------------
# library(flashClust)
# clusters <- flashClust((dist.matrix), method = 'average')
# #plot(clusters)
# #assuming optimal according to CI index - 2
# clusterCut <- cutree(clusters, 2)
# plot(table(clusterCut, proper$Met_def))

5.7 Code for outlier detection

#Check for outlier using Cook's distance
#Code heavily sourced from: http://r-statistics.co/Outlier-Treatment-With-R.html
# 
# model <- lm(cleaned$`Temperature daily mean [2 m above gnd]`~. , data = cleaned[,5:52])
# cooksdist <- cooks.distance(model)
# plot(cooksdist, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
# text(x=1:length(cooksdist)+1, y=cooksdist, labels=ifelse(cooksdist>4*mean(cooksdist, na.rm=T),names(cooksdist),""), col="coral")  # add labels
# abline(h = 4*mean(cooksdist), col="blue")  # add cutoff line

6 Bibliography

Abdi H. and Williams L.J.. (2010). Principal component analysis. Computational Statistics. 2 (4), P 433-459.
Agarwal P.. (2003). Lecture 18: Clusterin g & classification . Available: https://www2.cs.duke.edu/courses/fall03/cps260/notes/lecture18.pdf. Last accessed 16 April 2018.
Agarwal, P., Alam, M. A. and Biswas, R. (2011) ‘Issues,Challenges and Tools of Clustering Algorithms’. Available at: http://arxiv.org/abs/1110.2610 (Accessed: 18 April 2018).
Bhatnagar, V. and Kaur, S. (2007) ‘Exclusive and Complete Clustering of Streams’, in Database and Expert Systems Applications, pp. 629-638. doi: 10.1007/978-3-540-74469-6_61.
Bholowalia, P. and Kumar, A. (2014) ‘EBK-Means: A Clustering Technique based on Elbow Method and K-Means in WSN’, International Journal of Computer Applications, 105(9), pp. 17-24. doi: 10.5120/18405-9674.
Calinski, T., and J. Harabasz. “A dendrite method for cluster analysis.” Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1-27.
De Amorim, R. C. and Hennig, C. (2015) ‘Recovering the number of clusters in data sets with noise features using feature rescaling factors’, Information Sciences, 324, pp. 126-145. doi: 10.1016/j.ins.2015.06.039.
Everitt, B. and Hothorn, T. (2011) An introduction to applied multivariate analysis with R, Media. doi: 10.1007/978-1-4419-9650-3.
Everyday Analytics. (2014). PCA and K-means Clustering of Delta Aircraft . Available: http://www.everydayanalytics.ca/2014/06/pca-and-k-means-clustering-of-delta-aircraft.html. Last accessed 16 April 2018.
Grolemund G. and Wickham H. (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL:http://www.jstatsoft.org/v40/i03/.
IBM. (2018). KMO and Bartlett’s Test. Available: https://webcache.googleusercontent.com/search?q=cache:q3VwHI1BflkJ:https://www.ibm.com/support/knowledgecenter/en/SSLVMB_sub/spss/tutorials/fac_telco_kmo_01.html+&cd=12&hl=en&ct=clnk&gl=uk&client=fire. Last accessed 16 April 2018.
Jain, A. K. (2010) ‘Data clustering: 50 years beyond K-means’, Pattern Recognition Letters, 31(8), pp. 651-666. doi: 10.1016/j.patrec.2009.09.011.
Kahle D. and Wickham H. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf.
Kumar, A., Andu, T. and Thanamani, A. S. (2013) ‘Multidimensional Clustering Methods of Data Mining for Industrial Applications’, International Journal of Engineering Science Invention, 2(7), pp. 1-8.
Lawrence, M. et al. (2008) ‘explorase: Multivariate Exploratory Analysis and Visualization for Systems Biology’, Journal Of Statistical Software, 25(9), pp. 1-23. doi: http://dx.doi.org/10.18637/jss.v025.i09.
Leinhardt, S. and Wasserman, S. S. (1979) ‘Exploratory Data Analysis: An Introduction to Selected Methods’, Sociological Methodology. [American Sociological Association, Wiley, Sage Publications, Inc.], 10, pp. 311-365. Available at: http://www.jstor.org/stable/270776.
Mathworks. (2018). clustering.evaluation.CalinskiHarabaszEvaluation class. Available: https://www.mathworks.com/help/stats/clustering.evaluation.calinskiharabaszevaluation-class.html. Last accessed 16 April 2018.
Microsoft Corporation and Steve Weston (2017). doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. R package version 1.0.11. https://CRAN.R-project.org/package=doParallel
Murtagh, F. and Contreras, P. (2012) ‘Algorithms for hierarchical clustering: An overview’, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), pp. 86-97. doi: 10.1002/widm.53.
Nath G.. (2017). Introducing Apply, Sapply & Tapply through a simple Cluster Analysis Example. Available: https://rpubs.com/GourabNath/Loopfunctionapplications. Last accessed 16 April 2018.
National Oceanic and Atmospheric Administration. (2018). Meteorological Versus Astronomical Seasons. Available: https://www.ncei.noaa.gov/news/meteorological-versus-astronomical-seasons. Last accessed 16 April 2018.
Powell V.. (2018). Principal Component Analysis. Available: http://setosa.io/ev/principal-component-analysis/. Last accessed 16 April 2018.
Pretnar A.. (2015). Hierarchical Clustering: A Simple Explanation. Available: https://blog.biolab.si/2015/12/02/hierarchical-clustering-a-simple-explanation/. Last accessed 16 April 2018.
Revelle, W. (2018) psych: Procedures for Personality and Psychologica lResearch, Northwestern University, Evanston, Illinois, USA. URL: https://CRAN.R-project.org/package=psych Version = 1.8.3.
Rousseeuw, P. J. and van Zomeren, B. C. (1990) ‘Unmasking Multivariate Outliers and Leverage Points’, Journal of the American Statistical Association. [American Statistical Association, Taylor & Francis, Ltd.], 85(411), pp. 633-639. Available at: http://www.jstor.org/stable/2289995.
Sarkar, D. (2008) Lattice: Multivariate Data Visualization with R. Springer, New York. ISBN 978-0-387-75968-5.
Sievert C., Parmer C., Hocking T., Chamberlain S., Ram K., Corvellec M. and Despouy P.(2017). plotly: Create Interactive Web Graphics via ‘plotly.js’. R package version 4.7.1. URL:https://CRAN.R-project.org/package=plotly
Stackexchange. (2018). Making sense of principal component analysis, eigenvectors & eigenvalues. Available: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues. Last accessed 16 April 2018.
UC Business Analytics R programming Guide. (2018). K-means Cluster Analysis. Available: http://uc-r.github.io/kmeans_clustering#silo. Last accessed 16 April 2018.
Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K (2005) Data cleaning: Detecting, diagnosing, and editing data abnormalities. PLoS Med 2(10): e267.
Young W.F. . (1999). Principal Components: BiPlots . Available: http://forrest.psych.unc.edu/research/vista-frames/help/lecturenotes/lecture13/biplot.html. Last accessed 16 April 2018.