This paper made to accomplished an assignment in Unsupervised Learning course in Data Science and Business Analytics major in University of Warsaw.
The paper is related to Applied Microeconomy assignment that I am doing with my team, which is Elbrus Gasimov (437656) and Gunneet Singh (430075).
Laura E. R. Florencia (430985) - 2021
In this paper, we are going to talk about “Gold Standard”. Everybody have heard the notion, maybe in economics or to describe something to be high in value. This notion evolved throughout history and even though the actual gold standard is not used anymore as the basis of finance, we still see its signs everywhere.
I am trying to cover Unsupervised Learning topic for Clustering and Dimension Reduction to get the analysis of how gold standard relates to our economic nowadays.
Here is the data from gold.org for the gold prices per troy-ounce* from 1978-2020 in some countries (national Currency Unit per troy ounce)
1 troy ounce = 31.1034768 grams Troy ounce is a unit of measure used for weighing precious metals that dates back to the Middle Ages
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
library(ClusterR)
## Loading required package: gtools
library(readxl)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(animation) #for rgl in clusterSim
library(FactoMineR)
setwd("D:\\0. DSBA - Warsaw Uni\\8 Unsupervised Learning\\Paper")
getwd()
## [1] "D:/0. DSBA - Warsaw Uni/8 Unsupervised Learning/Paper"
dataprice <- read.csv2("goldprice.csv")
datadaily <-read.csv2("dailyFluctuation.csv")
datacleardaily <- read.csv2("dailyFluctuationt.csv")
View(dataprice)
View(datadaily)
View(datacleardaily)
Let us see the the price fluctuation from 6 countries for 42 years. The prices were converted from local currency to USD.
ggplot(data = dataprice, aes(x = dataprice$ï..year, y = dataprice$USD))+
geom_line() + ggtitle("Gold Price in USA, 1978 - 2020")
## Warning: Use of `dataprice$ï..year` is discouraged. Use `ï..year` instead.
## Warning: Use of `dataprice$USD` is discouraged. Use `USD` instead.
ggplot(data = dataprice, aes(x = dataprice$ï..year, y = dataprice$EUR))+
geom_line() + ggtitle("Gold Price in Europe, 1978 - 2020")
## Warning: Use of `dataprice$ï..year` is discouraged. Use `ï..year` instead.
## Warning: Use of `dataprice$EUR` is discouraged. Use `EUR` instead.
ggplot(data = dataprice, aes(x = dataprice$ï..year, y = dataprice$JPY))+
geom_line() + ggtitle("Gold Price in Japan, 1978 - 2020")
## Warning: Use of `dataprice$ï..year` is discouraged. Use `ï..year` instead.
## Warning: Use of `dataprice$JPY` is discouraged. Use `JPY` instead.
ggplot(data = dataprice, aes(x = dataprice$ï..year, y = dataprice$GBP))+
geom_line() + ggtitle("Gold Price in England, 1978 - 2020")
## Warning: Use of `dataprice$ï..year` is discouraged. Use `ï..year` instead.
## Warning: Use of `dataprice$GBP` is discouraged. Use `GBP` instead.
ggplot(data = dataprice, aes(x = dataprice$ï..year, y = dataprice$CAD))+
geom_line() + ggtitle("Gold Price in Canada, 1978 - 2020")
## Warning: Use of `dataprice$ï..year` is discouraged. Use `ï..year` instead.
## Warning: Use of `dataprice$CAD` is discouraged. Use `CAD` instead.
ggplot(data = dataprice, aes(x = dataprice$ï..year, y = dataprice$CHF))+
geom_line() + ggtitle("Gold Price in Swiss, 1978 - 2020")
## Warning: Use of `dataprice$ï..year` is discouraged. Use `ï..year` instead.
## Warning: Use of `dataprice$CHF` is discouraged. Use `CHF` instead.
#After showing all prices since 1978-2020, then go deeper into clustering
#The Elbow Curve method is helpful because it shows how increasing the number of the clusters contribute #separating the clusters in a meaningful way, not in a marginal way. We also have another way to perform the method, which is Nbclust. #Nbclust proposes to users the best clustering scheme from the different results obtained by varying all #combinations of number of clusters, distance measures, and clustering methods.
set.seed(42)
# function to compute total within-cluster sum of squares
fviz_nbclust(dataprice, kmeans, method = "wss", k.max = 25) + theme_minimal() +
ggtitle("The Elbow Method")
However, in this paper we are not performed the NbClust
set.seed(42) res.nbclust <- NbClust(dataprice, distance = “euclidean”, min.nc = 2, max.nc = 25, method = “complete”, index =“all”) factoextra::fviz_nbclust(res.nbclust) + theme_minimal() + ggtitle(“NbClust’s optimal number of clusters”)
km3<-eclust(dataprice, “kmeans”, hc_metric=“euclidean”, k=3) fviz_cluster(km3, main=“Yearly Price of Gold - 3 Clusters”)
km3$cluster
km3$silinfo
#If we see the negative silhouette, then it’s a problem somewhere. #Clustering is not perfect and we should have another clustering with different distance. #We should find the different solution.
km4<-eclust(dataprice, “kmeans”, hc_metric=“euclidean”, k=4) fviz_cluster(km4, main=“Yearly Price of Gold - 4 Clusters”)
km4$cluster
km4$silinfo
#we still get the negative silhouette, and even more than k=3
km5<-eclust(dataprice, “kmeans”, hc_metric=“euclidean”, k=5) fviz_cluster(km5, main=“Yearly Price of Gold - 5 Clusters”)
km5$cluster
km5$silinfo
dim(dataprice) head(dataprice) dim(datadaily) head(datadaily)
#CLARA from uncleared data cl2<-eclust(datadaily, “clara”, k=5) # factoextra summary(cl2) fviz_cluster(cl2) fviz_silhouette(cl2)
#Depending on observations drawn, silhouette may be slightly different fviz_cluster(cl1, geom=“point”, ellipse.type=“norm”) # factoextra:: fviz_cluster(cl1, palette=c(“#00AFBB”, “#FC4E07”, “#E7B800”), ellipse.type=“t”, geom=“point”, pointsize=1, ggtheme=theme_classic()) fviz_silhouette(cl1)
#CLARA from Daily Data Price after exchange the local currency into USD cl2<-eclust(datacleardaily, “clara”, k=3) # factoextra summary(cl2) fviz_cluster(cl2) fviz_silhouette(cl2)
#We can see from time to time the price scattered, the trend is not up or down or give a good traffic. #So many outliets and certainty in Gold Price for 42 years observation in 19 countries
#Dimension Reduction - PCA #normalization of data, we dont have to always do this install.packages(“D:/Installer/clusterSim_0.49-2/clusterSim”, repos = NULL, type = “source”) datacleardaily.n<-data.Normalization(datacleardaily, type=“n1”, normalization=“column”)#clusterSim::
datacleardaily.cov<-cov(datacleardaily.n) datacleardaily.eigen<-eigen(datacleardaily.cov) datacleardaily.eigen\(values head(datacleardaily.eigen\)vectors)
cleardata<-datacleardaily
cleardata.pca1<-prcomp(10951, center=TRUE, scale.=FALSE) # stats:: cleardata.pca1 cleardata.pca1$rotation #only “rotation” part, the matrix of variable loadings
cleardata.pca2<-princomp(1) # stats::princomp() loadings(cleardata.pca2)
plot(datacleardaily.pca2) # the same will be plot(xxx.pca1) library(factoextra) fviz_pca_var(datacleardaily.pca1, col.var=“steelblue”)#
#Visualization of PCA install.packages(“D:/Installer/ggfortify_0.4.11/ggfortify”, repos = NULL, type = “source”) library(ggfortify) autoplot(datacleardaily.pca1)
autoplot(xxx.pca1, loadings=TRUE, loadings.colour=‘blue’, loadings.label=TRUE, loadings.label.size=3)
#Multiple Factor Analysis #Multiple factor analysis (MFA) (J. Pagès 2002) is a multivariate data analysis method for summarizing and #visualizing a complex data table in which individuals are described by several sets of variables (quantitative #and /or qualitative) structured into groups. It takes into account the contribution of all active groups of variables #to define the distance between individuals. The number of variables in each group may differ and the nature of the #variables (qualitative or quantitative) can vary from one group to the other but the variables should be of the #same nature in a given group (Abdi and Williams 2010). #Why MFA? I chose MFA because the data mixes continuous and categorical variables. #As far as groups go, it seemed like a good idea to combine all of the presence for the pricing both in recession and in the normal year.
data(datacleardaily) colnames(datacleardaily) #group1: 1978 - 1982 > recession > 1046 rows #group2: 1983 - 1989 > normal > 2872 - 1046 = 1826 rows #group3: 1990 - 1991 > recession > 3394 - 2872 = 522 rows #group4: 1992 - 2000 > normal > 5742 - 3394 = 2348 rows #group5: 2001 > recession > 6003 - 5742 = 261 rows #group6: 2002 - 2007 > normal > 7568 - 6003 = 1565 rows #group7: 2008 - 2009 > recession > 8091 - 7568 = 523 rows #group8: 2010 - 2019 > normal > 10699 - 8091 = 2608 rows #group9: 2020 > recession > 10952 - 10699 = 253 rows
library(FactoMineR) res.mfa = MFA(datacleardaily, group = c(2, 4, 3, 4, 4, 2), type = c(“n”, “s”, “s”, “n”, “n”, “n”), name.group = c(“US”,“Europe”,“East Asia”, “Middle-East”, “South East Asia”,“Aus & Afr”), num.group.sup = c(1, 6), graph = TRUE)
print(res.mfa)
library(“factoextra”) eig.val <- get_eigenvalue(res.mfa) head(eig.val)
fviz_screeplot(res.mfa)
fviz_mfa_var(res.mfa, “group”) #Conclusions #Both clustering and outlier detection are important data analysis tasks. #In this paper, we proposed the KMOR algorithm by extending the k -means algorithm to provide data clustering and #outlier detection simultaneously. In the KMOR algorithm, two parameters n 0 and γ are used to control the number of outliers. #The parameter n 0 is the maximum number of outliers the proposed algorithm will produce regardless the value of γ. #For fixed n 0 , a larger value of γ leads to less number of outliers. We can also estimate the two parameters within the algorithm. #For example, we can follow the approach proposed by running the traditional k-means algorithm on a dataset to estimate n 0 and γ.
```
https://www.investopedia.com/terms/t/troyounce.asp#:~:text=A%20troy%20ounce%20is%20a,to%20the%20U.K.%20Royal%20Mint., access on 19 Dec 2020 https://www.researchgate.net/publication/331663878_Clustering_and_dimension_reduction_for_mixed_variables https://stats.stackexchange.com/questions/288668/clustering-as-dimensionality-reduction https://www.rug.nl/research/portal/publications/clustering-and-dimension-reduction-for-mixed-variables(83435d87-2609-46f5-96fb-ec016515490a).html https://rpubs.com/mkhan/STDS_36103_Vignette https://www.youtube.com/watch?v=N5gYo43oLE8 https://en.wikipedia.org/wiki/Cluster_analysis