Premise

This .html document refers to a fully reproducible R markdown document that can be downloaded form the code button in the top-right of this document. The source file is also available here

This file that can be used to go through the analysis associated with the Blue Paper 10: “A Global Habitat Strategy Framework” section.

In this supplementary, we assume some basic knowledge of the R programming language. If instruction are followed, this should be fully reproducible using R studio. For further comments on the results of this analysis please refer to the main paper, or please find Contacts to get in touch with the authors to report comments, bugs or problems.

Once R and R-studio are installed on the system, you can download the full supplementary data from: https://github.com/Fabbiologia/Blue-Paper-10.git

Beware that while all data used are open source, specific permission to reuse and publish them are needed from data providers. Credit for the use of those data should also go to the proper source listed in the table in the Methods in brief section.

The R code (v.3.6.0) was written using R-studio IDE (v.1.2.1511), as well as this document, using the following packages that can be installed in R or through R-studio using the following commands:

the package clValid available to download at https://cran.r-project.org/src/contrib/clValid_0.6-6.tar.gz and needs to be installed from source: install.packages(path_to_file, repos = NULL, type="source")

After all the packages are installed they can be loaded in R using the following commands:

## loading needed libraries after are installed
library(tidyverse)
library(readxl)
library(xlsx)
library(cluster)
library(factoextra)
library(vegan)
library(scales)
library(DT)
library(clValid)
library(cowplot)
library(ggpubr)

Methods in brief

Habitat data

To understand habitat area within each Exclusive Economic Zone (EEZ) and the proportion of each habitat that is within an MPA, we collected spatially referenced habitat data for coastal and oceanic ecosystems between February and April 2019. Data for thirty different habitats were found spanning oceanic features digitized from bathymetric data to satellite derived area estimates for many of the coastal habitats. We used a dataset that combined both EEZ area and land that was created by Sala et al., (2018) and intersected it with each habitat using ArcGIS 10.5 Desktop.

Next we calculated the area for each habitat by dissolving the resulting layer by Country and projecting it into the World Cylindrical Equal Area projection, and using the “Calculate Geometry” tool. To extract habitats that are of high conservation importance and the focus of our analysis we selected habitats that comprised of less than 0.5% of the cumulative habitat area. Seamounts did not fall under this classification, but due to their ecological importance were combined with guyots in our analysis (Rogers 1994). We used 6 habitats closely associated with the coast and 6 more closely associated with open ocean, for a total of 12 habitats. The table below lists these habitats and datasource used in the analysis. Losses and gains past the dates the data describes were not accounted for.

The February 2019 World Database of Protected Areas (UNEP-WCMC and IUCN (2019) was used to calculate the area or the number of reported locations of each habitat inside of an MPA. It needs to be clarified that being inside a MPA does not mean the habitat is protected, since the MPA objective and regulamentation might not involve the habitat at all. However, we consider that being inside an environmentally managed area should provide at least some indirect benefits to the habitat conservation.

The dataset was filtered by MPAs whose status was either designated, inscribed, adopted or established, thus removing not reported and proposed categories. The same methodology for the area calculation within each EEZ except the intersection included the filtered MPA dataset. The results were then compiled and compared to ensure that the maximum value of percent protection was 1 for each country.

Habitat Date of Data Data Type Source
Estuaries 2003 Polygon Alder (2003)
Mangroves 1997 - 2000 Polygon Giri, et al. (2011)
Saltmarsh 1973 - 2015 Points McOwen, et al. (2017)
Seagrasses 1934 - 2015 Polygon UNEP-WCMC, Short FT (2017)
Coral Reefs 1954 - 2018 Polygon UNEP-WCMC, WorldFish Centre, WRI, TNC (2018)
Kelp NA Point Jorge Assis (submitted for publication)
Cold Corals 1915 - 2014 Point Freiwald A (2017)
Sills 1950-2009 Polygon Harris et al. (2014)
Seamounts/Guyots 1950-2009 Polygon Harris et al. (2014)
Bridges 1950-2009 Polygon Harris et al. (2014)
Rift Valleys 1950-2009 Polygon Harris et al. (2014)
Hydrothermal Vents 1994-2019 Point Beaulieu, S.E., Szafranski, K. (2019)

The Global Habitat Strategy Framework

In this section, we applied a global marketing strategy to conservation. This strategy is based on the segmentation of a complex group of objects characterized by some attributes. Here, objects are countries’ Exclusive Economic Zones (EEZs), and the attributes are reported in the following table:

! Categories Attributes Coded as Description Source
Population Population growth pop_growth Population growth (annual %) derived from total population. World Bank Open data: https://data.worldbank.org/
! Population Population living less than 5m above sea level pop_coastal Population living in areas where elevation is below 5 meters (% of total population) World Bank Open data: https://data.worldbank.org/
Population Poverty poverty Percent of population living under $1.90 World Bank Open data: https://data.worldbank.org
Consumption Greenhouse gasses emissions emissions Total greenhouse gas emissions (kt of CO2 equivalent) World Bank Open data: https://data.worldbank.org/
Consumption Gross Domestic Product per capita GDP_capita GDP per capita (current USD) World Bank Open data: https://data.worldbank.org/
Consumption Gross Domestic Product total GDP_tot GDP per capita (current USD) World Bank Open data: https://data.worldbank.org/
Environment Pressures on the Marine Environment pressures The ecological and social factors that decrease health status Ocean Health Index: http://www.oceanhealthindex.org/
Environment Amount of area protected protected Overlap between protected area and the target habitat This paper
Government Corruption corruption Corruption score index for each country Worldwide governance indicators: https://datacatalog.worldbank.org
Government Regulatory quality reg_qual Regulatory Quality captures perceptions of the ability of the government to formulate and implement sound policies and regulations Worldwide governance indicators: https://datacatalog.worldbank.org
Government International cooperation int_coop Number of international treaties signed by a country This paper

We used the attributes in the table above to create an ecosocial dataset that can be reproduced using the code below in the Data wrangling section.

The segmentation can be fully reproduced and is further commented in the Segmentation process section.

To achieve the segmentation we used an unsupervised clustering technique. These analysis are designed to categorise observations into a number of different groups (“clusters”), with each being relatively similar based on their values for a range of different factors.

First we scaled our attributes, then we calculated an euclidean distance matrix. Then we used the Ward’s hierarchical cluster, that minimise the total variance between samples within each cluster. Then merge clusters successively so as to minimise the increase in the Ward’s distance. The process continues until there’s just one cluster containing all the observations.

Since we had no a priori hypothesis in on how many segments (clusters) countries should be separated in, the advantage of the unsupervised technique is that it is independent to a previous choice of the number of clusters. Therefore, the choice of how many segments countries will be separated can be based upon the usability and usefulness of the corresponding groupings.

Nevertheless, we tested with different cluster validation techniques, which method and number of clusters better represent data. This process can be reproduced in the Cluster validation section.

Then, we grouped countries according to their segments and calculated the average score for each attribute and plotted it in a bar-plot that are represented in Figure 1 and 2 in main text. These can be fully reproduced in the Results visualization section.

Data wrangling

Filtering data

Two filters were created for landlocked countries that due to an error in eez layer were included and disputed areas also were filtered out.

landlocked <- c("Afghanistan", "Andorra", "Armenia", "Austria", 
                "Azerbaijan", "Belarus", "Bhutan", "Bolivia", 
                "Botswana", "Burkina Faso", "Burundi", 
                "Central African Republic", "Chad", "Czech Republic", 
                "Ethiopia", "Hungary", "Kazakhstan", "Kosovo", "Kyrgyzstan", 
                "Laos", "Lesotho", "Liechtenstein", "Luxembourg", "Macedonia", 
                "Malawi", "Mali", "Moldova", "Mongolia", "Nepal", "Niger", "Paraguay", 
                "Rwanda", "San Marino", "Serbia", "Slovakia", "South Sudan", 
                "Swaziland", "Switzerland", "Tajikistan", "Turkmenistan", "Uganda", 
                "Uzbekistan", "West Bank", "Vatican City", "Zambia", "Zimbabwe", 
                "Gaza Strip")

disputed <- c("Area en controversia (disputed - Peruvian point of view)", 
              "Area of overlap Australia/Indonesia",
              "Conflict zone China/Japan/Taiwan", 
              "Conflict zone Japan/Russia", 
              "Conflict zone Japan/South Korea",
              "Disputed Barbados/Trinidad & Tobago", 
              "Joint regime Colombia/Jamaica", 
              "Joint regime Japan/Korea", 
              "Joint regime Nigeria/Sao Tome and Principe",  
              "Joint development area Australia/East Timor",
              "Protected zone Australia/Papua New Guinea", 
              "Disputed Kenya/Somalia",
              "Disputed Western Sahara/Mauritania")

Loading data files

All data sources are stored in the data/sources/ folder, some have been manually modified from the original downloaded version, that can be checked in the folder data/sources/raw. The final dataset that is going to be used in further analysis is the data/BP-10-Dataset.xlsx file.

EEZ and habitat data

### EEZ----
eez <- read.csv("data/sources/EEZ_marine_area.csv") %>% 
      select(Country, EEZ_area=Area)%>% 
      filter(! Country %in% disputed) %>% 
      filter(! Country %in% landlocked)


## Habitat data ----

hab <- read_excel("data/sources/Area of Habitats by Country.xlsx") %>% 
      filter(! Country %in% disputed) %>% 
      filter(! Country %in% landlocked)


eez_hab <- merge(eez, hab, all.x = T) %>% 
      replace(is.na(.), 0)


## Saving file ----

write.xlsx(eez_hab, "data/BP-10-Dataset.xlsx", sheetName = "eez_hab", 
           col.names = TRUE, row.names = F, append = FALSE)

Ecosocial data

## World Bank Data ----
social <- readxl::read_excel("data/sources/WorldBankData.xlsx")

### OHI----

OHI <- read_excel("data/sources/OHI/hab_pressures_output.xls") %>% 
      select(-c("OBJECTID", "FREQUENCY")) %>% 
      group_by(Country) %>% 
      gather("year", "pressures", SUM_w2012:SUM_w2018) %>% 
      summarise(pressures_hab = mean(pressures))

OHI <- OHI %>% 
      filter(! Country %in% disputed) %>% 
      filter(! Country %in% landlocked)


### WGI indicators ----

WGI <- read_excel("data/sources/WGI_estimate.xlsx") %>% 
      group_by(Country) %>% 
      spread(Indicator, WGI) %>% 
      select(Country, Regulatory_quality, Corruption)


### Internationa agreements edited ----

ICS <- read_excel("data/sources/International_coop.xlsx")

Now it is possible to merge all these datasets. The independent file loading was a choice made to clarify the process and link directly to data sources that can be worked and evaluated independently.

df <- merge(eez, OHI, all.x = T)
df <- merge(df, WGI, all.x = T)
df <- merge(df, social, all.x=T)
df <- merge(df, ICS, all.x = T)

df <- cbind(df, eez_hab[16:ncol(eez_hab)])
names(df)
##  [1] "Country"                  "EEZ_area"                
##  [3] "pressures_hab"            "Regulatory_quality"      
##  [5] "Corruption"               "Pop_growth"              
##  [7] "GDP_capita"               "GDP_tot"                 
##  [9] "Poverty"                  "Pop_coastal"             
## [11] "Greenhouse_gas"           "International_coop"      
## [13] "Coral_Reefs_Parea"        "Estuaries_Parea"         
## [15] "Mangroves_Parea"          "Seagrasses_Parea"        
## [17] "Kelp_Parea"               "Saltmarsh_Parea"         
## [19] "Hydrothermal_vents_Parea" "Cold_Coral_Parea"        
## [21] "Bridges_Parea"            "Sills_Parea"             
## [23] "Rift_Valleys_Parea"       "Seamounts_Guyots_Parea"
names(df) <- c("Country", "EEZ_area", 
               "pressures",
               "reg_qual",
               "corruption",
               "pop_growth",
               "GDP_cap",
               "GDP_tot",
               "poverty",
               "pop_coast",
               "emissions",
               "int_coop",
               "Coral_Reefs_Parea",
               "Estuaries_Parea",
               "Mangroves_Parea",
               "Seagrasses_Parea", 
               "Kelp_Parea", 
               "Saltmarsh_Parea",
               "Hydrothermal_vents_Parea",
               "Cold_Coral_Parea",
               "Bridges_Parea", 
               "Sills_Parea",
               "Rift_Valleys_Parea",
               "Seamounts_Guyots_Parea")

write.xlsx(df, "data/BP-10-Dataset.xlsx",
           sheetName="ecosocial", append=TRUE, row.names = F)

Note that the last code append data into the excel sheet “ecosocial”.


Segmentation process

The segmentation process, here applied to conservation, is based on a marketing strategy to simplify the complexity of target audience to which a product need to be targeted. The world is very heterogeneous, therefore global corporations realized that to adapt to this complexity they needed to divide and conquer different types of markets.

In marketing, segmentations is a process that helps marketers narrow their focus on the most promising groups within the universe they are targeting. There is no single correct way to segment a market. Defining a target consumer base can be performed using a variety of segmentation methods. One can apply a combination of these methods to provide greater insight to their target markets. Corporations usually start creating geo-clusters, which combine geographic, demographic and behavioural data.

In this context, conservation might aim at doing the same thing. International conservation strategies can benefit from this approach to create new strategies to apply to this complex world.

In the following sections we present an fully reproducible example of the segmentation process using the data available.

In the previous Data wrangling section we showed how the dataset was created. Here we focus on describing the process of the segmentation.


First we load the habitat data and we assign the name df that stand for data.frame.

# loading data
df <- read_excel("data/BP-10-Dataset.xlsx")

Then we upload the ecosocial data from the same xlsx file but in the other sheet (note the sheet argument in the read_excel function).

ecosocial <- read_excel("data/BP-10-Dataset.xlsx", sheet = "ecosocial")

From the original ecosocial data we want to change two things. First we invert the Corruption value. In the original format, corruption is represented with positive values when it is low, while with negative values when it is high. This was counter-intuitive for the way we wanted to interpret data, so we multiplied it *-1.

ecosocial$corruption <- ecosocial$corruption*-1 #this inverts Corruption values

Relative habitat importance index

Here we calculate the relative importance index following this formula:

\(\frac{(habitat-area-in-a-country / Total-habitat-area)*100}{(Country-EEZ-area / World-EEZ-area)*100}\)

That can be calculated with the following code:

## Area standardization x EEZ ----

## formula to calculate percentage
perc <- function(x){
   (x/sum(x))*100
}

framework <- df %>% 
   select(Country:Seamounts_Guyots) %>% 
   mutate_at(vars(EEZ_area:Seamounts_Guyots), perc) %>% ## percentage contribution
   mutate_at(vars(MPA:Seamounts_Guyots), funs(./EEZ_area)) ## EEZ area standardization
   

write.xlsx(as.data.frame(framework), "data/BP-10-framework.xlsx", row.names = F)

Example dataset

Now that all variables needed are in place, we choose the habitat target. Here we present a study case with Mangroves and the code to reproduce Figure 2 in main text for Coral Reefs.

First we want to extract countries that have mangroves, merge this with ecosocial variables, and create the dataset we will use for the segmentation.

## Dataset creation 

MG <- framework %>% 
      select(Country, Mangroves) %>% 
      arrange(-Mangroves) %>% 
      mutate(Mangroves = round(Mangroves, 2)) %>%
      filter(Mangroves > 0) %>% 
      merge(., ecosocial) %>% 
      select(Country:int_coop, protected=Mangroves_Parea) %>% 
      mutate_if(is.numeric, round, 2)

This create the following dataset, the first columns called “Mangroves” refers to the Relative habitat importance index calculated for this habitat". Using this interactive table countries can be ordered by its value:

It is important also to account for correlation among variables. Correlated variables are not necessarily a problem for cluster analyisis. However, this can be used to reduce the amount of variables used if wanted.

### Correlation between variables ----

df_corr <- Hmisc::rcorr(as.matrix(select(MG, -Country)))

# Insignificant correlation are crossed
corrplot::corrplot.mixed(df_corr$r, tl.pos = "lt",
         p.mat = df_corr$P, sig.level = 0.05, insig = "pch", pch.cex = 2, pch.col = "grey90")

Here some of the governance data are correlated as expected. Protection of the habitat however is strongly correlated only with EEZ area. No other ecosocial attribute has a significant correlation with protection. A low but unsignificant correlation is found with international cooperations, suggesting that the effects of treaties countries signed had some, even if low, effect on the amount of the habitat protected.

Clustering

To segment countries, we used unsupervised clustering techniques. These analysis are designed to categorise observations into a number of different groups (“clusters”), with each being relatively similar based on their values for a range of different factors. Some form of distance measure are used to determine how close or far apart are different samples (countries, regions, sites etc.) are based on their attributes.

There are many flavours of clustering methods depending upon the distance measure chosen and which characteristic one wants to highlight in the data.

For example, Ward’s distance minimise the total variance between samples within each cluster. Then merge clusters successively so as to minimise the increase in the Ward’s distance. The process continues until there’s just one cluster containing all the observations.

There is a subjective element in clustering techniques. By choosing different pre-transformations of the data, different variables, different distance matrices, the results can be very different. If this from one side can be seen as a weakness, when properly applied can be a powerful exploration technique that can be applied to many situations, scales and questions to be answered. At the end of the clustering process one can review the data and identify what the members of each cluster have in common or not by summarising the characteristics of each cluster and potentially visualising these summarises as a means of comparing them.

Another source of subjectivity is the choice of the number of clusters. The number of cluster (k) can be determined by the user, with the aid of some statistical techniques called “cluster validations”. However, at the end of the day, depending on the research question, a good grouping is determined based upon the usability and usefulness of the corresponding groupings. Once the segmentation is complete, and further knowledge is obtained, other predictive statistical techniques can be used to test the results of the strategy.

Data scaling

In our study case, the first thing we need to do is scaling the variables, this is done to account for the different units, and the different origins all these variables have. If accounted raw, some variables would have much more weight than others only because have different scale and not because of a true effect of that particular variable.

# Scaling

df_scl <- scale(MG[4:ncol(MG)]) 
# the part in parenthesis is only to exclude non numerical columns


# this creates new raw names for the dataframe and 
#the vegan::makecepnames is to shorten some country names
rownames(df_scl) <- make.cepnames(MG$Country) 

This is how the dataset looks:

datatable(df_scl)

There are many NAs in this dataset. This is because we are using Exclusive Economic Zones, while most of the available ecosocial variables are at a country level. Further, for some countries the indicators are simply not available. Unfortunately, we need to exclude those rows with NAs from our analysis. Clustering can deal with some NAs, however, a good amount of countries have a complete data set, so we decided to exclude them.

df_scl <- na.omit(df_scl) ## excluding NAs

Then the distance metric is calculated. As mentioned earlier there are a plethora of distance metrics that can be applied. These can highlight some aspect of the data and should be chosen depending on the hypothesis and the types of variables available. Here we chose the Euclidean distance. Mainly because is the default measure for many software, also because it is relatively simple to interpret and observation with high values of variables are usually clustered together as for observation with low values. This characteristic is something that makes sense when dealing with ecosocial data to create country segments.

The following code calculate the euclidean distance matrix:

dist.eucl <- dist(df_scl, method = "euclidean")

Cluster validation

If one is struggling with the many clustering options, using the clValid package, it is possible to have an aid to choose the best algorithm and also a number of cluster suggestion for each method.

This compares clustering algorithms using two cluster validation measures:

  1. Internal measures, which uses intrinsic information in the data like connectivity, silhouette coefficient and the Dunn index

these are calculated with the following code:

# this selects the methods to be tested
clmethods <- c("hierarchical","kmeans","pam") 

intern <- clValid(na.omit(df_scl), nClust = 2:10, 
                  clMethods = clmethods, validation = "internal")
summary(intern)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                  2       3       4       5       6       7       8       9      10
##                                                                                                   
## hierarchical Connectivity   2.9290  5.8579  8.9869 13.8361 19.8730 22.2063 27.2754 39.5115 40.8810
##              Dunn           0.9152  0.7288  0.4791  0.3906  0.2552  0.2552  0.2552  0.2280  0.2306
##              Silhouette     0.6778  0.6009  0.4558  0.4104  0.3554  0.3315  0.2468  0.2974  0.2757
## kmeans       Connectivity   4.8579 12.1361 13.1361 21.3504 33.5464 34.2464 44.8226 47.1560 44.4730
##              Dunn           0.7188  0.3222  0.3906  0.2071  0.1378  0.1900  0.1445  0.1521  0.1942
##              Silhouette     0.6511  0.4346  0.4332  0.2877  0.2898  0.2974  0.2570  0.2492  0.2523
## pam          Connectivity  30.0393 39.6226 40.4929 41.7135 43.6079 43.7508 46.1798 52.0063 53.0258
##              Dunn           0.0729  0.0729  0.0729  0.0999  0.1146  0.1146  0.1826  0.1831  0.1901
##              Silhouette     0.1886  0.1335  0.1565  0.1855  0.1847  0.2218  0.2271  0.2158  0.2249
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 2.9290 hierarchical 2       
## Dunn         0.9152 hierarchical 2       
## Silhouette   0.6778 hierarchical 2

The results of the optimal scores suggests that the best methods is the hierarchical cluster with 2-3 groups.

  1. Stability measures, are a special version of internal measures which evaluate the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time.

This method include various measures:

  • The average proportion of non-overlap APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
  • The average distance AD measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).
  • The average distance between means ADM measures the average distance between cluster centres for observations placed in the same cluster under both cases.
  • The figure of merit FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.

These are calculated with the following code:

clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(na.omit(df_scl), nClust = 2:6, clMethods = clmethods, 
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)
##           Score       Method Clusters
## APN 0.002884477 hierarchical        2
## AD  2.303623538          pam        6
## ADM 0.040850499 hierarchical        2
## FOM 0.666398620 hierarchical        6

In a similar way as before the stability measures suggest the use of hierarchical methods. However, the analyst might also explore a kmeans or pam algorithm with a higher number of cluster.

Again, these should be taken as a guide that the analyst can follow with common-sense.

In our example, we chose to go along with the hierarchical clusters and in particular we use a common method called “Ward” clustering. This can be calculated and visualised with the following code:

res.ward <- hclust(dist.eucl, method = "ward.D2")

fviz_dend(res.ward, k = 4, # Cut in four k = 3 groups
          cex = 0.8,   # label size
          k_colors = "aaas"  # colour palette
)

Note the k=4 argument in the fvis_dend() function. This refers to the number of cluster the analyst can decide to represent the data with.
The advantage of the unsupervised cluster, relative to supervised ones (e.g. kmeans), is that the final cluster tree results are independent on the number of cluster chosen. Therefore, here assigning a number of clusters can be done at different levels, deciding where to “cut” the dendogram tree.

If one wants to be sure about choice of clusters, can use the following methods to further decide the “best” way to groups the objects:

p1 <- fviz_nbclust(df_scl, hcut, method = "wss") +
      geom_vline(xintercept = 4, linetype = 2)+
      labs(subtitle = "Elbow method")

# Silhouette method
p2 <- fviz_nbclust(df_scl, hcut, method = "silhouette")+
      labs(subtitle = "Silhouette method")

# Gap statistic

set.seed(123)
p3 <- fviz_nbclust(na.omit(df_scl), hcut, nstart = 25,  method = "gap_stat", nboot = 500)+
      labs(subtitle = "Gap statistic method")

plot_grid(p1,p2,p3)

These tests, confirm that the best number of groups is between 2 to 4.

Results visualization

Now that we defined our country segments, we can extract the values for each country that define those segments.

To do that we can use the following code:

# this creates cluster classes data 
clusterclass<- cutree(res.ward, k=4) 
# this create factor levels according to the groups
clusterclass_levels<- levels(factor(clusterclass)) 
# this tranforme it to a R data.frame
clusterclass <- as.data.frame(clusterclass) 
# this creates a "Country" column 
clusterclass$Country <- rownames(clusterclass) 
# this merge the cluster groups with the variables used
mangrove_clusters <- cbind(as.data.frame(clusterclass), df_scl) 

# this create a factor for the clusters
mangrove_clusters$clusterclass <- as.factor(mangrove_clusters$clusterclass) 

For the final visualization we follow these steps:

  1. we create the final cluster graph final_cluster;
  2. we create three bar plots for each cluster: mc1, mc2 and mc3;
  3. we create a final image joining all these plots using the ggarrange() function from the package ggarrange.
# final cluster
final_cluster <- fviz_dend(res.ward, 
          k = 4, # Cut in four k = 4 groups
          k_colors  = "aaas", 
          horiz = TRUE,
          cex = 0.7,
          sub = "")

# cluster 1
mc1 <- mangrove_clusters %>% 
      select(clusterclass, pressures:protected) %>%  # selecting variables
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% #rescaling
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% # changing sign
      gather("Variable", "Value", pressures:protected) %>%  # gathering 
      group_by(clusterclass, Variable) %>% ## grouping
      summarise(Value=mean(Value, na.rm=T)) %>%  # summarising 
      filter(clusterclass == 1) %>% # this filters cluster one
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +  # plot starts
      geom_col(col="black", fill = "#EE0000FF", alpha=0.8)+ 
      ylim(-1,1)+ # this set the scale for the plot
      theme_minimal()+ # this set the graphic theme
      labs(x='', y='', title="C-3")+  
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip()# these are graphical options

# cluster 2
mc2 <- mangrove_clusters %>% 
   select(clusterclass, pressures:protected) %>% 
   mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
   mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
   gather("Variable", "Value", pressures:protected) %>% 
   group_by(clusterclass, Variable) %>% 
   summarise(Value=mean(Value, na.rm=T)) %>% 
   filter(clusterclass == 2) %>% 
   ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) + 
   geom_col(col="black", fill = "#008B45FF", alpha=0.8)+ 
   ylim(-1,1)+
   theme_minimal()+
   labs(x='', y='Score', title="C-2")+
   theme(axis.text.y = element_text(size=16), 
         axis.text.x = element_text(color = "black"),
         text=element_text(family="serif", face="bold"), 
         panel.grid = element_line(colour="gray90"), 
         legend.position = "")+
      coord_flip()

# cluster 3
mc3 <- mangrove_clusters %>% 
      select(clusterclass, pressures:protected) %>% 
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% 
      group_by(clusterclass, Variable) %>% 
      summarise(Value=mean(Value, na.rm=T)) %>% 
      filter(clusterclass == 3) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) + 
      geom_col(col="black", fill = "#631879FF", alpha=0.8)+ 
      ylim(-1,1)+
      theme_minimal()+
      labs(x='', y='', title="C-1")+
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
      coord_flip()

# cluster 4
mc4 <- mangrove_clusters %>% 
      select(clusterclass, pressures:protected) %>% 
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% 
      group_by(clusterclass, Variable) %>% 
      summarise(Value=mean(Value, na.rm=T)) %>% 
      filter(clusterclass == 4) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) + 
      geom_col(col="black", fill = "#3B4992FF", alpha=0.8)+ 
      ylim(-1,1)+
      theme_minimal()+
      labs(x='', y='', title="C-4")+
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
      coord_flip()

# here we create the panel: 

ggarrange(
  final_cluster,  # First row with line plot
  # Second row with box and dot plots
  ggarrange(mc3, mc2, mc1, mc4, ncol = 1), 
  nrow = 1,
  ncol = 2
  ) 

*note that the cluster numbers are changed for the visualization, so that are in the same order in the Figure in main text. This is only for visual representation and do not change the results. Names of some countries were also shortened to save space.

We can save the figure using:

ggsave("figs/Figure1.png", dpi=300)

Coral Reefs

For Coral reefs we show the code below. We followed the same methodology and framework as for Mangroves.

## Coral Reefs dataset creation 

CR <- framework %>% 
   select(Country, Coral_Reefs) %>% 
   arrange(-Coral_Reefs) %>% 
   mutate(Coral_Reefs = round(Coral_Reefs, 2)) %>% 
   filter(Coral_Reefs > 0) %>% 
   merge(., ecosocial) %>% 
   select(Country:emissions, protected=Coral_Reefs_Parea)

This create the following dataset:

datatable(CR)

Now if we want to order countries according to the relative habitat importance index here are the results:

datatable(CR %>% select(Country, Coral_Reefs) %>% arrange(-Coral_Reefs))
### Correlation between variables ----

df_corr <- Hmisc::rcorr(as.matrix(select(CR, -Country)))

# Insignificant correlation are crossed
corrplot::corrplot.mixed(df_corr$r, tl.pos = "lt",
         p.mat = df_corr$P, sig.level = 0.05, 
         insig = "pch", pch.cex = 2, pch.col = "grey90")

Interestingly, in this case, protection is negativley correlated with corruption. Even if significance is low.

Coral reefs Clustering

Data scaling

# Scaling
# the part in parenthesis is only to exclude non numerical columns
df_scl <- scale(CR[4:ncol(CR)]) 

# this creates new raw names for the dataframe and 
#the vegan::makecepnames is to shorten some country names

rownames(df_scl) <- make.cepnames(CR$Country) 


df_scl <- na.omit(df_scl) ## excluding NAs

datatable(df_scl)

Distance calculation

dist.eucl <- dist(df_scl, method = "euclidean")

Coral reefs cluster validation

# this selects the methods to be tested
clmethods <- c("hierarchical","kmeans","pam") 

intern <- clValid(na.omit(df_scl), nClust = 2:10, 
                  clMethods = clmethods, validation = "internal")
summary(intern)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                  2       3       4       5       6       7       8       9      10
##                                                                                                   
## hierarchical Connectivity   2.9290  5.8579  8.7869 11.7159 14.8115 17.8516 20.7806 32.3925 37.4266
##              Dunn           0.8245  0.6826  0.8172  0.6569  0.5455  0.4598  0.4598  0.1850  0.1871
##              Silhouette     0.6864  0.6570  0.5867  0.4501  0.3778  0.2540  0.1604  0.2278  0.2311
## kmeans       Connectivity   9.7159 10.2159 10.7159 19.4302 26.7488 29.6341 38.8218 37.3746 40.1437
##              Dunn           0.3063  0.4902  0.6569  0.2222  0.1389  0.2452  0.1578  0.1827  0.1827
##              Silhouette     0.6163  0.5957  0.5829  0.2223  0.2659  0.1725  0.2506  0.2820  0.2556
## pam          Connectivity   2.9290  5.8579 25.0147 30.6159 32.3615 47.4492 37.6365 43.8460 50.0980
##              Dunn           0.8245  0.6826  0.0928  0.0847  0.1283  0.1130  0.1451  0.2128  0.1817
##              Silhouette     0.6864  0.6570  0.2128  0.1933  0.2109  0.1597  0.2491  0.2377  0.2015
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 2.9290 hierarchical 2       
## Dunn         0.8245 hierarchical 2       
## Silhouette   0.6864 hierarchical 2
clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(na.omit(df_scl), nClust = 2:6, clMethods = clmethods, 
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)
##           Score       Method Clusters
## APN 0.004162812 hierarchical        2
## AD  2.166699805          pam        6
## ADM 0.068919641 hierarchical        2
## FOM 0.791351867 hierarchical        5
res.ward <- hclust(dist.eucl, method = "ward.D2")

fviz_dend(res.ward, k = 4, # Cut in k = 4 groups
          cex = 0.5,   # label size
          k_colors = "aaas"  # colour palette
)

In this case, pruning in 4 segments, create a cluster with only one country (Australia). This is were, for management pourposes, the analyst might choose to join those countries in one segment and treat them accordingly.

p1 <- fviz_nbclust(df_scl, hcut, method = "wss") +
      geom_vline(xintercept = 4, linetype = 2)+
      labs(subtitle = "Elbow method")

# Silhouette method
p2 <- fviz_nbclust(df_scl, hcut, method = "silhouette")+
      labs(subtitle = "Silhouette method")

# Gap statistic

set.seed(123)
p3 <- fviz_nbclust(na.omit(df_scl), hcut, nstart = 25,  method = "gap_stat", nboot = 500)+
      labs(subtitle = "Gap statistic method")

plot_grid(p1,p2,p3)

In this case, while the best clustering is suggested to be 2 or 4, we can see from the dendogram that 4 main groups can be managed in this framework and can be interesting to separate them further. The group formed by Australia, China, Japan and the USA is grouped arificially, since cutting the dendogram would split them further. However, in a management perspective, it make sense that these should be considered toghether.

Coral reefs results visualization

# this creates cluster classes data 
clusterclass<- cutree(res.ward, k=4) 
# this create factor levels according to the groups
clusterclass_levels<- levels(factor(clusterclass)) 
# this tranforme it to a R data.frame
clusterclass <- as.data.frame(clusterclass) 
# this creates a "Country" column 
clusterclass$Country <- rownames(clusterclass) 
# this merge the cluster groups with the variables used
coralreefs_clusters <- cbind(as.data.frame(clusterclass), df_scl) 
# this create a factor for the clusters
coralreefs_clusters$clusterclass <- as.factor(coralreefs_clusters$clusterclass) 

For the final visualization we follow these steps:

  1. we create the final cluster graph final_cluster;
  2. we create three bar plots for each cluster: mc1, mc2 and mc3;
  3. we create a final image joining all these plots using the ggarrange() function from the package ggarrange.
# final cluster
final_cluster <- fviz_dend(res.ward, 
          k = 4, # Cut in four k = 4 groups
          k_colors  = "aaas", 
          horiz = TRUE,
          cex = 0.7,
          sub = "")

# cluster 1
mc1 <-  coralreefs_clusters %>% 
      select(clusterclass, pressures:protected) %>%  #selecting variables
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% # rescaling
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% # reshape
      group_by(clusterclass, Variable) %>% ## grouping
      summarise(Value=mean(Value, na.rm=T)) %>%  # summarising 
      filter(clusterclass %in% c(1,4)) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +  
      geom_col(aes(fill = clusterclass), col="black",  alpha=0.8, position = "dodge2")+ #
      scale_fill_manual(values=c("#008B45FF", "#631879FF"))+
      ylim(-1,1)+ 
      theme_minimal()+ 
      labs(x='', y='', title="C-2")+ 
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip() 


# cluster 2
mc2 <- coralreefs_clusters %>% 
      select(clusterclass, pressures:protected) %>%  
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% 
      group_by(clusterclass, Variable) %>% 
      summarise(Value=mean(Value, na.rm=T)) %>%
      filter(clusterclass == 2 ) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +  
      geom_col(col="black", fill = "#3B4992FF", alpha=0.8)+ 
      ylim(-1,1)+ 
      theme_minimal()+
      labs(x='', y='', title="C-3")+ 
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip()

# cluster 3
mc3 <- coralreefs_clusters %>% 
      select(clusterclass, pressures:protected) %>% 
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>%
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>%
      group_by(clusterclass, Variable) %>%
      summarise(Value=mean(Value, na.rm=T)) %>%
      filter(clusterclass == 3) %>%
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +
      geom_col(col="black", fill = "#EE0000FF", alpha=0.8)+ 
      ylim(-1,1)+ 
      theme_minimal()+
      labs(x='', y='', title="C-2")+
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip()

# here we create the panel: 

ggarrange(
  final_cluster,  # First row with line plot
  # Second row with box and dot plots
  ggarrange(mc1, mc3, mc2, ncol = 1), 
  nrow = 1,
  ncol = 2
  ) 

*note that the cluster numbers are changed for the visualization, so that are in the same order in the Figure in main text. This is only for visual representation and do not change the results.

We can save the figure using:

ggsave("figs/Figure2.png", dpi=300)

References

Sala, E., Mayorga, J., Costello, C., Kroodsma, D., Palomares, M. L. D., Pauly, D., […] Zeller, D. (2018). The economics of fishing the high seas. Science Advances, 4(6), eaat2504. https://doi.org/10.1126/sciadv.aat2504

Habitat data references from the table

  • Estuaries: Alder J (2003). Putting the coast in the “Sea Around Us”. The Sea Around Us Newsletter 15: 1-2. URL: http://seaaroundus.org/newsletter/Issue15.pdf; http://data.unep-wcmc.org/datasets/23 (version 2.0)

  • Mangroves: Giri, C., E. Ochieng, L. L. Tieszen, Z. Zhu, A. Singh, T. Loveland, J. Masek, and N. Duke. (2011). “Status and Distribution of Mangrove Forests of the World Using Earth Observation Satellite Data: Status and Distributions of Global Mangroves.” Global Ecology and Biogeography 20 (1): 154–59. https://doi.org/10.1111/j.1466-8238.2010.00584.x.

  • Saltmarsh: Mcowen C, Weatherdon LV, Bochove J, Sullivan E, Blyth S, Zockler C, Stanwell-Smith D, Kingston N, Martin CS, Spalding M, Fletcher S (2017). A global map of saltmarshes. Biodiversity Data Journal 5: e11764. Paper DOI: https://doi.org/10.3897/BDJ.5.e11764; Data URL: http://data.unep-wcmc.org/datasets/43 (v.6)

  • Seagrasses:UNEP-WCMC, Short FT (2018). Global distribution of seagrasses (version 6.0). Sixth update to the data layer used in Green and Short (2003). Cambridge (UK): UN Environment World Conservation Monitoring Centre. URL: http://data.unep-wcmc.org/datasets/7

  • Coral Reefs: UNEP-WCMC, WorldFish Centre, WRI, TNC (2018). Global distribution of warm-water coral reefs, compiled from multiple sources including the Millennium Coral Reef Mapping Project. Version 4.0. Includes contributions from IMaRS-USF and IRD (2005), IMaRS-USF (2005) and Spalding et al. (2001). Cambridge (UK): UN Environment World Conservation Monitoring Centre. URL: http://data.unep-wcmc.org/datasets/1

  • Kelp: Jorge Assis (submitted for publication)

  • Coldcorals: Freiwald A, Rogers A, Hall-Spencer J, Guinotte JM, Davies AJ, Yesson C, Martin CS, Weatherdon LV (2017). Global distribution of cold-water corals (version 5.0). Fifth update to the dataset in Freiwald et al. (2004) by UNEP-WCMC, in collaboration with Andre Freiwald and John Guinotte. Cambridge (UK): UN Environment World Conservation Monitoring Centre. URL: http://data.unep-wcmc.org/datasets/3

  • Sills-Rift Valleys: Harris, P. T., Macmillan-Lawler, M., Rupp, J., & Baker, E. K. (2014). Geomorphology of the oceans. Marine Geology, 352, 4–24. https://doi.org/10.1016/j.margeo.2014.01.011 Hydrothermal vents: Beaulieu, S.E., Szafranski, K. (2018) InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, Version 3.4. World Wide Web electronic publication available from http://vents-data.interridge.org Accessed 2019-02-20.

Contacts

---
title: 'BP-10:GHSF supplementary analysis'
author: "Fabio Favoretto, Joy Kumagai, Octavio Aburto, Alex Rogers"
date: "Last updated `r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    theme: united
    code_download: true
---
<style>
body {
text-align: justify}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE
)
```


# Premise

This .html document refers to a fully reproducible R markdown document that can be downloaded form the `code` button in the top-right of this document. The source file is also available [here](https://github.com/Fabbiologia/Blue-Paper-10/blob/master/Supplementary_A_Global_Habitat_Strategy_Framework.Rmd)


This file that can be used to go through the analysis associated with the Blue Paper 10: **"A Global Habitat Strategy Framework"** section. 

In this supplementary, we assume some basic knowledge of the R programming language. If instruction are followed, this should be fully reproducible using R studio. For further comments on the results of this analysis please refer to the main paper, or please find [Contacts][Contacts] to get in touch with the authors to report comments, bugs or problems. 

- To download and install R: https://cran.r-project.org/bin/windows/base/
- To download and install R-studio: https://www.rstudio.com/products/rstudio/download/

Once R and R-studio are installed on the system, you can download the full supplementary data from: https://github.com/Fabbiologia/Blue-Paper-10.git 

Beware that while all data used are open source, specific permission to reuse and publish them are needed from data providers. Credit for the use of those data should also go to the proper source listed in the table in the [Methods in brief][Methods in brief] section. 

The R code (v.3.6.0) was written using R-studio IDE (v.1.2.1511), as well as this document, using the following packages that can be installed in R or through R-studio using the following commands: 

- `install.packages("tidyverse")`
- `install.packages("readxl")`
- `install.packages("xlsx")`
- `install.packages(cluster)`
- `install.packages(factoextra)`
- `install.packages(vegan)`
- `install.packages(scales)`
- `install.packages(DT)`
- `install.packages(cowplot)`
- `install.packages("ggpubr")`

the package `clValid` available to download at https://cran.r-project.org/src/contrib/clValid_0.6-6.tar.gz
and needs to be installed from source: `install.packages(path_to_file, repos = NULL, type="source")`


After all the packages are installed they can be loaded in R using the following commands: 
```{r}
## loading needed libraries after are installed
library(tidyverse)
library(readxl)
library(xlsx)
library(cluster)
library(factoextra)
library(vegan)
library(scales)
library(DT)
library(clValid)
library(cowplot)
library(ggpubr)
```

---

# Methods in brief

## Habitat data

To understand habitat area within each Exclusive Economic Zone (EEZ) and the proportion of each habitat that is within an MPA, we collected spatially referenced habitat data for coastal and oceanic ecosystems between February and April 2019. Data for thirty different habitats were found spanning oceanic features digitized from bathymetric data to satellite derived area estimates for many of the coastal habitats. We used a dataset that combined both EEZ area and land that was created by Sala et al., (2018) and intersected it with each habitat using ArcGIS 10.5 Desktop. 

Next we calculated the area for each habitat by dissolving the resulting layer by Country and projecting it into the World Cylindrical Equal Area projection, and using the "Calculate Geometry" tool. To extract habitats that are of high conservation importance and the focus of our analysis we selected habitats that comprised of less than 0.5% of the cumulative habitat area. Seamounts did not fall under this classification, but due to their ecological importance were combined with guyots in our analysis (Rogers 1994). We used 6 habitats closely associated with the coast and 6 more closely associated with open ocean, for a total of 12 habitats. The table below lists these habitats and datasource used in the analysis. Losses and gains past the dates the data describes were not accounted for. 

The February 2019 World Database of Protected Areas (UNEP-WCMC and IUCN (2019) was used to calculate the area or the number of reported locations of each habitat inside of an MPA. It needs to be clarified that being inside a MPA does not mean the habitat is protected, since the MPA objective and regulamentation might not involve the habitat at all. However, we consider that being inside an environmentally managed area should provide at least some indirect benefits to the habitat conservation.

The dataset was filtered by MPAs whose status was either designated, inscribed, adopted or established, thus removing not reported and proposed categories. The same methodology for the area calculation within each EEZ except the intersection included the filtered MPA dataset. The results were then compiled and compared to ensure that the maximum value of percent protection was 1 for each country. 

| Habitat |  Date of Data | Data Type | Source |
| :---    |     :----:    |   :----:  |   ---: |
| Estuaries | 2003 | Polygon | Alder (2003) |
| Mangroves	| 1997 - 2000 | Polygon | Giri, et al. (2011) | 
| Saltmarsh	| 1973 - 2015 | Points | McOwen, et al. (2017) |
| Seagrasses | 1934 - 2015 | Polygon | UNEP-WCMC, Short FT (2017) |
| Coral Reefs | 1954 - 2018 |	Polygon | UNEP-WCMC, WorldFish Centre, WRI, TNC (2018) |
| Kelp | NA	| Point | Jorge Assis (submitted for publication) |
| Cold Corals | 1915 - 2014 |	Point	| Freiwald A (2017) |
| Sills | 1950-2009 | Polygon | Harris et al. (2014) |
| Seamounts/Guyots | 1950-2009 | Polygon | Harris et al. (2014) |
| Bridges | 1950-2009 | Polygon | Harris et al. (2014) |
| Rift Valleys | 1950-2009 | Polygon | Harris et al. (2014) |
| Hydrothermal Vents | 1994-2019 | Point | Beaulieu, S.E., Szafranski, K. (2019) |



## The Global Habitat Strategy Framework

In this section, we applied a global marketing strategy to conservation. This strategy is based on the segmentation of a complex group of objects characterized by some attributes. Here, objects are countries' Exclusive Economic Zones (EEZs), and the attributes are reported in the following table: 

! Categories | Attributes      | Coded as | Description | Source     |
| :---       | :---        |    :----:   | :----: |         ---: |
| Population | Population growth      |  pop_growth      | Population growth (annual %) derived from total population. | World Bank Open data: https://data.worldbank.org/ |
! Population | Population living less than 5m above sea level   |   pop_coastal      | Population living in areas where elevation is below 5 meters (% of total population) |   World Bank Open data: https://data.worldbank.org/  |
| Population | Poverty | poverty | Percent of population living under $1.90 | World Bank Open data: https://data.worldbank.org |
| Consumption | Greenhouse gasses emissions  |  emissions   | Total greenhouse gas emissions (kt of CO2 equivalent) | World Bank Open data: https://data.worldbank.org/   |
| Consumption | Gross Domestic Product per capita | GDP_capita | GDP per capita (current USD) | World Bank Open data: https://data.worldbank.org/ | 
| Consumption | Gross Domestic Product total | GDP_tot | GDP per capita (current USD) | World Bank Open data: https://data.worldbank.org/ |
| Environment | Pressures on the Marine Environment | pressures | The ecological and social factors that decrease health status | Ocean Health Index: http://www.oceanhealthindex.org/ | 
| Environment | Amount of area protected | protected | Overlap between protected area and the target habitat | This paper |
| Government | Corruption | corruption | Corruption score index for each country | Worldwide governance indicators: https://datacatalog.worldbank.org |
| Government | Regulatory quality | reg_qual | Regulatory Quality captures perceptions of the ability of the government to formulate and implement sound policies and regulations | Worldwide governance indicators: https://datacatalog.worldbank.org |
| Government | International cooperation | int_coop | Number of international treaties signed by a country | This paper |

---


We used the attributes in the table above to create an ecosocial dataset that can be reproduced using the code below in the [Data wrangling][Data wrangling] section. 

The segmentation can be fully reproduced and is further commented in the [Segmentation process][Segmentation process] section. 

To achieve the segmentation we used an unsupervised clustering technique. These analysis are designed to categorise observations into a number of different groups ("clusters"), with each being relatively similar based on their values for a range of different factors. 

First we scaled our attributes, then we calculated an euclidean distance matrix. Then we used the Ward's hierarchical cluster, that minimise the total variance between samples within each cluster. Then merge clusters successively so as to minimise the increase in the Ward's distance. The process continues until there's just one cluster containing all the observations. 

Since we had no *a priori* hypothesis in on how many segments (clusters) countries should be separated in, the advantage of the unsupervised technique is that it is independent to a previous choice of the number of clusters. Therefore, the choice of how many segments countries will be separated can be based upon the usability and usefulness of the corresponding groupings. 

Nevertheless, we tested with different cluster validation techniques, which method and number of clusters better represent data. This process can be reproduced in the [Cluster validation][Cluster validation] section. 

Then, we grouped countries according to their segments and calculated the average score for each attribute and plotted it in a bar-plot that are represented in Figure 1 and 2 in main text. These can be fully reproduced in the [Results visualization][Results Visualization] section. 


# Data wrangling

## Filtering data

Two filters were created for landlocked countries that due to an error in eez layer were included and disputed areas also were filtered out.


```{r}
landlocked <- c("Afghanistan", "Andorra", "Armenia", "Austria", 
                "Azerbaijan", "Belarus", "Bhutan", "Bolivia", 
                "Botswana", "Burkina Faso", "Burundi", 
                "Central African Republic", "Chad", "Czech Republic", 
                "Ethiopia", "Hungary", "Kazakhstan", "Kosovo", "Kyrgyzstan", 
                "Laos", "Lesotho", "Liechtenstein", "Luxembourg", "Macedonia", 
                "Malawi", "Mali", "Moldova", "Mongolia", "Nepal", "Niger", "Paraguay", 
                "Rwanda", "San Marino", "Serbia", "Slovakia", "South Sudan", 
                "Swaziland", "Switzerland", "Tajikistan", "Turkmenistan", "Uganda", 
                "Uzbekistan", "West Bank", "Vatican City", "Zambia", "Zimbabwe", 
                "Gaza Strip")

disputed <- c("Area en controversia (disputed - Peruvian point of view)", 
              "Area of overlap Australia/Indonesia",
              "Conflict zone China/Japan/Taiwan", 
              "Conflict zone Japan/Russia", 
              "Conflict zone Japan/South Korea",
              "Disputed Barbados/Trinidad & Tobago", 
              "Joint regime Colombia/Jamaica", 
              "Joint regime Japan/Korea", 
              "Joint regime Nigeria/Sao Tome and Principe",  
              "Joint development area Australia/East Timor",
              "Protected zone Australia/Papua New Guinea", 
              "Disputed Kenya/Somalia",
              "Disputed Western Sahara/Mauritania")
```


## Loading data files

All data sources are stored in the `data/sources/` folder, some have been manually modified from the original downloaded version, that can be checked in the folder `data/sources/raw`.
The final dataset that is going to be used in further analysis is the `data/BP-10-Dataset.xlsx` file. 


### EEZ and habitat data

```{r}
### EEZ----
eez <- read.csv("data/sources/EEZ_marine_area.csv") %>% 
      select(Country, EEZ_area=Area)%>% 
      filter(! Country %in% disputed) %>% 
      filter(! Country %in% landlocked)


## Habitat data ----

hab <- read_excel("data/sources/Area of Habitats by Country.xlsx") %>% 
      filter(! Country %in% disputed) %>% 
      filter(! Country %in% landlocked)


eez_hab <- merge(eez, hab, all.x = T) %>% 
      replace(is.na(.), 0)


## Saving file ----

write.xlsx(eez_hab, "data/BP-10-Dataset.xlsx", sheetName = "eez_hab", 
           col.names = TRUE, row.names = F, append = FALSE)

```

### Ecosocial data 

```{r}
## World Bank Data ----
social <- readxl::read_excel("data/sources/WorldBankData.xlsx")

### OHI----

OHI <- read_excel("data/sources/OHI/hab_pressures_output.xls") %>% 
      select(-c("OBJECTID", "FREQUENCY")) %>% 
      group_by(Country) %>% 
      gather("year", "pressures", SUM_w2012:SUM_w2018) %>% 
      summarise(pressures_hab = mean(pressures))

OHI <- OHI %>% 
      filter(! Country %in% disputed) %>% 
      filter(! Country %in% landlocked)


### WGI indicators ----

WGI <- read_excel("data/sources/WGI_estimate.xlsx") %>% 
      group_by(Country) %>% 
      spread(Indicator, WGI) %>% 
      select(Country, Regulatory_quality, Corruption)


### Internationa agreements edited ----

ICS <- read_excel("data/sources/International_coop.xlsx")


```

Now it is possible to merge all these datasets. 
The independent file loading was a choice made to clarify the process and link directly to data sources that can be worked and evaluated independently. 


```{r}

df <- merge(eez, OHI, all.x = T)
df <- merge(df, WGI, all.x = T)
df <- merge(df, social, all.x=T)
df <- merge(df, ICS, all.x = T)

df <- cbind(df, eez_hab[16:ncol(eez_hab)])
names(df)
names(df) <- c("Country", "EEZ_area", 
               "pressures",
               "reg_qual",
               "corruption",
               "pop_growth",
               "GDP_cap",
               "GDP_tot",
               "poverty",
               "pop_coast",
               "emissions",
               "int_coop",
               "Coral_Reefs_Parea",
               "Estuaries_Parea",
               "Mangroves_Parea",
               "Seagrasses_Parea", 
               "Kelp_Parea", 
               "Saltmarsh_Parea",
               "Hydrothermal_vents_Parea",
               "Cold_Coral_Parea",
               "Bridges_Parea", 
               "Sills_Parea",
               "Rift_Valleys_Parea",
               "Seamounts_Guyots_Parea")

write.xlsx(df, "data/BP-10-Dataset.xlsx",
           sheetName="ecosocial", append=TRUE, row.names = F)
```


Note that the last code append data into the excel sheet "ecosocial". 

---

# Segmentation process

The segmentation process, here applied to conservation, is based on a marketing strategy to simplify the complexity of target audience to which a product need to be targeted. The world is very heterogeneous, therefore global corporations realized that to adapt to this complexity they needed to divide and conquer different types of markets. 

In marketing, segmentations is a process that helps marketers narrow their focus on the most promising groups within the universe they are targeting. There is no single correct way to segment a market. Defining a target consumer base can be performed using a variety of segmentation methods. One can apply a combination of these methods to provide greater insight to their target markets. Corporations usually start creating geo-clusters, which combine geographic, demographic and behavioural data. 

In this context, conservation might aim at doing the same thing. International conservation strategies can benefit from this approach to create new strategies to apply to this complex world. 

In the following sections we present an fully reproducible example of the segmentation process using the data available. 

In the previous [Data wrangling][Data wrangling] section we showed how the dataset was created. Here we focus on describing the process of the segmentation. 

---

First we load the habitat data and we assign the name `df` that stand for `data.frame`.

```{r}
# loading data
df <- read_excel("data/BP-10-Dataset.xlsx")

```

Then we upload the ecosocial data from the same `xlsx` file but in the other sheet (note the `sheet` argument in the `read_excel` function).

```{r}
ecosocial <- read_excel("data/BP-10-Dataset.xlsx", sheet = "ecosocial")
```

From the original ecosocial data we want to change two things. First we invert the Corruption value. In the original format, corruption is represented with positive values when it is low, while with negative values when it is high. This was counter-intuitive for the way we wanted to interpret data, so we multiplied it `*-1`.


```{r}
ecosocial$corruption <- ecosocial$corruption*-1 #this inverts Corruption values

```


### Relative habitat importance index

Here we calculate the relative importance index following this formula:

$\frac{(habitat-area-in-a-country / Total-habitat-area)*100}{(Country-EEZ-area / World-EEZ-area)*100}$

That can be calculated with the following code: 
```{r}
## Area standardization x EEZ ----

## formula to calculate percentage
perc <- function(x){
   (x/sum(x))*100
}

framework <- df %>% 
   select(Country:Seamounts_Guyots) %>% 
   mutate_at(vars(EEZ_area:Seamounts_Guyots), perc) %>% ## percentage contribution
   mutate_at(vars(MPA:Seamounts_Guyots), funs(./EEZ_area)) ## EEZ area standardization
   

write.xlsx(as.data.frame(framework), "data/BP-10-framework.xlsx", row.names = F)

```

---

## Example dataset

Now that all variables needed are in place, we choose the habitat target. Here we present a study case with Mangroves and the code to reproduce Figure 2 in main text for [Coral Reefs][Coral Reefs]. 

First we want to extract countries that have mangroves, merge this with ecosocial variables, and create the dataset we will use for the segmentation.  

```{r}
## Dataset creation 

MG <- framework %>% 
      select(Country, Mangroves) %>% 
      arrange(-Mangroves) %>% 
      mutate(Mangroves = round(Mangroves, 2)) %>%
      filter(Mangroves > 0) %>% 
      merge(., ecosocial) %>% 
      select(Country:int_coop, protected=Mangroves_Parea) %>% 
      mutate_if(is.numeric, round, 2)

```

This create the following dataset, the first columns called "Mangroves" refers to the [Relative habitat importance index][Relative habitat importance index] calculated for this habitat". Using this interactive table countries can be ordered by its value:

It is important also to account for correlation among variables. Correlated variables are not necessarily a problem for cluster analyisis. However, this can be used to reduce the amount of variables used if wanted. 

```{r fig.width=8, fig.height=8}
### Correlation between variables ----

df_corr <- Hmisc::rcorr(as.matrix(select(MG, -Country)))

# Insignificant correlation are crossed
corrplot::corrplot.mixed(df_corr$r, tl.pos = "lt",
         p.mat = df_corr$P, sig.level = 0.05, insig = "pch", pch.cex = 2, pch.col = "grey90")

```

Here some of the governance data are correlated as expected. Protection of the habitat however is strongly correlated only with EEZ area. No other ecosocial attribute has a significant correlation with protection. A low but unsignificant correlation is found with international cooperations, suggesting that the effects of treaties countries signed had some, even if low, effect on the amount of the habitat protected. 


### Clustering

To segment countries, we used unsupervised clustering techniques. These analysis are designed to categorise observations into a number of different groups ("clusters"), with each being relatively similar based on their values for a range of different factors. Some form of distance measure are used to determine how close or far apart are different samples (countries, regions, sites etc.) are based on their attributes. 

There are many flavours of clustering methods depending upon the distance measure chosen and which characteristic one wants to highlight in the data. 

For example, Ward's distance minimise the total variance between samples within each cluster. Then merge clusters successively so as to minimise the increase in the Ward's distance. The process continues until there's just one cluster containing all the observations. 

There is a subjective element in clustering techniques. By choosing different pre-transformations of the data, different variables, different distance matrices, the results can be very different. If this from one side can be seen as a weakness, when properly applied can be a powerful exploration technique that can be applied to many situations, scales and questions to be answered. 
At the end of the clustering process one can review the data and identify what the members of each cluster have in common or not by summarising the characteristics of each cluster and potentially visualising these summarises as a means of comparing them.

Another source of subjectivity is the choice of the number of clusters. The number of cluster (k) can be determined by the user, with the aid of some statistical techniques called "cluster validations". However, at the end of the day, depending on the research question, a good grouping is determined based upon the usability and usefulness of the corresponding groupings. Once the segmentation is complete, and further knowledge is obtained, other predictive statistical techniques can be used to test the results of the strategy. 

#### Data scaling

In our study case, the first thing we need to do is scaling the variables, this is done to account for the different units, and the different origins all these variables have. If accounted raw, some variables would have much more weight than others only because have different scale and not because of a true effect of that particular variable. 

```{r}
# Scaling

df_scl <- scale(MG[4:ncol(MG)]) 
# the part in parenthesis is only to exclude non numerical columns


# this creates new raw names for the dataframe and 
#the vegan::makecepnames is to shorten some country names
rownames(df_scl) <- make.cepnames(MG$Country) 



```


This is how the dataset looks: 


```{r}

datatable(df_scl)

```
There are many NAs in this dataset. 
This is because we are using Exclusive Economic Zones, while most of the available ecosocial variables are at a country level. Further, for some countries the indicators are simply not available. 
Unfortunately, we need to exclude those rows with NAs from our analysis. 
Clustering can deal with some NAs, however, a good amount of countries have a complete data set, so we decided to exclude them. 

```{r}

df_scl <- na.omit(df_scl) ## excluding NAs

```

Then the distance metric is calculated. 
As mentioned earlier there are a plethora of distance metrics that can be applied. These can highlight some aspect of the data and should be chosen depending on the hypothesis and the types of variables available. 
Here we chose the Euclidean distance. Mainly because is the default measure for many software, also because it is relatively simple to interpret and observation with high values of variables are usually clustered together as for observation with low values. This characteristic is something that makes sense when dealing with ecosocial data to create country segments. 

The following code calculate the euclidean distance matrix: 

```{r}
dist.eucl <- dist(df_scl, method = "euclidean")

```


### Cluster validation

If one is struggling with the many clustering options, using the `clValid` package, it is possible to have an aid to choose the best algorithm and also a number of cluster suggestion for each method. 

This compares clustering algorithms using two cluster validation measures: 

1. *Internal measures*, which uses intrinsic information in the data like connectivity, silhouette coefficient and the Dunn index

these are calculated with the following code: 
```{r}

# this selects the methods to be tested
clmethods <- c("hierarchical","kmeans","pam") 

intern <- clValid(na.omit(df_scl), nClust = 2:10, 
                  clMethods = clmethods, validation = "internal")
summary(intern)
```

The results of the optimal scores suggests that the best methods is the hierarchical cluster with 2-3 groups. 


2. *Stability measures*, are a special version of internal measures which evaluate the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at a time. 

This method include various measures: 

- The  average  proportion of non-overlap **APN** measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
- The average distance **AD** measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).
- The average distance between means **ADM** measures the average distance between cluster centres for observations placed in the same cluster under both cases.
- The figure of merit **FOM** measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.

These are calculated with the following code: 

```{r}
clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(na.omit(df_scl), nClust = 2:6, clMethods = clmethods, 
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)

```

In a similar way as before the stability measures suggest the use of hierarchical methods. However, the analyst might also explore a kmeans or pam algorithm with a higher number of cluster. 

Again, these should be taken as a guide that the analyst can follow with common-sense. 


In our example, we chose to go along with the hierarchical clusters and in particular we use a common method called "Ward" clustering. 
This can be calculated and visualised with the following code: 

```{r fig.width=10}
res.ward <- hclust(dist.eucl, method = "ward.D2")

fviz_dend(res.ward, k = 4, # Cut in four k = 3 groups
          cex = 0.8,   # label size
          k_colors = "aaas"  # colour palette
)

```

Note the `k=4` argument in the `fvis_dend()` function. This refers to the number of cluster the analyst can decide to represent the data with.  
The advantage of the unsupervised cluster, relative to supervised ones (e.g. kmeans), is that the final cluster tree results are independent on the number of cluster chosen. Therefore, here assigning a number of clusters can be done at different levels, deciding where to "cut" the dendogram tree. 

If one wants to be sure about choice of clusters, can use the following methods to further decide the "best" way to groups the objects: 

```{r}
p1 <- fviz_nbclust(df_scl, hcut, method = "wss") +
      geom_vline(xintercept = 4, linetype = 2)+
      labs(subtitle = "Elbow method")

# Silhouette method
p2 <- fviz_nbclust(df_scl, hcut, method = "silhouette")+
      labs(subtitle = "Silhouette method")

# Gap statistic

set.seed(123)
p3 <- fviz_nbclust(na.omit(df_scl), hcut, nstart = 25,  method = "gap_stat", nboot = 500)+
      labs(subtitle = "Gap statistic method")

plot_grid(p1,p2,p3)
```

**These tests, confirm that the best number of groups is between 2 to 4.**

# Results visualization

Now that we defined our country segments, we can extract the values for each country that define those segments.

To do that we can use the following code:

```{r}
# this creates cluster classes data 
clusterclass<- cutree(res.ward, k=4) 
# this create factor levels according to the groups
clusterclass_levels<- levels(factor(clusterclass)) 
# this tranforme it to a R data.frame
clusterclass <- as.data.frame(clusterclass) 
# this creates a "Country" column 
clusterclass$Country <- rownames(clusterclass) 
# this merge the cluster groups with the variables used
mangrove_clusters <- cbind(as.data.frame(clusterclass), df_scl) 

# this create a factor for the clusters
mangrove_clusters$clusterclass <- as.factor(mangrove_clusters$clusterclass) 
```

For the final visualization we follow these steps: 

1. we create the final cluster graph `final_cluster`;
2. we create three [bar plots](https://www.r-graph-gallery.com/barplot/) for each cluster: `mc1`, `mc2` and `mc3`;
3. we create a final image joining all these plots using the `ggarrange()` function from the package `ggarrange`.

```{r fig.height=12, fig.width=10}
# final cluster
final_cluster <- fviz_dend(res.ward, 
          k = 4, # Cut in four k = 4 groups
          k_colors  = "aaas", 
          horiz = TRUE,
          cex = 0.7,
          sub = "")

# cluster 1
mc1 <- mangrove_clusters %>% 
      select(clusterclass, pressures:protected) %>%  # selecting variables
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% #rescaling
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% # changing sign
      gather("Variable", "Value", pressures:protected) %>%  # gathering 
      group_by(clusterclass, Variable) %>% ## grouping
      summarise(Value=mean(Value, na.rm=T)) %>%  # summarising 
      filter(clusterclass == 1) %>% # this filters cluster one
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +  # plot starts
      geom_col(col="black", fill = "#EE0000FF", alpha=0.8)+ 
      ylim(-1,1)+ # this set the scale for the plot
      theme_minimal()+ # this set the graphic theme
      labs(x='', y='', title="C-3")+  
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip()# these are graphical options

# cluster 2
mc2 <- mangrove_clusters %>% 
   select(clusterclass, pressures:protected) %>% 
   mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
   mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
   gather("Variable", "Value", pressures:protected) %>% 
   group_by(clusterclass, Variable) %>% 
   summarise(Value=mean(Value, na.rm=T)) %>% 
   filter(clusterclass == 2) %>% 
   ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) + 
   geom_col(col="black", fill = "#008B45FF", alpha=0.8)+ 
   ylim(-1,1)+
   theme_minimal()+
   labs(x='', y='Score', title="C-2")+
   theme(axis.text.y = element_text(size=16), 
         axis.text.x = element_text(color = "black"),
         text=element_text(family="serif", face="bold"), 
         panel.grid = element_line(colour="gray90"), 
         legend.position = "")+
      coord_flip()

# cluster 3
mc3 <- mangrove_clusters %>% 
      select(clusterclass, pressures:protected) %>% 
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% 
      group_by(clusterclass, Variable) %>% 
      summarise(Value=mean(Value, na.rm=T)) %>% 
      filter(clusterclass == 3) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) + 
      geom_col(col="black", fill = "#631879FF", alpha=0.8)+ 
      ylim(-1,1)+
      theme_minimal()+
      labs(x='', y='', title="C-1")+
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
      coord_flip()

# cluster 4
mc4 <- mangrove_clusters %>% 
      select(clusterclass, pressures:protected) %>% 
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% 
      group_by(clusterclass, Variable) %>% 
      summarise(Value=mean(Value, na.rm=T)) %>% 
      filter(clusterclass == 4) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) + 
      geom_col(col="black", fill = "#3B4992FF", alpha=0.8)+ 
      ylim(-1,1)+
      theme_minimal()+
      labs(x='', y='', title="C-4")+
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
      coord_flip()

# here we create the panel: 

ggarrange(
  final_cluster,  # First row with line plot
  # Second row with box and dot plots
  ggarrange(mc3, mc2, mc1, mc4, ncol = 1), 
  nrow = 1,
  ncol = 2
  ) 

```

*note that the cluster numbers are changed for the visualization, so that are in the same order in the Figure in main text. This is only for visual representation and do not change the results.
Names of some countries were also shortened to save space. 


We can save the figure using: 

```{r fig.height=12, fig.width=10}
ggsave("figs/Figure1.png", dpi=300)
```

--- 



# Coral Reefs

For Coral reefs we show the code below. We followed the same methodology and framework as for Mangroves. 

```{r}
## Coral Reefs dataset creation 

CR <- framework %>% 
   select(Country, Coral_Reefs) %>% 
   arrange(-Coral_Reefs) %>% 
   mutate(Coral_Reefs = round(Coral_Reefs, 2)) %>% 
   filter(Coral_Reefs > 0) %>% 
   merge(., ecosocial) %>% 
   select(Country:emissions, protected=Coral_Reefs_Parea)

```

This create the following dataset:

```{r}

datatable(CR)

```

Now if we want to order countries according to the relative habitat importance index here are the results: 

```{r}

datatable(CR %>% select(Country, Coral_Reefs) %>% arrange(-Coral_Reefs))

```

```{r fig.width=8, fig.height=8}
### Correlation between variables ----

df_corr <- Hmisc::rcorr(as.matrix(select(CR, -Country)))

# Insignificant correlation are crossed
corrplot::corrplot.mixed(df_corr$r, tl.pos = "lt",
         p.mat = df_corr$P, sig.level = 0.05, 
         insig = "pch", pch.cex = 2, pch.col = "grey90")

```

Interestingly, in this case, protection is negativley correlated with corruption. Even if significance is low. 


### Coral reefs Clustering


#### Data scaling


```{r}
# Scaling
# the part in parenthesis is only to exclude non numerical columns
df_scl <- scale(CR[4:ncol(CR)]) 

# this creates new raw names for the dataframe and 
#the vegan::makecepnames is to shorten some country names

rownames(df_scl) <- make.cepnames(CR$Country) 


df_scl <- na.omit(df_scl) ## excluding NAs

datatable(df_scl)

```


Distance calculation 

```{r}
dist.eucl <- dist(df_scl, method = "euclidean")

```


### Coral reefs cluster validation

```{r}
# this selects the methods to be tested
clmethods <- c("hierarchical","kmeans","pam") 

intern <- clValid(na.omit(df_scl), nClust = 2:10, 
                  clMethods = clmethods, validation = "internal")
summary(intern)
```


```{r}
clmethods <- c("hierarchical","kmeans","pam")
stab <- clValid(na.omit(df_scl), nClust = 2:6, clMethods = clmethods, 
                validation = "stability")
# Display only optimal Scores
optimalScores(stab)

```



```{r}
res.ward <- hclust(dist.eucl, method = "ward.D2")

fviz_dend(res.ward, k = 4, # Cut in k = 4 groups
          cex = 0.5,   # label size
          k_colors = "aaas"  # colour palette
)

```
In this case, pruning in 4 segments, create a cluster with only one country (Australia). This is were, for management pourposes, the analyst might choose to join those countries in one segment and treat them accordingly. 

```{r}
p1 <- fviz_nbclust(df_scl, hcut, method = "wss") +
      geom_vline(xintercept = 4, linetype = 2)+
      labs(subtitle = "Elbow method")

# Silhouette method
p2 <- fviz_nbclust(df_scl, hcut, method = "silhouette")+
      labs(subtitle = "Silhouette method")

# Gap statistic

set.seed(123)
p3 <- fviz_nbclust(na.omit(df_scl), hcut, nstart = 25,  method = "gap_stat", nboot = 500)+
      labs(subtitle = "Gap statistic method")

plot_grid(p1,p2,p3)
```

In this case, while the best clustering is suggested to be 2 or 4, we can see from the dendogram that 4 main groups can be managed in this framework and can be interesting to separate them further. The group formed by Australia, China, Japan and the USA is grouped arificially, since cutting the dendogram would split them further. However, in a management perspective, it make sense that these should be considered toghether. 


# Coral reefs results visualization

```{r}
# this creates cluster classes data 
clusterclass<- cutree(res.ward, k=4) 
# this create factor levels according to the groups
clusterclass_levels<- levels(factor(clusterclass)) 
# this tranforme it to a R data.frame
clusterclass <- as.data.frame(clusterclass) 
# this creates a "Country" column 
clusterclass$Country <- rownames(clusterclass) 
# this merge the cluster groups with the variables used
coralreefs_clusters <- cbind(as.data.frame(clusterclass), df_scl) 
# this create a factor for the clusters
coralreefs_clusters$clusterclass <- as.factor(coralreefs_clusters$clusterclass) 

```

For the final visualization we follow these steps: 

1. we create the final cluster graph `final_cluster`;
2. we create three [bar plots](https://www.r-graph-gallery.com/barplot/) for each cluster: `mc1`, `mc2` and `mc3`;
3. we create a final image joining all these plots using the `ggarrange()` function from the package `ggarrange`.

```{r fig.height=12, fig.width=10}
# final cluster
final_cluster <- fviz_dend(res.ward, 
          k = 4, # Cut in four k = 4 groups
          k_colors  = "aaas", 
          horiz = TRUE,
          cex = 0.7,
          sub = "")

# cluster 1
mc1 <-  coralreefs_clusters %>% 
      select(clusterclass, pressures:protected) %>%  #selecting variables
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% # rescaling
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% # reshape
      group_by(clusterclass, Variable) %>% ## grouping
      summarise(Value=mean(Value, na.rm=T)) %>%  # summarising 
      filter(clusterclass %in% c(1,4)) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +  
      geom_col(aes(fill = clusterclass), col="black",  alpha=0.8, position = "dodge2")+ #
      scale_fill_manual(values=c("#008B45FF", "#631879FF"))+
      ylim(-1,1)+ 
      theme_minimal()+ 
      labs(x='', y='', title="C-2")+ 
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip() 


# cluster 2
mc2 <- coralreefs_clusters %>% 
      select(clusterclass, pressures:protected) %>%  
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>% 
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>% 
      group_by(clusterclass, Variable) %>% 
      summarise(Value=mean(Value, na.rm=T)) %>%
      filter(clusterclass == 2 ) %>% 
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +  
      geom_col(col="black", fill = "#3B4992FF", alpha=0.8)+ 
      ylim(-1,1)+ 
      theme_minimal()+
      labs(x='', y='', title="C-3")+ 
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip()

# cluster 3
mc3 <- coralreefs_clusters %>% 
      select(clusterclass, pressures:protected) %>% 
      mutate_if(is.numeric, rescale, na.rm=T, c(0,1)) %>%
      mutate(pressures=pressures*-1, corruption=corruption*-1, 
             pop_growth=pop_growth*-1, poverty=poverty*-1,
             pop_coast=pop_coast*-1, emissions=emissions*-1) %>% 
      gather("Variable", "Value", pressures:protected) %>%
      group_by(clusterclass, Variable) %>%
      summarise(Value=mean(Value, na.rm=T)) %>%
      filter(clusterclass == 3) %>%
      ggplot(aes(x=reorder(Variable, Value), y=Value, group=1)) +
      geom_col(col="black", fill = "#EE0000FF", alpha=0.8)+ 
      ylim(-1,1)+ 
      theme_minimal()+
      labs(x='', y='', title="C-2")+
      theme(axis.text.y = element_text(size=16),
            axis.text.x = element_text(color = "black"),
            text=element_text(family="serif", face="bold"), 
            panel.grid = element_line(colour="gray90"), 
            legend.position = "")+
       coord_flip()

# here we create the panel: 

ggarrange(
  final_cluster,  # First row with line plot
  # Second row with box and dot plots
  ggarrange(mc1, mc3, mc2, ncol = 1), 
  nrow = 1,
  ncol = 2
  ) 

```

*note that the cluster numbers are changed for the visualization, so that are in the same order in the Figure in main text. This is only for visual representation and do not change the results.

We can save the figure using: 

```{r fig.height=12, fig.width=10}
ggsave("figs/Figure2.png", dpi=300)
```


# References 


Sala, E., Mayorga, J., Costello, C., Kroodsma, D., Palomares, M. L. D., Pauly, D., […] Zeller, D. 
(2018). The economics of fishing the high seas. Science Advances, 4(6), eaat2504. https://doi.org/10.1126/sciadv.aat2504

## Habitat data references from the table

- Estuaries: Alder J (2003). Putting the coast in the “Sea Around Us”. The Sea Around Us Newsletter 15: 1-2. URL: http://seaaroundus.org/newsletter/Issue15.pdf; http://data.unep-wcmc.org/datasets/23 (version 2.0) 

- Mangroves: Giri, C., E. Ochieng, L. L. Tieszen, Z. Zhu, A. Singh, T. Loveland, J. Masek, and N. Duke. (2011). “Status and Distribution of Mangrove Forests of the World Using Earth Observation Satellite Data: Status and Distributions of Global Mangroves.” Global Ecology and Biogeography 20 (1): 154–59. https://doi.org/10.1111/j.1466-8238.2010.00584.x.

- Saltmarsh: Mcowen C, Weatherdon LV, Bochove J, Sullivan E, Blyth S, Zockler C, Stanwell-Smith D, Kingston N, Martin CS, Spalding M, Fletcher S (2017). A global map of saltmarshes. Biodiversity Data Journal 5: e11764. Paper DOI: https://doi.org/10.3897/BDJ.5.e11764; Data URL: http://data.unep-wcmc.org/datasets/43 (v.6)

- Seagrasses:UNEP-WCMC, Short FT (2018). Global distribution of seagrasses (version 6.0). Sixth update to the data layer used in Green and Short (2003). Cambridge (UK): UN Environment World Conservation Monitoring Centre. URL: http://data.unep-wcmc.org/datasets/7

- Coral Reefs: UNEP-WCMC, WorldFish Centre, WRI, TNC (2018). Global distribution of warm-water coral reefs, compiled from multiple sources including the Millennium Coral Reef Mapping Project. Version 4.0. Includes contributions from IMaRS-USF and IRD (2005), IMaRS-USF (2005) and Spalding et al. (2001). Cambridge (UK): UN Environment World Conservation Monitoring Centre. URL: http://data.unep-wcmc.org/datasets/1

- Kelp: Jorge Assis (submitted for publication) 

- Coldcorals: Freiwald A, Rogers A, Hall-Spencer J, Guinotte JM, Davies AJ, Yesson C, Martin CS, Weatherdon LV (2017). Global distribution of cold-water corals (version 5.0). Fifth update to the dataset in Freiwald et al. (2004) by UNEP-WCMC, in collaboration with Andre Freiwald and John Guinotte. Cambridge (UK): UN Environment World Conservation Monitoring Centre. URL: http://data.unep-wcmc.org/datasets/3

- Sills-Rift Valleys: Harris, P. T., Macmillan-Lawler, M., Rupp, J., & Baker, E. K. (2014). Geomorphology of the oceans. Marine Geology, 352, 4–24. https://doi.org/10.1016/j.margeo.2014.01.011
Hydrothermal vents: Beaulieu, S.E., Szafranski, K. (2018) InterRidge Global Database of Active Submarine Hydrothermal Vent Fields, Version 3.4. World Wide Web electronic publication available from http://vents-data.interridge.org Accessed 2019-02-20.


# Contacts

- Fabio Favoretto: favoretto.fabio@gmail.com

- Joy Kumagai: jkumagai96@gmail.com