Introduction

Business Goal

A multinational company is looking for a country to expand to. To be completely unbiased, the company assigns it’s data scientist to perform a categorization of countries based on it’s socio-economic factors. The result of the categorization will be used in a board of directions meeting so the company can make a decision on which category of countries that the company will prioritize expanding into. As a data scientist in the company, We are going to attempt a PCA and a K-means clustering on countries based on it’s socio-economic factors.

Data Source

This data is obtained from uploaded by Rohan Kukkula in Kaggle, this data is comprised of several countries and it’s socio-economic factors.

https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data?select=data-dictionary.csv

Data Preprocessing

Loading Libraries

First and foremost, we import all of the necessary libraries that we are going to use in the exploratory data analysis and modeling process.

library(GGally)
library(FactoMineR)
library(factoextra)
library(dplyr)
library(readr)
library(leaflet)
library(rworldmap)
library(countrycode)
library(htmltools)
library(leaflet)
library(wesanderson)
library(purrr)
library(NbClust)
library(tidyr)

Loading Dataset

Now, we load our dataset.

country<-read_csv("data/Country-data.csv")

Let’s do a quick inspection of our data

rmarkdown::paged_table(country)

This data consists of 10 columns and 167 rows, here below is a brief description of our data’s variable.

rmarkdown::paged_table(read_csv("data/data-dictionary.csv"))

Data Cleaning

Now, let’s check if cleaning our data is necessary.

First of all, let’s check if there are any empty columns.

colSums(is.na(country))

##    country child_mort    exports     health    imports     income  inflation 
##          0          0          0          0          0          0          0 
## life_expec  total_fer       gdpp 
##          0          0          0

None, perfect. Let’s move on to the next step.Now, we are going to check if our data has any duplicates, as this will make the data redundant.

country[duplicated(country$country),]

None, let’s move on to the next step. For graphing and other purposes, we are going to add a continent column to the data frame using the country code package.

country$continent <- countrycode(sourcevar = country$country,
                            origin = "country.name",
                            destination = "continent")

Data Scaling

Scaling data is necessary for clustering and principal component analysis. Therefore, we need to cluster our data first.

country_scaled<-scale(country %>% select_if(is.numeric))

Exploratory Data Analysis

Map Visualization

First of all, we are going to examine the tendencies of certain variables by creating a geographical visualisation using our data. To start this process, we are going to use the rworldmap library and use the joinCountryData2Map function to create a spatial data that we can plot later in the next step. This process is detrimental so we can plot our data later on.

joinData <- joinCountryData2Map( country,
                                 joinCode = "NAME",
                                 nameJoinColumn = "country")

## 166 codes from your data successfully matched countries in the map
## 1 codes from your data failed to match with a country code in the map
## 77 codes from the map weren't represented in your data

For smplicity, we are going to ignore the one data that failed to match with the country code since we are only going to use the info for plotting purposes.

Now, we have 10 columns in our original data, with 1 being the country name. So there are 9 variables that we can plot.

Social Factors

pal <- colorNumeric(palette = "Blues", domain = range(country$child_mort))
pal2 <- colorNumeric(palette = "Reds", domain = range(country$life_expec))
pal3 <- colorNumeric(palette = "Greens", domain = range(country$total_fer))
pal4 <- colorNumeric(palette = "Purples", domain = range(country$health))
map <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad <- map %>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal(child_mort),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Child Mortality: ", joinData$child_mort, "<br>")) %>% 
    addLegend("bottomright", 
              pal = pal,
              values=joinData$child_mort,
              opacity=1,
              title = "Child Mortality",
              labels = ~country)
map2 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad2 <- map2 %>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal2(life_expec),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Life Expectancy: ", joinData$life_expec," years ", "<br>")) %>% 
    addLegend("bottomright", 
              pal = pal2,
              values=joinData$life_expec,
              opacity=1,
              title = "Life Expectancy",
              labels = ~country) 
map3 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad3 <- map3%>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal3(total_fer),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Total Children Born: ", joinData$total_fer," Children" ,"<br>")) %>% 
    addLegend("bottomright", 
              pal = pal3,
              values=joinData$total_fer,
              opacity=1,
              title = "Total fer",
              labels = ~country) 
map4<- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad4 <- map4%>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal4(health),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Total Health Spending: ", joinData$health," % of GDP per Capita", "<br>")) %>% 
    addLegend("bottomright", 
              pal = pal4,
              values=joinData$health,
              opacity=1,
              title = "Health",
              labels = ~country) 
leafsync::latticeView(map_lad,map_lad2,map_lad3,map_lad4)

Based on our inspection, there are a few takeaways from the plotted map:

Child mortality in the African continent is generally higher than other continents.
Life expectancy in the American continent is generally higher than other continents. Meanwhile, life expectancy in the African continents is lower than other continents.
Total children born when the age-fertility rate is held constant is generally higher in the African continent.
The American continent generally has a higher health spending per capita than other continents.

Economical Factors

pal8 <- colorNumeric(palette = "Reds", domain = range(country$exports))
pal5 <- colorNumeric(palette = "Blues", domain = range(country$imports))
pal6 <- colorNumeric(palette = "Greens", domain = range(country$income))
pal7<- colorNumeric(palette = "Purples", domain = range(country$gdpp))
pal9<- colorNumeric(palette = "Blues", domain = range(country$inflation))

map8 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad8 <- map8 %>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal8(exports),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Exports: ", joinData$exports,"% of GDP" ,"<br>")) %>% 
    addLegend("bottomright", 
              pal = pal8,
              values=joinData$exports,
              opacity=1,
              title = "Exports",
              labels = ~country)
map5 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad5<- map5 %>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal5(imports),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Imports: ", joinData$imports,"% Of GDP" ,"<br>")) %>% 
    addLegend("bottomright", 
              pal = pal5,
              values=joinData$imports,
              opacity=1,
              title = "Imports",
              labels = ~country) 
map6 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad6 <- map6%>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal6(income),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Income: ", joinData$income, "<br>")) %>% 
    addLegend("bottomright", 
              pal = pal6,
              values=joinData$income,
              opacity=1,
              title = "Income",
              labels = ~country) 
map7<- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad7 <- map7%>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,  # line transparency
    color = "black",
    fillOpacity =0.7,
    fillColor = ~pal7(gdpp),
    popup = paste("Country: ", joinData$country, "<br>",
                          "GDP: ", joinData$gdpp, "<br>")) %>% 
    addLegend("bottomright", 
              pal = pal7,
              values=joinData$gdpp,
              opacity=1,
              title = "GDP",
              labels = ~country) 
map9<- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad9 <- map9%>%addPolygons(
    data = joinData,  # LAD polygon data from geojson
    weight = 0.5,  # line thickness
    opacity = 1,
    color="black",
    fillOpacity =0.7,
    fillColor = ~pal9(inflation),
    popup = paste("Country: ", joinData$country, "<br>",
                          "Inflation: ", joinData$inflation," %", "<br>")) %>% 
    addLegend("bottomright", 
              pal = pal9,
              values=joinData$inflation,
              opacity=1,
              title = "Inflation",
              labels = ~country)
leafsync::latticeView(map_lad8,map_lad5,map_lad7,map_lad9)

From the plotted graphs above, there are a few takeaways:

Nigeria has an inflation of 104%, this makes the color scale quite skewed and the plot less insightful
Norway has an outstanding GDP of 87800, which is quite higher than other countries

map_lad6

From the plotted graph above, we can see that the American and European continent generally has a higher net income per person than other continents.

Boxplot : Clustering Opportunity

country %>% select(-country) %>% 
pivot_longer(cols=c(1:9),names_to = "variable",values_to = "value") %>% 
   ggplot(aes(y=value,fill=continent))+geom_boxplot()+facet_wrap(~variable,scales="free")+theme_classic()+scale_fill_manual(values=wes_palette(n=5, name=("Darjeeling2")))

From the histograms plotted, we can observe that :

In terms of child mortality, we can see that most of the continents have pretty low child mortality, especially the Americas and Asia continent. However the Africa continent seem to have more distribution on the higher side of child mortality.
Life expectancy distribution is lower in the African and Oceanian continent.
The European continent has a distribution that tends to be on the higher side in terms of GDP, Health, Income and life expectancy.
The African continent has a higher number of child born when age-fertility factors are held constant than other continents.

Correlation Plot : PCA Opportunity

We want to explore if our data has a high correlation between variables. Since multicollinearity is not ideal to be present in a data when machine learning techniques such as regression and classification is performed on the data. This issue can be solved using principal component analysis which reduces the data’s dimension.

ggcorr(country %>% select_if(is.numeric),label=T, hjust = 0.5, size = 3)

From the correlation plot above, we can see that there is several variables that are highly correlated with each other, such as exports and imports, income and gdp also child mortality and life expectancy. Thus, it would make sense to perform principal component analysis on our data.

K-Means Clustering

First of all, we are going to cluster the variables based on K-means clustering. K-means clustering works by clustering the data based on the provided K, which then it will iterate over and over to assign the object to the nearest cluster using euclidean distance.

Finding The Perfect Number of Clusters

The first thing that we find the number of K that is good for our model. There are actually three methods to do so, which is the elbow method, silhouette method and the gap statistic method. The elbow method is used to determine K in which the within cluster sum of square won’t decrease abruptly if more cluster is made.

Elbow Method

The elbow method is used to find k in which the within sum of squares won’t decrease any more significantly with having more k. The “elbow” or bend in the plot is considered to where the k is.

fviz_nbclust(x = country_scaled, method = "wss", kmeans)

Using the elbow method, we cannot quite see any clear indication that shows the perfect numbers of k, k of 4 may be a suitable there is no dramatic decrease of total within sum of squares.

Silhouette Method

To verify our findings we are going to attempt using the silhouette method. The silhouette method calculate the average silhouette for each k. The maximum average silhouette width is considered to be where the optimum k is.

fviz_nbclust(x = country_scaled, method = "silhouette", kmeans)

Contrary to the elbow method, it appears that k of 5 is optimal for our data.

Gap Statistic Method

Gap statistics does a comparison of the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The optimal clusters will be value which maximizes the gap statistic.

fviz_nbclust(x = country_scaled, method = "gap_stat", kmeans)

As we can see, the gap statistics method produces an optimal k of 3. All of the three methods produce different results. This makes the whole process inconclusive. Now, we are going to try using the NbClust function.

NBClust

Since all of the methods above yields different results, we are going to attempt using the NBClust method.

The NBClust function provides 30 indices for choosing the best number of clusters which then proposes the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance method and clustering methods.

nbclust_out <- NbClust(
  data = country_scaled,
  distance = "euclidean",
  min.nc = 2, # minimum number of clusters
  max.nc = 10, # maximum number of clusters
  method = "kmeans" # one of: "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid", "kmeans"
)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 4 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Now, let’s plot the results.

nbclust_plot <- data.frame(clusters = nbclust_out$Best.nc[1, ])
nbclust_plot <- subset(nbclust_plot, clusters >= 2 & clusters <= 10)
# create plot
ggplot(nbclust_plot) +
  aes(x = clusters) +
  geom_histogram(bins = 30L, fill = "#0c4c8a") +
  labs(x = "Number of clusters", y = "Frequency among all indices", title = "Optimal number of clusters") +
  theme_minimal()

From this method, we have obtained that the perfect number of clusters for our data is in fact equal to three. This aligns with the k that we got using the gap statistic method.

Clustering

Now, let’s assemble our cluster with setting the centers to three as we have obtained in the previous step.

country_cluster<-kmeans(country_scaled,centers=3)

Then we add the cluster to it’s original data.

country<-country %>% mutate(cluster=as.factor(country_cluster$cluster))

Cluster Analysis

mean <-country %>%select(-c(country,continent))%>%  group_by(cluster) %>% summarise_all(.funs = c(mean="mean"))
mean %>% pivot_longer(cols = -cluster, names_to = "type", values_to = "value") %>% 
   ggplot(aes(x = cluster, y = value)) +
   geom_col(aes(fill = cluster)) +
   facet_wrap(~type,scales="free_y")+theme_classic()+scale_fill_manual(values=wes_palette(n=3, name=("Darjeeling2")))

From the plot above, we can see see that :

Cluster 1 : Has better economical factors than other clusters such as higher: GDPP, Income, Exports and Imports and lower inflation. Has better social factors such as lower child mortality and higher life expectancy.
Cluster 2 : Has the highest children born if other fertility factors are held constant also has the loweset life expectancy. Has the lowest exports,imports, GDP and the highest inflation.
Cluster 3 : Is mediocre in all aspect, performs right in the middle in all aspect.

From the observed demographics we can infer that cluster 1 consists of developed countries, cluster 2 consists of underdeveloped countries and cluster 3 consists of developing countries

PCA

Now, we will try implementing principal component analysis also known as PCA, on our data. PCA is a dimensionality reduction method often used on larger datasets to perform computation on the principal components, which is then used to perform a change of basis on the data. Usage of first few principal components is done to minimize the dimension of the data and also select the features that gives the highest variance explanation.

Dimensionality Reduction

First of all, we use the country_scaled data and then we add the continent column to the data, the purpose is so that we can do some plotting later on with the qualitative data.

country_pca<-as.data.frame(country_scaled)%>% mutate(continent=as.factor(country$continent),cluster=as.factor(country$cluster))

Now, let’s perfom PCA on our dataset. We set the scale.unit to FALSE because we have already scaled our data. We also set the parameter quali.sup to column 10 and 11 since it is where the qualitative variables which is the continent located.

pca_country <- PCA(X = country_pca, 
                     scale.unit = F,
                     quali.sup = c(10,11),
                     graph = F, 
                     ncp = 9) 
summary(pca_country)

## 
## Call:
## PCA(X = country_pca, scale.unit = F, ncp = 9, quali.sup = c(10,  
##      11), graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               4.111   1.537   1.163   0.989   0.657   0.222   0.113
## % of var.             45.952  17.182  13.004  11.053   7.340   2.484   1.260
## Cumulative % of var.  45.952  63.133  76.138  87.191  94.531  97.015  98.276
##                        Dim.8   Dim.9
## Variance               0.088   0.066
## % of var.              0.981   0.743
## Cumulative % of var.  99.257 100.000
## 
## Individuals (the 10 first)
##                Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1          |  3.220 | -2.904  1.229  0.814 |  0.095  0.004  0.001 | -0.716
## 2          |  1.468 |  0.429  0.027  0.085 | -0.586  0.134  0.160 | -0.332
## 3          |  1.659 | -0.284  0.012  0.029 | -0.454  0.080  0.075 |  1.218
## 4          |  3.888 | -2.924  1.245  0.565 |  1.690  1.113  0.189 |  1.520
## 5          |  1.411 |  1.030  0.155  0.533 |  0.136  0.007  0.009 | -0.225
## 6          |  2.216 |  0.022  0.000  0.000 | -1.774  1.226  0.641 |  0.867
## 7          |  1.714 | -0.101  0.001  0.003 | -0.567  0.125  0.109 |  0.241
## 8          |  3.395 |  2.335  0.794  0.473 | -1.982  1.531  0.341 |  0.190
## 9          |  3.331 |  2.965  1.280  0.792 | -0.732  0.209  0.048 | -0.518
## 10         |  1.576 | -0.181  0.005  0.013 | -0.402  0.063  0.065 |  0.865
##               ctr   cos2  
## 1           0.264  0.049 |
## 2           0.057  0.051 |
## 3           0.763  0.539 |
## 4           1.190  0.153 |
## 5           0.026  0.025 |
## 6           0.387  0.153 |
## 7           0.030  0.020 |
## 8           0.019  0.003 |
## 9           0.138  0.024 |
## 10          0.385  0.301 |
## 
## Variables
##               Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## child_mort | -0.851 17.600  0.728 |  0.239  3.720  0.058 | -0.032  0.087  0.001
## exports    |  0.576  8.060  0.333 |  0.760 37.597  0.581 |  0.156  2.096  0.025
## health     |  0.306  2.275  0.094 | -0.301  5.909  0.091 | -0.644 35.597  0.417
## imports    |  0.327  2.608  0.108 |  0.833 45.134  0.698 | -0.324  8.996  0.105
## income     |  0.808 15.876  0.657 |  0.028  0.051  0.001 |  0.325  9.093  0.106
## inflation  | -0.392  3.732  0.154 | -0.010  0.007  0.000 |  0.693 41.283  0.483
## life_expec |  0.863 18.134  0.750 | -0.276  4.960  0.077 |  0.123  1.298  0.015
## total_fer  | -0.819 16.300  0.674 |  0.192  2.410  0.037 |  0.021  0.038  0.000
## gdpp       |  0.796 15.417  0.638 | -0.057  0.212  0.003 |  0.133  1.512  0.018
##             
## child_mort |
## exports    |
## health     |
## imports    |
## income     |
## inflation  |
## life_expec |
## total_fer  |
## gdpp       |
## 
## Supplementary categories
##                Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3
## Africa     |  2.106 | -2.000  0.902 -7.954 |  0.453  0.046  2.944 | -0.023
## Americas   |  0.766 |  0.233  0.093  0.651 | -0.603  0.620 -2.753 | -0.165
## Asia       |  0.801 |  0.212  0.070  0.819 |  0.025  0.001  0.157 |  0.648
## Europe     |  2.086 |  2.019  0.937  7.083 | -0.152  0.005 -0.870 | -0.400
## Oceania    |  1.047 | -0.066  0.004 -0.100 | -0.022  0.000 -0.054 | -0.889
## cluster_1  |  2.908 |  2.761  0.902  9.199 | -0.212  0.005 -1.157 |  0.069
## cluster_2  |  0.805 |  0.175  0.047  1.117 | -0.138  0.030 -1.447 |  0.024
## cluster_3  |  2.563 | -2.427  0.897 -9.653 |  0.410  0.026  2.667 | -0.096
##              cos2 v.test  
## Africa      0.000 -0.173 |
## Americas    0.047 -0.868 |
## Asia        0.654  4.703 |
## Europe      0.037 -2.640 |
## Oceania     0.720 -2.534 |
## cluster_1   0.001  0.430 |
## cluster_2   0.001  0.291 |
## cluster_3   0.001 -0.717 |

Now, we visualize our the explained variance.

fviz_eig(pca_country, ncp = 9, addlabels = T, main = "Explained Variance in Each Dimensions", hjust = -0.3)+ylim(0,50)

As we can see, the first principal component has the highest explained variance. Now, we set the benchmark of our target for explained variance that we want to obtain by extracting only several principal components as 80%. Now, using the first four dimensions, we have already explained 87.3% of the total variance of the data.

This is the essence of principal component reduction, we are able to select 4 numeric features that is able to explain 87.3% of the total variance of the data from our initial data that contains 9 numerical features.

df_pca<- data.frame(pca_country$ind$coord[, 1:4]) %>% bind_cols(cluster = as.factor(country$cluster),continent=as.factor(country$continent)) %>% select(cluster,continent, 1:4)
rmarkdown::paged_table(df_pca)

Individual and Variable Factor Map

Individual Observations Map

fviz_pca_ind(pca_country, habillage = 10, addEllipses = T)+ylim(-3.5,6.5)+xlim(-6.5,8.5)

## Warning: Removed 1 rows containing non-finite values (stat_ellipse).

## Warning: Removed 1 rows containing non-finite values (stat_mean).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_text).

As we can see, there is a bunch of overlap of pc values when we color the pc using the continents as there is no clear separation amongst the PCs when we color it according to the continents.

Variable Factor Map

Now, we plot the variable factor map.

fviz_pca_var(pca_country, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "Red"), repel = TRUE)

From the plot above, we can infer that :

Life Expectation, child mortality and exports are the most contributive variables to our PC
Child mortality and total child born when other factors are held constant is positively correlated
Imports and exports are positively correlated
Child mortality, total_fer are negatively correlated to income, gdp and life expectancy

Combining PCA and Clustering

Since we a data that is comprised of 9 dimensions, performing cluster analysis by visualization can be quite challenging. Now, we are going to attempt using pca to visualize our cluster.

plot.PCA(x = pca_country, choix = "ind",habillage = 11,select = "contrib3",label="none")

Though Principal it his hard to interpret principal components on it’s own, we have successfully created an individual plot with PCA that has clearly shown distinctions between each cluster. So in our case, this method is quite applicable to our data.

Conclusion

In conclusion, we have performed clustering using the k-means method and also attempted to do principal component analysis. We have also visualized our clusters using principal component analysis. Going back to the business case, we have obtained 3 clusters. As we have categorized before, cluster 1 corresponds to developed countries, cluster 2 corresponds to underdeveloped countries meanwhile cluster 3 corresponds to developing countries. The decision of the management would depend on which strategy and risks that they are willing to take. The more undeveloped the country is the harder it is to gain sales since the net income per person is lower. However, taking up more developed countries would mean higher operational costs. In retrospect, expanding to developing countries would be the perfect middle ground in the company’s strategy without being to risky or conservative.

PCA and K-Means Clustering

Madeleine Erica Nauli

Last Updated: 21, March, 2021 at 17:21