A multinational company is looking for a country to expand to. To be completely unbiased, the company assigns it’s data scientist to perform a categorization of countries based on it’s socio-economic factors. The result of the categorization will be used in a board of directions meeting so the company can make a decision on which category of countries that the company will prioritize expanding into. As a data scientist in the company, We are going to attempt a PCA and a K-means clustering on countries based on it’s socio-economic factors.
This data is obtained from uploaded by Rohan Kukkula in Kaggle, this data is comprised of several countries and it’s socio-economic factors.
https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data?select=data-dictionary.csv
First and foremost, we import all of the necessary libraries that we are going to use in the exploratory data analysis and modeling process.
library(GGally)
library(FactoMineR)
library(factoextra)
library(dplyr)
library(readr)
library(leaflet)
library(rworldmap)
library(countrycode)
library(htmltools)
library(leaflet)
library(wesanderson)
library(purrr)
library(NbClust)
library(tidyr)Now, we load our dataset.
country<-read_csv("data/Country-data.csv")Let’s do a quick inspection of our data
rmarkdown::paged_table(country)This data consists of 10 columns and 167 rows, here below is a brief description of our data’s variable.
rmarkdown::paged_table(read_csv("data/data-dictionary.csv"))Now, let’s check if cleaning our data is necessary.
First of all, let’s check if there are any empty columns.
colSums(is.na(country))## country child_mort exports health imports income inflation
## 0 0 0 0 0 0 0
## life_expec total_fer gdpp
## 0 0 0
None, perfect. Let’s move on to the next step.Now, we are going to check if our data has any duplicates, as this will make the data redundant.
country[duplicated(country$country),]None, let’s move on to the next step. For graphing and other purposes, we are going to add a continent column to the data frame using the country code package.
country$continent <- countrycode(sourcevar = country$country,
origin = "country.name",
destination = "continent")Scaling data is necessary for clustering and principal component analysis. Therefore, we need to cluster our data first.
country_scaled<-scale(country %>% select_if(is.numeric))First of all, we are going to examine the tendencies of certain variables by creating a geographical visualisation using our data. To start this process, we are going to use the rworldmap library and use the joinCountryData2Map function to create a spatial data that we can plot later in the next step. This process is detrimental so we can plot our data later on.
joinData <- joinCountryData2Map( country,
joinCode = "NAME",
nameJoinColumn = "country")## 166 codes from your data successfully matched countries in the map
## 1 codes from your data failed to match with a country code in the map
## 77 codes from the map weren't represented in your data
For smplicity, we are going to ignore the one data that failed to match with the country code since we are only going to use the info for plotting purposes.
Now, we have 10 columns in our original data, with 1 being the country name. So there are 9 variables that we can plot.
pal8 <- colorNumeric(palette = "Reds", domain = range(country$exports))
pal5 <- colorNumeric(palette = "Blues", domain = range(country$imports))
pal6 <- colorNumeric(palette = "Greens", domain = range(country$income))
pal7<- colorNumeric(palette = "Purples", domain = range(country$gdpp))
pal9<- colorNumeric(palette = "Blues", domain = range(country$inflation))
map8 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad8 <- map8 %>%addPolygons(
data = joinData, # LAD polygon data from geojson
weight = 0.5, # line thickness
opacity = 1, # line transparency
color = "black",
fillOpacity =0.7,
fillColor = ~pal8(exports),
popup = paste("Country: ", joinData$country, "<br>",
"Exports: ", joinData$exports,"% of GDP" ,"<br>")) %>%
addLegend("bottomright",
pal = pal8,
values=joinData$exports,
opacity=1,
title = "Exports",
labels = ~country)
map5 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad5<- map5 %>%addPolygons(
data = joinData, # LAD polygon data from geojson
weight = 0.5, # line thickness
opacity = 1, # line transparency
color = "black",
fillOpacity =0.7,
fillColor = ~pal5(imports),
popup = paste("Country: ", joinData$country, "<br>",
"Imports: ", joinData$imports,"% Of GDP" ,"<br>")) %>%
addLegend("bottomright",
pal = pal5,
values=joinData$imports,
opacity=1,
title = "Imports",
labels = ~country)
map6 <- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad6 <- map6%>%addPolygons(
data = joinData, # LAD polygon data from geojson
weight = 0.5, # line thickness
opacity = 1, # line transparency
color = "black",
fillOpacity =0.7,
fillColor = ~pal6(income),
popup = paste("Country: ", joinData$country, "<br>",
"Income: ", joinData$income, "<br>")) %>%
addLegend("bottomright",
pal = pal6,
values=joinData$income,
opacity=1,
title = "Income",
labels = ~country)
map7<- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad7 <- map7%>%addPolygons(
data = joinData, # LAD polygon data from geojson
weight = 0.5, # line thickness
opacity = 1, # line transparency
color = "black",
fillOpacity =0.7,
fillColor = ~pal7(gdpp),
popup = paste("Country: ", joinData$country, "<br>",
"GDP: ", joinData$gdpp, "<br>")) %>%
addLegend("bottomright",
pal = pal7,
values=joinData$gdpp,
opacity=1,
title = "GDP",
labels = ~country)
map9<- leaflet::leaflet(joinData) %>%addProviderTiles(providers$OpenStreetMap)
map_lad9 <- map9%>%addPolygons(
data = joinData, # LAD polygon data from geojson
weight = 0.5, # line thickness
opacity = 1,
color="black",
fillOpacity =0.7,
fillColor = ~pal9(inflation),
popup = paste("Country: ", joinData$country, "<br>",
"Inflation: ", joinData$inflation," %", "<br>")) %>%
addLegend("bottomright",
pal = pal9,
values=joinData$inflation,
opacity=1,
title = "Inflation",
labels = ~country)
leafsync::latticeView(map_lad8,map_lad5,map_lad7,map_lad9)From the plotted graphs above, there are a few takeaways:
map_lad6From the plotted graph above, we can see that the American and European continent generally has a higher net income per person than other continents.
country %>% select(-country) %>%
pivot_longer(cols=c(1:9),names_to = "variable",values_to = "value") %>%
ggplot(aes(y=value,fill=continent))+geom_boxplot()+facet_wrap(~variable,scales="free")+theme_classic()+scale_fill_manual(values=wes_palette(n=5, name=("Darjeeling2")))From the histograms plotted, we can observe that :
We want to explore if our data has a high correlation between variables. Since multicollinearity is not ideal to be present in a data when machine learning techniques such as regression and classification is performed on the data. This issue can be solved using principal component analysis which reduces the data’s dimension.
ggcorr(country %>% select_if(is.numeric),label=T, hjust = 0.5, size = 3)From the correlation plot above, we can see that there is several variables that are highly correlated with each other, such as exports and imports, income and gdp also child mortality and life expectancy. Thus, it would make sense to perform principal component analysis on our data.
First of all, we are going to cluster the variables based on K-means clustering. K-means clustering works by clustering the data based on the provided K, which then it will iterate over and over to assign the object to the nearest cluster using euclidean distance.
The first thing that we find the number of K that is good for our model. There are actually three methods to do so, which is the elbow method, silhouette method and the gap statistic method. The elbow method is used to determine K in which the within cluster sum of square won’t decrease abruptly if more cluster is made.
The elbow method is used to find k in which the within sum of squares won’t decrease any more significantly with having more k. The “elbow” or bend in the plot is considered to where the k is.
fviz_nbclust(x = country_scaled, method = "wss", kmeans)Using the elbow method, we cannot quite see any clear indication that shows the perfect numbers of k, k of 4 may be a suitable there is no dramatic decrease of total within sum of squares.
To verify our findings we are going to attempt using the silhouette method. The silhouette method calculate the average silhouette for each k. The maximum average silhouette width is considered to be where the optimum k is.
fviz_nbclust(x = country_scaled, method = "silhouette", kmeans)Contrary to the elbow method, it appears that k of 5 is optimal for our data.
Gap statistics does a comparison of the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The optimal clusters will be value which maximizes the gap statistic.
fviz_nbclust(x = country_scaled, method = "gap_stat", kmeans)As we can see, the gap statistics method produces an optimal k of 3. All of the three methods produce different results. This makes the whole process inconclusive. Now, we are going to try using the NbClust function.
Since all of the methods above yields different results, we are going to attempt using the NBClust method.
The NBClust function provides 30 indices for choosing the best number of clusters which then proposes the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance method and clustering methods.
nbclust_out <- NbClust(
data = country_scaled,
distance = "euclidean",
min.nc = 2, # minimum number of clusters
max.nc = 10, # maximum number of clusters
method = "kmeans" # one of: "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid", "kmeans"
)## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 4 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Now, let’s plot the results.
nbclust_plot <- data.frame(clusters = nbclust_out$Best.nc[1, ])
nbclust_plot <- subset(nbclust_plot, clusters >= 2 & clusters <= 10)
# create plot
ggplot(nbclust_plot) +
aes(x = clusters) +
geom_histogram(bins = 30L, fill = "#0c4c8a") +
labs(x = "Number of clusters", y = "Frequency among all indices", title = "Optimal number of clusters") +
theme_minimal()From this method, we have obtained that the perfect number of clusters for our data is in fact equal to three. This aligns with the k that we got using the gap statistic method.
Now, let’s assemble our cluster with setting the centers to three as we have obtained in the previous step.
country_cluster<-kmeans(country_scaled,centers=3)Then we add the cluster to it’s original data.
country<-country %>% mutate(cluster=as.factor(country_cluster$cluster))mean <-country %>%select(-c(country,continent))%>% group_by(cluster) %>% summarise_all(.funs = c(mean="mean"))
mean %>% pivot_longer(cols = -cluster, names_to = "type", values_to = "value") %>%
ggplot(aes(x = cluster, y = value)) +
geom_col(aes(fill = cluster)) +
facet_wrap(~type,scales="free_y")+theme_classic()+scale_fill_manual(values=wes_palette(n=3, name=("Darjeeling2")))From the plot above, we can see see that :
Cluster 1 : Has better economical factors than other clusters such as higher: GDPP, Income, Exports and Imports and lower inflation. Has better social factors such as lower child mortality and higher life expectancy.
Cluster 2 : Has the highest children born if other fertility factors are held constant also has the loweset life expectancy. Has the lowest exports,imports, GDP and the highest inflation.
Cluster 3 : Is mediocre in all aspect, performs right in the middle in all aspect.
From the observed demographics we can infer that cluster 1 consists of developed countries, cluster 2 consists of underdeveloped countries and cluster 3 consists of developing countries
Now, we will try implementing principal component analysis also known as PCA, on our data. PCA is a dimensionality reduction method often used on larger datasets to perform computation on the principal components, which is then used to perform a change of basis on the data. Usage of first few principal components is done to minimize the dimension of the data and also select the features that gives the highest variance explanation.
First of all, we use the country_scaled data and then we add the continent column to the data, the purpose is so that we can do some plotting later on with the qualitative data.
country_pca<-as.data.frame(country_scaled)%>% mutate(continent=as.factor(country$continent),cluster=as.factor(country$cluster))Now, let’s perfom PCA on our dataset. We set the scale.unit to FALSE because we have already scaled our data. We also set the parameter quali.sup to column 10 and 11 since it is where the qualitative variables which is the continent located.
pca_country <- PCA(X = country_pca,
scale.unit = F,
quali.sup = c(10,11),
graph = F,
ncp = 9)
summary(pca_country)##
## Call:
## PCA(X = country_pca, scale.unit = F, ncp = 9, quali.sup = c(10,
## 11), graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 4.111 1.537 1.163 0.989 0.657 0.222 0.113
## % of var. 45.952 17.182 13.004 11.053 7.340 2.484 1.260
## Cumulative % of var. 45.952 63.133 76.138 87.191 94.531 97.015 98.276
## Dim.8 Dim.9
## Variance 0.088 0.066
## % of var. 0.981 0.743
## Cumulative % of var. 99.257 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 3.220 | -2.904 1.229 0.814 | 0.095 0.004 0.001 | -0.716
## 2 | 1.468 | 0.429 0.027 0.085 | -0.586 0.134 0.160 | -0.332
## 3 | 1.659 | -0.284 0.012 0.029 | -0.454 0.080 0.075 | 1.218
## 4 | 3.888 | -2.924 1.245 0.565 | 1.690 1.113 0.189 | 1.520
## 5 | 1.411 | 1.030 0.155 0.533 | 0.136 0.007 0.009 | -0.225
## 6 | 2.216 | 0.022 0.000 0.000 | -1.774 1.226 0.641 | 0.867
## 7 | 1.714 | -0.101 0.001 0.003 | -0.567 0.125 0.109 | 0.241
## 8 | 3.395 | 2.335 0.794 0.473 | -1.982 1.531 0.341 | 0.190
## 9 | 3.331 | 2.965 1.280 0.792 | -0.732 0.209 0.048 | -0.518
## 10 | 1.576 | -0.181 0.005 0.013 | -0.402 0.063 0.065 | 0.865
## ctr cos2
## 1 0.264 0.049 |
## 2 0.057 0.051 |
## 3 0.763 0.539 |
## 4 1.190 0.153 |
## 5 0.026 0.025 |
## 6 0.387 0.153 |
## 7 0.030 0.020 |
## 8 0.019 0.003 |
## 9 0.138 0.024 |
## 10 0.385 0.301 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## child_mort | -0.851 17.600 0.728 | 0.239 3.720 0.058 | -0.032 0.087 0.001
## exports | 0.576 8.060 0.333 | 0.760 37.597 0.581 | 0.156 2.096 0.025
## health | 0.306 2.275 0.094 | -0.301 5.909 0.091 | -0.644 35.597 0.417
## imports | 0.327 2.608 0.108 | 0.833 45.134 0.698 | -0.324 8.996 0.105
## income | 0.808 15.876 0.657 | 0.028 0.051 0.001 | 0.325 9.093 0.106
## inflation | -0.392 3.732 0.154 | -0.010 0.007 0.000 | 0.693 41.283 0.483
## life_expec | 0.863 18.134 0.750 | -0.276 4.960 0.077 | 0.123 1.298 0.015
## total_fer | -0.819 16.300 0.674 | 0.192 2.410 0.037 | 0.021 0.038 0.000
## gdpp | 0.796 15.417 0.638 | -0.057 0.212 0.003 | 0.133 1.512 0.018
##
## child_mort |
## exports |
## health |
## imports |
## income |
## inflation |
## life_expec |
## total_fer |
## gdpp |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3
## Africa | 2.106 | -2.000 0.902 -7.954 | 0.453 0.046 2.944 | -0.023
## Americas | 0.766 | 0.233 0.093 0.651 | -0.603 0.620 -2.753 | -0.165
## Asia | 0.801 | 0.212 0.070 0.819 | 0.025 0.001 0.157 | 0.648
## Europe | 2.086 | 2.019 0.937 7.083 | -0.152 0.005 -0.870 | -0.400
## Oceania | 1.047 | -0.066 0.004 -0.100 | -0.022 0.000 -0.054 | -0.889
## cluster_1 | 2.908 | 2.761 0.902 9.199 | -0.212 0.005 -1.157 | 0.069
## cluster_2 | 0.805 | 0.175 0.047 1.117 | -0.138 0.030 -1.447 | 0.024
## cluster_3 | 2.563 | -2.427 0.897 -9.653 | 0.410 0.026 2.667 | -0.096
## cos2 v.test
## Africa 0.000 -0.173 |
## Americas 0.047 -0.868 |
## Asia 0.654 4.703 |
## Europe 0.037 -2.640 |
## Oceania 0.720 -2.534 |
## cluster_1 0.001 0.430 |
## cluster_2 0.001 0.291 |
## cluster_3 0.001 -0.717 |
Now, we visualize our the explained variance.
fviz_eig(pca_country, ncp = 9, addlabels = T, main = "Explained Variance in Each Dimensions", hjust = -0.3)+ylim(0,50)As we can see, the first principal component has the highest explained variance. Now, we set the benchmark of our target for explained variance that we want to obtain by extracting only several principal components as 80%. Now, using the first four dimensions, we have already explained 87.3% of the total variance of the data.
This is the essence of principal component reduction, we are able to select 4 numeric features that is able to explain 87.3% of the total variance of the data from our initial data that contains 9 numerical features.
df_pca<- data.frame(pca_country$ind$coord[, 1:4]) %>% bind_cols(cluster = as.factor(country$cluster),continent=as.factor(country$continent)) %>% select(cluster,continent, 1:4)
rmarkdown::paged_table(df_pca)fviz_pca_ind(pca_country, habillage = 10, addEllipses = T)+ylim(-3.5,6.5)+xlim(-6.5,8.5)## Warning: Removed 1 rows containing non-finite values (stat_ellipse).
## Warning: Removed 1 rows containing non-finite values (stat_mean).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text).
As we can see, there is a bunch of overlap of pc values when we color the pc using the continents as there is no clear separation amongst the PCs when we color it according to the continents.
Now, we plot the variable factor map.
fviz_pca_var(pca_country, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "Red"), repel = TRUE)From the plot above, we can infer that :
Since we a data that is comprised of 9 dimensions, performing cluster analysis by visualization can be quite challenging. Now, we are going to attempt using pca to visualize our cluster.
plot.PCA(x = pca_country, choix = "ind",habillage = 11,select = "contrib3",label="none") Though Principal it his hard to interpret principal components on it’s own, we have successfully created an individual plot with PCA that has clearly shown distinctions between each cluster. So in our case, this method is quite applicable to our data.
In conclusion, we have performed clustering using the k-means method and also attempted to do principal component analysis. We have also visualized our clusters using principal component analysis. Going back to the business case, we have obtained 3 clusters. As we have categorized before, cluster 1 corresponds to developed countries, cluster 2 corresponds to underdeveloped countries meanwhile cluster 3 corresponds to developing countries. The decision of the management would depend on which strategy and risks that they are willing to take. The more undeveloped the country is the harder it is to gain sales since the net income per person is lower. However, taking up more developed countries would mean higher operational costs. In retrospect, expanding to developing countries would be the perfect middle ground in the company’s strategy without being to risky or conservative.
Social Factors
Based on our inspection, there are a few takeaways from the plotted map: