Idea for this project was taken from kaggle.com website: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
To determine the answer for that problem there are going to be presented methods used in Unsupervised Learning. There is K-Means and PAM algorithms presented.
Let’s have a look at the data and information what kaggle provides:
country : Name of the country
child_mort : Death of children under 5 years of age per 1000 live births
exports : Exports of goods and services per capita. Given as %age of the GDP per capita
health : Total health spending per capita. Given as %age of GDP per capita
imports : Imports of goods and services per capita. Given as %age of the GDP per capita
income : Net income per person
inflation : The measurement of the annual growth rate of the Total GDP
life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population.
country_data <- read.csv('C:/Users/1032745/OneDrive - Blue Yonder/Desktop/PRIVATE/STUDIA/UL/PROJECTS/Country-data.csv', header = TRUE)
head(country_data)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.49
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.10
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.40
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp
## 1 56.2 5.82 553
## 2 76.3 1.65 4090
## 3 76.5 2.89 4460
## 4 60.1 6.16 3530
## 5 76.8 2.13 12200
## 6 75.8 2.37 10300
dim(country_data)
## [1] 167 10
summary(country_data)
## country child_mort exports health
## Length:167 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Class :character 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Mode :character Median : 19.30 Median : 35.000 Median : 6.320
## Mean : 38.27 Mean : 41.109 Mean : 6.816
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Max. :208.00 Max. :200.000 Max. :17.900
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
any(is.na(country_data))
## [1] FALSE
Select the columns to use for clustering
cluster_data <- country_data[,c("child_mort", "exports", "health", "imports", "income", "inflation", "life_expec", "total_fer","gdpp")]
Let’s have a look at the distribution of each feature
par(mfcol = c(3,3))
titles<-colnames(cluster_data)
i<-1
par(mfcol = c(3,3))
titles<-colnames(cluster_data)
i<-1
titles
## [1] "child_mort" "exports" "health" "imports" "income"
## [6] "inflation" "life_expec" "total_fer" "gdpp"
for (i in 1:9){
hist(cluster_data[,i],
main=titles[i],
col="darkmagenta",
freq=FALSE
)
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
par(mfcol = c(1,1))
child_mort, exports, income, total_fer, inflation and gdpp assume to have log-normal distribution imports and health seem to have normal distribution life_expec seems to also have normal distribution with negative coefficient of asymmetry
Let us now see which country is the best candidate for the finansial support when considering each variable separately.
x=c(head(country_data[order(country_data$child_mort, decreasing = TRUE),1],5))
y=c(head(country_data[order(country_data$child_mort, decreasing = TRUE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Child Mortality Rate") + ggtitle("Child Mortality by Country")
Despite the Haiti all the countries are African countries. Haiti is located the the South America and it has the highest mortality rate out of all studied countries.
x=c(head(country_data[order(country_data$exports, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$exports, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Export Rate") + ggtitle("Exports by Country")
Exports is one of the crucial elements to build economic force of the country.
x=c(head(country_data[order(country_data$imports, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$imports, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Import Rate") + ggtitle("Imports by Country")
We see that countries which are exporting a lot are also importing a lot. So high import rate does not mean that the country has bad position on the international area.
x=c(head(country_data[order(country_data$health, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$health, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Capital on health Rate") + ggtitle("Capital on health by Country")
Countries which can allocate lots of capital on health are on good position
x=c(head(country_data[order(country_data$income, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$income, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Income Rate") + ggtitle("Income by Country")
Countries with lower income would be preferable.
x=c(head(country_data[order(country_data$inflation, decreasing = TRUE),1],5))
y=c(head(country_data[order(country_data$inflation, decreasing = TRUE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Inflation Rate") + ggtitle("Inflation by Country")
Countries with higher inflation would be preferable.
x=c(head(country_data[order(country_data$life_expec, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$life_expec, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Years") + ggtitle("Expected life duration by Country")
Countries with lower expected life duration would be preferable.
x=c(head(country_data[order(country_data$total_fer, decreasing = TRUE),1],5))
y=c(head(country_data[order(country_data$total_fer, decreasing = TRUE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Total fertility rate") + ggtitle("Total Fertility by Country")
x=c(head(country_data[order(country_data$total_fer, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$total_fer, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("Total fertility rate") + ggtitle("Total Fertility by Country")
With fertility rate situation is unclear as according to WorldBank data high fertility rate is high in not developed countries. From the other hand developed countries would like to have as high fertility rate as possible to speed up the economy.
x=c(head(country_data[order(country_data$gdpp, decreasing = FALSE),1],5))
y=c(head(country_data[order(country_data$gdpp, decreasing = FALSE),2],5))
df=data.frame(x,y)
ggplot(df, aes(x, y, fill = y)) + geom_col() + xlab("Country") + ylab("GDPP Rate") + ggtitle("GDPP by Country")
Countries with lower GDP per capita would be preferable.
In overall summary African countries seems to have the worst performance when analyzing those features. To analyze the data deeper there is going to be unsupervised learning methods for clustering used. Let us start with choosing the adequate number of clusters. K-Means and PAM methods are going to be used as they were presented in our classes. CLARA and CLARANS is not adequate as it is PAM for big datasets and this dataset is small.
Before doing further analysis I will do standarization and normalization because data is stored in different measures.
cluster_data.s<-scale(cluster_data) # standardised data (obs-mean)/sd
fviz_nbclust(cluster_data.s, FUNcluster = kmeans, method = "silhouette")
fviz_nbclust(cluster_data.s, FUNcluster = kmeans, method = "gap_stat")
fviz_nbclust(cluster_data.s, FUNcluster = kmeans, method = "wss")
Using different methods we receive different hints how many clusters should be chosen. I would try to pick 3 clusters as that would be the best way to interpret the results. One cluster for countries that require help. One cluster for countries that may require help. One cluster for countries that must receive help.
fviz_nbclust(cluster_data.s, FUNcluster=cluster::pam, method="silhouette")+ theme_classic()
fviz_nbclust(cluster_data.s, FUNcluster=cluster::pam, method="gap_stat")+ theme_classic()
fviz_nbclust(cluster_data.s, FUNcluster=cluster::pam, method="wss")+ theme_classic()
With PAM methods I have the same situation so here as well I would pick 3 clusters.
km1<-eclust(cluster_data.s, "kmeans", hc_metric="euclidean",k=3)
fviz_cluster(km1, main="kmeans / Euclidean")
km1$centers
## child_mort exports health imports income inflation
## 1 -0.8249676 0.64314557 0.7252301 0.19006732 1.4797922 -0.48346661
## 2 -0.4052346 -0.03155768 -0.2237978 0.02408916 -0.2510155 -0.01711594
## 3 1.3561391 -0.43622118 -0.1555163 -0.18863644 -0.6848344 0.40090504
## life_expec total_fer gdpp
## 1 1.0763414 -0.7895024 1.6111498
## 2 0.2539698 -0.4230704 -0.3534185
## 3 -1.2783352 1.3608511 -0.6024306
sil<-silhouette(km1$cluster, dist(cluster_data.s))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 36 0.15
## 2 2 84 0.36
## 3 3 47 0.24
Now having our data clustered we need to determin which cluster informs us about countries that require help. Let us go back to what we arranged in EDA part. Let us take two examples, we know that high child mortality is bad and low GDPP is bad.
country_data["cluster"] <- km1$cluster
ggplot(country_data, aes(x=cluster, y=child_mort)) +
geom_boxplot() +
facet_grid(~cluster, scales="free_y")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
ggplot(country_data, aes(x=cluster, y=gdpp)) +
geom_boxplot() +
facet_grid(~cluster, scales="free_y")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
After checking those two conditions we are clear that:
df_cluster2 <- country_data[country_data$cluster == 3, ]
list_of_countries_that_require_help_k_means <- df_cluster2$country
list_of_countries_that_require_help_k_means
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Equatorial Guinea"
## [15] "Eritrea" "Gabon"
## [17] "Gambia" "Ghana"
## [19] "Guinea" "Guinea-Bissau"
## [21] "Haiti" "Iraq"
## [23] "Kenya" "Kiribati"
## [25] "Lao" "Lesotho"
## [27] "Liberia" "Madagascar"
## [29] "Malawi" "Mali"
## [31] "Mauritania" "Mozambique"
## [33] "Namibia" "Niger"
## [35] "Nigeria" "Pakistan"
## [37] "Rwanda" "Senegal"
## [39] "Sierra Leone" "South Africa"
## [41] "Sudan" "Tanzania"
## [43] "Timor-Leste" "Togo"
## [45] "Uganda" "Yemen"
## [47] "Zambia"
#####Let us now move to PAM algorithm
pam1<-eclust(cluster_data.s, "pam", k=3)
fviz_silhouette(pam1)
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
fviz_cluster(pam1)
sil.p<-silhouette(pam1$cluster, dist(cluster_data.s))
fviz_silhouette(sil.p)
## cluster size ave.sil.width
## 1 1 51 0.25
## 2 2 85 0.31
## 3 3 31 0.26
A check that was done for k-means with clusters assigment is required here as well.
country_data["cluster_pam"] <- pam1$cluster
ggplot(country_data, aes(x=cluster_pam, y=child_mort)) +
geom_boxplot() +
facet_grid(~cluster_pam, scales="free_y")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
ggplot(country_data, aes(x=cluster_pam, y=gdpp)) +
geom_boxplot() +
facet_grid(~cluster_pam, scales="free_y")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
We have the different cluster assignment - Cluster 1: Countries must receive financial help. - Cluster 2: Countries that are between those two and it should be deeply considered if they require help. - Cluster 3: Countries that do not require any help.
df_cluster_3 <- country_data[country_data$cluster_pam == 1, ]
list_of_countries_that_require_help_pam <- df_cluster_3$country
list_of_countries_that_require_help_pam
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Equatorial Guinea"
## [15] "Eritrea" "Gabon"
## [17] "Gambia" "Ghana"
## [19] "Guinea" "Guinea-Bissau"
## [21] "Haiti" "India"
## [23] "Iraq" "Kenya"
## [25] "Kiribati" "Lao"
## [27] "Lesotho" "Liberia"
## [29] "Madagascar" "Malawi"
## [31] "Mali" "Mauritania"
## [33] "Mozambique" "Myanmar"
## [35] "Namibia" "Nepal"
## [37] "Niger" "Nigeria"
## [39] "Pakistan" "Rwanda"
## [41] "Senegal" "Sierra Leone"
## [43] "South Africa" "Sudan"
## [45] "Tajikistan" "Tanzania"
## [47] "Timor-Leste" "Togo"
## [49] "Uganda" "Yemen"
## [51] "Zambia"
x <- intersect(list_of_countries_that_require_help_pam,list_of_countries_that_require_help_k_means)
x
## [1] "Afghanistan" "Angola"
## [3] "Benin" "Botswana"
## [5] "Burkina Faso" "Burundi"
## [7] "Cameroon" "Central African Republic"
## [9] "Chad" "Comoros"
## [11] "Congo, Dem. Rep." "Congo, Rep."
## [13] "Cote d'Ivoire" "Equatorial Guinea"
## [15] "Eritrea" "Gabon"
## [17] "Gambia" "Ghana"
## [19] "Guinea" "Guinea-Bissau"
## [21] "Haiti" "Iraq"
## [23] "Kenya" "Kiribati"
## [25] "Lao" "Lesotho"
## [27] "Liberia" "Madagascar"
## [29] "Malawi" "Mali"
## [31] "Mauritania" "Mozambique"
## [33] "Namibia" "Niger"
## [35] "Nigeria" "Pakistan"
## [37] "Rwanda" "Senegal"
## [39] "Sierra Leone" "South Africa"
## [41] "Sudan" "Tanzania"
## [43] "Timor-Leste" "Togo"
## [45] "Uganda" "Yemen"
## [47] "Zambia"
After conducting a thorough analysis of the data using K-means and PAM algorithms, I have come to a conclusion on the most effective and strategic use of the available funds. My recommendation is to distribute the funds evenly to the 47 countries that were selected through the K-Means and PAM analysis, which have been stored in the variable “x.”
In order to arrive at this conclusion, several key steps were taken. Firstly, the data was preprocessed and cleaned to ensure accuracy and reliability. Next, the K-means and PAM algorithms were applied to cluster the countries into groups with similar characteristics. These algorithms use various mathematical and statistical methods to identify patterns and relationships within the data, which is critical in identifying the most appropriate distribution of funds.
The distribution of funds to the 47 countries identified through the analysis can help to address key development challenges and support sustainable economic growth in those regions. This can be achieved by investing in key areas such as education, healthcare, infrastructure, and employment opportunities, which can ultimately lead to improved quality of life for citizens and increased economic prosperity.
Moreover, distributing the funds evenly to the 47 countries can ensure that the impact of the funds is maximized, as all countries receive a fair share of the resources available. This approach can also help to avoid any potential biases or inequalities in the distribution of funds, ensuring that all countries are given an equal opportunity to benefit from the investment.
In conclusion, distributing the funds evenly to the 47 countries identified through the K-Means and PAM analysis is the most effective and strategic approach to using the available funds. This approach can help to address key development challenges, support sustainable economic growth, and ensure equitable distribution of resources.