Introduction

With almost 40 million inhabitants and a diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.

Tasks

We will analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, we will use principal component analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, we will use k-means clustering to partition the provinces into groups with similar development levels. These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.

Data Preparation

This is library that we will use in this analysis.

library(tidyverse)
library(GGally)
library(factoextra)
library(FactoMineR)
library(ggrepel)

Read data and see data structure.

prov <- read.csv("argentina.csv")
glimpse(prov)
## Rows: 22
## Columns: 11
## $ province               <chr> "Buenos Aires", "Catamarca", "Córdoba", "Co...
## $ gdp                    <dbl> 292689868, 6150949, 69363739, 7968013, 98326...
## $ illiteracy             <dbl> 1.383240, 2.344140, 2.714140, 5.602420, 7.51...
## $ poverty                <dbl> 8.167798, 9.234095, 5.382380, 12.747191, 15....
## $ deficient_infra        <dbl> 5.511856, 10.464484, 10.436086, 17.438858, 3...
## $ school_dropout         <dbl> 0.7661682, 0.9519631, 1.0350558, 3.8642652, ...
## $ no_healthcare          <dbl> 48.7947, 45.0456, 45.7640, 62.1103, 65.5104,...
## $ birth_mortal           <dbl> 4.4, 1.5, 4.8, 5.9, 7.5, 3.0, 3.1, 16.2, 3.7...
## $ pop                    <int> 15625084, 367828, 3308876, 992595, 1055259, ...
## $ movie_theatres_per_cap <dbl> 6.015968e-06, 5.437324e-06, 1.118204e-05, 4....
## $ doctors_per_cap        <dbl> 0.004835622, 0.004502104, 0.010175359, 0.004...

Based on the result, we don’t need to change any data type. Lets check if there is any NA.

colSums(is.na(prov))
##               province                    gdp             illiteracy 
##                      0                      0                      0 
##                poverty        deficient_infra         school_dropout 
##                      0                      0                      0 
##          no_healthcare           birth_mortal                    pop 
##                      0                      0                      0 
## movie_theatres_per_cap        doctors_per_cap 
##                      0                      0

As we can see, the dataset don’t have any NA. So we can continue to next step.

Unsupervised Learning

Also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention. Its ability to discover similarities and differences in information make it the ideal solution for exploratory data analysis, cross-selling strategies, customer segmentation, and image recognition. For this case, we use unsupervised learning to do exploratory data analysis and profiling for cities in Argentina so can plan budget allocation to develop cities.

Principal Component Analysis

An unsupervised learning technique that summarizes multivariate data by reducing redundancies (variables that are correlated). New variables (the principal components) are linear combinations of the original data that retain as much variation as possible.

To know if we should do PCA, we need to check if there is any correlation between variables

ggcorr(prov, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Result from ggcorr show us that every variables have correlation so we need to do PCA. First, lets change the province from column into rownames so we can track the cities names then we can do PCA with prcomp function.

prov_pca <- PCA(X = prov,
                scale.unit = T, 
                quali.sup = 1, 
                graph = F,
                ncp = 10)

For next step, lets check if the PC still have correlation between each other.

ggcorr(prov_pca$quali.sup$coord, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

After we do PCA, all PCAs don’t have any correlation. Before we do clustering, we need to know how many PCA we can use to describe the 90% of variables in the dataset

prov_pca$eig
##          eigenvalue percentage of variance cumulative percentage of variance
## comp 1  4.459990699            44.59990699                          44.59991
## comp 2  1.942356927            19.42356927                          64.02348
## comp 3  1.041920805            10.41920805                          74.44268
## comp 4  0.934245121             9.34245121                          83.78514
## comp 5  0.724781967             7.24781967                          91.03296
## comp 6  0.387604114             3.87604114                          94.90900
## comp 7  0.222161067             2.22161067                          97.13061
## comp 8  0.153922307             1.53922307                          98.66983
## comp 9  0.129975560             1.29975560                          99.96959
## comp 10 0.003041433             0.03041433                         100.00000

Based on the summary, we can use 5 PC to get 91% information from all variables. Lets bind this PC into the original data so we can cluster the information.

pc_keep <- prov_pca$quali.sup$coord[, 1:5] %>% 
  as.data.frame()
mycols <- colnames(prov)
colsnumber <- match(mycols, names(prov))
new_prov <- prov %>% 
  select(-colsnumber) %>% 
  bind_cols(pc_keep) 

After we bind it, we can use clustering to group similar data (in this case province).

Clustering

A method for grouping data based on its characteristics (distance). The goal is to get homogeneous clusters (observations in 1 similar cluster) and between heterogeneous clusters (observations between clusters are not similar).

Before we cluster the data, we need optimal number of clusters that we can use.

fviz_nbclust(x = new_prov, method = "wss", kmeans)

Based on the result, it seems 4 cluster is the most optimal.

RNGkind(sample.kind = "Rounding")
set.seed(123)
prov_cluster <- kmeans(x = new_prov, centers = 4)

Now lets, visualize the cluster.

fviz_cluster(object = prov_cluster, data = new_prov)

After we get the cluster, we can combine this information into the original data then we can do exploratory data analysis to get insight.

prov <- prov %>% 
  mutate(cluster = as.factor(prov_cluster$cluster))
head(prov)

Exploratory Data Analysis

We explore the data to get insight. For this case is what cluster need more attention to be develop. Before we explore the data, we need to add new column name gdp_per_cap and percentage. GDP is a measure of the size of a province’s economy. To measure how rich or poor the inhabitants are, economists use per capita GDP, which is GDP divided by the province’s population (this is gdp_per_cap). Percentage is population divided by total population and will be use to see distribution of population in Argentina.

prov <- prov %>% 
  mutate(gdp_per_cap = gdp / pop,
         percentage = pop / sum(pop))

Population Distribution

First, lets see population distribution in Argentina

bigpop_prov <- prov %>% 
   arrange(by=-pop)%>%
   select(province,pop,percentage,cluster)%>%
   filter(pop>1000000)
bigpop_prov

As we can see, even though Argentina ranks third in South America in total population, but the population is unevenly distributed throughout the country. About 60% of the population resides in the Pampa region (Buenos Aires, La Pampa, Santa Fe, Entre Ríos and Córdoba) which only encompasses about 20% of the land area. Buenos Aires or Autonomous City of Buenos Aires, is the capital and largest city of Argentina and have 42% of the population of Argentina.

Rich Cities

rich_prov <- prov %>% 
  arrange(-gdp_per_cap) %>%
  select(province, gdp_per_cap,cluster) %>%
  top_n(9)
rich_prov

Even though Buenos Aires is the capital city, Chubut, San Luis, Santa Fe, and Mendoza are richer than Buenos Aires.

Graph of Variables

plot.PCA(prov_pca,
         choix = "var")

This graph will be used for analysis below.

Buenos Aires, the highest GDP city

Since the vectors corresponding to gdp and pop are in the same direction as Dimension 2, Buenos Aires has high GDP and high population. Let’s visualize this pattern with a plot of gdp against cluster (we should get similar results with pop).

ggplot(prov, aes(cluster, gdp, color = cluster)) +
  geom_point()+
  geom_text_repel(aes(label = province), show.legend = FALSE) +
  labs(x = "Cluster", y = "GDP")

Based on the result, we can see that Buenos Aires GDP is very high unlike other clusters. Meanwhile cluster 1 and 2 have similar GDP but cluster 4 have the lowest GDP.

Highest GDP per Capita Cluster

If we plot gdp_per_cap for each cluster, we can see that half of provinces in this cluster 1, have greater GDP per capita than the provinces in the other clusters. We will see similar results for movie_theaters_per_cap and doctors_per_cap.

ggplot(prov, aes(cluster, gdp_per_cap, color = cluster)) +
  geom_point()+
  geom_text_repel(aes(label = province), show.legend = FALSE) +
  labs(x = "Cluster", y = "GDP per capita")

Based on the result, we can see that Buenos Aires don’t have the highest GDP per capita because the population of Buenos Aires is the highest in Argentina. Like the previous analysis, cluster 4 have the lowest GDP per capita. Meanwhile cluster 1 and 2 have similar distribution in GDP per capita.

Poorest Cluster

As shown in the biplot, provinces with high positive values in Dimension 1 have high values in poverty, deficient infrastructure, etc. These variables are also negatively correlated with gdp_per_cap, so these provinces have low values in this variable.

ggplot(prov, aes(cluster, poverty, color = cluster)) +
  geom_point()+
  labs(x = "Cluster", y = "Poverty rate") +
  geom_text_repel(aes(label = province), show.legend = FALSE)

Based on the result, you can see cluster 4 have the highest poverty rate because the lowest GDP per city and GDP per capita. Meanwhile other cluster have similar distribution with each other.

Overall Profiling

To see broad outline, we can do profiling and see the mean number of cluster.

profil <- prov %>%
   select(-province) %>% 
   group_by(cluster) %>% 
   summarise_all(mean)
head(profil)

Based on this profiling, we can see that cluster 1 and 4 need to be develop more than cluster 2 and 3. Cluster 4 have the lowest GDP per capita and highest illiteracy, poverty, deficient_infra, etc. Then followed by cluster 1, 2, and 3 sequentially.

Conclusion

The government need to do budget allocation to develop cities in cluster 1 and cluster 4 but don’t just develop cities in cluster 1 and 4 because it will cause massive inflation. For what cities in what cluster can be seen below.

prov %>% 
  select(province, cluster) %>% 
  arrange(cluster)