1. Provinces of Argentina
With almost 40 million inhabitants and a diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.
We will analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, we will use Principal Component Analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, we will use k-means clustering to partition the provinces into groups with similar development levels.
These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.
Setup
# Load the tidyverse
library(tidyverse)
# Read in the dataset
argentina <- read_csv("datasets/argentina.csv")
## # A tibble: 6 x 11
## province gdp illiteracy poverty deficient_infra school_dropout
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Buenos … 2.93e8 1.38 8.17 5.51 0.766
## 2 Catamar… 6.15e6 2.34 9.23 10.5 0.952
## 3 Córdoba 6.94e7 2.71 5.38 10.4 1.04
## 4 Corrien… 7.97e6 5.60 12.7 17.4 3.86
## 5 Chaco 9.83e6 7.52 15.9 31.5 2.58
## 6 Chubut 1.77e7 1.55 8.05 8.04 0.586
## # … with 5 more variables: no_healthcare <dbl>, birth_mortal <dbl>,
## # pop <dbl>, movie_theatres_per_cap <dbl>, doctors_per_cap <dbl>
2. Most populous, richest provinces
Argentina ranks third in South America in total population, but the population is unevenly distributed throughout the country. Sixty percent of the population resides in the Pampa region (Buenos Aires, La Pampa, Santa Fe, Entre Ríos and Córdoba) which only encompasses about 20% of the land area.
GDP is a measure of the size of a province’s economy. To measure how rich or poor the inhabitants are, economists use per capita GDP, which is GDP divided by the province’s population.
# Find the four richest provinces
( rich_provinces <- argentina %>%
arrange(-gdp_per_cap) %>%
select(province, gdp_per_cap) %>%
top_n(4) )
## Selecting by gdp_per_cap
## # A tibble: 4 x 2
## province gdp_per_cap
## <chr> <dbl>
## 1 Santa Cruz 42.6
## 2 Neuquén 40.9
## 3 Chubut 34.9
## 4 San Luis 27.3
# Find the provinces with populations over 1 million
( bigger_pops <- argentina %>%
arrange(pop) %>%
select(province, pop) %>%
filter(pop > 1000000) )
## # A tibble: 9 x 2
## province pop
## <chr> <dbl>
## 1 Chaco 1055259
## 2 Misiones 1101593
## 3 Salta 1214441
## 4 Entre Ríos 1235994
## 5 Tucumán 1448188
## 6 Mendoza 1738929
## 7 Santa Fe 3194537
## 8 Córdoba 3308876
## 9 Buenos Aires 15625084
3. A matrix for PCA
Principal Component Analysis (PCA) is an unsupervised learning technique that summarizes multivariate data by reducing redundancies (variables that are correlated). New variables (the principal components) are linear combinations of the original data that retain as much variation as possible. We would imagine that some aspects of economic and social data would be highly correlated, so let’s see what pops out. But first, we need to do some data preparation.
R makes it easy to run a PCA with the PCA()
function from the FactoMineR
package. The first argument in PCA()
is a data frame or matrix of the data where the rows are “individuals” (or in our case, provinces) and columns are numeric variables. To prepare for the analysis, we will remove the column of province names and build a matrix from the dataset.
# Select numeric columns and cast to matrix
argentina_matrix <- argentina %>%
select_if(is.numeric) %>%
as.matrix()
# Print the first lines of the result
head(argentina_matrix)
## gdp illiteracy poverty deficient_infra school_dropout
## [1,] 292689868 1.38324 8.167798 5.511856 0.7661682
## [2,] 6150949 2.34414 9.234095 10.464484 0.9519631
## [3,] 69363739 2.71414 5.382380 10.436086 1.0350558
## [4,] 7968013 5.60242 12.747191 17.438858 3.8642652
## [5,] 9832643 7.51758 15.862619 31.479527 2.5774621
## [6,] 17747854 1.54806 8.051752 8.044618 0.5863094
## no_healthcare birth_mortal pop movie_theatres_per_cap
## [1,] 48.7947 4.4 15625084 0.000006015968
## [2,] 45.0456 1.5 367828 0.000005437324
## [3,] 45.7640 4.8 3308876 0.000011182045
## [4,] 62.1103 5.9 992595 0.000004029841
## [5,] 65.5104 7.5 1055259 0.000002842904
## [6,] 39.5473 3.0 509108 0.000015713758
## doctors_per_cap gdp_per_cap
## [1,] 0.004835622 18.732051
## [2,] 0.004502104 16.722352
## [3,] 0.010175359 20.962931
## [4,] 0.004495288 8.027456
## [5,] 0.003604802 9.317753
## [6,] 0.004498063 34.860686
4. Reducing dimensions
PCA finds a lower dimensional representation of the data that keeps the maximum amount of variance. It’s great for analyzing multivariate datasets, like this one, with multiple numerical columns that are highly correlated. Typically, the first few components preserve most of the information in the raw data, allowing us, to go from eleven dimensions (eleven original variables) down to two dimensions (two variables that are summaries of the original eleven).
To run PCA, we need to make sure all the variables are on similar scales. Otherwise, variables with large variance will be overrepresented. In PCA()
setting scale.unit = TRUE
ensures that variables are scaled to unit variance before crunching the numbers.
Feel free to explore the output!
# Load FactoMineR
library(FactoMineR)
# Apply PCA and print results
( argentina_pca <- PCA(argentina_matrix, scale.unit = TRUE) )
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 22 individuals, described by 11 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
5. PCA: Variables & Components
Now that we have the principal components, we can see how the original variables are correlated among themselves and how the original variables are correlated with the principal components. We will build a plot using the factoextra
package to help us understand these relationships. A correlation circle plot (also known as a variable correlation plot) shows the relationship among all variables as they are plotted on the first two principal components (Dimension 1 and Dimension 2).
To understand the plot, note that:
- Positively correlated variables have similar vectors.
- The vectors of negatively correlated variables are on opposite sides of the plot origin (opposite quadrants).
- Each axis represents a principal component. Vectors pointing in the direction of the component are correlated with that component.
- The percentage of the original variance explained by each component (dimension) is given in parentheses in the axes labels.
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
# Set the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=5)
# Plot the original variables and the first 2 components and print the plot object.
( pca_var_plot <- fviz_pca_var(argentina_pca) )
# Sum the variance preserved by the first two components. Print the result.
( variance_first_two_pca <- argentina_pca$eig[2, 1] + argentina_pca$eig[2, 2] )
## [1] 20.25001
6. Plotting the components
With the first two principal components representing almost 65% of the variance, most of the information we are interested in is summarized in these two components. From the variable correlation plot, we can see that population and GDP are highly correlated; illiteracy, poverty, no healthcare, school dropout, and deficient infrastructure are correlated; and GDP per capita and movie theaters per capita are correlated.
But how do these correlations map to the provinces? To dive into that question, let’s plot the individual principal components for each province and look for clusters.
7. Cluster using K means
It looks like one province stands out and the rest follow the gradient along the second dimension. Are there clusters we are not detecting? Let’s use K-means clustering to see if there are patterns we are not detecting.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## 1 -2.3614699 5.85722972 0.02887238 1.1415873 -0.55664309
## 2 -0.6584545 -0.48651253 -1.29107656 -0.2415875 -0.03742897
## 3 -2.5752814 0.52381577 1.70307390 -0.9930687 2.11197740
## 4 2.8268547 0.15875925 0.48894162 -1.0256780 0.09720908
## 5 4.3000823 0.09073316 0.08145973 0.8101764 1.08427385
## 6 -2.7072455 -1.36535411 -0.23583031 0.9771357 -0.60816245
# Create an intermediate data frame with pca_1 and pca_2
argentina_comps <- tibble(pca_1 = argentina_pca$ind$coord[, 1],
pca_2 = argentina_pca$ind$coord[, 2])
# Cluster the observations using the first 2 components and print its contents
( argentina_km <- kmeans(argentina_comps, centers = 4, nstart = 20, iter.max = 50) )
## K-means clustering with 4 clusters of sizes 6, 1, 8, 7
##
## Cluster means:
## pca_1 pca_2
## 1 3.1637648 0.1200775
## 2 -2.3614699 5.8572297
## 3 -0.1320515 -0.3199319
## 4 -2.2235295 -0.5740342
##
## Clustering vector:
## [1] 2 3 4 1 1 4 3 1 3 4 3 4 1 4 3 1 3 3 4 4 1 3
##
## Within cluster sum of squares by cluster:
## [1] 4.375350 0.000000 3.109136 8.403846
## (between_SS / total_SS = 89.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
8. Components with colors
Now that we have cluster assignments for each province, we will plot the provinces according to their principal components coordinates, colored by the cluster.
# Convert assigned clusters to factor
clusters_as_factor <- factor(argentina_km$cluster)
# Plot individulas colored by cluster
fviz_pca_ind(argentina_pca,
title = "Clustered Provinces - PCA",
habillage = clusters_as_factor)
9. Buenos Aires, in a league of its own
A few things to note from the scatter plot:
- Cluster 1 includes only Buenos Aires and has a large positive value in Dimension 2 with an intermediate negative value in Dimension 1.
- Cluster 2 has the greatest negative values in Dimension 1.
- Cluster 3 has the greatest positive values in Dimension 1.
- Cluster 4 has small absolute values in Dimension 1.
- Clusters 2, 3, and 4, all have small absolute values in Dimension 2.
We will focus on exploring clusters 1, 2, and 3 in terms of the original variables in the next few tasks.
As we noted earlier, Buenos Aires is in a league of its own, with the largest positive value in Dimension 2 by far. The figure below is a biplot, a combination of the individuals plot from Task 6 and the circle plot from Task 5.
Since the vectors corresponding to gdp
and pop
are in the same direction as Dimension 2, Buenos Aires has high GDP and high population. Let’s visualize this pattern with a plot of gdp
against cluster
(we should get similar results with pop
).
library(ggrepel)
# Add cluster column to argentina
argentina <- argentina %>%
mutate(cluster= clusters_as_factor)
# Make a scatterplot of gdp vs. cluster, colored by cluster
ggplot(argentina, aes(y = gdp/1000000, x = cluster, color = cluster)) +
geom_point() +
geom_text_repel(aes(label = province), show.legend = FALSE) +
labs(x = "Cluster", y = "GDP in million") +
ggtitle("Argentina's GDP vs Province Clusters")
10. The rich provinces
Provinces in cluster 2 have large negative values in Dimension 1. The biplot shows that gdp_per_cap
, movie_theaters_per_cap
and doctors_per_cap
also have high negative values in Dimension 1.
If we plot gdp_per_cap
for each cluster, we can see that provinces in this cluster 2, in general, have greater GDP per capita than the provinces in the other clusters. San Luis is the only province from the other clusters with gdp_per_cap
in the range of values observed in cluster 2. We will see similar results for movie_theaters_per_cap
and doctors_per_cap
.
# Make a scatterplot of GDP per capita vs. cluster, colored by cluster
ggplot(argentina, aes(y = gdp_per_cap, x = cluster, color = cluster)) +
geom_point() +
geom_text_repel(aes(label = province), show.legend = FALSE) +
labs(x = "Cluster", y = "GDP per capita") +
ggtitle("Argentina's GDP per capita vs Province Clusters")
11. The poor provinces
Provinces in Cluster 3 have high positive values in Dimension 1. As shown in the biplot, provinces with high positive values in Dimension 1 have high values in poverty, deficient infrastructure, etc. These variables are also negatively correlated with gdp_per_cap
, so these provinces have low values in this variable.
# Make scatterplot of poverty vs. cluster, colored by cluster
ggplot(argentina, aes(x = cluster, y = poverty, color = cluster)) +
geom_point() +
labs(x = "Cluster", y = "Poverty rate") +
geom_text_repel(aes(label = province), show.legend = FALSE) +
ggtitle("Argentina's Proverty vs Province Clusters")
12. Planning for public policy
Now that we have an idea of how social and economic welfare varies among provinces, we’ve been asked to help plan an education program. A pilot phase of the program will be carried out to identify design issues. Our goal is to select the proposal with the most diverse set of provinces:
- Tucumán, San Juán, and Entre Ríos
- Córdoba, Santa Fé, and Mendoza
- Buenos Aires, Santa Cruz, and Misiones
Which proposal includes the most diverse set of provinces?