1. Provinces of Argentina

With almost 40 million inhabitants and a diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.

We will analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, we will use Principal Component Analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, we will use k-means clustering to partition the provinces into groups with similar development levels.

These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.

Setup

## # A tibble: 6 x 11
##   province    gdp illiteracy poverty deficient_infra school_dropout
##   <chr>     <dbl>      <dbl>   <dbl>           <dbl>          <dbl>
## 1 Buenos … 2.93e8       1.38    8.17            5.51          0.766
## 2 Catamar… 6.15e6       2.34    9.23           10.5           0.952
## 3 Córdoba  6.94e7       2.71    5.38           10.4           1.04 
## 4 Corrien… 7.97e6       5.60   12.7            17.4           3.86 
## 5 Chaco    9.83e6       7.52   15.9            31.5           2.58 
## 6 Chubut   1.77e7       1.55    8.05            8.04          0.586
## # … with 5 more variables: no_healthcare <dbl>, birth_mortal <dbl>,
## #   pop <dbl>, movie_theatres_per_cap <dbl>, doctors_per_cap <dbl>

2. Most populous, richest provinces

Argentina ranks third in South America in total population, but the population is unevenly distributed throughout the country. Sixty percent of the population resides in the Pampa region (Buenos Aires, La Pampa, Santa Fe, Entre Ríos and Córdoba) which only encompasses about 20% of the land area.

GDP is a measure of the size of a province’s economy. To measure how rich or poor the inhabitants are, economists use per capita GDP, which is GDP divided by the province’s population.

## Selecting by gdp_per_cap
## # A tibble: 4 x 2
##   province   gdp_per_cap
##   <chr>            <dbl>
## 1 Santa Cruz        42.6
## 2 Neuquén           40.9
## 3 Chubut            34.9
## 4 San Luis          27.3
## # A tibble: 9 x 2
##   province          pop
##   <chr>           <dbl>
## 1 Chaco         1055259
## 2 Misiones      1101593
## 3 Salta         1214441
## 4 Entre Ríos    1235994
## 5 Tucumán       1448188
## 6 Mendoza       1738929
## 7 Santa Fe      3194537
## 8 Córdoba       3308876
## 9 Buenos Aires 15625084

3. A matrix for PCA

Principal Component Analysis (PCA) is an unsupervised learning technique that summarizes multivariate data by reducing redundancies (variables that are correlated). New variables (the principal components) are linear combinations of the original data that retain as much variation as possible. We would imagine that some aspects of economic and social data would be highly correlated, so let’s see what pops out. But first, we need to do some data preparation.

R makes it easy to run a PCA with the PCA() function from the FactoMineR package. The first argument in PCA() is a data frame or matrix of the data where the rows are “individuals” (or in our case, provinces) and columns are numeric variables. To prepare for the analysis, we will remove the column of province names and build a matrix from the dataset.

##            gdp illiteracy   poverty deficient_infra school_dropout
## [1,] 292689868    1.38324  8.167798        5.511856      0.7661682
## [2,]   6150949    2.34414  9.234095       10.464484      0.9519631
## [3,]  69363739    2.71414  5.382380       10.436086      1.0350558
## [4,]   7968013    5.60242 12.747191       17.438858      3.8642652
## [5,]   9832643    7.51758 15.862619       31.479527      2.5774621
## [6,]  17747854    1.54806  8.051752        8.044618      0.5863094
##      no_healthcare birth_mortal      pop movie_theatres_per_cap
## [1,]       48.7947          4.4 15625084         0.000006015968
## [2,]       45.0456          1.5   367828         0.000005437324
## [3,]       45.7640          4.8  3308876         0.000011182045
## [4,]       62.1103          5.9   992595         0.000004029841
## [5,]       65.5104          7.5  1055259         0.000002842904
## [6,]       39.5473          3.0   509108         0.000015713758
##      doctors_per_cap gdp_per_cap
## [1,]     0.004835622   18.732051
## [2,]     0.004502104   16.722352
## [3,]     0.010175359   20.962931
## [4,]     0.004495288    8.027456
## [5,]     0.003604802    9.317753
## [6,]     0.004498063   34.860686

4. Reducing dimensions

PCA finds a lower dimensional representation of the data that keeps the maximum amount of variance. It’s great for analyzing multivariate datasets, like this one, with multiple numerical columns that are highly correlated. Typically, the first few components preserve most of the information in the raw data, allowing us, to go from eleven dimensions (eleven original variables) down to two dimensions (two variables that are summaries of the original eleven).

To run PCA, we need to make sure all the variables are on similar scales. Otherwise, variables with large variance will be overrepresented. In PCA() setting scale.unit = TRUE ensures that variables are scaled to unit variance before crunching the numbers.

Feel free to explore the output!

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 22 individuals, described by 11 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

5. PCA: Variables & Components

Now that we have the principal components, we can see how the original variables are correlated among themselves and how the original variables are correlated with the principal components. We will build a plot using the factoextra package to help us understand these relationships. A correlation circle plot (also known as a variable correlation plot) shows the relationship among all variables as they are plotted on the first two principal components (Dimension 1 and Dimension 2).

To understand the plot, note that:

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

## [1] 20.25001

6. Plotting the components

With the first two principal components representing almost 65% of the variance, most of the information we are interested in is summarized in these two components. From the variable correlation plot, we can see that population and GDP are highly correlated; illiteracy, poverty, no healthcare, school dropout, and deficient infrastructure are correlated; and GDP per capita and movie theaters per capita are correlated.

But how do these correlations map to the provinces? To dive into that question, let’s plot the individual principal components for each province and look for clusters.

7. Cluster using K means

It looks like one province stands out and the rest follow the gradient along the second dimension. Are there clusters we are not detecting? Let’s use K-means clustering to see if there are patterns we are not detecting.

##        Dim.1       Dim.2       Dim.3      Dim.4       Dim.5
## 1 -2.3614699  5.85722972  0.02887238  1.1415873 -0.55664309
## 2 -0.6584545 -0.48651253 -1.29107656 -0.2415875 -0.03742897
## 3 -2.5752814  0.52381577  1.70307390 -0.9930687  2.11197740
## 4  2.8268547  0.15875925  0.48894162 -1.0256780  0.09720908
## 5  4.3000823  0.09073316  0.08145973  0.8101764  1.08427385
## 6 -2.7072455 -1.36535411 -0.23583031  0.9771357 -0.60816245
## K-means clustering with 4 clusters of sizes 6, 1, 8, 7
## 
## Cluster means:
##        pca_1      pca_2
## 1  3.1637648  0.1200775
## 2 -2.3614699  5.8572297
## 3 -0.1320515 -0.3199319
## 4 -2.2235295 -0.5740342
## 
## Clustering vector:
##  [1] 2 3 4 1 1 4 3 1 3 4 3 4 1 4 3 1 3 3 4 4 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 4.375350 0.000000 3.109136 8.403846
##  (between_SS / total_SS =  89.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

8. Components with colors

Now that we have cluster assignments for each province, we will plot the provinces according to their principal components coordinates, colored by the cluster.

9. Buenos Aires, in a league of its own

A few things to note from the scatter plot:

We will focus on exploring clusters 1, 2, and 3 in terms of the original variables in the next few tasks.

As we noted earlier, Buenos Aires is in a league of its own, with the largest positive value in Dimension 2 by far. The figure below is a biplot, a combination of the individuals plot from Task 6 and the circle plot from Task 5.

Since the vectors corresponding to gdp and pop are in the same direction as Dimension 2, Buenos Aires has high GDP and high population. Let’s visualize this pattern with a plot of gdp against cluster (we should get similar results with pop).

10. The rich provinces

Provinces in cluster 2 have large negative values in Dimension 1. The biplot shows that gdp_per_cap, movie_theaters_per_cap and doctors_per_cap also have high negative values in Dimension 1.

If we plot gdp_per_cap for each cluster, we can see that provinces in this cluster 2, in general, have greater GDP per capita than the provinces in the other clusters. San Luis is the only province from the other clusters with gdp_per_cap in the range of values observed in cluster 2. We will see similar results for movie_theaters_per_cap and doctors_per_cap.

11. The poor provinces

Provinces in Cluster 3 have high positive values in Dimension 1. As shown in the biplot, provinces with high positive values in Dimension 1 have high values in poverty, deficient infrastructure, etc. These variables are also negatively correlated with gdp_per_cap, so these provinces have low values in this variable.

12. Planning for public policy

Now that we have an idea of how social and economic welfare varies among provinces, we’ve been asked to help plan an education program. A pilot phase of the program will be carried out to identify design issues. Our goal is to select the proposal with the most diverse set of provinces:

  1. Tucumán, San Juán, and Entre Ríos
  2. Córdoba, Santa Fé, and Mendoza
  3. Buenos Aires, Santa Cruz, and Misiones

Which proposal includes the most diverse set of provinces?