Public Planning in Argentina

Introduction

With almost 40 million inhabitants and a diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.

I analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, I will use principal component analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, I will use k-means clustering to partition the provinces into groups with similar development levels.

These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.3.1
## v readr   1.3.0     v forcats 0.3.0

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.2

## -- Conflicts --------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Warning: package 'FactoMineR' was built under R version 3.5.3

## Warning: package 'factoextra' was built under R version 3.5.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## Warning: package 'ggrepel' was built under R version 3.5.3

## Parsed with column specification:
## cols(
##   province = col_character(),
##   gdp = col_double(),
##   illiteracy = col_double(),
##   poverty = col_double(),
##   deficient_infra = col_double(),
##   school_dropout = col_double(),
##   no_healthcare = col_double(),
##   birth_mortal = col_double(),
##   pop = col_double(),
##   movie_theatres_per_cap = col_double(),
##   doctors_per_cap = col_double()
## )

## [1] 22

## # A tibble: 6 x 11
##   province    gdp illiteracy poverty deficient_infra school_dropout
##   <chr>     <dbl>      <dbl>   <dbl>           <dbl>          <dbl>
## 1 Buenos ~ 2.93e8       1.38    8.17            5.51          0.766
## 2 Catamar~ 6.15e6       2.34    9.23           10.5           0.952
## 3 Córdoba  6.94e7       2.71    5.38           10.4           1.04 
## 4 Corrien~ 7.97e6       5.60   12.7            17.4           3.86 
## 5 Chaco    9.83e6       7.52   15.9            31.5           2.58 
## 6 Chubut   1.77e7       1.55    8.05            8.04          0.586
## # ... with 5 more variables: no_healthcare <dbl>, birth_mortal <dbl>,
## #   pop <dbl>, movie_theatres_per_cap <dbl>, doctors_per_cap <dbl>

Most populous, richest provinces

Argentina ranks third in South America in total population, but the population is unevenly distributed throughout the country. Sixty percent of the population resides in the Pampa region (Buenos Aires, La Pampa, Santa Fe, Entre Ríos and Córdoba) which only encompasses about 20% of the land area.

GDP is a measure of the size of a province’s economy. To measure how rich or poor the inhabitants are, economists use per capita GDP, which is GDP divided by the province’s population.

# Add gdp_per_capita column to argentina
argentina <- argentina %>% 
  mutate(gdp_per_cap = gdp / pop) 

# Find the four richest provinces
( rich_provinces  <- argentina %>% 
                         arrange(desc(gdp_per_cap)) %>%                         
                         select(province, gdp_per_cap) %>%
                         top_n(4))

## # A tibble: 4 x 2
##   province   gdp_per_cap
##   <chr>            <dbl>
## 1 Santa Cruz        42.6
## 2 Neuquén           40.9
## 3 Chubut            34.9
## 4 San Luis          27.3

# Find the provinces with populations over 1 million
( bigger_pops <- argentina %>% 
                     arrange(desc(pop)) %>%
                     select(province, pop) %>%
                     filter(pop > 1000000))

## # A tibble: 9 x 2
##   province          pop
##   <chr>           <dbl>
## 1 Buenos Aires 15625084
## 2 Córdoba       3308876
## 3 Santa Fe      3194537
## 4 Mendoza       1738929
## 5 Tucumán       1448188
## 6 Entre Ríos    1235994
## 7 Salta         1214441
## 8 Misiones      1101593
## 9 Chaco         1055259

A matrix for PCA

Principal Component Analysis (PCA) is an unsupervised learning technique that summarizes multivariate data by reducing redundancies (variables that are correlated). New variables (the principal components) are linear combinations of the original data that retain as much variation as possible. Some aspects of economic and social data are highly correlated, so let’s see what pops out. But first, I need to do some data preparation.

R makes it easy to run a PCA with the PCA() function from the FactoMineR package. The first argument in PCA() is a data frame or matrix of the data where the rows are “individuals” (or in our case, provinces) and columns are numeric variables. To prepare for the analysis, I will remove the column of province names and build a matrix from the dataset.

# Select numeric columns and cast to matrix
argentina_matrix  <- argentina  %>% 
  select_if(is.numeric) %>%  
  as.matrix()

# Print the first lines of the result
head(argentina_matrix)

##            gdp illiteracy   poverty deficient_infra school_dropout
## [1,] 292689868    1.38324  8.167798        5.511856      0.7661682
## [2,]   6150949    2.34414  9.234095       10.464484      0.9519631
## [3,]  69363739    2.71414  5.382380       10.436086      1.0350558
## [4,]   7968013    5.60242 12.747191       17.438858      3.8642652
## [5,]   9832643    7.51758 15.862619       31.479527      2.5774621
## [6,]  17747854    1.54806  8.051752        8.044618      0.5863094
##      no_healthcare birth_mortal      pop movie_theatres_per_cap
## [1,]       48.7947          4.4 15625084           6.015968e-06
## [2,]       45.0456          1.5   367828           5.437324e-06
## [3,]       45.7640          4.8  3308876           1.118204e-05
## [4,]       62.1103          5.9   992595           4.029841e-06
## [5,]       65.5104          7.5  1055259           2.842904e-06
## [6,]       39.5473          3.0   509108           1.571376e-05
##      doctors_per_cap gdp_per_cap
## [1,]     0.004835622   18.732051
## [2,]     0.004502104   16.722352
## [3,]     0.010175359   20.962931
## [4,]     0.004495288    8.027456
## [5,]     0.003604802    9.317753
## [6,]     0.004498063   34.860686

Reducing dimensions

PCA finds a lower dimensional representation of the data that keeps the maximum amount of variance. It’s great for analyzing multivariate datasets, like this one, with multiple numerical columns that are highly correlated. Typically, the first few components preserve most of the information in the raw data, allowing me, to go from eleven dimensions (eleven original variables) down to two dimensions (two variables that are summaries of the original eleven).

To run PCA, I need to make sure all the variables are on similar scales. Otherwise, variables with large variance will be overrepresented. In PCA() setting scale.unit = TRUE ensures that variables are scaled to unit variance before crunching the numbers.

# Apply PCA and print results
( argentina_pca  <- PCA(argentina_matrix, scale.unit = TRUE) )

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 22 individuals, described by 11 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

PCA: Variables & Components

Now that I have the principal components, I can see how the original variables are correlated among themselves and how the original variables are correlated with the principal components. I will build a plot using the factoextra package to help understand these relationships. A correlation circle plot (also known as a variable correlation plot) shows the relationship among all variables as they are plotted on the first two principal components (Dimension 1 and Dimension 2).

To understand the plot, note that:

Positively correlated variables have similar vectors. The vectors of negatively correlated variables are on opposite sides of the plot origin (opposite quadrants). Each axis represents a principal component. Vectors pointing in the direction of the component are correlated with that component. The percentage of the original variance explained by each component (dimension) is given in parentheses in the axes labels.

# Plot the original variables and the first 2 components and print the plot object.
( pca_var_plot <- fviz_pca_var(argentina_pca))

# Sum the variance preserved by the first two components. Print the result.
( variance_first_two_pca <- argentina_pca$eig[1, 2] + argentina_pca$eig[2, 2] )

## [1] 63.54897

Plotting the components

With the first two principal components representing almost 65% of the variance, most of the information I am interested in is summarized in these two components. From the variable correlation plot, I can see that population and GDP are highly correlated; illiteracy, poverty, no healthcare, school dropout, and deficient infrastructure are correlated; and GDP per capita and movie theaters per capita are correlated.

But how do these correlations map to the provinces? To dive into that question, I’ll plot the individual principal components for each province and look for clusters.

# Visualize Dim2 vs. Dim1
fviz_pca_ind(argentina_pca, title = "Provinces - PCA")

Cluster using K means

It looks like one province stands out and the rest follow the gradient along the second dimension. Are there clusters I am not detecting? I’ll use K-means clustering to see if there are patterns I am not detecting.

# Set seed to 1234 for reproducibility
set.seed(1234)
head(argentina_pca)

## $eig
##          eigenvalue percentage of variance
## comp 1  4.983629450             45.3057223
## comp 2  2.006757647             18.2432513
## comp 3  1.042135090              9.4739554
## comp 4  0.949962523              8.6360229
## comp 5  0.742104444              6.7464040
## comp 6  0.485195430              4.4108675
## comp 7  0.329075931              2.9915994
## comp 8  0.184573578              1.6779416
## comp 9  0.145702076              1.3245643
## comp 10 0.129508268              1.1773479
## comp 11 0.001355563              0.0123233
##         cumulative percentage of variance
## comp 1                           45.30572
## comp 2                           63.54897
## comp 3                           73.02293
## comp 4                           81.65895
## comp 5                           88.40536
## comp 6                           92.81622
## comp 7                           95.80782
## comp 8                           97.48576
## comp 9                           98.81033
## comp 10                          99.98768
## comp 11                         100.00000
## 
## $var
## $var$coord
##                             Dim.1       Dim.2       Dim.3       Dim.4
## gdp                    -0.3531993  0.91458696  0.08754649  0.12733753
## illiteracy              0.8853148  0.02380461  0.02007275 -0.18633306
## poverty                 0.8519141  0.04728600 -0.11508341  0.21974852
## deficient_infra         0.6612476 -0.13860488  0.07865448  0.43724696
## school_dropout          0.6068040  0.02321504  0.41314618 -0.51753592
## no_healthcare           0.8827617  0.19865491 -0.05695380  0.01912375
## birth_mortal            0.4976821 -0.02647374  0.73210807  0.33456496
## pop                    -0.2587412  0.95003081  0.06144140  0.13485990
## movie_theatres_per_cap -0.7148864 -0.36527201  0.36142907  0.29671447
## doctors_per_cap        -0.5956986  0.06231820  0.41227017 -0.38607478
## gdp_per_cap            -0.7642370 -0.26020203 -0.01867155  0.15533866
##                              Dim.5
## gdp                    -0.01513966
## illiteracy              0.13325888
## poverty                -0.24316858
## deficient_infra         0.46747981
## school_dropout         -0.26983192
## no_healthcare           0.26044559
## birth_mortal           -0.21528460
## pop                    -0.01995636
## movie_theatres_per_cap  0.05508681
## doctors_per_cap         0.47514859
## gdp_per_cap            -0.17395790
## 
## $var$cor
##                             Dim.1       Dim.2       Dim.3       Dim.4
## gdp                    -0.3531993  0.91458696  0.08754649  0.12733753
## illiteracy              0.8853148  0.02380461  0.02007275 -0.18633306
## poverty                 0.8519141  0.04728600 -0.11508341  0.21974852
## deficient_infra         0.6612476 -0.13860488  0.07865448  0.43724696
## school_dropout          0.6068040  0.02321504  0.41314618 -0.51753592
## no_healthcare           0.8827617  0.19865491 -0.05695380  0.01912375
## birth_mortal            0.4976821 -0.02647374  0.73210807  0.33456496
## pop                    -0.2587412  0.95003081  0.06144140  0.13485990
## movie_theatres_per_cap -0.7148864 -0.36527201  0.36142907  0.29671447
## doctors_per_cap        -0.5956986  0.06231820  0.41227017 -0.38607478
## gdp_per_cap            -0.7642370 -0.26020203 -0.01867155  0.15533866
##                              Dim.5
## gdp                    -0.01513966
## illiteracy              0.13325888
## poverty                -0.24316858
## deficient_infra         0.46747981
## school_dropout         -0.26983192
## no_healthcare           0.26044559
## birth_mortal           -0.21528460
## pop                    -0.01995636
## movie_theatres_per_cap  0.05508681
## doctors_per_cap         0.47514859
## gdp_per_cap            -0.17395790
## 
## $var$cos2
##                             Dim.1        Dim.2        Dim.3        Dim.4
## gdp                    0.12474975 0.8364693038 0.0076643884 0.0162148476
## illiteracy             0.78378227 0.0005666597 0.0004029152 0.0347200079
## poverty                0.72575762 0.0022359658 0.0132441916 0.0482894099
## deficient_infra        0.43724841 0.0192113115 0.0061865277 0.1911849048
## school_dropout         0.36821106 0.0005389382 0.1706897644 0.2678434240
## no_healthcare          0.77926823 0.0394637724 0.0032437353 0.0003657178
## birth_mortal           0.24768746 0.0007008589 0.5359822300 0.1119337096
## pop                    0.06694702 0.9025585381 0.0037750454 0.0181871924
## movie_theatres_per_cap 0.51106252 0.1334236430 0.1306309699 0.0880394785
## doctors_per_cap        0.35485688 0.0038835580 0.1699666951 0.1490537324
## gdp_per_cap            0.58405823 0.0677050978 0.0003486267 0.0241300981
##                               Dim.5
## gdp                    0.0002292093
## illiteracy             0.0177579296
## poverty                0.0591309576
## deficient_infra        0.2185373727
## school_dropout         0.0728092624
## no_healthcare          0.0678319067
## birth_mortal           0.0463474578
## pop                    0.0003982561
## movie_theatres_per_cap 0.0030345561
## doctors_per_cap        0.2257661842
## gdp_per_cap            0.0302613518
## 
## $var$contrib
##                            Dim.1       Dim.2       Dim.3       Dim.4
## gdp                     2.503191 41.68262695  0.73545057  1.70689340
## illiteracy             15.727138  0.02823757  0.03866248  3.65488185
## poverty                14.562833  0.11142181  1.27087090  5.08329631
## deficient_infra         8.773694  0.95733092  0.59363971 20.12552076
## school_dropout          7.388412  0.02685617 16.37885204 28.19515692
## no_healthcare          15.636560  1.96654402  0.31125862  0.03849813
## birth_mortal            4.970021  0.03492494 51.43116619 11.78296058
## pop                     1.343339 44.97596107  0.36224146  1.91451684
## movie_theatres_per_cap 10.254826  6.64871731 12.53493632  9.26767913
## doctors_per_cap         7.120451  0.19352402 16.30946858 15.69048555
## gdp_per_cap            11.719536  3.37385523  0.03345312  2.54011053
##                              Dim.5
## gdp                     0.03088639
## illiteracy              2.39291513
## poverty                 7.96801017
## deficient_infra        29.44833094
## school_dropout          9.81118802
## no_healthcare           9.14047979
## birth_mortal            6.24540901
## pop                     0.05366578
## movie_theatres_per_cap  0.40891227
## doctors_per_cap        30.42242719
## gdp_per_cap             4.07777530
## 
## 
## $ind
## $ind$coord
##          Dim.1       Dim.2       Dim.3      Dim.4       Dim.5
## 1  -2.36146994  5.85722972  0.02887238  1.1415873 -0.55664309
## 2  -0.65845451 -0.48651253 -1.29107656 -0.2415875 -0.03742897
## 3  -2.57528138  0.52381577  1.70307390 -0.9930687  2.11197740
## 4   2.82685473  0.15875925  0.48894162 -1.0256780  0.09720908
## 5   4.30008229  0.09073316  0.08145973  0.8101764  1.08427385
## 6  -2.70724552 -1.36535411 -0.23583031  0.9771357 -0.60816245
## 7   0.04335852 -0.21902507 -0.37815826 -0.1520217  0.62600102
## 8   4.21149482 -0.12335287  1.70641723  2.0281065 -0.06036068
## 9   0.81084392 -0.02185901 -1.40833132  0.3811242 -0.10375906
## 10 -2.31574880 -1.54147650  1.21560898  1.5886977  1.04580005
## 11  0.03981146 -0.89482651  2.47490195 -0.4534430 -1.71571657
## 12 -1.49491730  0.06125399 -0.03492552 -0.5930355  0.07767891
## 13  3.00162668  0.32821769 -0.01572717 -0.8404313 -1.30454951
## 14 -1.54346805 -0.87168732 -0.34268279  0.3787263 -0.38147289
## 15 -0.92603242 -0.59687582 -1.15450347  0.4228471  0.90200815
## 16  1.92430029  0.19027541 -0.68696523  0.5769439 -0.29786660
## 17  0.28315539 -0.08601798  0.21078632 -1.7200906 -0.76180005
## 18 -0.63180598 -0.40460402 -0.03114572 -1.1202342  0.32829060
## 19 -3.24754516 -1.50531253 -0.71328750  0.9818379 -1.31232195
## 20 -1.68050058  0.68052132  0.58532027 -1.1784782  0.16509857
## 21  2.71823019  0.07583245 -1.39279281 -0.2316759  0.59215715
## 22 -0.01728864  0.15026553 -0.80995574 -0.7374384  0.10958703
## 
## $ind$cos2
##           Dim.1        Dim.2        Dim.3      Dim.4        Dim.5
## 1  1.339410e-01 0.8240115152 2.002230e-05 0.03130164 0.0074422184
## 2  1.503778e-01 0.0820956847 5.781443e-01 0.02024329 0.0004859013
## 3  3.986197e-01 0.0164917402 1.743317e-01 0.05927450 0.2680944888
## 4  8.092557e-01 0.0025524477 2.420989e-02 0.10653709 0.0009569571
## 5  8.460085e-01 0.0003766633 3.036038e-04 0.03003173 0.0537897069
## 6  6.425089e-01 0.1634234811 4.875544e-03 0.08370158 0.0324237361
## 7  1.013185e-03 0.0258539981 7.707033e-02 0.01245521 0.2111981610
## 8  6.834503e-01 0.0005863175 1.122031e-01 0.15849512 0.0001403921
## 9  1.412998e-01 0.0001026900 4.262621e-01 0.03121764 0.0023137642
## 10 3.593658e-01 0.1592309960 9.902426e-02 0.16913611 0.0732910777
## 11 1.486119e-04 0.0750784016 5.743192e-01 0.01927892 0.2760122966
## 12 5.665016e-01 0.0009511205 3.092093e-04 0.08915140 0.0015295831
## 13 7.026757e-01 0.0084016692 1.929047e-05 0.05508660 0.1327280557
## 14 3.799797e-01 0.1211955114 1.873051e-02 0.02287789 0.0232109243
## 15 2.103926e-01 0.0874069813 3.270157e-01 0.04386769 0.1996177111
## 16 5.842851e-01 0.0057127394 7.446439e-02 0.05252261 0.0139998282
## 17 1.821912e-02 0.0016813427 1.009631e-02 0.67232677 0.1318741723
## 18 1.234999e-01 0.0506476681 3.001205e-04 0.38825472 0.0333438665
## 19 6.113929e-01 0.1313600570 2.949437e-02 0.05588427 0.0998368240
## 20 4.058789e-01 0.0665583615 4.923867e-02 0.19960079 0.0039174710
## 21 6.579750e-01 0.0005120909 1.727468e-01 0.00477968 0.0312256073
## 22 9.523524e-05 0.0071944029 2.090252e-01 0.17327166 0.0038264349
## 
## $ind$contrib
##           Dim.1        Dim.2        Dim.3      Dim.4        Dim.5
## 1  5.086235e+00 77.708210251  0.003635956  6.2357560  1.897867670
## 2  3.954423e-01  0.536130420  7.270383057  0.2792671  0.008580807
## 3  6.048963e+00  0.621497698 12.650866941  4.7187766 27.320636998
## 4  7.288507e+00  0.057090053  1.042719074  5.0337644  0.057879657
## 5  1.686495e+01  0.018647236  0.028942713  3.1407258  7.200957614
## 6  6.684776e+00  4.222532436  0.242578627  4.5685736  2.265434379
## 7  1.714670e-03  0.108660090  0.623735519  0.1105814  2.400282576
## 8  1.617723e+01  0.034465206 12.700585922 19.6812355  0.022316217
## 9  5.996614e-01  0.001082290  8.650933511  0.6950305  0.065942340
## 10 4.891189e+00  5.382155130  6.445269762 12.0768420  6.698994194
## 11 1.445599e-03  1.813677529 26.715868375  0.9838215 18.030317687
## 12 2.038290e+00  0.008498673  0.005320336  1.6827989  0.036958859
## 13 8.217599e+00  0.244009295  0.001078834  3.3796773 10.423976357
## 14 2.172836e+00  1.721091062  0.512198015  0.6863123  0.891332298
## 15 7.821390e-01  0.806956945  5.813581796  0.8555344  4.983485604
## 16 3.377359e+00  0.082006393  2.058367022  1.5927150  0.543445376
## 17 7.312758e-02  0.016759493  0.193793033 14.1570741  3.554629041
## 18 3.640819e-01  0.370802610  0.004231067  6.0046614  0.660129561
## 19 9.619267e+00  5.132580229  2.219128865  4.6126493 10.548557377
## 20 2.575781e+00  1.048976759  1.494309043  6.6452903  0.166954805
## 21 6.739133e+00  0.013025443  8.461090299  0.2568223  2.147762468
## 22 2.726172e-04  0.051144759  2.861382234  2.6020904  0.073558116
## 
## $ind$dist
##        1        2        3        4        5        6        7        8 
## 6.452464 1.697985 4.078921 3.142394 4.675084 3.377443 1.362166 5.094280 
##        9       10       11       12       13       14       15       16 
## 2.157081 3.862986 3.265738 1.986171 3.580793 2.503903 2.018882 2.517448 
##       17       18       19       20       21       22 
## 2.097786 1.797838 4.153316 2.637791 3.351055 1.771585 
## 
## 
## $svd
## $svd$vs
##  [1] 2.23240441 1.41660074 1.02085018 0.97466021 0.86145484 0.69655971
##  [7] 0.57365140 0.42962027 0.38170941 0.35987257 0.03681797
## 
## $svd$U
##               [,1]        [,2]        [,3]       [,4]        [,5]
##  [1,] -1.057814582  4.13470752  0.02828268  1.1712670 -0.64616630
##  [2,] -0.294953058 -0.34343659 -1.26470719 -0.2478684 -0.04344856
##  [3,] -1.153590884  0.36976951  1.66828976 -1.0188871  2.45164030
##  [4,]  1.266282539  0.11207057  0.47895532 -1.0523441  0.11284292
##  [5,]  1.926211162  0.06404992  0.07979597  0.8312398  1.25865431
##  [6,] -1.212703894 -0.96382422 -0.23101363  1.0025399 -0.70597136
##  [7,]  0.019422341 -0.15461313 -0.37043463 -0.1559740  0.72667886
##  [8,]  1.886528625 -0.08707666  1.67156481  2.0808344 -0.07006831
##  [9,]  0.363215518 -0.01543061 -1.37956710  0.3910329 -0.12044632
## [10,] -1.037333912 -1.08815170  1.19078098  1.6300016  1.21399288
## [11,]  0.017833443 -0.63167164  2.42435374 -0.4652319 -1.99165004
## [12,] -0.669644485  0.04324012 -0.03421219 -0.6084536  0.09017177
## [13,]  1.344571201  0.23169386 -0.01540595 -0.8622813 -1.51435623
## [14,] -0.691392673 -0.61533733 -0.33568373  0.3885727 -0.44282401
## [15,] -0.414813920 -0.42134372 -1.13092351  0.4338405  1.04707537
## [16,]  0.861985528  0.13431830 -0.67293443  0.5919437 -0.34577158
## [17,]  0.126838751 -0.06072140  0.20648115 -1.7648106 -0.88431804
## [18,] -0.283015919 -0.28561613 -0.03050958 -1.1493587  0.38108858
## [19,] -1.454729771 -1.06262300 -0.69871908  1.0073643 -1.52337869
## [20,] -0.752776053  0.48039035  0.57336549 -1.2091170  0.19165087
## [21,]  1.217624447  0.05353128 -1.36434595 -0.2376992  0.68739199
## [22,] -0.007744403  0.10607472 -0.79341294 -0.7566108  0.12721158
## 
## $svd$V
##             [,1]        [,2]        [,3]        [,4]        [,5]
##  [1,] -0.1582148  0.64562084  0.08575841  0.13064813 -0.01757452
##  [2,]  0.3965746  0.01680404  0.01966278 -0.19117745  0.15469050
##  [3,]  0.3816128  0.03337991 -0.11273291  0.22546167 -0.28227664
##  [4,]  0.2962042 -0.09784329  0.07704802  0.44861477  0.54266316
##  [5,]  0.2718163  0.01638785  0.40470794 -0.53099112 -0.31322816
##  [6,]  0.3954309  0.14023352 -0.05579056  0.01962094  0.30233226
##  [7,]  0.2229355 -0.01868822  0.71715526  0.34326317 -0.24990816
##  [8,] -0.1159025  0.67064119  0.06018650  0.13836607 -0.02316588
##  [9,] -0.3202316 -0.25785107  0.35404712  0.30442863  0.06394625
## [10,] -0.2668417  0.04399136  0.40384983 -0.39611218  0.55156529
## [11,] -0.3423381 -0.18368057 -0.01829020  0.15937724 -0.20193502
## 
## 
## $call
## $call$row.w
##  [1] 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
##  [7] 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
## [13] 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
## [19] 0.04545455 0.04545455 0.04545455 0.04545455
## 
## $call$col.w
##  [1] 1 1 1 1 1 1 1 1 1 1 1
## 
## $call$scale.unit
## [1] TRUE
## 
## $call$ncp
## [1] 5
## 
## $call$centre
##  [1] 3.055703e+07 3.225541e+00 9.925625e+00 1.267730e+01 1.724866e+00
##  [6] 5.076884e+01 4.986364e+00 1.686352e+06 7.143952e-06 4.893720e-03
## [11] 1.834566e+01
## 
## $call$ecart.type
##  [1] 6.040940e+07 1.808927e+00 3.692633e+00 7.050933e+00 1.125942e+00
##  [6] 8.969951e+00 3.417907e+00 3.145799e+06 4.273323e-06 1.487053e-03
## [11] 1.013528e+01
## 
## $call$X
##          gdp illiteracy   poverty deficient_infra school_dropout
## 1  292689868   1.383240  8.167798        5.511856      0.7661682
## 2    6150949   2.344140  9.234095       10.464484      0.9519631
## 3   69363739   2.714140  5.382380       10.436086      1.0350558
## 4    7968013   5.602420 12.747191       17.438858      3.8642652
## 5    9832643   7.517580 15.862619       31.479527      2.5774621
## 6   17747854   1.548060  8.051752        8.044618      0.5863094
## 7   20743409   3.185580  7.288751       18.794568      1.8871881
## 8    3807057   4.610640 17.035583       28.004985      2.2689741
## 9    6484938   2.151390 13.367965       12.483179      0.7212945
## 10   6990262   1.539300  3.398774       16.505714      0.2040934
## 11   5590516   2.773210 10.875152        7.403254      3.8449494
## 12  33431369   2.200200  5.692798        3.839852      1.0637179
## 13   9646826   6.863950 13.529788        8.325740      3.1291244
## 14  22564106   1.943750  9.456635       11.267278      1.3935038
## 15  10264584   2.031420  8.678391       14.885444      0.4080420
## 16  13438835   3.346090 16.870500       14.182303      1.4820300
## 17   8262309   2.963260  9.050784        3.914390      3.2984129
## 18  11780849   3.433650  6.593771        9.679894      2.0001724
## 19  11663738   0.791485  8.024762        7.411364      0.2892622
## 20  81588690   1.975940  6.081012       11.869195      2.8721807
## 21   8387859   6.272090 11.759000       20.491433      2.3255981
## 22  13856199   3.770370 11.214239        6.466665      0.9772847
##    no_healthcare birth_mortal      pop movie_theatres_per_cap
## 1        48.7947          4.4 15625084           6.015968e-06
## 2        45.0456          1.5   367828           5.437324e-06
## 3        45.7640          4.8  3308876           1.118204e-05
## 4        62.1103          5.9   992595           4.029841e-06
## 5        65.5104          7.5  1055259           2.842904e-06
## 6        39.5473          3.0   509108           1.571376e-05
## 7        48.6571          3.1  1235994           5.663458e-06
## 8        65.8126         16.2   530162           3.772432e-06
## 9        54.1615          3.7   673307           2.970413e-06
## 10       45.4764          7.2   318951           1.881167e-05
## 11       40.8341         11.4   333642           1.198890e-05
## 12       50.5843          4.4  1738929           8.050932e-06
## 13       57.8339          8.1  1101593           1.815553e-06
## 14       48.7431          3.3   551266           9.070032e-06
## 15       49.9463          0.8   638645           9.394891e-06
## 16       60.4230          5.8  1214441           4.117121e-06
## 17       52.9684          4.2   681055           5.873241e-06
## 18       51.6154          3.8   432310           4.626310e-06
## 19       29.2321          3.3   273964           1.095034e-05
## 20       41.9660          2.6  3194537           6.573723e-06
## 21       63.6637          1.7   874006           3.432471e-06
## 22       48.2242          3.0  1448188           4.833627e-06
##    doctors_per_cap gdp_per_cap
## 1      0.004835622   18.732051
## 2      0.004502104   16.722352
## 3      0.010175359   20.962931
## 4      0.004495288    8.027456
## 5      0.003604802    9.317753
## 6      0.004498063   34.860686
## 7      0.004678825   16.782775
## 8      0.003440458    7.180932
## 9      0.003958076    9.631473
## 10     0.005414625   21.916415
## 11     0.005092285   16.756031
## 12     0.005720188   19.225264
## 13     0.002880374    8.757160
## 14     0.005066520   40.931431
## 15     0.004897870   16.072442
## 16     0.003991137   11.065861
## 17     0.005043646   12.131632
## 18     0.006102103   27.250930
## 19     0.004270634   42.573981
## 20     0.006671702   25.540067
## 21     0.002821491    9.597026
## 22     0.005500667    9.567956
## 
## $call$row.w.init
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $call$call
## PCA(X = argentina_matrix, scale.unit = TRUE)

# Create an intermediate data frame with pca_1 and pca_2
argentina_comps <- tibble(pca_1 = argentina_pca$ind$coord[ ,1],  
                          pca_2 = argentina_pca$ind$coord[ ,2])
argentina_comps

## # A tibble: 22 x 2
##      pca_1   pca_2
##      <dbl>   <dbl>
##  1 -2.36    5.86  
##  2 -0.658  -0.487 
##  3 -2.58    0.524 
##  4  2.83    0.159 
##  5  4.30    0.0907
##  6 -2.71   -1.37  
##  7  0.0434 -0.219 
##  8  4.21   -0.123 
##  9  0.811  -0.0219
## 10 -2.32   -1.54  
## # ... with 12 more rows

# Cluster the observations using the first 2 components and print its contents
( argentina_km <- kmeans(argentina_comps, centers = 4, nstart = 20, iter.max = 50))

## K-means clustering with 4 clusters of sizes 1, 7, 6, 8
## 
## Cluster means:
##        pca_1      pca_2
## 1 -2.3614699  5.8572297
## 2 -2.2235295 -0.5740342
## 3  3.1637648  0.1200775
## 4 -0.1320515 -0.3199319
## 
## Clustering vector:
##  [1] 1 4 2 3 3 2 4 3 4 2 4 2 3 2 4 3 4 4 2 2 3 4
## 
## Within cluster sum of squares by cluster:
## [1] 0.000000 8.403846 4.375350 3.109136
##  (between_SS / total_SS =  89.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Components with colors

Now that I have cluster assignments for each province, I will plot the provinces according to their principal components coordinates, colored by the cluster.

# Convert assigned clusters to factor
clusters_as_factor <- factor(argentina_km$cluster)

# Plot individulas colored by cluster
fviz_pca_ind(argentina_pca, 
             title = "Clustered Provinces - PCA", 
             habillage = clusters_as_factor)

Buenos Aires, in a league of its own

A few things to note from the scatter plot:

Cluster 1 includes only Buenos Aires and has a large positive value in Dimension 2 with an intermediate negative value in Dimension 1. Cluster 2 has the greatest negative values in Dimension 1. Cluster 3 has the greatest positive values in Dimension 1. Cluster 4 has small absolute values in Dimension 1. Clusters 2, 3, and 4, all have small absolute values in Dimension 2. We will focus on exploring clusters 1, 2, and 3 in terms of the original variables in the next few tasks.

As I noted earlier, Buenos Aires is in a league of its own, with the largest positive value in Dimension 2 by far. The figure below is a biplot, a combination of the individuals plot from Task 6 and the circle plot from Task 5.

Since the vectors corresponding to gdp and pop are in the same direction as Dimension 2, Buenos Aires has high GDP and high population. I’ll visualize this pattern with a plot of gdp against cluster (I should get similar results with pop).

# Add cluster column to argentina
argentina <- argentina %>%
               mutate(cluster = clusters_as_factor)

# Make a scatterplot of gdp vs. cluster, colored by cluster
ggplot(argentina, aes(y = gdp, x = cluster, color = cluster)) +
  geom_point() +
  geom_text_repel(aes(label = province), show.legend = FALSE) +
  labs(x = "Cluster", y = "GDP") +
  ggtitle("Argentina's GDP per capita vs Province Clusters")

The rich provinces

Provinces in cluster 2 have large negative values in Dimension 1. The biplot shows that gdp_per_cap, movie_theaters_per_cap and doctors_per_cap also have high negative values in Dimension 1.

If I plot gdp_per_cap for each cluster, I can see that provinces in this cluster 2, in general, have greater GDP per capita than the provinces in the other clusters. San Luis is the only province from the other clusters with gdp_per_cap in the range of values observed in cluster 2. I see similar results for movie_theaters_per_cap and doctors_per_cap.

# Make a scatterplot of GDP per capita vs. cluster, colored by cluster
ggplot(argentina, aes(y = gdp_per_cap, x = cluster, color = cluster)) +
    geom_point() +
    geom_text_repel(aes(label = province), show.legend = FALSE) +
    labs(x = "Cluster", y = "GDP per capita") +
    ggtitle("Argentina's GDP per Capita vs Province Clusters")

The poor provinces

Provinces in Cluster 3 have high positive values in Dimension 1. As shown in the biplot, provinces with high positive values in Dimension 1 have high values in poverty, deficient infrastructure, etc. These variables are also negatively correlated with gdp_per_cap, so these provinces have low values in this variable.

# Make scatterplot of poverty vs. cluster, colored by cluster
ggplot(argentina, aes(x = cluster, y = poverty, color = cluster)) +
    geom_point() +
    labs(x = "Cluster", y = "Poverty rate") +
    geom_text_repel(aes(label = province), show.legend = FALSE) +
    ggtitle("Argentina's Poverty vs Province Clusters")

Planning for public policy

Now that I have an idea of how social and economic welfare varies among provinces, I’ve been asked to help plan an education program. A pilot phase of the program will be carried out to identify design issues. My goal is to select the proposal with the most diverse set of provinces:

Tucumán, San Juán, and Entre Ríos Córdoba, Santa Fé, and Mendoza Buenos Aires, Santa Cruz, and Misiones

# Assign pilot provinces to the most diverse group
pilot_provinces <- 3