With almost 40 million inhabitants and a diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.
I analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, I will use principal component analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, I will use k-means clustering to partition the provinces into groups with similar development levels.
These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.3.1
## v readr 1.3.0 v forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts --------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'FactoMineR' was built under R version 3.5.3
## Warning: package 'factoextra' was built under R version 3.5.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'ggrepel' was built under R version 3.5.3
## Parsed with column specification:
## cols(
## province = col_character(),
## gdp = col_double(),
## illiteracy = col_double(),
## poverty = col_double(),
## deficient_infra = col_double(),
## school_dropout = col_double(),
## no_healthcare = col_double(),
## birth_mortal = col_double(),
## pop = col_double(),
## movie_theatres_per_cap = col_double(),
## doctors_per_cap = col_double()
## )
## [1] 22
## # A tibble: 6 x 11
## province gdp illiteracy poverty deficient_infra school_dropout
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Buenos ~ 2.93e8 1.38 8.17 5.51 0.766
## 2 Catamar~ 6.15e6 2.34 9.23 10.5 0.952
## 3 Córdoba 6.94e7 2.71 5.38 10.4 1.04
## 4 Corrien~ 7.97e6 5.60 12.7 17.4 3.86
## 5 Chaco 9.83e6 7.52 15.9 31.5 2.58
## 6 Chubut 1.77e7 1.55 8.05 8.04 0.586
## # ... with 5 more variables: no_healthcare <dbl>, birth_mortal <dbl>,
## # pop <dbl>, movie_theatres_per_cap <dbl>, doctors_per_cap <dbl>
Argentina ranks third in South America in total population, but the population is unevenly distributed throughout the country. Sixty percent of the population resides in the Pampa region (Buenos Aires, La Pampa, Santa Fe, Entre Ríos and Córdoba) which only encompasses about 20% of the land area.
GDP is a measure of the size of a province’s economy. To measure how rich or poor the inhabitants are, economists use per capita GDP, which is GDP divided by the province’s population.
# Add gdp_per_capita column to argentina
argentina <- argentina %>%
mutate(gdp_per_cap = gdp / pop)
# Find the four richest provinces
( rich_provinces <- argentina %>%
arrange(desc(gdp_per_cap)) %>%
select(province, gdp_per_cap) %>%
top_n(4))
## # A tibble: 4 x 2
## province gdp_per_cap
## <chr> <dbl>
## 1 Santa Cruz 42.6
## 2 Neuquén 40.9
## 3 Chubut 34.9
## 4 San Luis 27.3
# Find the provinces with populations over 1 million
( bigger_pops <- argentina %>%
arrange(desc(pop)) %>%
select(province, pop) %>%
filter(pop > 1000000))
## # A tibble: 9 x 2
## province pop
## <chr> <dbl>
## 1 Buenos Aires 15625084
## 2 Córdoba 3308876
## 3 Santa Fe 3194537
## 4 Mendoza 1738929
## 5 Tucumán 1448188
## 6 Entre Ríos 1235994
## 7 Salta 1214441
## 8 Misiones 1101593
## 9 Chaco 1055259
Principal Component Analysis (PCA) is an unsupervised learning technique that summarizes multivariate data by reducing redundancies (variables that are correlated). New variables (the principal components) are linear combinations of the original data that retain as much variation as possible. Some aspects of economic and social data are highly correlated, so let’s see what pops out. But first, I need to do some data preparation.
R makes it easy to run a PCA with the PCA() function from the FactoMineR package. The first argument in PCA() is a data frame or matrix of the data where the rows are “individuals” (or in our case, provinces) and columns are numeric variables. To prepare for the analysis, I will remove the column of province names and build a matrix from the dataset.
# Select numeric columns and cast to matrix
argentina_matrix <- argentina %>%
select_if(is.numeric) %>%
as.matrix()
# Print the first lines of the result
head(argentina_matrix)
## gdp illiteracy poverty deficient_infra school_dropout
## [1,] 292689868 1.38324 8.167798 5.511856 0.7661682
## [2,] 6150949 2.34414 9.234095 10.464484 0.9519631
## [3,] 69363739 2.71414 5.382380 10.436086 1.0350558
## [4,] 7968013 5.60242 12.747191 17.438858 3.8642652
## [5,] 9832643 7.51758 15.862619 31.479527 2.5774621
## [6,] 17747854 1.54806 8.051752 8.044618 0.5863094
## no_healthcare birth_mortal pop movie_theatres_per_cap
## [1,] 48.7947 4.4 15625084 6.015968e-06
## [2,] 45.0456 1.5 367828 5.437324e-06
## [3,] 45.7640 4.8 3308876 1.118204e-05
## [4,] 62.1103 5.9 992595 4.029841e-06
## [5,] 65.5104 7.5 1055259 2.842904e-06
## [6,] 39.5473 3.0 509108 1.571376e-05
## doctors_per_cap gdp_per_cap
## [1,] 0.004835622 18.732051
## [2,] 0.004502104 16.722352
## [3,] 0.010175359 20.962931
## [4,] 0.004495288 8.027456
## [5,] 0.003604802 9.317753
## [6,] 0.004498063 34.860686
PCA finds a lower dimensional representation of the data that keeps the maximum amount of variance. It’s great for analyzing multivariate datasets, like this one, with multiple numerical columns that are highly correlated. Typically, the first few components preserve most of the information in the raw data, allowing me, to go from eleven dimensions (eleven original variables) down to two dimensions (two variables that are summaries of the original eleven).
To run PCA, I need to make sure all the variables are on similar scales. Otherwise, variables with large variance will be overrepresented. In PCA() setting scale.unit = TRUE ensures that variables are scaled to unit variance before crunching the numbers.
# Apply PCA and print results
( argentina_pca <- PCA(argentina_matrix, scale.unit = TRUE) )
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 22 individuals, described by 11 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
Now that I have the principal components, I can see how the original variables are correlated among themselves and how the original variables are correlated with the principal components. I will build a plot using the factoextra package to help understand these relationships. A correlation circle plot (also known as a variable correlation plot) shows the relationship among all variables as they are plotted on the first two principal components (Dimension 1 and Dimension 2).
To understand the plot, note that:
Positively correlated variables have similar vectors. The vectors of negatively correlated variables are on opposite sides of the plot origin (opposite quadrants). Each axis represents a principal component. Vectors pointing in the direction of the component are correlated with that component. The percentage of the original variance explained by each component (dimension) is given in parentheses in the axes labels.
# Plot the original variables and the first 2 components and print the plot object.
( pca_var_plot <- fviz_pca_var(argentina_pca))
# Sum the variance preserved by the first two components. Print the result.
( variance_first_two_pca <- argentina_pca$eig[1, 2] + argentina_pca$eig[2, 2] )
## [1] 63.54897
With the first two principal components representing almost 65% of the variance, most of the information I am interested in is summarized in these two components. From the variable correlation plot, I can see that population and GDP are highly correlated; illiteracy, poverty, no healthcare, school dropout, and deficient infrastructure are correlated; and GDP per capita and movie theaters per capita are correlated.
But how do these correlations map to the provinces? To dive into that question, I’ll plot the individual principal components for each province and look for clusters.
# Visualize Dim2 vs. Dim1
fviz_pca_ind(argentina_pca, title = "Provinces - PCA")
It looks like one province stands out and the rest follow the gradient along the second dimension. Are there clusters I am not detecting? I’ll use K-means clustering to see if there are patterns I am not detecting.
# Set seed to 1234 for reproducibility
set.seed(1234)
head(argentina_pca)
## $eig
## eigenvalue percentage of variance
## comp 1 4.983629450 45.3057223
## comp 2 2.006757647 18.2432513
## comp 3 1.042135090 9.4739554
## comp 4 0.949962523 8.6360229
## comp 5 0.742104444 6.7464040
## comp 6 0.485195430 4.4108675
## comp 7 0.329075931 2.9915994
## comp 8 0.184573578 1.6779416
## comp 9 0.145702076 1.3245643
## comp 10 0.129508268 1.1773479
## comp 11 0.001355563 0.0123233
## cumulative percentage of variance
## comp 1 45.30572
## comp 2 63.54897
## comp 3 73.02293
## comp 4 81.65895
## comp 5 88.40536
## comp 6 92.81622
## comp 7 95.80782
## comp 8 97.48576
## comp 9 98.81033
## comp 10 99.98768
## comp 11 100.00000
##
## $var
## $var$coord
## Dim.1 Dim.2 Dim.3 Dim.4
## gdp -0.3531993 0.91458696 0.08754649 0.12733753
## illiteracy 0.8853148 0.02380461 0.02007275 -0.18633306
## poverty 0.8519141 0.04728600 -0.11508341 0.21974852
## deficient_infra 0.6612476 -0.13860488 0.07865448 0.43724696
## school_dropout 0.6068040 0.02321504 0.41314618 -0.51753592
## no_healthcare 0.8827617 0.19865491 -0.05695380 0.01912375
## birth_mortal 0.4976821 -0.02647374 0.73210807 0.33456496
## pop -0.2587412 0.95003081 0.06144140 0.13485990
## movie_theatres_per_cap -0.7148864 -0.36527201 0.36142907 0.29671447
## doctors_per_cap -0.5956986 0.06231820 0.41227017 -0.38607478
## gdp_per_cap -0.7642370 -0.26020203 -0.01867155 0.15533866
## Dim.5
## gdp -0.01513966
## illiteracy 0.13325888
## poverty -0.24316858
## deficient_infra 0.46747981
## school_dropout -0.26983192
## no_healthcare 0.26044559
## birth_mortal -0.21528460
## pop -0.01995636
## movie_theatres_per_cap 0.05508681
## doctors_per_cap 0.47514859
## gdp_per_cap -0.17395790
##
## $var$cor
## Dim.1 Dim.2 Dim.3 Dim.4
## gdp -0.3531993 0.91458696 0.08754649 0.12733753
## illiteracy 0.8853148 0.02380461 0.02007275 -0.18633306
## poverty 0.8519141 0.04728600 -0.11508341 0.21974852
## deficient_infra 0.6612476 -0.13860488 0.07865448 0.43724696
## school_dropout 0.6068040 0.02321504 0.41314618 -0.51753592
## no_healthcare 0.8827617 0.19865491 -0.05695380 0.01912375
## birth_mortal 0.4976821 -0.02647374 0.73210807 0.33456496
## pop -0.2587412 0.95003081 0.06144140 0.13485990
## movie_theatres_per_cap -0.7148864 -0.36527201 0.36142907 0.29671447
## doctors_per_cap -0.5956986 0.06231820 0.41227017 -0.38607478
## gdp_per_cap -0.7642370 -0.26020203 -0.01867155 0.15533866
## Dim.5
## gdp -0.01513966
## illiteracy 0.13325888
## poverty -0.24316858
## deficient_infra 0.46747981
## school_dropout -0.26983192
## no_healthcare 0.26044559
## birth_mortal -0.21528460
## pop -0.01995636
## movie_theatres_per_cap 0.05508681
## doctors_per_cap 0.47514859
## gdp_per_cap -0.17395790
##
## $var$cos2
## Dim.1 Dim.2 Dim.3 Dim.4
## gdp 0.12474975 0.8364693038 0.0076643884 0.0162148476
## illiteracy 0.78378227 0.0005666597 0.0004029152 0.0347200079
## poverty 0.72575762 0.0022359658 0.0132441916 0.0482894099
## deficient_infra 0.43724841 0.0192113115 0.0061865277 0.1911849048
## school_dropout 0.36821106 0.0005389382 0.1706897644 0.2678434240
## no_healthcare 0.77926823 0.0394637724 0.0032437353 0.0003657178
## birth_mortal 0.24768746 0.0007008589 0.5359822300 0.1119337096
## pop 0.06694702 0.9025585381 0.0037750454 0.0181871924
## movie_theatres_per_cap 0.51106252 0.1334236430 0.1306309699 0.0880394785
## doctors_per_cap 0.35485688 0.0038835580 0.1699666951 0.1490537324
## gdp_per_cap 0.58405823 0.0677050978 0.0003486267 0.0241300981
## Dim.5
## gdp 0.0002292093
## illiteracy 0.0177579296
## poverty 0.0591309576
## deficient_infra 0.2185373727
## school_dropout 0.0728092624
## no_healthcare 0.0678319067
## birth_mortal 0.0463474578
## pop 0.0003982561
## movie_theatres_per_cap 0.0030345561
## doctors_per_cap 0.2257661842
## gdp_per_cap 0.0302613518
##
## $var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## gdp 2.503191 41.68262695 0.73545057 1.70689340
## illiteracy 15.727138 0.02823757 0.03866248 3.65488185
## poverty 14.562833 0.11142181 1.27087090 5.08329631
## deficient_infra 8.773694 0.95733092 0.59363971 20.12552076
## school_dropout 7.388412 0.02685617 16.37885204 28.19515692
## no_healthcare 15.636560 1.96654402 0.31125862 0.03849813
## birth_mortal 4.970021 0.03492494 51.43116619 11.78296058
## pop 1.343339 44.97596107 0.36224146 1.91451684
## movie_theatres_per_cap 10.254826 6.64871731 12.53493632 9.26767913
## doctors_per_cap 7.120451 0.19352402 16.30946858 15.69048555
## gdp_per_cap 11.719536 3.37385523 0.03345312 2.54011053
## Dim.5
## gdp 0.03088639
## illiteracy 2.39291513
## poverty 7.96801017
## deficient_infra 29.44833094
## school_dropout 9.81118802
## no_healthcare 9.14047979
## birth_mortal 6.24540901
## pop 0.05366578
## movie_theatres_per_cap 0.40891227
## doctors_per_cap 30.42242719
## gdp_per_cap 4.07777530
##
##
## $ind
## $ind$coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## 1 -2.36146994 5.85722972 0.02887238 1.1415873 -0.55664309
## 2 -0.65845451 -0.48651253 -1.29107656 -0.2415875 -0.03742897
## 3 -2.57528138 0.52381577 1.70307390 -0.9930687 2.11197740
## 4 2.82685473 0.15875925 0.48894162 -1.0256780 0.09720908
## 5 4.30008229 0.09073316 0.08145973 0.8101764 1.08427385
## 6 -2.70724552 -1.36535411 -0.23583031 0.9771357 -0.60816245
## 7 0.04335852 -0.21902507 -0.37815826 -0.1520217 0.62600102
## 8 4.21149482 -0.12335287 1.70641723 2.0281065 -0.06036068
## 9 0.81084392 -0.02185901 -1.40833132 0.3811242 -0.10375906
## 10 -2.31574880 -1.54147650 1.21560898 1.5886977 1.04580005
## 11 0.03981146 -0.89482651 2.47490195 -0.4534430 -1.71571657
## 12 -1.49491730 0.06125399 -0.03492552 -0.5930355 0.07767891
## 13 3.00162668 0.32821769 -0.01572717 -0.8404313 -1.30454951
## 14 -1.54346805 -0.87168732 -0.34268279 0.3787263 -0.38147289
## 15 -0.92603242 -0.59687582 -1.15450347 0.4228471 0.90200815
## 16 1.92430029 0.19027541 -0.68696523 0.5769439 -0.29786660
## 17 0.28315539 -0.08601798 0.21078632 -1.7200906 -0.76180005
## 18 -0.63180598 -0.40460402 -0.03114572 -1.1202342 0.32829060
## 19 -3.24754516 -1.50531253 -0.71328750 0.9818379 -1.31232195
## 20 -1.68050058 0.68052132 0.58532027 -1.1784782 0.16509857
## 21 2.71823019 0.07583245 -1.39279281 -0.2316759 0.59215715
## 22 -0.01728864 0.15026553 -0.80995574 -0.7374384 0.10958703
##
## $ind$cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## 1 1.339410e-01 0.8240115152 2.002230e-05 0.03130164 0.0074422184
## 2 1.503778e-01 0.0820956847 5.781443e-01 0.02024329 0.0004859013
## 3 3.986197e-01 0.0164917402 1.743317e-01 0.05927450 0.2680944888
## 4 8.092557e-01 0.0025524477 2.420989e-02 0.10653709 0.0009569571
## 5 8.460085e-01 0.0003766633 3.036038e-04 0.03003173 0.0537897069
## 6 6.425089e-01 0.1634234811 4.875544e-03 0.08370158 0.0324237361
## 7 1.013185e-03 0.0258539981 7.707033e-02 0.01245521 0.2111981610
## 8 6.834503e-01 0.0005863175 1.122031e-01 0.15849512 0.0001403921
## 9 1.412998e-01 0.0001026900 4.262621e-01 0.03121764 0.0023137642
## 10 3.593658e-01 0.1592309960 9.902426e-02 0.16913611 0.0732910777
## 11 1.486119e-04 0.0750784016 5.743192e-01 0.01927892 0.2760122966
## 12 5.665016e-01 0.0009511205 3.092093e-04 0.08915140 0.0015295831
## 13 7.026757e-01 0.0084016692 1.929047e-05 0.05508660 0.1327280557
## 14 3.799797e-01 0.1211955114 1.873051e-02 0.02287789 0.0232109243
## 15 2.103926e-01 0.0874069813 3.270157e-01 0.04386769 0.1996177111
## 16 5.842851e-01 0.0057127394 7.446439e-02 0.05252261 0.0139998282
## 17 1.821912e-02 0.0016813427 1.009631e-02 0.67232677 0.1318741723
## 18 1.234999e-01 0.0506476681 3.001205e-04 0.38825472 0.0333438665
## 19 6.113929e-01 0.1313600570 2.949437e-02 0.05588427 0.0998368240
## 20 4.058789e-01 0.0665583615 4.923867e-02 0.19960079 0.0039174710
## 21 6.579750e-01 0.0005120909 1.727468e-01 0.00477968 0.0312256073
## 22 9.523524e-05 0.0071944029 2.090252e-01 0.17327166 0.0038264349
##
## $ind$contrib
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## 1 5.086235e+00 77.708210251 0.003635956 6.2357560 1.897867670
## 2 3.954423e-01 0.536130420 7.270383057 0.2792671 0.008580807
## 3 6.048963e+00 0.621497698 12.650866941 4.7187766 27.320636998
## 4 7.288507e+00 0.057090053 1.042719074 5.0337644 0.057879657
## 5 1.686495e+01 0.018647236 0.028942713 3.1407258 7.200957614
## 6 6.684776e+00 4.222532436 0.242578627 4.5685736 2.265434379
## 7 1.714670e-03 0.108660090 0.623735519 0.1105814 2.400282576
## 8 1.617723e+01 0.034465206 12.700585922 19.6812355 0.022316217
## 9 5.996614e-01 0.001082290 8.650933511 0.6950305 0.065942340
## 10 4.891189e+00 5.382155130 6.445269762 12.0768420 6.698994194
## 11 1.445599e-03 1.813677529 26.715868375 0.9838215 18.030317687
## 12 2.038290e+00 0.008498673 0.005320336 1.6827989 0.036958859
## 13 8.217599e+00 0.244009295 0.001078834 3.3796773 10.423976357
## 14 2.172836e+00 1.721091062 0.512198015 0.6863123 0.891332298
## 15 7.821390e-01 0.806956945 5.813581796 0.8555344 4.983485604
## 16 3.377359e+00 0.082006393 2.058367022 1.5927150 0.543445376
## 17 7.312758e-02 0.016759493 0.193793033 14.1570741 3.554629041
## 18 3.640819e-01 0.370802610 0.004231067 6.0046614 0.660129561
## 19 9.619267e+00 5.132580229 2.219128865 4.6126493 10.548557377
## 20 2.575781e+00 1.048976759 1.494309043 6.6452903 0.166954805
## 21 6.739133e+00 0.013025443 8.461090299 0.2568223 2.147762468
## 22 2.726172e-04 0.051144759 2.861382234 2.6020904 0.073558116
##
## $ind$dist
## 1 2 3 4 5 6 7 8
## 6.452464 1.697985 4.078921 3.142394 4.675084 3.377443 1.362166 5.094280
## 9 10 11 12 13 14 15 16
## 2.157081 3.862986 3.265738 1.986171 3.580793 2.503903 2.018882 2.517448
## 17 18 19 20 21 22
## 2.097786 1.797838 4.153316 2.637791 3.351055 1.771585
##
##
## $svd
## $svd$vs
## [1] 2.23240441 1.41660074 1.02085018 0.97466021 0.86145484 0.69655971
## [7] 0.57365140 0.42962027 0.38170941 0.35987257 0.03681797
##
## $svd$U
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1.057814582 4.13470752 0.02828268 1.1712670 -0.64616630
## [2,] -0.294953058 -0.34343659 -1.26470719 -0.2478684 -0.04344856
## [3,] -1.153590884 0.36976951 1.66828976 -1.0188871 2.45164030
## [4,] 1.266282539 0.11207057 0.47895532 -1.0523441 0.11284292
## [5,] 1.926211162 0.06404992 0.07979597 0.8312398 1.25865431
## [6,] -1.212703894 -0.96382422 -0.23101363 1.0025399 -0.70597136
## [7,] 0.019422341 -0.15461313 -0.37043463 -0.1559740 0.72667886
## [8,] 1.886528625 -0.08707666 1.67156481 2.0808344 -0.07006831
## [9,] 0.363215518 -0.01543061 -1.37956710 0.3910329 -0.12044632
## [10,] -1.037333912 -1.08815170 1.19078098 1.6300016 1.21399288
## [11,] 0.017833443 -0.63167164 2.42435374 -0.4652319 -1.99165004
## [12,] -0.669644485 0.04324012 -0.03421219 -0.6084536 0.09017177
## [13,] 1.344571201 0.23169386 -0.01540595 -0.8622813 -1.51435623
## [14,] -0.691392673 -0.61533733 -0.33568373 0.3885727 -0.44282401
## [15,] -0.414813920 -0.42134372 -1.13092351 0.4338405 1.04707537
## [16,] 0.861985528 0.13431830 -0.67293443 0.5919437 -0.34577158
## [17,] 0.126838751 -0.06072140 0.20648115 -1.7648106 -0.88431804
## [18,] -0.283015919 -0.28561613 -0.03050958 -1.1493587 0.38108858
## [19,] -1.454729771 -1.06262300 -0.69871908 1.0073643 -1.52337869
## [20,] -0.752776053 0.48039035 0.57336549 -1.2091170 0.19165087
## [21,] 1.217624447 0.05353128 -1.36434595 -0.2376992 0.68739199
## [22,] -0.007744403 0.10607472 -0.79341294 -0.7566108 0.12721158
##
## $svd$V
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.1582148 0.64562084 0.08575841 0.13064813 -0.01757452
## [2,] 0.3965746 0.01680404 0.01966278 -0.19117745 0.15469050
## [3,] 0.3816128 0.03337991 -0.11273291 0.22546167 -0.28227664
## [4,] 0.2962042 -0.09784329 0.07704802 0.44861477 0.54266316
## [5,] 0.2718163 0.01638785 0.40470794 -0.53099112 -0.31322816
## [6,] 0.3954309 0.14023352 -0.05579056 0.01962094 0.30233226
## [7,] 0.2229355 -0.01868822 0.71715526 0.34326317 -0.24990816
## [8,] -0.1159025 0.67064119 0.06018650 0.13836607 -0.02316588
## [9,] -0.3202316 -0.25785107 0.35404712 0.30442863 0.06394625
## [10,] -0.2668417 0.04399136 0.40384983 -0.39611218 0.55156529
## [11,] -0.3423381 -0.18368057 -0.01829020 0.15937724 -0.20193502
##
##
## $call
## $call$row.w
## [1] 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
## [7] 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
## [13] 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
## [19] 0.04545455 0.04545455 0.04545455 0.04545455
##
## $call$col.w
## [1] 1 1 1 1 1 1 1 1 1 1 1
##
## $call$scale.unit
## [1] TRUE
##
## $call$ncp
## [1] 5
##
## $call$centre
## [1] 3.055703e+07 3.225541e+00 9.925625e+00 1.267730e+01 1.724866e+00
## [6] 5.076884e+01 4.986364e+00 1.686352e+06 7.143952e-06 4.893720e-03
## [11] 1.834566e+01
##
## $call$ecart.type
## [1] 6.040940e+07 1.808927e+00 3.692633e+00 7.050933e+00 1.125942e+00
## [6] 8.969951e+00 3.417907e+00 3.145799e+06 4.273323e-06 1.487053e-03
## [11] 1.013528e+01
##
## $call$X
## gdp illiteracy poverty deficient_infra school_dropout
## 1 292689868 1.383240 8.167798 5.511856 0.7661682
## 2 6150949 2.344140 9.234095 10.464484 0.9519631
## 3 69363739 2.714140 5.382380 10.436086 1.0350558
## 4 7968013 5.602420 12.747191 17.438858 3.8642652
## 5 9832643 7.517580 15.862619 31.479527 2.5774621
## 6 17747854 1.548060 8.051752 8.044618 0.5863094
## 7 20743409 3.185580 7.288751 18.794568 1.8871881
## 8 3807057 4.610640 17.035583 28.004985 2.2689741
## 9 6484938 2.151390 13.367965 12.483179 0.7212945
## 10 6990262 1.539300 3.398774 16.505714 0.2040934
## 11 5590516 2.773210 10.875152 7.403254 3.8449494
## 12 33431369 2.200200 5.692798 3.839852 1.0637179
## 13 9646826 6.863950 13.529788 8.325740 3.1291244
## 14 22564106 1.943750 9.456635 11.267278 1.3935038
## 15 10264584 2.031420 8.678391 14.885444 0.4080420
## 16 13438835 3.346090 16.870500 14.182303 1.4820300
## 17 8262309 2.963260 9.050784 3.914390 3.2984129
## 18 11780849 3.433650 6.593771 9.679894 2.0001724
## 19 11663738 0.791485 8.024762 7.411364 0.2892622
## 20 81588690 1.975940 6.081012 11.869195 2.8721807
## 21 8387859 6.272090 11.759000 20.491433 2.3255981
## 22 13856199 3.770370 11.214239 6.466665 0.9772847
## no_healthcare birth_mortal pop movie_theatres_per_cap
## 1 48.7947 4.4 15625084 6.015968e-06
## 2 45.0456 1.5 367828 5.437324e-06
## 3 45.7640 4.8 3308876 1.118204e-05
## 4 62.1103 5.9 992595 4.029841e-06
## 5 65.5104 7.5 1055259 2.842904e-06
## 6 39.5473 3.0 509108 1.571376e-05
## 7 48.6571 3.1 1235994 5.663458e-06
## 8 65.8126 16.2 530162 3.772432e-06
## 9 54.1615 3.7 673307 2.970413e-06
## 10 45.4764 7.2 318951 1.881167e-05
## 11 40.8341 11.4 333642 1.198890e-05
## 12 50.5843 4.4 1738929 8.050932e-06
## 13 57.8339 8.1 1101593 1.815553e-06
## 14 48.7431 3.3 551266 9.070032e-06
## 15 49.9463 0.8 638645 9.394891e-06
## 16 60.4230 5.8 1214441 4.117121e-06
## 17 52.9684 4.2 681055 5.873241e-06
## 18 51.6154 3.8 432310 4.626310e-06
## 19 29.2321 3.3 273964 1.095034e-05
## 20 41.9660 2.6 3194537 6.573723e-06
## 21 63.6637 1.7 874006 3.432471e-06
## 22 48.2242 3.0 1448188 4.833627e-06
## doctors_per_cap gdp_per_cap
## 1 0.004835622 18.732051
## 2 0.004502104 16.722352
## 3 0.010175359 20.962931
## 4 0.004495288 8.027456
## 5 0.003604802 9.317753
## 6 0.004498063 34.860686
## 7 0.004678825 16.782775
## 8 0.003440458 7.180932
## 9 0.003958076 9.631473
## 10 0.005414625 21.916415
## 11 0.005092285 16.756031
## 12 0.005720188 19.225264
## 13 0.002880374 8.757160
## 14 0.005066520 40.931431
## 15 0.004897870 16.072442
## 16 0.003991137 11.065861
## 17 0.005043646 12.131632
## 18 0.006102103 27.250930
## 19 0.004270634 42.573981
## 20 0.006671702 25.540067
## 21 0.002821491 9.597026
## 22 0.005500667 9.567956
##
## $call$row.w.init
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $call$call
## PCA(X = argentina_matrix, scale.unit = TRUE)
# Create an intermediate data frame with pca_1 and pca_2
argentina_comps <- tibble(pca_1 = argentina_pca$ind$coord[ ,1],
pca_2 = argentina_pca$ind$coord[ ,2])
argentina_comps
## # A tibble: 22 x 2
## pca_1 pca_2
## <dbl> <dbl>
## 1 -2.36 5.86
## 2 -0.658 -0.487
## 3 -2.58 0.524
## 4 2.83 0.159
## 5 4.30 0.0907
## 6 -2.71 -1.37
## 7 0.0434 -0.219
## 8 4.21 -0.123
## 9 0.811 -0.0219
## 10 -2.32 -1.54
## # ... with 12 more rows
# Cluster the observations using the first 2 components and print its contents
( argentina_km <- kmeans(argentina_comps, centers = 4, nstart = 20, iter.max = 50))
## K-means clustering with 4 clusters of sizes 1, 7, 6, 8
##
## Cluster means:
## pca_1 pca_2
## 1 -2.3614699 5.8572297
## 2 -2.2235295 -0.5740342
## 3 3.1637648 0.1200775
## 4 -0.1320515 -0.3199319
##
## Clustering vector:
## [1] 1 4 2 3 3 2 4 3 4 2 4 2 3 2 4 3 4 4 2 2 3 4
##
## Within cluster sum of squares by cluster:
## [1] 0.000000 8.403846 4.375350 3.109136
## (between_SS / total_SS = 89.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Now that I have cluster assignments for each province, I will plot the provinces according to their principal components coordinates, colored by the cluster.
# Convert assigned clusters to factor
clusters_as_factor <- factor(argentina_km$cluster)
# Plot individulas colored by cluster
fviz_pca_ind(argentina_pca,
title = "Clustered Provinces - PCA",
habillage = clusters_as_factor)
A few things to note from the scatter plot:
Cluster 1 includes only Buenos Aires and has a large positive value in Dimension 2 with an intermediate negative value in Dimension 1. Cluster 2 has the greatest negative values in Dimension 1. Cluster 3 has the greatest positive values in Dimension 1. Cluster 4 has small absolute values in Dimension 1. Clusters 2, 3, and 4, all have small absolute values in Dimension 2. We will focus on exploring clusters 1, 2, and 3 in terms of the original variables in the next few tasks.
As I noted earlier, Buenos Aires is in a league of its own, with the largest positive value in Dimension 2 by far. The figure below is a biplot, a combination of the individuals plot from Task 6 and the circle plot from Task 5.
Since the vectors corresponding to gdp and pop are in the same direction as Dimension 2, Buenos Aires has high GDP and high population. I’ll visualize this pattern with a plot of gdp against cluster (I should get similar results with pop).
# Add cluster column to argentina
argentina <- argentina %>%
mutate(cluster = clusters_as_factor)
# Make a scatterplot of gdp vs. cluster, colored by cluster
ggplot(argentina, aes(y = gdp, x = cluster, color = cluster)) +
geom_point() +
geom_text_repel(aes(label = province), show.legend = FALSE) +
labs(x = "Cluster", y = "GDP") +
ggtitle("Argentina's GDP per capita vs Province Clusters")
Provinces in cluster 2 have large negative values in Dimension 1. The biplot shows that gdp_per_cap, movie_theaters_per_cap and doctors_per_cap also have high negative values in Dimension 1.
If I plot gdp_per_cap for each cluster, I can see that provinces in this cluster 2, in general, have greater GDP per capita than the provinces in the other clusters. San Luis is the only province from the other clusters with gdp_per_cap in the range of values observed in cluster 2. I see similar results for movie_theaters_per_cap and doctors_per_cap.
# Make a scatterplot of GDP per capita vs. cluster, colored by cluster
ggplot(argentina, aes(y = gdp_per_cap, x = cluster, color = cluster)) +
geom_point() +
geom_text_repel(aes(label = province), show.legend = FALSE) +
labs(x = "Cluster", y = "GDP per capita") +
ggtitle("Argentina's GDP per Capita vs Province Clusters")
Provinces in Cluster 3 have high positive values in Dimension 1. As shown in the biplot, provinces with high positive values in Dimension 1 have high values in poverty, deficient infrastructure, etc. These variables are also negatively correlated with gdp_per_cap, so these provinces have low values in this variable.
# Make scatterplot of poverty vs. cluster, colored by cluster
ggplot(argentina, aes(x = cluster, y = poverty, color = cluster)) +
geom_point() +
labs(x = "Cluster", y = "Poverty rate") +
geom_text_repel(aes(label = province), show.legend = FALSE) +
ggtitle("Argentina's Poverty vs Province Clusters")
Now that I have an idea of how social and economic welfare varies among provinces, I’ve been asked to help plan an education program. A pilot phase of the program will be carried out to identify design issues. My goal is to select the proposal with the most diverse set of provinces:
Tucumán, San Juán, and Entre Ríos Córdoba, Santa Fé, and Mendoza Buenos Aires, Santa Cruz, and Misiones
# Assign pilot provinces to the most diverse group
pilot_provinces <- 3