Principal Component Analysis (PCA) is a powerful statistical technique that allows us to reduce the complexity of a dataset by identifying underlying patterns and summarizing the information into latent factors. The primary objective of PCA is to identify and describe the latent constructs and/or factors that underlie the observed variables. These constructs can be understood as unobservable variables in our database, while the factors can be interpreted as estimates of these unobservable variables obtained after conducting a factor analysis. In summary, this technique helps us understand and interpret the underlying structure of the data, simplifying complexity and extracting information that lies behind the observed variables.
In this case, we will apply Principal Component Analysis based on the Kaiser criterion to extract factors representing the main dimensions that influence the quality of life in Brazilian metropolitan regions.
We will explore a comprehensive database containing metrics/variables related to longevity, education, income of inhabitants, as well as variables related to the infrastructure of these metropolitan regions.
Our ultimate goal is to create a ranking of regions, ordering them from best to worst, based on the information extracted from the database through factor analysis.
Database used:
A
- Base 20RMs 2000_2010 (Please right-click and select “open in a new
tab/window”.)
pacotes <- c("tidyverse","ggrepel","reshape2","knitr","kableExtra","dplyr","Hmisc","ltm","readxl",
"PerformanceAnalytics","plotly", "factoextra","psych","sp","tmap")
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
instalador <- pacotes[!pacotes %in% installed.packages()]
for(i in 1:length(instalador)) {
install.packages(instalador, dependencies = T)
break()}
sapply(pacotes, require, character = T)
} else {
sapply(pacotes, require, character = T)
}
We will use a database provided by IBGE (Brazilian Institute of Geography and Statistics). This same database is available at the GitHub link above
dados_completos <- read_excel("A - Base 20 RMs 2000_2010.xlsx")
As the original database contains an extensive number of variables, let’s create a new database that includes only the most relevant variables for our analysis:
dados <- dados_completos[, c("NOME_RM", "ESPVIDA", "FECTOT", "MORT5", "SOBRE60", "E_ANOSESTUDO", "T_SUPER25M", "RDPC", "T_DES18M", "T_BANAGUA", "T_LIXO", "T_LUZ", "POP", "IDHM", "IDHM_E", "IDHM_L", "IDHM_R")]
Below is the legend for the selected variables (described in averages or rates):
ESPVIDA = Life expectancy;
FECTOT = Number of children per woman;
MORT5 = Under-5 mortality rate;
SOBRE60 = Survival rate up to 60 years of age;
E_ANOSESTUDO = Years of schooling up to 18 years of age;
T_SUPER25M = Rate of people aged 25 or older with completed higher education;
RDPC = Per capita family income;
T_DES18M = Unemployment rate for individuals aged 18 or older;
T_BANAGUA = Rate of households with indoor plumbing and piped water supply;
T_LIXO = Rate of households served by a garbage collection service;
T_LUZ = Rate of households with electric power;
POP = Number of people residing in fixed households;
IDHM = Municipal Human Development Index;
IDHM_E = Municipal Human Development Index - Education dimension;
IDHM_L = Municipal Human Development Index - Longevity dimension;
IDHM_R = Municipal Human Development Index - Income dimension;
dados %>%
slice(1:10) %>%
kable(format = "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "250px")
| NOME_RM | ESPVIDA | FECTOT | MORT5 | SOBRE60 | E_ANOSESTUDO | T_SUPER25M | RDPC | T_DES18M | T_BANAGUA | T_LIXO | T_LUZ | POP | IDHM | IDHM_E | IDHM_L | IDHM_R |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RM Manaus (AM) | 68.77 | 2.80 | 41.53 | 75.22 | 7.99 | 4.80 | 487.67 | 21.72 | 61.59 | 90.97 | 95.33 | 1631641 | 0.585 | 0.414 | 0.730 | 0.661 |
| RM Belém (PA) | 70.13 | 2.27 | 35.46 | 76.92 | 8.64 | 7.15 | 524.82 | 18.42 | 68.57 | 90.75 | 99.04 | 1964065 | 0.621 | 0.474 | 0.752 | 0.672 |
| RM Grande São Luís (MA) | 68.73 | 2.24 | 49.25 | 75.04 | 9.37 | 6.12 | 449.12 | 20.48 | 54.96 | 74.21 | 98.58 | 1086365 | 0.642 | 0.560 | 0.729 | 0.647 |
| RM Fortaleza (CE) | 69.55 | 2.38 | 40.32 | 76.53 | 8.99 | 6.31 | 496.32 | 16.14 | 72.80 | 90.47 | 98.24 | 3041196 | 0.622 | 0.488 | 0.743 | 0.663 |
| RM Natal (RN) | 69.49 | 2.37 | 49.93 | 76.61 | 9.16 | 7.60 | 537.69 | 16.84 | 81.68 | 95.45 | 98.83 | 1124946 | 0.625 | 0.487 | 0.742 | 0.676 |
| RM Recife (PE) | 69.26 | 2.10 | 48.62 | 75.59 | 8.75 | 8.95 | 560.66 | 22.14 | 80.04 | 87.43 | 99.62 | 3306954 | 0.627 | 0.490 | 0.738 | 0.683 |
| RM Maceió (AL) | 65.85 | 2.47 | 57.86 | 69.66 | 7.58 | 7.96 | 503.36 | 22.32 | 81.50 | 92.71 | 98.82 | 1019183 | 0.567 | 0.402 | 0.681 | 0.666 |
| RM Salvador (BA) | 69.58 | 1.90 | 49.68 | 75.78 | 8.80 | 8.04 | 614.59 | 23.86 | 85.59 | 91.12 | 99.29 | 3099673 | 0.636 | 0.497 | 0.743 | 0.698 |
| RM Belo Horizonte (MG) | 72.01 | 2.01 | 27.88 | 79.86 | 9.82 | 9.71 | 782.97 | 16.84 | 95.15 | 93.68 | 99.60 | 4330618 | 0.682 | 0.549 | 0.784 | 0.737 |
| RM Grande Vitória (MG) | 71.75 | 2.09 | 27.08 | 77.90 | 10.02 | 8.23 | 731.30 | 16.05 | 93.70 | 91.06 | 99.72 | 1432003 | 0.678 | 0.552 | 0.779 | 0.726 |
showing only the first rows
summary(dados[,2:17])
## ESPVIDA FECTOT MORT5 SOBRE60
## Min. :65.85 Min. :1.620 Min. :12.92 Min. :69.66
## 1st Qu.:71.69 1st Qu.:1.810 1st Qu.:16.20 1st Qu.:77.81
## Median :73.28 Median :1.990 Median :21.16 Median :82.35
## Mean :72.96 Mean :2.008 Mean :25.32 Mean :81.15
## 3rd Qu.:75.18 3rd Qu.:2.172 3rd Qu.:27.71 3rd Qu.:84.16
## Max. :76.46 Max. :2.800 Max. :57.86 Max. :87.84
## E_ANOSESTUDO T_SUPER25M RDPC T_DES18M
## Min. : 7.580 Min. : 4.80 Min. : 449.1 Min. : 4.450
## 1st Qu.: 9.485 1st Qu.: 8.92 1st Qu.: 705.9 1st Qu.: 7.968
## Median : 9.920 Median :11.23 Median : 827.9 Median :13.085
## Mean : 9.777 Mean :11.45 Mean : 843.9 Mean :12.989
## 3rd Qu.:10.190 3rd Qu.:14.15 3rd Qu.:1001.7 3rd Qu.:16.810
## Max. :11.120 Max. :18.72 Max. :1362.5 Max. :23.860
## T_BANAGUA T_LIXO T_LUZ POP
## Min. :54.96 Min. :74.21 Min. :95.33 Min. : 718993
## 1st Qu.:82.19 1st Qu.:92.58 1st Qu.:99.07 1st Qu.: 1590334
## Median :93.36 Median :97.16 Median :99.73 Median : 2294862
## Mean :88.91 Mean :95.44 Mean :99.41 Mean : 3651660
## 3rd Qu.:96.50 3rd Qu.:98.73 3rd Qu.:99.87 3rd Qu.: 3620772
## Max. :97.62 Max. :99.77 Max. :99.97 Max. :19601268
## IDHM IDHM_E IDHM_L IDHM_R
## Min. :0.5670 Min. :0.4020 Min. :0.6810 Min. :0.6470
## 1st Qu.:0.6755 1st Qu.:0.5308 1st Qu.:0.7780 1st Qu.:0.7200
## Median :0.7120 Median :0.6000 Median :0.8050 Median :0.7455
## Mean :0.7094 Mean :0.6042 Mean :0.7996 Mean :0.7425
## 3rd Qu.:0.7675 3rd Qu.:0.6917 3rd Qu.:0.8367 3rd Qu.:0.7760
## Max. :0.7940 Max. :0.7370 Max. :0.8580 Max. :0.8260
Let’s extract the correlation coefficients using the rcorr function and store them in an object named “rho”.
rho <- rcorr(as.matrix(dados[,2:17]), type="pearson")
We can now extract some information from the “rho” object we created:
corr_coef <- rho$r
corr_coef %>%
kable(format = "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "250px")
| ESPVIDA | FECTOT | MORT5 | SOBRE60 | E_ANOSESTUDO | T_SUPER25M | RDPC | T_DES18M | T_BANAGUA | T_LIXO | T_LUZ | POP | IDHM | IDHM_E | IDHM_L | IDHM_R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ESPVIDA | 1.0000000 | -0.8012736 | -0.9398518 | 0.8999331 | 0.8108366 | 0.8832583 | 0.8547294 | -0.9032078 | 0.6921079 | 0.6892453 | 0.5904512 | 0.2191420 | 0.9733573 | 0.9103724 | 0.9999810 | 0.8726580 |
| FECTOT | -0.8012736 | 1.0000000 | 0.6773541 | -0.7265224 | -0.6253673 | -0.7957273 | -0.6764232 | 0.6923827 | -0.6263505 | -0.4518295 | -0.8233436 | -0.3250679 | -0.8541995 | -0.8513180 | -0.8007404 | -0.6990485 |
| MORT5 | -0.9398518 | 0.6773541 | 1.0000000 | -0.8742580 | -0.8117622 | -0.7660431 | -0.8014588 | 0.8365488 | -0.7059415 | -0.7155116 | -0.5351988 | -0.2230180 | -0.9011792 | -0.8216582 | -0.9395207 | -0.8471580 |
| SOBRE60 | 0.8999331 | -0.7265224 | -0.8742580 | 1.0000000 | 0.7315983 | 0.7438187 | 0.6477674 | -0.8109527 | 0.5432362 | 0.5607777 | 0.4986221 | 0.1262239 | 0.8803374 | 0.8711136 | 0.8995537 | 0.6800611 |
| E_ANOSESTUDO | 0.8108366 | -0.6253673 | -0.8117622 | 0.7315983 | 1.0000000 | 0.6373194 | 0.6614750 | -0.7073972 | 0.6410770 | 0.5414740 | 0.5916271 | 0.1544536 | 0.8289873 | 0.8015708 | 0.8110836 | 0.7050427 |
| T_SUPER25M | 0.8832583 | -0.7957273 | -0.7660431 | 0.7438187 | 0.6373194 | 1.0000000 | 0.8970472 | -0.8310691 | 0.6716508 | 0.6147516 | 0.6277368 | 0.3034201 | 0.9141601 | 0.8518184 | 0.8833361 | 0.8822774 |
| RDPC | 0.8547294 | -0.6764232 | -0.8014588 | 0.6477674 | 0.6614750 | 0.8970472 | 1.0000000 | -0.7369514 | 0.7586944 | 0.6774055 | 0.5605634 | 0.4335171 | 0.8442711 | 0.7007926 | 0.8546981 | 0.9884021 |
| T_DES18M | -0.9032078 | 0.6923827 | 0.8365488 | -0.8109527 | -0.7073972 | -0.8310691 | -0.7369514 | 1.0000000 | -0.5519304 | -0.6356822 | -0.5080116 | -0.0872201 | -0.8795034 | -0.8423586 | -0.9034464 | -0.7410374 |
| T_BANAGUA | 0.6921079 | -0.6263505 | -0.7059415 | 0.5432362 | 0.6410770 | 0.6716508 | 0.7586944 | -0.5519304 | 1.0000000 | 0.8196077 | 0.7191427 | 0.2827204 | 0.6719106 | 0.5395288 | 0.6917530 | 0.8192453 |
| T_LIXO | 0.6892453 | -0.4518295 | -0.7155116 | 0.5607777 | 0.5414740 | 0.6147516 | 0.6774055 | -0.6356822 | 0.8196077 | 1.0000000 | 0.4314067 | 0.2202156 | 0.6224431 | 0.5014995 | 0.6896981 | 0.7236845 |
| T_LUZ | 0.5904512 | -0.8233436 | -0.5351988 | 0.4986221 | 0.5916271 | 0.6277368 | 0.5605634 | -0.5080116 | 0.7191427 | 0.4314067 | 1.0000000 | 0.2444659 | 0.6527374 | 0.6254521 | 0.5894292 | 0.6060996 |
| POP | 0.2191420 | -0.3250679 | -0.2230180 | 0.1262239 | 0.1544536 | 0.3034201 | 0.4335171 | -0.0872201 | 0.2827204 | 0.2202156 | 0.2444659 | 1.0000000 | 0.2578177 | 0.1833684 | 0.2190487 | 0.4098729 |
| IDHM | 0.9733573 | -0.8541995 | -0.9011792 | 0.8803374 | 0.8289873 | 0.9141601 | 0.8442711 | -0.8795034 | 0.6719106 | 0.6224431 | 0.6527374 | 0.2578177 | 1.0000000 | 0.9693973 | 0.9734731 | 0.8592049 |
| IDHM_E | 0.9103724 | -0.8513180 | -0.8216582 | 0.8711136 | 0.8015708 | 0.8518184 | 0.7007926 | -0.8423586 | 0.5395288 | 0.5014995 | 0.6254521 | 0.1833684 | 0.9693973 | 1.0000000 | 0.9106483 | 0.7143394 |
| IDHM_L | 0.9999810 | -0.8007404 | -0.9395207 | 0.8995537 | 0.8110836 | 0.8833361 | 0.8546981 | -0.9034464 | 0.6917530 | 0.6896981 | 0.5894292 | 0.2190487 | 0.9734731 | 0.9106483 | 1.0000000 | 0.8725010 |
| IDHM_R | 0.8726580 | -0.6990485 | -0.8471580 | 0.6800611 | 0.7050427 | 0.8822774 | 0.9884021 | -0.7410374 | 0.8192453 | 0.7236845 | 0.6060996 | 0.4098729 | 0.8592049 | 0.7143394 | 0.8725010 | 1.0000000 |
This matrix above shows the correlation coefficients between pairs of variables.
corr_sig <- round(rho$P, 5)
corr_sig %>%
kable(format = "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "250px")
| ESPVIDA | FECTOT | MORT5 | SOBRE60 | E_ANOSESTUDO | T_SUPER25M | RDPC | T_DES18M | T_BANAGUA | T_LIXO | T_LUZ | POP | IDHM | IDHM_E | IDHM_L | IDHM_R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ESPVIDA | NA | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00006 | 0.17427 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| FECTOT | 0.00000 | NA | 0.00000 | 0.00000 | 0.00002 | 0.00000 | 0.00000 | 0.00000 | 0.00002 | 0.00343 | 0.00000 | 0.04069 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| MORT5 | 0.00000 | 0.00000 | NA | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00037 | 0.16659 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| SOBRE60 | 0.00000 | 0.00000 | 0.00000 | NA | 0.00000 | 0.00000 | 0.00001 | 0.00000 | 0.00029 | 0.00017 | 0.00106 | 0.43769 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| E_ANOSESTUDO | 0.00000 | 0.00002 | 0.00000 | 0.00000 | NA | 0.00001 | 0.00000 | 0.00000 | 0.00001 | 0.00031 | 0.00006 | 0.34130 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| T_SUPER25M | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00001 | NA | 0.00000 | 0.00000 | 0.00000 | 0.00002 | 0.00001 | 0.05700 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| RDPC | 0.00000 | 0.00000 | 0.00000 | 0.00001 | 0.00000 | 0.00000 | NA | 0.00000 | 0.00000 | 0.00000 | 0.00017 | 0.00520 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| T_DES18M | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | NA | 0.00022 | 0.00001 | 0.00082 | 0.59254 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| T_BANAGUA | 0.00000 | 0.00002 | 0.00000 | 0.00029 | 0.00001 | 0.00000 | 0.00000 | 0.00022 | NA | 0.00000 | 0.00000 | 0.07712 | 0.00000 | 0.00033 | 0.00000 | 0.00000 |
| T_LIXO | 0.00000 | 0.00343 | 0.00000 | 0.00017 | 0.00031 | 0.00002 | 0.00000 | 0.00001 | 0.00000 | NA | 0.00545 | 0.17212 | 0.00002 | 0.00098 | 0.00000 | 0.00000 |
| T_LUZ | 0.00006 | 0.00000 | 0.00037 | 0.00106 | 0.00006 | 0.00001 | 0.00017 | 0.00082 | 0.00000 | 0.00545 | NA | 0.12844 | 0.00001 | 0.00002 | 0.00006 | 0.00003 |
| POP | 0.17427 | 0.04069 | 0.16659 | 0.43769 | 0.34130 | 0.05700 | 0.00520 | 0.59254 | 0.07712 | 0.17212 | 0.12844 | NA | 0.10824 | 0.25739 | 0.17446 | 0.00862 |
| IDHM | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00002 | 0.00001 | 0.10824 | NA | 0.00000 | 0.00000 | 0.00000 |
| IDHM_E | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00033 | 0.00098 | 0.00002 | 0.25739 | 0.00000 | NA | 0.00000 | 0.00000 |
| IDHM_L | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00006 | 0.17446 | 0.00000 | 0.00000 | NA | 0.00000 |
| IDHM_R | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00003 | 0.00862 | 0.00000 | 0.00000 | 0.00000 | NA |
This matrix shows the significance levels for each pair. Considering a confidence level of 95%, we can see that almost all of our pairs have a significant correlation. The exceptions seem to concentrate on the “POP” variable, which indicates the number of people residing in fixed households. This indicates that the “POP” variable does not have a significant correlation with the other variables.
We can also create a “heat map” to better visualize the information seen above:
ggplotly(
dados[,2:17] %>%
cor() %>%
melt() %>%
rename(Correlação = value) %>%
ggplot() +
geom_tile(aes(x = Var1, y = Var2, fill = Correlação)) +
geom_text(aes(x = Var1, y = Var2, label = format(Correlação, digits = 1)),
size = 5) +
scale_fill_viridis_b() +
labs(x = NULL, y = NULL) +
theme_bw() + theme(
axis.text.x = element_blank(),
axis.title.x = element_blank()
),
width = 1000)
Note: Warmer colors indicate a higher positive correlation, while cooler colors indicate no correlation or negative correlation.
chart.CorrelationWe can also visualize this information using the
chart.Correlation function:
chart.Correlation(dados[, 2:17], histogram = TRUE, pch = "+")
This function allows us to visualize the plots we created earlier along with the correlation coefficients of the variable pairs.
Bartlett’s sphericity test is a statistical analysis used to check whether the variables in a dataset are correlated with each other. Although we have already observed indications of correlations, we will use Bartlett’s test through the ‘cortest.bartlett’ function as a means of confirmation:
cortest.bartlett(dados[, 2:17])
## R was not square, finding R from data
## $chisq
## [1] 1485.071
##
## $p.value
## [1] 6.124255e-234
##
## $df
## [1] 120
We obtained a p-value very close to zero, which indicates that the variables in the data are not independent from each other, suggesting the presence of significant correlation among them. Therefore, we can proceed with our analysis.
We will use the principal function to perform our factor
analysis. This technique consists of transforming a set of correlated
variables into a new set of uncorrelated variables called principal
components.
Since our database has 16 variables, to preserve all the variability present in the data, we will request a total of 16 principal components.
fatorial <- principal(dados[2:17],
nfactors = length(dados[2:17]),
rotate = "none",
scores = TRUE)
fatorial
## Principal Components Analysis
## Call: principal(r = dados[2:17], nfactors = length(dados[2:17]), rotate = "none",
## scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## ESPVIDA 0.98 -0.14 0.04 0.09 0.00 -0.01 0.04 -0.02 -0.07 -0.02 0.03
## FECTOT -0.84 -0.03 0.44 0.15 0.09 0.15 -0.05 -0.04 0.16 0.07 -0.01
## MORT5 -0.93 0.11 -0.17 -0.08 -0.17 0.01 -0.12 0.14 0.05 -0.13 0.08
## SOBRE60 0.87 -0.32 -0.03 0.11 0.14 -0.20 0.22 -0.01 0.16 -0.06 0.06
## E_ANOSESTUDO 0.83 -0.16 0.01 -0.09 0.41 0.29 -0.14 0.05 0.02 -0.05 0.04
## T_SUPER25M 0.92 0.05 -0.08 0.10 -0.32 0.04 -0.05 0.10 0.13 0.03 -0.02
## RDPC 0.89 0.27 0.11 0.15 -0.19 0.22 0.05 -0.02 -0.01 -0.01 0.06
## T_DES18M -0.88 0.27 -0.08 -0.10 0.14 0.08 0.24 0.20 -0.02 0.08 0.04
## T_BANAGUA 0.79 0.37 0.24 -0.39 0.05 0.00 0.08 0.05 0.03 -0.10 -0.13
## T_LIXO 0.73 0.22 0.53 -0.15 0.04 -0.29 -0.13 0.09 -0.03 0.05 0.08
## T_LUZ 0.70 0.20 -0.38 -0.53 0.00 0.00 -0.03 -0.14 0.07 0.08 0.05
## POP 0.30 0.75 -0.28 0.44 0.23 -0.12 -0.07 -0.03 0.03 -0.01 -0.01
## IDHM 0.98 -0.12 -0.10 0.07 -0.01 0.03 -0.02 0.08 -0.02 0.04 -0.03
## IDHM_E 0.91 -0.26 -0.24 0.06 0.03 -0.03 -0.08 0.15 0.01 0.07 -0.06
## IDHM_L 0.98 -0.14 0.04 0.09 0.00 -0.01 0.04 -0.01 -0.07 -0.02 0.03
## IDHM_R 0.92 0.26 0.14 0.07 -0.12 0.19 0.09 -0.03 -0.02 0.01 0.02
## PC12 PC13 PC14 PC15 PC16 h2 u2 com
## ESPVIDA 0.08 0.01 0.00 0.00 0 1 4.4e-16 1.1
## FECTOT 0.05 0.04 0.01 0.00 0 1 -6.2e-15 1.8
## MORT5 0.01 0.05 0.01 0.00 0 1 -2.7e-15 1.3
## SOBRE60 -0.03 0.01 0.00 0.00 0 1 -2.7e-15 1.7
## E_ANOSESTUDO -0.01 -0.04 0.00 0.00 0 1 -2.7e-15 2.0
## T_SUPER25M 0.03 -0.07 0.00 0.00 0 1 -2.9e-15 1.4
## RDPC -0.03 0.03 -0.04 0.00 0 1 -2.4e-15 1.5
## T_DES18M 0.03 -0.01 0.00 0.00 0 1 -3.1e-15 1.6
## T_BANAGUA 0.02 0.01 0.00 0.00 0 1 -1.6e-15 2.3
## T_LIXO -0.02 0.00 0.00 0.00 0 1 -1.8e-15 2.7
## T_LUZ 0.02 0.02 0.00 0.00 0 1 -1.6e-15 2.9
## POP 0.01 0.00 0.00 0.00 0 1 -6.7e-16 2.7
## IDHM -0.01 0.04 0.01 -0.01 0 1 -3.3e-15 1.1
## IDHM_E -0.02 0.05 0.00 0.01 0 1 -3.1e-15 1.4
## IDHM_L 0.08 0.01 0.00 0.00 0 1 -3.8e-15 1.1
## IDHM_R -0.05 0.01 0.04 0.00 0 1 -2.7e-15 1.4
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## SS loadings 11.69 1.27 0.91 0.77 0.45 0.34 0.19 0.14 0.09 0.06 0.05
## Proportion Var 0.73 0.08 0.06 0.05 0.03 0.02 0.01 0.01 0.01 0.00 0.00
## Cumulative Var 0.73 0.81 0.87 0.91 0.94 0.96 0.98 0.98 0.99 0.99 1.00
## Proportion Explained 0.73 0.08 0.06 0.05 0.03 0.02 0.01 0.01 0.01 0.00 0.00
## Cumulative Proportion 0.73 0.81 0.87 0.91 0.94 0.96 0.98 0.98 0.99 0.99 1.00
## PC12 PC13 PC14 PC15 PC16
## SS loadings 0.02 0.02 0 0 0
## Proportion Var 0.00 0.00 0 0 0
## Cumulative Var 1.00 1.00 1 1 1
## Proportion Explained 0.00 0.00 0 0 0
## Cumulative Proportion 1.00 1.00 1 1 1
##
## Mean item complexity = 1.8
## Test of the hypothesis that 16 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0
## with the empirical chi square 0 with prob < NA
##
## Fit based upon off diagonal values = 1
The factor loadings indicate, through a scale from -1 to 1, how much each variable contributes to the formation of the component. Consequently, we can see that our first component seems to capture a good portion of the data variability.
The eigenvalues indicate the variability of each component. Let’s display our eigenvalues with 5 decimal places:
autovalores <- round(fatorial$values, 5)
autovalores
## [1] 11.69324 1.26592 0.90564 0.77437 0.45165 0.33644 0.19296 0.13774
## [9] 0.08873 0.06124 0.04824 0.02390 0.01694 0.00286 0.00012 0.00002
It’s important to note that only 2 eigenvalues are greater than 1, as this detail is relevant for the continuation of our analysis based on the Kaiser criterion.
sum(autovalores)
## [1] 16.00001
As the goal of factor analysis is to simplify and condense the amount of information contained in the original variables, we will now make this simplification based on the Kaiser criterion. The Kaiser criterion states that only factors with eigenvalues greater than 1 should be considered significant. This means that only factors that explain a greater amount of variance than an individual variable will be retained.
k <- sum(autovalores > 1)
print(k)
## [1] 2
Since only 2 of our eigenvalues are greater than 1, only 2 factors will be preserved.
fatorial2 <- principal(dados[2:17],
nfactors = k,
rotate = "none",
scores = TRUE)
fatorial2
## Principal Components Analysis
## Call: principal(r = dados[2:17], nfactors = k, rotate = "none", scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## ESPVIDA 0.98 -0.14 0.97 0.025 1.0
## FECTOT -0.84 -0.03 0.71 0.286 1.0
## MORT5 -0.93 0.11 0.87 0.128 1.0
## SOBRE60 0.87 -0.32 0.85 0.149 1.3
## E_ANOSESTUDO 0.83 -0.16 0.71 0.291 1.1
## T_SUPER25M 0.92 0.05 0.84 0.158 1.0
## RDPC 0.89 0.27 0.87 0.129 1.2
## T_DES18M -0.88 0.27 0.85 0.151 1.2
## T_BANAGUA 0.79 0.37 0.75 0.247 1.4
## T_LIXO 0.73 0.22 0.58 0.418 1.2
## T_LUZ 0.70 0.20 0.53 0.466 1.2
## POP 0.30 0.75 0.65 0.347 1.3
## IDHM 0.98 -0.12 0.97 0.027 1.0
## IDHM_E 0.91 -0.26 0.90 0.104 1.2
## IDHM_L 0.98 -0.14 0.97 0.025 1.0
## IDHM_R 0.92 0.26 0.91 0.089 1.2
##
## PC1 PC2
## SS loadings 11.69 1.27
## Proportion Var 0.73 0.08
## Cumulative Var 0.73 0.81
## Proportion Explained 0.90 0.10
## Cumulative Proportion 0.90 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.06
## with the empirical chi square 37.79 with prob < 1
##
## Fit based upon off diagonal values = 0.99
It is emphasized here how the first component demonstrates a high capacity to condense the information from the original variables.
scores_fatoriais <- as.data.frame(fatorial2$weights)
round(scores_fatoriais, 3) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = FALSE,
font_size = 20)
| PC1 | PC2 | |
|---|---|---|
| ESPVIDA | 0.084 | -0.109 |
| FECTOT | -0.072 | -0.022 |
| MORT5 | -0.079 | 0.084 |
| SOBRE60 | 0.074 | -0.250 |
| E_ANOSESTUDO | 0.071 | -0.125 |
| T_SUPER25M | 0.078 | 0.038 |
| RDPC | 0.076 | 0.214 |
| T_DES18M | -0.075 | 0.213 |
| T_BANAGUA | 0.067 | 0.291 |
| T_LIXO | 0.062 | 0.177 |
| T_LUZ | 0.060 | 0.161 |
| POP | 0.026 | 0.592 |
| IDHM | 0.084 | -0.098 |
| IDHM_E | 0.078 | -0.209 |
| IDHM_L | 0.084 | -0.110 |
| IDHM_R | 0.078 | 0.208 |
Factorial scores represent the contribution of each observation to the construction of each principal component. They are calculated by multiplying the factorial loads (weights) of the principal components by the original variables of each observation and summing these products.
fatores <- as.data.frame(fatorial2$scores)
fatores
## PC1 PC2
## 1 -2.27807458 -0.75208164
## 2 -1.36700106 -0.32897735
## 3 -1.86131148 -1.65648403
## 4 -1.46556196 -0.38792353
## 5 -1.30175240 0.05883814
## 6 -1.35230324 0.62153512
## 7 -2.17803739 1.40604296
## 8 -1.20188608 0.94098168
## 9 -0.26824831 0.63190506
## 10 -0.43251561 0.05595586
## 11 -0.23934530 2.08105483
## 12 0.25957437 2.98833776
## 13 -0.10157389 0.48672981
## 14 0.20498602 0.52294266
## 15 -0.13531427 -0.01787279
## 16 -0.04481424 0.36775842
## 17 0.05999162 0.41007687
## 18 -0.76851218 -0.52890532
## 19 -0.33489685 0.30587303
## 20 -0.23687394 0.15929281
## 21 -0.04121714 -1.25523277
## 22 0.17702713 -0.89734540
## 23 0.27261412 -1.85457570
## 24 0.37516405 -1.18903484
## 25 0.38069506 -0.69520350
## 26 0.33496928 -0.32984767
## 27 -0.02652783 -0.39302817
## 28 0.50290224 -0.31086742
## 29 1.02552995 0.05706312
## 30 0.93010794 -0.41594619
## 31 0.96298181 1.31482409
## 32 1.44344053 2.20113948
## 33 0.95088981 -0.61213894
## 34 1.31584401 -0.46308740
## 35 1.04346268 -0.90946697
## 36 1.30181947 -0.34892142
## 37 1.06446806 -0.09186182
## 38 0.81040598 -0.85455643
## 39 0.96530881 -0.43148471
## 40 1.25358481 0.11449232
The factors are calculated by multiplying the factorial scores by the standard deviations of the principal components. The factors in factor analysis represent the individual scores of the observations in the underlying theoretical constructs.
rho2 <- rcorr(as.matrix(fatores), type="pearson")
round(rho2$r, 4)
## PC1 PC2
## PC1 1 0
## PC2 0 1
Here we can note that our factors do not have correlation with each other. We can say that the factors are orthogonal and do not share any common variance. This indicates that they represent distinct dimensions of the latent construct, and consequently facilitate the interpretation of the results.
comunalidades <- as.data.frame(unclass(fatorial2$communality)) %>%
rename(comunalidades = 1)
round(comunalidades, 3) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = FALSE,
font_size = 20)
| comunalidades | |
|---|---|
| ESPVIDA | 0.975 |
| FECTOT | 0.714 |
| MORT5 | 0.872 |
| SOBRE60 | 0.851 |
| E_ANOSESTUDO | 0.709 |
| T_SUPER25M | 0.842 |
| RDPC | 0.871 |
| T_DES18M | 0.849 |
| T_BANAGUA | 0.753 |
| T_LIXO | 0.582 |
| T_LUZ | 0.534 |
| POP | 0.653 |
| IDHM | 0.973 |
| IDHM_E | 0.896 |
| IDHM_L | 0.975 |
| IDHM_R | 0.911 |
Communality is an important measure. It ranges from 0 to 1, with 0 indicating that the variable has no relation to any of the factors and 1 indicating that the variable is perfectly related to at least one of the factors. We can see that, thanks to the Kaiser criterion, despite having preserved only 2 factors, we are managing to capture a good part of the variability of the original data.
dados_fatores <- bind_cols(dados,
"fator_1" = fatores$PC1,
"fator_2" = fatores$PC2)
Assuming only the 2 factors as indicators, the “score” is calculated. Where the score in this type of experiment is generally considered to be factor * shared variance by that factor.
dados_rank <- dados_fatores %>%
mutate(pontuacao = fator_1 * variancia_compartilhada$PC1[2] + fator_2 * variancia_compartilhada$PC2[2])
dados_rank[,c(1, 20)] %>%
arrange(desc(pontuacao)) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = T,
font_size = 12)
| NOME_RM | pontuacao |
|---|---|
| RM São Paulo (SP) 2010 | 1.2290601 |
| RIDE DF e Entorno (DF/GO/MG) 2010 | 0.9252126 |
| RM Campinas (SP) 2010 | 0.9250152 |
| RM Curitiba (PR) 2010 | 0.9237985 |
| RM Rio Janeiro (RJ) 2010 | 0.8078024 |
| RM Porto Alegre (RS) 2010 | 0.7706742 |
| RM Belo Horizonte (MG) 2010 | 0.7540001 |
| RM Vale do Paraíba e Litoral Norte (SP) 2010 | 0.6906338 |
| RM Goiânia (GO) 2010 | 0.6713349 |
| RM Grande Vitória (MG) 2010 | 0.6468385 |
| RM Baixada Santista (SP) 2010 | 0.6465037 |
| RM Vale do Rio Cuiabá (MT) 2010 | 0.5246542 |
| RM São Paulo (SP) | 0.4261417 |
| RM Salvador (BA) 2010 | 0.3429388 |
| RM Natal (RN) 2010 | 0.2232177 |
| RM Recife (PE) 2010 | 0.2187071 |
| RM Campinas (SP) | 0.1911847 |
| RM Fortaleza (CE) 2010 | 0.1801035 |
| RM Porto Alegre (RS) 2010 | 0.0762888 |
| RM Belém (PA) 2010 | 0.0583782 |
| RM Grande São Luís (MA) 2010 | 0.0524996 |
| RM Curitiba (PR) | -0.0036544 |
| RM Rio Janeiro (RJ) | -0.0102667 |
| RM Baixada Santista (SP) | -0.0357228 |
| RM Maceió (AL) | -0.0504837 |
| RM Vale do Paraíba e Litoral Norte (SP) | -0.1003055 |
| RM Manaus (AM) 2010 | -0.1294368 |
| RM Belo Horizonte (MG) | -0.1460468 |
| RIDE DF e Entorno (DF/GO/MG) | -0.1605107 |
| RM Goiânia (GO) | -0.2205510 |
| RM Grande Vitória (MG) | -0.3116670 |
| RM Vale do Rio Cuiabá (MT) | -0.6034967 |
| RM Salvador (BA) | -0.8039205 |
| RM Recife (PE) | -0.9391242 |
| RM Natal (RN) | -0.9467009 |
| RM Belém (PA) | -1.0250704 |
| RM Fortaleza (CE) | -1.1017652 |
| RM Maceió (AL) | -1.4805228 |
| RM Grande São Luís (MA) | -1.4913584 |
| RM Manaus (AM) | -1.7243838 |
When analyzing the ranking of cities based on principal component factor analysis, we observe that it consistently reflects the Brazilian reality. Metropolitan regions of cities like São Paulo, Campinas, and Curitiba occupy the top positions, while metropolitan regions in the North and Northeast are at the bottom positions. This highlights the ability of data analysis to summarize complex information contained in an extensive database into just two factors.
Principal component factor analysis not only allowed us to create a ranking of metropolitan regions but also provided valuable insights into the underlying latent constructs behind the observed variables. During the analysis process, we could identify and understand some of the factors that influence longevity, education, income, and infrastructure in Brazilian cities.
In addition to ranking, principal component factor analysis can have various other uses in data science projects. It can serve as a basis for creating indices or scores that quantify the performance of observations, help identify key variables that impact latent constructs, and provide a clearer view of the data’s structure.
Thus, principal component factor analysis has proven to be a powerful tool for understanding and interpreting complex databases, allowing the extraction of relevant information in a simpler and more concise manner. This approach contributes to informed decision-making and promotes a deeper understanding of phenomena derived from relationships between variables.