Principal Component Analysis (PCA) is a powerful statistical technique that allows us to reduce the complexity of a dataset by identifying underlying patterns and summarizing the information into latent factors. The primary objective of PCA is to identify and describe the latent constructs and/or factors that underlie the observed variables. These constructs can be understood as unobservable variables in our database, while the factors can be interpreted as estimates of these unobservable variables obtained after conducting a factor analysis. In summary, this technique helps us understand and interpret the underlying structure of the data, simplifying complexity and extracting information that lies behind the observed variables.

In this case, we will apply Principal Component Analysis based on the Kaiser criterion to extract factors representing the main dimensions that influence the quality of life in Brazilian metropolitan regions.

We will explore a comprehensive database containing metrics/variables related to longevity, education, income of inhabitants, as well as variables related to the infrastructure of these metropolitan regions.

Our ultimate goal is to create a ranking of regions, ordering them from best to worst, based on the information extracted from the database through factor analysis.

Database used:
A - Base 20RMs 2000_2010 (Please right-click and select “open in a new tab/window”.)

pacotes <- c("tidyverse","ggrepel","reshape2","knitr","kableExtra","dplyr","Hmisc","ltm","readxl", 
             "PerformanceAnalytics","plotly", "factoextra","psych","sp","tmap")
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}

Loading the database

We will use a database provided by IBGE (Brazilian Institute of Geography and Statistics). This same database is available at the GitHub link above

dados_completos <- read_excel("A - Base 20 RMs 2000_2010.xlsx")

As the original database contains an extensive number of variables, let’s create a new database that includes only the most relevant variables for our analysis:

dados <- dados_completos[, c("NOME_RM", "ESPVIDA", "FECTOT", "MORT5", "SOBRE60", "E_ANOSESTUDO", "T_SUPER25M", "RDPC", "T_DES18M", "T_BANAGUA", "T_LIXO", "T_LUZ", "POP", "IDHM", "IDHM_E", "IDHM_L", "IDHM_R")]

Below is the legend for the selected variables (described in averages or rates):

ESPVIDA = Life expectancy;

FECTOT = Number of children per woman;

MORT5 = Under-5 mortality rate;

SOBRE60 = Survival rate up to 60 years of age;

E_ANOSESTUDO = Years of schooling up to 18 years of age;

T_SUPER25M = Rate of people aged 25 or older with completed higher education;

RDPC = Per capita family income;

T_DES18M = Unemployment rate for individuals aged 18 or older;

T_BANAGUA = Rate of households with indoor plumbing and piped water supply;

T_LIXO = Rate of households served by a garbage collection service;

T_LUZ = Rate of households with electric power;

POP = Number of people residing in fixed households;

IDHM = Municipal Human Development Index;

IDHM_E = Municipal Human Development Index - Education dimension;

IDHM_L = Municipal Human Development Index - Longevity dimension;

IDHM_R = Municipal Human Development Index - Income dimension;

Descriptive Statistics

Observing the database

dados %>% 
  slice(1:10) %>%
  kable(format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "250px")

NOME_RM	ESPVIDA	FECTOT	MORT5	SOBRE60	E_ANOSESTUDO	T_SUPER25M	RDPC	T_DES18M	T_BANAGUA	T_LIXO	T_LUZ	POP	IDHM	IDHM_E	IDHM_L	IDHM_R
RM Manaus (AM)	68.77	2.80	41.53	75.22	7.99	4.80	487.67	21.72	61.59	90.97	95.33	1631641	0.585	0.414	0.730	0.661
RM Belém (PA)	70.13	2.27	35.46	76.92	8.64	7.15	524.82	18.42	68.57	90.75	99.04	1964065	0.621	0.474	0.752	0.672
RM Grande São Luís (MA)	68.73	2.24	49.25	75.04	9.37	6.12	449.12	20.48	54.96	74.21	98.58	1086365	0.642	0.560	0.729	0.647
RM Fortaleza (CE)	69.55	2.38	40.32	76.53	8.99	6.31	496.32	16.14	72.80	90.47	98.24	3041196	0.622	0.488	0.743	0.663
RM Natal (RN)	69.49	2.37	49.93	76.61	9.16	7.60	537.69	16.84	81.68	95.45	98.83	1124946	0.625	0.487	0.742	0.676
RM Recife (PE)	69.26	2.10	48.62	75.59	8.75	8.95	560.66	22.14	80.04	87.43	99.62	3306954	0.627	0.490	0.738	0.683
RM Maceió (AL)	65.85	2.47	57.86	69.66	7.58	7.96	503.36	22.32	81.50	92.71	98.82	1019183	0.567	0.402	0.681	0.666
RM Salvador (BA)	69.58	1.90	49.68	75.78	8.80	8.04	614.59	23.86	85.59	91.12	99.29	3099673	0.636	0.497	0.743	0.698
RM Belo Horizonte (MG)	72.01	2.01	27.88	79.86	9.82	9.71	782.97	16.84	95.15	93.68	99.60	4330618	0.682	0.549	0.784	0.737
RM Grande Vitória (MG)	71.75	2.09	27.08	77.90	10.02	8.23	731.30	16.05	93.70	91.06	99.72	1432003	0.678	0.552	0.779	0.726

showing only the first rows

Univariate Descriptive Statistics

summary(dados[,2:17])

##     ESPVIDA          FECTOT          MORT5          SOBRE60     
##  Min.   :65.85   Min.   :1.620   Min.   :12.92   Min.   :69.66  
##  1st Qu.:71.69   1st Qu.:1.810   1st Qu.:16.20   1st Qu.:77.81  
##  Median :73.28   Median :1.990   Median :21.16   Median :82.35  
##  Mean   :72.96   Mean   :2.008   Mean   :25.32   Mean   :81.15  
##  3rd Qu.:75.18   3rd Qu.:2.172   3rd Qu.:27.71   3rd Qu.:84.16  
##  Max.   :76.46   Max.   :2.800   Max.   :57.86   Max.   :87.84  
##   E_ANOSESTUDO      T_SUPER25M         RDPC           T_DES18M     
##  Min.   : 7.580   Min.   : 4.80   Min.   : 449.1   Min.   : 4.450  
##  1st Qu.: 9.485   1st Qu.: 8.92   1st Qu.: 705.9   1st Qu.: 7.968  
##  Median : 9.920   Median :11.23   Median : 827.9   Median :13.085  
##  Mean   : 9.777   Mean   :11.45   Mean   : 843.9   Mean   :12.989  
##  3rd Qu.:10.190   3rd Qu.:14.15   3rd Qu.:1001.7   3rd Qu.:16.810  
##  Max.   :11.120   Max.   :18.72   Max.   :1362.5   Max.   :23.860  
##    T_BANAGUA         T_LIXO          T_LUZ            POP          
##  Min.   :54.96   Min.   :74.21   Min.   :95.33   Min.   :  718993  
##  1st Qu.:82.19   1st Qu.:92.58   1st Qu.:99.07   1st Qu.: 1590334  
##  Median :93.36   Median :97.16   Median :99.73   Median : 2294862  
##  Mean   :88.91   Mean   :95.44   Mean   :99.41   Mean   : 3651660  
##  3rd Qu.:96.50   3rd Qu.:98.73   3rd Qu.:99.87   3rd Qu.: 3620772  
##  Max.   :97.62   Max.   :99.77   Max.   :99.97   Max.   :19601268  
##       IDHM            IDHM_E           IDHM_L           IDHM_R      
##  Min.   :0.5670   Min.   :0.4020   Min.   :0.6810   Min.   :0.6470  
##  1st Qu.:0.6755   1st Qu.:0.5308   1st Qu.:0.7780   1st Qu.:0.7200  
##  Median :0.7120   Median :0.6000   Median :0.8050   Median :0.7455  
##  Mean   :0.7094   Mean   :0.6042   Mean   :0.7996   Mean   :0.7425  
##  3rd Qu.:0.7675   3rd Qu.:0.6917   3rd Qu.:0.8367   3rd Qu.:0.7760  
##  Max.   :0.7940   Max.   :0.7370   Max.   :0.8580   Max.   :0.8260

Analyzing some of the variables that are likely to be correlated

Let’s create some graphs to analyze the relationship between certain variables. We know that years of schooling tend to be related to income and employment of the population. Therefore, we will use the variables related to these factors to start our analysis.

Scatter plot and linear fit between years of schooling and per capita income.

dados %>%
  ggplot() +
  geom_point(aes(x = E_ANOSESTUDO, y = RDPC),
             color = "darkorchid",
             size = 3) +
  geom_smooth(aes(x = E_ANOSESTUDO, y = RDPC),
              color = "orange", 
              method = "lm", 
              formula = y ~ x, 
              se = FALSE,
              size = 1.3) +
  labs(x = "Years of Study",
       y = "Per Capita Income") +
  theme_bw()

Scatter plot and linear fit between years of schooling and the unemployment rate.

dados %>%
  ggplot() +
  geom_point(aes(x = E_ANOSESTUDO, y = T_DES18M),
             color = "darkorchid",
             size = 3) +
  geom_smooth(aes(x = E_ANOSESTUDO, y = T_DES18M),
              color = "orange", 
              method = "lm", 
              formula = y ~ x, 
              se = FALSE,
              size = 1.3) +
  labs(x = "Years of Study",
       y = "Unemployment") +
  theme_bw()

We can observe two significant correlations in the above graphs. They show a clear trend of increasing per capita income as the level of education increases. Additionally, there appears to be an inverse relationship between unemployment and an increase in years of schooling.

Note: Due to the large number of variables, we will not plot graphs for all pairs of variables.

Principal Component Analysis (PCA) elaboration

Pearson correlation coefficients for each pair of variables

Let’s extract the correlation coefficients using the rcorr function and store them in an object named “rho”.

rho <- rcorr(as.matrix(dados[,2:17]), type="pearson")

We can now extract some information from the “rho” object we created:

corr_coef <- rho$r 
corr_coef %>% 
  kable(format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "250px")

	ESPVIDA	FECTOT	MORT5	SOBRE60	E_ANOSESTUDO	T_SUPER25M	RDPC	T_DES18M	T_BANAGUA	T_LIXO	T_LUZ	POP	IDHM	IDHM_E	IDHM_L	IDHM_R
ESPVIDA	1.0000000	-0.8012736	-0.9398518	0.8999331	0.8108366	0.8832583	0.8547294	-0.9032078	0.6921079	0.6892453	0.5904512	0.2191420	0.9733573	0.9103724	0.9999810	0.8726580
FECTOT	-0.8012736	1.0000000	0.6773541	-0.7265224	-0.6253673	-0.7957273	-0.6764232	0.6923827	-0.6263505	-0.4518295	-0.8233436	-0.3250679	-0.8541995	-0.8513180	-0.8007404	-0.6990485
MORT5	-0.9398518	0.6773541	1.0000000	-0.8742580	-0.8117622	-0.7660431	-0.8014588	0.8365488	-0.7059415	-0.7155116	-0.5351988	-0.2230180	-0.9011792	-0.8216582	-0.9395207	-0.8471580
SOBRE60	0.8999331	-0.7265224	-0.8742580	1.0000000	0.7315983	0.7438187	0.6477674	-0.8109527	0.5432362	0.5607777	0.4986221	0.1262239	0.8803374	0.8711136	0.8995537	0.6800611
E_ANOSESTUDO	0.8108366	-0.6253673	-0.8117622	0.7315983	1.0000000	0.6373194	0.6614750	-0.7073972	0.6410770	0.5414740	0.5916271	0.1544536	0.8289873	0.8015708	0.8110836	0.7050427
T_SUPER25M	0.8832583	-0.7957273	-0.7660431	0.7438187	0.6373194	1.0000000	0.8970472	-0.8310691	0.6716508	0.6147516	0.6277368	0.3034201	0.9141601	0.8518184	0.8833361	0.8822774
RDPC	0.8547294	-0.6764232	-0.8014588	0.6477674	0.6614750	0.8970472	1.0000000	-0.7369514	0.7586944	0.6774055	0.5605634	0.4335171	0.8442711	0.7007926	0.8546981	0.9884021
T_DES18M	-0.9032078	0.6923827	0.8365488	-0.8109527	-0.7073972	-0.8310691	-0.7369514	1.0000000	-0.5519304	-0.6356822	-0.5080116	-0.0872201	-0.8795034	-0.8423586	-0.9034464	-0.7410374
T_BANAGUA	0.6921079	-0.6263505	-0.7059415	0.5432362	0.6410770	0.6716508	0.7586944	-0.5519304	1.0000000	0.8196077	0.7191427	0.2827204	0.6719106	0.5395288	0.6917530	0.8192453
T_LIXO	0.6892453	-0.4518295	-0.7155116	0.5607777	0.5414740	0.6147516	0.6774055	-0.6356822	0.8196077	1.0000000	0.4314067	0.2202156	0.6224431	0.5014995	0.6896981	0.7236845
T_LUZ	0.5904512	-0.8233436	-0.5351988	0.4986221	0.5916271	0.6277368	0.5605634	-0.5080116	0.7191427	0.4314067	1.0000000	0.2444659	0.6527374	0.6254521	0.5894292	0.6060996
POP	0.2191420	-0.3250679	-0.2230180	0.1262239	0.1544536	0.3034201	0.4335171	-0.0872201	0.2827204	0.2202156	0.2444659	1.0000000	0.2578177	0.1833684	0.2190487	0.4098729
IDHM	0.9733573	-0.8541995	-0.9011792	0.8803374	0.8289873	0.9141601	0.8442711	-0.8795034	0.6719106	0.6224431	0.6527374	0.2578177	1.0000000	0.9693973	0.9734731	0.8592049
IDHM_E	0.9103724	-0.8513180	-0.8216582	0.8711136	0.8015708	0.8518184	0.7007926	-0.8423586	0.5395288	0.5014995	0.6254521	0.1833684	0.9693973	1.0000000	0.9106483	0.7143394
IDHM_L	0.9999810	-0.8007404	-0.9395207	0.8995537	0.8110836	0.8833361	0.8546981	-0.9034464	0.6917530	0.6896981	0.5894292	0.2190487	0.9734731	0.9106483	1.0000000	0.8725010
IDHM_R	0.8726580	-0.6990485	-0.8471580	0.6800611	0.7050427	0.8822774	0.9884021	-0.7410374	0.8192453	0.7236845	0.6060996	0.4098729	0.8592049	0.7143394	0.8725010	1.0000000

This matrix above shows the correlation coefficients between pairs of variables.

corr_sig <- round(rho$P, 5) 
corr_sig %>% 
  kable(format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "250px")

	ESPVIDA	FECTOT	MORT5	SOBRE60	E_ANOSESTUDO	T_SUPER25M	RDPC	T_DES18M	T_BANAGUA	T_LIXO	T_LUZ	POP	IDHM	IDHM_E	IDHM_L	IDHM_R
ESPVIDA	NA	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00006	0.17427	0.00000	0.00000	0.00000	0.00000
FECTOT	0.00000	NA	0.00000	0.00000	0.00002	0.00000	0.00000	0.00000	0.00002	0.00343	0.00000	0.04069	0.00000	0.00000	0.00000	0.00000
MORT5	0.00000	0.00000	NA	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00037	0.16659	0.00000	0.00000	0.00000	0.00000
SOBRE60	0.00000	0.00000	0.00000	NA	0.00000	0.00000	0.00001	0.00000	0.00029	0.00017	0.00106	0.43769	0.00000	0.00000	0.00000	0.00000
E_ANOSESTUDO	0.00000	0.00002	0.00000	0.00000	NA	0.00001	0.00000	0.00000	0.00001	0.00031	0.00006	0.34130	0.00000	0.00000	0.00000	0.00000
T_SUPER25M	0.00000	0.00000	0.00000	0.00000	0.00001	NA	0.00000	0.00000	0.00000	0.00002	0.00001	0.05700	0.00000	0.00000	0.00000	0.00000
RDPC	0.00000	0.00000	0.00000	0.00001	0.00000	0.00000	NA	0.00000	0.00000	0.00000	0.00017	0.00520	0.00000	0.00000	0.00000	0.00000
T_DES18M	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	NA	0.00022	0.00001	0.00082	0.59254	0.00000	0.00000	0.00000	0.00000
T_BANAGUA	0.00000	0.00002	0.00000	0.00029	0.00001	0.00000	0.00000	0.00022	NA	0.00000	0.00000	0.07712	0.00000	0.00033	0.00000	0.00000
T_LIXO	0.00000	0.00343	0.00000	0.00017	0.00031	0.00002	0.00000	0.00001	0.00000	NA	0.00545	0.17212	0.00002	0.00098	0.00000	0.00000
T_LUZ	0.00006	0.00000	0.00037	0.00106	0.00006	0.00001	0.00017	0.00082	0.00000	0.00545	NA	0.12844	0.00001	0.00002	0.00006	0.00003
POP	0.17427	0.04069	0.16659	0.43769	0.34130	0.05700	0.00520	0.59254	0.07712	0.17212	0.12844	NA	0.10824	0.25739	0.17446	0.00862
IDHM	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00002	0.00001	0.10824	NA	0.00000	0.00000	0.00000
IDHM_E	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00033	0.00098	0.00002	0.25739	0.00000	NA	0.00000	0.00000
IDHM_L	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00006	0.17446	0.00000	0.00000	NA	0.00000
IDHM_R	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00000	0.00003	0.00862	0.00000	0.00000	0.00000	NA

This matrix shows the significance levels for each pair. Considering a confidence level of 95%, we can see that almost all of our pairs have a significant correlation. The exceptions seem to concentrate on the “POP” variable, which indicates the number of people residing in fixed households. This indicates that the “POP” variable does not have a significant correlation with the other variables.

We can also create a “heat map” to better visualize the information seen above:

Creation of a heat map of Pearson correlations between variables

ggplotly(
  dados[,2:17] %>%
    cor() %>%
    melt() %>%
    rename(Correlação = value) %>%
    ggplot() +
    geom_tile(aes(x = Var1, y = Var2, fill = Correlação)) +
    geom_text(aes(x = Var1, y = Var2, label = format(Correlação, digits = 1)),
              size = 5) +
    scale_fill_viridis_b() +
    labs(x = NULL, y = NULL) +
    theme_bw() + theme(
    axis.text.x = element_blank(),
    axis.title.x = element_blank()
  ),
width = 1000)

Note: Warmer colors indicate a higher positive correlation, while cooler colors indicate no correlation or negative correlation.

Visualization using the `chart.Correlation`

We can also visualize this information using the chart.Correlation function:

chart.Correlation(dados[, 2:17], histogram = TRUE, pch = "+")

This function allows us to visualize the plots we created earlier along with the correlation coefficients of the variable pairs.

Bartlett’s sphericity test

Bartlett’s sphericity test is a statistical analysis used to check whether the variables in a dataset are correlated with each other. Although we have already observed indications of correlations, we will use Bartlett’s test through the ‘cortest.bartlett’ function as a means of confirmation:

cortest.bartlett(dados[, 2:17])

## R was not square, finding R from data

## $chisq
## [1] 1485.071
## 
## $p.value
## [1] 6.124255e-234
## 
## $df
## [1] 120

We obtained a p-value very close to zero, which indicates that the variables in the data are not independent from each other, suggesting the presence of significant correlation among them. Therefore, we can proceed with our analysis.

Principal Component Analysis (PCA) analysis

We will use the principal function to perform our factor analysis. This technique consists of transforming a set of correlated variables into a new set of uncorrelated variables called principal components.

Since our database has 16 variables, to preserve all the variability present in the data, we will request a total of 16 principal components.

fatorial <- principal(dados[2:17],
                      nfactors = length(dados[2:17]),
                      rotate = "none",
                      scores = TRUE)

Looking at the factor loadings of our components:

fatorial

## Principal Components Analysis
## Call: principal(r = dados[2:17], nfactors = length(dados[2:17]), rotate = "none", 
##     scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10  PC11
## ESPVIDA       0.98 -0.14  0.04  0.09  0.00 -0.01  0.04 -0.02 -0.07 -0.02  0.03
## FECTOT       -0.84 -0.03  0.44  0.15  0.09  0.15 -0.05 -0.04  0.16  0.07 -0.01
## MORT5        -0.93  0.11 -0.17 -0.08 -0.17  0.01 -0.12  0.14  0.05 -0.13  0.08
## SOBRE60       0.87 -0.32 -0.03  0.11  0.14 -0.20  0.22 -0.01  0.16 -0.06  0.06
## E_ANOSESTUDO  0.83 -0.16  0.01 -0.09  0.41  0.29 -0.14  0.05  0.02 -0.05  0.04
## T_SUPER25M    0.92  0.05 -0.08  0.10 -0.32  0.04 -0.05  0.10  0.13  0.03 -0.02
## RDPC          0.89  0.27  0.11  0.15 -0.19  0.22  0.05 -0.02 -0.01 -0.01  0.06
## T_DES18M     -0.88  0.27 -0.08 -0.10  0.14  0.08  0.24  0.20 -0.02  0.08  0.04
## T_BANAGUA     0.79  0.37  0.24 -0.39  0.05  0.00  0.08  0.05  0.03 -0.10 -0.13
## T_LIXO        0.73  0.22  0.53 -0.15  0.04 -0.29 -0.13  0.09 -0.03  0.05  0.08
## T_LUZ         0.70  0.20 -0.38 -0.53  0.00  0.00 -0.03 -0.14  0.07  0.08  0.05
## POP           0.30  0.75 -0.28  0.44  0.23 -0.12 -0.07 -0.03  0.03 -0.01 -0.01
## IDHM          0.98 -0.12 -0.10  0.07 -0.01  0.03 -0.02  0.08 -0.02  0.04 -0.03
## IDHM_E        0.91 -0.26 -0.24  0.06  0.03 -0.03 -0.08  0.15  0.01  0.07 -0.06
## IDHM_L        0.98 -0.14  0.04  0.09  0.00 -0.01  0.04 -0.01 -0.07 -0.02  0.03
## IDHM_R        0.92  0.26  0.14  0.07 -0.12  0.19  0.09 -0.03 -0.02  0.01  0.02
##               PC12  PC13  PC14  PC15 PC16 h2       u2 com
## ESPVIDA       0.08  0.01  0.00  0.00    0  1  4.4e-16 1.1
## FECTOT        0.05  0.04  0.01  0.00    0  1 -6.2e-15 1.8
## MORT5         0.01  0.05  0.01  0.00    0  1 -2.7e-15 1.3
## SOBRE60      -0.03  0.01  0.00  0.00    0  1 -2.7e-15 1.7
## E_ANOSESTUDO -0.01 -0.04  0.00  0.00    0  1 -2.7e-15 2.0
## T_SUPER25M    0.03 -0.07  0.00  0.00    0  1 -2.9e-15 1.4
## RDPC         -0.03  0.03 -0.04  0.00    0  1 -2.4e-15 1.5
## T_DES18M      0.03 -0.01  0.00  0.00    0  1 -3.1e-15 1.6
## T_BANAGUA     0.02  0.01  0.00  0.00    0  1 -1.6e-15 2.3
## T_LIXO       -0.02  0.00  0.00  0.00    0  1 -1.8e-15 2.7
## T_LUZ         0.02  0.02  0.00  0.00    0  1 -1.6e-15 2.9
## POP           0.01  0.00  0.00  0.00    0  1 -6.7e-16 2.7
## IDHM         -0.01  0.04  0.01 -0.01    0  1 -3.3e-15 1.1
## IDHM_E       -0.02  0.05  0.00  0.01    0  1 -3.1e-15 1.4
## IDHM_L        0.08  0.01  0.00  0.00    0  1 -3.8e-15 1.1
## IDHM_R       -0.05  0.01  0.04  0.00    0  1 -2.7e-15 1.4
## 
##                         PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 PC11
## SS loadings           11.69 1.27 0.91 0.77 0.45 0.34 0.19 0.14 0.09 0.06 0.05
## Proportion Var         0.73 0.08 0.06 0.05 0.03 0.02 0.01 0.01 0.01 0.00 0.00
## Cumulative Var         0.73 0.81 0.87 0.91 0.94 0.96 0.98 0.98 0.99 0.99 1.00
## Proportion Explained   0.73 0.08 0.06 0.05 0.03 0.02 0.01 0.01 0.01 0.00 0.00
## Cumulative Proportion  0.73 0.81 0.87 0.91 0.94 0.96 0.98 0.98 0.99 0.99 1.00
##                       PC12 PC13 PC14 PC15 PC16
## SS loadings           0.02 0.02    0    0    0
## Proportion Var        0.00 0.00    0    0    0
## Cumulative Var        1.00 1.00    1    1    1
## Proportion Explained  0.00 0.00    0    0    0
## Cumulative Proportion 1.00 1.00    1    1    1
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 16 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

The factor loadings indicate, through a scale from -1 to 1, how much each variable contributes to the formation of the component. Consequently, we can see that our first component seems to capture a good portion of the data variability.

Initial identification of all eigenvalues

The eigenvalues indicate the variability of each component. Let’s display our eigenvalues with 5 decimal places:

autovalores <- round(fatorial$values, 5)
autovalores

##  [1] 11.69324  1.26592  0.90564  0.77437  0.45165  0.33644  0.19296  0.13774
##  [9]  0.08873  0.06124  0.04824  0.02390  0.01694  0.00286  0.00012  0.00002

It’s important to note that only 2 eigenvalues are greater than 1, as this detail is relevant for the continuation of our analysis based on the Kaiser criterion.

Adding up our eigenvalues to check that the sum results in the exact number of variables in our database:

sum(autovalores)

## [1] 16.00001

As the goal of factor analysis is to simplify and condense the amount of information contained in the original variables, we will now make this simplification based on the Kaiser criterion. The Kaiser criterion states that only factors with eigenvalues greater than 1 should be considered significant. This means that only factors that explain a greater amount of variance than an individual variable will be retained.

Definition of the number of factors with eigenvalues greater than 1

k <- sum(autovalores > 1)
print(k)

## [1] 2

Since only 2 of our eigenvalues are greater than 1, only 2 factors will be preserved.

Reapplying the factor analysis through the principal function with only the first 2 components:

fatorial2 <- principal(dados[2:17],
                      nfactors = k,
                      rotate = "none",
                      scores = TRUE)

fatorial2

## Principal Components Analysis
## Call: principal(r = dados[2:17], nfactors = k, rotate = "none", scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                PC1   PC2   h2    u2 com
## ESPVIDA       0.98 -0.14 0.97 0.025 1.0
## FECTOT       -0.84 -0.03 0.71 0.286 1.0
## MORT5        -0.93  0.11 0.87 0.128 1.0
## SOBRE60       0.87 -0.32 0.85 0.149 1.3
## E_ANOSESTUDO  0.83 -0.16 0.71 0.291 1.1
## T_SUPER25M    0.92  0.05 0.84 0.158 1.0
## RDPC          0.89  0.27 0.87 0.129 1.2
## T_DES18M     -0.88  0.27 0.85 0.151 1.2
## T_BANAGUA     0.79  0.37 0.75 0.247 1.4
## T_LIXO        0.73  0.22 0.58 0.418 1.2
## T_LUZ         0.70  0.20 0.53 0.466 1.2
## POP           0.30  0.75 0.65 0.347 1.3
## IDHM          0.98 -0.12 0.97 0.027 1.0
## IDHM_E        0.91 -0.26 0.90 0.104 1.2
## IDHM_L        0.98 -0.14 0.97 0.025 1.0
## IDHM_R        0.92  0.26 0.91 0.089 1.2
## 
##                         PC1  PC2
## SS loadings           11.69 1.27
## Proportion Var         0.73 0.08
## Cumulative Var         0.73 0.81
## Proportion Explained   0.90 0.10
## Cumulative Proportion  0.90 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  37.79  with prob <  1 
## 
## Fit based upon off diagonal values = 0.99

It is emphasized here how the first component demonstrates a high capacity to condense the information from the original variables.

Identification of the shared variance in each component

variancia_compartilhada <- as.data.frame(fatorial2$Vaccounted) %>% 
  slice(1:3)

rownames(variancia_compartilhada) <- c("Eigenvalues",
                                       "Proportion of Variance",
                                       "Cumulative Proportion of Variance")
round(variancia_compartilhada, 3) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = FALSE, 
                font_size = 20)

	PC1	PC2
Eigenvalues	11.693	1.266
Proportion of Variance	0.731	0.079
Cumulative Proportion of Variance	0.731	0.810

Here we can see that our first 2 components capture approximately 81% of the variance of the original data.

Visualization of factorial scores

scores_fatoriais <- as.data.frame(fatorial2$weights)

round(scores_fatoriais, 3) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = FALSE, 
                font_size = 20)

	PC1	PC2
ESPVIDA	0.084	-0.109
FECTOT	-0.072	-0.022
MORT5	-0.079	0.084
SOBRE60	0.074	-0.250
E_ANOSESTUDO	0.071	-0.125
T_SUPER25M	0.078	0.038
RDPC	0.076	0.214
T_DES18M	-0.075	0.213
T_BANAGUA	0.067	0.291
T_LIXO	0.062	0.177
T_LUZ	0.060	0.161
POP	0.026	0.592
IDHM	0.084	-0.098
IDHM_E	0.078	-0.209
IDHM_L	0.084	-0.110
IDHM_R	0.078	0.208

Factorial scores represent the contribution of each observation to the construction of each principal component. They are calculated by multiplying the factorial loads (weights) of the principal components by the original variables of each observation and summing these products.

Visualization of the factors themselvess

fatores <- as.data.frame(fatorial2$scores)

fatores

##            PC1         PC2
## 1  -2.27807458 -0.75208164
## 2  -1.36700106 -0.32897735
## 3  -1.86131148 -1.65648403
## 4  -1.46556196 -0.38792353
## 5  -1.30175240  0.05883814
## 6  -1.35230324  0.62153512
## 7  -2.17803739  1.40604296
## 8  -1.20188608  0.94098168
## 9  -0.26824831  0.63190506
## 10 -0.43251561  0.05595586
## 11 -0.23934530  2.08105483
## 12  0.25957437  2.98833776
## 13 -0.10157389  0.48672981
## 14  0.20498602  0.52294266
## 15 -0.13531427 -0.01787279
## 16 -0.04481424  0.36775842
## 17  0.05999162  0.41007687
## 18 -0.76851218 -0.52890532
## 19 -0.33489685  0.30587303
## 20 -0.23687394  0.15929281
## 21 -0.04121714 -1.25523277
## 22  0.17702713 -0.89734540
## 23  0.27261412 -1.85457570
## 24  0.37516405 -1.18903484
## 25  0.38069506 -0.69520350
## 26  0.33496928 -0.32984767
## 27 -0.02652783 -0.39302817
## 28  0.50290224 -0.31086742
## 29  1.02552995  0.05706312
## 30  0.93010794 -0.41594619
## 31  0.96298181  1.31482409
## 32  1.44344053  2.20113948
## 33  0.95088981 -0.61213894
## 34  1.31584401 -0.46308740
## 35  1.04346268 -0.90946697
## 36  1.30181947 -0.34892142
## 37  1.06446806 -0.09186182
## 38  0.81040598 -0.85455643
## 39  0.96530881 -0.43148471
## 40  1.25358481  0.11449232

The factors are calculated by multiplying the factorial scores by the standard deviations of the principal components. The factors in factor analysis represent the individual scores of the observations in the underlying theoretical constructs.

Pearson correlation coefficients for each pair of (orthogonal) factors

rho2 <- rcorr(as.matrix(fatores), type="pearson")
round(rho2$r, 4)

##     PC1 PC2
## PC1   1   0
## PC2   0   1

Here we can note that our factors do not have correlation with each other. We can say that the factors are orthogonal and do not share any common variance. This indicates that they represent distinct dimensions of the latent construct, and consequently facilitate the interpretation of the results.

Visualization of communalities

comunalidades <- as.data.frame(unclass(fatorial2$communality)) %>%
  rename(comunalidades = 1)

round(comunalidades, 3) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped",
                full_width = FALSE,
                font_size = 20)

	comunalidades
ESPVIDA	0.975
FECTOT	0.714
MORT5	0.872
SOBRE60	0.851
E_ANOSESTUDO	0.709
T_SUPER25M	0.842
RDPC	0.871
T_DES18M	0.849
T_BANAGUA	0.753
T_LIXO	0.582
T_LUZ	0.534
POP	0.653
IDHM	0.973
IDHM_E	0.896
IDHM_L	0.975
IDHM_R	0.911

Communality is an important measure. It ranges from 0 to 1, with 0 indicating that the variable has no relation to any of the factors and 1 indicating that the variable is perfectly related to at least one of the factors. We can see that, thanks to the Kaiser criterion, despite having preserved only 2 factors, we are managing to capture a good part of the variability of the original data.

Ranking of Metropolitan Regions

Creating a database with the factors to rank the cities

dados_fatores <- bind_cols(dados,
                           "fator_1" = fatores$PC1, 
                           "fator_2" = fatores$PC2)

Assuming only the 2 factors as indicators, the “score” is calculated. Where the score in this type of experiment is generally considered to be factor * shared variance by that factor.

dados_rank <- dados_fatores %>% 
  mutate(pontuacao = fator_1 * variancia_compartilhada$PC1[2] + fator_2 * variancia_compartilhada$PC2[2])

Viewing the final ranking in descending order

dados_rank[,c(1, 20)] %>%
  arrange(desc(pontuacao)) %>% 
  kable() %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = T, 
                font_size = 12)

NOME_RM	pontuacao
RM São Paulo (SP) 2010	1.2290601
RIDE DF e Entorno (DF/GO/MG) 2010	0.9252126
RM Campinas (SP) 2010	0.9250152
RM Curitiba (PR) 2010	0.9237985
RM Rio Janeiro (RJ) 2010	0.8078024
RM Porto Alegre (RS) 2010	0.7706742
RM Belo Horizonte (MG) 2010	0.7540001
RM Vale do Paraíba e Litoral Norte (SP) 2010	0.6906338
RM Goiânia (GO) 2010	0.6713349
RM Grande Vitória (MG) 2010	0.6468385
RM Baixada Santista (SP) 2010	0.6465037
RM Vale do Rio Cuiabá (MT) 2010	0.5246542
RM São Paulo (SP)	0.4261417
RM Salvador (BA) 2010	0.3429388
RM Natal (RN) 2010	0.2232177
RM Recife (PE) 2010	0.2187071
RM Campinas (SP)	0.1911847
RM Fortaleza (CE) 2010	0.1801035
RM Porto Alegre (RS) 2010	0.0762888
RM Belém (PA) 2010	0.0583782
RM Grande São Luís (MA) 2010	0.0524996
RM Curitiba (PR)	-0.0036544
RM Rio Janeiro (RJ)	-0.0102667
RM Baixada Santista (SP)	-0.0357228
RM Maceió (AL)	-0.0504837
RM Vale do Paraíba e Litoral Norte (SP)	-0.1003055
RM Manaus (AM) 2010	-0.1294368
RM Belo Horizonte (MG)	-0.1460468
RIDE DF e Entorno (DF/GO/MG)	-0.1605107
RM Goiânia (GO)	-0.2205510
RM Grande Vitória (MG)	-0.3116670
RM Vale do Rio Cuiabá (MT)	-0.6034967
RM Salvador (BA)	-0.8039205
RM Recife (PE)	-0.9391242
RM Natal (RN)	-0.9467009
RM Belém (PA)	-1.0250704
RM Fortaleza (CE)	-1.1017652
RM Maceió (AL)	-1.4805228
RM Grande São Luís (MA)	-1.4913584
RM Manaus (AM)	-1.7243838

When analyzing the ranking of cities based on principal component factor analysis, we observe that it consistently reflects the Brazilian reality. Metropolitan regions of cities like São Paulo, Campinas, and Curitiba occupy the top positions, while metropolitan regions in the North and Northeast are at the bottom positions. This highlights the ability of data analysis to summarize complex information contained in an extensive database into just two factors.

Principal component factor analysis not only allowed us to create a ranking of metropolitan regions but also provided valuable insights into the underlying latent constructs behind the observed variables. During the analysis process, we could identify and understand some of the factors that influence longevity, education, income, and infrastructure in Brazilian cities.

In addition to ranking, principal component factor analysis can have various other uses in data science projects. It can serve as a basis for creating indices or scores that quantify the performance of observations, help identify key variables that impact latent constructs, and provide a clearer view of the data’s structure.

Thus, principal component factor analysis has proven to be a powerful tool for understanding and interpreting complex databases, allowing the extraction of relevant information in a simpler and more concise manner. This approach contributes to informed decision-making and promotes a deeper understanding of phenomena derived from relationships between variables.

Principal Component Analysis

Rafael

2023-06-21

Loading the database

Descriptive Statistics

Observing the database

Univariate Descriptive Statistics

Analyzing some of the variables that are likely to be correlated

Scatter plot and linear fit between years of schooling and per capita income.

Scatter plot and linear fit between years of schooling and the unemployment rate.

Principal Component Analysis (PCA) elaboration

Pearson correlation coefficients for each pair of variables

Creation of a heat map of Pearson correlations between variables

Visualization using the `chart.Correlation`

Bartlett’s sphericity test

Principal Component Analysis (PCA) analysis

Looking at the factor loadings of our components:

Initial identification of all eigenvalues

Adding up our eigenvalues to check that the sum results in the exact number of variables in our database:

Definition of the number of factors with eigenvalues greater than 1

Reapplying the factor analysis through the principal function with only the first 2 components:

Identification of the shared variance in each component

Visualization of factorial scores

Visualization of the factors themselvess

Pearson correlation coefficients for each pair of (orthogonal) factors

Visualization of communalities

Ranking of Metropolitan Regions

Creating a database with the factors to rank the cities

Viewing the final ranking in descending order

Principal Component Analysis

Rafael

2023-06-21

Loading the database

Descriptive Statistics

Observing the database

Univariate Descriptive Statistics

Analyzing some of the variables that are likely to be correlated

Scatter plot and linear fit between years of schooling and per capita income.

Scatter plot and linear fit between years of schooling and the unemployment rate.

Principal Component Analysis (PCA) elaboration

Pearson correlation coefficients for each pair of variables

Creation of a heat map of Pearson correlations between variables

Visualization using the chart.Correlation

Bartlett’s sphericity test

Principal Component Analysis (PCA) analysis

Looking at the factor loadings of our components:

Initial identification of all eigenvalues

Adding up our eigenvalues to check that the sum results in the exact number of variables in our database:

Definition of the number of factors with eigenvalues greater than 1

Reapplying the factor analysis through the principal function with only the first 2 components:

Identification of the shared variance in each component

Visualization of factorial scores

Visualization of the factors themselvess

Pearson correlation coefficients for each pair of (orthogonal) factors

Visualization of communalities

Ranking of Metropolitan Regions

Creating a database with the factors to rank the cities

Viewing the final ranking in descending order

Visualization using the `chart.Correlation`