In the contemporary world, where cardiovascular diseases have become a global concern, the ability to identify factors contributing to the development of heart problems is crucial. Understanding the aspects and personal characteristics that play a crucial role in this condition allows us to develop more effective and personalized preventive strategies.

Correspondence analysis is a powerful technique that enables the identification of patterns and associations between categorical variables. Based on this methodology, we will investigate a wide range of factors such as age, gender, and medical diagnoses. Through this approach, we will gain meaningful insights for a comprehensive understanding of the characteristics and profiles of individuals affected by heart problems.

Upon completing the analysis, we will be able to create a perceptual map-type graph where we can observe the relationship between variables based on their proximity on the graph. We will be able to see which categories of which variables are more strongly associated with the presence of heart problems and which are associated with the absence of heart problems.

Database used:
dados_cor (Please right-click and select “Open in a new tab/window”.)

Loading packages

pacotes <- c("plotly", "tidyverse", "ggrepel", "knitr", "kableExtra", "sjPlot", "FactoMineR", "amap", "ade4","readxl")

if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}
## Warning: package 'sjPlot' was built under R version 4.2.3
## Warning: package 'ade4' was built under R version 4.2.3

Importando a base de dados

dados_cor <- read_excel("dados_cor_acm.xlsx")
Idade Sexo Tipo_Dor_Peito PS_Descanso Colesterol Açucar_Sangue ECG_Descanso BC_Max Angina_Exerc Doença_Card
40 Masculino Atipica 140 289 Normal Normal 172 Nao Nao
49 Feminino Sem_Dor 160 180 Normal Normal 156 Nao Sim
37 Masculino Atipica 130 283 Normal Anormal_ST 98 Nao Nao
48 Feminino Assintomatico 138 214 Normal Normal 108 Sim Sim
54 Masculino Sem_Dor 150 195 Normal Normal 122 Nao Nao
39 Masculino Sem_Dor 120 339 Normal Normal 170 Nao Nao
45 Feminino Atipica 130 237 Normal Normal 170 Nao Nao
54 Masculino Atipica 110 208 Normal Normal 142 Nao Nao
37 Masculino Assintomatico 140 207 Normal Normal 130 Sim Sim
48 Feminino Atipica 120 284 Normal Normal 120 Nao Nao

When analyzing the database, we can see that some of the variables are quantitative, while others are qualitative. Since correspondence analysis is a technique exclusive to qualitative variables, we need to transform some of the variables. We could simply exclude the quantitative variables, but to avoid losing information from these variables, we will proceed with the transformation.

Adjusting the database for our analysis

We will categorize the quantitative variables based on statistical criteria

dados_cor <- dados_cor %>% 
  mutate(Categ_Idade = case_when(Idade <= quantile(Idade, 0.25, na.rm = T) ~ "menores_idades",
                                 Idade > quantile(Idade, 0.25, na.rm = T) & Idade <= quantile(Idade, 0.75, na.rm = T) ~ "idades_médias",
                                 Idade > quantile(Idade, 0.75, na.rm = T) ~ "maiores_idades"))

dados_cor <- dados_cor %>% 
  mutate(Categ_PS_Desc = case_when(PS_Descanso <= quantile(PS_Descanso, 0.25, na.rm = T) ~ "PS_descanso_baixo",
                                   PS_Descanso > quantile(PS_Descanso, 0.25, na.rm = T) & PS_Descanso <= quantile(PS_Descanso, 0.75, na.rm = T) ~ "PS_descanso_médio",
                                   PS_Descanso > quantile(PS_Descanso, 0.75, na.rm = T) ~ "PS_descanso_alto"))

dados_cor <- dados_cor %>% 
  mutate(Categ_Colest = case_when(Colesterol <= quantile(Colesterol, 0.25, na.rm = T) ~ "menor_colesterol",
                                  Colesterol > quantile(Colesterol, 0.25, na.rm = T) & Colesterol <= quantile(Colesterol, 0.75, na.rm = T) ~ "colesterol_médio",
                                  Colesterol > quantile(Colesterol, 0.75, na.rm = T) ~ "maior_colesterol"))

dados_cor <- dados_cor %>% 
  mutate(Categ_BC_Max = case_when(BC_Max <= quantile(BC_Max, 0.25, na.rm = T) ~ "menor_BC_Max",
                                  BC_Max > quantile(BC_Max, 0.25, na.rm = T) & BC_Max <= quantile(BC_Max, 0.75, na.rm = T) ~ "BC_Max_médio",
                                  BC_Max > quantile(BC_Max, 0.75, na.rm = T) ~ "maior_BC_Max"))

With the variables properly transformed, we can now eliminate the quantitative variables without losing the information contained in them.

Removing the variables that we will no longer use

dados_cor <- dados_cor %>% 
  select(-Idade, -PS_Descanso, -Colesterol, -BC_Max)

Checking the variable types

str(dados_cor)
## tibble [918 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Sexo          : chr [1:918] "Masculino" "Feminino" "Masculino" "Feminino" ...
##  $ Tipo_Dor_Peito: chr [1:918] "Atipica" "Sem_Dor" "Atipica" "Assintomatico" ...
##  $ Açucar_Sangue : chr [1:918] "Normal" "Normal" "Normal" "Normal" ...
##  $ ECG_Descanso  : chr [1:918] "Normal" "Normal" "Anormal_ST" "Normal" ...
##  $ Angina_Exerc  : chr [1:918] "Nao" "Nao" "Nao" "Sim" ...
##  $ Doença_Card   : chr [1:918] "Nao" "Sim" "Nao" "Sim" ...
##  $ Categ_Idade   : chr [1:918] "menores_idades" "idades_médias" "menores_idades" "idades_médias" ...
##  $ Categ_PS_Desc : chr [1:918] "PS_descanso_médio" "PS_descanso_alto" "PS_descanso_médio" "PS_descanso_médio" ...
##  $ Categ_Colest  : chr [1:918] "maior_colesterol" "colesterol_médio" "maior_colesterol" "colesterol_médio" ...
##  $ Categ_BC_Max  : chr [1:918] "maior_BC_Max" "BC_Max_médio" "menor_BC_Max" "menor_BC_Max" ...

We will change our character variables to factors because the dudi.acm function we will use only accepts inputs of this type.

The function for creating the Correspondence Analysis (CA) requires the use of “factors”

dados_cor <- as.data.frame(unclass(dados_cor), stringsAsFactors=TRUE) 

In order for a correspondence analysis to be conducted, we need the variables in the database to have an association with at least one other variable present in the database. In the context of our analysis, the presence or absence of heart disease is the most relevant aspect, so we will check if the other variables are related to this one.

Starting the Correspondence Analysis

Applying the correspondence analysis using the dudi.acm function.

ACM <- dudi.acm(dados_cor, scannf = FALSE, nf = 3)

As we have multiple pairs of variables, we needed to use a function that performs a Multiple Correspondence Analysis (MCA). MCA consists of Simple Correspondence Analyses (SCAs) that are conducted between all pairs of variables. After completing the SCAs for all pairs, eigenvalues are extracted and can be used, among other things, to obtain the proportions of inertia explained by the dimensions. In the end, we will have coordinates that we can use to plot a perceptual map that demonstrates the association between the variables.

Analyzing the variances of each dimension

perc_variancia <- (ACM$eig / sum(ACM$eig)) * 100
paste0(round(perc_variancia,2),"%")
##  [1] "16.95%" "8.14%"  "7.47%"  "6.84%"  "6.4%"   "6.17%"  "5.72%"  "5.66%" 
##  [9] "5.21%"  "5.06%"  "4.96%"  "4.72%"  "4.36%"  "3.8%"   "3.16%"  "3.03%" 
## [17] "2.34%"

The number of dimensions is proportional to the number of variables and categories. In our case, we have a total of 17 dimensions. A simple formula to determine the total number of dimensions in ACAs is to subtract the total number of categories from the total number of variables. We know we have 10 variables in total; let’s confirm if we indeed have 27 categories.

Number of categories per variable

quant_categorias <- apply(dados_cor,
                          MARGIN =  2,
                          FUN = function(x) nlevels(as.factor(x)))
quant_categorias
##           Sexo Tipo_Dor_Peito  Açucar_Sangue   ECG_Descanso   Angina_Exerc 
##              2              4              2              3              2 
##    Doença_Card    Categ_Idade  Categ_PS_Desc   Categ_Colest   Categ_BC_Max 
##              2              3              3              3              3

We can see that indeed our variables have a total of 27 categories.

Creating a dataframe with the standard coordinates obtained from the binary matrix

df_ACM <- data.frame(ACM$c1, Variável = rep(names(quant_categorias),
                                            quant_categorias))

Let’s use the obtained coordinates to plot some graphs that facilitate the visualization of associations between the variables.

Plotting graphs

Plotting the 2D perceptual map

df_ACM %>%
  rownames_to_column() %>%
  rename(Categoria = 1) %>%
  ggplot(aes(x = CS1, y = CS2, label = Categoria, color = Variável)) +
  geom_point() +
  geom_label_repel() +
  geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
  geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
  labs(x = paste("Dimensão 1:", paste0(round(perc_variancia[1], 2), "%")),
       y = paste("Dimensão 2:", paste0(round(perc_variancia[2], 2), "%"))) +
  theme_bw() +
  guides(color = FALSE)

This graph allows us to visualize which categories of the variables have a greater influence on the presence or absence of heart disease.

3D perceptual map (first 3 dimensions)

ACM_3D <- plot_ly()

ACM_3D <- add_trace(p = ACM_3D,
                    x = df_ACM$CS1,
                    y = df_ACM$CS2,
                    z = df_ACM$CS3,
                    mode = "text",
                    text = rownames(df_ACM),
                    textfont = list(color = "blue"),
                    marker = list(color = "red"),
                    showlegend = FALSE)

ACM_3D

The 3D graph, having one more dimension than the 2D, can capture a bit more of the variability present in the data, but a large number of variables can make the 3D graph challenging to interpret. However, in analyses with more than 2 dimensions, the 3D graph can be as useful, or even more useful, as a analytical tool.

Plotting the perceptual map of the observations

df_coord_obs <- ACM$li

df_coord_obs %>%
  ggplot(aes(x = Axis1, y = Axis2, color = dados_cor$Doença_Card)) +
  geom_point() +
  geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
  geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
  labs(x = paste("Dimension 1:", paste0(round(perc_variancia[1], 2), "%")),
       y = paste("Dimension 2:", paste0(round(perc_variancia[2], 2), "%")),
       color = "Cardiac Disease") +
  theme_bw()

sim = yes; nao = no

This last graph shows the arrangement of the observations according to their categories, with a focus on the variable indicating the presence or absence of heart disease in the individual.

Conclusion

We can conclude that correspondence analysis allows us to condense categorical (qualitative) variables into quantitative variables, which can be used in the plotting of graphs for a better visualization of the relationship between these variables. This transformation enables us to explore and interpret patterns of association between categories more efficiently.

Furthermore, the quantitative variables obtained through correspondence analysis can be used in other data analysis techniques. For instance, these variables can be employed in factor analyses to identify underlying structures in the data or in clustering to group similar observations. These variables can also be utilized in supervised predictive models, where they can serve as predictors in statistical or machine learning models. This enables us to explore the relationship between the transformed categorical variables and an outcome variable of interest, such as a response variable.

Therefore, correspondence analysis provides an effective way to explore and visualize the relationship between categorical variables, and the quantitative variables derived from this analysis can be applied in various data analysis techniques, expanding the potential for insights and discoveries in exploratory and predictive studies.