In the contemporary world, where cardiovascular diseases have become a global concern, the ability to identify factors contributing to the development of heart problems is crucial. Understanding the aspects and personal characteristics that play a crucial role in this condition allows us to develop more effective and personalized preventive strategies.

Correspondence analysis is a powerful technique that enables the identification of patterns and associations between categorical variables. Based on this methodology, we will investigate a wide range of factors such as age, gender, and medical diagnoses. Through this approach, we will gain meaningful insights for a comprehensive understanding of the characteristics and profiles of individuals affected by heart problems.

Upon completing the analysis, we will be able to create a perceptual map-type graph where we can observe the relationship between variables based on their proximity on the graph. We will be able to see which categories of which variables are more strongly associated with the presence of heart problems and which are associated with the absence of heart problems.

Database used:
dados_cor (Please right-click and select “Open in a new tab/window”.)

Loading packages

pacotes <- c("plotly", "tidyverse", "ggrepel", "knitr", "kableExtra", "sjPlot", "FactoMineR", "amap", "ade4","readxl")

if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}

## Warning: package 'sjPlot' was built under R version 4.2.3

## Warning: package 'ade4' was built under R version 4.2.3

Importando a base de dados

dados_cor <- read_excel("dados_cor_acm.xlsx")

Idade	Sexo	Tipo_Dor_Peito	PS_Descanso	Colesterol	Açucar_Sangue	ECG_Descanso	BC_Max	Angina_Exerc	Doença_Card
40	Masculino	Atipica	140	289	Normal	Normal	172	Nao	Nao
49	Feminino	Sem_Dor	160	180	Normal	Normal	156	Nao	Sim
37	Masculino	Atipica	130	283	Normal	Anormal_ST	98	Nao	Nao
48	Feminino	Assintomatico	138	214	Normal	Normal	108	Sim	Sim
54	Masculino	Sem_Dor	150	195	Normal	Normal	122	Nao	Nao
39	Masculino	Sem_Dor	120	339	Normal	Normal	170	Nao	Nao
45	Feminino	Atipica	130	237	Normal	Normal	170	Nao	Nao
54	Masculino	Atipica	110	208	Normal	Normal	142	Nao	Nao
37	Masculino	Assintomatico	140	207	Normal	Normal	130	Sim	Sim
48	Feminino	Atipica	120	284	Normal	Normal	120	Nao	Nao

When analyzing the database, we can see that some of the variables are quantitative, while others are qualitative. Since correspondence analysis is a technique exclusive to qualitative variables, we need to transform some of the variables. We could simply exclude the quantitative variables, but to avoid losing information from these variables, we will proceed with the transformation.

Adjusting the database for our analysis

We will categorize the quantitative variables based on statistical criteria

dados_cor <- dados_cor %>% 
  mutate(Categ_Idade = case_when(Idade <= quantile(Idade, 0.25, na.rm = T) ~ "menores_idades",
                                 Idade > quantile(Idade, 0.25, na.rm = T) & Idade <= quantile(Idade, 0.75, na.rm = T) ~ "idades_médias",
                                 Idade > quantile(Idade, 0.75, na.rm = T) ~ "maiores_idades"))

dados_cor <- dados_cor %>% 
  mutate(Categ_PS_Desc = case_when(PS_Descanso <= quantile(PS_Descanso, 0.25, na.rm = T) ~ "PS_descanso_baixo",
                                   PS_Descanso > quantile(PS_Descanso, 0.25, na.rm = T) & PS_Descanso <= quantile(PS_Descanso, 0.75, na.rm = T) ~ "PS_descanso_médio",
                                   PS_Descanso > quantile(PS_Descanso, 0.75, na.rm = T) ~ "PS_descanso_alto"))

dados_cor <- dados_cor %>% 
  mutate(Categ_Colest = case_when(Colesterol <= quantile(Colesterol, 0.25, na.rm = T) ~ "menor_colesterol",
                                  Colesterol > quantile(Colesterol, 0.25, na.rm = T) & Colesterol <= quantile(Colesterol, 0.75, na.rm = T) ~ "colesterol_médio",
                                  Colesterol > quantile(Colesterol, 0.75, na.rm = T) ~ "maior_colesterol"))

dados_cor <- dados_cor %>% 
  mutate(Categ_BC_Max = case_when(BC_Max <= quantile(BC_Max, 0.25, na.rm = T) ~ "menor_BC_Max",
                                  BC_Max > quantile(BC_Max, 0.25, na.rm = T) & BC_Max <= quantile(BC_Max, 0.75, na.rm = T) ~ "BC_Max_médio",
                                  BC_Max > quantile(BC_Max, 0.75, na.rm = T) ~ "maior_BC_Max"))

With the variables properly transformed, we can now eliminate the quantitative variables without losing the information contained in them.

Removing the variables that we will no longer use

dados_cor <- dados_cor %>% 
  select(-Idade, -PS_Descanso, -Colesterol, -BC_Max)

Checking the variable types

str(dados_cor)

## tibble [918 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Sexo          : chr [1:918] "Masculino" "Feminino" "Masculino" "Feminino" ...
##  $ Tipo_Dor_Peito: chr [1:918] "Atipica" "Sem_Dor" "Atipica" "Assintomatico" ...
##  $ Açucar_Sangue : chr [1:918] "Normal" "Normal" "Normal" "Normal" ...
##  $ ECG_Descanso  : chr [1:918] "Normal" "Normal" "Anormal_ST" "Normal" ...
##  $ Angina_Exerc  : chr [1:918] "Nao" "Nao" "Nao" "Sim" ...
##  $ Doença_Card   : chr [1:918] "Nao" "Sim" "Nao" "Sim" ...
##  $ Categ_Idade   : chr [1:918] "menores_idades" "idades_médias" "menores_idades" "idades_médias" ...
##  $ Categ_PS_Desc : chr [1:918] "PS_descanso_médio" "PS_descanso_alto" "PS_descanso_médio" "PS_descanso_médio" ...
##  $ Categ_Colest  : chr [1:918] "maior_colesterol" "colesterol_médio" "maior_colesterol" "colesterol_médio" ...
##  $ Categ_BC_Max  : chr [1:918] "maior_BC_Max" "BC_Max_médio" "menor_BC_Max" "menor_BC_Max" ...

We will change our character variables to factors because the dudi.acm function we will use only accepts inputs of this type.

The function for creating the Correspondence Analysis (CA) requires the use of “factors”

dados_cor <- as.data.frame(unclass(dados_cor), stringsAsFactors=TRUE)

In order for a correspondence analysis to be conducted, we need the variables in the database to have an association with at least one other variable present in the database. In the context of our analysis, the presence or absence of heart disease is the most relevant aspect, so we will check if the other variables are related to this one.

Checking if the variables are related to each other

Contingency tables

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Sexo,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")

Doença_Card	Sexo		Total
Doença_Card	Feminino	Masculino	Total
Nao	143 86 34.9 % 74.1 %	267 324 65.1 % 36.8 %	410 410 100 % 44.7 %
Sim	50 107 9.8 % 25.9 %	458 401 90.2 % 63.2 %	508 508 100 % 55.3 %
Total	193 193 21 % 100 %	725 725 79 % 100 %	918 918 100 % 100 %
χ²=84.145 · df=1 · φ=0.305 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Tipo_Dor_Peito,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")

Doença_Card	Tipo_Dor_Peito				Total
Doença_Card	Assintomatico	Atipica	Sem_Dor	Tipica	Total
Nao	104 222 25.4 % 21 %	149 77 36.3 % 86.1 %	131 91 32 % 64.5 %	26 21 6.3 % 56.5 %	410 410 100 % 44.7 %
Sim	392 274 77.2 % 79 %	24 96 4.7 % 13.9 %	72 112 14.2 % 35.5 %	20 25 3.9 % 43.5 %	508 508 100 % 55.3 %
Total	496 496 54 % 100 %	173 173 18.8 % 100 %	203 203 22.1 % 100 %	46 46 5 % 100 %	918 918 100 % 100 %
χ²=268.067 · df=3 · Cramer’s V=0.540 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Açucar_Sangue,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE,
         encoding = "UTF-8")

Doença_Card	Açucar_Sangue		Total
Doença_Card	Diabetes	Normal	Total
Nao	44 96 10.7 % 20.6 %	366 314 89.3 % 52 %	410 410 100 % 44.7 %
Sim	170 118 33.5 % 79.4 %	338 390 66.5 % 48 %	508 508 100 % 55.3 %
Total	214 214 23.3 % 100 %	704 704 76.7 % 100 %	918 918 100 % 100 %
χ²=64.321 · df=1 · φ=0.267 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$ECG_Descanso,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE,
         encoding = "UTF-8")

Doença_Card	ECG_Descanso			Total
Doença_Card	Anormal_ST	Hipertrofia_VE	Normal	Total
Nao	61 79 14.9 % 34.3 %	82 84 20 % 43.6 %	267 247 65.1 % 48.4 %	410 410 100 % 44.7 %
Sim	117 99 23 % 65.7 %	106 104 20.9 % 56.4 %	285 305 56.1 % 51.6 %	508 508 100 % 55.3 %
Total	178 178 19.4 % 100 %	188 188 20.5 % 100 %	552 552 60.1 % 100 %	918 918 100 % 100 %
χ²=10.931 · df=2 · Cramer’s V=0.109 · p=0.004

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Angina_Exerc,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE,
         encoding = "UTF-8")

Doença_Card	Angina_Exerc		Total
Doença_Card	Nao	Sim	Total
Nao	355 244 86.6 % 64.9 %	55 166 13.4 % 14.8 %	410 410 100 % 44.7 %
Sim	192 303 37.8 % 35.1 %	316 205 62.2 % 85.2 %	508 508 100 % 55.3 %
Total	547 547 59.6 % 100 %	371 371 40.4 % 100 %	918 918 100 % 100 %
χ²=222.259 · df=1 · φ=0.494 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Categ_Idade,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE,
         encoding = "UTF-8")

Doença_Card	Categ_Idade			Total
Doença_Card	idades_médias	maiores_idades	menores_idades	Total
Nao	196 205 47.8 % 42.8 %	60 99 14.6 % 27.1 %	154 107 37.6 % 64.4 %	410 410 100 % 44.7 %
Sim	262 253 51.6 % 57.2 %	161 122 31.7 % 72.9 %	85 132 16.7 % 35.6 %	508 508 100 % 55.3 %
Total	458 458 49.9 % 100 %	221 221 24.1 % 100 %	239 239 26 % 100 %	918 918 100 % 100 %
χ²=65.879 · df=2 · Cramer’s V=0.268 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Categ_PS_Desc,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE,
         encoding = "UTF-8")

Doença_Card	Categ_PS_Desc			Total
Doença_Card	PS_descanso_alto	PS_descanso_baixo	PS_descanso_médio	Total
Nao	70 98 17.1 % 31.8 %	150 131 36.6 % 51.2 %	190 181 46.3 % 46.9 %	410 410 100 % 44.7 %
Sim	150 122 29.5 % 68.2 %	143 162 28.1 % 48.8 %	215 224 42.3 % 53.1 %	508 508 100 % 55.3 %
Total	220 220 24 % 100 %	293 293 31.9 % 100 %	405 405 44.1 % 100 %	918 918 100 % 100 %
χ²=20.574 · df=2 · Cramer’s V=0.150 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Categ_Colest,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")

Doença_Card	Categ_Colest			Total
Doença_Card	colesterol_médio	maior_colesterol	menor_colesterol	Total
Nao	257 206 62.7 % 55.7 %	102 101 24.9 % 44.9 %	51 103 12.4 % 22.2 %	410 410 100 % 44.7 %
Sim	204 255 40.2 % 44.3 %	125 126 24.6 % 55.1 %	179 127 35.2 % 77.8 %	508 508 100 % 55.3 %
Total	461 461 50.2 % 100 %	227 227 24.7 % 100 %	230 230 25.1 % 100 %	918 918 100 % 100 %
χ²=69.994 · df=2 · Cramer’s V=0.276 · p=0.000

sjt.xtab(var.row = dados_cor$Doença_Card,
         var.col = dados_cor$Categ_BC_Max,
         show.exp = TRUE,
         show.row.prc = TRUE,
         show.col.prc = TRUE, 
         encoding = "UTF-8")

Doença_Card	Categ_BC_Max			Total
Doença_Card	BC_Max_médio	maior_BC_Max	menor_BC_Max	Total
Nao	188 195 45.9 % 43.1 %	163 99 39.8 % 73.8 %	59 117 14.4 % 22.6 %	410 410 100 % 44.7 %
Sim	248 241 48.8 % 56.9 %	58 122 11.4 % 26.2 %	202 144 39.8 % 77.4 %	508 508 100 % 55.3 %
Total	436 436 47.5 % 100 %	221 221 24.1 % 100 %	261 261 28.4 % 100 %	918 918 100 % 100 %
χ²=127.483 · df=2 · Cramer’s V=0.373 · p=0.000

All our variables have a p-value less than 0.05. This indicates that, at a 95% confidence level, all our variables are associated with at least one other variable in the database. Therefore, we can proceed with our analysis.

Starting the Correspondence Analysis

Applying the correspondence analysis using the `dudi.acm` function.

ACM <- dudi.acm(dados_cor, scannf = FALSE, nf = 3)

As we have multiple pairs of variables, we needed to use a function that performs a Multiple Correspondence Analysis (MCA). MCA consists of Simple Correspondence Analyses (SCAs) that are conducted between all pairs of variables. After completing the SCAs for all pairs, eigenvalues are extracted and can be used, among other things, to obtain the proportions of inertia explained by the dimensions. In the end, we will have coordinates that we can use to plot a perceptual map that demonstrates the association between the variables.

Analyzing the variances of each dimension

perc_variancia <- (ACM$eig / sum(ACM$eig)) * 100
paste0(round(perc_variancia,2),"%")

##  [1] "16.95%" "8.14%"  "7.47%"  "6.84%"  "6.4%"   "6.17%"  "5.72%"  "5.66%" 
##  [9] "5.21%"  "5.06%"  "4.96%"  "4.72%"  "4.36%"  "3.8%"   "3.16%"  "3.03%" 
## [17] "2.34%"

The number of dimensions is proportional to the number of variables and categories. In our case, we have a total of 17 dimensions. A simple formula to determine the total number of dimensions in ACAs is to subtract the total number of categories from the total number of variables. We know we have 10 variables in total; let’s confirm if we indeed have 27 categories.

Number of categories per variable

quant_categorias <- apply(dados_cor,
                          MARGIN =  2,
                          FUN = function(x) nlevels(as.factor(x)))
quant_categorias

##           Sexo Tipo_Dor_Peito  Açucar_Sangue   ECG_Descanso   Angina_Exerc 
##              2              4              2              3              2 
##    Doença_Card    Categ_Idade  Categ_PS_Desc   Categ_Colest   Categ_BC_Max 
##              2              3              3              3              3

We can see that indeed our variables have a total of 27 categories.

Creating a dataframe with the standard coordinates obtained from the binary matrix

df_ACM <- data.frame(ACM$c1, Variável = rep(names(quant_categorias),
                                            quant_categorias))

Let’s use the obtained coordinates to plot some graphs that facilitate the visualization of associations between the variables.

Plotting graphs

Plotting the 2D perceptual map

df_ACM %>%
  rownames_to_column() %>%
  rename(Categoria = 1) %>%
  ggplot(aes(x = CS1, y = CS2, label = Categoria, color = Variável)) +
  geom_point() +
  geom_label_repel() +
  geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
  geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
  labs(x = paste("Dimensão 1:", paste0(round(perc_variancia[1], 2), "%")),
       y = paste("Dimensão 2:", paste0(round(perc_variancia[2], 2), "%"))) +
  theme_bw() +
  guides(color = FALSE)

This graph allows us to visualize which categories of the variables have a greater influence on the presence or absence of heart disease.

3D perceptual map (first 3 dimensions)

ACM_3D <- plot_ly()

ACM_3D <- add_trace(p = ACM_3D,
                    x = df_ACM$CS1,
                    y = df_ACM$CS2,
                    z = df_ACM$CS3,
                    mode = "text",
                    text = rownames(df_ACM),
                    textfont = list(color = "blue"),
                    marker = list(color = "red"),
                    showlegend = FALSE)

ACM_3D

The 3D graph, having one more dimension than the 2D, can capture a bit more of the variability present in the data, but a large number of variables can make the 3D graph challenging to interpret. However, in analyses with more than 2 dimensions, the 3D graph can be as useful, or even more useful, as a analytical tool.

Plotting the perceptual map of the observations

df_coord_obs <- ACM$li

df_coord_obs %>%
  ggplot(aes(x = Axis1, y = Axis2, color = dados_cor$Doença_Card)) +
  geom_point() +
  geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
  geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
  labs(x = paste("Dimension 1:", paste0(round(perc_variancia[1], 2), "%")),
       y = paste("Dimension 2:", paste0(round(perc_variancia[2], 2), "%")),
       color = "Cardiac Disease") +
  theme_bw()

sim = yes; nao = no

This last graph shows the arrangement of the observations according to their categories, with a focus on the variable indicating the presence or absence of heart disease in the individual.

Conclusion

We can conclude that correspondence analysis allows us to condense categorical (qualitative) variables into quantitative variables, which can be used in the plotting of graphs for a better visualization of the relationship between these variables. This transformation enables us to explore and interpret patterns of association between categories more efficiently.

Furthermore, the quantitative variables obtained through correspondence analysis can be used in other data analysis techniques. For instance, these variables can be employed in factor analyses to identify underlying structures in the data or in clustering to group similar observations. These variables can also be utilized in supervised predictive models, where they can serve as predictors in statistical or machine learning models. This enables us to explore the relationship between the transformed categorical variables and an outcome variable of interest, such as a response variable.

Therefore, correspondence analysis provides an effective way to explore and visualize the relationship between categorical variables, and the quantitative variables derived from this analysis can be applied in various data analysis techniques, expanding the potential for insights and discoveries in exploratory and predictive studies.

Correspondence analysis for heart diseases

Rafael

2023-06-23

Loading packages

Importando a base de dados

Adjusting the database for our analysis

We will categorize the quantitative variables based on statistical criteria

Removing the variables that we will no longer use

Checking the variable types

The function for creating the Correspondence Analysis (CA) requires the use of “factors”

Starting the Correspondence Analysis

Applying the correspondence analysis using the `dudi.acm` function.

Analyzing the variances of each dimension

Number of categories per variable

Creating a dataframe with the standard coordinates obtained from the binary matrix

Plotting graphs

Plotting the 2D perceptual map

3D perceptual map (first 3 dimensions)

Plotting the perceptual map of the observations

Conclusion

Correspondence analysis for heart diseases

Rafael

2023-06-23

Loading packages

Importando a base de dados

Adjusting the database for our analysis

We will categorize the quantitative variables based on statistical criteria

Removing the variables that we will no longer use

Checking the variable types

The function for creating the Correspondence Analysis (CA) requires the use of “factors”

Checking if the variables are related to each other

Contingency tables

Starting the Correspondence Analysis

Applying the correspondence analysis using the dudi.acm function.

Analyzing the variances of each dimension

Number of categories per variable

Creating a dataframe with the standard coordinates obtained from the binary matrix

Plotting graphs

Plotting the 2D perceptual map

3D perceptual map (first 3 dimensions)

Plotting the perceptual map of the observations

Conclusion

Applying the correspondence analysis using the `dudi.acm` function.