In the contemporary world, where cardiovascular diseases have become a global concern, the ability to identify factors contributing to the development of heart problems is crucial. Understanding the aspects and personal characteristics that play a crucial role in this condition allows us to develop more effective and personalized preventive strategies.
Correspondence analysis is a powerful technique that enables the identification of patterns and associations between categorical variables. Based on this methodology, we will investigate a wide range of factors such as age, gender, and medical diagnoses. Through this approach, we will gain meaningful insights for a comprehensive understanding of the characteristics and profiles of individuals affected by heart problems.
Upon completing the analysis, we will be able to create a perceptual map-type graph where we can observe the relationship between variables based on their proximity on the graph. We will be able to see which categories of which variables are more strongly associated with the presence of heart problems and which are associated with the absence of heart problems.
Database used:
dados_cor
(Please right-click and select “Open in a new tab/window”.)
pacotes <- c("plotly", "tidyverse", "ggrepel", "knitr", "kableExtra", "sjPlot", "FactoMineR", "amap", "ade4","readxl")
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
instalador <- pacotes[!pacotes %in% installed.packages()]
for(i in 1:length(instalador)) {
install.packages(instalador, dependencies = T)
break()}
sapply(pacotes, require, character = T)
} else {
sapply(pacotes, require, character = T)
}
## Warning: package 'sjPlot' was built under R version 4.2.3
## Warning: package 'ade4' was built under R version 4.2.3
dados_cor <- read_excel("dados_cor_acm.xlsx")
| Idade | Sexo | Tipo_Dor_Peito | PS_Descanso | Colesterol | Açucar_Sangue | ECG_Descanso | BC_Max | Angina_Exerc | Doença_Card |
|---|---|---|---|---|---|---|---|---|---|
| 40 | Masculino | Atipica | 140 | 289 | Normal | Normal | 172 | Nao | Nao |
| 49 | Feminino | Sem_Dor | 160 | 180 | Normal | Normal | 156 | Nao | Sim |
| 37 | Masculino | Atipica | 130 | 283 | Normal | Anormal_ST | 98 | Nao | Nao |
| 48 | Feminino | Assintomatico | 138 | 214 | Normal | Normal | 108 | Sim | Sim |
| 54 | Masculino | Sem_Dor | 150 | 195 | Normal | Normal | 122 | Nao | Nao |
| 39 | Masculino | Sem_Dor | 120 | 339 | Normal | Normal | 170 | Nao | Nao |
| 45 | Feminino | Atipica | 130 | 237 | Normal | Normal | 170 | Nao | Nao |
| 54 | Masculino | Atipica | 110 | 208 | Normal | Normal | 142 | Nao | Nao |
| 37 | Masculino | Assintomatico | 140 | 207 | Normal | Normal | 130 | Sim | Sim |
| 48 | Feminino | Atipica | 120 | 284 | Normal | Normal | 120 | Nao | Nao |
When analyzing the database, we can see that some of the variables are quantitative, while others are qualitative. Since correspondence analysis is a technique exclusive to qualitative variables, we need to transform some of the variables. We could simply exclude the quantitative variables, but to avoid losing information from these variables, we will proceed with the transformation.
dados_cor <- dados_cor %>%
mutate(Categ_Idade = case_when(Idade <= quantile(Idade, 0.25, na.rm = T) ~ "menores_idades",
Idade > quantile(Idade, 0.25, na.rm = T) & Idade <= quantile(Idade, 0.75, na.rm = T) ~ "idades_médias",
Idade > quantile(Idade, 0.75, na.rm = T) ~ "maiores_idades"))
dados_cor <- dados_cor %>%
mutate(Categ_PS_Desc = case_when(PS_Descanso <= quantile(PS_Descanso, 0.25, na.rm = T) ~ "PS_descanso_baixo",
PS_Descanso > quantile(PS_Descanso, 0.25, na.rm = T) & PS_Descanso <= quantile(PS_Descanso, 0.75, na.rm = T) ~ "PS_descanso_médio",
PS_Descanso > quantile(PS_Descanso, 0.75, na.rm = T) ~ "PS_descanso_alto"))
dados_cor <- dados_cor %>%
mutate(Categ_Colest = case_when(Colesterol <= quantile(Colesterol, 0.25, na.rm = T) ~ "menor_colesterol",
Colesterol > quantile(Colesterol, 0.25, na.rm = T) & Colesterol <= quantile(Colesterol, 0.75, na.rm = T) ~ "colesterol_médio",
Colesterol > quantile(Colesterol, 0.75, na.rm = T) ~ "maior_colesterol"))
dados_cor <- dados_cor %>%
mutate(Categ_BC_Max = case_when(BC_Max <= quantile(BC_Max, 0.25, na.rm = T) ~ "menor_BC_Max",
BC_Max > quantile(BC_Max, 0.25, na.rm = T) & BC_Max <= quantile(BC_Max, 0.75, na.rm = T) ~ "BC_Max_médio",
BC_Max > quantile(BC_Max, 0.75, na.rm = T) ~ "maior_BC_Max"))
With the variables properly transformed, we can now eliminate the quantitative variables without losing the information contained in them.
dados_cor <- dados_cor %>%
select(-Idade, -PS_Descanso, -Colesterol, -BC_Max)
str(dados_cor)
## tibble [918 × 10] (S3: tbl_df/tbl/data.frame)
## $ Sexo : chr [1:918] "Masculino" "Feminino" "Masculino" "Feminino" ...
## $ Tipo_Dor_Peito: chr [1:918] "Atipica" "Sem_Dor" "Atipica" "Assintomatico" ...
## $ Açucar_Sangue : chr [1:918] "Normal" "Normal" "Normal" "Normal" ...
## $ ECG_Descanso : chr [1:918] "Normal" "Normal" "Anormal_ST" "Normal" ...
## $ Angina_Exerc : chr [1:918] "Nao" "Nao" "Nao" "Sim" ...
## $ Doença_Card : chr [1:918] "Nao" "Sim" "Nao" "Sim" ...
## $ Categ_Idade : chr [1:918] "menores_idades" "idades_médias" "menores_idades" "idades_médias" ...
## $ Categ_PS_Desc : chr [1:918] "PS_descanso_médio" "PS_descanso_alto" "PS_descanso_médio" "PS_descanso_médio" ...
## $ Categ_Colest : chr [1:918] "maior_colesterol" "colesterol_médio" "maior_colesterol" "colesterol_médio" ...
## $ Categ_BC_Max : chr [1:918] "maior_BC_Max" "BC_Max_médio" "menor_BC_Max" "menor_BC_Max" ...
We will change our character variables to factors because the dudi.acm function we will use only accepts inputs of this type.
dados_cor <- as.data.frame(unclass(dados_cor), stringsAsFactors=TRUE)
In order for a correspondence analysis to be conducted, we need the variables in the database to have an association with at least one other variable present in the database. In the context of our analysis, the presence or absence of heart disease is the most relevant aspect, so we will check if the other variables are related to this one.
dudi.acm
function.ACM <- dudi.acm(dados_cor, scannf = FALSE, nf = 3)
As we have multiple pairs of variables, we needed to use a function that performs a Multiple Correspondence Analysis (MCA). MCA consists of Simple Correspondence Analyses (SCAs) that are conducted between all pairs of variables. After completing the SCAs for all pairs, eigenvalues are extracted and can be used, among other things, to obtain the proportions of inertia explained by the dimensions. In the end, we will have coordinates that we can use to plot a perceptual map that demonstrates the association between the variables.
perc_variancia <- (ACM$eig / sum(ACM$eig)) * 100
paste0(round(perc_variancia,2),"%")
## [1] "16.95%" "8.14%" "7.47%" "6.84%" "6.4%" "6.17%" "5.72%" "5.66%"
## [9] "5.21%" "5.06%" "4.96%" "4.72%" "4.36%" "3.8%" "3.16%" "3.03%"
## [17] "2.34%"
The number of dimensions is proportional to the number of variables and categories. In our case, we have a total of 17 dimensions. A simple formula to determine the total number of dimensions in ACAs is to subtract the total number of categories from the total number of variables. We know we have 10 variables in total; let’s confirm if we indeed have 27 categories.
quant_categorias <- apply(dados_cor,
MARGIN = 2,
FUN = function(x) nlevels(as.factor(x)))
quant_categorias
## Sexo Tipo_Dor_Peito Açucar_Sangue ECG_Descanso Angina_Exerc
## 2 4 2 3 2
## Doença_Card Categ_Idade Categ_PS_Desc Categ_Colest Categ_BC_Max
## 2 3 3 3 3
We can see that indeed our variables have a total of 27 categories.
df_ACM <- data.frame(ACM$c1, Variável = rep(names(quant_categorias),
quant_categorias))
Let’s use the obtained coordinates to plot some graphs that facilitate the visualization of associations between the variables.
df_ACM %>%
rownames_to_column() %>%
rename(Categoria = 1) %>%
ggplot(aes(x = CS1, y = CS2, label = Categoria, color = Variável)) +
geom_point() +
geom_label_repel() +
geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
labs(x = paste("Dimensão 1:", paste0(round(perc_variancia[1], 2), "%")),
y = paste("Dimensão 2:", paste0(round(perc_variancia[2], 2), "%"))) +
theme_bw() +
guides(color = FALSE)
This graph allows us to visualize which categories of the variables have a greater influence on the presence or absence of heart disease.
ACM_3D <- plot_ly()
ACM_3D <- add_trace(p = ACM_3D,
x = df_ACM$CS1,
y = df_ACM$CS2,
z = df_ACM$CS3,
mode = "text",
text = rownames(df_ACM),
textfont = list(color = "blue"),
marker = list(color = "red"),
showlegend = FALSE)
ACM_3D
The 3D graph, having one more dimension than the 2D, can capture a bit more of the variability present in the data, but a large number of variables can make the 3D graph challenging to interpret. However, in analyses with more than 2 dimensions, the 3D graph can be as useful, or even more useful, as a analytical tool.
df_coord_obs <- ACM$li
df_coord_obs %>%
ggplot(aes(x = Axis1, y = Axis2, color = dados_cor$Doença_Card)) +
geom_point() +
geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
labs(x = paste("Dimension 1:", paste0(round(perc_variancia[1], 2), "%")),
y = paste("Dimension 2:", paste0(round(perc_variancia[2], 2), "%")),
color = "Cardiac Disease") +
theme_bw()
sim = yes; nao = no
This last graph shows the arrangement of the observations according to their categories, with a focus on the variable indicating the presence or absence of heart disease in the individual.
We can conclude that correspondence analysis allows us to condense categorical (qualitative) variables into quantitative variables, which can be used in the plotting of graphs for a better visualization of the relationship between these variables. This transformation enables us to explore and interpret patterns of association between categories more efficiently.
Furthermore, the quantitative variables obtained through correspondence analysis can be used in other data analysis techniques. For instance, these variables can be employed in factor analyses to identify underlying structures in the data or in clustering to group similar observations. These variables can also be utilized in supervised predictive models, where they can serve as predictors in statistical or machine learning models. This enables us to explore the relationship between the transformed categorical variables and an outcome variable of interest, such as a response variable.
Therefore, correspondence analysis provides an effective way to explore and visualize the relationship between categorical variables, and the quantitative variables derived from this analysis can be applied in various data analysis techniques, expanding the potential for insights and discoveries in exploratory and predictive studies.