Introduction

In the expansive landscape of data analysis methods, some emerge as frequent go-tos, valued for their reliability and utility. Frequently, the questions at hand involve unraveling the complex web of relationships between variables, a fundamental step not just for understanding the data but also for ensuring our models’ assumptions are met.

When dealing with numeric variables, we typically turn to Pearson’s and Spearman’s Correlation to gauge the strength and nature of these relationships. However, the terrain changes when we encounter categorical data. Enter Cramer’s V – an effect size (ES) metric that quantifies the overall associations within a contingency table’s rows and columns.

It categorizes results into the following buckets: an \(ES \leq 0.2\) signifies a weak connection, \(0.2 \lt ES \leq 0.6\) suggests a moderate one, and an \(ES \gt 0.6\) points to a strong association. To bring these insights to life, we often turn to the trusty heatmap, visually representing Cramer’s V results and illuminating the intricate web of associations in our data. This great idea came from Gustavo Duarte, from Comunidade DS.

1. Load libraries

First, let’s load the libraries required for the task at hand.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(vcd)
## Loading required package: grid
library(reshape2)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

1.1 Auxiliary functions

Then, let’s build a function that will perform the Cramer’s V calculations.

cramer_v <- function(x, y) {
  cm <- table(x, y)
  n <- sum(cm)
  r <- nrow(cm)
  k <- ncol(cm)
  
  chi2 <- chisq.test(cm)$statistic
  chi2corr <- max(0, chi2 - (k - 1) * (r - 1) / (n - 1))
  
  kcorr <- k - (k - 1)^2 / (n - 1)
  rcorr <- r - (r - 1)^2 / (n - 1)
  
  return(sqrt(chi2corr / n) / min(kcorr - 1, rcorr - 1))
}

2. Load data

Then, load the data I am using in another project.

df_cleaned <- readRDS("df_cleaned.rds")
sapply(df_cleaned, class)
##                   id               gender                  age 
##            "numeric"             "factor"            "numeric" 
##      driving_license          region_code   previously_insured 
##             "factor"            "numeric"             "factor" 
##          vehicle_age       vehicle_damage   health_annual_paid 
##             "factor"             "factor"            "numeric" 
## policy_sales_channel      days_associated             response 
##            "numeric"            "numeric"             "factor"

3. Perform calculations

3.1. Calculate data’s Cramer’s V

Now, I calculate the Cramer’s V for each combination of the categorical variables (I have ruled out the response variable).

a1   <- cramer_v( df_cleaned$gender,df_cleaned$gender )
a2   <- cramer_v( df_cleaned$gender,df_cleaned$driving_license )
a3   <- cramer_v( df_cleaned$gender,df_cleaned$previously_insured )
a4   <- cramer_v( df_cleaned$gender,df_cleaned$vehicle_age )
a5   <- cramer_v( df_cleaned$gender,df_cleaned$vehicle_damage )

a6   <- cramer_v( df_cleaned$driving_license,df_cleaned$gender )
a7   <- cramer_v( df_cleaned$driving_license,df_cleaned$driving_license  )
a8   <- cramer_v( df_cleaned$driving_license,df_cleaned$previously_insured )
a9   <- cramer_v( df_cleaned$driving_license,df_cleaned$vehicle_age )
a10  <- cramer_v( df_cleaned$driving_license,df_cleaned$vehicle_damage )

a11  <- cramer_v( df_cleaned$previously_insured,df_cleaned$gender )
a12  <- cramer_v( df_cleaned$previously_insured,df_cleaned$driving_license  )
a13  <- cramer_v( df_cleaned$previously_insured,df_cleaned$previously_insured )
a14  <- cramer_v( df_cleaned$previously_insured,df_cleaned$vehicle_age )
a15  <- cramer_v( df_cleaned$previously_insured,df_cleaned$vehicle_damage )

a16  <- cramer_v( df_cleaned$vehicle_age,df_cleaned$gender )
a17  <- cramer_v( df_cleaned$vehicle_age,df_cleaned$driving_license  )
a18  <- cramer_v( df_cleaned$vehicle_age,df_cleaned$previously_insured )
a19  <- cramer_v( df_cleaned$vehicle_age,df_cleaned$vehicle_age )
a20  <- cramer_v( df_cleaned$vehicle_age,df_cleaned$vehicle_damage )

a21  <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$gender )
a22  <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$driving_license  )
a23  <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$previously_insured )
a24  <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$vehicle_age )
a25  <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$vehicle_damage )

3.2. Create a data frame with the results

Then, I created a data frame with the results.

d <- data.frame(
  gender = c(a1, a2, a3, a4, a5),
  driving_license = c(a6, a7, a8, a9, a10),
  previously_insured = c(a11, a12, a13, a14, a15),
  vehicle_age = c(a16, a17, a18, a19, a20),
  vehicle_damage = c(a21, a22, a23, a24, a25))

3.3. Prepare data for plotting

Here, I have formatted the data to be used in the heatmap.

# Set row names and column names
rownames(d) <- colnames(d)

# Get the variable names
variables <- colnames(d)

# Create an empty data frame
cor_df <- data.frame()

# Loop to fill the data frame
for (i in 1:length(variables)) {
  for (j in 1:length(variables)) {
    row <- data.frame(
      Variable1 = variables[i],
      Variable2 = variables[j],
      Cramers_V = round(d[i, j],2)
    )
    cor_df <- rbind(cor_df, row)
  }
}

# Filter out the diagonal (correlation with itself)
cor_df <- cor_df[cor_df$Variable1 != cor_df$Variable2, ]

# Reset row names
rownames(cor_df) <- NULL

# Print the resulting data frame
print(cor_df)
##             Variable1          Variable2 Cramers_V
## 1              gender    driving_license      0.02
## 2              gender previously_insured      0.08
## 3              gender        vehicle_age      0.17
## 4              gender     vehicle_damage      0.09
## 5     driving_license             gender      0.02
## 6     driving_license previously_insured      0.01
## 7     driving_license        vehicle_age      0.04
## 8     driving_license     vehicle_damage      0.02
## 9  previously_insured             gender      0.08
## 10 previously_insured    driving_license      0.01
## 11 previously_insured        vehicle_age      0.38
## 12 previously_insured     vehicle_damage      0.82
## 13        vehicle_age             gender      0.17
## 14        vehicle_age    driving_license      0.04
## 15        vehicle_age previously_insured      0.38
## 16        vehicle_age     vehicle_damage      0.40
## 17     vehicle_damage             gender      0.09
## 18     vehicle_damage    driving_license      0.02
## 19     vehicle_damage previously_insured      0.82
## 20     vehicle_damage        vehicle_age      0.40

3.4. Plot the heatmap

Here it is!

Conclusion

The Cramer’s V is not a correlation. Instead, it measures the degree of association between two variables, as an omnibus effect size that quantifies the overall association among the rows and columns in a contingency table. Since the outcomes are in between 0 and 1, a heatmap is an interesting visualization for it.

Thanks for reading!

References

American Psychological Association. APA Dictionary of Psychology. Available at:https://dictionary.apa.org/cramers-v

IBM Cognos Analytics. Cramér’s V. available at: https://www.ibm.com/docs/en/cognos-analytics/12.0.0?topic=terms-cramrs-v