In the expansive landscape of data analysis methods, some emerge as frequent go-tos, valued for their reliability and utility. Frequently, the questions at hand involve unraveling the complex web of relationships between variables, a fundamental step not just for understanding the data but also for ensuring our models’ assumptions are met.
When dealing with numeric variables, we typically turn to Pearson’s and Spearman’s Correlation to gauge the strength and nature of these relationships. However, the terrain changes when we encounter categorical data. Enter Cramer’s V – an effect size (ES) metric that quantifies the overall associations within a contingency table’s rows and columns.
It categorizes results into the following buckets: an \(ES \leq 0.2\) signifies a weak connection, \(0.2 \lt ES \leq 0.6\) suggests a moderate one, and an \(ES \gt 0.6\) points to a strong association. To bring these insights to life, we often turn to the trusty heatmap, visually representing Cramer’s V results and illuminating the intricate web of associations in our data. This great idea came from Gustavo Duarte, from Comunidade DS.
First, let’s load the libraries required for the task at hand.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(vcd)
## Loading required package: grid
library(reshape2)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Then, let’s build a function that will perform the Cramer’s V calculations.
cramer_v <- function(x, y) {
cm <- table(x, y)
n <- sum(cm)
r <- nrow(cm)
k <- ncol(cm)
chi2 <- chisq.test(cm)$statistic
chi2corr <- max(0, chi2 - (k - 1) * (r - 1) / (n - 1))
kcorr <- k - (k - 1)^2 / (n - 1)
rcorr <- r - (r - 1)^2 / (n - 1)
return(sqrt(chi2corr / n) / min(kcorr - 1, rcorr - 1))
}
Then, load the data I am using in another project.
df_cleaned <- readRDS("df_cleaned.rds")
sapply(df_cleaned, class)
## id gender age
## "numeric" "factor" "numeric"
## driving_license region_code previously_insured
## "factor" "numeric" "factor"
## vehicle_age vehicle_damage health_annual_paid
## "factor" "factor" "numeric"
## policy_sales_channel days_associated response
## "numeric" "numeric" "factor"
Now, I calculate the Cramer’s V for each combination of the categorical variables (I have ruled out the response variable).
a1 <- cramer_v( df_cleaned$gender,df_cleaned$gender )
a2 <- cramer_v( df_cleaned$gender,df_cleaned$driving_license )
a3 <- cramer_v( df_cleaned$gender,df_cleaned$previously_insured )
a4 <- cramer_v( df_cleaned$gender,df_cleaned$vehicle_age )
a5 <- cramer_v( df_cleaned$gender,df_cleaned$vehicle_damage )
a6 <- cramer_v( df_cleaned$driving_license,df_cleaned$gender )
a7 <- cramer_v( df_cleaned$driving_license,df_cleaned$driving_license )
a8 <- cramer_v( df_cleaned$driving_license,df_cleaned$previously_insured )
a9 <- cramer_v( df_cleaned$driving_license,df_cleaned$vehicle_age )
a10 <- cramer_v( df_cleaned$driving_license,df_cleaned$vehicle_damage )
a11 <- cramer_v( df_cleaned$previously_insured,df_cleaned$gender )
a12 <- cramer_v( df_cleaned$previously_insured,df_cleaned$driving_license )
a13 <- cramer_v( df_cleaned$previously_insured,df_cleaned$previously_insured )
a14 <- cramer_v( df_cleaned$previously_insured,df_cleaned$vehicle_age )
a15 <- cramer_v( df_cleaned$previously_insured,df_cleaned$vehicle_damage )
a16 <- cramer_v( df_cleaned$vehicle_age,df_cleaned$gender )
a17 <- cramer_v( df_cleaned$vehicle_age,df_cleaned$driving_license )
a18 <- cramer_v( df_cleaned$vehicle_age,df_cleaned$previously_insured )
a19 <- cramer_v( df_cleaned$vehicle_age,df_cleaned$vehicle_age )
a20 <- cramer_v( df_cleaned$vehicle_age,df_cleaned$vehicle_damage )
a21 <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$gender )
a22 <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$driving_license )
a23 <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$previously_insured )
a24 <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$vehicle_age )
a25 <- cramer_v( df_cleaned$vehicle_damage,df_cleaned$vehicle_damage )
Then, I created a data frame with the results.
d <- data.frame(
gender = c(a1, a2, a3, a4, a5),
driving_license = c(a6, a7, a8, a9, a10),
previously_insured = c(a11, a12, a13, a14, a15),
vehicle_age = c(a16, a17, a18, a19, a20),
vehicle_damage = c(a21, a22, a23, a24, a25))
Here, I have formatted the data to be used in the heatmap.
# Set row names and column names
rownames(d) <- colnames(d)
# Get the variable names
variables <- colnames(d)
# Create an empty data frame
cor_df <- data.frame()
# Loop to fill the data frame
for (i in 1:length(variables)) {
for (j in 1:length(variables)) {
row <- data.frame(
Variable1 = variables[i],
Variable2 = variables[j],
Cramers_V = round(d[i, j],2)
)
cor_df <- rbind(cor_df, row)
}
}
# Filter out the diagonal (correlation with itself)
cor_df <- cor_df[cor_df$Variable1 != cor_df$Variable2, ]
# Reset row names
rownames(cor_df) <- NULL
# Print the resulting data frame
print(cor_df)
## Variable1 Variable2 Cramers_V
## 1 gender driving_license 0.02
## 2 gender previously_insured 0.08
## 3 gender vehicle_age 0.17
## 4 gender vehicle_damage 0.09
## 5 driving_license gender 0.02
## 6 driving_license previously_insured 0.01
## 7 driving_license vehicle_age 0.04
## 8 driving_license vehicle_damage 0.02
## 9 previously_insured gender 0.08
## 10 previously_insured driving_license 0.01
## 11 previously_insured vehicle_age 0.38
## 12 previously_insured vehicle_damage 0.82
## 13 vehicle_age gender 0.17
## 14 vehicle_age driving_license 0.04
## 15 vehicle_age previously_insured 0.38
## 16 vehicle_age vehicle_damage 0.40
## 17 vehicle_damage gender 0.09
## 18 vehicle_damage driving_license 0.02
## 19 vehicle_damage previously_insured 0.82
## 20 vehicle_damage vehicle_age 0.40
Here it is!
The Cramer’s V is not a correlation. Instead, it measures the degree of association between two variables, as an omnibus effect size that quantifies the overall association among the rows and columns in a contingency table. Since the outcomes are in between 0 and 1, a heatmap is an interesting visualization for it.
Thanks for reading!
American Psychological Association. APA Dictionary of Psychology. Available at:https://dictionary.apa.org/cramers-v
IBM Cognos Analytics. Cramér’s V. available at: https://www.ibm.com/docs/en/cognos-analytics/12.0.0?topic=terms-cramrs-v