Analyzing Multilingual Distribution Across Countries Using Correspondence Analysis

Author

Saurabh C Srivastava

Published

March 1, 2025

Objective of the Analysis

The objective of this analysis is to explore language distribution across different countries using Correspondence Analysis (CA). The dataset consists of language usage proportions in Canada, the USA, England, Italy, and Switzerland across five major languages: English, French, Spanish, German, and Italian. By applying Correspondence Analysis (CA), we aim to visualize relationships between countries and their dominant languages, identifying language preferences and cultural influences in multilingual societies.

Brief Description of the Code

1. Load Required Libraries

library(dplyr)       # Data manipulation

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tibble)      # Converting columns to row names
library(FactoMineR)  # Correspondence Analysis (CA)
library(factoextra)  # CA visualization (optional)
Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

2. Create and Load Language Data

lang_text = ("
Country English French  Spanish German  Italian Total
Canada  688 280 10  11  11  1000
USA 730 31  190 8   41  1000
England 798 74  38  31  59  1000
Italy   17  13  11  15  944 1000
Switzerland 15  222 20  648 95  1000")

3. Data Preprocessing

  • Removes the Total column, as it is redundant for Correspondence Analysis.

  • Converts Country column into row names, making it compatible for Correspondence Analysis.

lang = read.table(textConnection(lang_text), sep = "",header = TRUE, stringsAsFactor = TRUE)
lang
      Country English French Spanish German Italian Total
1      Canada     688    280      10     11      11  1000
2         USA     730     31     190      8      41  1000
3     England     798     74      38     31      59  1000
4       Italy      17     13      11     15     944  1000
5 Switzerland      15    222      20    648      95  1000
lang = lang %>% dplyr::select(-Total)
lang
      Country English French Spanish German Italian
1      Canada     688    280      10     11      11
2         USA     730     31     190      8      41
3     England     798     74      38     31      59
4       Italy      17     13      11     15     944
5 Switzerland      15    222      20    648      95
str(lang)
'data.frame':   5 obs. of  6 variables:
 $ Country: Factor w/ 5 levels "Canada","England",..: 1 5 2 3 4
 $ English: int  688 730 798 17 15
 $ French : int  280 31 74 13 222
 $ Spanish: int  10 190 38 11 20
 $ German : int  11 8 31 15 648
 $ Italian: int  11 41 59 944 95
lang$Country = as.factor(lang$Country)
lang = lang %>% column_to_rownames("Country")

4. Perform Correspondence Analysis (CA)

  • Performs Correspondence Analysis (CA) to explore relationships between countries and language usage.

  • The graph = TRUE argument automatically generates a basic visualization.

res.ca <- CA(lang, graph = TRUE)

5. Visualizing Correspondence Analysis Results with a Biplot

fviz_ca_biplot(res.ca, 
               repel = TRUE, 
               col.row = "#2E9FDF",  
               col.col = "#FC4E07", 
               axes = c(1, 2)
) + ggtitle("Correspondence Analysis of Languages and Countries") +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(caption = "Saurabh's Work")

Practical Applications of Correspondence Analysis (CA)

Correspondence Analysis (CA) is a powerful statistical technique widely used across various industries to analyze relationships between categorical variables, reducing dimensionality and providing intuitive visual representations. It has broad applications in market research, healthcare, finance, political analysis, and e-commerce, allowing organizations to uncover meaningful patterns and associations within data.

In the healthcare and epidemiology sector, CA plays a critical role in disease and risk factor analysis, mapping associations between different patient groups and their likelihood of developing specific conditions. Hospitals and healthcare institutions use it to optimize resource allocation, ensuring services are available to the populations that need them most. Additionally, it is a valuable tool in analyzing patient satisfaction surveys, helping healthcare providers improve their services based on demographic-specific feedback.

Social scientists and political analysts utilize Correspondence Analysis to examine voter behavior and public opinion trends. By analyzing survey responses, researchers can determine how different demographic factors—such as income level, education, and geography—influence political preferences. This information helps policymakers craft messages that resonate with specific voter segments and design policies that address the needs of diverse communities.

Financial institutions also benefit from CA, particularly in credit risk analysis and investment segmentation. Banks can use this method to assess loan applicants based on income, employment type, and credit history, identifying patterns that help determine risk levels. Additionally, CA aids in understanding consumer banking preferences, allowing financial organizations to tailor products such as loans, credit cards, and savings accounts to different customer profiles.

Conclusion

The Correspondence Analysis (CA) visualization provides a clear representation of how different countries associate with specific languages.

  • English-speaking countries (Canada, USA, and England) are positioned close together, indicating a strong dominance of English.

  • Italy is distinctly associated with the Italian language, while Switzerland exhibits a strong multilingual distribution, particularly in German.

  • The proximity of certain countries and languages in the CA plot confirms expected linguistic patterns, reinforcing how historical, geographical, and cultural influences shape language usage in different nations.

This analysis effectively demonstrates how Correspondence Analysis can be used to interpret categorical data and uncover relationships between multiple variables in sociolinguistic studies.