Analyzing Multilingual Distribution Across Countries Using Correspondence Analysis
Author
Saurabh C Srivastava
Published
March 1, 2025
Objective of the Analysis
The objective of this analysis is to explore language distribution across different countries using Correspondence Analysis (CA). The dataset consists of language usage proportions in Canada, the USA, England, Italy, and Switzerland across five major languages: English, French, Spanish, German, and Italian. By applying Correspondence Analysis (CA), we aim to visualize relationships between countries and their dominant languages, identifying language preferences and cultural influences in multilingual societies.
Brief Description of the Code
1. Load Required Libraries
library(dplyr) # Data manipulation
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tibble) # Converting columns to row nameslibrary(FactoMineR) # Correspondence Analysis (CA)library(factoextra) # CA visualization (optional)
Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
2. Create and Load Language Data
lang_text = ("Country English French Spanish German Italian TotalCanada 688 280 10 11 11 1000USA 730 31 190 8 41 1000England 798 74 38 31 59 1000Italy 17 13 11 15 944 1000Switzerland 15 222 20 648 95 1000")
3. Data Preprocessing
Removes the Total column, as it is redundant for Correspondence Analysis.
Converts Country column into row names, making it compatible for Correspondence Analysis.
lang =read.table(textConnection(lang_text), sep ="",header =TRUE, stringsAsFactor =TRUE)lang
Country English French Spanish German Italian Total
1 Canada 688 280 10 11 11 1000
2 USA 730 31 190 8 41 1000
3 England 798 74 38 31 59 1000
4 Italy 17 13 11 15 944 1000
5 Switzerland 15 222 20 648 95 1000
lang = lang %>% dplyr::select(-Total)lang
Country English French Spanish German Italian
1 Canada 688 280 10 11 11
2 USA 730 31 190 8 41
3 England 798 74 38 31 59
4 Italy 17 13 11 15 944
5 Switzerland 15 222 20 648 95
str(lang)
'data.frame': 5 obs. of 6 variables:
$ Country: Factor w/ 5 levels "Canada","England",..: 1 5 2 3 4
$ English: int 688 730 798 17 15
$ French : int 280 31 74 13 222
$ Spanish: int 10 190 38 11 20
$ German : int 11 8 31 15 648
$ Italian: int 11 41 59 944 95
lang$Country =as.factor(lang$Country)lang = lang %>%column_to_rownames("Country")
4. Perform Correspondence Analysis (CA)
Performs Correspondence Analysis (CA) to explore relationships between countries and language usage.
The graph = TRUE argument automatically generates a basic visualization.
res.ca <-CA(lang, graph =TRUE)
5. Visualizing Correspondence Analysis Results with a Biplot
fviz_ca_biplot(res.ca, repel =TRUE, col.row ="#2E9FDF", col.col ="#FC4E07", axes =c(1, 2)) +ggtitle("Correspondence Analysis of Languages and Countries") +theme(plot.title =element_text(hjust =0.5)) +labs(caption ="Saurabh's Work")
Practical Applications of Correspondence Analysis (CA)
Correspondence Analysis (CA) is a powerful statistical technique widely used across various industries to analyze relationships between categorical variables, reducing dimensionality and providing intuitive visual representations. It has broad applications in market research, healthcare, finance, political analysis, and e-commerce, allowing organizations to uncover meaningful patterns and associations within data.
In the healthcare and epidemiology sector, CA plays a critical role in disease and risk factor analysis, mapping associations between different patient groups and their likelihood of developing specific conditions. Hospitals and healthcare institutions use it to optimize resource allocation, ensuring services are available to the populations that need them most. Additionally, it is a valuable tool in analyzing patient satisfaction surveys, helping healthcare providers improve their services based on demographic-specific feedback.
Social scientists and political analysts utilize Correspondence Analysis to examine voter behavior and public opinion trends. By analyzing survey responses, researchers can determine how different demographic factors—such as income level, education, and geography—influence political preferences. This information helps policymakers craft messages that resonate with specific voter segments and design policies that address the needs of diverse communities.
Financial institutions also benefit from CA, particularly in credit risk analysis and investment segmentation. Banks can use this method to assess loan applicants based on income, employment type, and credit history, identifying patterns that help determine risk levels. Additionally, CA aids in understanding consumer banking preferences, allowing financial organizations to tailor products such as loans, credit cards, and savings accounts to different customer profiles.
Conclusion
The Correspondence Analysis (CA) visualization provides a clear representation of how different countries associate with specific languages.
English-speaking countries (Canada, USA, and England) are positioned close together, indicating a strong dominance of English.
Italy is distinctly associated with the Italian language, while Switzerland exhibits a strong multilingual distribution, particularly in German.
The proximity of certain countries and languages in the CA plot confirms expected linguistic patterns, reinforcing how historical, geographical, and cultural influences shape language usage in different nations.
This analysis effectively demonstrates how Correspondence Analysis can be used to interpret categorical data and uncover relationships between multiple variables in sociolinguistic studies.