Click the Original,Code & Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Visual Capitalist (2021).


Objective

The objective of the visualisation is to display linguistic diversity across the globe. It purposefully illustrates that countries with larger populations do not necessarily have a higher number of languages dispersed within its populations, with the top 10 most diverse countries highlighted. It is aimed at linguists and others who have a growing interest in the languages of the world, specifically those who are keen on linguistic demography.

The visualisation chosen had the following three main issues:

  • Area and size can be deceptive in terms of accuracy when visualising a quantitative variable. In the visualisation chosen the two quantitative features “Total Population” and “Total Languages” for each country has been demonstrated using “semihemispheres” that uses area and size as the indicator of its magnitude. Also there’s is no information about how the quantities have been scaled.

  • The representation has made it difficult to compare between the features. It is almost impossible to differentiate by what margin one country is “linguistically diverse” to another, even when being compared to its neighboring rank. For example meaningful comparison between India and China is barely comprehensible.

  • Although the visualization has been aimed at highlighting the linguistic difference across the globe, the features chosen: population and total number of language do not necessarily portray the actual linguistic diversity or the measure of the density of unique languages present in one country. The linguistic diversity of a country can be calculated numerically as the linguistic diversity index (LDI), which gives the probability of whether any given people will share the same first language. The LDI ranges from 0 (where everyone speaks the same language) to 1 (meaning no languages is shared). Hence, the ranking methodology is flawed and should be ranked according to each country’s LDI.

Reference

Ang, C. (2021). Ranked: The Countries with the Most Linguistic Diversity. https://www.visualcapitalist.com/the-countries-with-the-most-linguistic-diversity/

Code

The following code was used to fix the issues identified in the original.

library(pdftools)
library(dplyr)
library(ggplot2)
library(facetscales)
library(scales)
library(stringr)
library(tidyr)


#read pdf containging the list of countries corresponding to count of the languages spoken
pdf <- pdf_text(pdf = "countries_with_the_most_spoken_languages.pdf")

#remove newline and carriage return character
text <- gsub("\n","",gsub("\r","",pdf[1],fixed=T),fixed=T)

#extract the country list
countries <- str_trim(str_match_all(text,"(\\D+)(\\d{3})")[[1]][,2][2:11])
#fix the name of the first country
countries[1] <- substr(countries[1],6,nchar(countries[1]))
#convert the count to numeric
lang_num <- as.numeric( str_match_all(text,"(\\D+)(\\d{3})")[[1]][,3][2:11])

#dataframe that holds country against total language spoken
df_lng_count <- data.frame(Country=countries,C1.Lng = lang_num)


#read csv file of population data
pop <- read.csv("API_SP.POP.TOTL_DS2_en_csv_v2_2252106.csv",header=F)

#subset the data to get name of the countries corresponding to population in 2019
pop <- pop[c(4:nrow(pop)),c(1,64)]

#rename the columns

colnames(pop) <- c("Country","C2.Pop")

#diversity index collected from UNESCO (2005) paper release
df_idx = data.frame(Country=df_lng_count$Country,LDI=c(0.99,0.846,0.87,0.93,0.353,0.126,0.491,0.135,0.942,0.032))

#top ten linguistically diverse countries with population and linguistic diversity index
df <- df_lng_count %>% 
  inner_join(pop %>% 
              filter(Country %in% df_lng_count$Country)) %>% 
                  inner_join(df_idx)

df$Country <- factor(df$Country,levels = df$Country[order(df$LDI,decreasing = TRUE)])
#preparing our data for ggplot 
df <- gather(df,key="measure",value="value",c("C1.Lng","C2.Pop","LDI"))

#scales of each facet
scales_y <- list(
  `C1.Lng` = scale_y_continuous(breaks = seq(0,900,250)),
  `C2.Pop` = scale_y_log10(labels = label_number(scale=1/1000000)),
  `LDI` = scale_y_continuous(labels = percent_format() )
)



var.names <- list("C1.Lng"="Language Count","C2.Pop"="Population (Millions) Log Scaled","LDI"="Linguistic Diverse Index")
var.labeller <- function(variable,value){
  return (var.names[value])
  }
#barplot using ggplot

p1 <- ggplot(df, aes(x = Country,y=value,fill=Country)) +
  geom_bar(stat = "identity",width = 0.7) + 
  facet_grid_sc(rows = vars(measure), scales = list(y = scales_y),labeller = var.labeller) +
  labs(title = "Linguistic Diversity",y="Count",subtitle = "The most linguistically diverse countries")+
  geom_text( aes( label = ifelse(value > 1000,      paste(round(value/1000000),"M"),ifelse(value>100,value,paste(round(value*100,2),"%"))), y = value ),
            position=position_dodge(0.9), vjust = -.4, size = 3, color = "black" )+
  theme_bw()+coord_cartesian(clip = "off")+theme(panel.spacing.x = unit(3,"lines"),
             panel.spacing.y = unit(1,"lines")                                    )+
  scale_fill_manual(name="Linguistically Diverse Rank",labels=c("1.Papua New Guinea","2.Cameroon","3.India",
                                           "4.Nigeria","5.Indonesia","6.China",
                                           "7.United States","8.Mexico","9.Australia","10.Brazil"),
  values = c("#3A6629","#69B391","#4CBB17","#00EE00","#93DB70","#97E697","#66FF66","#B7E2B7","#C1FFC1","#E0EEE0"))

Data Reference

Eberhard, D. V., Simons, G. F., & Fennig, C. D. (2021). What countries have the most languages? https://www.ethnologue.com/guides/countries-most-languages

The World Bank. (2021). Population, total. https://data.worldbank.org/indicator/SP.POP.TOTL

United Nations Educational, Cultural and Scientific Organization. (2009). Investing in Cultural Diversity and Intercultural Dialogue. http://www.lacult.unesco.org/docc/2009_Investing_in_cult_div_Completo.pdf

Reconstruction

The following plot fixes the main issues in the original.