Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Visual Capitalist, Ang, C. (2020).


Objective

In today’s increasingly globalized world, it is important to have a shared means of communication. With roughly 7,139 languages spoken in the world today, roughly 40% of these languages are endangered, often with less than 1,000 speakers remaining, only 23 languages accounts for more than half the world’s population (“How many languages are there in the world?”, 2021).

This data visualization aims to showcase the Top 10 Most Spoken Language in the World, the proportion of natives speaking these languages and the origin of the languages. Its targetted audience is the general public all around the world or anyone who might be interested in international business or linguistics, for example, Authors, the Educational sectors and Businessmen.

The visualization chosen had the following three main issues:

  • Visual Comparison Accuracy: Different methods of comparing quantitative data in visualization impacts the time for the viewer to ensure visual comparison accuracy. The data visualization make use of area/size to determine the difference between the total number of speakers for each language. As the difference between the total speakers for different languages are minute, it makes it hard to make accurate inference if the actual numbers/ ranks are not provided. The following are some examples:

    • The area between the total number of speakers for English and Mandarin Chinese are pretty similar, the area between French, Standard Arabic, Bengali, Russian and Portuguese seemed to be of equal area too. As such, if the numbers and rank are not given, it would be difficult for viewers to make accurate inference.
    • As the area of the total speakers looks relatively similar, this makes the proportion of native speakers difficult to determine too. For example: 94% of Portuguese speakers consider it their native language, while only 86% of the Bengeli speakers consider it their native language, however, from the data visualisation, the proportion of native speakers look the same for both languages.
  • Frequency Data: The number of speakers speaking a particular language is reported in frequency. By visualising the data in frequency, it does not show the magnitude of the language spoken as compared to the total world population. Since this visualization aims to showcase the top 10 Most Spoken Languages in the World, it is essential to display it in terms of the total world population. For example, a value of 1,132 million English speakers does not show the magnitude of the language spoken as a regular person might not have an idea of the total world population. It is therefore better to visualize the data in terms of percentage.

  • Colour Blindness: The plot made use of colours that are not colour blindness friendly. Approximately 8% of males and 0.4% of females have some form of colour blindness, with the red-green colour blindness type being the most common. Using the Coblis colour bindness stimulator, stimulations of the data visualisation for both Red-Blind (Protanopia) and Green-Blind (Deuteranopia) can be depicted in the figure below.

    Coblis — Color Blindness Simulator – Colblindor (2021)


    it could be seen that both Red-Blind (Protanopia) and Green-Blind (Deuteranopia) people would not be able to differentiate between the language origins of Mandarin Chinese (language origin: Sino-Tibetan) and Standard Arabic (language origin: Afro-Asiatic). The plot also uses highly saturated colours which might result in visual stress to the viewer from prolonged viewing.

Reference

Code

The following code was used to fix the issues identified in the original.

library(ggplot2)
library(magrittr)
library(tidyr)
library(dplyr)

data <- read.csv("Top 10 Most Spoken Language.csv")
class(data$Rank)
## [1] "integer"
class(data$Language)
## [1] "character"
class(data$Total.Speakers)
## [1] "integer"
class(data$Native.Speakers)
## [1] "integer"
class((data$Language.Origins))
## [1] "character"
data <- data %>% gather(Total.Speakers,Native.Speakers, key = "Speakers", value = "Number of Speakers")

data$Language.Origins <- data$Language.Origins %>% factor(levels = c("Indo-European", "Sino-Tibetan", "Afro-Asiatic", "Austronesian"), labels =c("Indo-European", "Sino-Tibetan", "Afro-Asiatic", "Austronesian") )
data$Speakers <- data$Speakers %>% factor(levels = c("Total.Speakers", "Native.Speakers"), labels =c("Total Speakers", "Native Speakers"))
data$`Number of Speakers`[is.na(data$`Number of Speakers`)] <- 0
data$`Number of Speakers`[data$Speakers == 'Total Speakers'] = data$`Number of Speakers`[data$Speakers == 'Total Speakers']- data$`Number of Speakers`[data$Speakers == 'Native Speakers']
data$Speakers <- as.character(data$Speakers)
data$Speakers[data$Speakers == 'Total Speakers']<- 'Non Native Speakers'
data$Speakers <- data$Speakers %>% factor(levels = c("Non Native Speakers", "Native Speakers"), labels =c("Non Native", "Native"))

data <-data %>% mutate(percentage = round(`Number of Speakers`/ 7800*100, 2) )

p <- ggplot(data, aes(x=reorder(Language, -percentage), y= percentage, fill= Speakers , color= Language.Origins )) + labs(title = "Top 10 Most Spoken Language In The World",x=("Language"), y= "Total World Population (In Percentage)",color = "Language Origins") + scale_y_continuous(breaks = seq(0,15, by=1))
p <- p+ geom_bar(stat = "identity", size=1, alpha=0.8, width = 0.9) +  scale_fill_brewer(palette = "Pastel1")+ theme(plot.title = element_text( hjust = 0.5, size=16, face="bold"), legend.background = element_rect(fill="white", size=0.5, linetype="solid", colour ="black"), legend.title = element_text(size = 7,face = "bold"),legend.text = element_text(size=7),legend.position=c(.9,.7), legend.box="vertical",
axis.text=element_text(size=8, face="bold"),axis.title=element_text(size=12,face="bold"), axis.text.x = element_text(face = "bold.italic", angle = 30, hjust = 1) )+ scale_color_manual(values = c("Grey61", "brown4","darkorange2", "mediumpurple4")) + 
geom_text(aes(label=paste(percentage,"% (",`Number of Speakers`, "M)" ,sep="")),position = position_stack(),vjust= -1 ,size = 2.1, fontface="bold" ,color="black")

Data Reference

Reconstruction

The following plot fixes the main issues in the original.