Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
In today’s increasingly globalized world, it is important to have a shared means of communication. With roughly 7,139 languages spoken in the world today, roughly 40% of these languages are endangered, often with less than 1,000 speakers remaining, only 23 languages accounts for more than half the world’s population (“How many languages are there in the world?”, 2021).
This data visualization aims to showcase the Top 10 Most Spoken Language in the World, the proportion of natives speaking these languages and the origin of the languages. Its targetted audience is the general public all around the world or anyone who might be interested in international business or linguistics, for example, Authors, the Educational sectors and Businessmen.
The visualization chosen had the following three main issues:
Visual Comparison Accuracy: Different methods of comparing quantitative data in visualization impacts the time for the viewer to ensure visual comparison accuracy. The data visualization make use of area/size to determine the difference between the total number of speakers for each language. As the difference between the total speakers for different languages are minute, it makes it hard to make accurate inference if the actual numbers/ ranks are not provided. The following are some examples:
Frequency Data: The number of speakers speaking a particular language is reported in frequency. By visualising the data in frequency, it does not show the magnitude of the language spoken as compared to the total world population. Since this visualization aims to showcase the top 10 Most Spoken Languages in the World, it is essential to display it in terms of the total world population. For example, a value of 1,132 million English speakers does not show the magnitude of the language spoken as a regular person might not have an idea of the total world population. It is therefore better to visualize the data in terms of percentage.
Colour Blindness: The plot made use of colours that are not colour blindness friendly. Approximately 8% of males and 0.4% of females have some form of colour blindness, with the red-green colour blindness type being the most common. Using the Coblis colour bindness stimulator, stimulations of the data visualisation for both Red-Blind (Protanopia) and Green-Blind (Deuteranopia) can be depicted in the figure below.
Coblis — Color Blindness Simulator – Colblindor (2021)
it could be seen that both Red-Blind (Protanopia) and Green-Blind (Deuteranopia) people would not be able to differentiate between the language origins of Mandarin Chinese (language origin: Sino-Tibetan) and Standard Arabic (language origin: Afro-Asiatic). The plot also uses highly saturated colours which might result in visual stress to the viewer from prolonged viewing.
Reference
The following code was used to fix the issues identified in the original.
library(ggplot2)
library(magrittr)
library(tidyr)
library(dplyr)
data <- read.csv("Top 10 Most Spoken Language.csv")
class(data$Rank)
## [1] "integer"
class(data$Language)
## [1] "character"
class(data$Total.Speakers)
## [1] "integer"
class(data$Native.Speakers)
## [1] "integer"
class((data$Language.Origins))
## [1] "character"
data <- data %>% gather(Total.Speakers,Native.Speakers, key = "Speakers", value = "Number of Speakers")
data$Language.Origins <- data$Language.Origins %>% factor(levels = c("Indo-European", "Sino-Tibetan", "Afro-Asiatic", "Austronesian"), labels =c("Indo-European", "Sino-Tibetan", "Afro-Asiatic", "Austronesian") )
data$Speakers <- data$Speakers %>% factor(levels = c("Total.Speakers", "Native.Speakers"), labels =c("Total Speakers", "Native Speakers"))
data$`Number of Speakers`[is.na(data$`Number of Speakers`)] <- 0
data$`Number of Speakers`[data$Speakers == 'Total Speakers'] = data$`Number of Speakers`[data$Speakers == 'Total Speakers']- data$`Number of Speakers`[data$Speakers == 'Native Speakers']
data$Speakers <- as.character(data$Speakers)
data$Speakers[data$Speakers == 'Total Speakers']<- 'Non Native Speakers'
data$Speakers <- data$Speakers %>% factor(levels = c("Non Native Speakers", "Native Speakers"), labels =c("Non Native", "Native"))
data <-data %>% mutate(percentage = round(`Number of Speakers`/ 7800*100, 2) )
p <- ggplot(data, aes(x=reorder(Language, -percentage), y= percentage, fill= Speakers , color= Language.Origins )) + labs(title = "Top 10 Most Spoken Language In The World",x=("Language"), y= "Total World Population (In Percentage)",color = "Language Origins") + scale_y_continuous(breaks = seq(0,15, by=1))
p <- p+ geom_bar(stat = "identity", size=1, alpha=0.8, width = 0.9) + scale_fill_brewer(palette = "Pastel1")+ theme(plot.title = element_text( hjust = 0.5, size=16, face="bold"), legend.background = element_rect(fill="white", size=0.5, linetype="solid", colour ="black"), legend.title = element_text(size = 7,face = "bold"),legend.text = element_text(size=7),legend.position=c(.9,.7), legend.box="vertical",
axis.text=element_text(size=8, face="bold"),axis.title=element_text(size=12,face="bold"), axis.text.x = element_text(face = "bold.italic", angle = 30, hjust = 1) )+ scale_color_manual(values = c("Grey61", "brown4","darkorange2", "mediumpurple4")) +
geom_text(aes(label=paste(percentage,"% (",`Number of Speakers`, "M)" ,sep="")),position = position_stack(),vjust= -1 ,size = 2.1, fontface="bold" ,color="black")
Data Reference
Ang, C. (2020). The World’s Top 10 Most Spoken Languages. Visual Capitalist. Retrieved 22 April 2021, from The Visual Capitalist website: https://www.visualcapitalist.com/the-worlds-top-10-most-spoken-languages/.
2020 World Population Data Sheet Shows Older Populations Growing, Total Fertility Rates Declining – Population Reference Bureau. Prb.org. (2021). Retrieved 27 April 2021, from https://www.prb.org/2020-world-population-data-sheet/#:~:text=The%20world%20population%20is%20projected,as%20in%20the%20United%20States.
The following plot fixes the main issues in the original.