#Load libraries and import dataset
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tidyr)
language_extinction_data <- read.csv("extinct_languages_data.csv")
This project was made using the Extinct Languages dataset from Kaggle. It may be found here: https://www.kaggle.com/code/nishithamajeti/extinct-languages. We will be conducting a cursory examination into the distributions of extinct and endangered languages across the globe. After loading the dataset, I filter for the columns I will be using.
#Select the columns of data we will be using for our visualizations and save the path. These will be: ID, Name in English, Countries, Degree of endangerment, Number of speakers, Latitude, and Longitude.
lang_data_selected <- language_extinction_data %>%
select(ID, Name.in.English, Countries, Degree.of.endangerment, Number.of.speakers, Latitude, Longitude)
Some of the languages in our dataset are listed under multiple countries in the Countries column. We’re going to create separate rows for those entries, keeping the unique ID per language.
#Split the Countries listed in the Countries column for a particular language into multiple rows
lang_data_selected %>%
mutate(Countries=strsplit(Countries, ",")) %>%
unnest(Countries)
## # A tibble: 3,074 × 7
## ID Name.in.English Countries Degree.of.endangerment Number.of.speakers
## <int> <chr> <chr> <chr> <int>
## 1 1022 South Italian "Italy" Vulnerable 7500000
## 2 1023 Sicilian "Italy" Vulnerable 5000000
## 3 383 Low Saxon "Germany" Vulnerable 4800000
## 4 383 Low Saxon " Denmark" Vulnerable 4800000
## 5 383 Low Saxon " Netherland… Vulnerable 4800000
## 6 383 Low Saxon " Poland" Vulnerable 4800000
## 7 383 Low Saxon " Russian Fe… Vulnerable 4800000
## 8 335 Belarusian "Belarus" Vulnerable 4000000
## 9 335 Belarusian " Latvia" Vulnerable 4000000
## 10 335 Belarusian " Lithuania" Vulnerable 4000000
## # ℹ 3,064 more rows
## # ℹ 2 more variables: Latitude <dbl>, Longitude <dbl>
We are going to take a look at the distribution of extinct languages by latitude and longitude to see if we can find any correlations.
#Save a filter containing data for all extinct languages.
extinct_lang <- lang_data_selected %>%
filter(Degree.of.endangerment == 'Extinct')
#Create a scatterplot which shows the distribution of extinct languages by latitude and longitude. Let's plot this using filled circles in a light 'steelblue' color and containing a stoke. This way, overlapping data points will appear darker in color. We will also add a jitter and remove any rows with missing values. Does there appear to be any kind of correlation?
ggplot(data = extinct_lang, aes(x = Latitude, y = Longitude)) +
geom_point(position = 'jitter', shape = 19, alpha = .3, color = 'steelblue', na.rm=TRUE) +
ggtitle("Distribution of Extinct Languages by Latitude and Longitude") +
theme_minimal()
The scatterplot above shows that the distribution of extinct languages does not follow a clear path along latitudinal or longitudinal lines. I have opted not to use linear regression here because the relationship between the dependent and independent variables is not linear. While there are concentrations in some areas, they do not indicate a trend one way or the other. This is important in challenging assumptions about cultures and languages within countries near or along the equator.
Now let’s take a look at which countries have the highest distribution of endangered languages overall.
#Filter out extinct languages
endangered_lang <- (filter(lang_data_selected, Number.of.speakers > 0))
#Sort countries into a tibble by frequency to determine which 5 countries account for the highest numbers of endangered languages
df <- endangered_lang %>%
group_by(Countries) %>%
summarize(Freq=n()) %>%
arrange(desc(Freq)) %>%
slice(1:5)
df
## # A tibble: 5 × 2
## Countries Freq
## <chr> <int>
## 1 Brazil 171
## 2 United States of America 157
## 3 India 149
## 4 Mexico 141
## 5 Indonesia 136
#Plot a histogram depicting the top 5 countries with the highest distribution of endangered languages.
ggplot(df, aes(x = Countries, y = Freq))+
geom_histogram(stat = 'identity', position = 'dodge', color = 'darkred', fill = 'brown1', binwidth = 1)+
theme_classic() +
ggtitle("Highest Distribution of Endangered Languages Per Country")
## Warning in geom_histogram(stat = "identity", position = "dodge", color =
## "darkred", : Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
So we can see here that Brazil tops the list, followed by the USA. This is consistent with these countries’ histories of colonization and genocide. However, the sheer number of endangered languages distributed among each of them might be surprising to those who are either unaware of the linguistic diversity of these areas’ indigenous peoples, or maybe expected these languages to already be extinct. It’s important to demonstrate to students that the indigenous peoples of these lands are still fighting to preserve their languages and cultures, in spite of centuries of brutal imperialism.