Language Diversity Dataset

Authors

Rashad Long

Biyag Dukuray

Original Dataset

Data Source: The Language Diversity Dataset, obtained from untidydata repository

Justification for Reshaping: The dataset is currently in an “untidy” format, meaning it is not in a wide format suitable for extensive analysis. Reshaping the data into a wide format would improve its usability and facilitate efficient analysis.

Variable Renaming: Additionally, certain variables within the dataset could be renamed for enhanced readability and clarity.

Tidy Dataset Features

Variable Definintions
Features	Description	Data Type
Continent	Name of the Continent	chr
Country	Name of the Country	chr
Languages	Number of Languages Spoken	num
Area	Area of land (km²)	num
Population	Population (Thousands)	num
Weather Stations	The Number of weather stations used in calculation Mean Growth	num
Mean Growth	The Mean Growing Season (months)	num
Growth Deviation	The standard deviation of the Growing Season values from the different weather stations in that country	num

Data Import

Dataset was automatically loaded into the environment directly from dev tools package. Can also be found in repo.

library(untidydata)
library(tidyverse)
library(untidydata)

# Load the dataset
data(language_diversity)
str(language_diversity)

tibble [444 × 4] (S3: tbl_df/tbl/data.frame)
 $ Continent  : chr [1:444] "Africa" "Africa" "Oceania" "Asia" ...
 $ Country    : chr [1:444] "Algeria" "Angola" "Australia" "Bangladesh" ...
 $ Measurement: chr [1:444] "Langs" "Langs" "Langs" "Langs" ...
 $ Value      : num [1:444] 18 42 234 37 52 38 27 209 75 94 ...

head(language_diversity)

# A tibble: 6 × 4
  Continent Country    Measurement Value
  <chr>     <chr>      <chr>       <dbl>
1 Africa    Algeria    Langs          18
2 Africa    Angola     Langs          42
3 Oceania   Australia  Langs         234
4 Asia      Bangladesh Langs          37
5 Africa    Benin      Langs          52
6 Americas  Bolivia    Langs          38

Data Tidying and Transformation

The data underwent a transformation to a wider format using the pivot_wider() function. This is done by using the names from the Measurement column and the values from the Value column. Subsequently, the column names were modified to enhance readability.

# Spread the data out using the pivot_wider function
wide_languages <- language_diversity |>
  pivot_wider(names_from = Measurement, values_from = Value)

# Change names of specific columns
wide_languages <- wide_languages |>
  rename(
    "Languages" = "Langs",
    "Weather Stations" = "Stations",
    "Mean Growth" = "MGS",
    "Growth Deviation" = "Std"
  )
str(wide_languages)

tibble [74 × 8] (S3: tbl_df/tbl/data.frame)
 $ Continent       : chr [1:74] "Africa" "Africa" "Oceania" "Asia" ...
 $ Country         : chr [1:74] "Algeria" "Angola" "Australia" "Bangladesh" ...
 $ Languages       : num [1:74] 18 42 234 37 52 38 27 209 75 94 ...
 $ Area            : num [1:74] 2381741 1246700 7713364 143998 112622 ...
 $ Population      : num [1:74] 25660 10303 17336 118745 4889 ...
 $ Weather Stations: num [1:74] 102 50 134 20 7 48 10 245 6 13 ...
 $ Mean Growth     : num [1:74] 6.6 6.22 6 7.4 7.14 6.92 4.6 9.71 5.17 8.08 ...
 $ Growth Deviation: num [1:74] 2.29 1.87 4.17 0.73 0.99 2.5 1.69 5.87 1.07 1.21 ...

head(wide_languages)

# A tibble: 6 × 8
  Continent Country Languages   Area Population `Weather Stations` `Mean Growth`
  <chr>     <chr>       <dbl>  <dbl>      <dbl>              <dbl>         <dbl>
1 Africa    Algeria        18 2.38e6      25660                102          6.6 
2 Africa    Angola         42 1.25e6      10303                 50          6.22
3 Oceania   Austra…       234 7.71e6      17336                134          6   
4 Asia      Bangla…        37 1.44e5     118745                 20          7.4 
5 Africa    Benin          52 1.13e5       4889                  7          7.14
6 Americas  Bolivia        38 1.10e6       7612                 48          6.92
# ℹ 1 more variable: `Growth Deviation` <dbl>

Analysis

This data will be analyzed to determine which countries have the highest amount of languages per capita. Some countries have a vast amount of different cultures and diversity and to determine which has the most we’ll create a new column with the values of the amount of languages per capita. By calculating this average language count per individual, we aim to identify the countries within this dataset that boast the highest average number of languages spoken per resident.

pivoted_data <-
  pivot_wider(language_diversity,
              names_from = Measurement,
              values_from = Value)

pivoted_data$Avg_Langs_Per_Person <-
  pivoted_data$Langs / pivoted_data$Population

print(pivoted_data)

# A tibble: 74 × 9
   Continent Country      Langs    Area Population Stations   MGS   Std
   <chr>     <chr>        <dbl>   <dbl>      <dbl>    <dbl> <dbl> <dbl>
 1 Africa    Algeria         18 2381741      25660      102  6.6   2.29
 2 Africa    Angola          42 1246700      10303       50  6.22  1.87
 3 Oceania   Australia      234 7713364      17336      134  6     4.17
 4 Asia      Bangladesh      37  143998     118745       20  7.4   0.73
 5 Africa    Benin           52  112622       4889        7  7.14  0.99
 6 Americas  Bolivia         38 1098581       7612       48  6.92  2.5 
 7 Africa    Botswana        27  581730       1348       10  4.6   1.69
 8 Americas  Brazil         209 8511965     153322      245  9.71  5.87
 9 Africa    Burkina Faso    75  274000       9242        6  5.17  1.07
10 Africa    CAR             94  622984       3127       13  8.08  1.21
# ℹ 64 more rows
# ℹ 1 more variable: Avg_Langs_Per_Person <dbl>

Conclusion

Based on the analysis of the dataset, the countries with the highest amount of languages per capita are Vanuatu, Papua New Guinea, French Guiana, Suriname and Gabon. The table above has been sorted by average amount of languages column which gives us this value.

# | warning: FALSE

sorted_data <-
  pivoted_data[order(-pivoted_data$Avg_Langs_Per_Person),]

top_10 <- head(sorted_data, 10)

ggplot(top_10, aes(reorder(x = Country, Avg_Langs_Per_Person), y = Avg_Langs_Per_Person)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Top 10 Countries with the Highest Average Languages per Capita",
       x = "Country",
       y = "Average Languages per Capita")