New York Languages

The Languages of New York City is a dataset of languages spoken in the New York City area maintained by The Endangered Language Alliane, providing information focused on “Indigenous, minority, and endangered languages” and their communities.

ggplot(notempty) +
  geom_bar(aes(x = Language.Family, fill = Size)) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_y_continuous(breaks = seq(0, 600, by=100), limits=c(0,600))

Above, the number of speaker communities for each language family. The fill color represents the size of the communities. Clearly, the language family with the largest representation is Indo-European.

Let’s zoom in, however, on just one of the language families - Turkic!

Turkic Languages

fam_percent <- nyclang %>%
  group_by(Language.Family) %>%
  summarise(Count = n(), .groups = "drop") %>%
  mutate(Percentage = Count / sum(Count) * 100)
turk_percent <- fam_percent %>%
  filter(Language.Family == "Turkic")
print(turk_percent)

## # A tibble: 1 × 3
##   Language.Family Count Percentage
##   <chr>           <int>      <dbl>
## 1 Turkic             29       2.23

Turkic language communities just make 2.2% of all of the language communities represented in the dataset. What can we find out about these communities?

ggplot(turktrue) +
  geom_bar(aes(x = Language, fill = Size)) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

Above is a similar chart as the previous barchart, but instead of language families, we are looking at the count of speaker communities of each Turkic language in the NYC area. Most communities are of the “smallest” size. Turkish, at 4, has the most communities, however these are medium and small in size. Uzbek is the sole Turkic language with a “large” community.

turkmap_view <- turktrue %>% 
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>% 
  st_jitter(factor = 0.01) %>%
  mapview
turkmap_view

Here is an interactive map of each Turkic language. Clicking on a circle, which each represents one community, provides you with the available data about the community and its language.

Most of the circles are clustered around Brooklyn, while the rest are dispersed around the rest of the city and New Jersey. Is there a pattern as to which languages are clustered near each other?

qmplot(x = lon, y = lat, data = turktrue, maptype = "stamen_toner_lite",
       geom = "jitter", color = World.Region, size = 0.5, position = position_jitter(width = .01, height = .01)) +
  labs(title = "Language by World Region")

## Warning: `position` is deprecated.

## ℹ Using `zoom = 10`

## ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.

Here is the same map, but where the colors represent the region of origin of each language.

qmplot(x = lon, y = lat, data = turktrue, maptype = "stamen_toner_lite",
       geom = "jitter", color = Branch, size = .01, position = position_jitter(width = .01, height = .01)) +
  labs(title = "Languages by Branch")

## Warning: `position` is deprecated.

## ℹ Using `zoom = 10`

## ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.

And again, but this time the colors represent the branch of Turkic language.

The above maps unfortunately are hard to get a picture of any patterns from.

What happens if we calculate the mean distance from one another of each language community by region of origin?

turk_sf <- st_as_sf(turktrue, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(turk_sf, turk_sf)
diag(geo_dist_matrix) <- NA
average_distance_turk <- mean(geo_dist_matrix, na.rm = TRUE)
turk_sd <- sd(geo_dist_matrix, na.rm = TRUE)

cenas_sf <- st_as_sf(central_asia, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(cenas_sf, cenas_sf)
diag(geo_dist_matrix) <- NA
average_distance_cenas <- mean(geo_dist_matrix, na.rm = TRUE)
cenas_sd <- sd(geo_dist_matrix, na.rm = TRUE)

ee_sf <- st_as_sf(eastern_europe, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(ee_sf, ee_sf)
diag(geo_dist_matrix) <- NA
average_distance_ee <- mean(geo_dist_matrix, na.rm = TRUE)
ee_sd <- sd(geo_dist_matrix, na.rm = TRUE)

ea_sf <- st_as_sf(eastern_asia, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(ea_sf, ea_sf)
diag(geo_dist_matrix) <- NA
average_distance_ea <- mean(geo_dist_matrix, na.rm = TRUE)
ea_sd <- sd(geo_dist_matrix, na.rm = TRUE)

se_sf <- st_as_sf(southern_europe, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(se_sf, se_sf)
diag(geo_dist_matrix) <- NA
average_distance_se <- mean(geo_dist_matrix, na.rm = TRUE)
se_sd <- sd(geo_dist_matrix, na.rm = TRUE)

wa_sf <- st_as_sf(western_asia, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(wa_sf, wa_sf)
diag(geo_dist_matrix) <- NA
average_distance_wa <- mean(geo_dist_matrix, na.rm = TRUE)
wa_sd <- sd(geo_dist_matrix, na.rm = TRUE)

avgdis_region <- bind_cols(average_distance_turk, average_distance_cenas, average_distance_ea, average_distance_ee, average_distance_wa)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`

avgdis_region <- avgdis_region %>%
  rename(all_turk = 1, central_asia = 2, eastern_asia = 3, eastern_europe = 4, western_asia = 5)

avgdis_region <- avgdis_region %>% pivot_longer(cols = everything(), names_to = "region", values_to = "average.distance")


average_distance_turksd <- bind_cols(average_distance_turk, turk_sd)

## New names:
## • `` -> `...1`
## • `` -> `...2`

average_distance_cenas <- bind_cols(average_distance_cenas, cenas_sd)

## New names:
## • `` -> `...1`
## • `` -> `...2`

average_distance_ee <- bind_cols(average_distance_ee, ee_sd)

## New names:
## • `` -> `...1`
## • `` -> `...2`

average_distance_ea <- bind_cols(average_distance_ea, ea_sd)

## New names:
## • `` -> `...1`
## • `` -> `...2`

average_distance_se <- bind_cols(average_distance_se, se_sd)

## New names:
## • `` -> `...1`
## • `` -> `...2`

average_distance_wa <- bind_cols(average_distance_wa, wa_sd)

## New names:
## • `` -> `...1`
## • `` -> `...2`

avgdis_region <- bind_cols(avgdis_region, turk_sd, cenas_sd, ea_sd, ee_sd, wa_sd)

## New names:
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`

avgdis_region <- avgdis_region %>%
  unite("sd", 3, 4, 5, 6, 7)

ggplot(avgdis_region) +
  geom_col(aes(x = region, y = average.distance, fill = region)) +
  coord_flip()

It seems that we see quite a difference! Eastern European in origin Turkic languages live the nearest to each other – less than half as far as Western Asian Turks, which are the furthest away from each other. How about by language branch?

karluk_sf <- st_as_sf(karluk, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(karluk_sf, karluk_sf)
diag(geo_dist_matrix) <- NA
average_distance_karluk <- mean(geo_dist_matrix, na.rm = TRUE)

kipchak_sf <- st_as_sf(kipchak, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(kipchak_sf, kipchak_sf)
diag(geo_dist_matrix) <- NA
average_distance_kipchak <- mean(geo_dist_matrix, na.rm = TRUE)

oghur_sf <- st_as_sf(oghur, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(oghur_sf, oghur_sf)
diag(geo_dist_matrix) <- NA
average_distance_oghur <- mean(geo_dist_matrix, na.rm = TRUE)

oghuz_sf <- st_as_sf(oghuz, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(oghuz_sf, oghuz_sf)
diag(geo_dist_matrix) <- NA
average_distance_oghuz <- mean(geo_dist_matrix, na.rm = TRUE)

sayan_sf <- st_as_sf(sayan, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(sayan_sf, sayan_sf)
diag(geo_dist_matrix) <- NA
average_distance_sayan <- mean(geo_dist_matrix, na.rm = TRUE)

siberian_sf <- st_as_sf(siberian, coords = c("lon", "lat"), crs = 4326)
geo_dist_matrix <- st_distance(siberian_sf, siberian_sf)
diag(geo_dist_matrix) <- NA
average_distance_siberian <- mean(geo_dist_matrix, na.rm = TRUE)

avgdis_branch <- bind_cols(average_distance_turk, average_distance_karluk, average_distance_kipchak, average_distance_oghur, average_distance_oghuz, average_distance_sayan, average_distance_siberian)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`

avgdis_branch <- avgdis_branch %>%
  rename(all_turk = 1, karluk = 2, kipchak = 3, oghur = 4, oghuz = 5, sayan = 6, siberian = 7)

avgdis_branch <- avgdis_branch %>% pivot_longer(cols = everything(), names_to = "branch", values_to = "average.distance")

ggplot(avgdis_branch) +
  geom_col(aes(x = branch, y = average.distance, fill = branch)) +
  coord_flip()

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_col()`).

It seems there is not enough data for Siberian, Sayan, and Oghur Turks! There is only one language from each of those. It seems Karluk Turks are the furthest from each other, Kipchaks are the nearest, and Oghuz are right in the middle of the two.

In summary

Turkic languages make up just 2.2% of all language communities in New York. Of the Turkic languages spoken, Turkish has the highest number of communities, while Uzbek has the only “large” community. Eastern European Turks and Kipchak Turks live nearest to each other (respectively), while Western Asian and Karluk Turks live the furthest.

Reference

Perlin, Ross, Daniel Kaufman, Jason Lampel, Maya Daurio, Mark Turin, Sienna Craig, eds., Languages of New York City (digital version), map. New York: Endangered Language Alliance. (Available online at http://languagemap.nyc, Accessed on 2021-04-15.)