New York City is a densely multilingual city, and many residents speak languages other than English at home. Limited English Proficiency (LEP) speakers may face challenges in accessing services and information. This analysis aims to visualize the concentration of LEP speakers across different community districts in NYC through an interactive chloropleth map.
Before beginning this analysis, I set up my environment by installing and loading the necessary packages.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/l2/s_q45y0578j0vfffb90bh0qh0000gn/T//Rtmp2nLieT/downloaded_packages
install.packages("tmap", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/l2/s_q45y0578j0vfffb90bh0qh0000gn/T//Rtmp2nLieT/downloaded_packages
install.packages("sf", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/l2/s_q45y0578j0vfffb90bh0qh0000gn/T//Rtmp2nLieT/downloaded_packages
library(tidyverse)
library(dplyr)
library(readr)
library(tmap)
library(sf)
## Warning: package 'sf' was built under R version 4.5.2
I worked with two datasets for this analysis. The first dataset contains information on the LEP population by community district, while the second dataset provides the total population for each community district.
The first dataset is titled “Population and Languages of the Limited English Proficient LEP Speaker by Community District” and was made available by the NYC Civic Engagement Commission (CEC). It is availble on NYC OpenData.
languages_of_lep_community_district <- read.csv("Population_and_Languages_of_the_Limited_English_Proficient__LEP__Speakers_by_Community_District_2025.csv")
head(languages_of_lep_community_district)
## borough borough_community_district_code community_district_name
## 1 Manhattan 101 Battery Park City, Tribeca
## 2 Manhattan 101 Battery Park City, Tribeca
## 3 Manhattan 101 Battery Park City, Tribeca
## 4 Manhattan 101 Battery Park City, Tribeca
## 5 Manhattan 101 Battery Park City, Tribeca
## 6 Manhattan 101 Battery Park City, Tribeca
## language lep_population pct_lep_pop cvalep_pop pct_cvalep_pop
## 1 Afrikaans 0 0.0 0 0
## 2 Akan (incl. Twi) 0 0.0 0 0
## 3 Albanian 7 0.2 0 0
## 4 Aleut languages 0 0.0 0 0
## 5 Amharic 0 0.0 0 0
## 6 Apache languages 0 0.0 0 0
The second dataset is titled “New York City Population By Community Districts” and was made available by the Department of City Planning (DCP). It is also available on NYC OpenData. Note that the datas has been pre-filtered before downloading to include only population data from the 2010 Census, as this is the most recent data available at the time of this analysis.
cleaned_population_by_cd <- read.csv("cleaned_population_by_cd.csv") %>%
mutate(population_2010 = parse_number(population_2010))
head(cleaned_population_by_cd)
## Borough CD.Number borough_community_district_code
## 1 2 1 201
## 2 2 2 202
## 3 2 3 203
## 4 2 4 204
## 5 2 5 205
## 6 2 6 206
## community_district_name population_2010
## 1 Melrose, Mott Haven, Port Morris 91497
## 2 Hunts Point, Longwood 52246
## 3 Morrisania, Crotona Park East 79762
## 4 Highbridge, Concourse Village 146441
## 5 University Hts., Fordham, Mt. Hope 128200
## 6 East Tremont, Belmont 83268
To find the community districts with the highest concentration of LEP speakers, I first calculated the total LEP population for each community district and saved this result as a new dataframe.
total_lep_populations <- languages_of_lep_community_district %>%
group_by(borough_community_district_code) %>%
summarise(total_lep = sum(lep_population, na.rm = TRUE)) %>%
arrange(desc(total_lep))
View(total_lep_populations)
Next, I joined the total LEP populations dataframe with the population by community districts dataframe to calculate the concentration of LEP speakers in each community district. Both datasets share the column “borough_community_district_code” as a common key for merging. Note that the original DCP dataset did not contain the borough community codes formatted in a three-diigit format as in the CEC dataset. The DCP dataset used here has been pre-cleaned in Excel to include this formatting.
cd_population_totals <- cleaned_population_by_cd
populations_merged_table <- inner_join(total_lep_populations, cd_population_totals, by = "borough_community_district_code")
From here, I was able to calculate the percentage of LEP speakers out of the total population for each community district.
percentage_of_lep_by_cd <- populations_merged_table %>%
group_by(borough_community_district_code) %>%
mutate(lep_concentration = total_lep/population_2010) %>%
arrange(desc(lep_concentration))
View(percentage_of_lep_by_cd)
Performing this calculation provided interesting results. For while Sunset Park, Windsor Terrace in Brooklyn ranks sixth in terms of total LEP speaker population, it has the highest concentration of LEP speakers relative to its total population out of all other community districts. Nearly 50% of population of the community district are LEP New Yorkers.
The datasets I have been working with so far do not contain any geographic information. To create a map visualization, I needed to load a shapefile of NYC Community Districts that contains the geographic boundaries for each community district. The shapefile I used was downloaded from ArcGIS Hub.
## Reading layer `NYC_Community_Districts' from data source
## `/Users/oliviamignone/Documents/Data Analysis/GoogleCapstone/R Markdown/NYC_Community_Districts.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 71 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
To prepare for merging, I added a column named “borough_community_district_code” to the shapfile dataset that had the same information as the “BoroCD” column. This way, I could use this column as a join key with the same column in the percentage_of_lep_by_cd dataframe.
nyc_cd_shapes <- nyc_cd_shapes %>%
mutate(borough_community_district_code = paste0(BoroCD))
Next, I checked whether the “borough_community_district_code” columns in both datasets were formatted the same way. In the shapefile dataset, this column was formatted as a character type, while in the LEP concentration dataframe, it was formatted as a double type. I converted the LEP concentration dataframe’s column to character format to enable the join.
percentage_of_lep_by_cd <- percentage_of_lep_by_cd %>%
mutate(borough_community_district_code = as.character(borough_community_district_code))
Now, the LEP concentration dataframe could be successfully joined with the shapefile dataset. I also transformed the lep_concentration column to a percentage format and filtered out any community districts that did not have LEP data.
nyc_cd_lep_map <- nyc_cd_shapes %>%
left_join(percentage_of_lep_by_cd, by = "borough_community_district_code") %>%
mutate(lep_percent = lep_concentration * 100) %>%
filter(!is.na(total_lep))
Finally, I created an interactive chloropleth map using the tmap package to visualize the concentration of LEP speakers by community district in NYC. The map displays community districts shaded according to their percentage of LEP population, with a legend and pop-up information for each district.
tmap_mode("view")
tm_shape(nyc_cd_lep_map) +
tm_polygons(fill = "lep_percent",
fill.legend = tm_legend(title = "% LEP Population",
orientation = "landscape")) +
tm_fill(breaks = c(0,5,10,15,20,25,30,35,40,45,Inf),
id = "community_district_name",
popup.vars = c("Community District Code: " = "borough_community_district_code",
"Total Population: " = "population_2010",
"Total LEP Population: " = "total_lep",
"% LEP Population: " = "lep_percent"),
showNA = FALSE) +
tm_borders()+
tm_layout(title = "LEP Concentration by NYC Community District",
frame = FALSE)
The interactive pop-ups provide additional context and data for each district, including total population, total LEP population, and % LEP population. This map provides a clear visual representation of the concentration of LEP speakers across NYC community districts. Community districts with higher percentages of LEP speakers are shaded more darkly, allowing for easy identification of areas with significant LEP populations.