Bethlehem Broad St. at Christmas
Over this past winter break, I had the chance to read Chloe Taft’s From Steel to Slots: Casino Capitalism in the Post-Industrial City (2016). In this book, Taft offers a narrative of how local landscapes and ethnographies are transformed through the rise and fall of labor markets as they respond to globally driven shifts of capital and economies. Existing as the case study in this larger context of economic and social change, Taft writes specifically on the city of Bethlehem which is a relatively small post-industrial city on the eastern side of Pennsylvania in the Lehigh Valley. Underneath this larger narrative and serving as the main influence for my term project, Taft documents Bethlehem’s early to mid-20th century settlement pattern that saw segregation manifest across foreign-born status, religion, and income (Bethlehem’s Bureau of Planning and Development, 1967).
Imitative of recent planning theory that illustrates how markets, personal preferences, policies, technology, and other societal regimes have led to repeated socio-economic structures and their spatial separation, I initially set out to build a model that asked if the demographics presented by Taft could predict or show similar characteristics with the city’s existing settlement pattern. I intended to build a regression model that used the latest census data (2018 American Community Survey or ACS) containing foreign-born populations by Bethlehem census-tracts as my dependent variable. For my explanatory variables containing other demographics, I planned to use 1960 decennial census data as it was the year of Bethlehem’s peak population during its industrial period.
Unfortunately, due to incongruent census tracts between the 1960 decennial census and today’s ACS, the data was incompatible with a longitudinal regression study [Figure 6. Centroids of 1960 Decennial Census Tracts Census Tracts on top of 2018 ACS-5 Year Census Tracts]. As a result, my analysis pivoted to a different form of longitudinal study, where I assessed the relationship between the same demographics of interest spatially over time using decennial census tract data from 1970-2000 on 2010 geographies as they were the most congruent spatially.
For the greater Lehigh region, the coupled growth of railroad transport and access to the anthracite coal mines in the Lehigh Valley saw emerging industrial industries and their attached immigrant workforce move into the region in the latter half of the 19th century (Taft, 2016). Coming mostly from the Eastern and Southern reaches of Europe, the influx of immigrants during this formative time were a response to the exponential growth and demand for steel. Founded in 1857 as the Saucona Iron Company and later in 1899 as the Bethlehem Steel Corporation, the city’s steel plant and foreign-born laborers were the main economic drivers of the city (Taft, 2016). One of the major segregation patterns that unfolded during this time was the immense migration of immigrants to the south side of Bethlehem (Philadelphia Industrial Publish Company, 2018). Seen through the tables and maps presented in Bethlehem’s Bureau of Planning and Development’s (BBPD) Community Renewal Program Reports, Bethlehem’s greater majority of incoming immigrants settled south of the east-west flowing Lehigh River. As explained in Taft (2016), this can be explained by the city’s original settlers (the Moravian Church) conditional allowance of steel manufacturing in the city. The Moravian settlers planned for industry and its immigrant-majority blue-collar workforce to exist only south of the Lehigh River, across from the `quaint` and `proper` Moravian-preserved Northside. This is characteristic of the segregation pattern that became visible across most industrialized cities of this time, where heavy concentrations of European immigrants settled into extremely dense neighborhoods due to the social regimes that set up the unique proximity between these emerging foreign-born neighborhoods and the factories/plants they were seeking employment from (Taft, 2016).
As I learned more about the data found in the BBPD Community Renewal Program Reports, I found their maps and tables to be a meaningful way to visualize patterns of socio-economic segregation in the city. To quickly summarize their purpose, these reports were a response to the general trend of urban decay and poor living conditions that were associated with inner cities across the US at the time (1950s-1970s), and they presented demographic trends and physical characteristics of Bethlehem neighborhoods as they investigated potential areas of the city for redevelopment. A key part of these reports and serving as the base for which my own study looks to imitate, are their tables that present longitudinal change of foreign-born populations across the city’s wards from 1920 to 1960 [Figure 1. Table X Foreign Born Population by Wards].
Looking to recreate this foreign-born study for the subsequent decades (1970-2000), my report builds on BBPD’s analysis through a wider array of demographic metrics as they relate to the segregation narrative described in From Steel to Slots (2016). Unable to find or recreate the Bethlehem wards used in the Community Renewal Program Reports, I used Pennsylvania census tract data from 1970-2000 as well as by wrangling raw text from online websites to create needed explanatory data concerning other demographics. This study examines these demographic variables [Table 1. Variables of Interest (by Census Tract)] spatially over time, through exploratory data analysis, and later through calculating the dissimilarity index of Foreign Born versus U.S. Born populations to measure their segregation across the scale of Bethlehem as well as other major Pennsylvanian cities for comparison.
YEAR = Decennial Census Year
POP = Population
AREA_SQMI = Square Mile Area of Census Tract
Dependent Var F_BORN = Foreign Born Population
Dependent Var P_F_BORN = Percent Foreign Born
N_POV = Total Population used for Decennial Census Income to Poverty Ratio
POV_POOR = Poor - Population with Income to Poverty Line Ratio less than 1.0
PCT_POOR = (POV_POOR / N_POV) * 100
POV_STRG = Struggling - Population with Income to Poverty Line Ratio between 1.0 & 2.0
PCT_STRG = (POV_STRG / N_POV) * 100
POV_OKAY = Okay - Population with Income to Poverty Line Ratio greater than 2.0
Locality = City that census tract resides in.
‘Bethlehem’ = Tracts within 5 miles of Bethlehem City Centroid
‘Philadelphia’ = Tracts within 5 miles of Philadelphia City Centroid
‘Pittsburgh’ = Tracts within 5 miles of Pittsburgh City Centroid
‘Other’ = All other census tracts
N_Tracts = Number of tracts within a census tract - always 1
Cath_Chur = Number of Catholic churches within a census tract
CATH_BUFF = Catholic Buffer (Within a half mile of an Existing Catholic Church)
DISS_INDEX = Dissimilarity Index Score (Range of 0 - 1)
Thinking first on the correlation between our dependent variable and explanatory variables, across census tracts, I expected to see direct relationships with our foreign born dependent variables among tracts with higher populations, higher amounts of ‘Poor’ and ‘Struggling’ populations, higher amounts of Catholic Churches, and lower square mile areas. Understanding that most foreign born persons settle near the largest labor markets, tracts within our localities of interest are expected to have some of the highest numbers of foreign born populations. However, there should still exist other outliers across the “Other” locality, because I did not capture every substantive PA labor market (Lancaster, Scranton, Harrisburg, etc.). Looking specifically at change in our foreign born dependent variables over time, I expect to see higher clusters of foreign born population (aka more outliers and higher values) due to theory and research presented in Taft (2016). It’s been found that poverty rates have increased in the city of Bethlehem, and because of this, I deduce that this is a growing trend for other localities and is indicative also of higher foreign born populations.
Thinking now on the dissimilarity index, across the four different localities (Bethlehem, Philadelphia, Pittsburgh, Other), due to the way the dissimilarity index is calculated, I expected to see higher dissimilarity indices among localities containing merely more tracts, and by virtue of their greater abundance of tracts, these localities should also contain a greater abundance of P_F_BORN outliers. Outliers are representative of unusually high amounts of foreign born populations. These are expected to be in our smallest AREA_SQMI census tracts due to the theory behind the Modifiable Areal Unit Problem (MAUP). While foreign born outliers can be read as simply unique cases of higher count, because they are aggregated across census tract groups, the methodology behind census tract grouping could lead to biased results.
To explain further, when going back to the ‘smallest AREA_SQMI’ hypothesis, localities with merely more tracts to partition (separate and make smaller), the greater the chance of these tracts being partitioned in a way that pits foreign born populations from U.S. Born populations when they could actually exist in close proximity. Tracts with relatively high ‘AREA_SQMI’, if nearby a substantive PA labor market, are expected to dilute this chance. On the opposite side, if I increased our spatial resolution from census tract to county, there would be much heavier dilution resulting in an even more biased dissimilarity index.
To remove any MAUP bias, instead of using the polygon feature layer used through census tracts, each person/survey would have to be calculated as one feature point layer (containing the coordinates of their address). While I cannot remove this bias, I can at least use P_F_BORN instead of F_BORN to account for the different scales of population across tracts.
Over the course of the semester, I found RMarkdown to be a compelling challenge and opportunity for me to grow further in my programming skills. I decided on RMarkdown to be the main avenue for narrating my Final Reports, because it pairs together my writing, my code, and my outputs, offers a number of interactive features, such as clickable tabs and scroll-able tables, and it also best suits animated plots, which is one of my most consistent mediums for showing change over time within my variables. Additionally, I plan to publish it’s HTML formatted output to R-Pubs for cloud-saving and sharing purposes.
I gathered demographic data concerning Population, Foreign Born Population, Poverty, and Census Tract Area through NHGIS and Social Explorer, using the decennial census for 1970, 1980, 1990, and 2000. The only spatial data directly sourced came from the Census TIGER/Line shapefile offered by NHGIS. For my Catholic explanatory variables, I pulled raw text from the masstime.us website, which is purposed to help users find a Catholic Mass near them in ‘as simple as a few clicks’. Catholic church information includes address, their offered Mass times by day of week, confession times, as well as other information.
Needed Libraries
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(scales)
library(ggpubr)
library(ggcorrplot)
library(coefplot)
library(rsample)
library(ggpubr)
library(jtools)
library(gridExtra)
library(car)
library(zoo)
library(tmap)
library(gifski)
library(gganimate)
library(ggthemes)
library(ggmap)
library(seg)
library(viridis)
library(ggthemes)
library(seg)
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
source("./data/functions.r")Above is the collection of packages used for the wrangling of data and its subsequent analysis. Most critical, is the “tidyverse” package which we’ve discussed and worked our way through extensively this semester.
Below is the wrangling of my initial longitudinal analysis. It represents a plethora of foreign-born demographics which were ultimately not setup for comparing over time due to the inconsistencies of the data collected by each Decennial Survey.
#Original Idea / Mutating
#reading in sf
us_census_tract_1960 <- st_read("./data/US_tract_1960.shp")
us_census_tract <- us_census_tract_1960 %>% mutate_at(c(1:2, 5:6), as.numeric)
pa_census_tract_sf <- us_census_tract %>%
filter(us_census_tract$NHGISST == 420)
#reading in csv
PA_CSV <- as_tibble(read_csv("./data/nhgis1960_tract.csv"))
PA_CSV_2 <- PA_CSV %>%
select(-c(7:9))
#renaming and concatenating vars
PA_CSV_2 <- PA_CSV_2 %>%
dplyr::arrange(`GISJOIN`)
#renaming of columns to appropriate names
PA_CSV_2 <- PA_CSV_2 %>%
rename("POP" = 8) %>%
rename("AREA" = 9) %>%
rename("TOT_E" = 10) %>%
rename("TECNICAL_E" = 11) %>%
rename("FARM_E" = 12) %>%
rename("MANAGE_E" = 13) %>%
rename("CLERICAL_E" = 14) %>%
rename("SALES_E" = 15) %>%
rename("FOREMAN_E" = 16) %>%
rename("OPERATVE_E" = 17) %>%
rename("PRIV_E" = 18) %>%
rename("SERVIC_E" = 19) %>%
rename("FARM_2_E" = 20) %>%
rename("LABOR_E" = 21) %>%
rename("OTHER_E" = 22) %>%
rename("POP_FOREIG" = 24) %>%
rename("POP_UK" = 28) %>%
rename("POP_IRE" = 29) %>%
rename("POP_NOR" = 30) %>%
rename("POP_SWEDE" = 31) %>%
rename("POP_GERM" = 32) %>%
rename("POP_POL" = 33) %>%
rename("POP_CZECH" = 34) %>%
rename("POP_AUSTRI" = 35) %>%
rename("POP_HUN" = 36) %>%
rename("POP_USSR" = 37) %>%
rename("POP_ITALY" = 38) %>%
rename("POP_CAN" = 39) %>%
rename("POP_MEX" = 40) %>%
rename("POP_OTHER" = 41) %>%
rename("H_UNITS" = 42) %>%
rename("H_1950" = 43) %>%
rename("H_1949" = 44) %>%
rename("H_1940" = 45)
#further filtering and mutating
PA_CSV_2 <- PA_CSV_2 %>%
select(!contains("SE_T"))
PA_CSV_3 <- PA_CSV_2 %>%
mutate("NOTFARM" = FARM_E + FARM_2_E) %>%
select(!contains("FARM_")) %>%
rename("FARM_E" = "NOTFARM") %>%
relocate("FARM_E", .before = "MANAGE_E")
#further filtering and mutating
PA_CSV_4 <- PA_CSV_3 %>%
mutate("PCT_FOREIG" = (POP_FOREIG / POP) * 100) %>%
relocate("PCT_FOREIG", .after = "POP_FOREIG") %>%
mutate("PCT_UK" = (POP_UK / POP) * 100) %>%
relocate("PCT_UK", .after = "POP_UK") %>%
mutate("PCT_IRE" = (POP_IRE / POP) * 100) %>%
relocate("PCT_IRE", .after = "POP_IRE") %>%
mutate("PCT_NOR" = (POP_NOR / POP) * 100) %>%
relocate("PCT_NOR", .after = "POP_NOR") %>%
mutate("PCT_SWEDE" = (POP_SWEDE / POP) * 100) %>%
relocate("PCT_SWEDE", .after = "POP_SWEDE") %>%
mutate("PCT_GERM" = (POP_GERM / POP) * 100) %>%
relocate("PCT_GERM", .after = "POP_GERM") %>%
mutate("PCT_POL" = (POP_POL / POP) * 100) %>%
relocate("PCT_POL", .after = "POP_POL") %>%
mutate("PCT_CZECH" = (POP_CZECH / POP) * 100) %>%
relocate("PCT_CZECH", .after = "POP_CZECH") %>%
mutate("PCT_AUSTRI" = (POP_AUSTRI / POP) * 100) %>%
relocate("PCT_AUSTRI", .after = "POP_AUSTRI") %>%
mutate("PCT_HUN" = (POP_HUN / POP) * 100) %>%
relocate("PCT_HUN", .after = "POP_HUN") %>%
mutate("PCT_USSR" = (POP_USSR / POP) * 100) %>%
relocate("PCT_USSR", .after = "POP_USSR") %>%
mutate("PCT_ITALY" = (POP_ITALY / POP) * 100) %>%
relocate("PCT_ITALY", .after = "POP_ITALY") %>%
mutate("PCT_CAN" = (POP_CAN / POP) * 100) %>%
relocate("PCT_CAN", .after = "POP_CAN") %>%
mutate("PCT_MEX" = (POP_MEX / POP) * 100) %>%
relocate("PCT_MEX", .after = "POP_MEX") %>%
mutate("PCT_OTHER" = (POP_FOREIG / POP) * 100) %>%
relocate("PCT_OTHER", .after = "POP_OTHER")
#further filtering and mutating
PA_CSV_5 <- PA_CSV_4 %>%
rename("FORE" = "POP_FOREIG") %>%
select(!contains("POP_")) %>%
rename("POP_FOREIG" = "FORE")
#further filtering and mutating
PA_CSV_6 <- PA_CSV_5 %>%
mutate("PCT_1950" = (H_1950 / H_UNITS) * 100) %>%
relocate("PCT_1950", .after = "H_1950") %>%
mutate("PCT_1949" = (H_1949 / H_UNITS) * 100) %>%
relocate("PCT_1949", .after = "H_1949") %>%
mutate("PCT_1940" = (H_1940 / H_UNITS) * 100) %>%
relocate("PCT_1940", .after = "H_1940") %>%
rename("house" = "H_UNITS") %>%
select(!contains("H_")) %>%
rename("H_UNITS" = "house")
#further filtering and mutating
PA_CSV_7 <- PA_CSV_6 %>%
mutate("PCT_E" = (TOT_E / POP) * 100) %>%
relocate("PCT_E", .after = "TOT_E") %>%
mutate("PCT_TECH" = (TECNICAL_E / TOT_E) * 100) %>%
relocate("PCT_TECH", .after = "TECNICAL_E") %>%
mutate("PCT_FARM" = (FARM_E / TOT_E) * 100) %>%
relocate("PCT_FARM", .after = "FARM_E") %>%
mutate("PCT_MANAG" = (MANAGE_E / TOT_E) * 100) %>%
relocate("PCT_MANAG", .after = "MANAGE_E") %>%
mutate("PCT_CLER" = (CLERICAL_E / TOT_E) * 100) %>%
relocate("PCT_CLER", .after = "CLERICAL_E") %>%
mutate("PCT_SALS" = (SALES_E / TOT_E) * 100) %>%
relocate("PCT_SALS", .after = "SALES_E") %>%
mutate("PCT_FORMAN" = (FOREMAN_E / TOT_E) * 100) %>%
relocate("PCT_FORMAN", .after = "FOREMAN_E") %>%
mutate("PCT_OPERAT" = (OPERATVE_E / TOT_E) * 100) %>%
relocate("PCT_OPERAT", .after = "OPERATVE_E") %>%
mutate("PCT_PRIV" = (PRIV_E / TOT_E) * 100) %>%
relocate("PCT_PRIV", .after = "PRIV_E") %>%
mutate("PCT_SERV" = (SERVIC_E / TOT_E) * 100) %>%
relocate("PCT_SERV", .after = "SERVIC_E") %>%
mutate("PCT_LABOR" = (LABOR_E / TOT_E) * 100) %>%
relocate("PCT_LABOR", .after = "LABOR_E") %>%
mutate("PCT_OTHER" = (OTHER_E / TOT_E) * 100) %>%
relocate("PCT_OTHER", .after = "OTHER_E") %>%
rename("house" = "TOT_E") %>%
select(!contains("_E")) %>%
rename("TOT_E" = "house")
#write_csv(PA_CSV_7, "./data/PA_1960_F.csv", append = TRUE)
############
# 1960 shps
# Read in coordinates for census tracts
census.sf <- st_read("./data/US_tract_1960.shp")
st_crs(census.sf)
census.sf <- census.sf %>%
dplyr::arrange(GISJOIN)
PA_tracts <- inner_join(census.sf, PA_CSV_7, by = "GISJOIN")
census.1960.sf <- PA_tracts %>%
select(-1,
-2,
-5,
-6)
#reading in 2018 csv
another_CSV <- as_tibble(read_csv("./data/2018_dep.csv"))
another_CSV <- another_CSV %>%
filter(STATE == "Pennsylvania") %>%
select(1,
5,
36:38) %>%
mutate(PCT_F_2018 = (AJ7VE003 / AJ7VE001) *100) %>%
rename("POP_F_2018" = AJ7VE003)
#####################
#reading in 2018 shapefile census'
census.2018 <- st_read("./data/US_tract_2018.shp")
st_crs(census.2018)
census.2018 <- census.2018 %>%
filter(STATEFP == 42)
census.2018_ <- inner_join(another_CSV, census.2018, by = "GISJOIN")
census.2018__ <- census.2018_ %>%
select(1,
5,
6,
21)
census.2018.sf <- st_as_sf(census.2018__)
st_crs(census.2018.sf)
# using sf
census_cent_2018 <- st_centroid(census.2018.sf)
st_crs(census_cent_2018)st_crs(census.1960.sf)
# using sf
census_cent_1960 <- st_centroid(census.1960.sf)
st_crs(census_cent_1960)
nrow(census.1960.sf)
nrow(census_cent_2018)ggplot() +
geom_sf(data = census.1960.sf, fill = 'white') +
labs(subtitle = "1960 Decennial Census Tracts")ggplot() +
geom_sf(data = census.2018.sf, fill = 'white') +
labs(subtitle = "2018 ACS-5 Years Census Tracts")ggplot() +
geom_sf(data = census.1960.sf, fill = 'white') +
geom_sf(data = census_cent_2018, color = 'red', size = 0.25, alpha = 0.3) +
labs(subtitle = "Centroids of 2018 ACS 5-Year Census Tracts on top of 1960 Decennial Census Tracts")ggplot() +
geom_sf(data = census.2018.sf, fill = 'white') +
geom_sf(data = census_cent_1960, color = 'red', size = 0.25, alpha = 0.3) +
labs(subtitle = "Centroids of 1960 Census Tracts on top of 2018 ACS 5-Year Census Tracts")Following the roadblock of my initial longitudinal analysis, I scaled back my variables of interest to include only ‘foreign born’ and ‘poverty’ data. These variables were downloaded as csv’s from Social Explorer and were sorted initially using Excel. Using Excel, I manually renamed existing variables of interest as well as combined the decennial census data (1970-2000) together. I combined data by creating a new variable called YEAR which represented the decennial census survey that the census-tract row was attached to. Using YEAR as the unique identifier, the decennial census surveys were combined by simply copying and pasting each survey’s rows underneath one another to create 12,868 rows of data (representing the four combined surveys). The resulting csv was read into R and later joined with the Census TIGER/Line shapefile collected from NHGIS as seen below.
#reading in csv's & adding PCT_FBs
df_2000 <- as_tibble(read_csv("./data/POV_A_F_Born_1970_2010.csv"))
df_2000 <- df_2000 %>%
mutate("P_F_BORN" = (F_BORN / POP) * 100) %>%
relocate("P_F_BORN", .after = "F_BORN")%>%
mutate("PCT_POOR" = (POV_POOR / N_POV) * 100) %>%
relocate("PCT_POOR", .after = "POV_POOR") %>%
mutate("PCT_STRG" = (POV_STRG / N_POV) * 100) %>%
relocate("PCT_STRG", .after = "POV_STRG") %>%
mutate("PCT_OKAY" = (POV_OKAY / N_POV) * 100) %>%
relocate("PCT_OKAY", .after = "POV_OKAY")
###reading in 2010 census tract polygons
tracts_2010.shp <- st_read("./data/US_tract_2010.shp")
st_crs(tracts_2010.shp)
PA_tracts_2010.shp <- tracts_2010.shp %>%
filter(STATEFP10 == "42")
###joing all csv's
#2000 & 1990
# df_decennial <- left_join(df_2000, df_1990, by = "Geo_FIPS", suffix = c("", ".annoying_duplicate_column")) %>%
# select(-ends_with(".annoying_duplicate_column"))
# #1980
# df_decennial <- left_join(df_decennial, df_1980, by = "Geo_FIPS", suffix = c("", ".annoying_duplicate_column")) %>%
# select(-ends_with(".annoying_duplicate_column"))
# #1970
# df_decennial <- left_join(df_decennial, df_1970, by = "Geo_FIPS", suffix = c("", ".annoying_duplicate_column")) %>%
# select(-ends_with(".annoying_duplicate_column"))
df_decennial <- df_2000
# #POP CHANGE VAR - no longer using this
# EDA_df_2000 <- df_decennial %>%
# filter(YEAR == 2000)
#
# EDA_df_1970 <- df_decennial %>%
# filter(YEAR == 1970)
#
# EDA_df_2000 <- EDA_df_2000 %>%
# mutate(POP_PCT_CHNG = ((EDA_df_2000$POP - EDA_df_1970$POP) / EDA_df_1970$POP))
#
# EDA_df_2000 <- EDA_df_2000 %>%
# mutate(FB_PCT_CHNG = ((EDA_df_2000$F_BORN - EDA_df_1970$F_BORN) / EDA_df_1970$F_BORN))
#
# EDA_df <- EDA_df_2000 %>%
# select(Geo_FIPS,
# POP_PCT_CHNG,
# FB_PCT_CHNG)
#df_decennial
#----------------------------------------------------
#joining 2010 sf with 1970:2000 csvs
df_decennial$Geo_FIPS <- as.character(df_decennial$Geo_FIPS)
df_decennial <- df_decennial %>%
rename("GEOID10" = "Geo_FIPS")
tracts_2010 <- inner_join(PA_tracts_2010.shp, df_decennial, by = "GEOID10")
#st_crs(tracts_2010)While a theoretically correct longitudinal analysis was able to be setup using the 1970-2000 census’, there still existed some discrepancies in the variables of interest, particularly in the poverty variable. To quickly run through the limitations encountered by the variable, I’ll summarize how each census set up the poverty variable used in this analysis.
The 1970 decennial census poverty variable (Ratio of Income to Poverty Line) was measured by families versus individuals. Counts were available for families whose incomes were below 1.0 (Poor), between 1.0 and 2.0 (Struggling), as well as above 2.0 (Okay). Each ratio’s label (Poor, Struggling, Okay) were taken from the 2000 decennial survey that summarized each ratio with its interpretation. As seen in bold, the discrepancy in this census was the measurement of families versus individuals, which is different than the subsequent surveys.
The 1980 decennial census poverty variable (Ratio of Income to Poverty Line) was measured by individuals. Counts were available for individuals whose incomes were below 1.25 (Poor), between 1.25 and 2.0 (Struggling), as well as above 2.0 (Okay). Each ratio’s label (Poor, Struggling, Okay) were taken from the 2000 decennial survey that summarized each ratio with its interpretation. As seen in bold, the discrepancy in this census was the distinguishing between ‘Poor’ and ‘Struggling’ which were manipulated to be 1.25 and below (Poor) and between 1.25 and 2.0 (Struggling). This is a result of how the census counted and partitioned survey respondents. This is the only survey who’s ‘Poor’ and ‘Struggling’ factor were partitioned in this way.
The 1990 and 2000 decennial census poverty variables (Ratio of Income to Poverty Line) was measured by individuals. Counts were available for individuals whose incomes were below 1.0 (Poor), between 1.0 and 2.0 (Struggling), as well as above 2.0 (Okay). Each ratio’s label (Poor, Struggling, Okay) were taken from the 2000 decennial survey that summarized each ratio with its interpretation. These surveys were the most consistent across time.
The poverty line threshold used in each census comes from the poverty thresholds used by the Census Bureau. These are updated annually and are most suited for statistical analysis.
Using the masstime.us user interface, I filtered U.S. churches that were only located in the state of Pennsylvania. Then, using the ‘Reader View’ offered by my current version of Safari (15.4), I copied and pasted all text from the resulting webpage into an excel file, and later manually tidied the data into columns containing each church’s name and address. I later geocoded the church’s addresses using ArcGIS Pro’s ’Geocode Addresses’ tool. The resulting points feature layer was then read and further manipulated in R using the associated tidyverse packages.
#reading in sf
pa_churches <- st_read("./data/PA_Churches_geom.shp")
# head(pa_churches)
pa_churches_1 <- pa_churches %>%
filter(Region != "Alabama")
catholic.sf <- pa_churches_1 %>%
select("City",
"Subregion",
"Postal",
"USER_CHURC",
"USER_ADDRE")
catholic.sf <- st_transform(catholic.sf, crs = st_crs(tracts_2010))
catholic.sf2 <- pa_churches_1 %>%
select(!1:36)
catholic.sf2 <- st_transform(catholic.sf2, crs = st_crs(tracts_2010))Localities seen in [Table 1. Variables of Interest (by Census Tract)] were created by pulling each city’s centroid coordinates (from Google Maps specifically) and its subsequent conversion to an sf object. The resulting sf object was transformed to match the projections of the NHGIS shapefile (4326).
#making points layer of cities of interest
cities_sf <- tribble(
~id, ~lon, ~lat,
"Pittsburgh", -79.997422, 40.438592,
"Philadelphia", -75.171193, 39.949852,
# "New York", -74.00782551842028, 40.707626,
"Bethlehem", -75.37797, 40.61867) %>% #Assigning specific type
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(crs = st_crs(tracts_2010)) %>%
mutate(lon = st_coordinates(.)[,1],
lat = st_coordinates(.)[,2])#when using the tracts as just a baselayer always filter out only one year. Otherwise, 12,000+ polygons will be mapped and it takes a brick
ggplot() +
geom_sf(data = filter(tracts_2010, YEAR == 1970), fill = 'white') +
geom_sf(data = cities_sf, color = 'red', size = 1)Code Chunk below was no-longer needed for analysis
# church density thing
# # EDA Graphics & last bits of wrangling new vars
# catholic.sf %>%
# st_transform('ESRI:102286') %>%
# distinct()
#
# PA_County <-
# st_read("./data/PaCounty2022_04.geojson")
# st_crs(PA_County)
#
# PA_County.sf <- st_transform(PA_County, crs = st_crs(catholic.sf))
#
# philly.sf <- PA_County %>%
# filter(COUNTY_NAM == "PHILADELPHIA")
#
# philly.sf <- st_transform(philly.sf, crs = st_crs(catholic.sf))
#
# pittsburgh.sf <- PA_County %>%
# filter(COUNTY_NAM == "ALLEGHENY")
#
# pittsburgh.sf <- st_transform(pittsburgh.sf, crs = st_crs(catholic.sf))
#
# ggplot() + geom_sf(data = PA_County.sf, fill = "grey40") +
# stat_density2d(data = data.frame(st_coordinates(catholic.sf)),
# aes(X, Y, fill = ..level.., alpha = ..level..),
# size = 0.01, bins = 40, geom = 'polygon') +
# scale_fill_gradient(low = "#25CB10", high = "#FA7800",
# breaks=c(0.000000003,0.00000003),
# labels=c("Minimum","Maximum"), name = "Density") +
# scale_alpha(range = c(0.00, 0.35), guide = "none") +
# labs(title = "Density of Catholic Churches, Pennsylvania") +
# mapTheme()
#
# # st_crs(tract_distinct)
# p1 <- ggplot() +
# geom_sf(data = PA_County.sf, fill = 'white') +
# # geom_sf(data = PA_tracts_1, fill = 'orange') +
# geom_sf(data = catholic.sf, color = 'red') The code chunk below illustrates how I counted the number of Catholic Churches within each Census Tract as well as by using a half-mile buffer around each census tract. The sf package was critical here, and, because the Catholic Church demographic variable is meant to show proximity versus exclusivity, I ultimately decided that counting Catholic Churches within a half-mile buffer was the better option theoretically, as multiple census tracts should be allowed to share the same Catholic church.
The code chunk below illustrates how I categorized census tracts by their locality. I created a five-mile buffer around each locality’s centroid, and identified the census tracts that were within each respective locality. Each census tract has only one locality, and any tract that was not found within the five-mile buffer of our cities of interest (Bethlehem, Pittsburgh, Philadelphia) was given “Other”.
The code chunk below illustrates how I counted the number of census tracts that fell within each of my designated localities. Additionally, I counted the number of outliers within each locality as they pertain to P_F_BORN. The results of this analysis relate to which locality I expect to be most segregated (highest dissimilarity index) as mentioned in Expected Findings.
tracts_2010_df <- st_drop_geometry(tracts_2010) %>%
na.omit()
tracts_2010_df %>%
group_by(Locality) %>%
count()quantile(tracts_2010_df$P_F_BORN, 0.75)## 75%
## 4.75981
tracts_2010_df %>%
filter(P_F_BORN > 4.7591) %>%
group_by(Locality) %>%
count()quantile(tracts_2010_df$P_F_BORN, 0.25)## 25%
## 1.290115
tracts_2010_df %>%
filter(P_F_BORN < 1.290115) %>%
group_by(Locality) %>%
count()Code Chunk below was no-longer needed for analysis
# ###Clipping Catholic Churches Point Layer to Metros
# ###
# PITT_CATH <- st_intersection(catholic.sf, PITT_Tracts)
#
# PHIL_CATH <- st_intersection(catholic.sf, PHIL_Tracts)
#
# BETH_CATH <- st_intersection(catholic.sf, BETH_Tracts)
#
# # NY_CATH <- st_intersection(catholic.sf, NY_Tracts)Code Chunk below was no-longer needed for analysis
# #working with google api
# #getting dimensions and scale right for baselayers to use and visualize later
# ####################################################
# #google map thing
#
# my.goog <- read_lines("./google_api.txt")
# saveRDS(my.goog, file = "mattyp_goog_api.rds")
# my.goog <- readRDS("mattyp_goog_api.rds")
# #Sys.setenv()
# register_google(key = my.goog)
# ###
# ####fine-tuning distance metrics to use for the city vars based on their fit with these ggmaps
#
# tracts_2010.wgs84 <- st_transform(tracts_2010, 4326) #for compatibility with google
#
#
# beth_map <- ggmap(get_googlemap(center = c(lon = -75.37797, lat = 40.61867),
# zoom = 12,
# maptype = "terrain",
# color = "color")) +
# geom_sf(data = filter(tracts_2010.wgs84, Locality == "BETHLEHEM" & YEAR == 1970),
# mapping = aes(fill = F_BORN), inherit.aes = FALSE,
# color = "gray90", size = 0.25, alpha = 0.55) +
# theme_map()
#
# ggsave(filename = paste0("./img/beth_map_VAR.png"), plot = beth_map, width = 8, height = 8)
# ###
#
# pitt_map <- ggmap(get_googlemap(center = c(lon = -79.997422, lat = 40.438592),
# zoom = 12,
# maptype = "terrain",
# color = "color")) +
# geom_sf(data = filter(tracts_2010.wgs84, Locality == "PITTSBURGH" & YEAR == 1970),
# mapping = aes(fill = F_BORN), inherit.aes = FALSE,
# color = "gray90", size = 0.25, alpha = 0.55) +
# theme_map()
#
# ggsave(filename = paste0("./img/pitt_map_VAR.png"), plot = pitt_map, width = 8, height = 8)
# ###
# philly_map <- ggmap(get_googlemap(center = c(lon = -75.171193, lat = 39.949852),
# zoom = 12,
# maptype = "terrain",
# color = "color")) +
# geom_sf(data = filter(tracts_2010.wgs84, Locality == "PHILADELPHIA" & YEAR == 1970),
# mapping = aes(fill = F_BORN), inherit.aes = FALSE,
# color = "gray90", size = 0.25, alpha = 0.55) +
# theme_map()
#
# ggsave(filename = paste0("./img/philly_map_VAR.png"), plot = philly_map, width = 8, height = 8)Below is an animated boxplot analysis. It shows the distribution of ‘Percent Foreign by Locality’ across each decennial census. Paired with the results from Counting Tracts by Locality + Counting # of Outliers, the P_F_BORN variable can be easily assessed by each locality over time.
animate(pct_FB_boxplot1.animation, height = 800, width = 800, fps = 1, duration = 5, end_pause = 1, res = 100, renderer=gifski_renderer())## Warning: Removed 1136 rows containing non-finite values (stat_boxplot).
Assessing our results, among the localities, ‘Other’ has the most census tracts (9847). However, expect me to omit this locality until the very end of each ran analysis as I aim to compare specifically between Bethlehem, Pittsburgh, and Philadelphia. Among our localities of interest, Philadelphia contains the most census tracts, followed by Pittsburgh, and lastly, Bethlehem (936, 712, 228 respectively). Looking specifically at outliers concerning P_F_BORN, the same order is seen with Philadelphia leading, followed by Pittsburgh, and Bethlehem (644, 347, 110 respectively). Further, among these outliers, Bethlehem actually contains the highest percentage of outliers that are ‘high’ (greater than the 3rd quartile), followed by Philadelphia and Pittsburgh (97.3%, 79.0%, 70.9% respectively). Now, concerning our ‘Other’ locality, it contains 4,762 outliers, with only 43.5% of them being ‘high’.
Below is a correlation matrix analysis. It shows the correlation between our variables of interest (excluding non-numeric vars) . Particular to P_F_BORN, it has a somewhat indirect relationship with AREA_SQMI (-0.25) as well as PCT_STRG (-10.5). Oppositely, P_F_BORN has a somewhat direct relationship with PCT_POOR (0.07). Overall, there are no clear relationships, with mostly no relationship at all between P_F_BORN and our explanatory variables.
numericVars <- tracts_2010_df %>%
na.omit() %>%
select(AREA_SQMI,
YEAR,
POP,
P_F_BORN,
PCT_POOR,
PCT_STRG,
PCT_OKAY,
CATH_BUFF)ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables") #correlation on numeric variables only
correlation_results <- numericVars %>%
cor()
print(correlation_results)## AREA_SQMI YEAR POP P_F_BORN PCT_POOR
## AREA_SQMI 1.00000000 0.118361507 0.081303691 -0.247641408 -0.016533489
## YEAR 0.11836151 1.000000000 0.029603196 -0.006070913 0.227999250
## POP 0.08130369 0.029603196 1.000000000 0.029232304 0.007290028
## P_F_BORN -0.24764141 -0.006070913 0.029232304 1.000000000 0.068907274
## PCT_POOR -0.01653349 0.227999250 0.007290028 0.068907274 1.000000000
## PCT_STRG 0.16864259 -0.089016380 0.012418686 -0.104533602 0.491256904
## PCT_OKAY -0.08437164 -0.088981261 -0.006226701 0.016575073 -0.876837987
## CATH_BUFF 0.02693408 0.004453306 -0.018005709 -0.031258725 0.032296115
## PCT_STRG PCT_OKAY CATH_BUFF
## AREA_SQMI 0.16864259 -0.084371635 0.026934084
## YEAR -0.08901638 -0.088981261 0.004453306
## POP 0.01241869 -0.006226701 -0.018005709
## P_F_BORN -0.10453360 0.016575073 -0.031258725
## PCT_POOR 0.49125690 -0.876837987 0.032296115
## PCT_STRG 1.00000000 -0.840192815 0.019749392
## PCT_OKAY -0.84019281 1.000000000 -0.029731018
## CATH_BUFF 0.01974939 -0.029731018 1.000000000
Below is an animated scatterplot analysis. It visualizes the relationship between P_F_BORN and each categorical ‘Poverty’ variable (PCT_POOR, PCT_STRG, PCT_OKAY) and how the relationships change over time (each decennial census). Lastly, it also presents the relationship between P_F_BORN and CATH_BUFF.
animate(pct_poor_graph1.animation, height = 800, width = 800, fps = 1, duration = 5, end_pause = 1, res = 100, renderer=gifski_renderer())## Warning: Removed 676 rows containing missing values (geom_point).
## Removed 676 rows containing missing values (geom_point).
## Warning: Removed 454 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_point).
Assessing the scatterplots results, across the decennial census’, there is a clear pattern of PCT_POOR and P_F_BORN increasing together over time. There doesn’t appear to be any real discernible pattern over time between P_F_BORN and PCT_STRG or with the CATH_BUFF. Lastly, the PCT_OKAY variable appears to be decreasing over time, with the percent of census tracts having an Income to Poverty Ratio > 2.0 decreasing each census, particularly in Pittsburgh and Philadelphia.
The code chunks below and their resulting outputs contain choropleth maps that show how P_F_BORN has changed spatially over time in each of our comparison cities (excluding ‘Other’).
## Source : https://maps.googleapis.com/maps/api/staticmap?center=40.63037,-75.389&zoom=12&size=640x640&scale=2&maptype=terrain&key=xxx
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Source : https://maps.googleapis.com/maps/api/staticmap?center=39.949852,-75.171193&zoom=12&size=640x640&scale=2&maptype=terrain&key=xxx
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Source : https://maps.googleapis.com/maps/api/staticmap?center=40.438592,-80.001422&zoom=12&size=640x640&scale=2&maptype=terrain&key=xxx
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
From the choropleth maps generated, it’s very visible just how bigger Bethlehem’s census tracts are compared to Philadelphia or even Pittsburgh’s. Besides this fact, across all three cities, there are clear patterns concerning the distribution of P_F_BORN. In Bethlehem, as mentioned in Taft (2016), higher percents of foreign born populations are found south of the Lehigh river among each decennial census. However, in the year, 2000, there appears to be a regional shifting of P_F_BORN as the tracts that surround the peripheries of Bethlehem begin to show increasing numbers. In Philadelphia, it’s almost quite difficult to interpret the map. There are pockets of high P_F_BORN in the city center as well as the western peripheries. Additionally, just north of the city center and to the city’s southeast, these tracts remain low P_F_BORN. In Pittsburgh, showing the clearest pattern, census tracts to the east remain high P_F_BORN across each decennial census.
Using the dissim() function from the seg package, I was able to calculate the degree of segregation between foreign born populations and U.S. born populations. For only this analysis did I use F_BORN, as there was no theoretical reason to alternatively use P_F_BORN when calculating the degree of segregation.
print(diss_index_tb)## # A tibble: 12 × 3
## id year diss_index
## <chr> <dbl> <dbl>
## 1 Bethlehem 1970 0.251
## 2 Bethlehem 1980 0.250
## 3 Bethlehem 1990 0.202
## 4 Bethlehem 2000 0.258
## 5 Philadelphia 1970 0.355
## 6 Philadelphia 1980 0.349
## 7 Philadelphia 1990 0.357
## 8 Philadelphia 2000 0.339
## 9 Pittsburgh 1970 0.251
## 10 Pittsburgh 1980 0.273
## 11 Pittsburgh 1990 0.357
## 12 Pittsburgh 2000 0.406
Dissimilarity Index Results can be seen in [Table 3. Dissimilarity Index Results]. Assessing the the degree of segregation between foreign born populations and U.S. born populations, Bethlehem’s is the lowest as it remained around 0.25 but with an unanticipated drop to 0.202 in 1990. Philadelphia’s degree of segregation remained the highest until 2000. It’s dissimilarity index slightly decreased between 1970 and 2000 from 0.354 to 0.339. Pittsburgh’s dissimilarity index has seen the greatest change over time. Beginning at 0.251 (similar to Bethlehem) in 1970, it has risen each decennial census to eventually become the most segregated locality in the analysis with a degree of segregation of 0.406.
Thinking back to my Expected Findings, between my dependent variable and explanatory variables, I expected to see higher amounts of outliers among localities with greater number of census tracts. This hypothesis was proven to be true, as Philadelphia fit this hypothesis in both regards. Moving to my correlation plots (and matrix), I expected to see direct relationships with our foreign born dependent variables among tracts with higher populations, higher amounts of ‘Poor’ and ‘Struggling’ populations, higher amounts of Catholic Churches, and lower square mile areas. This was just about true. While P_F_BORN was directly related to PCT_POOR and inversely related to AREA_SQMI, it held no relationship with either PCT_STRG or CATH_BUFF. Thinking to the limitations of these two variables (PCT_STRG’s limitation mentioned in Existing Inconsistencies and Limitations), the CATH_BUFF variable could be improved by including the year each Catholic church was built to then later group into the YEAR variable. Since the Catholic church data I collected was only the most recent, it misses out on churches that existed during the study period but no longer exist today. Additionally, new churches that have recently been established may not have existed during the study period.
Thinking now of change in our foreign born dependent variables over time, I expected to see higher amounts of P_F_BORN and PCT_POOR over time. This hypothesis was deemed correct through the visualization of [Figure 11. Scatterplot Analysis - Percent Foreign Born vs Percent ‘Poor’ (Income to Poverty Line Ratio < 1.0)]. Paired with the Choropleth Map analysis, there is visible change in P_F_BORN populations over time. Overall, I would say the Bethlehem choropleth map was in agreement with Taft’s analysis. However, I expected to see a sharper degree of separation between North and South Bethlehem (North and South of the Lehigh River) Where were tracts south of the Lehigh River that did increase considerably in P_F_BORN, its pattern was not as clear as Pittsburgh or Philadlephia’s. This could be the results of the city’s larger census tracts (limitation mentioned in Expected Findings.
Pittsburgh’s pattern was the most surprising. Made even more visible through the dissimilarity index, of the localities, it is the most segregated city of recent years (2000). I did very little research on Pittsburgh going into this analysis, so I did not know what to expect. However, it’s 2000 diss_index (0.406) is greater than Philadelphia’s of that same year and was the highest diss_index recorded of any locality, which goes against my hypothesis (since Philadelphia has more tracts and more outliers). Further investigation could be done on the emerging settlement patterns taking place in Pittsburgh. While there is a growing trend of increasing P_F_BORN and PCT_POOR across the state, Pittsburgh seems to be the unique case study, not Bethlehem.
Overall, while this was a very difficult longitudinal analysis, I believe this study does have its benefits for the measuring of segregation across populations through a unique demographic lens. This is a great companion study to Taft (2016), and with my code and data manipulation being open, gives myself and any other social-scientist a platform of comparison for any new census-based segregation study going forward. While I would personally like to take a break from using census data (for at least a moment), when it does get consistency right, it can lead to unique case studies and analyses. I heavily enjoyed this term project’s process, but I’m also happy to finally tuck it away.