CTA āLā Train Expansion: Assessment on Ridership and Neighborhood Coverage in Chicago
Objective
This project is about analyzing the L Train network in the metropolis of Chicago. We will look at the location of stations and their proximity to different neighborhoods to understand the accessibility of these lines. Then working to make an overall evaluation of the efficiency of this system for commuters and citizens. We will also include several demographic and transportation variables. The ultimate goal is to see how CTAās subway network relates with the demographic and other variables, and to determine whether it is efficiently serving the people who are in need.
Background
The Chicago Transit Authority (CTA) is the main public transit system operator in Chicago. The āLā train, short for āelevatedā, is the mass transit(subway) system operated by CTA, serving Chicago and its neighboring communities. It together with Metra (commuter railway system) and Pace(suburban bus system) forms the multimodal public transportation network in metropolitan Chicago (i.e.Ā Chicagoland).
The āLā system mainly serves the city of Chicago as well as places in the neighboring Lake County. It is the 4th largest subway system in the United States. Consisting of 8 lines, its average normal weekday ridership goes to over 1.6 million passengers (https://www.transitchicago.com/facts/).
Data Acquisition
GTFS Data
The first and most important data for us to retrieve is the GTFS data, which includes the subway stations, routes, and their geographical locations. There is publicly available GTFS data that we can directly retrieve from CTAās official website.
After downloading the data, what we had to do was to filter out the subway routes (route_type == 1) & (location_type == 1). We joined the data from the stops, routes, trips and shapes columns to get the shape sf objects for the routes and the stations.
Here is the map that shows the CTA āLā Train network that we retrieved from the GTFS data. We can see that all the subway lines starts from a loop in the downtown region and goes radial into the suburban area. The subway loop is the namesake of Chicagoās downtown or CBD region, āthe Loopā.
Ridership Data
The official ridership data can be found at CTAās website as well as the Chicago Data Portal. It includes daily ridership for every station dating back from 2001 til this date (2024). We loaded the dataset in R and did data cleaning procedures. We wanted to gfilter out time range between the years 2015 and 2019, as these are the last years before subway ridership is affected by the COVID event.
We also filtered out only the weekday riderships of each station, as we think it may be most representative of the daily commuting patterns of the subway. Then we grouped then ridership by station and calculated the daily means of each station.
# Use forward slashes
ridership <- read.csv("CTA_-_Ridership_-__L__Station_Entries_-_Daily_Totals_20240922.csv")
ridership <- ridership %>%
mutate(station_id = as.character(station_id)) %>%
mutate(date = trimws(date), date = mdy(date)) %>%
filter(date >= as.Date("2015-01-01") & date <= as.Date("2019-12-31"))avg_rides <- ridership %>%
filter(daytype == "W") %>%
mutate(year = year(date)) %>%
group_by(station_id, stop_name) %>%
#group_by(station_id, stop_name, year) %>%
summarise(avg_rides = mean(rides, na.rm = TRUE), .groups = 'drop') Here is the map showing the riderships of each station, with larger circle indicating higher ridership. Unfortunately the size-based legend is not available, but you can view ech stationās ridership by clicking it and itāll show in the pop-up.
Census Demographic Data
We also can fetch the data from the Census Bureau, which is quite convenient with a census api key using the tidycensus package.
The data extraction uses the get_acs function to fetch block group-level data from the American Community Survey (ACS) for Cook and Lake Counties in Illinois. Important variables include household income, population, vehicle ownership, and senior population by age and gender. These variables are processed to find total counts (like total senior population) and densities (like car density, senior density, and population density) by dividing counts by the area of each spatial unit. The data is transformed to the same coordinate system and filtered to remove empty geometries. This helps combine demographic and spatial details for analysis.
Map 1: Population Density and Average Ridership
Here is the first map plotting the population density (per 1 square km) of each block group:
Map 2: Car Density and Average Ridership
This map shows car density (green shading) and average CTA ridership (dots, intensity by color) across Chicago. Higher car densities are concentrated in less populated areas, while the urban core has lower car density, likely due to better public transportation. Areas with higher car density often align with CTA transit lines, showing good transit access. Outside downtown, where ridership is highest, there is overlap between high car density areas and places with high ridership. This suggests that areas using cars also rely on public transit for some trips, revealing how both cars and transit work together in these areas.
Map 3: Senior Density and Average Ridership
#map2 <- tm_basemap(leaflet::providers$CartoDB.Positron) +
# tm_shape(census_data) +
# tm_polygons(col = "senior_density", style = "jenks", palette = "Purples", alpha = 0.8,
# popup.vars = c("Senior Density" = "senior_density")) +
# tm_shape(subway_sf) +
# tm_lines(col = "black", lwd = 2) +
# tm_shape(stations_ridership) +
# tm_dots(col = "avg_rides", style = "fisher", palette = "OrRd", size = 0.1)
#
#map2
black_to_purple <- colorRampPalette(c("#000000", "#990099", "#A020F0"))
tm_basemap(leaflet::providers$CartoDB.DarkMatter) +
tm_shape(census_data) +
tm_polygons(col = "senior_density", colorNA = "#444444", style = "fisher", popup.vars = c("Senior Density" = "senior_density"), border.col = NA, lwd = 0, palette = black_to_purple(10), alpha = 0.7, n = 10,
title = "senior_density per 1 sq km") +
tm_shape(subway_sf) + tm_lines(col = "#FF5555", lwd=2) +
tm_shape(stations_ridership) +
tm_dots(col = "avg_rides", style = "jenks", palette = "OrRd", border.col="grey", size=0.07) +
tm_view(set.view = c(-87.6298, 41.8781, 11))The third map visualizes senior density (purple shading) and average CTA ridership. Senior density is more spread-out than car density but aligns with CTA transit lines. This suggests older populations in these areas may have better transit access. Outside the downtown core, there is significant overlap between high senior density areas and high ridership regions. This shows that seniors in transit-accessible areas may rely on public transportation, highlighting the need for accessible transit to support mobility for older residents.
Map 4: Household Income and Average Ridership
The fourth map combines household income data (shaded in orange) with average CTA ridership. High-income areas are mostly in the north and near the lakefront, while lower-income areas are concentrated in the south and west. Also, Many high-ridership stations are located in lower-income areas, emphasizing the critical role of public transit in providing mobility for economically disadvantaged communities. The contrast between income and ridership patterns highlights the link between transit use and socioeconomic factors.
Bike Stations Data
There might be a pattern for subway passengers to use bike to reach their destinations within a mile from subway stations. So we used GBFS data from Divvy and calculated the density of bike stations for each block group.
gbfs_url <- "https://gbfs.lyft.com/gbfs/2.3/chi/en/station_information.json"
response <- GET(gbfs_url)
gbfs_data <- fromJSON(content(response, "text"))
stations <- gbfs_data$data$stations
head(stations)## short_name lat lon rental_uris.ios
## 1 24266 41.79431 -87.70348 https://chi.lft.to/lastmile_qr_scan
## 2 24477 41.98096 -87.76149 https://chi.lft.to/lastmile_qr_scan
## 3 24273 41.78458 -87.70319 https://chi.lft.to/lastmile_qr_scan
## 4 15535 41.88370 -87.64424 https://chi.lft.to/lastmile_qr_scan
## 5 TA1307000129 41.83104 -87.62680 https://chi.lft.to/lastmile_qr_scan
## 6 13345 41.83138 -87.61803 https://chi.lft.to/lastmile_qr_scan
## rental_uris.android station_id
## 1 https://chi.lft.to/lastmile_qr_scan 1963359453732131538
## 2 https://chi.lft.to/lastmile_qr_scan 2021144286196005052
## 3 https://chi.lft.to/lastmile_qr_scan 1963671811736140038
## 4 https://chi.lft.to/lastmile_qr_scan a3a52df5-a135-11e9-9cda-0a87ae2ba916
## 5 https://chi.lft.to/lastmile_qr_scan a3a75d8a-a135-11e9-9cda-0a87ae2ba916
## 6 https://chi.lft.to/lastmile_qr_scan a3aae35a-a135-11e9-9cda-0a87ae2ba916
## capacity name region_id address
## 1 11 Kedzie Ave & 54th Pl <NA> <NA>
## 2 16 Lynch Ave & Elston Ave <NA> <NA>
## 3 15 Kedzie Ave & 60th St <NA> <NA>
## 4 23 Desplaines St & Randolph St <NA> <NA>
## 5 15 State St & 35th St <NA> <NA>
## 6 15 Calumet Ave & 35th St <NA> <NA>
Bike Stations Map
Bike Stations Density and Average Ridership
OSM Street Network and Network Analysis on Service Area
We would like to fetch the road network data for the Chicago region and the area served by the āLā Train network. We did this in the osmdata package and created a custom bounding box to download the data. Ideally we should do all analysis in r markdown, but due to the large size of road network data(~800MB), we had to use ArcGIS for the network analysis.
Here is the bounding box we created to download the road network.
Here is the code we used to fetch the road network data.
osm_road <- opq(bbox = bb) %>%
add_osm_feature(key = 'highway',
value = c("motorway", "trunk", "primary", "secondary", "tertiary", "residential",
"motorway_link", "trunk_link", "primary_link", "secondary_link",
"tertiary_link", "residential_link", "unclassified")) %>%
osmdata_sf() %>%
osm_poly2line()After we downloaded the road network, we converted it to an sf network object, and activated the edges. To clean out the road network, we first filtered out all edges which are same pairs ( filter(!edge_is_multiple()) ) and edges which starts and ends at the same node ( filter(!edge_is_loop()) ). There are some road intersections that are not marked as nodes in this road network, and we used sfnetworks::to_spatial_subdivision to create nodes for those intersections. There are also āpseudoā nodes which only connects to two edges which is not essentially a network node so we deleted it using sfnetworks::to_spatial_smooth.
From this point on, the capacity in R simple features class does not allow us to do network analysis on the massive network. We exported the sfnetwork to a shapefile and did the analysis in ArcGIS Pro.
In ArcGIS, we imported the osm road network and the CTA Station point locations. In the osm_edges shapefile, we created two new columns, one is ālengthā that calculates the geographical length of all road sections, and the other is āminutesā that calculates the estimated walking time (in minutes) with 5 km per hour speed using the ālengthā column.
We built a Network Dataset using the osm_edges and osm_nodes shapefiles, and set the traveling mode to minutes using the column we created earlier. We are able to run a 5-10-15 minute walking distance service area analysis. We exported the result to a shapefile and imported it back to R.
## Reading layer `cta_polygons' from data source
## `C:\Users\benso\Documents\Georgia Tech\CP8883\Final Project\shp\cta_polygons.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.90694 ymin: 41.71177 xmax: -87.59051 ymax: 42.08323
## Geodetic CRS: WGS 84
We decided to use the 15-minute walking distance region. Hereās what it looks like:
We assigned each census block a binary value based on whether it intersects with the 15-min walking distance polygon.
Til this point, all the data we needed are ready for us to do a regression analysis on the different variables.
Regression Analysis
library(dplyr)
census_data <- census_data %>%
select(
-vehiclesM, -seniors_male, -seniors_maleM,
-seniors_female, -seniors_femaleM, -seniors_male_1, -seniors_male_1M, -seniors_female_1,
-seniors_female_1M, -seniors_male_2, -seniors_male_2M, -seniors_female_2, -seniors_female_2M,
-seniors_male_3, -seniors_male_3M, -seniors_female_3, -seniors_female_3M, -seniors_male_4,
-seniors_male_4M, -seniors_female_4, -seniors_female_4M, -seniors_male_5, -seniors_male_5M,
-seniors_female_5, -seniors_female_5M, -seniors_male_6, -seniors_male_6M, -seniors_female_6,
-seniors_female_6M, -seniors_male_7, -seniors_male_7M, -seniors_female_7, -seniors_female_7M
)census_new_df <- as.data.frame(census_new)
census_data_df <- as.data.frame(census_data)
merged_census <- left_join(census_new_df, census_data_df, by = "GEOID")
merged_census_clean1 <- merged_census %>%
select(
-NAME, -hhincomeM.x, -populationM.x, -geometry.x,
-hhincome.y, -hhincomeM.y, -population.y, -populationM.y,
-geometry.y, -NAM, -pop_den.y, -count, -population.x, -vehicles, - seniors
)## Warning: package 'VIM' was built under R version 4.4.2
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:dbscan':
##
## kNN
## The following object is masked from 'package:datasets':
##
## sleep
library(sf)
merged_census_clean1 <- read.csv("merged_census_clean.csv")
merged_census_cleaned <- merged_census_clean1 %>%
select(-GEOID)
# Convert columns to numeric where possible
merged_census_cleaned <- merged_census_cleaned %>%
mutate(across(where(is.character), ~ as.numeric(as.character(.)))) %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), ., .))) # Preserve NAs
# Check for and remove rows with any NA values before KNN
na_rows <- merged_census_cleaned %>%
rowwise() %>%
mutate(any_na = any(is.na(c_across(everything())))) %>%
pull(any_na)
if (any(na_rows)) {
merged_census_cleaned <- merged_census_cleaned[!na_rows, ]
}
# Apply KNN for missing values
knn_result <- kNN(merged_census_cleaned, k = 5)## Warning in kNN(merged_census_cleaned, k = 5): Nothing to impute, because no NA
## are present (also after using makeNA)
# Extract the imputed dataset (exclude "_imp" columns)
merged_census_cleaned <- knn_result[, !grepl("_imp$", names(knn_result))]
# Convert 'intersects' column from TRUE/FALSE to 1/0
merged_census_cleaned <- merged_census_cleaned %>%
mutate(intersects = ifelse(intersects, 1, 0))
# Define dependent and independent variables
dependent_variable <- "intersects"
independent_variables <- setdiff(names(merged_census_cleaned), dependent_variable)
# Construct and run the regression
formula <- as.formula(paste(dependent_variable, "~", paste(independent_variables, collapse = " + ")))
regression_model <- lm(formula, data = merged_census_cleaned)
summary(regression_model)##
## Call:
## lm(formula = formula, data = merged_census_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6303 -0.2466 -0.1920 0.4432 0.8708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.409e-01 1.726e-02 8.165 4.60e-16 ***
## hhincome.x 3.186e-07 1.623e-07 1.963 0.049748 *
## pop_den.x 4.289e-05 3.077e-06 13.939 < 2e-16 ***
## bike_den 2.191e-02 1.865e-03 11.747 < 2e-16 ***
## car_density -2.369e-05 6.742e-06 -3.514 0.000447 ***
## senior_density -2.053e-05 3.990e-06 -5.144 2.86e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4198 on 3149 degrees of freedom
## Multiple R-squared: 0.2099, Adjusted R-squared: 0.2087
## F-statistic: 167.3 on 5 and 3149 DF, p-value: < 2.2e-16
cook_dist <- cooks.distance(regression_model)
merged_census_cleaned$cook_dist <- cooks.distance(regression_model)
plot(cook_dist, pch = "*", cex = 2, main = "Influential Obs by Cooks Distance")
abline(h = 4*mean (cook_dist, na.rm = T), col = "red")
text(x = 1: length(cook_dist) + 5,
y = cook_dist,
col = "red",
labels = ifelse(cook_dist > 4 * mean(cook_dist, na.rm = T),
names(cook_dist),
""))#Remove the outlier
merged_census_noout <- merged_census_cleaned[merged_census_cleaned$cook_dist < 2, ]
ols_noout <- lm(formula, data = merged_census_noout)
summary(ols_noout)##
## Call:
## lm(formula = formula, data = merged_census_noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7290 -0.2431 -0.1788 0.4280 0.8667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.138e-01 1.739e-02 6.544 6.97e-11 ***
## hhincome.x 2.542e-07 1.607e-07 1.582 0.114
## pop_den.x 3.458e-05 3.203e-06 10.795 < 2e-16 ***
## bike_den 2.045e-02 1.854e-03 11.033 < 2e-16 ***
## car_density 2.344e-06 7.365e-06 0.318 0.750
## senior_density -1.654e-05 3.976e-06 -4.160 3.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4154 on 3148 degrees of freedom
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.2253
## F-statistic: 184.3 on 5 and 3148 DF, p-value: < 2.2e-16
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
# Residuals vs Fitted Values
ggplot(data = merged_census_noout, aes(x = ols_noout$fitted.values, y = ols_noout$residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal()# Correlation Heatmap
cor_matrix <- merged_census_noout %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")
corrplot(cor_matrix, method = "color", tl.col = "black", tl.srt = 45)formula <- as.formula(paste(dependent_variable, "~", paste(independent_variables, collapse = " + ")))
logit_model <- glm(formula, data = merged_census_cleaned, family = binomial(link="logit"))## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = merged_census_cleaned)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.591e+00 1.292e-01 -20.060 < 2e-16 ***
## hhincome.x 4.100e-06 1.011e-06 4.056 5.00e-05 ***
## pop_den.x 3.382e-04 3.024e-05 11.184 < 2e-16 ***
## bike_den 8.261e-02 1.198e-02 6.893 5.45e-12 ***
## car_density -2.241e-04 6.027e-05 -3.718 0.000201 ***
## senior_density 3.293e-05 3.659e-05 0.900 0.368112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4022.4 on 3154 degrees of freedom
## Residual deviance: 3067.1 on 3149 degrees of freedom
## AIC: 3079.1
##
## Number of Fisher Scoring iterations: 6
\[ Interpretaions: \]
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -1533.5721107 -2011.1924565 955.2406916 0.2374812 0.2612313
## r2CU
## 0.3625455
# Distribution of Car Density
ggplot(data = merged_census_noout, aes(x = car_density)) +
geom_density(binwidth = 1, fill = "blue", color = "white", alpha = 0.7) +
labs(
title = "Distribution of Car Density",
x = "Car Density (vehicles per 1 square km)",
y = "Count"
) +
theme_minimal()## Warning in geom_density(binwidth = 1, fill = "blue", color = "white", alpha =
## 0.7): Ignoring unknown parameters: `binwidth`
# Car Density vs Intersects
ggplot(data = merged_census_noout, aes(x = car_density, y = intersects)) +
geom_jitter(alpha = 0.5, color = "blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red", se=F) +
labs(
title = "Car Density vs Intersects",
x = "Car Density (vehicles per 1 square km)",
y = "Is within 15 min from CTA Station (1 = Yes, 0 = No)"
) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
# Distribution of Senior Density
ggplot(data = merged_census_noout, aes(x = senior_density)) +
geom_density(binwidth = 1, fill = "purple", color = "white", alpha = 0.7) +
labs(
title = "Distribution of Senior Density",
x = "Senior Density (seniors per 1 square km)",
y = "Count"
) +
theme_minimal()## Warning in geom_density(binwidth = 1, fill = "purple", color = "white", :
## Ignoring unknown parameters: `binwidth`
# Interaction between Senior Density and Intersects
ggplot(data = merged_census_noout, aes(x = senior_density, y = intersects)) +
geom_jitter(alpha = 0.5, color = "purple") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red", se = FALSE) +
labs(
title = "Senior Density vs Intersects",
x = "Senior Density (seniors per 1 square km)",
y = "Is within 15 min from CTA Station (1 = Yes, 0 = No)"
) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
# Distribution of Population Density
ggplot(data = merged_census_noout, aes(x = pop_den.x)) +
geom_density(binwidth = 1, fill = "green", color = "white", alpha = 0.7) +
labs(
title = "Distribution of Population Density",
x = "Population Density (people per 1000 sq.m)",
y = "Count"
) +
theme_minimal()## Warning in geom_density(binwidth = 1, fill = "green", color = "white", alpha =
## 0.7): Ignoring unknown parameters: `binwidth`
# Interaction between Population Density and Intersects
ggplot(data = merged_census_noout, aes(x = pop_den.x, y = intersects)) +
geom_jitter(alpha = 0.5, color = "green") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red", se = FALSE) +
labs(
title = "Population Density vs Intersects",
x = "Population Density (people per 1000 sq.m)",
y = "Is within 15 min from CTA Station (1 = Yes, 0 = No)"
) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
# Distribution of Bike Density
ggplot(data = merged_census_noout, aes(x = bike_den)) +
geom_density(binwidth = 1, fill = "orange", color = "white", alpha = 0.7) +
labs(
title = "Distribution of Bike Density",
x = "Bike Density (bikes per 1000 sq.m)",
y = "Count"
) +
theme_minimal()## Warning in geom_density(binwidth = 1, fill = "orange", color = "white", :
## Ignoring unknown parameters: `binwidth`
# Interaction between Bike Density and Intersects
ggplot(data = merged_census_noout, aes(x = bike_den, y = intersects)) +
geom_jitter(alpha = 0.5, color = "orange") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red", se = FALSE) +
labs(
title = "Bike Density vs Intersects",
x = "Bike Density (bikes per 1000 sq.m)",
y = "Is within 15 min from CTA Station (1 = Yes, 0 = No)"
) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'