CTA “L” Train Expansion: Assessment on Ridership and Neighborhood Coverage in Chicago
Yanfu Bai, Hsu-Chieh(Jasmine) Ma, Jiaqi Gu
Objective
Urban public transit systems play a critical role in shaping city life, influencing both commuting patterns and the socioeconomic landscape of neighborhoods. In Chicago, the CTA L Trains have been central to the city’s mobility, offering essential transit services to a diverse range of communities.
This project is about analyzing the L Train network in the metropolis of Chicago. We will look at the location of stations and their proximity to different neighborhoods to understand the accessibility of these lines. Then working to make an overall evaluation of the efficiency of this system for commuters and citizens. We will also include several demographic and transportation variables. The ultimate goal is to see how CTA’s subway network relates with the demographic and other variables, and to determine whether it is efficiently serving the people who are in need.
Background
The Chicago Transit Authority (CTA) is the main public transit system operator in Chicago. The “L” train, short for “elevated”, is the mass transit(subway) system operated by CTA, serving Chicago and its neighboring communities. It together with Metra (commuter railway system) and Pace(suburban bus system) forms the multimodal public transportation network in metropolitan Chicago (i.e. Chicagoland).
The “L” system mainly serves the city of Chicago as well as places in the neighboring Lake County. It is the 4th largest subway system in the United States. Consisting of 8 lines, its average normal weekday ridership goes to over 1.6 million passengers (https://www.transitchicago.com/facts/).
Research Questions
- How effectively does the CTA’s current coverage align with areas of high population, car, senior and bike density?
- What are the demographic and socioeconomic characteristics of areas with and without CTA coverage?
- Are there underserved areas that lack access to CTA stations?
- What role does bike infrastructure play in promoting ridership and accessibility to CTA stations?
- How do ridership patterns vary between central, endpoint, and intermediate stations across the network?
- What recommendations can be made to optimize CTA coverage and ridership in light of these findings?
Data Acquisition
GTFS Data
The first and most important data for us to retrieve is the GTFS data, which includes the subway stations, routes, and their geographical locations. There is publicly available GTFS data that we can directly retrieve from CTA’s official website.
After downloading the data, what we had to do was to filter out the subway routes (route_type == 1) & (location_type == 1). We joined the data from the stops, routes, trips and shapes columns to get the shape sf objects for the routes and the stations.
Here is the map that shows the CTA “L” Train network that we retrieved from the GTFS data. We can see that all the subway lines starts from a loop in the downtown region and goes radial into the suburban area. The subway loop is the namesake of Chicago’s downtown or CBD region, “the Loop”.
Ridership Data
The official ridership data can be found at CTA’s website as well as the Chicago Data Portal. It includes daily ridership for every station dating back from 2001 til this date (2024). We loaded the dataset in R and did data cleaning procedures. We wanted to gfilter out time range between the years 2015 and 2019, as these are the last years before subway ridership is affected by the COVID event.
We also filtered out only the weekday riderships of each station, as we think it may be most representative of the daily commuting patterns of the subway. Then we grouped then ridership by station and calculated the daily means of each station.
# Use forward slashes
ridership <- read.csv("CTA_-_Ridership_-__L__Station_Entries_-_Daily_Totals_20240922.csv")
ridership <- ridership %>%
mutate(station_id = as.character(station_id)) %>%
mutate(date = trimws(date), date = mdy(date)) %>%
filter(date >= as.Date("2015-01-01") & date <= as.Date("2019-12-31"))avg_rides <- ridership %>%
filter(daytype == "W") %>%
mutate(year = year(date)) %>%
group_by(station_id, stop_name) %>%
#group_by(station_id, stop_name, year) %>%
summarise(avg_rides = mean(rides, na.rm = TRUE), .groups = 'drop') Here is the map showing the riderships of each station, with larger circle indicating higher ridership. Unfortunately the size-based legend is not available, but you can view ech station’s ridership by clicking it and it’ll show in the pop-up.
Census Demographic Data
We also can fetch the data from the Census Bureau, which is quite convenient with a census api key using the tidycensus package.
The data extraction uses the get_acs function to fetch block group-level data from the American Community Survey (ACS) for Cook and Lake Counties in Illinois. Important variables include household income, population, vehicle ownership, and senior population by age and gender. These variables are processed to find total counts (like total senior population) and densities (like car density, senior density, and population density) by dividing counts by the area of each spatial unit. The data is transformed to the same coordinate system and filtered to remove empty geometries. This helps combine demographic and spatial details for analysis.
Map 1: Population Density and Average Ridership
Here is the first map plotting the population density (per 1 square km) of each block group:
High ridership stations are often in areas with higher population density. But some dense areas have low ridership, while others have high ridership. This difference may be due to the spacing between stations. In areas with fewer stations and high population, one station serves a larger area, leading to higher ridership at that station. In areas with more stations close together, passengers have more options, spreading ridership across multiple stations and reducing crowding.
Map 2: Car Density and Average Ridership
This map shows car density (green shading) and average CTA ridership (dots, intensity by color) across Chicago. Higher car densities are concentrated in less populated areas, while the urban core has lower car density, likely due to better public transportation. Areas with higher car density often align with CTA transit lines, showing good transit access. Outside downtown, where ridership is highest, there is overlap between high car density areas and places with high ridership. This suggests that areas using cars also rely on public transit for some trips, revealing how both cars and transit work together in these areas.
Map 3: Senior Density and Average Ridership
The third map visualizes senior density (purple shading) and average CTA ridership. Senior density is more spread-out than car density but aligns with CTA transit lines. This suggests older populations in these areas may have better transit access. Outside the downtown core, there is significant overlap between high senior density areas and high ridership regions. This shows that seniors in transit-accessible areas may rely on public transportation, highlighting the need for accessible transit to support mobility for older residents.
Map 4: Household Income and Average Ridership
The fourth map combines household income data (shaded in orange) with average CTA ridership. High-income areas are mostly in the north and near the lakefront, while lower-income areas are concentrated in the south and west. Also, Many high-ridership stations are located in lower-income areas, emphasizing the critical role of public transit in providing mobility for economically disadvantaged communities. The contrast between income and ridership patterns highlights the link between transit use and socioeconomic factors.
Bike Stations Data
There might be a pattern for subway passengers to use bike to reach their destinations within a mile from subway stations. So we used GBFS data from Divvy and calculated the density of bike stations for each block group.
Bike Stations Map
Bike Stations Density and Average Ridership
OSM Street Network and Network Analysis on Service Area
We would like to fetch the road network data for the Chicago region and the area served by the “L” Train network. We did this in the osmdata package and created a custom bounding box to download the data. Ideally we should do all analysis in r markdown, but due to the large size of road network data(~800MB), we had to use ArcGIS for the network analysis.
Here is the bounding box we created to download the road network.
Here is the code we used to fetch the road network data.
osm_road <- opq(bbox = bb) %>%
add_osm_feature(key = 'highway',
value = c("motorway", "trunk", "primary", "secondary", "tertiary", "residential",
"motorway_link", "trunk_link", "primary_link", "secondary_link",
"tertiary_link", "residential_link", "unclassified")) %>%
osmdata_sf() %>%
osm_poly2line()After we downloaded the road network, we converted it to an sf network object, and activated the edges. To clean out the road network, we first filtered out all edges which are same pairs ( filter(!edge_is_multiple()) ) and edges which starts and ends at the same node ( filter(!edge_is_loop()) ). There are some road intersections that are not marked as nodes in this road network, and we used sfnetworks::to_spatial_subdivision to create nodes for those intersections. There are also “pseudo” nodes which only connects to two edges which is not essentially a network node so we deleted it using sfnetworks::to_spatial_smooth.
From this point on, the capacity in R simple features class does not allow us to do network analysis on the massive network. We exported the sfnetwork to a shapefile and did the analysis in ArcGIS Pro.
In ArcGIS, we imported the osm road network and the CTA Station point locations. In the osm_edges shapefile, we created two new columns, one is “length” that calculates the geographical length of all road sections, and the other is “minutes” that calculates the estimated walking time (in minutes) with 5 km per hour speed using the “length” column.
We built a Network Dataset using the osm_edges and osm_nodes shapefiles, and set the traveling mode to minutes using the column we created earlier. We are able to run a 5-10-15 minute walking distance service area analysis. We exported the result to a shapefile and imported it back to R.
## Reading layer `cta_polygons' from data source
## `C:\Users\benso\Documents\Georgia Tech\CP8883\Final Project\shp\cta_polygons.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.90694 ymin: 41.71177 xmax: -87.59051 ymax: 42.08323
## Geodetic CRS: WGS 84
We decided to use the 15-minute walking distance region. Here’s what it looks like:
We assigned each census block a binary value based on whether it intersects with the 15-min walking distance polygon.
Til this point, all the data we needed are ready for us to do a regression analysis on the different variables.
Regression Analysis
Regarding missing values in the data set, we introduced K-Nearest Neighbor (KNN) method to filling the blanks. Comparing to mean imputation, which introduces bias and ignores multivariate patterns, KNN adapts to the complexity and variability in the data, maintaining its integrity.
After ensuring all missing values were filled with corresponding KNN values, we converted the column “intersects” from Boolean (TRUE/FALSE) to numeric expression of 1/0, 1 stood for True while 0 for False, then assigned “intersects” as dependent variable.
The first model we fitted is the multiple linear regression model, with “intersects” (whether within 15 min walking distance or not) as the dependent variable and median household income, population density, bike station density, car density, senior population density as dependent variables.
##
## Call:
## lm(formula = formula, data = merged_census_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6303 -0.2466 -0.1920 0.4432 0.8708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.409e-01 1.726e-02 8.165 4.60e-16 ***
## hhincome.x 3.186e-07 1.623e-07 1.963 0.049748 *
## pop_den.x 4.289e-05 3.077e-06 13.939 < 2e-16 ***
## bike_den 2.191e-02 1.865e-03 11.747 < 2e-16 ***
## car_density -2.369e-05 6.742e-06 -3.514 0.000447 ***
## senior_density -2.053e-05 3.990e-06 -5.144 2.86e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4198 on 3149 degrees of freedom
## Multiple R-squared: 0.2099, Adjusted R-squared: 0.2087
## F-statistic: 167.3 on 5 and 3149 DF, p-value: < 2.2e-16
Then we used Cook’s Distance to identify influential outliers in the regression model. Observations with a Cook’s Distance below a threshold (e.g., 2) were retained for further analysis. The refined dataset was used to refit the regression model and ensure robust estimation without undue influence from outliers. Below is the results of the outlier-cleaned linear model. As a result of the outlier removal, we can see an increase in r squared and adjusted r squared value. The positively correlated variables with statistical significance in this model are population density, bike station density, and the negatively correlated significant variable is senior density.
##
## Call:
## lm(formula = formula, data = merged_census_noout)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7290 -0.2431 -0.1788 0.4280 0.8667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.138e-01 1.739e-02 6.544 6.97e-11 ***
## hhincome.x 2.542e-07 1.607e-07 1.582 0.114
## pop_den.x 3.458e-05 3.203e-06 10.795 < 2e-16 ***
## bike_den 2.045e-02 1.854e-03 11.033 < 2e-16 ***
## car_density 2.344e-06 7.365e-06 0.318 0.750
## senior_density -1.654e-05 3.976e-06 -4.160 3.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4154 on 3148 degrees of freedom
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.2253
## F-statistic: 184.3 on 5 and 3148 DF, p-value: < 2.2e-16
We also ran a logistic model since the y variable is a binary variable. The positively correlated variables with statistical significance in this model are household income, population density, bike station density, and the negatively correlated significant variable is car density. In both the linear and the logistic model the population density and the bike density are both positively correlated.
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = merged_census_cleaned)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.591e+00 1.292e-01 -20.060 < 2e-16 ***
## hhincome.x 4.100e-06 1.011e-06 4.056 5.00e-05 ***
## pop_den.x 3.382e-04 3.024e-05 11.184 < 2e-16 ***
## bike_den 8.261e-02 1.198e-02 6.893 5.45e-12 ***
## car_density -2.241e-04 6.027e-05 -3.718 0.000201 ***
## senior_density 3.293e-05 3.659e-05 0.900 0.368112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4022.4 on 3154 degrees of freedom
## Residual deviance: 3067.1 on 3149 degrees of freedom
## AIC: 3079.1
##
## Number of Fisher Scoring iterations: 6
The pseudo R-squared value in the logistic regression model is also better than the R-squared value in the linear model.
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -1533.5721107 -2011.1924565 955.2406916 0.2374812 0.2612313
## r2CU
## 0.3625455
Visualization
Here is the correlation heatmap of the pairs of variables. The
colorized intensity denotes the correlation intensity between the pairs
of variables.
Distribution of Car Density
The distribution of car density is highly skewed, with most areas having very low vehicle density and a few areas exhibiting extremely high densities, forming a long tail. This suggests vehicles are concentrated in certain urban regions, while many areas may be less car-reliant or less populated. The skewness reflects varied land-use patterns, ranging from dense urban cores to sparsely populated suburban zones.
Car Density vs Intersects
The scatterplot of car density vs. intersects shows two clear clusters due to the binary nature of intersects: areas with CTA coverage (1) and without coverage (0). The red logistic regression curve indicates a positive relationship, where higher car density increases the likelihood of CTA coverage, especially in moderately dense areas. However, the curve flattens at very high car densities, suggesting that beyond a certain threshold, more cars do not significantly affect the probability of coverage. This reveals that CTA focuses on areas with moderate car density, balancing transit accessibility with urban vehicle use.
Distribution of Household Income
The distribution of household income has a single peak around the $50,000–$100,000 range, representing median income levels in the area. The curve steadily declines as income increases, with fewer households earning above $150,000. This distribution highlights the region’s socioeconomic diversity, with many middle-income households and fewer at the extremes. Most blocks are neither extremely wealthy nor impoverished, reflecting common urban and suburban trends.
Household Income vs Intersects
The scatter plot of household income vs. CTA station coverage shows a near-flat regression line, indicating little correlation between income and proximity to a CTA station. However, the points reveal that higher-income households are more often found in areas without CTA coverage (intersects = 0). This suggests wealthier neighborhoods may depend less on public transit, pointing to a possible disparity in transit usage across income levels.
Distribution of Population Density
The distribution of population density is highly skewed, with most areas having densities below 10,000 people per square kilometer. The number of areas drops sharply as density increases, indicating that very dense blocks are uncommon and concentrated in specific regions. This pattern reflects urban planning, where high densities are usually found near city centers or transit hubs.
Population Density vs. Intersects
The scatterplot of population density vs. CTA coverage (intersects) reveals a non-linear relationship. The red curve indicates that higher population densities increase the likelihood of CTA coverage, but the effect levels off beyond a certain density. This suggests the CTA focuses on moderately dense urban areas rather than exclusively targeting the most densely populated blocks.
Distribution of Bike Density
The distribution of bike density shows a heavily skewed pattern, with most areas reporting very low values close to zero. Only a small number of areas have higher bike density, indicating that bike infrastructure is concentrated in specific locations, leaving much of the region with minimal resources.
Bike Density vs Intersects
The scatterplot of bike density vs. CTA coverage (intersects) shows an upward trend, with higher bike density linked to a greater likelihood of being within 15 minutes of a CTA station. However, distinct patterns emerge:
- High Bike Density Distribution:
- Unlike previous variables (e.g., car density or population density), where areas with high values tend to cluster around intersects = 1 (covered by CTA), here we observe a more balanced distribution of high bike density points across both intersects = 0 (not covered) and intersects = 1 (covered).
- This suggests that areas with higher bike density are not exclusively reliant on CTA coverage, indicating that bike infrastructure may serve as a complementary transit option, independent of CTA presence.
- Low Bike Density Distribution:
- For areas with low bike density, the points are more evenly distributed between intersects = 0 and intersects = 1, reinforcing the idea that bike density alone does not determine CTA coverage in these regions.
- This highlights the potential role of other factors, such as population density or urban planning priorities.
- Unique Insights:
- The balanced presence of high bike density points across both intersects = 0 and intersects = 1 suggests that bike infrastructure might play an independent role in transit accessibility.
- This differs from variables like car or population density, where high values more strongly align with CTA coverage, indicating that bike density is less predictive of CTA coverage in isolation.
- This result points to the need for a combined transit strategy, where bikes complement rather than compete with CTA services, particularly in areas with well-established bike infrastructure.
Research Limitations
This study has several limitations. The model explains only 21% of the variation in CTA coverage, suggesting key factors like canceled trips and land use are missing. It focuses on correlations but does not explore causation. Influential data points and skewed residuals indicate potential biases. These issues limit the scope and highlight areas for improvement in future research.
Conclusion
The analysis reveals key insights about CTA coverage and ridership patterns. CTA stations are strategically placed in areas with higher population, car, and bike densities, focusing on serving densely populated neighborhoods. However, some high-income and low-density neighborhoods remain underserved.
Ridership patterns show that stations in the city center and at line endpoints have significantly higher average ridership, emphasizing their importance as transit hubs. Additionally, bike-friendly areas demonstrate a positive association with CTA coverage, suggesting efforts to support multi-modal transportation.
Overall, the findings underline the effectiveness of CTA’s current coverage in serving densely populated and transit-dependent areas. However, gaps in coverage in certain high-income, lower-density neighborhoods suggest opportunities to expand services or introduce supplementary transit options to ensure equitable access across Chicago.
Recommendations
Expand Access: Provide transit options for underserved low-density neighborhoods. Improve Multi-Modal Connectivity: Increase biking infrastructure near CTA stations and support last-mile solutions.
Refine Models: Include additional variables and explore causal relationships for deeper insights. Focus on Equity and Sustainability: Prioritize underserved areas and promote transit-oriented development. CTA serves densely populated areas well. Addressing these gaps can improve equity and create a more sustainable transit network.
Reference
Wang K, Woo M. The relationship between transit rich neighborhoods and transit ridership: Evidence from the decentralization of poverty[J]. Applied Geography, 2017, 86: 183-196.
Boarnet M G, Bostic R W, Rodnyansky S, et al. Do high income households reduce driving more when living near rail transit?[J]. Transportation research part D: transport and environment, 2020, 80: 102244.
Merlin L A, Singer M, Levine J. Influences on transit ridership and transit accessibility in US urban areas[J]. Transportation Research Part A: Policy and Practice, 2021, 150: 63-73.
Yang, H., Ruan, Z., Li, W., Zhu, H., Zhao, J., & Peng, J. (2022). The Impact of Built Environment Factors on Elderly People’s Mobility Characteristics by Metro System Considering Spatial Heterogeneity. ISPRS International Journal of Geo-Information, 11(5), 315. https://doi.org/10.3390/ijgi11050315
Daqrouq, A., & Anjomani, A. (2019). Public Transit Ridership and Car-Oriented Cities: The Case of the Dallas Region. Economies, 7(3), 86. https://doi.org/10.3390/economies7030086
Mass Transit as an Economic Equalizer: The Case for Expanding and Investing in Fair Fares. (n.d.). Community Service Society of New York. https://www.cssny.org/news/entry/mass-transit-economic-equity-fair-fares