I decided to look into data about the number of entries at stations on the L in Chicago. Growing up in Chicago, I used to ride the L on most weekends, to get to concerts, camp, the beach, museums, and more.
Map of the L train in Chicago
I joined 3 different data sets into one:
Note: The Weather data was all collected from a nearby weather station at Midway Airport.
glimpse(d)
## Rows: 519,420
## Columns: 18
## $ station_id <dbl> 41280, 41000, 40280, 40140, 40690, 41660, 401…
## $ station_name <fct> Jefferson Park, Cermak-Chinatown, Central-Lak…
## $ date <date> 2017-12-22, 2017-12-18, 2017-12-02, 2017-12-…
## $ day_type <fct> Weekday, Weekday, Saturday, Weekday, Sunday/H…
## $ rides <dbl> 6104, 3636, 1270, 1759, 499, 8615, 442, 1353,…
## $ station_name_simple <chr> "Jefferson Park", "Cermak-Chinatown", "Centra…
## $ station_name_descriptive <chr> "Jefferson Park (Blue Line)", "Cermak-Chinato…
## $ linecolor_simple <chr> "Blue", "Red", "Green", "Yellow", "Purple", "…
## $ linecolor <fct> "Blue", "Red", "Green", "Yellow", "Purple", "…
## $ line_type <fct> Single, Single, Single, Single, Single, Singl…
## $ ada <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, T…
## $ lat <chr> "41.970634", "41.853206", "41.887389", "42.03…
## $ long <chr> "-87.760892", "-87.630968", "-87.76565", "-87…
## $ avg_wind_speed <dbl> 7.83, 12.53, 5.14, 11.41, 6.26, 12.75, 3.80, …
## $ prec <dbl> 0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ high_temp <dbl> 41, 45, 54, 51, 58, 15, 47, 54, 29, 51, 48, 8…
## $ low_temp <dbl> 35, 41, 35, 37, 31, 0, 32, 35, 19, 37, 34, -3…
## $ weather_event <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
On any given day, at any given station, how many people enter that station?
What does TOTAL ridership look like on a daily basis ACROSS stations?
We can see how normal that distribution is with some qqplots:
Those histograms all have long tails. A box-plot shows us those outliers more clearly.
I manually labeled the top outlier by each day type.
(In Yellow) Addison is on the red line, and is right next to Wrigley field. This was the day of a Cubs vs. Indians game that was part of the 2016 World Series.
(In Brown) Lake/State serves multiple lines, and is right in the heart of downtown. This was the day of the parade celebrating that the Cubs won the World Series.
Fans walk to Grant Park for Cubs World Series Celebration
Source: https://www.denverpost.com/2016/11/04/chicago-cubs-world-series-rally-2016/
(In Pink) Belmont serves a few lines, and is close to Boystown. This was the day of the Pride Parade in Chicago, a few days after same-sex marriage was legalized in the U.S.
Chicago Pride Parade 2015
Source: https://www.timeout.com/chicago/lgbt/photos-from-the-2015-chicago-pride-parade
Note: Multiple are stations that serve more than 1 line
color. Because rides count the number of entries into
this station, we can’t know which train someone rode after entering a
station.
The Chicago Loop serves Chicago downtown and is a popular transit
route for daily commuters and tourists alike. Many, but not all, of the
multiple stations are located on the loop.
The chicago Loop
We looked at ridership by Line, what about by Station?
The top 2 stations here are on the loop.
Here we can see that (ignoring the “multiple” stations), the Red line is the most popular. Was that the case for the full 10 year period?
Why was there a drop in the yellow line in 2015? From looking into it, it seems that a section of the embankment next to the track collapsed due to a failure in construction a nearby Water Reclamation Plant, causing the whole track to be damaged.
Yearly felt broad when looking at riding trends. I wanted to see what it looked like month to month, and if any patterns were visible there.
Now let’s look at stations’ accessibility and the relationship with rides.
## ada
## Mode :logical
## FALSE:42
## TRUE :101
Now let’s look at the weather data.
Is there a linear relationship between the following weather data and the average number of daily rides?
This data is filtered to exclude dates beyond March 01, 2020. Ridership dropped drastically during COVID due to city-wide lockdowns. Before then, the relationships of rides to weather are much more clear.
Could ridership in X years be predicted using a regression model?
Are there relationships between the geographic locations of stations and the amount of ridership they get?
How does this analysis change if the year range is increased (e.g. past 20-30 years?)
Data:
References that provided context on certain dates in the data: