Background

I decided to look into data about the number of entries at stations on the L in Chicago. Growing up in Chicago, I used to ride the L on most weekends, to get to concerts, camp, the beach, museums, and more.

Map of the L train in Chicago

Summary of the data

I joined 3 different data sets into one:

Note: The Weather data was all collected from a nearby weather station at Midway Airport.

What does the data look like?

glimpse(d)
## Rows: 519,420
## Columns: 18
## $ station_id               <dbl> 41280, 41000, 40280, 40140, 40690, 41660, 401…
## $ station_name             <fct> Jefferson Park, Cermak-Chinatown, Central-Lak…
## $ date                     <date> 2017-12-22, 2017-12-18, 2017-12-02, 2017-12-…
## $ day_type                 <fct> Weekday, Weekday, Saturday, Weekday, Sunday/H…
## $ rides                    <dbl> 6104, 3636, 1270, 1759, 499, 8615, 442, 1353,…
## $ station_name_simple      <chr> "Jefferson Park", "Cermak-Chinatown", "Centra…
## $ station_name_descriptive <chr> "Jefferson Park (Blue Line)", "Cermak-Chinato…
## $ linecolor_simple         <chr> "Blue", "Red", "Green", "Yellow", "Purple", "…
## $ linecolor                <fct> "Blue", "Red", "Green", "Yellow", "Purple", "…
## $ line_type                <fct> Single, Single, Single, Single, Single, Singl…
## $ ada                      <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, T…
## $ lat                      <chr> "41.970634", "41.853206", "41.887389", "42.03…
## $ long                     <chr> "-87.760892", "-87.630968", "-87.76565", "-87…
## $ avg_wind_speed           <dbl> 7.83, 12.53, 5.14, 11.41, 6.26, 12.75, 3.80, …
## $ prec                     <dbl> 0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ high_temp                <dbl> 41, 45, 54, 51, 58, 15, 47, 54, 29, 51, 48, 8…
## $ low_temp                 <dbl> 35, 41, 35, 37, 31, 0, 32, 35, 19, 37, 34, -3…
## $ weather_event            <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …

What does ridership look like on a daily basis at stations?

On any given day, at any given station, how many people enter that station?

On any given day, how many people ride the L?

What does TOTAL ridership look like on a daily basis ACROSS stations?

Is ridership normally distributed?

We can see how normal that distribution is with some qqplots:

Let’s take a closer look at some outliers

Those histograms all have long tails. A box-plot shows us those outliers more clearly.

I manually labeled the top outlier by each day type.

Chicago Cubs

(In Yellow) Addison is on the red line, and is right next to Wrigley field. This was the day of a Cubs vs. Indians game that was part of the 2016 World Series.

(In Brown) Lake/State serves multiple lines, and is right in the heart of downtown. This was the day of the parade celebrating that the Cubs won the World Series.

Fans walk to Grant Park for Cubs World Series Celebration

Source: https://www.denverpost.com/2016/11/04/chicago-cubs-world-series-rally-2016/

June 28, 2015 - Chicago Pride Parade

(In Pink) Belmont serves a few lines, and is close to Boystown. This was the day of the Pride Parade in Chicago, a few days after same-sex marriage was legalized in the U.S.

Chicago Pride Parade 2015

Source: https://www.timeout.com/chicago/lgbt/photos-from-the-2015-chicago-pride-parade

What does ridership look like by Line?

Note: Multiple are stations that serve more than 1 line color. Because rides count the number of entries into this station, we can’t know which train someone rode after entering a station.

The Loop

The Chicago Loop serves Chicago downtown and is a popular transit route for daily commuters and tourists alike. Many, but not all, of the multiple stations are located on the loop.

The chicago Loop

Top and Bottom Stations by total rides in 10 years

We looked at ridership by Line, what about by Station?

The top 2 stations here are on the loop.

Here we can see that (ignoring the “multiple” stations), the Red line is the most popular. Was that the case for the full 10 year period?

Ridership by line over 10 years

Why was there a drop in the yellow line in 2015? From looking into it, it seems that a section of the embankment next to the track collapsed due to a failure in construction a nearby Water Reclamation Plant, causing the whole track to be damaged.

Weather

Now let’s look at the weather data.

Correlation between Rides and Weather

Weather vs. Rides Linear Relationship

Is there a linear relationship between the following weather data and the average number of daily rides?

  • Average Wind Speed
  • High Temperature
  • Low Temperature

This data is filtered to exclude dates beyond March 01, 2020. Ridership dropped drastically during COVID due to city-wide lockdowns. Before then, the relationships of rides to weather are much more clear.

Possible next steps:

  • Could ridership in X years be predicted using a regression model?

  • Are there relationships between the geographic locations of stations and the amount of ridership they get?

  • How does this analysis change if the year range is increased (e.g. past 20-30 years?)