Cyclitic program analysis

In 2016, a service of bikes was introduced within the Cyclistic program in Chicago. The program has grown over the years to a fleet of 5.824 bicycles in 692 stations. Yet, the more profitable program is not the most attractive service. Their customers are more and more casual than riders with a membership. However, to keep the service inside the city, the company wants to ensure that casual riders become members of the program.

So that the business task is convert as many customers as possible into members of the Cyclistic program. In order to do so, it is necessary to generate some analysis oriented to make a marketing strategy. Taking into account that the program is a green and social inclusion initiative, it reaches population segments that should be interesting to explore (awareness, social profiles, interests, etc.), and why or why not the bikes uses differ among them.

Looking at this proposal, we expose the following main questions: why are more people casual and not members of the program? We need to understand the main characteristics of the casual and membership customers. The response implies operationalizing the differences between those customer segments and asking us How do annual members and casual riders use Cyclistic bikes differently? Or even include some questions about the social profile of customers, like age, gender, or whatever variable is present in the data set.

The tools selected to process was Big Query and to analyze was Rstudio. Both, highly recommended by the Google tutors.

In SQL, we not only prepared the data set correspondingly, but process the data to merge both and do preliminary exploration a way to success our process. The Big Query is a data warehouse, which allow us to manipulate large data sets easily and in an accurate way.

Preparing phase

The data set chosen is a public data source. The data uses of this data is given us from the course Google Data Analyst, and we can use to explore how different customer types use Cyclistic bike in Chicago. Say that, continuing with the analysis data preparation, it’s important mention that the data wasn’t collected by us, but given by Google and Coursera team (it’s a secondary data from some external sources).

For that, as well should be done, this data set prohibit and safeguard the information storage. So, we follow the good ethics practice required of data analysis doing explicit that we don’t go to publish any private information about user riders. Adding, being faithful to keep sure the information, make use of version controls and applied data anonymization: we keep out the analysis sensible information, and make security steps to guarantee the privacy and security as tools and files as information manipulated.

Now, it’s time to mention some structural characteristics about this information data. Firstly, data sets are organized by comma separated values, which makes the structured format in columns and rows. Nevertheless, as we are working so SQL language in Big Query as r language in Rstudio, in the first case we migrate the file from Google sheets to the Big Query environment (data migration). The variables correspond to the columns and observations to the rows. The data follow the wide format which each row represent one observation with its attributes in columns.

Following the preparation data, we reviewed not only the general structure, but the data type and variable in the data frames. There are some characters, numeric and integer types stored, corresponding to quality and quantity data. Once, inspected the variables between the numeric and integer data type the variables are continuous ( … ) and discrete ( …. ), in the other case, are nominal ( …. ).

In another way, after a little exploration about our variable of interest, data could show an observation bias, because both data show more superscript than casual customers. Plus, the observation in both data shows more people subscriber than not from 2019 to 2020. This preliminary exploration suggests that the research proposes could be motivated by uncertainty fear market, it’s say, interpreted bias. The business task hide that suggestion: there is a trend of casual uses of the bike services, even though preliminary exploration suggests the contrary. For that, our finding about the differences between both customers segments could reject or confirm this underlying hypothesis: Is there more growth in casual riders than subscribers?

This research propose is not minor because the business task is driven to change the type of bike uses in customers. It’s important to answer those question to know the future marketing strategies: changing or pronouncing the intention use in customers.

Big Query SQL Language

Let’s connect with Big Query to show the steps we applied to get the data, and prepare and process them. To include our process in Big Query, we decide to include the queries in this Rmarkdown. It can possible thanks to the next libraries, and programming chunks to read SQL Language.

library(readr)
library(bigrquery)
library(DBI)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(stringr)
library(RSQLite)
library(vroom)
## 
## Attaching package: 'vroom'
## 
## The following objects are masked from 'package:readr':
## 
##     as.col_spec, col_character, col_date, col_datetime, col_double,
##     col_factor, col_guess, col_integer, col_logical, col_number,
##     col_skip, col_time, cols, cols_condense, cols_only, date_names,
##     date_names_lang, date_names_langs, default_locale, fwf_cols,
##     fwf_empty, fwf_positions, fwf_widths, locale, output_column,
##     problems, spec
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
knitr::opts_chunk$set(echo = TRUE)

There is an important inclusion in this preview steps Loading files. We storage and migrate internal data in Big Query from different data sets, and the first data loaded was from Google Drive, which makes us to manipulate the tables loaded assigned the first row as head. We start loading the files from Google Drive after downloading the archive. We appreciate that the content of the first row was the head in both files. So, we had to assign the first row as the head of our files. But to simulate the task in Rstudio we used the DBI and RSQLite library to connect to a data base located in Rstudio RAM memory, and processed in SQL engine required through the Rstudio, and loading the files from our device. Taking the data frames loaded in dbWriteTable save them inside your SQLite database db_local.

# 1. Crear la conexión (Usemos un solo nombre: db_local)
db_local <- dbConnect(RSQLite::SQLite(), ":memory:")

# 2. Definir las rutas
ruta1 = "C:\\Users\\user\\Downloads\\Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv"
ruta2 = "C:\\Users\\user\\Downloads\\Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv"

# 3. Cargar las tablas (Asegúrate de usar el mismo nombre 'db_local')
dbWriteTable(db_local, "bikes_2019", read.csv(ruta1))
dbWriteTable(db_local, "bikes_2020", read.csv(ruta2))

# Verificación: Esto te dirá si las tablas ya existen en el motor SQL
dbListTables(db_local)
## [1] "bikes_2019" "bikes_2020"

After, we inspect both tables to confirm the columns and variables so in the 2019 table across SQL language in Big Query like this.

/* Al NO poner output.var, el resultado debe salir aquí abajo */
SELECT 
    *
FROM 
    bikes_2019 
LIMIT 10
Displaying records 1 - 10
trip_id start_time end_time bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
21742443 2019-01-01 0:04:37 2019-01-01 0:11:07 2167 390 199 Wabash Ave & Grand Ave 84 Milwaukee Ave & Grand Ave Subscriber Male 1989
21742444 2019-01-01 0:08:13 2019-01-01 0:15:34 4386 441 44 State St & Randolph St 624 Dearborn St & Van Buren St (*) Subscriber Female 1990
21742445 2019-01-01 0:13:23 2019-01-01 0:27:12 1524 829 15 Racine Ave & 18th St 644 Western Ave & Fillmore St (*) Subscriber Female 1994
21742446 2019-01-01 0:13:45 2019-01-01 0:43:28 252 1,783.00 123 California Ave & Milwaukee Ave 176 Clark St & Elm St Subscriber Male 1993
21742447 2019-01-01 0:14:52 2019-01-01 0:20:56 1170 364 173 Mies van der Rohe Way & Chicago Ave 35 Streeter Dr & Grand Ave Subscriber Male 1994
21742448 2019-01-01 0:15:33 2019-01-01 0:19:09 2437 216 98 LaSalle St & Washington St 49 Dearborn St & Monroe St Subscriber Female 1983
21742449 2019-01-01 0:16:06 2019-01-01 0:19:03 2708 177 98 LaSalle St & Washington St 49 Dearborn St & Monroe St Subscriber Male 1984
21742450 2019-01-01 0:18:41 2019-01-01 0:20:21 2796 100 211 St. Clair St & Erie St 142 McClurg Ct & Erie St Subscriber Male 1990
21742451 2019-01-01 0:18:43 2019-01-01 0:47:30 6205 1,727.00 150 Fort Dearborn Dr & 31st St 148 State St & 33rd St Subscriber Male 1995
21742452 2019-01-01 0:19:18 2019-01-01 0:24:54 3939 336 268 Lake Shore Dr & North Blvd 141 Clark St & Lincoln Ave Subscriber Male 1996

as in 2020 table, showed in the following query:

/* Al NO poner output.var, el resultado debe salir aquí abajo */
SELECT 
    *
FROM 
    bikes_2019 
LIMIT 10
Displaying records 1 - 10
trip_id start_time end_time bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
21742443 2019-01-01 0:04:37 2019-01-01 0:11:07 2167 390 199 Wabash Ave & Grand Ave 84 Milwaukee Ave & Grand Ave Subscriber Male 1989
21742444 2019-01-01 0:08:13 2019-01-01 0:15:34 4386 441 44 State St & Randolph St 624 Dearborn St & Van Buren St (*) Subscriber Female 1990
21742445 2019-01-01 0:13:23 2019-01-01 0:27:12 1524 829 15 Racine Ave & 18th St 644 Western Ave & Fillmore St (*) Subscriber Female 1994
21742446 2019-01-01 0:13:45 2019-01-01 0:43:28 252 1,783.00 123 California Ave & Milwaukee Ave 176 Clark St & Elm St Subscriber Male 1993
21742447 2019-01-01 0:14:52 2019-01-01 0:20:56 1170 364 173 Mies van der Rohe Way & Chicago Ave 35 Streeter Dr & Grand Ave Subscriber Male 1994
21742448 2019-01-01 0:15:33 2019-01-01 0:19:09 2437 216 98 LaSalle St & Washington St 49 Dearborn St & Monroe St Subscriber Female 1983
21742449 2019-01-01 0:16:06 2019-01-01 0:19:03 2708 177 98 LaSalle St & Washington St 49 Dearborn St & Monroe St Subscriber Male 1984
21742450 2019-01-01 0:18:41 2019-01-01 0:20:21 2796 100 211 St. Clair St & Erie St 142 McClurg Ct & Erie St Subscriber Male 1990
21742451 2019-01-01 0:18:43 2019-01-01 0:47:30 6205 1,727.00 150 Fort Dearborn Dr & 31st St 148 State St & 33rd St Subscriber Male 1995
21742452 2019-01-01 0:19:18 2019-01-01 0:24:54 3939 336 268 Lake Shore Dr & North Blvd 141 Clark St & Lincoln Ave Subscriber Male 1996

Second, we start Exploring data frames. It explored that the main variable had some inconsistencies between the values (Customer and Subscriber in 2029) and (casual and member in 2020). It also showed that some variables were not in 2020. So first, explored the main variable.

/* Al NO poner output.var, el resultado debe salir aquí abajo */
SELECT 
    usertype,
    COUNT(usertype) AS count
FROM 
    bikes_2019
GROUP BY usertype
    /* We limited the output in the present project */
LIMIT 10 
2 records
usertype count
Customer 23163
Subscriber 341906

In 2019, there were 341906 subscribers, 23163 casual members, 1 user type. Also, in 2020, there were 107492 members and 6149 casual customers, 1 null and 1 member_casual. Looking to join both data frames, we needed to standardize these values. Also, we did the same with gender to attempt keeping this relevant information.

Also, to continue preparing data, another important observation was that all variables were stored as strings, but we had to doing a data casting accordingly the quantity or quality type assumed by theory. In one query we did as Data merging as Data casting in date and string variables.

CREATE TABLE data_merged AS
SELECT 
    CAST(trip_id AS TEXT) AS ride_id,
    DATETIME(start_time) AS started_at,
    DATETIME(end_time) AS ended_at,
    CAST(gender AS TEXT) AS gender, 
    CAST(birthyear AS INTEGER) AS birth_year, -- Revisa si la columna es birthyear o birthyearNombre
    CAST(bikeid AS TEXT) AS rideable_type,
    from_station_name AS start_station_name,
    CAST(from_station_id AS TEXT) AS start_station_id,
    to_station_name AS end_station_name,
    CAST(to_station_id AS TEXT) AS end_station_id,
    CASE 
      WHEN usertype = 'Subscriber' THEN 'member'
      WHEN usertype = 'Customer' THEN 'casual'
      ELSE usertype 
    END AS member_casual
FROM  bikes_2019
WHERE trip_id IS NOT NULL

UNION ALL

SELECT 
    ride_id,
    DATETIME(started_at) AS started_at,
    DATETIME(ended_at) AS ended_at,
    'Unknown' AS gender, 
    NULL AS birth_year,
    rideable_type,
    start_station_name,
    CAST(start_station_id AS TEXT) AS start_station_id,
    end_station_name,
    CAST(end_station_id AS TEXT) AS end_station_id,
    member_casual
FROM  bikes_2020
WHERE ride_id IS NOT NULL

To get the comparative analysis between 2019 and 2020, we join all schemes through UNION ALL. As SQLite language in this Rmarkdown, we exchange STRING and INTEGER type by TEXT and INT64. We applied the DATETIME function to normalize the time marks. As it was said, re coded the Customer and Subscriber values to adjust to modern format and ensuring the mixed sample.

/* Creamos la tabla final añadiendo dimensiones de tiempo */

CREATE TABLE data_table AS
SELECT 
    *,
    -- Columna de Día de la Semana (0 es Domingo, se traduce a nombre)
    CASE strftime('%w', started_at)
        WHEN '0' THEN 'Sunday'
        WHEN '1' THEN 'Monday'
        WHEN '2' THEN 'Tuesday'
        WHEN '3' THEN 'Wednesday'
        WHEN '4' THEN 'Thursday'
        WHEN '5' THEN 'Friday'
        WHEN '6' THEN 'Saturday'
    END AS day_of_week,
    
    -- Columna de Momento del Día extrayendo la hora
    CASE 
        WHEN CAST(strftime('%H', started_at) AS INT) BETWEEN 6 AND 11 THEN 'Morning'
        WHEN CAST(strftime('%H', started_at) AS INT) BETWEEN 12 AND 17 THEN 'Afternoon'
        WHEN CAST(strftime('%H', started_at) AS INT) BETWEEN 18 AND 23 THEN 'Night'
        ELSE 'Late Night' 
    END AS time_of_day
FROM data_merged
WHERE started_at IS NOT NULL;

We merge both data transformed according to the concise the variables, even though some variables don’t match. We cast the values to string and timestamp for date and other quality variables. For timestamp variables, we extract the date to approximate the date of riding, and cast to an integer accordingly. To made this final table, we made a logic translate to ensure a data filtering from Big Query to Rmarkdown. The main difference is the date and hours manipulation: in change to use FORMAT_TIMESTAMP and EXTRACT, in the present document applied strftime() to get as weekday as the hour day. As you can see, we applied a data mapping with case structure to index the names weekdays accordingly.

After that, we get the riders length of each observation deriving data from local variables (ended_at and started_at) to another (ride_length), adding a data filtering of length rides. We take the chance to change the label of our main category to status_service, as following:

/* Calculamos ride_length y renombramos la columna de estado.
   En SQLite, calculamos la diferencia restando los valores de unixepoch.
*/
CREATE TABLE data_frame AS
SELECT *
FROM (
    SELECT 
        ride_id,
        started_at,
        ended_at,
        gender,
        birth_year,
        rideable_type,
        start_station_name,
        start_station_id,
        end_station_name,
        end_station_id,
        day_of_week,
        time_of_day,
        member_casual AS status_service,
        (strftime('%s', ended_at) - strftime('%s', started_at)) AS ride_length
    FROM data_table
) AS subquery
WHERE ride_length > 0;

We can see a final show of preparing and processing both data frames. This show a snapshot about our interests, and all variables before adding geo points, locations, and age, and previous selecting the categories.


select 
COUNT(*)
from data_frame
1 records
COUNT(*)
336351

To do the last steps we applied an reverse gecoding. The neighborhood boundaries utilized in this report are sourced from the Chicago Data Portal. This dataset is fundamental for data segmentation, as it provides the necessary infrastructure to transform latitude and longitude coordinates into human-readable community names.

Technical Note: The dataset includes the the_geom column, which contains the MultiPolygon geometries for Chicago’s 77 community areas. This column was processed via SQL to perform a Spatial Join with the Divvy bike station coordinates, ensuring each station is accurately mapped to its geographic district.

The station infrastructure data was retrieved from the Chicago Data Portal - Divvy Bicycle Stations. This dataset provides a real-time and historical inventory of the bicycle docking stations available across the city, which is crucial for identifying the physical touchpoints of the bike-share network.

Technical Note: This dataset provides the exact Latitude and Longitude for each docking station. These coordinates served as the “point” entities during the spatial join process, allowing them to be intersected with the community area polygons to determine the geographic distribution of the Divvy fleet.

We located the information to map the station location and thus to derive some social variables such as location, and deep more social context about users. It’s thinking that we can get closer of the residence location knowing the starting point of the station names, assuming in morning people might be close to their homes. This inference allows to amply the analysis and met other variables differences.

The first merged was between community and geo-location data and bikes locations data. To ensure computational efficiency and maintain the integrity of complex geospatial relationships, this project employs a hybrid data processing strategy. The high-overhead Spatial Join—which determines the community area for each bike station using the ST_CONTAINS logic—was executed within Google BigQuery. This leverages cloud-native GIS capabilities to handle high-precision polygon intersections that would otherwise be memory-intensive for a local environment.

The resulting processed dataset was then exported as a flat file and integrated into our local SQLite environment. This approach allows us to bypass technical hardware limitations without sacrificing data depth. Consequently, the following sections focus on the Exploratory Data Analysis (EDA), where we utilize the pre-joined locations to uncover patterns in station density, rider behavior by community, and infrastructure distribution across the city of Chicago.

To do this, we migrate the data set in Big Query environment to the data sets merged, to finish with the last merge including the geo-code and longitude and latitude. We need to merge wit join inner only values in our interest data, because it refers to a sample limited to the customers of Cyclistic.

Once, we get the final data frame after exploring, preparing and processing the data, we export in CSV file the information to start processing in Rstudio.

Data exploration

In Google Cloud Storage (GCS), a structured export process was carried out to bypass the browser’s direct download limit (which is usually 10 MB or around 16,000 rows). Since your original file was larger than 1 GB, BigQuery couldn’t download it as a single file. The solution was to export your query results to a “bucket” (a temporary storage area in Google’s cloud).

Using a wildcard (*) in the file path, Google automatically fragmented your large dataset into 53 smaller, manageable CSV files. Once the cloud export was complete, you used either the web interface or the gsutil command-line tool to transfer these 53 individual files to your local Downloads folder. The final result is that you now have all the complete data on your computer, ready to be read by RStudio using vroom.

library(vroom)

# Paso 1: Obtener la lista completa de rutas de archivo
# list.files() encuentra los nombres, full.names=TRUE asegura que R use la ruta completa.
todos_los_archivos <- list.files(
  path = "C:/Users/user/Downloads/", 
  pattern = "data_bike_final_.*\\.csv", # Usa una expresión regular para el patrón
  full.names = TRUE
)

# Verifica en la consola que esta variable contiene 53 rutas
print(paste("Archivos encontrados:", length(todos_los_archivos)))
## [1] "Archivos encontrados: 53"
# Paso 2: Vroom lee la lista explícita de archivos
# Ahora le pasamos la lista exacta de nombres, sin usar el comodín * en la ruta principal.
data_bike <- vroom(todos_los_archivos, show_col_types = FALSE)

# Confirma que ahora tienes todos los registros
print(paste("Total de filas cargadas:", nrow(data_bike)))
## [1] "Total de filas cargadas: 423137"

The variables listed show us some point in common to take in mind. The bike service type about the customers, others about the trip or ride duration, the starting and ending point of ridding, and other like gender or birth year, which could give us relevant patterns about the bike use. Sorting and filtering the data we took the main variables to data merging in SQL.

At last but not less, we applied tools in SQL language to validate the data type and values in each variable, and checking the consistency in each value to prepare the data to next steps.

head(data_bike)
nrow(data_bike)
## [1] 423137

Process

Once the project prepared data to process, we choose the relevant values to answer our question explored the differences between customers according its type, social characteristics, location and duration bike uses. To reach those goals minimize any issue to analyze. It’s why we choose to perform with the data making casting, dropping Nan values, and omitting the wrong data.

The analysis was in order to capture as difference between users segments Suscribers and Casual riders.

The first step in Rstudio was exploring the data frames made previous, which are cleaned and transformed according our analysis needs required. In this step, inspect the variables, observations, null values, data types and outliers. To normalize the data and select the categories.

##  [1] "COMMUNITY"        "AREA_NUMBE"       "Station Name"     "Total Docks"     
##  [5] "Status"           "Longitude"        "Latitude"         "ride_id"         
##  [9] "started_at"       "ended_at"         "gender"           "birth_year"      
## [13] "rideable_type"    "start_station_id" "end_station_id"   "date"            
## [17] "day_of_week"      "time_of_day"      "status_service"   "ride_length"     
## [21] "end_station_name" "age"

So first we deleted the unnecessary columns:

  • start_station_id (there already is station name as starting as ending point)
  • end_station_id (there already is age)
  • started_at (there already is weekday, date and time)
  • ended_at (there already is age)
  • rideable_type
  • birth_year (there already is age)
  • Total Docks (It’s an external information)
df_bike <- data_bike %>%  
  select(-c( birth_year, rideable_type, started_at, ended_at , end_station_id,  start_station_id, "Total Docks"))

colnames(df_bike)
##  [1] "COMMUNITY"        "AREA_NUMBE"       "Station Name"     "Status"          
##  [5] "Longitude"        "Latitude"         "ride_id"          "gender"          
##  [9] "date"             "day_of_week"      "time_of_day"      "status_service"  
## [13] "ride_length"      "end_station_name" "age"
# Reorganizar las columnas
df_bike <- df_bike %>%
  select(
    ride_id,                    # ID del viaje
    status_service,             # Tipo de miembro (Member/Casual)
    date,                       # Fecha
    day_of_week,                # Día
    time_of_day,                # Momento del día
    ride_length,                # Duración
    gender,                     # Género
    age,                        # Edad
    "Station Name",               # Estación origen
    end_station_name,           # Estación destino
    Status,                     # Estado de la estación
    Latitude, Longitude,        # Coordenadas
    COMMUNITY, AREA_NUMBE       # Datos de la comunidad
  )

So, It might show better exploration according to plan the analysis perform. First, we can watch 423.137 observations and 15 variables. The variables are: the identification variable, the target variable status_service , and time variables such as date , day_of_week and time_of_day , bikes reference as ride_length ,Satation Name, end_station_name, locations reference like Latitude and Longitude, and other social variables as gender and COMMUNITY, AREA NUMBER, and age. Its data type is most character, except in date format (date), in double (ride_length, age, Latitude, Longitude and AREA NUMBER).

glimpse(df_bike)
## Rows: 423,137
## Columns: 15
## $ ride_id          <chr> "21743421", "21744156", "21742530", "21743928", "2174…
## $ status_service   <chr> "member", "member", "member", "member", "member", "me…
## $ date             <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019…
## $ day_of_week      <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday"…
## $ time_of_day      <chr> "Afternoon", "Night", "Night", "Afternoon", "Afternoo…
## $ ride_length      <dbl> 615, 934, 497, 1603, 510, 472, 485, 2832, 1792, 227, …
## $ gender           <chr> "Male", "Male", "Female", "Male", "Male", "Male", "Ma…
## $ age              <dbl> 27, 28, 30, 45, 19, 36, 38, NA, 30, 27, 50, 34, 37, 5…
## $ `Station Name`   <chr> "Eckhart Park", "Lincoln Ave & Waveland Ave", "State …
## $ end_station_name <chr> "Damen Ave & Pierce Ave", "Damen Ave & Cortland St", …
## $ Status           <chr> "In Service", "In Service", "In Service", "In Service…
## $ Latitude         <dbl> 41.89637, 41.94880, 41.88938, 41.96845, 41.96300, 41.…
## $ Longitude        <dbl> -87.66098, -87.67528, -87.62708, -87.67423, -87.68478…
## $ COMMUNITY        <chr> "WEST TOWN", "NORTH CENTER", "NEAR NORTH SIDE", "LINC…
## $ AREA_NUMBE       <dbl> 24, 5, 8, 4, 4, 7, 6, 77, 8, 8, 24, 7, 8, 33, 7, 32, …

It’s necessary to observe some summaryof the variables to add some cleanning or transformation steps.

summary(df_bike)
##    ride_id          status_service          date            day_of_week       
##  Length:423137      Length:423137      Min.   :2019-01-01   Length:423137     
##  Class :character   Class :character   1st Qu.:2019-02-05   Class :character  
##  Mode  :character   Mode  :character   Median :2019-03-11   Mode  :character  
##                                        Mean   :2019-05-09                     
##                                        3rd Qu.:2019-03-30                     
##                                        Max.   :2020-01-31                     
##                                                                               
##  time_of_day         ride_length          gender               age        
##  Length:423137      Min.   :       2   Length:423137      Min.   : 16.00  
##  Class :character   1st Qu.:     322   Class :character   1st Qu.: 29.00  
##  Mode  :character   Median :     515   Mode  :character   Median : 34.00  
##                     Mean   :    1060                      Mean   : 37.29  
##                     3rd Qu.:     842                      3rd Qu.: 44.00  
##                     Max.   :10632022                      Max.   :119.00  
##                                                           NA's   :115300  
##  Station Name       end_station_name      Status             Latitude    
##  Length:423137      Length:423137      Length:423137      Min.   :41.74  
##  Class :character   Class :character   Class :character   1st Qu.:41.88  
##  Mode  :character   Mode  :character   Mode  :character   Median :41.89  
##                                                           Mean   :41.90  
##                                                           3rd Qu.:41.91  
##                                                           Max.   :42.02  
##                                                                          
##    Longitude       COMMUNITY           AREA_NUMBE   
##  Min.   :-87.77   Length:423137      Min.   : 1.00  
##  1st Qu.:-87.65   Class :character   1st Qu.: 8.00  
##  Median :-87.64   Mode  :character   Median :24.00  
##  Mean   :-87.64                      Mean   :21.27  
##  3rd Qu.:-87.63                      3rd Qu.:32.00  
##  Max.   :-87.55                      Max.   :77.00  
## 
# 1. Contar valores nulos (excepto para age y gender)
null_report <- df_bike %>%
  select(-age, -gender) %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Null_Count")

# 2. Contar valores únicos para age y gender
unique_values <- df_bike %>%
  summarise(
    unique_age = n_distinct(age, na.rm = TRUE),
    unique_gender = n_distinct(gender, na.rm = TRUE)
  )

# 3. Contar valores inconsistentes en ride_length (menores a 1 segundo)
invalid_rides <- df_bike %>%
  filter(ride_length < 1) %>%
  count()

# Mostrar resultados en consola
print("--- Null values report ---")
## [1] "--- Null values report ---"
print(null_report)
## # A tibble: 13 × 2
##    Variable         Null_Count
##    <chr>                 <int>
##  1 ride_id                   0
##  2 status_service            1
##  3 date                      0
##  4 day_of_week               0
##  5 time_of_day               0
##  6 ride_length               0
##  7 Station Name              0
##  8 end_station_name          0
##  9 Status                    0
## 10 Latitude                  0
## 11 Longitude                 0
## 12 COMMUNITY                 0
## 13 AREA_NUMBE                0
print("--- Unique values report ---")
## [1] "--- Unique values report ---"
print(unique_values)
## # A tibble: 1 × 2
##   unique_age unique_gender
##        <int>         <int>
## 1         72             3
print(paste("--- Invalid trips (< 1 seg):", invalid_rides$n))
## [1] "--- Invalid trips (< 1 seg): 0"
library(dplyr)

# Aplicar las reglas de limpieza y transformación
table_cleaned <- df_bike %>%
  # 1. Imputar edades mayores a 80 a exactamente 80
  mutate(age = ifelse(age > 80, 80, age)) %>%
  
  # 2. Convertir valores de gender que no sean Male/Female a 'Unknown'
  # También manejamos los valores nulos (NAs) en esta columna
  mutate(gender = case_when(
    gender == "Male" ~ "Male",
    gender == "Female" ~ "Female",
    TRUE ~ "Unknown"
  )) %>%
  
  # 3. Eliminar nulos en status_service y filtrar solo Subscriber o Casual
  # Nota: BigQuery a veces usa 'Member' por 'Subscriber', ajusta según tus datos
  filter(!is.na(status_service)) %>%
  filter(status_service %in% c("Subscriber", "Customer", "member", "casual"))

# Verificamos los resultados
summary(table_cleaned$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   16.00   29.00   34.00   37.27   44.00   80.00  115299
table(table_cleaned$gender)
## 
##  Female    Male Unknown 
##   58855  247424  116857
table(table_cleaned$status_service)
## 
## casual member 
##  23677 399459

To ensure the statistical robustness of the analysis, several data normalization and imputation procedures were implemented:

Age Capping: Values in the age variable exceeding 80 were capped at 80. This technique mitigates the influence of extreme outliers (likely due to data entry errors) while preserving the integrity of the records for other variables.

Gender Categorization: To handle missing data and non-binary entries, the gender variable was recoded. Only “Male” and “Female” were retained as distinct categories, while all other values—including nulls—were consolidated into an “Unknown” category.

Target Variable Refinement: Observations with null values in status_service were removed. Furthermore, the data set was filtered to include only valid user segments—Subscriber (Member) and Customer (Casual)—aligning the data with the primary business objective of the analysis.

According to the previous summary it shows most normality in the data comportment. Excepting to age which indicate riders with birth days long of the present. So we are going to explore this.

library(janitor)


# Standardizing all headers to lowercase with underscores
table_cleaned <- table_cleaned %>%
  clean_names()

# Displaying the clean column names
colnames(table_cleaned)
##  [1] "ride_id"          "status_service"   "date"             "day_of_week"     
##  [5] "time_of_day"      "ride_length"      "gender"           "age"             
##  [9] "station_name"     "end_station_name" "status"           "latitude"        
## [13] "longitude"        "community"        "area_numbe"

To streamline the analysis and prevent reference errors, the data set headers were normalized using the clean_names() function. This process automatically converts all column names to lowercase and replaces spaces or irregular characters with underscores. By adopting this snake_case convention, the code becomes more readable and consistent, ensuring that functions across the tidyverse can interact with the variables without requiring special quoting for spaces or mixed casing.

Descriptive analysis

Starting to analyze, we would have identified main variables such as status_service and ride_length. The analysis will be individually both. Because, we can approximate through a frequency analysis, but it might hide some interest relation, that we can explore. for example is there a relationship between more o less service use according to the status service? The difference is not minor because it informs us about the earnings according to the use and the number of members without having financial information (we don’t have neither the subscription cost nor the duration of the uses). Of course, we can be close to a reccomendation though.

Target analysis

The layer outcome of both analysis could hide difference of uses service and not only about the profile service. As we met first, the analysis show more and more subscriber service:

  • there were 341906 members and 23163 casual customers in 2019
  • there were 107492 members and 6149 casual customers in 2020

The ride length information show some interest points to concern.

print("--- Average report ---")
## [1] "--- Average report ---"
mean(table_cleaned$ride_length) #straight average (total ride length / rides)
## [1] 1059.954
print("--- Median report ---")
## [1] "--- Median report ---"
median(table_cleaned$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 515
print("--- Max value report ---")
## [1] "--- Max value report ---"
max(table_cleaned$ride_length) #longest ride
## [1] 10632022
print("--- Min value report ---")
## [1] "--- Min value report ---"
min(table_cleaned$ride_length) #shortest ride
## [1] 2

As we can watched, it showed good measure of uses about bikes in Chicago, with translate good earning in general.

But, beyond of this statics, we need to know the subscriber and casual service make increase the duration use, and, as consequence better profits. For that we had to add measures such as the average, maximum and minimum values, and then cross them between them. At least this analysis get close information about the earnings.

After we can include an analysis about the social profile of the customers such as proxy location, gender or age. To reach those variables we preprocess the starting point of user in morning, then map those location with open source about the poverty level. For the others, gender and age, we applied an expansion based on demographic information between the previous residence place and data present in this project.

Compare results

print("--- Average report ---")
## [1] "--- Average report ---"
aggregate(table_cleaned$ride_length ~ table_cleaned$status_service, FUN = mean)
print("--- Median report ---")
## [1] "--- Median report ---"
aggregate(table_cleaned$ride_length ~ table_cleaned$status_service, FUN = median)
print("--- Max report ---")
## [1] "--- Max report ---"
aggregate(table_cleaned$ride_length ~ table_cleaned$status_service, FUN = max)
print("--- Min report ---")
## [1] "--- Min report ---"
aggregate(table_cleaned$ride_length ~ table_cleaned$status_service, FUN = min)

The last outcomes showed contrary indications about the numbers by status service and its uses. However casual members are less, they ride more. The casual average ride is much superior in all fields: average, median and max values. It indicates that casual members paying more for uses, yet they are less.

Searching more information about those outcomes, we can see the average ride time by each day for members vs casual users.

# Notice that the days of the week are out of order. Let's fix that.
table_cleaned$day_of_week <- ordered(table_cleaned$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

# Now, let's run the average ride time by each day for members vs casual users
aggregate(table_cleaned$ride_length ~ table_cleaned$status_service + table_cleaned$day_of_week, FUN = mean)

The results show us there are more ride length in first day of the week than the last days. People use to ride starting the week.

In other way, the average ride time by each day for members vs casual users find longer duration inside the casual than members riders. All days, it is more the casual customers than the member user in ride length, which made us an idea that, independent of the numbers of users, the causality increase the bike use.

library(dplyr)

# Usamos directamente la columna day_of_week que ya existe en tu dataset
summary_by_day <- table_cleaned %>% 
  group_by(status_service, day_of_week) %>% 
  summarise(
    number_of_rides = n(),
    average_duration = mean(ride_length, na.rm = TRUE),
    .groups = "drop"
  ) %>% 
  # Ordenamos por tipo de usuario y día
  arrange(status_service, day_of_week)

# Visualizar el resumen
summary_by_casual <- filter(summary_by_day, status_service == "casual")
summary_by_member <- filter(summary_by_day, status_service == "member")

print("--- Average report ---")
## [1] "--- Average report ---"
arrange(summary_by_day, desc(average_duration))
print("--- Average report casual customers---")
## [1] "--- Average report casual customers---"
arrange(summary_by_casual, desc(average_duration))
print("--- Average report member customer ---")
## [1] "--- Average report member customer ---"
arrange(summary_by_member, desc(average_duration))

The same analysis we could show the preferred time to ride, which looked like the night. The users are special night users in both service status, being more the volume of ride length in casual users than the member users. The number of users riders increase instead in members aas folowing we can observe:

library(dplyr)

# Usamos directamente la columna day_of_week que ya existe en tu dataset
summary_by_time <- table_cleaned %>% 
  group_by(status_service, time_of_day) %>% 
  summarise(
    number_of_rides = n(),
    average_duration = mean(ride_length, na.rm = TRUE),
    .groups = "drop"
  ) %>% 
  # Ordenamos por tipo de usuario y día
  arrange(status_service, time_of_day)


print("--- Average report ---")
## [1] "--- Average report ---"
# Visualizar el resumen
print(arrange(summary_by_time, desc(average_duration)))
## # A tibble: 6 × 4
##   status_service time_of_day number_of_rides average_duration
##   <chr>          <chr>                 <int>            <dbl>
## 1 casual         Night                  5143            5973.
## 2 casual         Morning                4693            5785.
## 3 casual         Afternoon             13841            5201.
## 4 member         Night                 85780            1006.
## 5 member         Afternoon            169059             781.
## 6 member         Morning              144620             694.
print("--- Number of rides report ---")
## [1] "--- Number of rides report ---"
# Visualizar el resumen
print(arrange(summary_by_time, desc(number_of_rides)))
## # A tibble: 6 × 4
##   status_service time_of_day number_of_rides average_duration
##   <chr>          <chr>                 <int>            <dbl>
## 1 member         Afternoon            169059             781.
## 2 member         Morning              144620             694.
## 3 member         Night                 85780            1006.
## 4 casual         Afternoon             13841            5201.
## 5 casual         Night                  5143            5973.
## 6 casual         Morning                4693            5785.

The temporal analysis reveals a distinct contrast in peak usage times between user segments. While both groups show a surprising affinity for night-time riding, and starting week days, the Casual users represent a higher proportion of the total volume during late-night hours, and Member users in total users.

Both user segments keep the same time and days uses, even though the volume and users differ significantly.

The number of members substantially exceeds casual riders, reinforcing the hypothesis that members primarily utilize Divvy for work-related commuting. Conversely, the casual segment’s dominance during non-traditional hours suggests a preference for recreational or nocturnal transit, providing a key opportunity for targeted evening marketing campaigns. However, in both user segments have decreased the number of user and proportionally increased the ride average.

library(lubridate)

# Usamos directamente la columna day_of_week que ya existe en tu dataset
summary_by_year <- table_cleaned %>% 
  mutate(year = year(as.Date(date))) %>%
  group_by(status_service, year) %>% 
  summarise(
    number_of_rides = n(),
    average_duration = mean(ride_length, na.rm = TRUE),
    .groups = "drop"
  ) %>% 
  # Ordenamos por tipo de usuario y día
  arrange(status_service, year)

print(summary_by_year)
## # A tibble: 4 × 4
##   status_service  year number_of_rides average_duration
##   <chr>          <dbl>           <int>            <dbl>
## 1 casual          2019           18506            4075.
## 2 casual          2020            5171           10529.
## 3 member          2019          303787             841.
## 4 member          2020           95672             661.

Socio-Economic and Demographic Data Integration

“To define the user social profile, we leveraged the pre-categorized time_of_day variable. By filtering specifically for trips occurring during the ‘morning’ period, we established a geographical proxy for the users’ home locations. This assumes that the first movement of the day typically originates from the user’s primary residence or local neighborhood.

To enrich the analysis of user behavior, we integrated external socio-economic data from the City of Chicago Data Portal, specifically focusing on the “Hardship Index” and poverty indicators per Community Area.

This allowed us to assign a socio-economic stratum (SES) to each trip based on the historical census data of the community area. By segmenting the data into High, Middle, and Low SES, we can now evaluate the accessibility and equity of the Divvy system. This analysis provides a window into the demographic composition of our service status groups, revealing whether ‘Subscribers’ primarily reside in low-hardship areas compared to ‘Casual’ users, and how age and gender distributions vary across these economic zones.”

The data processing workflow consisted of the following key steps:

  1. Socio-Economic Mapping: We performed a join between the bike-sharing data set and Chicago’s community metrics, categorizing trips into three poverty tiers based on the percentage of households below the poverty line: Low (<5%), Moderate (5-10%), and High (>10%).

  2. Demographic Imputation: To address the high volume of “Unknown” values in gender and age fields (particularly among Casual riders), we implemented a Geographic Proportional Imputation. This method replaced unknown values by sampling from the known demographic distribution of each specific community, preserving local neighborhood profiles .

  3. Cross-Tabulation Analysis: We generated specialized tables to identify user personas. This included calculating the Median Age and Gender Distribution (Male vs. Female percentages) cross-referenced by status_service (Casual vs. Member) for each community.

This comprehensive approach transformed raw trip data into a multi-dimensional dataset, allowing us to analyze how gender gaps, age groups, and subscription preferences fluctuate across different economic landscapes in Chicago.

Socio-Economic Mapping

We start to standardize the values of communities mapping and appling lower to the names makes being able to match with poverty values, previous mapped.

library(dplyr)
library(stringr)

# 1. Crear el dataframe de mapeo solo con tus 45 comunidades
mapeo_pobreza_restringido <- data.frame(
  community = c(
    "west town", "north center", "near north side", "lincoln square", 
    "lincoln park", "lake view", "edgewater", "near south side", 
    "loop", "logan square", "uptown", "rogers park", 
    "avondale", "north park", "irving park", "greater grand crossing", 
    "near west side", "bridgeport", "lower west side", "hyde park", 
    "albany park", "armour square", "west ridge", "douglas", 
    "washington park", "kenwood", "mckinley park", "east garfield park", 
    "woodlawn", "south lawndale", "englewood", "grand boulevard", 
    "north lawndale", "humboldt park", "new city", "fuller park", 
    "austin", "south shore", "portage park", "west garfield park", 
    "hermosa", "chatham", "west englewood", "south chicago", "avalon park"
  ),
  valor_pobreza = c(
    5.56, 2.18, 5.48, 4.68, 4.83, 4.51, 38.32, 9.62, 
    3.02, 5.14, 9.93, 9.56, 3.87, 5.87, 3.50, 8.22, 
    16.93, 6.00, 7.20, 14.51, 5.34, 18.22, 6.32, 10.62, 
    16.84, 22.71, 3.93, 11.42, 15.71, 5.53, 14.35, 14.98, 
    10.57, 9.79, 3.46, 11.05, 10.28, 10.29, 4.68, 15.25, 
    7.17, 19.74, 4.97, 5.31, 5.64
  )
)

# 2. Transformate data name of community
table_cleaned <- table_cleaned %>% 
  mutate(community = str_trim(tolower(community)))


# 3. Mapping the data frames with poverty value
table_cleaned <- table_cleaned %>%
  left_join(mapeo_pobreza_restringido, by = "community")

# Outcomes verification
summary(table_cleaned$valor_pobreza)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.180   4.510   5.480   8.153  10.620  38.320

It was going through the following values of poverty according to the percentages of poverty by community binning them with categories to continuation:

  • High Poverty: Communities with more than 10% of households below the poverty line.
  • Moderate Poverty: Communities between 5% and 10%.
  • Low Poverty: Communities with less than 5%.

Those bins allow to segment the socio-economic profile of the users, and analyze the status service accordingly. The findings is that the most trips was done by from intermediate to low strats as members as casual customers.

library(dplyr)

# Creamos las categorías basadas en los rangos que pediste
table_cleaned <- table_cleaned %>%
  mutate(categoria_pobreza = case_when(
    valor_pobreza > 10 ~ "High (>10%)",
    valor_pobreza >= 5 & valor_pobreza <= 10 ~ " Intermediate (5% - 10%)",
    valor_pobreza < 5 ~ "Low (<5%)",
    TRUE ~ "Sin Datos"
  ))

# Ordenamos las categorías para que las gráficas salgan lógicas
table_cleaned$categoria_pobreza <- factor(table_cleaned$categoria_pobreza, 
                                        levels = c("Low (<5%)", " Intermediate (5% - 10%)", "High (>10%)"))

# Resumen rápido para ver cuántos viajes hay en cada estrato
resumen_social <- table_cleaned %>%
  group_by(categoria_pobreza, status_service) %>%
  summarise(total_trips = n(),
            average_duration = mean(ride_length, na.rm = TRUE),
            .groups = 'drop',
            )

print(resumen_social)
## # A tibble: 6 × 4
##   categoria_pobreza          status_service total_trips average_duration
##   <fct>                      <chr>                <int>            <dbl>
## 1 "Low (<5%)"                casual                9912            4094.
## 2 "Low (<5%)"                member              150944             723.
## 3 " Intermediate (5% - 10%)" casual                9824            6066.
## 4 " Intermediate (5% - 10%)" member              143111             877.
## 5 "High (>10%)"              casual                3941            7533.
## 6 "High (>10%)"              member              105404             797.

Even though the numbers of members, there are some interesting findings:

  • In casual customers, in low social status, users made less trips but using more long time bikes.
  • In members customers, same, yet keeping the time record.
  • In change, comparison between both social segments, increasing in distance use of bikes by casual customers but more users by members customers. As we can watch next:
library(dplyr)
library(tidyr)

# 1. Calculamos la tabla resumen con porcentajes y totales
tabla_resumen <- table_cleaned %>%
  filter(!is.na(categoria_pobreza)) %>%
  group_by(categoria_pobreza, status_service) %>%
  summarise(cantidad = n(),
            average_duration = mean(ride_length, na.rm = TRUE),
            .groups = 'drop') %>%
  # Calculamos el % de Casuals y Members dentro de cada nivel de pobreza
  group_by(categoria_pobreza) %>%
  mutate(
    porcentaje = round((average_duration / sum(average_duration)) * 100, 2),
    total_viajes_nivel = sum(average_duration)
  ) %>%
  ungroup() %>%
  # Pasamos de formato largo a formato ancho (Cross-tab)
  select(categoria_pobreza, status_service, porcentaje, total_viajes_nivel) %>%
  pivot_wider(names_from = status_service, values_from = porcentaje) %>%
  # Calculamos el % que representa cada nivel sobre el gran total global
  mutate(porcentaje_del_total_global = round((total_viajes_nivel / sum(total_viajes_nivel)) * 100, 2))

print(tabla_resumen)
## # A tibble: 3 × 5
##   categoria_pobreza      total_viajes_nivel casual member porcentaje_del_total…¹
##   <fct>                               <dbl>  <dbl>  <dbl>                  <dbl>
## 1 "Low (<5%)"                         4817.   85.0  15.0                    24.0
## 2 " Intermediate (5% - …              6943.   87.4  12.6                    34.6
## 3 "High (>10%)"                       8330.   90.4   9.57                   41.5
## # ℹ abbreviated name: ¹​porcentaje_del_total_global

So, casual members not only ride long distance but the use increase while up through the social sectors.

Nevertheless, this finding contrast with the total of customers in each status service. Members are more than the Casual customers, and it represent considerably a customer majority.

library(dplyr)
library(tidyr)

# 1. Calculamos la tabla resumen con porcentajes y totales
tabla_resumen <- table_cleaned %>%
  filter(!is.na(categoria_pobreza)) %>%
  group_by(categoria_pobreza, status_service) %>%
  summarise(cantidad = n(),
            average_duration = mean(ride_length, na.rm = TRUE),
            .groups = 'drop') %>%
  # Calculamos el % de Casuals y Members dentro de cada nivel de pobreza
  group_by(categoria_pobreza) %>%
  mutate(
    porcentaje = round((cantidad / sum(cantidad)) * 100, 2),
    total_viajes_nivel = sum(cantidad)
  ) %>%
  ungroup() %>%
  # Pasamos de formato largo a formato ancho (Cross-tab)
  select(categoria_pobreza, status_service, porcentaje, total_viajes_nivel) %>%
  pivot_wider(names_from = status_service, values_from = porcentaje) %>%
  # Calculamos el % que representa cada nivel sobre el gran total global
  mutate(porcentaje_del_total_global = round((total_viajes_nivel / sum(total_viajes_nivel)) * 100, 2))

print(tabla_resumen)
## # A tibble: 3 × 5
##   categoria_pobreza      total_viajes_nivel casual member porcentaje_del_total…¹
##   <fct>                               <int>  <dbl>  <dbl>                  <dbl>
## 1 "Low (<5%)"                        160856   6.16   93.8                   38.0
## 2 " Intermediate (5% - …             152935   6.42   93.6                   36.1
## 3 "High (>10%)"                      109345   3.6    96.4                   25.8
## # ℹ abbreviated name: ¹​porcentaje_del_total_global

As resume: there are more members customers, but proportionally casual customers ride more. Both trends use to increase in low sectors.

Median Age** and Gender Distribution

By applying this segmentation, we cross-referenced the imputed gender and median age data with the status_service (Casual vs. Member). This allowed us to determine if the subscription model (Member or Casual) is predominantly utilized by specific age groups or genders within different economic contexts. This multi-layered analysis reveals whether the “Member” profile.

To resolve the missing data issue for the ‘gender’ variable, we moved beyond simple deletion. We implemented a Geographic Proportional Imputation. By using the starting community as a demographic anchor, we replaced ‘Unknown’ values with the predominant gender profile of that specific area. This preserved the unique socio-economic signature of each neighborhood, allowing for a high-integrity cross-analysis between gender, status_service, and the Poverty Index.

library(dplyr)
library(tidyr)

# 1. Creamos la tabla de pesos (probabilidades) por comunidad
pesos_genero <- table_cleaned %>%
  filter(gender %in% c("Male", "Female")) %>%
  group_by(community, gender) %>%
  summarise(n = n(), .groups = "drop_last") %>%
  mutate(prob = n / sum(n)) %>%
  select(-n) %>%
  pivot_wider(names_from = gender, values_from = prob)

# 2. Función para imputar basándose en las probabilidades de la comunidad
imputar_genero <- function(comunidad, genero_orig, prob_M, prob_F) {
  # Si ya conocemos el género, lo dejamos igual
  if (genero_orig != "Unknown") {
    return(genero_orig)
  }
  # Si no hay datos en la comunidad para decidir, devolvemos "Unknown" o un global
  if (is.na(prob_M) | is.na(prob_F)) {
    return("Male") # Asignación por defecto basada en tendencia global de Divvy
  }
  # Elegimos aleatoriamente según el peso de la comunidad
  sample(c("Male", "Female"), size = 1, prob = c(prob_M, prob_F))
}

# 3. Aplicamos la imputación a la tabla principal
table_cleaned <- table_cleaned %>%
  left_join(pesos_genero, by = "community") %>%
  rowwise() %>%
  mutate(gender = imputar_genero(community, gender, Male, Female)) %>%
  ungroup() %>%
  select(-Male, -Female) # Limpiamos columnas auxiliares

# Verificamos que ya no existan "Unknown"
table(table_cleaned$gender)
## 
## Female   Male 
##  81138 341998

An important observation found was there are much more males than females in users. It is repeated in every community where we observed the gender factor in customers as casual as members. This observation is an important value to concern in time to change a marketing strategy to focus the bikes uses to women.

library(dplyr)
library(tidyr)

# 1. Calculamos la distribución de género (usando el género imputado)
tabla_genero_status <- table_cleaned %>%
  group_by(community, status_service, gender) %>%
  summarise(total = n(), .groups = 'drop') %>%
  # Calculamos el porcentaje de género dentro de cada comunidad y cada status
  group_by(community, status_service) %>%
  mutate(porcentaje = round((total / sum(total)) * 100, 2)) %>%
  ungroup()

# 2. Transformamos a un formato ancho para ver Male y Female lado a lado
# Esto creará columnas como "casual_female", "casual_male", etc.
tabla_genero_final <- tabla_genero_status %>%
  select(community, status_service, gender, porcentaje) %>%
  pivot_wider(
    names_from = c(status_service, gender), 
    values_from = porcentaje,
    names_sep = "_"
  )

# 3. Visualizamos los resultados
print(tabla_genero_final)
## # A tibble: 45 × 5
##    community          casual_Female casual_Male member_Female member_Male
##    <chr>                      <dbl>       <dbl>         <dbl>       <dbl>
##  1 albany park                 18.0        82.0          20.5        79.5
##  2 armour square               27.6        72.4          22.2        77.8
##  3 austin                      23.1        76.9          30.1        69.9
##  4 avalon park                 16.7        83.3          25          75  
##  5 avondale                    14.6        85.4          18.8        81.2
##  6 bridgeport                  14.1        85.9          20.2        79.8
##  7 chatham                     28.6        71.4          18.6        81.4
##  8 douglas                     26.3        73.7          22.8        77.2
##  9 east garfield park          18.2        81.8          26.6        73.4
## 10 edgewater                   22.0        78.0          17.8        82.2
## # ℹ 35 more rows

We take in mind get the median age to know the preference in every community the preference of users. The median age is a little less in casual than members customers. So we can concern that members use to be more older than the casual customers, which could explain a preference to keep long distance in bikes ridding compared to members customers.

library(dplyr)
library(tidyr)

# 1. Calcular la mediana de edad por comunidad y status
tabla_edad_status <- table_cleaned %>%
  group_by(community, status_service) %>%
  summarise(edad_mediana = median(age, na.rm = TRUE), .groups = 'drop')

# 2. Pivotar para comparar Casual vs Member lado a lado
tabla_edad_final <- tabla_edad_status %>%
  pivot_wider(names_from = status_service, values_from = edad_mediana) %>%
  rename(
    `Edad Mediana Casual` = casual,
    `Edad Mediana Member` = member
  )

# 3. Mostrar los resultados
print(tabla_edad_final)
## # A tibble: 45 × 3
##    community          `Edad Mediana Casual` `Edad Mediana Member`
##    <chr>                              <dbl>                 <dbl>
##  1 albany park                         33                      37
##  2 armour square                       22.5                    31
##  3 austin                              33                      43
##  4 avalon park                         NA                      46
##  5 avondale                            29                      35
##  6 bridgeport                          29                      31
##  7 chatham                             29.5                    36
##  8 douglas                             28                      28
##  9 east garfield park                  28.5                    29
## 10 edgewater                           30                      35
## # ℹ 35 more rows

Summary

The findings identify a clear “usage vs. volume” paradox. While members provide the volume (quantity), casual riders provide the intensity (duration/distance), especially as you move into lower socio-economic sectors.

Socio-Economic & Demographic Divergence The analysis reveals a significant “usage paradox” between user segments across Chicago’s communities:

The Intensity vs. Volume Gap: While Members constitute the vast majority of total trips, Casual riders demonstrate higher engagement in terms of trip duration and distance. This trend is particularly amplified in Low Socio-Economic (High Poverty) sectors, where casual trips are fewer in number but significantly longer in duration.

# Let's visualize the number of rides by rider type
 table_cleaned %>% 
  group_by(status_service, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(status_service, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = status_service)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'status_service'. You can override using
## the `.groups` argument.

# Let's create a visualization for average duration
table_cleaned %>% 
  group_by(status_service, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(status_service, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = average_duration, fill = status_service)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'status_service'. You can override using
## the `.groups` argument.

A key finding is that bike usage intensity increases as we move into lower social sectors. Casual riders in these areas are not just riding for leisure; the increased distance suggests the service acts as a critical mobility tool for longer commutes where other transit options may be limited.

Adding, data shows a consistent Gender Gap across all 45 communities. Males represent the predominant user base in both the Casual and Member segments. This identified disparity presents a high-value marketing opportunity: a strategy specifically designed to lower barriers for women could unlock a massive, underserved market.

library(ggplot2)
library(dplyr)

# 1. Preparar los datos (usando el formato 'long' que es mejor para ggplot)
df_genero_plot <- tabla_genero_status %>%
  filter(!is.na(gender)) %>% # Aseguramos que no haya NAs
  # Ordenamos las comunidades por el porcentaje de mujeres (o el que prefieras)
  mutate(community = reorder(community, porcentaje))

# 2. Crear el gráfico de barras apiladas
ggplot(df_genero_plot, aes(x = porcentaje, y = community, fill = gender)) +
  geom_bar(stat = "identity", position = "fill") + # "fill" asegura que llegue al 100%
  facet_wrap(~status_service) +
  
  # Usamos colores que contrasten bien (ej. Naranja y Azul o el que prefieras)
  scale_fill_manual(values = c("Male" = "#2c7bb6", "Female" = "#d7191c"), 
                    name = "Género") +
  
  # Línea vertical en el 50% para ver rápidamente quién domina
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "white", alpha = 0.5) +
  
  scale_x_continuous(labels = scales::percent) + # Formato de porcentaje
  
  labs(
    title = "Gender Distribution by Community and User Status",
    subtitle = "Percentage of Male vs Female across 100% of community trips",
    x = "Percentage Contribution",
    y = "Community Area"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 7), # Ajuste para que quepan las comunidades
    legend.position = "top",
    strip.text = element_text(size = 12, face = "bold")
  )

To address the high volume of “Unknown” values in the gender column (primarily from Casual riders), we implemented a Geographical Proportional Imputation. This method preserves the specific demographic characteristics of each Chicago community.

  1. Imputation Logic: We calculated the probability of a user being Male or Female based on the known distribution within each Community Area. These weights were then applied to replace “Unknown” entries.
  2. Post-Imputation Analysis: After cleaning, we analyzed the gender distribution across service types (Member vs. Casual) to identify marketing opportunities.

Finally, by calculating the Median Age, we observed that Members tend to be older than Casual riders. This age difference helps explain the usage patterns: younger Casual riders show a higher preference for long-distance riding, whereas older Members likely utilize the service for shorter, more consistent “last-mile” utility trips.

Age Demographics: Casual vs. Member Profiles

To further refine our user personas, we calculated the Median Age for both segments across all Chicago communities. This metric helps distinguish between lifestyle-oriented riders and utility-based commuters.

  1. Methodology: We used the median() function to mitigate the influence of potential outliers in the age data, ensuring a representative age profile for each geographic area.
  2. Comparative Structure: The data was pivoted to allow a direct side-by-side comparison between the two service statuses.
library(ggplot2)
library(dplyr)

# Supongamos que tu tabla se llama tabla_edad
# La ordenamos para que la comunidad con mayor edad esté arriba
df_edad_plot <- tabla_edad_status %>%
  filter(!is.na(edad_mediana)) %>%
  mutate(community = reorder(community, edad_mediana))

ggplot(df_edad_plot, aes(x = edad_mediana, y = community)) +
  # Línea de fondo para estilo "Lollipop"
  geom_segment(aes(x = 0, xend = edad_mediana, y = community, yend = community), color = "gray90") +
  # Punto de color para la edad
  geom_point(aes(color = status_service), size = 3) +
  # Separamos por segmento para que sea gigante y claro
  facet_wrap(~status_service) +
  # Colores que pediste (Naranjas/Rojos para consistencia)
  scale_color_manual(values = c("casual"="#E27429", "member" = "#00539C")) +
  labs(
    title = "Median Age by Chicago Community",
    subtitle = "Analysis of User Demographics per Neighborhood",
    x = "Median Age (Years)",
    y = "Community Area"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 8), # Letra pequeña para que quepan las 77 comunidades
    strip.text = element_text(size = 14, face = "bold"),
    panel.grid.major.y = element_blank() # Limpiamos líneas horizontales para que no sature
  )

library(tigris)
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
library(sf)
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(ggplot2)
library(dplyr)

# 1. Silueta oficial
chicago_shape <- places(state = "IL", cb = TRUE) %>%
  filter(NAME == "Chicago")
## Retrieving data for the year 2024
##   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   4%  |                                                                              |===                                                                   |   5%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   8%  |                                                                              |======                                                                |   8%  |                                                                              |=========                                                             |  12%  |                                                                              |=========                                                             |  13%  |                                                                              |==========                                                            |  14%  |                                                                              |===========                                                           |  16%  |                                                                              |==============                                                        |  20%  |                                                                              |=================                                                     |  24%  |                                                                              |===================                                                   |  27%  |                                                                              |=======================                                               |  32%  |                                                                              |========================                                              |  34%  |                                                                              |==========================                                            |  37%  |                                                                              |===========================                                           |  38%  |                                                                              |============================                                          |  40%  |                                                                              |=============================                                         |  41%  |                                                                              |==============================                                        |  42%  |                                                                              |===============================                                       |  44%  |                                                                              |================================                                      |  45%  |                                                                              |==================================                                    |  48%  |                                                                              |===================================                                   |  50%  |                                                                              |=====================================                                 |  52%  |                                                                              |=======================================                               |  55%  |                                                                              |=========================================                             |  58%  |                                                                              |===========================================                           |  61%  |                                                                              |=============================================                         |  64%  |                                                                              |===============================================                       |  67%  |                                                                              |================================================                      |  68%  |                                                                              |==================================================                    |  71%  |                                                                              |===================================================                   |  72%  |                                                                              |=====================================================                 |  75%  |                                                                              |=======================================================               |  78%  |                                                                              |=========================================================             |  81%  |                                                                              |===========================================================           |  84%  |                                                                              |=============================================================         |  87%  |                                                                              |==============================================================        |  88%  |                                                                              |===============================================================       |  89%  |                                                                              |================================================================      |  91%  |                                                                              |=================================================================     |  92%  |                                                                              |====================================================================  |  96%  |                                                                              |======================================================================| 100%
# 2. Datos al 100%
df_mapa <- table_cleaned %>%
  mutate(ride_minutos = as.numeric(ride_length) / 60) %>%
  filter(ride_minutos > 1 & ride_minutos < 120) %>%
  group_by(community, longitude, latitude, status_service) %>%
  summarise(
    total_viajes = n(),
    duracion_mediana = median(ride_minutos, na.rm = TRUE),
    .groups = 'drop'
  )

# 3. Gráfico con Burbujas Naranjas
ggplot() +
  # Fondo de Chicago (Gris neutro para que no compita con el color de las burbujas)
  geom_sf(data = chicago_shape, fill = "#f2f2f2", color = "#d1d1d1", size = 0.4) +
  
  # Burbujas: El color solo aplica aquí
  geom_point(data = df_mapa, 
             aes(x = longitude, y = latitude, 
                 size = total_viajes, 
                 color = duracion_mediana), 
             alpha = 0.6) + 
  
  # Escala de Naranjas/Rojos (Resalta mucho más la intensidad)
  scale_color_distiller(palette = "YlOrRd", direction = 1, name = "Intensidad (Minutos)") +
  
  # Tamaño para el Volumen
  scale_size_continuous(name = "Volumen (Viajes)", range = c(0.5, 9)) +

  # Anotaciones con el mismo tono para coherencia visual
  geom_label(data = data.frame(
      status_service = c("casual", "member"),
      label = c("CASUAL:\nMÁS INTENSIDAD\nMenos Usuarios", 
                "MEMBER:\nMÁS FRECUENCIA\nMenos Duración"),
      x = c(-87.82, -87.82),
      y = c(41.73, 41.73)
    ), 
    aes(x = x, y = y, label = label), 
    size = 2.5, fontface = "bold", fill = "white", color = "#d7301f", alpha = 0.9, hjust = 0) +

  facet_wrap(~status_service) +
  
  labs(
    title = "Divvy System Intensity: Casual vs. Member",
    subtitle = "Color Depth (Intensity) vs. Bubble Size (Volume)",
    caption = "Casual riders consistently show higher intensity (darker red) across Chicago."
  ) +
  theme_void() + # Estética limpia de mapa
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(size = 12, face = "bold")
  )