In 2016, a service of bikes was introduced within the Cyclistic program in Chicago. The program has grown over the years to a fleet of 5.824 bicycles in 692 stations. Yet, the more profitable program is not the most attractive service. Their customers are more and more casual than riders with a membership. However, to keep the service inside the city, the company wants to ensure that casual riders become members of the program.
So that the business task is convert as many customers as possible into members of the Cyclistic program. In order to do so, it is necessary to generate some analysis oriented to make a marketing strategy. Taking into account that the program is a green and social inclusion initiative, it reaches population segments that should be interesting to explore (awareness, social profiles, interests, etc.), and why or why not the bikes uses differ among them.
Looking at this proposal, we expose the following main questions: why are more people casual and not members of the program? We need to understand the main characteristics of the casual and membership customers. The response implies operationalizing the differences between those customer segments and asking us How do annual members and casual riders use Cyclistic bikes differently? Or even include some questions about the social profile of customers, like age, gender, or whatever variable is present in the data set.
The tools selected to process was Big Query and to analyze was Rstudio. Both, highly recommended by the Google tutors.
In SQL, we not only prepared the data set correspondingly, but process the data to merge both and do preliminary exploration a way to success our process. The Big Query is a data warehouse, which allow us to manipulate large data sets easily and in an accurate way.
The data set chosen is a public data source. The data uses of this data is given us from the course Google Data Analyst, and we can use to explore how different customer types use Cyclistic bike in Chicago. Say that, continuing with the analysis data preparation, it’s important mention that the data wasn’t collected by us, but given by Google and Coursera team (it’s a secondary data from some external sources).
For that, as well should be done, this data set prohibit and safeguard the information storage. So, we follow the good ethics practice required of data analysis doing explicit that we don’t go to publish any private information about user riders. Adding, being faithful to keep sure the information, make use of version controls and applied data anonymization: we keep out the analysis sensible information, and make security steps to guarantee the privacy and security as tools and files as information manipulated.
Now, it’s time to mention some structural characteristics about this information data. Firstly, data sets are organized by comma separated values, which makes the structured format in columns and rows. Nevertheless, as we are working so SQL language in Big Query as r language in Rstudio, in the first case we migrate the file from Google sheets to the Big Query environment (data migration). The variables correspond to the columns and observations to the rows. The data follow the wide format which each row represent one observation with its attributes in columns.
Following the preparation data, we reviewed not only the general structure, but the data type and variable in the data frames. There are some characters, numeric and integer types stored, corresponding to quality and quantity data. Once, inspected the variables between the numeric and integer data type the variables are continuous ( … ) and discrete ( …. ), in the other case, are nominal ( …. ).
In another way, after a little exploration about our variable of interest, data could show an observation bias, because both data show more superscript than casual customers. Plus, the observation in both data shows more people subscriber than not from 2019 to 2020. This preliminary exploration suggests that the research proposes could be motivated by uncertainty fear market, it’s say, interpreted bias. The business task hide that suggestion: there is a trend of casual uses of the bike services, even though preliminary exploration suggests the contrary. For that, our finding about the differences between both customers segments could reject or confirm this underlying hypothesis: Is there more growth in casual riders than subscribers?
This research propose is not minor because the business task is driven to change the type of bike uses in customers. It’s important to answer those question to know the future marketing strategies: changing or pronouncing the intention use in customers.
Let’s connect with Big Query to show the steps we applied to get the data, and prepare and process them. To include our process in Big Query, we decide to include the queries in this Rmarkdown. It can possible thanks to the next libraries, and programming chunks to read SQL Language.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'vroom'
##
## The following objects are masked from 'package:readr':
##
## as.col_spec, col_character, col_date, col_datetime, col_double,
## col_factor, col_guess, col_integer, col_logical, col_number,
## col_skip, col_time, cols, cols_condense, cols_only, date_names,
## date_names_lang, date_names_langs, default_locale, fwf_cols,
## fwf_empty, fwf_positions, fwf_widths, locale, output_column,
## problems, spec
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
There is an important inclusion in this preview steps Loading
files. We storage and migrate internal data in Big Query from
different data sets, and the first data loaded was from Google Drive,
which makes us to manipulate the tables loaded assigned the first row as
head. We start loading the files from Google Drive after downloading the
archive. We appreciate that the content of the first row was the head in
both files. So, we had to assign the first row as the head of our files.
But to simulate the task in Rstudio we used the DBI and RSQLite library
to connect to a data base located in Rstudio RAM memory, and processed
in SQL engine required through the Rstudio, and loading the files from
our device. Taking the data frames loaded in dbWriteTable
save them inside your SQLite database db_local.
# 1. Crear la conexión (Usemos un solo nombre: db_local)
db_local <- dbConnect(RSQLite::SQLite(), ":memory:")
# 2. Definir las rutas
ruta1 = "C:\\Users\\user\\Downloads\\Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv"
ruta2 = "C:\\Users\\user\\Downloads\\Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv"
# 3. Cargar las tablas (Asegúrate de usar el mismo nombre 'db_local')
dbWriteTable(db_local, "bikes_2019", read.csv(ruta1))
dbWriteTable(db_local, "bikes_2020", read.csv(ruta2))
# Verificación: Esto te dirá si las tablas ya existen en el motor SQL
dbListTables(db_local)## [1] "bikes_2019" "bikes_2020"
After, we inspect both tables to confirm the columns and variables so in the 2019 table across SQL language in Big Query like this.
| trip_id | start_time | end_time | bikeid | tripduration | from_station_id | from_station_name | to_station_id | to_station_name | usertype | gender | birthyear |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 21742443 | 2019-01-01 0:04:37 | 2019-01-01 0:11:07 | 2167 | 390 | 199 | Wabash Ave & Grand Ave | 84 | Milwaukee Ave & Grand Ave | Subscriber | Male | 1989 |
| 21742444 | 2019-01-01 0:08:13 | 2019-01-01 0:15:34 | 4386 | 441 | 44 | State St & Randolph St | 624 | Dearborn St & Van Buren St (*) | Subscriber | Female | 1990 |
| 21742445 | 2019-01-01 0:13:23 | 2019-01-01 0:27:12 | 1524 | 829 | 15 | Racine Ave & 18th St | 644 | Western Ave & Fillmore St (*) | Subscriber | Female | 1994 |
| 21742446 | 2019-01-01 0:13:45 | 2019-01-01 0:43:28 | 252 | 1,783.00 | 123 | California Ave & Milwaukee Ave | 176 | Clark St & Elm St | Subscriber | Male | 1993 |
| 21742447 | 2019-01-01 0:14:52 | 2019-01-01 0:20:56 | 1170 | 364 | 173 | Mies van der Rohe Way & Chicago Ave | 35 | Streeter Dr & Grand Ave | Subscriber | Male | 1994 |
| 21742448 | 2019-01-01 0:15:33 | 2019-01-01 0:19:09 | 2437 | 216 | 98 | LaSalle St & Washington St | 49 | Dearborn St & Monroe St | Subscriber | Female | 1983 |
| 21742449 | 2019-01-01 0:16:06 | 2019-01-01 0:19:03 | 2708 | 177 | 98 | LaSalle St & Washington St | 49 | Dearborn St & Monroe St | Subscriber | Male | 1984 |
| 21742450 | 2019-01-01 0:18:41 | 2019-01-01 0:20:21 | 2796 | 100 | 211 | St. Clair St & Erie St | 142 | McClurg Ct & Erie St | Subscriber | Male | 1990 |
| 21742451 | 2019-01-01 0:18:43 | 2019-01-01 0:47:30 | 6205 | 1,727.00 | 150 | Fort Dearborn Dr & 31st St | 148 | State St & 33rd St | Subscriber | Male | 1995 |
| 21742452 | 2019-01-01 0:19:18 | 2019-01-01 0:24:54 | 3939 | 336 | 268 | Lake Shore Dr & North Blvd | 141 | Clark St & Lincoln Ave | Subscriber | Male | 1996 |
as in 2020 table, showed in the following query:
| trip_id | start_time | end_time | bikeid | tripduration | from_station_id | from_station_name | to_station_id | to_station_name | usertype | gender | birthyear |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 21742443 | 2019-01-01 0:04:37 | 2019-01-01 0:11:07 | 2167 | 390 | 199 | Wabash Ave & Grand Ave | 84 | Milwaukee Ave & Grand Ave | Subscriber | Male | 1989 |
| 21742444 | 2019-01-01 0:08:13 | 2019-01-01 0:15:34 | 4386 | 441 | 44 | State St & Randolph St | 624 | Dearborn St & Van Buren St (*) | Subscriber | Female | 1990 |
| 21742445 | 2019-01-01 0:13:23 | 2019-01-01 0:27:12 | 1524 | 829 | 15 | Racine Ave & 18th St | 644 | Western Ave & Fillmore St (*) | Subscriber | Female | 1994 |
| 21742446 | 2019-01-01 0:13:45 | 2019-01-01 0:43:28 | 252 | 1,783.00 | 123 | California Ave & Milwaukee Ave | 176 | Clark St & Elm St | Subscriber | Male | 1993 |
| 21742447 | 2019-01-01 0:14:52 | 2019-01-01 0:20:56 | 1170 | 364 | 173 | Mies van der Rohe Way & Chicago Ave | 35 | Streeter Dr & Grand Ave | Subscriber | Male | 1994 |
| 21742448 | 2019-01-01 0:15:33 | 2019-01-01 0:19:09 | 2437 | 216 | 98 | LaSalle St & Washington St | 49 | Dearborn St & Monroe St | Subscriber | Female | 1983 |
| 21742449 | 2019-01-01 0:16:06 | 2019-01-01 0:19:03 | 2708 | 177 | 98 | LaSalle St & Washington St | 49 | Dearborn St & Monroe St | Subscriber | Male | 1984 |
| 21742450 | 2019-01-01 0:18:41 | 2019-01-01 0:20:21 | 2796 | 100 | 211 | St. Clair St & Erie St | 142 | McClurg Ct & Erie St | Subscriber | Male | 1990 |
| 21742451 | 2019-01-01 0:18:43 | 2019-01-01 0:47:30 | 6205 | 1,727.00 | 150 | Fort Dearborn Dr & 31st St | 148 | State St & 33rd St | Subscriber | Male | 1995 |
| 21742452 | 2019-01-01 0:19:18 | 2019-01-01 0:24:54 | 3939 | 336 | 268 | Lake Shore Dr & North Blvd | 141 | Clark St & Lincoln Ave | Subscriber | Male | 1996 |
Second, we start Exploring data frames. It explored
that the main variable had some inconsistencies between the values
(Customer and Subscriber in 2029) and
(casual and member in 2020). It also showed
that some variables were not in 2020. So first, explored the main
variable.
/* Al NO poner output.var, el resultado debe salir aquí abajo */
SELECT
usertype,
COUNT(usertype) AS count
FROM
bikes_2019
GROUP BY usertype
/* We limited the output in the present project */
LIMIT 10 | usertype | count |
|---|---|
| Customer | 23163 |
| Subscriber | 341906 |
In 2019, there were 341906 subscribers, 23163 casual members, 1 user type. Also, in 2020, there were 107492 members and 6149 casual customers, 1 null and 1 member_casual. Looking to join both data frames, we needed to standardize these values. Also, we did the same with gender to attempt keeping this relevant information.
Also, to continue preparing data, another important observation was that all variables were stored as strings, but we had to doing a data casting accordingly the quantity or quality type assumed by theory. In one query we did as Data merging as Data casting in date and string variables.
CREATE TABLE data_merged AS
SELECT
CAST(trip_id AS TEXT) AS ride_id,
DATETIME(start_time) AS started_at,
DATETIME(end_time) AS ended_at,
CAST(gender AS TEXT) AS gender,
CAST(birthyear AS INTEGER) AS birth_year, -- Revisa si la columna es birthyear o birthyearNombre
CAST(bikeid AS TEXT) AS rideable_type,
from_station_name AS start_station_name,
CAST(from_station_id AS TEXT) AS start_station_id,
to_station_name AS end_station_name,
CAST(to_station_id AS TEXT) AS end_station_id,
CASE
WHEN usertype = 'Subscriber' THEN 'member'
WHEN usertype = 'Customer' THEN 'casual'
ELSE usertype
END AS member_casual
FROM bikes_2019
WHERE trip_id IS NOT NULL
UNION ALL
SELECT
ride_id,
DATETIME(started_at) AS started_at,
DATETIME(ended_at) AS ended_at,
'Unknown' AS gender,
NULL AS birth_year,
rideable_type,
start_station_name,
CAST(start_station_id AS TEXT) AS start_station_id,
end_station_name,
CAST(end_station_id AS TEXT) AS end_station_id,
member_casual
FROM bikes_2020
WHERE ride_id IS NOT NULLTo get the comparative analysis between 2019 and 2020, we join all schemes through UNION ALL. As SQLite language in this Rmarkdown, we exchange STRING and INTEGER type by TEXT and INT64. We applied the DATETIME function to normalize the time marks. As it was said, re coded the Customer and Subscriber values to adjust to modern format and ensuring the mixed sample.
/* Creamos la tabla final añadiendo dimensiones de tiempo */
CREATE TABLE data_table AS
SELECT
*,
-- Columna de Día de la Semana (0 es Domingo, se traduce a nombre)
CASE strftime('%w', started_at)
WHEN '0' THEN 'Sunday'
WHEN '1' THEN 'Monday'
WHEN '2' THEN 'Tuesday'
WHEN '3' THEN 'Wednesday'
WHEN '4' THEN 'Thursday'
WHEN '5' THEN 'Friday'
WHEN '6' THEN 'Saturday'
END AS day_of_week,
-- Columna de Momento del Día extrayendo la hora
CASE
WHEN CAST(strftime('%H', started_at) AS INT) BETWEEN 6 AND 11 THEN 'Morning'
WHEN CAST(strftime('%H', started_at) AS INT) BETWEEN 12 AND 17 THEN 'Afternoon'
WHEN CAST(strftime('%H', started_at) AS INT) BETWEEN 18 AND 23 THEN 'Night'
ELSE 'Late Night'
END AS time_of_day
FROM data_merged
WHERE started_at IS NOT NULL;We merge both data transformed according to the concise the variables, even though some variables don’t match. We cast the values to string and timestamp for date and other quality variables. For timestamp variables, we extract the date to approximate the date of riding, and cast to an integer accordingly. To made this final table, we made a logic translate to ensure a data filtering from Big Query to Rmarkdown. The main difference is the date and hours manipulation: in change to use FORMAT_TIMESTAMP and EXTRACT, in the present document applied strftime() to get as weekday as the hour day. As you can see, we applied a data mapping with case structure to index the names weekdays accordingly.
After that, we get the riders length of each observation
deriving data from local variables
(ended_at and started_at) to another
(ride_length), adding a data filtering of
length rides. We take the chance to change the label of our main
category to status_service, as following:
/* Calculamos ride_length y renombramos la columna de estado.
En SQLite, calculamos la diferencia restando los valores de unixepoch.
*/
CREATE TABLE data_frame AS
SELECT *
FROM (
SELECT
ride_id,
started_at,
ended_at,
gender,
birth_year,
rideable_type,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
day_of_week,
time_of_day,
member_casual AS status_service,
(strftime('%s', ended_at) - strftime('%s', started_at)) AS ride_length
FROM data_table
) AS subquery
WHERE ride_length > 0;We can see a final show of preparing and processing both data frames. This show a snapshot about our interests, and all variables before adding geo points, locations, and age, and previous selecting the categories.
| COUNT(*) |
|---|
| 336351 |
To do the last steps we applied an reverse gecoding. The neighborhood boundaries utilized in this report are sourced from the Chicago Data Portal. This dataset is fundamental for data segmentation, as it provides the necessary infrastructure to transform latitude and longitude coordinates into human-readable community names.
Technical Note: The dataset includes the
the_geom column, which contains the MultiPolygon geometries
for Chicago’s 77 community areas. This column was processed via SQL to
perform a Spatial Join with the Divvy bike station
coordinates, ensuring each station is accurately mapped to its
geographic district.
The station infrastructure data was retrieved from the Chicago Data Portal - Divvy Bicycle Stations. This dataset provides a real-time and historical inventory of the bicycle docking stations available across the city, which is crucial for identifying the physical touchpoints of the bike-share network.
Technical Note: This dataset provides the exact
Latitude and Longitude for each docking
station. These coordinates served as the “point” entities during the
spatial join process, allowing them to be intersected with the community
area polygons to determine the geographic distribution of the Divvy
fleet.
We located the information to map the station location and thus to derive some social variables such as location, and deep more social context about users. It’s thinking that we can get closer of the residence location knowing the starting point of the station names, assuming in morning people might be close to their homes. This inference allows to amply the analysis and met other variables differences.
The first merged was between community and geo-location data and bikes locations data. To ensure computational efficiency and maintain the integrity of complex geospatial relationships, this project employs a hybrid data processing strategy. The high-overhead Spatial Join—which determines the community area for each bike station using the ST_CONTAINS logic—was executed within Google BigQuery. This leverages cloud-native GIS capabilities to handle high-precision polygon intersections that would otherwise be memory-intensive for a local environment.
The resulting processed dataset was then exported as a flat file and integrated into our local SQLite environment. This approach allows us to bypass technical hardware limitations without sacrificing data depth. Consequently, the following sections focus on the Exploratory Data Analysis (EDA), where we utilize the pre-joined locations to uncover patterns in station density, rider behavior by community, and infrastructure distribution across the city of Chicago.
To do this, we migrate the data set in Big Query environment to the data sets merged, to finish with the last merge including the geo-code and longitude and latitude. We need to merge wit join inner only values in our interest data, because it refers to a sample limited to the customers of Cyclistic.
Once, we get the final data frame after exploring, preparing and processing the data, we export in CSV file the information to start processing in Rstudio.
In Google Cloud Storage (GCS), a structured export process was carried out to bypass the browser’s direct download limit (which is usually 10 MB or around 16,000 rows). Since your original file was larger than 1 GB, BigQuery couldn’t download it as a single file. The solution was to export your query results to a “bucket” (a temporary storage area in Google’s cloud).
Using a wildcard (*) in the file path, Google automatically fragmented your large dataset into 53 smaller, manageable CSV files. Once the cloud export was complete, you used either the web interface or the gsutil command-line tool to transfer these 53 individual files to your local Downloads folder. The final result is that you now have all the complete data on your computer, ready to be read by RStudio using vroom.
library(vroom)
# Paso 1: Obtener la lista completa de rutas de archivo
# list.files() encuentra los nombres, full.names=TRUE asegura que R use la ruta completa.
todos_los_archivos <- list.files(
path = "C:/Users/user/Downloads/",
pattern = "data_bike_final_.*\\.csv", # Usa una expresión regular para el patrón
full.names = TRUE
)
# Verifica en la consola que esta variable contiene 53 rutas
print(paste("Archivos encontrados:", length(todos_los_archivos)))## [1] "Archivos encontrados: 53"
# Paso 2: Vroom lee la lista explícita de archivos
# Ahora le pasamos la lista exacta de nombres, sin usar el comodín * en la ruta principal.
data_bike <- vroom(todos_los_archivos, show_col_types = FALSE)
# Confirma que ahora tienes todos los registros
print(paste("Total de filas cargadas:", nrow(data_bike)))## [1] "Total de filas cargadas: 423137"
The variables listed show us some point in common to take in mind. The bike service type about the customers, others about the trip or ride duration, the starting and ending point of ridding, and other like gender or birth year, which could give us relevant patterns about the bike use. Sorting and filtering the data we took the main variables to data merging in SQL.
At last but not less, we applied tools in SQL language to validate the data type and values in each variable, and checking the consistency in each value to prepare the data to next steps.
## [1] 423137
Once the project prepared data to process, we choose the relevant values to answer our question explored the differences between customers according its type, social characteristics, location and duration bike uses. To reach those goals minimize any issue to analyze. It’s why we choose to perform with the data making casting, dropping Nan values, and omitting the wrong data.
The analysis was in order to capture as difference between users
segments Suscribers and Casual riders.
The first step in Rstudio was exploring the data frames made previous, which are cleaned and transformed according our analysis needs required. In this step, inspect the variables, observations, null values, data types and outliers. To normalize the data and select the categories.
## [1] "COMMUNITY" "AREA_NUMBE" "Station Name" "Total Docks"
## [5] "Status" "Longitude" "Latitude" "ride_id"
## [9] "started_at" "ended_at" "gender" "birth_year"
## [13] "rideable_type" "start_station_id" "end_station_id" "date"
## [17] "day_of_week" "time_of_day" "status_service" "ride_length"
## [21] "end_station_name" "age"
So first we deleted the unnecessary columns:
df_bike <- data_bike %>%
select(-c( birth_year, rideable_type, started_at, ended_at , end_station_id, start_station_id, "Total Docks"))
colnames(df_bike)## [1] "COMMUNITY" "AREA_NUMBE" "Station Name" "Status"
## [5] "Longitude" "Latitude" "ride_id" "gender"
## [9] "date" "day_of_week" "time_of_day" "status_service"
## [13] "ride_length" "end_station_name" "age"
# Reorganizar las columnas
df_bike <- df_bike %>%
select(
ride_id, # ID del viaje
status_service, # Tipo de miembro (Member/Casual)
date, # Fecha
day_of_week, # Día
time_of_day, # Momento del día
ride_length, # Duración
gender, # Género
age, # Edad
"Station Name", # Estación origen
end_station_name, # Estación destino
Status, # Estado de la estación
Latitude, Longitude, # Coordenadas
COMMUNITY, AREA_NUMBE # Datos de la comunidad
)So, It might show better exploration according to plan the analysis
perform. First, we can watch 423.137 observations and 15 variables. The
variables are: the identification variable, the target variable
status_service , and time variables such as
date , day_of_week and
time_of_day , bikes reference as ride_length
,Satation Name, end_station_name, locations
reference like Latitude and Longitude, and
other social variables as gender and
COMMUNITY, AREA NUMBER, and age.
Its data type is most character, except in date format (date), in double
(ride_length, age, Latitude, Longitude and AREA NUMBER).
## Rows: 423,137
## Columns: 15
## $ ride_id <chr> "21743421", "21744156", "21742530", "21743928", "2174…
## $ status_service <chr> "member", "member", "member", "member", "member", "me…
## $ date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019…
## $ day_of_week <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday"…
## $ time_of_day <chr> "Afternoon", "Night", "Night", "Afternoon", "Afternoo…
## $ ride_length <dbl> 615, 934, 497, 1603, 510, 472, 485, 2832, 1792, 227, …
## $ gender <chr> "Male", "Male", "Female", "Male", "Male", "Male", "Ma…
## $ age <dbl> 27, 28, 30, 45, 19, 36, 38, NA, 30, 27, 50, 34, 37, 5…
## $ `Station Name` <chr> "Eckhart Park", "Lincoln Ave & Waveland Ave", "State …
## $ end_station_name <chr> "Damen Ave & Pierce Ave", "Damen Ave & Cortland St", …
## $ Status <chr> "In Service", "In Service", "In Service", "In Service…
## $ Latitude <dbl> 41.89637, 41.94880, 41.88938, 41.96845, 41.96300, 41.…
## $ Longitude <dbl> -87.66098, -87.67528, -87.62708, -87.67423, -87.68478…
## $ COMMUNITY <chr> "WEST TOWN", "NORTH CENTER", "NEAR NORTH SIDE", "LINC…
## $ AREA_NUMBE <dbl> 24, 5, 8, 4, 4, 7, 6, 77, 8, 8, 24, 7, 8, 33, 7, 32, …
It’s necessary to observe some summaryof the variables to add some cleanning or transformation steps.
## ride_id status_service date day_of_week
## Length:423137 Length:423137 Min. :2019-01-01 Length:423137
## Class :character Class :character 1st Qu.:2019-02-05 Class :character
## Mode :character Mode :character Median :2019-03-11 Mode :character
## Mean :2019-05-09
## 3rd Qu.:2019-03-30
## Max. :2020-01-31
##
## time_of_day ride_length gender age
## Length:423137 Min. : 2 Length:423137 Min. : 16.00
## Class :character 1st Qu.: 322 Class :character 1st Qu.: 29.00
## Mode :character Median : 515 Mode :character Median : 34.00
## Mean : 1060 Mean : 37.29
## 3rd Qu.: 842 3rd Qu.: 44.00
## Max. :10632022 Max. :119.00
## NA's :115300
## Station Name end_station_name Status Latitude
## Length:423137 Length:423137 Length:423137 Min. :41.74
## Class :character Class :character Class :character 1st Qu.:41.88
## Mode :character Mode :character Mode :character Median :41.89
## Mean :41.90
## 3rd Qu.:41.91
## Max. :42.02
##
## Longitude COMMUNITY AREA_NUMBE
## Min. :-87.77 Length:423137 Min. : 1.00
## 1st Qu.:-87.65 Class :character 1st Qu.: 8.00
## Median :-87.64 Mode :character Median :24.00
## Mean :-87.64 Mean :21.27
## 3rd Qu.:-87.63 3rd Qu.:32.00
## Max. :-87.55 Max. :77.00
##
# 1. Contar valores nulos (excepto para age y gender)
null_report <- df_bike %>%
select(-age, -gender) %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Null_Count")
# 2. Contar valores únicos para age y gender
unique_values <- df_bike %>%
summarise(
unique_age = n_distinct(age, na.rm = TRUE),
unique_gender = n_distinct(gender, na.rm = TRUE)
)
# 3. Contar valores inconsistentes en ride_length (menores a 1 segundo)
invalid_rides <- df_bike %>%
filter(ride_length < 1) %>%
count()
# Mostrar resultados en consola
print("--- Null values report ---")## [1] "--- Null values report ---"
## # A tibble: 13 × 2
## Variable Null_Count
## <chr> <int>
## 1 ride_id 0
## 2 status_service 1
## 3 date 0
## 4 day_of_week 0
## 5 time_of_day 0
## 6 ride_length 0
## 7 Station Name 0
## 8 end_station_name 0
## 9 Status 0
## 10 Latitude 0
## 11 Longitude 0
## 12 COMMUNITY 0
## 13 AREA_NUMBE 0
## [1] "--- Unique values report ---"
## # A tibble: 1 × 2
## unique_age unique_gender
## <int> <int>
## 1 72 3
## [1] "--- Invalid trips (< 1 seg): 0"
library(dplyr)
# Aplicar las reglas de limpieza y transformación
table_cleaned <- df_bike %>%
# 1. Imputar edades mayores a 80 a exactamente 80
mutate(age = ifelse(age > 80, 80, age)) %>%
# 2. Convertir valores de gender que no sean Male/Female a 'Unknown'
# También manejamos los valores nulos (NAs) en esta columna
mutate(gender = case_when(
gender == "Male" ~ "Male",
gender == "Female" ~ "Female",
TRUE ~ "Unknown"
)) %>%
# 3. Eliminar nulos en status_service y filtrar solo Subscriber o Casual
# Nota: BigQuery a veces usa 'Member' por 'Subscriber', ajusta según tus datos
filter(!is.na(status_service)) %>%
filter(status_service %in% c("Subscriber", "Customer", "member", "casual"))
# Verificamos los resultados
summary(table_cleaned$age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 16.00 29.00 34.00 37.27 44.00 80.00 115299
##
## Female Male Unknown
## 58855 247424 116857
##
## casual member
## 23677 399459
To ensure the statistical robustness of the analysis, several data normalization and imputation procedures were implemented:
Age Capping: Values in the age variable exceeding 80 were capped at 80. This technique mitigates the influence of extreme outliers (likely due to data entry errors) while preserving the integrity of the records for other variables.
Gender Categorization: To handle missing data and non-binary entries, the gender variable was recoded. Only “Male” and “Female” were retained as distinct categories, while all other values—including nulls—were consolidated into an “Unknown” category.
Target Variable Refinement: Observations with null values in status_service were removed. Furthermore, the data set was filtered to include only valid user segments—Subscriber (Member) and Customer (Casual)—aligning the data with the primary business objective of the analysis.
According to the previous summary it shows most normality in the data comportment. Excepting to age which indicate riders with birth days long of the present. So we are going to explore this.
library(janitor)
# Standardizing all headers to lowercase with underscores
table_cleaned <- table_cleaned %>%
clean_names()
# Displaying the clean column names
colnames(table_cleaned)## [1] "ride_id" "status_service" "date" "day_of_week"
## [5] "time_of_day" "ride_length" "gender" "age"
## [9] "station_name" "end_station_name" "status" "latitude"
## [13] "longitude" "community" "area_numbe"
To streamline the analysis and prevent reference errors, the data set
headers were normalized using the clean_names() function.
This process automatically converts all column names to lowercase and
replaces spaces or irregular characters with underscores. By adopting
this snake_case convention, the code becomes more
readable and consistent, ensuring that functions across the
tidyverse can interact with the variables without
requiring special quoting for spaces or mixed casing.
Starting to analyze, we would have identified main variables such as
status_service and ride_length. The analysis
will be individually both. Because, we can approximate through a
frequency analysis, but it might hide some interest relation, that we
can explore. for example is there a relationship between more o less
service use according to the status service? The difference is not minor
because it informs us about the earnings according to the use and the
number of members without having financial information (we don’t have
neither the subscription cost nor the duration of the uses). Of course,
we can be close to a reccomendation though.
The layer outcome of both analysis could hide difference of uses service and not only about the profile service. As we met first, the analysis show more and more subscriber service:
The ride length information show some interest points to concern.
## [1] "--- Average report ---"
## [1] 1059.954
## [1] "--- Median report ---"
## [1] 515
## [1] "--- Max value report ---"
## [1] 10632022
## [1] "--- Min value report ---"
## [1] 2
As we can watched, it showed good measure of uses about bikes in Chicago, with translate good earning in general.
But, beyond of this statics, we need to know the subscriber and casual service make increase the duration use, and, as consequence better profits. For that we had to add measures such as the average, maximum and minimum values, and then cross them between them. At least this analysis get close information about the earnings.
After we can include an analysis about the social profile of the customers such as proxy location, gender or age. To reach those variables we preprocess the starting point of user in morning, then map those location with open source about the poverty level. For the others, gender and age, we applied an expansion based on demographic information between the previous residence place and data present in this project.
## [1] "--- Average report ---"
## [1] "--- Median report ---"
## [1] "--- Max report ---"
## [1] "--- Min report ---"
The last outcomes showed contrary indications about the numbers by status service and its uses. However casual members are less, they ride more. The casual average ride is much superior in all fields: average, median and max values. It indicates that casual members paying more for uses, yet they are less.
Searching more information about those outcomes, we can see the average ride time by each day for members vs casual users.
# Notice that the days of the week are out of order. Let's fix that.
table_cleaned$day_of_week <- ordered(table_cleaned$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Now, let's run the average ride time by each day for members vs casual users
aggregate(table_cleaned$ride_length ~ table_cleaned$status_service + table_cleaned$day_of_week, FUN = mean)The results show us there are more ride length in first day of the week than the last days. People use to ride starting the week.
In other way, the average ride time by each day for members vs casual users find longer duration inside the casual than members riders. All days, it is more the casual customers than the member user in ride length, which made us an idea that, independent of the numbers of users, the causality increase the bike use.
library(dplyr)
# Usamos directamente la columna day_of_week que ya existe en tu dataset
summary_by_day <- table_cleaned %>%
group_by(status_service, day_of_week) %>%
summarise(
number_of_rides = n(),
average_duration = mean(ride_length, na.rm = TRUE),
.groups = "drop"
) %>%
# Ordenamos por tipo de usuario y día
arrange(status_service, day_of_week)
# Visualizar el resumen
summary_by_casual <- filter(summary_by_day, status_service == "casual")
summary_by_member <- filter(summary_by_day, status_service == "member")
print("--- Average report ---")## [1] "--- Average report ---"
## [1] "--- Average report casual customers---"
## [1] "--- Average report member customer ---"
The same analysis we could show the preferred time to ride, which looked like the night. The users are special night users in both service status, being more the volume of ride length in casual users than the member users. The number of users riders increase instead in members aas folowing we can observe:
library(dplyr)
# Usamos directamente la columna day_of_week que ya existe en tu dataset
summary_by_time <- table_cleaned %>%
group_by(status_service, time_of_day) %>%
summarise(
number_of_rides = n(),
average_duration = mean(ride_length, na.rm = TRUE),
.groups = "drop"
) %>%
# Ordenamos por tipo de usuario y día
arrange(status_service, time_of_day)
print("--- Average report ---")## [1] "--- Average report ---"
## # A tibble: 6 × 4
## status_service time_of_day number_of_rides average_duration
## <chr> <chr> <int> <dbl>
## 1 casual Night 5143 5973.
## 2 casual Morning 4693 5785.
## 3 casual Afternoon 13841 5201.
## 4 member Night 85780 1006.
## 5 member Afternoon 169059 781.
## 6 member Morning 144620 694.
## [1] "--- Number of rides report ---"
## # A tibble: 6 × 4
## status_service time_of_day number_of_rides average_duration
## <chr> <chr> <int> <dbl>
## 1 member Afternoon 169059 781.
## 2 member Morning 144620 694.
## 3 member Night 85780 1006.
## 4 casual Afternoon 13841 5201.
## 5 casual Night 5143 5973.
## 6 casual Morning 4693 5785.
The temporal analysis reveals a distinct contrast in peak usage times between user segments. While both groups show a surprising affinity for night-time riding, and starting week days, the Casual users represent a higher proportion of the total volume during late-night hours, and Member users in total users.
Both user segments keep the same time and days uses, even though the volume and users differ significantly.
The number of members substantially exceeds casual riders, reinforcing the hypothesis that members primarily utilize Divvy for work-related commuting. Conversely, the casual segment’s dominance during non-traditional hours suggests a preference for recreational or nocturnal transit, providing a key opportunity for targeted evening marketing campaigns. However, in both user segments have decreased the number of user and proportionally increased the ride average.
library(lubridate)
# Usamos directamente la columna day_of_week que ya existe en tu dataset
summary_by_year <- table_cleaned %>%
mutate(year = year(as.Date(date))) %>%
group_by(status_service, year) %>%
summarise(
number_of_rides = n(),
average_duration = mean(ride_length, na.rm = TRUE),
.groups = "drop"
) %>%
# Ordenamos por tipo de usuario y día
arrange(status_service, year)
print(summary_by_year)## # A tibble: 4 × 4
## status_service year number_of_rides average_duration
## <chr> <dbl> <int> <dbl>
## 1 casual 2019 18506 4075.
## 2 casual 2020 5171 10529.
## 3 member 2019 303787 841.
## 4 member 2020 95672 661.
“To define the user social profile, we leveraged the pre-categorized
time_of_day variable. By filtering specifically for trips
occurring during the ‘morning’ period, we established a
geographical proxy for the users’ home locations. This assumes that the
first movement of the day typically originates from the user’s primary
residence or local neighborhood.
To enrich the analysis of user behavior, we integrated external socio-economic data from the City of Chicago Data Portal, specifically focusing on the “Hardship Index” and poverty indicators per Community Area.
This allowed us to assign a socio-economic stratum (SES) to each trip based on the historical census data of the community area. By segmenting the data into High, Middle, and Low SES, we can now evaluate the accessibility and equity of the Divvy system. This analysis provides a window into the demographic composition of our service status groups, revealing whether ‘Subscribers’ primarily reside in low-hardship areas compared to ‘Casual’ users, and how age and gender distributions vary across these economic zones.”
The data processing workflow consisted of the following key steps:
Socio-Economic Mapping: We performed a join between the bike-sharing data set and Chicago’s community metrics, categorizing trips into three poverty tiers based on the percentage of households below the poverty line: Low (<5%), Moderate (5-10%), and High (>10%).
Demographic Imputation: To address the high volume of “Unknown” values in gender and age fields (particularly among Casual riders), we implemented a Geographic Proportional Imputation. This method replaced unknown values by sampling from the known demographic distribution of each specific community, preserving local neighborhood profiles .
Cross-Tabulation Analysis: We generated
specialized tables to identify user personas. This included calculating
the Median Age and Gender Distribution
(Male vs. Female percentages) cross-referenced by
status_service (Casual vs. Member) for each
community.
This comprehensive approach transformed raw trip data into a multi-dimensional dataset, allowing us to analyze how gender gaps, age groups, and subscription preferences fluctuate across different economic landscapes in Chicago.
We start to standardize the values of communities mapping and appling lower to the names makes being able to match with poverty values, previous mapped.
library(dplyr)
library(stringr)
# 1. Crear el dataframe de mapeo solo con tus 45 comunidades
mapeo_pobreza_restringido <- data.frame(
community = c(
"west town", "north center", "near north side", "lincoln square",
"lincoln park", "lake view", "edgewater", "near south side",
"loop", "logan square", "uptown", "rogers park",
"avondale", "north park", "irving park", "greater grand crossing",
"near west side", "bridgeport", "lower west side", "hyde park",
"albany park", "armour square", "west ridge", "douglas",
"washington park", "kenwood", "mckinley park", "east garfield park",
"woodlawn", "south lawndale", "englewood", "grand boulevard",
"north lawndale", "humboldt park", "new city", "fuller park",
"austin", "south shore", "portage park", "west garfield park",
"hermosa", "chatham", "west englewood", "south chicago", "avalon park"
),
valor_pobreza = c(
5.56, 2.18, 5.48, 4.68, 4.83, 4.51, 38.32, 9.62,
3.02, 5.14, 9.93, 9.56, 3.87, 5.87, 3.50, 8.22,
16.93, 6.00, 7.20, 14.51, 5.34, 18.22, 6.32, 10.62,
16.84, 22.71, 3.93, 11.42, 15.71, 5.53, 14.35, 14.98,
10.57, 9.79, 3.46, 11.05, 10.28, 10.29, 4.68, 15.25,
7.17, 19.74, 4.97, 5.31, 5.64
)
)
# 2. Transformate data name of community
table_cleaned <- table_cleaned %>%
mutate(community = str_trim(tolower(community)))
# 3. Mapping the data frames with poverty value
table_cleaned <- table_cleaned %>%
left_join(mapeo_pobreza_restringido, by = "community")
# Outcomes verification
summary(table_cleaned$valor_pobreza)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.180 4.510 5.480 8.153 10.620 38.320
It was going through the following values of poverty according to the percentages of poverty by community binning them with categories to continuation:
Those bins allow to segment the socio-economic profile of the users, and analyze the status service accordingly. The findings is that the most trips was done by from intermediate to low strats as members as casual customers.
library(dplyr)
# Creamos las categorías basadas en los rangos que pediste
table_cleaned <- table_cleaned %>%
mutate(categoria_pobreza = case_when(
valor_pobreza > 10 ~ "High (>10%)",
valor_pobreza >= 5 & valor_pobreza <= 10 ~ " Intermediate (5% - 10%)",
valor_pobreza < 5 ~ "Low (<5%)",
TRUE ~ "Sin Datos"
))
# Ordenamos las categorías para que las gráficas salgan lógicas
table_cleaned$categoria_pobreza <- factor(table_cleaned$categoria_pobreza,
levels = c("Low (<5%)", " Intermediate (5% - 10%)", "High (>10%)"))
# Resumen rápido para ver cuántos viajes hay en cada estrato
resumen_social <- table_cleaned %>%
group_by(categoria_pobreza, status_service) %>%
summarise(total_trips = n(),
average_duration = mean(ride_length, na.rm = TRUE),
.groups = 'drop',
)
print(resumen_social)## # A tibble: 6 × 4
## categoria_pobreza status_service total_trips average_duration
## <fct> <chr> <int> <dbl>
## 1 "Low (<5%)" casual 9912 4094.
## 2 "Low (<5%)" member 150944 723.
## 3 " Intermediate (5% - 10%)" casual 9824 6066.
## 4 " Intermediate (5% - 10%)" member 143111 877.
## 5 "High (>10%)" casual 3941 7533.
## 6 "High (>10%)" member 105404 797.
Even though the numbers of members, there are some interesting findings:
library(dplyr)
library(tidyr)
# 1. Calculamos la tabla resumen con porcentajes y totales
tabla_resumen <- table_cleaned %>%
filter(!is.na(categoria_pobreza)) %>%
group_by(categoria_pobreza, status_service) %>%
summarise(cantidad = n(),
average_duration = mean(ride_length, na.rm = TRUE),
.groups = 'drop') %>%
# Calculamos el % de Casuals y Members dentro de cada nivel de pobreza
group_by(categoria_pobreza) %>%
mutate(
porcentaje = round((average_duration / sum(average_duration)) * 100, 2),
total_viajes_nivel = sum(average_duration)
) %>%
ungroup() %>%
# Pasamos de formato largo a formato ancho (Cross-tab)
select(categoria_pobreza, status_service, porcentaje, total_viajes_nivel) %>%
pivot_wider(names_from = status_service, values_from = porcentaje) %>%
# Calculamos el % que representa cada nivel sobre el gran total global
mutate(porcentaje_del_total_global = round((total_viajes_nivel / sum(total_viajes_nivel)) * 100, 2))
print(tabla_resumen)## # A tibble: 3 × 5
## categoria_pobreza total_viajes_nivel casual member porcentaje_del_total…¹
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 "Low (<5%)" 4817. 85.0 15.0 24.0
## 2 " Intermediate (5% - … 6943. 87.4 12.6 34.6
## 3 "High (>10%)" 8330. 90.4 9.57 41.5
## # ℹ abbreviated name: ¹porcentaje_del_total_global
So, casual members not only ride long distance but the use increase while up through the social sectors.
Nevertheless, this finding contrast with the total of customers in each status service. Members are more than the Casual customers, and it represent considerably a customer majority.
library(dplyr)
library(tidyr)
# 1. Calculamos la tabla resumen con porcentajes y totales
tabla_resumen <- table_cleaned %>%
filter(!is.na(categoria_pobreza)) %>%
group_by(categoria_pobreza, status_service) %>%
summarise(cantidad = n(),
average_duration = mean(ride_length, na.rm = TRUE),
.groups = 'drop') %>%
# Calculamos el % de Casuals y Members dentro de cada nivel de pobreza
group_by(categoria_pobreza) %>%
mutate(
porcentaje = round((cantidad / sum(cantidad)) * 100, 2),
total_viajes_nivel = sum(cantidad)
) %>%
ungroup() %>%
# Pasamos de formato largo a formato ancho (Cross-tab)
select(categoria_pobreza, status_service, porcentaje, total_viajes_nivel) %>%
pivot_wider(names_from = status_service, values_from = porcentaje) %>%
# Calculamos el % que representa cada nivel sobre el gran total global
mutate(porcentaje_del_total_global = round((total_viajes_nivel / sum(total_viajes_nivel)) * 100, 2))
print(tabla_resumen)## # A tibble: 3 × 5
## categoria_pobreza total_viajes_nivel casual member porcentaje_del_total…¹
## <fct> <int> <dbl> <dbl> <dbl>
## 1 "Low (<5%)" 160856 6.16 93.8 38.0
## 2 " Intermediate (5% - … 152935 6.42 93.6 36.1
## 3 "High (>10%)" 109345 3.6 96.4 25.8
## # ℹ abbreviated name: ¹porcentaje_del_total_global
As resume: there are more members customers, but proportionally casual customers ride more. Both trends use to increase in low sectors.
By applying this segmentation, we cross-referenced the
imputed gender and median age data
with the status_service (Casual vs. Member). This allowed
us to determine if the subscription model (Member or Casual) is
predominantly utilized by specific age groups or genders within
different economic contexts. This multi-layered analysis reveals whether
the “Member” profile.
To resolve the missing data issue for the ‘gender’ variable, we moved beyond simple deletion. We implemented a Geographic Proportional Imputation. By using the starting community as a demographic anchor, we replaced ‘Unknown’ values with the predominant gender profile of that specific area. This preserved the unique socio-economic signature of each neighborhood, allowing for a high-integrity cross-analysis between gender, status_service, and the Poverty Index.
library(dplyr)
library(tidyr)
# 1. Creamos la tabla de pesos (probabilidades) por comunidad
pesos_genero <- table_cleaned %>%
filter(gender %in% c("Male", "Female")) %>%
group_by(community, gender) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(prob = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = gender, values_from = prob)
# 2. Función para imputar basándose en las probabilidades de la comunidad
imputar_genero <- function(comunidad, genero_orig, prob_M, prob_F) {
# Si ya conocemos el género, lo dejamos igual
if (genero_orig != "Unknown") {
return(genero_orig)
}
# Si no hay datos en la comunidad para decidir, devolvemos "Unknown" o un global
if (is.na(prob_M) | is.na(prob_F)) {
return("Male") # Asignación por defecto basada en tendencia global de Divvy
}
# Elegimos aleatoriamente según el peso de la comunidad
sample(c("Male", "Female"), size = 1, prob = c(prob_M, prob_F))
}
# 3. Aplicamos la imputación a la tabla principal
table_cleaned <- table_cleaned %>%
left_join(pesos_genero, by = "community") %>%
rowwise() %>%
mutate(gender = imputar_genero(community, gender, Male, Female)) %>%
ungroup() %>%
select(-Male, -Female) # Limpiamos columnas auxiliares
# Verificamos que ya no existan "Unknown"
table(table_cleaned$gender)##
## Female Male
## 81138 341998
An important observation found was there are much more males than females in users. It is repeated in every community where we observed the gender factor in customers as casual as members. This observation is an important value to concern in time to change a marketing strategy to focus the bikes uses to women.
library(dplyr)
library(tidyr)
# 1. Calculamos la distribución de género (usando el género imputado)
tabla_genero_status <- table_cleaned %>%
group_by(community, status_service, gender) %>%
summarise(total = n(), .groups = 'drop') %>%
# Calculamos el porcentaje de género dentro de cada comunidad y cada status
group_by(community, status_service) %>%
mutate(porcentaje = round((total / sum(total)) * 100, 2)) %>%
ungroup()
# 2. Transformamos a un formato ancho para ver Male y Female lado a lado
# Esto creará columnas como "casual_female", "casual_male", etc.
tabla_genero_final <- tabla_genero_status %>%
select(community, status_service, gender, porcentaje) %>%
pivot_wider(
names_from = c(status_service, gender),
values_from = porcentaje,
names_sep = "_"
)
# 3. Visualizamos los resultados
print(tabla_genero_final)## # A tibble: 45 × 5
## community casual_Female casual_Male member_Female member_Male
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 albany park 18.0 82.0 20.5 79.5
## 2 armour square 27.6 72.4 22.2 77.8
## 3 austin 23.1 76.9 30.1 69.9
## 4 avalon park 16.7 83.3 25 75
## 5 avondale 14.6 85.4 18.8 81.2
## 6 bridgeport 14.1 85.9 20.2 79.8
## 7 chatham 28.6 71.4 18.6 81.4
## 8 douglas 26.3 73.7 22.8 77.2
## 9 east garfield park 18.2 81.8 26.6 73.4
## 10 edgewater 22.0 78.0 17.8 82.2
## # ℹ 35 more rows
We take in mind get the median age to know the preference in every community the preference of users. The median age is a little less in casual than members customers. So we can concern that members use to be more older than the casual customers, which could explain a preference to keep long distance in bikes ridding compared to members customers.
library(dplyr)
library(tidyr)
# 1. Calcular la mediana de edad por comunidad y status
tabla_edad_status <- table_cleaned %>%
group_by(community, status_service) %>%
summarise(edad_mediana = median(age, na.rm = TRUE), .groups = 'drop')
# 2. Pivotar para comparar Casual vs Member lado a lado
tabla_edad_final <- tabla_edad_status %>%
pivot_wider(names_from = status_service, values_from = edad_mediana) %>%
rename(
`Edad Mediana Casual` = casual,
`Edad Mediana Member` = member
)
# 3. Mostrar los resultados
print(tabla_edad_final)## # A tibble: 45 × 3
## community `Edad Mediana Casual` `Edad Mediana Member`
## <chr> <dbl> <dbl>
## 1 albany park 33 37
## 2 armour square 22.5 31
## 3 austin 33 43
## 4 avalon park NA 46
## 5 avondale 29 35
## 6 bridgeport 29 31
## 7 chatham 29.5 36
## 8 douglas 28 28
## 9 east garfield park 28.5 29
## 10 edgewater 30 35
## # ℹ 35 more rows
The findings identify a clear “usage vs. volume” paradox. While members provide the volume (quantity), casual riders provide the intensity (duration/distance), especially as you move into lower socio-economic sectors.
Socio-Economic & Demographic Divergence The analysis reveals a significant “usage paradox” between user segments across Chicago’s communities:
The Intensity vs. Volume Gap: While Members constitute the vast majority of total trips, Casual riders demonstrate higher engagement in terms of trip duration and distance. This trend is particularly amplified in Low Socio-Economic (High Poverty) sectors, where casual trips are fewer in number but significantly longer in duration.
# Let's visualize the number of rides by rider type
table_cleaned %>%
group_by(status_service, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(status_service, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = status_service)) +
geom_col(position = "dodge")## `summarise()` has grouped output by 'status_service'. You can override using
## the `.groups` argument.
# Let's create a visualization for average duration
table_cleaned %>%
group_by(status_service, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(status_service, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = status_service)) +
geom_col(position = "dodge")## `summarise()` has grouped output by 'status_service'. You can override using
## the `.groups` argument.
A key finding is that bike usage intensity increases as we move into lower social sectors. Casual riders in these areas are not just riding for leisure; the increased distance suggests the service acts as a critical mobility tool for longer commutes where other transit options may be limited.
Adding, data shows a consistent Gender Gap across all 45 communities. Males represent the predominant user base in both the Casual and Member segments. This identified disparity presents a high-value marketing opportunity: a strategy specifically designed to lower barriers for women could unlock a massive, underserved market.
library(ggplot2)
library(dplyr)
# 1. Preparar los datos (usando el formato 'long' que es mejor para ggplot)
df_genero_plot <- tabla_genero_status %>%
filter(!is.na(gender)) %>% # Aseguramos que no haya NAs
# Ordenamos las comunidades por el porcentaje de mujeres (o el que prefieras)
mutate(community = reorder(community, porcentaje))
# 2. Crear el gráfico de barras apiladas
ggplot(df_genero_plot, aes(x = porcentaje, y = community, fill = gender)) +
geom_bar(stat = "identity", position = "fill") + # "fill" asegura que llegue al 100%
facet_wrap(~status_service) +
# Usamos colores que contrasten bien (ej. Naranja y Azul o el que prefieras)
scale_fill_manual(values = c("Male" = "#2c7bb6", "Female" = "#d7191c"),
name = "Género") +
# Línea vertical en el 50% para ver rápidamente quién domina
geom_vline(xintercept = 0.5, linetype = "dashed", color = "white", alpha = 0.5) +
scale_x_continuous(labels = scales::percent) + # Formato de porcentaje
labs(
title = "Gender Distribution by Community and User Status",
subtitle = "Percentage of Male vs Female across 100% of community trips",
x = "Percentage Contribution",
y = "Community Area"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 7), # Ajuste para que quepan las comunidades
legend.position = "top",
strip.text = element_text(size = 12, face = "bold")
)To address the high volume of “Unknown” values in the gender column (primarily from Casual riders), we implemented a Geographical Proportional Imputation. This method preserves the specific demographic characteristics of each Chicago community.
Finally, by calculating the Median Age, we observed that Members tend to be older than Casual riders. This age difference helps explain the usage patterns: younger Casual riders show a higher preference for long-distance riding, whereas older Members likely utilize the service for shorter, more consistent “last-mile” utility trips.
To further refine our user personas, we calculated the Median Age for both segments across all Chicago communities. This metric helps distinguish between lifestyle-oriented riders and utility-based commuters.
median()
function to mitigate the influence of potential outliers in the age
data, ensuring a representative age profile for each geographic
area.library(ggplot2)
library(dplyr)
# Supongamos que tu tabla se llama tabla_edad
# La ordenamos para que la comunidad con mayor edad esté arriba
df_edad_plot <- tabla_edad_status %>%
filter(!is.na(edad_mediana)) %>%
mutate(community = reorder(community, edad_mediana))
ggplot(df_edad_plot, aes(x = edad_mediana, y = community)) +
# Línea de fondo para estilo "Lollipop"
geom_segment(aes(x = 0, xend = edad_mediana, y = community, yend = community), color = "gray90") +
# Punto de color para la edad
geom_point(aes(color = status_service), size = 3) +
# Separamos por segmento para que sea gigante y claro
facet_wrap(~status_service) +
# Colores que pediste (Naranjas/Rojos para consistencia)
scale_color_manual(values = c("casual"="#E27429", "member" = "#00539C")) +
labs(
title = "Median Age by Chicago Community",
subtitle = "Analysis of User Demographics per Neighborhood",
x = "Median Age (Years)",
y = "Community Area"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 8), # Letra pequeña para que quepan las 77 comunidades
strip.text = element_text(size = 14, face = "bold"),
panel.grid.major.y = element_blank() # Limpiamos líneas horizontales para que no sature
)## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(ggplot2)
library(dplyr)
# 1. Silueta oficial
chicago_shape <- places(state = "IL", cb = TRUE) %>%
filter(NAME == "Chicago")## Retrieving data for the year 2024
## | | | 0% | |=== | 4% | |=== | 5% | |==== | 6% | |===== | 8% | |====== | 8% | |========= | 12% | |========= | 13% | |========== | 14% | |=========== | 16% | |============== | 20% | |================= | 24% | |=================== | 27% | |======================= | 32% | |======================== | 34% | |========================== | 37% | |=========================== | 38% | |============================ | 40% | |============================= | 41% | |============================== | 42% | |=============================== | 44% | |================================ | 45% | |================================== | 48% | |=================================== | 50% | |===================================== | 52% | |======================================= | 55% | |========================================= | 58% | |=========================================== | 61% | |============================================= | 64% | |=============================================== | 67% | |================================================ | 68% | |================================================== | 71% | |=================================================== | 72% | |===================================================== | 75% | |======================================================= | 78% | |========================================================= | 81% | |=========================================================== | 84% | |============================================================= | 87% | |============================================================== | 88% | |=============================================================== | 89% | |================================================================ | 91% | |================================================================= | 92% | |==================================================================== | 96% | |======================================================================| 100%
# 2. Datos al 100%
df_mapa <- table_cleaned %>%
mutate(ride_minutos = as.numeric(ride_length) / 60) %>%
filter(ride_minutos > 1 & ride_minutos < 120) %>%
group_by(community, longitude, latitude, status_service) %>%
summarise(
total_viajes = n(),
duracion_mediana = median(ride_minutos, na.rm = TRUE),
.groups = 'drop'
)
# 3. Gráfico con Burbujas Naranjas
ggplot() +
# Fondo de Chicago (Gris neutro para que no compita con el color de las burbujas)
geom_sf(data = chicago_shape, fill = "#f2f2f2", color = "#d1d1d1", size = 0.4) +
# Burbujas: El color solo aplica aquí
geom_point(data = df_mapa,
aes(x = longitude, y = latitude,
size = total_viajes,
color = duracion_mediana),
alpha = 0.6) +
# Escala de Naranjas/Rojos (Resalta mucho más la intensidad)
scale_color_distiller(palette = "YlOrRd", direction = 1, name = "Intensidad (Minutos)") +
# Tamaño para el Volumen
scale_size_continuous(name = "Volumen (Viajes)", range = c(0.5, 9)) +
# Anotaciones con el mismo tono para coherencia visual
geom_label(data = data.frame(
status_service = c("casual", "member"),
label = c("CASUAL:\nMÁS INTENSIDAD\nMenos Usuarios",
"MEMBER:\nMÁS FRECUENCIA\nMenos Duración"),
x = c(-87.82, -87.82),
y = c(41.73, 41.73)
),
aes(x = x, y = y, label = label),
size = 2.5, fontface = "bold", fill = "white", color = "#d7301f", alpha = 0.9, hjust = 0) +
facet_wrap(~status_service) +
labs(
title = "Divvy System Intensity: Casual vs. Member",
subtitle = "Color Depth (Intensity) vs. Bubble Size (Volume)",
caption = "Casual riders consistently show higher intensity (darker red) across Chicago."
) +
theme_void() + # Estética limpia de mapa
theme(
legend.position = "right",
plot.title = element_text(face = "bold", size = 14),
strip.text = element_text(size = 12, face = "bold")
)