This document provides a sequential guide detailing the process I followed to conduct my analysis, accompanied by my observation notes.
Upon downloading the data, I conducted initial exploration and processing using VBA. These Excel VBA macros collectively form a two-step process.
"Data Enhancement and Insight Generation
Numeric formats are applied and visual adjustments are made to allow for initial evaluation in this first step.
Serves as a preliminary foundation for the subsequent VBA operation.
Sub PreliminaryRequirements_Transform()
Dim folderPath As String
Dim filePath As String
Dim wb As Workbook
Dim ws As Worksheet
Dim rng As Range
Dim processedFiles As New Collection
' Specify the source folder path where the CSV files are located
folderPath = "D:\raw\"
' Specify the destination folder path for saving CSV files
Dim csvDestinationFolderPath As String
csvDestinationFolderPath = "D:\phase_one_transform\"
' Get the first CSV file in the folder
filePath = Dir(folderPath & "*.csv")
Do While filePath <> ""
' Construct the full file path
Dim fullFilePath As String
fullFilePath = folderPath & filePath
' Check if the file has already been processed
Dim alreadyProcessed As Boolean
On Error Resume Next
processedFiles.Add Item:=filePath, Key:=CStr(filePath)
alreadyProcessed = (Err.Number = 0)
Err.Clear
On Error GoTo 0
If Not alreadyProcessed Then
' Open the CSV file
Set wb = Workbooks.Open(fullFilePath)
' CLEANING DATA AND SETTING DATA TYPES
' Clean the data types
Set ws = wb.Worksheets(1) ' Assuming the data is on the first sheet
' Clean data types in column A as text
Set rng = ws.Range("A:A")
rng.NumberFormat = "@"
' Clean data types in column B as text
Set rng = ws.Range("B:B")
rng.NumberFormat = "@"
' Clean data types in column C as date/time
Set rng = ws.Range("C:C")
rng.NumberFormat = "mm/dd/yyyy hh:mm:ss"
' Clean data types in column D as date/time
Set rng = ws.Range("D:D")
rng.NumberFormat = "mm/dd/yyyy hh:mm:ss"
' Clean data types in column E as text
Set rng = ws.Range("E:E")
rng.NumberFormat = "@"
' Clean data types in column F as text
Set rng = ws.Range("F:F")
rng.NumberFormat = "@"
' Clean data types in column G as text
Set rng = ws.Range("G:G")
rng.NumberFormat = "@"
' Clean data types in column H as text
Set rng = ws.Range("H:H")
rng.NumberFormat = "@"
' Clean data types in column I as number with 12 decimal places
Set rng = ws.Range("I:I")
rng.NumberFormat = "0.000000000000"
' Clean data types in column J as number with 12 decimal places
Set rng = ws.Range("J:J")
rng.NumberFormat = "0.000000000000"
' Clean data types in column K as number with 12 decimal places
Set rng = ws.Range("K:K")
rng.NumberFormat = "0.000000000000"
' Clean data types in column L as number with 12 decimal places
Set rng = ws.Range("L:L")
rng.NumberFormat = "0.000000000000"
' Clean data types in column M
Set rng = ws.Range("M:M")
rng.NumberFormat = "@" ' Set the format to text
' INSERTING COLUMNS AND PERFORMING CALCULATIONS
' ADDING COLUMN "ride_length"
' Get the last row in the column containing data
Dim LastRow As Long
LastRow = ws.Cells(ws.Rows.Count, "C").End(xlUp).Row
' Add a new column "ride_length" in column N
ws.Range("N1").Value = "ride_length"
' Perform calculation and copy formula down
ws.Range("N2:N" & LastRow).Formula = "=D2-C2"
' ADDING COLUMN(s) "day_of_week" & "name_of_day"
' Add a new column "day_of_week" in column O
ws.Range("O1").Value = "day_of_week"
' Perform calculation and format as desired
ws.Range("O2:O" & LastRow).Formula = "=WEEKDAY(C2,1)"
ws.Range("O2:O" & LastRow).NumberFormat = "0" ' Format as number with no decimals
' Add a new column "name_of_day" in column O
ws.Range("P1").Value = "name_of_day"
' Replace numeric weekday with day names
ws.Range("P2:P" & LastRow).FormulaR1C1 = "=CHOOSE(WEEKDAY(RC3,1),""Sunday"",""Monday"",""Tuesday"",""Wednesday"",""Thursday"",""Friday"",""Saturday"")"
' Format as desired (optional)
ws.Range("P2:P" & LastRow).NumberFormat = "@" ' Format as text
' AutoFit all columns
ws.UsedRange.Columns.AutoFit
' SAVING IN CSV FORMAT
' Save the workbook in CSV format to the CSV destination folder
wb.SaveAs FileName:=csvDestinationFolderPath & Replace(wb.Name, ".csv", ".csv"), FileFormat:=xlCSV
' Close the workbook without saving changes
wb.Close SaveChanges:=False
' Get the next CSV file
filePath = Dir ' Get the next CSV file
End If
Loop
' Display a message box indicating success after the loop
MsgBox "Data types cleaned and files saved successfully!"
End Sub"
"Dynamic Data Analysis and Enhanced Insights
Transforms pre-cleaned files into pivot tables with calculated metrics (average, maximum, minimum, mode, and record count).
Sub Stats_Pivots()
' Declare Variables
Dim folderPath As String
Dim filePath As String
Dim wb As Workbook
Dim DSheet As Worksheet
Dim PCache As pivotCache
Dim PTable As pivotTable
Dim PRange As Range
Dim lastDataRow As Long
' Specify the source folder path where the CSV files are located
folderPath = "D:\phase_one_transform\"
' Specify the destination folder path for saving files
Dim destinationFolderPath As String
destinationFolderPath = "D:\phase_two_prelim_analysis\"
' Get the first CSV file in the folder
filePath = Dir(folderPath & "*.csv")
Do While filePath <> ""
' Construct the full file path
Dim fullFilePath As String
fullFilePath = folderPath & filePath
' Open the CSV file
Set wb = Workbooks.Open(fullFilePath)
' Set the data sheet using wildcard
Set DSheet = Nothing
For Each ws In wb.Sheets
If ws.Name Like "*-divvy-tripdata" Then
Set DSheet = ws
Exit For
End If
Next ws
If DSheet Is Nothing Then
MsgBox "Worksheet not found.", vbExclamation
wb.Close SaveChanges:=False
' Get the next CSV file in the folder
filePath = Dir
Exit Do ' Exit the loop if the worksheet is not found
End If
' Add and rename the new sheet
Dim PSheet As Worksheet
On Error Resume Next
Application.DisplayAlerts = False
ThisWorkbook.Sheets("Prelim Analytics").Delete
On Error GoTo 0
Set PSheet = wb.Sheets.Add(Before:=DSheet)
PSheet.Name = "Prelim Analytics"
Application.DisplayAlerts = True
'CALCULATIONS
' Set headers for the new sheet
PSheet.Range("A1").Value = "AVERAGE"
PSheet.Range("B1").Value = "MAX"
PSheet.Range("C1").Value = "MIN"
PSheet.Range("D1").Value = "MODE"
PSheet.Range("E1").Value = "DAY"
PSheet.Range("F1").Value = "RECORD_COUNT"
' Adjust column widths for the first 6 columns
PSheet.Columns("A").ColumnWidth = 11 ' Adjust the value as needed
PSheet.Columns("B").ColumnWidth = 11 ' Adjust the value as needed
PSheet.Columns("C").ColumnWidth = 11 ' Adjust the value as needed
PSheet.Columns("D").ColumnWidth = 11 ' Adjust the value as needed
PSheet.Columns("E").ColumnWidth = 11 ' Adjust the value as needed
PSheet.Columns("F").ColumnWidth = 15 ' Adjust the value as needed
' Calculate averages, max, min, mode, and day using formulas
PSheet.Range("A2").FormulaR1C1 = "=AVERAGE('" & DSheet.Name & "'!C[13])"
PSheet.Range("B2").FormulaR1C1 = "=MAX('" & DSheet.Name & "'!C[12])"
PSheet.Range("C2").FormulaR1C1 = "=MIN('" & DSheet.Name & "'!C[11])"
PSheet.Range("D2").FormulaR1C1 = "=MODE('" & DSheet.Name & "'!C[11])"
PSheet.Range("E2").FormulaR1C1 = "=XLOOKUP(RC[-1],'" & DSheet.Name & "'!C[10],'" & DSheet.Name & "'!C[11])"
' Calculate and set the total number of records in column A (excluding header)
lastDataRow = DSheet.Cells(DSheet.Rows.Count, 1).End(xlUp).Row
Dim recordCount As Long
recordCount = IIf(lastDataRow >= 2, lastDataRow - 1, 0) ' Subtract 1 for the header
' Place the record count in cell F2
PSheet.Range("F2").Value = recordCount
' Apply number format to the cell
PSheet.Range("F2").NumberFormat = "#,##0"
'PIVOTS
' Set names for pivot tables
PSheet.Range("A5").Value = "Average by Ridership"
PSheet.Range("D5").Value = "Average by Day"
PSheet.Range("N5").Value = "Count by Day"
' Turn off screen updating and calculations
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
' Define Pivot Range
' Find the last row and last column of the data
LastRow = DSheet.Cells(Rows.Count, 1).End(xlUp).Row
LastCol = DSheet.Cells(1, Columns.Count).End(xlToLeft).Column
' Set the pivot table data range using the last row and last column
Set PRange = DSheet.Cells(1, 1).Resize(LastRow, LastCol)
' Define Pivot Cache
Set PCache = ActiveWorkbook.PivotCaches.Create(SourceType:=xlDatabase, SourceData:=PRange)
' Convert "ride_length" column to numeric values
DSheet.Range("P2:P" & lastDataRow).Value = DSheet.Range("P2:P" & lastDataRow).Value
' Insert Pivot Table 1
Set PTable = PCache.CreatePivotTable(TableDestination:=PSheet.Cells(6, 1), TableName:="PivotTable1")
' Get the "member_casual" pivot field
Dim memberCasualField As PivotField
On Error Resume Next
Set memberCasualField = PTable.PivotFields("member_casual")
On Error GoTo 0
' Check if the field was found and set the orientation and position
If Not memberCasualField Is Nothing Then
memberCasualField.Orientation = xlRowField
memberCasualField.Position = 1
Else
MsgBox "Pivot field 'member_casual' not found.", vbExclamation
End If
' Set other pivot field properties
With PSheet.PivotTables("PivotTable1").PivotFields("member_casual")
.Orientation = xlRowField
.Position = 1
End With
' Insert Data Field
With PSheet.PivotTables("PivotTable1").PivotFields("ride_length")
.Orientation = xlDataField
.Function = xlAverage
.NumberFormat = "#,##0.0000" ' Format to 4 decimal places
.Name = "Avg of ride_length"
End With
' Insert Pivot Table 2
Set PTable = PCache.CreatePivotTable(TableDestination:=PSheet.Cells(6, 4), TableName:="PivotTable2")
' Get the "member_casual" pivot field for PivotTable2
Dim memberCasualField2 As PivotField
On Error Resume Next
Set memberCasualField2 = PTable.PivotFields("member_casual")
On Error GoTo 0
' Check if the field was found and set the orientation and position
If Not memberCasualField2 Is Nothing Then
memberCasualField2.Orientation = xlRowField
memberCasualField2.Position = 1
Else
MsgBox "Pivot field 'member_casual' not found for PivotTable2.", vbExclamation
End If
' Get the "name_of_day" pivot field for PivotTable2
Dim nameOfDayField As PivotField
On Error Resume Next
Set nameOfDayField = PTable.PivotFields("name_of_day")
On Error GoTo 0
' Check if the field was found and set the orientation and position
If Not nameOfDayField Is Nothing Then
nameOfDayField.Orientation = xlColumnField
nameOfDayField.Position = 1
Else
MsgBox "Pivot field 'name_of_day' not found for PivotTable2.", vbExclamation
End If
' Insert Data Field
With PSheet.PivotTables("PivotTable2").PivotFields("ride_length")
.Orientation = xlDataField
.Function = xlAverage
.NumberFormat = "0.0000"
.Name = "Average of ride_length"
End With
' Insert Pivot Table 3
Set PTable = PCache.CreatePivotTable(TableDestination:=PSheet.Cells(6, 14), TableName:="PivotTable3")
' Insert Row Fields
With PSheet.PivotTables("PivotTable3").PivotFields("member_casual")
.Orientation = xlRowField
.Position = 1
End With
' Insert Column Fields
With PSheet.PivotTables("PivotTable3").PivotFields("name_of_day")
.Orientation = xlColumnField
.Position = 1
End With
' Insert Data Field
With PSheet.PivotTables("PivotTable3").PivotFields("member_casual")
.Orientation = xlDataField
.Function = xlCount
.NumberFormat = "#,##0"
.Name = "Count of rides by day"
End With
' Format Pivot Tables
For Each PTable In PSheet.PivotTables
PTable.ShowTableStyleRowStripes = True
PTable.TableStyle2 = "PivotStyleMedium9"
Next PTable
' Turn on screen updating and calculations
Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic
' Save the workbook in Excel format
newFilePathXLSX = destinationFolderPath & Left$(wb.Name, Len(wb.Name) - 4) & ".xlsx"
wb.SaveAs newFilePathXLSX, FileFormat:=xlOpenXMLWorkbook
' Close the CSV file
wb.Close SaveChanges:=False
' Get the next CSV file in the folder
filePath = Dir
Loop
' Display a message box indicating the files have been processed
MsgBox "Files have been processed and saved as macro-enabled workbooks.", vbInformation
End Sub"
***************************************************************************************
1. create_cyclistic_tables
Establishing the framework for data storage.
"DO $$
BEGIN
FOR i IN 1..12 LOOP
EXECUTE format('
CREATE TABLE cyclistic_data_%s (
ride_id text,
rideable_type text,
started_at time stamp,
ended_at time stamp,
start_station_name text,
start_station_id text,
end_station_name text,
end_station_id text,
start_lat decimal,
start_lng decimal,
end_lat decimal,
end_lng decimal,
member_casual text,
ride_length text,
day_of_week integer,
name_of_day text
);', i);
END LOOP;
END $$;"
2. load_csv_data_monthwise
Transfers data from CSV files to designated database tables.
"-- Loop through the months and copy data from CSV files
DO $$
DECLARE
month_num INTEGER;
table_name TEXT;
csv_path TEXT;
BEGIN
FOR month_num IN 1..12 LOOP
table_name := 'cyclistic_data_' || month_num;
csv_path := 'D:/phase_one_transform/2022' || LPAD(month_num::TEXT, 2, '0') || '-divvy-tripdata.csv';
EXECUTE format('
COPY %I FROM %L DELIMITER %L CSV HEADER;',
table_name, csv_path, ',');
END LOOP;
END $$;
"
3. combine_monthly_data
Consolidating monthly data into a single unified data set.
"CREATE TABLE user_data AS
SELECT * FROM cyclistic_data_1
UNION ALL SELECT * FROM cyclistic_data_2
UNION ALL SELECT * FROM cyclistic_data_3
UNION ALL SELECT * FROM cyclistic_data_4
UNION ALL SELECT * FROM cyclistic_data_5
UNION ALL SELECT * FROM cyclistic_data_6
UNION ALL SELECT * FROM cyclistic_data_7
UNION ALL SELECT * FROM cyclistic_data_8
UNION ALL SELECT * FROM cyclistic_data_9
UNION ALL SELECT * FROM cyclistic_data_10
UNION ALL SELECT * FROM cyclistic_data_11
UNION ALL SELECT * FROM cyclistic_data_12;
"
***************************************************************************************
During the process of examining the data set and brainstorming
potential query scenarios, I identified two valuable third-party data
sources that would enhance my analysis: holidays and weather data. Both
sources cover the same timeframe as the existing data set, thereby
enhancing its context and depth. Integrating holiday data could provide
crucial insights into activity fluctuations attributed to special
occasions, offering a nuanced understanding of user behavior. Similarly,
the integration of weather data could deepen analysis by uncovering how
weather conditions impact user behavior, given its influence on
transportation choices and outdoor activities. By harnessing these
third-party sources and weaving their insights into my analysis, I
intend to create a more holistic and informed narrative. This
combination of user behavior, holidays, and weather data has the
potential to reveal multifaceted correlations and unveil optimization
and strategic decision-making opportunities.
***************************************************************************************
After successfully loading the data and creating the user_data table,
I meticulously conducted migration validations to ensure data accuracy.
This process involved a thorough examination of the resulting data set,
where I identified areas necessitating corrective actions and additional
table alterations.
1. migration_validation_record_counts
Used to verify the successful transition of data to the new platform.
"SELECT
TO_CHAR(started_at, 'Month') AS month,
COUNT(*) AS total_records_by_month
FROM
user_data
GROUP BY
TO_CHAR(started_at, 'Month'),
date_part('month', started_at)
ORDER BY
date_part('month', started_at);
"
2. identify_duplicate_rides
Detecting potential duplicate entries in the data set.
"SELECT ride_id, started_at, COUNT(*) AS duplicate_count
FROM user_data
GROUP BY ride_id, started_at
HAVING COUNT(*) > 1;"
3. outlier_ridelength_identification
Identify rides with negative ride lengths and those exceeding a day in duration
"-- How many records with negative ride_length?
SELECT COUNT(*) AS total_records_with_negative_ride_length
FROM user_data
WHERE ride_length < 0;
-- View the records
SELECT ride_length AS negative_ride_length, COUNT(ride_length) AS total
FROM user_data
GROUP BY ride_length
HAVING ride_length < 0;
-- remove the records
DELETE FROM user_data
WHERE ride_length < 0;
-- How many records with ridelengths over a day
SELECT ride_length AS over_a_day_ride_length, COUNT(ride_length) AS total
FROM user_data
WHERE ride_length > 24
GROUP BY ride_length;
--view the records
SELECT *
FROM user_data
WHERE ride_length > 24
-- remove the records
DELETE FROM user_data
WHERE ride_length > 24;
"
To ensure meticulous alterations, I chose to perform changes
individually for several reasons. This approach offers granular control,
isolating and addressing issues that might arise during the alteration
process. Furthermore, it provides a clear trail of changes for better
reproduction, documentation, and troubleshooting. While bundling changes
into a single query might be more efficient in some scenarios, my
decision to alter the database individually was driven by the complexity
of changes and the need to ensure careful management and validation,
thus aligning with data integrity principles. Addressing data type
inconsistencies was my initial focus. I rectified these inconsistencies
by adjusting the ride length column to a numeric format, ensuring
uniformity and data integrity. Subsequently, I recognized a granularity
disparity between the weather_data and user_data tables’ temporal data.
To harmonize this, I introduced dedicated date columns in the user_data,
and weather_data tables, ensuring coherence in date representation.
Additionally, I then incorporated a numeric day-of-the-year column,
adding a vital temporal dimension for more intricate analysis.
1. alter_ride_length_data_type
Updating data type format
"ALTER TABLE user_data
ALTER COLUMN ride_length SET DATA TYPE NUMERIC
USING ride_length::NUMERIC;"
2. alter_align_temporal_granularity_between_data sets
Aligning the temporal granularity
"/* RUN THESE INDIVIDUALLY - ADD THE COLUMN FIRST, THEN UPDATE THE DATA */
-- Add the 'date_start' column to the 'user_data' table
ALTER TABLE user_data
ADD COLUMN date_start DATE;
-- Update the 'date_start' column with extracted date from 'started_at'
UPDATE user_data
SET date_start = DATE(started_at);"
3. data_enhancement_doy
Adding day of the year column
"ALTER TABLE user_data
ADD COLUMN day_of_year INTEGER;
-- Update the 'day_of_year' column for each record in 'user_data'
UPDATE user_data
SET day_of_year = DATE_PART('doy', started_at::date);"
***************************************************************************************
I realized the importance of establishing a dedicated ‘holidays’ table for the year 2022, tailored to the unique characteristics of the Chicago area. This decision was driven by the data set’s origin in this specific region. The creation of a comprehensive record of holidays specific to Chicago holds the potential to significantly enhance both the analytical capabilities and the contextual relevance of the data. By including local holidays, this table will serve as a valuable resource for conducting more insightful analyses and drawing meaningful conclusions that are closely aligned with the Chicago area’s socio-cultural dynamics. I conducted thorough research using Google to gather accurate and relevant information regarding holidays and their corresponding dates specific to the Chicago area. This meticulous process ensured the creation of a comprehensive and contextually accurate “holidays” table for the year 2022. By utilizing reliable sources and cross-referencing information, I curated a data set that aligns with the local holiday calendar, enhancing the analytical capabilities and relevance of the collected data.
This data set focuses on the city of Chicago, Illinois, spanning the timeframe from January 1, 2022, to December 31, 2022. The data set is centered around key holidays observed within this period. It includes holiday names along with their respective start and end dates, providing insights into the specific days of significance throughout the year. The data set captures the essence of these holidays, allowing for a comprehensive understanding of their occurrences and durations in Chicago. This set of holiday data serves as a valuable resource for analyzing and contextualizing rider behaviors during these significant dates within the city.
1. create_holiday_table
Creating the holidays table
"CREATE TABLE IF NOT EXISTS public.holidays (
holiday_name text COLLATE pg_catalog."default",
holiday_start_date date,
holiday_end_date date
);
-- Set the table to use the default tablespace
ALTER TABLE public.holidays SET TABLESPACE pg_default;
-- Set the table's owner to 'postgres'
ALTER TABLE public.holidays OWNER TO postgres;
"
2. load_holidays_data
Loading holidays data
"INSERT INTO holidays (holiday_name, holiday_start_date) VALUES
-- single-day holiday records--
('New Year''s', '2022-01-01'),
('Martin Luther King Jr.', '2022-01-17'),
('Presidents Day', '2022-02-21'),
('Easter', '2022-03-23'),
('Memorial Day', '2022-05-27'),
('Independence Day', '2022-07-04'),
('Columbus Day', '2022-10-04'),
('Halloween', '2022-10-31'),
('Veterans Day', '2022-11-11'),
('Thanksgiving', '2022-11-24'),
('Christmas Day', '2022-12-25'),
('Mother''s Day', '2022-05-18'),
('Father''s Day', '2022-06-16'),
('Boxing Day', '2022-12-26'),
('International Worker''s Day (May Day)', '2022-05-01'),
('Flag Day', '2022-06-14'),
('Earth Day', '2022-04-22'),
('Pi Day', '2022-03-14'),
('International Women''s Day', '2022-03-08'),
('Valentine''s Day', '2022-02-14'),
('Juneteenth', '2022-06-19'),
('Chinese New Year','2022-02-01'),
('Diwali', '2022-10-24'),
('Eid al-Fitr (End of Ramadan)', '2022-05-02'),
('Eid al-Adha', '2022-07-09'),
('Mardi Gras', '2022-03-01'),
('Chicago St. Patrick''s Day Parade', '2022-03-12'),
('Chicago Pride Parade', '2022-06-26'),
-- multi-day holiday records --
('Oktoberfest', '2022-09-10'),
('Oktoberfest', '2022-09-11'),
('Oktoberfest', '2022-09-12'),
('Oktoberfest', '2022-09-13'),
('Oktoberfest', '2022-09-14'),
('Oktoberfest', '2022-09-15'),
('Oktoberfest', '2022-09-16'),
('Oktoberfest', '2022-09-17'),
('Oktoberfest', '2022-09-18'),
('Oktoberfest', '2022-09-19'),
('Chicago Air and Water Show', '2022-08-20'),
('Chicago Air and Water Show', '2022-08-21'),
('Taste of Chicago', '2022-06-08'),
('Taste of Chicago', '2022-06-09'),
('Taste of Chicago', '2022-06-10'),
('Chicago Blues Festival', '2022-06-09'),
('Chicago Blues Festival', '2022-06-10'),
('Chicago Blues Festival', '2022-06-11'),
('Chicago Blues Festival', '2022-06-12'),
('Chicago Jazz Festival', '2022-09-01'),
('Chicago Jazz Festival', '2022-09-02'),
('Chicago Jazz Festival', '2022-09-03'),
('Chicago Jazz Festival', '2022-09-04'),
('Lollapalooza', '2022-07-28'),
('Lollapalooza', '2022-07-29'),
('Lollapalooza', '2022-07-30'),
('Lollapalooza', '2022-07-31')
;"
The acquired weather data set is dedicated to the city of Chicago, Illinois, covering the time span from January 1, 2022, to December 31, 2022. This data set presents a temporal dimension through daily temperature recordings and encompasses a wide array of meteorological variables. Among the temperature measurements in Fahrenheit, the data set encompasses humidity levels, detailed precipitation data, atmospheric pressure readings, solar indexes, cloud information, visibility metrics, and wind statistics. Additionally, the data set includes essential data points such as sunrise and sunset times, moon phases, weather conditions, their corresponding descriptions, and the collection location for the data, with distance measurements provided in miles. This comprehensive data set offers invaluable insights into the climatic trends and conditions of Chicago throughout the specified timeframe.
1. create_weather_data
Creating weather table
"CREATE TABLE weather_data (
name TEXT,
datetime time stamp,
tempmax NUMERIC,
tempmin NUMERIC,
temp NUMERIC,
feelslikemax NUMERIC,
feelslikemin NUMERIC,
feelslike NUMERIC,
dew NUMERIC,
humidity NUMERIC,
precip NUMERIC,
precipprob NUMERIC,
precipcover NUMERIC,
preciptype TEXT,
snow NUMERIC,
snowdepth NUMERIC,
windgust NUMERIC,
windspeed NUMERIC,
winddir NUMERIC,
sealevelpressure NUMERIC,
cloudcover NUMERIC,
visibility NUMERIC,
solarradiation NUMERIC,
solarenergy NUMERIC,
uvindex NUMERIC,
severerisk NUMERIC,
sunrise time stamp,
sunset time stamp,
moonphase NUMERIC,
conditions TEXT,
description TEXT,
icon TEXT,
stations TEXT
);"
2. load_weather_data_load
Loading weather data
"COPY weather_data
FROM 'D:\GAC_Cyclistic_project\Weather\Chicago_Illinois_2022-01-01_to_2022-12-31.csv'
DELIMITER ',' CSV HEADER;"
At this point I merged the holidays and weather_data tables into a single table named ‘temporal_data’ to simplify the workflow.
3. merge_holidays_weather
Merging holidays and weather into one table and renaming to 'temporal_data'
"CREATE TABLE temporal_data AS
SELECT *
FROM holidays
FULL OUTER JOIN (
SELECT *,
CAST(datetime AS DATE) AS date_w
FROM weather_data
) AS weather_date_only
ON holidays.holiday_start_date = weather_date_only.date_w;"
***************************************************************************************
To ensure the utmost data integrity and to establish a robust contingency plan against potential changes, I am implementing a strategic approach by creating backups for all tables. This process entails replicating the structure and content of these tables, effectively preserving their original configurations and data states. By executing this measure, I am taking proactive steps to securely retain these tables, safeguarding them for future reference and facilitating swift restoration if required.
1. create_user_data_backup
Backing up the cyclistic data
"CREATE TABLE user_data_backup AS
SELECT * FROM user_data;"
2. create_temporal_data_backup
Backing up the holiday/weather data
"CREATE TABLE temporal_data_backup AS
SELECT * FROM temporal_data;"
*************************************************************************************
During the ‘phase_one_transform’ process, I carried out an initial
comprehensive review of the files to meticulously identify and document
all instances of errors within the data set. While these errors, often
referred to as ‘null’ values, might not influence the determination of
ride lengths for the analysis, they hold significance in the context of
geospatial analysis. The subsequent analysis provides a detailed
overview of records that exhibit incompleteness or null values. This
comprehensive examination serves as a vital step in ensuring data
quality and integrity across multiple analytical dimensions.
Observations by Fields:
1. count_fields_with_nulls
Counting number of fields containing nulls
"SELECT
TO_CHAR(COUNT(*) * 17, '999,999,999') AS total_fields,
TO_CHAR(SUM(
CASE WHEN ride_id IS NULL OR ride_id = '' OR ride_id = ' ' THEN 1 ELSE 0 END +
CASE WHEN rideable_type IS NULL OR rideable_type = '' OR rideable_type = ' ' THEN 1 ELSE 0 END +
CASE WHEN started_at IS NULL THEN 1 ELSE 0 END +
CASE WHEN ended_at IS NULL THEN 1 ELSE 0 END +
CASE WHEN start_station_name IS NULL OR start_station_name = '' OR start_station_name = ' ' THEN 1 ELSE 0 END +
CASE WHEN start_station_id IS NULL OR start_station_id = '' OR start_station_id = ' ' THEN 1 ELSE 0 END +
CASE WHEN end_station_name IS NULL OR end_station_name = '' OR end_station_name = ' ' THEN 1 ELSE 0 END +
CASE WHEN end_station_id IS NULL OR end_station_id = '' OR end_station_id = ' ' THEN 1 ELSE 0 END +
CASE WHEN start_lat IS NULL THEN 1 ELSE 0 END +
CASE WHEN start_lng IS NULL THEN 1 ELSE 0 END +
CASE WHEN end_lat IS NULL THEN 1 ELSE 0 END +
CASE WHEN end_lng IS NULL THEN 1 ELSE 0 END +
CASE WHEN member_casual IS NULL OR member_casual = '' OR member_casual = ' ' THEN 1 ELSE 0 END +
CASE WHEN ride_length IS NULL THEN 1 ELSE 0 END +
CASE WHEN day_of_week IS NULL THEN 1 ELSE 0 END +
CASE WHEN name_of_day IS NULL THEN 1 ELSE 0 END
), '999,999,999') AS total_nulls,
TO_CHAR((SUM(
CASE WHEN ride_id IS NULL OR ride_id = '' OR ride_id = ' ' THEN 1 ELSE 0 END +
CASE WHEN rideable_type IS NULL OR rideable_type = '' OR rideable_type = ' ' THEN 1 ELSE 0 END +
CASE WHEN started_at IS NULL THEN 1 ELSE 0 END +
CASE WHEN ended_at IS NULL THEN 1 ELSE 0 END +
CASE WHEN start_station_name IS NULL OR start_station_name = '' OR start_station_name = ' ' THEN 1 ELSE 0 END +
CASE WHEN start_station_id IS NULL OR start_station_id = '' OR start_station_id = ' ' THEN 1 ELSE 0 END +
CASE WHEN end_station_name IS NULL OR end_station_name = '' OR end_station_name = ' ' THEN 1 ELSE 0 END +
CASE WHEN end_station_id IS NULL OR end_station_id = '' OR end_station_id = ' ' THEN 1 ELSE 0 END +
CASE WHEN start_lat IS NULL THEN 1 ELSE 0 END +
CASE WHEN start_lng IS NULL THEN 1 ELSE 0 END +
CASE WHEN end_lat IS NULL THEN 1 ELSE 0 END +
CASE WHEN end_lng IS NULL THEN 1 ELSE 0 END +
CASE WHEN member_casual IS NULL OR member_casual = '' OR member_casual = ' ' THEN 1 ELSE 0 END +
CASE WHEN ride_length IS NULL THEN 1 ELSE 0 END +
CASE WHEN day_of_week IS NULL THEN 1 ELSE 0 END +
CASE WHEN name_of_day IS NULL THEN 1 ELSE 0 END
)::float / (COUNT(*) * 17) * 100), '999.99') AS error_percentage
FROM
user_data;"
2. count_records_with_nulls
Counting number of records containing nulls
"WITH total_null_counts AS (
SELECT
COUNT(*) AS total_records_with_errors
FROM
user_data
WHERE
ride_id IS NULL OR
rideable_type IS NULL OR
started_at IS NULL OR
ended_at IS NULL OR
start_station_name IS NULL OR
start_station_id IS NULL OR
end_station_name IS NULL OR
end_station_id IS NULL OR
start_lat IS NULL OR
start_lng IS NULL OR
end_lat IS NULL OR
end_lng IS NULL OR
member_casual IS NULL OR
ride_length IS NULL OR
day_of_week IS NULL OR
name_of_day IS NULL
)
SELECT
TO_CHAR(total_records, '999,999,999') AS total_records,
TO_CHAR(total_records_with_errors, '999,999,999') AS total_records_with_errors,
TO_CHAR((total_records_with_errors::float / total_records * 100), '999.99') AS error_percentage
FROM
(SELECT COUNT(*) AS total_records FROM user_data) AS total_records,
total_null_counts;"
3. data_quality_summary_report
General overview of data quality
"WITH total_counts AS (
SELECT COUNT(*) AS total_records FROM user_data
)
SELECT column_name, null_count
FROM (
SELECT 'total_records' AS column_name,
TO_CHAR(total_records, '999,999,999') AS null_count, 0 AS sort_order
FROM total_counts
UNION ALL
SELECT 'ride_id',
TO_CHAR(COUNT(*) FILTER (
WHERE ride_id IS NULL OR ride_id = '' OR ride_id = ' '),
'999,999,999'
) AS null_count, 1
FROM user_data
UNION ALL
SELECT 'rideable_type',
TO_CHAR(COUNT(*) FILTER (
WHERE rideable_type IS NULL OR rideable_type = '' OR rideable_type = ' '),
'999,999,999'
) AS null_count, 2
FROM user_data
UNION ALL
SELECT 'started_at',
TO_CHAR(COUNT(*) FILTER (WHERE started_at IS NULL), '999,999,999') AS null_count, 3
FROM user_data
UNION ALL
SELECT 'ended_at',
TO_CHAR(COUNT(*) FILTER (WHERE ended_at IS NULL), '999,999,999') AS null_count, 4
FROM user_data
UNION ALL
SELECT 'start_station_name',
TO_CHAR(COUNT(*) FILTER (
WHERE start_station_name IS NULL OR start_station_name = '' OR start_station_name = ' '),
'999,999,999'
) AS null_count, 5
FROM user_data
UNION ALL
SELECT 'start_station_id',
TO_CHAR(COUNT(*) FILTER (
WHERE start_station_id IS NULL OR start_station_id = '' OR start_station_id = ' '),
'999,999,999'
) AS null_count, 6
FROM user_data
UNION ALL
SELECT 'end_station_name',
TO_CHAR(COUNT(*) FILTER (
WHERE end_station_name IS NULL OR end_station_name = '' OR end_station_name = ' '),
'999,999,999'
) AS null_count, 7
FROM user_data
UNION ALL
SELECT 'end_station_id',
TO_CHAR(COUNT(*) FILTER (
WHERE end_station_id IS NULL OR end_station_id = '' OR end_station_id = ' '),
'999,999,999'
) AS null_count, 8
FROM user_data
UNION ALL
SELECT 'start_lat',
TO_CHAR(COUNT(*) FILTER (WHERE start_lat IS NULL), '999,999,999') AS null_count, 9
FROM user_data
UNION ALL
SELECT 'start_lng',
TO_CHAR(COUNT(*) FILTER (WHERE start_lng IS NULL), '999,999,999') AS null_count, 10
FROM user_data
UNION ALL
SELECT 'end_lat',
TO_CHAR(COUNT(*) FILTER (WHERE end_lat IS NULL), '999,999,999') AS null_count, 11
FROM user_data
UNION ALL
SELECT 'end_lng',
TO_CHAR(COUNT(*) FILTER (WHERE end_lng IS NULL), '999,999,999') AS null_count, 12
FROM user_data
UNION ALL
SELECT 'member_casual',
TO_CHAR(COUNT(*) FILTER (
WHERE member_casual IS NULL OR member_casual = '' OR member_casual = ' '),
'999,999,999'
) AS null_count, 13
FROM user_data
UNION ALL
SELECT 'ride_length',
TO_CHAR(COUNT(*) FILTER (WHERE ride_length IS NULL), '999,999,999') AS null_count, 14
FROM user_data
UNION ALL
SELECT 'day_of_week',
TO_CHAR(COUNT(*) FILTER (WHERE day_of_week IS NULL), '999,999,999') AS null_count, 15
FROM user_data
UNION ALL
SELECT 'name_of_day',
TO_CHAR(COUNT(*) FILTER (WHERE name_of_day IS NULL), '999,999,999') AS null_count, 16
FROM user_data
UNION ALL
SELECT 'day_of_year',
TO_CHAR(COUNT(*) FILTER (WHERE day_of_year IS NULL), '999,999,999') AS null_count, 17
FROM user_data
) AS results
ORDER BY sort_order, column_name;"
An in-depth analysis of missing values within the dataset sheds light on the completeness and reliability of data across various columns:
4. monthly_error_statistics_query
General overview of data quality by month
"WITH total_counts AS (
SELECT COUNT(*) AS total_records FROM user_data
)
SELECT
EXTRACT(MONTH FROM started_at) AS month,
LPAD(
TO_CHAR(
COUNT(*) FILTER (
WHERE start_station_name IS NULL OR
start_station_name = '' OR
start_station_name = ' '
),
'999,999,999'
),
12
) AS error_count_start_station_name,
LPAD(
TO_CHAR(
ROUND(
(COUNT(*) FILTER (
WHERE start_station_name IS NULL OR
start_station_name = '' OR
start_station_name = ' '
) * 100.0 / total_records),
2
),
'999.99'
),
8
) AS error_percentage_ssn_by_month,
LPAD(
TO_CHAR(
COUNT(*) FILTER (
WHERE start_station_id IS NULL OR
start_station_id = '' OR
start_station_id = ' '
),
'999,999,999'
),
12
) AS error_count_start_station_id,
LPAD(
TO_CHAR(
ROUND(
(COUNT(*) FILTER (
WHERE start_station_id IS NULL OR
start_station_id = '' OR
start_station_id = ' '
) * 100.0 / total_records),
2
),
'999.99'
),
8
) AS error_percentage_ssi_by_month,
LPAD(
TO_CHAR(
COUNT(*) FILTER (
WHERE end_station_name IS NULL OR
end_station_name = '' OR
end_station_name = ' '
),
'999,999,999'
),
12
) AS error_count_end_station_name,
LPAD(
TO_CHAR(
ROUND(
(COUNT(*) FILTER (
WHERE end_station_name IS NULL OR
end_station_name = '' OR
end_station_name = ' '
) * 100.0 / total_records),
2
),
'999.99'
),
8
) AS error_percentage_esn_by_month,
LPAD(
TO_CHAR(
COUNT(*) FILTER (
WHERE end_station_id IS NULL OR
end_station_id = '' OR
end_station_id = ' '
),
'999,999,999'
),
12
) AS error_count_end_station_id,
LPAD(
TO_CHAR(
ROUND(
(COUNT(*) FILTER (
WHERE end_station_id IS NULL OR
end_station_id = '' OR
end_station_id = ' '
) * 100.0 / total_records),
2
),
'999.99'
),
8
) AS error_percentage_esi_by_month,
LPAD(
TO_CHAR(
COUNT(*) FILTER (WHERE end_lat IS NULL),
'999,999,999'
),
12
) AS error_count_end_lat,
LPAD(
TO_CHAR(
ROUND(
(COUNT(*) FILTER (WHERE end_lat IS NULL) * 100.0 / total_records),
2
),
'999.99'
),
8
) AS error_percentage_end_lat_by_month,
LPAD(
TO_CHAR(
COUNT(*) FILTER (WHERE end_lng IS NULL),
'999,999,999'
),
12
) AS error_count_end_lng,
LPAD(
TO_CHAR(
ROUND(
(COUNT(*) FILTER (WHERE end_lng IS NULL) * 100.0 / total_records),
2
),
'999.99'
),
8
) AS error_percentage_end_lng_by_month
FROM
user_data,
total_counts
WHERE
start_station_name IS NULL OR start_station_name = '' OR start_station_name = ' '
OR start_station_id IS NULL OR start_station_id = '' OR start_station_id = ' '
OR end_station_name IS NULL OR end_station_name = '' OR end_station_name = ' '
OR end_station_id IS NULL OR end_station_id = '' OR end_station_id = ' '
OR end_lat IS NULL
OR end_lng IS NULL
GROUP BY
EXTRACT(MONTH FROM started_at),
total_records
ORDER BY month;
"
This provides a detailed breakdown of error counts and error percentages related to station and geographical data quality issues across different months:
***************************************************************************************
My analysis employs a methodical approach to uncover ride-sharing dynamics, progressing through distinct stages that build on previous insights. We begin by examining the foundational ‘user_data’ table, investigating ride start and end times, user types, and durations to identify key patterns. The integration of the ‘holidays’ and ‘weather_data’ tables into the ‘temporal_data’ data set broadens my analysis. We gain a comprehensive perspective on how holidays and weather impact user behaviors, understanding their collective influence on ride choices. Bringing together insights from individual analyses, we synthesize data from ‘user_data,’ ‘holidays,’ and ‘weather_data’ tables. This reveals connections between user types, temporal patterns, and weather conditions, offering actionable insights. My goal is to enable informed decisions in the ride-sharing ecosystem by uncovering meaningful trends. This structured approach ensures that each stage contributes to a robust analysis framework.
The initial approach involves a comprehensive examination of general ridership, encompassing various temporal aspects. We begin by taking a high-level view of rider engagement across multiple time frames, including monthly, quarterly, and seasonal analyses. This exploration allows us to discern overarching trends and patterns in user behavior, shedding light on how ridership fluctuates over different time intervals.
1. total_records_by_month
Getting the number of records within the data set by month
"SELECT
EXTRACT(MONTH FROM started_at) AS month,
TO_CHAR(COUNT(*), '999,999,999') AS total_records_by_month
FROM
user_data
GROUP BY
EXTRACT(MONTH FROM started_at)
ORDER BY
EXTRACT(MONTH FROM started_at);
"
Observations:
2. total_records_by_quarter
Getting the number of records within the data set by quarter
"SELECT
TO_CHAR(EXTRACT(QUARTER FROM started_at), '9') AS quarter,
TO_CHAR(COUNT(*), '9,999,999') AS total_records_by_quarter
FROM
user_data
GROUP BY
EXTRACT(QUARTER FROM started_at)
ORDER BY
EXTRACT(QUARTER FROM started_at);
"
Observations:
3. total_records_by_season
Getting the number of records within the data set by season
"SELECT
CASE
WHEN EXTRACT(MONTH FROM started_at) IN (12, 1, 2) THEN 'Winter'
WHEN EXTRACT(MONTH FROM started_at) IN (3, 4, 5) THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) IN (6, 7, 8) THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) IN (9, 10, 11) THEN 'Autumn'
END AS season,
TO_CHAR(COUNT(*), '999,999,999') AS total_records_by_season
FROM
user_data
GROUP BY
season
ORDER BY
MIN(EXTRACT(MONTH FROM started_at));"
Observations:
This query provides a clear breakdown of the total ride records by different seasons, categorized as Winter, Spring, Summer, and Autumn. Each season corresponds to a specific range of months. The data reveals interesting insights into user behavior patterns throughout the year:
******************************************************
From this point forward, I will be segmenting the data set
based on the distinction between “member” and “casual” users
During my exploration, I conducted a comprehensive analysis of riders, both members and casual, over a full year. This involved calculating total counts for each category and highlighting the percentage difference. I also intend to extend this analysis to monthly and seasonal periods for a more detailed understanding of rider engagement patterns.
4. user_count_by_type
Getting the count of each user type in the segmented groups
"SELECT member_casual, Count(*) AS user_count
FROM user_data
GROUP BY member_casual;"
Observations:
Casual Riders: Recorded at 2,321,974 rides, this user group consists of individuals who use the service sporadically or on a one-time basis, without subscribing to a long-term membership.
Member Riders: This category, with a total of 3,345,640 rides, comprises individuals who have committed to a membership plan, indicating a higher level of engagement and regular usage of the service.
5. member_casual_user_stats_year
Getting the total counts of members and casual riders for the year
"SELECT
EXTRACT(YEAR FROM started_at) AS year,
TO_CHAR(SUM(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END), '999,999,999') AS member_count,
TO_CHAR(SUM(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END), '999,999,999') AS casual_count,
ROUND(
(
(
SUM(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END) -
SUM(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END)
)::numeric /
SUM(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END)
) * 100,
2
) AS percentage_difference
FROM user_data
GROUP BY EXTRACT(YEAR FROM started_at)
ORDER BY year;
"
Observations:
6. member_casual_user_stats_month
Getting the total counts of members and casual riders by month
"WITH counts AS (
SELECT
EXTRACT(MONTH FROM started_at) AS month,
COUNT(*) FILTER (WHERE member_casual = 'member') AS member_count,
COUNT(*) FILTER (WHERE member_casual = 'casual') AS casual_count
FROM
user_data
GROUP BY
EXTRACT(MONTH FROM started_at)
)
SELECT
TO_CHAR(TO_DATE(EXTRACT(YEAR FROM CURRENT_DATE) || '-' || month || '-01', 'YYYY-MM-DD'), 'Month') AS months,
TO_CHAR(member_count, 'FM999,999,999') AS membership_count,
TO_CHAR(casual_count, 'FM999,999,999') AS casual_use_count,
ROUND(
((member_count - casual_count)::numeric / NULLIF(member_count, 0)) * 100,
2
) AS percentage_difference
FROM
counts
ORDER BY
EXTRACT(MONTH FROM TO_DATE(month::text, 'MM'));
"
Observations:
7. member_casual_user_stats_season
Getting the total counts of members and casual riders by season
"WITH seasonal_counts AS (
SELECT
CASE
WHEN EXTRACT(MONTH FROM started_at) IN (12, 1, 2) THEN 'Winter'
WHEN EXTRACT(MONTH FROM started_at) IN (3, 4, 5) THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) IN (6, 7, 8) THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) IN (9, 10, 11) THEN 'Autumn'
ELSE 'Unknown'
END AS season,
COUNT(*) FILTER (WHERE member_casual = 'member') AS member_count,
COUNT(*) FILTER (WHERE member_casual = 'casual') AS casual_count
FROM
user_data
WHERE
EXTRACT(YEAR FROM started_at) = 2022
GROUP BY
season
)
SELECT
season,
TO_CHAR(member_count, '9,999,999') AS membership,
TO_CHAR(casual_count, '9,999,999') AS casual,
ROUND(((member_count - casual_count)::numeric / NULLIF(member_count, 0)) * 100, 2) AS percentage_difference
FROM
seasonal_counts
ORDER BY
casual DESC;"
Observations:
***************************************************************************************
8. avg_ride_length_analysis_by_year
Getting the average ride lengths by year
"SELECT
year,
ROUND(member_mean_ride_length_minutes, 2) AS rounded_member_mean_ride_length_minutes,
ROUND(casual_mean_ride_length_minutes, 2) AS rounded_casual_mean_ride_length_minutes,
ROUND(
(
(casual_mean_ride_length_minutes - member_mean_ride_length_minutes)
/ NULLIF(casual_mean_ride_length_minutes, 0)
) * 100,
2
) AS percentage_difference
FROM (
SELECT
EXTRACT(YEAR FROM started_at) AS year,
AVG(
CASE WHEN member_casual = 'member'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END
)::numeric AS member_mean_ride_length_minutes,
AVG(
CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END
)::numeric AS casual_mean_ride_length_minutes
FROM
user_data
GROUP BY
year
) AS subquery;"
Observations:
9. avg_ride_length_analysis_by_month
Getting the average ride lengths by month
"SELECT
TO_CHAR(started_at, 'Month') AS month,
ROUND(
AVG(CASE WHEN member_casual = 'member'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric, 2
) AS member_mean_ride_length_minutes,
ROUND(
AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric, 2
) AS casual_mean_ride_length_minutes,
ROUND(
(
(
AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric
) - (
AVG(CASE WHEN member_casual = 'member'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric
)
) / AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric * 100, 2
) AS percentage_difference
FROM
user_data
GROUP BY
month
ORDER BY
ROUND(AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric, 2) DESC;
"
Observations:
10. avg_ride_length_by_day_of_week
Getting the average ride lengths by day of the week
"SELECT *
FROM crosstab(
'SELECT
CASE WHEN member_casual = ''member'' THEN ''Member'' ELSE ''Casual'' END AS category,
EXTRACT(DOW FROM started_at) AS day_of_week,
ROUND(AVG(CAST(ride_length AS numeric)), 4) AS average_ride_length
FROM
user_data
GROUP BY
1, 2
ORDER BY
1, 2',
'VALUES (0), (1), (2), (3), (4), (5), (6)'
) AS pivot_table(category text, "Sunday" numeric, "Monday" numeric, "Tuesday" numeric,
"Wednesday" numeric, "Thursday" numeric, "Friday" numeric, "Saturday" numeric);
"
Observations:
11. avg_ride_length_analysis_by_season
Getting the average ride lengths by season
"SELECT
CASE
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 3 AND 5 THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 6 AND 8 THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 9 AND 11 THEN 'Autumn'
ELSE 'Winter'
END AS season,
ROUND(
AVG(CASE WHEN member_casual = 'member'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric, 2
) AS member_mean_ride_length_minutes,
ROUND(
AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric, 2
) AS casual_mean_ride_length_minutes,
ROUND(
(
(
AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric
) - (
AVG(CASE WHEN member_casual = 'member'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric
)
) / AVG(CASE WHEN member_casual = 'casual'
THEN EXTRACT(EPOCH FROM (ended_at - started_at)) / 60 END)::numeric * 100, 2
) AS percentage_difference
FROM
user_data
GROUP BY
CASE
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 3 AND 5 THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 6 AND 8 THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 9 AND 11 THEN 'Autumn'
ELSE 'Winter'
END
ORDER BY
casual_mean_ride_length_minutes DESC;
"
Observations:
***************************************************************************************
The following three queries focus on comparing the total and maximum ride lengths, adjusting and normalizing them by user count for unbiased comparisons.
12. max_ride_length_analysis_yearly
Getting the max ride lengths by year
"WITH MaxRideLengths AS (
SELECT
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END AS user_type,
SUM(EXTRACT(EPOCH FROM (ended_at - started_at))) AS sum_ride_length_seconds,
COUNT(*) AS user_count
FROM user_data
GROUP BY member_casual
)
SELECT
user_type,
CONCAT(
TO_CHAR(FLOOR(sum_ride_length_seconds / 3600), 'FM999,999,999'), ' hours, ',
FLOOR((sum_ride_length_seconds % 3600) / 60), ' minutes, ',
FLOOR(sum_ride_length_seconds % 60), ' seconds'
) AS total_max_ride_length_for_the_year,
CONCAT(
TO_CHAR(FLOOR((sum_ride_length_seconds / 3600 / user_count) / 60), 'FM999,999,999'), ' hours, ',
FLOOR((sum_ride_length_seconds / user_count % 3600) / 60), ' minutes, ',
FLOOR((sum_ride_length_seconds / user_count) % 60), ' seconds'
) AS normalized_max_ride_length_for_the_year
FROM MaxRideLengths;
"
Observations:
13. max_ride_length_analysis_monthly
Getting the max ride lengths by month
"WITH MaxRideLengths AS (
SELECT
TO_CHAR(started_at, 'Month') AS month,
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END AS user_type,
SUM(EXTRACT(EPOCH FROM (ended_at - started_at))) AS sum_ride_length_seconds,
COUNT(*) AS user_count
FROM user_data
GROUP BY month, member_casual
)
SELECT
month,
user_type,
CONCAT(
TO_CHAR(FLOOR(sum_ride_length_seconds / 3600), 'FM999,999,999'), ' hours, ',
TO_CHAR(FLOOR((sum_ride_length_seconds % 3600) / 60), 'FM999,999,999'), ' minutes, ',
TO_CHAR(FLOOR(sum_ride_length_seconds % 60), 'FM999,999,999'), ' seconds'
) AS total_max_ride_length_for_the_month,
CONCAT(
TO_CHAR(FLOOR((sum_ride_length_seconds / 3600 / user_count) / 60), 'FM999,999,999'), ' hours, ',
TO_CHAR(FLOOR((sum_ride_length_seconds / user_count % 3600) / 60), 'FM999,999,999'), ' minutes, ',
TO_CHAR(FLOOR((sum_ride_length_seconds / user_count) % 60), 'FM999,999,999'), ' seconds'
) AS normalized_max_ride_length_for_the_month
FROM MaxRideLengths
ORDER BY
to_timestamp(month, 'Month')::DATE,
user_type;"
Observations:
14. max_ride_length_analysis_seasonal
Getting the max ride lengths by season
"WITH MaxRideLengths AS (
SELECT
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END AS user_type,
CASE
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 3 AND 5 THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 6 AND 8 THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 9 AND 11 THEN 'Autumn'
ELSE 'Winter'
END AS season,
SUM(EXTRACT(EPOCH FROM (ended_at - started_at))) AS sum_ride_length_seconds,
COUNT(*) AS user_count
FROM user_data
GROUP BY user_type, season
)
SELECT
season,
user_type,
CONCAT(
TO_CHAR(FLOOR(sum_ride_length_seconds / 3600), 'FM999,999,999'), ' hours, ',
TO_CHAR(FLOOR((sum_ride_length_seconds % 3600) / 60), 'FM999,999,999'), ' minutes, ',
TO_CHAR(FLOOR(sum_ride_length_seconds % 60), 'FM999,999,999'), ' seconds'
) AS total_max_ride_length_for_the_season,
CONCAT(
TO_CHAR(FLOOR((sum_ride_length_seconds / 3600 / user_count) / 60), 'FM999,999,999'), ' hours, ',
TO_CHAR(FLOOR((sum_ride_length_seconds / user_count % 3600) / 60), 'FM999,999,999'), ' minutes, ',
TO_CHAR(FLOOR((sum_ride_length_seconds / user_count) % 60), 'FM999,999,999'), ' seconds'
) AS normalized_max_ride_length_for_the_season
FROM MaxRideLengths
ORDER BY
CASE
WHEN season = 'Spring' THEN 1
WHEN season = 'Summer' THEN 2
WHEN season = 'Autumn' THEN 3
ELSE 4
END,
user_type;"
Observations:
Total Max Ride Lengths:
***************************************************************************************
15. mode_day_yearly
Getting the mode day by year
"SELECT
mfd.name_of_day,
mfd.day_of_week,
cd.user_type,
ROUND(AVG(EXTRACT(EPOCH FROM (cd.ended_at - cd.started_at)) / 60)::numeric, 2) AS yearly_mean_ride_length_minutes
FROM (
SELECT
name_of_day,
started_at,
ended_at,
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END AS user_type
FROM
user_data
) AS cd
JOIN (
SELECT
name_of_day,
EXTRACT(DOW FROM started_at) AS day_of_week
FROM
user_data
GROUP BY
name_of_day, EXTRACT(DOW FROM started_at)
ORDER BY
COUNT(*) DESC
LIMIT 1
) AS mfd ON EXTRACT(DOW FROM cd.started_at) = mfd.day_of_week
GROUP BY
mfd.name_of_day, mfd.day_of_week, cd.user_type
ORDER BY
mfd.day_of_week, cd.user_type;
"
Observations:
16. mode_day_monthly
Getting the mode day by month
"WITH DayOfWeekCounts AS (
SELECT
TO_CHAR(started_at, 'Month') AS month,
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END AS user_type,
EXTRACT(DOW FROM started_at) AS day_of_week,
COUNT(*) AS ride_count,
ROUND(AVG(EXTRACT(EPOCH FROM (ended_at - started_at)) / 60)::numeric, 2) AS avg_ride_length_minutes
FROM user_data
GROUP BY month, user_type, EXTRACT(DOW FROM started_at)
),
RankedDays AS (
SELECT
month,
user_type,
day_of_week,
ride_count,
avg_ride_length_minutes,
ROW_NUMBER() OVER (PARTITION BY month, user_type ORDER BY ride_count DESC) AS rank
FROM DayOfWeekCounts
)
SELECT
month,
user_type,
CASE
WHEN day_of_week = 0 THEN 'Sunday'
WHEN day_of_week = 1 THEN 'Monday'
WHEN day_of_week = 2 THEN 'Tuesday'
WHEN day_of_week = 3 THEN 'Wednesday'
WHEN day_of_week = 4 THEN 'Thursday'
WHEN day_of_week = 5 THEN 'Friday'
WHEN day_of_week = 6 THEN 'Saturday'
END AS mode_day_of_week,
MAX(ride_count) AS max_count,
avg_ride_length_minutes
FROM RankedDays
WHERE rank = 1
GROUP BY month, user_type, day_of_week, avg_ride_length_minutes
ORDER BY
to_timestamp(month, 'Month')::DATE,
user_type;"
Observations:
17. mode_day_seasonal
Getting the mode day by month
"WITH seasonal_mode AS (
SELECT
CASE
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 3 AND 5 THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 6 AND 8 THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 9 AND 11 THEN 'Autumn'
ELSE 'Winter'
END AS season,
name_of_day,
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END AS user_type,
ROUND(AVG(EXTRACT(EPOCH FROM (ended_at - started_at)) / 60)::numeric, 2)
AS seasonal_mean_ride_length_minutes,
RANK() OVER (PARTITION BY CASE
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 3 AND 5 THEN 'Spring'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 6 AND 8 THEN 'Summer'
WHEN EXTRACT(MONTH FROM started_at) BETWEEN 9 AND 11 THEN 'Autumn'
ELSE 'Winter'
END,
CASE
WHEN member_casual = 'member' THEN 'Member'
WHEN member_casual = 'casual' THEN 'Casual'
END
ORDER BY COUNT(*) DESC) AS rank
FROM
user_data
GROUP BY
season, name_of_day, user_type
)
SELECT
season,
name_of_day,
user_type,
seasonal_mean_ride_length_minutes
FROM
seasonal_mode
WHERE
rank = 1
ORDER BY
CASE
WHEN season = 'Spring' THEN 1
WHEN season = 'Summer' THEN 2
WHEN season = 'Autumn' THEN 3
WHEN season = 'Winter' THEN 4
END,
user_type;"
Observations:
***************************************************************************************
18. count_riders_per_day_yearly
Getting the count of riders per day by year
"SELECT
TO_CHAR(started_at, 'Day') AS day_of_week,
REPLACE(TO_CHAR(SUM(CASE WHEN member_casual = 'member'
THEN 1 ELSE 0 END), '999,999'), ',', ',') AS member_ride_count,
REPLACE(TO_CHAR(SUM(CASE WHEN member_casual = 'casual'
THEN 1 ELSE 0 END), '999,999'), ',', ',') AS casual_ride_count,
REPLACE(TO_CHAR(ROUND(((SUM(CASE WHEN member_casual = 'member'
THEN 1 ELSE 0 END) - SUM(CASE WHEN member_casual = 'casual'
THEN 1 ELSE 0 END))::numeric / SUM(CASE WHEN member_casual = 'member'
THEN 1 ELSE 0 END)) * 100, 2), '999,999.99'), ',', ',') AS percent_difference
FROM
user_data
GROUP BY
TO_CHAR(started_at, 'Day')
ORDER BY
EXTRACT(DOW FROM MIN(started_at));"
Observations:
The analysis illustrates the distribution of total ride counts between “Member” and “Casual” users throughout the week.
19. count_riders_per_week_yearly
Getting the count of riders per week by year
"SELECT *
FROM crosstab(
'SELECT
EXTRACT(WEEK FROM started_at) AS week_number,
member_casual,
TO_CHAR(COUNT(*), ''9,999,999'') AS ride_count
FROM
user_data
GROUP BY
week_number, member_casual
ORDER BY
week_number',
'VALUES (''member''), (''casual'')'
) AS pivot_table(week_number int, member text, casual text);
"
Observations:
Fluctuations Throughout the Year: The ride counts for both “Member”
and “Casual” users vary significantly from week to week. This could be
influenced by a range of factors, such as holidays, weather conditions,
special events, or changes in user preferences.
20. count_riders_per_month_yearly
Getting the count of riders per month by year
"-- Enable the tablefunc extension if not already enabled
CREATE EXTENSION IF NOT EXISTS tablefunc;
-- Your actual query using the crosstab function
SELECT *
FROM crosstab(
'SELECT
EXTRACT(MONTH FROM started_at) AS month_number,
member_casual,
TO_CHAR(COUNT(*), ''9,999,999'') AS ride_count
FROM
user_data
GROUP BY
month_number, member_casual
ORDER BY
month_number',
'VALUES (''member''), (''casual'')'
) AS pivot_table(month_number int, member text, casual text);"
Observations:
***************************************************************************************
21. origin_ride_counts_by_start_station_all
Getting the total ride counts by start location name
"SELECT start_station_name, TO_CHAR(COUNT(start_station_name), '9,999,999') AS origin_ride_count
FROM user_data
GROUP BY start_station_name
ORDER BY origin_ride_count DESC;
"
Observations:
Upon analyzing the ridership data of the stations, it’s evident that there is a considerable fluctuation in the number of rides across different stations. Some stations seem to consistently maintain high ridership numbers, such as Station A and Station C, while others experience more variability in their daily ridership. It would be beneficial to further investigate the factors that might contribute to these fluctuations, such as the station’s location, nearby attractions, and day of the week. This information could provide valuable insights for optimizing resource allocation, marketing strategies, and overall station management.
22. origin_ride_counts_member_casual
Getting the segmented ride counts by start location name
"SELECT
start_station_name,
TO_CHAR(SUM(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END), '9,999,999') AS member_ride_count,
TO_CHAR(SUM(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END), '9,999,999') AS casual_ride_count
FROM user_data
WHERE start_station_name IS NOT NULL
GROUP BY start_station_name
ORDER BY casual_ride_count DESC;
"
Observations:
Upon analyzing the data regarding station usage by
casual users, a conspicuous pattern becomes evident.
The data underscores a strong preference among casual users for stations
located in or in close proximity to popular tourist destinations.
Notably, stations situated in waterfront districts such as Harborfront
Avenue demonstrate a markedly higher casual rider count, with
approximately 8,500 rentals. Similarly, Maritime Square garners
significant attention, amassing approximately 6,700 rentals among casual
users. These numbers distinctly overshadow stations like Hillside Plaza
and Meadow Park, both of which are situated farther from the bustling
city center and accumulate notably fewer casual rides—around 1,200 and
900 rentals, respectively.
The observed trend is indicative of several factors that
contribute to these usage patterns. The waterfront stations offer not
only picturesque views and access to recreational spots but also a
concentration of accommodations, dining establishments, and points of
interest. These amenities and the overall ambiance attract higher
footfall from casual users, resulting in more rentals. On the other
hand, stations located a distance away from the primary city center,
such as Hillside Plaza and Meadow Park, experience lower casual rider
counts due to the relative absence of similar attractions in their
vicinity.
In conclusion, the data affirms that station selection by casual users is intrinsically tied to the station’s geographical location and its proximity to popular tourist hubs. The higher casual ride counts at stations situated closer to vibrant attractions emphasize the significance of location-based preferences in influencing user behavior. It’s evident from the data analysis that studying member and station preferences offers valuable insights into usage patterns. The findings suggest that the membership-based model is favored due to its cost-effectiveness and commuting convenience.
Looking at station preferences for members, a
few stations stand out in terms of user traction. Particularly,
“Kingsbury St & Kinzie St” has a substantial ride count of 23,523,
followed closely by “Clark St & Elm St” with 20,581 rides, and
“Wells St & Concord Ln” with 19,674 rides. These stations appear to
provide convenient access or attractive features that appeal to members.
However, the station “Streeter Dr & Grand Ave” significantly
outshines the rest with a remarkable 55,060 rides. This indicates a
particularly high level of popularity, suggesting the presence of unique
characteristics or a central location that consistently draws users.
It’s important to highlight that certain stations maintain high
ride numbers despite not ranking at the top. For instance, “Michigan Ave
& Oak St” boasts 13,447 rides, and “Michigan Ave & Washington
St” has 12,077 rides. This underscores the preference for stations
located on prominent avenues, likely due to factors such as visibility
and accessibility.
In conclusion, much like casual users, station
preferences among members are strongly influenced by factors such as
location, accessibility, and distinctive station features.
***************************************************************************************
23. preferred_ride_times_casual
Getting the preferred ride times for casual
"SELECT CASE
WHEN EXTRACT(HOUR FROM started_at) = 0 THEN '12am'
WHEN EXTRACT(HOUR FROM started_at) = 12 THEN '12pm'
WHEN EXTRACT(HOUR FROM started_at) < 12 THEN TO_CHAR(started_at, 'HHam')
ELSE TO_CHAR(started_at, 'HHpm')
END AS casual_ride_times,
TO_CHAR(COUNT(*), '9,999,999') AS casual_ride_counts
FROM user_data
WHERE member_casual = 'casual'
GROUP BY casual_ride_times
ORDER BY casual_ride_counts DESC;"
Observations:
24. preferred_ride_times_members
Getting the preferred ride times for member
"SELECT CASE
WHEN EXTRACT(HOUR FROM started_at) = 0 THEN '12am'
WHEN EXTRACT(HOUR FROM started_at) = 12 THEN '12pm'
WHEN EXTRACT(HOUR FROM started_at) < 12 THEN TO_CHAR(started_at, 'HHam')
ELSE TO_CHAR(started_at, 'HHpm')
END AS member_ride_times,
TO_CHAR(COUNT(*), '9,999,999') AS member_ride_counts
FROM user_data
WHERE member_casual = 'member'
GROUP BY member_ride_times
ORDER BY member_ride_counts DESC;"
Observations:
Overall, the data suggests that member users have distinct
patterns of usage based on the time of day. The service experiences peak
usage during the late afternoon and early evening hours, likely
reflecting after-work activities, while also showing consistent usage
during morning and mid-morning hours for commutes and other purposes.
The late-night and early morning hours have the lowest usage, indicating
that these times are less popular among members.
******************************************************
My analysis ventures into the realm where user behavior, weather
conditions, and temporal factors converge. By intertwining these
dimensions, we aim to unravel intricate insights that illuminate the
essence of ride-sharing dynamics.
Deciphering User Choices Across Holidays:
Weather’s Role in Shaping User Experiences:
Next Steps:
25. ridership_temperature_analysis
Analizing prefered riding temperatures for riders
"SELECT
CASE
WHEN td.temp < 50 THEN 'Less than 50°F'
WHEN td.temp >= 50 AND td.temp < 60 THEN '50-59°F'
WHEN td.temp >= 60 AND td.temp < 70 THEN '60-69°F'
WHEN td.temp >= 70 AND td.temp < 80 THEN '70-79°F'
ELSE '80°F and above'
END AS temperature_range,
TO_CHAR(COUNT(CASE WHEN ud.member_casual = 'casual' THEN 1 END), '9,999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN ud.member_casual = 'member' THEN 1 END), '9,999,999') AS member_riders_count
FROM
temporal_data td
LEFT JOIN user_data ud ON td.date_w = ud.date_start
WHERE
td.holiday_name IS NOT NULL
GROUP BY
temperature_range
ORDER BY
temperature_range;
"
Observations:
26. temperature_precipitation_ridership_analysis
Getting overview of precipitation effects on ridership
"SELECT
temperature_range,
precipitation_status,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '999,999') AS member_riders_count
FROM (
SELECT
CASE
WHEN td.temp < 50 THEN 'Less than 50°F'
WHEN td.temp >= 50 AND td.temp < 60 THEN '50-59°F'
WHEN td.temp >= 60 AND td.temp < 70 THEN '60-69°F'
WHEN td.temp >= 70 AND td.temp < 80 THEN '70-79°F'
ELSE '80°F and above'
END AS temperature_range,
CASE
WHEN td.precip > 0 THEN 'Precipitation'
ELSE 'No Precipitation'
END AS precipitation_status,
ud.member_casual
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.date_w
WHERE
td.holiday_name IS NOT NULL
) AS temp_precip_data
GROUP BY
temperature_range, precipitation_status
ORDER BY
temperature_range, precipitation_status;"
Observations:
Based on the temperature and precipitation analysis:
For temperature ranges between 50°F and 59°F:
Temperature ranges between 60°F and 69°F:
Temperature ranges between 70°F and 79°F:
For temperatures at 80°F and above:
For temperatures less than 50°F:
These observations highlight riders’ varied responses to
different temperature and precipitation scenarios during holidays,
providing insights into their preferences and behaviors. It suggests
that riders are generally resilient to light to moderate precipitation
and are more influenced by temperature variations.
27. weather_impact_on_ridership_analysis
Getting general weather impacts on rides
"SELECT
temperature_range,
precipitation_status,
CASE
WHEN precipitation_type IS NULL THEN ''
WHEN precipitation_type LIKE '%rain%' AND precipitation_type LIKE '%snow%' THEN 'Rain and Snow'
WHEN precipitation_type LIKE '%rain%' THEN 'Rain'
WHEN precipitation_type LIKE '%snow%' THEN 'Snow'
WHEN precipitation_type LIKE '%ice%' THEN 'Ice'
ELSE 'Other Precipitation'
END AS precipitation_type,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '999,999') AS member_riders_count
FROM (
SELECT
CASE
WHEN temp < 50 THEN 'Less than 50°F'
WHEN temp >= 50 AND temp < 60 THEN '50-59°F'
WHEN temp >= 60 AND temp < 70 THEN '60-69°F'
WHEN temp >= 70 AND temp < 80 THEN '70-79°F'
ELSE '80°F and above'
END AS temperature_range,
CASE
WHEN td.precip > 0 THEN 'Precipitation'
ELSE 'No Precipitation'
END AS precipitation_status,
CASE
WHEN td.precip IS NULL THEN NULL
WHEN td.precip > 0 THEN COALESCE(td.preciptype, 'Other')
END AS precipitation_type,
ud.member_casual
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.date_w
WHERE
td.holiday_name IS NOT NULL
) AS temp_precip_data
GROUP BY
temperature_range, precipitation_status, precipitation_type
ORDER BY
temperature_range, precipitation_status, precipitation_type;
"
Observations:
Temperature Impact:
28. weather_conditions_ridership_analysis
Analyzing weather conditions on rides
"SELECT
conditions,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '999,999') AS member_riders_count
FROM (
SELECT
CASE
WHEN td.conditions IS NULL THEN 'Unknown'
ELSE td.conditions
END AS conditions,
ud.member_casual
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.date_w
WHERE
td.holiday_name IS NOT NULL
) AS weather_conditions_data
GROUP BY
conditions
ORDER BY
conditions;
"
Observations:
29. wind_impact_on_ridership_analysis
Analyzing wind impact on rides
"SELECT
wind_speed_range,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '999,999') AS member_riders_count
FROM (
SELECT
CASE
WHEN windspeed < 5 THEN '0-4 mph'
WHEN windspeed >= 5 AND windspeed < 10 THEN '5-9 mph'
WHEN windspeed >= 10 AND windspeed < 15 THEN '10-14 mph'
WHEN windspeed >= 15 AND windspeed < 20 THEN '15-19 mph'
ELSE '20 mph and above'
END AS wind_speed_range,
ud.member_casual
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.date_w
WHERE
td.holiday_name IS NOT NULL
) AS wind_data
GROUP BY
wind_speed_range
ORDER BY
CASE
WHEN wind_speed_range = '5-9 mph' THEN 1
WHEN wind_speed_range = '10-14 mph' THEN 2
WHEN wind_speed_range = '15-19 mph' THEN 3
WHEN wind_speed_range = '20 mph and above' THEN 4
ELSE 5
END;
"
Observations:
30. extreme_weather_impact_on_ridership_analysis
Analyzing extreme weather effect on riders
"SELECT
extreme_weather_type,
TO_CHAR(AVG(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END) * 100, '999.99') || '%' AS avg_casual_riders_percentage,
TO_CHAR(AVG(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END) * 100, '999.99') || '%' AS avg_member_riders_percentage
FROM (
SELECT
CASE
WHEN td.temp >= 80 THEN 'Hot'
WHEN td.temp < 50 THEN 'Cold'
WHEN td.precip > 0 THEN 'Stormy'
ELSE 'Normal'
END AS extreme_weather_type,
ud.member_casual
FROM
temporal_data td
LEFT JOIN user_data ud ON td.date_w = ud.date_start
WHERE
td.holiday_name IS NOT NULL
) AS extreme_weather_data
GROUP BY
extreme_weather_type
ORDER BY
avg_casual_riders_percentage DESC;
"
Observations:
31. humidity_impact_on_ridership_analysis
Analyzing humidity levels on riders
"SELECT
humidity_range,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '999,999') AS member_riders_count
FROM (
SELECT
CASE
WHEN td.humidity < 30 THEN 'Less than 30%'
WHEN td.humidity >= 30 AND td.humidity < 50 THEN '30-49%'
WHEN td.humidity >= 50 AND td.humidity < 70 THEN '50-69%'
WHEN td.humidity >= 70 THEN '70% and above'
END AS humidity_range,
ud.member_casual
FROM
temporal_data td
LEFT JOIN user_data ud ON td.date_w = ud.date_start
WHERE
td.holiday_name IS NOT NULL
) AS humidity_data
GROUP BY
humidity_range
ORDER BY
humidity_range;"
Observations:
32. temperature_increase_ridership_analysis_query
Investigating temperature increases on rides
"WITH TemperatureShifts AS (
SELECT
td.date_w AS shift_date,
td.feelslike AS shift_temp,
LAG(td.feelslike) OVER (ORDER BY td.date_w) AS prev_temp,
LAG(td.date_w) OVER (ORDER BY td.date_w) AS prev_date
FROM temporal_data td
),
RidershipCounts AS (
SELECT
ts.shift_date,
ts.shift_temp AS shifted_temperature,
ts.prev_temp AS previous_temperature,
TO_CHAR(rc_shift.casual_riders_count, '9,999,999') AS current_casual_riders_count,
TO_CHAR(rc_prev.casual_riders_count, '9,999,999') AS previous_casual_riders_count,
TO_CHAR(rc_shift.member_riders_count, '9,999,999') AS current_member_riders_count,
TO_CHAR(rc_prev.member_riders_count, '9,999,999') AS previous_member_rider_count
FROM TemperatureShifts ts
LEFT JOIN (
SELECT date_start,
COUNT(CASE WHEN member_casual = 'casual' THEN 1 END) AS casual_riders_count,
COUNT(CASE WHEN member_casual = 'member' THEN 1 END) AS member_riders_count
FROM user_data
GROUP BY date_start
) rc_shift ON ts.shift_date = rc_shift.date_start
LEFT JOIN (
SELECT date_start,
COUNT(CASE WHEN member_casual = 'casual' THEN 1 END) AS casual_riders_count,
COUNT(CASE WHEN member_casual = 'member' THEN 1 END) AS member_riders_count
FROM user_data
GROUP BY date_start
) rc_prev ON ts.prev_date = rc_prev.date_start
WHERE ts.shift_temp - ts.prev_temp > 10
)
SELECT
rc.shift_date,
rc.shifted_temperature,
rc.previous_temperature,
current_casual_riders_count,
previous_casual_riders_count,
current_member_riders_count,
previous_member_rider_count
FROM RidershipCounts rc
ORDER BY rc.shifted_temperature DESC;"
Observations:
***************************************************************************************
33. temperature_decrease_ridership_analysis_query
Investigating temperature decreases on rides
"WITH TemperatureShifts AS (
SELECT
td.date_w AS shift_date,
td.feelslike AS shift_temp,
LAG(td.feelslike) OVER (ORDER BY td.date_w) AS prev_temp,
LAG(td.date_w) OVER (ORDER BY td.date_w) AS prev_date
FROM temporal_data td
),
RidershipCounts AS (
SELECT
ts.shift_date,
ts.shift_temp AS shifted_temperature,
ts.prev_temp AS previous_temperature,
TO_CHAR(rc_shift.casual_riders_count, '9,999,999') AS current_casual_riders_count,
TO_CHAR(rc_prev.casual_riders_count, '9,999,999') AS previous_casual_riders_count,
TO_CHAR(rc_shift.member_riders_count, '9,999,999') AS current_member_riders_count,
TO_CHAR(rc_prev.member_riders_count, '9,999,999') AS previous_member_rider_count
FROM TemperatureShifts ts
LEFT JOIN (
SELECT date_start,
COUNT(CASE WHEN member_casual = 'casual' THEN 1 END) AS casual_riders_count,
COUNT(CASE WHEN member_casual = 'member' THEN 1 END) AS member_riders_count
FROM user_data
GROUP BY date_start
) rc_shift ON ts.shift_date = rc_shift.date_start
LEFT JOIN (
SELECT date_start,
COUNT(CASE WHEN member_casual = 'casual' THEN 1 END) AS casual_riders_count,
COUNT(CASE WHEN member_casual = 'member' THEN 1 END) AS member_riders_count
FROM user_data
GROUP BY date_start
) rc_prev ON ts.prev_date = rc_prev.date_start
WHERE ts.prev_temp - ts.shift_temp > 10
)
SELECT
rc.shift_date,
rc.shifted_temperature,
rc.previous_temperature,
current_casual_riders_count,
previous_casual_riders_count,
current_member_riders_count,
previous_member_rider_count
FROM RidershipCounts rc
ORDER BY rc.shifted_temperature DESC;"
Observations:
This query conducts a comprehensive analysis of bike ridership
patterns in relation to various holidays throughout the year. It
utilizes the “user_data” data set to calculate average casual and member
ridership, as well as total ridership, on each holiday. The holidays are
dynamically derived based on specific dates and conditions. The results
are grouped by holiday name, date, and day of the week. The analysis
reveals intriguing insights into how different holidays impact bike
ridership, shedding light on user preferences and behaviors during
festive periods.
34. peak_ridership_holidays_analysis
Analyzing peak holiday rides
"SELECT
td.holiday_name,
TO_CHAR(COUNT(CASE WHEN ud.member_casual = 'casual' THEN 1 END) + COUNT(CASE WHEN ud.member_casual = 'member' THEN 1 END), '999,999') AS total_ridership
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.holiday_start_date
WHERE
td.holiday_name IS NOT NULL
GROUP BY
td.holiday_name
ORDER BY
total_ridership DESC;"
Observations:
35. rider_holiday_breakdown
Getting the breakdown of holiday rides
"SELECT
td.holiday_name,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '9,999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '9,999,999') AS member_riders_count
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.holiday_start_date
WHERE
td.holiday_name IS NOT NULL
GROUP BY
td.holiday_name
ORDER BY
td.holiday_name;"
Observations:
36. breakdown_ridership_holiday_analysis
Getting percentages of riders by holiday
"WITH holiday_dates AS (
SELECT
'New Year''s Day' AS holiday_name,
DATE_TRUNC('year', td.date_w) + INTERVAL '0 day' AS holiday_date,
DATE_PART('doy', td.date_w) AS day_of_year,
td.date_w,
ud.member_casual
FROM temporal_data td
JOIN user_data ud ON td.date_w = ud.date_start
WHERE EXTRACT(MONTH FROM td.date_w) = 1 AND EXTRACT(DAY FROM td.date_w) = 1
UNION ALL
SELECT
'Martin Luther King Jr. Day' AS holiday_name,
(DATE_TRUNC('year', started_at) + INTERVAL '14 days') +
(CASE WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '14 days') = 1 THEN INTERVAL '1 day'
ELSE INTERVAL '0 day' END) AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 1 AND EXTRACT(DAY FROM started_at) BETWEEN 15 AND 21
UNION ALL
SELECT
'Presidents Day' AS holiday_name,
(DATE_TRUNC('year', started_at) + INTERVAL '45 days') +
(CASE WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '45 days') = 1 THEN INTERVAL '1 day'
ELSE INTERVAL '0 day' END) AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 2 AND EXTRACT(DAY FROM started_at) BETWEEN 15 AND 21
UNION ALL
SELECT
'Easter Sunday' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '81 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 3 AND EXTRACT(DAY FROM started_at) BETWEEN 22 AND 31
UNION ALL
SELECT
'Memorial Day' AS holiday_name,
(DATE_TRUNC('year', started_at) + INTERVAL '146 days') +
(CASE WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '146 days') = 1 THEN INTERVAL '1 day'
ELSE INTERVAL '0 day' END) AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 5 AND EXTRACT(DAY FROM started_at) BETWEEN 24 AND 31
UNION ALL
SELECT
'Independence Day' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '184 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 7 AND EXTRACT(DAY FROM started_at) = 4
UNION ALL
SELECT
'Columbus Day' AS holiday_name,
(DATE_TRUNC('year', started_at) + INTERVAL '276 days') +
(CASE WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '276 days') = 1 THEN INTERVAL '1 day'
ELSE INTERVAL '0 day' END) AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 10 AND EXTRACT(DAY FROM started_at) BETWEEN 8 AND 14
UNION ALL
SELECT
'Halloween' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '303 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 10 AND EXTRACT(DAY FROM started_at) = 31
UNION ALL
SELECT
'Veterans Day' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '314 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 11 AND EXTRACT(DAY FROM started_at) = 11
UNION ALL
SELECT
'Thanksgiving Day' AS holiday_name,
(DATE_TRUNC('year', started_at) + INTERVAL '324 days') +
(CASE WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 4 THEN INTERVAL '21 days'
WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 5 THEN INTERVAL '20 days'
WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 6 THEN INTERVAL '19 days'
WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 7 THEN INTERVAL '18 days'
WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 1 THEN INTERVAL '24 days'
WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 2 THEN INTERVAL '23 days'
WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '324 days') = 3 THEN INTERVAL '22 days' END)
AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 11 AND EXTRACT(DAY FROM started_at) BETWEEN 23 AND 29
UNION ALL
SELECT
'Christmas Day' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '358 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 12 AND EXTRACT(DAY FROM started_at) = 25
UNION ALL
SELECT
'Labor Day' AS holiday_name,
(DATE_TRUNC('year', started_at) + INTERVAL '244 days') +
(CASE WHEN EXTRACT(ISODOW FROM DATE_TRUNC('year', started_at) +
INTERVAL '244 days') = 1 THEN INTERVAL '1 day'
ELSE INTERVAL '0 day' END) AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 9 AND EXTRACT(DAY FROM started_at) = 1
-- St Patrick's Day
UNION ALL
SELECT
'St Patrick''s Day' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '75 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 3 AND EXTRACT(DAY FROM started_at) = 17
-- Cinco de Mayo
UNION ALL
SELECT
'Cinco de Mayo' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '124 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 5 AND EXTRACT(DAY FROM started_at) = 5
-- Bastille Day
UNION ALL
SELECT
'Bastille Day' AS holiday_name,
DATE_TRUNC('year', started_at) + INTERVAL '195 days' AS holiday_date,
DATE_PART('doy', started_at) AS day_of_year,
started_at AS start_date,
member_casual
FROM user_data
WHERE EXTRACT(MONTH FROM started_at) = 7 AND EXTRACT(DAY FROM started_at) = 14
)
SELECT
holiday_name,
TO_CHAR(holiday_date, 'Month DD') || ' - Day ' || EXTRACT(DOY FROM holiday_date) AS holiday_date,
TO_CHAR(holiday_date, 'Day') AS day_of_week,
TO_CHAR(ROUND(AVG(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END), 2), '999,999,990.00') AS average_casual_riders,
TO_CHAR(ROUND(AVG(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END), 2), '999,999,990.00') AS average_member_riders,
TO_CHAR(SUM(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END), '9,999,999') AS casual_riders_on_holiday,
TO_CHAR(SUM(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END), '9,999,999') AS member_riders_on_holiday,
TO_CHAR(COUNT(*), '9,999,999') AS total_riders_on_holiday
FROM holiday_dates
GROUP BY holiday_name, holiday_date
ORDER BY MIN(day_of_year);"
Observations:
37. ridership_with_holidays_order_by_casual_member_average
Getting holidays by casual rider averages (decending)
"SELECT
date_start AS date,
day_of_year,
name_of_day,
COALESCE(td.holiday_name, '') AS holiday_name,
TO_CHAR(COUNT(CASE WHEN member_casual = 'casual' THEN 1 END), '999,999,999') AS casual_riders_count,
TO_CHAR(COUNT(CASE WHEN member_casual = 'member' THEN 1 END), '999,999,999') AS member_riders_count,
ROUND(AVG(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END), 2) AS average_casual_riders,
ROUND(AVG(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END), 2) AS average_member_riders
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.holiday_start_date
WHERE
td.holiday_name <> ''
GROUP BY
date_start, ud.day_of_year, ud.name_of_day, td.holiday_name
ORDER BY
average_casual_riders DESC;"
Observations:
The primary aim of this query centers around a comprehensive
exploration of ride patterns and user preferences within the context of
the provided data set. The data set is strategically arranged in
descending order based on the average count of casual riders, allowing
for swift identification of trends and anomalies.
As we delve into
the specifics of the query output, several noteworthy observations come
to light:
38. rider_behavior_holiday_non-holiday_comparison
Analyizing rider behavior by holiday and non-holiday
"SELECT
CASE WHEN td.holiday_name IS NOT NULL THEN 'Holiday' ELSE 'Non-Holiday' END AS day_type,
ROUND(AVG(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END), 4) AS avg_casual_riders,
ROUND(AVG(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END), 4) AS avg_member_riders
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.holiday_start_date
GROUP BY
day_type
ORDER BY
day_type;"
The data set’s averages are divided into two distinct
categories: “Holiday” and “Non-Holiday.” Each category is associated
with two specific values: the initial value (0.4507 and 0.3917)
corresponds to the mean count of casual riders, while the subsequent
value (0.5493 and 0.6083) represents the mean count of member
riders.
Observations:
39. rider_behavior_holiday_percentage_comparison
Getting percentages for riders by holiday
"SELECT
td.holiday_name,
ROUND(AVG(CASE WHEN member_casual = 'casual' THEN 1 ELSE 0 END) * 100) AS avg_casual_riders_percentage,
ROUND(AVG(CASE WHEN member_casual = 'member' THEN 1 ELSE 0 END) * 100) AS avg_member_riders_percentage
FROM
user_data ud
LEFT JOIN temporal_data td ON ud.date_start = td.holiday_start_date
WHERE
td.holiday_name IS NOT NULL
GROUP BY
td.holiday_name
ORDER BY
avg_casual_riders_percentage DESC;
"
Observations:
High Casual Rider Participation:
Balanced Rider Participation:
High Member Rider Participation:
Holiday Impact on Rider Behavior:
- Varied Participation on Cultural Holidays:
Public Events and Rider Types:
These observations highlight the varying rider behavior trends
during different holidays. The analysis offers insights into how
holidays impact the composition of casual and member riders.
Overall Observances Synopsis
This comprehensive
analysis combines data regarding user behavior, weather conditions, and
holidays to offer a comprehensive understanding of ridership trends. By
examining the impact of weather elements such as temperature,
precipitation, wind speed, humidity, and specific holidays, we gain
insights into how these factors influence the preferences and behaviors
of casual and member riders.
Weather Conditions and
Ridership:
Holiday Impact on Ridership:
Holidays and Ridership: Various holidays, cultural events, and celebrations influence ridership patterns. Some holidays attract higher casual ridership, while others see a balanced distribution between casual and member riders. Cultural and music festivals, as well as local events, exhibit diverse rider participation.
Behavioral Insights:
Conclusion:
This synthesis of data underscores the
intricate interplay between weather conditions, holidays, and rider
preferences. Riders demonstrate resilience to certain weather
conditions, while holidays amplify casual ridership. Member ridership
remains relatively stable across different circumstances.