Cyclistic is a bike sharing company which gathers data of their users and wants to understand the main differences between the ācasualā and āmemberā users.
The main question to be answered is āHow do annual members and casual riders use Cyclistic bikes differently?ā.
The following document is a description of the steps and processes undergone throughout the analysis to arrive to the answer of the previous question.
The only data source for available for this case are the Q1 data of the ride trips of the app in the years 2019 and 2020. Thus, the answers are limited to what can be found on this data.
First step is to install and activate the packages that will be used.
library(tidyverse)
## āā Attaching core tidyverse packages āāāāāāāāāāāāāāāāāāāāāāāā tidyverse 2.0.0 āā
## ā dplyr 1.1.4 ā readr 2.1.5
## ā forcats 1.0.0 ā stringr 1.5.1
## ā ggplot2 3.5.1 ā tibble 3.2.1
## ā lubridate 1.9.4 ā tidyr 1.3.1
## ā purrr 1.0.2
## āā Conflicts āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā tidyverse_conflicts() āā
## ā dplyr::filter() masks stats::filter()
## ā dplyr::lag() masks stats::lag()
## ā¹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Secondly, the data should be imported and explored.
# Importing Data
divvy_trips_2019 <- read.csv("Divvy_Trips_2019_Q1.csv")
divvy_trips_2020 <- read.csv("Divvy_Trips_2020_Q1.csv")
# Exploratory Analysis (EDA) - Q1 2019
head(divvy_trips_2019)
## trip_id start_time end_time bikeid tripduration
## 1 21742443 2019-01-01 00:04:37 2019-01-01 00:11:07 2167 390.0
## 2 21742444 2019-01-01 00:08:13 2019-01-01 00:15:34 4386 441.0
## 3 21742445 2019-01-01 00:13:23 2019-01-01 00:27:12 1524 829.0
## 4 21742446 2019-01-01 00:13:45 2019-01-01 00:43:28 252 1,783.0
## 5 21742447 2019-01-01 00:14:52 2019-01-01 00:20:56 1170 364.0
## 6 21742448 2019-01-01 00:15:33 2019-01-01 00:19:09 2437 216.0
## from_station_id from_station_name to_station_id
## 1 199 Wabash Ave & Grand Ave 84
## 2 44 State St & Randolph St 624
## 3 15 Racine Ave & 18th St 644
## 4 123 California Ave & Milwaukee Ave 176
## 5 173 Mies van der Rohe Way & Chicago Ave 35
## 6 98 LaSalle St & Washington St 49
## to_station_name usertype gender birthyear
## 1 Milwaukee Ave & Grand Ave Subscriber Male 1989
## 2 Dearborn St & Van Buren St (*) Subscriber Female 1990
## 3 Western Ave & Fillmore St (*) Subscriber Female 1994
## 4 Clark St & Elm St Subscriber Male 1993
## 5 Streeter Dr & Grand Ave Subscriber Male 1994
## 6 Dearborn St & Monroe St Subscriber Female 1983
colnames(divvy_trips_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
glimpse(divvy_trips_2019)
## Rows: 365,069
## Columns: 12
## $ trip_id <int> 21742443, 21742444, 21742445, 21742446, 21742447, 21ā¦
## $ start_time <chr> "2019-01-01 00:04:37", "2019-01-01 00:08:13", "2019-ā¦
## $ end_time <chr> "2019-01-01 00:11:07", "2019-01-01 00:15:34", "2019-ā¦
## $ bikeid <int> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205,ā¦
## $ tripduration <chr> "390.0", "441.0", "829.0", "1,783.0", "364.0", "216.ā¦
## $ from_station_id <int> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 2ā¦
## $ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", ā¦
## $ to_station_id <int> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 4ā¦
## $ to_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Bureā¦
## $ usertype <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribeā¦
## $ gender <chr> "Male", "Female", "Female", "Male", "Male", "Female"ā¦
## $ birthyear <int> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995ā¦
# Exploratory Analysis (EDA) - Q1 2020
head(divvy_trips_2020)
## ride_id rideable_type started_at ended_at
## 1 EACB19130B0CDA4A docked_bike 2020-01-21 20:06:59 2020-01-21 20:14:30
## 2 8FED874C809DC021 docked_bike 2020-01-30 14:22:39 2020-01-30 14:26:22
## 3 789F3C21E472CA96 docked_bike 2020-01-09 19:29:26 2020-01-09 19:32:17
## 4 C9A388DAC6ABF313 docked_bike 2020-01-06 16:17:07 2020-01-06 16:25:56
## 5 943BC3CBECCFD662 docked_bike 2020-01-30 08:37:16 2020-01-30 08:42:48
## 6 6D9C8A6938165C11 docked_bike 2020-01-10 12:33:05 2020-01-10 12:37:54
## start_station_name start_station_id end_station_name
## 1 Western Ave & Leland Ave 239 Clark St & Leland Ave
## 2 Clark St & Montrose Ave 234 Southport Ave & Irving Park Rd
## 3 Broadway & Belmont Ave 296 Wilton Ave & Belmont Ave
## 4 Clark St & Randolph St 51 Fairbanks Ct & Grand Ave
## 5 Clinton St & Lake St 66 Wells St & Hubbard St
## 6 Wells St & Hubbard St 212 Desplaines St & Randolph St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 326 41.9665 -87.6884 41.9671 -87.6674 member
## 2 318 41.9616 -87.6660 41.9542 -87.6644 member
## 3 117 41.9401 -87.6455 41.9402 -87.6530 member
## 4 24 41.8846 -87.6319 41.8918 -87.6206 member
## 5 212 41.8856 -87.6418 41.8899 -87.6343 member
## 6 96 41.8899 -87.6343 41.8846 -87.6446 member
colnames(divvy_trips_2020)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
glimpse(divvy_trips_2020)
## Rows: 426,887
## Columns: 13
## $ ride_id <chr> "EACB19130B0CDA4A", "8FED874C809DC021", "789F3C21E4ā¦
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "dockeā¦
## $ started_at <chr> "2020-01-21 20:06:59", "2020-01-30 14:22:39", "2020ā¦
## $ ended_at <chr> "2020-01-21 20:14:30", "2020-01-30 14:26:22", "2020ā¦
## $ start_station_name <chr> "Western Ave & Leland Ave", "Clark St & Montrose Avā¦
## $ start_station_id <int> 239, 234, 296, 51, 66, 212, 96, 96, 212, 38, 117, 1ā¦
## $ end_station_name <chr> "Clark St & Leland Ave", "Southport Ave & Irving Paā¦
## $ end_station_id <int> 326, 318, 117, 24, 212, 96, 212, 212, 96, 100, 632,ā¦
## $ start_lat <dbl> 41.9665, 41.9616, 41.9401, 41.8846, 41.8856, 41.889ā¦
## $ start_lng <dbl> -87.6884, -87.6660, -87.6455, -87.6319, -87.6418, -ā¦
## $ end_lat <dbl> 41.9671, 41.9542, 41.9402, 41.8918, 41.8899, 41.884ā¦
## $ end_lng <dbl> -87.6674, -87.6644, -87.6530, -87.6206, -87.6343, -ā¦
## $ member_casual <chr> "member", "member", "member", "member", "member", "ā¦
After a quick examination there are a few things that can be noticed:
The following code tackles each of the previously mentioned problems.
## Handles Extra column problem
# Deletes the extra columns
divvy_trips_2019_df <- subset(divvy_trips_2019, select = -c(bikeid,
tripduration,
gender,
birthyear))
divvy_trips_2020_df <- subset(divvy_trips_2020, select = -c(rideable_type,
start_lat,
start_lng,
end_lat,
end_lng))
glimpse(divvy_trips_2019_df)
## Rows: 365,069
## Columns: 8
## $ trip_id <int> 21742443, 21742444, 21742445, 21742446, 21742447, 21ā¦
## $ start_time <chr> "2019-01-01 00:04:37", "2019-01-01 00:08:13", "2019-ā¦
## $ end_time <chr> "2019-01-01 00:11:07", "2019-01-01 00:15:34", "2019-ā¦
## $ from_station_id <int> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 2ā¦
## $ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", ā¦
## $ to_station_id <int> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 4ā¦
## $ to_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Bureā¦
## $ usertype <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribeā¦
glimpse(divvy_trips_2020_df)
## Rows: 426,887
## Columns: 8
## $ ride_id <chr> "EACB19130B0CDA4A", "8FED874C809DC021", "789F3C21E4ā¦
## $ started_at <chr> "2020-01-21 20:06:59", "2020-01-30 14:22:39", "2020ā¦
## $ ended_at <chr> "2020-01-21 20:14:30", "2020-01-30 14:26:22", "2020ā¦
## $ start_station_name <chr> "Western Ave & Leland Ave", "Clark St & Montrose Avā¦
## $ start_station_id <int> 239, 234, 296, 51, 66, 212, 96, 96, 212, 38, 117, 1ā¦
## $ end_station_name <chr> "Clark St & Leland Ave", "Southport Ave & Irving Paā¦
## $ end_station_id <int> 326, 318, 117, 24, 212, 96, 212, 212, 96, 100, 632,ā¦
## $ member_casual <chr> "member", "member", "member", "member", "member", "ā¦
## Corrects the naming in the older version of the file.
colnames(divvy_trips_2019_df)
## [1] "trip_id" "start_time" "end_time"
## [4] "from_station_id" "from_station_name" "to_station_id"
## [7] "to_station_name" "usertype"
colnames(divvy_trips_2020_df)
## [1] "ride_id" "started_at" "ended_at"
## [4] "start_station_name" "start_station_id" "end_station_name"
## [7] "end_station_id" "member_casual"
colnames(divvy_trips_2019_df) <- c("ride_id",
"started_at",
"ended_at",
"start_station_id",
"start_station_name",
"end_station_id",
"end_station_name",
"member_casual")
colnames(divvy_trips_2019_df)
## [1] "ride_id" "started_at" "ended_at"
## [4] "start_station_id" "start_station_name" "end_station_id"
## [7] "end_station_name" "member_casual"
colnames(divvy_trips_2020_df)
## [1] "ride_id" "started_at" "ended_at"
## [4] "start_station_name" "start_station_id" "end_station_name"
## [7] "end_station_id" "member_casual"
## Corrects the datatypes in both tables
glimpse(divvy_trips_2019_df)
## Rows: 365,069
## Columns: 8
## $ ride_id <int> 21742443, 21742444, 21742445, 21742446, 21742447, 2ā¦
## $ started_at <chr> "2019-01-01 00:04:37", "2019-01-01 00:08:13", "2019ā¦
## $ ended_at <chr> "2019-01-01 00:11:07", "2019-01-01 00:15:34", "2019ā¦
## $ start_station_id <int> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, ā¦
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St",ā¦
## $ end_station_id <int> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, ā¦
## $ end_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Burā¦
## $ member_casual <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribā¦
glimpse(divvy_trips_2020_df)
## Rows: 426,887
## Columns: 8
## $ ride_id <chr> "EACB19130B0CDA4A", "8FED874C809DC021", "789F3C21E4ā¦
## $ started_at <chr> "2020-01-21 20:06:59", "2020-01-30 14:22:39", "2020ā¦
## $ ended_at <chr> "2020-01-21 20:14:30", "2020-01-30 14:26:22", "2020ā¦
## $ start_station_name <chr> "Western Ave & Leland Ave", "Clark St & Montrose Avā¦
## $ start_station_id <int> 239, 234, 296, 51, 66, 212, 96, 96, 212, 38, 117, 1ā¦
## $ end_station_name <chr> "Clark St & Leland Ave", "Southport Ave & Irving Paā¦
## $ end_station_id <int> 326, 318, 117, 24, 212, 96, 212, 212, 96, 100, 632,ā¦
## $ member_casual <chr> "member", "member", "member", "member", "member", "ā¦
divvy_trips_2019_df <- transform(divvy_trips_2019_df,
ride_id = as.character(ride_id),
started_at = as.Date(started_at),
ended_at = as.Date(ended_at),
start_station_name = as.character(start_station_name),
end_station_name = as.character(end_station_name)
)
divvy_trips_2020_df <- transform(divvy_trips_2020_df,
started_at = as.Date(started_at),
ended_at = as.Date(ended_at),
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id)
)
glimpse(divvy_trips_2019_df)
## Rows: 365,069
## Columns: 8
## $ ride_id <chr> "21742443", "21742444", "21742445", "21742446", "21ā¦
## $ started_at <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 20ā¦
## $ ended_at <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 20ā¦
## $ start_station_id <int> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, ā¦
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St",ā¦
## $ end_station_id <int> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, ā¦
## $ end_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Burā¦
## $ member_casual <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribā¦
glimpse(divvy_trips_2020_df)
## Rows: 426,887
## Columns: 8
## $ ride_id <chr> "EACB19130B0CDA4A", "8FED874C809DC021", "789F3C21E4ā¦
## $ started_at <date> 2020-01-21, 2020-01-30, 2020-01-09, 2020-01-06, 20ā¦
## $ ended_at <date> 2020-01-21, 2020-01-30, 2020-01-09, 2020-01-06, 20ā¦
## $ start_station_name <chr> "Western Ave & Leland Ave", "Clark St & Montrose Avā¦
## $ start_station_id <chr> "239", "234", "296", "51", "66", "212", "96", "96",ā¦
## $ end_station_name <chr> "Clark St & Leland Ave", "Southport Ave & Irving Paā¦
## $ end_station_id <chr> "326", "318", "117", "24", "212", "96", "212", "212ā¦
## $ member_casual <chr> "member", "member", "member", "member", "member", "ā¦
Once both data frames are ready, having the same number of columns and datatypes I merge them to create a single data frame.
# Having solved all previous problems I merge both data frames into one variable.
divvy_trips_df <- rbind(divvy_trips_2019_df, divvy_trips_2020_df)
Having just one data frame simplifies the analysis, but still there are bound to be mistakes within.
First, a new inspection is needed as there is a new table to explore.
# Inspect the new table that has been created
colnames(divvy_trips_df) # List of column names
## [1] "ride_id" "started_at" "ended_at"
## [4] "start_station_id" "start_station_name" "end_station_id"
## [7] "end_station_name" "member_casual"
nrow(divvy_trips_df) # How many rows are in data frame?
## [1] 791956
dim(divvy_trips_df) # Dimensions of the data frame?
## [1] 791956 8
head(divvy_trips_df) # See the first 6 rows of data frame.
## ride_id started_at ended_at start_station_id
## 1 21742443 2019-01-01 2019-01-01 199
## 2 21742444 2019-01-01 2019-01-01 44
## 3 21742445 2019-01-01 2019-01-01 15
## 4 21742446 2019-01-01 2019-01-01 123
## 5 21742447 2019-01-01 2019-01-01 173
## 6 21742448 2019-01-01 2019-01-01 98
## start_station_name end_station_id
## 1 Wabash Ave & Grand Ave 84
## 2 State St & Randolph St 624
## 3 Racine Ave & 18th St 644
## 4 California Ave & Milwaukee Ave 176
## 5 Mies van der Rohe Way & Chicago Ave 35
## 6 LaSalle St & Washington St 49
## end_station_name member_casual
## 1 Milwaukee Ave & Grand Ave Subscriber
## 2 Dearborn St & Van Buren St (*) Subscriber
## 3 Western Ave & Fillmore St (*) Subscriber
## 4 Clark St & Elm St Subscriber
## 5 Streeter Dr & Grand Ave Subscriber
## 6 Dearborn St & Monroe St Subscriber
str(divvy_trips_df) # See list of columns and data types (numeric, character, etc)
## 'data.frame': 791956 obs. of 8 variables:
## $ ride_id : chr "21742443" "21742444" "21742445" "21742446" ...
## $ started_at : Date, format: "2019-01-01" "2019-01-01" ...
## $ ended_at : Date, format: "2019-01-01" "2019-01-01" ...
## $ start_station_id : chr "199" "44" "15" "123" ...
## $ start_station_name: chr "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
## $ end_station_id : chr "84" "624" "644" "176" ...
## $ end_station_name : chr "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
## $ member_casual : chr "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
summary(divvy_trips_df) # Statistical summary of data.
## ride_id started_at ended_at
## Length:791956 Min. :2019-01-01 Min. :2019-01-01
## Class :character 1st Qu.:2019-02-28 1st Qu.:2019-02-28
## Mode :character Median :2020-01-07 Median :2020-01-07
## Mean :2019-08-31 Mean :2019-08-31
## 3rd Qu.:2020-02-19 3rd Qu.:2020-02-19
## Max. :2020-03-31 Max. :2020-05-19
## start_station_id start_station_name end_station_id end_station_name
## Length:791956 Length:791956 Length:791956 Length:791956
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## member_casual
## Length:791956
## Class :character
## Mode :character
##
##
##
On further inspection, it is possible to detect that there are 4 distinct ways to name the type of user that the company has.
# member_casual should be composed of two values. member or casual.
divvy_trips_df %>%
distinct(member_casual)
## member_casual
## 1 Subscriber
## 2 Customer
## 3 member
## 4 casual
To fix this, the āSubscriberā and āCustomerā classification is changed for the more updated version of the Q1 2020 data frame.
divvy_trips_df$member_casual <- replace(divvy_trips_df$member_casual,
divvy_trips_df$member_casual == 'Subscriber', 'member')
divvy_trips_df$member_casual <- replace(divvy_trips_df$member_casual,
divvy_trips_df$member_casual == 'Customer', 'casual')
In order to make further analysis (such as representing the ride length throughout the month), there needs to be a reformat of the date.
divvy_trips_df$date <- as.Date(divvy_trips_df$started_at) # The default format is yyyy-mm-dd
divvy_trips_df$month <- format(as.Date(divvy_trips_df$date), "%m")
divvy_trips_df$day <- format(as.Date(divvy_trips_df$date), "%d")
divvy_trips_df$year <- format(as.Date(divvy_trips_df$date), "%Y")
divvy_trips_df$day_of_week <- format(as.Date(divvy_trips_df$date), "%A")
In the initial data frame of 2019 there was a ātripdurationā column, showing the time that the trip took. However, the 2020 data frame didnāt have it, and at the time of merging it had to be deleted.
But having the moment at which the user started its ride, and when the user ended his ride, it can be determine how long was his ride.
divvy_trips_df$ride_length <- difftime(divvy_trips_df$ended_at,divvy_trips_df$started_at)
typeof(divvy_trips_df$ride_length) ## Shows data-type as "Double" when it should be NUmeric.
## [1] "double"
divvy_trips_df <- divvy_trips_df %>%
mutate(ride_length = as.numeric(ride_length))
str(divvy_trips_df) # Checks if datatype has changed
## 'data.frame': 791956 obs. of 14 variables:
## $ ride_id : chr "21742443" "21742444" "21742445" "21742446" ...
## $ started_at : Date, format: "2019-01-01" "2019-01-01" ...
## $ ended_at : Date, format: "2019-01-01" "2019-01-01" ...
## $ start_station_id : chr "199" "44" "15" "123" ...
## $ start_station_name: chr "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
## $ end_station_id : chr "84" "624" "644" "176" ...
## $ end_station_name : chr "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
## $ member_casual : chr "member" "member" "member" "member" ...
## $ date : Date, format: "2019-01-01" "2019-01-01" ...
## $ month : chr "01" "01" "01" "01" ...
## $ day : chr "01" "01" "01" "01" ...
## $ year : chr "2019" "2019" "2019" "2019" ...
## $ day_of_week : chr "martes" "martes" "martes" "martes" ...
## $ ride_length : num 0 0 0 0 0 0 0 0 0 0 ...
## Checks if there are negative rides
any(divvy_trips_df$ride_length < 0) # Returns false, all ride lengths make sense
## [1] FALSE
Having cleaned the data set, I can export it as a .csv file and analyze it in a dataviz tool such as Tableau.
write.csv(divvy_trips_df, file = 'divvy_trips_2019-2020.csv')
To end this analysis I will show an interactive dashboard representing the cleaned data done with the Tableau visualization tool.
We started off this analysis questioning the main differences between both kind of users, and through the analysis done we can conclude that:
Trip duration by day of the week:
Casual users have a higher average trip duration, especially on weekends (Saturday and Sunday).
Members have a more consistent average duration throughout the week, with lower values compared to casual users.
Trip duration by month:
Casual users show significantly higher durations in February compared to the other months (January and March).
Members have a more uniform trip duration across the three analyzed months.
Daily evolution of trip duration:
Casual users show higher duration peaks on specific days, such as February 9th and 24th, exceeding 3,000 minutes on average.
Members have less variability, with lower peaks and more stable distributions throughout the month.