Increasing the profits by maximizing the number of annual memberships by converting casual riders (single riders and full-day riders) into annual members (annual memberships)
The business task is to show how do the casual riders differ from annual members using past user data and report the findings
The stakeholders in this project are: * Lily Moreno: The director of marketing and the manager, she is responsible for the development of campaigns and initiatives to promote the bike-share program
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
The data that we are using is located in this link (https://divvy-tripdata.s3.amazonaws.com/index.html). It comes from Motivate International Inc, a bike share company in Chicago
The data is organized in zip folders in csv files, some files (including divyy-tripdata) contain trips information from 04-2020 till 04-2022. Others (called Divvy-Trips-…Qx) present data from 2015 till the first quarter (first 3 months) of 2020.
And other folders (called Divvy-Stations-trips …. Qx) contain stations and trips info in csv files for the years of 2013 and 2014. But our main focus is going to be on the trips of 2021 which consist of 13 columns concerning the trip and the user such as; Ride id, bike type, starting and ending time starting and ending station id, name and location, geographic coordinates and membership type PS: if we want to add other axis of analysis, we can work on like members gender and age the folders called Divvy-Stations-trips …. Qx. It contains this kind of information
The data is credible and original as it is collected directly by the company that runs the Cyclistic Bike Share program which is called Motivate Inc, not by second- or third-party info. It is also current and relevant as it is been updated monthly from 2015 till now (2022). We can also say that the data is accurate complete and unbiased as it reflects all real data gathered by the company not just a sample.
The data has been made available by Motivate International Inc. under this license https://ride.divvybikes.com/data-license-agreement The data-privacy issues prohibit us from using riders’ personally identifiable information that’s why we can’t determine if casual riders live in the Cyclitic service area or if they have purchased multiple single passes
The business goal that was already set is to determine the difference between casual riders and annuals members and it seems we have enough information about the 2 different types users in order to discover their differences but not having an idea about the identity of the users which can limit our analysis in order to know if the casual riders are loyal clients so we can build a more specific business strategy upon that. But in general, we can answer the business question about the difference between the 2 types of clients.
The problems that we faced with the data; Having dirty data that needs cleaning (duplicates, missing fields … etc.) The csv files are huge which required me to go by segment and clean the data from the variables I am not going to use.
I choose RStudio Desktop in order to prepare, process, clean, analyze and visualize the data because the set is too large and will be hard to get done in Excel, google spreadsheets or RStudio Cloud.
In order to get an overview, the data was reviewed in terms of understanding of the consent of variables, data formats and data integrity.
Data review involved the following:
• Checking column names across all the 12 original files.
• Checking for missing values.
• Checking of white spaces.
• Checking of duplicate records.
• Other data anomalies.
However, the review of the data revealed several problems: • Duplicate record of ID numbers.
• Records with missing start or end station name.
• Records with very short or very long ride duration.
• Records for trips starting or ending at an administrative station (repair or testing station).
All 12 files were combined into one data set after initial review was completed. The final data set consisted of 5479096 rows with 13 columns of character and numeric data. This matched the number of records in all 12 monthly data files.
#setting working directory
setwd("G:/portfolio projects/my work/cyclistic bike/CaseStudy/")
Import the data files into new vectors
january_2021 <- read.csv("data/202101-divvy-tripdata.csv")
february_2021 <- read.csv("data/202102-divvy-tripdata.csv")
march_2021 <- read.csv("data/202103-divvy-tripdata.csv")
april_2021 <- read.csv("data/202104-divvy-tripdata.csv")
may_2021 <- read.csv("data/202105-divvy-tripdata.csv")
june_2021 <- read.csv("data/202106-divvy-tripdata.csv")
july_2021 <- read.csv("data/202107-divvy-tripdata.csv")
august_2021 <- read.csv("data/202108-divvy-tripdata.csv")
september_2021 <- read.csv("data/202109-divvy-tripdata.csv")
october_2021 <- read.csv("data/202110-divvy-tripdata.csv")
november_2021 <- read.csv("data/202111-divvy-tripdata.csv")
december_2021 <- read.csv("data/202112-divvy-tripdata.csv")
Check column names to ensure we can join all the data. Compare column names for each of the files. While the names don’t have to be in the same order but they do need to match perfectly before we can use a command to join them into one file
Data validation
Calculate the total number of records in all twelve monthly files. It is 5479096 rows
sum( nrow(january_2021) + nrow(february_2021)
+ nrow(march_2021) + nrow(april_2021) + nrow(may_2021)
+ nrow(june_2021) + nrow(july_2021) + nrow(august_2021)
+ nrow(september_2021) + nrow(october_2021) + nrow(november_2021)+nrow(december_2021))
## [1] 5595063
Merging all individual files in df_all using “rbind()”
df_all <- rbind(january_2021, february_2021, march_2021, april_2021, may_2021, june_2021, july_2021, august_2021, september_2021, october_2021, november_2021, december_2021)
head(df_all)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
calculating the number of rows in the merged file just to make sure
nrow(df_all)
## [1] 5595063
*we can notice that the data was merged correctly noticing the number of rows
Saving the merged dataframe to local storage for backup using “write_csv()”
write_csv(df_all, "merged_trip.csv")
Inspecting the new combined dataframe
str(df_all) #To see the structure
## 'data.frame': 5595063 obs. of 13 variables:
## $ ride_id : chr "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
## $ ended_at : chr "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
## $ start_station_name: chr "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
summary(df_all) #to produce a summary of all records in the found set
## ride_id rideable_type started_at ended_at
## Length:5595063 Length:5595063 Length:5595063 Length:5595063
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5595063 Length:5595063 Length:5595063 Length:5595063
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.39 Min. :-88.97
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52 Max. :42.17 Max. :-87.49
## NA's :4771 NA's :4771
## member_casual
## Length:5595063
## Class :character
## Mode :character
##
##
##
##
df_all <- distinct(df_all)
nrow(df_all)
## [1] 5595063
####Transforming data Create additional columns for Date, Month, Day, Year, day of the Week from the started_at column.This allows for more granular analysis of the data by date/day/month
df_all$date <- as.Date(df_all$started_at)
df_all$week_day <- format(as.Date(df_all$date), "%A")
df_all$month <- format(as.Date(df_all$date), "%b")
df_all$year<-format(df_all$date,"%Y")
head(df_all)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual date week_day month
## 1 41.90034 -87.69674 41.89 -87.72 member 2021-01-23 samedi janv.
## 2 41.90033 -87.69671 41.90 -87.69 member 2021-01-27 mercredi janv.
## 3 41.90031 -87.69664 41.90 -87.70 member 2021-01-21 jeudi janv.
## 4 41.90040 -87.69666 41.92 -87.69 member 2021-01-07 jeudi janv.
## 5 41.90033 -87.69670 41.90 -87.70 casual 2021-01-23 samedi janv.
## 6 41.90041 -87.69676 41.94 -87.71 casual 2021-01-09 samedi janv.
## year
## 1 2021
## 2 2021
## 3 2021
## 4 2021
## 5 2021
## 6 2021
Create a column for duration of rides calculated from start and end time of rides.
df_all$ride_length <- difftime(df_all$ended_at,df_all$started_at, units = "mins")
head(df_all)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual date week_day month
## 1 41.90034 -87.69674 41.89 -87.72 member 2021-01-23 samedi janv.
## 2 41.90033 -87.69671 41.90 -87.69 member 2021-01-27 mercredi janv.
## 3 41.90031 -87.69664 41.90 -87.70 member 2021-01-21 jeudi janv.
## 4 41.90040 -87.69666 41.92 -87.69 member 2021-01-07 jeudi janv.
## 5 41.90033 -87.69670 41.90 -87.70 casual 2021-01-23 samedi janv.
## 6 41.90041 -87.69676 41.94 -87.71 casual 2021-01-09 samedi janv.
## year ride_length
## 1 2021 10.4166667 mins
## 2 2021 4.0666667 mins
## 3 2021 1.3333333 mins
## 4 2021 11.7000000 mins
## 5 2021 0.7166667 mins
## 6 2021 53.7833333 mins
df_all[df_all==""] <- NA
head(df_all)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660 <NA> <NA>
## 2 California Ave & Cortez St 17660 <NA> <NA>
## 3 California Ave & Cortez St 17660 <NA> <NA>
## 4 California Ave & Cortez St 17660 <NA> <NA>
## 5 California Ave & Cortez St 17660 <NA> <NA>
## 6 California Ave & Cortez St 17660 <NA> <NA>
## start_lat start_lng end_lat end_lng member_casual date week_day month
## 1 41.90034 -87.69674 41.89 -87.72 member 2021-01-23 samedi janv.
## 2 41.90033 -87.69671 41.90 -87.69 member 2021-01-27 mercredi janv.
## 3 41.90031 -87.69664 41.90 -87.70 member 2021-01-21 jeudi janv.
## 4 41.90040 -87.69666 41.92 -87.69 member 2021-01-07 jeudi janv.
## 5 41.90033 -87.69670 41.90 -87.70 casual 2021-01-23 samedi janv.
## 6 41.90041 -87.69676 41.94 -87.71 casual 2021-01-09 samedi janv.
## year ride_length
## 1 2021 10.4166667 mins
## 2 2021 4.0666667 mins
## 3 2021 1.3333333 mins
## 4 2021 11.7000000 mins
## 5 2021 0.7166667 mins
## 6 2021 53.7833333 mins
filtering, data with started_at greater than the ended_at will be removed and starting and ending stations not mentioned
filtered_df <-df_all %>%
filter(ride_length > 0)%>%
filter(!is.na(start_station_name)) %>%
filter(!is.na(end_station_name)) %>%
filter(!is.na(ride_length))
head(filtered_df)
## ride_id rideable_type started_at ended_at
## 1 B9F73448DFBE0D45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike 2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## start_station_name start_station_id end_station_name
## 1 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 2 California Ave & Cortez St 17660 California Ave & North Ave
## 3 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 4 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 5 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 6 California Ave & Cortez St 17660 Wells St & Elm St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 657 41.90036 -87.69670 41.89918 -87.67220 member
## 2 13258 41.90041 -87.69673 41.91044 -87.69689 member
## 3 657 41.90037 -87.69669 41.89918 -87.67218 casual
## 4 657 41.90038 -87.69672 41.89915 -87.67218 casual
## 5 657 41.90036 -87.69670 41.89918 -87.67220 casual
## 6 KA1504000135 41.90037 -87.69679 41.90327 -87.63446 member
## date week_day month year ride_length
## 1 2021-01-24 dimanche janv. 2021 7.216667 mins
## 2 2021-01-23 samedi janv. 2021 4.533333 mins
## 3 2021-01-09 samedi janv. 2021 9.783333 mins
## 4 2021-01-09 samedi janv. 2021 8.950000 mins
## 5 2021-01-24 dimanche janv. 2021 10.150000 mins
## 6 2021-01-22 vendredi janv. 2021 20.550000 mins
Get rid of too long rides as rides should be limited to 1 day or 1440 minutes or 24Hr(cyclistic considers these bikes are stolen).
filtered_df <- filtered_df[!filtered_df$ride_length>1440,]
head(filtered_df)
## ride_id rideable_type started_at ended_at
## 1 B9F73448DFBE0D45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike 2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## start_station_name start_station_id end_station_name
## 1 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 2 California Ave & Cortez St 17660 California Ave & North Ave
## 3 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 4 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 5 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 6 California Ave & Cortez St 17660 Wells St & Elm St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 657 41.90036 -87.69670 41.89918 -87.67220 member
## 2 13258 41.90041 -87.69673 41.91044 -87.69689 member
## 3 657 41.90037 -87.69669 41.89918 -87.67218 casual
## 4 657 41.90038 -87.69672 41.89915 -87.67218 casual
## 5 657 41.90036 -87.69670 41.89918 -87.67220 casual
## 6 KA1504000135 41.90037 -87.69679 41.90327 -87.63446 member
## date week_day month year ride_length
## 1 2021-01-24 dimanche janv. 2021 7.216667 mins
## 2 2021-01-23 samedi janv. 2021 4.533333 mins
## 3 2021-01-09 samedi janv. 2021 9.783333 mins
## 4 2021-01-09 samedi janv. 2021 8.950000 mins
## 5 2021-01-24 dimanche janv. 2021 10.150000 mins
## 6 2021-01-22 vendredi janv. 2021 20.550000 mins
Get rid of rides as rides that are equal to 0
filtered_df <- filtered_df[!filtered_df$ride_length==0,]
head(filtered_df)
## ride_id rideable_type started_at ended_at
## 1 B9F73448DFBE0D45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike 2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## start_station_name start_station_id end_station_name
## 1 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 2 California Ave & Cortez St 17660 California Ave & North Ave
## 3 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 4 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 5 California Ave & Cortez St 17660 Wood St & Augusta Blvd
## 6 California Ave & Cortez St 17660 Wells St & Elm St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 657 41.90036 -87.69670 41.89918 -87.67220 member
## 2 13258 41.90041 -87.69673 41.91044 -87.69689 member
## 3 657 41.90037 -87.69669 41.89918 -87.67218 casual
## 4 657 41.90038 -87.69672 41.89915 -87.67218 casual
## 5 657 41.90036 -87.69670 41.89918 -87.67220 casual
## 6 KA1504000135 41.90037 -87.69679 41.90327 -87.63446 member
## date week_day month year ride_length
## 1 2021-01-24 dimanche janv. 2021 7.216667 mins
## 2 2021-01-23 samedi janv. 2021 4.533333 mins
## 3 2021-01-09 samedi janv. 2021 9.783333 mins
## 4 2021-01-09 samedi janv. 2021 8.950000 mins
## 5 2021-01-24 dimanche janv. 2021 10.150000 mins
## 6 2021-01-22 vendredi janv. 2021 20.550000 mins
Filter out data that we will not be using for this analysis.
filtered_df <- filtered_df %>%
select(rideable_type, member_casual, ride_length, month, year, week_day )
head(filtered_df)
## rideable_type member_casual ride_length month year week_day
## 1 classic_bike member 7.216667 mins janv. 2021 dimanche
## 2 electric_bike member 4.533333 mins janv. 2021 samedi
## 3 electric_bike casual 9.783333 mins janv. 2021 samedi
## 4 electric_bike casual 8.950000 mins janv. 2021 samedi
## 5 classic_bike casual 10.150000 mins janv. 2021 dimanche
## 6 electric_bike member 20.550000 mins janv. 2021 vendredi