Increasing the profits by maximizing the number of annual memberships by converting casual riders (single riders and full-day riders) into annual members (annual memberships)
The business task is to show how do the casual riders differ from annual members using past user data and report the findings
The stakeholders in this project are:
Lily Moreno: The director of marketing and the manager, she is responsible for the development of campaigns and initiatives to promote the bike-share program
The cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that would help to guide the cyclistic marketing strategy. I am a junior analyst in this team.
Cyclistic executive team: The detail-oriented executive team will decide whether to approve the recommended marketing program or not.
The data that we are using is located in this link (https://divvy-tripdata.s3.amazonaws.com/index.html). It comes from Motivate International Inc, a bike share company in Chicago
The data is organized in zip folders in csv files, some files (including divyy-tripdata) contain trips information from 04-2020 till 04-2022. Others (called Divvy-Trips-…Qx) present data from 2015 till the first quarter (first 3 months) of 2020.
And other folders (called Divvy-Stations-trips …. Qx) contain stations and trips info in csv files for the years of 2013 and 2014. But our main focus is going to be on the trips of 2021 which consist of 13 columns concerning the trip and the user such as; Ride id, bike type, starting and ending time, starting and ending station id, name and location, geographic coordinates and membership type
PS: if we want to add other axis of analysis we can work on, like members gender and age, the folders called “Divvy-Stations-trips …. Qx” contain this kind of information
The data is credible and original as it is collected directly by the company that runs the Cyclistic Bike Share program which is called Motivate Inc, not by second- or third-party info. It is also current and relevant as it is been updated monthly from 2015 till now (2022). We can also say that the data is accurate complete and unbiased as it reflects all real data gathered by the company not just a sample.
The data has been made available by Motivate International Inc. under this license https://ride.divvybikes.com/data-license-agreement . The data-privacy issues prohibit us from using riders’ personally identifiable information that’s why we can’t determine if casual riders live in the Cyclitic service area or if they have purchased multiple single passes.
The business goal that was already set is to determine the difference between casual riders and annual members. We have enough information about the 2 different types of users in order to discover their differences but not having an idea about the identity of the users can limit our analysis in order to know if the casual riders are loyal clients so we can build a more specific business strategy upon that.
But in general, we can answer the business question about the differences between the 2 types of clients.
The problems that we faced with the data:
• Having dirty data that needs cleaning (duplicates, missing fields … etc.)
• The csv files are huge which required me to go by segment and clean the data from the variables I am not going to use.
I choose RStudio Desktop in order to prepare, process, clean, analyze and visualize the data because the set is too large and will be hard to get done in Excel, google spreadsheets or RStudio Cloud.
In order to get an overview, the data was reviewed in terms of understanding of the consent of variables, data formats and data integrity.
Data review involved the following:
• Checking column names across all the 12 original files.
• Checking for missing values.
• Checking of white spaces.
• Checking of duplicate records.
• Other data anomalies.
However, the review of the data revealed several problems:
• Duplicate record of ID numbers.
• Records with missing start or end station name.
• Records with very short or very long ride duration.
• Records for trips starting or ending at an administrative station (repair or testing station).
All 12 files were combined into one data set after initial review was completed. The final data set consisted of 4586829 rows with 6 columns of character and numeric data. This matched the number of records in all 12 monthly data files.
Setting work Directory
setwd("C:/Users/Mega-PC/Desktop/portfolio projects/portfolio projects/my work/cyclistic bike/CaseStudy")
Install Packages
install.packages('plyr', repos = "http://cran.us.r-project.org")
install.packages('tidyverse')
install.packages("lubridate")
Load Packages
library(tidyverse)
library(lubridate)
Import the data files into new vectors
january_2021 <- read.csv("data/202101-divvy-tripdata.csv")
february_2021 <- read.csv("data/202102-divvy-tripdata.csv")
march_2021 <- read.csv("data/202103-divvy-tripdata.csv")
april_2021 <- read.csv("data/202104-divvy-tripdata.csv")
may_2021 <- read.csv("data/202105-divvy-tripdata.csv")
june_2021 <- read.csv("data/202106-divvy-tripdata.csv")
july_2021 <- read.csv("data/202107-divvy-tripdata.csv")
august_2021 <- read.csv("data/202108-divvy-tripdata.csv")
september_2021 <- read.csv("data/202109-divvy-tripdata.csv")
october_2021 <- read.csv("data/202110-divvy-tripdata.csv")
november_2021 <- read.csv("data/202111-divvy-tripdata.csv")
december_2021 <- read.csv("data/202112-divvy-tripdata.csv")
Check column names to ensure we can join all the data. Compare column names for each of the files. While the names don’t have to be in the same order but they do need to match perfectly before we can use a command to join them into one file
Data validation
Calculate the total number of records in all twelve monthly files. It is 5479096 rows
sum( nrow(january_2021) + nrow(february_2021)
+ nrow(march_2021) + nrow(april_2021) + nrow(may_2021)
+ nrow(june_2021) + nrow(july_2021) + nrow(august_2021)
+ nrow(september_2021) + nrow(october_2021) + nrow(november_2021)+nrow(december_2021))
## [1] 5595063
Merging all individual files in df_all using “rbind()”
df_all <- rbind(january_2021, february_2021, march_2021, april_2021, may_2021, june_2021, july_2021, august_2021, september_2021, october_2021, november_2021, december_2021)
head(df_all)
calculating the number of rows in the merged file just to make sure
nrow(df_all)
## [1] 5595063
We can notice that the data was merged correctly noticing the number of rows is equal to the sum of records in all twelve monthly files
Saving the merged dataframe to local storage for backup using “write_csv()”
write_csv(df_all, "merged_trip.csv")
Inspecting the new combined dataframe
str(df_all)
## 'data.frame': 5595063 obs. of 13 variables:
## $ ride_id : chr "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
## $ ended_at : chr "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
## $ start_station_name: chr "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
df_all <- distinct(df_all)
nrow(df_all)
## [1] 5595063
Change the date to usa in order to geth the days of the week and the month in english
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
Create additional columns for the date, Week_day, month and year from the “started_at” column.
This allows a more granular analysis of the data by date/day/month
df_all$date <- as.Date(df_all$started_at)
df_all$week_day <- format(as.Date(df_all$date), "%A")
df_all$month <- format(as.Date(df_all$date), "%B")
df_all$year<-format(df_all$date,"%Y")
head(df_all)
Create a column for duration of rides calculated from start and end time of rides called “ride_length”.
df_all$ended_at <- as.POSIXct(df_all$ended_at, tz = 'UTC')
df_all$started_at <- as.POSIXct(df_all$started_at, tz = 'UTC')
df_all$ride_length <- difftime(df_all$ended_at, df_all$started_at, units = "mins")
df_all$ended_at<- as.character(df_all$ended_at)
df_all$started_at<- as.character(df_all$started_at)
head(df_all)
df_all[df_all==""] <- NA
head(df_all)
filtering data with “started_at” greater than the “ended_at” and starting and ending stations not mentioned
filtered_df <-df_all %>%
filter(ride_length > 0)%>%
filter(!is.na(start_station_name)) %>%
filter(!is.na(end_station_name)) %>%
filter(!is.na(ride_length))
filtered_df
head(filtered_df)
Get rid of too long rides as rides should be limited to 1 day or 1440 minutes or 24Hr(cyclistic considers these bikes are stolen).
filtered_df <- filtered_df[!filtered_df$ride_length>1440,]
Get rid of rides that are equal to 0
filtered_df <- filtered_df[!filtered_df$ride_length==0,]
head(filtered_df)
Filter out data that we will not be using for this analysis.
filtered_df <- filtered_df %>%
select(rideable_type, member_casual, ride_length, month, year, week_day )
head(filtered_df)
checking there are NA values in the dataframe
sum(is.na(filtered_df))
## [1] 0
checking the data types of the columns to be sure before calculations
colnames(filtered_df)
## [1] "rideable_type" "member_casual" "ride_length" "month"
## [5] "year" "week_day"
str(filtered_df)
## 'data.frame': 4586829 obs. of 6 variables:
## $ rideable_type: chr "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ member_casual: chr "member" "member" "casual" "casual" ...
## $ ride_length : 'difftime' num 7.21666666666667 4.53333333333333 9.78333333333333 8.95 ...
## ..- attr(*, "units")= chr "mins"
## $ month : chr "January" "January" "January" "January" ...
## $ year : chr "2021" "2021" "2021" "2021" ...
## $ week_day : chr "Sunday" "Saturday" "Saturday" "Saturday" ...
After cleaning the data, it is now ready to enter the analysis phase. The total row count after the cleaning has reduced to 4,586,829.
Now we Write the csv file again in the disk and restart R session to tackle the analysis phase
write_csv(filtered_df, "final_trip.csv")
check the types of members existing in our dataframe
filtered_df2<- filtered_df %>% distinct(member_casual)
filtered_df2
check the types of rideable types existing in our dataframe
filtered_df3<- filtered_df %>% distinct(rideable_type)
filtered_df3
We should focus in our analysis is the business task that we were asked to do ,which is showing how do causal riders differ from annual riders
Upload the “final_trip.csv” file on-board and check data validation if everything is ok
final_trip <- read_csv("final_trip.csv")
## Rows: 4586829 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): rideable_type, member_casual, month, week_day
## dbl (2): ride_length, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(final_trip)
## spec_tbl_df [4,586,829 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ rideable_type: chr [1:4586829] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ member_casual: chr [1:4586829] "member" "member" "casual" "casual" ...
## $ ride_length : num [1:4586829] 7.22 4.53 9.78 8.95 10.15 ...
## $ month : chr [1:4586829] "January" "January" "January" "January" ...
## $ year : num [1:4586829] 2021 2021 2021 2021 2021 ...
## $ week_day : chr [1:4586829] "Sunday" "Saturday" "Saturday" "Saturday" ...
## - attr(*, "spec")=
## .. cols(
## .. rideable_type = col_character(),
## .. member_casual = col_character(),
## .. ride_length = col_double(),
## .. month = col_character(),
## .. year = col_double(),
## .. week_day = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
names(final_trip)
## [1] "rideable_type" "member_casual" "ride_length" "month"
## [5] "year" "week_day"
final_trip$month <- ordered(final_trip$month,levels=c("January", "February", "March", "April","May", "June", "July", "August","September", "October", "November","December"))
final_trip$week_day <- ordered(final_trip$week_day, levels=c("Sunday", "Monday", "Tuesday","Wednesday", "Thursday","Friday", "Saturday"))
summary(final_trip$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0167 6.9500 12.2000 19.8852 22.1000 1439.3667
aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = mean)
aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = median)
aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = max)
aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = min)
aggregate(final_trip$ride_length ~ final_trip$member_casual + final_trip$week_day, FUN = mean)
We notice that the average of ride length for casual users has always been higher than the members
num_of_rides <- final_trip %>%
group_by(member_casual,week_day) %>%
summarise(number_of_rides = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
num_of_rides
num_of_rides_month <- final_trip %>%
group_by(member_casual,month) %>%
summarise(number_of_rides = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
num_of_rides_month