1- Ask

1.1- Business Objective

Increasing the profits by maximizing the number of annual memberships by converting casual riders (single riders and full-day riders) into annual members (annual memberships)

1.2- Business Task

The business task is to show how do the casual riders differ from annual members using past user data and report the findings

1.3- Stakeholders

The stakeholders in this project are:

  • Lily Moreno: The director of marketing and the manager, she is responsible for the development of campaigns and initiatives to promote the bike-share program

  • The cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that would help to guide the cyclistic marketing strategy. I am a junior analyst in this team.

  • Cyclistic executive team: The detail-oriented executive team will decide whether to approve the recommended marketing program or not.

2- Prepare

2.1- Where is the data located

The data that we are using is located in this link (https://divvy-tripdata.s3.amazonaws.com/index.html). It comes from Motivate International Inc, a bike share company in Chicago

2.2- How is the data organized

The data is organized in zip folders in csv files, some files (including divyy-tripdata) contain trips information from 04-2020 till 04-2022. Others (called Divvy-Trips-…Qx) present data from 2015 till the first quarter (first 3 months) of 2020.

And other folders (called Divvy-Stations-trips …. Qx) contain stations and trips info in csv files for the years of 2013 and 2014. But our main focus is going to be on the trips of 2021 which consist of 13 columns concerning the trip and the user such as; Ride id, bike type, starting and ending time, starting and ending station id, name and location, geographic coordinates and membership type

PS: if we want to add other axis of analysis we can work on, like members gender and age, the folders called “Divvy-Stations-trips …. Qx” contain this kind of information

2.3- Unbiasedness and Credibility of the Data

The data is credible and original as it is collected directly by the company that runs the Cyclistic Bike Share program which is called Motivate Inc, not by second- or third-party info. It is also current and relevant as it is been updated monthly from 2015 till now (2022). We can also say that the data is accurate complete and unbiased as it reflects all real data gathered by the company not just a sample.

2.4- Licensing, Privacy, Security and Accessibility

The data has been made available by Motivate International Inc. under this license https://ride.divvybikes.com/data-license-agreement . The data-privacy issues prohibit us from using riders’ personally identifiable information that’s why we can’t determine if casual riders live in the Cyclitic service area or if they have purchased multiple single passes.

2.5- Ability of the Data to Answer business questions

The business goal that was already set is to determine the difference between casual riders and annual members. We have enough information about the 2 different types of users in order to discover their differences but not having an idea about the identity of the users can limit our analysis in order to know if the casual riders are loyal clients so we can build a more specific business strategy upon that.

But in general, we can answer the business question about the differences between the 2 types of clients.

2.6- Problems with the Data

The problems that we faced with the data:

• Having dirty data that needs cleaning (duplicates, missing fields … etc.)

• The csv files are huge which required me to go by segment and clean the data from the variables I am not going to use.

3- Process

3.1- Tools

I choose RStudio Desktop in order to prepare, process, clean, analyze and visualize the data because the set is too large and will be hard to get done in Excel, google spreadsheets or RStudio Cloud.

3.2- Review of the Data

In order to get an overview, the data was reviewed in terms of understanding of the consent of variables, data formats and data integrity.

Data review involved the following:

• Checking column names across all the 12 original files.

• Checking for missing values.

• Checking of white spaces.

• Checking of duplicate records.

• Other data anomalies.

However, the review of the data revealed several problems:

• Duplicate record of ID numbers.

• Records with missing start or end station name.

• Records with very short or very long ride duration.

• Records for trips starting or ending at an administrative station (repair or testing station).

All 12 files were combined into one data set after initial review was completed. The final data set consisted of 4586829 rows with 6 columns of character and numeric data. This matched the number of records in all 12 monthly data files.

Setting work Directory

setwd("C:/Users/Mega-PC/Desktop/portfolio projects/portfolio projects/my work/cyclistic bike/CaseStudy")

Install Packages

install.packages('plyr', repos = "http://cran.us.r-project.org")
install.packages('tidyverse')
install.packages("lubridate")

Load Packages

library(tidyverse)
library(lubridate)

3.3- Collect Data(Data wrangling)

Import the data files into new vectors

january_2021 <- read.csv("data/202101-divvy-tripdata.csv")
february_2021 <- read.csv("data/202102-divvy-tripdata.csv")
march_2021 <- read.csv("data/202103-divvy-tripdata.csv")
april_2021 <- read.csv("data/202104-divvy-tripdata.csv")
may_2021 <- read.csv("data/202105-divvy-tripdata.csv") 
june_2021 <- read.csv("data/202106-divvy-tripdata.csv")
july_2021 <- read.csv("data/202107-divvy-tripdata.csv")
august_2021 <- read.csv("data/202108-divvy-tripdata.csv")
september_2021 <- read.csv("data/202109-divvy-tripdata.csv")
october_2021 <- read.csv("data/202110-divvy-tripdata.csv")
november_2021 <- read.csv("data/202111-divvy-tripdata.csv")
december_2021 <- read.csv("data/202112-divvy-tripdata.csv")

3.4-Data validation

Check column names to ensure we can join all the data. Compare column names for each of the files. While the names don’t have to be in the same order but they do need to match perfectly before we can use a command to join them into one file

Data validation

Calculate the total number of records in all twelve monthly files. It is 5479096 rows

sum( nrow(january_2021) + nrow(february_2021) 
    + nrow(march_2021) + nrow(april_2021) + nrow(may_2021) 
    + nrow(june_2021) + nrow(july_2021) + nrow(august_2021)
    + nrow(september_2021) + nrow(october_2021) + nrow(november_2021)+nrow(december_2021))
## [1] 5595063

Merging all individual files in df_all using “rbind()”

df_all <- rbind(january_2021, february_2021, march_2021, april_2021, may_2021, june_2021, july_2021, august_2021, september_2021, october_2021, november_2021, december_2021)
head(df_all)

calculating the number of rows in the merged file just to make sure

nrow(df_all)
## [1] 5595063

We can notice that the data was merged correctly noticing the number of rows is equal to the sum of records in all twelve monthly files

Saving the merged dataframe to local storage for backup using “write_csv()”

write_csv(df_all, "merged_trip.csv")

Inspecting the new combined dataframe

str(df_all)
## 'data.frame':    5595063 obs. of  13 variables:
##  $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
##  $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
##  $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

3.5- Cleaning data

3.5.1- Removing duplicates

df_all <- distinct(df_all)
nrow(df_all)
## [1] 5595063

3.5.2- Transforming data

Change the date to usa in order to geth the days of the week and the month in english

Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"

Create additional columns for the date, Week_day, month and year from the “started_at” column.

This allows a more granular analysis of the data by date/day/month

df_all$date <- as.Date(df_all$started_at)
df_all$week_day <- format(as.Date(df_all$date), "%A")
df_all$month <- format(as.Date(df_all$date), "%B")
df_all$year<-format(df_all$date,"%Y")
head(df_all)

Create a column for duration of rides calculated from start and end time of rides called “ride_length”.

df_all$ended_at <- as.POSIXct(df_all$ended_at, tz = 'UTC')
df_all$started_at <- as.POSIXct(df_all$started_at, tz = 'UTC')
df_all$ride_length <- difftime(df_all$ended_at, df_all$started_at, units = "mins")
df_all$ended_at<- as.character(df_all$ended_at)
df_all$started_at<- as.character(df_all$started_at)
head(df_all)

3.5.2- Replacing blank cells with NA

df_all[df_all==""] <- NA 
head(df_all)

3.6- Filtering data

3.6.1- Filtering ride length

filtering data with “started_at” greater than the “ended_at” and starting and ending stations not mentioned

filtered_df <-df_all %>% 
  filter(ride_length > 0)%>%
  filter(!is.na(start_station_name)) %>% 
  filter(!is.na(end_station_name)) %>% 
  filter(!is.na(ride_length))
filtered_df
head(filtered_df)

Get rid of too long rides as rides should be limited to 1 day or 1440 minutes or 24Hr(cyclistic considers these bikes are stolen).

filtered_df <- filtered_df[!filtered_df$ride_length>1440,]

Get rid of rides that are equal to 0

filtered_df <- filtered_df[!filtered_df$ride_length==0,] 
head(filtered_df)

3.6.2- Filtering unused data

Filter out data that we will not be using for this analysis.

filtered_df <- filtered_df %>% 
  select(rideable_type, member_casual, ride_length, month, year, week_day )
  head(filtered_df)

3.6.3- checking the filtered data

checking there are NA values in the dataframe

sum(is.na(filtered_df))
## [1] 0

checking the data types of the columns to be sure before calculations

colnames(filtered_df)
## [1] "rideable_type" "member_casual" "ride_length"   "month"        
## [5] "year"          "week_day"
str(filtered_df)
## 'data.frame':    4586829 obs. of  6 variables:
##  $ rideable_type: chr  "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ member_casual: chr  "member" "member" "casual" "casual" ...
##  $ ride_length  : 'difftime' num  7.21666666666667 4.53333333333333 9.78333333333333 8.95 ...
##   ..- attr(*, "units")= chr "mins"
##  $ month        : chr  "January" "January" "January" "January" ...
##  $ year         : chr  "2021" "2021" "2021" "2021" ...
##  $ week_day     : chr  "Sunday" "Saturday" "Saturday" "Saturday" ...

3.7- writting the final clean file

After cleaning the data, it is now ready to enter the analysis phase. The total row count after the cleaning has reduced to 4,586,829.

Now we Write the csv file again in the disk and restart R session to tackle the analysis phase

write_csv(filtered_df, "final_trip.csv")

4- Analyze

check the types of members existing in our dataframe

filtered_df2<- filtered_df %>% distinct(member_casual)
filtered_df2

check the types of rideable types existing in our dataframe

filtered_df3<- filtered_df %>% distinct(rideable_type)
filtered_df3

We should focus in our analysis is the business task that we were asked to do ,which is showing how do causal riders differ from annual riders

4.1- Data validation

Upload the “final_trip.csv” file on-board and check data validation if everything is ok

final_trip <- read_csv("final_trip.csv")
## Rows: 4586829 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): rideable_type, member_casual, month, week_day
## dbl (2): ride_length, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(final_trip)
## spec_tbl_df [4,586,829 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ rideable_type: chr [1:4586829] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ member_casual: chr [1:4586829] "member" "member" "casual" "casual" ...
##  $ ride_length  : num [1:4586829] 7.22 4.53 9.78 8.95 10.15 ...
##  $ month        : chr [1:4586829] "January" "January" "January" "January" ...
##  $ year         : num [1:4586829] 2021 2021 2021 2021 2021 ...
##  $ week_day     : chr [1:4586829] "Sunday" "Saturday" "Saturday" "Saturday" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   rideable_type = col_character(),
##   ..   member_casual = col_character(),
##   ..   ride_length = col_double(),
##   ..   month = col_character(),
##   ..   year = col_double(),
##   ..   week_day = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
names(final_trip)
## [1] "rideable_type" "member_casual" "ride_length"   "month"        
## [5] "year"          "week_day"

4.2- Sorting month and week days in order

final_trip$month <- ordered(final_trip$month,levels=c("January", "February", "March", "April","May", "June", "July", "August","September", "October", "November","December"))

final_trip$week_day <- ordered(final_trip$week_day, levels=c("Sunday", "Monday", "Tuesday","Wednesday", "Thursday","Friday", "Saturday"))

4.3- Descriptive analysis on the ride length

summary(final_trip$ride_length)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0167    6.9500   12.2000   19.8852   22.1000 1439.3667

4.4- Comparing members and casual users regarding ride length

4.4.1- in terms of the mean

aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = mean)

4.4.2- in terms of the median

aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = median)

4.4.3- in terms of the max

aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = max)

4.4.4- in terms of the min

aggregate(final_trip$ride_length ~ final_trip$member_casual, FUN = min)

4.4.5- The average ride length by day for members vs casual users

aggregate(final_trip$ride_length ~ final_trip$member_casual + final_trip$week_day, FUN = mean)

We notice that the average of ride length for casual users has always been higher than the members

4.5- Comparing members and casual users sum of rides

4.5.1- By day of the week

num_of_rides <- final_trip %>%
  group_by(member_casual,week_day) %>% 
  summarise(number_of_rides = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
num_of_rides 

4.5.2- By month

num_of_rides_month <- final_trip %>%
  group_by(member_casual,month) %>% 
  summarise(number_of_rides = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
num_of_rides_month

5- Share

5.1- Acording to the number of rides

*Number of rides by day of the week by member type

final_trip %>%
  group_by(member_casual, week_day) %>% 
  summarise(num_of_rides = n()) %>% 
  arrange(member_casual, week_day) %>% 
  ggplot(aes(x = week_day, y = num_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Day of the week", y = "Number of rides",
       title = "Number of Rides by Day of the week")+
  scale_y_continuous(name=expression(num ~ of ~ rides ~ (x10^5)),
                     labels=function(x) x / 100000,
                     limits=c(0,700000))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This visualization shows the number of rides by day of the week by each customer type. It indicates that casual riders peak is on the Sunday and Saturday. While it remains steady for members the whole week except Tuesday and Wednesday in particular where a slight peak can be seen.

This indicates that members mainly use the bikes for their regular commutes while casual riders for leisure on weekends.

*Number of rides by month by member type

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This visualization shows the number of rides by month by each customer type. The graph shows that the number of rides is directly related to the season.The warmer the weather is the more members and casuals use bikes and the colder the weather is the more and more this value decreases. We notice that the number of rides for casual users decreases more than the members as the weather gets colder

*Number of rides by bike type

final_trip %>% 
  group_by(rideable_type) %>% 
  summarise(num_of_rides = n()) %>%
  ggplot(aes(x = rideable_type, y = num_of_rides, fill = rideable_type)) +
  geom_col(position = "dodge") +
  labs(x = "Bike Types", y = "Number of Rides",
       title = "Number of Rides by Bike Type ")+
  scale_y_continuous(name=expression(num ~ of ~ rides ~ (x10^5)),
                     labels=function(x) x / 100000,
                     limits=c(0,5000000))

This visualization shows the number of rides by bike type. It indicates that the number of rides for classic bikes is the highest followed by the number of electric bikes and last but not least the lowest number is presented by docked bikes.

*Number of rides by bike type by member type

final_trip %>% 
  group_by(member_casual,rideable_type) %>% 
  summarise(num_of_rides = n()) %>%
   arrange(member_casual, rideable_type) %>% 
  ggplot(aes(x = rideable_type, y = num_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Bike Types", y = "Number of Rides",
       title = "Number of Rides by Bike Type by member type ")+
  scale_y_continuous(name=expression(num ~ of ~ rides ~ (x10^5)),
                     labels=function(x) x / 100000,
                     limits=c(0,2500000))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This visualization shows the number of rides by bike type for each member type. It indicates that the number of rides for classic bikes is the highest followed by the number of electric bikes and last but not least the lowest number is presented by docked bikes as the previous visualization. And for each type the members present higher numbers than the casual users.

*Number of rides by day of the week by bike type by member type

final_trip %>% 
  group_by(member_casual,week_day,rideable_type) %>% 
  summarise(num_of_rides = n()) %>%
   arrange(member_casual,week_day) %>% 
  ggplot(aes(x = week_day, y = num_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  facet_wrap(~ rideable_type) +
  labs(x = "Day of the week", y = "Number of Rides",
       title = "Number of Rides by day of the week by Bike Type by member type  ", fill="Customer type")+
  scale_y_continuous(name=expression(num ~ of ~ rides ~ (x10^5)),
                     labels=function(x) x / 100000,
                     limits=c(0,400000))
## `summarise()` has grouped output by 'member_casual', 'week_day'. You can
## override using the `.groups` argument.

The above graph, we can see that classic Bikes are very popular with members everyday of the week compared to electric bikes while the docked bikes are only used by casual users . Members are quite steady throughout the week.

*Number of rides by month by bike type by member type

final_trip %>% 
  group_by(member_casual,month,rideable_type) %>% 
  summarise(num_of_rides = n()) %>%
   arrange(member_casual,month) %>% 
  ggplot(aes(x = month, y = num_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  facet_wrap(~ rideable_type) +
  labs(x = "month", y = "Number of Rides",
       title = "Number of Rides by month by Bike Type by member type ", fill="Customer type")+
  scale_y_continuous(name=expression(num ~ of ~ rides ~ (x10^5)),
                     labels=function(x) x / 100000,
                     limits=c(0,300000))
## `summarise()` has grouped output by 'member_casual', 'month'. You can override
## using the `.groups` argument.

The above graph, we can see that classic Bikes are very popular with members during every month compared to the electric bikes while the docked bikes are only used by casual users. The number of rides is correlated to the temperature.

5.2- Acording to the average of ride Duration

*average Duration of a ride by day of the week by member type

final_trip %>%
  group_by(member_casual, week_day) %>% 
  summarise(avg_rides = mean(ride_length)) %>%
  ggplot(aes(x = week_day, y = avg_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Day of the week", y = "Average Rides Duration",
       title = "Average Ride Duration by Day of Week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This visualization shows the duration of rides by day of the week for each member type. It indicates that the average duration are the highest during the weekend for all users while in general the average is the highest among casual users.

*avg Duration of ride by month by member type

final_trip %>% 
  group_by(member_casual, month) %>% 
  summarise(avg_rides = mean(ride_length)) %>%
  ggplot(aes(x = month, y = avg_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Months", y = "Average Rides Duration",
       title = "Average Rides by Months ")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This visualization shows the duration of rides by month for each member type. It indicates that the average duration is the highest during the warmer seasons but it is slightly less during the colder season.In general the casual users are the ones who do the longest rides.

*length of ride by Bike Type

final_trip %>% 
  group_by(rideable_type) %>% 
  summarise(avg_rides = mean(ride_length)) %>%
  ggplot(aes(x = rideable_type, y = avg_rides, fill = rideable_type)) +
  geom_col(position = "dodge") +
  labs(x = "Bike Types", y = "Average Rides Duration",
       title = "Average Ride durations of Bike Type ")

*length of ride by Bike Type(by member type)

final_trip %>% 
  group_by(member_casual, rideable_type) %>% 
  summarise(avg_rides = mean(ride_length)) %>%
  ggplot(aes(x = rideable_type, y = avg_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Bike Types", y = "Average Rides Duration(mins)",
       title = "Average Ride durations of Bike Types(members vs casual)")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

From the 2 graphs above we can tell that docked Bikes are the most used Bikes followed by classic bikes and then electric Bikes. Also, there is a high popularity of docked Bikes by casual users followed by classic Bikes. Electric is being the least popular.

5.3- Visualizing average ride duration of users from January 2021 to December 2021 (Trend)

  • Visualizing average ride duration by month by user type
final_trip %>%
  group_by(member_casual, month) %>% 
  summarise(avg_rides = mean(ride_length)) %>%
  ggplot(aes(x = month, y = avg_rides, group = member_casual)) +
  geom_line(aes(color=member_casual)) + geom_point(aes(color=member_casual)) +
  labs(x = "Months", y = "Average Rides Duration(mins)",
       title = "Average Rides Duration(mins) trends(By Member vs Casual)")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

* Visualizing average ride duration by month by bike type

final_trip %>%
  group_by(rideable_type, month) %>% 
  summarise(avg_rides = mean(ride_length)) %>%
  ggplot(aes(x = month, y = avg_rides, group = rideable_type)) +
  geom_line(aes(color=rideable_type)) + geom_point(aes(color=rideable_type)) +
  labs(x = "Months(Year 2021)", y = "Average Rides Duration(mins)",
       title = "Average Rides Duration(mins) trend(by Bike Types)")
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

5.4- Conclusions

1.Casual users enjoy taking rides on Weekends and less rides on workdays. While members have more steady number of rides on workdays and a little increase on the weekends.

2.Both Casual users and members average number and length of rides increase as the Chicago weather gets warmer and decreases when its cold.

3.Casual users prefer classic Bikes, followed by electric bikes and docked bikes being the least popular.

4.Members prefer classic bikes, with docked and electric bikes relatively close in average number.

5.Casual users take longer rides than members.

6.Docked bikes are the most used among all types of bikes.

5.5- Recommendations

From the insights I derived from this analysis, I would recommend the director of marketing and the manager of Cyclistic Bike Share, Lily Moreno, to design a marketing campaign with a purpose of converting Casual users into members with Subscriptions. And add a marketing campaign to increase trips with Classic and Electric Bikes.