bike station

Introduction

This project is the final project in my Google Data Analytics Professional Certificate Course on coursera. In this case study, I will be analyzing a public dataset for a fictional company called Cyclistic, provided by the course. Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations

This project will be based on the data analysis process

This project will follow the following road map steps

scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships,in essence, converting casual users to member. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members.But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data

Characters and Teams

ASK

Three questions will guide the future marketing program:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Lily Moreno (director of marketing and my manager) has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

Key tasks

  1. Identify the business task
  • The main business objective is to design marketing strategies aimed at converting casual riders into annual members by understanding how they differ.
  1. Consider key stakeholders
  • The key stakeholders are the Director of Marketing (Lily Moreno), Marketing Analytics team, and Executive team

Deliverable

  1. A clear statement of the business task
  • To find the differences between the casual riders and annual members

PREPARE

I will use Cyclistic’s historical trip data to analyze and identify trends.The data has been made available by Motivate International Inc. under this license Datasets are available here for download

Key tasks

  1. Download data and store it appropriately
  • Data has been downloaded and copies have been stored securely on my computer.
  1. Identify how it’s organized
  • The data is in CSV (comma-separated values) format, and there are a total of 13 columns for q3_2019 and q4_2019 dataset and also 15 columns for q1_2020
  1. Sort and filter the data
  • For this analysis, I will be using data for the year 2019 and 2020
  1. Determine the credibility of the data
  • For the purposes of this case study, the datasets are appropriate and will enable me to answer the business questions. The data has been made available by Motivate International Inc. This is public data that I can use to explore how different customer types are using Cyclistic bikes. But data-privacy issues will prohibit me from using rider’s personal identifiable information and this will prevent me from determining if riders have purchased multiple single passes. All ride ids are unique

Deliverable

A description of all data sources used
* The main source of all the data used was provided by the Cyclistic Company

Install and load the needed R studio library

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

import data to R studio

  • importing q2_2019,q3_2019,q4_2019,q1_2020 dataset into R studio
q2_2019 <- read.csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read.csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read.csv("Divvy_Trips_2019_q4.csv")
q1_2020 <- read.csv("Divvy_Trips_2020_Q1.csv")

viewing and comparing each columns of the dataset

colnames(q2_2019)
##  [1] "X01...Rental.Details.Rental.ID"                   
##  [2] "X01...Rental.Details.Local.Start.Time"            
##  [3] "X01...Rental.Details.Local.End.Time"              
##  [4] "X01...Rental.Details.Bike.ID"                     
##  [5] "X01...Rental.Details.Duration.In.Seconds.Uncapped"
##  [6] "X03...Rental.Start.Station.ID"                    
##  [7] "X03...Rental.Start.Station.Name"                  
##  [8] "X02...Rental.End.Station.ID"                      
##  [9] "X02...Rental.End.Station.Name"                    
## [10] "User.Type"                                        
## [11] "Member.Gender"                                    
## [12] "X05...Member.Details.Member.Birthday.Year"
colnames(q3_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q4_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q1_2020)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Rename columns to make them consistent with q1_2020,this is due to the new naming convention of columns by the public dataset

Rename the q2_2019 dataset columns to q1_2020 dataset column

(q2_2019 <-  rename(q2_2019
                  ,ride_id = "X01...Rental.Details.Rental.ID"
                  ,rideable_type = "X01...Rental.Details.Bike.ID" 
                  ,started_at = "X01...Rental.Details.Local.Start.Time"
                  ,ended_at = "X01...Rental.Details.Local.End.Time"
                  ,start_station_name = "X03...Rental.Start.Station.Name"
                  ,start_station_id = "X03...Rental.Start.Station.ID"
                  ,end_station_name = "X02...Rental.End.Station.Name"
                  ,end_station_id = "X02...Rental.End.Station.ID"
                  ,member_casual = "User.Type"))
(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid
                   ,started_at = start_time
                   ,ended_at = end_time
                   ,start_station_name = from_station_name
                   ,start_station_id = from_station_id
                   ,end_station_name = to_station_name
                   ,end_station_id = to_station_id
                   ,member_casual = usertype))
(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid
                   ,started_at = start_time
                   ,ended_at = end_time
                   ,start_station_name = from_station_name
                   ,start_station_id = from_station_id
                   ,end_station_name = to_station_name
                   ,end_station_id = to_station_id
                   ,member_casual = usertype))

Confirmation of the columns name

colnames(q2_2019)
##  [1] "ride_id"                                          
##  [2] "started_at"                                       
##  [3] "ended_at"                                         
##  [4] "rideable_type"                                    
##  [5] "X01...Rental.Details.Duration.In.Seconds.Uncapped"
##  [6] "start_station_id"                                 
##  [7] "start_station_name"                               
##  [8] "end_station_id"                                   
##  [9] "end_station_name"                                 
## [10] "member_casual"                                    
## [11] "Member.Gender"                                    
## [12] "X05...Member.Details.Member.Birthday.Year"
colnames(q3_2019)
##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "gender"             "birthyear"
colnames(q4_2019)
##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "gender"             "birthyear"
colnames(q1_2020)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Confirmation of data type of all the columns

str(q2_2019)
str(q3_2019)
str(q4_2019)
str(q1_2020)

Rideable_type and ride_id columns from q2_1019,q3_2019 and q4_2019 needs to be converted to character data type as its in q1-2020 dataset

q2_2019 <- dplyr::mutate(q2_2019, ride_id =   as.character(q2_2019$ride_id),
                 rideable_type = as.character(q2_2019$rideable_type))
q3_2019 <- dplyr::mutate(q3_2019,ride_id = as.character(q3_2019$ride_id),
                  rideable_type = as.character(q3_2019$rideable_type))
q4_2019 <-  dplyr::mutate(q4_2019,ride_id = as.character(q4_2019$ride_id),
                  rideable_type = as.character(q4_2019$rideable_type))
is.character(q2_2019$ride_id)
## [1] TRUE
is.character(q3_2019$ride_id)
## [1] TRUE
is.character(q4_2019$ride_id)
## [1] TRUE
is.character(q2_2019$rideable_type)
## [1] TRUE
is.character(q3_2019$rideable_type)
## [1] TRUE
is.character(q4_2019$rideable_type)
## [1] TRUE

Aggregate the four datasets into a single dataset

all_trip <- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020) 

Drop some columns that are not unique

all_trip <- all_trip %>% 
  select(-c( "X01...Rental.Details.Duration.In.Seconds.Uncapped","X05...Member.Details.Member.Birthday.Year",Member.Gender,"X05...Member.Details.Member.Birthday.Year",tripduration,birthyear,start_lat,start_lng,end_lat,end_lng,gender))
colnames(all_trip)
## [1] "ride_id"            "started_at"         "ended_at"          
## [4] "rideable_type"      "start_station_id"   "start_station_name"
## [7] "end_station_id"     "end_station_name"   "member_casual"

Data processing activities on the dataset

  1. In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.we need to replace the subscribers with member and the customers with casual
all_trip <- all_trip %>% 
  mutate(member_casual = recode(member_casual
                                ,"Subscriber" = "member"
                                ,"Customer" = "casual"))

Confirm by running this code chunk

n_distinct(all_trip$member_casual)
## [1] 2
  1. Add a ride length column to show the time in seconds between each trip duration and also convert the data type to numeric
all_trip$ride_length <- difftime(all_trip$ended_at,all_trip$started_at)
typeof(all_trip$ride_length)
## [1] "double"

Convert to numeric data type

all_trip$ride_length <- as.numeric(all_trip$ride_length)
typeof(all_trip$ride_length)
## [1] "double"
  1. Add columns that list the date, month, day, and year of each ride this will allow us to aggregate ride data for each month, day, or year
all_trip$date <- as.Date(all_trip$started_at) #The default format is yyyy-mm-dd
all_trip$year <- format(as.Date(all_trip$date), "%Y")
all_trip$month <- format(as.Date(all_trip$date), "%m")
all_trip$day <- format(as.Date(all_trip$date), "%d")
all_trip$day_of_the_week <- format(as.Date(all_trip$date),"%A")

To confirm the new columns

colnames(all_trip)
##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "start_station_id"   "start_station_name"
##  [7] "end_station_id"     "end_station_name"   "member_casual"     
## [10] "ride_length"        "date"               "year"              
## [13] "month"              "day"                "day_of_the_week"

Filter out all the data where the ride_length is less than one seconds and also where the start station name is “HQ QR”

all_trip_v2 <- all_trip[!(all_trip$ride_length < 1 | all_trip$start_station_name == "HQ QR"),]

.

All the data have been stored appropriately and has been prepared for analysis

  • Descriptive analysis on ride_length (all figures in seconds)
mean(all_trip_v2$ride_length) #straight average (total ride length / rides)
median(all_trip_v2$ride_length) #midpoint number in the ascending array of ride lengths
max(all_trip_v2$ride_length) #longest ride
min(all_trip_v2$ride_length) #shortest ride

aggregating members_casual and ride_length(in seconds)

aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = mean)
##   all_trip_v2$member_casual all_trip_v2$ride_length
## 1                    casual               3552.7502
## 2                    member                850.0662
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = median)
##   all_trip_v2$member_casual all_trip_v2$ride_length
## 1                    casual                    1546
## 2                    member                     589
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = max)
##   all_trip_v2$member_casual all_trip_v2$ride_length
## 1                    casual                 9387024
## 2                    member                 9056634
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = min)
##   all_trip_v2$member_casual all_trip_v2$ride_length
## 1                    casual                       2
## 2                    member                       1

aggregating ride_length and the days_of_the_week

all_trip_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = mean) 
##    all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1                     casual     Sun               3581.4054
## 2                     member     Sun                919.9746
## 3                     casual     Mon               3372.2869
## 4                     member     Mon                842.5726
## 5                     casual     Tue               3596.3599
## 6                     member     Tue                826.1427
## 7                     casual     Wed               3718.6619
## 8                     member     Wed                823.9996
## 9                     casual     Thu               3682.9847
## 10                    member     Thu                823.9278
## 11                    casual     Fri               3773.8351
## 12                    member     Fri                824.5305
## 13                    casual     Sat               3331.9138
## 14                    member     Sat                968.9337
all_trip_v2 %>%
 mutate(weekday = wday(started_at, label = TRUE)) %>% 
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = max)  
##    all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1                     casual     Sun                 8636205
## 2                     member     Sun                 2910776
## 3                     casual     Mon                 7247750
## 4                     member     Mon                 8203638
## 5                     casual     Tue                 7522062
## 6                     member     Tue                 9056634
## 7                     casual     Wed                 7606872
## 8                     member     Wed                 5628779
## 9                     casual     Thu                 9387024
## 10                    member     Thu                 6028602
## 11                    casual     Fri                 7939448
## 12                    member     Fri                 4840301
## 13                    casual     Sat                 8116785
## 14                    member     Sat                 4805491
all_trip_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% 
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = min)  
##    all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1                     casual     Sun                       3
## 2                     member     Sun                       2
## 3                     casual     Mon                       8
## 4                     member     Mon                       2
## 5                     casual     Tue                      11
## 6                     member     Tue                       2
## 7                     casual     Wed                       6
## 8                     member     Wed                       2
## 9                     casual     Thu                      11
## 10                    member     Thu                       1
## 11                    casual     Fri                       2
## 12                    member     Fri                       1
## 13                    casual     Sat                       3
## 14                    member     Sat                       1
 all_trip_v2 %>%
 mutate(weekday = wday(started_at, label = TRUE)) %>% 
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = median)
##    all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1                     casual     Sun                    1626
## 2                     member     Sun                     641
## 3                     casual     Mon                    1547
## 4                     member     Mon                     579
## 5                     casual     Tue                    1428
## 6                     member     Tue                     580
## 7                     casual     Wed                    1419
## 8                     member     Wed                     583
## 9                     casual     Thu                    1430
## 10                    member     Thu                     581
## 11                    casual     Fri                    1499
## 12                    member     Fri                     573
## 13                    casual     Sat                    1658
## 14                    member     Sat                     640

Analyze number of rides by riders type during the weekday

  all_trip_v2 %>% 
  mutate(weekdays = wday(started_at, label = TRUE)) %>%
  group_by(weekdays,member_casual) %>% 
  summarise(number_of_ride = n()) %>% 
  arrange(weekdays)
## `summarise()` has grouped output by 'weekdays'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   weekdays [7]
##    weekdays member_casual number_of_ride
##    <ord>    <chr>                  <int>
##  1 Sun      casual                181293
##  2 Sun      member                267965
##  3 Mon      casual                103296
##  4 Mon      member                472196
##  5 Tue      casual                 90510
##  6 Tue      member                508445
##  7 Wed      casual                 92457
##  8 Wed      member                500329
##  9 Thu      casual                102679
## 10 Thu      member                484177
## 11 Fri      casual                122404
## 12 Fri      member                452790
## 13 Sat      casual                209543
## 14 Sat      member                287958

Analyze total number of ride by member_casual(rider type) and months of ride

all_trip_v2 %>% 
  group_by(member_casual,month) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(month)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 3
## # Groups:   member_casual [2]
##    member_casual month number_of_rides
##    <chr>         <chr>           <int>
##  1 casual        01               7785
##  2 member        01             136099
##  3 casual        02              12314
##  4 member        02             126714
##  5 casual        03              24615
##  6 member        03             115593
##  7 casual        04              47744
##  8 member        04             217566
##  9 casual        05              81624
## 10 member        05             285834
## # … with 14 more rows

Analyze the ride_length(total distance) and month of the ride

all_trip_v2 %>% 
  group_by(month) %>% 
  summarise(ride_length = sum(ride_length)) %>% 
  arrange(month)
## # A tibble: 12 × 2
##    month ride_length
##    <chr>       <dbl>
##  1 01      166548847
##  2 02      195843763
##  3 03      204036659
##  4 04      322327235
##  5 05      488463093
##  6 06      660063474
##  7 07     1006336596
##  8 08     1143030984
##  9 09      709088948
## 10 10      486573591
## 11 11      193487505
## 12 12      157404301

Analyze the number of ride by types of riders(member_casual)

all_trip_v2 %>% 
  group_by(member_casual) %>% 
  summarise(count_of_rides = length(ride_length))
## # A tibble: 2 × 2
##   member_casual count_of_rides
##   <chr>                  <int>
## 1 casual                902182
## 2 member               2973860

Supporting visualizations and key findings

Visualize average_duration of riders type and weekday

all_trip_v2 %>% 
  mutate(weekday = wday(started_at , label = TRUE)) %>%
  group_by(member_casual, weekday) %>%  
  summarise(average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday) %>% 
  ggplot(aes( x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") + 
  labs(title = "Average duration of ride by day of the week", subtitle = "ride_length and weekday")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

from the graph above, casual riders tends to rides for more distance than membership riders during the week days, Membership riders rides a little bit more on saturdays while for casual riders, they rides more on Friday. Generally, both riders type makes more use of the bikes on weekends.

Visualize total number of ride by member_casual(rider type) and months of ride

all_trip_v2 %>% 
  group_by(member_casual,month) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual,month) %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual )) +
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

From the graph above, it can be deduced that members have the most numbers of rides than casual riders, it can also be confirmed that August has the highest number of rides by both members and casual riders, month july, August and september recorded the most number of riders for both riders which stood above 350000 rides for casual and partly over 100000 rides for casual riders,january to march recorded the lowest number of rides for both parties

Visualize the ride_length(total distance) and month of the ride

all_trip_v2 %>% 
  group_by(member_casual,month) %>% 
  summarise(average_ride_length = mean(ride_length)) %>% 
  arrange(month) %>% 
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
  labs(title = "Average ride distance of members and casual riders by month")+
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

From the graph above, casual riders ride for a longer distance than the member users, likewise, the month of January and February has the longest distance of rides from casual riders whereas the membership riders maintained a uniform riding distance ranging from over 2500 to 5000

Visualize members and casuals by the number of ride taken (ride count)

all_trip_v2 %>% 
  group_by(member_casual) %>% 
  summarise(ride_count = length(ride_id)) %>% 
  arrange(member_casual) %>% 
  ggplot(aes(x = member_casual,y = ride_count,fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Total rides taken (ride_count) of Members and Casual riders")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the above graph, we can observe that there are more member rides(2973860) compared to casual rides(902182) based on the ride count

compare Members and Casual riders depending on ride distance

all_trip_v2 %>% 
  group_by(member_casual) %>% 
  summarise(average_ride_length = mean(ride_length)) %>%
  ggplot() + 
  geom_col(mapping= aes(x= member_casual,y= average_ride_length,fill=member_casual), show.legend = FALSE)+
  labs(title = "Mean distance traveled by Members and Casual riders")

From the graph, casual riders tends to travel for more distance than the memberships riders

Conclusion

  • Members have more rides compared to casual riders.

  • We have more members riding in all months compared to casual riders.

  • Casual riders rides for a longer time period.

  • casual riders ride more throughout the entire week days than the member riders

  • Casual riders go farther in terms of distance both by week days and months than the memberhip riders

top three recommendations based on this analysis

  1. Have a slash sale or promo for casual riders so they can acquire more bikes and indulge them in the benefits of being a member.

  2. Host fun biking competitions with prizes at intervals for casual riders during August due to their high number of rides during that time, this will also attract them to get a membership.

  3. Encourage casual riders to ride more in the entire year through advertisement, hand flyers, by giving them various coupons so as to convince them into being a member.