Data Analysis Casestudy Project: Cyclystic Bike-Share


Google coursera data analytics capstone. In this scenario I am a junior data analyst working for cylistic bikes. With the goal of maximising annual memberships we examine the data set and perform the phases of data analysis while preparing deliverables.


Business task: Find ways of maximising anual memberships by understanding and comparing casual and annual ridership.
Stakeholders: Lily Moreno (Dir. of marketing), the cyclistic marketing analytics team and the cylistic executive team.
Data: 12 months of historical trip data in CSV form. It is from the 2019 year, split into four quarters contained in four different files.
Public license: It is made available through motivate international inc by https://ride.divvybikes.com/data-license-agreemente
Data limitations: A limitation of the data is that due to data privacy concerns we cannot acess and connect pass purchases to rider addresses.
Programs used: RStudio
R packages used: tidyverse, ggplot2, dplyr and lubridate


We will start by installing the packages with install.packages() and then intializing them with library(). These packages are what we will use to perform data cleaning, filtering and graphing.

install.packages("tidyverse")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("lubridate")
library("tidyverse")  # "designed to make it easy to install and load multiple 'tidyverse' packages in a single step"
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggplot2")    # "You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
library("dplyr")      #
library("lubridate")  # "Lubridate provides tools that make it easier to parse and manipulate dates."


Next we will take the raw data in the CSV files and use the read_csv() function to bring them into R.

q1 <- read_csv("2019_Q1.csv")
## Rows: 365069 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q2 <- read_csv("2019_Q2.csv")
## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## num  (1): 01 - Rental Details Duration In Seconds Uncapped
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q3 <- read_csv("2019_Q3.csv")
## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q4 <- read_csv("2019_Q4.csv")
## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num  (1): tripduration
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


Preparing the Data:

Now that the CSV files are brought into R we can perform some initial observations to understand the datasets. We use colnames() to learn what the column names are in each of the four datasets. The column names in the q2 (2019_q2.csv) dataset need to be appropriately changed to achieve standerdisation.

colnames(q1)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q2) # The colnames are different in this data set than the rest. 
##  [1] "01 - Rental Details Rental ID"                   
##  [2] "01 - Rental Details Local Start Time"            
##  [3] "01 - Rental Details Local End Time"              
##  [4] "01 - Rental Details Bike ID"                     
##  [5] "01 - Rental Details Duration In Seconds Uncapped"
##  [6] "03 - Rental Start Station ID"                    
##  [7] "03 - Rental Start Station Name"                  
##  [8] "02 - Rental End Station ID"                      
##  [9] "02 - Rental End Station Name"                    
## [10] "User Type"                                       
## [11] "Member Gender"                                   
## [12] "05 - Member Details Member Birthday Year"
colnames(q3)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q4)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"


With glimpse() we can also see what type of data in contained in each colummn and if its the same.

glimpse(q1)
## Rows: 365,069
## Columns: 12
## $ trip_id           <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 21…
## $ start_time        <dttm> 2019-01-01 00:04:37, 2019-01-01 00:08:13, 2019-01-0…
## $ end_time          <dttm> 2019-01-01 00:11:07, 2019-01-01 00:15:34, 2019-01-0…
## $ bikeid            <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205,…
## $ tripduration      <dbl> 390, 441, 829, 1783, 364, 216, 177, 100, 1727, 336, …
## $ from_station_id   <dbl> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 2…
## $ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", …
## $ to_station_id     <dbl> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 4…
## $ to_station_name   <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Bure…
## $ usertype          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender            <chr> "Male", "Female", "Female", "Male", "Male", "Female"…
## $ birthyear         <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995…
glimpse(q2)
## Rows: 1,108,163
## Columns: 12
## $ `01 - Rental Details Rental ID`                    <dbl> 22178529, 22178530,…
## $ `01 - Rental Details Local Start Time`             <dttm> 2019-04-01 00:02:2…
## $ `01 - Rental Details Local End Time`               <dttm> 2019-04-01 00:09:4…
## $ `01 - Rental Details Bike ID`                      <dbl> 6251, 6226, 5649, 4…
## $ `01 - Rental Details Duration In Seconds Uncapped` <dbl> 446, 1048, 252, 357…
## $ `03 - Rental Start Station ID`                     <dbl> 81, 317, 283, 26, 2…
## $ `03 - Rental Start Station Name`                   <chr> "Daley Center Plaza…
## $ `02 - Rental End Station ID`                       <dbl> 56, 59, 174, 133, 1…
## $ `02 - Rental End Station Name`                     <chr> "Desplaines St & Ki…
## $ `User Type`                                        <chr> "Subscriber", "Subs…
## $ `Member Gender`                                    <chr> "Male", "Female", "…
## $ `05 - Member Details Member Birthday Year`         <dbl> 1975, 1984, 1990, 1…
glimpse(q3)
## Rows: 1,640,718
## Columns: 12
## $ trip_id           <dbl> 23479388, 23479389, 23479390, 23479391, 23479392, 23…
## $ start_time        <dttm> 2019-07-01 00:00:27, 2019-07-01 00:01:16, 2019-07-0…
## $ end_time          <dttm> 2019-07-01 00:20:41, 2019-07-01 00:18:44, 2019-07-0…
## $ bikeid            <dbl> 3591, 5353, 6180, 5540, 6014, 4941, 3770, 5442, 2957…
## $ tripduration      <dbl> 1214, 1048, 1554, 1503, 1213, 310, 1248, 1550, 1583,…
## $ from_station_id   <dbl> 117, 381, 313, 313, 168, 300, 168, 313, 43, 43, 511,…
## $ from_station_name <chr> "Wilton Ave & Belmont Ave", "Western Ave & Monroe St…
## $ to_station_id     <dbl> 497, 203, 144, 144, 62, 232, 62, 144, 195, 195, 84, …
## $ to_station_name   <chr> "Kimball Ave & Belmont Ave", "Western Ave & 21st St"…
## $ usertype          <chr> "Subscriber", "Customer", "Customer", "Customer", "C…
## $ gender            <chr> "Male", NA, NA, NA, NA, "Male", NA, NA, NA, NA, NA, …
## $ birthyear         <dbl> 1992, NA, NA, NA, NA, 1990, NA, NA, NA, NA, NA, NA, …
glimpse(q4)
## Rows: 704,054
## Columns: 12
## $ trip_id           <dbl> 25223640, 25223641, 25223642, 25223643, 25223644, 25…
## $ start_time        <dttm> 2019-10-01 00:01:39, 2019-10-01 00:02:16, 2019-10-0…
## $ end_time          <dttm> 2019-10-01 00:17:20, 2019-10-01 00:06:34, 2019-10-0…
## $ bikeid            <dbl> 2215, 6328, 3003, 3275, 5294, 1891, 1061, 1274, 6011…
## $ tripduration      <dbl> 940, 258, 850, 2350, 1867, 373, 1072, 1458, 1437, 83…
## $ from_station_id   <dbl> 20, 19, 84, 313, 210, 156, 84, 156, 156, 336, 77, 19…
## $ from_station_name <chr> "Sheffield Ave & Kingsbury St", "Throop (Loomis) St …
## $ to_station_id     <dbl> 309, 241, 199, 290, 382, 226, 142, 463, 463, 336, 50…
## $ to_station_name   <chr> "Leavitt St & Armitage Ave", "Morgan St & Polk St", …
## $ usertype          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender            <chr> "Male", "Male", "Female", "Male", "Male", "Female", …
## $ birthyear         <dbl> 1987, 1998, 1991, 1990, 1987, 1994, 1991, 1995, 1993…





Processing the data:

Now that we understand what data we have we can begin to process it for further steps. First we will make sure all column names are the same.

q2 <- q2 %>% 
  rename("trip_id" = "01 - Rental Details Rental ID",
        "start_time"  = "01 - Rental Details Local Start Time",
        "end_time" = "01 - Rental Details Local End Time",
        "bikeid" = "01 - Rental Details Bike ID",
        "tripduration" = "01 - Rental Details Duration In Seconds Uncapped",
        "from_station_id" = "03 - Rental Start Station ID",
        "from_station_name" = "03 - Rental Start Station Name",
        "to_station_id" = "02 - Rental End Station ID",
        "to_station_name" = "02 - Rental End Station Name",
        "usertype" = "User Type",
        "gender" = "Member Gender",
        "birthyear" = "05 - Member Details Member Birthday Year")


Now we can combine the four quarterly datasets into one year long data set that we will call ‘yeardata’ using the bind_rows() function.

yeardata <- bind_rows(q1, q2, q3, q4)


There are two unique values in the Usertype column. We will change them for clearer readability.

yeardata$usertype[yeardata$usertype == 'Subscriber'] <- 'Annual Membership'
yeardata$usertype[yeardata$usertype == 'Customer'] <- 'Single Use'


There is no column that has ride of length. We do have start_time and end_time therefore we can use the difftime() function to determine the length of trip.

yeardata$ride_length <- difftime(yeardata$end_time,yeardata$start_time)


Now we will check for any 0 length or negative length trips that might be anomolies and remove them:

yeardata %>% # With this we can see the rides with lengths at or less than 0.
  select(ride_length) %>% 
  filter(ride_length <= 0.0) 
## # A tibble: 13 × 1
##    ride_length   
##    <drtn>        
##  1 -33.41667 mins
##  2 -35.28333 mins
##  3 -45.15000 mins
##  4 -31.26667 mins
##  5 -46.10000 mins
##  6 -48.95000 mins
##  7 -47.36667 mins
##  8 -48.25000 mins
##  9 -48.28333 mins
## 10 -47.26667 mins
## 11 -53.68333 mins
## 12 -56.36667 mins
## 13 -49.83333 mins
yeardata <- yeardata[yeardata$ride_length > 0.0, ] # Now we establish the data set yeardata with only rides above 0
yeardata %>% # Now we can see if our function worked correctly.
  select(ride_length) %>% 
  filter(ride_length <= 0.0) 
## # A tibble: 0 × 1
## # ℹ 1 variable: ride_length <drtn>
yeardata <- yeardata[!duplicated(yeardata), ] # This function to remove duplicate rows.


Create a new column that extracts the month from start_time in a 1-12 format. Aggregate the date and time into seperate categories: Day, month and Day of week for analysis. Rearrange the display order of day_of_week for context of future charts.

yeardata <- yeardata %>% 
  mutate(trip_month=month(yeardata$start_time))
yeardata$date <- as.Date(yeardata$start_time) 
yeardata$month <- format(as.Date(yeardata$date), "%m") # Month
yeardata$day <- format(as.Date(yeardata$date), "%d") # Day
yeardata$day_of_week <- format(as.Date(yeardata$date), "%A") # Day_of_week
yeardata$day_of_week <- ordered(yeardata$day_of_week, levels=c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) # Reorders days of week to Monday first and Sunday last.
colnames(yeardata)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"        
## [13] "ride_length"       "trip_month"        "date"             
## [16] "month"             "day"               "day_of_week"





Visualizing our data:

Lets use ggplot() and geom_bar() to do some graphs. First we will visualise how many yearly riders we have in each usertype category. We can see there are many more annual subscribers than single use subscribers.

ggplot(data=yeardata)+
  geom_bar(mapping=aes(x=usertype,fill=usertype))+
  labs(title="Usertypes compared")




Now we will plot a bar graph of trips per month, split by usertype.

We can see what we would expect in that there is a spike in ridership during the summer months and a downward trend during the “off-season”.

Months are in a 1-12 format. Where 1 = January, 12 = December.

ggplot(data=yeardata)+
  geom_bar(mapping=aes(x=trip_month,fill=usertype))+
  facet_wrap(~usertype)+
  labs(title="Trips per month",subtitle="split by usertype")




Now we will plot a bar graph with difference in genders in terms of users. We can notice there is a disparity between men and women in terms of ridership.

ggplot(data=yeardata)+
  geom_bar(mapping=aes(x=gender,fill=gender)) +
  facet_wrap(~usertype) +
  labs(title="Gender breakdown of riders", subtitle="split by usertype")




Now we will explore users by birthyear. Use coord_cartesian(xlim) to remove front and back end of the X-axis with unnecesary decades i.e. 1890, 2030. We are able to see there is a growing wave of younger people who are using the bike shares.

ggplot(data=yeardata)+
  geom_bar(mapping=aes(x=birthyear,fill=gender))+
  coord_cartesian(xlim = c(1945, 2005))+
  labs(title="Users by birthyear",subtitle="1945-2005")
## Warning: Removed 538746 rows containing non-finite values (`stat_count()`).




Next we will graph day of week usage averages over the year split by usertype. The days of week will now display in order Monday to Sunday as per our prior rearange function.

ggplot(data=yeardata)+
  geom_bar(mapping=aes(x=day_of_week,fill=usertype))+
  facet_wrap(~usertype)+
  theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))+
  labs(title="Weekday usage",subtitle="Split by usertype")





Summary of analysis:

There is a large disparity between male bikers and women bikers. However, if we further break it down by age groups there is a growing proportion of young girls and women using the bike shares. So there is a potential demographic to create ads towards. We also notice that casual ridership peaks on the weekends.
Annual ridership dips on the weekends, as we can assume its primarialy commuters. So weekend casual riders might be convinced into buying annual memberships through ads or discounts.
However, I primarialy recommend further research and studies into these questions to be able to draw more meaningful conclusions.


Recommendations

1: Create advertisement campaign for young people to use the bike shares. Potenital to offer youth discounts for annual memberships.
2: Create an advertisement campaign for casual weekend riders. Possibly offer weekend or seasonal passes.
3: Collect more data on the habits and wants of weekekend riders in terms of purchasing an annual pass