knitr::opts_chunk$set(echo = TRUE)

Introduction

This is for the Google Data Analytics Capstone project on Coursera. For the purposes of this exercise, we will be using public data from a bike share company called Divvy. In this exercise, I work for a fictional company, Divvy. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

About the Company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic finance analyst: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Questions for the analysis

1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

The Business Task

To deliver recommendations by analyzing the 2022 trip data for Divvy to understand how casual riders, and annual members use the service differently. Using that information I will try to provide answers on how to convert casual riders to annual members.

Description of Data Sources used

I used the most recent twelve-month (#December_2021-September_2022) historical trip datasets provided by Divvy to perform this analysis. This includes twelve different files each containing a month of data ranging from August 2021, to July 2022. This is public data that you can use to explore how different customer types are using Divvy bikes. The data has been made available by Motivate International Inc. under this license.

Each dataset is made up of thirteen columns:

[1] ride_id   rideable_type   started_at
[4] ended_at   start_station_name   start_station_id
[7] end_station_name   end_station_id   start_lat
[10] start_lng   end_lat   end_lng
[13] member_casual

Process Phase:

In this phase we will process the data by cleaning and ensuring that it is correct,relevant,complete and error free. - We have to check if data contains any missing or null values -Transform the data into format we want for the analysis

Cleaning and Manipulation of Data

I will be using the programming language R to conduct this analysis.

First I need to install and load relevant packages:

Loading Packags

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(timechange)
library(ggplot2)
library(dplyr)

Upload Dataset

In the project we make use of Cyclistic’s historical trip data to analyze and identify trends.Download the previous 12 months of Cyclistic trip data here. (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is a public data that can be use to explore how different customer types are using Cyclistic bikes.

Next I need to load, read and assign a variable to all twelve csv files to Rstudio Environment.

Dec_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202112-divvy-tripdata.csv")
Nov_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202111-divvy-tripdata.csv")
Oct_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202110-divvy-tripdata.csv")
Jan_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202201-divvy-tripdata.csv")
Feb_22 <- read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202202-divvy-tripdata.csv")
Mar_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202203-divvy-tripdata.csv")
Apr_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202204-divvy-tripdata.csv")
May_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202205-divvy-tripdata.csv")
Jun_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202206-divvy-tripdata.csv")
Jul_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202207-divvy-tripdata.csv")
Aug_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202208-divvy-tripdata.csv")
Sep_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202209-divvy-publictripdata.csv")

To ensure consistency, we make sure all of these files have the same columns, so I will run colnames() on each csv file. Once I verify that all csv files contain each of the 13 columns shown above, I need to use rbind() to merge all files together vertically into a single large data frame.

Total_trip <- rbind(Dec_21, Nov_21, Oct_21, Jan_22, Feb_22, Mar_22, Apr_22, May_22, Jun_22, Jul_22 , Aug_22, Sep_22)

Check and Removing ‘unwanted bad’ data

To make sure our data is clean, let’s check for and remove NA/null values. We will have to assign this to a new data frame.

Total_trip_1 <- na.omit(Total_trip)

We are going to add a new column called “ride_length” to calculate the length of each trip in minutes. We will use the difftime() function to do this, and shown in the chunk below:

Total_trip_1$ride_length <- difftime(Total_trip_1$ended_at, Total_trip_1$started_at, units="mins")

Let check the internal structure of (Total_trip_1)

This will help to confirm the data_types that was previuosly loaded.

str(Total_trip_1)
## 'data.frame':    5822391 obs. of  14 variables:
##  $ ride_id           : chr  "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr  "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
##  $ ended_at          : chr  "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
##  $ start_station_name: chr  "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
##  $ start_station_id  : chr  "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
##  $ end_station_name  : chr  "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
##  $ end_station_id    : chr  "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num  41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "casual" "member" "member" ...
##  $ ride_length       : 'difftime' num  7.58333333333333 26.9 12.7666666666667 14.7166666666667 ...
##   ..- attr(*, "units")= chr "mins"
##  - attr(*, "na.action")= 'omit' Named int [1:5844] 3342 3391 3467 3586 3681 3692 3693 111319 111481 111526 ...
##   ..- attr(*, "names")= chr [1:5844] "3342" "3391" "3467" "3586" ...

Let confirm if all ‘unwanted bad’ data have been removed

Using View(trip_data) to check out our new column, we can see that there are some negative ride times. Let’s check how many columns display negative ride lengths and then remove them:

Total_trip_2 <- Total_trip_1[!Total_trip_1$ride_length < 0,]

Now we are going to add four new columns. The first will be populated with the abbreviated day of the week that the ride started on. We will also make a column for the month and the year. We can use this to explore how both casual riders and members use the bikes differently depending on the day of the week, month, or year. Going further, lets add a column isolating the hour that the ride started_at

Even though we already have this data in the “started_at” and “ended_at” columns, breaking it down into its own columns will make it easier to aggregate later on.

Total_trip_2$month <- format(as.Date(Total_trip_2$started_at), "%m")
Total_trip_2$year <- format(as.Date(Total_trip_2$started_at), "%Y")
Total_trip_2$start_hour <- format(as.POSIXct(Total_trip_2$started_at), format = "%H")
Total_trip_2 <- mutate(Total_trip_2, started_at = as_datetime(started_at), ended_at = as_datetime(ended_at)) 
Total_trip_2 <- mutate(Total_trip_2, "days_of_week" = weekdays(started_at))
glimpse(Total_trip_2)
## Rows: 5,822,283
## Columns: 18
## $ ride_id            <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-12-07 15:06:07, 2021-12-11 03:43:29, 2021-12-…
## $ ended_at           <dttm> 2021-12-07 15:13:42, 2021-12-11 04:10:23, 2021-12-…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id   <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name   <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id     <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat          <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng          <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat            <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng            <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "…
## $ ride_length        <drtn> 7.583333 mins, 26.900000 mins, 12.766667 mins, 14.…
## $ month              <chr> "12", "12", "12", "12", "12", "12", "12", "12", "12…
## $ year               <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20…
## $ start_hour         <chr> "15", "03", "23", "16", "11", "18", "15", "13", "14…
## $ days_of_week       <chr> "Tuesday", "Saturday", "Wednesday", "Sunday", "Thur…

Before we move on to the analysis, we need to first convert ‘ride_length’ from Factor to Numeric so we can run calculations on the data.

is.factor(Total_trip_2$ride_length)
## [1] FALSE
Total_trip_2$ride_length <- as.numeric(as.character(Total_trip_2$ride_length))
is.numeric(Total_trip_2$ride_length)
## [1] TRUE

Analyze Phase:

Now we have everything we need to move on to the analysis phase.

Descriptive Analysis of the Data

We will use the summary() function to quickly get the min, median, mean, and max for “ride_length”.

summary(Total_trip_2$ride_length)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.93    10.48    16.76    18.82 40705.02

Lets compare ride_length between casual riders, and members.

#Number of rides per hour of the day
Total_trip_2 %>% 
  mutate(hour = format(as.POSIXct(Total_trip_2$started_at), format = "%H")) %>% 
  group_by(member_casual, hour) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, hour)  %>% 
  ggplot(aes(x = hour, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Based on the graph above, we can come to the following conclusions: Members tend to have a higher frequency of rides between 6:00-9:00 and 15:00-19:00. This would seem to suggest that annual members use their bikes to commute to and from work. Casual riders have a spike in number of rides around 15:00-18:00 as well, but that isn’t reflected in the morning meaning this is likely not reflecting casual riders using the bikes to commute to and from work but rather perhaps after work activities.

Lets break it down by weekday.

# analyze ridership data by type and weekday
Total_trip_2 %>% 
  mutate(day_of_week = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(member_casual, day_of_week) %>%  #groups by usertype(member_casual) and weekday
  summarise(number_of_rides = n() #calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>% # calculates the average duration
  arrange(member_casual, day_of_week) # sorts
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual day_of_week number_of_rides average_duration
##    <chr>         <ord>                 <int>            <dbl>
##  1 casual        Sun                  403983             26.3
##  2 casual        Mon                  279170             23.4
##  3 casual        Tue                  275219             20.3
##  4 casual        Wed                  281149             19.8
##  5 casual        Thu                  306111             20.3
##  6 casual        Fri                  351729             21.3
##  7 casual        Sat                  498682             25.5
##  8 member        Sun                  393502             13.9
##  9 member        Mon                  472981             12.1
## 10 member        Tue                  541416             11.9
## 11 member        Wed                  538401             11.9
## 12 member        Thu                  530450             12.0
## 13 member        Fri                  491368             12.3
## 14 member        Sat                  458122             13.9

Let’s visualize the number of rides by rider type per weekday

# Let's visualize the number of rides by rider type per weekday
Total_trip_2 %>% 
  mutate(day_of_week = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

We can make some interesting discoveries from the graph shown above. We can see that during week day, members have a much higher frequency of rides than casual riders. Casual riders however, pull ahead with slightly more rides on the weekends, while the number of rides for members actually decreases slightly.

Let’s create a visualization for average duration

Total_trip_2 %>% 
  mutate(day_of_week = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

In this graph, we find that casual riders, while not taking quite as many rides, do in fact take much longer rides than members throughout each day of the week, with slight increases on the weekends. The average duration for members rides stays very consistent throughout the week.

Let’s compare number_of_rides per month for casual and annual members.

Total_trip_2 %>% 
  mutate(month = format(as.Date(Total_trip_2$started_at), "%m")) %>% 
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

We observe from this graph that casual riders always has less number of rides than annual members in every months of the year. It is worth noting that both members and casual riders number of rides follows a bell curve, meaning that both parties are using bikes more in the warmer months, than the colder ones. Casual riders bike trips dips drastically between November and March.

Lets see what type of bike (rideable_type) casual riders and members prefer.

Total_trip_2 %>% 
  group_by(member_casual, rideable_type) %>% 
  arrange(member_casual, rideable_type)  %>% 
  ggplot(aes(x = rideable_type, fill = member_casual)) +
  geom_bar(position = "dodge")

Based on this graph, both members and casual riders prefer to use the classic bike type as well electric bike type. It appear to be less important to focus on the type of bike the riders used.

Conclusion and recommendations

Casual riders take less trips during the week day than annual members do. We believe this to be due to members taking more rides on the way to work, and on the way home from work.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.