Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask:

Prepare

This project will use the data provided by Google, for more details, click here

Code:

Setting up the environment

library(tidyverse)  
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

Import files ranging from 202004 to 202012

filenames <-list.files("C:\\Users\\BinZhang\\Desktop\\Cap1", pattern = "*.csv", full.names = TRUE)
tripdata <- rbindlist(lapply(filenames, fread))

Data cleaning: remove rows with NA values.

cleaned_tripdata<-tripdata[complete.cases(tripdata[,]),]

Date cleaning: remove more rows with missing/blank values in certain columns.

cleaned_tripdata<-cleaned_tripdata[!(cleaned_tripdata$start_station_name =="" | cleaned_tripdata$start_station_id =="" | cleaned_tripdata$end_station_name == "" | cleaned_tripdata$end_station_id == "" ),]

data cleaning: filter out data with started_at > ended_at

cleaned_tripdata <- cleaned_tripdata %>% 
  filter(started_at < ended_at)

Add a new column for the duration of each trip in min:

diff_time_min <- difftime(as.POSIXct(cleaned_tripdata$ended_at), as.POSIXct(cleaned_tripdata$started_at), units = "mins")
cleaned_tripdata <- cleaned_tripdata %>% 
  mutate(trip_duration_min = as.integer(diff_time_min))

Calculate the average time of the service being used by members and casual riders.

cleaned_tripdata %>% 
  group_by(member_casual) %>%
  drop_na() %>% 
  summarize(ave_time = mean(trip_duration_min))
## # A tibble: 2 x 2
##   member_casual ave_time
##   <chr>            <dbl>
## 1 casual            45.6
## 2 member            15.8

Let’s find out what the top 5 busiest and least busy start stations are.

cleaned_tripdata %>% 
  count(start_station_name, member_casual, sort = TRUE)
##              start_station_name member_casual     n
##    1:   Streeter Dr & Grand Ave        casual 24329
##    2: Lake Shore Dr & Monroe St        casual 18302
##    3:           Millennium Park        casual 17589
##    4:         Clark St & Elm St        member 17472
##    5:       Theater on the Lake        casual 13950
##   ---                                              
## 1354: Stewart Ave & 63rd St (*)        member     1
## 1355:   Torrence Ave & 106th St        member     1
## 1356:   Torrence Ave & 126th Pl        member     1
## 1357:    Western Ave & 104th St        member     1
## 1358:    Western Ave & 111th St        member     1

Let’s find out what the top 5 busiest and least busy end stations are.

cleaned_tripdata %>% 
  count(end_station_name, member_casual, sort = TRUE)
##                end_station_name member_casual     n
##    1:   Streeter Dr & Grand Ave        casual 26607
##    2:           Millennium Park        casual 18305
##    3: Lake Shore Dr & Monroe St        casual 17984
##    4:         Clark St & Elm St        member 17928
##    5:       Theater on the Lake        casual 15693
##   ---                                              
## 1362: Stewart Ave & 63rd St (*)        member     1
## 1363:   Torrence Ave & 106th St        member     1
## 1364:   Torrence Ave & 126th Pl        member     1
## 1365:    Western Ave & 104th St        member     1
## 1366:    Western Ave & 111th St        member     1

Let’s find out how busy each day is within a week.

cleaned_tripdata %>% 
  count(weekdays(as.Date(cleaned_tripdata$started_at)), sort = TRUE)
##    weekdays(as.Date(cleaned_tripdata$started_at))      n
## 1:                                       Saturday 559587
## 2:                                         Sunday 452936
## 3:                                         Friday 442346
## 4:                                       Thursday 402126
## 5:                                      Wednesday 390278
## 6:                                        Tuesday 357311
## 7:                                         Monday 347175

Insert a new column showing on which day each ridership happens.

cleaned_tripdata <- cleaned_tripdata %>% 
  mutate(day_of_week = weekdays(as.Date(started_at)))

We can see that docked_bike is the most popular choice among both members and casual riders.

cleaned_tripdata %>% 
  group_by(member_casual, rideable_type) %>% 
  summarize(num_of_rides = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups:   member_casual [2]
##   member_casual rideable_type num_of_rides
##   <chr>         <chr>                <int>
## 1 casual        classic_bike         11260
## 2 casual        docked_bike        1092325
## 3 casual        electric_bike       145379
## 4 member        classic_bike         59143
## 5 member        docked_bike        1432356
## 6 member        electric_bike       211296

Prepare for the visualization.

options(scipen = 999)
cleaned_tripdata$day_of_week <- factor(cleaned_tripdata$day_of_week, levels = c("Monday","Tuesday","Wednesday", "Thursday","Friday","Saturday","Sunday"))

Visualize the popularity of each ride type.

ggplot(cleaned_tripdata) +
  geom_bar(aes(x = cleaned_tripdata$day_of_week, fill = rideable_type)) +
  labs(title = "Share of Ride Type",
       caption = paste0("Data from Google Capstone Project 1"),
        x = "Weekdays", y = "Number of rides")

Visualize the weekly usage of each ride type for members and casual riders respectively.

ggplot(cleaned_tripdata)+
  geom_bar(aes(x = cleaned_tripdata$day_of_week, fill = rideable_type))+
  facet_wrap(~member_casual)+
  labs(title = "Share of Ride Type",
       caption = paste0("Data from Google Capstone Project 1"),
       x = "Weekdays", y = "Number of rides")

Bias:

Due to data privacy issues, we can’t determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes; also, for the same reason, we won’t be able to find out which data is from the same rider and which one is not, therefore, we assume that the data from each row is from a distinct rider.

Conclusion:

Based on the data we have, it’s hard to come up with a strategy of how to convert casual cyclists to members. However, there are some obvious trends to consider that may help us when making decisions:

Recommendation: