You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
This project will use the data provided by Google, for more details, click here
Setting up the environment
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
Import files ranging from 202004 to 202012
filenames <-list.files("C:\\Users\\BinZhang\\Desktop\\Cap1", pattern = "*.csv", full.names = TRUE)
tripdata <- rbindlist(lapply(filenames, fread))
Data cleaning: remove rows with NA values.
cleaned_tripdata<-tripdata[complete.cases(tripdata[,]),]
Date cleaning: remove more rows with missing/blank values in certain columns.
cleaned_tripdata<-cleaned_tripdata[!(cleaned_tripdata$start_station_name =="" | cleaned_tripdata$start_station_id =="" | cleaned_tripdata$end_station_name == "" | cleaned_tripdata$end_station_id == "" ),]
data cleaning: filter out data with started_at > ended_at
cleaned_tripdata <- cleaned_tripdata %>%
filter(started_at < ended_at)
Add a new column for the duration of each trip in min:
diff_time_min <- difftime(as.POSIXct(cleaned_tripdata$ended_at), as.POSIXct(cleaned_tripdata$started_at), units = "mins")
cleaned_tripdata <- cleaned_tripdata %>%
mutate(trip_duration_min = as.integer(diff_time_min))
Calculate the average time of the service being used by members and casual riders.
cleaned_tripdata %>%
group_by(member_casual) %>%
drop_na() %>%
summarize(ave_time = mean(trip_duration_min))
## # A tibble: 2 x 2
## member_casual ave_time
## <chr> <dbl>
## 1 casual 45.6
## 2 member 15.8
Let’s find out what the top 5 busiest and least busy start stations are.
cleaned_tripdata %>%
count(start_station_name, member_casual, sort = TRUE)
## start_station_name member_casual n
## 1: Streeter Dr & Grand Ave casual 24329
## 2: Lake Shore Dr & Monroe St casual 18302
## 3: Millennium Park casual 17589
## 4: Clark St & Elm St member 17472
## 5: Theater on the Lake casual 13950
## ---
## 1354: Stewart Ave & 63rd St (*) member 1
## 1355: Torrence Ave & 106th St member 1
## 1356: Torrence Ave & 126th Pl member 1
## 1357: Western Ave & 104th St member 1
## 1358: Western Ave & 111th St member 1
Let’s find out what the top 5 busiest and least busy end stations are.
cleaned_tripdata %>%
count(end_station_name, member_casual, sort = TRUE)
## end_station_name member_casual n
## 1: Streeter Dr & Grand Ave casual 26607
## 2: Millennium Park casual 18305
## 3: Lake Shore Dr & Monroe St casual 17984
## 4: Clark St & Elm St member 17928
## 5: Theater on the Lake casual 15693
## ---
## 1362: Stewart Ave & 63rd St (*) member 1
## 1363: Torrence Ave & 106th St member 1
## 1364: Torrence Ave & 126th Pl member 1
## 1365: Western Ave & 104th St member 1
## 1366: Western Ave & 111th St member 1
Let’s find out how busy each day is within a week.
cleaned_tripdata %>%
count(weekdays(as.Date(cleaned_tripdata$started_at)), sort = TRUE)
## weekdays(as.Date(cleaned_tripdata$started_at)) n
## 1: Saturday 559587
## 2: Sunday 452936
## 3: Friday 442346
## 4: Thursday 402126
## 5: Wednesday 390278
## 6: Tuesday 357311
## 7: Monday 347175
Insert a new column showing on which day each ridership happens.
cleaned_tripdata <- cleaned_tripdata %>%
mutate(day_of_week = weekdays(as.Date(started_at)))
We can see that docked_bike is the most popular choice among both members and casual riders.
cleaned_tripdata %>%
group_by(member_casual, rideable_type) %>%
summarize(num_of_rides = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: member_casual [2]
## member_casual rideable_type num_of_rides
## <chr> <chr> <int>
## 1 casual classic_bike 11260
## 2 casual docked_bike 1092325
## 3 casual electric_bike 145379
## 4 member classic_bike 59143
## 5 member docked_bike 1432356
## 6 member electric_bike 211296
Prepare for the visualization.
options(scipen = 999)
cleaned_tripdata$day_of_week <- factor(cleaned_tripdata$day_of_week, levels = c("Monday","Tuesday","Wednesday", "Thursday","Friday","Saturday","Sunday"))
Visualize the popularity of each ride type.
ggplot(cleaned_tripdata) +
geom_bar(aes(x = cleaned_tripdata$day_of_week, fill = rideable_type)) +
labs(title = "Share of Ride Type",
caption = paste0("Data from Google Capstone Project 1"),
x = "Weekdays", y = "Number of rides")
Visualize the weekly usage of each ride type for members and casual riders respectively.
ggplot(cleaned_tripdata)+
geom_bar(aes(x = cleaned_tripdata$day_of_week, fill = rideable_type))+
facet_wrap(~member_casual)+
labs(title = "Share of Ride Type",
caption = paste0("Data from Google Capstone Project 1"),
x = "Weekdays", y = "Number of rides")
Due to data privacy issues, we can’t determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes; also, for the same reason, we won’t be able to find out which data is from the same rider and which one is not, therefore, we assume that the data from each row is from a distinct rider.
Based on the data we have, it’s hard to come up with a strategy of how to convert casual cyclists to members. However, there are some obvious trends to consider that may help us when making decisions: