You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
The ultimate goal is to design marketing strategies aimed at converting casual riders into annual members. To do so, a smaller question that I aim to answer here is: How do annual members and casual riders use Cyclistic bikes differently.
I will be using a total record of 12 months of Cyclistic’s historial trip data to analyze and identify trends. It is a public data set, made available by Motivate International Inc. under this license. My data ranges from April 2019 to March 2020, and it is cited and vetted by Chicago department of transportation.
Note: for data security purposes, every riders’ personal identifiable information is hidden through tokenization.
Excel spreadsheets can only contain 1,048,576 rows by 16,384 columns and the 4 quarterly datasets combined are far beyond that limit, therefore R is used to perform all tasks - organizing, cleaning, analyzing, and visualizing.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(metR)
##
## Attaching package: 'metR'
##
## The following object is masked from 'package:purrr':
##
## cross
Q1_2020<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q2_2019<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2019_Q2.csv")
## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q3_2019<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2019_Q3.csv")
## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): from_station_name, to_station_name, usertype, gender
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## dttm (2): start_time, end_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q4_2019<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2019_Q4.csv")
## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): from_station_name, to_station_name, usertype, gender
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## dttm (2): start_time, end_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(Q2_2019)
## [1] "01 - Rental Details Rental ID"
## [2] "01 - Rental Details Local Start Time"
## [3] "01 - Rental Details Local End Time"
## [4] "01 - Rental Details Bike ID"
## [5] "01 - Rental Details Duration In Seconds Uncapped"
## [6] "03 - Rental Start Station ID"
## [7] "03 - Rental Start Station Name"
## [8] "02 - Rental End Station ID"
## [9] "02 - Rental End Station Name"
## [10] "User Type"
## [11] "Member Gender"
## [12] "05 - Member Details Member Birthday Year"
colnames(Q3_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(Q4_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(Q1_2020)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
(Q4_2019 <- rename(Q4_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
## # A tibble: 704,054 × 12
## ride_id started_at ended_at rideable_type tripduration
## <dbl> <dttm> <dttm> <dbl> <dbl>
## 1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20 2215 940
## 2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34 6328 258
## 3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43 3003 850
## 4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43 3275 2350
## 5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42 5294 1867
## 6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51 1891 373
## 7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45 1061 1072
## 8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16 1274 1458
## 9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18 6011 1437
## 10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46 2957 8306
## # … with 704,044 more rows, and 7 more variables: start_station_id <dbl>,
## # start_station_name <chr>, end_station_id <dbl>, end_station_name <chr>,
## # member_casual <chr>, gender <chr>, birthyear <dbl>
(Q3_2019 <- rename(Q3_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
## # A tibble: 1,640,718 × 12
## ride_id started_at ended_at rideable_type tripduration
## <dbl> <dttm> <dttm> <dbl> <dbl>
## 1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41 3591 1214
## 2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44 5353 1048
## 3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42 6180 1554
## 4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10 5540 1503
## 5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26 6014 1213
## 6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31 4941 310
## 7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12 3770 1248
## 8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16 5442 1550
## 9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57 2957 1583
## 10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14 6091 1589
## # … with 1,640,708 more rows, and 7 more variables: start_station_id <dbl>,
## # start_station_name <chr>, end_station_id <dbl>, end_station_name <chr>,
## # member_casual <chr>, gender <chr>, birthyear <dbl>
(Q2_2019 <- rename(Q2_2019
,ride_id = "01 - Rental Details Rental ID"
,rideable_type = "01 - Rental Details Bike ID"
,started_at = "01 - Rental Details Local Start Time"
,ended_at = "01 - Rental Details Local End Time"
,start_station_name = "03 - Rental Start Station Name"
,start_station_id = "03 - Rental Start Station ID"
,end_station_name = "02 - Rental End Station Name"
,end_station_id = "02 - Rental End Station ID"
,member_casual = "User Type"))
## # A tibble: 1,108,163 × 12
## ride_id started_at ended_at rideable_type
## <dbl> <dttm> <dttm> <dbl>
## 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251
## 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226
## 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649
## 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151
## 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270
## 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123
## 7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41 6418
## 8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11 4513
## 9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44 3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39 5534
## # … with 1,108,153 more rows, and 8 more variables:
## # `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## # start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## # end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## # `05 - Member Details Member Birthday Year` <dbl>
summary(Q1_2020)
## ride_id rideable_type started_at
## Length:426887 Length:426887 Min. :2020-01-01 00:04:44.00
## Class :character Class :character 1st Qu.:2020-01-24 14:03:26.00
## Mode :character Mode :character Median :2020-02-17 05:01:27.00
## Mean :2020-02-14 01:23:18.51
## 3rd Qu.:2020-03-05 15:08:13.50
## Max. :2020-03-31 23:51:34.00
##
## ended_at start_station_name start_station_id
## Min. :2020-01-01 00:10:54.00 Length:426887 Min. : 2.0
## 1st Qu.:2020-01-24 14:21:24.50 Class :character 1st Qu.: 77.0
## Median :2020-02-17 05:48:58.00 Mode :character Median :176.0
## Mean :2020-02-14 01:45:25.43 Mean :209.8
## 3rd Qu.:2020-03-05 15:27:54.00 3rd Qu.:298.0
## Max. :2020-05-19 20:10:34.00 Max. :675.0
##
## end_station_name end_station_id start_lat start_lng
## Length:426887 Min. : 2.0 Min. :41.74 Min. :-87.77
## Class :character 1st Qu.: 77.0 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Median :175.0 Median :41.89 Median :-87.64
## Mean :209.3 Mean :41.90 Mean :-87.64
## 3rd Qu.:297.0 3rd Qu.:41.92 3rd Qu.:-87.63
## Max. :675.0 Max. :42.06 Max. :-87.55
## NA's :1
## end_lat end_lng member_casual
## Min. :41.74 Min. :-87.77 Length:426887
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.89 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.64
## 3rd Qu.:41.92 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.55
## NA's :1 NA's :1
summary(Q2_2019)
## ride_id started_at
## Min. :22178529 Min. :2019-04-01 00:02:22.00
## 1st Qu.:22505074 1st Qu.:2019-05-02 15:19:47.50
## Median :22831158 Median :2019-05-26 11:38:51.00
## Mean :22829856 Mean :2019-05-22 23:25:39.56
## 3rd Qu.:23154496 3rd Qu.:2019-06-13 18:18:56.00
## Max. :23479387 Max. :2019-06-30 23:59:05.00
##
## ended_at rideable_type
## Min. :2019-04-01 00:09:48.00 Min. : 1
## 1st Qu.:2019-05-02 15:37:41.00 1st Qu.:1738
## Median :2019-05-26 12:03:20.00 Median :3471
## Mean :2019-05-22 23:47:46.85 Mean :3399
## 3rd Qu.:2019-06-13 18:37:40.50 3rd Qu.:5080
## Max. :2019-07-06 14:22:25.00 Max. :6471
##
## 01 - Rental Details Duration In Seconds Uncapped start_station_id
## Min. : 61 Min. : 1.0
## 1st Qu.: 426 1st Qu.: 77.0
## Median : 742 Median :174.0
## Mean : 1327 Mean :200.3
## 3rd Qu.: 1347 3rd Qu.:289.0
## Max. :4757640 Max. :669.0
##
## start_station_name end_station_id end_station_name member_casual
## Length:1108163 Min. : 1.0 Length:1108163 Length:1108163
## Class :character 1st Qu.: 77.0 Class :character Class :character
## Mode :character Median :174.0 Mode :character Mode :character
## Mean :201.3
## 3rd Qu.:290.0
## Max. :669.0
##
## Member Gender 05 - Member Details Member Birthday Year
## Length:1108163 Min. :1759
## Class :character 1st Qu.:1979
## Mode :character Median :1987
## Mean :1984
## 3rd Qu.:1992
## Max. :2014
## NA's :180953
summary(Q3_2019)
## ride_id started_at
## Min. :23479388 Min. :2019-07-01 00:00:27.00
## 1st Qu.:23935498 1st Qu.:2019-07-24 15:39:07.00
## Median :24367416 Median :2019-08-14 07:11:50.50
## Mean :24364471 Mean :2019-08-15 00:18:26.00
## 3rd Qu.:24797401 3rd Qu.:2019-09-05 18:16:45.75
## Max. :25223639 Max. :2019-09-30 23:59:37.00
##
## ended_at rideable_type tripduration
## Min. :2019-07-01 00:07:31.00 Min. : 1 Min. : 61
## 1st Qu.:2019-07-24 16:04:16.00 1st Qu.:1713 1st Qu.: 465
## Median :2019-08-14 07:28:07.50 Median :3419 Median : 813
## Mean :2019-08-15 00:47:28.21 Mean :3350 Mean : 1742
## 3rd Qu.:2019-09-05 18:37:10.75 3rd Qu.:4997 3rd Qu.: 1460
## Max. :2019-11-04 08:09:47.00 Max. :6471 Max. :9056633
##
## start_station_id start_station_name end_station_id end_station_name
## Min. : 2.0 Length:1640718 Min. : 2.0 Length:1640718
## 1st Qu.: 77.0 Class :character 1st Qu.: 80.0 Class :character
## Median :174.0 Mode :character Median :176.0 Mode :character
## Mean :202.4 Mean :203.9
## 3rd Qu.:289.0 3rd Qu.:291.0
## Max. :673.0 Max. :673.0
##
## member_casual gender birthyear
## Length:1640718 Length:1640718 Min. :1888
## Class :character Class :character 1st Qu.:1980
## Mode :character Mode :character Median :1988
## Mean :1985
## 3rd Qu.:1992
## Max. :2003
## NA's :278094
summary(Q4_2019)
## ride_id started_at
## Min. :25223640 Min. :2019-10-01 00:01:39.00
## 1st Qu.:25407380 1st Qu.:2019-10-13 17:38:29.25
## Median :25590864 Median :2019-10-28 18:04:41.00
## Mean :25592222 Mean :2019-11-05 00:53:59.77
## 3rd Qu.:25777172 3rd Qu.:2019-11-25 16:19:34.00
## Max. :25962904 Max. :2019-12-31 23:57:17.00
##
## ended_at rideable_type tripduration
## Min. :2019-10-01 00:06:34.00 Min. : 1 Min. : 61
## 1st Qu.:2019-10-13 17:59:43.25 1st Qu.:1724 1st Qu.: 354
## Median :2019-10-28 18:18:41.00 Median :3473 Median : 585
## Mean :2019-11-05 01:13:49.22 Mean :3396 Mean : 1189
## 3rd Qu.:2019-11-25 16:34:22.75 3rd Qu.:5065 3rd Qu.: 1000
## Max. :2020-01-21 13:54:35.00 Max. :6946 Max. :8585902
##
## start_station_id start_station_name end_station_id end_station_name
## Min. : 2 Length:704054 Min. : 2.0 Length:704054
## 1st Qu.: 77 Class :character 1st Qu.: 77.0 Class :character
## Median :174 Mode :character Median :174.0 Mode :character
## Mean :204 Mean :203.9
## 3rd Qu.:291 3rd Qu.:291.0
## Max. :673 Max. :673.0
##
## member_casual gender birthyear
## Length:704054 Length:704054 Min. :1899
## Class :character Class :character 1st Qu.:1978
## Mode :character Mode :character Median :1987
## Mean :1984
## 3rd Qu.:1992
## Max. :2003
## NA's :61681
Q2_2019 <- mutate(Q2_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
Q3_2019 <- mutate(Q3_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
Q4_2019 <- mutate(Q4_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
total<-bind_rows(Q2_2019, Q3_2019, Q4_2019, Q1_2020)
dim(total)
## [1] 3879822 19
head(total)
## # A tibble: 6 × 19
## ride_id started_at ended_at rideable_type `01 - Rental D…`
## <chr> <dttm> <dttm> <chr> <dbl>
## 1 221785… 2019-04-01 00:02:22 2019-04-01 00:09:48 6251 446
## 2 221785… 2019-04-01 00:03:02 2019-04-01 00:20:30 6226 1048
## 3 221785… 2019-04-01 00:11:07 2019-04-01 00:15:19 5649 252
## 4 221785… 2019-04-01 00:13:01 2019-04-01 00:18:58 4151 357
## 5 221785… 2019-04-01 00:19:26 2019-04-01 00:36:13 3270 1007
## 6 221785… 2019-04-01 00:19:39 2019-04-01 00:23:56 3123 257
## # … with 14 more variables: start_station_id <dbl>, start_station_name <chr>,
## # end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## # `Member Gender` <chr>, `05 - Member Details Member Birthday Year` <dbl>,
## # tripduration <dbl>, gender <chr>, birthyear <dbl>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>
summary(total)
## ride_id started_at
## Length:3879822 Min. :2019-04-01 00:02:22.00
## Class :character 1st Qu.:2019-06-23 07:49:09.25
## Mode :character Median :2019-08-14 17:43:38.00
## Mean :2019-08-26 00:49:59.38
## 3rd Qu.:2019-10-12 12:10:21.00
## Max. :2020-03-31 23:51:34.00
##
## ended_at rideable_type
## Min. :2019-04-01 00:09:48.00 Length:3879822
## 1st Qu.:2019-06-23 08:20:27.75 Class :character
## Median :2019-08-14 18:02:04.00 Mode :character
## Mean :2019-08-26 01:14:37.06
## 3rd Qu.:2019-10-12 12:36:16.75
## Max. :2020-05-19 20:10:34.00
##
## 01 - Rental Details Duration In Seconds Uncapped start_station_id
## Min. : 61 Min. : 1.0
## 1st Qu.: 426 1st Qu.: 77.0
## Median : 742 Median :174.0
## Mean : 1327 Mean :202.9
## 3rd Qu.: 1347 3rd Qu.:291.0
## Max. :4757640 Max. :675.0
## NA's :2771659
## start_station_name end_station_id end_station_name member_casual
## Length:3879822 Min. : 1.0 Length:3879822 Length:3879822
## Class :character 1st Qu.: 77.0 Class :character Class :character
## Mode :character Median :174.0 Mode :character Mode :character
## Mean :203.8
## 3rd Qu.:291.0
## Max. :675.0
## NA's :1
## Member Gender 05 - Member Details Member Birthday Year tripduration
## Length:3879822 Min. :1759 Min. : 61
## Class :character 1st Qu.:1979 1st Qu.: 423
## Mode :character Median :1987 Median : 732
## Mean :1984 Mean : 1576
## 3rd Qu.:1992 3rd Qu.: 1322
## Max. :2014 Max. :9056633
## NA's :2952612 NA's :1535050
## gender birthyear start_lat start_lng
## Length:3879822 Min. :1888 Min. :42 Min. :-88
## Class :character 1st Qu.:1980 1st Qu.:42 1st Qu.:-88
## Mode :character Median :1988 Median :42 Median :-88
## Mean :1985 Mean :42 Mean :-88
## 3rd Qu.:1992 3rd Qu.:42 3rd Qu.:-88
## Max. :2003 Max. :42 Max. :-88
## NA's :1874825 NA's :3452935 NA's :3452935
## end_lat end_lng
## Min. :42 Min. :-88
## 1st Qu.:42 1st Qu.:-88
## Median :42 Median :-88
## Mean :42 Mean :-88
## 3rd Qu.:42 3rd Qu.:-88
## Max. :42 Max. :-88
## NA's :3452936 NA's :3452936
str(total)
## tibble [3,879,822 × 19] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
## $ started_at : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
## $ ended_at : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
## $ rideable_type : chr [1:3879822] "6251" "6226" "5649" "4151" ...
## $ 01 - Rental Details Duration In Seconds Uncapped: num [1:3879822] 446 1048 252 357 1007 ...
## $ start_station_id : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
## $ start_station_name : chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
## $ end_station_id : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
## $ end_station_name : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
## $ member_casual : chr [1:3879822] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ Member Gender : chr [1:3879822] "Male" "Female" "Male" "Male" ...
## $ 05 - Member Details Member Birthday Year : num [1:3879822] 1975 1984 1990 1993 1992 ...
## $ tripduration : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ gender : chr [1:3879822] NA NA NA NA ...
## $ birthyear : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ start_lat : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ start_lng : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lat : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
## $ end_lng : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
total <- total %>%
select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender,
"01 - Rental Details Duration In Seconds Uncapped",
"05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
Under the “member_casual” column, there are two names for members and also two names for casual riders(“Member” and “subscriber” are the same, and “customer” and “casual” are the same). I have to combine each of the two labels.
table(total$member_casual)
##
## casual Customer member Subscriber
## 48480 857474 378407 2595461
total<-total %>%
mutate(member_casual = recode(member_casual,
"Subscriber" = "member","Customer" = "casual"))
table(total$member_casual)
##
## casual member
## 905954 2973868
total$date <- as.Date(total$started_at)
total$month <- format(as.Date(total$date), "%m")
total$day <- format(as.Date(total$date), "%d")
total$year <- format(as.Date(total$date), "%Y")
total$day_of_week <- format(as.Date(total$date), "%A")
total$ride_length <- difftime(total$ended_at,total$started_at)
total$ride_length_m <- (difftime(total$ended_at,total$started_at))/60
summary(total)
## ride_id started_at
## Length:3879822 Min. :2019-04-01 00:02:22.00
## Class :character 1st Qu.:2019-06-23 07:49:09.25
## Mode :character Median :2019-08-14 17:43:38.00
## Mean :2019-08-26 00:49:59.38
## 3rd Qu.:2019-10-12 12:10:21.00
## Max. :2020-03-31 23:51:34.00
##
## ended_at rideable_type start_station_id
## Min. :2019-04-01 00:09:48.00 Length:3879822 Min. : 1.0
## 1st Qu.:2019-06-23 08:20:27.75 Class :character 1st Qu.: 77.0
## Median :2019-08-14 18:02:04.00 Mode :character Median :174.0
## Mean :2019-08-26 01:14:37.06 Mean :202.9
## 3rd Qu.:2019-10-12 12:36:16.75 3rd Qu.:291.0
## Max. :2020-05-19 20:10:34.00 Max. :675.0
##
## start_station_name end_station_id end_station_name member_casual
## Length:3879822 Min. : 1.0 Length:3879822 Length:3879822
## Class :character 1st Qu.: 77.0 Class :character Class :character
## Mode :character Median :174.0 Mode :character Mode :character
## Mean :203.8
## 3rd Qu.:291.0
## Max. :675.0
## NA's :1
## date month day year
## Min. :2019-04-01 Length:3879822 Length:3879822 Length:3879822
## 1st Qu.:2019-06-23 Class :character Class :character Class :character
## Median :2019-08-14 Mode :character Mode :character Mode :character
## Mean :2019-08-25
## 3rd Qu.:2019-10-12
## Max. :2020-03-31
##
## day_of_week ride_length ride_length_m
## Length:3879822 Length:3879822 Length:3879822
## Class :character Class :difftime Class :difftime
## Mode :character Mode :numeric Mode :numeric
##
##
##
##
total_v2<-total[!(total$ride_length<0),]
total_v2 %>%
summarise(max(ride_length), min(ride_length), mean(ride_length), median(ride_length))
## # A tibble: 1 × 4
## `max(ride_length)` `min(ride_length)` `mean(ride_length)` `median(ride_lengt…`
## <drtn> <drtn> <drtn> <drtn>
## 1 9387024 secs 0 secs 1477.751 secs 711 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = mean)
## total_v2$member_casual total_v2$ride_length
## 1 casual 3538.4516 secs
## 2 member 850.0659 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = median)
## total_v2$member_casual total_v2$ride_length
## 1 casual 1540 secs
## 2 member 589 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = max)
## total_v2$member_casual total_v2$ride_length
## 1 casual 9387024 secs
## 2 member 9056634 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = min)
## total_v2$member_casual total_v2$ride_length
## 1 casual 0 secs
## 2 member 1 secs
total_v2$day_of_week <-ordered(total_v2$day_of_week, levels = c("Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday", "Sunday"))
aggregate(total_v2$ride_length ~ total_v2$member_casual+total_v2$day_of_week, FUN = mean)
## total_v2$member_casual total_v2$day_of_week total_v2$ride_length
## 1 casual Monday 3335.6446 secs
## 2 member Monday 842.5726 secs
## 3 casual Tuesday 3569.7986 secs
## 4 member Tuesday 826.1427 secs
## 5 casual Wednesday 3691.0203 secs
## 6 member Wednesday 823.9980 secs
## 7 casual Thursday 3660.2933 secs
## 8 member Thursday 823.9278 secs
## 9 casual Friday 3758.2210 secs
## 10 member Friday 824.5305 secs
## 11 casual Saturday 3331.9138 secs
## 12 member Saturday 968.9337 secs
## 13 casual Sunday 3581.4054 secs
## 14 member Sunday 919.9746 secs
total_v2 %>%
mutate(weekday = wday(started_at, label =TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <drtn>
## 1 casual Sun 181293 3581.4054 secs
## 2 casual Mon 104432 3335.6446 secs
## 3 casual Tue 91184 3569.7986 secs
## 4 casual Wed 93150 3691.0203 secs
## 5 casual Thu 103316 3660.2933 secs
## 6 casual Fri 122913 3758.2210 secs
## 7 casual Sat 209543 3331.9138 secs
## 8 member Sun 267965 919.9746 secs
## 9 member Mon 472196 842.5726 secs
## 10 member Tue 508445 826.1427 secs
## 11 member Wed 500330 823.9980 secs
## 12 member Thu 484177 823.9278 secs
## 13 member Fri 452790 824.5305 secs
## 14 member Sat 287958 968.9337 secs
total_v2 %>%
group_by(member_casual) %>%
summarise(rider_count = n()) %>%
ggplot(aes(x = member_casual, y = rider_count,fill=member_casual )) + geom_col()
#number of rides
total_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x=weekday, y = number_of_rides, fill = member_casual))+geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
#average duration
total_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration , fill = member_casual)) + geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## Don't know how to automatically pick scale for object of type difftime.
## Defaulting to continuous.
total_v2$month <- as.numeric(total_v2$month)
total_v2$season<-season(total_v2$month)
total_v2%>%
group_by(season, day_of_week, member_casual) %>%
summarise(number_of_rides = n(),avg_ride_length = mean(ride_length_m)) %>%
ggplot() + geom_col(mapping = aes(x = day_of_week, y = number_of_rides, fill = member_casual), position = "dodge") + facet_wrap(~season) + scale_y_continuous(breaks = seq(0, 400000, by = 50000))
## `summarise()` has grouped output by 'season', 'day_of_week'. You can override
## using the `.groups` argument.
total_v2%>%
group_by(month, member_casual) %>%
summarise(number_of_rides = n(),avg_ride_length = mean(ride_length_m)) %>%
ggplot() + geom_line(mapping = aes(x = month, y = number_of_rides, color = member_casual)) + scale_x_continuous(breaks = seq(1, 12, by = 1))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
From the first plot, we see over 70% of the users are annual members, suggesting the company is already doing a great job in keeping their users loyal and converting them to members.
From the number_of_rides vs weekday plot and the average_duration vs weekday plot, we see that casual riders take longer rides and ride more often on the weekends. Members, on the other hand, take more number of rides but shorter duration per ride. This may be due to tourists visiting the city on the weekends, or just people taking time off on the weekends.
From the number_of_rides vs month line graph, we see the amount of rides start to rise in February and continue to rise till August. This correlation is due to the rise in temperature by the end of winter, with more people going outside and cycling. And inversely, when it starts to get cold again after August, the amount of rides goes down along with the temperature.
The marketing Campaign should be launched between February to August, since the number of casual riders peak around this time.
Design seasonal packages, providing more flexibility for riders to get membership for a specific period of time when they ride more often.
Design riding packages targeting those riders who bike for recreational activities and weekend events and offer special discounts and coupons on such events to encourage casual riders to buy member.