Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and teams

● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Goal

The ultimate goal is to design marketing strategies aimed at converting casual riders into annual members. To do so, a smaller question that I aim to answer here is: How do annual members and casual riders use Cyclistic bikes differently.

Data Source

I will be using a total record of 12 months of Cyclistic’s historial trip data to analyze and identify trends. It is a public data set, made available by Motivate International Inc. under this license. My data ranges from April 2019 to March 2020, and it is cited and vetted by Chicago department of transportation.

Note: for data security purposes, every riders’ personal identifiable information is hidden through tokenization.

Why R?

Excel spreadsheets can only contain 1,048,576 rows by 16,384 columns and the 4 quarterly datasets combined are far beyond that limit, therefore R is used to perform all tasks - organizing, cleaning, analyzing, and visualizing.

Setting up

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(metR)
## 
## Attaching package: 'metR'
## 
## The following object is masked from 'package:purrr':
## 
##     cross

Importing data into R

Q1_2020<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q2_2019<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2019_Q2.csv")
## Rows: 1108163 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
## dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
## dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q3_2019<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2019_Q3.csv")
## Rows: 1640718 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q4_2019<-read_csv("/Users/thomasli/Downloads/Divvy_Trips_2019_Q4.csv")
## Rows: 704054 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): from_station_name, to_station_name, usertype, gender
## dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## dttm (2): start_time, end_time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Compare column names of each csv file so that they match before joining them into one table.

colnames(Q2_2019)
##  [1] "01 - Rental Details Rental ID"                   
##  [2] "01 - Rental Details Local Start Time"            
##  [3] "01 - Rental Details Local End Time"              
##  [4] "01 - Rental Details Bike ID"                     
##  [5] "01 - Rental Details Duration In Seconds Uncapped"
##  [6] "03 - Rental Start Station ID"                    
##  [7] "03 - Rental Start Station Name"                  
##  [8] "02 - Rental End Station ID"                      
##  [9] "02 - Rental End Station Name"                    
## [10] "User Type"                                       
## [11] "Member Gender"                                   
## [12] "05 - Member Details Member Birthday Year"
colnames(Q3_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(Q4_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(Q1_2020)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Notice the difference in label between the four quarters. Now I have to utilize rename() to match each data set to Q1_2020.

(Q4_2019 <- rename(Q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))
## # A tibble: 704,054 × 12
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20          2215          940
##  2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34          6328          258
##  3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43          3003          850
##  4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43          3275         2350
##  5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42          5294         1867
##  6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51          1891          373
##  7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45          1061         1072
##  8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16          1274         1458
##  9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18          6011         1437
## 10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46          2957         8306
## # … with 704,044 more rows, and 7 more variables: start_station_id <dbl>,
## #   start_station_name <chr>, end_station_id <dbl>, end_station_name <chr>,
## #   member_casual <chr>, gender <chr>, birthyear <dbl>
(Q3_2019 <- rename(Q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))
## # A tibble: 1,640,718 × 12
##     ride_id started_at          ended_at            rideable_type tripduration
##       <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
##  1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41          3591         1214
##  2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44          5353         1048
##  3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42          6180         1554
##  4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10          5540         1503
##  5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26          6014         1213
##  6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31          4941          310
##  7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12          3770         1248
##  8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16          5442         1550
##  9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57          2957         1583
## 10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14          6091         1589
## # … with 1,640,708 more rows, and 7 more variables: start_station_id <dbl>,
## #   start_station_name <chr>, end_station_id <dbl>, end_station_name <chr>,
## #   member_casual <chr>, gender <chr>, birthyear <dbl>
(Q2_2019 <- rename(Q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))
## # A tibble: 1,108,163 × 12
##     ride_id started_at          ended_at            rideable_type
##       <dbl> <dttm>              <dttm>                      <dbl>
##  1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251
##  2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226
##  3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649
##  4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151
##  5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270
##  6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123
##  7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418
##  8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513
##  9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534
## # … with 1,108,153 more rows, and 8 more variables:
## #   `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## #   start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## #   `05 - Member Details Member Birthday Year` <dbl>

Inspect each updated datasets to make sure they are consistent.

summary(Q1_2020)
##    ride_id          rideable_type        started_at                    
##  Length:426887      Length:426887      Min.   :2020-01-01 00:04:44.00  
##  Class :character   Class :character   1st Qu.:2020-01-24 14:03:26.00  
##  Mode  :character   Mode  :character   Median :2020-02-17 05:01:27.00  
##                                        Mean   :2020-02-14 01:23:18.51  
##                                        3rd Qu.:2020-03-05 15:08:13.50  
##                                        Max.   :2020-03-31 23:51:34.00  
##                                                                        
##     ended_at                      start_station_name start_station_id
##  Min.   :2020-01-01 00:10:54.00   Length:426887      Min.   :  2.0   
##  1st Qu.:2020-01-24 14:21:24.50   Class :character   1st Qu.: 77.0   
##  Median :2020-02-17 05:48:58.00   Mode  :character   Median :176.0   
##  Mean   :2020-02-14 01:45:25.43                      Mean   :209.8   
##  3rd Qu.:2020-03-05 15:27:54.00                      3rd Qu.:298.0   
##  Max.   :2020-05-19 20:10:34.00                      Max.   :675.0   
##                                                                      
##  end_station_name   end_station_id    start_lat       start_lng     
##  Length:426887      Min.   :  2.0   Min.   :41.74   Min.   :-87.77  
##  Class :character   1st Qu.: 77.0   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Median :175.0   Median :41.89   Median :-87.64  
##                     Mean   :209.3   Mean   :41.90   Mean   :-87.64  
##                     3rd Qu.:297.0   3rd Qu.:41.92   3rd Qu.:-87.63  
##                     Max.   :675.0   Max.   :42.06   Max.   :-87.55  
##                     NA's   :1                                       
##     end_lat         end_lng       member_casual     
##  Min.   :41.74   Min.   :-87.77   Length:426887     
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.89   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.64                     
##  3rd Qu.:41.92   3rd Qu.:-87.63                     
##  Max.   :42.06   Max.   :-87.55                     
##  NA's   :1       NA's   :1
summary(Q2_2019)
##     ride_id           started_at                    
##  Min.   :22178529   Min.   :2019-04-01 00:02:22.00  
##  1st Qu.:22505074   1st Qu.:2019-05-02 15:19:47.50  
##  Median :22831158   Median :2019-05-26 11:38:51.00  
##  Mean   :22829856   Mean   :2019-05-22 23:25:39.56  
##  3rd Qu.:23154496   3rd Qu.:2019-06-13 18:18:56.00  
##  Max.   :23479387   Max.   :2019-06-30 23:59:05.00  
##                                                     
##     ended_at                      rideable_type 
##  Min.   :2019-04-01 00:09:48.00   Min.   :   1  
##  1st Qu.:2019-05-02 15:37:41.00   1st Qu.:1738  
##  Median :2019-05-26 12:03:20.00   Median :3471  
##  Mean   :2019-05-22 23:47:46.85   Mean   :3399  
##  3rd Qu.:2019-06-13 18:37:40.50   3rd Qu.:5080  
##  Max.   :2019-07-06 14:22:25.00   Max.   :6471  
##                                                 
##  01 - Rental Details Duration In Seconds Uncapped start_station_id
##  Min.   :     61                                  Min.   :  1.0   
##  1st Qu.:    426                                  1st Qu.: 77.0   
##  Median :    742                                  Median :174.0   
##  Mean   :   1327                                  Mean   :200.3   
##  3rd Qu.:   1347                                  3rd Qu.:289.0   
##  Max.   :4757640                                  Max.   :669.0   
##                                                                   
##  start_station_name end_station_id  end_station_name   member_casual     
##  Length:1108163     Min.   :  1.0   Length:1108163     Length:1108163    
##  Class :character   1st Qu.: 77.0   Class :character   Class :character  
##  Mode  :character   Median :174.0   Mode  :character   Mode  :character  
##                     Mean   :201.3                                        
##                     3rd Qu.:290.0                                        
##                     Max.   :669.0                                        
##                                                                          
##  Member Gender      05 - Member Details Member Birthday Year
##  Length:1108163     Min.   :1759                            
##  Class :character   1st Qu.:1979                            
##  Mode  :character   Median :1987                            
##                     Mean   :1984                            
##                     3rd Qu.:1992                            
##                     Max.   :2014                            
##                     NA's   :180953
summary(Q3_2019)
##     ride_id           started_at                    
##  Min.   :23479388   Min.   :2019-07-01 00:00:27.00  
##  1st Qu.:23935498   1st Qu.:2019-07-24 15:39:07.00  
##  Median :24367416   Median :2019-08-14 07:11:50.50  
##  Mean   :24364471   Mean   :2019-08-15 00:18:26.00  
##  3rd Qu.:24797401   3rd Qu.:2019-09-05 18:16:45.75  
##  Max.   :25223639   Max.   :2019-09-30 23:59:37.00  
##                                                     
##     ended_at                      rideable_type   tripduration    
##  Min.   :2019-07-01 00:07:31.00   Min.   :   1   Min.   :     61  
##  1st Qu.:2019-07-24 16:04:16.00   1st Qu.:1713   1st Qu.:    465  
##  Median :2019-08-14 07:28:07.50   Median :3419   Median :    813  
##  Mean   :2019-08-15 00:47:28.21   Mean   :3350   Mean   :   1742  
##  3rd Qu.:2019-09-05 18:37:10.75   3rd Qu.:4997   3rd Qu.:   1460  
##  Max.   :2019-11-04 08:09:47.00   Max.   :6471   Max.   :9056633  
##                                                                   
##  start_station_id start_station_name end_station_id  end_station_name  
##  Min.   :  2.0    Length:1640718     Min.   :  2.0   Length:1640718    
##  1st Qu.: 77.0    Class :character   1st Qu.: 80.0   Class :character  
##  Median :174.0    Mode  :character   Median :176.0   Mode  :character  
##  Mean   :202.4                       Mean   :203.9                     
##  3rd Qu.:289.0                       3rd Qu.:291.0                     
##  Max.   :673.0                       Max.   :673.0                     
##                                                                        
##  member_casual         gender            birthyear     
##  Length:1640718     Length:1640718     Min.   :1888    
##  Class :character   Class :character   1st Qu.:1980    
##  Mode  :character   Mode  :character   Median :1988    
##                                        Mean   :1985    
##                                        3rd Qu.:1992    
##                                        Max.   :2003    
##                                        NA's   :278094
summary(Q4_2019)
##     ride_id           started_at                    
##  Min.   :25223640   Min.   :2019-10-01 00:01:39.00  
##  1st Qu.:25407380   1st Qu.:2019-10-13 17:38:29.25  
##  Median :25590864   Median :2019-10-28 18:04:41.00  
##  Mean   :25592222   Mean   :2019-11-05 00:53:59.77  
##  3rd Qu.:25777172   3rd Qu.:2019-11-25 16:19:34.00  
##  Max.   :25962904   Max.   :2019-12-31 23:57:17.00  
##                                                     
##     ended_at                      rideable_type   tripduration    
##  Min.   :2019-10-01 00:06:34.00   Min.   :   1   Min.   :     61  
##  1st Qu.:2019-10-13 17:59:43.25   1st Qu.:1724   1st Qu.:    354  
##  Median :2019-10-28 18:18:41.00   Median :3473   Median :    585  
##  Mean   :2019-11-05 01:13:49.22   Mean   :3396   Mean   :   1189  
##  3rd Qu.:2019-11-25 16:34:22.75   3rd Qu.:5065   3rd Qu.:   1000  
##  Max.   :2020-01-21 13:54:35.00   Max.   :6946   Max.   :8585902  
##                                                                   
##  start_station_id start_station_name end_station_id  end_station_name  
##  Min.   :  2      Length:704054      Min.   :  2.0   Length:704054     
##  1st Qu.: 77      Class :character   1st Qu.: 77.0   Class :character  
##  Median :174      Mode  :character   Median :174.0   Mode  :character  
##  Mean   :204                         Mean   :203.9                     
##  3rd Qu.:291                         3rd Qu.:291.0                     
##  Max.   :673                         Max.   :673.0                     
##                                                                        
##  member_casual         gender            birthyear    
##  Length:704054      Length:704054      Min.   :1899   
##  Class :character   Class :character   1st Qu.:1978   
##  Mode  :character   Mode  :character   Median :1987   
##                                        Mean   :1984   
##                                        3rd Qu.:1992   
##                                        Max.   :2003   
##                                        NA's   :61681

Utilize mutate() to convert ride_id and rideable_id to the same type for us to stack.

Q2_2019 <-  mutate(Q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
Q3_2019 <-  mutate(Q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
Q4_2019 <-  mutate(Q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 

Now that each dataframe are consistent, I’m ready to combine them into one dataframe.

total<-bind_rows(Q2_2019, Q3_2019, Q4_2019, Q1_2020)

Inspect the new table and remove unnecessary columns(lat, long, birthyear, gender).

dim(total)
## [1] 3879822      19
head(total)
## # A tibble: 6 × 19
##   ride_id started_at          ended_at            rideable_type `01 - Rental D…`
##   <chr>   <dttm>              <dttm>              <chr>                    <dbl>
## 1 221785… 2019-04-01 00:02:22 2019-04-01 00:09:48 6251                       446
## 2 221785… 2019-04-01 00:03:02 2019-04-01 00:20:30 6226                      1048
## 3 221785… 2019-04-01 00:11:07 2019-04-01 00:15:19 5649                       252
## 4 221785… 2019-04-01 00:13:01 2019-04-01 00:18:58 4151                       357
## 5 221785… 2019-04-01 00:19:26 2019-04-01 00:36:13 3270                      1007
## 6 221785… 2019-04-01 00:19:39 2019-04-01 00:23:56 3123                       257
## # … with 14 more variables: start_station_id <dbl>, start_station_name <chr>,
## #   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
## #   `Member Gender` <chr>, `05 - Member Details Member Birthday Year` <dbl>,
## #   tripduration <dbl>, gender <chr>, birthyear <dbl>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>
summary(total)
##    ride_id            started_at                    
##  Length:3879822     Min.   :2019-04-01 00:02:22.00  
##  Class :character   1st Qu.:2019-06-23 07:49:09.25  
##  Mode  :character   Median :2019-08-14 17:43:38.00  
##                     Mean   :2019-08-26 00:49:59.38  
##                     3rd Qu.:2019-10-12 12:10:21.00  
##                     Max.   :2020-03-31 23:51:34.00  
##                                                     
##     ended_at                      rideable_type     
##  Min.   :2019-04-01 00:09:48.00   Length:3879822    
##  1st Qu.:2019-06-23 08:20:27.75   Class :character  
##  Median :2019-08-14 18:02:04.00   Mode  :character  
##  Mean   :2019-08-26 01:14:37.06                     
##  3rd Qu.:2019-10-12 12:36:16.75                     
##  Max.   :2020-05-19 20:10:34.00                     
##                                                     
##  01 - Rental Details Duration In Seconds Uncapped start_station_id
##  Min.   :     61                                  Min.   :  1.0   
##  1st Qu.:    426                                  1st Qu.: 77.0   
##  Median :    742                                  Median :174.0   
##  Mean   :   1327                                  Mean   :202.9   
##  3rd Qu.:   1347                                  3rd Qu.:291.0   
##  Max.   :4757640                                  Max.   :675.0   
##  NA's   :2771659                                                  
##  start_station_name end_station_id  end_station_name   member_casual     
##  Length:3879822     Min.   :  1.0   Length:3879822     Length:3879822    
##  Class :character   1st Qu.: 77.0   Class :character   Class :character  
##  Mode  :character   Median :174.0   Mode  :character   Mode  :character  
##                     Mean   :203.8                                        
##                     3rd Qu.:291.0                                        
##                     Max.   :675.0                                        
##                     NA's   :1                                            
##  Member Gender      05 - Member Details Member Birthday Year  tripduration    
##  Length:3879822     Min.   :1759                             Min.   :     61  
##  Class :character   1st Qu.:1979                             1st Qu.:    423  
##  Mode  :character   Median :1987                             Median :    732  
##                     Mean   :1984                             Mean   :   1576  
##                     3rd Qu.:1992                             3rd Qu.:   1322  
##                     Max.   :2014                             Max.   :9056633  
##                     NA's   :2952612                          NA's   :1535050  
##     gender            birthyear         start_lat         start_lng      
##  Length:3879822     Min.   :1888      Min.   :42        Min.   :-88      
##  Class :character   1st Qu.:1980      1st Qu.:42        1st Qu.:-88      
##  Mode  :character   Median :1988      Median :42        Median :-88      
##                     Mean   :1985      Mean   :42        Mean   :-88      
##                     3rd Qu.:1992      3rd Qu.:42        3rd Qu.:-88      
##                     Max.   :2003      Max.   :42        Max.   :-88      
##                     NA's   :1874825   NA's   :3452935   NA's   :3452935  
##     end_lat           end_lng       
##  Min.   :42        Min.   :-88      
##  1st Qu.:42        1st Qu.:-88      
##  Median :42        Median :-88      
##  Mean   :42        Mean   :-88      
##  3rd Qu.:42        3rd Qu.:-88      
##  Max.   :42        Max.   :-88      
##  NA's   :3452936   NA's   :3452936
str(total)
## tibble [3,879,822 × 19] (S3: tbl_df/tbl/data.frame)
##  $ ride_id                                         : chr [1:3879822] "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at                                      : POSIXct[1:3879822], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
##  $ ended_at                                        : POSIXct[1:3879822], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
##  $ rideable_type                                   : chr [1:3879822] "6251" "6226" "5649" "4151" ...
##  $ 01 - Rental Details Duration In Seconds Uncapped: num [1:3879822] 446 1048 252 357 1007 ...
##  $ start_station_id                                : num [1:3879822] 81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name                              : chr [1:3879822] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id                                  : num [1:3879822] 56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name                                : chr [1:3879822] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual                                   : chr [1:3879822] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ Member Gender                                   : chr [1:3879822] "Male" "Female" "Male" "Male" ...
##  $ 05 - Member Details Member Birthday Year        : num [1:3879822] 1975 1984 1990 1993 1992 ...
##  $ tripduration                                    : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ gender                                          : chr [1:3879822] NA NA NA NA ...
##  $ birthyear                                       : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lat                                       : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ start_lng                                       : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lat                                         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
##  $ end_lng                                         : num [1:3879822] NA NA NA NA NA NA NA NA NA NA ...
total <- total %>% 
  select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, 
            "01 - Rental Details Duration In Seconds Uncapped", 
            "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

Problem

Under the “member_casual” column, there are two names for members and also two names for casual riders(“Member” and “subscriber” are the same, and “customer” and “casual” are the same). I have to combine each of the two labels.

table(total$member_casual)
## 
##     casual   Customer     member Subscriber 
##      48480     857474     378407    2595461
total<-total %>% 
  mutate(member_casual = recode(member_casual,
                                "Subscriber" = "member","Customer" = "casual"))
table(total$member_casual)
## 
##  casual  member 
##  905954 2973868

For easier analysis, add additional columns of data giving more specified detail about ride date.

total$date <- as.Date(total$started_at)
total$month <- format(as.Date(total$date), "%m")
total$day <- format(as.Date(total$date), "%d")
total$year <- format(as.Date(total$date), "%Y")
total$day_of_week <- format(as.Date(total$date), "%A")

Add a “ride_duration” calculation to all_trips both in seconds and minutes.

total$ride_length <- difftime(total$ended_at,total$started_at)
total$ride_length_m <- (difftime(total$ended_at,total$started_at))/60
summary(total)
##    ride_id            started_at                    
##  Length:3879822     Min.   :2019-04-01 00:02:22.00  
##  Class :character   1st Qu.:2019-06-23 07:49:09.25  
##  Mode  :character   Median :2019-08-14 17:43:38.00  
##                     Mean   :2019-08-26 00:49:59.38  
##                     3rd Qu.:2019-10-12 12:10:21.00  
##                     Max.   :2020-03-31 23:51:34.00  
##                                                     
##     ended_at                      rideable_type      start_station_id
##  Min.   :2019-04-01 00:09:48.00   Length:3879822     Min.   :  1.0   
##  1st Qu.:2019-06-23 08:20:27.75   Class :character   1st Qu.: 77.0   
##  Median :2019-08-14 18:02:04.00   Mode  :character   Median :174.0   
##  Mean   :2019-08-26 01:14:37.06                      Mean   :202.9   
##  3rd Qu.:2019-10-12 12:36:16.75                      3rd Qu.:291.0   
##  Max.   :2020-05-19 20:10:34.00                      Max.   :675.0   
##                                                                      
##  start_station_name end_station_id  end_station_name   member_casual     
##  Length:3879822     Min.   :  1.0   Length:3879822     Length:3879822    
##  Class :character   1st Qu.: 77.0   Class :character   Class :character  
##  Mode  :character   Median :174.0   Mode  :character   Mode  :character  
##                     Mean   :203.8                                        
##                     3rd Qu.:291.0                                        
##                     Max.   :675.0                                        
##                     NA's   :1                                            
##       date               month               day                year          
##  Min.   :2019-04-01   Length:3879822     Length:3879822     Length:3879822    
##  1st Qu.:2019-06-23   Class :character   Class :character   Class :character  
##  Median :2019-08-14   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2019-08-25                                                           
##  3rd Qu.:2019-10-12                                                           
##  Max.   :2020-03-31                                                           
##                                                                               
##  day_of_week        ride_length       ride_length_m    
##  Length:3879822     Length:3879822    Length:3879822   
##  Class :character   Class :difftime   Class :difftime  
##  Mode  :character   Mode  :numeric    Mode  :numeric   
##                                                        
##                                                        
##                                                        
## 

Notice some rides have negative ridetime due to several reasons, it is best to delete these rides. (We will create a new version since data are being removed)

total_v2<-total[!(total$ride_length<0),]

Descriptive Analysis on Ride Length

total_v2 %>% 
  summarise(max(ride_length), min(ride_length), mean(ride_length), median(ride_length))
## # A tibble: 1 × 4
##   `max(ride_length)` `min(ride_length)` `mean(ride_length)` `median(ride_lengt…`
##   <drtn>             <drtn>             <drtn>              <drtn>              
## 1 9387024 secs       0 secs             1477.751 secs       711 secs

Compare between members and casuals

aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = mean)
##   total_v2$member_casual total_v2$ride_length
## 1                 casual       3538.4516 secs
## 2                 member        850.0659 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = median)
##   total_v2$member_casual total_v2$ride_length
## 1                 casual            1540 secs
## 2                 member             589 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = max)
##   total_v2$member_casual total_v2$ride_length
## 1                 casual         9387024 secs
## 2                 member         9056634 secs
aggregate(total_v2$ride_length ~ total_v2$member_casual, FUN = min)
##   total_v2$member_casual total_v2$ride_length
## 1                 casual               0 secs
## 2                 member               1 secs

Average ride time by each day (in order) for members vs casuals.

total_v2$day_of_week <-ordered(total_v2$day_of_week, levels = c("Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday", "Sunday"))
aggregate(total_v2$ride_length ~ total_v2$member_casual+total_v2$day_of_week, FUN = mean)
##    total_v2$member_casual total_v2$day_of_week total_v2$ride_length
## 1                  casual               Monday       3335.6446 secs
## 2                  member               Monday        842.5726 secs
## 3                  casual              Tuesday       3569.7986 secs
## 4                  member              Tuesday        826.1427 secs
## 5                  casual            Wednesday       3691.0203 secs
## 6                  member            Wednesday        823.9980 secs
## 7                  casual             Thursday       3660.2933 secs
## 8                  member             Thursday        823.9278 secs
## 9                  casual               Friday       3758.2210 secs
## 10                 member               Friday        824.5305 secs
## 11                 casual             Saturday       3331.9138 secs
## 12                 member             Saturday        968.9337 secs
## 13                 casual               Sunday       3581.4054 secs
## 14                 member               Sunday        919.9746 secs

Analyze number of rides as well as average ride time order by rider type and weekday.

total_v2 %>% 
  mutate(weekday = wday(started_at, label =TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int> <drtn>          
##  1 casual        Sun              181293 3581.4054 secs  
##  2 casual        Mon              104432 3335.6446 secs  
##  3 casual        Tue               91184 3569.7986 secs  
##  4 casual        Wed               93150 3691.0203 secs  
##  5 casual        Thu              103316 3660.2933 secs  
##  6 casual        Fri              122913 3758.2210 secs  
##  7 casual        Sat              209543 3331.9138 secs  
##  8 member        Sun              267965  919.9746 secs  
##  9 member        Mon              472196  842.5726 secs  
## 10 member        Tue              508445  826.1427 secs  
## 11 member        Wed              500330  823.9980 secs  
## 12 member        Thu              484177  823.9278 secs  
## 13 member        Fri              452790  824.5305 secs  
## 14 member        Sat              287958  968.9337 secs

Visualize number of rides by rider type

total_v2 %>%
  group_by(member_casual) %>%
  summarise(rider_count = n()) %>% 
  ggplot(aes(x = member_casual, y = rider_count,fill=member_casual )) + geom_col()

Number of rides and average_duration by rider type order by days of week

#number of rides
total_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday) %>% 
  ggplot(aes(x=weekday, y = number_of_rides, fill = member_casual))+geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

#average duration
total_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration , fill = member_casual)) + geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## Don't know how to automatically pick scale for object of type difftime.
## Defaulting to continuous.

Effect of Seasonality Order by days of week

total_v2$month <- as.numeric(total_v2$month)
total_v2$season<-season(total_v2$month)
total_v2%>%
  group_by(season, day_of_week, member_casual) %>%   
  summarise(number_of_rides = n(),avg_ride_length = mean(ride_length_m)) %>% 
ggplot() + geom_col(mapping = aes(x = day_of_week, y = number_of_rides, fill = member_casual), position = "dodge") + facet_wrap(~season) + scale_y_continuous(breaks = seq(0, 400000, by = 50000))
## `summarise()` has grouped output by 'season', 'day_of_week'. You can override
## using the `.groups` argument.

Continuous change of number of rides along the whole 12 months.

total_v2%>%
  group_by(month, member_casual) %>%   
  summarise(number_of_rides = n(),avg_ride_length = mean(ride_length_m)) %>% 
ggplot() + geom_line(mapping = aes(x = month, y = number_of_rides, color = member_casual)) + scale_x_continuous(breaks = seq(1, 12, by = 1))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

Conclusion

  1. From the first plot, we see over 70% of the users are annual members, suggesting the company is already doing a great job in keeping their users loyal and converting them to members.

  2. From the number_of_rides vs weekday plot and the average_duration vs weekday plot, we see that casual riders take longer rides and ride more often on the weekends. Members, on the other hand, take more number of rides but shorter duration per ride. This may be due to tourists visiting the city on the weekends, or just people taking time off on the weekends.

  3. From the number_of_rides vs month line graph, we see the amount of rides start to rise in February and continue to rise till August. This correlation is due to the rise in temperature by the end of winter, with more people going outside and cycling. And inversely, when it starts to get cold again after August, the amount of rides goes down along with the temperature.

Recommendations

  1. The marketing Campaign should be launched between February to August, since the number of casual riders peak around this time.

  2. Design seasonal packages, providing more flexibility for riders to get membership for a specific period of time when they ride more often.

  3. Design riding packages targeting those riders who bike for recreational activities and weekend events and offer special discounts and coupons on such events to encourage casual riders to buy member.