Chicago
Chicago

INTRODUCTION

This is part of the final project of the Google Coursera Data analyst course. The final assignment for the course requires to analyze data from a company and publish the results.

Cyclistic is a fictitious bike sharing company based in Chicago. I am working as a junior data analyst as part of the marketing team and I have to analyze the previous year’s data and present the results to the marketing department.

SCENARIO

There are two types of riders at the moment: members and casual riders. Cyclistic has realized that annual members are more profitable than casual riders. Therefore, the company would like to maximize the amount of members. The company would like to make membership more attractive for casual riders.

TASK

The purpose of my analysis is to find out how different is the use of bikes for members and casual riders.

PREPARATION

The data can be found at link and the license is at link. I downloaded the last 12 months of data from July 20203 to June 20204. The data is too large to work with it on spreadsheets so I have done my analysis using R.

=====================

STEP 1: COLLECT DATA

=====================

Getting necessary packages for analysis

library(tidyverse)  #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)  #helps wrangle date attributes
library(ggplot2)    #helps visualize data
library(dplyr)      #helps with data manipulation
library(fixr)       #helps fix data
library(readr)      #helps reading csv files

Set working directory

I granted access to R studio to my desktop so I can work easily with the files I need.

setwd("C:/Users/Juan/Desktop/cases/Bikes/month")

Import monthly csv files

july23 <- read_csv("202307-divvy-tripdata.csv")
## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug23 <- read_csv("202308-divvy-tripdata.csv")
## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep23 <- read_csv("202309-divvy-tripdata.csv")
## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct23 <- read_csv("202310-divvy-tripdata.csv")
## Rows: 537113 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov23 <- read_csv("202311-divvy-tripdata.csv")
## Rows: 362518 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec23 <- read_csv("202312-divvy-tripdata.csv")
## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan24 <- read_csv("202401-divvy-tripdata.csv")
## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb24 <- read_csv("202402-divvy-tripdata.csv")
## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
marc24 <- read_csv("202403-divvy-tripdata.csv")
## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
april24 <- read_csv("202404-divvy-tripdata.csv")
## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may24 <- read_csv("202405-divvy-tripdata.csv")
## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
june24 <- read_csv("202406-divvy-tripdata.csv")
## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

==== STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE ====

Checking all column names match

colnames(july23)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(aug23)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(sep23)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(oct23)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(nov23)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(dec23)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(jan24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(feb24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(marc24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(april24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(may24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(june24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

All tables have matching columns so I can proceed to bind all the data in one table.

Bind data in one data frame

all_trips <- bind_rows(july23,aug23,sep23,oct23,nov23,dec23,jan24,feb24,marc24,april24,
                       may24,june24)


str(all_trips)
## spc_tbl_ [5,734,381 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5734381] "9340B064F0AEE130" "D1460EE3CE0D8AF8" "DF41BE31B895A25E" "9624A293749EF703" ...
##  $ rideable_type     : chr [1:5734381] "electric_bike" "classic_bike" "classic_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5734381], format: "2023-07-23 20:06:14" "2023-07-23 17:05:07" ...
##  $ ended_at          : POSIXct[1:5734381], format: "2023-07-23 20:22:44" "2023-07-23 17:18:37" ...
##  $ start_station_name: chr [1:5734381] "Kedzie Ave & 110th St" "Western Ave & Walton St" "Western Ave & Walton St" "Racine Ave & Randolph St" ...
##  $ start_station_id  : chr [1:5734381] "20204" "KA1504000103" "KA1504000103" "13155" ...
##  $ end_station_name  : chr [1:5734381] "Public Rack - Racine Ave & 109th Pl" "Milwaukee Ave & Grand Ave" "Damen Ave & Pierce Ave" "Clinton St & Madison St" ...
##  $ end_station_id    : chr [1:5734381] "877" "13033" "TA1305000041" "TA1305000032" ...
##  $ start_lat         : num [1:5734381] 41.7 41.9 41.9 41.9 42 ...
##  $ start_lng         : num [1:5734381] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5734381] 41.7 41.9 41.9 41.9 42 ...
##  $ end_lng           : num [1:5734381] -87.7 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5734381] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now I have all my data in one table which is necessary for full analysis.

Drop unnecessary columns for analysis

Latitude and longitude are not neeede for the analysis

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng))

==== STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS ======

Inspect the new table that has been created

colnames(all_trips)  #List of column names
## [1] "ride_id"            "rideable_type"      "started_at"        
## [4] "ended_at"           "start_station_name" "start_station_id"  
## [7] "end_station_name"   "end_station_id"     "member_casual"
nrow(all_trips)  #How many rows are in data frame?
## [1] 5734381
dim(all_trips)  #Dimensions of the data frame?
## [1] 5734381       9
head(all_trips)  #See the first 6 rows of data frame.  Also tail(all_trips)
## # A tibble: 6 × 9
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 9340B064F0AEE130 electric_bike 2023-07-23 20:06:14 2023-07-23 20:22:44
## 2 D1460EE3CE0D8AF8 classic_bike  2023-07-23 17:05:07 2023-07-23 17:18:37
## 3 DF41BE31B895A25E classic_bike  2023-07-23 10:14:53 2023-07-23 10:24:29
## 4 9624A293749EF703 electric_bike 2023-07-21 08:27:44 2023-07-21 08:32:40
## 5 2F68A6A4CDB4C99A classic_bike  2023-07-08 15:46:42 2023-07-08 15:58:08
## 6 9AEE973E6B941A9C classic_bike  2023-07-10 08:44:47 2023-07-10 08:49:41
## # ℹ 5 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, member_casual <chr>
str(all_trips)  #See list of columns and data types (numeric, character, etc)
## tibble [5,734,381 × 9] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5734381] "9340B064F0AEE130" "D1460EE3CE0D8AF8" "DF41BE31B895A25E" "9624A293749EF703" ...
##  $ rideable_type     : chr [1:5734381] "electric_bike" "classic_bike" "classic_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5734381], format: "2023-07-23 20:06:14" "2023-07-23 17:05:07" ...
##  $ ended_at          : POSIXct[1:5734381], format: "2023-07-23 20:22:44" "2023-07-23 17:18:37" ...
##  $ start_station_name: chr [1:5734381] "Kedzie Ave & 110th St" "Western Ave & Walton St" "Western Ave & Walton St" "Racine Ave & Randolph St" ...
##  $ start_station_id  : chr [1:5734381] "20204" "KA1504000103" "KA1504000103" "13155" ...
##  $ end_station_name  : chr [1:5734381] "Public Rack - Racine Ave & 109th Pl" "Milwaukee Ave & Grand Ave" "Damen Ave & Pierce Ave" "Clinton St & Madison St" ...
##  $ end_station_id    : chr [1:5734381] "877" "13033" "TA1305000041" "TA1305000032" ...
##  $ member_casual     : chr [1:5734381] "member" "member" "member" "member" ...
summary(all_trips)  #Statistical summary of data. Mainly for numeric
##    ride_id          rideable_type        started_at                    
##  Length:5734381     Length:5734381     Min.   :2023-07-01 00:00:00.00  
##  Class :character   Class :character   1st Qu.:2023-08-27 15:55:29.00  
##  Mode  :character   Mode  :character   Median :2023-11-09 14:41:58.00  
##                                        Mean   :2023-12-16 20:05:31.20  
##                                        3rd Qu.:2024-04-23 17:55:40.00  
##                                        Max.   :2024-06-30 23:55:17.06  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-07-01 00:01:26.00   Length:5734381     Length:5734381    
##  1st Qu.:2023-08-27 16:19:14.00   Class :character   Class :character  
##  Median :2023-11-09 14:55:49.00   Mode  :character   Mode  :character  
##  Mean   :2023-12-16 20:23:50.00                                        
##  3rd Qu.:2024-04-23 18:08:50.00                                        
##  Max.   :2024-06-30 23:59:57.93                                        
##  end_station_name   end_station_id     member_casual     
##  Length:5734381     Length:5734381     Length:5734381    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 

Add columns that list the date, month, day, and year of each ride

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Let’s create a new column for ride length and then clean the results we don’t need (negative and 0 values).

Add a “ride_length” calculation to all_trips (in seconds)

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Convert “ride_length” from Factor to numeric so we can run calculations on the data

is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE

Check negative values

check_for_negative_values(all_trips)

Replace negative and 0 values with NA values

all_trips[all_trips <= 0] <- NA  

Remove rows with NA values

all_trips <- na.omit(all_trips)

Remove negative and 0 values in data frame

all_trips_2 <- all_trips[apply(all_trips!=0, 1, all),]

Check negative values again

check_for_negative_values(all_trips_2)
## Data frame does not contain negative values.
## NULL

The data is clean.

Find number of casual and members riders

table(all_trips_2$member_casual)
## 
##  casual  member 
## 1503559 2770268

We can see clearly that there are many more casual riders than members.

========= STEP 4: CONDUCT DESCRIPTIVE ANALYSIS =========

Descriptive analysis on ride_length (all figures in seconds)

mean(all_trips_2$ride_length) #straight average (total ride length / rides)
## [1] 990.8093
median(all_trips_2$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 604
max(all_trips_2$ride_length) #longest ride
## [1] 413473
min(all_trips_2$ride_length) #shortest ride
## [1] 0.155

Condense the four lines above to one line using summary() on the specific attribute

summary(all_trips_2$ride_length)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.2    347.0    604.0    990.8   1082.0 413473.0

Compare members and casual users

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = mean)
##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual               1434.4350
## 2                    member                750.0321
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = median)
##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual                     796
## 2                    member                     529
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = max)
##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual                  413473
## 2                    member                   89859
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = min)
##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual                   0.155
## 2                    member                   0.263

Order days of the week

all_trips_2$day_of_week <- ordered(all_trips_2$day_of_week, levels=c("domingo", "lunes", "martes", "miércoles", "jueves", "viernes", "sábado"))

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + all_trips_2$day_of_week, FUN = mean)
##    all_trips_2$member_casual all_trips_2$day_of_week all_trips_2$ride_length
## 1                     casual                 domingo               1658.9865
## 2                     member                 domingo                846.7964
## 3                     casual                   lunes               1406.8795
## 4                     member                   lunes                717.3063
## 5                     casual                  martes               1272.5568
## 6                     member                  martes                723.1518
## 7                     casual               miércoles               1254.6979
## 8                     member               miércoles                724.1508
## 9                     casual                  jueves               1225.2893
## 10                    member                  jueves                709.9829
## 11                    casual                 viernes               1380.3573
## 12                    member                 viernes                733.7183
## 13                    casual                  sábado               1609.3650
## 14                    member                  sábado                835.3832

Casual riders are using the bikes for a longer period of time at any dady of the week. This could be explained as many tourists to the city could be using the bikes for visiting.

Now, let’s run the average ride time by each day for members vs casual users

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + all_trips_2$day_of_week, FUN = mean)
##    all_trips_2$member_casual all_trips_2$day_of_week all_trips_2$ride_length
## 1                     casual                 domingo               1658.9865
## 2                     member                 domingo                846.7964
## 3                     casual                   lunes               1406.8795
## 4                     member                   lunes                717.3063
## 5                     casual                  martes               1272.5568
## 6                     member                  martes                723.1518
## 7                     casual               miércoles               1254.6979
## 8                     member               miércoles                724.1508
## 9                     casual                  jueves               1225.2893
## 10                    member                  jueves                709.9829
## 11                    casual                 viernes               1380.3573
## 12                    member                 viernes                733.7183
## 13                    casual                  sábado               1609.3650
## 14                    member                  sábado                835.3832

analyze ridership data by type and weekday

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()                       
            ,average_duration = mean(ride_length)) %>%  
  arrange(member_casual, weekday)   
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        "do\\."          268375            1659.
##  2 casual        "lu\\."          175574            1407.
##  3 casual        "ma\\."          173707            1273.
##  4 casual        "mi\\."          178181            1255.
##  5 casual        "ju\\."          182745            1225.
##  6 casual        "vi\\."          213507            1380.
##  7 casual        "sá\\."          311470            1609.
##  8 member        "do\\."          313737             847.
##  9 member        "lu\\."          395057             717.
## 10 member        "ma\\."          437338             723.
## 11 member        "mi\\."          447921             724.
## 12 member        "ju\\."          440940             710.
## 13 member        "vi\\."          385370             734.
## 14 member        "sá\\."          349905             835.

analyze ridership data by type and weekday

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()                       
            ,average_duration = mean(ride_length)) %>%  
  arrange(member_casual, weekday)   
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        "do\\."          268375            1659.
##  2 casual        "lu\\."          175574            1407.
##  3 casual        "ma\\."          173707            1273.
##  4 casual        "mi\\."          178181            1255.
##  5 casual        "ju\\."          182745            1225.
##  6 casual        "vi\\."          213507            1380.
##  7 casual        "sá\\."          311470            1609.
##  8 member        "do\\."          313737             847.
##  9 member        "lu\\."          395057             717.
## 10 member        "ma\\."          437338             723.
## 11 member        "mi\\."          447921             724.
## 12 member        "ju\\."          440940             710.
## 13 member        "vi\\."          385370             734.
## 14 member        "sá\\."          349905             835.

Let’s visualize the number of rides by rider type

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday,  y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + scale_fill_manual(values = c ("casual" = "darkblue", "member" = "darkorange"))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Although there are more casual users than members, it is clear that members use the bikes more frequently. This could be explain due to the fact that many members are most probably using the bikes to commute so the rides are shorter but more frequent.

Let’s create a visualization for average duration

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") + scale_fill_manual(values = c ("casual" = "purple", "member" = "yellow"))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This graph shows clearly that casual riders are using the bikes for longer periods of time. Many temporary visitor to the city probably using the bikes to move around and their therefore their trips are longer.

Find row with maximum ride length

rides<- all_trips_2%>%
  select(start_station_name,end_station_name,ride_length)
summary(rides)
##  start_station_name end_station_name    ride_length      
##  Length:4273827     Length:4273827     Min.   :     0.2  
##  Class :character   Class :character   1st Qu.:   347.0  
##  Mode  :character   Mode  :character   Median :   604.0  
##                                        Mean   :   990.8  
##                                        3rd Qu.:  1082.0  
##                                        Max.   :413473.0
max_ride <- dplyr::filter(rides, ride_length == max(ride_length))
max_start_station <- max_ride$start_station_name
max_end_station <- max_ride$end_station_name

Find the row with the minimum ride length

rides<- all_trips_2%>%
  select(start_station_name,end_station_name,ride_length)
summary(rides)
##  start_station_name end_station_name    ride_length      
##  Length:4273827     Length:4273827     Min.   :     0.2  
##  Class :character   Class :character   1st Qu.:   347.0  
##  Mode  :character   Mode  :character   Median :   604.0  
##                                        Mean   :   990.8  
##                                        3rd Qu.:  1082.0  
##                                        Max.   :413473.0
min_ride <- dplyr::filter(rides, ride_length == min(ride_length))
min_start_station <- min_ride$start_station_name
min_end_station <- min_ride$end_station_name

Display the results

cat("Maximum ride length starts at:", max_start_station,",","and ends at:", max_end_station, "\n")
## Maximum ride length starts at: Larrabee St & Webster Ave , and ends at: Michigan Ave & Oak St
cat("Minimum ride length starts at:", min_start_station, ",","and ends at:", min_end_station, "\n")
## Minimum ride length starts at: Millennium Park , and ends at: Millennium Park

Combine max_ride and min_ride

combined_rides<-rbind(max_ride,min_ride)

Visualize ride lengths by start and end stations

ggplot(combined_rides, aes(x = str_wrap(start_station_name,width = 10), y = ride_length, fill = end_station_name)) +
  geom_bar(stat = "identity", fill = "#FF6666") +
  geom_text(aes(label = ride_length), vjust = -0.5) +
  labs(title = "Maximum and Minimum Ride Length by Start and End Stations", x = "Start_Station_Name", y = "Ride_Length (minutes)") +
  theme_minimal()+
  theme(
    plot.background = element_rect(color = "purple", linewidth = 1),
    panel.border = element_rect(color = "lightblue", fill = NA, size = 0.5)
  )+
  scale_y_continuous(labels = scales::number_format(accuracy = 1))
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

### Create a visualization for the number of rides by rider type

ggplot(data = all_trips_2) +
  geom_bar(mapping=aes(x=member_casual, fill=member_casual)) +
  labs(title = "Number of Rides: Member vs. Casual") + 
  scale_fill_manual(values = c ("casual" = "purple", "member" = "lightblue"))

### Top 3 Start Stations

all_trips_2 %>% 
  group_by(start_station_name) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>%
  head(4)
## # A tibble: 4 × 2
##   start_station_name                 count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            60601
## 2 DuSable Lake Shore Dr & Monroe St  40556
## 3 Michigan Ave & Oak St              35295
## 4 DuSable Lake Shore Dr & North Blvd 34665

** The top stations are found on the water front where most amenities are found link

Top 3 End Stations

all_trips_2 %>% 
  group_by(end_station_name) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>%
  head(4)
## # A tibble: 4 × 2
##   end_station_name                   count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            62076
## 2 DuSable Lake Shore Dr & Monroe St  38878
## 3 DuSable Lake Shore Dr & North Blvd 37448
## 4 Michigan Ave & Oak St              35976

Create visualization for members vs casual per month

all_trips_2 %>% 
  ggplot(aes(started_at)) + 
  geom_histogram(color = "#000000", aes(fill = member_casual)) +
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~member_casual) + 
  scale_fill_manual(values = c ("casual" = "purple", "member" = "lightyellow")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

** All riders use the service during the warmer months of the year and it drops during the coldest months. Chicago is very cold during the winter months so that explains it.

CONCLUSIONS

  1. The analysis shows that there are more casual users than members.
  2. Casual riders ride for longer periods of time but members ride more frequently
  3. The most popular stations are near the waterfront where more of the touristic sights are found.

All these finding suggest that if Cyclistic would like to get more members, these are my recommendations:

  1. At the moments there are only a two options for casual members, one ride pass and one day pass. It would be interesting to create new passes such as Weekend pass, a three day pass, Weekly pass and Monthly pass.

  2. Attracting prices could be a good way of converting recurrent casual rider into full membership. This could be done by introducing discounts or giving a free month rides per year.

  3. Advertising at main stations, popular destinations and on social media the new rates and discounts should be a priority once the new plans are put in place.

##======== STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS ========

Create a csv file that we will visualize in Excel, Tableau, or my presentation software

counts <- aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + 
                      all_trips_2$day_of_week, FUN = mean)

View(counts)
write.csv(counts, file ='/Users/Juan/Desktop/cases/Bikes/avg_ride_lenght.csv')