This is part of the final project of the Google Coursera Data analyst course. The final assignment for the course requires to analyze data from a company and publish the results.
Cyclistic is a fictitious bike sharing company based in Chicago. I am working as a junior data analyst as part of the marketing team and I have to analyze the previous year’s data and present the results to the marketing department.
There are two types of riders at the moment: members and casual riders. Cyclistic has realized that annual members are more profitable than casual riders. Therefore, the company would like to maximize the amount of members. The company would like to make membership more attractive for casual riders.
The purpose of my analysis is to find out how different is the use of bikes for members and casual riders.
The data can be found at link and the license is at link. I downloaded the last 12 months of data from July 20203 to June 20204. The data is too large to work with it on spreadsheets so I have done my analysis using R.
library(tidyverse) #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
library(dplyr) #helps with data manipulation
library(fixr) #helps fix data
library(readr) #helps reading csv files
I granted access to R studio to my desktop so I can work easily with the files I need.
setwd("C:/Users/Juan/Desktop/cases/Bikes/month")
july23 <- read_csv("202307-divvy-tripdata.csv")
## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug23 <- read_csv("202308-divvy-tripdata.csv")
## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep23 <- read_csv("202309-divvy-tripdata.csv")
## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct23 <- read_csv("202310-divvy-tripdata.csv")
## Rows: 537113 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov23 <- read_csv("202311-divvy-tripdata.csv")
## Rows: 362518 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec23 <- read_csv("202312-divvy-tripdata.csv")
## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan24 <- read_csv("202401-divvy-tripdata.csv")
## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb24 <- read_csv("202402-divvy-tripdata.csv")
## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
marc24 <- read_csv("202403-divvy-tripdata.csv")
## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
april24 <- read_csv("202404-divvy-tripdata.csv")
## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may24 <- read_csv("202405-divvy-tripdata.csv")
## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
june24 <- read_csv("202406-divvy-tripdata.csv")
## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(july23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(aug23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jan24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(feb24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(marc24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(april24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(june24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
All tables have matching columns so I can proceed to bind all the data in one table.
all_trips <- bind_rows(july23,aug23,sep23,oct23,nov23,dec23,jan24,feb24,marc24,april24,
may24,june24)
str(all_trips)
## spc_tbl_ [5,734,381 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5734381] "9340B064F0AEE130" "D1460EE3CE0D8AF8" "DF41BE31B895A25E" "9624A293749EF703" ...
## $ rideable_type : chr [1:5734381] "electric_bike" "classic_bike" "classic_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5734381], format: "2023-07-23 20:06:14" "2023-07-23 17:05:07" ...
## $ ended_at : POSIXct[1:5734381], format: "2023-07-23 20:22:44" "2023-07-23 17:18:37" ...
## $ start_station_name: chr [1:5734381] "Kedzie Ave & 110th St" "Western Ave & Walton St" "Western Ave & Walton St" "Racine Ave & Randolph St" ...
## $ start_station_id : chr [1:5734381] "20204" "KA1504000103" "KA1504000103" "13155" ...
## $ end_station_name : chr [1:5734381] "Public Rack - Racine Ave & 109th Pl" "Milwaukee Ave & Grand Ave" "Damen Ave & Pierce Ave" "Clinton St & Madison St" ...
## $ end_station_id : chr [1:5734381] "877" "13033" "TA1305000041" "TA1305000032" ...
## $ start_lat : num [1:5734381] 41.7 41.9 41.9 41.9 42 ...
## $ start_lng : num [1:5734381] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5734381] 41.7 41.9 41.9 41.9 42 ...
## $ end_lng : num [1:5734381] -87.7 -87.6 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:5734381] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Now I have all my data in one table which is necessary for full analysis.
Latitude and longitude are not neeede for the analysis
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
colnames(all_trips) #List of column names
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "member_casual"
nrow(all_trips) #How many rows are in data frame?
## [1] 5734381
dim(all_trips) #Dimensions of the data frame?
## [1] 5734381 9
head(all_trips) #See the first 6 rows of data frame. Also tail(all_trips)
## # A tibble: 6 × 9
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 9340B064F0AEE130 electric_bike 2023-07-23 20:06:14 2023-07-23 20:22:44
## 2 D1460EE3CE0D8AF8 classic_bike 2023-07-23 17:05:07 2023-07-23 17:18:37
## 3 DF41BE31B895A25E classic_bike 2023-07-23 10:14:53 2023-07-23 10:24:29
## 4 9624A293749EF703 electric_bike 2023-07-21 08:27:44 2023-07-21 08:32:40
## 5 2F68A6A4CDB4C99A classic_bike 2023-07-08 15:46:42 2023-07-08 15:58:08
## 6 9AEE973E6B941A9C classic_bike 2023-07-10 08:44:47 2023-07-10 08:49:41
## # ℹ 5 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, member_casual <chr>
str(all_trips) #See list of columns and data types (numeric, character, etc)
## tibble [5,734,381 × 9] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5734381] "9340B064F0AEE130" "D1460EE3CE0D8AF8" "DF41BE31B895A25E" "9624A293749EF703" ...
## $ rideable_type : chr [1:5734381] "electric_bike" "classic_bike" "classic_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5734381], format: "2023-07-23 20:06:14" "2023-07-23 17:05:07" ...
## $ ended_at : POSIXct[1:5734381], format: "2023-07-23 20:22:44" "2023-07-23 17:18:37" ...
## $ start_station_name: chr [1:5734381] "Kedzie Ave & 110th St" "Western Ave & Walton St" "Western Ave & Walton St" "Racine Ave & Randolph St" ...
## $ start_station_id : chr [1:5734381] "20204" "KA1504000103" "KA1504000103" "13155" ...
## $ end_station_name : chr [1:5734381] "Public Rack - Racine Ave & 109th Pl" "Milwaukee Ave & Grand Ave" "Damen Ave & Pierce Ave" "Clinton St & Madison St" ...
## $ end_station_id : chr [1:5734381] "877" "13033" "TA1305000041" "TA1305000032" ...
## $ member_casual : chr [1:5734381] "member" "member" "member" "member" ...
summary(all_trips) #Statistical summary of data. Mainly for numeric
## ride_id rideable_type started_at
## Length:5734381 Length:5734381 Min. :2023-07-01 00:00:00.00
## Class :character Class :character 1st Qu.:2023-08-27 15:55:29.00
## Mode :character Mode :character Median :2023-11-09 14:41:58.00
## Mean :2023-12-16 20:05:31.20
## 3rd Qu.:2024-04-23 17:55:40.00
## Max. :2024-06-30 23:55:17.06
## ended_at start_station_name start_station_id
## Min. :2023-07-01 00:01:26.00 Length:5734381 Length:5734381
## 1st Qu.:2023-08-27 16:19:14.00 Class :character Class :character
## Median :2023-11-09 14:55:49.00 Mode :character Mode :character
## Mean :2023-12-16 20:23:50.00
## 3rd Qu.:2024-04-23 18:08:50.00
## Max. :2024-06-30 23:59:57.93
## end_station_name end_station_id member_casual
## Length:5734381 Length:5734381 Length:5734381
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
Let’s create a new column for ride length and then clean the results we don’t need (negative and 0 values).
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
check_for_negative_values(all_trips)
all_trips[all_trips <= 0] <- NA
all_trips <- na.omit(all_trips)
all_trips_2 <- all_trips[apply(all_trips!=0, 1, all),]
check_for_negative_values(all_trips_2)
## Data frame does not contain negative values.
## NULL
The data is clean.
table(all_trips_2$member_casual)
##
## casual member
## 1503559 2770268
We can see clearly that there are many more casual riders than members.
mean(all_trips_2$ride_length) #straight average (total ride length / rides)
## [1] 990.8093
median(all_trips_2$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 604
max(all_trips_2$ride_length) #longest ride
## [1] 413473
min(all_trips_2$ride_length) #shortest ride
## [1] 0.155
summary(all_trips_2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2 347.0 604.0 990.8 1082.0 413473.0
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = mean)
## all_trips_2$member_casual all_trips_2$ride_length
## 1 casual 1434.4350
## 2 member 750.0321
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = median)
## all_trips_2$member_casual all_trips_2$ride_length
## 1 casual 796
## 2 member 529
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = max)
## all_trips_2$member_casual all_trips_2$ride_length
## 1 casual 413473
## 2 member 89859
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = min)
## all_trips_2$member_casual all_trips_2$ride_length
## 1 casual 0.155
## 2 member 0.263
all_trips_2$day_of_week <- ordered(all_trips_2$day_of_week, levels=c("domingo", "lunes", "martes", "miércoles", "jueves", "viernes", "sábado"))
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + all_trips_2$day_of_week, FUN = mean)
## all_trips_2$member_casual all_trips_2$day_of_week all_trips_2$ride_length
## 1 casual domingo 1658.9865
## 2 member domingo 846.7964
## 3 casual lunes 1406.8795
## 4 member lunes 717.3063
## 5 casual martes 1272.5568
## 6 member martes 723.1518
## 7 casual miércoles 1254.6979
## 8 member miércoles 724.1508
## 9 casual jueves 1225.2893
## 10 member jueves 709.9829
## 11 casual viernes 1380.3573
## 12 member viernes 733.7183
## 13 casual sábado 1609.3650
## 14 member sábado 835.3832
Casual riders are using the bikes for a longer period of time at any dady of the week. This could be explained as many tourists to the city could be using the bikes for visiting.
aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + all_trips_2$day_of_week, FUN = mean)
## all_trips_2$member_casual all_trips_2$day_of_week all_trips_2$ride_length
## 1 casual domingo 1658.9865
## 2 member domingo 846.7964
## 3 casual lunes 1406.8795
## 4 member lunes 717.3063
## 5 casual martes 1272.5568
## 6 member martes 723.1518
## 7 casual miércoles 1254.6979
## 8 member miércoles 724.1508
## 9 casual jueves 1225.2893
## 10 member jueves 709.9829
## 11 casual viernes 1380.3573
## 12 member viernes 733.7183
## 13 casual sábado 1609.3650
## 14 member sábado 835.3832
all_trips_2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual "do\\." 268375 1659.
## 2 casual "lu\\." 175574 1407.
## 3 casual "ma\\." 173707 1273.
## 4 casual "mi\\." 178181 1255.
## 5 casual "ju\\." 182745 1225.
## 6 casual "vi\\." 213507 1380.
## 7 casual "sá\\." 311470 1609.
## 8 member "do\\." 313737 847.
## 9 member "lu\\." 395057 717.
## 10 member "ma\\." 437338 723.
## 11 member "mi\\." 447921 724.
## 12 member "ju\\." 440940 710.
## 13 member "vi\\." 385370 734.
## 14 member "sá\\." 349905 835.
all_trips_2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual "do\\." 268375 1659.
## 2 casual "lu\\." 175574 1407.
## 3 casual "ma\\." 173707 1273.
## 4 casual "mi\\." 178181 1255.
## 5 casual "ju\\." 182745 1225.
## 6 casual "vi\\." 213507 1380.
## 7 casual "sá\\." 311470 1609.
## 8 member "do\\." 313737 847.
## 9 member "lu\\." 395057 717.
## 10 member "ma\\." 437338 723.
## 11 member "mi\\." 447921 724.
## 12 member "ju\\." 440940 710.
## 13 member "vi\\." 385370 734.
## 14 member "sá\\." 349905 835.
all_trips_2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") + scale_fill_manual(values = c ("casual" = "darkblue", "member" = "darkorange"))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Although there are more casual users than members, it is clear that members use the bikes more frequently. This could be explain due to the fact that many members are most probably using the bikes to commute so the rides are shorter but more frequent.
all_trips_2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") + scale_fill_manual(values = c ("casual" = "purple", "member" = "yellow"))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
This graph shows clearly that casual riders are using the bikes for longer periods of time. Many temporary visitor to the city probably using the bikes to move around and their therefore their trips are longer.
rides<- all_trips_2%>%
select(start_station_name,end_station_name,ride_length)
summary(rides)
## start_station_name end_station_name ride_length
## Length:4273827 Length:4273827 Min. : 0.2
## Class :character Class :character 1st Qu.: 347.0
## Mode :character Mode :character Median : 604.0
## Mean : 990.8
## 3rd Qu.: 1082.0
## Max. :413473.0
max_ride <- dplyr::filter(rides, ride_length == max(ride_length))
max_start_station <- max_ride$start_station_name
max_end_station <- max_ride$end_station_name
rides<- all_trips_2%>%
select(start_station_name,end_station_name,ride_length)
summary(rides)
## start_station_name end_station_name ride_length
## Length:4273827 Length:4273827 Min. : 0.2
## Class :character Class :character 1st Qu.: 347.0
## Mode :character Mode :character Median : 604.0
## Mean : 990.8
## 3rd Qu.: 1082.0
## Max. :413473.0
min_ride <- dplyr::filter(rides, ride_length == min(ride_length))
min_start_station <- min_ride$start_station_name
min_end_station <- min_ride$end_station_name
cat("Maximum ride length starts at:", max_start_station,",","and ends at:", max_end_station, "\n")
## Maximum ride length starts at: Larrabee St & Webster Ave , and ends at: Michigan Ave & Oak St
cat("Minimum ride length starts at:", min_start_station, ",","and ends at:", min_end_station, "\n")
## Minimum ride length starts at: Millennium Park , and ends at: Millennium Park
combined_rides<-rbind(max_ride,min_ride)
ggplot(combined_rides, aes(x = str_wrap(start_station_name,width = 10), y = ride_length, fill = end_station_name)) +
geom_bar(stat = "identity", fill = "#FF6666") +
geom_text(aes(label = ride_length), vjust = -0.5) +
labs(title = "Maximum and Minimum Ride Length by Start and End Stations", x = "Start_Station_Name", y = "Ride_Length (minutes)") +
theme_minimal()+
theme(
plot.background = element_rect(color = "purple", linewidth = 1),
panel.border = element_rect(color = "lightblue", fill = NA, size = 0.5)
)+
scale_y_continuous(labels = scales::number_format(accuracy = 1))
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
### Create a visualization for the number of rides by rider type
ggplot(data = all_trips_2) +
geom_bar(mapping=aes(x=member_casual, fill=member_casual)) +
labs(title = "Number of Rides: Member vs. Casual") +
scale_fill_manual(values = c ("casual" = "purple", "member" = "lightblue"))
### Top 3 Start Stations
all_trips_2 %>%
group_by(start_station_name) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(4)
## # A tibble: 4 × 2
## start_station_name count
## <chr> <int>
## 1 Streeter Dr & Grand Ave 60601
## 2 DuSable Lake Shore Dr & Monroe St 40556
## 3 Michigan Ave & Oak St 35295
## 4 DuSable Lake Shore Dr & North Blvd 34665
** The top stations are found on the water front where most amenities are found link
all_trips_2 %>%
group_by(end_station_name) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(4)
## # A tibble: 4 × 2
## end_station_name count
## <chr> <int>
## 1 Streeter Dr & Grand Ave 62076
## 2 DuSable Lake Shore Dr & Monroe St 38878
## 3 DuSable Lake Shore Dr & North Blvd 37448
## 4 Michigan Ave & Oak St 35976
all_trips_2 %>%
ggplot(aes(started_at)) +
geom_histogram(color = "#000000", aes(fill = member_casual)) +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~member_casual) +
scale_fill_manual(values = c ("casual" = "purple", "member" = "lightyellow"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
** All riders use the service during the warmer months of the year and
it drops during the coldest months. Chicago is very cold during the
winter months so that explains it.
All these finding suggest that if Cyclistic would like to get more members, these are my recommendations:
At the moments there are only a two options for casual members, one ride pass and one day pass. It would be interesting to create new passes such as Weekend pass, a three day pass, Weekly pass and Monthly pass.
Attracting prices could be a good way of converting recurrent casual rider into full membership. This could be done by introducing discounts or giving a free month rides per year.
Advertising at main stations, popular destinations and on social media the new rates and discounts should be a priority once the new plans are put in place.
##======== STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS ========
counts <- aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual +
all_trips_2$day_of_week, FUN = mean)
View(counts)
write.csv(counts, file ='/Users/Juan/Desktop/cases/Bikes/avg_ride_lenght.csv')