Cyclistic is a bike-share company based in Chicago that launched a successful bike-sharing program in 2016. Throughout the years, the program has expanded significantly to a fleet of 5,824 bicycles and a network of 692 geotracked stations sprawled across the city. With the large number of bicycles across numerous stations, customers can rent bikes from one station and return them to any other station within the network at their convenience. This encourages people to opt for cycling as a mode of transportation, therefore contributing to the success of Cyclistic’s bike-sharing program.
Cyclistic’s marketing strategy has so far focused on building general awareness and appealing to broad consumer segments. The company offers flexibile pricing plans that cater to diverse needs of users including single-ride passes, full-day passes, and annual memberships. Besides, it provides reclining bikes, hand tricycles, and cargo bikes, effectively welcoming individuals with disabilities and those who can’t ride on the standard two-wheeled bicycles. Based on the company database, Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. While traditional bikes remain as the popular option, around 8% of users opt for the assistive alternatives.
The company’s marketing director believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, as a junior data analyst, my team and I have to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, we will design a new marketing strategy to convert casual riders into annual members.
Understand the differences between casual and member usage by answering the following questions:
Installing & Loading tidyverse and dplyr for data manipulation
Dataset to be used for this analysis can be found at divvy-tripdata. The data has been made available by Motivate International Inc. under this license. The Q1 2020 and Q1 2021 (Q1 2019 data was unavailable) data were used in this analysis.
R was used to combine, clean, analyze and visualize. This includes the following files that were added:
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
setwd('/cloud/project/')
df1 <- read_csv('/cloud/project/Divvy_Trips_2020_Q1.csv')
## Rows: 426887 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (7): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv('/cloud/project/Divvy_Trips_2021_Q1.csv')
## Rows: 96834 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2$start_station_id =as.numeric(as.character(df2$start_station_id))
## Warning: NAs introduced by coercion
df2$end_station_id = as.numeric(as.character(df2$end_station_id))
## Warning: NAs introduced by coercion
bikeshare_data <- na.omit(bikeshare_data)
bikeshare_data$ride_length <- as.numeric(bikeshare_data$ride_length)
summary(bikeshare_data$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 318.0 518.0 791.6 861.0 86399.0
hist(bikeshare_data$ride_length)
Q1 <- quantile(bikeshare_data$ride_length, .25)
Q3 <- quantile(bikeshare_data$ride_length, .75)
IQR <- IQR(bikeshare_data$ride_length)
outliers <- subset(bikeshare_data, bikeshare_data$ride_length<(Q1 - 1.5*IQR) | bikeshare_data$ride_length>(Q3 + 1.5*IQR))
dim(outliers)
## [1] 14589 15
bikeshare_data <- bikeshare_data[!(bikeshare_data$ride_length > 480),]
bikeshare_data <- bikeshare_data[!(bikeshare_data$ride_length < 5),]
ggplot(bikeshare_data, aes(x=member_casual, y=ride_length, fill=member_casual)) +
geom_boxplot()
rows_with_na <- bikeshare_data[apply(bikeshare_data, 1, function(x) any(is.na(x))
), ]
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(lubridate)
bikeshare_data$date <- ymd_hms(bikeshare_data$started_at)
bikeshare_data$weekday <- weekdays(bikeshare_data$date)
bikeshare_data$month_number <- format(bikeshare_data$date, "%m")
bikeshare_data$month_number <- as.numeric(bikeshare_data$month_number)
bikeshare_data$month <- month.name[bikeshare_data$month_number]
bikeshare_data$year <- year(bikeshare_data$date)
bikeshare_data$year <- as.character(bikeshare_data$year)
bikeshare_data$start_hour <- hour(bikeshare_data$date)
bikeshare_data %>%
group_by(rideable_type, month) %>%
summarize(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: rideable_type [3]
## rideable_type month count
## <chr> <chr> <int>
## 1 docked_bike January 67330
## 2 docked_bike February 28497
## 3 classic_bike January 4620
## 4 electric_bike January 1824
bikeshare_data %>%
group_by(rideable_type, weekday) %>%
summarize(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 21 × 3
## # Groups: rideable_type [3]
## rideable_type weekday count
## <chr> <chr> <int>
## 1 docked_bike Tuesday 17321
## 2 docked_bike Thursday 17187
## 3 docked_bike Wednesday 16697
## 4 docked_bike Friday 15826
## 5 docked_bike Monday 15640
## 6 docked_bike Sunday 6579
## 7 docked_bike Saturday 6577
## 8 classic_bike Friday 736
## 9 classic_bike Thursday 708
## 10 classic_bike Saturday 705
## # ℹ 11 more rows
bikeshare_data %>%
group_by(rideable_type, year) %>%
summarize(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: rideable_type [3]
## rideable_type year count
## <chr> <chr> <int>
## 1 docked_bike 2020 95776
## 2 classic_bike 2021 4620
## 3 electric_bike 2021 1824
## 4 docked_bike 2021 51
bikeshare_data %>%
group_by(member_casual, weekday) %>%
summarize(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: member_casual [2]
## member_casual weekday count
## <chr> <chr> <int>
## 1 member Tuesday 17765
## 2 member Thursday 17704
## 3 member Wednesday 17168
## 4 member Friday 16402
## 5 member Monday 16203
## 6 member Saturday 7139
## 7 member Sunday 6927
## 8 casual Friday 469
## 9 casual Wednesday 459
## 10 casual Thursday 456
## 11 casual Tuesday 425
## 12 casual Saturday 407
## 13 casual Sunday 376
## 14 casual Monday 371
bikeshare_data %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),
average_ride_duration = mean(ride_length),
median = median(ride_length))%>%
arrange(desc(number_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 5
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_ride_duration median
## <chr> <chr> <int> <dbl> <dbl>
## 1 member Tuesday 17765 303. 308
## 2 member Thursday 17704 302. 306
## 3 member Wednesday 17168 300. 303
## 4 member Friday 16402 302. 306
## 5 member Monday 16203 301. 305
## 6 member Saturday 7139 294. 299
## 7 member Sunday 6927 294. 296
## 8 casual Friday 469 313. 327
## 9 casual Wednesday 459 297. 328
## 10 casual Thursday 456 321. 337
## 11 casual Tuesday 425 331. 349
## 12 casual Saturday 407 325. 352
## 13 casual Sunday 376 321. 349
## 14 casual Monday 371 324. 334
bikeshare_data %>%
group_by(member_casual) %>%
summarise(total_rides = length(member_casual),
percentage = total_rides/nrow(bikeshare_data) * 100,
avg_ride_length = mean(ride_length),
median = median(ride_length))
## # A tibble: 2 × 5
## member_casual total_rides percentage avg_ride_length median
## <chr> <int> <dbl> <dbl> <dbl>
## 1 casual 2963 2.90 318. 339
## 2 member 99308 97.1 301. 304
bikeshare_data %>%
group_by(rideable_type) %>%
summarise(total_rides = length(member_casual),
percentage = total_rides/nrow(bikeshare_data) * 100,
avg_ride_length = mean(ride_length),
median = median(ride_length))
## # A tibble: 3 × 5
## rideable_type total_rides percentage avg_ride_length median
## <chr> <int> <dbl> <dbl> <dbl>
## 1 classic_bike 4620 4.52 287. 294
## 2 docked_bike 95827 93.7 302. 306
## 3 electric_bike 1824 1.78 292. 300
bikeshare_data %>%
group_by(rideable_type, weekday) %>%
summarise(number_of_rides = n(),
average_ride_duration = mean(ride_length),
median = median(ride_length),
max = max(ride_length),
min = min(ride_length)) %>%
arrange(desc(number_of_rides))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 21 × 7
## # Groups: rideable_type [3]
## rideable_type weekday number_of_rides average_ride_duration median max
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 docked_bike Tuesday 17321 304. 308 480
## 2 docked_bike Thursday 17187 303. 307 480
## 3 docked_bike Wednesday 16697 301. 304 480
## 4 docked_bike Friday 15826 304. 308 480
## 5 docked_bike Monday 15640 302. 306 480
## 6 docked_bike Sunday 6579 296. 299 480
## 7 docked_bike Saturday 6577 298. 303 480
## 8 classic_bike Friday 736 284. 285 480
## 9 classic_bike Thursday 708 290. 294. 480
## 10 classic_bike Saturday 705 277. 284 480
## # ℹ 11 more rows
## # ℹ 1 more variable: min <dbl>
bikeshare_data %>%
group_by(rideable_type,member_casual,weekday) %>%
summarise(total_count=n()) %>%
arrange(desc(total_count))
## `summarise()` has grouped output by 'rideable_type', 'member_casual'. You can
## override using the `.groups` argument.
## # A tibble: 42 × 4
## # Groups: rideable_type, member_casual [6]
## rideable_type member_casual weekday total_count
## <chr> <chr> <chr> <int>
## 1 docked_bike member Tuesday 16975
## 2 docked_bike member Thursday 16841
## 3 docked_bike member Wednesday 16331
## 4 docked_bike member Friday 15458
## 5 docked_bike member Monday 15366
## 6 docked_bike member Saturday 6303
## 7 docked_bike member Sunday 6293
## 8 classic_bike member Friday 684
## 9 classic_bike member Thursday 644
## 10 classic_bike member Saturday 617
## # ℹ 32 more rows
bikeshare_data %>%
group_by(member_casual, month) %>%
summarize(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: member_casual [2]
## member_casual month count
## <chr> <chr> <int>
## 1 member January 71596
## 2 member February 27712
## 3 casual January 2178
## 4 casual February 785
bikeshare_data %>%
group_by(member_casual, rideable_type, weekday) %>%
summarize(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'member_casual', 'rideable_type'. You can
## override using the `.groups` argument.
## # A tibble: 42 × 4
## # Groups: member_casual, rideable_type [6]
## member_casual rideable_type weekday count
## <chr> <chr> <chr> <int>
## 1 member docked_bike Tuesday 16975
## 2 member docked_bike Thursday 16841
## 3 member docked_bike Wednesday 16331
## 4 member docked_bike Friday 15458
## 5 member docked_bike Monday 15366
## 6 member docked_bike Saturday 6303
## 7 member docked_bike Sunday 6293
## 8 member classic_bike Friday 684
## 9 member classic_bike Thursday 644
## 10 member classic_bike Saturday 617
## # ℹ 32 more rows
bikeshare_data %>%
group_by(start_station_name) %>%
summarise(total_count = length((start_station_name))) %>%
arrange(desc(total_count)) %>% top_n(10)
## Selecting by total_count
## # A tibble: 10 × 2
## start_station_name total_count
## <chr> <int>
## 1 Clinton St & Madison St 1852
## 2 Kingsbury St & Kinzie St 1658
## 3 Canal St & Adams St 1657
## 4 Clinton St & Washington Blvd 1392
## 5 Clinton St & Lake St 1117
## 6 Columbus Dr & Randolph St 1044
## 7 Canal St & Madison St 984
## 8 Dearborn St & Erie St 951
## 9 LaSalle St & Jackson Blvd 918
## 10 Desplaines St & Kinzie St 914
bikeshare_data %>%
group_by(start_station_name, member_casual) %>%
summarise(total_count = length((start_station_name))) %>%
arrange(desc(total_count)) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
## Selecting by total_count
## # A tibble: 1,014 × 3
## # Groups: start_station_name [593]
## start_station_name member_casual total_count
## <chr> <chr> <int>
## 1 Clinton St & Madison St member 1837
## 2 Kingsbury St & Kinzie St member 1637
## 3 Canal St & Adams St member 1628
## 4 Clinton St & Washington Blvd member 1385
## 5 Clinton St & Lake St member 1085
## 6 Columbus Dr & Randolph St member 1030
## 7 Canal St & Madison St member 963
## 8 Dearborn St & Erie St member 929
## 9 LaSalle St & Jackson Blvd member 913
## 10 Desplaines St & Kinzie St member 885
## # ℹ 1,004 more rows
bikeshare_data %>%
group_by(start_station_name, rideable_type) %>%
summarise(total_count = length((start_station_name))) %>%
arrange(desc(total_count)) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
## Selecting by total_count
## # A tibble: 1,002 × 3
## # Groups: start_station_name [593]
## start_station_name rideable_type total_count
## <chr> <chr> <int>
## 1 Clinton St & Madison St docked_bike 1852
## 2 Kingsbury St & Kinzie St docked_bike 1658
## 3 Canal St & Adams St docked_bike 1579
## 4 Clinton St & Washington Blvd docked_bike 1392
## 5 Clinton St & Lake St docked_bike 1035
## 6 Canal St & Madison St docked_bike 959
## 7 Columbus Dr & Randolph St docked_bike 929
## 8 LaSalle St & Jackson Blvd docked_bike 918
## 9 Desplaines St & Kinzie St docked_bike 914
## 10 University Ave & 57th St docked_bike 884
## # ℹ 992 more rows
bikeshare_data %>%
group_by(start_station_name, member_casual, rideable_type) %>%
summarise(total_count = length((start_station_name))) %>%
arrange(desc(total_count)) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name', 'member_casual'. You
## can override using the `.groups` argument.
## Selecting by total_count
## # A tibble: 1,624 × 4
## # Groups: start_station_name, member_casual [1,014]
## start_station_name member_casual rideable_type total_count
## <chr> <chr> <chr> <int>
## 1 Clinton St & Madison St member docked_bike 1837
## 2 Kingsbury St & Kinzie St member docked_bike 1637
## 3 Canal St & Adams St member docked_bike 1559
## 4 Clinton St & Washington Blvd member docked_bike 1385
## 5 Clinton St & Lake St member docked_bike 1015
## 6 Canal St & Madison St member docked_bike 940
## 7 Columbus Dr & Randolph St member docked_bike 921
## 8 LaSalle St & Jackson Blvd member docked_bike 913
## 9 Desplaines St & Kinzie St member docked_bike 885
## 10 University Ave & 57th St member docked_bike 849
## # ℹ 1,614 more rows
top_station <- bikeshare_data %>%
group_by(start_station_name) %>%
summarise(total_count = length(na.omit(start_station_name))) %>%
arrange(desc(total_count)) %>%
top_n(10)
## Selecting by total_count