Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. The Director of Marketing believes there is a solid opportunity to convert casual riders into members. The Goal is to design marketing strategies aimed at converting casual riders into annual members.
Identify how annual members and casual riders differ in their use of the bike-share program
Analyzed member versus casual rider data for number of rides by time of day, number of rides by day of week, number of rides per month, average ride duration (in minutes) by day of week, average ride duration (in minutes) by month.
library(tidyverse) #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Use the conflicted package to manage conflicts
library(conflicted)
# Set dplyr::filter and dplyr::lag as the default choices
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
setwd("~/Desktop/capstone/Divvy_Data_2021")
Upload Divvy datasets (csv files).
On Kaggle: 2021 data can be found here: /kaggle/input/cyclistic-case-study-google-certificate
Downloaded the data to my drive from here: Pulled from here
License: here
jan <- read_csv("202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb <- read_csv("202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar <- read_csv("202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr <- read_csv("202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may <- read_csv("202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun <- read_csv("202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul <- read_csv("202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug <- read_csv("202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check the structure of the files using spec()
spec(jan)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(feb)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(mar)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(apr)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(may)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(jun)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(jul)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(aug)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(sep)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(oct)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(nov)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(dec)
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
all_trips <- rbind(jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec)
Inspect the new table that has been created – the column names, the first few lines of data, the types of data in each column, and a statistical summary of numeric data.
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
head(all_trips)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(all_trips)
## spc_tbl_ [5,595,063 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5595063], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:5595063], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:5595063] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:5595063] NA NA NA NA ...
## $ end_station_id : chr [1:5595063] NA NA NA NA ...
## $ start_lat : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5595063] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(all_trips)
## ride_id rideable_type started_at
## Length:5595063 Length:5595063 Min. :2021-01-01 00:02:05.00
## Class :character Class :character 1st Qu.:2021-06-06 23:52:40.00
## Mode :character Mode :character Median :2021-08-01 01:52:11.00
## Mean :2021-07-29 07:41:02.63
## 3rd Qu.:2021-09-24 16:36:16.00
## Max. :2021-12-31 23:59:48.00
##
## ended_at start_station_name start_station_id
## Min. :2021-01-01 00:08:39.00 Length:5595063 Length:5595063
## 1st Qu.:2021-06-07 00:44:21.00 Class :character Class :character
## Median :2021-08-01 02:21:55.00 Mode :character Mode :character
## Mean :2021-07-29 08:02:58.75
## 3rd Qu.:2021-09-24 16:54:05.50
## Max. :2022-01-03 17:32:18.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5595063 Length:5595063 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. :41.39 Min. :-88.97 Length:5595063
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.17 Max. :-87.49
## NA's :4771 NA's :4771
Add columns that list the date, hour (convert to numeric), day, month (convert to numeric), and year of each ride. Also create a column that specifies if a day is a weekday or weekend. Set levels for the week days so that they will graph in the appropriate order. Going to change the member-casual column name to usertype, due to personal preference.
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$month <- as.numeric(all_trips$month) #Conversion
all_trips$hour <- format(as.POSIXct(all_trips$started_at), format = "%H")
all_trips$hour <- as.numeric(all_trips$hour) #Conversion
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$day_type <- ifelse(all_trips$day %in% c("Sat", "Sun"), "weekend", "weekday")
levels(all_trips$day) <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
all_trips <- rename(all_trips, usertype = member_casual)
Add a column for “ride_length” calculation. Convert it to a numeric and change calculation from seconds to minutes.
# Add a "ride_length" calculation to all_trips (in minutes)
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
# Convert "ride_length" from Factor to numeric so we can run calculations on the data
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
#Convert ride_length to minutes for easier calculation
all_trips$ride_length <- (all_trips$ride_length / 60)
summary(all_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -58.03 6.75 12.00 21.94 21.78 55944.15
Inspectin the ride_length column, there are some trip lengths that are negative and some that are multiple days long. Delete any rides under 30 seconds long and over 6 hours long.
all_trips_clean <- all_trips[!(all_trips$ride_length < .5 | all_trips$ride_length > 360),]
Remove the data where the start or end station name is NA
all_trips_clean <- all_trips_clean %>% drop_na(start_station_name) %>% drop_na(end_station_name)
Create a csv file with the cleaned up data for future use
write_csv(all_trips_clean, "full-2021-divvydata.csv")
###Step 4: Conduct Descriptive Analysis
summary(all_trips_clean$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 7.05 12.28 19.15 22.17 359.97
Compare the ride length data for members versus casual riders
paste('Mean ride length')
## [1] "Mean ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = mean)
## all_trips_clean$usertype all_trips_clean$ride_length
## 1 casual 26.61215
## 2 member 13.12268
paste("Median ride length")
## [1] "Median ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = median)
## all_trips_clean$usertype all_trips_clean$ride_length
## 1 casual 16.7
## 2 member 9.8
paste("Max ride length")
## [1] "Max ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = max)
## all_trips_clean$usertype all_trips_clean$ride_length
## 1 casual 359.9667
## 2 member 359.7500
paste("Min ride length")
## [1] "Min ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = min)
## all_trips_clean$usertype all_trips_clean$ride_length
## 1 casual 0.5
## 2 member 0.5
Calculate how many rides were taken total, and how many were taken by members versus casual riders.
all_trips_clean %>%
summarise(number_of_rides = n())
## # A tibble: 1 × 1
## number_of_rides
## <int>
## 1 4545092
all_trips_clean %>%
group_by(usertype)%>%
summarise(number_of_rides = n())%>%
ggplot(aes(x=usertype, y = number_of_rides, fill = usertype))+
geom_col(position = "dodge") + geom_text(aes(label = number_of_rides)) +
labs(caption = "4,545,092 as total number of rides ")
Number of rides by bike type (with detail of casual and member counts)
all_trips_clean %>%
group_by(usertype, rideable_type)%>%
summarise(number_of_rides = n())%>%
ggplot(aes(x=rideable_type, y = number_of_rides, fill = usertype))+
geom_col(position="stack")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Calculate average ride time for all riders. Visualize by usertype
all_trips_clean %>%
summarise(average_duration = mean(ride_length))
## # A tibble: 1 × 1
## average_duration
## <dbl>
## 1 19.1
all_trips_clean %>%
group_by(usertype)%>%
summarise(average_duration = mean(ride_length))%>%
ggplot(aes(x=usertype, y = average_duration, fill = usertype))+
geom_col(position = "dodge")+
geom_label(aes(x = usertype, label=average_duration)) +
labs(caption = "Average ride time overall is 19.14989 minutes")
Create dataframe with average ride time per day of week for members vs casual users. This doesn’t sort the data correctly, but it’s a good starting point.
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype +
all_trips_clean$day_of_week, FUN = mean)
## all_trips_clean$usertype all_trips_clean$day_of_week
## 1 casual Friday
## 2 member Friday
## 3 casual Monday
## 4 member Monday
## 5 casual Saturday
## 6 member Saturday
## 7 casual Sunday
## 8 member Sunday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Tuesday
## 12 member Tuesday
## 13 casual Wednesday
## 14 member Wednesday
## all_trips_clean$ride_length
## 1 24.57028
## 2 12.75790
## 3 27.15517
## 4 12.68280
## 5 28.74262
## 6 14.71857
## 7 30.64529
## 8 15.07385
## 9 22.67127
## 10 12.29999
## 11 24.18076
## 12 12.35584
## 13 23.17347
## 14 12.40694
Create a dataframe that calculates 1. average ride time and 2. number of rides, per day of week for members and casual users. This dataframe will be sorted and presented Sunday to Saturday.
all_trips_clean %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n(),
average_duration = mean(ride_length)) %>%
arrange(weekday, usertype)
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: usertype [2]
## usertype weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 400080 30.6
## 2 member Sun 307737 15.1
## 3 casual Mon 226918 27.2
## 4 member Mon 342968 12.7
## 5 casual Tue 213088 24.2
## 6 member Tue 384500 12.4
## 7 casual Wed 216427 23.2
## 8 member Wed 393931 12.4
## 9 casual Thu 222378 22.7
## 10 member Thu 369900 12.3
## 11 casual Fri 287606 24.6
## 12 member Fri 362101 12.8
## 13 casual Sat 464287 28.7
## 14 member Sat 353171 14.7
Create a dataframe that calculates 1. average ridetime and 2. number of rides, per month for members and casual users. This dataframe will be sorted and printed January through December.
all_trips_clean %>%
mutate(month = month(started_at, label = TRUE)) %>%
group_by(usertype, month) %>%
summarise(number_of_rides = n(),
average_duration = mean(ride_length)) %>%
arrange(month, usertype)
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 4
## # Groups: usertype [2]
## usertype month number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Jan 14581 20.4
## 2 member Jan 68291 12.0
## 3 casual Feb 8499 27.2
## 4 member Feb 33951 14.1
## 5 casual Mar 75008 29.9
## 6 member Mar 128926 13.6
## 7 casual Apr 119413 29.9
## 8 member Apr 176048 14.2
## 9 casual May 214712 30.9
## 10 member May 231639 14.2
## # ℹ 14 more rows
Create a visualization showing average number of rides per day of the week. Separate by usertype.
all_trips_clean %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = usertype)) +
facet_wrap(~usertype) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Create a visualization showing average ride duration per day of the week. Separate by usertype.
all_trips_clean %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(usertype, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = usertype)) +
facet_wrap(~usertype)+
geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Create a visualization showing number of ride by month. Note that June data is incomplete as the first datapoint is from 2013-06-27.
all_trips_clean %>%
mutate(month = month(started_at, label = TRUE)) %>%
group_by(usertype, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = usertype)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Create a visualization showing average ride duration by month. Note that June data is incomplete as the first datapoint is from 2013-06-27.
all_trips_clean %>%
mutate(month = month(started_at, label = TRUE)) %>%
group_by(usertype, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, month) %>%
ggplot(aes(x = month, y = average_duration, fill = usertype)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Create a visualization that shows the number of rides taken over the course of the day.
all_trips_clean %>%
mutate(time_of_day = as.numeric(hour)) %>%
group_by(usertype, time_of_day) %>%
summarise(number_of_rides = n()) %>%
arrange(usertype, time_of_day) %>%
ggplot() +
geom_line(aes(x = time_of_day, y = number_of_rides, color = usertype))
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Create a visualization that shows the duration of rides taken over the course of the day.
all_trips_clean %>%
mutate(time_of_day = as.numeric(hour)) %>%
group_by(usertype, time_of_day) %>%
summarise(average_duration = mean(ride_length)) %>%
arrange(usertype, time_of_day) %>%
ggplot() +
geom_line(aes(x = time_of_day, y = average_duration, color = usertype))
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
Looking at start and end points: Given the question we are trying to solve for this capstone, it would be useful to know which stations are most popular for casual users versus members.
Most popular start station for members:
all_trips_clean %>%
filter(usertype == "member") %>%
group_by(start_station_name) %>%
summarise(number_of_rides = n()) %>%
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
## start_station_name number_of_rides
## <chr> <int>
## 1 Clark St & Elm St 23673
## 2 Wells St & Concord Ln 22554
## 3 Kingsbury St & Kinzie St 22451
## 4 Wells St & Elm St 20064
## 5 Dearborn St & Erie St 18452
Most popular start station for casual users:
all_trips_clean %>%
filter(usertype == "casual") %>%
group_by(start_station_name) %>%
summarise(number_of_rides = n()) %>%
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
## start_station_name number_of_rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave 63832
## 2 Millennium Park 31870
## 3 Michigan Ave & Oak St 28434
## 4 Shedd Aquarium 22353
## 5 Theater on the Lake 20448
Most popular end station for members:
all_trips_clean %>%
filter(usertype == "member") %>%
group_by(end_station_name) %>%
summarise(number_of_rides = n()) %>%
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
## end_station_name number_of_rides
## <chr> <int>
## 1 Clark St & Elm St 23745
## 2 Wells St & Concord Ln 23208
## 3 Kingsbury St & Kinzie St 22649
## 4 Wells St & Elm St 20625
## 5 Dearborn St & Erie St 19108
Most popular end station for casual users:
all_trips_clean %>%
filter(usertype == "casual") %>%
group_by(end_station_name) %>%
summarise(number_of_rides = n()) %>%
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
## end_station_name number_of_rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave 66971
## 2 Millennium Park 33501
## 3 Michigan Ave & Oak St 30146
## 4 Theater on the Lake 22122
## 5 Shedd Aquarium 20977
Most popular start station overall:
all_trips_clean %>%
group_by(start_station_name) %>%
summarise(number_of_rides = n()) %>%
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
## start_station_name number_of_rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave 79465
## 2 Michigan Ave & Oak St 42371
## 3 Wells St & Concord Ln 41290
## 4 Millennium Park 40062
## 5 Clark St & Elm St 39172
Most popular end station overall:
all_trips_clean %>%
group_by(end_station_name) %>%
summarise(number_of_rides = n()) %>%
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
## end_station_name number_of_rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave 81009
## 2 Michigan Ave & Oak St 43126
## 3 Wells St & Concord Ln 41701
## 4 Millennium Park 41411
## 5 Clark St & Elm St 38620
The data above shows that the most popular start and end stations for casual riders are at tourist destinations: Navy Pier (Streeter Dr & Grand Ave), Millenium Park, an intersection by Water Tower Place and the Drake Hotel (Michigan Ave & Oak St), Theatre on the Lake, and the Shedd Aquarium.