R Markdown Document for My Capstone on 2021 Divvy Data

Context:

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. The Director of Marketing believes there is a solid opportunity to convert casual riders into members. The Goal is to design marketing strategies aimed at converting casual riders into annual members.

Key Task:

Identify how annual members and casual riders differ in their use of the bike-share program

Summary of Analysis:

Analyzed member versus casual rider data for number of rides by time of day, number of rides by day of week, number of rides per month, average ride duration (in minutes) by day of week, average ride duration (in minutes) by month.

Code

Load the necessary libraries

library(tidyverse)  #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Use the conflicted package to manage conflicts
library(conflicted)

# Set dplyr::filter and dplyr::lag as the default choices
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.

Set Working Diretory

setwd("~/Desktop/capstone/Divvy_Data_2021")

Step 1: Collect Data

Upload Divvy datasets (csv files).

On Kaggle: 2021 data can be found here: /kaggle/input/cyclistic-case-study-google-certificate

Downloaded the data to my drive from here: Pulled from here

License: here

jan <- read_csv("202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb <- read_csv("202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar <- read_csv("202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr <- read_csv("202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may <- read_csv("202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun <- read_csv("202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul <- read_csv("202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug <- read_csv("202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check the structure of the files using spec()

spec(jan)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(feb)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(mar)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(apr)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(may)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(jun)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(jul)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(aug)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(sep)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(oct)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(nov)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )
spec(dec)
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Step 2: Combine Datasets

all_trips <- rbind(jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec)

Step 3: Clean up and add data to prepare for analysis

Inspect the new table that has been created – the column names, the first few lines of data, the types of data in each column, and a statistical summary of numeric data.

colnames(all_trips)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
head(all_trips)
## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(all_trips)
## spc_tbl_ [5,595,063 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5595063], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:5595063], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:5595063] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:5595063] NA NA NA NA ...
##  $ end_station_id    : chr [1:5595063] NA NA NA NA ...
##  $ start_lat         : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5595063] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(all_trips)
##    ride_id          rideable_type        started_at                    
##  Length:5595063     Length:5595063     Min.   :2021-01-01 00:02:05.00  
##  Class :character   Class :character   1st Qu.:2021-06-06 23:52:40.00  
##  Mode  :character   Mode  :character   Median :2021-08-01 01:52:11.00  
##                                        Mean   :2021-07-29 07:41:02.63  
##                                        3rd Qu.:2021-09-24 16:36:16.00  
##                                        Max.   :2021-12-31 23:59:48.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2021-01-01 00:08:39.00   Length:5595063     Length:5595063    
##  1st Qu.:2021-06-07 00:44:21.00   Class :character   Class :character  
##  Median :2021-08-01 02:21:55.00   Mode  :character   Mode  :character  
##  Mean   :2021-07-29 08:02:58.75                                        
##  3rd Qu.:2021-09-24 16:54:05.50                                        
##  Max.   :2022-01-03 17:32:18.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5595063     Length:5595063     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.39   Min.   :-88.97   Length:5595063    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.17   Max.   :-87.49                     
##  NA's   :4771    NA's   :4771

Add columns that list the date, hour (convert to numeric), day, month (convert to numeric), and year of each ride. Also create a column that specifies if a day is a weekday or weekend. Set levels for the week days so that they will graph in the appropriate order. Going to change the member-casual column name to usertype, due to personal preference.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$month <- as.numeric(all_trips$month) #Conversion
all_trips$hour <- format(as.POSIXct(all_trips$started_at), format = "%H")
all_trips$hour <- as.numeric(all_trips$hour) #Conversion
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$day_type <- ifelse(all_trips$day %in% c("Sat", "Sun"), "weekend", "weekday")
levels(all_trips$day) <- c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")

all_trips <- rename(all_trips, usertype = member_casual)

Add a column for “ride_length” calculation. Convert it to a numeric and change calculation from seconds to minutes.

# Add a "ride_length" calculation to all_trips (in minutes)
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

# Convert "ride_length" from Factor to numeric so we can run calculations on the data
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

#Convert ride_length to minutes for easier calculation
all_trips$ride_length <- (all_trips$ride_length / 60)

summary(all_trips$ride_length)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -58.03     6.75    12.00    21.94    21.78 55944.15

Inspectin the ride_length column, there are some trip lengths that are negative and some that are multiple days long. Delete any rides under 30 seconds long and over 6 hours long.

all_trips_clean <- all_trips[!(all_trips$ride_length < .5 | all_trips$ride_length > 360),]

Remove the data where the start or end station name is NA

all_trips_clean <- all_trips_clean %>% drop_na(start_station_name) %>% drop_na(end_station_name)

Create a csv file with the cleaned up data for future use

write_csv(all_trips_clean, "full-2021-divvydata.csv")

###Step 4: Conduct Descriptive Analysis

summary(all_trips_clean$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    7.05   12.28   19.15   22.17  359.97

Compare the ride length data for members versus casual riders

paste('Mean ride length')
## [1] "Mean ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = mean)
##   all_trips_clean$usertype all_trips_clean$ride_length
## 1                   casual                    26.61215
## 2                   member                    13.12268
paste("Median ride length")
## [1] "Median ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = median)
##   all_trips_clean$usertype all_trips_clean$ride_length
## 1                   casual                        16.7
## 2                   member                         9.8
paste("Max ride length")
## [1] "Max ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = max)
##   all_trips_clean$usertype all_trips_clean$ride_length
## 1                   casual                    359.9667
## 2                   member                    359.7500
paste("Min ride length")
## [1] "Min ride length"
aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype, FUN = min)
##   all_trips_clean$usertype all_trips_clean$ride_length
## 1                   casual                         0.5
## 2                   member                         0.5

Calculate how many rides were taken total, and how many were taken by members versus casual riders.

all_trips_clean %>%
summarise(number_of_rides = n())
## # A tibble: 1 × 1
##   number_of_rides
##             <int>
## 1         4545092
all_trips_clean %>%
group_by(usertype)%>%
summarise(number_of_rides = n())%>%
ggplot(aes(x=usertype, y = number_of_rides, fill = usertype))+
geom_col(position = "dodge") + geom_text(aes(label = number_of_rides)) + 
labs(caption = "4,545,092 as total number of rides ")

Number of rides by bike type (with detail of casual and member counts)

all_trips_clean %>%
group_by(usertype, rideable_type)%>%
summarise(number_of_rides = n())%>%
ggplot(aes(x=rideable_type, y = number_of_rides, fill = usertype))+
geom_col(position="stack")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Calculate average ride time for all riders. Visualize by usertype

all_trips_clean %>%
summarise(average_duration = mean(ride_length))
## # A tibble: 1 × 1
##   average_duration
##              <dbl>
## 1             19.1
all_trips_clean %>%
group_by(usertype)%>%
summarise(average_duration = mean(ride_length))%>%
ggplot(aes(x=usertype, y = average_duration, fill = usertype))+
geom_col(position = "dodge")+
geom_label(aes(x = usertype, label=average_duration)) + 
labs(caption = "Average ride time overall is 19.14989 minutes")

Create dataframe with average ride time per day of week for members vs casual users. This doesn’t sort the data correctly, but it’s a good starting point.

aggregate(all_trips_clean$ride_length ~ all_trips_clean$usertype + 
          all_trips_clean$day_of_week, FUN = mean)
##    all_trips_clean$usertype all_trips_clean$day_of_week
## 1                    casual                      Friday
## 2                    member                      Friday
## 3                    casual                      Monday
## 4                    member                      Monday
## 5                    casual                    Saturday
## 6                    member                    Saturday
## 7                    casual                      Sunday
## 8                    member                      Sunday
## 9                    casual                    Thursday
## 10                   member                    Thursday
## 11                   casual                     Tuesday
## 12                   member                     Tuesday
## 13                   casual                   Wednesday
## 14                   member                   Wednesday
##    all_trips_clean$ride_length
## 1                     24.57028
## 2                     12.75790
## 3                     27.15517
## 4                     12.68280
## 5                     28.74262
## 6                     14.71857
## 7                     30.64529
## 8                     15.07385
## 9                     22.67127
## 10                    12.29999
## 11                    24.18076
## 12                    12.35584
## 13                    23.17347
## 14                    12.40694

Create a dataframe that calculates 1. average ride time and 2. number of rides, per day of week for members and casual users. This dataframe will be sorted and presented Sunday to Saturday.

all_trips_clean %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(usertype, weekday) %>% 
  summarise(number_of_rides = n(),
            average_duration = mean(ride_length)) %>%
  arrange(weekday, usertype)
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   usertype [2]
##    usertype weekday number_of_rides average_duration
##    <chr>    <ord>             <int>            <dbl>
##  1 casual   Sun              400080             30.6
##  2 member   Sun              307737             15.1
##  3 casual   Mon              226918             27.2
##  4 member   Mon              342968             12.7
##  5 casual   Tue              213088             24.2
##  6 member   Tue              384500             12.4
##  7 casual   Wed              216427             23.2
##  8 member   Wed              393931             12.4
##  9 casual   Thu              222378             22.7
## 10 member   Thu              369900             12.3
## 11 casual   Fri              287606             24.6
## 12 member   Fri              362101             12.8
## 13 casual   Sat              464287             28.7
## 14 member   Sat              353171             14.7

Create a dataframe that calculates 1. average ridetime and 2. number of rides, per month for members and casual users. This dataframe will be sorted and printed January through December.

all_trips_clean %>% 
  mutate(month = month(started_at, label = TRUE)) %>%
  group_by(usertype, month) %>% 
  summarise(number_of_rides = n(),
            average_duration = mean(ride_length)) %>%
  arrange(month, usertype)
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 4
## # Groups:   usertype [2]
##    usertype month number_of_rides average_duration
##    <chr>    <ord>           <int>            <dbl>
##  1 casual   Jan             14581             20.4
##  2 member   Jan             68291             12.0
##  3 casual   Feb              8499             27.2
##  4 member   Feb             33951             14.1
##  5 casual   Mar             75008             29.9
##  6 member   Mar            128926             13.6
##  7 casual   Apr            119413             29.9
##  8 member   Apr            176048             14.2
##  9 casual   May            214712             30.9
## 10 member   May            231639             14.2
## # ℹ 14 more rows

Create a visualization showing average number of rides per day of the week. Separate by usertype.

all_trips_clean %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(usertype, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(usertype, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = usertype)) +
  facet_wrap(~usertype) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Create a visualization showing average ride duration per day of the week. Separate by usertype.

all_trips_clean %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(usertype, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(usertype, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = usertype)) + 
  facet_wrap(~usertype)+
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Create a visualization showing number of ride by month. Note that June data is incomplete as the first datapoint is from 2013-06-27.

all_trips_clean %>% 
  mutate(month = month(started_at, label = TRUE)) %>% 
  group_by(usertype, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(usertype, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = usertype)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Create a visualization showing average ride duration by month. Note that June data is incomplete as the first datapoint is from 2013-06-27.

all_trips_clean %>% 
  mutate(month = month(started_at, label = TRUE)) %>% 
  group_by(usertype, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(usertype, month)  %>% 
  ggplot(aes(x = month, y = average_duration, fill = usertype)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Create a visualization that shows the number of rides taken over the course of the day.

all_trips_clean %>% 
  mutate(time_of_day = as.numeric(hour)) %>% 
  group_by(usertype, time_of_day) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(usertype, time_of_day)  %>% 
  ggplot() +
  geom_line(aes(x = time_of_day, y = number_of_rides, color = usertype))
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Create a visualization that shows the duration of rides taken over the course of the day.

all_trips_clean %>% 
  mutate(time_of_day = as.numeric(hour)) %>% 
  group_by(usertype, time_of_day) %>% 
  summarise(average_duration = mean(ride_length)) %>% 
  arrange(usertype, time_of_day)  %>% 
  ggplot() +
  geom_line(aes(x = time_of_day, y = average_duration, color = usertype))
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

Looking at start and end points: Given the question we are trying to solve for this capstone, it would be useful to know which stations are most popular for casual users versus members.

Most popular start station for members:

all_trips_clean %>%
   filter(usertype == "member") %>% 
group_by(start_station_name) %>% 
summarise(number_of_rides = n()) %>% 
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
##   start_station_name       number_of_rides
##   <chr>                              <int>
## 1 Clark St & Elm St                  23673
## 2 Wells St & Concord Ln              22554
## 3 Kingsbury St & Kinzie St           22451
## 4 Wells St & Elm St                  20064
## 5 Dearborn St & Erie St              18452

Most popular start station for casual users:

all_trips_clean %>%
   filter(usertype == "casual") %>% 
group_by(start_station_name) %>% 
summarise(number_of_rides = n()) %>% 
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
##   start_station_name      number_of_rides
##   <chr>                             <int>
## 1 Streeter Dr & Grand Ave           63832
## 2 Millennium Park                   31870
## 3 Michigan Ave & Oak St             28434
## 4 Shedd Aquarium                    22353
## 5 Theater on the Lake               20448

Most popular end station for members:

all_trips_clean %>%
   filter(usertype == "member") %>% 
group_by(end_station_name) %>% 
summarise(number_of_rides = n()) %>% 
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
##   end_station_name         number_of_rides
##   <chr>                              <int>
## 1 Clark St & Elm St                  23745
## 2 Wells St & Concord Ln              23208
## 3 Kingsbury St & Kinzie St           22649
## 4 Wells St & Elm St                  20625
## 5 Dearborn St & Erie St              19108

Most popular end station for casual users:

all_trips_clean %>%
  filter(usertype == "casual") %>% 
group_by(end_station_name) %>% 
summarise(number_of_rides = n()) %>% 
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
##   end_station_name        number_of_rides
##   <chr>                             <int>
## 1 Streeter Dr & Grand Ave           66971
## 2 Millennium Park                   33501
## 3 Michigan Ave & Oak St             30146
## 4 Theater on the Lake               22122
## 5 Shedd Aquarium                    20977

Most popular start station overall:

all_trips_clean %>%
group_by(start_station_name) %>% 
summarise(number_of_rides = n()) %>% 
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
##   start_station_name      number_of_rides
##   <chr>                             <int>
## 1 Streeter Dr & Grand Ave           79465
## 2 Michigan Ave & Oak St             42371
## 3 Wells St & Concord Ln             41290
## 4 Millennium Park                   40062
## 5 Clark St & Elm St                 39172

Most popular end station overall:

all_trips_clean %>%
group_by(end_station_name) %>% 
summarise(number_of_rides = n()) %>% 
arrange(desc(number_of_rides)) %>%
slice(1:5)
## # A tibble: 5 × 2
##   end_station_name        number_of_rides
##   <chr>                             <int>
## 1 Streeter Dr & Grand Ave           81009
## 2 Michigan Ave & Oak St             43126
## 3 Wells St & Concord Ln             41701
## 4 Millennium Park                   41411
## 5 Clark St & Elm St                 38620

The data above shows that the most popular start and end stations for casual riders are at tourist destinations: Navy Pier (Streeter Dr & Grand Ave), Millenium Park, an intersection by Water Tower Place and the Drake Hotel (Michigan Ave & Oak St), Theatre on the Lake, and the Shedd Aquarium.