PREPARATION

The data can be found at link and the license is at link. I downloaded the last 12 months of data from July 20203 to June 20204. The data is too large to work with it on spreadsheets so I have done my analysis using R.

=====================

STEP 1: COLLECT DATA

=====================

Getting necessary packages for analysis

library(tidyverse)  #helps wrangle data

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)  #helps wrangle date attributes
library(ggplot2)    #helps visualize data
library(dplyr)      #helps with data manipulation
library(fixr)       #helps fix data
library(readr)      #helps reading csv files

Set working directory

I granted access to R studio to my desktop so I can work easily with the files I need.

setwd("C:/Users/Juan/Desktop/cases/Bikes/month")

Import monthly csv files

july23 <- read_csv("202307-divvy-tripdata.csv")

## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

aug23 <- read_csv("202308-divvy-tripdata.csv")

## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sep23 <- read_csv("202309-divvy-tripdata.csv")

## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oct23 <- read_csv("202310-divvy-tripdata.csv")

## Rows: 537113 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov23 <- read_csv("202311-divvy-tripdata.csv")

## Rows: 362518 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dec23 <- read_csv("202312-divvy-tripdata.csv")

## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jan24 <- read_csv("202401-divvy-tripdata.csv")

## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

feb24 <- read_csv("202402-divvy-tripdata.csv")

## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

marc24 <- read_csv("202403-divvy-tripdata.csv")

## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

april24 <- read_csv("202404-divvy-tripdata.csv")

## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may24 <- read_csv("202405-divvy-tripdata.csv")

## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

june24 <- read_csv("202406-divvy-tripdata.csv")

## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

==== STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE ====

Checking all column names match

colnames(july23)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(aug23)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(sep23)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(oct23)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(nov23)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(dec23)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jan24)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(feb24)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(marc24)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(april24)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(may24)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(june24)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

All tables have matching columns so I can proceed to bind all the data in one table.

Bind data in one data frame

all_trips <- bind_rows(july23,aug23,sep23,oct23,nov23,dec23,jan24,feb24,marc24,april24,
                       may24,june24)


str(all_trips)

## spc_tbl_ [5,734,381 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5734381] "9340B064F0AEE130" "D1460EE3CE0D8AF8" "DF41BE31B895A25E" "9624A293749EF703" ...
##  $ rideable_type     : chr [1:5734381] "electric_bike" "classic_bike" "classic_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5734381], format: "2023-07-23 20:06:14" "2023-07-23 17:05:07" ...
##  $ ended_at          : POSIXct[1:5734381], format: "2023-07-23 20:22:44" "2023-07-23 17:18:37" ...
##  $ start_station_name: chr [1:5734381] "Kedzie Ave & 110th St" "Western Ave & Walton St" "Western Ave & Walton St" "Racine Ave & Randolph St" ...
##  $ start_station_id  : chr [1:5734381] "20204" "KA1504000103" "KA1504000103" "13155" ...
##  $ end_station_name  : chr [1:5734381] "Public Rack - Racine Ave & 109th Pl" "Milwaukee Ave & Grand Ave" "Damen Ave & Pierce Ave" "Clinton St & Madison St" ...
##  $ end_station_id    : chr [1:5734381] "877" "13033" "TA1305000041" "TA1305000032" ...
##  $ start_lat         : num [1:5734381] 41.7 41.9 41.9 41.9 42 ...
##  $ start_lng         : num [1:5734381] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5734381] 41.7 41.9 41.9 41.9 42 ...
##  $ end_lng           : num [1:5734381] -87.7 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5734381] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now I have all my data in one table which is necessary for full analysis.

Drop unnecessary columns for analysis

Latitude and longitude are not neeede for the analysis

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng))

==== STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS ======

Inspect the new table that has been created

colnames(all_trips)  #List of column names

## [1] "ride_id"            "rideable_type"      "started_at"        
## [4] "ended_at"           "start_station_name" "start_station_id"  
## [7] "end_station_name"   "end_station_id"     "member_casual"

nrow(all_trips)  #How many rows are in data frame?

## [1] 5734381

dim(all_trips)  #Dimensions of the data frame?

## [1] 5734381       9

head(all_trips)  #See the first 6 rows of data frame.  Also tail(all_trips)

## # A tibble: 6 × 9
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 9340B064F0AEE130 electric_bike 2023-07-23 20:06:14 2023-07-23 20:22:44
## 2 D1460EE3CE0D8AF8 classic_bike  2023-07-23 17:05:07 2023-07-23 17:18:37
## 3 DF41BE31B895A25E classic_bike  2023-07-23 10:14:53 2023-07-23 10:24:29
## 4 9624A293749EF703 electric_bike 2023-07-21 08:27:44 2023-07-21 08:32:40
## 5 2F68A6A4CDB4C99A classic_bike  2023-07-08 15:46:42 2023-07-08 15:58:08
## 6 9AEE973E6B941A9C classic_bike  2023-07-10 08:44:47 2023-07-10 08:49:41
## # ℹ 5 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, member_casual <chr>

str(all_trips)  #See list of columns and data types (numeric, character, etc)

## tibble [5,734,381 × 9] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5734381] "9340B064F0AEE130" "D1460EE3CE0D8AF8" "DF41BE31B895A25E" "9624A293749EF703" ...
##  $ rideable_type     : chr [1:5734381] "electric_bike" "classic_bike" "classic_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5734381], format: "2023-07-23 20:06:14" "2023-07-23 17:05:07" ...
##  $ ended_at          : POSIXct[1:5734381], format: "2023-07-23 20:22:44" "2023-07-23 17:18:37" ...
##  $ start_station_name: chr [1:5734381] "Kedzie Ave & 110th St" "Western Ave & Walton St" "Western Ave & Walton St" "Racine Ave & Randolph St" ...
##  $ start_station_id  : chr [1:5734381] "20204" "KA1504000103" "KA1504000103" "13155" ...
##  $ end_station_name  : chr [1:5734381] "Public Rack - Racine Ave & 109th Pl" "Milwaukee Ave & Grand Ave" "Damen Ave & Pierce Ave" "Clinton St & Madison St" ...
##  $ end_station_id    : chr [1:5734381] "877" "13033" "TA1305000041" "TA1305000032" ...
##  $ member_casual     : chr [1:5734381] "member" "member" "member" "member" ...

summary(all_trips)  #Statistical summary of data. Mainly for numeric

##    ride_id          rideable_type        started_at                    
##  Length:5734381     Length:5734381     Min.   :2023-07-01 00:00:00.00  
##  Class :character   Class :character   1st Qu.:2023-08-27 15:55:29.00  
##  Mode  :character   Mode  :character   Median :2023-11-09 14:41:58.00  
##                                        Mean   :2023-12-16 20:05:31.20  
##                                        3rd Qu.:2024-04-23 17:55:40.00  
##                                        Max.   :2024-06-30 23:55:17.06  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-07-01 00:01:26.00   Length:5734381     Length:5734381    
##  1st Qu.:2023-08-27 16:19:14.00   Class :character   Class :character  
##  Median :2023-11-09 14:55:49.00   Mode  :character   Mode  :character  
##  Mean   :2023-12-16 20:23:50.00                                        
##  3rd Qu.:2024-04-23 18:08:50.00                                        
##  Max.   :2024-06-30 23:59:57.93                                        
##  end_station_name   end_station_id     member_casual     
##  Length:5734381     Length:5734381     Length:5734381    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##

Add columns that list the date, month, day, and year of each ride

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Let’s create a new column for ride length and then clean the results we don’t need (negative and 0 values).

Add a “ride_length” calculation to all_trips (in seconds)

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Convert “ride_length” from Factor to numeric so we can run calculations on the data

is.factor(all_trips$ride_length)

## [1] FALSE

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

## [1] TRUE

Check negative values

check_for_negative_values(all_trips)

Replace negative and 0 values with NA values

all_trips[all_trips <= 0] <- NA

Remove rows with NA values

all_trips <- na.omit(all_trips)

Remove negative and 0 values in data frame

all_trips_2 <- all_trips[apply(all_trips!=0, 1, all),]

Check negative values again

check_for_negative_values(all_trips_2)

## Data frame does not contain negative values.

## NULL

The data is clean.

Find number of casual and members riders

table(all_trips_2$member_casual)

## 
##  casual  member 
## 1503559 2770268

We can see clearly that there are many more casual riders than members.

========= STEP 4: CONDUCT DESCRIPTIVE ANALYSIS =========

Descriptive analysis on ride_length (all figures in seconds)

mean(all_trips_2$ride_length) #straight average (total ride length / rides)

## [1] 990.8093

median(all_trips_2$ride_length) #midpoint number in the ascending array of ride lengths

## [1] 604

max(all_trips_2$ride_length) #longest ride

## [1] 413473

min(all_trips_2$ride_length) #shortest ride

## [1] 0.155

Condense the four lines above to one line using summary() on the specific attribute

summary(all_trips_2$ride_length)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.2    347.0    604.0    990.8   1082.0 413473.0

Compare members and casual users

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = mean)

##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual               1434.4350
## 2                    member                750.0321

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = median)

##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual                     796
## 2                    member                     529

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = max)

##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual                  413473
## 2                    member                   89859

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual, FUN = min)

##   all_trips_2$member_casual all_trips_2$ride_length
## 1                    casual                   0.155
## 2                    member                   0.263

Order days of the week

all_trips_2$day_of_week <- ordered(all_trips_2$day_of_week, levels=c("domingo", "lunes", "martes", "miércoles", "jueves", "viernes", "sábado"))

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + all_trips_2$day_of_week, FUN = mean)

##    all_trips_2$member_casual all_trips_2$day_of_week all_trips_2$ride_length
## 1                     casual                 domingo               1658.9865
## 2                     member                 domingo                846.7964
## 3                     casual                   lunes               1406.8795
## 4                     member                   lunes                717.3063
## 5                     casual                  martes               1272.5568
## 6                     member                  martes                723.1518
## 7                     casual               miércoles               1254.6979
## 8                     member               miércoles                724.1508
## 9                     casual                  jueves               1225.2893
## 10                    member                  jueves                709.9829
## 11                    casual                 viernes               1380.3573
## 12                    member                 viernes                733.7183
## 13                    casual                  sábado               1609.3650
## 14                    member                  sábado                835.3832

Casual riders are using the bikes for a longer period of time at any dady of the week. This could be explained as many tourists to the city could be using the bikes for visiting.

Now, let’s run the average ride time by each day for members vs casual users

aggregate(all_trips_2$ride_length ~ all_trips_2$member_casual + all_trips_2$day_of_week, FUN = mean)

##    all_trips_2$member_casual all_trips_2$day_of_week all_trips_2$ride_length
## 1                     casual                 domingo               1658.9865
## 2                     member                 domingo                846.7964
## 3                     casual                   lunes               1406.8795
## 4                     member                   lunes                717.3063
## 5                     casual                  martes               1272.5568
## 6                     member                  martes                723.1518
## 7                     casual               miércoles               1254.6979
## 8                     member               miércoles                724.1508
## 9                     casual                  jueves               1225.2893
## 10                    member                  jueves                709.9829
## 11                    casual                 viernes               1380.3573
## 12                    member                 viernes                733.7183
## 13                    casual                  sábado               1609.3650
## 14                    member                  sábado                835.3832

analyze ridership data by type and weekday

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()                       
            ,average_duration = mean(ride_length)) %>%  
  arrange(member_casual, weekday)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        "do\\."          268375            1659.
##  2 casual        "lu\\."          175574            1407.
##  3 casual        "ma\\."          173707            1273.
##  4 casual        "mi\\."          178181            1255.
##  5 casual        "ju\\."          182745            1225.
##  6 casual        "vi\\."          213507            1380.
##  7 casual        "sá\\."          311470            1609.
##  8 member        "do\\."          313737             847.
##  9 member        "lu\\."          395057             717.
## 10 member        "ma\\."          437338             723.
## 11 member        "mi\\."          447921             724.
## 12 member        "ju\\."          440940             710.
## 13 member        "vi\\."          385370             734.
## 14 member        "sá\\."          349905             835.

analyze ridership data by type and weekday

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()                       
            ,average_duration = mean(ride_length)) %>%  
  arrange(member_casual, weekday)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        "do\\."          268375            1659.
##  2 casual        "lu\\."          175574            1407.
##  3 casual        "ma\\."          173707            1273.
##  4 casual        "mi\\."          178181            1255.
##  5 casual        "ju\\."          182745            1225.
##  6 casual        "vi\\."          213507            1380.
##  7 casual        "sá\\."          311470            1609.
##  8 member        "do\\."          313737             847.
##  9 member        "lu\\."          395057             717.
## 10 member        "ma\\."          437338             723.
## 11 member        "mi\\."          447921             724.
## 12 member        "ju\\."          440940             710.
## 13 member        "vi\\."          385370             734.
## 14 member        "sá\\."          349905             835.

Let’s visualize the number of rides by rider type

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday,  y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + scale_fill_manual(values = c ("casual" = "darkblue", "member" = "darkorange"))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Although there are more casual users than members, it is clear that members use the bikes more frequently. This could be explain due to the fact that many members are most probably using the bikes to commute so the rides are shorter but more frequent.

Let’s create a visualization for average duration

all_trips_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") + scale_fill_manual(values = c ("casual" = "purple", "member" = "yellow"))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

This graph shows clearly that casual riders are using the bikes for longer periods of time. Many temporary visitor to the city probably using the bikes to move around and their therefore their trips are longer.

Find row with maximum ride length

rides<- all_trips_2%>%
  select(start_station_name,end_station_name,ride_length)
summary(rides)

##  start_station_name end_station_name    ride_length      
##  Length:4273827     Length:4273827     Min.   :     0.2  
##  Class :character   Class :character   1st Qu.:   347.0  
##  Mode  :character   Mode  :character   Median :   604.0  
##                                        Mean   :   990.8  
##                                        3rd Qu.:  1082.0  
##                                        Max.   :413473.0

max_ride <- dplyr::filter(rides, ride_length == max(ride_length))
max_start_station <- max_ride$start_station_name
max_end_station <- max_ride$end_station_name

Find the row with the minimum ride length

rides<- all_trips_2%>%
  select(start_station_name,end_station_name,ride_length)
summary(rides)

##  start_station_name end_station_name    ride_length      
##  Length:4273827     Length:4273827     Min.   :     0.2  
##  Class :character   Class :character   1st Qu.:   347.0  
##  Mode  :character   Mode  :character   Median :   604.0  
##                                        Mean   :   990.8  
##                                        3rd Qu.:  1082.0  
##                                        Max.   :413473.0

min_ride <- dplyr::filter(rides, ride_length == min(ride_length))
min_start_station <- min_ride$start_station_name
min_end_station <- min_ride$end_station_name

Display the results

cat("Maximum ride length starts at:", max_start_station,",","and ends at:", max_end_station, "\n")

## Maximum ride length starts at: Larrabee St & Webster Ave , and ends at: Michigan Ave & Oak St

cat("Minimum ride length starts at:", min_start_station, ",","and ends at:", min_end_station, "\n")

## Minimum ride length starts at: Millennium Park , and ends at: Millennium Park

Combine max_ride and min_ride

combined_rides<-rbind(max_ride,min_ride)

Visualize ride lengths by start and end stations

ggplot(combined_rides, aes(x = str_wrap(start_station_name,width = 10), y = ride_length, fill = end_station_name)) +
  geom_bar(stat = "identity", fill = "#FF6666") +
  geom_text(aes(label = ride_length), vjust = -0.5) +
  labs(title = "Maximum and Minimum Ride Length by Start and End Stations", x = "Start_Station_Name", y = "Ride_Length (minutes)") +
  theme_minimal()+
  theme(
    plot.background = element_rect(color = "purple", linewidth = 1),
    panel.border = element_rect(color = "lightblue", fill = NA, size = 0.5)
  )+
  scale_y_continuous(labels = scales::number_format(accuracy = 1))

## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

### Create a visualization for the number of rides by rider type

ggplot(data = all_trips_2) +
  geom_bar(mapping=aes(x=member_casual, fill=member_casual)) +
  labs(title = "Number of Rides: Member vs. Casual") + 
  scale_fill_manual(values = c ("casual" = "purple", "member" = "lightblue"))

### Top 3 Start Stations

all_trips_2 %>% 
  group_by(start_station_name) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>%
  head(4)

## # A tibble: 4 × 2
##   start_station_name                 count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            60601
## 2 DuSable Lake Shore Dr & Monroe St  40556
## 3 Michigan Ave & Oak St              35295
## 4 DuSable Lake Shore Dr & North Blvd 34665

** The top stations are found on the water front where most amenities are found link

Top 3 End Stations

all_trips_2 %>% 
  group_by(end_station_name) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>%
  head(4)

## # A tibble: 4 × 2
##   end_station_name                   count
##   <chr>                              <int>
## 1 Streeter Dr & Grand Ave            62076
## 2 DuSable Lake Shore Dr & Monroe St  38878
## 3 DuSable Lake Shore Dr & North Blvd 37448
## 4 Michigan Ave & Oak St              35976

Create visualization for members vs casual per month

all_trips_2 %>% 
  ggplot(aes(started_at)) + 
  geom_histogram(color = "#000000", aes(fill = member_casual)) +
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~member_casual) + 
  scale_fill_manual(values = c ("casual" = "purple", "member" = "lightyellow"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

** All riders use the service during the warmer months of the year and it drops during the coldest months. Chicago is very cold during the winter months so that explains it.

Cyclistic project 2023-24

Juan Aurelio García Morata

2024-09-15

INTRODUCTION

SCENARIO

TASK

PREPARATION

=====================

STEP 1: COLLECT DATA

=====================

Getting necessary packages for analysis

Set working directory

Import monthly csv files

==== STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE ====

Checking all column names match

Bind data in one data frame

Drop unnecessary columns for analysis

==== STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS ======

Inspect the new table that has been created

Add columns that list the date, month, day, and year of each ride

Add a “ride_length” calculation to all_trips (in seconds)

Convert “ride_length” from Factor to numeric so we can run calculations on the data

Check negative values

Replace negative and 0 values with NA values

Remove rows with NA values

Remove negative and 0 values in data frame

Check negative values again

Find number of casual and members riders

========= STEP 4: CONDUCT DESCRIPTIVE ANALYSIS =========

Descriptive analysis on ride_length (all figures in seconds)

Condense the four lines above to one line using summary() on the specific attribute

Compare members and casual users

Order days of the week

Now, let’s run the average ride time by each day for members vs casual users

analyze ridership data by type and weekday

analyze ridership data by type and weekday

Let’s visualize the number of rides by rider type

Let’s create a visualization for average duration

Find row with maximum ride length

Find the row with the minimum ride length

Display the results

Combine max_ride and min_ride

Visualize ride lengths by start and end stations

Top 3 End Stations

Create visualization for members vs casual per month

CONCLUSIONS

Create a csv file that we will visualize in Excel, Tableau, or my presentation software