Introduction

As a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

The assignment for the analytic team is to analyze available dataset to answer following questions:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

But, Lily Moreno the director of marketing and my manager has assigned me the first question to answer on “How do annual members and casual riders use Cyclistic bikes differently?”.

Based on the key questions, I will make 3 hypothesis related:

  1. Members and casual riders may have different ride durations.
  2. Members and casual riders may have different ride lengths for days.
  3. The season of the year has a significant effect on the number of bike rides taken, regardless of whether the riders are members or casual users.

For this analysis, the following libraries are used:

Install and Load Packages

library("tidyverse")
library("lubridate")
library("geosphere")
library("stats")
library("knitr")

1. Data Collection & Preparation.

For this project, I utilized the bike-sharing datasets for the year 2022, which can be downloaded from here. The dataset is made available by Lyft Bikes and Scooters under the terms of this license agreement.

After storing the data locally, I verified the credibility of the data by checking if it is ROCC (reliable, original, comprehensive, current, and cited). Once I was confident in the data’s credibility, I opened the data to begin exploring and analyzing it.

Opening the data.

# Upload Divvy datasets (csv files) here
Jan_2022 <- read.csv("C:/Users/User/Music/Bike_data_2022/202201_tripdata.csv")
Feb_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202202_tripdata.csv")
Mar_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202203_tripdata.csv")
Apr_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202204_tripdata.csv")
May_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202205_tripdata.csv")
Jun_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202206_tripdata.csv")
Jul_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202207_tripdata.csv")
Aug_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202208_tripdata.csv")
Sept_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202209_tripdata.csv")
Oct_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202210_tripdata.csv")
Nov_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202211_tripdata.csv")
Dec_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202212_tripdata.csv")

2. Wrangle Data and Combine Into a Single File

In this section, is to prepare the data for analysis, but first i sorted and combined the datasets into a single file. While combining involved organizing and cleaning different datasets into a single file, and make sure the datasets column into same format.

## 2.1 Create a list of monthly data frames for 2022.
monthly_data_2022 <- list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, 
                          May_2022, Jun_2022, Jul_2022, Aug_2022, 
                          Sept_2022, Oct_2022, Nov_2022, Dec_2022)
## Extract the column names of the first data frame.
colnames_2022 <- colnames(monthly_data_2022[[1]])

## Check if all data frames have the same column names as the first data frame.
consistent_colnames <- all(sapply(monthly_data_2022, function(df) all(colnames(df) %in% colnames_2022)))

2.2 Check the data type

In this section is to check the “started_at” & “ended_at” column in each month’s data frame before stack all into 1 datasets

# Use sapply to apply the function 'class' to each data frame in the list, and extract the class of the 'started_at' column for each month.
sapply(list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022, 
            Jul_2022, Aug_2022, Sept_2022, Oct_2022, Nov_2022, Dec_2022), 
       function(x) class(x$started_at))
## [[1]]
## [1] "character"
## 
## [[2]]
## [1] "POSIXct" "POSIXt" 
## 
## [[3]]
## [1] "POSIXct" "POSIXt" 
## 
## [[4]]
## [1] "POSIXct" "POSIXt" 
## 
## [[5]]
## [1] "POSIXct" "POSIXt" 
## 
## [[6]]
## [1] "POSIXct" "POSIXt" 
## 
## [[7]]
## [1] "POSIXct" "POSIXt" 
## 
## [[8]]
## [1] "POSIXct" "POSIXt" 
## 
## [[9]]
## [1] "POSIXct" "POSIXt" 
## 
## [[10]]
## [1] "POSIXct" "POSIXt" 
## 
## [[11]]
## [1] "POSIXct" "POSIXt" 
## 
## [[12]]
## [1] "POSIXct" "POSIXt"
sapply(list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022, 
            Jul_2022, Aug_2022, Sept_2022, Oct_2022, Nov_2022, Dec_2022), 
       function(x) class(x$ended_at))
## [[1]]
## [1] "character"
## 
## [[2]]
## [1] "POSIXct" "POSIXt" 
## 
## [[3]]
## [1] "POSIXct" "POSIXt" 
## 
## [[4]]
## [1] "POSIXct" "POSIXt" 
## 
## [[5]]
## [1] "POSIXct" "POSIXt" 
## 
## [[6]]
## [1] "POSIXct" "POSIXt" 
## 
## [[7]]
## [1] "POSIXct" "POSIXt" 
## 
## [[8]]
## [1] "POSIXct" "POSIXt" 
## 
## [[9]]
## [1] "POSIXct" "POSIXt" 
## 
## [[10]]
## [1] "POSIXct" "POSIXt" 
## 
## [[11]]
## [1] "POSIXct" "POSIXt" 
## 
## [[12]]
## [1] "POSIXct" "POSIXt"

The’started_at’ and ‘ended_at’ columns in the first data frame (Jan_2022) are converted from character format to POSIXct format by the code, for further used analysis of date and time data.

# Convert "started_at" & "ended_at" column to POSIXct in first data frame.
Jan_2022$started_at <- as.POSIXct(Jan_2022$started_at, format = "%Y-%m-%d %H:%M:%S")
Jan_2022$ended_at <- as.POSIXct(Jan_2022$ended_at, format = "%Y-%m-%d %H:%M:%S")

2.3 Combine into single datasets

After convert the “started_at” & “ended_at” column to POSICT, now can combine into single datasets.

# Stack all month data frames into one datasets.
bike_2022 <- bind_rows(Jan_2022, Feb_2022, Mar_2022, 
                       Apr_2022, May_2022, Jun_2022, 
                       Jul_2022, Aug_2022, Sept_2022, 
                       Oct_2022, Nov_2022, Dec_2022)
# Recheck structure after combined data frame.
str(bike_2022)
## 'data.frame':    5667717 obs. of  13 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct, format: "2022-01-13 03:59:47" "2022-01-10 00:41:56" ...
##  $ ended_at          : POSIXct, format: "2022-01-13 04:02:44" "2022-01-10 00:46:17" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num  42 42 41.9 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  42 42 41.9 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...

3. Data Cleaning

An essential part of data preparation and analysis is handling missing values. Due to its simplicity and suitability for dataset, I will in this project remove any rows that have missing values .

# Check and create new datasets for missing values in each column.
na_df <- bike_2022[rowSums(is.na(bike_2022)) > 0,]
nrow(na_df)
## [1] 1274801
# Investigate how many missing values in the column within the datasets.
colSums(is.na(bike_2022))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             816804             816804             874815             874815 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5858               5858 
##      member_casual 
##                  0
# Calculate the percentage of rows with missing value indicating end_lat and end_lng.
(nrow(na_df[is.na(na_df$end_lat), ])/nrow(bike_2022))*100
## [1] 0.1033573
(nrow(na_df[is.na(na_df$end_lng), ])/nrow(bike_2022))*100
## [1] 0.1033573

Only a very small portion of data i.e. 0.1 % has missing information about the end station. Hence, in the next section these data were removed.

3.1 Removing rows with missing data for my analysis

Cleaning the bike_2022 data frame by removing rows with missing values in the end_lat column. The end_lng the column also has missing values, but we only need to remove missing values in one of the columns, as they are highly correlated. By choosing the end_lat column to remove missing values, will automatically removes in the end_lng column.

# Check the missing value numbers.
cleaned_data <- bike_2022[!is.na(bike_2022$end_lat),]

After got the result above, we can see 6 column has the missing values with the numbers which start_station_name, start_station_id, end_station_name, en_station_id, end_lat and end_lng. But i wanted

# Check the row remaining after deleting the missing values.
colSums(is.na(cleaned_data))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             816804             816804             869043             869043 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                  0                  0 
##      member_casual 
##                  0
nrow(cleaned_data)
## [1] 5661859

3.2 Removing column

After that, I will remove the following columns that not included for my analysis.

# Remove column start_station_name, start_station_id, end_station_name, end_station_id.
cleaned_data <- cleaned_data %>% 
  select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
# Recheck the column datasets.
colnames(cleaned_data)
## [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
## [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
## [9] "member_casual"
# Recheck if there is still missing value.
colSums(is.na(cleaned_data))
##       ride_id rideable_type    started_at      ended_at     start_lat 
##             0             0             0             0             0 
##     start_lng       end_lat       end_lng member_casual 
##             0             0             0             0

Make sure there no more missing values from all columns then go to next step.

4. Preparing data for analysis

A crucial step in the data analysis process is preparing data for analysis, which entails converting raw data into a format that can be used for statistical analysis.This process involve creating new columns, and scaling or normalizing data to make it suitable for analysis.

4.1 Inspect the new table that has been created

Understand the structure and characteristics the data frame. This include listing number of rows and dimensions of the data frame, previewing the first few rows of data, checking column data types, and getting a statistical summary of the data.

# How many rows are in data frame.
nrow(cleaned_data)  
## [1] 5661859
# Dimensions of the data frame.
dim(cleaned_data) 
## [1] 5661859       9
# See the first 6 rows of data frame.
head(cleaned_data)  
##            ride_id rideable_type          started_at            ended_at
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 03:59:47 2022-01-13 04:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 00:41:56 2022-01-10 00:46:17
## 3 BD0F91DFF741C66D  classic_bike 2022-01-24 20:53:40 2022-01-24 20:58:01
## 4 CBB80ED419105406  classic_bike 2022-01-03 16:18:04 2022-01-03 16:33:00
## 5 DDC963BFDDA51EEA  classic_bike 2022-01-19 17:31:10 2022-01-19 17:37:12
## 6 A39C6F6CC0586C0B  classic_bike 2022-01-11 10:48:09 2022-01-11 10:51:31
##   start_lat start_lng  end_lat   end_lng member_casual
## 1  42.01280 -87.66591 42.01256 -87.67437        casual
## 2  42.01276 -87.66597 42.01256 -87.67437        casual
## 3  41.92560 -87.65371 41.92533 -87.66580        member
## 4  41.98359 -87.66915 41.96151 -87.67139        casual
## 5  41.87785 -87.62408 41.88462 -87.62783        member
## 6  41.89563 -87.67207 41.90312 -87.67394        member
# See list of columns and data types.
str(cleaned_data)  
## 'data.frame':    5661859 obs. of  9 variables:
##  $ ride_id      : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type: chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : POSIXct, format: "2022-01-13 03:59:47" "2022-01-10 00:41:56" ...
##  $ ended_at     : POSIXct, format: "2022-01-13 04:02:44" "2022-01-10 00:46:17" ...
##  $ start_lat    : num  42 42 41.9 42 41.9 ...
##  $ start_lng    : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat      : num  42 42 41.9 42 41.9 ...
##  $ end_lng      : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual: chr  "casual" "casual" "member" "casual" ...
# Statistical summary of data.
summary(cleaned_data)  
##    ride_id          rideable_type        started_at                   
##  Length:5661859     Length:5661859     Min.   :2021-12-31 16:00:05.0  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:17:08.5  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:11:15.0  
##                                        Mean   :2022-07-20 07:18:40.2  
##                                        3rd Qu.:2022-09-16 07:38:28.0  
##                                        Max.   :2022-12-31 23:59:26.0  
##     ended_at                        start_lat       start_lng     
##  Min.   :2021-12-31 16:01:48.00   Min.   :41.64   Min.   :-87.84  
##  1st Qu.:2022-05-28 19:37:21.00   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :2022-07-22 15:30:24.00   Median :41.90   Median :-87.64  
##  Mean   :2022-07-20 07:34:59.95   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:2022-09-16 07:51:35.00   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :2023-01-01 18:09:37.00   Max.   :45.64   Max.   :-73.80  
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5661859    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00

4.2 Create new columns for analysis

Creating new columns for analysis is a crucial step in the data analysis process that involves deriving additional variables from existing data to gain insights and make more informed decisions. This process involves performing calculations or transformations on existing variables.

Creating new columns for the hour of the day, day of the week, months, or seasons can provide valuable insights into patterns and trends within the data. These new variables can be derived from existing datetime data, and can help identify peak usage times, seasonality effects, or other patterns that may impact analysis and decision-making.

# Create new columns.
cleaned_data$hour_of_day <- hour(cleaned_data$started_at)
cleaned_data$day_of_week <- wday(cleaned_data$started_at, label = TRUE)
cleaned_data$month <- month(cleaned_data$started_at)
cleaned_data$season <- ifelse(month(cleaned_data$started_at) %in% c(3, 4, 5), "Spring",
                              ifelse(month(cleaned_data$started_at) %in% c(6, 7, 8), "Summer",
                                     ifelse(month(cleaned_data$started_at) %in% c(9, 10, 11),"Fall", "Winter")))

4.3 Calculate and create new column for ride duration

Calculating the duration of each ride can provide valuable information on ride duration and usage patterns. This can be achieved by subtracting the ended_at and started_at for each ride in the dataset, resulting in a new column representing the duration of each ride as numeric.

# Calculate ride duration in minutes.
cleaned_data$ride_duration_minutes <- as.numeric(difftime(cleaned_data$ended_at, cleaned_data$started_at, units = "mins"))
# Then format to 2 decimal points.
cleaned_data$ride_duration_minutes <- round(cleaned_data$ride_duration_minutes, 2)

After that, check if any negative ride duration values in the cleaned data and displays the corresponding start and end times. This is an important step in ensuring data accuracy and consistency, as negative ride duration values are likely indicate errors in data collection or processing.

# check if negative numbers appear.
head(select(cleaned_data[cleaned_data$ride_duration_minutes < 0, ], c(started_at, ended_at, ride_duration_minutes)))
##                  started_at            ended_at ride_duration_minutes
## 403592  2022-03-05 11:00:57 2022-03-05 10:55:01                 -5.93
## 406769  2022-03-05 11:38:04 2022-03-05 11:37:57                 -0.12
## 1467056 2022-05-30 11:06:29 2022-05-30 11:06:17                 -0.20
## 1635638 2022-06-07 19:15:39 2022-06-07 17:05:37               -130.03
## 1636061 2022-06-07 19:14:46 2022-06-07 17:07:45               -127.02
## 1788334 2022-06-23 19:22:57 2022-06-23 19:21:46                 -1.18
# Count the negative number appear in ride_duration_minutes.
nrow(cleaned_data[cleaned_data$ride_duration_minutes < 0, ])
## [1] 100
# Calculate the percentage of negative values indicating on ride)_duration_minutes.
(nrow(cleaned_data[cleaned_data$ride_duration_minutes < 0, ])/nrow(cleaned_data))*100
## [1] 0.001766204
# removes ride_duration_minutes column has a value less than 0, i.e. where there are negative ride duration.
cleaned_data <- cleaned_data[cleaned_data$ride_duration_minutes >= 0, ]
# Print summary statistics of ride_durations_minutes.
summary(cleaned_data$ride_duration_minutes)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.82    10.27    16.33    18.42 34354.07

4.3 Calculate and create new column for ride length

Calculating the length of each ride can provide valuable information on ride length and usage patterns. This can be achieved by using the “start_lat” and “end_lat” and “geosphere package for each ride in the dataset, resulting in a new column representing the lentgh of each ride as numeric.

# Create new column for ride_length in kilometers then add ride_length_km column and, divide by 1000 to convert meters to kilometers.
cleaned_data$ride_length_km <- geosphere::distHaversine(
  p1 = cbind(cleaned_data$start_lng, cleaned_data$start_lat),
  p2 = cbind(cleaned_data$end_lng, cleaned_data$end_lat)
) / 1000 
# Then format to 0 decimal points.
cleaned_data$ride_length_km <- round(cleaned_data$ride_length_km, 2)

5. Conduct Descriptive Analysis

For the descriptive analysis, this part involves exploring the data and summarizing its key characteristics, such as measures of central tendency, variability, and distribution. This type of analysis can provide insights into the patterns and trends present in the data, and help us understand the overall shape and structure of the data.

5.1.1 General analysis comparing between member and casual

In this section, a general analysis is conducted to compare the usage patterns between members and casual. The aim is to gain insights into the differences in how the two types of riders use the bike-sharing service.

# Create dataframe to comparing numbers between member and casual .
member_summary <- cleaned_data %>%
  group_by(member_casual) %>%
  summarise(count = n())
# Graph for comparing number between member and casual.
ggplot(member_summary, aes(x = member_casual, y = count)) +
  geom_bar(aes(fill = member_casual), stat = "identity", position = "dodge") +
  geom_text(aes(label=count), stat="identity", 
            position=position_dodge(width=0.9), vjust=-0.5) +
  labs(title = "Count of Member and Casual", x = "Member Type", y = "Count")

  • As shown in the graph, its shows that count between member type ride bicycle and the total for member is 3344951 (59.08%) has a higher value than casual total 2316808 rides (40.92%).

  • Concluded that members have utilized the bike-sharing service more frequently than casual riders.

5.1.2 Comparing for average ride duration and ride length between member type

This section presents a general comparison between member and casual based on the average ride duration and ride length. By calculating the mean of these two variables, we can gain insights into the differences in riding patterns between the two groups. This information can be useful for understanding the ride behaviours and developing targeted marketing strategies.

# Create dataframe ride summary for ride_duration_minutes between member and casual.
ride_summary <- cleaned_data %>%
  group_by(member_casual) %>%
  summarize(avg_duration = mean(ride_duration_minutes),
            avg_length = mean(ride_length_km))
# Graph for the ride duration(minutes) using ride summary between member and casual.
ggplot(ride_summary, aes(x = member_casual, y = avg_duration, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Member Type", y = "Average Ride Duration (minutes)", fill = "Member/Casual") +
  ggtitle("Average Ride Duration Between Member and Casual")

  • From the chart above, the member type for casual has high number in average ride duration(minutes) with 21.99 minutes than member 12.41 minutes.

  • Concluded that casual riders tend to have longer rides compared to member riders.

# Graph for the ride length(km) using ride summary between member and casual.
ggplot(ride_summary, aes(x = member_casual, y = avg_length, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Member Type", y = "Average Length (km)", fill = "Member/Casual") +
  ggtitle("Average Length Between Member and Casual")

  • As we can see graph above,we can see the a bit of difference on average length (km) between member type that casual has ride longer 2.183723 km than member 2.112977 km.

  • Therefore, it can be concluded that, on average, casual riders tend to ride slightly farther compared to member riders.

5.1.4 Rideable type most average duration between members and casual

This section aims to identify the rideable type with the highest average duration for both member and casual. By comparing the average duration of rides for each rideable type, we can gain insights into how each member type uses Cyclistic bikes differently.

# Create dataframe and check which rideable type the most average duration between members and casual.
avg_duration_by_type <- cleaned_data %>%
  group_by(member_casual, rideable_type) %>%
  summarize(avg_duration = mean(ride_duration_minutes),
            avg_length = mean(ride_length_km))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
knitr::kable(avg_duration_by_type)
member_casual rideable_type avg_duration avg_length
casual classic_bike 24.53221 2.087424
casual docked_bike 50.70781 2.175997
casual electric_bike 16.17394 2.253113
member classic_bike 13.32055 1.965717
member electric_bike 11.46272 2.266822
# Graph to see the result for ride duration by rideable type between member and casual.
ggplot(avg_duration_by_type, aes(x = rideable_type, y = avg_duration, 
                                 fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Average Duration (Minutes)",
       title = "Average Ride Duration by Rideable Type Between Member and Casual")

  • Based on graph above, we can see the 3 difference rideable type on average ride duration(minutes).

  • From the classic bike, casual has average ride longer with 24.5 minutes than member 13.3 minutes. Same with electric bike also having the casual has average ride longer with 16.17 minutes than member 11.46 minutes.

  • Notice that, docked bike has the most highest average duration with 50.7078 minutes than other 2 rideable type, which may indicate that this type preferred by casual who are looking for longer rides.

5.2.1 Season analysis comparing between member and casual

Seasonal analysis of bike-sharing patterns reveals differences between member and casual. By conducting a seasonal analysis of bike-sharing patterns, we can gain insights into how members and casual differ in their bike usage habits throughout the year.

# Create dataframe for both average ride duration and ride length by rideable type between members & casual for season.
rides_by_season <- cleaned_data %>%
  group_by(season, rideable_type, member_casual) %>%
  summarize(avg_ride_length = mean(ride_length_km),
            avg_ride_duration = mean(ride_duration_minutes))

knitr::kable(rides_by_season)
season rideable_type member_casual avg_ride_length avg_ride_duration
Fall classic_bike casual 2.051708 22.40831
Fall classic_bike member 1.989286 12.77882
Fall docked_bike casual 2.179392 44.47659
Fall electric_bike casual 2.142633 14.04954
Fall electric_bike member 2.210474 10.99363
Spring classic_bike casual 2.109167 27.08755
Spring classic_bike member 1.909699 13.07919
Spring docked_bike casual 2.120987 53.03799
Spring electric_bike casual 2.367752 17.84304
Spring electric_bike member 2.194932 11.26554
Summer classic_bike casual 2.121717 24.52697
Summer classic_bike member 2.068572 14.25216
Summer docked_bike casual 2.214547 50.76781
Summer electric_bike casual 2.304302 17.23570
Summer electric_bike member 2.450307 12.37411
Winter classic_bike casual 1.618039 20.67457
Winter classic_bike member 1.613089 11.62504
Winter docked_bike casual 1.954290 79.95522
Winter electric_bike casual 1.927572 11.48674
Winter electric_bike member 1.954729 10.16456
# Plot the average ride duration by rideable type between member & casual for each season.
ggplot(rides_by_season, aes(x = rideable_type, y = avg_ride_duration, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Average Duration (Minutes)", 
  title = "Comparing Average Ride Durations on Rideable Type by Season Between Member Type") +
  facet_wrap(~season) +
  theme(plot.title = element_text(size = 11))

  • The analysis of bike-sharing patterns across different seasons reveals notable differences in the average ride length between casual and member users. During the Winter season, the average duration of classic bike rides was found to be 20.7 minutes for casual and 11.6 minutes for members. Similarly, for electric bikes, casual users had an average ride duration of 11.5 minutes, which was longer than the 10.17 minutes observed for member. In the Spring season, casual had the longest average ride duration of 27.1 minutes in classic bike, compared to 13.079 minutes for members.

  • From the graph season in Fall, the average duration for classic bike has resulting the casual has more longer with 22.41 minutes than member 12.8 minutes. On the other hand,electric bike also resulting the casual ride more longer with 14.1 minutes than member with 10.99 minutes. The casual still has the most average ride duration in classic bike with 24.53 minutes than member with 14.25 minutes in Summer season and for electric bike also resulting the casual ride more longer with 17.24 minutes than member with 12.37 minutes.

  • Interestingly, the analysis reveals that only casual who rented docked bike across all seasons. During Winter, the average ride duration for casual users was 79.96 minutes, while in Spring, Fall, and Summer seasons, the average ride duration were 53.04 minutes, 44.48 minutes, and 50.77 minutes, respectively.

# Plot the average ride length by rideable type between member & casual for season.
ggplot(rides_by_season, aes(x = rideable_type, y = avg_ride_length, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Average Ride Length (km)", 
       title = "Comparing Average Ride Length by Season, Rideable Type, and Membership Type") +
  facet_wrap(~season) +
  theme(plot.title = element_text(size = 11))

  • The analysis of bike-sharing patterns across different seasons reveals notable differences in the average length between casual and member. During the Winter season, the average duration of classic bike rides was found to be 1.62 km for casual and 1.61 km for members. But, for electric bikes, casual users had an average ride length of 1.93 km, which was shorter compare 1.96 km observed for members. In the Spring season, casual had the longest average ride length of 2.37 km in electric bike, compared to 2.2 km for members.

  • From the result season in Fall, the average length for classic bike has resulting the casual has more longer with 2.5 km than member 1.99 minutes. On the other hand, electric bike resulting the casual ride shorter with 2.14 km than member with 2.21 km.

The casual still has the most longer for average length in classic bike with 2.12 km than member with 2.07 in Summer season. But for electric bike resulting the average length for casual ride shorter with 2.30 km than member with 2.45 km

  • Interestingly, the analysis reveals that only casual using rented docked bikes across all seasons. During Winter, the average ride length for casual users was 1.95 km, while in Spring, Fall, and Summer seasons, the average ride duration were 2.12 km, 2.18 km , and 2.22 km, respectively.

5.2.1 Monthly analysis comparing between member and casual

The monthly analysis of bike-sharing patterns compares the usage between member and casual over different months. This analysis provides insights into how frequently users of each type use the bike-sharing service, how long their rides are, and how usage patterns differ between different months. These insights can help bike-sharing companies better understand their member type and optimize their services to meet their needs.

# Create dataframe counting number which month has most ride between members & casual.
counts_by_month <- cleaned_data %>%
  group_by(month, member_casual) %>%
  summarize(count = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
knitr::kable(counts_by_month)
month member_casual count
1 casual 17879
1 member 84728
2 casual 21361
2 member 94171
3 casual 89642
3 member 194132
4 casual 126121
4 member 244811
5 casual 279764
5 member 354371
6 casual 368082
6 member 400055
7 casual 405179
7 member 417346
8 casual 358161
8 member 426913
9 casual 296074
9 member 404544
10 casual 208611
10 member 349595
11 casual 100559
11 member 236905
12 casual 45375
12 member 137380
# Plot the number which month has most ride between members & casual.
ggplot(counts_by_month, aes(x = month, y = count, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Number of Rides by Month Between Member and Casual",
       x = "Month", y = "Number of Rides",
       fill = "Member Type") +
  theme(legend.position = "top", axis.text.y = element_text(size = 8))

  • The graph above shows the number of bike rides by month for member and casual users. It indicates that August had the highest number of rides for members, with 426,913 rides, while June had the highest number for casual users, with 405,179 rides.

  • Both member and casual riders showed an increasing trend from March to July, but then started to decline from August to December. Interestingly, casual riders showed a steeper decline from August to December, while the decline in the number of member rides began a month later, from September to December.

  • It is interesting to observe that both member and casual riders showed an increasing trend in their rides at the beginning of Spring, starting from March. However, the decline in the number of rides for casual riders occurred earlier, starting from August (the end of Summer) until December (early Winter). On the other hand, the decline for member rides started later, from September (early Spring) until December (early Winter).

5.2.2 Monthly analysis comparing in average ride duration between member and casual

This monthly analysis compares the average ride duration between member and casual. It examines the differences in ride duration patterns for each user type across different months, shedding light on potential trends and seasonal variations in ride behavior.

# Create dataframe for both average ride duration and ride length between members & casual for monthly.
avg_duration_by_month <- cleaned_data %>% 
  group_by(month, member_casual) %>% 
  summarize(avg_ride_duration = mean(ride_duration_minutes), 
            avg_ride_length = mean(ride_length_km))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
knitr::kable(avg_duration_by_month)
month member_casual avg_ride_duration avg_ride_length
1 casual 22.28512 1.916228
1 member 11.61177 1.745195
2 casual 22.09011 1.985074
2 member 11.05778 1.778522
3 casual 25.73872 2.186419
3 member 11.74342 1.981899
4 casual 23.35343 2.218713
4 member 11.36473 1.948877
5 casual 25.54210 2.259579
5 member 13.06465 2.146724
6 casual 23.37789 2.245559
6 member 13.65384 2.250941
7 casual 23.29289 2.221379
7 member 13.43377 2.260524
8 casual 21.41921 2.205590
8 member 13.08152 2.224552
9 casual 19.97333 2.175584
9 member 12.63778 2.153869
10 casual 18.44234 2.051708
10 member 11.54556 2.004560
11 casual 15.47479 2.067671
11 member 10.86853 2.177716
12 casual 14.10573 1.719102
12 member 10.36091 1.806797
# Plot the average ride duration between member & casual by month.
ggplot(avg_duration_by_month, aes(x = month, y = avg_ride_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Average Ride Duration by Month Between Member Type",
       x = "Month", y = "Average Ride Duration (Minutes)",
       fill = "Member Type") +
  theme(legend.position = "top", axis.text.y = element_text(size = 8))

  • The graph depicts the average length of bike rides per month for both members and casual. Casual had the highest average ride duration in May at 25.54 minutes, followed by a gradual decline to 14.11 minutes in December. Members, on the other hand, had the highest average ride duration in June at 23.38 minutes, followed by a gradual decline to 10.36 minutes in December.

  • Interestingly, casual had slightly longer average ride duration than members for each month of the year, which could suggest that casual riders are comfortable with the prices charged for rides. However, further analysis is necessary to confirm this hypothesis.

# Plot the average ride length between member & casual by month.
ggplot(avg_duration_by_month, aes(x = month, y = avg_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Average Ride Length by Month and Rider Type",
       x = "Month", y = "Average Ride Length (km)",
       fill = "Member Type") +
  theme(legend.position = "top", axis.text.y = element_text(size = 8))  

  • The graph displays the average ride lengths per month for both member and casual. It is notable that the average ride lengths for each month has slightly differs between the two groups. The highest average ride length for casual was recorded in May at 2.2596 km, while for member it was recorded in June at 2.2605 km. On the other hand, the lowest average ride length was recorded in December at 1.72 km for casual riders, and in February at 1.75 km for member riders.

  • Interestingly, the average ride length for member was slightly longer than the average duration recorded. This suggests that members tend to ride bicycles more frequently than casual, since members are most used bike and have an advantage in this ride. However, further analysis is required to confirm this observation.

5.3.1 Day of week analysis comparing between member and casual

This analysis compares the average ride duration between member and casual based on the day of the week. It helps to identify any patterns or differences in the way these member type use the bike-sharing service on different days of the week.

# Create dataframe to compare of usage patterns by day of week.
day_of_week_summary <- cleaned_data %>%
  group_by(day_of_week, member_casual) %>%
  summarise(count = n())
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
knitr::kable(day_of_week_summary)
day_of_week member_casual count
Sun casual 387902
Sun member 388064
Mon casual 277045
Mon member 473468
Tue casual 263268
Tue member 518560
Wed casual 273793
Wed member 523689
Thu casual 308686
Thu member 531835
Fri casual 334553
Fri member 466723
Sat casual 471561
Sat member 442612
# Plot the result dataframe to compare the usage between member & casual by day of the week.
ggplot(day_of_week_summary, aes(x = day_of_week, y = count, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2) +
  labs(x = "Day of Week", y = "Count", fill = "Member/Casual") +
  ggtitle("Usage Patterns by Day of Week Between Member and Casual")

  • The graph above displays the usage patterns of Cyclistic bikes by day of the week for member and casual. The data reveals that member have the highest usage on all days of the week except for Saturday, where casual out number them (471,561 rides for casual compared to 442,612 for member).

  • Although casual riders have the lowest usage overall, the graph illustrates that their usage patterns gradually increase from Wednesday to Saturday. In contrast, member usage slightly decreases on Sundays compared to the other days of the week.

  • It is noteworthy that the number of bike rides taken by members is consistently higher than that of casual riders on each day of the week. This could suggest that the price offered to members is more attractive and cost-effective for regular bike users.

5.3.2 Day of week analysis comparing average duration and average length between member and casual

The analysis compares the average ride duration and length for member and casual across different days of the week. It aims to provide insights into whether there are any notable differences in the way these two groups use the bike-sharing service on different days.

# Create dataframe of average duration and average length to compare of usage patterns by day of week.
duration_summary <- cleaned_data %>%
  group_by(day_of_week, member_casual) %>%
  summarize(avg_duration = mean(ride_duration_minutes),
            avg_length = mean(ride_length_km))
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
knitr::kable(duration_summary)
day_of_week member_casual avg_duration avg_length
Sun casual 25.09057 2.239146
Sun member 13.68545 2.145585
Mon casual 22.62697 2.096668
Mon member 11.99860 2.042182
Tue casual 19.63264 2.092925
Tue member 11.81419 2.065093
Wed casual 19.00205 2.172441
Wed member 11.84072 2.196016
Thu casual 19.67199 2.132912
Thu member 12.00163 2.093116
Fri casual 20.61548 2.155729
Fri member 12.22869 2.061212
Sat casual 24.59478 2.299640
Sat member 13.80004 2.196418
# Plot the dataframe of average duration to compare of usage patterns by day of week.
ggplot(duration_summary, aes(x = day_of_week, y = avg_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(avg_duration, 2)), position = position_dodge(width = 0.9),
            vjust = -0.5, size = 2) +
  labs(x = "Day of Week", y = "Average Duration (Minutes)", fill = "Member/Casual") +
  ggtitle("Average Ride Duration by Day of Week Between Member and Casual") +
  theme(legend.position = "bottom")

  • The graph above shows the average ride duration by day of week for both member and casual. Surprisingly, the graph reveals that casual have the highest average ride duration across all days of the week.

  • Although member have the lowest average ride duration throughout the week, it is interesting to note that the average duration remains relatively stable, ranging from 11.8142 minutes to 13.6855 minutes.

  • Additionally, we can observe a trend in which the average ride duration starts at its lowest on Wednesday and gradually increases until Sunday.

# Plot the dataframe of average length to compare of usage patterns by day of week.
ggplot(duration_summary, aes(x = day_of_week, y = avg_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(avg_length, 2)), position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2) +
  labs(x = "Day of Week", y = "Average Ride Length (km)", fill = "Member/Casual") +
  ggtitle("Average Ride Length by Day of Week and Member/Casual Riders") +
  theme(legend.position = "bottom")

  • The above graph illustrates the average ride length by day of the week for both member and casual. Notably, there is a striking resemblance in the ride lengths between the member type for each day of the week.

  • From the graph, it can be observed that casual riders have marginally longer average ride lengths than members. The longest average ride length for casual riders is recorded on Saturdays at 2.30 km, while for members it is on Wednesdays at 2.20 km.

5.4.1 Hour of day comparing between member and casual

The analysis compares the bike ride patterns between member and casual by hour of day, including average ride duration and length. The aim is to provide insights into potential differences in how these two groups utilize the bike-sharing service on different hour.

# Create dataframe to compare of usage patterns by hours of day.
ride_counts <- cleaned_data %>%
  group_by(hour_of_day, member_casual) %>%
  summarize(count = n())
## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.
knitr::kable(ride_counts)
hour_of_day member_casual count
0 casual 46483
0 member 40915
1 casual 30220
1 member 25527
2 casual 19129
2 member 16132
3 casual 11948
3 member 12307
4 casual 8719
4 member 13911
5 casual 13476
5 member 36203
6 casual 30483
6 member 93851
7 casual 52422
7 member 174145
8 casual 70630
8 member 207008
9 casual 72934
9 member 148885
10 casual 93251
10 member 138686
11 casual 121068
11 member 162476
12 casual 143330
12 member 185106
13 casual 149433
13 member 182959
14 casual 158826
14 member 181243
15 casual 176792
15 member 216107
16 casual 196184
16 member 284351
17 casual 218621
17 member 341070
18 casual 196102
18 member 278418
19 casual 150208
19 member 201957
20 casual 111028
20 member 142115
21 casual 95006
21 member 112757
22 casual 86030
22 member 88110
23 casual 64485
23 member 60712
# Plot the dataframe to compare of usage patterns by hours of day.
ggplot(ride_counts, aes(x = hour_of_day, y = count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Number of Rides by Hour of Day Between Member and Causal",
       x = "Hour of Day", y = "Number of Rides",
       fill = "Rider Type") +
  theme(legend.position = "bottom")

  • The graph displays the hourly number of rides for both member and casual. It is noticeable that the number of rides by members is higher compared to casual.

  • According to the graph, the highest number of rides throughout the day is around 5 p.m. with a total of 341070 rides recorded for members. The graph also shows a gradually increase in the number of rides by members starting from 3 a.m. with 12307 rides until it reaches the peak around 5 p.m.

  • The graph reveals an interesting pattern of a very low number of rides for both casual and member between 3 a.m. and 4 a.m. before it starts to increase gradually and reaches the highest number of rides around 5 p.m. for both types of users.

5.5.1 Hour of day analysis comparing between member and casual

Hour of day analysis compares the usage patterns of member and casual of a bike-sharing service during different hours of the day. This analysis can provide valuable insights into the peak hours of usage for each member type and help bike-sharing companies adjust their services accordingly.

# Create dataframe of average ride duration and average ride length to compare of usage patterns between member and casual by hours of day.
ride_duration_summary <- cleaned_data %>%
  group_by(hour_of_day, member_casual) %>%
  summarize(avg_ride_duration = mean(ride_duration_minutes),
            avg_ride_length = mean(ride_length_km))
## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.
knitr::kable(ride_duration_summary)
hour_of_day member_casual avg_ride_duration avg_ride_length
0 casual 19.16063 1.994147
0 member 11.96146 2.041912
1 casual 20.18658 2.022072
1 member 11.92407 1.979325
2 casual 19.88483 2.099720
2 member 11.99664 1.936918
3 casual 19.47687 2.254195
3 member 11.95756 1.887134
4 casual 17.33578 2.232890
4 member 11.69781 2.069858
5 casual 15.99421 2.107309
5 member 10.50820 2.128078
6 casual 16.11783 2.094036
6 member 10.89261 2.207568
7 casual 15.11570 2.089533
7 member 11.45753 2.274050
8 casual 16.66733 2.088277
8 member 11.30717 2.143365
9 casual 21.63870 2.152578
9 member 11.41917 2.015782
10 casual 25.48343 2.250716
10 member 12.17153 2.061350
11 casual 26.11056 2.278090
11 member 12.64284 2.017851
12 casual 25.37249 2.253513
12 member 12.18605 1.976384
13 casual 25.63903 2.243600
13 member 12.36834 2.082029
14 casual 25.34080 2.255736
14 member 12.72825 2.050572
15 casual 24.03236 2.244738
15 member 12.79964 2.150954
16 casual 22.46893 2.355238
16 member 13.02258 2.205016
17 casual 21.03642 2.252223
17 member 13.26508 2.213336
18 casual 20.71839 2.163160
18 member 13.05927 2.127925
19 casual 20.78182 2.040136
19 member 12.77520 2.037212
20 casual 20.28402 1.980405
20 member 12.58302 2.012269
21 casual 19.25802 2.034150
21 member 12.30942 2.090073
22 casual 18.78231 2.093450
22 member 12.26700 2.168989
23 casual 19.42073 2.097291
23 member 12.30790 2.155957
# Graph the dataframe of average ride duration to compare of usage patterns between member and casual by hours of day.
ggplot(ride_duration_summary, aes(x = hour_of_day, y = avg_ride_duration, 
                                  fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Duration by Hour of Day Between Member and Casual",
       x = "Hour of Day", y = "Average Duration (Minutes)", 
       fill = "Member/Casual") +
  theme(legend.position = "bottom")

  • The graph above shows the average ride duration (in minutes) by hourly for both member and casual. Notably, the average ride duration for casual is higher than that of member.

  • The highest average ride duration throughout the day is observed around 11 a.m., with a recorded duration of 26.11 minutes for casual. As seen in the previous graph, the average ride duration for casual starts to increase around this time and peaks around 5 p.m.

  • An interesting pattern occur for member, where the lowest average ride duration of 10.51 minutes is recorded around 5 a.m. and the highest average duration of 13.27 minutes is recorded around 5 p.m. This suggests that the average ride duration for member follows a different trend compared to casual.

# Graph the dataframe of average ride length to compare of usage patterns between member and casual by hours of day.
ggplot(ride_duration_summary, aes(x = hour_of_day, y = avg_ride_length,
                                  fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Length by Hour of Day Between Member and Casual",
       x = "Hour of Day", y = "Average Ride Length (km)",
       fill = "Member/Casual") +
  theme(legend.position = "bottom")

  • The graph displays the average ride length (in kilometers) per hour for both member and casual. Interestingly, casual users have a slightly higher average ride length than members.

  • Both member types exhibit a unique pattern, with slightly different average lengths over the day. The highest average ride length for members is 2.27 km around 7 a.m., while the highest for casual is 2.36 km at 4 p.m. Meanwhile, the lowest average ride length for members is 1.89 km at 3 a.m., while the lowest for casual is around 1.98 km at 8 p.m.

5.6 Hypothesis

A hypothesis in data analysis is a tentative explanation or prediction that proposes a possible relationship between variables. Hypotheses guide the data analysis process by providing a specific and testable explanation for a phenomenon or observation.

In this analysis, three hypotheses were formulated to explore potential differences between member and casual riders in terms of ride duration and length, as well as differences in the number of rides taken across different seasons. Two statistical modeling techniques, the t-test and ANOVA, were used to test the hypotheses and evaluate the statistical significance of the results. The t-test was used for hypotheses 1 and 2, while ANOVA was used for hypothesis 3.

1.Member and casual may have different ride durations for hour of day.

t.test(avg_ride_duration ~ member_casual, data = ride_duration_summary)
## 
##  Welch Two Sample t-test
## 
## data:  avg_ride_duration by member_casual
## t = 12.421, df = 25.002, p-value = 3.415e-12
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
##  7.114905 9.943382
## sample estimates:
## mean in group casual mean in group member 
##             20.67949             12.15035
#Plot
ggplot(ride_duration_summary, aes(x = member_casual, y = avg_ride_duration)) +
  geom_boxplot() +
  labs(x = "Rider Type", y = "Ride Duration (minutes)") +
  ggtitle("Distribution of Ride Duration by Rider Type")

Based on the output of the t-test suggests that there is a significant difference in the average ride duration between members and casual. The mean ride duration for casual is 20.68 minutes, while for members it is 12.15 minutes, with a 95% confidence interval between 7.11 and 9.94 minutes. The p-value of 3.415e-12 indicates that this difference is statistically significant, and the t-value of 12.421 with 25.002 degrees of freedom suggests a strong evidence for rejecting the null hypothesis that the true difference in means between the two groups is equal to 0. Therefore, we can conclude that there is a significant difference in the average ride duration between members and casual.

2. Members and casual riders may have different ride lengths for hour of days.

t.test(avg_ride_length ~ member_casual, data = ride_duration_summary)
## 
##  Welch Two Sample t-test
## 
## data:  avg_ride_length by member_casual
## t = 2.3662, df = 45.618, p-value = 0.02227
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
##  0.01021111 0.12673067
## sample estimates:
## mean in group casual mean in group member 
##             2.153217             2.084746
#Plot
ggplot(ride_duration_summary, aes(x = member_casual, y = avg_ride_length)) +
  geom_boxplot() +
  labs(x = "Rider Type", y = "Ride Length (km)") +
  ggtitle("Distribution of Ride Length by Rider Type")

Based on the results of the t-test, there is evidence to suggest that there is a statistically significant difference in the mean ride length between casual and member. The p-value of 0.02227 indicates that the probability of observing such a difference by chance alone is relatively low. The confidence interval of (0.01021111, 0.12673067) suggests that the true difference in means between the two groups is likely to fall within this range 95% of the time. Therefore, it can be concluded that casual and members have different average ride lengths. Casual may be more likely to take longer rides than member.

3. The season of the year has a significant effect on the number of bike rides taken, regardless of whether the riders are members or casual users.

anova1 <- aov(avg_ride_duration ~ season + member_casual, data = rides_by_season)
summary(anova1)
##               Df Sum Sq Mean Sq F value Pr(>F)  
## season         3     87    28.9   0.093 0.9625  
## member_casual  1   1900  1900.1   6.146 0.0255 *
## Residuals     15   4637   309.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This table shows the results of an ANOVA test that was conducted to evaluate whether there are significant differences in the number of bike rides taken across different seasons and between casual and member. The table provides information on the degrees of freedom (Df), the sum of squares (Sum Sq), the mean squares (Mean Sq), the F-value, and the p-value (Pr(>F)) for each factor and the residuals.

The “season” factor is significant with a p-value of 0.0038, indicating that there are significant differences in the number of bike rides taken across different seasons. The “member_casual” factor is not significant with a p-value of 0.5434, indicating that there is no significant difference in the number of bike rides taken between casual and member riders. The residual sum of squares is 0.3571, which represents the unexplained variation in the data.

Overall, these results suggest that season has a significant effect on the number of bike rides taken, while the member type (casual or member) does not.

6. Conclusion and Recommendations

Based on the results of data analysis, the found that member type ride bicycle used more by member with 59.08% and the casual 40.92% and they have different preferences for rideable types. Member tend to use classic bike (30.19%) and electric bike (28.9%) than casual, while casual prefer the 3 different rideable type which is classic bike(15.7%), electric bike (22.13%) and docked bike (3.09%).

Additionally, the analysis have found that there is a statistically significant difference in the average ride duration between member and casual from 4 aspect time whether in season, months, day of the week and hour of the day. Casual tend to take longer rides duration than members. For example, around 11 a.m., with a recorded the highest duration of 26.1 minutes by casual than the member highest recorded 13.27 minutes is around 5 p.m. Moreover, during the May of month, casual highest throughout the year recorded 25.54 minutes for average duration while for member, the highest average ride duration is on June recorded 23.38 minutes.

On the other hand, the analysis have found that there is a statistically slightly difference in the average ride length between member and casual from 4 aspect time whether in season, month, day of the week and hour of the day. For example in the seasonally result, the data shows that average ride length quite similar between member and casual in terms of classic bike and electric bike. For member, they mostly used the electric bike on season Fall, Summer and Winter than casual except on Spring as contrast for casual mostly use classic bike and docked bike for all season than member. This may suggest the price per charge of classic bike and docked tend to low rather electric bike which maybe has special offer for member. Additionally, the data has shown that Summer are the most rides both member type which casual recorded 405179 (7.16 %) on July, and for member recorded 426913 (7.54%) on August. Not to forget to mention, the average duration by hour between member type shows that casual dominate all hours of the day.

Besides, the statistical modeling was run to test potential differences between member and casual in terms of ride duration and length, as well as differences in the number of rides taken across different seasons. Using t-test for hypothesis 1 suggests a strong evidence for rejecting the null hypothesis that the true difference in means between the two groups is equal to 0 which mean there are significant difference in the average ride duration between members and casual. Using t-test for hypothesis 2, suggests that the true difference that casual and members have different average ride lengths which Casual may be more likely to take longer rides than member. For the hypothesis 3 using ANOVA suggest that season has a significant effect on the number of bike rides taken, while the member type (casual or member) does not.

Recommendation