How Does a Bike-Share Navigate Speedy Success?

Introduction

As a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

The assignment for the analytic team is to analyze available dataset to answer following questions:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

But, Lily Moreno the director of marketing and my manager has assigned me the first question to answer on “How do annual members and casual riders use Cyclistic bikes differently?”.

Based on the key questions, I will make 3 hypothesis related:

Members and casual riders may have different ride durations.
Members and casual riders may have different ride lengths for days.
The season of the year has a significant effect on the number of bike rides taken, regardless of whether the riders are members or casual users.

For this analysis, the following libraries are used:

Install and Load Packages

library("tidyverse")
library("lubridate")
library("geosphere")
library("stats")
library("knitr")

1. Data Collection & Preparation.

For this project, I utilized the bike-sharing datasets for the year 2022, which can be downloaded from here. The dataset is made available by Lyft Bikes and Scooters under the terms of this license agreement.

After storing the data locally, I verified the credibility of the data by checking if it is ROCC (reliable, original, comprehensive, current, and cited). Once I was confident in the data’s credibility, I opened the data to begin exploring and analyzing it.

Opening the data.

# Upload Divvy datasets (csv files) here
Jan_2022 <- read.csv("C:/Users/User/Music/Bike_data_2022/202201_tripdata.csv")
Feb_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202202_tripdata.csv")
Mar_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202203_tripdata.csv")
Apr_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202204_tripdata.csv")
May_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202205_tripdata.csv")
Jun_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202206_tripdata.csv")
Jul_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202207_tripdata.csv")
Aug_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202208_tripdata.csv")
Sept_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202209_tripdata.csv")
Oct_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202210_tripdata.csv")
Nov_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202211_tripdata.csv")
Dec_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202212_tripdata.csv")

2. Wrangle Data and Combine Into a Single File

In this section, is to prepare the data for analysis, but first i sorted and combined the datasets into a single file. While combining involved organizing and cleaning different datasets into a single file, and make sure the datasets column into same format.

## 2.1 Create a list of monthly data frames for 2022.
monthly_data_2022 <- list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, 
                          May_2022, Jun_2022, Jul_2022, Aug_2022, 
                          Sept_2022, Oct_2022, Nov_2022, Dec_2022)

## Extract the column names of the first data frame.
colnames_2022 <- colnames(monthly_data_2022[[1]])

## Check if all data frames have the same column names as the first data frame.
consistent_colnames <- all(sapply(monthly_data_2022, function(df) all(colnames(df) %in% colnames_2022)))

2.2 Check the data type

In this section is to check the “started_at” & “ended_at” column in each month’s data frame before stack all into 1 datasets

# Use sapply to apply the function 'class' to each data frame in the list, and extract the class of the 'started_at' column for each month.
sapply(list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022, 
            Jul_2022, Aug_2022, Sept_2022, Oct_2022, Nov_2022, Dec_2022), 
       function(x) class(x$started_at))

## [[1]]
## [1] "character"
## 
## [[2]]
## [1] "POSIXct" "POSIXt" 
## 
## [[3]]
## [1] "POSIXct" "POSIXt" 
## 
## [[4]]
## [1] "POSIXct" "POSIXt" 
## 
## [[5]]
## [1] "POSIXct" "POSIXt" 
## 
## [[6]]
## [1] "POSIXct" "POSIXt" 
## 
## [[7]]
## [1] "POSIXct" "POSIXt" 
## 
## [[8]]
## [1] "POSIXct" "POSIXt" 
## 
## [[9]]
## [1] "POSIXct" "POSIXt" 
## 
## [[10]]
## [1] "POSIXct" "POSIXt" 
## 
## [[11]]
## [1] "POSIXct" "POSIXt" 
## 
## [[12]]
## [1] "POSIXct" "POSIXt"

sapply(list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022, 
            Jul_2022, Aug_2022, Sept_2022, Oct_2022, Nov_2022, Dec_2022), 
       function(x) class(x$ended_at))

## [[1]]
## [1] "character"
## 
## [[2]]
## [1] "POSIXct" "POSIXt" 
## 
## [[3]]
## [1] "POSIXct" "POSIXt" 
## 
## [[4]]
## [1] "POSIXct" "POSIXt" 
## 
## [[5]]
## [1] "POSIXct" "POSIXt" 
## 
## [[6]]
## [1] "POSIXct" "POSIXt" 
## 
## [[7]]
## [1] "POSIXct" "POSIXt" 
## 
## [[8]]
## [1] "POSIXct" "POSIXt" 
## 
## [[9]]
## [1] "POSIXct" "POSIXt" 
## 
## [[10]]
## [1] "POSIXct" "POSIXt" 
## 
## [[11]]
## [1] "POSIXct" "POSIXt" 
## 
## [[12]]
## [1] "POSIXct" "POSIXt"

The’started_at’ and ‘ended_at’ columns in the first data frame (Jan_2022) are converted from character format to POSIXct format by the code, for further used analysis of date and time data.

# Convert "started_at" & "ended_at" column to POSIXct in first data frame.
Jan_2022$started_at <- as.POSIXct(Jan_2022$started_at, format = "%Y-%m-%d %H:%M:%S")
Jan_2022$ended_at <- as.POSIXct(Jan_2022$ended_at, format = "%Y-%m-%d %H:%M:%S")

2.3 Combine into single datasets

After convert the “started_at” & “ended_at” column to POSICT, now can combine into single datasets.

# Stack all month data frames into one datasets.
bike_2022 <- bind_rows(Jan_2022, Feb_2022, Mar_2022, 
                       Apr_2022, May_2022, Jun_2022, 
                       Jul_2022, Aug_2022, Sept_2022, 
                       Oct_2022, Nov_2022, Dec_2022)

# Recheck structure after combined data frame.
str(bike_2022)

## 'data.frame':    5667717 obs. of  13 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct, format: "2022-01-13 03:59:47" "2022-01-10 00:41:56" ...
##  $ ended_at          : POSIXct, format: "2022-01-13 04:02:44" "2022-01-10 00:46:17" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num  42 42 41.9 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  42 42 41.9 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...

3. Data Cleaning

An essential part of data preparation and analysis is handling missing values. Due to its simplicity and suitability for dataset, I will in this project remove any rows that have missing values .

# Check and create new datasets for missing values in each column.
na_df <- bike_2022[rowSums(is.na(bike_2022)) > 0,]
nrow(na_df)

## [1] 1274801

# Investigate how many missing values in the column within the datasets.
colSums(is.na(bike_2022))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             816804             816804             874815             874815 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5858               5858 
##      member_casual 
##                  0

# Calculate the percentage of rows with missing value indicating end_lat and end_lng.
(nrow(na_df[is.na(na_df$end_lat), ])/nrow(bike_2022))*100

## [1] 0.1033573

(nrow(na_df[is.na(na_df$end_lng), ])/nrow(bike_2022))*100

## [1] 0.1033573

Only a very small portion of data i.e. 0.1 % has missing information about the end station. Hence, in the next section these data were removed.

3.1 Removing rows with missing data for my analysis

Cleaning the bike_2022 data frame by removing rows with missing values in the end_lat column. The end_lng the column also has missing values, but we only need to remove missing values in one of the columns, as they are highly correlated. By choosing the end_lat column to remove missing values, will automatically removes in the end_lng column.

# Check the missing value numbers.
cleaned_data <- bike_2022[!is.na(bike_2022$end_lat),]

After got the result above, we can see 6 column has the missing values with the numbers which start_station_name, start_station_id, end_station_name, en_station_id, end_lat and end_lng. But i wanted

# Check the row remaining after deleting the missing values.
colSums(is.na(cleaned_data))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             816804             816804             869043             869043 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                  0                  0 
##      member_casual 
##                  0

nrow(cleaned_data)

## [1] 5661859

3.2 Removing column

After that, I will remove the following columns that not included for my analysis.

# Remove column start_station_name, start_station_id, end_station_name, end_station_id.
cleaned_data <- cleaned_data %>% 
  select(-c(start_station_name, start_station_id, end_station_name, end_station_id))

# Recheck the column datasets.
colnames(cleaned_data)

## [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
## [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
## [9] "member_casual"

# Recheck if there is still missing value.
colSums(is.na(cleaned_data))

##       ride_id rideable_type    started_at      ended_at     start_lat 
##             0             0             0             0             0 
##     start_lng       end_lat       end_lng member_casual 
##             0             0             0             0

Make sure there no more missing values from all columns then go to next step.

4. Preparing data for analysis

A crucial step in the data analysis process is preparing data for analysis, which entails converting raw data into a format that can be used for statistical analysis.This process involve creating new columns, and scaling or normalizing data to make it suitable for analysis.

4.1 Inspect the new table that has been created

Understand the structure and characteristics the data frame. This include listing number of rows and dimensions of the data frame, previewing the first few rows of data, checking column data types, and getting a statistical summary of the data.

# How many rows are in data frame.
nrow(cleaned_data)

## [1] 5661859

# Dimensions of the data frame.
dim(cleaned_data)

## [1] 5661859       9

# See the first 6 rows of data frame.
head(cleaned_data)

##            ride_id rideable_type          started_at            ended_at
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 03:59:47 2022-01-13 04:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 00:41:56 2022-01-10 00:46:17
## 3 BD0F91DFF741C66D  classic_bike 2022-01-24 20:53:40 2022-01-24 20:58:01
## 4 CBB80ED419105406  classic_bike 2022-01-03 16:18:04 2022-01-03 16:33:00
## 5 DDC963BFDDA51EEA  classic_bike 2022-01-19 17:31:10 2022-01-19 17:37:12
## 6 A39C6F6CC0586C0B  classic_bike 2022-01-11 10:48:09 2022-01-11 10:51:31
##   start_lat start_lng  end_lat   end_lng member_casual
## 1  42.01280 -87.66591 42.01256 -87.67437        casual
## 2  42.01276 -87.66597 42.01256 -87.67437        casual
## 3  41.92560 -87.65371 41.92533 -87.66580        member
## 4  41.98359 -87.66915 41.96151 -87.67139        casual
## 5  41.87785 -87.62408 41.88462 -87.62783        member
## 6  41.89563 -87.67207 41.90312 -87.67394        member

# See list of columns and data types.
str(cleaned_data)

## 'data.frame':    5661859 obs. of  9 variables:
##  $ ride_id      : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type: chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : POSIXct, format: "2022-01-13 03:59:47" "2022-01-10 00:41:56" ...
##  $ ended_at     : POSIXct, format: "2022-01-13 04:02:44" "2022-01-10 00:46:17" ...
##  $ start_lat    : num  42 42 41.9 42 41.9 ...
##  $ start_lng    : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat      : num  42 42 41.9 42 41.9 ...
##  $ end_lng      : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual: chr  "casual" "casual" "member" "casual" ...

# Statistical summary of data.
summary(cleaned_data)

##    ride_id          rideable_type        started_at                   
##  Length:5661859     Length:5661859     Min.   :2021-12-31 16:00:05.0  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:17:08.5  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:11:15.0  
##                                        Mean   :2022-07-20 07:18:40.2  
##                                        3rd Qu.:2022-09-16 07:38:28.0  
##                                        Max.   :2022-12-31 23:59:26.0  
##     ended_at                        start_lat       start_lng     
##  Min.   :2021-12-31 16:01:48.00   Min.   :41.64   Min.   :-87.84  
##  1st Qu.:2022-05-28 19:37:21.00   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :2022-07-22 15:30:24.00   Median :41.90   Median :-87.64  
##  Mean   :2022-07-20 07:34:59.95   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:2022-09-16 07:51:35.00   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :2023-01-01 18:09:37.00   Max.   :45.64   Max.   :-73.80  
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5661859    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00

4.2 Create new columns for analysis

Creating new columns for analysis is a crucial step in the data analysis process that involves deriving additional variables from existing data to gain insights and make more informed decisions. This process involves performing calculations or transformations on existing variables.

Creating new columns for the hour of the day, day of the week, months, or seasons can provide valuable insights into patterns and trends within the data. These new variables can be derived from existing datetime data, and can help identify peak usage times, seasonality effects, or other patterns that may impact analysis and decision-making.

# Create new columns.
cleaned_data$hour_of_day <- hour(cleaned_data$started_at)
cleaned_data$day_of_week <- wday(cleaned_data$started_at, label = TRUE)
cleaned_data$month <- month(cleaned_data$started_at)
cleaned_data$season <- ifelse(month(cleaned_data$started_at) %in% c(3, 4, 5), "Spring",
                              ifelse(month(cleaned_data$started_at) %in% c(6, 7, 8), "Summer",
                                     ifelse(month(cleaned_data$started_at) %in% c(9, 10, 11),"Fall", "Winter")))

4.3 Calculate and create new column for ride duration

Calculating the duration of each ride can provide valuable information on ride duration and usage patterns. This can be achieved by subtracting the ended_at and started_at for each ride in the dataset, resulting in a new column representing the duration of each ride as numeric.

# Calculate ride duration in minutes.
cleaned_data$ride_duration_minutes <- as.numeric(difftime(cleaned_data$ended_at, cleaned_data$started_at, units = "mins"))
# Then format to 2 decimal points.
cleaned_data$ride_duration_minutes <- round(cleaned_data$ride_duration_minutes, 2)

After that, check if any negative ride duration values in the cleaned data and displays the corresponding start and end times. This is an important step in ensuring data accuracy and consistency, as negative ride duration values are likely indicate errors in data collection or processing.

# check if negative numbers appear.
head(select(cleaned_data[cleaned_data$ride_duration_minutes < 0, ], c(started_at, ended_at, ride_duration_minutes)))

##                  started_at            ended_at ride_duration_minutes
## 403592  2022-03-05 11:00:57 2022-03-05 10:55:01                 -5.93
## 406769  2022-03-05 11:38:04 2022-03-05 11:37:57                 -0.12
## 1467056 2022-05-30 11:06:29 2022-05-30 11:06:17                 -0.20
## 1635638 2022-06-07 19:15:39 2022-06-07 17:05:37               -130.03
## 1636061 2022-06-07 19:14:46 2022-06-07 17:07:45               -127.02
## 1788334 2022-06-23 19:22:57 2022-06-23 19:21:46                 -1.18

# Count the negative number appear in ride_duration_minutes.
nrow(cleaned_data[cleaned_data$ride_duration_minutes < 0, ])

## [1] 100

# Calculate the percentage of negative values indicating on ride)_duration_minutes.
(nrow(cleaned_data[cleaned_data$ride_duration_minutes < 0, ])/nrow(cleaned_data))*100

## [1] 0.001766204

# removes ride_duration_minutes column has a value less than 0, i.e. where there are negative ride duration.
cleaned_data <- cleaned_data[cleaned_data$ride_duration_minutes >= 0, ]

# Print summary statistics of ride_durations_minutes.
summary(cleaned_data$ride_duration_minutes)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.82    10.27    16.33    18.42 34354.07

4.3 Calculate and create new column for ride length

Calculating the length of each ride can provide valuable information on ride length and usage patterns. This can be achieved by using the “start_lat” and “end_lat” and “geosphere package for each ride in the dataset, resulting in a new column representing the lentgh of each ride as numeric.

# Create new column for ride_length in kilometers then add ride_length_km column and, divide by 1000 to convert meters to kilometers.
cleaned_data$ride_length_km <- geosphere::distHaversine(
  p1 = cbind(cleaned_data$start_lng, cleaned_data$start_lat),
  p2 = cbind(cleaned_data$end_lng, cleaned_data$end_lat)
) / 1000

# Then format to 0 decimal points.
cleaned_data$ride_length_km <- round(cleaned_data$ride_length_km, 2)

5. Conduct Descriptive Analysis

For the descriptive analysis, this part involves exploring the data and summarizing its key characteristics, such as measures of central tendency, variability, and distribution. This type of analysis can provide insights into the patterns and trends present in the data, and help us understand the overall shape and structure of the data.

5.1.1 General analysis comparing between member and casual

In this section, a general analysis is conducted to compare the usage patterns between members and casual. The aim is to gain insights into the differences in how the two types of riders use the bike-sharing service.

# Create dataframe to comparing numbers between member and casual .
member_summary <- cleaned_data %>%
  group_by(member_casual) %>%
  summarise(count = n())

# Graph for comparing number between member and casual.
ggplot(member_summary, aes(x = member_casual, y = count)) +
  geom_bar(aes(fill = member_casual), stat = "identity", position = "dodge") +
  geom_text(aes(label=count), stat="identity", 
            position=position_dodge(width=0.9), vjust=-0.5) +
  labs(title = "Count of Member and Casual", x = "Member Type", y = "Count")

As shown in the graph, its shows that count between member type ride bicycle and the total for member is 3344951 (59.08%) has a higher value than casual total 2316808 rides (40.92%).
Concluded that members have utilized the bike-sharing service more frequently than casual riders.

5.1.2 Comparing for average ride duration and ride length between member type

This section presents a general comparison between member and casual based on the average ride duration and ride length. By calculating the mean of these two variables, we can gain insights into the differences in riding patterns between the two groups. This information can be useful for understanding the ride behaviours and developing targeted marketing strategies.

# Create dataframe ride summary for ride_duration_minutes between member and casual.
ride_summary <- cleaned_data %>%
  group_by(member_casual) %>%
  summarize(avg_duration = mean(ride_duration_minutes),
            avg_length = mean(ride_length_km))

# Graph for the ride duration(minutes) using ride summary between member and casual.
ggplot(ride_summary, aes(x = member_casual, y = avg_duration, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Member Type", y = "Average Ride Duration (minutes)", fill = "Member/Casual") +
  ggtitle("Average Ride Duration Between Member and Casual")

From the chart above, the member type for casual has high number in average ride duration(minutes) with 21.99 minutes than member 12.41 minutes.
Concluded that casual riders tend to have longer rides compared to member riders.

# Graph for the ride length(km) using ride summary between member and casual.
ggplot(ride_summary, aes(x = member_casual, y = avg_length, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Member Type", y = "Average Length (km)", fill = "Member/Casual") +
  ggtitle("Average Length Between Member and Casual")

As we can see graph above,we can see the a bit of difference on average length (km) between member type that casual has ride longer 2.183723 km than member 2.112977 km.
Therefore, it can be concluded that, on average, casual riders tend to ride slightly farther compared to member riders.

5.1.3 Popular rideable type between member and casual

This section aims to identify the most popular rideable type for both member and casual. By comparing the number of rides taken for each rideable type, we can gain insights into the preferences of each user group.

# Create a dataframe to compare which rideable type has most ride by both member and casual.
rideable_counts <- cleaned_data %>%
  group_by(rideable_type, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

knitr::kable(rideable_counts)

rideable_type	member_casual	count
classic_bike	casual	888890
classic_bike	member	1709054
docked_bike	casual	174858
electric_bike	casual	1253060
electric_bike	member	1635897

# Graph which rideable type has most ride by both members and casual.
ggplot(rideable_counts, aes(x = rideable_type, y = count)) + 
  geom_bar(aes(fill = member_casual), stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Count", fill = "Member/Casual") +
  ggtitle("Popular Rideable Types Between Member and Casual")

The graph provided above displays the usage of different rideable types by member type. It shows that there are three types of rideables used by members, namely classic bikes, docked bikes, and electrical bikes.
The classic bike is the most commonly used rideable type by members, with a total count of 1709054 rides, while casual only used classic bikes 888890 times. The docked bike is used exclusively by casual, with a total count of 174858 rides. The electrical bike is also a popular rideable type for members, with a total count of 1635897 rides. In comparison, casual used electrical bikes 1253060 times.
It can be concluded that members tend to use classic and electrical bikes more frequently than casual, while casual has utilized all three types of rideables.

5.1.4 Rideable type most average duration between members and casual

This section aims to identify the rideable type with the highest average duration for both member and casual. By comparing the average duration of rides for each rideable type, we can gain insights into how each member type uses Cyclistic bikes differently.

# Create dataframe and check which rideable type the most average duration between members and casual.
avg_duration_by_type <- cleaned_data %>%
  group_by(member_casual, rideable_type) %>%
  summarize(avg_duration = mean(ride_duration_minutes),
            avg_length = mean(ride_length_km))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

knitr::kable(avg_duration_by_type)

member_casual	rideable_type	avg_duration	avg_length
casual	classic_bike	24.53221	2.087424
casual	docked_bike	50.70781	2.175997
casual	electric_bike	16.17394	2.253113
member	classic_bike	13.32055	1.965717
member	electric_bike	11.46272	2.266822

# Graph to see the result for ride duration by rideable type between member and casual.
ggplot(avg_duration_by_type, aes(x = rideable_type, y = avg_duration, 
                                 fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Average Duration (Minutes)",
       title = "Average Ride Duration by Rideable Type Between Member and Casual")

Based on graph above, we can see the 3 difference rideable type on average ride duration(minutes).
From the classic bike, casual has average ride longer with 24.5 minutes than member 13.3 minutes. Same with electric bike also having the casual has average ride longer with 16.17 minutes than member 11.46 minutes.
Notice that, docked bike has the most highest average duration with 50.7078 minutes than other 2 rideable type, which may indicate that this type preferred by casual who are looking for longer rides.

5.2.1 Season analysis comparing between member and casual

Seasonal analysis of bike-sharing patterns reveals differences between member and casual. By conducting a seasonal analysis of bike-sharing patterns, we can gain insights into how members and casual differ in their bike usage habits throughout the year.

# Create dataframe for both average ride duration and ride length by rideable type between members & casual for season.
rides_by_season <- cleaned_data %>%
  group_by(season, rideable_type, member_casual) %>%
  summarize(avg_ride_length = mean(ride_length_km),
            avg_ride_duration = mean(ride_duration_minutes))

knitr::kable(rides_by_season)

season	rideable_type	member_casual	avg_ride_length	avg_ride_duration
Fall	classic_bike	casual	2.051708	22.40831
Fall	classic_bike	member	1.989286	12.77882
Fall	docked_bike	casual	2.179392	44.47659
Fall	electric_bike	casual	2.142633	14.04954
Fall	electric_bike	member	2.210474	10.99363
Spring	classic_bike	casual	2.109167	27.08755
Spring	classic_bike	member	1.909699	13.07919
Spring	docked_bike	casual	2.120987	53.03799
Spring	electric_bike	casual	2.367752	17.84304
Spring	electric_bike	member	2.194932	11.26554
Summer	classic_bike	casual	2.121717	24.52697
Summer	classic_bike	member	2.068572	14.25216
Summer	docked_bike	casual	2.214547	50.76781
Summer	electric_bike	casual	2.304302	17.23570
Summer	electric_bike	member	2.450307	12.37411
Winter	classic_bike	casual	1.618039	20.67457
Winter	classic_bike	member	1.613089	11.62504
Winter	docked_bike	casual	1.954290	79.95522
Winter	electric_bike	casual	1.927572	11.48674
Winter	electric_bike	member	1.954729	10.16456

# Plot the average ride duration by rideable type between member & casual for each season.
ggplot(rides_by_season, aes(x = rideable_type, y = avg_ride_duration, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Average Duration (Minutes)", 
  title = "Comparing Average Ride Durations on Rideable Type by Season Between Member Type") +
  facet_wrap(~season) +
  theme(plot.title = element_text(size = 11))

The analysis of bike-sharing patterns across different seasons reveals notable differences in the average ride length between casual and member users. During the Winter season, the average duration of classic bike rides was found to be 20.7 minutes for casual and 11.6 minutes for members. Similarly, for electric bikes, casual users had an average ride duration of 11.5 minutes, which was longer than the 10.17 minutes observed for member. In the Spring season, casual had the longest average ride duration of 27.1 minutes in classic bike, compared to 13.079 minutes for members.
From the graph season in Fall, the average duration for classic bike has resulting the casual has more longer with 22.41 minutes than member 12.8 minutes. On the other hand,electric bike also resulting the casual ride more longer with 14.1 minutes than member with 10.99 minutes. The casual still has the most average ride duration in classic bike with 24.53 minutes than member with 14.25 minutes in Summer season and for electric bike also resulting the casual ride more longer with 17.24 minutes than member with 12.37 minutes.
Interestingly, the analysis reveals that only casual who rented docked bike across all seasons. During Winter, the average ride duration for casual users was 79.96 minutes, while in Spring, Fall, and Summer seasons, the average ride duration were 53.04 minutes, 44.48 minutes, and 50.77 minutes, respectively.

# Plot the average ride length by rideable type between member & casual for season.
ggplot(rides_by_season, aes(x = rideable_type, y = avg_ride_length, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Rideable Type", y = "Average Ride Length (km)", 
       title = "Comparing Average Ride Length by Season, Rideable Type, and Membership Type") +
  facet_wrap(~season) +
  theme(plot.title = element_text(size = 11))

The analysis of bike-sharing patterns across different seasons reveals notable differences in the average length between casual and member. During the Winter season, the average duration of classic bike rides was found to be 1.62 km for casual and 1.61 km for members. But, for electric bikes, casual users had an average ride length of 1.93 km, which was shorter compare 1.96 km observed for members. In the Spring season, casual had the longest average ride length of 2.37 km in electric bike, compared to 2.2 km for members.
From the result season in Fall, the average length for classic bike has resulting the casual has more longer with 2.5 km than member 1.99 minutes. On the other hand, electric bike resulting the casual ride shorter with 2.14 km than member with 2.21 km.

The casual still has the most longer for average length in classic bike with 2.12 km than member with 2.07 in Summer season. But for electric bike resulting the average length for casual ride shorter with 2.30 km than member with 2.45 km

Interestingly, the analysis reveals that only casual using rented docked bikes across all seasons. During Winter, the average ride length for casual users was 1.95 km, while in Spring, Fall, and Summer seasons, the average ride duration were 2.12 km, 2.18 km , and 2.22 km, respectively.

5.2.1 Monthly analysis comparing between member and casual

The monthly analysis of bike-sharing patterns compares the usage between member and casual over different months. This analysis provides insights into how frequently users of each type use the bike-sharing service, how long their rides are, and how usage patterns differ between different months. These insights can help bike-sharing companies better understand their member type and optimize their services to meet their needs.

# Create dataframe counting number which month has most ride between members & casual.
counts_by_month <- cleaned_data %>%
  group_by(month, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

knitr::kable(counts_by_month)

month	member_casual	count
1	casual	17879
1	member	84728
2	casual	21361
2	member	94171
3	casual	89642
3	member	194132
4	casual	126121
4	member	244811
5	casual	279764
5	member	354371
6	casual	368082
6	member	400055
7	casual	405179
7	member	417346
8	casual	358161
8	member	426913
9	casual	296074
9	member	404544
10	casual	208611
10	member	349595
11	casual	100559
11	member	236905
12	casual	45375
12	member	137380

# Plot the number which month has most ride between members & casual.
ggplot(counts_by_month, aes(x = month, y = count, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Number of Rides by Month Between Member and Casual",
       x = "Month", y = "Number of Rides",
       fill = "Member Type") +
  theme(legend.position = "top", axis.text.y = element_text(size = 8))

The graph above shows the number of bike rides by month for member and casual users. It indicates that August had the highest number of rides for members, with 426,913 rides, while June had the highest number for casual users, with 405,179 rides.
Both member and casual riders showed an increasing trend from March to July, but then started to decline from August to December. Interestingly, casual riders showed a steeper decline from August to December, while the decline in the number of member rides began a month later, from September to December.
It is interesting to observe that both member and casual riders showed an increasing trend in their rides at the beginning of Spring, starting from March. However, the decline in the number of rides for casual riders occurred earlier, starting from August (the end of Summer) until December (early Winter). On the other hand, the decline for member rides started later, from September (early Spring) until December (early Winter).

5.2.2 Monthly analysis comparing in average ride duration between member and casual

This monthly analysis compares the average ride duration between member and casual. It examines the differences in ride duration patterns for each user type across different months, shedding light on potential trends and seasonal variations in ride behavior.

# Create dataframe for both average ride duration and ride length between members & casual for monthly.
avg_duration_by_month <- cleaned_data %>% 
  group_by(month, member_casual) %>% 
  summarize(avg_ride_duration = mean(ride_duration_minutes), 
            avg_ride_length = mean(ride_length_km))

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

knitr::kable(avg_duration_by_month)

month	member_casual	avg_ride_duration	avg_ride_length
1	casual	22.28512	1.916228
1	member	11.61177	1.745195
2	casual	22.09011	1.985074
2	member	11.05778	1.778522
3	casual	25.73872	2.186419
3	member	11.74342	1.981899
4	casual	23.35343	2.218713
4	member	11.36473	1.948877
5	casual	25.54210	2.259579
5	member	13.06465	2.146724
6	casual	23.37789	2.245559
6	member	13.65384	2.250941
7	casual	23.29289	2.221379
7	member	13.43377	2.260524
8	casual	21.41921	2.205590
8	member	13.08152	2.224552
9	casual	19.97333	2.175584
9	member	12.63778	2.153869
10	casual	18.44234	2.051708
10	member	11.54556	2.004560
11	casual	15.47479	2.067671
11	member	10.86853	2.177716
12	casual	14.10573	1.719102
12	member	10.36091	1.806797

# Plot the average ride duration between member & casual by month.
ggplot(avg_duration_by_month, aes(x = month, y = avg_ride_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Average Ride Duration by Month Between Member Type",
       x = "Month", y = "Average Ride Duration (Minutes)",
       fill = "Member Type") +
  theme(legend.position = "top", axis.text.y = element_text(size = 8))

The graph depicts the average length of bike rides per month for both members and casual. Casual had the highest average ride duration in May at 25.54 minutes, followed by a gradual decline to 14.11 minutes in December. Members, on the other hand, had the highest average ride duration in June at 23.38 minutes, followed by a gradual decline to 10.36 minutes in December.
Interestingly, casual had slightly longer average ride duration than members for each month of the year, which could suggest that casual riders are comfortable with the prices charged for rides. However, further analysis is necessary to confirm this hypothesis.

# Plot the average ride length between member & casual by month.
ggplot(avg_duration_by_month, aes(x = month, y = avg_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Average Ride Length by Month and Rider Type",
       x = "Month", y = "Average Ride Length (km)",
       fill = "Member Type") +
  theme(legend.position = "top", axis.text.y = element_text(size = 8))

The graph displays the average ride lengths per month for both member and casual. It is notable that the average ride lengths for each month has slightly differs between the two groups. The highest average ride length for casual was recorded in May at 2.2596 km, while for member it was recorded in June at 2.2605 km. On the other hand, the lowest average ride length was recorded in December at 1.72 km for casual riders, and in February at 1.75 km for member riders.
Interestingly, the average ride length for member was slightly longer than the average duration recorded. This suggests that members tend to ride bicycles more frequently than casual, since members are most used bike and have an advantage in this ride. However, further analysis is required to confirm this observation.

5.3.1 Day of week analysis comparing between member and casual

This analysis compares the average ride duration between member and casual based on the day of the week. It helps to identify any patterns or differences in the way these member type use the bike-sharing service on different days of the week.

# Create dataframe to compare of usage patterns by day of week.
day_of_week_summary <- cleaned_data %>%
  group_by(day_of_week, member_casual) %>%
  summarise(count = n())

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

knitr::kable(day_of_week_summary)

day_of_week	member_casual	count
Sun	casual	387902
Sun	member	388064
Mon	casual	277045
Mon	member	473468
Tue	casual	263268
Tue	member	518560
Wed	casual	273793
Wed	member	523689
Thu	casual	308686
Thu	member	531835
Fri	casual	334553
Fri	member	466723
Sat	casual	471561
Sat	member	442612

# Plot the result dataframe to compare the usage between member & casual by day of the week.
ggplot(day_of_week_summary, aes(x = day_of_week, y = count, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2) +
  labs(x = "Day of Week", y = "Count", fill = "Member/Casual") +
  ggtitle("Usage Patterns by Day of Week Between Member and Casual")

The graph above displays the usage patterns of Cyclistic bikes by day of the week for member and casual. The data reveals that member have the highest usage on all days of the week except for Saturday, where casual out number them (471,561 rides for casual compared to 442,612 for member).
Although casual riders have the lowest usage overall, the graph illustrates that their usage patterns gradually increase from Wednesday to Saturday. In contrast, member usage slightly decreases on Sundays compared to the other days of the week.
It is noteworthy that the number of bike rides taken by members is consistently higher than that of casual riders on each day of the week. This could suggest that the price offered to members is more attractive and cost-effective for regular bike users.

5.3.2 Day of week analysis comparing average duration and average length between member and casual

The analysis compares the average ride duration and length for member and casual across different days of the week. It aims to provide insights into whether there are any notable differences in the way these two groups use the bike-sharing service on different days.

# Create dataframe of average duration and average length to compare of usage patterns by day of week.
duration_summary <- cleaned_data %>%
  group_by(day_of_week, member_casual) %>%
  summarize(avg_duration = mean(ride_duration_minutes),
            avg_length = mean(ride_length_km))

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

knitr::kable(duration_summary)

day_of_week	member_casual	avg_duration	avg_length
Sun	casual	25.09057	2.239146
Sun	member	13.68545	2.145585
Mon	casual	22.62697	2.096668
Mon	member	11.99860	2.042182
Tue	casual	19.63264	2.092925
Tue	member	11.81419	2.065093
Wed	casual	19.00205	2.172441
Wed	member	11.84072	2.196016
Thu	casual	19.67199	2.132912
Thu	member	12.00163	2.093116
Fri	casual	20.61548	2.155729
Fri	member	12.22869	2.061212
Sat	casual	24.59478	2.299640
Sat	member	13.80004	2.196418

# Plot the dataframe of average duration to compare of usage patterns by day of week.
ggplot(duration_summary, aes(x = day_of_week, y = avg_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(avg_duration, 2)), position = position_dodge(width = 0.9),
            vjust = -0.5, size = 2) +
  labs(x = "Day of Week", y = "Average Duration (Minutes)", fill = "Member/Casual") +
  ggtitle("Average Ride Duration by Day of Week Between Member and Casual") +
  theme(legend.position = "bottom")

The graph above shows the average ride duration by day of week for both member and casual. Surprisingly, the graph reveals that casual have the highest average ride duration across all days of the week.
Although member have the lowest average ride duration throughout the week, it is interesting to note that the average duration remains relatively stable, ranging from 11.8142 minutes to 13.6855 minutes.
Additionally, we can observe a trend in which the average ride duration starts at its lowest on Wednesday and gradually increases until Sunday.

# Plot the dataframe of average length to compare of usage patterns by day of week.
ggplot(duration_summary, aes(x = day_of_week, y = avg_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = round(avg_length, 2)), position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2) +
  labs(x = "Day of Week", y = "Average Ride Length (km)", fill = "Member/Casual") +
  ggtitle("Average Ride Length by Day of Week and Member/Casual Riders") +
  theme(legend.position = "bottom")

The above graph illustrates the average ride length by day of the week for both member and casual. Notably, there is a striking resemblance in the ride lengths between the member type for each day of the week.
From the graph, it can be observed that casual riders have marginally longer average ride lengths than members. The longest average ride length for casual riders is recorded on Saturdays at 2.30 km, while for members it is on Wednesdays at 2.20 km.

5.4.1 Hour of day comparing between member and casual

The analysis compares the bike ride patterns between member and casual by hour of day, including average ride duration and length. The aim is to provide insights into potential differences in how these two groups utilize the bike-sharing service on different hour.

# Create dataframe to compare of usage patterns by hours of day.
ride_counts <- cleaned_data %>%
  group_by(hour_of_day, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.

knitr::kable(ride_counts)

hour_of_day	member_casual	count
0	casual	46483
0	member	40915
1	casual	30220
1	member	25527
2	casual	19129
2	member	16132
3	casual	11948
3	member	12307
4	casual	8719
4	member	13911
5	casual	13476
5	member	36203
6	casual	30483
6	member	93851
7	casual	52422
7	member	174145
8	casual	70630
8	member	207008
9	casual	72934
9	member	148885
10	casual	93251
10	member	138686
11	casual	121068
11	member	162476
12	casual	143330
12	member	185106
13	casual	149433
13	member	182959
14	casual	158826
14	member	181243
15	casual	176792
15	member	216107
16	casual	196184
16	member	284351
17	casual	218621
17	member	341070
18	casual	196102
18	member	278418
19	casual	150208
19	member	201957
20	casual	111028
20	member	142115
21	casual	95006
21	member	112757
22	casual	86030
22	member	88110
23	casual	64485
23	member	60712

# Plot the dataframe to compare of usage patterns by hours of day.
ggplot(ride_counts, aes(x = hour_of_day, y = count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Number of Rides by Hour of Day Between Member and Causal",
       x = "Hour of Day", y = "Number of Rides",
       fill = "Rider Type") +
  theme(legend.position = "bottom")

The graph displays the hourly number of rides for both member and casual. It is noticeable that the number of rides by members is higher compared to casual.
According to the graph, the highest number of rides throughout the day is around 5 p.m. with a total of 341070 rides recorded for members. The graph also shows a gradually increase in the number of rides by members starting from 3 a.m. with 12307 rides until it reaches the peak around 5 p.m.
The graph reveals an interesting pattern of a very low number of rides for both casual and member between 3 a.m. and 4 a.m. before it starts to increase gradually and reaches the highest number of rides around 5 p.m. for both types of users.

5.5.1 Hour of day analysis comparing between member and casual

Hour of day analysis compares the usage patterns of member and casual of a bike-sharing service during different hours of the day. This analysis can provide valuable insights into the peak hours of usage for each member type and help bike-sharing companies adjust their services accordingly.

# Create dataframe of average ride duration and average ride length to compare of usage patterns between member and casual by hours of day.
ride_duration_summary <- cleaned_data %>%
  group_by(hour_of_day, member_casual) %>%
  summarize(avg_ride_duration = mean(ride_duration_minutes),
            avg_ride_length = mean(ride_length_km))

## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.

knitr::kable(ride_duration_summary)

hour_of_day	member_casual	avg_ride_duration	avg_ride_length
0	casual	19.16063	1.994147
0	member	11.96146	2.041912
1	casual	20.18658	2.022072
1	member	11.92407	1.979325
2	casual	19.88483	2.099720
2	member	11.99664	1.936918
3	casual	19.47687	2.254195
3	member	11.95756	1.887134
4	casual	17.33578	2.232890
4	member	11.69781	2.069858
5	casual	15.99421	2.107309
5	member	10.50820	2.128078
6	casual	16.11783	2.094036
6	member	10.89261	2.207568
7	casual	15.11570	2.089533
7	member	11.45753	2.274050
8	casual	16.66733	2.088277
8	member	11.30717	2.143365
9	casual	21.63870	2.152578
9	member	11.41917	2.015782
10	casual	25.48343	2.250716
10	member	12.17153	2.061350
11	casual	26.11056	2.278090
11	member	12.64284	2.017851
12	casual	25.37249	2.253513
12	member	12.18605	1.976384
13	casual	25.63903	2.243600
13	member	12.36834	2.082029
14	casual	25.34080	2.255736
14	member	12.72825	2.050572
15	casual	24.03236	2.244738
15	member	12.79964	2.150954
16	casual	22.46893	2.355238
16	member	13.02258	2.205016
17	casual	21.03642	2.252223
17	member	13.26508	2.213336
18	casual	20.71839	2.163160
18	member	13.05927	2.127925
19	casual	20.78182	2.040136
19	member	12.77520	2.037212
20	casual	20.28402	1.980405
20	member	12.58302	2.012269
21	casual	19.25802	2.034150
21	member	12.30942	2.090073
22	casual	18.78231	2.093450
22	member	12.26700	2.168989
23	casual	19.42073	2.097291
23	member	12.30790	2.155957

# Graph the dataframe of average ride duration to compare of usage patterns between member and casual by hours of day.
ggplot(ride_duration_summary, aes(x = hour_of_day, y = avg_ride_duration, 
                                  fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Duration by Hour of Day Between Member and Casual",
       x = "Hour of Day", y = "Average Duration (Minutes)", 
       fill = "Member/Casual") +
  theme(legend.position = "bottom")

The graph above shows the average ride duration (in minutes) by hourly for both member and casual. Notably, the average ride duration for casual is higher than that of member.
The highest average ride duration throughout the day is observed around 11 a.m., with a recorded duration of 26.11 minutes for casual. As seen in the previous graph, the average ride duration for casual starts to increase around this time and peaks around 5 p.m.
An interesting pattern occur for member, where the lowest average ride duration of 10.51 minutes is recorded around 5 a.m. and the highest average duration of 13.27 minutes is recorded around 5 p.m. This suggests that the average ride duration for member follows a different trend compared to casual.

# Graph the dataframe of average ride length to compare of usage patterns between member and casual by hours of day.
ggplot(ride_duration_summary, aes(x = hour_of_day, y = avg_ride_length,
                                  fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Length by Hour of Day Between Member and Casual",
       x = "Hour of Day", y = "Average Ride Length (km)",
       fill = "Member/Casual") +
  theme(legend.position = "bottom")

The graph displays the average ride length (in kilometers) per hour for both member and casual. Interestingly, casual users have a slightly higher average ride length than members.
Both member types exhibit a unique pattern, with slightly different average lengths over the day. The highest average ride length for members is 2.27 km around 7 a.m., while the highest for casual is 2.36 km at 4 p.m. Meanwhile, the lowest average ride length for members is 1.89 km at 3 a.m., while the lowest for casual is around 1.98 km at 8 p.m.

5.6 Hypothesis

A hypothesis in data analysis is a tentative explanation or prediction that proposes a possible relationship between variables. Hypotheses guide the data analysis process by providing a specific and testable explanation for a phenomenon or observation.

In this analysis, three hypotheses were formulated to explore potential differences between member and casual riders in terms of ride duration and length, as well as differences in the number of rides taken across different seasons. Two statistical modeling techniques, the t-test and ANOVA, were used to test the hypotheses and evaluate the statistical significance of the results. The t-test was used for hypotheses 1 and 2, while ANOVA was used for hypothesis 3.

1.Member and casual may have different ride durations for hour of day.

t.test(avg_ride_duration ~ member_casual, data = ride_duration_summary)

## 
##  Welch Two Sample t-test
## 
## data:  avg_ride_duration by member_casual
## t = 12.421, df = 25.002, p-value = 3.415e-12
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
##  7.114905 9.943382
## sample estimates:
## mean in group casual mean in group member 
##             20.67949             12.15035

#Plot
ggplot(ride_duration_summary, aes(x = member_casual, y = avg_ride_duration)) +
  geom_boxplot() +
  labs(x = "Rider Type", y = "Ride Duration (minutes)") +
  ggtitle("Distribution of Ride Duration by Rider Type")

Based on the output of the t-test suggests that there is a significant difference in the average ride duration between members and casual. The mean ride duration for casual is 20.68 minutes, while for members it is 12.15 minutes, with a 95% confidence interval between 7.11 and 9.94 minutes. The p-value of 3.415e-12 indicates that this difference is statistically significant, and the t-value of 12.421 with 25.002 degrees of freedom suggests a strong evidence for rejecting the null hypothesis that the true difference in means between the two groups is equal to 0. Therefore, we can conclude that there is a significant difference in the average ride duration between members and casual.

2. Members and casual riders may have different ride lengths for hour of days.

t.test(avg_ride_length ~ member_casual, data = ride_duration_summary)

## 
##  Welch Two Sample t-test
## 
## data:  avg_ride_length by member_casual
## t = 2.3662, df = 45.618, p-value = 0.02227
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
##  0.01021111 0.12673067
## sample estimates:
## mean in group casual mean in group member 
##             2.153217             2.084746

#Plot
ggplot(ride_duration_summary, aes(x = member_casual, y = avg_ride_length)) +
  geom_boxplot() +
  labs(x = "Rider Type", y = "Ride Length (km)") +
  ggtitle("Distribution of Ride Length by Rider Type")

Based on the results of the t-test, there is evidence to suggest that there is a statistically significant difference in the mean ride length between casual and member. The p-value of 0.02227 indicates that the probability of observing such a difference by chance alone is relatively low. The confidence interval of (0.01021111, 0.12673067) suggests that the true difference in means between the two groups is likely to fall within this range 95% of the time. Therefore, it can be concluded that casual and members have different average ride lengths. Casual may be more likely to take longer rides than member.

3. The season of the year has a significant effect on the number of bike rides taken, regardless of whether the riders are members or casual users.

anova1 <- aov(avg_ride_duration ~ season + member_casual, data = rides_by_season)
summary(anova1)

##               Df Sum Sq Mean Sq F value Pr(>F)  
## season         3     87    28.9   0.093 0.9625  
## member_casual  1   1900  1900.1   6.146 0.0255 *
## Residuals     15   4637   309.1                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This table shows the results of an ANOVA test that was conducted to evaluate whether there are significant differences in the number of bike rides taken across different seasons and between casual and member. The table provides information on the degrees of freedom (Df), the sum of squares (Sum Sq), the mean squares (Mean Sq), the F-value, and the p-value (Pr(>F)) for each factor and the residuals.

The “season” factor is significant with a p-value of 0.0038, indicating that there are significant differences in the number of bike rides taken across different seasons. The “member_casual” factor is not significant with a p-value of 0.5434, indicating that there is no significant difference in the number of bike rides taken between casual and member riders. The residual sum of squares is 0.3571, which represents the unexplained variation in the data.

Overall, these results suggest that season has a significant effect on the number of bike rides taken, while the member type (casual or member) does not.

6. Conclusion and Recommendations

Based on the results of data analysis, the found that member type ride bicycle used more by member with 59.08% and the casual 40.92% and they have different preferences for rideable types. Member tend to use classic bike (30.19%) and electric bike (28.9%) than casual, while casual prefer the 3 different rideable type which is classic bike(15.7%), electric bike (22.13%) and docked bike (3.09%).

Additionally, the analysis have found that there is a statistically significant difference in the average ride duration between member and casual from 4 aspect time whether in season, months, day of the week and hour of the day. Casual tend to take longer rides duration than members. For example, around 11 a.m., with a recorded the highest duration of 26.1 minutes by casual than the member highest recorded 13.27 minutes is around 5 p.m. Moreover, during the May of month, casual highest throughout the year recorded 25.54 minutes for average duration while for member, the highest average ride duration is on June recorded 23.38 minutes.

On the other hand, the analysis have found that there is a statistically slightly difference in the average ride length between member and casual from 4 aspect time whether in season, month, day of the week and hour of the day. For example in the seasonally result, the data shows that average ride length quite similar between member and casual in terms of classic bike and electric bike. For member, they mostly used the electric bike on season Fall, Summer and Winter than casual except on Spring as contrast for casual mostly use classic bike and docked bike for all season than member. This may suggest the price per charge of classic bike and docked tend to low rather electric bike which maybe has special offer for member. Additionally, the data has shown that Summer are the most rides both member type which casual recorded 405179 (7.16 %) on July, and for member recorded 426913 (7.54%) on August. Not to forget to mention, the average duration by hour between member type shows that casual dominate all hours of the day.

Besides, the statistical modeling was run to test potential differences between member and casual in terms of ride duration and length, as well as differences in the number of rides taken across different seasons. Using t-test for hypothesis 1 suggests a strong evidence for rejecting the null hypothesis that the true difference in means between the two groups is equal to 0 which mean there are significant difference in the average ride duration between members and casual. Using t-test for hypothesis 2, suggests that the true difference that casual and members have different average ride lengths which Casual may be more likely to take longer rides than member. For the hypothesis 3 using ANOVA suggest that season has a significant effect on the number of bike rides taken, while the member type (casual or member) does not.

Recommendation

The company could offer promotions or incentives to encourage more casual and member to use docked bikes, or it could invest in more electric bikes to meet the demand from casual riders.
To attract more casual riders and improve customer loyalty, the company could launch marketing campaigns highlighting the convenience, affordability, and environmental benefits of the bike-sharing service, with a focus on the group’s high average ride duration. This may increase revenue and encourage more casual riders to become members.
Company collect more data to confirm these findings like using a standardized data entry form, performing regular data quality checks and implementing data validation checks since there a lot of missing issues data.
Identify any other variables that may be important predictors such demographic factors price, weather condition of rideable trip counts and duration.