As a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
The assignment for the analytic team is to analyze available dataset to answer following questions:
But, Lily Moreno the director of marketing and my manager has assigned me the first question to answer on “How do annual members and casual riders use Cyclistic bikes differently?”.
Based on the key questions, I will make 3 hypothesis related:
For this analysis, the following libraries are used:
library("tidyverse")
library("lubridate")
library("geosphere")
library("stats")
library("knitr")
For this project, I utilized the bike-sharing datasets for the year 2022, which can be downloaded from here. The dataset is made available by Lyft Bikes and Scooters under the terms of this license agreement.
After storing the data locally, I verified the credibility of the data by checking if it is ROCC (reliable, original, comprehensive, current, and cited). Once I was confident in the data’s credibility, I opened the data to begin exploring and analyzing it.
Opening the data.
# Upload Divvy datasets (csv files) here
Jan_2022 <- read.csv("C:/Users/User/Music/Bike_data_2022/202201_tripdata.csv")
Feb_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202202_tripdata.csv")
Mar_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202203_tripdata.csv")
Apr_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202204_tripdata.csv")
May_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202205_tripdata.csv")
Jun_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202206_tripdata.csv")
Jul_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202207_tripdata.csv")
Aug_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202208_tripdata.csv")
Sept_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202209_tripdata.csv")
Oct_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202210_tripdata.csv")
Nov_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202211_tripdata.csv")
Dec_2022 <- read_csv("C:/Users/User/Music/Bike_data_2022/202212_tripdata.csv")
In this section, is to prepare the data for analysis, but first i sorted and combined the datasets into a single file. While combining involved organizing and cleaning different datasets into a single file, and make sure the datasets column into same format.
## 2.1 Create a list of monthly data frames for 2022.
monthly_data_2022 <- list(Jan_2022, Feb_2022, Mar_2022, Apr_2022,
May_2022, Jun_2022, Jul_2022, Aug_2022,
Sept_2022, Oct_2022, Nov_2022, Dec_2022)
## Extract the column names of the first data frame.
colnames_2022 <- colnames(monthly_data_2022[[1]])
## Check if all data frames have the same column names as the first data frame.
consistent_colnames <- all(sapply(monthly_data_2022, function(df) all(colnames(df) %in% colnames_2022)))
In this section is to check the “started_at” & “ended_at” column in each month’s data frame before stack all into 1 datasets
# Use sapply to apply the function 'class' to each data frame in the list, and extract the class of the 'started_at' column for each month.
sapply(list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022,
Jul_2022, Aug_2022, Sept_2022, Oct_2022, Nov_2022, Dec_2022),
function(x) class(x$started_at))
## [[1]]
## [1] "character"
##
## [[2]]
## [1] "POSIXct" "POSIXt"
##
## [[3]]
## [1] "POSIXct" "POSIXt"
##
## [[4]]
## [1] "POSIXct" "POSIXt"
##
## [[5]]
## [1] "POSIXct" "POSIXt"
##
## [[6]]
## [1] "POSIXct" "POSIXt"
##
## [[7]]
## [1] "POSIXct" "POSIXt"
##
## [[8]]
## [1] "POSIXct" "POSIXt"
##
## [[9]]
## [1] "POSIXct" "POSIXt"
##
## [[10]]
## [1] "POSIXct" "POSIXt"
##
## [[11]]
## [1] "POSIXct" "POSIXt"
##
## [[12]]
## [1] "POSIXct" "POSIXt"
sapply(list(Jan_2022, Feb_2022, Mar_2022, Apr_2022, May_2022, Jun_2022,
Jul_2022, Aug_2022, Sept_2022, Oct_2022, Nov_2022, Dec_2022),
function(x) class(x$ended_at))
## [[1]]
## [1] "character"
##
## [[2]]
## [1] "POSIXct" "POSIXt"
##
## [[3]]
## [1] "POSIXct" "POSIXt"
##
## [[4]]
## [1] "POSIXct" "POSIXt"
##
## [[5]]
## [1] "POSIXct" "POSIXt"
##
## [[6]]
## [1] "POSIXct" "POSIXt"
##
## [[7]]
## [1] "POSIXct" "POSIXt"
##
## [[8]]
## [1] "POSIXct" "POSIXt"
##
## [[9]]
## [1] "POSIXct" "POSIXt"
##
## [[10]]
## [1] "POSIXct" "POSIXt"
##
## [[11]]
## [1] "POSIXct" "POSIXt"
##
## [[12]]
## [1] "POSIXct" "POSIXt"
The’started_at’ and ‘ended_at’ columns in the first data frame (Jan_2022) are converted from character format to POSIXct format by the code, for further used analysis of date and time data.
# Convert "started_at" & "ended_at" column to POSIXct in first data frame.
Jan_2022$started_at <- as.POSIXct(Jan_2022$started_at, format = "%Y-%m-%d %H:%M:%S")
Jan_2022$ended_at <- as.POSIXct(Jan_2022$ended_at, format = "%Y-%m-%d %H:%M:%S")
After convert the “started_at” & “ended_at” column to POSICT, now can combine into single datasets.
# Stack all month data frames into one datasets.
bike_2022 <- bind_rows(Jan_2022, Feb_2022, Mar_2022,
Apr_2022, May_2022, Jun_2022,
Jul_2022, Aug_2022, Sept_2022,
Oct_2022, Nov_2022, Dec_2022)
# Recheck structure after combined data frame.
str(bike_2022)
## 'data.frame': 5667717 obs. of 13 variables:
## $ ride_id : chr "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct, format: "2022-01-13 03:59:47" "2022-01-10 00:41:56" ...
## $ ended_at : POSIXct, format: "2022-01-13 04:02:44" "2022-01-10 00:46:17" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num 42 42 41.9 42 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 42 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr "casual" "casual" "member" "casual" ...
An essential part of data preparation and analysis is handling missing values. Due to its simplicity and suitability for dataset, I will in this project remove any rows that have missing values .
# Check and create new datasets for missing values in each column.
na_df <- bike_2022[rowSums(is.na(bike_2022)) > 0,]
nrow(na_df)
## [1] 1274801
# Investigate how many missing values in the column within the datasets.
colSums(is.na(bike_2022))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 816804 816804 874815 874815
## start_lat start_lng end_lat end_lng
## 0 0 5858 5858
## member_casual
## 0
# Calculate the percentage of rows with missing value indicating end_lat and end_lng.
(nrow(na_df[is.na(na_df$end_lat), ])/nrow(bike_2022))*100
## [1] 0.1033573
(nrow(na_df[is.na(na_df$end_lng), ])/nrow(bike_2022))*100
## [1] 0.1033573
Only a very small portion of data i.e. 0.1 % has missing information about the end station. Hence, in the next section these data were removed.
Cleaning the bike_2022 data frame by removing rows with missing values in the end_lat column. The end_lng the column also has missing values, but we only need to remove missing values in one of the columns, as they are highly correlated. By choosing the end_lat column to remove missing values, will automatically removes in the end_lng column.
# Check the missing value numbers.
cleaned_data <- bike_2022[!is.na(bike_2022$end_lat),]
After got the result above, we can see 6 column has the missing values with the numbers which start_station_name, start_station_id, end_station_name, en_station_id, end_lat and end_lng. But i wanted
# Check the row remaining after deleting the missing values.
colSums(is.na(cleaned_data))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 816804 816804 869043 869043
## start_lat start_lng end_lat end_lng
## 0 0 0 0
## member_casual
## 0
nrow(cleaned_data)
## [1] 5661859
After that, I will remove the following columns that not included for my analysis.
# Remove column start_station_name, start_station_id, end_station_name, end_station_id.
cleaned_data <- cleaned_data %>%
select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
# Recheck the column datasets.
colnames(cleaned_data)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual"
# Recheck if there is still missing value.
colSums(is.na(cleaned_data))
## ride_id rideable_type started_at ended_at start_lat
## 0 0 0 0 0
## start_lng end_lat end_lng member_casual
## 0 0 0 0
Make sure there no more missing values from all columns then go to next step.
A crucial step in the data analysis process is preparing data for analysis, which entails converting raw data into a format that can be used for statistical analysis.This process involve creating new columns, and scaling or normalizing data to make it suitable for analysis.
Understand the structure and characteristics the data frame. This include listing number of rows and dimensions of the data frame, previewing the first few rows of data, checking column data types, and getting a statistical summary of the data.
# How many rows are in data frame.
nrow(cleaned_data)
## [1] 5661859
# Dimensions of the data frame.
dim(cleaned_data)
## [1] 5661859 9
# See the first 6 rows of data frame.
head(cleaned_data)
## ride_id rideable_type started_at ended_at
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 03:59:47 2022-01-13 04:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 00:41:56 2022-01-10 00:46:17
## 3 BD0F91DFF741C66D classic_bike 2022-01-24 20:53:40 2022-01-24 20:58:01
## 4 CBB80ED419105406 classic_bike 2022-01-03 16:18:04 2022-01-03 16:33:00
## 5 DDC963BFDDA51EEA classic_bike 2022-01-19 17:31:10 2022-01-19 17:37:12
## 6 A39C6F6CC0586C0B classic_bike 2022-01-11 10:48:09 2022-01-11 10:51:31
## start_lat start_lng end_lat end_lng member_casual
## 1 42.01280 -87.66591 42.01256 -87.67437 casual
## 2 42.01276 -87.66597 42.01256 -87.67437 casual
## 3 41.92560 -87.65371 41.92533 -87.66580 member
## 4 41.98359 -87.66915 41.96151 -87.67139 casual
## 5 41.87785 -87.62408 41.88462 -87.62783 member
## 6 41.89563 -87.67207 41.90312 -87.67394 member
# See list of columns and data types.
str(cleaned_data)
## 'data.frame': 5661859 obs. of 9 variables:
## $ ride_id : chr "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type: chr "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct, format: "2022-01-13 03:59:47" "2022-01-10 00:41:56" ...
## $ ended_at : POSIXct, format: "2022-01-13 04:02:44" "2022-01-10 00:46:17" ...
## $ start_lat : num 42 42 41.9 42 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 42 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual: chr "casual" "casual" "member" "casual" ...
# Statistical summary of data.
summary(cleaned_data)
## ride_id rideable_type started_at
## Length:5661859 Length:5661859 Min. :2021-12-31 16:00:05.0
## Class :character Class :character 1st Qu.:2022-05-28 19:17:08.5
## Mode :character Mode :character Median :2022-07-22 15:11:15.0
## Mean :2022-07-20 07:18:40.2
## 3rd Qu.:2022-09-16 07:38:28.0
## Max. :2022-12-31 23:59:26.0
## ended_at start_lat start_lng
## Min. :2021-12-31 16:01:48.00 Min. :41.64 Min. :-87.84
## 1st Qu.:2022-05-28 19:37:21.00 1st Qu.:41.88 1st Qu.:-87.66
## Median :2022-07-22 15:30:24.00 Median :41.90 Median :-87.64
## Mean :2022-07-20 07:34:59.95 Mean :41.90 Mean :-87.65
## 3rd Qu.:2022-09-16 07:51:35.00 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :2023-01-01 18:09:37.00 Max. :45.64 Max. :-73.80
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.14 Length:5661859
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
Creating new columns for analysis is a crucial step in the data analysis process that involves deriving additional variables from existing data to gain insights and make more informed decisions. This process involves performing calculations or transformations on existing variables.
Creating new columns for the hour of the day, day of the week, months, or seasons can provide valuable insights into patterns and trends within the data. These new variables can be derived from existing datetime data, and can help identify peak usage times, seasonality effects, or other patterns that may impact analysis and decision-making.
# Create new columns.
cleaned_data$hour_of_day <- hour(cleaned_data$started_at)
cleaned_data$day_of_week <- wday(cleaned_data$started_at, label = TRUE)
cleaned_data$month <- month(cleaned_data$started_at)
cleaned_data$season <- ifelse(month(cleaned_data$started_at) %in% c(3, 4, 5), "Spring",
ifelse(month(cleaned_data$started_at) %in% c(6, 7, 8), "Summer",
ifelse(month(cleaned_data$started_at) %in% c(9, 10, 11),"Fall", "Winter")))
Calculating the duration of each ride can provide valuable information on ride duration and usage patterns. This can be achieved by subtracting the ended_at and started_at for each ride in the dataset, resulting in a new column representing the duration of each ride as numeric.
# Calculate ride duration in minutes.
cleaned_data$ride_duration_minutes <- as.numeric(difftime(cleaned_data$ended_at, cleaned_data$started_at, units = "mins"))
# Then format to 2 decimal points.
cleaned_data$ride_duration_minutes <- round(cleaned_data$ride_duration_minutes, 2)
After that, check if any negative ride duration values in the cleaned data and displays the corresponding start and end times. This is an important step in ensuring data accuracy and consistency, as negative ride duration values are likely indicate errors in data collection or processing.
# check if negative numbers appear.
head(select(cleaned_data[cleaned_data$ride_duration_minutes < 0, ], c(started_at, ended_at, ride_duration_minutes)))
## started_at ended_at ride_duration_minutes
## 403592 2022-03-05 11:00:57 2022-03-05 10:55:01 -5.93
## 406769 2022-03-05 11:38:04 2022-03-05 11:37:57 -0.12
## 1467056 2022-05-30 11:06:29 2022-05-30 11:06:17 -0.20
## 1635638 2022-06-07 19:15:39 2022-06-07 17:05:37 -130.03
## 1636061 2022-06-07 19:14:46 2022-06-07 17:07:45 -127.02
## 1788334 2022-06-23 19:22:57 2022-06-23 19:21:46 -1.18
# Count the negative number appear in ride_duration_minutes.
nrow(cleaned_data[cleaned_data$ride_duration_minutes < 0, ])
## [1] 100
# Calculate the percentage of negative values indicating on ride)_duration_minutes.
(nrow(cleaned_data[cleaned_data$ride_duration_minutes < 0, ])/nrow(cleaned_data))*100
## [1] 0.001766204
# removes ride_duration_minutes column has a value less than 0, i.e. where there are negative ride duration.
cleaned_data <- cleaned_data[cleaned_data$ride_duration_minutes >= 0, ]
# Print summary statistics of ride_durations_minutes.
summary(cleaned_data$ride_duration_minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.82 10.27 16.33 18.42 34354.07
Calculating the length of each ride can provide valuable information on ride length and usage patterns. This can be achieved by using the “start_lat” and “end_lat” and “geosphere package for each ride in the dataset, resulting in a new column representing the lentgh of each ride as numeric.
# Create new column for ride_length in kilometers then add ride_length_km column and, divide by 1000 to convert meters to kilometers.
cleaned_data$ride_length_km <- geosphere::distHaversine(
p1 = cbind(cleaned_data$start_lng, cleaned_data$start_lat),
p2 = cbind(cleaned_data$end_lng, cleaned_data$end_lat)
) / 1000
# Then format to 0 decimal points.
cleaned_data$ride_length_km <- round(cleaned_data$ride_length_km, 2)
For the descriptive analysis, this part involves exploring the data and summarizing its key characteristics, such as measures of central tendency, variability, and distribution. This type of analysis can provide insights into the patterns and trends present in the data, and help us understand the overall shape and structure of the data.
In this section, a general analysis is conducted to compare the usage patterns between members and casual. The aim is to gain insights into the differences in how the two types of riders use the bike-sharing service.
# Create dataframe to comparing numbers between member and casual .
member_summary <- cleaned_data %>%
group_by(member_casual) %>%
summarise(count = n())
# Graph for comparing number between member and casual.
ggplot(member_summary, aes(x = member_casual, y = count)) +
geom_bar(aes(fill = member_casual), stat = "identity", position = "dodge") +
geom_text(aes(label=count), stat="identity",
position=position_dodge(width=0.9), vjust=-0.5) +
labs(title = "Count of Member and Casual", x = "Member Type", y = "Count")
As shown in the graph, its shows that count between member type ride bicycle and the total for member is 3344951 (59.08%) has a higher value than casual total 2316808 rides (40.92%).
Concluded that members have utilized the bike-sharing service more frequently than casual riders.
This section presents a general comparison between member and casual based on the average ride duration and ride length. By calculating the mean of these two variables, we can gain insights into the differences in riding patterns between the two groups. This information can be useful for understanding the ride behaviours and developing targeted marketing strategies.
# Create dataframe ride summary for ride_duration_minutes between member and casual.
ride_summary <- cleaned_data %>%
group_by(member_casual) %>%
summarize(avg_duration = mean(ride_duration_minutes),
avg_length = mean(ride_length_km))
# Graph for the ride duration(minutes) using ride summary between member and casual.
ggplot(ride_summary, aes(x = member_casual, y = avg_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Member Type", y = "Average Ride Duration (minutes)", fill = "Member/Casual") +
ggtitle("Average Ride Duration Between Member and Casual")
From the chart above, the member type for casual has high number in average ride duration(minutes) with 21.99 minutes than member 12.41 minutes.
Concluded that casual riders tend to have longer rides compared to member riders.
# Graph for the ride length(km) using ride summary between member and casual.
ggplot(ride_summary, aes(x = member_casual, y = avg_length, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Member Type", y = "Average Length (km)", fill = "Member/Casual") +
ggtitle("Average Length Between Member and Casual")
As we can see graph above,we can see the a bit of difference on average length (km) between member type that casual has ride longer 2.183723 km than member 2.112977 km.
Therefore, it can be concluded that, on average, casual riders tend to ride slightly farther compared to member riders.
This section aims to identify the most popular rideable type for both member and casual. By comparing the number of rides taken for each rideable type, we can gain insights into the preferences of each user group.
# Create a dataframe to compare which rideable type has most ride by both member and casual.
rideable_counts <- cleaned_data %>%
group_by(rideable_type, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
knitr::kable(rideable_counts)
| rideable_type | member_casual | count |
|---|---|---|
| classic_bike | casual | 888890 |
| classic_bike | member | 1709054 |
| docked_bike | casual | 174858 |
| electric_bike | casual | 1253060 |
| electric_bike | member | 1635897 |
# Graph which rideable type has most ride by both members and casual.
ggplot(rideable_counts, aes(x = rideable_type, y = count)) +
geom_bar(aes(fill = member_casual), stat = "identity", position = "dodge") +
labs(x = "Rideable Type", y = "Count", fill = "Member/Casual") +
ggtitle("Popular Rideable Types Between Member and Casual")
The graph provided above displays the usage of different rideable types by member type. It shows that there are three types of rideables used by members, namely classic bikes, docked bikes, and electrical bikes.
The classic bike is the most commonly used rideable type by members, with a total count of 1709054 rides, while casual only used classic bikes 888890 times. The docked bike is used exclusively by casual, with a total count of 174858 rides. The electrical bike is also a popular rideable type for members, with a total count of 1635897 rides. In comparison, casual used electrical bikes 1253060 times.
It can be concluded that members tend to use classic and electrical bikes more frequently than casual, while casual has utilized all three types of rideables.
This section aims to identify the rideable type with the highest average duration for both member and casual. By comparing the average duration of rides for each rideable type, we can gain insights into how each member type uses Cyclistic bikes differently.
# Create dataframe and check which rideable type the most average duration between members and casual.
avg_duration_by_type <- cleaned_data %>%
group_by(member_casual, rideable_type) %>%
summarize(avg_duration = mean(ride_duration_minutes),
avg_length = mean(ride_length_km))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
knitr::kable(avg_duration_by_type)
| member_casual | rideable_type | avg_duration | avg_length |
|---|---|---|---|
| casual | classic_bike | 24.53221 | 2.087424 |
| casual | docked_bike | 50.70781 | 2.175997 |
| casual | electric_bike | 16.17394 | 2.253113 |
| member | classic_bike | 13.32055 | 1.965717 |
| member | electric_bike | 11.46272 | 2.266822 |
# Graph to see the result for ride duration by rideable type between member and casual.
ggplot(avg_duration_by_type, aes(x = rideable_type, y = avg_duration,
fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Rideable Type", y = "Average Duration (Minutes)",
title = "Average Ride Duration by Rideable Type Between Member and Casual")
Based on graph above, we can see the 3 difference rideable type on average ride duration(minutes).
From the classic bike, casual has average ride longer with 24.5 minutes than member 13.3 minutes. Same with electric bike also having the casual has average ride longer with 16.17 minutes than member 11.46 minutes.
Notice that, docked bike has the most highest average duration with 50.7078 minutes than other 2 rideable type, which may indicate that this type preferred by casual who are looking for longer rides.
Seasonal analysis of bike-sharing patterns reveals differences between member and casual. By conducting a seasonal analysis of bike-sharing patterns, we can gain insights into how members and casual differ in their bike usage habits throughout the year.
# Create dataframe for both average ride duration and ride length by rideable type between members & casual for season.
rides_by_season <- cleaned_data %>%
group_by(season, rideable_type, member_casual) %>%
summarize(avg_ride_length = mean(ride_length_km),
avg_ride_duration = mean(ride_duration_minutes))
knitr::kable(rides_by_season)
| season | rideable_type | member_casual | avg_ride_length | avg_ride_duration |
|---|---|---|---|---|
| Fall | classic_bike | casual | 2.051708 | 22.40831 |
| Fall | classic_bike | member | 1.989286 | 12.77882 |
| Fall | docked_bike | casual | 2.179392 | 44.47659 |
| Fall | electric_bike | casual | 2.142633 | 14.04954 |
| Fall | electric_bike | member | 2.210474 | 10.99363 |
| Spring | classic_bike | casual | 2.109167 | 27.08755 |
| Spring | classic_bike | member | 1.909699 | 13.07919 |
| Spring | docked_bike | casual | 2.120987 | 53.03799 |
| Spring | electric_bike | casual | 2.367752 | 17.84304 |
| Spring | electric_bike | member | 2.194932 | 11.26554 |
| Summer | classic_bike | casual | 2.121717 | 24.52697 |
| Summer | classic_bike | member | 2.068572 | 14.25216 |
| Summer | docked_bike | casual | 2.214547 | 50.76781 |
| Summer | electric_bike | casual | 2.304302 | 17.23570 |
| Summer | electric_bike | member | 2.450307 | 12.37411 |
| Winter | classic_bike | casual | 1.618039 | 20.67457 |
| Winter | classic_bike | member | 1.613089 | 11.62504 |
| Winter | docked_bike | casual | 1.954290 | 79.95522 |
| Winter | electric_bike | casual | 1.927572 | 11.48674 |
| Winter | electric_bike | member | 1.954729 | 10.16456 |
# Plot the average ride duration by rideable type between member & casual for each season.
ggplot(rides_by_season, aes(x = rideable_type, y = avg_ride_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Rideable Type", y = "Average Duration (Minutes)",
title = "Comparing Average Ride Durations on Rideable Type by Season Between Member Type") +
facet_wrap(~season) +
theme(plot.title = element_text(size = 11))
The analysis of bike-sharing patterns across different seasons reveals notable differences in the average ride length between casual and member users. During the Winter season, the average duration of classic bike rides was found to be 20.7 minutes for casual and 11.6 minutes for members. Similarly, for electric bikes, casual users had an average ride duration of 11.5 minutes, which was longer than the 10.17 minutes observed for member. In the Spring season, casual had the longest average ride duration of 27.1 minutes in classic bike, compared to 13.079 minutes for members.
From the graph season in Fall, the average duration for classic bike has resulting the casual has more longer with 22.41 minutes than member 12.8 minutes. On the other hand,electric bike also resulting the casual ride more longer with 14.1 minutes than member with 10.99 minutes. The casual still has the most average ride duration in classic bike with 24.53 minutes than member with 14.25 minutes in Summer season and for electric bike also resulting the casual ride more longer with 17.24 minutes than member with 12.37 minutes.
Interestingly, the analysis reveals that only casual who rented docked bike across all seasons. During Winter, the average ride duration for casual users was 79.96 minutes, while in Spring, Fall, and Summer seasons, the average ride duration were 53.04 minutes, 44.48 minutes, and 50.77 minutes, respectively.
# Plot the average ride length by rideable type between member & casual for season.
ggplot(rides_by_season, aes(x = rideable_type, y = avg_ride_length, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Rideable Type", y = "Average Ride Length (km)",
title = "Comparing Average Ride Length by Season, Rideable Type, and Membership Type") +
facet_wrap(~season) +
theme(plot.title = element_text(size = 11))
The analysis of bike-sharing patterns across different seasons reveals notable differences in the average length between casual and member. During the Winter season, the average duration of classic bike rides was found to be 1.62 km for casual and 1.61 km for members. But, for electric bikes, casual users had an average ride length of 1.93 km, which was shorter compare 1.96 km observed for members. In the Spring season, casual had the longest average ride length of 2.37 km in electric bike, compared to 2.2 km for members.
From the result season in Fall, the average length for classic bike has resulting the casual has more longer with 2.5 km than member 1.99 minutes. On the other hand, electric bike resulting the casual ride shorter with 2.14 km than member with 2.21 km.
The casual still has the most longer for average length in classic bike with 2.12 km than member with 2.07 in Summer season. But for electric bike resulting the average length for casual ride shorter with 2.30 km than member with 2.45 km
The monthly analysis of bike-sharing patterns compares the usage between member and casual over different months. This analysis provides insights into how frequently users of each type use the bike-sharing service, how long their rides are, and how usage patterns differ between different months. These insights can help bike-sharing companies better understand their member type and optimize their services to meet their needs.
# Create dataframe counting number which month has most ride between members & casual.
counts_by_month <- cleaned_data %>%
group_by(month, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
knitr::kable(counts_by_month)
| month | member_casual | count |
|---|---|---|
| 1 | casual | 17879 |
| 1 | member | 84728 |
| 2 | casual | 21361 |
| 2 | member | 94171 |
| 3 | casual | 89642 |
| 3 | member | 194132 |
| 4 | casual | 126121 |
| 4 | member | 244811 |
| 5 | casual | 279764 |
| 5 | member | 354371 |
| 6 | casual | 368082 |
| 6 | member | 400055 |
| 7 | casual | 405179 |
| 7 | member | 417346 |
| 8 | casual | 358161 |
| 8 | member | 426913 |
| 9 | casual | 296074 |
| 9 | member | 404544 |
| 10 | casual | 208611 |
| 10 | member | 349595 |
| 11 | casual | 100559 |
| 11 | member | 236905 |
| 12 | casual | 45375 |
| 12 | member | 137380 |
# Plot the number which month has most ride between members & casual.
ggplot(counts_by_month, aes(x = month, y = count, fill = member_casual)) +
geom_col(position = "dodge") +
scale_x_continuous(breaks = 1:12) +
labs(title = "Number of Rides by Month Between Member and Casual",
x = "Month", y = "Number of Rides",
fill = "Member Type") +
theme(legend.position = "top", axis.text.y = element_text(size = 8))
The graph above shows the number of bike rides by month for member and casual users. It indicates that August had the highest number of rides for members, with 426,913 rides, while June had the highest number for casual users, with 405,179 rides.
Both member and casual riders showed an increasing trend from March to July, but then started to decline from August to December. Interestingly, casual riders showed a steeper decline from August to December, while the decline in the number of member rides began a month later, from September to December.
It is interesting to observe that both member and casual riders showed an increasing trend in their rides at the beginning of Spring, starting from March. However, the decline in the number of rides for casual riders occurred earlier, starting from August (the end of Summer) until December (early Winter). On the other hand, the decline for member rides started later, from September (early Spring) until December (early Winter).
This monthly analysis compares the average ride duration between member and casual. It examines the differences in ride duration patterns for each user type across different months, shedding light on potential trends and seasonal variations in ride behavior.
# Create dataframe for both average ride duration and ride length between members & casual for monthly.
avg_duration_by_month <- cleaned_data %>%
group_by(month, member_casual) %>%
summarize(avg_ride_duration = mean(ride_duration_minutes),
avg_ride_length = mean(ride_length_km))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
knitr::kable(avg_duration_by_month)
| month | member_casual | avg_ride_duration | avg_ride_length |
|---|---|---|---|
| 1 | casual | 22.28512 | 1.916228 |
| 1 | member | 11.61177 | 1.745195 |
| 2 | casual | 22.09011 | 1.985074 |
| 2 | member | 11.05778 | 1.778522 |
| 3 | casual | 25.73872 | 2.186419 |
| 3 | member | 11.74342 | 1.981899 |
| 4 | casual | 23.35343 | 2.218713 |
| 4 | member | 11.36473 | 1.948877 |
| 5 | casual | 25.54210 | 2.259579 |
| 5 | member | 13.06465 | 2.146724 |
| 6 | casual | 23.37789 | 2.245559 |
| 6 | member | 13.65384 | 2.250941 |
| 7 | casual | 23.29289 | 2.221379 |
| 7 | member | 13.43377 | 2.260524 |
| 8 | casual | 21.41921 | 2.205590 |
| 8 | member | 13.08152 | 2.224552 |
| 9 | casual | 19.97333 | 2.175584 |
| 9 | member | 12.63778 | 2.153869 |
| 10 | casual | 18.44234 | 2.051708 |
| 10 | member | 11.54556 | 2.004560 |
| 11 | casual | 15.47479 | 2.067671 |
| 11 | member | 10.86853 | 2.177716 |
| 12 | casual | 14.10573 | 1.719102 |
| 12 | member | 10.36091 | 1.806797 |
# Plot the average ride duration between member & casual by month.
ggplot(avg_duration_by_month, aes(x = month, y = avg_ride_duration, fill = member_casual)) +
geom_col(position = "dodge") +
scale_x_continuous(breaks = 1:12) +
labs(title = "Average Ride Duration by Month Between Member Type",
x = "Month", y = "Average Ride Duration (Minutes)",
fill = "Member Type") +
theme(legend.position = "top", axis.text.y = element_text(size = 8))
The graph depicts the average length of bike rides per month for both members and casual. Casual had the highest average ride duration in May at 25.54 minutes, followed by a gradual decline to 14.11 minutes in December. Members, on the other hand, had the highest average ride duration in June at 23.38 minutes, followed by a gradual decline to 10.36 minutes in December.
Interestingly, casual had slightly longer average ride duration than members for each month of the year, which could suggest that casual riders are comfortable with the prices charged for rides. However, further analysis is necessary to confirm this hypothesis.
# Plot the average ride length between member & casual by month.
ggplot(avg_duration_by_month, aes(x = month, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
scale_x_continuous(breaks = 1:12) +
labs(title = "Average Ride Length by Month and Rider Type",
x = "Month", y = "Average Ride Length (km)",
fill = "Member Type") +
theme(legend.position = "top", axis.text.y = element_text(size = 8))
The graph displays the average ride lengths per month for both member and casual. It is notable that the average ride lengths for each month has slightly differs between the two groups. The highest average ride length for casual was recorded in May at 2.2596 km, while for member it was recorded in June at 2.2605 km. On the other hand, the lowest average ride length was recorded in December at 1.72 km for casual riders, and in February at 1.75 km for member riders.
Interestingly, the average ride length for member was slightly longer than the average duration recorded. This suggests that members tend to ride bicycles more frequently than casual, since members are most used bike and have an advantage in this ride. However, further analysis is required to confirm this observation.
This analysis compares the average ride duration between member and casual based on the day of the week. It helps to identify any patterns or differences in the way these member type use the bike-sharing service on different days of the week.
# Create dataframe to compare of usage patterns by day of week.
day_of_week_summary <- cleaned_data %>%
group_by(day_of_week, member_casual) %>%
summarise(count = n())
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
knitr::kable(day_of_week_summary)
| day_of_week | member_casual | count |
|---|---|---|
| Sun | casual | 387902 |
| Sun | member | 388064 |
| Mon | casual | 277045 |
| Mon | member | 473468 |
| Tue | casual | 263268 |
| Tue | member | 518560 |
| Wed | casual | 273793 |
| Wed | member | 523689 |
| Thu | casual | 308686 |
| Thu | member | 531835 |
| Fri | casual | 334553 |
| Fri | member | 466723 |
| Sat | casual | 471561 |
| Sat | member | 442612 |
# Plot the result dataframe to compare the usage between member & casual by day of the week.
ggplot(day_of_week_summary, aes(x = day_of_week, y = count, fill = member_casual)) +
geom_col(position = "dodge") +
geom_text(aes(label = count), position = position_dodge(width = 0.9),
vjust = -0.5, size = 2) +
labs(x = "Day of Week", y = "Count", fill = "Member/Casual") +
ggtitle("Usage Patterns by Day of Week Between Member and Casual")
The graph above displays the usage patterns of Cyclistic bikes by day of the week for member and casual. The data reveals that member have the highest usage on all days of the week except for Saturday, where casual out number them (471,561 rides for casual compared to 442,612 for member).
Although casual riders have the lowest usage overall, the graph illustrates that their usage patterns gradually increase from Wednesday to Saturday. In contrast, member usage slightly decreases on Sundays compared to the other days of the week.
It is noteworthy that the number of bike rides taken by members is consistently higher than that of casual riders on each day of the week. This could suggest that the price offered to members is more attractive and cost-effective for regular bike users.
The analysis compares the average ride duration and length for member and casual across different days of the week. It aims to provide insights into whether there are any notable differences in the way these two groups use the bike-sharing service on different days.
# Create dataframe of average duration and average length to compare of usage patterns by day of week.
duration_summary <- cleaned_data %>%
group_by(day_of_week, member_casual) %>%
summarize(avg_duration = mean(ride_duration_minutes),
avg_length = mean(ride_length_km))
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
knitr::kable(duration_summary)
| day_of_week | member_casual | avg_duration | avg_length |
|---|---|---|---|
| Sun | casual | 25.09057 | 2.239146 |
| Sun | member | 13.68545 | 2.145585 |
| Mon | casual | 22.62697 | 2.096668 |
| Mon | member | 11.99860 | 2.042182 |
| Tue | casual | 19.63264 | 2.092925 |
| Tue | member | 11.81419 | 2.065093 |
| Wed | casual | 19.00205 | 2.172441 |
| Wed | member | 11.84072 | 2.196016 |
| Thu | casual | 19.67199 | 2.132912 |
| Thu | member | 12.00163 | 2.093116 |
| Fri | casual | 20.61548 | 2.155729 |
| Fri | member | 12.22869 | 2.061212 |
| Sat | casual | 24.59478 | 2.299640 |
| Sat | member | 13.80004 | 2.196418 |
# Plot the dataframe of average duration to compare of usage patterns by day of week.
ggplot(duration_summary, aes(x = day_of_week, y = avg_duration, fill = member_casual)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(avg_duration, 2)), position = position_dodge(width = 0.9),
vjust = -0.5, size = 2) +
labs(x = "Day of Week", y = "Average Duration (Minutes)", fill = "Member/Casual") +
ggtitle("Average Ride Duration by Day of Week Between Member and Casual") +
theme(legend.position = "bottom")
The graph above shows the average ride duration by day of week for both member and casual. Surprisingly, the graph reveals that casual have the highest average ride duration across all days of the week.
Although member have the lowest average ride duration throughout the week, it is interesting to note that the average duration remains relatively stable, ranging from 11.8142 minutes to 13.6855 minutes.
Additionally, we can observe a trend in which the average ride duration starts at its lowest on Wednesday and gradually increases until Sunday.
# Plot the dataframe of average length to compare of usage patterns by day of week.
ggplot(duration_summary, aes(x = day_of_week, y = avg_length, fill = member_casual)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(avg_length, 2)), position = position_dodge(width = 0.9),
vjust = -0.5, size = 2) +
labs(x = "Day of Week", y = "Average Ride Length (km)", fill = "Member/Casual") +
ggtitle("Average Ride Length by Day of Week and Member/Casual Riders") +
theme(legend.position = "bottom")
The above graph illustrates the average ride length by day of the week for both member and casual. Notably, there is a striking resemblance in the ride lengths between the member type for each day of the week.
From the graph, it can be observed that casual riders have marginally longer average ride lengths than members. The longest average ride length for casual riders is recorded on Saturdays at 2.30 km, while for members it is on Wednesdays at 2.20 km.
The analysis compares the bike ride patterns between member and casual by hour of day, including average ride duration and length. The aim is to provide insights into potential differences in how these two groups utilize the bike-sharing service on different hour.
# Create dataframe to compare of usage patterns by hours of day.
ride_counts <- cleaned_data %>%
group_by(hour_of_day, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.
knitr::kable(ride_counts)
| hour_of_day | member_casual | count |
|---|---|---|
| 0 | casual | 46483 |
| 0 | member | 40915 |
| 1 | casual | 30220 |
| 1 | member | 25527 |
| 2 | casual | 19129 |
| 2 | member | 16132 |
| 3 | casual | 11948 |
| 3 | member | 12307 |
| 4 | casual | 8719 |
| 4 | member | 13911 |
| 5 | casual | 13476 |
| 5 | member | 36203 |
| 6 | casual | 30483 |
| 6 | member | 93851 |
| 7 | casual | 52422 |
| 7 | member | 174145 |
| 8 | casual | 70630 |
| 8 | member | 207008 |
| 9 | casual | 72934 |
| 9 | member | 148885 |
| 10 | casual | 93251 |
| 10 | member | 138686 |
| 11 | casual | 121068 |
| 11 | member | 162476 |
| 12 | casual | 143330 |
| 12 | member | 185106 |
| 13 | casual | 149433 |
| 13 | member | 182959 |
| 14 | casual | 158826 |
| 14 | member | 181243 |
| 15 | casual | 176792 |
| 15 | member | 216107 |
| 16 | casual | 196184 |
| 16 | member | 284351 |
| 17 | casual | 218621 |
| 17 | member | 341070 |
| 18 | casual | 196102 |
| 18 | member | 278418 |
| 19 | casual | 150208 |
| 19 | member | 201957 |
| 20 | casual | 111028 |
| 20 | member | 142115 |
| 21 | casual | 95006 |
| 21 | member | 112757 |
| 22 | casual | 86030 |
| 22 | member | 88110 |
| 23 | casual | 64485 |
| 23 | member | 60712 |
# Plot the dataframe to compare of usage patterns by hours of day.
ggplot(ride_counts, aes(x = hour_of_day, y = count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Number of Rides by Hour of Day Between Member and Causal",
x = "Hour of Day", y = "Number of Rides",
fill = "Rider Type") +
theme(legend.position = "bottom")
The graph displays the hourly number of rides for both member and casual. It is noticeable that the number of rides by members is higher compared to casual.
According to the graph, the highest number of rides throughout the day is around 5 p.m. with a total of 341070 rides recorded for members. The graph also shows a gradually increase in the number of rides by members starting from 3 a.m. with 12307 rides until it reaches the peak around 5 p.m.
The graph reveals an interesting pattern of a very low number of rides for both casual and member between 3 a.m. and 4 a.m. before it starts to increase gradually and reaches the highest number of rides around 5 p.m. for both types of users.
Hour of day analysis compares the usage patterns of member and casual of a bike-sharing service during different hours of the day. This analysis can provide valuable insights into the peak hours of usage for each member type and help bike-sharing companies adjust their services accordingly.
# Create dataframe of average ride duration and average ride length to compare of usage patterns between member and casual by hours of day.
ride_duration_summary <- cleaned_data %>%
group_by(hour_of_day, member_casual) %>%
summarize(avg_ride_duration = mean(ride_duration_minutes),
avg_ride_length = mean(ride_length_km))
## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.
knitr::kable(ride_duration_summary)
| hour_of_day | member_casual | avg_ride_duration | avg_ride_length |
|---|---|---|---|
| 0 | casual | 19.16063 | 1.994147 |
| 0 | member | 11.96146 | 2.041912 |
| 1 | casual | 20.18658 | 2.022072 |
| 1 | member | 11.92407 | 1.979325 |
| 2 | casual | 19.88483 | 2.099720 |
| 2 | member | 11.99664 | 1.936918 |
| 3 | casual | 19.47687 | 2.254195 |
| 3 | member | 11.95756 | 1.887134 |
| 4 | casual | 17.33578 | 2.232890 |
| 4 | member | 11.69781 | 2.069858 |
| 5 | casual | 15.99421 | 2.107309 |
| 5 | member | 10.50820 | 2.128078 |
| 6 | casual | 16.11783 | 2.094036 |
| 6 | member | 10.89261 | 2.207568 |
| 7 | casual | 15.11570 | 2.089533 |
| 7 | member | 11.45753 | 2.274050 |
| 8 | casual | 16.66733 | 2.088277 |
| 8 | member | 11.30717 | 2.143365 |
| 9 | casual | 21.63870 | 2.152578 |
| 9 | member | 11.41917 | 2.015782 |
| 10 | casual | 25.48343 | 2.250716 |
| 10 | member | 12.17153 | 2.061350 |
| 11 | casual | 26.11056 | 2.278090 |
| 11 | member | 12.64284 | 2.017851 |
| 12 | casual | 25.37249 | 2.253513 |
| 12 | member | 12.18605 | 1.976384 |
| 13 | casual | 25.63903 | 2.243600 |
| 13 | member | 12.36834 | 2.082029 |
| 14 | casual | 25.34080 | 2.255736 |
| 14 | member | 12.72825 | 2.050572 |
| 15 | casual | 24.03236 | 2.244738 |
| 15 | member | 12.79964 | 2.150954 |
| 16 | casual | 22.46893 | 2.355238 |
| 16 | member | 13.02258 | 2.205016 |
| 17 | casual | 21.03642 | 2.252223 |
| 17 | member | 13.26508 | 2.213336 |
| 18 | casual | 20.71839 | 2.163160 |
| 18 | member | 13.05927 | 2.127925 |
| 19 | casual | 20.78182 | 2.040136 |
| 19 | member | 12.77520 | 2.037212 |
| 20 | casual | 20.28402 | 1.980405 |
| 20 | member | 12.58302 | 2.012269 |
| 21 | casual | 19.25802 | 2.034150 |
| 21 | member | 12.30942 | 2.090073 |
| 22 | casual | 18.78231 | 2.093450 |
| 22 | member | 12.26700 | 2.168989 |
| 23 | casual | 19.42073 | 2.097291 |
| 23 | member | 12.30790 | 2.155957 |
# Graph the dataframe of average ride duration to compare of usage patterns between member and casual by hours of day.
ggplot(ride_duration_summary, aes(x = hour_of_day, y = avg_ride_duration,
fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Duration by Hour of Day Between Member and Casual",
x = "Hour of Day", y = "Average Duration (Minutes)",
fill = "Member/Casual") +
theme(legend.position = "bottom")
The graph above shows the average ride duration (in minutes) by hourly for both member and casual. Notably, the average ride duration for casual is higher than that of member.
The highest average ride duration throughout the day is observed around 11 a.m., with a recorded duration of 26.11 minutes for casual. As seen in the previous graph, the average ride duration for casual starts to increase around this time and peaks around 5 p.m.
An interesting pattern occur for member, where the lowest average ride duration of 10.51 minutes is recorded around 5 a.m. and the highest average duration of 13.27 minutes is recorded around 5 p.m. This suggests that the average ride duration for member follows a different trend compared to casual.
# Graph the dataframe of average ride length to compare of usage patterns between member and casual by hours of day.
ggplot(ride_duration_summary, aes(x = hour_of_day, y = avg_ride_length,
fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Length by Hour of Day Between Member and Casual",
x = "Hour of Day", y = "Average Ride Length (km)",
fill = "Member/Casual") +
theme(legend.position = "bottom")
The graph displays the average ride length (in kilometers) per hour for both member and casual. Interestingly, casual users have a slightly higher average ride length than members.
Both member types exhibit a unique pattern, with slightly different average lengths over the day. The highest average ride length for members is 2.27 km around 7 a.m., while the highest for casual is 2.36 km at 4 p.m. Meanwhile, the lowest average ride length for members is 1.89 km at 3 a.m., while the lowest for casual is around 1.98 km at 8 p.m.
A hypothesis in data analysis is a tentative explanation or prediction that proposes a possible relationship between variables. Hypotheses guide the data analysis process by providing a specific and testable explanation for a phenomenon or observation.
In this analysis, three hypotheses were formulated to explore potential differences between member and casual riders in terms of ride duration and length, as well as differences in the number of rides taken across different seasons. Two statistical modeling techniques, the t-test and ANOVA, were used to test the hypotheses and evaluate the statistical significance of the results. The t-test was used for hypotheses 1 and 2, while ANOVA was used for hypothesis 3.
t.test(avg_ride_duration ~ member_casual, data = ride_duration_summary)
##
## Welch Two Sample t-test
##
## data: avg_ride_duration by member_casual
## t = 12.421, df = 25.002, p-value = 3.415e-12
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
## 7.114905 9.943382
## sample estimates:
## mean in group casual mean in group member
## 20.67949 12.15035
#Plot
ggplot(ride_duration_summary, aes(x = member_casual, y = avg_ride_duration)) +
geom_boxplot() +
labs(x = "Rider Type", y = "Ride Duration (minutes)") +
ggtitle("Distribution of Ride Duration by Rider Type")
Based on the output of the t-test suggests that there is a significant difference in the average ride duration between members and casual. The mean ride duration for casual is 20.68 minutes, while for members it is 12.15 minutes, with a 95% confidence interval between 7.11 and 9.94 minutes. The p-value of 3.415e-12 indicates that this difference is statistically significant, and the t-value of 12.421 with 25.002 degrees of freedom suggests a strong evidence for rejecting the null hypothesis that the true difference in means between the two groups is equal to 0. Therefore, we can conclude that there is a significant difference in the average ride duration between members and casual.
t.test(avg_ride_length ~ member_casual, data = ride_duration_summary)
##
## Welch Two Sample t-test
##
## data: avg_ride_length by member_casual
## t = 2.3662, df = 45.618, p-value = 0.02227
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
## 0.01021111 0.12673067
## sample estimates:
## mean in group casual mean in group member
## 2.153217 2.084746
#Plot
ggplot(ride_duration_summary, aes(x = member_casual, y = avg_ride_length)) +
geom_boxplot() +
labs(x = "Rider Type", y = "Ride Length (km)") +
ggtitle("Distribution of Ride Length by Rider Type")
Based on the results of the t-test, there is evidence to suggest that there is a statistically significant difference in the mean ride length between casual and member. The p-value of 0.02227 indicates that the probability of observing such a difference by chance alone is relatively low. The confidence interval of (0.01021111, 0.12673067) suggests that the true difference in means between the two groups is likely to fall within this range 95% of the time. Therefore, it can be concluded that casual and members have different average ride lengths. Casual may be more likely to take longer rides than member.
anova1 <- aov(avg_ride_duration ~ season + member_casual, data = rides_by_season)
summary(anova1)
## Df Sum Sq Mean Sq F value Pr(>F)
## season 3 87 28.9 0.093 0.9625
## member_casual 1 1900 1900.1 6.146 0.0255 *
## Residuals 15 4637 309.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This table shows the results of an ANOVA test that was conducted to evaluate whether there are significant differences in the number of bike rides taken across different seasons and between casual and member. The table provides information on the degrees of freedom (Df), the sum of squares (Sum Sq), the mean squares (Mean Sq), the F-value, and the p-value (Pr(>F)) for each factor and the residuals.
The “season” factor is significant with a p-value of 0.0038, indicating that there are significant differences in the number of bike rides taken across different seasons. The “member_casual” factor is not significant with a p-value of 0.5434, indicating that there is no significant difference in the number of bike rides taken between casual and member riders. The residual sum of squares is 0.3571, which represents the unexplained variation in the data.
Overall, these results suggest that season has a significant effect on the number of bike rides taken, while the member type (casual or member) does not.
Based on the results of data analysis, the found that member type ride bicycle used more by member with 59.08% and the casual 40.92% and they have different preferences for rideable types. Member tend to use classic bike (30.19%) and electric bike (28.9%) than casual, while casual prefer the 3 different rideable type which is classic bike(15.7%), electric bike (22.13%) and docked bike (3.09%).
Additionally, the analysis have found that there is a statistically significant difference in the average ride duration between member and casual from 4 aspect time whether in season, months, day of the week and hour of the day. Casual tend to take longer rides duration than members. For example, around 11 a.m., with a recorded the highest duration of 26.1 minutes by casual than the member highest recorded 13.27 minutes is around 5 p.m. Moreover, during the May of month, casual highest throughout the year recorded 25.54 minutes for average duration while for member, the highest average ride duration is on June recorded 23.38 minutes.
On the other hand, the analysis have found that there is a statistically slightly difference in the average ride length between member and casual from 4 aspect time whether in season, month, day of the week and hour of the day. For example in the seasonally result, the data shows that average ride length quite similar between member and casual in terms of classic bike and electric bike. For member, they mostly used the electric bike on season Fall, Summer and Winter than casual except on Spring as contrast for casual mostly use classic bike and docked bike for all season than member. This may suggest the price per charge of classic bike and docked tend to low rather electric bike which maybe has special offer for member. Additionally, the data has shown that Summer are the most rides both member type which casual recorded 405179 (7.16 %) on July, and for member recorded 426913 (7.54%) on August. Not to forget to mention, the average duration by hour between member type shows that casual dominate all hours of the day.
Besides, the statistical modeling was run to test potential differences between member and casual in terms of ride duration and length, as well as differences in the number of rides taken across different seasons. Using t-test for hypothesis 1 suggests a strong evidence for rejecting the null hypothesis that the true difference in means between the two groups is equal to 0 which mean there are significant difference in the average ride duration between members and casual. Using t-test for hypothesis 2, suggests that the true difference that casual and members have different average ride lengths which Casual may be more likely to take longer rides than member. For the hypothesis 3 using ANOVA suggest that season has a significant effect on the number of bike rides taken, while the member type (casual or member) does not.