Abstract

Cyclistic, a popular bike-share company based in Chicago, strives to maximize its annual memberships for sustainable growth. The company has an existing user base composed of both casual riders (who opt for single-ride or full-day passes) and annual members. The focus of this case study is to understand the utilization patterns of these two user groups, as the company believes that converting casual riders into annual members can significantly enhance their profitability. To devise effective marketing strategies for this conversion, the case study will delve into questions that probe how these two user groups use Cyclistic bikes differently, why casual riders might opt for annual memberships, and how digital media can facilitate this conversion.

Primary Business Question

The main business question we seek to address is:

How do annual members and casual riders use Cyclistic bikes differently?

To answer these questions, we will delve into the following topics:

  1. Duration and frequency of rides: We will analyze data to understand the typical length of rides for both user groups. Are there noticeable differences between casual riders and members? Are members using the bikes more frequently or for longer duration?

  2. Ride timings: We’ll investigate if there’s a difference in the time of day, day of the week, or month of the year when the two user groups typically use the bikes.

  3. Bike preferences: Given that Cyclistic offers various bike options, we will examine if there’s a pattern in the type of bike preferred by the two user groups.

Answering these questions will provide insights into the riding patterns of the two groups (annual member and casual renters), and will be instrumental in forming the foundation for a marketing strategy aimed at converting casual riders into annual members. The strategy will be developed in alignment with these insights, addressing the unique needs and behaviors of casual riders to motivate their transition to becoming annual members.

Cyclistic’s collected data forms the backbone of this analysis. For transparency, we have included the source code in the paper’s Appendix. We strive to communicate our findings clearly, pairing concise language with comprehensible visualizations and straightforward recommendations.

About the Data

The data for this analysis is publicly available and located on Divvy’s (aka Cyclistic) Amazon S3 storage. The data is titled as “Divvy Data.” (Divvy 2023)

The data is organized in an anonymized format, detailing each trip’s start and end day/time, start and end stations, and rider type (Member or Casual). Trips taken by staff and any trips under 60 seconds in length have been excluded from this dataset.

Given the anonymization and rigorous data handling practices outlined by Divvy, we can consider the data to be Reliable, Original, Comprehensive, Current, and Cited (ROCCC). However, some level of bias could be inherent as the data only includes riders who choose to use Divvy’s services, and not all potential or actual cyclists in Chicago. Also, the credibility of the dataset is tied to Divvy’s data collection and handling processes.

The data is provided according to the Divvy Data License Agreement. Accessibility is facilitated through public download links, ensuring the data can be freely used for analysis while respecting privacy through anonymization. Security of the data in our analysis will be ensured by following best practices, such as secure data storage and handling.

Data integrity has been verified through initial exploratory data analysis to check for missing values, inconsistencies, or obvious outliers. Any issues found during this process will be documented and addressed appropriately.

This dataset is instrumental in answering our primary business question — understanding how annual members and casual riders use Cyclistic bikes differently. Each trip’s information, coupled with the type of rider, will offer insights into the usage patterns of the two user groups.

Potential problems with the data could include missing values, inconsistent data entries, or anomalies. These will be explored and addressed during the data cleaning and preprocessing stages of our analysis. If severe issues are encountered, they will be documented and taken into consideration when interpreting the analysis results.

Structure of the Data

We used the str() function in R to give us a detailed structure of the data.

## spc_tbl_ [6,230,310 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:6230310] "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
##  $ rideable_type     : Factor w/ 3 levels "classic_bike",..: 3 1 1 1 3 1 1 1 3 3 ...
##  $ started_at        : POSIXct[1:6230310], format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" ...
##  $ ended_at          : POSIXct[1:6230310], format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" ...
##  $ start_station_name: chr [1:6230310] "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
##  $ start_station_id  : chr [1:6230310] "515" "13075" "TA1307000121" "13075" ...
##  $ end_station_name  : chr [1:6230310] "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
##  $ end_station_id    : chr [1:6230310] "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
##  $ start_lat         : num [1:6230310] 42 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:6230310] -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:6230310] 42.1 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:6230310] -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : Factor w/ 2 levels "casual","member": 2 2 2 1 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Here are some interesting observations from the output:

  • Data size: The dataframe contains over 6.23 million records, spread across 13 variables.

  • Data types: The variables in the dataframe are of various types, including character (e.g., ride_id, start_station_name, start_station_id, etc.), factor (e.g., rideable_type, member_casual), numeric (e.g., start_lat, start_lng, end_lat, end_lng), and POSIXct (e.g., started_at, ended_at). This variety indicates that the dataset is rich and contains different kinds of information.

  • Rideable types: There are three levels of rideable_type: classic_bike, docked_bike, and electric_bike.

  • Member categories: There are two levels of member_casual: casual and member.

  • Geographic coordinates: The data contains latitude and longitude information for both the start and end locations of rides. This information could be used to plot geographic routes or to analyze usage patterns across different locations.

  • Time information: The dataset includes when each ride started and ended. This information can be used to analyze the temporal patterns of bike usage.

  • Station information: The data includes detailed information about the start and end stations, including their names and IDs.

Duplicate Observations

Now we turn our attention to broad issues in the data if they exist. The first task we will undertake is to check for duplicate rows in our data. Fortunately, this dataset doesn’t have any duplicate data.

## Total number of duplicate rows found: 0

Missing Values

Our next task is to look for missing data (or NA data, as it is commonly called). Here is what we found:

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             902896             903028             964949             965090 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               6290               6290 
##      member_casual 
##                  0

From this output, we can see that there are missing values (NA) in some of the columns.

  • The start_station_name column has 902,896 missing values.
  • The start_station_id column has 903,028 missing values.
  • The end_station_name column has 964,949 missing values.
  • The end_station_id column has 965,090 missing values.
  • The end_lat and end_lng columns each have 6,290 missing values.

The other columns (ride_id, rideable_type, started_at, ended_at, start_lat, start_lng, and member_casual) do not have any missing values.

Missing data can be problematic because it may skew the results of our analysis or cause certain functions to fail. You will want to investigate why there are missing values in the data and consider ways of handling it, such as filling them in with an average or other statistic (imputation), or removing rows.

The differences in the number of missing values between the start and end station names and IDs could indicate inconsistencies in data collection or recording. This may require further investigation to ensure that any analyses we conduct on this data are valid.

Summary Statistics

Now we an run some summary statistics to get an overall view of the data makeup and how we might approach our analysis.

Data summary
Name combined_data_df
Number of rows 6230310
Number of columns 13
_______________________
Column type frequency:
character 5
factor 2
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 6230310 0
start_station_name 902896 0.86 3 64 0 1722 0
start_station_id 903028 0.86 3 36 0 1320 0
end_station_name 964949 0.85 3 64 0 1742 0
end_station_id 965090 0.85 3 36 0 1324 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
rideable_type 0 1 FALSE 3 ele: 3238379, cla: 2809297, doc: 182634
member_casual 0 1 FALSE 2 mem: 3745586, cas: 2484724

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 42.07 ▁▁▇▇▁
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -87.52 ▁▁▆▇▁
end_lat 6290 1 41.90 0.07 0.00 41.88 41.90 41.93 42.37 ▁▁▁▁▇
end_lng 6290 1 -87.65 0.10 -88.14 -87.66 -87.64 -87.63 0.00 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2022-04-01 00:01:48 2023-04-30 23:59:05 2022-08-21 12:55:50 5242951
ended_at 0 1 2022-04-01 00:02:15 2023-05-03 10:37:12 2022-08-21 13:20:40 5257508

Here are some insights we can glean:

  • Data dimensions: The dataset contains 6,230,310 rows (observations) and 13 columns (variables). It’s a large dataset.

  • Variable types: There are five character variables, two factors, four numeric variables, and two POSIXct (date-time) variables. This mix of types suggests a varied dataset, likely to contain both categorical and continuous data, as well as time series data.

  • Missing Values: The start and end station names and IDs have around 15% missing values, given that the complete rate is around 0.85. This could be a potential area of concern depending on how we want to use this data.

  • Unique Values: The ride_id variable is unique for every row, which correctly suggests that it is a unique identifier for each ride.

  • Rideable Type: There are three types of bikes used in these rides: electric bikes, classic bikes, and docked bikes. The most common type of bike used is the electric bike, followed by the classic bike.

  • Membership: There are more rides by members (3,745,586) than by casual users (2,484,724).

  • Locations: The latitude and longitude for start and end points of the rides are quite concentrated around certain values (41.9 for latitude and -87.6 for longitude), suggesting that most rides occur in a relatively confined geographical area.

  • Dates: The started_at and ended_at fields indicate that the data spans from 1st April 2022 to 30th April 2023, so just over a year’s worth of ride data.

Processing the Data

Turning our attention to processing the data, we can review what decisions we have made thus far and augment the data as needed to facilitate the analysis process coming up.

Key Decisions

Here are the decisions made so far: * Tools chosen include R programming language along with the tidyverse, purrr, knitr, skimr, and readr libraries. The reason for choosing these tools is their strength in data manipulation, visualization, and analysis. tidyverse includes several useful packages for data cleaning and analysis, while purrr enhances R’s functional programming capabilities. knitr provides dynamic report generation, and skimr and readr are helpful for data summarization and importing.

  • Data’s integrity is ensured through multiple steps: Firstly, by loading the dataset directly from a local “.rds” file if it exists to avoid data corruption that could occur during multiple read operations. If the “.rds” file does not exist, CSV files are read directly from the source directory, ensuring the original data’s integrity.

  • Data cleaning steps taken involve checking for missing values and duplicated records. The number of missing values in each column is found using colSums(is.na(combined_data_df)), and duplicated records are found using duplicated(combined_data_df).

  • Data verification for readiness for analysis is done by summarizing the dataframe using skim(combined_data_df) which provides detailed summary statistics. This summary provides a good understanding of the data’s distribution and potential issues (like extreme outliers).

  • Cleaning processes are documented in this markdown document. Including, checking for missing values, identifying duplicates, and getting summary statistics. This document serves as a record of the cleaning process and can be shared or reviewed as needed.

Additional Data Changes

Beyond the standard processing steps to get the data ready for analysis, we decided to make a couple of modifications to the dataframe to support additional queries we anticipate. The changes are as follows:

  • Calculate Trip Time: The ride_length variable was added. This column is a derived by subtracting ended_at from started_from. The results are in total minutes elapsed for the ride.

  • Start Date: One new column (started_date) was added to strip out the time from the original started_at column. This allows for plots to examine rental trends over time by day.

  • Day of Week: A two new columns entitled day_of_week_start and day_of_week_end were added to make analysis of the specific day (e.g. Monday, Tuesday, etc…) a trip happened easier. This is converted to a factor for more advanced activities.

Removing Outliers

After adding the ride_length variable we discovered some outliers in the data. Specifically, we found start and end trip times that would result in a negative (or very low positive) time and we found values that were well beyond the two day “normal” time for a ride. Ultimately, we decided to limit the results to those rentals that were between five minutes and two days long.

Analysis of the Data

Rental Basic Metrics

Overall, it appears that casual customers rent bikes for longer periods of time compared to members. This information can be valuable for our strategic planning. For instance, it might be worth exploring if there are ways to encourage casual customers to become members, given their high usage. It’s also important to explore the reasons behind the high maximum ride lengths, particularly for casual customers, as this might point to potential misuse or misunderstanding of rental policies.

##   Metric  Member  Casual
## 1   Mean   15.57   27.03
## 2    Max 1559.67 2878.82
## 3    Min    5.02    5.02

Daily Bike Rentals Over Time

Now let’s take a step back and look at the entire year of data we have to get some perspective on rental trends between the two groups. The graph below shows the daily bike rentals from April 2022 to April 2023.

As we can see, there is a definitive seasonal trend to bike rentals overall and particularly between the two groups. Most notably, the member rentals decline in the winter months but almost never “bottom out”. Whereas, the casual rentals go very flat in the winter months but make up for it in volume during the warmer times. If we are looking to convert casual riders to members, an incentive that can be used during the warmer months should be considered.

Bike Preferences Overall

When we looked at bike preference type, we found that members, marginally, prefer classic bikes. The same is not true for casual riders as they tend to prefer electric bikes. Perhaps some incentives around electric bike rentals should be considered to move casual riders into becoming members.

NOTE: While the vast majority of bikes are classic or electric, it appears that there are still some with the older designation of docked_bike. A docked bike is a traditional type of shared bike that needs to be picked up and returned to a specific docking station. Docking stations are fixed physical locations distributed throughout a city where bikes are stored. When users are finished with their ride, they must return the bike to one of these stations and dock it correctly for the trip to end.

Cyclistic, which operates in Chicago, uses both docked bikes and “dockless” bikes. Dockless bikes, also called free-floating bikes, can be picked up and left anywhere within a designated area. They have built-in locks and GPS technology that allow the system to keep track of the bikes’ locations.

Day and Time Preferences

Now that we know the seasonal differences in rentals, we can turn our attention to the day and time when most rentals happen for our two groups. The heat maps below show us this information.

For member rentals, we can clearly see that most rentals happen Monday through Thursday. They tend to happen more in the morning and at night. Surprisingly, fewer rentals happen in the morning than at night which might suggest some riders are taking alternate transport to work and then renting a bike at the end of the day to go home. Further study to discover what those modes might be would be useful in building a marketing strategy.

For casual rentals, the data is reversed. Most rentals happen on the weekends and the rentals are fairly consistent throughout those days. This suggests that casual rentals are being used for leisure and not work transportation.

Bike Preferences Over Time

Looking at bike preferences over time between the two groups we don’t see many surprises. However, there is a very notable split in the warmer months where members clearly are choosing classic bikes over electric. This could be due to a host of reasons, not the least of which is simple supply and demand since this is the busiest time of the year for rentals. However, it should be examined to ensure there isn’t some other casual factor.

Conclusion

Upon examining the bike-sharing data from Cyclistic, we’ve unearthed several key insights that could steer the company’s strategic decision-making, especially when it comes to converting casual riders into members:

  1. Seasonality: We observed a clear pattern of bike rentals across the seasons. Both members and casual riders tend to rent more during warmer months and less during winter. However, members’ usage is steadier across the year, while casual riders’ rentals significantly drop during winter. Therefore, memberships offer a consistent revenue stream across the year.

  2. Rental Duration: Casual riders appear to rent bikes for longer periods than members. This trend could indicate that casual riders embark on longer, less hurried trips. Understanding the reasons behind these long rides could guide our marketing strategies to convert more casual riders into members.

  3. Bike Preference: Members seem to prefer classic bikes slightly more than casual riders who lean towards electric bikes. Creating incentives related to electric bike rentals could persuade casual riders to sign up for a membership.

  4. Day and Time: Members usually rent bikes on weekdays during commute hours, suggesting that they use the service for work or school transportation. In contrast, casual riders tend to rent bikes on weekends throughout the day, indicating leisure usage. Marketing campaigns can highlight the benefits of membership for both commuting and leisure to attract casual riders.

  5. Data Quality: We noticed some areas where the data quality could be improved. Specifically, the absence of data in the start and end station names and IDs could limit the depth of our analysis. Enhanced data collection or cleaning methods could improve this aspect and provide more valuable data for future analysis.

In conclusion, these findings can help Cyclistic design targeted marketing strategies to convert casual riders into members. For instance, promoting membership benefits for commuting and leisure, offering incentives for electric bike rentals, and intensifying marketing efforts during the warm seasons when casual usage soars. Improving data collection practices could provide a richer data set, which would be a powerful asset for future decision-making.

Recommendations

Based on our analysis, we offer the following top three recommendations:

  1. Develop Seasonal Marketing Campaigns: Since bike rentals show a clear seasonal trend with casual riders renting more during warmer months, consider developing marketing campaigns that align with this trend. Use targeted advertising during these periods to highlight the benefits of membership, possibly offering seasonal discounts or promotional codes to incentivize casual riders to convert into members.

  2. Promote Electric Bikes: As casual riders show a preference for electric bikes, there could be potential in offering membership benefits specifically related to these bikes. Special member rates, priority access, or loyalty rewards for electric bike usage could appeal to these casual riders and motivate them to sign up for membership.

  3. Highlight Membership Benefits for Leisure and Commute: Given the differing usage patterns between members and casual users, consider tailoring communication to highlight the benefits of membership for both commuting and leisure use. For instance, emphasize the cost-efficiency of membership for weekday commuting to work or school, and the convenience and enjoyment for weekend leisure rides. This dual-focused approach could resonate with casual riders who are using the service for different purposes.

Further Study

To make predictions about when a casual rider might become a member, we need more comprehensive data. This includes tracking individual user behavior over a period of time - observing frequency of rides, duration, preferred stations, and other relevant features. It is essential to have data where we see casual riders transitioning to become members.

With the appropriate data, we could apply techniques such as survival analysis or other time-to-event models. These models could potentially predict the timing of a casual rider converting to a member.

However, we should recognize the complexity of this task. Numerous factors influence the decision of a casual rider to become a member. These factors can range from personal financial circumstances, lifestyle choices, and the quality of their experiences with the service. Understanding these nuances will require a multi-dimensional approach and thoughtful interpretation of data.

Therefore, it’s crucial that we broaden our data collection and analysis to better understand these dynamics. By doing so, we will be able to create accurate models and develop strategic insights to drive membership growth.

Appendix

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2023. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Divvy. 2023. “Divvy Data.” https://divvy-tripdata.s3.amazonaws.com/index.html.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.
RStudio Team. 2020. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, PBC. http://www.rstudio.com/.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain Françoi, Garrett Grolemun, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

Source Code

All source code is written in the R Programming Language (R Core Team 2019) using the R Studio IDE (RStudio Team 2020). The packages used include the popular tidyverse (Wickham et al. 2019) set of packages for doing most of the heavy lifting in the code, knitr (Xie 2015) for rendering the R Markdown (Allaire et al. 2023) documents, and skimr (Waring et al. 2022) for creating detailed summaries of the data. Finally, for generating the visuals, we used the ggplot2 (Wickham 2016) package.

# set a seed in case we use any random numbers
set.seed(1337)


# Set the names of the packages and libraries you want to install
required_libraries <- c("tidyverse", "purrr", "knitr", "skimr", "readr")

# Install missing packages and load all required libraries
for (lib in required_libraries) {
  if (!requireNamespace(lib, quietly = TRUE)) {
    install.packages(lib)
  }
  library(lib, character.only = TRUE)
}

# load all the cycling data for analysis
# our data runs from 04/2022 to 04/2023
if(file.exists("combined_data_df.rds")) {
    # read in our combined data
    combined_data_df <- readRDS("combined_data_df.rds")
    
} else {
    
    # get file list
    file_list <- list.files(path = "SourceData/", pattern = "*.csv", full.names = TRUE)
    
    # read and combine files
    combined_data_df <- map_df(file_list, read_csv)
    
    # convert ridable_type and member_casual into factors
    combined_data_df$rideable_type <- as.factor(combined_data_df$rideable_type)
    combined_data_df$member_casual <- as.factor(combined_data_df$member_casual)
    
    # save the combined_data object so we don't have to reload this data again
    # unless it changes
    write_rds(combined_data_df, "combined_data_df.rds")
}


# examine the structure of the data
str(combined_data_df)


# Identify duplicates
duplicates <- duplicated(combined_data_df)

# Count duplicates
num_duplicates <- sum(duplicates)

# Show the number of duplicates
cat("Total number of duplicate rows found:",num_duplicates)


# Check for missing values
colSums(is.na(combined_data_df))


# Print detailed summary statistics of the dataframe
skim(combined_data_df)


# calculate the total ride time in minutes for each group
combined_data_df$ride_length <- 
    round((combined_data_df$ended_at - combined_data_df$started_at) / 60, 2)

# Change the units attribute to "minutes"
attr(combined_data_df$ride_length, "units") <- "minutes"


# Add a new column for the date without the time
combined_data_df$started_date <- as.Date(combined_data_df$started_at)


# Create the 'day_of_week_start' and 'day_of_week_end' columns
combined_data_df$day_of_week_start <- wday(combined_data_df$started_at, label = TRUE)
combined_data_df$day_of_week_end <- wday(combined_data_df$ended_at, label = TRUE)

# Convert the new columns to factors
combined_data_df$day_of_week_start <- as.factor(combined_data_df$day_of_week_start)
combined_data_df$day_of_week_end <- as.factor(combined_data_df$day_of_week_end)



# Get rid of outliers by only accepting ride times that are less than two days 
# and greater than zero
combined_data_df <- combined_data_df %>% filter(ride_length <= 2880 & 
                                                    ride_length > 5) 



# split our data into two groups: member and casual
member_df <- combined_data_df %>% filter(member_casual == "member") 
casual_df <- combined_data_df %>% filter(member_casual == "casual") 

# ride_length metrics for both groups
ride_length_mean_member <- mean(as.numeric(member_df$ride_length))
ride_length_max_member <- max(as.numeric(member_df$ride_length))
ride_length_min_member <- min(as.numeric(member_df$ride_length))

ride_length_mean_casual <- mean(as.numeric(casual_df$ride_length))
ride_length_max_casual <- max(as.numeric(casual_df$ride_length))
ride_length_min_casual <- min(as.numeric(casual_df$ride_length))

# Create a data frame
ride_length_metrics <- data.frame(
    Metric = c("Mean", "Max", "Min"),
    Member = round(c(ride_length_mean_member, ride_length_max_member, ride_length_min_member), 2),
    Casual = round(c(ride_length_mean_casual, ride_length_max_casual, ride_length_min_casual), 2)
)

# Print the data frame
print(ride_length_metrics)



# Group by date and count the number of rides each day per group
daily_counts_member <- member_df %>% group_by(started_date) %>% summarise(count = n())
daily_counts_casual <- casual_df %>% group_by(started_date) %>% summarise(count = n())


# super line graph for members
ggplot(daily_counts_member, aes(x = started_date, y = count)) +
    geom_line(color = "darkblue") +
    geom_smooth(method = "loess", se = FALSE, color = "red", linetype="dashed") +
    labs(x = "Date", 
         y = "Number of Rentals", 
         title = "Daily Bike Rentals for Members", 
         subtitle = "With trend line",
         caption = "Source: https://divvybikes.com/system-data") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", size = 20),
          plot.subtitle = element_text(face = "italic", size = 12),
          axis.title = element_text(face = "bold", size = 14),
          axis.text = element_text(size = 12))

# super line graph for casual
ggplot(daily_counts_casual, aes(x = started_date, y = count)) +
    geom_line(color = "darkblue") +
    geom_smooth(method = "loess", se = FALSE, color = "red", linetype="dashed") +
    labs(x = "Date", 
         y = "Number of Rentals", 
         title = "Daily Bike Rentals for Casual", 
         subtitle = "With trend line",
         caption = "Source: https://divvybikes.com/system-data") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", size = 20),
          plot.subtitle = element_text(face = "italic", size = 12),
          axis.title = element_text(face = "bold", size = 14),
          axis.text = element_text(size = 12))




# Calculate the counts
member_counts <- member_df %>% group_by(rideable_type) %>% summarise(count = n())
casual_counts <- casual_df %>% group_by(rideable_type) %>% summarise(count = n())

# Combine the data into one dataframe
bike_pref_df <- rbind(
    mutate(member_counts, user_type = "Member"),
    mutate(casual_counts, user_type = "Casual")
)

# Plot
ggplot(bike_pref_df, aes(x = rideable_type, y = count, fill = user_type)) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_text(aes(label=count), vjust=-0.3, position = position_dodge(0.9)) +
    labs(x = "Bike Type", y = "Count", fill = "User Type", title = "Bike Type Preference by User Type") +
    theme_minimal()



# Extract hour and day of week from started_at for members
member_df <- member_df %>%
    mutate(hour = hour(started_at),
           day = wday(started_at, label = TRUE)) %>%
    group_by(hour, day) %>%
    summarise(freq = n())

# Extract hour and day of week from started_at for casual customers
casual_df <- casual_df %>%
    mutate(hour = hour(started_at),
           day = wday(started_at, label = TRUE)) %>%
    group_by(hour, day) %>%
    summarise(freq = n())



# Create the heatmap for member rentals
ggplot(member_df, aes(x = day, y = hour, fill = freq)) +
    geom_tile(color = "white") +
    scale_fill_gradient(low = "light blue", high = "dark blue") +
    labs(title = "Member Rentals by Time and Day of Week",
         x = "Day of Week",
         y = "Hour of Day",
         fill = "Frequency") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))



# Create the heatmap for member rentals
ggplot(casual_df, aes(x = day, y = hour, fill = freq)) +
    geom_tile(color = "white") +
    scale_fill_gradient(low = "light blue", high = "dark blue") +
    labs(title = "Casual Rentals by Time and Day of Week",
         x = "Day of Week",
         y = "Hour of Day",
         fill = "Frequency") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))



# Group by date, user type and bike type to get counts
daily_bike_counts <- combined_data_df %>% 
    group_by(started_date, member_casual, rideable_type) %>%
    summarise(count = n()) %>%
    ungroup()

# Now we'll split the data into two dataframes for member and casual
member_daily_bike_counts <- daily_bike_counts %>% 
    filter(member_casual == "member")

casual_daily_bike_counts <- daily_bike_counts %>% 
    filter(member_casual == "casual")

# Define our color palette
color_palette <- c("classic_bike" = "blue", "electric_bike" = "red", "docked_bike" = "green")

# Plot for Members
ggplot(member_daily_bike_counts, aes(x = started_date, y = count, color = rideable_type)) +
    geom_line() +
    scale_color_manual(values = color_palette) +
    labs(x = "Date", y = "Number of Rentals", 
         color = "Bike Type",
         title = "Daily Bike Type Preferences of Members") +
    theme_minimal()

# Plot for Casual users
ggplot(casual_daily_bike_counts, aes(x = started_date, y = count, color = rideable_type)) +
    geom_line() +
    scale_color_manual(values = color_palette) +
    labs(x = "Date", y = "Number of Rentals", 
         color = "Bike Type",
         title = "Daily Bike Type Preferences of Casual Users") +
    theme_minimal()