R Markdown

Introduction

Hello! My name is Laura and I have been learning the basics of data analytics through Coursera’s Google Data Analytics Professional Certificate over the past few months. To illustrate my learning and the skills and technologies I have picked up along the way, I have created the following Capstone project.

Scenario

I am a junior data analyst working within the marketing team at Cyclistic, a bike-share company in Chicago. The director of marketing is looking to see how Cyclistic can increase the adoption of annual memberships, which are much more profitable, as opposed to casual usage. The director of marketing believes that Cyclistic’s success relies on this. In order to answer the broader question of how to convert casual riders into members, I will be exploring how casual users and members use Cyclistic bikes differently and looking at data from the past 12 months to explore this.

The data I have been given is hosted on an AWS web page, where the data for each month from November 2023 to October 2024 is contained within a zip file. This is data that comes directly from Cyclistic and is therefore original and reliable. It runs up to October 2024, so is therefore current, and it is clearly cited where the data has come from. Whilst it contains data on all bike rides taken within the last 12 months, no data was collected on individual customers (for data privacy reasons). There is therefore enough data for us to answer the business question, but the data is maybe not as comprehensive as it could potentially be.

Process (Google Sheets):

The CSV files containing the monthly data were each too large to import to Google Sheets for cleaning. The csv file for May 2024 was split in two parts using splitcsv.com to allow for upload to Google Sheets and was used as a tester to explore the data.

The following two columns were then added to the sheet:

The May 2024 sheet was then filtered to check for missing data.

~86,000 rows didn’t have an end station specified. ~80,000 rows didn’t have a start station specified. 35,647 rows didn’t have either a start or end station. All rows, however, do have a latitude. This data will therefore be kept in the dataset as these data points could be used for geo mapping. Considerations will therefore need to be made if analysing station names. 688 rows didn’t have an end latitude or longitude. The time of these journeys were also shown to be 24+ hours. These are likely to be bikes that broke during a journey or were stolen. These will be deleted from the dataset as these are unusual journeys and we only want to analyse data for journeys that were completed. It is worth noting that these journeys were all taken by classic bikes. All other columns had full sets of data.

The maximum and minimum of each column was checked to ensure that there were no anomalies. In the ride_length column, there were a number of rides that were shown as having a negative length. These rows were deleted as the data here is likely to have been incorrectly captured. A lot of these rows also had start and/or end station data missing.

The ride_id column was also checked for duplicates - none were found.

It was quickly ascertained that analysing all 12 months of data in Google sheets was going to be difficult as the total number of rows was going to be ~6 million. The method for analysis was switched to R/Posit Studio Cloud. Using Google Sheets had, however, familiarised me with the data and therefore will make cleaning it in R more efficient.

Process (R/Posit Studio Cloud):

The CSV files for the 12 months leading up to October 2024 were downloaded from the following site: https://divvy-tripdata.s3.amazonaws.com/index.html

The tidyverse, janitor and lubridate packages were installed using the following

install.packages('tidyverse')
install.packages('janitor')
install.packages('lubridate')

library(tidyverse)
library(janitor)
library(lubridate)

The files were uploaded to Posit Studio Cloud using the ‘Upload’ function in the workspace browser. “Read.csv” was then used to upload each csv file to R and to give each file a more readable pseudonym:

nov2023 <- read_csv("202311-divvy-tripdata.csv")
dec2023 <- read_csv("202312-divvy-tripdata.csv")
jan2024 <- read_csv("202401-divvy-tripdata.csv")
feb2024 <- read_csv("202402-divvy-tripdata.csv")
mar2024 <- read_csv("202403-divvy-tripdata.csv")
apr2024 <- read_csv("202404-divvy-tripdata.csv")
may2024 <- read.csv("202405-divvy-tripdata.csv")
jun2024 <- read.csv("202406-divvy-tripdata.csv")
jul2024 <- read.csv("202407-divvy-tripdata.csv")
aug2024 <- read.csv("202408-divvy-tripdata.csv")
sep2024 <- read.csv("202409-divvy-tripdata.csv")
oct2024 <- read.csv("202410-divvy-tripdata.csv")

The different csv files then had to be merged to create one dataframe. To check that the csv files were compatible and there are no discrepancies with formatting (e.g. each ID column had the same number of digits, each date column was formatted as dd/mm/yyyy etc.), the format of each column was checked using the ‘str()’ method.

It was found that for tables nov2023 - apr2024, the started_at and ended_at columns were formatted as datetime, whereas for tables may2024 - oct2024, these columns were formatted as characters.

To ensure consistent formatting, the started_at and ended_at columns in tables may2024, jun2024, jul2024, aug2024, sep2024 and oct2024 needed to be converted to datetime. This was done using the following method:

may2024$started_at <- as.POSIXlt (may2024$started_at, format = "%m/%d/%Y %H:%M")
may2024$ended_at <- as.POSIXlt (may2024$ended_at, format = "%m/%d/%Y %H:%M")
str(may2024)
rm(may2024)
may2024 <- read_csv("202405-divvy-tripdata.csv")

jun2024$started_at <- as.POSIXlt (jun2024$started_at, format = "%m/%d/%Y %H:%M")
jun2024$ended_at <- as.POSIXlt (jun2024$ended_at, format = "%m/%d/%Y %H:%M")
str(jun2024)
rm(jun2024)
jun2024 <- read_csv("202406-divvy-tripdata.csv")

jul2024$started_at <- as.POSIXlt (jul2024$started_at, format = "%m/%d/%Y %H:%M")
jul2024$ended_at <- as.POSIXlt (jul2024$ended_at, format = "%m/%d/%Y %H:%M")
str(jul2024)
rm(jul2024)
jul2024 <- read_csv("202407-divvy-tripdata.csv")

aug2024$started_at <- as.POSIXlt (aug2024$started_at, format = "%m/%d/%Y %H:%M")
aug2024$ended_at <- as.POSIXlt (aug2024$ended_at, format = "%m/%d/%Y %H:%M")
str(aug2024)
rm(aug2024)
aug2024 <- read_csv("202408-divvy-tripdata.csv")

sep2024$started_at <- as.POSIXlt (sep2024$started_at, format = "%m/%d/%Y %H:%M")
sep2024$ended_at <- as.POSIXlt (sep2024$ended_at, format = "%m/%d/%Y %H:%M")
str(sep2024)
rm(sep2024)
sep2024 <- read_csv("202409-divvy-tripdata.csv")

oct2024$started_at <- as.POSIXlt (oct2024$started_at, format = "%m/%d/%Y %H:%M")
oct2024$ended_at <- as.POSIXlt (oct2024$ended_at, format = "%m/%d/%Y %H:%M")
str(oct2024)
rm(oct2024)
oct2024 <- read_csv("202410-divvy-tripdata.csv")

Once the data was consistent, the tables were merged together to form a table called cycles2024 using the following code:

#Create merge table
cycles2024 <- bind_rows(nov2023, dec2023, jan2024, feb2024, mar2024, apr2024, may2024, jun2024, jul2024, aug2024, sep2024, oct2024)

#Drop tables for individual months to save processing space
rm(nov2023)
rm(dec2023)
rm(jan2024)
rm(feb2024)
rm(mar2024)
rm(apr2024)
rm(may2024)
rm(jun2024)
rm(jul2024)
rm(aug2024)
rm(sep2024)
rm(oct2024)
gc()

This created a table of all 12 months of data, containing 5,933,712 rows and 13 columns.

The data in the aggregated table then needed an initial clean. Empty rows and duplicates were removed by using the following:

#Remove empty rows
remove_empty(cycles2024, which = c())

#Remove duplicated rows
cycles2024clean <- cycles2024[!duplicated(cycles2024$ride_id), ]

“Ride_id” was used as the primary key here as each ride was assigned a unique key, therefore duplicates of this needed to be removed. 211 rows were removed, so the new table contains 5,933,501 rows, each of which contained a unique ride ID.

Two further columns then needed to be added in order to fully analyse data to answer the business question of how monthly and casual users differ. My hypothesis is that annual members are using the bikes more to commute, meaning peak usage on weekday mornings and shorter journeys, and casual members are using the bikes for leisure, meaning peak usage at the weekend and longer journeys. To analyse this, two columns needed to be added: weekday to assess the difference in usage between weekdays and weekends; and journey_length, to assess the journey types that these users are taking.

#Adding the journey length column
cycles2024clean <- cycles2024clean %>% mutate(journey_length = difftime((`ended_at`),(`started_at`), units = "mins"))

#Converting output to hh:mm:ss format
cycles2024clean$journey_length_hms <- hms(minutes = as.numeric(cycles2024clean$journey_length))

#Adding the weekday column
cycles2024clean <- cycles2024clean %>% mutate(weekday = weekdays(`started_at`))

Now that these columns have been added in, rows where journey_length is <0 will be deleted as it is likely here that the data has been incorrectly inputted (ended_at before started_at etc.). Furthermore, rows where journey_length is >24 hours will also be deleted as, from our exploration using Google Sheets, almost all of these didn’t have end stations and are likely to be incomplete journeys (e.g. bikes was stolen or became faulty), therefore not representing a typical journey.

#Filtering for journey times < 0 minutes
cycles2024better <- cycles2024clean %>% filter(journey_length>0)

#Filtering for journey times > 24 hours
cycles2024better <- cycles2024clean %>% filter(journey_length>0, journey_length<1439.617)

#Drop cycles2024clean, cycles2024 and all additional tables to save processing space
rm(cycles2024clean)
rm(cycles2024)
gc()

The Cyclistic data has now been cleaned and prepared for analysis.

Analyse

To start analysing, ggplot was installed using the following:

install.packages("ggplot2")
library(ggplot2)

To focus back on the business question of how annual and casual members of Cyclistic use bikes differently, I will assess whether my hypothesis is correct. To refresh, my hypothesis is that members use bikes more frequently for short journeys during the week (i.e. commuting), and casual users use bikes more for leisure. To assess whether this is correct, I will draw comparisons between the following data points:

Number of Journeys Taken

#To prevent scientific notation
options(scipen = 999)

#Creating the graph
ggplot(data = cycles2024better) +
geom_bar(mapping = aes(x=member_casual, fill="orange")) +
labs(title="Total Member Usage", x="Member Type", y="Annual Usage")

As a brief overview, this graph shows that members took many more journeys than casual users in the last 12 months. This would fit in with the hypothesis that members take more frequent, shorter journeys and casual users make fewer, longer journeys. Additional data that would be helpful to analyse this point would be for customer IDs to be assigned to each journey. This would allow us to analyse whether the above is because there are more members than casual users, or because members take more journeys on average than casual users

Average Journey Length

#Calculating the average journey length
mean(cycles2024better$journey_length)
## Time difference of 15.36926 mins
#Making two tables to help assess if there is a significant difference in journey length between members and casual users
membercycles2024 <- cycles2024better %>% filter(member_casual=="member")

mean(membercycles2024$journey_length)
## Time difference of 12.19586 mins
casualcycles2024 <- cycles2024better %>% filter(member_casual=="casual")

mean(casualcycles2024$journey_length)
## Time difference of 20.88599 mins

Both of these show that members tend to take shorter journeys than casual users.

Let’s break this down further and look at the average journey length over the course of a week.

Weekly Usage

#Ensuring weekdays show in the correct order
cycles2024better$weekday <- ordered(cycles2024better$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

#Graph created
ggplot(cycles2024better) +
  geom_bar(mapping = aes(x=weekday, fill=member_casual), position="dodge2") +
  theme_bw() +
  labs(x="Weekday", y="Total Rides")

The above graph shows peak member usage during weekdays which then drops off during the weekend, and peak casual usage on Saturday followed by Sunday, with gradual increases during the week.

To combine these two observations (difference in average journey length and difference in usage across the week), attempts were made to create a graph to show if the average journey length for each user type differed across weekdays.

#ggplot(nov2023, aes(x=weekday, y=mean(journey_length), fill = member_casual))+
    #geom_col(position = "dodge2")

#The above wouldn't run with the full 12 months of data. The below was therefore run to create two tibbles to show the average journey length for each member type on each day of the week

casualweekdayaverage <- casualcycles2024 %>% 
    group_by(weekday) %>% 
   summarise(mean(journey_length)) %>% 
  mutate(member_type = "casual")

memberweekdayaverage <- membercycles2024 %>% 
    group_by(weekday) %>% 
   summarise(mean(journey_length)) %>% 
  mutate(member_type = "member")

#Tibbles were then bound together
weekdayaverage <- bind_rows(memberweekdayaverage, casualweekdayaverage)

#Ensuring weekdays show in the correct order
weekdayaverage$weekday <- ordered(weekdayaverage$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

#Graph made to illustrate findings
ggplot(weekdayaverage, aes(x=weekday,y=`mean(journey_length)`, fill=member_type)) +
  geom_col(position = "dodge2") +
  labs(x="Weekday",y="Average journey length (minutes)") +
  theme_bw()

The above graph shows that, on each day of the week, casual users took on average longer journeys than members, with their longest journeys being taken on Saturday and Sunday. The length of member journeys remains steady throughout the week, with a slight increase on Saturday and Sunday.

Daily Usage

Let’s break this down even further and look at whether members or casual users use Cyclistic bikes differently throughout the day.

#Extracting the hour from the journey start time
cycles2024better$ride_hour <- hour(cycles2024better$started_at)

#Graph created
ggplot(cycles2024better) +
  geom_bar(mapping = aes(x=ride_hour, fill=member_casual), stat="count", position="dodge2") +
  theme_bw() +
  labs(x="Start Hour", y="Total Rides")

This graph shows that the peak start hour for both user types is 17:00, and the least popular start hour for both user types is 04:00. There are two peaks of member usage throughout the day: at 08:00 and at 17:00. This would coincide with both morning and evening rush hour for commuters. For casual users, on the other hand, there is only one daily peak: at 17:00, with the number of rides slowly increasing throughout the day from 07:00 to 17:00 and then decreasing until the end of the day.

Let’s break this down further and see if there is any variation to this daily use pattern throughout the week.

ggplot(cycles2024better, aes(x=ride_hour, fill = member_casual)) +
     geom_bar(stat="count", position="dodge2") +
     facet_wrap("weekday") +
     theme_bw() +
     labs(x="Start Hour", y="Total Rides")

This graph clearly shows a change in ride behaviour between weekdays and weekends for both members and casual users During the weekdays, members take many more rides than casual users, as ascertained in previous analysis. Both user types show two usage peaks as well, one at 08:00 and at 17:00, however for members these peaks are much more pronounced. The fact that these peaks show for casual users as well means that they could also be using Cyclistic bikes for commuting but potentially not as regularly.

During the weekends, the usage is almost exactly the same between members and casual users, with both showing one peak between 12:00 and 15:00, where usage slowly builds up during the morning and slowly peters out during the evening. This is an interesting pattern as it suggests that the overall difference in use between members and casual users is actually not as pronounced as anticipated, and that the main difference in usage is the volume of rides during the week. It is worth noting, however, that the peak time of usage for casual users overall is on weekend afternoons.

Seasonal Difference

Whilst we have assessed there is a difference throughout the week in member and casual user usage, let’s have a look to see if there are any seasonal differences. If we can establish when during the year there is high casual usage, this will allow for more effective timed targeting of ad campaigns.

#Extract just the date from the journey start time
cycles2024better$ride_date <- date(cycles2024better$started_at)

#Create two graphs comparing seasonal usage of casual users and members
ggplot(filter(cycles2024better)) +
  geom_area(mapping = aes(x=ride_date), stat="count", fill="orange") +
  facet_grid("member_casual") +
  theme_bw() +
  labs(x="Date", y="Total Rides")

The above graphs show an interesting difference in seasonal usage between members and casual users. What is interesting here is the usage pattern of casual users where there is a larger difference in usage between summer months (May - September) and winter months (October - April) compared to that of members. This increase in usage at the beginning of April could be down to a number of factors, including (but not limited to):

To understand these annual patterns better and ascertain if this change in usage is regular, it would be interesting to get the following information from the Cyclistic marketing team:

If none of this is the case, then it can be safely assumed that casual usage of Cyclistic bikes increases dramatically during the summer, with the largest increase occurring around mid-April. Knowing this could allow for effective marketing strategies and for marketing to ramp up in the months prior to this increase.

In conclusion, from the analysis so far we can confidently state that the types of journeys that casual users and annual members take are different. Annual members on average take shorter, more frequent journeys on all days of the week, with peak usage happening on weekdays during morning and evening rush hour. Casual users, on the other hand, take longer and less frequent journeys, with most journeys happening between 12:00 and 15:00 on weekends. The main difference in the volume of usage is during weekdays at rush hour.

Geographic Differences

Finally, I analysed whether there were any geographic differences in where casual users and members were making their Cyclistic journeys. To do this, I used Tableau to plot geographic data and see where the majority of these journeys were started by both members and casual users. The below heat map was created in Tableau Public Desktop to show where the highest density of journeys started in from November 2023 to October 2024.

‘Map 1 - Journey start locations’
‘Map 1 - Journey start locations’

This map shows the highest density of journeys starting within the centre of Chicago and then spreading north along the coast up to Rogers Park. There is also a bubble of high density journeys starting between Washington Park and Jackson Park in the south of the city. Let’s zoom in on the highest density area in Central Chicago:

‘Map 2 - Journey start locations central Chicago’
‘Map 2 - Journey start locations central Chicago’

This shows the hotspots for journey start locations, which include locations in Greektown, Chicago Loop & Magnificent Mile.

Let’s break this down further and see whether these locations differ between members and casual users.

‘Map 3 - Member vs Casual’
‘Map 3 - Member vs Casual’

The above maps show that there are variations in popular starting stations for casual users and members. The top map, showing popular member starting stations, has a medium density throughout the centre of Chicago and then one hotspot of high density. This is the area between the north/south highway and the canal, which, from looking at Google Maps, is West Loop Gate. This is home to Chicago Union Station, the main transport hub within Chicago, and is next to Greektown, the main commercial district of Chicago. The bottom map, showing popular casual user starting stations, has areas of high density throughout central Chicago, with notable hotspots in Millennium Park and Northerly Island park (green areas to the east of the centre), Navy Pier and River North. From searching online, it has been found that these are all popular tourist destinations.

Geographic data shows that casual users are collecting bikes more frequently from green spaces and members are collecting bikes more around transport hubs and in commercial areas.

Final Conclusions & Recommendations:.

From analysing the Cyclistic data using Google Sheets, R and Tableau, we were able to create a rough profile of both members and casual users of Cyclistic:

Casual

Average journey length: 20:53.15 minutes

Most popular journey times: Afternoon/Weekend/Summer

Most popular geographies: All of Central Chicago

Member

Average journey length: 12:11.75 minutes

Most popular journey times: 08:00 & 17:00/Weekday/Summer

Most popular geographies: West Loop Gate

Members tend to take shorter journeys on average and their most popular time to start a journey was during rush hour on weekdays. Their use was also seasonal and they were more likely to make a journey during the summer months as opposed to during the winter. Their use was also concentrated around the main transport hub and commercial district of Chicago. These patterns of use make it likely that members are people using Cyclistic to commute to work or to travel around during their workday.

To contrast, casual users tended to take longer journeys than members and their most popular time to start a journey was on weekend afternoons between 12:00 and 15:00. Their use was also seasonal, with a large increase in use during the summer months, and was concentrated throughout central Chicago, with main hotspots by tourist attractions and in green spaces. These patterns of use make it likely that casual users are tourists or locals doing leisure rides outside of work hours.

Recommendations:

Based on my previous analysis, my top 3 recommendations to help convert casual users into members are:

Final Remarks:

To make this analysis the best it could be, it would be useful in the future to add the following data to the dataset: