Context

The following data analysis project is the final part of the Google Data Analytics Professional Certificate. The objective here, is to put all the skill taught throughout the course into practice. The course itself is broken down in 6 key phases that represent the integral processes of data analysis.

Ask
Prepare
Process
Analyze
Share
Act

Before we can jump into these processes, we first need to answer some preliminary questions below.

Where is your data located?
- The data is located in cloud storage on Google Drive and Kaggle.
How is the data organized?
- The data is organized in Microsoft Excel sheets representing each month of 2022. Its also organized in numerous data frames within R.
Are there issues with bias or credibility in this data? Does your data ROCCC?
- For persons viewing this project who may not be familiar with the terminology ROCCC, it essentially asks if you data is:
- RELIABLE - Yes, the data is reliable. Cyclystic is actually based on real data published by Divvy Bikes based in Chicago.
- ORIGINAL - Yes, the data is original. Divvy Bikes makes all of it trip data available for public use.
- COMPREHENSIVE - Somewhat, while the data as uploaded is broken down into columns, there are still issues to overcome which will be tackled in the process phase.
- CURRENT - Yes, the data is current. Divvy Bikes constantly adds new data each month to their ever growing list of public data.
- CITED - Yes, the data is cited. Since Divvy Bikes produces and verifies it’s data, and this data is being used for this project. I would conclude it’s cited.
How are you addressing licensing, privacy, security, and accessibility?
- Licensing: Since the data is made publicly available by the source (Divvy Bikes), and subsequent manipulations and cleanings were done by me, it should be covered by Creative Commons License.
- Privacy: The data, and subsequent analysis are being stored and presented on publicly accessible platforms, which means it won’t be private.
- Security: Since the data is being stored and presented on publicly accessible platforms, I am reliant on their security.
- Accessibility: Steps have been taken to ensure that all charts and tables are legible, clearly presented, especially for the visually impaired.
How did you verify the data’s integrity?
- Data integrity was verified through the owner’s download portal. Each months data is stored in zip files that can be downloaded from Divvy Bikes site.
How does it help you answer your question?
- While the data can help to answer my questions, it is not perfect. To more effectively answer the questions, some qualatative data would have been useful.
Are there any problems with the data?
- The data was not cleaned.

Ask

The project outlines what questions we are asking of Cyclystics data and sets the scene under which these questions are asked below.

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Prepare

The objective of this part of the project is to outline how we get the data ready for analysis.

Download

The data was downloaded from Cyclystics aka Divvy Bikes, in the form of zip files representing each month the data was collected.

Store

The zip files were stored initially on my laptops hard drive, then they were unzipped and the individual CSV files were transferred to my Google Drive and Kaggle

Process

Here is where fun really started, during this phase I worked on removing any inconsistencies and ensured the data was properly cleaned, and added any useful calculations. The first part of the processing and cleaning was done in Excel, the second part was handled in R, the reason for this is explained later on along with any assumptions made.

What did I do in excel?

To make life easier for myself, i saved each .CSV file as a .XLSX file. The reason for this is because .CSV files do not retain certain types of formatting, like tables, etc.
After saving the files in .XLSX format, I put the data into tables. This was done so i could see the data presented as columns and rows much easier. This would be useful when creating formulas
Once the data was in table format, I added 4 new columns to the existing ones. These were:
- trip_month: Based on the “started_at” column, used to calculate the month of the trip using the formula - =SWITCH(MONTH(started_at),1,“JANUARY”,2,“FEBRUARY”,3,“MARCH”,4,“APRIL”,5,“MAY”,6,“JUNE”,7,“JULY”,8,“AUGUST”,9,“SEPTEMBER”,10,“OCTOBER”,11,“NOVEMBER”,12,“DECEMBER”), the result was formatted as a custom data type.
- trip_day: Based on the “started_at” column, used to calculate the day of the trip using the formula - =SWITCH(WEEKDAY(started_at),1,“SUNDAY”,2,“MONDAY”,3,“TUESDAY”,4,“WEDNESDAY”,5,“THURSDAY”,6,“FRIDAY”,7,“SATURDAY”), the result was formatted as a custom data type.
- trip_time_period: Based on the “started_at” column, used to calculate the period of the day of the trip using the formula - =IF(AND(HOUR(started_at)>0,HOUR([@[started_at]])<12),“Morning”,IF(AND(HOUR(started_at)>12,HOUR(started_at)<17),“Afternoon”,IF(AND(HOUR(started_at)>17, HOUR(started_at)<20),“Evening”,“Night”))), the result was formatted as a custom data type.
- trip_duration: Based on the “started_at” and “ended_at” columns, used to calculate the duration of the trip using the formula - =(ended_at)-(started_at)x1440, the result was formatted as a numeric data type.

Quick note: I am quite adept at using excel, it was a skill I had developed prior to doing this project or pursuing the certification, hence my use of functions like SWITCH

Once the new columns were added, I formatted the pre existing columns where necessary.
- ride_id: initially general, changed to text
- rideable_type: initially general, changed to text
- started_at: initially custom date/time, left unchanged
- ended_at: initially custom date/time, left unchanged
- start_station_name: initially general, changed to text
- start_station_id: initially general, changed to text
- end_station_name: initially general, changed to text
- start_station_id: initially general, changed to text
- start_lat: initiallly general, changed to numeric
- start_lng: initiallly general, changed to numeric
- end_lat: initiallly general, changed to numeric
- end_lng: initiallly general, changed to numeric
- member_casual: initially general, changed to text
Once all the formatting was completed, the files were saved the rest of process phase was handled in R, documented below.

What did I do in R?

Loaded the necessary libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)
library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(readxl)
library(modeest)

## Registered S3 method overwritten by 'rmutil':
##   method         from
##   print.response httr

library(stringr)

Imported the partially cleaned excel files into data frames

I took the time to do some preliminary data cleaning in Excel prior to bring the data over to R. The reason for this is that there are thing that Excel is just naturally good at from a data cleaning perspective. These include fixing data types, performing basic functions and calculations, as well as, highlighting errors. One major caveat here is that the excel files for this project were huge, in excess of 60MB or over 100,000 rows of data. Given that while I do a fairly competent laptop, it wouldn’t be efficient to try to do everything in Excel, especially when R can be brought in handle some of the heavy lifting.

january_tripdata <- read_excel("Divvy-TripData/202201-divvy-tripdata.xlsx", 
    sheet = "202201-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

february_tripdata <- read_excel("Divvy-TripData/202202-divvy-tripdata.xlsx", 
    sheet = "202202-divvy-tripdata", col_types = c("text", 
        "text", "date", "date","text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

march_tripdata <- read_excel("Divvy-TripData/202203-divvy-tripdata.xlsx", 
    sheet = "202203-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

april_tripdata <- read_excel("Divvy-TripData/202204-divvy-tripdata.xlsx", 
    sheet = "202204-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

may_tripdata <- read_excel("Divvy-TripData/202205-divvy-tripdata.xlsx", 
    sheet = "202205-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

june_tripdata <- read_excel("Divvy-TripData/202206-divvy-tripdata.xlsx", 
    sheet = "202206-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

july_tripdata <- read_excel("Divvy-TripData/202207-divvy-tripdata.xlsx", 
    sheet = "202207-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

august_tripdata <- read_excel("Divvy-TripData/202208-divvy-tripdata.xlsx", 
    sheet = "202208-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

september_tripdata <- read_excel("Divvy-TripData/202209-divvy-tripdata.xlsx", 
    sheet = "202209-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

october_tripdata <- read_excel("Divvy-TripData/202210-divvy-tripdata.xlsx", 
    sheet = "202210-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

november_tripdata <- read_excel("Divvy-TripData/202211-divvy-tripdata.xlsx", 
    sheet = "202211-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

december_tripdata <- read_excel("Divvy-TripData/202212-divvy-tripdata.xlsx", 
    sheet = "202212-divvy-tripdata", col_types = c("text", 
        "text", "date", "date", "text", "text", "text", 
        "numeric", "text", "text", "text", 
        "text", "numeric", "numeric", "numeric", 
        "numeric", "text"))

Bound all the data frames together into a single data frame

Now we have all of our Excel files loaded into their respective data frames, to make the process of analysis more efficient, let’s combine them all into one big data frame.

all_trips <- bind_rows(
            january_tripdata,
            february_tripdata,
            march_tripdata,
            april_tripdata,
            may_tripdata,
            june_tripdata,
            july_tripdata,
            august_tripdata,
            september_tripdata,
            october_tripdata,
            november_tripdata,
            december_tripdata,
            )

Removed any blank rows

One of the assumptions that I made, was that any trip without a start or end station, was an error and as such, should be removed from the dataset. In addition to start and end stations, there were coordinates that also had blank rows, so those needed to be removed as well.

all_trips_no_blanks <- drop_na(all_trips)

Removed unrealistic ride durations

Initially I thought about excluding trip duration less than 0 minutes only, however in reality that doesn’t make real world sense from an analysis perspective. A realistic bike ride can’t take 0 minutes, it must be longer. By the same reasoning, you are limited to a certain amount of rides in 24 hours,as such any rides longer than 24 hours are assumed to be errors. Therefore, I decided to filter out rides less than 1 minute and greater than 24 hours for a more meaningful analysis.

all_trips_no_negatives <- all_trips_no_blanks[(all_trips_no_blanks$trip_duration > 1 & all_trips_no_blanks$trip_duration < 34560),]
trips_summary <- all_trips_no_negatives

Analyze and Share

I combined these two aspects of the data analysis process here because as you are reading this, I am fulfilling these aspects. I have analyzed the data which I am also sharing with you below.

Summarized the newly clean data

Much of the heavy lifting for cleaning the data was done in excel, but due to the enormous file sizes for the data sets used in this analysis, it wouldn’t be practical to do all of summarisation there. So, we tackle that aspect of things here, since R is much more efficient.

summary(trips_summary)

##    ride_id          rideable_type        started_at                    
##  Length:4292473     Length:4292473     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-29 05:01:35.00  
##  Mode  :character   Mode  :character   Median :2022-07-20 19:37:51.00  
##                                        Mean   :2022-07-19 11:37:12.23  
##                                        3rd Qu.:2022-09-14 17:45:46.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##     ended_at                       trip_month          trip_day        
##  Min.   :2022-01-01 00:01:48.00   Length:4292473     Length:4292473    
##  1st Qu.:2022-05-29 05:40:24.00   Class :character   Class :character  
##  Median :2022-07-20 19:55:37.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-19 11:54:35.89                                        
##  3rd Qu.:2022-09-14 18:01:26.00                                        
##  Max.   :2023-01-01 18:09:37.00                                        
##  trip_time_period   trip_duration      start_station_name start_station_id  
##  Length:4292473     Min.   :    1.00   Length:4292473     Length:4292473    
##  Class :character   1st Qu.:    6.25   Class :character   Class :character  
##  Mode  :character   Median :   10.80   Mode  :character   Mode  :character  
##                     Mean   :   17.39                                        
##                     3rd Qu.:   19.25                                        
##                     Max.   :34354.07                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:4292473     Length:4292473     Min.   :41.65   Min.   :-87.83  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.64  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-87.83   Length:4292473    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.64                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.06   Max.   :  0.00

str(trips_summary)

## tibble [4,292,473 × 17] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4292473] "578BA30BA1348F18" "5EE2D7C533CCC17B" "5AA216F2E2138811" "81F3141973924C8C" ...
##  $ rideable_type     : chr [1:4292473] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:4292473], format: "2022-01-01 01:00:05" "2022-01-06 19:07:45" ...
##  $ ended_at          : POSIXct[1:4292473], format: "2022-01-21 08:51:11" "2022-01-25 14:30:33" ...
##  $ trip_month        : chr [1:4292473] "JANUARY" "JANUARY" "JANUARY" "JANUARY" ...
##  $ trip_day          : chr [1:4292473] "SATURDAY" "THURSDAY" "THURSDAY" "WEDNESDAY" ...
##  $ trip_time_period  : chr [1:4292473] "Morning" "Evening" "Night" "Afternoon" ...
##  $ trip_duration     : num [1:4292473] 29271 27083 14238 9839 8531 ...
##  $ start_station_name: chr [1:4292473] "Millennium Park" "Wabash Ave & Grand Ave" "Broadway & Belmont Ave" "Sedgwick St & Schiller St" ...
##  $ start_station_id  : chr [1:4292473] "13008" "TA1307000117" "13277" "TA1307000143" ...
##  $ end_station_name  : chr [1:4292473] "Fairfield Ave & Roosevelt Rd" "Base - 2132 W Hubbard Warehouse" "Avers Ave & Belmont Ave" "Larrabee St & Division St" ...
##  $ end_station_id    : chr [1:4292473] "KA1504000102" "Hubbard Bike-checking (LBS-WH-TEST)" "15640" "KA1504000079" ...
##  $ start_lat         : num [1:4292473] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:4292473] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:4292473] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:4292473] -87.7 -87.7 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr [1:4292473] "casual" "casual" "casual" "casual" ...

Dived Deeper into the analysis

What was the summary information for trip duration

summary(trips_summary$trip_duration)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     6.25    10.80    17.39    19.25 34354.07

What as the total number of rides based on Membership Type>

members_trip_total <- data.frame(table(trips_summary$member_casual))
colnames(members_trip_total) <- c("Membership Type", "Number of Rides")
members_trip_total

##   Membership Type Number of Rides
## 1          casual         1731141
## 2          member         2561332

Quick Observation: There are almost 50% more Annual Members than Casual Riders

What was the Total Duration of rides based on Membership Type?

members_trip_total_duration <- aggregate(trips_summary$trip_duration, list(trips_summary$member_casual), FUN = sum)
colnames(members_trip_total_duration) <- c("Membership Type", "Trip Duration (Minutes)")
members_trip_total_duration

##   Membership Type Trip Duration (Minutes)
## 1          casual                42171615
## 2          member                32492675

Quick Observation Casual riders rode for almost 30% longer than Annual Members in total

What was the average trip duration for based on Membership Type?

members_trip_mean <- aggregate(trips_summary$trip_duration, list(trips_summary$member_casual), FUN = mean)
colnames(members_trip_mean) <- c("Membership Type","Average Trip (minutes)")
members_trip_mean

##   Membership Type Average Trip (minutes)
## 1          casual               24.36059
## 2          member               12.68585

Quick Observation: Casual Riders for 50% more than Annual Members on average

What were the longest and shortest trip duration based on Membership Type?

members_trip_max <- aggregate(trips_summary$trip_duration, list(trips_summary$member_casual), FUN = max)
colnames(members_trip_max) <- c("Membership Type", "Longest Trip (minutes)")
members_trip_max

##   Membership Type Longest Trip (minutes)
## 1          casual              34354.067
## 2          member               1493.233

members_trip_min <- aggregate(trips_summary$trip_duration, list(trips_summary$member_casual), FUN = min)
colnames(members_trip_min) <- c("Membership Type", "Shortest Trip (minutes)")
members_trip_min

##   Membership Type Shortest Trip (minutes)
## 1          casual                       1
## 2          member                       1

Looked at how popular certain weekdays are between Casual, Annual Members and Overall

What was the most popular day of the week overall?

members_trip_weekday_popular <- data.frame(table(trips_summary$trip_day))
colnames(members_trip_weekday_popular) <- c("Weekday", "Frequency")
members_trip_weekday_popular

##     Weekday Frequency
## 1    FRIDAY    598037
## 2    MONDAY    575757
## 3  SATURDAY    692890
## 4    SUNDAY    588199
## 5  THURSDAY    634734
## 6   TUESDAY    597226
## 7 WEDNESDAY    605630

Quick Observation: Saturday gets the most riders overall, regardless of membership type

What was the most popular day based on Membership Type?

members_trip_popular_days <- data.frame(table(trips_summary$member_casual, trips_summary$trip_day))
colnames(members_trip_popular_days) <- c("Membership Type","Weekday","Frequency")
members_trip_popular_days

##    Membership Type   Weekday Frequency
## 1           casual    FRIDAY    244971
## 2           member    FRIDAY    353066
## 3           casual    MONDAY    207530
## 4           member    MONDAY    368227
## 5           casual  SATURDAY    361592
## 6           member  SATURDAY    331298
## 7           casual    SUNDAY    296578
## 8           member    SUNDAY    291621
## 9           casual  THURSDAY    226558
## 10          member  THURSDAY    408176
## 11          casual   TUESDAY    193402
## 12          member   TUESDAY    403824
## 13          casual WEDNESDAY    200510
## 14          member WEDNESDAY    405120

Visualized the result

ggplot(members_trip_popular_days, aes(x = `Weekday`, y = `Frequency`, fill = `Membership Type`)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = `Frequency`), position = position_dodge(width = 0.9), vjust = -0.25)+
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Weekday", y = "Frequency", fill = "Membership Type", title = "Most Popular Day based on Membership Type") +
  theme_minimal()

Extracted the key results based on Membership Type

popular_casual_day <- members_trip_popular_days[members_trip_popular_days[, "Membership Type"] == "casual", ]
popular_member_day <- members_trip_popular_days[members_trip_popular_days[, "Membership Type"] == "member", ]
casual_most_popular_row <- which.max(popular_casual_day[, "Frequency"])
member_most_popular_row <- which.max(popular_member_day[, "Frequency"])
casual_most_popular_day <- popular_casual_day[casual_most_popular_row, c("Membership Type", "Weekday", "Frequency")]
member_most_popular_day <- popular_member_day[member_most_popular_row, c("Membership Type", "Weekday", "Frequency")]
most_popular_days <- rbind(casual_most_popular_day, member_most_popular_day)
most_popular_days

##    Membership Type  Weekday Frequency
## 5           casual SATURDAY    361592
## 10          member THURSDAY    408176

Quick Observation: For Casual Riders, the most popular day of the week is Saturday, whereas for Annual Members its Thursday

Exploring which time period is most popular based on Membership Type and overall

What was the most popular time period overall?

members_trip_popular_time_period <- data.frame(table(trips_summary$trip_time_period))
colnames(members_trip_popular_time_period) <- c("Time Period", "Frequency")
members_trip_popular_time_period

##   Time Period Frequency
## 1   Afternoon   1191788
## 2     Evening    636305
## 3     Morning   1157095
## 4       Night   1307285

Quick Observation: The most popular time period overall was Night Time

What was the most popular time period based on membership type?

members_trip_popular_period <- data.frame(table(trips_summary$member_casual, trips_summary$trip_time_period))
colnames(members_trip_popular_period) <- c("Membership Type","Time Period","Frequency")
members_trip_popular_period

##   Membership Type Time Period Frequency
## 1          casual   Afternoon    517921
## 2          member   Afternoon    673867
## 3          casual     Evening    260318
## 4          member     Evening    375987
## 5          casual     Morning    382052
## 6          member     Morning    775043
## 7          casual       Night    570850
## 8          member       Night    736435

Visualized the results

ggplot(members_trip_popular_period, aes(x = `Time Period`, y = `Frequency`, fill = `Membership Type`)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = `Frequency`), position = position_dodge(width = 0.9), vjust = -0.25)+
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Weekday", y = "Frequency", fill = "Membership Type", title = "Most Popular Time Period by Membership Type") +
  theme_minimal()

Extracted the key results based on Membership Type

popular_casual_period <- members_trip_popular_period[members_trip_popular_period[, "Membership Type"] == "casual", ]
popular_member_period <- members_trip_popular_period[members_trip_popular_period[, "Membership Type"] == "member", ]
casual_most_popular_period_row <- which.max(popular_casual_period[, "Frequency"])
member_most_popular_period_row <- which.max(popular_member_period[, "Frequency"])
casual_most_popular_period <- popular_casual_period[casual_most_popular_period_row, c("Membership Type", "Time Period", "Frequency")]
member_most_popular_period <- popular_member_period[member_most_popular_period_row, c("Membership Type", "Time Period", "Frequency")]
most_popular_time_period <- rbind(casual_most_popular_period, member_most_popular_period)
most_popular_time_period

##   Membership Type Time Period Frequency
## 7          casual       Night    570850
## 6          member     Morning    775043

Investigated how popular which months are between Casual and Annual Members, as well as overall

What was the most popular month overall?

members_trip_popular_month_overall <- data.frame(table(trips_summary$trip_month))
colnames(members_trip_popular_month_overall) <- c("Month", "Frequency")
members_trip_popular_month_overall

##        Month Frequency
## 1      APRIL    268528
## 2     AUGUST    594293
## 3   DECEMBER    132604
## 4   FEBRUARY     87649
## 5    JANUARY     79052
## 6       JULY    630891
## 7       JUNE    609773
## 8      MARCH    212888
## 9        MAY    494082
## 10  NOVEMBER    251113
## 11   OCTOBER    406305
## 12 SEPTEMBER    525295

Quick Observation: July was the most popular month overall

what was the most popular month based on Membership Type?

members_trip_popular_month <- data.frame(table(trips_summary$member_casual, trips_summary$trip_month))
colnames(members_trip_popular_month) <- c("Membership Type","Month","Duration (Minutes)")
members_trip_popular_month

##    Membership Type     Month Duration (Minutes)
## 1           casual     APRIL              90812
## 2           member     APRIL             177716
## 3           casual    AUGUST             265735
## 4           member    AUGUST             328558
## 5           casual  DECEMBER              30979
## 6           member  DECEMBER             101625
## 7           casual  FEBRUARY              14972
## 8           member  FEBRUARY              72677
## 9           casual   JANUARY              12481
## 10          member   JANUARY              66571
## 11          casual      JULY             306599
## 12          member      JULY             324292
## 13          casual      JUNE             287536
## 14          member      JUNE             322237
## 15          casual     MARCH              66401
## 16          member     MARCH             146487
## 17          casual       MAY             216932
## 18          member       MAY             277150
## 19          casual  NOVEMBER              72364
## 20          member  NOVEMBER             178749
## 21          casual   OCTOBER             148857
## 22          member   OCTOBER             257448
## 23          casual SEPTEMBER             217473
## 24          member SEPTEMBER             307822

Visualized the results

members_trip_popular_month$`Month` <- as.factor(members_trip_popular_month$`Month`)

ggplot(members_trip_popular_month, aes(x = `Month`, y = `Duration (Minutes)`, fill = `Membership Type`)) +
  geom_col(position = "dodge", width = 1) +
  geom_text(aes(label = sprintf("%.2f", `Duration (Minutes)`)), position = position_dodge(width = 0.9), vjust = -0.25)+
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Month", y = "Duration (Minutes)", fill = "Membership Type", title = "Trip Durarion by Month and Membership Type") +
  theme_minimal()

Extracted the key results based on Membership Type

popular_casual_month <- members_trip_popular_month[members_trip_popular_month[, "Membership Type"] == "casual", ]
popular_member_month <- members_trip_popular_month[members_trip_popular_month[, "Membership Type"] == "member", ]
casual_most_popular_row2 <- which.max(popular_casual_month[, "Duration (Minutes)"])
member_most_popular_row2 <- which.max(popular_member_month[, "Duration (Minutes)"])
casual_most_popular_month <- popular_casual_month[casual_most_popular_row2, c("Membership Type", "Month", "Duration (Minutes)")]
member_most_popular_month <- popular_member_month[member_most_popular_row2, c("Membership Type", "Month", "Duration (Minutes)")]
most_popular_month <- rbind(casual_most_popular_month, member_most_popular_month)
most_popular_month

##    Membership Type  Month Duration (Minutes)
## 11          casual   JULY             306599
## 4           member AUGUST             328558

Finally, explored how the most popular bicycle types based on membership type

members_popular_bike <- data.frame(table(trips_summary$member_casual, trips_summary$rideable_type))
colnames(members_popular_bike) <- c("Membership Type","Bicycle Type","Number of rides")
members_popular_bike

##   Membership Type  Bicycle Type Number of rides
## 1          casual  classic_bike          875958
## 2          member  classic_bike         1682817
## 3          casual   docked_bike          173342
## 4          member   docked_bike               0
## 5          casual electric_bike          681841
## 6          member electric_bike          878515

Visualized the results

ggplot(members_popular_bike, aes(x = `Bicycle Type`, y = `Number of rides`, fill = `Membership Type`))+
geom_col(position = "dodge", width = 1)+
geom_text(aes(label = `Number of rides`), position = position_dodge(width = 0.9), vjust = -0.25)+
scale_y_continuous(labels = scales::comma) +
labs(x = "Bicycle Type", y = "Number of rides", Fill = "Membership Type", title = "Most Popular Bicycle Type by Number of Rides and Membership Type")

Repeated the entire analysis again, but using Bicycle Type as the focus

The Average Trip Duration by Bicycle Type

bicycle_trip_mean <- aggregate(trips_summary$trip_duration, list(trips_summary$rideable_type), FUN = mean)
colnames(bicycle_trip_mean) <- c("Bicycle Type","Average Trip (minutes)")
bicycle_trip_mean

##    Bicycle Type Average Trip (minutes)
## 1  classic_bike               17.32218
## 2   docked_bike               51.14780
## 3 electric_bike               13.76267

Longest Trip by Bicycle Type

bicycle_trip_max <- aggregate(trips_summary$trip_duration, list(trips_summary$rideable_type), FUN = max)
colnames(bicycle_trip_max) <- c("Bicycle Type", "Longest Trip (minutes)")
bicycle_trip_max

##    Bicycle Type Longest Trip (minutes)
## 1  classic_bike               1499.417
## 2   docked_bike              34354.067
## 3 electric_bike                480.000

Shortest Trip by Bicycle Type

bicycle_trip_min <- aggregate(trips_summary$trip_duration, list(trips_summary$rideable_type), FUN = min)
colnames(bicycle_trip_min) <- c("Bicycle Type", "Shortest Trip (minutes)")
bicycle_trip_min

##    Bicycle Type Shortest Trip (minutes)
## 1  classic_bike                       1
## 2   docked_bike                       1
## 3 electric_bike                       1

What was the most popular Bicycle Type overall?

members_trip_popular_bike <- data.frame(table(trips_summary$rideable_type))
colnames(members_trip_popular_bike) <- c("Bicycle Type", "Number of rides")
members_trip_popular_bike

##    Bicycle Type Number of rides
## 1  classic_bike         2558775
## 2   docked_bike          173342
## 3 electric_bike         1560356

Quick Observation: The most popular bicycle type overall was the Classic Bike

What was the most Popular Day by Bicycle Type?

bicycle_trip_popular_days <- data.frame(table(trips_summary$rideable_type, trips_summary$trip_day))
colnames(bicycle_trip_popular_days) <- c("Bicycle Type","Weekday","Number of rides")
bicycle_trip_popular_days

##     Bicycle Type   Weekday Number of rides
## 1   classic_bike    FRIDAY          349141
## 2    docked_bike    FRIDAY           22806
## 3  electric_bike    FRIDAY          226090
## 4   classic_bike    MONDAY          346180
## 5    docked_bike    MONDAY           21995
## 6  electric_bike    MONDAY          207582
## 7   classic_bike  SATURDAY          416944
## 8    docked_bike  SATURDAY           40026
## 9  electric_bike  SATURDAY          235920
## 10  classic_bike    SUNDAY          353180
## 11   docked_bike    SUNDAY           34882
## 12 electric_bike    SUNDAY          200137
## 13  classic_bike  THURSDAY          375396
## 14   docked_bike  THURSDAY           19304
## 15 electric_bike  THURSDAY          240034
## 16  classic_bike   TUESDAY          358874
## 17   docked_bike   TUESDAY           17366
## 18 electric_bike   TUESDAY          220986
## 19  classic_bike WEDNESDAY          359060
## 20   docked_bike WEDNESDAY           16963
## 21 electric_bike WEDNESDAY          229607

Visualized the results

ggplot(bicycle_trip_popular_days, aes(x = `Weekday`, y = `Number of rides`, fill = `Bicycle Type`)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = `Number of rides`), position = position_dodge(width = 0.9), vjust = -0.25)+
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Weekday", y = "Number of rides", fill = "Bicycle Type", title = "Most Popular Weekday by Bicycle Type") +
  theme_minimal()

Extracted the key results for each bicycle type

popular_classic_day <- bicycle_trip_popular_days[bicycle_trip_popular_days[, "Bicycle Type"] == "classic_bike", ]
popular_docked_day <- bicycle_trip_popular_days[bicycle_trip_popular_days[, "Bicycle Type"] == "docked_bike", ]
popular_electric_day <- bicycle_trip_popular_days[bicycle_trip_popular_days[, "Bicycle Type"] == "electric_bike", ]
classic_most_popular_row <- which.max(popular_classic_day[, "Number of rides"])
docked_most_popular_row <- which.max(popular_docked_day[, "Number of rides"])
electric_most_popular_row <- which.max(popular_electric_day[, "Number of rides"])
classic_most_popular_day <- popular_classic_day[classic_most_popular_row, c("Bicycle Type", "Weekday", "Number of rides")]
docked_most_popular_day <- popular_docked_day[docked_most_popular_row, c("Bicycle Type", "Weekday", "Number of rides")]
electric_most_popular_day <- popular_electric_day[electric_most_popular_row, c("Bicycle Type", "Weekday", "Number of rides")]
most_popular_bike_days <- rbind(classic_most_popular_day, docked_most_popular_day, electric_most_popular_day)
most_popular_bike_days

##     Bicycle Type  Weekday Number of rides
## 7   classic_bike SATURDAY          416944
## 8    docked_bike SATURDAY           40026
## 15 electric_bike THURSDAY          240034

Exploring which time period is most popular based on Bicycle Type

What was the most popular time period based on bicycle type?

members_bike_popular_period <- data.frame(table(trips_summary$rideable_type, trips_summary$trip_time_period))
colnames(members_bike_popular_period) <- c("Bicycle Type","Time Period","Number of rides")
members_bike_popular_period

##     Bicycle Type Time Period Number of rides
## 1   classic_bike   Afternoon          696076
## 2    docked_bike   Afternoon           61741
## 3  electric_bike   Afternoon          433971
## 4   classic_bike     Evening          398665
## 5    docked_bike     Evening           22365
## 6  electric_bike     Evening          215275
## 7   classic_bike     Morning          686059
## 8    docked_bike     Morning           35324
## 9  electric_bike     Morning          435712
## 10  classic_bike       Night          777975
## 11   docked_bike       Night           53912
## 12 electric_bike       Night          475398

Visualized the results

ggplot(members_bike_popular_period, aes(x = `Time Period`, y = `Number of rides`, fill = `Bicycle Type`)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = `Number of rides`), position = position_dodge(width = 0.9), vjust = -0.25)+
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Time Period", y = "Number of rides", fill = "Bicycle Type", title = "Most Popular Time Period by Bicycle Type") +
  theme_minimal()

Extracted the key results based on Bicycle Type

popular_classic_time_period <- members_bike_popular_period[members_bike_popular_period[, "Bicycle Type"] == "classic_bike", ]
popular_docked_time_period <- members_bike_popular_period[members_bike_popular_period[, "Bicycle Type"] == "docked_bike", ]
popular_electric_time_period <- members_bike_popular_period[members_bike_popular_period[, "Bicycle Type"] == "electric_bike", ]
classic_most_popular_period_row <- which.max(popular_classic_time_period[, "Number of rides"])
docked_most_popular_period_row <- which.max(popular_docked_time_period[, "Number of rides"])
electric_most_popular_period_row <- which.max(popular_electric_time_period[, "Number of rides"])
classic_most_popular_period <- popular_classic_time_period[classic_most_popular_period_row, c("Bicycle Type", "Time Period", "Number of rides")]
docked_most_popular_period <- popular_docked_time_period[docked_most_popular_period_row, c("Bicycle Type", "Time Period", "Number of rides")]
electric_most_popular_period <- popular_electric_time_period[electric_most_popular_period_row, c("Bicycle Type", "Time Period", "Number of rides")]
most_popular_bike_time_period <- rbind(classic_most_popular_period, docked_most_popular_period, electric_most_popular_period)
most_popular_bike_time_period

##     Bicycle Type Time Period Number of rides
## 10  classic_bike       Night          777975
## 2    docked_bike   Afternoon           61741
## 12 electric_bike       Night          475398

What was the most popular month based on bicycle type?

bicycle_trip_popular_month <- data.frame(table(trips_summary$rideable_type, trips_summary$trip_month))
colnames(bicycle_trip_popular_month) <- c("Bicycle Type","Month","Number of rides")
bicycle_trip_popular_month

##     Bicycle Type     Month Number of rides
## 1   classic_bike     APRIL          164281
## 2    docked_bike     APRIL           11910
## 3  electric_bike     APRIL           92337
## 4   classic_bike    AUGUST          338611
## 5    docked_bike    AUGUST           25646
## 6  electric_bike    AUGUST          230036
## 7   classic_bike  DECEMBER           72196
## 8    docked_bike  DECEMBER            1860
## 9  electric_bike  DECEMBER           58548
## 10  classic_bike  FEBRUARY           58140
## 11   docked_bike  FEBRUARY            1338
## 12 electric_bike  FEBRUARY           28171
## 13  classic_bike   JANUARY           53978
## 14   docked_bike   JANUARY             939
## 15 electric_bike   JANUARY           24135
## 16  classic_bike      JULY          366908
## 17   docked_bike      JULY           30313
## 18 electric_bike      JULY          233670
## 19  classic_bike      JUNE          399684
## 20   docked_bike      JUNE           29962
## 21 electric_bike      JUNE          180127
## 22  classic_bike     MARCH          132472
## 23   docked_bike     MARCH            8173
## 24 electric_bike     MARCH           72243
## 25  classic_bike       MAY          318546
## 26   docked_bike       MAY           25931
## 27 electric_bike       MAY          149605
## 28  classic_bike  NOVEMBER          142429
## 29   docked_bike  NOVEMBER            5736
## 30 electric_bike  NOVEMBER          102948
## 31  classic_bike   OCTOBER          210209
## 32   docked_bike   OCTOBER           12268
## 33 electric_bike   OCTOBER          183828
## 34  classic_bike SEPTEMBER          301321
## 35   docked_bike SEPTEMBER           19266
## 36 electric_bike SEPTEMBER          204708

Visualized the results

bicycle_trip_popular_month$`Month` <- as.factor(bicycle_trip_popular_month$`Month`)

ggplot(bicycle_trip_popular_month, aes(x = `Month`, y = `Number of rides`, fill = `Bicycle Type`)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = `Number of rides`), position = position_dodge(width = 0.9), vjust = -0.25)+
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Month", y = "FNumber of rides", fill = "Bicycle Type", title = "Most Popular Month by Bicycle Type") +
  theme_minimal()

Extracted the key results for each bicycle type

popular_classic_month <- bicycle_trip_popular_month[bicycle_trip_popular_month[, "Bicycle Type"] == "classic_bike", ]
popular_docked_month <- bicycle_trip_popular_month[bicycle_trip_popular_month[, "Bicycle Type"] == "docked_bike", ]
popular_electric_month <- bicycle_trip_popular_month[bicycle_trip_popular_month[, "Bicycle Type"] == "electric_bike", ]
classic_most_popular_month_row <- which.max(popular_classic_month[, "Number of rides"])
docked_most_popular_month_row <- which.max(popular_docked_month[, "Number of rides"])
electric_most_popular_month_row <- which.max(popular_electric_month[, "Number of rides"])
classic_most_popular_month <- popular_classic_month[classic_most_popular_month_row, c("Bicycle Type", "Month", "Number of rides")]
docked_most_popular_month <- popular_docked_month[docked_most_popular_month_row, c("Bicycle Type", "Month", "Number of rides")]
electric_most_popular_month <- popular_electric_month[electric_most_popular_month_row, c("Bicycle Type", "Month", "Number of rides")]
most_popular_bike_month <- rbind(classic_most_popular_month, docked_most_popular_month, electric_most_popular_month)
most_popular_bike_month

##     Bicycle Type Month Number of rides
## 19  classic_bike  JUNE          399684
## 17   docked_bike  JULY           30313
## 18 electric_bike  JULY          233670

Conclusion

Act

The last aspect of the data analysis process is taking action based on the analysis you have done and shared. In this particular case, taking an action has been taken in the form of the recommendations to the Cyclistic team based on the initial questions at the start of this project, which I have outline below.

Questions and Recommendations

How do annual members and casual riders use Cyclistic bikes differently?
- Based on the data above, we can see a few things regarding Annual and Casual Members:
  1. On average, Casual Members tend to take longer rides on Saturdays, whereas Annual members prefer Thursdays.
  2. Casual Members prefer to ride at night, whereas Annual Members prefer Mornings.
  3. Casual and Annual Members love the classic bikes over the other bicycle types, but Casual Members seem to be more inclined to give electric bikes a try as the disparity between the two isn’t as great as with Annual Members.
Why would casual riders buy Cyclistic annual memberships?
- Personally, I don’t believe this question can be reasonably answered with the data provided, as the the benefits and disadvantages would come down to pricing, which isn’t given in this project. However, if we assume that casual riders would enjoy certain benefits that only exist by being an Annual Member, then those benefits would be a good reason to change membership types. These benefits could be:
  - Unlimited rides of a certain duration within 24hr day.
  - Discounted ride rates based on peak periods (Months, Days, Period of the day)
How can Cyclistic use digital media to influence casual riders to become members?
- There are a few ways Cyclistic can use digital media to influence casual riders to become annual members.
  1. Digital Surveys at the end of each ride. These can used to gather qualitative data about why the rider chose the type of bicycle and the purpose of the trip. At the end of the survey, the casual rider could be presented with the price of such trips if they were an annual member, and present them with coupon code that would discount their first year of annual membership if they signed up.
  2. Targeted Digital Media Campaigns. These could be used to great effect when combined with Popular Month, Day and Time Period data exposed in the above analysis for casual riders. Offering discounted Annual Memberships for the first year if they sign up at peak periods for casual riders is a great way to catch their attention and incentivize them to switch to Annual Memberships.
  3. Targeted Digital Media referral campaigns. These campaigns would be primarily aimed at existing Annual Members but could be extended to newly converted Casual Riders. Whereby for existing Annual Members, if they convince a friend who is a casual rider to switch to an Annual Membership:
    - The existing Annual Member or newly converted Casual Rider, gets a small percentage discount on renewal for each casual rider converted
    - The casual rider get discount on their new Annual Membership at sign up.

Google Data Analytics Certificate Capstone Project: Cyclistic Bike-Share

Nicholas L. Worrell

2023-02-14

Context

Ask

Prepare

Download

Store

Process

What did I do in excel?

What did I do in R?

Loaded the necessary libraries

Imported the partially cleaned excel files into data frames

Bound all the data frames together into a single data frame

Removed any blank rows

Removed unrealistic ride durations

Conclusion

Act

Questions and Recommendations