Cyclistic Bike Sharing

Introduction

Cyclistic™ is a bike-sharing company with an extensive network of 5,800 bikes spread across 1,600 stations in the state of Chicago, USA. The core of its business model revolves around providing an eco-friendly and affordable alternative means of transportation.

The company has successfully built brand recognition through targeted marketing efforts, particularly with flexible pricing plans. This approach has allowed the company to secure a significant market share. Users are offered two subscription types: annual memberships for regular users, known as members and one-trip or day passes for occasional riders, whom are known as casual.

The marketing team is currently addressing three crucial questions that will shape the future marketing strategy:

What’s the difference between casual members and annual members?
What motivates casual riders to opt for an annual plan?
How can Cyclistic utilize digital media to encourage casual riders to choose an annual membership?

I am tasked to delve into the monthly data sets from 2023 and identify patterns among different member types and propose three strategic recommendations.

Within this document, you’ll find a comprehensive guide outlining the meticulous steps taken to prepare, process, and visualize the data.

For a detailed exploration of my findings, analysis, conclusions, and recommendations, please refer to the complete report accessible here

Ask

I have been tasked with answering the first question, investigate the difference between members and occasional riders, report my findings, craft visualizations and provide three strategic recommendations to the Marketing Director, with the objective of converting occasional bikers into paying members.

Expected output includes:

Description of all data sources.
Documentation of the analysis process.
Data visualizations.
Report of my analysis.

Prepare

Setting up the Environment

Data will be processed and analyzed using R Project, leveraging CRAN libraries to help across various analytical and data processing tasks.

library(tidyverse) # A bundle of R tools for easy and organized data work.
library(ggplot2)   # Create custom data visualizations.
library(readr)     # Quickly read and write data.
library(dplyr)     # Simplify data manipulation.
library(tibble)    # A user-friendly data frame.
library(janitor)   # Clean and format messy data easily.
library(tidyr)     # Reshape and organize data.
library(lubridate) # Manage dates and times more easily.
library(purrr)     # Simplify tasks related to functions and vectors.
library(knitr)     # Mix R code and results into documents.
library(skimr)     # Get quick insights about the data.
library(ggmap)     # Get Google Maps data (Requires API key).
library(sf)        # Spatial vector data encoding.
library(geosphere) # Calculate distance between lat and lng
library(leaflet)   # Create interactive maps
library(leaflet.extras)
library(geojsonio)
library(Rcpp)      # Integration of C++ with R for performance improvements
library(Cairo)     # HD plots
library(webshot)
library(mapview)
library(htmlwidgets)

RStudio is the preffered IDE due to its suitability for handling large datasets and its convenience in generating plots within the same environment, for publishing purposes, Rmarkdown and Rpubs.

Note: To process spatial data I attempted to capitalize on Stamen Maps. However they were recently acquired by Stadia Maps, rendering Stamen Maps unavailable in the R environment, Consequently, to process spatial data contained in the datasets, I have opted for Google Maps API instead and for widets leaflet has been a lifesaver.

ROCCC Statement

Cyclistic has provided a data bucket for this analysis. The data is distributed under this license agreement, Divvy Bikes is a program of the Chicago Department of Transportation (CDOT).

To ensure that the data meets the ROCCC (Reliable, Original, Comprehensive, Current, and Cited) standards, I’ve verified that it is meticulously organized, featuring clear columns for each user’s trip details. This organization facilitates a thorough understanding of the information. The data repository follows the naming convention YYYYMM-divvy-tripdata.zip, with our focus on the 2023 files.

It has been noted by the publisher that the dataset has anonymized trip IDs to protect user privacy. Additionally, any trips taken by staff members and trips that were below 0 seconds in duration have subsequently been removed, we’ll dive in to this later.

Importing Data Sets

First, let’s set the working directory where the data sets are located Then, import each file into the project as a data frame.

# Set Working Directory
setwd("~/Documents/Cyclist Data")

# Assign a variable to each workbook
Jan_2023 <- read_csv("202301-divvy-tripdata.csv")
Feb_2023 <- read_csv("202302-divvy-tripdata.csv")
Mar_2023 <- read_csv("202303-divvy-tripdata.csv")
Apr_2023 <- read_csv("202304-divvy-tripdata.csv")
May_2023 <- read_csv("202305-divvy-tripdata.csv")
Jun_2023 <- read_csv("202306-divvy-tripdata.csv")
Jul_2023 <- read_csv("202307-divvy-tripdata.csv")
Aug_2023 <- read_csv("202308-divvy-tripdata.csv")
Sep_2023 <- read_csv("202309-divvy-tripdata.csv")
Oct_2023 <- read_csv("202310-divvy-tripdata.csv")
Nov_2023 <- read_csv("202311-divvy-tripdata.csv")
Dec_2023 <- read_csv("202312-divvy-tripdata.csv")

Lets take a look at the first data frame to get an idea of how its structured.

## ── Data Summary ────────────────────────
##                            Values
## Name                       data  
## Number of rows             190301
## Number of columns          13    
## _______________________          
## Column type frequency:           
##   character                7     
##   numeric                  4     
##   POSIXct                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate min max empty n_unique whitespace
## 1 ride_id                    0         1      16  16     0   190301          0
## 2 rideable_type              0         1      11  13     0        3          0
## 3 start_station_name     26721         0.860  10  50     0      964          0
## 4 start_station_id       26721         0.860   3  12     0      944          0
## 5 end_station_name       27840         0.854  10  50     0      962          0
## 6 end_station_id         27840         0.854   3  35     0      942          0
## 7 member_casual              0         1       6   6     0        2          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean     sd    p0   p25   p50   p75
## 1 start_lat             0         1      41.9 0.0492  41.6  41.9  41.9  41.9
## 2 start_lng             0         1     -87.6 0.0285 -87.8 -87.7 -87.6 -87.6
## 3 end_lat             127         0.999  41.9 0.0492  41.6  41.9  41.9  41.9
## 4 end_lng             127         0.999 -87.6 0.0285 -87.8 -87.7 -87.6 -87.6
##    p100 hist 
## 1  42.1 ▁▁▇▆▁
## 2 -87.5 ▁▁▆▇▁
## 3  42.1 ▁▁▇▅▁
## 4 -87.5 ▁▁▅▇▁
## 
## ── Variable type: POSIXct ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min                 max                
## 1 started_at            0             1 2023-01-01 00:01:58 2023-01-31 23:56:09
## 2 ended_at              0             1 2023-01-01 00:02:41 2023-02-04 04:27:03
##   median              n_unique
## 1 2023-01-14 16:26:15   178872
## 2 2023-01-14 16:44:41   179025

This process provides insights into two key aspects: first, a summary of the NA values present in the data frame, and second, an overview of the data formats for each variable.

Next, let’s compare data types across the columns in each workbook and identify any mismatches. We will then return only those columns that exhibit mismatched data types.

# Compare Column Types
mismatched_dt <- compare_df_cols(Jan_2023, Feb_2023, Mar_2023, Apr_2023, May_2023, Jun_2023, Jul_2023, Aug_2023, Sep_2023, Oct_2023, Nov_2023, Dec_2023, return = "mismatch")

## [1] "There are no mismatched columns"

Summary

Data has been imported into the project.

Data structure includes a variety of data types, encompassing both atomic vectors and nominal vectors.

Data type is consistent across workbooks.

Data adheres to the ROCCC standard.

Process

Combining Data Frames

To streamline data processing and address inconsistencies across observations stored in various data frames, let’s combine the data into a single data frame.

df_2023 <- list(Jan_2023, Feb_2023, Mar_2023, Apr_2023, May_2023, Jun_2023, Jul_2023, Aug_2023, Sep_2023, Oct_2023, Nov_2023, Dec_2023)

combined_df <- bind_rows(df_2023)

Let’s verify that all data from the workbooks has been successfully added to the combined data frame.

# Count the number of rows
total_rows <- nrow(combined_df)
total_rows1 <- sum(sapply(list(Jan_2023, Feb_2023, Mar_2023, Apr_2023, May_2023, Jun_2023, Jul_2023, Aug_2023, Sep_2023, Oct_2023, Nov_2023, Dec_2023),
                          nrow))

## There's a total of 5,719,877 observations

Transforming the data

There are two variables of POSIXct (Date and Time) class, started_at and ended_at.We can use these to calculate the duration for each observation in the data set.

transform1 <- combined_df %>%
  mutate(duration_seconds = round(as.numeric(difftime(ended_at, started_at, units = "secs"))),
         duration_minutes = round(as.numeric(difftime(ended_at, started_at, units = "mins")), digits = 2),
         duration_hours = round(as.numeric(difftime(ended_at, started_at, units = "hours")), digits = 3))

Let’s transform these variables and obtain the values for time, month, day, quarter and year.

transform2 <- transform1 %>% 
  arrange(started_at) %>% 
  mutate(day_name = format(started_at, "%A"),
         month_name = format(started_at, "%B"),
         hour = format(started_at, "%H"),
         day = as.numeric(format(started_at, "%d")),
         month = as.numeric(format(started_at, "%m")),
         year = as.numeric(format(started_at, "%Y")),
         quarter = cut(month, breaks = c(0, 3, 6, 9, 12), labels = c("Q1", "Q2", "Q3", "Q4")))

Calculating the distance from the beginning and end of each trip its rather convenient. To achieve this, we’ll need to define a new function capable of measuring the distance between geographical coordinates.

# Function to calculate distance using Haversine formula
haversine <- function(lat1, lon1, lat2, lon2) {
  R <- 6371  # Radius of the Earth in kilometers
  
  # Calculate the distance using the Haversine formula
  distance <- distGeo(c(lon1, lat1), c(lon2, lat2), a=R)
  
  return(distance)
}

# Prepare coordinate data
coords_start <- cbind(transform2$start_lng, transform2$start_lat)
coords_end <- cbind(transform2$end_lng, transform2$end_lat)

# Calculate distance for each row
transformed_df <- transform2 %>% 
  mutate(distance_meters = distGeo(coords_start, coords_end))

Finally, let’s factor categorical data. This process enables us to establish an order for our analysis.

transformed_df$day_name <- factor(transformed_df$day_name, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
transformed_df$hour <- factor(transformed_df$hour, levels = c("00", "01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23"))
transformed_df$month_name <- factor(transformed_df$month_name, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
transformed_df$quarter <- factor(transformed_df$quarter, levels = c("Q1", "Q2", "Q3", "Q4"))

Uncovering patterns

More often than not, there are hidden patterns lurking within the data, waiting to be discovered and leveraged for valuable insights.

In the Prepare phase, we observed missing values in the Jan_2023 data frame for variables such as start_station, end_station, end_lat, and end_lng.

Let’s perform a quick analysis on the newly combined data frame and verify whether the same columns contain missing values.

## Variables that contain NA values

## [1] "start_station_name" "start_station_id"   "end_station_name"  
## [4] "end_station_id"     "end_lat"            "end_lng"           
## [7] "distance_meters"

The significance of NA values lies in their potential to affect the integrity and accuracy of the data analysis process.

Let’s begin by having a look at NA values for start_station_name.

## 
## 
## |start_station_name |end_station_name       | start_lat|  end_lat|
## |:------------------|:----------------------|---------:|--------:|
## |NA                 |Halsted St & Roscoe St |     41.97| 41.94367|
## |NA                 |NA                     |     41.80| 41.80000|

From a glimpse, we can deduce that there may have been an issue recording the station name at the start of the trip. Let’s take a closer look at the NA values for the end_station_name variable.

## 
## 
## |start_station_name  |end_station_name | start_lat| end_lat|
## |:-------------------|:----------------|---------:|-------:|
## |NA                  |NA               |  41.98000|   41.96|
## |2112 W Peterson Ave |NA               |  41.99116|   41.98|

We’ve discovered that end_lat values are rounded to two decimal points when end_station_names are missing. Similarly, start_lat values are rounded in entries with missing start_station_names.

Side Note: At 5 decimal points, the distance equals a radius of 1.11 meters; at 2 decimal places, approximately 1.11 kilometers.

I’m curious about the types of bikes that performed these trips, and whether if it affects more occasional users or members.

# Summary of bike types and NA values for Start Station 
transformed_df %>% 
  select(start_station_name, end_station_name, rideable_type, member_casual) %>% 
  filter(is.na(start_station_name) | is.na(end_station_name) ) %>%
  group_by(rideable_type, member_casual) %>% 
  summarise(count = n()) %>% 
  arrange(member_casual) %>% 
  print()

## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 3
## # Groups:   rideable_type [3]
##   rideable_type member_casual  count
##   <chr>         <chr>          <int>
## 1 classic_bike  casual          3818
## 2 docked_bike   casual          2047
## 3 electric_bike casual        521373
## 4 classic_bike  member          1304
## 5 electric_bike member        859266

This occurence happens mostly with electric bikes, which don’t require a docking station, we can assume that this happens because of issues related to the GPS system installed in these bikes, also this issue affects members more than it does to occasional users.

Let’s see how these values are plotted in a map, we’ll have a look at start_stations_names with NA values.

The uniform appearance is a consequence of the variables’ decimal points, We can infer that this pattern likely extends to end_station_names

What is the impact of these values on the data set?

# Count of rows with missing values
na_values_ssn <- combined_df[is.na(combined_df$start_station_name) |
                                      is.na(combined_df$end_station_name), 
                                      c("start_station_name", "end_station_name")] %>%
  nrow()

## NA values for start or end station names represent 1,387,808 a 24.26% of total observations.

That’s a lot, but this doesn’t necesesarily mean that these observations are useless.

Let’s turn our attention to the NA values contained in these variables.

na_value_spatial <- transformed_df[is.na(combined_df$end_lat) |
                                     is.na(combined_df$end_lng), 
                                   c("start_lat", "start_lng", "end_lat", "end_lng")] %>%
  nrow()

## NA values for spatial data in 'end_stations' represent 6,990 observations.

Let’s have a closer look at these observations and include another variable, duration_minutes.

na_value_spatial <- transformed_df %>% 
  select(member_casual, duration_minutes, start_lng, start_lat, end_lng, end_lat) %>% 
  filter(is.na(start_lat) | is.na(start_lng) | is.na(end_lat) | is.na(end_lng)) %>%
  arrange(desc(duration_minutes)) %>% 
  print()

## # A tibble: 6,990 × 6
##    member_casual duration_minutes start_lng start_lat end_lng end_lat
##    <chr>                    <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
##  1 casual                  98489.     -87.7      42.0      NA      NA
##  2 casual                  92570.     -87.7      42.0      NA      NA
##  3 casual                  83383.     -87.6      41.9      NA      NA
##  4 casual                  79775.     -87.6      41.9      NA      NA
##  5 casual                  64172.     -87.6      41.9      NA      NA
##  6 casual                  64009.     -87.6      41.9      NA      NA
##  7 casual                  62867.     -87.6      41.9      NA      NA
##  8 casual                  56195.     -87.6      41.9      NA      NA
##  9 casual                  55070.     -87.7      41.9      NA      NA
## 10 casual                  51461.     -87.6      41.9      NA      NA
## # ℹ 6,980 more rows

We have made another discovery, outliers. These trips have an unusual high duration, we can assume that these observations had issues at the end the trip or are stolen bikes, heh.

lets understand the impact of outlier data in this subset, assuming that the max duration of a normal trip would be around 30 minutes.

## Values above 90 minutes in the 'na_value_spatial' object represent 6,803 observations.

That’s around 98% of the values contained in the subset. Last, let’s have quick a look at distance_meters values.

## # A tibble: 10 × 7
##    member_casual distance_meters duration_minutes start_lng start_lat end_lng
##    <chr>                   <dbl>            <dbl>     <dbl>     <dbl>   <dbl>
##  1 member               9818680.             0.6      -87.7      41.9     0  
##  2 casual               9814574.           203.       -87.6      41.9     0  
##  3 casual               9814022.             4.83     -87.6      41.8     0  
##  4 casual                 49022.            99.5      -87.6      41.9   -88.2
##  5 member                 40859.             2.6      -87.6      41.9   -88.0
##  6 casual                 36909.           118.       -87.6      41.7   -87.8
##  7 casual                 33901.           150.       -87.7      41.8   -88.1
##  8 casual                 33057.           130.       -87.6      41.8   -87.7
##  9 member                 32221.            90.1      -87.7      42.0   -87.7
## 10 casual                 30601.           149.       -87.7      42.1   -87.6
## # ℹ 1 more variable: end_lat <dbl>

That is also an an unusual travelled distance. Now that we have found some patterns in the data, and also discovered outliers, it’s time to do some cleaning.

Before we beging

Summary Statistics

Let’s take a look at the summary values for min, max, median and mean for the duration and distance_meters columns, this will give us insights into the range and central tendency of the data.

##  duration_seconds  duration_minutes    duration_hours      distance_meters  
##  Min.   :-999391   Min.   :-16656.52   Min.   :-277.6090   Min.   :      0  
##  1st Qu.:    325   1st Qu.:     5.42   1st Qu.:   0.0900   1st Qu.:    863  
##  Median :    572   Median :     9.53   Median :   0.1590   Median :   1541  
##  Mean   :   1090   Mean   :    18.17   Mean   :   0.3028   Mean   :   2103  
##  3rd Qu.:   1015   3rd Qu.:    16.92   3rd Qu.:   0.2820   3rd Qu.:   2749  
##  Max.   :5909344   Max.   : 98489.07   Max.   :1641.4840   Max.   :9818680  
##                                                            NA's   :6990

The mean provides a measure of central tendency by summing all values in the variable and dividing by the total count of values. The median represents the middle value of the variable The min and max values denote the lowest and highest values. Additionally, the 1st and 3rd quartiles represent the values below which 25% and 75% of the data fall, respectively.

Central Distribution

Normalizing the data is crucial before analysis to ensure accurate insights. By adjusting data to a common scale or distribution, comparisons between variables become meaningful. It aids in identifying patterns, relationships, and outliers effectively. Normalization enhances the reliability and validity of statistical analyses, allowing for more accurate conclusions and informed decision-making.

We are going to start by conducting a Shapiro-Wilk test to assess if the data follows a normal distribution. For this task we are going to establish a null and alternative hypothesis, setting a 0.05 significance level.

The test will be conducted on the duration_minutes variable, as the duration of the trip is closely related to distance. Analyzing outliers in the duration_minutes variable is essential for understanding trip characteristics and potential anomalies.

H0 = The data set follows a normal distribution. H1 = The data set does not follow a normal distribution.

# Arranging the data
duration_data <- transformed_df %>%
  arrange(duration_minutes)

# Extracting the duration_minutes column
test_data <- as.data.frame(duration_data$duration_minutes)
colnames(test_data) <- "duration"

# And convert it to a numeric class for the purpose of the test
fig_2_data <- as.numeric(test_data$duration)

# Shapiro-Wilk Test
shapiro_sample_size <- 3000
shapiro_sample_data <- sample(fig_2_data, shapiro_sample_size)
shapiro_result <- shapiro.test(shapiro_sample_data)
print(shapiro_result)

## 
##  Shapiro-Wilk normality test
## 
## data:  shapiro_sample_data
## W = 0.0064973, p-value < 2.2e-16

The test result indicate strong evidence to reject the null hypothesis, we can determine this by looking at the p-value which is smaller than the significance level. We can safely conclude that the data does not follow a normal distribution.

Let’s visualize the test data in a QQ plot, and confirm this. View Chunk

The green line in the plot is this normality we are seeking, the black line represents the actual data. At both edges we see little circles that represent the values that drift apart from normality; we are going to call these values outliers.

Excluding Outliers

These outliers can significantly influence the mean and median values, potentially leading to an inaccurate portrayal of the central tendency of the data. Therefore, it’s crucial to identify and appropriately handle these outliers to ensure the robustness and reliability of our analysis and conclusions.

To achieve this, we will employ the IQR method, which involves determining the first (lower quartile) and third quartile ranges (upper quartile) of the dataset. We will then exclude values falling outside this range from our analysis, resulting in a normalized dataset.

On top of this, let’s declare a new function to obtain the mode of the duration variable.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Extract First and Third quartile
Q1 <- quantile(duration_data$duration_minutes, 0.25)
Q3 <- quantile(duration_data$duration_minutes, 0.75)

# Calculate IQR
IQR <- Q3 - Q1

# Define lower and upper fences
lower_fence <- Q1 - 1.5 * IQR
upper_fence <- Q3 + 1.5 * IQR

# Identify outliers
outliers_logical <- duration_data$duration_minutes < lower_fence | duration_data$duration_minutes > upper_fence
outliers <- duration_data[outliers_logical, ]

# Exclude outliers
rm_outliers_df <- duration_data[!outliers_logical, ]
summary(rm_outliers_df$duration_minutes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -11.60    5.17    8.82   10.75   14.67   34.17

Let’s visualize the data distribution by plotting a Bell Curve and examining its fit to the data. View Chunk

The bell curve showcases duration in minutes on the X axis and the density of these values on the Y axis, the curve central tendency normalizes at 3 standard deviations significantly reducing the statistics in our data frame.

Further Cleaning

Let’s examine the trips with a duration of zero or less.

## # A tibble: 1,211 × 3
##    rideable_type duration_minutes distance_meters
##    <chr>                    <dbl>           <dbl>
##  1 classic_bike            -11.6               0 
##  2 electric_bike           -11.1             310.
##  3 electric_bike           -11.1             865.
##  4 electric_bike            -8.93            378.
##  5 electric_bike            -8.43            401.
##  6 electric_bike            -5.87            508.
##  7 electric_bike            -5.05            352.
##  8 electric_bike            -5.05          10631.
##  9 electric_bike            -4.97            636.
## 10 electric_bike            -4.92            268.
## # ℹ 1,201 more rows

## 421,594 ( 7.37% ) observations with values are equal or below zero have been excluded from the data set.

Let’s shift our focus to the distance_meters variable. Let’s investigate trips with durations of less than 5 minutes to uncover any noteworthy insights.

distance_upper_values <- cleaned_duration %>%
  select(distance_meters, duration_minutes, start_lng, start_lat, end_lng, end_lat) %>%
  filter(!is.na(distance_meters) & duration_minutes < 5) %>% 
  arrange(desc(distance_meters)) %>% 
  slice(1:10) %>% 
  print()

## # A tibble: 10 × 6
##    distance_meters duration_minutes start_lng start_lat end_lng end_lat
##              <dbl>            <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
##  1        9818680.             0.6      -87.7      41.9     0       0  
##  2        9814022.             4.83     -87.6      41.8     0       0  
##  3          40859.             2.6      -87.6      41.9   -88.0    41.6
##  4          20950.             3.53     -87.6      41.8   -87.7    41.9
##  5          20511.             4.85     -87.6      41.8   -87.6    41.9
##  6          18649.             3.93     -87.8      42.0   -87.6    41.9
##  7          14482.             4.07     -87.6      41.8   -87.6    41.9
##  8          13202.             4.77     -87.6      41.8   -87.6    41.9
##  9          12812.             4.87     -87.6      41.9   -87.6    41.8
## 10          12648.             3.63     -87.6      41.9   -87.7    42.0

It’s quite unusual that some rides exhibit an exceptionally long traversed distance considering their duration, we could assume that there was an issue recording the time at the start of the trip, on top of that the first two rows display an incorrectly calculated distance. Let’s take a look at the mean and median distance by time.

## # A tibble: 35 × 3
##    duration_minutes mean_distance median_distance
##               <dbl>         <dbl>           <dbl>
##  1               34         4537.           4520.
##  2               33         4477.           4462.
##  3               32         4474.           4520.
##  4               31         4460.           4505.
##  5               30         4384.           4449.
##  6               29         4319.           4372.
##  7               28         4242.           4296.
##  8               27         4200.           4256.
##  9               26         4134.           4223.
## 10               25         4068.           4143.
## # ℹ 25 more rows

Considering that the max speed for Cyclistic bikes is of 20 mph (32 km/h), we will exclude values where the duration exceeds it’s maximum distance, and we’ll store these values in excluded_dist_df.
- Duration < 30 seconds and Distance = 0 meters: Possible false starts, removed observations 30,950.
- Duration < 5 minutes, Distance > 2,670 meters: removed observations 288.
- Duration < 10 minutes, Distance > 5,359 meters: removed observations 209
- Duration < 15 minutes, Distance > 8,046 meters: removed observations 123
- Duration < 20 minutes, Distance > 10,724 meters: removed observations 59
- Duration < 25 minutes, Distance > 13,408 meters: removed observations 24
- Duration < 30 minutes, Distance > 16,092 meters: removed observations 20
- Duration < 34 minutes, Distance > 18,239 meters: removed observations 13

# Set data frames with conditions
rm1 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 18239 & duration_minutes <= 34) %>% 
  arrange(distance_meters)

rm2 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 16092 & duration_minutes <= 30) %>% 
  arrange(distance_meters)

rm3 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 13408 & duration_minutes <= 25) %>% 
  arrange(distance_meters)

rm4 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 10724 & duration_minutes <= 20) %>% 
  arrange(distance_meters)

rm5 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 8046 & duration_minutes <= 15) %>% 
  arrange(distance_meters) 

rm6 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 5359 & duration_minutes <= 10) %>% 
  arrange(distance_meters)

rm7 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters > 2670 & duration_minutes <= 5) %>% 
  arrange(distance_meters)

rm8 <- cleaned_duration %>%
  filter(!is.na(distance_meters) & distance_meters == 0 & duration_seconds <= 15) %>% 
  arrange(distance_meters)

# Exclude the conditions from the dataframe
excluded_dist_df <- cleaned_duration %>%
  anti_join(rm1, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm2, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm3, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm4, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm5, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm6, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm7, by = c("distance_meters", "duration_minutes")) %>%
  anti_join(rm8, by = c("distance_meters", "duration_minutes"))

## 31,386 ( 0.59% ) observations that surpass the maximum travel distance based on speed have been omitted from the dataset.

Next, let’s examine the recorded values for end_station_name, noting that there are missing values for the end_lng and end_lat variables.

## # A tibble: 5 × 1
##   end_station_name          
##   <chr>                     
## 1 Halsted St & Fulton St    
## 2 Stony Island Ave & 63rd St
## 3 Elizabeth St & Randolph St
## 4 Lincoln Ave & Byron St    
## 5 Drexel Ave & 60th St

To address the missing coordinates and re-calculate the distance, we first need to declare the lat and lng values for these stations, subsequently, we can proceed to fill in the absent data and store it in a data frame.

end_lat_value1 <- 41.88687
end_lng_value1 <- -87.64809
end_lat_value2 <- 41.78051
end_lng_value2 <- -87.58685
end_lat_value3 <- 41.88434
end_lng_value3 <- -87.6589
end_lat_value4 <- 41.95237
end_lng_value4 <- -87.6773
end_lat_value5 <- 41.78586
end_lng_value5 <- -87.60455

# Fill missing values with corresponding station data
fill_values <- excluded_dist_df %>%
  mutate(end_lat = if_else(end_station_name == "Halsted St & Fulton St" & is.na(end_lat), end_lat_value1, end_lat),
         end_lng = if_else(end_station_name == "Halsted St & Fulton St" & is.na(end_lng), end_lng_value1, end_lng)) %>%
  mutate(end_lat = if_else(end_station_name == "Stony Island Ave & 63rd St" & is.na(end_lat), end_lat_value2, end_lat),
         end_lng = if_else(end_station_name == "Stony Island Ave & 63rd St" & is.na(end_lng), end_lng_value2, end_lng)) %>% 
  mutate(end_lat = if_else(end_station_name == "Elizabeth St & Randolph St" & is.na(end_lat), end_lat_value3, end_lat),
         end_lng = if_else(end_station_name == "Elizabeth St & Randolph St" & is.na(end_lng), end_lng_value3, end_lng)) %>% 
  mutate(end_lat = if_else(end_station_name == "Lincoln Ave & Byron St" & is.na(end_lat), end_lat_value4, end_lat),
         end_lng = if_else(end_station_name == "Lincoln Ave & Byron St" & is.na(end_lng), end_lng_value4, end_lng)) %>% 
  mutate(end_lat = if_else(end_station_name == "Drexel Ave & 60th St" & is.na(end_lat), end_lat_value5, end_lat),
         end_lng = if_else(end_station_name == "Drexel Ave & 60th St" & is.na(end_lng), end_lng_value5, end_lng))

# Prepare coordinate data
coords_start <- cbind(fill_values$start_lng, fill_values$start_lat)
coords_end <- cbind(fill_values$end_lng, fill_values$end_lat)

# Re calculate distance for each row
cleaned_distance <- fill_values %>% 
  mutate(distance_meters = distGeo(coords_start, coords_end))

## 93 observations with no NA values for `end_station_name` and contain missing `end_lng` and `end_lat` values have been filled.

Finally, lets exclude from the data set the variables that have no recorded end_lng and end_lat coordinates. We are doing this for the purpose of maintaining row consistency across the data set.

rm9 <- cleaned_distance %>%
  filter(is.na(end_lat) & is.na(end_lng))

cyclistic_data_2023 <- cleaned_distance %>%
  anti_join(rm9, by = c("end_lat", "end_lng"))

## 118 observations with NA values for `end_station_name` but contain missing `end_lng` and `end_lat` values have been filled.

Lets have one last look at the summary statistics for the duration_minutes and distance_meters variables.

##  duration_minutes distance_meters  
##  Min.   : 0.02    Min.   :    0.0  
##  1st Qu.: 5.22    1st Qu.:  877.8  
##  Median : 8.88    Median : 1518.8  
##  Mean   :10.82    Mean   : 1967.1  
##  3rd Qu.:14.72    3rd Qu.: 2620.9  
##  Max.   :34.17    Max.   :14318.8

## 453,098 ( 7.92% ) observations have been excluded for the analysis phase.

With a cleaned data set that reflects a centralized distribution, we can be certain that our analysis will be as accurate as possible.

Summary

Combined monthly trips files into a single data frame.

Extracted duration and date values, factored values.

Discovered patterns in geo-spatial data

Cleaned outliers and filled missing values.

Distribution has been normalized.

7.92% of total observations have been excluded from the analysis.

Analyze

The final step of the data analysis process is the analysis itself, since we are already familiar with the structure of the data set, we have a good idea on how to approach our business task. As a refresher we are tasked with finding the Differences between Casual Riders and Member Subscribers, recommendations are available on the final report.

## Correlation index between mean duration and distance = -0.4222123

Fig 17. #14 Bike Type Preference

View Chunk

With all the visualizations in place, there’s not much left to say but thank you for accompanying me this far. I hope that this journey has offered valuable insights into the comprehensive data analysis process. Have a good day!

Resources and References

Leveraged resources, articles and communities for knowledge, insights, inspiration and answers.

Special Thanks

Coursera - “Google Data Analytics Professional Certificate” Coursera
Fundae - “Fundación Estatal para la Formación en el Empleo” FUNDAE

Resources

Yihui Xie, Christophe Dervieux, Emily Riederer (2023). “R Markdown Cookbook” Retrieved from Bookdown
Yihui Xie, J. J. Allaire, Garrett Grolemund (2023). “R Markdown: The Definitive Guide” Retrieved from Bookdown
Zachary M. Smith (2020). “R Markdown Crash Course” Retrieved from Bookdown
Robert J. Hijmans (2023). “Spherical Data Analysis with R” Retrieved From Rspatial

Organizations

Megan A. Jones, Marisa Guarinello, Courtney Soderberg, Leah A. Wasser (2021). “Time Series 00: Intro to Time Series Data in R - Managing Date/Time Formats” Retrieved from Neon Science
cmkmanwani (2022). “Mathematics | Probability Distributions Set 3 (Normal Distribution).” Retrieved from GeeksforGeeks
Courtney Taylor (2020). “How to Calculate a Sample Standard Deviation.” Retrieved from ThoughtCo
Dr. Mark Williamson (n.d.). “Sample Size in R.” Retrieved from University of North Dakota
Jose Luis Garcia G. (2019) “Creación de visores de mapas web de Leaflet en R” Retrieved from Mapping GIS
Melanie Frazier (2020). “R color cheatsheet” Retrieved from NCEAS
Swirl - “Learn R, in R” Swirl
Brilliant - “Statistics Fundamentals” Retrieved from Brilliant

Blogs

Statisticshowto (n.d.). “Interquartile Range (IQR): What It Is and How to Find It.” Retrieved from Statisticshowto
Antoine Soetewey (2020). “Outliers Detection in R.” Retrieved from statsandr
Antoine Soetewey (2024). “Binary Logistic Regression in R.” Retrieved from statsandr
Pritha Bhandari (2020). “Interquartile Range (IQR): What It Is and How to Find It.” Retrieved from Scribbr
Cedric Chin (2024). “Becoming Data-Driven: First Principles.” Retrieved from Commoncog
Alicia Horsch (2021). “Hypothesis Testing for Data Scientists: Everything You Need to Know.” Retrieved from Towards Data Science
Zach (2021). “How to Fix in R: Non-numeric Argument to Binary Operator.” Retrieved from Statology

Communities

Pierre L (2015). “Heatmap plot by value using ggmap” Retrieved from Stack Overflow
Rstudio Community (2021). “Need help with ggmap and google API in Rstudio”. Retrieved from RStudio Community
Crestor (2021). “Problem registering google map API using”register_google” in ggmap” Retrieved from Stack Overflow
finnstats (2021). “How to Identify Outliers: Grubbs’ Test in R.” Retrieved from R-bloggers
Gaurav Chawla (2015). “How to Prevent Scientific Notation in R.” Retrieved from Stack Overflow
SteveO7 (2015). “Function to calculate geospatial distance between two points (lat,long) using R [duplicate]” Retrieved from Stack Overflow
kjo (2019). “How to Add Count Labels to the Right of the Bars in a Horizontal Bar Chart.” Retrieved from Stack Overflow
Jonathan West (2020). “Leaflet not rendering in dynamically generated R markdown html knitr” Retrieved from Stack Overflow
f.lechleitner (2018). “RMarkdown: Floating TOC and TOC at beginning” Retrieved from Stack Overflow
Cristoph (2018). “How to remove a section from the table of contents” Retrieved from Stack Overflow

Videos

Lisa Lendway - Statistics & Data Science (2020, September 19). “Plotting data on maps in R using ggmap” Retrieved from Youtube
AI & Data Science with Raghav (2018). “R Tutorials 1: Plot a Standard Normal Curve” Retrieved from YouTube

Capstones

Abdullaah, Odunmbaku (2023). “Cyclistic Case Study” Retrieved from RPubs
Ezra Ip (2021). “Summary on Cyclistic Chapter 1: The usage time difference between casuals and members” Retrieved from Medium
Sagnik Chand (2023). “Google Data Analytics Capstone Project: Cyclistic Bike-share Case Study” Retrieved fom Medium
Joey Petosa (2023). “Cyclistic Case Study Using Spreadsheets, SQL and Tableau” Retrieved from Joey Petosa
J. Mason (2023). “Cyclistic Case Study” Retrieved from RPubs
Eyitayo Ogunniyi (2023). “Google Data Analytics Capstone Project: Cyclistic Bike Share Case Study” Retrieved from Medium
Kevin de Ramos (2023). “Google Data Analytics Capstone: Cyclistic Case Study” Retrieved from LinkedIn
Dimitris Raidos (2022). “Cyclistic Case Study - Google Data Analytics Cert.”Retrieved from Kaggle

Tools

Color for HTML snippets. Supplied by HTML Code Generator
Map coordinates. Supplied by OpenStreetMap
Awesome R Editor. Supplied byRstudio
AI-supported troubleshooting. Supplied by ChatGPT

Chunks

Fig 1. Missing Start Stations Names by Rideable Types

View Fig

# Modeling the plot data
fig_1_data <- transformed_df %>% 
  select(member_casual, start_station_name, end_station_name, rideable_type, end_lat, end_lng) %>% 
  filter(is.na(end_station_name) & member_casual == "member") %>% 
  group_by(rideable_type, end_lat, end_lng) %>% 
  summarise(count = n(), .groups = 'drop') %>% print()

# Setting up the map
suppressMessages({
fig_1_map <- get_googlemap(center = c( lon = -87.630303, lat = 41.871074),
                             style = c(feature = "all", element = "none", visibility = "off"),
                             zoom = 11,
                             size = c(720, 720),
                             maptype = "roadmap")
})

# Plotting the data on the map
ggmap(fig_1_map) +
  geom_point(data = fig_1_data, aes(x = end_lng, y = end_lat, color = rideable_type, size = count), alpha = 0.5, na.rm = TRUE) +
  scale_color_manual(values = c("electric_bike" = "gray", "classic_bike" = "black")) +
  labs(title = "Fig 1. Missing Start Stations Names by Rideable Types", x = "Longitude", y = "Latitude", size = "Value Count", color = "Bike Type") +
  theme_minimal() +
  theme(text = element_text(family = "Trebuchet MS"),
        plot.title = element_text(hjust = 0.5, vjust = 2, size = 12),
        axis.text = element_blank(),
        axis.title = element_blank())

Fig 2. QQ Plot

View Fig

fig_1 <- qqnorm(fig_2_data, main = "fig 2. Normal QQ Plot")
qqline(fig_2_data, distribution = qnorm, col = "darkgreen")

Fig 3. Bell Curve

View Fig

# Central Tendency Values
mean <- mean(cyclistic_data_2023$duration_minutes)
sd <- sd(cyclistic_data_2023$duration_minutes)
y <- dnorm(cyclistic_data_2023$duration_minutes, mean, sd)
mode <- Mode(cyclistic_data_2023$duration_minutes)
median <- median(cyclistic_data_2023$duration_minutes)

# Plotting The Bell Curve
fig_2 <- plot(x = cyclistic_data_2023$duration_minutes, 
              y = y,
              col = "blue", 
              type = "l",
              xlab = "Duration in Mins", 
              ylab = "Density", 
              main = "fig 2. Bell Curve",
              lwd = 6)

# Text Format
par(family = "Trebuchet MS")

# Adding SD lines
text(mean + seq(-3, 3) * sd, 
     dnorm(mean + seq(-3, 3) * sd, mean, sd), 
     c(paste0(-(3:1), "sd"), "", paste0(1:3, "sd")), 
     pos = 3,
     offset = 3,
     col = "black")

# Mean Line
abline(v = mean + seq(-3, 3) * sd, 
       col = "gray", 
       lty = 2)

# Mean Line Format
text(mean,
     dnorm(mean, mean, sd),
     "Mean",
     pos = 1,
     offset = 8,
     col = "darkgray")

# Median Line
abline(v = median, 
       col = "darkgreen", 
       lty = 2)

# Median Line Format
text(median, 
     dnorm(median, mean, sd), 
     "Median", 
     pos = 3,
     offset = -6.5,
     col = "darkgreen")

# Mode Line
abline(v = mode, 
       col = "blue", 
       lty = 2)

# Mode Line Format
text(mode, 
     dnorm(mode, mean, sd), 
     "Mode", 
     pos = 3,
     offset = -7,
     col = "blue")

Fig 4. #1 Trip Distribution

View Fig

fig_4 <- ggplot(fig_4_data, aes(x = "", y = Freq, fill = Var1)) +
  coord_polar("y", start = 0, direction = 1) +
  geom_bar(stat = "identity", 
           width = 1, 
           color = "#FFFFFF") +
  scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
                    breaks = c("casual", "member"),
                    labels = c("Casual", "Member")) +
  labs(title = "fig 1. User Distribution", subtitle = "for 2023", fill = "", caption = "casuals: 1,769,115; members: 3,497,664") +
  geom_text(aes(label = paste(round(Freq / sum(Freq) * 100), "%")), 
            position = position_stack(vjust = 0.5),
            family = "Verdana") +
  theme_classic() +
  theme(axis.text = element_blank(), 
        axis.title = element_blank(), 
        panel.grid = element_blank(),
        axis.line = element_line("#FFFFFF"),
        text = element_text(family = "Trebuchet MS"),
        plot.title = element_text(hjust = 0.5, vjust = -5, face = 'bold', family = "Trebuchet MS"),
        plot.subtitle = element_text(hjust = 0.5, vjust = -8, face = 'italic', family = "Trebuchet MS"),
        legend.title = element_text(size = 12),
        legend.position = "top",
        legend.box.margin = margin(t = 18, r = 5, b = -31, l = 0),
        legend.text = element_text(size = 10),
        plot.caption = element_text(face = "italic", hjust = 0.5, vjust = 7))

Fig 5. #2 Trip Distribution Per Quarter

View Fig

fig_5 <- ggplot(fig_5_data, aes(y = reorder(quarter,desc(quarter)), x = count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge2") +
  geom_text(aes(label = format(count, big.mark = ",")), 
          position = position_dodge(width = 0.9), 
          hjust = 1.2, 
          size = 3, 
          family = "Trebuchet MS",
          color = "black") +
  geom_text(aes(label = sprintf("%.2f%%", percent)), 
          position = position_dodge(width = 0.9), 
          hjust = -0.2, 
          size = 3, 
          family = "Trebuchet MS",
          color = "black",
          check_overlap = TRUE)+  
  scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
                    breaks = c("casual", "member"), 
                    labels = c("Casual", "Member")) +
  labs(title = "fig 2. Quarterly Trip Distribution", 
       subtitle = "Categorized by user group", 
       y = "", 
       x = "Frequency", 
       fill = "",
       caption = "Percentages based on group totals") + 
  scale_x_continuous(labels = function(x) format(x, big.mark = ","), limits = c(0, 1300000)) +
  theme_classic() +
  theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"),
    plot.title = element_text(hjust = 0.5, vjust = -1, face = 'bold'),
    plot.subtitle = element_text(hjust = 0.5, vjust = -2, face = 'italic'),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.ticks.y = element_line(linewidth = 0.1),
    axis.ticks.x = element_line(linewidth = 0.1),
    legend.position = "top",
    legend.box.margin = margin(t = 13, r = 0, b = -50, l = 290))

Fig 6. #3 Trip Distribution Per Month

View Fig

fig_6 <- ggplot(fig_6_data, aes(x = month_name, y = percent, fill = member_casual)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(x = month_name, y = percent, label = sprintf("%.2f%%", percent)), 
            position = position_fill(vjust = 0.5),
            size = 2.5, 
            family = "Trebuchet MS",
            color = "black") +
  scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
                    breaks = c("casual", "member"), 
                    labels = c("occasional", "member")) +
  labs(title = "fig 6. Monthly Trip Distribution", 
       subtitle = "Categorized by user group", 
       x = "", 
       y = "", 
       fill = "",
       caption = " - Percentages based on group totals") + 
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
  theme_classic() +
  theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"),
    plot.title = element_text(hjust = 0.5, vjust = -1, face = 'bold', family = "Trebuchet MS"),
    plot.subtitle = element_text(hjust = 0.5, vjust = -2, face = 'italic', family = "Trebuchet MS"),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.caption = element_text(face = "italic", hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))

Fig 7. #4 Trip Distribution per Day

View Fig

fig_7 <- ggplot(fig_7_data, aes(x = day_name, y = percent, fill = member_casual)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(x = day_name, y = percent, label = sprintf("%.2f%%",percent)), 
            position = position_fill(vjust = 0.5),
            size = 3, 
            family = "Trebuchet MS",
            color = "black") +  
  scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
                    breaks = c("casual", "member"), 
                    labels = c("occasional", "member")) +
  labs(title = "fig 4. Trip Distribution per Day", 
       subtitle = "Categorized by user group", 
       x = "Day", 
       y = "Distribution", 
       fill = "",
              caption = " - Percentages based on group totals") + 
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
  theme_classic() +
  theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"),
    plot.title = element_text(hjust = 0.5, vjust = -1, face = 'bold', family = "Trebuchet MS"),
    plot.subtitle = element_text(hjust = 0.5, vjust = -2, face = 'italic', family = "Trebuchet MS"),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.ticks.y = element_line(linewidth = 0.1),
    axis.ticks.x = element_line(linewidth = 0.1),
    plot.caption = element_text(face = "italic", hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))

Fig 8. #5 Casual Users Daily Breakdown

View Fig

fig_8 <- ggplot(fig_8_data, aes(x = day_name, y = hour, fill = percent, color = member_casual)) +
  geom_tile(color = "black") +  
  scale_fill_gradient(low = "gray", high = "#9D3414") +
  geom_text(aes(label = sprintf("%.1f%%", percent)), 
            size = 3,
            color = "white") +  
  labs(title = "fig 5. Hourly Distribution ", subtitle = "Casual Riders", x="", y = "") +
  theme_minimal() +
    theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"),
    plot.title = element_text(hjust = 0.5, vjust = 1.5, face = 'bold', family = "Trebuchet MS"),
    plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic', family = "Trebuchet MS"),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.ticks.y = element_line(linewidth = 0.1),
    axis.ticks.x = element_line(linewidth = 0.1),
    plot.margin = margin(t = 1, r = 0, b = 0, l = 0, unit = "lines")) +
  guides(fill = 'none')

Fig 9. #6 Member Users Daily Breakdown

View Fig

fig_9 <- ggplot(fig_9_data, aes(x = day_name, y = hour, fill = percent, color = member_casual)) +
  geom_tile(color = "black") +  
  scale_fill_gradient(low = "gray", high = "#448B0E") +
  geom_text(aes(label = sprintf("%.1f%%", percent)), 
            size = 3,
            color = "white") +  
  labs(title = "fig 6. Hourly Distribution ", subtitle = "Member Subscribers", x="", y = "") +
  theme_minimal() +
    theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"),
    plot.title = element_text(hjust = 0.5, vjust = 1.5, face = 'bold', family = "Trebuchet MS"),
    plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic', family = "Trebuchet MS"),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.text.y = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.ticks.y = element_line(linewidth = 0.1),
    axis.ticks.x = element_line(linewidth = 0.1),
    plot.margin = margin(t = 1, r = 0, b = 0, l = 0, unit = "lines")) +
  guides(fill = 'none')

Fig 10. #7 Leaflet. Choroplot Start Stations

View Fig

fig_10 <- leaflet(districts, options = leafletOptions(minZoom = 10, maxZoom = 12, fontFamily = "Trebuchet MS")) %>% 
  setView(lng = -87.6473, lat = 41.8563, zoom = 10) %>% 
  setMaxBounds(lng1 = -87.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>% 
  addProviderTiles("CartoDB.Voyager") %>% 
  addPolygons(
    fillColor = ~pal(fig_10_data$total_count),
    weight = 2,
    opacity = 20,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.6,
    highlightOptions = highlightOptions(
      weight = 2,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.5,
      bringToFront = TRUE),
      label = labels_choro,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px",
      direction = "auto")) %>% 
    addLegend(
    "bottomright", 
    pal = pal, 
    values = ~fig_10_data$total_count,
    title = "Trips",
    opacity = 0.6) %>%
  addControl(
    html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Bike Trip Origins</h3>",
    position = "topright") %>%
  addControl(
    html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 12px;'>fig 7. Trip Starts by Districts in Chicago</h3>",
    position = "topright")

Fig 11. #8 Leaflet. Casual Start Stations Heatmap

View Fig

fig_11 <- leaflet(data = fig_11_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
  setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>% 
  setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(
    lng = ~start_lng,
    lat = ~start_lat, 
    popup = ~paste(start_station_name, "<br> Total Trips:", count),
    radius = sqrt(fig_11_data$count) * 0.06, 
    fillOpacity = 1,
    color = ~pal(fig_11_data$count),
    stroke = FALSE) %>%
  addLegend(
    "bottomright", 
    pal = pal, 
    values = ~fig_11_data$count,
    title = "Count",
    opacity = 0.5) %>%
  addControl(
    html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip Start</h1>",
    position = "topright") %>%
  addControl(
    html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 8. Casual Riders Trip Start </h3>",
    position = "topright")

Fig 12. #9 leaflet. Casual End Stations Heatmap

View Fig

fig_12 <- leaflet(data = fig_12_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
  setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>% 
  setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(
    lng = ~end_lng,
    lat = ~end_lat, 
    popup = ~paste(end_station_name, "<br> Total Trips:", count),
    radius = sqrt(fig_12_data$count) * 0.06, 
    fillOpacity = 1,
    color = ~pal(fig_12_data$count),
    stroke = FALSE) %>%
  addLegend(
    "bottomright", 
    pal = pal, 
    values = ~fig_12_data$count,
    title = "Count",
    opacity = 0.5) %>%
  addControl(
    html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip End</h1>",
    position = "topright") %>%
  addControl(
    html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 9. Casual Riders Trip End </h3>",
    position = "topright")

Fig 13. #10 Leaeflet. Member Start Stations Heatmap

View Fig

fig_13 <- leaflet(data = fig_13_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
  setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>% 
  setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(
    lng = ~start_lng,
    lat = ~start_lat, 
    popup = ~paste(start_station_name, "<br> Total Trips:", count),
    radius = sqrt(fig_13_data$count) * 0.06, 
    fillOpacity = 1,
    color = ~pal(fig_13_data$count),
    stroke = FALSE) %>%
  addLegend(
    "bottomright", 
    pal = pal, 
    values = ~fig_13_data$count,
    title = "Count",
    opacity = 0.5) %>%
  addControl(
    html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip Start</h1>",
    position = "topright") %>%
  addControl(
    html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 10. Member Riders Trip Start </h3>",
    position = "topright")

Fig 14. #11 Leaflet. Member End Stations Heatmap

View Fig

fig_14 <- leaflet(data = fig_14_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
  setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>% 
  setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(
    lng = ~end_lng,
    lat = ~end_lat, 
    popup = ~paste(end_station_name, "<br> Total Trips:", count),
    radius = sqrt(fig_14_data$count) * 0.06, 
    fillOpacity = 1,
    color = ~pal(fig_14_data$count),
    stroke = FALSE) %>%
  addLegend(
    "bottomright", 
    pal = pal, 
    values = ~fig_14_data$count,
    title = "Count",
    opacity = 0.5) %>%
  addControl(
    html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip End</h1>",
    position = "topright") %>%
  addControl(
    html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 11. Member Riders Trip End </h3>",
    position = "topright")

Fig 15. #12 Daily Mean Duration

View Fig

fig_15 <- ggplot(fig_15_data, aes(x = day_name, y = mean, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge2") +
  geom_text(aes(label = round(mean, 2)), position = position_dodge(width = 1), vjust = -0.5, size = 3,family = "Trebuchet MS") +
  scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
                    breaks = c("casual", "member"), 
                    labels = c("occasional", "member")) +
  labs(title = "fig 12. Daily Mean Duration", subtitle = "Categorized by user group", x = "", y = "Minutes", fill = "User Type") + 
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
  theme_classic() +
  theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"), 
    plot.title = element_text(hjust = 0.5, vjust = 1, face = 'bold'), 
    plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic'), 
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.ticks.y = element_line(linewidth = 0.1),
    axis.ticks.x = element_line(linewidth = 0.1),
    axis.text.x = element_text(angle = 45, hjust = 1),
    strip.background = element_rect(color = "white"),
    strip.text = element_text(size = 10),
    plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "lines")) +
  facet_wrap(~member_casual) +
  guides(fill = "none")

Fig 16. #13 Contour Plot

View Fig

fig_16 <- ggplot(correlation_values, aes(x = mean_duration, y = mean_distance)) +
  geom_density_2d() +
  geom_density_2d(color = "#4285f4ff") + 
  labs(title = "fig 13. Contour Plot", subtitle = "Spatial Density of Trip Characteristics", x = "Mean Duration (minutes)", y = "Mean Distance (meters)") +
  theme_classic() +
  theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"), 
    plot.title = element_text(hjust = 0.5, vjust = 1, face = 'bold'), 
    plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic'), 
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 8),
    axis.ticks.y = element_line(linewidth = 0.1),
    axis.ticks.x = element_line(linewidth = 0.1),
    axis.text.x = element_text(angle = 45, hjust = 1),
    strip.background = element_rect(color = "white"),
    strip.text = element_text(size = 10),
    plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "lines"))

Fig 17. #14 Bike Type Preference

View Fig

fig_17 <- ggplot(fig_17_data, aes(x = member_casual, y = percentage, fill = rideable_type)) +
  geom_col(stat = "identity", position = "dodge")+
  geom_text(aes(label = sprintf("%.2f%%", percentage)), 
            position = position_dodge(width = 0.9),
            vjust = -0.5,
            size = 3, 
            family = "Trebuchet MS",
            color = "black") +
  scale_fill_manual(values = c("classic_bike" = "#AAEDC3", "docked_bike" = "#AAD4ED", "electric_bike" = "#AAB3ED"),
                    labels = c("Classic Bike", "Docked Bike", "Electric Bike")) +
  labs(title = "fig 14. Rideable Types Preference", subtitle = "Percentage of bike ride by user group", x = "Month", y = "Distribution", fill = "User Type", caption = "-Docked bikes represent older classic bikes; data cut as of August 2023") +
  labs(x = "Member Type", y = "Percentage", fill = "Rideable Type") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme_classic() +
  theme(
    axis.line = element_line("gray"),
    text = element_text(family = "Trebuchet MS"),
    plot.title = element_text(hjust = 0.5, vjust = 3, face = 'bold'),
    plot.subtitle = element_text(hjust = 0.5, vjust = 2, face = 'italic'),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 8),
    axis.title = element_text(size = 8),
    plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "lines"))

Cyclistic Bike Sharing

Frank N.

Feb 18, 2024

Introduction

Ask

Prepare

Setting up the Environment

ROCCC Statement

Importing Data Sets

Summary

Process

Combining Data Frames

Transforming the data

Uncovering patterns

Before we beging

Summary Statistics

Central Distribution

Excluding Outliers

Further Cleaning

Summary

Analyze

Fig 4. #1 Trip Distribution

Fig 5. #2 Trip Distribution Per Quarter

Fig 6. #3 Trip Distribution Per Month

Fig 7. #4 Trip Distribution Per Day

Fig 8. #5 Casual Users Daily Breakdown

Fig 9. #6 Member Users Daily Breakdown

Fig 10: #7 Leaflet. Choroplot Start Stations

Fig 11. #8 Leaflet. Casual Start Stations Heatmap

Fig 12. #9 leaflet. Casual End Stations Heatmap

Fig 13. #10 Leaeflet. Member Start Stations Heatmap

Fig 14. #11 Leaflet. Member End Stations Heatmap

Fig 15. #12 Daily Mean Duration

Fig 16. #13 Contour Plot

Fig 17. #14 Bike Type Preference

Resources and References

Special Thanks

Resources

Organizations

Blogs

Communities

Videos

Capstones

Tools

Chunks

Fig 1. Missing Start Stations Names by Rideable Types

Fig 2. QQ Plot

Fig 3. Bell Curve

Fig 4. #1 Trip Distribution

Fig 5. #2 Trip Distribution Per Quarter

Fig 6. #3 Trip Distribution Per Month

Fig 7. #4 Trip Distribution per Day

Fig 8. #5 Casual Users Daily Breakdown

Fig 9. #6 Member Users Daily Breakdown

Fig 10. #7 Leaflet. Choroplot Start Stations

Fig 11. #8 Leaflet. Casual Start Stations Heatmap

Fig 12. #9 leaflet. Casual End Stations Heatmap

Fig 13. #10 Leaeflet. Member Start Stations Heatmap

Fig 14. #11 Leaflet. Member End Stations Heatmap

Fig 15. #12 Daily Mean Duration

Fig 16. #13 Contour Plot

Fig 17. #14 Bike Type Preference