Cyclistic™ is a bike-sharing company with an
extensive network of 5,800 bikes spread across 1,600
stations in the state of Chicago, USA. The core of its business
model revolves around providing an eco-friendly and affordable
alternative means of transportation.
The company has successfully built brand recognition through targeted
marketing efforts, particularly with flexible pricing plans. This
approach has allowed the company to secure a significant market share.
Users are offered two subscription types: annual memberships for regular
users, known as members and one-trip or day passes for
occasional riders, whom are known as casual.
The marketing team is currently addressing three crucial questions
that will shape the future marketing strategy:
I am tasked to delve into the monthly data sets from 2023 and
identify patterns among different member types and propose three
strategic recommendations.
Within this document, you’ll find a comprehensive guide outlining the
meticulous steps taken to prepare, process, and visualize the
data.
For a detailed exploration of my findings, analysis, conclusions, and
recommendations, please refer to the complete report accessible here
I have been tasked with answering the first question, investigate the difference between members and occasional riders, report my findings, craft visualizations and provide three strategic recommendations to the Marketing Director, with the objective of converting occasional bikers into paying members.
Expected output includes:
Data will be processed and analyzed using R Project,
leveraging CRAN
libraries to help across various analytical and data processing
tasks.
library(tidyverse) # A bundle of R tools for easy and organized data work.
library(ggplot2) # Create custom data visualizations.
library(readr) # Quickly read and write data.
library(dplyr) # Simplify data manipulation.
library(tibble) # A user-friendly data frame.
library(janitor) # Clean and format messy data easily.
library(tidyr) # Reshape and organize data.
library(lubridate) # Manage dates and times more easily.
library(purrr) # Simplify tasks related to functions and vectors.
library(knitr) # Mix R code and results into documents.
library(skimr) # Get quick insights about the data.
library(ggmap) # Get Google Maps data (Requires API key).
library(sf) # Spatial vector data encoding.
library(geosphere) # Calculate distance between lat and lng
library(leaflet) # Create interactive maps
library(leaflet.extras)
library(geojsonio)
library(Rcpp) # Integration of C++ with R for performance improvements
library(Cairo) # HD plots
library(webshot)
library(mapview)
library(htmlwidgets)
RStudio is the
preffered IDE due to its suitability for handling large datasets and its
convenience in generating plots within the same environment, for
publishing purposes, Rmarkdown and Rpubs.
Note: To process spatial data I attempted to
capitalize on Stamen Maps.
However they were recently acquired by Stadia Maps, rendering Stamen Maps
unavailable in the R environment, Consequently, to process spatial data
contained in the datasets, I have opted for Google Maps API instead
and for widets leaflet
has been a lifesaver.
Cyclistic has provided a data
bucket for this analysis. The data is distributed under this license
agreement, Divvy Bikes is a program
of the Chicago Department of Transportation (CDOT).
To ensure that the data meets the ROCCC (Reliable, Original,
Comprehensive, Current, and Cited) standards, I’ve verified that it is
meticulously organized, featuring clear columns for each user’s trip
details. This organization facilitates a thorough understanding of the
information. The data repository follows the naming convention
YYYYMM-divvy-tripdata.zip, with our focus on the 2023
files.
It has been noted by the publisher that the dataset has anonymized
trip IDs to protect user privacy. Additionally, any trips taken by staff
members and trips that were below 0 seconds in duration have
subsequently been removed, we’ll dive in to this later.
First, let’s set the working directory where the data sets are located Then, import each file into the project as a data frame.
# Set Working Directory
setwd("~/Documents/Cyclist Data")
# Assign a variable to each workbook
Jan_2023 <- read_csv("202301-divvy-tripdata.csv")
Feb_2023 <- read_csv("202302-divvy-tripdata.csv")
Mar_2023 <- read_csv("202303-divvy-tripdata.csv")
Apr_2023 <- read_csv("202304-divvy-tripdata.csv")
May_2023 <- read_csv("202305-divvy-tripdata.csv")
Jun_2023 <- read_csv("202306-divvy-tripdata.csv")
Jul_2023 <- read_csv("202307-divvy-tripdata.csv")
Aug_2023 <- read_csv("202308-divvy-tripdata.csv")
Sep_2023 <- read_csv("202309-divvy-tripdata.csv")
Oct_2023 <- read_csv("202310-divvy-tripdata.csv")
Nov_2023 <- read_csv("202311-divvy-tripdata.csv")
Dec_2023 <- read_csv("202312-divvy-tripdata.csv")
Lets take a look at the first data frame to get an idea of how its structured.
## ── Data Summary ────────────────────────
## Values
## Name data
## Number of rows 190301
## Number of columns 13
## _______________________
## Column type frequency:
## character 7
## numeric 4
## POSIXct 2
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 ride_id 0 1 16 16 0 190301 0
## 2 rideable_type 0 1 11 13 0 3 0
## 3 start_station_name 26721 0.860 10 50 0 964 0
## 4 start_station_id 26721 0.860 3 12 0 944 0
## 5 end_station_name 27840 0.854 10 50 0 962 0
## 6 end_station_id 27840 0.854 3 35 0 942 0
## 7 member_casual 0 1 6 6 0 2 0
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75
## 1 start_lat 0 1 41.9 0.0492 41.6 41.9 41.9 41.9
## 2 start_lng 0 1 -87.6 0.0285 -87.8 -87.7 -87.6 -87.6
## 3 end_lat 127 0.999 41.9 0.0492 41.6 41.9 41.9 41.9
## 4 end_lng 127 0.999 -87.6 0.0285 -87.8 -87.7 -87.6 -87.6
## p100 hist
## 1 42.1 ▁▁▇▆▁
## 2 -87.5 ▁▁▆▇▁
## 3 42.1 ▁▁▇▅▁
## 4 -87.5 ▁▁▅▇▁
##
## ── Variable type: POSIXct ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max
## 1 started_at 0 1 2023-01-01 00:01:58 2023-01-31 23:56:09
## 2 ended_at 0 1 2023-01-01 00:02:41 2023-02-04 04:27:03
## median n_unique
## 1 2023-01-14 16:26:15 178872
## 2 2023-01-14 16:44:41 179025
This process provides insights into two key aspects: first, a summary of the NA values present in the data frame, and second, an overview of the data formats for each variable.
Next, let’s compare data types across the columns in each workbook and identify any mismatches. We will then return only those columns that exhibit mismatched data types.
# Compare Column Types
mismatched_dt <- compare_df_cols(Jan_2023, Feb_2023, Mar_2023, Apr_2023, May_2023, Jun_2023, Jul_2023, Aug_2023, Sep_2023, Oct_2023, Nov_2023, Dec_2023, return = "mismatch")
## [1] "There are no mismatched columns"
- Data has been imported into the project.
- Data structure includes a variety of data types, encompassing both atomic vectors and nominal vectors.
- Data type is consistent across workbooks.
- Data adheres to the ROCCC standard.
To streamline data processing and address inconsistencies across observations stored in various data frames, let’s combine the data into a single data frame.
df_2023 <- list(Jan_2023, Feb_2023, Mar_2023, Apr_2023, May_2023, Jun_2023, Jul_2023, Aug_2023, Sep_2023, Oct_2023, Nov_2023, Dec_2023)
combined_df <- bind_rows(df_2023)
Let’s verify that all data from the workbooks has been successfully added to the combined data frame.
# Count the number of rows
total_rows <- nrow(combined_df)
total_rows1 <- sum(sapply(list(Jan_2023, Feb_2023, Mar_2023, Apr_2023, May_2023, Jun_2023, Jul_2023, Aug_2023, Sep_2023, Oct_2023, Nov_2023, Dec_2023),
nrow))
## There's a total of 5,719,877 observations
There are two variables of POSIXct (Date and Time) class,
started_atandended_at.We can use these to calculate thedurationfor each observation in the data set.
transform1 <- combined_df %>%
mutate(duration_seconds = round(as.numeric(difftime(ended_at, started_at, units = "secs"))),
duration_minutes = round(as.numeric(difftime(ended_at, started_at, units = "mins")), digits = 2),
duration_hours = round(as.numeric(difftime(ended_at, started_at, units = "hours")), digits = 3))
Let’s transform these variables and obtain the values for
time,month,day,quarterandyear.
transform2 <- transform1 %>%
arrange(started_at) %>%
mutate(day_name = format(started_at, "%A"),
month_name = format(started_at, "%B"),
hour = format(started_at, "%H"),
day = as.numeric(format(started_at, "%d")),
month = as.numeric(format(started_at, "%m")),
year = as.numeric(format(started_at, "%Y")),
quarter = cut(month, breaks = c(0, 3, 6, 9, 12), labels = c("Q1", "Q2", "Q3", "Q4")))
Calculating the distance from the beginning and end of each trip its rather convenient. To achieve this, we’ll need to define a new function capable of measuring the distance between geographical coordinates.
# Function to calculate distance using Haversine formula
haversine <- function(lat1, lon1, lat2, lon2) {
R <- 6371 # Radius of the Earth in kilometers
# Calculate the distance using the Haversine formula
distance <- distGeo(c(lon1, lat1), c(lon2, lat2), a=R)
return(distance)
}
# Prepare coordinate data
coords_start <- cbind(transform2$start_lng, transform2$start_lat)
coords_end <- cbind(transform2$end_lng, transform2$end_lat)
# Calculate distance for each row
transformed_df <- transform2 %>%
mutate(distance_meters = distGeo(coords_start, coords_end))
Finally, let’s factor categorical data. This process enables us to establish an order for our analysis.
transformed_df$day_name <- factor(transformed_df$day_name, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
transformed_df$hour <- factor(transformed_df$hour, levels = c("00", "01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23"))
transformed_df$month_name <- factor(transformed_df$month_name, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
transformed_df$quarter <- factor(transformed_df$quarter, levels = c("Q1", "Q2", "Q3", "Q4"))
More often than not, there are hidden patterns lurking within the data, waiting to be discovered and leveraged for valuable insights.
In the Prepare phase, we observed missing values in the
Jan_2023data frame for variables such asstart_station,end_station,end_lat, andend_lng.
Let’s perform a quick analysis on the newly combined data frame and verify whether the same columns contain missing values.
## Variables that contain NA values
## [1] "start_station_name" "start_station_id" "end_station_name"
## [4] "end_station_id" "end_lat" "end_lng"
## [7] "distance_meters"
The significance of NA values lies in their potential to affect the integrity and accuracy of the data analysis process.
Let’s begin by having a look at NA values for
start_station_name.
##
##
## |start_station_name |end_station_name | start_lat| end_lat|
## |:------------------|:----------------------|---------:|--------:|
## |NA |Halsted St & Roscoe St | 41.97| 41.94367|
## |NA |NA | 41.80| 41.80000|
From a glimpse, we can deduce that there may have been an issue recording the station name at the start of the trip. Let’s take a closer look at the NA values for the
end_station_namevariable.
##
##
## |start_station_name |end_station_name | start_lat| end_lat|
## |:-------------------|:----------------|---------:|-------:|
## |NA |NA | 41.98000| 41.96|
## |2112 W Peterson Ave |NA | 41.99116| 41.98|
We’ve discovered that
end_latvalues are rounded to two decimal points whenend_station_namesare missing. Similarly,start_latvalues are rounded in entries with missingstart_station_names.
Side Note: At 5 decimal points, the distance equals a radius of 1.11 meters; at 2 decimal places, approximately 1.11 kilometers.
I’m curious about the types of bikes that performed these trips, and whether if it affects more occasional users or members.
# Summary of bike types and NA values for Start Station
transformed_df %>%
select(start_station_name, end_station_name, rideable_type, member_casual) %>%
filter(is.na(start_station_name) | is.na(end_station_name) ) %>%
group_by(rideable_type, member_casual) %>%
summarise(count = n()) %>%
arrange(member_casual) %>%
print()
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: rideable_type [3]
## rideable_type member_casual count
## <chr> <chr> <int>
## 1 classic_bike casual 3818
## 2 docked_bike casual 2047
## 3 electric_bike casual 521373
## 4 classic_bike member 1304
## 5 electric_bike member 859266
This occurence happens mostly with electric bikes, which don’t require a docking station, we can assume that this happens because of issues related to the GPS system installed in these bikes, also this issue affects members more than it does to occasional users.
Let’s see how these values are plotted in a map, we’ll have a look at
start_stations_nameswith NA values.
The uniform appearance is a consequence of the variables’ decimal points, We can infer that this pattern likely extends to
end_station_namesWhat is the impact of these values on the data set?
# Count of rows with missing values
na_values_ssn <- combined_df[is.na(combined_df$start_station_name) |
is.na(combined_df$end_station_name),
c("start_station_name", "end_station_name")] %>%
nrow()
## NA values for start or end station names represent 1,387,808 a 24.26% of total observations.
That’s a lot, but this doesn’t necesesarily mean that these observations are useless.
Let’s turn our attention to the NA values contained in these variables.
na_value_spatial <- transformed_df[is.na(combined_df$end_lat) |
is.na(combined_df$end_lng),
c("start_lat", "start_lng", "end_lat", "end_lng")] %>%
nrow()
## NA values for spatial data in 'end_stations' represent 6,990 observations.
Let’s have a closer look at these observations and include another variable,
duration_minutes.
na_value_spatial <- transformed_df %>%
select(member_casual, duration_minutes, start_lng, start_lat, end_lng, end_lat) %>%
filter(is.na(start_lat) | is.na(start_lng) | is.na(end_lat) | is.na(end_lng)) %>%
arrange(desc(duration_minutes)) %>%
print()
## # A tibble: 6,990 × 6
## member_casual duration_minutes start_lng start_lat end_lng end_lat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 casual 98489. -87.7 42.0 NA NA
## 2 casual 92570. -87.7 42.0 NA NA
## 3 casual 83383. -87.6 41.9 NA NA
## 4 casual 79775. -87.6 41.9 NA NA
## 5 casual 64172. -87.6 41.9 NA NA
## 6 casual 64009. -87.6 41.9 NA NA
## 7 casual 62867. -87.6 41.9 NA NA
## 8 casual 56195. -87.6 41.9 NA NA
## 9 casual 55070. -87.7 41.9 NA NA
## 10 casual 51461. -87.6 41.9 NA NA
## # ℹ 6,980 more rows
We have made another discovery, outliers. These trips have an unusual high duration, we can assume that these observations had issues at the end the trip or are stolen bikes, heh.
lets understand the impact of outlier data in this subset, assuming that the max duration of a normal trip would be around 30 minutes.
## Values above 90 minutes in the 'na_value_spatial' object represent 6,803 observations.
That’s around 98% of the values contained in the subset. Last, let’s have quick a look at
distance_metersvalues.
## # A tibble: 10 × 7
## member_casual distance_meters duration_minutes start_lng start_lat end_lng
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 member 9818680. 0.6 -87.7 41.9 0
## 2 casual 9814574. 203. -87.6 41.9 0
## 3 casual 9814022. 4.83 -87.6 41.8 0
## 4 casual 49022. 99.5 -87.6 41.9 -88.2
## 5 member 40859. 2.6 -87.6 41.9 -88.0
## 6 casual 36909. 118. -87.6 41.7 -87.8
## 7 casual 33901. 150. -87.7 41.8 -88.1
## 8 casual 33057. 130. -87.6 41.8 -87.7
## 9 member 32221. 90.1 -87.7 42.0 -87.7
## 10 casual 30601. 149. -87.7 42.1 -87.6
## # ℹ 1 more variable: end_lat <dbl>
That is also an an unusual travelled distance. Now that we have found some patterns in the data, and also discovered outliers, it’s time to do some cleaning.
Let’s take a look at the summary values for
min,max,medianandmeanfor thedurationanddistance_meterscolumns, this will give us insights into the range and central tendency of the data.
## duration_seconds duration_minutes duration_hours distance_meters
## Min. :-999391 Min. :-16656.52 Min. :-277.6090 Min. : 0
## 1st Qu.: 325 1st Qu.: 5.42 1st Qu.: 0.0900 1st Qu.: 863
## Median : 572 Median : 9.53 Median : 0.1590 Median : 1541
## Mean : 1090 Mean : 18.17 Mean : 0.3028 Mean : 2103
## 3rd Qu.: 1015 3rd Qu.: 16.92 3rd Qu.: 0.2820 3rd Qu.: 2749
## Max. :5909344 Max. : 98489.07 Max. :1641.4840 Max. :9818680
## NA's :6990
The mean provides a measure of central tendency by summing all values in the variable and dividing by the total count of values. The median represents the middle value of the variable The min and max values denote the lowest and highest values. Additionally, the 1st and 3rd quartiles represent the values below which 25% and 75% of the data fall, respectively.
Normalizing the data is crucial before analysis to ensure accurate insights. By adjusting data to a common scale or distribution, comparisons between variables become meaningful. It aids in identifying patterns, relationships, and outliers effectively. Normalization enhances the reliability and validity of statistical analyses, allowing for more accurate conclusions and informed decision-making.
We are going to start by conducting a Shapiro-Wilk test to assess if the data follows a normal distribution. For this task we are going to establish a null and alternative hypothesis, setting a 0.05 significance level.
The test will be conducted on the
duration_minutesvariable, as thedurationof the trip is closely related todistance. Analyzing outliers in the duration_minutes variable is essential for understanding trip characteristics and potential anomalies.
H0 = The data set follows a normal distribution. H1 = The data set does not follow a normal distribution.
# Arranging the data
duration_data <- transformed_df %>%
arrange(duration_minutes)
# Extracting the duration_minutes column
test_data <- as.data.frame(duration_data$duration_minutes)
colnames(test_data) <- "duration"
# And convert it to a numeric class for the purpose of the test
fig_2_data <- as.numeric(test_data$duration)
# Shapiro-Wilk Test
shapiro_sample_size <- 3000
shapiro_sample_data <- sample(fig_2_data, shapiro_sample_size)
shapiro_result <- shapiro.test(shapiro_sample_data)
print(shapiro_result)
##
## Shapiro-Wilk normality test
##
## data: shapiro_sample_data
## W = 0.0064973, p-value < 2.2e-16
The test result indicate strong evidence to reject the null hypothesis, we can determine this by looking at the p-value which is smaller than the significance level. We can safely conclude that the data does not follow a normal distribution.
Let’s visualize the test data in a QQ plot, and confirm this. View Chunk
The green line in the plot is this normality we are seeking, the black line represents the actual data. At both edges we see little circles that represent the values that drift apart from normality; we are going to call these values
outliers.
These outliers can significantly influence the
meanandmedianvalues, potentially leading to an inaccurate portrayal of the central tendency of the data. Therefore, it’s crucial to identify and appropriately handle these outliers to ensure the robustness and reliability of our analysis and conclusions.
To achieve this, we will employ the IQR method, which involves determining the first (lower quartile) and third quartile ranges (upper quartile) of the dataset. We will then exclude values falling outside this range from our analysis, resulting in a normalized dataset.
On top of this, let’s declare a new function to obtain the
modeof the duration variable.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# Extract First and Third quartile
Q1 <- quantile(duration_data$duration_minutes, 0.25)
Q3 <- quantile(duration_data$duration_minutes, 0.75)
# Calculate IQR
IQR <- Q3 - Q1
# Define lower and upper fences
lower_fence <- Q1 - 1.5 * IQR
upper_fence <- Q3 + 1.5 * IQR
# Identify outliers
outliers_logical <- duration_data$duration_minutes < lower_fence | duration_data$duration_minutes > upper_fence
outliers <- duration_data[outliers_logical, ]
# Exclude outliers
rm_outliers_df <- duration_data[!outliers_logical, ]
summary(rm_outliers_df$duration_minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.60 5.17 8.82 10.75 14.67 34.17
Let’s visualize the data distribution by plotting a Bell Curve and examining its fit to the data. View Chunk
The bell curve showcases duration in minutes on the X axis and the density of these values on the Y axis, the curve central tendency normalizes at 3 standard deviations significantly reducing the statistics in our data frame.
Let’s examine the trips with a
durationof zero or less.
## # A tibble: 1,211 × 3
## rideable_type duration_minutes distance_meters
## <chr> <dbl> <dbl>
## 1 classic_bike -11.6 0
## 2 electric_bike -11.1 310.
## 3 electric_bike -11.1 865.
## 4 electric_bike -8.93 378.
## 5 electric_bike -8.43 401.
## 6 electric_bike -5.87 508.
## 7 electric_bike -5.05 352.
## 8 electric_bike -5.05 10631.
## 9 electric_bike -4.97 636.
## 10 electric_bike -4.92 268.
## # ℹ 1,201 more rows
## 421,594 ( 7.37% ) observations with values are equal or below zero have been excluded from the data set.
Let’s shift our focus to the
distance_metersvariable. Let’s investigate trips with durations of less than 5 minutes to uncover any noteworthy insights.
distance_upper_values <- cleaned_duration %>%
select(distance_meters, duration_minutes, start_lng, start_lat, end_lng, end_lat) %>%
filter(!is.na(distance_meters) & duration_minutes < 5) %>%
arrange(desc(distance_meters)) %>%
slice(1:10) %>%
print()
## # A tibble: 10 × 6
## distance_meters duration_minutes start_lng start_lat end_lng end_lat
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9818680. 0.6 -87.7 41.9 0 0
## 2 9814022. 4.83 -87.6 41.8 0 0
## 3 40859. 2.6 -87.6 41.9 -88.0 41.6
## 4 20950. 3.53 -87.6 41.8 -87.7 41.9
## 5 20511. 4.85 -87.6 41.8 -87.6 41.9
## 6 18649. 3.93 -87.8 42.0 -87.6 41.9
## 7 14482. 4.07 -87.6 41.8 -87.6 41.9
## 8 13202. 4.77 -87.6 41.8 -87.6 41.9
## 9 12812. 4.87 -87.6 41.9 -87.6 41.8
## 10 12648. 3.63 -87.6 41.9 -87.7 42.0
It’s quite unusual that some rides exhibit an exceptionally long traversed distance considering their
duration, we could assume that there was an issue recording the time at the start of the trip, on top of that the first two rows display an incorrectly calculateddistance. Let’s take a look at themeanandmediandistance by time.
## # A tibble: 35 × 3
## duration_minutes mean_distance median_distance
## <dbl> <dbl> <dbl>
## 1 34 4537. 4520.
## 2 33 4477. 4462.
## 3 32 4474. 4520.
## 4 31 4460. 4505.
## 5 30 4384. 4449.
## 6 29 4319. 4372.
## 7 28 4242. 4296.
## 8 27 4200. 4256.
## 9 26 4134. 4223.
## 10 25 4068. 4143.
## # ℹ 25 more rows
Considering that the max speed for Cyclistic bikes is of 20 mph (32 km/h), we will exclude values where the duration exceeds it’s maximum distance, and we’ll store these values in
excluded_dist_df.
- Duration < 30 seconds and Distance = 0 meters: Possible false starts, removed observations 30,950.
- Duration < 5 minutes, Distance > 2,670 meters: removed observations 288.
- Duration < 10 minutes, Distance > 5,359 meters: removed observations 209
- Duration < 15 minutes, Distance > 8,046 meters: removed observations 123
- Duration < 20 minutes, Distance > 10,724 meters: removed observations 59
- Duration < 25 minutes, Distance > 13,408 meters: removed observations 24
- Duration < 30 minutes, Distance > 16,092 meters: removed observations 20
- Duration < 34 minutes, Distance > 18,239 meters: removed observations 13
# Set data frames with conditions
rm1 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 18239 & duration_minutes <= 34) %>%
arrange(distance_meters)
rm2 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 16092 & duration_minutes <= 30) %>%
arrange(distance_meters)
rm3 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 13408 & duration_minutes <= 25) %>%
arrange(distance_meters)
rm4 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 10724 & duration_minutes <= 20) %>%
arrange(distance_meters)
rm5 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 8046 & duration_minutes <= 15) %>%
arrange(distance_meters)
rm6 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 5359 & duration_minutes <= 10) %>%
arrange(distance_meters)
rm7 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters > 2670 & duration_minutes <= 5) %>%
arrange(distance_meters)
rm8 <- cleaned_duration %>%
filter(!is.na(distance_meters) & distance_meters == 0 & duration_seconds <= 15) %>%
arrange(distance_meters)
# Exclude the conditions from the dataframe
excluded_dist_df <- cleaned_duration %>%
anti_join(rm1, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm2, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm3, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm4, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm5, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm6, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm7, by = c("distance_meters", "duration_minutes")) %>%
anti_join(rm8, by = c("distance_meters", "duration_minutes"))
## 31,386 ( 0.59% ) observations that surpass the maximum travel distance based on speed have been omitted from the dataset.
Next, let’s examine the recorded values for
end_station_name, noting that there are missing values for theend_lngandend_latvariables.
## # A tibble: 5 × 1
## end_station_name
## <chr>
## 1 Halsted St & Fulton St
## 2 Stony Island Ave & 63rd St
## 3 Elizabeth St & Randolph St
## 4 Lincoln Ave & Byron St
## 5 Drexel Ave & 60th St
To address the missing coordinates and re-calculate the distance, we first need to declare the
latandlngvalues for these stations, subsequently, we can proceed to fill in the absent data and store it in a data frame.
end_lat_value1 <- 41.88687
end_lng_value1 <- -87.64809
end_lat_value2 <- 41.78051
end_lng_value2 <- -87.58685
end_lat_value3 <- 41.88434
end_lng_value3 <- -87.6589
end_lat_value4 <- 41.95237
end_lng_value4 <- -87.6773
end_lat_value5 <- 41.78586
end_lng_value5 <- -87.60455
# Fill missing values with corresponding station data
fill_values <- excluded_dist_df %>%
mutate(end_lat = if_else(end_station_name == "Halsted St & Fulton St" & is.na(end_lat), end_lat_value1, end_lat),
end_lng = if_else(end_station_name == "Halsted St & Fulton St" & is.na(end_lng), end_lng_value1, end_lng)) %>%
mutate(end_lat = if_else(end_station_name == "Stony Island Ave & 63rd St" & is.na(end_lat), end_lat_value2, end_lat),
end_lng = if_else(end_station_name == "Stony Island Ave & 63rd St" & is.na(end_lng), end_lng_value2, end_lng)) %>%
mutate(end_lat = if_else(end_station_name == "Elizabeth St & Randolph St" & is.na(end_lat), end_lat_value3, end_lat),
end_lng = if_else(end_station_name == "Elizabeth St & Randolph St" & is.na(end_lng), end_lng_value3, end_lng)) %>%
mutate(end_lat = if_else(end_station_name == "Lincoln Ave & Byron St" & is.na(end_lat), end_lat_value4, end_lat),
end_lng = if_else(end_station_name == "Lincoln Ave & Byron St" & is.na(end_lng), end_lng_value4, end_lng)) %>%
mutate(end_lat = if_else(end_station_name == "Drexel Ave & 60th St" & is.na(end_lat), end_lat_value5, end_lat),
end_lng = if_else(end_station_name == "Drexel Ave & 60th St" & is.na(end_lng), end_lng_value5, end_lng))
# Prepare coordinate data
coords_start <- cbind(fill_values$start_lng, fill_values$start_lat)
coords_end <- cbind(fill_values$end_lng, fill_values$end_lat)
# Re calculate distance for each row
cleaned_distance <- fill_values %>%
mutate(distance_meters = distGeo(coords_start, coords_end))
## 93 observations with no NA values for `end_station_name` and contain missing `end_lng` and `end_lat` values have been filled.
Finally, lets exclude from the data set the variables that have no recorded
end_lngandend_latcoordinates. We are doing this for the purpose of maintaining row consistency across the data set.
rm9 <- cleaned_distance %>%
filter(is.na(end_lat) & is.na(end_lng))
cyclistic_data_2023 <- cleaned_distance %>%
anti_join(rm9, by = c("end_lat", "end_lng"))
## 118 observations with NA values for `end_station_name` but contain missing `end_lng` and `end_lat` values have been filled.
Lets have one last look at the summary statistics for the
duration_minutesanddistance_metersvariables.
## duration_minutes distance_meters
## Min. : 0.02 Min. : 0.0
## 1st Qu.: 5.22 1st Qu.: 877.8
## Median : 8.88 Median : 1518.8
## Mean :10.82 Mean : 1967.1
## 3rd Qu.:14.72 3rd Qu.: 2620.9
## Max. :34.17 Max. :14318.8
## 453,098 ( 7.92% ) observations have been excluded for the analysis phase.
With a cleaned data set that reflects a centralized distribution, we can be certain that our analysis will be as accurate as possible.
- Combined monthly trips files into a single data frame.
- Extracted duration and date values, factored values.
- Discovered patterns in geo-spatial data
- Cleaned outliers and filled missing values.
- Distribution has been normalized.
- 7.92% of total observations have been excluded from the analysis.
The final step of the data analysis process is the analysis itself, since we are already familiar with the structure of the data set, we have a good idea on how to approach our business task. As a refresher we are tasked with finding the Differences between Casual Riders and Member Subscribers, recommendations are available on the final report.
## Correlation index between mean duration and distance = -0.4222123
With all the visualizations in place, there’s not much left to say but thank you for accompanying me this far. I hope that this journey has offered valuable insights into the comprehensive data analysis process. Have a good day!
Leveraged resources, articles and communities for knowledge, insights, inspiration and answers.
# Modeling the plot data
fig_1_data <- transformed_df %>%
select(member_casual, start_station_name, end_station_name, rideable_type, end_lat, end_lng) %>%
filter(is.na(end_station_name) & member_casual == "member") %>%
group_by(rideable_type, end_lat, end_lng) %>%
summarise(count = n(), .groups = 'drop') %>% print()
# Setting up the map
suppressMessages({
fig_1_map <- get_googlemap(center = c( lon = -87.630303, lat = 41.871074),
style = c(feature = "all", element = "none", visibility = "off"),
zoom = 11,
size = c(720, 720),
maptype = "roadmap")
})
# Plotting the data on the map
ggmap(fig_1_map) +
geom_point(data = fig_1_data, aes(x = end_lng, y = end_lat, color = rideable_type, size = count), alpha = 0.5, na.rm = TRUE) +
scale_color_manual(values = c("electric_bike" = "gray", "classic_bike" = "black")) +
labs(title = "Fig 1. Missing Start Stations Names by Rideable Types", x = "Longitude", y = "Latitude", size = "Value Count", color = "Bike Type") +
theme_minimal() +
theme(text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = 2, size = 12),
axis.text = element_blank(),
axis.title = element_blank())
fig_1 <- qqnorm(fig_2_data, main = "fig 2. Normal QQ Plot")
qqline(fig_2_data, distribution = qnorm, col = "darkgreen")
# Central Tendency Values
mean <- mean(cyclistic_data_2023$duration_minutes)
sd <- sd(cyclistic_data_2023$duration_minutes)
y <- dnorm(cyclistic_data_2023$duration_minutes, mean, sd)
mode <- Mode(cyclistic_data_2023$duration_minutes)
median <- median(cyclistic_data_2023$duration_minutes)
# Plotting The Bell Curve
fig_2 <- plot(x = cyclistic_data_2023$duration_minutes,
y = y,
col = "blue",
type = "l",
xlab = "Duration in Mins",
ylab = "Density",
main = "fig 2. Bell Curve",
lwd = 6)
# Text Format
par(family = "Trebuchet MS")
# Adding SD lines
text(mean + seq(-3, 3) * sd,
dnorm(mean + seq(-3, 3) * sd, mean, sd),
c(paste0(-(3:1), "sd"), "", paste0(1:3, "sd")),
pos = 3,
offset = 3,
col = "black")
# Mean Line
abline(v = mean + seq(-3, 3) * sd,
col = "gray",
lty = 2)
# Mean Line Format
text(mean,
dnorm(mean, mean, sd),
"Mean",
pos = 1,
offset = 8,
col = "darkgray")
# Median Line
abline(v = median,
col = "darkgreen",
lty = 2)
# Median Line Format
text(median,
dnorm(median, mean, sd),
"Median",
pos = 3,
offset = -6.5,
col = "darkgreen")
# Mode Line
abline(v = mode,
col = "blue",
lty = 2)
# Mode Line Format
text(mode,
dnorm(mode, mean, sd),
"Mode",
pos = 3,
offset = -7,
col = "blue")
fig_4 <- ggplot(fig_4_data, aes(x = "", y = Freq, fill = Var1)) +
coord_polar("y", start = 0, direction = 1) +
geom_bar(stat = "identity",
width = 1,
color = "#FFFFFF") +
scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
breaks = c("casual", "member"),
labels = c("Casual", "Member")) +
labs(title = "fig 1. User Distribution", subtitle = "for 2023", fill = "", caption = "casuals: 1,769,115; members: 3,497,664") +
geom_text(aes(label = paste(round(Freq / sum(Freq) * 100), "%")),
position = position_stack(vjust = 0.5),
family = "Verdana") +
theme_classic() +
theme(axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
axis.line = element_line("#FFFFFF"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = -5, face = 'bold', family = "Trebuchet MS"),
plot.subtitle = element_text(hjust = 0.5, vjust = -8, face = 'italic', family = "Trebuchet MS"),
legend.title = element_text(size = 12),
legend.position = "top",
legend.box.margin = margin(t = 18, r = 5, b = -31, l = 0),
legend.text = element_text(size = 10),
plot.caption = element_text(face = "italic", hjust = 0.5, vjust = 7))
fig_5 <- ggplot(fig_5_data, aes(y = reorder(quarter,desc(quarter)), x = count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge2") +
geom_text(aes(label = format(count, big.mark = ",")),
position = position_dodge(width = 0.9),
hjust = 1.2,
size = 3,
family = "Trebuchet MS",
color = "black") +
geom_text(aes(label = sprintf("%.2f%%", percent)),
position = position_dodge(width = 0.9),
hjust = -0.2,
size = 3,
family = "Trebuchet MS",
color = "black",
check_overlap = TRUE)+
scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
breaks = c("casual", "member"),
labels = c("Casual", "Member")) +
labs(title = "fig 2. Quarterly Trip Distribution",
subtitle = "Categorized by user group",
y = "",
x = "Frequency",
fill = "",
caption = "Percentages based on group totals") +
scale_x_continuous(labels = function(x) format(x, big.mark = ","), limits = c(0, 1300000)) +
theme_classic() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = -1, face = 'bold'),
plot.subtitle = element_text(hjust = 0.5, vjust = -2, face = 'italic'),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.title = element_text(size = 8),
axis.ticks.y = element_line(linewidth = 0.1),
axis.ticks.x = element_line(linewidth = 0.1),
legend.position = "top",
legend.box.margin = margin(t = 13, r = 0, b = -50, l = 290))
fig_6 <- ggplot(fig_6_data, aes(x = month_name, y = percent, fill = member_casual)) +
geom_bar(stat = "identity", position = "fill") +
geom_text(aes(x = month_name, y = percent, label = sprintf("%.2f%%", percent)),
position = position_fill(vjust = 0.5),
size = 2.5,
family = "Trebuchet MS",
color = "black") +
scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
breaks = c("casual", "member"),
labels = c("occasional", "member")) +
labs(title = "fig 6. Monthly Trip Distribution",
subtitle = "Categorized by user group",
x = "",
y = "",
fill = "",
caption = " - Percentages based on group totals") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
theme_classic() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = -1, face = 'bold', family = "Trebuchet MS"),
plot.subtitle = element_text(hjust = 0.5, vjust = -2, face = 'italic', family = "Trebuchet MS"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.title = element_text(size = 8),
axis.text.x = element_text(angle = 45, hjust = 1),
plot.caption = element_text(face = "italic", hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))
fig_7 <- ggplot(fig_7_data, aes(x = day_name, y = percent, fill = member_casual)) +
geom_bar(stat = "identity", position = "fill") +
geom_text(aes(x = day_name, y = percent, label = sprintf("%.2f%%",percent)),
position = position_fill(vjust = 0.5),
size = 3,
family = "Trebuchet MS",
color = "black") +
scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
breaks = c("casual", "member"),
labels = c("occasional", "member")) +
labs(title = "fig 4. Trip Distribution per Day",
subtitle = "Categorized by user group",
x = "Day",
y = "Distribution",
fill = "",
caption = " - Percentages based on group totals") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
theme_classic() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = -1, face = 'bold', family = "Trebuchet MS"),
plot.subtitle = element_text(hjust = 0.5, vjust = -2, face = 'italic', family = "Trebuchet MS"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.text.y = element_text(size = 8),
axis.title = element_text(size = 8),
axis.ticks.y = element_line(linewidth = 0.1),
axis.ticks.x = element_line(linewidth = 0.1),
plot.caption = element_text(face = "italic", hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))
fig_8 <- ggplot(fig_8_data, aes(x = day_name, y = hour, fill = percent, color = member_casual)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "gray", high = "#9D3414") +
geom_text(aes(label = sprintf("%.1f%%", percent)),
size = 3,
color = "white") +
labs(title = "fig 5. Hourly Distribution ", subtitle = "Casual Riders", x="", y = "") +
theme_minimal() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = 1.5, face = 'bold', family = "Trebuchet MS"),
plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic', family = "Trebuchet MS"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.text.y = element_text(size = 8),
axis.title = element_text(size = 8),
axis.ticks.y = element_line(linewidth = 0.1),
axis.ticks.x = element_line(linewidth = 0.1),
plot.margin = margin(t = 1, r = 0, b = 0, l = 0, unit = "lines")) +
guides(fill = 'none')
fig_9 <- ggplot(fig_9_data, aes(x = day_name, y = hour, fill = percent, color = member_casual)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "gray", high = "#448B0E") +
geom_text(aes(label = sprintf("%.1f%%", percent)),
size = 3,
color = "white") +
labs(title = "fig 6. Hourly Distribution ", subtitle = "Member Subscribers", x="", y = "") +
theme_minimal() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = 1.5, face = 'bold', family = "Trebuchet MS"),
plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic', family = "Trebuchet MS"),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.text.y = element_text(size = 8),
axis.title = element_text(size = 8),
axis.ticks.y = element_line(linewidth = 0.1),
axis.ticks.x = element_line(linewidth = 0.1),
plot.margin = margin(t = 1, r = 0, b = 0, l = 0, unit = "lines")) +
guides(fill = 'none')
fig_10 <- leaflet(districts, options = leafletOptions(minZoom = 10, maxZoom = 12, fontFamily = "Trebuchet MS")) %>%
setView(lng = -87.6473, lat = 41.8563, zoom = 10) %>%
setMaxBounds(lng1 = -87.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>%
addProviderTiles("CartoDB.Voyager") %>%
addPolygons(
fillColor = ~pal(fig_10_data$total_count),
weight = 2,
opacity = 20,
color = "white",
dashArray = "3",
fillOpacity = 0.6,
highlightOptions = highlightOptions(
weight = 2,
color = "#666",
dashArray = "",
fillOpacity = 0.5,
bringToFront = TRUE),
label = labels_choro,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")) %>%
addLegend(
"bottomright",
pal = pal,
values = ~fig_10_data$total_count,
title = "Trips",
opacity = 0.6) %>%
addControl(
html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Bike Trip Origins</h3>",
position = "topright") %>%
addControl(
html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 12px;'>fig 7. Trip Starts by Districts in Chicago</h3>",
position = "topright")
fig_11 <- leaflet(data = fig_11_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>%
setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng = ~start_lng,
lat = ~start_lat,
popup = ~paste(start_station_name, "<br> Total Trips:", count),
radius = sqrt(fig_11_data$count) * 0.06,
fillOpacity = 1,
color = ~pal(fig_11_data$count),
stroke = FALSE) %>%
addLegend(
"bottomright",
pal = pal,
values = ~fig_11_data$count,
title = "Count",
opacity = 0.5) %>%
addControl(
html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip Start</h1>",
position = "topright") %>%
addControl(
html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 8. Casual Riders Trip Start </h3>",
position = "topright")
fig_12 <- leaflet(data = fig_12_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>%
setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng = ~end_lng,
lat = ~end_lat,
popup = ~paste(end_station_name, "<br> Total Trips:", count),
radius = sqrt(fig_12_data$count) * 0.06,
fillOpacity = 1,
color = ~pal(fig_12_data$count),
stroke = FALSE) %>%
addLegend(
"bottomright",
pal = pal,
values = ~fig_12_data$count,
title = "Count",
opacity = 0.5) %>%
addControl(
html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip End</h1>",
position = "topright") %>%
addControl(
html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 9. Casual Riders Trip End </h3>",
position = "topright")
fig_13 <- leaflet(data = fig_13_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>%
setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng = ~start_lng,
lat = ~start_lat,
popup = ~paste(start_station_name, "<br> Total Trips:", count),
radius = sqrt(fig_13_data$count) * 0.06,
fillOpacity = 1,
color = ~pal(fig_13_data$count),
stroke = FALSE) %>%
addLegend(
"bottomright",
pal = pal,
values = ~fig_13_data$count,
title = "Count",
opacity = 0.5) %>%
addControl(
html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip Start</h1>",
position = "topright") %>%
addControl(
html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 10. Member Riders Trip Start </h3>",
position = "topright")
fig_14 <- leaflet(data = fig_14_data, options = leafletOptions(minZoom = 10, maxZoom = 16, fontFamily = "Trebuchet MS")) %>%
setView(lng = -87.3444, lat = 41.8563, zoom = 10) %>%
setMaxBounds(lng1 = -89.9789, lat1 = 42.1270, lng2 = -87.2184, lat2 = 41.5623) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng = ~end_lng,
lat = ~end_lat,
popup = ~paste(end_station_name, "<br> Total Trips:", count),
radius = sqrt(fig_14_data$count) * 0.06,
fillOpacity = 1,
color = ~pal(fig_14_data$count),
stroke = FALSE) %>%
addLegend(
"bottomright",
pal = pal,
values = ~fig_14_data$count,
title = "Count",
opacity = 0.5) %>%
addControl(
html = "<h1 style='padding: 2px; margin: 0; font-family: Trebuchet MS'>Trip End</h1>",
position = "topright") %>%
addControl(
html = "<h3 style='padding: 0xpx; margin: 0; font-family: Trebuchet MS; font-size: 14px;'>Fig 11. Member Riders Trip End </h3>",
position = "topright")
fig_15 <- ggplot(fig_15_data, aes(x = day_name, y = mean, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge2") +
geom_text(aes(label = round(mean, 2)), position = position_dodge(width = 1), vjust = -0.5, size = 3,family = "Trebuchet MS") +
scale_fill_manual(values = c("casual" = "#EDB8AA", "member" = "#D4EDAA"),
breaks = c("casual", "member"),
labels = c("occasional", "member")) +
labs(title = "fig 12. Daily Mean Duration", subtitle = "Categorized by user group", x = "", y = "Minutes", fill = "User Type") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
theme_classic() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = 1, face = 'bold'),
plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic'),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.title = element_text(size = 8),
axis.ticks.y = element_line(linewidth = 0.1),
axis.ticks.x = element_line(linewidth = 0.1),
axis.text.x = element_text(angle = 45, hjust = 1),
strip.background = element_rect(color = "white"),
strip.text = element_text(size = 10),
plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "lines")) +
facet_wrap(~member_casual) +
guides(fill = "none")
fig_16 <- ggplot(correlation_values, aes(x = mean_duration, y = mean_distance)) +
geom_density_2d() +
geom_density_2d(color = "#4285f4ff") +
labs(title = "fig 13. Contour Plot", subtitle = "Spatial Density of Trip Characteristics", x = "Mean Duration (minutes)", y = "Mean Distance (meters)") +
theme_classic() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = 1, face = 'bold'),
plot.subtitle = element_text(hjust = 0.5, vjust = 1, face = 'italic'),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.title = element_text(size = 8),
axis.ticks.y = element_line(linewidth = 0.1),
axis.ticks.x = element_line(linewidth = 0.1),
axis.text.x = element_text(angle = 45, hjust = 1),
strip.background = element_rect(color = "white"),
strip.text = element_text(size = 10),
plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "lines"))
fig_17 <- ggplot(fig_17_data, aes(x = member_casual, y = percentage, fill = rideable_type)) +
geom_col(stat = "identity", position = "dodge")+
geom_text(aes(label = sprintf("%.2f%%", percentage)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3,
family = "Trebuchet MS",
color = "black") +
scale_fill_manual(values = c("classic_bike" = "#AAEDC3", "docked_bike" = "#AAD4ED", "electric_bike" = "#AAB3ED"),
labels = c("Classic Bike", "Docked Bike", "Electric Bike")) +
labs(title = "fig 14. Rideable Types Preference", subtitle = "Percentage of bike ride by user group", x = "Month", y = "Distribution", fill = "User Type", caption = "-Docked bikes represent older classic bikes; data cut as of August 2023") +
labs(x = "Member Type", y = "Percentage", fill = "Rideable Type") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
theme_classic() +
theme(
axis.line = element_line("gray"),
text = element_text(family = "Trebuchet MS"),
plot.title = element_text(hjust = 0.5, vjust = 3, face = 'bold'),
plot.subtitle = element_text(hjust = 0.5, vjust = 2, face = 'italic'),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
axis.text = element_text(size = 8),
axis.title = element_text(size = 8),
plot.margin = margin(t = 1, r = 1, b = 1, l = 1, unit = "lines"))