This case study is the capstone project to get the Google Data Analytics Certification. The certification emphasizes teaching the six phases of the data analysis process to solve problems. Students build technical skills by learning and using key software and big data platforms such: Excel, SQL, R programming, R Studio, Tableau, Google Cloud Big Query SQL Workspace, Posit R-Cloud workspace, Tableau Public, Kaggle, and other applications and workspaces.
I will use the six steps of the Data Analysis Process for the case study: Ask, Prepare, Process, Analyze, Share, and Act. For this project, I will use Excel, R programming, and R Desktop. I will also use structured thinking, which is the process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options to solve the problem at hand.
The Cyclistic bike-share company is based in Chicago. The bikes are geo-tracks by the starting and ending stations. Bike can be picked up at one station and dropped off at another. The company currently has 692 docking stations and 5,824 bikes. Traditional bikes make up 92% of the bikes, and 8% of the bikes are assistive: reclining, hand tricycles, and cargo bikes. There are three pricing plans: single-rider passes, full-day passes, and annual memberships. Casual riders are customers that purchase single-ride and full-day passes; effectively, there are annual memberships and casual riders.
The stakeholders for the project are: Lily Moreno, Director of Marketing, the Cyclistic marketing analytics team, and the Cyclistic executive team.
The Director of Marketing, Moreno, believes that future success depends on maximizing annual memberships. The goal is to design marketing strategies aimed at converting casual riders to annual members. Moreno believes that it would be easier to convert causal members to annual members than it would be to recruit new members, because the casual riders already know about and have chosen Cyclistic for their biking needs. To design the marketing strategy, the marketing analyst team needs to: 1. better understand how annual members and casual riders differ, 2. determine why casual riders would buy a membership, and 3. decide how digital media could affect their marketing tactics.
Define the problem to be solved to fully understand stakeholder expectations. I will use the SMART methodology to ask questions to learn about the data and to gain insights to solve the problem.
The goal is to increase annual memberships. How to do that is the problem that needs to be solved. The business task assigned is to answer the following question: How do annual members and casual riders use Cyclistic bikes differently?
How can insights drive the business decisions? Analyze the raw data to get insights needed to make informed decision-making. This is accomplished by following the data to where it leads you. Creating queries can help to identify trends, which in turn can identify opportunities. The opportunities can be exploited to influence casual riders to become members.
In order to create questions, I have to take a preliminary look at the data at this point. What type of data is available that can be analyzed to show how riders use bikes differently? We have time, distance, location, and bike type.
What can I do with this data?
I can do descriptive statistics and logistics regression.
Descriptive statistics
It can show usage and possible trends between the two rider types by
looking at and comparing various factors such as number of trips, time,
and distance. These factors are framed as follows:
TRIPS
What are the total number of trips taken?
What are the total number of trips taken by customer type?
RIDER TYPE
Percentage by rider type What percentage of the customers have an annual
membership?
What percentage of the customers are casual riders?
BIKE TYPE
What is the percentage for each bike type used by rider type?
What percentage of members use classic bikes?
What percentage of casual riders use classic bikes?
What percentage of members use electric bikes?
What percentage of casual riders use electric bikes?
What percentage of casual riders use docked bikes?
TIME
What is the average time riders rode bikes per trip by rider type?
Day of the Week
Rank day of the week that riders spend the most minutes to the least
minutes on bikes.
Which day of the week do riders spend the most minutes on bikes?
Which day of the week do riders spend the least minutes on bikes?
Rank the day of the week that members spend the most minutes to the
least minutes on bikes.
Which day of the week do members spend the most minutes on bikes?
Which day of the week do members spend the least minutes on bikes?
Rank the day of the week do casual riders spend the most minutes to the
least minutes on bikes?
Which day of the week do casual riders spend the most minutes on
bikes?
Which day of the week do casual riders spend the least minutes on
bikes?
Month
Rank the months that riders spend the most and least minutes on
bikes?
Which month do riders spend the most minutes on bikes?
Which month do riders spend the least minutes on bikes?
Rank the months that member spend the most and least minutes on
bikes?
Which month do members spend the most minutes on bikes?
Which month do members spend the least minutes on bikes?
Rank the months that casual riders spend the most and least minutes on
bikes?
Which month do casual riders spend the most minutes on bikes?
Which month do casual riders spend the least minutes on bikes?
TRIPS
Day of week
Rank the day of the week riders take the most trips to the least
trips?
Which day of the week do riders take the most trips?
Which day of the week do riders take the least trips?
Rank the day of the week members take the most trips to the least
trips?
Which day of the week do annual members take the most trips?
Which day of the week do annual members take the least trips?
Rank the day of the week casual riders take the most trips to the least
trips?
Which day of the week do casual riders take the most trips?
Which day of the week do casual riders take the least trips?
What are the top 3 days of the week with the most trips taken by
members?
What are the top 3 days of the week with the most trips taken by casual
riders?
What are the bottom 3 days of the week with the most trips taken by
members?
What are the bottom 3 days of the week with the most trips taken by
casual riders?
Month
Rank the months that riders take the most trips to the least
trips?
Which month do riders take the most trips?
Which month do riders take the least trips?
Rank the months that members take the most trips to the least
trips?
Which month do members take the most trips?
Which month do members take the least trips?
Rank the months that casual riders take the most trips to the least
trips?
Which month do casual riders take the most trips?
Which month do casual riders take the least trips?
DISTANCE
What is the average distance riders traveled per trip?
Day of Week
Rank the day-of-the-week that riders rode the most miles to the least
miles?
Which day-of-the-week do riders rode the most miles?
Which day-of-the-week that riders rode the least miles?
Rank the day-of-the-week that members rode the most miles to the least
miles?
Which day-of-the-week do members rode the most miles?
Which day-of-the-week that riders rode the least miles?
Rank the day-of-the-week that members rode the most miles to the least
miles?
Which day-of-the-week do members rode the most miles?
Which day of the week do members rode the least miles?
Rank the day-of-the-week that casual riders rode the most miles to the
least miles?
Which day of the week do casual riders rode the most miles?
Which day of the week do casual riders rode the least miles?
Month
Rank the months that riders rode the most miles to the least
miles.
Which month did riders ride the most miles?
Which month did riders ride the least miles?
Rank the months that members rode the most miles to the least
miles.
Which month do members ride the most miles?
Which month do members ride the least miles?
Rank the months that casual riders rode the most miles to the least
miles.
Which month do casual riders ride the most miles?
Which month do casual riders ride the least miles?
AVERAGES Average minutes most miles
What are the average minutes per trip that members rode bikes when they
rode the most miles?
What are the average minutes per trip that casual riders rode bikes when
they rode the most miles?
Average Distance Top Month
What is the average distance per trip that members rode in the top
month?
What is the average distance per trip that casual riders ride in the top
month?
Average Time Rode by Month
What are the average minutes that annual members used bikes each
month?
What are the average minutes that casual riders used bikes each
month?
Average Distance Traveled by Month
What is the average distance that annual members traveled each
month?
What is the average distance that casual riders traveled each month?
Logistics Regression
Logistics regression can show the probability that a person will get a
membership based on predictors such as miles and distance. I can use the
use the start and end times to extract the amount of time for each ride.
I can also use starting and ending longitude and latitude data to
extrapolate distance. That will provide numerical variables of time and
distance. We want to predict membership, which can be thought of as a
binary. True or false. You either have it or you don’t. That means
membership is a categorical variable. This case meets the criteria for
logistics regression analysis, because I have a binary outcome and
numerical predictors. We can use logistics regression to calculate the
probability of a binary outcome; a membership. We can also find out the
impact of numeric independent variables on the categorical dependent
variable. We can evaluate which independent variable (time or distance)
influences the dependent variable (membership) the most. Once answered,
the course of action is to target the variable with the greater
influence.
Gaps
Where are the gaps?
The gap is space or distance of the present state of where the company
is now to the preferred state of where the company wants to be. The
present state is 41% annual members and 59% casual riders. The Preferred
state is 100% annual membership and 0% casual riders, so there is no
gap.
Opportunities
Where are the opportunities?
The obvious opportunity is to convert the 41% casual riders to annual
members. The other opportunities will surface after the analysis.
Options
What are the options?
We must reflect back on the business task of converting casual members
to annual members in order to think about options. I always say to put
all the options on the table, good or bad. Once all the options are
listed, or in my case the ones I can think of, then pick the ones that
make the most sense.
Opportunity Cost vs Value
The casual riders need to believe that the switch to annual membership
is worth it. So, how would you do that? That can be done by using
pricing plans, incentives, promotions, discounts, or other means. Let’s
look at a simple analogy that works and see if we can use that same
logic. If that chocolate ice cream you just bought tasted good and
satisfied your sweet tooth, would you be likely to buy it the next time?
Most likely, yes. If that software worked really well and satisfies the
need, would you be likely to buy it? Most likely, yes. Now let’s look at
how this analogy worked in business. Adobe provided Acrobat Reader free
for years. It worked. It was good. Everyone loved it and used it. It
satisfied the need to read PDF files. Everyone was used to using Adobe.
Now, Adobe has many products, and they are almost all subscription
based. People know they are good, and they buy them, and they use them.
We don’t have to reinvent the wheel. This model can be applied to Bike
Share. If casual riders are given a membership for a trial period, they
will most likely use the bikes more and see the value of a buying a
membership. The opportunity cost is the loss of potential revenue during
the trial period. Is the opportunity cost worth getting more
memberships?
This step is to collect and store data for the upcoming data analysis.
The data is provided by Amazon and available for download at Amazonws.com.
There are 12 monthly CSV files. One year of data is being evaluated, which is from February 2022 – January 2023. There are 13 columns with what looks like three possible data types: character, date, and numeric.
ROCCC check
* Reliable? Yes. The data is provided by the company Motivate
International Inc, for the purpose of the case study.
* Original? Yes. It is fictitious data that is to be used for case
studies.
* Comprehensive? No. PII and financial data has been removed.
* Current? Yes. The data files are current.
* Cited? Yes. The data is from Motivate, and there is a license
agreement.
Licensing, Privacy, Security, Accessibility
* Licensing: The license agreement is available online. * Security: Data
security issues are stated in the license agreement under Prohibited
Conduct. * Accessibility: The data is accessible online for download at
Amazonws.com.
Data Integrity The files are all downloaded and populated. I checked
the file properties of all the files to check for byte size. All the
files have the same number of columns and the same names. Problems
include missing data and values of zero. The missing data is
predominately station names and station IDs. Omitting the columns won’t
impact the case study.
* Completeness: there are missing values in several columns.
* Accuracy: aside from the missing data, the data is accurate.
* Consistency: aside from the missing values, the data is
consistent.
* Timeliness: the data is always available, so it is timely.
* Compliance: this is a nonissue as this is a case problem.
Organize and Protect Data
Each of the 12 data files will be saved with a new name, which preserves
the original files.The new files, working files, will be used for
analysis. There is a backup of the original and working files.
Original filenames ………………. New filenames
202202-divvy-tripdata.csv ………. 202202Data.csv
202203-divvy-tripdata.csv ………. 202203Data.csv
202204-divvy-tripdata.csv ………. 202204Data.csv
202205-divvy-tripdata.csv ………. 202205Data.csv
202206-divvy-tripdata.csv ………. 202206Data.csv
202207-divvy-tripdata.csv ………. 202207Data.csv
202208-divvy-tripdata.csv ………. 202208Data.csv
202209-divvy-publictripdata.csv….202209Data.csv
202210-divvy-tripdata.csv ………. 202210Data.csv
202211-divvy-tripdata.csv ………. 202211Data.csv
202212-divvy-tripdata.csv ………. 202212Data.csv
202301-divvy-tripdata.csv ………. 202301Data.csv
The Process step has three parts: 1) clean the data, 2) verify the data, and 3) gain strategies for verifying and sharing your data.
I will do a little work in Excel to add some useful columns and format the data so that it will be easier to work with in R. Then, I will combine the 12 CSV data files into one data frame. Then I will eliminate any errors and inaccuracies.
1. Clean Data
* find and eliminate errors or inaccuracies
* transform into useful format
* combine data sets to make more complete
* remove outliers that skew information
2. Verify Data to make sure it is complete and
correct
3. Gain Strategies for verifying and sharing your data
cleansing with stakeholders
This is where we find and eliminate errors or inaccuracies, transfer into a useful format, combine datasets to make it more complete,and remove any outliers that may skew the data.
The original dataset has 13 columns. I removed four columns and added 5. There are now 14 columns.
After inspecting the CSV files, there are a lot missing data and values of zero for the station IDs and station names; thousands and thousands of missing values. There are also inconsistencies in these columns with the naming conventions, so I won’t use them to avoid skewing the data. These 4 columns were deleted: start_station_name, start_station_id,end_station_name, and end_station_id. I used Excel to add five columns and extrapolated data for new columns that will be useful during analysis.
I checked for blank records and obvious errors. There were some blank values for the end_lat and end_lng columns. I will sort that out with R.
I sorted by miles and found a bunch of #NUM! errors in the miles
column. NUM errors occur when a formula or function isn’t valid or a
mathematical function doesn’t work like dividing by zero. I removed the
#NUM! records from each file. Less than 1% of the data per files was
removed.
year…..records…#NUM!…remaining..% removed
202202…115,159…449…..114,709….0.390%
202203…284,042…853…..283,188….0.300%
202204…371,250…1,877…369,373….0.506%
202205…634,859…0…….634,859….0.000%
202206…769,205…924…..768,280….0.120%
202207…823,489…966…..822,522….0.117%
202208…785,933…919…..785,013….0.117%
202209…701,340…817…..700,522….0.116%
202210…558,686…539…..558,146….0.096%
202211…337,736…396…..337,339….0.117%
202212…181,807…211…..181,595….0.116%
202301…190,302…232…..190,069….0.122%
Other errors and outliers that I will remove with R.
Trips with duration less than 1 minute.
Trips with duration greater than minutes in a day, 1440.
Trips greater than 100 miles.
I notied there were trips with miles over 60,000.
Zeroes in the member column should not be removed as it indicates a
non-membership.
Load the CSV files into R, combine them into a data frame, and then continue the cleaning process.
Load R packages and libraries. this is done from the console to allow the Knit to run.
options(repos="https://cran.rstudio.com")
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\rfinegan\AppData\Local\Temp\RtmpS6CWuC\downloaded_packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
Create the data frames by reading in the CSV files.
bike202202 <- read_csv("202202Data.csv")
## Rows: 115158 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202203 <- read_csv("202203Data.csv")
## Rows: 283187 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202204 <- read_csv("202204Data.csv")
## Rows: 371247 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202205 <- read_csv("202205Data.csv")
## Rows: 634858 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202206 <- read_csv("202206Data.csv")
## Rows: 769204 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202207 <- read_csv("202207Data.csv")
## Rows: 823488 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202208 <- read_csv("202208Data.csv")
## Rows: 785932 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202209 <- read_csv("202209Data.csv")
## Rows: 701339 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202210 <- read_csv("202210Data.csv")
## Rows: 558685 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202211 <- read_csv("202211Data.csv")
## Rows: 337735 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202212 <- read_csv("202212Data.csv")
## Rows: 181806 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike202301 <- read_csv("202301Data.csv")
## Rows: 190301 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check the column names of data frames.
colnames(bike202202)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202203)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202204)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202205)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202206)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202207)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202208)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202209)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202210)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202211)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202212)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
colnames(bike202301)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
Here are the 14 columns and their data types.
ride_id………character
rideable_type…character
started_at……character
ended_at……..character
start_lat…….numeric
start_lng…….numeric
end_lat………numeric
end_lng………numeric
member_casual…character
member……….numeric
minutes………numeric
dow………….character
month………..character
miles………..numeric
Check the structure of data.
str(bike202202)
## spc_tbl_ [115,158 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:115158] "973B6FB26CA8BD27" "73758E384AA65A53" "202F9F20AB6E177D" "E83139CA4370782B" ...
## $ rideable_type: chr [1:115158] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:115158] "2/10/2022 14:35" "2/1/2022 9:36" "2/11/2022 10:16" "2/8/2022 16:03" ...
## $ ended_at : chr [1:115158] "2/10/2022 14:49" "2/1/2022 10:51" "2/11/2022 11:05" "2/8/2022 16:40" ...
## $ start_lat : num [1:115158] 41.9 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:115158] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:115158] 41.9 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:115158] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual: chr [1:115158] "member" "casual" "member" "member" ...
## $ member : num [1:115158] 1 0 1 1 1 0 1 1 0 1 ...
## $ minutes : num [1:115158] 14 75.2 49.1 36.6 39.6 ...
## $ dow : chr [1:115158] "Thu" "Tue" "Fri" "Tue" ...
## $ month : chr [1:115158] "Feb" "Feb" "Feb" "Feb" ...
## $ miles : num [1:115158] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202203)
## spc_tbl_ [283,187 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:283187] "01CBE93F2A84B1D4" "BF3995B1A3E8C334" "F6F4B4A2E7DA0B33" "D92355D09AD21135" ...
## $ rideable_type: chr [1:283187] "classic_bike" "docked_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:283187] "3/26/2022 11:40" "3/20/2022 16:38" "3/16/2022 14:42" "3/13/2022 17:05" ...
## $ ended_at : chr [1:283187] "3/26/2022 12:04" "3/20/2022 16:59" "3/16/2022 14:43" "3/13/2022 18:12" ...
## $ start_lat : num [1:283187] 41.8 41.9 42 41.7 42 ...
## $ start_lng : num [1:283187] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ end_lat : num [1:283187] 41.8 41.9 42 41.7 42 ...
## $ end_lng : num [1:283187] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual: chr [1:283187] "member" "casual" "member" "casual" ...
## $ member : num [1:283187] 1 0 1 0 0 1 1 1 0 1 ...
## $ minutes : num [1:283187] 24.03 21.42 0.35 67.37 89.18 ...
## $ dow : chr [1:283187] "Sat" "Sun" "Wed" "Sun" ...
## $ month : chr [1:283187] "Mar" "Mar" "Mar" "Mar" ...
## $ miles : num [1:283187] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202204)
## spc_tbl_ [371,247 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:371247] "7E14444DA4A4B2BC" "56970CB2B4599258" "99473F4114014BAE" "2AA570D0C915ED6D" ...
## $ rideable_type: chr [1:371247] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr [1:371247] "4/29/2022 11:38" "4/24/2022 18:19" "4/17/2022 14:01" "4/23/2022 16:37" ...
## $ ended_at : chr [1:371247] "4/29/2022 11:38" "4/24/2022 19:15" "4/17/2022 14:05" "4/23/2022 16:37" ...
## $ start_lat : num [1:371247] 41.9 41.9 41.9 42 41.9 ...
## $ start_lng : num [1:371247] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:371247] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:371247] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ member_casual: chr [1:371247] "member" "casual" "casual" "member" ...
## $ member : num [1:371247] 1 0 0 1 0 1 1 1 1 1 ...
## $ minutes : num [1:371247] 0.53 56.4 4.33 0.57 35.75 ...
## $ dow : chr [1:371247] "Fri" "Sun" "Sun" "Sat" ...
## $ month : chr [1:371247] "Apr" "Apr" "Apr" "Apr" ...
## $ miles : num [1:371247] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202205)
## spc_tbl_ [634,858 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:634858] "043D51D15887474F" "84676EDD2699ABF5" "92ACD45CFCD9364E" "09B6FA8E709B19A7" ...
## $ rideable_type: chr [1:634858] "classic_bike" "docked_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:634858] "5/19/2022 21:55" "5/14/2022 13:53" "5/4/2022 8:03" "5/24/2022 17:59" ...
## $ ended_at : chr [1:634858] "5/19/2022 21:56" "5/14/2022 17:30" "5/4/2022 8:05" "5/24/2022 18:01" ...
## $ start_lat : num [1:634858] 41.9 41.8 41.8 42 42 ...
## $ start_lng : num [1:634858] -87.7 -87.6 -87.8 -87.7 -87.7 ...
## $ end_lat : num [1:634858] 41.9 41.8 41.8 42 42 ...
## $ end_lng : num [1:634858] -87.7 -87.6 -87.8 -87.7 -87.7 ...
## $ member_casual: chr [1:634858] "member" "casual" "member" "member" ...
## $ member : num [1:634858] 1 0 1 1 1 0 0 1 1 1 ...
## $ minutes : num [1:634858] 0.75 217.65 2.48 2.2 35.97 ...
## $ dow : chr [1:634858] "Thu" "Sat" "Wed" "Tue" ...
## $ month : chr [1:634858] "May" "May" "May" "May" ...
## $ miles : num [1:634858] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202206)
## spc_tbl_ [769,204 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:769204] "C19B08D794D1C89E" "6E9E3A041C14E960" "0AFDFDED87B18711" "73824977D6BA1702" ...
## $ rideable_type: chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:769204] "6/30/2022 11:09" "6/30/2022 11:05" "6/30/2022 8:02" "6/30/2022 8:12" ...
## $ ended_at : chr [1:769204] "6/30/2022 11:10" "6/30/2022 11:09" "6/30/2022 8:07" "6/30/2022 8:17" ...
## $ start_lat : num [1:769204] 41.9 41.9 41.9 41.9 42 ...
## $ start_lng : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:769204] 41.9 41.9 41.9 41.9 42 ...
## $ end_lng : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual: chr [1:769204] "casual" "casual" "casual" "casual" ...
## $ member : num [1:769204] 0 0 0 0 0 0 0 0 0 0 ...
## $ minutes : num [1:769204] 0.78 3.42 4.88 4.1 21.68 ...
## $ dow : chr [1:769204] "Thu" "Thu" "Thu" "Thu" ...
## $ month : chr [1:769204] "Jun" "Jun" "Jun" "Jun" ...
## $ miles : num [1:769204] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202207)
## spc_tbl_ [823,488 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:823488] "55661F65ED2D1B17" "DBA6BF86FC23B1F3" "1A3FDE807BC010E0" "6614263E987C229A" ...
## $ rideable_type: chr [1:823488] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:823488] "7/4/2022 9:32" "7/31/2022 15:42" "7/21/2022 18:42" "7/1/2022 18:57" ...
## $ ended_at : chr [1:823488] "7/4/2022 10:44" "7/31/2022 15:42" "7/21/2022 19:12" "7/1/2022 18:58" ...
## $ start_lat : num [1:823488] 41.9 41.9 41.9 41.9 41.8 ...
## $ start_lng : num [1:823488] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:823488] 41.9 41.9 41.9 41.9 41.8 ...
## $ end_lng : num [1:823488] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual: chr [1:823488] "casual" "member" "member" "member" ...
## $ member : num [1:823488] 0 1 1 1 0 1 1 1 1 1 ...
## $ minutes : num [1:823488] 72.03 0.23 29.43 0.5 80.48 ...
## $ dow : chr [1:823488] "Mon" "Sun" "Thu" "Fri" ...
## $ month : chr [1:823488] "Jul" "Jul" "Jul" "Jul" ...
## $ miles : num [1:823488] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202208)
## spc_tbl_ [785,932 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:785932] "CBC1E1C17B07A999" "11EDB12F3F8F084E" "A49AE6AC519701A8" "0370D86C6729518B" ...
## $ rideable_type: chr [1:785932] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:785932] "8/8/2022 11:30" "8/7/2022 20:37" "8/7/2022 19:09" "8/8/2022 13:24" ...
## $ ended_at : chr [1:785932] "8/8/2022 11:36" "8/7/2022 20:40" "8/7/2022 19:11" "8/8/2022 13:31" ...
## $ start_lat : num [1:785932] 41.8 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:785932] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:785932] 41.8 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:785932] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual: chr [1:785932] "casual" "casual" "casual" "casual" ...
## $ member : num [1:785932] 0 0 0 0 0 0 0 0 0 0 ...
## $ minutes : num [1:785932] 5.42 3.4 2.65 6.78 31.15 ...
## $ dow : chr [1:785932] "Mon" "Sun" "Sun" "Mon" ...
## $ month : chr [1:785932] "Aug" "Aug" "Aug" "Aug" ...
## $ miles : num [1:785932] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202209)
## spc_tbl_ [701,339 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:701339] "E12D4A16BF51C274" "A02B53CD7DB72DD7" "4DEEB4550A266AE1" "106BA573A37CC68F" ...
## $ rideable_type: chr [1:701339] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr [1:701339] "9/1/2022 17:11" "9/1/2022 17:15" "9/1/2022 7:30" "9/1/2022 6:44" ...
## $ ended_at : chr [1:701339] "9/1/2022 17:14" "9/1/2022 17:16" "9/1/2022 7:32" "9/1/2022 6:48" ...
## $ start_lat : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:701339] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:701339] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ member_casual: chr [1:701339] "casual" "casual" "casual" "casual" ...
## $ member : num [1:701339] 0 0 0 0 0 0 0 0 0 0 ...
## $ minutes : num [1:701339] 3.27 0.37 2.42 3.83 1.88 ...
## $ dow : chr [1:701339] "Thu" "Thu" "Thu" "Thu" ...
## $ month : chr [1:701339] "Sep" "Sep" "Sep" "Sep" ...
## $ miles : num [1:701339] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202210)
## spc_tbl_ [558,685 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:558685] "9209A92F1AF15A51" "ADB620B994F6DB82" "83ACA34E916CF8D5" "6F5988377250388C" ...
## $ rideable_type: chr [1:558685] "electric_bike" "electric_bike" "classic_bike" "electric_bike" ...
## $ started_at : chr [1:558685] "10/6/2022 7:12" "10/21/2022 12:40" "10/27/2022 19:21" "10/18/2022 21:11" ...
## $ ended_at : chr [1:558685] "10/6/2022 7:12" "10/21/2022 13:33" "10/27/2022 19:43" "10/18/2022 21:11" ...
## $ start_lat : num [1:558685] 42 42 41.9 41.9 41.9 ...
## $ start_lng : num [1:558685] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:558685] 42 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:558685] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual: chr [1:558685] "member" "member" "member" "member" ...
## $ member : num [1:558685] 1 1 1 1 0 1 1 0 1 0 ...
## $ minutes : num [1:558685] 0.65 52.75 22.15 0.28 13.67 ...
## $ dow : chr [1:558685] "Thu" "Fri" "Thu" "Tue" ...
## $ month : chr [1:558685] "Oct" "Oct" "Oct" "Oct" ...
## $ miles : num [1:558685] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202211)
## spc_tbl_ [337,735 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:337735] "7F8CA9B17D7E2B5F" "E7372C2C8A9BFCA7" "1E19C43F1B1A04EC" "C8467A137DFF117A" ...
## $ rideable_type: chr [1:337735] "classic_bike" "electric_bike" "classic_bike" "docked_bike" ...
## $ started_at : chr [1:337735] "11/3/2022 11:52" "11/8/2022 5:16" "11/18/2022 11:11" "11/10/2022 15:49" ...
## $ ended_at : chr [1:337735] "11/3/2022 11:52" "11/8/2022 5:16" "11/18/2022 11:43" "11/10/2022 17:47" ...
## $ start_lat : num [1:337735] 41.9 42 41.9 41.9 41.9 ...
## $ start_lng : num [1:337735] -87.6 -87.7 -87.7 -87.6 -87.7 ...
## $ end_lat : num [1:337735] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:337735] -87.6 -87.7 -87.7 -87.6 -87.7 ...
## $ member_casual: chr [1:337735] "member" "member" "member" "casual" ...
## $ member : num [1:337735] 1 1 1 0 1 1 1 1 1 1 ...
## $ minutes : num [1:337735] 0.73 0.07 31.42 118.33 2.35 ...
## $ dow : chr [1:337735] "Thu" "Tue" "Fri" "Thu" ...
## $ month : chr [1:337735] "Nov" "Nov" "Nov" "Nov" ...
## $ miles : num [1:337735] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202212)
## spc_tbl_ [181,806 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:181806] "471662F932C43107" "FE9D8A7A26F6E917" "8602A572ADB87616" "BD850F0C2EF9CBA0" ...
## $ rideable_type: chr [1:181806] "classic_bike" "classic_bike" "classic_bike" "electric_bike" ...
## $ started_at : chr [1:181806] "12/20/2022 7:13" "12/6/2022 13:21" "12/19/2022 12:07" "12/31/2022 16:11" ...
## $ ended_at : chr [1:181806] "12/20/2022 7:14" "12/6/2022 13:21" "12/19/2022 12:46" "12/31/2022 16:11" ...
## $ start_lat : num [1:181806] 41.9 42 42 42 41.9 ...
## $ start_lng : num [1:181806] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:181806] 41.9 42 42 42 41.9 ...
## $ end_lng : num [1:181806] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual: chr [1:181806] "member" "member" "member" "member" ...
## $ member : num [1:181806] 1 1 1 1 1 1 1 1 1 1 ...
## $ minutes : num [1:181806] 0.5 0.05 38.42 0.38 28.65 ...
## $ dow : chr [1:181806] "Tue" "Tue" "Mon" "Sat" ...
## $ month : chr [1:181806] "Dec" "Dec" "Dec" "Dec" ...
## $ miles : num [1:181806] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(bike202301)
## spc_tbl_ [190,301 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:190301] "C1D80BC2F7A0AF30" "B4ED20B291AB768F" "D9012A259FE14AC9" "BD8BFC9CDEA07766" ...
## $ rideable_type: chr [1:190301] "electric_bike" "electric_bike" "classic_bike" "electric_bike" ...
## $ started_at : chr [1:190301] "1/19/2023 19:17" "1/26/2023 12:36" "1/19/2023 17:42" "1/2/2023 13:49" ...
## $ ended_at : chr [1:190301] "1/19/2023 19:18" "1/26/2023 12:37" "1/19/2023 17:42" "1/2/2023 13:49" ...
## $ start_lat : num [1:190301] 41.9 41.9 41.9 41.9 42 ...
## $ start_lng : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ end_lat : num [1:190301] 41.9 41.9 41.9 41.9 42 ...
## $ end_lng : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual: chr [1:190301] "casual" "member" "member" "casual" ...
## $ member : num [1:190301] 0 1 1 0 1 1 1 1 1 1 ...
## $ minutes : num [1:190301] 0.47 0.8 0.07 0.27 0.93 ...
## $ dow : chr [1:190301] "Thu" "Thu" "Thu" "Mon" ...
## $ month : chr [1:190301] "Jan" "Jan" "Jan" "Jan" ...
## $ miles : num [1:190301] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
I will be doing calculations with the minutes and miles columns.They both show as numeric columns. All the other columns checked out.
Character data types ride_id, rideable_type, started_at, ended_at, member_casual, dow, month
Numeric data types start_lat, start_lng, end_lat, end_lng, member, minutes, and miles.
back to…
# Create and Transform
Merge the 12 data frames into one data frame.
trips <- bind_rows(
bike202202, bike202203, bike202204, bike202205, bike202206, bike202207,
bike202208, bike202209, bike202210, bike202211, bike202212, bike202301
)
This is a test and check of the data to see if the merge of the 12 data frames to the one data frame worked out OK.
nrow(trips)
## [1] 5752940
Check the number of variables.
dim(trips)
## [1] 5752940 14
Check the column names of the new data frame.
colnames(trips)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
Check the structure of the column data types.
str(trips)
## spc_tbl_ [5,752,940 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5752940] "973B6FB26CA8BD27" "73758E384AA65A53" "202F9F20AB6E177D" "E83139CA4370782B" ...
## $ rideable_type: chr [1:5752940] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:5752940] "2/10/2022 14:35" "2/1/2022 9:36" "2/11/2022 10:16" "2/8/2022 16:03" ...
## $ ended_at : chr [1:5752940] "2/10/2022 14:49" "2/1/2022 10:51" "2/11/2022 11:05" "2/8/2022 16:40" ...
## $ start_lat : num [1:5752940] 41.9 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:5752940] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:5752940] 41.9 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:5752940] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual: chr [1:5752940] "member" "casual" "member" "member" ...
## $ member : num [1:5752940] 1 0 1 1 1 0 1 1 0 1 ...
## $ minutes : num [1:5752940] 14 75.2 49.1 36.6 39.6 ...
## $ dow : chr [1:5752940] "Thu" "Tue" "Fri" "Tue" ...
## $ month : chr [1:5752940] "Feb" "Feb" "Feb" "Feb" ...
## $ miles : num [1:5752940] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. member = col_double(),
## .. minutes = col_double(),
## .. dow = col_character(),
## .. month = col_character(),
## .. miles = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Take a peek at the data frame.
glimpse(trips)
## Rows: 5,752,940
## Columns: 14
## $ ride_id <chr> "973B6FB26CA8BD27", "73758E384AA65A53", "202F9F20AB6E177…
## $ rideable_type <chr> "classic_bike", "classic_bike", "classic_bike", "classic…
## $ started_at <chr> "2/10/2022 14:35", "2/1/2022 9:36", "2/11/2022 10:16", "…
## $ ended_at <chr> "2/10/2022 14:49", "2/1/2022 10:51", "2/11/2022 11:05", …
## $ start_lat <dbl> 41.87812, 41.78794, 41.78794, 41.78794, 41.78794, 41.900…
## $ start_lng <dbl> -87.64395, -87.58832, -87.58832, -87.58832, -87.58832, -…
## $ end_lat <dbl> 41.87812, 41.78794, 41.78794, 41.78794, 41.78794, 41.900…
## $ end_lng <dbl> -87.64395, -87.58832, -87.58832, -87.58832, -87.58832, -…
## $ member_casual <chr> "member", "casual", "member", "member", "member", "casua…
## $ member <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ minutes <dbl> 13.97, 75.17, 49.13, 36.60, 39.58, 0.08, 31.10, 30.88, 1…
## $ dow <chr> "Thu", "Tue", "Fri", "Tue", "Wed", "Sun", "Thu", "Wed", …
## $ month <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", …
## $ miles <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Save trips variable to a file.
save(trips, file = "trips.Rdata")
Verified the file is created and in the working directory. Check for missing values.
sum(is.na(trips))
## [1] 103191
There are 103,191 missing or NA values. What percentage of data will
I lose by removing NAs?
I can calculate it. 103,191 / 5,752,940 = 0.017937 or 1.79%. That is
acceptable.
Remove NAs and create trips2 data frame.
trips2 <- na.omit(trips)
Check the number of rows for trips2.
nrow(trips2)
## [1] 5740512
Check column names of trips2 data frame.
colnames(trips2)
## [1] "ride_id" "rideable_type" "started_at" "ended_at"
## [5] "start_lat" "start_lng" "end_lat" "end_lng"
## [9] "member_casual" "member" "minutes" "dow"
## [13] "month" "miles"
Check the structure of trips2 data frame.
str(trips2)
## tibble [5,740,512 × 14] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5740512] "973B6FB26CA8BD27" "73758E384AA65A53" "202F9F20AB6E177D" "E83139CA4370782B" ...
## $ rideable_type: chr [1:5740512] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:5740512] "2/10/2022 14:35" "2/1/2022 9:36" "2/11/2022 10:16" "2/8/2022 16:03" ...
## $ ended_at : chr [1:5740512] "2/10/2022 14:49" "2/1/2022 10:51" "2/11/2022 11:05" "2/8/2022 16:40" ...
## $ start_lat : num [1:5740512] 41.9 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:5740512] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:5740512] 41.9 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:5740512] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual: chr [1:5740512] "member" "casual" "member" "member" ...
## $ member : num [1:5740512] 1 0 1 1 1 0 1 1 0 1 ...
## $ minutes : num [1:5740512] 14 75.2 49.1 36.6 39.6 ...
## $ dow : chr [1:5740512] "Thu" "Tue" "Fri" "Tue" ...
## $ month : chr [1:5740512] "Feb" "Feb" "Feb" "Feb" ...
## $ miles : num [1:5740512] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:12428] 115082 115083 115084 115085 115086 115087 115088 115089 115090 115091 ...
## ..- attr(*, "names")= chr [1:12428] "115082" "115083" "115084" "115085" ...
The column names and column structures look good.
Save trips2 variable to a file.
save(trips2, file = "trips2.Rdata")
Verified the file is created and in the working directory.
Check for duplicates.
trips2 %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
No duplicates found!!
Filter the data. Remove rows with minutes < 1 minute, or minutes > 1440, or miles > 100.
trips3 <- trips2[trips2$minutes >= 1, ]
Row count for trips 3.
nrow(trips3)
## [1] 5616638
Save trips3 variable to a file.
save(trips3, file = "trips3.Rdata")
Verified the file is created and in the working directory.
Remove rows where minutes > 1440 or miles > 100.
trips4 <- trips3[!(trips3$minutes > 1400 | trips3$miles > 100), ]
Row count for trips 4.
nrow(trips4)
## [1] 5616390
After the cleaning process, approximate 6% of the records were
removed.
That is 5,752,940 - 5,392,606 = 4,591 records or 0.0626 or 6.26% of
total records.
Save trips4 variable to a file.
save(trips4, file = "trips4.Rdata")
Verified the file is created and in the working directory.
Station names and IDs were removed, because there were too many missing data and inconsistencies. The record count at the beginning of the process for the trips data frame was 5,752,940. After the cleaning process, the record count for the trips for data frame is 5,616,390. The number of records removed are 5,752,940 - 5,616,390 = 136,550 records, which is 2.37% of the total records. The change in population should not affect the outcome of the analysis. The data are clean and good to go.
This step is for using tools to transform and organize information in order to draw useful conclusions, make predictions, and drive informed decision-making. Use tools to format and transform data • Sort and filter data • Identify patterns and draw conclusions • Make predictions and recommendations • Make data-driven decisions
How should I organize the data to perform analysis on it? To begin, I have to go back to Step 1: Ask Questions. I asked myself, what can I do with this data. Then I listed the questions that I could think of that seemed relevant. It is time to answer the questions now that the data is cleaned.
What are the total number of trips taken?
trips4 %>% tally()
Trip Count by Rider Type
What are the total number of trips taken by rider type?
trips4 %>%
count(member_casual)
Percentage by Rider Type
What percentage of the customers have an annual membership?
sum(trips4$member_casual == "member") / sum(trips4$member_casual == "casual" | trips4$member_casual == "member")
## [1] 0.5928365
What percentage of the customers are casual riders?
sum(trips4$member_casual == "casual") / sum(trips4$member_casual == "casual" | trips4$member_casual == "member")
## [1] 0.4071635
Pie chart: Percentage by Rider Type
Member=trips4$member_casual
Member=as.factor(trips4$member_casual)
table(Member)
## Member
## casual member
## 2286789 3329601
Freqmember=table(Member)
Percentage=round(Freqmember/5616390*100)
lbl=paste(names(Freqmember), Percentage, "%", sep = " ")
pie(Freqmember, main="Percentage by Rider Type", col = c(4,3), labels = lbl)
# Bike Type
What is the bike type count by rider type?
trips4 %>%
group_by(member_casual, rideable_type) %>%
summarise(num_used = n(), .groups = "drop") %>%
arrange(desc(num_used))
Note that members do not dock bikes.
What is the percentage for each bike type used by riders?
trips4 %>%
group_by(rideable_type) %>%
summarise(num_trips = n(), .groups = "drop") %>%
mutate(percent = round(num_trips / sum(num_trips),digits =2) * 100) %>%
arrange(desc(percent))
Pie chart: Percentage by Bike Type
RideType=trips4$rideable_type
RideType=as.factor(trips4$rideable_type)
table(RideType)
## RideType
## classic_bike docked_bike electric_bike
## 2592234 173711 2850445
FreqRideType=table(RideType)
Percentage=round(FreqRideType/5616390*100)
lbl=paste(names(FreqRideType), Percentage, "%", sep = " ")
pie(FreqRideType, main="Percentage by Bike Type", col = c(4,3,2), labels = lbl)
What is the percentage for each bike type used by rider type?
trips4 %>%
group_by(member_casual,rideable_type) %>%
summarise(num_trips = n(), .groups = "drop") %>%
mutate(percent = round(num_trips / sum(num_trips),digits =2) * 100) %>%
arrange(desc(percent))
What percentage Of members use classic bikes?
nrow(trips4[trips4$member_casual == "member" & trips4$rideable_type == "classic_bike", ]) /
nrow(trips4[trips4$member_casual == "member", ])
## [1] 0.5136829
What percentage of casual riders use classic bikes?
nrow(trips4[trips4$member_casual == "casual" & trips4$rideable_type == "classic_bike", ]) /
nrow(trips4[trips4$member_casual == "casual", ])
## [1] 0.385639
What Of the percentage of members use electric bikes?
nrow(trips4[trips4$member_casual == "member" & trips4$rideable_type == "electric_bike", ]) /
nrow(trips4[trips4$member_casual == "member", ])
## [1] 0.4863171
What percentage Of casual riders use electric bikes?
nrow(trips4[trips4$member_casual == "casual" & trips4$rideable_type == "electric_bike", ]) /
nrow(trips4[trips4$member_casual == "casual", ])
## [1] 0.5383982
What percentage Of casual riders use docked bikes?
nrow(trips4[trips4$member_casual == "casual" & trips4$rideable_type == "docked_bike", ]) /
nrow(trips4[trips4$member_casual == "casual", ])
## [1] 0.07596285
What is the average time riders used bikes per trip by rider type?
trips4 %>%
group_by(member_casual) %>%
summarise(AverageMinutes = round(mean(minutes), digits = 1))
Casual riders on average spend almost twice as much time on a bike as members.
Bar chart: Average time per trip
trips4 %>%
group_by(member_casual) %>%
summarise(AMin = round(mean(minutes), digits = 1)) %>%
ggplot(., aes(x=member_casual, y = AMin)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=AMin), vjust = -0.3) +
labs(title="Average Minutes per Trip", x="Member Type", y="Minutes")
Rank day of the week that riders spend the most minutes to the least minutes on bikes.
trips4 %>%
group_by(dow) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
arrange(desc(MinuteCount))
Most minutes: Saturday
Least minutes: Tuesday
Bar chart: Rider Trip Minutes by Day of Week
trips4%>%
group_by(dow) %>%
summarise(MinuteCount = round(sum(minutes), digits = 0), .groups = 'drop') %>%
ggplot(., aes(x=dow, y=MinuteCount)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=format(MinuteCount,big.mark=","), vjust = -0.3)) +
labs(title="Trip Minutes by Day of Week", x="Day of Week", y="Minutes")
Which day of the week do riders spend the most minutes on bikes?
trips4 %>%
group_by(dow) %>%
summarise(minutes = sum(minutes)) %>%
top_n(1, minutes)
Which day of the week do riders spend the least minutes on bikes?
trips4 %>%
group_by(dow) %>%
summarise(minutes = sum(minutes)) %>%
top_n(-1, minutes)
Rank the day of the week that members spend the most minutes to the least minutes on bikes.
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow, member_casual) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
arrange(desc(MinuteCount))
Most minutes: Thursday
Least minutes: Sunday
Which day of the week do members spend the most minutes on bikes?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(minutes = sum(minutes)) %>%
top_n(1, minutes)
Which day of the week do members spend the least minutes on bikes?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(minutes = sum(minutes)) %>%
top_n(-1, minutes)
Rank the day of the week do casual riders spend the most minutes to the least minutes on bikes?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow, member_casual) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
arrange(desc(MinuteCount))
Most minutes: Saturday
Least minutes: Tuesday
Which day of the week do casual riders spend the most minutes on bikes?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(minutes = sum(minutes)) %>%
top_n(1, minutes)
Which day of the week do casual riders spend the least minutes on bikes?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(minutes = sum(minutes)) %>%
top_n(-1, minutes)
Bar chart: Day of Week Most Minutes by Member Type
trips4 %>%
group_by(member_casual, dow) %>%
summarise(MostMinutes = sum(minutes)) %>%
top_n(1, MostMinutes) %>%
ggplot(., aes(x=member_casual, y=MostMinutes))+
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=dow), vjust = -0.3) +
labs(title="Day of Week - Most Minutes by Member Type", x="Member Type", y="Minutes")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Rank the months that riders spend the most and least minutes on bikes?
trips4 %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
arrange(desc(MinuteCount))
Most minutes: July
Least minutes: Feb
Bar chart: Rider Trip Minutes by Month
trips4%>%
group_by(month) %>%
summarise(MinuteCount = round(sum(minutes), digits = 0), .groups = 'drop') %>%
ggplot(., aes(x=month, y=MinuteCount)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=format(MinuteCount,big.mark=","), vjust = -0.3)) +
labs(title="Trip Minutes by Month", x="Month", y="Minutes")
Which month do riders spend the most minutes on bikes?
trips4 %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
top_n(1, MinuteCount)
Which month do riders spend the least minutes on bikes?
trips4 %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
top_n(-1, MinuteCount)
Rank the months that member spend the most and least minutes on bikes?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
arrange(desc(MinuteCount))
Most minutes: July
Least minutes: February
Which month do members spend the most minutes on bikes?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
top_n(1, MinuteCount)
Which month do members spend the least minutes on bikes?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
top_n(-1, MinuteCount)
Rank the months that casual riders spend the most and least minutes on bikes?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
arrange(desc(MinuteCount))
Most minutes: July
Least minutes: February
Which month do casual riders spend the most minutes on bikes?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
top_n(1, MinuteCount)
Which month do casual riders spend the least minutes on bikes?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
top_n(-1, MinuteCount)
Bar chart: Month Most Minutes by Member Type
trips4 %>%
group_by(member_casual, month) %>%
summarise(MostMinutes = sum(minutes)) %>%
top_n(1, MostMinutes) %>%
ggplot(., aes(x=member_casual, y=MostMinutes)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=month), vjust = -0.3) +
labs(title="Month - Most Minutes by Member Type", x="Member Type", y="Minutes")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Rank the day of the week riders take the most trips to the least trips?
trips4 %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
arrange(desc(TripCount))
Most trips: Saturday
Least trips: Monday
Bar chart: Trips by Weekday
trips4%>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
ggplot(., aes(x=dow, y=TripCount)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=format(TripCount,big.mark=","), vjust = -0.3)) +
labs(title="Trips by Day of Week", x="Day of Week", y="Trips")
Which day of the week do riders take the most trips?
trips4 %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(1, TripCount)
Which day of the week do riders take the least trips?
trips4 %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(-1, TripCount)
Rank the day of the week members take the most trips to the least trips?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
arrange(desc(TripCount))
Most trips: Thursday
Least trips: Sunday
Which day of the week do annual members take the most trips?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(1, TripCount)
Which day of the week do annual members take the least trips?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(-1, TripCount)
Rank the day of the week casual riders take the most trips to the least trips?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
arrange(desc(TripCount))
Most trips:Saturday
Least trips: Tuesday
Which day of the week do casual riders take the most trips?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(1, TripCount)
Which day of the week do casual riders take the least trips?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(-1, TripCount)
What are the top 3 days of the week with the most trips taken by members?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(3, TripCount)
What are the top 3 days of the week with the most trips taken by casual riders?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(3, TripCount)
What are the bottom 3 days of the week with the least trips taken by members?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(-3, TripCount)
What are the bottom 3 days of the week with the least trips taken by casual riders?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
top_n(-3, TripCount)
Bar chart: Day of Week - Most Trips by member Type
trips4 %>%
group_by(member_casual, dow) %>%
summarise(TripCount = n(), groups = "drop") %>%
top_n(1, TripCount) %>%
ggplot(., aes(x=member_casual, y=TripCount)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=dow), vjust = -0.3) +
labs(title="Day of Week - Most Trips by Member Type", x="Member Type", y="Minutes")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Rank the months that riders take the most trips to the least trips?
trips4 %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
arrange(desc(TripCount))
most trips: Jul
least trips: Jan
Bar chart: Trips by Month
trips4%>%
group_by(month) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
ggplot(., aes(x=month, y=TripCount)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=format(TripCount,big.mark=","), vjust = -0.3)) +
labs(title="Trips by Month", x="Month", y="Trips")
Which month do riders take the most trips?
trips4 %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = "drop") %>%
top_n(1, TripCount)
Which month do riders take the least trips?
trips4 %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = "drop") %>%
top_n(-1, TripCount)
Rank the months that members take the most trips to the least trips?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
arrange(desc(TripCount))
most trips: Aug least trips: Jan
Which month do members take the most trips?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = "drop") %>%
top_n(1, TripCount)
Which month do members take the least trips?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = "drop") %>%
top_n(-1, TripCount)
Rank the months that casual riders take the most trips to the least trips?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = 'drop') %>%
arrange(desc(TripCount))
most trips: Jul
least trips: Feb
Which month do casual riders take the most trips?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = "drop") %>%
top_n(1, TripCount)
Which month do casual riders take the least trips?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(TripCount = n(), .groups = "drop") %>%
top_n(-1, TripCount)
Bar chart: Month Most Trips by Member Type
trips4 %>%
group_by(member_casual, month) %>%
summarise(TripCount = n(), groups = "drop") %>%
top_n(1, TripCount) %>%
ggplot(., aes(x=member_casual, y=TripCount))+
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=month), vjust = -0.3) +
labs(title="Month - Most Trips by Member Type", x="Member Type", y="Trips")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
What is the average distance riders traveled per trip?
trips4 %>%
group_by(member_casual) %>%
summarise(AvgMiles = mean(miles), .groups = 'drop') %>%
arrange(desc(AvgMiles))
Rank the day-of-the-week that riders rode the most miles to the least miles?
trips4 %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = 'drop') %>%
arrange(desc(dowMiles))
Most miles: Saturday
Least miles: Monday
Bar chart: Miles by Day of Week
trips4%>%
group_by(dow) %>%
summarise(SumMiles = sum(miles)) %>%
ggplot(., aes(x=dow, y=SumMiles)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=format(SumMiles,big.mark=","), vjust = -0.3)) +
labs(title="Miles by Day of Week", x="Day of Week", y="Miles")
Which day-of-the-week do riders rode the most miles?
trips4 %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = 'drop') %>%
top_n(1, dowMiles)
Which day-of-the-week that riders rode the least miles?
trips4 %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = 'drop') %>%
top_n(-1, dowMiles)
Rank the day-of-the-week that members rode the most miles to the least miles?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = 'drop') %>%
arrange(desc(dowMiles))
Most miles: Thursday
Least miles: Sunday
Which day-of-the-week do members rode the most miles?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = "drop") %>%
top_n(1, dowMiles)
Which day of the week do members rode the least miles?
trips4 %>%
filter(member_casual == "member") %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = "drop") %>%
top_n(-1, dowMiles)
Rank the day-of-the-week that casual riders rode the most miles to the least miles?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = 'drop') %>%
arrange(desc(dowMiles))
Most miles: Saturday
Least miles: Tuesday
Which day of the week do casual riders rode the most miles?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = "drop") %>%
top_n(1, dowMiles)
Which day of the week do casual riders rode the least miles?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(dow) %>%
summarise(dowMiles = sum(miles), .groups = "drop") %>%
top_n(-1, dowMiles)
Bar chart: Day of Week - Most miles by member Type
trips4 %>%
group_by(member_casual, dow) %>%
summarise(MostMiles = sum(miles)) %>%
top_n(1, MostMiles) %>%
ggplot(., aes(x=member_casual, y=MostMiles))+
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=dow), vjust = -0.3) +
labs(title="Day of Week - Most Minutes by Member Type", x="Member Type", y="Minutes")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Rank the months that riders rode the most miles to the least miles.
trips4 %>%
group_by(month) %>%
summarise(MileCount = sum(miles), .groups = 'drop') %>%
arrange(desc(MileCount))
Most miles: July
Least miles: February
Bar chart: Miles per Month
trips4%>%
group_by(month) %>%
summarise(SumMiles = sum(miles)) %>%
ggplot(., aes(x=month, y=SumMiles)) +
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=format(SumMiles,big.mark=","), vjust = -0.3)) +
labs(title="Miles by Month", x="Month", y="Miles")
Which month did riders ride the most miles?
trips4 %>%
group_by(month) %>%
summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(1, MostMiles)
Which month did riders ride the least miles?
trips4 %>%
group_by(month) %>%
summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(-1, MostMiles)
Rank the months that members rode the most miles to the least miles.
trips4 %>%
filter(member_casual =="member") %>%
group_by(month) %>%
summarise(MileCount = sum(miles), .groups = 'drop') %>%
arrange(desc(MileCount))
Most miles; August
Least miles: February
Which month do members ride the most miles?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(1, MostMiles)
Which month do members ride the least miles?
trips4 %>%
filter(member_casual == "member") %>%
group_by(month) %>%
summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(-1, MostMiles)
Rank the months that casual riders rode the most miles to the least miles.
trips4 %>%
filter(member_casual =="casual") %>%
group_by(month) %>%
summarise(MileCount = sum(miles), .groups = 'drop') %>%
arrange(desc(MileCount))
Most miles: July
Least miles: February
Which month do casual riders ride the most miles?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(1, MostMiles)
Which month do casual riders ride the least miles?
trips4 %>%
filter(member_casual == "casual") %>%
group_by(month) %>%
summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(-1, MostMiles)
Bar chart: Month - Most Miles by Member Type
trips4 %>%
group_by(member_casual, month) %>%
summarise(MostMiles = sum(miles)) %>%
top_n(1, MostMiles) %>%
ggplot(., aes(x=member_casual, y=MostMiles))+
geom_bar(width=0.5, stat="identity", fill="steelblue") +
geom_text(aes(label=month), vjust = -0.3) +
labs(title="Month - Most Miles by Member Type", x="Member Type", y="Miles")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Average minutes most miles What are the average minutes per trip that members rode bikes when they rode the most miles?
trips4 %>%
filter(month == "Aug" & member_casual == "member") %>%
group_by(month, member_casual) %>%
summarise(AvgMinutes = mean(minutes), .groups = 'drop')
What are the average minutes per trip that casual riders rode bikes when they rode the most miles?
trips4 %>%
filter(month == "Jul" & member_casual == "casual") %>%
group_by(month, member_casual) %>%
summarise(AvgMinutes = mean(minutes), .groups = 'drop')
Average distance most miles What is the average distance per trip that members rode in the top month?
trips4 %>%
filter(month == "Aug", member_casual == "member") %>%
group_by(month, member_casual) %>%
summarise(AvgMiles = mean(miles), .groups ='drop') %>%
arrange(desc(AvgMiles))
What is the average distance per trip that casual riders ride in the top month?
trips4 %>%
filter(month == "Jul", member_casual == "casual") %>%
group_by(month, member_casual) %>%
summarise(AvgMiles = mean(miles), .groups ='drop') %>%
arrange(desc(AvgMiles))
Create new data frame. Remove any unneeded columns for cleaner data run and computer efficiency.
trips5 <- trips4[,c("member", "minutes", "miles")]
Make sure all columns are still numeric.
trips5$member = as.numeric(as.character(trips5$member))
trips5$minutes = as.numeric(as.character(trips5$minutes))
trips5$miles = as.numeric(as.character(trips5$miles))
View the data frame to make sure members are 1 and casual are 0.
View(trips5)
Summarize data
summary(trips5)
## member minutes miles
## Min. :0.0000 Min. : 1.00 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 6.02 1st Qu.: 0.570
## Median :1.0000 Median : 10.43 Median : 1.000
## Mean :0.5928 Mean : 16.44 Mean : 1.347
## 3rd Qu.:1.0000 3rd Qu.: 18.57 3rd Qu.: 1.750
## Max. :1.0000 Max. :1396.75 Max. :26.270
Get standard deviation
sapply(trips5, sd)
## member minutes miles
## 0.4913058 28.3688534 1.1917930
Fit the logistics regression model.
M1=glm(formula=member~minutes+miles,family="binomial", data=trips5)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Summarize results
summary(M1)
##
## Call:
## glm(formula = member ~ minutes + miles, family = "binomial",
## data = trips5)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.132e-01 1.450e-03 491.8 <2e-16 ***
## minutes -3.829e-02 8.514e-05 -449.8 <2e-16 ***
## miles 1.820e-01 8.984e-04 202.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7591220 on 5616389 degrees of freedom
## Residual deviance: 7258406 on 5616387 degrees of freedom
## AIC: 7258412
##
## Number of Fisher Scoring iterations: 6
Explanation of results
The p-values indicated that minutes and miles are important predictors, because values are less than .05.
The dependent variable, member_casual now named member, (Intercept) is influenced by the independent variables minutes and miles.
Each one-unit change in minutes will decrease the log odds of getting a membership by -.03829.
Each one-unit change in miles will increase the log odds of getting a membership by 0.1820.
We can look at the difference between Null deviance and Residual deviance to see if the model is a good fit. The bigger the difference, the better the model. Null deviance is when you only have an intercept in your equation, but in our case we have variables on the other side of the equation; the Residual variance. The model is a good fit if that difference is big enough. This explanation is referenced from “Simply Explained Logistics Regression with Example in R” on the towardsdatascience.com website. https://towardsdatascience.com/simply-explained-logistic-regression-with-example-in-r-b919acb1d6b3
Test the model. Plug in numbers for minutes and miles to see the
probability of a casual rider getting a membership.
.07132Y = -0.03829(minutes) + 0.182(miles)
.07132Y = -0.03829(20) + 0.182(8) = 9.67751 percent probability that the
casual rider will get a membership.
Plot the logistics function for the miles predictor for casual riders.
trips5 %>%
filter(member == "0")
ggplot(trips5, aes(x=miles, y=member)) +
geom_point() +
stat_smooth(method="glm", color="green", se=FALSE, method.args=list(family=binomial))
## `geom_smooth()` using formula = 'y ~ x'
What does this chart mean? You can see in the chart that if the miles are less than 12, the log odds, probability, that the person will be a member is at least 50%.
Since the coefficient for miles was positive, it is a better predictor than minutes, which has a negative coefficient. That begs the question of: where should our efforts be targeted? It’s really sort of an educated guess. I am going to guess at casual riders that ride 5 or more miles.
How many casual riders ride 5 or more miles?
nrow(trips5[trips5$member == "0" & trips5$miles >= "5", ])
## [1] 44296
There are 44,296 casual riders that have a 50% or higher change of becoming annual members.
Plot the logistics function for the minutes predictor for casual riders.
trips5 %>%
filter(member == "0")
ggplot(trips5, aes(x=minutes, y=member)) +
geom_point(alpha=.5) +
scale_x_continuous(n.breaks = 20) +
stat_smooth(method="glm", color="green", se=FALSE, method.args=list(family=binomial))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
You can see in the chart that if the minutes are less than 25, the log odds, probability, that the person will be a member is at least 50%.
It makes sense that both charts are negative sloping, because these are rentals and there are only used a portion of the day.
save(trips5, file = "trips5.Rdata")
Insights and Recommendations
Distance is the key factor for turning casual riders into annual members.
The Target
There are 44,296 casual riders that have a 50% or higher change of
becoming annual members.
I recommend option 3 that I suggested in step 1, ASK, of the Data Analytics Process: Launch a small marketing plan targeting causal riders at specific times and locations.
Time Casual riders spend almost twice as much time per trip on a bike
than members.
Include an element in the marketing plan to target casual riders
to:
1. ride bikes during the middle days of the week, and
2. get people riding more in the non-peak months.
Distance
While the average distance riders traveled per trip is 1.35 miles, there
are casual riders that bike a lot further.
Include an element in the marketing plan to get customers to bike a
little further.
Regression analysis
Analysis indicates that miles is a better predictor than minutes for a
casual rider to become an annual member.
Include an element in the marketing plan to get casual riders to ride
more miles; up to 12 miles maximum.