Case Study

This case study is the capstone project to get the Google Data Analytics Certification. The certification emphasizes teaching the six phases of the data analysis process to solve problems. Students build technical skills by learning and using key software and big data platforms such: Excel, SQL, R programming, R Studio, Tableau, Google Cloud Big Query SQL Workspace, Posit R-Cloud workspace, Tableau Public, Kaggle, and other applications and workspaces.

Methodology

I will use the six steps of the Data Analysis Process for the case study: Ask, Prepare, Process, Analyze, Share, and Act. For this project, I will use Excel, R programming, and R Desktop. I will also use structured thinking, which is the process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options to solve the problem at hand.

Company Overview

The Cyclistic bike-share company is based in Chicago. The bikes are geo-tracks by the starting and ending stations. Bike can be picked up at one station and dropped off at another. The company currently has 692 docking stations and 5,824 bikes. Traditional bikes make up 92% of the bikes, and 8% of the bikes are assistive: reclining, hand tricycles, and cargo bikes. There are three pricing plans: single-rider passes, full-day passes, and annual memberships. Casual riders are customers that purchase single-ride and full-day passes; effectively, there are annual memberships and casual riders.

Stakeholders

The stakeholders for the project are: Lily Moreno, Director of Marketing, the Cyclistic marketing analytics team, and the Cyclistic executive team.

The Director of Marketing, Moreno, believes that future success depends on maximizing annual memberships. The goal is to design marketing strategies aimed at converting casual riders to annual members. Moreno believes that it would be easier to convert causal members to annual members than it would be to recruit new members, because the casual riders already know about and have chosen Cyclistic for their biking needs. To design the marketing strategy, the marketing analyst team needs to: 1. better understand how annual members and casual riders differ, 2. determine why casual riders would buy a membership, and 3. decide how digital media could affect their marketing tactics.

STEP 1: Ask

Define the problem to be solved to fully understand stakeholder expectations. I will use the SMART methodology to ask questions to learn about the data and to gain insights to solve the problem.

Business Task

The goal is to increase annual memberships. How to do that is the problem that needs to be solved. The business task assigned is to answer the following question: How do annual members and casual riders use Cyclistic bikes differently?

Insights

How can insights drive the business decisions? Analyze the raw data to get insights needed to make informed decision-making. This is accomplished by following the data to where it leads you. Creating queries can help to identify trends, which in turn can identify opportunities. The opportunities can be exploited to influence casual riders to become members.

Questions

In order to create questions, I have to take a preliminary look at the data at this point. What type of data is available that can be analyzed to show how riders use bikes differently? We have time, distance, location, and bike type.

What can I do with this data?
I can do descriptive statistics and logistics regression.

Descriptive statistics
It can show usage and possible trends between the two rider types by looking at and comparing various factors such as number of trips, time, and distance. These factors are framed as follows:

TRIPS
What are the total number of trips taken?
What are the total number of trips taken by customer type?

RIDER TYPE
Percentage by rider type What percentage of the customers have an annual membership?
What percentage of the customers are casual riders?

BIKE TYPE
What is the percentage for each bike type used by rider type?
What percentage of members use classic bikes?
What percentage of casual riders use classic bikes?
What percentage of members use electric bikes?
What percentage of casual riders use electric bikes?
What percentage of casual riders use docked bikes?

TIME
What is the average time riders rode bikes per trip by rider type?

Day of the Week
Rank day of the week that riders spend the most minutes to the least minutes on bikes.
Which day of the week do riders spend the most minutes on bikes?
Which day of the week do riders spend the least minutes on bikes?
Rank the day of the week that members spend the most minutes to the least minutes on bikes.
Which day of the week do members spend the most minutes on bikes?
Which day of the week do members spend the least minutes on bikes?
Rank the day of the week do casual riders spend the most minutes to the least minutes on bikes?
Which day of the week do casual riders spend the most minutes on bikes?
Which day of the week do casual riders spend the least minutes on bikes?

Month
Rank the months that riders spend the most and least minutes on bikes?
Which month do riders spend the most minutes on bikes?
Which month do riders spend the least minutes on bikes?
Rank the months that member spend the most and least minutes on bikes?
Which month do members spend the most minutes on bikes?
Which month do members spend the least minutes on bikes?
Rank the months that casual riders spend the most and least minutes on bikes?
Which month do casual riders spend the most minutes on bikes?
Which month do casual riders spend the least minutes on bikes?

TRIPS
Day of week
Rank the day of the week riders take the most trips to the least trips?
Which day of the week do riders take the most trips?
Which day of the week do riders take the least trips?
Rank the day of the week members take the most trips to the least trips?
Which day of the week do annual members take the most trips?
Which day of the week do annual members take the least trips?
Rank the day of the week casual riders take the most trips to the least trips?
Which day of the week do casual riders take the most trips?
Which day of the week do casual riders take the least trips?
What are the top 3 days of the week with the most trips taken by members?
What are the top 3 days of the week with the most trips taken by casual riders?
What are the bottom 3 days of the week with the most trips taken by members?
What are the bottom 3 days of the week with the most trips taken by casual riders?

Month
Rank the months that riders take the most trips to the least trips?
Which month do riders take the most trips?
Which month do riders take the least trips?
Rank the months that members take the most trips to the least trips?
Which month do members take the most trips?
Which month do members take the least trips?
Rank the months that casual riders take the most trips to the least trips?
Which month do casual riders take the most trips?
Which month do casual riders take the least trips?

DISTANCE
What is the average distance riders traveled per trip?

Day of Week
Rank the day-of-the-week that riders rode the most miles to the least miles?
Which day-of-the-week do riders rode the most miles?
Which day-of-the-week that riders rode the least miles?
Rank the day-of-the-week that members rode the most miles to the least miles?
Which day-of-the-week do members rode the most miles?
Which day-of-the-week that riders rode the least miles?
Rank the day-of-the-week that members rode the most miles to the least miles?
Which day-of-the-week do members rode the most miles?
Which day of the week do members rode the least miles?
Rank the day-of-the-week that casual riders rode the most miles to the least miles?
Which day of the week do casual riders rode the most miles?
Which day of the week do casual riders rode the least miles?

Month
Rank the months that riders rode the most miles to the least miles.
Which month did riders ride the most miles?
Which month did riders ride the least miles?
Rank the months that members rode the most miles to the least miles.
Which month do members ride the most miles?
Which month do members ride the least miles?
Rank the months that casual riders rode the most miles to the least miles.
Which month do casual riders ride the most miles?
Which month do casual riders ride the least miles?

AVERAGES Average minutes most miles
What are the average minutes per trip that members rode bikes when they rode the most miles?
What are the average minutes per trip that casual riders rode bikes when they rode the most miles?

Average Distance Top Month
What is the average distance per trip that members rode in the top month?
What is the average distance per trip that casual riders ride in the top month?

Average Time Rode by Month
What are the average minutes that annual members used bikes each month?
What are the average minutes that casual riders used bikes each month?

Average Distance Traveled by Month
What is the average distance that annual members traveled each month?
What is the average distance that casual riders traveled each month?

Logistics Regression
Logistics regression can show the probability that a person will get a membership based on predictors such as miles and distance. I can use the use the start and end times to extract the amount of time for each ride. I can also use starting and ending longitude and latitude data to extrapolate distance. That will provide numerical variables of time and distance. We want to predict membership, which can be thought of as a binary. True or false. You either have it or you don’t. That means membership is a categorical variable. This case meets the criteria for logistics regression analysis, because I have a binary outcome and numerical predictors. We can use logistics regression to calculate the probability of a binary outcome; a membership. We can also find out the impact of numeric independent variables on the categorical dependent variable. We can evaluate which independent variable (time or distance) influences the dependent variable (membership) the most. Once answered, the course of action is to target the variable with the greater influence.

Gaps
Where are the gaps?
The gap is space or distance of the present state of where the company is now to the preferred state of where the company wants to be. The present state is 41% annual members and 59% casual riders. The Preferred state is 100% annual membership and 0% casual riders, so there is no gap.

Opportunities
Where are the opportunities?
The obvious opportunity is to convert the 41% casual riders to annual members. The other opportunities will surface after the analysis.

Options
What are the options?
We must reflect back on the business task of converting casual members to annual members in order to think about options. I always say to put all the options on the table, good or bad. Once all the options are listed, or in my case the ones I can think of, then pick the ones that make the most sense.

Do nothing. Stay the current course.
Launch a massively expensive marketing plan showing the benefits and convenience of a great bike-share service and hope you snag casual riders.
Launch a small marketing plan targeting causal riders at specific times and locations.
Expand the pricing plan. Add a pricing plan for weekends.
Add a Rewards Program. Rewards could be free water, snack, or incentive discount.
Provide a free membership for a limited time period. Do it in increments.

Opportunity Cost vs Value
The casual riders need to believe that the switch to annual membership is worth it. So, how would you do that? That can be done by using pricing plans, incentives, promotions, discounts, or other means. Let’s look at a simple analogy that works and see if we can use that same logic. If that chocolate ice cream you just bought tasted good and satisfied your sweet tooth, would you be likely to buy it the next time? Most likely, yes. If that software worked really well and satisfies the need, would you be likely to buy it? Most likely, yes. Now let’s look at how this analogy worked in business. Adobe provided Acrobat Reader free for years. It worked. It was good. Everyone loved it and used it. It satisfied the need to read PDF files. Everyone was used to using Adobe. Now, Adobe has many products, and they are almost all subscription based. People know they are good, and they buy them, and they use them. We don’t have to reinvent the wheel. This model can be applied to Bike Share. If casual riders are given a membership for a trial period, they will most likely use the bikes more and see the value of a buying a membership. The opportunity cost is the loss of potential revenue during the trial period. Is the opportunity cost worth getting more memberships?

STEP 2: Prepare

This step is to collect and store data for the upcoming data analysis.

Data Generation

The data is provided by Amazon and available for download at Amazonws.com.

202202-divvy-tripdata.csv
202203-divvy-tripdata.csv
202204-divvy-tripdata.csv
202205-divvy-tripdata.csv
202206-divvy-tripdata.csv
202207-divvy-tripdata.csv
202208-divvy-tripdata.csv
202209-divvy-publictripdata.csv
202210-divvy-tripdata.csv
202211-divvy-tripdata.csv
202212-divvy-tripdata.csv
202301-divvy-tripdata.csv

Data Format and Structure

There are 12 monthly CSV files. One year of data is being evaluated, which is from February 2022 – January 2023. There are 13 columns with what looks like three possible data types: character, date, and numeric.

Bias and Credibility

ROCCC check
* Reliable? Yes. The data is provided by the company Motivate International Inc, for the purpose of the case study.
* Original? Yes. It is fictitious data that is to be used for case studies.
* Comprehensive? No. PII and financial data has been removed.
* Current? Yes. The data files are current.
* Cited? Yes. The data is from Motivate, and there is a license agreement.

Licensing, Privacy, Security, Accessibility
* Licensing: The license agreement is available online. * Security: Data security issues are stated in the license agreement under Prohibited Conduct. * Accessibility: The data is accessible online for download at Amazonws.com.

Data Integrity The files are all downloaded and populated. I checked the file properties of all the files to check for byte size. All the files have the same number of columns and the same names. Problems include missing data and values of zero. The missing data is predominately station names and station IDs. Omitting the columns won’t impact the case study.
* Completeness: there are missing values in several columns.
* Accuracy: aside from the missing data, the data is accurate.
* Consistency: aside from the missing values, the data is consistent.
* Timeliness: the data is always available, so it is timely.
* Compliance: this is a nonissue as this is a case problem.

Organize and Protect Data
Each of the 12 data files will be saved with a new name, which preserves the original files.The new files, working files, will be used for analysis. There is a backup of the original and working files.

Original filenames ………………. New filenames
202202-divvy-tripdata.csv ………. 202202Data.csv
202203-divvy-tripdata.csv ………. 202203Data.csv
202204-divvy-tripdata.csv ………. 202204Data.csv
202205-divvy-tripdata.csv ………. 202205Data.csv
202206-divvy-tripdata.csv ………. 202206Data.csv
202207-divvy-tripdata.csv ………. 202207Data.csv
202208-divvy-tripdata.csv ………. 202208Data.csv
202209-divvy-publictripdata.csv….202209Data.csv
202210-divvy-tripdata.csv ………. 202210Data.csv
202211-divvy-tripdata.csv ………. 202211Data.csv
202212-divvy-tripdata.csv ………. 202212Data.csv
202301-divvy-tripdata.csv ………. 202301Data.csv

STEP 3: Process

The Process step has three parts: 1) clean the data, 2) verify the data, and 3) gain strategies for verifying and sharing your data.

I will do a little work in Excel to add some useful columns and format the data so that it will be easier to work with in R. Then, I will combine the 12 CSV data files into one data frame. Then I will eliminate any errors and inaccuracies.

1. Clean Data
* find and eliminate errors or inaccuracies
* transform into useful format
* combine data sets to make more complete
* remove outliers that skew information
2. Verify Data to make sure it is complete and correct
3. Gain Strategies for verifying and sharing your data cleansing with stakeholders

Cleaning the Data

This is where we find and eliminate errors or inaccuracies, transfer into a useful format, combine datasets to make it more complete,and remove any outliers that may skew the data.

The original dataset has 13 columns. I removed four columns and added 5. There are now 14 columns.

After inspecting the CSV files, there are a lot missing data and values of zero for the station IDs and station names; thousands and thousands of missing values. There are also inconsistencies in these columns with the naming conventions, so I won’t use them to avoid skewing the data. These 4 columns were deleted: start_station_name, start_station_id,end_station_name, and end_station_id. I used Excel to add five columns and extrapolated data for new columns that will be useful during analysis.

member: =IF(M2=“member”,1,0), convert membership to numeric, binary.
minutes: =(D2-C2)*1440, format mm:ss
dow: =TEXT(C2,“ddd”), the three-letter abbreviation for the day of the week
month: =MONTH(C2), the three-letter abbreviation for the month
miles: =ACOS(COS(RADIANS(90-I2))COS(RADIANS(90-K2))+SIN(RADIANS(90-I2))SIN(RADIANS(90-K2))COS(RADIANS(J2-L2)))3959, distance between the beginning and ending coordinates of a trip

I checked for blank records and obvious errors. There were some blank values for the end_lat and end_lng columns. I will sort that out with R.

I sorted by miles and found a bunch of #NUM! errors in the miles column. NUM errors occur when a formula or function isn’t valid or a mathematical function doesn’t work like dividing by zero. I removed the #NUM! records from each file. Less than 1% of the data per files was removed.
year…..records…#NUM!…remaining..% removed
202202…115,159…449…..114,709….0.390%
202203…284,042…853…..283,188….0.300%
202204…371,250…1,877…369,373….0.506%
202205…634,859…0…….634,859….0.000%
202206…769,205…924…..768,280….0.120%
202207…823,489…966…..822,522….0.117%
202208…785,933…919…..785,013….0.117%
202209…701,340…817…..700,522….0.116%
202210…558,686…539…..558,146….0.096%
202211…337,736…396…..337,339….0.117%
202212…181,807…211…..181,595….0.116%
202301…190,302…232…..190,069….0.122%

Other errors and outliers that I will remove with R.
Trips with duration less than 1 minute.
Trips with duration greater than minutes in a day, 1440.
Trips greater than 100 miles.
I notied there were trips with miles over 60,000.
Zeroes in the member column should not be removed as it indicates a non-membership.

Load the CSV files into R, combine them into a data frame, and then continue the cleaning process.

Load R packages and libraries. this is done from the console to allow the Knit to run.

options(repos="https://cran.rstudio.com")

install.packages("tidyverse")

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\rfinegan\AppData\Local\Temp\RtmpS6CWuC\downloaded_packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

Create the data frames by reading in the CSV files.

bike202202 <- read_csv("202202Data.csv")

## Rows: 115158 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202203 <- read_csv("202203Data.csv")

## Rows: 283187 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202204 <- read_csv("202204Data.csv")

## Rows: 371247 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202205 <- read_csv("202205Data.csv")

## Rows: 634858 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202206 <- read_csv("202206Data.csv")

## Rows: 769204 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202207 <- read_csv("202207Data.csv")

## Rows: 823488 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202208 <- read_csv("202208Data.csv")

## Rows: 785932 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202209 <- read_csv("202209Data.csv")

## Rows: 701339 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202210 <- read_csv("202210Data.csv")

## Rows: 558685 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202211 <- read_csv("202211Data.csv")

## Rows: 337735 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202212 <- read_csv("202212Data.csv")

## Rows: 181806 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bike202301 <- read_csv("202301Data.csv")

## Rows: 190301 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, member_casual, dow, m...
## dbl (7): start_lat, start_lng, end_lat, end_lng, member, minutes, miles
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check the column names of data frames.

colnames(bike202202)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202203)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202204)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202205)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202206)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202207)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202208)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202209)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202210)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202211)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202212)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

colnames(bike202301)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

Here are the 14 columns and their data types.
ride_id………character
rideable_type…character
started_at……character
ended_at……..character
start_lat…….numeric
start_lng…….numeric
end_lat………numeric
end_lng………numeric
member_casual…character
member……….numeric
minutes………numeric
dow………….character
month………..character
miles………..numeric

Check the structure of data.

str(bike202202)

## spc_tbl_ [115,158 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:115158] "973B6FB26CA8BD27" "73758E384AA65A53" "202F9F20AB6E177D" "E83139CA4370782B" ...
##  $ rideable_type: chr [1:115158] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : chr [1:115158] "2/10/2022 14:35" "2/1/2022 9:36" "2/11/2022 10:16" "2/8/2022 16:03" ...
##  $ ended_at     : chr [1:115158] "2/10/2022 14:49" "2/1/2022 10:51" "2/11/2022 11:05" "2/8/2022 16:40" ...
##  $ start_lat    : num [1:115158] 41.9 41.8 41.8 41.8 41.8 ...
##  $ start_lng    : num [1:115158] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat      : num [1:115158] 41.9 41.8 41.8 41.8 41.8 ...
##  $ end_lng      : num [1:115158] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual: chr [1:115158] "member" "casual" "member" "member" ...
##  $ member       : num [1:115158] 1 0 1 1 1 0 1 1 0 1 ...
##  $ minutes      : num [1:115158] 14 75.2 49.1 36.6 39.6 ...
##  $ dow          : chr [1:115158] "Thu" "Tue" "Fri" "Tue" ...
##  $ month        : chr [1:115158] "Feb" "Feb" "Feb" "Feb" ...
##  $ miles        : num [1:115158] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202203)

## spc_tbl_ [283,187 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:283187] "01CBE93F2A84B1D4" "BF3995B1A3E8C334" "F6F4B4A2E7DA0B33" "D92355D09AD21135" ...
##  $ rideable_type: chr [1:283187] "classic_bike" "docked_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : chr [1:283187] "3/26/2022 11:40" "3/20/2022 16:38" "3/16/2022 14:42" "3/13/2022 17:05" ...
##  $ ended_at     : chr [1:283187] "3/26/2022 12:04" "3/20/2022 16:59" "3/16/2022 14:43" "3/13/2022 18:12" ...
##  $ start_lat    : num [1:283187] 41.8 41.9 42 41.7 42 ...
##  $ start_lng    : num [1:283187] -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ end_lat      : num [1:283187] 41.8 41.9 42 41.7 42 ...
##  $ end_lng      : num [1:283187] -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual: chr [1:283187] "member" "casual" "member" "casual" ...
##  $ member       : num [1:283187] 1 0 1 0 0 1 1 1 0 1 ...
##  $ minutes      : num [1:283187] 24.03 21.42 0.35 67.37 89.18 ...
##  $ dow          : chr [1:283187] "Sat" "Sun" "Wed" "Sun" ...
##  $ month        : chr [1:283187] "Mar" "Mar" "Mar" "Mar" ...
##  $ miles        : num [1:283187] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202204)

## spc_tbl_ [371,247 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:371247] "7E14444DA4A4B2BC" "56970CB2B4599258" "99473F4114014BAE" "2AA570D0C915ED6D" ...
##  $ rideable_type: chr [1:371247] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at   : chr [1:371247] "4/29/2022 11:38" "4/24/2022 18:19" "4/17/2022 14:01" "4/23/2022 16:37" ...
##  $ ended_at     : chr [1:371247] "4/29/2022 11:38" "4/24/2022 19:15" "4/17/2022 14:05" "4/23/2022 16:37" ...
##  $ start_lat    : num [1:371247] 41.9 41.9 41.9 42 41.9 ...
##  $ start_lng    : num [1:371247] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat      : num [1:371247] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng      : num [1:371247] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ member_casual: chr [1:371247] "member" "casual" "casual" "member" ...
##  $ member       : num [1:371247] 1 0 0 1 0 1 1 1 1 1 ...
##  $ minutes      : num [1:371247] 0.53 56.4 4.33 0.57 35.75 ...
##  $ dow          : chr [1:371247] "Fri" "Sun" "Sun" "Sat" ...
##  $ month        : chr [1:371247] "Apr" "Apr" "Apr" "Apr" ...
##  $ miles        : num [1:371247] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202205)

## spc_tbl_ [634,858 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:634858] "043D51D15887474F" "84676EDD2699ABF5" "92ACD45CFCD9364E" "09B6FA8E709B19A7" ...
##  $ rideable_type: chr [1:634858] "classic_bike" "docked_bike" "electric_bike" "electric_bike" ...
##  $ started_at   : chr [1:634858] "5/19/2022 21:55" "5/14/2022 13:53" "5/4/2022 8:03" "5/24/2022 17:59" ...
##  $ ended_at     : chr [1:634858] "5/19/2022 21:56" "5/14/2022 17:30" "5/4/2022 8:05" "5/24/2022 18:01" ...
##  $ start_lat    : num [1:634858] 41.9 41.8 41.8 42 42 ...
##  $ start_lng    : num [1:634858] -87.7 -87.6 -87.8 -87.7 -87.7 ...
##  $ end_lat      : num [1:634858] 41.9 41.8 41.8 42 42 ...
##  $ end_lng      : num [1:634858] -87.7 -87.6 -87.8 -87.7 -87.7 ...
##  $ member_casual: chr [1:634858] "member" "casual" "member" "member" ...
##  $ member       : num [1:634858] 1 0 1 1 1 0 0 1 1 1 ...
##  $ minutes      : num [1:634858] 0.75 217.65 2.48 2.2 35.97 ...
##  $ dow          : chr [1:634858] "Thu" "Sat" "Wed" "Tue" ...
##  $ month        : chr [1:634858] "May" "May" "May" "May" ...
##  $ miles        : num [1:634858] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202206)

## spc_tbl_ [769,204 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:769204] "C19B08D794D1C89E" "6E9E3A041C14E960" "0AFDFDED87B18711" "73824977D6BA1702" ...
##  $ rideable_type: chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at   : chr [1:769204] "6/30/2022 11:09" "6/30/2022 11:05" "6/30/2022 8:02" "6/30/2022 8:12" ...
##  $ ended_at     : chr [1:769204] "6/30/2022 11:10" "6/30/2022 11:09" "6/30/2022 8:07" "6/30/2022 8:17" ...
##  $ start_lat    : num [1:769204] 41.9 41.9 41.9 41.9 42 ...
##  $ start_lng    : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat      : num [1:769204] 41.9 41.9 41.9 41.9 42 ...
##  $ end_lng      : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual: chr [1:769204] "casual" "casual" "casual" "casual" ...
##  $ member       : num [1:769204] 0 0 0 0 0 0 0 0 0 0 ...
##  $ minutes      : num [1:769204] 0.78 3.42 4.88 4.1 21.68 ...
##  $ dow          : chr [1:769204] "Thu" "Thu" "Thu" "Thu" ...
##  $ month        : chr [1:769204] "Jun" "Jun" "Jun" "Jun" ...
##  $ miles        : num [1:769204] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202207)

## spc_tbl_ [823,488 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:823488] "55661F65ED2D1B17" "DBA6BF86FC23B1F3" "1A3FDE807BC010E0" "6614263E987C229A" ...
##  $ rideable_type: chr [1:823488] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : chr [1:823488] "7/4/2022 9:32" "7/31/2022 15:42" "7/21/2022 18:42" "7/1/2022 18:57" ...
##  $ ended_at     : chr [1:823488] "7/4/2022 10:44" "7/31/2022 15:42" "7/21/2022 19:12" "7/1/2022 18:58" ...
##  $ start_lat    : num [1:823488] 41.9 41.9 41.9 41.9 41.8 ...
##  $ start_lng    : num [1:823488] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat      : num [1:823488] 41.9 41.9 41.9 41.9 41.8 ...
##  $ end_lng      : num [1:823488] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual: chr [1:823488] "casual" "member" "member" "member" ...
##  $ member       : num [1:823488] 0 1 1 1 0 1 1 1 1 1 ...
##  $ minutes      : num [1:823488] 72.03 0.23 29.43 0.5 80.48 ...
##  $ dow          : chr [1:823488] "Mon" "Sun" "Thu" "Fri" ...
##  $ month        : chr [1:823488] "Jul" "Jul" "Jul" "Jul" ...
##  $ miles        : num [1:823488] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202208)

## spc_tbl_ [785,932 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:785932] "CBC1E1C17B07A999" "11EDB12F3F8F084E" "A49AE6AC519701A8" "0370D86C6729518B" ...
##  $ rideable_type: chr [1:785932] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at   : chr [1:785932] "8/8/2022 11:30" "8/7/2022 20:37" "8/7/2022 19:09" "8/8/2022 13:24" ...
##  $ ended_at     : chr [1:785932] "8/8/2022 11:36" "8/7/2022 20:40" "8/7/2022 19:11" "8/8/2022 13:31" ...
##  $ start_lat    : num [1:785932] 41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng    : num [1:785932] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat      : num [1:785932] 41.8 41.9 41.9 41.9 41.9 ...
##  $ end_lng      : num [1:785932] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual: chr [1:785932] "casual" "casual" "casual" "casual" ...
##  $ member       : num [1:785932] 0 0 0 0 0 0 0 0 0 0 ...
##  $ minutes      : num [1:785932] 5.42 3.4 2.65 6.78 31.15 ...
##  $ dow          : chr [1:785932] "Mon" "Sun" "Sun" "Mon" ...
##  $ month        : chr [1:785932] "Aug" "Aug" "Aug" "Aug" ...
##  $ miles        : num [1:785932] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202209)

## spc_tbl_ [701,339 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:701339] "E12D4A16BF51C274" "A02B53CD7DB72DD7" "4DEEB4550A266AE1" "106BA573A37CC68F" ...
##  $ rideable_type: chr [1:701339] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at   : chr [1:701339] "9/1/2022 17:11" "9/1/2022 17:15" "9/1/2022 7:30" "9/1/2022 6:44" ...
##  $ ended_at     : chr [1:701339] "9/1/2022 17:14" "9/1/2022 17:16" "9/1/2022 7:32" "9/1/2022 6:48" ...
##  $ start_lat    : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng    : num [1:701339] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ end_lat      : num [1:701339] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng      : num [1:701339] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual: chr [1:701339] "casual" "casual" "casual" "casual" ...
##  $ member       : num [1:701339] 0 0 0 0 0 0 0 0 0 0 ...
##  $ minutes      : num [1:701339] 3.27 0.37 2.42 3.83 1.88 ...
##  $ dow          : chr [1:701339] "Thu" "Thu" "Thu" "Thu" ...
##  $ month        : chr [1:701339] "Sep" "Sep" "Sep" "Sep" ...
##  $ miles        : num [1:701339] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202210)

## spc_tbl_ [558,685 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:558685] "9209A92F1AF15A51" "ADB620B994F6DB82" "83ACA34E916CF8D5" "6F5988377250388C" ...
##  $ rideable_type: chr [1:558685] "electric_bike" "electric_bike" "classic_bike" "electric_bike" ...
##  $ started_at   : chr [1:558685] "10/6/2022 7:12" "10/21/2022 12:40" "10/27/2022 19:21" "10/18/2022 21:11" ...
##  $ ended_at     : chr [1:558685] "10/6/2022 7:12" "10/21/2022 13:33" "10/27/2022 19:43" "10/18/2022 21:11" ...
##  $ start_lat    : num [1:558685] 42 42 41.9 41.9 41.9 ...
##  $ start_lng    : num [1:558685] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat      : num [1:558685] 42 42 41.9 41.9 41.9 ...
##  $ end_lng      : num [1:558685] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual: chr [1:558685] "member" "member" "member" "member" ...
##  $ member       : num [1:558685] 1 1 1 1 0 1 1 0 1 0 ...
##  $ minutes      : num [1:558685] 0.65 52.75 22.15 0.28 13.67 ...
##  $ dow          : chr [1:558685] "Thu" "Fri" "Thu" "Tue" ...
##  $ month        : chr [1:558685] "Oct" "Oct" "Oct" "Oct" ...
##  $ miles        : num [1:558685] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202211)

## spc_tbl_ [337,735 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:337735] "7F8CA9B17D7E2B5F" "E7372C2C8A9BFCA7" "1E19C43F1B1A04EC" "C8467A137DFF117A" ...
##  $ rideable_type: chr [1:337735] "classic_bike" "electric_bike" "classic_bike" "docked_bike" ...
##  $ started_at   : chr [1:337735] "11/3/2022 11:52" "11/8/2022 5:16" "11/18/2022 11:11" "11/10/2022 15:49" ...
##  $ ended_at     : chr [1:337735] "11/3/2022 11:52" "11/8/2022 5:16" "11/18/2022 11:43" "11/10/2022 17:47" ...
##  $ start_lat    : num [1:337735] 41.9 42 41.9 41.9 41.9 ...
##  $ start_lng    : num [1:337735] -87.6 -87.7 -87.7 -87.6 -87.7 ...
##  $ end_lat      : num [1:337735] 41.9 42 41.9 41.9 41.9 ...
##  $ end_lng      : num [1:337735] -87.6 -87.7 -87.7 -87.6 -87.7 ...
##  $ member_casual: chr [1:337735] "member" "member" "member" "casual" ...
##  $ member       : num [1:337735] 1 1 1 0 1 1 1 1 1 1 ...
##  $ minutes      : num [1:337735] 0.73 0.07 31.42 118.33 2.35 ...
##  $ dow          : chr [1:337735] "Thu" "Tue" "Fri" "Thu" ...
##  $ month        : chr [1:337735] "Nov" "Nov" "Nov" "Nov" ...
##  $ miles        : num [1:337735] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202212)

## spc_tbl_ [181,806 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:181806] "471662F932C43107" "FE9D8A7A26F6E917" "8602A572ADB87616" "BD850F0C2EF9CBA0" ...
##  $ rideable_type: chr [1:181806] "classic_bike" "classic_bike" "classic_bike" "electric_bike" ...
##  $ started_at   : chr [1:181806] "12/20/2022 7:13" "12/6/2022 13:21" "12/19/2022 12:07" "12/31/2022 16:11" ...
##  $ ended_at     : chr [1:181806] "12/20/2022 7:14" "12/6/2022 13:21" "12/19/2022 12:46" "12/31/2022 16:11" ...
##  $ start_lat    : num [1:181806] 41.9 42 42 42 41.9 ...
##  $ start_lng    : num [1:181806] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat      : num [1:181806] 41.9 42 42 42 41.9 ...
##  $ end_lng      : num [1:181806] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual: chr [1:181806] "member" "member" "member" "member" ...
##  $ member       : num [1:181806] 1 1 1 1 1 1 1 1 1 1 ...
##  $ minutes      : num [1:181806] 0.5 0.05 38.42 0.38 28.65 ...
##  $ dow          : chr [1:181806] "Tue" "Tue" "Mon" "Sat" ...
##  $ month        : chr [1:181806] "Dec" "Dec" "Dec" "Dec" ...
##  $ miles        : num [1:181806] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(bike202301)

## spc_tbl_ [190,301 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:190301] "C1D80BC2F7A0AF30" "B4ED20B291AB768F" "D9012A259FE14AC9" "BD8BFC9CDEA07766" ...
##  $ rideable_type: chr [1:190301] "electric_bike" "electric_bike" "classic_bike" "electric_bike" ...
##  $ started_at   : chr [1:190301] "1/19/2023 19:17" "1/26/2023 12:36" "1/19/2023 17:42" "1/2/2023 13:49" ...
##  $ ended_at     : chr [1:190301] "1/19/2023 19:18" "1/26/2023 12:37" "1/19/2023 17:42" "1/2/2023 13:49" ...
##  $ start_lat    : num [1:190301] 41.9 41.9 41.9 41.9 42 ...
##  $ start_lng    : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ end_lat      : num [1:190301] 41.9 41.9 41.9 41.9 42 ...
##  $ end_lng      : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual: chr [1:190301] "casual" "member" "member" "casual" ...
##  $ member       : num [1:190301] 0 1 1 0 1 1 1 1 1 1 ...
##  $ minutes      : num [1:190301] 0.47 0.8 0.07 0.27 0.93 ...
##  $ dow          : chr [1:190301] "Thu" "Thu" "Thu" "Mon" ...
##  $ month        : chr [1:190301] "Jan" "Jan" "Jan" "Jan" ...
##  $ miles        : num [1:190301] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

I will be doing calculations with the minutes and miles columns.They both show as numeric columns. All the other columns checked out.

The Data

Character data types ride_id, rideable_type, started_at, ended_at, member_casual, dow, month

Numeric data types start_lat, start_lng, end_lat, end_lng, member, minutes, and miles.

back to…
# Create and Transform
Merge the 12 data frames into one data frame.

trips <- bind_rows(
    bike202202, bike202203, bike202204, bike202205, bike202206, bike202207,
  bike202208, bike202209, bike202210, bike202211, bike202212, bike202301
)

Test the Data

This is a test and check of the data to see if the merge of the 12 data frames to the one data frame worked out OK.

nrow(trips)

## [1] 5752940

Check the number of variables.

dim(trips)

## [1] 5752940      14

Check the column names of the new data frame.

colnames(trips)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

Check the structure of the column data types.

str(trips)

## spc_tbl_ [5,752,940 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:5752940] "973B6FB26CA8BD27" "73758E384AA65A53" "202F9F20AB6E177D" "E83139CA4370782B" ...
##  $ rideable_type: chr [1:5752940] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : chr [1:5752940] "2/10/2022 14:35" "2/1/2022 9:36" "2/11/2022 10:16" "2/8/2022 16:03" ...
##  $ ended_at     : chr [1:5752940] "2/10/2022 14:49" "2/1/2022 10:51" "2/11/2022 11:05" "2/8/2022 16:40" ...
##  $ start_lat    : num [1:5752940] 41.9 41.8 41.8 41.8 41.8 ...
##  $ start_lng    : num [1:5752940] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat      : num [1:5752940] 41.9 41.8 41.8 41.8 41.8 ...
##  $ end_lng      : num [1:5752940] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual: chr [1:5752940] "member" "casual" "member" "member" ...
##  $ member       : num [1:5752940] 1 0 1 1 1 0 1 1 0 1 ...
##  $ minutes      : num [1:5752940] 14 75.2 49.1 36.6 39.6 ...
##  $ dow          : chr [1:5752940] "Thu" "Tue" "Fri" "Tue" ...
##  $ month        : chr [1:5752940] "Feb" "Feb" "Feb" "Feb" ...
##  $ miles        : num [1:5752940] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character(),
##   ..   member = col_double(),
##   ..   minutes = col_double(),
##   ..   dow = col_character(),
##   ..   month = col_character(),
##   ..   miles = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Take a peek at the data frame.

glimpse(trips)

## Rows: 5,752,940
## Columns: 14
## $ ride_id       <chr> "973B6FB26CA8BD27", "73758E384AA65A53", "202F9F20AB6E177…
## $ rideable_type <chr> "classic_bike", "classic_bike", "classic_bike", "classic…
## $ started_at    <chr> "2/10/2022 14:35", "2/1/2022 9:36", "2/11/2022 10:16", "…
## $ ended_at      <chr> "2/10/2022 14:49", "2/1/2022 10:51", "2/11/2022 11:05", …
## $ start_lat     <dbl> 41.87812, 41.78794, 41.78794, 41.78794, 41.78794, 41.900…
## $ start_lng     <dbl> -87.64395, -87.58832, -87.58832, -87.58832, -87.58832, -…
## $ end_lat       <dbl> 41.87812, 41.78794, 41.78794, 41.78794, 41.78794, 41.900…
## $ end_lng       <dbl> -87.64395, -87.58832, -87.58832, -87.58832, -87.58832, -…
## $ member_casual <chr> "member", "casual", "member", "member", "member", "casua…
## $ member        <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ minutes       <dbl> 13.97, 75.17, 49.13, 36.60, 39.58, 0.08, 31.10, 30.88, 1…
## $ dow           <chr> "Thu", "Tue", "Fri", "Tue", "Wed", "Sun", "Thu", "Wed", …
## $ month         <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", …
## $ miles         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Save trips variable to a file.

save(trips, file = "trips.Rdata")

Verified the file is created and in the working directory. Check for missing values.

sum(is.na(trips))

## [1] 103191

There are 103,191 missing or NA values. What percentage of data will I lose by removing NAs?
I can calculate it. 103,191 / 5,752,940 = 0.017937 or 1.79%. That is acceptable.

Remove NAs and create trips2 data frame.

trips2 <- na.omit(trips)

Check the number of rows for trips2.

nrow(trips2)

## [1] 5740512

Check column names of trips2 data frame.

colnames(trips2)

##  [1] "ride_id"       "rideable_type" "started_at"    "ended_at"     
##  [5] "start_lat"     "start_lng"     "end_lat"       "end_lng"      
##  [9] "member_casual" "member"        "minutes"       "dow"          
## [13] "month"         "miles"

Check the structure of trips2 data frame.

str(trips2)

## tibble [5,740,512 × 14] (S3: tbl_df/tbl/data.frame)
##  $ ride_id      : chr [1:5740512] "973B6FB26CA8BD27" "73758E384AA65A53" "202F9F20AB6E177D" "E83139CA4370782B" ...
##  $ rideable_type: chr [1:5740512] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at   : chr [1:5740512] "2/10/2022 14:35" "2/1/2022 9:36" "2/11/2022 10:16" "2/8/2022 16:03" ...
##  $ ended_at     : chr [1:5740512] "2/10/2022 14:49" "2/1/2022 10:51" "2/11/2022 11:05" "2/8/2022 16:40" ...
##  $ start_lat    : num [1:5740512] 41.9 41.8 41.8 41.8 41.8 ...
##  $ start_lng    : num [1:5740512] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat      : num [1:5740512] 41.9 41.8 41.8 41.8 41.8 ...
##  $ end_lng      : num [1:5740512] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual: chr [1:5740512] "member" "casual" "member" "member" ...
##  $ member       : num [1:5740512] 1 0 1 1 1 0 1 1 0 1 ...
##  $ minutes      : num [1:5740512] 14 75.2 49.1 36.6 39.6 ...
##  $ dow          : chr [1:5740512] "Thu" "Tue" "Fri" "Tue" ...
##  $ month        : chr [1:5740512] "Feb" "Feb" "Feb" "Feb" ...
##  $ miles        : num [1:5740512] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:12428] 115082 115083 115084 115085 115086 115087 115088 115089 115090 115091 ...
##   ..- attr(*, "names")= chr [1:12428] "115082" "115083" "115084" "115085" ...

The column names and column structures look good.

Save trips2 variable to a file.

save(trips2, file = "trips2.Rdata")

Verified the file is created and in the working directory.

Check for duplicates.

trips2 %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

No duplicates found!!

Filter the data. Remove rows with minutes < 1 minute, or minutes > 1440, or miles > 100.

trips3 <- trips2[trips2$minutes >= 1, ]

Row count for trips 3.

nrow(trips3)

## [1] 5616638

Save trips3 variable to a file.

save(trips3, file = "trips3.Rdata")

Verified the file is created and in the working directory.

Remove rows where minutes > 1440 or miles > 100.

trips4 <- trips3[!(trips3$minutes > 1400 | trips3$miles > 100), ]

Row count for trips 4.

nrow(trips4)

## [1] 5616390

After the cleaning process, approximate 6% of the records were removed.
That is 5,752,940 - 5,392,606 = 4,591 records or 0.0626 or 6.26% of total records.

Save trips4 variable to a file.

save(trips4, file = "trips4.Rdata")

Verified the file is created and in the working directory.

Verify and Report on Cleaning Results

Station names and IDs were removed, because there were too many missing data and inconsistencies. The record count at the beginning of the process for the trips data frame was 5,752,940. After the cleaning process, the record count for the trips for data frame is 5,616,390. The number of records removed are 5,752,940 - 5,616,390 = 136,550 records, which is 2.37% of the total records. The change in population should not affect the outcome of the analysis. The data are clean and good to go.

STEP 4: Analyze

This step is for using tools to transform and organize information in order to draw useful conclusions, make predictions, and drive informed decision-making. Use tools to format and transform data • Sort and filter data • Identify patterns and draw conclusions • Make predictions and recommendations • Make data-driven decisions

How should I organize the data to perform analysis on it? To begin, I have to go back to Step 1: Ask Questions. I asked myself, what can I do with this data. Then I listed the questions that I could think of that seemed relevant. It is time to answer the questions now that the data is cleaned.

Trips

What are the total number of trips taken?

trips4 %>% tally()

Trip Count by Rider Type
What are the total number of trips taken by rider type?

 trips4 %>% 
  count(member_casual)

Percentage by Rider Type
What percentage of the customers have an annual membership?

sum(trips4$member_casual == "member") / sum(trips4$member_casual == "casual" | trips4$member_casual == "member")

## [1] 0.5928365

What percentage of the customers are casual riders?

sum(trips4$member_casual == "casual") / sum(trips4$member_casual == "casual" | trips4$member_casual == "member")

## [1] 0.4071635

Pie chart: Percentage by Rider Type

Member=trips4$member_casual
Member=as.factor(trips4$member_casual)
table(Member)

## Member
##  casual  member 
## 2286789 3329601

Freqmember=table(Member)
Percentage=round(Freqmember/5616390*100)
lbl=paste(names(Freqmember), Percentage, "%", sep = " ")
pie(Freqmember, main="Percentage by Rider Type", col = c(4,3), labels = lbl)

# Bike Type
What is the bike type count by rider type?

trips4 %>% 
  group_by(member_casual, rideable_type) %>% 
  summarise(num_used = n(), .groups = "drop") %>%
  arrange(desc(num_used))

Note that members do not dock bikes.

What is the percentage for each bike type used by riders?

trips4 %>% 
    group_by(rideable_type) %>% 
    summarise(num_trips = n(), .groups = "drop") %>%
    mutate(percent = round(num_trips / sum(num_trips),digits =2) * 100) %>%
    arrange(desc(percent))

Pie chart: Percentage by Bike Type

RideType=trips4$rideable_type
RideType=as.factor(trips4$rideable_type)
table(RideType)

## RideType
##  classic_bike   docked_bike electric_bike 
##       2592234        173711       2850445

FreqRideType=table(RideType)
Percentage=round(FreqRideType/5616390*100)
lbl=paste(names(FreqRideType), Percentage, "%", sep = " ")
pie(FreqRideType, main="Percentage by Bike Type", col = c(4,3,2), labels = lbl)

What is the percentage for each bike type used by rider type?

trips4 %>% 
    group_by(member_casual,rideable_type) %>% 
    summarise(num_trips = n(), .groups = "drop") %>%
    mutate(percent = round(num_trips / sum(num_trips),digits =2) * 100) %>%
    arrange(desc(percent))

What percentage Of members use classic bikes?

nrow(trips4[trips4$member_casual == "member" & trips4$rideable_type == "classic_bike", ]) /
nrow(trips4[trips4$member_casual == "member", ])

## [1] 0.5136829

What percentage of casual riders use classic bikes?

nrow(trips4[trips4$member_casual == "casual" & trips4$rideable_type == "classic_bike", ]) /
nrow(trips4[trips4$member_casual == "casual", ])

## [1] 0.385639

What Of the percentage of members use electric bikes?

nrow(trips4[trips4$member_casual == "member" & trips4$rideable_type == "electric_bike", ]) /
nrow(trips4[trips4$member_casual == "member", ])

## [1] 0.4863171

What percentage Of casual riders use electric bikes?

nrow(trips4[trips4$member_casual == "casual" & trips4$rideable_type == "electric_bike", ]) /
nrow(trips4[trips4$member_casual == "casual", ])

## [1] 0.5383982

What percentage Of casual riders use docked bikes?

nrow(trips4[trips4$member_casual == "casual" & trips4$rideable_type == "docked_bike", ]) /
nrow(trips4[trips4$member_casual == "casual", ])

## [1] 0.07596285

Time

What is the average time riders used bikes per trip by rider type?

trips4 %>% 
group_by(member_casual) %>% 
summarise(AverageMinutes = round(mean(minutes), digits = 1))

Casual riders on average spend almost twice as much time on a bike as members.

Bar chart: Average time per trip

trips4 %>%
  group_by(member_casual) %>%
  summarise(AMin = round(mean(minutes), digits = 1)) %>%
    ggplot(., aes(x=member_casual, y = AMin)) +
    geom_bar(width=0.5, stat="identity", fill="steelblue") +
    geom_text(aes(label=AMin), vjust = -0.3) +
    labs(title="Average Minutes per Trip", x="Member Type", y="Minutes")

day of week

Rank day of the week that riders spend the most minutes to the least minutes on bikes.

trips4 %>% 
group_by(dow) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  arrange(desc(MinuteCount))

Most minutes: Saturday
Least minutes: Tuesday

Bar chart: Rider Trip Minutes by Day of Week

trips4%>%
  group_by(dow) %>%
  summarise(MinuteCount = round(sum(minutes), digits = 0), .groups = 'drop') %>%
  ggplot(., aes(x=dow, y=MinuteCount)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=format(MinuteCount,big.mark=","), vjust = -0.3)) +
  labs(title="Trip Minutes by Day of Week", x="Day of Week", y="Minutes")

Which day of the week do riders spend the most minutes on bikes?

trips4 %>% 
  group_by(dow) %>%
  summarise(minutes = sum(minutes)) %>%
  top_n(1, minutes)

Which day of the week do riders spend the least minutes on bikes?

trips4 %>% 
  group_by(dow) %>%
  summarise(minutes = sum(minutes)) %>%
  top_n(-1, minutes)

Rank the day of the week that members spend the most minutes to the least minutes on bikes.

trips4 %>% 
  filter(member_casual == "member") %>%
  group_by(dow, member_casual) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  arrange(desc(MinuteCount))

Most minutes: Thursday
Least minutes: Sunday

Which day of the week do members spend the most minutes on bikes?

trips4 %>% 
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(minutes = sum(minutes)) %>%
  top_n(1, minutes)

Which day of the week do members spend the least minutes on bikes?

trips4 %>% 
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(minutes = sum(minutes)) %>%
  top_n(-1, minutes)

Rank the day of the week do casual riders spend the most minutes to the least minutes on bikes?

trips4 %>% 
  filter(member_casual == "casual") %>%
  group_by(dow, member_casual) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  arrange(desc(MinuteCount))

Most minutes: Saturday
Least minutes: Tuesday

Which day of the week do casual riders spend the most minutes on bikes?

trips4 %>% 
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(minutes = sum(minutes)) %>%
  top_n(1, minutes)

Which day of the week do casual riders spend the least minutes on bikes?

trips4 %>% 
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(minutes = sum(minutes)) %>%
  top_n(-1, minutes)

Bar chart: Day of Week Most Minutes by Member Type

trips4 %>% 
  group_by(member_casual, dow) %>%
  summarise(MostMinutes = sum(minutes)) %>%
  top_n(1, MostMinutes) %>%
  ggplot(., aes(x=member_casual, y=MostMinutes))+
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=dow), vjust = -0.3) +
  labs(title="Day of Week - Most Minutes by Member Type", x="Member Type", y="Minutes")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

month

Rank the months that riders spend the most and least minutes on bikes?

trips4 %>% 
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  arrange(desc(MinuteCount))

Most minutes: July
Least minutes: Feb

Bar chart: Rider Trip Minutes by Month

trips4%>%
  group_by(month) %>%
  summarise(MinuteCount = round(sum(minutes), digits = 0), .groups = 'drop') %>%
  ggplot(., aes(x=month, y=MinuteCount)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=format(MinuteCount,big.mark=","), vjust = -0.3)) +
  labs(title="Trip Minutes by Month", x="Month", y="Minutes")

Which month do riders spend the most minutes on bikes?

trips4 %>% 
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  top_n(1, MinuteCount)

Which month do riders spend the least minutes on bikes?

trips4 %>% 
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  top_n(-1, MinuteCount)

Rank the months that member spend the most and least minutes on bikes?

trips4 %>% 
  filter(member_casual == "member") %>%
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  arrange(desc(MinuteCount))

Most minutes: July
Least minutes: February

Which month do members spend the most minutes on bikes?

trips4 %>% 
  filter(member_casual == "member") %>%
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  top_n(1, MinuteCount)

Which month do members spend the least minutes on bikes?

trips4 %>% 
  filter(member_casual == "member") %>%
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  top_n(-1, MinuteCount)

Rank the months that casual riders spend the most and least minutes on bikes?

trips4 %>% 
  filter(member_casual == "casual") %>%
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  arrange(desc(MinuteCount))

Most minutes: July
Least minutes: February

Which month do casual riders spend the most minutes on bikes?

trips4 %>% 
  filter(member_casual == "casual") %>%
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  top_n(1, MinuteCount)

Which month do casual riders spend the least minutes on bikes?

trips4 %>% 
  filter(member_casual == "casual") %>%
  group_by(month) %>% 
  summarise(MinuteCount = sum(minutes), .groups = 'drop') %>%
  top_n(-1, MinuteCount)

Bar chart: Month Most Minutes by Member Type

trips4 %>% 
  group_by(member_casual, month) %>%
  summarise(MostMinutes = sum(minutes)) %>%
  top_n(1, MostMinutes) %>%
  ggplot(., aes(x=member_casual, y=MostMinutes)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=month), vjust = -0.3) +
  labs(title="Month - Most Minutes by Member Type", x="Member Type", y="Minutes")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Trips

day of week

Rank the day of the week riders take the most trips to the least trips?

trips4 %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  arrange(desc(TripCount))

Most trips: Saturday
Least trips: Monday

Bar chart: Trips by Weekday

trips4%>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  ggplot(., aes(x=dow, y=TripCount)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=format(TripCount,big.mark=","), vjust = -0.3)) +
  labs(title="Trips by Day of Week", x="Day of Week", y="Trips")

Which day of the week do riders take the most trips?

trips4 %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(1, TripCount)

Which day of the week do riders take the least trips?

trips4 %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(-1, TripCount)

Rank the day of the week members take the most trips to the least trips?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  arrange(desc(TripCount))

Most trips: Thursday
Least trips: Sunday

Which day of the week do annual members take the most trips?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(1, TripCount)

Which day of the week do annual members take the least trips?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(-1, TripCount)

Rank the day of the week casual riders take the most trips to the least trips?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  arrange(desc(TripCount))

Most trips:Saturday
Least trips: Tuesday

Which day of the week do casual riders take the most trips?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(1, TripCount)

Which day of the week do casual riders take the least trips?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(-1, TripCount)

What are the top 3 days of the week with the most trips taken by members?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(3, TripCount)

What are the top 3 days of the week with the most trips taken by casual riders?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(3, TripCount)

What are the bottom 3 days of the week with the least trips taken by members?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(-3, TripCount)

What are the bottom 3 days of the week with the least trips taken by casual riders?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  top_n(-3, TripCount)

Bar chart: Day of Week - Most Trips by member Type

trips4 %>% 
  group_by(member_casual, dow) %>%
  summarise(TripCount = n(), groups = "drop") %>%
  top_n(1, TripCount) %>%
    ggplot(., aes(x=member_casual, y=TripCount)) +
    geom_bar(width=0.5, stat="identity", fill="steelblue") +
    geom_text(aes(label=dow), vjust = -0.3) +
    labs(title="Day of Week - Most Trips by Member Type", x="Member Type", y="Minutes")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

month

Rank the months that riders take the most trips to the least trips?

trips4 %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  arrange(desc(TripCount))

most trips: Jul
least trips: Jan

Bar chart: Trips by Month

trips4%>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  ggplot(., aes(x=month, y=TripCount)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=format(TripCount,big.mark=","), vjust = -0.3)) +
  labs(title="Trips by Month", x="Month", y="Trips")

Which month do riders take the most trips?

trips4 %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = "drop") %>%
top_n(1, TripCount)

Which month do riders take the least trips?

trips4 %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = "drop") %>%
top_n(-1, TripCount)

Rank the months that members take the most trips to the least trips?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  arrange(desc(TripCount))

most trips: Aug least trips: Jan

Which month do members take the most trips?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = "drop") %>%
top_n(1, TripCount)

Which month do members take the least trips?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = "drop") %>%
top_n(-1, TripCount)

Rank the months that casual riders take the most trips to the least trips?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = 'drop') %>%
  arrange(desc(TripCount))

most trips: Jul
least trips: Feb

Which month do casual riders take the most trips?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = "drop") %>%
top_n(1, TripCount)

Which month do casual riders take the least trips?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(month) %>%
  summarise(TripCount = n(), .groups = "drop") %>%
top_n(-1, TripCount)

Bar chart: Month Most Trips by Member Type

trips4 %>% 
  group_by(member_casual, month) %>%
  summarise(TripCount = n(), groups = "drop") %>%
  top_n(1, TripCount) %>%
  ggplot(., aes(x=member_casual, y=TripCount))+
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=month), vjust = -0.3) +
  labs(title="Month - Most Trips by Member Type", x="Member Type", y="Trips")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Distance

What is the average distance riders traveled per trip?

trips4 %>% 
  group_by(member_casual) %>% 
  summarise(AvgMiles = mean(miles), .groups = 'drop') %>%
  arrange(desc(AvgMiles))

day of week

Rank the day-of-the-week that riders rode the most miles to the least miles?

trips4 %>% 
  group_by(dow) %>% 
  summarise(dowMiles = sum(miles), .groups = 'drop') %>%
  arrange(desc(dowMiles))

Most miles: Saturday
Least miles: Monday

Bar chart: Miles by Day of Week

trips4%>%
  group_by(dow) %>%
  summarise(SumMiles = sum(miles)) %>%
  ggplot(., aes(x=dow, y=SumMiles)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=format(SumMiles,big.mark=","), vjust = -0.3)) +
  labs(title="Miles by Day of Week", x="Day of Week", y="Miles")

Which day-of-the-week do riders rode the most miles?

trips4 %>% 
  group_by(dow) %>% 
  summarise(dowMiles = sum(miles), .groups = 'drop') %>%
  top_n(1, dowMiles)

Which day-of-the-week that riders rode the least miles?

trips4 %>% 
  group_by(dow) %>% 
  summarise(dowMiles = sum(miles), .groups = 'drop') %>%
  top_n(-1, dowMiles)

Rank the day-of-the-week that members rode the most miles to the least miles?

trips4 %>% 
  filter(member_casual  == "member") %>%
  group_by(dow) %>% 
  summarise(dowMiles = sum(miles), .groups = 'drop') %>%
  arrange(desc(dowMiles))

Most miles: Thursday
Least miles: Sunday

Which day-of-the-week do members rode the most miles?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(dowMiles = sum(miles), .groups = "drop") %>%
  top_n(1, dowMiles)

Which day of the week do members rode the least miles?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(dow) %>%
  summarise(dowMiles = sum(miles), .groups = "drop") %>%
  top_n(-1, dowMiles)

Rank the day-of-the-week that casual riders rode the most miles to the least miles?

trips4 %>% 
  filter(member_casual  == "casual") %>%
  group_by(dow) %>% 
  summarise(dowMiles = sum(miles), .groups = 'drop') %>%
  arrange(desc(dowMiles))

Most miles: Saturday
Least miles: Tuesday

Which day of the week do casual riders rode the most miles?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(dowMiles = sum(miles), .groups = "drop") %>%
  top_n(1, dowMiles)

Which day of the week do casual riders rode the least miles?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(dow) %>%
  summarise(dowMiles = sum(miles), .groups = "drop") %>%
  top_n(-1, dowMiles)

Bar chart: Day of Week - Most miles by member Type

trips4 %>% 
  group_by(member_casual, dow) %>%
  summarise(MostMiles = sum(miles)) %>%
  top_n(1, MostMiles) %>%
  ggplot(., aes(x=member_casual, y=MostMiles))+
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=dow), vjust = -0.3) +
  labs(title="Day of Week - Most Minutes by Member Type", x="Member Type", y="Minutes")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

month

Rank the months that riders rode the most miles to the least miles.

trips4 %>% 
  group_by(month) %>% 
  summarise(MileCount = sum(miles), .groups = 'drop') %>%
  arrange(desc(MileCount))

Most miles: July
Least miles: February

Bar chart: Miles per Month

trips4%>%
  group_by(month) %>%
  summarise(SumMiles = sum(miles)) %>%
  ggplot(., aes(x=month, y=SumMiles)) +
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=format(SumMiles,big.mark=","), vjust = -0.3)) +
  labs(title="Miles by Month", x="Month", y="Miles")

Which month did riders ride the most miles?

trips4 %>%
  group_by(month) %>%
  summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(1, MostMiles)

Which month did riders ride the least miles?

trips4 %>%
  group_by(month) %>%
  summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(-1, MostMiles)

Rank the months that members rode the most miles to the least miles.

trips4 %>% 
  filter(member_casual =="member") %>%
  group_by(month) %>% 
  summarise(MileCount = sum(miles), .groups = 'drop') %>%
  arrange(desc(MileCount))

Most miles; August
Least miles: February

Which month do members ride the most miles?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(month) %>%
  summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(1, MostMiles)

Which month do members ride the least miles?

trips4 %>%
  filter(member_casual == "member") %>%
  group_by(month) %>%
  summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(-1, MostMiles)

Rank the months that casual riders rode the most miles to the least miles.

trips4 %>% 
  filter(member_casual =="casual") %>%
  group_by(month) %>% 
  summarise(MileCount = sum(miles), .groups = 'drop') %>%
  arrange(desc(MileCount))

Most miles: July
Least miles: February

Which month do casual riders ride the most miles?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(month) %>%
  summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(1, MostMiles)

Which month do casual riders ride the least miles?

trips4 %>%
  filter(member_casual == "casual") %>%
  group_by(month) %>%
  summarise(MostMiles = sum(miles), .groups = "drop", ) %>%
top_n(-1, MostMiles)

Bar chart: Month - Most Miles by Member Type

trips4 %>% 
  group_by(member_casual, month) %>%
  summarise(MostMiles = sum(miles)) %>%
  top_n(1, MostMiles) %>%
  ggplot(., aes(x=member_casual, y=MostMiles))+
  geom_bar(width=0.5, stat="identity", fill="steelblue") +
  geom_text(aes(label=month), vjust = -0.3) +
  labs(title="Month - Most Miles by Member Type", x="Member Type", y="Miles")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Average minutes most miles What are the average minutes per trip that members rode bikes when they rode the most miles?

trips4 %>% 
  filter(month == "Aug" & member_casual == "member") %>%
  group_by(month, member_casual) %>% 
  summarise(AvgMinutes = mean(minutes), .groups = 'drop')

What are the average minutes per trip that casual riders rode bikes when they rode the most miles?

trips4 %>% 
  filter(month == "Jul" & member_casual == "casual") %>%
  group_by(month, member_casual) %>% 
  summarise(AvgMinutes = mean(minutes), .groups = 'drop')

Average distance most miles What is the average distance per trip that members rode in the top month?

trips4 %>% 
  filter(month == "Aug", member_casual == "member") %>%
  group_by(month, member_casual) %>% 
  summarise(AvgMiles = mean(miles), .groups ='drop') %>%
  arrange(desc(AvgMiles))

What is the average distance per trip that casual riders ride in the top month?

trips4 %>% 
  filter(month == "Jul", member_casual == "casual") %>%
  group_by(month, member_casual) %>% 
  summarise(AvgMiles = mean(miles), .groups ='drop') %>%
  arrange(desc(AvgMiles))

Logistics Regression Analysis

Create new data frame. Remove any unneeded columns for cleaner data run and computer efficiency.

trips5 <- trips4[,c("member", "minutes", "miles")]

Make sure all columns are still numeric.

trips5$member = as.numeric(as.character(trips5$member))
trips5$minutes = as.numeric(as.character(trips5$minutes))
trips5$miles = as.numeric(as.character(trips5$miles))

View the data frame to make sure members are 1 and casual are 0.

View(trips5)

Summarize data

summary(trips5)

##      member          minutes            miles       
##  Min.   :0.0000   Min.   :   1.00   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:   6.02   1st Qu.: 0.570  
##  Median :1.0000   Median :  10.43   Median : 1.000  
##  Mean   :0.5928   Mean   :  16.44   Mean   : 1.347  
##  3rd Qu.:1.0000   3rd Qu.:  18.57   3rd Qu.: 1.750  
##  Max.   :1.0000   Max.   :1396.75   Max.   :26.270

Get standard deviation

sapply(trips5, sd)

##     member    minutes      miles 
##  0.4913058 28.3688534  1.1917930

Fit the logistics regression model.

M1=glm(formula=member~minutes+miles,family="binomial", data=trips5)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Summarize results

summary(M1)

## 
## Call:
## glm(formula = member ~ minutes + miles, family = "binomial", 
##     data = trips5)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.132e-01  1.450e-03   491.8   <2e-16 ***
## minutes     -3.829e-02  8.514e-05  -449.8   <2e-16 ***
## miles        1.820e-01  8.984e-04   202.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7591220  on 5616389  degrees of freedom
## Residual deviance: 7258406  on 5616387  degrees of freedom
## AIC: 7258412
## 
## Number of Fisher Scoring iterations: 6

Explanation of results

The p-values indicated that minutes and miles are important predictors, because values are less than .05.

The dependent variable, member_casual now named member, (Intercept) is influenced by the independent variables minutes and miles.

Each one-unit change in minutes will decrease the log odds of getting a membership by -.03829.

Each one-unit change in miles will increase the log odds of getting a membership by 0.1820.

We can look at the difference between Null deviance and Residual deviance to see if the model is a good fit. The bigger the difference, the better the model. Null deviance is when you only have an intercept in your equation, but in our case we have variables on the other side of the equation; the Residual variance. The model is a good fit if that difference is big enough. This explanation is referenced from “Simply Explained Logistics Regression with Example in R” on the towardsdatascience.com website. https://towardsdatascience.com/simply-explained-logistic-regression-with-example-in-r-b919acb1d6b3

Test the model. Plug in numbers for minutes and miles to see the probability of a casual rider getting a membership.
.07132Y = -0.03829(minutes) + 0.182(miles)
.07132Y = -0.03829(20) + 0.182(8) = 9.67751 percent probability that the casual rider will get a membership.

Plot the logistics function for the miles predictor for casual riders.

trips5 %>%
  filter(member == "0")

    ggplot(trips5, aes(x=miles, y=member)) +
    geom_point() +
    stat_smooth(method="glm", color="green", se=FALSE, method.args=list(family=binomial))

## `geom_smooth()` using formula = 'y ~ x'

What does this chart mean? You can see in the chart that if the miles are less than 12, the log odds, probability, that the person will be a member is at least 50%.

Since the coefficient for miles was positive, it is a better predictor than minutes, which has a negative coefficient. That begs the question of: where should our efforts be targeted? It’s really sort of an educated guess. I am going to guess at casual riders that ride 5 or more miles.

How many casual riders ride 5 or more miles?

nrow(trips5[trips5$member == "0" & trips5$miles >= "5", ])

## [1] 44296

There are 44,296 casual riders that have a 50% or higher change of becoming annual members.

Plot the logistics function for the minutes predictor for casual riders.

trips5 %>%
  filter(member == "0")

  ggplot(trips5, aes(x=minutes, y=member)) +
  geom_point(alpha=.5) +
  scale_x_continuous(n.breaks = 20) +
  stat_smooth(method="glm", color="green", se=FALSE, method.args=list(family=binomial))

## `geom_smooth()` using formula = 'y ~ x'

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

You can see in the chart that if the minutes are less than 25, the log odds, probability, that the person will be a member is at least 50%.

It makes sense that both charts are negative sloping, because these are rentals and there are only used a portion of the day.

save(trips5, file = "trips5.Rdata")

STEP 6: Act

Insights and Recommendations

Distance is the key factor for turning casual riders into annual members.

The Target
There are 44,296 casual riders that have a 50% or higher change of becoming annual members.

I recommend option 3 that I suggested in step 1, ASK, of the Data Analytics Process: Launch a small marketing plan targeting causal riders at specific times and locations.

Time Casual riders spend almost twice as much time per trip on a bike than members.
Include an element in the marketing plan to target casual riders to:
1. ride bikes during the middle days of the week, and
2. get people riding more in the non-peak months.

Distance
While the average distance riders traveled per trip is 1.35 miles, there are casual riders that bike a lot further.
Include an element in the marketing plan to get customers to bike a little further.

Regression analysis
Analysis indicates that miles is a better predictor than minutes for a casual rider to become an annual member.
Include an element in the marketing plan to get casual riders to ride more miles; up to 12 miles maximum.

Case Study: How Does a Bike-Share Navigate Speedy Success

Case Study

Methodology

Company Overview

Stakeholders

STEP 1: Ask

Business Task

Insights

Questions

STEP 2: Prepare

Data Generation

Data Format and Structure

Bias and Credibility

STEP 3: Process

Cleaning the Data

The Data

Test the Data

Verify and Report on Cleaning Results

STEP 4: Analyze

Trips

Time

day of week

month

Trips

day of week

month

Distance

day of week

month

Logistics Regression Analysis

STEP 5: SHARE

STEP 6: Act