Cyclistic bike-share analysis is one of the three options that Google Data Analytics Professional Certification Programs let the alumni work with, by providing an existing data set and questions. In the case study that I am going to present, I’m in the role of a junior data analyst working in the marketing analyst team at Cyclistic.
The director of marketing told us that he believes that the future success of the company depends on maximizing the number of annual memberships, and my team wants to understand how casual rides and annual members use the Cyclistic bike, from these insights, our team will design a new marketing strategy to convert casual riders into annual members.
I have been assigned to answer the question of how annual members and casual riders use cyclistic bikes differently.
In this analysis, I will demonstrate the differences in the usage of cyclistic bikes by these two types of users.
Desing marketing strategies aimed at converting casual riders into annual members, using this information to get conclusions and make recommendations about the strategy to convert casual riders to annual members.
will be using the data analysis process steps:
Looking how does the casual members and the annual members differ in the way the use the cyclistic bikes and with that information, create a strategy to convert casual members into annual members in order to get more profits.
Once the analysis is completed, provide quality recommendations on how to reach casual users and how they are differentiated from annual users who use cyclistic.
Data source:
The marketing analyst team will use the data from the company/ This data was created with the information of the cyclistic bikes users and contains the rides of every user made in the first quarter of 2020, the information does not have personal information of the users because of the data-privacy policy therefore will not be able to identify every customer, just the history of rides. The data set includes the station where the ride starts and the station of end and times are registered to the start and the end of each ride, also the latitude of the start and the end and the type of member that made each ride.
the credibility of this data set will be identified by the ROCC method:
sorting and filtering the data:
in this analysis i will be focusing on the time of every trips in order to detect any trend in the use of the cyclistic bikes by casual members and annual members.
I will be using the “divvy-tripdata” data set to get the information i need and Ill be using R studio in order to complete this analysis.
install and load the packages:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(here)
## here() starts at C:/Users/torre/OneDrive/Escritorio/Case_Study/R_markdown
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
as previusly mentioned we are going to use R in this project, the reason its because is a large data set wit more than 400,000 rows of data, so its better for the team to process faster the informations and also we can all review the code in order to get more credibility in our analysis.
First we are going to import the data set: * we have to verify the data and check that is properly imported whit out any errors.
library(readr)
cyclistic_data <- read_csv("C:/Users/torre/OneDrive/Escritorio/Case_Study/Divvy_Trips_2020_Q1/Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(cyclistic_data)
cyclistic_rides2 <- cyclistic_data %>%
select(ride_id, rideable_type, started_at,
ended_at,start_station_name,
start_station_id,end_station_name,
end_station_id, member_casual)%>%
separate(started_at, into =c("start_date","start_time"),sep= " ") %>%
separate(ended_at, into =c("ended_date","ended_time"),sep= " ")
##The time was only showing "%H:%M" and i need also the "%S" so i export the data to excel to make the change because R does not provide a friendly way to aggregate the "%S".
write.csv(cyclistic_rides2,"C:/Users/torre/OneDrive/Escritorio/Certificados/ResourcesData/Google DataAnalytics Resources/Course8/cyclistic_rides.csv", row.names=FALSE)
##I added the "%S" to the times columns, but when i imported the data, the columns are being reading in double format, so i change the format to time while importing the data with "read_cvs".
library(readr)
cyclistic_rides_v2 <- read_csv("C:/Users/torre/OneDrive/Escritorio/Case_Study/cyclistic_rides.csv",
col_types = cols(start_time = col_time(format = "%H:%M:%S"),
ended_time = col_time(format = "%H:%M:%S")))
str(cyclistic_rides_v2)
## spc_tbl_ [426,887 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
## $ rideable_type : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ start_date : Date[1:426887], format: "2020-01-21" "2020-01-30" ...
## $ start_time : 'hms' num [1:426887] 20:06:00 14:22:00 19:29:00 16:17:00 ...
## ..- attr(*, "units")= chr "secs"
## $ ended_date : Date[1:426887], format: "2020-01-21" "2020-01-30" ...
## $ ended_time : 'hms' num [1:426887] 20:14:00 14:26:00 19:32:00 16:25:00 ...
## ..- attr(*, "units")= chr "secs"
## $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
## $ start_station_id : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...
## $ end_station_name : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
## $ end_station_id : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...
## $ member_casual : chr [1:426887] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. start_date = col_date(format = ""),
## .. start_time = col_time(format = "%H:%M:%S"),
## .. ended_date = col_date(format = ""),
## .. ended_time = col_time(format = "%H:%M:%S"),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Need to check for any duplicates, we know that the same user can make multiple trips and the stations can be used daily for the same users, so we don’t have a column that can distinguish each ride to find any duplicates therefore we can’t find any.
Now we get the ride length with the mutate function, with that column we obtain how long each trip took in total.
cyclistic_rides_v3 <- mutate(cyclistic_rides_v2, ride_length = difftime(ended_time, start_time, units = "mins"))
After inspecting the table, I notice some negative numbers in the column, in order know how many negative numbers we have in the table we use the count function.
count(cyclistic_rides_v3[cyclistic_rides_v3$ride_length <0,])
## # A tibble: 1 × 1
## n
## <int>
## 1 1027
We must eliminate those negative numbers to make our calculations properly.
cyclistic_rides_v4 <- cyclistic_rides_v3[!cyclistic_rides_v3$ride_length < 0,]
The next step is aggregate a new column call “weekday” in this column we count the number of the day of the week in each row. We count the weekdays with numbers, 1 = Sunday representing the start of the week.
cyclistic_rides_v5 <- cyclistic_rides_v4 %>%
mutate(weekday = wday(start_date, week_start = 7))
After we properly cleaned and formatted our data, we may proceed to make some calculations in order to identify any trends or relationships that may have formed in the data.
We need to see the average ride length in total and separately for casual members and annual members.
average_rides <- cyclistic_rides_v5%>%
group_by(member_casual, weekday)%>%
summarise(mean(ride_length))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
And we also get the number of rides casual and annual members made.
numbers_of_rides_members <- cyclistic_rides_v5%>%
count(member_casual)
##and we change the number of the column n to number_of_rides.
number_of_rides<- numbers_of_rides_members%>%
rename(number_of_rides=n)
We analyze the average time that casual members and annual members ride cyclistic bikes and we also get the total of minutes of rides and the max ride_length.
average_rides_members <- cyclistic_rides_v5%>%
group_by(member_casual)%>%
summarise(average_rides_minutes = mean(ride_length),
total_rides_minutes = sum(ride_length),
max_ride_length=max(ride_length),
min_ride_length=min(ride_length))
Now we get the average of rides but in every weekday.
weekdays_rides <- cyclistic_rides_v5%>%
group_by(weekday, member_casual)%>%
summarise(average_rides_minutes = mean(ride_length),
total_rides_minutes = sum(ride_length),
max_ride_length=max(ride_length))
## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.
We get the top 10 of start_stations that the users use the most.
top_stations <- cyclistic_rides_v5%>%
group_by(start_station_name)%>%
count(start_station_name, sort = TRUE)
#in order to get just the top 10 we create this order.
top_10_stations <- head(top_stations[order(-top_stations$n), ], 10)%>%
rename(start_station_usage=n)