Welcome to this Capstone case study from Google Data Analytics Certificate by Coursera. This is an awesome opportunity to apply the knowledge and skills gained in the journey, with this real life scenario will perform as a Junior data analyst role and it is going to be part of my online portfolio.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
## Ask
#### Understanding the task
Three questions will guide the future marketing program: 1. How do annual members and casual riders use Cyclistic bikes differently? 2. Why would casual riders buy Cyclistic annual memberships? 3. How can Cyclistic use digital media to influence casual riders to become members?
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
The Marketing Team believes that the profits might be maximized through increased conversion of one time customers (casual) to regular customers (members).
To conduct this analysis, we are obtaining our data from the Cyclistic’s trip data link historical trip data will use a 12 month period (from 08-2022 to 07-2023), the data has been made available by Motivate International Inc. Under this licence Licence Link
Proceed to download all 12 zip files, extract CSV and save them in a separate new folder to keep the original dataset as a backup, in case any data is missing through the exercise.
The plan after examining our datasets, is to observe that all files contain the same columns, and that the data type is consistent and appropriate. I decided to start up with Excel to clean the data on each file and make it more manageable for my equipment and then will use Rstudio to combine all twelve files and further analysis
Examining each CSV file is ROCCC (reliable, original, comprehensive, current, cited) The company has its own license over the dataset. Besides that, the dataset does not have any personal information about the riders. Checked for null data, misspelled words, mistype, extra spaces and characters, messy strings, and truncated data, also looked for duplicate data, using the remove duplicates function.
The task assigned by Lily Moreno gave me room to be able to remove a couple of unnecessary columns this will make it easier to work with the data while decreasing the volume will speed up the equipment processing.
After removing the unnecessary columns, I am going to add a fifth column named Ride_length ” Calculate the length of each ride by subtracting the column “started_at” from the column “ended_at” (for example, =D2-C2) and format as HH:MM:SS using Format > Cells > Time > 37:30:55.When done with calculating the whole column we continue to copy and paste it without format.
We also add a column for weekdays, using the WEEKDAY function “day_of_week,” and calculate the day of the week that each ride started using the “WEEKDAY” command (for example, =WEEKDAY(C2,1)) in each file. Format as General or as a number with no decimals, noting that 1 = Sunday and 7 = Saturdayand copied and pasted only the values without any formula or function to be helpful for further analysis.
## Analyze
Before moving my analysis to Rstudio, decided to explore a little my data individually, using a pivot table in a different sheet in every file, exploring the data in a table allows us to find interesting insights into our data.
Here is a calculation of the average Ride length by user per month
The following example shows the average ride length by user per day, starting 1 as Sunday finishing 7 as Saturday.
Also, we take a look at the rider’s quantity per day in every file analyzed.
Now, that we have all the data needed clean, I decided to migrate our 12 files in order to combine them and create a single dataset to have all your data in one file, As previously explained working in a spreadsheet will not let us handle the amount of data combined and that is why Rstudio is the perfect tool to work with our data from now on.
I began by loading data month by month. After I have imported all 12 CSV files and put together the datasets, I can move on to the next steps. It is essential to keep these 12 CSV files separate because we might have to copy their file paths to load them into Rstudio
the following code will let you load your first CSV file and create your dataset:
august_2022 <- read.csv("C:/Users/hisny/Desktop/Capstone-tripdata/divvy_csv/202208_CSV.csv")
** NOTE: If you are working with windows and copy your path make sure you change the simbols from to / you will thank me later. **
Once we have all 12 Months loaded,we proceed to verify each colunm name because they need to match correctly before and after this validation using the colnames(fuction) it is time to create a single dataset with all our data combine in just one dataset.
months_2023 <- bind_rows(january_2023,february_2023,march_2023,may_2023,june_2023,july_2023,april_2023_v1)
final_data_cleaned <- read.csv("C:/Users/hisny/Desktop/capstone_1/bike_cleaned_output_v2.csv")
Now, our dataset is complete with the 12 months together, is important to make sure we make a copy in case our data is missing or messy.
final_data_cleaned_v1 <- final_data_cleaned
Also, we make sure a CVS file is created and save in your directory to keep your progress save.
write.csv(final_data_cleaned,"bike_cleaned_output_v1.csv")
Now, it is time to start working with our data in Rstudio but first is important to make sure we have installed all packages we need to perform our data manipulation.
install.packages(“tidyverse”) library(tidyverse) library(lubridate) library(dplyr) library(skimr) install.packages(“skimr”)
Here used the remane function to change the column names.
In order to create compelling visuals decided to insert columns to be able separate the date format for my graphs when sharing and presenting stage.
final_data_cleaned_v1$date <- as.Date(final_data_cleaned_v1$start_time, format = '%d/%m/%y')
final_data_cleaned_v1$month <-format(as.Date(final_data_cleaned_v1$date), "%m")
final_data_cleaned_v1$day <- format(as.Date(final_data_cleaned_v1$date), "%d")
final_data_cleaned_v1$year <- format(as.Date(final_data_cleaned_v1$date), "%Y%")
final_data_cleaned_v1$day_of_week <- ordered (final_data_cleaned_v1$day_of_week, levels=c("Sunday" , "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
Here we can make sure our final document have clean and relevant data in the proper format and this skim_without_charts function helps to verify nothing is missing, now our data is ready for further analysis
Prior continuing the analysis I noticed the ride_length column strings are character, converting from character to factor.
<!–>{r echo=TRUE} final_data_cleaned_v1 <- mutate_at (final_data_cleaned_v1, vars(ride_length),as.factor)
<!–>
Now we have our column in factor it is time to convert our ride_length column from factor to number.
final_data_cleaned_v1$ride_length <- as.numeric(as.factor(final_data_cleaned_v1$ride_length))
Always is a good practice to validate our data changes, how can you verify your data? with the following code.
is.numeric(final_data_cleaned_v1$ride_length)
## [1] TRUE
###CONDUCTING DESCRIPTIVE ANALYSIS###
mean(final_data_cleaned_v1$ride_length)
## [1] 894.9244
median(final_data_cleaned_v1$ride_length)
## [1] 578
max(final_data_cleaned_v1$ride_length)
## [1] 22124
min(final_data_cleaned_v1$ride_length)
## [1] 1
summary(final_data_cleaned_v1$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 329.0 578.0 894.9 1027.0 22124.0
Comparison by user type:
aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=mean)
## final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1 casual 1191.176
## 2 member 714.079
aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=median)
## final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1 casual 713
## 2 member 513
aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=max)
## final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1 casual 22124
## 2 member 22121
aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=min)
## final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1 casual 1
## 2 member 1
Looking for average ride per day by user type
aggregate(cbind(Avg_time_per_day=final_data_cleaned$ride_length) ~ cbind(User_type=final_data_cleaned$user_type) + cbind(Day_of_week=final_data_cleaned$day_of_week), FUN=mean)
## User_type Day_of_week Avg_time_per_day
## 1 casual Friday 1256.7663
## 2 member Friday 713.6921
## 3 casual Monday 1049.9357
## 4 member Monday 691.7732
## 5 casual Saturday 1120.3693
## 6 member Saturday 683.9464
## 7 casual Sunday 1036.9942
## 8 member Sunday 679.4760
## 9 casual Thursday 1352.5184
## 10 member Thursday 778.6435
## 11 casual Tuesday 1117.5004
## 12 member Tuesday 703.8505
## 13 casual Wednesday 1287.2222
## 14 member Wednesday 764.6367
To have a better understanding of our data, I decided to share my findings using Tableau.
It appears that the bike usage data showcases the maximum and minimum number of riders, and may provide insights into the preferences that riders have in terms of bike type.
[Total Riders daily comparison!]
(C:/Users/hisny/Desktop/Capstone_Prt_Scr/tableau_bike/Total_user_2.png)
Based on the data analyzed, we can observe a trend for our target audience where the maximum number of visits by our casual riders occur on days 3, 4, and 5.
Total amount of riders.
The following chart shows us a comparison between user per day that helps shareholders to make any decision and when to do it.
At this point of the analysis, we have a good knowledge of our data and understand different rider patterns.
Based on the findings from this analysis, I would like to share my takeaways and provide some recommendations for the company’s marketing team to convert casual riders into members:
1. Enroll campaign: Now we have identified when the casual riders visit the most effective campaign to have them enroll in the membership program, the marketing team could work with promotions and offer strategies for casual riders becoming members and a rewards program for those who are already members. 2. Days-targeted benefits: Targeted the top 3 busiest days for casual riders giving them exclusive offers and coupons to increase the chance to subscribe to the membership program.
This is how my capstone project is finished. Thanks for giving me this opportunity to prove my data-driven skills.