Introduction

Welcome to this Capstone case study from Google Data Analytics Certificate by Coursera. This is an awesome opportunity to apply the knowledge and skills gained in the journey, with this real life scenario will perform as a Junior data analyst role and it is going to be part of my online portfolio.

Scenario and Role

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.

## Ask

#### Understanding the task

Three questions will guide the future marketing program: 1. How do annual members and casual riders use Cyclistic bikes differently? 2. Why would casual riders buy Cyclistic annual memberships? 3. How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

The Marketing Team believes that the profits might be maximized through increased conversion of one time customers (casual) to regular customers (members).

Prepare

To conduct this analysis, we are obtaining our data from the Cyclistic’s trip data link historical trip data will use a 12 month period (from 08-2022 to 07-2023), the data has been made available by Motivate International Inc. Under this licence Licence Link

Proceed to download all 12 zip files, extract CSV and save them in a separate new folder to keep the original dataset as a backup, in case any data is missing through the exercise.

12 Months data file!

The plan after examining our datasets, is to observe that all files contain the same columns, and that the data type is consistent and appropriate. I decided to start up with Excel to clean the data on each file and make it more manageable for my equipment and then will use Rstudio to combine all twelve files and further analysis

Process

Cleaning data

Examining each CSV file is ROCCC (reliable, original, comprehensive, current, cited) The company has its own license over the dataset. Besides that, the dataset does not have any personal information about the riders. Checked for null data, misspelled words, mistype, extra spaces and characters, messy strings, and truncated data, also looked for duplicate data, using the remove duplicates function.

Looking into file!

The task assigned by Lily Moreno gave me room to be able to remove a couple of unnecessary columns this will make it easier to work with the data while decreasing the volume will speed up the equipment processing.

Removing unnecessary columns!

After removing the unnecessary columns, I am going to add a fifth column named Ride_length ” Calculate the length of each ride by subtracting the column “started_at” from the column “ended_at” (for example, =D2-C2) and format as HH:MM:SS using Format > Cells > Time > 37:30:55.When done with calculating the whole column we continue to copy and paste it without format.

We also add a column for weekdays, using the WEEKDAY function “day_of_week,” and calculate the day of the week that each ride started using the “WEEKDAY” command (for example, =WEEKDAY(C2,1)) in each file. Format as General or as a number with no decimals, noting that 1 = Sunday and 7 = Saturdayand copied and pasted only the values without any formula or function to be helpful for further analysis.

Ride length and Day of week column added! ## Analyze

Before moving my analysis to Rstudio, decided to explore a little my data individually, using a pivot table in a different sheet in every file, exploring the data in a table allows us to find interesting insights into our data.

Here is a calculation of the average Ride length by user per month

Average ride length by user!

The following example shows the average ride length by user per day, starting 1 as Sunday finishing 7 as Saturday.

Average ride length by user per day!

Also, we take a look at the rider’s quantity per day in every file analyzed.

Riders quantity per week!

Now, that we have all the data needed clean, I decided to migrate our 12 files in order to combine them and create a single dataset to have all your data in one file, As previously explained working in a spreadsheet will not let us handle the amount of data combined and that is why Rstudio is the perfect tool to work with our data from now on.

I began by loading data month by month. After I have imported all 12 CSV files and put together the datasets, I can move on to the next steps. It is essential to keep these 12 CSV files separate because we might have to copy their file paths to load them into Rstudio

the following code will let you load your first CSV file and create your dataset:

august_2022 <- read.csv("C:/Users/hisny/Desktop/Capstone-tripdata/divvy_csv/202208_CSV.csv")

** NOTE: If you are working with windows and copy your path make sure you change the simbols from to / you will thank me later. **

Once we have all 12 Months loaded,we proceed to verify each colunm name because they need to match correctly before and after this validation using the colnames(fuction) it is time to create a single dataset with all our data combine in just one dataset.

To merge all dataset we will use the bind_rows function:

months_2023 <- bind_rows(january_2023,february_2023,march_2023,may_2023,june_2023,july_2023,april_2023_v1)

final_data_cleaned <- read.csv("C:/Users/hisny/Desktop/capstone_1/bike_cleaned_output_v2.csv")

Now, our dataset is complete with the 12 months together, is important to make sure we make a copy in case our data is missing or messy.

final_data_cleaned_v1 <- final_data_cleaned

Also, we make sure a CVS file is created and save in your directory to keep your progress save.

write.csv(final_data_cleaned,"bike_cleaned_output_v1.csv")

Now, it is time to start working with our data in Rstudio but first is important to make sure we have installed all packages we need to perform our data manipulation.

install.packages(“tidyverse”) library(tidyverse) library(lubridate) library(dplyr) library(skimr) install.packages(“skimr”)

Here used the remane function to change the column names.

In order to create compelling visuals decided to insert columns to be able separate the date format for my graphs when sharing and presenting stage.

final_data_cleaned_v1$date <- as.Date(final_data_cleaned_v1$start_time, format = '%d/%m/%y')
final_data_cleaned_v1$month <-format(as.Date(final_data_cleaned_v1$date), "%m")
final_data_cleaned_v1$day <- format(as.Date(final_data_cleaned_v1$date), "%d")
final_data_cleaned_v1$year <- format(as.Date(final_data_cleaned_v1$date), "%Y%")
final_data_cleaned_v1$day_of_week <- ordered (final_data_cleaned_v1$day_of_week, levels=c("Sunday" , "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Here we can make sure our final document have clean and relevant data in the proper format and this skim_without_charts function helps to verify nothing is missing, now our data is ready for further analysis

Prior continuing the analysis I noticed the ride_length column strings are character, converting from character to factor.

<!–>{r echo=TRUE} final_data_cleaned_v1 <- mutate_at (final_data_cleaned_v1, vars(ride_length),as.factor) <!–>

Now we have our column in factor it is time to convert our ride_length column from factor to number.

final_data_cleaned_v1$ride_length <- as.numeric(as.factor(final_data_cleaned_v1$ride_length))

Always is a good practice to validate our data changes, how can you verify your data? with the following code.

is.numeric(final_data_cleaned_v1$ride_length)

## [1] TRUE

###CONDUCTING DESCRIPTIVE ANALYSIS###

mean(final_data_cleaned_v1$ride_length)

## [1] 894.9244

median(final_data_cleaned_v1$ride_length)

## [1] 578

max(final_data_cleaned_v1$ride_length)

## [1] 22124

min(final_data_cleaned_v1$ride_length)

## [1] 1

summary(final_data_cleaned_v1$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   329.0   578.0   894.9  1027.0 22124.0

Comparison by user type:

aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=mean)

##   final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1                          casual                          1191.176
## 2                          member                           714.079

aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=median)

##   final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1                          casual                               713
## 2                          member                               513

aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=max)

##   final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1                          casual                             22124
## 2                          member                             22121

aggregate(final_data_cleaned_v1$ride_length ~ final_data_cleaned_v1$user_type, FUN=min)

##   final_data_cleaned_v1$user_type final_data_cleaned_v1$ride_length
## 1                          casual                                 1
## 2                          member                                 1

Looking for average ride per day by user type

aggregate(cbind(Avg_time_per_day=final_data_cleaned$ride_length) ~ cbind(User_type=final_data_cleaned$user_type) + cbind(Day_of_week=final_data_cleaned$day_of_week), FUN=mean)

##    User_type Day_of_week Avg_time_per_day
## 1     casual      Friday        1256.7663
## 2     member      Friday         713.6921
## 3     casual      Monday        1049.9357
## 4     member      Monday         691.7732
## 5     casual    Saturday        1120.3693
## 6     member    Saturday         683.9464
## 7     casual      Sunday        1036.9942
## 8     member      Sunday         679.4760
## 9     casual    Thursday        1352.5184
## 10    member    Thursday         778.6435
## 11    casual     Tuesday        1117.5004
## 12    member     Tuesday         703.8505
## 13    casual   Wednesday        1287.2222
## 14    member   Wednesday         764.6367

To have a better understanding of our data, I decided to share my findings using Tableau.

Total of users per day!

It appears that the bike usage data showcases the maximum and minimum number of riders, and may provide insights into the preferences that riders have in terms of bike type.

[Total Riders daily comparison!]

(C:/Users/hisny/Desktop/Capstone_Prt_Scr/tableau_bike/Total_user_2.png)

Based on the data analyzed, we can observe a trend for our target audience where the maximum number of visits by our casual riders occur on days 3, 4, and 5.

Global bike type per user!

Total amount of riders.

The following chart shows us a comparison between user per day that helps shareholders to make any decision and when to do it.

Day of week/ User type!

Day of week / user type #2!

Capstone project Cyclistic bike-share

Hisny Ramirez

2023-10-12