Case_Study_in_R

Introduction.

Cyclistic bike-share analysis is one of the three options that Google Data Analytics Professional Certification Programs let the alumni work with, by providing an existing data set and questions. In the case study that I am going to present, I’m in the role of a junior data analyst working in the marketing analyst team at Cyclistic.

The director of marketing told us that he believes that the future success of the company depends on maximizing the number of annual memberships, and my team wants to understand how casual rides and annual members use the Cyclistic bike, from these insights, our team will design a new marketing strategy to convert casual riders into annual members.

I have been assigned to answer the question of how annual members and casual riders use cyclistic bikes differently.

In this analysis, I will demonstrate the differences in the usage of cyclistic bikes by these two types of users.

Business task.

Desing marketing strategies aimed at converting casual riders into annual members, using this information to get conclusions and make recommendations about the strategy to convert casual riders to annual members.

Report pathway.

will be using the data analysis process steps:

Ask: Define the problem and make sure to fully understand expectations.
Prepare: Collect and store the data by verifying its integrity and credibility.
Process: Defining what tools are we going to use and their purpose, making sure our data is clean and providing documentation of that and finally we verifying that it is ready to use for the analysis.
Analyze: Transform and organize the data by making calculations and with these calculations get a view of trends and relations from the data.
Share: Interpretations of the results we obtain and with that create visualizations to get a better understanding of the results.
Act: share the conclusions obtained and, give recommendations based on the analysis, and suggest the next steps to take.

Ask step:

Business Task:

Looking how does the casual members and the annual members differ in the way the use the cyclistic bikes and with that information, create a strategy to convert casual members into annual members in order to get more profits.

Once the analysis is completed, provide quality recommendations on how to reach casual users and how they are differentiated from annual users who use cyclistic.

Stake Holders:

Director Of Marketing Lily Moreno.

Prepare step:

Data source:

The marketing analyst team will use the data from the company/ This data was created with the information of the cyclistic bikes users and contains the rides of every user made in the first quarter of 2020, the information does not have personal information of the users because of the data-privacy policy therefore will not be able to identify every customer, just the history of rides. The data set includes the station where the ride starts and the station of end and times are registered to the start and the end of each ride, also the latitude of the start and the end and the type of member that made each ride.

the credibility of this data set will be identified by the ROCC method:

Reliability: this data is reliable. Therese is a good amount of users information about every trip so we can calculate a margin error of 3%.
Originality: This is a database created by the company team so it is from a reliable source.
Comprehensiveness: this data is comprehensive, there is the necessary information to work with, the amount of rides on every date, the distance and the time of start and end, the latitude and the type of membership that every ride has, this means that the data is randomized, there’s no sight of bias data because it includes the two types of members.
Current: this data was collected in the first quarter of 2020, which means that is outdated and it does not represent the demand from the users nowadays.
Cited: the data is from the company, Cyclistic. “divvy-tripdata” https://divvy-tripdata.s3.amazonaws.com/index.html.

sorting and filtering the data:

in this analysis i will be focusing on the time of every trips in order to detect any trend in the use of the cyclistic bikes by casual members and annual members.

I will be using the “divvy-tripdata” data set to get the information i need and Ill be using R studio in order to complete this analysis.

install and load the packages:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(here)

## here() starts at C:/Users/torre/OneDrive/Escritorio/Case_Study/R_markdown

library(skimr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Process step:

as previusly mentioned we are going to use R in this project, the reason its because is a large data set wit more than 400,000 rows of data, so its better for the team to process faster the informations and also we can all review the code in order to get more credibility in our analysis.

First we are going to import the data set: * we have to verify the data and check that is properly imported whit out any errors.

library(readr)
cyclistic_data <- read_csv("C:/Users/torre/OneDrive/Escritorio/Case_Study/Divvy_Trips_2020_Q1/Divvy_Trips_2020_Q1.csv")

## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(cyclistic_data)

We don’t need the latitude and longitude of rides, it’s irrelevant in this analysis so we proceed to just select the other 9 elements in a new table, and we need to make some changes to the dataset, we need the date and the time of each ride separated in order to make some calculations with those numbers.

cyclistic_rides2 <- cyclistic_data %>%
  select(ride_id, rideable_type, started_at, 
         ended_at,start_station_name, 
         start_station_id,end_station_name, 
         end_station_id, member_casual)%>%
  separate(started_at, into =c("start_date","start_time"),sep= " ") %>%
  separate(ended_at, into =c("ended_date","ended_time"),sep= " ")

Then we will change the time vector because when importing the data, this is read as a character vector, but we need a data time.

##The time was only showing "%H:%M" and i need also the "%S" so i export the data to excel to make the change because R does not provide a friendly way to aggregate the "%S".

write.csv(cyclistic_rides2,"C:/Users/torre/OneDrive/Escritorio/Certificados/ResourcesData/Google DataAnalytics Resources/Course8/cyclistic_rides.csv", row.names=FALSE)

##I added the "%S" to the times columns, but when i imported the data, the columns are being reading in double format, so i change the format to time while importing the data with "read_cvs".

library(readr)
cyclistic_rides_v2 <- read_csv("C:/Users/torre/OneDrive/Escritorio/Case_Study/cyclistic_rides.csv", 
                               col_types = cols(start_time = col_time(format = "%H:%M:%S"), 
                                                ended_time = col_time(format = "%H:%M:%S")))

str(cyclistic_rides_v2)

## spc_tbl_ [426,887 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
##  $ rideable_type     : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ start_date        : Date[1:426887], format: "2020-01-21" "2020-01-30" ...
##  $ start_time        : 'hms' num [1:426887] 20:06:00 14:22:00 19:29:00 16:17:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ended_date        : Date[1:426887], format: "2020-01-21" "2020-01-30" ...
##  $ ended_time        : 'hms' num [1:426887] 20:14:00 14:26:00 19:32:00 16:25:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
##  $ start_station_id  : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...
##  $ end_station_name  : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
##  $ end_station_id    : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...
##  $ member_casual     : chr [1:426887] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   start_date = col_date(format = ""),
##   ..   start_time = col_time(format = "%H:%M:%S"),
##   ..   ended_date = col_date(format = ""),
##   ..   ended_time = col_time(format = "%H:%M:%S"),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Need to check for any duplicates, we know that the same user can make multiple trips and the stations can be used daily for the same users, so we don’t have a column that can distinguish each ride to find any duplicates therefore we can’t find any.
Now we get the ride length with the mutate function, with that column we obtain how long each trip took in total.

cyclistic_rides_v3 <- mutate(cyclistic_rides_v2, ride_length = difftime(ended_time, start_time, units = "mins"))

After inspecting the table, I notice some negative numbers in the column, in order know how many negative numbers we have in the table we use the count function.

count(cyclistic_rides_v3[cyclistic_rides_v3$ride_length <0,])

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1027

We must eliminate those negative numbers to make our calculations properly.

cyclistic_rides_v4 <- cyclistic_rides_v3[!cyclistic_rides_v3$ride_length < 0,]

The next step is aggregate a new column call “weekday” in this column we count the number of the day of the week in each row. We count the weekdays with numbers, 1 = Sunday representing the start of the week.

cyclistic_rides_v5 <- cyclistic_rides_v4 %>%
  mutate(weekday = wday(start_date, week_start = 7))

Analyse Step:

After we properly cleaned and formatted our data, we may proceed to make some calculations in order to identify any trends or relationships that may have formed in the data.

We need to see the average ride length in total and separately for casual members and annual members.

average_rides <- cyclistic_rides_v5%>%
  group_by(member_casual, weekday)%>%
  summarise(mean(ride_length))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

And we also get the number of rides casual and annual members made.

numbers_of_rides_members <- cyclistic_rides_v5%>%
  count(member_casual)

##and we change the number of the column n to number_of_rides.

number_of_rides<- numbers_of_rides_members%>%
  rename(number_of_rides=n)

We analyze the average time that casual members and annual members ride cyclistic bikes and we also get the total of minutes of rides and the max ride_length.

average_rides_members <- cyclistic_rides_v5%>%
group_by(member_casual)%>%
 summarise(average_rides_minutes = mean(ride_length),
           total_rides_minutes = sum(ride_length),
           max_ride_length=max(ride_length),
           min_ride_length=min(ride_length))

Now we get the average of rides but in every weekday.

weekdays_rides <- cyclistic_rides_v5%>%
  group_by(weekday, member_casual)%>%
  summarise(average_rides_minutes = mean(ride_length),
            total_rides_minutes = sum(ride_length),
            max_ride_length=max(ride_length))

## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.

We get the top 10 of start_stations that the users use the most.

top_stations <- cyclistic_rides_v5%>%
  group_by(start_station_name)%>%
 count(start_station_name, sort = TRUE)

#in order to get just the top 10 we create this order.

top_10_stations <- head(top_stations[order(-top_stations$n), ], 10)%>%
rename(start_station_usage=n)

Share Step:

Now that we have all our information and tables we can proceed to make visualizations for the representation and presentation of the data.

with this visuals we can understand better the information and draw conclusions.

Visualizing the total of rides of members and casual riders.

In this table we can see that the the annual members made more rides than the casual riders, that confirms that people with a membership tend to make more rides because they can make unlimited trips.

#Total of rides of members and casual riders
number_of_rides %>% 
  group_by(member_casual) %>% 
  arrange(member_casual) %>% 
  ggplot(aes(x = member_casual,y = number_of_rides,fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Total of rides of members and casual riders")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

visualizing the top 10 of start stations that people most use.

#Top_10_stations
top_10_stations %>% 
  group_by(start_station_name) %>% 
  arrange(start_station_name) %>% 
  ggplot(aes(x = start_station_name,y = start_station_usage,fill = start_station_name)) +
  geom_col(position = "dodge")+
  scale_x_discrete(guide = guide_axis(angle = 90))+
  labs(title = "Top 10 stations", subtitle = "casual and annual memebers")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Now we know that the station that most people use is “Canal St and Adams St”, with this information we can make more advertising in those areas that most people frequent.

Visualizing the total of minutes by weekday

#Total_of_minutes_by_weekday
weekdays_rides %>% 
  group_by(weekday,member_casual) %>% 
  arrange(weekday) %>% 
  ggplot(aes(x = weekday,y = total_rides_minutes,fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Total minutes by weekday")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

This graphic show how casual members only use more minutes the bike in Monday, the rest of the week annual members use more the bikes.

Visualizing the Max ride length.

#Max_ride_length
weekdays_rides %>% 
  group_by(member_casual) %>% 
  arrange(member_casual) %>% 
  ggplot(aes(x = weekday,y = max_ride_length,fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Max ride length in every weekday", subtitle = "The week starts counting from sunday")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Contrary from the previous chart, this one demonstrate that the casual members made the longest ride in every weekday, this demonstrate that casual members use the bikes more for long distance because they don’t have unlimited rides.

Visualizing average minutes of rides.

#averaga_minutes_rides
average_rides_members %>% 
  group_by(member_casual) %>% 
  arrange(member_casual) %>% 
  ggplot(aes(x = member_casual,y = average_rides_minutes,fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Average minutes of rides", subtitle = "comparing members with casual users")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

This bar chart make more strong the past statement, because the average minutes of rides for casual members are above 30 minutes and the annaul members is only 11 minutes.

Visualizing average of rides by weekday.

##average of rides by weekday.
average_rides %>% 
  group_by(weekday) %>% 
  arrange(member_casual) %>% 
  ggplot(aes(x = weekday,y = `mean(ride_length)`,fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Average minutes of rides by weekday")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  facet_wrap(~member_casual)

The chart above show the average time in every weekday of casual and annual members, we can see that sunday and saturday have the most minutes of usage for both.

Final Conclusions:

Casual members use for longest the bikes in order to make the most of one ride.
Annual members make more rides because of the unlimited usage.
The days that all people use the most rides are Sunday and Saturday.
Canal St and Adams St is the most visited station.

Answeing the question assigned.

How do annual members and casual riders use cyclistic bikes differently?

As we discuss in this analysis, the difference is that casual rides tend to make longer rides in order to make the most of one ride and the annual members make various trips during the day because of the unlimited usage

Top recommendations based on the analysis.

Make a monthly payment type to become a member and give a month of free trial to make the casual users try the premium version.
Create marketing campaign, by hosting activities in the weekends with all the users and they can invite new users and give special time offers in this days so people can try the premium version.

Case_Study_in_R_cyclistic

Alejandro Torres

2023-10-03