Google_Data_Analytics_Professional_Certificate_Capstone

Introduction

You are working for Cyclistic, a bike-sharing company. Bikes can be unlocked from one station and returned to any other station in the system anytime.

Cyclistic has flexible pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Director of marketing believes the company’s future success depends on maximizing the number of annual memberships as finance analysts have concluded that annual memberships are much more profitable than casual riders. She also believes that there is a good chance of converting casual riders to members as they are already aware of Cyclistic program and have chosen it for their mobility needs.

The full document to the case study can be found in the Google Data Analytics Capstone: Complete a Case Study

For this project this steps will be followed to ensure its completion:

It will follow the steps of the data analysis process: Ask, prepare, process, analyze, share, and act. Each step will follow its own roadmap with: * Initial assessment and exploration done using Excel * Data cleaning, exploration, manipulation, validation and bit of visualisation done using R * Data visualization using tableau

1.Ask

identify the business task: Strategy to maximize the number of annual memberships by converting casual riders into annual riders.

Key Stakeholders: Lily Monero & the Executive team

Questions to Analyze * How do annual members and casual riders use Cyclistic bikes differently? * Why would casual riders buy Cyclistic annual memberships? * How can Cyclistic use digital media to influence casual riders to become members?

2.Prepare

I will be using 6 Months of data from April-2020 to September-2020. The data is made available in the form of zip files here. The data is made publicly available by Motivate International Inc.

The company has their own licence over the dataset. Besides that, the dataset doesn’t have any personal information about the riders. The user info is hidden by tokenization.

Key Tasks

Download data and store it properly
Identify how the data is organised
Sort and filter the data using excel and R

3.Process

Guiding Questions

Which tools are you using and why? I’m using R for this project, for two main reasons: Because of the large dataset and to gather experience with the language.
Have you ensured your data’s integrity? Yes, the data is consistent throughout the columns.
What steps are you taking to ensure your data is clean First removing incorrect ride_ids from the data set using filters in excel, then removing duplicates in R after combing all the datasets.

Initial Assessment of the data in excel

By using filter function, abnormalities in the ride_id column is spotted. Standard ride_id contains 16 characters. Take note to remove rows that does not contains 16 character for ride_id.

improper ride ids removed

The first step is to install various necessary packages and install them in R studio

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)

Importing files in the database to be read.

april_tripdata <- read_csv("april_tripdata.csv")

## Rows: 84665 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may_tripdata <- read_csv("may_tripdata.csv")

## Rows: 199947 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

june_tripdata <- read_csv("june_tripdata.csv")

## Rows: 342525 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

july_tripdata <- read_csv("july_tripdata.csv")

## Rows: 550705 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

august_tripdata <- read_csv("august_tripdata.csv")

## Rows: 608506 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

september_tripdata <- read_csv("september_tripdata.csv")

## Rows: 500380 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Then merging 6 months of data in .csv format into a single file

agr_datatry <- bind_rows(april_tripdata, may_tripdata, june_tripdata,july_tripdata, august_tripdata,september_tripdata)

summary of the dataframe

summary(agr_datatry)

##    ride_id          rideable_type       started_at          ended_at        
##  Length:2286728     Length:2286728     Length:2286728     Length:2286728    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id end_station_name   end_station_id 
##  Length:2286728     Min.   :  2.0    Length:2286728     Min.   :  2.0  
##  Class :character   1st Qu.: 97.0    Class :character   1st Qu.: 98.0  
##  Mode  :character   Median :196.0    Mode  :character   Median :197.0  
##                     Mean   :224.9                       Mean   :225.9  
##                     3rd Qu.:312.0                       3rd Qu.:313.0  
##                     Max.   :721.0                       Max.   :721.0  
##                     NA's   :152                         NA's   :199    
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.65   Min.   :-87.77   Min.   :41.65   Min.   :-87.77  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.91   Mean   :-87.64   Mean   :41.91   Mean   :-87.64  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :42.06   Max.   :-87.53   Max.   :42.07   Max.   :-87.53  
##                                                                   
##  member_casual     
##  Length:2286728    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Removind Duplicated from the data

cyclistic_no_dups <- agr_datatry[!duplicated(agr_datatry$ride_id), ]
print(paste("Removed", nrow(agr_datatry) - nrow(cyclistic_no_dups), "duplicated rows"))

## [1] "Removed 2 duplicated rows"

Tibble of edited table

tibble(agr_datatry)

## # A tibble: 2,286,728 × 13
##    ride_id   rideable_type started_at ended_at start_station_n… start_station_id
##    <chr>     <chr>         <chr>      <chr>    <chr>                       <dbl>
##  1 A847FADB… docked_bike   26-04-202… 26-04-2… Eckhart Park                   86
##  2 5405B80E… docked_bike   17-04-202… 17-04-2… Drake Ave & Ful…              503
##  3 5DD24A79… docked_bike   01-04-202… 01-04-2… McClurg Ct & Er…              142
##  4 2A59BBDF… docked_bike   07-04-202… 07-04-2… California Ave …              216
##  5 27AD306C… docked_bike   18-04-202… 18-04-2… Rush St & Hubba…              125
##  6 356216E8… docked_bike   30-04-202… 30-04-2… Mies van der Ro…              173
##  7 A2759CB0… docked_bike   02-04-202… 02-04-2… Streeter Dr & G…               35
##  8 FC8BC2E2… docked_bike   07-04-202… 07-04-2… Ogden Ave & Roo…              434
##  9 9EC56486… docked_bike   15-04-202… 15-04-2… LaSalle Dr & Hu…              627
## 10 A8FFF891… docked_bike   04-04-202… 04-04-2… Kedzie Ave & La…              377
## # … with 2,286,718 more rows, and 7 more variables: end_station_name <chr>,
## #   end_station_id <dbl>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>

calculating ride duration as it will be an important metric for analysis

cyclistic_no_dups <- cyclistic_no_dups %>%
  mutate(across(c(started_at, ended_at), dmy_hm), 
  ride_duration = difftime(ended_at,started_at, units = "mins"))

seprating month and year from date

cyclistic_no_dups <- cyclistic_no_dups %>%
  mutate(year_month = paste(strftime(cyclistic_no_dups$started_at, "%Y"),
                            "-",
                            strftime(cyclistic_no_dups$started_at, "%m"),
                            paste("(",strftime(cyclistic_no_dups$started_at, "%b"), ")", sep="")))
unique(cyclistic_no_dups$year_month)

## [1] "2020 - 04 (Apr)" "2020 - 05 (May)" "2020 - 06 (Jun)" "2020 - 07 (Jul)"
## [5] "2020 - 08 (Aug)" "2020 - 09 (Sep)" "2020 - 10 (Oct)"

this is done to check which months have more people traffic

calculation of weekdays

yclistic_no_dups <- cyclistic_no_dups %>%
  mutate(weekday = paste(strftime(cyclistic_no_dups$ended_at, "%u"), "-", strftime(cyclistic_no_dups$ended_at, "%a")))
unique(cyclistic_no_dups$weekday)

## Warning: Unknown or uninitialised column: `weekday`.

## NULL

this is done to identify trends among different class of members across various days across a week.

seprating starting hour

cyclistic_no_dups <- cyclistic_no_dups %>%
  mutate(start_hour = strftime(cyclistic_no_dups$ended_at, "%H"))
unique(cyclistic_no_dups$start_hour)

##  [1] "23" "22" "18" "16" "20" "19" "21" "08" "17" "14" "01" "00" "03" "05" "15"
## [16] "12" "13" "04" "11" "07" "10" "02" "06" "09"

this is done to find frequency during different hours of the day.

4.Analyze & 5.Share

In this project i have combined the two steps together

I will be using both R and Tableau for data visualization and analysis. The reason being my familiarity with the tools and ease to use of the two mentioned tools.

Data Visualizations

First of all we will be analyzing ride share between casual and member customers

Then we will see the location wise distribution of rides based on the basis of starting location. Then we will see the location wise distribution of rides based on the basis of ending location.

Now analyzing distribution on the basis of month and customer type.

cyclistic_no_dups %>%
  ggplot(aes(year_month, fill=member_casual)) +
  geom_bar() +
  labs(x="Month", title="Chart 02 - Distribution by month") +
  coord_flip()

Analysis of the data on the basis of hourly frequency to find out the busy hours

cyclistic_no_dups %>%
  ggplot(aes(start_hour, fill=member_casual)) +
  labs(x="Hour of the day", title="Chart 04 - Distribution by hour of the day") +
  geom_bar()

Now analyzing the data on the basis of days of the week and no of rides, also comparing this with different types of customer.

Now coming to the share part on the basis of the data we had at our disposal my finding about the dataset are

Members have the biggest proportion of the dataset, ~19% bigger thand casuals.
There’s more data points at the last semester of 2020.
The month with the biggest count of data points was August with ~18% of the dataset.
In all months we have more members’ rides than casual rides.
The difference of proporcion of member x casual is smaller in the last semester of 2020.
Temperature heavily influences the volume of rides in the month.
The biggest volume of data is on the the weekend.
There’s a bigger volume of bikers in the afternoon.
We have more bike rides on the weekends. Maybe because on those days the bikes were utilized for more recreational ways. This even more plausible when knowing that There’s a bigger volume of bikers in the afternoon.

Now for how members differs from casuals:

Members may have the biggest volume of data, besides on saturday. On this weekday, casuals take place as having the most data points.
Weekends have the biggest volume of casuals, starting on friday, a ~20% increase.
We have more members during the morning, mainly between 5am and 11am. And more casuals between 11pm and 4am.
There’s a big increase of data points in the midweek between 6am to 8am for members. Then it fell a bit. Another big increase is from 5pm to 6pm.
During the weekend we have a bigger flow of casuals between 11am to 6pm.
Casuals have more riding time than members.
Riding time for members keeps unchanged during the midweek, increasing during weekends.
Casuals follow a more curve distribution, peaking on sundays and valleying on Wednesday/Thursday.

What we can take from this information is that members have a more fixed use for bikes besides casuals. Their uses is for more routine activities, like:

Go to work. Use it as an exercise.

Concluding:

Members use the bikes for fixed activities, one of those is going to work.
Bikes are used for recreation on the weekends.
Rides are influenced by temperature.

Guiding questions

Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently? Yes. The data points to several differences between casuals and members.

What story does your data tell? The main story the data tells is that members have set schedules, as seen on chart 06 on key timestamps. Those timestamps point out that members use the bikes for routine activities, like going to work. Charts like 08 also point out that they have less riding time, because they have a set route to take.

How do your findings relate to your original question? The findings build a profile for members, relating to “Find the keys differences between casuals and annual riders”, also knowing whey they use the bikes helps to find “How digital media could influence them”.

6.Act

The act phase would be done by the marketing team of the company. The main takeaway will be the top three recommendations for the marketing.

Google_Data_Analytics_Professional_Certificate_Capstone_Project

Kanishk Gupta

2022-07-17