Hi Everyone this is my version of google data analytics capstone project 1.
You are working for Cyclistic, a bike-sharing company. Bikes can be unlocked from one station and returned to any other station in the system anytime.
Cyclistic has flexible pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Director of marketing believes the company’s future success depends on maximizing the number of annual memberships as finance analysts have concluded that annual memberships are much more profitable than casual riders. She also believes that there is a good chance of converting casual riders to members as they are already aware of Cyclistic program and have chosen it for their mobility needs.
The full document to the case study can be found in the Google Data Analytics Capstone: Complete a Case Study
For this project this steps will be followed to ensure its completion:
It will follow the steps of the data analysis process: Ask, prepare, process, analyze, share, and act. Each step will follow its own roadmap with: * Initial assessment and exploration done using Excel * Data cleaning, exploration, manipulation, validation and bit of visualisation done using R * Data visualization using tableau
identify the business task: Strategy to maximize the number of annual memberships by converting casual riders into annual riders.
Key Stakeholders: Lily Monero & the Executive team
Questions to Analyze * How do annual members and casual riders use Cyclistic bikes differently? * Why would casual riders buy Cyclistic annual memberships? * How can Cyclistic use digital media to influence casual riders to become members?
I will be using 6 Months of data from April-2020 to September-2020. The data is made available in the form of zip files here. The data is made publicly available by Motivate International Inc.
improper ride ids removed
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
april_tripdata <- read_csv("april_tripdata.csv")
## Rows: 84665 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_tripdata <- read_csv("may_tripdata.csv")
## Rows: 199947 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
june_tripdata <- read_csv("june_tripdata.csv")
## Rows: 342525 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
july_tripdata <- read_csv("july_tripdata.csv")
## Rows: 550705 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
august_tripdata <- read_csv("august_tripdata.csv")
## Rows: 608506 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
september_tripdata <- read_csv("september_tripdata.csv")
## Rows: 500380 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
agr_datatry <- bind_rows(april_tripdata, may_tripdata, june_tripdata,july_tripdata, august_tripdata,september_tripdata)
summary(agr_datatry)
## ride_id rideable_type started_at ended_at
## Length:2286728 Length:2286728 Length:2286728 Length:2286728
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:2286728 Min. : 2.0 Length:2286728 Min. : 2.0
## Class :character 1st Qu.: 97.0 Class :character 1st Qu.: 98.0
## Mode :character Median :196.0 Mode :character Median :197.0
## Mean :224.9 Mean :225.9
## 3rd Qu.:312.0 3rd Qu.:313.0
## Max. :721.0 Max. :721.0
## NA's :152 NA's :199
## start_lat start_lng end_lat end_lng
## Min. :41.65 Min. :-87.77 Min. :41.65 Min. :-87.77
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.91 Mean :-87.64 Mean :41.91 Mean :-87.64
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53 Max. :42.07 Max. :-87.53
##
## member_casual
## Length:2286728
## Class :character
## Mode :character
##
##
##
##
cyclistic_no_dups <- agr_datatry[!duplicated(agr_datatry$ride_id), ]
print(paste("Removed", nrow(agr_datatry) - nrow(cyclistic_no_dups), "duplicated rows"))
## [1] "Removed 2 duplicated rows"
Tibble of edited table
tibble(agr_datatry)
## # A tibble: 2,286,728 × 13
## ride_id rideable_type started_at ended_at start_station_n… start_station_id
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 A847FADB… docked_bike 26-04-202… 26-04-2… Eckhart Park 86
## 2 5405B80E… docked_bike 17-04-202… 17-04-2… Drake Ave & Ful… 503
## 3 5DD24A79… docked_bike 01-04-202… 01-04-2… McClurg Ct & Er… 142
## 4 2A59BBDF… docked_bike 07-04-202… 07-04-2… California Ave … 216
## 5 27AD306C… docked_bike 18-04-202… 18-04-2… Rush St & Hubba… 125
## 6 356216E8… docked_bike 30-04-202… 30-04-2… Mies van der Ro… 173
## 7 A2759CB0… docked_bike 02-04-202… 02-04-2… Streeter Dr & G… 35
## 8 FC8BC2E2… docked_bike 07-04-202… 07-04-2… Ogden Ave & Roo… 434
## 9 9EC56486… docked_bike 15-04-202… 15-04-2… LaSalle Dr & Hu… 627
## 10 A8FFF891… docked_bike 04-04-202… 04-04-2… Kedzie Ave & La… 377
## # … with 2,286,718 more rows, and 7 more variables: end_station_name <chr>,
## # end_station_id <dbl>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>
cyclistic_no_dups <- cyclistic_no_dups %>%
mutate(across(c(started_at, ended_at), dmy_hm),
ride_duration = difftime(ended_at,started_at, units = "mins"))
cyclistic_no_dups <- cyclistic_no_dups %>%
mutate(year_month = paste(strftime(cyclistic_no_dups$started_at, "%Y"),
"-",
strftime(cyclistic_no_dups$started_at, "%m"),
paste("(",strftime(cyclistic_no_dups$started_at, "%b"), ")", sep="")))
unique(cyclistic_no_dups$year_month)
## [1] "2020 - 04 (Apr)" "2020 - 05 (May)" "2020 - 06 (Jun)" "2020 - 07 (Jul)"
## [5] "2020 - 08 (Aug)" "2020 - 09 (Sep)" "2020 - 10 (Oct)"
this is done to check which months have more people traffic
yclistic_no_dups <- cyclistic_no_dups %>%
mutate(weekday = paste(strftime(cyclistic_no_dups$ended_at, "%u"), "-", strftime(cyclistic_no_dups$ended_at, "%a")))
unique(cyclistic_no_dups$weekday)
## Warning: Unknown or uninitialised column: `weekday`.
## NULL
this is done to identify trends among different class of members across various days across a week.
cyclistic_no_dups <- cyclistic_no_dups %>%
mutate(start_hour = strftime(cyclistic_no_dups$ended_at, "%H"))
unique(cyclistic_no_dups$start_hour)
## [1] "23" "22" "18" "16" "20" "19" "21" "08" "17" "14" "01" "00" "03" "05" "15"
## [16] "12" "13" "04" "11" "07" "10" "02" "06" "09"
this is done to find frequency during different hours of the day.
Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently? Yes. The data points to several differences between casuals and members.
What story does your data tell? The main story the data tells is that members have set schedules, as seen on chart 06 on key timestamps. Those timestamps point out that members use the bikes for routine activities, like going to work. Charts like 08 also point out that they have less riding time, because they have a set route to take.
How do your findings relate to your original question? The findings build a profile for members, relating to “Find the keys differences between casuals and annual riders”, also knowing whey they use the bikes helps to find “How digital media could influence them”.
The act phase would be done by the marketing team of the company. The main takeaway will be the top three recommendations for the marketing.