Introduction This case study is a part of google data analytics capstone. Project is broken down into steps which are taught in this professional certificate, which are Ask, Prepare, Process, Analyze, Share, Act. This data is of a fictional company which will be analyzed to produce insights which satisfy the business task.The dataset used is of divvytrip and the data can be downladed here.
Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations
About The Company
In 2016, Cyclistic launched a successful bike-share oering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders dier, why casual riders would buy a membership, and how digital media could aect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Characters and Teams
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also oering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
ASK
Tasks Done
Identify the business task
Consider key stakeholders
Questions
1 Goal?
Design marketing strategies aimed at converting casual riders into annual members.
2 Business Task?
How do annual members and casual riders use Cyclistic bikes differently?
3 How can my insights drive business decisions?
With reference to my insights the company can leverage more users and can increase the conversion ratio of annual members from casual riders.
PREPARE
Tasks Done
Download data and store it appropriately.
Identify how it’s organized.
Sort and filter the data.
Determine the credibility of the data.
Questions
1 Where is your data located?
Extraction of data took place from Kaggle website.
2 How is the data organized?
The data is structured, file format is .csv and the files are organized ccording to month.
3 Are there issues with bias or credibility in this data? Does your data ROCCC?
There are no issues with bias and credibility as it is the internal data of th company, and the data is reliable, original, comprehensive, current and credible.
4 How are you addressing licensing, privacy, security, and accessibility?
The data has been made available by Motivate International Inc. under this license on top of this their dataset doesnt include customers’ personal information. Security and accessibility are not a problem as the dataset is public and available for everyone to utilise.
5 How did you verify the data’s integrity?
The data is consistent, accurate but it had few columns which were not of desired data type, these were handled in coming stages.
6 How does it help you answer your question?
The data is sufficient to extract business critical insights which will solve the business task.
7 Are there any problems with the data?
Too less information about the riders.
PROCESS
library(tidyverse) # used for wrangling data
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) # used for parsing date attributes
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) # used for visualization
library(dplyr)
library(anytime)
Loading the .csv files
Jan_2021 <- read_csv("202101-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Feb_2021 <- read_csv("202102-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Mar_2021 <- read_csv("202103-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_character(),
## ended_at = col_character(),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Apr_2021 <- read_csv("202104-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
May_2021 <- read_csv("202105-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Jun_2021 <- read_csv("202106-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Jul_2020 <- read_csv("202007-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_character(),
## ended_at = col_character(),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Aug_2020 <- read_csv("202008-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Sep_2020 <- read_csv("202009-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Oct_2020 <- read_csv("202010-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Nov_2020 <- read_csv("202011-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Dec_2020 <- read_csv("202012-divvy-tripdata.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
Checking the structure of all dataframes to find any inconcistencies
str(Jan_2021)
## spec_tbl_df [96,834 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:96834] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:96834] NA NA NA NA ...
## $ end_station_id : chr [1:96834] NA NA NA NA ...
## $ start_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:96834] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Feb_2021)
## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
## $ ended_at : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
## $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr [1:49622] "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ start_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ end_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:49622] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Mar_2021)
## spec_tbl_df [205,856 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:205856] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
## $ rideable_type : chr [1:205856] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr [1:205856] "16/03/2021 08:32" "28/03/2021 01:26" "11/03/2021 21:17" "11/03/2021 13:26" ...
## $ ended_at : chr [1:205856] "16/03/2021 08:36" "28/03/2021 01:36" "11/03/2021 21:33" "11/03/2021 13:55" ...
## $ start_station_name: chr [1:205856] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
## $ start_station_id : chr [1:205856] "15651" "15651" "15443" "TA1308000021" ...
## $ end_station_name : chr [1:205856] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:205856] "13266" "18017" "TA1308000043" "13323" ...
## $ start_lat : num [1:205856] 41.9 41.9 41.8 42 42 ...
## $ start_lng : num [1:205856] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:205856] 41.9 41.9 41.8 42 42.1 ...
## $ end_lng : num [1:205856] -87.7 -87.7 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:205856] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Apr_2021)
## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
## $ rideable_type : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
## $ started_at : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
## $ ended_at : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
## $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
## $ start_station_id : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
## $ end_station_name : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
## $ end_station_id : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
## $ start_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ start_lng : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ end_lng : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:337230] "member" "casual" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(May_2021)
## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
## $ rideable_type : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
## $ ended_at : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
## $ start_station_name: chr [1:531633] NA NA NA NA ...
## $ start_station_id : chr [1:531633] NA NA NA NA ...
## $ end_station_name : chr [1:531633] NA NA NA NA ...
## $ end_station_id : chr [1:531633] NA NA NA NA ...
## $ start_lat : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
## $ end_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:531633] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Jun_2021)
## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
## $ rideable_type : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
## $ ended_at : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
## $ start_station_name: chr [1:729595] NA NA NA NA ...
## $ start_station_id : chr [1:729595] NA NA NA NA ...
## $ end_station_name : chr [1:729595] NA NA NA NA ...
## $ end_station_id : chr [1:729595] NA NA NA NA ...
## $ start_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:729595] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Jul_2020)
## spec_tbl_df [551,480 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:551480] "762198876D69004D" "BEC9C9FBA0D4CF1B" "D2FD8EA432C77EC1" "54AE594E20B35881" ...
## $ rideable_type : chr [1:551480] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : chr [1:551480] "09/07/2020 15:22" "24/07/2020 23:56" "08/07/2020 19:49" "17/07/2020 19:06" ...
## $ ended_at : chr [1:551480] "09/07/2020 15:25" "25/07/2020 00:20" "08/07/2020 19:56" "17/07/2020 19:27" ...
## $ start_station_name: chr [1:551480] "Ritchie Ct & Banks St" "Halsted St & Roscoe St" "Lake Shore Dr & Diversey Pkwy" "LaSalle St & Illinois St" ...
## $ start_station_id : num [1:551480] 180 299 329 181 268 635 113 211 176 31 ...
## $ end_station_name : chr [1:551480] "Wells St & Evergreen Ave" "Broadway & Ridge Ave" "Clark St & Wellington Ave" "Clark St & Armitage Ave" ...
## $ end_station_id : num [1:551480] 291 461 156 94 301 289 140 31 191 142 ...
## $ start_lat : num [1:551480] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:551480] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:551480] 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:551480] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:551480] "member" "member" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Aug_2020)
## spec_tbl_df [622,361 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:622361] "322BD23D287743ED" "2A3AEF1AB9054D8B" "67DC1D133E8B5816" "C79FBBD412E578A7" ...
## $ rideable_type : chr [1:622361] "docked_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:622361], format: "2020-08-20 18:08:14" "2020-08-27 18:46:04" ...
## $ ended_at : POSIXct[1:622361], format: "2020-08-20 18:17:51" "2020-08-27 19:54:51" ...
## $ start_station_name: chr [1:622361] "Lake Shore Dr & Diversey Pkwy" "Michigan Ave & 14th St" "Columbus Dr & Randolph St" "Daley Center Plaza" ...
## $ start_station_id : num [1:622361] 329 168 195 81 658 658 196 67 153 177 ...
## $ end_station_name : chr [1:622361] "Clark St & Lincoln Ave" "Michigan Ave & 14th St" "State St & Randolph St" "State St & Kinzie St" ...
## $ end_station_id : num [1:622361] 141 168 44 47 658 658 49 229 225 305 ...
## $ start_lat : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:622361] "member" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Sep_2020)
## spec_tbl_df [532,958 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:532958] "2B22BD5F95FB2629" "A7FB70B4AFC6CAF2" "86057FA01BAC778E" "57F6DC9A153DB98C" ...
## $ rideable_type : chr [1:532958] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:532958], format: "2020-09-17 14:27:11" "2020-09-17 15:07:31" ...
## $ ended_at : POSIXct[1:532958], format: "2020-09-17 14:44:24" "2020-09-17 15:07:45" ...
## $ start_station_name: chr [1:532958] "Michigan Ave & Lake St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Ashland Ave & Belle Plaine Ave" ...
## $ start_station_id : num [1:532958] 52 NA NA 246 24 94 291 NA NA NA ...
## $ end_station_name : chr [1:532958] "Green St & Randolph St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Montrose Harbor" ...
## $ end_station_id : num [1:532958] 112 NA NA 249 24 NA 256 NA NA NA ...
## $ start_lat : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
## $ start_lng : num [1:532958] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:532958] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:532958] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Oct_2020)
## spec_tbl_df [388,653 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:388653] "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ rideable_type : chr [1:388653] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:388653], format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct[1:388653], format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr [1:388653] "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : num [1:388653] 313 227 102 165 190 359 313 125 NA 174 ...
## $ end_station_name : chr [1:388653] "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : num [1:388653] 125 260 423 256 185 53 125 313 199 635 ...
## $ start_lat : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:388653] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Nov_2020)
## spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
## $ rideable_type : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
## $ ended_at : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
## $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
## $ start_station_id : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
## $ end_station_name : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
## $ end_station_id : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
## $ start_lat : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:259716] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
str(Dec_2020)
## spec_tbl_df [131,573 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
## $ rideable_type : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
## $ ended_at : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
## $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
## $ start_station_id : chr [1:131573] "13157" NA NA NA ...
## $ end_station_name : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
## $ end_station_id : chr [1:131573] "TA1306000003" NA NA NA ...
## $ start_lat : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
## $ start_lng : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
## $ end_lng : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:131573] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
Data Cleaning
Converting inconcistent columns by changing data type.
mutate() used for conversion of column’s data type
Jan_2021<-mutate(Jan_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Feb_2021<-mutate(Feb_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Mar_2021<-mutate(Mar_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Apr_2021<-mutate(Apr_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
May_2021<-mutate(May_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Jun_2021<-mutate(Jun_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Jul_2020<-mutate(Jul_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Aug_2020<-mutate(Aug_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Sep_2020<-mutate(Sep_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Oct_2020<-mutate(Oct_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Nov_2020<-mutate(Nov_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Dec_2020<-mutate(Dec_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Parsing character to date time
Mar_2021$started_at<-dmy_hms(Mar_2021$started_at,tz=Sys.timezone())
Mar_2021$ended_at<-dmy_hms(Mar_2021$ended_at,tz=Sys.timezone())
Jul_2020$started_at<-dmy_hms(Jul_2020$started_at,tz=Sys.timezone())
Jul_2020$ended_at<-dmy_hms(Jul_2020$ended_at,tz=Sys.timezone())
Merging the dataframes
Complete_trips<-bind_rows(Jan_2021,Feb_2021,Mar_2021,Apr_2021,May_2021,Jun_2021,Jul_2020,Aug_2020,Sep_2020,Oct_2020,Nov_2020,Dec_2020)
Eliminating Duplicates
duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.
Distinct_trips<-Complete_trips[!duplicated(Complete_trips$ride_id),]
cleaned_divvy<-na.omit(Distinct_trips)
Primary Insights from the dataset
colnames(cleaned_divvy)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
nrow(cleaned_divvy)
## [1] 2630393
summary(cleaned_divvy)
## ride_id rideable_type started_at
## Length:2630393 Length:2630393 Min. :2020-03-01 15:30:01
## Class :character Class :character 1st Qu.:2020-08-04 16:48:43
## Mode :character Mode :character Median :2020-09-06 19:29:09
## Mean :2020-10-03 09:51:38
## 3rd Qu.:2020-10-22 18:15:19
## Max. :2021-06-30 23:59:05
## ended_at start_station_name start_station_id
## Min. :2020-03-01 15:30:06 Length:2630393 Min. : 2
## 1st Qu.:2020-08-04 17:15:00 Class :character 1st Qu.: 114
## Median :2020-09-06 20:01:03 Mode :character Median : 236
## Mean :2020-10-03 10:17:37 Mean : 2135
## 3rd Qu.:2020-10-22 18:34:31 3rd Qu.: 443
## Max. :2021-07-09 17:23:46 Max. :202480
## end_station_name end_station_id start_lat start_lng
## Length:2630393 Min. : 2 Min. :41.65 Min. :-87.77
## Class :character 1st Qu.: 115 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Median : 236 Median :41.90 Median :-87.64
## Mean : 2140 Mean :41.90 Mean :-87.64
## 3rd Qu.: 448 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :202480 Max. :42.06 Max. :-87.53
## end_lat end_lng member_casual
## Min. :41.65 Min. :-87.77 Length:2630393
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.64
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
str(cleaned_divvy)
## tibble [2,630,393 x 13] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:2630393] "B9F73448DFBE0D45" "457C7F4B5D3DA135" "57C750326F9FDABE" "4D518C65E338D070" ...
## $ rideable_type : chr [1:2630393] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:2630393], format: "2021-01-24 19:15:38" "2021-01-23 12:57:38" ...
## $ ended_at : POSIXct[1:2630393], format: "2021-01-24 19:22:51" "2021-01-23 13:02:10" ...
## $ start_station_name: chr [1:2630393] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : num [1:2630393] 17660 17660 17660 17660 17660 ...
## $ end_station_name : chr [1:2630393] "Wood St & Augusta Blvd" "California Ave & North Ave" "Wood St & Augusta Blvd" "Wood St & Augusta Blvd" ...
## $ end_station_id : num [1:2630393] 657 13258 657 657 657 ...
## $ start_lat : num [1:2630393] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:2630393] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:2630393] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:2630393] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:2630393] "member" "member" "casual" "casual" ...
## - attr(*, "na.action")= 'omit' Named int [1:1806878] 1 2 3 4 5 6 7 8 9 15 ...
## ..- attr(*, "names")= chr [1:1806878] "1" "2" "3" "4" ...
Manipulating the data
Addition of New columns
Ride_time_min Calcualtes the total time travelled by the rider.
cleaned_divvy<-cleaned_divvy %>%
mutate(Ride_time_min=as.numeric(cleaned_divvy$ended_at-cleaned_divvy$started_at)/60)
summary(cleaned_divvy$Ride_time_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -28948.43 2.43 10.10 25.98 21.45 54283.35
Year_month Seperating the year & month.
cleaned_divvy<-cleaned_divvy %>%
mutate(year_month=paste(strftime(cleaned_divvy$started_at, "%Y"),"-", strftime(cleaned_divvy$started_at, "%m"), "(", strftime(cleaned_divvy$started_at, "%b"), ")", sep=""))
cleaned_divvy<-cleaned_divvy %>%
mutate(month=paste(strftime(cleaned_divvy$started_at, "%m"), "(", strftime(cleaned_divvy$started_at, "%b"), ")", sep=""))
unique(cleaned_divvy$month)
## [1] "01(Jan)" "02(Feb)" "03(Mar)" "04(Apr)" "05(May)" "06(Jun)" "07(Jul)"
## [8] "08(Aug)" "09(Sep)" "10(Oct)" "11(Nov)" "12(Dec)"
Functions used are: mutate, paste, strftime, unique.
mutate() adds new variables and preserves existing ones, here paste() concatenate Vectors by converting them into character strftime() converts a date-time object into a character vector unique() eliminate or delete the duplicate values or the rows present.
Weekday
cleaned_divvy<-cleaned_divvy %>%
mutate(weekday=paste(strftime(cleaned_divvy$ended_at, "%u"),"-", strftime(cleaned_divvy$ended_at, "%a")))
start_hour
cleaned_divvy<-cleaned_divvy %>%
mutate(start_hour=strftime(cleaned_divvy$ended_at, "%H"))
unique(cleaned_divvy$start_hour)
## [1] "00" "18" "21" "13" "14" "20" "19" "23" "22" "01" "17" "05" "16" "03" "15"
## [16] "02" "04" "06" "12" "10" "11" "08" "07" "09"
Questions
1 What tools are you choosing and why?
Tool used is R studio as this project contains large datasets so handling the data checking, tidying, transforming, analysing, visualising and documenting in R would be faster and easier as compared to spreadsheets.
2 Have you ensured your data’s integrity?
Yes, The data is concistent and accurate.
3 What steps have you taken to ensure that your data is clean?
Removal of duplicate rows, converted the columns to their correct format, omotted the NAs.
4 Have you documented your cleaning process so you can review and share those results?
Yes, In this document.
Tasks Done
Check the data for errors.
Choose your tools.
Transform the data so you can work with it eectively
Document the cleaning process.
ANALYZE
Data distribution
Casuals and members
Tells us how is the data distributed among the casual riders and members
cleaned_divvy %>%
group_by(member_casual) %>%
summarise('no_of_riders'= n(),'%_of'=(n()/nrow(cleaned_divvy))*100)
## # A tibble: 2 x 3
## member_casual no_of_riders `%_of`
## <chr> <int> <dbl>
## 1 casual 1162723 44.2
## 2 member 1467670 55.8
Functions used are: group_by, summarise, length, nrow
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. If we combine group_by() and n() we can easily count the no of rows for each product, an alterntive for n() is length (ride_id). Simple count() isnt used as it cant be used on character class object.nrow() is used to return the number of rows in the data set.
ggplot(cleaned_divvy,aes(member_casual,fill=member_casual))+
geom_bar()+scale_fill_brewer(palette = "Set2")+labs(title= "Members X Casual riders distribution")

Members comprise of more than 50% of the data 55.7% (approx) which is approximately 10% more than casual riders.
Distribution by Month
cleaned_divvy %>%
group_by(year_month) %>%
summarise('no_of_riders' = length(ride_id), '%_of'=(length(ride_id)/nrow(cleaned_divvy)*100),'members_p'=(sum(member_casual=='member')/length(ride_id))*100, 'casuals_p'=(sum(member_casual=='casual')/length(ride_id))*100, 'Member_Casual_Diff'= members_p-casuals_p)
## # A tibble: 14 x 6
## year_month no_of_riders `%_of` members_p casuals_p Member_Casual_Diff
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2020-03(Mar) 47065 1.79 56.5 43.5 13.0
## 2 2020-07(Jul) 550394 20.9 51.2 48.8 2.35
## 3 2020-08(Aug) 604445 23.0 53.5 46.5 6.94
## 4 2020-09(Sep) 501353 19.1 56.9 43.1 13.8
## 5 2020-10(Oct) 340912 13.0 63.8 36.2 27.7
## 6 2020-11(Nov) 223732 8.51 67.1 32.9 34.2
## 7 2020-12(Dec) 23816 0.905 75.0 25.0 50.1
## 8 2021-01(Jan) 16589 0.631 80.2 19.8 60.5
## 9 2021-02(Feb) 9096 0.346 76.3 23.7 52.5
## 10 2021-03(Mar) 67 0.00255 62.7 37.3 25.4
## 11 2021-04(Apr) 67597 2.57 52.8 47.2 5.61
## 12 2021-05(May) 106161 4.04 44.9 55.1 -10.3
## 13 2021-06(Jun) 138302 5.26 44.2 55.8 -11.6
## 14 2021-07(Jul) 864 0.0328 48.5 51.5 -3.01
cleaned_divvy %>%
ggplot(aes(month, fill=member_casual))+
geom_bar()+labs(x='Month',y='No of rides', title='Distribution by Month')+ coord_flip()+ theme(axis.title.x = element_text(color="Blue",size=13),
axis.title.y = element_text(color="Blue",size=13))

Insights
The ratio of members is higher than casuals in all months. The difference of proportion seems less in the mid months(Jul-sept)
Lets compare this with climate of Chicago to get further insights.Data is taken from Climte of Chicago
chicago_mean_temp<- c(-3.2, -1.2, 4.4, 10.5, 16.6, 22.2, 24.8, 23.9, 19.9, 12.9, 5.8, -0.3)
Chicago_month <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
data.frame(Chicago_month, chicago_mean_temp) %>%
ggplot(aes(x=factor(Chicago_month,levels=Chicago_month),y=chicago_mean_temp, fill=chicago_mean_temp))+geom_col(color="Black")+labs(x="Month", y="Mean_temperature", title="Chicago's Mean Temperature [1991-2020]")

Geom_col() is used to inorder to represent the values to the height of the bars.
Insight
The volume of rides undertaken by riders is highly impacted by the climate of a month.
Distribution by Weekday
cleaned_divvy %>%
group_by(weekday) %>%
summarise("no_of_riders"=n(), "%_of"=(n()/nrow(cleaned_divvy))*100, "members_p"=(sum(member_casual=="member")/n())*100, "casual_p"=(sum(member_casual=="casual")/n())*100, "member_casual_diff_%"=members_p-casual_p)
## # A tibble: 7 x 6
## weekday no_of_riders `%_of` members_p casual_p `member_casual_diff_%`
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 - Mon 315569 12.0 58.3 41.7 16.6
## 2 2 - Tue 309752 11.8 63.7 36.3 27.5
## 3 3 - Wed 339587 12.9 63.9 36.1 27.8
## 4 4 - Thu 352912 13.4 62.2 37.8 24.5
## 5 5 - Fri 391989 14.9 57.2 42.8 14.4
## 6 6 - Sat 491222 18.7 46.7 53.3 -6.55
## 7 7 - Sun 429362 16.3 45.6 54.4 -8.79
cleaned_divvy %>%
ggplot(aes(x=weekday, fill=member_casual))+geom_bar()+
labs(x="Day", y="No of Rides")+ coord_flip()

Insights
Largest volume of rides are taken place on Weekend i.e Friday, Saturday and Sunday with Saturday being the highest.
Members have the largest volume through the week except on weekends.No of rides made by Casuals increase by 20% (approx) from friday to Sunday.
Distribution by Type of bike
cleaned_divvy %>%
group_by(rideable_type) %>%
summarise("no_of_riders"= n(),
'%_of' = (n() / nrow(cleaned_divvy)) * 100,
'members_p' = (sum(member_casual == "member") / n()) * 100,
'casual_p' = (sum(member_casual == "casual") / n()) * 100,
'member_casual_perc_difer' = members_p - casual_p)
## # A tibble: 3 x 6
## rideable_type no_of_riders `%_of` members_p casual_p member_casual_perc_difer
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 classic_bike 274461 10.4 59.5 40.5 18.9
## 2 docked_bike 1943842 73.9 55.1 44.9 10.2
## 3 electric_bike 412090 15.7 56.7 43.3 13.4
cleaned_divvy %>%
ggplot(aes(x=member_casual, fill=rideable_type))+geom_bar(position = "dodge")+
labs(x="Bike", y="No of Rides")

Insights
The docked bike is more preferred by both type of riders, however we cant pinpoint it as the company may actually provide more docked bikes than other.
Most no of rides are made using Docked bikes and least is Classic. Furthermore we can say that members use Docked bike more than Casuals.
There has been a rise of usage of classic bikes from 2021 followed by electric bikes.
Saving the dataset into a CSV file
cleaned_divvy %>%
write.csv("final_divvy_dataset.csv")
Questions
1 How should you organize your data to perform analysis on it?
Data has been orgnised into a csv file and stored in a folder.
2 Has your data been properly formatted?
All columns contain correct data type.
3 What surprises did you discover in the data?
Distribution of the riders during the weekday and weekends and how climate had a huge impact on the riding.
4 What trends or relationships did you find in the data?
Volume of members is more than casual riders, No of rides done by the riders differ in weekdays and weekends, No of rides are more during the Summer months(June - August), Members tend to prefer docked bikes.
5 How will these insights help answer your business questions?
These will help provide behavioral and usage patterns of bike users which will be utilised by the marketing team to leverage more traffic.
Tasks Done
Aggregate your data so it’s useful and accessible.
Organize and format your data.
Perform calculations.
Identify trends and relationships.
SHARE
Share phase involves presenting the findings in a documented or a dashboard manner. In this project the documentation is done using R Markdown and published in RPubs.
Key findings:
Members are larger in volume than Casual riders i.e 20% (approx)
In all months No of rides done by members is more than casuals
The difference in usage decreases in last quarter of 2020
Climate highly impacts the volume of rides in a month.
Weekend has larger volume of rides compared to weekdays.
Conclusion
From the findings we can derive that members use the bikes for routine activities like going to work or exercise as their distribution stays normal during the weekdays, where as casuals have a more curved distribution which peaks in weekends, through this we can derive that casuals use the bikes for recreational ways.
Questions
1 Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
The above analysis answers and shows the difference between the two type of riders.
2 What story does your data tell?
The story which the data tells us is that members though being large in number and in executing higher no of rides their total ride time is less compared to casuals as members mostly use their bikes for routine activities and they also have set routes where as casuals usage booms in weekends and we can say that they use it for recreational ways.
3 How do your findings relate to your original question?
My findings help in identifying the behavioral and usage differences of both types of customers which will make it easier for conducting a digital campaign and to increase the conversion ratio.
4 Who is your audience? What is the best way to communicate with them?
Audience is my cyclistic marketing analytics team and Lily Moreno. Best way to communicate with them is through neat and clear visualizations embedded in a dashboard in a story format.
5 Can data visualization help you share your findings?
The primary purpose of visualising data is to share the findings in an understandable manner for the stakeholders.
6 Is your presentation accessible to your audience?
Yes, The colors used are pleasing, the font is readable and there is no clutter.
Tasks Done
Determine the best way to share your findings.
Create effective data visualizations.
Present your findings.
Ensure your work is accessible.
ACT
This phase needs to be executed by the marketing team of Cyclistic Co.
Questions
1 What is your final conclusion based on your analysis? and How could your team and business apply your insights?
Behvioral patterns and usage patterns of the users are different and those patterns can be targeted in ordered to make the services more desirable so that the user traffic increases multifold and so will the conversion ratio from casual to annual members.
2 What next steps would you or your stakeholders take based on your findings?
Analysis can be further done to the granular level by utilising the geographic data and marketing team can accordingly plan its strategies to leverage more users.
3 Is there additional data you could use to expand on your findings?
Geographical data, Additional Climate data, Routes data.
——————————————————————————-