Exploratory Data Analysis of Cyclisitc Co.

Introduction This case study is a part of google data analytics capstone. Project is broken down into steps which are taught in this professional certificate, which are Ask, Prepare, Process, Analyze, Share, Act. This data is of a fictional company which will be analyzed to produce insights which satisfy the business task.The dataset used is of divvytrip and the data can be downladed here.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations

About The Company

In 2016, Cyclistic launched a successful bike-share oering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders dier, why casual riders would buy a membership, and how digital media could aect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Characters and Teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also oering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

ASK

Tasks Done

Identify the business task

Consider key stakeholders

Questions

1 Goal?

Design marketing strategies aimed at converting casual riders into annual members.

2 Business Task?

How do annual members and casual riders use Cyclistic bikes differently?

3 How can my insights drive business decisions?

With reference to my insights the company can leverage more users and can increase the conversion ratio of annual members from casual riders.

PREPARE

Tasks Done

Download data and store it appropriately.

Identify how it’s organized.

Sort and filter the data.

Determine the credibility of the data.

Questions

1 Where is your data located?

Extraction of data took place from Kaggle website.

2 How is the data organized?

The data is structured, file format is .csv and the files are organized ccording to month.

3 Are there issues with bias or credibility in this data? Does your data ROCCC?

There are no issues with bias and credibility as it is the internal data of th company, and the data is reliable, original, comprehensive, current and credible.

4 How are you addressing licensing, privacy, security, and accessibility?

The data has been made available by Motivate International Inc. under this license on top of this their dataset doesnt include customers’ personal information. Security and accessibility are not a problem as the dataset is public and available for everyone to utilise.

5 How did you verify the data’s integrity?

The data is consistent, accurate but it had few columns which were not of desired data type, these were handled in coming stages.

6 How does it help you answer your question?

The data is sufficient to extract business critical insights which will solve the business task.

7 Are there any problems with the data?

Too less information about the riders.

PROCESS

library(tidyverse) # used for wrangling data

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate) # used for parsing date attributes

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)   # used for visualization
library(dplyr)
library(anytime)

Loading the .csv files

Jan_2021 <- read_csv("202101-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Feb_2021 <- read_csv("202102-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Mar_2021 <- read_csv("202103-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_character(),
##   ended_at = col_character(),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Apr_2021 <- read_csv("202104-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

May_2021 <- read_csv("202105-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Jun_2021 <- read_csv("202106-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Jul_2020 <- read_csv("202007-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_character(),
##   ended_at = col_character(),
##   start_station_name = col_character(),
##   start_station_id = col_double(),
##   end_station_name = col_character(),
##   end_station_id = col_double(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Aug_2020 <- read_csv("202008-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_double(),
##   end_station_name = col_character(),
##   end_station_id = col_double(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Sep_2020 <- read_csv("202009-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_double(),
##   end_station_name = col_character(),
##   end_station_id = col_double(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Oct_2020 <- read_csv("202010-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_double(),
##   end_station_name = col_character(),
##   end_station_id = col_double(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Nov_2020 <- read_csv("202011-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_double(),
##   end_station_name = col_character(),
##   end_station_id = col_double(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Dec_2020 <- read_csv("202012-divvy-tripdata.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   ride_id = col_character(),
##   rideable_type = col_character(),
##   started_at = col_datetime(format = ""),
##   ended_at = col_datetime(format = ""),
##   start_station_name = col_character(),
##   start_station_id = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_character(),
##   start_lat = col_double(),
##   start_lng = col_double(),
##   end_lat = col_double(),
##   end_lng = col_double(),
##   member_casual = col_character()
## )

Checking the structure of all dataframes to find any inconcistencies

str(Jan_2021)

## spec_tbl_df [96,834 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:96834] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:96834] NA NA NA NA ...
##  $ end_station_id    : chr [1:96834] NA NA NA NA ...
##  $ start_lat         : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:96834] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Feb_2021)

## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
##  $ ended_at          : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
##  $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:49622] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:49622] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Mar_2021)

## spec_tbl_df [205,856 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:205856] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
##  $ rideable_type     : chr [1:205856] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr [1:205856] "16/03/2021 08:32" "28/03/2021 01:26" "11/03/2021 21:17" "11/03/2021 13:26" ...
##  $ ended_at          : chr [1:205856] "16/03/2021 08:36" "28/03/2021 01:36" "11/03/2021 21:33" "11/03/2021 13:55" ...
##  $ start_station_name: chr [1:205856] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
##  $ start_station_id  : chr [1:205856] "15651" "15651" "15443" "TA1308000021" ...
##  $ end_station_name  : chr [1:205856] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:205856] "13266" "18017" "TA1308000043" "13323" ...
##  $ start_lat         : num [1:205856] 41.9 41.9 41.8 42 42 ...
##  $ start_lng         : num [1:205856] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:205856] 41.9 41.9 41.8 42 42.1 ...
##  $ end_lng           : num [1:205856] -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:205856] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Apr_2021)

## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
##  $ rideable_type     : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
##  $ ended_at          : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
##  $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
##  $ start_station_id  : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
##  $ end_station_name  : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
##  $ end_station_id    : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
##  $ start_lat         : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ start_lng         : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ end_lng           : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:337230] "member" "casual" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(May_2021)

## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
##  $ rideable_type     : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
##  $ ended_at          : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
##  $ start_station_name: chr [1:531633] NA NA NA NA ...
##  $ start_station_id  : chr [1:531633] NA NA NA NA ...
##  $ end_station_name  : chr [1:531633] NA NA NA NA ...
##  $ end_station_id    : chr [1:531633] NA NA NA NA ...
##  $ start_lat         : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:531633] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Jun_2021)

## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
##  $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
##  $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
##  $ start_station_name: chr [1:729595] NA NA NA NA ...
##  $ start_station_id  : chr [1:729595] NA NA NA NA ...
##  $ end_station_name  : chr [1:729595] NA NA NA NA ...
##  $ end_station_id    : chr [1:729595] NA NA NA NA ...
##  $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:729595] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Jul_2020)

## spec_tbl_df [551,480 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:551480] "762198876D69004D" "BEC9C9FBA0D4CF1B" "D2FD8EA432C77EC1" "54AE594E20B35881" ...
##  $ rideable_type     : chr [1:551480] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : chr [1:551480] "09/07/2020 15:22" "24/07/2020 23:56" "08/07/2020 19:49" "17/07/2020 19:06" ...
##  $ ended_at          : chr [1:551480] "09/07/2020 15:25" "25/07/2020 00:20" "08/07/2020 19:56" "17/07/2020 19:27" ...
##  $ start_station_name: chr [1:551480] "Ritchie Ct & Banks St" "Halsted St & Roscoe St" "Lake Shore Dr & Diversey Pkwy" "LaSalle St & Illinois St" ...
##  $ start_station_id  : num [1:551480] 180 299 329 181 268 635 113 211 176 31 ...
##  $ end_station_name  : chr [1:551480] "Wells St & Evergreen Ave" "Broadway & Ridge Ave" "Clark St & Wellington Ave" "Clark St & Armitage Ave" ...
##  $ end_station_id    : num [1:551480] 291 461 156 94 301 289 140 31 191 142 ...
##  $ start_lat         : num [1:551480] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:551480] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:551480] 41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:551480] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:551480] "member" "member" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Aug_2020)

## spec_tbl_df [622,361 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:622361] "322BD23D287743ED" "2A3AEF1AB9054D8B" "67DC1D133E8B5816" "C79FBBD412E578A7" ...
##  $ rideable_type     : chr [1:622361] "docked_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:622361], format: "2020-08-20 18:08:14" "2020-08-27 18:46:04" ...
##  $ ended_at          : POSIXct[1:622361], format: "2020-08-20 18:17:51" "2020-08-27 19:54:51" ...
##  $ start_station_name: chr [1:622361] "Lake Shore Dr & Diversey Pkwy" "Michigan Ave & 14th St" "Columbus Dr & Randolph St" "Daley Center Plaza" ...
##  $ start_station_id  : num [1:622361] 329 168 195 81 658 658 196 67 153 177 ...
##  $ end_station_name  : chr [1:622361] "Clark St & Lincoln Ave" "Michigan Ave & 14th St" "State St & Randolph St" "State St & Kinzie St" ...
##  $ end_station_id    : num [1:622361] 141 168 44 47 658 658 49 229 225 305 ...
##  $ start_lat         : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:622361] "member" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Sep_2020)

## spec_tbl_df [532,958 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:532958] "2B22BD5F95FB2629" "A7FB70B4AFC6CAF2" "86057FA01BAC778E" "57F6DC9A153DB98C" ...
##  $ rideable_type     : chr [1:532958] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:532958], format: "2020-09-17 14:27:11" "2020-09-17 15:07:31" ...
##  $ ended_at          : POSIXct[1:532958], format: "2020-09-17 14:44:24" "2020-09-17 15:07:45" ...
##  $ start_station_name: chr [1:532958] "Michigan Ave & Lake St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Ashland Ave & Belle Plaine Ave" ...
##  $ start_station_id  : num [1:532958] 52 NA NA 246 24 94 291 NA NA NA ...
##  $ end_station_name  : chr [1:532958] "Green St & Randolph St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Montrose Harbor" ...
##  $ end_station_id    : num [1:532958] 112 NA NA 249 24 NA 256 NA NA NA ...
##  $ start_lat         : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
##  $ start_lng         : num [1:532958] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:532958] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:532958] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Oct_2020)

## spec_tbl_df [388,653 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:388653] "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
##  $ rideable_type     : chr [1:388653] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:388653], format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
##  $ ended_at          : POSIXct[1:388653], format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
##  $ start_station_name: chr [1:388653] "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
##  $ start_station_id  : num [1:388653] 313 227 102 165 190 359 313 125 NA 174 ...
##  $ end_station_name  : chr [1:388653] "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : num [1:388653] 125 260 423 256 185 53 125 313 199 635 ...
##  $ start_lat         : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
##  $ start_lng         : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
##  $ end_lng           : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:388653] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Nov_2020)

## spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
##  $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
##  $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
##  $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
##  $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
##  $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
##  $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
##  $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

str(Dec_2020)

## spec_tbl_df [131,573 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
##  $ rideable_type     : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
##  $ ended_at          : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
##  $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
##  $ start_station_id  : chr [1:131573] "13157" NA NA NA ...
##  $ end_station_name  : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
##  $ end_station_id    : chr [1:131573] "TA1306000003" NA NA NA ...
##  $ start_lat         : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:131573] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )

Data Cleaning

Converting inconcistent columns by changing data type.

mutate() used for conversion of column’s data type

Jan_2021<-mutate(Jan_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Feb_2021<-mutate(Feb_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Mar_2021<-mutate(Mar_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Apr_2021<-mutate(Apr_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
May_2021<-mutate(May_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Jun_2021<-mutate(Jun_2021, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Jul_2020<-mutate(Jul_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Aug_2020<-mutate(Aug_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Sep_2020<-mutate(Sep_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Oct_2020<-mutate(Oct_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Nov_2020<-mutate(Nov_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))
Dec_2020<-mutate(Dec_2020, start_station_id=as.double(start_station_id),end_station_id = as.double(end_station_id))

Parsing character to date time

Mar_2021$started_at<-dmy_hms(Mar_2021$started_at,tz=Sys.timezone())
Mar_2021$ended_at<-dmy_hms(Mar_2021$ended_at,tz=Sys.timezone())
Jul_2020$started_at<-dmy_hms(Jul_2020$started_at,tz=Sys.timezone())
Jul_2020$ended_at<-dmy_hms(Jul_2020$ended_at,tz=Sys.timezone())

Merging the dataframes

Complete_trips<-bind_rows(Jan_2021,Feb_2021,Mar_2021,Apr_2021,May_2021,Jun_2021,Jul_2020,Aug_2020,Sep_2020,Oct_2020,Nov_2020,Dec_2020)

Eliminating Duplicates

duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Distinct_trips<-Complete_trips[!duplicated(Complete_trips$ride_id),]
cleaned_divvy<-na.omit(Distinct_trips)

Primary Insights from the dataset

colnames(cleaned_divvy)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

nrow(cleaned_divvy)

## [1] 2630393

summary(cleaned_divvy)

##    ride_id          rideable_type        started_at                 
##  Length:2630393     Length:2630393     Min.   :2020-03-01 15:30:01  
##  Class :character   Class :character   1st Qu.:2020-08-04 16:48:43  
##  Mode  :character   Mode  :character   Median :2020-09-06 19:29:09  
##                                        Mean   :2020-10-03 09:51:38  
##                                        3rd Qu.:2020-10-22 18:15:19  
##                                        Max.   :2021-06-30 23:59:05  
##     ended_at                   start_station_name start_station_id
##  Min.   :2020-03-01 15:30:06   Length:2630393     Min.   :     2  
##  1st Qu.:2020-08-04 17:15:00   Class :character   1st Qu.:   114  
##  Median :2020-09-06 20:01:03   Mode  :character   Median :   236  
##  Mean   :2020-10-03 10:17:37                      Mean   :  2135  
##  3rd Qu.:2020-10-22 18:34:31                      3rd Qu.:   443  
##  Max.   :2021-07-09 17:23:46                      Max.   :202480  
##  end_station_name   end_station_id     start_lat       start_lng     
##  Length:2630393     Min.   :     2   Min.   :41.65   Min.   :-87.77  
##  Class :character   1st Qu.:   115   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Median :   236   Median :41.90   Median :-87.64  
##                     Mean   :  2140   Mean   :41.90   Mean   :-87.64  
##                     3rd Qu.:   448   3rd Qu.:41.93   3rd Qu.:-87.63  
##                     Max.   :202480   Max.   :42.06   Max.   :-87.53  
##     end_lat         end_lng       member_casual     
##  Min.   :41.65   Min.   :-87.77   Length:2630393    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.64                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.07   Max.   :-87.52

str(cleaned_divvy)

## tibble [2,630,393 x 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:2630393] "B9F73448DFBE0D45" "457C7F4B5D3DA135" "57C750326F9FDABE" "4D518C65E338D070" ...
##  $ rideable_type     : chr [1:2630393] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:2630393], format: "2021-01-24 19:15:38" "2021-01-23 12:57:38" ...
##  $ ended_at          : POSIXct[1:2630393], format: "2021-01-24 19:22:51" "2021-01-23 13:02:10" ...
##  $ start_station_name: chr [1:2630393] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : num [1:2630393] 17660 17660 17660 17660 17660 ...
##  $ end_station_name  : chr [1:2630393] "Wood St & Augusta Blvd" "California Ave & North Ave" "Wood St & Augusta Blvd" "Wood St & Augusta Blvd" ...
##  $ end_station_id    : num [1:2630393] 657 13258 657 657 657 ...
##  $ start_lat         : num [1:2630393] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:2630393] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:2630393] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:2630393] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:2630393] "member" "member" "casual" "casual" ...
##  - attr(*, "na.action")= 'omit' Named int [1:1806878] 1 2 3 4 5 6 7 8 9 15 ...
##   ..- attr(*, "names")= chr [1:1806878] "1" "2" "3" "4" ...

Manipulating the data

Addition of New columns

Ride_time_min Calcualtes the total time travelled by the rider.

cleaned_divvy<-cleaned_divvy %>% 
  mutate(Ride_time_min=as.numeric(cleaned_divvy$ended_at-cleaned_divvy$started_at)/60)
summary(cleaned_divvy$Ride_time_min)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -28948.43      2.43     10.10     25.98     21.45  54283.35

Year_month Seperating the year & month.

cleaned_divvy<-cleaned_divvy %>% 
  mutate(year_month=paste(strftime(cleaned_divvy$started_at, "%Y"),"-", strftime(cleaned_divvy$started_at, "%m"), "(", strftime(cleaned_divvy$started_at, "%b"), ")", sep=""))

cleaned_divvy<-cleaned_divvy %>% 
  mutate(month=paste(strftime(cleaned_divvy$started_at, "%m"), "(", strftime(cleaned_divvy$started_at, "%b"), ")", sep=""))

unique(cleaned_divvy$month)

##  [1] "01(Jan)" "02(Feb)" "03(Mar)" "04(Apr)" "05(May)" "06(Jun)" "07(Jul)"
##  [8] "08(Aug)" "09(Sep)" "10(Oct)" "11(Nov)" "12(Dec)"

Functions used are: mutate, paste, strftime, unique.

mutate() adds new variables and preserves existing ones, here paste() concatenate Vectors by converting them into character strftime() converts a date-time object into a character vector unique() eliminate or delete the duplicate values or the rows present.

Weekday

cleaned_divvy<-cleaned_divvy %>% 
  mutate(weekday=paste(strftime(cleaned_divvy$ended_at, "%u"),"-", strftime(cleaned_divvy$ended_at, "%a")))

start_hour

cleaned_divvy<-cleaned_divvy %>% 
  mutate(start_hour=strftime(cleaned_divvy$ended_at, "%H"))
unique(cleaned_divvy$start_hour)

##  [1] "00" "18" "21" "13" "14" "20" "19" "23" "22" "01" "17" "05" "16" "03" "15"
## [16] "02" "04" "06" "12" "10" "11" "08" "07" "09"

Questions

1 What tools are you choosing and why?

Tool used is R studio as this project contains large datasets so handling the data checking, tidying, transforming, analysing, visualising and documenting in R would be faster and easier as compared to spreadsheets.

2 Have you ensured your data’s integrity?

Yes, The data is concistent and accurate.

3 What steps have you taken to ensure that your data is clean?

Removal of duplicate rows, converted the columns to their correct format, omotted the NAs.

4 Have you documented your cleaning process so you can review and share those results?

Yes, In this document.

Tasks Done

Check the data for errors.

Choose your tools.

Transform the data so you can work with it eectively

Document the cleaning process.

ANALYZE

Data distribution

Casuals and members

Tells us how is the data distributed among the casual riders and members

cleaned_divvy %>% 
  group_by(member_casual) %>% 
  summarise('no_of_riders'= n(),'%_of'=(n()/nrow(cleaned_divvy))*100)

## # A tibble: 2 x 3
##   member_casual no_of_riders `%_of`
##   <chr>                <int>  <dbl>
## 1 casual             1162723   44.2
## 2 member             1467670   55.8

Functions used are: group_by, summarise, length, nrow

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. If we combine group_by() and n() we can easily count the no of rows for each product, an alterntive for n() is length (ride_id). Simple count() isnt used as it cant be used on character class object.nrow() is used to return the number of rows in the data set.

ggplot(cleaned_divvy,aes(member_casual,fill=member_casual))+
  geom_bar()+scale_fill_brewer(palette = "Set2")+labs(title= "Members X Casual riders distribution")

Members comprise of more than 50% of the data 55.7% (approx) which is approximately 10% more than casual riders.

Distribution by Month

cleaned_divvy %>% 
  group_by(year_month) %>% 
  summarise('no_of_riders' = length(ride_id), '%_of'=(length(ride_id)/nrow(cleaned_divvy)*100),'members_p'=(sum(member_casual=='member')/length(ride_id))*100, 'casuals_p'=(sum(member_casual=='casual')/length(ride_id))*100, 'Member_Casual_Diff'= members_p-casuals_p)

## # A tibble: 14 x 6
##    year_month   no_of_riders   `%_of` members_p casuals_p Member_Casual_Diff
##    <chr>               <int>    <dbl>     <dbl>     <dbl>              <dbl>
##  1 2020-03(Mar)        47065  1.79         56.5      43.5              13.0 
##  2 2020-07(Jul)       550394 20.9          51.2      48.8               2.35
##  3 2020-08(Aug)       604445 23.0          53.5      46.5               6.94
##  4 2020-09(Sep)       501353 19.1          56.9      43.1              13.8 
##  5 2020-10(Oct)       340912 13.0          63.8      36.2              27.7 
##  6 2020-11(Nov)       223732  8.51         67.1      32.9              34.2 
##  7 2020-12(Dec)        23816  0.905        75.0      25.0              50.1 
##  8 2021-01(Jan)        16589  0.631        80.2      19.8              60.5 
##  9 2021-02(Feb)         9096  0.346        76.3      23.7              52.5 
## 10 2021-03(Mar)           67  0.00255      62.7      37.3              25.4 
## 11 2021-04(Apr)        67597  2.57         52.8      47.2               5.61
## 12 2021-05(May)       106161  4.04         44.9      55.1             -10.3 
## 13 2021-06(Jun)       138302  5.26         44.2      55.8             -11.6 
## 14 2021-07(Jul)          864  0.0328       48.5      51.5              -3.01

cleaned_divvy %>% 
  ggplot(aes(month, fill=member_casual))+
  geom_bar()+labs(x='Month',y='No of rides', title='Distribution by Month')+ coord_flip()+ theme(axis.title.x = element_text(color="Blue",size=13),
        axis.title.y = element_text(color="Blue",size=13))

Insights

The ratio of members is higher than casuals in all months. The difference of proportion seems less in the mid months(Jul-sept)

Lets compare this with climate of Chicago to get further insights.Data is taken from Climte of Chicago

chicago_mean_temp<- c(-3.2, -1.2, 4.4, 10.5, 16.6, 22.2, 24.8, 23.9, 19.9, 12.9, 5.8, -0.3)
Chicago_month <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")

data.frame(Chicago_month, chicago_mean_temp) %>% 
  ggplot(aes(x=factor(Chicago_month,levels=Chicago_month),y=chicago_mean_temp, fill=chicago_mean_temp))+geom_col(color="Black")+labs(x="Month", y="Mean_temperature", title="Chicago's Mean Temperature [1991-2020]")

Geom_col() is used to inorder to represent the values to the height of the bars.

Insight

The volume of rides undertaken by riders is highly impacted by the climate of a month.

Distribution by Weekday

cleaned_divvy %>% 
  group_by(weekday) %>% 
  summarise("no_of_riders"=n(), "%_of"=(n()/nrow(cleaned_divvy))*100, "members_p"=(sum(member_casual=="member")/n())*100, "casual_p"=(sum(member_casual=="casual")/n())*100, "member_casual_diff_%"=members_p-casual_p)

## # A tibble: 7 x 6
##   weekday no_of_riders `%_of` members_p casual_p `member_casual_diff_%`
##   <chr>          <int>  <dbl>     <dbl>    <dbl>                  <dbl>
## 1 1 - Mon       315569   12.0      58.3     41.7                  16.6 
## 2 2 - Tue       309752   11.8      63.7     36.3                  27.5 
## 3 3 - Wed       339587   12.9      63.9     36.1                  27.8 
## 4 4 - Thu       352912   13.4      62.2     37.8                  24.5 
## 5 5 - Fri       391989   14.9      57.2     42.8                  14.4 
## 6 6 - Sat       491222   18.7      46.7     53.3                  -6.55
## 7 7 - Sun       429362   16.3      45.6     54.4                  -8.79

cleaned_divvy %>% 
  ggplot(aes(x=weekday, fill=member_casual))+geom_bar()+
  labs(x="Day", y="No of Rides")+ coord_flip()

Insights

Largest volume of rides are taken place on Weekend i.e Friday, Saturday and Sunday with Saturday being the highest.

Members have the largest volume through the week except on weekends.No of rides made by Casuals increase by 20% (approx) from friday to Sunday.

Distribution by Type of bike

cleaned_divvy %>%
    group_by(rideable_type) %>% 
    summarise("no_of_riders"= n(),
          '%_of' = (n() / nrow(cleaned_divvy)) * 100,
          'members_p' = (sum(member_casual == "member") / n()) * 100,
          'casual_p' = (sum(member_casual == "casual") / n()) * 100,
          'member_casual_perc_difer' = members_p - casual_p)

## # A tibble: 3 x 6
##   rideable_type no_of_riders `%_of` members_p casual_p member_casual_perc_difer
##   <chr>                <int>  <dbl>     <dbl>    <dbl>                    <dbl>
## 1 classic_bike        274461   10.4      59.5     40.5                     18.9
## 2 docked_bike        1943842   73.9      55.1     44.9                     10.2
## 3 electric_bike       412090   15.7      56.7     43.3                     13.4

cleaned_divvy %>% 
  ggplot(aes(x=member_casual, fill=rideable_type))+geom_bar(position = "dodge")+
  labs(x="Bike", y="No of Rides")

Insights

The docked bike is more preferred by both type of riders, however we cant pinpoint it as the company may actually provide more docked bikes than other.

Most no of rides are made using Docked bikes and least is Classic. Furthermore we can say that members use Docked bike more than Casuals.

There has been a rise of usage of classic bikes from 2021 followed by electric bikes.

Saving the dataset into a CSV file

cleaned_divvy %>%
  write.csv("final_divvy_dataset.csv")

Questions

1 How should you organize your data to perform analysis on it?

Data has been orgnised into a csv file and stored in a folder.

2 Has your data been properly formatted?

All columns contain correct data type.

3 What surprises did you discover in the data?

Distribution of the riders during the weekday and weekends and how climate had a huge impact on the riding.

4 What trends or relationships did you find in the data?

Volume of members is more than casual riders, No of rides done by the riders differ in weekdays and weekends, No of rides are more during the Summer months(June - August), Members tend to prefer docked bikes.

5 How will these insights help answer your business questions?

These will help provide behavioral and usage patterns of bike users which will be utilised by the marketing team to leverage more traffic.

Tasks Done

Aggregate your data so it’s useful and accessible.

Organize and format your data.

Perform calculations.

Identify trends and relationships.

Conclusion

From the findings we can derive that members use the bikes for routine activities like going to work or exercise as their distribution stays normal during the weekdays, where as casuals have a more curved distribution which peaks in weekends, through this we can derive that casuals use the bikes for recreational ways.

Questions

1 Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?

The above analysis answers and shows the difference between the two type of riders.

2 What story does your data tell?

The story which the data tells us is that members though being large in number and in executing higher no of rides their total ride time is less compared to casuals as members mostly use their bikes for routine activities and they also have set routes where as casuals usage booms in weekends and we can say that they use it for recreational ways.

3 How do your findings relate to your original question?

My findings help in identifying the behavioral and usage differences of both types of customers which will make it easier for conducting a digital campaign and to increase the conversion ratio.

4 Who is your audience? What is the best way to communicate with them?

Audience is my cyclistic marketing analytics team and Lily Moreno. Best way to communicate with them is through neat and clear visualizations embedded in a dashboard in a story format.

5 Can data visualization help you share your findings?

The primary purpose of visualising data is to share the findings in an understandable manner for the stakeholders.

6 Is your presentation accessible to your audience?

Yes, The colors used are pleasing, the font is readable and there is no clutter.

Tasks Done

Determine the best way to share your findings.

Create effective data visualizations.

Present your findings.

Ensure your work is accessible.

ACT

This phase needs to be executed by the marketing team of Cyclistic Co.

Questions

1 What is your final conclusion based on your analysis? and How could your team and business apply your insights?

Behvioral patterns and usage patterns of the users are different and those patterns can be targeted in ordered to make the services more desirable so that the user traffic increases multifold and so will the conversion ratio from casual to annual members.

2 What next steps would you or your stakeholders take based on your findings?

Analysis can be further done to the granular level by utilising the geographic data and marketing team can accordingly plan its strategies to leverage more users.

3 Is there additional data you could use to expand on your findings?

Geographical data, Additional Climate data, Routes data.

——————————————————————————-

Exploratory Data Analysis of Cyclisitc Co.

Akhelaaditya

8/10/2021

Scenario

About The Company

Characters and Teams

ASK

PREPARE

PROCESS

Loading the .csv files

Checking the structure of all dataframes to find any inconcistencies

Data Cleaning

Converting inconcistent columns by changing data type.

Parsing character to date time

Merging the dataframes

Eliminating Duplicates

Primary Insights from the dataset

Manipulating the data

ANALYZE

Saving the dataset into a CSV file

Conclusion

ACT

Exploratory Data Analysis of Cyclisitc Co.

Akhelaaditya

8/10/2021

Scenario

About The Company

Characters and Teams

ASK

PREPARE

PROCESS

Loading the .csv files

Checking the structure of all dataframes to find any inconcistencies

Data Cleaning

Converting inconcistent columns by changing data type.

Parsing character to date time

Merging the dataframes

Eliminating Duplicates

Primary Insights from the dataset

Manipulating the data

ANALYZE

Saving the dataset into a CSV file

SHARE

Conclusion

ACT