A junior data analyst is working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. There are two types of riders-
Customers who purchase single-ride or full-day passes are referred to as casual riders
Customers who purchase annual memberships are Cyclistic members
The company’s future success depends on maximizing the number of annual memberships and needs to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Use the right tools and prepare a story/presentation for the stakeholders keeping the business task in mind.
The data is on an AWS server where it is easily downloadable and named correctly. I have downloaded the previous 12 months data and stored it locally for the next steps in this process. It is organized by year and Fiscal Quarters. The data is reliable and original since it comes from the company.It is comprehensive, current, and cited.
Start with processing the data Calling the libraries required for this job
library(dplyr)
library(tidyverse)
library(skimr)
library(janitor)
library(readr)
Because of limited RAM available on RStudio, I choose only 3 data sets, if I add more data sets, unfortunately the RStudio crash.
trip_2021_01 <- read_csv("202101-divvy-tripdata.csv")
trip_2021_02 <- read_csv("202102-divvy-tripdata.csv")
trip_2021_03 <- read_csv("202103-divvy-tripdata.csv")
Before merging,lets check the structure of one file
str(trip_2021_01)
## spec_tbl_df [96,834 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:96834] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:96834] NA NA NA NA ...
## $ end_station_id : chr [1:96834] NA NA NA NA ...
## $ start_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:96834] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Trip2021 <- rbind(trip_2021_01,trip_2021_02,trip_2021_03)
When we saw the structure of the file earlier, we saw that start and end date were in character format. We need to change that to date and time format.
Trip2021$started_at = strptime(Trip2021$started_at,"%Y-%m-%d %H:%M:%S")
Trip2021$ended_at = strptime(Trip2021$ended_at,"%Y-%m-%d %H:%M:%S")
str(Trip2021)
## spec_tbl_df [374,952 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:374952] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:374952] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXlt[1:374952], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXlt[1:374952], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:374952] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:374952] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:374952] NA NA NA NA ...
## $ end_station_id : chr [1:374952] NA NA NA NA ...
## $ start_lat : num [1:374952] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:374952] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:374952] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:374952] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:374952] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Trip2021<-mutate(Trip2021,tripduration=difftime(ended_at,started_at,units = "secs"))
head(Trip2021)
## # A tibble: 6 × 14
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 E19E6F1B8D4C4… electr… 2021-01-23 16:14:19 2021-01-23 16:24:44 Califo… 17660
## 2 DC88F20C2C55F… electr… 2021-01-27 18:43:08 2021-01-27 18:47:12 Califo… 17660
## 3 EC45C94683FE3… electr… 2021-01-21 22:35:54 2021-01-21 22:37:14 Califo… 17660
## 4 4FA453A75AE37… electr… 2021-01-07 13:31:13 2021-01-07 13:42:55 Califo… 17660
## 5 BE5E8EB4E7263… electr… 2021-01-23 02:24:02 2021-01-23 02:24:45 Califo… 17660
## 6 5D8969F88C773… electr… 2021-01-09 14:24:07 2021-01-09 15:17:54 Califo… 17660
## # … with 8 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, tripduration <drtn>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
## # ℹ Use `colnames()` to see all variable names
Lets filter ‘tripduaration’ greater than zero because is Viewing Trip data we found a lot of values less than zero
Trip2021 <- filter(Trip2021,Trip2021$tripduration>0)
Lets take a glimpse of data
glimpse(Trip2021)
## Rows: 374,930
## Columns: 14
## $ ride_id <chr> "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
## $ ended_at <dttm> 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
## $ start_station_name <chr> "California Ave & Cortez St", "California Ave & Cor…
## $ start_station_id <chr> "17660", "17660", "17660", "17660", "17660", "17660…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Wood St & Augu…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "657", "13258",…
## $ start_lat <dbl> 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
## $ start_lng <dbl> -87.69674, -87.69671, -87.69664, -87.69666, -87.696…
## $ end_lat <dbl> 41.89000, 41.90000, 41.90000, 41.92000, 41.90000, 4…
## $ end_lng <dbl> -87.72000, -87.69000, -87.70000, -87.69000, -87.700…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
## $ tripduration <drtn> 625 secs, 244 secs, 80 secs, 702 secs, 43 secs, 32…
We also saw that some of the values are missing/blank. We need to remove this bad data.
Trip2021 <- Trip2021 %>%
na.omit # Removing rows with NA values
For the final step,we will arrange our data in ascending order by date and then view our clean data
Trip2021<-mutate(Trip2021,weekday=weekdays(started_at))
Trip2021<-arrange(Trip2021,started_at)
View(Trip2021)
write_csv(Trip2021,"Trip2021_v1.csv")
Great, now we are all set to analyze and visualize our data
##Analysis and Presentation of Final Results
All the data was transferred to Tableau, please check. After downloading the data we need to separate date and time to use in Tableau for analysis. I used Replace function in Spreadsheet to remove string from Stated at (date and time) and used text to column wizard in Data tab to separate date and time.