Scenario

A junior data analyst is working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. There are two types of riders-

The company’s future success depends on maximizing the number of annual memberships and needs to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Use the right tools and prepare a story/presentation for the stakeholders keeping the business task in mind.

Objectives of analysis

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Preparing data

The data is on an AWS server where it is easily downloadable and named correctly. I have downloaded the previous 12 months data and stored it locally for the next steps in this process. It is organized by year and Fiscal Quarters. The data is reliable and original since it comes from the company.It is comprehensive, current, and cited.

Processing Data

Start with processing the data Calling the libraries required for this job

library(dplyr)
library(tidyverse)
library(skimr)
library(janitor)
library(readr)

Loading Cyclist Data in R

Because of limited RAM available on RStudio, I choose only 3 data sets, if I add more data sets, unfortunately the RStudio crash.

trip_2021_01 <- read_csv("202101-divvy-tripdata.csv")
trip_2021_02 <- read_csv("202102-divvy-tripdata.csv")
trip_2021_03 <- read_csv("202103-divvy-tripdata.csv")

Merging the data into single file

Before merging,lets check the structure of one file

str(trip_2021_01)
## spec_tbl_df [96,834 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:96834] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:96834] NA NA NA NA ...
##  $ end_station_id    : chr [1:96834] NA NA NA NA ...
##  $ start_lat         : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:96834] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Bind Together

Trip2021 <- rbind(trip_2021_01,trip_2021_02,trip_2021_03)

Chaging to suitable data types

When we saw the structure of the file earlier, we saw that start and end date were in character format. We need to change that to date and time format.

Trip2021$started_at = strptime(Trip2021$started_at,"%Y-%m-%d %H:%M:%S")
Trip2021$ended_at = strptime(Trip2021$ended_at,"%Y-%m-%d %H:%M:%S")

Check String

str(Trip2021)
## spec_tbl_df [374,952 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:374952] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:374952] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXlt[1:374952], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXlt[1:374952], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:374952] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:374952] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:374952] NA NA NA NA ...
##  $ end_station_id    : chr [1:374952] NA NA NA NA ...
##  $ start_lat         : num [1:374952] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:374952] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:374952] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:374952] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:374952] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Adding Trip Duration

Trip2021<-mutate(Trip2021,tripduration=difftime(ended_at,started_at,units = "secs"))

Lets Preview

head(Trip2021)
## # A tibble: 6 × 14
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 E19E6F1B8D4C4… electr… 2021-01-23 16:14:19 2021-01-23 16:24:44 Califo… 17660  
## 2 DC88F20C2C55F… electr… 2021-01-27 18:43:08 2021-01-27 18:47:12 Califo… 17660  
## 3 EC45C94683FE3… electr… 2021-01-21 22:35:54 2021-01-21 22:37:14 Califo… 17660  
## 4 4FA453A75AE37… electr… 2021-01-07 13:31:13 2021-01-07 13:42:55 Califo… 17660  
## 5 BE5E8EB4E7263… electr… 2021-01-23 02:24:02 2021-01-23 02:24:45 Califo… 17660  
## 6 5D8969F88C773… electr… 2021-01-09 14:24:07 2021-01-09 15:17:54 Califo… 17660  
## # … with 8 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, tripduration <drtn>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
## # ℹ Use `colnames()` to see all variable names

Lets filter ‘tripduaration’ greater than zero because is Viewing Trip data we found a lot of values less than zero

Trip2021 <- filter(Trip2021,Trip2021$tripduration>0)

Lets take a glimpse of data

glimpse(Trip2021)
## Rows: 374,930
## Columns: 14
## $ ride_id            <chr> "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
## $ ended_at           <dttm> 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
## $ start_station_name <chr> "California Ave & Cortez St", "California Ave & Cor…
## $ start_station_id   <chr> "17660", "17660", "17660", "17660", "17660", "17660…
## $ end_station_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Wood St & Augu…
## $ end_station_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "657", "13258",…
## $ start_lat          <dbl> 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
## $ start_lng          <dbl> -87.69674, -87.69671, -87.69664, -87.69666, -87.696…
## $ end_lat            <dbl> 41.89000, 41.90000, 41.90000, 41.92000, 41.90000, 4…
## $ end_lng            <dbl> -87.72000, -87.69000, -87.70000, -87.69000, -87.700…
## $ member_casual      <chr> "member", "member", "member", "member", "casual", "…
## $ tripduration       <drtn> 625 secs, 244 secs, 80 secs, 702 secs, 43 secs, 32…

We also saw that some of the values are missing/blank. We need to remove this bad data.

Trip2021 <- Trip2021 %>% 
  na.omit  # Removing rows with NA values

For the final step,we will arrange our data in ascending order by date and then view our clean data

Trip2021<-mutate(Trip2021,weekday=weekdays(started_at))
Trip2021<-arrange(Trip2021,started_at)
View(Trip2021)

Exporting Data

write_csv(Trip2021,"Trip2021_v1.csv")

Great, now we are all set to analyze and visualize our data

##Analysis and Presentation of Final Results

All the data was transferred to Tableau, please check. After downloading the data we need to separate date and time to use in Tableau for analysis. I used Replace function in Spreadsheet to remove string from Stated at (date and time) and used text to column wizard in Data tab to separate date and time.