This is a capstone project as part of my Google Data Analytics Professional Certificate course. The project will focus on analytical skills (analysis, data cleaning and visualization). For analysis / cleaning, R and R Studio was used. While for visualization and storytelling, Tableau Public was used.
For this project following data analysis steps will be
followed:
• Ask
• Prepare
• Process
•
Analyze
• Share
• Act
Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Three questions will guide the future marketing program:
1. How
do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to
become members?
The director of marketing and your manager Lily Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? (which is what the analysis will be concentrated on)
Key Task
• Objective is to build the best marketing strategy(s) for
turning “casual” riders into annual “members” by analyzing how each
group of customers use Cyclist bikes.
• Stakeholders for this
project will be the Cyclistic executive team, Director of Marketing
(Lily Moreno), and Marketing Analytics team Deliverable
• Answering
the guiding question: How do annual members and casual riders use
Cyclistic bikes differently?
The data that was used for this project was for all of 2023 (12 months) data and was provided by Motivate International, Inc.
Due note that the data sets have a different name due to the fictionalize company (Cyclist company) being used for the Capstone project.
All trip data is in comma-delimited (.CSV) format. Column names
“ride_id”, “rideable_type”, “started_at”, “ended_at”,
“start_station_name”, “start_station_id”, “end_station_name”,
“end_station_id”, “start_lat”, “start_lng”, “end_lat”, “end_lng”,
“member_casual”. A total of 13 columns.
Data Description
The data contains the following columns:
ride_id
(categorical): Unique number assigned to a ride trip.
rideable_type (categorical):Type of bike used during
trip; standard two-wheel bike, reclining bike, hand tricycle, or cargo
bike.
started_at (datetime): Start date and time
for the trip
ended_at (datetime): End data and
time for the trip
start_station_name
(categorical): Name of the station where the trip started
start_station_id (categorical): Unique identification
code assigned to the start station.
end_station_name
(categorical): Name of the station where the trip ended.
end_station_id (categorical): Unique identification
code assigned to the end station.
start_lat
(numeric): Latitude coordinate of where the trip started.
start_lng (numeric): Longitude coordinate of where the
trip started.
end_lat (numeric): Latitude
coordinate of where the trip ended.
end_lng
(numeric): Longitude coordinate of where the trip ended.
member_casual (categorical): Customer type; “member” =
annual member, “casual” = casual rider.
Loading R Packages
As part of the analysis being done in R, below are the packages that will be used.
(.packages())
## [1] "plyr" "forcats" "lubridate" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
Combining 12 Files into 1
The dataset that was provided was split into 1 file per month,
for a total of 12 files that had to be combined into 1.
bikeshare_df <-list.files(path = "C:/Users/Fiona/Desktop/Code/Dataset",
pattern = "*.csv", full.names = FALSE) %>%
lapply(read_csv) %>%
bind_rows
write.csv(data_all, "cyclistic_202301-202312.csv", row.names = TRUE)
Creation of Dataframe
Once all 12 CSV has been combined into 1, we can then create a
dataframe out of it. This will allow for all 12 month period to be
analyze / cleaned in one go.
bikeshare_df <- read_csv("cyclistic_202301-202312.csv")
## New names:
## Rows: 5719877 Columns: 14
## -- Column specification
## -------------------------------------------------------- Delimiter: "," chr
## (7): ride_id, rideable_type, start_station_name, start_station_id, end_... dbl
## (5): ...1, start_lat, start_lng, end_lat, end_lng dttm (2): started_at,
## ended_at
## i Use `spec()` to retrieve the full column specification for this data. i
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## * `` -> `...1`
During this step, I will begin to clean and prepare the data for
analysis.
Inspecting Various Aspect of the Dataframe
Below are some quick views of the head(), summary(), str()
Displays first 6 rows of the dataset
head(bikeshare_df)
## # A tibble: 6 x 14
## ...1 ride_id rideable_type started_at ended_at
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # i 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
Displays all Col Names and data types
str(bikeshare_df)
## spc_tbl_ [5,719,877 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:5719877] 1 2 3 4 5 6 7 8 9 10 ...
## $ ride_id : chr [1:5719877] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
## $ rideable_type : chr [1:5719877] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5719877], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ...
## $ ended_at : POSIXct[1:5719877], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ...
## $ start_station_name: chr [1:5719877] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
## $ start_station_id : chr [1:5719877] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
## $ end_station_name : chr [1:5719877] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
## $ end_station_id : chr [1:5719877] "202480.0" "TA1308000002" "599" "TA1308000002" ...
## $ start_lat : num [1:5719877] 41.9 41.8 42 41.8 41.8 ...
## $ start_lng : num [1:5719877] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:5719877] 41.9 41.8 42 41.8 41.8 ...
## $ end_lng : num [1:5719877] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:5719877] "member" "member" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Statistical Summary of the dataset
summary(bikeshare_df)
## ...1 ride_id rideable_type
## Min. : 1 Length:5719877 Length:5719877
## 1st Qu.:1429970 Class :character Class :character
## Median :2859939 Mode :character Mode :character
## Mean :2859939
## 3rd Qu.:4289908
## Max. :5719877
##
## started_at ended_at start_station_name
## Min. :2023-01-01 00:01:58 Min. :2023-01-01 00:02:41 Length:5719877
## 1st Qu.:2023-05-21 12:50:44 1st Qu.:2023-05-21 13:14:09 Class :character
## Median :2023-07-20 18:02:50 Median :2023-07-20 18:19:47 Mode :character
## Mean :2023-07-16 10:27:50 Mean :2023-07-16 10:46:00
## 3rd Qu.:2023-09-16 20:08:49 3rd Qu.:2023-09-16 20:28:10
## Max. :2023-12-31 23:59:38 Max. :2024-01-01 23:50:51
##
## start_station_id end_station_name end_station_id start_lat
## Length:5719877 Length:5719877 Length:5719877 Min. :41.63
## Class :character Class :character Class :character 1st Qu.:41.88
## Mode :character Mode :character Mode :character Median :41.90
## Mean :41.90
## 3rd Qu.:41.93
## Max. :42.07
##
## start_lng end_lat end_lng member_casual
## Min. :-87.94 Min. : 0.00 Min. :-88.16 Length:5719877
## 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :-87.64 Median :41.90 Median :-87.64 Mode :character
## Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :-87.46 Max. :42.18 Max. : 0.00
## NA's :6990 NA's :6990
I will now add Columns for Date, Month, Year, Day of the Week,
Ride Length
In order to analyze ride usage based on the month, day,
and year, we need to add columns for each.
bikeshare_df$date <- as.Date(bikeshare_df$started_at)
bikeshare_df$month <- format(as.Date(bikeshare_df$date), "%m")
bikeshare_df$day <- format(as.Date(bikeshare_df$date), "%d")
bikeshare_df$year <- format(as.Date(bikeshare_df$date), "%Y")
bikeshare_df$day_of_week <- format(as.Date(bikeshare_df$date), "%A")
bikeshare_df$ride_length <- difftime(bikeshare_df$ended_at,bikeshare_df$started_at)
Next is to convert the ride length to numeric for calculation
is.factor(bikeshare_df$ride_length)
## [1] FALSE
bikeshare_df$ride_length <- as.numeric(as.character(bikeshare_df$ride_length))
is.numeric(bikeshare_df$ride_length)
## [1] TRUE
Now we will begin to remove the “bad” data
The data frame includes entries when bikes were taken out of docks and checked for quality or ride_length was negative. Thus, We will create a new version of the data frame since data is being removed.
bikeshare_df_v2 <- bikeshare_df[!(bikeshare_df$ride_length<0),] # removes neg values
bikeshare_df_v2 <- mutate(bikeshare_df_v2, ...1 = NULL) # removes extra col
Inspect the new dataframe
head(bikeshare_df_v2)
## # A tibble: 6 x 19
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # i 15 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # date <date>, month <chr>, day <chr>, year <chr>, day_of_week <chr>,
## # ride_length <dbl>
colnames(bikeshare_df_v2)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "date" "month"
## [16] "day" "year" "day_of_week"
## [19] "ride_length"
View(bikeshare_df_v2)
Check for number of rows to make sure data was removed
nrow(bikeshare_df) #5719877 rows before removing values
## [1] 5719877
nrow(bikeshare_df_v2) #5719605
## [1] 5719605
A total of 272 rows (entries were removed)
Make sure value does not exist in data set
any(bikeshare_df_v2$start_station_name == "HQ QR")
## [1] NA
Check for negative values in the dataset
any(bikeshare_df_v2$ride_length < 0)
## [1] FALSE
Creating an extract of the clean data
write_csv(bikeshare_df_v2, "2023-12-31_cyclistic_clean_data.csv")#Clean data extract
As there are to many rows to analyze in Tableau Public, we will
be creating subset file for use with the following variables.
myvars <- c("ride_id", "rideable_type", "member_casual", "date", "month",
"day", "year", "day_of_week", "ride_length", "start_station_name",
"end_station_name")
Storing selected columns in a data frame and writing it to a
subset file
bikeshare_subset <- bikeshare_df_v2[myvars]
write_csv(bikeshare_subset, "2023-12-31_cyclistic_subset.csv")