Introduction

This is a capstone project as part of my Google Data Analytics Professional Certificate course. The project will focus on analytical skills (analysis, data cleaning and visualization). For analysis / cleaning, R and R Studio was used. While for visualization and storytelling, Tableau Public was used.

For this project following data analysis steps will be followed:
• Ask
• Prepare
• Process
• Analyze
• Share
• Act

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.


Ask

Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

The director of marketing and your manager Lily Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? (which is what the analysis will be concentrated on)


Key Task


• Objective is to build the best marketing strategy(s) for turning “casual” riders into annual “members” by analyzing how each group of customers use Cyclist bikes.
• Stakeholders for this project will be the Cyclistic executive team, Director of Marketing (Lily Moreno), and Marketing Analytics team Deliverable
• Answering the guiding question: How do annual members and casual riders use Cyclistic bikes differently?


Prepare


The data that was used for this project was for all of 2023 (12 months) data and was provided by Motivate International, Inc.

Due note that the data sets have a different name due to the fictionalize company (Cyclist company) being used for the Capstone project.

All trip data is in comma-delimited (.CSV) format. Column names “ride_id”, “rideable_type”, “started_at”, “ended_at”, “start_station_name”, “start_station_id”, “end_station_name”, “end_station_id”, “start_lat”, “start_lng”, “end_lat”, “end_lng”, “member_casual”. A total of 13 columns.

Data Description


The data contains the following columns:
ride_id (categorical): Unique number assigned to a ride trip.
rideable_type (categorical):Type of bike used during trip; standard two-wheel bike, reclining bike, hand tricycle, or cargo bike.
started_at (datetime): Start date and time for the trip
ended_at (datetime): End data and time for the trip
start_station_name (categorical): Name of the station where the trip started
start_station_id (categorical): Unique identification code assigned to the start station.
end_station_name (categorical): Name of the station where the trip ended.
end_station_id (categorical): Unique identification code assigned to the end station.
start_lat (numeric): Latitude coordinate of where the trip started.
start_lng (numeric): Longitude coordinate of where the trip started.
end_lat (numeric): Latitude coordinate of where the trip ended.
end_lng (numeric): Longitude coordinate of where the trip ended.
member_casual (categorical): Customer type; “member” = annual member, “casual” = casual rider.

Loading R Packages


As part of the analysis being done in R, below are the packages that will be used.

(.packages())
##  [1] "plyr"      "forcats"   "lubridate" "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"


Combining 12 Files into 1


The dataset that was provided was split into 1 file per month, for a total of 12 files that had to be combined into 1.

bikeshare_df <-list.files(path = "C:/Users/Fiona/Desktop/Code/Dataset",
                      pattern = "*.csv", full.names = FALSE) %>%
  lapply(read_csv) %>%
  bind_rows

write.csv(data_all, "cyclistic_202301-202312.csv", row.names = TRUE) 


Creation of Dataframe


Once all 12 CSV has been combined into 1, we can then create a dataframe out of it. This will allow for all 12 month period to be analyze / cleaned in one go.

bikeshare_df <- read_csv("cyclistic_202301-202312.csv")
## New names:
## Rows: 5719877 Columns: 14
## -- Column specification
## -------------------------------------------------------- Delimiter: "," chr
## (7): ride_id, rideable_type, start_station_name, start_station_id, end_... dbl
## (5): ...1, start_lat, start_lng, end_lat, end_lng dttm (2): started_at,
## ended_at
## i Use `spec()` to retrieve the full column specification for this data. i
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## * `` -> `...1`


Process


During this step, I will begin to clean and prepare the data for analysis.

Inspecting Various Aspect of the Dataframe


Below are some quick views of the head(), summary(), str()


Displays first 6 rows of the dataset

head(bikeshare_df)
## # A tibble: 6 x 14
##    ...1 ride_id          rideable_type started_at          ended_at           
##   <dbl> <chr>            <chr>         <dttm>              <dttm>             
## 1     1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2     2 13CB7EB698CEDB88 classic_bike  2023-01-10 15:37:36 2023-01-10 15:46:05
## 3     3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4     4 C90792D034FED968 classic_bike  2023-01-22 10:52:58 2023-01-22 11:01:44
## 5     5 3397017529188E8A classic_bike  2023-01-12 13:58:01 2023-01-12 14:13:20
## 6     6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # i 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>


Displays all Col Names and data types

str(bikeshare_df)
## spc_tbl_ [5,719,877 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1              : num [1:5719877] 1 2 3 4 5 6 7 8 9 10 ...
##  $ ride_id           : chr [1:5719877] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
##  $ rideable_type     : chr [1:5719877] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5719877], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ...
##  $ ended_at          : POSIXct[1:5719877], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ...
##  $ start_station_name: chr [1:5719877] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
##  $ start_station_id  : chr [1:5719877] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
##  $ end_station_name  : chr [1:5719877] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
##  $ end_station_id    : chr [1:5719877] "202480.0" "TA1308000002" "599" "TA1308000002" ...
##  $ start_lat         : num [1:5719877] 41.9 41.8 42 41.8 41.8 ...
##  $ start_lng         : num [1:5719877] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num [1:5719877] 41.9 41.8 42 41.8 41.8 ...
##  $ end_lng           : num [1:5719877] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5719877] "member" "member" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>


Statistical Summary of the dataset

summary(bikeshare_df)
##       ...1           ride_id          rideable_type     
##  Min.   :      1   Length:5719877     Length:5719877    
##  1st Qu.:1429970   Class :character   Class :character  
##  Median :2859939   Mode  :character   Mode  :character  
##  Mean   :2859939                                        
##  3rd Qu.:4289908                                        
##  Max.   :5719877                                        
##                                                         
##    started_at                     ended_at                   start_station_name
##  Min.   :2023-01-01 00:01:58   Min.   :2023-01-01 00:02:41   Length:5719877    
##  1st Qu.:2023-05-21 12:50:44   1st Qu.:2023-05-21 13:14:09   Class :character  
##  Median :2023-07-20 18:02:50   Median :2023-07-20 18:19:47   Mode  :character  
##  Mean   :2023-07-16 10:27:50   Mean   :2023-07-16 10:46:00                     
##  3rd Qu.:2023-09-16 20:08:49   3rd Qu.:2023-09-16 20:28:10                     
##  Max.   :2023-12-31 23:59:38   Max.   :2024-01-01 23:50:51                     
##                                                                                
##  start_station_id   end_station_name   end_station_id       start_lat    
##  Length:5719877     Length:5719877     Length:5719877     Min.   :41.63  
##  Class :character   Class :character   Class :character   1st Qu.:41.88  
##  Mode  :character   Mode  :character   Mode  :character   Median :41.90  
##                                                           Mean   :41.90  
##                                                           3rd Qu.:41.93  
##                                                           Max.   :42.07  
##                                                                          
##    start_lng         end_lat         end_lng       member_casual     
##  Min.   :-87.94   Min.   : 0.00   Min.   :-88.16   Length:5719877    
##  1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :-87.64   Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :-87.65   Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :-87.46   Max.   :42.18   Max.   :  0.00                     
##                   NA's   :6990    NA's   :6990


I will now add Columns for Date, Month, Year, Day of the Week, Ride Length
In order to analyze ride usage based on the month, day, and year, we need to add columns for each.

bikeshare_df$date <- as.Date(bikeshare_df$started_at)
bikeshare_df$month <- format(as.Date(bikeshare_df$date), "%m")
bikeshare_df$day <- format(as.Date(bikeshare_df$date), "%d")
bikeshare_df$year <- format(as.Date(bikeshare_df$date), "%Y")
bikeshare_df$day_of_week <- format(as.Date(bikeshare_df$date), "%A")
bikeshare_df$ride_length <- difftime(bikeshare_df$ended_at,bikeshare_df$started_at)


Next is to convert the ride length to numeric for calculation

is.factor(bikeshare_df$ride_length)
## [1] FALSE
bikeshare_df$ride_length <- as.numeric(as.character(bikeshare_df$ride_length))
is.numeric(bikeshare_df$ride_length)
## [1] TRUE


Now we will begin to remove the “bad” data

The data frame includes entries when bikes were taken out of docks and checked for quality or ride_length was negative. Thus, We will create a new version of the data frame since data is being removed.

bikeshare_df_v2 <- bikeshare_df[!(bikeshare_df$ride_length<0),] # removes neg values
bikeshare_df_v2 <- mutate(bikeshare_df_v2, ...1 = NULL) # removes extra col


Inspect the new dataframe

head(bikeshare_df_v2)
## # A tibble: 6 x 19
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike  2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike  2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike  2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # i 15 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   date <date>, month <chr>, day <chr>, year <chr>, day_of_week <chr>,
## #   ride_length <dbl>
colnames(bikeshare_df_v2)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"      "date"               "month"             
## [16] "day"                "year"               "day_of_week"       
## [19] "ride_length"
View(bikeshare_df_v2)


Check for number of rows to make sure data was removed

nrow(bikeshare_df) #5719877 rows before removing values
## [1] 5719877
nrow(bikeshare_df_v2) #5719605
## [1] 5719605

A total of 272 rows (entries were removed)


Make sure value does not exist in data set

any(bikeshare_df_v2$start_station_name == "HQ QR") 
## [1] NA


Check for negative values in the dataset

any(bikeshare_df_v2$ride_length < 0)
## [1] FALSE


Creating an extract of the clean data

write_csv(bikeshare_df_v2, "2023-12-31_cyclistic_clean_data.csv")#Clean data extract


As there are to many rows to analyze in Tableau Public, we will be creating subset file for use with the following variables.

myvars <- c("ride_id", "rideable_type", "member_casual", "date", "month", 
            "day", "year", "day_of_week", "ride_length", "start_station_name",
            "end_station_name")


Storing selected columns in a data frame and writing it to a subset file

bikeshare_subset <- bikeshare_df_v2[myvars]


write_csv(bikeshare_subset, "2023-12-31_cyclistic_subset.csv")


Analyze / Share / Act


For the Analyze / Share / Act phase of this project, we will be referring to Tableau Public link here. Or below (For the below version I apologize as the window may be a bit to small for the visuals).


## Thank you I wanted to experiment with using both R and Tableau in one file. This analyis was quite interesting to do when trying to incorporate a Business Intelligence tool like Tableau into R. The story flows when visualizations can be seen as the “Story” is being presented throughout.