Capstone Case 1 Google Cert

About this Project

This case study is to complete the capstone requirement for the Google Data Analytics Certificate currently offered on Coursera. The purpose is to utilize the tools the tools of choice to ultimately provide recommendations by following the six steps of the data analysis process: Ask, Prepare, Process, Analyze, Share and Act.

                                   **Scenario**

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes diﬀerently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But ﬁrst, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

                                   **Ask**

Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes diﬀerently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to inﬂuence casual riders to become members?
```
                                **Prepare**
```
In order to be as concise as possible, the though process for this exercise to sequentially layout the thought process. This will be accompanied by the code needed to process the available data.

Step 1: Load relevant Libraries and setup the environment for subsequent functions to run

library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
library(readr)
library(data.table)
library(tidyr)

library(here)

                                 **Process**

Step 2: Apply CSV files to separate data frames

The output from the below code chunk to assign csv elements to their data frames will need visual analysis to ensure consistency regarding the data types it contains. These are: chr (character), dbl (double) and dttm (datetime).

X202006_divvy_tripdata <- read_csv(here("202006-divvy-tripdata.csv"))
X202007_divvy_tripdata <- read_csv(here("202007-divvy-tripdata.csv"))
X202008_divvy_tripdata <- read_csv(here("202008-divvy-tripdata.csv"))
X202009_divvy_tripdata <- read_csv(here("202009-divvy-tripdata.csv"))
X202010_divvy_tripdata <- read_csv(here("202010-divvy-tripdata.csv"))
X202111_divvy_tripdata <- read_csv(here("202012-divvy-tripdata.csv"))
X202012_divvy_tripdata <- read_csv(here("202011-divvy-tripdata.csv"))
X202101_divvy_tripdata <- read_csv(here("202012-divvy-tripdata.csv"))
X202102_divvy_tripdata <- read_csv(here("202012-divvy-tripdata.csv"))
X202103_divvy_tripdata <- read_csv(here("202012-divvy-tripdata.csv"))
X202104_divvy_tripdata <- read_csv(here("202012-divvy-tripdata.csv"))
X202105_divvy_tripdata <- read_csv(here("202012-divvy-tripdata.csv"))

Step 3: Typecasting data with incorrectly assigned data-type

Inconsistencies have been detected in the data, specifically for the variables “start_station_id” and “end_station_id”. It makes more sense for this data to be cast as a character type rather than an integer since it is not to be utilized as a measure, but as an identifier. We will utilize the mutate() function to convert the data back to the “character” type. The new dataframe will be called “Combined_Clean_Data”. The tibble output for the below chunk will verify the data cast from “int” to “str” was successful:

X202006_divvy_tripdata_Clean <- X202006_divvy_tripdata %>% mutate(start_station_id = as.character(start_station_id),end_station_id = as.character(end_station_id))
X202007_divvy_tripdata_Clean <- X202007_divvy_tripdata %>% mutate(start_station_id = as.character(start_station_id),end_station_id = as.character(end_station_id))
X202008_divvy_tripdata_Clean <- X202008_divvy_tripdata %>% mutate(start_station_id = as.character(start_station_id),end_station_id = as.character(end_station_id))
X202009_divvy_tripdata_Clean <- X202009_divvy_tripdata %>% mutate(start_station_id = as.character(start_station_id),end_station_id = as.character(end_station_id))
X202010_divvy_tripdata_Clean <- X202010_divvy_tripdata %>% mutate(start_station_id = as.character(start_station_id),end_station_id = as.character(end_station_id))
X2020012_divvy_tripdata_Clean <- X202012_divvy_tripdata %>% mutate(start_station_id = as.character(start_station_id),end_station_id = as.character(end_station_id))

Combined_Clean_Data <- bind_rows(X202006_divvy_tripdata_Clean,X202007_divvy_tripdata_Clean,X202008_divvy_tripdata_Clean,X202009_divvy_tripdata_Clean,X202010_divvy_tripdata_Clean,X2020012_divvy_tripdata_Clean)

str(Combined_Clean_Data)

## tibble [2,698,173 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:2698173] "8CD5DE2C2B6C4CFC" "9A191EB2C751D85D" "F37D14B0B5659BCF" "C41237B506E85FA1" ...
##  $ rideable_type     : chr [1:2698173] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:2698173], format: "2020-06-13 23:24:48" "2020-06-26 07:26:10" ...
##  $ ended_at          : POSIXct[1:2698173], format: "2020-06-13 23:36:55" "2020-06-26 07:31:58" ...
##  $ start_station_name: chr [1:2698173] "Wilton Ave & Belmont Ave" "Federal St & Polk St" "Daley Center Plaza" "Broadway & Cornelia Ave" ...
##  $ start_station_id  : chr [1:2698173] "117" "41" "81" "303" ...
##  $ end_station_name  : chr [1:2698173] "Damen Ave & Clybourn Ave" "Daley Center Plaza" "State St & Harrison St" "Broadway & Berwyn Ave" ...
##  $ end_station_id    : chr [1:2698173] "163" "81" "5" "294" ...
##  $ start_lat         : num [1:2698173] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:2698173] -87.7 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:2698173] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:2698173] -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:2698173] "casual" "member" "member" "casual" ...

Step 4: Combining the Clean Data

Now that all the data is consistent, we can combine the already clean dataframes with the recently cleaned ones. We can call this “Main_Trips_Dataframe”.

Main_Trips_Dataframe <- bind_rows(Combined_Clean_Data,X202105_divvy_tripdata,X202104_divvy_tripdata,X202103_divvy_tripdata,X202102_divvy_tripdata,X202101_divvy_tripdata,X202111_divvy_tripdata)


rm(Combined_Clean_Data,X202006_divvy_tripdata,X202007_divvy_tripdata,X202008_divvy_tripdata,X202009_divvy_tripdata,X202010_divvy_tripdata,X202111_divvy_tripdata,X202012_divvy_tripdata,X202101_divvy_tripdata,X202102_divvy_tripdata,X202103_divvy_tripdata,X202104_divvy_tripdata,X202105_divvy_tripdata)

str(Main_Trips_Dataframe)

## tibble [3,487,611 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3487611] "8CD5DE2C2B6C4CFC" "9A191EB2C751D85D" "F37D14B0B5659BCF" "C41237B506E85FA1" ...
##  $ rideable_type     : chr [1:3487611] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:3487611], format: "2020-06-13 23:24:48" "2020-06-26 07:26:10" ...
##  $ ended_at          : POSIXct[1:3487611], format: "2020-06-13 23:36:55" "2020-06-26 07:31:58" ...
##  $ start_station_name: chr [1:3487611] "Wilton Ave & Belmont Ave" "Federal St & Polk St" "Daley Center Plaza" "Broadway & Cornelia Ave" ...
##  $ start_station_id  : chr [1:3487611] "117" "41" "81" "303" ...
##  $ end_station_name  : chr [1:3487611] "Damen Ave & Clybourn Ave" "Daley Center Plaza" "State St & Harrison St" "Broadway & Berwyn Ave" ...
##  $ end_station_id    : chr [1:3487611] "163" "81" "5" "294" ...
##  $ start_lat         : num [1:3487611] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:3487611] -87.7 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:3487611] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:3487611] -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:3487611] "casual" "member" "member" "casual" ...

Visual Analysis 1 - Customer Type vs. Number of Rides

The first visual analysis shows how the different customer types interact with the service. It appears the most popular type of bike for both Casual and Member type customers is the docked bike.

ggplot(data = Main_Trips_Dataframe ,mapping = aes(x=member_casual))+
  geom_bar()+
  ylab("Number_Of_Rides")

Visual Analysis 2 - Weather Season vs. Number of Rides

The first step for this analysis is to create a new column within the “Main_Trips_Dataframe” object. The goal is to visualize how the weather affects ridership for both customer types. We will therefore initialize a new column named “Seasons_Chicago” to hold the new data as a “char” type. These will be subsequently populated buy manipulating the array based on a conditional argument. We will add a “str” to verify the new column contains the correct data type. We can also use the “unique()” to check how many seasons are in the new column.

Main_Trips_Dataframe <- Main_Trips_Dataframe %>% 

mutate(Seasons_Chicago = "empty")

Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "September"] <- "Fall"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "October"] <- "Fall"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "November"] <- "Fall"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "March"] <- "Spring"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "April"] <- "Spring"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "May"] <- "Spring"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "June"] <- "Summer"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "July"] <- "Summer"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "August"] <- "Summer"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "December"] <- "Winter"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "January"] <- "Winter"
Main_Trips_Dataframe$Seasons_Chicago[as.character(months(Main_Trips_Dataframe$started_at)) == "February"] <- "Winter"

str(Main_Trips_Dataframe)

## tibble [3,487,611 × 14] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:3487611] "8CD5DE2C2B6C4CFC" "9A191EB2C751D85D" "F37D14B0B5659BCF" "C41237B506E85FA1" ...
##  $ rideable_type     : chr [1:3487611] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:3487611], format: "2020-06-13 23:24:48" "2020-06-26 07:26:10" ...
##  $ ended_at          : POSIXct[1:3487611], format: "2020-06-13 23:36:55" "2020-06-26 07:31:58" ...
##  $ start_station_name: chr [1:3487611] "Wilton Ave & Belmont Ave" "Federal St & Polk St" "Daley Center Plaza" "Broadway & Cornelia Ave" ...
##  $ start_station_id  : chr [1:3487611] "117" "41" "81" "303" ...
##  $ end_station_name  : chr [1:3487611] "Damen Ave & Clybourn Ave" "Daley Center Plaza" "State St & Harrison St" "Broadway & Berwyn Ave" ...
##  $ end_station_id    : chr [1:3487611] "163" "81" "5" "294" ...
##  $ start_lat         : num [1:3487611] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:3487611] -87.7 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:3487611] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:3487611] -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:3487611] "casual" "member" "member" "casual" ...
##  $ Seasons_Chicago   : chr [1:3487611] "Summer" "Summer" "Summer" "Summer" ...

unique(Main_Trips_Dataframe$Seasons_Chicago)

## [1] "Summer" "Fall"   "Winter"

As expected, the ridership numbers are at their highest when the weather is most favorable outdoors during summer, followed by Fall. Winter shows the least ridership, which makes sense as besides the cold itself, icy conditions bring riding hazards to the fore. What is interesting too is that the docked bike is mostly used in the summer and fall, but the least used during winter. This visual is achieved using the ggplot function below:

ggplot(data = Main_Trips_Dataframe ,mapping = aes(x=Seasons_Chicago,fill=rideable_type))+
  geom_bar()+
  ylab("Number_Of_Rides")

                      **Key Findings**

We can now briefly go over the observations for each visual output:

As briefly noted in Visual Analysis 1, the bike of choice for both Casual and Member type customers is the docked bike. This would mean that the current docked infrastructure for the area Cyclistic serves is convenient for the docked bike. This detail will need to be put in consideration when designing campaigns for converting the casual riders into subscribers.
In Visual Analysis 2, we noted that the weather appears to play a huge part in the number of riders. The summer has be best ridership and the winter has the lowest.
```
                    **Recommendations**
```
Season-based promotions for when casual ridership is expected to be at its highest. That means a more focused campaign overall.
One campaign strategy could have influencers in the bike riding circles to occasionally promote Cyclistic through their own social media platforms. This might lead to an increase in subscriptions either from the direct partnership with the influencer, or indirectly through their audience.

References

Reading and Writing CSV Files

Battle of the bikes: docked vs. dockless

Capstone Case 1 Google Cert

Adrian Mutandiro

2023-02-01