How Does a Bike-Share Navigate Speedy Success?

This is a case study for the Google Certificate on Data Analysis.

Business Task

The goal of this business task is to design a Marketing Strategy aimed at converting casual riders into annual members. In order to do so, the goal of this research is to understand how casual riders and annual members use Cyclistic bikes differently.

Description of Data Used

In order to access the data we need to go to click the link. This will lead us to the folders that contain the data, in order to do this report I used the folders that contain information about the last 12 months (from Sep 2020 to Aug 2021).

for (month in 1:8){
  date<-paste("20210", month,  sep="") 
  filename<-paste(date, "-divvy-tripdata.csv",  sep="")
  assign(paste("data", month, sep="_"),read.csv(filename))}

assign(paste("data", 9, sep="_"),read.csv("202009-divvy-tripdata.csv"))

for (month in 10:12){
  date<-paste("2020", month,  sep="") 
  filename<-paste(date, "-divvy-tripdata.csv",  sep="")
  assign(paste("data", month, sep="_"),read.csv(filename))}

Documentation of cleaning or manipulation of data

Now let’s clean the data: to begin this, let’s present it in only one database, in order to do this first I will create two variables corresponding to the month and year, so in the future we will be able to assign a time period to each value.

data_1<-mutate(data_1, year=2021, month = 01)
data_2<-mutate(data_1, year=2021, month = 02)
data_3<-mutate(data_1, year=2021, month = 03)
data_4<-mutate(data_1, year=2021, month = 04)
data_5<-mutate(data_1, year=2021, month = 05)
data_6<-mutate(data_1, year=2021, month = 06)
data_7<-mutate(data_1, year=2021, month = 07)
data_8<-mutate(data_1, year=2021, month = 08)
data_9<-mutate(data_1, year=2021, month = 09)
data_10<-mutate(data_1, year=2020, month = 10)
data_11<-mutate(data_1, year=2020, month = 11)
data_12<-mutate(data_1, year=2020, month = 12)

final_database <- rbind(data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12)

# let's clean the environment by erasing some objects
rm(data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12)
rm(db, db_20, i, dataname, date, filename, month)

After creating one dataset that contains all the information, let’s create some variable. In this section I create three new variables: * ride_length: calculate the length of all the rides (in seconds and hours). * day_of_the_week: calculates which of the week the service was used. * year_month: this is a variable that contains both the year and the month when the service was used.

final_database <- mutate(final_database,
                         ride_length = difftime(ended_at, started_at),
                         ride_length_h = difftime(ended_at, started_at, units="hours"),
                         day_of_the_week = weekdays(as.Date(started_at)),
                         day_of_the_week_c = as.factor(weekdays(as.Date(started_at))),
                         year_month = zoo::as.yearmon(paste(year, month), "%Y %m"),
                         year_month_c = as.factor(zoo::as.yearmon(paste(year, month), "%Y %m")))

In this section I will eliminate all the observations that have a ride length that is negative. Notice that by doing this 24 observations are lost.

final_database_clean <- final_database[!(final_database$ride_length<0),]

Summary of the Analysis

In this section I will construct some descriptive statistics. I will always differentiate between the two main groups, those that are members and those that are casual users of the service, given that this is our main goal.

On top if this I will also get some summary statistics based on:

The specific date (year, month).
The specific day of the week.

in order to find some patterns.

#Let's get some basic statistics to see if there are any outliers:
#head(final_database)

#I need to use a function to get the mode, otherwise the function mode() gives the type of variable

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}


#Statistics based on group

by_mem<-group_by(final_database_clean, member_casual) %>% 
  summarise(
          total=n(),
          len_mean = mean(ride_length_h),
          len_min = min(ride_length),
          len_max = max(ride_length_h),
          len_sd = sd(ride_length),
          day_mode = getmode(day_of_the_week),
          month_mode = getmode(year_month),
          count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
          count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
          count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE))
           

by_mem<-mutate(by_mem,
               elec_share = count_elect/total,
               dock_share = count_dock/total,
               class_share = count_classical/total)
  
knitr::kable(by_mem, caption="Summary table")

Summary table
member_casual	total	len_mean	len_min	len_max	len_sd	day_mode	month_mode	count_elect	count_dock	count_classical	elec_share	dock_share	class_share
casual	217404	0.4280765 hours	1 secs	330.43194 hours	10976.345	Saturday	Jan 2021	93036	25260	99108	0.4279406	0.1161892	0.4558702
member	944580	0.2145385 hours	0 secs	24.99917 hours	1833.832	Friday	Jan 2021	303288	12	641280	0.3210824	0.0000127	0.6789049


#Statistics based on group and year/month

by_mem_month<-group_by(final_database_clean, member_casual, year_month_c, .drop=FALSE) %>% 
  summarise(
            total=n(),
            len_mean = mean(ride_length_h),
            len_min = min(ride_length),
            len_max = max(ride_length_h),
            len_sd = sd(ride_length),
            day_mode = getmode(day_of_the_week),
            count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
            count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
            count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE),
            .groups = "keep")

by_mem_month<-mutate(by_mem_month,
                elec_share = count_elect/total,
                dock_share = count_dock/total,
                class_share = count_classical/total)

knitr::kable(by_mem_month, caption="Summary table - by date")

Summary table - by date
member_casual	year_month_c	total	len_mean	len_min	len_max	len_sd	day_mode	count_elect	count_dock	count_classical	elec_share	dock_share	class_share
casual	Oct 2020	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Nov 2020	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Dec 2020	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Jan 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Feb 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Mar 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Apr 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	May 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Jun 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Jul 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Aug 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
casual	Sep 2021	18117	0.4280765 hours	1 secs	330.43194 hours	10976.623	Saturday	7753	2105	8259	0.4279406	0.1161892	0.4558702
member	Oct 2020	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Nov 2020	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Dec 2020	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Jan 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Feb 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Mar 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Apr 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	May 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Jun 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Jul 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Aug 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049
member	Sep 2021	78715	0.2145385 hours	0 secs	24.99917 hours	1833.843	Friday	25274	1	53440	0.3210824	0.0000127	0.6789049


#Statistics based on group and day of the week

by_mem_day<-group_by(final_database_clean, member_casual, day_of_the_week, .drop=FALSE) %>% 
  summarise(
          total=n(),
          len_mean = mean(ride_length_h),
          len_min = min(ride_length),
          len_max = max(ride_length_h),
          len_sd = sd(ride_length),
          count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
          count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
          count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE),
          .groups = "keep")

by_mem_day<-mutate(by_mem_day,
                elec_share = count_elect/total,
                dock_share = count_dock/total,
                class_share = count_classical/total)


knitr::kable(by_mem_day, caption="Summary table - by day of the week")

Summary table - by day of the week
member_casual	day_of_the_week	total	len_mean	len_min	len_max	len_sd	count_elect	count_dock	count_classical	elec_share	dock_share	class_share
casual	Friday	34056	0.3825049 hours	1 secs	61.25361 hours	5727.333	15828	3816	14412	0.4647639	0.1120507	0.4231853
casual	Monday	25056	0.3416230 hours	2 secs	24.99750 hours	3210.948	11424	2412	11220	0.4559387	0.0962644	0.4477969
casual	Saturday	48036	0.5291652 hours	2 secs	330.43194 hours	19295.513	17724	6828	23484	0.3689733	0.1421434	0.4888833
casual	Sunday	34368	0.4869339 hours	3 secs	85.36861 hours	10061.609	13884	4224	16260	0.4039804	0.1229050	0.4731145
casual	Thursday	28176	0.3608854 hours	1 secs	24.99806 hours	4211.954	12696	2832	12648	0.4505963	0.1005111	0.4488927
casual	Tuesday	22584	0.3724724 hours	1 secs	48.86833 hours	5680.557	10296	2364	9924	0.4558980	0.1046759	0.4394261
casual	Wednesday	25128	0.4276147 hours	1 secs	91.82722 hours	9381.646	11184	2784	11160	0.4450812	0.1107927	0.4441261
member	Friday	151740	0.2054044 hours	0 secs	17.33750 hours	1246.815	50856	0	100884	0.3351522	0.0000000	0.6648478
member	Monday	133620	0.2203209 hours	2 secs	24.99861 hours	2745.456	42060	0	91560	0.3147732	0.0000000	0.6852268
member	Saturday	147828	0.2259902 hours	2 secs	24.45194 hours	1505.493	49056	0	98772	0.3318451	0.0000000	0.6681549
member	Sunday	106464	0.2226606 hours	2 secs	18.22472 hours	1448.226	34596	0	71868	0.3249549	0.0000000	0.6750451
member	Thursday	143616	0.2042194 hours	0 secs	24.99917 hours	1625.038	45492	0	98124	0.3167614	0.0000000	0.6832386
member	Tuesday	126828	0.2006372 hours	1 secs	24.50278 hours	1247.780	38940	0	87888	0.3070300	0.0000000	0.6929700
member	Wednesday	134484	0.2242114 hours	1 secs	24.99722 hours	2434.755	42288	12	92184	0.3144463	0.0000892	0.6854644

Supporting visualizations and key findings

Let’s start by visualizing the big two groups to see how the behave ifferently:


#relative amount
ggplot(data = by_mem, aes(x = "", y = total, fill = member_casual)) + 
  geom_bar(stat="identity", width=1, color="white")+
  coord_polar("y", start=0)+
  theme_void()+ # remove background, grid, numeric labels
  ggtitle("Relative casual riders and annual members")+
  theme_bw()


#total amount
ggplot(data = by_mem, aes(x = member_casual, y=total, fill = member_casual)) + 
  geom_bar(stat="identity")+
  ggtitle("Total casual riders and annual members")+
  theme_bw()


par(mfrow=c(1,2))
#total amount by ride elect
ggplot(data = by_mem, aes(x = member_casual, y=elec_share, fill = member_casual)) + 
  geom_bar(stat="identity")+
  geom_text(aes(label=scales::percent(elec_share), vjust=2), color="white", size=6)+
  ggtitle("Share of people who use electric bikes conditional on their member status")+
  theme_bw()

#total amount by ride elect
ggplot(data = by_mem, aes(x = member_casual, y=class_share, fill = member_casual)) + 
  geom_bar(stat="identity")+
  geom_text(aes(label=scales::percent(class_share), vjust=2), color="white", size=6)+
  ggtitle("Share of people who use classical bikes conditional on their member status")+
  theme_bw()


#Let's start plotting conditional on month 
ggplot(data = by_mem_month, aes(x = year_month_c)) + 
  geom_line(mapping = aes(y = total, color = member_casual))+
  ggtitle("Use by member type through time")+
  theme_bw()
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

  
#Data by day
ggplot(data = by_mem_day, aes(x = factor(day_of_the_week), y = total, fill=member_casual)) + 
  geom_bar(stat="identity", position=position_dodge())+
  ggtitle("Use by member type and day of the week")+
  theme_bw()

How Does a Bike-Share Navigate Speedy Success?

Aguiar28D

9/25/2021

Business Task

Description of Data Used

Documentation of cleaning or manipulation of data

Summary of the Analysis

Supporting visualizations and key findings

Top three recommendations