This is a case study for the Google Certificate on Data Analysis.

Business Task

The goal of this business task is to design a Marketing Strategy aimed at converting casual riders into annual members. In order to do so, the goal of this research is to understand how casual riders and annual members use Cyclistic bikes differently.

Description of Data Used

In order to access the data we need to go to click the link. This will lead us to the folders that contain the data, in order to do this report I used the folders that contain information about the last 12 months (from Sep 2020 to Aug 2021).

for (month in 1:8){
  date<-paste("20210", month,  sep="") 
  filename<-paste(date, "-divvy-tripdata.csv",  sep="")
  assign(paste("data", month, sep="_"),read.csv(filename))}

assign(paste("data", 9, sep="_"),read.csv("202009-divvy-tripdata.csv"))

for (month in 10:12){
  date<-paste("2020", month,  sep="") 
  filename<-paste(date, "-divvy-tripdata.csv",  sep="")
  assign(paste("data", month, sep="_"),read.csv(filename))}

Documentation of cleaning or manipulation of data

Now let’s clean the data: to begin this, let’s present it in only one database, in order to do this first I will create two variables corresponding to the month and year, so in the future we will be able to assign a time period to each value.

data_1<-mutate(data_1, year=2021, month = 01)
data_2<-mutate(data_1, year=2021, month = 02)
data_3<-mutate(data_1, year=2021, month = 03)
data_4<-mutate(data_1, year=2021, month = 04)
data_5<-mutate(data_1, year=2021, month = 05)
data_6<-mutate(data_1, year=2021, month = 06)
data_7<-mutate(data_1, year=2021, month = 07)
data_8<-mutate(data_1, year=2021, month = 08)
data_9<-mutate(data_1, year=2021, month = 09)
data_10<-mutate(data_1, year=2020, month = 10)
data_11<-mutate(data_1, year=2020, month = 11)
data_12<-mutate(data_1, year=2020, month = 12)

final_database <- rbind(data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12)

# let's clean the environment by erasing some objects
rm(data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12)
rm(db, db_20, i, dataname, date, filename, month)

After creating one dataset that contains all the information, let’s create some variable. In this section I create three new variables: * ride_length: calculate the length of all the rides (in seconds and hours). * day_of_the_week: calculates which of the week the service was used. * year_month: this is a variable that contains both the year and the month when the service was used.

final_database <- mutate(final_database,
                         ride_length = difftime(ended_at, started_at),
                         ride_length_h = difftime(ended_at, started_at, units="hours"),
                         day_of_the_week = weekdays(as.Date(started_at)),
                         day_of_the_week_c = as.factor(weekdays(as.Date(started_at))),
                         year_month = zoo::as.yearmon(paste(year, month), "%Y %m"),
                         year_month_c = as.factor(zoo::as.yearmon(paste(year, month), "%Y %m")))

In this section I will eliminate all the observations that have a ride length that is negative. Notice that by doing this 24 observations are lost.

final_database_clean <- final_database[!(final_database$ride_length<0),]

Summary of the Analysis

In this section I will construct some descriptive statistics. I will always differentiate between the two main groups, those that are members and those that are casual users of the service, given that this is our main goal.

On top if this I will also get some summary statistics based on:

in order to find some patterns.

#Let's get some basic statistics to see if there are any outliers:
#head(final_database)

#I need to use a function to get the mode, otherwise the function mode() gives the type of variable

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}


#Statistics based on group

by_mem<-group_by(final_database_clean, member_casual) %>% 
  summarise(
          total=n(),
          len_mean = mean(ride_length_h),
          len_min = min(ride_length),
          len_max = max(ride_length_h),
          len_sd = sd(ride_length),
          day_mode = getmode(day_of_the_week),
          month_mode = getmode(year_month),
          count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
          count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
          count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE))
           

by_mem<-mutate(by_mem,
               elec_share = count_elect/total,
               dock_share = count_dock/total,
               class_share = count_classical/total)
  
knitr::kable(by_mem, caption="Summary table")
Summary table
member_casual total len_mean len_min len_max len_sd day_mode month_mode count_elect count_dock count_classical elec_share dock_share class_share
casual 217404 0.4280765 hours 1 secs 330.43194 hours 10976.345 Saturday Jan 2021 93036 25260 99108 0.4279406 0.1161892 0.4558702
member 944580 0.2145385 hours 0 secs 24.99917 hours 1833.832 Friday Jan 2021 303288 12 641280 0.3210824 0.0000127 0.6789049

#Statistics based on group and year/month

by_mem_month<-group_by(final_database_clean, member_casual, year_month_c, .drop=FALSE) %>% 
  summarise(
            total=n(),
            len_mean = mean(ride_length_h),
            len_min = min(ride_length),
            len_max = max(ride_length_h),
            len_sd = sd(ride_length),
            day_mode = getmode(day_of_the_week),
            count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
            count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
            count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE),
            .groups = "keep")

by_mem_month<-mutate(by_mem_month,
                elec_share = count_elect/total,
                dock_share = count_dock/total,
                class_share = count_classical/total)

knitr::kable(by_mem_month, caption="Summary table - by date")
Summary table - by date
member_casual year_month_c total len_mean len_min len_max len_sd day_mode count_elect count_dock count_classical elec_share dock_share class_share
casual Oct 2020 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Nov 2020 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Dec 2020 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Jan 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Feb 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Mar 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Apr 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual May 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Jun 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Jul 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Aug 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
casual Sep 2021 18117 0.4280765 hours 1 secs 330.43194 hours 10976.623 Saturday 7753 2105 8259 0.4279406 0.1161892 0.4558702
member Oct 2020 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Nov 2020 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Dec 2020 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Jan 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Feb 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Mar 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Apr 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member May 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Jun 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Jul 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Aug 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049
member Sep 2021 78715 0.2145385 hours 0 secs 24.99917 hours 1833.843 Friday 25274 1 53440 0.3210824 0.0000127 0.6789049

#Statistics based on group and day of the week

by_mem_day<-group_by(final_database_clean, member_casual, day_of_the_week, .drop=FALSE) %>% 
  summarise(
          total=n(),
          len_mean = mean(ride_length_h),
          len_min = min(ride_length),
          len_max = max(ride_length_h),
          len_sd = sd(ride_length),
          count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
          count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
          count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE),
          .groups = "keep")

by_mem_day<-mutate(by_mem_day,
                elec_share = count_elect/total,
                dock_share = count_dock/total,
                class_share = count_classical/total)


knitr::kable(by_mem_day, caption="Summary table - by day of the week")
Summary table - by day of the week
member_casual day_of_the_week total len_mean len_min len_max len_sd count_elect count_dock count_classical elec_share dock_share class_share
casual Friday 34056 0.3825049 hours 1 secs 61.25361 hours 5727.333 15828 3816 14412 0.4647639 0.1120507 0.4231853
casual Monday 25056 0.3416230 hours 2 secs 24.99750 hours 3210.948 11424 2412 11220 0.4559387 0.0962644 0.4477969
casual Saturday 48036 0.5291652 hours 2 secs 330.43194 hours 19295.513 17724 6828 23484 0.3689733 0.1421434 0.4888833
casual Sunday 34368 0.4869339 hours 3 secs 85.36861 hours 10061.609 13884 4224 16260 0.4039804 0.1229050 0.4731145
casual Thursday 28176 0.3608854 hours 1 secs 24.99806 hours 4211.954 12696 2832 12648 0.4505963 0.1005111 0.4488927
casual Tuesday 22584 0.3724724 hours 1 secs 48.86833 hours 5680.557 10296 2364 9924 0.4558980 0.1046759 0.4394261
casual Wednesday 25128 0.4276147 hours 1 secs 91.82722 hours 9381.646 11184 2784 11160 0.4450812 0.1107927 0.4441261
member Friday 151740 0.2054044 hours 0 secs 17.33750 hours 1246.815 50856 0 100884 0.3351522 0.0000000 0.6648478
member Monday 133620 0.2203209 hours 2 secs 24.99861 hours 2745.456 42060 0 91560 0.3147732 0.0000000 0.6852268
member Saturday 147828 0.2259902 hours 2 secs 24.45194 hours 1505.493 49056 0 98772 0.3318451 0.0000000 0.6681549
member Sunday 106464 0.2226606 hours 2 secs 18.22472 hours 1448.226 34596 0 71868 0.3249549 0.0000000 0.6750451
member Thursday 143616 0.2042194 hours 0 secs 24.99917 hours 1625.038 45492 0 98124 0.3167614 0.0000000 0.6832386
member Tuesday 126828 0.2006372 hours 1 secs 24.50278 hours 1247.780 38940 0 87888 0.3070300 0.0000000 0.6929700
member Wednesday 134484 0.2242114 hours 1 secs 24.99722 hours 2434.755 42288 12 92184 0.3144463 0.0000892 0.6854644

Supporting visualizations and key findings

Let’s start by visualizing the big two groups to see how the behave ifferently:


#relative amount
ggplot(data = by_mem, aes(x = "", y = total, fill = member_casual)) + 
  geom_bar(stat="identity", width=1, color="white")+
  coord_polar("y", start=0)+
  theme_void()+ # remove background, grid, numeric labels
  ggtitle("Relative casual riders and annual members")+
  theme_bw()


#total amount
ggplot(data = by_mem, aes(x = member_casual, y=total, fill = member_casual)) + 
  geom_bar(stat="identity")+
  ggtitle("Total casual riders and annual members")+
  theme_bw()


par(mfrow=c(1,2))
#total amount by ride elect
ggplot(data = by_mem, aes(x = member_casual, y=elec_share, fill = member_casual)) + 
  geom_bar(stat="identity")+
  geom_text(aes(label=scales::percent(elec_share), vjust=2), color="white", size=6)+
  ggtitle("Share of people who use electric bikes conditional on their member status")+
  theme_bw()

#total amount by ride elect
ggplot(data = by_mem, aes(x = member_casual, y=class_share, fill = member_casual)) + 
  geom_bar(stat="identity")+
  geom_text(aes(label=scales::percent(class_share), vjust=2), color="white", size=6)+
  ggtitle("Share of people who use classical bikes conditional on their member status")+
  theme_bw()


#Let's start plotting conditional on month 
ggplot(data = by_mem_month, aes(x = year_month_c)) + 
  geom_line(mapping = aes(y = total, color = member_casual))+
  ggtitle("Use by member type through time")+
  theme_bw()
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

  
#Data by day
ggplot(data = by_mem_day, aes(x = factor(day_of_the_week), y = total, fill=member_casual)) + 
  geom_bar(stat="identity", position=position_dodge())+
  ggtitle("Use by member type and day of the week")+
  theme_bw()

Top three recommendations