This is a case study for the Google Certificate on Data Analysis.
The goal of this business task is to design a Marketing Strategy aimed at converting casual riders into annual members. In order to do so, the goal of this research is to understand how casual riders and annual members use Cyclistic bikes differently.
In order to access the data we need to go to click the link. This will lead us to the folders that contain the data, in order to do this report I used the folders that contain information about the last 12 months (from Sep 2020 to Aug 2021).
for (month in 1:8){
date<-paste("20210", month, sep="")
filename<-paste(date, "-divvy-tripdata.csv", sep="")
assign(paste("data", month, sep="_"),read.csv(filename))}
assign(paste("data", 9, sep="_"),read.csv("202009-divvy-tripdata.csv"))
for (month in 10:12){
date<-paste("2020", month, sep="")
filename<-paste(date, "-divvy-tripdata.csv", sep="")
assign(paste("data", month, sep="_"),read.csv(filename))}
Now let’s clean the data: to begin this, let’s present it in only one database, in order to do this first I will create two variables corresponding to the month and year, so in the future we will be able to assign a time period to each value.
data_1<-mutate(data_1, year=2021, month = 01)
data_2<-mutate(data_1, year=2021, month = 02)
data_3<-mutate(data_1, year=2021, month = 03)
data_4<-mutate(data_1, year=2021, month = 04)
data_5<-mutate(data_1, year=2021, month = 05)
data_6<-mutate(data_1, year=2021, month = 06)
data_7<-mutate(data_1, year=2021, month = 07)
data_8<-mutate(data_1, year=2021, month = 08)
data_9<-mutate(data_1, year=2021, month = 09)
data_10<-mutate(data_1, year=2020, month = 10)
data_11<-mutate(data_1, year=2020, month = 11)
data_12<-mutate(data_1, year=2020, month = 12)
final_database <- rbind(data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12)
# let's clean the environment by erasing some objects
rm(data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12)
rm(db, db_20, i, dataname, date, filename, month)
After creating one dataset that contains all the information, let’s create some variable. In this section I create three new variables: * ride_length: calculate the length of all the rides (in seconds and hours). * day_of_the_week: calculates which of the week the service was used. * year_month: this is a variable that contains both the year and the month when the service was used.
final_database <- mutate(final_database,
ride_length = difftime(ended_at, started_at),
ride_length_h = difftime(ended_at, started_at, units="hours"),
day_of_the_week = weekdays(as.Date(started_at)),
day_of_the_week_c = as.factor(weekdays(as.Date(started_at))),
year_month = zoo::as.yearmon(paste(year, month), "%Y %m"),
year_month_c = as.factor(zoo::as.yearmon(paste(year, month), "%Y %m")))
In this section I will eliminate all the observations that have a ride length that is negative. Notice that by doing this 24 observations are lost.
final_database_clean <- final_database[!(final_database$ride_length<0),]
In this section I will construct some descriptive statistics. I will always differentiate between the two main groups, those that are members and those that are casual users of the service, given that this is our main goal.
On top if this I will also get some summary statistics based on:
in order to find some patterns.
#Let's get some basic statistics to see if there are any outliers:
#head(final_database)
#I need to use a function to get the mode, otherwise the function mode() gives the type of variable
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
#Statistics based on group
by_mem<-group_by(final_database_clean, member_casual) %>%
summarise(
total=n(),
len_mean = mean(ride_length_h),
len_min = min(ride_length),
len_max = max(ride_length_h),
len_sd = sd(ride_length),
day_mode = getmode(day_of_the_week),
month_mode = getmode(year_month),
count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE))
by_mem<-mutate(by_mem,
elec_share = count_elect/total,
dock_share = count_dock/total,
class_share = count_classical/total)
knitr::kable(by_mem, caption="Summary table")
| member_casual | total | len_mean | len_min | len_max | len_sd | day_mode | month_mode | count_elect | count_dock | count_classical | elec_share | dock_share | class_share |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| casual | 217404 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.345 | Saturday | Jan 2021 | 93036 | 25260 | 99108 | 0.4279406 | 0.1161892 | 0.4558702 |
| member | 944580 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.832 | Friday | Jan 2021 | 303288 | 12 | 641280 | 0.3210824 | 0.0000127 | 0.6789049 |
#Statistics based on group and year/month
by_mem_month<-group_by(final_database_clean, member_casual, year_month_c, .drop=FALSE) %>%
summarise(
total=n(),
len_mean = mean(ride_length_h),
len_min = min(ride_length),
len_max = max(ride_length_h),
len_sd = sd(ride_length),
day_mode = getmode(day_of_the_week),
count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE),
.groups = "keep")
by_mem_month<-mutate(by_mem_month,
elec_share = count_elect/total,
dock_share = count_dock/total,
class_share = count_classical/total)
knitr::kable(by_mem_month, caption="Summary table - by date")
| member_casual | year_month_c | total | len_mean | len_min | len_max | len_sd | day_mode | count_elect | count_dock | count_classical | elec_share | dock_share | class_share |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| casual | Oct 2020 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Nov 2020 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Dec 2020 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Jan 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Feb 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Mar 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Apr 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | May 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Jun 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Jul 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Aug 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| casual | Sep 2021 | 18117 | 0.4280765 hours | 1 secs | 330.43194 hours | 10976.623 | Saturday | 7753 | 2105 | 8259 | 0.4279406 | 0.1161892 | 0.4558702 |
| member | Oct 2020 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Nov 2020 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Dec 2020 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Jan 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Feb 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Mar 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Apr 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | May 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Jun 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Jul 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Aug 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
| member | Sep 2021 | 78715 | 0.2145385 hours | 0 secs | 24.99917 hours | 1833.843 | Friday | 25274 | 1 | 53440 | 0.3210824 | 0.0000127 | 0.6789049 |
#Statistics based on group and day of the week
by_mem_day<-group_by(final_database_clean, member_casual, day_of_the_week, .drop=FALSE) %>%
summarise(
total=n(),
len_mean = mean(ride_length_h),
len_min = min(ride_length),
len_max = max(ride_length_h),
len_sd = sd(ride_length),
count_elect = sum(rideable_type == "electric_bike", na.rm=TRUE),
count_dock = sum(rideable_type == "docked_bike", na.rm=TRUE),
count_classical = sum(rideable_type == "classic_bike", na.rm=TRUE),
.groups = "keep")
by_mem_day<-mutate(by_mem_day,
elec_share = count_elect/total,
dock_share = count_dock/total,
class_share = count_classical/total)
knitr::kable(by_mem_day, caption="Summary table - by day of the week")
| member_casual | day_of_the_week | total | len_mean | len_min | len_max | len_sd | count_elect | count_dock | count_classical | elec_share | dock_share | class_share |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| casual | Friday | 34056 | 0.3825049 hours | 1 secs | 61.25361 hours | 5727.333 | 15828 | 3816 | 14412 | 0.4647639 | 0.1120507 | 0.4231853 |
| casual | Monday | 25056 | 0.3416230 hours | 2 secs | 24.99750 hours | 3210.948 | 11424 | 2412 | 11220 | 0.4559387 | 0.0962644 | 0.4477969 |
| casual | Saturday | 48036 | 0.5291652 hours | 2 secs | 330.43194 hours | 19295.513 | 17724 | 6828 | 23484 | 0.3689733 | 0.1421434 | 0.4888833 |
| casual | Sunday | 34368 | 0.4869339 hours | 3 secs | 85.36861 hours | 10061.609 | 13884 | 4224 | 16260 | 0.4039804 | 0.1229050 | 0.4731145 |
| casual | Thursday | 28176 | 0.3608854 hours | 1 secs | 24.99806 hours | 4211.954 | 12696 | 2832 | 12648 | 0.4505963 | 0.1005111 | 0.4488927 |
| casual | Tuesday | 22584 | 0.3724724 hours | 1 secs | 48.86833 hours | 5680.557 | 10296 | 2364 | 9924 | 0.4558980 | 0.1046759 | 0.4394261 |
| casual | Wednesday | 25128 | 0.4276147 hours | 1 secs | 91.82722 hours | 9381.646 | 11184 | 2784 | 11160 | 0.4450812 | 0.1107927 | 0.4441261 |
| member | Friday | 151740 | 0.2054044 hours | 0 secs | 17.33750 hours | 1246.815 | 50856 | 0 | 100884 | 0.3351522 | 0.0000000 | 0.6648478 |
| member | Monday | 133620 | 0.2203209 hours | 2 secs | 24.99861 hours | 2745.456 | 42060 | 0 | 91560 | 0.3147732 | 0.0000000 | 0.6852268 |
| member | Saturday | 147828 | 0.2259902 hours | 2 secs | 24.45194 hours | 1505.493 | 49056 | 0 | 98772 | 0.3318451 | 0.0000000 | 0.6681549 |
| member | Sunday | 106464 | 0.2226606 hours | 2 secs | 18.22472 hours | 1448.226 | 34596 | 0 | 71868 | 0.3249549 | 0.0000000 | 0.6750451 |
| member | Thursday | 143616 | 0.2042194 hours | 0 secs | 24.99917 hours | 1625.038 | 45492 | 0 | 98124 | 0.3167614 | 0.0000000 | 0.6832386 |
| member | Tuesday | 126828 | 0.2006372 hours | 1 secs | 24.50278 hours | 1247.780 | 38940 | 0 | 87888 | 0.3070300 | 0.0000000 | 0.6929700 |
| member | Wednesday | 134484 | 0.2242114 hours | 1 secs | 24.99722 hours | 2434.755 | 42288 | 12 | 92184 | 0.3144463 | 0.0000892 | 0.6854644 |
Let’s start by visualizing the big two groups to see how the behave ifferently:
#relative amount
ggplot(data = by_mem, aes(x = "", y = total, fill = member_casual)) +
geom_bar(stat="identity", width=1, color="white")+
coord_polar("y", start=0)+
theme_void()+ # remove background, grid, numeric labels
ggtitle("Relative casual riders and annual members")+
theme_bw()
#total amount
ggplot(data = by_mem, aes(x = member_casual, y=total, fill = member_casual)) +
geom_bar(stat="identity")+
ggtitle("Total casual riders and annual members")+
theme_bw()
par(mfrow=c(1,2))
#total amount by ride elect
ggplot(data = by_mem, aes(x = member_casual, y=elec_share, fill = member_casual)) +
geom_bar(stat="identity")+
geom_text(aes(label=scales::percent(elec_share), vjust=2), color="white", size=6)+
ggtitle("Share of people who use electric bikes conditional on their member status")+
theme_bw()
#total amount by ride elect
ggplot(data = by_mem, aes(x = member_casual, y=class_share, fill = member_casual)) +
geom_bar(stat="identity")+
geom_text(aes(label=scales::percent(class_share), vjust=2), color="white", size=6)+
ggtitle("Share of people who use classical bikes conditional on their member status")+
theme_bw()
#Let's start plotting conditional on month
ggplot(data = by_mem_month, aes(x = year_month_c)) +
geom_line(mapping = aes(y = total, color = member_casual))+
ggtitle("Use by member type through time")+
theme_bw()
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
#Data by day
ggplot(data = by_mem_day, aes(x = factor(day_of_the_week), y = total, fill=member_casual)) +
geom_bar(stat="identity", position=position_dodge())+
ggtitle("Use by member type and day of the week")+
theme_bw()