This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
knitr::opts_chunk$set(echo = TRUE)
##All project details
This is for the Google Data Analytics Capstone project on Coursera. This is my first ever deep dive into a data analytics project. For the purposes of this exercise, we will be using public data from a bike share company called Divvy. In this exercise, I work for a fictional company, Divvy. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
To deliver recommendations by analyzing the 2022 trip data for Divvy to understand how casual riders, and annual members use the service differently. Using that information I will try to provide answers on how to convert casual riders to annual members.
I used the most recent twelve-month (#December_2021-September_2022) historical trip datasets provided by Divvy to perform this analysis. This includes twelve different files each containing a month of data ranging from August 2021, to July 2022. This is public data that you can use to explore how different customer types are using Divvy bikes. The data has been made available by Motivate International Inc. under this license.
Each dataset is made up of thirteen columns:
[1] ride_id rideable_type started_at
[4] ended_at start_station_name start_station_id
[7] end_station_name end_station_id start_lat
[10] start_lng end_lat end_lng
[13] member_casual
I will be using the programming language R to conduct this analysis.
First I need to install and load relevant packages: ## Package importation
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(timechange)
library(ggplot2)
library(dplyr)
##Next I need to load, read and assign a variable to all twelve csv files to Rstudio Environment.
Dec_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202112-divvy-tripdata.csv")
Nov_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202111-divvy-tripdata.csv")
Oct_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202110-divvy-tripdata.csv")
Jan_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202201-divvy-tripdata.csv")
Feb_22 <- read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202202-divvy-tripdata.csv")
Mar_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202203-divvy-tripdata.csv")
Apr_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202204-divvy-tripdata.csv")
May_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202205-divvy-tripdata.csv")
Jun_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202206-divvy-tripdata.csv")
Jul_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202207-divvy-tripdata.csv")
Aug_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202208-divvy-tripdata.csv")
Sep_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202209-divvy-publictripdata.csv")
To ensure consistency, I want to make sure all of these files have the same columns, so I will run colnames() on each csv file. Once I verify that all csv files contain each of the 13 columns shown above, I need to use rbind() to merge all files together vertically into a single large data frame.
Total_trip <- rbind(Dec_21, Nov_21, Oct_21, Jan_22, Feb_22, Mar_22, Apr_22, May_22, Jun_22, Jul_22 , Aug_22, Sep_22)
To make sure our data is clean, let’s check for and remove NA/null values. We will have to assign this to a new data frame.
Total_trip_1 <- na.omit(Total_trip)
This brought the total observations from 5,828,235 to 5,822,391 meaning that 5,8444 rows with NA/null information! Has been removed before performing further data entries and manipulation!
We are going to add a new column called “ride_length” to calculate the length of each trip in minutes. We will use the difftime() function to do this, and shown in the chunk below:
Total_trip_1$ride_length <- difftime(Total_trip_1$ended_at, Total_trip_1$started_at, units="mins")
##Let check the internal structure of (Total_trip_1) This will help to confirm the data_types that was previuosly loaded.
str(Total_trip_1)
## 'data.frame': 5822391 obs. of 14 variables:
## $ ride_id : chr "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
## $ ended_at : chr "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
## $ start_station_name: chr "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
## $ ride_length : 'difftime' num 7.58333333333333 26.9 12.7666666666667 14.7166666666667 ...
## ..- attr(*, "units")= chr "mins"
## - attr(*, "na.action")= 'omit' Named int [1:5844] 3342 3391 3467 3586 3681 3692 3693 111319 111481 111526 ...
## ..- attr(*, "names")= chr [1:5844] "3342" "3391" "3467" "3586" ...
Using View(trip_data) to check out our new column, we can see that there are some negative ride times. Let’s check how many columns display negative ride lengths and then remove them:
Total_trip_2 <- Total_trip_1[!Total_trip_1$ride_length < 0,]
Now we are going to add four new columns. The first will be populated with the abbreviated day of the week that the ride started on. We will also make a column for the month and the year. We can use this to explore how both casual riders and members use the bikes differently depending on the day of the week, month, or year. Going further, lets add a column isolating the hour that the ride started_at
Even though we already have this data in the “started_at” and “ended_at” columns, breaking it down into its own columns will make it easier to aggregate later on.
Total_trip_2$month <- format(as.Date(Total_trip_2$started_at), "%m")
Total_trip_2$year <- format(as.Date(Total_trip_2$started_at), "%Y")
Total_trip_2$start_hour <- format(as.POSIXct(Total_trip_2$started_at), format = "%H")
Total_trip_2 <- mutate(Total_trip_2, started_at = as_datetime(started_at), ended_at = as_datetime(ended_at))
Total_trip_2 <- mutate(Total_trip_2, "days_of_week" = weekdays(started_at))
glimpse(Total_trip_2)
## Rows: 5,822,283
## Columns: 18
## $ ride_id <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-12-07 15:06:07, 2021-12-11 03:43:29, 2021-12-…
## $ ended_at <dttm> 2021-12-07 15:13:42, 2021-12-11 04:10:23, 2021-12-…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual <chr> "member", "casual", "member", "member", "member", "…
## $ ride_length <drtn> 7.583333 mins, 26.900000 mins, 12.766667 mins, 14.…
## $ month <chr> "12", "12", "12", "12", "12", "12", "12", "12", "12…
## $ year <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20…
## $ start_hour <chr> "15", "03", "23", "16", "11", "18", "15", "13", "14…
## $ days_of_week <chr> "Tuesday", "Saturday", "Wednesday", "Sunday", "Thur…
Before we move on to the analysis, we need to first convert ‘ride_length’ from Factor to Numeric so we can run calculations on the data.
is.factor(Total_trip_2$ride_length)
## [1] FALSE
Total_trip_2$ride_length <- as.numeric(as.character(Total_trip_2$ride_length))
is.numeric(Total_trip_2$ride_length)
## [1] TRUE
Now we have everything we need to move on to the analysis phase.
We will use the summary() function to quickly get the min, median, mean, and max for “ride_length”.
summary(Total_trip_2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.93 10.48 16.76 18.82 40705.02
##Lets compare ride_length between casual riders, and members.
#Number of rides per hour of the day
Total_trip_2 %>%
mutate(hour = format(as.POSIXct(Total_trip_2$started_at), format = "%H")) %>%
group_by(member_casual, hour) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, hour) %>%
ggplot(aes(x = hour, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Based on the graph above, we can come to the following conclusions: Members tend to have a higher frequency of rides between 6:00-9:00 and 15:00-19:00. This would seem to suggest that annual members use their bikes to commute to and from work. Casual riders have a spike in number of rides around 15:00-18:00 as well, but that isn’t reflected in the morning meaning this is likely not reflecting casual riders using the bikes to commute to and from work but rather perhaps after work activities.
Advertising can target casual riders around these times, promoting commuting to work with Divvy bikes company. Promoting discounted prices for these time slots could prove effective.
##Lets break it down by weekday.
# analyze ridership data by type and weekday
Total_trip_2 %>%
mutate(day_of_week = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, day_of_week) %>% #groups by usertype(member_casual) and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, day_of_week) # sorts
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual day_of_week number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 403983 26.3
## 2 casual Mon 279170 23.4
## 3 casual Tue 275219 20.3
## 4 casual Wed 281149 19.8
## 5 casual Thu 306111 20.3
## 6 casual Fri 351729 21.3
## 7 casual Sat 498682 25.5
## 8 member Sun 393502 13.9
## 9 member Mon 472981 12.1
## 10 member Tue 541416 11.9
## 11 member Wed 538401 11.9
## 12 member Thu 530450 12.0
## 13 member Fri 491368 12.3
## 14 member Sat 458122 13.9
# Let's visualize the number of rides by rider type per weekday
Total_trip_2 %>%
mutate(day_of_week = wday(started_at, label = TRUE)) %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
We can make some interesting discoveries from the graph shown above. We can see that during week day, members have a much higher frequency of rides than casual riders. Casual riders however, pull ahead with slightly more rides on the weekends, while the number of rides for members actually decreases slightly.
##Let’s create a visualization for average duration
Total_trip_2 %>%
mutate(day_of_week = wday(started_at, label = TRUE)) %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
In this graph, we find that casual riders, while not taking quite as many rides, do in fact take much longer rides than members throughout each day of the week, with slight increases on the weekends. The average duration for members rides stays very consistent throughout the week.
##Let’s compare number_of_rides per month for casual and annual members.
Total_trip_2 %>%
mutate(month = format(as.Date(Total_trip_2$started_at), "%m")) %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
We observe from this graph that casual riders always has less number of rides than annual members in every months of the year. It is worth noting that both members and casual riders number of rides follows a bell curve, meaning that both parties are using bikes more in the warmer months, than the colder ones. Casual riders bike trips dips drastically between November and March.
Advertising targeting casual riders should focus efforts on the months April-October.
##Lets see what type of bike (rideable_type) casual riders and members prefer.
Total_trip_2 %>%
group_by(member_casual, rideable_type) %>%
arrange(member_casual, rideable_type) %>%
ggplot(aes(x = rideable_type, fill = member_casual)) +
geom_bar(position = "dodge")
Based on this graph, both members and casual riders prefer to use the classic bike type as well electric bike type. It appear to be less important to focus on the type of bike the riders used.
##Key Takeaways
Casual riders take less trips during the week day than annual members do. We believe this to be due to members taking more rides on the way to work, and on the way home from work. We suggest increasing casual riders week day trips by advertising at work commute hours. (7:00-9:00 and 16:00-18:00)
We recommend advertising between May and September. These are the months with the highest frequencies of bike trips.
Casual riders and members both prefer the classic bike, but electric bikes aren’t too far behind.
We removed 1,312,658 rows of data for this analysis. These were NA/null values and unaccounted for negative ride time duration. This could have effected the integrity of this analysis. We recommend looking into the source of these problems so that future analysis is unaffected.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.