Analysis of Bike share Data 2019-2020

Scenario

As a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago, the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, their team wants to understand how casual riders and annual members use Cyclistic bikes differently.

For the purpose of this case study, the datasets shared by the company are appropriate and will answer the business questions. The data has been made available by Motivate International Inc. under license. Decided to download data for four quarters.

We have used data for four quarters and combined them to a single dataframe for twelve months. The time period for analysis is from April 2019 to March 2020.

step 1. install the required packages in R for completing the analysis

options(repos = list(CRAN="http://cran.rstudio.com/"))
install.packages("tidyverse")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpMjSoXn\downloaded_packages

install.packages("lubridate")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'lubridate' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpMjSoXn\downloaded_packages

install.packages("ggplot2")

## Installing package into 'C:/Users/DELL/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\DELL\AppData\Local\Temp\RtmpMjSoXn\downloaded_packages

step 2. load the libraries

library(tidyverse)
library(lubridate)
library(ggplot2)

step 3. set your working directory or folder and change the file path according to folder in your computer.

knitr::opts_chunk$set(warning = FALSE)
knitr::opts_knit$set(root.dir = 'C:/Users/DELL/Documents/S/Google_dataanalytics/case_study_1/data')

step 4. Input the quaterly bikeshare data into respective data frames by using read.csv() function

q2_2019 <- read.csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read.csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read.csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read.csv("Divvy_Trips_2020_Q1.csv")

step 5. Explore the dataset, Where relevant, make columns consistent and combine them into a single worksheet. Clean and transform your data to prepare for analysis. We can start by checking column names for consistency.

colnames(q3_2019)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

colnames(q4_2019)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

colnames(q2_2019)

##  [1] "X01...Rental.Details.Rental.ID"                   
##  [2] "X01...Rental.Details.Local.Start.Time"            
##  [3] "X01...Rental.Details.Local.End.Time"              
##  [4] "X01...Rental.Details.Bike.ID"                     
##  [5] "X01...Rental.Details.Duration.In.Seconds.Uncapped"
##  [6] "X03...Rental.Start.Station.ID"                    
##  [7] "X03...Rental.Start.Station.Name"                  
##  [8] "X02...Rental.End.Station.ID"                      
##  [9] "X02...Rental.End.Station.Name"                    
## [10] "User.Type"                                        
## [11] "Member.Gender"                                    
## [12] "X05...Member.Details.Member.Birthday.Year"

colnames(q1_2020)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

step 6. We can see that column names are inconsistent. “q2_2019” has a lot different column names than other quarters. So, to make the column names same we will follow the standard used in “q1_2020”. We will use rename() function.

(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

(q2_2019 <- rename(q2_2019
                   ,ride_id = "X01...Rental.Details.Rental.ID"
                   ,rideable_type = "X01...Rental.Details.Bike.ID"
                   ,tripduration = "X01...Rental.Details.Duration.In.Seconds.Uncapped"
                   ,started_at = "X01...Rental.Details.Local.Start.Time"  
                   ,ended_at = "X01...Rental.Details.Local.End.Time"  
                   ,start_station_name = "X03...Rental.Start.Station.Name" 
                   ,start_station_id = "X03...Rental.Start.Station.ID"
                   ,end_station_name = "X02...Rental.End.Station.Name" 
                   ,end_station_id = "X02...Rental.End.Station.ID"
                   ,member_casual = "User.Type"
                   ,gender = "Member.Gender"
                   ,birthyear = "X05...Member.Details.Member.Birthday.Year"))

Step 7. Re-check column names for consistency. Now, they are consistently named.

colnames(q3_2019)

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "gender"             "birthyear"

colnames(q4_2019)

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "gender"             "birthyear"

colnames(q2_2019)

##  [1] "ride_id"            "started_at"         "ended_at"          
##  [4] "rideable_type"      "tripduration"       "start_station_id"  
##  [7] "start_station_name" "end_station_id"     "end_station_name"  
## [10] "member_casual"      "gender"             "birthyear"

colnames(q1_2020)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Step 8. Inspect the data frames and look for incongruities an find data type of each column

str(q1_2020)

## 'data.frame':    426887 obs. of  13 variables:
##  $ ride_id           : chr  "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
##  $ rideable_type     : chr  "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : chr  "2020-01-21 20:06:59" "2020-01-30 14:22:39" "2020-01-09 19:29:26" "2020-01-06 16:17:07" ...
##  $ ended_at          : chr  "2020-01-21 20:14:30" "2020-01-30 14:26:22" "2020-01-09 19:32:17" "2020-01-06 16:25:56" ...
##  $ start_station_name: chr  "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
##  $ start_station_id  : int  239 234 296 51 66 212 96 96 212 38 ...
##  $ end_station_name  : chr  "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
##  $ end_station_id    : int  326 318 117 24 212 96 212 212 96 100 ...
##  $ start_lat         : num  42 42 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  42 42 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

str(q4_2019)

## 'data.frame':    704054 obs. of  12 variables:
##  $ ride_id           : int  25223640 25223641 25223642 25223643 25223644 25223645 25223646 25223647 25223648 25223649 ...
##  $ started_at        : chr  "2019-10-01 00:01:39" "2019-10-01 00:02:16" "2019-10-01 00:04:32" "2019-10-01 00:04:32" ...
##  $ ended_at          : chr  "2019-10-01 00:17:20" "2019-10-01 00:06:34" "2019-10-01 00:18:43" "2019-10-01 00:43:43" ...
##  $ rideable_type     : int  2215 6328 3003 3275 5294 1891 1061 1274 6011 2957 ...
##  $ tripduration      : chr  "940.0" "258.0" "850.0" "2,350.0" ...
##  $ start_station_id  : int  20 19 84 313 210 156 84 156 156 336 ...
##  $ start_station_name: chr  "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
##  $ end_station_id    : int  309 241 199 290 382 226 142 463 463 336 ...
##  $ end_station_name  : chr  "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
##  $ member_casual     : chr  "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ gender            : chr  "Male" "Male" "Female" "Male" ...
##  $ birthyear         : int  1987 1998 1991 1990 1987 1994 1991 1995 1993 NA ...

str(q3_2019)

## 'data.frame':    1640718 obs. of  12 variables:
##  $ ride_id           : int  23479388 23479389 23479390 23479391 23479392 23479393 23479394 23479395 23479396 23479397 ...
##  $ started_at        : chr  "2019-07-01 00:00:27" "2019-07-01 00:01:16" "2019-07-01 00:01:48" "2019-07-01 00:02:07" ...
##  $ ended_at          : chr  "2019-07-01 00:20:41" "2019-07-01 00:18:44" "2019-07-01 00:27:42" "2019-07-01 00:27:10" ...
##  $ rideable_type     : int  3591 5353 6180 5540 6014 4941 3770 5442 2957 6091 ...
##  $ tripduration      : chr  "1,214.0" "1,048.0" "1,554.0" "1,503.0" ...
##  $ start_station_id  : int  117 381 313 313 168 300 168 313 43 43 ...
##  $ start_station_name: chr  "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
##  $ end_station_id    : int  497 203 144 144 62 232 62 144 195 195 ...
##  $ end_station_name  : chr  "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
##  $ member_casual     : chr  "Subscriber" "Customer" "Customer" "Customer" ...
##  $ gender            : chr  "Male" "" "" "" ...
##  $ birthyear         : int  1992 NA NA NA NA 1990 NA NA NA NA ...

str(q2_2019)

## 'data.frame':    1108163 obs. of  12 variables:
##  $ ride_id           : int  22178529 22178530 22178531 22178532 22178533 22178534 22178535 22178536 22178537 22178538 ...
##  $ started_at        : chr  "2019-04-01 00:02:22" "2019-04-01 00:03:02" "2019-04-01 00:11:07" "2019-04-01 00:13:01" ...
##  $ ended_at          : chr  "2019-04-01 00:09:48" "2019-04-01 00:20:30" "2019-04-01 00:15:19" "2019-04-01 00:18:58" ...
##  $ rideable_type     : int  6251 6226 5649 4151 3270 3123 6418 4513 3280 5534 ...
##  $ tripduration      : chr  "446.0" "1,048.0" "252.0" "357.0" ...
##  $ start_station_id  : int  81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : int  56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr  "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ gender            : chr  "Male" "Female" "Male" "Male" ...
##  $ birthyear         : int  1975 1984 1990 1993 1992 1999 1969 1991 NA NA ...

Step 9. Our next aim is to stack all four dataframes over one another into a single dataframe. So, as a first step, we will convert data type of “ride_id” and “rideable_type” columns to character so that they can stack correctly.

q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))

Step 10. Next, we can stack all four data frames into one big data frame using bind_rows() function

all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)

Step 11. To make data more consistent, remove lat, long, birthyear, and gender fields as this data was dropped from data collection since 2020

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, tripduration))

Step 12. Inspect the new table all_trips that has been created and also re-check column names for consistency

colnames(all_trips)

## [1] "ride_id"            "started_at"         "ended_at"          
## [4] "rideable_type"      "start_station_id"   "start_station_name"
## [7] "end_station_id"     "end_station_name"   "member_casual"

Step 13. Now, we might feel our data is ready for analysis. But, lets re-check again by computing number of rows, column names, using summary() function and cross checking dimensions of data frame. Basically, carefully inspect the new table all_trips that has been created.

colnames(all_trips)

## [1] "ride_id"            "started_at"         "ended_at"          
## [4] "rideable_type"      "start_station_id"   "start_station_name"
## [7] "end_station_id"     "end_station_name"   "member_casual"

nrow(all_trips)

## [1] 3879822

dim(all_trips)

## [1] 3879822       9

head(all_trips)

##    ride_id          started_at            ended_at rideable_type
## 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251
## 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226
## 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649
## 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151
## 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270
## 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123
##   start_station_id        start_station_name end_station_id
## 1               81        Daley Center Plaza             56
## 2              317       Wood St & Taylor St             59
## 3              283 LaSalle St & Jackson Blvd            174
## 4               26  McClurg Ct & Illinois St            133
## 5              202      Halsted St & 18th St            129
## 6              420       Ellis Ave & 55th St            426
##            end_station_name member_casual
## 1 Desplaines St & Kinzie St    Subscriber
## 2 Wabash Ave & Roosevelt Rd    Subscriber
## 3     Canal St & Madison St    Subscriber
## 4  Kingsbury St & Kinzie St    Subscriber
## 5 Blue Island Ave & 18th St    Subscriber
## 6       Ellis Ave & 60th St    Subscriber

str(all_trips)

## 'data.frame':    3879822 obs. of  9 variables:
##  $ ride_id           : chr  "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : chr  "2019-04-01 00:02:22" "2019-04-01 00:03:02" "2019-04-01 00:11:07" "2019-04-01 00:13:01" ...
##  $ ended_at          : chr  "2019-04-01 00:09:48" "2019-04-01 00:20:30" "2019-04-01 00:15:19" "2019-04-01 00:18:58" ...
##  $ rideable_type     : chr  "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : int  81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : int  56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr  "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...

summary(all_trips)

##    ride_id           started_at          ended_at         rideable_type     
##  Length:3879822     Length:3879822     Length:3879822     Length:3879822    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_id start_station_name end_station_id  end_station_name  
##  Min.   :  1.0    Length:3879822     Min.   :  1.0   Length:3879822    
##  1st Qu.: 77.0    Class :character   1st Qu.: 77.0   Class :character  
##  Median :174.0    Mode  :character   Median :174.0   Mode  :character  
##  Mean   :202.9                       Mean   :203.8                     
##  3rd Qu.:291.0                       3rd Qu.:291.0                     
##  Max.   :675.0                       Max.   :675.0                     
##                                      NA's   :1                         
##  member_casual     
##  Length:3879822    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Step 14. After careful scrutiny, there are a few problems we will need to fix. In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.First check how many observations fall under each user type.

table(all_trips$member_casual)

## 
##     casual   Customer     member Subscriber 
##      48480     857474     378407    2595461

Step 15. We can see that there is data under four labels in member type category. So, to change four types of membership to two, we use the following code chunk:

all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                                ,"Subscriber" = "member"
                                ,"Customer" = "casual"))

Step 16. Again re-run check for number of labels in member_casual using table() function.

table(all_trips$member_casual)

## 
##  casual  member 
##  905954 2973868

Step 17. Now, there are only two labels. Next, Add separate columns for the date, month, day, and year of each ride. This will allow us to aggregate ride data for each month, day, or year. Otherwise, we could only aggregate at the ride level with “ride_id”.

For that, first we should extract only date into a separate column using as.Date() function. Next, we need to use format() function to extract month, day of the week and year from date column. For more detailed explanation of the function click here

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date),"%b")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%a")
all_trips$year <- format(as.Date(all_trips$date), "%Y")

Step 18. Create a column called “ride_length.” Calculate the length of each ride by subtracting the column “ended_at” from the column “started_at”. We can do this by using difftime() function to get time in seconds.

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Step 19. Next, let’s inspect the structure of the columns in the dataframe “all_trips”

str(all_trips)

## 'data.frame':    3879822 obs. of  14 variables:
##  $ ride_id           : chr  "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : chr  "2019-04-01 00:02:22" "2019-04-01 00:03:02" "2019-04-01 00:11:07" "2019-04-01 00:13:01" ...
##  $ ended_at          : chr  "2019-04-01 00:09:48" "2019-04-01 00:20:30" "2019-04-01 00:15:19" "2019-04-01 00:18:58" ...
##  $ rideable_type     : chr  "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : int  81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : int  56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
##  $ date              : Date, format: "2019-04-01" "2019-04-01" ...
##  $ month             : chr  "Apr" "Apr" "Apr" "Apr" ...
##  $ day_of_week       : chr  "Mon" "Mon" "Mon" "Mon" ...
##  $ year              : chr  "2019" "2019" "2019" "2019" ...
##  $ ride_length       : 'difftime' num  446 1048 252 357 ...
##   ..- attr(*, "units")= chr "secs"

Step 20. Convert data type of “ride_length” from factor to numeric so that we can run calculations on the data like maximum, mean, average, minimum etc.

is.factor(all_trips$ride_length)

## [1] FALSE

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

Step 21. Re-check whether datatype has been converted from factor to numeric

is.numeric(all_trips$ride_length)

## [1] TRUE

Step 22. Remove “bad” data. The data frame includes a few hundred entries, when bikes were taken out of docks and checked for quality by company. These values can be identified by negative value in “ride_length”. We will create a new version of the data frame (v2) where rows with negative “ride_length” data is being removed.

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Step 23. Again carefully inspect the structure of the columns. We can see that number of rows have reduced from 3879822 to 3876042. So, there were 3780 rows with negative “ride_length”

str(all_trips_v2)

## 'data.frame':    3876042 obs. of  14 variables:
##  $ ride_id           : chr  "22178529" "22178530" "22178531" "22178532" ...
##  $ started_at        : chr  "2019-04-01 00:02:22" "2019-04-01 00:03:02" "2019-04-01 00:11:07" "2019-04-01 00:13:01" ...
##  $ ended_at          : chr  "2019-04-01 00:09:48" "2019-04-01 00:20:30" "2019-04-01 00:15:19" "2019-04-01 00:18:58" ...
##  $ rideable_type     : chr  "6251" "6226" "5649" "4151" ...
##  $ start_station_id  : int  81 317 283 26 202 420 503 260 211 211 ...
##  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
##  $ end_station_id    : int  56 59 174 133 129 426 500 499 211 211 ...
##  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
##  $ date              : Date, format: "2019-04-01" "2019-04-01" ...
##  $ month             : chr  "Apr" "Apr" "Apr" "Apr" ...
##  $ day_of_week       : chr  "Mon" "Mon" "Mon" "Mon" ...
##  $ year              : chr  "2019" "2019" "2019" "2019" ...
##  $ ride_length       : num  446 1048 252 357 1007 ...

Step 24. Now, we can use the data for twelve months to do further analysis as we have completed a number of cleaning tasks and made it consistent.

Next, we can do a descriptive analysis on ride_length (all values in seconds) using following code chunk:

summary(all_trips_v2$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     412     712    1479    1289 9387024

Step 25. Our Aim is to find data about how casual riders use bike share compared to Annual members. So, we compare various parameters like average “ride_length” and number of trips for both categories of users.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, 
          FUN = mean)

##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                      Fri                3773.8351
## 2                      member                      Fri                 824.5305
## 3                      casual                      Mon                3372.2869
## 4                      member                      Mon                 842.5726
## 5                      casual                      Sat                3331.9138
## 6                      member                      Sat                 968.9337
## 7                      casual                      Sun                3581.4054
## 8                      member                      Sun                 919.9746
## 9                      casual                      Thu                3682.9847
## 10                     member                      Thu                 823.9278
## 11                     casual                      Tue                3596.3599
## 12                     member                      Tue                 826.1427
## 13                     casual                      Wed                3718.6619
## 14                     member                      Wed                 823.9996

Step 26. Notice how the days of the week are out of order. We can fix that by using the following code chunk:

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week,
                                    levels=c("Sun", "Mon", "Tue", 
                                             "Wed", "Thu", "Fri", "Sat"))

Step 27. Now, let’s compare the average “ride_length” again by each day for both category of users i.e. annual members and casual users.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week,
          FUN = mean)

##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                      Sun                3581.4054
## 2                      member                      Sun                 919.9746
## 3                      casual                      Mon                3372.2869
## 4                      member                      Mon                 842.5726
## 5                      casual                      Tue                3596.3599
## 6                      member                      Tue                 826.1427
## 7                      casual                      Wed                3718.6619
## 8                      member                      Wed                 823.9996
## 9                      casual                      Thu                3682.9847
## 10                     member                      Thu                 823.9278
## 11                     casual                      Fri                3773.8351
## 12                     member                      Fri                 824.5305
## 13                     casual                      Sat                3331.9138
## 14                     member                      Sat                 968.9337

Step 28. Next, we can analyze and find insights about user type by day of the week. We can first use Mutate() function to creates weekday field using wday(). The, followed by grouping data by user type and day of the week.

Then, we calculate number of rides and average duration of each ride to sort them by user type and day of the week. We use data pipes to complete all these analysis in the following code chunk:

 all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%                        ## creates weekday field using wday()
  group_by(member_casual, weekday) %>%                                        ## groups by user type and weekday
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%  ## calculates the number of rides and average duration
  arrange(member_casual, weekday)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        Sun              181293            3581.
##  2 casual        Mon              103296            3372.
##  3 casual        Tue               90510            3596.
##  4 casual        Wed               92457            3719.
##  5 casual        Thu              102679            3683.
##  6 casual        Fri              122404            3774.
##  7 casual        Sat              209543            3332.
##  8 member        Sun              267965             920.
##  9 member        Mon              472196             843.
## 10 member        Tue              508445             826.
## 11 member        Wed              500329             824.
## 12 member        Thu              484177             824.
## 13 member        Fri              452790             825.
## 14 member        Sat              287958             969.

Step 29. Now, Let’s visualize the number of rides v/s average duration plot grouped by rider type. We use ggplot() function to get a column barchart where “dodge” is used to group by “member_casual”

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Average duration v/s day of the week")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

1) From the above graph, we can see that average duration of “ridelength” for annual members is much lower than that for casual riders.So, if we target these casual riders, by introducing offers catering to this segment, we can increase annual membership enrollments into the company.

2) Amongst, casual riders, the average duration of “trip_length” is high on Fridays and Sundays. So, ads for memberships can be targeted on these days. Weekend offers might also be considered to increase revenue from casual riders.

Step 30. Let’s visualize the number of rides v/s weekday plot grouped by user type. Again,we use “dodge” for grouped column plots side by side

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%                        ## creates weekday field using wday()
  group_by(member_casual, weekday) %>%                                        ## groups by user type and weekday
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%  ## calculates the number of rides and average duration
  arrange(member_casual, weekday) %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")+
  labs(title = "Number of rides v/s day of the week")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

1) From the above graph, it is clear that number of trips by annual members are higher compared to casual riders especially during weekdays. Further analysis needs to be done to determine wether higher number of trips by annual members correspond to home-work commute during weekdays.

2) Again, we can observe casual riders have taken more number of trips during weekends compared to weekdays. Further analysis needs to be done to determine the reason for an increase in number of trips by casual users during weekend.

3) This graph also suggests the possibility of exploring weekend offers and ads targeting casual riders to convert to annual members.

Step 31. Next, we can export a SUMMARY FILE for further analysis. We Create a new csv file that we will use to visualize in Excel, Tableau, or my presentation software. By using write.csv() function, a new file with “all_trips_v2” data is created.

write.csv(all_trips_v2, "C:/Users/DELL/Documents/S/Google_dataanalytics/case_study_1/data/all_trips_final.csv", 
          row.names=FALSE)

Step 32. We also export a summary table from the data analysed in code chunk titled “Insights”

counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
write.csv(counts, file = 'C:/Users/DELL/Documents/S/Google_dataanalytics/case_study_1/data/avg_ride_length.csv')

bike_share_analysis

Sowparnika

2023-12-13

bike_share_analysis

Sowparnika

2023-12-13

Analysis of Bike share Data 2019-2020

Maximizing the number of annual memberships of a Bike-Share Company

Scenario

For the purpose of this case study, the datasets shared by the company are appropriate and will answer the business questions. The data has been made available by Motivate International Inc. under license. Decided to download data for four quarters.

After downloading the bike share data, unzip the files and create a folder on your desktop or Drive to house the files. Use appropriate file-naming conventions.

We have used data for four quarters and combined them to a single dataframe for twelve months. The time period for analysis is from April 2019 to March 2020.

step 1. install the required packages in R for completing the analysis

step 2. load the libraries

step 3. set your working directory or folder and change the file path according to folder in your computer.

step 4. Input the quaterly bikeshare data into respective data frames by using read.csv() function

step 5. Explore the dataset, Where relevant, make columns consistent and combine them into a single worksheet. Clean and transform your data to prepare for analysis. We can start by checking column names for consistency.

step 6. We can see that column names are inconsistent. “q2_2019” has a lot different column names than other quarters. So, to make the column names same we will follow the standard used in “q1_2020”. We will use rename() function.

Step 7. Re-check column names for consistency. Now, they are consistently named.

Step 8. Inspect the data frames and look for incongruities an find data type of each column

Step 9. Our next aim is to stack all four dataframes over one another into a single dataframe. So, as a first step, we will convert data type of “ride_id” and “rideable_type” columns to character so that they can stack correctly.

Step 10. Next, we can stack all four data frames into one big data frame using bind_rows() function

Step 11. To make data more consistent, remove lat, long, birthyear, and gender fields as this data was dropped from data collection since 2020

Step 12. Inspect the new table all_trips that has been created and also re-check column names for consistency

Step 13. Now, we might feel our data is ready for analysis. But, lets re-check again by computing number of rows, column names, using summary() function and cross checking dimensions of data frame. Basically, carefully inspect the new table all_trips that has been created.

Step 15. We can see that there is data under four labels in member type category. So, to change four types of membership to two, we use the following code chunk:

Step 16. Again re-run check for number of labels in member_casual using table() function.

Step 17. Now, there are only two labels. Next, Add separate columns for the date, month, day, and year of each ride. This will allow us to aggregate ride data for each month, day, or year. Otherwise, we could only aggregate at the ride level with “ride_id”.

For that, first we should extract only date into a separate column using as.Date() function. Next, we need to use format() function to extract month, day of the week and year from date column. For more detailed explanation of the function click here

Step 18. Create a column called “ride_length.” Calculate the length of each ride by subtracting the column “ended_at” from the column “started_at”. We can do this by using difftime() function to get time in seconds.

Step 19. Next, let’s inspect the structure of the columns in the dataframe “all_trips”

Step 20. Convert data type of “ride_length” from factor to numeric so that we can run calculations on the data like maximum, mean, average, minimum etc.

Step 21. Re-check whether datatype has been converted from factor to numeric

Step 23. Again carefully inspect the structure of the columns. We can see that number of rows have reduced from 3879822 to 3876042. So, there were 3780 rows with negative “ride_length”

Step 24. Now, we can use the data for twelve months to do further analysis as we have completed a number of cleaning tasks and made it consistent.

Next, we can do a descriptive analysis on ride_length (all values in seconds) using following code chunk:

Step 25. Our Aim is to find data about how casual riders use bike share compared to Annual members. So, we compare various parameters like average “ride_length” and number of trips for both categories of users.

Step 26. Notice how the days of the week are out of order. We can fix that by using the following code chunk:

Step 27. Now, let’s compare the average “ride_length” again by each day for both category of users i.e. annual members and casual users.

Step 28. Next, we can analyze and find insights about user type by day of the week. We can first use Mutate() function to creates weekday field using wday(). The, followed by grouping data by user type and day of the week.

Then, we calculate number of rides and average duration of each ride to sort them by user type and day of the week. We use data pipes to complete all these analysis in the following code chunk:

Step 29. Now, Let’s visualize the number of rides v/s average duration plot grouped by rider type. We use ggplot() function to get a column barchart where “dodge” is used to group by “member_casual”

1) From the above graph, we can see that average duration of “ridelength” for annual members is much lower than that for casual riders.So, if we target these casual riders, by introducing offers catering to this segment, we can increase annual membership enrollments into the company.

2) Amongst, casual riders, the average duration of “trip_length” is high on Fridays and Sundays. So, ads for memberships can be targeted on these days. Weekend offers might also be considered to increase revenue from casual riders.

Step 30. Let’s visualize the number of rides v/s weekday plot grouped by user type. Again,we use “dodge” for grouped column plots side by side

1) From the above graph, it is clear that number of trips by annual members are higher compared to casual riders especially during weekdays. Further analysis needs to be done to determine wether higher number of trips by annual members correspond to home-work commute during weekdays.

2) Again, we can observe casual riders have taken more number of trips during weekends compared to weekdays. Further analysis needs to be done to determine the reason for an increase in number of trips by casual users during weekend.

3) This graph also suggests the possibility of exploring weekend offers and ads targeting casual riders to convert to annual members.

Step 31. Next, we can export a SUMMARY FILE for further analysis. We Create a new csv file that we will use to visualize in Excel, Tableau, or my presentation software. By using write.csv() function, a new file with “all_trips_v2” data is created.

Step 32. We also export a summary table from the data analysed in code chunk titled “Insights”

We come to the end of bike-share case study! We have compared the data about casual users and annual members to obtain insights which would help in increasing memberships. Please do leave feedback regarding this analysis like what can be improved or included.