Introduction

In this Data Analyst project, we picked a business problem and collected data from various sources & analyze it. After analyzing it, we made a certain conclusion which would help to grow the various businesses. Usually data analysis has 6 stages which are as follows:

Business Task

To understand how customer and subscription members use Cyclist bikes differently, how to increase number of customer and convert those casual customer into subscription member so that they can enjoy more profit and better services.

Data Source

In this project we have used Divvy_trips dataset which is a open source dataset which has cyclist data of citizens of Chicago of year 2019. How long they ride, from where to where, when they started and ended, station names, birth year and so on. Not all data in this dataset is beneficial to us so we need to clean and organize the data. Data source this project is going to use is split between 4 CSVs each CSV has a quarter’s data.

Loading library and setting working directory.

tidyverse helps in data wrangling i.e. removing errors and combining complex data sets to make them more accessible and easier to analyze. lubridate helps wrangle date attributes. ggplot2 helps in making of graphs, labeling axes, giving title, sub-title and much more. getwd() function displays your current working directory and setwd(“address”) sets you’re new working directory. We need to know our working directory so that we can keep our files organized and add or remove files if necessary.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

Step 1: Collecting Data

Step one of every data analyst project is to collect data. It can be imported via CSVs, excel sheets or could be web-scrapped.

q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")

Step 2: Wrangle Data and Combine Into a Single File

Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. We will combine all 4 CSVs after we wrangle the data. After importing data it is really important to know what is inn data, what type is it, to what extend it needs to be cleaned.

colnames(q1_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q2_2019) 
##  [1] "01 - Rental Details Rental ID"                   
##  [2] "01 - Rental Details Local Start Time"            
##  [3] "01 - Rental Details Local End Time"              
##  [4] "01 - Rental Details Bike ID"                     
##  [5] "01 - Rental Details Duration In Seconds Uncapped"
##  [6] "03 - Rental Start Station ID"                    
##  [7] "03 - Rental Start Station Name"                  
##  [8] "02 - Rental End Station ID"                      
##  [9] "02 - Rental End Station Name"                    
## [10] "User Type"                                       
## [11] "Member Gender"                                   
## [12] "05 - Member Details Member Birthday Year"
colnames(q3_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q4_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
(q2_2019 <- rename(q2_2019
                   ,trip_id = "01 - Rental Details Rental ID"
                   ,start_time = "01 - Rental Details Local Start Time"
                   ,end_time = "01 - Rental Details Local End Time"
                   ,bikeid = "01 - Rental Details Bike ID"
                   ,tripduration = "01 - Rental Details Duration In Seconds Uncapped"
                   ,from_station_id = "03 - Rental Start Station ID"
                   ,from_station_name = "03 - Rental Start Station Name"
                   ,to_station_id = "02 - Rental End Station ID"
                   ,to_station_name = "02 - Rental End Station Name"
                   ,usertype = "User Type"
                   ,gender = "Member Gender"
                   ,birthyear = "05 - Member Details Member Birthday Year"))
## # A tibble: 1,108,163 × 12
##     trip_id start_time          end_time            bikeid tripduration from_s…¹
##       <dbl> <dttm>              <dttm>               <dbl>        <dbl>    <dbl>
##  1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48   6251          446       81
##  2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30   6226         1048      317
##  3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19   5649          252      283
##  4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58   4151          357       26
##  5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13   3270         1007      202
##  6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56   3123          257      420
##  7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41   6418          548      503
##  8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11   4513          383      260
##  9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44   3280         2137      211
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39   5534         2120      211
## # … with 1,108,153 more rows, 6 more variables: from_station_name <chr>,
## #   to_station_id <dbl>, to_station_name <chr>, usertype <chr>, gender <chr>,
## #   birthyear <dbl>, and abbreviated variable name ¹​from_station_id

Now that column names are updated let’s check weather the changes we made are done or not.

colnames(q1_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q2_2019) 
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q3_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
colnames(q4_2019)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

After all column names are fixed for all 4 CSV we need to check their structure. Now we need to match their data types to so that we can combine all 4 CSVs into one CSV. With all 4 CSV combined we can perform a combined analysis.

Convert data type of columns if needed so that they can stack and combine correctly and smoothly. We can do that by using mutate function which can change the data type.

Now lets’s stack individual quarter’s data frames into one big data frame and view what it looks like now.

all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2019)

Step 3: Clean up and add data to prepare for analysis

After cleaning and wrangling the data we need to remove any Bad Data. It can be done by inspecting the new dataframe that we created.

colnames(all_trips)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
nrow(all_trips)
## [1] 3818004
head(all_trips) 
## # A tibble: 6 × 12
##   trip_id start_time          end_time            bikeid tripd…¹ from_…² from_…³
##     <dbl> <dttm>              <dttm>               <dbl>   <dbl>   <dbl> <chr>  
## 1  2.22e7 2019-04-01 00:02:22 2019-04-01 00:09:48   6251     446      81 Daley …
## 2  2.22e7 2019-04-01 00:03:02 2019-04-01 00:20:30   6226    1048     317 Wood S…
## 3  2.22e7 2019-04-01 00:11:07 2019-04-01 00:15:19   5649     252     283 LaSall…
## 4  2.22e7 2019-04-01 00:13:01 2019-04-01 00:18:58   4151     357      26 McClur…
## 5  2.22e7 2019-04-01 00:19:26 2019-04-01 00:36:13   3270    1007     202 Halste…
## 6  2.22e7 2019-04-01 00:19:39 2019-04-01 00:23:56   3123     257     420 Ellis …
## # … with 5 more variables: to_station_id <dbl>, to_station_name <chr>,
## #   usertype <chr>, gender <chr>, birthyear <dbl>, and abbreviated variable
## #   names ¹​tripduration, ²​from_station_id, ³​from_station_name
summary(all_trips)
##     trip_id           start_time                    
##  Min.   :21742443   Min.   :2019-01-01 00:04:37.00  
##  1st Qu.:22873787   1st Qu.:2019-05-29 15:49:26.50  
##  Median :23962320   Median :2019-07-25 17:50:54.00  
##  Mean   :23915629   Mean   :2019-07-19 21:47:37.11  
##  3rd Qu.:24963703   3rd Qu.:2019-09-15 06:48:05.75  
##  Max.   :25962904   Max.   :2019-12-31 23:57:17.00  
##                                                     
##     end_time                          bikeid      tripduration     
##  Min.   :2019-01-01 00:11:07.00   Min.   :   1   Min.   :      61  
##  1st Qu.:2019-05-29 16:09:28.25   1st Qu.:1727   1st Qu.:     411  
##  Median :2019-07-25 18:12:23.00   Median :3451   Median :     709  
##  Mean   :2019-07-19 22:11:47.56   Mean   :3380   Mean   :    1450  
##  3rd Qu.:2019-09-15 08:30:13.25   3rd Qu.:5046   3rd Qu.:    1283  
##  Max.   :2020-01-21 13:54:35.00   Max.   :6946   Max.   :10628400  
##                                                                    
##  from_station_id from_station_name  to_station_id   to_station_name   
##  Min.   :  1.0   Length:3818004     Min.   :  1.0   Length:3818004    
##  1st Qu.: 77.0   Class :character   1st Qu.: 77.0   Class :character  
##  Median :174.0   Mode  :character   Median :174.0   Mode  :character  
##  Mean   :201.7                      Mean   :202.6                     
##  3rd Qu.:289.0                      3rd Qu.:291.0                     
##  Max.   :673.0                      Max.   :673.0                     
##                                                                       
##    usertype            gender            birthyear     
##  Length:3818004     Length:3818004     Min.   :1759    
##  Class :character   Class :character   1st Qu.:1979    
##  Mode  :character   Mode  :character   Median :1987    
##                                        Mean   :1984    
##                                        3rd Qu.:1992    
##                                        Max.   :2014    
##                                        NA's   :538751

After viewing data it was decided we need more columns for better analysis like splitting date and time column into Day, Month, Year. Extracting age and ride length. With this we can have insights about cyclist w.r.t Day, Month, Year, ride length and age.

table(all_trips$usertype)
## 
##   Customer Subscriber 
##     880637    2937367
all_trips <- all_trips %>%
   mutate(usertype = recode(usertype
                                 ,"Subscriber" = "Member"
                                 ,"Customer" = "Casual"))
all_trips$date <- as.Date(all_trips$start_time)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$age <- 2019-all_trips$birthyear
all_trips$ride_length <- difftime(all_trips$end_time,all_trips$start_time)

With above code we extracted day, month, year of ride, age of the user and ride length. After this let’s calculate how much time rider took to reach from start point to end point.

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
all_trips$age <- as.numeric(as.character(all_trips$age))

Step 4: Conduct Descriptive Analysis

Descriptive Analysis is the type of analysis of data that helps describe, show or summarize data points. It usually done by calculating mean, median, min and max.

summary(all_trips$ride_length)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    -56.37      6.85     11.82     24.17     21.40 177200.37
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = mean)
##   all_trips$usertype all_trips$ride_length
## 1             Casual              57.01734
## 2             Member              14.32765
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = median)
##   all_trips$usertype all_trips$ride_length
## 1             Casual              25.83333
## 2             Member               9.80000
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = max)
##   all_trips$usertype all_trips$ride_length
## 1             Casual              177200.4
## 2             Member              150943.9
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = min)
##   all_trips$usertype all_trips$ride_length
## 1             Casual             -48.28333
## 2             Member             -56.36667

As we can see minimum ride length is in negative. Reason for it could be that bike might have gone for maintenance or for repair. Because negative ride length is misleading and inappropriate it is considered as bad data so we will remove it from dataset and create a new data set by name all_trips_v2.

all_trips_v2 <- all_trips[!(all_trips$from_station_name == "HQ QR" | all_trips$ride_length<0),]

let’s add more condition to previous code to get better and more customized results. Previously we calculated mean, median, min, max w.r.t user type. Now day of week will be a added condition.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$usertype + all_trips_v2$day_of_week,
          FUN = mean)
##    all_trips_v2$usertype all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                 Casual                   Friday                 60.17561
## 2                 Member                   Friday                 13.89748
## 3                 Casual                   Monday                 54.49989
## 4                 Member                   Monday                 14.24928
## 5                 Casual                 Saturday                 54.06111
## 6                 Member                 Saturday                 16.30271
## 7                 Casual                   Sunday                 56.18519
## 8                 Member                   Sunday                 15.40290
## 9                 Casual                 Thursday                 59.95112
## 10                Member                 Thursday                 13.77979
## 11                Casual                  Tuesday                 57.41328
## 12                Member                  Tuesday                 14.15259
## 13                Casual                Wednesday                 60.33407
## 14                Member                Wednesday                 13.80984

To make it cleaner let’s sort data by days.

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday",
                                                                       "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

After Sort

aggregate(all_trips_v2$ride_length ~ all_trips_v2$usertype + all_trips_v2$day_of_week,
          FUN = mean)
##    all_trips_v2$usertype all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                 Casual                   Sunday                 56.18519
## 2                 Member                   Sunday                 15.40290
## 3                 Casual                   Monday                 54.49989
## 4                 Member                   Monday                 14.24928
## 5                 Casual                  Tuesday                 57.41328
## 6                 Member                  Tuesday                 14.15259
## 7                 Casual                Wednesday                 60.33407
## 8                 Member                Wednesday                 13.80984
## 9                 Casual                 Thursday                 59.95112
## 10                Member                 Thursday                 13.77979
## 11                Casual                   Friday                 60.17561
## 12                Member                   Friday                 13.89748
## 13                Casual                 Saturday                 54.06111
## 14                Member                 Saturday                 16.30271

Now we’ll see what type of rider rides how much on what day.

all_trips_v2 %>%
  mutate(weekday = wday(start_time, label = TRUE)) %>% 
  group_by(usertype, day_of_week) %>% 
  summarise(number_of_rides = n() 
            ,average_duration = mean(ride_length)) %>% 
  arrange(usertype, day_of_week)
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   usertype [2]
##    usertype day_of_week number_of_rides average_duration
##    <chr>    <ord>                 <int>            <dbl>
##  1 Casual   Sunday               170173             56.2
##  2 Casual   Monday               101489             54.5
##  3 Casual   Tuesday               88655             57.4
##  4 Casual   Wednesday             89745             60.3
##  5 Casual   Thursday             101372             60.0
##  6 Casual   Friday               121141             60.2
##  7 Casual   Saturday             208056             54.1
##  8 Member   Sunday               256234             15.4
##  9 Member   Monday               458780             14.2
## 10 Member   Tuesday              497025             14.2
## 11 Member   Wednesday            494277             13.8
## 12 Member   Thursday             486915             13.8
## 13 Member   Friday               456966             13.9
## 14 Member   Saturday             287163             16.3

Step 5: Vizualization

As our analysis is coming to it’s end now is the time to visualize our finding via graphs and charts.
1. Number of rides per usertype.
2. Average duration per usertype.
3. Average duration w.r.t Gender.
4. Age group of riders.

all_trips_v2 %>%
  mutate(weekday = wday(start_time, label = TRUE)) %>%
  group_by(usertype, day_of_week) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(usertype, day_of_week) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = usertype)) +
  ylim(0,1000000) +
  labs(title ="Number of Ride w.r.t Usertype") +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

all_trips_v2 %>%
  mutate(day_of_week = wday(start_time, label = TRUE)) %>%
  group_by(usertype, day_of_week) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(usertype, day_of_week) %>%
  ggplot(aes(x = day_of_week, y = average_duration, fill = usertype)) +
  ylim(0,80) +
  labs(title ="Average Duration of Ride w.r.t Usertype") +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.

all_trips_v2 %>%
  mutate(day_of_week = wday(start_time, label = TRUE)) %>%
  group_by(gender, day_of_week) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(gender, day_of_week) %>%
  ggplot(aes(x = day_of_week, y = average_duration, fill = gender)) +
  ylim(0,80) + labs(title ="Average duration of Bike riding w.r.t Gender") +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

hist(all_trips_v2$age,
  main = "Number of riders by ages",
  xlab = "Ages",
  ylab = "Number of Users",
  xlim = c(0,80),
  ylim = c(0,1500000),
  col = "gold")

Conclusion

Suggestions