In this Data Analyst project, we picked a business problem and collected data from various sources & analyze it. After analyzing it, we made a certain conclusion which would help to grow the various businesses. Usually data analysis has 6 stages which are as follows:
To understand how customer and subscription members use Cyclist bikes differently, how to increase number of customer and convert those casual customer into subscription member so that they can enjoy more profit and better services.
In this project we have used Divvy_trips dataset which is a open source dataset which has cyclist data of citizens of Chicago of year 2019. How long they ride, from where to where, when they started and ended, station names, birth year and so on. Not all data in this dataset is beneficial to us so we need to clean and organize the data. Data source this project is going to use is split between 4 CSVs each CSV has a quarter’s data.
tidyverse helps in data wrangling i.e. removing errors and combining complex data sets to make them more accessible and easier to analyze. lubridate helps wrangle date attributes. ggplot2 helps in making of graphs, labeling axes, giving title, sub-title and much more. getwd() function displays your current working directory and setwd(“address”) sets you’re new working directory. We need to know our working directory so that we can keep our files organized and add or remove files if necessary.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
Step one of every data analyst project is to collect data. It can be imported via CSVs, excel sheets or could be web-scrapped.
q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. We will combine all 4 CSVs after we wrangle the data. After importing data it is really important to know what is inn data, what type is it, to what extend it needs to be cleaned.
colnames(q1_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q2_2019)
## [1] "01 - Rental Details Rental ID"
## [2] "01 - Rental Details Local Start Time"
## [3] "01 - Rental Details Local End Time"
## [4] "01 - Rental Details Bike ID"
## [5] "01 - Rental Details Duration In Seconds Uncapped"
## [6] "03 - Rental Start Station ID"
## [7] "03 - Rental Start Station Name"
## [8] "02 - Rental End Station ID"
## [9] "02 - Rental End Station Name"
## [10] "User Type"
## [11] "Member Gender"
## [12] "05 - Member Details Member Birthday Year"
colnames(q3_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q4_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
(q2_2019 <- rename(q2_2019
,trip_id = "01 - Rental Details Rental ID"
,start_time = "01 - Rental Details Local Start Time"
,end_time = "01 - Rental Details Local End Time"
,bikeid = "01 - Rental Details Bike ID"
,tripduration = "01 - Rental Details Duration In Seconds Uncapped"
,from_station_id = "03 - Rental Start Station ID"
,from_station_name = "03 - Rental Start Station Name"
,to_station_id = "02 - Rental End Station ID"
,to_station_name = "02 - Rental End Station Name"
,usertype = "User Type"
,gender = "Member Gender"
,birthyear = "05 - Member Details Member Birthday Year"))
## # A tibble: 1,108,163 × 12
## trip_id start_time end_time bikeid tripduration from_s…¹
## <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
## 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251 446 81
## 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226 1048 317
## 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649 252 283
## 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151 357 26
## 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270 1007 202
## 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123 257 420
## 7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41 6418 548 503
## 8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11 4513 383 260
## 9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44 3280 2137 211
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39 5534 2120 211
## # … with 1,108,153 more rows, 6 more variables: from_station_name <chr>,
## # to_station_id <dbl>, to_station_name <chr>, usertype <chr>, gender <chr>,
## # birthyear <dbl>, and abbreviated variable name ¹from_station_id
Now that column names are updated let’s check weather the changes we made are done or not.
colnames(q1_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q2_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q3_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q4_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
After all column names are fixed for all 4 CSV we need to check their structure. Now we need to match their data types to so that we can combine all 4 CSVs into one CSV. With all 4 CSV combined we can perform a combined analysis.
Convert data type of columns if needed so that they can stack and combine correctly and smoothly. We can do that by using mutate function which can change the data type.
Now lets’s stack individual quarter’s data frames into one big data frame and view what it looks like now.
all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2019)
After cleaning and wrangling the data we need to remove any Bad Data. It can be done by inspecting the new dataframe that we created.
colnames(all_trips)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
nrow(all_trips)
## [1] 3818004
head(all_trips)
## # A tibble: 6 × 12
## trip_id start_time end_time bikeid tripd…¹ from_…² from_…³
## <dbl> <dttm> <dttm> <dbl> <dbl> <dbl> <chr>
## 1 2.22e7 2019-04-01 00:02:22 2019-04-01 00:09:48 6251 446 81 Daley …
## 2 2.22e7 2019-04-01 00:03:02 2019-04-01 00:20:30 6226 1048 317 Wood S…
## 3 2.22e7 2019-04-01 00:11:07 2019-04-01 00:15:19 5649 252 283 LaSall…
## 4 2.22e7 2019-04-01 00:13:01 2019-04-01 00:18:58 4151 357 26 McClur…
## 5 2.22e7 2019-04-01 00:19:26 2019-04-01 00:36:13 3270 1007 202 Halste…
## 6 2.22e7 2019-04-01 00:19:39 2019-04-01 00:23:56 3123 257 420 Ellis …
## # … with 5 more variables: to_station_id <dbl>, to_station_name <chr>,
## # usertype <chr>, gender <chr>, birthyear <dbl>, and abbreviated variable
## # names ¹tripduration, ²from_station_id, ³from_station_name
summary(all_trips)
## trip_id start_time
## Min. :21742443 Min. :2019-01-01 00:04:37.00
## 1st Qu.:22873787 1st Qu.:2019-05-29 15:49:26.50
## Median :23962320 Median :2019-07-25 17:50:54.00
## Mean :23915629 Mean :2019-07-19 21:47:37.11
## 3rd Qu.:24963703 3rd Qu.:2019-09-15 06:48:05.75
## Max. :25962904 Max. :2019-12-31 23:57:17.00
##
## end_time bikeid tripduration
## Min. :2019-01-01 00:11:07.00 Min. : 1 Min. : 61
## 1st Qu.:2019-05-29 16:09:28.25 1st Qu.:1727 1st Qu.: 411
## Median :2019-07-25 18:12:23.00 Median :3451 Median : 709
## Mean :2019-07-19 22:11:47.56 Mean :3380 Mean : 1450
## 3rd Qu.:2019-09-15 08:30:13.25 3rd Qu.:5046 3rd Qu.: 1283
## Max. :2020-01-21 13:54:35.00 Max. :6946 Max. :10628400
##
## from_station_id from_station_name to_station_id to_station_name
## Min. : 1.0 Length:3818004 Min. : 1.0 Length:3818004
## 1st Qu.: 77.0 Class :character 1st Qu.: 77.0 Class :character
## Median :174.0 Mode :character Median :174.0 Mode :character
## Mean :201.7 Mean :202.6
## 3rd Qu.:289.0 3rd Qu.:291.0
## Max. :673.0 Max. :673.0
##
## usertype gender birthyear
## Length:3818004 Length:3818004 Min. :1759
## Class :character Class :character 1st Qu.:1979
## Mode :character Mode :character Median :1987
## Mean :1984
## 3rd Qu.:1992
## Max. :2014
## NA's :538751
After viewing data it was decided we need more columns for better analysis like splitting date and time column into Day, Month, Year. Extracting age and ride length. With this we can have insights about cyclist w.r.t Day, Month, Year, ride length and age.
table(all_trips$usertype)
##
## Customer Subscriber
## 880637 2937367
all_trips <- all_trips %>%
mutate(usertype = recode(usertype
,"Subscriber" = "Member"
,"Customer" = "Casual"))
all_trips$date <- as.Date(all_trips$start_time)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$age <- 2019-all_trips$birthyear
all_trips$ride_length <- difftime(all_trips$end_time,all_trips$start_time)
With above code we extracted day, month, year of ride, age of the user and ride length. After this let’s calculate how much time rider took to reach from start point to end point.
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
all_trips$age <- as.numeric(as.character(all_trips$age))
Descriptive Analysis is the type of analysis of data that helps describe, show or summarize data points. It usually done by calculating mean, median, min and max.
summary(all_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -56.37 6.85 11.82 24.17 21.40 177200.37
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = mean)
## all_trips$usertype all_trips$ride_length
## 1 Casual 57.01734
## 2 Member 14.32765
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = median)
## all_trips$usertype all_trips$ride_length
## 1 Casual 25.83333
## 2 Member 9.80000
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = max)
## all_trips$usertype all_trips$ride_length
## 1 Casual 177200.4
## 2 Member 150943.9
aggregate(all_trips$ride_length ~ all_trips$usertype, FUN = min)
## all_trips$usertype all_trips$ride_length
## 1 Casual -48.28333
## 2 Member -56.36667
As we can see minimum ride length is in negative. Reason for it could be that bike might have gone for maintenance or for repair. Because negative ride length is misleading and inappropriate it is considered as bad data so we will remove it from dataset and create a new data set by name all_trips_v2.
all_trips_v2 <- all_trips[!(all_trips$from_station_name == "HQ QR" | all_trips$ride_length<0),]
let’s add more condition to previous code to get better and more customized results. Previously we calculated mean, median, min, max w.r.t user type. Now day of week will be a added condition.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$usertype + all_trips_v2$day_of_week,
FUN = mean)
## all_trips_v2$usertype all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 Casual Friday 60.17561
## 2 Member Friday 13.89748
## 3 Casual Monday 54.49989
## 4 Member Monday 14.24928
## 5 Casual Saturday 54.06111
## 6 Member Saturday 16.30271
## 7 Casual Sunday 56.18519
## 8 Member Sunday 15.40290
## 9 Casual Thursday 59.95112
## 10 Member Thursday 13.77979
## 11 Casual Tuesday 57.41328
## 12 Member Tuesday 14.15259
## 13 Casual Wednesday 60.33407
## 14 Member Wednesday 13.80984
To make it cleaner let’s sort data by days.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday",
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
After Sort
aggregate(all_trips_v2$ride_length ~ all_trips_v2$usertype + all_trips_v2$day_of_week,
FUN = mean)
## all_trips_v2$usertype all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 Casual Sunday 56.18519
## 2 Member Sunday 15.40290
## 3 Casual Monday 54.49989
## 4 Member Monday 14.24928
## 5 Casual Tuesday 57.41328
## 6 Member Tuesday 14.15259
## 7 Casual Wednesday 60.33407
## 8 Member Wednesday 13.80984
## 9 Casual Thursday 59.95112
## 10 Member Thursday 13.77979
## 11 Casual Friday 60.17561
## 12 Member Friday 13.89748
## 13 Casual Saturday 54.06111
## 14 Member Saturday 16.30271
Now we’ll see what type of rider rides how much on what day.
all_trips_v2 %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
group_by(usertype, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, day_of_week)
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: usertype [2]
## usertype day_of_week number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 Casual Sunday 170173 56.2
## 2 Casual Monday 101489 54.5
## 3 Casual Tuesday 88655 57.4
## 4 Casual Wednesday 89745 60.3
## 5 Casual Thursday 101372 60.0
## 6 Casual Friday 121141 60.2
## 7 Casual Saturday 208056 54.1
## 8 Member Sunday 256234 15.4
## 9 Member Monday 458780 14.2
## 10 Member Tuesday 497025 14.2
## 11 Member Wednesday 494277 13.8
## 12 Member Thursday 486915 13.8
## 13 Member Friday 456966 13.9
## 14 Member Saturday 287163 16.3
As our analysis is coming to it’s end now is the time to visualize
our finding via graphs and charts.
1. Number of rides per usertype.
2. Average duration per usertype.
3. Average duration w.r.t Gender.
4. Age group of riders.
all_trips_v2 %>%
mutate(weekday = wday(start_time, label = TRUE)) %>%
group_by(usertype, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = usertype)) +
ylim(0,1000000) +
labs(title ="Number of Ride w.r.t Usertype") +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
all_trips_v2 %>%
mutate(day_of_week = wday(start_time, label = TRUE)) %>%
group_by(usertype, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(usertype, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = usertype)) +
ylim(0,80) +
labs(title ="Average Duration of Ride w.r.t Usertype") +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'usertype'. You can override using the
## `.groups` argument.
all_trips_v2 %>%
mutate(day_of_week = wday(start_time, label = TRUE)) %>%
group_by(gender, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(gender, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = gender)) +
ylim(0,80) + labs(title ="Average duration of Bike riding w.r.t Gender") +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
hist(all_trips_v2$age,
main = "Number of riders by ages",
xlab = "Ages",
ylab = "Number of Users",
xlim = c(0,80),
ylim = c(0,1500000),
col = "gold")
Members have more bikes compared to casual riders.
We have more members riding in all months compared to casual riders.
Casual riders travel for a longer time period.
Members ride more throughout the entire weekday while the casual riders also have a high ride record during the weekends(Saturday and Sunday) compared to the other days of the week.
Casual riders go farther in terms of distance.