1 Greetings

Welcome to my Rmd. The reason why I created this Rmd is to improve my EDA (Exploratory Data Analysis) skill using R language.

2 Brief explanation about the data

The data which will be used is history of bike rent in New York on August 2019.

Columns Insight :
1. tripduration: duration of bike trip from start station until end station.
2. starttime: start time of rent.
3. stoptime: end time of rent.
4. start.staion.name: the initial place where the rent took place.
5. end.staion.name: the last place of rent.
6. bikeid : unique code of bike for tracking.
7. usertype: category of user who rent the bike (Subscriber & Customer).
8. birth.year : birth year of user.
9. gender: the range of characteristics pertaining to sex of (Male & Female).

3 Business Questions

Show top 10 favorite routes!
Compare average trip duration for top 10 favorite routes and top 10 non favorite routes!
Find out trip duration_type frequency based on user category type and gender!
On what day most of the bike rent occurs?
Find out the frequency of trip duration type at the most busies day then show the average and total trip duration on each trip duration type on that day!
Visualize bike rent trend every hour on each day in a week!

4 Wrangle

First thing first, lets do the wrangling. But what is Data Wrangling? Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question.

4.1 Import necessary library to help solving the business questions.

library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(knitr)

4.2 Read the data and assign it into an object called `bike`

bike <- read.csv("http://bit.ly/dwp-data-bike")

4.3 Inspect the data type from each columns

str(bike)

## 'data.frame':    46230 obs. of  9 variables:
##  $ tripduration      : int  261 172 525 219 262 820 543 478 646 265 ...
##  $ starttime         : chr  "2019-08-01 00:14:55.9900" "2019-08-01 00:23:06.9910" "2019-08-01 00:23:28.6170" "2019-08-01 00:32:36.1410" ...
##  $ stoptime          : chr  "2019-08-01 00:19:17.4780" "2019-08-01 00:25:59.1480" "2019-08-01 00:32:13.7000" "2019-08-01 00:36:15.2730" ...
##  $ start.station.name: chr  "JC Medical Center" "Dixon Mills" "Newport Pkwy" "Warren St" ...
##  $ end.station.name  : chr  "Liberty Light Rail" "Grove St PATH" "Hamilton Park" "City Hall" ...
##  $ bikeid            : int  26268 26162 29279 29598 26162 29598 29525 29641 29448 29477 ...
##  $ usertype          : chr  "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ birth.year        : int  1980 1996 1991 1988 1960 1981 1985 1978 1987 1984 ...
##  $ gender            : int  1 1 1 1 1 1 2 1 2 2 ...

From the data inspection above, there are some reworks need to be done.

4.4 Change data type for columns `starttime` and `stoptime`

head(bike$starttime)

## [1] "2019-08-01 00:14:55.9900" "2019-08-01 00:23:06.9910"
## [3] "2019-08-01 00:23:28.6170" "2019-08-01 00:32:36.1410"
## [5] "2019-08-01 00:41:26.6700" "2019-08-01 00:43:15.2990"

tail(bike$starttime)

## [1] "2019-08-31 23:35:54.0660" "2019-08-31 23:40:45.2450"
## [3] "2019-08-31 23:45:55.6660" "2019-08-31 23:50:27.9220"
## [5] "2019-08-31 23:52:21.9250" "2019-08-31 23:56:55.1530"

Why do I show the head and tail from bike$starttune, the reason is to obtain the date position whether the date is dd/mm/yy, mm/dd/yy, etc.

From the result above I can concluded the date position is yyyy/mm/mm (Month/Date/Year) _ hh/mm/ss (Hour/Minute/Second)

Lets change both starttime and stopdate with code below.

bike[ ,c("starttime", "stoptime")] <- lapply(bike[ , c("starttime", "stoptime")], ymd_hms)

Just want to make sure whether the data type have been change or not.

str(bike$starttime)

##  POSIXct[1:46230], format: "2019-08-01 00:14:55" "2019-08-01 00:23:06" "2019-08-01 00:23:28" ...

str(bike$stoptime)

##  POSIXct[1:46230], format: "2019-08-01 00:19:17" "2019-08-01 00:25:59" "2019-08-01 00:32:13" ...

4.5 After `startime` and `stoptime` have been change into datetime data type, extract hour from columns `starttime.`

bike$hour.rent <- hour(bike$starttime)

# Check the new columns
str(bike$hour.rent)

##  int [1:46230] 0 0 0 0 0 0 0 1 1 3 ...

Switch the data from columns hour.rent from integer into string (1 -> “1 PM”) with the purpose to give better visualization later on by implementing function sapply(..., FUN = swith).

 bike$hour.rent <- sapply(X = as.character(bike$hour.rent), FUN = switch,
                           "0" = "12 PM",
                           "1" = "1 PM", 
                           "2" = "2 PM", 
                           "3" = "3 PM", 
                           "4" = "4 PM",
                           "5" = "5 PM", 
                           "6" = "6 PM", 
                           "7" = "7 PM", 
                           "8" = "8 PM",
                           "9" = "9 PM", 
                           "10" = "10 PM",
                           "11" = "11 PM", 
                           "12" = "12 AM",
                           "13" = "1 AM", 
                           "14" = "2 AM",
                           "15" = "3 AM",
                           "16" = "4 AM",
                           "17" = "5 AM",
                           "18" = "6 AM",
                           "19" = "7 AM",
                           "20" = "8 AM",
                           "21" = "9 AM",
                           "22" = "10 AM",
                           "23" = "11 AM")

Level the factor column position hour.rent.

bike$hour.rent <- factor(bike$hour.rent, levels = c("12 PM", 
                                                       "1 PM",
                                                       "2 PM",
                                                       "3 PM",
                                                       "4 PM",
                                                       "5 PM",
                                                       "6 PM",
                                                       "7 PM",
                                                       "8 PM",
                                                       "9 PM",
                                                       "10 PM",
                                                       "11 PM",
                                                       "12 AM",
                                                       "1 AM",
                                                       "2 AM",
                                                       "3 AM",
                                                       "4 AM",
                                                       "5 AM",
                                                       "6 AM",
                                                       "7 AM",
                                                       "8 AM",
                                                       "9 AM",
                                                       "10 AM",
                                                       "11 AM"))   
str(bike$hour.rent)

##  Factor w/ 24 levels "12 PM","1 PM",..: 1 1 1 1 1 1 1 2 2 4 ...

4.6 Extract name of day from columns `starttime` and change the level so the level will start from Monday.

bike$dayname <- wday(bike$starttime, label = T, abbr = F, week_start = "1")

# Check the level
str(bike$dayname)

##  Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 4 4 4 4 4 4 4 4 4 4 ...

4.7 Classify duration rent category with the requirements below:

tripduration <= 500s : “Short Trip”
tripduration <= 1000s : “Medium Trip”
tripduration > 1000s : “Long trip”

# Make a looping function, in order to get the `tripduration_type` based on `tripduration` column
convert_tripdur <- function(x)
{
  if( x <= 500)
  {
    x <- "Short Trip"
  }else
    if(x <= 1000)
  {
    x <- "Medium Trip"
  }else{
    x <- "Long Trip"
  }
}

# Create a new column and apply the function
bike$tripdur_type <- sapply(X = bike$tripduration, FUN = convert_tripdur)

# Check the new column and make sure whether the function is correct
tail(subset(bike, select = c(tripduration, tripdur_type)),10)

Great! The function works perfectly. But due to the function was changing from integer to string, columns tripdur_type datatype must be change into factor.

# Change the data type
bike$tripdur_type <- as.factor(bike$tripdur_type)
str(bike$tripdur_type)

##  Factor w/ 3 levels "Long Trip","Medium Trip",..: 3 3 2 3 3 2 2 3 2 3 ...

4.8 It is better to combine `start.station.name` & `end.station.name` into one columns name `route`, with the purpose to make visualization easier for the best 10 routes.

# Combining two columns can be done easily using `paste()` function.
bike$route <- paste(bike$start.station.name, "-", bike$end.station.name)

# Show top 5 value after combining the two columns
head(bike$route)

## [1] "JC Medical Center - Liberty Light Rail"
## [2] "Dixon Mills - Grove St PATH"           
## [3] "Newport Pkwy - Hamilton Park"          
## [4] "Warren St - City Hall"                 
## [5] "Grove St PATH - Jersey & 3rd"          
## [6] "City Hall - Bergen Ave"

4.9 Instead of using number to represent gender it is better to changed it into Male & Female.

1 is represent Male
2 is represent Female.

#Make a looping funtion, in order to convert the gender from number into string
convert_gender <- function(x)
{
  if( x == 1)
  {
    x <- "Male"
  }
  else
  {
    x <- "Female"
  }
}

# Apply the funtion into columns `gender`
bike$gender <- sapply(X = bike$gender, FUN = convert_gender)

# Check wheter the function works 
head(subset(bike, select = gender),10)

After check the function works. Follow the rest of step like in sub-bab 4.7.

# Cchange it into factor
bike$gender <- as.factor(bike$gender)

str(bike$gender)

##  Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 1 1 ...

4.10 Change data type columns `usertype` from chr into factor

bike$usertype <- as.factor(bike$usertype)

str(bike$usertype)

##  Factor w/ 2 levels "Customer","Subscriber": 2 2 2 2 2 2 2 2 2 2 ...

4.11 Last but not least, check for missing value.

colSums(is.na(bike))

##       tripduration          starttime           stoptime start.station.name 
##                  0                  0                  0                  0 
##   end.station.name             bikeid           usertype         birth.year 
##                  0                  0                  0                  0 
##             gender          hour.rent            dayname       tripdur_type 
##                  0                  0                  0                  0 
##              route 
##                  0

Good! There are no missing value.

5 Exploratory Data Analysis

summary(bike)

##   tripduration      starttime                      stoptime                  
##  Min.   :  61.0   Min.   :2019-08-01 00:14:55   Min.   :2019-08-01 00:19:17  
##  1st Qu.: 240.0   1st Qu.:2019-08-08 18:04:08   1st Qu.:2019-08-08 18:12:52  
##  Median : 352.0   Median :2019-08-16 08:24:12   Median :2019-08-16 08:29:01  
##  Mean   : 478.2   Mean   :2019-08-16 12:04:04   Mean   :2019-08-16 12:12:03  
##  3rd Qu.: 587.0   3rd Qu.:2019-08-24 09:26:23   3rd Qu.:2019-08-24 09:36:51  
##  Max.   :2000.0   Max.   :2019-08-31 23:56:55   Max.   :2019-09-01 00:08:25  
##                                                                              
##  start.station.name end.station.name       bikeid            usertype    
##  Length:46230       Length:46230       Min.   :14967   Customer  : 6149  
##  Class :character   Class :character   1st Qu.:26289   Subscriber:40081  
##  Mode  :character   Mode  :character   Median :29287                     
##                                        Mean   :28459                     
##                                        3rd Qu.:29538                     
##                                        Max.   :39224                     
##                                                                          
##    birth.year      gender        hour.rent          dayname    
##  Min.   :1950   Female:14309   8 PM   : 5135   Monday   :6377  
##  1st Qu.:1976   Male  :31921   6 AM   : 4667   Tuesday  :6627  
##  Median :1984                  5 AM   : 4333   Wednesday:5958  
##  Mean   :1982                  7 AM   : 3521   Thursday :8470  
##  3rd Qu.:1989                  7 PM   : 3199   Friday   :8100  
##  Max.   :2003                  9 PM   : 2861   Saturday :6258  
##                                (Other):22514   Sunday   :4440  
##       tripdur_type      route          
##  Long Trip  : 4458   Length:46230      
##  Medium Trip:10166   Class :character  
##  Short Trip :31606   Mode  :character  
##                                        
##                                        
##                                        
##

The function summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

From the result above there are some information which might be the answer for the business question given, or at least guidance to do the EDA.

5.1 Show top 10 favorite routes.

fav_routes <- bike %>%
                group_by(route) %>%
                summarise(freq = n()) %>%
                arrange(desc(freq)) %>%
                head(10)

fav_routes

# Visualize 
plot_fav_routes <- ggplot(fav_routes, aes(x = freq, reorder(route, freq))) +
                      geom_col(aes(fill = freq)) +
                      geom_text(aes(label = freq),color = "white", size = 4,nudge_x = -40) +
                      labs(title = "Top 10 Favourite Routes",
                           subtitle = "",
                           x = "Freq in 1 Month",
                           y = "Route (Start Point - End Point)",
                           color = "") +
                      theme_bw()  +
                      theme(legend.position = "none",
                            legend.title = element_blank())

plot_fav_routes

5.2 Compare average tripduration for top 10 favorite routes and top 10 non favorite routes!

Count 10 less favorite route.

# Find top 10 least favorite route
less_fav_routes <- bike %>%
                    group_by(route) %>%
                    summarise(freq = n()) %>%
                    arrange(desc(freq)) %>%
                    tail(10)

less_fav_routes

Filter the top most and less favorite route.

# Filter top 10 less favorite route into bike data frame 
filter_less_fav_route <- filter(bike, bike$route %in% less_fav_routes$route)

# Check the function by using function `unique()`
unique(filter_less_fav_route$route)

##  [1] "Washington St - Journal Square"  "York St - Washington St"        
##  [3] "Washington St - Glenwood Ave"    "Washington St - Astor Place"    
##  [5] "Washington St - McGinley Square" "Warren St - Oakland Ave"        
##  [7] "York St - Hilltop"               "York St - McGinley Square"      
##  [9] "Washington St - York St"         "Warren St - Lincoln Park"

# Filter top 10 most favorite route into bike data frame 
filter_fav_route <- filter(bike, bike$route %in% fav_routes$route)

# Check the function by using function `unique()`
unique(filter_fav_route$route)

##  [1] "Dixon Mills - Grove St PATH"      "McGinley Square - Sip Ave"       
##  [3] "Brunswick & 6th - Grove St PATH"  "Brunswick St - Grove St PATH"    
##  [5] "Marin Light Rail - Grove St PATH" "Hamilton Park - Grove St PATH"   
##  [7] "Monmouth and 6th - Grove St PATH" "Jersey & 6th St - Grove St PATH" 
##  [9] "Grove St PATH - Hamilton Park"    "Grove St PATH - Marin Light Rail"

Calculate average trip duration for most favorite route.

avg_tripdur_fav_route <- mean(filter_fav_route$tripduration)

avg_tripdur_fav_route

## [1] 290.786

Calculate aerage trip duration for least favorite route

avg_tripdur_less_fav_route <- mean(filter_less_fav_route$tripduration)

avg_tripdur_less_fav_route

## [1] 866.9

5.3 Find out tripduration_type frequnce based on usercategory type and gender!

Create a new data frame consist of columns trip duration type, user type and gender .

# Since output from function `table()` is not a data frame, the output require to be change into data frame first before being visualize
triptype_usrcat_gen <- as.data.frame(table(bike$tripdur_type, bike$usertype, bike$gender))

head(triptype_usrcat_gen, 10)

Visualize by split the graph based on trip duration type and user type.

plot_triptype_usr_gen <- ggplot(triptype_usrcat_gen, aes(x = Freq, y = reorder(Var3,Freq), fill =  Var3)) +
                            geom_col(position = "dodge") +
                            facet_grid(Var1 ~ Var2,scales = "free_x") +
                            scale_fill_manual(values = c("red","blue")) +
                            labs(title = "Comparison Tripduration Type Frequency Bike Rent",
                                 subtitle = "Group by Usertype (Customer & Subscriber) and Gender (Male & Female)",
                                 x = "Frequency", y = "", fill = "") +
                            theme_light() +
                            theme(legend.position = "bottom")

plot_triptype_usr_gen

5.4 On what day most of the bike rent occurs?

knitr::include_graphics("function_summary_bike_dayname.png")

5.5 Find out the frequence of tripduration type at the most busies day, then show the average and total trip duration on each tripduration type on that day!

Subset the most busies day (Thursday) then implement function summarise() to: - Percentage of count trip duration type on that day for each tripdur_type. - mean tripduration. - Percentage of sum trip duration on each tripdur_type.

thursday <- bike %>% 
              subset(subset = bike$dayname == "Thursday")%>% 
              group_by(tripdur_type) %>% 
              summarise(freq = n(),
                        mean_tripdur = round(mean(tripduration), 1),
                        total_tripdur = sum(tripduration)) %>% 
              mutate(
                f_percent = freq / sum(freq),
                f_percent = round(f_percent * 100, 1),
                t_percent = total_tripdur / sum(total_tripdur),
                t_percent = round(t_percent * 100, 1),
                freq_percent = glue::glue("{freq} ({f_percent} %)"),
                total_percent = glue::glue("{total_tripdur} ({t_percent} %)")
              )
# Select the the required columns
thursday_new <- thursday %>% 
                  select(tripdur_type, freq_percent, total_percent, mean_tripdur)

thursday_new

5.6 Visualize bike rent trend every hour on each day in a week! And find out how many bike rent occur during peak hours on weekdays!

Count how many rent happen every day and hour

trend_day_hour <- as.data.frame(table(bike$dayname, bike$hour.rent))

head(trend_day_hour)

Visualize

plot_trend_day_hour <- ggplot(data = trend_day_hour, aes(x = Var1, y = Freq, fill = Var2))+
                          geom_bar(stat = "identity",position = "dodge") +
                          scale_fill_manual(values = c("red", "blue", "green", "yellow", "violet", "pink", "purple", "black",
                                                       "grey", "brown", "cyan", "orange", "beige", "light pink", "gold",
                                                       "sienna1", "hotpink2", "aquamarine", "wheat", "plum", "khaki", "azure",
                                                       "sky blue", "tan" )) +
                          labs(title = "Bike Rent Trend Every Hour",
                              subtitle = "Group by Day",
                              x = "Day Name",
                              y = "Freq",
                              color = "") +
                        theme_classic()  +
                        theme(legend.position = "bottom",
                              legend.title = element_blank())

plot_trend_day_hour

6 Insight

Grove St.Path is in the list of top 10 most favorite routes for start point or end point bike rent, it is because there are a rail station which connect into several important place in New York. Therefore, it is better if inventory management/planner team always ensures that the stock of bike is enough, especially during the most busies day and hour.
Comparison between average trip duration in most and less favorite routes:

Average most favorite routes : 290.7 sec / around 4 minutes
Average less favorite routes : 866.9 sec / around 14 minutes
From the comparison above, the trip duration type for average most favorite type can be categorized into short trip and for less favorite routes is long trip. Trip duration category can also be interpreted as how long the distance of travel between the start point until end point. Due to that interpretation, less favorite routes have less frequency of rent because many user obtain choose faster or less effort transportation compared to rent a bike

During August 2019, from total 46.230 rent of bike most of the rent are from user type Subscriber with total 40,081 rent.
Deeper observation, trip duration type and gender on each user type:

For user type Subscriber is dominated by male gender for each type of trip duration. The most common type of tripduration is for Short Trip followed by Medium Trip and finally for Long trips, with a significant comparison of frequencies for each type of rent.
Meanwhile, for user type Customer is more dominated by female gender for each type of trip duration. The type of trip duration that occurs is the same as user type Subscriber (Short Trip - Medium trip - Long trip), there is no significant comparison between each gender in every trip duration type.

The most busies day for bike rent is in Thursday with total 8470 rent happened on that day during August 2019.
On Thursday, the type of trip duration that occurred the most was short trips with the percentage of 73% followed by Medium trips with the percentage of 19.8% and the last one was Long Trips. The total duration of rent is directly proportional to the frequency of trip duration type, the more the total borrowed duration for a certain type of duration, the greater the accumulated total time borrowed on that day. The average trip duration for each type:

Long Trip : 1,346.9 sec / 22.5 min
Medium Trip : 680.5 sec / 11.3 min
Short Trip : 282 sec / 4.7 min

There are cycle of rent hour during weekday(Monday - Friday) and weekend(Saturday-Sunday).

Cycle hour of rent during weekday : Peak hours occur at 6-8 in the morning and 5-7 at night.
Peak hours occur between 6-8 in the morning can be happened because during those hour most people usually will go to work or school, and peak hours between 5-7 at night are due to most people just finished their daily activity and want to go back to their home.
Cycle hour of rent during weekend : During weekend there are no peak hours like in weekday, frequency of bike rent happened on every hour are relatively stable but the frequency of rent is not as much as on weekday.

Bike Rent EDA

Author : Victor Nugraha

1 Greetings

2 Brief explanation about the data

3 Business Questions

4 Wrangle

4.1 Import necessary library to help solving the business questions.

4.2 Read the data and assign it into an object called `bike`

4.3 Inspect the data type from each columns

4.4 Change data type for columns `starttime` and `stoptime`

4.5 After `startime` and `stoptime` have been change into datetime data type, extract hour from columns `starttime.`

4.6 Extract name of day from columns `starttime` and change the level so the level will start from Monday.

4.7 Classify duration rent category with the requirements below:

4.8 It is better to combine `start.station.name` & `end.station.name` into one columns name `route`, with the purpose to make visualization easier for the best 10 routes.

4.9 Instead of using number to represent gender it is better to changed it into Male & Female.

4.10 Change data type columns `usertype` from chr into factor

4.11 Last but not least, check for missing value.

5 Exploratory Data Analysis

5.1 Show top 10 favorite routes.

5.2 Compare average tripduration for top 10 favorite routes and top 10 non favorite routes!

5.3 Find out tripduration_type frequnce based on usercategory type and gender!

5.4 On what day most of the bike rent occurs?

5.5 Find out the frequence of tripduration type at the most busies day, then show the average and total trip duration on each tripduration type on that day!

5.6 Visualize bike rent trend every hour on each day in a week! And find out how many bike rent occur during peak hours on weekdays!

6 Insight

Bike Rent EDA

Author : Victor Nugraha

1 Greetings

2 Brief explanation about the data

3 Business Questions

4 Wrangle

4.1 Import necessary library to help solving the business questions.

4.2 Read the data and assign it into an object called bike

4.3 Inspect the data type from each columns

4.4 Change data type for columns starttime and stoptime

4.5 After startime and stoptime have been change into datetime data type, extract hour from columns starttime.

4.6 Extract name of day from columns starttime and change the level so the level will start from Monday.

4.7 Classify duration rent category with the requirements below:

4.8 It is better to combine start.station.name & end.station.name into one columns name route, with the purpose to make visualization easier for the best 10 routes.

4.9 Instead of using number to represent gender it is better to changed it into Male & Female.

4.10 Change data type columns usertype from chr into factor

4.11 Last but not least, check for missing value.

5 Exploratory Data Analysis

5.1 Show top 10 favorite routes.

5.2 Compare average tripduration for top 10 favorite routes and top 10 non favorite routes!

5.3 Find out tripduration_type frequnce based on usercategory type and gender!

5.4 On what day most of the bike rent occurs?

5.5 Find out the frequence of tripduration type at the most busies day, then show the average and total trip duration on each tripduration type on that day!

5.6 Visualize bike rent trend every hour on each day in a week! And find out how many bike rent occur during peak hours on weekdays!

6 Insight

4.2 Read the data and assign it into an object called `bike`

4.4 Change data type for columns `starttime` and `stoptime`

4.5 After `startime` and `stoptime` have been change into datetime data type, extract hour from columns `starttime.`

4.6 Extract name of day from columns `starttime` and change the level so the level will start from Monday.

4.8 It is better to combine `start.station.name` & `end.station.name` into one columns name `route`, with the purpose to make visualization easier for the best 10 routes.

4.10 Change data type columns `usertype` from chr into factor