Welcome to my Rmd. The reason why I created this Rmd is to improve my EDA (Exploratory Data Analysis) skill using R language.
The data which will be used is history of bike rent in New York on August 2019.
Columns Insight :
1. tripduration: duration of bike trip from start station until end station.
2. starttime: start time of rent.
3. stoptime: end time of rent.
4. start.staion.name: the initial place where the rent took place.
5. end.staion.name: the last place of rent.
6. bikeid : unique code of bike for tracking.
7. usertype: category of user who rent the bike (Subscriber & Customer).
8. birth.year : birth year of user.
9. gender: the range of characteristics pertaining to sex of (Male & Female).
First thing first, lets do the wrangling. But what is Data Wrangling? Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question.
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(knitr)
bikebike <- read.csv("http://bit.ly/dwp-data-bike")
str(bike)
## 'data.frame': 46230 obs. of 9 variables:
## $ tripduration : int 261 172 525 219 262 820 543 478 646 265 ...
## $ starttime : chr "2019-08-01 00:14:55.9900" "2019-08-01 00:23:06.9910" "2019-08-01 00:23:28.6170" "2019-08-01 00:32:36.1410" ...
## $ stoptime : chr "2019-08-01 00:19:17.4780" "2019-08-01 00:25:59.1480" "2019-08-01 00:32:13.7000" "2019-08-01 00:36:15.2730" ...
## $ start.station.name: chr "JC Medical Center" "Dixon Mills" "Newport Pkwy" "Warren St" ...
## $ end.station.name : chr "Liberty Light Rail" "Grove St PATH" "Hamilton Park" "City Hall" ...
## $ bikeid : int 26268 26162 29279 29598 26162 29598 29525 29641 29448 29477 ...
## $ usertype : chr "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ birth.year : int 1980 1996 1991 1988 1960 1981 1985 1978 1987 1984 ...
## $ gender : int 1 1 1 1 1 1 2 1 2 2 ...
From the data inspection above, there are some reworks need to be done.
starttime and stoptimehead(bike$starttime)
## [1] "2019-08-01 00:14:55.9900" "2019-08-01 00:23:06.9910"
## [3] "2019-08-01 00:23:28.6170" "2019-08-01 00:32:36.1410"
## [5] "2019-08-01 00:41:26.6700" "2019-08-01 00:43:15.2990"
tail(bike$starttime)
## [1] "2019-08-31 23:35:54.0660" "2019-08-31 23:40:45.2450"
## [3] "2019-08-31 23:45:55.6660" "2019-08-31 23:50:27.9220"
## [5] "2019-08-31 23:52:21.9250" "2019-08-31 23:56:55.1530"
Why do I show the head and tail from bike$starttune, the reason is to obtain the date position whether the date is dd/mm/yy, mm/dd/yy, etc.
From the result above I can concluded the date position is yyyy/mm/mm (Month/Date/Year) _ hh/mm/ss (Hour/Minute/Second)
Lets change both starttime and stopdate with code below.
bike[ ,c("starttime", "stoptime")] <- lapply(bike[ , c("starttime", "stoptime")], ymd_hms)
Just want to make sure whether the data type have been change or not.
str(bike$starttime)
## POSIXct[1:46230], format: "2019-08-01 00:14:55" "2019-08-01 00:23:06" "2019-08-01 00:23:28" ...
str(bike$stoptime)
## POSIXct[1:46230], format: "2019-08-01 00:19:17" "2019-08-01 00:25:59" "2019-08-01 00:32:13" ...
startime and stoptime have been change into datetime data type, extract hour from columns starttime.bike$hour.rent <- hour(bike$starttime)
# Check the new columns
str(bike$hour.rent)
## int [1:46230] 0 0 0 0 0 0 0 1 1 3 ...
Switch the data from columns hour.rent from integer into string (1 -> “1 PM”) with the purpose to give better visualization later on by implementing function sapply(..., FUN = swith).
bike$hour.rent <- sapply(X = as.character(bike$hour.rent), FUN = switch,
"0" = "12 PM",
"1" = "1 PM",
"2" = "2 PM",
"3" = "3 PM",
"4" = "4 PM",
"5" = "5 PM",
"6" = "6 PM",
"7" = "7 PM",
"8" = "8 PM",
"9" = "9 PM",
"10" = "10 PM",
"11" = "11 PM",
"12" = "12 AM",
"13" = "1 AM",
"14" = "2 AM",
"15" = "3 AM",
"16" = "4 AM",
"17" = "5 AM",
"18" = "6 AM",
"19" = "7 AM",
"20" = "8 AM",
"21" = "9 AM",
"22" = "10 AM",
"23" = "11 AM")
Level the factor column position hour.rent.
bike$hour.rent <- factor(bike$hour.rent, levels = c("12 PM",
"1 PM",
"2 PM",
"3 PM",
"4 PM",
"5 PM",
"6 PM",
"7 PM",
"8 PM",
"9 PM",
"10 PM",
"11 PM",
"12 AM",
"1 AM",
"2 AM",
"3 AM",
"4 AM",
"5 AM",
"6 AM",
"7 AM",
"8 AM",
"9 AM",
"10 AM",
"11 AM"))
str(bike$hour.rent)
## Factor w/ 24 levels "12 PM","1 PM",..: 1 1 1 1 1 1 1 2 2 4 ...
starttime and change the level so the level will start from Monday.bike$dayname <- wday(bike$starttime, label = T, abbr = F, week_start = "1")
# Check the level
str(bike$dayname)
## Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 4 4 4 4 4 4 4 4 4 4 ...
tripduration <= 500s : “Short Trip”tripduration <= 1000s : “Medium Trip”tripduration > 1000s : “Long trip”# Make a looping function, in order to get the `tripduration_type` based on `tripduration` column
convert_tripdur <- function(x)
{
if( x <= 500)
{
x <- "Short Trip"
}else
if(x <= 1000)
{
x <- "Medium Trip"
}else{
x <- "Long Trip"
}
}
# Create a new column and apply the function
bike$tripdur_type <- sapply(X = bike$tripduration, FUN = convert_tripdur)
# Check the new column and make sure whether the function is correct
tail(subset(bike, select = c(tripduration, tripdur_type)),10)
Great! The function works perfectly. But due to the function was changing from integer to string, columns tripdur_type datatype must be change into factor.
# Change the data type
bike$tripdur_type <- as.factor(bike$tripdur_type)
str(bike$tripdur_type)
## Factor w/ 3 levels "Long Trip","Medium Trip",..: 3 3 2 3 3 2 2 3 2 3 ...
start.station.name & end.station.name into one columns name route, with the purpose to make visualization easier for the best 10 routes.# Combining two columns can be done easily using `paste()` function.
bike$route <- paste(bike$start.station.name, "-", bike$end.station.name)
# Show top 5 value after combining the two columns
head(bike$route)
## [1] "JC Medical Center - Liberty Light Rail"
## [2] "Dixon Mills - Grove St PATH"
## [3] "Newport Pkwy - Hamilton Park"
## [4] "Warren St - City Hall"
## [5] "Grove St PATH - Jersey & 3rd"
## [6] "City Hall - Bergen Ave"
#Make a looping funtion, in order to convert the gender from number into string
convert_gender <- function(x)
{
if( x == 1)
{
x <- "Male"
}
else
{
x <- "Female"
}
}
# Apply the funtion into columns `gender`
bike$gender <- sapply(X = bike$gender, FUN = convert_gender)
# Check wheter the function works
head(subset(bike, select = gender),10)
After check the function works. Follow the rest of step like in sub-bab 4.7.
# Cchange it into factor
bike$gender <- as.factor(bike$gender)
str(bike$gender)
## Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 1 1 ...
usertype from chr into factorbike$usertype <- as.factor(bike$usertype)
str(bike$usertype)
## Factor w/ 2 levels "Customer","Subscriber": 2 2 2 2 2 2 2 2 2 2 ...
colSums(is.na(bike))
## tripduration starttime stoptime start.station.name
## 0 0 0 0
## end.station.name bikeid usertype birth.year
## 0 0 0 0
## gender hour.rent dayname tripdur_type
## 0 0 0 0
## route
## 0
Good! There are no missing value.
summary(bike)
## tripduration starttime stoptime
## Min. : 61.0 Min. :2019-08-01 00:14:55 Min. :2019-08-01 00:19:17
## 1st Qu.: 240.0 1st Qu.:2019-08-08 18:04:08 1st Qu.:2019-08-08 18:12:52
## Median : 352.0 Median :2019-08-16 08:24:12 Median :2019-08-16 08:29:01
## Mean : 478.2 Mean :2019-08-16 12:04:04 Mean :2019-08-16 12:12:03
## 3rd Qu.: 587.0 3rd Qu.:2019-08-24 09:26:23 3rd Qu.:2019-08-24 09:36:51
## Max. :2000.0 Max. :2019-08-31 23:56:55 Max. :2019-09-01 00:08:25
##
## start.station.name end.station.name bikeid usertype
## Length:46230 Length:46230 Min. :14967 Customer : 6149
## Class :character Class :character 1st Qu.:26289 Subscriber:40081
## Mode :character Mode :character Median :29287
## Mean :28459
## 3rd Qu.:29538
## Max. :39224
##
## birth.year gender hour.rent dayname
## Min. :1950 Female:14309 8 PM : 5135 Monday :6377
## 1st Qu.:1976 Male :31921 6 AM : 4667 Tuesday :6627
## Median :1984 5 AM : 4333 Wednesday:5958
## Mean :1982 7 AM : 3521 Thursday :8470
## 3rd Qu.:1989 7 PM : 3199 Friday :8100
## Max. :2003 9 PM : 2861 Saturday :6258
## (Other):22514 Sunday :4440
## tripdur_type route
## Long Trip : 4458 Length:46230
## Medium Trip:10166 Class :character
## Short Trip :31606 Mode :character
##
##
##
##
The function summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.
From the result above there are some information which might be the answer for the business question given, or at least guidance to do the EDA.
fav_routes <- bike %>%
group_by(route) %>%
summarise(freq = n()) %>%
arrange(desc(freq)) %>%
head(10)
fav_routes
# Visualize
plot_fav_routes <- ggplot(fav_routes, aes(x = freq, reorder(route, freq))) +
geom_col(aes(fill = freq)) +
geom_text(aes(label = freq),color = "white", size = 4,nudge_x = -40) +
labs(title = "Top 10 Favourite Routes",
subtitle = "",
x = "Freq in 1 Month",
y = "Route (Start Point - End Point)",
color = "") +
theme_bw() +
theme(legend.position = "none",
legend.title = element_blank())
plot_fav_routes
Count 10 less favorite route.
# Find top 10 least favorite route
less_fav_routes <- bike %>%
group_by(route) %>%
summarise(freq = n()) %>%
arrange(desc(freq)) %>%
tail(10)
less_fav_routes
Filter the top most and less favorite route.
# Filter top 10 less favorite route into bike data frame
filter_less_fav_route <- filter(bike, bike$route %in% less_fav_routes$route)
# Check the function by using function `unique()`
unique(filter_less_fav_route$route)
## [1] "Washington St - Journal Square" "York St - Washington St"
## [3] "Washington St - Glenwood Ave" "Washington St - Astor Place"
## [5] "Washington St - McGinley Square" "Warren St - Oakland Ave"
## [7] "York St - Hilltop" "York St - McGinley Square"
## [9] "Washington St - York St" "Warren St - Lincoln Park"
# Filter top 10 most favorite route into bike data frame
filter_fav_route <- filter(bike, bike$route %in% fav_routes$route)
# Check the function by using function `unique()`
unique(filter_fav_route$route)
## [1] "Dixon Mills - Grove St PATH" "McGinley Square - Sip Ave"
## [3] "Brunswick & 6th - Grove St PATH" "Brunswick St - Grove St PATH"
## [5] "Marin Light Rail - Grove St PATH" "Hamilton Park - Grove St PATH"
## [7] "Monmouth and 6th - Grove St PATH" "Jersey & 6th St - Grove St PATH"
## [9] "Grove St PATH - Hamilton Park" "Grove St PATH - Marin Light Rail"
Calculate average trip duration for most favorite route.
avg_tripdur_fav_route <- mean(filter_fav_route$tripduration)
avg_tripdur_fav_route
## [1] 290.786
Calculate aerage trip duration for least favorite route
avg_tripdur_less_fav_route <- mean(filter_less_fav_route$tripduration)
avg_tripdur_less_fav_route
## [1] 866.9
Create a new data frame consist of columns trip duration type, user type and gender .
# Since output from function `table()` is not a data frame, the output require to be change into data frame first before being visualize
triptype_usrcat_gen <- as.data.frame(table(bike$tripdur_type, bike$usertype, bike$gender))
head(triptype_usrcat_gen, 10)
Visualize by split the graph based on trip duration type and user type.
plot_triptype_usr_gen <- ggplot(triptype_usrcat_gen, aes(x = Freq, y = reorder(Var3,Freq), fill = Var3)) +
geom_col(position = "dodge") +
facet_grid(Var1 ~ Var2,scales = "free_x") +
scale_fill_manual(values = c("red","blue")) +
labs(title = "Comparison Tripduration Type Frequency Bike Rent",
subtitle = "Group by Usertype (Customer & Subscriber) and Gender (Male & Female)",
x = "Frequency", y = "", fill = "") +
theme_light() +
theme(legend.position = "bottom")
plot_triptype_usr_gen
knitr::include_graphics("function_summary_bike_dayname.png")
Subset the most busies day (Thursday) then implement function summarise() to: - Percentage of count trip duration type on that day for each tripdur_type. - mean tripduration. - Percentage of sum trip duration on each tripdur_type.
thursday <- bike %>%
subset(subset = bike$dayname == "Thursday")%>%
group_by(tripdur_type) %>%
summarise(freq = n(),
mean_tripdur = round(mean(tripduration), 1),
total_tripdur = sum(tripduration)) %>%
mutate(
f_percent = freq / sum(freq),
f_percent = round(f_percent * 100, 1),
t_percent = total_tripdur / sum(total_tripdur),
t_percent = round(t_percent * 100, 1),
freq_percent = glue::glue("{freq} ({f_percent} %)"),
total_percent = glue::glue("{total_tripdur} ({t_percent} %)")
)
# Select the the required columns
thursday_new <- thursday %>%
select(tripdur_type, freq_percent, total_percent, mean_tripdur)
thursday_new
Count how many rent happen every day and hour
trend_day_hour <- as.data.frame(table(bike$dayname, bike$hour.rent))
head(trend_day_hour)
Visualize
plot_trend_day_hour <- ggplot(data = trend_day_hour, aes(x = Var1, y = Freq, fill = Var2))+
geom_bar(stat = "identity",position = "dodge") +
scale_fill_manual(values = c("red", "blue", "green", "yellow", "violet", "pink", "purple", "black",
"grey", "brown", "cyan", "orange", "beige", "light pink", "gold",
"sienna1", "hotpink2", "aquamarine", "wheat", "plum", "khaki", "azure",
"sky blue", "tan" )) +
labs(title = "Bike Rent Trend Every Hour",
subtitle = "Group by Day",
x = "Day Name",
y = "Freq",
color = "") +
theme_classic() +
theme(legend.position = "bottom",
legend.title = element_blank())
plot_trend_day_hour
Grove St.Path is in the list of top 10 most favorite routes for start point or end point bike rent, it is because there are a rail station which connect into several important place in New York. Therefore, it is better if inventory management/planner team always ensures that the stock of bike is enough, especially during the most busies day and hour.
Comparison between average trip duration in most and less favorite routes:
For user type Subscriber is dominated by male gender for each type of trip duration. The most common type of tripduration is for Short Trip followed by Medium Trip and finally for Long trips, with a significant comparison of frequencies for each type of rent.
Meanwhile, for user type Customer is more dominated by female gender for each type of trip duration. The type of trip duration that occurs is the same as user type Subscriber (Short Trip - Medium trip - Long trip), there is no significant comparison between each gender in every trip duration type.
The most busies day for bike rent is in Thursday with total 8470 rent happened on that day during August 2019.
On Thursday, the type of trip duration that occurred the most was short trips with the percentage of 73% followed by Medium trips with the percentage of 19.8% and the last one was Long Trips. The total duration of rent is directly proportional to the frequency of trip duration type, the more the total borrowed duration for a certain type of duration, the greater the accumulated total time borrowed on that day. The average trip duration for each type:
Cycle hour of rent during weekday : Peak hours occur at 6-8 in the morning and 5-7 at night.
Peak hours occur between 6-8 in the morning can be happened because during those hour most people usually will go to work or school, and peak hours between 5-7 at night are due to most people just finished their daily activity and want to go back to their home.
Cycle hour of rent during weekend : During weekend there are no peak hours like in weekday, frequency of bike rent happened on every hour are relatively stable but the frequency of rent is not as much as on weekday.