1 Introduction
1.1 Transjakarta
Transjakarta, an Indonesian public transportation firm in Jakarta, operates a diverse range of transit options, encompassing large buses (both BRT and non-BRT) as well as mini buses (Mikrotrans). The system’s fundamental procedure involves passengers utilizing payment cards for Tap-In when they board and Tap-Out when they disembark, effectively serving as their ticket for the journey. Transjakarta users can be divided into three categories, namely regular users, Royaltrans and Free Service Cards. For users, the cost required for one trip is Rp. 3,500, for Royaltrans Rp. 20,000, while there is no fee for the Free Service Card.
1.2 Data’s Point of View
To build an analytical framework, data analysts cannot rely on incrementally filling in real-time transactions. Instead, they can proactively generate dummy data for testing purposes, ensuring that the framework and data structure align with in-depth analysis requirements. In this context, I have carried out an analysis of Transjakarta transaction data obtained from Kaggle because there is no data available to the public. The master data in it is original, except for dummy transactions. Basically this data is a simulation for transaction data in Transjakarta. Does not represent the actual data/structure used in Transjakarta.
2 Data Preparation
2.1 import the data
Before importing the data, we will call the library that will be used to manipulate the data
library(dplyr)
library(lubridate)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(wordcloud2)
library(tm)
library(readr)
library(extrafont)First we read the .csv file, which then saved in a new data frame called dataset
# Data Input and Checking Data
dataset <- read.csv("data_input/dfTransjakarta.csv",na.strings = "")
glimpse(dataset)#> Rows: 37,900
#> Columns: 22
#> $ transID <chr> "EIIW227B8L34VB", "LGXO740D2N47GZ", "DJWR385V2U57TO",…
#> $ payCardID <dbl> 1.800627e+14, 4.885332e+15, 4.996225e+15, 6.390992e+1…
#> $ payCardBank <chr> "emoney", "dki", "dki", "flazz", "flazz", "dki", "emo…
#> $ payCardName <chr> "Bajragin Usada", "Gandi Widodo", "Emong Wastuti", "S…
#> $ payCardSex <chr> "M", "F", "F", "F", "M", "F", "M", "F", "F", "M", "F"…
#> $ payCardBirthDate <int> 2008, 1997, 1992, 1978, 1982, 1993, 1974, 1991, 1992,…
#> $ corridorID <chr> "5", "6C", "R1A", "11D", "12", "1T", "JAK.18", NA, "B…
#> $ corridorName <chr> "Matraman Baru - Ancol", "Stasiun Tebet - Karet via P…
#> $ direction <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,…
#> $ tapInStops <chr> "P00142", "B01963P", "B00499P", "B05587P", "P00239", …
#> $ tapInStopsName <chr> "Pal Putih", "Kemenkes 2", "Gg. Kunir II", "Taman Elo…
#> $ tapInStopsLat <dbl> -6.184631, -6.228700, -6.133132, -6.195743, -6.149650…
#> $ tapInStopsLon <dbl> 106.8440, 106.8330, 106.8144, 106.9353, 106.8890, 106…
#> $ stopStartSeq <int> 7, 13, 38, 23, 5, 3, 25, 2, 1, 15, 17, 12, 4, 2, 5, 1…
#> $ tapInTime <chr> "2023-04-03 05:21:44", "2023-04-03 05:42:44", "2023-0…
#> $ tapOutStops <chr> "P00253", "B03307P", "B04962P", "B03090P", "P00098", …
#> $ tapOutStopsName <chr> "Tegalan", "Sampoerna Strategic", "Simpang Kunir Kemu…
#> $ tapOutStopsLat <dbl> -6.203101, -6.217152, -6.133731, -6.183068, -6.135355…
#> $ tapOutStopsLon <dbl> 106.8572, 106.8189, 106.8148, 106.9319, 106.8114, 106…
#> $ stopEndSeq <dbl> 12, 21, 39, 29, 15, 6, NA, 12, 16, 24, 23, 14, 6, 4, …
#> $ tapOutTime <chr> "2023-04-03 06:00:53", "2023-04-03 06:40:01", "2023-0…
#> $ payAmount <dbl> 3500, 3500, 3500, 3500, 3500, 20000, 0, 0, 20000, 350…
The dataset consists of 22 columns and 37900 rows, here is an explanation of each column :
| variable_name | Description |
|---|---|
| transID | Unique transaction id for every transaction |
| payCardID | Customers main identifier. The card customer use as ticket for entrance and exit |
| payCardBank | Customers card bank issuer name |
| payCardName | Customers name that is embedded in the card |
| payCardSex | Customers sex that is embedded in the card |
| payCardBirthDate | Customers birth year |
| corridorID | Corridor ID / Route ID as key for route grouping |
| corridorName | Corridor Name / Route Name contains Start and Finish for each route |
| direction | 0 for Go, 1 for Back. Direction of the route |
| tapInStops | Tap In (entrance) Stops ID for identifying stops name |
| tapInStopsName | Tap In (entrance) Stops name |
| tapInStopsLat | Tap In (entrance) latitude geographical coordinates of the bus stop |
| tapInStopsLon | Tap In (entrance) longitude geographical coordinates of the bus stop |
| stopStartSeq | Tap In (entrance) sequence of the bus stops |
| tapInTime | Tap In (entrance) date and time |
| tapOutStops | Tap Out (exit) Stops ID for identifying stops name |
| tapOutStopsName | Tap Out (exit) Stops name |
| tapOutStopsLat | Tap Out (exit) latitude geographical coordinates of the bus stop |
| tapOutStopsLon | Tap Out (exit) longitude geographical coordinates of the bus stop |
| stopEndSeq | Tap Out (exit) sequence of the bus stops |
| tapOutTime | Tap Out (exit) date and time |
| payAmount | The amount of money or fees that must be paid |
2.2 Handling Missing Values
#> transID payCardID payCardBank payCardName
#> 0 0 0 0
#> payCardSex payCardBirthDate corridorID corridorName
#> 0 0 1257 1930
#> direction tapInStops tapInStopsName tapInStopsLat
#> 0 1213 0 0
#> tapInStopsLon stopStartSeq tapInTime tapOutStops
#> 0 0 0 2289
#> tapOutStopsName tapOutStopsLat tapOutStopsLon stopEndSeq
#> 1344 1344 1344 1344
#> tapOutTime payAmount
#> 1344 1007
Based on the examination results, it was found that there are 1263 missing values in the columns “stopEndSeq,” “tapOutStopsLat,” “tapOutStopsLon,” and “payAmount.” These columns contain information about users exiting the bus, so it can be inferred that these missing values may indicate errors in tapping when entering or users exiting without tapping. also we found 1930 and 1257 missing value in corridorID and corridorName, this could be the old corridor that doesn’t exist anymore. Therefore, because these columns cannot be replaced and considering the large dataset volume, the decision is to drop the rows with missing values. For missing values in the tapInStops column, we can match the values based on the tapInStopsName column.
#The loop will fill in the NA values in column tapInStops based on column tapInStopsName
for (i in 1:nrow(dataset)) {
if (is.na(dataset$tapInStops[i])) {
matching_row <- which(dataset$tapInStopsName == dataset$tapInStopsName[i])[1]
dataset$tapInStops[i] <- dataset$tapInStops[matching_row]
}
}#> transID payCardID payCardBank payCardName
#> 0 0 0 0
#> payCardSex payCardBirthDate corridorID corridorName
#> 0 0 0 0
#> direction tapInStops tapInStopsName tapInStopsLat
#> 0 0 0 0
#> tapInStopsLon stopStartSeq tapInTime tapOutStops
#> 0 0 0 0
#> tapOutStopsName tapOutStopsLat tapOutStopsLon stopEndSeq
#> 0 0 0 0
#> tapOutTime payAmount
#> 0 0
Some of data type not in the corect type. column payCardBank,payCardSex, corridorID, corridorName, can be converted into a factor type to simplify analysis.tapInTime, tapOutTime can be converted into date-time format.
2.3 Data type manipulation
#create a new column containing user categories
clean_data <- clean_data %>%
mutate(user_Category = case_when(
payAmount == 3500 ~ "regular",
payAmount == 20000 ~ "royaltrans",
payAmount == 0 ~ "freecard",
TRUE ~ NA_character_
))
#Create a new column containing age
clean_data <- clean_data %>%
mutate(age = year(Sys.Date())-payCardBirthDate)
#create a new column containing age categories
clean_data <- clean_data %>%
mutate(age_Category = case_when(
age < 18 ~ "children",
age >= 18 & age < 65 ~ "adult",
age >= 60 ~ "elderly",
TRUE ~ NA_character_
))
#adjust the data type of the column
clean_data <- clean_data %>%
mutate(payCardBank = as.factor(payCardBank),
payCardSex = as.factor(payCardSex),
corridorID = as.factor(corridorID),
tapInStopsName = as.factor(tapInStopsName),
age_Category = as.factor(age_Category),
corridorName = as.factor(corridorName),
user_Category = as.factor(user_Category),
tapInTime = ymd_hms(tapInTime),
tapOutTime = ymd_hms(tapOutTime)
)
glimpse(clean_data)#> Rows: 32,675
#> Columns: 25
#> $ transID <chr> "EIIW227B8L34VB", "LGXO740D2N47GZ", "DJWR385V2U57TO",…
#> $ payCardID <dbl> 1.800627e+14, 4.885332e+15, 4.996225e+15, 6.390992e+1…
#> $ payCardBank <fct> emoney, dki, dki, flazz, flazz, dki, dki, dki, dki, d…
#> $ payCardName <chr> "Bajragin Usada", "Gandi Widodo", "Emong Wastuti", "S…
#> $ payCardSex <fct> M, F, F, F, M, F, F, M, F, F, F, M, M, F, M, M, M, F,…
#> $ payCardBirthDate <int> 2008, 1997, 1992, 1978, 1982, 1993, 1992, 2004, 1997,…
#> $ corridorID <fct> 5, 6C, R1A, 11D, 12, 1T, B14, 1Q, 5M, 9A, 4, JAK.46, …
#> $ corridorName <fct> Matraman Baru - Ancol, Stasiun Tebet - Karet via Patr…
#> $ direction <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,…
#> $ tapInStops <chr> "P00142", "B01963P", "B00499P", "B05587P", "P00239", …
#> $ tapInStopsName <fct> Pal Putih, Kemenkes 2, Gg. Kunir II, Taman Elok 1, Su…
#> $ tapInStopsLat <dbl> -6.184631, -6.228700, -6.133132, -6.195743, -6.149650…
#> $ tapInStopsLon <dbl> 106.8440, 106.8330, 106.8144, 106.9353, 106.8890, 106…
#> $ stopStartSeq <int> 7, 13, 38, 23, 5, 3, 1, 15, 17, 4, 2, 5, 1, 10, 19, 2…
#> $ tapInTime <dttm> 2023-04-03 05:21:44, 2023-04-03 05:42:44, 2023-04-03…
#> $ tapOutStops <chr> "P00253", "B03307P", "B04962P", "B03090P", "P00098", …
#> $ tapOutStopsName <chr> "Tegalan", "Sampoerna Strategic", "Simpang Kunir Kemu…
#> $ tapOutStopsLat <dbl> -6.203101, -6.217152, -6.133731, -6.183068, -6.135355…
#> $ tapOutStopsLon <dbl> 106.8572, 106.8189, 106.8148, 106.9319, 106.8114, 106…
#> $ stopEndSeq <dbl> 12, 21, 39, 29, 15, 6, 16, 24, 23, 6, 4, 17, 3, 11, 2…
#> $ tapOutTime <dttm> 2023-04-03 06:00:53, 2023-04-03 06:40:01, 2023-04-03…
#> $ payAmount <dbl> 3500, 3500, 3500, 3500, 3500, 20000, 20000, 3500, 350…
#> $ user_Category <fct> regular, regular, regular, regular, regular, royaltra…
#> $ age <dbl> 15, 26, 31, 45, 41, 30, 31, 19, 26, 35, 21, 48, 12, 1…
#> $ age_Category <fct> children, adult, adult, adult, adult, adult, adult, a…
3 Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to summarize their main characteristics, often with the help of statistical graphics and other data visualization techniques. First, we use the summary() function to extract basic statistical information from each column in the dataset
#> transID payCardID payCardBank payCardName
#> Length:32675 Min. :6.040e+10 bni : 2336 Length:32675
#> Class :character 1st Qu.:1.800e+14 brizzi: 2997 Class :character
#> Mode :character Median :3.507e+15 dki :16176 Mode :character
#> Mean :4.190e+17 emoney: 5991
#> 3rd Qu.:4.685e+15 flazz : 2743
#> Max. :4.998e+18 online: 2432
#>
#> payCardSex payCardBirthDate corridorID
#> F:17440 Min. :1946 1T : 369
#> M:15235 1st Qu.:1982 S21 : 362
#> Median :1991 JIS3 : 324
#> Mean :1990 JAK.06 : 313
#> 3rd Qu.:2001 11P : 311
#> Max. :2012 8C : 305
#> (Other):30691
#> corridorName direction
#> Cibubur - Balai Kota : 369 Min. :0.0000
#> Ciputat - CSW : 362 1st Qu.:0.0000
#> Harmoni - Jakarta International Stadium : 324 Median :1.0000
#> Pulo Gadung - Monas : 319 Mean :0.5006
#> Kampung Rambutan - Pondok Gede : 313 3rd Qu.:1.0000
#> Rusun Pondok Bambu - Walikota Jakarta Timur: 311 Max. :1.0000
#> (Other) :30677
#> tapInStops tapInStopsName tapInStopsLat tapInStopsLon
#> Length:32675 Penjaringan : 214 Min. :-6.395 Min. :106.6
#> Class :character Garuda Taman Mini: 186 1st Qu.:-6.246 1st Qu.:106.8
#> Mode :character BKN : 151 Median :-6.214 Median :106.8
#> BNN LRT : 143 Mean :-6.215 Mean :106.8
#> Pejaten : 140 3rd Qu.:-6.175 3rd Qu.:106.9
#> Penggilingan : 119 Max. :-6.089 Max. :107.0
#> (Other) :31722
#> stopStartSeq tapInTime tapOutStops
#> Min. : 0.00 Min. :2023-04-01 06:22:27.00 Length:32675
#> 1st Qu.: 4.00 1st Qu.:2023-04-10 09:28:08.00 Class :character
#> Median :10.00 Median :2023-04-16 17:50:56.00 Mode :character
#> Mean :13.61 Mean :2023-04-16 08:13:55.76
#> 3rd Qu.:19.00 3rd Qu.:2023-04-21 21:29:27.50
#> Max. :68.00 Max. :2023-04-30 21:51:08.00
#>
#> tapOutStopsName tapOutStopsLat tapOutStopsLon stopEndSeq
#> Length:32675 Min. :-6.395 Min. :106.6 Min. : 1.00
#> Class :character 1st Qu.:-6.248 1st Qu.:106.8 1st Qu.:11.00
#> Mode :character Median :-6.215 Median :106.8 Median :18.00
#> Mean :-6.215 Mean :106.8 Mean :21.29
#> 3rd Qu.:-6.174 3rd Qu.:106.9 3rd Qu.:30.00
#> Max. :-6.092 Max. :107.0 Max. :77.00
#>
#> tapOutTime payAmount user_Category
#> Min. :2023-04-01 07:27:31.0 Min. : 0 freecard :14685
#> 1st Qu.:2023-04-10 10:38:08.0 1st Qu.: 0 regular :16419
#> Median :2023-04-16 19:16:57.0 Median : 3500 royaltrans: 1571
#> Mean :2023-04-16 09:26:05.7 Mean : 2720
#> 3rd Qu.:2023-04-21 22:54:39.0 3rd Qu.: 3500
#> Max. :2023-04-30 23:23:18.0 Max. :20000
#>
#> age age_Category
#> Min. :11.00 adult :27558
#> 1st Qu.:22.00 children: 4706
#> Median :32.00 elderly : 411
#> Mean :32.81
#> 3rd Qu.:41.00
#> Max. :77.00
#>
Insight :
- The data has a duration of one month from 2023-04-01 to 2023-04-31
- The Cibubur - Balai kota route is the one with the most passengers
- Most payments are made with DKI bank cards
3.1 Define top 5 payment methods of Transjakarta passengers?
#create a barplot to compare Transjakarta payments
databank <- table(clean_data$payCardBank)
barplot(height = databank)
Answer : Top 5 payment method is DKI, emoney, flazz,
brizzi and online.
3.2 How is the distribution of Transjakarta users?
#distribution of users by gender and category
graphics::barplot(table(clean_data$payCardSex,droplevels(clean_data$user_Category)), legend.text = T)
Answer : 1. More Transjakarta users are women with a
small difference from men 2. Transjakarta users are dominated by ages
25-40 with most aged 35-40 years 3. Most of the users are regular
categories and free card users
3.3 At what time is Transjakarta busiest?
Answer : Transjakarta is busiest at 5-6 o’clock and
17-18 o’clock, 5 o’clock in the morning is the time to go to work and
school while 17 o’clock is the time to leave the office
3.4 Which route has the longest average travel time for passengers?
#To calculate the average time, we first need to find the time needed for each trip
clean_data <- clean_data %>%
mutate(duration = tapOutTime - tapInTime)
trav_time <- clean_data %>%
group_by(corridorID) %>%
summarise(mean_time = mean(duration)) %>%
ungroup() %>%
arrange(desc(mean_time))
trav_time5 <- head(trav_time, 5)
barplot(height = as.double(trav_time5$mean_time), names.arg = trav_time5$corridorID, horiz = TRUE)Answer: The route with the longest travel time is route 7B, route D31 and route 1B
3.5 how much transjakarta income during weekdays?
clean_data <- clean_data %>%
mutate(is_weekend = weekdays(tapInTime) %in% c("Saturday", "Sunday"),
is_weekend = case_when(
is_weekend == TRUE ~ "Weekend",
is_weekend == FALSE ~ "Weekday",
TRUE ~ NA_character_
))
data_weekday <- clean_data %>%
group_by(is_weekend) %>%
summarise(total = sum(payAmount)) %>%
ungroup()
data_weekday#> # A tibble: 2 × 2
#> is_weekend total
#> <chr> <dbl>
#> 1 Weekday 80808000
#> 2 Weekend 8078500
Answer Transjakarta revenue during weekdays in April 2023 is Rp.80.808.000
3.6 If we want to increase the capacity of seats specifically for women, which route will we choose?
data_female <- clean_data %>%
group_by(corridorID, payCardSex) %>%
summarise(total = n()) %>%
ungroup()
data_female <- data_female %>%
filter(payCardSex == "F") %>%
arrange(desc(total))
head(data_female)#> # A tibble: 6 × 3
#> corridorID payCardSex total
#> <fct> <fct> <int>
#> 1 9D F 208
#> 2 T11 F 204
#> 3 2E F 201
#> 4 JAK.112 F 191
#> 5 JAK.36 F 183
#> 6 2P F 168
Answer: 1. route 9D & T11 has the most female passengers, this could be a consideration for increasing the capacity of seats specifically for women
4 Explanatory Visualization
Explanatory visualization is a type of data visualization that is specifically designed to help people understand and interpret complex data or information. It goes beyond simply presenting data in a visual format; it aims to provide insights, tell a story, or convey a specific message to the audience.
4.1 comparing the distribution of payment methods based on Transjakarta user categories
plot1_data <- clean_data %>%
group_by(payCardBank, user_Category) %>%
summarise(total = n()) %>%
ungroup()
plot1 <- ggplot(data = plot1_data,
aes(x = total,
y = reorder(payCardBank, total))) +
geom_col(aes(fill = user_Category),
position = "dodge") +
labs(title="Transjakarta Payment Method",
subtitle = "distribution of payment methods for Transjakarta users based on user categories",
x="Number of Users",
y="Payment Method",
caption = "Source : Transjakarta dataset-Kaggle", fill = "Legends\n") +
theme_light() +
theme(plot.caption = element_text(hjust = 0), )
plot1
Insight - The bar plot above depicts the distribution
of Transjakarta users based on the payment method used and the user
category. It is observed that users in the Royaltrans, Freecard, and
Regular categories most frequently use the DKI card.
4.2 Time series of Transjakarta users based on each age category
plot2_data <- clean_data %>%
group_by(age_Category, hour(tapInTime)) %>%
summarise(total = n()) %>%
ungroup() %>%
rename(hour = "hour(tapInTime)")
plot3 <- ggplot(plot2_data, aes(x=hour, y=total, fill=age_Category)) +
geom_area(alpha=0.6 , size=.5, colour="white") +
scale_fill_viridis(discrete = T) +
labs(title="Transjakarta Users",
subtitle = "Time series of Transjakarta users based on each age category",
x="Operational hour",
y="Number of customers",
caption = "Source : Transjakarta dataset-Kaggle", fill = "Legends\n") +
scale_x_continuous(breaks = seq(0,23,1)) +
theme_ipsum() +
theme(plot.caption = element_text(hjust = 0),)
plot3
Insight - This plot shows the trend of Transjakarta
users every hour. It can be concluded that the highest use of
Transjakarta occurs at 06.00 and 17.00.
4.3 Word Cloud of Transjakarta corridor Name
medium.corpus = Corpus(VectorSource(clean_data$corridorName))
removeHTML = function(text){
text = gsub(pattern = '<.+\\">', '', text)
text = gsub(pattern = '</.+>', '', text)
return(text)
}
medium.corpus = medium.corpus %>%
#tm_map(content_transformer(removeHTML)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("SMART"))
tdm = TermDocumentMatrix(medium.corpus) %>%
as.matrix()
words = sort(rowSums(tdm), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)
df = df %>%
filter(nchar(as.character(word)) > 2,
word != "don’")
uxc.colors = c("#fefefe", "#f4f2a8", "#030303")
uxc.background = "#00ccff"
# font_import()
fonts()#> NULL
wordcloud2(df,
color = rep_len(uxc.colors, nrow(df)),
backgroundColor = uxc.background,
fontFamily = "DM Sans",
size = 2.5,
minSize = 5,
rotateRatio = 0)- This wordcloud shows the routes most frequently used by Transjakarta users