1 Introduction

1.1 Transjakarta

Transjakarta, an Indonesian public transportation firm in Jakarta, operates a diverse range of transit options, encompassing large buses (both BRT and non-BRT) as well as mini buses (Mikrotrans). The system’s fundamental procedure involves passengers utilizing payment cards for Tap-In when they board and Tap-Out when they disembark, effectively serving as their ticket for the journey. Transjakarta users can be divided into three categories, namely regular users, Royaltrans and Free Service Cards. For users, the cost required for one trip is Rp. 3,500, for Royaltrans Rp. 20,000, while there is no fee for the Free Service Card.

1.2 Data’s Point of View

To build an analytical framework, data analysts cannot rely on incrementally filling in real-time transactions. Instead, they can proactively generate dummy data for testing purposes, ensuring that the framework and data structure align with in-depth analysis requirements. In this context, I have carried out an analysis of Transjakarta transaction data obtained from Kaggle because there is no data available to the public. The master data in it is original, except for dummy transactions. Basically this data is a simulation for transaction data in Transjakarta. Does not represent the actual data/structure used in Transjakarta.

2 Data Preparation

2.1 import the data

Before importing the data, we will call the library that will be used to manipulate the data

library(dplyr)
library(lubridate)
library(ggplot2)
library(viridis)
library(hrbrthemes)
library(wordcloud2)
library(tm)
library(readr)
library(extrafont)

First we read the .csv file, which then saved in a new data frame called dataset

# Data Input and Checking Data
dataset <- read.csv("data_input/dfTransjakarta.csv",na.strings = "")
glimpse(dataset)
#> Rows: 37,900
#> Columns: 22
#> $ transID          <chr> "EIIW227B8L34VB", "LGXO740D2N47GZ", "DJWR385V2U57TO",…
#> $ payCardID        <dbl> 1.800627e+14, 4.885332e+15, 4.996225e+15, 6.390992e+1…
#> $ payCardBank      <chr> "emoney", "dki", "dki", "flazz", "flazz", "dki", "emo…
#> $ payCardName      <chr> "Bajragin Usada", "Gandi Widodo", "Emong Wastuti", "S…
#> $ payCardSex       <chr> "M", "F", "F", "F", "M", "F", "M", "F", "F", "M", "F"…
#> $ payCardBirthDate <int> 2008, 1997, 1992, 1978, 1982, 1993, 1974, 1991, 1992,…
#> $ corridorID       <chr> "5", "6C", "R1A", "11D", "12", "1T", "JAK.18", NA, "B…
#> $ corridorName     <chr> "Matraman Baru - Ancol", "Stasiun Tebet - Karet via P…
#> $ direction        <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,…
#> $ tapInStops       <chr> "P00142", "B01963P", "B00499P", "B05587P", "P00239", …
#> $ tapInStopsName   <chr> "Pal Putih", "Kemenkes 2", "Gg. Kunir II", "Taman Elo…
#> $ tapInStopsLat    <dbl> -6.184631, -6.228700, -6.133132, -6.195743, -6.149650…
#> $ tapInStopsLon    <dbl> 106.8440, 106.8330, 106.8144, 106.9353, 106.8890, 106…
#> $ stopStartSeq     <int> 7, 13, 38, 23, 5, 3, 25, 2, 1, 15, 17, 12, 4, 2, 5, 1…
#> $ tapInTime        <chr> "2023-04-03 05:21:44", "2023-04-03 05:42:44", "2023-0…
#> $ tapOutStops      <chr> "P00253", "B03307P", "B04962P", "B03090P", "P00098", …
#> $ tapOutStopsName  <chr> "Tegalan", "Sampoerna Strategic", "Simpang Kunir Kemu…
#> $ tapOutStopsLat   <dbl> -6.203101, -6.217152, -6.133731, -6.183068, -6.135355…
#> $ tapOutStopsLon   <dbl> 106.8572, 106.8189, 106.8148, 106.9319, 106.8114, 106…
#> $ stopEndSeq       <dbl> 12, 21, 39, 29, 15, 6, NA, 12, 16, 24, 23, 14, 6, 4, …
#> $ tapOutTime       <chr> "2023-04-03 06:00:53", "2023-04-03 06:40:01", "2023-0…
#> $ payAmount        <dbl> 3500, 3500, 3500, 3500, 3500, 20000, 0, 0, 20000, 350…

The dataset consists of 22 columns and 37900 rows, here is an explanation of each column :

variable_name Description
transID Unique transaction id for every transaction
payCardID Customers main identifier. The card customer use as ticket for entrance and exit
payCardBank Customers card bank issuer name
payCardName Customers name that is embedded in the card
payCardSex Customers sex that is embedded in the card
payCardBirthDate Customers birth year
corridorID Corridor ID / Route ID as key for route grouping
corridorName Corridor Name / Route Name contains Start and Finish for each route
direction 0 for Go, 1 for Back. Direction of the route
tapInStops Tap In (entrance) Stops ID for identifying stops name
tapInStopsName Tap In (entrance) Stops name
tapInStopsLat Tap In (entrance) latitude geographical coordinates of the bus stop
tapInStopsLon Tap In (entrance) longitude geographical coordinates of the bus stop
stopStartSeq Tap In (entrance) sequence of the bus stops
tapInTime Tap In (entrance) date and time
tapOutStops Tap Out (exit) Stops ID for identifying stops name
tapOutStopsName Tap Out (exit) Stops name
tapOutStopsLat Tap Out (exit) latitude geographical coordinates of the bus stop
tapOutStopsLon Tap Out (exit) longitude geographical coordinates of the bus stop
stopEndSeq Tap Out (exit) sequence of the bus stops
tapOutTime Tap Out (exit) date and time
payAmount The amount of money or fees that must be paid

2.2 Handling Missing Values

#checking missing values
colSums(is.na(dataset))
#>          transID        payCardID      payCardBank      payCardName 
#>                0                0                0                0 
#>       payCardSex payCardBirthDate       corridorID     corridorName 
#>                0                0             1257             1930 
#>        direction       tapInStops   tapInStopsName    tapInStopsLat 
#>                0             1213                0                0 
#>    tapInStopsLon     stopStartSeq        tapInTime      tapOutStops 
#>                0                0                0             2289 
#>  tapOutStopsName   tapOutStopsLat   tapOutStopsLon       stopEndSeq 
#>             1344             1344             1344             1344 
#>       tapOutTime        payAmount 
#>             1344             1007

Based on the examination results, it was found that there are 1263 missing values in the columns “stopEndSeq,” “tapOutStopsLat,” “tapOutStopsLon,” and “payAmount.” These columns contain information about users exiting the bus, so it can be inferred that these missing values may indicate errors in tapping when entering or users exiting without tapping. also we found 1930 and 1257 missing value in corridorID and corridorName, this could be the old corridor that doesn’t exist anymore. Therefore, because these columns cannot be replaced and considering the large dataset volume, the decision is to drop the rows with missing values. For missing values in the tapInStops column, we can match the values based on the tapInStopsName column.

#The loop will fill in the NA values in column tapInStops based on column tapInStopsName
for (i in 1:nrow(dataset)) {
  if (is.na(dataset$tapInStops[i])) {
    matching_row <- which(dataset$tapInStopsName == dataset$tapInStopsName[i])[1]
    dataset$tapInStops[i] <- dataset$tapInStops[matching_row]
  }
}
# handling missing values
clean_data <- na.omit(dataset)
colSums(is.na(clean_data))
#>          transID        payCardID      payCardBank      payCardName 
#>                0                0                0                0 
#>       payCardSex payCardBirthDate       corridorID     corridorName 
#>                0                0                0                0 
#>        direction       tapInStops   tapInStopsName    tapInStopsLat 
#>                0                0                0                0 
#>    tapInStopsLon     stopStartSeq        tapInTime      tapOutStops 
#>                0                0                0                0 
#>  tapOutStopsName   tapOutStopsLat   tapOutStopsLon       stopEndSeq 
#>                0                0                0                0 
#>       tapOutTime        payAmount 
#>                0                0

Some of data type not in the corect type. column payCardBank,payCardSex, corridorID, corridorName, can be converted into a factor type to simplify analysis.tapInTime, tapOutTime can be converted into date-time format.

2.3 Data type manipulation

#create a new column containing user categories
clean_data <- clean_data %>%
  mutate(user_Category = case_when(
    payAmount  == 3500 ~ "regular",
    payAmount  == 20000  ~ "royaltrans",
    payAmount  == 0 ~ "freecard",
    TRUE ~ NA_character_
  ))

#Create a new column containing age
clean_data <- clean_data %>% 
  mutate(age = year(Sys.Date())-payCardBirthDate)

#create a new column containing age categories
clean_data <- clean_data %>%
  mutate(age_Category = case_when(
    age < 18 ~ "children",
    age >= 18 & age < 65 ~ "adult",
    age >= 60 ~ "elderly",
    TRUE ~ NA_character_
  ))


#adjust the data type of the column
clean_data <- clean_data %>% 
  mutate(payCardBank = as.factor(payCardBank),
         payCardSex = as.factor(payCardSex),
         corridorID = as.factor(corridorID),
         tapInStopsName = as.factor(tapInStopsName),
         age_Category = as.factor(age_Category),
         corridorName = as.factor(corridorName),
         user_Category = as.factor(user_Category),
         tapInTime = ymd_hms(tapInTime),
         tapOutTime = ymd_hms(tapOutTime)
         
  )

glimpse(clean_data)
#> Rows: 32,675
#> Columns: 25
#> $ transID          <chr> "EIIW227B8L34VB", "LGXO740D2N47GZ", "DJWR385V2U57TO",…
#> $ payCardID        <dbl> 1.800627e+14, 4.885332e+15, 4.996225e+15, 6.390992e+1…
#> $ payCardBank      <fct> emoney, dki, dki, flazz, flazz, dki, dki, dki, dki, d…
#> $ payCardName      <chr> "Bajragin Usada", "Gandi Widodo", "Emong Wastuti", "S…
#> $ payCardSex       <fct> M, F, F, F, M, F, F, M, F, F, F, M, M, F, M, M, M, F,…
#> $ payCardBirthDate <int> 2008, 1997, 1992, 1978, 1982, 1993, 1992, 2004, 1997,…
#> $ corridorID       <fct> 5, 6C, R1A, 11D, 12, 1T, B14, 1Q, 5M, 9A, 4, JAK.46, …
#> $ corridorName     <fct> Matraman Baru - Ancol, Stasiun Tebet - Karet via Patr…
#> $ direction        <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,…
#> $ tapInStops       <chr> "P00142", "B01963P", "B00499P", "B05587P", "P00239", …
#> $ tapInStopsName   <fct> Pal Putih, Kemenkes 2, Gg. Kunir II, Taman Elok 1, Su…
#> $ tapInStopsLat    <dbl> -6.184631, -6.228700, -6.133132, -6.195743, -6.149650…
#> $ tapInStopsLon    <dbl> 106.8440, 106.8330, 106.8144, 106.9353, 106.8890, 106…
#> $ stopStartSeq     <int> 7, 13, 38, 23, 5, 3, 1, 15, 17, 4, 2, 5, 1, 10, 19, 2…
#> $ tapInTime        <dttm> 2023-04-03 05:21:44, 2023-04-03 05:42:44, 2023-04-03…
#> $ tapOutStops      <chr> "P00253", "B03307P", "B04962P", "B03090P", "P00098", …
#> $ tapOutStopsName  <chr> "Tegalan", "Sampoerna Strategic", "Simpang Kunir Kemu…
#> $ tapOutStopsLat   <dbl> -6.203101, -6.217152, -6.133731, -6.183068, -6.135355…
#> $ tapOutStopsLon   <dbl> 106.8572, 106.8189, 106.8148, 106.9319, 106.8114, 106…
#> $ stopEndSeq       <dbl> 12, 21, 39, 29, 15, 6, 16, 24, 23, 6, 4, 17, 3, 11, 2…
#> $ tapOutTime       <dttm> 2023-04-03 06:00:53, 2023-04-03 06:40:01, 2023-04-03…
#> $ payAmount        <dbl> 3500, 3500, 3500, 3500, 3500, 20000, 20000, 3500, 350…
#> $ user_Category    <fct> regular, regular, regular, regular, regular, royaltra…
#> $ age              <dbl> 15, 26, 31, 45, 41, 30, 31, 19, 26, 35, 21, 48, 12, 1…
#> $ age_Category     <fct> children, adult, adult, adult, adult, adult, adult, a…

3 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to summarize their main characteristics, often with the help of statistical graphics and other data visualization techniques. First, we use the summary() function to extract basic statistical information from each column in the dataset

summary(clean_data)
#>    transID            payCardID         payCardBank    payCardName       
#>  Length:32675       Min.   :6.040e+10   bni   : 2336   Length:32675      
#>  Class :character   1st Qu.:1.800e+14   brizzi: 2997   Class :character  
#>  Mode  :character   Median :3.507e+15   dki   :16176   Mode  :character  
#>                     Mean   :4.190e+17   emoney: 5991                     
#>                     3rd Qu.:4.685e+15   flazz : 2743                     
#>                     Max.   :4.998e+18   online: 2432                     
#>                                                                          
#>  payCardSex payCardBirthDate   corridorID   
#>  F:17440    Min.   :1946     1T     :  369  
#>  M:15235    1st Qu.:1982     S21    :  362  
#>             Median :1991     JIS3   :  324  
#>             Mean   :1990     JAK.06 :  313  
#>             3rd Qu.:2001     11P    :  311  
#>             Max.   :2012     8C     :  305  
#>                              (Other):30691  
#>                                       corridorName     direction     
#>  Cibubur - Balai Kota                       :  369   Min.   :0.0000  
#>  Ciputat - CSW                              :  362   1st Qu.:0.0000  
#>  Harmoni - Jakarta International Stadium    :  324   Median :1.0000  
#>  Pulo Gadung - Monas                        :  319   Mean   :0.5006  
#>  Kampung Rambutan - Pondok Gede             :  313   3rd Qu.:1.0000  
#>  Rusun Pondok Bambu - Walikota Jakarta Timur:  311   Max.   :1.0000  
#>  (Other)                                    :30677                   
#>   tapInStops                  tapInStopsName  tapInStopsLat    tapInStopsLon  
#>  Length:32675       Penjaringan      :  214   Min.   :-6.395   Min.   :106.6  
#>  Class :character   Garuda Taman Mini:  186   1st Qu.:-6.246   1st Qu.:106.8  
#>  Mode  :character   BKN              :  151   Median :-6.214   Median :106.8  
#>                     BNN LRT          :  143   Mean   :-6.215   Mean   :106.8  
#>                     Pejaten          :  140   3rd Qu.:-6.175   3rd Qu.:106.9  
#>                     Penggilingan     :  119   Max.   :-6.089   Max.   :107.0  
#>                     (Other)          :31722                                   
#>   stopStartSeq     tapInTime                      tapOutStops       
#>  Min.   : 0.00   Min.   :2023-04-01 06:22:27.00   Length:32675      
#>  1st Qu.: 4.00   1st Qu.:2023-04-10 09:28:08.00   Class :character  
#>  Median :10.00   Median :2023-04-16 17:50:56.00   Mode  :character  
#>  Mean   :13.61   Mean   :2023-04-16 08:13:55.76                     
#>  3rd Qu.:19.00   3rd Qu.:2023-04-21 21:29:27.50                     
#>  Max.   :68.00   Max.   :2023-04-30 21:51:08.00                     
#>                                                                     
#>  tapOutStopsName    tapOutStopsLat   tapOutStopsLon    stopEndSeq   
#>  Length:32675       Min.   :-6.395   Min.   :106.6   Min.   : 1.00  
#>  Class :character   1st Qu.:-6.248   1st Qu.:106.8   1st Qu.:11.00  
#>  Mode  :character   Median :-6.215   Median :106.8   Median :18.00  
#>                     Mean   :-6.215   Mean   :106.8   Mean   :21.29  
#>                     3rd Qu.:-6.174   3rd Qu.:106.9   3rd Qu.:30.00  
#>                     Max.   :-6.092   Max.   :107.0   Max.   :77.00  
#>                                                                     
#>    tapOutTime                      payAmount        user_Category  
#>  Min.   :2023-04-01 07:27:31.0   Min.   :    0   freecard  :14685  
#>  1st Qu.:2023-04-10 10:38:08.0   1st Qu.:    0   regular   :16419  
#>  Median :2023-04-16 19:16:57.0   Median : 3500   royaltrans: 1571  
#>  Mean   :2023-04-16 09:26:05.7   Mean   : 2720                     
#>  3rd Qu.:2023-04-21 22:54:39.0   3rd Qu.: 3500                     
#>  Max.   :2023-04-30 23:23:18.0   Max.   :20000                     
#>                                                                    
#>       age          age_Category  
#>  Min.   :11.00   adult   :27558  
#>  1st Qu.:22.00   children: 4706  
#>  Median :32.00   elderly :  411  
#>  Mean   :32.81                   
#>  3rd Qu.:41.00                   
#>  Max.   :77.00                   
#> 

Insight :

  1. The data has a duration of one month from 2023-04-01 to 2023-04-31
  2. The Cibubur - Balai kota route is the one with the most passengers
  3. Most payments are made with DKI bank cards

3.1 Define top 5 payment methods of Transjakarta passengers?

#create a barplot to compare Transjakarta payments
databank <- table(clean_data$payCardBank)
barplot(height = databank)

Answer : Top 5 payment method is DKI, emoney, flazz, brizzi and online.

3.2 How is the distribution of Transjakarta users?

#distribution of users by gender and category
graphics::barplot(table(clean_data$payCardSex,droplevels(clean_data$user_Category)), legend.text = T)

#distribution of users based on age
hist(x=clean_data$age)

Answer : 1. More Transjakarta users are women with a small difference from men 2. Transjakarta users are dominated by ages 25-40 with most aged 35-40 years 3. Most of the users are regular categories and free card users

3.3 At what time is Transjakarta busiest?

hist(x = hour(clean_data$tapInTime), freq = T, breaks = 24)

Answer : Transjakarta is busiest at 5-6 o’clock and 17-18 o’clock, 5 o’clock in the morning is the time to go to work and school while 17 o’clock is the time to leave the office

3.4 Which route has the longest average travel time for passengers?

#To calculate the average time, we first need to find the time needed for each trip
clean_data <- clean_data %>% 
  mutate(duration = tapOutTime - tapInTime)

trav_time <- clean_data %>% 
  group_by(corridorID) %>% 
  summarise(mean_time = mean(duration)) %>% 
  ungroup() %>% 
  arrange(desc(mean_time))

trav_time5 <- head(trav_time, 5)

barplot(height = as.double(trav_time5$mean_time), names.arg = trav_time5$corridorID, horiz = TRUE)

Answer: The route with the longest travel time is route 7B, route D31 and route 1B

3.5 how much transjakarta income during weekdays?

clean_data <- clean_data %>%
  mutate(is_weekend = weekdays(tapInTime) %in% c("Saturday", "Sunday"),
         is_weekend = case_when(
          is_weekend   == TRUE ~ "Weekend",
          is_weekend  == FALSE  ~ "Weekday",
          TRUE ~ NA_character_
          ))

data_weekday <- clean_data %>% 
  group_by(is_weekend) %>% 
  summarise(total = sum(payAmount)) %>% 
  ungroup()
  
data_weekday
#> # A tibble: 2 × 2
#>   is_weekend    total
#>   <chr>         <dbl>
#> 1 Weekday    80808000
#> 2 Weekend     8078500

Answer Transjakarta revenue during weekdays in April 2023 is Rp.80.808.000

3.6 If we want to increase the capacity of seats specifically for women, which route will we choose?

data_female <- clean_data %>% 
  group_by(corridorID, payCardSex) %>% 
  summarise(total = n()) %>% 
  ungroup()

data_female <- data_female %>% 
  filter(payCardSex == "F") %>% 
  arrange(desc(total))

head(data_female)
#> # A tibble: 6 × 3
#>   corridorID payCardSex total
#>   <fct>      <fct>      <int>
#> 1 9D         F            208
#> 2 T11        F            204
#> 3 2E         F            201
#> 4 JAK.112    F            191
#> 5 JAK.36     F            183
#> 6 2P         F            168

Answer: 1. route 9D & T11 has the most female passengers, this could be a consideration for increasing the capacity of seats specifically for women

4 Explanatory Visualization

Explanatory visualization is a type of data visualization that is specifically designed to help people understand and interpret complex data or information. It goes beyond simply presenting data in a visual format; it aims to provide insights, tell a story, or convey a specific message to the audience.

4.1 comparing the distribution of payment methods based on Transjakarta user categories

plot1_data <- clean_data %>% 
  group_by(payCardBank, user_Category) %>% 
  summarise(total = n()) %>% 
  ungroup()

plot1 <- ggplot(data = plot1_data, 
                aes(x = total, 
                    y = reorder(payCardBank, total))) +
  geom_col(aes(fill = user_Category), 
           position = "dodge") +
  labs(title="Transjakarta Payment Method",
       subtitle = "distribution of payment methods for Transjakarta users based on user categories",
       x="Number of Users", 
       y="Payment Method", 
       caption = "Source : Transjakarta dataset-Kaggle", fill = "Legends\n") +
  theme_light() +
  theme(plot.caption = element_text(hjust = 0), )
  

plot1

4.2 Time series of Transjakarta users based on each age category

plot2_data <- clean_data %>% 
  group_by(age_Category, hour(tapInTime)) %>% 
  summarise(total = n()) %>% 
  ungroup() %>% 
  rename(hour = "hour(tapInTime)")
  
plot3 <- ggplot(plot2_data, aes(x=hour, y=total, fill=age_Category)) + 
    geom_area(alpha=0.6 , size=.5, colour="white") +
    scale_fill_viridis(discrete = T) +
labs(title="Transjakarta Users",
       subtitle = "Time series of Transjakarta users based on each age category",
       x="Operational hour", 
       y="Number of customers", 
       caption = "Source : Transjakarta dataset-Kaggle", fill = "Legends\n") +
  scale_x_continuous(breaks = seq(0,23,1)) +
  theme_ipsum() +
  theme(plot.caption = element_text(hjust = 0),)

plot3

4.3 Word Cloud of Transjakarta corridor Name

medium.corpus = Corpus(VectorSource(clean_data$corridorName))

removeHTML = function(text){
  text = gsub(pattern = '<.+\\">', '', text)
  text = gsub(pattern = '</.+>', '', text)
  return(text)
}

medium.corpus = medium.corpus %>%
  #tm_map(content_transformer(removeHTML)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("SMART")) 

tdm = TermDocumentMatrix(medium.corpus) %>%
  as.matrix()
words = sort(rowSums(tdm), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)

df = df %>%
  filter(nchar(as.character(word)) > 2,
         word != "don’")

uxc.colors = c("#fefefe", "#f4f2a8", "#030303")
uxc.background = "#00ccff"

# font_import()
fonts()
#> NULL
wordcloud2(df,
           color = rep_len(uxc.colors, nrow(df)),
           backgroundColor = uxc.background,
           fontFamily = "DM Sans",
           size = 2.5,
           minSize = 5,
           rotateRatio = 0)