1 Comparing the Best Flights Ticket Prices

1.1 Introduction

Welcome! This dataset contains information about the best flights ticket prices from Jakarta. We all know how important it is to obtain affordable flights tickets when planning a trip. Sometimes, we need to invest time in searching for the best deals or even stay up late to take advantage of cheaper ticket prices. This data has been collected from the online ticket agent, tiket.com, and compiled by Riki Akbar as the code author.

This dataset provides an opportunity for us to analyze and understand the trends in flights ticket prices from Jakarta to various destinations in Indonesia. With this data, we can explore opportunities to secure flights tickets at better prices when planning our travels

1.2 Data Preparation

1.2.1 Import Data

tc <- read.csv(file = "datainput/tiketcom_bestprice.csv")

#inspect data
head(tc)

1.2.2 Splitting Data Columns

The data we have is currently in a single column, even though it contains different pieces of information. This makes it appear messy, and we need to split it into several new columns. We can do this using the ‘tidyr’ library

library(tidyr)


# Split the column based on the delimiter "|"

ticket <- separate(tc, 
               col = extract_timestamp.origin.destination.depart_date.best_price, 
               into = c("Timestamp", "Origin", "Destination", "Depart_Date", "Best_Price"), 
               sep = "\\|")

# Check the data frame after the transformation
ticket

1.2.3 Reviewing Data Components

Next, let’s take a look at the data components using ‘glimpse()’ from the ‘dplyr’ library.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(ticket)
## Rows: 45,438
## Columns: 5
## $ Timestamp   <chr> "2022-12-24 11:25:37.656571", "2022-12-24 11:25:37.656571"…
## $ Origin      <chr> "JKTC", "JKTC", "JKTC", "JKTC", "JKTC", "JKTC", "JKTC", "J…
## $ Destination <chr> "BTJ", "BTJ", "BTJ", "BTJ", "BTJ", "BTJ", "BTJ", "BTJ", "B…
## $ Depart_Date <chr> "2023-04-25", "2023-04-04", "2023-04-27", "2023-04-29", "2…
## $ Best_Price  <chr> "2310582.0", "2310582.0", "2310582.0", "2310582.0", "23163…

1.2.4 Checking for Missing Values

colSums(is.na(ticket))
##   Timestamp      Origin Destination Depart_Date  Best_Price 
##           0           0           0           0           0
anyNA(ticket)
## [1] FALSE

1.2.5 Modifying Data Types

The data currently does not have appropriate data types for the five columns. They should have the following data types:

•Origin: factor

•Destination: factor

•Depart_Date: date

•Best_Price: integer

•Timestamp: timestamp using ‘as.POSIXct()’

# Modifying data types using mutate()

ticket <- ticket %>% 
            mutate(Timestamp = as.POSIXct(Timestamp,format = "%Y-%m-%d %H:%M:%S"),
                   Origin = as.factor(Origin),
                   Destination = as.factor(Destination),
                   Best_Price = as.integer(Best_Price),
                   Depart_Date = as.Date(Depart_Date, format = "%Y-%m-%d"))

str(ticket)
## 'data.frame':    45438 obs. of  5 variables:
##  $ Timestamp  : POSIXct, format: "2022-12-24 11:25:37" "2022-12-24 11:25:37" ...
##  $ Origin     : Factor w/ 1 level "JKTC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Destination: Factor w/ 29 levels "BDJ","BDO","BIK",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Depart_Date: Date, format: "2023-04-25" "2023-04-04" ...
##  $ Best_Price : int  2310582 2310582 2310582 2310582 2316313 2474980 2474980 2474980 2474980 2474980 ...

In the code above, we use the ‘mutate()’ function to modify the data types of the columns as specified. We ensure that “Timestamp” is in the timestamp format using ‘as.POSIXct()’, “Origin” and “Destination” are converted to factors, “Best_Price” is an integer, and “Depart_Date” is in date format. After these modifications, we use ‘str()’ to review the data structure.

1.3 Data Explanation

This dataset contains the best flight prices from Jakarta (JKTC) to various cities in Indonesia according to their respective destinations. It includes information about the departure date, as explained in the “Depart_Date” column. Additionally, the ticket prices for these flights are provided in the “Best_Price” column.

To start, let’s determine the total number of rows in the flight dataset using ‘nrow()’:

# Calculate the total number of flights
nrow(ticket)
## [1] 45438

In the dataset, there are 45,438 flights from Jakarta (JKTC). However, we are interested in knowing the departure dates of these flights. We can do this by using the ‘unique()’ function:

# View all unique departure dates
unique(ticket$Depart_Date)
##   [1] "2023-04-25" "2023-04-04" "2023-04-27" "2023-04-29" "2023-04-01"
##   [6] "2023-04-24" "2023-04-02" "2023-04-03" "2023-04-26" "2023-04-09"
##  [11] "2023-04-05" "2023-04-28" "2023-04-06" "2023-04-07" "2023-04-08"
##  [16] "2023-04-12" "2023-04-13" "2023-04-30" "2023-04-10" "2023-04-11"
##  [21] "2023-04-23" "2023-04-16" "2023-04-17" "2023-04-18" "2023-04-19"
##  [26] "2023-04-20" "2023-04-21" "2023-04-22" "2023-04-14" "2023-04-15"
##  [31] "2023-01-15" "2023-01-16" "2023-01-09" "2023-01-13" "2023-01-02"
##  [36] "2023-01-05" "2023-01-27" "2023-01-03" "2023-01-08" "2023-01-06"
##  [41] "2023-01-29" "2023-01-30" "2023-01-10" "2023-01-22" "2023-01-23"
##  [46] "2023-01-01" "2023-01-20" "2023-01-04" "2023-01-25" "2023-01-11"
##  [51] "2023-01-12" "2023-01-19" "2023-01-18" "2023-01-07" "2023-01-26"
##  [56] "2023-01-24" "2023-01-31" "2023-01-14" "2023-01-17" "2023-01-28"
##  [61] "2023-01-21" "2023-02-16" "2023-02-02" "2023-02-23" "2023-02-09"
##  [66] "2023-02-11" "2023-02-18" "2023-02-04" "2023-02-25" "2023-02-17"
##  [71] "2023-02-15" "2023-02-14" "2023-02-19" "2023-02-20" "2023-02-24"
##  [76] "2023-02-01" "2023-02-21" "2023-02-06" "2023-02-28" "2023-02-27"
##  [81] "2023-02-05" "2023-02-26" "2023-02-03" "2023-02-08" "2023-02-07"
##  [86] "2023-02-13" "2023-02-12" "2023-02-10" "2023-02-22" "2023-03-28"
##  [91] "2023-03-30" "2023-03-23" "2023-03-25" "2023-03-06" "2023-03-05"
##  [96] "2023-03-10" "2023-03-13" "2023-03-12" "2023-03-17" "2023-03-19"
## [101] "2023-03-20" "2023-03-03" "2023-03-24" "2023-03-29" "2023-03-07"
## [106] "2023-03-27" "2023-03-26" "2023-03-04" "2023-03-09" "2023-03-08"
## [111] "2023-03-31" "2023-03-14" "2023-03-11" "2023-03-18" "2023-03-16"
## [116] "2023-03-15" "2023-03-02" "2023-03-01" "2023-03-22" "2023-03-21"
## [121] "2023-05-28" "2023-05-29" "2023-05-08" "2023-05-09" "2023-05-13"
## [126] "2023-05-14" "2023-05-15" "2023-05-16" "2023-05-31" "2023-05-10"
## [131] "2023-05-11" "2023-05-12" "2023-05-30" "2023-05-17" "2023-05-18"
## [136] "2023-05-19" "2023-05-24" "2023-05-25" "2023-05-26" "2023-05-27"
## [141] "2023-05-20" "2023-05-21" "2023-05-22" "2023-05-23" "2023-05-06"
## [146] "2023-05-07" "2023-05-02" "2023-05-03" "2023-05-04" "2023-05-05"
## [151] "2023-05-01" "2023-06-19" "2023-06-26" "2023-06-28" "2023-06-27"
## [156] "2023-06-21" "2023-06-20" "2023-06-14" "2023-06-13" "2023-06-12"
## [161] "2023-06-18" "2023-06-04" "2023-06-25" "2023-06-06" "2023-06-05"
## [166] "2023-06-22" "2023-06-24" "2023-06-02" "2023-06-01" "2023-06-23"
## [171] "2023-06-08" "2023-06-07" "2023-06-29" "2023-06-09" "2023-06-15"
## [176] "2023-06-17" "2023-06-16" "2023-06-11" "2023-06-10" "2023-06-03"
## [181] "2023-06-30" "2024-01-08" "2024-01-13" "2024-01-28" "2024-02-27"
## [186] "2024-02-12" "2024-03-28" "2024-04-27" "2024-04-12" "2024-05-27"
## [191] "2024-05-12" "2024-06-11" "2024-06-26" "2024-03-13" "2024-01-01"
## [196] "2024-01-20" "2024-04-23" "2024-05-23" "2024-05-08" "2024-06-07"
## [201] "2024-06-22" "2024-01-02" "2023-07-03" "2023-07-28" "2023-07-09"
## [206] "2023-07-08" "2023-07-16" "2023-07-15" "2023-07-18" "2023-07-17"
## [211] "2023-07-12" "2023-07-11" "2023-07-14" "2023-07-13" "2023-07-30"
## [216] "2023-07-10" "2023-07-31" "2023-07-19" "2023-07-05" "2023-07-27"
## [221] "2023-07-26" "2023-07-04" "2023-07-29" "2023-07-07" "2023-07-06"
## [226] "2023-07-23" "2023-07-01" "2023-07-22" "2023-07-25" "2023-07-02"
## [231] "2023-07-24" "2023-07-21" "2023-07-20"

Based on the “Depart_Date” column, there are departures in the year 2024. However, we only want to focus on the data for departures in the year 2023 from January to July. Here is how we can filter this data:

library(dplyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# Filter for the year 2023
data_tiket_2023 <- ticket %>%
  filter(year(Depart_Date) == 2023)

# Remove the day component from the "Depart_Date" column

data_tiket_2023 <- data_tiket_2023 %>%
  mutate(Depart_Date = as.Date(format(Depart_Date, "%Y-%m-01")))

data_tiket_2023

After this separation, it is observed that there are 45,395 flights in the dataset for the year 2023, from January to July.

Next, we want to find out the destinations from Jakarta using the ‘levels()’ function.

# View IATA codes using levels()

levels(ticket$Destination)
##  [1] "BDJ" "BDO" "BIK" "BKS" "BPN" "BTH" "BTJ" "DJB" "DJJ" "DPS" "JOG" "KNO"
## [13] "LOP" "MDC" "MKQ" "MLG" "PDG" "PGK" "PKU" "PLM" "PNK" "SOC" "SRG" "SUB"
## [25] "TKG" "TNJ" "TRK" "UPG" "YIA"

In the “Destination” column, there are flights to 29 cities in Indonesia. These destinations are identified by their International Air Transport Association (IATA) codes, which correspond to various airports in Indonesia. Here is an explanation of the IATA codes:

•BDJ: Syamsudin Noor Airport (Banjarmasin)

•BDO: Husein Sastranegara International Airport (Bandung)

•BIK: Frans Kaisiepo Airport (Biak)

•BKS: Fatmawati Soekarno Airport (Bengkulu)

•BPN: Sultan Aji Muhammad Sulaiman International Airport (Balikpapan)

•BTH: Hang Nadim Airport (Batam)

•BTJ: Sultan Iskandar Muda International Airport (Banda Aceh)

•DJB: Sultan Thaha Syaifuddin International Airport (Jambi)

•DJJ: Sentani Airport (Jayapura)

•DPS: Ngurah Rai International Airport (Denpasar, Bali)

•JOG: Adisutjipto International Airport (Yogyakarta)

•KNO: Kualanamu International Airport (Medan)

•LOP: Lombok International Airport (Praya)

•MDC: Sam Ratulangi International Airport (Manado)

•MKQ: Mopah Airport (Merauke)

•MLG: Abdul Rachman Saleh Airport (Malang)

•PDG: Minangkabau International Airport (Padang)

•PGK: Depati Amir International Airport (Pangkalpinang)

•PKU: Sultan Syarif Kasim II International Airport (Pekanbaru)

•PLM: Sultan Mahmud Badaruddin II International Airport (Palembang)

•PNK: Supadio International Airport (Pontianak)

•SOC: Adi Sumarmo Airport (Solo)

•SRG: Adisumarmo International Airport (Surakarta)

•SUB: Juanda International Airport (Surabaya)

•TKG: Radin Inten II Airport (Bandar Lampung)

•TNJ: Raja Haji Fisabilillah International Airport (Tanjung Pinang)

•TRK: Juwata Airport (Tarakan)

•UPG: Sultan Hasanuddin Airport (Makassar)

•YIA: Yogyakarta International Airport (Kulon Progo)

Next, we will explore the data further.

1.4 Data Exploration

This exploration will be based on business questions that need to be answered.

  1. Is there a pattern in the fluctuation of flights ticket prices based on booking time? Does booking tickets in advance result in better prices?
library(dplyr)

# Finding the average flight ticket prices

ticket_TS <- ticket %>% mutate(Booking_Month = format(Timestamp, "%Y-%m"))%>% group_by(Booking_Month) %>%  summarize(Average_Price = mean(Best_Price)) 
  
ticket_TS

From this data, we can observe the changes in flights ticket prices over time. For example, we can see that the average flights ticket prices tend to decrease from month to month.

In the provided output, the ticket prices in December 2022 (1,573,344) are higher than in January 2023 (1,477,963) and February 2023 (1,402,706).

This indicates that booking tickets in advance (e.g., a few months before the departure date) tends to result in better prices.

  1. Is there a significant difference in flights ticket prices for closely spaced departure dates?
# Import the necessary library
library(ggplot2)

# Group the data by "Depart_Date"
ticket_o <- data_tiket_2023 %>%
  group_by(Depart_Date) %>%
  summarize(Average_Price = mean(Best_Price)) 

# Create a ggplot for the output
ggplot(ticket_o, aes(x = Depart_Date, y = Average_Price)) +
  geom_line() +
  labs(title = "Perbedaan Harga Tiket Pesawat antara Tanggal Keberangkatan",
       x = "Tanggal Keberangkatan",
       y = "Rata-rata Harga Tiket") +
  theme_minimal()

In the provided plot, there is an upward price trend around February to April, followed by a gradual decrease after April towards July. This indicates a fluctuation in ticket prices from month to month. Further statistical analysis may be needed to investigate this.

3.Is there a price trend for specific days of the week?

# Import the necessary libraries
library(dplyr)
library(ggplot2)


# Extract the day of the week
data_tiket_2023 <- data_tiket_2023 %>%
  mutate(Day_of_Week = weekdays(Depart_Date))

# Group the data by the day of the week
ticket_week <- data_tiket_2023 %>%
  group_by(Day_of_Week) %>%
  summarize(Average_Price = mean(Best_Price))

ticket_week

Based on the analysis of the data output, we can see the average flights ticket prices for each day of the week. The results are as follows:

•Monday has an average airfare ticket price of approximately 1,580,924.

•Saturday has an average airfare ticket price of approximately 1,569,280.

•Sunday has an average airfare ticket price of approximately 1,385,607.

•Thursday has an average airfare ticket price of approximately 1,553,374.

•Wednesday has an average airfare ticket price of approximately 1,379,488.

From this data, we can observe differences in the average flights ticket prices between days of the week. For example, ticket prices on Sundays appear to be lower compared to Mondays or Thursdays.