Data Exploration - Pariwisata Daerah

Practicum ~ Week 4

1 Introduction

1.1 Objectives

The following data aims to analyze and visualize the development of the tourism sector in Indonesia from August 2024 to August 2025. This analysis uses reconstructed data from infographics by the Central Statistics Agency (BPS) on tourist visits. The main context of this study is to understand seasonal trends, identify the main contributions of tourist markets (foreign tourists, domestic tourists, and domestic tourists), and analyze the potential impact on the accommodation sector (hotels). Data visualization will be performed using the ggplot2 package to display at least eight (8) different types of graphs.

1.2 Dataset Description

The data used is reconstructed from BPS infographics, which include: Foreign Tourist Visits (Wisman) in thousands of visits, Domestic Tourist Travel (Wisnus), and National Tourist Travel (Wisnas), from August 2024 to August 2025, displaying eight types of data visualizations created using the ggplot2 package.

1.3 Case Context

To analyze monthly patterns and fluctuations in tourist movements (foreign tourists, domestic tourists, and local tourists). This study aims to identify seasonal trends and the relative contribution of each tourist market to total travel. Additionally, this study examines how disparities in travel patterns related to time (season/day) affect pressure on the accommodation sector’s (hotels) capacity.

2 Data Preparation

Data preparation serves to make raw data structured, clean, and usable for data visualization using ggplot2.

The following is the preparation of Regional Tourism data from August 2024 to August 2025:

data_bulanan <- data.frame(
  Bulan = c("Agt'24", "Sep", "Okt", "Nov", "Des", 
            "Jan'25", "Feb", "Mar", "Apr", "Mei", "Jun", "Jul", "Agt"),
  WisMan = c(1339.95, 1279.26, 1193.87, 1092.07, 1228.63, 
             1156.01, 1022.89, 984.77, 1164.53, 1306.00, 1481.35, 1415.96, 1505.22), # Ribu kunjungan
  WisNus = c(75.88, 83.36, 81.43, 80.61, 101.08, 103.00, 
             90.49, 88.91, 128.58, 97.67, 105.11, 100.20, 93.57), # Juta perjalanan
  WisNas= c(648.58, 661.19, 731.26, 750.06, 810.44, 990.11,
             759.07, 582.08, 926.60, 585.80, 727.56, 869.93, 684.93)
)

urutan_bulan <- c("Agt'24", "Sep", "Okt", "Nov",
                  "Des", "Jan'25", "Feb", "Mar", "Apr", "Mei", "Jun", "Jul", "Agt")
data_bulanan$Bulan <- factor(data_bulanan$Bulan, levels = urutan_bulan)

data_kategori <- data.frame(
  Jenis = c(rep("Wisman", 3), rep("Wisnus", 3), rep("Wisnas", 3)),
  Kategori = c("Kebangsaan", "Kebangsaan", "Kebangsaan", "Provinsi",
               "Provinsi", "Provinsi", "Negara Tujuan", "Negara Tujuan", "Negara Tujuan"),
  Detail = c("Malaysia", "Australia", "Tiongkok", "Jawa Barat", "Jawa Timur",
             "Jawa Tengah", "Malaysia", "Arab Saudi", "Singapura"),
  Nilai_Persen = c(15.26, 10.35, 9.35, 17.79, 17.17,
                   11.72, 30.81, 22.86, 13.19)
)


data_hotel <- data.frame(
  Tipe_Hotel = c("Hotel Bintang", "Hotel Nonbintang"),
  TPK = c(50.51, 25.79) # Persen
)


set.seed(42) 
data_distribusi <- data.frame(
  Hari = 1:365,
  TPK_Bintang = rnorm(365, mean = 55, sd = 8),
  TPK_Nonbintang = rnorm(365, mean = 30, sd = 6),
  Pemasukan = runif(365, 100, 500) * (rnorm(365, 1, 0.2) + 
                  data_bulanan$WisMan[sample(1:13, 365, replace=TRUE)]/1000)  # Variabel untuk Scatter
)
data_distribusi_long <- data_distribusi %>%
  pivot_longer(cols = starts_with("TPK"), names_to = "Tipe", values_to = "TPK") %>%
  mutate(Tipe = gsub("TPK_", "", Tipe))

datatable(data_bulanan, data_distribusi,
          caption = "Data on tourist visits to Indonesia from August 2024 to August 2025")

kable(data_kategori, 
      caption = "Data Category")

Data Category
Jenis	Kategori	Detail	Nilai_Persen
Wisman	Kebangsaan	Malaysia	15.26
Wisman	Kebangsaan	Australia	10.35
Wisman	Kebangsaan	Tiongkok	9.35
Wisnus	Provinsi	Jawa Barat	17.79
Wisnus	Provinsi	Jawa Timur	17.17
Wisnus	Provinsi	Jawa Tengah	11.72
Wisnas	Negara Tujuan	Malaysia	30.81
Wisnas	Negara Tujuan	Arab Saudi	22.86
Wisnas	Negara Tujuan	Singapura	13.19

kable(data_hotel, 
      caption = "Data Hotel")

Data Hotel
Tipe_Hotel	TPK
Hotel Bintang	50.51
Hotel Nonbintang	25.79

Distribution Data

head(data_distribusi_long)

EXPLANATION:

WisMan: Wisatawan Mancanegara
WisNus: Wisatawan Nusantara
WisNas: Wisatawan Nasional

3 Data Visualization

Definition: Data visualization in R is the process of displaying data in the form of graphs, diagrams, or plots to make it easier to understand and analyze. The goal is to see patterns, trends, relationships between variables, or anomalies (outliers) that may not be visible from numbers or tables alone.

3.1 Bar Chart

Bar Chart is a bar diagram used to present categorical data. Bar charts have the advantage of being simple and easy to understand, especially when comparing one category to another. However, Bar Charts are not suitable for displaying continuous data, especially data with many categories, because it will be difficult to read.

In Regional Tourism data, it is necessary to look at data on foreign tourists visiting Indonesia.

data_wisman_bar <- data_kategori %>% filter(Jenis == "Wisman")

ggplot(data_wisman_bar, aes(x = reorder(Detail, Nilai_Persen), y = Nilai_Persen, fill = Detail)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(Nilai_Persen, "%")), vjust = -0.5) +
  labs(
    title = "Top 3 Foreign Tourist Arrivals by Nationality (Agt 2025)",
    x = "Nationality",
    y = "Percentage (%)"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")

3.2 Line Plot

A Line Plot is a visualization that displays a series of data points that are connected by straight line segments.

A Line Plot is effective for visualizing data from time series and also for seeing trends that occur over time. The advantages of a Line Plot are that it is good at showing trends and data movements over time, and it is easy to understand when comparing multiple data series. However, line plots are not suitable for categorical data types, especially those with many data series, as this can make the data appear messy.

# Data sudah diolah di bagian Persiapan Data (data_tren)
data_tren <- data_bulanan %>%
  pivot_longer(cols = c(WisMan, WisNus, WisNas),
               names_to = "Jenis_Wisatawan",
               values_to = "Jumlah") %>%
  mutate(Jenis_Wisatawan = factor(Jenis_Wisatawan, levels = c("WisMan", "WisNus",
                                                              "WisNas")))

ggplot(data_tren, aes(x = Bulan, y = Jumlah, color = Jenis_Wisatawan, group =
                      Jenis_Wisatawan)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(
    title = "Trends in Tourist Visits/Travel (Agt'24 - Agt'25)",
    subtitle = "WisMan (Thousands of Visits), WisNus & WisNas (Millions/Thousands of Trips)",
    y = "Number of Visits/Trips",
    x = "Month",
    color = "Types of Tourists"
  ) +
  theme_minimal(base_size = 14) +
  scale_color_manual(values = c("WisMan" = "#D95F02", "WisNus" = "#1B9E77", "WisNas" =
                                           "#7570B3")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

he graph above shows that the trend increased in April due to holidays and vacations. From June to August, the WisMan trend experienced a significant increase due to the summer vacation season.

3.3 Pie Chart

A Pie Chart is used to illustrate data proportions in a circular shape. This is a popular choice because it provides a quick and clear overview of the data. However, if there are too many data categories, it can be difficult to see them, especially if there are similar size slices. Therefore, if you want to use a pie chart, it is best to use it for data with 2 to 5 categories.

data_pie <- data_hotel %>%
  mutate(persen = TPK / sum(TPK),
         label_posisi = cumsum(persen) - 0.5 * persen, 
         label = paste0(Tipe_Hotel, "\n", round(TPK, 2), "%"))

ggplot(data_pie, aes(x = "", y = persen, fill = Tipe_Hotel)) +
  geom_bar(stat = "identity", width = 1, color = "white", linewidth = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(y = label_posisi, label = label), size = 5, color = "black") +
  labs(
    title = "TPK Proportion of Star-Rated Hotels vs Non-Star-Rated Hotels (Agt 2025)",
    fill = "Tipe_Hotel"
  ) +
  theme_void(base_size = 14) +
  scale_fill_manual(values = c("Hotel Bintang" = "#FDB462", "Hotel Nonbintang" = "#B3DE69" )) +
  theme(legend.position = "none")

EXPLANATION:

TPK = Tingkat Penghunian Kamar

From the pie chart above, it can be concluded that TPK Hotel Bintang contributed a large proportion with a percentage of 50.51%, and the combined TPK of Star and Non-Star hotels indicates that accommodation preferences or focus on the market. However, it should be noted that the reason the pie chart does not add up to 100% is because it was created to visualize the proportion of TPK for both types of hotels relative to the combined total of both TPKs, not relative to 100% of the national room capacity.

3.4 Scatter Plot

A Scatter Plot is a visualization that displays points where each point represents the values of two different variables. Scatter Plots can be used to test and visualize the relationship between two continuous variables. Scatter Plots are useful for showing the relationship (correlation) between variables and identifying outliers. However, they are not useful for comparing categories, and it is difficult to interpret the data if there are too many points.

ggplot(data_distribusi, aes(x = TPK_Bintang, y = Pemasukan)) +
  geom_point(alpha = 0.7, color = "#6A51A3", size = 2) +
  geom_smooth(method = "lm", col = "red", se = TRUE, linewidth = 1) +
  labs(
    title = "Hypothetical Relationship between Hotel Star TPK and Daily Revenue",
    x = "Daily Star Hotel TPK (Simulation - %)",
    y = "Hotel Daily Revenue (Hypothetical Unit)"
  ) +
  theme_minimal(base_size = 14)

From the scatter plot above, we can conclude that the Scatter Plot illustrates a weak negative relationship between the hotel’s daily revenue TPK and the daily star rating TPK.

3.5 Histogram

A Histogram is a type of visualization that presents the frequency of numerical data using adjacent bars. It is commonly used to display the shape and spread of a single numerical data distribution. The advantage of a histogram is that it is easy to show the distribution shape and quickly identify the most frequently occurring value range. However, because it is influenced by the bin width (value range), it cannot show the actual value.

ggplot(data_distribusi, aes(x = TPK_Bintang)) +
  geom_histogram(binwidth = 3, fill = "#3182BD", color = "white", alpha = 0.8) +
  geom_vline(aes(xintercept = mean(TPK_Bintang)), color = "red", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Daily Frequency Distribution of Star Hotel TPK (Simulation)",
    x = "TPK Star Hotel (%)",
    y = "Daily Frequency"
  ) +
  theme_classic(base_size = 14)

From the histogram data above, it can be concluded that the occupancy rate of star-rated hotels is relatively stable at around 55%, as shown by the red dotted line. This can increase the potential for efficiency and marketing to raise marketing above average.

3.6 Box Plot

Box Plot, also known as a box-and-whisker plot, is used to display a visualization of the five summary numbers from a set of data (minimum, first quartile, median, third quartile, and maximum). Box Plots can be used to compare the spread and center of several data groups very well. Its advantage is that it is effective for comparing distributions between groups, clearly showing the median and outliers. However, the shape of the data is less visible compared to Histograms and Density Plots.

ggplot(data_distribusi_long, aes(x = Tipe, y = TPK, fill = Tipe)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 8, linewidth = 1) +
  labs(
    title = "Boxplot Comparison of TPK Distribution for Star and Non-Star Hotels (Simulation)",
    x = "Hotel Type",
    y = "Room Occupancy Rate (%)"
  ) +
  theme_light(base_size = 14) +
  scale_fill_manual(values = c("Bintang" = "#E69F00", "Nonbintang" = "#56B4E9")) +
  theme(legend.position = "none")

From the data above, it can be concluded that TPK Bintang Hotels have a much higher median and interquartile range than non-star hotels.

3.7 Density Plot

A Density Plot is a graphical representation of the probability distribution of continuous numerical data. It is a smoothed version of a histogram. Density Plots are typically used to visualize the shape of continuous data distributions and are very useful for comparing the distributions of several data groups, making Density Plots better than Histograms for comparing distribution shapes between groups. However, they have the disadvantage of not showing the actual data frequency but rather density estimates that require curve interpretation.

ggplot(data_distribusi_long, aes(x = TPK, fill = Tipe)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "Estimated Distribution Curve of TPK for Star and Non-Star Hotels (Simulation)",
    x = "Room Occupancy Rate (%)",
    y = "Density"
  ) +
  theme_minimal(base_size = 14) +
  scale_fill_manual(values = c("Bintang" = "#CC79A7", "Nonbintang" = "#009E73"))

3.8 Ridgeline Plot

A Ridgeline Plot is a series of overlapping density plots distributed vertically. It can be used to compare changes or differences in the distribution of numerical data across many categories very well. However, due to the overlap, it can hide some information, especially for curves with low values.

# Destination country data for Wisnas travel, using Ridgeline as an example
data_wisnas_ridge <- data_kategori %>% filter(Jenis == "Wisnas")

# Since the data only has three categories and is a single percentage, 
# use hypothetical TPK data for a more accurate demonstration.
ggplot(data_distribusi_long, aes(x = TPK, y = Tipe, fill = Tipe)) +
  geom_density_ridges(alpha = 0.8, scale = 1.5, rel_min_height = 0.01) +
  labs(
    title = "Comparison of TPK Distribution for Star and Non-Star Hotels (Ridgeline - Simulation)",
    x = "Room Occupancy Rate (%)",
    y = "Hotel Type"
  ) +
  theme_minimal(base_size = 10) +
  scale_fill_manual(values = c("Bintang" = "#8DD3C7", "Nonbintang" = "#BEBADA")) +
  theme(legend.position = "none")

4 Conclusion

Conclusion: After reviewing the data visualization, it can be concluded that both international and domestic tourist visits show a similar trend, with August 2025 being the peak month, with 1.5 million visits by foreign tourists and 128.58 million visits by domestic tourists in April 2025 due to the Eid al-Fitr holiday. There was a year-on-year increase in foreign tourists, domestic tourists, and national tourists from August 2024 to August 2025.

Market Contribution: Domestic tourists have a primary domestic market on the island of Java, with West Java (17.79%) and East Java (17.17%) dominating as destination provinces.
Hotel TPK Distribution: Domestic tourists have their primary domestic market on the island of Java, with West Java (17.79%) and East Java (17.17%) dominating as destination provinces.
Relationship (Scatter Plot): Based on hypothetical correlation analysis, an increase in Hotel TPK tends to be positively correlated with an increase in Daily Revenue, confirming that TPK is an important indicator of a hotel’s financial health.

4.1 Source

The raw data was obtained from: Badan Pusat Statistik 2024 s.d.2025

image data: Logo