1 Pendahuluan

Kalibrr merupakan sumber informasi yang dapat diandalkan untuk mencari dan menilai peluang karier dalam industri, serta digunakan oleh pencari kerja dan profesional sumber daya manusia untuk menemukan dan menilai potensi perusahaan dan posisi pekerjaan yang sesuai dengan kebutuhan mereka.

Pada project kali ini dilakukan scraping pada website https://www.kalibrr.id/. Kalibrr adalah platform yang menyediakan informasi tentang lowongan kerja, perusahaan, dan kesempatan karier lainnya. Project ini bertujuan untuk mengumpulkan data dari Kalibrr menggunakan teknik web scraping, yang nantinya akan digunakan untuk analisis dan visualisasi data terkait informasi lowongan kerja dan perusahaan yang tersedia. Proyek ini akan menggunakan teknik web scraping untuk mengumpulkan beberapa informasi yang relevan dari website Kalibrr. Setelah data berhasil dikumpulkan, langkah selanjutnya adalah melakukan analisis data dan membuat visualisasi yang informatif, seperti grafik untuk menggambarkan tren lowongan kerja berdasarkan sektor industri, tingkat pendidikan yang dibutuhkan, atau lokasi pekerjaan yang paling banyak dicari. Visualisasi ini akan membantu dalam memahami dinamika pasar kerja dan memberikan wawasan berharga bagi para pencari kerja dan perusahaan.

Dalam hal ini, data yang akan dilakukan scraping berkaitan dengan :

  • Posisi: Posisi atau jabatan pekerjaan yang ditawarkan.

  • Perusahaan: meliputi kenyamanan lounge, kebersihan, katering makanan, toilet, layanan staf, dll.

  • Lokasi: Lokasi tempat kerja.

  • Gaji: Informasi tentang gaji yang ditawarkan untuk posisi tersebut.

  • Jenis: Jenis pekerjaan, seperti full time, part time, internship.

  • Batas: Tanggal batas pengajuan berkas lamaran

  • Level: Tingkat pengalaman yang dibutuhkan untuk posisi tersebut, seperti Entry Level / Junior, Apprentice.

Scraping dilakukan dengan software R menggunakan packages rvest dan tidyverse sebagai cleaning tools pada data yang telah diambil. Data yang telah discraping dari website akan disimpan dalam MongoDB Atlas dan dijadwalkan setiap hari di jam 01.00 akan diambil secara acak lima data lowongan pekerjaan. Scraping terjadwal dilakukan dengan membuat workflow di Github Action.

2 MongoDB Atlas

library(mongolite)
## Warning: package 'mongolite' was built under R version 4.3.3
koleksi <- "kalibrr"
database <- "MDS_Scrape"
url_user <- "mongodb+srv://rahmianadra:12345@cluster0.albcw4e.mongodb.net/"
atlas_kalibrr <- mongo(
  collection = koleksi,
  db         = database,
  url        = url_user
)

2.1 Jumlah Data yang Telah di Scrapping

atlas_kalibrr$count()
## [1] 150

2.2 Menampilkan Data Keseluruhan

DT::datatable(atlas_kalibrr$find('{}'))

3 Visualisasi Data

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(RColorBrewer)
dataku<-atlas_kalibrr$find('{}')
head(dataku)
##          time_scraped                                             posisi
## 1 2024-05-31 15:05:29       Officer Development Program (ODP) Batch 2024
## 2 2024-05-31 15:05:29                                 Social Media Admin
## 3 2024-05-31 15:05:29 Public Relations Analyst [Corporate Communication]
## 4 2024-05-31 15:05:29       Frontliner Staff (Teller & Customer Service)
## 5 2024-05-31 15:05:29  Human Capital Staff (Culture & Employer Branding)
## 6 2024-05-31 15:05:35                           Account Relation Officer
##              perusahaan                     lokasi               gaji     jenis
## 1 Bank Negara Indonesia   Jakarta Pusat, Indonesia Salary Undisclosed FULL_TIME
## 2             SKINTIFIC Jakarta Selatan, Indonesia Salary Undisclosed FULL_TIME
## 3       Kompas Gramedia   Jakarta Pusat, Indonesia Salary Undisclosed FULL_TIME
## 4      PT Bank BTPN Tbk   South Jakarta, Indonesia Salary Undisclosed FULL_TIME
## 5              FIFGROUP Central Jakarta, Indonesia Salary Undisclosed FULL_TIME
## 6            Satu Group   Jakarta Utara, Indonesia Salary Undisclosed FULL_TIME
##                 batas                            level
## 1 Apply before 29 Jun Entry Level / Junior, Apprentice
## 2 Apply before 29 Apr Entry Level / Junior, Apprentice
## 3 Apply before 13 May Entry Level / Junior, Apprentice
## 4 Apply before 31 Jul Entry Level / Junior, Apprentice
## 5 Apply before 27 Aug           Associate / Supervisor
## 6  Apply before 5 Jul Entry Level / Junior, Apprentice
glimpse(dataku)
## Rows: 150
## Columns: 8
## $ time_scraped <dttm> 2024-05-31 15:05:29, 2024-05-31 15:05:29, 2024-05-31 15:…
## $ posisi       <chr> "Officer Development Program (ODP) Batch 2024", "Social M…
## $ perusahaan   <chr> "Bank Negara Indonesia", "SKINTIFIC", "Kompas Gramedia", …
## $ lokasi       <chr> "Jakarta Pusat, Indonesia", "Jakarta Selatan, Indonesia",…
## $ gaji         <chr> "Salary Undisclosed", "Salary Undisclosed", "Salary Undis…
## $ jenis        <chr> "FULL_TIME", "FULL_TIME", "FULL_TIME", "FULL_TIME", "FULL…
## $ batas        <chr> "Apply before 29 Jun", "Apply before 29 Apr", "Apply befo…
## $ level        <chr> "Entry Level / Junior, Apprentice", "Entry Level / Junior…
# Filter data yang memiliki nilai posisi "Social Media Admin"
data_filter<- dataku[dataku$posisi == "Social Media Admin", ]
head(data_filter)
##             time_scraped             posisi perusahaan
## 2    2024-05-31 15:05:29 Social Media Admin  SKINTIFIC
## 32   2024-06-03 08:56:29 Social Media Admin  SKINTIFIC
## 77   2024-06-07 08:57:12 Social Media Admin  SKINTIFIC
## 117  2024-06-11 08:57:10 Social Media Admin  SKINTIFIC
## NA                  <NA>               <NA>       <NA>
## NA.1                <NA>               <NA>       <NA>
##                          lokasi               gaji     jenis
## 2    Jakarta Selatan, Indonesia Salary Undisclosed FULL_TIME
## 32   Jakarta Selatan, Indonesia Salary Undisclosed FULL_TIME
## 77   Jakarta Selatan, Indonesia Salary Undisclosed FULL_TIME
## 117  Jakarta Selatan, Indonesia Salary Undisclosed FULL_TIME
## NA                         <NA>               <NA>      <NA>
## NA.1                       <NA>               <NA>      <NA>
##                    batas                            level
## 2    Apply before 29 Apr Entry Level / Junior, Apprentice
## 32   Apply before 29 Apr Entry Level / Junior, Apprentice
## 77   Apply before 29 Apr Entry Level / Junior, Apprentice
## 117  Apply before 29 Apr Entry Level / Junior, Apprentice
## NA                  <NA>                             <NA>
## NA.1                <NA>                             <NA>

3.1 Menghapus Data Duplikat

Sebelum melakukan visualisasi dilihat bahwa terdapat data duplikat dari hasil scrapping sehingga perlu dilakukan penghapusan data duplikat.

# Count the number of entries before removing duplicates
initial_count <- nrow(dataku)

# Remove duplicate entries based on the 'posisi' column
data_clean <- dataku %>% distinct(posisi, .keep_all = TRUE)

# Count the number of entries after removing duplicates
clean_count <- nrow(data_clean)
# Create a data frame for visualization
count_data <- data.frame(
  Condition = c("Before Cleaning", "After Cleaning"),
  Count = c(initial_count, clean_count)
)
# Create a bar plot to visualize the number of entries before and after removing duplicates
ggplot(count_data, aes(x = Condition, y = Count, fill = Condition)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Number of Entries Before and After Removing Duplicates",
       x = "Condition",
       y = "Number of Entries") +
  theme_minimal()

3.2 Visualisasi Jumlah Pekerjaan Berdasarkan Lokasi

ggplot(data_clean, aes(x = reorder(lokasi, lokasi, function(x) -length(x)))) +
  geom_bar(aes(fill = lokasi), color = "black", show.legend = FALSE) +
  labs(title = "Jumlah Pekerjaan Berdasarkan Lokasi",
       x = "Lokasi",
       y = "Jumlah Pekerjaan",
       caption = "Sumber: Hasil Web Scraping") +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.caption = element_text(hjust = 0)
  ) +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3.3 Visualisasi Jumlah Pekerjaan Berdasarkan Perusahaan

ggplot(data_clean, aes(x = reorder(perusahaan, perusahaan, function(x) -length(x)))) +
  geom_bar(aes(fill = perusahaan), color = "black", show.legend = FALSE) +
  labs(title = "Jumlah Pekerjaan Berdasarkan Perusahaan",
       x = "Perusahaan",
       y = "Jumlah Pekerjaan",
       caption = "Sumber: Hasil Web Scraping") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.caption = element_text(hjust = 0)
  ) +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5)
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

3.4 Visualisasi Jumlah Pekerjaan Berdasarkan Level

ggplot(data_clean, aes(x = reorder(level, level, function(x) -length(x)))) +
  geom_bar(aes(fill = level), color = "black", show.legend = FALSE) +
  labs(title = "Jumlah Pekerjaan Berdasarkan Level",
       x = "Level",
       y = "Jumlah Pekerjaan",
       caption = "Sumber: Hasil Web Scraping") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.caption = element_text(hjust = 0)
  ) +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5)

3.5 Visualisasi Distribusi Pekerjaan Berdasarkan Jenis Pekerjaan

# Menghitung jumlah pekerjaan berdasarkan jenis
jenis_count <- table(data_clean$jenis)
jenis_percentage <- prop.table(jenis_count) * 100

# Membuat pie chart
pie_chart <- ggplot(data = NULL, aes(x = "", y = jenis_count, fill = names(jenis_count))) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Distribusi Pekerjaan Berdasarkan Jenis Pekerjaan",
       fill = "Jenis Pekerjaan",
       caption = "Sumber: Hasil Web Scraping") +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    plot.caption = element_text(hjust = 0),
    legend.position = "bottom"
  ) +
  geom_text(aes(label = paste0(round(jenis_percentage, 1), "%")),
            position = position_stack(vjust = 0.5), color = "white")

print(pie_chart)
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

3.6 Visualisasi Batas Waktu Apply Lamaran

ggplot(data_clean, aes(x = batas)) +
  geom_bar(aes(fill = batas), color = "black", show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Batas Waktu Kirim Lamaran",
       x = "Batas Waktu",
       y = "Jumlah Pekerjaan",
       caption = "Sumber: Hasil Web Scraping") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.caption = element_text(hjust = 0)
  ) +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5, color = "black") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Pastel1 is 9
## Returning the palette you asked for with that many colors

3.7 Visualisasi Jumlah Pekerjaan Berdasarkan Posisi

ggplot(data_clean, aes(x = reorder(posisi, posisi, function(x) -length(x)))) +
  geom_bar(aes(fill = posisi), color = "black", show.legend = FALSE) +
  labs(title = "Jumlah Pekerjaan Berdasarkan Posisi",
       x = "Posisi",
       y = "Jumlah Pekerjaan",
       caption = "Sumber: Hasil Web Scraping") +
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12, face = "bold"),
    plot.caption = element_text(hjust = 0)
  ) +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5)
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Pastel1 is 9
## Returning the palette you asked for with that many colors

3.8 Visualisasi Jenis Pekerjaan Berdasarkan Gaji

# Menghitung jumlah pekerjaan berdasarkan rentang gaji
gaji_count <- table(data_clean$gaji)
gaji_percentage <- prop.table(gaji_count) * 100

# Membuat pie chart
pie_chart <- ggplot(data = NULL, aes(x = "", y = gaji_count, fill = names(gaji_count))) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Distribusi Pekerjaan Berdasarkan Rentang Gaji",
       fill = "Rentang Gaji",
       caption = "Sumber: Hasil Web Scraping") +
  theme_void() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    plot.caption = element_text(hjust = 0),
    legend.position = "bottom"
  ) +
  geom_text(aes(label = paste0(round(gaji_percentage, 1), "%")),
            position = position_stack(vjust = 0.5), color = "white")

print(pie_chart)
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.