Import Library

Pada eksplorasi kali ini, akan digunakan satu package bernama tidyverse, yang mengusung konsep tidy dalam transformasi data. Adapun di dalam tidyverse, terdapat package-package yang umum digunakan untuk transformasi data, yakni ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, dan forcats.

Import data

Kali ini, kita akan menggunakan dataset dari #tidytuesday, yakni “Women in the workplace” dari biro sensus dan biro tenaga kerja.

Berikut merupakan deskripsi dari variabel pada dataset jobs_gender.csv:
- year Year
- occupation Specific job/career
- major_category Broad category of occupation
- minor_category Fine category of occupation
- total_workers Total estimated full-time workers > 16 years old
- workers_male Estimated MALE full-time workers > 16 years old
- workers_female Estimated FEMALE full-time workers > 16 years old
- percent_female The percent of females for specific occupation
- total_earnings Total estimated median earnings for full-time workers > 16 years old
- total_earnings_male Estimated MALE median earnings for full-time workers > 16 years old
- total_earnings_female Estimated FEMALE median earnings for full-time workers > 16 years old
- wage_percent_of_male Female wages as percent of male wages - NA for occupations with small sample size

Business Question

  1. Dari dataset yang ada, kita akan ingin melihat berapa gap earnings antara male dan female untuk tiap major kategori, terutama Computer, Engineering, and Science
  2. Kita ingin mengetahui perbandingan jumlah pekerja antara male dan female (workers male/ female) untuk tiap major category
  3. Kita ingin mengetahui sebaran data antara total earnings male dan female

Tidy Data

Cek NA(Missing Values)

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                   846
##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                   846

Handling NA

Cara menghandle NA ada beberapa cara:
- Dibuang baris-baris yang mengandung NA kalau jumlahnya < 5% data
- Imputation (NA diisi oleh suatu nilai)
ketika datanya numerik, diisi dengan mean/median, kalau kategorikal diisi dengan modus - Membuang variabel yang banyak mengandung NA

Drop NA

Akan dibuang NA pada kolom total_earnings_female dan total_earnings_male

Cek lagi NA-nya

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     0                     0                   777

contoh NA imputation

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     0                     0                     0

Transformation Data 1

Dari dataset yang ada, kita akan ingin melihat berapa gap earnings antara male dan female untuk tiap major kategori, terutama Computer, Engineering, and Science untuk tahun 2016.

shortcut pipes: ctrl + shift + m

Visualisasi 1

Membuat tema algoritma untuk branding visualization

Plot 1

Interactive Visualization

Visualisasi 2

Interactive Visualization

Interactive Visualization

Communicate

Static Plot

mengatur urutan plot dengan ggarrange()

Exporting plot ke dalam format .png, .jpg, atau .pdf

## function (..., plotlist = NULL, filename = NULL, ncol = NULL, 
##     nrow = NULL, width = 480, height = 480, pointsize = 12, res = NA, 
##     verbose = TRUE) 
## {
##     if (is.null(filename)) 
##         filename <- .collapse(.random_string(), ".pdf", sep = "")
##     file.ext <- .file_ext(filename)
##     dev <- .device(filename)
##     dev.opts <- list(file = filename)
##     if (file.ext %in% c("ps", "eps")) 
##         dev.opts <- dev.opts %>% .add_item(onefile = FALSE, horizontal = FALSE)
##     else if (file.ext %in% c("png", "jpeg", "jpg", "bmp", "tiff")) 
##         dev.opts <- dev.opts %>% .add_item(width = width, height = height, 
##             pointsize = pointsize, res = res)
##     if (file.ext %in% c("pdf")) {
##         if (!missing(width)) 
##             dev.opts <- dev.opts %>% .add_item(width = width)
##         if (!missing(height)) 
##             dev.opts <- dev.opts %>% .add_item(height = height)
##         if (!missing(pointsize)) 
##             dev.opts <- dev.opts %>% .add_item(pointsize = pointsize)
##     }
##     plots <- c(list(...), plotlist)
##     nb.plots <- length(plots)
##     if (nb.plots == 1) 
##         plots <- plots[[1]]
##     else if (!is.null(ncol) | !is.null(nrow)) {
##         plots <- ggarrange(plotlist = plots, ncol = ncol, nrow = nrow)
##     }
##     if (inherits(plots, "ggarrange") & .is_list(plots)) 
##         nb.plots <- length(plots)
##     if (nb.plots > 1 & file.ext %in% c("eps", "ps", "png", "jpeg", 
##         "jpg", "tiff", "bmp", "svg")) {
##         filename <- gsub(paste0(".", file.ext), paste0("%03d.", 
##             file.ext), filename)
##         dev.opts$file <- filename
##         print(filename)
##     }
##     do.call(dev, dev.opts)
##     utils::capture.output(print(plots))
##     utils::capture.output(grDevices::dev.off())
##     message("file saved to ", filename)
## }
## <bytecode: 0x000000001ece5270>
## <environment: namespace:ggpubr>

Interactive plot

Subplot

Flex Dashboard

Membuat dashboard template flexdashboard menggunakan Rmd.

More about flex:
- https://rmarkdown.rstudio.com/flexdashboard/index.html

PR: 1. Demo read data dari tipe selain csv, manajemen data input, baca data dari sql 2. paste() untuk custom tooltip