Import Library

Pada eksplorasi kali ini, akan digunakan satu package bernama tidyverse, yang mengusung konsep tidy dalam transformasi data. Adapun di dalam tidyverse, terdapat package-package yang umum digunakan untuk transformasi data, yakni ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, dan forcats.

library(tidyverse)

Import data

Kali ini, kita akan menggunakan dataset dari #tidytuesday, yakni “Women in the workplace” dari biro sensus dan biro tenaga kerja.

workers <- read_csv("jobs_gender.csv")
workers

Berikut merupakan deskripsi dari variabel pada dataset jobs_gender.csv:
- year Year
- occupation Specific job/career
- major_category Broad category of occupation
- minor_category Fine category of occupation
- total_workers Total estimated full-time workers > 16 years old
- workers_male Estimated MALE full-time workers > 16 years old
- workers_female Estimated FEMALE full-time workers > 16 years old
- percent_female The percent of females for specific occupation
- total_earnings Total estimated median earnings for full-time workers > 16 years old
- total_earnings_male Estimated MALE median earnings for full-time workers > 16 years old
- total_earnings_female Estimated FEMALE median earnings for full-time workers > 16 years old
- wage_percent_of_male Female wages as percent of male wages - NA for occupations with small sample size

Business Question

Dari dataset yang ada, kita akan ingin melihat berapa gap earnings antara male dan female untuk tiap major kategori, terutama Computer, Engineering, and Science
Kita ingin mengetahui perbandingan jumlah pekerja antara male dan female (workers male/ female) untuk tiap major category
Kita ingin mengetahui sebaran data antara total earnings male dan female

Tidy Data

Cek NA(Missing Values)

workers %>% 
  is.na() %>% 
  colSums()

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                   846

colSums(is.na(workers))

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                   846

Handling NA

Cara menghandle NA ada beberapa cara:
- Dibuang baris-baris yang mengandung NA kalau jumlahnya < 5% data
- Imputation (NA diisi oleh suatu nilai)
ketika datanya numerik, diisi dengan mean/median, kalau kategorikal diisi dengan modus - Membuang variabel yang banyak mengandung NA

Drop NA

Akan dibuang NA pada kolom total_earnings_female dan total_earnings_male

workers <- workers %>% 
  drop_na(total_earnings_male, total_earnings_female)

Cek lagi NA-nya

workers %>% 
  is.na() %>% 
  colSums()

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     0                     0                   777

contoh NA imputation

workers_tanpa_na <- workers %>% 
  mutate(wage_percent_of_male = 
            replace_na(wage_percent_of_male, 
                       replace = mean(wage_percent_of_male, 
                                      na.rm = T))) 

workers_tanpa_na %>% 
  is.na() %>% 
  colSums()

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     0                     0                     0

workers_clean <- select(workers, -wage_percent_of_male)
workers_clean

# cara lain untuk select column dengan pipes operator
workers %>% 
  select(-wage_percent_of_male)

Transformation Data 1

Dari dataset yang ada, kita akan ingin melihat berapa gap earnings antara male dan female untuk tiap major kategori, terutama Computer, Engineering, and Science untuk tahun 2016.

shortcut pipes: ctrl + shift + m

data_agg <- workers_clean %>% 
  filter(year == 2016) %>% 
  mutate(gap_earnings = total_earnings_male - total_earnings_female)%>%
  select(major_category, gap_earnings) %>% 
  group_by(major_category) %>% 
  summarise(mean_gap = round(mean(gap_earnings), 2))
data_agg

Visualisasi 1

Membuat tema algoritma untuk branding visualization

theme_algoritma <- theme(legend.key = element_rect(fill="black"),
           legend.background = element_rect(color="white", fill="#263238"),
           plot.subtitle = element_text(size=6, color="white"),
           panel.background = element_rect(fill="#dddddd"),
           panel.border = element_rect(fill=NA),
           panel.grid.minor.x = element_blank(),
           panel.grid.major.x = element_blank(),
           panel.grid.major.y = element_line(color="darkgrey", linetype=2),
           panel.grid.minor.y = element_blank(),
           plot.background = element_rect(fill="#263238"),
           text = element_text(color="white"),
           axis.text = element_text(color="white")
           
           )

Plot 1

library(glue) # utk menempelkan teks dengan isi variabel/ untuk tooltip
plot1 <- ggplot(data = data_agg, aes(x = reorder(major_category, 
                                        mean_gap), 
                                     y = mean_gap,
                                      text = glue("Major Category:{major_category} <br> Mean Gap : {mean_gap}") # atau gunakan paste()
                                     # text = paste("Major Category:", major_category, "<br>", "Mean Gap :" , mean_gap)
                                     )) +
  geom_col(fill = "dodgerblue4") +
  geom_col(data = filter(data_agg, 
                         major_category == "Computer, Engineering, and Science"), 
           fill = "firebrick4") +
  labs(x = NULL,
       y = NULL,
       title = "Gap earnings on male and female") +
  coord_flip() +
  theme_algoritma
plot1

Interactive Visualization

library(plotly)
plot1_inter <- ggplotly(plot1, tooltip = "text")
plot1_inter

Transformasi Data 2

workers_long <- pivot_longer(data = workers, 
                             cols = c(workers_male, workers_female),
                             names_to = "workers_gender", 
                             values_to = "value" )
workers_vis <- workers_long %>% 
  group_by(major_category, workers_gender) %>% 
  summarise(mean_value = round(mean(value),2)) %>% 
  ungroup() %>% 
  mutate(workers_gender = case_when(workers_gender == "workers_female" ~ "Female",
                                    TRUE ~ "Male"))

Visualisasi 2

options(scipen = 999)
plot2 <- ggplot(workers_vis, aes(x = reorder(major_category, 
                                             mean_value), 
                                 y = mean_value,
                                 text = glue("Gender: {workers_gender}
                                             Mean of workers: {mean_value}"))) +
  geom_col(aes(fill = workers_gender), position = "dodge") +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       title = "Composition of female and male workers") +
  theme(legend.position = "none") +
  theme_algoritma

Interactive Visualization

ggplotly(plot2, tooltip = "text")

Visualisasi 3

Kita ingin mengetahui sebaran data antara total earnings male dan female

plot3 <- ggplot(workers_clean, aes(x = total_earnings_male, 
                          y = total_earnings_female)) +
  geom_point(aes(col = major_category)) +
  geom_smooth(method = "auto") +
  labs(x = "Total Earnings Male",
       y = "Total Earnings Female",
       title = "Distribution plot of earnings") +
  scale_color_brewer(palette = "Set3") +
  theme(legend.position = "none") +
  theme_algoritma
plot3

Interactive Visualization

plot3_inter <- ggplotly(plot3)
plot3_inter

Communicate

Static Plot

library(ggpubr)

mengatur urutan plot dengan ggarrange()

plot_arrange <- ggarrange(plot1, plot3, nrow = 2)
plot_arrange

Exporting plot ke dalam format .png, .jpg, atau .pdf

ggexport(plot_arrange, filename = "plot_jupiter.jpg")
ggexport

## function (..., plotlist = NULL, filename = NULL, ncol = NULL, 
##     nrow = NULL, width = 480, height = 480, pointsize = 12, res = NA, 
##     verbose = TRUE) 
## {
##     if (is.null(filename)) 
##         filename <- .collapse(.random_string(), ".pdf", sep = "")
##     file.ext <- .file_ext(filename)
##     dev <- .device(filename)
##     dev.opts <- list(file = filename)
##     if (file.ext %in% c("ps", "eps")) 
##         dev.opts <- dev.opts %>% .add_item(onefile = FALSE, horizontal = FALSE)
##     else if (file.ext %in% c("png", "jpeg", "jpg", "bmp", "tiff")) 
##         dev.opts <- dev.opts %>% .add_item(width = width, height = height, 
##             pointsize = pointsize, res = res)
##     if (file.ext %in% c("pdf")) {
##         if (!missing(width)) 
##             dev.opts <- dev.opts %>% .add_item(width = width)
##         if (!missing(height)) 
##             dev.opts <- dev.opts %>% .add_item(height = height)
##         if (!missing(pointsize)) 
##             dev.opts <- dev.opts %>% .add_item(pointsize = pointsize)
##     }
##     plots <- c(list(...), plotlist)
##     nb.plots <- length(plots)
##     if (nb.plots == 1) 
##         plots <- plots[[1]]
##     else if (!is.null(ncol) | !is.null(nrow)) {
##         plots <- ggarrange(plotlist = plots, ncol = ncol, nrow = nrow)
##     }
##     if (inherits(plots, "ggarrange") & .is_list(plots)) 
##         nb.plots <- length(plots)
##     if (nb.plots > 1 & file.ext %in% c("eps", "ps", "png", "jpeg", 
##         "jpg", "tiff", "bmp", "svg")) {
##         filename <- gsub(paste0(".", file.ext), paste0("%03d.", 
##             file.ext), filename)
##         dev.opts$file <- filename
##         print(filename)
##     }
##     do.call(dev, dev.opts)
##     utils::capture.output(print(plots))
##     utils::capture.output(grDevices::dev.off())
##     message("file saved to ", filename)
## }
## <bytecode: 0x000000001ece5270>
## <environment: namespace:ggpubr>

Interactive plot

Subplot

subplot(plot1_inter, plot3_inter, nrows = 2, shareY = FALSE)

Flex Dashboard

library(flexdashboard)
library()

Membuat dashboard template flexdashboard menggunakan Rmd.

More about flex:
- https://rmarkdown.rstudio.com/flexdashboard/index.html

https://rmarkdown.rstudio.com/flexdashboard/using.html#storyboards

PR: 1. Demo read data dari tipe selain csv, manajemen data input, baca data dari sql 2. paste() untuk custom tooltip

Jobs Gender Exploration

Sitta

1/27/2020

Import Library

Import data

Business Question

Tidy Data

Cek NA(Missing Values)

Handling NA

Drop NA

Transformation Data 1

Visualisasi 1

Interactive Visualization

Transformasi Data 2

Visualisasi 2

Interactive Visualization

Visualisasi 3

Interactive Visualization

Communicate

Static Plot

Interactive plot

Subplot

Flex Dashboard