Pada eksplorasi kali ini, akan digunakan satu package bernama tidyverse, yang mengusung konsep tidy dalam transformasi data. Adapun di dalam tidyverse, terdapat package-package yang umum digunakan untuk transformasi data, yakni ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, dan forcats.
Kali ini, kita akan menggunakan dataset dari #tidytuesday, yakni “Women in the workplace” dari biro sensus dan biro tenaga kerja.
Berikut merupakan deskripsi dari variabel pada dataset jobs_gender.csv:
- year Year
- occupation Specific job/career
- major_category Broad category of occupation
- minor_category Fine category of occupation
- total_workers Total estimated full-time workers > 16 years old
- workers_male Estimated MALE full-time workers > 16 years old
- workers_female Estimated FEMALE full-time workers > 16 years old
- percent_female The percent of females for specific occupation
- total_earnings Total estimated median earnings for full-time workers > 16 years old
- total_earnings_male Estimated MALE median earnings for full-time workers > 16 years old
- total_earnings_female Estimated FEMALE median earnings for full-time workers > 16 years old
- wage_percent_of_male Female wages as percent of male wages - NA for occupations with small sample size
## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 4 65 846
## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 4 65 846
Cara menghandle NA ada beberapa cara:
- Dibuang baris-baris yang mengandung NA kalau jumlahnya < 5% data
- Imputation (NA diisi oleh suatu nilai)
ketika datanya numerik, diisi dengan mean/median, kalau kategorikal diisi dengan modus - Membuang variabel yang banyak mengandung NA
Akan dibuang NA pada kolom total_earnings_female dan total_earnings_male
Cek lagi NA-nya
## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 0 0 777
contoh NA imputation
workers_tanpa_na <- workers %>%
mutate(wage_percent_of_male =
replace_na(wage_percent_of_male,
replace = mean(wage_percent_of_male,
na.rm = T)))
workers_tanpa_na %>%
is.na() %>%
colSums()## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 0 0 0
Dari dataset yang ada, kita akan ingin melihat berapa gap earnings antara male dan female untuk tiap major kategori, terutama Computer, Engineering, and Science untuk tahun 2016.
shortcut pipes: ctrl + shift + m
data_agg <- workers_clean %>%
filter(year == 2016) %>%
mutate(gap_earnings = total_earnings_male - total_earnings_female)%>%
select(major_category, gap_earnings) %>%
group_by(major_category) %>%
summarise(mean_gap = round(mean(gap_earnings), 2))
data_aggMembuat tema algoritma untuk branding visualization
theme_algoritma <- theme(legend.key = element_rect(fill="black"),
legend.background = element_rect(color="white", fill="#263238"),
plot.subtitle = element_text(size=6, color="white"),
panel.background = element_rect(fill="#dddddd"),
panel.border = element_rect(fill=NA),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color="darkgrey", linetype=2),
panel.grid.minor.y = element_blank(),
plot.background = element_rect(fill="#263238"),
text = element_text(color="white"),
axis.text = element_text(color="white")
)Plot 1
library(glue) # utk menempelkan teks dengan isi variabel/ untuk tooltip
plot1 <- ggplot(data = data_agg, aes(x = reorder(major_category,
mean_gap),
y = mean_gap,
text = glue("Major Category:{major_category} <br> Mean Gap : {mean_gap}") # atau gunakan paste()
# text = paste("Major Category:", major_category, "<br>", "Mean Gap :" , mean_gap)
)) +
geom_col(fill = "dodgerblue4") +
geom_col(data = filter(data_agg,
major_category == "Computer, Engineering, and Science"),
fill = "firebrick4") +
labs(x = NULL,
y = NULL,
title = "Gap earnings on male and female") +
coord_flip() +
theme_algoritma
plot1workers_long <- pivot_longer(data = workers,
cols = c(workers_male, workers_female),
names_to = "workers_gender",
values_to = "value" )
workers_vis <- workers_long %>%
group_by(major_category, workers_gender) %>%
summarise(mean_value = round(mean(value),2)) %>%
ungroup() %>%
mutate(workers_gender = case_when(workers_gender == "workers_female" ~ "Female",
TRUE ~ "Male"))options(scipen = 999)
plot2 <- ggplot(workers_vis, aes(x = reorder(major_category,
mean_value),
y = mean_value,
text = glue("Gender: {workers_gender}
Mean of workers: {mean_value}"))) +
geom_col(aes(fill = workers_gender), position = "dodge") +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "Composition of female and male workers") +
theme(legend.position = "none") +
theme_algoritmaKita ingin mengetahui sebaran data antara total earnings male dan female
plot3 <- ggplot(workers_clean, aes(x = total_earnings_male,
y = total_earnings_female)) +
geom_point(aes(col = major_category)) +
geom_smooth(method = "auto") +
labs(x = "Total Earnings Male",
y = "Total Earnings Female",
title = "Distribution plot of earnings") +
scale_color_brewer(palette = "Set3") +
theme(legend.position = "none") +
theme_algoritma
plot3mengatur urutan plot dengan ggarrange()
Exporting plot ke dalam format .png, .jpg, atau .pdf
## function (..., plotlist = NULL, filename = NULL, ncol = NULL,
## nrow = NULL, width = 480, height = 480, pointsize = 12, res = NA,
## verbose = TRUE)
## {
## if (is.null(filename))
## filename <- .collapse(.random_string(), ".pdf", sep = "")
## file.ext <- .file_ext(filename)
## dev <- .device(filename)
## dev.opts <- list(file = filename)
## if (file.ext %in% c("ps", "eps"))
## dev.opts <- dev.opts %>% .add_item(onefile = FALSE, horizontal = FALSE)
## else if (file.ext %in% c("png", "jpeg", "jpg", "bmp", "tiff"))
## dev.opts <- dev.opts %>% .add_item(width = width, height = height,
## pointsize = pointsize, res = res)
## if (file.ext %in% c("pdf")) {
## if (!missing(width))
## dev.opts <- dev.opts %>% .add_item(width = width)
## if (!missing(height))
## dev.opts <- dev.opts %>% .add_item(height = height)
## if (!missing(pointsize))
## dev.opts <- dev.opts %>% .add_item(pointsize = pointsize)
## }
## plots <- c(list(...), plotlist)
## nb.plots <- length(plots)
## if (nb.plots == 1)
## plots <- plots[[1]]
## else if (!is.null(ncol) | !is.null(nrow)) {
## plots <- ggarrange(plotlist = plots, ncol = ncol, nrow = nrow)
## }
## if (inherits(plots, "ggarrange") & .is_list(plots))
## nb.plots <- length(plots)
## if (nb.plots > 1 & file.ext %in% c("eps", "ps", "png", "jpeg",
## "jpg", "tiff", "bmp", "svg")) {
## filename <- gsub(paste0(".", file.ext), paste0("%03d.",
## file.ext), filename)
## dev.opts$file <- filename
## print(filename)
## }
## do.call(dev, dev.opts)
## utils::capture.output(print(plots))
## utils::capture.output(grDevices::dev.off())
## message("file saved to ", filename)
## }
## <bytecode: 0x000000001ece5270>
## <environment: namespace:ggpubr>
Membuat dashboard template flexdashboard menggunakan Rmd.
More about flex:
- https://rmarkdown.rstudio.com/flexdashboard/index.html
PR: 1. Demo read data dari tipe selain csv, manajemen data input, baca data dari sql 2. paste() untuk custom tooltip