Manajemen Data dengan tidyverse
Tidyverse
Tidyverse adalah kumpulan package R yang berfungsi untuk melakukan pengolahan data seperti import, subset, visualisasi, transformasi, dan lain sebagainya. Tidyverse diciptakan oleh Hadley Wickham dan timnya dengan tujuan menyediakan semua tools untuk membersihkan, merapikan dan bekerja dengan data.
Terdapat ggplot2 untuk data visualisation atau visualisasi data, yang memungkinkan pengguna membuat grafik yang kompleks dengan sintaks yang elegan. Di sampingnya, dplyr digunakan untuk data wrangling, yaitu manipulasi data seperti filter, sort, dan agregasi. read membantu dalam reading data dari berbagai format seperti CSV dengan cara yang cepat dan efisien. tibble menawarkan modern data frames, struktur data yang lebih baik dari data.frame tradisional di R.
Di bagian bawah, lubridate mempermudah pengguna dalam working with dates, seperti parsing dan manipulasi tanggal/waktu. forcats mendukung dealing with factors, yaitu pengelolaan data kategorik di R. tidyr berperan penting dalam data tidying, membantu merapikan data agar sesuai dengan prinsip tidy data.
Sebelum melakukan pemrograman simpan R Script dalam suatu folder bersama *folder data yang telah diekstrak!
Install packages dan menjalankan library
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Package dalam tidyverse
## [1] "broom" "conflicted" "cli" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "ggplot2"
## [9] "googledrive" "googlesheets4" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "ragg"
## [21] "readr" "readxl" "reprex" "rlang"
## [25] "rstudioapi" "rvest" "stringr" "tibble"
## [29] "tidyr" "xml2" "tidyverse"
Tibble
Pada dasarnya tibble masih merupakan dataframe, hanya saja tampilan print/outputnya lebih baik dibandingkan dataframe biasa. ## Membuat tibble (memuat list)
#Tibble dapat memiliki kolom atau variabel dengan type list
coba <- tibble(x = c(1,2), y = list(c('h','i'), letters[1:7]))
coba
## # A tibble: 2 × 2
## x y
## <dbl> <list>
## 1 1 <chr [2]>
## 2 2 <chr [7]>
Membuat tibble (tabel)
# tibble dapat membuat dataframe dengan nama variabel yang mengandung spasi atau diawali angka
peserta <- tibble(
`Nama Peserta` = c("Nicolas", "Thierry", "Bernard", "Jerome"),
`7Usia` = c(27, 25, 29, 26)
)
peserta
## # A tibble: 4 × 2
## `Nama Peserta` `7Usia`
## <chr> <dbl>
## 1 Nicolas 27
## 2 Thierry 25
## 3 Bernard 29
## 4 Jerome 26
# dataframe tidak cocok untuk nama variabel yang mengandung spasi atau diawali angka
peserta2 <- data.frame(
`Nama Peserta` = c("Nicolas", "Thierry", "Bernard", "Jerome"),
`7Usia` = c(27, 25, 29, 26)
)
peserta2
## Nama.Peserta X7Usia
## 1 Nicolas 27
## 2 Thierry 25
## 3 Bernard 29
## 4 Jerome 26
Tampilan tibble (tabel)
## Warning: package 'nycflights13' was built under R version 4.4.3
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Merubah tibble menjadi dataframe
# merubah tibble menjadi dataframe
flights2 <- as.data.frame(flights)
# menampilkan data teratas
head(flights2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
Memanggil nama kolom tibble harus lengkap
# membuat tibble dengan nama karyawan
karyawan <- tibble(emp_id = c (1:5), name = c("Rick","Dan","Michelle","Ryan","Gary"))
karyawan
## # A tibble: 5 × 2
## emp_id name
## <int> <chr>
## 1 1 Rick
## 2 2 Dan
## 3 3 Michelle
## 4 4 Ryan
## 5 5 Gary
## Warning: Unknown or uninitialised column: `n`.
## NULL
# artinya jika ingin memanggil nama variabel, maka harus ditulis lengkap nama variabelnya
karyawan$name
## [1] "Rick" "Dan" "Michelle" "Ryan" "Gary"
Memanggil nama kolom data frame bisa parsial
# sedangkan dataframe dapat melakukan pencocokan parsial
karyawan2 <- data.frame(emp_id = c (1:5), name = c("Rick","Dan","Michelle","Ryan","Gary"))
karyawan2
## emp_id name
## 1 1 Rick
## 2 2 Dan
## 3 3 Michelle
## 4 4 Ryan
## 5 5 Gary
## [1] "Rick" "Dan" "Michelle" "Ryan" "Gary"
Merubah dataframe menjadi tibble
# membuat dataframe
dataku <- data.frame(
id = c(1, 2, 3, 4, 5),
name = c("Adam", "Eva", "Miki", "Yola", "Jack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE)
)
dataku
## id name age gender drives
## 1 1 Adam 46 male TRUE
## 2 2 Eva 48 female TRUE
## 3 3 Miki 21 female FALSE
## 4 4 Yola 19 female TRUE
## 5 5 Jack 17 male FALSE
## # A tibble: 5 × 5
## id name age gender drives
## <dbl> <chr> <dbl> <chr> <lgl>
## 1 1 Adam 46 male TRUE
## 2 2 Eva 48 female TRUE
## 3 3 Miki 21 female FALSE
## 4 4 Yola 19 female TRUE
## 5 5 Jack 17 male FALSE
Merubah beberapa vektor menjadi tibble
# mengkonversi atau membuat tibble dari beberapa vektor
dataku3 <- tibble(
id = c(1, 2, 3, 4, 5),
name = c("Adam", "Eva", "Miki", "Yola", "Jack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE)
)
dataku3
## # A tibble: 5 × 5
## id name age gender drives
## <dbl> <chr> <dbl> <chr> <lgl>
## 1 1 Adam 46 male TRUE
## 2 2 Eva 48 female TRUE
## 3 3 Miki 21 female FALSE
## 4 4 Yola 19 female TRUE
## 5 5 Jack 17 male FALSE
## id name age gender
## Min. :1 Length:5 Min. :17.0 Length:5
## 1st Qu.:2 Class :character 1st Qu.:19.0 Class :character
## Median :3 Mode :character Median :21.0 Mode :character
## Mean :3 Mean :30.2
## 3rd Qu.:4 3rd Qu.:46.0
## Max. :5 Max. :48.0
## drives
## Mode :logical
## FALSE:2
## TRUE :3
##
##
##
## tibble [5 × 5] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:5] 1 2 3 4 5
## $ name : chr [1:5] "Adam" "Eva" "Miki" "Yola" ...
## $ age : num [1:5] 46 48 21 19 17
## $ gender: chr [1:5] "male" "female" "female" "female" ...
## $ drives: logi [1:5] TRUE TRUE FALSE TRUE FALSE
## Rows: 5
## Columns: 5
## $ id <dbl> 1, 2, 3, 4, 5
## $ name <chr> "Adam", "Eva", "Miki", "Yola", "Jack"
## $ age <dbl> 46, 48, 21, 19, 17
## $ gender <chr> "male", "female", "female", "female", "male"
## $ drives <lgl> TRUE, TRUE, FALSE, TRUE, FALSE
# merubah tipe variabel gender menjadi factor
dataku3$gender <- factor(dataku3$gender)
summary(dataku3)
## id name age gender drives
## Min. :1 Length:5 Min. :17.0 female:3 Mode :logical
## 1st Qu.:2 Class :character 1st Qu.:19.0 male :2 FALSE:2
## Median :3 Mode :character Median :21.0 TRUE :3
## Mean :3 Mean :30.2
## 3rd Qu.:4 3rd Qu.:46.0
## Max. :5 Max. :48.0
Merubah tabel menjadi tibble
# mengkonversi tabel kedalam tibble
dataku4 <- tribble(
~id, ~name, ~age, ~gender, ~drives,
1, "Adam", 46, "male", TRUE,
2, "Eva", 48, "female", TRUE,
3, "Xaxi", 21, "female", FALSE,
4, "Yota", 19, "female", TRUE,
5, "Zack", 17, "male", FALSE,
)
dataku4
## # A tibble: 5 × 5
## id name age gender drives
## <dbl> <chr> <dbl> <chr> <lgl>
## 1 1 Adam 46 male TRUE
## 2 2 Eva 48 female TRUE
## 3 3 Xaxi 21 female FALSE
## 4 4 Yota 19 female TRUE
## 5 5 Zack 17 male FALSE
Memanggil kolom pada tibble
## [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
## # A tibble: 5 × 1
## name
## <chr>
## 1 Adam
## 2 Eva
## 3 Xaxi
## 4 Yota
## 5 Zack
## [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
## # A tibble: 5 × 1
## name
## <chr>
## 1 Adam
## 2 Eva
## 3 Xaxi
## 4 Yota
## 5 Zack
## # A tibble: 1 × 5
## id name age gender drives
## <dbl> <chr> <dbl> <chr> <lgl>
## 1 5 Zack 17 male FALSE
## # A tibble: 3 × 5
## id name age gender drives
## <dbl> <chr> <dbl> <chr> <lgl>
## 1 1 Adam 46 male TRUE
## 2 2 Eva 48 female TRUE
## 3 4 Yota 19 female TRUE
Readr
readr adalah suatu package yang digunakan untuk import data seperti csv, txt, dan lain sebagainya. Package ini dinilai mampu membaca data lebih cepat dibandingkan read.csv, read.table dan sejenisnya. Data hasil import dari readr berbentuk tibble. ## Memanggil data ## Memanggil Data
# memanggil data csv dengan separator koma (,)
data1<-read_csv("D:/Download/Data/penjualan.csv")
data1
## # A tibble: 5 × 5
## ID Nama Usia Gender Penjualan
## <chr> <chr> <dbl> <chr> <dbl>
## 1 321A Eiden 35 Laki-laki 1125600
## 2 44B Olivia 28 Perempuan 987000
## 3 984C Brandon 25 Laki-laki 2134000
## 4 653D Wendy 30 Perempuan 756000
## 5 178E Loki 26 Laki-laki 1698000
# memanggil data csv dengan separator titik koma (;)
dataku2 <- read_csv2("D:/Download/Data/penjualan2.csv")
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (3): ID, Nama, Gender
## dbl (2): Usia, Penjualan
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## ID Nama Usia Gender Penjualan
## <chr> <chr> <dbl> <chr> <dbl>
## 1 321A Eiden 35 Laki-laki 1125600
## 2 44B Olivia 28 Perempuan 987000
## 3 984C Brandon 25 Laki-laki 2134000
## 4 653D Wendy 30 Perempuan 756000
## 5 178E Loki 26 Laki-laki 1698000
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): ID, Nama, Gender
## dbl (1): Usia
## num (1): Penjualan
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## ID Nama Usia Gender Penjualan
## <chr> <chr> <dbl> <chr> <dbl>
## 1 321A Eiden 35 Laki-laki 1125600
## 2 44B Olivia 28 Perempuan 987000
## 3 984C Brandon 25 Laki-laki 2134000
## 4 653D Wendy 30 Perempuan 756000
## 5 178E Loki 26 Laki-laki 1698000
# memanggil data txt dengan separator (|)
dataku4 <- read_delim("D:/Download/Data/penjualan2.txt", delim='|')
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "|"
## chr (3): ID, Nama, Gender
## dbl (2): Usia, Penjualan
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## ID Nama Usia Gender Penjualan
## <chr> <chr> <dbl> <chr> <dbl>
## 1 321A Eiden 35 Laki-laki 1125600
## 2 44B Olivia 28 Perempuan 987000
## 3 984C Brandon 25 Laki-laki 2134000
## 4 653D Wendy 30 Perempuan 756000
## 5 178E Loki 26 Laki-laki 1698000
# read_delim secara umum dapat memanggil file dengan berbagai macam separator
# memanggil data csv dengan separator (,)
dataku5 <- read_delim("D:/Download/Data/penjualan.csv", delim=',')
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ID, Nama, Gender
## dbl (1): Usia
## num (1): Penjualan
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## ID Nama Usia Gender Penjualan
## <chr> <chr> <dbl> <chr> <dbl>
## 1 321A Eiden 35 Laki-laki 1125600
## 2 44B Olivia 28 Perempuan 987000
## 3 984C Brandon 25 Laki-laki 2134000
## 4 653D Wendy 30 Perempuan 756000
## 5 178E Loki 26 Laki-laki 1698000
# memanggil data txt dengan separator tab (\t)
dataku6 <- read_delim("D:/Download/Data/penjualan.txt", delim='\t')
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): ID, Nama, Gender
## dbl (1): Usia
## num (1): Penjualan
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
## ID Nama Usia Gender Penjualan
## <chr> <chr> <dbl> <chr> <dbl>
## 1 321A Eiden 35 Laki-laki 1125600
## 2 44B Olivia 28 Perempuan 987000
## 3 984C Brandon 25 Laki-laki 2134000
## 4 653D Wendy 30 Perempuan 756000
## 5 178E Loki 26 Laki-laki 1698000
Menyimpan data
# membuat data tibble dari berbagai vektor terlebih dahulu
datatbb <- tibble(
id = c(1, 2, 3, 4, 5),
name = c("Adam", "Eva", "Miki", "Yola", "Jack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE)
)
datatbb
## # A tibble: 5 × 5
## id name age gender drives
## <dbl> <chr> <dbl> <chr> <lgl>
## 1 1 Adam 46 male TRUE
## 2 2 Eva 48 female TRUE
## 3 3 Miki 21 female FALSE
## 4 4 Yola 19 female TRUE
## 5 5 Jack 17 male FALSE
# mengeksport tibble ke csv dengan separator koma
write_csv(datatbb,'cobakoma.csv')
# mengeksport tibble ke csv dengan separator titik koma
write_csv2(datatbb,'cobatitikkoma.csv')
# mengeksport tibble ke txt dengan separator spasi
write_delim(datatbb,'cobaspasi.txt', delim=' ')
# mengeksport tibble ke txt dengan separator tab
write_tsv(datatbb,'cobatab.txt')
Dplyr
dplyr adalah suatu package yang digunakan untuk data wrangling seperti transformasi dataframe, melakukan aggregate, menampilkan statistika deskriptif, menggabungkan dataframe, mengubah kolom dan baris, mengurutkan data, dan lain sebagainya. ## Operator Pipe (%>%) Pipe (%>%) operator adalah tool yang digunakan untuk mengekspresikan urutan beberapa operasi dengan jelas. Tujuannya adalah agar kita dapat menulis kode dengan mudah dan dapat dipahami dengan baik.
# membuat vektor
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.117, 0.829, 0.907)
# membulatkan hasil mean menggunakan fungsi base R
round(mean(x),2)
## [1] 0.51
## [1] 0.51
Select
Select adalah fungsi yang digunakan untuk memilih kolom dari suatu dataframe.
# menggunakan data salary.csv
gaji <- read_csv("D:/Download/Data/salary.csv")
# merubah semua variabel dengan tipe character menjadi factor
gaji2 <- gaji %>% mutate_if(is.character,as.factor)
gaji2
## # A tibble: 200 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 2 university 5 male 36 married
## 2 3 senior_high 4 female 19 single
## 3 4 university 3 female 23 married
## 4 5 university 3 female 28 married
## 5 6 university 6 female 26 single
## 6 7 university 5 female 27 single
## 7 8 university 8 female 28 single
## 8 9 university 3 male 29 married
## 9 10 diploma 3 female 20 single
## 10 11 university 5 female 40 single
## # ℹ 190 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
# memilih kolom id, gender, dan annual salary
datagajigender <- select(gaji2, id, gender, `annual salary`)
datagajigender
## # A tibble: 200 × 3
## id gender `annual salary`
## <dbl> <fct> <dbl>
## 1 2 male 330650
## 2 3 female 79000
## 3 4 female 156750
## 4 5 female 349000
## 5 6 female 59000
## 6 7 female 49000
## 7 8 female 79200
## 8 9 male 499000
## 9 10 female 424150
## 10 11 female 160300
## # ℹ 190 more rows
## # A tibble: 200 × 5
## id `degree level` `family size` gender age
## <dbl> <fct> <dbl> <fct> <dbl>
## 1 2 university 5 male 36
## 2 3 senior_high 4 female 19
## 3 4 university 3 female 23
## 4 5 university 3 female 28
## 5 6 university 6 female 26
## 6 7 university 5 female 27
## 7 8 university 8 female 28
## 8 9 university 3 male 29
## 9 10 diploma 3 female 20
## 10 11 university 5 female 40
## # ℹ 190 more rows
## # A tibble: 200 × 8
## id `degree level` gender age `marital status` `residence type`
## <dbl> <fct> <fct> <dbl> <fct> <fct>
## 1 2 university male 36 married own
## 2 3 senior_high female 19 single family
## 3 4 university female 23 married family
## 4 5 university female 28 married family
## 5 6 university female 26 single family
## 6 7 university female 27 single family
## 7 8 university female 28 single kost
## 8 9 university male 29 married family
## 9 10 diploma female 20 single rent
## 10 11 university female 40 single family
## # ℹ 190 more rows
## # ℹ 2 more variables: `work type` <fct>, `annual salary` <dbl>
# memilih semua kolom kecuali family size dan residence type
select(gaji2, -c(`family size`, `residence type`))
## # A tibble: 200 × 7
## id `degree level` gender age `marital status` `work type`
## <dbl> <fct> <fct> <dbl> <fct> <fct>
## 1 2 university male 36 married full_time
## 2 3 senior_high female 19 single student
## 3 4 university female 23 married full_time
## 4 5 university female 28 married full_time
## 5 6 university female 26 single full_time
## 6 7 university female 27 single entrepreneur
## 7 8 university female 28 single full_time
## 8 9 university male 29 married full_time
## 9 10 diploma female 20 single full_time
## 10 11 university female 40 single full_time
## # ℹ 190 more rows
## # ℹ 1 more variable: `annual salary` <dbl>
## # A tibble: 200 × 2
## age `annual salary`
## <dbl> <dbl>
## 1 36 330650
## 2 19 79000
## 3 23 156750
## 4 28 349000
## 5 26 59000
## 6 27 49000
## 7 28 79200
## 8 29 499000
## 9 20 424150
## 10 40 160300
## # ℹ 190 more rows
## # A tibble: 200 × 3
## age `annual salary` id
## <dbl> <dbl> <dbl>
## 1 36 330650 2
## 2 19 79000 3
## 3 23 156750 4
## 4 28 349000 5
## 5 26 59000 6
## 6 27 49000 7
## 7 28 79200 8
## 8 29 499000 9
## 9 20 424150 10
## 10 40 160300 11
## # ℹ 190 more rows
## # A tibble: 200 × 4
## `family size` age `residence type` `work type`
## <dbl> <dbl> <fct> <fct>
## 1 5 36 own full_time
## 2 4 19 family student
## 3 3 23 family full_time
## 4 3 28 family full_time
## 5 6 26 family full_time
## 6 5 27 family entrepreneur
## 7 8 28 kost full_time
## 8 3 29 family full_time
## 9 3 20 rent full_time
## 10 5 40 family full_time
## # ℹ 190 more rows
## # A tibble: 200 × 5
## `family size` age `residence type` `work type` `annual salary`
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 5 36 own full_time 330650
## 2 4 19 family student 79000
## 3 3 23 family full_time 156750
## 4 3 28 family full_time 349000
## 5 6 26 family full_time 59000
## 6 5 27 family entrepreneur 49000
## 7 8 28 kost full_time 79200
## 8 3 29 family full_time 499000
## 9 3 20 rent full_time 424150
## 10 5 40 family full_time 160300
## # ℹ 190 more rows
## # A tibble: 200 × 1
## `marital status`
## <fct>
## 1 married
## 2 single
## 3 married
## 4 married
## 5 single
## 6 single
## 7 single
## 8 married
## 9 single
## 10 single
## # ℹ 190 more rows
# memilih kolom dengan nama kolom yang mengandung huruf a atau n
select(gaji2, contains(c("a","n")))
## # A tibble: 200 × 6
## `family size` age `marital status` `annual salary` gender `residence type`
## <dbl> <dbl> <fct> <dbl> <fct> <fct>
## 1 5 36 married 330650 male own
## 2 4 19 single 79000 female family
## 3 3 23 married 156750 female family
## 4 3 28 married 349000 female family
## 5 6 26 single 59000 female family
## 6 5 27 single 49000 female family
## 7 8 28 single 79200 female kost
## 8 3 29 married 499000 male family
## 9 3 20 single 424150 female rent
## 10 5 40 single 160300 female family
## # ℹ 190 more rows
## # A tibble: 200 × 4
## id `family size` age `annual salary`
## <dbl> <dbl> <dbl> <dbl>
## 1 2 5 36 330650
## 2 3 4 19 79000
## 3 4 3 23 156750
## 4 5 3 28 349000
## 5 6 6 26 59000
## 6 7 5 27 49000
## 7 8 8 28 79200
## 8 9 3 29 499000
## 9 10 3 20 424150
## 10 11 5 40 160300
## # ℹ 190 more rows
Filter
filter( ) adalah fungsi yang digunakan untuk memilih baris sesuai dengan kriteria tertentu.
## # A tibble: 37 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 2 university 5 male 36 married
## 2 9 university 3 male 29 married
## 3 12 university 2 male 30 single
## 4 14 senior_high 3 male 17 single
## 5 21 university 3 male 25 married
## 6 23 university 3 male 34 married
## 7 26 senior_high 4 male 17 single
## 8 29 university 7 male 28 single
## 9 35 senior_high 4 male 20 single
## 10 36 diploma 4 male 39 married
## # ℹ 27 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
## # A tibble: 0 × 9
## # ℹ 9 variables: id <dbl>, degree level <fct>, family size <dbl>, gender <fct>,
## # age <dbl>, marital status <fct>, residence type <fct>, work type <fct>,
## # annual salary <dbl>
# memfilter data berdasarkan kolom gender tanpa missing value
gaji3 <- filter(gaji2, !is.na(`gender`))
# memfilter data berdasarkan gender male dan age lebih dari 30 tahun
filter(gaji2, `gender`=="male", age > 30)
## # A tibble: 12 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 2 university 5 male 36 married
## 2 23 university 3 male 34 married
## 3 36 diploma 4 male 39 married
## 4 45 university 7 male 31 single
## 5 49 university 10 male 35 married
## 6 56 university 2 male 49 married
## 7 61 university 7 male 31 single
## 8 170 university 7 male 31 married
## 9 193 university 7 male 33 single
## 10 109 university 5 male 37 married
## 11 154 university 7 male 44 married
## 12 169 senior_high 4 male 38 single
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
# memfilter data berdasarkan work type part time dan student menggunakan pipe
gaji2 %>%
filter(`work type` %in% c("part_time", "student"))
## # A tibble: 52 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 3 senior_high 4 female 19 single
## 2 12 university 2 male 30 single
## 3 13 diploma 6 female 31 single
## 4 15 university 2 female 16 single
## 5 17 university 3 female 33 single
## 6 22 senior_high 5 female 35 single
## 7 26 senior_high 4 male 17 single
## 8 27 senior_high 5 female 35 single
## 9 33 senior_high 5 female 43 single
## 10 35 senior_high 4 male 20 single
## # ℹ 42 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
# memilih kolom id, degree level, gender, age, dan annual salary
# dengan syarat age berada antara 30 sampai 40 tahun
gaji2 %>%
select(id, `degree level`, `gender`, age, `annual salary`) %>%
filter(between(age, 30, 40))
## # A tibble: 73 × 5
## id `degree level` gender age `annual salary`
## <dbl> <fct> <fct> <dbl> <dbl>
## 1 2 university male 36 330650
## 2 11 university female 40 160300
## 3 12 university male 30 119200
## 4 13 diploma female 31 139000
## 5 16 university female 32 171750
## 6 17 university female 33 99000
## 7 18 university female 34 239200
## 8 20 diploma female 30 109000
## 9 22 senior_high female 35 119000
## 10 23 university male 34 96850
## # ℹ 63 more rows
# memilih kolom id, degree level, gender, age, dan annual salary
# dengan degree level bukan university
gaji2 %>%
select(id, `degree level`,`gender`, age, `annual salary`) %>%
filter(`degree level` != "university")
## # A tibble: 81 × 5
## id `degree level` gender age `annual salary`
## <dbl> <fct> <fct> <dbl> <dbl>
## 1 3 senior_high female 19 79000
## 2 10 diploma female 20 424150
## 3 13 diploma female 31 139000
## 4 14 senior_high male 17 209000
## 5 20 diploma female 30 109000
## 6 22 senior_high female 35 119000
## 7 26 senior_high male 17 39400
## 8 27 senior_high female 35 493500
## 9 28 senior_high female 30 63200
## 10 31 senior_high female 31 199000
## # ℹ 71 more rows
# memfilter data berdasarkan degree level yang mengandung kata ior
gaji2 %>%
filter(str_detect(`degree level`, "ior"))
## # A tibble: 61 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 3 senior_high 4 female 19 single
## 2 14 senior_high 3 male 17 single
## 3 22 senior_high 5 female 35 single
## 4 26 senior_high 4 male 17 single
## 5 27 senior_high 5 female 35 single
## 6 28 senior_high 5 female 30 married
## 7 31 senior_high 6 female 31 single
## 8 33 senior_high 5 female 43 single
## 9 34 senior_high 4 female 33 single
## 10 35 senior_high 4 male 20 single
## # ℹ 51 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
Arrange
arrange( ) adalah fungsi yang digunakan untuk mengurutkan data.
## # A tibble: 200 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 57 university 3 female 15 single
## 2 155 school 4 male 15 single
## 3 132 university 5 female 15 single
## 4 15 university 2 female 16 single
## 5 60 senior_high 1 female 16 single
## 6 79 school 4 male 16 single
## 7 140 senior_high 4 female 16 single
## 8 14 senior_high 3 male 17 single
## 9 26 senior_high 4 male 17 single
## 10 71 university 3 female 17 single
## # ℹ 190 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
## # A tibble: 200 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 43 senior_high 4 female 53 married
## 2 178 diploma 6 female 51 married
## 3 56 university 2 male 49 married
## 4 104 university 3 female 46 married
## 5 154 university 7 male 44 married
## 6 194 university 2 female 44 single
## 7 33 senior_high 5 female 43 single
## 8 150 senior_high 5 female 43 single
## 9 189 university 2 female 43 married
## 10 19 university 4 female 42 married
## # ℹ 190 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
# mengurutkan data berdasarkan degree level menaik dan annual salary menaik
arrange(gaji2, `degree level`, `annual salary`)
## # A tibble: 200 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 100 diploma 7 female 37 married
## 2 92 diploma 4 female 25 single
## 3 20 diploma 8 female 30 single
## 4 36 diploma 4 male 39 married
## 5 72 diploma 2 female 32 married
## 6 77 diploma 1 female 22 single
## 7 13 diploma 6 female 31 single
## 8 178 diploma 6 female 51 married
## 9 39 diploma 3 female 22 single
## 10 38 diploma 4 female 22 single
## # ℹ 190 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
# mengurutkan data berdasarkan degree level menaik dan annual salary menurun
arrange(gaji2, `degree level`, desc(`annual salary`))
## # A tibble: 200 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 10 diploma 3 female 20 single
## 2 186 diploma 4 female 28 married
## 3 128 diploma 3 female 22 single
## 4 53 diploma 1 female 19 single
## 5 136 diploma 4 female 28 single
## 6 159 diploma 4 female 36 married
## 7 42 diploma 2 male 28 married
## 8 38 diploma 4 female 22 single
## 9 39 diploma 3 female 22 single
## 10 178 diploma 6 female 51 married
## # ℹ 190 more rows
## # ℹ 3 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>
# memilih kolom id, degree level dan annual salary
# lalu mengurutkannya berdasarkan annual salary menaik
gaji2 %>%
select(id, `degree level`, `annual salary`) %>%
arrange(`annual salary`)
## # A tibble: 200 × 3
## id `degree level` `annual salary`
## <dbl> <fct> <dbl>
## 1 117 senior_high 12500
## 2 154 university 25000
## 3 26 senior_high 39400
## 4 44 senior_high 41700
## 5 7 university 49000
## 6 114 university 49000
## 7 197 senior_high 49500
## 8 138 university 55200
## 9 6 university 59000
## 10 75 university 59000
## # ℹ 190 more rows
# memilih kolom id, degree level dan annual salary
# kemudian memfilternya terkhusus untuk degree level university saja
# lalu mengurutkannya berdasarkan annual salary (menurun)
gaji2 %>%
select(id, `degree level`, `annual salary`) %>%
filter(`degree level` == "university") %>%
arrange(desc(`annual salary`))
## # A tibble: 119 × 3
## id `degree level` `annual salary`
## <dbl> <fct> <dbl>
## 1 58 university 896000
## 2 180 university 599200
## 3 45 university 599000
## 4 9 university 499000
## 5 48 university 479000
## 6 59 university 442000
## 7 127 university 439000
## 8 55 university 419000
## 9 194 university 398650
## 10 141 university 398000
## # ℹ 109 more rows
# urutan bisa dibalik
dataku5 <- gaji2 %>%
filter(`degree level` == "university") %>%
select(id, `annual salary`) %>%
arrange(desc(`annual salary`))
dataku5
## # A tibble: 119 × 2
## id `annual salary`
## <dbl> <dbl>
## 1 58 896000
## 2 180 599200
## 3 45 599000
## 4 9 499000
## 5 48 479000
## 6 59 442000
## 7 127 439000
## 8 55 419000
## 9 194 398650
## 10 141 398000
## # ℹ 109 more rows
Mutate
mutate( ) adalah fungsi yang digunakan untuk membuat kolom atau variabel baru.
## # A tibble: 200 × 10
## id `degree level` `family size` gender age `marital status`
## <dbl> <fct> <dbl> <fct> <dbl> <fct>
## 1 2 university 5 male 36 married
## 2 3 senior_high 4 female 19 single
## 3 4 university 3 female 23 married
## 4 5 university 3 female 28 married
## 5 6 university 6 female 26 single
## 6 7 university 5 female 27 single
## 7 8 university 8 female 28 single
## 8 9 university 3 male 29 married
## 9 10 diploma 3 female 20 single
## 10 11 university 5 female 40 single
## # ℹ 190 more rows
## # ℹ 4 more variables: `residence type` <fct>, `work type` <fct>,
## # `annual salary` <dbl>, pajak <dbl>
# memilih id dan annual salary
# lalu menambah kolom pajak yaitu 10% dari annual salary dan kolom gaji utuh yaitu annual salary dikurang pajak
datatambah <- datagaji %>%
select(id, `annual salary`) %>%
mutate(pajak = `annual salary` * 0.1,
`gaji utuh` = `annual salary` - pajak
)
datatambah
## # A tibble: 200 × 4
## id `annual salary` pajak `gaji utuh`
## <dbl> <dbl> <dbl> <dbl>
## 1 2 330650 33065 297585
## 2 3 79000 7900 71100
## 3 4 156750 15675 141075
## 4 5 349000 34900 314100
## 5 6 59000 5900 53100
## 6 7 49000 4900 44100
## 7 8 79200 7920 71280
## 8 9 499000 49900 449100
## 9 10 424150 42415 381735
## 10 11 160300 16030 144270
## # ℹ 190 more rows
# memilih id, age dan annual salary
# lalu menambah kolom kategori salary yang terdiri dari low dan high
# low jika annual salarynya <= 450000
# high jika annual salarynya > 450000
datagaji %>%
select(id, age, `annual salary`) %>%
mutate(`kategori salary` = if_else(`annual salary` <= 450000, "low", "high"))
## # A tibble: 200 × 4
## id age `annual salary` `kategori salary`
## <dbl> <dbl> <dbl> <chr>
## 1 2 36 330650 low
## 2 3 19 79000 low
## 3 4 23 156750 low
## 4 5 28 349000 low
## 5 6 26 59000 low
## 6 7 27 49000 low
## 7 8 28 79200 low
## 8 9 29 499000 high
## 9 10 20 424150 low
## 10 11 40 160300 low
## # ℹ 190 more rows
# memilih id, age dan annual salary
# lalu menambah kolom kategori salary yang terdiri dari low, medium dan high
# high jika annual salarynya > 750000
# medium jika annual salarynya > 250000 dan <= 750000
# low selainnya
# kemudian memfilternya berdasarkan gender male
cc <- datagaji %>%
select(id, gender, age, `annual salary`) %>%
mutate(`kategori salary` = case_when(
`annual salary` > 750000 ~"high",
`annual salary` > 250000 ~"medium",
TRUE ~"low")) %>%
filter(gender == "male")
cc
## # A tibble: 37 × 5
## id gender age `annual salary` `kategori salary`
## <dbl> <fct> <dbl> <dbl> <chr>
## 1 2 male 36 330650 medium
## 2 9 male 29 499000 medium
## 3 12 male 30 119200 low
## 4 14 male 17 209000 low
## 5 21 male 25 209000 low
## 6 23 male 34 96850 low
## 7 26 male 17 39400 low
## 8 29 male 28 381650 medium
## 9 35 male 20 114500 low
## 10 36 male 39 109000 low
## # ℹ 27 more rows
Summarize
summarise( ) adalah fungsi yang digunakan untuk melakukan statistika deskriptif. Umumnya fungsi ini digunakan pada data hasil grouping group_by( ).
## # A tibble: 1 × 1
## rata2
## <dbl>
## 1 209111.
# menampilan minumum dan maksimum annual salary
summarise(gaji2, minimum = min(`annual salary`), maksimum = max(`annual salary`))
## # A tibble: 1 × 2
## minimum maksimum
## <dbl> <dbl>
## 1 12500 948306
# menampilan minumum dan maksimum annual salary
summarise(gaji2, minimum = min(`annual salary`), maksimum = max(`annual salary`))
## # A tibble: 1 × 2
## minimum maksimum
## <dbl> <dbl>
## 1 12500 948306
## # A tibble: 1 × 1
## total
## <dbl>
## 1 41822297
## # A tibble: 1 × 1
## `median age`
## <dbl>
## 1 28
# menampilkan rata2 annual salary berdasarkan gender
gaji2 %>%
group_by(gender) %>%
summarise(rata2 = mean(`annual salary`))
## # A tibble: 2 × 2
## gender rata2
## <fct> <dbl>
## 1 female 205100.
## 2 male 226785.
# menampilkan rata2 annual salary berdasarkan gender dan degree level
gaji2 %>%
group_by(gender, `degree level`) %>%
summarise(rata2 = mean(`annual salary`))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 3
## # Groups: gender [2]
## gender `degree level` rata2
## <fct> <fct> <dbl>
## 1 female diploma 175810
## 2 female senior_high 199629.
## 3 female university 213354.
## 4 male diploma 139000
## 5 male school 115667.
## 6 male senior_high 327802.
## 7 male university 230530.
# menampilkan rata2 dari variabel numerik berdasarkan gender
gaji2 %>%
group_by(gender) %>%
summarise_if(is.numeric, mean)
## # A tibble: 2 × 5
## gender id `family size` age `annual salary`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 female 105. 4.14 28.5 205100.
## 2 male 84.8 4.57 28.5 226785.
# data female dengan status single
# lalu menampilkan rata2 age dan salary dari berdasarkan degree level
gaji2 %>%
filter(gender == "female", `marital status` == "single") %>%
select(`degree level`, age, `annual salary`) %>%
group_by(`degree level`) %>%
summarise_all(mean)
## # A tibble: 3 × 3
## `degree level` age `annual salary`
## <fct> <dbl> <dbl>
## 1 diploma 24.1 181425
## 2 senior_high 28.2 186651.
## 3 university 27.8 217620.
Join
join( ) adalah penggabungan table yang dilakukan melalui kolom / key tertentu yang memiliki nilai terkait untuk mendapatkan satu set data dengan informasi lengkap.
# membuat tabel
tabely <- tribble(
~ID, ~y,
"A", 5,
"B", 5,
"C", 8,
"D", 0,
"F", 9)
tabelz <- tribble(
~ID, ~z,
"A", 30,
"B", 21,
"C", 22,
"D", 25,
"E", 29)
## # A tibble: 5 × 3
## ID y z
## <chr> <dbl> <dbl>
## 1 A 5 30
## 2 B 5 21
## 3 C 8 22
## 4 D 0 25
## 5 F 9 NA
## # A tibble: 5 × 3
## ID y z
## <chr> <dbl> <dbl>
## 1 A 5 30
## 2 B 5 21
## 3 C 8 22
## 4 D 0 25
## 5 E NA 29
## # A tibble: 4 × 3
## ID y z
## <chr> <dbl> <dbl>
## 1 A 5 30
## 2 B 5 21
## 3 C 8 22
## 4 D 0 25
## # A tibble: 6 × 3
## ID y z
## <chr> <dbl> <dbl>
## 1 A 5 30
## 2 B 5 21
## 3 C 8 22
## 4 D 0 25
## 5 F 9 NA
## 6 E NA 29
# join dengan dua key column
tabel3 <- tribble(
~ID, ~year, ~items,
"A", 2015,3,
"A", 2016,7,
"A", 2017,6,
"B", 2015,4,
"B", 2016,8,
"B", 2017,7,
"C", 2015,4,
"C", 2016,6,
"C", 2017,6)
tabel4 <- tribble(
~ID, ~year, ~prices,
"A", 2015,900,
"A", 2016,850,
"A", 2017,1200,
"B", 2015,1300,
"B", 2016,1400,
"B", 2017,600,
"C", 2015,1500,
"C", 2016,1500,
"C", 2017,1300)
## Warning in left_join(tabel3, tabel4, by = "ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 27 × 5
## ID year.x items year.y prices
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A 2015 3 2015 900
## 2 A 2015 3 2016 850
## 3 A 2015 3 2017 1200
## 4 A 2016 7 2015 900
## 5 A 2016 7 2016 850
## 6 A 2016 7 2017 1200
## 7 A 2017 6 2015 900
## 8 A 2017 6 2016 850
## 9 A 2017 6 2017 1200
## 10 B 2015 4 2015 1300
## # ℹ 17 more rows
## # A tibble: 9 × 4
## ID year items prices
## <chr> <dbl> <dbl> <dbl>
## 1 A 2015 3 900
## 2 A 2016 7 850
## 3 A 2017 6 1200
## 4 B 2015 4 1300
## 5 B 2016 8 1400
## 6 B 2017 7 600
## 7 C 2015 4 1500
## 8 C 2016 6 1500
## 9 C 2017 6 1300
Tidyr
tidyr adalah package yang digunakan untuk merapikan data. Data yang rapi memudahkan proses eksplorasi, visualisasi, dan pemodelan. Tidy data mengacu pada format data yang terstruktur dengan prinsip dasar yang jelas. Tiga prinsip utama dari tidy data adalah: pertama, setiap variabel harus memiliki kolomnya sendiri; kedua, setiap amatan harus memiliki barisnya sendiri; dan ketiga, setiap nilai harus disimpan dalam sel yang terpisah. Dengan mengikuti prinsip-prinsip ini, data menjadi lebih mudah diproses dan dianalisis.
Pivot longer
# membuat tibble dari tabel atau tabulasi silang
cases <- tribble(
~Country, ~"2011", ~"2012", ~"2013",
"France", 7000, 6900, 7000,
"Denmark", 5800, 6000, 6200,
"USA", 15000, 14000, 13000,
)
cases
## # A tibble: 3 × 4
## Country `2011` `2012` `2013`
## <chr> <dbl> <dbl> <dbl>
## 1 France 7000 6900 7000
## 2 Denmark 5800 6000 6200
## 3 USA 15000 14000 13000
# memisahkan tabulasi silang diatas menjadi data tersusun kebawah
# atau membuat kolom menjadi baris
cases %>%
pivot_longer(-Country, names_to = "Year", values_to = "frequency")
## # A tibble: 9 × 3
## Country Year frequency
## <chr> <chr> <dbl>
## 1 France 2011 7000
## 2 France 2012 6900
## 3 France 2013 7000
## 4 Denmark 2011 5800
## 5 Denmark 2012 6000
## 6 Denmark 2013 6200
## 7 USA 2011 15000
## 8 USA 2012 14000
## 9 USA 2013 13000
Pivot wider
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
# menggabungkan baris menjadi kolom
# atau membuat kolom type yang berisi cases dan population menjadi masing-masing kolom baru
penduduk %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Separate
# membuat tibble
stocks <- tibble(
region = c("Asia Tenggara", "Asia Tenggara", "Eropa Tengah", "Eropa Tengah", "Amerika Barat", "Amerika Barat", "Asia Timur"),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks
## # A tibble: 7 × 3
## region qtr return
## <chr> <dbl> <dbl>
## 1 Asia Tenggara 1 1.88
## 2 Asia Tenggara 2 0.59
## 3 Eropa Tengah 3 0.35
## 4 Eropa Tengah 4 NA
## 5 Amerika Barat 2 0.92
## 6 Amerika Barat 3 0.17
## 7 Asia Timur 4 2.66
# memisahkan region menjadi dua kolom yaitu kolom Benua dan Bagian
pisah <- stocks %>%
separate(region, into = c("Benua", "Bagian"))
pisah
## # A tibble: 7 × 4
## Benua Bagian qtr return
## <chr> <chr> <dbl> <dbl>
## 1 Asia Tenggara 1 1.88
## 2 Asia Tenggara 2 0.59
## 3 Eropa Tengah 3 0.35
## 4 Eropa Tengah 4 NA
## 5 Amerika Barat 2 0.92
## 6 Amerika Barat 3 0.17
## 7 Asia Timur 4 2.66
Unite
# menggabungkan kolom Benua dan Bagian menjadi satu kolom yaitu region
gabung <- pisah %>%
unite(Region, Benua, Bagian, sep=" ")
gabung
## # A tibble: 7 × 3
## Region qtr return
## <chr> <dbl> <dbl>
## 1 Asia Tenggara 1 1.88
## 2 Asia Tenggara 2 0.59
## 3 Eropa Tengah 3 0.35
## 4 Eropa Tengah 4 NA
## 5 Amerika Barat 2 0.92
## 6 Amerika Barat 3 0.17
## 7 Asia Timur 4 2.66
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# memisahkan rate menjadi dua kolom yaitu kolom cases dan population dengan pemisah "/"
kasus <- table3 %>%
separate(rate, into = c("cases", "population"), sep="/")
kasus
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# memisahkan rate menjadi dua kolom yaitu kolom cases dan population dengan pemisah "/"
# convert=TRUE artinya tipe variabel barunya ditentukan otomatis
kasus2 <- table3 %>%
separate(rate, into = c("cases", "population"), sep="/", convert=TRUE)
kasus2
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
forcats
forcats adalah suatu package yang digunakan untuk mengorganisir data kategorik (factor) seperti melihat level factor, merubah urutan level factor, merubah level factor, menggabungkan level factor, dan lain sebagainya.
## # A tibble: 200 × 9
## id `degree level` `family size` gender age `marital status`
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 2 university 5 male 36 married
## 2 3 senior_high 4 female 19 single
## 3 4 university 3 female 23 married
## 4 5 university 3 female 28 married
## 5 6 university 6 female 26 single
## 6 7 university 5 female 27 single
## 7 8 university 8 female 28 single
## 8 9 university 3 male 29 married
## 9 10 diploma 3 female 20 single
## 10 11 university 5 female 40 single
## # ℹ 190 more rows
## # ℹ 3 more variables: `residence type` <chr>, `work type` <chr>,
## # `annual salary` <dbl>
# mengambil subset data dengan variabel id, degree level dan annual salary
datagaji <- gaji%>%
select(id, `degree level`, `annual salary`)
glimpse(datagaji)
## Rows: 200
## Columns: 3
## $ id <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ `degree level` <chr> "university", "senior_high", "university", "university…
## $ `annual salary` <dbl> 330650, 79000, 156750, 349000, 59000, 49000, 79200, 49…
# mengubah type variabel degree level menjadi factor
datagaji$`degree level` <- as_factor(datagaji$`degree level`)
glimpse(datagaji)
## Rows: 200
## Columns: 3
## $ id <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ `degree level` <fct> university, senior_high, university, university, unive…
## $ `annual salary` <dbl> 330650, 79000, 156750, 349000, 59000, 49000, 79200, 49…
## [1] "university" "senior_high" "diploma" "school"
fct_count
## # A tibble: 4 × 2
## f n
## <fct> <int>
## 1 university 119
## 2 senior_high 61
## 3 diploma 17
## 4 school 3
# menghitung frekuensi masing2 level dari degree level menggunakan pipe %>%
datagaji$`degree level` %>%
fct_count()
## # A tibble: 4 × 2
## f n
## <fct> <int>
## 1 university 119
## 2 senior_high 61
## 3 diploma 17
## 4 school 3
fct_relevel
# merubah urutan level dari degree level sesuai yang diinginkan
datagaji$`degree level` <- datagaji$`degree level` %>%
fct_relevel(c("school", "senior_high", "diploma", "university"))
levels(datagaji$`degree level`)
## [1] "school" "senior_high" "diploma" "university"
fct_rev
# membalikkan urutan level dari degree level
datagaji$`degree level` <- datagaji$`degree level` %>%
fct_rev()
levels(datagaji$`degree level`)
## [1] "university" "diploma" "senior_high" "school"
fct_expand
# menambah level S3 pada degree level
datagaji$`degree level` <- datagaji$`degree level` %>%
fct_expand("S3")
levels(datagaji$`degree level`)
## [1] "university" "diploma" "senior_high" "school" "S3"
fct_drop
# perhatikan subset data di bawah ini.
pilih <- datagaji %>%
filter(`degree level`!= "university")
levels(pilih$`degree level`)
## [1] "university" "diploma" "senior_high" "school" "S3"
## # A tibble: 5 × 2
## f n
## <fct> <int>
## 1 university 119
## 2 diploma 17
## 3 senior_high 61
## 4 school 3
## 5 S3 0
# membuang level yang tidak digunakan dari degree level
datagaji$`degree level` <- datagaji$`degree level` %>%
fct_drop()
levels(datagaji$`degree level`)
## [1] "university" "diploma" "senior_high" "school"
fct_collapse
# menggabungkan level school dan senior_high menjadi level other
datagaji$`degree level` <- datagaji$`degree level` %>%
fct_collapse(other = c("school", "senior_high"))
levels(datagaji$`degree level`)
## [1] "university" "diploma" "other"
fct_recode
# merubah nama level diploma menjadi D3 dan university menjadi S1
datagaji$`degree level` <- datagaji$`degree level` %>%
fct_recode(D3 = 'diploma', S1 = 'university' )
levels(datagaji$`degree level`)
## [1] "S1" "D3" "other"
## # A tibble: 200 × 3
## id `degree level` `annual salary`
## <dbl> <fct> <dbl>
## 1 2 S1 330650
## 2 3 other 79000
## 3 4 S1 156750
## 4 5 S1 349000
## 5 6 S1 59000
## 6 7 S1 49000
## 7 8 S1 79200
## 8 9 S1 499000
## 9 10 D3 424150
## 10 11 S1 160300
## # ℹ 190 more rows
lubridate
lubridate adalah suatu package yang digunakan untuk mengorganisir data tanggal dan waktu. Selain itu package ini juga bisa melakukan operasi aritmatika terhadap tanggal dan waktu.
Now
## [1] "2025-04-28 03:15:08 CEST"
ymd
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## Warning: All formats failed to parse. No formats found.
## [1] NA
## [1] "2021-10-04"
## [1] "2021-10-04"
# mengkonversi teks tanggal dengan susunan tahun bulan hari dan waktu
ymd_hms("2021-10-04 19:30:59")
## [1] "2021-10-04 19:30:59 UTC"
## [1] "2021-10-04 19:30:59 WITA"
mdy
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04 19:30:59 UTC"
## [1] "2021-10-04 19:30:59 WIB"
dmy
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04"
## [1] "2021-10-04 19:30:59 UTC"
## [1] "2021-10-04 19:30:59 WIT"
pemisahan waktu
## [1] "2021-10-04 19:30:59 UTC"
## [1] 4
## [1] Mon
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
## [1] Oct
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
## [1] 2021
## [1] 19
## [1] 30
## [1] 59
operasi waktu
## [1] "2016-10-04 19:30:59 UTC"
## [1] "2021-11-04 19:30:59 UTC"
## [1] "2021-10-11 19:30:59 UTC"
## [1] "2021-10-04 15:30:59 UTC"
# menambahkan atau mengurangi waktu (sesuai perputaran bumi)
datetime - dyears(5) # 5 tahun yang lalu
## [1] "2016-10-04 13:30:59 UTC"
## [1] "2021-11-04 06:00:59 UTC"
## [1] "2021-10-11 19:30:59 UTC"
## [1] "2021-10-04 15:30:59 UTC"
## [1] NA
## [1] "2021-03-01 18:41:59 UTC"
contoh data tanggal 1
# memanggil data contohtanggal.csv
datatgl <- read_csv("D:/Download/Data/contohtanggal.csv")
datatgl
## # A tibble: 4 × 1
## Tanggal
## <chr>
## 1 14/02/2008
## 2 25/12/2013
## 3 19/09/2017
## 4 23/06/2021
## # A tibble: 4 × 1
## Tanggal
## <date>
## 1 2008-02-14
## 2 2013-12-25
## 3 2017-09-19
## 4 2021-06-23
# Memisahkan tanggal, bulan, tahun
datatgl %>% mutate(Tgl=day(Tanggal),
Bulan=month(Tanggal),
Tahun=year(Tanggal))
## # A tibble: 4 × 4
## Tanggal Tgl Bulan Tahun
## <date> <int> <dbl> <dbl>
## 1 2008-02-14 14 2 2008
## 2 2013-12-25 25 12 2013
## 3 2017-09-19 19 9 2017
## 4 2021-06-23 23 6 2021
# Memisahkan hari, nama bulan, tahun
datatgl %>% mutate(Hari=wday(Tanggal, label=T),
Bulan=month(Tanggal, label=T),
Tahun=year(Tanggal))
## # A tibble: 4 × 4
## Tanggal Hari Bulan Tahun
## <date> <ord> <ord> <dbl>
## 1 2008-02-14 Thu Feb 2008
## 2 2013-12-25 Wed Dec 2013
## 3 2017-09-19 Tue Sep 2017
## 4 2021-06-23 Wed Jun 2021
contoh data tanggal 2
## Rows: 19 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): tahun, bulan, tanggal, jam, menit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 19 × 5
## tahun bulan tanggal jam menit
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 6 30 9 40
## 2 2013 5 7 16 57
## 3 2013 12 8 8 59
## 4 2013 5 14 18 41
## 5 2013 7 21 11 2
## 6 2013 1 1 18 17
## 7 2013 12 9 12 59
## 8 2013 8 13 19 20
## 9 2013 9 26 7 25
## 10 2013 4 30 13 23
## 11 2013 6 17 9 40
## 12 2013 11 22 13 20
## 13 2013 4 26 8 9
## 14 2013 3 25 20 54
## 15 2013 10 21 12 17
## 16 2013 1 23 20 24
## 17 2013 2 8 6 44
## 18 2013 8 5 7 57
## 19 2013 10 21 8 59
## # A tibble: 19 × 6
## tahun bulan tanggal jam menit Tgl
## <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 2013 6 30 9 40 2013-06-30
## 2 2013 5 7 16 57 2013-05-07
## 3 2013 12 8 8 59 2013-12-08
## 4 2013 5 14 18 41 2013-05-14
## 5 2013 7 21 11 2 2013-07-21
## 6 2013 1 1 18 17 2013-01-01
## 7 2013 12 9 12 59 2013-12-09
## 8 2013 8 13 19 20 2013-08-13
## 9 2013 9 26 7 25 2013-09-26
## 10 2013 4 30 13 23 2013-04-30
## 11 2013 6 17 9 40 2013-06-17
## 12 2013 11 22 13 20 2013-11-22
## 13 2013 4 26 8 9 2013-04-26
## 14 2013 3 25 20 54 2013-03-25
## 15 2013 10 21 12 17 2013-10-21
## 16 2013 1 23 20 24 2013-01-23
## 17 2013 2 8 6 44 2013-02-08
## 18 2013 8 5 7 57 2013-08-05
## 19 2013 10 21 8 59 2013-10-21
# menggabungkan data tanggal + waktu
tanggalku <- datatgl2 %>% mutate(waktu=make_datetime(year=tahun,month=bulan,day=tanggal, hour = jam, min=menit))
tanggalku
## # A tibble: 19 × 6
## tahun bulan tanggal jam menit waktu
## <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 2013 6 30 9 40 2013-06-30 09:40:00
## 2 2013 5 7 16 57 2013-05-07 16:57:00
## 3 2013 12 8 8 59 2013-12-08 08:59:00
## 4 2013 5 14 18 41 2013-05-14 18:41:00
## 5 2013 7 21 11 2 2013-07-21 11:02:00
## 6 2013 1 1 18 17 2013-01-01 18:17:00
## 7 2013 12 9 12 59 2013-12-09 12:59:00
## 8 2013 8 13 19 20 2013-08-13 19:20:00
## 9 2013 9 26 7 25 2013-09-26 07:25:00
## 10 2013 4 30 13 23 2013-04-30 13:23:00
## 11 2013 6 17 9 40 2013-06-17 09:40:00
## 12 2013 11 22 13 20 2013-11-22 13:20:00
## 13 2013 4 26 8 9 2013-04-26 08:09:00
## 14 2013 3 25 20 54 2013-03-25 20:54:00
## 15 2013 10 21 12 17 2013-10-21 12:17:00
## 16 2013 1 23 20 24 2013-01-23 20:24:00
## 17 2013 2 8 6 44 2013-02-08 06:44:00
## 18 2013 8 5 7 57 2013-08-05 07:57:00
## 19 2013 10 21 8 59 2013-10-21 08:59:00
## # A tibble: 19 × 7
## tahun bulan tanggal jam menit waktu waktu10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dttm> <dttm>
## 1 2013 6 30 9 40 2013-06-30 09:40:00 2012-08-30 00:40:00
## 2 2013 5 7 16 57 2013-05-07 16:57:00 2012-07-07 07:57:00
## 3 2013 12 8 8 59 2013-12-08 08:59:00 2013-02-06 23:59:00
## 4 2013 5 14 18 41 2013-05-14 18:41:00 2012-07-14 09:41:00
## 5 2013 7 21 11 2 2013-07-21 11:02:00 2012-09-20 02:02:00
## 6 2013 1 1 18 17 2013-01-01 18:17:00 2012-03-03 09:17:00
## 7 2013 12 9 12 59 2013-12-09 12:59:00 2013-02-08 03:59:00
## 8 2013 8 13 19 20 2013-08-13 19:20:00 2012-10-13 10:20:00
## 9 2013 9 26 7 25 2013-09-26 07:25:00 2012-11-25 22:25:00
## 10 2013 4 30 13 23 2013-04-30 13:23:00 2012-06-30 04:23:00
## 11 2013 6 17 9 40 2013-06-17 09:40:00 2012-08-17 00:40:00
## 12 2013 11 22 13 20 2013-11-22 13:20:00 2013-01-22 04:20:00
## 13 2013 4 26 8 9 2013-04-26 08:09:00 2012-06-25 23:09:00
## 14 2013 3 25 20 54 2013-03-25 20:54:00 2012-05-25 11:54:00
## 15 2013 10 21 12 17 2013-10-21 12:17:00 2012-12-21 03:17:00
## 16 2013 1 23 20 24 2013-01-23 20:24:00 2012-03-25 11:24:00
## 17 2013 2 8 6 44 2013-02-08 06:44:00 2012-04-09 21:44:00
## 18 2013 8 5 7 57 2013-08-05 07:57:00 2012-10-04 22:57:00
## 19 2013 10 21 8 59 2013-10-21 08:59:00 2012-12-20 23:59:00