Pengenalan Tidyverse

Package

Pakcage yang akan digunakan :

library(dplyr)
library(knitr)
library(kableExtra)

Dataset

Dataset yang digunakan untuk praktikum data wrangling adalah mtcars

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

dataset = mtcars
kbl(head(mtcars), caption = "Data mtcars") %>%  kable_styling()
Data mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Data Wrangling

Pada bagian ini akan dilakukan praktik penggunaan fungsi fungsi summarise(), arrange(), filter(),mutate(), select()

summarise()

creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

Pada contoh ini data akan diringkas untuk menghitung rata-rata jumlah cylinder (cyl) dan total jumlah carburator (carb) untuk automobile dengan tranmisi automatic dan manua; (am)

summarised_dataset = dataset %>%  group_by(am) %>%  summarise(mean_cyl = mean(cyl), total_carb = sum(carb))
summarised_dataset %>% kbl(caption = "Summarised Dataset") %>%  kable_styling()
Summarised Dataset
am mean_cyl total_carb
0 6.947368 52
1 5.076923 38

arrange()

orders the rows of a data frame by the values of selected columns.

Pada contoh berikut ini akan diurutkan mobil dengan penggunaan bahan bakar dari yang paling kecil ke yang paling besar

arranged_dataset = dataset %>%  arrange(mpg)
arranged_dataset%>%  kbl(caption = "Arranged Dataset") %>%  kable_styling() %>%  column_spec(2, bold = T, background = "#FA5754") %>%  scroll_box(height = "400px")
Arranged Dataset
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

filter()

used to subset a data frame, retaining all rows that satisfy your conditions.

Untuk fungsi filter(), akan dilakukan pemilihan mobil dengan berat (wt) diatas 4000 lbs,

filtered_dataset = dataset %>%  filter(wt >4)
filtered_dataset %>% kbl(caption = "Filtered Dataset") %>% kable_styling()  %>%  column_spec(7, bold = T, background = "#FA5754")
Filtered Dataset
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

mutate()

adds new variables and preserves existing ones;

Untuk fungsi mutate(), akan dilakukan pembentukan dua variable yaitu : - wt_to_cyl = berat dibagi dengan jumlah cylinders - vs_str = yaitu string version dari variable vs.

mutated_dataset = dataset %>% mutate(wt_to_cyl = wt/cyl, vs_str = ifelse(vs == 0 , "V-shaped","straight"))
mutated_dataset %>% kbl(caption = "Mutated Dataset") %>% kable_styling()  %>%  column_spec(12, bold = T, background = "#FA5754") %>%  column_spec(13, bold = T, background = "#009CD9") %>% scroll_box(height = "400px")
Mutated Dataset
mpg cyl disp hp drat wt qsec vs am gear carb wt_to_cyl vs_str
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0.4366667 V-shaped
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0.4791667 V-shaped
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 0.5800000 straight
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0.5358333 straight
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0.4300000 V-shaped
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0.5766667 straight
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0.4462500 V-shaped
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 0.7975000 straight
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 0.7875000 straight
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 0.5733333 straight
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 0.5733333 straight
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 0.5087500 V-shaped
17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 0.4662500 V-shaped
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 0.4725000 V-shaped
10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 0.6562500 V-shaped
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 0.6780000 V-shaped
14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 0.6681250 V-shaped
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 0.5500000 straight
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 0.4037500 straight
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 0.4587500 straight
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 0.6162500 straight
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 0.4400000 V-shaped
15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 0.4293750 V-shaped
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 0.4800000 V-shaped
19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 0.4806250 V-shaped
27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 0.4837500 straight
26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 0.5350000 V-shaped
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 0.3782500 straight
15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 0.3962500 V-shaped
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 0.4616667 V-shaped
15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 0.4462500 V-shaped
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 0.6950000 straight

select()

Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f selects all columns from a on the left to f on the right).

Select non-black column
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Untuk fungsi select akan dipilih 4 variabel = hp, wt, am ,carb

selected_dataset = dataset %>% select(hp,wt,am, carb)
selected_dataset %>% kbl(caption = "Mutated Dataset") %>% kable_styling() %>%  scroll_box(height = "200px") 
Mutated Dataset
hp wt am carb
Mazda RX4 110 2.620 1 4
Mazda RX4 Wag 110 2.875 1 4
Datsun 710 93 2.320 1 1
Hornet 4 Drive 110 3.215 0 1
Hornet Sportabout 175 3.440 0 2
Valiant 105 3.460 0 1
Duster 360 245 3.570 0 4
Merc 240D 62 3.190 0 2
Merc 230 95 3.150 0 2
Merc 280 123 3.440 0 4
Merc 280C 123 3.440 0 4
Merc 450SE 180 4.070 0 3
Merc 450SL 180 3.730 0 3
Merc 450SLC 180 3.780 0 3
Cadillac Fleetwood 205 5.250 0 4
Lincoln Continental 215 5.424 0 4
Chrysler Imperial 230 5.345 0 4
Fiat 128 66 2.200 1 1
Honda Civic 52 1.615 1 2
Toyota Corolla 65 1.835 1 1
Toyota Corona 97 2.465 0 1
Dodge Challenger 150 3.520 0 2
AMC Javelin 150 3.435 0 2
Camaro Z28 245 3.840 0 4
Pontiac Firebird 175 3.845 0 2
Fiat X1-9 66 1.935 1 1
Porsche 914-2 91 2.140 1 2
Lotus Europa 113 1.513 1 2
Ford Pantera L 264 3.170 1 4
Ferrari Dino 175 2.770 1 6
Maserati Bora 335 3.570 1 8
Volvo 142E 109 2.780 1 2

cook it all !

Seluruh fungsi diatas dapat digunakan secara bersamaan, potongan kode di bawah akan menerapkan :

  • memilih colom cyl,disp,hp,wt
  • filter Displacement (disp) lebih dari 1000
  • mengelompokkan data berdasarkan jumlah silinder
  • menghitung total untuk variable disp,hp, dan wt
  • mengurutkan data berdasarkan variable wt
  • terakhir, mengembalikan variabel wt ke skala awal (1000 lbs)
wrangled_dataset = dataset %>% select(cyl,disp,hp,wt) %>% filter(disp >100) %>% group_by(cyl) %>%  summarise(total_disp = sum(disp),total_hp = sum(hp),total_wt = sum(wt)) %>% arrange(total_wt) %>% mutate( wt_1000 = total_wt*1000)
wrangled_dataset %>% kbl(caption = "Wrangled Dataset") %>% kable_styling()  
Wrangled Dataset
cyl total_disp total_hp total_wt wt_1000
4 756.9 547 16.045 16045
6 1283.2 856 21.820 21820
8 4943.4 2929 55.989 55989