Dalam prinsip data management, terdapat satu proses bernama data pre-processing. Dalam proses ini, data yang berukuran besar sering kali dikemas dalam format yang tidak terstruktur sehingga memerlukan pengolahan lebih dulu sebelum dilakukan analisis dan pemodelan. Untuk memudahkan proses selanjutnya, data terlebih dahulu diproses dengan mekanisme data wrangling/munging. Data wrangling dan data munging memiliki arti yang sama, merujuk pada makna data cleaning. Istilah ini diartikan sebagai proses mentransformasi atau menyiapkan data menjadi bentuk yang lebih rapi dan sistematis dan siap untuk dilakukan analisis. Sebuah dataset bisa memiliki bentuk memanjang atau melebar menyesuaikan kebutuhan analisis. Hal ini dapat dilakukan dengan menggunakan teknik data wrangling. Data wrangling bisa dilakukan secara manual dengan memanipulasi dataframe, atau melalui tibble den menggunakan package tidyverse. Dalam data munging/wrangling, dilakukan data discovering > structuring > cleaning > enriching > validating > publishing.
Meurpakan proses pembersihan, penataan, dan pengayaan data mentah ke dalam format yang diinginkan untuk menghasilkan pengambilan keputusan yang lebih baik dalam waktu yang lebih singkat. Data wrangling diterapkan dengan melakukan kegiatan-kegiatan sebagai berikut: 1. Membuat kolom/variabel baru dalam data frame 2. Subsetting atau memilih sebagian baris/kolom dari data 3. Sorting atau mengurutkan data 4. Recoding atau mengodekan ulang nilai-nilai dalam data 5. Merging atau menggabungkan data, meliputi penggabungan baris atau kolom. 6. Reshaping atau memformat ulang data, bisa menjadi wide format atau long format.
Bagaimana menyusun data dalam format yang lebih terstruktur, dimana setiap variabel membentuk sebuah kolom, setiap observasi membentuk sebuah baris, dan setiap sel merupakan pengukuran tunggal. Tidy data bisa jadi tidak tercapai (menjadi messy data) ketika beberapa kondisi, seperti adanya: 1. header kolom merupakan value 2. beberapa variabel disimpan dalam satu kolom 3. variabel tersimpan di baris dan kolom 4. observasi tunggal tersimpan di beberapa baris 5. baris tunggal menyimpan beberapa observasi Adanya data messy ini akan menghambat proses analisis selanjutnya.
Data ini terdiri dari 7 kolom dengan 1338 baris, dengan rincian variabel sebagai berikut:
| Variable | Description |
|---|---|
| age | umur penerima utama asuransi (tahun) |
| sex | jenis kelamin penerima asuransi (female/male) |
| bmi | body mass index (indeks massa tubuh), yang merupakan hasil bagi berat badan dengan kuadrat tinggi badan |
| children | banyak anak yang ditanggung asuransi |
| smoker | status perokok (ya/tidak) |
| region | daerah domisili penerima asuransi di Amerika Serikat, meliputi: northeast, southeast, southwest, dan northwest |
| charges | biaya kesehatan individu yang ditanggung pihak asuransi ($) |
Data diperoleh dari Kaggle.com melalui link link berupa data biaya kesehatan pemilik asuransi yang diperoleh dari buku Machine Learning with R oleh Brett Lantz.
dataset.df <- read.csv("/Users/User/Documents/RFiles/datasets/insurance.csv")
# menampilkan 6 record teratas
head(dataset.df)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
# menampilkan 6 record terbawah
tail(dataset.df)
## age sex bmi children smoker region charges
## 1333 52 female 44.70 3 no southwest 11411.685
## 1334 50 male 30.97 3 no northwest 10600.548
## 1335 18 female 31.92 0 no northeast 2205.981
## 1336 18 female 36.85 0 no southeast 1629.833
## 1337 21 female 25.80 0 no southwest 2007.945
## 1338 61 female 29.07 0 yes northwest 29141.360
#periksa dimensi
dim(dataset.df)
## [1] 1338 7
# menghitung nilai na
sum(is.na(dataset.df))
## [1] 0
# menghitung nilai null
sum(is.null(dataset.df))
## [1] 0
# menampilkan ringkasan data
summary(dataset.df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
# melihat struktur data
str(dataset.df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Bisa dengan operator [, ], [[, ]], dan $.
# menyeleksi value kolom pertama pada record 2 sampai 7
dataset.df[2:7,1]
## [1] 18 28 33 32 31 46
# menyeleksi value record ke-3 pada kolom ke 2 sampai 5
dataset.df[3, 2:5]
## sex bmi children smoker
## 3 male 33 3 no
# menyeleksi record ke-2 semua kolom
dataset.df[2,]
## age sex bmi children smoker region charges
## 2 18 male 33.77 1 no southeast 1725.552
# menyeleksi 5 record teratas pada kolom sex
dataset.df[1:5,]$sex
## [1] "female" "male" "male" "male" "male"
# menyeleksi baris ke-3 sampai 5 dari kolom age, sex, bmi, dan region
dataset.df[3:5, c("age", "sex", "bmi", "region")]
## age sex bmi region
## 3 28 male 33.000 southeast
## 4 33 male 22.705 northwest
## 5 32 male 28.880 northwest
# menyeleksi berdasarkan kriteria tertentu
dataset.df[dataset.df$age > 63,]
## age sex bmi children smoker region charges
## 63 64 male 24.700 1 no northwest 30166.62
## 95 64 female 31.300 2 yes southwest 47291.06
## 200 64 female 39.330 0 no northeast 14901.52
## 329 64 female 33.800 1 yes southwest 47928.03
## 336 64 male 34.500 0 no southwest 13822.80
## 379 64 female 30.115 3 no northwest 16455.71
## 399 64 male 25.600 2 no southwest 14988.43
## 403 64 female 32.965 0 no northwest 14692.67
## 419 64 male 39.160 1 no southeast 14418.28
## 421 64 male 33.880 0 yes southeast 46889.26
## 535 64 male 40.480 0 no southeast 13831.12
## 604 64 female 39.050 3 no southeast 16085.13
## 636 64 male 38.190 0 no northeast 14410.93
## 665 64 female 22.990 0 yes southeast 27037.91
## 753 64 male 37.905 0 no northwest 14210.54
## 769 64 female 39.700 0 no southwest 14319.03
## 802 64 female 35.970 0 no southeast 14313.85
## 826 64 female 31.825 2 no northeast 16069.08
## 891 64 female 26.885 0 yes northwest 29330.98
## 1052 64 male 26.410 0 no northeast 14394.56
## 1242 64 male 36.960 2 yes southeast 49577.66
## 1266 64 male 23.760 0 yes southeast 26926.51
head(dataset.df[dataset.df$sex=="male" & dataset.df$region=="southwest", ])
## age sex bmi children smoker region charges
## 13 23 male 34.4 0 no southwest 1826.843
## 16 19 male 24.6 1 no southwest 1837.237
## 19 56 male 40.3 0 no southwest 10602.385
## 20 30 male 35.3 0 yes southwest 36837.467
## 30 31 male 36.3 2 yes southwest 38711.000
## 31 22 male 35.6 0 yes southwest 35585.576
# menyeleksi bisa dengan beberapa cara menghasilkan output sama
head(dataset.df[dataset.df["age"]>60,])
## age sex bmi children smoker region charges
## 12 62 female 26.290 0 yes southeast 27808.73
## 27 63 female 23.085 0 no northeast 14451.84
## 34 63 male 28.310 0 no northwest 13770.10
## 37 62 female 32.965 3 no northwest 15612.19
## 63 64 male 24.700 1 no northwest 30166.62
## 67 61 female 39.100 2 no southwest 14235.07
head(dataset.df[dataset.df$age>60,])
## age sex bmi children smoker region charges
## 12 62 female 26.290 0 yes southeast 27808.73
## 27 63 female 23.085 0 no northeast 14451.84
## 34 63 male 28.310 0 no northwest 13770.10
## 37 62 female 32.965 3 no northwest 15612.19
## 63 64 male 24.700 1 no northwest 30166.62
## 67 61 female 39.100 2 no southwest 14235.07
# misal ingin mengubah data baris ke-2 kolom sex menjadi lawannya
dataset.df[2, "sex"]
## [1] "male"
dataset.df[2, "sex"] <- "female"
dataset.df[2, "sex"]
## [1] "female"
dataset.df["bmi.category"] <- ""
dataset.df["bmi.category"][dataset.df["bmi"]<18.5] <-"underweight"
dataset.df["bmi.category"][dataset.df["bmi"]>25] <-"overweight"
dataset.df["bmi.category"][dataset.df["bmi.category"]==""] <-"normal"
head(dataset.df)
## age sex bmi children smoker region charges bmi.category
## 1 19 female 27.900 0 yes southwest 16884.924 overweight
## 2 18 female 33.770 1 no southeast 1725.552 overweight
## 3 28 male 33.000 3 no southeast 4449.462 overweight
## 4 33 male 22.705 0 no northwest 21984.471 normal
## 5 32 male 28.880 0 no northwest 3866.855 overweight
## 6 31 female 25.740 0 no southeast 3756.622 overweight
Bisa secara ascending atau urut dari terkecil, atau sebaliknya descending dari yang terbesar. Secara defaultnya order adalah ascending.
# mengurutkan berdasarkan usia termuda
head(dataset.df[order(dataset.df$age), ], 10)
## age sex bmi children smoker region charges bmi.category
## 2 18 female 33.770 1 no southeast 1725.552 overweight
## 23 18 male 34.100 0 no southeast 1137.011 overweight
## 32 18 female 26.315 0 no northeast 2198.190 overweight
## 47 18 female 38.665 2 no northeast 3393.356 overweight
## 51 18 female 35.625 0 no northeast 2211.131 overweight
## 58 18 male 31.680 2 yes southeast 34303.167 overweight
## 103 18 female 30.115 0 no northeast 21344.847 overweight
## 122 18 male 23.750 0 no northeast 1705.624 normal
## 158 18 male 25.175 0 yes northeast 15518.180 overweight
## 162 18 female 36.850 0 yes southeast 36149.484 overweight
# mengurutkan berdasarkan bmi terbesar dan usia tertua
# jika > 1 kriteria dan minimal salah satunya descending maka decreasing argument harus diisi
head(dataset.df[order(dataset.df$bmi, dataset.df$age, decreasing = c(T, T)), ], 10)
## age sex bmi children smoker region charges bmi.category
## 1318 18 male 53.13 0 no southeast 1163.463 overweight
## 1048 22 male 52.58 1 yes southeast 44501.398 overweight
## 848 23 male 50.38 1 no southeast 2438.055 overweight
## 117 58 male 49.06 0 no southeast 11381.325 overweight
## 287 46 female 48.07 2 no northeast 9432.925 overweight
## 1089 52 male 47.74 1 no southeast 9748.911 overweight
## 861 37 female 47.60 2 yes southwest 46113.511 overweight
## 402 47 male 47.52 1 no southeast 8083.920 overweight
## 544 54 female 47.41 0 yes southeast 63770.428 overweight
## 439 52 female 46.75 5 no southeast 12592.534 overweight
age.below18 <- dataset.df$age<=18
smoker.1 <- dataset.df$smoker==1
dataset.df[age.below18&smoker.1,]
## [1] age sex bmi children smoker
## [6] region charges bmi.category
## <0 rows> (or 0-length row.names)
# menghitung banyak NA di setiap kolom
colSums(is.na(dataset.df)==TRUE)
## age sex bmi children smoker region
## 0 0 0 0 0 0
## charges bmi.category
## 0 0
colSums(is.na(dataset.df)) #secara default TRUE
## age sex bmi children smoker region
## 0 0 0 0 0 0
## charges bmi.category
## 0 0
na.check <- colSums(is.na(dataset.df))>0
which(na.check)
## named integer(0)
colnames(dataset.df)[na.check]
## character(0)
duplicate <- duplicated(dataset.df)
dim(dataset.df[duplicate,])
## [1] 1 8
dataset.df[duplicate,]
## age sex bmi children smoker region charges bmi.category
## 582 19 male 30.59 0 no northwest 1639.563 overweight
Ada 1 baris di index 582 yang duplikat sehingga dihapus dengan:
dataset.df <- dataset.df[-582,]
dataset.df["sex"][dataset.df["sex"]=="female"] <- 0
dataset.df$sex[dataset.df$sex=="male"] <- 1
head(dataset.df)
## age sex bmi children smoker region charges bmi.category
## 1 19 0 27.900 0 yes southwest 16884.924 overweight
## 2 18 0 33.770 1 no southeast 1725.552 overweight
## 3 28 1 33.000 3 no southeast 4449.462 overweight
## 4 33 1 22.705 0 no northwest 21984.471 normal
## 5 32 1 28.880 0 no northwest 3866.855 overweight
## 6 31 0 25.740 0 no southeast 3756.622 overweight
dataset.df["smoker"][dataset.df["smoker"]=="no"] <- 0
dataset.df["smoker"][dataset.df["smoker"]=="yes"] <- 1
head(dataset.df)
## age sex bmi children smoker region charges bmi.category
## 1 19 0 27.900 0 1 southwest 16884.924 overweight
## 2 18 0 33.770 1 0 southeast 1725.552 overweight
## 3 28 1 33.000 3 0 southeast 4449.462 overweight
## 4 33 1 22.705 0 0 northwest 21984.471 normal
## 5 32 1 28.880 0 0 northwest 3866.855 overweight
## 6 31 0 25.740 0 0 southeast 3756.622 overweight
dataset.df["region"][dataset.df["region"]=="southwest"] <- "SW"
dataset.df["region"][dataset.df["region"]=="southeast"] <- "SE"
dataset.df["region"][dataset.df["region"]=="northwest"] <- "NW"
dataset.df["region"][dataset.df["region"]=="northeast"] <- "NE"
head(dataset.df)
## age sex bmi children smoker region charges bmi.category
## 1 19 0 27.900 0 1 SW 16884.924 overweight
## 2 18 0 33.770 1 0 SE 1725.552 overweight
## 3 28 1 33.000 3 0 SE 4449.462 overweight
## 4 33 1 22.705 0 0 NW 21984.471 normal
## 5 32 1 28.880 0 0 NW 3866.855 overweight
## 6 31 0 25.740 0 0 SE 3756.622 overweight
as.factor(dataset.df$sex)
## [1] 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0
## [38] 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 1
## [75] 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 1
## [112] 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0
## [149] 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1
## [186] 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0
## [223] 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1
## [260] 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 1 1
## [297] 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 0
## [334] 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1
## [371] 0 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 0 1 0 1 1 0 0
## [408] 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 1 0
## [445] 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1
## [482] 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1
## [519] 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0
## [556] 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 0 1 1
## [593] 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0
## [630] 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1
## [667] 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 0 1 0
## [704] 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1
## [741] 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1
## [778] 1 1 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1
## [815] 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1
## [852] 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1
## [889] 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1
## [926] 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
## [963] 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0
## [1000] 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1
## [1037] 0 1 1 0 1 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0
## [1074] 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0
## [1111] 1 0 0 1 1 1 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0
## [1148] 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0
## [1185] 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1
## [1222] 1 0 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1
## [1259] 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1
## [1296] 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0
## [1333] 1 0 0 0 0
## Levels: 0 1
as.factor(dataset.df$smoker)
## [1] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0
## [38] 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0
## [75] 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0
## [112] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0
## [149] 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [186] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 0 0
## [260] 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0
## [297] 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0
## [334] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [371] 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [445] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
## [482] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [519] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
## [556] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [593] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1
## [630] 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0
## [667] 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [704] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0
## [741] 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [778] 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0
## [852] 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0
## [889] 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 0 0
## [963] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
## [1000] 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1
## [1037] 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
## [1074] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [1111] 1 1 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [1148] 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1
## [1185] 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
## [1222] 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0
## [1259] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0
## [1296] 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
## [1333] 0 0 0 0 1
## Levels: 0 1
as.factor(dataset.df$region)
## [1] SW SE SE NW NW SE SE NW NE NW NE SE SW SE SE SW NE NE SW SW NE SW SE NE
## [25] NW SE NE NW NW SW SW NE SW NW SW NW NW SW NE SW NE SE SE SE NE SW NE NW
## [49] SE SE NE NW SW SE NW NW NE SE SE NW NE SE NW NW NW SW SW NW SE SE SE NE
## [73] SW SE SW NW SE SE NE NW NE NE SE NW SW NW NW SW NW NW SE NW NE NW SW SE
## [97] SW SE NE SW SW NE NE SE SW NW SW NW SE SE NW SW SW NW NE NE SE SE SE NW
## [121] SW NE NW NE NW NE SW SW NW SW NE NE SW NW NE SE SW NW SE SW NE NE SE NW
## [145] NW SE NW SE NW SW NW SE NE NE NE NW SE NE SE SE NW SE SW SW NW NE SW NW
## [169] NW NE SE SW NE SW NW SW NW SW SW NE NW SE NE NW SE NE SE SW SW NW SE SW
## [193] SE NW SE NW SW SE NW NE NW SE NW SE SW NE SE NE SW NE SW NW NW SE SW SW
## [217] NW SE SE SE SW NE SW SW SE SE SE SE NE NE NW SE SW SW NW SE SE SE NW SE
## [241] NE NE SW SW NE NW SE SE SW NE NE SW SE SW NE NE NW SE NW NW SW SE NE NW
## [265] SE SE SE NE SW NE SE SW NW NE NW NE NW SW SE SE NE NE NE NE SW SE NE NW
## [289] NE SE SW NE SE SE SW NE SW SE NW NW NE NE SE SE SW NW SW SE NE NW SW SW
## [313] SE SE SW NE NW NE NW NE NW NE SW NE SW NE SE NW SW SW NE NW NW NE NE SW
## [337] SE NW NE SE SW NW NE NE SE SE SE NE SE NW NW SW SW NE SE SW SE NW SE SE
## [361] NE SW SW SW SE NE NE NW NW NE NW NE NE SW SE NW NE SE NW SE NE NE SE SE
## [385] NE SW SE NW NW NW NE NW NE NE NE NW SW SE SW SE SW SE NW NW SW NE SE SW
## [409] SE SE NW NE NE SW NW SE SE SW SE NW SE SE NE NE SW SE NE NE NE NW SW NW
## [433] SW SW NW SE NE SW SE NE NW SW SE SE NW SW NE NW SW SW SW NW SW NW SE SW
## [457] SE NW SW SE SE SW NE NE NW SE SW NW NE SE SE NE SW NE SW NW NE NW SE SE
## [481] NW SE SE SW SW NE NW SW SE NW SW SE NE SW SW NE SW SW SE SW SW NE SE SE
## [505] SE NW NW NW NE SW NE SE NE SW SW SW SE NW SW NE NE SE NE SE SE SE NW SW
## [529] NE NE SE NE SE SE SE NE SW SW SE SE SW SE SE SE NW NW NE SW NE SE SW SE
## [553] SW NW NE SW NE SE NW NW NW NE SW SE SE NW NW NW SW NW SW SE SE NE NE NW
## [577] SE NE SW NE NE SE SE SW SE NE NW NE SE SW NW SE NE SE NE SE NE SW NW SE
## [601] NW SW SE NW SE NE NW NE SW SE SW NE NE SE SE NE SW SE SW SW SW SW NE NW
## [625] NW NE SE SW NW SW SW SE NE SW NE NW NE NE SE SW NW NE NW SE NE NW NE NE
## [649] NE SE SE SE SE SE SE SW NW NE NE SE SE NE SE SE SE SW NW NE SE SE NE SE
## [673] SE SE NW SE NW SW NW SW SW SW NW SW NE NE SE SW SE NE SW NW NW SW NW NE
## [697] SE NW SE SE NE SE NW NW SW SE NW NE NE SE SE NW NE SW SW NW NW NW NW NE
## [721] SW SW SW NE SE NW NE NE SE SE SW SW NE SW NW SE SW NE SW NE SE NE SE NW
## [745] NW SW NW SW NW SE NW NW SE NE NE NE SE NE SE NE SW SW NE NE NW SW SW SW
## [769] NW SW SW NE NW NW SE NW NE SE NW SW SE SE SW SE SW NE NW NE SE SE SW NE
## [793] SE NW NW SE NE SW NE SE SE SW SE SE NW NW NW SE NE SW NW SE NE SE SE NW
## [817] SW NE NW SW NW SE SE NW NE SE NE NE NW SW NW NW NW NW SE SW NE NE NW SW
## [841] NE SE SE NE SE SW SE SW NW NE NW NE NE NE SW SE NW SE SW SW SW NW NE SW
## [865] SW SE SW NE SW SW SE SE SW NE NW SW SE SW SW SW NW NE NE NW SE NE NW SW
## [889] NW NW SE NE SE NE SW NE NW SE NW NE SE NE SE SW NE NE SE SW SW NW NE NW
## [913] SW NW SE NW NE SW SE SW SW SW NW SW NE SW SW SE SE SE SW SW SW SE SW NE
## [937] NW SE SE SE SE NE NW SE SW SW NE NW SW NE SE NW SW NW SE SE NW NE NW NW
## [961] SW SE NE NW SW NW NW NE SE SE NE NW SW SE NE SE SE NE SE NE NE SW NE NE
## [985] SW NW NW NE NE SW NE SW SE NW NE SW SE NE NW NW SW SW SW NE NW NE NW NE
## [1009] NE SW SE SE NW SW NW NW SW NW NW SW SE SE SE SE SW NW NW SW NE NW SE NE
## [1033] NE NW SW SE NW NE NW NW NE NE SW NE NW NE SE NW SW NW NE NE SW NW NW SW
## [1057] SE SE NW SE SE SE NW SW SW SE NE NW SE SE NE NW NE NE SE SW NE SE SE SE
## [1081] NW NW SW NW SW NE NW SE SW SE NE SW NW SW NE NE SE NE SE NE SW SE SE SW
## [1105] SE NW NW SW SE NE SE SE NW NE SE NE SE SE NW SW SE NW NE NE NW SW SE SW
## [1129] SW SE SW NE NW NW NW SW NW SE NW SE SW SE SE SW NW SW NW SW SW NE NW SE
## [1153] NW NW NE SE NW NE SW NW SE SE NE NW NE SE SW SW NW NE SW SE NW NW SW NW
## [1177] SW NE SE NE NW SW NE SE NE NW NW NE SW NW NE NE NW NW NW NW SE NW SW NW
## [1201] NW NW NE SE NW SW SW NE SW NW SE NE SW NW NE SE SE SW NW NE SE SE SE NE
## [1225] SE NE SE SE NE NW SW NW SW SE NW NE NW NE SE SE SE NW SE SE SW SW SW SE
## [1249] NE NE SW SW SW SE SW NW NW NW NE NE SW SE SW NE SE SW NE SW SW NW NW SE
## [1273] SE SE SW NE NW NE NW SE NW NE SE SW NE NE SW SW SE NE SW SE NW NE SW NE
## [1297] SE NW NW SE NW SW SW NE SE NE NW SW SW NW NW SW SW NW NE SW SE NE NW NW
## [1321] NE SE SE NW NE NE SE NE SW SE SW SW NW NE SE SW NW
## Levels: NE NW SE SW
head(dataset.df)
## age sex bmi children smoker region charges bmi.category
## 1 19 0 27.900 0 1 SW 16884.924 overweight
## 2 18 0 33.770 1 0 SE 1725.552 overweight
## 3 28 1 33.000 3 0 SE 4449.462 overweight
## 4 33 1 22.705 0 0 NW 21984.471 normal
## 5 32 1 28.880 0 0 NW 3866.855 overweight
## 6 31 0 25.740 0 0 SE 3756.622 overweight
# misal membuat kolom id
id <- c(1:nrow(dataset.df))
# menggabungkan id ke sebelah kiri dataframe
dataset.df <- cbind(id, dataset.df)
head(dataset.df, 10)
## id age sex bmi children smoker region charges bmi.category
## 1 1 19 0 27.900 0 1 SW 16884.924 overweight
## 2 2 18 0 33.770 1 0 SE 1725.552 overweight
## 3 3 28 1 33.000 3 0 SE 4449.462 overweight
## 4 4 33 1 22.705 0 0 NW 21984.471 normal
## 5 5 32 1 28.880 0 0 NW 3866.855 overweight
## 6 6 31 0 25.740 0 0 SE 3756.622 overweight
## 7 7 46 0 33.440 1 0 SE 8240.590 overweight
## 8 8 37 0 27.740 3 0 NW 7281.506 overweight
## 9 9 37 1 29.830 2 0 NE 6406.411 overweight
## 10 10 60 0 25.840 0 0 NW 28923.137 overweight
# misal menggabungkan data frame dengan beberapa row baru
more.datasets <- data.frame(id = c(1339, 1340, 1341),
age = c(39, 28, 40),
sex = c(0, 0, 1),
bmi = c(39.3, 20.56, 21.44),
children = c(0, 0, 2),
smoker = c(0, 1, 1),
region = c("NW", "SE", "SE"),
charges = c(83942.12, 34345.2, 38263.49),
bmi.category = c("", "", ""))
more.datasets
## id age sex bmi children smoker region charges bmi.category
## 1 1339 39 0 39.30 0 0 NW 83942.12
## 2 1340 28 0 20.56 0 1 SE 34345.20
## 3 1341 40 1 21.44 2 1 SE 38263.49
more.datasets["bmi.category"][more.datasets["bmi"]<18.5] <-"underweight"
more.datasets["bmi.category"][more.datasets["bmi"]>25] <-"overweight"
more.datasets["bmi.category"][more.datasets["bmi.category"]==""] <-"normal"
more.datasets
## id age sex bmi children smoker region charges bmi.category
## 1 1339 39 0 39.30 0 0 NW 83942.12 overweight
## 2 1340 28 0 20.56 0 1 SE 34345.20 normal
## 3 1341 40 1 21.44 2 1 SE 38263.49 normal
dataset.df <- rbind(dataset.df, more.datasets)
tail(dataset.df)
## id age sex bmi children smoker region charges bmi.category
## 1336 1335 18 0 36.85 0 0 SE 1629.833 overweight
## 1337 1336 21 0 25.80 0 0 SW 2007.945 overweight
## 1338 1337 61 0 29.07 0 1 NW 29141.360 overweight
## 1339 1339 39 0 39.30 0 0 NW 83942.120 overweight
## 2100 1340 28 0 20.56 0 1 SE 34345.200 normal
## 3100 1341 40 1 21.44 2 1 SE 38263.490 normal
Melakukan perhitungan pada sekelompok data dengan fungsi tertentu seperti mean(), max(), min(), var(), aggregate(), dan lain-lain.
# menghitung varians dan std dari charge
v <- var(dataset.df$charges)
sqrt(v); v
## [1] 12282.16
## [1] 150851333
# menghitung banyak penerima asuransi berdasarkan jenis kelamin
aggregate(dataset.df$sex, list(Sex.Frequency = dataset.df$sex), FUN=length)
## Sex.Frequency x
## 1 0 665
## 2 1 675
# menghitung rata-rata bmi berdasarkan region
aggregate(dataset.df$bmi, list(Region = dataset.df$region), FUN=mean)
## Region x
## 1 NE 29.17350
## 2 NW 29.22658
## 3 SE 33.28847
## 4 SW 30.59662
# menghitung rata-rata charge berdasarkan smoker/non smoker
aggregate(dataset.df$charge, list(Status=dataset.df$smoker), FUN=mean)
## Status x
## 1 0 8511.62
## 2 1 32081.06
FUN bisa diisi dengan fungsi apa yang diinginkan dari aggregate tersebut, misal max, min, mean, length, atau sum.
Mengubah format data frame, bisa dari long ke wide atau sebaliknya. Reshaping data frame menggunakan function reshape() dengan beberapa argumen sebagai berikut: * data = data frame yang digunakan * varying = objek list dari nama variabel yang terbentuk dalam format wide * v.names = objek vektor dari nama kolom baru yang akan dibentuk dalam format long * timevar * idvar = kolom kelas dari data yang akan dispesifikasikan/digabung atau disebar/dipecah * timevar = nama baris yang akan dipecah ke beberapa kolom berdasarkan kategori tertentu * v.names = nama kategori yang akan dipecah ke beberapa kolom berdasarkan timevar (untuk to wide) * varying = kolom ke berapa dan berapa yang digabung menjadi satu dari format long to wide * direction = spesifikasi arah dataframe yang akan dibentuk apakah “long” atau “wide” 1. Wide to Long
set.seed(1234) # agar mendapatkan output yang sama setiap running proses
# diambil sebagian tabel
dataset2 <- data.frame(cbind(region=dataset.df$region, sex=dataset.df$sex, bmi=dataset.df$bmi, charges=dataset.df$charges))
head(dataset2 ,10)
## region sex bmi charges
## 1 SW 0 27.9 16884.924
## 2 SE 0 33.77 1725.5523
## 3 SE 1 33 4449.462
## 4 NW 1 22.705 21984.47061
## 5 NW 1 28.88 3866.8552
## 6 SE 0 25.74 3756.6216
## 7 SE 0 33.44 8240.5896
## 8 NW 0 27.74 7281.5056
## 9 NE 1 29.83 6406.4107
## 10 NW 0 25.84 28923.13692
# reshape untuk kolom region
df.wide = reshape(dataset2, idvar = "region", timevar = "sex", v.names=c
("bmi", "charges"), direction="wide")
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : multiple rows match for sex=0: first taken
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : multiple rows match for sex=1: first taken
df.wide
## region bmi.0 charges.0 bmi.1 charges.1
## 1 SW 27.9 16884.924 34.4 1826.843
## 2 SE 33.77 1725.5523 33 4449.462
## 4 NW 27.74 7281.5056 22.705 21984.47061
## 9 NE 30.78 10797.3362 29.83 6406.4107
df.long <- reshape(df.wide, idvar = "region", varying = list(c(2,4), c(3,5)), direction = "long")
df.long
## region time bmi.0 charges.0
## SW.1 SW 1 27.9 16884.924
## SE.1 SE 1 33.77 1725.5523
## NW.1 NW 1 27.74 7281.5056
## NE.1 NE 1 30.78 10797.3362
## SW.2 SW 2 34.4 1826.843
## SE.2 SE 2 33 4449.462
## NW.2 NW 2 22.705 21984.47061
## NE.2 NE 2 29.83 6406.4107