Munging/Wrangling R Data using R Basic Package

Introduction

Dalam prinsip data management, terdapat satu proses bernama data pre-processing. Dalam proses ini, data yang berukuran besar sering kali dikemas dalam format yang tidak terstruktur sehingga memerlukan pengolahan lebih dulu sebelum dilakukan analisis dan pemodelan. Untuk memudahkan proses selanjutnya, data terlebih dahulu diproses dengan mekanisme data wrangling/munging. Data wrangling dan data munging memiliki arti yang sama, merujuk pada makna data cleaning. Istilah ini diartikan sebagai proses mentransformasi atau menyiapkan data menjadi bentuk yang lebih rapi dan sistematis dan siap untuk dilakukan analisis. Sebuah dataset bisa memiliki bentuk memanjang atau melebar menyesuaikan kebutuhan analisis. Hal ini dapat dilakukan dengan menggunakan teknik data wrangling. Data wrangling bisa dilakukan secara manual dengan memanipulasi dataframe, atau melalui tibble den menggunakan package tidyverse. Dalam data munging/wrangling, dilakukan data discovering > structuring > cleaning > enriching > validating > publishing.

Data Wrangling

Meurpakan proses pembersihan, penataan, dan pengayaan data mentah ke dalam format yang diinginkan untuk menghasilkan pengambilan keputusan yang lebih baik dalam waktu yang lebih singkat. Data wrangling diterapkan dengan melakukan kegiatan-kegiatan sebagai berikut: 1. Membuat kolom/variabel baru dalam data frame 2. Subsetting atau memilih sebagian baris/kolom dari data 3. Sorting atau mengurutkan data 4. Recoding atau mengodekan ulang nilai-nilai dalam data 5. Merging atau menggabungkan data, meliputi penggabungan baris atau kolom. 6. Reshaping atau memformat ulang data, bisa menjadi wide format atau long format.

Tidy Data

Bagaimana menyusun data dalam format yang lebih terstruktur, dimana setiap variabel membentuk sebuah kolom, setiap observasi membentuk sebuah baris, dan setiap sel merupakan pengukuran tunggal. Tidy data bisa jadi tidak tercapai (menjadi messy data) ketika beberapa kondisi, seperti adanya: 1. header kolom merupakan value 2. beberapa variabel disimpan dalam satu kolom 3. variabel tersimpan di baris dan kolom 4. observasi tunggal tersimpan di beberapa baris 5. baris tunggal menyimpan beberapa observasi Adanya data messy ini akan menghambat proses analisis selanjutnya.

Prepare Datasets

Data ini terdiri dari 7 kolom dengan 1338 baris, dengan rincian variabel sebagai berikut:

Variable	Description
age	umur penerima utama asuransi (tahun)
sex	jenis kelamin penerima asuransi (female/male)
bmi	body mass index (indeks massa tubuh), yang merupakan hasil bagi berat badan dengan kuadrat tinggi badan
children	banyak anak yang ditanggung asuransi
smoker	status perokok (ya/tidak)
region	daerah domisili penerima asuransi di Amerika Serikat, meliputi: northeast, southeast, southwest, dan northwest
charges	biaya kesehatan individu yang ditanggung pihak asuransi ($)

Data diperoleh dari Kaggle.com melalui link link berupa data biaya kesehatan pemilik asuransi yang diperoleh dari buku Machine Learning with R oleh Brett Lantz.

Data Wrangling using R Basic Package

dataset.df <- read.csv("/Users/User/Documents/RFiles/datasets/insurance.csv")
# menampilkan 6 record teratas
head(dataset.df)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

# menampilkan 6 record terbawah
tail(dataset.df)

##      age    sex   bmi children smoker    region   charges
## 1333  52 female 44.70        3     no southwest 11411.685
## 1334  50   male 30.97        3     no northwest 10600.548
## 1335  18 female 31.92        0     no northeast  2205.981
## 1336  18 female 36.85        0     no southeast  1629.833
## 1337  21 female 25.80        0     no southwest  2007.945
## 1338  61 female 29.07        0    yes northwest 29141.360

Memeriksa dataset

#periksa dimensi
dim(dataset.df)

## [1] 1338    7

# menghitung nilai na
sum(is.na(dataset.df))

## [1] 0

# menghitung nilai null
sum(is.null(dataset.df))

## [1] 0

# menampilkan ringkasan data
summary(dataset.df)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

# melihat struktur data
str(dataset.df)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Akses Elemen (Subsetting)

Bisa dengan operator [, ], [[, ]], dan $.

# menyeleksi value kolom pertama pada record 2 sampai 7
dataset.df[2:7,1]

## [1] 18 28 33 32 31 46

# menyeleksi value record ke-3 pada kolom ke 2 sampai 5
dataset.df[3, 2:5]

##    sex bmi children smoker
## 3 male  33        3     no

# menyeleksi record ke-2 semua kolom
dataset.df[2,]

##   age  sex   bmi children smoker    region  charges
## 2  18 male 33.77        1     no southeast 1725.552

# menyeleksi 5 record teratas pada kolom sex
dataset.df[1:5,]$sex

## [1] "female" "male"   "male"   "male"   "male"

# menyeleksi baris ke-3 sampai 5 dari kolom age, sex, bmi, dan region
dataset.df[3:5, c("age", "sex", "bmi", "region")]

##   age  sex    bmi    region
## 3  28 male 33.000 southeast
## 4  33 male 22.705 northwest
## 5  32 male 28.880 northwest

# menyeleksi berdasarkan kriteria tertentu
dataset.df[dataset.df$age > 63,]

##      age    sex    bmi children smoker    region  charges
## 63    64   male 24.700        1     no northwest 30166.62
## 95    64 female 31.300        2    yes southwest 47291.06
## 200   64 female 39.330        0     no northeast 14901.52
## 329   64 female 33.800        1    yes southwest 47928.03
## 336   64   male 34.500        0     no southwest 13822.80
## 379   64 female 30.115        3     no northwest 16455.71
## 399   64   male 25.600        2     no southwest 14988.43
## 403   64 female 32.965        0     no northwest 14692.67
## 419   64   male 39.160        1     no southeast 14418.28
## 421   64   male 33.880        0    yes southeast 46889.26
## 535   64   male 40.480        0     no southeast 13831.12
## 604   64 female 39.050        3     no southeast 16085.13
## 636   64   male 38.190        0     no northeast 14410.93
## 665   64 female 22.990        0    yes southeast 27037.91
## 753   64   male 37.905        0     no northwest 14210.54
## 769   64 female 39.700        0     no southwest 14319.03
## 802   64 female 35.970        0     no southeast 14313.85
## 826   64 female 31.825        2     no northeast 16069.08
## 891   64 female 26.885        0    yes northwest 29330.98
## 1052  64   male 26.410        0     no northeast 14394.56
## 1242  64   male 36.960        2    yes southeast 49577.66
## 1266  64   male 23.760        0    yes southeast 26926.51

head(dataset.df[dataset.df$sex=="male" & dataset.df$region=="southwest", ])

##    age  sex  bmi children smoker    region   charges
## 13  23 male 34.4        0     no southwest  1826.843
## 16  19 male 24.6        1     no southwest  1837.237
## 19  56 male 40.3        0     no southwest 10602.385
## 20  30 male 35.3        0    yes southwest 36837.467
## 30  31 male 36.3        2    yes southwest 38711.000
## 31  22 male 35.6        0    yes southwest 35585.576

# menyeleksi bisa dengan beberapa cara menghasilkan output sama
head(dataset.df[dataset.df["age"]>60,])

##    age    sex    bmi children smoker    region  charges
## 12  62 female 26.290        0    yes southeast 27808.73
## 27  63 female 23.085        0     no northeast 14451.84
## 34  63   male 28.310        0     no northwest 13770.10
## 37  62 female 32.965        3     no northwest 15612.19
## 63  64   male 24.700        1     no northwest 30166.62
## 67  61 female 39.100        2     no southwest 14235.07

head(dataset.df[dataset.df$age>60,])

##    age    sex    bmi children smoker    region  charges
## 12  62 female 26.290        0    yes southeast 27808.73
## 27  63 female 23.085        0     no northeast 14451.84
## 34  63   male 28.310        0     no northwest 13770.10
## 37  62 female 32.965        3     no northwest 15612.19
## 63  64   male 24.700        1     no northwest 30166.62
## 67  61 female 39.100        2     no southwest 14235.07

Mengubah Data

# misal ingin mengubah data baris ke-2 kolom sex menjadi lawannya
dataset.df[2, "sex"]

## [1] "male"

dataset.df[2, "sex"] <- "female"
dataset.df[2, "sex"]

## [1] "female"

Menambah kolom baru ke dataframe

dataset.df["bmi.category"] <- ""
dataset.df["bmi.category"][dataset.df["bmi"]<18.5] <-"underweight"
dataset.df["bmi.category"][dataset.df["bmi"]>25] <-"overweight"
dataset.df["bmi.category"][dataset.df["bmi.category"]==""] <-"normal"
head(dataset.df)

##   age    sex    bmi children smoker    region   charges bmi.category
## 1  19 female 27.900        0    yes southwest 16884.924   overweight
## 2  18 female 33.770        1     no southeast  1725.552   overweight
## 3  28   male 33.000        3     no southeast  4449.462   overweight
## 4  33   male 22.705        0     no northwest 21984.471       normal
## 5  32   male 28.880        0     no northwest  3866.855   overweight
## 6  31 female 25.740        0     no southeast  3756.622   overweight

Mengurutkan baris

Bisa secara ascending atau urut dari terkecil, atau sebaliknya descending dari yang terbesar. Secara defaultnya order adalah ascending.

# mengurutkan berdasarkan usia termuda
head(dataset.df[order(dataset.df$age), ], 10)

##     age    sex    bmi children smoker    region   charges bmi.category
## 2    18 female 33.770        1     no southeast  1725.552   overweight
## 23   18   male 34.100        0     no southeast  1137.011   overweight
## 32   18 female 26.315        0     no northeast  2198.190   overweight
## 47   18 female 38.665        2     no northeast  3393.356   overweight
## 51   18 female 35.625        0     no northeast  2211.131   overweight
## 58   18   male 31.680        2    yes southeast 34303.167   overweight
## 103  18 female 30.115        0     no northeast 21344.847   overweight
## 122  18   male 23.750        0     no northeast  1705.624       normal
## 158  18   male 25.175        0    yes northeast 15518.180   overweight
## 162  18 female 36.850        0    yes southeast 36149.484   overweight

# mengurutkan berdasarkan bmi terbesar dan usia tertua
# jika > 1 kriteria dan minimal salah satunya descending maka decreasing argument harus diisi
head(dataset.df[order(dataset.df$bmi, dataset.df$age, decreasing = c(T, T)), ], 10)

##      age    sex   bmi children smoker    region   charges bmi.category
## 1318  18   male 53.13        0     no southeast  1163.463   overweight
## 1048  22   male 52.58        1    yes southeast 44501.398   overweight
## 848   23   male 50.38        1     no southeast  2438.055   overweight
## 117   58   male 49.06        0     no southeast 11381.325   overweight
## 287   46 female 48.07        2     no northeast  9432.925   overweight
## 1089  52   male 47.74        1     no southeast  9748.911   overweight
## 861   37 female 47.60        2    yes southwest 46113.511   overweight
## 402   47   male 47.52        1     no southeast  8083.920   overweight
## 544   54 female 47.41        0    yes southeast 63770.428   overweight
## 439   52 female 46.75        5     no southeast 12592.534   overweight

Filter Data

Filter baris penerima asuransi yang perokok usia 18 tahun ke bawah

age.below18 <- dataset.df$age<=18
smoker.1 <- dataset.df$smoker==1
dataset.df[age.below18&smoker.1,]

## [1] age          sex          bmi          children     smoker      
## [6] region       charges      bmi.category
## <0 rows> (or 0-length row.names)

Filter kolom yang mengandung NA

# menghitung banyak NA di setiap kolom
colSums(is.na(dataset.df)==TRUE)

##          age          sex          bmi     children       smoker       region 
##            0            0            0            0            0            0 
##      charges bmi.category 
##            0            0

colSums(is.na(dataset.df)) #secara default TRUE

##          age          sex          bmi     children       smoker       region 
##            0            0            0            0            0            0 
##      charges bmi.category 
##            0            0

na.check <- colSums(is.na(dataset.df))>0
which(na.check)

## named integer(0)

colnames(dataset.df)[na.check]

## character(0)

Periksa data duplikat

duplicate <- duplicated(dataset.df)
dim(dataset.df[duplicate,])

## [1] 1 8

dataset.df[duplicate,]

##     age  sex   bmi children smoker    region  charges bmi.category
## 582  19 male 30.59        0     no northwest 1639.563   overweight

Ada 1 baris di index 582 yang duplikat sehingga dihapus dengan:

dataset.df <- dataset.df[-582,]

Recoding data frame

Mengubah nilai sex menjadi 0 untuk female dan 1 untuk male

dataset.df["sex"][dataset.df["sex"]=="female"] <- 0
dataset.df$sex[dataset.df$sex=="male"] <- 1
head(dataset.df)

##   age sex    bmi children smoker    region   charges bmi.category
## 1  19   0 27.900        0    yes southwest 16884.924   overweight
## 2  18   0 33.770        1     no southeast  1725.552   overweight
## 3  28   1 33.000        3     no southeast  4449.462   overweight
## 4  33   1 22.705        0     no northwest 21984.471       normal
## 5  32   1 28.880        0     no northwest  3866.855   overweight
## 6  31   0 25.740        0     no southeast  3756.622   overweight

Mengubah nilai smoker menjadi 0 untuk no dan 1 untuk yes:

dataset.df["smoker"][dataset.df["smoker"]=="no"] <- 0
dataset.df["smoker"][dataset.df["smoker"]=="yes"] <- 1
head(dataset.df)

##   age sex    bmi children smoker    region   charges bmi.category
## 1  19   0 27.900        0      1 southwest 16884.924   overweight
## 2  18   0 33.770        1      0 southeast  1725.552   overweight
## 3  28   1 33.000        3      0 southeast  4449.462   overweight
## 4  33   1 22.705        0      0 northwest 21984.471       normal
## 5  32   1 28.880        0      0 northwest  3866.855   overweight
## 6  31   0 25.740        0      0 southeast  3756.622   overweight

Mengubah region menjadi format yang lebih singkat:

dataset.df["region"][dataset.df["region"]=="southwest"] <- "SW"
dataset.df["region"][dataset.df["region"]=="southeast"] <- "SE"
dataset.df["region"][dataset.df["region"]=="northwest"] <- "NW"
dataset.df["region"][dataset.df["region"]=="northeast"] <- "NE"
head(dataset.df)

##   age sex    bmi children smoker region   charges bmi.category
## 1  19   0 27.900        0      1     SW 16884.924   overweight
## 2  18   0 33.770        1      0     SE  1725.552   overweight
## 3  28   1 33.000        3      0     SE  4449.462   overweight
## 4  33   1 22.705        0      0     NW 21984.471       normal
## 5  32   1 28.880        0      0     NW  3866.855   overweight
## 6  31   0 25.740        0      0     SE  3756.622   overweight

Membuat sex, smoker, dan region menjadi factor:

as.factor(dataset.df$sex)

##    [1] 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0
##   [38] 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 1
##   [75] 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 1
##  [112] 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0
##  [149] 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1
##  [186] 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0
##  [223] 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1
##  [260] 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 1 1
##  [297] 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 0
##  [334] 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1
##  [371] 0 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 0 1 0 1 1 0 0
##  [408] 0 1 1 1 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 1 0
##  [445] 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1
##  [482] 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1
##  [519] 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0
##  [556] 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 0 1 1
##  [593] 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0
##  [630] 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1
##  [667] 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 0 1 0
##  [704] 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1
##  [741] 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1
##  [778] 1 1 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1
##  [815] 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1
##  [852] 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1
##  [889] 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1
##  [926] 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
##  [963] 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0
## [1000] 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1
## [1037] 0 1 1 0 1 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0
## [1074] 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0
## [1111] 1 0 0 1 1 1 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0
## [1148] 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0
## [1185] 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1
## [1222] 1 0 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1
## [1259] 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1
## [1296] 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0
## [1333] 1 0 0 0 0
## Levels: 0 1

as.factor(dataset.df$smoker)

##    [1] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0
##   [38] 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0
##   [75] 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0
##  [112] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0
##  [149] 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [186] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [223] 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 0 0
##  [260] 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0
##  [297] 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0
##  [334] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
##  [371] 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [408] 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [445] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
##  [482] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
##  [519] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
##  [556] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
##  [593] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1
##  [630] 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0
##  [667] 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
##  [704] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0
##  [741] 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [778] 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##  [815] 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0
##  [852] 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0
##  [889] 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0
##  [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 0 0
##  [963] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
## [1000] 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1
## [1037] 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
## [1074] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [1111] 1 1 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [1148] 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1
## [1185] 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
## [1222] 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0
## [1259] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0
## [1296] 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
## [1333] 0 0 0 0 1
## Levels: 0 1

as.factor(dataset.df$region)

##    [1] SW SE SE NW NW SE SE NW NE NW NE SE SW SE SE SW NE NE SW SW NE SW SE NE
##   [25] NW SE NE NW NW SW SW NE SW NW SW NW NW SW NE SW NE SE SE SE NE SW NE NW
##   [49] SE SE NE NW SW SE NW NW NE SE SE NW NE SE NW NW NW SW SW NW SE SE SE NE
##   [73] SW SE SW NW SE SE NE NW NE NE SE NW SW NW NW SW NW NW SE NW NE NW SW SE
##   [97] SW SE NE SW SW NE NE SE SW NW SW NW SE SE NW SW SW NW NE NE SE SE SE NW
##  [121] SW NE NW NE NW NE SW SW NW SW NE NE SW NW NE SE SW NW SE SW NE NE SE NW
##  [145] NW SE NW SE NW SW NW SE NE NE NE NW SE NE SE SE NW SE SW SW NW NE SW NW
##  [169] NW NE SE SW NE SW NW SW NW SW SW NE NW SE NE NW SE NE SE SW SW NW SE SW
##  [193] SE NW SE NW SW SE NW NE NW SE NW SE SW NE SE NE SW NE SW NW NW SE SW SW
##  [217] NW SE SE SE SW NE SW SW SE SE SE SE NE NE NW SE SW SW NW SE SE SE NW SE
##  [241] NE NE SW SW NE NW SE SE SW NE NE SW SE SW NE NE NW SE NW NW SW SE NE NW
##  [265] SE SE SE NE SW NE SE SW NW NE NW NE NW SW SE SE NE NE NE NE SW SE NE NW
##  [289] NE SE SW NE SE SE SW NE SW SE NW NW NE NE SE SE SW NW SW SE NE NW SW SW
##  [313] SE SE SW NE NW NE NW NE NW NE SW NE SW NE SE NW SW SW NE NW NW NE NE SW
##  [337] SE NW NE SE SW NW NE NE SE SE SE NE SE NW NW SW SW NE SE SW SE NW SE SE
##  [361] NE SW SW SW SE NE NE NW NW NE NW NE NE SW SE NW NE SE NW SE NE NE SE SE
##  [385] NE SW SE NW NW NW NE NW NE NE NE NW SW SE SW SE SW SE NW NW SW NE SE SW
##  [409] SE SE NW NE NE SW NW SE SE SW SE NW SE SE NE NE SW SE NE NE NE NW SW NW
##  [433] SW SW NW SE NE SW SE NE NW SW SE SE NW SW NE NW SW SW SW NW SW NW SE SW
##  [457] SE NW SW SE SE SW NE NE NW SE SW NW NE SE SE NE SW NE SW NW NE NW SE SE
##  [481] NW SE SE SW SW NE NW SW SE NW SW SE NE SW SW NE SW SW SE SW SW NE SE SE
##  [505] SE NW NW NW NE SW NE SE NE SW SW SW SE NW SW NE NE SE NE SE SE SE NW SW
##  [529] NE NE SE NE SE SE SE NE SW SW SE SE SW SE SE SE NW NW NE SW NE SE SW SE
##  [553] SW NW NE SW NE SE NW NW NW NE SW SE SE NW NW NW SW NW SW SE SE NE NE NW
##  [577] SE NE SW NE NE SE SE SW SE NE NW NE SE SW NW SE NE SE NE SE NE SW NW SE
##  [601] NW SW SE NW SE NE NW NE SW SE SW NE NE SE SE NE SW SE SW SW SW SW NE NW
##  [625] NW NE SE SW NW SW SW SE NE SW NE NW NE NE SE SW NW NE NW SE NE NW NE NE
##  [649] NE SE SE SE SE SE SE SW NW NE NE SE SE NE SE SE SE SW NW NE SE SE NE SE
##  [673] SE SE NW SE NW SW NW SW SW SW NW SW NE NE SE SW SE NE SW NW NW SW NW NE
##  [697] SE NW SE SE NE SE NW NW SW SE NW NE NE SE SE NW NE SW SW NW NW NW NW NE
##  [721] SW SW SW NE SE NW NE NE SE SE SW SW NE SW NW SE SW NE SW NE SE NE SE NW
##  [745] NW SW NW SW NW SE NW NW SE NE NE NE SE NE SE NE SW SW NE NE NW SW SW SW
##  [769] NW SW SW NE NW NW SE NW NE SE NW SW SE SE SW SE SW NE NW NE SE SE SW NE
##  [793] SE NW NW SE NE SW NE SE SE SW SE SE NW NW NW SE NE SW NW SE NE SE SE NW
##  [817] SW NE NW SW NW SE SE NW NE SE NE NE NW SW NW NW NW NW SE SW NE NE NW SW
##  [841] NE SE SE NE SE SW SE SW NW NE NW NE NE NE SW SE NW SE SW SW SW NW NE SW
##  [865] SW SE SW NE SW SW SE SE SW NE NW SW SE SW SW SW NW NE NE NW SE NE NW SW
##  [889] NW NW SE NE SE NE SW NE NW SE NW NE SE NE SE SW NE NE SE SW SW NW NE NW
##  [913] SW NW SE NW NE SW SE SW SW SW NW SW NE SW SW SE SE SE SW SW SW SE SW NE
##  [937] NW SE SE SE SE NE NW SE SW SW NE NW SW NE SE NW SW NW SE SE NW NE NW NW
##  [961] SW SE NE NW SW NW NW NE SE SE NE NW SW SE NE SE SE NE SE NE NE SW NE NE
##  [985] SW NW NW NE NE SW NE SW SE NW NE SW SE NE NW NW SW SW SW NE NW NE NW NE
## [1009] NE SW SE SE NW SW NW NW SW NW NW SW SE SE SE SE SW NW NW SW NE NW SE NE
## [1033] NE NW SW SE NW NE NW NW NE NE SW NE NW NE SE NW SW NW NE NE SW NW NW SW
## [1057] SE SE NW SE SE SE NW SW SW SE NE NW SE SE NE NW NE NE SE SW NE SE SE SE
## [1081] NW NW SW NW SW NE NW SE SW SE NE SW NW SW NE NE SE NE SE NE SW SE SE SW
## [1105] SE NW NW SW SE NE SE SE NW NE SE NE SE SE NW SW SE NW NE NE NW SW SE SW
## [1129] SW SE SW NE NW NW NW SW NW SE NW SE SW SE SE SW NW SW NW SW SW NE NW SE
## [1153] NW NW NE SE NW NE SW NW SE SE NE NW NE SE SW SW NW NE SW SE NW NW SW NW
## [1177] SW NE SE NE NW SW NE SE NE NW NW NE SW NW NE NE NW NW NW NW SE NW SW NW
## [1201] NW NW NE SE NW SW SW NE SW NW SE NE SW NW NE SE SE SW NW NE SE SE SE NE
## [1225] SE NE SE SE NE NW SW NW SW SE NW NE NW NE SE SE SE NW SE SE SW SW SW SE
## [1249] NE NE SW SW SW SE SW NW NW NW NE NE SW SE SW NE SE SW NE SW SW NW NW SE
## [1273] SE SE SW NE NW NE NW SE NW NE SE SW NE NE SW SW SE NE SW SE NW NE SW NE
## [1297] SE NW NW SE NW SW SW NE SE NE NW SW SW NW NW SW SW NW NE SW SE NE NW NW
## [1321] NE SE SE NW NE NE SE NE SW SE SW SW NW NE SE SW NW
## Levels: NE NW SE SW

head(dataset.df)

##   age sex    bmi children smoker region   charges bmi.category
## 1  19   0 27.900        0      1     SW 16884.924   overweight
## 2  18   0 33.770        1      0     SE  1725.552   overweight
## 3  28   1 33.000        3      0     SE  4449.462   overweight
## 4  33   1 22.705        0      0     NW 21984.471       normal
## 5  32   1 28.880        0      0     NW  3866.855   overweight
## 6  31   0 25.740        0      0     SE  3756.622   overweight

Menggabungkan Data

Menggabungkan data frame dengan suatu kolom lain dengan cbind

# misal membuat kolom id
id <- c(1:nrow(dataset.df))
# menggabungkan id ke sebelah kiri dataframe
dataset.df <- cbind(id, dataset.df)
head(dataset.df, 10)

##    id age sex    bmi children smoker region   charges bmi.category
## 1   1  19   0 27.900        0      1     SW 16884.924   overweight
## 2   2  18   0 33.770        1      0     SE  1725.552   overweight
## 3   3  28   1 33.000        3      0     SE  4449.462   overweight
## 4   4  33   1 22.705        0      0     NW 21984.471       normal
## 5   5  32   1 28.880        0      0     NW  3866.855   overweight
## 6   6  31   0 25.740        0      0     SE  3756.622   overweight
## 7   7  46   0 33.440        1      0     SE  8240.590   overweight
## 8   8  37   0 27.740        3      0     NW  7281.506   overweight
## 9   9  37   1 29.830        2      0     NE  6406.411   overweight
## 10 10  60   0 25.840        0      0     NW 28923.137   overweight

Menggabungkan data frame dengan suatu baris tertentu dengan rbind

# misal menggabungkan data frame dengan beberapa row baru
more.datasets <- data.frame(id = c(1339, 1340, 1341),
                            age = c(39, 28, 40),
                            sex = c(0, 0, 1),
                            bmi = c(39.3, 20.56, 21.44),
                            children = c(0, 0, 2),
                            smoker = c(0, 1, 1),
                            region = c("NW", "SE", "SE"),
                            charges = c(83942.12, 34345.2, 38263.49),
                            bmi.category = c("", "", ""))
more.datasets

##     id age sex   bmi children smoker region  charges bmi.category
## 1 1339  39   0 39.30        0      0     NW 83942.12             
## 2 1340  28   0 20.56        0      1     SE 34345.20             
## 3 1341  40   1 21.44        2      1     SE 38263.49

more.datasets["bmi.category"][more.datasets["bmi"]<18.5] <-"underweight"
more.datasets["bmi.category"][more.datasets["bmi"]>25] <-"overweight"
more.datasets["bmi.category"][more.datasets["bmi.category"]==""] <-"normal"
more.datasets

##     id age sex   bmi children smoker region  charges bmi.category
## 1 1339  39   0 39.30        0      0     NW 83942.12   overweight
## 2 1340  28   0 20.56        0      1     SE 34345.20       normal
## 3 1341  40   1 21.44        2      1     SE 38263.49       normal

dataset.df <- rbind(dataset.df, more.datasets)
tail(dataset.df)

##        id age sex   bmi children smoker region   charges bmi.category
## 1336 1335  18   0 36.85        0      0     SE  1629.833   overweight
## 1337 1336  21   0 25.80        0      0     SW  2007.945   overweight
## 1338 1337  61   0 29.07        0      1     NW 29141.360   overweight
## 1339 1339  39   0 39.30        0      0     NW 83942.120   overweight
## 2100 1340  28   0 20.56        0      1     SE 34345.200       normal
## 3100 1341  40   1 21.44        2      1     SE 38263.490       normal

Agregasi Data

Melakukan perhitungan pada sekelompok data dengan fungsi tertentu seperti mean(), max(), min(), var(), aggregate(), dan lain-lain.

# menghitung varians dan std dari charge
v <- var(dataset.df$charges)
sqrt(v); v

## [1] 12282.16

## [1] 150851333

# menghitung banyak penerima asuransi berdasarkan jenis kelamin
aggregate(dataset.df$sex, list(Sex.Frequency = dataset.df$sex), FUN=length)

##   Sex.Frequency   x
## 1             0 665
## 2             1 675

# menghitung rata-rata bmi berdasarkan region
aggregate(dataset.df$bmi, list(Region = dataset.df$region), FUN=mean)

##   Region        x
## 1     NE 29.17350
## 2     NW 29.22658
## 3     SE 33.28847
## 4     SW 30.59662

# menghitung rata-rata charge berdasarkan smoker/non smoker
aggregate(dataset.df$charge, list(Status=dataset.df$smoker), FUN=mean)

##   Status        x
## 1      0  8511.62
## 2      1 32081.06

FUN bisa diisi dengan fungsi apa yang diinginkan dari aggregate tersebut, misal max, min, mean, length, atau sum.

Reshaping Data Frame

Mengubah format data frame, bisa dari long ke wide atau sebaliknya. Reshaping data frame menggunakan function reshape() dengan beberapa argumen sebagai berikut: * data = data frame yang digunakan * varying = objek list dari nama variabel yang terbentuk dalam format wide * v.names = objek vektor dari nama kolom baru yang akan dibentuk dalam format long * timevar * idvar = kolom kelas dari data yang akan dispesifikasikan/digabung atau disebar/dipecah * timevar = nama baris yang akan dipecah ke beberapa kolom berdasarkan kategori tertentu * v.names = nama kategori yang akan dipecah ke beberapa kolom berdasarkan timevar (untuk to wide) * varying = kolom ke berapa dan berapa yang digabung menjadi satu dari format long to wide * direction = spesifikasi arah dataframe yang akan dibentuk apakah “long” atau “wide” 1. Wide to Long

set.seed(1234) # agar mendapatkan output yang sama setiap running proses
# diambil sebagian tabel
dataset2 <- data.frame(cbind(region=dataset.df$region, sex=dataset.df$sex, bmi=dataset.df$bmi, charges=dataset.df$charges))
head(dataset2  ,10)

##    region sex    bmi     charges
## 1      SW   0   27.9   16884.924
## 2      SE   0  33.77   1725.5523
## 3      SE   1     33    4449.462
## 4      NW   1 22.705 21984.47061
## 5      NW   1  28.88   3866.8552
## 6      SE   0  25.74   3756.6216
## 7      SE   0  33.44   8240.5896
## 8      NW   0  27.74   7281.5056
## 9      NE   1  29.83   6406.4107
## 10     NW   0  25.84 28923.13692

# reshape untuk kolom region
df.wide = reshape(dataset2, idvar = "region", timevar = "sex", v.names=c
                  ("bmi", "charges"), direction="wide")

## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : multiple rows match for sex=0: first taken

## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : multiple rows match for sex=1: first taken

df.wide

##   region bmi.0  charges.0  bmi.1   charges.1
## 1     SW  27.9  16884.924   34.4    1826.843
## 2     SE 33.77  1725.5523     33    4449.462
## 4     NW 27.74  7281.5056 22.705 21984.47061
## 9     NE 30.78 10797.3362  29.83   6406.4107

Long to Wide

df.long <- reshape(df.wide, idvar = "region", varying = list(c(2,4), c(3,5)), direction = "long")
df.long

##      region time  bmi.0   charges.0
## SW.1     SW    1   27.9   16884.924
## SE.1     SE    1  33.77   1725.5523
## NW.1     NW    1  27.74   7281.5056
## NE.1     NE    1  30.78  10797.3362
## SW.2     SW    2   34.4    1826.843
## SE.2     SE    2     33    4449.462
## NW.2     NW    2 22.705 21984.47061
## NE.2     NE    2  29.83   6406.4107