Pada materi kali ini penulis akan memberikan materi bagaimana cara preprocessing pada data regresi. Data yang digunakan adalah data prediksi harga rumah. Dan materi yang dibahas adalah:
1. Menangani nilai outlier
2. Menangani missing value
3. Menghapus atribute yang tidak digunakan
4. Menangani atribute kategori pada data regresi
5. Menampilkan nilai korelasi matriks dengan R
Kita import terlebih dahulu dataset yang akan digunakan yaitu data prediksi harga rumah
df <- read.csv("data/House-Price.csv", header=TRUE)
str(df)
## 'data.frame': 506 obs. of 19 variables:
## $ price : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ resid_area : num 32.3 37.1 37.1 32.2 32.2 ...
## $ air_qual : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ room_num : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dist1 : num 4.35 4.99 5.03 6.21 6.16 6.22 5.87 6.04 6.18 6.67 ...
## $ dist2 : num 3.81 4.7 4.86 5.93 5.86 5.8 5.47 5.85 5.85 6.55 ...
## $ dist3 : num 4.18 5.12 5.01 6.16 6.37 6.23 5.7 6.25 6.3 6.85 ...
## $ dist4 : num 4.01 5.06 4.97 5.96 5.86 5.99 5.2 5.66 6 6.29 ...
## $ teachers : num 24.7 22.2 22.2 21.3 21.3 21.3 24.8 24.8 24.8 24.8 ...
## $ poor_prop : num 4.98 9.14 4.03 2.94 5.33 ...
## $ airport : chr "YES" "NO" "NO" "YES" ...
## $ n_hos_beds : num 5.48 7.33 7.39 9.27 8.82 ...
## $ n_hot_rooms: num 11.2 12.2 101.1 11.3 11.3 ...
## $ waterbody : chr "River" "Lake" "None" "Lake" ...
## $ rainfall : int 23 42 38 45 55 53 41 56 55 45 ...
## $ bus_ter : chr "YES" "YES" "YES" "YES" ...
## $ parks : num 0.0493 0.0461 0.0458 0.0472 0.0395 ...
## $ Sold : int 0 1 0 0 0 1 1 1 0 0 ...
summary(df)
## price resid_area air_qual room_num
## Min. : 5.00 Min. :30.46 Min. :0.3850 Min. :3.561
## 1st Qu.:17.02 1st Qu.:35.19 1st Qu.:0.4490 1st Qu.:5.886
## Median :21.20 Median :39.69 Median :0.5380 Median :6.208
## Mean :22.53 Mean :41.14 Mean :0.5547 Mean :6.285
## 3rd Qu.:25.00 3rd Qu.:48.10 3rd Qu.:0.6240 3rd Qu.:6.623
## Max. :50.00 Max. :57.74 Max. :0.8710 Max. :8.780
##
## age dist1 dist2 dist3
## Min. : 2.90 Min. : 1.130 Min. : 0.920 Min. : 1.150
## 1st Qu.: 45.02 1st Qu.: 2.270 1st Qu.: 1.940 1st Qu.: 2.232
## Median : 77.50 Median : 3.385 Median : 3.010 Median : 3.375
## Mean : 68.57 Mean : 3.972 Mean : 3.629 Mean : 3.961
## 3rd Qu.: 94.08 3rd Qu.: 5.367 3rd Qu.: 4.992 3rd Qu.: 5.407
## Max. :100.00 Max. :12.320 Max. :11.930 Max. :12.320
##
## dist4 teachers poor_prop airport
## Min. : 0.730 Min. :18.00 Min. : 1.73 Length:506
## 1st Qu.: 1.940 1st Qu.:19.80 1st Qu.: 6.95 Class :character
## Median : 3.070 Median :20.95 Median :11.36 Mode :character
## Mean : 3.619 Mean :21.54 Mean :12.65
## 3rd Qu.: 4.985 3rd Qu.:22.60 3rd Qu.:16.95
## Max. :11.940 Max. :27.40 Max. :37.97
##
## n_hos_beds n_hot_rooms waterbody rainfall
## Min. : 5.268 Min. : 10.06 Length:506 Min. : 3.00
## 1st Qu.: 6.635 1st Qu.: 11.19 Class :character 1st Qu.:28.00
## Median : 7.999 Median : 12.72 Mode :character Median :39.00
## Mean : 7.900 Mean : 13.04 Mean :39.18
## 3rd Qu.: 9.088 3rd Qu.: 14.17 3rd Qu.:50.00
## Max. :10.876 Max. :101.12 Max. :60.00
## NA's :8
## bus_ter parks Sold
## Length:506 Min. :0.03329 Min. :0.0000
## Class :character 1st Qu.:0.04646 1st Qu.:0.0000
## Mode :character Median :0.05351 Median :0.0000
## Mean :0.05445 Mean :0.4545
## 3rd Qu.:0.06140 3rd Qu.:1.0000
## Max. :0.08671 Max. :1.0000
##
untuk mengidentifikasi nilai outlier, variabel yang diambil ada 3 yaitu price
, resid_area
, n_hot_rooms
dan rainfall
pairs(~price+resid_area+n_hot_rooms+rainfall, data = df)
Hasil observasi terdapat beberapa temuan:
1. variabel n_hot_rooms
and rainfall
terdapat nilai outliers
2. variabel n_hos_beds
terdapat missing values
3. variabel bus_term
tidak berpengaruh terhadap hasil analisis
Pendekatan nilai outlier menggunakan “capping and flooring”. Pada metode ini suatu nilai diidentifikasi sebagai outlier jika melebihi P (percentile) ke 99 / P99 dan jika di bawah P1 dari nilai yang diberikan [1].
Untuk variabel n_hot_rooms
karena nilai max-nya > P99 maka nilai outlier-nya diturunkan.
quantile(df$n_hot_rooms, 0.99)
## 99%
## 15.39952
uv = 3*quantile(df$n_hot_rooms, 0.99)
df$n_hot_rooms[df$n_hot_rooms > uv] <- uv
summary(df$n_hot_rooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.06 11.19 12.72 12.86 14.17 46.20
Untuk variabel rainfall
karena nilai min-nya < P1 maka nilai outlier-nya dinaikkan.
lv = 0.3*quantile(df$rainfall, 0.01)
df$rainfall[df$rainfall<lv] <- lv
summary(df$rainfall)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 28.00 39.00 39.19 50.00 60.00
Ada beberapa cara dalam penanganan missing value. Bisa memasukkan nilai 0, mengisi dengan mean/median/modus. Akan tetapi pada materi kali ini untuk mengatasi missing value menggunakan mean.
Sebelum ditangani atribute n_hos_beds
terdapat nilai missing value.
summary(df$n_hos_beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.268 6.635 7.999 7.900 9.088 10.876 8
Kemudian nilai missing value diisi dengan metode mean atau nilai rata-ratanya
mean(df$n_hos_beds)
## [1] NA
mean(df$n_hos_beds, na.rm = TRUE)
## [1] 7.899767
which(is.na(df$n_hos_beds))
## [1] 51 113 216 261 360 404 417 497
df$n_hos_beds[is.na(df$n_hos_beds)] <- mean(df$n_hos_beds, na.rm = TRUE)
Hasilnya tidak ada nilai missing value
summary(df$n_hos_beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.268 6.659 7.963 7.900 9.076 10.876
which(is.na(df$n_hos_beds))
## integer(0)
Selanjutnya menghapus atribute yang tidak digunakan, pada tahap observasi ditemukan atribute yang tidak mempengaruhi pada proses analisis. Namun, ada lagi atribute yang tidak mempengaruhi yaitu atribute dist1
, dist2
, dist3
, dan dist4
. Maka 4 atribute itu harus dihapus.
df2 <-df[,-6:-9]
df <- df2
rm(df2)
Kemudian menghapus atribut bus_term
juga
df <-df[-13]
Pada proses analisis regresi data harus berupa numeric, tidak boleh ada data selain numeric termasuk kategorial. Namun, pada dataset kita terdapat atribute berbentuk kategori yaitu atribute airport
dan waterbody
. Bagaimana cara mengatasinya ?
R sudah menyediakan library dummies
, library ini digunakan untuk membuat atribute bayangan. Teknisnya atribute dibagi sesuai dengan kategorinya. Misalnya atribute airport
terdapat 2 kategori yaitu YES dan NO. Maka atributenya dibagi menjadi 2 yaitu airportNO
dan airportYES
.
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
df <- dummy.data.frame(df)
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
Penulis ingin menghapus atribute nomer 8 dan 13 yaitu atribute airportNO
dan waterbodyNONE
.
df <-df[,-8]
df <-df[,-13]
Korelasi adalah suatu nilai yang menunjukkan keeratan hubungan antara dua variabel. Dua variabel disebut berkorelasi positif bila kedua buah variabel tersebut mengalami kenaikan serupa. Dua variabel berkorelasi negatif bila kedua buah variabel berlawanan maksudnya satu variabelmengalami kenaikan ketika variabel lainnya turun.
Pada R sendiri ada function untuk menghitung nilai korelasi matriks, berikut contohnya:
cor(df)
## price resid_area air_qual room_num
## price 1.00000000 -0.484754379 -0.429300219 0.696303794
## resid_area -0.48475438 1.000000000 0.763651447 -0.391675853
## air_qual -0.42930022 0.763651447 1.000000000 -0.302188188
## room_num 0.69630379 -0.391675853 -0.302188188 1.000000000
## age -0.37799890 0.644778511 0.731470104 -0.240264931
## teachers 0.50565462 -0.383247556 -0.188932677 0.355501495
## poor_prop -0.74083599 0.603799716 0.590878921 -0.613808272
## airportYES 0.18286708 -0.115401422 -0.073903182 0.163773760
## n_hos_beds 0.10887997 0.005798539 -0.049552671 0.032009306
## n_hot_rooms 0.01700713 -0.003760916 0.007238427 0.014582669
## waterbodyLake 0.03623283 -0.026589652 -0.046393025 -0.004195248
## waterbodyLake and River -0.03749700 0.051648836 0.013849087 0.010554415
## waterbodyRiver 0.07175147 -0.098976235 -0.037772472 0.046250988
## rainfall -0.04720035 0.055845495 0.091955932 -0.064718054
## parks -0.39157407 0.707634823 0.915543600 -0.282816674
## Sold -0.15469836 0.024404423 -0.004017022 0.027147654
## age teachers poor_prop airportYES
## price -0.377998896 0.505654619 -0.740835993 0.182867077
## resid_area 0.644778511 -0.383247556 0.603799716 -0.115401422
## air_qual 0.731470104 -0.188932677 0.590878921 -0.073903182
## room_num -0.240264931 0.355501495 -0.613808272 0.163773760
## age 1.000000000 -0.261515012 0.602338529 0.005101429
## teachers -0.261515012 1.000000000 -0.374044317 0.069436684
## poor_prop 0.602338529 -0.374044317 1.000000000 -0.095053886
## airportYES 0.005101429 0.069436684 -0.095053886 1.000000000
## n_hos_beds -0.021011632 -0.008056261 -0.066007503 -0.006365440
## n_hot_rooms 0.013918222 -0.037007002 0.017035504 -0.055338457
## waterbodyLake 0.003451883 0.048716990 0.003197048 0.035490605
## waterbodyLake and River -0.004354081 -0.046980765 0.020620334 -0.070341428
## waterbodyRiver -0.088608658 0.094256280 -0.109003630 0.017340976
## rainfall 0.074684329 -0.045928130 0.061444338 -0.013170629
## parks 0.673850360 -0.187004256 0.552310042 -0.052503091
## Sold -0.016291411 0.042524703 -0.082776328 -0.070371381
## n_hos_beds n_hot_rooms waterbodyLake
## price 0.108879969 0.017007130 0.036232826
## resid_area 0.005798539 -0.003760916 -0.026589652
## air_qual -0.049552671 0.007238427 -0.046393025
## room_num 0.032009306 0.014582669 -0.004195248
## age -0.021011632 0.013918222 0.003451883
## teachers -0.008056261 -0.037007002 0.048716990
## poor_prop -0.066007503 0.017035504 0.003197048
## airportYES -0.006365440 -0.055338457 0.035490605
## n_hos_beds 1.000000000 -0.003129899 0.042277609
## n_hot_rooms -0.003129899 1.000000000 0.037924587
## waterbodyLake 0.042277609 0.037924587 1.000000000
## waterbodyLake and River 0.059482068 0.014754998 -0.196747290
## waterbodyRiver -0.074148044 -0.064096372 -0.366562891
## rainfall 0.058595806 0.014869392 -0.016170315
## parks -0.071271771 0.023756726 -0.034991348
## Sold 0.066847394 -0.090338284 -0.061413897
## waterbodyLake and River waterbodyRiver rainfall
## price -0.037497000 0.07175147 -0.04720035
## resid_area 0.051648836 -0.09897624 0.05584549
## air_qual 0.013849087 -0.03777247 0.09195593
## room_num 0.010554415 0.04625099 -0.06471805
## age -0.004354081 -0.08860866 0.07468433
## teachers -0.046980765 0.09425628 -0.04592813
## poor_prop 0.020620334 -0.10900363 0.06144434
## airportYES -0.070341428 0.01734098 -0.01317063
## n_hos_beds 0.059482068 -0.07414804 0.05859581
## n_hot_rooms 0.014754998 -0.06409637 0.01486939
## waterbodyLake -0.196747290 -0.36656289 -0.01617032
## waterbodyLake and River 1.000000000 -0.30409469 0.10923352
## waterbodyRiver -0.304094685 1.00000000 -0.03701623
## rainfall 0.109233521 -0.03701623 1.00000000
## parks 0.013264558 -0.04886215 0.07827804
## Sold -0.003116641 0.07284241 -0.04011386
## parks Sold
## price -0.391574065 -0.154698364
## resid_area 0.707634823 0.024404423
## air_qual 0.915543600 -0.004017022
## room_num -0.282816674 0.027147654
## age 0.673850360 -0.016291411
## teachers -0.187004256 0.042524703
## poor_prop 0.552310042 -0.082776328
## airportYES -0.052503091 -0.070371381
## n_hos_beds -0.071271771 0.066847394
## n_hot_rooms 0.023756726 -0.090338284
## waterbodyLake -0.034991348 -0.061413897
## waterbodyLake and River 0.013264558 -0.003116641
## waterbodyRiver -0.048862152 0.072842409
## rainfall 0.078278037 -0.040113860
## parks 1.000000000 0.004808074
## Sold 0.004808074 1.000000000
Agar tidak terlalu banyak angka, bisa kita kurangi dengan function round().
round(cor(df),2)
## price resid_area air_qual room_num age teachers
## price 1.00 -0.48 -0.43 0.70 -0.38 0.51
## resid_area -0.48 1.00 0.76 -0.39 0.64 -0.38
## air_qual -0.43 0.76 1.00 -0.30 0.73 -0.19
## room_num 0.70 -0.39 -0.30 1.00 -0.24 0.36
## age -0.38 0.64 0.73 -0.24 1.00 -0.26
## teachers 0.51 -0.38 -0.19 0.36 -0.26 1.00
## poor_prop -0.74 0.60 0.59 -0.61 0.60 -0.37
## airportYES 0.18 -0.12 -0.07 0.16 0.01 0.07
## n_hos_beds 0.11 0.01 -0.05 0.03 -0.02 -0.01
## n_hot_rooms 0.02 0.00 0.01 0.01 0.01 -0.04
## waterbodyLake 0.04 -0.03 -0.05 0.00 0.00 0.05
## waterbodyLake and River -0.04 0.05 0.01 0.01 0.00 -0.05
## waterbodyRiver 0.07 -0.10 -0.04 0.05 -0.09 0.09
## rainfall -0.05 0.06 0.09 -0.06 0.07 -0.05
## parks -0.39 0.71 0.92 -0.28 0.67 -0.19
## Sold -0.15 0.02 0.00 0.03 -0.02 0.04
## poor_prop airportYES n_hos_beds n_hot_rooms
## price -0.74 0.18 0.11 0.02
## resid_area 0.60 -0.12 0.01 0.00
## air_qual 0.59 -0.07 -0.05 0.01
## room_num -0.61 0.16 0.03 0.01
## age 0.60 0.01 -0.02 0.01
## teachers -0.37 0.07 -0.01 -0.04
## poor_prop 1.00 -0.10 -0.07 0.02
## airportYES -0.10 1.00 -0.01 -0.06
## n_hos_beds -0.07 -0.01 1.00 0.00
## n_hot_rooms 0.02 -0.06 0.00 1.00
## waterbodyLake 0.00 0.04 0.04 0.04
## waterbodyLake and River 0.02 -0.07 0.06 0.01
## waterbodyRiver -0.11 0.02 -0.07 -0.06
## rainfall 0.06 -0.01 0.06 0.01
## parks 0.55 -0.05 -0.07 0.02
## Sold -0.08 -0.07 0.07 -0.09
## waterbodyLake waterbodyLake and River waterbodyRiver
## price 0.04 -0.04 0.07
## resid_area -0.03 0.05 -0.10
## air_qual -0.05 0.01 -0.04
## room_num 0.00 0.01 0.05
## age 0.00 0.00 -0.09
## teachers 0.05 -0.05 0.09
## poor_prop 0.00 0.02 -0.11
## airportYES 0.04 -0.07 0.02
## n_hos_beds 0.04 0.06 -0.07
## n_hot_rooms 0.04 0.01 -0.06
## waterbodyLake 1.00 -0.20 -0.37
## waterbodyLake and River -0.20 1.00 -0.30
## waterbodyRiver -0.37 -0.30 1.00
## rainfall -0.02 0.11 -0.04
## parks -0.03 0.01 -0.05
## Sold -0.06 0.00 0.07
## rainfall parks Sold
## price -0.05 -0.39 -0.15
## resid_area 0.06 0.71 0.02
## air_qual 0.09 0.92 0.00
## room_num -0.06 -0.28 0.03
## age 0.07 0.67 -0.02
## teachers -0.05 -0.19 0.04
## poor_prop 0.06 0.55 -0.08
## airportYES -0.01 -0.05 -0.07
## n_hos_beds 0.06 -0.07 0.07
## n_hot_rooms 0.01 0.02 -0.09
## waterbodyLake -0.02 -0.03 -0.06
## waterbodyLake and River 0.11 0.01 0.00
## waterbodyRiver -0.04 -0.05 0.07
## rainfall 1.00 0.08 -0.04
## parks 0.08 1.00 0.00
## Sold -0.04 0.00 1.00
[1] G. Kanda, “Selecting the Appropriate Outlier Treatment for Common Industry Applications,” Stat. Data Anal., pp. 1–5, 2007.