Data yang digunakan dalam eksperimen ini berasal dari Kaggle (tautan: https://www.kaggle.com/camnugent/california-housing-prices). Data tersebut terkait dengan properti perumahan yang terletak di sebuah distrik di California dan mencakup informasi statistik berdasarkan hasil sensus tahun 1990.
rumah <- read.csv("housing.csv")
head(rumah)dim(rumah)#> [1] 20640 10
Penyederhanaan nama kolom dilakukan supaya lebih efektif dan mencegah kesalahan penulisan dalam proses eksplorasi data berikutnya.
library(dplyr)
rumah <- rumah %>%
rename(Age = housing_median_age,
Income = median_income,
Value = median_house_value)
head(rumah)glimpse(rumah)#> Rows: 20,640
#> Columns: 10
#> $ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, …
#> $ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84…
#> $ Age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52, 52…
#> $ total_rooms <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 35…
#> $ total_bedrooms <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434…
#> $ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551,…
#> $ households <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, 402…
#> $ Income <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591…
#> $ Value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299200…
#> $ ocean_proximity <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR …
Akan dilakukan transformasi tipe data ocean_proximity dari awalnya chr menjadi factor. Variabel ocean_proximity menyatakan kedekatan rumah dengan laut yang dikategorikan dalam beberapa kelompok. Selain itu, kolom total_rooms, total_bedrooms dan households juga akan ditransformasi menjadi int karena menyatakan total/banyak (nilainya bulat).
unique(rumah$ocean_proximity)#> [1] "NEAR BAY" "<1H OCEAN" "INLAND" "NEAR OCEAN" "ISLAND"
rumah <- rumah %>%
mutate_at(vars(total_rooms, total_bedrooms, households), as.integer) %>%
mutate(ocean_proximity = as.factor(ocean_proximity))
glimpse(rumah)#> Rows: 20,640
#> Columns: 10
#> $ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, …
#> $ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84…
#> $ Age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52, 52…
#> $ total_rooms <int> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 35…
#> $ total_bedrooms <int> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434…
#> $ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551,…
#> $ households <int> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, 402…
#> $ Income <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591…
#> $ Value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299200…
#> $ ocean_proximity <fct> NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR…
colSums(is.na(rumah))#> longitude latitude Age total_rooms total_bedrooms
#> 0 0 0 0 207
#> population households Income Value ocean_proximity
#> 0 0 0 0 0
Terdapat 207 missing value pada total_bedrooms, untuk meng-handlenya terlebih dahulu akan dilihat sebaran datanya. Langkah ini akan membantu dalam memahami distribusi variabel total_bedrooms dan memastikan bahwa pengisian missing values nantinya dilakukan dengan pendekatan yang sesuai.
boxplot(rumah$total_bedrooms, col = 'brown2', xlab = 'Total Bedrooms')Oleh karena terdapat outlier pada variabel total_bedrooms, langkah selanjutnya adalah mengisi nilai yang hilang (missing values) dengan menggunakan median. Menggunakan median sebagai pengganti nilai yang hilang adalah pilihan yang baik ketika terdapat outlier, karena median tidak terpengaruh oleh nilai ekstrem.
# Isi missing value dengan mediannya
library(Hmisc)
rumah$total_bedrooms <- with(rumah, impute(total_bedrooms, median))
str(rumah)#> 'data.frame': 20640 obs. of 10 variables:
#> $ longitude : num -122 -122 -122 -122 -122 ...
#> $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
#> $ Age : num 41 21 52 52 52 52 52 52 42 52 ...
#> $ total_rooms : int 880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
#> $ total_bedrooms : 'impute' int 129 1106 190 235 280 213 489 687 665 707 ...
#> ..- attr(*, "imputed")= int [1:207] 291 342 539 564 697 739 1098 1351 1457 1494 ...
#> $ population : num 322 2401 496 558 565 ...
#> $ households : int 126 1138 177 219 259 193 514 647 595 714 ...
#> $ Income : num 8.33 8.3 7.26 5.64 3.85 ...
#> $ Value : num 452600 358500 352100 341300 342200 ...
#> $ ocean_proximity: Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...
rumah$total_bedrooms <- as.integer(rumah$total_bedrooms)glimpse(rumah)#> Rows: 20,640
#> Columns: 10
#> $ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, …
#> $ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84…
#> $ Age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52, 52…
#> $ total_rooms <int> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 35…
#> $ total_bedrooms <int> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434…
#> $ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551,…
#> $ households <int> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, 402…
#> $ Income <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591…
#> $ Value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299200…
#> $ ocean_proximity <fct> NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR…
Dalam data yang digunakan dalam eksperimen ini, masih terdapat informasi yang kurang relevan dan perlu penyesuaian. Beberapa variabel yang perlu penyesuaian adalah total_rooms, total_bedrooms, dan population. Ketiga variabel ini adalah total perblok, bukan per-rumah. Sehingga perlu disesuaikan unitnya dengan target dalam kasus ini yaitu House Value atau nilai dari suatu rumah.
rumah$rooms = round(rumah$total_rooms/rumah$households)
rumah$bedrooms = round(rumah$total_bedrooms/rumah$households)
rumah$person = round(rumah$population/rumah$households)head(rumah)summary(rumah)#> longitude latitude Age total_rooms
#> Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
#> 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
#> Median :-118.5 Median :34.26 Median :29.00 Median : 2127
#> Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
#> 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
#> Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
#> total_bedrooms population households Income
#> Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
#> 1st Qu.: 297.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
#> Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
#> Mean : 536.8 Mean : 1425 Mean : 499.5 Mean : 3.8707
#> 3rd Qu.: 643.2 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
#> Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
#> Value ocean_proximity rooms bedrooms
#> Min. : 14999 <1H OCEAN :9136 Min. : 1.000 Min. : 0.00
#> 1st Qu.:119600 INLAND :6551 1st Qu.: 4.000 1st Qu.: 1.00
#> Median :179700 ISLAND : 5 Median : 5.000 Median : 1.00
#> Mean :206856 NEAR BAY :2290 Mean : 5.425 Mean : 1.05
#> 3rd Qu.:264725 NEAR OCEAN:2658 3rd Qu.: 6.000 3rd Qu.: 1.00
#> Max. :500001 Max. :142.000 Max. :34.00
#> person
#> Min. : 1.000
#> 1st Qu.: 2.000
#> Median : 3.000
#> Mean : 3.076
#> 3rd Qu.: 3.000
#> Max. :1243.000
# Sebaran data kategorik
library(ggplot2)
ggplot(rumah, aes(x=ocean_proximity, y=Value, color=ocean_proximity)) +
geom_boxplot(color = 'black')+
geom_jitter(position=position_jitter(0.3))+
theme_classic()💡 Insight :
# Jumlah penduduk berdasarkan ocean_proximity
agg1 <- rumah %>%
aggregate(population~ocean_proximity, sum)
agg1library(scales)
ggplot(data = agg1, mapping = aes(y = reorder(ocean_proximity, population), x = population)) +
geom_col(aes(fill = population)) +
labs(y = "Ocean Proximity", x = "Population", title = "Population by Ocean Proximity") +
theme_minimal() +
theme(legend.position = "none")
💡 Insight :
# Sebaran data numerik
# Pilih variabel-variabel numerik yang diperlukan saja
library(tidyverse)
library(GGally)
rumah %>%
select(c(3, 7:9, 11:13)) %>%
ggpairs()
💡 Insight :
# Cek outlier dengan boxplot
rumah %>%
select(c(3, 7:9, 11:13)) %>%
gather() %>%
ggplot(aes(x=key, y=value)) +
geom_boxplot() +
facet_wrap( ~ key, scales="free") +
ggtitle("Sebaran Age, Household, Income, House Value, Room, Bedrooms, Person") +
theme(plot.title = element_text(hjust = 0.5))Hampir semua variabel memiliki outlier kecuali House Age. Selanjutnya akan dilakukan penanganan outlier. Dalam kasus ini, pendekatan penanganan outlier yang digunakan adalah mengubah outlier atas menjadi nilai persentil ke-95 (upper bound) dan mengubah outlier bawah menjadi nilai persentil ke-5 (lower bound). Dalam konteks ini, mengubah outlier menjadi nilai persentil ke-95 dan ke-5 membantu meminimalkan efek outlier pada statistik ringkasan dan analisis yang dilakukan.
# Fungsi untuk merubah outlier atas dengan persentil 95 dan outlier bawah dengan persentil 5
f1 <- function(x) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = TRUE)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5*IQR(x, na.rm = TRUE)
x[x< (qnt[1] - H)] <- caps[1]
x[x> (qnt[2] + H)] <- caps[2]
x
}# Menerapkan fungsi pada variabel yang memiliki outlier
rumah2 <- rumah %>%
mutate(across(c(7:9, 11:13), f1))# Perbandingan summary sebelum dan sesudah penanganan outlier
summary(rumah)#> longitude latitude Age total_rooms
#> Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
#> 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
#> Median :-118.5 Median :34.26 Median :29.00 Median : 2127
#> Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
#> 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
#> Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
#> total_bedrooms population households Income
#> Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
#> 1st Qu.: 297.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
#> Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
#> Mean : 536.8 Mean : 1425 Mean : 499.5 Mean : 3.8707
#> 3rd Qu.: 643.2 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
#> Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
#> Value ocean_proximity rooms bedrooms
#> Min. : 14999 <1H OCEAN :9136 Min. : 1.000 Min. : 0.00
#> 1st Qu.:119600 INLAND :6551 1st Qu.: 4.000 1st Qu.: 1.00
#> Median :179700 ISLAND : 5 Median : 5.000 Median : 1.00
#> Mean :206856 NEAR BAY :2290 Mean : 5.425 Mean : 1.05
#> 3rd Qu.:264725 NEAR OCEAN:2658 3rd Qu.: 6.000 3rd Qu.: 1.00
#> Max. :500001 Max. :142.000 Max. :34.00
#> person
#> Min. : 1.000
#> 1st Qu.: 2.000
#> Median : 3.000
#> Mean : 3.076
#> 3rd Qu.: 3.000
#> Max. :1243.000
# Setelah outlier ditangani
summary(rumah2)#> longitude latitude Age total_rooms
#> Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
#> 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
#> Median :-118.5 Median :34.26 Median :29.00 Median : 2127
#> Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
#> 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
#> Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
#> total_bedrooms population households Income
#> Min. : 1.0 Min. : 3 Min. : 1.0 Min. :0.4999
#> 1st Qu.: 297.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.:2.5634
#> Median : 435.0 Median : 1166 Median : 409.0 Median :3.5348
#> Mean : 536.8 Mean : 1425 Mean : 473.1 Mean :3.7775
#> 3rd Qu.: 643.2 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.:4.7432
#> Max. :6445.0 Max. :35682 Max. :1162.0 Max. :8.0113
#> Value ocean_proximity rooms bedrooms person
#> Min. : 14999 <1H OCEAN :9136 Min. :1.000 Min. :1 Min. :1.000
#> 1st Qu.:119600 INLAND :6551 1st Qu.:4.000 1st Qu.:1 1st Qu.:2.000
#> Median :179700 ISLAND : 5 Median :5.000 Median :1 Median :3.000
#> Mean :206365 NEAR BAY :2290 Mean :5.298 Mean :1 Mean :2.879
#> 3rd Qu.:264725 NEAR OCEAN:2658 3rd Qu.:6.000 3rd Qu.:1 3rd Qu.:3.000
#> Max. :489810 Max. :9.000 Max. :1 Max. :4.000
💡 Insight :
glimpse(rumah2)#> Rows: 20,640
#> Columns: 13
#> $ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, …
#> $ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84…
#> $ Age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52, 52…
#> $ total_rooms <int> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 35…
#> $ total_bedrooms <int> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434…
#> $ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551,…
#> $ households <dbl> 126, 1162, 177, 219, 259, 193, 514, 647, 595, 714, 402…
#> $ Income <dbl> 7.300305, 7.300305, 7.257400, 5.643100, 3.846200, 4.03…
#> $ Value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299200…
#> $ ocean_proximity <fct> NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR BAY, NEAR…
#> $ rooms <dbl> 7, 6, 8, 6, 6, 5, 5, 5, 4, 5, 5, 5, 5, 4, 4, 4, 6, 4, …
#> $ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ person <dbl> 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, …
# Cek boxplot setelah penanganan outlier
rumah2 %>%
select(c(7:9, 11:13)) %>%
gather() %>%
ggplot(aes(x=key, y=value)) +
geom_boxplot() +
facet_wrap( ~ key, scales="free") +
ggtitle("Sebaran Household, Income, House Value, Room, Bedrooms, Person") +
theme(plot.title = element_text(hjust = 0.5))Variabel bedrooms hanya memiliki satu nilai yaitu 1, sehingga nantinya variabel ini tidak dimasukkan ke dalam model.
rumah3 <- rumah2 %>%
select('longitude', 'latitude', 'Age', 'households', 'Income', 'Value', 'ocean_proximity', 'rooms', 'person')
head(rumah3)# Split Data : Data dibagi menjadi data training dan testing dengan perbandingan 80:20
set.seed(1234)
library(caret)
bagi <- createDataPartition(rumah3$Value, p = 0.8, list=F)
training <- rumah3[bagi,]
testing <- rumah3[-bagi,]# Metode validasi yang digunakan adalah validasi silang 5 lipat
fit.control <- trainControl(method = "cv", number = 5)lr <- train(Value ~ ., data = training, method = "lm", trControl = fit.control)
summary(lr)#>
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -295087 -44070 -8881 33148 496155
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) -2011786.199 96439.410 -20.861
#> longitude -25188.762 1119.256 -22.505
#> latitude -25012.310 1112.373 -22.486
#> Age 1188.662 48.726 24.395
#> households 22.440 2.088 10.750
#> Income 42335.655 502.076 84.321
#> ocean_proximityINLAND -39738.969 1916.022 -20.740
#> ocean_proximityISLAND 149993.171 30217.528 4.964
#> `ocean_proximityNEAR BAY` -6118.564 2097.134 -2.918
#> `ocean_proximityNEAR OCEAN` 2157.632 1729.550 1.248
#> rooms 1096.812 617.321 1.777
#> person -34713.659 789.728 -43.956
#> Pr(>|t|)
#> (Intercept) < 0.0000000000000002 ***
#> longitude < 0.0000000000000002 ***
#> latitude < 0.0000000000000002 ***
#> Age < 0.0000000000000002 ***
#> households < 0.0000000000000002 ***
#> Income < 0.0000000000000002 ***
#> ocean_proximityINLAND < 0.0000000000000002 ***
#> ocean_proximityISLAND 0.000000698 ***
#> `ocean_proximityNEAR BAY` 0.00353 **
#> `ocean_proximityNEAR OCEAN` 0.21223
#> rooms 0.07563 .
#> person < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 67460 on 16501 degrees of freedom
#> Multiple R-squared: 0.6497, Adjusted R-squared: 0.6494
#> F-statistic: 2782 on 11 and 16501 DF, p-value: < 0.00000000000000022
# Melakukan prediksi terhadap data testing
predicted_lr <- predict(lr, testing)
# Melihat tingkat error atau akurasi hasil prediksi
postResample(predicted_lr, testing$Value)#> RMSE Rsquared MAE
#> 68602.5786845 0.6446888 51101.9436921
range(rumah3$Value)#> [1] 14999 489810
Berdasarkan nilai-nilai metrik evaluasi diatas, model regresi ini memiliki RMSE yang masih dalam batas toleransi (masih dalam range). Nilai R-squared yang cukup tinggi menunjukkan bahwa variabel-variabel prediktor dalam model dapat menjelaskan sebagian besar variabilitas dalam house value. MAE yang relatif rendah menunjukkan bahwa rata-rata kesalahan prediksi model dalam memperkirakan nilai sebenarnya juga cukup rendah.
# Decision tree
dtr <- train(Value ~ ., data = training, method = "rpart", trControl = fit.control)
dtr#> CART
#>
#> 16513 samples
#> 8 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 13210, 13210, 13210, 13211, 13211
#> Resampling results across tuning parameters:
#>
#> cp RMSE Rsquared MAE
#> 0.05065877 81750.36 0.4850866 61566.79
#> 0.13246228 90650.90 0.3648337 70008.09
#> 0.31536241 106529.92 0.2997310 83849.88
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was cp = 0.05065877.
Hasil dari model Decision Tree Regression (CART) dengan menggunakan metode Cross-Validation (5 fold) adalah sebagai berikut :
# Melakukan prediksi terhadap data testing
predicted_dtr <- predict(dtr, testing)
# Melihat tingkat error atau akurasi hasil prediksi
postResample(predicted_dtr, testing$Value)#> RMSE Rsquared MAE
#> 87799.3471007 0.4176067 66236.6428320
Jika dibandingkan dengan model regresi linear sebelumnya, model decision tree terpaut cukup jauh dari segi nilai RMSE, Rsquare yang kecil, dan MAE yang besar.
gbr <- train(Value ~ ., data = training, method = "gbm", trControl = fit.control)#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 12187470198.7212 nan 0.1000 767805758.3250
#> 2 11531418332.8782 nan 0.1000 634926693.8175
#> 3 10979113625.2843 nan 0.1000 552321003.1802
#> 4 10452973159.0742 nan 0.1000 519699196.7974
#> 5 10003586906.5585 nan 0.1000 429425178.8001
#> 6 9608355605.8283 nan 0.1000 389333891.6908
#> 7 9210171290.4043 nan 0.1000 391089780.7225
#> 8 8856737464.1269 nan 0.1000 352141059.2158
#> 9 8542250983.4110 nan 0.1000 304183727.5750
#> 10 8255023073.4185 nan 0.1000 285061939.8320
#> 20 6491942268.9987 nan 0.1000 109428353.8539
#> 40 5145050708.9688 nan 0.1000 32608954.6956
#> 60 4646957907.6927 nan 0.1000 17947576.9059
#> 80 4372523530.4164 nan 0.1000 7563636.8364
#> 100 4202272839.6574 nan 0.1000 6830051.9284
#> 120 4073560136.7751 nan 0.1000 5996744.2734
#> 140 3977374770.0561 nan 0.1000 2501785.0471
#> 150 3932755328.3898 nan 0.1000 2882687.8704
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11864825948.3234 nan 0.1000 1085879061.1354
#> 2 10924022113.5736 nan 0.1000 942551622.4740
#> 3 10167207438.0886 nan 0.1000 738948786.9942
#> 4 9526788274.7101 nan 0.1000 627233719.9864
#> 5 8951746057.3902 nan 0.1000 578344245.9767
#> 6 8498587104.1342 nan 0.1000 440825416.5335
#> 7 8059954233.5497 nan 0.1000 427029092.2272
#> 8 7666063329.0789 nan 0.1000 383385432.6532
#> 9 7372141960.2864 nan 0.1000 280492632.2955
#> 10 7084410104.9204 nan 0.1000 286668049.9714
#> 20 5351888981.9477 nan 0.1000 101821649.7570
#> 40 4269545561.3307 nan 0.1000 29049819.4327
#> 60 3868689112.4366 nan 0.1000 12804363.3040
#> 80 3611570240.2585 nan 0.1000 16028062.1023
#> 100 3440910394.5663 nan 0.1000 10236900.0715
#> 120 3281820177.0860 nan 0.1000 6296692.6590
#> 140 3160411775.0471 nan 0.1000 7893988.5345
#> 150 3125737272.7945 nan 0.1000 230109.4056
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11701745600.2734 nan 0.1000 1247066114.8904
#> 2 10664936777.8855 nan 0.1000 1029285228.3009
#> 3 9816119160.3926 nan 0.1000 852434908.9859
#> 4 9086649741.2601 nan 0.1000 734296594.3787
#> 5 8480216960.8064 nan 0.1000 591387036.3609
#> 6 7958580498.1595 nan 0.1000 503743101.9883
#> 7 7484351609.0115 nan 0.1000 463380282.7363
#> 8 7114303814.4011 nan 0.1000 362765379.0949
#> 9 6730476509.7469 nan 0.1000 386609916.0358
#> 10 6427393192.1920 nan 0.1000 298823180.4797
#> 20 4770433685.6349 nan 0.1000 76311288.1708
#> 40 3743131091.3829 nan 0.1000 33735144.0230
#> 60 3387441270.9623 nan 0.1000 5022578.9512
#> 80 3142232760.9212 nan 0.1000 14316556.2035
#> 100 3036823637.5707 nan 0.1000 5236021.9881
#> 120 2918946679.0377 nan 0.1000 -77465.1859
#> 140 2827441242.9891 nan 0.1000 -650860.6918
#> 150 2786040331.6584 nan 0.1000 5789484.5675
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 12172601585.1740 nan 0.1000 780144216.5659
#> 2 11533270624.4650 nan 0.1000 645195226.4243
#> 3 10971002869.9289 nan 0.1000 572387251.1008
#> 4 10452124913.7244 nan 0.1000 518713350.1503
#> 5 9982028645.5581 nan 0.1000 449820576.3443
#> 6 9568335927.1563 nan 0.1000 405563625.5848
#> 7 9168247684.3071 nan 0.1000 390450738.7300
#> 8 8817919454.5703 nan 0.1000 344028689.0959
#> 9 8508922104.8327 nan 0.1000 306496479.1943
#> 10 8218823688.3182 nan 0.1000 276635520.7922
#> 20 6427895891.1545 nan 0.1000 103378744.7576
#> 40 5114732104.4024 nan 0.1000 33983942.4735
#> 60 4629140481.9065 nan 0.1000 14122216.8714
#> 80 4371852853.7726 nan 0.1000 9415234.0590
#> 100 4202084082.5873 nan 0.1000 6243662.2855
#> 120 4079053033.1297 nan 0.1000 1918914.5758
#> 140 3981607570.6694 nan 0.1000 4219134.3044
#> 150 3937792314.7422 nan 0.1000 3291032.8463
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11849619774.8646 nan 0.1000 1124935046.4349
#> 2 10979296439.9321 nan 0.1000 857343425.1919
#> 3 10176888177.9376 nan 0.1000 811839891.0398
#> 4 9492088022.3759 nan 0.1000 667037188.1963
#> 5 8932883631.3404 nan 0.1000 547786122.3180
#> 6 8467423213.3143 nan 0.1000 460391249.9512
#> 7 8000938936.6400 nan 0.1000 459732914.5985
#> 8 7645692242.4044 nan 0.1000 348929511.1798
#> 9 7307011405.7464 nan 0.1000 341308787.5295
#> 10 7016372375.2350 nan 0.1000 282361251.1873
#> 20 5318667427.4644 nan 0.1000 91966460.8169
#> 40 4259094676.6676 nan 0.1000 31218765.2427
#> 60 3870527333.8334 nan 0.1000 13948850.5100
#> 80 3579361225.7867 nan 0.1000 16712489.0596
#> 100 3445978974.1135 nan 0.1000 13368946.6353
#> 120 3307979992.6177 nan 0.1000 1463610.8787
#> 140 3195128545.8267 nan 0.1000 2054215.8189
#> 150 3146512346.3670 nan 0.1000 997753.2893
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11730768133.9243 nan 0.1000 1257727375.4441
#> 2 10713630247.0307 nan 0.1000 1013399681.5173
#> 3 9816875296.8882 nan 0.1000 873003193.3685
#> 4 9081041929.6603 nan 0.1000 738937161.0040
#> 5 8466523098.2737 nan 0.1000 620334442.4970
#> 6 7925436836.7912 nan 0.1000 525241662.7115
#> 7 7482957652.3789 nan 0.1000 429335047.6651
#> 8 7102048939.8328 nan 0.1000 372305341.3981
#> 9 6753169240.1861 nan 0.1000 335037396.6701
#> 10 6415607670.4981 nan 0.1000 336748634.5857
#> 20 4781353066.5533 nan 0.1000 92725893.5937
#> 40 3807293680.0831 nan 0.1000 20756410.2887
#> 60 3469332036.7746 nan 0.1000 7132598.9585
#> 80 3226528210.4021 nan 0.1000 3310181.5359
#> 100 3052021837.7306 nan 0.1000 5517375.0212
#> 120 2928659253.4776 nan 0.1000 10561528.5175
#> 140 2836866439.5993 nan 0.1000 2571885.5145
#> 150 2807005925.0242 nan 0.1000 -1681252.0747
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 12227776666.5816 nan 0.1000 768217221.0347
#> 2 11595355663.3182 nan 0.1000 640879901.0490
#> 3 11044258911.2986 nan 0.1000 540279342.2001
#> 4 10514294593.4259 nan 0.1000 526500595.9839
#> 5 10044417316.1777 nan 0.1000 468653542.9414
#> 6 9631355981.8523 nan 0.1000 409166176.0982
#> 7 9245558002.9695 nan 0.1000 372222706.1659
#> 8 8895200718.5908 nan 0.1000 342638090.4880
#> 9 8589931403.6406 nan 0.1000 302749879.9891
#> 10 8318860517.7755 nan 0.1000 274690985.1026
#> 20 6545148048.7880 nan 0.1000 122637118.4213
#> 40 5215628749.7682 nan 0.1000 31697224.2616
#> 60 4721870950.1266 nan 0.1000 15766889.5575
#> 80 4456471688.7283 nan 0.1000 9065059.2214
#> 100 4287467905.4168 nan 0.1000 6633724.3926
#> 120 4162075344.5405 nan 0.1000 5526565.5910
#> 140 4057012213.2047 nan 0.1000 3780071.4925
#> 150 4015228178.8669 nan 0.1000 2793774.8619
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11908791939.3362 nan 0.1000 1085586076.3904
#> 2 11057524858.5026 nan 0.1000 857687105.4184
#> 3 10255505470.8603 nan 0.1000 799457782.8779
#> 4 9614224079.3993 nan 0.1000 644838975.6614
#> 5 9038244736.4393 nan 0.1000 561296935.7109
#> 6 8570847927.6280 nan 0.1000 458250173.9555
#> 7 8110772576.4201 nan 0.1000 460015883.6754
#> 8 7731934453.8534 nan 0.1000 363212824.1335
#> 9 7393207779.8237 nan 0.1000 338218258.8298
#> 10 7128687973.0509 nan 0.1000 250899694.3720
#> 20 5390927477.7560 nan 0.1000 106437928.7501
#> 40 4338108341.0499 nan 0.1000 18644059.2461
#> 60 3881342654.0051 nan 0.1000 51172170.7257
#> 80 3622910345.0758 nan 0.1000 4125566.1334
#> 100 3476407138.2313 nan 0.1000 6265134.2982
#> 120 3326147834.7413 nan 0.1000 2592354.7821
#> 140 3236308082.9235 nan 0.1000 965694.0945
#> 150 3204867028.5036 nan 0.1000 961999.8837
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11749277394.4977 nan 0.1000 1229606105.8683
#> 2 10738966302.2822 nan 0.1000 987682549.2523
#> 3 9917090279.8314 nan 0.1000 827819478.0785
#> 4 9186964295.2682 nan 0.1000 731372344.0961
#> 5 8569859579.7011 nan 0.1000 597303694.5077
#> 6 8057850808.0212 nan 0.1000 514717328.0764
#> 7 7566489531.5140 nan 0.1000 483057885.0744
#> 8 7197622757.5132 nan 0.1000 363876506.9887
#> 9 6829719809.1445 nan 0.1000 361351402.7999
#> 10 6518206451.2612 nan 0.1000 305887963.2493
#> 20 4848472197.1703 nan 0.1000 89206762.5699
#> 40 3898999434.7330 nan 0.1000 18724101.7271
#> 60 3462722872.1378 nan 0.1000 7744475.3596
#> 80 3212483957.0870 nan 0.1000 18583404.8688
#> 100 3067405094.9655 nan 0.1000 11139240.3850
#> 120 2949039465.7260 nan 0.1000 1097580.0714
#> 140 2858802609.3298 nan 0.1000 513372.7385
#> 150 2821349250.2989 nan 0.1000 7501.0064
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 12159934658.1893 nan 0.1000 740918705.1216
#> 2 11500315879.2062 nan 0.1000 645184467.4785
#> 3 10948806910.9194 nan 0.1000 537949872.2712
#> 4 10438709150.8557 nan 0.1000 509820340.6020
#> 5 9966248016.1117 nan 0.1000 459270076.7662
#> 6 9563911993.7521 nan 0.1000 386175062.2520
#> 7 9173440421.8603 nan 0.1000 393704750.6471
#> 8 8826282576.7728 nan 0.1000 337857402.6985
#> 9 8529263987.4861 nan 0.1000 291427530.5409
#> 10 8236613787.7094 nan 0.1000 293950410.0275
#> 20 6493901932.7024 nan 0.1000 110683468.5600
#> 40 5211731829.8385 nan 0.1000 29215327.1248
#> 60 4702019229.7275 nan 0.1000 14875781.8883
#> 80 4437061893.8196 nan 0.1000 8027244.0532
#> 100 4267083860.3881 nan 0.1000 6755239.3490
#> 120 4143689568.2294 nan 0.1000 3483872.0044
#> 140 4039816776.8567 nan 0.1000 4076875.3063
#> 150 3996311771.1019 nan 0.1000 2574306.6850
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11823292249.4617 nan 0.1000 1099374551.5072
#> 2 10969018302.4568 nan 0.1000 847941582.4640
#> 3 10172915986.0207 nan 0.1000 805163183.9945
#> 4 9506437995.8660 nan 0.1000 671976801.1578
#> 5 8961603939.5694 nan 0.1000 530275414.8489
#> 6 8476297989.9444 nan 0.1000 478627366.8424
#> 7 8064554481.4094 nan 0.1000 429103562.7651
#> 8 7685046401.7182 nan 0.1000 381766852.7315
#> 9 7372001979.2183 nan 0.1000 305061289.2278
#> 10 7081125817.5494 nan 0.1000 283293032.0846
#> 20 5395544459.1859 nan 0.1000 107860180.0611
#> 40 4341056828.7088 nan 0.1000 30769951.9310
#> 60 3942430616.4804 nan 0.1000 8423437.4080
#> 80 3608667085.5982 nan 0.1000 14848301.6584
#> 100 3459770246.7954 nan 0.1000 4585700.8198
#> 120 3325628983.2921 nan 0.1000 942259.8802
#> 140 3240330419.7681 nan 0.1000 -2448112.8177
#> 150 3205065130.1402 nan 0.1000 269372.4946
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11720976786.1086 nan 0.1000 1241279042.0023
#> 2 10704560388.6333 nan 0.1000 1016642342.5569
#> 3 9806883030.7871 nan 0.1000 879452013.2185
#> 4 9075005142.1902 nan 0.1000 722007998.6358
#> 5 8452884548.2675 nan 0.1000 628578643.4357
#> 6 7942628268.8880 nan 0.1000 502451825.1452
#> 7 7514625562.2391 nan 0.1000 414915550.6982
#> 8 7124319634.0354 nan 0.1000 370455651.2187
#> 9 6760828948.1882 nan 0.1000 371269235.0094
#> 10 6483358706.2885 nan 0.1000 265380619.0209
#> 20 4817971405.5724 nan 0.1000 108137110.9201
#> 40 3839115032.7867 nan 0.1000 27206420.4519
#> 60 3476932200.1661 nan 0.1000 27718590.1479
#> 80 3246143666.8754 nan 0.1000 9826191.9088
#> 100 3096172126.6351 nan 0.1000 11965591.1573
#> 120 2997351446.4428 nan 0.1000 1576395.1390
#> 140 2908013759.4414 nan 0.1000 -1297883.1881
#> 150 2849533421.1010 nan 0.1000 -1340908.8153
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 12246925655.5883 nan 0.1000 788000234.6505
#> 2 11587372613.3535 nan 0.1000 646713454.5353
#> 3 11026507351.9456 nan 0.1000 550716473.0803
#> 4 10506516940.2563 nan 0.1000 522557870.8369
#> 5 10040231634.5952 nan 0.1000 461082110.2962
#> 6 9626755375.6991 nan 0.1000 407710024.5282
#> 7 9227147521.8809 nan 0.1000 403202500.4037
#> 8 8898986841.3637 nan 0.1000 324219821.5036
#> 9 8569314925.4673 nan 0.1000 306963975.5084
#> 10 8262612510.0909 nan 0.1000 306669184.7094
#> 20 6507775072.1235 nan 0.1000 117287135.0176
#> 40 5177204561.0314 nan 0.1000 32045617.6637
#> 60 4676648008.5611 nan 0.1000 18086257.7197
#> 80 4405723093.1238 nan 0.1000 9599425.9601
#> 100 4234509225.0951 nan 0.1000 5099229.5576
#> 120 4107969452.8245 nan 0.1000 4676235.0673
#> 140 4009571990.7271 nan 0.1000 2693661.4432
#> 150 3966237693.8008 nan 0.1000 2389219.2367
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11992247166.1209 nan 0.1000 1056068428.5214
#> 2 11065889821.4906 nan 0.1000 937432824.8359
#> 3 10271316493.3801 nan 0.1000 793339099.5176
#> 4 9600248184.7362 nan 0.1000 681211663.0494
#> 5 9020760432.3617 nan 0.1000 577724004.9878
#> 6 8546523934.5961 nan 0.1000 471710397.8989
#> 7 8100117694.1416 nan 0.1000 451127648.7789
#> 8 7719132555.7723 nan 0.1000 371174048.8273
#> 9 7387286478.1749 nan 0.1000 329124116.4172
#> 10 7129999262.1405 nan 0.1000 256744405.3660
#> 20 5385402602.6823 nan 0.1000 112972316.3981
#> 40 4289449777.9496 nan 0.1000 25759956.0699
#> 60 3873310584.7893 nan 0.1000 33437068.3246
#> 80 3645386273.1765 nan 0.1000 7071262.2862
#> 100 3466558034.6494 nan 0.1000 1822283.7296
#> 120 3329327997.6291 nan 0.1000 1649829.1900
#> 140 3241863487.1961 nan 0.1000 3958519.2759
#> 150 3196516521.3155 nan 0.1000 1769735.0198
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11815149626.4956 nan 0.1000 1187592652.5087
#> 2 10804055648.5239 nan 0.1000 1026417623.4813
#> 3 9948270968.8655 nan 0.1000 837688640.8312
#> 4 9239369713.5398 nan 0.1000 722705528.9939
#> 5 8618676711.8306 nan 0.1000 621264132.0460
#> 6 8104657382.8187 nan 0.1000 523315318.7475
#> 7 7612017735.3487 nan 0.1000 482875860.8354
#> 8 7169853439.3270 nan 0.1000 449382578.2731
#> 9 6853968985.4451 nan 0.1000 310407778.2475
#> 10 6535743780.9508 nan 0.1000 315355462.9588
#> 20 4847214073.4094 nan 0.1000 90578567.0989
#> 40 3866313879.6006 nan 0.1000 26296704.5799
#> 60 3451686206.5630 nan 0.1000 9717442.6052
#> 80 3219532807.2014 nan 0.1000 3069907.0255
#> 100 3081811961.6624 nan 0.1000 6470764.2762
#> 120 2961494158.4221 nan 0.1000 2866920.9710
#> 140 2889926369.4559 nan 0.1000 -2690888.7439
#> 150 2843012780.5390 nan 0.1000 7995805.9646
#>
#> Iter TrainDeviance ValidDeviance StepSize Improve
#> 1 11750660320.0126 nan 0.1000 1213394624.9992
#> 2 10726189645.1007 nan 0.1000 1022454931.9101
#> 3 9871335922.3467 nan 0.1000 837294901.4660
#> 4 9150407715.7813 nan 0.1000 728868013.3055
#> 5 8543503501.9434 nan 0.1000 574450143.9799
#> 6 7982914685.9145 nan 0.1000 547302406.7456
#> 7 7535692861.6761 nan 0.1000 438075801.4152
#> 8 7106024358.3487 nan 0.1000 426744262.9431
#> 9 6770995508.7784 nan 0.1000 335698466.6493
#> 10 6452111003.4187 nan 0.1000 323753922.0018
#> 20 4811975163.4705 nan 0.1000 107863380.5591
#> 40 3859253616.6115 nan 0.1000 28566191.7775
#> 60 3456675643.4271 nan 0.1000 28766018.6856
#> 80 3225245121.8015 nan 0.1000 8230844.8162
#> 100 3084979239.3904 nan 0.1000 -588066.2170
#> 120 2959159543.2267 nan 0.1000 12213751.5583
#> 140 2855564964.8910 nan 0.1000 2864188.1793
#> 150 2806115077.3637 nan 0.1000 9340212.4666
gbr#> Stochastic Gradient Boosting
#>
#> 16513 samples
#> 8 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 13209, 13210, 13210, 13211, 13212
#> Resampling results across tuning parameters:
#>
#> interaction.depth n.trees RMSE Rsquared MAE
#> 1 50 70102.96 0.6453714 52301.81
#> 1 100 65448.95 0.6777533 47987.37
#> 1 150 63389.80 0.6946316 46163.40
#> 2 50 64237.80 0.6920034 46955.40
#> 2 100 59410.09 0.7300034 42526.80
#> 2 150 57166.64 0.7486976 40564.45
#> 3 50 60589.77 0.7236238 44051.75
#> 3 100 56289.66 0.7566408 39929.50
#> 3 150 54349.92 0.7727556 38269.18
#>
#> Tuning parameter 'shrinkage' was held constant at a value of 0.1
#>
#> Tuning parameter 'n.minobsinnode' was held constant at a value of 10
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were n.trees = 150, interaction.depth =
#> 3, shrinkage = 0.1 and n.minobsinnode = 10.
# Melakukan prediksi terhadap data testing
predicted_gbr <- predict(gbr, testing)
# Melihat tingkat error atau akurasi hasil prediksi
postResample(predicted_gbr, testing$Value)#> RMSE Rsquared MAE
#> 54976.8352835 0.7720363 38207.1638667
# Menampilkan perbandingan hasil antar model
model_list <- list(LinearRegression = lr,
DecisionTreeReg = dtr,
GradientBoostingReg = gbr)
res <- resamples(model_list)
summary(res)#>
#> Call:
#> summary.resamples(object = res)
#>
#> Models: LinearRegression, DecisionTreeReg, GradientBoostingReg
#> Number of resamples: 5
#>
#> MAE
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> LinearRegression 50317.06 50510.59 50543.25 50562.85 50591.33 50852.04 0
#> DecisionTreeReg 60200.76 60431.68 61097.20 61566.79 61408.14 64696.15 0
#> GradientBoostingReg 37371.82 37842.39 38151.74 38269.18 38712.05 39267.89 0
#>
#> RMSE
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> LinearRegression 65614.68 67325.02 67522.41 67507.28 67864.57 69209.75 0
#> DecisionTreeReg 79663.78 79775.43 81735.12 81750.36 82525.30 85052.19 0
#> GradientBoostingReg 52669.30 53944.34 53999.19 54349.92 55245.07 55891.71 0
#>
#> Rsquared
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> LinearRegression 0.6299743 0.6484514 0.6487745 0.6491383 0.6506399 0.6678512
#> DecisionTreeReg 0.4483023 0.4731376 0.4822061 0.4850866 0.5077367 0.5140502
#> GradientBoostingReg 0.7616624 0.7665813 0.7708396 0.7727556 0.7787029 0.7859917
#> NA's
#> LinearRegression 0
#> DecisionTreeReg 0
#> GradientBoostingReg 0
# Melihat variable importance dari model gbr
library(gbm)
varImp(gbr)#> gbm variable importance
#>
#> Overall
#> Income 100.0000
#> ocean_proximityINLAND 28.6294
#> person 13.1267
#> longitude 11.9185
#> latitude 10.8788
#> Age 5.2926
#> rooms 1.5515
#> households 1.2740
#> ocean_proximityNEAR OCEAN 0.7748
#> ocean_proximityNEAR BAY 0.4942
#> ocean_proximityISLAND 0.0000
Berdasarkan hasil model Gradient Boosting, terdapat 5 variabel yang memiliki kontribusi yang signifikan dalam memprediksi nilai rumah (House Value) di California. Variabel-variabel tersebut adalah Income, Ocean_Proximity, Person, Longitude, dan Latitude.
Dengan mempertimbangkan variabel-variabel ini, model Gradient Boosting dapat memberikan prediksi yang lebih akurat terkait dengan nilai rumah di California. Informasi ini dapat memberikan wawasan yang berharga bagi pihak-pihak yang terlibat dalam pasar perumahan, seperti pembeli, penjual, atau pihak keuangan, dalam mengambil keputusan yang berkaitan dengan nilai dan harga properti di daerah tersebut.