Analisis ini bertujuan untuk memahami faktor-faktor yang memengaruhi harga rumah berdasarkan dataset yang telah diberikan. Kita akan melakukan beberapa tahap, termasuk eksplorasi data, pemilihan fitur, validasi model, serta interpretasi hasil.
Kita mulai dengan membaca dataset dan melihat struktur datanya
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
library(corrplot)
## corrplot 0.95 loaded
# Load data
house <- read.csv("data_input/HousePrices_HalfMil.csv")
# Check structure
str(house)
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
summary(house)
## Area Garage FirePlace Baths
## Min. : 1.0 Min. :1.000 Min. :0.000 Min. :1.000
## 1st Qu.: 63.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :125.0 Median :2.000 Median :2.000 Median :3.000
## Mean :124.9 Mean :2.001 Mean :2.003 Mean :2.998
## 3rd Qu.:187.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :249.0 Max. :3.000 Max. :4.000 Max. :5.000
## White.Marble Black.Marble Indian.Marble Floors
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.333 Mean :0.3327 Mean :0.3343 Mean :0.4994
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## City Solar Electric Fiber
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :2.001 Mean :0.4987 Mean :0.5007 Mean :0.5005
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Glass.Doors Swiming.Pool Garden Prices
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 7725
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:33500
## Median :0.0000 Median :1.0000 Median :1.0000 Median :41850
## Mean :0.4999 Mean :0.5004 Mean :0.5016 Mean :42050
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:50750
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :77975
Sebelum membangun model, kita perlu menentukan variabel target. - Jika kita ingin memprediksi harga rumah, kita perlu mencari apakah terdapat variabel yang memiliki harga dalam dataset - Jika tidak ada variabel harga, kita mungkin perlu mengubah pendekatan, misalnya dengan mengelompokkan rumah berdasarkan ukuran dan fasilitas
# Mengecek kemungkinan target variabel
target_candidates <- names(house)
target_candidates
## [1] "Area" "Garage" "FirePlace" "Baths"
## [5] "White.Marble" "Black.Marble" "Indian.Marble" "Floors"
## [9] "City" "Solar" "Electric" "Fiber"
## [13] "Glass.Doors" "Swiming.Pool" "Garden" "Prices"
Kesimpulan: Jika terdapat variabel harga, maka itu akan menjadi target utama ktia. Jika tidak, kita perlu mempertimbangan alternatif seperti klasifikasi rumah berdasarkan ukuran atau fasilitas
Kita akan menganalisis hubungan antarvariabel untuk menentukan fitur mana yang paling berharga.
# Melihat korelasi antar variabel numerik
cor_matrix <- cor(house)
corrplot(cor_matrix, method = "color", tl_cox = 0.7)
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "tl_cox" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "tl_cox" is not a graphical parameter
## Warning in title(title, ...): "tl_cox" is not a graphical parameter
# Menggunakan metode stepwise selection untuk memilih fitur terbaik
model <- lm(Area ~ ., data = house)
step_model <- step(model, direction = "both")
## Start: AIC=-18694931
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Indian.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors + Swiming.Pool + Garden + Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
##
## Step: AIC=-18694931
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Swiming.Pool + Garden + Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Garden 1 0 0 -18694932
## - Swiming.Pool 1 0 0 -18694931
## <none> 0 -18694931
## - Solar 1 12438200 12438200 1606988
## - Electric 1 278724620 278724620 3161714
## - FirePlace 1 666697351 666697351 3597770
## - Garage 1 820131672 820131672 3701334
## - Baths 1 1269527529 1269527529 3919802
## - Black.Marble 1 1454135907 1454135907 3987685
## - Glass.Doors 1 1563126117 1563126117 4023823
## - City 1 1853552578 1853552578 4109031
## - White.Marble 1 2343313175 2343313175 4226262
## - Fiber 1 2357069287 2357069287 4229189
## - Floors 1 2438470455 2438470455 4246165
## - Prices 1 2577219153 2577219153 4273835
##
## Step: AIC=-18694932
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Swiming.Pool + Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Swiming.Pool 1 0 0 -18694932
## <none> 0 -18694932
## + Garden 1 0 0 -18694931
## - Solar 1 12438435 12438435 1606995
## - Electric 1 278724638 278724638 3161712
## - FirePlace 1 666697540 666697540 3597768
## - Garage 1 820133164 820133164 3701333
## - Baths 1 1269527590 1269527590 3919800
## - Black.Marble 1 1454136440 1454136440 3987683
## - Glass.Doors 1 1563127595 1563127595 4023822
## - City 1 1853553182 1853553182 4109029
## - White.Marble 1 2343315597 2343315597 4226261
## - Fiber 1 2357073746 2357073746 4229188
## - Floors 1 2438476038 2438476038 4246164
## - Prices 1 2577224465 2577224465 4273834
##
## Step: AIC=-18694932
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## <none> 0 -18694932
## + Garden 1 0 0 -18694931
## + Swiming.Pool 1 0 0 -18694931
## - Solar 1 12438439 12438439 1606993
## - Electric 1 278724669 278724669 3161710
## - FirePlace 1 666697959 666697959 3597766
## - Garage 1 820133421 820133421 3701331
## - Baths 1 1269529270 1269529270 3919798
## - Black.Marble 1 1454137034 1454137034 3987682
## - Glass.Doors 1 1563127672 1563127672 4023820
## - City 1 1853553403 1853553403 4109028
## - White.Marble 1 2343318548 2343318548 4226259
## - Fiber 1 2357074641 2357074641 4229186
## - Floors 1 2438477068 2438477068 4246162
## - Prices 1 2577225447 2577225447 4273832
summary(step_model)
##
## Call:
## lm(formula = Area ~ Garage + FirePlace + Baths + White.Marble +
## Black.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors + Prices, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.968e-07 0.000e+00 0.000e+00 0.000e+00 5.371e-06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.000e+01 6.069e-11 -6.591e+11 <2e-16 ***
## Garage -6.000e+01 1.593e-11 -3.768e+12 <2e-16 ***
## FirePlace -3.000e+01 8.832e-12 -3.397e+12 <2e-16 ***
## Baths -5.000e+01 1.067e-11 -4.687e+12 <2e-16 ***
## White.Marble -5.600e+02 8.793e-11 -6.368e+12 <2e-16 ***
## Black.Marble -2.000e+02 3.987e-11 -5.017e+12 <2e-16 ***
## Floors -6.000e+02 9.236e-11 -6.496e+12 <2e-16 ***
## City -1.400e+02 2.472e-11 -5.664e+12 <2e-16 ***
## Solar -1.000e+01 2.155e-11 -4.640e+11 <2e-16 ***
## Electric -5.000e+01 2.277e-11 -2.196e+12 <2e-16 ***
## Fiber -4.700e+02 7.359e-11 -6.387e+12 <2e-16 ***
## Glass.Doors -1.780e+02 3.422e-11 -5.201e+12 <2e-16 ***
## Prices 4.000e-02 5.989e-15 6.679e+12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.601e-09 on 499987 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.717e+24 on 12 and 499987 DF, p-value: < 2.2e-16
Insight: Variabel dengan korelasi tinggi terhadap taget akan digunakan dalam meodul prediksi. Jika ditemukan multikolinearitas, kita mungkin perlu menghapus beberapa variabel.
Setelah memeilih fitur, kita akan membangun model regresi dan melakukan validasi.
set.seed(123)
trainIndex <- createDataPartition(house$Area, p = 0.8, list = FALSE)
trainData <- house[trainIndex, ]
testData <- house[-trainIndex, ]
# Model Regresi
lm_model <- lm(Area ~ ., data = trainData)
predictions <- predict(lm_model, newdata = testData)
# Mengukur akurasi model
rmse <- sqrt(mean((predictions - testData$Area)^2))
r2 <- cor(predictions, testData$Area)^2
list(RMSE = rmse, R2 = r2)
## $RMSE
## [1] 2.019296e-09
##
## $R2
## [1] 1
Kesimpulan: Jika RMSE rendah dan R^2 tinggi, maka model kita cukup baik. Jika tidak, kita perlu mencoba pendekatan lain seperti regresi non-linear atau teknik machine learning lainnya.
Dari model yang telah dibuat, kita dapat menarik kesimpulan tentang faktor-faktor yang paling berpengaruh terhadap harga rumah.
summary(lm_model)
##
## Call:
## lm(formula = Area ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.176e-08 -1.000e-11 0.000e+00 0.000e+00 1.273e-06
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.000e+01 1.856e-11 -2.156e+12 <2e-16 ***
## Garage -6.000e+01 4.726e-12 -1.269e+13 <2e-16 ***
## FirePlace -3.000e+01 2.620e-12 -1.145e+13 <2e-16 ***
## Baths -5.000e+01 3.163e-12 -1.581e+13 <2e-16 ***
## White.Marble -5.600e+02 2.610e-11 -2.146e+13 <2e-16 ***
## Black.Marble -2.000e+02 1.183e-11 -1.690e+13 <2e-16 ***
## Indian.Marble NA NA NA NA
## Floors -6.000e+02 2.741e-11 -2.189e+13 <2e-16 ***
## City -1.400e+02 7.332e-12 -1.909e+13 <2e-16 ***
## Solar -1.000e+01 6.394e-12 -1.564e+12 <2e-16 ***
## Electric -5.000e+01 6.753e-12 -7.405e+12 <2e-16 ***
## Fiber -4.700e+02 2.183e-11 -2.153e+13 <2e-16 ***
## Glass.Doors -1.780e+02 1.015e-11 -1.754e+13 <2e-16 ***
## Swiming.Pool -6.758e-12 6.378e-12 -1.059e+00 0.289
## Garden -6.465e-12 6.378e-12 -1.014e+00 0.311
## Prices 4.000e-02 1.777e-15 2.251e+13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.017e-09 on 399986 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.619e+25 on 14 and 399986 DF, p-value: < 2.2e-16
Berdasarkan hasil regresi: - Variabel Prices, Floors, Fiber, Glass.Doors, dan White.Marble memiliki pengaruh signifikan terhadap Area (p-value < 0.05). - Variabel Swiming.Pool dan Garden tidak signifikan (p-value > 0.05). - Variabel Indian.Marble memiliki nilai NA, menandakan kemungkinan adanya multikolinearitas atau redundansi dalam dataset.
Untuk memastikan bahwa model ini valid dan tidak terlalu bergantung pada hubungan antar variabel, analisis Variance Inflation Factor (VIF) perlu dilakukan untuk mengidentifikasi variabel yang mengalami multikolinearitas tinggi.
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(dplyr)
# Mengecek variabel dengan aliaaaasing jika ditemuikam
alias(lm_model)
## Model :
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Indian.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors + Swiming.Pool + Garden + Prices
##
## Complete :
## (Intercept) Garage FirePlace Baths White.Marble Black.Marble
## Indian.Marble 1 0 0 0 -1 -1
## Floors City Solar Electric Fiber Glass.Doors Swiming.Pool Garden
## Indian.Marble 0 0 0 0 0 0 0 0
## Prices
## Indian.Marble 0
# Menghapus variabel dengan aliasing jika ditemun
trainData_clean <- trainData %>% select(-Indian.Marble)
# Membangun ulang model tanpa variasi bermmasalah
lm_model_clean <- lm(Area ~ ., data = trainData_clean)
# Analisis Vif
vif_values <- vif(lm_model_clean)
vif_values
## Garage FirePlace Baths White.Marble Black.Marble Floors
## 1.466818 1.349252 1.965883 14.861714 3.055333 18.462154
## City Solar Electric Fiber Glass.Doors Swiming.Pool
## 3.520999 1.004987 1.120863 11.718414 2.532204 1.000039
## Garden Prices
## 1.000053 45.501648
high_vif <- names(vif_values[vif_values > 5])
high_vif
## [1] "White.Marble" "Floors" "Fiber" "Prices"
Jika terdapat variabel dengan VIF > 5, kita akan mengapusnya satu per satu dan membangun ulang model.
# Menghapus variabel dengan VIF tertinggi secara bertahap
trainData_reduced <- trainData_clean %>% select(-Prices)
lm_model_reduced <- lm(Area ~ ., data = trainData_reduced)
# Cek ulang VIF setelah penghapusan pertama
vif_values_reduced <- vif(lm_model_reduced)
vif_values_reduced
## Garage FirePlace Baths White.Marble Black.Marble Floors
## 1.000032 1.000022 1.000040 1.330191 1.330191 1.000012
## City Solar Electric Fiber Glass.Doors Swiming.Pool
## 1.000019 1.000039 1.000025 1.000034 1.000027 1.000037
## Garden
## 1.000049
# Jika masih ada VIF tinggi, lanjutkan dengan menghapus variabel lain
high_vif_reduced <- names(vif_values_reduced[vif_values_reduced > 5])
high_vif_reduced
## character(0)
Setelah menangani multikolinearitas, kita harus memvalidasi model.
# Evaluasi performa model dengan data uji
predictions <- predict(lm_model_reduced, newdata = testData)
actuals <- testData$Area
# Menghitung metrik evaluasi
mse <- mean((predictions - actuals)^2)
rmse <- sqrt(mse)
r2 <- cor(predictions, actuals)^2
# Menampilkan hasil evaluasi
cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 5163.004
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 71.85405
cat("R-squared", r2, "\n")
## R-squared 4.408275e-06
Untuk memastikan model tidak memiliki pola tertentu pada residualnya.
par(mfrow = c(2,2))
plot(lm_model_reduced)
Analisis ini telah mengevaluasi faktor-faktor yang memengaruhi harga rumah dengan menggunakan regresi linear. Setelah menangani multikolinearitas dan melakukan validasi model, kita menemukan bahwa beberapa variabel memiliki pengaruh signifikan terhadap luas rumah.
Selanjutnya, model ini dapat digunakan untuk prediksi harga rumah, dengan catatan bahwa validasi lebih lanjut dan eksplorasi model lain mungkin diperlukan untuk meningkatkan akurasi dan generalisasi model.