Introduction

Analisis ini bertujuan untuk memahami faktor-faktor yang memengaruhi harga rumah berdasarkan dataset yang telah diberikan. Kita akan melakukan beberapa tahap, termasuk eksplorasi data, pemilihan fitur, validasi model, serta interpretasi hasil.

Data Loading dan Overview

Kita mulai dengan membaca dataset dan melihat struktur datanya

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
library(corrplot)
## corrplot 0.95 loaded
# Load data
house <- read.csv("data_input/HousePrices_HalfMil.csv")

# Check structure
str(house)
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
summary(house)
##       Area           Garage        FirePlace         Baths      
##  Min.   :  1.0   Min.   :1.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.: 63.0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :125.0   Median :2.000   Median :2.000   Median :3.000  
##  Mean   :124.9   Mean   :2.001   Mean   :2.003   Mean   :2.998  
##  3rd Qu.:187.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :249.0   Max.   :3.000   Max.   :4.000   Max.   :5.000  
##   White.Marble    Black.Marble    Indian.Marble        Floors      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.333   Mean   :0.3327   Mean   :0.3343   Mean   :0.4994  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       City           Solar           Electric          Fiber       
##  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :2.001   Mean   :0.4987   Mean   :0.5007   Mean   :0.5005  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   Glass.Doors      Swiming.Pool        Garden           Prices     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   : 7725  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:33500  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :41850  
##  Mean   :0.4999   Mean   :0.5004   Mean   :0.5016   Mean   :42050  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:50750  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :77975

Target Variable Selection

Sebelum membangun model, kita perlu menentukan variabel target. - Jika kita ingin memprediksi harga rumah, kita perlu mencari apakah terdapat variabel yang memiliki harga dalam dataset - Jika tidak ada variabel harga, kita mungkin perlu mengubah pendekatan, misalnya dengan mengelompokkan rumah berdasarkan ukuran dan fasilitas

# Mengecek kemungkinan target variabel
target_candidates <- names(house)
target_candidates
##  [1] "Area"          "Garage"        "FirePlace"     "Baths"        
##  [5] "White.Marble"  "Black.Marble"  "Indian.Marble" "Floors"       
##  [9] "City"          "Solar"         "Electric"      "Fiber"        
## [13] "Glass.Doors"   "Swiming.Pool"  "Garden"        "Prices"

Kesimpulan: Jika terdapat variabel harga, maka itu akan menjadi target utama ktia. Jika tidak, kita perlu mempertimbangan alternatif seperti klasifikasi rumah berdasarkan ukuran atau fasilitas

Feature Selection

Kita akan menganalisis hubungan antarvariabel untuk menentukan fitur mana yang paling berharga.

# Melihat korelasi antar variabel numerik
cor_matrix <- cor(house)
corrplot(cor_matrix, method = "color", tl_cox = 0.7)
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "tl_cox" is not a graphical parameter
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "tl_cox" is not a graphical parameter
## Warning in title(title, ...): "tl_cox" is not a graphical parameter

# Menggunakan metode stepwise selection untuk memilih fitur terbaik
model <- lm(Area ~ ., data = house)
step_model <- step(model, direction = "both")
## Start:  AIC=-18694931
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Swiming.Pool + Garden + Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## 
## Step:  AIC=-18694931
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool + Garden + Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
##                Df  Sum of Sq        RSS       AIC
## - Garden        1          0          0 -18694932
## - Swiming.Pool  1          0          0 -18694931
## <none>                                0 -18694931
## - Solar         1   12438200   12438200   1606988
## - Electric      1  278724620  278724620   3161714
## - FirePlace     1  666697351  666697351   3597770
## - Garage        1  820131672  820131672   3701334
## - Baths         1 1269527529 1269527529   3919802
## - Black.Marble  1 1454135907 1454135907   3987685
## - Glass.Doors   1 1563126117 1563126117   4023823
## - City          1 1853552578 1853552578   4109031
## - White.Marble  1 2343313175 2343313175   4226262
## - Fiber         1 2357069287 2357069287   4229189
## - Floors        1 2438470455 2438470455   4246165
## - Prices        1 2577219153 2577219153   4273835
## 
## Step:  AIC=-18694932
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool + Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
##                Df  Sum of Sq        RSS       AIC
## - Swiming.Pool  1          0          0 -18694932
## <none>                                0 -18694932
## + Garden        1          0          0 -18694931
## - Solar         1   12438435   12438435   1606995
## - Electric      1  278724638  278724638   3161712
## - FirePlace     1  666697540  666697540   3597768
## - Garage        1  820133164  820133164   3701333
## - Baths         1 1269527590 1269527590   3919800
## - Black.Marble  1 1454136440 1454136440   3987683
## - Glass.Doors   1 1563127595 1563127595   4023822
## - City          1 1853553182 1853553182   4109029
## - White.Marble  1 2343315597 2343315597   4226261
## - Fiber         1 2357073746 2357073746   4229188
## - Floors        1 2438476038 2438476038   4246164
## - Prices        1 2577224465 2577224465   4273834
## 
## Step:  AIC=-18694932
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Prices
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
##                Df  Sum of Sq        RSS       AIC
## <none>                                0 -18694932
## + Garden        1          0          0 -18694931
## + Swiming.Pool  1          0          0 -18694931
## - Solar         1   12438439   12438439   1606993
## - Electric      1  278724669  278724669   3161710
## - FirePlace     1  666697959  666697959   3597766
## - Garage        1  820133421  820133421   3701331
## - Baths         1 1269529270 1269529270   3919798
## - Black.Marble  1 1454137034 1454137034   3987682
## - Glass.Doors   1 1563127672 1563127672   4023820
## - City          1 1853553403 1853553403   4109028
## - White.Marble  1 2343318548 2343318548   4226259
## - Fiber         1 2357074641 2357074641   4229186
## - Floors        1 2438477068 2438477068   4246162
## - Prices        1 2577225447 2577225447   4273832
summary(step_model)
## 
## Call:
## lm(formula = Area ~ Garage + FirePlace + Baths + White.Marble + 
##     Black.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Prices, data = house)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.968e-07  0.000e+00  0.000e+00  0.000e+00  5.371e-06 
## 
## Coefficients:
##                Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  -4.000e+01  6.069e-11 -6.591e+11   <2e-16 ***
## Garage       -6.000e+01  1.593e-11 -3.768e+12   <2e-16 ***
## FirePlace    -3.000e+01  8.832e-12 -3.397e+12   <2e-16 ***
## Baths        -5.000e+01  1.067e-11 -4.687e+12   <2e-16 ***
## White.Marble -5.600e+02  8.793e-11 -6.368e+12   <2e-16 ***
## Black.Marble -2.000e+02  3.987e-11 -5.017e+12   <2e-16 ***
## Floors       -6.000e+02  9.236e-11 -6.496e+12   <2e-16 ***
## City         -1.400e+02  2.472e-11 -5.664e+12   <2e-16 ***
## Solar        -1.000e+01  2.155e-11 -4.640e+11   <2e-16 ***
## Electric     -5.000e+01  2.277e-11 -2.196e+12   <2e-16 ***
## Fiber        -4.700e+02  7.359e-11 -6.387e+12   <2e-16 ***
## Glass.Doors  -1.780e+02  3.422e-11 -5.201e+12   <2e-16 ***
## Prices        4.000e-02  5.989e-15  6.679e+12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.601e-09 on 499987 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.717e+24 on 12 and 499987 DF,  p-value: < 2.2e-16

Insight: Variabel dengan korelasi tinggi terhadap taget akan digunakan dalam meodul prediksi. Jika ditemukan multikolinearitas, kita mungkin perlu menghapus beberapa variabel.

Model Validation

Setelah memeilih fitur, kita akan membangun model regresi dan melakukan validasi.

set.seed(123)
trainIndex <- createDataPartition(house$Area, p = 0.8, list = FALSE)
trainData <- house[trainIndex, ]
testData <- house[-trainIndex, ]

# Model Regresi
lm_model <- lm(Area ~ ., data = trainData)
predictions <- predict(lm_model, newdata = testData)

# Mengukur akurasi model
rmse <- sqrt(mean((predictions - testData$Area)^2))
r2 <- cor(predictions, testData$Area)^2
list(RMSE = rmse, R2 = r2)
## $RMSE
## [1] 2.019296e-09
## 
## $R2
## [1] 1

Kesimpulan: Jika RMSE rendah dan R^2 tinggi, maka model kita cukup baik. Jika tidak, kita perlu mencoba pendekatan lain seperti regresi non-linear atau teknik machine learning lainnya.

Model Interpretation & Recommendations

Dari model yang telah dibuat, kita dapat menarik kesimpulan tentang faktor-faktor yang paling berpengaruh terhadap harga rumah.

summary(lm_model)
## 
## Call:
## lm(formula = Area ~ ., data = trainData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.176e-08 -1.000e-11  0.000e+00  0.000e+00  1.273e-06 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -4.000e+01  1.856e-11 -2.156e+12   <2e-16 ***
## Garage        -6.000e+01  4.726e-12 -1.269e+13   <2e-16 ***
## FirePlace     -3.000e+01  2.620e-12 -1.145e+13   <2e-16 ***
## Baths         -5.000e+01  3.163e-12 -1.581e+13   <2e-16 ***
## White.Marble  -5.600e+02  2.610e-11 -2.146e+13   <2e-16 ***
## Black.Marble  -2.000e+02  1.183e-11 -1.690e+13   <2e-16 ***
## Indian.Marble         NA         NA         NA       NA    
## Floors        -6.000e+02  2.741e-11 -2.189e+13   <2e-16 ***
## City          -1.400e+02  7.332e-12 -1.909e+13   <2e-16 ***
## Solar         -1.000e+01  6.394e-12 -1.564e+12   <2e-16 ***
## Electric      -5.000e+01  6.753e-12 -7.405e+12   <2e-16 ***
## Fiber         -4.700e+02  2.183e-11 -2.153e+13   <2e-16 ***
## Glass.Doors   -1.780e+02  1.015e-11 -1.754e+13   <2e-16 ***
## Swiming.Pool  -6.758e-12  6.378e-12 -1.059e+00    0.289    
## Garden        -6.465e-12  6.378e-12 -1.014e+00    0.311    
## Prices         4.000e-02  1.777e-15  2.251e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.017e-09 on 399986 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.619e+25 on 14 and 399986 DF,  p-value: < 2.2e-16

Conclusion

Berdasarkan hasil regresi: - Variabel Prices, Floors, Fiber, Glass.Doors, dan White.Marble memiliki pengaruh signifikan terhadap Area (p-value < 0.05). - Variabel Swiming.Pool dan Garden tidak signifikan (p-value > 0.05). - Variabel Indian.Marble memiliki nilai NA, menandakan kemungkinan adanya multikolinearitas atau redundansi dalam dataset.

VIF Analysis

Untuk memastikan bahwa model ini valid dan tidak terlalu bergantung pada hubungan antar variabel, analisis Variance Inflation Factor (VIF) perlu dilakukan untuk mengidentifikasi variabel yang mengalami multikolinearitas tinggi.

library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(dplyr)
# Mengecek variabel dengan aliaaaasing jika ditemuikam
alias(lm_model)
## Model :
## Area ~ Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Swiming.Pool + Garden + Prices
## 
## Complete :
##               (Intercept) Garage FirePlace Baths White.Marble Black.Marble
## Indian.Marble  1           0      0         0    -1           -1          
##               Floors City Solar Electric Fiber Glass.Doors Swiming.Pool Garden
## Indian.Marble  0      0    0     0        0     0           0            0    
##               Prices
## Indian.Marble  0
# Menghapus variabel dengan aliasing jika ditemun
trainData_clean <- trainData %>% select(-Indian.Marble)

# Membangun ulang model tanpa variasi bermmasalah
lm_model_clean <- lm(Area ~ ., data = trainData_clean)

# Analisis Vif
vif_values <- vif(lm_model_clean)
vif_values
##       Garage    FirePlace        Baths White.Marble Black.Marble       Floors 
##     1.466818     1.349252     1.965883    14.861714     3.055333    18.462154 
##         City        Solar     Electric        Fiber  Glass.Doors Swiming.Pool 
##     3.520999     1.004987     1.120863    11.718414     2.532204     1.000039 
##       Garden       Prices 
##     1.000053    45.501648
high_vif <- names(vif_values[vif_values > 5])
high_vif
## [1] "White.Marble" "Floors"       "Fiber"        "Prices"

Handling High VIF Variables

Jika terdapat variabel dengan VIF > 5, kita akan mengapusnya satu per satu dan membangun ulang model.

# Menghapus variabel dengan VIF tertinggi secara bertahap
trainData_reduced <- trainData_clean %>% select(-Prices)
lm_model_reduced <- lm(Area ~ ., data = trainData_reduced)

# Cek ulang VIF setelah penghapusan pertama
vif_values_reduced <- vif(lm_model_reduced)
vif_values_reduced
##       Garage    FirePlace        Baths White.Marble Black.Marble       Floors 
##     1.000032     1.000022     1.000040     1.330191     1.330191     1.000012 
##         City        Solar     Electric        Fiber  Glass.Doors Swiming.Pool 
##     1.000019     1.000039     1.000025     1.000034     1.000027     1.000037 
##       Garden 
##     1.000049
# Jika masih ada VIF tinggi, lanjutkan dengan menghapus variabel lain
high_vif_reduced <- names(vif_values_reduced[vif_values_reduced > 5])
high_vif_reduced
## character(0)

Model Validation

Setelah menangani multikolinearitas, kita harus memvalidasi model.

# Evaluasi performa model dengan data uji
predictions <- predict(lm_model_reduced, newdata = testData)
actuals <- testData$Area

# Menghitung metrik evaluasi
mse <- mean((predictions - actuals)^2)
rmse <- sqrt(mse)
r2 <- cor(predictions, actuals)^2

# Menampilkan hasil evaluasi
cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 5163.004
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 71.85405
cat("R-squared", r2, "\n")
## R-squared 4.408275e-06

Residual Analysis

Untuk memastikan model tidak memiliki pola tertentu pada residualnya.

par(mfrow = c(2,2))
plot(lm_model_reduced)

Closing Remarks

Analisis ini telah mengevaluasi faktor-faktor yang memengaruhi harga rumah dengan menggunakan regresi linear. Setelah menangani multikolinearitas dan melakukan validasi model, kita menemukan bahwa beberapa variabel memiliki pengaruh signifikan terhadap luas rumah.

Selanjutnya, model ini dapat digunakan untuk prediksi harga rumah, dengan catatan bahwa validasi lebih lanjut dan eksplorasi model lain mungkin diperlukan untuk meningkatkan akurasi dan generalisasi model.