1. Introduction
In recent years, the real estate industry in South Jakarta, Indonesia has experienced rapid growth, making it an interesting area for research and experimentation. One common task in this field is forecasting house prices, which can be challenging due to the complex and dynamic nature of the market. In this experiment, we will use linear regression, a widely used statistical method, to analyze data and make predictions about house prices in South Jakarta from Kaggle dataset. By applying this technique, we aim to gain insights into the factors that affect house prices and improve our ability to forecast them accurately.
2. Data Wrangling & EDA
2.1 Read data
house <-read.csv("rumah_jaksel.csv", sep = ";")
knitr::kable(head(house, 10))
| HARGA | LT | LB | JKT | JKM | GRS | KOTA |
|---|---|---|---|---|---|---|
| 28.000.000.000 | 1100 | 700 | 5 | 6 | ADA | JAKSEL |
| 19.000.000.000 | 824 | 800 | 4 | 4 | ADA | JAKSEL |
| 4.700.000.000 | 500 | 400 | 4 | 3 | ADA | JAKSEL |
| 4.900.000.000 | 251 | 300 | 5 | 4 | ADA | JAKSEL |
| 28.000.000.000 | 1340 | 575 | 4 | 5 | ADA | JAKSEL |
| 10.000.000.000 | 460 | 300 | 4 | 4 | ADA | JAKSEL |
| 7.600.000.000 | 278 | 350 | 4 | 4 | ADA | JAKSEL |
| 5.250.000.000 | 511 | 300 | 3 | 2 | ADA | JAKSEL |
| 670.000.000 | 70 | 69 | 3 | 2 | TIDAK ADA | JAKSEL |
| 480.000.000 | 66 | 42 | 2 | 1 | TIDAK ADA | JAKSEL |
Explanation of each column in the dataset:
- HARGA = House prices (IDR)
- LT = Total Land Area
- LB = Total Building Area
- JKT = Number of Bedrooms
- JKM = Number of Bathrooms
- KOTA = City
- GRS: Has Garage? (binary: “ADA = yes”,“TIDAK ADA = no”)
The column names will be changed to make them easier to understand
and the KOTA column will be removed as it is not
needed.
house <- house %>%
select(-KOTA) %>%
rename(Price = HARGA,
Land.Area = LT,
Building.Area = LB,
Bedrooms = JKT,
Bathrooms = JKM,
Garage = GRS) %>%
mutate(Garage = ifelse(Garage == "ADA", "Yes", "No"))
knitr::kable(head(house, 10))
| Price | Land.Area | Building.Area | Bedrooms | Bathrooms | Garage |
|---|---|---|---|---|---|
| 28.000.000.000 | 1100 | 700 | 5 | 6 | Yes |
| 19.000.000.000 | 824 | 800 | 4 | 4 | Yes |
| 4.700.000.000 | 500 | 400 | 4 | 3 | Yes |
| 4.900.000.000 | 251 | 300 | 5 | 4 | Yes |
| 28.000.000.000 | 1340 | 575 | 4 | 5 | Yes |
| 10.000.000.000 | 460 | 300 | 4 | 4 | Yes |
| 7.600.000.000 | 278 | 350 | 4 | 4 | Yes |
| 5.250.000.000 | 511 | 300 | 3 | 2 | Yes |
| 670.000.000 | 70 | 69 | 3 | 2 | No |
| 480.000.000 | 66 | 42 | 2 | 1 | No |
2.2 Check the data structure
glimpse(house)
#> Rows: 1,001
#> Columns: 6
#> $ Price <chr> "28.000.000.000", "19.000.000.000", "4.700.000.000", "4.…
#> $ Land.Area <int> 1100, 824, 500, 251, 1340, 460, 278, 511, 70, 66, 449, 1…
#> $ Building.Area <int> 700, 800, 400, 300, 575, 300, 350, 300, 69, 42, 500, 188…
#> $ Bedrooms <int> 5, 4, 4, 5, 4, 4, 4, 3, 3, 2, 6, 2, 4, 4, 5, 4, 5, 4, 4,…
#> $ Bathrooms <int> 6, 4, 3, 4, 5, 4, 4, 2, 2, 1, 7, 3, 4, 4, 6, 4, 4, 4, 3,…
#> $ Garage <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
💡 Data structure check results:
- The dataset house consists of 6 columns and 1,001 rows
- The
Pricecolumn needs to be changed to numeric - There are indications that there are columns that need to be changed
to factors, namely the
Bedrooms,Bathrooms, andGaragecolumns
unique(house$Bedrooms)
#> [1] 5 4 3 2 6 9 8 7 10 17 11 1 27 22
unique(house$Bathrooms)
#> [1] 6 4 3 5 2 1 7 8 9 18 11 21 27 10
unique(house$Garage)
#> [1] "Yes" "No"
Based on the inspection above, it can be seen that besides the
Garage column has too many categories so only the
Garage column whose data type will be changed to
factor.
house$Price <- gsub("\\.", "", house$Price)
house <- house %>%
mutate(Price = as.numeric(Price),
Garage = as.factor(Garage))
glimpse(house)
#> Rows: 1,001
#> Columns: 6
#> $ Price <dbl> 28000000000, 19000000000, 4700000000, 4900000000, 280000…
#> $ Land.Area <int> 1100, 824, 500, 251, 1340, 460, 278, 511, 70, 66, 449, 1…
#> $ Building.Area <int> 700, 800, 400, 300, 575, 300, 350, 300, 69, 42, 500, 188…
#> $ Bedrooms <int> 5, 4, 4, 5, 4, 4, 4, 3, 3, 2, 6, 2, 4, 4, 5, 4, 5, 4, 4,…
#> $ Bathrooms <int> 6, 4, 3, 4, 5, 4, 4, 2, 2, 1, 7, 3, 4, 4, 6, 4, 4, 4, 3,…
#> $ Garage <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes…
2.3 Cleansing Data
colSums(is.na(house))
#> Price Land.Area Building.Area Bedrooms Bathrooms
#> 0 0 0 0 0
#> Garage
#> 0
Luckily there is no missing value in the dataset.
2.4 EDA
#distribution of data
hist(house$Price)
#distribution of data
boxplot(house$Price)
#correlation
ggcorr(house,label = TRUE , hjust = 1)
💡 Insight:
- There are outliers
- Right skewed data distribution
- There are 2 variables that have a strong correlation (>=0.5) with
PricenamelyLand.AreaandBuilding.Area
2.5 Separating Data to Train and Test
In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.
set.seed(100)
intrain <- sample(nrow(house), nrow(house) * .8)
house_price_train <- house[intrain,]
house_price_test <- house[-intrain,]
3. Modelling
We will create several models based on feature selection:
- Model with all predictor
- Model selection based on correlation (correlation > 0.5)
- Model selection based on stepwise result (backward/forward/both)
3.1 Model Fitting
All Predictor
Model with all predictor.
model_all <- lm(formula = Price ~ . ,
data = house_price_train)
summary(model_all)
#>
#> Call:
#> lm(formula = Price ~ ., data = house_price_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -79354787519 -3953529547 -858300311 1427548393 135150271743
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3313604959 1423188139 -2.328 0.0201 *
#> Land.Area 22969163 1249775 18.379 < 0.0000000000000002 ***
#> Building.Area 10711579 1472806 7.273 0.000000000000845 ***
#> Bedrooms -261767528 447470305 -0.585 0.5587
#> Bathrooms 764119312 474298192 1.611 0.1076
#> GarageYes 2080230117 1167108925 1.782 0.0751 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13390000000 on 794 degrees of freedom
#> Multiple R-squared: 0.6124, Adjusted R-squared: 0.61
#> F-statistic: 250.9 on 5 and 794 DF, p-value: < 0.00000000000000022
💡 Insight:
- Most of the predictor has positive coefficient, except Bedrooms.
- Variabel prediktor yang signifikan berpengaruh terhadap Price adalah
Land.AreadanBuilding.Area - Adjusted R-squared: 0.61, artinya model kita bisa menjelaskan Price dengan baik sebesar 61%
Correlation
Model selection based on correlation (correlation > 0.5), namely
Land.Area and Building.Area
model_selection <- lm(formula = Price ~ Land.Area + Building.Area,
data = house_price_train)
summary(model_selection)
#>
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area, data = house_price_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -84141977037 -3819266578 -1043687619 804217559 135522012296
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -188052156 706002175 -0.266 0.79
#> Land.Area 23241202 1248727 18.612 < 0.0000000000000002 ***
#> Building.Area 11152227 1469746 7.588 0.0000000000000908 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13440000000 on 797 degrees of freedom
#> Multiple R-squared: 0.6081, Adjusted R-squared: 0.6071
#> F-statistic: 618.4 on 2 and 797 DF, p-value: < 0.00000000000000022
💡 Insight:
- Every predictor has positive coefficient.
- Semua prediktor berpengaruh signifikan terhadap Price.
- Adjusted R-squared: 0.6071, artinya model kita bisa menjelaskan Price dengan baik sebesar 60.71%
Stepwise
Model selection based on stepwise result.
# model tanpa prediktor dari data house untuk `price`
model_none <- lm(formula = Price ~ 1,
data = house )
model_forward <- step(object = model_none,
scope = list(upper = model_all),
direction = "forward",
trace = F)
model_backward <- step(object = model_all,
direction = "backward",
trace = F)
model_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = F)
#model forward
summary(model_forward)
#>
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms +
#> Garage, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -83388265362 -4077544360 -868978732 1280926770 135800805889
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3256711926 1228917644 -2.650 0.00818 **
#> Land.Area 21436040 1085458 19.748 < 0.0000000000000002 ***
#> Building.Area 12172126 1277753 9.526 < 0.0000000000000002 ***
#> Bathrooms 514024107 228406771 2.250 0.02464 *
#> GarageYes 1802890425 1017526044 1.772 0.07673 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13300000000 on 996 degrees of freedom
#> Multiple R-squared: 0.5925, Adjusted R-squared: 0.5909
#> F-statistic: 362 on 4 and 996 DF, p-value: < 0.00000000000000022
💡 Insight:
- Every predictor has positive coefficient.
- Semua prediktor berpengaruh signifikan terhadap Price, kecuali
GarageYes - Adjusted R-squared: 0.5909, artinya model kita bisa menjelaskan Price dengan baik sebesar 59.09%
#model backward
summary(model_backward)
#>
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms +
#> Garage, data = house_price_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -79310117158 -4003089453 -829459418 1396519580 135017159775
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3554356015 1361822306 -2.610 0.00922 **
#> Land.Area 22940572 1248302 18.377 < 0.0000000000000002 ***
#> Building.Area 10737900 1471510 7.297 0.000000000000713 ***
#> Bathrooms 523932575 237352978 2.207 0.02757 *
#> GarageYes 2099328536 1166169478 1.800 0.07221 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13390000000 on 795 degrees of freedom
#> Multiple R-squared: 0.6123, Adjusted R-squared: 0.6103
#> F-statistic: 313.8 on 4 and 795 DF, p-value: < 0.00000000000000022
💡 Insight:
- Every predictor has positive coefficient.
- Semua prediktor berpengaruh signifikan terhadap Price, kecuali
GarageYes - Adjusted R-squared: 0.6103, artinya model kita bisa menjelaskan Price dengan baik sebesar 61.03%
#model both
summary(model_both)
#>
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms +
#> Garage, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -83388265362 -4077544360 -868978732 1280926770 135800805889
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3256711926 1228917644 -2.650 0.00818 **
#> Land.Area 21436040 1085458 19.748 < 0.0000000000000002 ***
#> Building.Area 12172126 1277753 9.526 < 0.0000000000000002 ***
#> Bathrooms 514024107 228406771 2.250 0.02464 *
#> GarageYes 1802890425 1017526044 1.772 0.07673 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13300000000 on 996 degrees of freedom
#> Multiple R-squared: 0.5925, Adjusted R-squared: 0.5909
#> F-statistic: 362 on 4 and 996 DF, p-value: < 0.00000000000000022
💡 Insight:
- Every predictor has positive coefficient.
- Semua prediktor berpengaruh signifikan terhadap Price, kecuali
GarageYes - Adjusted R-squared: 0.5909, artinya model kita bisa menjelaskan Price dengan baik sebesar 59.09%
3.2 Model Comparison with Goodness of Fit (Adjusted R-squared)
💡 Kesimpulan:
Model yang dapat menjelaskan variabel target dengan baik adalah
model_all , yakni sebesar 0.61 atau 61% dan
model_backward , yakni sebesar 0.6103 atau 61.03%. Karena
keduanya hampir sama, kita akan coba untuk menggunakan kedua model
tersebut untuk tahap selanjutnya.
3.3 Model Assumption
In this section, we will evaluate each models using Normality test, Homoscedasticity test, Multicollinearity test.
Normality of Residuals
model_all
hist(model_all$residuals)
shapiro.test(model_all$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: model_all$residuals
#> W = 0.56654, p-value < 0.00000000000000022
model_backward
hist(model_backward$residuals)
shapiro.test(model_backward$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward$residuals
#> W = 0.56711, p-value < 0.00000000000000022
💡 Kesimpulan: p-value pada kedua model < 0.00000000000000022, artinya residual data tidak berdistribusi normal. Hal tersebut mungkin saja terjadi karena adanya outlier.
Homoscedasticity of Residuals
model_all
library(lmtest)
bptest(model_all)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_all
#> BP = 118.29, df = 5, p-value < 0.00000000000000022
model_backward
bptest(model_backward)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 118.32, df = 4, p-value < 0.00000000000000022
💡 Kesimpulan: p-value pada kedua model < 0.00000000000000022, artinya error menyebar tidak konstan atau heteroscedasticity
Multicollinearity
model_all
library(car)
vif(model_all)
#> Land.Area Building.Area Bedrooms Bathrooms Garage
#> 2.060187 2.065310 4.162510 4.215207 1.012512
model_backward
vif(model_backward)
#> Land.Area Building.Area Bathrooms Garage
#> 2.057036 2.063383 1.056489 1.011720
💡 Kesimpulan: seluruh variabel prediktor pada kedua model memenuhi asumsi no multicolinierity
3.4 Model Transformation
Berdasarkan tahap model assumption sebelumnya, diketahui bahwa residual data tidak berdistribusi normal dan error menyebar tidak konstan atau heteroscedasticity sehingga model perlu ditransformasi.
model_all_trans <- lm(formula = Price ~ Land.Area + Building.Area + log(Bedrooms) + log(Bathrooms) + Garage,
data = house_price_train)
summary(model_all_trans)
#>
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + log(Bedrooms) +
#> log(Bathrooms) + Garage, data = house_price_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -75913448296 -4204040173 -693569424 1637890101 134729823551
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6115363599 2363226112 -2.588 0.00984 **
#> Land.Area 22770430 1255169 18.141 < 0.0000000000000002 ***
#> Building.Area 10423148 1477745 7.053 0.0000000000038 ***
#> log(Bedrooms) -321846062 2290044432 -0.141 0.88827
#> log(Bathrooms) 4234710196 1980369671 2.138 0.03279 *
#> GarageYes 1888599934 1168313360 1.617 0.10638
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13360000000 on 794 degrees of freedom
#> Multiple R-squared: 0.6141, Adjusted R-squared: 0.6116
#> F-statistic: 252.7 on 5 and 794 DF, p-value: < 0.00000000000000022
#normalitas shapiro
shapiro.test(model_all_trans$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: model_all_trans$residuals
#> W = 0.57427, p-value < 0.00000000000000022
#heterosedastisitas bp
bptest(model_all_trans)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_all_trans
#> BP = 114.88, df = 5, p-value < 0.00000000000000022
model_backward_trans <- lm(formula = Price ~ Land.Area + Building.Area + log(Bathrooms) + Garage,
data = house_price_train)
summary(model_backward_trans)
#>
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + log(Bathrooms) +
#> Garage, data = house_price_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -75815559702 -4195687607 -715866009 1618120327 134693626211
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6314147492 1892048615 -3.337 0.000886 ***
#> Land.Area 22754314 1249149 18.216 < 0.0000000000000002 ***
#> Building.Area 10424346 1476809 7.059 0.00000000000367 ***
#> log(Bathrooms) 4034821715 1377191812 2.930 0.003489 **
#> GarageYes 1890659405 1167501017 1.619 0.105756
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13350000000 on 795 degrees of freedom
#> Multiple R-squared: 0.6141, Adjusted R-squared: 0.6121
#> F-statistic: 316.2 on 4 and 795 DF, p-value: < 0.00000000000000022
#normalitas shapiro
shapiro.test(model_backward_trans$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward_trans$residuals
#> W = 0.57444, p-value < 0.00000000000000022
#heterosedastisitas bp
bptest(model_backward_trans)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward_trans
#> BP = 114.69, df = 4, p-value < 0.00000000000000022
Setelah dilakukan transformasi data, ternyata nilai p-value tidak mengalami perubahan.
ps: Saya sudah mencoba melakukan handling outlier, namun p-value tetap < 0.05
3.4 Model Evaluation and Prediction
all_pred <- predict(model_all, house_price_test)
backward_pred <- predict(model_backward, house_price_test)
MAE
MAE(y_pred=all_pred, y_true=house_price_test$Price)
#> [1] 6343359557
MAE(y_pred=backward_pred, y_true=house_price_test$Price)
#> [1] 6339267013
# Check Target Variable Range
range(house_price_test$Price)
#> [1] 430000000 169000000000
169000000000-430000000
#> [1] 168570000000
- Nilai MAE relatif lebih kecil dibandingkan range data, namun error masih lumayan besar.
- model_backward mempunyai error lebih kecil dibandingkan model_all
MAPE
MAPE(y_pred = all_pred,
y_true = house_price_test$Price)*100
#> [1] 41.68089
MAPE(y_pred = backward_pred,
y_true = house_price_test$Price)*100
#> [1] 41.87019
- Dari model_all kita memperoleh MAPE sebesar 41,7%, artinya model dapat memprediksi dengan keakuratan sebesar 58.3%.
- Dari model_backward kita memperoleh MAPE sebesar 41,9%, artinya model dapat memprediksi dengan keakuratan sebesar 58.1%.
RMSE
RMSE(y_pred=all_pred, y_true=house_price_test$Price)
#> [1] 13080415605
RMSE(y_pred=backward_pred, y_true=house_price_test$Price)
#> [1] 13035505643
- Nilai RMSE relatif lebih kecil dibandingkan range data, namun error masih lumayan besar.
- model_backward mempunyai error lebih kecil dibandingkan model_all
4. Conclusion
- Model yang dibuat masih menghasilkan error yang besar, hal tersebut karena pada kedua model residual data tidak berdistribusi normal dan error menyebar tidak konstan atau heteroscedasticity.
- Oleh karena itu, saya merekomendasikan untuk menggunakan model lain yang lebih kompleks (model yang bebas asumsi)
- Meskipun begitu jika dilihat dari MAE dan RMSE, model_backward lebih baik dibandingkan model_all