LBB: Regression Model

Syahvan Alviansyah Diva Ritonga

2023-07-01

1. Introduction

In recent years, the real estate industry in South Jakarta, Indonesia has experienced rapid growth, making it an interesting area for research and experimentation. One common task in this field is forecasting house prices, which can be challenging due to the complex and dynamic nature of the market. In this experiment, we will use linear regression, a widely used statistical method, to analyze data and make predictions about house prices in South Jakarta from Kaggle dataset. By applying this technique, we aim to gain insights into the factors that affect house prices and improve our ability to forecast them accurately.

2. Data Wrangling & EDA

2.1 Read data

house <-read.csv("rumah_jaksel.csv", sep = ";")
knitr::kable(head(house, 10))
HARGA LT LB JKT JKM GRS KOTA
28.000.000.000 1100 700 5 6 ADA JAKSEL
19.000.000.000 824 800 4 4 ADA JAKSEL
4.700.000.000 500 400 4 3 ADA JAKSEL
4.900.000.000 251 300 5 4 ADA JAKSEL
28.000.000.000 1340 575 4 5 ADA JAKSEL
10.000.000.000 460 300 4 4 ADA JAKSEL
7.600.000.000 278 350 4 4 ADA JAKSEL
5.250.000.000 511 300 3 2 ADA JAKSEL
670.000.000 70 69 3 2 TIDAK ADA JAKSEL
480.000.000 66 42 2 1 TIDAK ADA JAKSEL

Explanation of each column in the dataset:

  1. HARGA = House prices (IDR)
  2. LT = Total Land Area
  3. LB = Total Building Area
  4. JKT = Number of Bedrooms
  5. JKM = Number of Bathrooms
  6. KOTA = City
  7. GRS: Has Garage? (binary: “ADA = yes”,“TIDAK ADA = no”)

The column names will be changed to make them easier to understand and the KOTA column will be removed as it is not needed.

house <- house %>% 
  select(-KOTA) %>% 
  rename(Price = HARGA,
         Land.Area = LT,
         Building.Area = LB,
         Bedrooms = JKT,
         Bathrooms = JKM,
         Garage = GRS) %>% 
  mutate(Garage = ifelse(Garage == "ADA", "Yes", "No"))
knitr::kable(head(house, 10))
Price Land.Area Building.Area Bedrooms Bathrooms Garage
28.000.000.000 1100 700 5 6 Yes
19.000.000.000 824 800 4 4 Yes
4.700.000.000 500 400 4 3 Yes
4.900.000.000 251 300 5 4 Yes
28.000.000.000 1340 575 4 5 Yes
10.000.000.000 460 300 4 4 Yes
7.600.000.000 278 350 4 4 Yes
5.250.000.000 511 300 3 2 Yes
670.000.000 70 69 3 2 No
480.000.000 66 42 2 1 No

2.2 Check the data structure

glimpse(house)
#> Rows: 1,001
#> Columns: 6
#> $ Price         <chr> "28.000.000.000", "19.000.000.000", "4.700.000.000", "4.…
#> $ Land.Area     <int> 1100, 824, 500, 251, 1340, 460, 278, 511, 70, 66, 449, 1…
#> $ Building.Area <int> 700, 800, 400, 300, 575, 300, 350, 300, 69, 42, 500, 188…
#> $ Bedrooms      <int> 5, 4, 4, 5, 4, 4, 4, 3, 3, 2, 6, 2, 4, 4, 5, 4, 5, 4, 4,…
#> $ Bathrooms     <int> 6, 4, 3, 4, 5, 4, 4, 2, 2, 1, 7, 3, 4, 4, 6, 4, 4, 4, 3,…
#> $ Garage        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …

💡 Data structure check results:

  • The dataset house consists of 6 columns and 1,001 rows
  • The Price column needs to be changed to numeric
  • There are indications that there are columns that need to be changed to factors, namely the Bedrooms, Bathrooms, and Garage columns
unique(house$Bedrooms)
#>  [1]  5  4  3  2  6  9  8  7 10 17 11  1 27 22
unique(house$Bathrooms)
#>  [1]  6  4  3  5  2  1  7  8  9 18 11 21 27 10
unique(house$Garage)
#> [1] "Yes" "No"

Based on the inspection above, it can be seen that besides the Garage column has too many categories so only the Garage column whose data type will be changed to factor.

house$Price <- gsub("\\.", "", house$Price)
house <- house %>% 
  mutate(Price = as.numeric(Price),
         Garage = as.factor(Garage))
glimpse(house)
#> Rows: 1,001
#> Columns: 6
#> $ Price         <dbl> 28000000000, 19000000000, 4700000000, 4900000000, 280000…
#> $ Land.Area     <int> 1100, 824, 500, 251, 1340, 460, 278, 511, 70, 66, 449, 1…
#> $ Building.Area <int> 700, 800, 400, 300, 575, 300, 350, 300, 69, 42, 500, 188…
#> $ Bedrooms      <int> 5, 4, 4, 5, 4, 4, 4, 3, 3, 2, 6, 2, 4, 4, 5, 4, 5, 4, 4,…
#> $ Bathrooms     <int> 6, 4, 3, 4, 5, 4, 4, 2, 2, 1, 7, 3, 4, 4, 6, 4, 4, 4, 3,…
#> $ Garage        <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes…

2.3 Cleansing Data

colSums(is.na(house)) 
#>         Price     Land.Area Building.Area      Bedrooms     Bathrooms 
#>             0             0             0             0             0 
#>        Garage 
#>             0

Luckily there is no missing value in the dataset.

2.4 EDA

#distribution of data
hist(house$Price)

#distribution of data
boxplot(house$Price)

#correlation
ggcorr(house,label = TRUE , hjust = 1)

💡 Insight:

  • There are outliers
  • Right skewed data distribution
  • There are 2 variables that have a strong correlation (>=0.5) with Price namely Land.Area and Building.Area

2.5 Separating Data to Train and Test

In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.

set.seed(100)
intrain <- sample(nrow(house), nrow(house) * .8)
house_price_train <- house[intrain,]
house_price_test <- house[-intrain,]

3. Modelling

We will create several models based on feature selection:

  1. Model with all predictor
  2. Model selection based on correlation (correlation > 0.5)
  3. Model selection based on stepwise result (backward/forward/both)

3.1 Model Fitting

All Predictor

Model with all predictor.

model_all <- lm(formula = Price ~ . , 
                data = house_price_train)
summary(model_all)
#> 
#> Call:
#> lm(formula = Price ~ ., data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -79354787519  -3953529547   -858300311   1427548393 135150271743 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3313604959  1423188139  -2.328               0.0201 *  
#> Land.Area        22969163     1249775  18.379 < 0.0000000000000002 ***
#> Building.Area    10711579     1472806   7.273    0.000000000000845 ***
#> Bedrooms       -261767528   447470305  -0.585               0.5587    
#> Bathrooms       764119312   474298192   1.611               0.1076    
#> GarageYes      2080230117  1167108925   1.782               0.0751 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13390000000 on 794 degrees of freedom
#> Multiple R-squared:  0.6124, Adjusted R-squared:   0.61 
#> F-statistic: 250.9 on 5 and 794 DF,  p-value: < 0.00000000000000022

💡 Insight:

  • Most of the predictor has positive coefficient, except Bedrooms.
  • Variabel prediktor yang signifikan berpengaruh terhadap Price adalah Land.Area dan Building.Area
  • Adjusted R-squared: 0.61, artinya model kita bisa menjelaskan Price dengan baik sebesar 61%

Correlation

Model selection based on correlation (correlation > 0.5), namely Land.Area and Building.Area

model_selection <- lm(formula = Price ~ Land.Area + Building.Area, 
                data = house_price_train)
summary(model_selection)
#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -84141977037  -3819266578  -1043687619    804217559 135522012296 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)   -188052156  706002175  -0.266                 0.79    
#> Land.Area       23241202    1248727  18.612 < 0.0000000000000002 ***
#> Building.Area   11152227    1469746   7.588   0.0000000000000908 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13440000000 on 797 degrees of freedom
#> Multiple R-squared:  0.6081, Adjusted R-squared:  0.6071 
#> F-statistic: 618.4 on 2 and 797 DF,  p-value: < 0.00000000000000022

💡 Insight:

  • Every predictor has positive coefficient.
  • Semua prediktor berpengaruh signifikan terhadap Price.
  • Adjusted R-squared: 0.6071, artinya model kita bisa menjelaskan Price dengan baik sebesar 60.71%

Stepwise

Model selection based on stepwise result.

# model tanpa prediktor dari data house untuk `price`
model_none <- lm(formula = Price ~ 1,
                 data = house )

model_forward <- step(object = model_none,
                      scope = list(upper = model_all),
                      direction = "forward",
                      trace = F)

model_backward <- step(object = model_all,
                      direction = "backward",
                      trace = F)

model_both <- step(object = model_none,
                   direction = "both",
                   scope = list(upper = model_all),
                   trace = F)
#model forward
summary(model_forward)
#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms + 
#>     Garage, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -83388265362  -4077544360   -868978732   1280926770 135800805889 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3256711926  1228917644  -2.650              0.00818 ** 
#> Land.Area        21436040     1085458  19.748 < 0.0000000000000002 ***
#> Building.Area    12172126     1277753   9.526 < 0.0000000000000002 ***
#> Bathrooms       514024107   228406771   2.250              0.02464 *  
#> GarageYes      1802890425  1017526044   1.772              0.07673 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13300000000 on 996 degrees of freedom
#> Multiple R-squared:  0.5925, Adjusted R-squared:  0.5909 
#> F-statistic:   362 on 4 and 996 DF,  p-value: < 0.00000000000000022

💡 Insight:

  • Every predictor has positive coefficient.
  • Semua prediktor berpengaruh signifikan terhadap Price, kecuali GarageYes
  • Adjusted R-squared: 0.5909, artinya model kita bisa menjelaskan Price dengan baik sebesar 59.09%
#model backward
summary(model_backward)
#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms + 
#>     Garage, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -79310117158  -4003089453   -829459418   1396519580 135017159775 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3554356015  1361822306  -2.610              0.00922 ** 
#> Land.Area        22940572     1248302  18.377 < 0.0000000000000002 ***
#> Building.Area    10737900     1471510   7.297    0.000000000000713 ***
#> Bathrooms       523932575   237352978   2.207              0.02757 *  
#> GarageYes      2099328536  1166169478   1.800              0.07221 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13390000000 on 795 degrees of freedom
#> Multiple R-squared:  0.6123, Adjusted R-squared:  0.6103 
#> F-statistic: 313.8 on 4 and 795 DF,  p-value: < 0.00000000000000022

💡 Insight:

  • Every predictor has positive coefficient.
  • Semua prediktor berpengaruh signifikan terhadap Price, kecuali GarageYes
  • Adjusted R-squared: 0.6103, artinya model kita bisa menjelaskan Price dengan baik sebesar 61.03%
#model both
summary(model_both)
#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms + 
#>     Garage, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -83388265362  -4077544360   -868978732   1280926770 135800805889 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3256711926  1228917644  -2.650              0.00818 ** 
#> Land.Area        21436040     1085458  19.748 < 0.0000000000000002 ***
#> Building.Area    12172126     1277753   9.526 < 0.0000000000000002 ***
#> Bathrooms       514024107   228406771   2.250              0.02464 *  
#> GarageYes      1802890425  1017526044   1.772              0.07673 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13300000000 on 996 degrees of freedom
#> Multiple R-squared:  0.5925, Adjusted R-squared:  0.5909 
#> F-statistic:   362 on 4 and 996 DF,  p-value: < 0.00000000000000022

💡 Insight:

  • Every predictor has positive coefficient.
  • Semua prediktor berpengaruh signifikan terhadap Price, kecuali GarageYes
  • Adjusted R-squared: 0.5909, artinya model kita bisa menjelaskan Price dengan baik sebesar 59.09%

3.2 Model Comparison with Goodness of Fit (Adjusted R-squared)

💡 Kesimpulan:

Model yang dapat menjelaskan variabel target dengan baik adalah model_all , yakni sebesar 0.61 atau 61% dan model_backward , yakni sebesar 0.6103 atau 61.03%. Karena keduanya hampir sama, kita akan coba untuk menggunakan kedua model tersebut untuk tahap selanjutnya.

3.3 Model Assumption

In this section, we will evaluate each models using Normality test, Homoscedasticity test, Multicollinearity test.

Normality of Residuals

  • model_all
hist(model_all$residuals)

shapiro.test(model_all$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all$residuals
#> W = 0.56654, p-value < 0.00000000000000022
  • model_backward
hist(model_backward$residuals)

shapiro.test(model_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.56711, p-value < 0.00000000000000022

💡 Kesimpulan: p-value pada kedua model < 0.00000000000000022, artinya residual data tidak berdistribusi normal. Hal tersebut mungkin saja terjadi karena adanya outlier.

Homoscedasticity of Residuals

  • model_all
library(lmtest)
bptest(model_all)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all
#> BP = 118.29, df = 5, p-value < 0.00000000000000022
  • model_backward
bptest(model_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 118.32, df = 4, p-value < 0.00000000000000022

💡 Kesimpulan: p-value pada kedua model < 0.00000000000000022, artinya error menyebar tidak konstan atau heteroscedasticity

Multicollinearity

  • model_all
library(car)
vif(model_all)
#>     Land.Area Building.Area      Bedrooms     Bathrooms        Garage 
#>      2.060187      2.065310      4.162510      4.215207      1.012512
  • model_backward
vif(model_backward)
#>     Land.Area Building.Area     Bathrooms        Garage 
#>      2.057036      2.063383      1.056489      1.011720

💡 Kesimpulan: seluruh variabel prediktor pada kedua model memenuhi asumsi no multicolinierity

3.4 Model Transformation

Berdasarkan tahap model assumption sebelumnya, diketahui bahwa residual data tidak berdistribusi normal dan error menyebar tidak konstan atau heteroscedasticity sehingga model perlu ditransformasi.

model_all_trans <- lm(formula = Price ~ Land.Area + Building.Area + log(Bedrooms) + log(Bathrooms) + Garage, 
                data = house_price_train)
summary(model_all_trans)
#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + log(Bedrooms) + 
#>     log(Bathrooms) + Garage, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -75913448296  -4204040173   -693569424   1637890101 134729823551 
#> 
#> Coefficients:
#>                   Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)    -6115363599  2363226112  -2.588              0.00984 ** 
#> Land.Area         22770430     1255169  18.141 < 0.0000000000000002 ***
#> Building.Area     10423148     1477745   7.053      0.0000000000038 ***
#> log(Bedrooms)   -321846062  2290044432  -0.141              0.88827    
#> log(Bathrooms)  4234710196  1980369671   2.138              0.03279 *  
#> GarageYes       1888599934  1168313360   1.617              0.10638    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13360000000 on 794 degrees of freedom
#> Multiple R-squared:  0.6141, Adjusted R-squared:  0.6116 
#> F-statistic: 252.7 on 5 and 794 DF,  p-value: < 0.00000000000000022
#normalitas shapiro
shapiro.test(model_all_trans$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all_trans$residuals
#> W = 0.57427, p-value < 0.00000000000000022
#heterosedastisitas bp
bptest(model_all_trans)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all_trans
#> BP = 114.88, df = 5, p-value < 0.00000000000000022
model_backward_trans <- lm(formula = Price ~ Land.Area + Building.Area + log(Bathrooms) + Garage, 
                data = house_price_train)
summary(model_backward_trans)
#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + log(Bathrooms) + 
#>     Garage, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -75815559702  -4195687607   -715866009   1618120327 134693626211 
#> 
#> Coefficients:
#>                   Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)    -6314147492  1892048615  -3.337             0.000886 ***
#> Land.Area         22754314     1249149  18.216 < 0.0000000000000002 ***
#> Building.Area     10424346     1476809   7.059     0.00000000000367 ***
#> log(Bathrooms)  4034821715  1377191812   2.930             0.003489 ** 
#> GarageYes       1890659405  1167501017   1.619             0.105756    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13350000000 on 795 degrees of freedom
#> Multiple R-squared:  0.6141, Adjusted R-squared:  0.6121 
#> F-statistic: 316.2 on 4 and 795 DF,  p-value: < 0.00000000000000022
#normalitas shapiro
shapiro.test(model_backward_trans$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward_trans$residuals
#> W = 0.57444, p-value < 0.00000000000000022
#heterosedastisitas bp
bptest(model_backward_trans)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward_trans
#> BP = 114.69, df = 4, p-value < 0.00000000000000022

Setelah dilakukan transformasi data, ternyata nilai p-value tidak mengalami perubahan.

ps: Saya sudah mencoba melakukan handling outlier, namun p-value tetap < 0.05

3.4 Model Evaluation and Prediction

all_pred <- predict(model_all, house_price_test)
backward_pred <- predict(model_backward, house_price_test)

MAE

MAE(y_pred=all_pred, y_true=house_price_test$Price)
#> [1] 6343359557
MAE(y_pred=backward_pred, y_true=house_price_test$Price)
#> [1] 6339267013
# Check Target Variable Range
range(house_price_test$Price)
#> [1]    430000000 169000000000
169000000000-430000000
#> [1] 168570000000
  • Nilai MAE relatif lebih kecil dibandingkan range data, namun error masih lumayan besar.
  • model_backward mempunyai error lebih kecil dibandingkan model_all

MAPE

MAPE(y_pred = all_pred,
    y_true = house_price_test$Price)*100
#> [1] 41.68089
MAPE(y_pred = backward_pred,
    y_true = house_price_test$Price)*100
#> [1] 41.87019
  • Dari model_all kita memperoleh MAPE sebesar 41,7%, artinya model dapat memprediksi dengan keakuratan sebesar 58.3%.
  • Dari model_backward kita memperoleh MAPE sebesar 41,9%, artinya model dapat memprediksi dengan keakuratan sebesar 58.1%.

RMSE

RMSE(y_pred=all_pred, y_true=house_price_test$Price)
#> [1] 13080415605
RMSE(y_pred=backward_pred, y_true=house_price_test$Price)
#> [1] 13035505643
  • Nilai RMSE relatif lebih kecil dibandingkan range data, namun error masih lumayan besar.
  • model_backward mempunyai error lebih kecil dibandingkan model_all

4. Conclusion

  • Model yang dibuat masih menghasilkan error yang besar, hal tersebut karena pada kedua model residual data tidak berdistribusi normal dan error menyebar tidak konstan atau heteroscedasticity.
  • Oleh karena itu, saya merekomendasikan untuk menggunakan model lain yang lebih kompleks (model yang bebas asumsi)
  • Meskipun begitu jika dilihat dari MAE dan RMSE, model_backward lebih baik dibandingkan model_all