LBB: Regression Model

Syahvan Alviansyah Diva Ritonga

2023-07-01

1. Introduction

In recent years, the real estate industry in South Jakarta, Indonesia has experienced rapid growth, making it an interesting area for research and experimentation. One common task in this field is forecasting house prices, which can be challenging due to the complex and dynamic nature of the market. In this experiment, we will use linear regression, a widely used statistical method, to analyze data and make predictions about house prices in South Jakarta from Kaggle dataset. By applying this technique, we aim to gain insights into the factors that affect house prices and improve our ability to forecast them accurately.

2. Data Wrangling & EDA

2.1 Read data

house <-read.csv("rumah_jaksel.csv", sep = ";")
knitr::kable(head(house, 10))

HARGA	LT	LB	JKT	JKM	GRS	KOTA
28.000.000.000	1100	700	5	6	ADA	JAKSEL
19.000.000.000	824	800	4	4	ADA	JAKSEL
4.700.000.000	500	400	4	3	ADA	JAKSEL
4.900.000.000	251	300	5	4	ADA	JAKSEL
28.000.000.000	1340	575	4	5	ADA	JAKSEL
10.000.000.000	460	300	4	4	ADA	JAKSEL
7.600.000.000	278	350	4	4	ADA	JAKSEL
5.250.000.000	511	300	3	2	ADA	JAKSEL
670.000.000	70	69	3	2	TIDAK ADA	JAKSEL
480.000.000	66	42	2	1	TIDAK ADA	JAKSEL

Explanation of each column in the dataset:

HARGA = House prices (IDR)
LT = Total Land Area
LB = Total Building Area
JKT = Number of Bedrooms
JKM = Number of Bathrooms
KOTA = City
GRS: Has Garage? (binary: “ADA = yes”,“TIDAK ADA = no”)

The column names will be changed to make them easier to understand and the KOTA column will be removed as it is not needed.

house <- house %>% 
  select(-KOTA) %>% 
  rename(Price = HARGA,
         Land.Area = LT,
         Building.Area = LB,
         Bedrooms = JKT,
         Bathrooms = JKM,
         Garage = GRS) %>% 
  mutate(Garage = ifelse(Garage == "ADA", "Yes", "No"))
knitr::kable(head(house, 10))

Price	Land.Area	Building.Area	Bedrooms	Bathrooms	Garage
28.000.000.000	1100	700	5	6	Yes
19.000.000.000	824	800	4	4	Yes
4.700.000.000	500	400	4	3	Yes
4.900.000.000	251	300	5	4	Yes
28.000.000.000	1340	575	4	5	Yes
10.000.000.000	460	300	4	4	Yes
7.600.000.000	278	350	4	4	Yes
5.250.000.000	511	300	3	2	Yes
670.000.000	70	69	3	2	No
480.000.000	66	42	2	1	No

2.2 Check the data structure

glimpse(house)

#> Rows: 1,001
#> Columns: 6
#> $ Price         <chr> "28.000.000.000", "19.000.000.000", "4.700.000.000", "4.…
#> $ Land.Area     <int> 1100, 824, 500, 251, 1340, 460, 278, 511, 70, 66, 449, 1…
#> $ Building.Area <int> 700, 800, 400, 300, 575, 300, 350, 300, 69, 42, 500, 188…
#> $ Bedrooms      <int> 5, 4, 4, 5, 4, 4, 4, 3, 3, 2, 6, 2, 4, 4, 5, 4, 5, 4, 4,…
#> $ Bathrooms     <int> 6, 4, 3, 4, 5, 4, 4, 2, 2, 1, 7, 3, 4, 4, 6, 4, 4, 4, 3,…
#> $ Garage        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …

💡 Data structure check results:

The dataset house consists of 6 columns and 1,001 rows
The Price column needs to be changed to numeric
There are indications that there are columns that need to be changed to factors, namely the Bedrooms, Bathrooms, and Garage columns

unique(house$Bedrooms)

#>  [1]  5  4  3  2  6  9  8  7 10 17 11  1 27 22

unique(house$Bathrooms)

#>  [1]  6  4  3  5  2  1  7  8  9 18 11 21 27 10

unique(house$Garage)

#> [1] "Yes" "No"

Based on the inspection above, it can be seen that besides the Garage column has too many categories so only the Garage column whose data type will be changed to factor.

house$Price <- gsub("\\.", "", house$Price)
house <- house %>% 
  mutate(Price = as.numeric(Price),
         Garage = as.factor(Garage))
glimpse(house)

#> Rows: 1,001
#> Columns: 6
#> $ Price         <dbl> 28000000000, 19000000000, 4700000000, 4900000000, 280000…
#> $ Land.Area     <int> 1100, 824, 500, 251, 1340, 460, 278, 511, 70, 66, 449, 1…
#> $ Building.Area <int> 700, 800, 400, 300, 575, 300, 350, 300, 69, 42, 500, 188…
#> $ Bedrooms      <int> 5, 4, 4, 5, 4, 4, 4, 3, 3, 2, 6, 2, 4, 4, 5, 4, 5, 4, 4,…
#> $ Bathrooms     <int> 6, 4, 3, 4, 5, 4, 4, 2, 2, 1, 7, 3, 4, 4, 6, 4, 4, 4, 3,…
#> $ Garage        <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes…

2.3 Cleansing Data

colSums(is.na(house))

#>         Price     Land.Area Building.Area      Bedrooms     Bathrooms 
#>             0             0             0             0             0 
#>        Garage 
#>             0

Luckily there is no missing value in the dataset.

2.4 EDA

#distribution of data
hist(house$Price)

#distribution of data
boxplot(house$Price)

#correlation
ggcorr(house,label = TRUE , hjust = 1)

💡 Insight:

There are outliers
Right skewed data distribution
There are 2 variables that have a strong correlation (>=0.5) with Price namely Land.Area and Building.Area

2.5 Separating Data to Train and Test

In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.

set.seed(100)
intrain <- sample(nrow(house), nrow(house) * .8)
house_price_train <- house[intrain,]
house_price_test <- house[-intrain,]

3. Modelling

We will create several models based on feature selection:

Model with all predictor
Model selection based on correlation (correlation > 0.5)
Model selection based on stepwise result (backward/forward/both)

3.1 Model Fitting

All Predictor

Model with all predictor.

model_all <- lm(formula = Price ~ . , 
                data = house_price_train)
summary(model_all)

#> 
#> Call:
#> lm(formula = Price ~ ., data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -79354787519  -3953529547   -858300311   1427548393 135150271743 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3313604959  1423188139  -2.328               0.0201 *  
#> Land.Area        22969163     1249775  18.379 < 0.0000000000000002 ***
#> Building.Area    10711579     1472806   7.273    0.000000000000845 ***
#> Bedrooms       -261767528   447470305  -0.585               0.5587    
#> Bathrooms       764119312   474298192   1.611               0.1076    
#> GarageYes      2080230117  1167108925   1.782               0.0751 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13390000000 on 794 degrees of freedom
#> Multiple R-squared:  0.6124, Adjusted R-squared:   0.61 
#> F-statistic: 250.9 on 5 and 794 DF,  p-value: < 0.00000000000000022

💡 Insight:

Most of the predictor has positive coefficient, except Bedrooms.
Variabel prediktor yang signifikan berpengaruh terhadap Price adalah Land.Area dan Building.Area
Adjusted R-squared: 0.61, artinya model kita bisa menjelaskan Price dengan baik sebesar 61%

Correlation

Model selection based on correlation (correlation > 0.5), namely Land.Area and Building.Area

model_selection <- lm(formula = Price ~ Land.Area + Building.Area, 
                data = house_price_train)
summary(model_selection)

#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -84141977037  -3819266578  -1043687619    804217559 135522012296 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)   -188052156  706002175  -0.266                 0.79    
#> Land.Area       23241202    1248727  18.612 < 0.0000000000000002 ***
#> Building.Area   11152227    1469746   7.588   0.0000000000000908 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13440000000 on 797 degrees of freedom
#> Multiple R-squared:  0.6081, Adjusted R-squared:  0.6071 
#> F-statistic: 618.4 on 2 and 797 DF,  p-value: < 0.00000000000000022

💡 Insight:

Every predictor has positive coefficient.
Semua prediktor berpengaruh signifikan terhadap Price.
Adjusted R-squared: 0.6071, artinya model kita bisa menjelaskan Price dengan baik sebesar 60.71%

Stepwise

Model selection based on stepwise result.

# model tanpa prediktor dari data house untuk `price`
model_none <- lm(formula = Price ~ 1,
                 data = house )

model_forward <- step(object = model_none,
                      scope = list(upper = model_all),
                      direction = "forward",
                      trace = F)

model_backward <- step(object = model_all,
                      direction = "backward",
                      trace = F)

model_both <- step(object = model_none,
                   direction = "both",
                   scope = list(upper = model_all),
                   trace = F)

#model forward
summary(model_forward)

#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms + 
#>     Garage, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -83388265362  -4077544360   -868978732   1280926770 135800805889 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3256711926  1228917644  -2.650              0.00818 ** 
#> Land.Area        21436040     1085458  19.748 < 0.0000000000000002 ***
#> Building.Area    12172126     1277753   9.526 < 0.0000000000000002 ***
#> Bathrooms       514024107   228406771   2.250              0.02464 *  
#> GarageYes      1802890425  1017526044   1.772              0.07673 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13300000000 on 996 degrees of freedom
#> Multiple R-squared:  0.5925, Adjusted R-squared:  0.5909 
#> F-statistic:   362 on 4 and 996 DF,  p-value: < 0.00000000000000022

💡 Insight:

Every predictor has positive coefficient.
Semua prediktor berpengaruh signifikan terhadap Price, kecuali GarageYes
Adjusted R-squared: 0.5909, artinya model kita bisa menjelaskan Price dengan baik sebesar 59.09%

#model backward
summary(model_backward)

#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms + 
#>     Garage, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -79310117158  -4003089453   -829459418   1396519580 135017159775 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3554356015  1361822306  -2.610              0.00922 ** 
#> Land.Area        22940572     1248302  18.377 < 0.0000000000000002 ***
#> Building.Area    10737900     1471510   7.297    0.000000000000713 ***
#> Bathrooms       523932575   237352978   2.207              0.02757 *  
#> GarageYes      2099328536  1166169478   1.800              0.07221 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13390000000 on 795 degrees of freedom
#> Multiple R-squared:  0.6123, Adjusted R-squared:  0.6103 
#> F-statistic: 313.8 on 4 and 795 DF,  p-value: < 0.00000000000000022

💡 Insight:

Every predictor has positive coefficient.
Semua prediktor berpengaruh signifikan terhadap Price, kecuali GarageYes
Adjusted R-squared: 0.6103, artinya model kita bisa menjelaskan Price dengan baik sebesar 61.03%

#model both
summary(model_both)

#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + Bathrooms + 
#>     Garage, data = house)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -83388265362  -4077544360   -868978732   1280926770 135800805889 
#> 
#> Coefficients:
#>                  Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)   -3256711926  1228917644  -2.650              0.00818 ** 
#> Land.Area        21436040     1085458  19.748 < 0.0000000000000002 ***
#> Building.Area    12172126     1277753   9.526 < 0.0000000000000002 ***
#> Bathrooms       514024107   228406771   2.250              0.02464 *  
#> GarageYes      1802890425  1017526044   1.772              0.07673 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13300000000 on 996 degrees of freedom
#> Multiple R-squared:  0.5925, Adjusted R-squared:  0.5909 
#> F-statistic:   362 on 4 and 996 DF,  p-value: < 0.00000000000000022

💡 Insight:

Every predictor has positive coefficient.
Semua prediktor berpengaruh signifikan terhadap Price, kecuali GarageYes
Adjusted R-squared: 0.5909, artinya model kita bisa menjelaskan Price dengan baik sebesar 59.09%

3.2 Model Comparison with Goodness of Fit (Adjusted R-squared)

💡 Kesimpulan:

Model yang dapat menjelaskan variabel target dengan baik adalah model_all , yakni sebesar 0.61 atau 61% dan model_backward , yakni sebesar 0.6103 atau 61.03%. Karena keduanya hampir sama, kita akan coba untuk menggunakan kedua model tersebut untuk tahap selanjutnya.

3.3 Model Assumption

In this section, we will evaluate each models using Normality test, Homoscedasticity test, Multicollinearity test.

Normality of Residuals

model_all

hist(model_all$residuals)

shapiro.test(model_all$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all$residuals
#> W = 0.56654, p-value < 0.00000000000000022

model_backward

hist(model_backward$residuals)

shapiro.test(model_backward$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.56711, p-value < 0.00000000000000022

💡 Kesimpulan: p-value pada kedua model < 0.00000000000000022, artinya residual data tidak berdistribusi normal. Hal tersebut mungkin saja terjadi karena adanya outlier.

Homoscedasticity of Residuals

model_all

library(lmtest)
bptest(model_all)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all
#> BP = 118.29, df = 5, p-value < 0.00000000000000022

model_backward

bptest(model_backward)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 118.32, df = 4, p-value < 0.00000000000000022

💡 Kesimpulan: p-value pada kedua model < 0.00000000000000022, artinya error menyebar tidak konstan atau heteroscedasticity

Multicollinearity

model_all

library(car)
vif(model_all)

#>     Land.Area Building.Area      Bedrooms     Bathrooms        Garage 
#>      2.060187      2.065310      4.162510      4.215207      1.012512

model_backward

vif(model_backward)

#>     Land.Area Building.Area     Bathrooms        Garage 
#>      2.057036      2.063383      1.056489      1.011720

💡 Kesimpulan: seluruh variabel prediktor pada kedua model memenuhi asumsi no multicolinierity

3.4 Model Transformation

Berdasarkan tahap model assumption sebelumnya, diketahui bahwa residual data tidak berdistribusi normal dan error menyebar tidak konstan atau heteroscedasticity sehingga model perlu ditransformasi.

model_all_trans <- lm(formula = Price ~ Land.Area + Building.Area + log(Bedrooms) + log(Bathrooms) + Garage, 
                data = house_price_train)
summary(model_all_trans)

#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + log(Bedrooms) + 
#>     log(Bathrooms) + Garage, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -75913448296  -4204040173   -693569424   1637890101 134729823551 
#> 
#> Coefficients:
#>                   Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)    -6115363599  2363226112  -2.588              0.00984 ** 
#> Land.Area         22770430     1255169  18.141 < 0.0000000000000002 ***
#> Building.Area     10423148     1477745   7.053      0.0000000000038 ***
#> log(Bedrooms)   -321846062  2290044432  -0.141              0.88827    
#> log(Bathrooms)  4234710196  1980369671   2.138              0.03279 *  
#> GarageYes       1888599934  1168313360   1.617              0.10638    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13360000000 on 794 degrees of freedom
#> Multiple R-squared:  0.6141, Adjusted R-squared:  0.6116 
#> F-statistic: 252.7 on 5 and 794 DF,  p-value: < 0.00000000000000022

#normalitas shapiro
shapiro.test(model_all_trans$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all_trans$residuals
#> W = 0.57427, p-value < 0.00000000000000022

#heterosedastisitas bp
bptest(model_all_trans)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all_trans
#> BP = 114.88, df = 5, p-value < 0.00000000000000022

model_backward_trans <- lm(formula = Price ~ Land.Area + Building.Area + log(Bathrooms) + Garage, 
                data = house_price_train)
summary(model_backward_trans)

#> 
#> Call:
#> lm(formula = Price ~ Land.Area + Building.Area + log(Bathrooms) + 
#>     Garage, data = house_price_train)
#> 
#> Residuals:
#>          Min           1Q       Median           3Q          Max 
#> -75815559702  -4195687607   -715866009   1618120327 134693626211 
#> 
#> Coefficients:
#>                   Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)    -6314147492  1892048615  -3.337             0.000886 ***
#> Land.Area         22754314     1249149  18.216 < 0.0000000000000002 ***
#> Building.Area     10424346     1476809   7.059     0.00000000000367 ***
#> log(Bathrooms)  4034821715  1377191812   2.930             0.003489 ** 
#> GarageYes       1890659405  1167501017   1.619             0.105756    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13350000000 on 795 degrees of freedom
#> Multiple R-squared:  0.6141, Adjusted R-squared:  0.6121 
#> F-statistic: 316.2 on 4 and 795 DF,  p-value: < 0.00000000000000022

#normalitas shapiro
shapiro.test(model_backward_trans$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward_trans$residuals
#> W = 0.57444, p-value < 0.00000000000000022

#heterosedastisitas bp
bptest(model_backward_trans)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward_trans
#> BP = 114.69, df = 4, p-value < 0.00000000000000022

Setelah dilakukan transformasi data, ternyata nilai p-value tidak mengalami perubahan.

ps: Saya sudah mencoba melakukan handling outlier, namun p-value tetap < 0.05

3.4 Model Evaluation and Prediction

all_pred <- predict(model_all, house_price_test)
backward_pred <- predict(model_backward, house_price_test)

MAE

MAE(y_pred=all_pred, y_true=house_price_test$Price)

#> [1] 6343359557

MAE(y_pred=backward_pred, y_true=house_price_test$Price)

#> [1] 6339267013

# Check Target Variable Range
range(house_price_test$Price)

#> [1]    430000000 169000000000

169000000000-430000000

#> [1] 168570000000

Nilai MAE relatif lebih kecil dibandingkan range data, namun error masih lumayan besar.
model_backward mempunyai error lebih kecil dibandingkan model_all

MAPE

MAPE(y_pred = all_pred,
    y_true = house_price_test$Price)*100

#> [1] 41.68089

MAPE(y_pred = backward_pred,
    y_true = house_price_test$Price)*100

#> [1] 41.87019

Dari model_all kita memperoleh MAPE sebesar 41,7%, artinya model dapat memprediksi dengan keakuratan sebesar 58.3%.
Dari model_backward kita memperoleh MAPE sebesar 41,9%, artinya model dapat memprediksi dengan keakuratan sebesar 58.1%.

RMSE

RMSE(y_pred=all_pred, y_true=house_price_test$Price)

#> [1] 13080415605

RMSE(y_pred=backward_pred, y_true=house_price_test$Price)

#> [1] 13035505643

Nilai RMSE relatif lebih kecil dibandingkan range data, namun error masih lumayan besar.
model_backward mempunyai error lebih kecil dibandingkan model_all

4. Conclusion

Model yang dibuat masih menghasilkan error yang besar, hal tersebut karena pada kedua model residual data tidak berdistribusi normal dan error menyebar tidak konstan atau heteroscedasticity.
Oleh karena itu, saya merekomendasikan untuk menggunakan model lain yang lebih kompleks (model yang bebas asumsi)
Meskipun begitu jika dilihat dari MAE dan RMSE, model_backward lebih baik dibandingkan model_all