1 Explanation

Content

The Green Wing 1985 House Price database is generated by computer, to help the very bigger in the field of machine learning

Notes on Specific Variables

  • Area : Land Size in Square Meter(s)

  • FirePlace : Number of Fire Place, ‘0’ means there is no Fire Place

  • Baths : Number of Bathrooms, ‘0’ means there is no Bathroom

  • White.Marble : Home interiors with White Marble, ‘0’ is No, ‘1’ is Yes

  • Black.Marble : Home interiors with Black Marble, ‘0’ is No, ‘1’ is Yes

  • Indian.Marble: Home interiors with Indian Marble, ‘0’ is No, ‘1’ is Yes

  • Floors : The house consists of more than one floor, ‘0’ is No, ‘1’ is Yes

  • City : “Undefined Variable”

  • Solar : The house electricity is supplied with Solar Panel, ‘0’ is No, ‘1’ is Yes

  • Electric : The house electricity is connected with Electricity Network, ‘0’ is No, ‘1’ is Yes

  • Fiber : The house using Fiber Manterials, ‘0’ is No, ‘1’ is Yes

  • Glass.Doors : The house with Glass Doors, ‘0’ is No, ‘1’ is Yes

  • Swiming.Pool : The house with Swimming Pool, ‘0’ is No, ‘1’ is Yes

  • Garden : The house with Garden, ‘0’ is No, ‘1’ is Yes

  • Prices : House price in USD

Data Source

2 Input Data

Input Library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(manipulate)
library(performance)
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
options(scipen = 123)

Use read.csv() to input bulk data

house <- read.csv("HousePrices_HalfMil.csv")

2.1 Data Inspection

Check head & tail data of dataframe

head(house)

Check NA value in dataframe

colSums(is.na(house))
##          Area        Garage     FirePlace         Baths  White.Marble 
##             0             0             0             0             0 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             0             0             0             0             0 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             0             0             0             0             0 
##        Prices 
##             0

Check dataframe dimension

dim(house)
## [1] 500000     16

Check duplicated data

sum(duplicated(house) == T)
## [1] 5693
sum(duplicated(house) == F)
## [1] 494307

2.2 Data Cleansing

Check dataframe structure

str(house)
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Remove duplicated data

house <- distinct(house)

Check the unique values for each column to get the column to convert to the ‘factor’ data type

count_unique <- rapply(house,    # Counts of unique values
                       function(x) length(unique(x)))
count_unique  
##          Area        Garage     FirePlace         Baths  White.Marble 
##           249             3             5             5             2 
##  Black.Marble Indian.Marble        Floors          City         Solar 
##             2             2             2             3             2 
##      Electric         Fiber   Glass.Doors  Swiming.Pool        Garden 
##             2             2             2             2             2 
##        Prices 
##          2714

Dataframe summary

summary(house)
##       Area           Garage        FirePlace         Baths      
##  Min.   :  1.0   Min.   :1.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.: 63.0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :125.0   Median :2.000   Median :2.000   Median :3.000  
##  Mean   :124.9   Mean   :2.001   Mean   :2.003   Mean   :2.998  
##  3rd Qu.:187.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :249.0   Max.   :3.000   Max.   :4.000   Max.   :5.000  
##   White.Marble    Black.Marble    Indian.Marble        Floors      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.333   Mean   :0.3327   Mean   :0.3343   Mean   :0.4993  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       City           Solar           Electric          Fiber       
##  Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :2.001   Mean   :0.4987   Mean   :0.5005   Mean   :0.5003  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   Glass.Doors   Swiming.Pool        Garden           Prices     
##  Min.   :0.0   Min.   :0.0000   Min.   :0.0000   Min.   : 7725  
##  1st Qu.:0.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:33500  
##  Median :1.0   Median :1.0000   Median :1.0000   Median :41850  
##  Mean   :0.5   Mean   :0.5004   Mean   :0.5016   Mean   :42047  
##  3rd Qu.:1.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:50750  
##  Max.   :1.0   Max.   :1.0000   Max.   :1.0000   Max.   :77975

Plot the columns correlation using (ggcorr())

ggcorr(house, label = T, label_round = 2, label_size = 2, hjust = 1, size = 4, col = "grey", layout.exp = 1)

3 House Price Prediction

3.1 Splitting Data

Data train & data test using set.seed & row index

RNGkind(sample.kind = "Rounding") #untuk menyamaratakan splitting antar komputer yang berbeda versi
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(50)
# train-test splitting
index_new <- sample(nrow(house), nrow(house)*0.75)
house_train <- house[index_new,]
house_test <- house[-index_new,]

Check data train & data test dimension

dim(house_train)
## [1] 370730     16
dim(house_test)
## [1] 123577     16

3.2 Linear Regression Model

Model using correlated all columns

model_all <- lm(formula = Prices ~., data = house_train)
summary(model_all)
## 
## Call:
## lm(formula = Prices ~ ., data = house_train)
## 
## Residuals:
##          Min           1Q       Median           3Q          Max 
## -0.000081214  0.000000000  0.000000000  0.000000001  0.000000204 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate           Std. Error            t value
## (Intercept)     999.99999986773810     0.00000000123062   812601680256.496
## Area             24.99999999999766     0.00000000000305  8197305526038.221
## Garage         1499.99999999959005     0.00000000026810  5594823763621.271
## FirePlace       750.00000000029877     0.00000000015498  4839348467431.281
## Baths          1250.00000000023033     0.00000000015497  8066053834128.667
## White.Marble  13999.99999999995271     0.00000000053640 26100111329769.492
## Black.Marble   4999.99999999938973     0.00000000053678  9314852788528.242
## Indian.Marble                   NA                   NA                 NA
## Floors        14999.99999999946704     0.00000000043815 34234529287136.609
## City           3499.99999999945248     0.00000000026837 13041770625334.861
## Solar           250.00000000043636     0.00000000043816   570567638944.143
## Electric       1250.00000000048408     0.00000000043816  2852866440816.003
## Fiber         11749.99999999971442     0.00000000043816 26816730596011.367
## Glass.Doors    4449.99999999948341     0.00000000043816 10156166006835.543
## Swiming.Pool      0.00000000044157     0.00000000043816              1.008
## Garden           -0.00000000043482     0.00000000043816             -0.992
##                          Pr(>|t|)    
## (Intercept)   <0.0000000000000002 ***
## Area          <0.0000000000000002 ***
## Garage        <0.0000000000000002 ***
## FirePlace     <0.0000000000000002 ***
## Baths         <0.0000000000000002 ***
## White.Marble  <0.0000000000000002 ***
## Black.Marble  <0.0000000000000002 ***
## Indian.Marble                  NA    
## Floors        <0.0000000000000002 ***
## City          <0.0000000000000002 ***
## Solar         <0.0000000000000002 ***
## Electric      <0.0000000000000002 ***
## Fiber         <0.0000000000000002 ***
## Glass.Doors   <0.0000000000000002 ***
## Swiming.Pool                0.314    
## Garden                      0.321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0000001334 on 370715 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.182e+26 on 14 and 370715 DF,  p-value: < 0.00000000000000022

Note :

  • The target variable used is Price, model_all uses all predictor variables

  • model_all generates a value of R-squared : 1, which indicates overfitting, so a model will be made using predictors with correlation values from the largest to the smallest to the target (Floors, Fiber, White.Marble, City, Glass.Doors, etc.)

Model using Floors, Fiber, & White.Marble columns

model_cor_3 <- lm(formula = Prices ~ White.Marble+Floors+Fiber, data = house_train)
summary(model_cor_3)
## 
## Call:
## lm(formula = Prices ~ White.Marble + Floors + Fiber, data = house_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17554.8  -3590.7      9.3   3586.9  17505.2 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  24855.73      15.84  1569.2 <0.0000000000000002 ***
## White.Marble 11522.37      17.95   641.7 <0.0000000000000002 ***
## Floors       14989.06      16.93   885.2 <0.0000000000000002 ***
## Fiber        11734.98      16.93   693.0 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5155 on 370726 degrees of freedom
## Multiple R-squared:  0.8187, Adjusted R-squared:  0.8187 
## F-statistic: 5.581e+05 on 3 and 370726 DF,  p-value: < 0.00000000000000022

Note :

  • model_cor_3 using Floors, Fiber & White.Marble predictors gives an R-squared value: 0.8187

Model using Floors, Fiber, White.Marble, Indian.Marble & City columns

model_cor_5 <- lm(formula = Prices ~ Floors+Fiber+White.Marble+Indian.Marble+City, data = house_train)
summary(model_cor_5)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + 
##     City, data = house_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11588.9  -2668.3      4.7   2663.0  11560.2 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   20352.195     20.601   987.9 <0.0000000000000002 ***
## Floors        14998.165     12.411  1208.4 <0.0000000000000002 ***
## Fiber         11745.667     12.411   946.4 <0.0000000000000002 ***
## White.Marble   9022.287     15.203   593.4 <0.0000000000000002 ***
## Indian.Marble -4994.457     15.205  -328.5 <0.0000000000000002 ***
## City           3496.458      7.602   459.9 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3779 on 370724 degrees of freedom
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.9026 
## F-statistic: 6.872e+05 on 5 and 370724 DF,  p-value: < 0.00000000000000022

Note :

  • model_cor_5 using Floors, Fiber, White.Marble, Indian.Marble & City predictors gives an R-squared value: 0.9026

Model using Floors, Fiber, White.Marble, Indian.Marble, City columns

model_cor_7 <- lm(formula = Prices ~ Floors+Fiber+White.Marble+Indian.Marble+City+Glass.Doors+Area,
                  data = house_train)
summary(model_cor_7)
## 
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble + 
##     City + Glass.Doors + Area, data = house_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6278  -1756      2   1757   6271 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   15002.0568    15.8122   948.8 <0.0000000000000002 ***
## Floors        15003.1385     8.1459  1841.8 <0.0000000000000002 ***
## Fiber         11752.5273     8.1459  1442.8 <0.0000000000000002 ***
## White.Marble   9020.1173     9.9781   904.0 <0.0000000000000002 ***
## Indian.Marble -5003.3251     9.9794  -501.4 <0.0000000000000002 ***
## City           3500.6258     4.9893   701.6 <0.0000000000000002 ***
## Glass.Doors    4441.7874     8.1459   545.3 <0.0000000000000002 ***
## Area             24.9456     0.0567   440.0 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2480 on 370722 degrees of freedom
## Multiple R-squared:  0.958,  Adjusted R-squared:  0.958 
## F-statistic: 1.209e+06 on 7 and 370722 DF,  p-value: < 0.00000000000000022

Note :

  • model_cor_7 using Floors, Fiber, White.Marble, Indian.Marble, City & Glass.Doors predictors gives an R-squared value: 0.958
pred <- predict(object = model_cor_7, newdata = house_test)

Predict with Confidence Interval

predict(object = model_cor_7, newdata = house_test, interval = "prediction", level = 0.95)[1:5,]
##         fit      lwr      upr
## 1  45789.33 40928.74 50649.93
## 3  47515.62 42655.04 52376.21
## 4  46567.73 41707.14 51428.33
## 11 46218.49 41357.90 51079.09
## 12 31159.82 26299.24 36020.41

4 Model Evaluation

4.1 Residual/Error

Model Residual/Error

data.frame(test = house_train$Prices, predict= model_cor_7$fitted.values, 
           error = house_train$Prices - model_cor_7$fitted.values)

Predict/Test Residual/Error

data.frame(test = house_test$Prices, predict= pred, 
           error = house_test$Prices - pred)

4.2 MSE

#Model
MSE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)
## [1] 6149798
#Predict/Test
MSE(y_pred = pred, y_true = house_test$Prices)
## [1] 6162237

4.3 RMSE

#Model
RMSE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)
## [1] 2479.879
#Predict/Test
RMSE(y_pred = pred, y_true = house_test$Prices)
## [1] 2482.385

4.4 MAE

#Model
MAE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)
## [1] 2023.05
#Test/Predict
MAE(y_pred = pred, y_true = house_test$Prices)
## [1] 2027.706

4.5 MAPE

#Model
MAPE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)
## [1] 0.05346924
#Test/Predict
MAPE(y_pred = pred, y_true = house_test$Prices)
## [1] 0.05363865

5 Linear Regression Assumption

5.1 Linearity

Linearity ‘Prices’ to ‘Area’

cor.test(house_train$Area, house_train$Prices)
## 
##  Pearson's product-moment correlation
## 
## data:  house_train$Area and house_train$Prices
## t = 90.805, df = 370728, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1443547 0.1506526
## sample estimates:
##       cor 
## 0.1475052

Untuk prediktor ‘Area’ dengan nilai korelasi yang terkecil diantara prediktor lainnya yaitu 14.75%, memiliki p-value < alpha (5%) sehingga ‘Area’ linear terhadap ‘Prices’, ‘Prices’ juga liniear dengan prediktor yang bernilai korelasi lebih besar dari ‘Area’ dan dapat digunakan dalam pemodelan.

5.2 Normality of Residual

hist(model_cor_7$residuals)

shapiro.test(model_cor_7$residuals[0:5000])
## 
##  Shapiro-Wilk normality test
## 
## data:  model_cor_7$residuals[0:5000]
## W = 0.99529, p-value = 0.0000000000119
library(nortest)
ad.test(model_cor_7$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model_cor_7$residuals
## A = 326, p-value < 0.00000000000000022
check_normality(model_cor_7)
## Warning: Non-normality of residuals detected (p < .001).
  • Shapiro test terbatas di 5000 data

  • Asumsi normality dengan berbagai metode berbeda tidak terpenuhi karena p-value < alpha (5%)

5.3 Homoscedasticity of Residual

bptest(model_cor_7)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_cor_7
## BP = 6.4547, df = 7, p-value = 0.4878
check_heteroscedasticity(model_cor_7)
## OK: Error variance appears to be homoscedastic (p = 0.661).

Asumsi homoscedasticity terpenuhi karena p-value > alpha (5%)

5.4 No Multicolinearity

vif(model_cor_7)
##        Floors         Fiber  White.Marble Indian.Marble          City 
##      1.000003      1.000015      1.334494      1.334492      1.000013 
##   Glass.Doors          Area 
##      1.000014      1.000016
check_collinearity(model_cor_7)

6 Summary

  • Model generated using all predictor and we get value of R-squared : 1, which indicates overfitting, so a model will be made using predictors based one correlation values from the largest to the smallest to the target (Floors, Fiber, White.Marble, City, Glass.Doors, etc.)

  • The selected model uses 7 predictors, namely Floors, Fiber, White.Marble, Indian.Marble, City & Glass.Doors with R-squared value : 0.958 and MAPE value : 5.3%. The selected model is used to predict the test data and the MAPE value is obtained: 5.3%

  • Test the assumption of linear regression, normality of residuals detected (p < 0.05). So the error/residual distribution is not normally distributed