Content
The Green Wing 1985 House Price database is generated by computer, to help the very bigger in the field of machine learning
Notes on Specific Variables
Area : Land Size in Square Meter(s)
FirePlace : Number of Fire Place, ‘0’ means there is no Fire Place
Baths : Number of Bathrooms, ‘0’ means there is no Bathroom
White.Marble : Home interiors with White Marble, ‘0’ is No, ‘1’ is Yes
Black.Marble : Home interiors with Black Marble, ‘0’ is No, ‘1’ is Yes
Indian.Marble: Home interiors with Indian Marble, ‘0’ is No, ‘1’ is Yes
Floors : The house consists of more than one floor, ‘0’ is No, ‘1’ is Yes
City : “Undefined Variable”
Solar : The house electricity is supplied with Solar Panel, ‘0’ is No, ‘1’ is Yes
Electric : The house electricity is connected with Electricity Network, ‘0’ is No, ‘1’ is Yes
Fiber : The house using Fiber Manterials, ‘0’ is No, ‘1’ is Yes
Glass.Doors : The house with Glass Doors, ‘0’ is No, ‘1’ is Yes
Swiming.Pool : The house with Swimming Pool, ‘0’ is No, ‘1’ is Yes
Garden : The house with Garden, ‘0’ is No, ‘1’ is Yes
Prices : House price in USD
Input Library
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(manipulate)
library(performance)
library(MLmetrics)##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
options(scipen = 123)Use read.csv() to input bulk data
house <- read.csv("HousePrices_HalfMil.csv")Check head & tail data of dataframe
head(house)Check NA value in dataframe
colSums(is.na(house))## Area Garage FirePlace Baths White.Marble
## 0 0 0 0 0
## Black.Marble Indian.Marble Floors City Solar
## 0 0 0 0 0
## Electric Fiber Glass.Doors Swiming.Pool Garden
## 0 0 0 0 0
## Prices
## 0
Check dataframe dimension
dim(house)## [1] 500000 16
Check duplicated data
sum(duplicated(house) == T)## [1] 5693
sum(duplicated(house) == F)## [1] 494307
Check dataframe structure
str(house)## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Remove duplicated data
house <- distinct(house)Check the unique values for each column to get the column to convert to the ‘factor’ data type
count_unique <- rapply(house, # Counts of unique values
function(x) length(unique(x)))
count_unique ## Area Garage FirePlace Baths White.Marble
## 249 3 5 5 2
## Black.Marble Indian.Marble Floors City Solar
## 2 2 2 3 2
## Electric Fiber Glass.Doors Swiming.Pool Garden
## 2 2 2 2 2
## Prices
## 2714
Dataframe summary
summary(house)## Area Garage FirePlace Baths
## Min. : 1.0 Min. :1.000 Min. :0.000 Min. :1.000
## 1st Qu.: 63.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :125.0 Median :2.000 Median :2.000 Median :3.000
## Mean :124.9 Mean :2.001 Mean :2.003 Mean :2.998
## 3rd Qu.:187.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :249.0 Max. :3.000 Max. :4.000 Max. :5.000
## White.Marble Black.Marble Indian.Marble Floors
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.333 Mean :0.3327 Mean :0.3343 Mean :0.4993
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## City Solar Electric Fiber
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :2.001 Mean :0.4987 Mean :0.5005 Mean :0.5003
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Glass.Doors Swiming.Pool Garden Prices
## Min. :0.0 Min. :0.0000 Min. :0.0000 Min. : 7725
## 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:33500
## Median :1.0 Median :1.0000 Median :1.0000 Median :41850
## Mean :0.5 Mean :0.5004 Mean :0.5016 Mean :42047
## 3rd Qu.:1.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:50750
## Max. :1.0 Max. :1.0000 Max. :1.0000 Max. :77975
Plot the columns correlation using (ggcorr())
ggcorr(house, label = T, label_round = 2, label_size = 2, hjust = 1, size = 4, col = "grey", layout.exp = 1)Data train & data test using set.seed & row index
RNGkind(sample.kind = "Rounding") #untuk menyamaratakan splitting antar komputer yang berbeda versi## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(50)
# train-test splitting
index_new <- sample(nrow(house), nrow(house)*0.75)
house_train <- house[index_new,]
house_test <- house[-index_new,]Check data train & data test dimension
dim(house_train)## [1] 370730 16
dim(house_test)## [1] 123577 16
Model using correlated all columns
model_all <- lm(formula = Prices ~., data = house_train)
summary(model_all)##
## Call:
## lm(formula = Prices ~ ., data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.000081214 0.000000000 0.000000000 0.000000001 0.000000204
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 999.99999986773810 0.00000000123062 812601680256.496
## Area 24.99999999999766 0.00000000000305 8197305526038.221
## Garage 1499.99999999959005 0.00000000026810 5594823763621.271
## FirePlace 750.00000000029877 0.00000000015498 4839348467431.281
## Baths 1250.00000000023033 0.00000000015497 8066053834128.667
## White.Marble 13999.99999999995271 0.00000000053640 26100111329769.492
## Black.Marble 4999.99999999938973 0.00000000053678 9314852788528.242
## Indian.Marble NA NA NA
## Floors 14999.99999999946704 0.00000000043815 34234529287136.609
## City 3499.99999999945248 0.00000000026837 13041770625334.861
## Solar 250.00000000043636 0.00000000043816 570567638944.143
## Electric 1250.00000000048408 0.00000000043816 2852866440816.003
## Fiber 11749.99999999971442 0.00000000043816 26816730596011.367
## Glass.Doors 4449.99999999948341 0.00000000043816 10156166006835.543
## Swiming.Pool 0.00000000044157 0.00000000043816 1.008
## Garden -0.00000000043482 0.00000000043816 -0.992
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Area <0.0000000000000002 ***
## Garage <0.0000000000000002 ***
## FirePlace <0.0000000000000002 ***
## Baths <0.0000000000000002 ***
## White.Marble <0.0000000000000002 ***
## Black.Marble <0.0000000000000002 ***
## Indian.Marble NA
## Floors <0.0000000000000002 ***
## City <0.0000000000000002 ***
## Solar <0.0000000000000002 ***
## Electric <0.0000000000000002 ***
## Fiber <0.0000000000000002 ***
## Glass.Doors <0.0000000000000002 ***
## Swiming.Pool 0.314
## Garden 0.321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0000001334 on 370715 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.182e+26 on 14 and 370715 DF, p-value: < 0.00000000000000022
Note :
The target variable used is Price, model_all uses all predictor variables
model_all generates a value of R-squared : 1, which indicates overfitting, so a model will be made using predictors with correlation values from the largest to the smallest to the target (Floors, Fiber, White.Marble, City, Glass.Doors, etc.)
Model using Floors, Fiber, & White.Marble columns
model_cor_3 <- lm(formula = Prices ~ White.Marble+Floors+Fiber, data = house_train)
summary(model_cor_3)##
## Call:
## lm(formula = Prices ~ White.Marble + Floors + Fiber, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17554.8 -3590.7 9.3 3586.9 17505.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24855.73 15.84 1569.2 <0.0000000000000002 ***
## White.Marble 11522.37 17.95 641.7 <0.0000000000000002 ***
## Floors 14989.06 16.93 885.2 <0.0000000000000002 ***
## Fiber 11734.98 16.93 693.0 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5155 on 370726 degrees of freedom
## Multiple R-squared: 0.8187, Adjusted R-squared: 0.8187
## F-statistic: 5.581e+05 on 3 and 370726 DF, p-value: < 0.00000000000000022
Note :
Model using Floors, Fiber, White.Marble, Indian.Marble & City columns
model_cor_5 <- lm(formula = Prices ~ Floors+Fiber+White.Marble+Indian.Marble+City, data = house_train)
summary(model_cor_5)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble +
## City, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11588.9 -2668.3 4.7 2663.0 11560.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20352.195 20.601 987.9 <0.0000000000000002 ***
## Floors 14998.165 12.411 1208.4 <0.0000000000000002 ***
## Fiber 11745.667 12.411 946.4 <0.0000000000000002 ***
## White.Marble 9022.287 15.203 593.4 <0.0000000000000002 ***
## Indian.Marble -4994.457 15.205 -328.5 <0.0000000000000002 ***
## City 3496.458 7.602 459.9 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3779 on 370724 degrees of freedom
## Multiple R-squared: 0.9026, Adjusted R-squared: 0.9026
## F-statistic: 6.872e+05 on 5 and 370724 DF, p-value: < 0.00000000000000022
Note :
Model using Floors, Fiber, White.Marble, Indian.Marble, City columns
model_cor_7 <- lm(formula = Prices ~ Floors+Fiber+White.Marble+Indian.Marble+City+Glass.Doors+Area,
data = house_train)
summary(model_cor_7)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + White.Marble + Indian.Marble +
## City + Glass.Doors + Area, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6278 -1756 2 1757 6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15002.0568 15.8122 948.8 <0.0000000000000002 ***
## Floors 15003.1385 8.1459 1841.8 <0.0000000000000002 ***
## Fiber 11752.5273 8.1459 1442.8 <0.0000000000000002 ***
## White.Marble 9020.1173 9.9781 904.0 <0.0000000000000002 ***
## Indian.Marble -5003.3251 9.9794 -501.4 <0.0000000000000002 ***
## City 3500.6258 4.9893 701.6 <0.0000000000000002 ***
## Glass.Doors 4441.7874 8.1459 545.3 <0.0000000000000002 ***
## Area 24.9456 0.0567 440.0 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2480 on 370722 degrees of freedom
## Multiple R-squared: 0.958, Adjusted R-squared: 0.958
## F-statistic: 1.209e+06 on 7 and 370722 DF, p-value: < 0.00000000000000022
Note :
pred <- predict(object = model_cor_7, newdata = house_test)Predict with Confidence Interval
predict(object = model_cor_7, newdata = house_test, interval = "prediction", level = 0.95)[1:5,]## fit lwr upr
## 1 45789.33 40928.74 50649.93
## 3 47515.62 42655.04 52376.21
## 4 46567.73 41707.14 51428.33
## 11 46218.49 41357.90 51079.09
## 12 31159.82 26299.24 36020.41
Model Residual/Error
data.frame(test = house_train$Prices, predict= model_cor_7$fitted.values,
error = house_train$Prices - model_cor_7$fitted.values)Predict/Test Residual/Error
data.frame(test = house_test$Prices, predict= pred,
error = house_test$Prices - pred)#Model
MSE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)## [1] 6149798
#Predict/Test
MSE(y_pred = pred, y_true = house_test$Prices)## [1] 6162237
#Model
RMSE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)## [1] 2479.879
#Predict/Test
RMSE(y_pred = pred, y_true = house_test$Prices)## [1] 2482.385
#Model
MAE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)## [1] 2023.05
#Test/Predict
MAE(y_pred = pred, y_true = house_test$Prices)## [1] 2027.706
#Model
MAPE(y_pred = model_cor_7$fitted.values, y_true = house_train$Prices)## [1] 0.05346924
#Test/Predict
MAPE(y_pred = pred, y_true = house_test$Prices)## [1] 0.05363865
Linearity ‘Prices’ to ‘Area’
cor.test(house_train$Area, house_train$Prices)##
## Pearson's product-moment correlation
##
## data: house_train$Area and house_train$Prices
## t = 90.805, df = 370728, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1443547 0.1506526
## sample estimates:
## cor
## 0.1475052
Untuk prediktor ‘Area’ dengan nilai korelasi yang terkecil diantara prediktor lainnya yaitu 14.75%, memiliki p-value < alpha (5%) sehingga ‘Area’ linear terhadap ‘Prices’, ‘Prices’ juga liniear dengan prediktor yang bernilai korelasi lebih besar dari ‘Area’ dan dapat digunakan dalam pemodelan.
hist(model_cor_7$residuals)shapiro.test(model_cor_7$residuals[0:5000])##
## Shapiro-Wilk normality test
##
## data: model_cor_7$residuals[0:5000]
## W = 0.99529, p-value = 0.0000000000119
library(nortest)
ad.test(model_cor_7$residuals)##
## Anderson-Darling normality test
##
## data: model_cor_7$residuals
## A = 326, p-value < 0.00000000000000022
check_normality(model_cor_7)## Warning: Non-normality of residuals detected (p < .001).
Shapiro test terbatas di 5000 data
Asumsi normality dengan berbagai metode berbeda tidak terpenuhi karena p-value < alpha (5%)
bptest(model_cor_7)##
## studentized Breusch-Pagan test
##
## data: model_cor_7
## BP = 6.4547, df = 7, p-value = 0.4878
check_heteroscedasticity(model_cor_7)## OK: Error variance appears to be homoscedastic (p = 0.661).
Asumsi homoscedasticity terpenuhi karena p-value > alpha (5%)
vif(model_cor_7)## Floors Fiber White.Marble Indian.Marble City
## 1.000003 1.000015 1.334494 1.334492 1.000013
## Glass.Doors Area
## 1.000014 1.000016
check_collinearity(model_cor_7)Model generated using all predictor and we get value of R-squared : 1, which indicates overfitting, so a model will be made using predictors based one correlation values from the largest to the smallest to the target (Floors, Fiber, White.Marble, City, Glass.Doors, etc.)
The selected model uses 7 predictors, namely Floors, Fiber, White.Marble, Indian.Marble, City & Glass.Doors with R-squared value : 0.958 and MAPE value : 5.3%. The selected model is used to predict the test data and the MAPE value is obtained: 5.3%
Test the assumption of linear regression, normality of residuals detected (p < 0.05). So the error/residual distribution is not normally distributed