LBB Regression Model
LBB Regression Model
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: carData
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
0. Read Data
id date price bedrooms bathrooms sqft_living
1 7129300520 20141013T000000 221900 3 1.00 1180
2 6414100192 20141209T000000 538000 3 2.25 2570
3 5631500400 20150225T000000 180000 2 1.00 770
sqft_lot floors waterfront view condition grade sqft_above sqft_basement
1 5650 1 0 0 3 7 1180 0
2 7242 2 0 0 3 7 2170 400
3 10000 1 0 0 3 6 770 0
yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
1 1955 0 98178 47.5112 -122.257 1340 5650
2 1951 1991 98125 47.7210 -122.319 1690 7639
3 1933 0 98028 47.7379 -122.233 2720 8062
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
1. Explorasi Data
'data.frame': 21613 obs. of 21 variables:
$ id : num 7129300520 6414100192 5631500400 2487200875 1954400510 ...
$ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
2. Seleksi Data Berdasarkan Bisnis + Corellation
price: harga
m2_living: luas rumah dalam
m2_lot: luas tanah dalam m2
bedrooms: jumlah kamar tidur
bathrooms: jumlah kamar mandi
floors: jumlah lantai
condition: kondisi rumah 1 buruk - 5 baru
view: jumlah halaman
waterfront: jumlah kolam renang
grade: rating rumah
Korelasi data
Outlier Check
Bedrooms
Bathrooms
m2_living
m2_lot
floors
condition
waterfront
view
3. Modelling
Model Full
Call:
lm(formula = price ~ ., data = data_selected)
Residuals:
Min 1Q Median 3Q Max
-1231443 -125596 -17532 94899 4558095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -665307.8183 17191.3671 -38.700 <0.0000000000000002 ***
m2_living 594.5217 11.0632 53.739 <0.0000000000000002 ***
m2_lot -1.1718 0.1267 -9.247 <0.0000000000000002 ***
bedrooms -38380.0003 2134.0597 -17.985 <0.0000000000000002 ***
bathrooms 29075.7846 3221.7008 9.025 <0.0000000000000002 ***
floors -47120.4673 3530.2883 -13.347 <0.0000000000000002 ***
condition 51524.9237 2530.6973 20.360 <0.0000000000000002 ***
waterfront 581878.1782 19776.3159 29.423 <0.0000000000000002 ***
view 61576.6924 2341.1371 26.302 <0.0000000000000002 ***
grade 102659.4243 2223.5880 46.168 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 230000 on 21603 degrees of freedom
Multiple R-squared: 0.6078, Adjusted R-squared: 0.6077
F-statistic: 3720 on 9 and 21603 DF, p-value: < 0.00000000000000022
Interpretasi koefisien:
Setiap kenaikan 1 nilai pada m2_living maka price bertambah sebesar 594.5217
Setiap kenaikan 1 nilai pada m2_lot maka price berkurang sebesar 1.1718
Setiap kenaikan 1 nilai pada bedrooms maka price berkurang sebesar 38380.0003
Setiap kenaikan 1 nilai pada bathrooms maka price bertambah sebesar 29075.7846
Setiap kenaikan 1 nilai pada floors maka price berkurang sebesar 47120.4673
Setiap kenaikan 1 nilai pada condition maka price bertambah sebesar 51524.9237
Setiap kenaikan 1 nilai pada waterfront maka price bertambah sebesar 581878.1782
Setiap kenaikan 1 nilai pada view maka price bertambah sebesar 61576.6924
Setiap kenaikan 1 nilai pada grade maka price bertambah sebesar 102659.4243
##Adj R-Square:
0.60, berarti predictor yang dipilih menjelaskan 60% informasi, sisanya yaitu 40% dijelaskan oleh variable lainnya.
4. Feature Selection
Stepwise (Backward)
Start: AIC=533662.3
price ~ m2_living + m2_lot + bedrooms + bathrooms + floors +
condition + waterfront + view + grade
Df Sum of Sq RSS AIC
<none> 1142336728298222 533662
- bathrooms 1 4306977080413 1146643705378635 533742
- m2_lot 1 4521106021818 1146857834320040 533746
- floors 1 9420607646769 1151757335944991 533838
- bedrooms 1 17103181270161 1159439909568383 533981
- condition 1 21919689506038 1164256417804259 534071
- view 1 36581294998829 1178918023297051 534342
- waterfront 1 45777643537833 1188114371836054 534510
- grade 1 112711720747852 1255048449046074 535694
- m2_living 1 152705668498510 1295042396796732 536372
Call:
lm(formula = price ~ m2_living + m2_lot + bedrooms + bathrooms +
floors + condition + waterfront + view + grade, data = data_selected)
Residuals:
Min 1Q Median 3Q Max
-1231443 -125596 -17532 94899 4558095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -665307.8183 17191.3671 -38.700 <0.0000000000000002 ***
m2_living 594.5217 11.0632 53.739 <0.0000000000000002 ***
m2_lot -1.1718 0.1267 -9.247 <0.0000000000000002 ***
bedrooms -38380.0003 2134.0597 -17.985 <0.0000000000000002 ***
bathrooms 29075.7846 3221.7008 9.025 <0.0000000000000002 ***
floors -47120.4673 3530.2883 -13.347 <0.0000000000000002 ***
condition 51524.9237 2530.6973 20.360 <0.0000000000000002 ***
waterfront 581878.1782 19776.3159 29.423 <0.0000000000000002 ***
view 61576.6924 2341.1371 26.302 <0.0000000000000002 ***
grade 102659.4243 2223.5880 46.168 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 230000 on 21603 degrees of freedom
Multiple R-squared: 0.6078, Adjusted R-squared: 0.6077
F-statistic: 3720 on 9 and 21603 DF, p-value: < 0.00000000000000022
Interpretasi koefisien:
Setiap kenaikan 1 nilai pada m2_living maka price bertambah sebesar 594.5217
Setiap kenaikan 1 nilai pada m2_lot maka price berkurang sebesar 1.1718
Setiap kenaikan 1 nilai pada bedrooms maka price berkurang sebesar 38380.0003
Setiap kenaikan 1 nilai pada bathrooms maka price bertambah sebesar 29075.7846
Setiap kenaikan 1 nilai pada floors maka price berkurang sebesar 47120.4673
Setiap kenaikan 1 nilai pada condition maka price bertambah sebesar 51524.9237
Setiap kenaikan 1 nilai pada waterfront maka price bertambah sebesar 581878.1782
Setiap kenaikan 1 nilai pada view maka price bertambah sebesar 61576.6924
Setiap kenaikan 1 nilai pada grade maka price bertambah sebesar 102659.4243
Adj R-Square
memiliki adj r-sq 0.60, berarti predictor yang dipilih menjelaskan 60% informasi, sisanya yaitu 40% dijelaskan oleh variable lainnya.
Regsubset
5. Error
RMSE (Root Mean Square Error)
[1] 229900.3
[1] 75000 7700000
MAE (Mean Absolute Error)
[1] 152322.6
[1] 75000 7700000
MAPE (Mean Absolute Percentage Error)
[1] 0.3192653
6. Asumsi
1. Residual Menyebar Normal (Normality)
H0: Residual menyebar normal H1: Residual tidak menyebar normal
jika p-value < alpha (0.05) maka tolak h0
residual dinyatakan normal ketika p-value > 0.05 (asumsi terpenuhi)
Histogram
Shapiro Test
Tidak bisa karena sample size lebih dari 5000
Lillie Test
Lilliefors (Kolmogorov-Smirnov) normality test
data: model_backward$residuals
D = 0.093559, p-value < 0.00000000000000022
QQ Plot
Residual Tidak Berpola
h0: Model Homoscedasticity h1: Model Heteroscedasticity
model dinyatakan Homoscedasticity bila p-value > alpha semakin mendekati 1 semakin bagus
hasil: p-value = 0.00000000000000022
studentized Breusch-Pagan test
data: model_backward
BP = 2560.5, df = 9, p-value < 0.00000000000000022
3. Tiap X tidak memiliki hubungan (Multicolinearity)
nilai vif yang bagus di bawah 10
hasil: semua predictor yang dipilih multicolinearitynya < 10
m2_living m2_lot bedrooms bathrooms floors condition
3.920194 1.046210 1.610096 2.290920 1.551479 1.108446
waterfront view grade
1.196494 1.315486 2.792147
4. Linearity
h0: korelasi = 0 h1: korelasi != 0
membandingkan 2 variabel untuk menguji apakah kedua variable tersebut memiliki korelasi linear
hasil: p-value < alpha (0.05)
x y
1 price price
2 m2_living price
3 m2_lot price
4 bedrooms price
5 bathrooms price
6 floors price
7 condition price
8 waterfront price
9 view price
10 grade price
p_value
1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
2 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
3 0.0000000000000000000000000000000000000007972504510431172068015271378975312839109888680414442704441488599349726832457853352001369166562990060172916306768797767290379852056503295898437500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
4 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
5 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003812102
7 0.0000000893565406245876391372784559863351461217462201602756977081298828125000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
9 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
10 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7. Kesimpulan
4 dari 4 Asumsi terpenuhi sehingga model bisa dikatakan fit. model_backward memiliki adjusted R-sq sebesar 0.60 artinya predictor yang dipilih mampu menjelaskan 60% informasi, sisanya 40% dijelaskan oleh variable lainnya.