Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.
Tentukan variabel:
priceprice1. Read data house_data.csv
house <- read.csv("data_input/house_data.csv")
head(house)2. Cek struktur data
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
length(unique(house$waterfront))#> [1] 2
unique(house$grade)#> [1] 7 6 8 11 9 5 10 12 4 3 13 1
💡 Hasil pemeriksaan struktur data: Ditemukan bahwa kita harus mengubah tipe data kategorik terbelih dahulu yang meliputi : - waterfront
3. Cleansing Data
# Ubah Tipe Data
house$waterfront <- as.factor(house$waterfront)
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
# Cek missing value
colSums(is.na(house))#> price bedrooms bathrooms sqft_living sqft_lot floors
#> 0 0 0 0 0 0
#> waterfront grade yr_built
#> 0 0 0
4. EDA
#persebaran data
ggplot(house, aes(x = sqft_living, y = price)) +
geom_point() +
geom_hline(yintercept = mean(house$price), color = "red", linetype = "dashed") +
geom_vline(xintercept = mean(house$sqft_living), color = "blue", linetype = "dashed")#korelasi
cor(x = house$sqft_living,
y = house$price)#> [1] 0.7020351
# cek mean house$price
mean(house$price)#> [1] 540088.1
💡 Insight: - nilai korelasi 0.7020351 artinya Semakin besar luas rumah, maka harganya akan semakin tinggi
rata2 harga jual rumah ada di harga 540.088.1 dengan luas tanah perkiraan sebesar 2400 sqft_living
sqft_living memiliki korelasi yang kuat sebagai prediktor yaitu sebesar 0.7020351
Buatlah 3 model berdasarkan feature selection yg telah dipelajari 1. model all predictor 2. model selection based on correlation (korelasi > 0.5) 3. model selection hasil stepwise (backward/forward/both)
# 1. Model_all predictor
model_all <- lm(formula = price ~ .,
data = house)
summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
# 2. Model_selection
model_sqft_living <- lm(formula = price ~ sqft_living,
data = house)
summary(model_sqft_living)#>
#> Call:
#> lm(formula = price ~ sqft_living, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1476062 -147486 -24043 106182 4362067
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -43580.743 4402.690 -9.899 <0.0000000000000002 ***
#> sqft_living 280.624 1.936 144.920 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 261500 on 21611 degrees of freedom
#> Multiple R-squared: 0.4929, Adjusted R-squared: 0.4928
#> F-statistic: 2.1e+04 on 1 and 21611 DF, p-value: < 0.00000000000000022
💡 dari ringkasan model_sqft_living terdapat perolehan informasi:
💡 p-value prediktor pada model_sqft_living:
0.0000000000000002, variabel sqft_living signifkan mempengaruhi
price.
💡 melihat sebaran data yang dihasilkan oleh sqft_living ~ price dan mmebuat line plot berdasarkan model_sqft_living yang telah dibuat
plot(house$sqft_living, house$price)
abline(model_sqft_living, col = "red")
## Mencoba mmebuat model tanpa tanpa outlier untuk cek high leverage
# membuat filter tanpa outlier dengan subset sqft_living dibawah 6000 berdasarkan plot yang telah dibuat
house_no_outlier <- house[house$sqft_living <= 6000,]
# Mengambil keputusan outlier dibawah 6000 karena data sqft_living cenderung berkumpul di range angka perkiraan 100-6000. Lebih dari 6000 data sudah mulai berpecah dan cenderung tidak berkumpul satu sama lain dalam sebarannyaBuatlah model linear regression dengan data
copiers_no_outlier:
# model dengan data tanpa outlier
model_no_outlier <- lm(formula = price ~ sqft_living,
data = house_no_outlier)
summary(model_no_outlier)#>
#> Call:
#> lm(formula = price ~ sqft_living, data = house_no_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -881194 -144906 -26189 101683 2380546
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -9067.419 4330.219 -2.094 0.0363 *
#> sqft_living 262.820 1.933 135.960 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 247100 on 21543 degrees of freedom
#> Multiple R-squared: 0.4618, Adjusted R-squared: 0.4618
#> F-statistic: 1.849e+04 on 1 and 21543 DF, p-value: < 0.00000000000000022
💡 Ketika model_selection dilakukan sebelumnya r-squared memiliki angka di 0.4928 dan setelah outlier dibuang dengan subset < 6000, r-squared mengalami penurunan menjadi 0.4618
# Memasukan line plot untuk cek high leverage
plot(house$sqft_living, house$price)
abline(model_sqft_living, col = "red")
abline(model_no_outlier, col = "green")
💡 Terdapat banyak sekali high leverage yang dihasilkan saat nilai
sqft_living sudah mencapai diangka lebih dari 6000
# membuat model_none terlebih dahulu untuk membuat model stepwise
model_none <- lm(formula = price ~ 1,
data = house)model_backward <- step(object = model_all,
direction = 'backward')#> Start: AIC=531533.8
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
#> waterfront + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> <none> 1035294741860252 531534
#> - floors 1 1218939726736 1036513681586988 531557
#> - sqft_lot 1 2030230655068 1037324972515320 531574
#> - bathrooms 1 10844095263314 1046138837123566 531757
#> - bedrooms 1 19802603600782 1055097345461034 531941
#> - waterfront 1 82402185733932 1117696927594184 533187
#> - sqft_living 1 140116227741713 1175410969601965 534275
#> - grade 1 172030644434158 1207325386294409 534854
#> - yr_built 1 183528097123653 1218822838983905 535059
model_forward <- step(object = model_none,
direction = 'forward',
scope = list(upper = model_all))#> Start: AIC=553875.8
#> price ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 1435640399598809 1477276362322490 539204
#> + grade 1 1297612620095470 1615304141825828 541134
#> + bathrooms 1 803293306497671 2109623455423628 546904
#> + bedrooms 1 276958595500073 2635958166421226 551718
#> + waterfront 1 206679237434408 2706237524486890 552287
#> + floors 1 192086763313773 2720829998607526 552403
#> + sqft_lot 1 23417141523777 2889499620397522 553703
#> + yr_built 1 8497693415832 2904419068505468 553815
#> <none> 2912916761921299 553876
#>
#> Step: AIC=539203.5
#> price ~ sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + grade 1 121320543948745 1355955818373745 537353
#> + waterfront 1 110238185400763 1367038176921727 537529
#> + yr_built 1 92854405407200 1384421956915290 537802
#> + bedrooms 1 40635382190095 1436640980132395 538603
#> + sqft_lot 1 3011349102420 1474265013220070 539161
#> + floors 1 229913654973 1477046448667517 539202
#> + bathrooms 1 147193010785 1477129169311705 539203
#> <none> 1477276362322490 539204
#>
#> Step: AIC=537353.4
#> price ~ sqft_living + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 199227645099154 1156728173274590 533921
#> + waterfront 1 108953582123633 1247002236250112 535545
#> + bedrooms 1 22141328690666 1333814489683078 537000
#> + floors 1 9622247208765 1346333571164980 537202
#> + bathrooms 1 7651853153980 1348303965219765 537233
#> + sqft_lot 1 2020142096807 1353935676276938 537323
#> <none> 1355955818373745 537353
#>
#> Step: AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 90115805935988 1066612367338602 532170
#> + bedrooms 1 20121301261862 1136606872012728 533544
#> + bathrooms 1 8626538372689 1148101634901902 533761
#> + floors 1 4395842729126 1152332330545465 533841
#> + sqft_lot 1 1713565021968 1155014608252622 533891
#> <none> 1156728173274590 533921
#>
#> Step: AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + bedrooms 1 13902067534638 1052710299803964 531888
#> + bathrooms 1 8476715048277 1058135652290326 531999
#> + floors 1 4061693690658 1062550673647944 532089
#> + sqft_lot 1 1826102349598 1064786264989004 532135
#> <none> 1066612367338602 532170
#>
#> Step: AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#>
#> Df Sum of Sq RSS AIC
#> + bathrooms 1 13953491780043 1038756808023922 531602
#> + floors 1 4176630866150 1048533668937814 531804
#> + sqft_lot 1 2870998480425 1049839301323540 531831
#> <none> 1052710299803964 531888
#>
#> Step: AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 2243126436934 1036513681586988 531557
#> + floors 1 1431835508601 1037324972515321 531574
#> <none> 1038756808023922 531602
#>
#> Step: AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot
#>
#> Df Sum of Sq RSS AIC
#> + floors 1 1218939726736 1035294741860252 531534
#> <none> 1036513681586988 531557
#>
#> Step: AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot + floors
model_both <- step(object = model_none,
direction = 'both',
scope = list(upper = model_all))#> Start: AIC=553875.8
#> price ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 1435640399598809 1477276362322490 539204
#> + grade 1 1297612620095470 1615304141825828 541134
#> + bathrooms 1 803293306497671 2109623455423628 546904
#> + bedrooms 1 276958595500073 2635958166421226 551718
#> + waterfront 1 206679237434408 2706237524486890 552287
#> + floors 1 192086763313773 2720829998607526 552403
#> + sqft_lot 1 23417141523777 2889499620397522 553703
#> + yr_built 1 8497693415832 2904419068505468 553815
#> <none> 2912916761921299 553876
#>
#> Step: AIC=539203.5
#> price ~ sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + grade 1 121320543948745 1355955818373745 537353
#> + waterfront 1 110238185400763 1367038176921727 537529
#> + yr_built 1 92854405407200 1384421956915290 537802
#> + bedrooms 1 40635382190095 1436640980132395 538603
#> + sqft_lot 1 3011349102420 1474265013220070 539161
#> + floors 1 229913654973 1477046448667517 539202
#> + bathrooms 1 147193010785 1477129169311705 539203
#> <none> 1477276362322490 539204
#> - sqft_living 1 1435640399598809 2912916761921299 553876
#>
#> Step: AIC=537353.4
#> price ~ sqft_living + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 199227645099154 1156728173274590 533921
#> + waterfront 1 108953582123633 1247002236250112 535545
#> + bedrooms 1 22141328690666 1333814489683078 537000
#> + floors 1 9622247208765 1346333571164980 537202
#> + bathrooms 1 7651853153980 1348303965219765 537233
#> + sqft_lot 1 2020142096807 1353935676276938 537323
#> <none> 1355955818373745 537353
#> - grade 1 121320543948745 1477276362322490 539204
#> - sqft_living 1 259348323452084 1615304141825828 541134
#>
#> Step: AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 90115805935988 1066612367338602 532170
#> + bedrooms 1 20121301261862 1136606872012728 533544
#> + bathrooms 1 8626538372689 1148101634901902 533761
#> + floors 1 4395842729126 1152332330545465 533841
#> + sqft_lot 1 1713565021968 1155014608252622 533891
#> <none> 1156728173274590 533921
#> - yr_built 1 199227645099154 1355955818373745 537353
#> - grade 1 227693783640699 1384421956915290 537802
#> - sqft_living 1 241311622806594 1398039796081185 538014
#>
#> Step: AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + bedrooms 1 13902067534638 1052710299803964 531888
#> + bathrooms 1 8476715048277 1058135652290326 531999
#> + floors 1 4061693690658 1062550673647944 532089
#> + sqft_lot 1 1826102349598 1064786264989004 532135
#> <none> 1066612367338602 532170
#> - waterfront 1 90115805935988 1156728173274590 533921
#> - yr_built 1 180389868911509 1247002236250112 535545
#> - grade 1 219517939490357 1286130306828959 536213
#> - sqft_living 1 222939932856996 1289552300195599 536270
#>
#> Step: AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#>
#> Df Sum of Sq RSS AIC
#> + bathrooms 1 13953491780043 1038756808023922 531602
#> + floors 1 4176630866150 1048533668937814 531804
#> + sqft_lot 1 2870998480425 1049839301323540 531831
#> <none> 1052710299803964 531888
#> - bedrooms 1 13902067534638 1066612367338602 532170
#> - waterfront 1 83896572208764 1136606872012728 533544
#> - yr_built 1 179366335438934 1232076635242899 535287
#> - grade 1 198282580205801 1250992880009766 535616
#> - sqft_living 1 217755228757072 1270465528561036 535950
#>
#> Step: AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 2243126436934 1036513681586988 531557
#> + floors 1 1431835508601 1037324972515321 531574
#> <none> 1038756808023922 531602
#> - bathrooms 1 13953491780043 1052710299803964 531888
#> - bedrooms 1 19378844266404 1058135652290326 531999
#> - waterfront 1 82547202240690 1121304010264612 533253
#> - sqft_living 1 136922615063114 1175679423087036 534276
#> - grade 1 184390684043560 1223147492067482 535132
#> - yr_built 1 190012863248603 1228769671272525 535231
#>
#> Step: AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot
#>
#> Df Sum of Sq RSS AIC
#> + floors 1 1218939726736 1035294741860252 531534
#> <none> 1036513681586988 531557
#> - sqft_lot 1 2243126436934 1038756808023922 531602
#> - bathrooms 1 13325619736552 1049839301323540 531831
#> - bedrooms 1 20298780261196 1056812461848184 531974
#> - waterfront 1 82496530390614 1119010211977602 533210
#> - sqft_living 1 138930506698738 1175444188285726 534274
#> - grade 1 182635783415639 1219149465002627 535063
#> - yr_built 1 188639111042699 1225152792629688 535169
#>
#> Step: AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot + floors
#>
#> Df Sum of Sq RSS AIC
#> <none> 1035294741860252 531534
#> - floors 1 1218939726736 1036513681586988 531557
#> - sqft_lot 1 2030230655068 1037324972515321 531574
#> - bathrooms 1 10844095263314 1046138837123566 531757
#> - bedrooms 1 19802603600782 1055097345461034 531941
#> - waterfront 1 82402185733932 1117696927594184 533187
#> - sqft_living 1 140116227741712 1175410969601964 534275
#> - grade 1 172030644434158 1207325386294410 534854
#> - yr_built 1 183528097123652 1218822838983905 535059
# SUMMARY MODEL STEPWISE
summary(model_backward)#>
#> Call:
#> lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
#> floors + waterfront + grade + yr_built, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
# SUMMARY MODEL STEPWISE
summary(model_forward)#>
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront +
#> bedrooms + bathrooms + sqft_lot + floors, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
# SUMMARY MODEL STEPWISE
summary(model_both)#>
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront +
#> bedrooms + bathrooms + sqft_lot + floors, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
Membandingkan nilai R-Squared dari model-model yang telah dibuat
# periksa nilai R-Squared untuk setiap model
summary(model_sqft_living)$r.squared#> [1] 0.4928532
summary(model_all)$adj.r.squared#> [1] 0.6444532
summary(model_backward)$adj.r.squared#> [1] 0.6444532
summary(model_forward)$adj.r.squared#> [1] 0.6444532
summary(model_both)$adj.r.squared#> [1] 0.6444532
💡 Kesimpulan model : model menggunakan stepwise apapun hasilnya sama saja, karena setiap variable memang saling mempengaruhi.
# menambahkan hasil prediksi model_both stepwise ke kolom baru di data house
house$pred_both<- predict(object = model_both, newdata = house)
head(house)💡 saya memilih model both saja, karena menurut saya lebih stabil untuk kasus ini dikarenakan model sudah mengujinya dengan membuang dan menambahkan variable untuk mencoba mendapatkan hasil yang terbaik dari step perhitungan modelnya.
#membuat dataframe berisi data aktual, hasil prediksi, dan error dari model_both sebagai perbandingan hasil prediksi
res <- data.frame(aktual = house$price,
prediksi = model_both$fitted.values) %>%
mutate(error = prediksi - aktual)
head(res)library(MLmetrics)
# Menghitung RMSE dari res
RMSE(y_pred = res$prediksi,
y_true = res$aktual)#> [1] 218864.1
# cek range target variable
range(res$aktual)#> [1] 75000 7700000
#membuat dataframe berisi data aktual, hasil prediksi, dan error
res2 <- data.frame(aktual = house$price, #mengambil nilai aktual inequality
prediksi = model_sqft_living$fitted.values) %>% #mengambil nilai prediksi price di model_both
mutate(error = prediksi - aktual) #membuat kolom berisi hasil perhitungan nilai eror
head(res2)# Menghitung RMSE dari res
RMSE(y_pred = res2$prediksi,
y_true = res2$aktual)#> [1] 261440.8
Berdasarkan ‘RMSE’ model regresi manakah yang terbaik?
REGRESSI model_both<<<
💡 Kesimpulan :
💡 model_both Menjadi model terbaik dikarenakan saat memakai stepwise dengan mode lainnya, hasilnya sama saja, mendapatkan hasil yang terbilang cukup bagus dengan r-squared di angka 0.6444532 dan RMSE masih dapat diterima jika dilihat dari range datanya.
model_both dengan nilai RMSE : 218864.1, Dibandingkan model_selection yakni model_sqft_living dengan r-squared di angka 0.4928532 dan RMSE lebih besar dibandingkan model_both, yakni diangka 261440.8. Terdapat GAP RMSE yang lumayan jauh.
summary(model_both)#>
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront +
#> bedrooms + bathrooms + sqft_lot + floors, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
1. Interpretasi coefficient untuk prediktor kategorik:
2. Interpretasi coefficient untuk prediktor numerik:
intercept : 6999106.70657, artinya price akan bernilai 6999106.70657 dengan syarat semua variabel prediktornya bernilai 0
sqft_living : 177.91392, nilai price akan naik senilai 177.91392 ketika nilai meningkat sebanyak 1 sqft_living dan variable lainnya tetap
grade : 128813.92794, nilai price akan naik senilai 128813.92794 ketika nilai meningkat sebanyak 1 grade dan variable lainnya tetap
yr_built : -3963.73577, nilai price akan turun senilai -3963.73577 ketika nilai meningkat sebanyak 1 yr_built dan variable lainnya tetap
bedrooms : -41484.20936, nilai price akan turun senilai -41484.20936 ketika nilai meningkat sebanyak 1 bedrooms dan variable lainnya tetap
bathrooms : 51710.08964, nilai price akan naik senilai 51710.08964 ketika nilai meningkat sebanyak 1 bathrooms dan variable lainnya tetap
sqft_lot : -0.23947, nilai price akan turun senilai -0.23947 ketika nilai meningkat sebanyak 1 sqft_lot dan variable lainnya tetap
floors : 17283.13337, nilai price akan naik senilai 17283.13337 ketika nilai meningkat sebanyak 1 floors dan variable lainnya tetap
3. Signifikansi prediktor:
💡 p-value prediktor pada model_both:
0.0000000000000002, variabel lainnya saling signifkan mempengaruhi price
dalam harga suatu rumah.
4. Adjusted R Squared:
💡 Adjusted R-squared :0.6445, Model dapat menjelaskan dengan cukup baik sebesar 64.50%
dengan hasil diatas, saya memutuskan untuk membawa data-data outlier karena dirasa variable-variable yang memiliki outlier ini penting bagi business question kita untuk memprediksi harga properti rumah(dilihat dari data outlier yang sangat banyak juga di plot sqft_living ~ price, tidak mungkin kita akan membuang outlier2 ini).
MAPE(y_pred = res$prediksi,
y_true = res$aktual)*100#> [1] 29.75172
MAPE(y_pred = res2$prediksi,
y_true = res2$aktual)*100#> [1] 35.99486
💡 model_both mendapatkan hasil r-squared yang paling baik dari model lainnya dengan hasil model dapat menginterpretasikan cukup bagus sebesar 64.50% dengan error sebesar : 29.75% saja
dan mendapatkan hasil RMSE : 218864.1, dari range data 75.000 - 7.700.000 dengan RMSE 218864.1, hasil RMSE masih terbilang cukup baik.
LINK HASIL : https://rpubs.com/zdims17/Divedeeper_ML_RegressionModels