Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.
Tentukan variabel:
priceprice1. Read data house_data.csv
house <- read.csv("data_input/house_data.csv")
head(house)2. Cek struktur data
library(dplyr)
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
💡 Hasil pemeriksaan struktur data: bedrooms -> factor floors -> factor grade -> factor
3. Cleansing Data
library(lubridate)
house <- house %>%
mutate_at(vars(bedrooms,floors,grade),as.factor) #mengubah tipe data menjadi factor
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <fct> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <fct> 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1.5, 1, 1.5, 2, 2, 1.5…
#> $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
3. EDA
#persebaran data
boxplot(house)#korelasiboxplot(house$price)#cek korelasi
library(GGally)
ggcorr(house,label = TRUE )💡 Insight: - ada outlier - Skew kekanan - prediktor dengan korelasi baiuk dengan target adalah bathrooms # Modeling Buatlah 3 model berdasarkan feature selection yg telah dipelajari 1. model all predictor 2. model selection based on correlation (korelasi > 0.5) 3. model selection hasil stepwise (backward/forward/both)
head(house)model_all <- lm(formula = price~., data = house)
summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1517103 -107587 -10970 85414 3813468
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7147857.06153 249085.31875 28.696 < 0.0000000000000002 ***
#> bedrooms1 37424.97778 62910.89983 0.595 0.55192
#> bedrooms2 45623.36394 61348.99816 0.744 0.45708
#> bedrooms3 -6857.60993 61294.14995 -0.112 0.91092
#> bedrooms4 -44750.89791 61356.85974 -0.729 0.46579
#> bedrooms5 -35177.34817 61609.08243 -0.571 0.56802
#> bedrooms6 -78883.53811 62799.42407 -1.256 0.20909
#> bedrooms7 -131940.67030 70420.24010 -1.874 0.06100 .
#> bedrooms8 61874.15704 84454.54665 0.733 0.46379
#> bedrooms9 -264200.67930 105777.05165 -2.498 0.01251 *
#> bedrooms10 -85275.78492 135616.78332 -0.629 0.52949
#> bedrooms11 -321420.39093 218087.79358 -1.474 0.14055
#> bedrooms33 183125.68799 217960.15021 0.840 0.40082
#> bathrooms 59323.17930 3339.47381 17.764 < 0.0000000000000002 ***
#> sqft_living 150.82615 3.34495 45.091 < 0.0000000000000002 ***
#> sqft_lot -0.26362 0.03527 -7.474 0.000000000000081 ***
#> floors1.5 2065.20064 5516.86935 0.374 0.70815
#> floors2 -2086.69407 4016.66492 -0.520 0.60341
#> floors2.5 103156.01348 16939.45136 6.090 0.000000001150417 ***
#> floors3 119955.20419 9456.47094 12.685 < 0.0000000000000002 ***
#> floors3.5 221434.69776 74840.40158 2.959 0.00309 **
#> waterfront 694092.03079 16684.68661 41.601 < 0.0000000000000002 ***
#> grade3 -96635.81614 249563.42063 -0.387 0.69860
#> grade4 -159916.49032 220291.08263 -0.726 0.46789
#> grade5 -193776.86952 218338.52545 -0.888 0.37482
#> grade6 -140947.75035 217972.54746 -0.647 0.51788
#> grade7 -50876.31110 217938.41841 -0.233 0.81542
#> grade8 43430.97033 217988.12226 0.199 0.84208
#> grade9 194335.96998 218070.62723 0.891 0.37285
#> grade10 373542.29828 218182.72098 1.712 0.08690 .
#> grade11 634497.10197 218451.15444 2.905 0.00368 **
#> grade12 1089430.43387 219317.87445 4.967 0.000000683896980 ***
#> grade13 2253954.55040 226376.98970 9.957 < 0.0000000000000002 ***
#> yr_built -3588.43279 68.87915 -52.098 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 209200 on 21579 degrees of freedom
#> Multiple R-squared: 0.6759, Adjusted R-squared: 0.6754
#> F-statistic: 1364 on 33 and 21579 DF, p-value: < 0.00000000000000022
model_bathrooms <- lm(formula = price~bathrooms, data = house)
summary(model_bathrooms)#>
#> Call:
#> lm(formula = price ~ bathrooms, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1438157 -184525 -41525 113220 5925322
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10708 6211 1.724 0.0847 .
#> bathrooms 250326 2760 90.714 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 312400 on 21611 degrees of freedom
#> Multiple R-squared: 0.2758, Adjusted R-squared: 0.2757
#> F-statistic: 8229 on 1 and 21611 DF, p-value: < 0.00000000000000022
price=10708+250326∗bathrooms
plot(house$bathrooms, house$price)
abline(model_bathrooms, col = "red")
# STEPWISE
backward <- step(object = model_all, direction = "backward")#> Start: AIC=529587.7
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
#> waterfront + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> <none> 943960178918381 529588
#> - sqft_lot 1 2443362459096 946403541377477 529642
#> - floors 5 10007431499118 953967610417499 529806
#> - bedrooms 12 12540920524799 956501099443180 529849
#> - bathrooms 1 13804322987608 957764501905989 529899
#> - waterfront 1 75704268598963 1019664447517344 531253
#> - sqft_living 1 88939924950670 1032900103869050 531532
#> - yr_built 1 118728880745514 1062689059663895 532146
#> - grade 11 242625670736310 1186585849654690 534510
summary(backward)#>
#> Call:
#> lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
#> floors + waterfront + grade + yr_built, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1517103 -107587 -10970 85414 3813468
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7147857.06153 249085.31875 28.696 < 0.0000000000000002 ***
#> bedrooms1 37424.97778 62910.89983 0.595 0.55192
#> bedrooms2 45623.36394 61348.99816 0.744 0.45708
#> bedrooms3 -6857.60993 61294.14995 -0.112 0.91092
#> bedrooms4 -44750.89791 61356.85974 -0.729 0.46579
#> bedrooms5 -35177.34817 61609.08243 -0.571 0.56802
#> bedrooms6 -78883.53811 62799.42407 -1.256 0.20909
#> bedrooms7 -131940.67030 70420.24010 -1.874 0.06100 .
#> bedrooms8 61874.15704 84454.54665 0.733 0.46379
#> bedrooms9 -264200.67930 105777.05165 -2.498 0.01251 *
#> bedrooms10 -85275.78492 135616.78332 -0.629 0.52949
#> bedrooms11 -321420.39093 218087.79358 -1.474 0.14055
#> bedrooms33 183125.68799 217960.15021 0.840 0.40082
#> bathrooms 59323.17930 3339.47381 17.764 < 0.0000000000000002 ***
#> sqft_living 150.82615 3.34495 45.091 < 0.0000000000000002 ***
#> sqft_lot -0.26362 0.03527 -7.474 0.000000000000081 ***
#> floors1.5 2065.20064 5516.86935 0.374 0.70815
#> floors2 -2086.69407 4016.66492 -0.520 0.60341
#> floors2.5 103156.01348 16939.45136 6.090 0.000000001150417 ***
#> floors3 119955.20419 9456.47094 12.685 < 0.0000000000000002 ***
#> floors3.5 221434.69776 74840.40158 2.959 0.00309 **
#> waterfront 694092.03079 16684.68661 41.601 < 0.0000000000000002 ***
#> grade3 -96635.81614 249563.42063 -0.387 0.69860
#> grade4 -159916.49032 220291.08263 -0.726 0.46789
#> grade5 -193776.86952 218338.52545 -0.888 0.37482
#> grade6 -140947.75035 217972.54746 -0.647 0.51788
#> grade7 -50876.31110 217938.41841 -0.233 0.81542
#> grade8 43430.97033 217988.12226 0.199 0.84208
#> grade9 194335.96998 218070.62723 0.891 0.37285
#> grade10 373542.29828 218182.72098 1.712 0.08690 .
#> grade11 634497.10197 218451.15444 2.905 0.00368 **
#> grade12 1089430.43387 219317.87445 4.967 0.000000683896980 ***
#> grade13 2253954.55040 226376.98970 9.957 < 0.0000000000000002 ***
#> yr_built -3588.43279 68.87915 -52.098 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 209200 on 21579 degrees of freedom
#> Multiple R-squared: 0.6759, Adjusted R-squared: 0.6754
#> F-statistic: 1364 on 33 and 21579 DF, p-value: < 0.00000000000000022
forward <- step(object = model_bathrooms, direction = "forward", scope = list(lower = model_bathrooms, upper = model_all))#> Start: AIC=546904.4
#> price ~ bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + grade 11 752394678422453 1357228777001178 537394
#> + sqft_living 1 632494286111913 1477129169311718 539203
#> + yr_built 1 175510961437577 1934112493986054 545029
#> + waterfront 1 158641800230300 1950981655193331 545217
#> + floors 5 46043144331791 2063580311091840 546438
#> + bedrooms 12 33177360169758 2076446095253873 546586
#> + sqft_lot 1 5576578954819 2104046876468812 546849
#> <none> 2109623455423631 546904
#>
#> Step: AIC=537393.7
#> price ~ bathrooms + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 225794937949651 1131433839051526 533463
#> + sqft_living 1 139395971837453 1217832805163725 535053
#> + waterfront 1 107174563579753 1250054213421424 535618
#> + floors 5 60169818251324 1297058958749854 536424
#> + bedrooms 12 22264807429182 1334963969571996 537060
#> + sqft_lot 1 174875436325 1357053901564852 537393
#> <none> 1357228777001178 537394
#>
#> Step: AIC=533463
#> price ~ bathrooms + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 88000861122237 1043432977929290 531715
#> + sqft_living 1 79053517962434 1052380321089093 531900
#> + bedrooms 12 5893198032337 1125540641019190 533374
#> + floors 5 5025122100976 1126408716950551 533377
#> <none> 1131433839051526 533463
#> + sqft_lot 1 58165483405 1131375673568121 533464
#>
#> Step: AIC=531715
#> price ~ bathrooms + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 73076552051395 970356425877895 530148
#> + bedrooms 12 5727253131939 1037705724797350 531620
#> + floors 5 4757227546980 1038675750382310 531626
#> <none> 1043432977929290 531715
#> + sqft_lot 1 87180090980 1043345797838310 531715
#>
#> Step: AIC=530147.8
#> price ~ bathrooms + grade + yr_built + waterfront + sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + bedrooms 12 13627551392089 956728874485806 529866
#> + floors 5 12027309816169 958329116061726 529888
#> + sqft_lot 1 2138877360195 968217548517699 530102
#> <none> 970356425877895 530148
#>
#> Step: AIC=529866.1
#> price ~ bathrooms + grade + yr_built + waterfront + sqft_living +
#> bedrooms
#>
#> Df Sum of Sq RSS AIC
#> + floors 5 10325333108330 946403541377476 529642
#> + sqft_lot 1 2761264068308 953967610417498 529806
#> <none> 956728874485806 529866
#>
#> Step: AIC=529641.6
#> price ~ bathrooms + grade + yr_built + waterfront + sqft_living +
#> bedrooms + floors
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 2443362459095 943960178918381 529588
#> <none> 946403541377476 529642
#>
#> Step: AIC=529587.7
#> price ~ bathrooms + grade + yr_built + waterfront + sqft_living +
#> bedrooms + floors + sqft_lot
summary(forward)#>
#> Call:
#> lm(formula = price ~ bathrooms + grade + yr_built + waterfront +
#> sqft_living + bedrooms + floors + sqft_lot, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1517103 -107587 -10970 85414 3813468
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7147857.06153 249085.31875 28.696 < 0.0000000000000002 ***
#> bathrooms 59323.17930 3339.47381 17.764 < 0.0000000000000002 ***
#> grade3 -96635.81614 249563.42063 -0.387 0.69860
#> grade4 -159916.49032 220291.08263 -0.726 0.46789
#> grade5 -193776.86952 218338.52545 -0.888 0.37482
#> grade6 -140947.75035 217972.54746 -0.647 0.51788
#> grade7 -50876.31110 217938.41841 -0.233 0.81542
#> grade8 43430.97033 217988.12226 0.199 0.84208
#> grade9 194335.96998 218070.62723 0.891 0.37285
#> grade10 373542.29828 218182.72098 1.712 0.08690 .
#> grade11 634497.10197 218451.15444 2.905 0.00368 **
#> grade12 1089430.43387 219317.87445 4.967 0.000000683896980 ***
#> grade13 2253954.55040 226376.98970 9.957 < 0.0000000000000002 ***
#> yr_built -3588.43279 68.87915 -52.098 < 0.0000000000000002 ***
#> waterfront 694092.03079 16684.68661 41.601 < 0.0000000000000002 ***
#> sqft_living 150.82615 3.34495 45.091 < 0.0000000000000002 ***
#> bedrooms1 37424.97778 62910.89983 0.595 0.55192
#> bedrooms2 45623.36394 61348.99816 0.744 0.45708
#> bedrooms3 -6857.60993 61294.14995 -0.112 0.91092
#> bedrooms4 -44750.89791 61356.85974 -0.729 0.46579
#> bedrooms5 -35177.34817 61609.08243 -0.571 0.56802
#> bedrooms6 -78883.53811 62799.42407 -1.256 0.20909
#> bedrooms7 -131940.67030 70420.24010 -1.874 0.06100 .
#> bedrooms8 61874.15704 84454.54665 0.733 0.46379
#> bedrooms9 -264200.67930 105777.05165 -2.498 0.01251 *
#> bedrooms10 -85275.78492 135616.78332 -0.629 0.52949
#> bedrooms11 -321420.39093 218087.79358 -1.474 0.14055
#> bedrooms33 183125.68799 217960.15021 0.840 0.40082
#> floors1.5 2065.20064 5516.86935 0.374 0.70815
#> floors2 -2086.69407 4016.66492 -0.520 0.60341
#> floors2.5 103156.01348 16939.45136 6.090 0.000000001150417 ***
#> floors3 119955.20419 9456.47094 12.685 < 0.0000000000000002 ***
#> floors3.5 221434.69776 74840.40158 2.959 0.00309 **
#> sqft_lot -0.26362 0.03527 -7.474 0.000000000000081 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 209200 on 21579 degrees of freedom
#> Multiple R-squared: 0.6759, Adjusted R-squared: 0.6754
#> F-statistic: 1364 on 33 and 21579 DF, p-value: < 0.00000000000000022
# performa Model backward
summary(backward)$adj.r.squared#> [1] 0.6754443
# Performa Model Forward
summary(forward)$adj.r.squared#> [1] 0.6754443
dari kedua model dengan menggunakan stepwise yaitu backward dan forward, tidak didapat perbedaan yang signifikan
house$price_bathrooms <- predict(object = model_bathrooms, newdata = house)
house$priceBackward <-predict(object = backward, newdata = house)Berdasarkan RMSE model regresi manakah yang terbaik?
RMSE(y_pred = house$priceBackward, y_true = house$price) # model_all#> [1] 208987
RMSE(y_pred = house$price_bathrooms, y_true = house$price) # Model_bathrooms#> [1] 312424.4
MSE(y_pred = house$priceBackward, y_true = house$price) # model_all#> [1] 43675573910
MSE(y_pred = house$price_bathrooms, y_true = house$price) # Model_bathrooms#> [1] 97609006405
💡 Kesimpulan : Dari hasil prediksi dan pengujian error dengan menggunakan RMSE dan MSE, didapat model All (backward) adalah yang terbaik dengan error yang lebih rendah dibandingkan dengan model single linear regression (bathrooms)/ korelasi
summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1517103 -107587 -10970 85414 3813468
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7147857.06153 249085.31875 28.696 < 0.0000000000000002 ***
#> bedrooms1 37424.97778 62910.89983 0.595 0.55192
#> bedrooms2 45623.36394 61348.99816 0.744 0.45708
#> bedrooms3 -6857.60993 61294.14995 -0.112 0.91092
#> bedrooms4 -44750.89791 61356.85974 -0.729 0.46579
#> bedrooms5 -35177.34817 61609.08243 -0.571 0.56802
#> bedrooms6 -78883.53811 62799.42407 -1.256 0.20909
#> bedrooms7 -131940.67030 70420.24010 -1.874 0.06100 .
#> bedrooms8 61874.15704 84454.54665 0.733 0.46379
#> bedrooms9 -264200.67930 105777.05165 -2.498 0.01251 *
#> bedrooms10 -85275.78492 135616.78332 -0.629 0.52949
#> bedrooms11 -321420.39093 218087.79358 -1.474 0.14055
#> bedrooms33 183125.68799 217960.15021 0.840 0.40082
#> bathrooms 59323.17930 3339.47381 17.764 < 0.0000000000000002 ***
#> sqft_living 150.82615 3.34495 45.091 < 0.0000000000000002 ***
#> sqft_lot -0.26362 0.03527 -7.474 0.000000000000081 ***
#> floors1.5 2065.20064 5516.86935 0.374 0.70815
#> floors2 -2086.69407 4016.66492 -0.520 0.60341
#> floors2.5 103156.01348 16939.45136 6.090 0.000000001150417 ***
#> floors3 119955.20419 9456.47094 12.685 < 0.0000000000000002 ***
#> floors3.5 221434.69776 74840.40158 2.959 0.00309 **
#> waterfront 694092.03079 16684.68661 41.601 < 0.0000000000000002 ***
#> grade3 -96635.81614 249563.42063 -0.387 0.69860
#> grade4 -159916.49032 220291.08263 -0.726 0.46789
#> grade5 -193776.86952 218338.52545 -0.888 0.37482
#> grade6 -140947.75035 217972.54746 -0.647 0.51788
#> grade7 -50876.31110 217938.41841 -0.233 0.81542
#> grade8 43430.97033 217988.12226 0.199 0.84208
#> grade9 194335.96998 218070.62723 0.891 0.37285
#> grade10 373542.29828 218182.72098 1.712 0.08690 .
#> grade11 634497.10197 218451.15444 2.905 0.00368 **
#> grade12 1089430.43387 219317.87445 4.967 0.000000683896980 ***
#> grade13 2253954.55040 226376.98970 9.957 < 0.0000000000000002 ***
#> yr_built -3588.43279 68.87915 -52.098 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 209200 on 21579 degrees of freedom
#> Multiple R-squared: 0.6759, Adjusted R-squared: 0.6754
#> F-statistic: 1364 on 33 and 21579 DF, p-value: < 0.00000000000000022
1. Interpretasi coefficient untuk prediktor kategorik:
2. Interpretasi coefficient untuk prediktor numerik: - Model all bernilai 7147857.06153
3. Signifikansi prediktor: prediktor yang paling signifikan 1. bathrooms 2. grade 3. sqft_living 4. sqft_lot
4. Adjusted R Squared:
summary(model_all)$adj.r.squared#> [1] 0.6754443
summary(model_bathrooms)$adj.r.squared#> [1] 0.2757359