Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.
Tentukan variabel:
priceprice1. Read data house_data.csv
house <- read.csv("data_input/house_data.csv")
head(house)2. Cek struktur data
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
💡 Hasil pemeriksaan struktur data:
bedrooms, floors,
waterfront, dan gradeunique(house$bedrooms)#> [1] 3 2 4 5 1 6 7 0 8 9 11 10 33
unique(house$floors)#> [1] 1.0 2.0 1.5 3.0 2.5 3.5
unique(house$waterfront)#> [1] 0 1
unique(house$grade)#> [1] 7 6 8 11 9 5 10 12 4 3 13 1
Berdasarkan pemeriksaan diatas, terlihat bahwa selain kolom
waterfront memiliki terlalu banyak kategori sehingga hanya
kolom waterfront saja yang tipe datanya akan diubah menjadi
factor
house$waterfront <- as.factor(house$waterfront)
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
3. Cleansing Data
colSums(is.na(house))#> price bedrooms bathrooms sqft_living sqft_lot floors
#> 0 0 0 0 0 0
#> waterfront grade yr_built
#> 0 0 0
3. EDA
#persebaran data
hist(house$price)#persebaran data
boxplot(house$price)#korelasi
ggcorr(house,label = TRUE , hjust = 1)💡 Insight:
price yaitu grade, sqft_living,
dan bathroomsBuatlah 3 model berdasarkan feature selection yg telah dipelajari
model_all <- lm(formula = price ~ . ,
data = house)
model_selection <- lm(formula = price ~ sqft_living + bathrooms + grade,
data = house)
model_backward <- step(object = model_all,
direction = "backward",
trace = F)
# model tanpa prediktor dari data house untuk `price`
model_none <- lm(formula = price ~ 1,
data = house )
model_forward <- step(object = model_none,
scope = list(upper = model_all),
direction = "forward",
trace = F)
model_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = F)Berdasarkan RMSE model regresi manakah yang terbaik?
# simpan hasil prediksi ke kolom baru
house$pred_all <- predict(object = model_all, newdata = house)
house$pred_selection <- predict(object = model_selection, newdata = house)
house$pred_backward <- predict(object = model_backward, newdata = house)
house$pred_forward <- predict(object = model_forward, newdata = house)
house$pred_both<- predict(object = model_both, newdata = house)# RMSE model_all
RMSE(y_pred = house$pred_all,
y_true = house$price)#> [1] 218864.1
# RMSE model_selection
RMSE(y_pred = house$pred_selection,
y_true = house$price)#> [1] 249767.8
# RMSE model_forward
RMSE(y_pred = house$pred_forward,
y_true = house$price)#> [1] 218864.1
# RMSE model_backward
RMSE(y_pred = house$pred_backward,
y_true = house$price)#> [1] 218864.1
# RMSE model_both
RMSE(y_pred = house$pred_both,
y_true = house$price)#> [1] 218864.1
bandingkan dengan nilai rentang:
range(house$price)#> [1] 75000 7700000
💡 Kesimpulan :
model_all, model_forward,
model_backward, dan model_both dengan nilai
RMSE sebesar 218864.1summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
1. Interpretasi coefficient untuk prediktor kategorik:
2. Interpretasi coefficient untuk prediktor numerik:
3. Signifikansi prediktor:
4. Adjusted R Squared: