Berikut adalah beberapa library yang digunakan
library(broom) #untuk merapikan tampilan data
library(glmnet) #metode seleksi LASSO dan RIDGE
library(glmnetUtils) #package tambahan dari glmnet yang memungkinkan syntax glmnet bisa dinput menggunakan object data.frame
library(leaps) #menggunakan fungsi regsubset sebagai metode seleksi peubah dengan bestforward
library(varbvs) #Data
library(ggplot2)
library(tidyverse)
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(repr)
library(MASS) #Data
library(MuMIn) #Model Averaging
Data yang digunakan merupakan harga rumah di Boston. Data ini terdiri dari 506 amatan yang mewakili data agregat 14 fitur untuk rumah dari berbagai pinggiran kota di Boston. Data ini juga dilengkapi dengan informasi seperti Kejahatan (CRIM), wilayah usaha non retail di dalam kota (INDUS), umur pemilik rumah (AGE), dan lain lain.
dat <- MASS::Boston #data harga perumahan di pinggiran Kota Boston
str(dat)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(dat)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Dilakukan partisi data untuk membagi menjadi data training dan data testing
set.seed(100)
index = sample(1:nrow(dat), 0.7*nrow(dat))
train = dat[index,] # Create the training data
test = dat[-index,] # Create the test data
dim(train)
## [1] 354 14
dim(test)
## [1] 152 14
Metode ini digunakan untuk menangani masalah multikolinieritas dengan memberikan estimator yang bias namun memberikan varian yang lebih kecil daripada metode kuadrat terkecil.
#creating custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = F) #print iterasi
Data yang digunakan adalah data train dengan jumlah lipatan 10 dan akan diulang 5 kali.
#fitting Ridge Regression model
set.seed(1234)
ridge <- train(medv~.,train,
method="glmnet",
tuneGrid=expand.grid(alpha=0,lambda=seq(0.0001,1,length=10)),
trControl=custom)
ridge
## glmnet
##
## 354 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 318, 319, 318, 319, 319, 318, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001 4.340939 0.7712224 3.102308
## 0.1112 4.340939 0.7712224 3.102308
## 0.2223 4.340939 0.7712224 3.102308
## 0.3334 4.340939 0.7712224 3.102308
## 0.4445 4.340939 0.7712224 3.102308
## 0.5556 4.340939 0.7712224 3.102308
## 0.6667 4.341042 0.7712146 3.102329
## 0.7778 4.348468 0.7706810 3.099810
## 0.8889 4.356811 0.7700437 3.097963
## 1.0000 4.365780 0.7693706 3.097781
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.5556.
Berdasarkan output di atas terdapat beberapa model regresi berbasis ridge dengan beberapa jenis lambda dengan cross validation 10 lipatan yang diulang 5 kali. Didapatkan alpha optimal sebesar 0 dan lambda optimal sebesar 0.50005.
#mean validation score
mean(ridge$resample$RMSE)
## [1] 4.340939
Dari beberapa jenis lambda yang dikeluarkan, didapatkan rata-rata RMSE sebesar 4.340939.
#plotting the model
plot(ridge, main = "Ridge Regression")
Dari plot di atas dapat dilihat bahwa nilai RMSE terkecil berada pada saat lambda di titik 0.5, sehingga lambda 0.5 merupakan lambda optimum.
#plotting important variables
plot(varImp(ridge,scale=F))
Dari plot di atas dapat dilihat bahwa variabel yang paling berpengaruh terhadap harga perumahan di Boston adalah peubah nox, diikuti dengan peubah rm, lalu chas.
#Ridge Regression with library(glmnet)
lambdas <- 10^seq(2,-3, by = -.1)
ridge_reg = glmnet(as.matrix(train[,-14]), train[,14], nlambda = 25, alpha = 0,family = 'gaussian', lambda = lambdas)
summary(ridge_reg)
## Length Class Mode
## a0 51 -none- numeric
## beta 663 dgCMatrix S4
## df 51 -none- numeric
## dim 2 -none- numeric
## lambda 51 -none- numeric
## dev.ratio 51 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 7 -none- call
## nobs 1 -none- numeric
ridge_reg
##
## Call: glmnet::glmnet(x = as.matrix(train[, -14]), y = train[, 14], nlambda = 25, alpha = 0, family = "gaussian", lambda = lambdas)
##
## Df %Dev Lambda
## 1 13 30.01 100.000
## 2 13 34.00 79.430
## 3 13 38.04 63.100
## 4 13 42.05 50.120
## 5 13 45.95 39.810
## 6 13 49.70 31.620
## 7 13 53.25 25.120
## 8 13 56.59 19.950
## 9 13 59.69 15.850
## 10 13 62.52 12.590
## 11 13 65.09 10.000
## 12 13 67.36 7.943
## 13 13 69.34 6.310
## 14 13 71.03 5.012
## 15 13 72.44 3.981
## 16 13 73.62 3.162
## 17 13 74.57 2.512
## 18 13 75.34 1.995
## 19 13 75.94 1.585
## 20 13 76.42 1.259
## 21 13 76.79 1.000
## 22 13 77.08 0.794
## 23 13 77.30 0.631
## 24 13 77.46 0.501
## 25 13 77.58 0.398
## 26 13 77.67 0.316
## 27 13 77.73 0.251
## 28 13 77.77 0.200
## 29 13 77.81 0.158
## 30 13 77.83 0.126
## 31 13 77.84 0.100
## 32 13 77.85 0.079
## 33 13 77.86 0.063
## 34 13 77.86 0.050
## 35 13 77.87 0.040
## 36 13 77.87 0.032
## 37 13 77.87 0.025
## 38 13 77.87 0.020
## 39 13 77.87 0.016
## 40 13 77.87 0.013
## 41 13 77.87 0.010
## 42 13 77.87 0.008
## 43 13 77.87 0.006
## 44 13 77.87 0.005
## 45 13 77.87 0.004
## 46 13 77.87 0.003
## 47 13 77.87 0.003
## 48 13 77.87 0.002
## 49 13 77.87 0.002
## 50 13 77.87 0.001
## 51 13 77.87 0.001
set.seed(100)
#tunning parameter lambda dg cv
cv_ridge <- cv.glmnet(as.matrix(train[,-14]), train[,14], alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda
## [1] 0.1258925
coef(cv_ridge, s="lambda.min")
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 33.512762193
## crim -0.116541536
## zn 0.045303128
## indus 0.015047943
## chas 2.864206507
## nox -10.422379184
## rm 3.659253006
## age -0.012198505
## dis -1.384865675
## rad 0.227556069
## tax -0.008280337
## ptratio -0.902912189
## black 0.008342041
## lstat -0.623529573
coef(cv_ridge, s=100)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 23.249596796
## crim -0.022001298
## zn 0.008746460
## indus -0.037653796
## chas 0.417908674
## nox -1.911639790
## rm 0.653628764
## age -0.007345027
## dis 0.053942947
## rad -0.019853504
## tax -0.001407383
## ptratio -0.143013850
## black 0.001933966
## lstat -0.064620516
Didapatkan lambda optimum sebesar 0.1258925
# Compute R^2 from true and predicted values
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
SST <- sum((true - mean(true))^2)
R_square <- 1 - SSE / SST
RMSE = sqrt(SSE/nrow(df))
# Model performance metrics
data.frame(RMSE = RMSE,
Rsquare = R_square)
}
# Prediction and evaluation on train data
predictions_train <- predict(ridge_reg, s = optimal_lambda, newx = as.matrix(train[,-14]))
eval_results(train[,14], predictions_train, train)
## RMSE Rsquare
## 1 4.186191 0.7782747
Pada data training dengan lambda terbaik, didapatkan RMSE sebesar 4.186191 dan R-square sebesar 0.7782747.
# Prediction and evaluation on test data
predictions_test <- predict(ridge_reg, s = optimal_lambda, newx = as.matrix(test[,-14]))
eval_results(test[,14], predictions_test, test)
## RMSE Rsquare
## 1 5.830073 0.6410751
Pada data testing dengan lambda terbaik, didapatkan RMSE sebesar 5.830073 dan R-square sebesar 0.6410751.
Metode Least Absolute Shrinkage and Selection Operator (LASSO) diperkenalkan pertama kali oleh Tibshirani pada tahun 1996. LASSO menyusutkan koefisien regresi dari variabel prediktor yang memiliki korelasi tinggi tinggi dengan galat, menjadi tepat pada nol atau mendekati nol.
lambdas <- 10^seq(2, -3, by = -.1)
# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(as.matrix(train[,-14]), train[,14], alpha = 1,
lambda = lambdas, standardize = TRUE, nfolds = 5)
# Best
lambda_best <- lasso_reg$lambda.min
lambda_best
## [1] 0.01995262
Didapatkan lambda terbaik sebesar 0.01995262.
lasso_model <- glmnet(as.matrix(train[,-14]), train[,14], alpha = 1,
lambda = lambda_best, standardize = TRUE)
predictions_train <- predict(lasso_model, s = lambda_best, newx = as.matrix(train[,-14]))
eval_results(train[,14], predictions_train, train)
## RMSE Rsquare
## 1 4.185119 0.7783883
Dengan menggunakan data training dengan lambda terbaik, diperoleh RMSE sebesar 4.185119 dan R-square sebesar 0.7783883.
predictions_test <- predict(lasso_model, s = lambda_best, newx = as.matrix(test[,-14]))
eval_results(test[,14], predictions_test, test)
## RMSE Rsquare
## 1 5.844554 0.6392898
Dengan menggunakan data testing dengan lambda terbaik, diperoleh RMSE sebesar 5.844554 dan R-square sebesar 0.6392898.
mod1 <- lm(medv~.,train, na.action = na.fail)
mod2 <- dredge(global.model=mod1, m.lim=c(10,13))
## Fixed term is "(Intercept)"
Minimal terdapat 10 peubah prediktor dan maksimal 13 peubah prediktor
mod3 <- model.avg(mod2, delta<4)
mod3
##
## Call:
## model.avg(object = mod2, subset = delta < 4)
##
## Component models:
## '2+3+4+5+7+8+9+10+11+12+13' '1+2+3+4+5+7+8+9+10+11+12+13'
## '2+3+4+5+6+7+8+9+10+11+12+13' '1+2+3+4+5+6+7+8+9+10+11+12+13'
##
## Coefficients:
## (Intercept) black chas crim dis lstat nox
## full 35.65021 0.008197076 2.769576 -0.122049 -1.44241 -0.6478746 -11.52519
## subset 35.65021 0.008197076 2.769576 -0.122049 -1.44241 -0.6478746 -11.52519
## ptratio rad rm tax zn age
## full -0.9215836 0.2651881 3.517981 -0.009768949 0.04945744 -0.003714323
## subset -0.9215836 0.2651881 3.517981 -0.009768949 0.04945744 -0.011705247
## indus
## full 0.009483636
## subset 0.033960012
summary(mod3)
##
## Call:
## model.avg(object = mod2, subset = delta < 4)
##
## Component model call:
## lm(formula = medv ~ <4 unique rhs>, data = train, na.action = na.fail)
##
## Component models:
## df logLik AICc delta weight
## 2+3+4+5+7+8+9+10+11+12+13 13 -1009.26 2045.59 0.00 0.49
## 1+2+3+4+5+7+8+9+10+11+12+13 14 -1008.94 2047.12 1.53 0.23
## 2+3+4+5+6+7+8+9+10+11+12+13 14 -1009.12 2047.48 1.89 0.19
## 1+2+3+4+5+6+7+8+9+10+11+12+13 15 -1008.80 2049.02 3.43 0.09
##
## Term codes:
## age black chas crim dis indus lstat nox ptratio rad
## 1 2 3 4 5 6 7 8 9 10
## rm tax zn
## 11 12 13
##
## Model-averaged coefficients:
## (full average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 35.650212 5.767077 5.787586 6.160 < 2e-16 ***
## black 0.008197 0.002781 0.002791 2.937 0.003318 **
## chas 2.769576 0.974071 0.977528 2.833 0.004608 **
## crim -0.122049 0.030904 0.031014 3.935 8.31e-05 ***
## dis -1.442410 0.210692 0.211425 6.822 < 2e-16 ***
## lstat -0.647875 0.055948 0.056144 11.540 < 2e-16 ***
## nox -11.525187 3.940808 3.954705 2.914 0.003565 **
## ptratio -0.921584 0.143636 0.144146 6.393 < 2e-16 ***
## rad 0.265188 0.067213 0.067451 3.932 8.44e-05 ***
## rm 3.517981 0.473696 0.475375 7.400 < 2e-16 ***
## tax -0.009769 0.003694 0.003707 2.636 0.008401 **
## zn 0.049457 0.014677 0.014729 3.358 0.000785 ***
## age -0.003714 0.010012 0.010037 0.370 0.711326
## indus 0.009484 0.037995 0.038109 0.249 0.803472
##
## (conditional average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 35.650212 5.767077 5.787586 6.160 < 2e-16 ***
## black 0.008197 0.002781 0.002791 2.937 0.003318 **
## chas 2.769576 0.974071 0.977528 2.833 0.004608 **
## crim -0.122049 0.030904 0.031014 3.935 8.31e-05 ***
## dis -1.442410 0.210692 0.211425 6.822 < 2e-16 ***
## lstat -0.647875 0.055948 0.056144 11.540 < 2e-16 ***
## nox -11.525187 3.940808 3.954705 2.914 0.003565 **
## ptratio -0.921584 0.143636 0.144146 6.393 < 2e-16 ***
## rad 0.265188 0.067213 0.067451 3.932 8.44e-05 ***
## rm 3.517981 0.473696 0.475375 7.400 < 2e-16 ***
## tax -0.009769 0.003694 0.003707 2.636 0.008401 **
## zn 0.049457 0.014677 0.014729 3.358 0.000785 ***
## age -0.011705 0.014911 0.014964 0.782 0.434076
## indus 0.033960 0.065866 0.066101 0.514 0.607419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Jika menggunakan full average, didapatkan beberapa peubah yang signifikan yaitu chas, crim, dis, lstat, nox, ptratio, rad, dan rm.
set.seed(123)
x <- cbind(1,matrix( rnorm(100*20,2,1),100,20))
e <- matrix( rnorm(100),100,1)
b <- c(1,rep(0:4,each=4))
y <- x%*%b+e
dt.all <- data.frame(y,x[, -1])
str( dt.all)
## 'data.frame': 100 obs. of 21 variables:
## $ y : num 71.6 64.3 64.7 100.1 90.3 ...
## $ X1 : num 1.44 1.77 3.56 2.07 2.13 ...
## $ X2 : num 1.29 2.26 1.75 1.65 1.05 ...
## $ X3 : num 4.2 3.31 1.73 2.54 1.59 ...
## $ X4 : num 1.285 1.247 1.061 0.947 1.563 ...
## $ X5 : num 1.926 0.831 1.365 1.971 2.671 ...
## $ X6 : num 1.398 1.006 3.027 2.751 0.491 ...
## $ X7 : num 3.074 1.973 1.967 0.484 2.79 ...
## $ X8 : num 1.272 0.46 1.307 2.119 0.635 ...
## $ X9 : num 2.36 1.34 2.86 3.15 2.28 ...
## $ X10: num 0.986 1.209 2.3 3.639 3.085 ...
## $ X11: num 1.004 0.96 1.982 1.868 -0.549 ...
## $ X12: num 2.916 2.801 1.063 0.599 2.16 ...
## $ X13: num 2.62 1.24 2.85 1.25 2.63 ...
## $ X14: num 1.25 1.678 0.852 2.354 2.425 ...
## $ X15: num 0.914 1.335 2.715 1.568 2.228 ...
## $ X16: num 1.18 1.69 1.1 2.63 3.12 ...
## $ X17: num 1.71 2.66 1.55 1.41 0.29 ...
## $ X18: num 1.81 1.53 -1.05 3.87 3.79 ...
## $ X19: num 0.711 1.345 1.943 3.257 3.587 ...
## $ X20: num 3.54 1.54 1.97 3.64 1.67 ...
mod1 <- lm( y~.,data=dt.all)
coef(mod1)
## (Intercept) X1 X2 X3 X4 X5
## 1.199813532 -0.073189937 -0.071293234 0.176350703 -0.006345167 1.015471670
## X6 X7 X8 X9 X10 X11
## 0.999708429 1.019633372 0.885459998 1.988429428 2.019416432 2.158248869
## X12 X13 X14 X15 X16 X17
## 2.013082122 2.867064930 3.029382418 3.046901211 2.970311314 3.954652929
## X18 X19 X20
## 3.951884182 4.035364936 3.853178398
#library( glmnet)
mod2 <- cv.glmnet(x[,-1], y, alpha=0) #pemilihan lambda dgn cv untuk ridge
mod2
##
## Call: glmnet::cv.glmnet(x = x[, -1], y = y, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.4241 100 1.434 0.1982 20
## 1se 0.5108 98 1.543 0.2247 20
coef(mod2, s="lambda.min")
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.56512925
## V1 0.01864237
## V2 -0.14474607
## V3 0.13647799
## V4 0.09321257
## V5 1.01992103
## V6 1.02833345
## V7 1.03021554
## V8 0.87150470
## V9 1.90975375
## V10 1.89214977
## V11 2.12432856
## V12 1.99839789
## V13 2.69183965
## V14 2.86949361
## V15 2.91427164
## V16 2.72840558
## V17 3.73396794
## V18 3.70200292
## V19 3.88847919
## V20 3.67709282
mod3 <- cv.glmnet(x[, -1], y,alpha=1) #pemilihan lambda dgn cv untuk lambda
mod3
##
## Call: glmnet::cv.glmnet(x = x[, -1], y = y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.01923 59 1.014 0.1213 19
## 1se 0.07763 44 1.134 0.1439 18
coef(mod3, s="lambda.min")
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.83533712
## V1 -0.03815930
## V2 -0.05393318
## V3 0.14314845
## V4 .
## V5 1.00548302
## V6 0.98429527
## V7 0.99965103
## V8 0.86434523
## V9 1.96683664
## V10 1.99554456
## V11 2.14570324
## V12 2.01162119
## V13 2.84395139
## V14 3.01020049
## V15 3.01257009
## V16 2.92811226
## V17 3.93506150
## V18 3.90997381
## V19 4.01843473
## V20 3.84020618
Berikut adalah beberapa library yang digunakan
library(tidyverse)
library(mlr3verse)
## Loading required package: mlr3
library(mlr3fselect)
library(DataExplorer)
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:mlr3':
##
## partition
library(corrplot)
## corrplot 0.92 loaded
library(leaps)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
library(tidyverse)
library(dplyr)
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Data ini merupakan data harga rumah yang terdiri dari 81 peubah dan 1.460 data.
data_house <- read.csv("C:/0 SEM5/PSD/p14/house_price1.csv", stringsAsFactors = TRUE)
glimpse(data_house)
## Rows: 1,460
## Columns: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
## $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, RL, …
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pa…
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg, Reg, I…
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, L…
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
## $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Corner…
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel, So…
## $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Artery,…
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Ar…
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2f…
## $ HouseStyle <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, 1Story, …
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
## $ RoofStyle <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gable, …
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg, Co…
## $ Exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd, Vi…
## $ Exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd, Vi…
## $ MasVnrType <fct> BrkFace, None, BrkFace, None, BrkFace, None, Stone, Ston…
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
## $ ExterQual <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, …
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ Foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc, CBlock…
## $ BsmtQual <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, TA, Gd, …
## $ BsmtCond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ BsmtExposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, No, Av, …
## $ BsmtFinType1 <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ, Rec, G…
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
## $ BsmtFinType2 <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf, Unf, U…
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Ga…
## $ HeatingQC <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, TA, Ex, …
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, …
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ KitchenQual <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, …
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Typ, Typ, …
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
## $ FireplaceQu <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, NA, Gd, …
## $ GarageType <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, Attchd, …
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
## $ GarageFinish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn, Unf, F…
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, TA, TA, …
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, NA, NA, N…
## $ MiscFeature <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, NA, NA, …
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New, WD, New…
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal, Normal,…
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …
plot_intro(data = data_house,
geom_label_args = list(size=2.5))
Catatan:
plot_intro merupakan fungsi yang berasal dari package DataExplorer dan argumen utamanya adalah object berbentuk data.frame. argumen geom_label_args bisa diisi dengan opsi-opsi yang ada pada fungsi geom_label pada pacakge ggplot2.
skim_without_charts(data = data_house)
| Name | data_house |
| Number of rows | 1460 |
| Number of columns | 81 |
| _______________________ | |
| Column type frequency: | |
| factor | 43 |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| MSZoning | 0 | 1.00 | FALSE | 5 | RL: 1151, RM: 218, FV: 65, RH: 16 |
| Street | 0 | 1.00 | FALSE | 2 | Pav: 1454, Grv: 6 |
| Alley | 1369 | 0.06 | FALSE | 2 | Grv: 50, Pav: 41 |
| LotShape | 0 | 1.00 | FALSE | 4 | Reg: 925, IR1: 484, IR2: 41, IR3: 10 |
| LandContour | 0 | 1.00 | FALSE | 4 | Lvl: 1311, Bnk: 63, HLS: 50, Low: 36 |
| Utilities | 0 | 1.00 | FALSE | 2 | All: 1459, NoS: 1 |
| LotConfig | 0 | 1.00 | FALSE | 5 | Ins: 1052, Cor: 263, Cul: 94, FR2: 47 |
| LandSlope | 0 | 1.00 | FALSE | 3 | Gtl: 1382, Mod: 65, Sev: 13 |
| Neighborhood | 0 | 1.00 | FALSE | 25 | NAm: 225, Col: 150, Old: 113, Edw: 100 |
| Condition1 | 0 | 1.00 | FALSE | 9 | Nor: 1260, Fee: 81, Art: 48, RRA: 26 |
| Condition2 | 0 | 1.00 | FALSE | 8 | Nor: 1445, Fee: 6, Art: 2, Pos: 2 |
| BldgType | 0 | 1.00 | FALSE | 5 | 1Fa: 1220, Twn: 114, Dup: 52, Twn: 43 |
| HouseStyle | 0 | 1.00 | FALSE | 8 | 1St: 726, 2St: 445, 1.5: 154, SLv: 65 |
| RoofStyle | 0 | 1.00 | FALSE | 6 | Gab: 1141, Hip: 286, Fla: 13, Gam: 11 |
| RoofMatl | 0 | 1.00 | FALSE | 8 | Com: 1434, Tar: 11, WdS: 6, WdS: 5 |
| Exterior1st | 0 | 1.00 | FALSE | 15 | Vin: 515, HdB: 222, Met: 220, Wd : 206 |
| Exterior2nd | 0 | 1.00 | FALSE | 16 | Vin: 504, Met: 214, HdB: 207, Wd : 197 |
| MasVnrType | 8 | 0.99 | FALSE | 4 | Non: 864, Brk: 445, Sto: 128, Brk: 15 |
| ExterQual | 0 | 1.00 | FALSE | 4 | TA: 906, Gd: 488, Ex: 52, Fa: 14 |
| ExterCond | 0 | 1.00 | FALSE | 5 | TA: 1282, Gd: 146, Fa: 28, Ex: 3 |
| Foundation | 0 | 1.00 | FALSE | 6 | PCo: 647, CBl: 634, Brk: 146, Sla: 24 |
| BsmtQual | 37 | 0.97 | FALSE | 4 | TA: 649, Gd: 618, Ex: 121, Fa: 35 |
| BsmtCond | 37 | 0.97 | FALSE | 4 | TA: 1311, Gd: 65, Fa: 45, Po: 2 |
| BsmtExposure | 38 | 0.97 | FALSE | 4 | No: 953, Av: 221, Gd: 134, Mn: 114 |
| BsmtFinType1 | 37 | 0.97 | FALSE | 6 | Unf: 430, GLQ: 418, ALQ: 220, BLQ: 148 |
| BsmtFinType2 | 38 | 0.97 | FALSE | 6 | Unf: 1256, Rec: 54, LwQ: 46, BLQ: 33 |
| Heating | 0 | 1.00 | FALSE | 6 | Gas: 1428, Gas: 18, Gra: 7, Wal: 4 |
| HeatingQC | 0 | 1.00 | FALSE | 5 | Ex: 741, TA: 428, Gd: 241, Fa: 49 |
| CentralAir | 0 | 1.00 | FALSE | 2 | Y: 1365, N: 95 |
| Electrical | 1 | 1.00 | FALSE | 5 | SBr: 1334, Fus: 94, Fus: 27, Fus: 3 |
| KitchenQual | 0 | 1.00 | FALSE | 4 | TA: 735, Gd: 586, Ex: 100, Fa: 39 |
| Functional | 0 | 1.00 | FALSE | 7 | Typ: 1360, Min: 34, Min: 31, Mod: 15 |
| FireplaceQu | 690 | 0.53 | FALSE | 5 | Gd: 380, TA: 313, Fa: 33, Ex: 24 |
| GarageType | 81 | 0.94 | FALSE | 6 | Att: 870, Det: 387, Bui: 88, Bas: 19 |
| GarageFinish | 81 | 0.94 | FALSE | 3 | Unf: 605, RFn: 422, Fin: 352 |
| GarageQual | 81 | 0.94 | FALSE | 5 | TA: 1311, Fa: 48, Gd: 14, Ex: 3 |
| GarageCond | 81 | 0.94 | FALSE | 5 | TA: 1326, Fa: 35, Gd: 9, Po: 7 |
| PavedDrive | 0 | 1.00 | FALSE | 3 | Y: 1340, N: 90, P: 30 |
| PoolQC | 1453 | 0.00 | FALSE | 3 | Gd: 3, Ex: 2, Fa: 2 |
| Fence | 1179 | 0.19 | FALSE | 4 | MnP: 157, GdP: 59, GdW: 54, MnW: 11 |
| MiscFeature | 1406 | 0.04 | FALSE | 4 | She: 49, Gar: 2, Oth: 2, Ten: 1 |
| SaleType | 0 | 1.00 | FALSE | 9 | WD: 1267, New: 122, COD: 43, Con: 9 |
| SaleCondition | 0 | 1.00 | FALSE | 6 | Nor: 1198, Par: 125, Abn: 101, Fam: 20 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 |
| MSSubClass | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 |
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 |
| OverallQual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 |
| OverallCond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 |
| YearRemodAdd | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 |
| MasVnrArea | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 |
| BsmtFinSF1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 |
| BsmtFinSF2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 |
| BsmtUnfSF | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 |
| X1stFlrSF | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 |
| X2ndFlrSF | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 |
| LowQualFinSF | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 |
| GrLivArea | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 |
| BsmtFullBath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 |
| BsmtHalfBath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 |
| FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 |
| Fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 |
| GarageYrBlt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 |
| WoodDeckSF | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 |
| OpenPorchSF | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 |
| EnclosedPorch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 |
| X3SsnPorch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 |
| ScreenPorch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 |
| PoolArea | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 |
| MiscVal | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 |
| MoSold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 |
Berdasarkan hasil syntax diatas, dapat diketahui bahwa terdapat beberapa kolom yang memiliki missing value. Terdapat 5 kolom yang mengalami banyak missing value, yaitu Alley, FireplaceQu, PoolQX, Fence, dan MiscFeature.
Catatan:
skim_without_charts merupakan fungsi yang berasal dari package skimr dan argumen utamanya adalah object berbentuk data.frame. Berdasarkan informasi diatas, kita tahu terdata beberapa kolom yang memiliki missing value. Namun hanya 5 kolom saja yang mengalami banyak missing value yaitu Alley,FireplaceQu,PoolQX,Fence,MiscFeature.
Dalam kasus ini kita akan menangani missing value dengan dua cara, yaitu 1. Mereplace missing value pada kolom-kolom yang memiliki banyak sekali missing value (diatas 500) 2. Menghapus baris-baris yang mengandung missing value pada kolom-kolom yang memiliki sedikti missing value (dibawah 500)
Berikut dibawah ini adalah sintaks untuk melakukan replace missing value, khususnya jika datanya berupa factor atau string. Kemudian na.omit digunakan untuk menghapus semua baris yang mengandung missing value
data_house1 <- data_house %>%
dplyr::select(-Id) %>%
mutate(
Alley = forcats :: fct_explicit_na(Alley, na_level = "ukn"),
FireplaceQu = forcats :: fct_explicit_na(FireplaceQu, na_level = "ukn"),
PoolQC = forcats :: fct_explicit_na(PoolQC, na_level = "ukn"),
Fence = forcats:: fct_explicit_na(Fence, na_level = "ukn"),
MiscFeature = forcats :: fct_explicit_na(MiscFeature, na_level = "ukn")) %>% na.omit
Kemudian kita akan lihat kembali data yang sudah kita tangani missing valuenya
plot_intro(data = data_house1)
skim_without_charts(data_house1)
| Name | data_house1 |
| Number of rows | 1094 |
| Number of columns | 80 |
| _______________________ | |
| Column type frequency: | |
| factor | 43 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| MSZoning | 0 | 1 | FALSE | 5 | RL: 850, RM: 173, FV: 54, RH: 9 |
| Street | 0 | 1 | FALSE | 2 | Pav: 1090, Grv: 4 |
| Alley | 0 | 1 | FALSE | 3 | ukn: 1017, Grv: 41, Pav: 36 |
| LotShape | 0 | 1 | FALSE | 4 | Reg: 760, IR1: 301, IR2: 26, IR3: 7 |
| LandContour | 0 | 1 | FALSE | 4 | Lvl: 991, Bnk: 45, HLS: 44, Low: 14 |
| Utilities | 0 | 1 | FALSE | 1 | All: 1094, NoS: 0 |
| LotConfig | 0 | 1 | FALSE | 5 | Ins: 830, Cor: 187, Cul: 44, FR2: 29 |
| LandSlope | 0 | 1 | FALSE | 3 | Gtl: 1045, Mod: 44, Sev: 5 |
| Neighborhood | 0 | 1 | FALSE | 25 | NAm: 173, Col: 122, Old: 96, Som: 75 |
| Condition1 | 0 | 1 | FALSE | 9 | Nor: 950, Fee: 52, Art: 42, RRA: 24 |
| Condition2 | 0 | 1 | FALSE | 6 | Nor: 1082, Fee: 5, Art: 2, Pos: 2 |
| BldgType | 0 | 1 | FALSE | 5 | 1Fa: 925, Twn: 90, Twn: 35, Dup: 24 |
| HouseStyle | 0 | 1 | FALSE | 8 | 1St: 540, 2St: 346, 1.5: 117, SLv: 43 |
| RoofStyle | 0 | 1 | FALSE | 5 | Gab: 843, Hip: 230, Gam: 10, Man: 6 |
| RoofMatl | 0 | 1 | FALSE | 7 | Com: 1078, WdS: 6, Tar: 5, WdS: 2 |
| Exterior1st | 0 | 1 | FALSE | 14 | Vin: 421, Met: 172, HdB: 151, Wd : 149 |
| Exterior2nd | 0 | 1 | FALSE | 16 | Vin: 412, Met: 169, Wd : 145, HdB: 138 |
| MasVnrType | 0 | 1 | FALSE | 4 | Non: 639, Brk: 327, Sto: 119, Brk: 9 |
| ExterQual | 0 | 1 | FALSE | 4 | TA: 646, Gd: 395, Ex: 46, Fa: 7 |
| ExterCond | 0 | 1 | FALSE | 4 | TA: 973, Gd: 104, Fa: 15, Ex: 2 |
| Foundation | 0 | 1 | FALSE | 5 | PCo: 518, CBl: 446, Brk: 122, Sto: 6 |
| BsmtQual | 0 | 1 | FALSE | 4 | TA: 486, Gd: 463, Ex: 113, Fa: 32 |
| BsmtCond | 0 | 1 | FALSE | 4 | TA: 1006, Gd: 51, Fa: 36, Po: 1 |
| BsmtExposure | 0 | 1 | FALSE | 4 | No: 734, Av: 174, Gd: 97, Mn: 89 |
| BsmtFinType1 | 0 | 1 | FALSE | 6 | Unf: 343, GLQ: 323, ALQ: 162, BLQ: 105 |
| BsmtFinType2 | 0 | 1 | FALSE | 6 | Unf: 972, Rec: 37, LwQ: 35, BLQ: 25 |
| Heating | 0 | 1 | FALSE | 4 | Gas: 1075, Gas: 16, Gra: 2, Oth: 1 |
| HeatingQC | 0 | 1 | FALSE | 5 | Ex: 594, TA: 298, Gd: 174, Fa: 27 |
| CentralAir | 0 | 1 | FALSE | 2 | Y: 1036, N: 58 |
| Electrical | 0 | 1 | FALSE | 5 | SBr: 1009, Fus: 67, Fus: 15, Fus: 2 |
| KitchenQual | 0 | 1 | FALSE | 4 | TA: 528, Gd: 454, Ex: 91, Fa: 21 |
| Functional | 0 | 1 | FALSE | 6 | Typ: 1024, Min: 25, Min: 21, Maj: 10 |
| FireplaceQu | 0 | 1 | FALSE | 6 | ukn: 511, Gd: 315, TA: 212, Fa: 24 |
| GarageType | 0 | 1 | FALSE | 6 | Att: 680, Det: 325, Bui: 63, Bas: 15 |
| GarageFinish | 0 | 1 | FALSE | 3 | Unf: 485, RFn: 333, Fin: 276 |
| GarageQual | 0 | 1 | FALSE | 5 | TA: 1031, Fa: 46, Gd: 11, Ex: 3 |
| GarageCond | 0 | 1 | FALSE | 5 | TA: 1050, Fa: 31, Po: 6, Gd: 5 |
| PavedDrive | 0 | 1 | FALSE | 3 | Y: 1023, N: 48, P: 23 |
| PoolQC | 0 | 1 | FALSE | 4 | ukn: 1088, Ex: 2, Fa: 2, Gd: 2 |
| Fence | 0 | 1 | FALSE | 5 | ukn: 882, MnP: 117, GdP: 46, GdW: 39 |
| MiscFeature | 0 | 1 | FALSE | 4 | ukn: 1059, She: 33, Oth: 1, Ten: 1 |
| SaleType | 0 | 1 | FALSE | 9 | WD: 928, New: 116, COD: 31, Con: 5 |
| SaleCondition | 0 | 1 | FALSE | 6 | Nor: 880, Par: 119, Abn: 70, Fam: 18 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| MSSubClass | 0 | 1 | 56.13 | 41.98 | 20 | 20.00 | 50.0 | 70.00 | 190 |
| LotFrontage | 0 | 1 | 70.76 | 24.51 | 21 | 60.00 | 70.0 | 80.00 | 313 |
| LotArea | 0 | 1 | 10132.35 | 8212.25 | 1300 | 7606.75 | 9444.5 | 11387.25 | 215245 |
| OverallQual | 0 | 1 | 6.25 | 1.37 | 2 | 5.00 | 6.0 | 7.00 | 10 |
| OverallCond | 0 | 1 | 5.58 | 1.07 | 2 | 5.00 | 5.0 | 6.00 | 9 |
| YearBuilt | 0 | 1 | 1972.41 | 31.19 | 1880 | 1953.00 | 1975.0 | 2003.00 | 2010 |
| YearRemodAdd | 0 | 1 | 1985.92 | 20.93 | 1950 | 1967.00 | 1995.0 | 2005.00 | 2010 |
| MasVnrArea | 0 | 1 | 109.86 | 190.67 | 0 | 0.00 | 0.0 | 171.75 | 1600 |
| BsmtFinSF1 | 0 | 1 | 448.19 | 468.73 | 0 | 0.00 | 384.5 | 712.75 | 5644 |
| BsmtFinSF2 | 0 | 1 | 45.25 | 159.08 | 0 | 0.00 | 0.0 | 0.00 | 1474 |
| BsmtUnfSF | 0 | 1 | 606.12 | 445.83 | 0 | 270.00 | 525.0 | 846.00 | 2336 |
| TotalBsmtSF | 0 | 1 | 1099.56 | 415.85 | 105 | 816.00 | 1023.0 | 1345.50 | 6110 |
| X1stFlrSF | 0 | 1 | 1173.81 | 387.68 | 438 | 894.00 | 1097.0 | 1413.50 | 4692 |
| X2ndFlrSF | 0 | 1 | 356.54 | 439.26 | 0 | 0.00 | 0.0 | 729.00 | 2065 |
| LowQualFinSF | 0 | 1 | 4.68 | 42.10 | 0 | 0.00 | 0.0 | 0.00 | 572 |
| GrLivArea | 0 | 1 | 1535.03 | 526.12 | 438 | 1164.00 | 1480.0 | 1779.00 | 5642 |
| BsmtFullBath | 0 | 1 | 0.42 | 0.51 | 0 | 0.00 | 0.0 | 1.00 | 2 |
| BsmtHalfBath | 0 | 1 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 |
| FullBath | 0 | 1 | 1.58 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 |
| HalfBath | 0 | 1 | 0.39 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 |
| BedroomAbvGr | 0 | 1 | 2.86 | 0.76 | 0 | 2.00 | 3.0 | 3.00 | 6 |
| KitchenAbvGr | 0 | 1 | 1.03 | 0.19 | 1 | 1.00 | 1.0 | 1.00 | 3 |
| TotRmsAbvGrd | 0 | 1 | 6.57 | 1.58 | 3 | 5.00 | 6.0 | 7.00 | 12 |
| Fireplaces | 0 | 1 | 0.61 | 0.63 | 0 | 0.00 | 1.0 | 1.00 | 3 |
| GarageYrBlt | 0 | 1 | 1978.57 | 25.93 | 1900 | 1960.00 | 1982.0 | 2003.00 | 2010 |
| GarageCars | 0 | 1 | 1.88 | 0.66 | 1 | 1.00 | 2.0 | 2.00 | 4 |
| GarageArea | 0 | 1 | 503.76 | 192.26 | 160 | 360.00 | 484.0 | 602.50 | 1418 |
| WoodDeckSF | 0 | 1 | 94.34 | 122.62 | 0 | 0.00 | 0.0 | 169.75 | 857 |
| OpenPorchSF | 0 | 1 | 46.95 | 64.82 | 0 | 0.00 | 28.0 | 68.00 | 547 |
| EnclosedPorch | 0 | 1 | 22.05 | 61.57 | 0 | 0.00 | 0.0 | 0.00 | 552 |
| X3SsnPorch | 0 | 1 | 3.27 | 29.66 | 0 | 0.00 | 0.0 | 0.00 | 508 |
| ScreenPorch | 0 | 1 | 16.50 | 58.46 | 0 | 0.00 | 0.0 | 0.00 | 480 |
| PoolArea | 0 | 1 | 3.01 | 40.71 | 0 | 0.00 | 0.0 | 0.00 | 648 |
| MiscVal | 0 | 1 | 23.55 | 167.14 | 0 | 0.00 | 0.0 | 0.00 | 2500 |
| MoSold | 0 | 1 | 6.34 | 2.69 | 1 | 5.00 | 6.0 | 8.00 | 12 |
| YrSold | 0 | 1 | 2007.79 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 |
| SalePrice | 0 | 1 | 187033.26 | 83165.33 | 35311 | 132500.00 | 165750.0 | 221000.00 | 755000 |
Setelah dilihat kembali ternyata ada kolom yang hanya memiliki satu kategori saja yaitu kolom Utilities. Sehingga kita perlu menghapusnya.
data_house1 <- data_house1 %>%
dplyr::select(-Utilities)
plot_histogram(data = data_house1,nrow=3,ncol = 3,
geom_histogram_args = list(fill="steelblue")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Pada histogram-histogram di atas dapat dilihat sebaran-sebaran data. Terdapat berbagai sebaran yang terlihat. Pada variabel SalePrice, Lotfrontage, BsmtUnfSF, GrLivArea, terlihat data menyebar ke kanan. Pada variabel GarageYrBlt, YearBuilt, dan YearRemodAdd terlihat data menyebar ke kiri.
plot_bar(data = data_house1)
Dari barplot di atas dapat dilihat frekuensi amatan dari peubah-peubah pada data.
plot_scatterplot(data = data_house1 %>%
select_if(is.numeric),
by="SalePrice",geom_point_args = list(color="steelblue") )
Pada scatterplot-scatterplot di atas dapat dilihat korelasi peubah SalePrice dengan peubah-peubah lainnya. Terdapat peubah-peubah yang memiliki korelasi kuat dengan peubah SalePrice, seperti peubah GrLivArea. Terdapat pula peubah-peubah yang memiliki korelasi lemah dengan peubah SalePrice.
cor_mat <- cor(data_house1%>%
select_if(is.numeric),method = "spearman")
cor_mat[upper.tri(cor_mat,diag = TRUE)] <- NA
cor_df <- cor_mat %>%
as.data.frame() %>%
rownames_to_column(var = "Var1") %>%
pivot_longer(names_to = "Var2",
values_to = "corr",
-Var1) %>% na.omit
cor_df %>% filter(abs(corr)>0.6) %>% arrange(desc(abs(corr)))
## # A tibble: 31 × 3
## Var1 Var2 corr
## <chr> <chr> <dbl>
## 1 GarageYrBlt YearBuilt 0.895
## 2 X1stFlrSF TotalBsmtSF 0.877
## 3 GarageArea GarageCars 0.841
## 4 TotRmsAbvGrd GrLivArea 0.829
## 5 SalePrice OverallQual 0.823
## 6 GarageYrBlt YearRemodAdd 0.747
## 7 YearRemodAdd YearBuilt 0.738
## 8 SalePrice GrLivArea 0.731
## 9 SalePrice GarageCars 0.681
## 10 SalePrice FullBath 0.671
## # … with 21 more rows
Pada matriks korelasi di atas dapat dilihat berbagai korelasi antar peubah. Pada Matriks Korelasi di atas ditampilkan peubah-peubah yang memiliki korelasi di atas 0.6. Dapat dilihat korelasi yang tinggi antara peubah GarageYrBlt dan peubah YearBuilt yaitu sebesar 0.895.
cor_df %>% filter(abs(corr)<=0.6)
## # A tibble: 635 × 3
## Var1 Var2 corr
## <chr> <chr> <dbl>
## 1 LotFrontage MSSubClass -0.313
## 2 LotArea MSSubClass -0.255
## 3 OverallQual MSSubClass 0.0992
## 4 OverallQual LotFrontage 0.238
## 5 OverallQual LotArea 0.283
## 6 OverallCond MSSubClass -0.0763
## 7 OverallCond LotFrontage -0.0693
## 8 OverallCond LotArea -0.0873
## 9 OverallCond OverallQual -0.264
## 10 YearBuilt MSSubClass -0.00468
## # … with 625 more rows
Pada matriks korelasi di atas dapat dilihat berbagai korelasi antar peubah. Pada Matriks Korelasi di atas ditampilkan peubah-peubah yang memiliki korelasi di bawah 0.6. Dapat dilihat korelasi yang rendah antara peubah YearBuilt dan peubah MSSubClass yaitu sebesar -0.00468.
plot_bar(data = data_house1,with="SalePrice")
Dari barplot di atas dapat dilihat besarnya SalePrice dari peubah-peubah pada data.
Metode yang digunakan untuk seleksi peubah adalah regresi linear.
regresi1 <- lm(formula = SalePrice~.,data = data_house1)
broom::tidy(regresi1) %>% mutate(across(where(is.numeric), ~ format(round(.x,3), big.mark=",", scientific=F)))
## # A tibble: 239 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3644762. 1362175. -2.68 0.00760
## 2 MSSubClass -3.79 128. -0.0295 0.977
## 3 MSZoningFV 41403. 14265. 2.90 0.00380
## 4 MSZoningRH 28191. 15384. 1.83 0.0672
## 5 MSZoningRL 27117. 12657. 2.14 0.0324
## 6 MSZoningRM 23666. 11795. 2.01 0.0451
## 7 LotFrontage 77.8 57.4 1.36 0.176
## 8 LotArea 0.819 0.154 5.31 0.000000143
## 9 StreetPave 53484. 20624. 2.59 0.00967
## 10 AlleyPave -4026. 7366. -0.547 0.585
## # … with 229 more rows
best_forward <- regsubsets(SalePrice~.,data = data_house1,method = "forward",nvmax = 78)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 15 linear dependencies found
## Reordering variables and trying again:
summ_forward <- summary(best_forward)
result_gof <- data.frame(
Adj.R2 = c(which.max(summ_forward$adjr2),max(summ_forward$adjr2)),
CP = c(which.min(summ_forward$cp),min(summ_forward$cp)),
BIC = c(which.min(summ_forward$bic),min(summ_forward$bic))
)
result_gof
## Adj.R2 CP BIC
## 1 79.0000000 79.000000 41.000
## 2 0.9243032 5.802693 -2429.285
coef_forward <- coef(best_forward,result_gof$BIC[1])
enframe(coef_forward) %>% mutate(across(where(is.numeric), ~ format(round(.x,3), big.mark=",", scientific=F)))
## # A tibble: 42 × 2
## name value
## <chr> <dbl>
## 1 (Intercept) -1750371.
## 2 MSSubClass -142.
## 3 LotArea 1.82
## 4 LotConfigCulDSac 25170.
## 5 LandSlopeSev -76949.
## 6 NeighborhoodCrawfor 35948.
## 7 NeighborhoodNoRidge 106177.
## 8 NeighborhoodNridgHt 87107.
## 9 NeighborhoodSomerst 37153.
## 10 NeighborhoodStoneBr 90695.
## # … with 32 more rows
plot_forward <- data.frame(subset=seq_along(summ_forward$bic),
BIC = summ_forward$bic
)
ggplot(plot_forward,aes(subset,BIC))+
geom_line(size=1)+
geom_point(color="darkgreen",size=2)+
geom_vline(aes(xintercept=result_gof$BIC[1]),
color="orange",size=1.2)+
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Dengan metode forward didapatkan 41 peubah penjelas yang berpengaruh nyata terhadap y. Didapatkan Adj R Squared yang cukup besar yaitu 0.9243032.
best_backward <- regsubsets(SalePrice~.,data = data_house1,method = "backward",nvmax = 78)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 15 linear dependencies found
## Reordering variables and trying again:
summ_backward <- summary(best_backward)
result_gof2 <- data.frame(
Adj.R2 = c(which.max(summ_backward$adjr2),max(summ_backward$adjr2)),
CP = c(which.min(summ_backward$cp),min(summ_backward$cp)),
BIC = c(which.min(summ_backward$bic),min(summ_backward$bic))
)
result_gof2
## Adj.R2 CP BIC
## 1 79.0000000 79.00000 46.000
## 2 0.9228913 23.33181 -2432.324
coef_backward <- coef(best_backward,result_gof2$BIC[1])
enframe(coef_backward) %>% mutate(across(where(is.numeric), ~ format(round(.x,3), big.mark=",", scientific=F)))
## # A tibble: 47 × 2
## name value
## <chr> <dbl>
## 1 (Intercept) -1592018.
## 2 MSZoningFV 26696.
## 3 LotArea 1.40
## 4 StreetPave 13193.
## 5 LotConfigCulDSac 26339.
## 6 LandSlopeSev -46624.
## 7 NeighborhoodCrawfor 26811.
## 8 NeighborhoodEdwards -18847.
## 9 NeighborhoodMitchel -15130.
## 10 NeighborhoodNAmes -4774.
## # … with 37 more rows
plot_backward <- data.frame(subset=seq_along(summ_backward$bic),
BIC = summ_backward$bic
)
ggplot(plot_backward,aes(subset,BIC))+
geom_line(size=1)+
geom_point(color="darkgreen",size=2)+
geom_vline(aes(xintercept=result_gof$BIC[1]),
color="orange",size=1.2)+
theme_bw()
Dengan metode backward didapatkan hasil yang sama dengan metode forward yaitu dengan 41 peubah penjelas. Didapatkan Adj R Squared yang cukup besar yaitu 0.9228913.
Berikut adalah beberapa library yang digunakan
library("GGally")
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library("ggplot2")
library("dplyr")
library("factoextra")
library("readxl")
library("ggpubr")
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
##
## mutate
Data yang digunakan adalah data sekolah yang berisi 9 peubah dan 27 data.
data <- read.csv("C:/0 SEM5/PSD/p14/DataSekolah.csv", sep=";")
tidyr::tibble(data)
## # A tibble: 27 × 9
## Wilayah standar_isi standar…¹ stand…² stand…³ stand…⁴ stand…⁵ stand…⁶ stand…⁷
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 95.7 93.8 93.6 80.4 87.2 93.5 95.0 94.1
## 2 2 96.3 94.5 93.3 79.8 83.6 94.2 94.4 94.4
## 3 3 96.2 95.5 94.2 78.3 79 95.5 92.2 95.8
## 4 4 94.6 92.2 90.5 76.9 79.8 92.2 95.3 93.4
## 5 5 95.3 93 92.2 76.3 77.7 91.7 93.7 94.2
## 6 6 95.7 92.5 92.6 75.5 76.0 90.2 95.0 93
## 7 7 96 93.8 94.4 86 88.2 93.8 94.8 95.1
## 8 8 95.3 94.7 94 84.9 88.1 94 97.3 95.8
## 9 9 97 95.4 94 79.6 87.6 93.9 95.1 95.7
## 10 10 95 96 95 82 82.5 92 96 90
## # … with 17 more rows, and abbreviated variable names ¹standar_proses,
## # ²standar_kompetensi, ³standar_pendidik, ⁴standar_sarana,
## # ⁵standar_pengelolaan, ⁶standar_biaya, ⁷standar_nilai
Selanjutnya dilakukan standardisasi peubah yang merupakan proses transformasi peubah agar memiliki rata-rata nol dan simpangan baku satu. Standarisasi data dilakukan apabila melihat perbedaan satuan pengukuran peubah-peubah yang digunakan.Pada data sekolah dilakukan standarisasi skala karena skala satuan berbeda menggunakan konsep jarak.
data.ok = data[,-1]
head(data.ok)
## standar_isi standar_proses standar_kompetensi standar_pendidik standar_sarana
## 1 95.72 93.82 93.65 80.39 87.21
## 2 96.31 94.52 93.31 79.76 83.62
## 3 96.17 95.50 94.17 78.33 79.00
## 4 94.63 92.20 90.54 76.88 79.83
## 5 95.34 93.00 92.21 76.28 77.66
## 6 95.67 92.48 92.57 75.48 76.05
## standar_pengelolaan standar_biaya standar_nilai
## 1 93.46 95.04 94.11
## 2 94.24 94.45 94.38
## 3 95.50 92.17 95.83
## 4 92.22 95.27 93.44
## 5 91.72 93.72 94.17
## 6 90.19 95.05 93.00
data.stdz = scale(data.ok)
apply(data.stdz, 2, mean) #cek mean = 0
## standar_isi standar_proses standar_kompetensi standar_pendidik
## 3.180569e-15 6.868605e-16 -2.178426e-15 -7.314739e-16
## standar_sarana standar_pengelolaan standar_biaya standar_nilai
## 3.829086e-16 3.070609e-15 3.268383e-16 2.322203e-16
apply(data.stdz, 2, sd) # cek
## standar_isi standar_proses standar_kompetensi standar_pendidik
## 1 1 1 1
## standar_sarana standar_pengelolaan standar_biaya standar_nilai
## 1 1 1 1
Analisis gerombol merupakan analisis yang bertujuan untuk mencari pola pengelompokkan dari suatu objek pada data peubah ganda. Analisis gerombol terbagi menjadi dua yaitu berhirarki dan tak berhirarki.Analisis gerombol berhirarki mempunyai karkateristik jumlah gerombol (k) yang tidak ditentukan terlebih dahulu sebaliknya gerombol tak berhirarki jumlah k perlu ditentukan terlebih dahulu.
Metode penggerombolan berhirarki digunakan apabila banyak gerombol yang akan dibentuk belum diketahui sebelumnya. Dalam metode berhirarki terdapat beberapa ukuran jarak antar gerombol, antara lain metode pautan tunggal (single linkage), pautan lengkap (complete linkage), pautan rataan (average linkage), metode ward, dan metode centroid. Fungsi jarak yang sering digunakan diantaranya jarak Euclidean dan jarak Mahalanobis.
Untuk menentukan cluster, pada metode hierarki dapat menggunakan koefisien silhouette, Within Sum of Square (WSS), serta dendogram. Dalam analisis ini, beberapa metode yang akan digunakan, yaitu metode pautan tunggal (single linkage), pautan lengkap (complete linkage), pautan rataan (average linkage), dan metode centroid dengan jarak Euclidean.
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "complete", hc_metric="euclidean")
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "average", hc_metric="euclidean")
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "centroid", hc_metric="euclidean")
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "single", hc_metric="euclidean")
Berdasarkan output di atas, dengan menggunakan average linkage, complete linkage, single linkage, dan centroid linkage, koefisien silhouette tetinggi terjadi saat jumlah kluster = 2 atau jumlah kluster optimal pada kedua linkage tersebut adalah sebanyak 2 kluster.
Dipilih metode Complete Linkage
fviz_dend(hclust(dist(data.stdz, method = "euclidean"), method = "complete"))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
Dari output di atas dapat dilihat dendogram yang cukup seimbang.
#Interpretasi, utk melihat anggota dari setiap gerombol
hc.data <- eclust(data.ok, stand = TRUE, FUNcluster = "hclust", k=2, hc_method = "complete", hc_metric = "euclidean", graph = F)
hc.data$cluster #cluster dari setiap pengamatan
## [1] 1 1 1 2 2 2 1 1 1 2 1 2 2 1 1 1 2 1 1 2 1 1 1 1 2 2 1
aggregate(data.ok, by=list(cluster=hc.data$cluster), FUN = mean)
## cluster standar_isi standar_proses standar_kompetensi standar_pendidik
## 1 1 96.45118 95.21765 93.92706 81.24412
## 2 2 94.17200 91.80600 90.28900 77.14400
## standar_sarana standar_pengelolaan standar_biaya standar_nilai
## 1 86.16647 93.99353 94.90235 95.62765
## 2 77.51200 90.66800 91.96200 92.36200
Interpretasi:
Cluster 1 Berisi sekolah yang memiliki standar isi, standar proses, standar kompetensi, standar pendidik, standar sarana, standar pengelolaan, standar biaya, dan standar nilai yang lebih tinggi dibanding Cluster 2.
fviz_cluster(hc.data)
K-means merupakan salah satu algoritma yang bersifat unsupervised learning. K-Means memiliki fungsi untuk mengelompokkan data kedalam data cluster. Algoritma ini dapat menerima data tanpa ada label kategori. K-Means Clustering merupakan suatu metode yang mengelompokan data berbagai partisi. K Means Clustering memiliki objective yaitu meminimalisasi object function yang telah di atur pada proses klasterisasi. Dengan cara minimalisasi variasi antar 1 cluster dengan maksimalisasi variasi dengan data di cluster lainnya.
K-Means Clustering adalah suatu metode penganalisaan data atau metode Data Mining yang melakukan proses pemodelan unssupervised learning dan menggunakan metode yang mengelompokan data berbagai partisi. Penentuan banyak cluster pada metode ini dilakukan dengan menggunakan koefisien silhouette dan Within Sum of Square (WSS).
fviz_nbclust(data.stdz, FUNcluster = kmeans, method = "wss")
Dengan menggunakan metode WSS, dipilih garis yang menjadi siku secara eksploratif dan objektif yaitu pada titik k = 2.
kmeans.data <- eclust(data.ok, stand = TRUE, FUNcluster = "kmeans", k=2, graph = F)
kmeans.data$cluster
## [1] 2 2 2 1 1 1 2 2 2 2 2 1 1 2 2 2 1 2 2 1 2 2 2 2 1 1 2
kmeans.data$centers #Memunculkan centroidnya (data masih terstandardisasi, shg sulit diinterpretasikan)
## standar_isi standar_proses standar_kompetensi standar_pendidik standar_sarana
## 1 -0.9224109 -1.1080938 -0.8964315 -0.8740571 -1.0900098
## 2 0.4612054 0.5540469 0.4482158 0.4370286 0.5450049
## standar_pengelolaan standar_biaya standar_nilai
## 1 -1.0923991 -0.7226372 -0.8303075
## 2 0.5461996 0.3613186 0.4151537
aggregate(data, by=list(cluster=kmeans.data$cluster), FUN = mean)
## cluster Wilayah standar_isi standar_proses standar_kompetensi
## 1 1 14.22222 94.08000 91.34000 89.76556
## 2 2 13.88889 96.37056 95.26111 93.98667
## standar_pendidik standar_sarana standar_pengelolaan standar_biaya
## 1 76.60444 76.95778 90.52000 91.51333
## 2 81.28611 85.96278 93.88278 94.96333
## standar_nilai
## 1 92.62444
## 2 95.31500
Interpretasi:
Cluster 1 Berisi sekolah yang memiliki standar isi, standar proses, standar kompetensi, standar pendidik, standar sarana, standar pengelolaan, standar biaya, dan standar nilai yang lebih rendah dibanding Cluster 2.
fviz_cluster(kmeans.data)