Outline
Supervised Statistical Learning
Secara umum, statistical supervised learning melibatkan pembangunan model statistik untuk memprediksi atau memperkirakan outputs berdasarkan satu atau lebih inputs. Masalah alam ini terjadi di berbagai bidang seperti bisnis, kedokteran, astrofisika, dan kebijakan publik.
Jenis pemodelan supervised statistical learning yang dibahas adalah sebagai berikut.
1. Regresi Ridge (Ridge Regression)
2. Regresi Lasso (Lasso Regresssion)
Kasifikasi Machine Learning
Variable Selection
Beberapa metode dalam seleksi peubah (Variable Selection) adalah sebagai berikut.
1. Subset Selection
Pendekatan ini melibatkan pengidentifikasian subset dari p buah prediktor yang diyakini terkait dengan respon. Kemudian, dugaan modelnya diperoleh menggunakan OLS pada sebagian prediktor yang telah dipilih. Metode seleksi ini terbagi lagi menjadi best subset selection (pemilihan subset terbaik) dan stepwise selection (forward, backward, dan hybrid).
2. Shrinkage
Pendekatan ini melibatkan pembuatan model yang melibatkan semua p buah prediktor. Namun, koefisien ini diduga dengan penyusutan menuju nol. Penyusutan ini (juga dikenal sebagai regularisasi) memiliki efek mengurangi keragaman. Oleh karena itu, metode penyusutan juga dapat melakukan pemilihan peubah prediktor.
3. Dimension Reduction Pendekatan ini memproyeksikan p buah prediktor ke dalam subruang M-dimensi, di mana M < p. Hal ini diperoleh dengan menghitung M buah kombinasi linier yang berbeda, atau proyeksi, dari variabel. Kemudian proyeksi M ini digunakan sebagai prediktor agar sesuai dengan model regresi linier dengan OLS.
Unsupervised Learning: Cluster Analysis
Analisis gerombol bertujuan untuk menemukan subkelompok yang homogen di antara pengamatan yang berbasis pada data yang berskala interval atau rasio. Kemiripan/ketakmiripan antar objek dalam analisis gerombol menggunakan konsep jarak (Sumertajaya dan Erfiani 2007).
Dua pendekatan metode penggerombolan yang paling terkenal adalah sebagai berikut.
Analisis Gerombol tak Berhirarki (Non-hierarchical Clustering)
Pada metode ini harus ditentukan terlebih dahulu besarnya \(k\), yaitu banyaknya gerombol. Pemilihan k dapat ditentukan secara subjektif berdasarkan latar belakang bidang studi masing-masing. Jarak yang biasanya digunakan adalah jarak Euclidean (Sumertajaya dan Erfiani 2007).
Tahapan analisis gerombol tidak berhirarki adalah sebagai berikut. 1. Persiapan data
2. pemilihan banyaknya gerombol serta konsep jarak
3. Penerapan k-means
4. Interpretasi gerombol yang terbentuk
Analisis Gerombol Berhirarki (Hierarchical Clustering)
Tahapan analisis gerombol berhirarki adalah sebagai berikut.
1. Persiapan data 2. pemilihan metode pautan (linkage) dan banyaknya gerombol serta konsep jarak
3. Penerapan hierarchical clustering
4. Interpretasi gerombol yang terbentuk
Teknik Penggerombolan
Perbedaan Teknik Berhirarki & Non
Metode Penggerombolan Berhirarki
library(broom) #untuk merapikan tampilan data
library(modelr)
library(glmnet) #metode seleksi LASSO dan RIDGE
library(glmnetUtils) #package tambahan dari glmnet yang memungkinkan syntax glmnet bisa dinput menggunakan object data.frame
library(leaps) #menggunakan fungsi regsubset sebagai metode seleksi peubah dengan bestforward
library(varbvs) #Data
library(ggplot2)
library(tidyverse)
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(repr)
library(MASS) #Data
library(MuMIn) #Model Averaging
library(ISLR)
library(mlr3verse)
library(mlr3fselect)
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(DataExplorer)
library(skimr)
library(corrplot)
library(ggpubr)
Packages yang digunakan dalam data yang dianalisis ini adalah sebagai berikut.
library(broom) #untuk merapikan tampilan data
library(modelr)
library(glmnet) #metode seleksi LASSO dan RIDGE
library(glmnetUtils) #package tambahan dari glmnet yang memungkinkan syntax glmnet bisa dinput menggunakan object data.frame
library(leaps) #menggunakan fungsi regsubset sebagai metode seleksi peubah dengan bestforward
library(varbvs) #Data
library(ggplot2)
library(tidyverse)
library(plyr)
library(readr)
library(dplyr)
library(caret)
library(repr)
library(MASS) #Data
library(MuMIn) #Model Averaging
library(ISLR)
library(mlr3verse)
library(mlr3fselect)
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(DataExplorer)
library(skimr)
library(corrplot)
library(ggpubr)
Supervised Statistical Learning
Contoh Responsi
Data
dat <- MASS::Boston #data harga perumahan di pinggiran Kota Boston
str(dat)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(dat)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Pre-Proccessing
Partisi Data
set.seed(100)
index = sample(1:nrow(dat), 0.7*nrow(dat))
train = dat[index,] # Create the training data
test = dat[-index,] # Create the test data
dim(train)
## [1] 354 14
dim(test)
## [1] 152 14
Ridge Regression
Cara Pertama
#creating custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = F) #print iterasi
#fitting Ridge Regression model
set.seed(1234)
ridge <- train(medv~.,train,
method="glmnet",
tuneGrid=expand.grid(alpha=0,lambda=seq(0.0001,1,length=10)),
trControl=custom)
ridge
## glmnet
##
## 354 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 318, 319, 318, 319, 319, 318, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001 4.340939 0.7712224 3.102308
## 0.1112 4.340939 0.7712224 3.102308
## 0.2223 4.340939 0.7712224 3.102308
## 0.3334 4.340939 0.7712224 3.102308
## 0.4445 4.340939 0.7712224 3.102308
## 0.5556 4.340939 0.7712224 3.102308
## 0.6667 4.341042 0.7712146 3.102329
## 0.7778 4.348468 0.7706810 3.099810
## 0.8889 4.356811 0.7700437 3.097963
## 1.0000 4.365780 0.7693706 3.097781
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.5556.
#mean validation score
mean(ridge$resample$RMSE)
## [1] 4.340939
#plotting the model
plot(ridge, main = "Ridge Regression")
#plotting important variables
plot(varImp(ridge,scale=F))
Berdasarkan plot di atas, tiga peubah teratas yang memiliki kontribusi besar secara berurutan, yaitu nox, rm, dan chas.
Cara Kedua
#Ridge Regression with library(glmnet)
lambdas <- 10^seq(2,-3, by = -.1)
ridge_reg = glmnet(as.matrix(train[,-14]), train[,14], nlambda = 25, alpha = 0,family = 'gaussian', lambda = lambdas)
summary(ridge_reg)
## Length Class Mode
## a0 51 -none- numeric
## beta 663 dgCMatrix S4
## df 51 -none- numeric
## dim 2 -none- numeric
## lambda 51 -none- numeric
## dev.ratio 51 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 7 -none- call
## nobs 1 -none- numeric
ridge_reg
##
## Call: glmnet::glmnet(x = as.matrix(train[, -14]), y = train[, 14], nlambda = 25, alpha = 0, family = "gaussian", lambda = lambdas)
##
## Df %Dev Lambda
## 1 13 30.01 100.000
## 2 13 34.00 79.430
## 3 13 38.04 63.100
## 4 13 42.05 50.120
## 5 13 45.95 39.810
## 6 13 49.70 31.620
## 7 13 53.25 25.120
## 8 13 56.59 19.950
## 9 13 59.69 15.850
## 10 13 62.52 12.590
## 11 13 65.09 10.000
## 12 13 67.36 7.943
## 13 13 69.34 6.310
## 14 13 71.03 5.012
## 15 13 72.44 3.981
## 16 13 73.62 3.162
## 17 13 74.57 2.512
## 18 13 75.34 1.995
## 19 13 75.94 1.585
## 20 13 76.42 1.259
## 21 13 76.79 1.000
## 22 13 77.08 0.794
## 23 13 77.30 0.631
## 24 13 77.46 0.501
## 25 13 77.58 0.398
## 26 13 77.67 0.316
## 27 13 77.73 0.251
## 28 13 77.77 0.200
## 29 13 77.81 0.158
## 30 13 77.83 0.126
## 31 13 77.84 0.100
## 32 13 77.85 0.079
## 33 13 77.86 0.063
## 34 13 77.86 0.050
## 35 13 77.87 0.040
## 36 13 77.87 0.032
## 37 13 77.87 0.025
## 38 13 77.87 0.020
## 39 13 77.87 0.016
## 40 13 77.87 0.013
## 41 13 77.87 0.010
## 42 13 77.87 0.008
## 43 13 77.87 0.006
## 44 13 77.87 0.005
## 45 13 77.87 0.004
## 46 13 77.87 0.003
## 47 13 77.87 0.003
## 48 13 77.87 0.002
## 49 13 77.87 0.002
## 50 13 77.87 0.001
## 51 13 77.87 0.001
set.seed(100)
#tunning parameter lambda dg cv
cv_ridge <- cv.glmnet(as.matrix(train[,-14]), train[,14], alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda
## [1] 0.1258925
coef(cv_ridge, s="lambda.min")
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 33.512762193
## crim -0.116541536
## zn 0.045303128
## indus 0.015047943
## chas 2.864206507
## nox -10.422379184
## rm 3.659253006
## age -0.012198505
## dis -1.384865675
## rad 0.227556069
## tax -0.008280337
## ptratio -0.902912189
## black 0.008342041
## lstat -0.623529573
coef(cv_ridge, s=100)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 23.249596796
## crim -0.022001298
## zn 0.008746460
## indus -0.037653796
## chas 0.417908674
## nox -1.911639790
## rm 0.653628764
## age -0.007345027
## dis 0.053942947
## rad -0.019853504
## tax -0.001407383
## ptratio -0.143013850
## black 0.001933966
## lstat -0.064620516
# Compute R^2 from true and predicted values
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
SST <- sum((true - mean(true))^2)
R_square <- 1 - SSE / SST
RMSE = sqrt(SSE/nrow(df))
# Model performance metrics
data.frame(RMSE = RMSE,
Rsquare = R_square)
}
# Prediction and evaluation on train data
predictions_train <- predict(ridge_reg, s = optimal_lambda, newx = as.matrix(train[,-14]))
eval_results(train[,14], predictions_train, train)
## RMSE Rsquare
## 1 4.186191 0.7782747
# Prediction and evaluation on test data
predictions_test <- predict(ridge_reg, s = optimal_lambda, newx = as.matrix(test[,-14]))
eval_results(test[,14], predictions_test, test)
## RMSE Rsquare
## 1 5.830073 0.6410751
Lasso Regression
lambdas <- 10^seq(2, -3, by = -.1)
# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(as.matrix(train[,-14]), train[,14], alpha = 1,
lambda = lambdas, standardize = TRUE, nfolds = 5)
# Best
lambda_best <- lasso_reg$lambda.min
lambda_best
## [1] 0.01995262
lasso_model <- glmnet(as.matrix(train[,-14]), train[,14], alpha = 1,
lambda = lambda_best, standardize = TRUE)
predictions_train <- predict(lasso_model, s = lambda_best, newx = as.matrix(train[,-14]))
eval_results(train[,14], predictions_train, train)
## RMSE Rsquare
## 1 4.185119 0.7783883
predictions_test <- predict(lasso_model, s = lambda_best, newx = as.matrix(test[,-14]))
eval_results(test[,14], predictions_test, test)
## RMSE Rsquare
## 1 5.844554 0.6392898
Averaging Regression
mod1 <- lm(medv~.,train, na.action = na.fail)
mod2 <- dredge(global.model=mod1, m.lim=c(10,12))
## Fixed term is "(Intercept)"
mod3 <- model.avg(mod2, delta<4)
mod3
##
## Call:
## model.avg(object = mod2, subset = delta < 4)
##
## Component models:
## '2+3+4+5+7+8+9+10+11+12+13' '1+2+3+4+5+7+8+9+10+11+12+13'
## '2+3+4+5+6+7+8+9+10+11+12+13'
##
## Coefficients:
## (Intercept) black chas crim dis lstat
## full 35.63353 0.008185818 2.765414 -0.1220815 -1.439714 -0.6486161
## subset 35.63353 0.008185818 2.765414 -0.1220815 -1.439714 -0.6486161
## nox ptratio rad rm tax zn
## full -11.52498 -0.9210936 0.2647936 3.513183 -0.009723128 0.04946858
## subset -11.52498 -0.9210936 0.2647936 3.513183 -0.009723128 0.04946858
## age indus
## full -0.002933631 0.00707678
## subset -0.011686278 0.03381587
summary(mod3)
##
## Call:
## model.avg(object = mod2, subset = delta < 4)
##
## Component model call:
## lm(formula = medv ~ <3 unique rhs>, data = train, na.action = na.fail)
##
## Component models:
## df logLik AICc delta weight
## 2+3+4+5+7+8+9+10+11+12+13 13 -1009.26 2045.59 0.00 0.54
## 1+2+3+4+5+7+8+9+10+11+12+13 14 -1008.94 2047.12 1.53 0.25
## 2+3+4+5+6+7+8+9+10+11+12+13 14 -1009.12 2047.48 1.89 0.21
##
## Term codes:
## age black chas crim dis indus lstat nox ptratio rad
## 1 2 3 4 5 6 7 8 9 10
## rm tax zn
## 11 12 13
##
## Model-averaged coefficients:
## (full average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 35.633531 5.764676 5.785169 6.159 < 2e-16 ***
## black 0.008186 0.002780 0.002790 2.934 0.003351 **
## chas 2.765414 0.973260 0.976713 2.831 0.004635 **
## crim -0.122081 0.030899 0.031009 3.937 8.25e-05 ***
## dis -1.439714 0.209373 0.210101 6.852 < 2e-16 ***
## lstat -0.648616 0.055719 0.055914 11.600 < 2e-16 ***
## nox -11.524979 3.927020 3.940849 2.924 0.003450 **
## ptratio -0.921094 0.143484 0.143994 6.397 < 2e-16 ***
## rad 0.264794 0.066994 0.067231 3.939 8.20e-05 ***
## rm 3.513183 0.473014 0.474692 7.401 < 2e-16 ***
## tax -0.009723 0.003663 0.003676 2.645 0.008166 **
## zn 0.049469 0.014665 0.014717 3.361 0.000776 ***
## age -0.002934 0.009025 0.009047 0.324 0.745743
## indus 0.007077 0.033118 0.033215 0.213 0.831281
##
## (conditional average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 35.633531 5.764676 5.785169 6.159 < 2e-16 ***
## black 0.008186 0.002780 0.002790 2.934 0.003351 **
## chas 2.765414 0.973260 0.976713 2.831 0.004635 **
## crim -0.122081 0.030899 0.031009 3.937 8.25e-05 ***
## dis -1.439714 0.209373 0.210101 6.852 < 2e-16 ***
## lstat -0.648616 0.055719 0.055914 11.600 < 2e-16 ***
## nox -11.524979 3.927020 3.940849 2.924 0.003450 **
## ptratio -0.921094 0.143484 0.143994 6.397 < 2e-16 ***
## rad 0.264794 0.066994 0.067231 3.939 8.20e-05 ***
## rm 3.513183 0.473014 0.474692 7.401 < 2e-16 ***
## tax -0.009723 0.003663 0.003676 2.645 0.008166 **
## zn 0.049469 0.014665 0.014717 3.361 0.000776 ***
## age -0.011686 0.014906 0.014959 0.781 0.434677
## indus 0.033816 0.065853 0.066088 0.512 0.608874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Contoh Kuliah
Data
set.seed(123)
x <- cbind(1,matrix( rnorm(100*20,2,1),100,20))
e <- matrix( rnorm(100),100,1)
b <- c(1,rep(0:4,each=4))
y <- x%*%b+e
dt.all <- data.frame(y,x[, -1])
str( dt.all)
## 'data.frame': 100 obs. of 21 variables:
## $ y : num 71.6 64.3 64.7 100.1 90.3 ...
## $ X1 : num 1.44 1.77 3.56 2.07 2.13 ...
## $ X2 : num 1.29 2.26 1.75 1.65 1.05 ...
## $ X3 : num 4.2 3.31 1.73 2.54 1.59 ...
## $ X4 : num 1.285 1.247 1.061 0.947 1.563 ...
## $ X5 : num 1.926 0.831 1.365 1.971 2.671 ...
## $ X6 : num 1.398 1.006 3.027 2.751 0.491 ...
## $ X7 : num 3.074 1.973 1.967 0.484 2.79 ...
## $ X8 : num 1.272 0.46 1.307 2.119 0.635 ...
## $ X9 : num 2.36 1.34 2.86 3.15 2.28 ...
## $ X10: num 0.986 1.209 2.3 3.639 3.085 ...
## $ X11: num 1.004 0.96 1.982 1.868 -0.549 ...
## $ X12: num 2.916 2.801 1.063 0.599 2.16 ...
## $ X13: num 2.62 1.24 2.85 1.25 2.63 ...
## $ X14: num 1.25 1.678 0.852 2.354 2.425 ...
## $ X15: num 0.914 1.335 2.715 1.568 2.228 ...
## $ X16: num 1.18 1.69 1.1 2.63 3.12 ...
## $ X17: num 1.71 2.66 1.55 1.41 0.29 ...
## $ X18: num 1.81 1.53 -1.05 3.87 3.79 ...
## $ X19: num 0.711 1.345 1.943 3.257 3.587 ...
## $ X20: num 3.54 1.54 1.97 3.64 1.67 ...
Linear Regression
mod1 <- lm( y~.,data=dt.all)
coef(mod1)
## (Intercept) X1 X2 X3 X4 X5
## 1.199813532 -0.073189937 -0.071293234 0.176350703 -0.006345167 1.015471670
## X6 X7 X8 X9 X10 X11
## 0.999708429 1.019633372 0.885459998 1.988429428 2.019416432 2.158248869
## X12 X13 X14 X15 X16 X17
## 2.013082122 2.867064930 3.029382418 3.046901211 2.970311314 3.954652929
## X18 X19 X20
## 3.951884182 4.035364936 3.853178398
Ridge Regression
#library( glmnet)
mod2 <- cv.glmnet(x[,-1], y, alpha=0) #pemilihan lambda dgn cv untuk ridge
mod2
##
## Call: glmnet::cv.glmnet(x = x[, -1], y = y, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.4241 100 1.434 0.1982 20
## 1se 0.5108 98 1.543 0.2247 20
coef(mod2, s="lambda.min")
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.56512925
## V1 0.01864237
## V2 -0.14474607
## V3 0.13647799
## V4 0.09321257
## V5 1.01992103
## V6 1.02833345
## V7 1.03021554
## V8 0.87150470
## V9 1.90975375
## V10 1.89214977
## V11 2.12432856
## V12 1.99839789
## V13 2.69183965
## V14 2.86949361
## V15 2.91427164
## V16 2.72840558
## V17 3.73396794
## V18 3.70200292
## V19 3.88847919
## V20 3.67709282
Lasso Regression
mod3 <- cv.glmnet(x[, -1], y,alpha=1) #pemilihan lambda dgn cv untuk lambda
mod3
##
## Call: glmnet::cv.glmnet(x = x[, -1], y = y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.01923 59 1.014 0.1213 19
## 1se 0.07763 44 1.134 0.1439 18
coef(mod3, s="lambda.min")
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.83533712
## V1 -0.03815930
## V2 -0.05393318
## V3 0.14314845
## V4 .
## V5 1.00548302
## V6 0.98429527
## V7 0.99965103
## V8 0.86434523
## V9 1.96683664
## V10 1.99554456
## V11 2.14570324
## V12 2.01162119
## V13 2.84395139
## V14 3.01020049
## V15 3.01257009
## V16 2.92811226
## V17 3.93506150
## V18 3.90997381
## V19 4.01843473
## V20 3.84020618
Averaging Regression
Mengalami kendala saat knit processing ke HTML meskipun m.lim=c(8,10) sudah diperkecil yang tadinya dari m.lim=c(10,20).
Variable Selection
Contoh Kuliah
Data Hitters
Data Major League Baseball dari musim tahun 1986 dan 1987. Sebuah kerangka data dengan 322 amatan data berupa pemain liga utama pada 20 peubah sebagai berikut.
- AtBat
Jumlah kali bat pada tahun 1986.
- Hits
Jumlah pengunjung pada tahun 1986.
- HmRun
Jumlah home run pada tahun 1986.
- Runs
Jumlah larian pada tahun 1986.
- RBI
Jumlah larangannya jatuh pada tahun 1986.
- Walks
Jumlah orang yang berjalan pada tahun 1986.
- Years
Jumlah tahun di liga utama.
- CAtBat
Jumlah kali bat selama karirnya.
- CHits
Jumlah pengunjung selama karirnya.
- CHmRun
Jumlah home run selama karirnya.
- CRuns
Jumlah lari selama karirnya.
- CRBI
Beberapa kali percobaan gagal selama karirnya.
- CWalks
Jumlah berjalan selama karirnya.
- League
Sebuah faktor dengan tingkat A dan N menunjukkan liga pemain pada akhir tahun 1986.
- Division
Sebuah faktor dengan tingkat E dan W menunjukkan pemain divisi pada akhir tahun 1986.
- PutOuts
Jumlah yang dikeluarkan pada tahun 1986.
- Assists
Jumlah bantuan pada tahun 1986.
- Errors
Jumlah kesalahan pada tahun 1986.
- Salary
Gaji tahunan pada 1986 pada hari pembukaan dalam ribuan dolar.
- NewLeague
Sebuah faktor dengan tingkat A dan N menunjukkan liga pemain di awal tahun 1987.
data("Hitters")
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
dim(Hitters)
## [1] 322 20
Pre-Processing
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
sum(is.na(Hitters$Salary)) #ada 59 pemain baseball yang gajinya NA
## [1] 59
Hitters=na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters$Salary))
## [1] 0
Melakukan penyisihan pada amatan data 59 pemain baseball yang gajinya missing value atau tidak tercatat.
Statistika Deskriptif
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50
## Median :413.0 Median :103.0 Median : 9.00 Median : 52.00
## Mean :403.6 Mean :107.8 Mean :11.62 Mean : 54.75
## 3rd Qu.:526.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00
## Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 30.00 1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5
## Median : 47.00 Median : 37.00 Median : 6.000 Median : 1931.0
## Mean : 51.49 Mean : 41.11 Mean : 7.312 Mean : 2657.5
## 3rd Qu.: 71.00 3rd Qu.: 57.00 3rd Qu.:10.000 3rd Qu.: 3890.5
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 2.0 Min. : 3.0
## 1st Qu.: 212.0 1st Qu.: 15.00 1st Qu.: 105.5 1st Qu.: 95.0
## Median : 516.0 Median : 40.00 Median : 250.0 Median : 230.0
## Mean : 722.2 Mean : 69.24 Mean : 361.2 Mean : 330.4
## 3rd Qu.:1054.0 3rd Qu.: 92.50 3rd Qu.: 497.5 3rd Qu.: 424.5
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.0
## CWalks League Division PutOuts Assists
## Min. : 1.0 A:139 E:129 Min. : 0.0 Min. : 0.0
## 1st Qu.: 71.0 N:124 W:134 1st Qu.: 113.5 1st Qu.: 8.0
## Median : 174.0 Median : 224.0 Median : 45.0
## Mean : 260.3 Mean : 290.7 Mean :118.8
## 3rd Qu.: 328.5 3rd Qu.: 322.5 3rd Qu.:192.0
## Max. :1566.0 Max. :1377.0 Max. :492.0
## Errors Salary NewLeague
## Min. : 0.000 Min. : 67.5 A:141
## 1st Qu.: 3.000 1st Qu.: 190.0 N:122
## Median : 7.000 Median : 425.0
## Mean : 8.593 Mean : 535.9
## 3rd Qu.:13.000 3rd Qu.: 750.0
## Max. :32.000 Max. :2460.0
Model Regresi Penuh
regfit.full=regsubsets(Salary~.,Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
regfit.full=regsubsets(Salary~.,data=Hitters ,nvmax=19)
reg.summary=summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Ada 19 kombinasi model regresi linier dari model regresi penuh data Hitters dengn peubah respon Salary, dimana peubah penjelas yang sering muncul setiap kombinasi model adalah Hits dan CRBI.
Berdasarkan Adjusted R-Squared
reg.summary$adjr2 #model11
## [1] 0.3188503 0.4208024 0.4450753 0.4672734 0.4808971 0.4972001 0.5007849
## [8] 0.5137083 0.5180572 0.5222606 0.5225706 0.5217245 0.5206736 0.5195431
## [15] 0.5178661 0.5162219 0.5144464 0.5126097 0.5106270
which.max(reg.summary$adjr2)
## [1] 11
plot(reg.summary$adjr2 ,xlab="Number of Variables ",
ylab="Adjusted RSq",type="l")
points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)
#dan seterusnya
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Berdasarkan nilai adjusted R squared, kombinasi model regresi linier ke-11 merupakan model yang paling tinggi nilai adjusted R squared dimana memiliki 11 peubah penjelas. Jadi, kombinasi model regresi linier ke-11 merupakan model terbaik. Selain nilai adjusted R squared, perbandingan model terbaik dapat dilihat dengan menggunakan nilai CP dan BIC.
Variable Selection
Forward Stepwise Selection
regfit.fwd=regsubsets(Salary~.,data=Hitters ,nvmax=19, method ="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Backward Stepwise Selection
regfit.bwd=regsubsets (Salary~.,data=Hitters ,nvmax=19,method ="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*" " "
## 5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " " " "
## 5 ( 1 ) " " " " " " "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Perbandingan Koefisien Parameter Model
coef(regfit.full,7)
## (Intercept) Hits Walks CAtBat CHits CHmRun
## 79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538
## DivisionW PutOuts
## -129.9866432 0.2366813
coef(regfit.fwd,7)
## (Intercept) AtBat Hits Walks CRBI CWalks
## 109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070
## DivisionW PutOuts
## -127.1223928 0.2533404
coef(regfit.bwd,7)
## (Intercept) AtBat Hits Walks CRuns CWalks
## 105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346
## DivisionW PutOuts
## -116.1692169 0.3028847
Contoh Responsi
Data House Price
Mintalah seorang pembeli rumah untuk menggambarkan rumah impian mereka, dan mereka mungkin tidak akan mulai dengan ketinggian langit-langit ruang bawah tanah atau jarak dengan kereta api timur-barat, tetapi dataset mengenai kompetisi bermain ini membuktikan bahwa jauh lebih banyak mempengaruhi negosiasi harga daripada jumlah kamar tidur atau pagar kayu putih. Dengan 79 peubah penjelas yang menggambarkan (hampir) setiap aspek dari rumah tempat tinggal di Ames, Iowa, kompetisi ini menantang anda untuk memprediksi harga akhir dari setiap rumah.
Dataset ini dapat diakses melalui tautan sebagai berikut:Dataset House Price
data_house <- read.csv("D:/MY COLLEGE/SEMESTER 5/PSD/DATA/house_price1.csv",stringsAsFactors = TRUE)
glimpse(data_house)
## Rows: 1,460
## Columns: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
## $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, RL, …
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pa…
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg, Reg, I…
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, L…
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
## $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Corner…
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel, So…
## $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Artery,…
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Ar…
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 2f…
## $ HouseStyle <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, 1Story, …
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
## $ RoofStyle <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gable, …
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg, Co…
## $ Exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd, Vi…
## $ Exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd, Vi…
## $ MasVnrType <fct> BrkFace, None, BrkFace, None, BrkFace, None, Stone, Ston…
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
## $ ExterQual <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, …
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ Foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc, CBlock…
## $ BsmtQual <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, TA, Gd, …
## $ BsmtCond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ BsmtExposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, No, Av, …
## $ BsmtFinType1 <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ, Rec, G…
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
## $ BsmtFinType2 <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf, Unf, U…
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Ga…
## $ HeatingQC <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, TA, Ex, …
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, …
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ KitchenQual <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, Gd, …
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Typ, Typ, …
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
## $ FireplaceQu <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, NA, Gd, …
## $ GarageType <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, Attchd, …
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
## $ GarageFinish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn, Unf, F…
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, TA, TA, …
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,…
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, NA, NA, N…
## $ MiscFeature <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, NA, NA, …
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New, WD, New…
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal, Normal,…
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …
Khusus yang menggunakan R versi 4.00 keatas, argumen stringsAsFactors = TRUE disertakan agar data yang berbentuk string bisa berubah menjadi faktor.
Exploratory Data Analysis (EDA)
Karakteristik Umum
ggpubr::gghistogram(data = data_house,x = "SalePrice",fill = "#FAAB78")+scale_y_continuous(expand = c(0,0))
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
data_house %>%
mutate(SalePrice=log(SalePrice)) %>%
ggpubr::gghistogram(x = "SalePrice",fill = "#9BA17B")+scale_y_continuous(expand = c(0,0))
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
plot_intro(data = data_house,
geom_label_args = list(size=2.5))
Keterangan:
- plot_intro() merupakan fungsi yang berasal dari package DataExplorer dan argumen utamanya adalah object berbentuk data.frame.
- argumen geom_label_args bisa diisi dengan opsi-opsi yang ada pada fungsi geom_label pada package ggplot2.
skim_without_charts(data = data_house)
| Name | data_house |
| Number of rows | 1460 |
| Number of columns | 81 |
| _______________________ | |
| Column type frequency: | |
| factor | 43 |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| MSZoning | 0 | 1.00 | FALSE | 5 | RL: 1151, RM: 218, FV: 65, RH: 16 |
| Street | 0 | 1.00 | FALSE | 2 | Pav: 1454, Grv: 6 |
| Alley | 1369 | 0.06 | FALSE | 2 | Grv: 50, Pav: 41 |
| LotShape | 0 | 1.00 | FALSE | 4 | Reg: 925, IR1: 484, IR2: 41, IR3: 10 |
| LandContour | 0 | 1.00 | FALSE | 4 | Lvl: 1311, Bnk: 63, HLS: 50, Low: 36 |
| Utilities | 0 | 1.00 | FALSE | 2 | All: 1459, NoS: 1 |
| LotConfig | 0 | 1.00 | FALSE | 5 | Ins: 1052, Cor: 263, Cul: 94, FR2: 47 |
| LandSlope | 0 | 1.00 | FALSE | 3 | Gtl: 1382, Mod: 65, Sev: 13 |
| Neighborhood | 0 | 1.00 | FALSE | 25 | NAm: 225, Col: 150, Old: 113, Edw: 100 |
| Condition1 | 0 | 1.00 | FALSE | 9 | Nor: 1260, Fee: 81, Art: 48, RRA: 26 |
| Condition2 | 0 | 1.00 | FALSE | 8 | Nor: 1445, Fee: 6, Art: 2, Pos: 2 |
| BldgType | 0 | 1.00 | FALSE | 5 | 1Fa: 1220, Twn: 114, Dup: 52, Twn: 43 |
| HouseStyle | 0 | 1.00 | FALSE | 8 | 1St: 726, 2St: 445, 1.5: 154, SLv: 65 |
| RoofStyle | 0 | 1.00 | FALSE | 6 | Gab: 1141, Hip: 286, Fla: 13, Gam: 11 |
| RoofMatl | 0 | 1.00 | FALSE | 8 | Com: 1434, Tar: 11, WdS: 6, WdS: 5 |
| Exterior1st | 0 | 1.00 | FALSE | 15 | Vin: 515, HdB: 222, Met: 220, Wd : 206 |
| Exterior2nd | 0 | 1.00 | FALSE | 16 | Vin: 504, Met: 214, HdB: 207, Wd : 197 |
| MasVnrType | 8 | 0.99 | FALSE | 4 | Non: 864, Brk: 445, Sto: 128, Brk: 15 |
| ExterQual | 0 | 1.00 | FALSE | 4 | TA: 906, Gd: 488, Ex: 52, Fa: 14 |
| ExterCond | 0 | 1.00 | FALSE | 5 | TA: 1282, Gd: 146, Fa: 28, Ex: 3 |
| Foundation | 0 | 1.00 | FALSE | 6 | PCo: 647, CBl: 634, Brk: 146, Sla: 24 |
| BsmtQual | 37 | 0.97 | FALSE | 4 | TA: 649, Gd: 618, Ex: 121, Fa: 35 |
| BsmtCond | 37 | 0.97 | FALSE | 4 | TA: 1311, Gd: 65, Fa: 45, Po: 2 |
| BsmtExposure | 38 | 0.97 | FALSE | 4 | No: 953, Av: 221, Gd: 134, Mn: 114 |
| BsmtFinType1 | 37 | 0.97 | FALSE | 6 | Unf: 430, GLQ: 418, ALQ: 220, BLQ: 148 |
| BsmtFinType2 | 38 | 0.97 | FALSE | 6 | Unf: 1256, Rec: 54, LwQ: 46, BLQ: 33 |
| Heating | 0 | 1.00 | FALSE | 6 | Gas: 1428, Gas: 18, Gra: 7, Wal: 4 |
| HeatingQC | 0 | 1.00 | FALSE | 5 | Ex: 741, TA: 428, Gd: 241, Fa: 49 |
| CentralAir | 0 | 1.00 | FALSE | 2 | Y: 1365, N: 95 |
| Electrical | 1 | 1.00 | FALSE | 5 | SBr: 1334, Fus: 94, Fus: 27, Fus: 3 |
| KitchenQual | 0 | 1.00 | FALSE | 4 | TA: 735, Gd: 586, Ex: 100, Fa: 39 |
| Functional | 0 | 1.00 | FALSE | 7 | Typ: 1360, Min: 34, Min: 31, Mod: 15 |
| FireplaceQu | 690 | 0.53 | FALSE | 5 | Gd: 380, TA: 313, Fa: 33, Ex: 24 |
| GarageType | 81 | 0.94 | FALSE | 6 | Att: 870, Det: 387, Bui: 88, Bas: 19 |
| GarageFinish | 81 | 0.94 | FALSE | 3 | Unf: 605, RFn: 422, Fin: 352 |
| GarageQual | 81 | 0.94 | FALSE | 5 | TA: 1311, Fa: 48, Gd: 14, Ex: 3 |
| GarageCond | 81 | 0.94 | FALSE | 5 | TA: 1326, Fa: 35, Gd: 9, Po: 7 |
| PavedDrive | 0 | 1.00 | FALSE | 3 | Y: 1340, N: 90, P: 30 |
| PoolQC | 1453 | 0.00 | FALSE | 3 | Gd: 3, Ex: 2, Fa: 2 |
| Fence | 1179 | 0.19 | FALSE | 4 | MnP: 157, GdP: 59, GdW: 54, MnW: 11 |
| MiscFeature | 1406 | 0.04 | FALSE | 4 | She: 49, Gar: 2, Oth: 2, Ten: 1 |
| SaleType | 0 | 1.00 | FALSE | 9 | WD: 1267, New: 122, COD: 43, Con: 9 |
| SaleCondition | 0 | 1.00 | FALSE | 6 | Nor: 1198, Par: 125, Abn: 101, Fam: 20 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 |
| MSSubClass | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 |
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 |
| OverallQual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 |
| OverallCond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 |
| YearRemodAdd | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 |
| MasVnrArea | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 |
| BsmtFinSF1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 |
| BsmtFinSF2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 |
| BsmtUnfSF | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 |
| X1stFlrSF | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 |
| X2ndFlrSF | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 |
| LowQualFinSF | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 |
| GrLivArea | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 |
| BsmtFullBath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 |
| BsmtHalfBath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 |
| FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 |
| Fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 |
| GarageYrBlt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 |
| WoodDeckSF | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 |
| OpenPorchSF | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 |
| EnclosedPorch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 |
| X3SsnPorch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 |
| ScreenPorch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 |
| PoolArea | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 |
| MiscVal | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 |
| MoSold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 |
Penanganan Missing Value
Terdapat dua metode yaitu sebagai berikut.
- Mengganti missing value pada kolom-kolom yang banyak jumlahnya (sekitar >500)
Jika datanya berupafactorataustring. Kemudian,na.omit()digunakan untuk menghapus semua baris yang mengandung missing value. - Menghapus baris-baris yang mengandung missing value pada kolom-kolom yang sedikit jumlahnya (sekitar <500)
Dalam analisis ini menggunakan cara 1.
data_house1 <- data_house %>%
dplyr::select(-Id) %>%
mutate(
Alley = forcats::fct_explicit_na(Alley, na_level = "Ukn"),
FireplaceQu=forcats::fct_explicit_na(FireplaceQu,
na_level = "Ukn"
),
PoolQC = forcats::fct_explicit_na(PoolQC, na_level = "Ukn"),
Fence = forcats::fct_explicit_na(Fence, na_level = "Ukn"),
MiscFeature = forcats::fct_explicit_na(MiscFeature, na_level = "Ukn")
) %>% na.omit
plot_intro(data = data_house1)
skim_without_charts(data_house1)
| Name | data_house1 |
| Number of rows | 1094 |
| Number of columns | 80 |
| _______________________ | |
| Column type frequency: | |
| factor | 43 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| MSZoning | 0 | 1 | FALSE | 5 | RL: 850, RM: 173, FV: 54, RH: 9 |
| Street | 0 | 1 | FALSE | 2 | Pav: 1090, Grv: 4 |
| Alley | 0 | 1 | FALSE | 3 | Ukn: 1017, Grv: 41, Pav: 36 |
| LotShape | 0 | 1 | FALSE | 4 | Reg: 760, IR1: 301, IR2: 26, IR3: 7 |
| LandContour | 0 | 1 | FALSE | 4 | Lvl: 991, Bnk: 45, HLS: 44, Low: 14 |
| Utilities | 0 | 1 | FALSE | 1 | All: 1094, NoS: 0 |
| LotConfig | 0 | 1 | FALSE | 5 | Ins: 830, Cor: 187, Cul: 44, FR2: 29 |
| LandSlope | 0 | 1 | FALSE | 3 | Gtl: 1045, Mod: 44, Sev: 5 |
| Neighborhood | 0 | 1 | FALSE | 25 | NAm: 173, Col: 122, Old: 96, Som: 75 |
| Condition1 | 0 | 1 | FALSE | 9 | Nor: 950, Fee: 52, Art: 42, RRA: 24 |
| Condition2 | 0 | 1 | FALSE | 6 | Nor: 1082, Fee: 5, Art: 2, Pos: 2 |
| BldgType | 0 | 1 | FALSE | 5 | 1Fa: 925, Twn: 90, Twn: 35, Dup: 24 |
| HouseStyle | 0 | 1 | FALSE | 8 | 1St: 540, 2St: 346, 1.5: 117, SLv: 43 |
| RoofStyle | 0 | 1 | FALSE | 5 | Gab: 843, Hip: 230, Gam: 10, Man: 6 |
| RoofMatl | 0 | 1 | FALSE | 7 | Com: 1078, WdS: 6, Tar: 5, WdS: 2 |
| Exterior1st | 0 | 1 | FALSE | 14 | Vin: 421, Met: 172, HdB: 151, Wd : 149 |
| Exterior2nd | 0 | 1 | FALSE | 16 | Vin: 412, Met: 169, Wd : 145, HdB: 138 |
| MasVnrType | 0 | 1 | FALSE | 4 | Non: 639, Brk: 327, Sto: 119, Brk: 9 |
| ExterQual | 0 | 1 | FALSE | 4 | TA: 646, Gd: 395, Ex: 46, Fa: 7 |
| ExterCond | 0 | 1 | FALSE | 4 | TA: 973, Gd: 104, Fa: 15, Ex: 2 |
| Foundation | 0 | 1 | FALSE | 5 | PCo: 518, CBl: 446, Brk: 122, Sto: 6 |
| BsmtQual | 0 | 1 | FALSE | 4 | TA: 486, Gd: 463, Ex: 113, Fa: 32 |
| BsmtCond | 0 | 1 | FALSE | 4 | TA: 1006, Gd: 51, Fa: 36, Po: 1 |
| BsmtExposure | 0 | 1 | FALSE | 4 | No: 734, Av: 174, Gd: 97, Mn: 89 |
| BsmtFinType1 | 0 | 1 | FALSE | 6 | Unf: 343, GLQ: 323, ALQ: 162, BLQ: 105 |
| BsmtFinType2 | 0 | 1 | FALSE | 6 | Unf: 972, Rec: 37, LwQ: 35, BLQ: 25 |
| Heating | 0 | 1 | FALSE | 4 | Gas: 1075, Gas: 16, Gra: 2, Oth: 1 |
| HeatingQC | 0 | 1 | FALSE | 5 | Ex: 594, TA: 298, Gd: 174, Fa: 27 |
| CentralAir | 0 | 1 | FALSE | 2 | Y: 1036, N: 58 |
| Electrical | 0 | 1 | FALSE | 5 | SBr: 1009, Fus: 67, Fus: 15, Fus: 2 |
| KitchenQual | 0 | 1 | FALSE | 4 | TA: 528, Gd: 454, Ex: 91, Fa: 21 |
| Functional | 0 | 1 | FALSE | 6 | Typ: 1024, Min: 25, Min: 21, Maj: 10 |
| FireplaceQu | 0 | 1 | FALSE | 6 | Ukn: 511, Gd: 315, TA: 212, Fa: 24 |
| GarageType | 0 | 1 | FALSE | 6 | Att: 680, Det: 325, Bui: 63, Bas: 15 |
| GarageFinish | 0 | 1 | FALSE | 3 | Unf: 485, RFn: 333, Fin: 276 |
| GarageQual | 0 | 1 | FALSE | 5 | TA: 1031, Fa: 46, Gd: 11, Ex: 3 |
| GarageCond | 0 | 1 | FALSE | 5 | TA: 1050, Fa: 31, Po: 6, Gd: 5 |
| PavedDrive | 0 | 1 | FALSE | 3 | Y: 1023, N: 48, P: 23 |
| PoolQC | 0 | 1 | FALSE | 4 | Ukn: 1088, Ex: 2, Fa: 2, Gd: 2 |
| Fence | 0 | 1 | FALSE | 5 | Ukn: 882, MnP: 117, GdP: 46, GdW: 39 |
| MiscFeature | 0 | 1 | FALSE | 4 | Ukn: 1059, She: 33, Oth: 1, Ten: 1 |
| SaleType | 0 | 1 | FALSE | 9 | WD: 928, New: 116, COD: 31, Con: 5 |
| SaleCondition | 0 | 1 | FALSE | 6 | Nor: 880, Par: 119, Abn: 70, Fam: 18 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| MSSubClass | 0 | 1 | 56.13 | 41.98 | 20 | 20.00 | 50.0 | 70.00 | 190 |
| LotFrontage | 0 | 1 | 70.76 | 24.51 | 21 | 60.00 | 70.0 | 80.00 | 313 |
| LotArea | 0 | 1 | 10132.35 | 8212.25 | 1300 | 7606.75 | 9444.5 | 11387.25 | 215245 |
| OverallQual | 0 | 1 | 6.25 | 1.37 | 2 | 5.00 | 6.0 | 7.00 | 10 |
| OverallCond | 0 | 1 | 5.58 | 1.07 | 2 | 5.00 | 5.0 | 6.00 | 9 |
| YearBuilt | 0 | 1 | 1972.41 | 31.19 | 1880 | 1953.00 | 1975.0 | 2003.00 | 2010 |
| YearRemodAdd | 0 | 1 | 1985.92 | 20.93 | 1950 | 1967.00 | 1995.0 | 2005.00 | 2010 |
| MasVnrArea | 0 | 1 | 109.86 | 190.67 | 0 | 0.00 | 0.0 | 171.75 | 1600 |
| BsmtFinSF1 | 0 | 1 | 448.19 | 468.73 | 0 | 0.00 | 384.5 | 712.75 | 5644 |
| BsmtFinSF2 | 0 | 1 | 45.25 | 159.08 | 0 | 0.00 | 0.0 | 0.00 | 1474 |
| BsmtUnfSF | 0 | 1 | 606.12 | 445.83 | 0 | 270.00 | 525.0 | 846.00 | 2336 |
| TotalBsmtSF | 0 | 1 | 1099.56 | 415.85 | 105 | 816.00 | 1023.0 | 1345.50 | 6110 |
| X1stFlrSF | 0 | 1 | 1173.81 | 387.68 | 438 | 894.00 | 1097.0 | 1413.50 | 4692 |
| X2ndFlrSF | 0 | 1 | 356.54 | 439.26 | 0 | 0.00 | 0.0 | 729.00 | 2065 |
| LowQualFinSF | 0 | 1 | 4.68 | 42.10 | 0 | 0.00 | 0.0 | 0.00 | 572 |
| GrLivArea | 0 | 1 | 1535.03 | 526.12 | 438 | 1164.00 | 1480.0 | 1779.00 | 5642 |
| BsmtFullBath | 0 | 1 | 0.42 | 0.51 | 0 | 0.00 | 0.0 | 1.00 | 2 |
| BsmtHalfBath | 0 | 1 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 |
| FullBath | 0 | 1 | 1.58 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 |
| HalfBath | 0 | 1 | 0.39 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 |
| BedroomAbvGr | 0 | 1 | 2.86 | 0.76 | 0 | 2.00 | 3.0 | 3.00 | 6 |
| KitchenAbvGr | 0 | 1 | 1.03 | 0.19 | 1 | 1.00 | 1.0 | 1.00 | 3 |
| TotRmsAbvGrd | 0 | 1 | 6.57 | 1.58 | 3 | 5.00 | 6.0 | 7.00 | 12 |
| Fireplaces | 0 | 1 | 0.61 | 0.63 | 0 | 0.00 | 1.0 | 1.00 | 3 |
| GarageYrBlt | 0 | 1 | 1978.57 | 25.93 | 1900 | 1960.00 | 1982.0 | 2003.00 | 2010 |
| GarageCars | 0 | 1 | 1.88 | 0.66 | 1 | 1.00 | 2.0 | 2.00 | 4 |
| GarageArea | 0 | 1 | 503.76 | 192.26 | 160 | 360.00 | 484.0 | 602.50 | 1418 |
| WoodDeckSF | 0 | 1 | 94.34 | 122.62 | 0 | 0.00 | 0.0 | 169.75 | 857 |
| OpenPorchSF | 0 | 1 | 46.95 | 64.82 | 0 | 0.00 | 28.0 | 68.00 | 547 |
| EnclosedPorch | 0 | 1 | 22.05 | 61.57 | 0 | 0.00 | 0.0 | 0.00 | 552 |
| X3SsnPorch | 0 | 1 | 3.27 | 29.66 | 0 | 0.00 | 0.0 | 0.00 | 508 |
| ScreenPorch | 0 | 1 | 16.50 | 58.46 | 0 | 0.00 | 0.0 | 0.00 | 480 |
| PoolArea | 0 | 1 | 3.01 | 40.71 | 0 | 0.00 | 0.0 | 0.00 | 648 |
| MiscVal | 0 | 1 | 23.55 | 167.14 | 0 | 0.00 | 0.0 | 0.00 | 2500 |
| MoSold | 0 | 1 | 6.34 | 2.69 | 1 | 5.00 | 6.0 | 8.00 | 12 |
| YrSold | 0 | 1 | 2007.79 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 |
| SalePrice | 0 | 1 | 187033.26 | 83165.33 | 35311 | 132500.00 | 165750.0 | 221000.00 | 755000 |
Ada kolom yang hanya memiliki satu kategori, yaitu kolom Utilities sehingga perlu dihapus saja.
data_house1 <- data_house1 %>%
dplyr::select(-Utilities)
plot_histogram(data = data_house1,nrow=3,ncol = 3,
geom_histogram_args = list(fill="steelblue")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot_bar(data = data_house1)
Korelasi Peubah
plot_scatterplot(data = data_house1 %>%
select_if(is.numeric),
by="SalePrice",geom_point_args = list(color="steelblue") )
cor_mat <- cor(data_house1%>%
select_if(is.numeric),method = "spearman")
cor_mat[upper.tri(cor_mat,diag = TRUE)] <- NA
cor_df <- cor_mat %>%
as.data.frame() %>%
rownames_to_column(var = "Var1") %>%
pivot_longer(names_to = "Var2",
values_to = "corr",
-Var1) %>% na.omit
cor_df %>% filter(abs(corr)>0.6) %>% arrange(desc(abs(corr)))
## # A tibble: 31 × 3
## Var1 Var2 corr
## <chr> <chr> <dbl>
## 1 GarageYrBlt YearBuilt 0.895
## 2 X1stFlrSF TotalBsmtSF 0.877
## 3 GarageArea GarageCars 0.841
## 4 TotRmsAbvGrd GrLivArea 0.829
## 5 SalePrice OverallQual 0.823
## 6 GarageYrBlt YearRemodAdd 0.747
## 7 YearRemodAdd YearBuilt 0.738
## 8 SalePrice GrLivArea 0.731
## 9 SalePrice GarageCars 0.681
## 10 SalePrice FullBath 0.671
## # … with 21 more rows
cor_df %>% filter(abs(corr)<=0.6)
## # A tibble: 635 × 3
## Var1 Var2 corr
## <chr> <chr> <dbl>
## 1 LotFrontage MSSubClass -0.313
## 2 LotArea MSSubClass -0.255
## 3 OverallQual MSSubClass 0.0992
## 4 OverallQual LotFrontage 0.238
## 5 OverallQual LotArea 0.283
## 6 OverallCond MSSubClass -0.0763
## 7 OverallCond LotFrontage -0.0693
## 8 OverallCond LotArea -0.0873
## 9 OverallCond OverallQual -0.264
## 10 YearBuilt MSSubClass -0.00468
## # … with 625 more rows
plot_bar(data = data_house1,with="SalePrice")
Variable Selection
Linear Regression
regresi1 <- lm(formula = SalePrice~.,data = data_house1)
broom::tidy(regresi1) %>% mutate(across(where(is.numeric),~ format(round(.x,3), big.mark=",", scientific=F)))
## # A tibble: 239 × 5
## term estimate std.error statistic p.value
## <chr> <chr> <chr> <chr> <chr>
## 1 (Intercept) "-3,644,761.772" "1,362,174.828" "-2.676" 0.008
## 2 MSSubClass " -3.786" " 128.499" "-0.029" 0.977
## 3 MSZoningFV " 41,403.127" " 14,264.781" " 2.902" 0.004
## 4 MSZoningRH " 28,190.701" " 15,383.702" " 1.833" 0.067
## 5 MSZoningRL " 27,116.671" " 12,657.393" " 2.142" 0.032
## 6 MSZoningRM " 23,665.844" " 11,794.937" " 2.006" 0.045
## 7 LotFrontage " 77.791" " 57.376" " 1.356" 0.176
## 8 LotArea " 0.819" " 0.154" " 5.305" 0.000
## 9 StreetPave " 53,484.238" " 20,623.555" " 2.593" 0.010
## 10 AlleyPave " -4,026.039" " 7,366.494" "-0.547" 0.585
## # … with 229 more rows
Forward Selection
best_forward <- regsubsets(SalePrice~.,data = data_house1,method = "forward",nvmax = 78)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 15 linear dependencies found
## Reordering variables and trying again:
summ_forward <- summary(best_forward)
result_gof <- data.frame(
Adj.R2 = c(which.max(summ_forward$adjr2),max(summ_forward$adjr2)),
CP = c(which.min(summ_forward$cp),min(summ_forward$cp)),
BIC = c(which.min(summ_forward$bic),min(summ_forward$bic))
)
result_gof
## Adj.R2 CP BIC
## 1 79.0000000 79.000000 41.000
## 2 0.9243032 5.802693 -2429.285
coef_forward <- coef(best_forward,result_gof$BIC[1])
enframe(coef_forward) %>% mutate(across(where(is.numeric), ~format(round(.x,3), big.mark=",", scientific=F)))
## # A tibble: 42 × 2
## name value
## <chr> <chr>
## 1 (Intercept) "-1,750,370.796"
## 2 MSSubClass " -141.815"
## 3 LotArea " 1.820"
## 4 LotConfigCulDSac " 25,170.144"
## 5 LandSlopeSev " -76,948.803"
## 6 NeighborhoodCrawfor " 35,947.652"
## 7 NeighborhoodNoRidge " 106,176.907"
## 8 NeighborhoodNridgHt " 87,107.138"
## 9 NeighborhoodSomerst " 37,152.844"
## 10 NeighborhoodStoneBr " 90,695.249"
## # … with 32 more rows
plot_forward <- data.frame(subset=seq_along(summ_forward$bic),BIC = summ_forward$bic)
ggplot(plot_forward,aes(subset,BIC))+
geom_line(size=1)+
geom_point(color="darkgreen",size=2)+
geom_vline(aes(xintercept=result_gof$BIC[1]), color="orange",size=1.2)+theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Backward Selection
best_backward <- regsubsets(SalePrice~.,data = data_house1,method = "backward",nvmax = 78)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 15 linear dependencies found
## Reordering variables and trying again:
summ_backward <- summary(best_backward)
result_gof2 <- data.frame(Adj.R2 = c(which.max(summ_backward$adjr2),max(summ_backward$adjr2)), CP = c(which.min(summ_backward$cp),min(summ_backward$cp)), BIC = c(which.min(summ_backward$bic),min(summ_backward$bic)))
result_gof2
## Adj.R2 CP BIC
## 1 79.0000000 79.00000 46.000
## 2 0.9228913 23.33181 -2432.324
coef_backward <- coef(best_backward,result_gof2$BIC[1])
enframe(coef_backward) %>% mutate(across(where(is.numeric), ~format(round(.x,3), big.mark=",", scientific=F)))
## # A tibble: 47 × 2
## name value
## <chr> <chr>
## 1 (Intercept) "-1,592,018.215"
## 2 MSZoningFV " 26,696.330"
## 3 LotArea " 1.398"
## 4 StreetPave " 13,192.845"
## 5 LotConfigCulDSac " 26,338.556"
## 6 LandSlopeSev " -46,624.257"
## 7 NeighborhoodCrawfor " 26,810.626"
## 8 NeighborhoodEdwards " -18,846.596"
## 9 NeighborhoodMitchel " -15,130.495"
## 10 NeighborhoodNAmes " -4,773.610"
## # … with 37 more rows
plot_backward <- data.frame(subset=seq_along(summ_backward$bic), BIC = summ_backward$bic)
ggplot(plot_backward,aes(subset,BIC))+
geom_line(size=1)+
geom_point(color="darkgreen",size=2)+
geom_vline(aes(xintercept=result_gof$BIC[1]),color="orange",size=1.2)+theme_bw()
Unsupervised Learning: Cluster Analysis
Contoh Kuliah
Data
set.seed (17)
x=matrix(rnorm(50*2), ncol=2)
x[1:25,1]=x[1:25,1]+3
x[1:25,2]=x[1:25,2]-4
Tak Berhirarki: K-means
K-means : k=2
km.out1=kmeans(x,2,nstart=20)
km.out1$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1
plot(x, col=(km.out1$cluster +1),
main="K-Means Clustering Results with K=2",
xlab="", ylab="", pch=20, cex=2)
K-means : k=3
set.seed(17)
km.out2=kmeans(x,3,nstart=20)
km.out2$cluster
## [1] 1 1 1 1 3 1 3 3 1 1 3 3 3 3 3 1 3 1 1 3 1 3 3 3 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 1 2 2 2
plot(x, col=(km.out2$cluster +1),
main="K-Means Clustering Results with K=3",
xlab="", ylab="", pch=20, cex=2)
Perbandingan Koefisien WSS
km.out1$tot.withinss
## [1] 105.3784
km.out2$tot.withinss
## [1] 82.21278
Berdasarkan koefisien within sum square (WSS) terkecil, K-means dengan 3 gerombol lebih sesuai dibandingkan 2 gerombol karena kecilnya keragamaan dalam gerombol yang terbentuk.
Berhirarki: Dendogram
hc.single=hclust(dist(x), method="single")
hc.complete=hclust(dist(x), method="complete")
hc.average=hclust(dist(x), method="average")
hc.centroid=hclust(dist(x), method="centroid")
hc.median=hclust(dist(x), method="median")
hc.ward=hclust(dist(x), method="ward.D")
Dendogram: Pautan Tunggal
plot(hc.single , main="Single Linkage", xlab="", sub="",
cex =.9)
Dendogram: Pautan Lengkap
plot(hc.complete,main="Complete Linkage", xlab="", sub="",
cex =.9)
Dendogram: Pautan Rataan
plot(hc.average , main="Average Linkage", xlab="", sub="",
cex =.9)
Dendogram: Pautan Centroid
plot(hc.centroid, main="Centroid Linkage", xlab="", sub="",
cex =.9)
Dendogram: Pautan Nilai Tengah
plot(hc.median, main="Median Linkage", xlab="", sub="",
cex =.9)
Dendogram: Pautan Ward
plot(hc.ward, main="Ward Linkage", xlab="", sub="",
cex =.9)
Dendogram dengan keenam metode penggerombolan di atas memiliki visualisasi yang baik dimana setiap cabang grafik tersusun rapi dan mudah dibaca tiap amatan data masuk pada cabang gerombol yang mana.
Contoh Responsi
Data
Dataset ini dapat diakses melalui tauta sebagai berikut:Dataset Sekolah
data <- read.table("D:/MY COLLEGE/SEMESTER 5/PSD/DATA/DataSekolah.csv",sep=";", header = T)
tidyr::tibble(data)
## # A tibble: 27 × 9
## Wilayah standar_isi standar…¹ stand…² stand…³ stand…⁴ stand…⁵ stand…⁶ stand…⁷
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 95.7 93.8 93.6 80.4 87.2 93.5 95.0 94.1
## 2 2 96.3 94.5 93.3 79.8 83.6 94.2 94.4 94.4
## 3 3 96.2 95.5 94.2 78.3 79 95.5 92.2 95.8
## 4 4 94.6 92.2 90.5 76.9 79.8 92.2 95.3 93.4
## 5 5 95.3 93 92.2 76.3 77.7 91.7 93.7 94.2
## 6 6 95.7 92.5 92.6 75.5 76.0 90.2 95.0 93
## 7 7 96 93.8 94.4 86 88.2 93.8 94.8 95.1
## 8 8 95.3 94.7 94 84.9 88.1 94 97.3 95.8
## 9 9 97 95.4 94 79.6 87.6 93.9 95.1 95.7
## 10 10 95 96 95 82 82.5 92 96 90
## # … with 17 more rows, and abbreviated variable names ¹standar_proses,
## # ²standar_kompetensi, ³standar_pendidik, ⁴standar_sarana,
## # ⁵standar_pengelolaan, ⁶standar_biaya, ⁷standar_nilai
Standarisasi
data.ok = data[,-1]
head(data.ok)
## standar_isi standar_proses standar_kompetensi standar_pendidik standar_sarana
## 1 95.72 93.82 93.65 80.39 87.21
## 2 96.31 94.52 93.31 79.76 83.62
## 3 96.17 95.50 94.17 78.33 79.00
## 4 94.63 92.20 90.54 76.88 79.83
## 5 95.34 93.00 92.21 76.28 77.66
## 6 95.67 92.48 92.57 75.48 76.05
## standar_pengelolaan standar_biaya standar_nilai
## 1 93.46 95.04 94.11
## 2 94.24 94.45 94.38
## 3 95.50 92.17 95.83
## 4 92.22 95.27 93.44
## 5 91.72 93.72 94.17
## 6 90.19 95.05 93.00
data.stdz = scale(data.ok)
apply(data.stdz, 2, mean) #cek mean = 0
## standar_isi standar_proses standar_kompetensi standar_pendidik
## 3.180569e-15 6.868605e-16 -2.178426e-15 -7.314739e-16
## standar_sarana standar_pengelolaan standar_biaya standar_nilai
## 3.829086e-16 3.070609e-15 3.268383e-16 2.322203e-16
apply(data.stdz, 2, sd) #cek stdev = 1
## standar_isi standar_proses standar_kompetensi standar_pendidik
## 1 1 1 1
## standar_sarana standar_pengelolaan standar_biaya standar_nilai
## 1 1 1 1
Pada analisis gerombol, jika skala antara peubah berbeda, standarisasi perlu dilakukan karena analisis ini menerapkan konsep jarak. Kemudian, pengecekan perlu dilakukan dengan nilai rata-rata sebesar 0 dan simpangan baku sebesar 1.
Analisis Gerombol Berhirarki (Hierarchical Clustering)
Pemilihan Klaster & Dendogram
Pemilihan banyaknya klaster dengan melakukan perbandingan aantara berbagai metode lingkage dan konsep jarak sehingga mendapatkan banyaknya klaster yang optimal.
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "single", hc_metric="euclidean")
fviz_dend(hclust(dist(data.stdz, method = "euclidean"), method = "single"))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
Pemilihan banyaknya klaster menggunakan metode single lingkage dimana jarak dua klaster diukur dengan jarak terdekat antarobjek yang berbeda klaster dan nilai koefisien silhouette dengan konsep jarak Euclidean karena konsep jarak yang sering digunakan dalam analisis ketika skala antarpeubah sama.
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "complete", hc_metric="euclidean")
fviz_dend(hclust(dist(data.stdz, method = "euclidean"), method = "complete"))
Pemilihan banyaknya klaster menggunakan metode complete lingkage dimana jarak dua klaster diukur dengan jarak terjauh antarobjek yang berbeda klaster dan nilai koefisien silhouette dengan konsep jarak Euclidean karena konsep jarak yang sering digunakan dalam analisis ketika skala antarpeubah sama.
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "average", hc_metric="euclidean")
fviz_dend(hclust(dist(data.stdz, method = "euclidean"), method = "average"))
Pemilihan banyaknya klaster menggunakan metode average lingkage dimana jarak antara dua buah klaster dihitung dari rata-rata jarak antara anggota klaster satu dengan klaster yang kedua. dan nilai koefisien silhouette dengan konsep jarak Euclidean karena konsep jarak yang sering digunakan dalam analisis ketika skala antarpeubah sama.
fviz_nbclust(data.stdz, FUNcluster = hcut, method = "silhouette", hc_method = "centroid", hc_metric="euclidean")
fviz_dend(hclust(dist(data.stdz, method = "euclidean"), method = "centroid"))
Pemilihan banyaknya klaster menggunakan metode centroid lingkage dimana jarak antara dua buah klaster diukur sebagai jarak Euclidean antara rataan (centroid) klaster. dan nilai koefisien silhouette dengan konsep jarak Euclidean karena konsep jarak yang sering digunakan dalam analisis ketika skala antarpeubah sama.
Berdasarkan penulis (secara subjektif), Pemilihan banyaknya klaster menggunakan metode complete lingkage dan nilai koefisien silhouette dengan konsep jarak Euclidean karena visualisasi berupa dendogram terlihat lebih mudah dipahami. Kalsrerisasi dilakukan dengan membagi menjadi 2 klaster.
#Interpretasi, untuk melihat anggota dari setiap gerombol
hc.data <- eclust(data.ok, stand = TRUE, FUNcluster = "hclust", k=2, hc_method = "complete", hc_metric = "euclidean", graph = F)
hc.data$cluster #cluster dari setiap pengamatan
## [1] 1 1 1 2 2 2 1 1 1 2 1 2 2 1 1 1 2 1 1 2 1 1 1 1 2 2 1
Agregasi Data
aggregate(data.ok, by=list(cluster=hc.data$cluster), FUN = mean)
## cluster standar_isi standar_proses standar_kompetensi standar_pendidik
## 1 1 96.45118 95.21765 93.92706 81.24412
## 2 2 94.17200 91.80600 90.28900 77.14400
## standar_sarana standar_pengelolaan standar_biaya standar_nilai
## 1 86.16647 93.99353 94.90235 95.62765
## 2 77.51200 90.66800 91.96200 92.36200
Sebelum memvisualisasikan berupa cluster plot, perlu dilakukan agregasi data.
Cluster Plot
fviz_cluster(hc.data)
Analisis Gerombol tak Berhirarki (Non-hierarchical Clustering)
Penentuan Klaster
fviz_nbclust(data.stdz, FUNcluster = kmeans, method = "wss")
Pemilihan banyaknya klaster (\(k\)) dengan menggunakan nilai koefisien within sum square dimana yang membentuk seperti garis siku(elbow), didapatkan klasterisasi terbagi menjadi 2 klaster.
Klasterisasi
#Ditetapkan k=2
kmeans.data <- eclust(data.ok, stand = TRUE, FUNcluster = "kmeans", k=2, graph = F)
kmeans.data$cluster
## [1] 2 2 2 1 1 1 2 2 2 2 2 1 1 2 2 2 1 2 2 1 2 2 2 2 1 1 2
Pemanggilan Centroid
kmeans.data$centers #Memunculkan centroidnya (data masih terstandardisasi, shg sulit diinterpretasikan)
## standar_isi standar_proses standar_kompetensi standar_pendidik standar_sarana
## 1 -0.9224109 -1.1080938 -0.8964315 -0.8740571 -1.0900098
## 2 0.4612054 0.5540469 0.4482158 0.4370286 0.5450049
## standar_pengelolaan standar_biaya standar_nilai
## 1 -1.0923991 -0.7226372 -0.8303075
## 2 0.5461996 0.3613186 0.4151537
Pemanggilan centroid dari klasterisasi k-means masihc cukup sulit diinterpretasikan sehingga perlu dilakukan agregasi data.
Agregasi Data
aggregate(data, by=list(cluster=kmeans.data$cluster), FUN = mean)
## cluster Wilayah standar_isi standar_proses standar_kompetensi
## 1 1 14.22222 94.08000 91.34000 89.76556
## 2 2 13.88889 96.37056 95.26111 93.98667
## standar_pendidik standar_sarana standar_pengelolaan standar_biaya
## 1 76.60444 76.95778 90.52000 91.51333
## 2 81.28611 85.96278 93.88278 94.96333
## standar_nilai
## 1 92.62444
## 2 95.31500
Pemanggilan CentroidK-means*
fviz_cluster(kmeans.data)
Latihan
Data
Data yang digunakan pada analisis gerombol ini merupakan data sekunder yang berasal dari situs web Kaggle. Peubah dari Data Mall_customers terdiri atas Customer ID, Age, Annual income dan spending score. Customer ID merupakan ID milik pelanggan yang bersifat unik. Age merupakan umur dari pelanggan. Spending score merupakan nilai yang diberikan oleh suatu perusahaan mall kepada pelanggan berdasarkan perilakunya meliputi waktu kunjungan, jenis barang yang dibeli, dan banyaknya uang yang dihabiskan dalam belanja dengan rentang nilai 1-100, dimana semakin besar nilai Spending Score maka pelanggan semakin loyal padanya sekaligus semakin besar juga uang belanja yang digunakan. Dengan demikian, analisis gerombol hirarki maupun non hierarki membantu pemilik perusahaan mall ini untuk mengelompokan para pelanggannya, sehingga tim marketing dapat mengembangkan strategi yang tepat untuk para pelanggan mereka agar penggelompokan tepat guna. Data dapat diakses melalui tautan sebagai berikut:
Dataset Mall Customers
mc <- read.csv("D:/MY COLLEGE/SEMESTER 5/PSD/DATA/Mall_Customers.csv")
Peubah Customer ID tidak digunakan pada analisis ini maka akan dihilangkan terlebih dahulu.
Tahapan 1: Pre-proccessing
mc <- mc[,c("Age","Annual.Income","Spending.Score")]
str(mc)
## 'data.frame': 200 obs. of 3 variables:
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score: int 39 81 6 77 40 76 6 94 3 72 ...
tidyr::tibble(mc)
## # A tibble: 200 × 3
## Age Annual.Income Spending.Score
## <int> <int> <int>
## 1 19 15 39
## 2 21 15 81
## 3 20 16 6
## 4 23 16 77
## 5 31 17 40
## 6 22 17 76
## 7 35 18 6
## 8 23 18 94
## 9 64 19 3
## 10 30 19 72
## # … with 190 more rows
Standarization Variable
Standarisasi peubah merupakan proses transformasi peubah menjadi peubah yang memiliki rata-rata nol dan simpangan baku satu. Proses standarisasi dilakukan ketika ada perbedaan satuan pengukuran peubah-peubah yang digunakan (Age, Annual income dan spending score), metode non hirarki (k-means) dan hirarki menggunakan konsep jarak antara amatan, dimana cukup sensitif terhadap perbedaan satuan pengukuran. Dalam software R studio, standarisasi peubah dilakukan dengan menggunakan fungsi scale(). Setelah dilakukan standarisasi peubah, rataan dan simpangan baku data mendekati nilai 1 maupun 0.
mc_stdr <- scale(mc)
apply(mc_stdr,2,mean)
## Age Annual.Income Spending.Score
## -1.016906e-16 -8.144310e-17 -1.096708e-16
apply(mc_stdr,2,sd)
## Age Annual.Income Spending.Score
## 1 1 1
Exploratory Anaysis Data (EDA)
boxplot(mc)
Berdasarkan boxplot di atas, pola sebaran data peubah Age, Annual Income, dan Spending Score menjulur ke kanan, dan data peubahAnnual Income terindikasi adanya pencilan atas.
Hierarchical Clustering Analysis
Umumnya, banyaknya gerombol dapat ditentukan dengan menggunakan beberapa kriteria statistik, antara lain koefisien silhouette dan WSS (within sum of square). Kriteria koefisien silhouette dihitung berdasarkan jarak antara amatan. Koefisien ini mengukur seberapa dekat jarak suatu amatan dengan amatan lain yang berada di gerombol yang sama, dikenal sebagai ukuran cohesion dibandingkan di gerombol berbeda, dikenal sebagai ukuran separation. Besarnya nilai koefisien menunjukkan bahwa gerombol yang terbentuk sudah sesuai. Kriteria koefisien WSS merupakan kriteria yang menghitung keragamaan dalam gerombol yang terbentuk. Semakin kecil keragamannya menunjukkan bahwa gerombol yang terbentuk sudah sesuai. Perbandingan banyaknya gerombol yang paling sesuai pada data dilakukan berdasarkan kedua kriteria tersebut. Dalam software R studio, fungsi fviz_nbclust() dengan package factoextra dapat digunakan untuk memilih banyaknya gerombol.
Tahapan 2
Pada analisis gerombol ini memilih menggunakan metode silhouette karena lebih mudah dalam proses analisis. Konsep jarak Euclidean digunakan pada analisis ini karena merupakan metode paling sering digunakan hampir di berbagai metode analisis gerombol (Sumertajaya dan Erfiani 2007). Perbandingan dilakukan untuk membentuk klaster yang sesuai.
fviz_nbclust(mc_stdr,FUNcluster = hcut,method = "silhouette",
hc_method = "single",hc_metric = "euclidean")
fviz_nbclust(mc_stdr,FUNcluster = hcut,method = "silhouette",
hc_method = "complete",hc_metric = "euclidean")
fviz_nbclust(mc_stdr,FUNcluster = hcut,method = "silhouette",
hc_method = "average",hc_metric = "euclidean")
fviz_nbclust(mc_stdr,FUNcluster = hcut,method = "silhouette",
hc_method = "centroid",hc_metric = "euclidean")
fviz_nbclust(mc_stdr,FUNcluster = hcut,method = "silhouette",
hc_method = "ward.D",hc_metric = "euclidean")
Berdasarkan nilai koefisien silhouette, terdapat perbedaan dimana metode single dan centroid lingkage memilih 2 jumlah optimal gerombol, metode complete dan average lingkage memilih 5 jumlah optimal gerombol, sedangkan metode ward lingkage masing-masing memilih 6 jumlah optimal gerombol. Untuk analisis gerombol ini, kemungkinan metode yang digunakan adalah single, centroid, complete dan average lingkage karena banyaknya gerombol yang sama, maka gerombol yang terbentuk akan relatif mirip. Kemudian, penulis dapat memilih salah satu dari keempat metode pautan tersebut.
Tahapan 2: Dendogram
linkage_methods <- c("single", "complete","average","centroid","ward.D")
hc_mall_dend <- lapply(linkage_methods, function(i)
hclust(dist(mc_stdr,method = 'euclidean'),method = i)
)
Single Linkage
fviz_dend(hc_mall_dend[[1]])
Complete Linkage
fviz_dend(hc_mall_dend[[2]])
Average Linkage
fviz_dend(hc_mall_dend[[3]])
Centroid Linkage
fviz_dend(hc_mall_dend[[4]])
Ward Linkage
Banyak kesamaan dengan ANOVA. Fungsi hubungan antardua gerombol itu menentukan jarak antara dua gerombol karena meningkatnya (ESS) setelah pengelompokan dua kelompok menjadi gerombol tunggal.
fviz_dend(hc_mall_dend[[5]])
Hasil kelima tipe dendogram dengan menerapkan masing-masing metode linkage dan koefisien silhouette tidak ada perbedaan karena jumlah gerombol yang terbentuk sama.
Tahapan 3
hc_mc <- eclust(mc,stand = TRUE,FUNcluster = "hclust",k=5,hc_method = "complete",hc_metric = "euclidean",graph = F)
hc_mc$cluster
## [1] 1 2 1 2 1 2 1 2 3 2 3 2 3 2 1 2 1 2 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
## [38] 2 3 2 3 2 3 1 3 2 3 1 1 1 3 1 1 3 3 3 3 3 1 3 3 1 3 3 3 1 1 3 1 1 3 3 3 3
## [75] 3 1 1 1 1 3 3 1 3 3 1 3 3 1 1 3 3 1 3 1 1 1 3 1 3 1 1 3 3 1 3 1 3 3 3 3 3
## [112] 1 1 1 1 1 3 3 3 3 1 1 1 4 1 4 5 4 5 4 5 4 1 4 5 4 5 4 5 4 5 4 1 4 5 4 5 4
## [149] 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
## [186] 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4
aggregate(mc,by =list(cluster=hc_mc$cluster),
FUN = mean)
## cluster Age Annual.Income Spending.Score
## 1 1 28.35417 50.29167 45.93750
## 2 2 24.80952 25.61905 80.23810
## 3 3 55.33333 47.31579 41.08772
## 4 4 32.69231 86.53846 82.12821
## 5 5 41.68571 88.22857 17.28571
Visualisasi: Cluster Plot
fviz_cluster(hc_mc)
Principal Component Analysis
Interpretasi metode PCA dapat dilakukan dengan menggunakan vektor ciri pada masingmasing komponen utama. Semakin besar vektor ciri pada komponen utama tertentu maka semakin besar pula kontribusi dari peubah asal untuk membangun komponen utama tersebut. Catatan lain yang perlu diperhatikan adalah nilai negatif pada vektor ciri menandakan peubah asal memberikan kontribusi yang berkembalikan pada pembentukan komponen utama. Dalam konteks vektor ciri negatif, semakin besar nilai peubah asal maka semakin kecil nilai pada komponen utama.
pca_mC <- prcomp(mc_stdr)
pca_mC$rotation
## PC1 PC2 PC3
## Age 0.70638235 -0.03014116 0.707188441
## Annual.Income -0.04802398 -0.99883160 0.005397916
## Spending.Score -0.70619946 0.03777499 0.707004506
Non Hierarchical Clustering Analysis: K-means
Tahapan 2
fviz_nbclust(mc_stdr,FUNcluster = kmeans,method = "silhouett")
fviz_nbclust(mc_stdr,FUNcluster = kmeans,method = "wss")
Untuk kriteria koefisien silhoutte, banyaknya gerombol dengan nilai koefisien tertinggi yang dipilih yaitu pada gerombol ke-8. Namun, pada WSS, banyaknya gerombol yang dipilih didasarkan pada banyaknya gerombol yang membentuk garis seperti siku. Berdasarkan gambar di atas, garis membentuk siku saat berada di gerombol ke-4 dan ke-8, tatpi visualisasi ini bersifat subjektif. Berdasarkan kedua kriteria tersebut, jumalh optimal gerombol yang dipilih ada sedikit perbedaan.Oleh karena itu, banyaknya gerombol ditentukan berdasarkan kemudahan interpretasi dengan menggunakan 4 gerombol.
Tahapan 3
Fungsi eclust() dengan package factoextra digunakan untuk menerpkan metode k-means.
$cluster yang dapat memperoleh informasi mengenai interpretasi setiap gerombol yang terbentuk dapat dilakukan dengan menggunakan bantuan nilai rata-rata dari masing-masing peubah dihitung berdasarkan gerombol.
kmeans_mc <- eclust(mc,stand = TRUE,FUNcluster = "kmeans",k=4,graph = F)
kmeans_mc$cluster
## [1] 3 3 3 3 3 3 2 3 2 3 2 3 2 3 2 3 3 3 2 3 3 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
## [38] 3 2 3 2 3 2 3 2 3 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 2 3 2 2 3 3 2 2 2 2
## [75] 2 3 2 2 3 2 2 3 2 2 3 2 2 3 3 2 2 3 2 2 3 3 2 3 2 3 3 2 2 3 2 3 2 2 2 2 2
## [112] 3 1 3 3 3 2 2 2 2 3 1 4 4 1 4 1 4 2 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
## [149] 1 4 1 4 1 4 1 4 1 4 1 4 2 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
## [186] 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
kmeans_mc$centers
## Age Annual.Income Spending.Score
## 1 0.03711223 0.9876366 -1.1857814
## 2 1.08344244 -0.4893373 -0.3961802
## 3 -0.96008279 -0.7827991 0.3910484
## 4 -0.42773261 0.9724070 1.2130414
Tahapan 4
Cara 1: Standarisasi
Interpretasi k-means berdasarkan nilai rata-rata dari $centers adalah sebagai berikut.
- Gerombol 1
Gerombol ini merupakan para pelanggan yang lumayan muda dan berpenghasilan besar, tetapi kecilnya hasrat dalam menghabiskan uang untuk berbelanja.
- Gerombol 2
Gerombol ini merupakan para pelanggan yang sudah tua dan berpenghasilan kecil, tetapi kecilnya hasrat dalam menghabiskan uang untuk berbelanja, yang diduga merupakan pelanggan sudah pensiun dan hanya memiliki pemasukan dari tunjangan pensiun.
- Gerombol 3
Gerombol ini merupakan para pelanggan yang masih sangat muda dan berpenghasilan kecil, tetapi besarnya hasrat dalam menghabiskan uang untuk berbelanja, yang diduga merupakan pelanggan yang aneh.
- Gerombol 4
Gerombol ini merupakan para pelanggan yang lumayan muda dan berpenghasilan besar, dan juga besarnya hasrat dalam menghabiskan uang untuk berbelanja, yang diduga merupakan pelanggan yang aneh, yang diduga pelanggan yang dapat menjadi target yang tepat untuk tim pemasaran dengan strategi yang tepat juga.
Cara 2: Agregasi
Adapun cara kedua untuk mempermudah pemahaman interpretasi analisis dengan fungsi aggregate() untuk melihat rata-rata dalam skala asli sehingga dapat menghitung rata-rata setiap peubah berdasarkan gerombol yang terbentuk.
aggregate(mc,by =list(cluster = kmeans_mc$cluster),
FUN = mean)
## cluster Age Annual.Income Spending.Score
## 1 1 39.36842 86.50000 19.57895
## 2 2 53.98462 47.70769 39.96923
## 3 3 25.43860 40.00000 60.29825
## 4 4 32.87500 86.10000 81.52500
Cara 3: Cluster Plot
Sebelum dibentuk cluster plot, peubah tersebut direduksi terlebih dahulu menggunakan PCA menjadi 2 KU. Interpretasi setiap gerombol harus mengetahui interpretasi dari kedua komponen utama dan belum tentu dengan dua komponen utama tersebut sudah mampu menjelaskan keragaman total data asal dengan baik.
fviz_cluster(kmeans_mc)
Principal Component Analysis
pca_mall <- prcomp(mc_stdr)
pca_mall$rotation
## PC1 PC2 PC3
## Age 0.70638235 -0.03014116 0.707188441
## Annual.Income -0.04802398 -0.99883160 0.005397916
## Spending.Score -0.70619946 0.03777499 0.707004506
Perbendingan 2 Metode
Perbandingan nilai koefisien within sum square (WSS) antara analisis gerombol berhirarki dengan complete lingkage dan tak berhirarki dimana keduanya menggunakan konsep jarak Euclidean.
d <- fviz_nbclust(mc_stdr,FUNcluster = hcut,method = "wss",
hc_method = "complete",hc_metric = "euclidean")
d
d$data
## clusters y
## 1 1 597.00000
## 2 2 472.44364
## 3 3 334.64043
## 4 4 210.61574
## 5 5 171.00969
## 6 6 156.28341
## 7 7 144.22360
## 8 8 116.22958
## 9 9 107.39757
## 10 10 99.36778
k <- fviz_nbclust(mc_stdr,FUNcluster = kmeans,method = "wss")
k
k$data
## clusters y
## 1 1 597.00000
## 2 2 387.43926
## 3 3 295.51779
## 4 4 204.19902
## 5 5 167.85741
## 6 6 133.19899
## 7 7 147.86523
## 8 8 104.68068
## 9 9 107.61878
## 10 10 89.69673
cbind(clusters=d$data[,1],complete=d$data[,2],kmeans=k$data[,2])
## clusters complete kmeans
## [1,] 1 597.00000 597.00000
## [2,] 2 472.44364 387.43926
## [3,] 3 334.64043 295.51779
## [4,] 4 210.61574 204.19902
## [5,] 5 171.00969 167.85741
## [6,] 6 156.28341 133.19899
## [7,] 7 144.22360 147.86523
## [8,] 8 116.22958 104.68068
## [9,] 9 107.39757 107.61878
## [10,] 10 99.36778 89.69673
Analisis gerombol tak berhirarki (k-means) dengan \(k\)=4 lebih optimal dibandingkan dengan yang berhirarki karena keragaman objek-objek antargerombol memiliki kemiripan yang rendah, tetapi memiliki kemiripan yang tinggi pada objek-objek dalam satu gerombol.
Daftar Pustaka
Sumertaya IM, Erfiani. 2007 Analisis gerombol menggunakan metode two step cluster (studi kasus: data Potensi Desa Sensus Ekonomi 2003 wilayah Jawa Barat). Forum statistika dan Komputasi. 12(1): 18-23.