1 비지도학습: K-means 군집분석

iris 데이터셋에는 꽃의 품종(Species) 정보가 포함되어 있습니다.
K-means는 비지도 학습이므로, 군집 분석에서는 이러한 레이블 정보를 사용하지 않습니다. 따라서 꽃의 특성 값들만 사용하여 클러스터링을 수행합니다.

# 1. iris 데이터 로드
data(iris)

# 2. iris 데이터에서 꽃의 특성(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)만 추출
iris_data <- iris[, -5]  # Species 열을 제외한 데이터

# 3. K-means 클러스터링 수행 (k = 3)
set.seed(123)  # 결과의 재현성을 위해 seed 설정
kmeans_result <- kmeans(iris_data, centers = 3, nstart = 25)

# 4. 결과 출력
kmeans_result

## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# 5. 클러스터 결과를 시각화
library(ggplot2)
iris$Cluster <- as.factor(kmeans_result$cluster)

# 6. Petal.Length와 Petal.Width를 사용한 2D 시각화
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "K-means Clustering of Iris Data",
       x = "Petal Length",
       y = "Petal Width") +
  theme_minimal()

# 7. K-means 클러스터와 실제 종 정보 비교
table(iris$Cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36

2 지도학습방법

K-평균 클러스터링은 비지도 학습 방법이므로 실제 레이블(이 경우 붓꽃 데이터세트의 종 열)과 항상 완벽하게 일치하지는 않습니다. 관찰한 차이점은 일반적입니다.

이미 실제 레이블(종)이 있으므로 지도 학습 알고리즘 사용을 고려할 수 있습니다. 이러한 방법은 예측과 실제 레이블 간의 오류를 최소화하도록 훈련되었기 때문에 일반적으로 더 높은 예측력을 제공합니다.

2.1 결정 트리(decision tree)

library(rpart)
# Train a decision tree model
model_tree <- rpart(Species ~ ., data = iris, method = "class")

# Predict using the model
pred_tree <- predict(model_tree, iris, type = "class")

# Confusion matrix
table(Predicted = pred_tree, Actual = iris$Species)

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         5
##   virginica       0          1        45

2.2 Random Forest

랜덤 포레스트는 높은 정확성과 견고성으로 알려진 앙상블 방법입니다.

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## 다음의 패키지를 부착합니다: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Train a random forest model
model_rf <- randomForest(Species ~ ., data = iris, ntree = 100)

# Predict using the model
pred_rf <- predict(model_rf, iris)

# Confusion matrix
table(Predicted = pred_rf, Actual = iris$Species)

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         50         0
##   virginica       0          0        50

2.3 Support Vector Machines (SVM)

SVM은 분류 작업에도 매우 효과적일 수 있습니다.

library(e1071)
# Train an SVM model
model_svm <- svm(Species ~ ., data = iris)

# Predict using the model
pred_svm <- predict(model_svm, iris)

# Confusion matrix
table(Predicted = pred_svm, Actual = iris$Species)

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         5
##   virginica       0          2        45

3 클러스터 검증 기술

K-평균과 같은 비지도 방법을 계속 사용하려는 경우 클러스터 검증 기술을 사용하여 모델을 조정하고 더 나은 성능을 얻을 수 있습니다. 실루엣 분석 또는 엘보우 방법과 같은 방법은 최적의 클러스터 수를 찾는 데 도움이 될 수 있습니다. 또한 다음을 사용해 볼 수 있습니다.

3.1 PCA (Principal Component Analysis) Before Clustering: 클러스터링 전 주성분 분석

클러스터링을 수행하기 전에 데이터의 차원을 줄여 주요 분산 특징을 강조할 수 있으며, 이를 통해 클러스터 분리를 개선할 수 있습니다.

# Perform PCA

str(iris)

## 'data.frame':    150 obs. of  6 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Cluster     : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

pca_result <- prcomp(iris[, -c(5:6)], scale. = TRUE)

# Apply K-means on first two principal components
set.seed(123)
kmeans_pca <- kmeans(pca_result$x[, 1:2], centers = 3, nstart = 25)

# Visualize the clusters after PCA
iris$Cluster_pca <- as.factor(kmeans_pca$cluster)
ggplot(iris, aes(x = pca_result$x[, 1], y = pca_result$x[, 2], color = Cluster_pca)) +
  geom_point(size = 3) +
  labs(title = "K-means Clustering after PCA",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

# Confusion matrix
table(Predicted = iris$Cluster_pca, Actual = iris$Species)

##          Actual
## Predicted setosa versicolor virginica
##         1     50          0         0
##         2      0         39        14
##         3      0         11        36

3.2 Gaussian Mixture Models (GMM): 가우스 혼합 모델

GMM은 데이터가 여러 가우스 분포의 혼합에서 생성되고 종종 복잡한 클러스터링 문제에서 K-평균보다 더 나은 결과를 산출한다고 가정합니다.

library(mclust)

## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.

# Fit GMM model
gmm_model <- Mclust(iris_data, G=3)

# View clustering results
table(Predicted = gmm_model$classification, Actual = iris$Species)

##          Actual
## Predicted setosa versicolor virginica
##         1     50          0         0
##         2      0         45         0
##         3      0          5        50

3.3 Hierarchical Clustering: 계층적 클러스터링

또 다른 대안은 계층적 클러스터링으로, 데이터에서 자연스러운 그룹화를 보다 유연하게 찾을 수 있습니다.

# Compute the distance matrix
d <- dist(iris_data)

# Apply hierarchical clustering
hc <- hclust(d)

# Cut tree into 3 clusters
hc_clusters <- cutree(hc, k = 3)

# Compare with actual species
table(Predicted = hc_clusters, Actual = iris$Species)

##          Actual
## Predicted setosa versicolor virginica
##         1     50          0         0
##         2      0         23        49
##         3      0         27         1

4 요약

더 높은 예측력과 정확성을 원하는 경우:

Random Forest, SVM 또는 의사결정 트리와 같은 지도 학습 알고리즘은 K-평균과 같은 비지도 학습 알고리즘보다 성능이 뛰어날 가능성이 높습니다.
비지도 방법의 경우 클러스터링 전에 가우스 혼합 모델(GMM)을 시도하거나 차원 축소(예: PCA)를 적용할 수 있습니다.
비지도 학습이 여전히 우선순위인 경우 클러스터 검증 기술(예: 실루엣 점수)을 사용하여 K-평균 클러스터링 성능을 최적화하세요.

5 KNN algorithm

KNN(K-Nearest Neighbors)은 기계 학습에서 간단하고 널리 사용되는 분류 알고리즘입니다.
주어진 입력에 가장 가까운 “k”개의 데이터 포인트(이웃)를 찾고 해당 이웃의 대다수 레이블을 예측 클래스로 할당하는 방식으로 작동합니다. R에서는 class 라이브러리를 사용하여 KNN을 수행할 수 있습니다.

KNN을 사용하는 단계

필요한 라이브러리 로드: knn() 함수에는 class 라이브러리가 필요하고 데이터 세트를 분할하고 성능을 평가하려면 caret이 필요합니다.
데이터 분할: KNN 모델의 성능을 평가하려면 데이터를 훈련 세트와 테스트 세트로 분할해야 합니다.
데이터 정규화: KNN은 거리 기반이기 때문에 데이터를 스케일링하거나 정규화하면 더 잘 작동합니다.
KNN 알고리즘 적용: 학습 데이터를 기반으로 테스트 데이터를 분류하려면 knn() 함수를 사용하세요.
K 조정: ‘k’ 값을 변경하여 이웃 수가 모델 정확도에 어떤 영향을 미치는지 확인할 수 있습니다. 일반적으로 ‘k’ 값이 작을수록 모델이 더 유연해지며, ‘k’ 값이 클수록 더 많은 이웃을 고려하여 예측이 원활해집니다.
최적 K에 대한 교차 검증: 교차 검증을 사용하여 ’k’의 최적 값을 선택할 수도 있습니다.

KNN은 가장 가까운 이웃의 다수 클래스를 기반으로 분류하는 간단한 인스턴스 기반 학습 알고리즘입니다.
기능 스케일링에 민감하므로 정규화가 중요합니다.
분류 성능을 최적화하려면 ’k’의 다양한 값을 시도하고, 최적의 ’k’를 찾으려면 교차 검증을 사용하는 것이 좋습니다.

KNN은 예측을 위해 레이블이 지정된 훈련 데이터가 필요하기 때문에 감독됩니다.
KNN에서는 알고리즘이 학습하고 보이지 않는 새로운 데이터에 대해 예측할 수 있도록 레이블이 지정된 데이터 세트(클래스 또는 대상 변수가 알려진 위치)를 제공해야 합니다.
알고리즘은 지정된 테스트 포인트에 가장 가까운 “k” 훈련 예제를 찾고 다수 클래스(분류의 경우)를 기반으로 해당 포인트를 분류하거나 해당 이웃의 값(회귀의 경우)을 평균화하는 방식으로 작동합니다.

# Load necessary libraries
library(class)  # For knn
library(caret)  # For train/test splitting

## 필요한 패키지를 로딩중입니다: lattice

# 1. Prepare the data (Remove the species column for the features)
set.seed(123)
str(iris)

## 'data.frame':    150 obs. of  7 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Cluster     : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Cluster_pca : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

iris_data <- iris[, -c(5:7)]  # Features (numerical data)
iris_labels <- iris$Species  # Labels (species)

# 2. Normalize the data (scaling)
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}
iris_data <- as.data.frame(lapply(iris_data, normalize))

# 3. Split the data into training and test sets
train_index <- createDataPartition(iris_labels, p = 0.7, list = FALSE)  # 70% for training
train_data <- iris_data[train_index, ]
test_data <- iris_data[-train_index, ]
train_labels <- iris_labels[train_index]
test_labels <- iris_labels[-train_index]

# 4. Apply the KNN algorithm (with k = 3, as an example)
knn_pred <- knn(train = train_data, test = test_data, cl = train_labels, k = 3)

# 5. Evaluate the performance (confusion matrix)
conf_matrix <- table(Predicted = knn_pred, Actual = test_labels)
print(conf_matrix)

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13

# Calculate the accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy:", accuracy)

## Accuracy: 0.9333333

# k=5
knn_pred <- knn(train = train_data, test = test_data, cl = train_labels, k = 5)

# Cross-validation for optimal k
set.seed(123)
ctrl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation
knn_cv <- train(train_data, train_labels, method = "knn", trControl = ctrl, tuneLength = 10)
knn_cv$bestTune  # Optimal k

##   k
## 2 7

iris clustering

백승훈

2024-10-19