Let’s use metrics for Euclidean distance, Manhattan distance, and Minkowski distance.
# Load the PlantGrowth dataset
data(PlantGrowth)
# Use Weight Variables
ctrl_group <- subset(PlantGrowth, group == "ctrl")$weight
trt1_group <- subset(PlantGrowth, group == "trt1")$weight
euclidean_dist <- dist(rbind(ctrl_group, trt1_group))
manhattan_dist <- dist(rbind(ctrl_group, trt1_group), method = "manhattan")
p_value <- 3 # Set the desired value for p
minkowski_dist <- dist(rbind(ctrl_group, trt1_group), method = "minkowski", p = p_value)
Exam the results:
euclidean_dist
## ctrl_group
## trt1_group 3.730697
manhattan_dist
## ctrl_group
## trt1_group 10.17
minkowski_dist
## ctrl_group
## trt1_group 2.899443
Let’s consider attributes “mpg”, “hp” and “wt”
data(mtcars)
car1 <- mtcars[1, c("mpg", "hp", "wt")]
car2 <- mtcars[2, c("mpg", "hp", "wt")]
euclidean_dist <- dist(rbind(car1, car2))
manhattan_dist <- dist(rbind(car1, car2), method = "manhattan")
p_value <- 3 # Set the desired value for p
minkowski_dist <- dist(rbind(car1, car2), method = "minkowski", p = p_value)
Examine the results:
euclidean_dist
## Mazda RX4
## Mazda RX4 Wag 0.255
manhattan_dist
## Mazda RX4
## Mazda RX4 Wag 0.255
minkowski_dist
## Mazda RX4
## Mazda RX4 Wag 0.255
Let’s perform hierachical clustering using Euclidean distance and Manhattan distance.
# Load the mtcars dataset
data(mtcars)
# Extract relevant columns for clustering
mtcars_subset <- mtcars[, c("mpg", "disp", "hp", "wt")]
# Calculate Euclidean distance matrix
euclidean_dist <- dist(mtcars_subset)
# Calculate Manhattan distance matrix
manhattan_dist <- dist(mtcars_subset, method = "manhattan")
# Perform hierarchical clustering using Euclidean distance
euclidean_cluster <- hclust(euclidean_dist)
# Perform hierarchical clustering using Manhattan distance
manhattan_cluster <- hclust(manhattan_dist)
Compare the results and plot the dendrograms
plot(euclidean_cluster, main = "Hierarchical Clustering (Euclidean Distance)")
plot(manhattan_cluster, main = "Hierarchical Clustering (Manhattan Distance)")
The dendrograms differ significantly in terms of cluster formation, it tells us that the distance metrics generate distinct clustering outcomes.
# Load the iris dataset
data(iris)
# Set the random seed for reproducibility
set.seed(123)
# Create an index to randomly sample the data
train_index <- sample(1:nrow(iris), 120)
# Split the dataset into training and testing sets
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
# Perform KNN clustering
library(class)
# Define the number of neighbors (K)
k <- 3
# Train the KNN model
knn_model <- knn(train = train_data[, -5], test = test_data[, -5], cl = train_data[, 5], k = k)
# Compare predicted labels with actual labels
accuracy <- sum(knn_model == test_data[, 5]) / length(knn_model)
# Print the accuracy
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.966666666666667"
I randomly sample 120 indices from the iris data set for the training set, while the remaining samples (30 in this case) are assigned to the testing set. When applying the KNN algorithm, the number of neighbors (K) is set to 3. The model is trained on the training data (lengths and widths of sepals and petals) and their correponding species lables.
The accuracy value represents 96.7% of correctly classified samples in the testing set based on the KNN clustering alogorithm.
# Load the iris dataset
data(iris)
# Select the relevant columns for clustering
iris_cluster <- iris[, 1:4]
# Set the number of clusters
k <- 3
# Perform K-means clustering
set.seed(123)
kmeans_result <- kmeans(iris_cluster, centers = k)
# Compare the predicted cluster labels to the actual species
table(kmeans_result$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
# Calculate clustering accuracy
cluster_accuracy <- sum(diag(table(kmeans_result$cluster, iris$Species))) / sum(table(kmeans_result$cluster, iris$Species))
cluster_accuracy
## [1] 0.8933333
The result indicates the clustering accuracy based on the K-means clustering results compared to the actual species. The accuracy is calculated as the ratio of the correctly classified samples to the total number of samples.