Data 609

EX 1.Use a data set such as the PlantGrowth in R to calculate three different distances metrics and discuss the results.

Let’s use metrics for Euclidean distance, Manhattan distance, and Minkowski distance.

# Load the PlantGrowth dataset
data(PlantGrowth)

# Use Weight Variables
ctrl_group <- subset(PlantGrowth, group == "ctrl")$weight
trt1_group <- subset(PlantGrowth, group == "trt1")$weight

Euclidean Distance:

euclidean_dist <- dist(rbind(ctrl_group, trt1_group))

Manattan Distance:

manhattan_dist <- dist(rbind(ctrl_group, trt1_group), method = "manhattan")

Minkowski Distance:

p_value <- 3  # Set the desired value for p
minkowski_dist <- dist(rbind(ctrl_group, trt1_group), method = "minkowski", p = p_value)

Exam the results:

euclidean_dist

##            ctrl_group
## trt1_group   3.730697

manhattan_dist

##            ctrl_group
## trt1_group      10.17

minkowski_dist

##            ctrl_group
## trt1_group   2.899443

EX 2. Now use a higher-dimensional data set mtcars, try the same three distance metrics in the previous question and discuss the results.

Let’s consider attributes “mpg”, “hp” and “wt”

data(mtcars)

car1 <- mtcars[1, c("mpg", "hp", "wt")]
car2 <- mtcars[2, c("mpg", "hp", "wt")]

Euclidean Distance:

euclidean_dist <- dist(rbind(car1, car2))

Manattan Distance:

manhattan_dist <- dist(rbind(car1, car2), method = "manhattan")

Minkowski Distance:

p_value <- 3  # Set the desired value for p
minkowski_dist <- dist(rbind(car1, car2), method = "minkowski", p = p_value)

Examine the results:

euclidean_dist

##               Mazda RX4
## Mazda RX4 Wag     0.255

manhattan_dist

##               Mazda RX4
## Mazda RX4 Wag     0.255

minkowski_dist

##               Mazda RX4
## Mazda RX4 Wag     0.255

EX 3. Use the built-in data set mtcars to carry out hierarchy clustering using two different distance metrics and compare if they get the same results, Discuss the results.

Let’s perform hierachical clustering using Euclidean distance and Manhattan distance.

# Load the mtcars dataset
data(mtcars)

# Extract relevant columns for clustering
mtcars_subset <- mtcars[, c("mpg", "disp", "hp", "wt")]

# Calculate Euclidean distance matrix
euclidean_dist <- dist(mtcars_subset)

# Calculate Manhattan distance matrix
manhattan_dist <- dist(mtcars_subset, method = "manhattan")

# Perform hierarchical clustering using Euclidean distance
euclidean_cluster <- hclust(euclidean_dist)

# Perform hierarchical clustering using Manhattan distance
manhattan_cluster <- hclust(manhattan_dist)

Compare the results and plot the dendrograms

plot(euclidean_cluster, main = "Hierarchical Clustering (Euclidean Distance)")

plot(manhattan_cluster, main = "Hierarchical Clustering (Manhattan Distance)")

The dendrograms differ significantly in terms of cluster formation, it tells us that the distance metrics generate distinct clustering outcomes.

EX. 4 Load the well-known Fisher’s iris flower data set that consists of 150 samples for three 3 species (50 samples each species). The four measures or features are the lengths and widths of sepals and petals. Use the KNN clustering to analyze this iris data set by selecting 120 samples for training and 30 samples for testing.

# Load the iris dataset
data(iris)

# Set the random seed for reproducibility
set.seed(123)

# Create an index to randomly sample the data
train_index <- sample(1:nrow(iris), 120)

# Split the dataset into training and testing sets
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

# Perform KNN clustering
library(class)

# Define the number of neighbors (K)
k <- 3

# Train the KNN model
knn_model <- knn(train = train_data[, -5], test = test_data[, -5], cl = train_data[, 5], k = k)

# Compare predicted labels with actual labels
accuracy <- sum(knn_model == test_data[, 5]) / length(knn_model)

# Print the accuracy
print(paste("Accuracy:", accuracy))

## [1] "Accuracy: 0.966666666666667"

I randomly sample 120 indices from the iris data set for the training set, while the remaining samples (30 in this case) are assigned to the testing set. When applying the KNN algorithm, the number of neighbors (K) is set to 3. The model is trained on the training data (lengths and widths of sepals and petals) and their correponding species lables.

The accuracy value represents 96.7% of correctly classified samples in the testing set based on the KNN clustering alogorithm.

EX. 5 Use the iris data set to carry out K-means clustering. Compare the results to the actual classes and estimate the clustering accuracy.

# Load the iris dataset
data(iris)

# Select the relevant columns for clustering
iris_cluster <- iris[, 1:4]

# Set the number of clusters
k <- 3

# Perform K-means clustering
set.seed(123)
kmeans_result <- kmeans(iris_cluster, centers = k)

# Compare the predicted cluster labels to the actual species
table(kmeans_result$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36

# Calculate clustering accuracy
cluster_accuracy <- sum(diag(table(kmeans_result$cluster, iris$Species))) / sum(table(kmeans_result$cluster, iris$Species))
cluster_accuracy

## [1] 0.8933333

The result indicates the clustering accuracy based on the K-means clustering results compared to the actual species. The accuracy is calculated as the ratio of the correctly classified samples to the total number of samples.

Data 609_HW6

Zhenni Xie

2023-04-29

EX 1.Use a data set such as the PlantGrowth in R to calculate three different distances metrics and discuss the results.

EX 2. Now use a higher-dimensional data set mtcars, try the same three distance metrics in the previous question and discuss the results.

EX 3. Use the built-in data set mtcars to carry out hierarchy clustering using two different distance metrics and compare if they get the same results, Discuss the results.

EX. 5 Use the iris data set to carry out K-means clustering. Compare the results to the actual classes and estimate the clustering accuracy.