Problem 1

Use a data set such as the PlantGrowth in R to calculate three different distance metrics and discuss the results.

man <- dist(PlantGrowth, method = 'manhattan', diag = FALSE, upper = FALSE)

euc <- dist(PlantGrowth, method = 'euclidean', diag = FALSE, upper = FALSE)

can <- dist(PlantGrowth, method = 'canberra', diag = FALSE, upper = FALSE)
head(man)
## [1] 2.82 2.02 3.88 0.66 0.88 2.00
head(euc)
## [1] 1.9940411 1.4283557 2.7435743 0.4666905 0.6222540 1.4142136
head(can)
## [1] 0.28923077 0.21604278 0.37743191 0.07612457 0.10022779 0.21413276

The head() function returns the first values in the distance data object. We are seeing the distance between the 1st row and the 2nd row, the distance between the 1st row and the 3rd row, etc.

As I expected, the ‘Manhattan’ distances are the largest of the distance values.

The ‘Euclidean’ distance values are similar to the ‘Manhattan’ in magnitude but smaller values.

Interestingly, the ‘Canberra’ are very similar to the ‘Manhattan’ distance values just 2 orders of magnitude smaller.

Problem 2

Now use the higher-dimensional data set mtcars, try the same three distance metrics in the previous question and discuss the results.

man <- dist(mtcars, method = 'manhattan', diag = FALSE, upper = FALSE)

euc <- dist(mtcars, method = 'euclidean', diag = FALSE, upper = FALSE)

can <- dist(mtcars, method = 'canberra', diag = FALSE, upper = FALSE)
head(man)
## [1]   0.815  79.300 108.795 275.430  84.640 347.960
head(euc)
## [1]   0.6153251  54.9086059  98.1125212 210.3374396  65.4717710 241.4076490
head(can)
## [1] 0.06944545 2.24735590 3.28919860 2.80289966 3.42095017 2.76114065

I think that it would be helpful to look the ‘mtcars’ data set to explain these distances.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The first distance for all three methods is very small compared to all the other distances. When we look at the ‘mtcars’ data set, the first two rows are just variations on the same car the ‘Mazda RXW’. These two vehicles only differ in their weight and 1/4 mile time (the heavier wagon version is slightly slower)

All three distance types share the same trend in values with the Canberra being an order of magnitude smaller.

Problem 3:

Use the built-in data set mtcars to carry out hierachy clustering using two different distance metrics and compare if they get the same results. Discuss the results.

man_cluster <- hclust(dist(mtcars, method = 'manhattan'))
plot(man_cluster)

euc_cluster <- hclust(dist(mtcars, method = 'euclidean'))
plot(euc_cluster)

The ‘Manhattan’ distance splits American cars in one major branch and European and Japanese cars in the other major branch.

The ‘Euclidean’ distance splits the cars up into three major branches. American sedans in one, American muscle cars and Mercedes in another and then European and Japanese cars in a third.

The branches of the two trees are very different but the arrangement of the leaves are similar in many cases.

Problem 4:

Load the well-known Fisher’s iris flower data set that consists of 150 samples for three species. Use the kNN clustering to analyze this iris data set by selecting 120 samples for training and 30 samples for testing.

#I want to randomly reorder the iris datas et before splitting it into a train and test set
set.seed(199)
rows <- sample(nrow(iris))
iris_rand <- iris[rows,]
train <- iris_rand[1:120,]
test <- iris_rand[121:150,]

train_scale <- scale(train[, 1:4])
test_scale <- scale(test[, 1:4])
  
# Fitting KNN Model to training data set

for (i in 1:10) {
classifier_knn <- class::knn(train = train_scale,
                      test = test_scale,
                      cl = train$Species,
                      k = i)
print(paste0("When k is ", i, " Accuracy is ", round(1-mean(classifier_knn != test$Species), 3)))
}
## [1] "When k is 1 Accuracy is 1"
## [1] "When k is 2 Accuracy is 1"
## [1] "When k is 3 Accuracy is 1"
## [1] "When k is 4 Accuracy is 1"
## [1] "When k is 5 Accuracy is 0.967"
## [1] "When k is 6 Accuracy is 0.967"
## [1] "When k is 7 Accuracy is 1"
## [1] "When k is 8 Accuracy is 1"
## [1] "When k is 9 Accuracy is 0.967"
## [1] "When k is 10 Accuracy is 0.967"
# Confusion Matrix for k = 10
cm_10 <- table(test$Species, classifier_knn)
cm_10
##             classifier_knn
##              setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9

The accuracy of this kNN appears to be more dependent on how the data is split the training and test set than the value for k. Last time I ran this with a different seed I couldn’t get an accuracy over 83% now I have 100% accuracy.

Problem 5:

Use the iris data set to carry out k-means clustering. Compare the results to the actual classes and estimate the clustering accuracy

set.seed(991)
classifier_kmeans <- stats::kmeans(iris[1:4], centers = 3, nstart = 25)

results <- dplyr::bind_cols(classifier_kmeans$cluster, iris$Species)
## New names:
## • `` -> `...1`
## • `` -> `...2`
results <- dplyr::rename(results, kmean_pred = "...1" , species = "...2")

results$pred_species <- dplyr::case_when(results$kmean_pred == 1 ~ 'setosa', results$kmean_pred == 2 ~ 'virginica', results$kmean_pred == 3 ~ 'versicolor')

print(paste0( "Accuracy is ", round(1-mean(results$pred_species != results$species), 3)))
## [1] "Accuracy is 0.893"
table(results$pred_species, results$species)
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48        14
##   virginica       0          2        36

Kmeans gives us a estimated accuracy of 89%.