I used the dataset iris to calculate the Cartesian distance, the Jaccard distance and Manhattan.
calc_distance <- function(x,y){
sqrt=(sum((x - y)^2))
}
calc_distance(iris[1],iris[2])
dist(t(iris[1:2]),method='euclidean')
## Sepal.Length
## Sepal.Width 36.15785
calc_manhattan_distance <- function(x,y){
(sum(abs(x-y)))
}
calc_manhattan_distance(iris[1],iris[2])
## [1] 417.9
dist(t(iris[1:2]),method='manhattan')
## Sepal.Length
## Sepal.Width 417.9
calc_jaccard_distance <- function(x,y){
colnames(x) <- 'a'
colnames(y) <- 'a'
intersection <- count(intersect(x,y))
union <- count(union(x,y))
jac_dist <- intersection/union
return(jac_dist)
}
calc_jaccard_distance(iris[1],iris[2])
## n
## 1 0.01754386
#take in dataframe with 2 rows and each row is a metric
calc_distance <- function(x,y){
sqrt(sum((x - y)^2))
}
calc_distance(mtcars['cyl'],mtcars['mpg'])
## [1] 89.32586
calc_distance(mtcars['disp'],mtcars['mpg'])
## [1] 1391.495
calc_distance(mtcars['disp'],mtcars['cyl'])
## [1] 1441.252
dist(t(mtcars),method='euclidean')
## mpg cyl disp hp drat wt
## cyl 89.325864
## disp 1391.495462 1441.251772
## hp 824.375467 878.176520 656.640442
## drat 98.511658 19.078540 1459.404217 895.520090
## wt 102.877138 18.058047 1458.014195 895.374454 8.139647
## qsec 33.261091 68.310762 1390.078394 826.067286 81.255418 83.655198
## vs 115.623138 34.785054 1475.104291 911.994518 18.130932 17.371962
## am 115.849514 34.713110 1475.096156 911.588175 18.179403 17.641289
## gear 98.084199 18.867962 1459.033540 894.710009 2.981728 8.929562
## carb 105.320986 21.213203 1460.660559 896.136150 10.689747 8.596341
## qsec vs am gear
## cyl
## disp
## hp
## drat
## wt
## qsec
## vs 98.823784
## am 99.272958 3.605551
## gear 80.935531 18.920888 18.734994
## carb 86.787904 17.262677 16.462078 10.099505
calc_manhattan_distance <- function(x,y){
(sum(abs(x-y)))
}
calc_manhattan_distance(mtcars['cyl'],mtcars['mpg'])
## [1] 444.9
dist(t(mtcars),method='manhattan')
## mpg cyl disp hp drat wt qsec vs
## cyl 444.900
## disp 6740.200 7185.100
## hp 4051.100 4496.000 2852.900
## drat 527.810 86.610 7268.010 4578.910
## wt 539.948 95.048 7280.148 4591.048 37.466
## qsec 136.260 373.160 6811.940 4122.840 456.070 468.208
## vs 628.900 184.000 7369.100 4680.000 101.090 88.952 557.160
## am 629.900 185.000 7370.100 4681.000 102.090 89.952 558.160 13.000
## gear 524.900 84.000 7265.100 4576.000 10.830 42.896 453.160 104.000
## carb 552.900 108.000 7293.100 4604.000 47.290 40.106 481.160 76.000
## am gear
## cyl
## disp
## hp
## drat
## wt
## qsec
## vs
## am
## gear 105.000
## carb 77.000 46.000
calc_jaccard_distance <- function(x,y){
colnames(x) <- 'a'
colnames(y) <- 'a'
intersection <- count(intersect(x,y))
union <- count(union(x,y))
jac_dist <- intersection/union
return(jac_dist)
}
calc_jaccard_distance(mtcars['mpg'],mtcars['cyl'])
## n
## 1 0
calc_jaccard_distance(mtcars['drat'],mtcars['wt'])
## n
## 1 0.04081633
Hierarchy clustering using Euclidean distance immediately breaks off into the first cluster and more data points are put in the first cluster. Manhattan distance separates the data points more evenly across the 3 clusters.
mtcars_euclidean <- mtcars
mtcars_scaled <- as.data.frame(scale(mtcars_euclidean))
summary(mtcars_scaled) #Mean are all 0
## mpg cyl disp hp
## Min. :-1.6079 Min. :-1.225 Min. :-1.2879 Min. :-1.3810
## 1st Qu.:-0.7741 1st Qu.:-1.225 1st Qu.:-0.8867 1st Qu.:-0.7320
## Median :-0.1478 Median :-0.105 Median :-0.2777 Median :-0.3455
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4495 3rd Qu.: 1.015 3rd Qu.: 0.7688 3rd Qu.: 0.4859
## Max. : 2.2913 Max. : 1.015 Max. : 1.9468 Max. : 2.7466
## drat wt qsec vs
## Min. :-1.5646 Min. :-1.7418 Min. :-1.87401 Min. :-0.868
## 1st Qu.:-0.9661 1st Qu.:-0.6500 1st Qu.:-0.53513 1st Qu.:-0.868
## Median : 0.1841 Median : 0.1101 Median :-0.07765 Median :-0.868
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.: 0.6049 3rd Qu.: 0.4014 3rd Qu.: 0.58830 3rd Qu.: 1.116
## Max. : 2.4939 Max. : 2.2553 Max. : 2.82675 Max. : 1.116
## am gear carb
## Min. :-0.8141 Min. :-0.9318 Min. :-1.1222
## 1st Qu.:-0.8141 1st Qu.:-0.9318 1st Qu.:-0.5030
## Median :-0.8141 Median : 0.4236 Median :-0.5030
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1899 3rd Qu.: 0.4236 3rd Qu.: 0.7352
## Max. : 1.1899 Max. : 1.7789 Max. : 3.2117
dist_mtcars_euclidean <- dist(mtcars_scaled,method='euclidean')
hier_clust <- hclust(dist_mtcars_euclidean,method='average')
cut_avg <- cutree(hier_clust, k = 3)
avg_dend_obj <- as.dendrogram(hier_clust)
avg_col_dend <- color_branches(avg_dend_obj, h = 5)
plot(avg_col_dend)
mtcars_euclidean$cluster <- cut_avg
table(mtcars_euclidean$cluster)
##
## 1 2 3
## 18 12 2
mtcars_manhattan <- mtcars
mtcars_scaled <- as.data.frame(scale(mtcars_manhattan))
summary(mtcars_scaled) #Mean are all 0
## mpg cyl disp hp
## Min. :-1.6079 Min. :-1.225 Min. :-1.2879 Min. :-1.3810
## 1st Qu.:-0.7741 1st Qu.:-1.225 1st Qu.:-0.8867 1st Qu.:-0.7320
## Median :-0.1478 Median :-0.105 Median :-0.2777 Median :-0.3455
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4495 3rd Qu.: 1.015 3rd Qu.: 0.7688 3rd Qu.: 0.4859
## Max. : 2.2913 Max. : 1.015 Max. : 1.9468 Max. : 2.7466
## drat wt qsec vs
## Min. :-1.5646 Min. :-1.7418 Min. :-1.87401 Min. :-0.868
## 1st Qu.:-0.9661 1st Qu.:-0.6500 1st Qu.:-0.53513 1st Qu.:-0.868
## Median : 0.1841 Median : 0.1101 Median :-0.07765 Median :-0.868
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.: 0.6049 3rd Qu.: 0.4014 3rd Qu.: 0.58830 3rd Qu.: 1.116
## Max. : 2.4939 Max. : 2.2553 Max. : 2.82675 Max. : 1.116
## am gear carb
## Min. :-0.8141 Min. :-0.9318 Min. :-1.1222
## 1st Qu.:-0.8141 1st Qu.:-0.9318 1st Qu.:-0.5030
## Median :-0.8141 Median : 0.4236 Median :-0.5030
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1899 3rd Qu.: 0.4236 3rd Qu.: 0.7352
## Max. : 1.1899 Max. : 1.7789 Max. : 3.2117
dist_mtcars_manhattan <- dist(mtcars_scaled,method='manhattan')
hier_clust <- hclust(dist_mtcars_manhattan,method='average')
cut_avg <- cutree(hier_clust, k = 3)
avg_dend_obj <- as.dendrogram(hier_clust)
avg_col_dend <- color_branches(avg_dend_obj, h = 5)
plot(avg_col_dend)
mtcars_manhattan$cluster <- cut_avg
table(mtcars_manhattan$cluster)
##
## 1 2 3
## 5 15 12
set.seed(123)
train_size <- floor(0.80 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = train_size)
train <- iris[train_ind,-5 ] #120 rows all columns
test <- iris[-train_ind,-5 ] # 30 rows all columns
iris_target_category <- iris[train_ind,5] #120 rows target column
iris_test_category <- iris[-train_ind,5] # 30 rows target column
results <- knn(train,test,cl=iris_target_category,k=13)
table <- table(results,iris_test_category)
accuracy = sum(diag(table))/sum(table)
accuracy
## [1] 0.9666667
ggplot(iris,aes(Sepal.Length,Sepal.Width,color=Species)) + geom_point() + ggtitle("Iris Data Grouped by Species")
set.seed(123)
results_kmeans <- kmeans(iris[,c(1,2)], 3, nstart = 25)
results_kmeans
## K-means clustering with 3 clusters of sizes 50, 53, 47
##
## Cluster means:
## Sepal.Length Sepal.Width
## 1 5.006000 3.428000
## 2 5.773585 2.692453
## 3 6.812766 3.074468
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2
## [75] 3 3 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 13.1290 11.3000 12.6217
## (between_SS / total_SS = 71.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
iris$cluster_results <- replace(replace(replace(results_kmeans$cluster,results_kmeans$cluster==1,'setosa'),results_kmeans$cluster==2,'virginica'),results_kmeans$cluster==3,'versicolor')
table_kmeans <- table(iris$Species,iris$cluster_results)
accuracy <- sum(diag(table_kmeans))/sum(table_kmeans)
accuracy
## [1] 0.5133333
ggplot(iris,aes(Sepal.Length,Sepal.Width,color=cluster_results)) + geom_point() + ggtitle("Kmeans Clustering Using Iris Data - 82% Accuracy")
Provide a summary on what you have learned and give several screenshots to show that you have gone through this code.
KMeans is a quick and simple clustering method but it can come up with very different solutions when you run it more than once. To determine which run is best, you use the inertia which is the sum of squared distances between each point and the centroid. The lower the inertia, the better the model.
You can use Kmeans++ to make sure the initial centroids are not chosen at random and instead pick points as centroids that are further apart. This decreasing the time to run K-Means.
Mini-batch K-Means is another option that’s quicker than the traditional K-Means but it has lower performance and higher inertia.
As k increases, inertia decreases. You should plot the inertia as k increases and pick the ‘elbow’ as your k-value.
You can also calculate the silhouette score (closer to 1 is better, -1 is the worst). This represents how well points are within or not within their own assigned cluster.
This exercise showed great example code of models in action and walked through examples of application and predictions for bagging ensembles, random forests, AdaBoost, Gradient Boosting, XGBoost and others. I always enjoy seeing actual Python or R code in action.
#### Resources * https://www.datacamp.com/community/tutorials/hierarchical-clustering-R