Module6_Devin

R Markdown

Exercise 1

Use a data set such as the PlantGrowth in R to calculate three different distance metrics and discuss the results.

I used the dataset iris to calculate the Cartesian distance, the Jaccard distance and Manhattan.

calc_distance <- function(x,y){
  sqrt=(sum((x - y)^2))
}
calc_distance(iris[1],iris[2])
dist(t(iris[1:2]),method='euclidean')

##             Sepal.Length
## Sepal.Width     36.15785

calc_manhattan_distance <- function(x,y){
  (sum(abs(x-y)))
} 

calc_manhattan_distance(iris[1],iris[2])

## [1] 417.9

dist(t(iris[1:2]),method='manhattan')

##             Sepal.Length
## Sepal.Width        417.9

calc_jaccard_distance <- function(x,y){
  colnames(x) <- 'a'
  colnames(y) <- 'a'
  
  intersection <- count(intersect(x,y))
  union <- count(union(x,y))
  
  jac_dist <- intersection/union
  return(jac_dist)
}
calc_jaccard_distance(iris[1],iris[2])

##            n
## 1 0.01754386

Exercise 2

Now use a higher-dimensional data set mtcars, try the same three distance metrics in the previous question and discuss the results.

#take in dataframe with 2 rows and each row is a metric 
calc_distance <- function(x,y){
  sqrt(sum((x - y)^2))
}
calc_distance(mtcars['cyl'],mtcars['mpg'])

## [1] 89.32586

calc_distance(mtcars['disp'],mtcars['mpg'])

## [1] 1391.495

calc_distance(mtcars['disp'],mtcars['cyl'])

## [1] 1441.252

dist(t(mtcars),method='euclidean')

##              mpg         cyl        disp          hp        drat          wt
## cyl    89.325864                                                            
## disp 1391.495462 1441.251772                                                
## hp    824.375467  878.176520  656.640442                                    
## drat   98.511658   19.078540 1459.404217  895.520090                        
## wt    102.877138   18.058047 1458.014195  895.374454    8.139647            
## qsec   33.261091   68.310762 1390.078394  826.067286   81.255418   83.655198
## vs    115.623138   34.785054 1475.104291  911.994518   18.130932   17.371962
## am    115.849514   34.713110 1475.096156  911.588175   18.179403   17.641289
## gear   98.084199   18.867962 1459.033540  894.710009    2.981728    8.929562
## carb  105.320986   21.213203 1460.660559  896.136150   10.689747    8.596341
##             qsec          vs          am        gear
## cyl                                                 
## disp                                                
## hp                                                  
## drat                                                
## wt                                                  
## qsec                                                
## vs     98.823784                                    
## am     99.272958    3.605551                        
## gear   80.935531   18.920888   18.734994            
## carb   86.787904   17.262677   16.462078   10.099505

calc_manhattan_distance <- function(x,y){
  (sum(abs(x-y)))
} 

calc_manhattan_distance(mtcars['cyl'],mtcars['mpg'])

## [1] 444.9

dist(t(mtcars),method='manhattan')

##           mpg      cyl     disp       hp     drat       wt     qsec       vs
## cyl   444.900                                                               
## disp 6740.200 7185.100                                                      
## hp   4051.100 4496.000 2852.900                                             
## drat  527.810   86.610 7268.010 4578.910                                    
## wt    539.948   95.048 7280.148 4591.048   37.466                           
## qsec  136.260  373.160 6811.940 4122.840  456.070  468.208                  
## vs    628.900  184.000 7369.100 4680.000  101.090   88.952  557.160         
## am    629.900  185.000 7370.100 4681.000  102.090   89.952  558.160   13.000
## gear  524.900   84.000 7265.100 4576.000   10.830   42.896  453.160  104.000
## carb  552.900  108.000 7293.100 4604.000   47.290   40.106  481.160   76.000
##            am     gear
## cyl                   
## disp                  
## hp                    
## drat                  
## wt                    
## qsec                  
## vs                    
## am                    
## gear  105.000         
## carb   77.000   46.000

calc_jaccard_distance <- function(x,y){
  colnames(x) <- 'a'
  colnames(y) <- 'a'
  
  intersection <- count(intersect(x,y))
  union <- count(union(x,y))
  
  jac_dist <- intersection/union
  return(jac_dist)
}
calc_jaccard_distance(mtcars['mpg'],mtcars['cyl'])

##   n
## 1 0

calc_jaccard_distance(mtcars['drat'],mtcars['wt'])

##            n
## 1 0.04081633

Exercise 3

Use the built-in data set mtcars to carry out hierarchy clustering using two different distance metrics and compare if they get the same results. Discuss the results.

Hierarchy clustering using Euclidean distance immediately breaks off into the first cluster and more data points are put in the first cluster. Manhattan distance separates the data points more evenly across the 3 clusters.

mtcars_euclidean <- mtcars
mtcars_scaled <- as.data.frame(scale(mtcars_euclidean))
summary(mtcars_scaled) #Mean are all 0

##       mpg               cyl              disp               hp         
##  Min.   :-1.6079   Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810  
##  1st Qu.:-0.7741   1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320  
##  Median :-0.1478   Median :-0.105   Median :-0.2777   Median :-0.3455  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859  
##  Max.   : 2.2913   Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466  
##       drat               wt               qsec                vs        
##  Min.   :-1.5646   Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868  
##  1st Qu.:-0.9661   1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868  
##  Median : 0.1841   Median : 0.1101   Median :-0.07765   Median :-0.868  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000  
##  3rd Qu.: 0.6049   3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116  
##  Max.   : 2.4939   Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116  
##        am               gear              carb        
##  Min.   :-0.8141   Min.   :-0.9318   Min.   :-1.1222  
##  1st Qu.:-0.8141   1st Qu.:-0.9318   1st Qu.:-0.5030  
##  Median :-0.8141   Median : 0.4236   Median :-0.5030  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1899   3rd Qu.: 0.4236   3rd Qu.: 0.7352  
##  Max.   : 1.1899   Max.   : 1.7789   Max.   : 3.2117

dist_mtcars_euclidean <- dist(mtcars_scaled,method='euclidean')
hier_clust <- hclust(dist_mtcars_euclidean,method='average')

cut_avg <- cutree(hier_clust, k = 3)

avg_dend_obj <- as.dendrogram(hier_clust)
avg_col_dend <- color_branches(avg_dend_obj, h = 5)
plot(avg_col_dend)

mtcars_euclidean$cluster <- cut_avg
table(mtcars_euclidean$cluster)

## 
##  1  2  3 
## 18 12  2

mtcars_manhattan <- mtcars
mtcars_scaled <- as.data.frame(scale(mtcars_manhattan))
summary(mtcars_scaled) #Mean are all 0

##       mpg               cyl              disp               hp         
##  Min.   :-1.6079   Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810  
##  1st Qu.:-0.7741   1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320  
##  Median :-0.1478   Median :-0.105   Median :-0.2777   Median :-0.3455  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859  
##  Max.   : 2.2913   Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466  
##       drat               wt               qsec                vs        
##  Min.   :-1.5646   Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868  
##  1st Qu.:-0.9661   1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868  
##  Median : 0.1841   Median : 0.1101   Median :-0.07765   Median :-0.868  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000  
##  3rd Qu.: 0.6049   3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116  
##  Max.   : 2.4939   Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116  
##        am               gear              carb        
##  Min.   :-0.8141   Min.   :-0.9318   Min.   :-1.1222  
##  1st Qu.:-0.8141   1st Qu.:-0.9318   1st Qu.:-0.5030  
##  Median :-0.8141   Median : 0.4236   Median :-0.5030  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1899   3rd Qu.: 0.4236   3rd Qu.: 0.7352  
##  Max.   : 1.1899   Max.   : 1.7789   Max.   : 3.2117

dist_mtcars_manhattan <- dist(mtcars_scaled,method='manhattan')
hier_clust <- hclust(dist_mtcars_manhattan,method='average')

cut_avg <- cutree(hier_clust, k = 3)

avg_dend_obj <- as.dendrogram(hier_clust)
avg_col_dend <- color_branches(avg_dend_obj, h = 5)
plot(avg_col_dend)

mtcars_manhattan$cluster <- cut_avg
table(mtcars_manhattan$cluster)

## 
##  1  2  3 
##  5 15 12

Exercise 4

Load the well-known Fisher’s iris flower data set that consists of 150 samples for 3 species (50 samples each species). The four measures or features are the lengths and widths of sepals and petals. Use the kNN clustering to analyze this iris data set by selecting 120 samples for training and 30 samples for testing.

set.seed(123)

train_size <- floor(0.80 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = train_size)

train <- iris[train_ind,-5 ] #120 rows all columns
test <- iris[-train_ind,-5 ] # 30 rows all columns
iris_target_category <- iris[train_ind,5] #120 rows target column
iris_test_category <- iris[-train_ind,5]  # 30 rows target column

results <- knn(train,test,cl=iris_target_category,k=13)
table <- table(results,iris_test_category)

accuracy = sum(diag(table))/sum(table)
accuracy

## [1] 0.9666667

Exercise 5

Use the iris data set to carry out k-means clustering. Compare the results to the acutal classes and estimate the clustering accuracy.

ggplot(iris,aes(Sepal.Length,Sepal.Width,color=Species)) + geom_point() + ggtitle("Iris Data Grouped by Species")

set.seed(123)

results_kmeans <- kmeans(iris[,c(1,2)], 3, nstart = 25)
results_kmeans

## K-means clustering with 3 clusters of sizes 50, 53, 47
## 
## Cluster means:
##   Sepal.Length Sepal.Width
## 1     5.006000    3.428000
## 2     5.773585    2.692453
## 3     6.812766    3.074468
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2
##  [75] 3 3 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 13.1290 11.3000 12.6217
##  (between_SS / total_SS =  71.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

iris$cluster_results <- replace(replace(replace(results_kmeans$cluster,results_kmeans$cluster==1,'setosa'),results_kmeans$cluster==2,'virginica'),results_kmeans$cluster==3,'versicolor')
 
table_kmeans <- table(iris$Species,iris$cluster_results)

accuracy <- sum(diag(table_kmeans))/sum(table_kmeans)
accuracy

## [1] 0.5133333

ggplot(iris,aes(Sepal.Length,Sepal.Width,color=cluster_results)) + geom_point() + ggtitle("Kmeans Clustering Using Iris Data - 82% Accuracy")

Exercise 6:

Please set up Python environment on your computer （recommended installation would be ANACONDA), then go through the following codes: https://github.com/ageron/handson-ml2/blob/master/09_unsupervised_learning.ipynb

Provide a summary on what you have learned and give several screenshots to show that you have gone through this code.

Data609-Python1.1.png Data609-Python1.2.png Data609-Python1.3.png KMeans is a quick and simple clustering method but it can come up with very different solutions when you run it more than once. To determine which run is best, you use the inertia which is the sum of squared distances between each point and the centroid. The lower the inertia, the better the model.

You can use Kmeans++ to make sure the initial centroids are not chosen at random and instead pick points as centroids that are further apart. This decreasing the time to run K-Means.

Mini-batch K-Means is another option that’s quicker than the traditional K-Means but it has lower performance and higher inertia.

As k increases, inertia decreases. You should plot the inertia as k increases and pick the ‘elbow’ as your k-value.

You can also calculate the silhouette score (closer to 1 is better, -1 is the worst). This represents how well points are within or not within their own assigned cluster.

Exercise 7:

Go through the following codes: https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb Provide a summary on what you have learned and give several screenshots to show that you have gone through this code.

This exercise showed great example code of models in action and walked through examples of application and predictions for bagging ensembles, random forests, AdaBoost, Gradient Boosting, XGBoost and others. I always enjoy seeing actual Python or R code in action.

Data609-Python2.1.png Data609-Python2.2.png Data609-Python2.3.png #### Resources * https://www.datacamp.com/community/tutorials/hierarchical-clustering-R

Module6_Devin_Teran

Devin Teran

5/2/2021