1. Importing and preparation of the data

Dataset: Dog Breeds

Source: Keggle: https://www.kaggle.com/datasets/sujaykapadnis/dog-breeds?select=breed_traits.csv

#Importing the data

mydata <- read.table("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/MVA/MVA HW 2/breed_traits.csv", header=TRUE, sep=",")
head(mydata)
##                   Breed Affectionate.With.Family
## 1 Retrievers (Labrador)                        5
## 2       French Bulldogs                        5
## 3  German Shepherd Dogs                        5
## 4   Retrievers (Golden)                        5
## 5              Bulldogs                        4
## 6               Poodles                        5
##   Good.With.Young.Children Good.With.Other.Dogs Shedding.Level
## 1                        5                    5              4
## 2                        5                    4              3
## 3                        5                    3              4
## 4                        5                    5              4
## 5                        3                    3              3
## 6                        5                    3              1
##   Coat.Grooming.Frequency Drooling.Level Coat.Type Coat.Length
## 1                       2              2    Double       Short
## 2                       1              3    Smooth       Short
## 3                       2              2    Double      Medium
## 4                       2              2    Double      Medium
## 5                       3              3    Smooth       Short
## 6                       4              1     Curly        Long
##   Openness.To.Strangers Playfulness.Level Watchdog.Protective.Nature
## 1                     5                 5                          3
## 2                     5                 5                          3
## 3                     3                 4                          5
## 4                     5                 4                          3
## 5                     4                 4                          3
## 6                     5                 5                          5
##   Adaptability.Level Trainability.Level Energy.Level Barking.Level
## 1                  5                  5            5             3
## 2                  5                  4            3             1
## 3                  5                  5            5             3
## 4                  5                  5            3             1
## 5                  3                  4            3             2
## 6                  4                  5            4             4
##   Mental.Stimulation.Needs
## 1                        4
## 2                        3
## 3                        5
## 4                        4
## 5                        3
## 6                        5

Unit of observation is one dog breed.

Initial sample size: 195 dog breeds

Description of all variables before the data manipulation:

2. Data Manipulation

summary(mydata)
##     Breed           Affectionate.With.Family Good.With.Young.Children
##  Length:195         Min.   :0.000            Min.   :0.000           
##  Class :character   1st Qu.:4.000            1st Qu.:3.000           
##  Mode  :character   Median :5.000            Median :3.000           
##                     Mean   :4.477            Mean   :3.867           
##                     3rd Qu.:5.000            3rd Qu.:5.000           
##                     Max.   :5.000            Max.   :5.000           
##  Good.With.Other.Dogs Shedding.Level Coat.Grooming.Frequency
##  Min.   :0.000        Min.   :0.00   Min.   :0.000          
##  1st Qu.:3.000        1st Qu.:2.00   1st Qu.:2.000          
##  Median :3.000        Median :3.00   Median :2.000          
##  Mean   :3.513        Mean   :2.59   Mean   :2.277          
##  3rd Qu.:4.000        3rd Qu.:3.00   3rd Qu.:3.000          
##  Max.   :5.000        Max.   :5.00   Max.   :5.000          
##  Drooling.Level  Coat.Type         Coat.Length       
##  Min.   :0.00   Length:195         Length:195        
##  1st Qu.:1.00   Class :character   Class :character  
##  Median :2.00   Mode  :character   Mode  :character  
##  Mean   :1.79                                        
##  3rd Qu.:2.00                                        
##  Max.   :5.00                                        
##  Openness.To.Strangers Playfulness.Level Watchdog.Protective.Nature
##  Min.   :0.000         Min.   :0.000     Min.   :0.000             
##  1st Qu.:3.000         1st Qu.:3.000     1st Qu.:3.000             
##  Median :3.000         Median :4.000     Median :4.000             
##  Mean   :3.467         Mean   :3.631     Mean   :3.718             
##  3rd Qu.:4.000         3rd Qu.:4.000     3rd Qu.:5.000             
##  Max.   :5.000         Max.   :5.000     Max.   :5.000             
##  Adaptability.Level Trainability.Level  Energy.Level  
##  Min.   :0.000      Min.   :0.000      Min.   :0.000  
##  1st Qu.:3.000      1st Qu.:3.000      1st Qu.:3.000  
##  Median :4.000      Median :4.000      Median :4.000  
##  Mean   :3.774      Mean   :3.846      Mean   :3.723  
##  3rd Qu.:4.000      3rd Qu.:5.000      3rd Qu.:4.000  
##  Max.   :5.000      Max.   :5.000      Max.   :5.000  
##  Barking.Level   Mental.Stimulation.Needs
##  Min.   :0.000   Min.   :0.000           
##  1st Qu.:3.000   1st Qu.:3.000           
##  Median :3.000   Median :4.000           
##  Mean   :3.118   Mean   :3.662           
##  3rd Qu.:4.000   3rd Qu.:4.000           
##  Max.   :5.000   Max.   :5.000

We can see that we need to do some data adjustments (removing the n/a-s, factoring…).

mydata$Coat.Type <- factor(mydata$Coat.Type,
                                   levels = c("Double","Curly", "Smooth", "Wiry", "Silky", "Hairless", "Corded", "Wavy"),
                                   labels = c("Double","Curly", "Smooth", "Wiry", "Silky", "Hairless", "Corded", "Wavy"))

mydata$Coat.Length <- factor(mydata$Coat.Length,
                                   levels = c("Short","Medium", "Long"),
                                   labels = c("Short","Medium", "Long"))
library(tidyr)
mydata <- drop_na(mydata)

We dropped 4 units due to drop_na function. Let’s check how the data looks now:

summary(mydata)
##     Breed           Affectionate.With.Family Good.With.Young.Children
##  Length:191         Min.   :1.000            Min.   :1.00            
##  Class :character   1st Qu.:4.000            1st Qu.:3.00            
##  Mode  :character   Median :5.000            Median :3.00            
##                     Mean   :4.508            Mean   :3.89            
##                     3rd Qu.:5.000            3rd Qu.:5.00            
##                     Max.   :5.000            Max.   :5.00            
##                                                                      
##  Good.With.Other.Dogs Shedding.Level  Coat.Grooming.Frequency
##  Min.   :1.000        Min.   :1.000   Min.   :1.000          
##  1st Qu.:3.000        1st Qu.:2.000   1st Qu.:2.000          
##  Median :3.000        Median :3.000   Median :2.000          
##  Mean   :3.529        Mean   :2.607   Mean   :2.283          
##  3rd Qu.:4.000        3rd Qu.:3.000   3rd Qu.:3.000          
##  Max.   :5.000        Max.   :5.000   Max.   :5.000          
##                                                              
##  Drooling.Level    Coat.Type  Coat.Length Openness.To.Strangers
##  Min.   :1.000   Double :66   Short :86   Min.   :1.000        
##  1st Qu.:1.000   Smooth :66   Medium:77   1st Qu.:3.000        
##  Median :2.000   Wiry   :30   Long  :28   Median :3.000        
##  Mean   :1.791   Silky  : 9               Mean   :3.482        
##  3rd Qu.:2.000   Curly  : 7               3rd Qu.:4.000        
##  Max.   :5.000   Wavy   : 6               Max.   :5.000        
##                  (Other): 7                                    
##  Playfulness.Level Watchdog.Protective.Nature Adaptability.Level
##  Min.   :2.000     Min.   :1.000              Min.   :3.000     
##  1st Qu.:3.000     1st Qu.:3.000              1st Qu.:3.000     
##  Median :4.000     Median :4.000              Median :4.000     
##  Mean   :3.654     Mean   :3.738              Mean   :3.796     
##  3rd Qu.:4.000     3rd Qu.:5.000              3rd Qu.:4.000     
##  Max.   :5.000     Max.   :5.000              Max.   :5.000     
##                                                                 
##  Trainability.Level  Energy.Level   Barking.Level  
##  Min.   :1.000      Min.   :2.000   Min.   :1.000  
##  1st Qu.:3.000      1st Qu.:3.000   1st Qu.:3.000  
##  Median :4.000      Median :4.000   Median :3.000  
##  Mean   :3.869      Mean   :3.743   Mean   :3.115  
##  3rd Qu.:5.000      3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000      Max.   :5.000   Max.   :5.000  
##                                                    
##  Mental.Stimulation.Needs
##  Min.   :3.000           
##  1st Qu.:3.000           
##  Median :4.000           
##  Mean   :3.686           
##  3rd Qu.:4.000           
##  Max.   :5.000           
## 

We can see this already looks much better.

Interpretations:

After these adjustments, we will start with our clustering part of the homework.

3. RQ1: Can we classify dog breeds into meaningful groups based on their trainability, energy level, shedding level, adaptability, and affection level?

Variables I picked for further analysis are the following, as already mentioned in the research question:

#Saving standardized cluster variables into new data frame and standardization of cluster variables

mydata_clu_std <- as.data.frame(scale(mydata[c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level")]))


head(mydata_clu_std)
##   Trainability.Level Energy.Level Adaptability.Level
## 1           1.309392    1.6482097          1.9422309
## 2           0.151550   -0.9751908          1.9422309
## 3           1.309392    1.6482097          1.9422309
## 4           1.309392   -0.9751908          1.9422309
## 5           0.151550   -0.9751908         -1.2835613
## 6           1.309392    0.3365095          0.3293348
##   Affectionate.With.Family Shedding.Level
## 1                0.6148339      1.6367984
## 2                0.6148339      0.4615033
## 3                0.6148339      1.6367984
## 4                0.6148339      1.6367984
## 5               -0.6344563      0.4615033
## 6                0.6148339     -1.8890868

Is data clusterable?

Checking for potential outliers with new variable Dissimilarity

mydata$Dissimilarity = sqrt(mydata_clu_std$Trainability.Level^2 + mydata_clu_std$Energy.Level^2 + mydata_clu_std$Adaptability.Level^2 + mydata_clu_std$Affectionate.With.Family^2 + mydata_clu_std$Shedding.Level^2 )
print(head(mydata[order(-mydata$Dissimilarity), c("Dissimilarity", "Breed")], 10))
##     Dissimilarity                     Breed
## 85       5.167183   Anatolian Shepherd Dogs
## 121      4.462448             Afghan Hounds
## 36       3.456591             Basset Hounds
## 99       3.413431       Neapolitan Mastiffs
## 120      3.406517                Beaucerons
## 157      3.380856 Spaniels (American Water)
## 189      3.367492                  Azawakhs
## 1        3.355666     Retrievers (Labrador)
## 3        3.355666      German Shepherd Dogs
## 62       3.275408         Airedale Terriers

Dissimilarity tells us how different each breed is from the rest of the breeds, where higher value means more different. We can see that we have highest dissimilarity (meaning most evident differences) with breeds Anatolian Shepherd Dogs and Afghan Hounds that could be possible outliers, therefore I decided to remove them.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#mydata <- mydata %>%
  #filter(!Breed %in% c("Anatolian Shepherd Dogs", "Afghan Hounds"))

Since this code isn’t working for me (I do not get any errors, but it just does not delete selected rows after I run the code) I left it here to example what I would normally use. But because of this situation I tried a new code for deleting with help of row numbers:

mydata <- mydata %>%
  slice(c(-85,-121))

Now we successfully have 189 observations to continue with.

mydata$ID <- seq(1, nrow(mydata)) #Creating a sequence again
mydata_clu_std <- as.data.frame(scale(mydata[c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level")])) #Scaling the data again

rownames(mydata_clu_std) <- mydata$Breed #Renaming rows with breed names

Dissimilarity matrix

#Graphical presentation of dissimilarity matrix

#install.packages("factoextra")
library(factoextra) 
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distance <- get_dist(mydata_clu_std, 
                     method = "euclidian")

fviz_dist(Distance, 
          gradient = list(low = "darkred", 
                          mid = "grey95", 
                          high = "white")) 

This is a dissimilarity matrix with size 189 x 189. We see groups forming around diagonal, therefore data is clusterable, but I do not see any bigger groups forming which might not be the best sign. I expect Hopkins statistics to be above 0.5, but not too highly above.

Hopkins statistics

library(factoextra) 
get_clust_tendency(mydata_clu_std,
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE) 
## $hopkins_stat
## [1] 0.5295679
## 
## $plot
## NULL

Hopkins statistics is successfully above 0.5 and is currently at level 0.53.

With these first steps we confirmed the first part of our analysis, answering the question “Is data clusterable?” We answered the question by fulfilling the following objectives:

  1. Numeric clustering variables
  2. Checking dissimilarity metrix - visible groups forming
  3. Hopkins statistics > 0.5

We can conclude that data is clusterable, although it is not the best since some criteria is barely passing, and move on to answering the question about how many clusters to create based on the given data.

How many clusters?

Hierarchical approach - Cluster Dendrogram

library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 189
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none"
## instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at
##   <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.

By this method, I would conclude it makes most sense to make 3 clusters. But let’s also check other methods.

K-means approach

#install.packages("BiocManager")
#BiocManager::install("NbClust")
library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

The elbow method in our case suggests 3, 6 or 8 clusters.

Silhouette Analysis

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

Based on this method we try to maximize silhouette value through observing peaks. The highest one is suggesting 10 clusters.

library(NbClust)
# Determine the optimal number of clusters

nc <- NbClust(mydata_clu_std, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 5 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Based on this I will continue with 3 clusters.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions

Clustering
## K-means clustering with 3 clusters of sizes 90, 43, 56
## 
## Cluster means:
##   Trainability.Level Energy.Level Adaptability.Level
## 1         -0.3271923   -0.5254591          0.1190422
## 2         -0.3483321   -0.1241318         -1.1511861
## 3          0.7933140    0.9398033          0.6926286
##   Affectionate.With.Family Shedding.Level
## 1                0.4256575    -0.33116256
## 2               -1.6329290    -0.06568033
## 3                0.5697639     0.58265865
## 
## Clustering vector:
##                 Retrievers (Labrador) 
##                                     3 
##                       French Bulldogs 
##                                     1 
##                  German Shepherd Dogs 
##                                     3 
##                   Retrievers (Golden) 
##                                     3 
##                              Bulldogs 
##                                     2 
##                               Poodles 
##                                     1 
##                               Beagles 
##                                     2 
##                           Rottweilers 
##                                     1 
##         Pointers (German Shorthaired) 
##                                     3 
##                            Dachshunds 
##                                     1 
##                 Pembroke Welsh Corgis 
##                                     3 
##                  Australian Shepherds 
##                                     2 
##                    Yorkshire Terriers 
##                                     1 
##                                Boxers 
##                                     2 
##                           Great Danes 
##                                     1 
##                      Siberian Huskies 
##                                     3 
##        Cavalier King Charles Spaniels 
##                                     1 
##                    Doberman Pinschers 
##                                     3 
##                  Miniature Schnauzers 
##                                     3 
##                              Shih Tzu 
##                                     1 
##                       Boston Terriers 
##                                     1 
##                 Bernese Mountain Dogs 
##                                     3 
##                           Pomeranians 
##                                     1 
##                              Havanese 
##                                     1 
##                            Cane Corso 
##                                     2 
##           Spaniels (English Springer) 
##                                     3 
##                    Shetland Sheepdogs 
##                                     3 
##                             Brittanys 
##                                     2 
##                                  Pugs 
##                                     3 
##                     Spaniels (Cocker) 
##                                     3 
##          Miniature American Shepherds 
##                                     3 
##                        Border Collies 
##                                     3 
##                              Mastiffs 
##                                     1 
##                            Chihuahuas 
##                                     1 
##                               Vizslas 
##                                     3 
##                         Basset Hounds 
##                                     2 
##                      Belgian Malinois 
##                                     2 
##                               Maltese 
##                                     1 
##                           Weimaraners 
##                                     3 
##                               Collies 
##                                     1 
##                         Newfoundlands 
##                                     1 
##                  Rhodesian Ridgebacks 
##                                     1 
##                             Shiba Inu 
##                                     1 
##          West Highland White Terriers 
##                                     1 
##                        Bichons Frises 
##                                     1 
##                           Bloodhounds 
##                                     2 
##             Spaniels (English Cocker) 
##                                     1 
##                                Akitas 
##                                     2 
##                 Portuguese Water Dogs 
##                                     3 
##           Retrievers (Chesapeake Bay) 
##                                     3 
##                            Dalmatians 
##                                     3 
##                          St. Bernards 
##                                     1 
##                             Papillons 
##                                     3 
##                Australian Cattle Dogs 
##                                     2 
##                          Bullmastiffs 
##                                     2 
##                              Samoyeds 
##                                     3 
##                     Scottish Terriers 
##                                     1 
##          Soft Coated Wheaten Terriers 
##                                     1 
##                              Whippets 
##                                     1 
##          Pointers (German Wirehaired) 
##                                     3 
##                      Chinese Shar-Pei 
##                                     1 
##                     Airedale Terriers 
##                                     2 
##          Wirehaired Pointing Griffons 
##                                     3 
##                         Bull Terriers 
##                                     2 
##                     Alaskan Malamutes 
##                                     2 
##                 Cardigan Welsh Corgis 
##                                     2 
##                      Giant Schnauzers 
##                                     3 
##                 Old English Sheepdogs 
##                                     1 
##                    Italian Greyhounds 
##                                     1 
##                        Great Pyrenees 
##                                     1 
##                    Dogues de Bordeaux 
##                                     1 
##                      Russell Terriers 
##                                     3 
##                        Cairn Terriers 
##                                     2 
##                      Irish Wolfhounds 
##                                     1 
##                       Setters (Irish) 
##                                     3 
##           Greater Swiss Mountain Dogs 
##                                     1 
##                   Miniature Pinschers 
##                                     3 
##                           Lhasa Apsos 
##                                     1 
##                       Chinese Crested 
##                                     1 
##                       Coton de Tulear 
##                                     1 
##           Staffordshire Bull Terriers 
##                                     3 
##       American Staffordshire Terriers 
##                                     1 
##                          Rat Terriers 
##                                     3 
##                            Chow Chows 
##                                     2 
##                              Basenjis 
##                                     2 
##                     Spaniels (Boykin) 
##                                     2 
##                     Lagotti Romagnoli 
##                                     1 
##                     Brussels Griffons 
##                                     1 
## Retrievers (Nova Scotia Duck Tolling) 
##                                     3 
##                   Norwegian Elkhounds 
##                                     3 
##                   Standard Schnauzers 
##                                     1 
##                       Dogo Argentinos 
##                                     3 
##                             Pekingese 
##                                     1 
##                            Keeshonden 
##                                     3 
##                       Border Terriers 
##                                     1 
##                           Leonbergers 
##                                     3 
##                      Tibetan Terriers 
##                                     1 
##                   Neapolitan Mastiffs 
##                                     2 
##                     Setters (English) 
##                                     1 
##              Retrievers (Flat-Coated) 
##                                     3 
##                               Borzois 
##                                     2 
##                   Fox Terriers (Wire) 
##                                     1 
##               Miniature Bull Terriers 
##                                     1 
##                      Belgian Tervuren 
##                                     2 
##                      Setters (Gordon) 
##                                     3 
##                        Silky Terriers 
##                                     1 
##                      Norwich Terriers 
##                                     1 
##                      Spinoni Italiani 
##                                     1 
##                         Japanese Chin 
##                                     1 
##                        Welsh Terriers 
##                                     1 
##                      Toy Fox Terriers 
##                                     3 
##                           Schipperkes 
##                                     1 
##               Parson Russell Terriers 
##                                     3 
##                              Pointers 
##                                     3 
##                     Belgian Sheepdogs 
##                                     2 
##                      Tibetan Spaniels 
##                                     1 
##                  American Eskimo Dogs 
##                                     3 
##                        Irish Terriers 
##                                     1 
##                            Beaucerons 
##                                     2 
##                             Boerboels 
##                                     1 
##                 Fox Terriers (Smooth) 
##                                     1 
##                       Bearded Collies 
##                                     1 
##                Black Russian Terriers 
##                                     2 
##              Black and Tan Coonhounds 
##                                     1 
##             Spaniels (Welsh Springer) 
##                                     3 
##            American Hairless Terriers 
##                                     1 
##                      Norfolk Terriers 
##                                     1 
##                        Xoloitzcuintli 
##                                     1 
##                   Manchester Terriers 
##                                     1 
##                   Kerry Blue Terriers 
##                                     1 
##                   Australian Terriers 
##                                     2 
##                    Spaniels (Clumber) 
##                                     1 
##                     Lakeland Terriers 
##                                     1 
##                   Bluetick Coonhounds 
##                                     2 
##                  English Toy Spaniels 
##                                     1 
##                      German Pinschers 
##                                     3 
##                      Tibetan Mastiffs 
##                                     2 
##                   Bedlington Terriers 
##                                     2 
##                            Greyhounds 
##                                     1 
##                                 Pulik 
##                                     1 
##                               Salukis 
##                                     1 
##                               Barbets 
##                                     2 
##                    Redbone Coonhounds 
##                                     1 
##                     Swedish Vallhunds 
##                                     3 
##                     Sealyham Terriers 
##                                     1 
##                    Spanish Water Dogs 
##                                     1 
##                               Briards 
##                                     2 
##                        Berger Picards 
##                                     2 
##             Entlebucher Mountain Dogs 
##                                     3 
##             Treeing Walker Coonhounds 
##                                     3 
##                   Icelandic Sheepdogs 
##                                     3 
##                    Wirehaired Vizslas 
##                                     3 
##                                 Pumik 
##                                     3 
##           Portuguese Podengo Pequenos 
##                                     1 
##             Spaniels (American Water) 
##                                     2 
##             Retrievers (Curly-Coated) 
##                                     1 
##                      Spaniels (Field) 
##                                     1 
##                               Lowchen 
##                                     1 
##            Nederlandse Kooikerhondjes 
##                                     3 
##                        Affenpinschers 
##                                     2 
##                     Finnish Lapphunds 
##                                     1 
##                   Scottish Deerhounds 
##                                     1 
##                     Norwegian Buhunds 
##                                     1 
##                Glen of Imaal Terriers 
##                                     1 
##         Setters (Irish Red and White) 
##                                     3 
##                         Ibizan Hounds 
##                                     1 
##                     Spaniels (Sussex) 
##                                     1 
##                  Bergamasco Sheepdogs 
##                                     2 
##                Spaniels (Irish Water) 
##                                     1 
##              Polish Lowland Sheepdogs 
##                                     1 
##                              Kuvaszok 
##                                     1 
##                            Komondorok 
##                                     1 
##                    Cirnechi dell’Etna 
##                                     1 
##                        Pharaoh Hounds 
##                                     1 
##               Dandie Dinmont Terriers 
##                                     2 
##                    Pyrenean Shepherds 
##                                     3 
##                         Skye Terriers 
##                                     1 
##                           Canaan Dogs 
##                                     2 
##           American English Coonhounds 
##                                     2 
##                              Chinooks 
##                                     1 
##                         Finnish Spitz 
##                                     3 
##         Grand Basset Griffon Vendeens 
##                                     3 
##                              Sloughis 
##                                     2 
##                              Harriers 
##                                     3 
##                        Cesky Terriers 
##                                     1 
##                    American Foxhounds 
##                                     2 
##                              Azawakhs 
##                                     2 
##                     English Foxhounds 
##                                     3 
##                  Norwegian Lundehunds 
##                                     2 
## 
## Within cluster sum of squares by cluster:
## [1] 260.0833 149.2129 142.2987
##  (between_SS / total_SS =  41.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

We can observe that the best raiton between SSbetween and SStotal we can achieve with this data is 41.3%.

Interpretation:

  • 41.3% of total variability is explained by between variability (by classification of breeds in groups).
library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

We can observe 3 clusters that include 90, 43 and 56 breeds.

We can see that there is some overlap between clusters. This happens because they are measured from multiple dimensions and we only present them in two, therefore it seems like they are overlaping.

We can also observe from the graph that there is cca 60.9% of total information shown with these two dimensions that we can observe in the graph.

Since there are not any specific outliers seen that I would like to drop, I will continue with my analysis.

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
##   Trainability.Level Energy.Level Adaptability.Level
## 1         -0.3271923   -0.5254591          0.1190422
## 2         -0.3483321   -0.1241318         -1.1511861
## 3          0.7933140    0.9398033          0.6926286
##   Affectionate.With.Family Shedding.Level
## 1                0.4256575    -0.33116256
## 2               -1.6329290    -0.06568033
## 3                0.5697639     0.58265865

Profiling graph

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level"), 
                       labels = c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

This is now main graph to describe our segments. Based on this we see that Group 3 is the one above average in all of the analyzed characteristics and the worst is Group 2, which is bellow average in all of the checked variables. Group 1 is above average in adaptability and affection with family.

But, we must be careful with shedding level variable since it is not defined well in the description of variables and above average shedding level could mean more shedding than average, which could possibly not be a big advantage of the cluster.

mydata$Group <- Clustering$cluster
#Checking if clustering variables successfully differentiate between groups - Appropriateness of used cluster variables

fit <- aov(cbind(Trainability.Level, Energy.Level, Adaptability.Level, Affectionate.With.Family,  Shedding.Level) ~ as.factor(Group), 
           data = mydata)

summary(fit)
##  Response Trainability.Level :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 34.610 17.3048  33.784 3.048e-13 ***
## Residuals        186 95.274  0.5122                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Energy.Level :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 43.791 21.8956  61.689 < 2.2e-16 ***
## Residuals        186 66.018  0.3549                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Adaptability.Level :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 32.491 16.2455  76.954 < 2.2e-16 ***
## Residuals        186 39.266  0.2111                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Affectionate.With.Family :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 84.906  42.453  356.96 < 2.2e-16 ***
## Residuals        186 22.121   0.119                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Shedding.Level :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  20.842 10.4212  17.009 1.645e-07 ***
## Residuals        186 113.962  0.6127                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conducted analysis of variance (ANOVA). For each paragraph, for each variable, we check the hypotheses, following the same pattern as I will present in an example.

Let’s make an example for the first variable:

  • H0: Mean(Trainability.Level, Group1) = Mean(Trainability.Level, Group2) = Mean(Trainability.Level, Group3)
  • H1: At leas one arithmetic mean is different

We can reject H0 at p<0.001 for all of the variables and conclude that we do not have any bad cluster variables in our case and we can continue the analysis.

Criterion validity

I decided to continue my analysis with checking my groups also based on coat length. I wasn’t able to add it to my cluster variables, because variable is not numeric. I will use Pearson Chi-Squared Test (without Yates correlation).

chi_square <- chisq.test(mydata$Coat.Length, as.factor(mydata$Group))
chi_square
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Coat.Length and as.factor(mydata$Group)
## X-squared = 8.8433, df = 4, p-value = 0.06514

With p>0.05 we cannot reject H0 that there is no association between coat length and classification. We can conclude that there is no statistically significant difference in coat length between the clusters at the 5% level of significance.

addmargins(chi_square$observed)
##                   as.factor(mydata$Group)
## mydata$Coat.Length   1   2   3 Sum
##             Short   38  24  23  85
##             Medium  33  17  27  77
##             Long    19   2   6  27
##             Sum     90  43  56 189
addmargins(round(chi_square$expected, 2))
##                   as.factor(mydata$Group)
## mydata$Coat.Length     1     2     3    Sum
##             Short  40.48 19.34 25.19  85.01
##             Medium 36.67 17.52 22.81  77.00
##             Long   12.86  6.14  8.00  27.00
##             Sum    90.01 43.00 56.00 189.01
round(chi_square$res, 2)
##                   as.factor(mydata$Group)
## mydata$Coat.Length     1     2     3
##             Short  -0.39  1.06 -0.44
##             Medium -0.61 -0.12  0.88
##             Long    1.71 -1.67 -0.71
library(effectsize)
effectsize::cramers_v(mydata$Coat.Length, mydata$Group)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.11              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

This would be my next steps if I was able to reject H0 at p<0.05, but I was not. So therefore I cannot use or interpret the steps.

Since validation was unsuccessful, I will try to use another variable - Barking level.

aggregate(mydata$Barking.Level, 
          by = list(mydata$Group), 
          FUN = mean)
##   Group.1        x
## 1       1 3.055556
## 2       2 2.860465
## 3       3 3.410714
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(mydata$Barking.Level, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  1.2774 0.2812
##       186

We cannot reject H0 at p>0.05 that the variances of barking level are equal in all three groups.

library(dplyr)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
## 
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:stats':
## 
##     filter
mydata %>%
  group_by(Group) %>%
  shapiro_test(Barking.Level)
## # A tibble: 3 × 4
##   Group variable      statistic           p
##   <int> <chr>             <dbl>       <dbl>
## 1     1 Barking.Level     0.882 0.000000715
## 2     2 Barking.Level     0.880 0.000335   
## 3     3 Barking.Level     0.786 0.000000128

We reject H0 that variable is normally distributed in each of the groups, with p<0.001 for all three groups. Since normality is violated, we cannot conduct the ANOVA, therefore we do the Kruskal Wallis test:

kruskal.test(Barking.Level ~ Group, 
             data = mydata)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Barking.Level by Group
## Kruskal-Wallis chi-squared = 6.0004, df = 2, p-value = 0.04978
kruskal_effsize(Barking.Level ~ Group, 
                data = mydata)
## # A tibble: 1 × 5
##   .y.               n effsize method  magnitude
## * <chr>         <int>   <dbl> <chr>   <ord>    
## 1 Barking.Level   189  0.0215 eta2[H] small

At p=0.05 we can reject H0 that location distribution of variable Barking.Level is same in all three groups and therefore we can validate our clustering with criterion validity and conclude that there is statistically significant difference in barking level between the clusters, in our case small differences based on the effect size.

4. Conclusion

I clustered 189 dog breeds into three cluster, based on 5 standardized variables.

  • Cluster 1: The Low-Energy, Adaptable Companion Breeds: This cluster contains 89 breeds (47%), making it the largest group. Breeds in this cluster tend to have above-average adaptability and affection levels, though still lower than those in Cluster 3. They are characterized by having the lowest average energy and shedding levels, making them relatively low-maintenance. Their trainability and barking levels are also below average, but not as low as those in Cluster 2. This cluster includes many calm, adaptable, and affectionate companion breeds.

  • Cluster 2: The Low-Scoring, Independent Breeds: This is the smallest cluster, containing 45 breeds (23%). Breeds in this group generally score below average across all variables, especially in adaptability, affection, and trainability levels, making them the least people-oriented and hardest to train. However, they have slightly higher energy and shedding levels than Cluster 1. This group also includes breeds with the lowest barking levels, meaning they are relatively quiet. These breeds tend to be more independent, less affectionate, and less adaptable.

  • Cluster 3: The High-Energy, High-Maintenance Breeds: This cluster consists of 57 breeds (30%). Breeds in this group score above average in all five variables, with the highest trainability, energy, adaptability, affection, and shedding levels of all clusters. This means they are highly trainable, full of energy, adaptable to different environments, and very affectionate with family members. However, they also have the highest shedding and barking levels, meaning they require more maintenance and attention.