Dataset: Dog Breeds
Source: Keggle: https://www.kaggle.com/datasets/sujaykapadnis/dog-breeds?select=breed_traits.csv
#Importing the data
mydata <- read.table("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/MVA/MVA HW 2/breed_traits.csv", header=TRUE, sep=",")
head(mydata)
## Breed Affectionate.With.Family
## 1 Retrievers (Labrador) 5
## 2 French Bulldogs 5
## 3 German Shepherd Dogs 5
## 4 Retrievers (Golden) 5
## 5 Bulldogs 4
## 6 Poodles 5
## Good.With.Young.Children Good.With.Other.Dogs Shedding.Level
## 1 5 5 4
## 2 5 4 3
## 3 5 3 4
## 4 5 5 4
## 5 3 3 3
## 6 5 3 1
## Coat.Grooming.Frequency Drooling.Level Coat.Type Coat.Length
## 1 2 2 Double Short
## 2 1 3 Smooth Short
## 3 2 2 Double Medium
## 4 2 2 Double Medium
## 5 3 3 Smooth Short
## 6 4 1 Curly Long
## Openness.To.Strangers Playfulness.Level Watchdog.Protective.Nature
## 1 5 5 3
## 2 5 5 3
## 3 3 4 5
## 4 5 4 3
## 5 4 4 3
## 6 5 5 5
## Adaptability.Level Trainability.Level Energy.Level Barking.Level
## 1 5 5 5 3
## 2 5 4 3 1
## 3 5 5 5 3
## 4 5 5 3 1
## 5 3 4 3 2
## 6 4 5 4 4
## Mental.Stimulation.Needs
## 1 4
## 2 3
## 3 5
## 4 4
## 5 3
## 6 5
Unit of observation is one dog breed.
Initial sample size: 195 dog breeds
Description of all variables before the data manipulation:
Breed: Name of the dog breed
Affectionate With Family: Level of affection towards family (1–5)
Good With Young Children: Compatibility with young children (1–5)
Good With Other Dogs: Friendliness towards other dogs (1–5)
Shedding Level: Amount of shedding (1–5)
Coat Grooming: Frequency Frequency of grooming needed (1–5)
Drooling Level: Tendency to drool (1–5)
Coat Type: Type of coat (e.g., curly, smooth, wiry)
Coat Length: Length of the coat (e.g., short, medium)
Openness To Strangers: Friendliness towards strangers (1–5)
Playfulness Level: Playfulness of the breed (1–5)
Watchdog/Protective Nature: Level of protectiveness (1–5)
Adaptability Level: Ability to adapt to environments (1–5)
Trainability Level: Ease of training (1–5)
Energy Level: Activity and energy levels (1–5)
Barking Level: Tendency to bark (1–5)
Mental Stimulation Needs: Need for mental challenges (1–5)
summary(mydata)
## Breed Affectionate.With.Family Good.With.Young.Children
## Length:195 Min. :0.000 Min. :0.000
## Class :character 1st Qu.:4.000 1st Qu.:3.000
## Mode :character Median :5.000 Median :3.000
## Mean :4.477 Mean :3.867
## 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000
## Good.With.Other.Dogs Shedding.Level Coat.Grooming.Frequency
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:3.000 1st Qu.:2.00 1st Qu.:2.000
## Median :3.000 Median :3.00 Median :2.000
## Mean :3.513 Mean :2.59 Mean :2.277
## 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Drooling.Level Coat.Type Coat.Length
## Min. :0.00 Length:195 Length:195
## 1st Qu.:1.00 Class :character Class :character
## Median :2.00 Mode :character Mode :character
## Mean :1.79
## 3rd Qu.:2.00
## Max. :5.00
## Openness.To.Strangers Playfulness.Level Watchdog.Protective.Nature
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.467 Mean :3.631 Mean :3.718
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Adaptability.Level Trainability.Level Energy.Level
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000
## Mean :3.774 Mean :3.846 Mean :3.723
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Barking.Level Mental.Stimulation.Needs
## Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :4.000
## Mean :3.118 Mean :3.662
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
We can see that we need to do some data adjustments (removing the n/a-s, factoring…).
mydata$Coat.Type <- factor(mydata$Coat.Type,
levels = c("Double","Curly", "Smooth", "Wiry", "Silky", "Hairless", "Corded", "Wavy"),
labels = c("Double","Curly", "Smooth", "Wiry", "Silky", "Hairless", "Corded", "Wavy"))
mydata$Coat.Length <- factor(mydata$Coat.Length,
levels = c("Short","Medium", "Long"),
labels = c("Short","Medium", "Long"))
library(tidyr)
mydata <- drop_na(mydata)
We dropped 4 units due to drop_na function. Let’s check how the data looks now:
summary(mydata)
## Breed Affectionate.With.Family Good.With.Young.Children
## Length:191 Min. :1.000 Min. :1.00
## Class :character 1st Qu.:4.000 1st Qu.:3.00
## Mode :character Median :5.000 Median :3.00
## Mean :4.508 Mean :3.89
## 3rd Qu.:5.000 3rd Qu.:5.00
## Max. :5.000 Max. :5.00
##
## Good.With.Other.Dogs Shedding.Level Coat.Grooming.Frequency
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :2.000
## Mean :3.529 Mean :2.607 Mean :2.283
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Drooling.Level Coat.Type Coat.Length Openness.To.Strangers
## Min. :1.000 Double :66 Short :86 Min. :1.000
## 1st Qu.:1.000 Smooth :66 Medium:77 1st Qu.:3.000
## Median :2.000 Wiry :30 Long :28 Median :3.000
## Mean :1.791 Silky : 9 Mean :3.482
## 3rd Qu.:2.000 Curly : 7 3rd Qu.:4.000
## Max. :5.000 Wavy : 6 Max. :5.000
## (Other): 7
## Playfulness.Level Watchdog.Protective.Nature Adaptability.Level
## Min. :2.000 Min. :1.000 Min. :3.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000
## Mean :3.654 Mean :3.738 Mean :3.796
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Trainability.Level Energy.Level Barking.Level
## Min. :1.000 Min. :2.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :3.000
## Mean :3.869 Mean :3.743 Mean :3.115
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Mental.Stimulation.Needs
## Min. :3.000
## 1st Qu.:3.000
## Median :4.000
## Mean :3.686
## 3rd Qu.:4.000
## Max. :5.000
##
We can see this already looks much better.
Interpretations:
There are 77 breeds categorized with medium length of coat
Minimal level of playfulness observed in a dog breed was 2.
Average trainability level of breeds in the sample was 3.869.
There are 7 breeds in the sample with curly coat type.
After these adjustments, we will start with our clustering part of the homework.
Variables I picked for further analysis are the following, as already mentioned in the research question:
#Saving standardized cluster variables into new data frame and standardization of cluster variables
mydata_clu_std <- as.data.frame(scale(mydata[c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level")]))
head(mydata_clu_std)
## Trainability.Level Energy.Level Adaptability.Level
## 1 1.309392 1.6482097 1.9422309
## 2 0.151550 -0.9751908 1.9422309
## 3 1.309392 1.6482097 1.9422309
## 4 1.309392 -0.9751908 1.9422309
## 5 0.151550 -0.9751908 -1.2835613
## 6 1.309392 0.3365095 0.3293348
## Affectionate.With.Family Shedding.Level
## 1 0.6148339 1.6367984
## 2 0.6148339 0.4615033
## 3 0.6148339 1.6367984
## 4 0.6148339 1.6367984
## 5 -0.6344563 0.4615033
## 6 0.6148339 -1.8890868
mydata$Dissimilarity = sqrt(mydata_clu_std$Trainability.Level^2 + mydata_clu_std$Energy.Level^2 + mydata_clu_std$Adaptability.Level^2 + mydata_clu_std$Affectionate.With.Family^2 + mydata_clu_std$Shedding.Level^2 )
print(head(mydata[order(-mydata$Dissimilarity), c("Dissimilarity", "Breed")], 10))
## Dissimilarity Breed
## 85 5.167183 Anatolian Shepherd Dogs
## 121 4.462448 Afghan Hounds
## 36 3.456591 Basset Hounds
## 99 3.413431 Neapolitan Mastiffs
## 120 3.406517 Beaucerons
## 157 3.380856 Spaniels (American Water)
## 189 3.367492 Azawakhs
## 1 3.355666 Retrievers (Labrador)
## 3 3.355666 German Shepherd Dogs
## 62 3.275408 Airedale Terriers
Dissimilarity tells us how different each breed is from the rest of the breeds, where higher value means more different. We can see that we have highest dissimilarity (meaning most evident differences) with breeds Anatolian Shepherd Dogs and Afghan Hounds that could be possible outliers, therefore I decided to remove them.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#mydata <- mydata %>%
#filter(!Breed %in% c("Anatolian Shepherd Dogs", "Afghan Hounds"))
Since this code isn’t working for me (I do not get any errors, but it just does not delete selected rows after I run the code) I left it here to example what I would normally use. But because of this situation I tried a new code for deleting with help of row numbers:
mydata <- mydata %>%
slice(c(-85,-121))
Now we successfully have 189 observations to continue with.
mydata$ID <- seq(1, nrow(mydata)) #Creating a sequence again
mydata_clu_std <- as.data.frame(scale(mydata[c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level")])) #Scaling the data again
rownames(mydata_clu_std) <- mydata$Breed #Renaming rows with breed names
#Graphical presentation of dissimilarity matrix
#install.packages("factoextra")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distance <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distance,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
This is a dissimilarity matrix with size 189 x 189. We see groups forming around diagonal, therefore data is clusterable, but I do not see any bigger groups forming which might not be the best sign. I expect Hopkins statistics to be above 0.5, but not too highly above.
library(factoextra)
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.5295679
##
## $plot
## NULL
Hopkins statistics is successfully above 0.5 and is currently at level 0.53.
With these first steps we confirmed the first part of our analysis, answering the question “Is data clusterable?” We answered the question by fulfilling the following objectives:
We can conclude that data is clusterable, although it is not the best since some criteria is barely passing, and move on to answering the question about how many clusters to create based on the given data.
library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 189
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none"
## instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at
## <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.
By this method, I would conclude it makes most sense to make 3 clusters. But let’s also check other methods.
#install.packages("BiocManager")
#BiocManager::install("NbClust")
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
The elbow method in our case suggests 3, 6 or 8 clusters.
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
Based on this method we try to maximize silhouette value through observing peaks. The highest one is suggesting 10 clusters.
library(NbClust)
# Determine the optimal number of clusters
nc <- NbClust(mydata_clu_std, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 5 proposed 9 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Based on this I will continue with 3 clusters.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 90, 43, 56
##
## Cluster means:
## Trainability.Level Energy.Level Adaptability.Level
## 1 -0.3271923 -0.5254591 0.1190422
## 2 -0.3483321 -0.1241318 -1.1511861
## 3 0.7933140 0.9398033 0.6926286
## Affectionate.With.Family Shedding.Level
## 1 0.4256575 -0.33116256
## 2 -1.6329290 -0.06568033
## 3 0.5697639 0.58265865
##
## Clustering vector:
## Retrievers (Labrador)
## 3
## French Bulldogs
## 1
## German Shepherd Dogs
## 3
## Retrievers (Golden)
## 3
## Bulldogs
## 2
## Poodles
## 1
## Beagles
## 2
## Rottweilers
## 1
## Pointers (German Shorthaired)
## 3
## Dachshunds
## 1
## Pembroke Welsh Corgis
## 3
## Australian Shepherds
## 2
## Yorkshire Terriers
## 1
## Boxers
## 2
## Great Danes
## 1
## Siberian Huskies
## 3
## Cavalier King Charles Spaniels
## 1
## Doberman Pinschers
## 3
## Miniature Schnauzers
## 3
## Shih Tzu
## 1
## Boston Terriers
## 1
## Bernese Mountain Dogs
## 3
## Pomeranians
## 1
## Havanese
## 1
## Cane Corso
## 2
## Spaniels (English Springer)
## 3
## Shetland Sheepdogs
## 3
## Brittanys
## 2
## Pugs
## 3
## Spaniels (Cocker)
## 3
## Miniature American Shepherds
## 3
## Border Collies
## 3
## Mastiffs
## 1
## Chihuahuas
## 1
## Vizslas
## 3
## Basset Hounds
## 2
## Belgian Malinois
## 2
## Maltese
## 1
## Weimaraners
## 3
## Collies
## 1
## Newfoundlands
## 1
## Rhodesian Ridgebacks
## 1
## Shiba Inu
## 1
## West Highland White Terriers
## 1
## Bichons Frises
## 1
## Bloodhounds
## 2
## Spaniels (English Cocker)
## 1
## Akitas
## 2
## Portuguese Water Dogs
## 3
## Retrievers (Chesapeake Bay)
## 3
## Dalmatians
## 3
## St. Bernards
## 1
## Papillons
## 3
## Australian Cattle Dogs
## 2
## Bullmastiffs
## 2
## Samoyeds
## 3
## Scottish Terriers
## 1
## Soft Coated Wheaten Terriers
## 1
## Whippets
## 1
## Pointers (German Wirehaired)
## 3
## Chinese Shar-Pei
## 1
## Airedale Terriers
## 2
## Wirehaired Pointing Griffons
## 3
## Bull Terriers
## 2
## Alaskan Malamutes
## 2
## Cardigan Welsh Corgis
## 2
## Giant Schnauzers
## 3
## Old English Sheepdogs
## 1
## Italian Greyhounds
## 1
## Great Pyrenees
## 1
## Dogues de Bordeaux
## 1
## Russell Terriers
## 3
## Cairn Terriers
## 2
## Irish Wolfhounds
## 1
## Setters (Irish)
## 3
## Greater Swiss Mountain Dogs
## 1
## Miniature Pinschers
## 3
## Lhasa Apsos
## 1
## Chinese Crested
## 1
## Coton de Tulear
## 1
## Staffordshire Bull Terriers
## 3
## American Staffordshire Terriers
## 1
## Rat Terriers
## 3
## Chow Chows
## 2
## Basenjis
## 2
## Spaniels (Boykin)
## 2
## Lagotti Romagnoli
## 1
## Brussels Griffons
## 1
## Retrievers (Nova Scotia Duck Tolling)
## 3
## Norwegian Elkhounds
## 3
## Standard Schnauzers
## 1
## Dogo Argentinos
## 3
## Pekingese
## 1
## Keeshonden
## 3
## Border Terriers
## 1
## Leonbergers
## 3
## Tibetan Terriers
## 1
## Neapolitan Mastiffs
## 2
## Setters (English)
## 1
## Retrievers (Flat-Coated)
## 3
## Borzois
## 2
## Fox Terriers (Wire)
## 1
## Miniature Bull Terriers
## 1
## Belgian Tervuren
## 2
## Setters (Gordon)
## 3
## Silky Terriers
## 1
## Norwich Terriers
## 1
## Spinoni Italiani
## 1
## Japanese Chin
## 1
## Welsh Terriers
## 1
## Toy Fox Terriers
## 3
## Schipperkes
## 1
## Parson Russell Terriers
## 3
## Pointers
## 3
## Belgian Sheepdogs
## 2
## Tibetan Spaniels
## 1
## American Eskimo Dogs
## 3
## Irish Terriers
## 1
## Beaucerons
## 2
## Boerboels
## 1
## Fox Terriers (Smooth)
## 1
## Bearded Collies
## 1
## Black Russian Terriers
## 2
## Black and Tan Coonhounds
## 1
## Spaniels (Welsh Springer)
## 3
## American Hairless Terriers
## 1
## Norfolk Terriers
## 1
## Xoloitzcuintli
## 1
## Manchester Terriers
## 1
## Kerry Blue Terriers
## 1
## Australian Terriers
## 2
## Spaniels (Clumber)
## 1
## Lakeland Terriers
## 1
## Bluetick Coonhounds
## 2
## English Toy Spaniels
## 1
## German Pinschers
## 3
## Tibetan Mastiffs
## 2
## Bedlington Terriers
## 2
## Greyhounds
## 1
## Pulik
## 1
## Salukis
## 1
## Barbets
## 2
## Redbone Coonhounds
## 1
## Swedish Vallhunds
## 3
## Sealyham Terriers
## 1
## Spanish Water Dogs
## 1
## Briards
## 2
## Berger Picards
## 2
## Entlebucher Mountain Dogs
## 3
## Treeing Walker Coonhounds
## 3
## Icelandic Sheepdogs
## 3
## Wirehaired Vizslas
## 3
## Pumik
## 3
## Portuguese Podengo Pequenos
## 1
## Spaniels (American Water)
## 2
## Retrievers (Curly-Coated)
## 1
## Spaniels (Field)
## 1
## Lowchen
## 1
## Nederlandse Kooikerhondjes
## 3
## Affenpinschers
## 2
## Finnish Lapphunds
## 1
## Scottish Deerhounds
## 1
## Norwegian Buhunds
## 1
## Glen of Imaal Terriers
## 1
## Setters (Irish Red and White)
## 3
## Ibizan Hounds
## 1
## Spaniels (Sussex)
## 1
## Bergamasco Sheepdogs
## 2
## Spaniels (Irish Water)
## 1
## Polish Lowland Sheepdogs
## 1
## Kuvaszok
## 1
## Komondorok
## 1
## Cirnechi dell’Etna
## 1
## Pharaoh Hounds
## 1
## Dandie Dinmont Terriers
## 2
## Pyrenean Shepherds
## 3
## Skye Terriers
## 1
## Canaan Dogs
## 2
## American English Coonhounds
## 2
## Chinooks
## 1
## Finnish Spitz
## 3
## Grand Basset Griffon Vendeens
## 3
## Sloughis
## 2
## Harriers
## 3
## Cesky Terriers
## 1
## American Foxhounds
## 2
## Azawakhs
## 2
## English Foxhounds
## 3
## Norwegian Lundehunds
## 2
##
## Within cluster sum of squares by cluster:
## [1] 260.0833 149.2129 142.2987
## (between_SS / total_SS = 41.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
We can observe that the best raiton between SSbetween and SStotal we can achieve with this data is 41.3%.
Interpretation:
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
We can observe 3 clusters that include 90, 43 and 56 breeds.
We can see that there is some overlap between clusters. This happens because they are measured from multiple dimensions and we only present them in two, therefore it seems like they are overlaping.
We can also observe from the graph that there is cca 60.9% of total information shown with these two dimensions that we can observe in the graph.
Since there are not any specific outliers seen that I would like to drop, I will continue with my analysis.
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Trainability.Level Energy.Level Adaptability.Level
## 1 -0.3271923 -0.5254591 0.1190422
## 2 -0.3483321 -0.1241318 -1.1511861
## 3 0.7933140 0.9398033 0.6926286
## Affectionate.With.Family Shedding.Level
## 1 0.4256575 -0.33116256
## 2 -1.6329290 -0.06568033
## 3 0.5697639 0.58265865
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level"),
labels = c("Trainability.Level", "Energy.Level", "Adaptability.Level", "Affectionate.With.Family", "Shedding.Level"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
This is now main graph to describe our segments. Based on this we see that Group 3 is the one above average in all of the analyzed characteristics and the worst is Group 2, which is bellow average in all of the checked variables. Group 1 is above average in adaptability and affection with family.
But, we must be careful with shedding level variable since it is not defined well in the description of variables and above average shedding level could mean more shedding than average, which could possibly not be a big advantage of the cluster.
mydata$Group <- Clustering$cluster
#Checking if clustering variables successfully differentiate between groups - Appropriateness of used cluster variables
fit <- aov(cbind(Trainability.Level, Energy.Level, Adaptability.Level, Affectionate.With.Family, Shedding.Level) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response Trainability.Level :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 34.610 17.3048 33.784 3.048e-13 ***
## Residuals 186 95.274 0.5122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Energy.Level :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 43.791 21.8956 61.689 < 2.2e-16 ***
## Residuals 186 66.018 0.3549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Adaptability.Level :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 32.491 16.2455 76.954 < 2.2e-16 ***
## Residuals 186 39.266 0.2111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Affectionate.With.Family :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 84.906 42.453 356.96 < 2.2e-16 ***
## Residuals 186 22.121 0.119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Shedding.Level :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 20.842 10.4212 17.009 1.645e-07 ***
## Residuals 186 113.962 0.6127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conducted analysis of variance (ANOVA). For each paragraph, for each variable, we check the hypotheses, following the same pattern as I will present in an example.
Let’s make an example for the first variable:
We can reject H0 at p<0.001 for all of the variables and conclude that we do not have any bad cluster variables in our case and we can continue the analysis.
I decided to continue my analysis with checking my groups also based on coat length. I wasn’t able to add it to my cluster variables, because variable is not numeric. I will use Pearson Chi-Squared Test (without Yates correlation).
chi_square <- chisq.test(mydata$Coat.Length, as.factor(mydata$Group))
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$Coat.Length and as.factor(mydata$Group)
## X-squared = 8.8433, df = 4, p-value = 0.06514
With p>0.05 we cannot reject H0 that there is no association between coat length and classification. We can conclude that there is no statistically significant difference in coat length between the clusters at the 5% level of significance.
addmargins(chi_square$observed)
## as.factor(mydata$Group)
## mydata$Coat.Length 1 2 3 Sum
## Short 38 24 23 85
## Medium 33 17 27 77
## Long 19 2 6 27
## Sum 90 43 56 189
addmargins(round(chi_square$expected, 2))
## as.factor(mydata$Group)
## mydata$Coat.Length 1 2 3 Sum
## Short 40.48 19.34 25.19 85.01
## Medium 36.67 17.52 22.81 77.00
## Long 12.86 6.14 8.00 27.00
## Sum 90.01 43.00 56.00 189.01
round(chi_square$res, 2)
## as.factor(mydata$Group)
## mydata$Coat.Length 1 2 3
## Short -0.39 1.06 -0.44
## Medium -0.61 -0.12 0.88
## Long 1.71 -1.67 -0.71
library(effectsize)
effectsize::cramers_v(mydata$Coat.Length, mydata$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.11 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
This would be my next steps if I was able to reject H0 at p<0.05, but I was not. So therefore I cannot use or interpret the steps.
Since validation was unsuccessful, I will try to use another variable - Barking level.
aggregate(mydata$Barking.Level,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 3.055556
## 2 2 2.860465
## 3 3 3.410714
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$Barking.Level, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.2774 0.2812
## 186
We cannot reject H0 at p>0.05 that the variances of barking level are equal in all three groups.
library(dplyr)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
##
## cohens_d, eta_squared
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(Group) %>%
shapiro_test(Barking.Level)
## # A tibble: 3 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 Barking.Level 0.882 0.000000715
## 2 2 Barking.Level 0.880 0.000335
## 3 3 Barking.Level 0.786 0.000000128
We reject H0 that variable is normally distributed in each of the groups, with p<0.001 for all three groups. Since normality is violated, we cannot conduct the ANOVA, therefore we do the Kruskal Wallis test:
kruskal.test(Barking.Level ~ Group,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Barking.Level by Group
## Kruskal-Wallis chi-squared = 6.0004, df = 2, p-value = 0.04978
kruskal_effsize(Barking.Level ~ Group,
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Barking.Level 189 0.0215 eta2[H] small
At p=0.05 we can reject H0 that location distribution of variable Barking.Level is same in all three groups and therefore we can validate our clustering with criterion validity and conclude that there is statistically significant difference in barking level between the clusters, in our case small differences based on the effect size.
I clustered 189 dog breeds into three cluster, based on 5 standardized variables.
Cluster 1: The Low-Energy, Adaptable Companion Breeds: This cluster contains 89 breeds (47%), making it the largest group. Breeds in this cluster tend to have above-average adaptability and affection levels, though still lower than those in Cluster 3. They are characterized by having the lowest average energy and shedding levels, making them relatively low-maintenance. Their trainability and barking levels are also below average, but not as low as those in Cluster 2. This cluster includes many calm, adaptable, and affectionate companion breeds.
Cluster 2: The Low-Scoring, Independent Breeds: This is the smallest cluster, containing 45 breeds (23%). Breeds in this group generally score below average across all variables, especially in adaptability, affection, and trainability levels, making them the least people-oriented and hardest to train. However, they have slightly higher energy and shedding levels than Cluster 1. This group also includes breeds with the lowest barking levels, meaning they are relatively quiet. These breeds tend to be more independent, less affectionate, and less adaptable.
Cluster 3: The High-Energy, High-Maintenance Breeds: This cluster consists of 57 breeds (30%). Breeds in this group score above average in all five variables, with the highest trainability, energy, adaptability, affection, and shedding levels of all clusters. This means they are highly trainable, full of energy, adaptable to different environments, and very affectionate with family members. However, they also have the highest shedding and barking levels, meaning they require more maintenance and attention.