# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)This article is made for completing the assignment for Algoritma : Machine Learning Course in Unsupervised Learning Theia Batch 2022.
In this article we will try to make a cluster of dog breeds based on several measures commonly used by aspiring dog owners for choosing their best suited dog in their criteria. Note that every dog is a unique creature and they grow into unique dogs also in their adult life so this data is a generalization but it can be used as a prediction of how a dog will become later in their life based on genetic traits and the purpose of certain dog breeds is used in human life.
This dataset is taken from fmfilho github This data set contains information about 277 breeds and was extracted from the American Kennel Club website. All rights to the information contained here belong to them. This was created as a pastime project during the COVID-19 quarantine and is intended to be used for statistical analysis, as well as machine learning and natural language processing projects. Information about 277 dog breeds extracted from the American Kennel Club website using Beautiful Soup.
This article is dedicated to my furry friend Kuro. He’s a Dalmatian + Labrador Mix with a very gentle and sweet personality to humans but a fierce one to other dogs, a basic Alpha Dog trait :)
Meet Kuro the Dalmador (Dalmatian + Labrador)
Him and Me
options(scipen = 99) #non active science annotation (example : e-1)
library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(tibble) dogs <- read.csv("data-input/akc-data-latest.csv")glimpse(dogs)#> Rows: 277
#> Columns: 21
#> $ X <chr> "Affenpinscher", "Afghan Hound", "Airedale…
#> $ description <chr> "The Affen’s apish look has been described…
#> $ temperament <chr> "Confident, Famously Funny, Fearless", "Di…
#> $ popularity <chr> "148", "113", "60", "47", "58", "", "175",…
#> $ min_height <dbl> 22.86, 63.50, 58.42, 60.96, 58.42, 50.80, …
#> $ max_height <dbl> 29.21, 68.58, 58.42, 71.12, 63.50, 63.50, …
#> $ min_weight <dbl> 3.175147, 22.679619, 22.679619, 31.751466,…
#> $ max_weight <dbl> 4.535924, 27.215542, 31.751466, 58.967008,…
#> $ min_expectancy <dbl> 12, 12, 11, 10, 10, 10, 11, 13, 11, 14, 12…
#> $ max_expectancy <dbl> 15, 15, 14, 13, 14, 12, 12, 15, 13, 16, 15…
#> $ group <chr> "Toy Group", "Hound Group", "Terrier Group…
#> $ grooming_frequency_value <dbl> 0.6, 0.8, 0.6, 0.8, 0.6, 0.2, 0.2, 0.4, 0.…
#> $ grooming_frequency_category <chr> "2-3 Times a Week Brushing", "Daily Brushi…
#> $ shedding_value <dbl> 0.6, 0.2, 0.4, 0.6, 0.6, 0.6, 0.4, 0.6, 0.…
#> $ shedding_category <chr> "Seasonal", "Infrequent", "Occasional", "S…
#> $ energy_level_value <dbl> 0.6, 0.8, 0.6, 0.8, 0.8, 0.8, 0.8, 0.8, 0.…
#> $ energy_level_category <chr> "Regular Exercise", "Energetic", "Regular …
#> $ trainability_value <dbl> 0.8, 0.2, 1.0, 1.0, 0.4, 0.6, 0.6, 1.0, 0.…
#> $ trainability_category <chr> "Easy Training", "May be Stubborn", "Eager…
#> $ demeanor_value <dbl> 1.0, 0.2, 0.8, 0.6, 0.8, 0.6, 0.6, 1.0, 0.…
#> $ demeanor_category <chr> "Outgoing", "Aloof/Wary", "Friendly", "Ale…
description <- 1 to 3 paragraphs describing the
breed “temperament <- breed temperament described in
keywordspopularity <- popularity ranking (1-195)min_height <- minimum height in cmmax_height <- maximum height in cmmin_weight <- minimum weight in kgmax_weight <- maximum weight in kgmin_expectancy <- minimum life expectancy in
yearsmax_expectancy <- maximum life expectancy in
yearsgroup <-one of 9 breed groups assigned by the akc (7
main groups and 2 extra)grooming_frequency_value <- A number representing
the level of required groominggrooming_frequency_category <- A categorization of
grooming requirementsshedding_value <- A number representing the level of
sheddingshedding_category <- A categorization of shedding
frequencyenergy_level_value <- A number representing the
breed’s energy levelenergy_level_category <- A categorization of energy
leveltrainability_value <- A number representing the
breed’s trainabilitytrainability_category <- A categorization of
trainabilitydemeanor_value <- A number representing the breed’s
reaction to strangers and other petsdemeanor_category <- A categorization of reaction to
strangers and other petsBased on the glimpse above, we know that we won’t need several columns because we want as many as possible. So we will pick numeric values on several measures for better cluster modeling.
We also want to separate traits because the dataset only contains one column of traits with comma separated values on the cell.
# remove Several Columns
dogs_clean <-
dogs %>%
dplyr::rename(name = X) %>%
select(-c(description, grooming_frequency_category, shedding_category, energy_level_category, trainability_category, demeanor_category))#separate row on temprament column
dogs_clean <- separate_rows(dogs_clean, temperament, sep = ",")
# separate several traits from temperament
dogs_clean <-
dogs_clean %>%
group_by(name) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = "n",
names_prefix = "trait_",
values_from = "temperament") #checking the dataframe
head(dogs_clean)The separating values of the temperament column could be
unused in this article, but I will try to keep this step because someday
if I want to further analyze there will not be much work to start from
scratch again.
Before we further do analysis, we will check if there’s any missing value in our dataset.
#check if there's missing value
colSums(is.na(dogs_clean))#> name popularity min_height
#> 0 0 0
#> max_height min_weight max_weight
#> 0 2 2
#> min_expectancy max_expectancy group
#> 3 3 0
#> grooming_frequency_value shedding_value energy_level_value
#> 7 20 6
#> trainability_value demeanor_value trait_1
#> 24 25 0
#> trait_2 trait_3
#> 2 3
Apparently, there’s many missing value in our dataset. This may be caused by the website script being updated and the Beautiful Soup for web scraping cannot handle updates of the web dataset. For the sake of this analysis and my furry friend, I will try to insert the right value from the web to each missing value in this dataset. (and I also enjoy surfing the web about dogs)
#check missing value in minimal weight and max weight
dogs_clean[rowSums(is.na(dogs_clean[,c(1,5,6)])) > 0,c(1,5,6)]#https://en.wikipedia.org/wiki/Australian_Kelpie
dogs_clean[dogs_clean$name == "Australian Kelpie", "min_weight"] <- 13
dogs_clean[dogs_clean$name == "Australian Kelpie", "max_weight"] <- 19
#https://en.wikipedia.org/wiki/Spinone_Italiano
dogs_clean[dogs_clean$name == "Spinone Italiano", "min_weight"] <- 32
dogs_clean[dogs_clean$name == "Spinone Italiano", "max_weight"] <- 37#check if there is 0 value in minimal weight and max weight
dogs_clean[dogs_clean$min_weight == 0 & dogs_clean$max_weight == 0,]#https://www.britannica.com/animal/Cane-Corso
dogs_clean[dogs_clean$name == "Cane Corso", "min_weight"] <- 41
dogs_clean[dogs_clean$name == "Cane Corso", "max_weight"] <- 50#check missing value in minimal and maximal life expectancy
dogs_clean[rowSums(is.na(dogs_clean[,c(1,7,8)])) > 0,c(1,7,8)]#https://www.wisdompanel.com/en-us/dog-breeds/braque-francais-type-pyrenees
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "min_expectancy"] <- 12
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "max_expectancy"] <- 14
#https://www.dailypaws.com/dogs-puppies/dog-breeds/carolina-dog
dogs_clean[dogs_clean$name == "Carolina Dog", "min_expectancy"] <- 12
dogs_clean[dogs_clean$name == "Carolina Dog", "max_expectancy"] <- 15
#https://en.wikipedia.org/wiki/Central_Asian_Shepherd_Dog
dogs_clean[dogs_clean$name == "Central Asian Shepherd Dog", "min_expectancy"] <- 12
dogs_clean[dogs_clean$name == "Central Asian Shepherd Dog", "max_expectancy"] <- 15#check if there is 0 value in minimal and maximal life expectancy
dogs_clean[dogs_clean$min_expectancy == 0 & dogs_clean$max_expectancy == 0,]#https://www.akc.org/dog-breeds/pyrenean-shepherd/
dogs_clean[dogs_clean$name == "Pyrenean Shepherd", "min_expectancy"] <- 17
dogs_clean[dogs_clean$name == "Pyrenean Shepherd", "max_expectancy"] <- 19#check missing value in grooming frequency value and shedding value
dogs_clean[rowSums(is.na(dogs_clean[,c(1,10,11)])) > 0,c(1,10,11)]#https://www.akc.org/dog-breeds/american-hairless-terrier/
dogs_clean[dogs_clean$name == "American Hairless Terrier", "grooming_frequency_value"] <- 0.2
dogs_clean[dogs_clean$name == "American Hairless Terrier", "shedding_value"] <- 0.2
#https://hypoallergenicdog.net/are-bedlington-terriers-hypoallergenic-dogs/
dogs_clean[dogs_clean$name == "Bedlington Terrier", "shedding_value"] <- 0.6
#https://a-z-animals.com/animals/biewer-terrier/
dogs_clean[dogs_clean$name == "Biewer Terrier", "shedding_value"] <- 0.2
#https://www.akc.org/dog-breeds/braque-du-bourbonnais/
dogs_clean[dogs_clean$name == "Braque du Bourbonnais", "shedding_value"] <- 0.2
#https://www.akc.org/dog-breeds/catahoula-leopard-dog/
dogs_clean[dogs_clean$name == "Catahoula Leopard Dog", "grooming_frequency_value"] <- 0.4
dogs_clean[dogs_clean$name == "Catahoula Leopard Dog", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/deutscher-wachtelhund/
dogs_clean[dogs_clean$name == "Deutscher Wachtelhund", "shedding_value"] <- 0.4
#https://www.akc.org/dog-breeds/french-spaniel/
dogs_clean[dogs_clean$name == "French Spaniel", "grooming_frequency_value"] <- 0.2
dogs_clean[dogs_clean$name == "French Spaniel", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/german-longhaired-pointer/
dogs_clean[dogs_clean$name == "German Longhaired Pointer", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/giant-schnauzer/
dogs_clean[dogs_clean$name == "Giant Schnauzer", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/irish-red-and-white-setter/
dogs_clean[dogs_clean$name == "Irish Red and White Setter", "shedding_value"] <- 0.4
#https://www.akc.org/dog-breeds/irish-terrier/
dogs_clean[dogs_clean$name == "Irish Terrier", "shedding_value"] <- 0.4
#https://www.akc.org/dog-breeds/jagdterrier/
dogs_clean[dogs_clean$name == "Jagdterrier", "grooming_frequency_value"] <- 0.2
dogs_clean[dogs_clean$name == "Jagdterrier", "shedding_value"] <- 0.4
#https://www.akc.org/dog-breeds/kerry-blue-terrier/
dogs_clean[dogs_clean$name == "Kerry Blue Terrier", "shedding_value"] <- 0.2
#https://www.akc.org/dog-breeds/poodle-miniature/
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "grooming_frequency_value"] <- 0.6
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "shedding_value"] <- 0.2
#https://www.akc.org/dog-breeds/porcelaine/
dogs_clean[dogs_clean$name == "Porcelaine", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/shikoku/
dogs_clean[dogs_clean$name == "Shikoku", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/spanish-mastiff/
dogs_clean[dogs_clean$name == "Spanish Mastiff", "grooming_frequency_value"] <- 0.2
#https://www.akc.org/dog-breeds/spinone-italiano/
dogs_clean[dogs_clean$name == "Spinone Italiano", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/tornjak/
dogs_clean[dogs_clean$name == "Tornjak", "grooming_frequency_value"] <- 0.4
dogs_clean[dogs_clean$name == "Tornjak", "shedding_value"] <- 0.8
#https://www.akc.org/dog-breeds/transylvanian-hound/
dogs_clean[dogs_clean$name == "Transylvanian Hound", "shedding_value"] <- 0.6
#https://www.akc.org/dog-breeds/welsh-springer-spaniel/
dogs_clean[dogs_clean$name == "Welsh Springer Spaniel", "shedding_value"] <- 0.6#check missing value in energy level value, trainability value, and demeanor value
dogs_clean[rowSums(is.na(dogs_clean[,c(1,11,12,13)])) > 0,c(1,11,12,13)]#https://www.akc.org/dog-breeds/american-hairless-terrier/
dogs_clean[dogs_clean$name == "American Hairless Terrier", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "American Hairless Terrier", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "American Hairless Terrier", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/braque-francais-pyrenean/
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/broholmer/
dogs_clean[dogs_clean$name == "Broholmer", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Broholmer", "demeanor_value"] <- 0.4
#https://www.akc.org/dog-breeds/danish-swedish-farmdog/
dogs_clean[dogs_clean$name == "Danish-Swedish Farmdog", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Danish-Swedish Farmdog", "demeanor_value"] <- 0.4
#https://www.akc.org/dog-breeds/deutscher-wachtelhund/
dogs_clean[dogs_clean$name == "Deutscher Wachtelhund", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Deutscher Wachtelhund", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/dogo-argentino/
dogs_clean[dogs_clean$name == "Dogo Argentino", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Dogo Argentino", "demeanor_value"] <- 0.4
#https://www.akc.org/dog-breeds/estrela-mountain-dog/
dogs_clean[dogs_clean$name == "Estrela Mountain Dog", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Estrela Mountain Dog", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/french-spaniel/
dogs_clean[dogs_clean$name == "French Spaniel", "energy_level_value"] <- 0.8
dogs_clean[dogs_clean$name == "French Spaniel", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "French Spaniel", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/hokkaido/
dogs_clean[dogs_clean$name == "Hokkaido", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Hokkaido", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/jagdterrier/
dogs_clean[dogs_clean$name == "Jagdterrier", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "Jagdterrier", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Jagdterrier", "demeanor_value"] <- 0.4
#https://www.akc.org/dog-breeds/hokkaido/
dogs_clean[dogs_clean$name == "Hokkaido", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Hokkaido", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/karelian-bear-dog/
dogs_clean[dogs_clean$name == "Karelian Bear Dog", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Karelian Bear Dog", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/norrbottenspets/
dogs_clean[dogs_clean$name == "Norrbottenspets", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Norrbottenspets", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/perro-de-presa-canario/
dogs_clean[dogs_clean$name == "Perro de Presa Canario", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Perro de Presa Canario", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/poodle-miniature/
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "energy_level_value"] <- 0.8
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/porcelaine/
dogs_clean[dogs_clean$name == "Porcelaine", "trainability_value"] <- 0.2
#https://www.akc.org/dog-breeds/portuguese-sheepdog/
dogs_clean[dogs_clean$name == "Portuguese Sheepdog", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Portuguese Sheepdog", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/pudelpointer/
dogs_clean[dogs_clean$name == "Pudelpointer", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Pudelpointer", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/pyrenean-mastiff/
dogs_clean[dogs_clean$name == "Pyrenean Mastiff", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Pyrenean Mastiff", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/rafeiro-do-alentejo/
dogs_clean[dogs_clean$name == "Rafeiro do Alentejo", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Rafeiro do Alentejo", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/segugio-italiano/
dogs_clean[dogs_clean$name == "Segugio Italiano", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "Segugio Italiano", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/shikoku/
dogs_clean[dogs_clean$name == "Shikoku", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Shikoku", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/slovensky-cuvac/
dogs_clean[dogs_clean$name == "Slovensky Cuvac", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Slovensky Cuvac", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/slovensky-kopov/
dogs_clean[dogs_clean$name == "Slovensky Kopov", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Slovensky Kopov", "demeanor_value"] <- 0.6
#https://www.akc.org/dog-breeds/spanish-mastiff/
dogs_clean[dogs_clean$name == "Spanish Mastiff", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Spanish Mastiff", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/tornjak/
dogs_clean[dogs_clean$name == "Tornjak", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "Tornjak", "trainability_value"] <- 0.6
#https://www.akc.org/dog-breeds/transylvanian-hound/
dogs_clean[dogs_clean$name == "Transylvanian Hound", "demeanor_value"] <- 0.2
#https://www.akc.org/dog-breeds/treeing-tennessee-brindle/
dogs_clean[dogs_clean$name == "Treeing Tennessee Brindle", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Treeing Tennessee Brindle", "demeanor_value"] <- 0.6We already insert the numeric columns to the missing value in our dataset. We will further analyze the value in categorical columns. (For making the dataset futureproof)
#check if there's missing value
colSums(is.na(dogs_clean))#> name popularity min_height
#> 0 0 0
#> max_height min_weight max_weight
#> 0 0 0
#> min_expectancy max_expectancy group
#> 0 0 0
#> grooming_frequency_value shedding_value energy_level_value
#> 0 0 0
#> trainability_value demeanor_value trait_1
#> 0 0 0
#> trait_2 trait_3
#> 2 3
If there’s a missing value in trait columns, we can look
at the demeanor value to compare it and recheck the right value. Because
demeanor value represents the dog’s general trait, well
behaved or not.
#compared demeanor value to the trait
dogs_clean[rowSums(is.na(dogs_clean[,c(14,15,16)])) > 0,c(1,14,15,16)]#adding right value to missing value
dogs_clean[dogs_clean$name == "Slovakian Wirehaired Pointer", 15] <- "Smart"We want to look further to all the values in the trait columns
unique(dogs_clean$trait_1)#> [1] "Confident"
#> [2] "Dignified"
#> [3] "Friendly"
#> [4] "Courageous"
#> [5] "Affectionate"
#> [6] "Loyal"
#> [7] "Sweet"
#> [8] "Playful"
#> [9] "Independent"
#> [10] "Energetic"
#> [11] "Sociable"
#> [12] "Eager"
#> [13] "Agile"
#> [14] "Alert"
#> [15] "Smart"
#> [16] "Charming"
#> [17] "Gentle"
#> [18] "Bright"
#> [19] "Good-Natured"
#> [20] "Intelligent"
#> [21] "Easy-Going"
#> [22] "Fearless"
#> [23] "Clever"
#> [24] "Mellow"
#> [25] "Devoted"
#> [26] "Cheerful"
#> [27] "Adaptable"
#> [28] "Versatile"
#> [29] "Faithful"
#> [30] "Family-Oriented"
#> [31] "Active"
#> [32] "Bold"
#> [33] "Spirited"
#> [34] "Upbeat"
#> [35] "Even-Tempered"
#> [36] "Inquisitive"
#> [37] "Reserved"
#> [38] "Playful but also Work-Oriented. Very Active and Upbeat."
#> [39] "Happy"
unique(dogs_clean$trait_2)#> [1] " Famously Funny" " Profoundly Loyal" " Clever"
#> [4] " Dignified" " Loyal" " Self-Confident"
#> [7] " Mellow" " Perky" " Easy-Going"
#> [10] " Alert" " Energetic" " Smart"
#> [13] " Happy" " Independent" " Versatile"
#> [16] " Curious" " Work-Oriented" " Trainable"
#> [19] " Courageous" " Bright" " Cheerful"
#> [22] " Patient" " Bouncy" " Faithful"
#> [25] " Charming" " Watchful" " Sociable"
#> [28] " Good-Natured" " Calm" " Devoted"
#> [31] " Intelligent" " Fun-Loving" " Eager"
#> [34] " Adaptable" " Bold" " Gentle"
#> [37] " Adventurous" " Graceful" " Friendly"
#> [40] " Amusing" " Active" " Proud"
#> [43] " Fearless" " Even-Tempered" " Lively"
#> [46] " Merry" " Playful" " Optimistic"
#> [49] " Spirited" " Confident" " Family-Oriented"
#> [52] " Outgoing" " Dashing" " Hardworking"
#> [55] " Noble" " Agile" " Keen"
#> [58] " Mischievous" " Strong-Willed" " Amiable"
#> [61] " Brave" " Inquisitive" " Loving"
#> [64] " Quick" " Good-Tempered" NA
#> [67] " Responsive" " Reserved" " Sprightly"
unique(dogs_clean$trait_3)#> [1] " Fearless" " Aristocratic"
#> [3] " Courageous" " Profoundly Loyal"
#> [5] " Playful" NA
#> [7] " Sociable" " Smart"
#> [9] " Sweet-Tempered" " Curious"
#> [11] " Intelligent" " Good-Natured"
#> [13] " Charming" " Reserved"
#> [15] " Lively" " Pleasant"
#> [17] " Exuberant" " Comical"
#> [19] " Spirited" " Deeply Affectionate"
#> [21] " Sweet-Natured" " Poised"
#> [23] " Determined" " Low-Key"
#> [25] " Reserved with Strangers" " Merry"
#> [27] " Charismatic" " Obedient"
#> [29] " Frollicking" " Hardworking"
#> [31] " Serious-Minded" " Observant"
#> [33] " Strong" " Peppy"
#> [35] " Amusing" " Brave"
#> [37] " Powerful" " Inquisitive"
#> [39] " Tenacious" " Calm"
#> [41] " Alert and Intelligent" " Devoted"
#> [43] " Energetic" " Plucky"
#> [45] " Regally Dignified" " Strong-Willed"
#> [47] " Active" " Lovable"
#> [49] " Enthusiastic" " Gentle"
#> [51] " Willing to Please" " Faithful"
#> [53] " Upbeat" " Loving"
#> [55] " Mischievous" " Busy"
#> [57] " Vigilant" " Majestic"
#> [59] " Watchful" " Kind"
#> [61] " Graceful" " Self-Confident"
#> [63] " Family-Oriented" " Sensitive"
#> [65] " Sassy" " Independent"
#> [67] " Gentlemanly" " Happy"
#> [69] " Proud" " Happy-Go-Lucky"
#> [71] " Wickedly Smart" " Spunky"
#> [73] " Outgoing" " Attentive"
#> [75] " Alert" " Humble"
#> [77] " Athletic" " Responsive"
#> [79] " Agile" " Good-Humored"
#> [81] " Vivacious" " Trainable"
#> [83] " Bold" " Dependable"
#> [85] " Noble" " Regal"
#> [87] " People-Oriented" " Funny"
#> [89] " Polite" " Tenderhearted"
#> [91] " Keen" " Docile"
#> [93] " Sweet" " Undemanding"
#> [95] " Positive" " Keenly Observant"
#> [97] " Quick" " Fun-Loving"
#> [99] " Perceptive" " Boisterous"
#> [101] " Regal in Manner" " Even-Tempered"
#> [103] " Very Smart" " Home-Loving"
#> [105] " Ready to Work" " Eager to Please"
#> [107] " Confident Guardian" " Independent-Minded"
#> [109] " Sense Of Humor" " Bright"
#> [111] " Keenly Alert" " Canny"
#> [113] " Deeply Devoted" " Courteous"
#> [115] " Entertaining" " Gregarious"
#> [117] " Tomboyish"
We can see that from the values above, there are too many variables of trait and I think it will not represent anything on the dataset, especially on secondary and third traits. I also think that there are many values that have similar meaning (because the writer I think is too creative with the description).
#remove column trait_2 and trait_3
dogs_clean <- dogs_clean %>%
select(-c(trait_2, trait_3)) %>%
rename(trait = trait_1)#simplify the unique values
dogs_clean$trait[dogs_clean$trait == "Dignified"] <- "Confident"
dogs_clean$trait[dogs_clean$trait == "Affectionate"] <- "Sweet"
dogs_clean$trait[dogs_clean$trait == "Sociable"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Eager"] <- "Independent"
dogs_clean$trait[dogs_clean$trait == "Agile"] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Charming"] <- "Sweet"
dogs_clean$trait[dogs_clean$trait == "Good-Natured"] <- "Gentle"
dogs_clean$trait[dogs_clean$trait == "Bright"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Intelligent"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Clever"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Easy-Going"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Fearless"] <- "Courageous"
dogs_clean$trait[dogs_clean$trait == "Devoted"] <- "Loyal"
dogs_clean$trait[dogs_clean$trait == "Cheerful"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Versatile"] <- "Adaptable"
dogs_clean$trait[dogs_clean$trait == "Faithful"] <- "Loyal"
dogs_clean$trait[dogs_clean$trait == "Family-Oriented"] <- "Sweet"
dogs_clean$trait[dogs_clean$trait == "Active"] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Bold"] <- "Courageous"
dogs_clean$trait[dogs_clean$trait == "Spirited"] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Upbeat" ] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Inquisitive"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Reserved"] <- "Independent"
dogs_clean$trait[dogs_clean$trait == "Playful but also Work-Oriented. Very Active and Upbeat."] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Happy"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Even-Tempered"] <- "Gentle"unique(dogs_clean$trait)#> [1] "Confident" "Friendly" "Courageous" "Sweet" "Loyal"
#> [6] "Playful" "Independent" "Energetic" "Alert" "Smart"
#> [11] "Gentle" "Mellow" "Adaptable"
Final check of the missing values in the Dataset.
colSums(is.na(dogs_clean))#> name popularity min_height
#> 0 0 0
#> max_height min_weight max_weight
#> 0 0 0
#> min_expectancy max_expectancy group
#> 0 0 0
#> grooming_frequency_value shedding_value energy_level_value
#> 0 0 0
#> trainability_value demeanor_value trait
#> 0 0 0
We want to see the value in popularity column
dogs_clean$popularity#> [1] "148" "113" "60" "47" "58" "" "175" "122" "186" "136" "" "85"
#> [13] "166" "90" "" "55" "" "17" "" "140" "of" "of" "87" ""
#> [25] "39" "" "6" "127" "124" "141" "" "43" "125" "106" "187" "144"
#> [37] "22" "46" "" "138" "118" "49" "130" "121" "" "" "35" "88"
#> [49] "103" "21" "84" "11" "100" "" "" "" "132" "26" "" "98"
#> [61] "62" "5" "51" "69" "179" "32" "68" "" "" "" "18" ""
#> [73] "185" "45" "33" "79" "64" "190" "75" "183" "143" "30" "38" "81"
#> [85] "" "162" "" "12" "56" "176" "" "" "17" "of" "67" ""
#> [97] "" "" "52" "188" "94" "27" "135" "157" "" "" "149" "161"
#> [109] "184" "91" "4" "" "" "134" "2" "9" "" "63" "78" "174"
#> [121] "3" "115" "177" "16" "66" "74" "145" "" "" "189" "24" ""
#> [133] "" "152" "155" "146" "77" "116" "159" "76" "73" "" "104" ""
#> [145] "" "" "" "95" "129" "" "173" "" "163" "1" "99" "147"
#> [157] "" "" "93" "71" "168" "37" "133" "133" "29" "34" "110" "70"
#> [169] "19" "" "" "102" "150" "40" "126" "" "165" "97" "191" "108"
#> [181] "83" "72" "182" "54" "117" "92" "13" "" "" "156" "172" "171"
#> [193] "114" "170" "23" "7" "7" "7" "" "" "154" "" "" "50"
#> [205] "" "28" "160" "151" "" "181" "" "86" "142" "41" "" "8"
#> [217] "82" "" "" "48" "120" "59" "" "105" "158" "57" "164" ""
#> [229] "25" "44" "20" "" "14" "112" "178" "192" "" "" "" ""
#> [241] "123" "53" "" "153" "109" "" "80" "89" "180" "" "169" ""
#> [253] "" "" "131" "119" "96" "" "" "111" "" "" "137" "31"
#> [265] "36" "128" "107" "42" "" "61" "101" "65" "167" "" "140" ""
#> [277] "10"
The popularity is a ranking of dog breeds based on votes. I don’t think this column is necessary for our analysis since it is not objective for clustering purposes.
#create row ID for easier analysis
dogs_clean$ID <- seq.int(nrow(dogs_clean))
#change columns into factor
dogs_clean <- dogs_clean %>%
mutate(trait = as.factor(trait),
group = as.factor(group)) %>%
#remove popularity column
select(-c(popularity)) %>%
tibble::column_to_rownames("ID") %>%
relocate(group, .after = trait)
head(dogs_clean, 5)We see that the group and trait column
could be a comparator for the cluster, since it is already a group.
unique(dogs_clean$group)#> [1] Toy Group Hound Group Terrier Group
#> [4] Working Group Foundation Stock Service Non-Sporting Group
#> [7] Sporting Group Herding Group Miscellaneous Class
#> 9 Levels: Toy Group Hound Group Terrier Group ... Miscellaneous Class
unique(dogs_clean$trait)#> [1] Confident Friendly Courageous Sweet Loyal Playful
#> [7] Independent Energetic Alert Smart Gentle Mellow
#> [13] Adaptable
#> 13 Levels: Confident Friendly Courageous Sweet Loyal Playful ... Adaptable
These columns will further be used for comparation with cluster by numeric values.
ggcorr(dogs_clean, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)We see from the plot above that weight, height, and life expectancy is correlated very high with each other. It will be very good if we make Principal Component Analysis for the dataset to produce non-multicollinearity data and also reduce dimension (less than 11).
We want to maintain at least 80% of the data.
library(gridExtra)
par(mfrow=c(2,3))
hist(x = dogs_clean$max_height, freq = TRUE, main=NULL)
hist(x = dogs_clean$max_weight, freq = TRUE, main=NULL)
hist(x = dogs_clean$max_expectancy, freq = TRUE, main=NULL)
hist(x = dogs_clean$energy_level_value, freq = TRUE, main=NULL)
hist(x = dogs_clean$trainability_value, freq = TRUE, main=NULL)
hist(x = dogs_clean$demeanor_value, freq = TRUE, main=NULL)
Based on the plots above, we can see that 5 out of 6 columns spread is
following a normal distribution, although there are skewed properties on
each plot. This means at least the dataset can be clustered into low,
mid, high on each value.
We take on example, max_weight for the initial
analysis
ggplot(data = dogs_clean, aes(x = group, y = max_weight)) +
geom_boxplot(aes(fill = max_weight), width = 0.8, outlier.shape = NA) +
scale_x_discrete(guide = guide_axis(n.dodge = 2))ggplot(data = dogs_clean, aes(x = group, y = max_weight)) +
geom_boxplot(aes(fill = max_weight), width = 0.8, outlier.shape = NA) +
geom_hline(yintercept=7, color = "red", size=1) +
geom_hline(yintercept=35, color = "red", size=1) +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
We’ll see that with using
max_weight there are at least
three clusters, combined with every column we may get more
clusters or maybe even less clusters.
Before we want to make PCA, we want to be able to differentiate between numeric columns and categorical columns.
#finding column that is numeric and categorical
dogs_noname <- dogs_clean %>% select(-name)
col_num <- dogs_noname %>%
select_if(is.numeric) %>%
colnames()
col_fac <- dogs_noname %>%
select_if(is.factor) %>%
colnames()
quantivar <- which(colnames(dogs_noname) %in% col_num)
qualivar <- which(colnames(dogs_noname) %in% col_fac)#principal component analysis
dogs_pca <- PCA(
X = dogs_noname,
scale.unit = TRUE,
quali.sup = qualivar,
ncp = 7,
graph = FALSE
)dogs_pca$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.80271641 43.6610583 43.66106
#> comp 2 1.31420932 11.9473575 55.60842
#> comp 3 1.17864459 10.7149508 66.32337
#> comp 4 0.95965764 8.7241604 75.04753
#> comp 5 0.84670653 7.6973321 82.74486
#> comp 6 0.76488077 6.9534615 89.69832
#> comp 7 0.56967879 5.1788981 94.87722
#> comp 8 0.24100044 2.1909131 97.06813
#> comp 9 0.21197441 1.9270401 98.99517
#> comp 10 0.07319681 0.6654256 99.66060
#> comp 11 0.03733427 0.3394025 100.00000
Using PCA, I wanted to reduce the dimension but keep as much information as possible. In this case, I wanted to keep at least 80% of the Data.
With that criteria, I want to pick a new PCA axis until
comp 5. Which I can at least keep 82% of the data, while
reducing 6/11*100 (54,54%) of the axis (dimension).
We can extract values from PCA and this extracted value will be threaded into data frame for further analysis (for example : classification by Random Forrest)
dogs_pca_df <- as.data.frame(dogs_pca$ind$coord[,1:5])
dogs_pca_df <- cbind(dogs_pca_df, group = dogs_clean$group, trait = dogs_clean$trait, name = dogs_clean$name)
head(dogs_pca_df)ggcorr(dogs_pca_df, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)We also see that using PCA, the correlation is not happening. So this new axis is very useful for classification analysis.
Before we further making cluster of the dataset, we want to make dataframe that contains only numeric columns. We also scale the data for making uniform weight of the data.
#create numeric dataframe
#scale the numeric values on the dataset
dogs_numeric <- scale(dogs_clean %>%
select(-c(name, trait, group)))#visualize WSS, Silhouette, and Gap Stat for K Means Analysis
library(cowplot)
kmeans_wss <- fviz_nbclust(x = dogs_numeric, FUNcluster = kmeans, method = "wss") + theme(plot.title = element_blank())
kmeans_sil <- fviz_nbclust(x = dogs_numeric, FUNcluster = kmeans, method = "silhouette") + theme(plot.title = element_blank())
kmeans_gap <- fviz_nbclust(x = dogs_numeric, FUNcluster = kmeans, method = "gap_stat") + theme(plot.title = element_blank())
list.plot_km <- list() #list to store the plots
list.plot_km [[1]] <- kmeans_wss
list.plot_km [[2]] <- kmeans_sil
list.plot_km [[3]] <- kmeans_gap
plot_grid(plotlist = list.plot_km, ncol = 3)
Based on two out of three plot above, I decided to take 3 as a number
for clustering.
# k-means with k optimum
dogs_cluster_op <- kmeans(dogs_numeric, centers = 3)First we make plot for PCA 1 and PCA 2 that we use to further understand the PCA. Note that PCA is combination of many dimensions in the dataset so it couldn’t be interpreted easily without take a look at variables contributions to the axis.
#plotting PCA graph of variables
plot.PCA(x =dogs_pca,
choix = "var")plot_dim_1 <- fviz_contrib(X = dogs_pca, choice = "var", axes = 1) + theme(plot.title = element_blank())
plot_dim_2 <- fviz_contrib(X = dogs_pca, choice = "var", axes = 2) + theme(plot.title = element_blank())
list.plot_dim <- list()
list.plot_dim [[1]] <- plot_dim_1
list.plot_dim [[2]] <- plot_dim_2
plot_grid(plotlist = list.plot_dim, ncol = 2)Insight from PCA graph of variables :
weight and
height.life expectancy is also contributing significantly
on the PCA 1 but it is in reverse with weight and
height. Means that the value is extremely uncorrelated in
this PCA.shedding_value and
grooming_frequency_value is in reverse, indicating that
there is no correlation between those two.energy_level_value is correlated with
trainability_value which is very logic because dog that can
be train is the dog that has high energy level.energy_level_value contributed significantly to the
PCA 2.#plotting PCA graph of individuals
plot.PCA(x =dogs_pca,
choix = "ind",
invisible = "quali")
Insight from PCA graph of individuals :
group and trait on the PCA
Plotplot.PCA(
x = dogs_pca,
choix = "ind",
invisible = "quali",
habillage = "group"
)
Insight :
Foundation Stock Service is a dog breed group that
consists of purebred rare breeds that are currently being bred giving
them a reliable and reputable form of record keeping, so that’s why the
group is spreaded randomly because it doesn’t reflect the measures that
are being used in this analysisMiscellaneous Class is a dog breed that not yet
included to the Foundation Stock Service. It also doesn’t
reflect the measures that are being used in this analysis.plot.PCA(
x = dogs_pca,
choix = "ind",
invisible = "quali",
habillage = "trait"
)
Insight :
trait group is spread evenly to the plot and not
making a cluster. So the trait could be not reflected by the measures
this analysis uses.# visualize clustering on the PCA
fviz_cluster(object = dogs_cluster_op,
data = dogs_numeric,
)
Based on the plot above, the cluster is separated quite evenly by the
size of the region. Most of the data though contained in
cluster 1 and cluster 3. We will take a
further look for the cluster by profiling the cluster.
# adding cluster to the dataset
dogs_clean$cluster <- as.factor(dogs_cluster_op$cluster)# cluster profiling
dogs_clean %>%
select(-c(name, group, trait)) %>%
group_by(cluster) %>%
summarise_all(.funs = "mean") Based on profiling above using mean function for every
columns, we can see that the cluster reflections are :
Looking at height and weight:
Looking at life expectancy:
Looking at energy level value:
Looking at grooming frequency value:
From the analysis, we can take some summaries like :
We can reduce the dimension into 5 PCs with 82.7% of information
kept. The reason why it couldn’t be higher is because there are a lot of
measures that eventually are not significant to the analysis because it
can’t be measured with high accuracy and sometimes is very subjective.
For example the height, weight,
life expectancy can be measured well with already existing
metrics but trait like trainability,
demeanor are not easy to measure. (and to define
maybe?)
New Dataset obtained by using a PC could be used for classification purposes with a very good result because each of the dimensions is not correlated with each other. So the bias will be lower.
K-Means Clustering can be done to this dataset. Some of the groups in the dog breed resemble regions in the PCA plot even though some of the groups are overlapping other groups, and also there is some group that is not looking at the measures that is used in this dataset for the grouping.
There are three clusters produced in this Dataset. This can be used for example to classify the products that are used by each of the clusters. For example :