# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Intro

This article is made for completing the assignment for Algoritma : Machine Learning Course in Unsupervised Learning Theia Batch 2022.

In this article we will try to make a cluster of dog breeds based on several measures commonly used by aspiring dog owners for choosing their best suited dog in their criteria. Note that every dog is a unique creature and they grow into unique dogs also in their adult life so this data is a generalization but it can be used as a prediction of how a dog will become later in their life based on genetic traits and the purpose of certain dog breeds is used in human life.

This dataset is taken from fmfilho github This data set contains information about 277 breeds and was extracted from the American Kennel Club website. All rights to the information contained here belong to them. This was created as a pastime project during the COVID-19 quarantine and is intended to be used for statistical analysis, as well as machine learning and natural language processing projects. Information about 277 dog breeds extracted from the American Kennel Club website using Beautiful Soup.

This article is dedicated to my furry friend Kuro. He’s a Dalmatian + Labrador Mix with a very gentle and sweet personality to humans but a fierce one to other dogs, a basic Alpha Dog trait :)

Meet Kuro the Dalmador (Dalmatian + Labrador)

Meet Kuro the Dalmador (Dalmatian + Labrador)

Him and Me

Him and Me

Explaratory Data Analysis

Load Library

options(scipen = 99) #non active science annotation (example : e-1)
library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(tibble) 

Load Dataset

dogs <- read.csv("data-input/akc-data-latest.csv")
glimpse(dogs)
#> Rows: 277
#> Columns: 21
#> $ X                           <chr> "Affenpinscher", "Afghan Hound", "Airedale…
#> $ description                 <chr> "The Affen’s apish look has been described…
#> $ temperament                 <chr> "Confident, Famously Funny, Fearless", "Di…
#> $ popularity                  <chr> "148", "113", "60", "47", "58", "", "175",…
#> $ min_height                  <dbl> 22.86, 63.50, 58.42, 60.96, 58.42, 50.80, …
#> $ max_height                  <dbl> 29.21, 68.58, 58.42, 71.12, 63.50, 63.50, …
#> $ min_weight                  <dbl> 3.175147, 22.679619, 22.679619, 31.751466,…
#> $ max_weight                  <dbl> 4.535924, 27.215542, 31.751466, 58.967008,…
#> $ min_expectancy              <dbl> 12, 12, 11, 10, 10, 10, 11, 13, 11, 14, 12…
#> $ max_expectancy              <dbl> 15, 15, 14, 13, 14, 12, 12, 15, 13, 16, 15…
#> $ group                       <chr> "Toy Group", "Hound Group", "Terrier Group…
#> $ grooming_frequency_value    <dbl> 0.6, 0.8, 0.6, 0.8, 0.6, 0.2, 0.2, 0.4, 0.…
#> $ grooming_frequency_category <chr> "2-3 Times a Week Brushing", "Daily Brushi…
#> $ shedding_value              <dbl> 0.6, 0.2, 0.4, 0.6, 0.6, 0.6, 0.4, 0.6, 0.…
#> $ shedding_category           <chr> "Seasonal", "Infrequent", "Occasional", "S…
#> $ energy_level_value          <dbl> 0.6, 0.8, 0.6, 0.8, 0.8, 0.8, 0.8, 0.8, 0.…
#> $ energy_level_category       <chr> "Regular Exercise", "Energetic", "Regular …
#> $ trainability_value          <dbl> 0.8, 0.2, 1.0, 1.0, 0.4, 0.6, 0.6, 1.0, 0.…
#> $ trainability_category       <chr> "Easy Training", "May be Stubborn", "Eager…
#> $ demeanor_value              <dbl> 1.0, 0.2, 0.8, 0.6, 0.8, 0.6, 0.6, 1.0, 0.…
#> $ demeanor_category           <chr> "Outgoing", "Aloof/Wary", "Friendly", "Ale…
  • description <- 1 to 3 paragraphs describing the breed “
  • temperament <- breed temperament described in keywords
  • popularity <- popularity ranking (1-195)
  • min_height <- minimum height in cm
  • max_height <- maximum height in cm
  • min_weight <- minimum weight in kg
  • max_weight <- maximum weight in kg
  • min_expectancy <- minimum life expectancy in years
  • max_expectancy <- maximum life expectancy in years
  • group <-one of 9 breed groups assigned by the akc (7 main groups and 2 extra)
  • grooming_frequency_value <- A number representing the level of required grooming
  • grooming_frequency_category <- A categorization of grooming requirements
  • shedding_value <- A number representing the level of shedding
  • shedding_category <- A categorization of shedding frequency
  • energy_level_value <- A number representing the breed’s energy level
  • energy_level_category <- A categorization of energy level
  • trainability_value <- A number representing the breed’s trainability
  • trainability_category <- A categorization of trainability
  • demeanor_value <- A number representing the breed’s reaction to strangers and other pets
  • demeanor_category <- A categorization of reaction to strangers and other pets

Data Wrangling

Data Cleansing

Based on the glimpse above, we know that we won’t need several columns because we want as many as possible. So we will pick numeric values on several measures for better cluster modeling.

We also want to separate traits because the dataset only contains one column of traits with comma separated values on the cell.

# remove Several Columns
dogs_clean <- 
  dogs %>%
  dplyr::rename(name = X) %>%
  select(-c(description, grooming_frequency_category, shedding_category, energy_level_category, trainability_category, demeanor_category))
#separate row on temprament column
dogs_clean <- separate_rows(dogs_clean, temperament, sep = ",")

# separate several traits from temperament
dogs_clean <- 
  dogs_clean %>%
  group_by(name) %>%
  mutate(n = row_number()) %>%
  pivot_wider(names_from = "n", 
              names_prefix = "trait_",
              values_from = "temperament") 
#checking the dataframe
head(dogs_clean)

The separating values of the temperament column could be unused in this article, but I will try to keep this step because someday if I want to further analyze there will not be much work to start from scratch again.

Handle Missing Value

Before we further do analysis, we will check if there’s any missing value in our dataset.

#check if there's missing value
colSums(is.na(dogs_clean))
#>                     name               popularity               min_height 
#>                        0                        0                        0 
#>               max_height               min_weight               max_weight 
#>                        0                        2                        2 
#>           min_expectancy           max_expectancy                    group 
#>                        3                        3                        0 
#> grooming_frequency_value           shedding_value       energy_level_value 
#>                        7                       20                        6 
#>       trainability_value           demeanor_value                  trait_1 
#>                       24                       25                        0 
#>                  trait_2                  trait_3 
#>                        2                        3

Apparently, there’s many missing value in our dataset. This may be caused by the website script being updated and the Beautiful Soup for web scraping cannot handle updates of the web dataset. For the sake of this analysis and my furry friend, I will try to insert the right value from the web to each missing value in this dataset. (and I also enjoy surfing the web about dogs)

#check missing value in minimal weight and max weight
dogs_clean[rowSums(is.na(dogs_clean[,c(1,5,6)])) > 0,c(1,5,6)]
#https://en.wikipedia.org/wiki/Australian_Kelpie
dogs_clean[dogs_clean$name == "Australian Kelpie", "min_weight"] <- 13
dogs_clean[dogs_clean$name == "Australian Kelpie", "max_weight"] <- 19

#https://en.wikipedia.org/wiki/Spinone_Italiano
dogs_clean[dogs_clean$name == "Spinone Italiano", "min_weight"] <- 32
dogs_clean[dogs_clean$name == "Spinone Italiano", "max_weight"] <- 37
#check if there is 0 value in minimal weight and max weight
dogs_clean[dogs_clean$min_weight == 0 & dogs_clean$max_weight == 0,]
#https://www.britannica.com/animal/Cane-Corso
dogs_clean[dogs_clean$name == "Cane Corso", "min_weight"] <- 41
dogs_clean[dogs_clean$name == "Cane Corso", "max_weight"] <- 50
#check missing value in minimal and maximal life expectancy
dogs_clean[rowSums(is.na(dogs_clean[,c(1,7,8)])) > 0,c(1,7,8)]
#https://www.wisdompanel.com/en-us/dog-breeds/braque-francais-type-pyrenees
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "min_expectancy"] <- 12
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "max_expectancy"] <- 14

#https://www.dailypaws.com/dogs-puppies/dog-breeds/carolina-dog
dogs_clean[dogs_clean$name == "Carolina Dog", "min_expectancy"] <- 12
dogs_clean[dogs_clean$name == "Carolina Dog", "max_expectancy"] <- 15

#https://en.wikipedia.org/wiki/Central_Asian_Shepherd_Dog
dogs_clean[dogs_clean$name == "Central Asian Shepherd Dog", "min_expectancy"] <- 12
dogs_clean[dogs_clean$name == "Central Asian Shepherd Dog", "max_expectancy"] <- 15
#check if there is 0 value in minimal and maximal life expectancy
dogs_clean[dogs_clean$min_expectancy == 0 & dogs_clean$max_expectancy == 0,]
#https://www.akc.org/dog-breeds/pyrenean-shepherd/
dogs_clean[dogs_clean$name == "Pyrenean Shepherd", "min_expectancy"] <- 17
dogs_clean[dogs_clean$name == "Pyrenean Shepherd", "max_expectancy"] <- 19
#check missing value in grooming frequency value and shedding value
dogs_clean[rowSums(is.na(dogs_clean[,c(1,10,11)])) > 0,c(1,10,11)]
#https://www.akc.org/dog-breeds/american-hairless-terrier/
dogs_clean[dogs_clean$name == "American Hairless Terrier", "grooming_frequency_value"] <- 0.2
dogs_clean[dogs_clean$name == "American Hairless Terrier", "shedding_value"] <- 0.2

#https://hypoallergenicdog.net/are-bedlington-terriers-hypoallergenic-dogs/
dogs_clean[dogs_clean$name == "Bedlington Terrier", "shedding_value"] <- 0.6

#https://a-z-animals.com/animals/biewer-terrier/
dogs_clean[dogs_clean$name == "Biewer Terrier", "shedding_value"] <- 0.2

#https://www.akc.org/dog-breeds/braque-du-bourbonnais/
dogs_clean[dogs_clean$name == "Braque du Bourbonnais", "shedding_value"] <- 0.2

#https://www.akc.org/dog-breeds/catahoula-leopard-dog/
dogs_clean[dogs_clean$name == "Catahoula Leopard Dog", "grooming_frequency_value"] <- 0.4
dogs_clean[dogs_clean$name == "Catahoula Leopard Dog", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/deutscher-wachtelhund/
dogs_clean[dogs_clean$name == "Deutscher Wachtelhund", "shedding_value"] <- 0.4

#https://www.akc.org/dog-breeds/french-spaniel/
dogs_clean[dogs_clean$name == "French Spaniel", "grooming_frequency_value"] <- 0.2
dogs_clean[dogs_clean$name == "French Spaniel", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/german-longhaired-pointer/
dogs_clean[dogs_clean$name == "German Longhaired Pointer", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/giant-schnauzer/
dogs_clean[dogs_clean$name == "Giant Schnauzer", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/irish-red-and-white-setter/
dogs_clean[dogs_clean$name == "Irish Red and White Setter", "shedding_value"] <- 0.4

#https://www.akc.org/dog-breeds/irish-terrier/
dogs_clean[dogs_clean$name == "Irish Terrier", "shedding_value"] <- 0.4

#https://www.akc.org/dog-breeds/jagdterrier/
dogs_clean[dogs_clean$name == "Jagdterrier", "grooming_frequency_value"] <- 0.2
dogs_clean[dogs_clean$name == "Jagdterrier", "shedding_value"] <- 0.4

#https://www.akc.org/dog-breeds/kerry-blue-terrier/
dogs_clean[dogs_clean$name == "Kerry Blue Terrier", "shedding_value"] <- 0.2

#https://www.akc.org/dog-breeds/poodle-miniature/
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "grooming_frequency_value"] <- 0.6
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "shedding_value"] <- 0.2

#https://www.akc.org/dog-breeds/porcelaine/
dogs_clean[dogs_clean$name == "Porcelaine", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/shikoku/
dogs_clean[dogs_clean$name == "Shikoku", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/spanish-mastiff/
dogs_clean[dogs_clean$name == "Spanish Mastiff", "grooming_frequency_value"] <- 0.2

#https://www.akc.org/dog-breeds/spinone-italiano/
dogs_clean[dogs_clean$name == "Spinone Italiano", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/tornjak/
dogs_clean[dogs_clean$name == "Tornjak", "grooming_frequency_value"] <- 0.4
dogs_clean[dogs_clean$name == "Tornjak", "shedding_value"] <- 0.8

#https://www.akc.org/dog-breeds/transylvanian-hound/
dogs_clean[dogs_clean$name == "Transylvanian Hound", "shedding_value"] <- 0.6

#https://www.akc.org/dog-breeds/welsh-springer-spaniel/
dogs_clean[dogs_clean$name == "Welsh Springer Spaniel", "shedding_value"] <- 0.6
#check missing value in energy level value, trainability value, and demeanor value
dogs_clean[rowSums(is.na(dogs_clean[,c(1,11,12,13)])) > 0,c(1,11,12,13)]
#https://www.akc.org/dog-breeds/american-hairless-terrier/
dogs_clean[dogs_clean$name == "American Hairless Terrier", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "American Hairless Terrier", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "American Hairless Terrier", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/braque-francais-pyrenean/
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Braque Francais Pyrenean", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/broholmer/
dogs_clean[dogs_clean$name == "Broholmer", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Broholmer", "demeanor_value"] <- 0.4

#https://www.akc.org/dog-breeds/danish-swedish-farmdog/
dogs_clean[dogs_clean$name == "Danish-Swedish Farmdog", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Danish-Swedish Farmdog", "demeanor_value"] <- 0.4

#https://www.akc.org/dog-breeds/deutscher-wachtelhund/
dogs_clean[dogs_clean$name == "Deutscher Wachtelhund", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Deutscher Wachtelhund", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/dogo-argentino/
dogs_clean[dogs_clean$name == "Dogo Argentino", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Dogo Argentino", "demeanor_value"] <- 0.4

#https://www.akc.org/dog-breeds/estrela-mountain-dog/
dogs_clean[dogs_clean$name == "Estrela Mountain Dog", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Estrela Mountain Dog", "demeanor_value"] <- 0.2
  
#https://www.akc.org/dog-breeds/french-spaniel/
dogs_clean[dogs_clean$name == "French Spaniel", "energy_level_value"] <- 0.8
dogs_clean[dogs_clean$name == "French Spaniel", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "French Spaniel", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/hokkaido/
dogs_clean[dogs_clean$name == "Hokkaido", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Hokkaido", "demeanor_value"] <- 0.2
  
#https://www.akc.org/dog-breeds/jagdterrier/
dogs_clean[dogs_clean$name == "Jagdterrier", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "Jagdterrier", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Jagdterrier", "demeanor_value"] <- 0.4

#https://www.akc.org/dog-breeds/hokkaido/
dogs_clean[dogs_clean$name == "Hokkaido", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Hokkaido", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/karelian-bear-dog/
dogs_clean[dogs_clean$name == "Karelian Bear Dog", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Karelian Bear Dog", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/norrbottenspets/
dogs_clean[dogs_clean$name == "Norrbottenspets", "trainability_value"] <- 0.8
dogs_clean[dogs_clean$name == "Norrbottenspets", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/perro-de-presa-canario/
dogs_clean[dogs_clean$name == "Perro de Presa Canario", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Perro de Presa Canario", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/poodle-miniature/
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "energy_level_value"] <- 0.8
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Poodle (Miniature)", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/porcelaine/
dogs_clean[dogs_clean$name == "Porcelaine", "trainability_value"] <- 0.2

#https://www.akc.org/dog-breeds/portuguese-sheepdog/
dogs_clean[dogs_clean$name == "Portuguese Sheepdog", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Portuguese Sheepdog", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/pudelpointer/
dogs_clean[dogs_clean$name == "Pudelpointer", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Pudelpointer", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/pyrenean-mastiff/
dogs_clean[dogs_clean$name == "Pyrenean Mastiff", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Pyrenean Mastiff", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/rafeiro-do-alentejo/
dogs_clean[dogs_clean$name == "Rafeiro do Alentejo", "trainability_value"] <- 1.0
dogs_clean[dogs_clean$name == "Rafeiro do Alentejo", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/segugio-italiano/
dogs_clean[dogs_clean$name == "Segugio Italiano", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "Segugio Italiano", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/shikoku/
dogs_clean[dogs_clean$name == "Shikoku", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Shikoku", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/slovensky-cuvac/
dogs_clean[dogs_clean$name == "Slovensky Cuvac", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Slovensky Cuvac", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/slovensky-kopov/
dogs_clean[dogs_clean$name == "Slovensky Kopov", "trainability_value"] <- 0.8 
dogs_clean[dogs_clean$name == "Slovensky Kopov", "demeanor_value"] <- 0.6

#https://www.akc.org/dog-breeds/spanish-mastiff/
dogs_clean[dogs_clean$name == "Spanish Mastiff", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Spanish Mastiff", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/tornjak/
dogs_clean[dogs_clean$name == "Tornjak", "energy_level_value"] <- 0.6
dogs_clean[dogs_clean$name == "Tornjak", "trainability_value"] <- 0.6

#https://www.akc.org/dog-breeds/transylvanian-hound/
dogs_clean[dogs_clean$name == "Transylvanian Hound", "demeanor_value"] <- 0.2

#https://www.akc.org/dog-breeds/treeing-tennessee-brindle/
dogs_clean[dogs_clean$name == "Treeing Tennessee Brindle", "trainability_value"] <- 0.6
dogs_clean[dogs_clean$name == "Treeing Tennessee Brindle", "demeanor_value"] <- 0.6

Further Cleansing

We already insert the numeric columns to the missing value in our dataset. We will further analyze the value in categorical columns. (For making the dataset futureproof)

#check if there's missing value
colSums(is.na(dogs_clean))
#>                     name               popularity               min_height 
#>                        0                        0                        0 
#>               max_height               min_weight               max_weight 
#>                        0                        0                        0 
#>           min_expectancy           max_expectancy                    group 
#>                        0                        0                        0 
#> grooming_frequency_value           shedding_value       energy_level_value 
#>                        0                        0                        0 
#>       trainability_value           demeanor_value                  trait_1 
#>                        0                        0                        0 
#>                  trait_2                  trait_3 
#>                        2                        3

If there’s a missing value in trait columns, we can look at the demeanor value to compare it and recheck the right value. Because demeanor value represents the dog’s general trait, well behaved or not.

#compared demeanor value to the trait 
dogs_clean[rowSums(is.na(dogs_clean[,c(14,15,16)])) > 0,c(1,14,15,16)]
#adding right value to missing value
dogs_clean[dogs_clean$name == "Slovakian Wirehaired Pointer", 15] <- "Smart"

We want to look further to all the values in the trait columns

unique(dogs_clean$trait_1)
#>  [1] "Confident"                                              
#>  [2] "Dignified"                                              
#>  [3] "Friendly"                                               
#>  [4] "Courageous"                                             
#>  [5] "Affectionate"                                           
#>  [6] "Loyal"                                                  
#>  [7] "Sweet"                                                  
#>  [8] "Playful"                                                
#>  [9] "Independent"                                            
#> [10] "Energetic"                                              
#> [11] "Sociable"                                               
#> [12] "Eager"                                                  
#> [13] "Agile"                                                  
#> [14] "Alert"                                                  
#> [15] "Smart"                                                  
#> [16] "Charming"                                               
#> [17] "Gentle"                                                 
#> [18] "Bright"                                                 
#> [19] "Good-Natured"                                           
#> [20] "Intelligent"                                            
#> [21] "Easy-Going"                                             
#> [22] "Fearless"                                               
#> [23] "Clever"                                                 
#> [24] "Mellow"                                                 
#> [25] "Devoted"                                                
#> [26] "Cheerful"                                               
#> [27] "Adaptable"                                              
#> [28] "Versatile"                                              
#> [29] "Faithful"                                               
#> [30] "Family-Oriented"                                        
#> [31] "Active"                                                 
#> [32] "Bold"                                                   
#> [33] "Spirited"                                               
#> [34] "Upbeat"                                                 
#> [35] "Even-Tempered"                                          
#> [36] "Inquisitive"                                            
#> [37] "Reserved"                                               
#> [38] "Playful but also Work-Oriented. Very Active and Upbeat."
#> [39] "Happy"
unique(dogs_clean$trait_2)
#>  [1] " Famously Funny"   " Profoundly Loyal" " Clever"          
#>  [4] " Dignified"        " Loyal"            " Self-Confident"  
#>  [7] " Mellow"           " Perky"            " Easy-Going"      
#> [10] " Alert"            " Energetic"        " Smart"           
#> [13] " Happy"            " Independent"      " Versatile"       
#> [16] " Curious"          " Work-Oriented"    " Trainable"       
#> [19] " Courageous"       " Bright"           " Cheerful"        
#> [22] " Patient"          " Bouncy"           " Faithful"        
#> [25] " Charming"         " Watchful"         " Sociable"        
#> [28] " Good-Natured"     " Calm"             " Devoted"         
#> [31] " Intelligent"      " Fun-Loving"       " Eager"           
#> [34] " Adaptable"        " Bold"             " Gentle"          
#> [37] " Adventurous"      " Graceful"         " Friendly"        
#> [40] " Amusing"          " Active"           " Proud"           
#> [43] " Fearless"         " Even-Tempered"    " Lively"          
#> [46] " Merry"            " Playful"          " Optimistic"      
#> [49] " Spirited"         " Confident"        " Family-Oriented" 
#> [52] " Outgoing"         " Dashing"          " Hardworking"     
#> [55] " Noble"            " Agile"            " Keen"            
#> [58] " Mischievous"      " Strong-Willed"    " Amiable"         
#> [61] " Brave"            " Inquisitive"      " Loving"          
#> [64] " Quick"            " Good-Tempered"    NA                 
#> [67] " Responsive"       " Reserved"         " Sprightly"
unique(dogs_clean$trait_3)
#>   [1] " Fearless"                " Aristocratic"           
#>   [3] " Courageous"              " Profoundly Loyal"       
#>   [5] " Playful"                 NA                        
#>   [7] " Sociable"                " Smart"                  
#>   [9] " Sweet-Tempered"          " Curious"                
#>  [11] " Intelligent"             " Good-Natured"           
#>  [13] " Charming"                " Reserved"               
#>  [15] " Lively"                  " Pleasant"               
#>  [17] " Exuberant"               " Comical"                
#>  [19] " Spirited"                " Deeply Affectionate"    
#>  [21] " Sweet-Natured"           " Poised"                 
#>  [23] " Determined"              " Low-Key"                
#>  [25] " Reserved with Strangers" " Merry"                  
#>  [27] " Charismatic"             " Obedient"               
#>  [29] " Frollicking"             " Hardworking"            
#>  [31] " Serious-Minded"          " Observant"              
#>  [33] " Strong"                  " Peppy"                  
#>  [35] " Amusing"                 " Brave"                  
#>  [37] " Powerful"                " Inquisitive"            
#>  [39] " Tenacious"               " Calm"                   
#>  [41] " Alert and Intelligent"   " Devoted"                
#>  [43] " Energetic"               " Plucky"                 
#>  [45] " Regally Dignified"       " Strong-Willed"          
#>  [47] " Active"                  " Lovable"                
#>  [49] " Enthusiastic"            " Gentle"                 
#>  [51] " Willing to Please"       " Faithful"               
#>  [53] " Upbeat"                  " Loving"                 
#>  [55] " Mischievous"             " Busy"                   
#>  [57] " Vigilant"                " Majestic"               
#>  [59] " Watchful"                " Kind"                   
#>  [61] " Graceful"                " Self-Confident"         
#>  [63] " Family-Oriented"         " Sensitive"              
#>  [65] " Sassy"                   " Independent"            
#>  [67] " Gentlemanly"             " Happy"                  
#>  [69] " Proud"                   " Happy-Go-Lucky"         
#>  [71] " Wickedly Smart"          " Spunky"                 
#>  [73] " Outgoing"                " Attentive"              
#>  [75] " Alert"                   " Humble"                 
#>  [77] " Athletic"                " Responsive"             
#>  [79] " Agile"                   " Good-Humored"           
#>  [81] " Vivacious"               " Trainable"              
#>  [83] " Bold"                    " Dependable"             
#>  [85] " Noble"                   " Regal"                  
#>  [87] " People-Oriented"         " Funny"                  
#>  [89] " Polite"                  " Tenderhearted"          
#>  [91] " Keen"                    " Docile"                 
#>  [93] " Sweet"                   " Undemanding"            
#>  [95] " Positive"                " Keenly Observant"       
#>  [97] " Quick"                   " Fun-Loving"             
#>  [99] " Perceptive"              " Boisterous"             
#> [101] " Regal in Manner"         " Even-Tempered"          
#> [103] " Very Smart"              " Home-Loving"            
#> [105] " Ready to Work"           " Eager to Please"        
#> [107] " Confident Guardian"      " Independent-Minded"     
#> [109] " Sense Of Humor"          " Bright"                 
#> [111] " Keenly Alert"            " Canny"                  
#> [113] " Deeply Devoted"          " Courteous"              
#> [115] " Entertaining"            " Gregarious"             
#> [117] " Tomboyish"

We can see that from the values above, there are too many variables of trait and I think it will not represent anything on the dataset, especially on secondary and third traits. I also think that there are many values that have similar meaning (because the writer I think is too creative with the description).

#remove column trait_2 and trait_3
dogs_clean <- dogs_clean %>%
  select(-c(trait_2, trait_3)) %>%
  rename(trait = trait_1)
#simplify the unique values
dogs_clean$trait[dogs_clean$trait == "Dignified"] <- "Confident"
dogs_clean$trait[dogs_clean$trait == "Affectionate"] <- "Sweet"
dogs_clean$trait[dogs_clean$trait == "Sociable"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Eager"] <- "Independent"
dogs_clean$trait[dogs_clean$trait == "Agile"] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Charming"] <- "Sweet"
dogs_clean$trait[dogs_clean$trait == "Good-Natured"] <- "Gentle"
dogs_clean$trait[dogs_clean$trait == "Bright"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Intelligent"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Clever"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Easy-Going"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Fearless"] <- "Courageous"
dogs_clean$trait[dogs_clean$trait == "Devoted"] <- "Loyal"
dogs_clean$trait[dogs_clean$trait == "Cheerful"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Versatile"] <- "Adaptable"
dogs_clean$trait[dogs_clean$trait == "Faithful"] <- "Loyal"
dogs_clean$trait[dogs_clean$trait == "Family-Oriented"] <- "Sweet"
dogs_clean$trait[dogs_clean$trait == "Active"] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Bold"] <- "Courageous"
dogs_clean$trait[dogs_clean$trait == "Spirited"] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Upbeat" ] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Inquisitive"] <- "Smart"
dogs_clean$trait[dogs_clean$trait == "Reserved"] <- "Independent"
dogs_clean$trait[dogs_clean$trait == "Playful but also Work-Oriented. Very Active and Upbeat."] <- "Energetic"
dogs_clean$trait[dogs_clean$trait == "Happy"] <- "Friendly"
dogs_clean$trait[dogs_clean$trait == "Even-Tempered"] <- "Gentle"
unique(dogs_clean$trait)
#>  [1] "Confident"   "Friendly"    "Courageous"  "Sweet"       "Loyal"      
#>  [6] "Playful"     "Independent" "Energetic"   "Alert"       "Smart"      
#> [11] "Gentle"      "Mellow"      "Adaptable"

Final check of the missing values in the Dataset.

colSums(is.na(dogs_clean))
#>                     name               popularity               min_height 
#>                        0                        0                        0 
#>               max_height               min_weight               max_weight 
#>                        0                        0                        0 
#>           min_expectancy           max_expectancy                    group 
#>                        0                        0                        0 
#> grooming_frequency_value           shedding_value       energy_level_value 
#>                        0                        0                        0 
#>       trainability_value           demeanor_value                    trait 
#>                        0                        0                        0

We want to see the value in popularity column

dogs_clean$popularity
#>   [1] "148" "113" "60"  "47"  "58"  ""    "175" "122" "186" "136" ""    "85" 
#>  [13] "166" "90"  ""    "55"  ""    "17"  ""    "140" "of"  "of"  "87"  ""   
#>  [25] "39"  ""    "6"   "127" "124" "141" ""    "43"  "125" "106" "187" "144"
#>  [37] "22"  "46"  ""    "138" "118" "49"  "130" "121" ""    ""    "35"  "88" 
#>  [49] "103" "21"  "84"  "11"  "100" ""    ""    ""    "132" "26"  ""    "98" 
#>  [61] "62"  "5"   "51"  "69"  "179" "32"  "68"  ""    ""    ""    "18"  ""   
#>  [73] "185" "45"  "33"  "79"  "64"  "190" "75"  "183" "143" "30"  "38"  "81" 
#>  [85] ""    "162" ""    "12"  "56"  "176" ""    ""    "17"  "of"  "67"  ""   
#>  [97] ""    ""    "52"  "188" "94"  "27"  "135" "157" ""    ""    "149" "161"
#> [109] "184" "91"  "4"   ""    ""    "134" "2"   "9"   ""    "63"  "78"  "174"
#> [121] "3"   "115" "177" "16"  "66"  "74"  "145" ""    ""    "189" "24"  ""   
#> [133] ""    "152" "155" "146" "77"  "116" "159" "76"  "73"  ""    "104" ""   
#> [145] ""    ""    ""    "95"  "129" ""    "173" ""    "163" "1"   "99"  "147"
#> [157] ""    ""    "93"  "71"  "168" "37"  "133" "133" "29"  "34"  "110" "70" 
#> [169] "19"  ""    ""    "102" "150" "40"  "126" ""    "165" "97"  "191" "108"
#> [181] "83"  "72"  "182" "54"  "117" "92"  "13"  ""    ""    "156" "172" "171"
#> [193] "114" "170" "23"  "7"   "7"   "7"   ""    ""    "154" ""    ""    "50" 
#> [205] ""    "28"  "160" "151" ""    "181" ""    "86"  "142" "41"  ""    "8"  
#> [217] "82"  ""    ""    "48"  "120" "59"  ""    "105" "158" "57"  "164" ""   
#> [229] "25"  "44"  "20"  ""    "14"  "112" "178" "192" ""    ""    ""    ""   
#> [241] "123" "53"  ""    "153" "109" ""    "80"  "89"  "180" ""    "169" ""   
#> [253] ""    ""    "131" "119" "96"  ""    ""    "111" ""    ""    "137" "31" 
#> [265] "36"  "128" "107" "42"  ""    "61"  "101" "65"  "167" ""    "140" ""   
#> [277] "10"

The popularity is a ranking of dog breeds based on votes. I don’t think this column is necessary for our analysis since it is not objective for clustering purposes.

#create row ID for easier analysis
dogs_clean$ID <- seq.int(nrow(dogs_clean))

#change columns into factor
dogs_clean <- dogs_clean %>%
  mutate(trait = as.factor(trait), 
         group = as.factor(group)) %>%
  
  #remove popularity column
  select(-c(popularity)) %>% 
  tibble::column_to_rownames("ID") %>%
  relocate(group, .after = trait)

head(dogs_clean, 5)

Insight from Dataset

We see that the group and trait column could be a comparator for the cluster, since it is already a group.

unique(dogs_clean$group)
#> [1] Toy Group                Hound Group              Terrier Group           
#> [4] Working Group            Foundation Stock Service Non-Sporting Group      
#> [7] Sporting Group           Herding Group            Miscellaneous Class     
#> 9 Levels: Toy Group Hound Group Terrier Group ... Miscellaneous Class
unique(dogs_clean$trait)
#>  [1] Confident   Friendly    Courageous  Sweet       Loyal       Playful    
#>  [7] Independent Energetic   Alert       Smart       Gentle      Mellow     
#> [13] Adaptable  
#> 13 Levels: Confident Friendly Courageous Sweet Loyal Playful ... Adaptable

These columns will further be used for comparation with cluster by numeric values.

Data Pre-Analysis

Correlation of the Data

ggcorr(dogs_clean, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)

We see from the plot above that weight, height, and life expectancy is correlated very high with each other. It will be very good if we make Principal Component Analysis for the dataset to produce non-multicollinearity data and also reduce dimension (less than 11).

We want to maintain at least 80% of the data.

Spread of the Dataset

library(gridExtra)

par(mfrow=c(2,3))
hist(x = dogs_clean$max_height, freq = TRUE, main=NULL)
hist(x = dogs_clean$max_weight, freq = TRUE, main=NULL)
hist(x = dogs_clean$max_expectancy, freq = TRUE, main=NULL)
hist(x = dogs_clean$energy_level_value, freq = TRUE, main=NULL)
hist(x = dogs_clean$trainability_value, freq = TRUE, main=NULL)
hist(x = dogs_clean$demeanor_value, freq = TRUE, main=NULL)

Based on the plots above, we can see that 5 out of 6 columns spread is following a normal distribution, although there are skewed properties on each plot. This means at least the dataset can be clustered into low, mid, high on each value.

We take on example, max_weight for the initial analysis

ggplot(data = dogs_clean, aes(x = group, y = max_weight)) + 
  geom_boxplot(aes(fill = max_weight), width = 0.8, outlier.shape = NA) +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

ggplot(data = dogs_clean, aes(x = group, y = max_weight)) + 
  geom_boxplot(aes(fill = max_weight), width = 0.8, outlier.shape = NA) +
  geom_hline(yintercept=7, color = "red", size=1) +
  geom_hline(yintercept=35, color = "red", size=1) +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

We’ll see that with using max_weight there are at least three clusters, combined with every column we may get more clusters or maybe even less clusters.

UL : Principal Component Analysis (PCA)

Before we want to make PCA, we want to be able to differentiate between numeric columns and categorical columns.

#finding column that is numeric and categorical

dogs_noname <- dogs_clean %>% select(-name)

col_num <- dogs_noname %>%
  select_if(is.numeric) %>% 
  colnames()

col_fac <- dogs_noname %>% 
  select_if(is.factor) %>% 
  colnames()

quantivar <- which(colnames(dogs_noname) %in% col_num)
qualivar <- which(colnames(dogs_noname) %in% col_fac)
#principal component analysis

dogs_pca <- PCA(
  X = dogs_noname, 
  scale.unit = TRUE,
  quali.sup = qualivar,
  ncp = 7, 
  graph = FALSE  
  )
dogs_pca$eig
#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1  4.80271641             43.6610583                          43.66106
#> comp 2  1.31420932             11.9473575                          55.60842
#> comp 3  1.17864459             10.7149508                          66.32337
#> comp 4  0.95965764              8.7241604                          75.04753
#> comp 5  0.84670653              7.6973321                          82.74486
#> comp 6  0.76488077              6.9534615                          89.69832
#> comp 7  0.56967879              5.1788981                          94.87722
#> comp 8  0.24100044              2.1909131                          97.06813
#> comp 9  0.21197441              1.9270401                          98.99517
#> comp 10 0.07319681              0.6654256                          99.66060
#> comp 11 0.03733427              0.3394025                         100.00000

Using PCA, I wanted to reduce the dimension but keep as much information as possible. In this case, I wanted to keep at least 80% of the Data.

With that criteria, I want to pick a new PCA axis until comp 5. Which I can at least keep 82% of the data, while reducing 6/11*100 (54,54%) of the axis (dimension).

We can extract values from PCA and this extracted value will be threaded into data frame for further analysis (for example : classification by Random Forrest)

dogs_pca_df <- as.data.frame(dogs_pca$ind$coord[,1:5])
dogs_pca_df <- cbind(dogs_pca_df, group = dogs_clean$group, trait = dogs_clean$trait, name = dogs_clean$name)
head(dogs_pca_df)
ggcorr(dogs_pca_df, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)

We also see that using PCA, the correlation is not happening. So this new axis is very useful for classification analysis.

UL : Clustering

Before we further making cluster of the dataset, we want to make dataframe that contains only numeric columns. We also scale the data for making uniform weight of the data.

#create numeric dataframe

#scale the numeric values on the dataset
dogs_numeric <- scale(dogs_clean %>%
  select(-c(name, trait, group)))
#visualize WSS, Silhouette, and Gap Stat for K Means Analysis

library(cowplot)

kmeans_wss <- fviz_nbclust(x = dogs_numeric, FUNcluster = kmeans, method = "wss") + theme(plot.title = element_blank())
kmeans_sil <- fviz_nbclust(x = dogs_numeric, FUNcluster = kmeans, method = "silhouette") + theme(plot.title = element_blank())
kmeans_gap <- fviz_nbclust(x = dogs_numeric, FUNcluster = kmeans, method = "gap_stat") + theme(plot.title = element_blank())

list.plot_km <- list() #list to store the plots
list.plot_km [[1]] <- kmeans_wss
list.plot_km [[2]] <- kmeans_sil 
list.plot_km [[3]] <- kmeans_gap

plot_grid(plotlist = list.plot_km, ncol = 3)

Based on two out of three plot above, I decided to take 3 as a number for clustering.

# k-means with k optimum
dogs_cluster_op <- kmeans(dogs_numeric, centers = 3)

Visualization of PCA and Clustering

First we make plot for PCA 1 and PCA 2 that we use to further understand the PCA. Note that PCA is combination of many dimensions in the dataset so it couldn’t be interpreted easily without take a look at variables contributions to the axis.

#plotting PCA graph of variables
plot.PCA(x =dogs_pca,
         choix = "var")

plot_dim_1 <- fviz_contrib(X = dogs_pca, choice = "var", axes = 1) + theme(plot.title = element_blank())
plot_dim_2 <- fviz_contrib(X = dogs_pca, choice = "var", axes = 2) + theme(plot.title = element_blank())

list.plot_dim <- list() 
list.plot_dim [[1]] <- plot_dim_1
list.plot_dim [[2]] <- plot_dim_2 

plot_grid(plotlist = list.plot_dim, ncol = 2)

Insight from PCA graph of variables :

  • Many variables related to the size of breed is contributing significantly on the PCA 1 such as weight and height.
  • The life expectancy is also contributing significantly on the PCA 1 but it is in reverse with weight and height. Means that the value is extremely uncorrelated in this PCA.
  • The shedding_value and grooming_frequency_value is in reverse, indicating that there is no correlation between those two.
  • The energy_level_value is correlated with trainability_value which is very logic because dog that can be train is the dog that has high energy level.
  • The energy_level_value contributed significantly to the PCA 2.
#plotting PCA graph of individuals

plot.PCA(x =dogs_pca,
         choix = "ind",
         invisible = "quali")

Insight from PCA graph of individuals :

  • There is no outliers that present in the dataset
  • The data is spreaded most on the first, second, and third quadrant of the plot and around the zero point of the plot

Visualize of group and trait on the PCA Plot

plot.PCA(
  x = dogs_pca,        
  choix = "ind",    
  invisible = "quali", 
  habillage = "group" 
)

Insight :

  • The group is separated but still overlaps to the other group, but we can still see that most of the data is placed near with the same color.
  • Foundation Stock Service is a dog breed group that consists of purebred rare breeds that are currently being bred giving them a reliable and reputable form of record keeping, so that’s why the group is spreaded randomly because it doesn’t reflect the measures that are being used in this analysis
  • Miscellaneous Class is a dog breed that not yet included to the Foundation Stock Service. It also doesn’t reflect the measures that are being used in this analysis.
plot.PCA(
  x = dogs_pca,        
  choix = "ind",    
  invisible = "quali", 
  habillage = "trait" 
)

Insight :

  • The trait group is spread evenly to the plot and not making a cluster. So the trait could be not reflected by the measures this analysis uses.
  • Could not be used to classify Dog, so the trait is only used for spesific case.

Visualize and Analysis Cluster

# visualize clustering on the PCA
fviz_cluster(object = dogs_cluster_op,
             data = dogs_numeric,
            )

Based on the plot above, the cluster is separated quite evenly by the size of the region. Most of the data though contained in cluster 1 and cluster 3. We will take a further look for the cluster by profiling the cluster.

# adding cluster to the dataset
dogs_clean$cluster <- as.factor(dogs_cluster_op$cluster)
# cluster profiling
dogs_clean %>%
  select(-c(name, group, trait)) %>%
  group_by(cluster) %>% 
  summarise_all(.funs = "mean") 

Based on profiling above using mean function for every columns, we can see that the cluster reflections are :

Looking at height and weight:

  • Cluster 1 <- Contained middle-sized Dog Breed.
  • Cluster 2 <- Contained big-sized Dog Breed.
  • Cluster 3 <- Contained small-sized Dog Breed.

Looking at life expectancy:

  • Cluster 1 <- Contained middle life expectancy
  • Cluster 2 <- Contained short life expectancy
  • Cluster 3 <- Contained long life expectancy

Looking at energy level value:

  • Cluster 1 is the highest of all cluster significantly

Looking at grooming frequency value:

  • Cluster 3 is the highest of all cluster significantly

Conclusion

From the analysis, we can take some summaries like :

  1. We can reduce the dimension into 5 PCs with 82.7% of information kept. The reason why it couldn’t be higher is because there are a lot of measures that eventually are not significant to the analysis because it can’t be measured with high accuracy and sometimes is very subjective. For example the height, weight, life expectancy can be measured well with already existing metrics but trait like trainability, demeanor are not easy to measure. (and to define maybe?)

  2. New Dataset obtained by using a PC could be used for classification purposes with a very good result because each of the dimensions is not correlated with each other. So the bias will be lower.

  3. K-Means Clustering can be done to this dataset. Some of the groups in the dog breed resemble regions in the PCA plot even though some of the groups are overlapping other groups, and also there is some group that is not looking at the measures that is used in this dataset for the grouping.

  4. There are three clusters produced in this Dataset. This can be used for example to classify the products that are used by each of the clusters. For example :

  • Recommend the Dog Food Producer to promote the higher calories of food to the cluster 1 without having a risk of obesity to the dog because the Cluster 1 has higher energy level (because they are more active)
  • Recommend grooming products like shampoo, soap, and hair care to the Dog Breed owner at Cluster 3. They can also put a higher profit margin to the product for Cluster 3 since the Dog is small-sized and usually being kept by wealthier Dog Owners.